dummy commit to RUN BENCHMARKS

GAIL: bugfix in dataset loading (#447 )
2018-07-25 18:09:30 -07:00 · 2018-07-25 18:07:56 -07:00 · 2018-07-06 16:12:14 -07:00 · 2018-06-08 09:41:45 -07:00 · 2018-06-06 11:39:13 -07:00 · 2018-05-21 15:24:00 -07:00
48 changed files with 755 additions and 305 deletions
--- a/.travis.yml
+++ b/.travis.yml
@@ -0,0 +1,14 @@
+language: python
+python:
+    - "3.6"
+
+services:
+    - docker
+
+install:
+    - pip install flake8
+    - docker build . -t baselines-test
+
+script:
+    - flake8 --select=F baselines/common
+    - docker run baselines-test pytest
--- a/20
+++ b/20
@@ -0,0 +1,20 @@
+FROM ubuntu:16.04
+
+RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake
+ENV CODE_DIR /root/code
+ENV VENV /root/venv
+
+COPY . $CODE_DIR/baselines
+RUN \
+    pip install virtualenv && \
+    virtualenv $VENV --python=python3 && \
+    . $VENV/bin/activate && \
+    cd $CODE_DIR && \
+    pip install --upgrade pip && \
+    pip install -e baselines && \
+    pip install pytest
+
+ENV PATH=$VENV/bin:$PATH
+WORKDIR $CODE_DIR/baselines
+
+CMD /bin/bash
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-<img src="data/logo.jpg" width=25% align="right" />
+<img src="data/logo.jpg" width=25% align="right" /> [![Build status](https://travis-ci.org/openai/baselines.svg?branch=master)](https://travis-ci.org/openai/baselines)

 # Baselines

@@ -6,13 +6,63 @@ OpenAI Baselines is a set of high-quality implementations of reinforcement learn

 These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. 

-You can install it by typing:
+## Prerequisites 
+Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows
+### Ubuntu 
+    
+```bash
+sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
+```
+    
+### Mac OS X
+Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the follwing:
+```bash
+brew install cmake openmpi
+```
+    
+## Virtual environment
+From the general python package sanity perspective, it is a good idea to use virtual environments (virtualenvs) to make sure packages from different projects do not interfere with each other. You can install virtualenv (which is itself a pip package) via
+```bash
+pip install virtualenv
+```
+Virtualenvs are essentially folders that have copies of python executable and all python packages.
+To create a virtualenv called venv with python3, one runs 
+```bash
+virtualenv /path/to/venv --python=python3
+```
+To activate a virtualenv: 
+```
+. /path/to/venv/bin/activate
+```
+More thorough tutorial on virtualenvs and options can be found [here](https://virtualenv.pypa.io/en/stable/) 

+
+## Installation
+Clone the repo and cd into it:
 ```bash
 git clone https://github.com/openai/baselines.git
 cd baselines
+```
+If using virtualenv, create a new virtualenv and activate it
+```bash
+    virtualenv env --python=python3
+    . env/bin/activate
+```
+Install baselines package
+```bash
 pip install -e .
 ```
+### MuJoCo
+Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)
+
+## Testing the installation
+All unit tests in baselines can be run using pytest runner:
+```
+pip install pytest
+pytest
+```
+
+## Subpackages

 - [A2C](baselines/a2c)
 - [ACER](baselines/acer)
--- a/baselines/a2c/a2c.py
+++ b/baselines/a2c/a2c.py
@@ -1,16 +1,12 @@
-import os
 import os.path as osp
-import gym
 import time
 import joblib
-import logging
 import numpy as np
 import tensorflow as tf
 from baselines import logger

 from baselines.common import set_global_seeds, explained_variance
-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from baselines.common.atari_wrappers import wrap_deepmind
+from baselines.common.runners import AbstractEnvRunner
 from baselines.common import tf_util

 from baselines.a2c.utils import discount_with_dones
@@ -24,7 +20,6 @@ class Model(object):
            alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):

        sess = tf_util.make_session()
-        nact = ac_space.n
        nbatch = nenvs*nsteps

        A = tf.placeholder(tf.int32, [nbatch])
@@ -75,7 +70,7 @@ class Model(object):
            restores = []
            for p, loaded_p in zip(params, loaded_params):
                restores.append(p.assign(loaded_p))
-            ps = sess.run(restores)
+            sess.run(restores)

        self.train = train
        self.train_model = train_model
@@ -87,21 +82,11 @@ class Model(object):
        self.load = load
        tf.global_variables_initializer().run(session=sess)

-class Runner(object):
+class Runner(AbstractEnvRunner):

    def __init__(self, env, model, nsteps=5, gamma=0.99):
-        self.env = env
-        self.model = model
-        nh, nw, nc = env.observation_space.shape
-        nenv = env.num_envs
-        self.batch_ob_shape = (nenv*nsteps, nh, nw, nc)
-        self.obs = np.zeros((nenv, nh, nw, nc), dtype=np.uint8)
-        self.nc = nc
-        obs = env.reset()
+        super().__init__(env=env, model=model, nsteps=nsteps)
        self.gamma = gamma
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]

    def run(self):
        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
@@ -146,7 +131,6 @@ class Runner(object):
        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values

 def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, ent_coef=0.01, max_grad_norm=0.5, lr=7e-4, lrschedule='linear', epsilon=1e-5, alpha=0.99, gamma=0.99, log_interval=100):
-    tf.reset_default_graph()
    set_global_seeds(seed)

    nenvs = env.num_envs
@@ -173,3 +157,4 @@ def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, e
            logger.record_tabular("explained_variance", float(ev))
            logger.dump_tabular()
    env.close()
+    return model
--- a/baselines/a2c/policies.py
+++ b/baselines/a2c/policies.py
@@ -2,39 +2,36 @@ import numpy as np
 import tensorflow as tf
 from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
 from baselines.common.distributions import make_pdtype
+from baselines.common.input import observation_input

-def nature_cnn(unscaled_images):
+def nature_cnn(unscaled_images, **conv_kwargs):
    """
    CNN from Nature paper.
    """
    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
    activ = tf.nn.relu
-    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2)))
-    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2)))
-    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2)))
+    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
+                   **conv_kwargs))
+    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
+    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
    h3 = conv_to_fc(h3)
    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))

 class LnLstmPolicy(object):
    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
        nenv = nbatch // nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
+        X, processed_x = observation_input(ob_space, nbatch)
        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
+        self.pdtype = make_pdtype(ac_space)
        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
+            h = nature_cnn(processed_x)
            xs = batch_to_seq(h, nenv, nsteps)
            ms = batch_to_seq(M, nenv, nsteps)
            h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact)
            vf = fc(h5, 'v', 1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
+            self.pd, self.pi = self.pdtype.pdfromlatent(h5)

        v0 = vf[:, 0]
        a0 = self.pd.sample()
@@ -50,7 +47,6 @@ class LnLstmPolicy(object):
        self.X = X
        self.M = M
        self.S = S
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value
@@ -59,11 +55,9 @@ class LstmPolicy(object):

    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
        nenv = nbatch // nsteps
+        self.pdtype = make_pdtype(ac_space)
+        X, processed_x = observation_input(ob_space, nbatch)

-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
        with tf.variable_scope("model", reuse=reuse):
@@ -72,11 +66,8 @@ class LstmPolicy(object):
            ms = batch_to_seq(M, nenv, nsteps)
            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact)
            vf = fc(h5, 'v', 1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
+            self.pd, self.pi = self.pdtype.pdfromlatent(h5)

        v0 = vf[:, 0]
        a0 = self.pd.sample()
@@ -92,25 +83,19 @@ class LstmPolicy(object):
        self.X = X
        self.M = M
        self.S = S
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value

 class CnnPolicy(object):

-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            pi = fc(h, 'pi', nact, init_scale=0.01)
-            vf = fc(h, 'v', 1)[:,0]
-
+    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False, **conv_kwargs): #pylint: disable=W0613
        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
+        X, processed_x = observation_input(ob_space, nbatch)
+        with tf.variable_scope("model", reuse=reuse):
+            h = nature_cnn(processed_x, **conv_kwargs)
+            vf = fc(h, 'v', 1)[:,0]
+            self.pd, self.pi = self.pdtype.pdfromlatent(h, init_scale=0.01)

        a0 = self.pd.sample()
        neglogp0 = self.pd.neglogp(a0)
@@ -124,31 +109,25 @@ class CnnPolicy(object):
            return sess.run(vf, {X:ob})

        self.X = X
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value

 class MlpPolicy(object):
    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
-        ob_shape = (nbatch,) + ob_space.shape
-        actdim = ac_space.shape[0]
-        X = tf.placeholder(tf.float32, ob_shape, name='Ob') #obs
-        with tf.variable_scope("model", reuse=reuse):
-            activ = tf.tanh
-            h1 = activ(fc(X, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
-            h2 = activ(fc(h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
-            pi = fc(h2, 'pi', actdim, init_scale=0.01)
-            h1 = activ(fc(X, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
-            h2 = activ(fc(h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
-            vf = fc(h2, 'vf', 1)[:,0]
-            logstd = tf.get_variable(name="logstd", shape=[1, actdim],
-                initializer=tf.zeros_initializer())
-
-        pdparam = tf.concat([pi, pi * 0.0 + logstd], axis=1)
-
        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pdparam)
+        with tf.variable_scope("model", reuse=reuse):
+            X, processed_x = observation_input(ob_space, nbatch)
+            activ = tf.tanh
+            processed_x = tf.layers.flatten(processed_x)
+            pi_h1 = activ(fc(processed_x, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
+            pi_h2 = activ(fc(pi_h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
+            vf_h1 = activ(fc(processed_x, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
+            vf_h2 = activ(fc(vf_h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
+            vf = fc(vf_h2, 'vf', 1)[:,0]
+
+            self.pd, self.pi = self.pdtype.pdfromlatent(pi_h2, init_scale=0.01)
+

        a0 = self.pd.sample()
        neglogp0 = self.pd.neglogp(a0)
@@ -162,7 +141,6 @@ class MlpPolicy(object):
            return sess.run(vf, {X:ob})

        self.X = X
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value
--- a/baselines/a2c/utils.py
+++ b/baselines/a2c/utils.py
@@ -39,7 +39,7 @@ def ortho_init(scale=1.0):
        return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
    return _ortho_init

-def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC'):
+def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
    if data_format == 'NHWC':
        channel_ax = 3
        strides = [1, stride, stride, 1]
@@ -50,12 +50,14 @@ def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='
        bshape = [1, nf, 1, 1]
    else:
        raise NotImplementedError
+    bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
    nin = x.get_shape()[channel_ax].value
    wshape = [rf, rf, nin, nf]
    with tf.variable_scope(scope):
        w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [1, nf, 1, 1], initializer=tf.constant_initializer(0.0))
-        if data_format == 'NHWC': b = tf.reshape(b, bshape)
+        b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
+        if not one_dim_bias and data_format == 'NHWC':
+            b = tf.reshape(b, bshape)
        return b + tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format)

 def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
--- a/baselines/acer/acer_simple.py
+++ b/baselines/acer/acer_simple.py
@@ -5,6 +5,7 @@ import tensorflow as tf
 from baselines import logger

 from baselines.common import set_global_seeds
+from baselines.common.runners import AbstractEnvRunner

 from baselines.a2c.utils import batch_to_seq, seq_to_batch
 from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
@@ -13,6 +14,8 @@ from baselines.a2c.utils import EpisodeStats
 from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
 from baselines.acer.buffer import Buffer

+import os.path as osp
+
 # remove last step
 def strip(var, nenvs, nsteps, flat = False):
    vars = batch_to_seq(var, nenvs, nsteps + 1, flat)
@@ -209,11 +212,10 @@ class Model(object):
        self.initial_state = step_model.initial_state
        tf.global_variables_initializer().run(session=sess)

-class Runner(object):
+class Runner(AbstractEnvRunner):
    def __init__(self, env, model, nsteps, nstack):
-        self.env = env
+        super().__init__(env=env, model=model, nsteps=nsteps)
        self.nstack = nstack
-        self.model = model
        nh, nw, nc = env.observation_space.shape
        self.nc = nc  # nc = 1 for atari, but just in case
        self.nenv = nenv = env.num_envs
@@ -223,9 +225,6 @@ class Runner(object):
        self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
        obs = env.reset()
        self.update_obs(obs)
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]

    def update_obs(self, obs, dones=None):
        if dones is not None:
--- a/baselines/acktr/run_atari.py
+++ b/baselines/acktr/run_atari.py
@@ -1,5 +1,7 @@
 #!/usr/bin/env python3

+from functools import partial
+
 from baselines import logger
 from baselines.acktr.acktr_disc import learn
 from baselines.common.cmd_util import make_atari_env, atari_arg_parser
@@ -8,7 +10,7 @@ from baselines.ppo2.policies import CnnPolicy

 def train(env_id, num_timesteps, seed, num_cpu):
    env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
-    policy_fn = CnnPolicy
+    policy_fn = partial(CnnPolicy, one_dim_bias=True)
    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
    env.close()

--- a/baselines/bench/benchmarks.py
+++ b/baselines/bench/benchmarks.py
@@ -9,6 +9,8 @@ _atariexpl7 = ['Freeway', 'Gravitar', 'MontezumaRevenge', 'Pitfall', 'PrivateEye
 _BENCHMARKS = []

 remove_version_re = re.compile(r'-v\d+$')
+
+
 def register_benchmark(benchmark):
    for b in _BENCHMARKS:
        if b['name'] == benchmark['name']:
@@ -138,3 +140,11 @@ register_benchmark({
    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari50]
 })

+# HER DDPG
+
+register_benchmark({
+    'name': 'HerDdpg',
+    'description': 'Smoke-test only benchmark of HER',
+    'tasks': [{'trials': 1, 'env_id': 'FetchReach-v1'}]
+})
+
--- a/baselines/common/init.py
+++ b/baselines/common/init.py
@@ -1,3 +1,4 @@
+# flake8: noqa F403
 from baselines.common.console_util import *
 from baselines.common.dataset import Dataset
 from baselines.common.math_util import *
--- a/baselines/common/atari_wrappers.py
+++ b/baselines/common/atari_wrappers.py
@@ -98,9 +98,6 @@ class MaxAndSkipEnv(gym.Wrapper):
        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
        self._skip       = skip

-    def reset(self):
-        return self.env.reset()
-
    def step(self, action):
        """Repeat action, sum reward, and max over last observations."""
        total_reward = 0.0
--- a/baselines/common/cmd_util.py
+++ b/baselines/common/cmd_util.py
@@ -3,6 +3,7 @@ Helpers for scripts like run_atari.py.
 """

 import os
+from mpi4py import MPI
 import gym
 from gym.wrappers import FlattenDictWrapper
 from baselines import logger
@@ -10,7 +11,6 @@ from baselines.bench import Monitor
 from baselines.common import set_global_seeds
 from baselines.common.atari_wrappers import make_atari, wrap_deepmind
 from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from mpi4py import MPI

 def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
    """
@@ -31,9 +31,10 @@ def make_mujoco_env(env_id, seed):
    """
    Create a wrapped, monitored gym.Env for MuJoCo.
    """
-    set_global_seeds(seed)
+    rank = MPI.COMM_WORLD.Get_rank()
+    set_global_seeds(seed + 10000 * rank)
    env = gym.make(env_id)
-    env = Monitor(env, logger.get_dir())
+    env = Monitor(env, os.path.join(logger.get_dir(), str(rank)))
    env.seed(seed)
    return env

@@ -75,6 +76,7 @@ def mujoco_arg_parser():
    parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
+    parser.add_argument('--play', default=False, action='store_true')
    return parser

 def robotics_arg_parser():
--- a/baselines/common/distributions.py
+++ b/baselines/common/distributions.py
@@ -1,6 +1,7 @@
 import tensorflow as tf
 import numpy as np
 import baselines.common.tf_util as U
+from baselines.a2c.utils import fc
 from tensorflow.python.ops import math_ops

 class Pd(object):
@@ -31,6 +32,8 @@ class PdType(object):
        raise NotImplementedError
    def pdfromflat(self, flat):
        return self.pdclass()(flat)
+    def pdfromlatent(self, latent_vector):
+        raise NotImplementedError
    def param_shape(self):
        raise NotImplementedError
    def sample_shape(self):
@@ -48,6 +51,10 @@ class CategoricalPdType(PdType):
        self.ncat = ncat
    def pdclass(self):
        return CategoricalPd
+    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
+        pdparam = fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
+        return self.pdfromflat(pdparam), pdparam
+
    def param_shape(self):
        return [self.ncat]
    def sample_shape(self):
@@ -75,6 +82,13 @@ class DiagGaussianPdType(PdType):
        self.size = size
    def pdclass(self):
        return DiagGaussianPd
+
+    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
+        mean = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
+        logstd = tf.get_variable(name='logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
+        pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
+        return self.pdfromflat(pdparam), mean
+
    def param_shape(self):
        return [2*self.size]
    def sample_shape(self):
--- a/baselines/common/filters.py
+++ b/baselines/common/filters.py
--- a/baselines/common/identity_env.py
+++ b/baselines/common/identity_env.py
@@ -0,0 +1,30 @@
+from gym import Env
+from gym.spaces import Discrete
+
+
+class IdentityEnv(Env):
+    def __init__(
+            self,
+            dim,
+            ep_length=100,
+    ):
+
+        self.action_space = Discrete(dim)
+        self.reset()
+
+    def reset(self):
+        self._choose_next_state()
+        self.observation_space = self.action_space
+
+        return self.state
+
+    def step(self, actions):
+        rew = self._get_reward(actions)
+        self._choose_next_state()
+        return self.state, rew, False, {}
+
+    def _choose_next_state(self):
+        self.state = self.action_space.sample()
+
+    def _get_reward(self, actions):
+        return 1 if self.state == actions else 0
--- a/baselines/common/input.py
+++ b/baselines/common/input.py
@@ -0,0 +1,30 @@
+import tensorflow as tf
+from gym.spaces import Discrete, Box
+
+def observation_input(ob_space, batch_size=None, name='Ob'):
+    '''
+    Build observation input with encoding depending on the 
+    observation space type
+    Params:
+    
+    ob_space: observation space (should be one of gym.spaces)
+    batch_size: batch size for input (default is None, so that resulting input placeholder can take tensors with any batch size)
+    name: tensorflow variable name for input placeholder
+
+    returns: tuple (input_placeholder, processed_input_tensor)
+    '''
+    if isinstance(ob_space, Discrete):
+        input_x  = tf.placeholder(shape=(batch_size,), dtype=tf.int32, name=name)
+        processed_x = tf.to_float(tf.one_hot(input_x, ob_space.n))
+        return input_x, processed_x
+
+    elif isinstance(ob_space, Box):
+        input_shape = (batch_size,) + ob_space.shape
+        input_x = tf.placeholder(shape=input_shape, dtype=ob_space.dtype, name=name)
+        processed_x = tf.to_float(input_x)
+        return input_x, processed_x
+
+    else:
+        raise NotImplementedError
+
+ 
--- a/baselines/common/mpi_moments.py
+++ b/baselines/common/mpi_moments.py
@@ -2,6 +2,7 @@ from mpi4py import MPI
 import numpy as np
 from baselines.common import zipsame

+
 def mpi_mean(x, axis=0, comm=None, keepdims=False):
    x = np.asarray(x)
    assert x.ndim > 0
--- a/baselines/common/runners.py
+++ b/baselines/common/runners.py
@@ -0,0 +1,18 @@
+import numpy as np
+from abc import ABC, abstractmethod
+
+class AbstractEnvRunner(ABC):
+    def __init__(self, *, env, model, nsteps):
+        self.env = env
+        self.model = model
+        nenv = env.num_envs
+        self.batch_ob_shape = (nenv*nsteps,) + env.observation_space.shape
+        self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
+        self.obs[:] = env.reset()
+        self.nsteps = nsteps
+        self.states = model.initial_state
+        self.dones = [False for _ in range(nenv)]
+
+    @abstractmethod
+    def run(self):
+        raise NotImplementedError
--- a/baselines/common/running_stat.py
+++ b/baselines/common/running_stat.py
--- a/baselines/common/segment_tree.py
+++ b/baselines/common/segment_tree.py
@@ -12,10 +12,9 @@ class SegmentTree(object):

            a) setting item's value is slightly slower.
               It is O(lg capacity) instead of O(1).
-            b) user has access to an efficient `reduce`
-               operation which reduces `operation` over
-               a contiguous subsequence of items in the
-               array.
+            b) user has access to an efficient ( O(log segment size) )
+               `reduce` operation which reduces `operation` over
+               a contiguous subsequence of items in the array.

        Paramters
        ---------
@@ -23,8 +22,8 @@ class SegmentTree(object):
            Total size of the array - must be a power of two.
        operation: lambda obj, obj -> obj
            and operation for combining elements (eg. sum, max)
-            must for a mathematical group together with the set of
-            possible values for array elements.
+            must form a mathematical group together with the set of
+            possible values for array elements (i.e. be associative)
        neutral_element: obj
            neutral element for the operation above. eg. float('-inf')
            for max and 0 for sum.
--- a/baselines/common/test_identity.py
+++ b/baselines/common/test_identity.py
@@ -0,0 +1,44 @@
+import pytest
+import tensorflow as tf
+import random
+import numpy as np
+from gym.spaces import np_random
+
+from baselines.a2c import a2c
+from baselines.ppo2 import ppo2
+from baselines.common.identity_env import IdentityEnv
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+from baselines.ppo2.policies import MlpPolicy
+
+
+learn_func_list = [
+    lambda e: a2c.learn(policy=MlpPolicy, env=e, seed=0, total_timesteps=50000),
+    lambda e: ppo2.learn(policy=MlpPolicy, env=e, total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.01)
+]
+
+
+@pytest.mark.slow
+@pytest.mark.parametrize("learn_func", learn_func_list)
+def test_identity(learn_func):
+    '''
+    Test if the algorithm (with a given policy) 
+    can learn an identity transformation (i.e. return observation as an action)
+    '''
+    np.random.seed(0)
+    np_random.seed(0)
+    random.seed(0)
+
+    env = DummyVecEnv([lambda: IdentityEnv(10)])
+
+    with tf.Graph().as_default(), tf.Session().as_default():
+        tf.set_random_seed(0)
+        model = learn_func(env)
+
+        N_TRIALS = 1000
+        sum_rew = 0
+        obs = env.reset()
+        for i in range(N_TRIALS):
+            obs, rew, done, _ = env.step(model.step(obs)[0])
+            sum_rew += rew
+
+        assert sum_rew > 0.9 * N_TRIALS
--- a/baselines/common/tests/test_tf_util.py
+++ b/baselines/common/tests/test_tf_util.py
@@ -33,7 +33,6 @@ def test_multikwargs():
            initialize()
            assert lin(2) == 6
            assert lin(2, 2) == 10
-            expt_caught = False


 if __name__ == '__main__':
--- a/baselines/common/tf_util.py
+++ b/baselines/common/tf_util.py
@@ -48,18 +48,17 @@ def huber_loss(x, delta=1.0):
 # Global session
 # ================================================================

-def make_session(num_cpu=None, make_default=False):
+def make_session(num_cpu=None, make_default=False, graph=None):
    """Returns a session that will use <num_cpu> CPU's only"""
    if num_cpu is None:
        num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
    tf_config = tf.ConfigProto(
        inter_op_parallelism_threads=num_cpu,
        intra_op_parallelism_threads=num_cpu)
-    tf_config.gpu_options.allocator_type = 'BFC'
    if make_default:
-        return tf.InteractiveSession(config=tf_config)
+        return tf.InteractiveSession(config=tf_config, graph=graph)
    else:
-        return tf.Session(config=tf_config)
+        return tf.Session(config=tf_config, graph=graph)

 def single_threaded_session():
    """Returns a session which will only use a single CPU"""
@@ -84,10 +83,10 @@ def initialize():
 # Model components
 # ================================================================

-def normc_initializer(std=1.0):
+def normc_initializer(std=1.0, axis=0):
    def _initializer(shape, dtype=None, partition_info=None):  # pylint: disable=W0613
        out = np.random.randn(*shape).astype(np.float32)
-        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
+        out *= std / np.sqrt(np.square(out).sum(axis=axis, keepdims=True))
        return tf.constant(out)
    return _initializer

@@ -273,8 +272,33 @@ def display_var_info(vars):
    for v in vars:
        name = v.name
        if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
-        count_params += np.prod(v.shape.as_list())
-        if "/b:" in name: continue    # Wx+b, bias is not interesting to look at => count params, but not print
-        logger.info("    %s%s%s" % (name, " "*(55-len(name)), str(v.shape)))
-    logger.info("Total model parameters: %0.1f million" % (count_params*1e-6))
+        v_params = np.prod(v.shape.as_list())
+        count_params += v_params
+        if "/b:" in name or "/biases" in name: continue    # Wx+b, bias is not interesting to look at => count params, but not print
+        logger.info("   %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
+
+    logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
+
+
+def get_available_gpus():
+    # recipe from here:
+    # https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
+ 
+    from tensorflow.python.client import device_lib
+    local_device_protos = device_lib.list_local_devices()
+    return [x.name for x in local_device_protos if x.device_type == 'GPU']
+
+# ================================================================
+# Saving variables
+# ================================================================
+
+def load_state(fname):
+    saver = tf.train.Saver()
+    saver.restore(tf.get_default_session(), fname)
+
+def save_state(fname):
+    os.makedirs(os.path.dirname(fname), exist_ok=True)
+    saver = tf.train.Saver()
+    saver.save(tf.get_default_session(), fname)
+

--- a/baselines/common/tile_images.py
+++ b/baselines/common/tile_images.py
@@ -0,0 +1,23 @@
+import numpy as np
+
+def tile_images(img_nhwc):
+    """
+    Tile N images into one big PxQ image
+    (P,Q) are chosen to be as close as possible, and if N
+    is square, then P=Q.
+
+    input: img_nhwc, list or array of images, ndim=4 once turned into array
+        n = batch index, h = height, w = width, c = channel
+    returns:
+        bigim_HWc, ndarray with ndim=3
+    """
+    img_nhwc = np.asarray(img_nhwc)
+    N, h, w, c = img_nhwc.shape
+    H = int(np.ceil(np.sqrt(N)))
+    W = int(np.ceil(float(N)/H))
+    img_nhwc = np.array(list(img_nhwc) + [img_nhwc[0]*0 for _ in range(N, H*W)])
+    img_HWhwc = img_nhwc.reshape(H, W, h, w, c)
+    img_HhWwc = img_HWhwc.transpose(0, 2, 1, 3, 4)
+    img_Hh_Ww_c = img_HhWwc.reshape(H*h, W*w, c)
+    return img_Hh_Ww_c
+
--- a/baselines/common/vec_env/init.py
+++ b/baselines/common/vec_env/init.py
@@ -77,9 +77,16 @@ class VecEnv(ABC):
        self.step_async(actions)
        return self.step_wait()

-    def render(self):
+    def render(self, mode='human'):
        logger.warn('Render not defined for %s'%self)

+    @property
+    def unwrapped(self):
+        if isinstance(self, VecEnvWrapper):
+            return self.venv.unwrapped
+        else:
+            return self
+
 class VecEnvWrapper(VecEnv):
    def __init__(self, venv, observation_space=None, action_space=None):
        self.venv = venv
--- a/baselines/common/vec_env/dummy_vec_env.py
+++ b/baselines/common/vec_env/dummy_vec_env.py
@@ -1,5 +1,6 @@
 import numpy as np
-import gym
+from gym import spaces
+from collections import OrderedDict
 from . import VecEnv

 class DummyVecEnv(VecEnv):
@@ -7,9 +8,22 @@ class DummyVecEnv(VecEnv):
        self.envs = [fn() for fn in env_fns]
        env = self.envs[0]
        VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
+        shapes, dtypes = {}, {}
+        self.keys = []
+        obs_space = env.observation_space

-        obs_spaces = self.observation_space.spaces if isinstance(self.observation_space, gym.spaces.Tuple) else (self.observation_space,)
-        self.buf_obs = [np.zeros((self.num_envs,) + tuple(s.shape), s.dtype) for s in obs_spaces]
+        if isinstance(obs_space, spaces.Dict):
+            assert isinstance(obs_space.spaces, OrderedDict)
+            subspaces = obs_space.spaces
+        else:
+            subspaces = {None: obs_space}
+
+        for key, box in subspaces.items():
+            shapes[key] = box.shape
+            dtypes[key] = box.dtype
+            self.keys.append(key)
+        
+        self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
        self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
        self.buf_rews  = np.zeros((self.num_envs,), dtype=np.float32)
        self.buf_infos = [{} for _ in range(self.num_envs)]
@@ -19,33 +33,35 @@ class DummyVecEnv(VecEnv):
        self.actions = actions

    def step_wait(self):
-        for i in range(self.num_envs):
-            obs_tuple, self.buf_rews[i], self.buf_dones[i], self.buf_infos[i] = self.envs[i].step(self.actions[i])
-            if self.buf_dones[i]:
-                obs_tuple = self.envs[i].reset()
-            if isinstance(obs_tuple, (tuple, list)):
-                for t,x in enumerate(obs_tuple):
-                    self.buf_obs[t][i] = x
-            else:
-                self.buf_obs[0][i] = obs_tuple
+        for e in range(self.num_envs):
+            obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(self.actions[e])
+            if self.buf_dones[e]:
+                obs = self.envs[e].reset()
+            self._save_obs(e, obs)
        return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones),
                self.buf_infos.copy())

    def reset(self):
-        for i in range(self.num_envs):
-            obs_tuple = self.envs[i].reset()
-            if isinstance(obs_tuple, (tuple, list)):
-                for t,x in enumerate(obs_tuple):
-                    self.buf_obs[t][i] = x
-            else:
-                self.buf_obs[0][i] = obs_tuple
+        for e in range(self.num_envs):
+            obs = self.envs[e].reset()
+            self._save_obs(e, obs)
        return self._obs_from_buf()

    def close(self):
        return

+    def render(self, mode='human'):
+        return [e.render(mode=mode) for e in self.envs]
+
+    def _save_obs(self, e, obs):
+        for k in self.keys:
+            if k is None:
+                self.buf_obs[k][e] = obs
+            else:
+                self.buf_obs[k][e] = obs[k]
+
    def _obs_from_buf(self):
-        if len(self.buf_obs) == 1:
-            return np.copy(self.buf_obs[0])
+        if self.keys==[None]:
+            return self.buf_obs[None]
        else:
-            return tuple(np.copy(x) for x in self.buf_obs)
+            return self.buf_obs
--- a/baselines/common/vec_env/subproc_vec_env.py
+++ b/baselines/common/vec_env/subproc_vec_env.py
@@ -1,6 +1,7 @@
 import numpy as np
 from multiprocessing import Process, Pipe
 from baselines.common.vec_env import VecEnv, CloudpickleWrapper
+from baselines.common.tile_images import tile_images


 def worker(remote, parent_remote, env_fn_wrapper):
@@ -16,9 +17,8 @@ def worker(remote, parent_remote, env_fn_wrapper):
        elif cmd == 'reset':
            ob = env.reset()
            remote.send(ob)
-        elif cmd == 'reset_task':
-            ob = env.reset_task()
-            remote.send(ob)
+        elif cmd == 'render':
+            remote.send(env.render(mode='rgb_array'))
        elif cmd == 'close':
            remote.close()
            break
@@ -81,3 +81,17 @@ class SubprocVecEnv(VecEnv):
        for p in self.ps:
            p.join()
        self.closed = True
+
+    def render(self, mode='human'):
+        for pipe in self.remotes:
+            pipe.send(('render', None))
+        imgs = [pipe.recv() for pipe in self.remotes]
+        bigimg = tile_images(imgs)
+        if mode == 'human':
+            import cv2
+            cv2.imshow('vecenv', bigimg[:,:,::-1])
+            cv2.waitKey(1)
+        elif mode == 'rgb_array':
+            return bigimg
+        else:
+            raise NotImplementedError
--- a/baselines/deepq/experiments/custom_cartpole.py
+++ b/baselines/deepq/experiments/custom_cartpole.py
@@ -9,7 +9,7 @@ import baselines.common.tf_util as U
 from baselines import logger
 from baselines import deepq
 from baselines.deepq.replay_buffer import ReplayBuffer
-from baselines.deepq.utils import BatchInput
+from baselines.deepq.utils import ObservationInput
 from baselines.common.schedules import LinearSchedule


@@ -28,7 +28,7 @@ if __name__ == '__main__':
        env = gym.make("CartPole-v0")
        # Create all the functions necessary to train the model
        act, train, update_target, debug = deepq.build_train(
-            make_obs_ph=lambda name: BatchInput(env.observation_space.shape, name=name),
+            make_obs_ph=lambda name: ObservationInput(env.observation_space, name=name),
            q_func=model,
            num_actions=env.action_space.n,
            optimizer=tf.train.AdamOptimizer(learning_rate=5e-4),
--- a/baselines/deepq/experiments/run_atari.py
+++ b/baselines/deepq/experiments/run_atari.py
@@ -5,13 +5,18 @@ import argparse
 from baselines import logger
 from baselines.common.atari_wrappers import make_atari

+
 def main():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
    parser.add_argument('--prioritized', type=int, default=1)
+    parser.add_argument('--prioritized-replay-alpha', type=float, default=0.6)
    parser.add_argument('--dueling', type=int, default=1)
    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
+    parser.add_argument('--checkpoint-freq', type=int, default=10000)
+    parser.add_argument('--checkpoint-path', type=str, default=None)
+
    args = parser.parse_args()
    logger.configure()
    set_global_seeds(args.seed)
@@ -23,7 +28,8 @@ def main():
        hiddens=[256],
        dueling=bool(args.dueling),
    )
-    act = deepq.learn(
+
+    deepq.learn(
        env,
        q_func=model,
        lr=1e-4,
@@ -35,9 +41,12 @@ def main():
        learning_starts=10000,
        target_network_update_freq=1000,
        gamma=0.99,
-        prioritized_replay=bool(args.prioritized)
+        prioritized_replay=bool(args.prioritized),
+        prioritized_replay_alpha=args.prioritized_replay_alpha,
+        checkpoint_freq=args.checkpoint_freq,
+        checkpoint_path=args.checkpoint_path,
    )
-    # act.save("pong_model.pkl") XXX
+
    env.close()


--- a/baselines/deepq/replay_buffer.py
+++ b/baselines/deepq/replay_buffer.py
@@ -86,7 +86,7 @@ class PrioritizedReplayBuffer(ReplayBuffer):
        ReplayBuffer.__init__
        """
        super(PrioritizedReplayBuffer, self).__init__(size)
-        assert alpha > 0
+        assert alpha >= 0
        self._alpha = alpha

        it_capacity = 1
--- a/baselines/deepq/simple.py
+++ b/baselines/deepq/simple.py
@@ -6,13 +6,15 @@ import zipfile
 import cloudpickle
 import numpy as np

-import gym
 import baselines.common.tf_util as U
+from baselines.common.tf_util import load_state, save_state
 from baselines import logger
 from baselines.common.schedules import LinearSchedule
+from baselines.common.input import observation_input
+
 from baselines import deepq
 from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
-from baselines.deepq.utils import BatchInput, load_state, save_state
+from baselines.deepq.utils import ObservationInput


 class ActWrapper(object):
@@ -88,6 +90,7 @@ def learn(env,
          batch_size=32,
          print_freq=100,
          checkpoint_freq=10000,
+          checkpoint_path=None,
          learning_starts=1000,
          gamma=1.0,
          target_network_update_freq=500,
@@ -170,9 +173,9 @@ def learn(env,

    # capture the shape outside the closure so that the env object is not serialized
    # by cloudpickle when serializing make_obs_ph
-    observation_space_shape = env.observation_space.shape
+
    def make_obs_ph(name):
-        return BatchInput(observation_space_shape, name=name)
+        return ObservationInput(env.observation_space, name=name)

    act, train, update_target, debug = deepq.build_train(
        make_obs_ph=make_obs_ph,
@@ -216,9 +219,17 @@ def learn(env,
    saved_mean_reward = None
    obs = env.reset()
    reset = True
+
    with tempfile.TemporaryDirectory() as td:
-        model_saved = False
+        td = checkpoint_path or td
+
        model_file = os.path.join(td, "model")
+        model_saved = False
+        if tf.train.latest_checkpoint(td) is not None:
+            load_state(model_file)
+            logger.log('Loaded model from {}'.format(model_file))
+            model_saved = True
+
        for t in range(max_timesteps):
            if callback is not None:
                if callback(locals(), globals()):
--- a/baselines/deepq/test_identity.py
+++ b/baselines/deepq/test_identity.py
@@ -0,0 +1,43 @@
+import tensorflow as tf
+import random
+
+from baselines import deepq
+from baselines.common.identity_env import IdentityEnv
+
+
+def test_identity():
+
+    with tf.Graph().as_default():
+        env = IdentityEnv(10)
+        random.seed(0)
+
+        tf.set_random_seed(0)
+
+        param_noise = False
+        model = deepq.models.mlp([32])
+        act = deepq.learn(
+            env,
+            q_func=model,
+            lr=1e-3,
+            max_timesteps=10000,
+            buffer_size=50000,
+            exploration_fraction=0.1,
+            exploration_final_eps=0.02,
+            print_freq=10,
+            param_noise=param_noise,
+        )
+
+        tf.set_random_seed(0)
+
+        N_TRIALS = 1000
+        sum_rew = 0
+        obs = env.reset()
+        for i in range(N_TRIALS):
+            obs, rew, done, _ = env.step(act([obs]))
+            sum_rew += rew
+
+        assert sum_rew > 0.9 * N_TRIALS
+
+
+if __name__ == '__main__':
+    test_identity()
--- a/baselines/deepq/utils.py
+++ b/baselines/deepq/utils.py
@@ -1,24 +1,12 @@
-import os
+from baselines.common.input import observation_input

 import tensorflow as tf

-# ================================================================
-# Saving variables
-# ================================================================
-
-def load_state(fname):
-    saver = tf.train.Saver()
-    saver.restore(tf.get_default_session(), fname)
-
-def save_state(fname):
-    os.makedirs(os.path.dirname(fname), exist_ok=True)
-    saver = tf.train.Saver()
-    saver.save(tf.get_default_session(), fname)
-
 # ================================================================
 # Placeholders
 # ================================================================

+
 class TfInput(object):
    def __init__(self, name="(unnamed)"):
        """Generalized Tensorflow placeholder. The main differences are:
@@ -50,20 +38,6 @@ class PlaceholderTfInput(TfInput):
    def make_feed_dict(self, data):
        return {self._placeholder: data}

-class BatchInput(PlaceholderTfInput):
-    def __init__(self, shape, dtype=tf.float32, name=None):
-        """Creates a placeholder for a batch of tensors of a given shape and dtype
-
-        Parameters
-        ----------
-        shape: [int]
-            shape of a single elemenet of the batch
-        dtype: tf.dtype
-            number representation used for tensor contents
-        name: str
-            name of the underlying placeholder
-        """
-        super().__init__(tf.placeholder(dtype, [None] + list(shape), name=name))

 class Uint8Input(PlaceholderTfInput):
    def __init__(self, shape, name=None):
@@ -85,4 +59,25 @@ class Uint8Input(PlaceholderTfInput):
        self._output = tf.cast(super().get(), tf.float32) / 255.0

    def get(self):
-        return self._output
+        return self._output
+
+
+class ObservationInput(PlaceholderTfInput):
+    def __init__(self, observation_space, name=None):
+        """Creates an input placeholder tailored to a specific observation space
+        
+        Parameters
+        ----------
+
+        observation_space: 
+                observation space of the environment. Should be one of the gym.spaces types
+        name: str 
+                tensorflow name of the underlying placeholder
+        """
+        inpt, self.processed_inpt = observation_input(observation_space, name=name)
+        super().__init__(inpt)
+
+    def get(self):
+        return self.processed_inpt
+    
+    
--- a/baselines/gail/dataset/mujoco_dset.py
+++ b/baselines/gail/dataset/mujoco_dset.py
@@ -47,18 +47,12 @@ class Mujoco_Dset(object):
        obs = traj_data['obs'][:traj_limitation]
        acs = traj_data['acs'][:traj_limitation]

-        def flatten(x):
-            # x.shape = (E,), or (E, L, D)
-            _, size = x[0].shape
-            episode_length = [len(i) for i in x]
-            y = np.zeros((sum(episode_length), size))
-            start_idx = 0
-            for l, x_i in zip(episode_length, x):
-                y[start_idx:(start_idx+l)] = x_i
-                start_idx += l
-                return y
-        self.obs = np.array(flatten(obs))
-        self.acs = np.array(flatten(acs))
+        # obs, acs: shape (N, L, ) + S where N = # episodes, L = episode length
+        # and S is the environment observation/action space.
+        # Flatten to (N * L, prod(S))
+        self.obs = np.reshape(obs, [-1, np.prod(obs.shape[2:])])
+        self.acs = np.reshape(acs, [-1, np.prod(acs.shape[2:])])
+
        self.rets = traj_data['ep_rets'][:traj_limitation]
        self.avg_ret = sum(self.rets)/len(self.rets)
        self.std_ret = np.std(np.array(self.rets))
--- a/baselines/her/README.md
+++ b/baselines/her/README.md
@@ -1,5 +1,5 @@
 # Hindsight Experience Replay
-For details on Hindsight Experience Replay (HER), please read the [paper](https://arxiv.org/pdf/1707.01495.pdf).
+For details on Hindsight Experience Replay (HER), please read the [paper](https://arxiv.org/abs/1707.01495).

 ## How to use Hindsight Experience Replay

@@ -22,14 +22,11 @@ You can try it right now with the results of the training step (the script print
 This should visualize the current policy for 10 episodes and will also print statistics.


-### Advanced usage
-The train script comes with advanced features like MPI support, that allows to scale across all cores of a single machine.
-To see all available options, simply run this command:
+### Reproducing results
+In order to reproduce the results from [Plappert et al. (2018)](https://arxiv.org/abs/1802.09464), run the following command:
 ```bash
-python -m baselines.her.experiment.train --help
+python -m baselines.her.experiment.train --num_cpu 19
 ```
-To run on, say, 20 CPU cores, you can use the following command:
-```bash
-python -m baselines.her.experiment.train --num_cpu 20
-```
-That's it, you are now running rollouts using 20 MPI workers and average gradients for network updates across all 20 core.
+This will require a machine with sufficient amount of physical CPU cores. In our experiments,
+we used [Azure's D15v2 instances](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes),
+which have 20 physical cores. We only scheduled the experiment on 19 of those to leave some head-room on the system.
--- a/baselines/her/experiment/config.py
+++ b/baselines/her/experiment/config.py
@@ -1,7 +1,4 @@
-from copy import deepcopy
 import numpy as np
-import json
-import os
 import gym

 from baselines import logger
@@ -10,7 +7,7 @@ from baselines.her.her import make_sample_her_transitions


 DEFAULT_ENV_PARAMS = {
-    'FetchReach-v0': {
+    'FetchReach-v1': {
        'n_cycles': 10,
    },
 }
@@ -51,6 +48,8 @@ DEFAULT_PARAMS = {


 CACHED_ENVS = {}
+
+
 def cached_make_env(make_env):
    """
    Only creates a new environment from the provided function if one has not yet already been
@@ -68,6 +67,7 @@ def prepare_params(kwargs):
    ddpg_params = dict()

    env_name = kwargs['env_name']
+
    def make_env():
        return gym.make(env_name)
    kwargs['make_env'] = make_env
@@ -75,7 +75,7 @@ def prepare_params(kwargs):
    assert hasattr(tmp_env, '_max_episode_steps')
    kwargs['T'] = tmp_env._max_episode_steps
    tmp_env.reset()
-    kwargs['max_u'] = np.array(kwargs['max_u']) if type(kwargs['max_u']) == list else kwargs['max_u']
+    kwargs['max_u'] = np.array(kwargs['max_u']) if isinstance(kwargs['max_u'], list) else kwargs['max_u']
    kwargs['gamma'] = 1. - 1. / kwargs['T']
    if 'lr' in kwargs:
        kwargs['pi_lr'] = kwargs['lr']
@@ -83,7 +83,7 @@ def prepare_params(kwargs):
        del kwargs['lr']
    for name in ['buffer_size', 'hidden', 'layers',
                 'network_class',
-                 'polyak', 
+                 'polyak',
                 'batch_size', 'Q_lr', 'pi_lr',
                 'norm_eps', 'norm_clip', 'max_u',
                 'action_l2', 'clip_obs', 'scope', 'relative_goals']:
@@ -103,6 +103,7 @@ def log_params(params, logger=logger):
 def configure_her(params):
    env = cached_make_env(params['make_env'])
    env.reset()
+
    def reward_fun(ag_2, g, info):  # vectorized
        return env.compute_reward(achieved_goal=ag_2, desired_goal=g, info=info)

--- a/baselines/her/experiment/train.py
+++ b/baselines/her/experiment/train.py
@@ -13,6 +13,8 @@ import baselines.her.experiment.config as config
 from baselines.her.rollout import RolloutWorker
 from baselines.her.util import mpi_fork

+from subprocess import CalledProcessError
+

 def mpi_average(value):
    if value == []:
@@ -81,12 +83,17 @@ def train(policy, rollout_worker, evaluator,


 def launch(
-    env_name, logdir, n_epochs, num_cpu, seed, replay_strategy, policy_save_interval, clip_return,
+    env, logdir, n_epochs, num_cpu, seed, replay_strategy, policy_save_interval, clip_return,
    override_params={}, save_policies=True
 ):
    # Fork for multi-CPU MPI implementation.
    if num_cpu > 1:
-        whoami = mpi_fork(num_cpu)
+        try:
+            whoami = mpi_fork(num_cpu, ['--bind-to', 'core'])
+        except CalledProcessError:
+            # fancy version of mpi call failed, try simple version
+            whoami = mpi_fork(num_cpu)
+
        if whoami == 'parent':
            sys.exit(0)
        import baselines.common.tf_util as U
@@ -109,10 +116,10 @@ def launch(

    # Prepare params.
    params = config.DEFAULT_PARAMS
-    params['env_name'] = env_name
+    params['env_name'] = env
    params['replay_strategy'] = replay_strategy
-    if env_name in config.DEFAULT_ENV_PARAMS:
-        params.update(config.DEFAULT_ENV_PARAMS[env_name])  # merge env-specific parameters in
+    if env in config.DEFAULT_ENV_PARAMS:
+        params.update(config.DEFAULT_ENV_PARAMS[env])  # merge env-specific parameters in
    params.update(**override_params)  # makes it possible to override any parameter
    with open(os.path.join(logger.get_dir(), 'params.json'), 'w') as f:
        json.dump(params, f)
@@ -126,7 +133,7 @@ def launch(
            'You are running HER with just a single MPI worker. This will work, but the ' +
            'experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) ' +
            'were obtained with --num_cpu 19. This makes a significant difference and if you ' +
-            'are looking to reproduce those results, be aware of this. Please also refer to ' + 
+            'are looking to reproduce those results, be aware of this. Please also refer to ' +
            'https://github.com/openai/baselines/issues/314 for further details.')
        logger.warn('****************')
        logger.warn()
@@ -168,7 +175,7 @@ def launch(


@click.command()
-@click.option('--env_name', type=str, default='FetchReach-v0', help='the name of the OpenAI Gym environment that you want to train on')
+@click.option('--env', type=str, default='FetchReach-v1', help='the name of the OpenAI Gym environment that you want to train on')
@click.option('--logdir', type=str, default=None, help='the path to where logs and policy pickles should go. If not specified, creates a folder in /tmp/')
@click.option('--n_epochs', type=int, default=50, help='the number of training epochs to run')
@click.option('--num_cpu', type=int, default=1, help='the number of CPU cores to use (using MPI)')
--- a/baselines/her/util.py
+++ b/baselines/her/util.py
@@ -58,12 +58,12 @@ def nn(input, layers_sizes, reuse=None, flatten=False, name=""):
    """Creates a simple neural network
    """
    for i, size in enumerate(layers_sizes):
-        activation = tf.nn.relu if i < len(layers_sizes)-1 else None
+        activation = tf.nn.relu if i < len(layers_sizes) - 1 else None
        input = tf.layers.dense(inputs=input,
                                units=size,
                                kernel_initializer=tf.contrib.layers.xavier_initializer(),
                                reuse=reuse,
-                                name=name+'_'+str(i))
+                                name=name + '_' + str(i))
        if activation:
            input = activation(input)
    if flatten:
@@ -85,7 +85,7 @@ def install_mpi_excepthook():
    sys.excepthook = new_hook


-def mpi_fork(n):
+def mpi_fork(n, extra_mpi_args=[]):
    """Re-launches the current script with workers
    Returns "parent" for original parent, "child" for MPI children
    """
@@ -99,14 +99,10 @@ def mpi_fork(n):
            IN_MPI="1"
        )
        # "-bind-to core" is crucial for good performance
-        args = [
-            "mpirun",
-            "-np",
-            str(n),
-            "-bind-to",
-            "core",
-            sys.executable
-        ]
+        args = ["mpirun", "-np", str(n)] + \
+            extra_mpi_args + \
+            [sys.executable]
+
        args += sys.argv
        subprocess.check_call(args, env=env)
        return "parent"
@@ -140,5 +136,5 @@ def reshape_for_broadcasting(source, target):
    before broadcasting it with MPI.
    """
    dim = len(target.get_shape())
-    shape = ([1] * (dim-1)) + [-1]
+    shape = ([1] * (dim - 1)) + [-1]
    return tf.reshape(tf.cast(source, target.dtype), shape)
--- a/baselines/logger.py
+++ b/baselines/logger.py
@@ -8,9 +8,6 @@ import datetime
 import tempfile
 from collections import defaultdict

-LOG_OUTPUT_FORMATS = ['stdout', 'log', 'csv']
-# Also valid: json, tensorboard
-
 DEBUG = 10
 INFO = 20
 WARN = 30
@@ -74,8 +71,11 @@ class HumanOutputFormat(KVWriter, SeqWriter):
        return s[:20] + '...' if len(s) > 23 else s

    def writeseq(self, seq):
-        for arg in seq:
-            self.file.write(arg)
+        seq = list(seq)
+        for (i, elem) in enumerate(seq):
+            self.file.write(elem)
+            if i < len(seq) - 1: # add space unless this is the last one
+                self.file.write(' ')
        self.file.write('\n')
        self.file.flush()

@@ -355,10 +355,19 @@ def configure(dir=None, format_strs=None):
    assert isinstance(dir, str)
    os.makedirs(dir, exist_ok=True)

+    log_suffix = ''
+    from mpi4py import MPI
+    rank = MPI.COMM_WORLD.Get_rank()
+    if rank > 0:
+        log_suffix = "-rank%03i" % rank
+
    if format_strs is None:
-        strs = os.getenv('OPENAI_LOG_FORMAT')
-        format_strs = strs.split(',') if strs else LOG_OUTPUT_FORMATS
-    output_formats = [make_output_format(f, dir) for f in format_strs]
+        if rank == 0:
+            format_strs = os.getenv('OPENAI_LOG_FORMAT', 'stdout,log,csv').split(',')
+        else:
+            format_strs = os.getenv('OPENAI_LOG_FORMAT_MPI', 'log').split(',')
+    format_strs = filter(None, format_strs)
+    output_formats = [make_output_format(f, dir, log_suffix) for f in format_strs]

    Logger.CURRENT = Logger(dir=dir, output_formats=output_formats)
    log('Logging to %s'%dir)
--- a/baselines/ppo1/README.md
+++ b/baselines/ppo1/README.md
@@ -5,3 +5,5 @@
 - `mpirun -np 8 python -m baselines.ppo1.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
 - `python -m baselines.ppo1.run_mujoco` runs the algorithm for 1M frames on a Mujoco environment.

+- Train mujoco 3d humanoid (with optimal-ish hyperparameters): `mpirun -np 16 python -m baselines.ppo1.run_humanoid --model-path=/path/to/model`
+- Render the 3d humanoid: `python -m baselines.ppo1.run_humanoid --play --model-path=/path/to/model`
--- a/baselines/ppo1/pposgd_simple.py
+++ b/baselines/ppo1/pposgd_simple.py
@@ -212,5 +212,7 @@ def learn(env, policy_fn, *,
        if MPI.COMM_WORLD.Get_rank()==0:
            logger.dump_tabular()

+    return pi
+
 def flatten_lists(listoflists):
    return [el for list_ in listoflists for el in list_]
--- a/baselines/ppo1/run_humanoid.py
+++ b/baselines/ppo1/run_humanoid.py
@@ -0,0 +1,75 @@
+#!/usr/bin/env python3
+import os
+from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
+from baselines.common import tf_util as U
+from baselines import logger
+
+import gym
+
+def train(num_timesteps, seed, model_path=None):
+    env_id = 'Humanoid-v2'
+    from baselines.ppo1 import mlp_policy, pposgd_simple
+    U.make_session(num_cpu=1).__enter__()
+    def policy_fn(name, ob_space, ac_space):
+        return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
+            hid_size=64, num_hid_layers=2)
+    env = make_mujoco_env(env_id, seed)
+
+    # parameters below were the best found in a simple random search
+    # these are good enough to make humanoid walk, but whether those are
+    # an absolute best or not is not certain
+    env = RewScale(env, 0.1)
+    pi = pposgd_simple.learn(env, policy_fn,
+            max_timesteps=num_timesteps,
+            timesteps_per_actorbatch=2048,
+            clip_param=0.2, entcoeff=0.0,
+            optim_epochs=10, 
+            optim_stepsize=3e-4, 
+            optim_batchsize=64, 
+            gamma=0.99, 
+            lam=0.95,
+            schedule='linear',
+        )
+    env.close()
+    if model_path:
+        U.save_state(model_path)
+        
+    return pi
+
+class RewScale(gym.RewardWrapper):
+    def __init__(self, env, scale):
+        gym.RewardWrapper.__init__(self, env)
+        self.scale = scale
+    def reward(self, r):
+        return r * self.scale
+
+def main():
+    logger.configure()
+    parser = mujoco_arg_parser()
+    parser.add_argument('--model-path', default=os.path.join(logger.get_dir(), 'humanoid_policy'))
+    parser.set_defaults(num_timesteps=int(2e7))
+   
+    args = parser.parse_args()
+    
+    if not args.play:
+        # train the model
+        train(num_timesteps=args.num_timesteps, seed=args.seed, model_path=args.model_path)
+    else:       
+        # construct the model object, load pre-trained model and render
+        pi = train(num_timesteps=1, seed=args.seed)
+        U.load_state(args.model_path)
+        env = make_mujoco_env('Humanoid-v2', seed=0)
+
+        ob = env.reset()        
+        while True:
+            action = pi.act(stochastic=False, ob=ob)[0]
+            ob, _, done, _ =  env.step(action)
+            env.render()
+            if done:
+                ob = env.reset()
+        
+        
+    
+
+if __name__ == '__main__':
+    main()
--- a/baselines/ppo1/run_robotics.py
+++ b/baselines/ppo1/run_robotics.py
@@ -0,0 +1,40 @@
+#!/usr/bin/env python3
+
+from mpi4py import MPI
+from baselines.common import set_global_seeds
+from baselines import logger
+from baselines.common.cmd_util import make_robotics_env, robotics_arg_parser
+import mujoco_py
+
+
+def train(env_id, num_timesteps, seed):
+    from baselines.ppo1 import mlp_policy, pposgd_simple
+    import baselines.common.tf_util as U
+    rank = MPI.COMM_WORLD.Get_rank()
+    sess = U.single_threaded_session()
+    sess.__enter__()
+    mujoco_py.ignore_mujoco_warnings().__enter__()
+    workerseed = seed + 10000 * rank
+    set_global_seeds(workerseed)
+    env = make_robotics_env(env_id, workerseed, rank=rank)
+    def policy_fn(name, ob_space, ac_space):
+        return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
+            hid_size=256, num_hid_layers=3)
+
+    pposgd_simple.learn(env, policy_fn,
+            max_timesteps=num_timesteps,
+            timesteps_per_actorbatch=2048,
+            clip_param=0.2, entcoeff=0.0,
+            optim_epochs=5, optim_stepsize=3e-4, optim_batchsize=256,
+            gamma=0.99, lam=0.95, schedule='linear',
+        )
+    env.close()
+
+
+def main():
+    args = robotics_arg_parser().parse_args()
+    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
+
+
+if __name__ == '__main__':
+    main()
--- a/baselines/ppo2/policies.py
+++ b/baselines/ppo2/policies.py
@@ -2,39 +2,36 @@ import numpy as np
 import tensorflow as tf
 from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
 from baselines.common.distributions import make_pdtype
+from baselines.common.input import observation_input

-def nature_cnn(unscaled_images):
+def nature_cnn(unscaled_images, **conv_kwargs):
    """
    CNN from Nature paper.
    """
    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
    activ = tf.nn.relu
-    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2)))
-    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2)))
-    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2)))
+    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
+                   **conv_kwargs))
+    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
+    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
    h3 = conv_to_fc(h3)
    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))

 class LnLstmPolicy(object):
    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
        nenv = nbatch // nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
+        X, processed_x = observation_input(ob_space, nbatch)
        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
+        self.pdtype = make_pdtype(ac_space)
        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
+            h = nature_cnn(processed_x)
            xs = batch_to_seq(h, nenv, nsteps)
            ms = batch_to_seq(M, nenv, nsteps)
            h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact)
            vf = fc(h5, 'v', 1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
+            self.pd, self.pi = self.pdtype.pdfromlatent(h5)

        v0 = vf[:, 0]
        a0 = self.pd.sample()
@@ -50,7 +47,6 @@ class LnLstmPolicy(object):
        self.X = X
        self.M = M
        self.S = S
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value
@@ -59,11 +55,9 @@ class LstmPolicy(object):

    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
        nenv = nbatch // nsteps
+        self.pdtype = make_pdtype(ac_space)
+        X, processed_x = observation_input(ob_space, nbatch)

-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
        with tf.variable_scope("model", reuse=reuse):
@@ -72,11 +66,8 @@ class LstmPolicy(object):
            ms = batch_to_seq(M, nenv, nsteps)
            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact)
            vf = fc(h5, 'v', 1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
+            self.pd, self.pi = self.pdtype.pdfromlatent(h5)

        v0 = vf[:, 0]
        a0 = self.pd.sample()
@@ -92,25 +83,19 @@ class LstmPolicy(object):
        self.X = X
        self.M = M
        self.S = S
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value

 class CnnPolicy(object):

-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            pi = fc(h, 'pi', nact, init_scale=0.01)
-            vf = fc(h, 'v', 1)[:,0]
-
+    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False, **conv_kwargs): #pylint: disable=W0613
        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
+        X, processed_x = observation_input(ob_space, nbatch)
+        with tf.variable_scope("model", reuse=reuse):
+            h = nature_cnn(processed_x, **conv_kwargs)
+            vf = fc(h, 'v', 1)[:,0]
+            self.pd, self.pi = self.pdtype.pdfromlatent(h, init_scale=0.01)

        a0 = self.pd.sample()
        neglogp0 = self.pd.neglogp(a0)
@@ -124,31 +109,25 @@ class CnnPolicy(object):
            return sess.run(vf, {X:ob})

        self.X = X
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value

 class MlpPolicy(object):
    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
-        ob_shape = (nbatch,) + ob_space.shape
-        actdim = ac_space.shape[0]
-        X = tf.placeholder(tf.float32, ob_shape, name='Ob') #obs
-        with tf.variable_scope("model", reuse=reuse):
-            activ = tf.tanh
-            h1 = activ(fc(X, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
-            h2 = activ(fc(h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
-            pi = fc(h2, 'pi', actdim, init_scale=0.01)
-            h1 = activ(fc(X, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
-            h2 = activ(fc(h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
-            vf = fc(h2, 'vf', 1)[:,0]
-            logstd = tf.get_variable(name="logstd", shape=[1, actdim],
-                initializer=tf.zeros_initializer())
-
-        pdparam = tf.concat([pi, pi * 0.0 + logstd], axis=1)
-
        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pdparam)
+        with tf.variable_scope("model", reuse=reuse):
+            X, processed_x = observation_input(ob_space, nbatch)
+            activ = tf.tanh
+            processed_x = tf.layers.flatten(processed_x)
+            pi_h1 = activ(fc(processed_x, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
+            pi_h2 = activ(fc(pi_h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
+            vf_h1 = activ(fc(processed_x, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
+            vf_h2 = activ(fc(vf_h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
+            vf = fc(vf_h2, 'vf', 1)[:,0]
+
+            self.pd, self.pi = self.pdtype.pdfromlatent(pi_h2, init_scale=0.01)
+

        a0 = self.pd.sample()
        neglogp0 = self.pd.neglogp(a0)
@@ -162,7 +141,6 @@ class MlpPolicy(object):
            return sess.run(vf, {X:ob})

        self.X = X
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value
--- a/baselines/ppo2/ppo2.py
+++ b/baselines/ppo2/ppo2.py
@@ -7,6 +7,7 @@ import tensorflow as tf
 from baselines import logger
 from collections import deque
 from baselines.common import explained_variance
+from baselines.common.runners import AbstractEnvRunner

 class Model(object):
    def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
@@ -84,19 +85,12 @@ class Model(object):
        self.load = load
        tf.global_variables_initializer().run(session=sess) #pylint: disable=E1101

-class Runner(object):
+class Runner(AbstractEnvRunner):

    def __init__(self, *, env, model, nsteps, gamma, lam):
-        self.env = env
-        self.model = model
-        nenv = env.num_envs
-        self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=model.train_model.X.dtype.name)
-        self.obs[:] = env.reset()
-        self.gamma = gamma
+        super().__init__(env=env, model=model, nsteps=nsteps)
        self.lam = lam
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]
+        self.gamma = gamma

    def run(self):
        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones, mb_neglogpacs = [],[],[],[],[],[]
@@ -154,7 +148,7 @@ def constfn(val):
 def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
            vf_coef=0.5,  max_grad_norm=0.5, gamma=0.99, lam=0.95,
            log_interval=10, nminibatches=4, noptepochs=4, cliprange=0.2,
-            save_interval=0):
+            save_interval=0, load_path=None):

    if isinstance(lr, float): lr = constfn(lr)
    else: assert callable(lr)
@@ -176,6 +170,8 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
        with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
            fh.write(cloudpickle.dumps(make_model))
    model = make_model()
+    if load_path is not None:
+        model.load(load_path)
    runner = Runner(env=env, model=model, nsteps=nsteps, gamma=gamma, lam=lam)

    epinfobuf = deque(maxlen=100)
@@ -240,6 +236,7 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
            print('Saving to', savepath)
            model.save(savepath)
    env.close()
+    return model

 def safemean(xs):
    return np.nan if len(xs) == 0 else np.mean(xs)
--- a/baselines/ppo2/run_atari.py
+++ b/baselines/ppo2/run_atari.py
@@ -4,7 +4,7 @@ from baselines import logger
 from baselines.common.cmd_util import make_atari_env, atari_arg_parser
 from baselines.common.vec_env.vec_frame_stack import VecFrameStack
 from baselines.ppo2 import ppo2
-from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
+from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy, MlpPolicy
 import multiprocessing
 import tensorflow as tf

@@ -20,7 +20,7 @@ def train(env_id, num_timesteps, seed, policy):
    tf.Session(config=config).__enter__()

    env = VecFrameStack(make_atari_env(env_id, 8, seed), 4)
-    policy = {'cnn' : CnnPolicy, 'lstm' : LstmPolicy, 'lnlstm' : LnLstmPolicy}[policy]
+    policy = {'cnn' : CnnPolicy, 'lstm' : LstmPolicy, 'lnlstm' : LnLstmPolicy, 'mlp': MlpPolicy}[policy]
    ppo2.learn(policy=policy, env=env, nsteps=128, nminibatches=4,
        lam=0.95, gamma=0.99, noptepochs=4, log_interval=1,
        ent_coef=.01,
@@ -30,7 +30,7 @@ def train(env_id, num_timesteps, seed, policy):

 def main():
    parser = atari_arg_parser()
-    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
+    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm', 'mlp'], default='cnn')
    args = parser.parse_args()
    logger.configure()
    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
--- a/baselines/ppo2/run_mujoco.py
+++ b/baselines/ppo2/run_mujoco.py
@@ -1,8 +1,9 @@
 #!/usr/bin/env python3
-import argparse
+import numpy as np
 from baselines.common.cmd_util import mujoco_arg_parser
 from baselines import bench, logger

+
 def train(env_id, num_timesteps, seed):
    from baselines.common import set_global_seeds
    from baselines.common.vec_env.vec_normalize import VecNormalize
@@ -16,27 +17,40 @@ def train(env_id, num_timesteps, seed):
                            intra_op_parallelism_threads=ncpu,
                            inter_op_parallelism_threads=ncpu)
    tf.Session(config=config).__enter__()
+
    def make_env():
        env = gym.make(env_id)
-        env = bench.Monitor(env, logger.get_dir())
+        env = bench.Monitor(env, logger.get_dir(), allow_early_resets=True)
        return env
+
    env = DummyVecEnv([make_env])
    env = VecNormalize(env)

    set_global_seeds(seed)
    policy = MlpPolicy
-    ppo2.learn(policy=policy, env=env, nsteps=2048, nminibatches=32,
-        lam=0.95, gamma=0.99, noptepochs=10, log_interval=1,
-        ent_coef=0.0,
-        lr=3e-4,
-        cliprange=0.2,
-        total_timesteps=num_timesteps)
+    model = ppo2.learn(policy=policy, env=env, nsteps=2048, nminibatches=32,
+                       lam=0.95, gamma=0.99, noptepochs=10, log_interval=1,
+                       ent_coef=0.0,
+                       lr=3e-4,
+                       cliprange=0.2,
+                       total_timesteps=num_timesteps)
+
+    return model, env


 def main():
    args = mujoco_arg_parser().parse_args()
    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
+    model, env = train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
+
+    if args.play:
+        logger.log("Running trained model")
+        obs = np.zeros((env.num_envs,) + env.observation_space.shape)
+        obs[:] = env.reset()
+        while True:
+            actions = model.step(obs)[0]
+            obs[:]  = env.step(actions)[0]
+            env.render()


 if __name__ == '__main__':
--- a/setup.py
+++ b/setup.py
@@ -21,6 +21,7 @@ setup(name='baselines',
          'cloudpickle',
          'tensorflow>=1.4.0',
          'click',
+          'opencv-python'
      ],
      description='OpenAI baselines: high quality implementations of reinforcement learning algorithms',
      author='OpenAI',
Author	SHA1	Message	Date
Peter Zhokhov	2c818245d6	dummy commit to RUN BENCHMARKS	2018-07-25 18:09:30 -07:00
Peter Zhokhov	ae8e7fd16b	dummy commit to RUN BENCHMARKS	2018-07-25 18:07:56 -07:00
Adam Gleave	f272969325	GAIL: bugfix in dataset loading (#447 ) * Fix silly typo * Replace ad-hoc function with NumPy code	2018-07-06 16:12:14 -07:00
pzhokhov	a6b1bc70f1	re-import internal; fix missing tile_images.py (#427 ) * import rl-algs from 2e3a166 commit * extra import of the baselines badge * exported commit with identity test * proper rng seeding in the test_identity * import internal * adding missing tile_images.py	2018-06-08 09:41:45 -07:00
pzhokhov	36ee5d1707	Import internal changes (#422 ) * import rl-algs from 2e3a166 commit * extra import of the baselines badge * exported commit with identity test * proper rng seeding in the test_identity * import internal	2018-06-06 11:39:13 -07:00
pzhokhov	24fe3d6576	Import internal repo (#409 ) * import rl-algs from 2e3a166 commit * extra import of the baselines badge * exported commit with identity test * proper rng seeding in the test_identity	2018-05-21 15:24:00 -07:00
pzhokhov	9cb7ece338	add opencv-python to the dependencies (#407 )	2018-05-14 10:52:19 -07:00
pzhokhov	9cf95a0054	setup travis ci build (#388 ) * simple .travis.yml file * added static syntax checks of common to .travis.yml * dockerizing the build * fix Dockerfile, adding build shield * cleaning up workdir in Dockerfile and .travis.yml * .travis.yml fixed common -> baselines/common for style check	2018-05-03 09:43:28 -07:00
pzhokhov	8b781038cc	put filters and running_stat files in common instead of acktr (#389 )	2018-05-02 18:42:48 -07:00
pzhokhov	69f25c6028	import internal repo (#385 )	2018-05-01 16:54:04 -07:00
pzhokhov	2b0283b9db	Readme.md detailed installation instructions (#377 ) * changes to README.md files with more detailed installation instructions * md-fying the changes better * link on the word homebrew in readme.md * typos in README.md * README.md * removed extra comma sign * removed sudo from brew command	2018-04-25 17:40:48 -07:00
Matthias Plappert	1f8a03f3a6	Update README	2018-03-26 16:50:22 +02:00
Matthias Plappert	3cc7df0608	Minor fixes to HER release (#319 ) * Fix plotting script * Add warning if num_cpu = 1	2018-03-05 11:06:17 +01:00