lstm network builders using tf lstm

merged refactor
GAIL: bugfix in dataset loading (#447 )
2018-08-10 14:21:30 -07:00 · 2018-08-10 14:14:46 -07:00 · 2018-07-06 16:12:14 -07:00 · 2018-06-08 09:41:45 -07:00 · 2018-06-06 11:39:13 -07:00 · 2018-05-21 15:24:00 -07:00
103 changed files with 5170 additions and 1125 deletions
--- a/.benchmark_pattern
+++ b/.benchmark_pattern
@@ -0,0 +1 @@
+ppo2
--- a/.gitignore
+++ b/.gitignore
@@ -34,5 +34,3 @@ src
 .cache

 MUJOCO_LOG.TXT
-
-
--- a/.travis.yml
+++ b/.travis.yml
@@ -0,0 +1,14 @@
+language: python
+python:
+    - "3.6"
+
+services:
+    - docker
+
+install:
+    - pip install flake8
+    - docker build . -t baselines-test
+
+script:
+    - flake8 --select=F,E999 baselines/common baselines/trpo_mpi baselines/ppo2 baselines/a2c baselines/deepq baselines/acer
+    - docker run baselines-test pytest --runslow
--- a/24
+++ b/24
@@ -0,0 +1,24 @@
+FROM ubuntu:16.04
+
+RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake python-opencv
+ENV CODE_DIR /root/code
+ENV VENV /root/venv
+
+RUN \
+    pip install virtualenv && \
+    virtualenv $VENV --python=python3 && \
+    . $VENV/bin/activate && \
+    pip install --upgrade pip
+
+ENV PATH=$VENV/bin:$PATH
+
+COPY . $CODE_DIR/baselines
+WORKDIR $CODE_DIR/baselines
+
+# Clean up pycache and pyc files
+RUN rm -rf __pycache__ && \
+    find . -name "*.pyc" -delete && \
+    pip install -e .[test]
+
+
+CMD /bin/bash
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-<img src="data/logo.jpg" width=25% align="right" />
+<img src="data/logo.jpg" width=25% align="right" /> [![Build status](https://travis-ci.org/openai/baselines.svg?branch=master)](https://travis-ci.org/openai/baselines)

 # Baselines

@@ -6,13 +6,117 @@ OpenAI Baselines is a set of high-quality implementations of reinforcement learn

 These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. 

-You can install it by typing:
+## Prerequisites 
+Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows
+### Ubuntu 
    
+```bash
+sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
+```
+    
+### Mac OS X
+Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the follwing:
+```bash
+brew install cmake openmpi
+```
+    
+## Virtual environment
+From the general python package sanity perspective, it is a good idea to use virtual environments (virtualenvs) to make sure packages from different projects do not interfere with each other. You can install virtualenv (which is itself a pip package) via
+```bash
+pip install virtualenv
+```
+Virtualenvs are essentially folders that have copies of python executable and all python packages.
+To create a virtualenv called venv with python3, one runs 
+```bash
+virtualenv /path/to/venv --python=python3
+```
+To activate a virtualenv: 
+```
+. /path/to/venv/bin/activate
+```
+More thorough tutorial on virtualenvs and options can be found [here](https://virtualenv.pypa.io/en/stable/) 
+
+
+## Installation
+Clone the repo and cd into it:
 ```bash
 git clone https://github.com/openai/baselines.git
 cd baselines
+```
+If using virtualenv, create a new virtualenv and activate it
+```bash
+    virtualenv env --python=python3
+    . env/bin/activate
+```
+Install baselines package
+```bash
 pip install -e .
 ```
+### MuJoCo
+Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)
+
+## Testing the installation
+All unit tests in baselines can be run using pytest runner:
+```
+pip install pytest
+pytest
+```
+
+## Subpackages
+
+## Testing the installation
+All unit tests in baselines can be run using pytest runner:
+```
+pip install pytest
+pytest
+```
+
+## Training models
+Most of the algorithms in baselines repo are used as follows:
+```bash
+    python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
+```
+### Example 1. PPO with MuJoCo Humanoid
+For instance, to train a fully-connected network controlling MuJoCo humanoid using a2c for 20M timesteps
+```bash
+    python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
+```
+Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
+The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
+```bash
+    python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
+```
+will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
+
+See docstrings in [common/models.py](common/models.py) for description of network parameters for each type of model, and 
+docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) fir the description of the ppo2 hyperparamters. 
+
+### Example 2. DQN on Atari 
+DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
+```
+    python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
+```
+
+## Saving, loading and visualizing models
+The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models. 
+`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively. 
+Let's imagine you'd like to train ppo2 on Atari Pong,  save the model and then later visualize what has it learnt.
+```bash
+    python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=2e7 --save_path=~/models/pong_20M_ppo2
+```
+This should get to the mean reward per episode about 5k. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize: 
+```bash
+    python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
+```
+
+*NOTE:* At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in [baselines/common/vec_env/vec_normalize.py](baselines/common/vec_env/vec_normalize.py#L12). This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default
+
+
+
+
+
+
+## Subpackages

 - [A2C](baselines/a2c)
 - [ACER](baselines/acer)
@@ -20,6 +124,7 @@ pip install -e .
 - [DDPG](baselines/ddpg)
 - [DQN](baselines/deepq)
 - [GAIL](baselines/gail)
+- [HER](baselines/her)
 - [PPO1](baselines/ppo1) (Multi-CPU using MPI)
 - [PPO2](baselines/ppo2) (Optimized for GPU)
 - [TRPO](baselines/trpo_mpi)
--- a/baselines/a2c/a2c.py
+++ b/baselines/a2c/a2c.py
@@ -1,47 +1,48 @@
-import os
-import os.path as osp
-import gym
 import time
-import joblib
-import logging
-import numpy as np
+import functools
 import tensorflow as tf
+
 from baselines import logger

 from baselines.common import set_global_seeds, explained_variance
-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from baselines.common.atari_wrappers import wrap_deepmind
 from baselines.common import tf_util
+from baselines.common.policies import build_policy

-from baselines.a2c.utils import discount_with_dones
-from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
-from baselines.a2c.utils import cat_entropy, mse
+
+from baselines.a2c.utils import Scheduler, find_trainable_variables
+from baselines.a2c.runner import Runner
+
+from tensorflow import losses

 class Model(object):

-    def __init__(self, policy, ob_space, ac_space, nenvs, nsteps,
+    def __init__(self, policy, env, nsteps,
            ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, lr=7e-4,
            alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):

-        sess = tf_util.make_session()
-        nact = ac_space.n
+        sess = tf_util.get_session()
+        nenvs = env.num_envs
        nbatch = nenvs*nsteps

-        A = tf.placeholder(tf.int32, [nbatch])
+
+        with tf.variable_scope('a2c_model', reuse=tf.AUTO_REUSE):
+            step_model = policy(nenvs, 1, sess)
+            train_model = policy(nbatch, nsteps, sess)
+
+        A = tf.placeholder(train_model.action.dtype, train_model.action.shape)
        ADV = tf.placeholder(tf.float32, [nbatch])
        R = tf.placeholder(tf.float32, [nbatch])
        LR = tf.placeholder(tf.float32, [])

-        step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
-        train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)
+        neglogpac = train_model.pd.neglogp(A)
+        entropy = tf.reduce_mean(train_model.pd.entropy())

-        neglogpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
        pg_loss = tf.reduce_mean(ADV * neglogpac)
-        vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
-        entropy = tf.reduce_mean(cat_entropy(train_model.pi))
+        vf_loss = losses.mean_squared_error(tf.squeeze(train_model.vf), R)
+
        loss = pg_loss - entropy*ent_coef + vf_loss * vf_coef

-        params = find_trainable_variables("model")
+        params = find_trainable_variables("a2c_model")
        grads = tf.gradients(loss, params)
        if max_grad_norm is not None:
            grads, grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
@@ -55,6 +56,7 @@ class Model(object):
            advs = rewards - values
            for step in range(len(obs)):
                cur_lr = lr.value()
+
            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, LR:cur_lr}
            if states is not None:
                td_map[train_model.S] = states
@@ -65,17 +67,6 @@ class Model(object):
            )
            return policy_loss, value_loss, policy_entropy

-        def save(save_path):
-            ps = sess.run(params)
-            make_path(save_path)
-            joblib.dump(ps, save_path)
-
-        def load(load_path):
-            loaded_params = joblib.load(load_path)
-            restores = []
-            for p, loaded_p in zip(params, loaded_params):
-                restores.append(p.assign(loaded_p))
-            ps = sess.run(restores)

        self.train = train
        self.train_model = train_model
@@ -83,77 +74,87 @@ class Model(object):
        self.step = step_model.step
        self.value = step_model.value
        self.initial_state = step_model.initial_state
-        self.save = save
-        self.load = load
+        self.save = functools.partial(tf_util.save_variables, sess=sess)
+        self.load = functools.partial(tf_util.load_variables, sess=sess)
        tf.global_variables_initializer().run(session=sess)

-class Runner(object):

-    def __init__(self, env, model, nsteps=5, gamma=0.99):
-        self.env = env
-        self.model = model
-        nh, nw, nc = env.observation_space.shape
-        nenv = env.num_envs
-        self.batch_ob_shape = (nenv*nsteps, nh, nw, nc)
-        self.obs = np.zeros((nenv, nh, nw, nc), dtype=np.uint8)
-        self.nc = nc
-        obs = env.reset()
-        self.gamma = gamma
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]
+def learn(
+    network,
+    env,
+    seed=None,
+    nsteps=5,
+    total_timesteps=int(80e6),
+    vf_coef=0.5,
+    ent_coef=0.01,
+    max_grad_norm=0.5,
+    lr=7e-4,
+    lrschedule='linear',
+    epsilon=1e-5,
+    alpha=0.99,
+    gamma=0.99,
+    log_interval=100,
+    load_path=None,
+    **network_kwargs):
+
+    ''' 
+    Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
+
+    Parameters:
+    -----------
+
+    network:            policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
+                        specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns 
+                        tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
+                        neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
+                        See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
+                
+
+    env:                RL environment. Should implement interface similar to VecEnv (baselines.common/vec_env) or be wrapped with DummyVecEnv (baselines.common/vec_env/dummy_vec_env.py)
+                    
+
+    seed:               seed to make random number sequence in the alorightm reproducible. By default is None which means seed from system noise generator (not reproducible)
+
+    nsteps:             int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
+                        nenv is number of environment copies simulated in parallel)
+
+    total_timesteps:    int, total number of timesteps to train on (default: 80M)
+
+    vf_coef:            float, coefficient in front of value function loss in the total loss function (default: 0.5)
+
+    ent_coef:           float, coeffictiant in front of the policy entropy in the total loss function (default: 0.01)
+
+    max_gradient_norm:  float, gradient is clipped to have global L2 norm no more than this value (default: 0.5)
+
+    lr:                 float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
+
+    lrschedule:         schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and 
+                        returns fraction of the learning rate (specified as lr) as output
+
+    epsilon:            float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
+
+    alpha:              float, RMSProp decay parameter (default: 0.99)
+
+    gamma:              float, reward discounting parameter (default: 0.99)
+
+    log_interval:       int, specifies how frequently the logs are printed out (default: 100)
+
+    **network_kwargs:   keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
+                        For instance, 'mlp' network architecture has arguments num_hidden and num_layers. 
+
+    '''
+    

-    def run(self):
-        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
-        mb_states = self.states
-        for n in range(self.nsteps):
-            actions, values, states, _ = self.model.step(self.obs, self.states, self.dones)
-            mb_obs.append(np.copy(self.obs))
-            mb_actions.append(actions)
-            mb_values.append(values)
-            mb_dones.append(self.dones)
-            obs, rewards, dones, _ = self.env.step(actions)
-            self.states = states
-            self.dones = dones
-            for n, done in enumerate(dones):
-                if done:
-                    self.obs[n] = self.obs[n]*0
-            self.obs = obs
-            mb_rewards.append(rewards)
-        mb_dones.append(self.dones)
-        #batch of steps to batch of rollouts
-        mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0).reshape(self.batch_ob_shape)
-        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
-        mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
-        mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
-        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
-        mb_masks = mb_dones[:, :-1]
-        mb_dones = mb_dones[:, 1:]
-        last_values = self.model.value(self.obs, self.states, self.dones).tolist()
-        #discount/bootstrap off value fn
-        for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
-            rewards = rewards.tolist()
-            dones = dones.tolist()
-            if dones[-1] == 0:
-                rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
-            else:
-                rewards = discount_with_dones(rewards, dones, self.gamma)
-            mb_rewards[n] = rewards
-        mb_rewards = mb_rewards.flatten()
-        mb_actions = mb_actions.flatten()
-        mb_values = mb_values.flatten()
-        mb_masks = mb_masks.flatten()
-        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values

-def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, ent_coef=0.01, max_grad_norm=0.5, lr=7e-4, lrschedule='linear', epsilon=1e-5, alpha=0.99, gamma=0.99, log_interval=100):
-    tf.reset_default_graph()
    set_global_seeds(seed)

    nenvs = env.num_envs
-    ob_space = env.observation_space
-    ac_space = env.action_space
-    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
+    policy = build_policy(env, network, **network_kwargs)
+   
+    model = Model(policy=policy, env=env, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
        max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
+    if load_path is not None:
+        model.load(load_path)
    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)

    nbatch = nenvs*nsteps
@@ -173,3 +174,5 @@ def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, e
            logger.record_tabular("explained_variance", float(ev))
            logger.dump_tabular()
    env.close()
+    return model
+
--- a/baselines/a2c/policies.py
+++ b/baselines/a2c/policies.py
@@ -1,168 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
-from baselines.common.distributions import make_pdtype
-
-def nature_cnn(unscaled_images):
-    """
-    CNN from Nature paper.
-    """
-    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
-    activ = tf.nn.relu
-    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2)))
-    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2)))
-    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2)))
-    h3 = conv_to_fc(h3)
-    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
-
-class LnLstmPolicy(object):
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
-        nenv = nbatch // nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
-        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            xs = batch_to_seq(h, nenv, nsteps)
-            ms = batch_to_seq(M, nenv, nsteps)
-            h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
-            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact)
-            vf = fc(h5, 'v', 1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
-
-        v0 = vf[:, 0]
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
-
-        def step(ob, state, mask):
-            return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
-
-        def value(ob, state, mask):
-            return sess.run(v0, {X:ob, S:state, M:mask})
-
-        self.X = X
-        self.M = M
-        self.S = S
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class LstmPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
-        nenv = nbatch // nsteps
-
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
-        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            xs = batch_to_seq(h, nenv, nsteps)
-            ms = batch_to_seq(M, nenv, nsteps)
-            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
-            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact)
-            vf = fc(h5, 'v', 1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
-
-        v0 = vf[:, 0]
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
-
-        def step(ob, state, mask):
-            return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
-
-        def value(ob, state, mask):
-            return sess.run(v0, {X:ob, S:state, M:mask})
-
-        self.X = X
-        self.M = M
-        self.S = S
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class CnnPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            pi = fc(h, 'pi', nact, init_scale=0.01)
-            vf = fc(h, 'v', 1)[:,0]
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
-
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = None
-
-        def step(ob, *_args, **_kwargs):
-            a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
-            return a, v, self.initial_state, neglogp
-
-        def value(ob, *_args, **_kwargs):
-            return sess.run(vf, {X:ob})
-
-        self.X = X
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class MlpPolicy(object):
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
-        ob_shape = (nbatch,) + ob_space.shape
-        actdim = ac_space.shape[0]
-        X = tf.placeholder(tf.float32, ob_shape, name='Ob') #obs
-        with tf.variable_scope("model", reuse=reuse):
-            activ = tf.tanh
-            h1 = activ(fc(X, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
-            h2 = activ(fc(h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
-            pi = fc(h2, 'pi', actdim, init_scale=0.01)
-            h1 = activ(fc(X, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
-            h2 = activ(fc(h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
-            vf = fc(h2, 'vf', 1)[:,0]
-            logstd = tf.get_variable(name="logstd", shape=[1, actdim],
-                initializer=tf.zeros_initializer())
-
-        pdparam = tf.concat([pi, pi * 0.0 + logstd], axis=1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pdparam)
-
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = None
-
-        def step(ob, *_args, **_kwargs):
-            a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
-            return a, v, self.initial_state, neglogp
-
-        def value(ob, *_args, **_kwargs):
-            return sess.run(vf, {X:ob})
-
-        self.X = X
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
--- a/baselines/a2c/run_atari.py
+++ b/baselines/a2c/run_atari.py
@@ -1,30 +0,0 @@
-#!/usr/bin/env python3
-
-from baselines import logger
-from baselines.common.cmd_util import make_atari_env, atari_arg_parser
-from baselines.common.vec_env.vec_frame_stack import VecFrameStack
-from baselines.a2c.a2c import learn
-from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
-
-def train(env_id, num_timesteps, seed, policy, lrschedule, num_env):
-    if policy == 'cnn':
-        policy_fn = CnnPolicy
-    elif policy == 'lstm':
-        policy_fn = LstmPolicy
-    elif policy == 'lnlstm':
-        policy_fn = LnLstmPolicy
-    env = VecFrameStack(make_atari_env(env_id, num_env, seed), 4)
-    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
-    env.close()
-
-def main():
-    parser = atari_arg_parser()
-    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
-    parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
-    args = parser.parse_args()
-    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
-        policy=args.policy, lrschedule=args.lrschedule, num_env=16)
-
-if __name__ == '__main__':
-    main()
--- a/baselines/a2c/runner.py
+++ b/baselines/a2c/runner.py
@@ -0,0 +1,60 @@
+import numpy as np
+from baselines.a2c.utils import discount_with_dones
+from baselines.common.runners import AbstractEnvRunner
+
+class Runner(AbstractEnvRunner):
+
+    def __init__(self, env, model, nsteps=5, gamma=0.99):
+        super().__init__(env=env, model=model, nsteps=nsteps)
+        self.gamma = gamma
+        self.batch_action_shape = [x if x is not None else -1 for x in model.train_model.action.shape.as_list()]
+        self.ob_dtype = model.train_model.X.dtype.as_numpy_dtype
+    
+    def run(self):
+        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
+        mb_states = self.states
+        for n in range(self.nsteps):
+            actions, values, states, _ = self.model.step(self.obs, S=self.states, M=self.dones)
+            mb_obs.append(np.copy(self.obs))
+            mb_actions.append(actions)
+            mb_values.append(values)
+            mb_dones.append(self.dones)
+            obs, rewards, dones, _ = self.env.step(actions)
+            self.states = states
+            self.dones = dones
+            for n, done in enumerate(dones):
+                if done:
+                    self.obs[n] = self.obs[n]*0
+            self.obs = obs
+            mb_rewards.append(rewards)
+        mb_dones.append(self.dones)
+        #batch of steps to batch of rollouts
+
+        mb_obs = np.asarray(mb_obs, dtype=self.ob_dtype).swapaxes(1, 0).reshape(self.batch_ob_shape)
+        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
+        mb_actions = np.asarray(mb_actions, dtype=self.model.train_model.action.dtype.name).swapaxes(1, 0)
+        mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
+        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
+        mb_masks = mb_dones[:, :-1]
+        mb_dones = mb_dones[:, 1:]
+
+
+        if self.gamma > 0.0:
+            #discount/bootstrap off value fn
+            last_values = self.model.value(self.obs, S=self.states, M=self.dones).tolist()
+            for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
+                rewards = rewards.tolist()
+                dones = dones.tolist()
+                if dones[-1] == 0:
+                    rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
+                else:
+                    rewards = discount_with_dones(rewards, dones, self.gamma)
+
+                mb_rewards[n] = rewards
+    
+        mb_actions = mb_actions.reshape(self.batch_action_shape)
+
+        mb_rewards = mb_rewards.flatten()
+        mb_values = mb_values.flatten()
+        mb_masks = mb_masks.flatten()
+        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values
--- a/baselines/a2c/utils.py
+++ b/baselines/a2c/utils.py
@@ -1,8 +1,6 @@
 import os
-import gym
 import numpy as np
 import tensorflow as tf
-from gym import spaces
 from collections import deque

 def sample(logits):
@@ -10,18 +8,15 @@ def sample(logits):
    return tf.argmax(logits - tf.log(-tf.log(noise)), 1)

 def cat_entropy(logits):
-    a0 = logits - tf.reduce_max(logits, 1, keep_dims=True)
+    a0 = logits - tf.reduce_max(logits, 1, keepdims=True)
    ea0 = tf.exp(a0)
-    z0 = tf.reduce_sum(ea0, 1, keep_dims=True)
+    z0 = tf.reduce_sum(ea0, 1, keepdims=True)
    p0 = ea0 / z0
    return tf.reduce_sum(p0 * (tf.log(z0) - a0), 1)

 def cat_entropy_softmax(p0):
    return - tf.reduce_sum(p0 * tf.log(p0 + 1e-6), axis = 1)

-def mse(pred, target):
-    return tf.square(pred-target)/2.
-
 def ortho_init(scale=1.0):
    def _ortho_init(shape, dtype, partition_info=None):
        #lasagne ortho init for tf
@@ -39,12 +34,26 @@ def ortho_init(scale=1.0):
        return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
    return _ortho_init

-def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0):
+def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
+    if data_format == 'NHWC':
+        channel_ax = 3
+        strides = [1, stride, stride, 1]
+        bshape = [1, 1, 1, nf]
+    elif data_format == 'NCHW':
+        channel_ax = 1
+        strides = [1, 1, stride, stride]
+        bshape = [1, nf, 1, 1]
+    else:
+        raise NotImplementedError
+    bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
+    nin = x.get_shape()[channel_ax].value
+    wshape = [rf, rf, nin, nf]
    with tf.variable_scope(scope):
-        nin = x.get_shape()[3].value
-        w = tf.get_variable("w", [rf, rf, nin, nf], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nf], initializer=tf.constant_initializer(0.0))
-        return tf.nn.conv2d(x, w, strides=[1, stride, stride, 1], padding=pad)+b
+        w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
+        b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
+        if not one_dim_bias and data_format == 'NHWC':
+            b = tf.reshape(b, bshape)
+        return tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format) + b

 def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
    with tf.variable_scope(scope):
@@ -71,7 +80,6 @@ def seq_to_batch(h, flat = False):

 def lstm(xs, ms, s, scope, nh, init_scale=1.0):
    nbatch, nin = [v.value for v in xs[0].get_shape()]
-    nsteps = len(xs)
    with tf.variable_scope(scope):
        wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
        wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
@@ -101,7 +109,6 @@ def _ln(x, g, b, e=1e-5, axes=[1]):

 def lnlstm(xs, ms, s, scope, nh, init_scale=1.0):
    nbatch, nin = [v.value for v in xs[0].get_shape()]
-    nsteps = len(xs)
    with tf.variable_scope(scope):
        wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
        gx = tf.get_variable("gx", [nh*4], initializer=tf.constant_initializer(1.0))
@@ -146,8 +153,7 @@ def discount_with_dones(rewards, dones, gamma):
    return discounted[::-1]

 def find_trainable_variables(key):
-    with tf.variable_scope(key):
-        return tf.trainable_variables()
+    return tf.trainable_variables(key)

 def make_path(f):
    return os.makedirs(f, exist_ok=True)
--- a/baselines/acer/acer_simple.py
+++ b/baselines/acer/acer_simple.py
@@ -1,17 +1,20 @@
 import time
-import joblib
+import functools
 import numpy as np
 import tensorflow as tf
 from baselines import logger

 from baselines.common import set_global_seeds
+from baselines.common.policies import build_policy
+from baselines.common.tf_util import get_session, save_variables

 from baselines.a2c.utils import batch_to_seq, seq_to_batch
-from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
 from baselines.a2c.utils import cat_entropy_softmax
+from baselines.a2c.utils import Scheduler, find_trainable_variables
 from baselines.a2c.utils import EpisodeStats
 from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
 from baselines.acer.buffer import Buffer
+from baselines.acer.runner import Runner

 # remove last step
 def strip(var, nenvs, nsteps, flat = False):
@@ -56,10 +59,8 @@ class Model(object):
                 ent_coef, q_coef, gamma, max_grad_norm, lr,
                 rprop_alpha, rprop_epsilon, total_timesteps, lrschedule,
                 c, trust_region, alpha, delta):
-        config = tf.ConfigProto(allow_soft_placement=True,
-                                intra_op_parallelism_threads=num_procs,
-                                inter_op_parallelism_threads=num_procs)
-        sess = tf.Session(config=config)
+
+        sess = get_session()
        nact = ac_space.n
        nbatch = nenvs * nsteps

@@ -70,10 +71,15 @@ class Model(object):
        LR = tf.placeholder(tf.float32, [])
        eps = 1e-6
    
-        step_model = policy(sess, ob_space, ac_space, nenvs, 1, nstack, reuse=False)
-        train_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
+        step_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs,) + ob_space.shape[:-1] + (ob_space.shape[-1] * nstack,))
+        train_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs*(nsteps+1),) + ob_space.shape[:-1] + (ob_space.shape[-1] * nstack,))
+        with tf.variable_scope('acer_model', reuse=tf.AUTO_REUSE):

-        params = find_trainable_variables("model")
+            step_model = policy(observ_placeholder=step_ob_placeholder, sess=sess)
+            train_model = policy(observ_placeholder=train_ob_placeholder, sess=sess)
+
+    
+        params = find_trainable_variables("acer_model")
        print("Params {}".format(len(params)))
        for var in params:
            print(var)
@@ -87,14 +93,20 @@ class Model(object):
            print(v.name)
            return v

-        with tf.variable_scope("", custom_getter=custom_getter, reuse=True):
-            polyak_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
+        with tf.variable_scope("acer_model", custom_getter=custom_getter, reuse=True):
+            polyak_model = policy(observ_placeholder=train_ob_placeholder, sess=sess)

        # Notation: (var) = batch variable, (var)s = seqeuence variable, (var)_i = variable index by action at step i
-        v = tf.reduce_sum(train_model.pi * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]
+        
+        # action probability distributions according to train_model, polyak_model and step_model
+        # poilcy.pi is probability distribution parameters; to obtain distribution that sums to 1 need to take softmax
+        train_model_p = tf.nn.softmax(train_model.pi)  
+        polyak_model_p = tf.nn.softmax(polyak_model.pi)
+        step_model_p = tf.nn.softmax(step_model.pi)
+        v = tf.reduce_sum(train_model_p * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]

        # strip off last step
-        f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model.pi, polyak_model.pi, train_model.q])
+        f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model_p, polyak_model_p, train_model.q])
        # Get pi and q values for actions taken
        f_i = get_by_index(f, A)
        q_i = get_by_index(q, A)
@@ -108,6 +120,7 @@ class Model(object):

        # Calculate losses
        # Entropy   
+        # entropy = tf.reduce_mean(strip(train_model.pd.entropy(), nenvs, nsteps))
        entropy = tf.reduce_mean(cat_entropy_softmax(f))

        # Policy Graident loss, with truncated importance sampling & bias correction
@@ -189,84 +202,29 @@ class Model(object):
        def train(obs, actions, rewards, dones, mus, states, masks, steps):
            cur_lr = lr.value_steps(steps)
            td_map = {train_model.X: obs, polyak_model.X: obs, A: actions, R: rewards, D: dones, MU: mus, LR: cur_lr}
-            if states != []:
+            if states is not None:
                td_map[train_model.S] = states
                td_map[train_model.M] = masks
                td_map[polyak_model.S] = states
                td_map[polyak_model.M] = masks
+
            return names_ops, sess.run(run_ops, td_map)[1:]  # strip off _train

-        def save(save_path):
-            ps = sess.run(params)
-            make_path(save_path)
-            joblib.dump(ps, save_path)
+        def _step(observation, **kwargs):
+            return step_model._evaluate([step_model.action, step_model_p, step_model.state], observation, **kwargs)
+                
+                    

        self.train = train
-        self.save = save
+        self.save = functools.partial(save_variables, sess=sess, variables=params)
        self.train_model = train_model
        self.step_model = step_model
-        self.step = step_model.step
+        self._step = _step
+        self.step = self.step_model.step
+
        self.initial_state = step_model.initial_state
        tf.global_variables_initializer().run(session=sess)

-class Runner(object):
-    def __init__(self, env, model, nsteps, nstack):
-        self.env = env
-        self.nstack = nstack
-        self.model = model
-        nh, nw, nc = env.observation_space.shape
-        self.nc = nc  # nc = 1 for atari, but just in case
-        self.nenv = nenv = env.num_envs
-        self.nact = env.action_space.n
-        self.nbatch = nenv * nsteps
-        self.batch_ob_shape = (nenv*(nsteps+1), nh, nw, nc*nstack)
-        self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
-        obs = env.reset()
-        self.update_obs(obs)
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]
-
-    def update_obs(self, obs, dones=None):
-        if dones is not None:
-            self.obs *= (1 - dones.astype(np.uint8))[:, None, None, None]
-        self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
-        self.obs[:, :, :, -self.nc:] = obs[:, :, :, :]
-
-    def run(self):
-        enc_obs = np.split(self.obs, self.nstack, axis=3)  # so now list of obs steps
-        mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
-        for _ in range(self.nsteps):
-            actions, mus, states = self.model.step(self.obs, state=self.states, mask=self.dones)
-            mb_obs.append(np.copy(self.obs))
-            mb_actions.append(actions)
-            mb_mus.append(mus)
-            mb_dones.append(self.dones)
-            obs, rewards, dones, _ = self.env.step(actions)
-            # states information for statefull models like LSTM
-            self.states = states
-            self.dones = dones
-            self.update_obs(obs, dones)
-            mb_rewards.append(rewards)
-            enc_obs.append(obs)
-        mb_obs.append(np.copy(self.obs))
-        mb_dones.append(self.dones)
-
-        enc_obs = np.asarray(enc_obs, dtype=np.uint8).swapaxes(1, 0)
-        mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0)
-        mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
-        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
-        mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
-
-        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
-
-        mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
-        mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards
-
-        # shapes are now [nenv, nsteps, []]
-        # When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.
-
-        return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks

 class Acer():
    def __init__(self, runner, model, buffer, log_interval):
@@ -312,19 +270,84 @@ class Acer():
            logger.dump_tabular()


-def learn(policy, env, seed, nsteps=20, nstack=4, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
+def learn(network, env, seed=None, nsteps=20, nstack=4, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
          max_grad_norm=10, lr=7e-4, lrschedule='linear', rprop_epsilon=1e-5, rprop_alpha=0.99, gamma=0.99,
          log_interval=100, buffer_size=50000, replay_ratio=4, replay_start=10000, c=10.0,
-          trust_region=True, alpha=0.99, delta=1):
+          trust_region=True, alpha=0.99, delta=1, load_path=None, **network_kwargs):
+
+    '''
+    Main entrypoint for ACER (Actor-Critic with Experience Replay) algorithm (https://arxiv.org/pdf/1611.01224.pdf)
+    Train an agent with given network architecture on a given environment using ACER.
+
+    Parameters:
+    ----------
+
+    network:            policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
+                        specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns 
+                        tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
+                        neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
+                        See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
+
+    env:                environment. Needs to be vectorized for parallel environment simulation. 
+                        The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.
+
+    nsteps:             int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
+                        nenv is number of environment copies simulated in parallel) (default: 20)
+
+    nstack:             int, size of the frame stack, i.e. number of the frames passed to the step model. Frames are stacked along channel dimension 
+                        (last image dimension) (default: 4)
+
+    total_timesteps:    int, number of timesteps (i.e. number of actions taken in the environment) (default: 80M)
+
+    q_coef:             float, value function loss coefficient in the optimization objective (analog of vf_coef for other actor-critic methods)
+
+    ent_coef:           float, policy entropy coefficient in the optimization objective (default: 0.01)
+
+    max_grad_norm:      float, gradient norm clipping coefficient. If set to None, no clipping. (default: 10), 
+    
+    lr:                 float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
+
+    lrschedule:         schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and 
+                        returns fraction of the learning rate (specified as lr) as output
+
+    rprop_epsilon:      float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
+
+    rprop_alpha:        float, RMSProp decay parameter (default: 0.99)
+
+    gamma:              float, reward discounting factor (default: 0.99)
+
+    log_interval:       int, number of updates between logging events (default: 100)
+
+    buffer_size:        int, size of the replay buffer (default: 50k)
+
+    replay_ratio:       int, now many (on average) batches of data to sample from the replay buffer take after batch from the environment (default: 4)
+
+    replay_start:       int, the sampling from the replay buffer does not start until replay buffer has at least that many samples (default: 10k)
+
+    c:                  float, importance weight clipping factor (default: 10)
+    
+    trust_region        bool, whether or not algorithms estimates the gradient KL divergence between the old and updated policy and uses it to determine step size  (default: True)
+
+    delta:              float, max KL divergence between the old policy and updated policy (default: 1)
+
+    alpha:              float, momentum factor in the Polyak (exponential moving average) averaging of the model parameters (default: 0.99) 
+
+    load_path:          str, path to load the model from (default: None)
+
+    **network_kwargs:               keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
+                                    For instance, 'mlp' network architecture has arguments num_hidden and num_layers. 
+
+    '''
+
    print("Running Acer Simple")
    print(locals())
-    tf.reset_default_graph()
    set_global_seeds(seed)
+    policy = build_policy(env, network, estimate_q=True, **network_kwargs)

    nenvs = env.num_envs
    ob_space = env.observation_space
    ac_space = env.action_space
-    num_procs = len(env.remotes) # HACK
+    num_procs = len(env.remotes) if hasattr(env, 'remotes') else 1# HACK
    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, nstack=nstack,
                  num_procs=num_procs, ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
                  max_grad_norm=max_grad_norm, lr=lr, rprop_alpha=rprop_alpha, rprop_epsilon=rprop_epsilon,
@@ -339,6 +362,7 @@ def learn(policy, env, seed, nsteps=20, nstack=4, total_timesteps=int(80e6), q_c
    nbatch = nenvs*nsteps
    acer = Acer(runner, model, buffer, log_interval)
    acer.tstart = time.time()
+
    for acer.steps in range(0, total_timesteps, nbatch): #nbatch samples, 1 on_policy call and multiple off-policy calls
        acer.call(on_policy=True)
        if replay_ratio > 0 and buffer.has_atleast(replay_start):
@@ -347,3 +371,4 @@ def learn(policy, env, seed, nsteps=20, nstack=4, total_timesteps=int(80e6), q_c
                acer.call(on_policy=False)  # no simulation steps in this

    env.close()
+    return model
--- a/baselines/acer/defaults.py
+++ b/baselines/acer/defaults.py
@@ -0,0 +1,4 @@
+def atari():
+    return dict(
+        lrschedule='constant'
+    )
--- a/baselines/acer/policies.py
+++ b/baselines/acer/policies.py
@@ -1,6 +1,6 @@
 import numpy as np
 import tensorflow as tf
-from baselines.ppo2.policies import nature_cnn
+from baselines.common.policies import nature_cnn
 from baselines.a2c.utils import fc, batch_to_seq, seq_to_batch, lstm, sample


@@ -18,11 +18,13 @@ class AcerCnnPolicy(object):
            pi = tf.nn.softmax(pi_logits)
            q = fc(h, 'q', nact)

-        a = sample(pi_logits)  # could change this to use self.pi instead
+        a = sample(tf.nn.softmax(pi_logits))  # could change this to use self.pi instead
        self.initial_state = []  # not stateful
        self.X = X
        self.pi = pi  # actual policy params now
+        self.pi_logits = pi_logits
        self.q = q
+        self.vf = q

        def step(ob, *args, **kwargs):
            # returns actions, mus, states
--- a/baselines/acer/run_atari.py
+++ b/baselines/acer/run_atari.py
@@ -1,30 +0,0 @@
-#!/usr/bin/env python3
-from baselines import logger
-from baselines.acer.acer_simple import learn
-from baselines.acer.policies import AcerCnnPolicy, AcerLstmPolicy
-from baselines.common.cmd_util import make_atari_env, atari_arg_parser
-
-def train(env_id, num_timesteps, seed, policy, lrschedule, num_cpu):
-    env = make_atari_env(env_id, num_cpu, seed)
-    if policy == 'cnn':
-        policy_fn = AcerCnnPolicy
-    elif policy == 'lstm':
-        policy_fn = AcerLstmPolicy
-    else:
-        print("Policy {} not implemented".format(policy))
-        return
-    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
-    env.close()
-
-def main():
-    parser = atari_arg_parser()
-    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
-    parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
-    parser.add_argument('--logdir', help ='Directory for logging')
-    args = parser.parse_args()
-    logger.configure(args.logdir)
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
-          policy=args.policy, lrschedule=args.lrschedule, num_cpu=16)
-
-if __name__ == '__main__':
-    main()
--- a/baselines/acer/runner.py
+++ b/baselines/acer/runner.py
@@ -0,0 +1,60 @@
+import numpy as np
+from baselines.common.runners import AbstractEnvRunner
+
+class Runner(AbstractEnvRunner):
+
+    def __init__(self, env, model, nsteps, nstack):
+        super().__init__(env=env, model=model, nsteps=nsteps)
+        self.nstack = nstack
+        nh, nw, nc = env.observation_space.shape
+        self.nc = nc  # nc = 1 for atari, but just in case
+        self.nact = env.action_space.n
+        nenv = self.nenv
+        self.nbatch = nenv * nsteps
+        self.batch_ob_shape = (nenv*(nsteps+1), nh, nw, nc*nstack)
+        self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
+        obs = env.reset()
+        self.update_obs(obs)
+
+    def update_obs(self, obs, dones=None):
+        #self.obs = obs
+        if dones is not None:
+            self.obs *= (1 - dones.astype(np.uint8))[:, None, None, None]
+        self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
+        self.obs[:, :, :, -self.nc:] = obs[:, :, :, :]
+
+    def run(self):
+        enc_obs = np.split(self.obs, self.nstack, axis=3)  # so now list of obs steps
+        mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
+        for _ in range(self.nsteps):
+            actions, mus, states = self.model._step(self.obs, S=self.states, M=self.dones)
+            mb_obs.append(np.copy(self.obs))
+            mb_actions.append(actions)
+            mb_mus.append(mus)
+            mb_dones.append(self.dones)
+            obs, rewards, dones, _ = self.env.step(actions)
+            # states information for statefull models like LSTM
+            self.states = states
+            self.dones = dones
+            self.update_obs(obs, dones)
+            mb_rewards.append(rewards)
+            enc_obs.append(obs)
+        mb_obs.append(np.copy(self.obs))
+        mb_dones.append(self.dones)
+
+        enc_obs = np.asarray(enc_obs, dtype=np.uint8).swapaxes(1, 0)
+        mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0)
+        mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
+        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
+        mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
+
+        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
+
+        mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
+        mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards
+
+        # shapes are now [nenv, nsteps, []]
+        # When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.
+
+        return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks
+
--- a/baselines/acktr/acktr.py
+++ b/baselines/acktr/acktr.py
@@ -0,0 +1 @@
+from baselines.acktr.acktr_disc import *
--- a/baselines/acktr/acktr_disc.py
+++ b/baselines/acktr/acktr_disc.py
@@ -1,16 +1,17 @@
 import os.path as osp
 import time
-import joblib
+import functools
 import numpy as np
 import tensorflow as tf
 from baselines import logger

 from baselines.common import set_global_seeds, explained_variance
+from baselines.common.policies import build_policy
+from baselines.common.tf_util import get_session, save_variables, load_variables

-from baselines.a2c.a2c import Runner
+from baselines.a2c.runner import Runner
 from baselines.a2c.utils import discount_with_dones
 from baselines.a2c.utils import Scheduler, find_trainable_variables
-from baselines.a2c.utils import cat_entropy, mse
 from baselines.acktr import kfac


@@ -19,11 +20,8 @@ class Model(object):
    def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
                 kfac_clip=0.001, lrschedule='linear'):
-        config = tf.ConfigProto(allow_soft_placement=True,
-                                intra_op_parallelism_threads=nprocs,
-                                inter_op_parallelism_threads=nprocs)
-        config.gpu_options.allow_growth = True
-        self.sess = sess = tf.Session(config=config)
+
+        self.sess = sess = get_session()
        nact = ac_space.n
        nbatch = nenvs * nsteps
        A = tf.placeholder(tf.int32, [nbatch])
@@ -32,27 +30,28 @@ class Model(object):
        PG_LR = tf.placeholder(tf.float32, [])
        VF_LR = tf.placeholder(tf.float32, [])

-        self.model = step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
-        self.model2 = train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)
+        with tf.variable_scope('acktr_model', reuse=tf.AUTO_REUSE):
+            self.model = step_model = policy(nenvs, 1, sess=sess)
+            self.model2 = train_model = policy(nenvs*nsteps, nsteps, sess=sess)

-        logpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
+        neglogpac = train_model.pd.neglogp(A)
        self.logits = logits = train_model.pi

        ##training loss
-        pg_loss = tf.reduce_mean(ADV*logpac)
-        entropy = tf.reduce_mean(cat_entropy(train_model.pi))
+        pg_loss = tf.reduce_mean(ADV*neglogpac)
+        entropy = tf.reduce_mean(train_model.pd.entropy())
        pg_loss = pg_loss - ent_coef * entropy
-        vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
+        vf_loss = tf.losses.mean_squared_error(tf.squeeze(train_model.vf), R)
        train_loss = pg_loss + vf_coef * vf_loss


        ##Fisher loss construction
-        self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(logpac)
+        self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(neglogpac)
        sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
        self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
        self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss

-        self.params=params = find_trainable_variables("model")
+        self.params=params = find_trainable_variables("acktr_model")

        self.grads_check = grads = tf.gradients(train_loss,params)

@@ -82,22 +81,10 @@ class Model(object):
            )
            return policy_loss, value_loss, policy_entropy

-        def save(save_path):
-            ps = sess.run(params)
-            joblib.dump(ps, save_path)
-
-        def load(load_path):
-            loaded_params = joblib.load(load_path)
-            restores = []
-            for p, loaded_p in zip(params, loaded_params):
-                restores.append(p.assign(loaded_p))
-            sess.run(restores)
-
-

        self.train = train
-        self.save = save
-        self.load = load
+        self.save = functools.partial(save_variables, sess=sess)
+        self.load = functools.partial(load_variables, sess=sess)
        self.train_model = train_model
        self.step_model = step_model
        self.step = step_model.step
@@ -105,12 +92,17 @@ class Model(object):
        self.initial_state = step_model.initial_state
        tf.global_variables_initializer().run(session=sess)

-def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
+def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
-                 kfac_clip=0.001, save_interval=None, lrschedule='linear'):
-    tf.reset_default_graph()
+                 kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, **network_kwargs):
    set_global_seeds(seed)

+    
+    if network == 'cnn':
+        network_kwargs['one_dim_bias'] = True
+
+    policy = build_policy(env, network, **network_kwargs)
+
    nenvs = env.num_envs
    ob_space = env.observation_space
    ac_space = env.action_space
@@ -124,6 +116,9 @@ def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval
            fh.write(cloudpickle.dumps(make_model))
    model = make_model()
            
+    if load_path is not None:
+        model.load(load_path)
+
    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
    nbatch = nenvs*nsteps
    tstart = time.time()
@@ -153,3 +148,4 @@ def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval
    coord.request_stop()
    coord.join(enqueue_threads)
    env.close()
+    return model
--- a/baselines/acktr/run_atari.py
+++ b/baselines/acktr/run_atari.py
@@ -1,14 +1,16 @@
 #!/usr/bin/env python3

+from functools import partial
+
 from baselines import logger
 from baselines.acktr.acktr_disc import learn
 from baselines.common.cmd_util import make_atari_env, atari_arg_parser
 from baselines.common.vec_env.vec_frame_stack import VecFrameStack
-from baselines.ppo2.policies import CnnPolicy
+from baselines.common.policies import cnn

 def train(env_id, num_timesteps, seed, num_cpu):
    env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
-    policy_fn = CnnPolicy
+    policy_fn = cnn(env=env, one_dim_bias=True)
    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
    env.close()

--- a/baselines/bench/benchmarks.py
+++ b/baselines/bench/benchmarks.py
@@ -9,6 +9,8 @@ _atariexpl7 = ['Freeway', 'Gravitar', 'MontezumaRevenge', 'Pitfall', 'PrivateEye
 _BENCHMARKS = []

 remove_version_re = re.compile(r'-v\d+$')
+
+
 def register_benchmark(benchmark):
    for b in _BENCHMARKS:
        if b['name'] == benchmark['name']:
@@ -57,7 +59,7 @@ register_benchmark({
 register_benchmark({
    'name': 'Atari10M',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
-    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 6, 'num_timesteps': int(10e6)} for _game in _atari7]
 })

 register_benchmark({
@@ -82,8 +84,9 @@ _mujocosmall = [
 register_benchmark({
    'name': 'Mujoco1M',
    'description': 'Some small 2D MuJoCo tasks, run for 1M timesteps',
-    'tasks': [{'env_id': _envid, 'trials': 3, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
+    'tasks': [{'env_id': _envid, 'trials': 6, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
 })
+
 register_benchmark({
    'name': 'MujocoWalkers',
    'description': 'MuJoCo forward walkers, run for 8M, humanoid 100M',
@@ -138,3 +141,11 @@ register_benchmark({
    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari50]
 })

+# HER DDPG
+
+register_benchmark({
+    'name': 'HerDdpg',
+    'description': 'Smoke-test only benchmark of HER',
+    'tasks': [{'trials': 1, 'env_id': 'FetchReach-v1'}]
+})
+
--- a/baselines/bench/monitor.py
+++ b/baselines/bench/monitor.py
@@ -7,12 +7,13 @@ from glob import glob
 import csv
 import os.path as osp
 import json
+import numpy as np

 class Monitor(Wrapper):
    EXT = "monitor.csv"
    f = None

-    def __init__(self, env, filename, allow_early_resets=False, reset_keywords=()):
+    def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
        Wrapper.__init__(self, env=env)
        self.tstart = time.time()
        if filename is None:
@@ -26,10 +27,12 @@ class Monitor(Wrapper):
                    filename = filename + "." + Monitor.EXT
            self.f = open(filename, "wt")
            self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, 'env_id' : env.spec and env.spec.id}))
-            self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords)
+            self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords+info_keywords)
            self.logger.writeheader()
+            self.f.flush()

        self.reset_keywords = reset_keywords
+        self.info_keywords = info_keywords
        self.allow_early_resets = allow_early_resets
        self.rewards = None
        self.needs_reset = True
@@ -61,6 +64,8 @@ class Monitor(Wrapper):
            eprew = sum(self.rewards)
            eplen = len(self.rewards)
            epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)}
+            for k in self.info_keywords:
+                epinfo[k] = info[k]
            self.episode_rewards.append(eprew)
            self.episode_lengths.append(eplen)
            self.episode_times.append(time.time() - self.tstart)
@@ -107,6 +112,8 @@ def load_results(dir):
        with open(fname, 'rt') as fh:
            if fname.endswith('csv'):
                firstline = fh.readline()
+                if not firstline:
+                    continue
                assert firstline[0] == '#'
                header = json.loads(firstline[1:])
                df = pandas.read_csv(fh, index_col=None)
--- a/baselines/common/init.py
+++ b/baselines/common/init.py
@@ -1,3 +1,4 @@
+# flake8: noqa F403
 from baselines.common.console_util import *
 from baselines.common.dataset import Dataset
 from baselines.common.math_util import *
--- a/baselines/common/atari_wrappers.py
+++ b/baselines/common/atari_wrappers.py
@@ -1,4 +1,6 @@
 import numpy as np
+import os
+os.environ.setdefault('PATH', '')
 from collections import deque
 import gym
 from gym import spaces
@@ -98,9 +100,6 @@ class MaxAndSkipEnv(gym.Wrapper):
        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
        self._skip       = skip

-    def reset(self):
-        return self.env.reset()
-
    def step(self, action):
        """Repeat action, sum reward, and max over last observations."""
        total_reward = 0.0
--- a/baselines/common/cmd_util.py
+++ b/baselines/common/cmd_util.py
@@ -3,36 +3,62 @@ Helpers for scripts like run_atari.py.
 """

 import os
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
 import gym
+from gym.wrappers import FlattenDictWrapper
 from baselines import logger
 from baselines.bench import Monitor
 from baselines.common import set_global_seeds
 from baselines.common.atari_wrappers import make_atari, wrap_deepmind
 from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from mpi4py import MPI

 def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
    """
    Create a wrapped, monitored SubprocVecEnv for Atari.
    """
    if wrapper_kwargs is None: wrapper_kwargs = {}
+    mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
    def make_env(rank): # pylint: disable=C0111
        def _thunk():
            env = make_atari(env_id)
-            env.seed(seed + rank)
-            env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
+            env.seed(seed + 10000*mpi_rank + rank if seed is not None else None)
+            env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(rank)))
            return wrap_deepmind(env, **wrapper_kwargs)
        return _thunk
    set_global_seeds(seed)
    return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])

-def make_mujoco_env(env_id, seed):
+def make_mujoco_env(env_id, seed, reward_scale=1.0):
+    """
+    Create a wrapped, monitored gym.Env for MuJoCo.
+    """
+    rank = MPI.COMM_WORLD.Get_rank()
+    myseed = seed  + 1000 * rank if seed is not None else None
+    set_global_seeds(myseed)
+    env = gym.make(env_id)
+    env = Monitor(env, os.path.join(logger.get_dir(), str(rank)), allow_early_resets=True)
+    env.seed(seed)
+
+    if reward_scale != 1.0:
+        from baselines.common.retro_wrappers import RewardScaler
+        env = RewardScaler(env, reward_scale)
+
+    return env
+
+def make_robotics_env(env_id, seed, rank=0):
    """
    Create a wrapped, monitored gym.Env for MuJoCo.
    """
    set_global_seeds(seed)
    env = gym.make(env_id)
-    env = Monitor(env, logger.get_dir())
+    env = FlattenDictWrapper(env, ['observation', 'desired_goal'])
+    env = Monitor(
+        env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)),
+        info_keywords=('is_success',))
    env.seed(seed)
    return env

@@ -47,18 +73,54 @@ def atari_arg_parser():
    """
    Create an argparse.ArgumentParser for run_atari.py.
    """
-    parser = arg_parser()
-    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
-    return parser
+    print('Obsolete - use common_arg_parser instead')
+    return common_arg_parser()

 def mujoco_arg_parser():
+    print('Obsolete - use common_arg_parser instead')
+    return common_arg_parser()
+
+def common_arg_parser():
    """
    Create an argparse.ArgumentParser for run_mujoco.py.
    """
    parser = arg_parser()
-    parser.add_argument('--env', help='environment ID', type=str, default="Reacher-v1")
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
+    parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
+    parser.add_argument('--seed', help='RNG seed', type=int, default=None)
+    parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
+    parser.add_argument('--num_timesteps', type=float, default=1e6), 
+    parser.add_argument('--network', help='network type (mlp, cnn, lstm, cnn_lstm, conv_only)', default=None)
+    parser.add_argument('--gamestate', help='game state to load (so far only used in retro games)', default=None)
+    parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
+    parser.add_argument('--reward_scale', help='Reward scale factor. Default: 1.0', default=1.0, type=float)
+    parser.add_argument('--save_path', help='Path to save trained model to', default=None, type=str)
+    parser.add_argument('--play', default=False, action='store_true')
+    return parser
+
+def robotics_arg_parser():
+    """
+    Create an argparse.ArgumentParser for run_mujoco.py.
+    """
+    parser = arg_parser()
+    parser.add_argument('--env', help='environment ID', type=str, default='FetchReach-v0')
+    parser.add_argument('--seed', help='RNG seed', type=int, default=None)
    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
    return parser
+
+
+def parse_unknown_args(args):
+    """
+    Parse arguments not consumed by arg parser into a dicitonary
+    """
+    retval = {}
+    for arg in args:
+        assert arg.startswith('--')
+        assert '=' in arg, 'cannot parse arg {}'.format(arg)
+        key = arg.split('=')[0][2:]
+        value = arg.split('=')[1]
+        retval[key] = value
+
+    return retval
+
+
+
--- a/baselines/common/console_util.py
+++ b/baselines/common/console_util.py
@@ -16,7 +16,12 @@ def fmt_item(x, l):
    if isinstance(x, np.ndarray):
        assert x.ndim==0
        x = x.item()
-    if isinstance(x, float): rep = "%g"%x
+    if isinstance(x, (float, np.float32, np.float64)):
+        v = abs(x)
+        if (v < 1e-4 or v > 1e+4) and v > 0:
+            rep = "%7.2e" % x
+        else:
+            rep = "%7.5f" % x
    else: rep = str(x)
    return " "*(l - len(rep)) + rep

--- a/baselines/common/distributions.py
+++ b/baselines/common/distributions.py
@@ -1,6 +1,7 @@
 import tensorflow as tf
 import numpy as np
 import baselines.common.tf_util as U
+from baselines.a2c.utils import fc
 from tensorflow.python.ops import math_ops

 class Pd(object):
@@ -31,6 +32,8 @@ class PdType(object):
        raise NotImplementedError
    def pdfromflat(self, flat):
        return self.pdclass()(flat)
+    def pdfromlatent(self, latent_vector):
+        raise NotImplementedError
    def param_shape(self):
        raise NotImplementedError
    def sample_shape(self):
@@ -48,6 +51,10 @@ class CategoricalPdType(PdType):
        self.ncat = ncat
    def pdclass(self):
        return CategoricalPd
+    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
+        pdparam = fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
+        return self.pdfromflat(pdparam), pdparam
+
    def param_shape(self):
        return [self.ncat]
    def sample_shape(self):
@@ -75,6 +82,13 @@ class DiagGaussianPdType(PdType):
        self.size = size
    def pdclass(self):
        return DiagGaussianPd
+
+    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
+        mean = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
+        logstd = tf.get_variable(name='pi/logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
+        pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
+        return self.pdfromflat(pdparam), mean
+
    def param_shape(self):
        return [2*self.size]
    def sample_shape(self):
@@ -129,26 +143,26 @@ class CategoricalPd(Pd):
        # Note: we can't use sparse_softmax_cross_entropy_with_logits because
        #       the implementation does not allow second-order derivatives...
        one_hot_actions = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
-        return tf.nn.softmax_cross_entropy_with_logits(
+        return tf.nn.softmax_cross_entropy_with_logits_v2(
            logits=self.logits,
            labels=one_hot_actions)
    def kl(self, other):
-        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
-        a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keep_dims=True)
+        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
+        a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keepdims=True)
        ea0 = tf.exp(a0)
        ea1 = tf.exp(a1)
-        z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
-        z1 = tf.reduce_sum(ea1, axis=-1, keep_dims=True)
+        z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
+        z1 = tf.reduce_sum(ea1, axis=-1, keepdims=True)
        p0 = ea0 / z0
        return tf.reduce_sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
    def entropy(self):
-        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
+        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
        ea0 = tf.exp(a0)
-        z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
+        z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
        p0 = ea0 / z0
        return tf.reduce_sum(p0 * (tf.log(z0) - a0), axis=-1)
    def sample(self):
-        u = tf.random_uniform(tf.shape(self.logits))
+        u = tf.random_uniform(tf.shape(self.logits), dtype=self.logits.dtype)
        return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
    @classmethod
    def fromflat(cls, flat):
--- a/baselines/common/filters.py
+++ b/baselines/common/filters.py
--- a/baselines/common/identity_env.py
+++ b/baselines/common/identity_env.py
@@ -0,0 +1,30 @@
+from gym import Env
+from gym.spaces import Discrete
+
+
+class IdentityEnv(Env):
+    def __init__(
+            self,
+            dim,
+            ep_length=100,
+    ):
+
+        self.action_space = Discrete(dim)
+        self.reset()
+
+    def reset(self):
+        self._choose_next_state()
+        self.observation_space = self.action_space
+
+        return self.state
+
+    def step(self, actions):
+        rew = self._get_reward(actions)
+        self._choose_next_state()
+        return self.state, rew, False, {}
+
+    def _choose_next_state(self):
+        self.state = self.action_space.sample()
+
+    def _get_reward(self, actions):
+        return 1 if self.state == actions else 0
--- a/baselines/common/input.py
+++ b/baselines/common/input.py
@@ -0,0 +1,56 @@
+import tensorflow as tf
+from gym.spaces import Discrete, Box
+
+def observation_placeholder(ob_space, batch_size=None, name='Ob'):
+    ''' 
+    Create placeholder to feed observations into of the size appropriate to the observation space
+    
+    Parameters:
+    ----------
+
+    ob_space: gym.Space     observation space
+    
+    batch_size: int         size of the batch to be fed into input. Can be left None in most cases. 
+
+    name: str               name of the placeholder
+
+    Returns:
+    -------
+
+    tensorflow placeholder tensor
+    '''
+
+    assert isinstance(ob_space, Discrete) or isinstance(ob_space, Box), \
+        'Can only deal with Discrete and Box observation spaces for now'
+
+    return tf.placeholder(shape=(batch_size,) + ob_space.shape, dtype=ob_space.dtype, name=name)
+
+
+def observation_input(ob_space, batch_size=None, name='Ob'):
+    ''' 
+    Create placeholder to feed observations into of the size appropriate to the observation space, and add input 
+    encoder of the appropriate type. 
+    '''
+
+    placeholder = observation_placeholder(ob_space, batch_size, name)
+    return placeholder, encode_observation(ob_space, placeholder)
+
+def encode_observation(ob_space, placeholder):
+    '''
+    Encode input in the way that is appropriate to the observation space
+
+    Parameters:
+    ----------
+    
+    ob_space: gym.Space             observation space
+    
+    placeholder: tf.placeholder     observation input placeholder
+    '''
+    if isinstance(ob_space, Discrete):
+        return tf.to_float(tf.one_hot(placeholder, ob_space.n))
+
+    elif isinstance(ob_space, Box):
+        return tf.to_float(placeholder)
+    else:
+        raise NotImplementedError
+
--- a/baselines/common/misc_util.py
+++ b/baselines/common/misc_util.py
@@ -67,14 +67,21 @@ class EzPickle(object):


 def set_global_seeds(i):
+    try:
+        import MPI
+        rank = MPI.COMM_WORLD.Get_rank()
+    except ImportError:
+        rank = 0
+
+    myseed = i  + 1000 * rank if i is not None else None
    try:
        import tensorflow as tf
    except ImportError:
        pass
    else:
-        tf.set_random_seed(i)
-    np.random.seed(i)
-    random.seed(i)
+        tf.set_random_seed(myseed)
+    np.random.seed(myseed)
+    random.seed(myseed)


 def pretty_eta(seconds_left):
--- a/baselines/common/models.py
+++ b/baselines/common/models.py
@@ -0,0 +1,177 @@
+import numpy as np
+import tensorflow as tf
+from baselines.a2c import utils
+from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch
+from baselines.common.mpi_running_mean_std import RunningMeanStd
+import tensorflow.contrib.layers as layers
+
+
+def nature_cnn(unscaled_images, **conv_kwargs):
+    """
+    CNN from Nature paper.
+    """
+    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
+    activ = tf.nn.relu
+    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
+                   **conv_kwargs))
+    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
+    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
+    h3 = conv_to_fc(h3)
+    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
+
+
+def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
+    """
+    Simple fully connected layer policy. Separate stacks of fully-connected layers are used for policy and value function estimation.
+    More customized fully-connected policies can be obtained by using PolicyWithV class directly.
+
+    Parameters:
+    ----------
+
+    num_layers: int                 number of fully-connected layers (default: 2)
+    
+    num_hidden: int                 size of fully-connected layers (default: 64)
+    
+    activation:                     activation function (default: tf.tanh)
+        
+    Returns:
+    -------
+
+    function that builds fully connected network with a given input placeholder
+    """        
+    def network_fn(X):
+        h = tf.layers.flatten(X)
+        for i in range(num_layers):
+            h = activation(fc(h, 'mlp_fc{}'.format(i), nh=num_hidden, init_scale=np.sqrt(2)))
+        return h, None
+
+    return network_fn
+  
+
+def cnn(**conv_kwargs):
+    def network_fn(X):
+        return nature_cnn(X, **conv_kwargs), None
+    return network_fn
+
+def cnn_small(**conv_kwargs):
+    def network_fn(X):
+        h = tf.cast(X, tf.float32) / 255.
+        
+        activ = tf.nn.relu
+        h = activ(conv(h, 'c1', nf=8, rf=8, stride=4, init_scale=np.sqrt(2), **conv_kwargs))
+        h = activ(conv(h, 'c2', nf=16, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
+        h = conv_to_fc(h)
+        h = activ(fc(h, 'fc1', nh=128, init_scale=np.sqrt(2)))
+        return h, None
+    return network_fn
+
+
+
+def lstm(nlstm=128, layer_norm=False):
+    def network_fn(X, nenv=1):
+        nbatch = X.shape[0] 
+        nsteps = nbatch // nenv
+         
+        h = tf.layers.flatten(X)
+
+        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
+        S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states
+
+        xs = batch_to_seq(h, nenv, nsteps)
+        ms = batch_to_seq(M, nenv, nsteps)
+
+        if layer_norm:
+            h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
+        else:
+            h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)
+            
+        h = seq_to_batch(h5)
+        initial_state = np.zeros(S.shape.as_list(), dtype=float)
+
+        return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
+
+    return network_fn
+
+
+def cnn_lstm(nlstm=128, layer_norm=False, **conv_kwargs):
+    def network_fn(X, nenv=1):
+        nbatch = X.shape[0] 
+        nsteps = nbatch // nenv
+         
+        h = nature_cnn(X, **conv_kwargs)
+       
+        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
+        S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states
+
+        xs = batch_to_seq(h, nenv, nsteps)
+        ms = batch_to_seq(M, nenv, nsteps)
+
+        if layer_norm:
+            h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
+        else:
+            h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)
+            
+        h = seq_to_batch(h5)
+        initial_state = np.zeros(S.shape.as_list(), dtype=float)
+
+        return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
+
+    return network_fn
+
+def cnn_lnlstm(nlstm=128, **conv_kwargs):
+    return cnn_lstm(nlstm, layer_norm=True, **conv_kwargs)
+
+
+def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
+    ''' 
+    convolutions-only net
+
+    Parameters:
+    ----------
+
+    conv:       list of triples (filter_number, filter_size, stride) specifying parameters for each layer. 
+
+    Returns:
+
+    function that takes tensorflow tensor as input and returns the output of the last convolutional layer
+    
+    '''
+
+    def network_fn(X):
+        out = X
+        with tf.variable_scope("convnet"):
+            for num_outputs, kernel_size, stride in convs:
+                out = layers.convolution2d(out,
+                                           num_outputs=num_outputs,
+                                           kernel_size=kernel_size,
+                                           stride=stride,
+                                           activation_fn=tf.nn.relu,
+                                           **conv_kwargs)
+
+        return out, None
+    return network_fn
+
+def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
+    rms = RunningMeanStd(shape=x.shape[1:])
+    norm_x = tf.clip_by_value((x - rms.mean) / rms.std, min(clip_range), max(clip_range))
+    return norm_x, rms
+    
+
+def get_network_builder(name):
+    # TODO: replace with reflection? 
+    if name == 'cnn':
+        return cnn
+    elif name == 'cnn_small':
+        return cnn_small
+    elif name == 'conv_only':
+        return conv_only
+    elif name == 'mlp':
+        return mlp
+    elif name == 'lstm':
+        return lstm
+    elif name == 'cnn_lstm':
+        return cnn_lstm
+    elif name == 'cnn_lnlstm':
+        return cnn_lnlstm
+    else:
+        raise ValueError('Unknown network type: {}'.format(name))
--- a/baselines/common/mpi_adam_optimizer.py
+++ b/baselines/common/mpi_adam_optimizer.py
@@ -0,0 +1,31 @@
+import numpy as np
+import tensorflow as tf
+from mpi4py import MPI
+
+class MpiAdamOptimizer(tf.train.AdamOptimizer):
+    """Adam optimizer that averages gradients across mpi processes."""
+    def __init__(self, comm, **kwargs):
+        self.comm = comm
+        tf.train.AdamOptimizer.__init__(self, **kwargs)
+    def compute_gradients(self, loss, var_list, **kwargs):
+        grads_and_vars = tf.train.AdamOptimizer.compute_gradients(self, loss, var_list, **kwargs)
+        grads_and_vars = [(g, v) for g, v in grads_and_vars if g is not None]
+        flat_grad = tf.concat([tf.reshape(g, (-1,)) for g, v in grads_and_vars], axis=0)
+        shapes = [v.shape.as_list() for g, v in grads_and_vars]
+        sizes = [int(np.prod(s)) for s in shapes]
+
+        num_tasks = self.comm.Get_size()
+        buf = np.zeros(sum(sizes), np.float32)
+
+        def _collect_grads(flat_grad):
+            self.comm.Allreduce(flat_grad, buf, op=MPI.SUM)
+            np.divide(buf, float(num_tasks), out=buf)
+            return buf
+
+        avg_flat_grad = tf.py_func(_collect_grads, [flat_grad], tf.float32)
+        avg_flat_grad.set_shape(flat_grad.shape)
+        avg_grads = tf.split(avg_flat_grad, sizes, axis=0)
+        avg_grads_and_vars = [(tf.reshape(g, v.shape), v)
+                    for g, (_, v) in zip(avg_grads, grads_and_vars)]
+
+        return avg_grads_and_vars
--- a/baselines/common/mpi_moments.py
+++ b/baselines/common/mpi_moments.py
@@ -2,6 +2,7 @@ from mpi4py import MPI
 import numpy as np
 from baselines.common import zipsame

+
 def mpi_mean(x, axis=0, comm=None, keepdims=False):
    x = np.asarray(x)
    assert x.ndim > 0
--- a/baselines/common/mpi_util.py
+++ b/baselines/common/mpi_util.py
@@ -0,0 +1,101 @@
+from collections import defaultdict
+from mpi4py import MPI
+import os, numpy as np
+import platform
+import shutil
+import subprocess
+
+def sync_from_root(sess, variables, comm=None):
+    """
+    Send the root node's parameters to every worker.
+    Arguments:
+      sess: the TensorFlow session.
+      variables: all parameter variables including optimizer's
+    """
+    if comm is None: comm = MPI.COMM_WORLD
+    rank = comm.Get_rank()
+    for var in variables:
+        if rank == 0:
+            comm.Bcast(sess.run(var))
+        else:
+            import tensorflow as tf
+            returned_var = np.empty(var.shape, dtype='float32')
+            comm.Bcast(returned_var)
+            sess.run(tf.assign(var, returned_var))
+
+def gpu_count():
+    """
+    Count the GPUs on this machine.
+    """
+    if shutil.which('nvidia-smi') is None:
+        return 0
+    output = subprocess.check_output(['nvidia-smi', '--query-gpu=gpu_name', '--format=csv'])
+    return max(0, len(output.split(b'\n')) - 2)
+
+def setup_mpi_gpus():
+    """
+    Set CUDA_VISIBLE_DEVICES using MPI.
+    """
+    num_gpus = gpu_count()
+    if num_gpus == 0:
+        return
+    local_rank, _ = get_local_rank_size(MPI.COMM_WORLD)
+    os.environ['CUDA_VISIBLE_DEVICES'] = str(local_rank % num_gpus)
+
+def get_local_rank_size(comm):
+    """
+    Returns the rank of each process on its machine
+    The processes on a given machine will be assigned ranks
+        0, 1, 2, ..., N-1,
+    where N is the number of processes on this machine.
+
+    Useful if you want to assign one gpu per machine
+    """
+    this_node = platform.node()
+    ranks_nodes = comm.allgather((comm.Get_rank(), this_node))
+    node2rankssofar = defaultdict(int)
+    local_rank = None
+    for (rank, node) in ranks_nodes:
+        if rank == comm.Get_rank():
+            local_rank = node2rankssofar[node]
+        node2rankssofar[node] += 1
+    assert local_rank is not None
+    return local_rank, node2rankssofar[this_node]
+
+def share_file(comm, path):
+    """
+    Copies the file from rank 0 to all other ranks
+    Puts it in the same place on all machines
+    """
+    localrank, _ = get_local_rank_size(comm)
+    if comm.Get_rank() == 0:
+        with open(path, 'rb') as fh:
+            data = fh.read()
+        comm.bcast(data)
+    else:
+        data = comm.bcast(None)
+        if localrank == 0:
+            os.makedirs(os.path.dirname(path), exist_ok=True)
+            with open(path, 'wb') as fh:
+                fh.write(data)
+    comm.Barrier()
+
+def dict_gather(comm, d, op='mean', assert_all_have_data=True):
+    if comm is None: return d
+    alldicts = comm.allgather(d)
+    size = comm.size
+    k2li = defaultdict(list)
+    for d in alldicts:
+        for (k,v) in d.items():
+            k2li[k].append(v)
+    result = {}
+    for (k,li) in k2li.items():
+        if assert_all_have_data:
+            assert len(li)==size, "only %i out of %i MPI workers have sent '%s'" % (len(li), size, k)
+        if op=='mean':
+            result[k] = np.mean(li, axis=0)
+        elif op=='sum':
+            result[k] = np.sum(li, axis=0)
+        else:
+            assert 0, op
+    return result
--- a/baselines/common/policies.py
+++ b/baselines/common/policies.py
@@ -0,0 +1,179 @@
+import tensorflow as tf
+from baselines.common import tf_util
+from baselines.a2c.utils import fc
+from baselines.common.distributions import make_pdtype
+from baselines.common.input import observation_placeholder, encode_observation
+from baselines.common.tf_util import adjust_shape
+from baselines.common.mpi_running_mean_std import RunningMeanStd
+from baselines.common.models import get_network_builder
+
+import gym
+
+
+class PolicyWithValue(object):
+    """
+    Encapsulates fields and methods for RL policy and value function estimation with shared parameters
+    """
+
+    def __init__(self, env, observations, latent, estimate_q=False, vf_latent=None, sess=None, **tensors):
+        """
+        Parameters:
+        ----------
+        env             RL environment
+
+        observations    tensorflow placeholder in which the observations will be fed
+
+        latent          latent state from which policy distribution parameters should be inferred
+
+        vf_latent       latent state from which value function should be inferred (if None, then latent is used)
+
+        sess            tensorflow session to run calculations in (if None, default session is used)
+
+        **tensors       tensorflow tensors for additional attributes such as state or mask
+
+        """
+            
+        self.X = observations
+        self.state = tf.constant([])
+        self.initial_state = None
+        self.__dict__.update(tensors)
+
+        vf_latent = vf_latent if vf_latent is not None else latent
+
+        vf_latent = tf.layers.flatten(vf_latent)
+        latent = tf.layers.flatten(latent)
+
+        self.pdtype = make_pdtype(env.action_space)
+
+        self.pd, self.pi = self.pdtype.pdfromlatent(latent, init_scale=0.01)
+
+        self.action = self.pd.sample()
+        self.neglogp = self.pd.neglogp(self.action)
+        self.sess = sess
+
+        if estimate_q:
+            assert isinstance(env.action_space, gym.spaces.Discrete)
+            self.q = fc(vf_latent, 'q', env.action_space.n)
+            self.vf = self.q
+        else:
+            self.vf = fc(vf_latent, 'vf', 1)
+            self.vf = self.vf[:,0]
+
+    def _evaluate(self, variables, observation, **extra_feed):
+        sess = self.sess or tf.get_default_session()
+        feed_dict = {self.X: adjust_shape(self.X, observation)}
+        for inpt_name, data in extra_feed.items():
+            if inpt_name in self.__dict__.keys():
+                inpt = self.__dict__[inpt_name]
+                if isinstance(inpt, tf.Tensor) and inpt._op.type == 'Placeholder':
+                    feed_dict[inpt] = adjust_shape(inpt, data)
+
+        return sess.run(variables, feed_dict)
+
+    def step(self, observation, **extra_feed):
+        """
+        Compute next action(s) given the observaion(s)
+
+        Parameters:
+        ----------
+
+        observation     observation data (either single or a batch)
+
+        **extra_feed    additional data such as state or mask (names of the arguments should match the ones in constructor, see __init__)
+
+        Returns:
+        -------
+        (action, value estimate, next state, negative log likelihood of the action under current policy parameters) tuple
+        """
+    
+        a, v, state, neglogp = self._evaluate([self.action, self.vf, self.state, self.neglogp], observation, **extra_feed)
+        if state.size == 0:
+            state = None
+        return a, v, state, neglogp
+
+    def value(self, ob, *args, **kwargs):
+        """
+        Compute value estimate(s) given the observaion(s)
+
+        Parameters:
+        ----------
+
+        observation     observation data (either single or a batch)
+
+        **extra_feed    additional data such as state or mask (names of the arguments should match the ones in constructor, see __init__)
+
+        Returns:
+        -------
+        value estimate
+        """
+        return self._evaluate(self.vf, ob, *args, **kwargs)      
+
+    def save(self, save_path):
+        tf_util.save_state(save_path, sess=self.sess)
+
+    def load(self, load_path):
+        tf_util.load_state(load_path, sess=self.sess)
+  
+def build_policy(env, policy_network, value_network=None,  normalize_observations=False, estimate_q=False, **policy_kwargs):
+    if isinstance(policy_network, str):
+        network_type = policy_network
+        policy_network = get_network_builder(network_type)(**policy_kwargs)
+
+    def policy_fn(nbatch=None, nsteps=None, sess=None, observ_placeholder=None):
+        ob_space = env.observation_space
+
+        X = observ_placeholder if observ_placeholder is not None else observation_placeholder(ob_space, batch_size=nbatch)
+        
+        extra_tensors = {}
+
+        if normalize_observations and X.dtype == tf.float32:
+            encoded_x, rms = _normalize_clip_observation(X)
+            extra_tensors['rms'] = rms
+        else:
+            encoded_x = X
+
+        encoded_x = encode_observation(ob_space, encoded_x)
+
+        with tf.variable_scope('pi', reuse=tf.AUTO_REUSE):
+            policy_latent, recurrent_tensors = policy_network(encoded_x)
+
+            if recurrent_tensors is not None:
+                # recurrent architecture, need a few more steps
+                nenv = nbatch // nsteps
+                assert nenv > 0, 'Bad input for recurrent policy: batch size {} smaller than nsteps {}'.format(nbatch, nsteps)
+                policy_latent, recurrent_tensors = policy_network(encoded_x, nenv)
+                extra_tensors.update(recurrent_tensors)
+
+            
+        _v_net = value_network
+
+        if _v_net is None or _v_net == 'shared':
+            vf_latent = policy_latent
+        else:
+            if _v_net == 'copy':
+                _v_net = policy_network
+            else:
+                assert callable(_v_net)
+ 
+            with tf.variable_scope('vf', reuse=tf.AUTO_REUSE):
+                vf_latent, _ = _v_net(encoded_x)
+        
+        policy = PolicyWithValue(
+            env=env,
+            observations=X,
+            latent=policy_latent,
+            vf_latent=vf_latent,
+            sess=sess,
+            estimate_q=estimate_q,
+            **extra_tensors
+        )
+        return policy
+
+    return policy_fn
+
+
+def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
+    rms = RunningMeanStd(shape=x.shape[1:])
+    norm_x = tf.clip_by_value((x - rms.mean) / rms.std, min(clip_range), max(clip_range))
+    return norm_x, rms
+    
--- a/baselines/common/retro_wrappers.py
+++ b/baselines/common/retro_wrappers.py
@@ -0,0 +1,293 @@
+ # flake8: noqa F403, F405
+from .atari_wrappers import *
+import numpy as np
+import gym
+
+class TimeLimit(gym.Wrapper):
+    def __init__(self, env, max_episode_steps=None):
+        super(TimeLimit, self).__init__(env)
+        self._max_episode_steps = max_episode_steps
+        self._elapsed_steps = 0
+
+    def step(self, ac):
+        observation, reward, done, info = self.env.step(ac)
+        self._elapsed_steps += 1
+        if self._elapsed_steps >= self._max_episode_steps:
+            done = True
+            info['TimeLimit.truncated'] = True
+        return observation, reward, done, info
+
+    def reset(self, **kwargs):
+        self._elapsed_steps = 0
+        return self.env.reset(**kwargs)
+
+class StochasticFrameSkip(gym.Wrapper):
+    def __init__(self, env, n, stickprob):
+        gym.Wrapper.__init__(self, env)
+        self.n = n
+        self.stickprob = stickprob
+        self.curac = None
+        self.rng = np.random.RandomState()
+        self.supports_want_render = hasattr(env, "supports_want_render")
+
+    def reset(self, **kwargs):
+        self.curac = None
+        return self.env.reset(**kwargs)
+
+    def step(self, ac):
+        done = False
+        totrew = 0
+        for i in range(self.n):
+            # First step after reset, use action
+            if self.curac is None:
+                self.curac = ac
+            # First substep, delay with probability=stickprob
+            elif i==0:
+                if self.rng.rand() > self.stickprob:
+                    self.curac = ac
+            # Second substep, new action definitely kicks in
+            elif i==1:
+                self.curac = ac
+            if self.supports_want_render and i<self.n-1:
+                ob, rew, done, info = self.env.step(self.curac, want_render=False)
+            else:
+                ob, rew, done, info = self.env.step(self.curac)
+            totrew += rew
+            if done: break
+        return ob, totrew, done, info
+
+    def seed(self, s):
+        self.rng.seed(s)
+
+class PartialFrameStack(gym.Wrapper):
+    def __init__(self, env, k, channel=1):
+        """
+        Stack one channel (channel keyword) from previous frames
+        """
+        gym.Wrapper.__init__(self, env)
+        shp = env.observation_space.shape
+        self.channel = channel
+        self.observation_space = gym.spaces.Box(low=0, high=255,
+            shape=(shp[0], shp[1], shp[2] + k - 1),
+            dtype=env.observation_space.dtype)
+        self.k = k
+        self.frames = deque([], maxlen=k)
+        shp = env.observation_space.shape
+
+    def reset(self):
+        ob = self.env.reset()
+        assert ob.shape[2] > self.channel
+        for _ in range(self.k):
+            self.frames.append(ob)
+        return self._get_ob()
+
+    def step(self, ac):
+        ob, reward, done, info = self.env.step(ac)
+        self.frames.append(ob)
+        return self._get_ob(), reward, done, info
+
+    def _get_ob(self):
+        assert len(self.frames) == self.k
+        return np.concatenate([frame if i==self.k-1 else frame[:,:,self.channel:self.channel+1]
+            for (i, frame) in enumerate(self.frames)], axis=2)
+
+class Downsample(gym.ObservationWrapper):
+    def __init__(self, env, ratio):
+        """
+        Downsample images by a factor of ratio
+        """
+        gym.ObservationWrapper.__init__(self, env)
+        (oldh, oldw, oldc) = env.observation_space.shape
+        newshape = (oldh//ratio, oldw//ratio, oldc)
+        self.observation_space = spaces.Box(low=0, high=255,
+            shape=newshape, dtype=np.uint8)
+
+    def observation(self, frame):
+        height, width, _ = self.observation_space.shape
+        frame = cv2.resize(frame, (width, height), interpolation=cv2.INTER_AREA)
+        if frame.ndim == 2:
+            frame = frame[:,:,None]
+        return frame
+
+class Rgb2gray(gym.ObservationWrapper):
+    def __init__(self, env):
+        """
+        Downsample images by a factor of ratio
+        """
+        gym.ObservationWrapper.__init__(self, env)
+        (oldh, oldw, _oldc) = env.observation_space.shape
+        self.observation_space = spaces.Box(low=0, high=255,
+            shape=(oldh, oldw, 1), dtype=np.uint8)
+
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        return frame[:,:,None]
+
+
+class MovieRecord(gym.Wrapper):
+    def __init__(self, env, savedir, k):
+        gym.Wrapper.__init__(self, env)
+        self.savedir = savedir
+        self.k = k
+        self.epcount = 0
+    def reset(self):
+        if self.epcount % self.k == 0:
+            print('saving movie this episode', self.savedir)
+            self.env.unwrapped.movie_path = self.savedir
+        else:
+            print('not saving this episode')
+            self.env.unwrapped.movie_path = None
+            self.env.unwrapped.movie = None
+        self.epcount += 1
+        return self.env.reset()
+
+class AppendTimeout(gym.Wrapper):
+    def __init__(self, env):
+        gym.Wrapper.__init__(self, env)
+        self.action_space = env.action_space
+        self.timeout_space = gym.spaces.Box(low=np.array([0.0]), high=np.array([1.0]), dtype=np.float32)
+        self.original_os = env.observation_space
+        if isinstance(self.original_os, gym.spaces.Dict):
+            import copy
+            ordered_dict = copy.deepcopy(self.original_os.spaces)
+            ordered_dict['value_estimation_timeout'] = self.timeout_space
+            self.observation_space = gym.spaces.Dict(ordered_dict)
+            self.dict_mode = True
+        else:
+            self.observation_space = gym.spaces.Dict({
+                'original': self.original_os,
+                'value_estimation_timeout': self.timeout_space
+                })
+            self.dict_mode = False
+        self.ac_count = None
+        while 1:
+            if not hasattr(env, "_max_episode_steps"):  # Looking for TimeLimit wrapper that has this field
+                env = env.env
+                continue
+            break
+        self.timeout = env._max_episode_steps
+
+    def step(self, ac):
+        self.ac_count += 1
+        ob, rew, done, info = self.env.step(ac)
+        return self._process(ob), rew, done, info
+
+    def reset(self):
+        self.ac_count = 0
+        return self._process(self.env.reset())
+
+    def _process(self, ob):
+        fracmissing = 1 - self.ac_count / self.timeout
+        if self.dict_mode:
+            ob['value_estimation_timeout'] = fracmissing
+        else:
+            return { 'original': ob, 'value_estimation_timeout': fracmissing }
+
+class StartDoingRandomActionsWrapper(gym.Wrapper):
+    """
+    Warning: can eat info dicts, not good if you depend on them
+    """
+    def __init__(self, env, max_random_steps, on_startup=True, every_episode=False):
+        gym.Wrapper.__init__(self, env)
+        self.on_startup = on_startup
+        self.every_episode = every_episode
+        self.random_steps = max_random_steps
+        self.last_obs = None
+        if on_startup:
+            self.some_random_steps()
+
+    def some_random_steps(self):
+        self.last_obs = self.env.reset()
+        n = np.random.randint(self.random_steps)
+        #print("running for random %i frames" % n)
+        for _ in range(n):
+            self.last_obs, _, done, _ = self.env.step(self.env.action_space.sample())
+            if done: self.last_obs = self.env.reset()
+
+    def reset(self):
+        return self.last_obs
+
+    def step(self, a):
+        self.last_obs, rew, done, info = self.env.step(a)
+        if done:
+            self.last_obs = self.env.reset()
+            if self.every_episode:
+                self.some_random_steps()
+        return self.last_obs, rew, done, info
+
+def make_retro(*, game, state, max_episode_steps, **kwargs):
+    import retro
+    env = retro.make(game, state, **kwargs)
+    env = StochasticFrameSkip(env, n=4, stickprob=0.25)
+    if max_episode_steps is not None:
+        env = TimeLimit(env, max_episode_steps=max_episode_steps)
+    return env
+
+def wrap_deepmind_retro(env, scale=True, frame_stack=4):
+    """
+    Configure environment for retro games, using config similar to DeepMind-style Atari in wrap_deepmind
+    """
+    env = WarpFrame(env)
+    env = ClipRewardEnv(env)
+    env = FrameStack(env, frame_stack)
+    if scale:
+        env = ScaledFloatFrame(env)
+    return env
+
+class SonicDiscretizer(gym.ActionWrapper):
+    """
+    Wrap a gym-retro environment and make it use discrete
+    actions for the Sonic game.
+    """
+    def __init__(self, env):
+        super(SonicDiscretizer, self).__init__(env)
+        buttons = ["B", "A", "MODE", "START", "UP", "DOWN", "LEFT", "RIGHT", "C", "Y", "X", "Z"]
+        actions = [['LEFT'], ['RIGHT'], ['LEFT', 'DOWN'], ['RIGHT', 'DOWN'], ['DOWN'],
+                   ['DOWN', 'B'], ['B']]
+        self._actions = []
+        for action in actions:
+            arr = np.array([False] * 12)
+            for button in action:
+                arr[buttons.index(button)] = True
+            self._actions.append(arr)
+        self.action_space = gym.spaces.Discrete(len(self._actions))
+
+    def action(self, a): # pylint: disable=W0221
+        return self._actions[a].copy()
+
+class RewardScaler(gym.RewardWrapper):
+    """
+    Bring rewards to a reasonable scale for PPO.
+    This is incredibly important and effects performance
+    drastically.
+    """
+    def __init__(self, env, scale=0.01):
+        super(RewardScaler, self).__init__(env)
+        self.scale = scale
+
+    def reward(self, reward):
+        return reward * self.scale
+
+class AllowBacktracking(gym.Wrapper):
+    """
+    Use deltas in max(X) as the reward, rather than deltas
+    in X. This way, agents are not discouraged too heavily
+    from exploring backwards if there is no way to advance
+    head-on in the level.
+    """
+    def __init__(self, env):
+        super(AllowBacktracking, self).__init__(env)
+        self._cur_x = 0
+        self._max_x = 0
+
+    def reset(self, **kwargs): # pylint: disable=E0202
+        self._cur_x = 0
+        self._max_x = 0
+        return self.env.reset(**kwargs)
+
+    def step(self, action): # pylint: disable=E0202
+        obs, rew, done, info = self.env.step(action)
+        self._cur_x += rew
+        rew = max(0, self._cur_x - self._max_x)
+        self._max_x = max(self._max_x, self._cur_x)
+        return obs, rew, done, info
--- a/baselines/common/runners.py
+++ b/baselines/common/runners.py
@@ -0,0 +1,19 @@
+import numpy as np
+from abc import ABC, abstractmethod
+
+class AbstractEnvRunner(ABC):
+    def __init__(self, *, env, model, nsteps):
+        self.env = env
+        self.model = model
+        self.nenv = nenv = env.num_envs if hasattr(env, 'num_envs') else 1
+        self.batch_ob_shape = (nenv*nsteps,) + env.observation_space.shape
+        self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
+        self.obs[:] = env.reset()
+        self.nsteps = nsteps
+        self.states = model.initial_state
+        self.dones = [False for _ in range(nenv)]
+
+    @abstractmethod
+    def run(self):
+        raise NotImplementedError
+
--- a/baselines/common/running_mean_std.py
+++ b/baselines/common/running_mean_std.py
@@ -1,4 +1,7 @@
+import tensorflow as tf
 import numpy as np
+from baselines.common.tf_util import get_session
+
 class RunningMeanStd(object):
    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
    def __init__(self, epsilon=1e-4, shape=()):
@@ -13,20 +16,71 @@ class RunningMeanStd(object):
        self.update_from_moments(batch_mean, batch_var, batch_count)

    def update_from_moments(self, batch_mean, batch_var, batch_count):
-        delta = batch_mean - self.mean
-        tot_count = self.count + batch_count
+        self.mean, self.var, self.count = update_mean_var_count_from_moments(
+            self.mean, self.var, self.count, batch_mean, batch_var, batch_count)

-        new_mean = self.mean + delta * batch_count / tot_count        
-        m_a = self.var * (self.count)
-        m_b = batch_var * (batch_count)
-        M2 = m_a + m_b + np.square(delta) * self.count * batch_count / (self.count + batch_count)
-        new_var = M2 / (self.count + batch_count)
+def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
+    delta = batch_mean - mean
+    tot_count = count + batch_count
+
+    new_mean = mean + delta * batch_count / tot_count        
+    m_a = var * count
+    m_b = batch_var * batch_count
+    M2 = m_a + m_b + np.square(delta) * count * batch_count / (count + batch_count)
+    new_var = M2 / (count + batch_count)
+    new_count = batch_count + count
+    
+    return new_mean, new_var, new_count
+    
+
+class TfRunningMeanStd(object):
+    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
+    '''
+    TensorFlow variables-based implmentation of computing running mean and std
+    Benefit of this implementation is that it can be saved / loaded together with the tensorflow model
+    '''
+    def __init__(self, epsilon=1e-4, shape=(), scope=''):
+        sess = get_session()
+
+        self._new_mean = tf.placeholder(shape=shape, dtype=tf.float64)
+        self._new_var = tf.placeholder(shape=shape, dtype=tf.float64)
+        self._new_count = tf.placeholder(shape=(), dtype=tf.float64)
+
+        
+        with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
+            self._mean  = tf.get_variable('mean',  initializer=np.zeros(shape, 'float64'),      dtype=tf.float64)
+            self._var   = tf.get_variable('std',   initializer=np.ones(shape, 'float64'),       dtype=tf.float64)    
+            self._count = tf.get_variable('count', initializer=np.full((), epsilon, 'float64'), dtype=tf.float64)
+
+        self.update_ops = tf.group([
+            self._var.assign(self._new_var),
+            self._mean.assign(self._new_mean),
+            self._count.assign(self._new_count)
+        ])
+
+        sess.run(tf.variables_initializer([self._mean, self._var, self._count]))
+        self.sess = sess
+        self._set_mean_var_count()
+    
+    def _set_mean_var_count(self):
+        self.mean, self.var, self.count = self.sess.run([self._mean, self._var, self._count])                    
+         
+    def update(self, x):
+        batch_mean = np.mean(x, axis=0)
+        batch_var = np.var(x, axis=0)
+        batch_count = x.shape[0]
+
+        new_mean, new_var, new_count = update_mean_var_count_from_moments(self.mean, self.var, self.count, batch_mean, batch_var, batch_count)
+
+        self.sess.run(self.update_ops, feed_dict={
+            self._new_mean: new_mean,
+            self._new_var: new_var, 
+            self._new_count: new_count
+        })
+
+        self._set_mean_var_count()

-        new_count = batch_count + self.count
        
-        self.mean = new_mean
-        self.var = new_var
-        self.count = new_count    

 def test_runningmeanstd():
    for (x1, x2, x3) in [
@@ -43,4 +97,91 @@ def test_runningmeanstd():
        rms.update(x3)
        ms2 = [rms.mean, rms.var]

-        assert np.allclose(ms1, ms2)
+        np.testing.assert_allclose(ms1, ms2)
+
+def test_tf_runningmeanstd():
+    for (x1, x2, x3) in [
+        (np.random.randn(3), np.random.randn(4), np.random.randn(5)),
+        (np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
+        ]:
+
+        rms = TfRunningMeanStd(epsilon=0.0, shape=x1.shape[1:], scope='running_mean_std' + str(np.random.randint(0, 128)))
+
+        x = np.concatenate([x1, x2, x3], axis=0)
+        ms1 = [x.mean(axis=0), x.var(axis=0)]
+        rms.update(x1)
+        rms.update(x2)
+        rms.update(x3)
+        ms2 = [rms.mean, rms.var]
+
+        np.testing.assert_allclose(ms1, ms2)
+
+
+def profile_tf_runningmeanstd():
+    import time
+    from baselines.common import tf_util
+
+    tf_util.get_session( config=tf.ConfigProto(
+        inter_op_parallelism_threads=1,
+        intra_op_parallelism_threads=1,
+        allow_soft_placement=True
+    ))
+
+    x = np.random.random((376,))
+
+    n_trials = 10000
+    rms = RunningMeanStd()
+    tfrms = TfRunningMeanStd()
+
+    tic1 = time.time()
+    for _ in range(n_trials):
+        rms.update(x)
+
+    tic2 = time.time()
+    for _ in range(n_trials):
+        tfrms.update(x)
+
+    tic3 = time.time()
+
+    print('rms update time ({} trials): {} s'.format(n_trials, tic2 - tic1))
+    print('tfrms update time ({} trials): {} s'.format(n_trials, tic3 - tic2))
+    
+
+    tic1 = time.time()
+    for _ in range(n_trials):
+        z1 = rms.mean
+
+    tic2 = time.time()
+    for _ in range(n_trials):
+        z2 = tfrms.mean
+
+    assert z1 == z2
+
+    tic3 = time.time()
+
+    print('rms get mean time ({} trials): {} s'.format(n_trials, tic2 - tic1))
+    print('tfrms get mean time ({} trials): {} s'.format(n_trials, tic3 - tic2))
+         
+    
+
+    '''
+    options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) #pylint: disable=E1101
+    run_metadata = tf.RunMetadata()
+    profile_opts = dict(options=options, run_metadata=run_metadata)
+
+    
+
+    from tensorflow.python.client import timeline
+    fetched_timeline = timeline.Timeline(run_metadata.step_stats) #pylint: disable=E1101
+    chrome_trace = fetched_timeline.generate_chrome_trace_format()
+    outfile = '/tmp/timeline.json'
+    with open(outfile, 'wt') as f: 
+        f.write(chrome_trace)
+    print(f'Successfully saved profile to {outfile}. Exiting.')
+    exit(0)
+    '''
+
+
+
+if __name__ == '__main__':
+   profile_tf_runningmeanstd() 
--- a/baselines/common/running_stat.py
+++ b/baselines/common/running_stat.py
--- a/baselines/common/segment_tree.py
+++ b/baselines/common/segment_tree.py
@@ -12,10 +12,9 @@ class SegmentTree(object):

            a) setting item's value is slightly slower.
               It is O(lg capacity) instead of O(1).
-            b) user has access to an efficient `reduce`
-               operation which reduces `operation` over
-               a contiguous subsequence of items in the
-               array.
+            b) user has access to an efficient ( O(log segment size) )
+               `reduce` operation which reduces `operation` over
+               a contiguous subsequence of items in the array.

        Paramters
        ---------
@@ -23,8 +22,8 @@ class SegmentTree(object):
            Total size of the array - must be a power of two.
        operation: lambda obj, obj -> obj
            and operation for combining elements (eg. sum, max)
-            must for a mathematical group together with the set of
-            possible values for array elements.
+            must form a mathematical group together with the set of
+            possible values for array elements (i.e. be associative)
        neutral_element: obj
            neutral element for the operation above. eg. float('-inf')
            for max and 0 for sum.
--- a/baselines/common/tests/init.py
+++ b/baselines/common/tests/init.py
--- a/baselines/common/tests/envs/init.py
+++ b/baselines/common/tests/envs/init.py
--- a/baselines/common/tests/envs/fixed_sequence_env.py
+++ b/baselines/common/tests/envs/fixed_sequence_env.py
@@ -0,0 +1,44 @@
+import numpy as np
+from gym import Env
+from gym.spaces import Discrete
+
+
+class FixedSequenceEnv(Env):
+    def __init__(
+            self,
+            n_actions=10,
+            seed=0,
+            episode_len=100
+    ):
+        self.np_random = np.random.RandomState()
+        self.np_random.seed(seed)
+        self.sequence = [self.np_random.randint(0, n_actions-1) for _ in range(episode_len)]
+
+        self.action_space = Discrete(n_actions)
+        self.observation_space = Discrete(1)
+
+        self.episode_len = episode_len
+        self.time = 0
+        self.reset()
+
+    def reset(self):
+        self.time = 0
+        return 0
+
+    def step(self, actions):
+        rew = self._get_reward(actions)
+        self._choose_next_state()
+        done = False
+        if self.episode_len and self.time >= self.episode_len:
+            rew = 0
+            done = True
+
+        return 0, rew, done, {}
+
+    def _choose_next_state(self):
+        self.time += 1
+
+    def _get_reward(self, actions):
+        return 1 if actions == self.sequence[self.time] else 0
+        
+
--- a/baselines/common/tests/envs/identity_env.py
+++ b/baselines/common/tests/envs/identity_env.py
@@ -0,0 +1,70 @@
+import numpy as np
+from abc import abstractmethod
+from gym import Env
+from gym.spaces import Discrete, Box
+
+
+class IdentityEnv(Env):
+    def __init__(
+            self,
+            episode_len=None
+    ):
+
+        self.episode_len = episode_len
+        self.time = 0
+        self.reset()
+
+    def reset(self):
+        self._choose_next_state()
+        self.time = 0
+        self.observation_space = self.action_space
+
+        return self.state
+
+    def step(self, actions):
+        rew = self._get_reward(actions)
+        self._choose_next_state()
+        done = False
+        if self.episode_len and self.time >= self.episode_len:
+            rew = 0
+            done = True
+
+        return self.state, rew, done, {}
+
+    def _choose_next_state(self):
+        self.state = self.action_space.sample()
+        self.time += 1
+
+    @abstractmethod
+    def _get_reward(self, actions):
+        raise NotImplementedError
+
+
+class DiscreteIdentityEnv(IdentityEnv):
+    def __init__(
+            self,
+            dim,
+            episode_len=None,
+    ):
+
+        self.action_space = Discrete(dim)
+        super().__init__(episode_len=episode_len)
+
+    def _get_reward(self, actions):
+        return 1 if self.state == actions else 0
+
+
+class BoxIdentityEnv(IdentityEnv):
+    def __init__(
+            self,
+            shape,
+            episode_len=None,
+    ):
+
+        self.action_space = Box(low=-1.0, high=1.0, shape=shape)
+        super().__init__(episode_len=episode_len)
+
+    def _get_reward(self, actions):
+        diff = actions - self.state
+        diff = diff[:]
+        return -0.5 * np.dot(diff, diff)
--- a/baselines/common/tests/envs/mnist_env.py
+++ b/baselines/common/tests/envs/mnist_env.py
@@ -0,0 +1,70 @@
+import os.path as osp
+import numpy as np
+import tempfile
+import filelock
+from gym import Env
+from gym.spaces import Discrete, Box
+
+
+
+class MnistEnv(Env):
+    def __init__(
+            self,
+            seed=0,
+            episode_len=None,
+            no_images=None
+    ):
+        from tensorflow.examples.tutorials.mnist import input_data
+        # we could use temporary directory for this with a context manager and 
+        # TemporaryDirecotry, but then each test that uses mnist would re-download the data
+        # this way the data is not cleaned up, but we only download it once per machine
+        mnist_path = osp.join(tempfile.gettempdir(), 'MNIST_data')
+        with filelock.FileLock(mnist_path + '.lock'):
+           self.mnist = input_data.read_data_sets(mnist_path)
+
+        self.np_random = np.random.RandomState()
+        self.np_random.seed(seed)
+
+        self.observation_space = Box(low=0.0, high=1.0, shape=(28,28,1))
+        self.action_space = Discrete(10)
+        self.episode_len = episode_len
+        self.time = 0
+        self.no_images = no_images
+
+        self.train_mode()
+        self.reset()
+        
+    def reset(self):
+        self._choose_next_state()
+        self.time = 0
+
+        return self.state[0]
+
+    def step(self, actions):
+        rew = self._get_reward(actions)
+        self._choose_next_state()
+        done = False
+        if self.episode_len and self.time >= self.episode_len:
+            rew = 0
+            done = True
+
+        return self.state[0], rew, done, {}
+
+    def train_mode(self):
+        self.dataset = self.mnist.train
+
+    def test_mode(self):
+        self.dataset = self.mnist.test
+
+    def _choose_next_state(self):
+        max_index = (self.no_images if self.no_images is not None else self.dataset.num_examples) - 1
+        index = self.np_random.randint(0, max_index)
+        image = self.dataset.images[index].reshape(28,28,1)*255
+        label = self.dataset.labels[index]
+        self.state = (image, label)
+        self.time += 1
+
+    def _get_reward(self, actions):
+        return 1 if self.state[1] == actions else 0
+
+
--- a/baselines/common/tests/test_cartpole.py
+++ b/baselines/common/tests/test_cartpole.py
@@ -0,0 +1,40 @@
+import pytest
+import gym
+
+from baselines.run import get_learn_function
+from baselines.common.tests.util import reward_per_episode_test
+
+common_kwargs = dict(
+    total_timesteps=30000,
+    network='mlp',
+    gamma=1.0,
+    seed=0,
+)
+   
+learn_kwargs = {
+    'a2c' : dict(nsteps=32, value_network='copy', lr=0.05),
+    'acktr': dict(nsteps=32, value_network='copy'),
+    'deepq': {},
+    'ppo2': dict(value_network='copy'),
+    'trpo_mpi': {}
+}
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", learn_kwargs.keys())
+def test_cartpole(alg):
+    '''
+    Test if the algorithm (with an mlp policy)
+    can learn to balance the cartpole
+    '''
+
+    kwargs = common_kwargs.copy()
+    kwargs.update(learn_kwargs[alg])
+
+    learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
+    def env_fn(): 
+        
+        env = gym.make('CartPole-v0')
+        env.seed(0)
+        return env
+
+    reward_per_episode_test(env_fn, learn_fn, 100)
--- a/baselines/common/tests/test_fixed_sequence.py
+++ b/baselines/common/tests/test_fixed_sequence.py
@@ -0,0 +1,52 @@
+import pytest
+from baselines.common.tests.envs.fixed_sequence_env import FixedSequenceEnv
+
+from baselines.common.tests.util import simple_test
+from baselines.run import get_learn_function
+
+common_kwargs = dict(
+    seed=0,
+    total_timesteps=20000,
+    nlstm=64
+)
+    
+learn_kwargs = {
+    'a2c': {},
+    'ppo2': dict(nsteps=10, ent_coef=0.0, nminibatches=1),
+    # TODO enable sequential models for trpo_mpi (proper handling of nbatch and nsteps)
+    # github issue: https://github.com/openai/baselines/issues/188
+    # 'trpo_mpi': lambda e, p: trpo_mpi.learn(policy_fn=p(env=e), env=e, max_timesteps=30000, timesteps_per_batch=100, cg_iters=10, gamma=0.9, lam=1.0, max_kl=0.001)
+}
+
+
+alg_list = learn_kwargs.keys()
+rnn_list = ['lstm', 'tflstm', 'tflstm_static']
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", alg_list)
+@pytest.mark.parametrize("rnn", rnn_list)
+def test_fixed_sequence(alg, rnn):
+    '''
+    Test if the algorithm (with a given policy)
+    can learn an identity transformation (i.e. return observation as an action)
+    '''
+
+    kwargs = learn_kwargs[alg]
+    kwargs.update(common_kwargs)
+
+    episode_len = 5
+    env_fn = lambda: FixedSequenceEnv(10, episode_len=episode_len)
+    learn = lambda e: get_learn_function(alg)(
+        env=e, 
+        network=rnn,
+        **kwargs
+    )
+
+    simple_test(env_fn, learn, 0.3)
+
+
+if __name__ == '__main__':
+    test_fixed_sequence('ppo2', 'tflstm')
+
+    
+
--- a/baselines/common/tests/test_identity.py
+++ b/baselines/common/tests/test_identity.py
@@ -0,0 +1,55 @@
+import pytest
+from baselines.common.tests.envs.identity_env import DiscreteIdentityEnv, BoxIdentityEnv
+from baselines.run import get_learn_function
+from baselines.common.tests.util import simple_test
+
+common_kwargs = dict(
+    total_timesteps=30000,
+    network='mlp',
+    gamma=0.9,
+    seed=0,
+)
+   
+learn_kwargs = {
+    'a2c' : {},
+    'acktr': {},
+    'deepq': {},
+    'ppo2': dict(lr=1e-3, nsteps=64, ent_coef=0.0),
+    'trpo_mpi': dict(timesteps_per_batch=100, cg_iters=10, gamma=0.9, lam=1.0, max_kl=0.01)
+}
+
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", learn_kwargs.keys())
+def test_discrete_identity(alg):
+    '''
+    Test if the algorithm (with an mlp policy)
+    can learn an identity transformation (i.e. return observation as an action)
+    '''
+
+    kwargs = learn_kwargs[alg]
+    kwargs.update(common_kwargs)
+
+    learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
+    env_fn = lambda: DiscreteIdentityEnv(10, episode_len=100)
+    simple_test(env_fn, learn_fn, 0.9)
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", ['a2c', 'ppo2', 'trpo_mpi'])
+def test_continuous_identity(alg):
+    '''
+    Test if the algorithm (with an mlp policy)
+    can learn an identity transformation (i.e. return observation as an action)
+    to a required precision
+    '''
+
+    kwargs = learn_kwargs[alg]
+    kwargs.update(common_kwargs)
+    learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
+
+    env_fn = lambda: BoxIdentityEnv((1,), episode_len=100)
+    simple_test(env_fn, learn_fn, -0.1)
+
+if __name__ == '__main__':
+    test_continuous_identity('a2c')    
+
--- a/baselines/common/tests/test_mnist.py
+++ b/baselines/common/tests/test_mnist.py
@@ -0,0 +1,50 @@
+import pytest
+
+# from baselines.acer import acer_simple as acer
+from baselines.common.tests.envs.mnist_env import MnistEnv
+from baselines.common.tests.util import simple_test
+from baselines.run import get_learn_function
+
+
+# TODO investigate a2c and ppo2 failures - is it due to bad hyperparameters for this problem?  
+# GitHub issue https://github.com/openai/baselines/issues/189
+common_kwargs = {
+    'seed': 0,
+    'network':'cnn',
+    'gamma':0.9,
+    'pad':'SAME'
+}
+
+learn_args = {
+    'a2c': dict(total_timesteps=50000),
+    # TODO need to resolve inference (step) API differences for acer; also slow
+    # 'acer': dict(seed=0, total_timesteps=1000),
+    'deepq': dict(total_timesteps=5000),
+    'acktr': dict(total_timesteps=30000),
+    'ppo2': dict(total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.0),
+    'trpo_mpi': dict(total_timesteps=80000, timesteps_per_batch=100, cg_iters=10, lam=1.0, max_kl=0.001)
+}
+
+ 
+#tests pass, but are too slow on travis. Same algorithms are covered 
+# by other tests with less compute-hungry nn's and by benchmarks
+@pytest.mark.skip 
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", learn_args.keys())
+def test_mnist(alg):
+    '''
+    Test if the algorithm can learn to classify MNIST digits. 
+    Uses CNN policy. 
+    '''
+    
+    learn_kwargs = learn_args[alg]
+    learn_kwargs.update(common_kwargs)
+    
+    learn = get_learn_function(alg)
+    learn_fn = lambda e: learn(env=e, **learn_kwargs)
+    env_fn = lambda: MnistEnv(seed=0, episode_len=100)
+
+    simple_test(env_fn, learn_fn, 0.6)
+
+if __name__ == '__main__':
+    test_mnist('deepq')
--- a/baselines/common/tests/test_serialization.py
+++ b/baselines/common/tests/test_serialization.py
@@ -0,0 +1,97 @@
+import os
+import tempfile
+import pytest
+import tensorflow as tf
+import numpy as np
+
+from baselines.common.tests.envs.mnist_env import MnistEnv
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+from baselines.run import get_learn_function
+from baselines.common.tf_util import make_session, get_session
+
+from functools import partial
+
+
+learn_kwargs = {
+    'deepq': {},
+    'a2c': {}, 
+    'acktr': {},
+    'ppo2': {'nminibatches': 1, 'nsteps': 10},
+    'trpo_mpi': {},
+}
+
+network_kwargs = {
+    'mlp': {}, 
+    'cnn': {'pad': 'SAME'}, 
+    'lstm': {},
+    'cnn_lnlstm': {'pad': 'SAME'}
+}
+
+
+@pytest.mark.parametrize("learn_fn", learn_kwargs.keys())
+@pytest.mark.parametrize("network_fn", network_kwargs.keys())
+def test_serialization(learn_fn, network_fn):
+    '''
+    Test if the trained model can be serialized 
+    '''
+
+    
+    if network_fn.endswith('lstm') and learn_fn in ['acktr', 'trpo_mpi', 'deepq']:
+            # TODO make acktr work with recurrent policies
+            # and test
+            # github issue: https://github.com/openai/baselines/issues/194
+            return 
+
+    env = DummyVecEnv([lambda: MnistEnv(10, episode_len=100)])
+    ob = env.reset().copy()
+    learn = get_learn_function(learn_fn)
+
+    kwargs = {}
+    kwargs.update(network_kwargs[network_fn])
+    kwargs.update(learn_kwargs[learn_fn])
+
+
+    learn = partial(learn, env=env, network=network_fn, seed=0, **kwargs)
+
+    with tempfile.TemporaryDirectory() as td:
+        model_path = os.path.join(td, 'serialization_test_model')
+
+        with tf.Graph().as_default(), make_session().as_default():
+            model = learn(total_timesteps=100)
+            model.save(model_path)
+            mean1, std1 = _get_action_stats(model, ob)
+            variables_dict1 = _serialize_variables()
+
+        with tf.Graph().as_default(), make_session().as_default():
+            model = learn(total_timesteps=0, load_path=model_path)
+            mean2, std2 = _get_action_stats(model, ob)
+            variables_dict2 = _serialize_variables()
+
+        for k, v in variables_dict1.items():
+            np.testing.assert_allclose(v, variables_dict2[k], atol=0.01,
+                err_msg='saved and loaded variable {} value mismatch'.format(k))
+
+        np.testing.assert_allclose(mean1, mean2, atol=0.5)
+        np.testing.assert_allclose(std1, std2, atol=0.5)
+
+ 
+
+def _serialize_variables():
+    sess = get_session()
+    variables = tf.trainable_variables()    
+    values = sess.run(variables)
+    return {var.name: value for var, value in zip(variables, values)}
+    
+
+def _get_action_stats(model, ob):
+    ntrials = 1000
+    if model.initial_state is None or model.initial_state == []:
+        actions = np.array([model.step(ob)[0] for _ in range(ntrials)])
+    else:
+        actions = np.array([model.step(ob, S=model.initial_state, M=[False])[0] for _ in range(ntrials)])
+
+    mean = np.mean(actions, axis=0)
+    std = np.std(actions, axis=0)
+
+    return mean, std
+
--- a/baselines/common/tests/test_tf_util.py
+++ b/baselines/common/tests/test_tf_util.py
@@ -33,7 +33,6 @@ def test_multikwargs():
            initialize()
            assert lin(2) == 6
            assert lin(2, 2) == 10
-            expt_caught = False


 if __name__ == '__main__':
--- a/baselines/common/tests/util.py
+++ b/baselines/common/tests/util.py
@@ -0,0 +1,92 @@
+import tensorflow as tf
+import numpy as np
+from gym.spaces import np_random
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+from baselines.bench.monitor import Monitor
+
+N_TRIALS = 10000
+N_EPISODES = 100
+
+def simple_test(env_fn, learn_fn, min_reward_fraction, n_trials=N_TRIALS):
+    np.random.seed(0)
+    np_random.seed(0)
+
+    env = DummyVecEnv([lambda: Monitor(env_fn(), None, allow_early_resets=True)])
+
+
+    with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
+        tf.set_random_seed(0)
+
+        model = learn_fn(env)
+
+        sum_rew = 0
+        done = True
+
+        for i in range(n_trials):
+            if done:
+                obs = env.reset()
+                state = model.initial_state
+
+            if state is not None:
+                a, v, state, _ = model.step(obs, S=state, M=[False])
+            else:
+                a, v, _, _ = model.step(obs)
+            
+            obs, rew, done, _ = env.step(a)
+            sum_rew += float(rew)
+
+        print("Reward in {} trials is {}".format(n_trials, sum_rew))
+        assert sum_rew > min_reward_fraction * n_trials, \
+            'sum of rewards {} is less than {} of the total number of trials {}'.format(sum_rew, min_reward_fraction, n_trials)
+
+
+
+def reward_per_episode_test(env_fn, learn_fn, min_avg_reward, n_trials=N_EPISODES):
+    env = DummyVecEnv([env_fn])
+
+    with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
+        model = learn_fn(env)
+
+        N_TRIALS = 100    
+
+        observations, actions, rewards = rollout(env, model, N_TRIALS)
+        rewards = [sum(r) for r in rewards]
+
+        avg_rew = sum(rewards) / N_TRIALS
+        print("Average reward in {} episodes is {}".format(n_trials, avg_rew))
+        assert avg_rew > min_avg_reward, \
+            'average reward in {} episodes ({}) is less than {}'.format(n_trials, avg_rew, min_avg_reward)
+
+def rollout(env, model, n_trials):
+    rewards = []
+    actions = []
+    observations = []
+
+    for i in range(n_trials):
+        obs = env.reset()
+        state = model.initial_state
+        episode_rew = []
+        episode_actions = []
+        episode_obs = []
+
+        while True:
+            if state is not None:
+                a, v, state, _ = model.step(obs, S=state, M=[False])
+            else:
+                a,v, _, _ = model.step(obs)
+
+            obs, rew, done, _ = env.step(a)
+
+            episode_rew.append(rew)
+            episode_actions.append(a)
+            episode_obs.append(obs)
+
+            if done:
+                break
+
+        rewards.append(episode_rew)
+        actions.append(episode_actions)
+        observations.append(episode_obs)
+
+    return observations, actions, rewards
+
--- a/baselines/common/tf_util.py
+++ b/baselines/common/tf_util.py
@@ -1,3 +1,4 @@
+import joblib
 import numpy as np
 import tensorflow as tf  # pylint: ignore-module
 import copy
@@ -48,18 +49,28 @@ def huber_loss(x, delta=1.0):
 # Global session
 # ================================================================

-def make_session(num_cpu=None, make_default=False):
+def get_session(config=None):
+    """Get default session or create one with a given config"""
+    sess = tf.get_default_session()
+    if sess is None:
+        sess = make_session(config=config, make_default=True)
+    return sess
+
+def make_session(config=None, num_cpu=None, make_default=False, graph=None):
    """Returns a session that will use <num_cpu> CPU's only"""
    if num_cpu is None:
        num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
-    tf_config = tf.ConfigProto(
-        inter_op_parallelism_threads=num_cpu,
-        intra_op_parallelism_threads=num_cpu)
-    tf_config.gpu_options.allocator_type = 'BFC'
+    if config is None:
+        config = tf.ConfigProto(
+            allow_soft_placement=True, 
+            inter_op_parallelism_threads=num_cpu,
+            intra_op_parallelism_threads=num_cpu)
+        config.gpu_options.allow_growth = True
+
    if make_default:
-        return tf.InteractiveSession(config=tf_config)
+        return tf.InteractiveSession(config=config, graph=graph)
    else:
-        return tf.Session(config=tf_config)
+        return tf.Session(config=config, graph=graph)

 def single_threaded_session():
    """Returns a session which will only use a single CPU"""
@@ -77,17 +88,17 @@ ALREADY_INITIALIZED = set()
 def initialize():
    """Initialize all the uninitialized variables in the global scope."""
    new_variables = set(tf.global_variables()) - ALREADY_INITIALIZED
-    tf.get_default_session().run(tf.variables_initializer(new_variables))
+    get_session().run(tf.variables_initializer(new_variables))
    ALREADY_INITIALIZED.update(new_variables)

 # ================================================================
 # Model components
 # ================================================================

-def normc_initializer(std=1.0):
+def normc_initializer(std=1.0, axis=0):
    def _initializer(shape, dtype=None, partition_info=None):  # pylint: disable=W0613
-        out = np.random.randn(*shape).astype(np.float32)
-        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
+        out = np.random.randn(*shape).astype(dtype.as_numpy_dtype)
+        out *= std / np.sqrt(np.square(out).sum(axis=axis, keepdims=True))
        return tf.constant(out)
    return _initializer

@@ -180,7 +191,7 @@ class _Function(object):
        if hasattr(inpt, 'make_feed_dict'):
            feed_dict.update(inpt.make_feed_dict(value))
        else:
-            feed_dict[inpt] = value
+            feed_dict[inpt] = adjust_shape(inpt, value)

    def __call__(self, *args):
        assert len(args) <= len(self.inputs), "Too many arguments provided"
@@ -190,8 +201,8 @@ class _Function(object):
            self._feed_input(feed_dict, inpt, value)
        # Update feed dict with givens.
        for inpt in self.givens:
-            feed_dict[inpt] = feed_dict.get(inpt, self.givens[inpt])
-        results = tf.get_default_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
+            feed_dict[inpt] = adjust_shape(inpt, feed_dict.get(inpt, self.givens[inpt]))
+        results = get_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
        return results

 # ================================================================
@@ -244,20 +255,151 @@ class GetFlat(object):
    def __call__(self):
        return tf.get_default_session().run(self.op)

+def flattenallbut0(x):
+    return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])
+
+# =============================================================
+# TF placeholders management
+# ============================================================
+
 _PLACEHOLDER_CACHE = {}  # name -> (placeholder, dtype, shape)

 def get_placeholder(name, dtype, shape):
    if name in _PLACEHOLDER_CACHE:
        out, dtype1, shape1 = _PLACEHOLDER_CACHE[name]
-        assert dtype1 == dtype and shape1 == shape
-        return out
-    else:
-        out = tf.placeholder(dtype=dtype, shape=shape, name=name)
-        _PLACEHOLDER_CACHE[name] = (out, dtype, shape)
-        return out
+        if out.graph == tf.get_default_graph():
+            assert dtype1 == dtype and shape1 == shape, \
+                'Placeholder with name {} has already been registered and has shape {}, different from requested {}'.format(name, shape1, shape)
+            return out
+
+    out = tf.placeholder(dtype=dtype, shape=shape, name=name)
+    _PLACEHOLDER_CACHE[name] = (out, dtype, shape)
+    return out

 def get_placeholder_cached(name):
    return _PLACEHOLDER_CACHE[name][0]

-def flattenallbut0(x):
-    return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])
+
+
+# ================================================================
+# Diagnostics
+# ================================================================
+
+def display_var_info(vars):
+    from baselines import logger
+    count_params = 0
+    for v in vars:
+        name = v.name
+        if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
+        v_params = np.prod(v.shape.as_list())
+        count_params += v_params
+        if "/b:" in name or "/biases" in name: continue    # Wx+b, bias is not interesting to look at => count params, but not print
+        logger.info("   %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
+
+    logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
+
+
+def get_available_gpus():
+    # recipe from here:
+    # https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
+
+    from tensorflow.python.client import device_lib
+    local_device_protos = device_lib.list_local_devices()
+    return [x.name for x in local_device_protos if x.device_type == 'GPU']
+
+# ================================================================
+# Saving variables
+# ================================================================
+
+def load_state(fname, sess=None):
+    sess = sess or get_session()
+    saver = tf.train.Saver()
+    saver.restore(tf.get_default_session(), fname)
+
+def save_state(fname, sess=None):
+    sess = sess or get_session()
+    os.makedirs(os.path.dirname(fname), exist_ok=True)
+    saver = tf.train.Saver()
+    saver.save(tf.get_default_session(), fname)
+
+# The methods above and below are clearly doing the same thing, and in a rather similar way
+# TODO: ensure there is no subtle differences and remove one
+
+def save_variables(save_path, variables=None, sess=None):
+    sess = sess or get_session()
+    variables = variables or tf.trainable_variables()
+    
+    ps = sess.run(variables)
+    save_dict = {v.name: value for v, value in zip(variables, ps)}
+    os.makedirs(os.path.dirname(save_path), exist_ok=True)
+    joblib.dump(save_dict, save_path)
+
+def load_variables(load_path, variables=None, sess=None):
+    sess = sess or get_session()
+    variables = variables or tf.trainable_variables()
+
+    loaded_params = joblib.load(os.path.expanduser(load_path))
+    restores = []
+    for v in variables:
+        restores.append(v.assign(loaded_params[v.name]))
+    sess.run(restores)
+
+
+# ================================================================
+# Shape adjustment for feeding into tf placeholders
+# ================================================================
+def adjust_shape(placeholder, data):
+    '''
+    adjust shape of the data to the shape of the placeholder if possible.
+    If shape is incompatible, AssertionError is thrown
+
+    Parameters:
+        placeholder     tensorflow input placeholder 
+        
+        data            input data to be (potentially) reshaped to be fed into placeholder
+    
+    Returns:
+        reshaped data
+    '''
+
+    if not isinstance(data, np.ndarray) and not isinstance(data, list):
+        return data
+    if isinstance(data, list):
+        data = np.array(data)
+    
+    placeholder_shape = [x or -1 for x in placeholder.shape.as_list()]
+    
+    assert _check_shape(placeholder_shape, data.shape), \
+        'Shape of data {} is not compatible with shape of the placeholder {}'.format(data.shape, placeholder_shape)
+
+    return np.reshape(data, placeholder_shape)  
+    
+
+def _check_shape(placeholder_shape, data_shape):
+    ''' check if two shapes are compatible (i.e. differ only by dimensions of size 1, or by the batch dimension)'''
+
+    return True
+    squeezed_placeholder_shape = _squeeze_shape(placeholder_shape)
+    squeezed_data_shape = _squeeze_shape(data_shape)
+    
+    for i, s_data in enumerate(squeezed_data_shape):
+        s_placeholder = squeezed_placeholder_shape[i]
+        if s_placeholder != -1 and s_data != s_placeholder:
+            return False
+
+    return True
+
+
+def _squeeze_shape(shape):
+    return [x for x in shape if x != 1]
+        
+# Tensorboard interfacing
+# ================================================================
+
+def launch_tensorboard_in_background(log_dir):
+    from tensorboard import main as tb
+    import threading
+    tf.flags.FLAGS.logdir = log_dir
+    t = threading.Thread(target=tb.main, args=([]))
+    t.start()
+
--- a/baselines/common/tile_images.py
+++ b/baselines/common/tile_images.py
@@ -0,0 +1,23 @@
+import numpy as np
+
+def tile_images(img_nhwc):
+    """
+    Tile N images into one big PxQ image
+    (P,Q) are chosen to be as close as possible, and if N
+    is square, then P=Q.
+
+    input: img_nhwc, list or array of images, ndim=4 once turned into array
+        n = batch index, h = height, w = width, c = channel
+    returns:
+        bigim_HWc, ndarray with ndim=3
+    """
+    img_nhwc = np.asarray(img_nhwc)
+    N, h, w, c = img_nhwc.shape
+    H = int(np.ceil(np.sqrt(N)))
+    W = int(np.ceil(float(N)/H))
+    img_nhwc = np.array(list(img_nhwc) + [img_nhwc[0]*0 for _ in range(N, H*W)])
+    img_HWhwc = img_nhwc.reshape(H, W, h, w, c)
+    img_HhWwc = img_HWhwc.transpose(0, 2, 1, 3, 4)
+    img_Hh_Ww_c = img_HhWwc.reshape(H*h, W*w, c)
+    return img_Hh_Ww_c
+
--- a/baselines/common/vec_env/init.py
+++ b/baselines/common/vec_env/init.py
@@ -20,20 +20,19 @@ class NotSteppingError(Exception):
        Exception.__init__(self, msg)

 class VecEnv(ABC):
-
+    """
+    An abstract asynchronous, vectorized environment.
+    """
    def __init__(self, num_envs, observation_space, action_space):
        self.num_envs = num_envs
        self.observation_space = observation_space
        self.action_space = action_space

-    """
-    An abstract asynchronous, vectorized environment.
-    """
    @abstractmethod
    def reset(self):
        """
        Reset all the environments and return an array of
-        observations.
+        observations, or a tuple of observation arrays.

        If step_async is still doing work, that work will
        be cancelled and step_wait() should not be called
@@ -59,10 +58,11 @@ class VecEnv(ABC):
        Wait for the step taken with step_async().

        Returns (obs, rews, dones, infos):
-         - obs: an array of observations
+         - obs: an array of observations, or a tuple of
+                arrays of observations.
         - rews: an array of rewards
         - dones: an array of "episode done" booleans
-         - infos: an array of info objects
+         - infos: a sequence of info objects
        """
        pass

@@ -77,9 +77,16 @@ class VecEnv(ABC):
        self.step_async(actions)
        return self.step_wait()

-    def render(self):
+    def render(self, mode='human'):
        logger.warn('Render not defined for %s'%self)

+    @property
+    def unwrapped(self):
+        if isinstance(self, VecEnvWrapper):
+            return self.venv.unwrapped
+        else:
+            return self
+
 class VecEnvWrapper(VecEnv):
    def __init__(self, venv, observation_space=None, action_space=None):
        self.venv = venv
--- a/baselines/common/vec_env/dummy_vec_env.py
+++ b/baselines/common/vec_env/dummy_vec_env.py
@@ -1,4 +1,6 @@
 import numpy as np
+from gym import spaces
+from collections import OrderedDict
 from . import VecEnv

 class DummyVecEnv(VecEnv):
@@ -6,26 +8,75 @@ class DummyVecEnv(VecEnv):
        self.envs = [fn() for fn in env_fns]
        env = self.envs[0]
        VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
-        self.ts = np.zeros(len(self.envs), dtype='int')        
+        shapes, dtypes = {}, {}
+        self.keys = []
+        obs_space = env.observation_space
+
+        if isinstance(obs_space, spaces.Dict):
+            assert isinstance(obs_space.spaces, OrderedDict)
+            subspaces = obs_space.spaces
+        else:
+            subspaces = {None: obs_space}
+
+        for key, box in subspaces.items():
+            shapes[key] = box.shape
+            dtypes[key] = box.dtype
+            self.keys.append(key)
+        
+        self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
+        self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
+        self.buf_rews  = np.zeros((self.num_envs,), dtype=np.float32)
+        self.buf_infos = [{} for _ in range(self.num_envs)]
        self.actions = None

    def step_async(self, actions):
-        self.actions = actions
+        listify = True
+        try:
+            if len(actions) == self.num_envs:
+                listify = False
+        except TypeError:
+            pass
+
+        if not listify:
+            self.actions = actions
+        else:
+            assert self.num_envs == 1, "actions {} is either not a list or has a wrong size - cannot match to {} environments".format(actions, self.num_envs)
+            self.actions = [actions]

    def step_wait(self):
-        results = [env.step(a) for (a,env) in zip(self.actions, self.envs)]
-        obs, rews, dones, infos = map(np.array, zip(*results))
-        self.ts += 1
-        for (i, done) in enumerate(dones):
-            if done: 
-                obs[i] = self.envs[i].reset()
-                self.ts[i] = 0
-        self.actions = None
-        return np.array(obs), np.array(rews), np.array(dones), infos
+        for e in range(self.num_envs):
+            action = self.actions[e]
+            if isinstance(self.envs[e].action_space, spaces.Discrete):
+                action = int(action)
+
+            obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
+            if self.buf_dones[e]:
+                obs = self.envs[e].reset()
+            self._save_obs(e, obs)
+        return (np.copy(self._obs_from_buf()), np.copy(self.buf_rews), np.copy(self.buf_dones),
+                self.buf_infos.copy())

    def reset(self):
-        results = [env.reset() for env in self.envs]
-        return np.array(results)
+        for e in range(self.num_envs):
+            obs = self.envs[e].reset()
+            self._save_obs(e, obs)
+        return self._obs_from_buf()

    def close(self):
        return
+
+    def render(self, mode='human'):
+        return [e.render(mode=mode) for e in self.envs]
+
+    def _save_obs(self, e, obs):
+        for k in self.keys:
+            if k is None:
+                self.buf_obs[k][e] = obs
+            else:
+                self.buf_obs[k][e] = obs[k]
+
+    def _obs_from_buf(self):
+        if self.keys==[None]:
+            return self.buf_obs[None]
+        else:
+            return self.buf_obs
--- a/baselines/common/vec_env/subproc_vec_env.py
+++ b/baselines/common/vec_env/subproc_vec_env.py
@@ -1,32 +1,36 @@
 import numpy as np
 from multiprocessing import Process, Pipe
 from baselines.common.vec_env import VecEnv, CloudpickleWrapper
+from baselines.common.tile_images import tile_images


 def worker(remote, parent_remote, env_fn_wrapper):
    parent_remote.close()
    env = env_fn_wrapper.x()
-    while True:
-        cmd, data = remote.recv()
-        if cmd == 'step':
-            ob, reward, done, info = env.step(data)
-            if done:
+    try:
+        while True:
+            cmd, data = remote.recv()
+            if cmd == 'step':
+                ob, reward, done, info = env.step(data)
+                if done:
+                    ob = env.reset()
+                remote.send((ob, reward, done, info))
+            elif cmd == 'reset':
                ob = env.reset()
-            remote.send((ob, reward, done, info))
-        elif cmd == 'reset':
-            ob = env.reset()
-            remote.send(ob)
-        elif cmd == 'reset_task':
-            ob = env.reset_task()
-            remote.send(ob)
-        elif cmd == 'close':
-            remote.close()
-            break
-        elif cmd == 'get_spaces':
-            remote.send((env.observation_space, env.action_space))
-        else:
-            raise NotImplementedError
-
+                remote.send(ob)
+            elif cmd == 'render':
+                remote.send(env.render(mode='rgb_array'))
+            elif cmd == 'close':
+                remote.close()
+                break
+            elif cmd == 'get_spaces':
+                remote.send((env.observation_space, env.action_space))
+            else:
+                raise NotImplementedError
+    except KeyboardInterrupt:
+        print('SubprocVecEnv worker: got KeyboardInterrupt')
+    finally:
+        env.close()

 class SubprocVecEnv(VecEnv):
    def __init__(self, env_fns, spaces=None):
@@ -81,3 +85,17 @@ class SubprocVecEnv(VecEnv):
        for p in self.ps:
            p.join()
        self.closed = True
+
+    def render(self, mode='human'):
+        for pipe in self.remotes:
+            pipe.send(('render', None))
+        imgs = [pipe.recv() for pipe in self.remotes]
+        bigimg = tile_images(imgs)
+        if mode == 'human':
+            import cv2
+            cv2.imshow('vecenv', bigimg[:,:,::-1])
+            cv2.waitKey(1)
+        elif mode == 'rgb_array':
+            return bigimg
+        else:
+            raise NotImplementedError
--- a/baselines/common/vec_env/vec_normalize.py
+++ b/baselines/common/vec_env/vec_normalize.py
@@ -10,6 +10,8 @@ class VecNormalize(VecEnvWrapper):
        VecEnvWrapper.__init__(self, venv)
        self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
        self.ret_rms = RunningMeanStd(shape=()) if ret else None
+        #self.ob_rms = TfRunningMeanStd(shape=self.observation_space.shape, scope='observation_running_mean_std') if ob else None
+        #self.ret_rms = TfRunningMeanStd(shape=(), scope='return_running_mean_std') if ret else None
        self.clipob = clipob
        self.cliprew = cliprew
        self.ret = np.zeros(self.num_envs)
--- a/baselines/ddpg/ddpg.py
+++ b/baselines/ddpg/ddpg.py
@@ -26,9 +26,9 @@ def reduce_std(x, axis=None, keepdims=False):
    return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))

 def reduce_var(x, axis=None, keepdims=False):
-    m = tf.reduce_mean(x, axis=axis, keep_dims=True)
+    m = tf.reduce_mean(x, axis=axis, keepdims=True)
    devs_squared = tf.square(x - m)
-    return tf.reduce_mean(devs_squared, axis=axis, keep_dims=keepdims)
+    return tf.reduce_mean(devs_squared, axis=axis, keepdims=keepdims)

 def get_target_updates(vars, target_vars, tau):
    logger.info('setting up target updates ...')
--- a/baselines/ddpg/training.py
+++ b/baselines/ddpg/training.py
@@ -109,7 +109,7 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
                epoch_adaptive_distances = []
                for t_train in range(nb_train_steps):
                    # Adapt param noise, if necessary.
-                    if memory.nb_entries >= batch_size and t % param_noise_adaption_interval == 0:
+                    if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
                        distance = agent.adapt_param_noise()
                        epoch_adaptive_distances.append(distance)

@@ -189,4 +189,3 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
                if eval_env and hasattr(eval_env, 'get_state'):
                    with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
                        pickle.dump(eval_env.get_state(), f)
-
--- a/baselines/deepq/init.py
+++ b/baselines/deepq/init.py
@@ -1,6 +1,6 @@
 from baselines.deepq import models  # noqa
 from baselines.deepq.build_graph import build_act, build_train  # noqa
-from baselines.deepq.simple import learn, load  # noqa
+from baselines.deepq.deepq import learn, load_act  # noqa
 from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer  # noqa

 def wrap_atari_dqn(env):
--- a/baselines/deepq/simple.py
+++ b/baselines/deepq/simple.py
@@ -6,22 +6,28 @@ import zipfile
 import cloudpickle
 import numpy as np

-import gym
 import baselines.common.tf_util as U
+from baselines.common.tf_util import load_state, save_state
 from baselines import logger
 from baselines.common.schedules import LinearSchedule
+from baselines.common import set_global_seeds
+
 from baselines import deepq
 from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
-from baselines.deepq.utils import BatchInput, load_state, save_state
+from baselines.deepq.utils import ObservationInput
+
+from baselines.common.tf_util import get_session
+from baselines.deepq.models import build_q_func


 class ActWrapper(object):
    def __init__(self, act, act_params):
        self._act = act
        self._act_params = act_params
+        self.initial_state = None

    @staticmethod
-    def load(path):
+    def load_act(self, path):
        with open(path, "rb") as f:
            model_data, act_params = cloudpickle.load(f)
        act = deepq.build_act(**act_params)
@@ -40,7 +46,10 @@ class ActWrapper(object):
    def __call__(self, *args, **kwargs):
        return self._act(*args, **kwargs)

-    def save(self, path=None):
+    def step(self, observation, **kwargs):
+        return self._act([observation], **kwargs), None, None, None
+
+    def save_act(self, path=None):
        """Save model to a pickle located at `path`"""
        if path is None:
            path = os.path.join(logger.get_dir(), "model.pkl")
@@ -59,8 +68,11 @@ class ActWrapper(object):
        with open(path, "wb") as f:
            cloudpickle.dump((model_data, self._act_params), f)

+    def save(self, path):
+        save_state(path)

-def load(path):
+
+def load_act(path):
    """Load act function that was returned by learn function.

    Parameters
@@ -74,13 +86,14 @@ def load(path):
        function that takes a batch of observations
        and returns actions.
    """
-    return ActWrapper.load(path)
+    return ActWrapper.load_act(path)


 def learn(env,
-          q_func,
+          network,
+          seed=None,
          lr=5e-4,
-          max_timesteps=100000,
+          total_timesteps=100000,
          buffer_size=50000,
          exploration_fraction=0.1,
          exploration_final_eps=0.02,
@@ -88,6 +101,7 @@ def learn(env,
          batch_size=32,
          print_freq=100,
          checkpoint_freq=10000,
+          checkpoint_path=None,
          learning_starts=1000,
          gamma=1.0,
          target_network_update_freq=500,
@@ -97,7 +111,10 @@ def learn(env,
          prioritized_replay_beta_iters=None,
          prioritized_replay_eps=1e-6,
          param_noise=False,
-          callback=None):
+          callback=None,
+          load_path=None,
+          **network_kwargs
+            ):
    """Train a deepq model.

    Parameters
@@ -116,7 +133,7 @@ def learn(env,
        and returns a tensor of shape (batch_size, num_actions) with values of every action.
    lr: float
        learning rate for adam optimizer
-    max_timesteps: int
+    total_timesteps: int
        number of env steps to optimizer for
    buffer_size: int
        size of the replay buffer
@@ -150,12 +167,16 @@ def learn(env,
        initial value of beta for prioritized replay buffer
    prioritized_replay_beta_iters: int
        number of iterations over which beta will be annealed from initial value
-        to 1.0. If set to None equals to max_timesteps.
+        to 1.0. If set to None equals to total_timesteps.
    prioritized_replay_eps: float
        epsilon to add to the TD errors when updating priorities.
    callback: (locals, globals) -> None
        function called at every steps with state of the algorithm.
        If callback returns true training stops.
+    load_path: str
+        path to load the model from. (default: None)
+    **network_kwargs
+        additional keyword arguments to pass to the network builder. 

    Returns
    -------
@@ -165,14 +186,16 @@ def learn(env,
    """
    # Create all the functions necessary to train the model

-    sess = tf.Session()
-    sess.__enter__()
+    sess = get_session()
+    set_global_seeds(seed)
+
+    q_func = build_q_func(network, **network_kwargs)

    # capture the shape outside the closure so that the env object is not serialized
    # by cloudpickle when serializing make_obs_ph
-    observation_space_shape = env.observation_space.shape
+
    def make_obs_ph(name):
-        return BatchInput(observation_space_shape, name=name)
+        return ObservationInput(env.observation_space, name=name)

    act, train, update_target, debug = deepq.build_train(
        make_obs_ph=make_obs_ph,
@@ -196,7 +219,7 @@ def learn(env,
    if prioritized_replay:
        replay_buffer = PrioritizedReplayBuffer(buffer_size, alpha=prioritized_replay_alpha)
        if prioritized_replay_beta_iters is None:
-            prioritized_replay_beta_iters = max_timesteps
+            prioritized_replay_beta_iters = total_timesteps
        beta_schedule = LinearSchedule(prioritized_replay_beta_iters,
                                       initial_p=prioritized_replay_beta0,
                                       final_p=1.0)
@@ -204,7 +227,7 @@ def learn(env,
        replay_buffer = ReplayBuffer(buffer_size)
        beta_schedule = None
    # Create the schedule for exploration starting from 1.
-    exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * max_timesteps),
+    exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * total_timesteps),
                                 initial_p=1.0,
                                 final_p=exploration_final_eps)

@@ -216,10 +239,23 @@ def learn(env,
    saved_mean_reward = None
    obs = env.reset()
    reset = True
+
    with tempfile.TemporaryDirectory() as td:
-        model_saved = False
+        td = checkpoint_path or td
+
        model_file = os.path.join(td, "model")
-        for t in range(max_timesteps):
+        model_saved = False
+        
+        if tf.train.latest_checkpoint(td) is not None:
+            load_state(model_file)
+            logger.log('Loaded model from {}'.format(model_file))
+            model_saved = True
+        elif load_path is not None:
+            load_state(load_path)
+            logger.log('Loaded model from {}'.format(load_path))
+        
+
+        for t in range(total_timesteps):
            if callback is not None:
                if callback(locals(), globals()):
                    break
--- a/baselines/deepq/defaults.py
+++ b/baselines/deepq/defaults.py
@@ -0,0 +1,21 @@
+def atari():
+    return dict(
+        network='conv_only',
+        lr=1e-4,
+        buffer_size=10000,
+        exploration_fraction=0.1,
+        exploration_final_eps=0.01,
+        train_freq=4,
+        learning_starts=10000,
+        target_network_update_freq=1000,
+        gamma=0.99,
+        prioritized_replay=True,
+        prioritized_replay_alpha=0.6,
+        checkpoint_freq=10000,
+        checkpoint_path=None,
+        dueling=True
+    )
+
+def retro():
+    return atari()
+
--- a/baselines/deepq/experiments/custom_cartpole.py
+++ b/baselines/deepq/experiments/custom_cartpole.py
@@ -9,7 +9,7 @@ import baselines.common.tf_util as U
 from baselines import logger
 from baselines import deepq
 from baselines.deepq.replay_buffer import ReplayBuffer
-from baselines.deepq.utils import BatchInput
+from baselines.deepq.utils import ObservationInput
 from baselines.common.schedules import LinearSchedule


@@ -28,7 +28,7 @@ if __name__ == '__main__':
        env = gym.make("CartPole-v0")
        # Create all the functions necessary to train the model
        act, train, update_target, debug = deepq.build_train(
-            make_obs_ph=lambda name: BatchInput(env.observation_space.shape, name=name),
+            make_obs_ph=lambda name: ObservationInput(env.observation_space, name=name),
            q_func=model,
            num_actions=env.action_space.n,
            optimizer=tf.train.AdamOptimizer(learning_rate=5e-4),
--- a/baselines/deepq/experiments/enjoy_retro.py
+++ b/baselines/deepq/experiments/enjoy_retro.py
@@ -0,0 +1,34 @@
+import argparse
+
+import numpy as np
+
+from baselines import deepq
+from baselines.common import retro_wrappers
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--env', help='environment ID', default='SuperMarioBros-Nes')
+    parser.add_argument('--gamestate', help='game state to load', default='Level1-1')
+    parser.add_argument('--model', help='model pickle file from ActWrapper.save', default='model.pkl')
+    args = parser.parse_args()
+
+    env = retro_wrappers.make_retro(game=args.env, state=args.gamestate, max_episode_steps=None)
+    env = retro_wrappers.wrap_deepmind_retro(env)
+    act = deepq.load(args.model)
+
+    while True:
+        obs, done = env.reset(), False
+        episode_rew = 0
+        while not done:
+            env.render()
+            action = act(obs[None])[0]
+            env_action = np.zeros(env.action_space.n)
+            env_action[action] = 1
+            obs, rew, done, _ = env.step(env_action)
+            episode_rew += rew
+        print('Episode reward', episode_rew)
+
+
+if __name__ == '__main__':
+    main()
--- a/baselines/deepq/experiments/run_atari.py
+++ b/baselines/deepq/experiments/run_atari.py
@@ -5,13 +5,18 @@ import argparse
 from baselines import logger
 from baselines.common.atari_wrappers import make_atari

+
 def main():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
    parser.add_argument('--prioritized', type=int, default=1)
+    parser.add_argument('--prioritized-replay-alpha', type=float, default=0.6)
    parser.add_argument('--dueling', type=int, default=1)
    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
+    parser.add_argument('--checkpoint-freq', type=int, default=10000)
+    parser.add_argument('--checkpoint-path', type=str, default=None)
+
    args = parser.parse_args()
    logger.configure()
    set_global_seeds(args.seed)
@@ -23,7 +28,8 @@ def main():
        hiddens=[256],
        dueling=bool(args.dueling),
    )
-    act = deepq.learn(
+
+    deepq.learn(
        env,
        q_func=model,
        lr=1e-4,
@@ -35,9 +41,12 @@ def main():
        learning_starts=10000,
        target_network_update_freq=1000,
        gamma=0.99,
-        prioritized_replay=bool(args.prioritized)
+        prioritized_replay=bool(args.prioritized),
+        prioritized_replay_alpha=args.prioritized_replay_alpha,
+        checkpoint_freq=args.checkpoint_freq,
+        checkpoint_path=args.checkpoint_path,
    )
-    # act.save("pong_model.pkl") XXX
+
    env.close()


--- a/baselines/deepq/experiments/run_retro.py
+++ b/baselines/deepq/experiments/run_retro.py
@@ -0,0 +1,49 @@
+import argparse
+
+from baselines import deepq
+from baselines.common import set_global_seeds
+from baselines import bench
+from baselines import logger
+from baselines.common import retro_wrappers
+import retro
+
+
+def main():
+    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser.add_argument('--env', help='environment ID', default='SuperMarioBros-Nes')
+    parser.add_argument('--gamestate', help='game state to load', default='Level1-1')
+    parser.add_argument('--seed', help='seed', type=int, default=0)
+    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
+    args = parser.parse_args()
+    logger.configure()
+    set_global_seeds(args.seed)
+    env = retro_wrappers.make_retro(game=args.env, state=args.gamestate, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE)
+    env.seed(args.seed)
+    env = bench.Monitor(env, logger.get_dir())
+    env = retro_wrappers.wrap_deepmind_retro(env)
+
+    model = deepq.models.cnn_to_mlp(
+        convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
+        hiddens=[256],
+        dueling=True
+    )
+    act = deepq.learn(
+        env,
+        q_func=model,
+        lr=1e-4,
+        max_timesteps=args.num_timesteps,
+        buffer_size=10000,
+        exploration_fraction=0.1,
+        exploration_final_eps=0.01,
+        train_freq=4,
+        learning_starts=10000,
+        target_network_update_freq=1000,
+        gamma=0.99,
+        prioritized_replay=True
+    )
+    act.save()
+    env.close()
+
+
+if __name__ == '__main__':
+    main()
--- a/baselines/deepq/models.py
+++ b/baselines/deepq/models.py
@@ -89,3 +89,41 @@ def cnn_to_mlp(convs, hiddens, dueling=False, layer_norm=False):

    return lambda *args, **kwargs: _cnn_to_mlp(convs, hiddens, dueling, layer_norm=layer_norm, *args, **kwargs)

+
+
+def build_q_func(network, hiddens=[256], dueling=True, layer_norm=False, **network_kwargs):
+    if isinstance(network, str):
+        from baselines.common.models import get_network_builder
+        network = get_network_builder(network)(**network_kwargs)   
+        
+    def q_func_builder(input_placeholder, num_actions, scope, reuse=False):
+        with tf.variable_scope(scope, reuse=reuse):
+            latent, _ = network(input_placeholder)
+            latent = layers.flatten(latent)
+
+            with tf.variable_scope("action_value"):
+                action_out = latent
+                for hidden in hiddens:
+                    action_out = layers.fully_connected(action_out, num_outputs=hidden, activation_fn=None)
+                    if layer_norm:
+                        action_out = layers.layer_norm(action_out, center=True, scale=True)
+                    action_out = tf.nn.relu(action_out)
+                action_scores = layers.fully_connected(action_out, num_outputs=num_actions, activation_fn=None)
+
+            if dueling:
+                with tf.variable_scope("state_value"):
+                    state_out = latent
+                    for hidden in hiddens:
+                        state_out = layers.fully_connected(state_out, num_outputs=hidden, activation_fn=None)
+                        if layer_norm:
+                            state_out = layers.layer_norm(state_out, center=True, scale=True)
+                        state_out = tf.nn.relu(state_out)
+                    state_score = layers.fully_connected(state_out, num_outputs=1, activation_fn=None)
+                action_scores_mean = tf.reduce_mean(action_scores, 1)
+                action_scores_centered = action_scores - tf.expand_dims(action_scores_mean, 1)
+                q_out = state_score + action_scores_centered
+            else:
+                q_out = action_scores
+            return q_out
+            
+    return q_func_builder
--- a/baselines/deepq/replay_buffer.py
+++ b/baselines/deepq/replay_buffer.py
@@ -86,7 +86,7 @@ class PrioritizedReplayBuffer(ReplayBuffer):
        ReplayBuffer.__init__
        """
        super(PrioritizedReplayBuffer, self).__init__(size)
-        assert alpha > 0
+        assert alpha >= 0
        self._alpha = alpha

        it_capacity = 1
--- a/baselines/deepq/utils.py
+++ b/baselines/deepq/utils.py
@@ -1,24 +1,13 @@
-import os
+from baselines.common.input import observation_input
+from baselines.common.tf_util import adjust_shape

 import tensorflow as tf

-# ================================================================
-# Saving variables
-# ================================================================
-
-def load_state(fname):
-    saver = tf.train.Saver()
-    saver.restore(tf.get_default_session(), fname)
-
-def save_state(fname):
-    os.makedirs(os.path.dirname(fname), exist_ok=True)
-    saver = tf.train.Saver()
-    saver.save(tf.get_default_session(), fname)
-
 # ================================================================
 # Placeholders
 # ================================================================

+
 class TfInput(object):
    def __init__(self, name="(unnamed)"):
        """Generalized Tensorflow placeholder. The main differences are:
@@ -48,22 +37,8 @@ class PlaceholderTfInput(TfInput):
        return self._placeholder

    def make_feed_dict(self, data):
-        return {self._placeholder: data}
+        return {self._placeholder: adjust_shape(self._placeholder, data)}

-class BatchInput(PlaceholderTfInput):
-    def __init__(self, shape, dtype=tf.float32, name=None):
-        """Creates a placeholder for a batch of tensors of a given shape and dtype
-
-        Parameters
-        ----------
-        shape: [int]
-            shape of a single elemenet of the batch
-        dtype: tf.dtype
-            number representation used for tensor contents
-        name: str
-            name of the underlying placeholder
-        """
-        super().__init__(tf.placeholder(dtype, [None] + list(shape), name=name))

 class Uint8Input(PlaceholderTfInput):
    def __init__(self, shape, name=None):
@@ -86,3 +61,24 @@ class Uint8Input(PlaceholderTfInput):

    def get(self):
        return self._output
+
+
+class ObservationInput(PlaceholderTfInput):
+    def __init__(self, observation_space, name=None):
+        """Creates an input placeholder tailored to a specific observation space
+        
+        Parameters
+        ----------
+
+        observation_space: 
+                observation space of the environment. Should be one of the gym.spaces types
+        name: str 
+                tensorflow name of the underlying placeholder
+        """
+        inpt, self.processed_inpt = observation_input(observation_space, name=name)
+        super().__init__(inpt)
+
+    def get(self):
+        return self.processed_inpt
+    
+    
--- a/baselines/gail/dataset/mujoco_dset.py
+++ b/baselines/gail/dataset/mujoco_dset.py
@@ -47,18 +47,12 @@ class Mujoco_Dset(object):
        obs = traj_data['obs'][:traj_limitation]
        acs = traj_data['acs'][:traj_limitation]

-        def flatten(x):
-            # x.shape = (E,), or (E, L, D)
-            _, size = x[0].shape
-            episode_length = [len(i) for i in x]
-            y = np.zeros((sum(episode_length), size))
-            start_idx = 0
-            for l, x_i in zip(episode_length, x):
-                y[start_idx:(start_idx+l)] = x_i
-                start_idx += l
-                return y
-        self.obs = np.array(flatten(obs))
-        self.acs = np.array(flatten(acs))
+        # obs, acs: shape (N, L, ) + S where N = # episodes, L = episode length
+        # and S is the environment observation/action space.
+        # Flatten to (N * L, prod(S))
+        self.obs = np.reshape(obs, [-1, np.prod(obs.shape[2:])])
+        self.acs = np.reshape(acs, [-1, np.prod(acs.shape[2:])])
+
        self.rets = traj_data['ep_rets'][:traj_limitation]
        self.avg_ret = sum(self.rets)/len(self.rets)
        self.std_ret = np.std(np.array(self.rets))
--- a/baselines/her/README.md
+++ b/baselines/her/README.md
@@ -0,0 +1,32 @@
+# Hindsight Experience Replay
+For details on Hindsight Experience Replay (HER), please read the [paper](https://arxiv.org/abs/1707.01495).
+
+## How to use Hindsight Experience Replay
+
+### Getting started
+Training an agent is very simple:
+```bash
+python -m baselines.her.experiment.train
+```
+This will train a DDPG+HER agent on the `FetchReach` environment.
+You should see the success rate go up quickly to `1.0`, which means that the agent achieves the
+desired goal in 100% of the cases.
+The training script logs other diagnostics as well and pickles the best policy so far (w.r.t. to its test success rate),
+the latest policy, and, if enabled, a history of policies every K epochs.
+
+To inspect what the agent has learned, use the play script:
+```bash
+python -m baselines.her.experiment.play /path/to/an/experiment/policy_best.pkl
+```
+You can try it right now with the results of the training step (the script prints out the path for you).
+This should visualize the current policy for 10 episodes and will also print statistics.
+
+
+### Reproducing results
+In order to reproduce the results from [Plappert et al. (2018)](https://arxiv.org/abs/1802.09464), run the following command:
+```bash
+python -m baselines.her.experiment.train --num_cpu 19
+```
+This will require a machine with sufficient amount of physical CPU cores. In our experiments,
+we used [Azure's D15v2 instances](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes),
+which have 20 physical cores. We only scheduled the experiment on 19 of those to leave some head-room on the system.
--- a/baselines/her/init.py
+++ b/baselines/her/init.py
--- a/baselines/her/actor_critic.py
+++ b/baselines/her/actor_critic.py
@@ -0,0 +1,44 @@
+import tensorflow as tf
+from baselines.her.util import store_args, nn
+
+
+class ActorCritic:
+    @store_args
+    def __init__(self, inputs_tf, dimo, dimg, dimu, max_u, o_stats, g_stats, hidden, layers,
+                 **kwargs):
+        """The actor-critic network and related training code.
+
+        Args:
+            inputs_tf (dict of tensors): all necessary inputs for the network: the
+                observation (o), the goal (g), and the action (u)
+            dimo (int): the dimension of the observations
+            dimg (int): the dimension of the goals
+            dimu (int): the dimension of the actions
+            max_u (float): the maximum magnitude of actions; action outputs will be scaled
+                accordingly
+            o_stats (baselines.her.Normalizer): normalizer for observations
+            g_stats (baselines.her.Normalizer): normalizer for goals
+            hidden (int): number of hidden units that should be used in hidden layers
+            layers (int): number of hidden layers
+        """
+        self.o_tf = inputs_tf['o']
+        self.g_tf = inputs_tf['g']
+        self.u_tf = inputs_tf['u']
+
+        # Prepare inputs for actor and critic.
+        o = self.o_stats.normalize(self.o_tf)
+        g = self.g_stats.normalize(self.g_tf)
+        input_pi = tf.concat(axis=1, values=[o, g])  # for actor
+
+        # Networks.
+        with tf.variable_scope('pi'):
+            self.pi_tf = self.max_u * tf.tanh(nn(
+                input_pi, [self.hidden] * self.layers + [self.dimu]))
+        with tf.variable_scope('Q'):
+            # for policy training
+            input_Q = tf.concat(axis=1, values=[o, g, self.pi_tf / self.max_u])
+            self.Q_pi_tf = nn(input_Q, [self.hidden] * self.layers + [1])
+            # for critic training
+            input_Q = tf.concat(axis=1, values=[o, g, self.u_tf / self.max_u])
+            self._input_Q = input_Q  # exposed for tests
+            self.Q_tf = nn(input_Q, [self.hidden] * self.layers + [1], reuse=True)
--- a/baselines/her/ddpg.py
+++ b/baselines/her/ddpg.py
@@ -0,0 +1,340 @@
+from collections import OrderedDict
+
+import numpy as np
+import tensorflow as tf
+from tensorflow.contrib.staging import StagingArea
+
+from baselines import logger
+from baselines.her.util import (
+    import_function, store_args, flatten_grads, transitions_in_episode_batch)
+from baselines.her.normalizer import Normalizer
+from baselines.her.replay_buffer import ReplayBuffer
+from baselines.common.mpi_adam import MpiAdam
+
+
+def dims_to_shapes(input_dims):
+    return {key: tuple([val]) if val > 0 else tuple() for key, val in input_dims.items()}
+
+
+class DDPG(object):
+    @store_args
+    def __init__(self, input_dims, buffer_size, hidden, layers, network_class, polyak, batch_size,
+                 Q_lr, pi_lr, norm_eps, norm_clip, max_u, action_l2, clip_obs, scope, T,
+                 rollout_batch_size, subtract_goals, relative_goals, clip_pos_returns, clip_return,
+                 sample_transitions, gamma, reuse=False, **kwargs):
+        """Implementation of DDPG that is used in combination with Hindsight Experience Replay (HER).
+
+        Args:
+            input_dims (dict of ints): dimensions for the observation (o), the goal (g), and the
+                actions (u)
+            buffer_size (int): number of transitions that are stored in the replay buffer
+            hidden (int): number of units in the hidden layers
+            layers (int): number of hidden layers
+            network_class (str): the network class that should be used (e.g. 'baselines.her.ActorCritic')
+            polyak (float): coefficient for Polyak-averaging of the target network
+            batch_size (int): batch size for training
+            Q_lr (float): learning rate for the Q (critic) network
+            pi_lr (float): learning rate for the pi (actor) network
+            norm_eps (float): a small value used in the normalizer to avoid numerical instabilities
+            norm_clip (float): normalized inputs are clipped to be in [-norm_clip, norm_clip]
+            max_u (float): maximum action magnitude, i.e. actions are in [-max_u, max_u]
+            action_l2 (float): coefficient for L2 penalty on the actions
+            clip_obs (float): clip observations before normalization to be in [-clip_obs, clip_obs]
+            scope (str): the scope used for the TensorFlow graph
+            T (int): the time horizon for rollouts
+            rollout_batch_size (int): number of parallel rollouts per DDPG agent
+            subtract_goals (function): function that subtracts goals from each other
+            relative_goals (boolean): whether or not relative goals should be fed into the network
+            clip_pos_returns (boolean): whether or not positive returns should be clipped
+            clip_return (float): clip returns to be in [-clip_return, clip_return]
+            sample_transitions (function) function that samples from the replay buffer
+            gamma (float): gamma used for Q learning updates
+            reuse (boolean): whether or not the networks should be reused
+        """
+        if self.clip_return is None:
+            self.clip_return = np.inf
+
+        self.create_actor_critic = import_function(self.network_class)
+
+        input_shapes = dims_to_shapes(self.input_dims)
+        self.dimo = self.input_dims['o']
+        self.dimg = self.input_dims['g']
+        self.dimu = self.input_dims['u']
+
+        # Prepare staging area for feeding data to the model.
+        stage_shapes = OrderedDict()
+        for key in sorted(self.input_dims.keys()):
+            if key.startswith('info_'):
+                continue
+            stage_shapes[key] = (None, *input_shapes[key])
+        for key in ['o', 'g']:
+            stage_shapes[key + '_2'] = stage_shapes[key]
+        stage_shapes['r'] = (None,)
+        self.stage_shapes = stage_shapes
+
+        # Create network.
+        with tf.variable_scope(self.scope):
+            self.staging_tf = StagingArea(
+                dtypes=[tf.float32 for _ in self.stage_shapes.keys()],
+                shapes=list(self.stage_shapes.values()))
+            self.buffer_ph_tf = [
+                tf.placeholder(tf.float32, shape=shape) for shape in self.stage_shapes.values()]
+            self.stage_op = self.staging_tf.put(self.buffer_ph_tf)
+
+            self._create_network(reuse=reuse)
+
+        # Configure the replay buffer.
+        buffer_shapes = {key: (self.T if key != 'o' else self.T+1, *input_shapes[key])
+                         for key, val in input_shapes.items()}
+        buffer_shapes['g'] = (buffer_shapes['g'][0], self.dimg)
+        buffer_shapes['ag'] = (self.T+1, self.dimg)
+
+        buffer_size = (self.buffer_size // self.rollout_batch_size) * self.rollout_batch_size
+        self.buffer = ReplayBuffer(buffer_shapes, buffer_size, self.T, self.sample_transitions)
+
+    def _random_action(self, n):
+        return np.random.uniform(low=-self.max_u, high=self.max_u, size=(n, self.dimu))
+
+    def _preprocess_og(self, o, ag, g):
+        if self.relative_goals:
+            g_shape = g.shape
+            g = g.reshape(-1, self.dimg)
+            ag = ag.reshape(-1, self.dimg)
+            g = self.subtract_goals(g, ag)
+            g = g.reshape(*g_shape)
+        o = np.clip(o, -self.clip_obs, self.clip_obs)
+        g = np.clip(g, -self.clip_obs, self.clip_obs)
+        return o, g
+
+    def get_actions(self, o, ag, g, noise_eps=0., random_eps=0., use_target_net=False,
+                    compute_Q=False):
+        o, g = self._preprocess_og(o, ag, g)
+        policy = self.target if use_target_net else self.main
+        # values to compute
+        vals = [policy.pi_tf]
+        if compute_Q:
+            vals += [policy.Q_pi_tf]
+        # feed
+        feed = {
+            policy.o_tf: o.reshape(-1, self.dimo),
+            policy.g_tf: g.reshape(-1, self.dimg),
+            policy.u_tf: np.zeros((o.size // self.dimo, self.dimu), dtype=np.float32)
+        }
+
+        ret = self.sess.run(vals, feed_dict=feed)
+        # action postprocessing
+        u = ret[0]
+        noise = noise_eps * self.max_u * np.random.randn(*u.shape)  # gaussian noise
+        u += noise
+        u = np.clip(u, -self.max_u, self.max_u)
+        u += np.random.binomial(1, random_eps, u.shape[0]).reshape(-1, 1) * (self._random_action(u.shape[0]) - u)  # eps-greedy
+        if u.shape[0] == 1:
+            u = u[0]
+        u = u.copy()
+        ret[0] = u
+
+        if len(ret) == 1:
+            return ret[0]
+        else:
+            return ret
+
+    def store_episode(self, episode_batch, update_stats=True):
+        """
+        episode_batch: array of batch_size x (T or T+1) x dim_key
+                       'o' is of size T+1, others are of size T
+        """
+
+        self.buffer.store_episode(episode_batch)
+
+        if update_stats:
+            # add transitions to normalizer
+            episode_batch['o_2'] = episode_batch['o'][:, 1:, :]
+            episode_batch['ag_2'] = episode_batch['ag'][:, 1:, :]
+            num_normalizing_transitions = transitions_in_episode_batch(episode_batch)
+            transitions = self.sample_transitions(episode_batch, num_normalizing_transitions)
+
+            o, o_2, g, ag = transitions['o'], transitions['o_2'], transitions['g'], transitions['ag']
+            transitions['o'], transitions['g'] = self._preprocess_og(o, ag, g)
+            # No need to preprocess the o_2 and g_2 since this is only used for stats
+
+            self.o_stats.update(transitions['o'])
+            self.g_stats.update(transitions['g'])
+
+            self.o_stats.recompute_stats()
+            self.g_stats.recompute_stats()
+
+    def get_current_buffer_size(self):
+        return self.buffer.get_current_size()
+
+    def _sync_optimizers(self):
+        self.Q_adam.sync()
+        self.pi_adam.sync()
+
+    def _grads(self):
+        # Avoid feed_dict here for performance!
+        critic_loss, actor_loss, Q_grad, pi_grad = self.sess.run([
+            self.Q_loss_tf,
+            self.main.Q_pi_tf,
+            self.Q_grad_tf,
+            self.pi_grad_tf
+        ])
+        return critic_loss, actor_loss, Q_grad, pi_grad
+
+    def _update(self, Q_grad, pi_grad):
+        self.Q_adam.update(Q_grad, self.Q_lr)
+        self.pi_adam.update(pi_grad, self.pi_lr)
+
+    def sample_batch(self):
+        transitions = self.buffer.sample(self.batch_size)
+        o, o_2, g = transitions['o'], transitions['o_2'], transitions['g']
+        ag, ag_2 = transitions['ag'], transitions['ag_2']
+        transitions['o'], transitions['g'] = self._preprocess_og(o, ag, g)
+        transitions['o_2'], transitions['g_2'] = self._preprocess_og(o_2, ag_2, g)
+
+        transitions_batch = [transitions[key] for key in self.stage_shapes.keys()]
+        return transitions_batch
+
+    def stage_batch(self, batch=None):
+        if batch is None:
+            batch = self.sample_batch()
+        assert len(self.buffer_ph_tf) == len(batch)
+        self.sess.run(self.stage_op, feed_dict=dict(zip(self.buffer_ph_tf, batch)))
+
+    def train(self, stage=True):
+        if stage:
+            self.stage_batch()
+        critic_loss, actor_loss, Q_grad, pi_grad = self._grads()
+        self._update(Q_grad, pi_grad)
+        return critic_loss, actor_loss
+
+    def _init_target_net(self):
+        self.sess.run(self.init_target_net_op)
+
+    def update_target_net(self):
+        self.sess.run(self.update_target_net_op)
+
+    def clear_buffer(self):
+        self.buffer.clear_buffer()
+
+    def _vars(self, scope):
+        res = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.scope + '/' + scope)
+        assert len(res) > 0
+        return res
+
+    def _global_vars(self, scope):
+        res = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.scope + '/' + scope)
+        return res
+
+    def _create_network(self, reuse=False):
+        logger.info("Creating a DDPG agent with action space %d x %s..." % (self.dimu, self.max_u))
+
+        self.sess = tf.get_default_session()
+        if self.sess is None:
+            self.sess = tf.InteractiveSession()
+
+        # running averages
+        with tf.variable_scope('o_stats') as vs:
+            if reuse:
+                vs.reuse_variables()
+            self.o_stats = Normalizer(self.dimo, self.norm_eps, self.norm_clip, sess=self.sess)
+        with tf.variable_scope('g_stats') as vs:
+            if reuse:
+                vs.reuse_variables()
+            self.g_stats = Normalizer(self.dimg, self.norm_eps, self.norm_clip, sess=self.sess)
+
+        # mini-batch sampling.
+        batch = self.staging_tf.get()
+        batch_tf = OrderedDict([(key, batch[i])
+                                for i, key in enumerate(self.stage_shapes.keys())])
+        batch_tf['r'] = tf.reshape(batch_tf['r'], [-1, 1])
+
+        # networks
+        with tf.variable_scope('main') as vs:
+            if reuse:
+                vs.reuse_variables()
+            self.main = self.create_actor_critic(batch_tf, net_type='main', **self.__dict__)
+            vs.reuse_variables()
+        with tf.variable_scope('target') as vs:
+            if reuse:
+                vs.reuse_variables()
+            target_batch_tf = batch_tf.copy()
+            target_batch_tf['o'] = batch_tf['o_2']
+            target_batch_tf['g'] = batch_tf['g_2']
+            self.target = self.create_actor_critic(
+                target_batch_tf, net_type='target', **self.__dict__)
+            vs.reuse_variables()
+        assert len(self._vars("main")) == len(self._vars("target"))
+
+        # loss functions
+        target_Q_pi_tf = self.target.Q_pi_tf
+        clip_range = (-self.clip_return, 0. if self.clip_pos_returns else np.inf)
+        target_tf = tf.clip_by_value(batch_tf['r'] + self.gamma * target_Q_pi_tf, *clip_range)
+        self.Q_loss_tf = tf.reduce_mean(tf.square(tf.stop_gradient(target_tf) - self.main.Q_tf))
+        self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
+        self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))
+        Q_grads_tf = tf.gradients(self.Q_loss_tf, self._vars('main/Q'))
+        pi_grads_tf = tf.gradients(self.pi_loss_tf, self._vars('main/pi'))
+        assert len(self._vars('main/Q')) == len(Q_grads_tf)
+        assert len(self._vars('main/pi')) == len(pi_grads_tf)
+        self.Q_grads_vars_tf = zip(Q_grads_tf, self._vars('main/Q'))
+        self.pi_grads_vars_tf = zip(pi_grads_tf, self._vars('main/pi'))
+        self.Q_grad_tf = flatten_grads(grads=Q_grads_tf, var_list=self._vars('main/Q'))
+        self.pi_grad_tf = flatten_grads(grads=pi_grads_tf, var_list=self._vars('main/pi'))
+
+        # optimizers
+        self.Q_adam = MpiAdam(self._vars('main/Q'), scale_grad_by_procs=False)
+        self.pi_adam = MpiAdam(self._vars('main/pi'), scale_grad_by_procs=False)
+
+        # polyak averaging
+        self.main_vars = self._vars('main/Q') + self._vars('main/pi')
+        self.target_vars = self._vars('target/Q') + self._vars('target/pi')
+        self.stats_vars = self._global_vars('o_stats') + self._global_vars('g_stats')
+        self.init_target_net_op = list(
+            map(lambda v: v[0].assign(v[1]), zip(self.target_vars, self.main_vars)))
+        self.update_target_net_op = list(
+            map(lambda v: v[0].assign(self.polyak * v[0] + (1. - self.polyak) * v[1]), zip(self.target_vars, self.main_vars)))
+
+        # initialize all variables
+        tf.variables_initializer(self._global_vars('')).run()
+        self._sync_optimizers()
+        self._init_target_net()
+
+    def logs(self, prefix=''):
+        logs = []
+        logs += [('stats_o/mean', np.mean(self.sess.run([self.o_stats.mean])))]
+        logs += [('stats_o/std', np.mean(self.sess.run([self.o_stats.std])))]
+        logs += [('stats_g/mean', np.mean(self.sess.run([self.g_stats.mean])))]
+        logs += [('stats_g/std', np.mean(self.sess.run([self.g_stats.std])))]
+
+        if prefix is not '' and not prefix.endswith('/'):
+            return [(prefix + '/' + key, val) for key, val in logs]
+        else:
+            return logs
+
+    def __getstate__(self):
+        """Our policies can be loaded from pkl, but after unpickling you cannot continue training.
+        """
+        excluded_subnames = ['_tf', '_op', '_vars', '_adam', 'buffer', 'sess', '_stats',
+                             'main', 'target', 'lock', 'env', 'sample_transitions',
+                             'stage_shapes', 'create_actor_critic']
+
+        state = {k: v for k, v in self.__dict__.items() if all([not subname in k for subname in excluded_subnames])}
+        state['buffer_size'] = self.buffer_size
+        state['tf'] = self.sess.run([x for x in self._global_vars('') if 'buffer' not in x.name])
+        return state
+
+    def __setstate__(self, state):
+        if 'sample_transitions' not in state:
+            # We don't need this for playing the policy.
+            state['sample_transitions'] = None
+
+        self.__init__(**state)
+        # set up stats (they are overwritten in __init__)
+        for k, v in state.items():
+            if k[-6:] == '_stats':
+                self.__dict__[k] = v
+        # load TF variables
+        vars = [x for x in self._global_vars('') if 'buffer' not in x.name]
+        assert(len(vars) == len(state["tf"]))
+        node = [tf.assign(var, val) for var, val in zip(vars, state["tf"])]
+        self.sess.run(node)
--- a/baselines/her/experiment/init.py
+++ b/baselines/her/experiment/init.py
--- a/baselines/her/experiment/config.py
+++ b/baselines/her/experiment/config.py
@@ -0,0 +1,171 @@
+import numpy as np
+import gym
+
+from baselines import logger
+from baselines.her.ddpg import DDPG
+from baselines.her.her import make_sample_her_transitions
+
+
+DEFAULT_ENV_PARAMS = {
+    'FetchReach-v1': {
+        'n_cycles': 10,
+    },
+}
+
+
+DEFAULT_PARAMS = {
+    # env
+    'max_u': 1.,  # max absolute value of actions on different coordinates
+    # ddpg
+    'layers': 3,  # number of layers in the critic/actor networks
+    'hidden': 256,  # number of neurons in each hidden layers
+    'network_class': 'baselines.her.actor_critic:ActorCritic',
+    'Q_lr': 0.001,  # critic learning rate
+    'pi_lr': 0.001,  # actor learning rate
+    'buffer_size': int(1E6),  # for experience replay
+    'polyak': 0.95,  # polyak averaging coefficient
+    'action_l2': 1.0,  # quadratic penalty on actions (before rescaling by max_u)
+    'clip_obs': 200.,
+    'scope': 'ddpg',  # can be tweaked for testing
+    'relative_goals': False,
+    # training
+    'n_cycles': 50,  # per epoch
+    'rollout_batch_size': 2,  # per mpi thread
+    'n_batches': 40,  # training batches per cycle
+    'batch_size': 256,  # per mpi thread, measured in transitions and reduced to even multiple of chunk_length.
+    'n_test_rollouts': 10,  # number of test rollouts per epoch, each consists of rollout_batch_size rollouts
+    'test_with_polyak': False,  # run test episodes with the target network
+    # exploration
+    'random_eps': 0.3,  # percentage of time a random action is taken
+    'noise_eps': 0.2,  # std of gaussian noise added to not-completely-random actions as a percentage of max_u
+    # HER
+    'replay_strategy': 'future',  # supported modes: future, none
+    'replay_k': 4,  # number of additional goals used for replay, only used if off_policy_data=future
+    # normalization
+    'norm_eps': 0.01,  # epsilon used for observation normalization
+    'norm_clip': 5,  # normalized observations are cropped to this values
+}
+
+
+CACHED_ENVS = {}
+
+
+def cached_make_env(make_env):
+    """
+    Only creates a new environment from the provided function if one has not yet already been
+    created. This is useful here because we need to infer certain properties of the env, e.g.
+    its observation and action spaces, without any intend of actually using it.
+    """
+    if make_env not in CACHED_ENVS:
+        env = make_env()
+        CACHED_ENVS[make_env] = env
+    return CACHED_ENVS[make_env]
+
+
+def prepare_params(kwargs):
+    # DDPG params
+    ddpg_params = dict()
+
+    env_name = kwargs['env_name']
+
+    def make_env():
+        return gym.make(env_name)
+    kwargs['make_env'] = make_env
+    tmp_env = cached_make_env(kwargs['make_env'])
+    assert hasattr(tmp_env, '_max_episode_steps')
+    kwargs['T'] = tmp_env._max_episode_steps
+    tmp_env.reset()
+    kwargs['max_u'] = np.array(kwargs['max_u']) if isinstance(kwargs['max_u'], list) else kwargs['max_u']
+    kwargs['gamma'] = 1. - 1. / kwargs['T']
+    if 'lr' in kwargs:
+        kwargs['pi_lr'] = kwargs['lr']
+        kwargs['Q_lr'] = kwargs['lr']
+        del kwargs['lr']
+    for name in ['buffer_size', 'hidden', 'layers',
+                 'network_class',
+                 'polyak',
+                 'batch_size', 'Q_lr', 'pi_lr',
+                 'norm_eps', 'norm_clip', 'max_u',
+                 'action_l2', 'clip_obs', 'scope', 'relative_goals']:
+        ddpg_params[name] = kwargs[name]
+        kwargs['_' + name] = kwargs[name]
+        del kwargs[name]
+    kwargs['ddpg_params'] = ddpg_params
+
+    return kwargs
+
+
+def log_params(params, logger=logger):
+    for key in sorted(params.keys()):
+        logger.info('{}: {}'.format(key, params[key]))
+
+
+def configure_her(params):
+    env = cached_make_env(params['make_env'])
+    env.reset()
+
+    def reward_fun(ag_2, g, info):  # vectorized
+        return env.compute_reward(achieved_goal=ag_2, desired_goal=g, info=info)
+
+    # Prepare configuration for HER.
+    her_params = {
+        'reward_fun': reward_fun,
+    }
+    for name in ['replay_strategy', 'replay_k']:
+        her_params[name] = params[name]
+        params['_' + name] = her_params[name]
+        del params[name]
+    sample_her_transitions = make_sample_her_transitions(**her_params)
+
+    return sample_her_transitions
+
+
+def simple_goal_subtract(a, b):
+    assert a.shape == b.shape
+    return a - b
+
+
+def configure_ddpg(dims, params, reuse=False, use_mpi=True, clip_return=True):
+    sample_her_transitions = configure_her(params)
+    # Extract relevant parameters.
+    gamma = params['gamma']
+    rollout_batch_size = params['rollout_batch_size']
+    ddpg_params = params['ddpg_params']
+
+    input_dims = dims.copy()
+
+    # DDPG agent
+    env = cached_make_env(params['make_env'])
+    env.reset()
+    ddpg_params.update({'input_dims': input_dims,  # agent takes an input observations
+                        'T': params['T'],
+                        'clip_pos_returns': True,  # clip positive returns
+                        'clip_return': (1. / (1. - gamma)) if clip_return else np.inf,  # max abs of return
+                        'rollout_batch_size': rollout_batch_size,
+                        'subtract_goals': simple_goal_subtract,
+                        'sample_transitions': sample_her_transitions,
+                        'gamma': gamma,
+                        })
+    ddpg_params['info'] = {
+        'env_name': params['env_name'],
+    }
+    policy = DDPG(reuse=reuse, **ddpg_params, use_mpi=use_mpi)
+    return policy
+
+
+def configure_dims(params):
+    env = cached_make_env(params['make_env'])
+    env.reset()
+    obs, _, _, info = env.step(env.action_space.sample())
+
+    dims = {
+        'o': obs['observation'].shape[0],
+        'u': env.action_space.shape[0],
+        'g': obs['desired_goal'].shape[0],
+    }
+    for key, value in info.items():
+        value = np.array(value)
+        if value.ndim == 0:
+            value = value.reshape(1)
+        dims['info_{}'.format(key)] = value.shape[0]
+    return dims
--- a/baselines/her/experiment/play.py
+++ b/baselines/her/experiment/play.py
@@ -0,0 +1,60 @@
+import click
+import numpy as np
+import pickle
+
+from baselines import logger
+from baselines.common import set_global_seeds
+import baselines.her.experiment.config as config
+from baselines.her.rollout import RolloutWorker
+
+
+@click.command()
+@click.argument('policy_file', type=str)
+@click.option('--seed', type=int, default=0)
+@click.option('--n_test_rollouts', type=int, default=10)
+@click.option('--render', type=int, default=1)
+def main(policy_file, seed, n_test_rollouts, render):
+    set_global_seeds(seed)
+
+    # Load policy.
+    with open(policy_file, 'rb') as f:
+        policy = pickle.load(f)
+    env_name = policy.info['env_name']
+
+    # Prepare params.
+    params = config.DEFAULT_PARAMS
+    if env_name in config.DEFAULT_ENV_PARAMS:
+        params.update(config.DEFAULT_ENV_PARAMS[env_name])  # merge env-specific parameters in
+    params['env_name'] = env_name
+    params = config.prepare_params(params)
+    config.log_params(params, logger=logger)
+
+    dims = config.configure_dims(params)
+
+    eval_params = {
+        'exploit': True,
+        'use_target_net': params['test_with_polyak'],
+        'compute_Q': True,
+        'rollout_batch_size': 1,
+        'render': bool(render),
+    }
+
+    for name in ['T', 'gamma', 'noise_eps', 'random_eps']:
+        eval_params[name] = params[name]
+    
+    evaluator = RolloutWorker(params['make_env'], policy, dims, logger, **eval_params)
+    evaluator.seed(seed)
+
+    # Run evaluation.
+    evaluator.clear_history()
+    for _ in range(n_test_rollouts):
+        evaluator.generate_rollouts()
+
+    # record logs
+    for key, val in evaluator.logs('test'):
+        logger.record_tabular(key, np.mean(val))
+    logger.dump_tabular()
+
+
+if __name__ == '__main__':
+    main()
--- a/baselines/her/experiment/plot.py
+++ b/baselines/her/experiment/plot.py
@@ -0,0 +1,118 @@
+import os
+import matplotlib.pyplot as plt
+import numpy as np
+import json
+import seaborn as sns; sns.set()
+import glob2
+import argparse
+
+
+def smooth_reward_curve(x, y):
+    halfwidth = int(np.ceil(len(x) / 60))  # Halfwidth of our smoothing convolution
+    k = halfwidth
+    xsmoo = x
+    ysmoo = np.convolve(y, np.ones(2 * k + 1), mode='same') / np.convolve(np.ones_like(y), np.ones(2 * k + 1),
+        mode='same')
+    return xsmoo, ysmoo
+
+
+def load_results(file):
+    if not os.path.exists(file):
+        return None
+    with open(file, 'r') as f:
+        lines = [line for line in f]
+    if len(lines) < 2:
+        return None
+    keys = [name.strip() for name in lines[0].split(',')]
+    data = np.genfromtxt(file, delimiter=',', skip_header=1, filling_values=0.)
+    if data.ndim == 1:
+        data = data.reshape(1, -1)
+    assert data.ndim == 2
+    assert data.shape[-1] == len(keys)
+    result = {}
+    for idx, key in enumerate(keys):
+        result[key] = data[:, idx]
+    return result
+
+
+def pad(xs, value=np.nan):
+    maxlen = np.max([len(x) for x in xs])
+    
+    padded_xs = []
+    for x in xs:
+        if x.shape[0] >= maxlen:
+            padded_xs.append(x)
+    
+        padding = np.ones((maxlen - x.shape[0],) + x.shape[1:]) * value
+        x_padded = np.concatenate([x, padding], axis=0)
+        assert x_padded.shape[1:] == x.shape[1:]
+        assert x_padded.shape[0] == maxlen
+        padded_xs.append(x_padded)
+    return np.array(padded_xs)
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument('dir', type=str)
+parser.add_argument('--smooth', type=int, default=1)
+args = parser.parse_args()
+
+# Load all data.
+data = {}
+paths = [os.path.abspath(os.path.join(path, '..')) for path in glob2.glob(os.path.join(args.dir, '**', 'progress.csv'))]
+for curr_path in paths:
+    if not os.path.isdir(curr_path):
+        continue
+    results = load_results(os.path.join(curr_path, 'progress.csv'))
+    if not results:
+        print('skipping {}'.format(curr_path))
+        continue
+    print('loading {} ({})'.format(curr_path, len(results['epoch'])))
+    with open(os.path.join(curr_path, 'params.json'), 'r') as f:
+        params = json.load(f)
+
+    success_rate = np.array(results['test/success_rate'])
+    epoch = np.array(results['epoch']) + 1
+    env_id = params['env_name']
+    replay_strategy = params['replay_strategy']
+
+    if replay_strategy == 'future':
+        config = 'her'
+    else:
+        config = 'ddpg'
+    if 'Dense' in env_id:
+        config += '-dense'
+    else:
+        config += '-sparse'
+    env_id = env_id.replace('Dense', '')
+
+    # Process and smooth data.
+    assert success_rate.shape == epoch.shape
+    x = epoch
+    y = success_rate
+    if args.smooth:
+        x, y = smooth_reward_curve(epoch, success_rate)
+    assert x.shape == y.shape
+
+    if env_id not in data:
+        data[env_id] = {}
+    if config not in data[env_id]:
+        data[env_id][config] = []
+    data[env_id][config].append((x, y))
+
+# Plot data.
+for env_id in sorted(data.keys()):
+    print('exporting {}'.format(env_id))
+    plt.clf()
+
+    for config in sorted(data[env_id].keys()):
+        xs, ys = zip(*data[env_id][config])
+        xs, ys = pad(xs), pad(ys)
+        assert xs.shape == ys.shape
+
+        plt.plot(xs[0], np.nanmedian(ys, axis=0), label=config)
+        plt.fill_between(xs[0], np.nanpercentile(ys, 25, axis=0), np.nanpercentile(ys, 75, axis=0), alpha=0.25)
+    plt.title(env_id)
+    plt.xlabel('Epoch')
+    plt.ylabel('Median Success Rate')
+    plt.legend()
+    plt.savefig(os.path.join(args.dir, 'fig_{}.png'.format(env_id)))
--- a/baselines/her/experiment/train.py
+++ b/baselines/her/experiment/train.py
@@ -0,0 +1,191 @@
+import os
+import sys
+
+import click
+import numpy as np
+import json
+from mpi4py import MPI
+
+from baselines import logger
+from baselines.common import set_global_seeds
+from baselines.common.mpi_moments import mpi_moments
+import baselines.her.experiment.config as config
+from baselines.her.rollout import RolloutWorker
+from baselines.her.util import mpi_fork
+
+from subprocess import CalledProcessError
+
+
+def mpi_average(value):
+    if value == []:
+        value = [0.]
+    if not isinstance(value, list):
+        value = [value]
+    return mpi_moments(np.array(value))[0]
+
+
+def train(policy, rollout_worker, evaluator,
+          n_epochs, n_test_rollouts, n_cycles, n_batches, policy_save_interval,
+          save_policies, **kwargs):
+    rank = MPI.COMM_WORLD.Get_rank()
+
+    latest_policy_path = os.path.join(logger.get_dir(), 'policy_latest.pkl')
+    best_policy_path = os.path.join(logger.get_dir(), 'policy_best.pkl')
+    periodic_policy_path = os.path.join(logger.get_dir(), 'policy_{}.pkl')
+
+    logger.info("Training...")
+    best_success_rate = -1
+    for epoch in range(n_epochs):
+        # train
+        rollout_worker.clear_history()
+        for _ in range(n_cycles):
+            episode = rollout_worker.generate_rollouts()
+            policy.store_episode(episode)
+            for _ in range(n_batches):
+                policy.train()
+            policy.update_target_net()
+
+        # test
+        evaluator.clear_history()
+        for _ in range(n_test_rollouts):
+            evaluator.generate_rollouts()
+
+        # record logs
+        logger.record_tabular('epoch', epoch)
+        for key, val in evaluator.logs('test'):
+            logger.record_tabular(key, mpi_average(val))
+        for key, val in rollout_worker.logs('train'):
+            logger.record_tabular(key, mpi_average(val))
+        for key, val in policy.logs():
+            logger.record_tabular(key, mpi_average(val))
+
+        if rank == 0:
+            logger.dump_tabular()
+
+        # save the policy if it's better than the previous ones
+        success_rate = mpi_average(evaluator.current_success_rate())
+        if rank == 0 and success_rate >= best_success_rate and save_policies:
+            best_success_rate = success_rate
+            logger.info('New best success rate: {}. Saving policy to {} ...'.format(best_success_rate, best_policy_path))
+            evaluator.save_policy(best_policy_path)
+            evaluator.save_policy(latest_policy_path)
+        if rank == 0 and policy_save_interval > 0 and epoch % policy_save_interval == 0 and save_policies:
+            policy_path = periodic_policy_path.format(epoch)
+            logger.info('Saving periodic policy to {} ...'.format(policy_path))
+            evaluator.save_policy(policy_path)
+
+        # make sure that different threads have different seeds
+        local_uniform = np.random.uniform(size=(1,))
+        root_uniform = local_uniform.copy()
+        MPI.COMM_WORLD.Bcast(root_uniform, root=0)
+        if rank != 0:
+            assert local_uniform[0] != root_uniform[0]
+
+
+def launch(
+    env, logdir, n_epochs, num_cpu, seed, replay_strategy, policy_save_interval, clip_return,
+    override_params={}, save_policies=True
+):
+    # Fork for multi-CPU MPI implementation.
+    if num_cpu > 1:
+        try:
+            whoami = mpi_fork(num_cpu, ['--bind-to', 'core'])
+        except CalledProcessError:
+            # fancy version of mpi call failed, try simple version
+            whoami = mpi_fork(num_cpu)
+
+        if whoami == 'parent':
+            sys.exit(0)
+        import baselines.common.tf_util as U
+        U.single_threaded_session().__enter__()
+    rank = MPI.COMM_WORLD.Get_rank()
+
+    # Configure logging
+    if rank == 0:
+        if logdir or logger.get_dir() is None:
+            logger.configure(dir=logdir)
+    else:
+        logger.configure()
+    logdir = logger.get_dir()
+    assert logdir is not None
+    os.makedirs(logdir, exist_ok=True)
+
+    # Seed everything.
+    rank_seed = seed + 1000000 * rank
+    set_global_seeds(rank_seed)
+
+    # Prepare params.
+    params = config.DEFAULT_PARAMS
+    params['env_name'] = env
+    params['replay_strategy'] = replay_strategy
+    if env in config.DEFAULT_ENV_PARAMS:
+        params.update(config.DEFAULT_ENV_PARAMS[env])  # merge env-specific parameters in
+    params.update(**override_params)  # makes it possible to override any parameter
+    with open(os.path.join(logger.get_dir(), 'params.json'), 'w') as f:
+        json.dump(params, f)
+    params = config.prepare_params(params)
+    config.log_params(params, logger=logger)
+
+    if num_cpu == 1:
+        logger.warn()
+        logger.warn('*** Warning ***')
+        logger.warn(
+            'You are running HER with just a single MPI worker. This will work, but the ' +
+            'experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) ' +
+            'were obtained with --num_cpu 19. This makes a significant difference and if you ' +
+            'are looking to reproduce those results, be aware of this. Please also refer to ' +
+            'https://github.com/openai/baselines/issues/314 for further details.')
+        logger.warn('****************')
+        logger.warn()
+
+    dims = config.configure_dims(params)
+    policy = config.configure_ddpg(dims=dims, params=params, clip_return=clip_return)
+
+    rollout_params = {
+        'exploit': False,
+        'use_target_net': False,
+        'use_demo_states': True,
+        'compute_Q': False,
+        'T': params['T'],
+    }
+
+    eval_params = {
+        'exploit': True,
+        'use_target_net': params['test_with_polyak'],
+        'use_demo_states': False,
+        'compute_Q': True,
+        'T': params['T'],
+    }
+
+    for name in ['T', 'rollout_batch_size', 'gamma', 'noise_eps', 'random_eps']:
+        rollout_params[name] = params[name]
+        eval_params[name] = params[name]
+
+    rollout_worker = RolloutWorker(params['make_env'], policy, dims, logger, **rollout_params)
+    rollout_worker.seed(rank_seed)
+
+    evaluator = RolloutWorker(params['make_env'], policy, dims, logger, **eval_params)
+    evaluator.seed(rank_seed)
+
+    train(
+        logdir=logdir, policy=policy, rollout_worker=rollout_worker,
+        evaluator=evaluator, n_epochs=n_epochs, n_test_rollouts=params['n_test_rollouts'],
+        n_cycles=params['n_cycles'], n_batches=params['n_batches'],
+        policy_save_interval=policy_save_interval, save_policies=save_policies)
+
+
+@click.command()
+@click.option('--env', type=str, default='FetchReach-v1', help='the name of the OpenAI Gym environment that you want to train on')
+@click.option('--logdir', type=str, default=None, help='the path to where logs and policy pickles should go. If not specified, creates a folder in /tmp/')
+@click.option('--n_epochs', type=int, default=50, help='the number of training epochs to run')
+@click.option('--num_cpu', type=int, default=1, help='the number of CPU cores to use (using MPI)')
+@click.option('--seed', type=int, default=0, help='the random seed used to seed both the environment and the training code')
+@click.option('--policy_save_interval', type=int, default=5, help='the interval with which policy pickles are saved. If set to 0, only the best and latest policy will be pickled.')
+@click.option('--replay_strategy', type=click.Choice(['future', 'none']), default='future', help='the HER replay strategy to be used. "future" uses HER, "none" disables HER.')
+@click.option('--clip_return', type=int, default=1, help='whether or not returns should be clipped')
+def main(**kwargs):
+    launch(**kwargs)
+
+
+if __name__ == '__main__':
+    main()
--- a/baselines/her/her.py
+++ b/baselines/her/her.py
@@ -0,0 +1,63 @@
+import numpy as np
+
+
+def make_sample_her_transitions(replay_strategy, replay_k, reward_fun):
+    """Creates a sample function that can be used for HER experience replay.
+
+    Args:
+        replay_strategy (in ['future', 'none']): the HER replay strategy; if set to 'none',
+            regular DDPG experience replay is used
+        replay_k (int): the ratio between HER replays and regular replays (e.g. k = 4 -> 4 times
+            as many HER replays as regular replays are used)
+        reward_fun (function): function to re-compute the reward with substituted goals
+    """
+    if replay_strategy == 'future':
+        future_p = 1 - (1. / (1 + replay_k))
+    else:  # 'replay_strategy' == 'none'
+        future_p = 0
+
+    def _sample_her_transitions(episode_batch, batch_size_in_transitions):
+        """episode_batch is {key: array(buffer_size x T x dim_key)}
+        """
+        T = episode_batch['u'].shape[1]
+        rollout_batch_size = episode_batch['u'].shape[0]
+        batch_size = batch_size_in_transitions
+
+        # Select which episodes and time steps to use.
+        episode_idxs = np.random.randint(0, rollout_batch_size, batch_size)
+        t_samples = np.random.randint(T, size=batch_size)
+        transitions = {key: episode_batch[key][episode_idxs, t_samples].copy()
+                       for key in episode_batch.keys()}
+
+        # Select future time indexes proportional with probability future_p. These
+        # will be used for HER replay by substituting in future goals.
+        her_indexes = np.where(np.random.uniform(size=batch_size) < future_p)
+        future_offset = np.random.uniform(size=batch_size) * (T - t_samples)
+        future_offset = future_offset.astype(int)
+        future_t = (t_samples + 1 + future_offset)[her_indexes]
+
+        # Replace goal with achieved goal but only for the previously-selected
+        # HER transitions (as defined by her_indexes). For the other transitions,
+        # keep the original goal.
+        future_ag = episode_batch['ag'][episode_idxs[her_indexes], future_t]
+        transitions['g'][her_indexes] = future_ag
+
+        # Reconstruct info dictionary for reward  computation.
+        info = {}
+        for key, value in transitions.items():
+            if key.startswith('info_'):
+                info[key.replace('info_', '')] = value
+
+        # Re-compute reward since we may have substituted the goal.
+        reward_params = {k: transitions[k] for k in ['ag_2', 'g']}
+        reward_params['info'] = info
+        transitions['r'] = reward_fun(**reward_params)
+
+        transitions = {k: transitions[k].reshape(batch_size, *transitions[k].shape[1:])
+                       for k in transitions.keys()}
+
+        assert(transitions['u'].shape[0] == batch_size_in_transitions)
+
+        return transitions
+
+    return _sample_her_transitions
--- a/baselines/her/normalizer.py
+++ b/baselines/her/normalizer.py
@@ -0,0 +1,140 @@
+import threading
+
+import numpy as np
+from mpi4py import MPI
+import tensorflow as tf
+
+from baselines.her.util import reshape_for_broadcasting
+
+
+class Normalizer:
+    def __init__(self, size, eps=1e-2, default_clip_range=np.inf, sess=None):
+        """A normalizer that ensures that observations are approximately distributed according to
+        a standard Normal distribution (i.e. have mean zero and variance one).
+
+        Args:
+            size (int): the size of the observation to be normalized
+            eps (float): a small constant that avoids underflows
+            default_clip_range (float): normalized observations are clipped to be in
+                [-default_clip_range, default_clip_range]
+            sess (object): the TensorFlow session to be used
+        """
+        self.size = size
+        self.eps = eps
+        self.default_clip_range = default_clip_range
+        self.sess = sess if sess is not None else tf.get_default_session()
+
+        self.local_sum = np.zeros(self.size, np.float32)
+        self.local_sumsq = np.zeros(self.size, np.float32)
+        self.local_count = np.zeros(1, np.float32)
+
+        self.sum_tf = tf.get_variable(
+            initializer=tf.zeros_initializer(), shape=self.local_sum.shape, name='sum',
+            trainable=False, dtype=tf.float32)
+        self.sumsq_tf = tf.get_variable(
+            initializer=tf.zeros_initializer(), shape=self.local_sumsq.shape, name='sumsq',
+            trainable=False, dtype=tf.float32)
+        self.count_tf = tf.get_variable(
+            initializer=tf.ones_initializer(), shape=self.local_count.shape, name='count',
+            trainable=False, dtype=tf.float32)
+        self.mean = tf.get_variable(
+            initializer=tf.zeros_initializer(), shape=(self.size,), name='mean',
+            trainable=False, dtype=tf.float32)
+        self.std = tf.get_variable(
+            initializer=tf.ones_initializer(), shape=(self.size,), name='std',
+            trainable=False, dtype=tf.float32)
+        self.count_pl = tf.placeholder(name='count_pl', shape=(1,), dtype=tf.float32)
+        self.sum_pl = tf.placeholder(name='sum_pl', shape=(self.size,), dtype=tf.float32)
+        self.sumsq_pl = tf.placeholder(name='sumsq_pl', shape=(self.size,), dtype=tf.float32)
+
+        self.update_op = tf.group(
+            self.count_tf.assign_add(self.count_pl),
+            self.sum_tf.assign_add(self.sum_pl),
+            self.sumsq_tf.assign_add(self.sumsq_pl)
+        )
+        self.recompute_op = tf.group(
+            tf.assign(self.mean, self.sum_tf / self.count_tf),
+            tf.assign(self.std, tf.sqrt(tf.maximum(
+                tf.square(self.eps),
+                self.sumsq_tf / self.count_tf - tf.square(self.sum_tf / self.count_tf)
+            ))),
+        )
+        self.lock = threading.Lock()
+
+    def update(self, v):
+        v = v.reshape(-1, self.size)
+
+        with self.lock:
+            self.local_sum += v.sum(axis=0)
+            self.local_sumsq += (np.square(v)).sum(axis=0)
+            self.local_count[0] += v.shape[0]
+
+    def normalize(self, v, clip_range=None):
+        if clip_range is None:
+            clip_range = self.default_clip_range
+        mean = reshape_for_broadcasting(self.mean, v)
+        std = reshape_for_broadcasting(self.std,  v)
+        return tf.clip_by_value((v - mean) / std, -clip_range, clip_range)
+
+    def denormalize(self, v):
+        mean = reshape_for_broadcasting(self.mean, v)
+        std = reshape_for_broadcasting(self.std,  v)
+        return mean + v * std
+
+    def _mpi_average(self, x):
+        buf = np.zeros_like(x)
+        MPI.COMM_WORLD.Allreduce(x, buf, op=MPI.SUM)
+        buf /= MPI.COMM_WORLD.Get_size()
+        return buf
+
+    def synchronize(self, local_sum, local_sumsq, local_count, root=None):
+        local_sum[...] = self._mpi_average(local_sum)
+        local_sumsq[...] = self._mpi_average(local_sumsq)
+        local_count[...] = self._mpi_average(local_count)
+        return local_sum, local_sumsq, local_count
+
+    def recompute_stats(self):
+        with self.lock:
+            # Copy over results.
+            local_count = self.local_count.copy()
+            local_sum = self.local_sum.copy()
+            local_sumsq = self.local_sumsq.copy()
+
+            # Reset.
+            self.local_count[...] = 0
+            self.local_sum[...] = 0
+            self.local_sumsq[...] = 0
+
+        # We perform the synchronization outside of the lock to keep the critical section as short
+        # as possible.
+        synced_sum, synced_sumsq, synced_count = self.synchronize(
+            local_sum=local_sum, local_sumsq=local_sumsq, local_count=local_count)
+
+        self.sess.run(self.update_op, feed_dict={
+            self.count_pl: synced_count,
+            self.sum_pl: synced_sum,
+            self.sumsq_pl: synced_sumsq,
+        })
+        self.sess.run(self.recompute_op)
+
+
+class IdentityNormalizer:
+    def __init__(self, size, std=1.):
+        self.size = size
+        self.mean = tf.zeros(self.size, tf.float32)
+        self.std = std * tf.ones(self.size, tf.float32)
+
+    def update(self, x):
+        pass
+
+    def normalize(self, x, clip_range=None):
+        return x / self.std
+
+    def denormalize(self, x):
+        return self.std * x
+
+    def synchronize(self):
+        pass
+
+    def recompute_stats(self):
+        pass
--- a/baselines/her/replay_buffer.py
+++ b/baselines/her/replay_buffer.py
@@ -0,0 +1,108 @@
+import threading
+
+import numpy as np
+
+
+class ReplayBuffer:
+    def __init__(self, buffer_shapes, size_in_transitions, T, sample_transitions):
+        """Creates a replay buffer.
+
+        Args:
+            buffer_shapes (dict of ints): the shape for all buffers that are used in the replay
+                buffer
+            size_in_transitions (int): the size of the buffer, measured in transitions
+            T (int): the time horizon for episodes
+            sample_transitions (function): a function that samples from the replay buffer
+        """
+        self.buffer_shapes = buffer_shapes
+        self.size = size_in_transitions // T
+        self.T = T
+        self.sample_transitions = sample_transitions
+
+        # self.buffers is {key: array(size_in_episodes x T or T+1 x dim_key)}
+        self.buffers = {key: np.empty([self.size, *shape])
+                        for key, shape in buffer_shapes.items()}
+
+        # memory management
+        self.current_size = 0
+        self.n_transitions_stored = 0
+
+        self.lock = threading.Lock()
+
+    @property
+    def full(self):
+        with self.lock:
+            return self.current_size == self.size
+
+    def sample(self, batch_size):
+        """Returns a dict {key: array(batch_size x shapes[key])}
+        """
+        buffers = {}
+
+        with self.lock:
+            assert self.current_size > 0
+            for key in self.buffers.keys():
+                buffers[key] = self.buffers[key][:self.current_size]
+
+        buffers['o_2'] = buffers['o'][:, 1:, :]
+        buffers['ag_2'] = buffers['ag'][:, 1:, :]
+
+        transitions = self.sample_transitions(buffers, batch_size)
+
+        for key in (['r', 'o_2', 'ag_2'] + list(self.buffers.keys())):
+            assert key in transitions, "key %s missing from transitions" % key
+
+        return transitions
+
+    def store_episode(self, episode_batch):
+        """episode_batch: array(batch_size x (T or T+1) x dim_key)
+        """
+        batch_sizes = [len(episode_batch[key]) for key in episode_batch.keys()]
+        assert np.all(np.array(batch_sizes) == batch_sizes[0])
+        batch_size = batch_sizes[0]
+
+        with self.lock:
+            idxs = self._get_storage_idx(batch_size)
+
+            # load inputs into buffers
+            for key in self.buffers.keys():
+                self.buffers[key][idxs] = episode_batch[key]
+
+            self.n_transitions_stored += batch_size * self.T
+
+    def get_current_episode_size(self):
+        with self.lock:
+            return self.current_size
+
+    def get_current_size(self):
+        with self.lock:
+            return self.current_size * self.T
+
+    def get_transitions_stored(self):
+        with self.lock:
+            return self.n_transitions_stored
+
+    def clear_buffer(self):
+        with self.lock:
+            self.current_size = 0
+
+    def _get_storage_idx(self, inc=None):
+        inc = inc or 1   # size increment
+        assert inc <= self.size, "Batch committed to replay is too large!"
+        # go consecutively until you hit the end, and then go randomly.
+        if self.current_size+inc <= self.size:
+            idx = np.arange(self.current_size, self.current_size+inc)
+        elif self.current_size < self.size:
+            overflow = inc - (self.size - self.current_size)
+            idx_a = np.arange(self.current_size, self.size)
+            idx_b = np.random.randint(0, self.current_size, overflow)
+            idx = np.concatenate([idx_a, idx_b])
+        else:
+            idx = np.random.randint(0, self.size, inc)
+
+        # update replay size
+        self.current_size = min(self.size, self.current_size+inc)
+
+        if inc == 1:
+            idx = idx[0]
+        return idx
--- a/baselines/her/rollout.py
+++ b/baselines/her/rollout.py
@@ -0,0 +1,188 @@
+from collections import deque
+
+import numpy as np
+import pickle
+from mujoco_py import MujocoException
+
+from baselines.her.util import convert_episode_to_batch_major, store_args
+
+
+class RolloutWorker:
+
+    @store_args
+    def __init__(self, make_env, policy, dims, logger, T, rollout_batch_size=1,
+                 exploit=False, use_target_net=False, compute_Q=False, noise_eps=0,
+                 random_eps=0, history_len=100, render=False, **kwargs):
+        """Rollout worker generates experience by interacting with one or many environments.
+
+        Args:
+            make_env (function): a factory function that creates a new instance of the environment
+                when called
+            policy (object): the policy that is used to act
+            dims (dict of ints): the dimensions for observations (o), goals (g), and actions (u)
+            logger (object): the logger that is used by the rollout worker
+            rollout_batch_size (int): the number of parallel rollouts that should be used
+            exploit (boolean): whether or not to exploit, i.e. to act optimally according to the
+                current policy without any exploration
+            use_target_net (boolean): whether or not to use the target net for rollouts
+            compute_Q (boolean): whether or not to compute the Q values alongside the actions
+            noise_eps (float): scale of the additive Gaussian noise
+            random_eps (float): probability of selecting a completely random action
+            history_len (int): length of history for statistics smoothing
+            render (boolean): whether or not to render the rollouts
+        """
+        self.envs = [make_env() for _ in range(rollout_batch_size)]
+        assert self.T > 0
+
+        self.info_keys = [key.replace('info_', '') for key in dims.keys() if key.startswith('info_')]
+
+        self.success_history = deque(maxlen=history_len)
+        self.Q_history = deque(maxlen=history_len)
+
+        self.n_episodes = 0
+        self.g = np.empty((self.rollout_batch_size, self.dims['g']), np.float32)  # goals
+        self.initial_o = np.empty((self.rollout_batch_size, self.dims['o']), np.float32)  # observations
+        self.initial_ag = np.empty((self.rollout_batch_size, self.dims['g']), np.float32)  # achieved goals
+        self.reset_all_rollouts()
+        self.clear_history()
+
+    def reset_rollout(self, i):
+        """Resets the `i`-th rollout environment, re-samples a new goal, and updates the `initial_o`
+        and `g` arrays accordingly.
+        """
+        obs = self.envs[i].reset()
+        self.initial_o[i] = obs['observation']
+        self.initial_ag[i] = obs['achieved_goal']
+        self.g[i] = obs['desired_goal']
+
+    def reset_all_rollouts(self):
+        """Resets all `rollout_batch_size` rollout workers.
+        """
+        for i in range(self.rollout_batch_size):
+            self.reset_rollout(i)
+
+    def generate_rollouts(self):
+        """Performs `rollout_batch_size` rollouts in parallel for time horizon `T` with the current
+        policy acting on it accordingly.
+        """
+        self.reset_all_rollouts()
+
+        # compute observations
+        o = np.empty((self.rollout_batch_size, self.dims['o']), np.float32)  # observations
+        ag = np.empty((self.rollout_batch_size, self.dims['g']), np.float32)  # achieved goals
+        o[:] = self.initial_o
+        ag[:] = self.initial_ag
+
+        # generate episodes
+        obs, achieved_goals, acts, goals, successes = [], [], [], [], []
+        info_values = [np.empty((self.T, self.rollout_batch_size, self.dims['info_' + key]), np.float32) for key in self.info_keys]
+        Qs = []
+        for t in range(self.T):
+            policy_output = self.policy.get_actions(
+                o, ag, self.g,
+                compute_Q=self.compute_Q,
+                noise_eps=self.noise_eps if not self.exploit else 0.,
+                random_eps=self.random_eps if not self.exploit else 0.,
+                use_target_net=self.use_target_net)
+
+            if self.compute_Q:
+                u, Q = policy_output
+                Qs.append(Q)
+            else:
+                u = policy_output
+
+            if u.ndim == 1:
+                # The non-batched case should still have a reasonable shape.
+                u = u.reshape(1, -1)
+
+            o_new = np.empty((self.rollout_batch_size, self.dims['o']))
+            ag_new = np.empty((self.rollout_batch_size, self.dims['g']))
+            success = np.zeros(self.rollout_batch_size)
+            # compute new states and observations
+            for i in range(self.rollout_batch_size):
+                try:
+                    # We fully ignore the reward here because it will have to be re-computed
+                    # for HER.
+                    curr_o_new, _, _, info = self.envs[i].step(u[i])
+                    if 'is_success' in info:
+                        success[i] = info['is_success']
+                    o_new[i] = curr_o_new['observation']
+                    ag_new[i] = curr_o_new['achieved_goal']
+                    for idx, key in enumerate(self.info_keys):
+                        info_values[idx][t, i] = info[key]
+                    if self.render:
+                        self.envs[i].render()
+                except MujocoException as e:
+                    return self.generate_rollouts()
+
+            if np.isnan(o_new).any():
+                self.logger.warning('NaN caught during rollout generation. Trying again...')
+                self.reset_all_rollouts()
+                return self.generate_rollouts()
+
+            obs.append(o.copy())
+            achieved_goals.append(ag.copy())
+            successes.append(success.copy())
+            acts.append(u.copy())
+            goals.append(self.g.copy())
+            o[...] = o_new
+            ag[...] = ag_new
+        obs.append(o.copy())
+        achieved_goals.append(ag.copy())
+        self.initial_o[:] = o
+
+        episode = dict(o=obs,
+                       u=acts,
+                       g=goals,
+                       ag=achieved_goals)
+        for key, value in zip(self.info_keys, info_values):
+            episode['info_{}'.format(key)] = value
+
+        # stats
+        successful = np.array(successes)[-1, :]
+        assert successful.shape == (self.rollout_batch_size,)
+        success_rate = np.mean(successful)
+        self.success_history.append(success_rate)
+        if self.compute_Q:
+            self.Q_history.append(np.mean(Qs))
+        self.n_episodes += self.rollout_batch_size
+
+        return convert_episode_to_batch_major(episode)
+
+    def clear_history(self):
+        """Clears all histories that are used for statistics
+        """
+        self.success_history.clear()
+        self.Q_history.clear()
+
+    def current_success_rate(self):
+        return np.mean(self.success_history)
+
+    def current_mean_Q(self):
+        return np.mean(self.Q_history)
+
+    def save_policy(self, path):
+        """Pickles the current policy for later inspection.
+        """
+        with open(path, 'wb') as f:
+            pickle.dump(self.policy, f)
+
+    def logs(self, prefix='worker'):
+        """Generates a dictionary that contains all collected statistics.
+        """
+        logs = []
+        logs += [('success_rate', np.mean(self.success_history))]
+        if self.compute_Q:
+            logs += [('mean_Q', np.mean(self.Q_history))]
+        logs += [('episode', self.n_episodes)]
+
+        if prefix is not '' and not prefix.endswith('/'):
+            return [(prefix + '/' + key, val) for key, val in logs]
+        else:
+            return logs
+
+    def seed(self, seed):
+        """Seeds each environment with a distinct seed derived from the passed in global seed.
+        """
+        for idx, env in enumerate(self.envs):
+            env.seed(seed + 1000 * idx)
--- a/baselines/her/util.py
+++ b/baselines/her/util.py
@@ -0,0 +1,140 @@
+import os
+import subprocess
+import sys
+import importlib
+import inspect
+import functools
+
+import tensorflow as tf
+import numpy as np
+
+from baselines.common import tf_util as U
+
+
+def store_args(method):
+    """Stores provided method args as instance attributes.
+    """
+    argspec = inspect.getfullargspec(method)
+    defaults = {}
+    if argspec.defaults is not None:
+        defaults = dict(
+            zip(argspec.args[-len(argspec.defaults):], argspec.defaults))
+    if argspec.kwonlydefaults is not None:
+        defaults.update(argspec.kwonlydefaults)
+    arg_names = argspec.args[1:]
+
+    @functools.wraps(method)
+    def wrapper(*positional_args, **keyword_args):
+        self = positional_args[0]
+        # Get default arg values
+        args = defaults.copy()
+        # Add provided arg values
+        for name, value in zip(arg_names, positional_args[1:]):
+            args[name] = value
+        args.update(keyword_args)
+        self.__dict__.update(args)
+        return method(*positional_args, **keyword_args)
+
+    return wrapper
+
+
+def import_function(spec):
+    """Import a function identified by a string like "pkg.module:fn_name".
+    """
+    mod_name, fn_name = spec.split(':')
+    module = importlib.import_module(mod_name)
+    fn = getattr(module, fn_name)
+    return fn
+
+
+def flatten_grads(var_list, grads):
+    """Flattens a variables and their gradients.
+    """
+    return tf.concat([tf.reshape(grad, [U.numel(v)])
+                      for (v, grad) in zip(var_list, grads)], 0)
+
+
+def nn(input, layers_sizes, reuse=None, flatten=False, name=""):
+    """Creates a simple neural network
+    """
+    for i, size in enumerate(layers_sizes):
+        activation = tf.nn.relu if i < len(layers_sizes) - 1 else None
+        input = tf.layers.dense(inputs=input,
+                                units=size,
+                                kernel_initializer=tf.contrib.layers.xavier_initializer(),
+                                reuse=reuse,
+                                name=name + '_' + str(i))
+        if activation:
+            input = activation(input)
+    if flatten:
+        assert layers_sizes[-1] == 1
+        input = tf.reshape(input, [-1])
+    return input
+
+
+def install_mpi_excepthook():
+    import sys
+    from mpi4py import MPI
+    old_hook = sys.excepthook
+
+    def new_hook(a, b, c):
+        old_hook(a, b, c)
+        sys.stdout.flush()
+        sys.stderr.flush()
+        MPI.COMM_WORLD.Abort()
+    sys.excepthook = new_hook
+
+
+def mpi_fork(n, extra_mpi_args=[]):
+    """Re-launches the current script with workers
+    Returns "parent" for original parent, "child" for MPI children
+    """
+    if n <= 1:
+        return "child"
+    if os.getenv("IN_MPI") is None:
+        env = os.environ.copy()
+        env.update(
+            MKL_NUM_THREADS="1",
+            OMP_NUM_THREADS="1",
+            IN_MPI="1"
+        )
+        # "-bind-to core" is crucial for good performance
+        args = ["mpirun", "-np", str(n)] + \
+            extra_mpi_args + \
+            [sys.executable]
+
+        args += sys.argv
+        subprocess.check_call(args, env=env)
+        return "parent"
+    else:
+        install_mpi_excepthook()
+        return "child"
+
+
+def convert_episode_to_batch_major(episode):
+    """Converts an episode to have the batch dimension in the major (first)
+    dimension.
+    """
+    episode_batch = {}
+    for key in episode.keys():
+        val = np.array(episode[key]).copy()
+        # make inputs batch-major instead of time-major
+        episode_batch[key] = val.swapaxes(0, 1)
+
+    return episode_batch
+
+
+def transitions_in_episode_batch(episode_batch):
+    """Number of transitions in a given episode batch.
+    """
+    shape = episode_batch['u'].shape
+    return shape[0] * shape[1]
+
+
+def reshape_for_broadcasting(source, target):
+    """Reshapes a tensor (source) to have the correct shape and dtype of the target
+    before broadcasting it with MPI.
+    """
+    dim = len(target.get_shape())
+    shape = ([1] * (dim - 1)) + [-1]
+    return tf.reshape(tf.cast(source, target.dtype), shape)
--- a/baselines/logger.py
+++ b/baselines/logger.py
@@ -6,9 +6,7 @@ import json
 import time
 import datetime
 import tempfile
-
-LOG_OUTPUT_FORMATS = ['stdout', 'log', 'csv']
-# Also valid: json, tensorboard
+from collections import defaultdict

 DEBUG = 10
 INFO = 20
@@ -73,8 +71,11 @@ class HumanOutputFormat(KVWriter, SeqWriter):
        return s[:20] + '...' if len(s) > 23 else s

    def writeseq(self, seq):
-        for arg in seq:
-            self.file.write(arg)
+        seq = list(seq)
+        for (i, elem) in enumerate(seq):
+            self.file.write(elem)
+            if i < len(seq) - 1: # add space unless this is the last one
+                self.file.write(' ')
        self.file.write('\n')
        self.file.flush()

@@ -124,7 +125,7 @@ class CSVOutputFormat(KVWriter):
            if i > 0:
                self.file.write(',')
            v = kvs.get(k)
-            if v:
+            if v is not None:
                self.file.write(str(v))
        self.file.write('\n')
        self.file.flush()
@@ -168,24 +169,18 @@ class TensorBoardOutputFormat(KVWriter):
            self.writer.Close()
            self.writer = None

-def make_output_format(format, ev_dir):
-    from mpi4py import MPI
+def make_output_format(format, ev_dir, log_suffix=''):
    os.makedirs(ev_dir, exist_ok=True)
-    rank = MPI.COMM_WORLD.Get_rank()
    if format == 'stdout':
        return HumanOutputFormat(sys.stdout)
    elif format == 'log':
-        suffix = "" if rank==0 else ("-mpi%03i"%rank)
-        return HumanOutputFormat(osp.join(ev_dir, 'log%s.txt' % suffix))
+        return HumanOutputFormat(osp.join(ev_dir, 'log%s.txt' % log_suffix))
    elif format == 'json':
-        assert rank==0
-        return JSONOutputFormat(osp.join(ev_dir, 'progress.json'))
+        return JSONOutputFormat(osp.join(ev_dir, 'progress%s.json' % log_suffix))
    elif format == 'csv':
-        assert rank==0
-        return CSVOutputFormat(osp.join(ev_dir, 'progress.csv'))
+        return CSVOutputFormat(osp.join(ev_dir, 'progress%s.csv' % log_suffix))
    elif format == 'tensorboard':
-        assert rank==0
-        return TensorBoardOutputFormat(osp.join(ev_dir, 'tb'))
+        return TensorBoardOutputFormat(osp.join(ev_dir, 'tb%s' % log_suffix))
    else:
        raise ValueError('Unknown format specified: %s' % (format,))

@@ -197,9 +192,16 @@ def logkv(key, val):
    """
    Log a value of some diagnostic
    Call this once for each diagnostic quantity, each iteration
+    If called many times, last value will be used.
    """
    Logger.CURRENT.logkv(key, val)

+def logkv_mean(key, val):
+    """
+    The same as logkv(), but if called many times, values averaged.
+    """
+    Logger.CURRENT.logkv_mean(key, val)
+
 def logkvs(d):
    """
    Log a dictionary of key-value pairs
@@ -255,6 +257,33 @@ def get_dir():
 record_tabular = logkv
 dump_tabular = dumpkvs

+class ProfileKV:
+    """
+    Usage:
+    with logger.ProfileKV("interesting_scope"):
+        code
+    """
+    def __init__(self, n):
+        self.n = "wait_" + n
+    def __enter__(self):
+        self.t1 = time.time()
+    def __exit__(self ,type, value, traceback):
+        Logger.CURRENT.name2val[self.n] += time.time() - self.t1
+
+def profile(n):
+    """
+    Usage:
+    @profile("my_func")
+    def my_func(): code
+    """
+    def decorator_with_name(func):
+        def func_wrapper(*args, **kwargs):
+            with ProfileKV(n):
+                return func(*args, **kwargs)
+        return func_wrapper
+    return decorator_with_name
+
+
 # ================================================================
 # Backend
 # ================================================================
@@ -265,7 +294,8 @@ class Logger(object):
    CURRENT = None  # Current logger being used by the free functions above

    def __init__(self, dir, output_formats):
-        self.name2val = {}  # values this iteration
+        self.name2val = defaultdict(float)  # values this iteration
+        self.name2cnt = defaultdict(int)
        self.level = INFO
        self.dir = dir
        self.output_formats = output_formats
@@ -275,12 +305,21 @@ class Logger(object):
    def logkv(self, key, val):
        self.name2val[key] = val

+    def logkv_mean(self, key, val):
+        if val is None:
+            self.name2val[key] = None
+            return
+        oldval, cnt = self.name2val[key], self.name2cnt[key]
+        self.name2val[key] = oldval*cnt/(cnt+1) + val/(cnt+1)
+        self.name2cnt[key] = cnt + 1
+
    def dumpkvs(self):
        if self.level == DISABLED: return
        for fmt in self.output_formats:
            if isinstance(fmt, KVWriter):
                fmt.writekvs(self.name2val)
        self.name2val.clear()
+        self.name2cnt.clear()

    def log(self, *args, level=INFO):
        if self.level <= level:
@@ -316,10 +355,19 @@ def configure(dir=None, format_strs=None):
    assert isinstance(dir, str)
    os.makedirs(dir, exist_ok=True)

+    log_suffix = ''
+    from mpi4py import MPI
+    rank = MPI.COMM_WORLD.Get_rank()
+    if rank > 0:
+        log_suffix = "-rank%03i" % rank
+
    if format_strs is None:
-        strs = os.getenv('OPENAI_LOG_FORMAT')
-        format_strs = strs.split(',') if strs else LOG_OUTPUT_FORMATS
-    output_formats = [make_output_format(f, dir) for f in format_strs]
+        if rank == 0:
+            format_strs = os.getenv('OPENAI_LOG_FORMAT', 'stdout,log,csv').split(',')
+        else:
+            format_strs = os.getenv('OPENAI_LOG_FORMAT_MPI', 'log').split(',')
+    format_strs = filter(None, format_strs)
+    output_formats = [make_output_format(f, dir, log_suffix) for f in format_strs]

    Logger.CURRENT = Logger(dir=dir, output_formats=output_formats)
    log('Logging to %s'%dir)
@@ -360,6 +408,11 @@ def _demo():
    logkv("a", 5.5)
    dumpkvs()
    info("^^^ should see a = 5.5")
+    logkv_mean("b", -22.5)
+    logkv_mean("b", -44.4)
+    logkv("a", 5.5)
+    dumpkvs()
+    info("^^^ should see b = 33.3")

    logkv("b", -2.5)
    dumpkvs()
--- a/baselines/ppo1/README.md
+++ b/baselines/ppo1/README.md
@@ -5,3 +5,5 @@
 - `mpirun -np 8 python -m baselines.ppo1.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
 - `python -m baselines.ppo1.run_mujoco` runs the algorithm for 1M frames on a Mujoco environment.

+- Train mujoco 3d humanoid (with optimal-ish hyperparameters): `mpirun -np 16 python -m baselines.ppo1.run_humanoid --model-path=/path/to/model`
+- Render the 3d humanoid: `python -m baselines.ppo1.run_humanoid --play --model-path=/path/to/model`
--- a/baselines/ppo1/pposgd_simple.py
+++ b/baselines/ppo1/pposgd_simple.py
@@ -212,5 +212,7 @@ def learn(env, policy_fn, *,
        if MPI.COMM_WORLD.Get_rank()==0:
            logger.dump_tabular()

+    return pi
+
 def flatten_lists(listoflists):
    return [el for list_ in listoflists for el in list_]
--- a/baselines/ppo1/run_atari.py
+++ b/baselines/ppo1/run_atari.py
@@ -18,7 +18,7 @@ def train(env_id, num_timesteps, seed):
        logger.configure()
    else:
        logger.configure(format_strs=[])
-    workerseed = seed + 10000 * MPI.COMM_WORLD.Get_rank()
+    workerseed = seed + 10000 * MPI.COMM_WORLD.Get_rank() if seed is not None else None
    set_global_seeds(workerseed)
    env = make_atari(env_id)
    def policy_fn(name, ob_space, ac_space): #pylint: disable=W0613
--- a/baselines/ppo1/run_humanoid.py
+++ b/baselines/ppo1/run_humanoid.py
@@ -0,0 +1,75 @@
+#!/usr/bin/env python3
+import os
+from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
+from baselines.common import tf_util as U
+from baselines import logger
+
+import gym
+
+def train(num_timesteps, seed, model_path=None):
+    env_id = 'Humanoid-v2'
+    from baselines.ppo1 import mlp_policy, pposgd_simple
+    U.make_session(num_cpu=1).__enter__()
+    def policy_fn(name, ob_space, ac_space):
+        return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
+            hid_size=64, num_hid_layers=2)
+    env = make_mujoco_env(env_id, seed)
+
+    # parameters below were the best found in a simple random search
+    # these are good enough to make humanoid walk, but whether those are
+    # an absolute best or not is not certain
+    env = RewScale(env, 0.1)
+    pi = pposgd_simple.learn(env, policy_fn,
+            max_timesteps=num_timesteps,
+            timesteps_per_actorbatch=2048,
+            clip_param=0.2, entcoeff=0.0,
+            optim_epochs=10, 
+            optim_stepsize=3e-4, 
+            optim_batchsize=64, 
+            gamma=0.99, 
+            lam=0.95,
+            schedule='linear',
+        )
+    env.close()
+    if model_path:
+        U.save_state(model_path)
+        
+    return pi
+
+class RewScale(gym.RewardWrapper):
+    def __init__(self, env, scale):
+        gym.RewardWrapper.__init__(self, env)
+        self.scale = scale
+    def reward(self, r):
+        return r * self.scale
+
+def main():
+    logger.configure()
+    parser = mujoco_arg_parser()
+    parser.add_argument('--model-path', default=os.path.join(logger.get_dir(), 'humanoid_policy'))
+    parser.set_defaults(num_timesteps=int(2e7))
+   
+    args = parser.parse_args()
+    
+    if not args.play:
+        # train the model
+        train(num_timesteps=args.num_timesteps, seed=args.seed, model_path=args.model_path)
+    else:       
+        # construct the model object, load pre-trained model and render
+        pi = train(num_timesteps=1, seed=args.seed)
+        U.load_state(args.model_path)
+        env = make_mujoco_env('Humanoid-v2', seed=0)
+
+        ob = env.reset()        
+        while True:
+            action = pi.act(stochastic=False, ob=ob)[0]
+            ob, _, done, _ =  env.step(action)
+            env.render()
+            if done:
+                ob = env.reset()
+        
+        
+    
+
+if __name__ == '__main__':
+    main()
--- a/baselines/ppo1/run_robotics.py
+++ b/baselines/ppo1/run_robotics.py
@@ -0,0 +1,40 @@
+#!/usr/bin/env python3
+
+from mpi4py import MPI
+from baselines.common import set_global_seeds
+from baselines import logger
+from baselines.common.cmd_util import make_robotics_env, robotics_arg_parser
+import mujoco_py
+
+
+def train(env_id, num_timesteps, seed):
+    from baselines.ppo1 import mlp_policy, pposgd_simple
+    import baselines.common.tf_util as U
+    rank = MPI.COMM_WORLD.Get_rank()
+    sess = U.single_threaded_session()
+    sess.__enter__()
+    mujoco_py.ignore_mujoco_warnings().__enter__()
+    workerseed = seed + 10000 * rank
+    set_global_seeds(workerseed)
+    env = make_robotics_env(env_id, workerseed, rank=rank)
+    def policy_fn(name, ob_space, ac_space):
+        return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
+            hid_size=256, num_hid_layers=3)
+
+    pposgd_simple.learn(env, policy_fn,
+            max_timesteps=num_timesteps,
+            timesteps_per_actorbatch=2048,
+            clip_param=0.2, entcoeff=0.0,
+            optim_epochs=5, optim_stepsize=3e-4, optim_batchsize=256,
+            gamma=0.99, lam=0.95, schedule='linear',
+        )
+    env.close()
+
+
+def main():
+    args = robotics_arg_parser().parse_args()
+    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
+
+
+if __name__ == '__main__':
+    main()
--- a/baselines/ppo2/defaults.py
+++ b/baselines/ppo2/defaults.py
@@ -0,0 +1,22 @@
+def mujoco():
+    return dict(
+        nsteps=2048,
+        nminibatches=32,
+        lam=0.95,
+        gamma=0.99,
+        noptepochs=10,
+        log_interval=1,
+        ent_coef=0.0,
+        lr=lambda f: 3e-4 * f,
+        cliprange=0.2,
+        value_network='copy'
+    )
+
+def atari():
+    return dict(
+        nsteps=128, nminibatches=4,
+        lam=0.95, gamma=0.99, noptepochs=4, log_interval=1,
+        ent_coef=.01,
+        lr=lambda f : f * 2.5e-4,
+        cliprange=lambda f : f * 0.1,
+    )
--- a/baselines/ppo2/policies.py
+++ b/baselines/ppo2/policies.py
@@ -1,168 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
-from baselines.common.distributions import make_pdtype
-
-def nature_cnn(unscaled_images):
-    """
-    CNN from Nature paper.
-    """
-    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
-    activ = tf.nn.relu
-    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2)))
-    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2)))
-    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2)))
-    h3 = conv_to_fc(h3)
-    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
-
-class LnLstmPolicy(object):
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
-        nenv = nbatch // nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
-        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            xs = batch_to_seq(h, nenv, nsteps)
-            ms = batch_to_seq(M, nenv, nsteps)
-            h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
-            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact)
-            vf = fc(h5, 'v', 1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
-
-        v0 = vf[:, 0]
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
-
-        def step(ob, state, mask):
-            return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
-
-        def value(ob, state, mask):
-            return sess.run(v0, {X:ob, S:state, M:mask})
-
-        self.X = X
-        self.M = M
-        self.S = S
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class LstmPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
-        nenv = nbatch // nsteps
-
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
-        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            xs = batch_to_seq(h, nenv, nsteps)
-            ms = batch_to_seq(M, nenv, nsteps)
-            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
-            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact)
-            vf = fc(h5, 'v', 1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
-
-        v0 = vf[:, 0]
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
-
-        def step(ob, state, mask):
-            return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
-
-        def value(ob, state, mask):
-            return sess.run(v0, {X:ob, S:state, M:mask})
-
-        self.X = X
-        self.M = M
-        self.S = S
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class CnnPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            pi = fc(h, 'pi', nact, init_scale=0.01)
-            vf = fc(h, 'v', 1)[:,0]
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pi)
-
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = None
-
-        def step(ob, *_args, **_kwargs):
-            a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
-            return a, v, self.initial_state, neglogp
-
-        def value(ob, *_args, **_kwargs):
-            return sess.run(vf, {X:ob})
-
-        self.X = X
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class MlpPolicy(object):
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
-        ob_shape = (nbatch,) + ob_space.shape
-        actdim = ac_space.shape[0]
-        X = tf.placeholder(tf.float32, ob_shape, name='Ob') #obs
-        with tf.variable_scope("model", reuse=reuse):
-            activ = tf.tanh
-            h1 = activ(fc(X, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
-            h2 = activ(fc(h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
-            pi = fc(h2, 'pi', actdim, init_scale=0.01)
-            h1 = activ(fc(X, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
-            h2 = activ(fc(h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
-            vf = fc(h2, 'vf', 1)[:,0]
-            logstd = tf.get_variable(name="logstd", shape=[1, actdim],
-                initializer=tf.zeros_initializer())
-
-        pdparam = tf.concat([pi, pi * 0.0 + logstd], axis=1)
-
-        self.pdtype = make_pdtype(ac_space)
-        self.pd = self.pdtype.pdfromflat(pdparam)
-
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = None
-
-        def step(ob, *_args, **_kwargs):
-            a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
-            return a, v, self.initial_state, neglogp
-
-        def value(ob, *_args, **_kwargs):
-            return sess.run(vf, {X:ob})
-
-        self.X = X
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
--- a/baselines/ppo2/ppo2.py
+++ b/baselines/ppo2/ppo2.py
@@ -1,20 +1,29 @@
 import os
 import time
-import joblib
+import functools
 import numpy as np
 import os.path as osp
 import tensorflow as tf
 from baselines import logger
 from collections import deque
-from baselines.common import explained_variance
+from baselines.common import explained_variance, set_global_seeds
+from baselines.common.policies import build_policy
+from baselines.common.runners import AbstractEnvRunner
+from baselines.common.tf_util import get_session, save_variables, load_variables
+from baselines.common.mpi_adam_optimizer import MpiAdamOptimizer
+
+from mpi4py import MPI
+from baselines.common.tf_util import initialize
+from baselines.common.mpi_util import sync_from_root

 class Model(object):
    def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
                nsteps, ent_coef, vf_coef, max_grad_norm):
-        sess = tf.get_default_session()
+        sess = get_session()

-        act_model = policy(sess, ob_space, ac_space, nbatch_act, 1, reuse=False)
-        train_model = policy(sess, ob_space, ac_space, nbatch_train, nsteps, reuse=True)
+        with tf.variable_scope('ppo2_model', reuse=tf.AUTO_REUSE):
+            act_model = policy(nbatch_act, 1, sess)
+            train_model = policy(nbatch_train, nsteps, sess)

        A = train_model.pdtype.sample_placeholder([None])
        ADV = tf.placeholder(tf.float32, [None])
@@ -39,14 +48,16 @@ class Model(object):
        approxkl = .5 * tf.reduce_mean(tf.square(neglogpac - OLDNEGLOGPAC))
        clipfrac = tf.reduce_mean(tf.to_float(tf.greater(tf.abs(ratio - 1.0), CLIPRANGE)))
        loss = pg_loss - entropy * ent_coef + vf_loss * vf_coef
-        with tf.variable_scope('model'):
-            params = tf.trainable_variables()
-        grads = tf.gradients(loss, params)
+        params = tf.trainable_variables('ppo2_model')
+        trainer = MpiAdamOptimizer(MPI.COMM_WORLD, learning_rate=LR, epsilon=1e-5)
+        grads_and_var = trainer.compute_gradients(loss, params)
+        grads, var = zip(*grads_and_var)
+
        if max_grad_norm is not None:
            grads, _grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
-        grads = list(zip(grads, params))
-        trainer = tf.train.AdamOptimizer(learning_rate=LR, epsilon=1e-5)
-        _train = trainer.apply_gradients(grads)
+        grads_and_var = list(zip(grads, var))
+
+        _train = trainer.apply_gradients(grads_and_var)

        def train(lr, cliprange, obs, returns, masks, actions, values, neglogpacs, states=None):
            advs = returns - values
@@ -62,17 +73,6 @@ class Model(object):
            )[:-1]
        self.loss_names = ['policy_loss', 'value_loss', 'policy_entropy', 'approxkl', 'clipfrac']

-        def save(save_path):
-            ps = sess.run(params)
-            joblib.dump(ps, save_path)
-
-        def load(load_path):
-            loaded_params = joblib.load(load_path)
-            restores = []
-            for p, loaded_p in zip(params, loaded_params):
-                restores.append(p.assign(loaded_p))
-            sess.run(restores)
-            # If you want to load weights, also save/load observation scaling inside VecNormalize

        self.train = train
        self.train_model = train_model
@@ -80,30 +80,28 @@ class Model(object):
        self.step = act_model.step
        self.value = act_model.value
        self.initial_state = act_model.initial_state
-        self.save = save
-        self.load = load
-        tf.global_variables_initializer().run(session=sess) #pylint: disable=E1101

-class Runner(object):
+        self.save = functools.partial(save_variables, sess=sess)
+        self.load = functools.partial(load_variables, sess=sess)
+
+        if MPI.COMM_WORLD.Get_rank() == 0:
+            initialize()
+        global_variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="")
+        sync_from_root(sess, global_variables) #pylint: disable=E1101
+
+class Runner(AbstractEnvRunner):

    def __init__(self, *, env, model, nsteps, gamma, lam):
-        self.env = env
-        self.model = model
-        nenv = env.num_envs
-        self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=model.train_model.X.dtype.name)
-        self.obs[:] = env.reset()
-        self.gamma = gamma
+        super().__init__(env=env, model=model, nsteps=nsteps)
        self.lam = lam
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]
+        self.gamma = gamma

    def run(self):
        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones, mb_neglogpacs = [],[],[],[],[],[]
        mb_states = self.states
        epinfos = []
        for _ in range(self.nsteps):
-            actions, values, self.states, neglogpacs = self.model.step(self.obs, self.states, self.dones)
+            actions, values, self.states, neglogpacs = self.model.step(self.obs, S=self.states, M=self.dones)
            mb_obs.append(self.obs.copy())
            mb_actions.append(actions)
            mb_values.append(values)
@@ -121,7 +119,7 @@ class Runner(object):
        mb_values = np.asarray(mb_values, dtype=np.float32)
        mb_neglogpacs = np.asarray(mb_neglogpacs, dtype=np.float32)
        mb_dones = np.asarray(mb_dones, dtype=np.bool)
-        last_values = self.model.value(self.obs, self.states, self.dones)
+        last_values = self.model.value(self.obs, S=self.states, M=self.dones)
        #discount/bootstrap off value fn
        mb_returns = np.zeros_like(mb_rewards)
        mb_advs = np.zeros_like(mb_rewards)
@@ -151,10 +149,65 @@ def constfn(val):
        return val
    return f

-def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
+def learn(*, network, env, total_timesteps, seed=None, nsteps=2048, ent_coef=0.0, lr=3e-4,
            vf_coef=0.5,  max_grad_norm=0.5, gamma=0.99, lam=0.95,
            log_interval=10, nminibatches=4, noptepochs=4, cliprange=0.2,
-            save_interval=0):
+            save_interval=0, load_path=None, **network_kwargs):
+    '''
+    Learn policy using PPO algorithm (https://arxiv.org/abs/1707.06347)
+    
+    Parameters:
+    ----------
+
+    network:                          policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
+                                      specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns 
+                                      tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
+                                      neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
+                                      See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
+
+    env: baselines.common.vec_env.VecEnv     environment. Needs to be vectorized for parallel environment simulation. 
+                                      The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.
+
+    
+    nsteps: int                       number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
+                                      nenv is number of environment copies simulated in parallel)
+
+    total_timesteps: int              number of timesteps (i.e. number of actions taken in the environment)
+
+    ent_coef: float                   policy entropy coefficient in the optimization objective
+
+    lr: float or function             learning rate, constant or a schedule function [0,1] -> R+ where 1 is beginning of the 
+                                      training and 0 is the end of the training.
+
+    vf_coef: float                    value function loss coefficient in the optimization objective
+
+    max_grad_norm: float or None      gradient norm clipping coefficient
+    
+    gamma: float                      discounting factor
+
+    lam: float                        advantage estimation discounting factor (lambda in the paper)
+
+    log_interval: int                 number of timesteps between logging events
+
+    nminibatches: int                 number of training minibatches per update
+
+    noptepochs: int                   number of training epochs per update
+
+    cliprange: float or function      clipping range, constant or schedule function [0,1] -> R+ where 1 is beginning of the training 
+                                      and 0 is the end of the training 
+
+    save_interval: int                number of timesteps between saving events
+
+    load_path: str                    path to load the model from
+
+    **network_kwargs:                 keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
+                                      For instance, 'mlp' network architecture has arguments num_hidden and num_layers. 
+
+    
+
+    '''
+    
+    set_global_seeds(seed)

    if isinstance(lr, float): lr = constfn(lr)
    else: assert callable(lr)
@@ -162,6 +215,8 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
    else: assert callable(cliprange)
    total_timesteps = int(total_timesteps)

+    policy = build_policy(env, network, **network_kwargs)
+
    nenvs = env.num_envs
    ob_space = env.observation_space
    ac_space = env.action_space
@@ -176,6 +231,8 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
        with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
            fh.write(cloudpickle.dumps(make_model))
    model = make_model()
+    if load_path is not None:
+        model.load(load_path)
    runner = Runner(env=env, model=model, nsteps=nsteps, gamma=gamma, lam=lam)

    epinfobuf = deque(maxlen=100)
@@ -184,7 +241,6 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
    nupdates = total_timesteps//nbatch
    for update in range(1, nupdates+1):
        assert nbatch % nminibatches == 0
-        nbatch_train = nbatch // nminibatches
        tstart = time.time()
        frac = 1.0 - (update - 1.0) / nupdates
        lrnow = lr(frac)
@@ -232,14 +288,19 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
            logger.logkv('time_elapsed', tnow - tfirststart)
            for (lossval, lossname) in zip(lossvals, model.loss_names):
                logger.logkv(lossname, lossval)
-            logger.dumpkvs()
-        if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
+            if MPI.COMM_WORLD.Get_rank() == 0:
+                logger.dumpkvs()
+        if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir() and MPI.COMM_WORLD.Get_rank() == 0:
            checkdir = osp.join(logger.get_dir(), 'checkpoints')
            os.makedirs(checkdir, exist_ok=True)
            savepath = osp.join(checkdir, '%.5i'%update)
            print('Saving to', savepath)
            model.save(savepath)
    env.close()
+    return model

 def safemean(xs):
    return np.nan if len(xs) == 0 else np.mean(xs)
+
+
+
--- a/baselines/ppo2/run_atari.py
+++ b/baselines/ppo2/run_atari.py
@@ -1,40 +0,0 @@
-#!/usr/bin/env python3
-import sys
-from baselines import logger
-from baselines.common.cmd_util import make_atari_env, atari_arg_parser
-from baselines.common.vec_env.vec_frame_stack import VecFrameStack
-from baselines.ppo2 import ppo2
-from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
-import multiprocessing
-import tensorflow as tf
-
-
-def train(env_id, num_timesteps, seed, policy):
-
-    ncpu = multiprocessing.cpu_count()
-    if sys.platform == 'darwin': ncpu //= 2
-    config = tf.ConfigProto(allow_soft_placement=True,
-                            intra_op_parallelism_threads=ncpu,
-                            inter_op_parallelism_threads=ncpu)
-    config.gpu_options.allow_growth = True #pylint: disable=E1101
-    tf.Session(config=config).__enter__()
-
-    env = VecFrameStack(make_atari_env(env_id, 8, seed), 4)
-    policy = {'cnn' : CnnPolicy, 'lstm' : LstmPolicy, 'lnlstm' : LnLstmPolicy}[policy]
-    ppo2.learn(policy=policy, env=env, nsteps=128, nminibatches=4,
-        lam=0.95, gamma=0.99, noptepochs=4, log_interval=1,
-        ent_coef=.01,
-        lr=lambda f : f * 2.5e-4,
-        cliprange=lambda f : f * 0.1,
-        total_timesteps=int(num_timesteps * 1.1))
-
-def main():
-    parser = atari_arg_parser()
-    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
-    args = parser.parse_args()
-    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
-        policy=args.policy)
-
-if __name__ == '__main__':
-    main()
--- a/baselines/ppo2/run_mujoco.py
+++ b/baselines/ppo2/run_mujoco.py
@@ -1,43 +0,0 @@
-#!/usr/bin/env python3
-import argparse
-from baselines.common.cmd_util import mujoco_arg_parser
-from baselines import bench, logger
-
-def train(env_id, num_timesteps, seed):
-    from baselines.common import set_global_seeds
-    from baselines.common.vec_env.vec_normalize import VecNormalize
-    from baselines.ppo2 import ppo2
-    from baselines.ppo2.policies import MlpPolicy
-    import gym
-    import tensorflow as tf
-    from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
-    ncpu = 1
-    config = tf.ConfigProto(allow_soft_placement=True,
-                            intra_op_parallelism_threads=ncpu,
-                            inter_op_parallelism_threads=ncpu)
-    tf.Session(config=config).__enter__()
-    def make_env():
-        env = gym.make(env_id)
-        env = bench.Monitor(env, logger.get_dir())
-        return env
-    env = DummyVecEnv([make_env])
-    env = VecNormalize(env)
-
-    set_global_seeds(seed)
-    policy = MlpPolicy
-    ppo2.learn(policy=policy, env=env, nsteps=2048, nminibatches=32,
-        lam=0.95, gamma=0.99, noptepochs=10, log_interval=1,
-        ent_coef=0.0,
-        lr=3e-4,
-        cliprange=0.2,
-        total_timesteps=num_timesteps)
-
-
-def main():
-    args = mujoco_arg_parser().parse_args()
-    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/run.py
+++ b/baselines/run.py
@@ -0,0 +1,230 @@
+import sys
+import multiprocessing 
+import os
+import os.path as osp
+import gym
+from collections import defaultdict
+import tensorflow as tf
+
+from baselines.common.vec_env.vec_frame_stack import VecFrameStack
+from baselines.common.cmd_util import common_arg_parser, parse_unknown_args, make_mujoco_env, make_atari_env
+from baselines.common.tf_util import save_state, load_state, get_session
+from baselines import bench, logger
+from importlib import import_module
+
+from baselines.common.vec_env.vec_normalize import VecNormalize
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
+from baselines.common import atari_wrappers, retro_wrappers
+
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
+_game_envs = defaultdict(set)
+for env in gym.envs.registry.all():
+    # solve this with regexes
+    env_type = env._entry_point.split(':')[0].split('.')[-1]
+    _game_envs[env_type].add(env.id)
+
+# reading benchmark names directly from retro requires 
+# importing retro here, and for some reason that crashes tensorflow 
+# in ubuntu 
+_game_envs['retro'] = set([
+    'BubbleBobble-Nes',
+    'SuperMarioBros-Nes',
+    'TwinBee3PokoPokoDaimaou-Nes',
+    'SpaceHarrier-Nes',
+    'SonicTheHedgehog-Genesis',
+    'Vectorman-Genesis',
+    'FinalFight-Snes',
+    'SpaceInvaders-Snes',
+])
+
+
+def train(args, extra_args):
+    env_type, env_id = get_env_type(args.env)
+        
+    total_timesteps = int(args.num_timesteps)
+    seed = args.seed
+
+    learn = get_learn_function(args.alg)
+    alg_kwargs = get_learn_function_defaults(args.alg, env_type)
+    alg_kwargs.update(extra_args)
+
+    env = build_env(args)
+
+    if args.network:
+        alg_kwargs['network'] = args.network
+    else:
+        if alg_kwargs.get('network') is None:
+            alg_kwargs['network'] = get_default_network(env_type)
+ 
+       
+    
+    print('Training {} on {}:{} with arguments \n{}'.format(args.alg, env_type, env_id, alg_kwargs))
+
+    model = learn(
+        env=env,  
+        seed=seed,
+        total_timesteps=total_timesteps,
+        **alg_kwargs
+    )
+
+    return model, env
+
+
+def build_env(args, render=False):
+    ncpu = multiprocessing.cpu_count()
+    if sys.platform == 'darwin': ncpu //= 2
+    nenv = args.num_env or ncpu if not render else 1
+    alg = args.alg
+    rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
+    seed = args.seed    
+
+    env_type, env_id = get_env_type(args.env)
+    if env_type == 'mujoco':
+        get_session(tf.ConfigProto(allow_soft_placement=True,
+                                   intra_op_parallelism_threads=1, 
+                                   inter_op_parallelism_threads=1))
+
+        if args.num_env:
+            env = SubprocVecEnv([lambda: make_mujoco_env(env_id, seed + i if seed is not None else None, args.reward_scale) for i in range(args.num_env)])    
+        else:
+            env = DummyVecEnv([lambda: make_mujoco_env(env_id, seed, args.reward_scale)])
+
+        env = VecNormalize(env)
+
+    elif env_type == 'atari':
+        if alg == 'acer':
+            env = make_atari_env(env_id, nenv, seed)
+        elif alg == 'deepq':
+            env = atari_wrappers.make_atari(env_id)
+            env.seed(seed)
+            env = bench.Monitor(env, logger.get_dir())
+            env = atari_wrappers.wrap_deepmind(env, frame_stack=True, scale=True)
+        elif alg == 'trpo_mpi':
+            env = atari_wrappers.make_atari(env_id)
+            env.seed(seed)
+            env = bench.Monitor(env, logger.get_dir() and osp.join(logger.get_dir(), str(rank)))
+            env = atari_wrappers.wrap_deepmind(env)
+            # TODO check if the second seeding is necessary, and eventually remove
+            env.seed(seed)
+        else:
+            frame_stack_size = 4
+            env = VecFrameStack(make_atari_env(env_id, nenv, seed), frame_stack_size)
+
+    elif env_type == 'retro':
+        import retro
+        gamestate = args.gamestate or 'Level1-1'
+        env = retro_wrappers.make_retro(game=args.env, state=gamestate, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE)
+        env.seed(args.seed)
+        env = bench.Monitor(env, logger.get_dir())
+        env = retro_wrappers.wrap_deepmind_retro(env)
+        
+    elif env_type == 'classic':
+        def make_env():
+            e = gym.make(env_id)
+            e.seed(seed)
+            return e
+            
+        env = DummyVecEnv([make_env])
+ 
+    return env
+
+
+def get_env_type(env_id):
+    if env_id in _game_envs.keys():
+        env_type = env_id
+        env_id =  [g for g in _game_envs[env_type]][0]
+    else:
+        env_type = None
+        for g, e in _game_envs.items():
+            if env_id in e:
+                env_type = g
+                break 
+        assert env_type is not None, 'env_id {} is not recognized in env types'.format(env_id, _game_envs.keys())
+
+    return env_type, env_id
+
+def get_default_network(env_type):
+    if env_type == 'mujoco' or env_type=='classic':
+        return 'mlp'
+    if env_type == 'atari':
+        return 'cnn'
+
+    raise ValueError('Unknown env_type {}'.format(env_type))
+    
+def get_alg_module(alg, submodule=None):
+    submodule = submodule or alg
+    try:
+        # first try to import the alg module from baselines
+        alg_module = import_module('.'.join(['baselines', alg, submodule]))
+    except ImportError:
+        # then from rl_algs
+        alg_module = import_module('.'.join(['rl_' + 'algs', alg, submodule]))
+    
+    return alg_module
+        
+
+def get_learn_function(alg):
+    return get_alg_module(alg).learn
+
+def get_learn_function_defaults(alg, env_type):
+    try:
+        alg_defaults = get_alg_module(alg, 'defaults')
+        kwargs = getattr(alg_defaults, env_type)()
+    except (ImportError, AttributeError):
+        kwargs = {}       
+    return kwargs
+    
+def parse(v): 
+    '''
+    convert value of a command-line arg to a python object if possible, othewise, keep as string
+    '''
+
+    assert isinstance(v, str)
+    try:
+        return eval(v) 
+    except (NameError, SyntaxError): 
+        return v
+
+
+def main():
+    # configure logger, disable logging in child MPI processes (with rank > 0) 
+            
+    arg_parser = common_arg_parser()
+    args, unknown_args = arg_parser.parse_known_args()
+    extra_args = {k: parse(v) for k,v in parse_unknown_args(unknown_args).items()}
+
+    
+    if MPI is None or MPI.COMM_WORLD.Get_rank() == 0:
+        rank = 0
+        logger.configure()
+    else:
+        logger.configure(format_strs = [])
+        rank = MPI.COMM_WORLD.Get_rank()
+
+    model, _ = train(args, extra_args)
+
+    if args.save_path is not None and rank == 0:
+        save_path = osp.expanduser(args.save_path)
+        model.save(save_path)
+    
+
+    if args.play:
+        logger.log("Running trained model")
+        env = build_env(args, render=True)
+        obs = env.reset()
+        while True:
+            actions = model.step(obs)[0]
+            obs, _, done, _  = env.step(actions)
+            env.render()
+            if done:
+                obs = env.reset()
+            
+
+
+if __name__ == '__main__':
+    main()
--- a/baselines/trpo_mpi/defaults.py
+++ b/baselines/trpo_mpi/defaults.py
@@ -0,0 +1,30 @@
+from rl_common.models import mlp, cnn_small
+
+
+def atari():
+    return dict(
+        network = cnn_small(),
+        timesteps_per_batch=512, 
+        max_kl=0.001,
+        cg_iters=10,
+        cg_damping=1e-3,
+        gamma=0.98,
+        lam=1.0,
+        vf_iters=3,
+        vf_stepsize=1e-4,
+        entcoeff=0.00,
+    )
+
+def mujoco():
+    return dict(
+        network = mlp(num_hidden=32, num_layers=2),
+        timesteps_per_batch=1024,
+        max_kl=0.01,
+        cg_iters=10,
+        cg_damping=0.1,
+        gamma=0.99,
+        lam=0.98,
+        vf_iters=5,
+        vf_stepsize=1e-3,
+        normalize_observations=True, 
+    )
--- a/baselines/trpo_mpi/nosharing_cnn_policy.py
+++ b/baselines/trpo_mpi/nosharing_cnn_policy.py
@@ -1,56 +0,0 @@
-import baselines.common.tf_util as U
-import tensorflow as tf
-import gym
-from baselines.common.distributions import make_pdtype
-
-class CnnPolicy(object):
-    recurrent = False
-    def __init__(self, name, ob_space, ac_space):
-        with tf.variable_scope(name):
-            self._init(ob_space, ac_space)
-            self.scope = tf.get_variable_scope().name
-
-    def _init(self, ob_space, ac_space):
-        assert isinstance(ob_space, gym.spaces.Box)
-
-        self.pdtype = pdtype = make_pdtype(ac_space)
-        sequence_length = None
-
-        ob = U.get_placeholder(name="ob", dtype=tf.float32, shape=[sequence_length] + list(ob_space.shape))
-
-        obscaled = ob / 255.0
-
-        with tf.variable_scope("pol"):
-            x = obscaled
-            x = tf.nn.relu(U.conv2d(x, 8, "l1", [8, 8], [4, 4], pad="VALID"))
-            x = tf.nn.relu(U.conv2d(x, 16, "l2", [4, 4], [2, 2], pad="VALID"))
-            x = U.flattenallbut0(x)
-            x = tf.nn.relu(tf.layers.dense(x, 128, name='lin', kernel_initializer=U.normc_initializer(1.0)))
-            logits = tf.layers.dense(x, pdtype.param_shape()[0], name='logits', kernel_initializer=U.normc_initializer(0.01))
-            self.pd = pdtype.pdfromflat(logits)
-        with tf.variable_scope("vf"):
-            x = obscaled
-            x = tf.nn.relu(U.conv2d(x, 8, "l1", [8, 8], [4, 4], pad="VALID"))
-            x = tf.nn.relu(U.conv2d(x, 16, "l2", [4, 4], [2, 2], pad="VALID"))
-            x = U.flattenallbut0(x)
-            x = tf.nn.relu(tf.layers.dense(x, 128, name='lin', kernel_initializer=U.normc_initializer(1.0)))
-            self.vpred = tf.layers.dense(x, 1, name='value', kernel_initializer=U.normc_initializer(1.0))
-            self.vpredz = self.vpred
-
-        self.state_in = []
-        self.state_out = []
-
-        stochastic = tf.placeholder(dtype=tf.bool, shape=())
-        ac = self.pd.sample()
-        self._act = U.function([stochastic, ob], [ac, self.vpred])
-
-    def act(self, stochastic, ob):
-        ac1, vpred1 =  self._act(stochastic, ob[None])
-        return ac1[0], vpred1[0]
-    def get_variables(self):
-        return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, self.scope)
-    def get_trainable_variables(self):
-        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)
-    def get_initial_state(self):
-        return []
-
--- a/baselines/trpo_mpi/run_atari.py
+++ b/baselines/trpo_mpi/run_atari.py
@@ -1,43 +0,0 @@
-    #!/usr/bin/env python3
-from mpi4py import MPI
-from baselines.common import set_global_seeds
-import os.path as osp
-import gym, logging
-from baselines import logger
-from baselines import bench
-from baselines.common.atari_wrappers import make_atari, wrap_deepmind
-from baselines.common.cmd_util import atari_arg_parser
-
-def train(env_id, num_timesteps, seed):
-    from baselines.trpo_mpi.nosharing_cnn_policy import CnnPolicy
-    from baselines.trpo_mpi import trpo_mpi
-    import baselines.common.tf_util as U
-    rank = MPI.COMM_WORLD.Get_rank()
-    sess = U.single_threaded_session()
-    sess.__enter__()
-    if rank == 0:
-        logger.configure()
-    else:
-        logger.configure(format_strs=[])
-
-    workerseed = seed + 10000 * MPI.COMM_WORLD.Get_rank()
-    set_global_seeds(workerseed)
-    env = make_atari(env_id)
-    def policy_fn(name, ob_space, ac_space): #pylint: disable=W0613
-        return CnnPolicy(name=name, ob_space=env.observation_space, ac_space=env.action_space)
-    env = bench.Monitor(env, logger.get_dir() and osp.join(logger.get_dir(), str(rank)))
-    env.seed(workerseed)
-
-    env = wrap_deepmind(env)
-    env.seed(workerseed)
-
-    trpo_mpi.learn(env, policy_fn, timesteps_per_batch=512, max_kl=0.001, cg_iters=10, cg_damping=1e-3,
-        max_timesteps=int(num_timesteps * 1.1), gamma=0.98, lam=1.0, vf_iters=3, vf_stepsize=1e-4, entcoeff=0.00)
-    env.close()
-
-def main():
-    args = atari_arg_parser().parse_args()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
-
-if __name__ == "__main__":
-    main()
--- a/baselines/trpo_mpi/run_mujoco.py
+++ b/baselines/trpo_mpi/run_mujoco.py
@@ -1,36 +0,0 @@
-#!/usr/bin/env python3
-# noinspection PyUnresolvedReferences
-from mpi4py import MPI
-from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
-from baselines import logger
-from baselines.ppo1.mlp_policy import MlpPolicy
-from baselines.trpo_mpi import trpo_mpi
-
-def train(env_id, num_timesteps, seed):
-    import baselines.common.tf_util as U
-    sess = U.single_threaded_session()
-    sess.__enter__()
-
-    rank = MPI.COMM_WORLD.Get_rank()
-    if rank == 0:
-        logger.configure()
-    else:
-        logger.configure(format_strs=[])
-        logger.set_level(logger.DISABLED)
-    workerseed = seed + 10000 * MPI.COMM_WORLD.Get_rank()
-    def policy_fn(name, ob_space, ac_space):
-        return MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
-            hid_size=32, num_hid_layers=2)
-    env = make_mujoco_env(env_id, workerseed)
-    trpo_mpi.learn(env, policy_fn, timesteps_per_batch=1024, max_kl=0.01, cg_iters=10, cg_damping=0.1,
-        max_timesteps=num_timesteps, gamma=0.99, lam=0.98, vf_iters=5, vf_stepsize=1e-3)
-    env.close()
-
-def main():
-    args = mujoco_arg_parser().parse_args()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
-
-
-if __name__ == '__main__':
-    main()
-
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Peter Zhokhov	2ebdc28791	lstm network builders using tf lstm	2018-08-10 14:21:30 -07:00
Peter Zhokhov	217b111c88	merged refactor	2018-08-10 14:14:46 -07:00
Adam Gleave	f272969325	GAIL: bugfix in dataset loading (#447 ) * Fix silly typo * Replace ad-hoc function with NumPy code	2018-07-06 16:12:14 -07:00
pzhokhov	a6b1bc70f1	re-import internal; fix missing tile_images.py (#427 ) * import rl-algs from 2e3a166 commit * extra import of the baselines badge * exported commit with identity test * proper rng seeding in the test_identity * import internal * adding missing tile_images.py	2018-06-08 09:41:45 -07:00
pzhokhov	36ee5d1707	Import internal changes (#422 ) * import rl-algs from 2e3a166 commit * extra import of the baselines badge * exported commit with identity test * proper rng seeding in the test_identity * import internal	2018-06-06 11:39:13 -07:00
pzhokhov	24fe3d6576	Import internal repo (#409 ) * import rl-algs from 2e3a166 commit * extra import of the baselines badge * exported commit with identity test * proper rng seeding in the test_identity	2018-05-21 15:24:00 -07:00
pzhokhov	9cb7ece338	add opencv-python to the dependencies (#407 )	2018-05-14 10:52:19 -07:00
pzhokhov	9cf95a0054	setup travis ci build (#388 ) * simple .travis.yml file * added static syntax checks of common to .travis.yml * dockerizing the build * fix Dockerfile, adding build shield * cleaning up workdir in Dockerfile and .travis.yml * .travis.yml fixed common -> baselines/common for style check	2018-05-03 09:43:28 -07:00
pzhokhov	8b781038cc	put filters and running_stat files in common instead of acktr (#389 )	2018-05-02 18:42:48 -07:00
pzhokhov	69f25c6028	import internal repo (#385 )	2018-05-01 16:54:04 -07:00
pzhokhov	2b0283b9db	Readme.md detailed installation instructions (#377 ) * changes to README.md files with more detailed installation instructions * md-fying the changes better * link on the word homebrew in readme.md * typos in README.md * README.md * removed extra comma sign * removed sudo from brew command	2018-04-25 17:40:48 -07:00
Matthias Plappert	1f8a03f3a6	Update README	2018-03-26 16:50:22 +02:00
Matthias Plappert	3cc7df0608	Minor fixes to HER release (#319 ) * Fix plotting script * Add warning if num_cpu = 1	2018-03-05 11:06:17 +01:00
Alex Nichol	8b3a6c2051	fix DummyVecEnv reusing buffers	2018-03-02 17:18:07 -08:00
Alex Nichol	569bd42629	Merge pull request #308 from araffin/master Bug fix in saving ACER model	2018-03-01 10:45:04 -08:00
Daniel Ziegler	f49a9c3d85	Fix bug in DDPG parameter space noise adaptation (#306 ) The training loop used the rollout step variable `t` rather than the training step variable `t_train` to decide when to adapt the scale of the parameter space noise.	2018-03-01 18:00:34 +01:00
Antonin RAFFIN	14f2f9328c	Bug fix in saving ACER model	2018-03-01 10:24:14 +01:00
Alex Nichol	6bdf2f55a2	Merge pull request #132 from bhatiaabhinav/bug_fixes Bug fix in saving a2c model.	2018-02-27 19:00:37 -08:00
Alex Nichol	97be70d6c8	fixes for DummyVecEnv Fixes various problems running MuJoCo tasks.	2018-02-27 18:55:10 -08:00
Matthias Plappert	b71152eea0	Adds support for Hindsight Experience Replay (HER) (#299 ) * Add Hindsight Experience Replay (HER) * Minor improvements	2018-02-26 17:40:16 +01:00
Abhinav Bhatia	3d1e171b3a	Bug fix in saving a2c model.	2017-09-12 02:35:43 +08:00