Update cmd_util with initializer, env_kwargs, and force_dummy

fix #795 : Making tf_util._Function consistent (#796 )
* fix #795: Making tf_util._Function consistent The fix involves using the placeholder name to crossreference passed kwargs values, just like the tf_util.function expects. Also, the givens are updated before the parameters to make it behave like it's supposed to. * test: Adding test for issue #795
2019-03-18 17:53:42 -07:00 · 2019-01-31 10:23:38 -08:00 · 2019-01-30 16:21:57 -08:00 · 2019-01-22 19:22:28 -08:00 · 2019-01-09 22:30:52 -08:00 · 2019-01-09 11:21:53 -08:00
201 changed files with 34237 additions and 3912 deletions
--- a/.benchmark_pattern
+++ b/.benchmark_pattern
@@ -0,0 +1 @@
+
--- a/.gitignore
+++ b/.gitignore
@@ -1,6 +1,8 @@
 *.swp
 *.pyc
+*.pkl
 *.py~
+.pytest_cache
 .DS_Store
 .idea

@@ -32,4 +34,3 @@ src
 .cache

 MUJOCO_LOG.TXT
-
--- a/.travis.yml
+++ b/.travis.yml
@@ -0,0 +1,14 @@
+language: python
+python:
+    - "3.6"
+
+services:
+    - docker
+
+install:
+    - pip install flake8
+    - docker build . -t baselines-test
+
+script:
+    - flake8 . --show-source --statistics
+    - docker run baselines-test pytest -v --forked .
--- a/18
+++ b/18
@@ -0,0 +1,18 @@
+FROM python:3.6
+
+RUN apt-get -y update && apt-get -y install ffmpeg
+# RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake python-opencv
+
+ENV CODE_DIR /root/code
+
+COPY . $CODE_DIR/baselines
+WORKDIR $CODE_DIR/baselines
+
+# Clean up pycache and pyc files
+RUN rm -rf __pycache__ && \
+    find . -name "*.pyc" -delete && \
+    pip install tensorflow && \
+    pip install -e .[test]
+
+
+CMD /bin/bash
--- a/README.md
+++ b/README.md
@@ -1,4 +1,6 @@
-<img src="data/logo.jpg" width=25% align="right" />
+**Status:** Active (under active development, breaking changes may occur)
+
+<img src="data/logo.jpg" width=25% align="right" /> [![Build status](https://travis-ci.org/openai/baselines.svg?branch=master)](https://travis-ci.org/openai/baselines)

 # Baselines

@@ -6,28 +8,142 @@ OpenAI Baselines is a set of high-quality implementations of reinforcement learn

 These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. 

-You can install it by typing:
-
+## Prerequisites 
+Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows
+### Ubuntu 
+    
 ```bash
-git clone https://github.com/openai/baselines.git
-cd baselines
-pip install -e .
+sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
+```
+    
+### Mac OS X
+Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the following:
+```bash
+brew install cmake openmpi
+```
+    
+## Virtual environment
+From the general python package sanity perspective, it is a good idea to use virtual environments (virtualenvs) to make sure packages from different projects do not interfere with each other. You can install virtualenv (which is itself a pip package) via
+```bash
+pip install virtualenv
+```
+Virtualenvs are essentially folders that have copies of python executable and all python packages.
+To create a virtualenv called venv with python3, one runs 
+```bash
+virtualenv /path/to/venv --python=python3
+```
+To activate a virtualenv: 
+```
+. /path/to/venv/bin/activate
+```
+More thorough tutorial on virtualenvs and options can be found [here](https://virtualenv.pypa.io/en/stable/) 
+
+
+## Installation
+- Clone the repo and cd into it:
+    ```bash
+    git clone https://github.com/openai/baselines.git
+    cd baselines
+    ```
+- If you don't have TensorFlow installed already, install your favourite flavor of TensorFlow. In most cases, 
+    ```bash 
+    pip install tensorflow-gpu # if you have a CUDA-compatible gpu and proper drivers
+    ```
+    or 
+    ```bash
+    pip install tensorflow
+    ```
+    should be sufficient. Refer to [TensorFlow installation guide](https://www.tensorflow.org/install/)
+    for more details. 
+
+- Install baselines package
+    ```bash
+    pip install -e .
+    ```
+
+### MuJoCo
+Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)
+
+## Testing the installation
+All unit tests in baselines can be run using pytest runner:
+```
+pip install pytest
+pytest
 ```

+## Training models
+Most of the algorithms in baselines repo are used as follows:
+```bash
+python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
+```
+### Example 1. PPO with MuJoCo Humanoid
+For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps
+```bash
+python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
+```
+Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
+The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
+```bash
+python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
+```
+will set entropy coefficient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
+
+See docstrings in [common/models.py](baselines/common/models.py) for description of network parameters for each type of model, and 
+docstring for [baselines/ppo2/ppo2.py/learn()](baselines/ppo2/ppo2.py#L152) for the description of the ppo2 hyperparamters. 
+
+### Example 2. DQN on Atari 
+DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
+```
+python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
+```
+
+## Saving, loading and visualizing models
+The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models. 
+`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively. 
+Let's imagine you'd like to train ppo2 on Atari Pong,  save the model and then later visualize what has it learnt.
+```bash
+python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
+```
+This should get to the mean reward per episode about 20. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize: 
+```bash
+python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
+```
+
+*NOTE:* At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in [baselines/common/vec_env/vec_normalize.py](baselines/common/vec_env/vec_normalize.py#L12). This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default
+
+## Loading and vizualizing learning curves and other training metrics
+See [here](docs/viz/viz.ipynb) for instructions on how to load and display the training data. 
+
+## Subpackages
+
 - [A2C](baselines/a2c)
+- [ACER](baselines/acer)
 - [ACKTR](baselines/acktr)
 - [DDPG](baselines/ddpg)
 - [DQN](baselines/deepq)
- [PPO](baselines/ppo1)
+- [GAIL](baselines/gail)
+- [HER](baselines/her)
+- [PPO1](baselines/ppo1) (obsolete version, left here temporarily)
+- [PPO2](baselines/ppo2) 
 - [TRPO](baselines/trpo_mpi)

+
+
+## Benchmarks
+Results of benchmarks on Mujoco (1M timesteps) and Atari (10M timesteps) are available 
+[here for Mujoco](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm) 
+and
+[here for Atari](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_atari10M.htm) 
+respectively. Note that these results may be not on the latest version of the code, particular commit hash with which results were obtained is specified on the benchmarks page. 
+
 To cite this repository in publications:

    @misc{baselines,
-      author = {Hesse, Christopher and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
+      author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai and Zhokhov, Peter},
      title = {OpenAI Baselines},
      year = {2017},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/openai/baselines}},
    }
+
--- a/baselines/a2c/README.md
+++ b/baselines/a2c/README.md
@@ -2,4 +2,12 @@

 - Original paper: https://arxiv.org/abs/1602.01783
 - Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.a2c.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
+- `python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options
+- also refer to the repo-wide [README.md](../../README.md#training-models)
+
+## Files
+- `run_atari`: file used to run the algorithm.
+- `policies.py`: contains the different versions of the A2C architecture (MlpPolicy, CNNPolicy, LstmPolicy...).
+- `a2c.py`: - Model : class used to initialize the step_model (sampling) and train_model (training)
+	- learn : Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
+- `runner.py`: class used to generates a batch of experiences
--- a/baselines/a2c/a2c.py
+++ b/baselines/a2c/a2c.py
@@ -1,64 +1,99 @@
-import os.path as osp
-import gym
 import time
-import joblib
-import logging
-import numpy as np
+import functools
 import tensorflow as tf
+
 from baselines import logger

 from baselines.common import set_global_seeds, explained_variance
-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from baselines.common.atari_wrappers import wrap_deepmind
+from baselines.common import tf_util
+from baselines.common.policies import build_policy

-from baselines.a2c.utils import discount_with_dones
-from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
-from baselines.a2c.policies import CnnPolicy
-from baselines.a2c.utils import cat_entropy, mse
+
+from baselines.a2c.utils import Scheduler, find_trainable_variables
+from baselines.a2c.runner import Runner
+
+from tensorflow import losses

 class Model(object):

-    def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, nstack, num_procs,
+    """
+    We use this class to :
+        __init__:
+        - Creates the step_model
+        - Creates the train_model
+
+        train():
+        - Make the training part (feedforward and retropropagation of gradients)
+
+        save/load():
+        - Save load the model
+    """
+    def __init__(self, policy, env, nsteps,
            ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, lr=7e-4,
            alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):
-        config = tf.ConfigProto(allow_soft_placement=True,
-                                intra_op_parallelism_threads=num_procs,
-                                inter_op_parallelism_threads=num_procs)
-        config.gpu_options.allow_growth = True
-        sess = tf.Session(config=config)
-        nact = ac_space.n
+
+        sess = tf_util.get_session()
+        nenvs = env.num_envs
        nbatch = nenvs*nsteps

-        A = tf.placeholder(tf.int32, [nbatch])
+
+        with tf.variable_scope('a2c_model', reuse=tf.AUTO_REUSE):
+            # step_model is used for sampling
+            step_model = policy(nenvs, 1, sess)
+
+            # train_model is used to train our network
+            train_model = policy(nbatch, nsteps, sess)
+
+        A = tf.placeholder(train_model.action.dtype, train_model.action.shape)
        ADV = tf.placeholder(tf.float32, [nbatch])
        R = tf.placeholder(tf.float32, [nbatch])
        LR = tf.placeholder(tf.float32, [])

-        step_model = policy(sess, ob_space, ac_space, nenvs, 1, nstack, reuse=False)
-        train_model = policy(sess, ob_space, ac_space, nenvs, nsteps, nstack, reuse=True)
+        # Calculate the loss
+        # Total loss = Policy gradient loss - entropy * entropy coefficient + Value coefficient * value loss

-        neglogpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
+        # Policy loss
+        neglogpac = train_model.pd.neglogp(A)
+        # L = A(s,a) * -logpi(a|s)
        pg_loss = tf.reduce_mean(ADV * neglogpac)
-        vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
-        entropy = tf.reduce_mean(cat_entropy(train_model.pi))
+
+        # Entropy is used to improve exploration by limiting the premature convergence to suboptimal policy.
+        entropy = tf.reduce_mean(train_model.pd.entropy())
+
+        # Value loss
+        vf_loss = losses.mean_squared_error(tf.squeeze(train_model.vf), R)
+
        loss = pg_loss - entropy*ent_coef + vf_loss * vf_coef

-        params = find_trainable_variables("model")
+        # Update parameters using loss
+        # 1. Get the model parameters
+        params = find_trainable_variables("a2c_model")
+
+        # 2. Calculate the gradients
        grads = tf.gradients(loss, params)
        if max_grad_norm is not None:
+            # Clip the gradients (normalize)
            grads, grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
        grads = list(zip(grads, params))
+        # zip aggregate each gradient with parameters associated
+        # For instance zip(ABCD, xyza) => Ax, By, Cz, Da
+
+        # 3. Make op for one policy and value update step of A2C
        trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=alpha, epsilon=epsilon)
+
        _train = trainer.apply_gradients(grads)

        lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)

        def train(obs, states, rewards, masks, actions, values):
+            # Here we calculate advantage A(s,a) = R + yV(s') - V(s)
+            # rewards = R + yV(s')
            advs = rewards - values
            for step in range(len(obs)):
                cur_lr = lr.value()
+
            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, LR:cur_lr}
-            if states != []:
+            if states is not None:
                td_map[train_model.S] = states
                td_map[train_model.M] = masks
            policy_loss, value_loss, policy_entropy, _ = sess.run(
@@ -67,17 +102,6 @@ class Model(object):
            )
            return policy_loss, value_loss, policy_entropy

-        def save(save_path):
-            ps = sess.run(params)
-            make_path(save_path)
-            joblib.dump(ps, save_path)
-
-        def load(load_path):
-            loaded_params = joblib.load(load_path)
-            restores = []
-            for p, loaded_p in zip(params, loaded_params):
-                restores.append(p.assign(loaded_p))
-            ps = sess.run(restores)

        self.train = train
        self.train_model = train_model
@@ -85,95 +109,111 @@ class Model(object):
        self.step = step_model.step
        self.value = step_model.value
        self.initial_state = step_model.initial_state
-        self.save = save
-        self.load = load
+        self.save = functools.partial(tf_util.save_variables, sess=sess)
+        self.load = functools.partial(tf_util.load_variables, sess=sess)
        tf.global_variables_initializer().run(session=sess)

-class Runner(object):

-    def __init__(self, env, model, nsteps=5, nstack=4, gamma=0.99):
-        self.env = env
-        self.model = model
-        nh, nw, nc = env.observation_space.shape
-        nenv = env.num_envs
-        self.batch_ob_shape = (nenv*nsteps, nh, nw, nc*nstack)
-        self.obs = np.zeros((nenv, nh, nw, nc*nstack), dtype=np.uint8)
-        self.nc = nc
-        obs = env.reset()
-        self.update_obs(obs)
-        self.gamma = gamma
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]
+def learn(
+    network,
+    env,
+    seed=None,
+    nsteps=5,
+    total_timesteps=int(80e6),
+    vf_coef=0.5,
+    ent_coef=0.01,
+    max_grad_norm=0.5,
+    lr=7e-4,
+    lrschedule='linear',
+    epsilon=1e-5,
+    alpha=0.99,
+    gamma=0.99,
+    log_interval=100,
+    load_path=None,
+    **network_kwargs):
+
+    '''
+    Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
+
+    Parameters:
+    -----------
+
+    network:            policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
+                        specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
+                        tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
+                        neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
+                        See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
+
+
+    env:                RL environment. Should implement interface similar to VecEnv (baselines.common/vec_env) or be wrapped with DummyVecEnv (baselines.common/vec_env/dummy_vec_env.py)
+
+
+    seed:               seed to make random number sequence in the alorightm reproducible. By default is None which means seed from system noise generator (not reproducible)
+
+    nsteps:             int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
+                        nenv is number of environment copies simulated in parallel)
+
+    total_timesteps:    int, total number of timesteps to train on (default: 80M)
+
+    vf_coef:            float, coefficient in front of value function loss in the total loss function (default: 0.5)
+
+    ent_coef:           float, coeffictiant in front of the policy entropy in the total loss function (default: 0.01)
+
+    max_gradient_norm:  float, gradient is clipped to have global L2 norm no more than this value (default: 0.5)
+
+    lr:                 float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
+
+    lrschedule:         schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
+                        returns fraction of the learning rate (specified as lr) as output
+
+    epsilon:            float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
+
+    alpha:              float, RMSProp decay parameter (default: 0.99)
+
+    gamma:              float, reward discounting parameter (default: 0.99)
+
+    log_interval:       int, specifies how frequently the logs are printed out (default: 100)
+
+    **network_kwargs:   keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
+                        For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
+
+    '''

-    def update_obs(self, obs):
-        # Do frame-stacking here instead of the FrameStack wrapper to reduce
-        # IPC overhead
-        self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
-        self.obs[:, :, :, -self.nc:] = obs

-    def run(self):
-        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
-        mb_states = self.states
-        for n in range(self.nsteps):
-            actions, values, states = self.model.step(self.obs, self.states, self.dones)
-            mb_obs.append(np.copy(self.obs))
-            mb_actions.append(actions)
-            mb_values.append(values)
-            mb_dones.append(self.dones)
-            obs, rewards, dones, _ = self.env.step(actions)
-            self.states = states
-            self.dones = dones
-            for n, done in enumerate(dones):
-                if done:
-                    self.obs[n] = self.obs[n]*0
-            self.update_obs(obs)
-            mb_rewards.append(rewards)
-        mb_dones.append(self.dones)
-        #batch of steps to batch of rollouts
-        mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0).reshape(self.batch_ob_shape)
-        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
-        mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
-        mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
-        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
-        mb_masks = mb_dones[:, :-1]
-        mb_dones = mb_dones[:, 1:]
-        last_values = self.model.value(self.obs, self.states, self.dones).tolist()
-        #discount/bootstrap off value fn
-        for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
-            rewards = rewards.tolist()
-            dones = dones.tolist()
-            if dones[-1] == 0:
-                rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
-            else:
-                rewards = discount_with_dones(rewards, dones, self.gamma)
-            mb_rewards[n] = rewards
-        mb_rewards = mb_rewards.flatten()
-        mb_actions = mb_actions.flatten()
-        mb_values = mb_values.flatten()
-        mb_masks = mb_masks.flatten()
-        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values

-def learn(policy, env, seed, nsteps=5, nstack=4, total_timesteps=int(80e6), vf_coef=0.5, ent_coef=0.01, max_grad_norm=0.5, lr=7e-4, lrschedule='linear', epsilon=1e-5, alpha=0.99, gamma=0.99, log_interval=100):
-    tf.reset_default_graph()
    set_global_seeds(seed)

+    # Get the nb of env
    nenvs = env.num_envs
-    ob_space = env.observation_space
-    ac_space = env.action_space
-    num_procs = len(env.remotes) # HACK
-    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, nstack=nstack, num_procs=num_procs, ent_coef=ent_coef, vf_coef=vf_coef,
-        max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
-    runner = Runner(env, model, nsteps=nsteps, nstack=nstack, gamma=gamma)
+    policy = build_policy(env, network, **network_kwargs)

+    # Instantiate the model object (that creates step_model and train_model)
+    model = Model(policy=policy, env=env, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
+        max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
+    if load_path is not None:
+        model.load(load_path)
+
+    # Instantiate the runner object
+    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
+
+    # Calculate the batch_size
    nbatch = nenvs*nsteps
+
+    # Start total timer
    tstart = time.time()
+
    for update in range(1, total_timesteps//nbatch+1):
+        # Get mini batch of experiences
        obs, states, rewards, masks, actions, values = runner.run()
+
        policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
        nseconds = time.time()-tstart
+
+        # Calculate the fps (frame per second)
        fps = int((update*nbatch)/nseconds)
        if update % log_interval == 0 or update == 1:
+            # Calculates if value function is a good predicator of the returns (ev > 1)
+            # or if it's just worse than predicting nothing (ev =< 0)
            ev = explained_variance(values, rewards)
            logger.record_tabular("nupdates", update)
            logger.record_tabular("total_timesteps", update*nbatch)
@@ -182,7 +222,5 @@ def learn(policy, env, seed, nsteps=5, nstack=4, total_timesteps=int(80e6), vf_c
            logger.record_tabular("value_loss", float(value_loss))
            logger.record_tabular("explained_variance", float(ev))
            logger.dump_tabular()
-    env.close()
+    return model

-if __name__ == '__main__':
-    main()
--- a/baselines/a2c/policies.py
+++ b/baselines/a2c/policies.py
@@ -1,120 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm, sample
-
-class LnLstmPolicy(object):
-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, nlstm=256, reuse=False):
-        nbatch = nenv*nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc*nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
-        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
-        with tf.variable_scope("model", reuse=reuse):
-            h = conv(tf.cast(X, tf.float32)/255., 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2))
-            h2 = conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2))
-            h3 = conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2))
-            h3 = conv_to_fc(h3)
-            h4 = fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))
-            xs = batch_to_seq(h4, nenv, nsteps)
-            ms = batch_to_seq(M, nenv, nsteps)
-            h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
-            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact, act=lambda x:x)
-            vf = fc(h5, 'v', 1, act=lambda x:x)
-
-        v0 = vf[:, 0]
-        a0 = sample(pi)
-        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
-
-        def step(ob, state, mask):
-            a, v, s = sess.run([a0, v0, snew], {X:ob, S:state, M:mask})
-            return a, v, s
-
-        def value(ob, state, mask):
-            return sess.run(v0, {X:ob, S:state, M:mask})
-
-        self.X = X
-        self.M = M
-        self.S = S
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class LstmPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, nlstm=256, reuse=False):
-        nbatch = nenv*nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc*nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
-        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
-        with tf.variable_scope("model", reuse=reuse):
-            h = conv(tf.cast(X, tf.float32)/255., 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2))
-            h2 = conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2))
-            h3 = conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2))
-            h3 = conv_to_fc(h3)
-            h4 = fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))
-            xs = batch_to_seq(h4, nenv, nsteps)
-            ms = batch_to_seq(M, nenv, nsteps)
-            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
-            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact, act=lambda x:x)
-            vf = fc(h5, 'v', 1, act=lambda x:x)
-
-        v0 = vf[:, 0]
-        a0 = sample(pi)
-        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
-
-        def step(ob, state, mask):
-            a, v, s = sess.run([a0, v0, snew], {X:ob, S:state, M:mask})
-            return a, v, s
-
-        def value(ob, state, mask):
-            return sess.run(v0, {X:ob, S:state, M:mask})
-
-        self.X = X
-        self.M = M
-        self.S = S
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class CnnPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
-        nbatch = nenv*nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc*nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        with tf.variable_scope("model", reuse=reuse):
-            h = conv(tf.cast(X, tf.float32)/255., 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2))
-            h2 = conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2))
-            h3 = conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2))
-            h3 = conv_to_fc(h3)
-            h4 = fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))
-            pi = fc(h4, 'pi', nact, act=lambda x:x)
-            vf = fc(h4, 'v', 1, act=lambda x:x)
-
-        v0 = vf[:, 0]
-        a0 = sample(pi)
-        self.initial_state = [] #not stateful
-
-        def step(ob, *_args, **_kwargs):
-            a, v = sess.run([a0, v0], {X:ob})
-            return a, v, [] #dummy state
-
-        def value(ob, *_args, **_kwargs):
-            return sess.run(v0, {X:ob})
-
-        self.X = X
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
--- a/baselines/a2c/run_atari.py
+++ b/baselines/a2c/run_atari.py
@@ -1,45 +0,0 @@
-#!/usr/bin/env python
-import os, logging, gym
-from baselines import logger
-from baselines.common import set_global_seeds
-from baselines import bench
-from baselines.a2c.a2c import learn
-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from baselines.common.atari_wrappers import make_atari, wrap_deepmind
-from baselines.a2c.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
-
-def train(env_id, num_timesteps, seed, policy, lrschedule, num_cpu):
-    def make_env(rank):
-        def _thunk():
-            env = make_atari(env_id)
-            env.seed(seed + rank)
-            env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-            gym.logger.setLevel(logging.WARN)
-            return wrap_deepmind(env)
-        return _thunk
-    set_global_seeds(seed)
-    env = SubprocVecEnv([make_env(i) for i in range(num_cpu)])
-    if policy == 'cnn':
-        policy_fn = CnnPolicy
-    elif policy == 'lstm':
-        policy_fn = LstmPolicy
-    elif policy == 'lnlstm':
-        policy_fn = LnLstmPolicy
-    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
-    env.close()
-
-def main():
-    import argparse
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
-    parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
-    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
-    args = parser.parse_args()
-    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
-        policy=args.policy, lrschedule=args.lrschedule, num_cpu=16)
-
-if __name__ == '__main__':
-    main()
--- a/baselines/a2c/runner.py
+++ b/baselines/a2c/runner.py
@@ -0,0 +1,72 @@
+import numpy as np
+from baselines.a2c.utils import discount_with_dones
+from baselines.common.runners import AbstractEnvRunner
+
+class Runner(AbstractEnvRunner):
+    """
+    We use this class to generate batches of experiences
+
+    __init__:
+    - Initialize the runner
+
+    run():
+    - Make a mini batch of experiences
+    """
+    def __init__(self, env, model, nsteps=5, gamma=0.99):
+        super().__init__(env=env, model=model, nsteps=nsteps)
+        self.gamma = gamma
+        self.batch_action_shape = [x if x is not None else -1 for x in model.train_model.action.shape.as_list()]
+        self.ob_dtype = model.train_model.X.dtype.as_numpy_dtype
+
+    def run(self):
+        # We initialize the lists that will contain the mb of experiences
+        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
+        mb_states = self.states
+        for n in range(self.nsteps):
+            # Given observations, take action and value (V(s))
+            # We already have self.obs because Runner superclass run self.obs[:] = env.reset() on init
+            actions, values, states, _ = self.model.step(self.obs, S=self.states, M=self.dones)
+
+            # Append the experiences
+            mb_obs.append(np.copy(self.obs))
+            mb_actions.append(actions)
+            mb_values.append(values)
+            mb_dones.append(self.dones)
+
+            # Take actions in env and look the results
+            obs, rewards, dones, _ = self.env.step(actions)
+            self.states = states
+            self.dones = dones
+            self.obs = obs
+            mb_rewards.append(rewards)
+        mb_dones.append(self.dones)
+
+        # Batch of steps to batch of rollouts
+        mb_obs = np.asarray(mb_obs, dtype=self.ob_dtype).swapaxes(1, 0).reshape(self.batch_ob_shape)
+        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
+        mb_actions = np.asarray(mb_actions, dtype=self.model.train_model.action.dtype.name).swapaxes(1, 0)
+        mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
+        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
+        mb_masks = mb_dones[:, :-1]
+        mb_dones = mb_dones[:, 1:]
+
+
+        if self.gamma > 0.0:
+            # Discount/bootstrap off value fn
+            last_values = self.model.value(self.obs, S=self.states, M=self.dones).tolist()
+            for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
+                rewards = rewards.tolist()
+                dones = dones.tolist()
+                if dones[-1] == 0:
+                    rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
+                else:
+                    rewards = discount_with_dones(rewards, dones, self.gamma)
+
+                mb_rewards[n] = rewards
+
+        mb_actions = mb_actions.reshape(self.batch_action_shape)
+
+        mb_rewards = mb_rewards.flatten()
+        mb_values = mb_values.flatten()
+        mb_masks = mb_masks.flatten()
+        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values
--- a/baselines/a2c/utils.py
+++ b/baselines/a2c/utils.py
@@ -1,8 +1,6 @@
 import os
-import gym
 import numpy as np
 import tensorflow as tf
-from gym import spaces
 from collections import deque

 def sample(logits):
@@ -10,18 +8,15 @@ def sample(logits):
    return tf.argmax(logits - tf.log(-tf.log(noise)), 1)

 def cat_entropy(logits):
-    a0 = logits - tf.reduce_max(logits, 1, keep_dims=True)
+    a0 = logits - tf.reduce_max(logits, 1, keepdims=True)
    ea0 = tf.exp(a0)
-    z0 = tf.reduce_sum(ea0, 1, keep_dims=True)
+    z0 = tf.reduce_sum(ea0, 1, keepdims=True)
    p0 = ea0 / z0
    return tf.reduce_sum(p0 * (tf.log(z0) - a0), 1)

 def cat_entropy_softmax(p0):
    return - tf.reduce_sum(p0 * tf.log(p0 + 1e-6), axis = 1)

-def mse(pred, target):
-    return tf.square(pred-target)/2.
-
 def ortho_init(scale=1.0):
    def _ortho_init(shape, dtype, partition_info=None):
        #lasagne ortho init for tf
@@ -39,23 +34,33 @@ def ortho_init(scale=1.0):
        return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
    return _ortho_init

-def conv(x, scope, nf, rf, stride, pad='VALID', act=tf.nn.relu, init_scale=1.0):
+def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
+    if data_format == 'NHWC':
+        channel_ax = 3
+        strides = [1, stride, stride, 1]
+        bshape = [1, 1, 1, nf]
+    elif data_format == 'NCHW':
+        channel_ax = 1
+        strides = [1, 1, stride, stride]
+        bshape = [1, nf, 1, 1]
+    else:
+        raise NotImplementedError
+    bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
+    nin = x.get_shape()[channel_ax].value
+    wshape = [rf, rf, nin, nf]
    with tf.variable_scope(scope):
-        nin = x.get_shape()[3].value
-        w = tf.get_variable("w", [rf, rf, nin, nf], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nf], initializer=tf.constant_initializer(0.0))
-        z = tf.nn.conv2d(x, w, strides=[1, stride, stride, 1], padding=pad)+b
-        h = act(z)
-        return h
+        w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
+        b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
+        if not one_dim_bias and data_format == 'NHWC':
+            b = tf.reshape(b, bshape)
+        return tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format) + b

-def fc(x, scope, nh, act=tf.nn.relu, init_scale=1.0):
+def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
    with tf.variable_scope(scope):
        nin = x.get_shape()[1].value
        w = tf.get_variable("w", [nin, nh], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(0.0))
-        z = tf.matmul(x, w)+b
-        h = act(z)
-        return h
+        b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(init_bias))
+        return tf.matmul(x, w)+b

 def batch_to_seq(h, nbatch, nsteps, flat=False):
    if flat:
@@ -75,7 +80,6 @@ def seq_to_batch(h, flat = False):

 def lstm(xs, ms, s, scope, nh, init_scale=1.0):
    nbatch, nin = [v.value for v in xs[0].get_shape()]
-    nsteps = len(xs)
    with tf.variable_scope(scope):
        wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
        wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
@@ -105,7 +109,6 @@ def _ln(x, g, b, e=1e-5, axes=[1]):

 def lnlstm(xs, ms, s, scope, nh, init_scale=1.0):
    nbatch, nin = [v.value for v in xs[0].get_shape()]
-    nsteps = len(xs)
    with tf.variable_scope(scope):
        wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
        gx = tf.get_variable("gx", [nh*4], initializer=tf.constant_initializer(1.0))
@@ -150,8 +153,7 @@ def discount_with_dones(rewards, dones, gamma):
    return discounted[::-1]

 def find_trainable_variables(key):
-    with tf.variable_scope(key):
-        return tf.trainable_variables()
+    return tf.trainable_variables(key)

 def make_path(f):
    return os.makedirs(f, exist_ok=True)
@@ -162,9 +164,34 @@ def constant(p):
 def linear(p):
    return 1-p

+def middle_drop(p):
+    eps = 0.75
+    if 1-p<eps:
+        return eps*0.1
+    return 1-p
+
+def double_linear_con(p):
+    p *= 2
+    eps = 0.125
+    if 1-p<eps:
+        return eps
+    return 1-p
+
+def double_middle_drop(p):
+    eps1 = 0.75
+    eps2 = 0.25
+    if 1-p<eps1:
+        if 1-p<eps2:
+            return eps2*0.5
+        return eps1*0.1
+    return 1-p
+
 schedules = {
    'linear':linear,
-    'constant':constant
+    'constant':constant,
+    'double_linear_con': double_linear_con,
+    'middle_drop': middle_drop,
+    'double_middle_drop': double_middle_drop
 }

 class Scheduler(object):
@@ -238,7 +265,7 @@ def check_shape(ts,shapes):
 def avg_norm(t):
    return tf.reduce_mean(tf.sqrt(tf.reduce_sum(tf.square(t), axis=-1)))

-def myadd(g1, g2, param):
+def gradient_add(g1, g2, param):
    print([g1, g2, param.name])
    assert (not (g1 is None and g2 is None)), param.name
    if g1 is None:
@@ -248,7 +275,7 @@ def myadd(g1, g2, param):
    else:
        return g1 + g2

-def my_explained_variance(qpred, q):
+def q_explained_variance(qpred, q):
    _, vary = tf.nn.moments(q, axes=[0, 1])
    _, varpred = tf.nn.moments(q - qpred, axes=[0, 1])
    check_shape([vary, varpred], [[]] * 2)
--- a/baselines/acer/README.md
+++ b/baselines/acer/README.md
@@ -0,0 +1,6 @@
+# ACER
+
+- Original paper: https://arxiv.org/abs/1611.01224
+- `python -m baselines.run --alg=acer --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
+- also refer to the repo-wide [README.md](../../README.md#training-models)
+
--- a/baselines/deepq/experiments/atari/init.py
+++ b/baselines/deepq/experiments/atari/init.py
--- a/baselines/acer/acer.py
+++ b/baselines/acer/acer.py
@@ -0,0 +1,377 @@
+import time
+import functools
+import numpy as np
+import tensorflow as tf
+from baselines import logger
+
+from baselines.common import set_global_seeds
+from baselines.common.policies import build_policy
+from baselines.common.tf_util import get_session, save_variables
+from baselines.common.vec_env.vec_frame_stack import VecFrameStack
+
+from baselines.a2c.utils import batch_to_seq, seq_to_batch
+from baselines.a2c.utils import cat_entropy_softmax
+from baselines.a2c.utils import Scheduler, find_trainable_variables
+from baselines.a2c.utils import EpisodeStats
+from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
+from baselines.acer.buffer import Buffer
+from baselines.acer.runner import Runner
+
+# remove last step
+def strip(var, nenvs, nsteps, flat = False):
+    vars = batch_to_seq(var, nenvs, nsteps + 1, flat)
+    return seq_to_batch(vars[:-1], flat)
+
+def q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma):
+    """
+    Calculates q_retrace targets
+
+    :param R: Rewards
+    :param D: Dones
+    :param q_i: Q values for actions taken
+    :param v: V values
+    :param rho_i: Importance weight for each action
+    :return: Q_retrace values
+    """
+    rho_bar = batch_to_seq(tf.minimum(1.0, rho_i), nenvs, nsteps, True)  # list of len steps, shape [nenvs]
+    rs = batch_to_seq(R, nenvs, nsteps, True)  # list of len steps, shape [nenvs]
+    ds = batch_to_seq(D, nenvs, nsteps, True)  # list of len steps, shape [nenvs]
+    q_is = batch_to_seq(q_i, nenvs, nsteps, True)
+    vs = batch_to_seq(v, nenvs, nsteps + 1, True)
+    v_final = vs[-1]
+    qret = v_final
+    qrets = []
+    for i in range(nsteps - 1, -1, -1):
+        check_shape([qret, ds[i], rs[i], rho_bar[i], q_is[i], vs[i]], [[nenvs]] * 6)
+        qret = rs[i] + gamma * qret * (1.0 - ds[i])
+        qrets.append(qret)
+        qret = (rho_bar[i] * (qret - q_is[i])) + vs[i]
+    qrets = qrets[::-1]
+    qret = seq_to_batch(qrets, flat=True)
+    return qret
+
+# For ACER with PPO clipping instead of trust region
+# def clip(ratio, eps_clip):
+#     # assume 0 <= eps_clip <= 1
+#     return tf.minimum(1 + eps_clip, tf.maximum(1 - eps_clip, ratio))
+
+class Model(object):
+    def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, ent_coef, q_coef, gamma, max_grad_norm, lr,
+                 rprop_alpha, rprop_epsilon, total_timesteps, lrschedule,
+                 c, trust_region, alpha, delta):
+
+        sess = get_session()
+        nact = ac_space.n
+        nbatch = nenvs * nsteps
+
+        A = tf.placeholder(tf.int32, [nbatch]) # actions
+        D = tf.placeholder(tf.float32, [nbatch]) # dones
+        R = tf.placeholder(tf.float32, [nbatch]) # rewards, not returns
+        MU = tf.placeholder(tf.float32, [nbatch, nact]) # mu's
+        LR = tf.placeholder(tf.float32, [])
+        eps = 1e-6
+
+        step_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs,) + ob_space.shape)
+        train_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs*(nsteps+1),) + ob_space.shape)
+        with tf.variable_scope('acer_model', reuse=tf.AUTO_REUSE):
+
+            step_model = policy(nbatch=nenvs, nsteps=1, observ_placeholder=step_ob_placeholder, sess=sess)
+            train_model = policy(nbatch=nbatch, nsteps=nsteps, observ_placeholder=train_ob_placeholder, sess=sess)
+
+
+        params = find_trainable_variables("acer_model")
+        print("Params {}".format(len(params)))
+        for var in params:
+            print(var)
+
+        # create polyak averaged model
+        ema = tf.train.ExponentialMovingAverage(alpha)
+        ema_apply_op = ema.apply(params)
+
+        def custom_getter(getter, *args, **kwargs):
+            v = ema.average(getter(*args, **kwargs))
+            print(v.name)
+            return v
+
+        with tf.variable_scope("acer_model", custom_getter=custom_getter, reuse=True):
+            polyak_model = policy(nbatch=nbatch, nsteps=nsteps, observ_placeholder=train_ob_placeholder, sess=sess)
+
+        # Notation: (var) = batch variable, (var)s = seqeuence variable, (var)_i = variable index by action at step i
+
+        # action probability distributions according to train_model, polyak_model and step_model
+        # poilcy.pi is probability distribution parameters; to obtain distribution that sums to 1 need to take softmax
+        train_model_p = tf.nn.softmax(train_model.pi)
+        polyak_model_p = tf.nn.softmax(polyak_model.pi)
+        step_model_p = tf.nn.softmax(step_model.pi)
+        v = tf.reduce_sum(train_model_p * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]
+
+        # strip off last step
+        f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model_p, polyak_model_p, train_model.q])
+        # Get pi and q values for actions taken
+        f_i = get_by_index(f, A)
+        q_i = get_by_index(q, A)
+
+        # Compute ratios for importance truncation
+        rho = f / (MU + eps)
+        rho_i = get_by_index(rho, A)
+
+        # Calculate Q_retrace targets
+        qret = q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma)
+
+        # Calculate losses
+        # Entropy
+        # entropy = tf.reduce_mean(strip(train_model.pd.entropy(), nenvs, nsteps))
+        entropy = tf.reduce_mean(cat_entropy_softmax(f))
+
+        # Policy Graident loss, with truncated importance sampling & bias correction
+        v = strip(v, nenvs, nsteps, True)
+        check_shape([qret, v, rho_i, f_i], [[nenvs * nsteps]] * 4)
+        check_shape([rho, f, q], [[nenvs * nsteps, nact]] * 2)
+
+        # Truncated importance sampling
+        adv = qret - v
+        logf = tf.log(f_i + eps)
+        gain_f = logf * tf.stop_gradient(adv * tf.minimum(c, rho_i))  # [nenvs * nsteps]
+        loss_f = -tf.reduce_mean(gain_f)
+
+        # Bias correction for the truncation
+        adv_bc = (q - tf.reshape(v, [nenvs * nsteps, 1]))  # [nenvs * nsteps, nact]
+        logf_bc = tf.log(f + eps) # / (f_old + eps)
+        check_shape([adv_bc, logf_bc], [[nenvs * nsteps, nact]]*2)
+        gain_bc = tf.reduce_sum(logf_bc * tf.stop_gradient(adv_bc * tf.nn.relu(1.0 - (c / (rho + eps))) * f), axis = 1) #IMP: This is sum, as expectation wrt f
+        loss_bc= -tf.reduce_mean(gain_bc)
+
+        loss_policy = loss_f + loss_bc
+
+        # Value/Q function loss, and explained variance
+        check_shape([qret, q_i], [[nenvs * nsteps]]*2)
+        ev = q_explained_variance(tf.reshape(q_i, [nenvs, nsteps]), tf.reshape(qret, [nenvs, nsteps]))
+        loss_q = tf.reduce_mean(tf.square(tf.stop_gradient(qret) - q_i)*0.5)
+
+        # Net loss
+        check_shape([loss_policy, loss_q, entropy], [[]] * 3)
+        loss = loss_policy + q_coef * loss_q - ent_coef * entropy
+
+        if trust_region:
+            g = tf.gradients(- (loss_policy - ent_coef * entropy) * nsteps * nenvs, f) #[nenvs * nsteps, nact]
+            # k = tf.gradients(KL(f_pol || f), f)
+            k = - f_pol / (f + eps) #[nenvs * nsteps, nact] # Directly computed gradient of KL divergence wrt f
+            k_dot_g = tf.reduce_sum(k * g, axis=-1)
+            adj = tf.maximum(0.0, (tf.reduce_sum(k * g, axis=-1) - delta) / (tf.reduce_sum(tf.square(k), axis=-1) + eps)) #[nenvs * nsteps]
+
+            # Calculate stats (before doing adjustment) for logging.
+            avg_norm_k = avg_norm(k)
+            avg_norm_g = avg_norm(g)
+            avg_norm_k_dot_g = tf.reduce_mean(tf.abs(k_dot_g))
+            avg_norm_adj = tf.reduce_mean(tf.abs(adj))
+
+            g = g - tf.reshape(adj, [nenvs * nsteps, 1]) * k
+            grads_f = -g/(nenvs*nsteps) # These are turst region adjusted gradients wrt f ie statistics of policy pi
+            grads_policy = tf.gradients(f, params, grads_f)
+            grads_q = tf.gradients(loss_q * q_coef, params)
+            grads = [gradient_add(g1, g2, param) for (g1, g2, param) in zip(grads_policy, grads_q, params)]
+
+            avg_norm_grads_f = avg_norm(grads_f) * (nsteps * nenvs)
+            norm_grads_q = tf.global_norm(grads_q)
+            norm_grads_policy = tf.global_norm(grads_policy)
+        else:
+            grads = tf.gradients(loss, params)
+
+        if max_grad_norm is not None:
+            grads, norm_grads = tf.clip_by_global_norm(grads, max_grad_norm)
+        grads = list(zip(grads, params))
+        trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=rprop_alpha, epsilon=rprop_epsilon)
+        _opt_op = trainer.apply_gradients(grads)
+
+        # so when you call _train, you first do the gradient step, then you apply ema
+        with tf.control_dependencies([_opt_op]):
+            _train = tf.group(ema_apply_op)
+
+        lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
+
+        # Ops/Summaries to run, and their names for logging
+        run_ops = [_train, loss, loss_q, entropy, loss_policy, loss_f, loss_bc, ev, norm_grads]
+        names_ops = ['loss', 'loss_q', 'entropy', 'loss_policy', 'loss_f', 'loss_bc', 'explained_variance',
+                     'norm_grads']
+        if trust_region:
+            run_ops = run_ops + [norm_grads_q, norm_grads_policy, avg_norm_grads_f, avg_norm_k, avg_norm_g, avg_norm_k_dot_g,
+                                 avg_norm_adj]
+            names_ops = names_ops + ['norm_grads_q', 'norm_grads_policy', 'avg_norm_grads_f', 'avg_norm_k', 'avg_norm_g',
+                                     'avg_norm_k_dot_g', 'avg_norm_adj']
+
+        def train(obs, actions, rewards, dones, mus, states, masks, steps):
+            cur_lr = lr.value_steps(steps)
+            td_map = {train_model.X: obs, polyak_model.X: obs, A: actions, R: rewards, D: dones, MU: mus, LR: cur_lr}
+            if states is not None:
+                td_map[train_model.S] = states
+                td_map[train_model.M] = masks
+                td_map[polyak_model.S] = states
+                td_map[polyak_model.M] = masks
+
+            return names_ops, sess.run(run_ops, td_map)[1:]  # strip off _train
+
+        def _step(observation, **kwargs):
+            return step_model._evaluate([step_model.action, step_model_p, step_model.state], observation, **kwargs)
+
+
+
+        self.train = train
+        self.save = functools.partial(save_variables, sess=sess, variables=params)
+        self.train_model = train_model
+        self.step_model = step_model
+        self._step = _step
+        self.step = self.step_model.step
+
+        self.initial_state = step_model.initial_state
+        tf.global_variables_initializer().run(session=sess)
+
+
+class Acer():
+    def __init__(self, runner, model, buffer, log_interval):
+        self.runner = runner
+        self.model = model
+        self.buffer = buffer
+        self.log_interval = log_interval
+        self.tstart = None
+        self.episode_stats = EpisodeStats(runner.nsteps, runner.nenv)
+        self.steps = None
+
+    def call(self, on_policy):
+        runner, model, buffer, steps = self.runner, self.model, self.buffer, self.steps
+        if on_policy:
+            enc_obs, obs, actions, rewards, mus, dones, masks = runner.run()
+            self.episode_stats.feed(rewards, dones)
+            if buffer is not None:
+                buffer.put(enc_obs, actions, rewards, mus, dones, masks)
+        else:
+            # get obs, actions, rewards, mus, dones from buffer.
+            obs, actions, rewards, mus, dones, masks = buffer.get()
+
+
+        # reshape stuff correctly
+        obs = obs.reshape(runner.batch_ob_shape)
+        actions = actions.reshape([runner.nbatch])
+        rewards = rewards.reshape([runner.nbatch])
+        mus = mus.reshape([runner.nbatch, runner.nact])
+        dones = dones.reshape([runner.nbatch])
+        masks = masks.reshape([runner.batch_ob_shape[0]])
+
+        names_ops, values_ops = model.train(obs, actions, rewards, dones, mus, model.initial_state, masks, steps)
+
+        if on_policy and (int(steps/runner.nbatch) % self.log_interval == 0):
+            logger.record_tabular("total_timesteps", steps)
+            logger.record_tabular("fps", int(steps/(time.time() - self.tstart)))
+            # IMP: In EpisodicLife env, during training, we get done=True at each loss of life, not just at the terminal state.
+            # Thus, this is mean until end of life, not end of episode.
+            # For true episode rewards, see the monitor files in the log folder.
+            logger.record_tabular("mean_episode_length", self.episode_stats.mean_length())
+            logger.record_tabular("mean_episode_reward", self.episode_stats.mean_reward())
+            for name, val in zip(names_ops, values_ops):
+                logger.record_tabular(name, float(val))
+            logger.dump_tabular()
+
+
+def learn(network, env, seed=None, nsteps=20, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
+          max_grad_norm=10, lr=7e-4, lrschedule='linear', rprop_epsilon=1e-5, rprop_alpha=0.99, gamma=0.99,
+          log_interval=100, buffer_size=50000, replay_ratio=4, replay_start=10000, c=10.0,
+          trust_region=True, alpha=0.99, delta=1, load_path=None, **network_kwargs):
+
+    '''
+    Main entrypoint for ACER (Actor-Critic with Experience Replay) algorithm (https://arxiv.org/pdf/1611.01224.pdf)
+    Train an agent with given network architecture on a given environment using ACER.
+
+    Parameters:
+    ----------
+
+    network:            policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
+                        specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
+                        tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
+                        neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
+                        See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
+
+    env:                environment. Needs to be vectorized for parallel environment simulation.
+                        The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.
+
+    nsteps:             int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
+                        nenv is number of environment copies simulated in parallel) (default: 20)
+
+    nstack:             int, size of the frame stack, i.e. number of the frames passed to the step model. Frames are stacked along channel dimension
+                        (last image dimension) (default: 4)
+
+    total_timesteps:    int, number of timesteps (i.e. number of actions taken in the environment) (default: 80M)
+
+    q_coef:             float, value function loss coefficient in the optimization objective (analog of vf_coef for other actor-critic methods)
+
+    ent_coef:           float, policy entropy coefficient in the optimization objective (default: 0.01)
+
+    max_grad_norm:      float, gradient norm clipping coefficient. If set to None, no clipping. (default: 10),
+
+    lr:                 float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
+
+    lrschedule:         schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
+                        returns fraction of the learning rate (specified as lr) as output
+
+    rprop_epsilon:      float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
+
+    rprop_alpha:        float, RMSProp decay parameter (default: 0.99)
+
+    gamma:              float, reward discounting factor (default: 0.99)
+
+    log_interval:       int, number of updates between logging events (default: 100)
+
+    buffer_size:        int, size of the replay buffer (default: 50k)
+
+    replay_ratio:       int, now many (on average) batches of data to sample from the replay buffer take after batch from the environment (default: 4)
+
+    replay_start:       int, the sampling from the replay buffer does not start until replay buffer has at least that many samples (default: 10k)
+
+    c:                  float, importance weight clipping factor (default: 10)
+
+    trust_region        bool, whether or not algorithms estimates the gradient KL divergence between the old and updated policy and uses it to determine step size  (default: True)
+
+    delta:              float, max KL divergence between the old policy and updated policy (default: 1)
+
+    alpha:              float, momentum factor in the Polyak (exponential moving average) averaging of the model parameters (default: 0.99)
+
+    load_path:          str, path to load the model from (default: None)
+
+    **network_kwargs:               keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
+                                    For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
+
+    '''
+
+    print("Running Acer Simple")
+    print(locals())
+    set_global_seeds(seed)
+    if not isinstance(env, VecFrameStack):
+        env = VecFrameStack(env, 1)
+
+    policy = build_policy(env, network, estimate_q=True, **network_kwargs)
+    nenvs = env.num_envs
+    ob_space = env.observation_space
+    ac_space = env.action_space
+
+    nstack = env.nstack
+    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps,
+                  ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
+                  max_grad_norm=max_grad_norm, lr=lr, rprop_alpha=rprop_alpha, rprop_epsilon=rprop_epsilon,
+                  total_timesteps=total_timesteps, lrschedule=lrschedule, c=c,
+                  trust_region=trust_region, alpha=alpha, delta=delta)
+
+    runner = Runner(env=env, model=model, nsteps=nsteps)
+    if replay_ratio > 0:
+        buffer = Buffer(env=env, nsteps=nsteps, size=buffer_size)
+    else:
+        buffer = None
+    nbatch = nenvs*nsteps
+    acer = Acer(runner, model, buffer, log_interval)
+    acer.tstart = time.time()
+
+    for acer.steps in range(0, total_timesteps, nbatch): #nbatch samples, 1 on_policy call and multiple off-policy calls
+        acer.call(on_policy=True)
+        if replay_ratio > 0 and buffer.has_atleast(replay_start):
+            n = np.random.poisson(replay_ratio)
+            for _ in range(n):
+                acer.call(on_policy=False)  # no simulation steps in this
+
+    return model
--- a/baselines/acer/buffer.py
+++ b/baselines/acer/buffer.py
@@ -0,0 +1,156 @@
+import numpy as np
+
+class Buffer(object):
+    # gets obs, actions, rewards, mu's, (states, masks), dones
+    def __init__(self, env, nsteps, size=50000):
+        self.nenv = env.num_envs
+        self.nsteps = nsteps
+        # self.nh, self.nw, self.nc = env.observation_space.shape
+        self.obs_shape = env.observation_space.shape
+        self.obs_dtype = env.observation_space.dtype
+        self.ac_dtype = env.action_space.dtype
+        self.nc = self.obs_shape[-1]
+        self.nstack = env.nstack
+        self.nc //= self.nstack
+        self.nbatch = self.nenv * self.nsteps
+        self.size = size // (self.nsteps)  # Each loc contains nenv * nsteps frames, thus total buffer is nenv * size frames
+
+        # Memory
+        self.enc_obs = None
+        self.actions = None
+        self.rewards = None
+        self.mus = None
+        self.dones = None
+        self.masks = None
+
+        # Size indexes
+        self.next_idx = 0
+        self.num_in_buffer = 0
+
+    def has_atleast(self, frames):
+        # Frames per env, so total (nenv * frames) Frames needed
+        # Each buffer loc has nenv * nsteps frames
+        return self.num_in_buffer >= (frames // self.nsteps)
+
+    def can_sample(self):
+        return self.num_in_buffer > 0
+
+    # Generate stacked frames
+    def decode(self, enc_obs, dones):
+        # enc_obs has shape [nenvs, nsteps + nstack, nh, nw, nc]
+        # dones has shape [nenvs, nsteps]
+        # returns stacked obs of shape [nenv, (nsteps + 1), nh, nw, nstack*nc]
+
+        return _stack_obs(enc_obs, dones,
+                          nsteps=self.nsteps)
+
+    def put(self, enc_obs, actions, rewards, mus, dones, masks):
+        # enc_obs [nenv, (nsteps + nstack), nh, nw, nc]
+        # actions, rewards, dones [nenv, nsteps]
+        # mus [nenv, nsteps, nact]
+
+        if self.enc_obs is None:
+            self.enc_obs = np.empty([self.size] + list(enc_obs.shape), dtype=self.obs_dtype)
+            self.actions = np.empty([self.size] + list(actions.shape), dtype=self.ac_dtype)
+            self.rewards = np.empty([self.size] + list(rewards.shape), dtype=np.float32)
+            self.mus = np.empty([self.size] + list(mus.shape), dtype=np.float32)
+            self.dones = np.empty([self.size] + list(dones.shape), dtype=np.bool)
+            self.masks = np.empty([self.size] + list(masks.shape), dtype=np.bool)
+
+        self.enc_obs[self.next_idx] = enc_obs
+        self.actions[self.next_idx] = actions
+        self.rewards[self.next_idx] = rewards
+        self.mus[self.next_idx] = mus
+        self.dones[self.next_idx] = dones
+        self.masks[self.next_idx] = masks
+
+        self.next_idx = (self.next_idx + 1) % self.size
+        self.num_in_buffer = min(self.size, self.num_in_buffer + 1)
+
+    def take(self, x, idx, envx):
+        nenv = self.nenv
+        out = np.empty([nenv] + list(x.shape[2:]), dtype=x.dtype)
+        for i in range(nenv):
+            out[i] = x[idx[i], envx[i]]
+        return out
+
+    def get(self):
+        # returns
+        # obs [nenv, (nsteps + 1), nh, nw, nstack*nc]
+        # actions, rewards, dones [nenv, nsteps]
+        # mus [nenv, nsteps, nact]
+        nenv = self.nenv
+        assert self.can_sample()
+
+        # Sample exactly one id per env. If you sample across envs, then higher correlation in samples from same env.
+        idx = np.random.randint(0, self.num_in_buffer, nenv)
+        envx = np.arange(nenv)
+
+        take = lambda x: self.take(x, idx, envx)  # for i in range(nenv)], axis = 0)
+        dones = take(self.dones)
+        enc_obs = take(self.enc_obs)
+        obs = self.decode(enc_obs, dones)
+        actions = take(self.actions)
+        rewards = take(self.rewards)
+        mus = take(self.mus)
+        masks = take(self.masks)
+        return obs, actions, rewards, mus, dones, masks
+
+
+
+def _stack_obs_ref(enc_obs, dones, nsteps):
+    nenv = enc_obs.shape[0]
+    nstack = enc_obs.shape[1] - nsteps
+    nh, nw, nc = enc_obs.shape[2:]
+    obs_dtype = enc_obs.dtype
+    obs_shape = (nh, nw, nc*nstack)
+
+    mask = np.empty([nsteps + nstack - 1, nenv, 1, 1, 1], dtype=np.float32)
+    obs = np.zeros([nstack, nsteps + nstack, nenv, nh, nw, nc], dtype=obs_dtype)
+    x = np.reshape(enc_obs, [nenv, nsteps + nstack, nh, nw, nc]).swapaxes(1, 0)  # [nsteps + nstack, nenv, nh, nw, nc]
+
+    mask[nstack-1:] = np.reshape(1.0 - dones, [nenv, nsteps, 1, 1, 1]).swapaxes(1, 0)  # keep
+    mask[:nstack-1] = 1.0
+
+    # y = np.reshape(1 - dones, [nenvs, nsteps, 1, 1, 1])
+    for i in range(nstack):
+        obs[-(i + 1), i:] = x
+        # obs[:,i:,:,:,-(i+1),:] = x
+        x = x[:-1] * mask
+        mask = mask[1:]
+
+    return np.reshape(obs[:, (nstack-1):].transpose((2, 1, 3, 4, 0, 5)), (nenv, (nsteps + 1)) + obs_shape)
+
+def _stack_obs(enc_obs, dones, nsteps):
+    nenv = enc_obs.shape[0]
+    nstack = enc_obs.shape[1] - nsteps
+    nc = enc_obs.shape[-1]
+
+    obs_ = np.zeros((nenv, nsteps + 1) + enc_obs.shape[2:-1] + (enc_obs.shape[-1] * nstack, ), dtype=enc_obs.dtype)
+    mask = np.ones((nenv, nsteps+1), dtype=enc_obs.dtype)
+    mask[:, 1:] = 1.0 - dones
+    mask = mask.reshape(mask.shape + tuple(np.ones(len(enc_obs.shape)-2, dtype=np.uint8)))
+
+    for i in range(nstack-1, -1, -1):
+        obs_[..., i * nc : (i + 1) * nc] = enc_obs[:, i : i + nsteps + 1, :]
+        if i < nstack-1:
+            obs_[..., i * nc : (i + 1) * nc] *= mask
+            mask[:, 1:, ...] *= mask[:, :-1, ...]
+
+    return obs_
+
+def test_stack_obs():
+    nstack = 7
+    nenv = 1
+    nsteps = 5
+
+    obs_shape = (2, 3, nstack)
+
+    enc_obs_shape = (nenv, nsteps + nstack) + obs_shape[:-1] + (1,)
+    enc_obs = np.random.random(enc_obs_shape)
+    dones = np.random.randint(low=0, high=2, size=(nenv, nsteps))
+
+    stacked_obs_ref = _stack_obs_ref(enc_obs, dones, nsteps=nsteps)
+    stacked_obs_test = _stack_obs(enc_obs, dones, nsteps=nsteps)
+
+    np.testing.assert_allclose(stacked_obs_ref, stacked_obs_test)
--- a/baselines/acer/defaults.py
+++ b/baselines/acer/defaults.py
@@ -0,0 +1,4 @@
+def atari():
+    return dict(
+        lrschedule='constant'
+    )
--- a/baselines/acer/policies.py
+++ b/baselines/acer/policies.py
@@ -0,0 +1,81 @@
+import numpy as np
+import tensorflow as tf
+from baselines.common.policies import nature_cnn
+from baselines.a2c.utils import fc, batch_to_seq, seq_to_batch, lstm, sample
+
+
+class AcerCnnPolicy(object):
+
+    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
+        nbatch = nenv * nsteps
+        nh, nw, nc = ob_space.shape
+        ob_shape = (nbatch, nh, nw, nc * nstack)
+        nact = ac_space.n
+        X = tf.placeholder(tf.uint8, ob_shape)  # obs
+        with tf.variable_scope("model", reuse=reuse):
+            h = nature_cnn(X)
+            pi_logits = fc(h, 'pi', nact, init_scale=0.01)
+            pi = tf.nn.softmax(pi_logits)
+            q = fc(h, 'q', nact)
+
+        a = sample(tf.nn.softmax(pi_logits))  # could change this to use self.pi instead
+        self.initial_state = []  # not stateful
+        self.X = X
+        self.pi = pi  # actual policy params now
+        self.pi_logits = pi_logits
+        self.q = q
+        self.vf = q
+
+        def step(ob, *args, **kwargs):
+            # returns actions, mus, states
+            a0, pi0 = sess.run([a, pi], {X: ob})
+            return a0, pi0, []  # dummy state
+
+        def out(ob, *args, **kwargs):
+            pi0, q0 = sess.run([pi, q], {X: ob})
+            return pi0, q0
+
+        def act(ob, *args, **kwargs):
+            return sess.run(a, {X: ob})
+
+        self.step = step
+        self.out = out
+        self.act = act
+
+class AcerLstmPolicy(object):
+
+    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False, nlstm=256):
+        nbatch = nenv * nsteps
+        nh, nw, nc = ob_space.shape
+        ob_shape = (nbatch, nh, nw, nc * nstack)
+        nact = ac_space.n
+        X = tf.placeholder(tf.uint8, ob_shape)  # obs
+        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
+        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
+        with tf.variable_scope("model", reuse=reuse):
+            h = nature_cnn(X)
+
+            # lstm
+            xs = batch_to_seq(h, nenv, nsteps)
+            ms = batch_to_seq(M, nenv, nsteps)
+            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
+            h5 = seq_to_batch(h5)
+
+            pi_logits = fc(h5, 'pi', nact, init_scale=0.01)
+            pi = tf.nn.softmax(pi_logits)
+            q = fc(h5, 'q', nact)
+
+        a = sample(pi_logits)  # could change this to use self.pi instead
+        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
+        self.X = X
+        self.M = M
+        self.S = S
+        self.pi = pi  # actual policy params now
+        self.q = q
+
+        def step(ob, state, mask, *args, **kwargs):
+            # returns actions, mus, states
+            a0, pi0, s = sess.run([a, pi, snew], {X: ob, S: state, M: mask})
+            return a0, pi0, s
+
+        self.step = step
--- a/baselines/acer/runner.py
+++ b/baselines/acer/runner.py
@@ -0,0 +1,61 @@
+import numpy as np
+from baselines.common.runners import AbstractEnvRunner
+from baselines.common.vec_env.vec_frame_stack import VecFrameStack
+from gym import spaces
+
+
+class Runner(AbstractEnvRunner):
+
+    def __init__(self, env, model, nsteps):
+        super().__init__(env=env, model=model, nsteps=nsteps)
+        assert isinstance(env.action_space, spaces.Discrete), 'This ACER implementation works only with discrete action spaces!'
+        assert isinstance(env, VecFrameStack)
+
+        self.nact = env.action_space.n
+        nenv = self.nenv
+        self.nbatch = nenv * nsteps
+        self.batch_ob_shape = (nenv*(nsteps+1),) + env.observation_space.shape
+
+        self.obs = env.reset()
+        self.obs_dtype = env.observation_space.dtype
+        self.ac_dtype = env.action_space.dtype
+        self.nstack = self.env.nstack
+        self.nc = self.batch_ob_shape[-1] // self.nstack
+
+
+    def run(self):
+        # enc_obs = np.split(self.obs, self.nstack, axis=3)  # so now list of obs steps
+        enc_obs = np.split(self.env.stackedobs, self.env.nstack, axis=-1)
+        mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
+        for _ in range(self.nsteps):
+            actions, mus, states = self.model._step(self.obs, S=self.states, M=self.dones)
+            mb_obs.append(np.copy(self.obs))
+            mb_actions.append(actions)
+            mb_mus.append(mus)
+            mb_dones.append(self.dones)
+            obs, rewards, dones, _ = self.env.step(actions)
+            # states information for statefull models like LSTM
+            self.states = states
+            self.dones = dones
+            self.obs = obs
+            mb_rewards.append(rewards)
+            enc_obs.append(obs[..., -self.nc:])
+        mb_obs.append(np.copy(self.obs))
+        mb_dones.append(self.dones)
+
+        enc_obs = np.asarray(enc_obs, dtype=self.obs_dtype).swapaxes(1, 0)
+        mb_obs = np.asarray(mb_obs, dtype=self.obs_dtype).swapaxes(1, 0)
+        mb_actions = np.asarray(mb_actions, dtype=self.ac_dtype).swapaxes(1, 0)
+        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
+        mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
+
+        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
+
+        mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
+        mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards
+
+        # shapes are now [nenv, nsteps, []]
+        # When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.
+
+        return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks
+
--- a/baselines/acktr/README.md
+++ b/baselines/acktr/README.md
@@ -2,4 +2,8 @@

 - Original paper: https://arxiv.org/abs/1708.05144
 - Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.acktr.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
+- `python -m baselines.run --alg=acktr --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
+- also refer to the repo-wide [README.md](../../README.md#training-models)
+
+## ACKTR with continuous action spaces
+The code of ACKTR has been refactored to handle both discrete and continuous action spaces uniformly. In the original version, discrete and continuous action spaces were handled by different code (actkr_disc.py and acktr_cont.py) with little overlap. If interested in the original version of the acktr for continuous action spaces, use `old_acktr_cont` branch. Note that original code performs better on the mujoco tasks than the refactored version; we are still investigating why. 
--- a/baselines/acktr/acktr.py
+++ b/baselines/acktr/acktr.py
@@ -0,0 +1,152 @@
+import os.path as osp
+import time
+import functools
+import tensorflow as tf
+from baselines import logger
+
+from baselines.common import set_global_seeds, explained_variance
+from baselines.common.policies import build_policy
+from baselines.common.tf_util import get_session, save_variables, load_variables
+
+from baselines.a2c.runner import Runner
+from baselines.a2c.utils import Scheduler, find_trainable_variables
+from baselines.acktr import kfac
+
+
+class Model(object):
+
+    def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
+                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
+                 kfac_clip=0.001, lrschedule='linear', is_async=True):
+
+        self.sess = sess = get_session()
+        nbatch = nenvs * nsteps
+        with tf.variable_scope('acktr_model', reuse=tf.AUTO_REUSE):
+            self.model = step_model = policy(nenvs, 1, sess=sess)
+            self.model2 = train_model = policy(nenvs*nsteps, nsteps, sess=sess)
+
+        A = train_model.pdtype.sample_placeholder([None])
+        ADV = tf.placeholder(tf.float32, [nbatch])
+        R = tf.placeholder(tf.float32, [nbatch])
+        PG_LR = tf.placeholder(tf.float32, [])
+        VF_LR = tf.placeholder(tf.float32, [])
+
+        neglogpac = train_model.pd.neglogp(A)
+        self.logits = train_model.pi
+
+        ##training loss
+        pg_loss = tf.reduce_mean(ADV*neglogpac)
+        entropy = tf.reduce_mean(train_model.pd.entropy())
+        pg_loss = pg_loss - ent_coef * entropy
+        vf_loss = tf.losses.mean_squared_error(tf.squeeze(train_model.vf), R)
+        train_loss = pg_loss + vf_coef * vf_loss
+
+
+        ##Fisher loss construction
+        self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(neglogpac)
+        sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
+        self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
+        self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss
+
+        self.params=params = find_trainable_variables("acktr_model")
+
+        self.grads_check = grads = tf.gradients(train_loss,params)
+
+        with tf.device('/gpu:0'):
+            self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
+                momentum=0.9, kfac_update=1, epsilon=0.01,\
+                stats_decay=0.99, is_async=is_async, cold_iter=10, max_grad_norm=max_grad_norm)
+
+            # update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
+            optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
+            train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
+        self.q_runner = q_runner
+        self.lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
+
+        def train(obs, states, rewards, masks, actions, values):
+            advs = rewards - values
+            for step in range(len(obs)):
+                cur_lr = self.lr.value()
+
+            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, PG_LR:cur_lr, VF_LR:cur_lr}
+            if states is not None:
+                td_map[train_model.S] = states
+                td_map[train_model.M] = masks
+
+            policy_loss, value_loss, policy_entropy, _ = sess.run(
+                [pg_loss, vf_loss, entropy, train_op],
+                td_map
+            )
+            return policy_loss, value_loss, policy_entropy
+
+
+        self.train = train
+        self.save = functools.partial(save_variables, sess=sess)
+        self.load = functools.partial(load_variables, sess=sess)
+        self.train_model = train_model
+        self.step_model = step_model
+        self.step = step_model.step
+        self.value = step_model.value
+        self.initial_state = step_model.initial_state
+        tf.global_variables_initializer().run(session=sess)
+
+def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
+                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
+                 kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, is_async=True, **network_kwargs):
+    set_global_seeds(seed)
+
+
+    if network == 'cnn':
+        network_kwargs['one_dim_bias'] = True
+
+    policy = build_policy(env, network, **network_kwargs)
+
+    nenvs = env.num_envs
+    ob_space = env.observation_space
+    ac_space = env.action_space
+    make_model = lambda : Model(policy, ob_space, ac_space, nenvs, total_timesteps, nprocs=nprocs, nsteps
+                                =nsteps, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
+                                vf_fisher_coef, lr=lr, max_grad_norm=max_grad_norm, kfac_clip=kfac_clip,
+                                lrschedule=lrschedule, is_async=is_async)
+    if save_interval and logger.get_dir():
+        import cloudpickle
+        with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
+            fh.write(cloudpickle.dumps(make_model))
+    model = make_model()
+
+    if load_path is not None:
+        model.load(load_path)
+
+    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
+    nbatch = nenvs*nsteps
+    tstart = time.time()
+    coord = tf.train.Coordinator()
+    if is_async:
+        enqueue_threads = model.q_runner.create_threads(model.sess, coord=coord, start=True)
+    else:
+        enqueue_threads = []
+
+    for update in range(1, total_timesteps//nbatch+1):
+        obs, states, rewards, masks, actions, values = runner.run()
+        policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
+        model.old_obs = obs
+        nseconds = time.time()-tstart
+        fps = int((update*nbatch)/nseconds)
+        if update % log_interval == 0 or update == 1:
+            ev = explained_variance(values, rewards)
+            logger.record_tabular("nupdates", update)
+            logger.record_tabular("total_timesteps", update*nbatch)
+            logger.record_tabular("fps", fps)
+            logger.record_tabular("policy_entropy", float(policy_entropy))
+            logger.record_tabular("policy_loss", float(policy_loss))
+            logger.record_tabular("value_loss", float(value_loss))
+            logger.record_tabular("explained_variance", float(ev))
+            logger.dump_tabular()
+
+        if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
+            savepath = osp.join(logger.get_dir(), 'checkpoint%.5i'%update)
+            print('Saving to', savepath)
+            model.save(savepath)
+    coord.request_stop()
+    coord.join(enqueue_threads)
+    return model
--- a/baselines/acktr/acktr_cont.py
+++ b/baselines/acktr/acktr_cont.py
@@ -1,142 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from baselines import logger
-from baselines import common
-from baselines.common import tf_util as U
-from baselines.acktr import kfac
-from baselines.acktr.filters import ZFilter
-
-def pathlength(path):
-    return path["reward"].shape[0]# Loss function that we'll differentiate to get the policy gradient
-
-def rollout(env, policy, max_pathlength, animate=False, obfilter=None):
-    """
-    Simulate the env and policy for max_pathlength steps
-    """
-    ob = env.reset()
-    prev_ob = np.float32(np.zeros(ob.shape))
-    if obfilter: ob = obfilter(ob)
-    terminated = False
-
-    obs = []
-    acs = []
-    ac_dists = []
-    logps = []
-    rewards = []
-    for _ in range(max_pathlength):
-        if animate:
-            env.render()
-        state = np.concatenate([ob, prev_ob], -1)
-        obs.append(state)
-        ac, ac_dist, logp = policy.act(state)
-        acs.append(ac)
-        ac_dists.append(ac_dist)
-        logps.append(logp)
-        prev_ob = np.copy(ob)
-        scaled_ac = env.action_space.low + (ac + 1.) * 0.5 * (env.action_space.high - env.action_space.low)
-        scaled_ac = np.clip(scaled_ac, env.action_space.low, env.action_space.high)
-        ob, rew, done, _ = env.step(scaled_ac)
-        if obfilter: ob = obfilter(ob)
-        rewards.append(rew)
-        if done:
-            terminated = True
-            break
-    return {"observation" : np.array(obs), "terminated" : terminated,
-            "reward" : np.array(rewards), "action" : np.array(acs),
-            "action_dist": np.array(ac_dists), "logp" : np.array(logps)}
-
-def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
-    animate=False, callback=None, desired_kl=0.002):
-
-    obfilter = ZFilter(env.observation_space.shape)
-
-    max_pathlength = env.spec.timestep_limit
-    stepsize = tf.Variable(initial_value=np.float32(np.array(0.03)), name='stepsize')
-    inputs, loss, loss_sampled = policy.update_info
-    optim = kfac.KfacOptimizer(learning_rate=stepsize, cold_lr=stepsize*(1-0.9), momentum=0.9, kfac_update=2,\
-                                epsilon=1e-2, stats_decay=0.99, async=1, cold_iter=1,
-                                weight_decay_dict=policy.wd_dict, max_grad_norm=None)
-    pi_var_list = []
-    for var in tf.trainable_variables():
-        if "pi" in var.name:
-            pi_var_list.append(var)
-
-    update_op, q_runner = optim.minimize(loss, loss_sampled, var_list=pi_var_list)
-    do_update = U.function(inputs, update_op)
-    U.initialize()
-
-    # start queue runners
-    enqueue_threads = []
-    coord = tf.train.Coordinator()
-    for qr in [q_runner, vf.q_runner]:
-        assert (qr != None)
-        enqueue_threads.extend(qr.create_threads(U.get_session(), coord=coord, start=True))
-
-    i = 0
-    timesteps_so_far = 0
-    while True:
-        if timesteps_so_far > num_timesteps:
-            break
-        logger.log("********** Iteration %i ************"%i)
-
-        # Collect paths until we have enough timesteps
-        timesteps_this_batch = 0
-        paths = []
-        while True:
-            path = rollout(env, policy, max_pathlength, animate=(len(paths)==0 and (i % 10 == 0) and animate), obfilter=obfilter)
-            paths.append(path)
-            n = pathlength(path)
-            timesteps_this_batch += n
-            timesteps_so_far += n
-            if timesteps_this_batch > timesteps_per_batch:
-                break
-
-        # Estimate advantage function
-        vtargs = []
-        advs = []
-        for path in paths:
-            rew_t = path["reward"]
-            return_t = common.discount(rew_t, gamma)
-            vtargs.append(return_t)
-            vpred_t = vf.predict(path)
-            vpred_t = np.append(vpred_t, 0.0 if path["terminated"] else vpred_t[-1])
-            delta_t = rew_t + gamma*vpred_t[1:] - vpred_t[:-1]
-            adv_t = common.discount(delta_t, gamma * lam)
-            advs.append(adv_t)
-        # Update value function
-        vf.fit(paths, vtargs)
-
-        # Build arrays for policy update
-        ob_no = np.concatenate([path["observation"] for path in paths])
-        action_na = np.concatenate([path["action"] for path in paths])
-        oldac_dist = np.concatenate([path["action_dist"] for path in paths])
-        adv_n = np.concatenate(advs)
-        standardized_adv_n = (adv_n - adv_n.mean()) / (adv_n.std() + 1e-8)
-
-        # Policy update
-        do_update(ob_no, action_na, standardized_adv_n)
-
-        min_stepsize = np.float32(1e-8)
-        max_stepsize = np.float32(1e0)
-        # Adjust stepsize
-        kl = policy.compute_kl(ob_no, oldac_dist)
-        if kl > desired_kl * 2:
-            logger.log("kl too high")
-            U.eval(tf.assign(stepsize, tf.maximum(min_stepsize, stepsize / 1.5)))
-        elif kl < desired_kl / 2:
-            logger.log("kl too low")
-            U.eval(tf.assign(stepsize, tf.minimum(max_stepsize, stepsize * 1.5)))            
-        else:
-            logger.log("kl just right!")
-
-        logger.record_tabular("EpRewMean", np.mean([path["reward"].sum() for path in paths]))
-        logger.record_tabular("EpRewSEM", np.std([path["reward"].sum()/np.sqrt(len(paths)) for path in paths]))
-        logger.record_tabular("EpLenMean", np.mean([pathlength(path) for path in paths]))
-        logger.record_tabular("KL", kl)
-        if callback:
-            callback()
-        logger.dump_tabular()
-        i += 1
-
-    coord.request_stop()
-    coord.join(enqueue_threads)
--- a/baselines/acktr/acktr_disc.py
+++ b/baselines/acktr/acktr_disc.py
@@ -1,216 +0,0 @@
-import os.path as osp
-import time
-import joblib
-import numpy as np
-import tensorflow as tf
-from baselines import logger
-
-from baselines.common import set_global_seeds, explained_variance
-
-from baselines.acktr.utils import discount_with_dones
-from baselines.acktr.utils import Scheduler, find_trainable_variables
-from baselines.acktr.utils import cat_entropy, mse
-from baselines.acktr import kfac
-
-
-class Model(object):
-
-    def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
-                 nstack=4, ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
-                 kfac_clip=0.001, lrschedule='linear'):
-        config = tf.ConfigProto(allow_soft_placement=True,
-                                intra_op_parallelism_threads=nprocs,
-                                inter_op_parallelism_threads=nprocs)
-        config.gpu_options.allow_growth = True
-        self.sess = sess = tf.Session(config=config)
-        nact = ac_space.n
-        nbatch = nenvs * nsteps
-        A = tf.placeholder(tf.int32, [nbatch])
-        ADV = tf.placeholder(tf.float32, [nbatch])
-        R = tf.placeholder(tf.float32, [nbatch])
-        PG_LR = tf.placeholder(tf.float32, [])
-        VF_LR = tf.placeholder(tf.float32, [])
-
-        self.model = step_model = policy(sess, ob_space, ac_space, nenvs, 1, nstack, reuse=False)
-        self.model2 = train_model = policy(sess, ob_space, ac_space, nenvs, nsteps, nstack, reuse=True)
-
-        logpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
-        self.logits = logits = train_model.pi
-
-        ##training loss
-        pg_loss = tf.reduce_mean(ADV*logpac)
-        entropy = tf.reduce_mean(cat_entropy(train_model.pi))
-        pg_loss = pg_loss - ent_coef * entropy
-        vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
-        train_loss = pg_loss + vf_coef * vf_loss
-
-
-        ##Fisher loss construction
-        self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(logpac)
-        sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
-        self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
-        self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss
-
-        self.params=params = find_trainable_variables("model")
-
-        self.grads_check = grads = tf.gradients(train_loss,params)
-
-        with tf.device('/gpu:0'):
-            self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
-                momentum=0.9, kfac_update=1, epsilon=0.01,\
-                stats_decay=0.99, async=1, cold_iter=10, max_grad_norm=max_grad_norm)
-
-            update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
-            train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
-        self.q_runner = q_runner
-        self.lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
-
-        def train(obs, states, rewards, masks, actions, values):
-            advs = rewards - values
-            for step in range(len(obs)):
-                cur_lr = self.lr.value()
-
-            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, PG_LR:cur_lr}
-            if states != []:
-                td_map[train_model.S] = states
-                td_map[train_model.M] = masks
-
-            policy_loss, value_loss, policy_entropy, _ = sess.run(
-                [pg_loss, vf_loss, entropy, train_op],
-                td_map
-            )
-            return policy_loss, value_loss, policy_entropy
-
-        def save(save_path):
-            ps = sess.run(params)
-            joblib.dump(ps, save_path)
-
-        def load(load_path):
-            loaded_params = joblib.load(load_path)
-            restores = []
-            for p, loaded_p in zip(params, loaded_params):
-                restores.append(p.assign(loaded_p))
-            sess.run(restores)
-
-
-
-        self.train = train
-        self.save = save
-        self.load = load
-        self.train_model = train_model
-        self.step_model = step_model
-        self.step = step_model.step
-        self.value = step_model.value
-        self.initial_state = step_model.initial_state
-        tf.global_variables_initializer().run(session=sess)
-
-class Runner(object):
-
-    def __init__(self, env, model, nsteps, nstack, gamma):
-        self.env = env
-        self.model = model
-        nh, nw, nc = env.observation_space.shape
-        nenv = env.num_envs
-        self.batch_ob_shape = (nenv*nsteps, nh, nw, nc*nstack)
-        self.obs = np.zeros((nenv, nh, nw, nc*nstack), dtype=np.uint8)
-        obs = env.reset()
-        self.update_obs(obs)
-        self.gamma = gamma
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]
-
-    def update_obs(self, obs):
-        self.obs = np.roll(self.obs, shift=-1, axis=3)
-        self.obs[:, :, :, -1] = obs[:, :, :, 0]
-
-    def run(self):
-        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
-        mb_states = self.states
-        for n in range(self.nsteps):
-            actions, values, states = self.model.step(self.obs, self.states, self.dones)
-            mb_obs.append(np.copy(self.obs))
-            mb_actions.append(actions)
-            mb_values.append(values)
-            mb_dones.append(self.dones)
-            obs, rewards, dones, _ = self.env.step(actions)
-            self.states = states
-            self.dones = dones
-            for n, done in enumerate(dones):
-                if done:
-                    self.obs[n] = self.obs[n]*0
-            self.update_obs(obs)
-            mb_rewards.append(rewards)
-        mb_dones.append(self.dones)
-        #batch of steps to batch of rollouts
-        mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0).reshape(self.batch_ob_shape)
-        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
-        mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
-        mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
-        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
-        mb_masks = mb_dones[:, :-1]
-        mb_dones = mb_dones[:, 1:]
-        last_values = self.model.value(self.obs, self.states, self.dones).tolist()
-        #discount/bootstrap off value fn
-        for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
-            rewards = rewards.tolist()
-            dones = dones.tolist()
-            if dones[-1] == 0:
-                rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
-            else:
-                rewards = discount_with_dones(rewards, dones, self.gamma)
-            mb_rewards[n] = rewards
-        mb_rewards = mb_rewards.flatten()
-        mb_actions = mb_actions.flatten()
-        mb_values = mb_values.flatten()
-        mb_masks = mb_masks.flatten()
-        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values
-
-def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
-                 nstack=4, ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
-                 kfac_clip=0.001, save_interval=None, lrschedule='linear'):
-    tf.reset_default_graph()
-    set_global_seeds(seed)
-
-    nenvs = env.num_envs
-    ob_space = env.observation_space
-    ac_space = env.action_space
-    make_model = lambda : Model(policy, ob_space, ac_space, nenvs, total_timesteps, nprocs=nprocs, nsteps
-                                =nsteps, nstack=nstack, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
-                                vf_fisher_coef, lr=lr, max_grad_norm=max_grad_norm, kfac_clip=kfac_clip,
-                                lrschedule=lrschedule)
-    if save_interval and logger.get_dir():
-        import cloudpickle
-        with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
-            fh.write(cloudpickle.dumps(make_model))
-    model = make_model()
-
-    runner = Runner(env, model, nsteps=nsteps, nstack=nstack, gamma=gamma)
-    nbatch = nenvs*nsteps
-    tstart = time.time()
-    coord = tf.train.Coordinator()
-    enqueue_threads = model.q_runner.create_threads(model.sess, coord=coord, start=True)
-    for update in range(1, total_timesteps//nbatch+1):
-        obs, states, rewards, masks, actions, values = runner.run()
-        policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
-        model.old_obs = obs
-        nseconds = time.time()-tstart
-        fps = int((update*nbatch)/nseconds)
-        if update % log_interval == 0 or update == 1:
-            ev = explained_variance(values, rewards)
-            logger.record_tabular("nupdates", update)
-            logger.record_tabular("total_timesteps", update*nbatch)
-            logger.record_tabular("fps", fps)
-            logger.record_tabular("policy_entropy", float(policy_entropy))
-            logger.record_tabular("policy_loss", float(policy_loss))
-            logger.record_tabular("value_loss", float(value_loss))
-            logger.record_tabular("explained_variance", float(ev))
-            logger.dump_tabular()
-
-        if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
-            savepath = osp.join(logger.get_dir(), 'checkpoint%.5i'%update)
-            print('Saving to', savepath)
-            model.save(savepath)
-    coord.request_stop()
-    coord.join(enqueue_threads)
-    env.close()
--- a/baselines/acktr/defaults.py
+++ b/baselines/acktr/defaults.py
@@ -0,0 +1,5 @@
+def mujoco():
+    return dict(
+        nsteps=2500,
+        value_network='copy'
+    )
--- a/baselines/acktr/filters.py
+++ b/baselines/acktr/filters.py
@@ -1,98 +0,0 @@
-from baselines.acktr.running_stat import RunningStat
-from collections import deque
-import numpy as np
-
-class Filter(object):
-    def __call__(self, x, update=True):
-        raise NotImplementedError
-    def reset(self):
-        pass
-
-class IdentityFilter(Filter):
-    def __call__(self, x, update=True):
-        return x
-
-class CompositionFilter(Filter):
-    def __init__(self, fs):
-        self.fs = fs
-    def __call__(self, x, update=True):
-        for f in self.fs:
-            x = f(x)
-        return x
-    def output_shape(self, input_space):
-        out = input_space.shape
-        for f in self.fs:
-            out = f.output_shape(out)
-        return out
-
-class ZFilter(Filter):
-    """
-    y = (x-mean)/std
-    using running estimates of mean,std
-    """
-
-    def __init__(self, shape, demean=True, destd=True, clip=10.0):
-        self.demean = demean
-        self.destd = destd
-        self.clip = clip
-
-        self.rs = RunningStat(shape)
-
-    def __call__(self, x, update=True):
-        if update: self.rs.push(x)
-        if self.demean:
-            x = x - self.rs.mean
-        if self.destd:
-            x = x / (self.rs.std+1e-8)
-        if self.clip:
-            x = np.clip(x, -self.clip, self.clip)
-        return x
-    def output_shape(self, input_space):
-        return input_space.shape
-
-class AddClock(Filter):
-    def __init__(self):
-        self.count = 0
-    def reset(self):
-        self.count = 0
-    def __call__(self, x, update=True):
-        return np.append(x, self.count/100.0)
-    def output_shape(self, input_space):
-        return (input_space.shape[0]+1,)
-
-class FlattenFilter(Filter):
-    def __call__(self, x, update=True):
-        return x.ravel()
-    def output_shape(self, input_space):
-        return (int(np.prod(input_space.shape)),)
-
-class Ind2OneHotFilter(Filter):
-    def __init__(self, n):
-        self.n = n
-    def __call__(self, x, update=True):
-        out = np.zeros(self.n)
-        out[x] = 1
-        return out
-    def output_shape(self, input_space):
-        return (input_space.n,)
-
-class DivFilter(Filter):
-    def __init__(self, divisor):
-        self.divisor = divisor
-    def __call__(self, x, update=True):
-        return x / self.divisor
-    def output_shape(self, input_space):
-        return input_space.shape
-
-class StackFilter(Filter):
-    def __init__(self, length):
-        self.stack = deque(maxlen=length)
-    def reset(self):
-        self.stack.clear()
-    def __call__(self, x, update=True):
-        self.stack.append(x)
-        while len(self.stack) < self.stack.maxlen:
-            self.stack.append(x)
-        return np.concatenate(self.stack, axis=-1)
-    def output_shape(self, input_space):
-        return input_space.shape[:-1] + (input_space.shape[-1] * self.stack.maxlen,)
--- a/baselines/acktr/kfac.py
+++ b/baselines/acktr/kfac.py
@@ -1,6 +1,8 @@
 import tensorflow as tf
 import numpy as np
 import re
+
+ # flake8: noqa F403, F405
 from baselines.acktr.kfac_utils import *
 from functools import reduce

@@ -10,14 +12,14 @@ KFAC_DEBUG = False

 class KfacOptimizer():

-    def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
+    def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, is_async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
        self.max_grad_norm = max_grad_norm
        self._lr = learning_rate
        self._momentum = momentum
        self._clip_kl = clip_kl
        self._channel_fac = channel_fac
        self._kfac_update = kfac_update
-        self._async = async
+        self._async = is_async
        self._async_stats = async_stats
        self._epsilon = epsilon
        self._stats_decay = stats_decay
@@ -228,7 +230,7 @@ class KfacOptimizer():
                                Ow = bpropFactor.get_shape()[2]
                                if Oh == 1 and Ow == 1 and self._channel_fac:
                                    # factorization along the channels
-                                    # assume independence bewteen input channels and spatial
+                                    # assume independence between input channels and spatial
                                    # 2K-1 x 2K-1 covariance matrix and C x C covariance matrix
                                    # factorization along the channels do not
                                    # support homogeneous coordinate, assnBias
--- a/baselines/acktr/kfac_utils.py
+++ b/baselines/acktr/kfac_utils.py
@@ -1,93 +1,55 @@
 import tensorflow as tf
-import numpy as np
-

 def gmatmul(a, b, transpose_a=False, transpose_b=False, reduce_dim=None):
-    if reduce_dim == None:
-        # general batch matmul
-        if len(a.get_shape()) == 3 and len(b.get_shape()) == 3:
-            return tf.batch_matmul(a, b, adj_x=transpose_a, adj_y=transpose_b)
-        elif len(a.get_shape()) == 3 and len(b.get_shape()) == 2:
-            if transpose_b:
-                N = b.get_shape()[0].value
-            else:
-                N = b.get_shape()[1].value
-            B = a.get_shape()[0].value
-            if transpose_a:
-                K = a.get_shape()[1].value
-                a = tf.reshape(tf.transpose(a, [0, 2, 1]), [-1, K])
-            else:
-                K = a.get_shape()[-1].value
-                a = tf.reshape(a, [-1, K])
-            result = tf.matmul(a, b, transpose_b=transpose_b)
-            result = tf.reshape(result, [B, -1, N])
-            return result
-        elif len(a.get_shape()) == 2 and len(b.get_shape()) == 3:
-            if transpose_a:
-                M = a.get_shape()[1].value
-            else:
-                M = a.get_shape()[0].value
-            B = b.get_shape()[0].value
-            if transpose_b:
-                K = b.get_shape()[-1].value
-                b = tf.transpose(tf.reshape(b, [-1, K]), [1, 0])
-            else:
-                K = b.get_shape()[1].value
-                b = tf.transpose(tf.reshape(
-                    tf.transpose(b, [0, 2, 1]), [-1, K]), [1, 0])
-            result = tf.matmul(a, b, transpose_a=transpose_a)
-            result = tf.transpose(tf.reshape(result, [M, B, -1]), [1, 0, 2])
-            return result
-        else:
-            return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)
-    else:
-        # weird batch matmul
-        if len(a.get_shape()) == 2 and len(b.get_shape()) > 2:
-            # reshape reduce_dim to the left most dim in b
-            b_shape = b.get_shape()
-            if reduce_dim != 0:
-                b_dims = list(range(len(b_shape)))
-                b_dims.remove(reduce_dim)
-                b_dims.insert(0, reduce_dim)
-                b = tf.transpose(b, b_dims)
-            b_t_shape = b.get_shape()
-            b = tf.reshape(b, [int(b_shape[reduce_dim]), -1])
-            result = tf.matmul(a, b, transpose_a=transpose_a,
-                               transpose_b=transpose_b)
-            result = tf.reshape(result, b_t_shape)
-            if reduce_dim != 0:
-                b_dims = list(range(len(b_shape)))
-                b_dims.remove(0)
-                b_dims.insert(reduce_dim, 0)
-                result = tf.transpose(result, b_dims)
-            return result
+    assert reduce_dim is not None

-        elif len(a.get_shape()) > 2 and len(b.get_shape()) == 2:
-            # reshape reduce_dim to the right most dim in a
-            a_shape = a.get_shape()
-            outter_dim = len(a_shape) - 1
-            reduce_dim = len(a_shape) - reduce_dim - 1
-            if reduce_dim != outter_dim:
-                a_dims = list(range(len(a_shape)))
-                a_dims.remove(reduce_dim)
-                a_dims.insert(outter_dim, reduce_dim)
-                a = tf.transpose(a, a_dims)
-            a_t_shape = a.get_shape()
-            a = tf.reshape(a, [-1, int(a_shape[reduce_dim])])
-            result = tf.matmul(a, b, transpose_a=transpose_a,
-                               transpose_b=transpose_b)
-            result = tf.reshape(result, a_t_shape)
-            if reduce_dim != outter_dim:
-                a_dims = list(range(len(a_shape)))
-                a_dims.remove(outter_dim)
-                a_dims.insert(reduce_dim, outter_dim)
-                result = tf.transpose(result, a_dims)
-            return result
+    # weird batch matmul
+    if len(a.get_shape()) == 2 and len(b.get_shape()) > 2:
+        # reshape reduce_dim to the left most dim in b
+        b_shape = b.get_shape()
+        if reduce_dim != 0:
+            b_dims = list(range(len(b_shape)))
+            b_dims.remove(reduce_dim)
+            b_dims.insert(0, reduce_dim)
+            b = tf.transpose(b, b_dims)
+        b_t_shape = b.get_shape()
+        b = tf.reshape(b, [int(b_shape[reduce_dim]), -1])
+        result = tf.matmul(a, b, transpose_a=transpose_a,
+                           transpose_b=transpose_b)
+        result = tf.reshape(result, b_t_shape)
+        if reduce_dim != 0:
+            b_dims = list(range(len(b_shape)))
+            b_dims.remove(0)
+            b_dims.insert(reduce_dim, 0)
+            result = tf.transpose(result, b_dims)
+        return result

-        elif len(a.get_shape()) == 2 and len(b.get_shape()) == 2:
-            return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)
+    elif len(a.get_shape()) > 2 and len(b.get_shape()) == 2:
+        # reshape reduce_dim to the right most dim in a
+        a_shape = a.get_shape()
+        outter_dim = len(a_shape) - 1
+        reduce_dim = len(a_shape) - reduce_dim - 1
+        if reduce_dim != outter_dim:
+            a_dims = list(range(len(a_shape)))
+            a_dims.remove(reduce_dim)
+            a_dims.insert(outter_dim, reduce_dim)
+            a = tf.transpose(a, a_dims)
+        a_t_shape = a.get_shape()
+        a = tf.reshape(a, [-1, int(a_shape[reduce_dim])])
+        result = tf.matmul(a, b, transpose_a=transpose_a,
+                           transpose_b=transpose_b)
+        result = tf.reshape(result, a_t_shape)
+        if reduce_dim != outter_dim:
+            a_dims = list(range(len(a_shape)))
+            a_dims.remove(outter_dim)
+            a_dims.insert(reduce_dim, outter_dim)
+            result = tf.transpose(result, a_dims)
+        return result

-        assert False, 'something went wrong'
+    elif len(a.get_shape()) == 2 and len(b.get_shape()) == 2:
+        return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)
+
+    assert False, 'something went wrong'


 def clipoutNeg(vec, threshold=1e-6):
--- a/baselines/acktr/policies.py
+++ b/baselines/acktr/policies.py
@@ -1,77 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from baselines.acktr.utils import conv, fc, dense, conv_to_fc, sample, kl_div
-import baselines.common.tf_util as U
-
-class CnnPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
-        nbatch = nenv*nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc*nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        with tf.variable_scope("model", reuse=reuse):
-            h = conv(tf.cast(X, tf.float32)/255., 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2))
-            h2 = conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2))
-            h3 = conv(h2, 'c3', nf=32, rf=3, stride=1, init_scale=np.sqrt(2))
-            h3 = conv_to_fc(h3)
-            h4 = fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))
-            pi = fc(h4, 'pi', nact, act=lambda x:x)
-            vf = fc(h4, 'v', 1, act=lambda x:x)
-
-        v0 = vf[:, 0]
-        a0 = sample(pi)
-        self.initial_state = [] #not stateful
-
-        def step(ob, *_args, **_kwargs):
-            a, v = sess.run([a0, v0], {X:ob})
-            return a, v, [] #dummy state
-
-        def value(ob, *_args, **_kwargs):
-            return sess.run(v0, {X:ob})
-
-        self.X = X
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-
-class GaussianMlpPolicy(object):
-    def __init__(self, ob_dim, ac_dim):
-        # Here we'll construct a bunch of expressions, which will be used in two places:
-        # (1) When sampling actions
-        # (2) When computing loss functions, for the policy update
-        # Variables specific to (1) have the word "sampled" in them,
-        # whereas variables specific to (2) have the word "old" in them
-        ob_no = tf.placeholder(tf.float32, shape=[None, ob_dim*2], name="ob") # batch of observations
-        oldac_na = tf.placeholder(tf.float32, shape=[None, ac_dim], name="ac") # batch of actions previous actions
-        oldac_dist = tf.placeholder(tf.float32, shape=[None, ac_dim*2], name="oldac_dist") # batch of actions previous action distributions
-        adv_n = tf.placeholder(tf.float32, shape=[None], name="adv") # advantage function estimate
-        wd_dict = {}
-        h1 = tf.nn.tanh(dense(ob_no, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
-        h2 = tf.nn.tanh(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
-        mean_na = dense(h2, ac_dim, "mean", weight_init=U.normc_initializer(0.1), bias_init=0.0, weight_loss_dict=wd_dict) # Mean control output
-        self.wd_dict = wd_dict
-        self.logstd_1a = logstd_1a = tf.get_variable("logstd", [ac_dim], tf.float32, tf.zeros_initializer()) # Variance on outputs
-        logstd_1a = tf.expand_dims(logstd_1a, 0)
-        std_1a = tf.exp(logstd_1a)
-        std_na = tf.tile(std_1a, [tf.shape(mean_na)[0], 1])
-        ac_dist = tf.concat([tf.reshape(mean_na, [-1, ac_dim]), tf.reshape(std_na, [-1, ac_dim])], 1)
-        sampled_ac_na = tf.random_normal(tf.shape(ac_dist[:,ac_dim:])) * ac_dist[:,ac_dim:] + ac_dist[:,:ac_dim] # This is the sampled action we'll perform.
-        logprobsampled_n = - U.sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * U.sum(tf.square(ac_dist[:,:ac_dim] - sampled_ac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of sampled action
-        logprob_n = - U.sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * U.sum(tf.square(ac_dist[:,:ac_dim] - oldac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of previous actions under CURRENT policy (whereas oldlogprob_n is under OLD policy)
-        kl = U.mean(kl_div(oldac_dist, ac_dist, ac_dim))
-        #kl = .5 * U.mean(tf.square(logprob_n - oldlogprob_n)) # Approximation of KL divergence between old policy used to generate actions, and new policy used to compute logprob_n
-        surr = - U.mean(adv_n * logprob_n) # Loss function that we'll differentiate to get the policy gradient
-        surr_sampled = - U.mean(logprob_n) # Sampled loss of the policy
-        self._act = U.function([ob_no], [sampled_ac_na, ac_dist, logprobsampled_n]) # Generate a new action and its logprob
-        #self.compute_kl = U.function([ob_no, oldac_na, oldlogprob_n], kl) # Compute (approximate) KL divergence between old policy and new policy
-        self.compute_kl = U.function([ob_no, oldac_dist], kl)
-        self.update_info = ((ob_no, oldac_na, adv_n), surr, surr_sampled) # Input and output variables needed for computing loss
-        U.initialize() # Initialize uninitialized TF variables
-
-    def act(self, ob):
-        ac, ac_dist, logp = self._act(ob[None])
-        return ac[0], ac_dist[0], logp[0]
--- a/baselines/acktr/run_atari.py
+++ b/baselines/acktr/run_atari.py
@@ -1,38 +0,0 @@
-#!/usr/bin/env python
-import os, logging, gym
-from baselines import logger
-from baselines.common import set_global_seeds
-from baselines import bench
-from baselines.acktr.acktr_disc import learn
-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from baselines.common.atari_wrappers import make_atari, wrap_deepmind
-from baselines.acktr.policies import CnnPolicy
-
-def train(env_id, num_timesteps, seed, num_cpu):
-    def make_env(rank):
-        def _thunk():
-            env = make_atari(env_id)
-            env.seed(seed + rank)
-            env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-            gym.logger.setLevel(logging.WARN)
-            return wrap_deepmind(env)
-        return _thunk
-    set_global_seeds(seed)
-    env = SubprocVecEnv([make_env(i) for i in range(num_cpu)])
-    policy_fn = CnnPolicy
-    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
-    env.close()
-
-def main():
-    import argparse
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
-    args = parser.parse_args()
-    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed, num_cpu=32)
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/acktr/run_mujoco.py
+++ b/baselines/acktr/run_mujoco.py
@@ -1,43 +0,0 @@
-#!/usr/bin/env python
-import argparse
-import logging
-import os
-import tensorflow as tf
-import gym
-from baselines import logger
-from baselines.common import set_global_seeds
-from baselines import bench
-from baselines.acktr.acktr_cont import learn
-from baselines.acktr.policies import GaussianMlpPolicy
-from baselines.acktr.value_functions import NeuralNetValueFunction
-
-def train(env_id, num_timesteps, seed):
-    env=gym.make(env_id)
-    env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-    set_global_seeds(seed)
-    env.seed(seed)
-    gym.logger.setLevel(logging.WARN)
-
-    with tf.Session(config=tf.ConfigProto()):
-        ob_dim = env.observation_space.shape[0]
-        ac_dim = env.action_space.shape[0]
-        with tf.variable_scope("vf"):
-            vf = NeuralNetValueFunction(ob_dim, ac_dim)
-        with tf.variable_scope("pi"):
-            policy = GaussianMlpPolicy(ob_dim, ac_dim)
-
-        learn(env, policy=policy, vf=vf,
-            gamma=0.99, lam=0.97, timesteps_per_batch=2500,
-            desired_kl=0.002,
-            num_timesteps=num_timesteps, animate=False)
-
-        env.close()
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description='Run Mujoco benchmark.')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--env', help='environment ID', type=str, default="Reacher-v1")
-    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
-    args = parser.parse_args()
-    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
--- a/baselines/acktr/running_stat.py
+++ b/baselines/acktr/running_stat.py
@@ -1,46 +0,0 @@
-import numpy as np
-
-# http://www.johndcook.com/blog/standard_deviation/
-class RunningStat(object):
-    def __init__(self, shape):
-        self._n = 0
-        self._M = np.zeros(shape)
-        self._S = np.zeros(shape)
-    def push(self, x):
-        x = np.asarray(x)
-        assert x.shape == self._M.shape
-        self._n += 1
-        if self._n == 1:
-            self._M[...] = x
-        else:
-            oldM = self._M.copy()
-            self._M[...] = oldM + (x - oldM)/self._n
-            self._S[...] = self._S + (x - oldM)*(x - self._M)
-    @property
-    def n(self):
-        return self._n
-    @property
-    def mean(self):
-        return self._M
-    @property
-    def var(self):
-        return self._S/(self._n - 1) if self._n > 1 else np.square(self._M)
-    @property
-    def std(self):
-        return np.sqrt(self.var)
-    @property
-    def shape(self):
-        return self._M.shape
-
-def test_running_stat():
-    for shp in ((), (3,), (3,4)):
-        li = []
-        rs = RunningStat(shp)
-        for _ in range(5):
-            val = np.random.randn(*shp)
-            rs.push(val)
-            li.append(val)
-            m = np.mean(li, axis=0)
-            assert np.allclose(rs.mean, m)
-            v = np.square(m) if (len(li) == 1) else np.var(li, ddof=1, axis=0)
-            assert np.allclose(rs.var, v)
--- a/baselines/acktr/utils.py
+++ b/baselines/acktr/utils.py
@@ -1,69 +1,8 @@
-import os
-import numpy as np
 import tensorflow as tf
-import baselines.common.tf_util as U
-from collections import deque
-
-def sample(logits):
-    noise = tf.random_uniform(tf.shape(logits))
-    return tf.argmax(logits - tf.log(-tf.log(noise)), 1)
-
-def std(x):
-    mean = tf.reduce_mean(x)
-    var = tf.reduce_mean(tf.square(x-mean))
-    return tf.sqrt(var)
-
-def cat_entropy(logits):
-    a0 = logits - tf.reduce_max(logits, 1, keep_dims=True)
-    ea0 = tf.exp(a0)
-    z0 = tf.reduce_sum(ea0, 1, keep_dims=True)
-    p0 = ea0 / z0
-    return tf.reduce_sum(p0 * (tf.log(z0) - a0), 1)
-
-def cat_entropy_softmax(p0):
-    return - tf.reduce_sum(p0 * tf.log(p0 + 1e-6), axis = 1)
-
-def mse(pred, target):
-    return tf.square(pred-target)/2.
-
-def ortho_init(scale=1.0):
-    def _ortho_init(shape, dtype, partition_info=None):
-        #lasagne ortho init for tf
-        shape = tuple(shape)
-        if len(shape) == 2:
-            flat_shape = shape
-        elif len(shape) == 4: # assumes NHWC
-            flat_shape = (np.prod(shape[:-1]), shape[-1])
-        else:
-            raise NotImplementedError
-        a = np.random.normal(0.0, 1.0, flat_shape)
-        u, _, v = np.linalg.svd(a, full_matrices=False)
-        q = u if u.shape == flat_shape else v # pick the one with the correct shape
-        q = q.reshape(shape)
-        return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
-    return _ortho_init
-
-def conv(x, scope, nf, rf, stride, pad='VALID', act=tf.nn.relu, init_scale=1.0):
-    with tf.variable_scope(scope):
-        nin = x.get_shape()[3].value
-        w = tf.get_variable("w", [rf, rf, nin, nf], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nf], initializer=tf.constant_initializer(0.0))
-        z = tf.nn.conv2d(x, w, strides=[1, stride, stride, 1], padding=pad)+b
-        h = act(z)
-        return h
-
-def fc(x, scope, nh, act=tf.nn.relu, init_scale=1.0):
-    with tf.variable_scope(scope):
-        nin = x.get_shape()[1].value
-        w = tf.get_variable("w", [nin, nh], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(0.0))
-        z = tf.matmul(x, w)+b
-        h = act(z)
-        return h

 def dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict=None, reuse=None):
    with tf.variable_scope(name, reuse=reuse):
-        assert (len(U.scope_name().split('/')) == 2)
+        assert (len(tf.get_variable_scope().name.split('/')) == 2)

        w = tf.get_variable("w", [x.get_shape()[1], size], initializer=weight_init)
        b = tf.get_variable("b", [size], initializer=tf.constant_initializer(bias_init))
@@ -75,15 +14,10 @@ def dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict=None, r
                weight_loss_dict[w] = weight_decay_fc
                weight_loss_dict[b] = 0.0

-            tf.add_to_collection(U.scope_name().split('/')[0] + '_' + 'losses', weight_decay)
+            tf.add_to_collection(tf.get_variable_scope().name.split('/')[0] + '_' + 'losses', weight_decay)

        return tf.nn.bias_add(tf.matmul(x, w), b)

-def conv_to_fc(x):
-    nh = np.prod([v.value for v in x.get_shape()[1:]])
-    x = tf.reshape(x, [-1, nh])
-    return x
-
 def kl_div(action_dist1, action_dist2, action_size):
    mean1, std1 = action_dist1[:, :action_size], action_dist1[:, action_size:]
    mean2, std2 = action_dist2[:, :action_size], action_dist2[:, action_size:]
@@ -92,109 +26,3 @@ def kl_div(action_dist1, action_dist2, action_size):
    denominator = 2 * tf.square(std2) + 1e-8
    return tf.reduce_sum(
        numerator/denominator + tf.log(std2) - tf.log(std1),reduction_indices=-1)
-
-def discount_with_dones(rewards, dones, gamma):
-    discounted = []
-    r = 0
-    for reward, done in zip(rewards[::-1], dones[::-1]):
-        r = reward + gamma*r*(1.-done) # fixed off by one bug
-        discounted.append(r)
-    return discounted[::-1]
-
-def find_trainable_variables(key):
-    with tf.variable_scope(key):
-        return tf.trainable_variables()
-
-def make_path(f):
-    return os.makedirs(f, exist_ok=True)
-
-def constant(p):
-    return 1
-
-def linear(p):
-    return 1-p
-
-
-def middle_drop(p):
-    eps = 0.75
-    if 1-p<eps:
-        return eps*0.1
-    return 1-p
-
-def double_linear_con(p):
-    p *= 2
-    eps = 0.125
-    if 1-p<eps:
-        return eps
-    return 1-p
-
-
-def double_middle_drop(p):
-    eps1 = 0.75
-    eps2 = 0.25
-    if 1-p<eps1:
-        if 1-p<eps2:
-            return eps2*0.5
-        return eps1*0.1
-    return 1-p
-
-
-schedules = {
-    'linear':linear,
-    'constant':constant,
-    'double_linear_con':double_linear_con,
-    'middle_drop':middle_drop,
-    'double_middle_drop':double_middle_drop
-}
-
-class Scheduler(object):
-
-    def __init__(self, v, nvalues, schedule):
-        self.n = 0.
-        self.v = v
-        self.nvalues = nvalues
-        self.schedule = schedules[schedule]
-
-    def value(self):
-        current_value = self.v*self.schedule(self.n/self.nvalues)
-        self.n += 1.
-        return current_value
-
-    def value_steps(self, steps):
-        return self.v*self.schedule(steps/self.nvalues)
-
-
-class EpisodeStats:
-    def __init__(self, nsteps, nenvs):
-        self.episode_rewards = []
-        for i in range(nenvs):
-            self.episode_rewards.append([])
-        self.lenbuffer = deque(maxlen=40)  # rolling buffer for episode lengths
-        self.rewbuffer = deque(maxlen=40)  # rolling buffer for episode rewards
-        self.nsteps = nsteps
-        self.nenvs = nenvs
-
-    def feed(self, rewards, masks):
-        rewards = np.reshape(rewards, [self.nenvs, self.nsteps])
-        masks = np.reshape(masks, [self.nenvs, self.nsteps])
-        for i in range(0, self.nenvs):
-            for j in range(0, self.nsteps):
-                self.episode_rewards[i].append(rewards[i][j])
-                if masks[i][j]:
-                    l = len(self.episode_rewards[i])
-                    s = sum(self.episode_rewards[i])
-                    self.lenbuffer.append(l)
-                    self.rewbuffer.append(s)
-                    self.episode_rewards[i] = []
-
-    def mean_length(self):
-        if self.lenbuffer:
-            return np.mean(self.lenbuffer)
-        else:
-            return 0  # on the first params dump, no episodes are finished
-
-    def mean_reward(self):
-        if self.rewbuffer:
-            return np.mean(self.rewbuffer)
-        else:
-            return 0
--- a/baselines/acktr/value_functions.py
+++ b/baselines/acktr/value_functions.py
@@ -1,50 +0,0 @@
-from baselines import logger
-import numpy as np
-from baselines import common
-from baselines.common import tf_util as U
-import tensorflow as tf
-from baselines.acktr import kfac
-from baselines.acktr.utils import dense
-
-class NeuralNetValueFunction(object):
-    def __init__(self, ob_dim, ac_dim): #pylint: disable=W0613
-        X = tf.placeholder(tf.float32, shape=[None, ob_dim*2+ac_dim*2+2]) # batch of observations
-        vtarg_n = tf.placeholder(tf.float32, shape=[None], name='vtarg')
-        wd_dict = {}
-        h1 = tf.nn.elu(dense(X, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
-        h2 = tf.nn.elu(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
-        vpred_n = dense(h2, 1, "hfinal", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict)[:,0]
-        sample_vpred_n = vpred_n + tf.random_normal(tf.shape(vpred_n))
-        wd_loss = tf.get_collection("vf_losses", None)
-        loss = U.mean(tf.square(vpred_n - vtarg_n)) + tf.add_n(wd_loss)
-        loss_sampled = U.mean(tf.square(vpred_n - tf.stop_gradient(sample_vpred_n)))
-        self._predict = U.function([X], vpred_n)
-        optim = kfac.KfacOptimizer(learning_rate=0.001, cold_lr=0.001*(1-0.9), momentum=0.9, \
-                                    clip_kl=0.3, epsilon=0.1, stats_decay=0.95, \
-                                    async=1, kfac_update=2, cold_iter=50, \
-                                    weight_decay_dict=wd_dict, max_grad_norm=None)
-        vf_var_list = []
-        for var in tf.trainable_variables():
-            if "vf" in var.name:
-                vf_var_list.append(var)
-
-        update_op, self.q_runner = optim.minimize(loss, loss_sampled, var_list=vf_var_list)
-        self.do_update = U.function([X, vtarg_n], update_op) #pylint: disable=E1101
-        U.initialize() # Initialize uninitialized TF variables
-    def _preproc(self, path):
-        l = pathlength(path)
-        al = np.arange(l).reshape(-1,1)/10.0
-        act = path["action_dist"].astype('float32')
-        X = np.concatenate([path['observation'], act, al, np.ones((l, 1))], axis=1)
-        return X
-    def predict(self, path):
-        return self._predict(self._preproc(path))
-    def fit(self, paths, targvals):
-        X = np.concatenate([self._preproc(p) for p in paths])
-        y = np.concatenate(targvals)
-        logger.record_tabular("EVBefore", common.explained_variance(self._predict(X), y))
-        for _ in range(25): self.do_update(X, y)
-        logger.record_tabular("EVAfter", common.explained_variance(self._predict(X), y))
-
-def pathlength(path):
-    return path["reward"].shape[0]
--- a/baselines/bench/init.py
+++ b/baselines/bench/init.py
@@ -1,3 +1,2 @@
 from baselines.bench.benchmarks import *
 from baselines.bench.monitor import *
-from baselines.bench.simple_bench import simple_bench
--- a/baselines/bench/benchmarks.py
+++ b/baselines/bench/benchmarks.py
@@ -1,15 +1,26 @@
+import re
 import os.path as osp
+import os
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))

 _atari7 = ['BeamRider', 'Breakout', 'Enduro', 'Pong', 'Qbert', 'Seaquest', 'SpaceInvaders']
 _atariexpl7 = ['Freeway', 'Gravitar', 'MontezumaRevenge', 'Pitfall', 'PrivateEye', 'Solaris', 'Venture']

 _BENCHMARKS = []

+remove_version_re = re.compile(r'-v\d+$')
+

 def register_benchmark(benchmark):
    for b in _BENCHMARKS:
        if b['name'] == benchmark['name']:
            raise ValueError('Benchmark with name %s already registered!' % b['name'])
+
+    # automatically add a description if it is not present
+    if 'tasks' in benchmark:
+        for t in benchmark['tasks']:
+            if 'desc' not in t:
+                t['desc'] = remove_version_re.sub('', t['env_id'])
    _BENCHMARKS.append(benchmark)


@@ -42,41 +53,40 @@ _ATARI_SUFFIX = 'NoFrameskip-v4'
 register_benchmark({
    'name': 'Atari50M',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 50M timesteps',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(50e6)} for _game in _atari7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(50e6)} for _game in _atari7]
 })

 register_benchmark({
    'name': 'Atari10M',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 6, 'num_timesteps': int(10e6)} for _game in _atari7]
 })

 register_benchmark({
    'name': 'Atari1Hr',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 1 hour of walltime',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_seconds': 60 * 60} for _game in _atari7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_seconds': 60 * 60} for _game in _atari7]
 })

 register_benchmark({
    'name': 'AtariExploration10M',
    'description': '7 Atari games emphasizing exploration, with pixel observations, 10M timesteps',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atariexpl7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atariexpl7]
 })


-
-
 # MuJoCo

 _mujocosmall = [
-    'InvertedDoublePendulum-v1', 'InvertedPendulum-v1',
-    'HalfCheetah-v1', 'Hopper-v1', 'Walker2d-v1',
-    'Reacher-v1', 'Swimmer-v1']
+    'InvertedDoublePendulum-v2', 'InvertedPendulum-v2',
+    'HalfCheetah-v2', 'Hopper-v2', 'Walker2d-v2',
+    'Reacher-v2', 'Swimmer-v2']
 register_benchmark({
    'name': 'Mujoco1M',
    'description': 'Some small 2D MuJoCo tasks, run for 1M timesteps',
-    'tasks': [{'env_id': _envid, 'trials': 3, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
+    'tasks': [{'env_id': _envid, 'trials': 6, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
 })
+
 register_benchmark({
    'name': 'MujocoWalkers',
    'description': 'MuJoCo forward walkers, run for 8M, humanoid 100M',
@@ -87,6 +97,19 @@ register_benchmark({
    ]
 })

+# Bullet
+_bulletsmall = [
+    'InvertedDoublePendulum', 'InvertedPendulum', 'HalfCheetah', 'Reacher', 'Walker2D', 'Hopper', 'Ant'
+]
+_bulletsmall = [e + 'BulletEnv-v0' for e in _bulletsmall]
+
+register_benchmark({
+    'name': 'Bullet1M',
+    'description': '6 mujoco-like tasks from bullet, 1M steps',
+    'tasks': [{'env_id': e, 'trials': 6, 'num_timesteps': int(1e6)} for e in _bulletsmall]
+})
+
+
 # Roboschool

 register_benchmark({
@@ -128,5 +151,15 @@ _atari50 = [  # actually 47
 register_benchmark({
    'name': 'Atari50_10M',
    'description': '47 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 3, 'num_timesteps': int(10e6)} for _game in _atari50]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari50]
 })
+
+# HER DDPG
+
+_fetch_tasks = ['FetchReach-v1', 'FetchPush-v1', 'FetchSlide-v1']
+register_benchmark({
+    'name': 'Fetch1M',
+    'description': 'Fetch* benchmarks for 1M timesteps',
+    'tasks': [{'trials': 6, 'env_id': env_id, 'num_timesteps': int(1e6)} for env_id in _fetch_tasks]
+})
+
--- a/baselines/bench/monitor.py
+++ b/baselines/bench/monitor.py
@@ -7,43 +7,33 @@ from glob import glob
 import csv
 import os.path as osp
 import json
+import numpy as np

 class Monitor(Wrapper):
    EXT = "monitor.csv"
    f = None

-    def __init__(self, env, filename, allow_early_resets=False, reset_keywords=()):
+    def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
        Wrapper.__init__(self, env=env)
        self.tstart = time.time()
-        if filename is None:
-            self.f = None
-            self.logger = None
-        else:
-            if not filename.endswith(Monitor.EXT):
-                if osp.isdir(filename):
-                    filename = osp.join(filename, Monitor.EXT)
-                else:
-                    filename = filename + "." + Monitor.EXT
-            self.f = open(filename, "wt")
-            self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, "gym_version": gym.__version__,
-                "env_id": env.spec.id if env.spec else 'Unknown'}))
-            self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords)
-            self.logger.writeheader()
-
+        self.results_writer = ResultsWriter(
+            filename,
+            header={"t_start": time.time(), 'env_id' : env.spec and env.spec.id},
+            extra_keys=reset_keywords + info_keywords
+        )
        self.reset_keywords = reset_keywords
+        self.info_keywords = info_keywords
        self.allow_early_resets = allow_early_resets
        self.rewards = None
        self.needs_reset = True
        self.episode_rewards = []
        self.episode_lengths = []
+        self.episode_times = []
        self.total_steps = 0
        self.current_reset_info = {} # extra info about the current episode, that was passed in during reset()

-    def _reset(self, **kwargs):
-        if not self.allow_early_resets and not self.needs_reset:
-            raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
-        self.rewards = []
-        self.needs_reset = False
+    def reset(self, **kwargs):
+        self.reset_state()
        for k in self.reset_keywords:
            v = kwargs.get(k)
            if v is None:
@@ -51,25 +41,39 @@ class Monitor(Wrapper):
            self.current_reset_info[k] = v
        return self.env.reset(**kwargs)

-    def _step(self, action):
+    def reset_state(self):
+        if not self.allow_early_resets and not self.needs_reset:
+            raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
+        self.rewards = []
+        self.needs_reset = False
+
+
+    def step(self, action):
        if self.needs_reset:
            raise RuntimeError("Tried to step environment that needs reset")
        ob, rew, done, info = self.env.step(action)
+        self.update(ob, rew, done, info)
+        return (ob, rew, done, info)
+
+    def update(self, ob, rew, done, info):
        self.rewards.append(rew)
        if done:
            self.needs_reset = True
            eprew = sum(self.rewards)
            eplen = len(self.rewards)
            epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)}
-            epinfo.update(self.current_reset_info)
-            if self.logger:
-                self.logger.writerow(epinfo)
-                self.f.flush()
+            for k in self.info_keywords:
+                epinfo[k] = info[k]
            self.episode_rewards.append(eprew)
            self.episode_lengths.append(eplen)
-            info['episode'] = epinfo
+            self.episode_times.append(time.time() - self.tstart)
+            epinfo.update(self.current_reset_info)
+            self.results_writer.write_row(epinfo)
+
+            if isinstance(info, dict):
+                info['episode'] = epinfo
+
        self.total_steps += 1
-        return (ob, rew, done, info)

    def close(self):
        if self.f is not None:
@@ -84,15 +88,48 @@ class Monitor(Wrapper):
    def get_episode_lengths(self):
        return self.episode_lengths

+    def get_episode_times(self):
+        return self.episode_times
+
 class LoadMonitorResultsError(Exception):
    pass

+
+class ResultsWriter(object):
+    def __init__(self, filename=None, header='', extra_keys=()):
+        self.extra_keys = extra_keys
+        if filename is None:
+            self.f = None
+            self.logger = None
+        else:
+            if not filename.endswith(Monitor.EXT):
+                if osp.isdir(filename):
+                    filename = osp.join(filename, Monitor.EXT)
+                else:
+                    filename = filename + "." + Monitor.EXT
+            self.f = open(filename, "wt")
+            if isinstance(header, dict):
+                header = '# {} \n'.format(json.dumps(header))
+            self.f.write(header)
+            self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+tuple(extra_keys))
+            self.logger.writeheader()
+            self.f.flush()
+
+    def write_row(self, epinfo):
+        if self.logger:
+            self.logger.writerow(epinfo)
+            self.f.flush()
+
+
+
 def get_monitor_files(dir):
    return glob(osp.join(dir, "*" + Monitor.EXT))

 def load_results(dir):
    import pandas
-    monitor_files = glob(osp.join(dir, "*monitor.*")) # get both csv and (old) json files
+    monitor_files = (
+        glob(osp.join(dir, "*monitor.json")) +
+        glob(osp.join(dir, "*monitor.csv"))) # get both csv and (old) json files
    if not monitor_files:
        raise LoadMonitorResultsError("no monitor files of the form *%s found in %s" % (Monitor.EXT, dir))
    dfs = []
@@ -101,6 +138,8 @@ def load_results(dir):
        with open(fname, 'rt') as fh:
            if fname.endswith('csv'):
                firstline = fh.readline()
+                if not firstline:
+                    continue
                assert firstline[0] == '#'
                header = json.loads(firstline[1:])
                df = pandas.read_csv(fh, index_col=None)
@@ -114,10 +153,37 @@ def load_results(dir):
                    episode = json.loads(line)
                    episodes.append(episode)
                df = pandas.DataFrame(episodes)
-        df['t'] += header['t_start']
+            else:
+                assert 0, 'unreachable'
+            df['t'] += header['t_start']
        dfs.append(df)
    df = pandas.concat(dfs)
    df.sort_values('t', inplace=True)
+    df.reset_index(inplace=True)
    df['t'] -= min(header['t_start'] for header in headers)
    df.headers = headers # HACK to preserve backwards compatibility
-    return df
+    return df
+
+def test_monitor():
+    env = gym.make("CartPole-v1")
+    env.seed(0)
+    mon_file = "/tmp/baselines-test-%s.monitor.csv" % uuid.uuid4()
+    menv = Monitor(env, mon_file)
+    menv.reset()
+    for _ in range(1000):
+        _, _, done, _ = menv.step(0)
+        if done:
+            menv.reset()
+
+    f = open(mon_file, 'rt')
+
+    firstline = f.readline()
+    assert firstline.startswith('#')
+    metadata = json.loads(firstline[1:])
+    assert metadata['env_id'] == "CartPole-v1"
+    assert set(metadata.keys()) == {'env_id', 'gym_version', 't_start'},  "Incorrect keys in monitor metadata"
+
+    last_logline = pandas.read_csv(f, index_col=None)
+    assert set(last_logline.keys()) == {'l', 't', 'r'}, "Incorrect keys in monitor logline"
+    f.close()
+    os.remove(mon_file)
--- a/baselines/common/init.py
+++ b/baselines/common/init.py
@@ -1,3 +1,4 @@
+# flake8: noqa F403
 from baselines.common.console_util import *
 from baselines.common.dataset import Dataset
 from baselines.common.math_util import *
--- a/baselines/common/atari_wrappers.py
+++ b/baselines/common/atari_wrappers.py
@@ -1,8 +1,11 @@
 import numpy as np
+import os
+os.environ.setdefault('PATH', '')
 from collections import deque
 import gym
 from gym import spaces
 import cv2
+cv2.ocl.setUseOpenCL(False)

 class NoopResetEnv(gym.Wrapper):
    def __init__(self, env, noop_max=30):
@@ -12,14 +15,10 @@ class NoopResetEnv(gym.Wrapper):
        gym.Wrapper.__init__(self, env)
        self.noop_max = noop_max
        self.override_num_noops = None
-        if isinstance(env.action_space, gym.spaces.MultiBinary):
-            self.noop_action = np.zeros(self.env.action_space.n, dtype=np.int64)
-        else:
-            # used for atari environments
-            self.noop_action = 0
-            assert env.unwrapped.get_action_meanings()[0] == 'NOOP'
+        self.noop_action = 0
+        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

-    def _reset(self, **kwargs):
+    def reset(self, **kwargs):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset(**kwargs)
        if self.override_num_noops is not None:
@@ -34,6 +33,9 @@ class NoopResetEnv(gym.Wrapper):
                obs = self.env.reset(**kwargs)
        return obs

+    def step(self, ac):
+        return self.env.step(ac)
+
 class FireResetEnv(gym.Wrapper):
    def __init__(self, env):
        """Take action on reset for environments that are fixed until firing."""
@@ -41,7 +43,7 @@ class FireResetEnv(gym.Wrapper):
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

-    def _reset(self, **kwargs):
+    def reset(self, **kwargs):
        self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(1)
        if done:
@@ -51,6 +53,9 @@ class FireResetEnv(gym.Wrapper):
            self.env.reset(**kwargs)
        return obs

+    def step(self, ac):
+        return self.env.step(ac)
+
 class EpisodicLifeEnv(gym.Wrapper):
    def __init__(self, env):
        """Make end-of-life == end-of-episode, but only reset on true game over.
@@ -60,21 +65,21 @@ class EpisodicLifeEnv(gym.Wrapper):
        self.lives = 0
        self.was_real_done  = True

-    def _step(self, action):
+    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.was_real_done = done
        # check current lives, make loss of life terminal,
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
-            # for Qbert somtimes we stay in lives == 0 condtion for a few frames
-            # so its important to keep lives > 0, so that we only reset once
+            # for Qbert sometimes we stay in lives == 0 condition for a few frames
+            # so it's important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
        return obs, reward, done, info

-    def _reset(self, **kwargs):
+    def reset(self, **kwargs):
        """Reset only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.
@@ -92,10 +97,10 @@ class MaxAndSkipEnv(gym.Wrapper):
        """Return only every `skip`-th frame"""
        gym.Wrapper.__init__(self, env)
        # most recent raw observations (for max pooling across time steps)
-        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype='uint8')
+        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
        self._skip       = skip

-    def _step(self, action):
+    def step(self, action):
        """Repeat action, sum reward, and max over last observations."""
        total_reward = 0.0
        done = None
@@ -112,23 +117,38 @@ class MaxAndSkipEnv(gym.Wrapper):

        return max_frame, total_reward, done, info

+    def reset(self, **kwargs):
+        return self.env.reset(**kwargs)
+
 class ClipRewardEnv(gym.RewardWrapper):
-    def _reward(self, reward):
+    def __init__(self, env):
+        gym.RewardWrapper.__init__(self, env)
+
+    def reward(self, reward):
        """Bin reward to {+1, 0, -1} by its sign."""
        return np.sign(reward)

 class WarpFrame(gym.ObservationWrapper):
-    def __init__(self, env):
+    def __init__(self, env, width=84, height=84, grayscale=True):
        """Warp frames to 84x84 as done in the Nature paper and later work."""
        gym.ObservationWrapper.__init__(self, env)
-        self.width = 84
-        self.height = 84
-        self.observation_space = spaces.Box(low=0, high=255, shape=(self.height, self.width, 1))
+        self.width = width
+        self.height = height
+        self.grayscale = grayscale
+        if self.grayscale:
+            self.observation_space = spaces.Box(low=0, high=255,
+                shape=(self.height, self.width, 1), dtype=np.uint8)
+        else:
+            self.observation_space = spaces.Box(low=0, high=255,
+                shape=(self.height, self.width, 3), dtype=np.uint8)

-    def _observation(self, frame):
-        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+    def observation(self, frame):
+        if self.grayscale:
+            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
-        return frame[:, :, None]
+        if self.grayscale:
+            frame = np.expand_dims(frame, -1)
+        return frame

 class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
@@ -144,15 +164,15 @@ class FrameStack(gym.Wrapper):
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
-        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k))
+        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[:-1] + (shp[-1] * k,)), dtype=env.observation_space.dtype)

-    def _reset(self):
+    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

-    def _step(self, action):
+    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info
@@ -162,7 +182,11 @@ class FrameStack(gym.Wrapper):
        return LazyFrames(list(self.frames))

 class ScaledFloatFrame(gym.ObservationWrapper):
-    def _observation(self, observation):
+    def __init__(self, env):
+        gym.ObservationWrapper.__init__(self, env)
+        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)
+
+    def observation(self, observation):
        # careful! This undoes the memory optimization, use
        # with smaller replay buffers only.
        return np.array(observation).astype(np.float32) / 255.0
@@ -175,17 +199,33 @@ class LazyFrames(object):

        This object should only be converted to numpy array before being passed to the model.

-        You'd not belive how complex the previous solution was."""
+        You'd not believe how complex the previous solution was."""
        self._frames = frames
+        self._out = None
+
+    def _force(self):
+        if self._out is None:
+            self._out = np.concatenate(self._frames, axis=-1)
+            self._frames = None
+        return self._out

    def __array__(self, dtype=None):
-        out = np.concatenate(self._frames, axis=2)
+        out = self._force()
        if dtype is not None:
            out = out.astype(dtype)
        return out

-def make_atari(env_id):
+    def __len__(self):
+        return len(self._force())
+
+    def __getitem__(self, i):
+        return self._force()[i]
+
+def make_atari(env_id, timelimit=True):
+    # XXX(john): remove timelimit argument after gym is upgraded to allow double wrapping
    env = gym.make(env_id)
+    if not timelimit:
+        env = env.env
    assert 'NoFrameskip' in env.spec.id
    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
--- a/baselines/common/azure_utils.py
+++ b/baselines/common/azure_utils.py
@@ -1,154 +0,0 @@
-import os
-import tempfile
-import zipfile
-
-from azure.common import AzureMissingResourceHttpError
-try:
-    from azure.storage.blob import BlobService
-except ImportError:
-    from azure.storage.blob import BlockBlobService as BlobService
-from shutil import unpack_archive
-from threading import Event
-
-# TODOS: use Azure snapshots instead of hacky backups
-
-def fixed_list_blobs(service, *args, **kwargs):
-    """By defualt list_containers only returns a subset of results.
-
-    This function attempts to fix this.
-    """
-    res = []
-    next_marker = None
-    while next_marker is None or len(next_marker) > 0:
-        kwargs['marker'] = next_marker
-        gen = service.list_blobs(*args, **kwargs)
-        for b in gen:
-            res.append(b.name)
-        next_marker = gen.next_marker
-    return res
-
-
-def make_archive(source_path, dest_path):
-    if source_path.endswith(os.path.sep):
-        source_path = source_path.rstrip(os.path.sep)
-    prefix_path = os.path.dirname(source_path)
-    with zipfile.ZipFile(dest_path, "w", compression=zipfile.ZIP_STORED) as zf:
-        if os.path.isdir(source_path):
-            for dirname, _subdirs, files in os.walk(source_path):
-                zf.write(dirname, os.path.relpath(dirname, prefix_path))
-                for filename in files:
-                    filepath = os.path.join(dirname, filename)
-                    zf.write(filepath, os.path.relpath(filepath, prefix_path))
-        else:
-            zf.write(source_path, os.path.relpath(source_path, prefix_path))
-
-
-class Container(object):
-    services = {}
-
-    def __init__(self, account_name, account_key, container_name, maybe_create=False):
-        self._account_name = account_name
-        self._container_name = container_name
-        if account_name not in Container.services:
-            Container.services[account_name] = BlobService(account_name, account_key)
-        self._service = Container.services[account_name]
-        if maybe_create:
-            self._service.create_container(self._container_name, fail_on_exist=False)
-
-    def put(self, source_path, blob_name, callback=None):
-        """Upload a file or directory from `source_path` to azure blob `blob_name`.
-
-        Upload progress can be traced by an optional callback.
-        """
-        upload_done = Event()
-
-        def progress_callback(current, total):
-            if callback:
-                callback(current, total)
-            if current >= total:
-                upload_done.set()
-
-        # Attempt to make backup if an existing version is already available
-        try:
-            x_ms_copy_source = "https://{}.blob.core.windows.net/{}/{}".format(
-                self._account_name,
-                self._container_name,
-                blob_name
-            )
-            self._service.copy_blob(
-                container_name=self._container_name,
-                blob_name=blob_name + ".backup",
-                x_ms_copy_source=x_ms_copy_source
-            )
-        except AzureMissingResourceHttpError:
-            pass
-
-        with tempfile.TemporaryDirectory() as td:
-            arcpath = os.path.join(td, "archive.zip")
-            make_archive(source_path, arcpath)
-            self._service.put_block_blob_from_path(
-                container_name=self._container_name,
-                blob_name=blob_name,
-                file_path=arcpath,
-                max_connections=4,
-                progress_callback=progress_callback,
-                max_retries=10)
-            upload_done.wait()
-
-    def get(self, dest_path, blob_name, callback=None):
-        """Download a file or directory to `dest_path` to azure blob `blob_name`.
-
-        Warning! If directory is downloaded the `dest_path` is the parent directory.
-
-        Upload progress can be traced by an optional callback.
-        """
-        download_done = Event()
-
-        def progress_callback(current, total):
-            if callback:
-                callback(current, total)
-            if current >= total:
-                download_done.set()
-
-        with tempfile.TemporaryDirectory() as td:
-            arcpath = os.path.join(td, "archive.zip")
-            for backup_blob_name in [blob_name, blob_name + '.backup']:
-                try:
-                    properties = self._service.get_blob_properties(
-                        blob_name=backup_blob_name,
-                        container_name=self._container_name
-                    )
-                    if hasattr(properties, 'properties'):
-                        # Annoyingly, Azure has changed the API and this now returns a blob
-                        # instead of it's properties with up-to-date azure package.
-                        blob_size = properties.properties.content_length
-                    else:
-                        blob_size = properties['content-length']
-                    if int(blob_size) > 0:
-                        self._service.get_blob_to_path(
-                            container_name=self._container_name,
-                            blob_name=backup_blob_name,
-                            file_path=arcpath,
-                            max_connections=4,
-                            progress_callback=progress_callback)
-                        unpack_archive(arcpath, dest_path)
-                        download_done.wait()
-                        return True
-                except AzureMissingResourceHttpError:
-                    pass
-        return False
-
-    def list(self, prefix=None):
-        """List all blobs in the container."""
-        return fixed_list_blobs(self._service, self._container_name, prefix=prefix)
-
-    def exists(self, blob_name):
-        """Returns true if `blob_name` exists in container."""
-        try:
-            self._service.get_blob_properties(
-                blob_name=blob_name,
-                container_name=self._container_name
-            )
-            return True
-        except AzureMissingResourceHttpError:
-            return False
--- a/baselines/common/cg.py
+++ b/baselines/common/cg.py
@@ -31,4 +31,4 @@ def cg(f_Ax, b, cg_iters=10, callback=None, verbose=False, residual_tol=1e-10):
    if callback is not None:
        callback(x)
    if verbose: print(fmtstr % (i+1, rdotr, np.linalg.norm(x)))  # pylint: disable=W0631
-    return x
+    return x
--- a/baselines/common/cmd_util.py
+++ b/baselines/common/cmd_util.py
@@ -0,0 +1,192 @@
+"""
+Helpers for scripts like run_atari.py.
+"""
+
+import os
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
+import gym
+from gym.wrappers import FlattenDictWrapper
+from baselines import logger
+from baselines.bench import Monitor
+from baselines.common import set_global_seeds
+from baselines.common.atari_wrappers import make_atari, wrap_deepmind
+from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+from baselines.common import retro_wrappers
+
+def make_vec_env(env_id, env_type, num_env, seed,
+                 wrapper_kwargs=None,
+                 start_index=0,
+                 reward_scale=1.0,
+                 flatten_dict_observations=True,
+                 gamestate=None,
+                 initializer=None,
+                 env_kwargs=None,
+                 force_dummy=False):
+    """
+    Create a wrapped, monitored SubprocVecEnv for Atari and MuJoCo.
+    """
+    wrapper_kwargs = wrapper_kwargs or {}
+    mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
+    seed = seed + 10000 * mpi_rank if seed is not None else None
+    logger_dir = logger.get_dir()
+    def make_thunk(rank, initializer=None):
+        return lambda: make_env(
+            env_id=env_id,
+            env_type=env_type,
+            mpi_rank=mpi_rank,
+            subrank=rank,
+            seed=seed,
+            reward_scale=reward_scale,
+            gamestate=gamestate,
+            flatten_dict_observations=flatten_dict_observations,
+            wrapper_kwargs=wrapper_kwargs,
+            logger_dir=logger_dir,
+            initializer=initializer,
+            env_kwargs=env_kwargs,
+        )
+
+    set_global_seeds(seed)
+    if not force_dummy and num_env > 1:
+        return SubprocVecEnv([make_thunk(i + start_index, initializer=initializer) for i in range(num_env)])
+    else:
+        return DummyVecEnv([make_thunk(i + start_index, initializer=None) for i in range(num_env)])
+
+
+def make_env(env_id, env_type, mpi_rank=0, subrank=0, seed=None, reward_scale=1.0, gamestate=None, flatten_dict_observations=True, wrapper_kwargs=None, logger_dir=None, initializer=None, env_kwargs=None):
+    if initializer is not None:
+        initializer(mpi_rank=mpi_rank, subrank=subrank)
+
+    wrapper_kwargs = wrapper_kwargs or {}
+    if env_type == 'atari':
+        env = make_atari(env_id)
+    elif env_type == 'retro':
+        import retro
+        gamestate = gamestate or retro.State.DEFAULT
+        env = retro_wrappers.make_retro(game=env_id, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE, state=gamestate)
+    else:
+        env = gym.make(env_id, **(env_kwargs or {}))
+
+    if flatten_dict_observations and isinstance(env.observation_space, gym.spaces.Dict):
+        keys = env.observation_space.spaces.keys()
+        env = gym.wrappers.FlattenDictWrapper(env, dict_keys=list(keys))
+
+    env.seed(seed + subrank if seed is not None else None)
+    env = Monitor(env,
+                  logger_dir and os.path.join(logger_dir, str(mpi_rank) + '.' + str(subrank)),
+                  allow_early_resets=True)
+
+    if env_type == 'atari':
+        env = wrap_deepmind(env, **wrapper_kwargs)
+    elif env_type == 'retro':
+        env = retro_wrappers.wrap_deepmind_retro(env, **wrapper_kwargs)
+
+    if reward_scale != 1:
+        env = retro_wrappers.RewardScaler(env, reward_scale)
+
+    return env
+
+
+def make_mujoco_env(env_id, seed, reward_scale=1.0):
+    """
+    Create a wrapped, monitored gym.Env for MuJoCo.
+    """
+    rank = MPI.COMM_WORLD.Get_rank()
+    myseed = seed  + 1000 * rank if seed is not None else None
+    set_global_seeds(myseed)
+    env = gym.make(env_id)
+    logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
+    env = Monitor(env, logger_path, allow_early_resets=True)
+    env.seed(seed)
+    if reward_scale != 1.0:
+        from baselines.common.retro_wrappers import RewardScaler
+        env = RewardScaler(env, reward_scale)
+    return env
+
+def make_robotics_env(env_id, seed, rank=0):
+    """
+    Create a wrapped, monitored gym.Env for MuJoCo.
+    """
+    set_global_seeds(seed)
+    env = gym.make(env_id)
+    env = FlattenDictWrapper(env, ['observation', 'desired_goal'])
+    env = Monitor(
+        env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)),
+        info_keywords=('is_success',))
+    env.seed(seed)
+    return env
+
+def arg_parser():
+    """
+    Create an empty argparse.ArgumentParser.
+    """
+    import argparse
+    return argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+def atari_arg_parser():
+    """
+    Create an argparse.ArgumentParser for run_atari.py.
+    """
+    print('Obsolete - use common_arg_parser instead')
+    return common_arg_parser()
+
+def mujoco_arg_parser():
+    print('Obsolete - use common_arg_parser instead')
+    return common_arg_parser()
+
+def common_arg_parser():
+    """
+    Create an argparse.ArgumentParser for run_mujoco.py.
+    """
+    parser = arg_parser()
+    parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
+    parser.add_argument('--env_type', help='type of environment, used when the environment type cannot be automatically determined', type=str)
+    parser.add_argument('--seed', help='RNG seed', type=int, default=None)
+    parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
+    parser.add_argument('--num_timesteps', type=float, default=1e6),
+    parser.add_argument('--network', help='network type (mlp, cnn, lstm, cnn_lstm, conv_only)', default=None)
+    parser.add_argument('--gamestate', help='game state to load (so far only used in retro games)', default=None)
+    parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
+    parser.add_argument('--reward_scale', help='Reward scale factor. Default: 1.0', default=1.0, type=float)
+    parser.add_argument('--save_path', help='Path to save trained model to', default=None, type=str)
+    parser.add_argument('--save_video_interval', help='Save video every x steps (0 = disabled)', default=0, type=int)
+    parser.add_argument('--save_video_length', help='Length of recorded video. Default: 200', default=200, type=int)
+    parser.add_argument('--play', default=False, action='store_true')
+    parser.add_argument('--extra_import', help='Extra module to import to access external environments', type=str, default=None)
+    return parser
+
+def robotics_arg_parser():
+    """
+    Create an argparse.ArgumentParser for run_mujoco.py.
+    """
+    parser = arg_parser()
+    parser.add_argument('--env', help='environment ID', type=str, default='FetchReach-v0')
+    parser.add_argument('--seed', help='RNG seed', type=int, default=None)
+    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
+    return parser
+
+
+def parse_unknown_args(args):
+    """
+    Parse arguments not consumed by arg parser into a dicitonary
+    """
+    retval = {}
+    preceded_by_key = False
+    for arg in args:
+        if arg.startswith('--'):
+            if '=' in arg:
+                key = arg.split('=')[0][2:]
+                value = arg.split('=')[1]
+                retval[key] = value
+            else:
+                key = arg[2:]
+                preceded_by_key = True
+        elif preceded_by_key:
+            retval[key] = arg
+            preceded_by_key = False
+
+    return retval
--- a/baselines/common/console_util.py
+++ b/baselines/common/console_util.py
@@ -2,6 +2,8 @@ from __future__ import print_function
 from contextlib import contextmanager
 import numpy as np
 import time
+import shlex
+import subprocess

 # ================================================================
 # Misc
@@ -16,7 +18,12 @@ def fmt_item(x, l):
    if isinstance(x, np.ndarray):
        assert x.ndim==0
        x = x.item()
-    if isinstance(x, float): rep = "%g"%x
+    if isinstance(x, (float, np.float32, np.float64)):
+        v = abs(x)
+        if (v < 1e-4 or v > 1e+4) and v > 0:
+            rep = "%7.2e" % x
+        else:
+            rep = "%7.5f" % x
    else: rep = str(x)
    return " "*(l - len(rep)) + rep

@@ -32,7 +39,7 @@ color2num = dict(
    crimson=38
 )

-def colorize(string, color, bold=False, highlight=False):
+def colorize(string, color='green', bold=False, highlight=False):
    attr = []
    num = color2num[color]
    if highlight: num += 10
@@ -40,6 +47,25 @@ def colorize(string, color, bold=False, highlight=False):
    if bold: attr.append('1')
    return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)

+def print_cmd(cmd, dry=False):
+    if isinstance(cmd, str):  # for shell=True
+        pass
+    else:
+        cmd = ' '.join(shlex.quote(arg) for arg in cmd)
+    print(colorize(('CMD: ' if not dry else 'DRY: ') + cmd))
+
+
+def get_git_commit(cwd=None):
+    return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'], cwd=cwd).decode('utf8')
+
+def get_git_commit_message(cwd=None):
+    return subprocess.check_output(['git', 'show', '-s', '--format=%B', 'HEAD'], cwd=cwd).decode('utf8')
+
+def ccap(cmd, dry=False, env=None, **kwargs):
+    print_cmd(cmd, dry)
+    if not dry:
+        subprocess.check_call(cmd, env=env, **kwargs)
+

 MESSAGE_DEPTH = 0

--- a/baselines/common/distributions.py
+++ b/baselines/common/distributions.py
@@ -1,6 +1,7 @@
 import tensorflow as tf
 import numpy as np
 import baselines.common.tf_util as U
+from baselines.a2c.utils import fc
 from tensorflow.python.ops import math_ops

 class Pd(object):
@@ -22,6 +23,13 @@ class Pd(object):
        raise NotImplementedError
    def logp(self, x):
        return - self.neglogp(x)
+    def get_shape(self):
+        return self.flatparam().shape
+    @property
+    def shape(self):
+        return self.get_shape()
+    def __getitem__(self, idx):
+        return self.__class__(self.flatparam()[idx])

 class PdType(object):
    """
@@ -31,6 +39,8 @@ class PdType(object):
        raise NotImplementedError
    def pdfromflat(self, flat):
        return self.pdclass()(flat)
+    def pdfromlatent(self, latent_vector, init_scale, init_bias):
+        raise NotImplementedError
    def param_shape(self):
        raise NotImplementedError
    def sample_shape(self):
@@ -43,11 +53,18 @@ class PdType(object):
    def sample_placeholder(self, prepend_shape, name=None):
        return tf.placeholder(dtype=self.sample_dtype(), shape=prepend_shape+self.sample_shape(), name=name)

+    def __eq__(self, other):
+        return (type(self) == type(other)) and (self.__dict__ == other.__dict__)
+
 class CategoricalPdType(PdType):
    def __init__(self, ncat):
        self.ncat = ncat
    def pdclass(self):
        return CategoricalPd
+    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
+        pdparam = _matching_fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
+        return self.pdfromflat(pdparam), pdparam
+
    def param_shape(self):
        return [self.ncat]
    def sample_shape(self):
@@ -57,14 +74,18 @@ class CategoricalPdType(PdType):


 class MultiCategoricalPdType(PdType):
-    def __init__(self, low, high):
-        self.low = low
-        self.high = high
-        self.ncats = high - low + 1
+    def __init__(self, nvec):
+        self.ncats = nvec.astype('int32')
+        assert (self.ncats > 0).all()
    def pdclass(self):
        return MultiCategoricalPd
    def pdfromflat(self, flat):
-        return MultiCategoricalPd(self.low, self.high, flat)
+        return MultiCategoricalPd(self.ncats, flat)
+
+    def pdfromlatent(self, latent, init_scale=1.0, init_bias=0.0):
+        pdparam = _matching_fc(latent, 'pi', self.ncats.sum(), init_scale=init_scale, init_bias=init_bias)
+        return self.pdfromflat(pdparam), pdparam
+
    def param_shape(self):
        return [sum(self.ncats)]
    def sample_shape(self):
@@ -77,6 +98,13 @@ class DiagGaussianPdType(PdType):
        self.size = size
    def pdclass(self):
        return DiagGaussianPd
+
+    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
+        mean = _matching_fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
+        logstd = tf.get_variable(name='pi/logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
+        pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
+        return self.pdfromflat(pdparam), mean
+
    def param_shape(self):
        return [2*self.size]
    def sample_shape(self):
@@ -95,6 +123,9 @@ class BernoulliPdType(PdType):
        return [self.size]
    def sample_dtype(self):
        return tf.int32
+    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
+        pdparam = _matching_fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
+        return self.pdfromflat(pdparam), pdparam

 # WRONG SECOND DERIVATIVES
 # class CategoricalPd(Pd):
@@ -125,56 +156,69 @@ class CategoricalPd(Pd):
    def flatparam(self):
        return self.logits
    def mode(self):
-        return U.argmax(self.logits, axis=-1)
+        return tf.argmax(self.logits, axis=-1)
+
+    @property
+    def mean(self):
+        return tf.nn.softmax(self.logits)
    def neglogp(self, x):
        # return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
        # Note: we can't use sparse_softmax_cross_entropy_with_logits because
        #       the implementation does not allow second-order derivatives...
-        one_hot_actions = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
-        return tf.nn.softmax_cross_entropy_with_logits(
+        if x.dtype in {tf.uint8, tf.int32, tf.int64}:
+            # one-hot encoding
+            x_shape_list = x.shape.as_list()
+            logits_shape_list = self.logits.get_shape().as_list()[:-1]
+            for xs, ls in zip(x_shape_list, logits_shape_list):
+                if xs is not None and ls is not None:
+                    assert xs == ls, 'shape mismatch: {} in x vs {} in logits'.format(xs, ls)
+
+            x = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
+        else:
+            # already encoded
+            assert x.shape.as_list() == self.logits.shape.as_list()
+
+        return tf.nn.softmax_cross_entropy_with_logits_v2(
            logits=self.logits,
-            labels=one_hot_actions)
+            labels=x)
    def kl(self, other):
-        a0 = self.logits - U.max(self.logits, axis=-1, keepdims=True)
-        a1 = other.logits - U.max(other.logits, axis=-1, keepdims=True)
+        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
+        a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keepdims=True)
        ea0 = tf.exp(a0)
        ea1 = tf.exp(a1)
-        z0 = U.sum(ea0, axis=-1, keepdims=True)
-        z1 = U.sum(ea1, axis=-1, keepdims=True)
+        z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
+        z1 = tf.reduce_sum(ea1, axis=-1, keepdims=True)
        p0 = ea0 / z0
-        return U.sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
+        return tf.reduce_sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
    def entropy(self):
-        a0 = self.logits - U.max(self.logits, axis=-1, keepdims=True)
+        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
        ea0 = tf.exp(a0)
-        z0 = U.sum(ea0, axis=-1, keepdims=True)
+        z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
        p0 = ea0 / z0
-        return U.sum(p0 * (tf.log(z0) - a0), axis=-1)
+        return tf.reduce_sum(p0 * (tf.log(z0) - a0), axis=-1)
    def sample(self):
-        u = tf.random_uniform(tf.shape(self.logits))
+        u = tf.random_uniform(tf.shape(self.logits), dtype=self.logits.dtype)
        return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
    @classmethod
    def fromflat(cls, flat):
        return cls(flat)

 class MultiCategoricalPd(Pd):
-    def __init__(self, low, high, flat):
+    def __init__(self, nvec, flat):
        self.flat = flat
-        self.low = tf.constant(low, dtype=tf.int32)
-        self.categoricals = list(map(CategoricalPd, tf.split(flat, high - low + 1, axis=len(flat.get_shape()) - 1)))
+        self.categoricals = list(map(CategoricalPd, tf.split(flat, nvec, axis=-1)))
    def flatparam(self):
        return self.flat
    def mode(self):
-        return self.low + tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
+        return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
    def neglogp(self, x):
-        return tf.add_n([p.neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x - self.low, axis=len(x.get_shape()) - 1))])
+        return tf.add_n([p.neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x, axis=-1))])
    def kl(self, other):
-        return tf.add_n([
-                p.kl(q) for p, q in zip(self.categoricals, other.categoricals)
-            ])
+        return tf.add_n([p.kl(q) for p, q in zip(self.categoricals, other.categoricals)])
    def entropy(self):
        return tf.add_n([p.entropy() for p in self.categoricals])
    def sample(self):
-        return self.low + tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
+        return tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
    @classmethod
    def fromflat(cls, flat):
        raise NotImplementedError
@@ -191,34 +235,38 @@ class DiagGaussianPd(Pd):
    def mode(self):
        return self.mean
    def neglogp(self, x):
-        return 0.5 * U.sum(tf.square((x - self.mean) / self.std), axis=-1) \
+        return 0.5 * tf.reduce_sum(tf.square((x - self.mean) / self.std), axis=-1) \
               + 0.5 * np.log(2.0 * np.pi) * tf.to_float(tf.shape(x)[-1]) \
-               + U.sum(self.logstd, axis=-1)
+               + tf.reduce_sum(self.logstd, axis=-1)
    def kl(self, other):
        assert isinstance(other, DiagGaussianPd)
-        return U.sum(other.logstd - self.logstd + (tf.square(self.std) + tf.square(self.mean - other.mean)) / (2.0 * tf.square(other.std)) - 0.5, axis=-1)
+        return tf.reduce_sum(other.logstd - self.logstd + (tf.square(self.std) + tf.square(self.mean - other.mean)) / (2.0 * tf.square(other.std)) - 0.5, axis=-1)
    def entropy(self):
-        return U.sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
+        return tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
    def sample(self):
        return self.mean + self.std * tf.random_normal(tf.shape(self.mean))
    @classmethod
    def fromflat(cls, flat):
        return cls(flat)

+
 class BernoulliPd(Pd):
    def __init__(self, logits):
        self.logits = logits
        self.ps = tf.sigmoid(logits)
    def flatparam(self):
        return self.logits
+    @property
+    def mean(self):
+        return self.ps
    def mode(self):
        return tf.round(self.ps)
    def neglogp(self, x):
-        return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1)
+        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1)
    def kl(self, other):
-        return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
+        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
    def entropy(self):
-        return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
+        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
    def sample(self):
        u = tf.random_uniform(tf.shape(self.ps))
        return tf.to_float(math_ops.less(u, self.ps))
@@ -234,7 +282,7 @@ def make_pdtype(ac_space):
    elif isinstance(ac_space, spaces.Discrete):
        return CategoricalPdType(ac_space.n)
    elif isinstance(ac_space, spaces.MultiDiscrete):
-        return MultiCategoricalPdType(ac_space.low, ac_space.high)
+        return MultiCategoricalPdType(ac_space.nvec)
    elif isinstance(ac_space, spaces.MultiBinary):
        return BernoulliPdType(ac_space.n)
    else:
@@ -259,6 +307,11 @@ def test_probtypes():
    categorical = CategoricalPdType(pdparam_categorical.size) #pylint: disable=E1101
    validate_probtype(categorical, pdparam_categorical)

+    nvec = [1,2,3]
+    pdparam_multicategorical = np.array([-.2, .3, .5, .1, 1, -.1])
+    multicategorical = MultiCategoricalPdType(nvec) #pylint: disable=E1101
+    validate_probtype(multicategorical, pdparam_multicategorical)
+
    pdparam_bernoulli = np.array([-.2, .3, .5])
    bernoulli = BernoulliPdType(pdparam_bernoulli.size) #pylint: disable=E1101
    validate_probtype(bernoulli, pdparam_bernoulli)
@@ -270,10 +323,10 @@ def validate_probtype(probtype, pdparam):
    Mval = np.repeat(pdparam[None, :], N, axis=0)
    M = probtype.param_placeholder([N])
    X = probtype.sample_placeholder([N])
-    pd = probtype.pdclass()(M)
+    pd = probtype.pdfromflat(M)
    calcloglik = U.function([X, M], pd.logp(X))
    calcent = U.function([M], pd.entropy())
-    Xval = U.eval(pd.sample(), feed_dict={M:Mval})
+    Xval = tf.get_default_session().run(pd.sample(), feed_dict={M:Mval})
    logliks = calcloglik(Xval, Mval)
    entval_ll = - logliks.mean() #pylint: disable=E1101
    entval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
@@ -282,7 +335,7 @@ def validate_probtype(probtype, pdparam):

    # Check to see if kldiv[p,q] = - ent[p] - E_p[log q]
    M2 = probtype.param_placeholder([N])
-    pd2 = probtype.pdclass()(M2)
+    pd2 = probtype.pdfromflat(M2)
    q = pdparam + np.random.randn(pdparam.size) * 0.1
    Mval2 = np.repeat(q[None, :], N, axis=0)
    calckl = U.function([M, M2], pd.kl(pd2))
@@ -291,3 +344,11 @@ def validate_probtype(probtype, pdparam):
    klval_ll = - entval - logliks.mean() #pylint: disable=E1101
    klval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
    assert np.abs(klval - klval_ll) < 3 * klval_ll_stderr # within 3 sigmas
+    print('ok on', probtype, pdparam)
+
+
+def _matching_fc(tensor, name, size, init_scale, init_bias):
+    if tensor.shape[-1] == size:
+        return tensor
+    else:
+        return fc(tensor, name, size, init_scale=init_scale, init_bias=init_bias)
--- a/baselines/common/input.py
+++ b/baselines/common/input.py
@@ -0,0 +1,64 @@
+import numpy as np
+import tensorflow as tf
+from gym.spaces import Discrete, Box, MultiDiscrete
+
+def observation_placeholder(ob_space, batch_size=None, name='Ob'):
+    '''
+    Create placeholder to feed observations into of the size appropriate to the observation space
+
+    Parameters:
+    ----------
+
+    ob_space: gym.Space     observation space
+
+    batch_size: int         size of the batch to be fed into input. Can be left None in most cases.
+
+    name: str               name of the placeholder
+
+    Returns:
+    -------
+
+    tensorflow placeholder tensor
+    '''
+
+    assert isinstance(ob_space, Discrete) or isinstance(ob_space, Box) or isinstance(ob_space, MultiDiscrete), \
+        'Can only deal with Discrete and Box observation spaces for now'
+
+    dtype = ob_space.dtype
+    if dtype == np.int8:
+        dtype = np.uint8
+
+    return tf.placeholder(shape=(batch_size,) + ob_space.shape, dtype=dtype, name=name)
+
+
+def observation_input(ob_space, batch_size=None, name='Ob'):
+    '''
+    Create placeholder to feed observations into of the size appropriate to the observation space, and add input
+    encoder of the appropriate type.
+    '''
+
+    placeholder = observation_placeholder(ob_space, batch_size, name)
+    return placeholder, encode_observation(ob_space, placeholder)
+
+def encode_observation(ob_space, placeholder):
+    '''
+    Encode input in the way that is appropriate to the observation space
+
+    Parameters:
+    ----------
+
+    ob_space: gym.Space             observation space
+
+    placeholder: tf.placeholder     observation input placeholder
+    '''
+    if isinstance(ob_space, Discrete):
+        return tf.to_float(tf.one_hot(placeholder, ob_space.n))
+    elif isinstance(ob_space, Box):
+        return tf.to_float(placeholder)
+    elif isinstance(ob_space, MultiDiscrete):
+        placeholder = tf.cast(placeholder, tf.int32)
+        one_hots = [tf.to_float(tf.one_hot(placeholder[..., i], ob_space.nvec[i])) for i in range(placeholder.shape[-1])]
+        return tf.concat(one_hots, axis=-1)
+    else:
+        raise NotImplementedError
+
--- a/baselines/common/math_util.py
+++ b/baselines/common/math_util.py
@@ -82,4 +82,4 @@ def test_discount_with_boundaries():
        2 + gamma * 3,
        3,
        4
-    ])
+    ])
--- a/baselines/common/misc_util.py
+++ b/baselines/common/misc_util.py
@@ -67,14 +67,20 @@ class EzPickle(object):


 def set_global_seeds(i):
+    try:
+        import MPI
+        rank = MPI.COMM_WORLD.Get_rank()
+    except ImportError:
+        rank = 0
+
+    myseed = i  + 1000 * rank if i is not None else None
    try:
        import tensorflow as tf
+        tf.set_random_seed(myseed)
    except ImportError:
        pass
-    else:
-        tf.set_random_seed(i)
-    np.random.seed(i)
-    random.seed(i)
+    np.random.seed(myseed)
+    random.seed(myseed)


 def pretty_eta(seconds_left):
@@ -224,6 +230,7 @@ def relatively_safe_pickle_dump(obj, path, compression=False):
        # Using gzip here would be simpler, but the size is limited to 2GB
        with tempfile.NamedTemporaryFile() as uncompressed_file:
            pickle.dump(obj, uncompressed_file)
+            uncompressed_file.file.flush()
            with zipfile.ZipFile(temp_storage, "w", compression=zipfile.ZIP_DEFLATED) as myzip:
                myzip.write(uncompressed_file.name, "data")
    else:
--- a/baselines/common/models.py
+++ b/baselines/common/models.py
@@ -0,0 +1,224 @@
+import numpy as np
+import tensorflow as tf
+from baselines.a2c import utils
+from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch
+from baselines.common.mpi_running_mean_std import RunningMeanStd
+import tensorflow.contrib.layers as layers
+
+mapping = {}
+
+def register(name):
+    def _thunk(func):
+        mapping[name] = func
+        return func
+    return _thunk
+
+def nature_cnn(unscaled_images, **conv_kwargs):
+    """
+    CNN from Nature paper.
+    """
+    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
+    activ = tf.nn.relu
+    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
+                   **conv_kwargs))
+    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
+    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
+    h3 = conv_to_fc(h3)
+    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
+
+
+@register("mlp")
+def mlp(num_layers=2, num_hidden=64, activation=tf.tanh, layer_norm=False):
+    """
+    Stack of fully-connected layers to be used in a policy / q-function approximator
+
+    Parameters:
+    ----------
+
+    num_layers: int                 number of fully-connected layers (default: 2)
+
+    num_hidden: int                 size of fully-connected layers (default: 64)
+
+    activation:                     activation function (default: tf.tanh)
+
+    Returns:
+    -------
+
+    function that builds fully connected network with a given input tensor / placeholder
+    """
+    def network_fn(X):
+        h = tf.layers.flatten(X)
+        for i in range(num_layers):
+            h = fc(h, 'mlp_fc{}'.format(i), nh=num_hidden, init_scale=np.sqrt(2))
+            if layer_norm:
+                h = tf.contrib.layers.layer_norm(h, center=True, scale=True)
+            h = activation(h)
+
+        return h
+
+    return network_fn
+
+
+@register("cnn")
+def cnn(**conv_kwargs):
+    def network_fn(X):
+        return nature_cnn(X, **conv_kwargs)
+    return network_fn
+
+
+@register("cnn_small")
+def cnn_small(**conv_kwargs):
+    def network_fn(X):
+        h = tf.cast(X, tf.float32) / 255.
+
+        activ = tf.nn.relu
+        h = activ(conv(h, 'c1', nf=8, rf=8, stride=4, init_scale=np.sqrt(2), **conv_kwargs))
+        h = activ(conv(h, 'c2', nf=16, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
+        h = conv_to_fc(h)
+        h = activ(fc(h, 'fc1', nh=128, init_scale=np.sqrt(2)))
+        return h
+    return network_fn
+
+
+@register("lstm")
+def lstm(nlstm=128, layer_norm=False):
+    """
+    Builds LSTM (Long-Short Term Memory) network to be used in a policy.
+    Note that the resulting function returns not only the output of the LSTM
+    (i.e. hidden state of lstm for each step in the sequence), but also a dictionary
+    with auxiliary tensors to be set as policy attributes.
+
+    Specifically,
+        S is a placeholder to feed current state (LSTM state has to be managed outside policy)
+        M is a placeholder for the mask (used to mask out observations after the end of the episode, but can be used for other purposes too)
+        initial_state is a numpy array containing initial lstm state (usually zeros)
+        state is the output LSTM state (to be fed into S at the next call)
+
+
+    An example of usage of lstm-based policy can be found here: common/tests/test_doc_examples.py/test_lstm_example
+
+    Parameters:
+    ----------
+
+    nlstm: int          LSTM hidden state size
+
+    layer_norm: bool    if True, layer-normalized version of LSTM is used
+
+    Returns:
+    -------
+
+    function that builds LSTM with a given input tensor / placeholder
+    """
+
+    def network_fn(X, nenv=1):
+        nbatch = X.shape[0]
+        nsteps = nbatch // nenv
+
+        h = tf.layers.flatten(X)
+
+        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
+        S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states
+
+        xs = batch_to_seq(h, nenv, nsteps)
+        ms = batch_to_seq(M, nenv, nsteps)
+
+        if layer_norm:
+            h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
+        else:
+            h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)
+
+        h = seq_to_batch(h5)
+        initial_state = np.zeros(S.shape.as_list(), dtype=float)
+
+        return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
+
+    return network_fn
+
+
+@register("cnn_lstm")
+def cnn_lstm(nlstm=128, layer_norm=False, **conv_kwargs):
+    def network_fn(X, nenv=1):
+        nbatch = X.shape[0]
+        nsteps = nbatch // nenv
+
+        h = nature_cnn(X, **conv_kwargs)
+
+        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
+        S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states
+
+        xs = batch_to_seq(h, nenv, nsteps)
+        ms = batch_to_seq(M, nenv, nsteps)
+
+        if layer_norm:
+            h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
+        else:
+            h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)
+
+        h = seq_to_batch(h5)
+        initial_state = np.zeros(S.shape.as_list(), dtype=float)
+
+        return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
+
+    return network_fn
+
+
+@register("cnn_lnlstm")
+def cnn_lnlstm(nlstm=128, **conv_kwargs):
+    return cnn_lstm(nlstm, layer_norm=True, **conv_kwargs)
+
+
+@register("conv_only")
+def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
+    '''
+    convolutions-only net
+
+    Parameters:
+    ----------
+
+    conv:       list of triples (filter_number, filter_size, stride) specifying parameters for each layer.
+
+    Returns:
+
+    function that takes tensorflow tensor as input and returns the output of the last convolutional layer
+
+    '''
+
+    def network_fn(X):
+        out = tf.cast(X, tf.float32) / 255.
+        with tf.variable_scope("convnet"):
+            for num_outputs, kernel_size, stride in convs:
+                out = layers.convolution2d(out,
+                                           num_outputs=num_outputs,
+                                           kernel_size=kernel_size,
+                                           stride=stride,
+                                           activation_fn=tf.nn.relu,
+                                           **conv_kwargs)
+
+        return out
+    return network_fn
+
+def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
+    rms = RunningMeanStd(shape=x.shape[1:])
+    norm_x = tf.clip_by_value((x - rms.mean) / rms.std, min(clip_range), max(clip_range))
+    return norm_x, rms
+
+
+def get_network_builder(name):
+    """
+    If you want to register your own network outside models.py, you just need:
+
+    Usage Example:
+    -------------
+    from baselines.common.models import register
+    @register("your_network_name")
+    def your_network_define(**net_kwargs):
+        ...
+        return network_fn
+
+    """
+    if callable(name):
+        return name
+    elif name in mapping:
+        return mapping[name]
+    else:
+        raise ValueError('Unknown network type: {}'.format(name))
--- a/baselines/common/mpi_adam.py
+++ b/baselines/common/mpi_adam.py
@@ -1,7 +1,11 @@
-from mpi4py import MPI
 import baselines.common.tf_util as U
 import tensorflow as tf
 import numpy as np
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+

 class MpiAdam(object):
    def __init__(self, var_list, *, beta1=0.9, beta2=0.999, epsilon=1e-08, scale_grad_by_procs=True, comm=None):
@@ -16,16 +20,19 @@ class MpiAdam(object):
        self.t = 0
        self.setfromflat = U.SetFromFlat(var_list)
        self.getflat = U.GetFlat(var_list)
-        self.comm = MPI.COMM_WORLD if comm is None else comm
+        self.comm = MPI.COMM_WORLD if comm is None and MPI is not None else comm

    def update(self, localg, stepsize):
        if self.t % 100 == 0:
            self.check_synced()
        localg = localg.astype('float32')
-        globalg = np.zeros_like(localg)
-        self.comm.Allreduce(localg, globalg, op=MPI.SUM)
-        if self.scale_grad_by_procs:
-            globalg /= self.comm.Get_size()
+        if self.comm is not None:
+            globalg = np.zeros_like(localg)
+            self.comm.Allreduce(localg, globalg, op=MPI.SUM)
+            if self.scale_grad_by_procs:
+                globalg /= self.comm.Get_size()
+        else:
+            globalg = np.copy(localg)

        self.t += 1
        a = stepsize * np.sqrt(1 - self.beta2**self.t)/(1 - self.beta1**self.t)
@@ -35,11 +42,15 @@ class MpiAdam(object):
        self.setfromflat(self.getflat() + step)

    def sync(self):
+        if self.comm is None:
+            return
        theta = self.getflat()
        self.comm.Bcast(theta, root=0)
        self.setfromflat(theta)

    def check_synced(self):
+        if self.comm is None:
+            return
        if self.comm.Get_rank() == 0: # this is root
            theta = self.getflat()
            self.comm.Bcast(theta, root=0)
@@ -53,7 +64,7 @@ class MpiAdam(object):
 def test_MpiAdam():
    np.random.seed(0)
    tf.set_random_seed(0)
-    
+
    a = tf.Variable(np.random.randn(3).astype('float32'))
    b = tf.Variable(np.random.randn(2,5).astype('float32'))
    loss = tf.reduce_sum(tf.square(a)) + tf.reduce_sum(tf.sin(b))
@@ -63,17 +74,30 @@ def test_MpiAdam():
    do_update = U.function([], loss, updates=[update_op])

    tf.get_default_session().run(tf.global_variables_initializer())
+    losslist_ref = []
    for i in range(10):
-        print(i,do_update())
+        l = do_update()
+        print(i, l)
+        losslist_ref.append(l)
+
+

    tf.set_random_seed(0)
    tf.get_default_session().run(tf.global_variables_initializer())

    var_list = [a,b]
-    lossandgrad = U.function([], [loss, U.flatgrad(loss, var_list)], updates=[update_op])
+    lossandgrad = U.function([], [loss, U.flatgrad(loss, var_list)])
    adam = MpiAdam(var_list)

+    losslist_test = []
    for i in range(10):
        l,g = lossandgrad()
        adam.update(g, stepsize)
-        print(i,l)
+        print(i,l)
+        losslist_test.append(l)
+
+    np.testing.assert_allclose(np.array(losslist_ref), np.array(losslist_test), atol=1e-4)
+
+
+if __name__ == '__main__':
+    test_MpiAdam()
--- a/baselines/common/mpi_adam_optimizer.py
+++ b/baselines/common/mpi_adam_optimizer.py
@@ -0,0 +1,31 @@
+import numpy as np
+import tensorflow as tf
+from mpi4py import MPI
+
+class MpiAdamOptimizer(tf.train.AdamOptimizer):
+    """Adam optimizer that averages gradients across mpi processes."""
+    def __init__(self, comm, **kwargs):
+        self.comm = comm
+        tf.train.AdamOptimizer.__init__(self, **kwargs)
+    def compute_gradients(self, loss, var_list, **kwargs):
+        grads_and_vars = tf.train.AdamOptimizer.compute_gradients(self, loss, var_list, **kwargs)
+        grads_and_vars = [(g, v) for g, v in grads_and_vars if g is not None]
+        flat_grad = tf.concat([tf.reshape(g, (-1,)) for g, v in grads_and_vars], axis=0)
+        shapes = [v.shape.as_list() for g, v in grads_and_vars]
+        sizes = [int(np.prod(s)) for s in shapes]
+
+        num_tasks = self.comm.Get_size()
+        buf = np.zeros(sum(sizes), np.float32)
+
+        def _collect_grads(flat_grad):
+            self.comm.Allreduce(flat_grad, buf, op=MPI.SUM)
+            np.divide(buf, float(num_tasks), out=buf)
+            return buf
+
+        avg_flat_grad = tf.py_func(_collect_grads, [flat_grad], tf.float32)
+        avg_flat_grad.set_shape(flat_grad.shape)
+        avg_grads = tf.split(avg_flat_grad, sizes, axis=0)
+        avg_grads_and_vars = [(tf.reshape(g, v.shape), v)
+                    for g, (_, v) in zip(avg_grads, grads_and_vars)]
+
+        return avg_grads_and_vars
--- a/baselines/common/mpi_fork.py
+++ b/baselines/common/mpi_fork.py
@@ -4,7 +4,7 @@ def mpi_fork(n, bind_to_core=False):
    """Re-launches the current script with workers
    Returns "parent" for original parent, "child" for MPI children
    """
-    if n<=1: 
+    if n<=1:
        return "child"
    if os.getenv("IN_MPI") is None:
        env = os.environ.copy()
--- a/baselines/common/mpi_moments.py
+++ b/baselines/common/mpi_moments.py
@@ -2,29 +2,42 @@ from mpi4py import MPI
 import numpy as np
 from baselines.common import zipsame

-def mpi_moments(x, axis=0):
-    x = np.asarray(x, dtype='float64')
-    newshape = list(x.shape)
-    newshape.pop(axis)
-    n = np.prod(newshape,dtype=int)
-    totalvec = np.zeros(n*2+1, 'float64')
-    addvec = np.concatenate([x.sum(axis=axis).ravel(), 
-        np.square(x).sum(axis=axis).ravel(), 
-        np.array([x.shape[axis]],dtype='float64')])
-    MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
-    sum = totalvec[:n]
-    sumsq = totalvec[n:2*n]
-    count = totalvec[2*n]
-    if count == 0:
-        mean = np.empty(newshape); mean[:] = np.nan
-        std = np.empty(newshape); std[:] = np.nan
-    else:
-        mean = sum/count
-        std = np.sqrt(np.maximum(sumsq/count - np.square(mean),0))
+
+def mpi_mean(x, axis=0, comm=None, keepdims=False):
+    x = np.asarray(x)
+    assert x.ndim > 0
+    if comm is None: comm = MPI.COMM_WORLD
+    xsum = x.sum(axis=axis, keepdims=keepdims)
+    n = xsum.size
+    localsum = np.zeros(n+1, x.dtype)
+    localsum[:n] = xsum.ravel()
+    localsum[n] = x.shape[axis]
+    globalsum = np.zeros_like(localsum)
+    comm.Allreduce(localsum, globalsum, op=MPI.SUM)
+    return globalsum[:n].reshape(xsum.shape) / globalsum[n], globalsum[n]
+
+def mpi_moments(x, axis=0, comm=None, keepdims=False):
+    x = np.asarray(x)
+    assert x.ndim > 0
+    mean, count = mpi_mean(x, axis=axis, comm=comm, keepdims=True)
+    sqdiffs = np.square(x - mean)
+    meansqdiff, count1 = mpi_mean(sqdiffs, axis=axis, comm=comm, keepdims=True)
+    assert count1 == count
+    std = np.sqrt(meansqdiff)
+    if not keepdims:
+        newshape = mean.shape[:axis] + mean.shape[axis+1:]
+        mean = mean.reshape(newshape)
+        std = std.reshape(newshape)
    return mean, std, count


 def test_runningmeanstd():
+    import subprocess
+    subprocess.check_call(['mpirun', '-np', '3',
+        'python','-c',
+        'from baselines.common.mpi_moments import _helper_runningmeanstd; _helper_runningmeanstd()'])
+
+def _helper_runningmeanstd():
    comm = MPI.COMM_WORLD
    np.random.seed(0)
    for (triple,axis) in [
@@ -45,6 +58,3 @@ def test_runningmeanstd():
            assert np.allclose(a1, a2)
            print("ok!")

-if __name__ == "__main__":
-    #mpirun -np 3 python <script>
-    test_runningmeanstd()
--- a/baselines/common/mpi_running_mean_std.py
+++ b/baselines/common/mpi_running_mean_std.py
@@ -1,4 +1,8 @@
-from mpi4py import MPI
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
 import tensorflow as tf, baselines.common.tf_util as U, numpy as np

 class RunningMeanStd(object):
@@ -39,7 +43,8 @@ class RunningMeanStd(object):
        n = int(np.prod(self.shape))
        totalvec = np.zeros(n*2+1, 'float64')
        addvec = np.concatenate([x.sum(axis=0).ravel(), np.square(x).sum(axis=0).ravel(), np.array([len(x)],dtype='float64')])
-        MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
+        if MPI is not None:
+            MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
        self.incfiltparams(totalvec[0:n].reshape(self.shape), totalvec[n:2*n].reshape(self.shape), totalvec[2*n])

@U.in_session
@@ -57,7 +62,7 @@ def test_runningmeanstd():
        rms.update(x1)
        rms.update(x2)
        rms.update(x3)
-        ms2 = U.eval([rms.mean, rms.std])
+        ms2 = [rms.mean.eval(), rms.std.eval()]

        assert np.allclose(ms1, ms2)

@@ -94,11 +99,11 @@ def test_dist():

    assert checkallclose(
        bigvec.mean(axis=0),
-        U.eval(rms.mean)
+        rms.mean.eval(),
    )
    assert checkallclose(
        bigvec.std(axis=0),
-        U.eval(rms.std)
+        rms.std.eval(),
    )


--- a/baselines/common/mpi_util.py
+++ b/baselines/common/mpi_util.py
@@ -0,0 +1,101 @@
+from collections import defaultdict
+from mpi4py import MPI
+import os, numpy as np
+import platform
+import shutil
+import subprocess
+
+def sync_from_root(sess, variables, comm=None):
+    """
+    Send the root node's parameters to every worker.
+    Arguments:
+      sess: the TensorFlow session.
+      variables: all parameter variables including optimizer's
+    """
+    if comm is None: comm = MPI.COMM_WORLD
+    rank = comm.Get_rank()
+    for var in variables:
+        if rank == 0:
+            comm.Bcast(sess.run(var))
+        else:
+            import tensorflow as tf
+            returned_var = np.empty(var.shape, dtype='float32')
+            comm.Bcast(returned_var)
+            sess.run(tf.assign(var, returned_var))
+
+def gpu_count():
+    """
+    Count the GPUs on this machine.
+    """
+    if shutil.which('nvidia-smi') is None:
+        return 0
+    output = subprocess.check_output(['nvidia-smi', '--query-gpu=gpu_name', '--format=csv'])
+    return max(0, len(output.split(b'\n')) - 2)
+
+def setup_mpi_gpus():
+    """
+    Set CUDA_VISIBLE_DEVICES using MPI.
+    """
+    num_gpus = gpu_count()
+    if num_gpus == 0:
+        return
+    local_rank, _ = get_local_rank_size(MPI.COMM_WORLD)
+    os.environ['CUDA_VISIBLE_DEVICES'] = str(local_rank % num_gpus)
+
+def get_local_rank_size(comm):
+    """
+    Returns the rank of each process on its machine
+    The processes on a given machine will be assigned ranks
+        0, 1, 2, ..., N-1,
+    where N is the number of processes on this machine.
+
+    Useful if you want to assign one gpu per machine
+    """
+    this_node = platform.node()
+    ranks_nodes = comm.allgather((comm.Get_rank(), this_node))
+    node2rankssofar = defaultdict(int)
+    local_rank = None
+    for (rank, node) in ranks_nodes:
+        if rank == comm.Get_rank():
+            local_rank = node2rankssofar[node]
+        node2rankssofar[node] += 1
+    assert local_rank is not None
+    return local_rank, node2rankssofar[this_node]
+
+def share_file(comm, path):
+    """
+    Copies the file from rank 0 to all other ranks
+    Puts it in the same place on all machines
+    """
+    localrank, _ = get_local_rank_size(comm)
+    if comm.Get_rank() == 0:
+        with open(path, 'rb') as fh:
+            data = fh.read()
+        comm.bcast(data)
+    else:
+        data = comm.bcast(None)
+        if localrank == 0:
+            os.makedirs(os.path.dirname(path), exist_ok=True)
+            with open(path, 'wb') as fh:
+                fh.write(data)
+    comm.Barrier()
+
+def dict_gather(comm, d, op='mean', assert_all_have_data=True):
+    if comm is None: return d
+    alldicts = comm.allgather(d)
+    size = comm.size
+    k2li = defaultdict(list)
+    for d in alldicts:
+        for (k,v) in d.items():
+            k2li[k].append(v)
+    result = {}
+    for (k,li) in k2li.items():
+        if assert_all_have_data:
+            assert len(li)==size, "only %i out of %i MPI workers have sent '%s'" % (len(li), size, k)
+        if op=='mean':
+            result[k] = np.mean(li, axis=0)
+        elif op=='sum':
+            result[k] = np.sum(li, axis=0)
+        else:
+            assert 0, op
+    return result
--- a/baselines/common/plot_util.py
+++ b/baselines/common/plot_util.py
@@ -0,0 +1,404 @@
+import matplotlib.pyplot as plt
+import os.path as osp
+import json
+import os
+import numpy as np
+import pandas
+from collections import defaultdict, namedtuple
+from baselines.bench import monitor
+from baselines.logger import read_json, read_csv
+
+def smooth(y, radius, mode='two_sided', valid_only=False):
+    '''
+    Smooth signal y, where radius is determines the size of the window
+
+    mode='twosided':
+        average over the window [max(index - radius, 0), min(index + radius, len(y)-1)]
+    mode='causal':
+        average over the window [max(index - radius, 0), index]
+
+    valid_only: put nan in entries where the full-sized window is not available
+
+    '''
+    assert mode in ('two_sided', 'causal')
+    if len(y) < 2*radius+1:
+        return np.ones_like(y) * y.mean()
+    elif mode == 'two_sided':
+        convkernel = np.ones(2 * radius+1)
+        out = np.convolve(y, convkernel,mode='same') / np.convolve(np.ones_like(y), convkernel, mode='same')
+        if valid_only:
+            out[:radius] = out[-radius:] = np.nan
+    elif mode == 'causal':
+        convkernel = np.ones(radius)
+        out = np.convolve(y, convkernel,mode='full') / np.convolve(np.ones_like(y), convkernel, mode='full')
+        out = out[:-radius+1]
+        if valid_only:
+            out[:radius] = np.nan
+    return out
+
+def one_sided_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8):
+    '''
+    perform one-sided (causal) EMA (exponential moving average)
+    smoothing and resampling to an even grid with n points.
+    Does not do extrapolation, so we assume
+    xolds[0] <= low && high <= xolds[-1]
+
+    Arguments:
+
+    xolds: array or list  - x values of data. Needs to be sorted in ascending order
+    yolds: array of list  - y values of data. Has to have the same length as xolds
+
+    low: float            - min value of the new x grid. By default equals to xolds[0]
+    high: float           - max value of the new x grid. By default equals to xolds[-1]
+
+    n: int                - number of points in new x grid
+
+    decay_steps: float    - EMA decay factor, expressed in new x grid steps.
+
+    low_counts_threshold: float or int
+                          - y values with counts less than this value will be set to NaN
+
+    Returns:
+        tuple sum_ys, count_ys where
+            xs        - array with new x grid
+            ys        - array of EMA of y at each point of the new x grid
+            count_ys  - array of EMA of y counts at each point of the new x grid
+
+    '''
+
+    low = xolds[0] if low is None else low
+    high = xolds[-1] if high is None else high
+
+    assert xolds[0] <= low, 'low = {} < xolds[0] = {} - extrapolation not permitted!'.format(low, xolds[0])
+    assert xolds[-1] >= high, 'high = {} > xolds[-1] = {}  - extrapolation not permitted!'.format(high, xolds[-1])
+    assert len(xolds) == len(yolds), 'length of xolds ({}) and yolds ({}) do not match!'.format(len(xolds), len(yolds))
+
+
+    xolds = xolds.astype('float64')
+    yolds = yolds.astype('float64')
+
+    luoi = 0 # last unused old index
+    sum_y = 0.
+    count_y = 0.
+    xnews = np.linspace(low, high, n)
+    decay_period = (high - low) / (n - 1) * decay_steps
+    interstep_decay = np.exp(- 1. / decay_steps)
+    sum_ys = np.zeros_like(xnews)
+    count_ys = np.zeros_like(xnews)
+    for i in range(n):
+        xnew = xnews[i]
+        sum_y *= interstep_decay
+        count_y *= interstep_decay
+        while True:
+            xold = xolds[luoi]
+            if xold <= xnew:
+                decay = np.exp(- (xnew - xold) / decay_period)
+                sum_y += decay * yolds[luoi]
+                count_y += decay
+                luoi += 1
+            else:
+                break
+            if luoi >= len(xolds):
+                break
+        sum_ys[i] = sum_y
+        count_ys[i] = count_y
+
+    ys = sum_ys / count_ys
+    ys[count_ys < low_counts_threshold] = np.nan
+
+    return xnews, ys, count_ys
+
+def symmetric_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8):
+    '''
+    perform symmetric EMA (exponential moving average)
+    smoothing and resampling to an even grid with n points.
+    Does not do extrapolation, so we assume
+    xolds[0] <= low && high <= xolds[-1]
+
+    Arguments:
+
+    xolds: array or list  - x values of data. Needs to be sorted in ascending order
+    yolds: array of list  - y values of data. Has to have the same length as xolds
+
+    low: float            - min value of the new x grid. By default equals to xolds[0]
+    high: float           - max value of the new x grid. By default equals to xolds[-1]
+
+    n: int                - number of points in new x grid
+
+    decay_steps: float    - EMA decay factor, expressed in new x grid steps.
+
+    low_counts_threshold: float or int
+                          - y values with counts less than this value will be set to NaN
+
+    Returns:
+        tuple sum_ys, count_ys where
+            xs        - array with new x grid
+            ys        - array of EMA of y at each point of the new x grid
+            count_ys  - array of EMA of y counts at each point of the new x grid
+
+    '''
+    xs, ys1, count_ys1 = one_sided_ema(xolds, yolds, low, high, n, decay_steps, low_counts_threshold=0)
+    _,  ys2, count_ys2 = one_sided_ema(-xolds[::-1], yolds[::-1], -high, -low, n, decay_steps, low_counts_threshold=0)
+    ys2 = ys2[::-1]
+    count_ys2 = count_ys2[::-1]
+    count_ys = count_ys1 + count_ys2
+    ys = (ys1 * count_ys1 + ys2 * count_ys2) / count_ys
+    ys[count_ys < low_counts_threshold] = np.nan
+    return xs, ys, count_ys
+
+Result = namedtuple('Result', 'monitor progress dirname metadata')
+Result.__new__.__defaults__ = (None,) * len(Result._fields)
+
+def load_results(root_dir_or_dirs, enable_progress=True, enable_monitor=True, verbose=False):
+    '''
+    load summaries of runs from a list of directories (including subdirectories)
+    Arguments:
+
+    enable_progress: bool - if True, will attempt to load data from progress.csv files (data saved by logger). Default: True
+
+    enable_monitor: bool - if True, will attempt to load data from monitor.csv files (data saved by Monitor environment wrapper). Default: True
+
+    verbose: bool - if True, will print out list of directories from which the data is loaded. Default: False
+
+
+    Returns:
+    List of Result objects with the following fields:
+         - dirname - path to the directory data was loaded from
+         - metadata - run metadata (such as command-line arguments and anything else in metadata.json file
+         - monitor - if enable_monitor is True, this field contains pandas dataframe with loaded monitor.csv file (or aggregate of all *.monitor.csv files in the directory)
+         - progress - if enable_progress is True, this field contains pandas dataframe with loaded progress.csv file
+    '''
+    import re
+    if isinstance(root_dir_or_dirs, str):
+        rootdirs = [osp.expanduser(root_dir_or_dirs)]
+    else:
+        rootdirs = [osp.expanduser(d) for d in root_dir_or_dirs]
+    allresults = []
+    for rootdir in rootdirs:
+        assert osp.exists(rootdir), "%s doesn't exist"%rootdir
+        for dirname, dirs, files in os.walk(rootdir):
+            if '-proc' in dirname:
+                files[:] = []
+                continue
+            monitor_re = re.compile(r'(\d+\.)?(\d+\.)?monitor\.csv')
+            if set(['metadata.json', 'monitor.json', 'progress.json', 'progress.csv']).intersection(files) or \
+               any([f for f in files if monitor_re.match(f)]):  # also match monitor files like 0.1.monitor.csv
+                # used to be uncommented, which means do not go deeper than current directory if any of the data files
+                # are found
+                # dirs[:] = []
+                result = {'dirname' : dirname}
+                if "metadata.json" in files:
+                    with open(osp.join(dirname, "metadata.json"), "r") as fh:
+                        result['metadata'] = json.load(fh)
+                progjson = osp.join(dirname, "progress.json")
+                progcsv = osp.join(dirname, "progress.csv")
+                if enable_progress:
+                    if osp.exists(progjson):
+                        result['progress'] = pandas.DataFrame(read_json(progjson))
+                    elif osp.exists(progcsv):
+                        try:
+                            result['progress'] = read_csv(progcsv)
+                        except pandas.errors.EmptyDataError:
+                            print('skipping progress file in ', dirname, 'empty data')
+                    else:
+                        if verbose: print('skipping %s: no progress file'%dirname)
+
+                if enable_monitor:
+                    try:
+                        result['monitor'] = pandas.DataFrame(monitor.load_results(dirname))
+                    except monitor.LoadMonitorResultsError:
+                        print('skipping %s: no monitor files'%dirname)
+                    except Exception as e:
+                        print('exception loading monitor file in %s: %s'%(dirname, e))
+
+                if result.get('monitor') is not None or result.get('progress') is not None:
+                    allresults.append(Result(**result))
+                    if verbose:
+                        print('successfully loaded %s'%dirname)
+
+    if verbose: print('loaded %i results'%len(allresults))
+    return allresults
+
+COLORS = ['blue', 'green', 'red', 'cyan', 'magenta', 'yellow', 'black', 'purple', 'pink',
+        'brown', 'orange', 'teal',  'lightblue', 'lime', 'lavender', 'turquoise',
+        'darkgreen', 'tan', 'salmon', 'gold',  'darkred', 'darkblue']
+
+
+def default_xy_fn(r):
+    x = np.cumsum(r.monitor.l)
+    y = smooth(r.monitor.r, radius=10)
+    return x,y
+
+def default_split_fn(r):
+    import re
+    # match name between slash and -<digits> at the end of the string
+    # (slash in the beginning or -<digits> in the end or either may be missing)
+    match = re.search(r'[^/-]+(?=(-\d+)?\Z)', r.dirname)
+    if match:
+        return match.group(0)
+
+def plot_results(
+    allresults, *,
+    xy_fn=default_xy_fn,
+    split_fn=default_split_fn,
+    group_fn=default_split_fn,
+    average_group=False,
+    shaded_std=True,
+    shaded_err=True,
+    figsize=None,
+    legend_outside=False,
+    resample=0,
+    smooth_step=1.0,
+):
+    '''
+    Plot multiple Results objects
+
+    xy_fn: function Result -> x,y           - function that converts results objects into tuple of x and y values.
+                                              By default, x is cumsum of episode lengths, and y is episode rewards
+
+    split_fn: function Result -> hashable   - function that converts results objects into keys to split curves into sub-panels by.
+                                              That is, the results r for which split_fn(r) is different will be put on different sub-panels.
+                                              By default, the portion of r.dirname between last / and -<digits> is returned. The sub-panels are
+                                              stacked vertically in the figure.
+
+    group_fn: function Result -> hashable   - function that converts results objects into keys to group curves by.
+                                              That is, the results r for which group_fn(r) is the same will be put into the same group.
+                                              Curves in the same group have the same color (if average_group is False), or averaged over
+                                              (if average_group is True). The default value is the same as default value for split_fn
+
+    average_group: bool                     - if True, will average the curves in the same group and plot the mean. Enables resampling
+                                              (if resample = 0, will use 512 steps)
+
+    shaded_std: bool                        - if True (default), the shaded region corresponding to standard deviation of the group of curves will be
+                                              shown (only applicable if average_group = True)
+
+    shaded_err: bool                        - if True (default), the shaded region corresponding to error in mean estimate of the group of curves
+                                              (that is, standard deviation divided by square root of number of curves) will be
+                                              shown (only applicable if average_group = True)
+
+    figsize: tuple or None                  - size of the resulting figure (including sub-panels). By default, width is 6 and height is 6 times number of
+                                              sub-panels.
+
+
+    legend_outside: bool                    - if True, will place the legend outside of the sub-panels.
+
+    resample: int                           - if not zero, size of the uniform grid in x direction to resample onto. Resampling is performed via symmetric
+                                              EMA smoothing (see the docstring for symmetric_ema).
+                                              Default is zero (no resampling). Note that if average_group is True, resampling is necessary; in that case, default
+                                              value is 512.
+
+    smooth_step: float                      - when resampling (i.e. when resample > 0 or average_group is True), use this EMA decay parameter (in units of the new grid step).
+                                              See docstrings for decay_steps in symmetric_ema or one_sided_ema functions.
+
+    '''
+
+    if split_fn is None: split_fn = lambda _ : ''
+    if group_fn is None: group_fn = lambda _ : ''
+    sk2r = defaultdict(list) # splitkey2results
+    for result in allresults:
+        splitkey = split_fn(result)
+        sk2r[splitkey].append(result)
+    assert len(sk2r) > 0
+    assert isinstance(resample, int), "0: don't resample. <integer>: that many samples"
+    nrows = len(sk2r)
+    ncols = 1
+    figsize = figsize or (6, 6 * nrows)
+    f, axarr = plt.subplots(nrows, ncols, sharex=False, squeeze=False, figsize=figsize)
+
+    groups = list(set(group_fn(result) for result in allresults))
+
+    default_samples = 512
+    if average_group:
+        resample = resample or default_samples
+
+    for (isplit, sk) in enumerate(sorted(sk2r.keys())):
+        g2l = {}
+        g2c = defaultdict(int)
+        sresults = sk2r[sk]
+        gresults = defaultdict(list)
+        ax = axarr[isplit][0]
+        for result in sresults:
+            group = group_fn(result)
+            g2c[group] += 1
+            x, y = xy_fn(result)
+            if x is None: x = np.arange(len(y))
+            x, y = map(np.asarray, (x, y))
+            if average_group:
+                gresults[group].append((x,y))
+            else:
+                if resample:
+                    x, y, counts = symmetric_ema(x, y, x[0], x[-1], resample, decay_steps=smooth_step)
+                l, = ax.plot(x, y, color=COLORS[groups.index(group) % len(COLORS)])
+                g2l[group] = l
+        if average_group:
+            for group in sorted(groups):
+                xys = gresults[group]
+                if not any(xys):
+                    continue
+                color = COLORS[groups.index(group) % len(COLORS)]
+                origxs = [xy[0] for xy in xys]
+                minxlen = min(map(len, origxs))
+                def allequal(qs):
+                    return all((q==qs[0]).all() for q in qs[1:])
+                if resample:
+                    low  = max(x[0] for x in origxs)
+                    high = min(x[-1] for x in origxs)
+                    usex = np.linspace(low, high, resample)
+                    ys = []
+                    for (x, y) in xys:
+                        ys.append(symmetric_ema(x, y, low, high, resample, decay_steps=smooth_step)[1])
+                else:
+                    assert allequal([x[:minxlen] for x in origxs]),\
+                        'If you want to average unevenly sampled data, set resample=<number of samples you want>'
+                    usex = origxs[0]
+                    ys = [xy[1][:minxlen] for xy in xys]
+                ymean = np.mean(ys, axis=0)
+                ystd = np.std(ys, axis=0)
+                ystderr = ystd / np.sqrt(len(ys))
+                l, = axarr[isplit][0].plot(usex, ymean, color=color)
+                g2l[group] = l
+                if shaded_err:
+                    ax.fill_between(usex, ymean - ystderr, ymean + ystderr, color=color, alpha=.4)
+                if shaded_std:
+                    ax.fill_between(usex, ymean - ystd,    ymean + ystd,    color=color, alpha=.2)
+
+
+        # https://matplotlib.org/users/legend_guide.html
+        plt.tight_layout()
+        if any(g2l.keys()):
+            ax.legend(
+                g2l.values(),
+                ['%s (%i)'%(g, g2c[g]) for g in g2l] if average_group else g2l.keys(),
+                loc=2 if legend_outside else None,
+                bbox_to_anchor=(1,1) if legend_outside else None)
+        ax.set_title(sk)
+    return f, axarr
+
+def regression_analysis(df):
+    xcols = list(df.columns.copy())
+    xcols.remove('score')
+    ycols = ['score']
+    import statsmodels.api as sm
+    mod = sm.OLS(df[ycols], sm.add_constant(df[xcols]), hasconst=False)
+    res = mod.fit()
+    print(res.summary())
+
+def test_smooth():
+    norig = 100
+    nup = 300
+    ndown = 30
+    xs = np.cumsum(np.random.rand(norig) * 10 / norig)
+    yclean = np.sin(xs)
+    ys = yclean + .1 * np.random.randn(yclean.size)
+    xup, yup, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), nup, decay_steps=nup/ndown)
+    xdown, ydown, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), ndown, decay_steps=ndown/ndown)
+    xsame, ysame, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), norig, decay_steps=norig/ndown)
+    plt.plot(xs, ys, label='orig', marker='x')
+    plt.plot(xup, yup, label='up', marker='x')
+    plt.plot(xdown, ydown, label='down', marker='x')
+    plt.plot(xsame, ysame, label='same', marker='x')
+    plt.plot(xs, yclean, label='clean', marker='x')
+    plt.legend()
+    plt.show()
+
+
--- a/baselines/common/policies.py
+++ b/baselines/common/policies.py
@@ -0,0 +1,186 @@
+import tensorflow as tf
+from baselines.common import tf_util
+from baselines.a2c.utils import fc
+from baselines.common.distributions import make_pdtype
+from baselines.common.input import observation_placeholder, encode_observation
+from baselines.common.tf_util import adjust_shape
+from baselines.common.mpi_running_mean_std import RunningMeanStd
+from baselines.common.models import get_network_builder
+
+import gym
+
+
+class PolicyWithValue(object):
+    """
+    Encapsulates fields and methods for RL policy and value function estimation with shared parameters
+    """
+
+    def __init__(self, env, observations, latent, estimate_q=False, vf_latent=None, sess=None, **tensors):
+        """
+        Parameters:
+        ----------
+        env             RL environment
+
+        observations    tensorflow placeholder in which the observations will be fed
+
+        latent          latent state from which policy distribution parameters should be inferred
+
+        vf_latent       latent state from which value function should be inferred (if None, then latent is used)
+
+        sess            tensorflow session to run calculations in (if None, default session is used)
+
+        **tensors       tensorflow tensors for additional attributes such as state or mask
+
+        """
+
+        self.X = observations
+        self.state = tf.constant([])
+        self.initial_state = None
+        self.__dict__.update(tensors)
+
+        vf_latent = vf_latent if vf_latent is not None else latent
+
+        vf_latent = tf.layers.flatten(vf_latent)
+        latent = tf.layers.flatten(latent)
+
+        # Based on the action space, will select what probability distribution type
+        self.pdtype = make_pdtype(env.action_space)
+
+        self.pd, self.pi = self.pdtype.pdfromlatent(latent, init_scale=0.01)
+
+        # Take an action
+        self.action = self.pd.sample()
+
+        # Calculate the neg log of our probability
+        self.neglogp = self.pd.neglogp(self.action)
+        self.sess = sess or tf.get_default_session()
+
+        if estimate_q:
+            assert isinstance(env.action_space, gym.spaces.Discrete)
+            self.q = fc(vf_latent, 'q', env.action_space.n)
+            self.vf = self.q
+        else:
+            self.vf = fc(vf_latent, 'vf', 1)
+            self.vf = self.vf[:,0]
+
+    def _evaluate(self, variables, observation, **extra_feed):
+        sess = self.sess
+        feed_dict = {self.X: adjust_shape(self.X, observation)}
+        for inpt_name, data in extra_feed.items():
+            if inpt_name in self.__dict__.keys():
+                inpt = self.__dict__[inpt_name]
+                if isinstance(inpt, tf.Tensor) and inpt._op.type == 'Placeholder':
+                    feed_dict[inpt] = adjust_shape(inpt, data)
+
+        return sess.run(variables, feed_dict)
+
+    def step(self, observation, **extra_feed):
+        """
+        Compute next action(s) given the observation(s)
+
+        Parameters:
+        ----------
+
+        observation     observation data (either single or a batch)
+
+        **extra_feed    additional data such as state or mask (names of the arguments should match the ones in constructor, see __init__)
+
+        Returns:
+        -------
+        (action, value estimate, next state, negative log likelihood of the action under current policy parameters) tuple
+        """
+
+        a, v, state, neglogp = self._evaluate([self.action, self.vf, self.state, self.neglogp], observation, **extra_feed)
+        if state.size == 0:
+            state = None
+        return a, v, state, neglogp
+
+    def value(self, ob, *args, **kwargs):
+        """
+        Compute value estimate(s) given the observation(s)
+
+        Parameters:
+        ----------
+
+        observation     observation data (either single or a batch)
+
+        **extra_feed    additional data such as state or mask (names of the arguments should match the ones in constructor, see __init__)
+
+        Returns:
+        -------
+        value estimate
+        """
+        return self._evaluate(self.vf, ob, *args, **kwargs)
+
+    def save(self, save_path):
+        tf_util.save_state(save_path, sess=self.sess)
+
+    def load(self, load_path):
+        tf_util.load_state(load_path, sess=self.sess)
+
+def build_policy(env, policy_network, value_network=None,  normalize_observations=False, estimate_q=False, **policy_kwargs):
+    if isinstance(policy_network, str):
+        network_type = policy_network
+        policy_network = get_network_builder(network_type)(**policy_kwargs)
+
+    def policy_fn(nbatch=None, nsteps=None, sess=None, observ_placeholder=None):
+        ob_space = env.observation_space
+
+        X = observ_placeholder if observ_placeholder is not None else observation_placeholder(ob_space, batch_size=nbatch)
+
+        extra_tensors = {}
+
+        if normalize_observations and X.dtype == tf.float32:
+            encoded_x, rms = _normalize_clip_observation(X)
+            extra_tensors['rms'] = rms
+        else:
+            encoded_x = X
+
+        encoded_x = encode_observation(ob_space, encoded_x)
+
+        with tf.variable_scope('pi', reuse=tf.AUTO_REUSE):
+            policy_latent = policy_network(encoded_x)
+            if isinstance(policy_latent, tuple):
+                policy_latent, recurrent_tensors = policy_latent
+
+                if recurrent_tensors is not None:
+                    # recurrent architecture, need a few more steps
+                    nenv = nbatch // nsteps
+                    assert nenv > 0, 'Bad input for recurrent policy: batch size {} smaller than nsteps {}'.format(nbatch, nsteps)
+                    policy_latent, recurrent_tensors = policy_network(encoded_x, nenv)
+                    extra_tensors.update(recurrent_tensors)
+
+
+        _v_net = value_network
+
+        if _v_net is None or _v_net == 'shared':
+            vf_latent = policy_latent
+        else:
+            if _v_net == 'copy':
+                _v_net = policy_network
+            else:
+                assert callable(_v_net)
+
+            with tf.variable_scope('vf', reuse=tf.AUTO_REUSE):
+                # TODO recurrent architectures are not supported with value_network=copy yet
+                vf_latent = _v_net(encoded_x)
+
+        policy = PolicyWithValue(
+            env=env,
+            observations=X,
+            latent=policy_latent,
+            vf_latent=vf_latent,
+            sess=sess,
+            estimate_q=estimate_q,
+            **extra_tensors
+        )
+        return policy
+
+    return policy_fn
+
+
+def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
+    rms = RunningMeanStd(shape=x.shape[1:])
+    norm_x = tf.clip_by_value((x - rms.mean) / rms.std, min(clip_range), max(clip_range))
+    return norm_x, rms
+
--- a/baselines/common/retro_wrappers.py
+++ b/baselines/common/retro_wrappers.py
@@ -0,0 +1,291 @@
+ # flake8: noqa F403, F405
+from .atari_wrappers import *
+import numpy as np
+import gym
+
+class TimeLimit(gym.Wrapper):
+    def __init__(self, env, max_episode_steps=None):
+        super(TimeLimit, self).__init__(env)
+        self._max_episode_steps = max_episode_steps
+        self._elapsed_steps = 0
+
+    def step(self, ac):
+        observation, reward, done, info = self.env.step(ac)
+        self._elapsed_steps += 1
+        if self._elapsed_steps >= self._max_episode_steps:
+            done = True
+            info['TimeLimit.truncated'] = True
+        return observation, reward, done, info
+
+    def reset(self, **kwargs):
+        self._elapsed_steps = 0
+        return self.env.reset(**kwargs)
+
+class StochasticFrameSkip(gym.Wrapper):
+    def __init__(self, env, n, stickprob):
+        gym.Wrapper.__init__(self, env)
+        self.n = n
+        self.stickprob = stickprob
+        self.curac = None
+        self.rng = np.random.RandomState()
+        self.supports_want_render = hasattr(env, "supports_want_render")
+
+    def reset(self, **kwargs):
+        self.curac = None
+        return self.env.reset(**kwargs)
+
+    def step(self, ac):
+        done = False
+        totrew = 0
+        for i in range(self.n):
+            # First step after reset, use action
+            if self.curac is None:
+                self.curac = ac
+            # First substep, delay with probability=stickprob
+            elif i==0:
+                if self.rng.rand() > self.stickprob:
+                    self.curac = ac
+            # Second substep, new action definitely kicks in
+            elif i==1:
+                self.curac = ac
+            if self.supports_want_render and i<self.n-1:
+                ob, rew, done, info = self.env.step(self.curac, want_render=False)
+            else:
+                ob, rew, done, info = self.env.step(self.curac)
+            totrew += rew
+            if done: break
+        return ob, totrew, done, info
+
+    def seed(self, s):
+        self.rng.seed(s)
+
+class PartialFrameStack(gym.Wrapper):
+    def __init__(self, env, k, channel=1):
+        """
+        Stack one channel (channel keyword) from previous frames
+        """
+        gym.Wrapper.__init__(self, env)
+        shp = env.observation_space.shape
+        self.channel = channel
+        self.observation_space = gym.spaces.Box(low=0, high=255,
+            shape=(shp[0], shp[1], shp[2] + k - 1),
+            dtype=env.observation_space.dtype)
+        self.k = k
+        self.frames = deque([], maxlen=k)
+        shp = env.observation_space.shape
+
+    def reset(self):
+        ob = self.env.reset()
+        assert ob.shape[2] > self.channel
+        for _ in range(self.k):
+            self.frames.append(ob)
+        return self._get_ob()
+
+    def step(self, ac):
+        ob, reward, done, info = self.env.step(ac)
+        self.frames.append(ob)
+        return self._get_ob(), reward, done, info
+
+    def _get_ob(self):
+        assert len(self.frames) == self.k
+        return np.concatenate([frame if i==self.k-1 else frame[:,:,self.channel:self.channel+1]
+            for (i, frame) in enumerate(self.frames)], axis=2)
+
+class Downsample(gym.ObservationWrapper):
+    def __init__(self, env, ratio):
+        """
+        Downsample images by a factor of ratio
+        """
+        gym.ObservationWrapper.__init__(self, env)
+        (oldh, oldw, oldc) = env.observation_space.shape
+        newshape = (oldh//ratio, oldw//ratio, oldc)
+        self.observation_space = spaces.Box(low=0, high=255,
+            shape=newshape, dtype=np.uint8)
+
+    def observation(self, frame):
+        height, width, _ = self.observation_space.shape
+        frame = cv2.resize(frame, (width, height), interpolation=cv2.INTER_AREA)
+        if frame.ndim == 2:
+            frame = frame[:,:,None]
+        return frame
+
+class Rgb2gray(gym.ObservationWrapper):
+    def __init__(self, env):
+        """
+        Downsample images by a factor of ratio
+        """
+        gym.ObservationWrapper.__init__(self, env)
+        (oldh, oldw, _oldc) = env.observation_space.shape
+        self.observation_space = spaces.Box(low=0, high=255,
+            shape=(oldh, oldw, 1), dtype=np.uint8)
+
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        return frame[:,:,None]
+
+
+class MovieRecord(gym.Wrapper):
+    def __init__(self, env, savedir, k):
+        gym.Wrapper.__init__(self, env)
+        self.savedir = savedir
+        self.k = k
+        self.epcount = 0
+    def reset(self):
+        if self.epcount % self.k == 0:
+            self.env.unwrapped.movie_path = self.savedir
+        else:
+            self.env.unwrapped.movie_path = None
+            self.env.unwrapped.movie = None
+        self.epcount += 1
+        return self.env.reset()
+
+class AppendTimeout(gym.Wrapper):
+    def __init__(self, env):
+        gym.Wrapper.__init__(self, env)
+        self.action_space = env.action_space
+        self.timeout_space = gym.spaces.Box(low=np.array([0.0]), high=np.array([1.0]), dtype=np.float32)
+        self.original_os = env.observation_space
+        if isinstance(self.original_os, gym.spaces.Dict):
+            import copy
+            ordered_dict = copy.deepcopy(self.original_os.spaces)
+            ordered_dict['value_estimation_timeout'] = self.timeout_space
+            self.observation_space = gym.spaces.Dict(ordered_dict)
+            self.dict_mode = True
+        else:
+            self.observation_space = gym.spaces.Dict({
+                'original': self.original_os,
+                'value_estimation_timeout': self.timeout_space
+                })
+            self.dict_mode = False
+        self.ac_count = None
+        while 1:
+            if not hasattr(env, "_max_episode_steps"):  # Looking for TimeLimit wrapper that has this field
+                env = env.env
+                continue
+            break
+        self.timeout = env._max_episode_steps
+
+    def step(self, ac):
+        self.ac_count += 1
+        ob, rew, done, info = self.env.step(ac)
+        return self._process(ob), rew, done, info
+
+    def reset(self):
+        self.ac_count = 0
+        return self._process(self.env.reset())
+
+    def _process(self, ob):
+        fracmissing = 1 - self.ac_count / self.timeout
+        if self.dict_mode:
+            ob['value_estimation_timeout'] = fracmissing
+        else:
+            return { 'original': ob, 'value_estimation_timeout': fracmissing }
+
+class StartDoingRandomActionsWrapper(gym.Wrapper):
+    """
+    Warning: can eat info dicts, not good if you depend on them
+    """
+    def __init__(self, env, max_random_steps, on_startup=True, every_episode=False):
+        gym.Wrapper.__init__(self, env)
+        self.on_startup = on_startup
+        self.every_episode = every_episode
+        self.random_steps = max_random_steps
+        self.last_obs = None
+        if on_startup:
+            self.some_random_steps()
+
+    def some_random_steps(self):
+        self.last_obs = self.env.reset()
+        n = np.random.randint(self.random_steps)
+        #print("running for random %i frames" % n)
+        for _ in range(n):
+            self.last_obs, _, done, _ = self.env.step(self.env.action_space.sample())
+            if done: self.last_obs = self.env.reset()
+
+    def reset(self):
+        return self.last_obs
+
+    def step(self, a):
+        self.last_obs, rew, done, info = self.env.step(a)
+        if done:
+            self.last_obs = self.env.reset()
+            if self.every_episode:
+                self.some_random_steps()
+        return self.last_obs, rew, done, info
+
+def make_retro(*, game, state, max_episode_steps, **kwargs):
+    import retro
+    env = retro.make(game, state, **kwargs)
+    env = StochasticFrameSkip(env, n=4, stickprob=0.25)
+    if max_episode_steps is not None:
+        env = TimeLimit(env, max_episode_steps=max_episode_steps)
+    return env
+
+def wrap_deepmind_retro(env, scale=True, frame_stack=4):
+    """
+    Configure environment for retro games, using config similar to DeepMind-style Atari in wrap_deepmind
+    """
+    env = WarpFrame(env)
+    env = ClipRewardEnv(env)
+    env = FrameStack(env, frame_stack)
+    if scale:
+        env = ScaledFloatFrame(env)
+    return env
+
+class SonicDiscretizer(gym.ActionWrapper):
+    """
+    Wrap a gym-retro environment and make it use discrete
+    actions for the Sonic game.
+    """
+    def __init__(self, env):
+        super(SonicDiscretizer, self).__init__(env)
+        buttons = ["B", "A", "MODE", "START", "UP", "DOWN", "LEFT", "RIGHT", "C", "Y", "X", "Z"]
+        actions = [['LEFT'], ['RIGHT'], ['LEFT', 'DOWN'], ['RIGHT', 'DOWN'], ['DOWN'],
+                   ['DOWN', 'B'], ['B']]
+        self._actions = []
+        for action in actions:
+            arr = np.array([False] * 12)
+            for button in action:
+                arr[buttons.index(button)] = True
+            self._actions.append(arr)
+        self.action_space = gym.spaces.Discrete(len(self._actions))
+
+    def action(self, a): # pylint: disable=W0221
+        return self._actions[a].copy()
+
+class RewardScaler(gym.RewardWrapper):
+    """
+    Bring rewards to a reasonable scale for PPO.
+    This is incredibly important and effects performance
+    drastically.
+    """
+    def __init__(self, env, scale=0.01):
+        super(RewardScaler, self).__init__(env)
+        self.scale = scale
+
+    def reward(self, reward):
+        return reward * self.scale
+
+class AllowBacktracking(gym.Wrapper):
+    """
+    Use deltas in max(X) as the reward, rather than deltas
+    in X. This way, agents are not discouraged too heavily
+    from exploring backwards if there is no way to advance
+    head-on in the level.
+    """
+    def __init__(self, env):
+        super(AllowBacktracking, self).__init__(env)
+        self._cur_x = 0
+        self._max_x = 0
+
+    def reset(self, **kwargs): # pylint: disable=E0202
+        self._cur_x = 0
+        self._max_x = 0
+        return self.env.reset(**kwargs)
+
+    def step(self, action): # pylint: disable=E0202
+        obs, rew, done, info = self.env.step(action)
+        self._cur_x += rew
+        rew = max(0, self._cur_x - self._max_x)
+        self._max_x = max(self._max_x, self._cur_x)
+        return obs, rew, done, info
--- a/baselines/common/runners.py
+++ b/baselines/common/runners.py
@@ -0,0 +1,19 @@
+import numpy as np
+from abc import ABC, abstractmethod
+
+class AbstractEnvRunner(ABC):
+    def __init__(self, *, env, model, nsteps):
+        self.env = env
+        self.model = model
+        self.nenv = nenv = env.num_envs if hasattr(env, 'num_envs') else 1
+        self.batch_ob_shape = (nenv*nsteps,) + env.observation_space.shape
+        self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
+        self.obs[:] = env.reset()
+        self.nsteps = nsteps
+        self.states = model.initial_state
+        self.dones = [False for _ in range(nenv)]
+
+    @abstractmethod
+    def run(self):
+        raise NotImplementedError
+
--- a/baselines/common/running_mean_std.py
+++ b/baselines/common/running_mean_std.py
@@ -0,0 +1,187 @@
+import tensorflow as tf
+import numpy as np
+from baselines.common.tf_util import get_session
+
+class RunningMeanStd(object):
+    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
+    def __init__(self, epsilon=1e-4, shape=()):
+        self.mean = np.zeros(shape, 'float64')
+        self.var = np.ones(shape, 'float64')
+        self.count = epsilon
+
+    def update(self, x):
+        batch_mean = np.mean(x, axis=0)
+        batch_var = np.var(x, axis=0)
+        batch_count = x.shape[0]
+        self.update_from_moments(batch_mean, batch_var, batch_count)
+
+    def update_from_moments(self, batch_mean, batch_var, batch_count):
+        self.mean, self.var, self.count = update_mean_var_count_from_moments(
+            self.mean, self.var, self.count, batch_mean, batch_var, batch_count)
+
+def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
+    delta = batch_mean - mean
+    tot_count = count + batch_count
+
+    new_mean = mean + delta * batch_count / tot_count
+    m_a = var * count
+    m_b = batch_var * batch_count
+    M2 = m_a + m_b + np.square(delta) * count * batch_count / tot_count
+    new_var = M2 / tot_count
+    new_count = tot_count
+
+    return new_mean, new_var, new_count
+
+
+class TfRunningMeanStd(object):
+    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
+    '''
+    TensorFlow variables-based implmentation of computing running mean and std
+    Benefit of this implementation is that it can be saved / loaded together with the tensorflow model
+    '''
+    def __init__(self, epsilon=1e-4, shape=(), scope=''):
+        sess = get_session()
+
+        self._new_mean = tf.placeholder(shape=shape, dtype=tf.float64)
+        self._new_var = tf.placeholder(shape=shape, dtype=tf.float64)
+        self._new_count = tf.placeholder(shape=(), dtype=tf.float64)
+
+
+        with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
+            self._mean  = tf.get_variable('mean',  initializer=np.zeros(shape, 'float64'),      dtype=tf.float64)
+            self._var   = tf.get_variable('std',   initializer=np.ones(shape, 'float64'),       dtype=tf.float64)
+            self._count = tf.get_variable('count', initializer=np.full((), epsilon, 'float64'), dtype=tf.float64)
+
+        self.update_ops = tf.group([
+            self._var.assign(self._new_var),
+            self._mean.assign(self._new_mean),
+            self._count.assign(self._new_count)
+        ])
+
+        sess.run(tf.variables_initializer([self._mean, self._var, self._count]))
+        self.sess = sess
+        self._set_mean_var_count()
+
+    def _set_mean_var_count(self):
+        self.mean, self.var, self.count = self.sess.run([self._mean, self._var, self._count])
+
+    def update(self, x):
+        batch_mean = np.mean(x, axis=0)
+        batch_var = np.var(x, axis=0)
+        batch_count = x.shape[0]
+
+        new_mean, new_var, new_count = update_mean_var_count_from_moments(self.mean, self.var, self.count, batch_mean, batch_var, batch_count)
+
+        self.sess.run(self.update_ops, feed_dict={
+            self._new_mean: new_mean,
+            self._new_var: new_var,
+            self._new_count: new_count
+        })
+
+        self._set_mean_var_count()
+
+
+
+def test_runningmeanstd():
+    for (x1, x2, x3) in [
+        (np.random.randn(3), np.random.randn(4), np.random.randn(5)),
+        (np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
+        ]:
+
+        rms = RunningMeanStd(epsilon=0.0, shape=x1.shape[1:])
+
+        x = np.concatenate([x1, x2, x3], axis=0)
+        ms1 = [x.mean(axis=0), x.var(axis=0)]
+        rms.update(x1)
+        rms.update(x2)
+        rms.update(x3)
+        ms2 = [rms.mean, rms.var]
+
+        np.testing.assert_allclose(ms1, ms2)
+
+def test_tf_runningmeanstd():
+    for (x1, x2, x3) in [
+        (np.random.randn(3), np.random.randn(4), np.random.randn(5)),
+        (np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
+        ]:
+
+        rms = TfRunningMeanStd(epsilon=0.0, shape=x1.shape[1:], scope='running_mean_std' + str(np.random.randint(0, 128)))
+
+        x = np.concatenate([x1, x2, x3], axis=0)
+        ms1 = [x.mean(axis=0), x.var(axis=0)]
+        rms.update(x1)
+        rms.update(x2)
+        rms.update(x3)
+        ms2 = [rms.mean, rms.var]
+
+        np.testing.assert_allclose(ms1, ms2)
+
+
+def profile_tf_runningmeanstd():
+    import time
+    from baselines.common import tf_util
+
+    tf_util.get_session( config=tf.ConfigProto(
+        inter_op_parallelism_threads=1,
+        intra_op_parallelism_threads=1,
+        allow_soft_placement=True
+    ))
+
+    x = np.random.random((376,))
+
+    n_trials = 10000
+    rms = RunningMeanStd()
+    tfrms = TfRunningMeanStd()
+
+    tic1 = time.time()
+    for _ in range(n_trials):
+        rms.update(x)
+
+    tic2 = time.time()
+    for _ in range(n_trials):
+        tfrms.update(x)
+
+    tic3 = time.time()
+
+    print('rms update time ({} trials): {} s'.format(n_trials, tic2 - tic1))
+    print('tfrms update time ({} trials): {} s'.format(n_trials, tic3 - tic2))
+
+
+    tic1 = time.time()
+    for _ in range(n_trials):
+        z1 = rms.mean
+
+    tic2 = time.time()
+    for _ in range(n_trials):
+        z2 = tfrms.mean
+
+    assert z1 == z2
+
+    tic3 = time.time()
+
+    print('rms get mean time ({} trials): {} s'.format(n_trials, tic2 - tic1))
+    print('tfrms get mean time ({} trials): {} s'.format(n_trials, tic3 - tic2))
+
+
+
+    '''
+    options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) #pylint: disable=E1101
+    run_metadata = tf.RunMetadata()
+    profile_opts = dict(options=options, run_metadata=run_metadata)
+
+
+
+    from tensorflow.python.client import timeline
+    fetched_timeline = timeline.Timeline(run_metadata.step_stats) #pylint: disable=E1101
+    chrome_trace = fetched_timeline.generate_chrome_trace_format()
+    outfile = '/tmp/timeline.json'
+    with open(outfile, 'wt') as f:
+        f.write(chrome_trace)
+    print(f'Successfully saved profile to {outfile}. Exiting.')
+    exit(0)
+    '''
+
+
+
+if __name__ == '__main__':
+   profile_tf_runningmeanstd()
--- a/baselines/common/segment_tree.py
+++ b/baselines/common/segment_tree.py
@@ -12,10 +12,9 @@ class SegmentTree(object):

            a) setting item's value is slightly slower.
               It is O(lg capacity) instead of O(1).
-            b) user has access to an efficient `reduce`
-               operation which reduces `operation` over
-               a contiguous subsequence of items in the
-               array.
+            b) user has access to an efficient ( O(log segment size) )
+               `reduce` operation which reduces `operation` over
+               a contiguous subsequence of items in the array.

        Paramters
        ---------
@@ -23,8 +22,8 @@ class SegmentTree(object):
            Total size of the array - must be a power of two.
        operation: lambda obj, obj -> obj
            and operation for combining elements (eg. sum, max)
-            must for a mathematical group together with the set of
-            possible values for array elements.
+            must form a mathematical group together with the set of
+            possible values for array elements (i.e. be associative)
        neutral_element: obj
            neutral element for the operation above. eg. float('-inf')
            for max and 0 for sum.
--- a/baselines/common/tests/init.py
+++ b/baselines/common/tests/init.py
--- a/baselines/common/tests/envs/init.py
+++ b/baselines/common/tests/envs/init.py
--- a/baselines/common/tests/envs/fixed_sequence_env.py
+++ b/baselines/common/tests/envs/fixed_sequence_env.py
@@ -0,0 +1,44 @@
+import numpy as np
+from gym import Env
+from gym.spaces import Discrete
+
+
+class FixedSequenceEnv(Env):
+    def __init__(
+            self,
+            n_actions=10,
+            seed=0,
+            episode_len=100
+    ):
+        self.np_random = np.random.RandomState()
+        self.np_random.seed(seed)
+        self.sequence = [self.np_random.randint(0, n_actions-1) for _ in range(episode_len)]
+
+        self.action_space = Discrete(n_actions)
+        self.observation_space = Discrete(1)
+
+        self.episode_len = episode_len
+        self.time = 0
+        self.reset()
+
+    def reset(self):
+        self.time = 0
+        return 0
+
+    def step(self, actions):
+        rew = self._get_reward(actions)
+        self._choose_next_state()
+        done = False
+        if self.episode_len and self.time >= self.episode_len:
+            rew = 0
+            done = True
+
+        return 0, rew, done, {}
+
+    def _choose_next_state(self):
+        self.time += 1
+
+    def _get_reward(self, actions):
+        return 1 if actions == self.sequence[self.time] else 0
+
+
--- a/baselines/common/tests/envs/identity_env.py
+++ b/baselines/common/tests/envs/identity_env.py
@@ -0,0 +1,83 @@
+import numpy as np
+from abc import abstractmethod
+from gym import Env
+from gym.spaces import MultiDiscrete, Discrete, Box
+
+
+class IdentityEnv(Env):
+    def __init__(
+            self,
+            episode_len=None
+    ):
+
+        self.episode_len = episode_len
+        self.time = 0
+        self.reset()
+
+    def reset(self):
+        self._choose_next_state()
+        self.time = 0
+        self.observation_space = self.action_space
+
+        return self.state
+
+    def step(self, actions):
+        rew = self._get_reward(actions)
+        self._choose_next_state()
+        done = False
+        if self.episode_len and self.time >= self.episode_len:
+            rew = 0
+            done = True
+
+        return self.state, rew, done, {}
+
+    def _choose_next_state(self):
+        self.state = self.action_space.sample()
+        self.time += 1
+
+    @abstractmethod
+    def _get_reward(self, actions):
+        raise NotImplementedError
+
+
+class DiscreteIdentityEnv(IdentityEnv):
+    def __init__(
+            self,
+            dim,
+            episode_len=None,
+    ):
+
+        self.action_space = Discrete(dim)
+        super().__init__(episode_len=episode_len)
+
+    def _get_reward(self, actions):
+        return 1 if self.state == actions else 0
+
+class MultiDiscreteIdentityEnv(IdentityEnv):
+    def __init__(
+            self,
+            dims,
+            episode_len=None,
+    ):
+
+        self.action_space = MultiDiscrete(dims)
+        super().__init__(episode_len=episode_len)
+
+    def _get_reward(self, actions):
+        return 1 if all(self.state == actions) else 0
+
+
+class BoxIdentityEnv(IdentityEnv):
+    def __init__(
+            self,
+            shape,
+            episode_len=None,
+    ):
+
+        self.action_space = Box(low=-1.0, high=1.0, shape=shape)
+        super().__init__(episode_len=episode_len)
+
+    def _get_reward(self, actions):
+        diff = actions - self.state
+        diff = diff[:]
+        return -0.5 * np.dot(diff, diff)
--- a/baselines/common/tests/envs/mnist_env.py
+++ b/baselines/common/tests/envs/mnist_env.py
@@ -0,0 +1,70 @@
+import os.path as osp
+import numpy as np
+import tempfile
+from gym import Env
+from gym.spaces import Discrete, Box
+
+
+
+class MnistEnv(Env):
+    def __init__(
+            self,
+            seed=0,
+            episode_len=None,
+            no_images=None
+    ):
+        import filelock
+        from tensorflow.examples.tutorials.mnist import input_data
+        # we could use temporary directory for this with a context manager and
+        # TemporaryDirecotry, but then each test that uses mnist would re-download the data
+        # this way the data is not cleaned up, but we only download it once per machine
+        mnist_path = osp.join(tempfile.gettempdir(), 'MNIST_data')
+        with filelock.FileLock(mnist_path + '.lock'):
+           self.mnist = input_data.read_data_sets(mnist_path)
+
+        self.np_random = np.random.RandomState()
+        self.np_random.seed(seed)
+
+        self.observation_space = Box(low=0.0, high=1.0, shape=(28,28,1))
+        self.action_space = Discrete(10)
+        self.episode_len = episode_len
+        self.time = 0
+        self.no_images = no_images
+
+        self.train_mode()
+        self.reset()
+
+    def reset(self):
+        self._choose_next_state()
+        self.time = 0
+
+        return self.state[0]
+
+    def step(self, actions):
+        rew = self._get_reward(actions)
+        self._choose_next_state()
+        done = False
+        if self.episode_len and self.time >= self.episode_len:
+            rew = 0
+            done = True
+
+        return self.state[0], rew, done, {}
+
+    def train_mode(self):
+        self.dataset = self.mnist.train
+
+    def test_mode(self):
+        self.dataset = self.mnist.test
+
+    def _choose_next_state(self):
+        max_index = (self.no_images if self.no_images is not None else self.dataset.num_examples) - 1
+        index = self.np_random.randint(0, max_index)
+        image = self.dataset.images[index].reshape(28,28,1)*255
+        label = self.dataset.labels[index]
+        self.state = (image, label)
+        self.time += 1
+
+    def _get_reward(self, actions):
+        return 1 if self.state[1] == actions else 0
+
+
--- a/baselines/common/tests/test_cartpole.py
+++ b/baselines/common/tests/test_cartpole.py
@@ -0,0 +1,44 @@
+import pytest
+import gym
+
+from baselines.run import get_learn_function
+from baselines.common.tests.util import reward_per_episode_test
+
+common_kwargs = dict(
+    total_timesteps=30000,
+    network='mlp',
+    gamma=1.0,
+    seed=0,
+)
+
+learn_kwargs = {
+    'a2c' : dict(nsteps=32, value_network='copy', lr=0.05),
+    'acer': dict(value_network='copy'),
+    'acktr': dict(nsteps=32, value_network='copy', is_async=False),
+    'deepq': dict(total_timesteps=20000),
+    'ppo2': dict(value_network='copy'),
+    'trpo_mpi': {}
+}
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", learn_kwargs.keys())
+def test_cartpole(alg):
+    '''
+    Test if the algorithm (with an mlp policy)
+    can learn to balance the cartpole
+    '''
+
+    kwargs = common_kwargs.copy()
+    kwargs.update(learn_kwargs[alg])
+
+    learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
+    def env_fn():
+
+        env = gym.make('CartPole-v0')
+        env.seed(0)
+        return env
+
+    reward_per_episode_test(env_fn, learn_fn, 100)
+
+if __name__ == '__main__':
+    test_cartpole('acer')
--- a/baselines/common/tests/test_doc_examples.py
+++ b/baselines/common/tests/test_doc_examples.py
@@ -0,0 +1,48 @@
+import pytest
+try:
+    import mujoco_py
+    _mujoco_present = True
+except BaseException:
+    mujoco_py = None
+    _mujoco_present = False
+
+
+@pytest.mark.skipif(
+    not _mujoco_present,
+    reason='error loading mujoco - either mujoco / mujoco key not present, or LD_LIBRARY_PATH is not pointing to mujoco library'
+)
+def test_lstm_example():
+    import tensorflow as tf
+    from baselines.common import policies, models, cmd_util
+    from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+
+    # create vectorized environment
+    venv = DummyVecEnv([lambda: cmd_util.make_mujoco_env('Reacher-v2', seed=0)])
+
+    with tf.Session() as sess:
+        # build policy based on lstm network with 128 units
+        policy = policies.build_policy(venv, models.lstm(128))(nbatch=1, nsteps=1)
+
+        # initialize tensorflow variables
+        sess.run(tf.global_variables_initializer())
+
+        # prepare environment variables
+        ob = venv.reset()
+        state = policy.initial_state
+        done = [False]
+        step_counter = 0
+
+        # run a single episode until the end (i.e. until done)
+        while True:
+            action, _, state, _ = policy.step(ob, S=state, M=done)
+            ob, reward, done, _ = venv.step(action)
+            step_counter += 1
+            if done:
+                break
+
+
+        assert step_counter > 5
+
+
+
+
--- a/baselines/common/tests/test_env_after_learn.py
+++ b/baselines/common/tests/test_env_after_learn.py
@@ -0,0 +1,27 @@
+import pytest
+import gym
+import tensorflow as tf
+
+from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
+from baselines.run import get_learn_function
+from baselines.common.tf_util import make_session
+
+algos = ['a2c', 'acer', 'acktr', 'deepq', 'ppo2', 'trpo_mpi']
+
+@pytest.mark.parametrize('algo', algos)
+def test_env_after_learn(algo):
+    def make_env():
+        # acktr requires too much RAM, fails on travis
+        env = gym.make('CartPole-v1' if algo == 'acktr' else 'PongNoFrameskip-v4')
+        return env
+
+    make_session(make_default=True, graph=tf.Graph())
+    env = SubprocVecEnv([make_env])
+
+    learn = get_learn_function(algo)
+
+    # Commenting out the following line resolves the issue, though crash happens at env.reset().
+    learn(network='mlp', env=env, total_timesteps=0, load_path=None, seed=None)
+
+    env.reset()
+    env.close()
--- a/baselines/common/tests/test_fetchreach.py
+++ b/baselines/common/tests/test_fetchreach.py
@@ -0,0 +1,39 @@
+import pytest
+import gym
+
+from baselines.run import get_learn_function
+from baselines.common.tests.util import reward_per_episode_test
+
+pytest.importorskip('mujoco_py')
+
+common_kwargs = dict(
+    network='mlp',
+    seed=0,
+)
+
+learn_kwargs = {
+    'her': dict(total_timesteps=2000)
+}
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", learn_kwargs.keys())
+def test_fetchreach(alg):
+    '''
+    Test if the algorithm (with an mlp policy)
+    can learn the FetchReach task
+    '''
+
+    kwargs = common_kwargs.copy()
+    kwargs.update(learn_kwargs[alg])
+
+    learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
+    def env_fn():
+
+        env = gym.make('FetchReach-v1')
+        env.seed(0)
+        return env
+
+    reward_per_episode_test(env_fn, learn_fn, -15)
+
+if __name__ == '__main__':
+    test_fetchreach('her')
--- a/baselines/common/tests/test_fixed_sequence.py
+++ b/baselines/common/tests/test_fixed_sequence.py
@@ -0,0 +1,51 @@
+import pytest
+from baselines.common.tests.envs.fixed_sequence_env import FixedSequenceEnv
+
+from baselines.common.tests.util import simple_test
+from baselines.run import get_learn_function
+
+common_kwargs = dict(
+    seed=0,
+    total_timesteps=50000,
+)
+
+learn_kwargs = {
+    'a2c': {},
+    'ppo2': dict(nsteps=10, ent_coef=0.0, nminibatches=1),
+    # TODO enable sequential models for trpo_mpi (proper handling of nbatch and nsteps)
+    # github issue: https://github.com/openai/baselines/issues/188
+    # 'trpo_mpi': lambda e, p: trpo_mpi.learn(policy_fn=p(env=e), env=e, max_timesteps=30000, timesteps_per_batch=100, cg_iters=10, gamma=0.9, lam=1.0, max_kl=0.001)
+}
+
+
+alg_list = learn_kwargs.keys()
+rnn_list = ['lstm']
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", alg_list)
+@pytest.mark.parametrize("rnn", rnn_list)
+def test_fixed_sequence(alg, rnn):
+    '''
+    Test if the algorithm (with a given policy)
+    can learn an identity transformation (i.e. return observation as an action)
+    '''
+
+    kwargs = learn_kwargs[alg]
+    kwargs.update(common_kwargs)
+
+    episode_len = 5
+    env_fn = lambda: FixedSequenceEnv(10, episode_len=episode_len)
+    learn = lambda e: get_learn_function(alg)(
+        env=e,
+        network=rnn,
+        **kwargs
+    )
+
+    simple_test(env_fn, learn, 0.7)
+
+
+if __name__ == '__main__':
+    test_fixed_sequence('ppo2', 'lstm')
+
+
+
--- a/baselines/common/tests/test_identity.py
+++ b/baselines/common/tests/test_identity.py
@@ -0,0 +1,75 @@
+import pytest
+from baselines.common.tests.envs.identity_env import DiscreteIdentityEnv, BoxIdentityEnv, MultiDiscreteIdentityEnv
+from baselines.run import get_learn_function
+from baselines.common.tests.util import simple_test
+
+common_kwargs = dict(
+    total_timesteps=30000,
+    network='mlp',
+    gamma=0.9,
+    seed=0,
+)
+
+learn_kwargs = {
+    'a2c' : {},
+    'acktr': {},
+    'deepq': {},
+    'ddpg': dict(layer_norm=True),
+    'ppo2': dict(lr=1e-3, nsteps=64, ent_coef=0.0),
+    'trpo_mpi': dict(timesteps_per_batch=100, cg_iters=10, gamma=0.9, lam=1.0, max_kl=0.01)
+}
+
+
+algos_disc = ['a2c', 'acktr', 'deepq', 'ppo2', 'trpo_mpi']
+algos_multidisc = ['a2c', 'acktr', 'ppo2', 'trpo_mpi']
+algos_cont = ['a2c', 'acktr', 'ddpg',  'ppo2', 'trpo_mpi']
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", algos_disc)
+def test_discrete_identity(alg):
+    '''
+    Test if the algorithm (with an mlp policy)
+    can learn an identity transformation (i.e. return observation as an action)
+    '''
+
+    kwargs = learn_kwargs[alg]
+    kwargs.update(common_kwargs)
+
+    learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
+    env_fn = lambda: DiscreteIdentityEnv(10, episode_len=100)
+    simple_test(env_fn, learn_fn, 0.9)
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", algos_multidisc)
+def test_multidiscrete_identity(alg):
+    '''
+    Test if the algorithm (with an mlp policy)
+    can learn an identity transformation (i.e. return observation as an action)
+    '''
+
+    kwargs = learn_kwargs[alg]
+    kwargs.update(common_kwargs)
+
+    learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
+    env_fn = lambda: MultiDiscreteIdentityEnv((3,3), episode_len=100)
+    simple_test(env_fn, learn_fn, 0.9)
+
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", algos_cont)
+def test_continuous_identity(alg):
+    '''
+    Test if the algorithm (with an mlp policy)
+    can learn an identity transformation (i.e. return observation as an action)
+    to a required precision
+    '''
+
+    kwargs = learn_kwargs[alg]
+    kwargs.update(common_kwargs)
+    learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
+
+    env_fn = lambda: BoxIdentityEnv((1,), episode_len=100)
+    simple_test(env_fn, learn_fn, -0.1)
+
+if __name__ == '__main__':
+    test_multidiscrete_identity('acktr')
+
--- a/baselines/common/tests/test_mnist.py
+++ b/baselines/common/tests/test_mnist.py
@@ -0,0 +1,49 @@
+import pytest
+
+# from baselines.acer import acer_simple as acer
+from baselines.common.tests.envs.mnist_env import MnistEnv
+from baselines.common.tests.util import simple_test
+from baselines.run import get_learn_function
+
+
+# TODO investigate a2c and ppo2 failures - is it due to bad hyperparameters for this problem?
+# GitHub issue https://github.com/openai/baselines/issues/189
+common_kwargs = {
+    'seed': 0,
+    'network':'cnn',
+    'gamma':0.9,
+    'pad':'SAME'
+}
+
+learn_args = {
+    'a2c': dict(total_timesteps=50000),
+    'acer': dict(total_timesteps=20000),
+    'deepq': dict(total_timesteps=5000),
+    'acktr': dict(total_timesteps=30000),
+    'ppo2': dict(total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.0),
+    'trpo_mpi': dict(total_timesteps=80000, timesteps_per_batch=100, cg_iters=10, lam=1.0, max_kl=0.001)
+}
+
+
+#tests pass, but are too slow on travis. Same algorithms are covered
+# by other tests with less compute-hungry nn's and by benchmarks
+@pytest.mark.skip
+@pytest.mark.slow
+@pytest.mark.parametrize("alg", learn_args.keys())
+def test_mnist(alg):
+    '''
+    Test if the algorithm can learn to classify MNIST digits.
+    Uses CNN policy.
+    '''
+
+    learn_kwargs = learn_args[alg]
+    learn_kwargs.update(common_kwargs)
+
+    learn = get_learn_function(alg)
+    learn_fn = lambda e: learn(env=e, **learn_kwargs)
+    env_fn = lambda: MnistEnv(seed=0, episode_len=100)
+
+    simple_test(env_fn, learn_fn, 0.6)
+
+if __name__ == '__main__':
+    test_mnist('acer')
--- a/baselines/common/tests/test_serialization.py
+++ b/baselines/common/tests/test_serialization.py
@@ -0,0 +1,134 @@
+import os
+import gym
+import tempfile
+import pytest
+import tensorflow as tf
+import numpy as np
+
+from baselines.common.tests.envs.mnist_env import MnistEnv
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+from baselines.run import get_learn_function
+from baselines.common.tf_util import make_session, get_session
+
+from functools import partial
+
+
+learn_kwargs = {
+    'deepq': {},
+    'a2c': {},
+    'acktr': {},
+    'acer': {},
+    'ppo2': {'nminibatches': 1, 'nsteps': 10},
+    'trpo_mpi': {},
+}
+
+network_kwargs = {
+    'mlp': {},
+    'cnn': {'pad': 'SAME'},
+    'lstm': {},
+    'cnn_lnlstm': {'pad': 'SAME'}
+}
+
+
+@pytest.mark.parametrize("learn_fn", learn_kwargs.keys())
+@pytest.mark.parametrize("network_fn", network_kwargs.keys())
+def test_serialization(learn_fn, network_fn):
+    '''
+    Test if the trained model can be serialized
+    '''
+
+
+    if network_fn.endswith('lstm') and learn_fn in ['acer', 'acktr', 'trpo_mpi', 'deepq']:
+            # TODO make acktr work with recurrent policies
+            # and test
+            # github issue: https://github.com/openai/baselines/issues/660
+            return
+
+    env = DummyVecEnv([lambda: MnistEnv(10, episode_len=100)])
+    ob = env.reset().copy()
+    learn = get_learn_function(learn_fn)
+
+    kwargs = {}
+    kwargs.update(network_kwargs[network_fn])
+    kwargs.update(learn_kwargs[learn_fn])
+
+
+    learn = partial(learn, env=env, network=network_fn, seed=0, **kwargs)
+
+    with tempfile.TemporaryDirectory() as td:
+        model_path = os.path.join(td, 'serialization_test_model')
+
+        with tf.Graph().as_default(), make_session().as_default():
+            model = learn(total_timesteps=100)
+            model.save(model_path)
+            mean1, std1 = _get_action_stats(model, ob)
+            variables_dict1 = _serialize_variables()
+
+        with tf.Graph().as_default(), make_session().as_default():
+            model = learn(total_timesteps=0, load_path=model_path)
+            mean2, std2 = _get_action_stats(model, ob)
+            variables_dict2 = _serialize_variables()
+
+        for k, v in variables_dict1.items():
+            np.testing.assert_allclose(v, variables_dict2[k], atol=0.01,
+                err_msg='saved and loaded variable {} value mismatch'.format(k))
+
+        np.testing.assert_allclose(mean1, mean2, atol=0.5)
+        np.testing.assert_allclose(std1, std2, atol=0.5)
+
+
+@pytest.mark.parametrize("learn_fn", learn_kwargs.keys())
+@pytest.mark.parametrize("network_fn", ['mlp'])
+def test_coexistence(learn_fn, network_fn):
+    '''
+    Test if more than one model can exist at a time
+    '''
+
+    if learn_fn == 'deepq':
+            # TODO enable multiple DQN models to be useable at the same time
+            # github issue https://github.com/openai/baselines/issues/656
+            return
+
+    if network_fn.endswith('lstm') and learn_fn in ['acktr', 'trpo_mpi', 'deepq']:
+            # TODO make acktr work with recurrent policies
+            # and test
+            # github issue: https://github.com/openai/baselines/issues/660
+            return
+
+    env = DummyVecEnv([lambda: gym.make('CartPole-v0')])
+    learn = get_learn_function(learn_fn)
+
+    kwargs = {}
+    kwargs.update(network_kwargs[network_fn])
+    kwargs.update(learn_kwargs[learn_fn])
+
+    learn =  partial(learn, env=env, network=network_fn, total_timesteps=0, **kwargs)
+    make_session(make_default=True, graph=tf.Graph())
+    model1 = learn(seed=1)
+    make_session(make_default=True, graph=tf.Graph())
+    model2 = learn(seed=2)
+
+    model1.step(env.observation_space.sample())
+    model2.step(env.observation_space.sample())
+
+
+
+def _serialize_variables():
+    sess = get_session()
+    variables = tf.trainable_variables()
+    values = sess.run(variables)
+    return {var.name: value for var, value in zip(variables, values)}
+
+
+def _get_action_stats(model, ob):
+    ntrials = 1000
+    if model.initial_state is None or model.initial_state == []:
+        actions = np.array([model.step(ob)[0] for _ in range(ntrials)])
+    else:
+        actions = np.array([model.step(ob, S=model.initial_state, M=[False])[0] for _ in range(ntrials)])
+
+    mean = np.mean(actions, axis=0)
+    std = np.std(actions, axis=0)
+
+    return mean, std
+
--- a/baselines/common/tests/test_tf_util.py
+++ b/baselines/common/tests/test_tf_util.py
@@ -3,67 +3,40 @@ import tensorflow as tf
 from baselines.common.tf_util import (
    function,
    initialize,
-    set_value,
    single_threaded_session
 )


-def test_set_value():
-    a = tf.Variable(42.)
-    with single_threaded_session():
-        set_value(a, 5)
-        assert a.eval() == 5
-        g = tf.get_default_graph()
-        g.finalize()
-        set_value(a, 6)
-        assert a.eval() == 6
-
-        # test the test
-        try:
-            assert a.eval() == 7
-        except AssertionError:
-            pass
-        else:
-            assert False, "assertion should have failed"
-
-
 def test_function():
-    tf.reset_default_graph()
-    x = tf.placeholder(tf.int32, (), name="x")
-    y = tf.placeholder(tf.int32, (), name="y")
-    z = 3 * x + 2 * y
-    lin = function([x, y], z, givens={y: 0})
+    with tf.Graph().as_default():
+        x = tf.placeholder(tf.int32, (), name="x")
+        y = tf.placeholder(tf.int32, (), name="y")
+        z = 3 * x + 2 * y
+        lin = function([x, y], z, givens={y: 0})

-    with single_threaded_session():
-        initialize()
+        with single_threaded_session():
+            initialize()

-        assert lin(2) == 6
-        assert lin(x=3) == 9
-        assert lin(2, 2) == 10
-        assert lin(x=2, y=3) == 12
+            assert lin(2) == 6
+            assert lin(x=3) == 9
+            assert lin(2, 2) == 10
+            assert lin(x=2, y=3) == 12


 def test_multikwargs():
-    tf.reset_default_graph()
-    x = tf.placeholder(tf.int32, (), name="x")
-    with tf.variable_scope("other"):
-        x2 = tf.placeholder(tf.int32, (), name="x")
-    z = 3 * x + 2 * x2
+    with tf.Graph().as_default():
+        x = tf.placeholder(tf.int32, (), name="x")
+        with tf.variable_scope("other"):
+            x2 = tf.placeholder(tf.int32, (), name="x")
+        z = 3 * x + 2 * x2

-    lin = function([x, x2], z, givens={x2: 0})
-    with single_threaded_session():
-        initialize()
-        assert lin(2) == 6
-        assert lin(2, 2) == 10
-        expt_caught = False
-        try:
-            lin(x=2)
-        except AssertionError:
-            expt_caught = True
-        assert expt_caught
+        lin = function([x, x2], z, givens={x2: 0})
+        with single_threaded_session():
+            initialize()
+            assert lin(2) == 6
+            assert lin(2, 2) == 10


 if __name__ == '__main__':
-    test_set_value()
    test_function()
    test_multikwargs()
--- a/baselines/common/tests/util.py
+++ b/baselines/common/tests/util.py
@@ -0,0 +1,91 @@
+import tensorflow as tf
+import numpy as np
+from gym.spaces import np_random
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+
+N_TRIALS = 10000
+N_EPISODES = 100
+
+def simple_test(env_fn, learn_fn, min_reward_fraction, n_trials=N_TRIALS):
+    np.random.seed(0)
+    np_random.seed(0)
+
+    env = DummyVecEnv([env_fn])
+
+
+    with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
+        tf.set_random_seed(0)
+
+        model = learn_fn(env)
+
+        sum_rew = 0
+        done = True
+
+        for i in range(n_trials):
+            if done:
+                obs = env.reset()
+                state = model.initial_state
+
+            if state is not None:
+                a, v, state, _ = model.step(obs, S=state, M=[False])
+            else:
+                a, v, _, _ = model.step(obs)
+
+            obs, rew, done, _ = env.step(a)
+            sum_rew += float(rew)
+
+        print("Reward in {} trials is {}".format(n_trials, sum_rew))
+        assert sum_rew > min_reward_fraction * n_trials, \
+            'sum of rewards {} is less than {} of the total number of trials {}'.format(sum_rew, min_reward_fraction, n_trials)
+
+
+
+def reward_per_episode_test(env_fn, learn_fn, min_avg_reward, n_trials=N_EPISODES):
+    env = DummyVecEnv([env_fn])
+
+    with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
+        model = learn_fn(env)
+
+        N_TRIALS = 100
+
+        observations, actions, rewards = rollout(env, model, N_TRIALS)
+        rewards = [sum(r) for r in rewards]
+
+        avg_rew = sum(rewards) / N_TRIALS
+        print("Average reward in {} episodes is {}".format(n_trials, avg_rew))
+        assert avg_rew > min_avg_reward, \
+            'average reward in {} episodes ({}) is less than {}'.format(n_trials, avg_rew, min_avg_reward)
+
+def rollout(env, model, n_trials):
+    rewards = []
+    actions = []
+    observations = []
+
+    for i in range(n_trials):
+        obs = env.reset()
+        state = model.initial_state if hasattr(model, 'initial_state') else None
+        episode_rew = []
+        episode_actions = []
+        episode_obs = []
+
+        while True:
+            if state is not None:
+                a, v, state, _ = model.step(obs, S=state, M=[False])
+            else:
+                a,v, _, _ = model.step(obs)
+
+            obs, rew, done, _ = env.step(a)
+
+            episode_rew.append(rew)
+            episode_actions.append(a)
+            episode_obs.append(obs)
+
+            if done:
+                break
+
+        rewards.append(episode_rew)
+        actions.append(episode_actions)
+        observations.append(episode_obs)
+
+    return observations, actions, rewards
+
--- a/baselines/common/tf_util.py
+++ b/baselines/common/tf_util.py
@@ -1,45 +1,11 @@
+import joblib
 import numpy as np
 import tensorflow as tf  # pylint: ignore-module
-import builtins
-import functools
 import copy
 import os
+import functools
 import collections
-
-# ================================================================
-# Make consistent with numpy
-# ================================================================
-
-clip = tf.clip_by_value
-
-def sum(x, axis=None, keepdims=False):
-    axis = None if axis is None else [axis]
-    return tf.reduce_sum(x, axis=axis, keep_dims=keepdims)
-
-def mean(x, axis=None, keepdims=False):
-    axis = None if axis is None else [axis]
-    return tf.reduce_mean(x, axis=axis, keep_dims=keepdims)
-
-def var(x, axis=None, keepdims=False):
-    meanx = mean(x, axis=axis, keepdims=keepdims)
-    return mean(tf.square(x - meanx), axis=axis, keepdims=keepdims)
-
-def std(x, axis=None, keepdims=False):
-    return tf.sqrt(var(x, axis=axis, keepdims=keepdims))
-
-def max(x, axis=None, keepdims=False):
-    axis = None if axis is None else [axis]
-    return tf.reduce_max(x, axis=axis, keep_dims=keepdims)
-
-def min(x, axis=None, keepdims=False):
-    axis = None if axis is None else [axis]
-    return tf.reduce_min(x, axis=axis, keep_dims=keepdims)
-
-def concatenate(arrs, axis=0):
-    return tf.concat(axis=axis, values=arrs)
-
-def argmax(x, axis=None):
-    return tf.argmax(x, axis=axis)
+import multiprocessing

 def switch(condition, then_expression, else_expression):
    """Switches between two operations depending on a scalar value (int or bool).
@@ -62,105 +28,11 @@ def switch(condition, then_expression, else_expression):
 # Extras
 # ================================================================

-def l2loss(params):
-    if len(params) == 0:
-        return tf.constant(0.0)
-    else:
-        return tf.add_n([sum(tf.square(p)) for p in params])
-
 def lrelu(x, leak=0.2):
    f1 = 0.5 * (1 + leak)
    f2 = 0.5 * (1 - leak)
    return f1 * x + f2 * abs(x)

-def categorical_sample_logits(X):
-    # https://github.com/tensorflow/tensorflow/issues/456
-    U = tf.random_uniform(tf.shape(X))
-    return argmax(X - tf.log(-tf.log(U)), axis=1)
-
-# ================================================================
-# Inputs
-# ================================================================
-
-def is_placeholder(x):
-    return type(x) is tf.Tensor and len(x.op.inputs) == 0
-
-class TfInput(object):
-    def __init__(self, name="(unnamed)"):
-        """Generalized Tensorflow placeholder. The main differences are:
-            - possibly uses multiple placeholders internally and returns multiple values
-            - can apply light postprocessing to the value feed to placeholder.
-        """
-        self.name = name
-
-    def get(self):
-        """Return the tf variable(s) representing the possibly postprocessed value
-        of placeholder(s).
-        """
-        raise NotImplemented()
-
-    def make_feed_dict(data):
-        """Given data input it to the placeholder(s)."""
-        raise NotImplemented()
-
-class PlacholderTfInput(TfInput):
-    def __init__(self, placeholder):
-        """Wrapper for regular tensorflow placeholder."""
-        super().__init__(placeholder.name)
-        self._placeholder = placeholder
-
-    def get(self):
-        return self._placeholder
-
-    def make_feed_dict(self, data):
-        return {self._placeholder: data}
-
-class BatchInput(PlacholderTfInput):
-    def __init__(self, shape, dtype=tf.float32, name=None):
-        """Creates a placeholder for a batch of tensors of a given shape and dtype
-
-        Parameters
-        ----------
-        shape: [int]
-            shape of a single elemenet of the batch
-        dtype: tf.dtype
-            number representation used for tensor contents
-        name: str
-            name of the underlying placeholder
-        """
-        super().__init__(tf.placeholder(dtype, [None] + list(shape), name=name))
-
-class Uint8Input(PlacholderTfInput):
-    def __init__(self, shape, name=None):
-        """Takes input in uint8 format which is cast to float32 and divided by 255
-        before passing it to the model.
-
-        On GPU this ensures lower data transfer times.
-
-        Parameters
-        ----------
-        shape: [int]
-            shape of the tensor.
-        name: str
-            name of the underlying placeholder
-        """
-
-        super().__init__(tf.placeholder(tf.uint8, [None] + list(shape), name=name))
-        self._shape = shape
-        self._output = tf.cast(super().get(), tf.float32) / 255.0
-
-    def get(self):
-        return self._output
-
-def ensure_tf_input(thing):
-    """Takes either tf.placeholder of TfInput and outputs equivalent TfInput"""
-    if isinstance(thing, TfInput):
-        return thing
-    elif is_placeholder(thing):
-        return PlacholderTfInput(thing)
-    else:
-        raise ValueError("Must be a placeholder or TfInput")
-
 # ================================================================
 # Mathematical utils
 # ================================================================
@@ -173,39 +45,43 @@ def huber_loss(x, delta=1.0):
        delta * (tf.abs(x) - 0.5 * delta)
    )

-# ================================================================
-# Optimizer utils
-# ================================================================
-
-def minimize_and_clip(optimizer, objective, var_list, clip_val=10):
-    """Minimized `objective` using `optimizer` w.r.t. variables in
-    `var_list` while ensure the norm of the gradients for each
-    variable is clipped to `clip_val`
-    """
-    gradients = optimizer.compute_gradients(objective, var_list=var_list)
-    for i, (grad, var) in enumerate(gradients):
-        if grad is not None:
-            gradients[i] = (tf.clip_by_norm(grad, clip_val), var)
-    return optimizer.apply_gradients(gradients)
-
 # ================================================================
 # Global session
 # ================================================================

-def get_session():
-    """Returns recently made Tensorflow session"""
-    return tf.get_default_session()
+def get_session(config=None):
+    """Get default session or create one with a given config"""
+    sess = tf.get_default_session()
+    if sess is None:
+        sess = make_session(config=config, make_default=True)
+    return sess

-def make_session(num_cpu):
+def make_session(config=None, num_cpu=None, make_default=False, graph=None):
    """Returns a session that will use <num_cpu> CPU's only"""
-    tf_config = tf.ConfigProto(
-        inter_op_parallelism_threads=num_cpu,
-        intra_op_parallelism_threads=num_cpu)
-    return tf.Session(config=tf_config)
+    if num_cpu is None:
+        num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
+    if config is None:
+        config = tf.ConfigProto(
+            allow_soft_placement=True,
+            inter_op_parallelism_threads=num_cpu,
+            intra_op_parallelism_threads=num_cpu)
+        config.gpu_options.allow_growth = True
+
+    if make_default:
+        return tf.InteractiveSession(config=config, graph=graph)
+    else:
+        return tf.Session(config=config, graph=graph)

 def single_threaded_session():
    """Returns a session which will only use a single CPU"""
-    return make_session(1)
+    return make_session(num_cpu=1)
+
+def in_session(f):
+    @functools.wraps(f)
+    def newfunc(*args, **kwargs):
+        with tf.Session():
+            f(*args, **kwargs)
+    return newfunc

 ALREADY_INITIALIZED = set()

@@ -215,44 +91,14 @@ def initialize():
    get_session().run(tf.variables_initializer(new_variables))
    ALREADY_INITIALIZED.update(new_variables)

-def eval(expr, feed_dict=None):
-    if feed_dict is None:
-        feed_dict = {}
-    return get_session().run(expr, feed_dict=feed_dict)
-
-VALUE_SETTERS = collections.OrderedDict()
-
-def set_value(v, val):
-    global VALUE_SETTERS
-    if v in VALUE_SETTERS:
-        set_op, set_endpoint = VALUE_SETTERS[v]
-    else:
-        set_endpoint = tf.placeholder(v.dtype)
-        set_op = v.assign(set_endpoint)
-        VALUE_SETTERS[v] = (set_op, set_endpoint)
-    get_session().run(set_op, feed_dict={set_endpoint: val})
-
-# ================================================================
-# Saving variables
-# ================================================================
-
-def load_state(fname):
-    saver = tf.train.Saver()
-    saver.restore(get_session(), fname)
-
-def save_state(fname):
-    os.makedirs(os.path.dirname(fname), exist_ok=True)
-    saver = tf.train.Saver()
-    saver.save(get_session(), fname)
-
 # ================================================================
 # Model components
 # ================================================================

-def normc_initializer(std=1.0):
+def normc_initializer(std=1.0, axis=0):
    def _initializer(shape, dtype=None, partition_info=None):  # pylint: disable=W0613
-        out = np.random.randn(*shape).astype(np.float32)
-        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
+        out = np.random.randn(*shape).astype(dtype.as_numpy_dtype)
+        out *= std / np.sqrt(np.square(out).sum(axis=axis, keepdims=True))
        return tf.constant(out)
    return _initializer

@@ -285,36 +131,6 @@ def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad="SAME",

        return tf.nn.conv2d(x, w, stride_shape, pad) + b

-def dense(x, size, name, weight_init=None, bias=True):
-    w = tf.get_variable(name + "/w", [x.get_shape()[1], size], initializer=weight_init)
-    ret = tf.matmul(x, w)
-    if bias:
-        b = tf.get_variable(name + "/b", [size], initializer=tf.zeros_initializer())
-        return ret + b
-    else:
-        return ret
-
-def wndense(x, size, name, init_scale=1.0):
-    v = tf.get_variable(name + "/V", [int(x.get_shape()[1]), size],
-                        initializer=tf.random_normal_initializer(0, 0.05))
-    g = tf.get_variable(name + "/g", [size], initializer=tf.constant_initializer(init_scale))
-    b = tf.get_variable(name + "/b", [size], initializer=tf.constant_initializer(0.0))
-
-    # use weight normalization (Salimans & Kingma, 2016)
-    x = tf.matmul(x, v)
-    scaler = g / tf.sqrt(sum(tf.square(v), axis=0, keepdims=True))
-    return tf.reshape(scaler, [1, size]) * x + tf.reshape(b, [1, size])
-
-def densenobias(x, size, name, weight_init=None):
-    return dense(x, size, name, weight_init=weight_init, bias=False)
-
-def dropout(x, pkeep, phase=None, mask=None):
-    mask = tf.floor(pkeep + tf.random_uniform(tf.shape(x))) if mask is None else mask
-    if phase is None:
-        return mask * x
-    else:
-        return switch(phase, mask * x, pkeep * x)
-
 # ================================================================
 # Theano-like Function
 # ================================================================
@@ -344,11 +160,15 @@ def function(inputs, outputs, updates=None, givens=None):

    Parameters
    ----------
-    inputs: [tf.placeholder or TfInput]
+    inputs: [tf.placeholder, tf.constant, or object with make_feed_dict method]
        list of input arguments
    outputs: [tf.Variable] or tf.Variable
        list of outputs or a single output to be returned from function. Returned
        value will also have the same shape.
+    updates: [tf.Operation] or tf.Operation
+        list of update functions or single update function that will be run whenever
+        the function is called. The return is ignored.
+
    """
    if isinstance(outputs, list):
        return _Function(inputs, outputs, updates, givens=givens)
@@ -359,183 +179,39 @@ def function(inputs, outputs, updates=None, givens=None):
        f = _Function(inputs, [outputs], updates, givens=givens)
        return lambda *args, **kwargs: f(*args, **kwargs)[0]

+
 class _Function(object):
-    def __init__(self, inputs, outputs, updates, givens, check_nan=False):
+    def __init__(self, inputs, outputs, updates, givens):
        for inpt in inputs:
-            if not issubclass(type(inpt), TfInput):
-                assert len(inpt.op.inputs) == 0, "inputs should all be placeholders of baselines.common.TfInput"
+            if not hasattr(inpt, 'make_feed_dict') and not (type(inpt) is tf.Tensor and len(inpt.op.inputs) == 0):
+                assert False, "inputs should all be placeholders, constants, or have a make_feed_dict method"
        self.inputs = inputs
+        self.input_names = {inp.name.split("/")[-1].split(":")[0]: inp for inp in inputs}
        updates = updates or []
        self.update_group = tf.group(*updates)
        self.outputs_update = list(outputs) + [self.update_group]
        self.givens = {} if givens is None else givens
-        self.check_nan = check_nan

    def _feed_input(self, feed_dict, inpt, value):
-        if issubclass(type(inpt), TfInput):
+        if hasattr(inpt, 'make_feed_dict'):
            feed_dict.update(inpt.make_feed_dict(value))
-        elif is_placeholder(inpt):
-            feed_dict[inpt] = value
+        else:
+            feed_dict[inpt] = adjust_shape(inpt, value)

    def __call__(self, *args, **kwargs):
-        assert len(args) <= len(self.inputs), "Too many arguments provided"
+        assert len(args) + len(kwargs) <= len(self.inputs), "Too many arguments provided"
        feed_dict = {}
+        # Update feed dict with givens.
+        for inpt in self.givens:
+            feed_dict[inpt] = adjust_shape(inpt, feed_dict.get(inpt, self.givens[inpt]))
        # Update the args
        for inpt, value in zip(self.inputs, args):
            self._feed_input(feed_dict, inpt, value)
-        # Update the kwargs
-        kwargs_passed_inpt_names = set()
-        for inpt in self.inputs[len(args):]:
-            inpt_name = inpt.name.split(':')[0]
-            inpt_name = inpt_name.split('/')[-1]
-            assert inpt_name not in kwargs_passed_inpt_names, \
-                "this function has two arguments with the same name \"{}\", so kwargs cannot be used.".format(inpt_name)
-            if inpt_name in kwargs:
-                kwargs_passed_inpt_names.add(inpt_name)
-                self._feed_input(feed_dict, inpt, kwargs.pop(inpt_name))
-            else:
-                assert inpt in self.givens, "Missing argument " + inpt_name
-        assert len(kwargs) == 0, "Function got extra arguments " + str(list(kwargs.keys()))
-        # Update feed dict with givens.
-        for inpt in self.givens:
-            feed_dict[inpt] = feed_dict.get(inpt, self.givens[inpt])
+        for inpt_name, value in kwargs.items():
+            self._feed_input(feed_dict, self.input_names[inpt_name], value)
        results = get_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
-        if self.check_nan:
-            if any(np.isnan(r).any() for r in results):
-                raise RuntimeError("Nan detected")
        return results

-def mem_friendly_function(nondata_inputs, data_inputs, outputs, batch_size):
-    if isinstance(outputs, list):
-        return _MemFriendlyFunction(nondata_inputs, data_inputs, outputs, batch_size)
-    else:
-        f = _MemFriendlyFunction(nondata_inputs, data_inputs, [outputs], batch_size)
-        return lambda *inputs: f(*inputs)[0]
-
-class _MemFriendlyFunction(object):
-    def __init__(self, nondata_inputs, data_inputs, outputs, batch_size):
-        self.nondata_inputs = nondata_inputs
-        self.data_inputs = data_inputs
-        self.outputs = list(outputs)
-        self.batch_size = batch_size
-
-    def __call__(self, *inputvals):
-        assert len(inputvals) == len(self.nondata_inputs) + len(self.data_inputs)
-        nondata_vals = inputvals[0:len(self.nondata_inputs)]
-        data_vals = inputvals[len(self.nondata_inputs):]
-        feed_dict = dict(zip(self.nondata_inputs, nondata_vals))
-        n = data_vals[0].shape[0]
-        for v in data_vals[1:]:
-            assert v.shape[0] == n
-        for i_start in range(0, n, self.batch_size):
-            slice_vals = [v[i_start:builtins.min(i_start + self.batch_size, n)] for v in data_vals]
-            for (var, val) in zip(self.data_inputs, slice_vals):
-                feed_dict[var] = val
-            results = tf.get_default_session().run(self.outputs, feed_dict=feed_dict)
-            if i_start == 0:
-                sum_results = results
-            else:
-                for i in range(len(results)):
-                    sum_results[i] = sum_results[i] + results[i]
-        for i in range(len(results)):
-            sum_results[i] = sum_results[i] / n
-        return sum_results
-
-# ================================================================
-# Modules
-# ================================================================
-
-class Module(object):
-    def __init__(self, name):
-        self.name = name
-        self.first_time = True
-        self.scope = None
-        self.cache = {}
-
-    def __call__(self, *args):
-        if args in self.cache:
-            print("(%s) retrieving value from cache" % (self.name,))
-            return self.cache[args]
-        with tf.variable_scope(self.name, reuse=not self.first_time):
-            scope = tf.get_variable_scope().name
-            if self.first_time:
-                self.scope = scope
-                print("(%s) running function for the first time" % (self.name,))
-            else:
-                assert self.scope == scope, "Tried calling function with a different scope"
-                print("(%s) running function on new inputs" % (self.name,))
-            self.first_time = False
-            out = self._call(*args)
-        self.cache[args] = out
-        return out
-
-    def _call(self, *args):
-        raise NotImplementedError
-
-    @property
-    def trainable_variables(self):
-        assert self.scope is not None, "need to call module once before getting variables"
-        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)
-
-    @property
-    def variables(self):
-        assert self.scope is not None, "need to call module once before getting variables"
-        return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, self.scope)
-
-def module(name):
-    @functools.wraps
-    def wrapper(f):
-        class WrapperModule(Module):
-            def _call(self, *args):
-                return f(*args)
-        return WrapperModule(name)
-    return wrapper
-
-# ================================================================
-# Graph traversal
-# ================================================================
-
-VARIABLES = {}
-
-def get_parents(node):
-    return node.op.inputs
-
-def topsorted(outputs):
-    """
-    Topological sort via non-recursive depth-first search
-    """
-    assert isinstance(outputs, (list, tuple))
-    marks = {}
-    out = []
-    stack = []  # pylint: disable=W0621
-    # i: node
-    # jidx = number of children visited so far from that node
-    # marks: state of each node, which is one of
-    #   0: haven't visited
-    #   1: have visited, but not done visiting children
-    #   2: done visiting children
-    for x in outputs:
-        stack.append((x, 0))
-        while stack:
-            (i, jidx) = stack.pop()
-            if jidx == 0:
-                m = marks.get(i, 0)
-                if m == 0:
-                    marks[i] = 1
-                elif m == 1:
-                    raise ValueError("not a dag")
-                else:
-                    continue
-            ps = get_parents(i)
-            if jidx == len(ps):
-                marks[i] = 2
-                out.append(i)
-            else:
-                stack.append((i, jidx + 1))
-                j = ps[jidx]
-                stack.append((j, 0))
-    return out
-
 # ================================================================
 # Flat vectors
 # ================================================================
@@ -577,110 +253,185 @@ class SetFromFlat(object):
        self.op = tf.group(*assigns)

    def __call__(self, theta):
-        get_session().run(self.op, feed_dict={self.theta: theta})
+        tf.get_default_session().run(self.op, feed_dict={self.theta: theta})

 class GetFlat(object):
    def __init__(self, var_list):
        self.op = tf.concat(axis=0, values=[tf.reshape(v, [numel(v)]) for v in var_list])

    def __call__(self):
-        return get_session().run(self.op)
+        return tf.get_default_session().run(self.op)

-# ================================================================
-# Misc
-# ================================================================
+def flattenallbut0(x):
+    return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])

-def fancy_slice_2d(X, inds0, inds1):
-    """
-    like numpy X[inds0, inds1]
-    XXX this implementation is bad
-    """
-    inds0 = tf.cast(inds0, tf.int64)
-    inds1 = tf.cast(inds1, tf.int64)
-    shape = tf.cast(tf.shape(X), tf.int64)
-    ncols = shape[1]
-    Xflat = tf.reshape(X, [-1])
-    return tf.gather(Xflat, inds0 * ncols + inds1)
-
-# ================================================================
-# Scopes
-# ================================================================
-
-def scope_vars(scope, trainable_only=False):
-    """
-    Get variables inside a scope
-    The scope can be specified as a string
-
-    Parameters
-    ----------
-    scope: str or VariableScope
-        scope in which the variables reside.
-    trainable_only: bool
-        whether or not to return only the variables that were marked as trainable.
-
-    Returns
-    -------
-    vars: [tf.Variable]
-        list of variables in `scope`.
-    """
-    return tf.get_collection(
-        tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.GLOBAL_VARIABLES,
-        scope=scope if isinstance(scope, str) else scope.name
-    )
-
-def scope_name():
-    """Returns the name of current scope as a string, e.g. deepq/q_func"""
-    return tf.get_variable_scope().name
-
-def absolute_scope_name(relative_scope_name):
-    """Appends parent scope name to `relative_scope_name`"""
-    return scope_name() + "/" + relative_scope_name
-
-def lengths_to_mask(lengths_b, max_length):
-    """
-    Turns a vector of lengths into a boolean mask
-
-    Args:
-        lengths_b: an integer vector of lengths
-        max_length: maximum length to fill the mask
-
-    Returns:
-        a boolean array of shape (batch_size, max_length)
-        row[i] consists of True repeated lengths_b[i] times, followed by False
-    """
-    lengths_b = tf.convert_to_tensor(lengths_b)
-    assert lengths_b.get_shape().ndims == 1
-    mask_bt = tf.expand_dims(tf.range(max_length), 0) < tf.expand_dims(lengths_b, 1)
-    return mask_bt
-
-def in_session(f):
-    @functools.wraps(f)
-    def newfunc(*args, **kwargs):
-        with tf.Session():
-            f(*args, **kwargs)
-    return newfunc
+# =============================================================
+# TF placeholders management
+# ============================================================

 _PLACEHOLDER_CACHE = {}  # name -> (placeholder, dtype, shape)

 def get_placeholder(name, dtype, shape):
    if name in _PLACEHOLDER_CACHE:
        out, dtype1, shape1 = _PLACEHOLDER_CACHE[name]
-        assert dtype1 == dtype and shape1 == shape
-        return out
-    else:
-        out = tf.placeholder(dtype=dtype, shape=shape, name=name)
-        _PLACEHOLDER_CACHE[name] = (out, dtype, shape)
-        return out
+        if out.graph == tf.get_default_graph():
+            assert dtype1 == dtype and shape1 == shape, \
+                'Placeholder with name {} has already been registered and has shape {}, different from requested {}'.format(name, shape1, shape)
+            return out
+
+    out = tf.placeholder(dtype=dtype, shape=shape, name=name)
+    _PLACEHOLDER_CACHE[name] = (out, dtype, shape)
+    return out

 def get_placeholder_cached(name):
    return _PLACEHOLDER_CACHE[name][0]

-def flattenallbut0(x):
-    return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])

-def reset():
-    global _PLACEHOLDER_CACHE
-    global VARIABLES
-    _PLACEHOLDER_CACHE = {}
-    VARIABLES = {}
-    tf.reset_default_graph()
+
+# ================================================================
+# Diagnostics
+# ================================================================
+
+def display_var_info(vars):
+    from baselines import logger
+    count_params = 0
+    for v in vars:
+        name = v.name
+        if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
+        v_params = np.prod(v.shape.as_list())
+        count_params += v_params
+        if "/b:" in name or "/bias" in name: continue    # Wx+b, bias is not interesting to look at => count params, but not print
+        logger.info("   %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
+
+    logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
+
+
+def get_available_gpus():
+    # recipe from here:
+    # https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
+
+    from tensorflow.python.client import device_lib
+    local_device_protos = device_lib.list_local_devices()
+    return [x.name for x in local_device_protos if x.device_type == 'GPU']
+
+# ================================================================
+# Saving variables
+# ================================================================
+
+def load_state(fname, sess=None):
+    from baselines import logger
+    logger.warn('load_state method is deprecated, please use load_variables instead')
+    sess = sess or get_session()
+    saver = tf.train.Saver()
+    saver.restore(tf.get_default_session(), fname)
+
+def save_state(fname, sess=None):
+    from baselines import logger
+    logger.warn('save_state method is deprecated, please use save_variables instead')
+    sess = sess or get_session()
+    dirname = os.path.dirname(fname)
+    if any(dirname):
+        os.makedirs(dirname, exist_ok=True)
+    saver = tf.train.Saver()
+    saver.save(tf.get_default_session(), fname)
+
+# The methods above and below are clearly doing the same thing, and in a rather similar way
+# TODO: ensure there is no subtle differences and remove one
+
+def save_variables(save_path, variables=None, sess=None):
+    sess = sess or get_session()
+    variables = variables or tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
+
+    ps = sess.run(variables)
+    save_dict = {v.name: value for v, value in zip(variables, ps)}
+    dirname = os.path.dirname(save_path)
+    if any(dirname):
+        os.makedirs(dirname, exist_ok=True)
+    joblib.dump(save_dict, save_path)
+
+def load_variables(load_path, variables=None, sess=None):
+    sess = sess or get_session()
+    variables = variables or tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
+
+    loaded_params = joblib.load(os.path.expanduser(load_path))
+    restores = []
+    if isinstance(loaded_params, list):
+        assert len(loaded_params) == len(variables), 'number of variables loaded mismatches len(variables)'
+        for d, v in zip(loaded_params, variables):
+            restores.append(v.assign(d))
+    else:
+        for v in variables:
+            restores.append(v.assign(loaded_params[v.name]))
+
+    sess.run(restores)
+
+# ================================================================
+# Shape adjustment for feeding into tf placeholders
+# ================================================================
+def adjust_shape(placeholder, data):
+    '''
+    adjust shape of the data to the shape of the placeholder if possible.
+    If shape is incompatible, AssertionError is thrown
+
+    Parameters:
+        placeholder     tensorflow input placeholder
+
+        data            input data to be (potentially) reshaped to be fed into placeholder
+
+    Returns:
+        reshaped data
+    '''
+
+    if not isinstance(data, np.ndarray) and not isinstance(data, list):
+        return data
+    if isinstance(data, list):
+        data = np.array(data)
+
+    placeholder_shape = [x or -1 for x in placeholder.shape.as_list()]
+
+    assert _check_shape(placeholder_shape, data.shape), \
+        'Shape of data {} is not compatible with shape of the placeholder {}'.format(data.shape, placeholder_shape)
+
+    return np.reshape(data, placeholder_shape)
+
+
+def _check_shape(placeholder_shape, data_shape):
+    ''' check if two shapes are compatible (i.e. differ only by dimensions of size 1, or by the batch dimension)'''
+
+    return True
+    squeezed_placeholder_shape = _squeeze_shape(placeholder_shape)
+    squeezed_data_shape = _squeeze_shape(data_shape)
+
+    for i, s_data in enumerate(squeezed_data_shape):
+        s_placeholder = squeezed_placeholder_shape[i]
+        if s_placeholder != -1 and s_data != s_placeholder:
+            return False
+
+    return True
+
+
+def _squeeze_shape(shape):
+    return [x for x in shape if x != 1]
+
+# ================================================================
+# Tensorboard interfacing
+# ================================================================
+
+def launch_tensorboard_in_background(log_dir):
+    '''
+    To log the Tensorflow graph when using rl-algs
+    algorithms, you can run the following code
+    in your main script:
+        import threading, time
+        def start_tensorboard(session):
+            time.sleep(10) # Wait until graph is setup
+            tb_path = osp.join(logger.get_dir(), 'tb')
+            summary_writer = tf.summary.FileWriter(tb_path, graph=session.graph)
+            summary_op = tf.summary.merge_all()
+            launch_tensorboard_in_background(tb_path)
+        session = tf.get_default_session()
+        t = threading.Thread(target=start_tensorboard, args=([session]))
+        t.start()
+    '''
+    import subprocess
+    subprocess.Popen(['tensorboard', '--logdir', log_dir])
--- a/baselines/common/tile_images.py
+++ b/baselines/common/tile_images.py
@@ -0,0 +1,23 @@
+import numpy as np
+
+def tile_images(img_nhwc):
+    """
+    Tile N images into one big PxQ image
+    (P,Q) are chosen to be as close as possible, and if N
+    is square, then P=Q.
+
+    input: img_nhwc, list or array of images, ndim=4 once turned into array
+        n = batch index, h = height, w = width, c = channel
+    returns:
+        bigim_HWc, ndarray with ndim=3
+    """
+    img_nhwc = np.asarray(img_nhwc)
+    N, h, w, c = img_nhwc.shape
+    H = int(np.ceil(np.sqrt(N)))
+    W = int(np.ceil(float(N)/H))
+    img_nhwc = np.array(list(img_nhwc) + [img_nhwc[0]*0 for _ in range(N, H*W)])
+    img_HWhwc = img_nhwc.reshape(H, W, h, w, c)
+    img_HhWwc = img_HWhwc.transpose(0, 2, 1, 3, 4)
+    img_Hh_Ww_c = img_HhWwc.reshape(H*h, W*w, c)
+    return img_Hh_Ww_c
+
--- a/baselines/common/vec_env/init.py
+++ b/baselines/common/vec_env/init.py
@@ -1,19 +1,185 @@
-class VecEnv(object):
-    """
-    Vectorized environment base class
-    """
-    def step(self, vac):
-        """
-        Apply sequence of actions to sequence of environments
-        actions -> (observations, rewards, news)
+from abc import ABC, abstractmethod
+from baselines.common.tile_images import tile_images

-        where 'news' is a boolean vector indicating whether each element is new.
-        """
-        raise NotImplementedError
+class AlreadySteppingError(Exception):
+    """
+    Raised when an asynchronous step is running while
+    step_async() is called again.
+    """
+
+    def __init__(self):
+        msg = 'already running an async step'
+        Exception.__init__(self, msg)
+
+
+class NotSteppingError(Exception):
+    """
+    Raised when an asynchronous step is not running but
+    step_wait() is called.
+    """
+
+    def __init__(self):
+        msg = 'not running an async step'
+        Exception.__init__(self, msg)
+
+
+class VecEnv(ABC):
+    """
+    An abstract asynchronous, vectorized environment.
+    Used to batch data from multiple copies of an environment, so that
+    each observation becomes an batch of observations, and expected action is a batch of actions to
+    be applied per-environment.
+    """
+    closed = False
+    viewer = None
+
+    metadata = {
+        'render.modes': ['human', 'rgb_array']
+    }
+
+    def __init__(self, num_envs, observation_space, action_space):
+        self.num_envs = num_envs
+        self.observation_space = observation_space
+        self.action_space = action_space
+
+    @abstractmethod
    def reset(self):
        """
-        Reset all environments
+        Reset all the environments and return an array of
+        observations, or a dict of observation arrays.
+
+        If step_async is still doing work, that work will
+        be cancelled and step_wait() should not be called
+        until step_async() is invoked again.
+        """
+        pass
+
+    @abstractmethod
+    def step_async(self, actions):
+        """
+        Tell all the environments to start taking a step
+        with the given actions.
+        Call step_wait() to get the results of the step.
+
+        You should not call this if a step_async run is
+        already pending.
+        """
+        pass
+
+    @abstractmethod
+    def step_wait(self):
+        """
+        Wait for the step taken with step_async().
+
+        Returns (obs, rews, dones, infos):
+         - obs: an array of observations, or a dict of
+                arrays of observations.
+         - rews: an array of rewards
+         - dones: an array of "episode done" booleans
+         - infos: a sequence of info objects
+        """
+        pass
+
+    def close_extras(self):
+        """
+        Clean up the  extra resources, beyond what's in this base class.
+        Only runs when not self.closed.
+        """
+        pass
+
+    def close(self):
+        if self.closed:
+            return
+        if self.viewer is not None:
+            self.viewer.close()
+        self.close_extras()
+        self.closed = True
+
+    def step(self, actions):
+        """
+        Step the environments synchronously.
+
+        This is available for backwards compatibility.
+        """
+        self.step_async(actions)
+        return self.step_wait()
+
+    def render(self, mode='human'):
+        imgs = self.get_images()
+        bigimg = tile_images(imgs)
+        if mode == 'human':
+            self.get_viewer().imshow(bigimg)
+            return self.get_viewer().isopen
+        elif mode == 'rgb_array':
+            return bigimg
+        else:
+            raise NotImplementedError
+
+    def get_images(self):
+        """
+        Return RGB images from each environment
        """
        raise NotImplementedError
+
+    @property
+    def unwrapped(self):
+        if isinstance(self, VecEnvWrapper):
+            return self.venv.unwrapped
+        else:
+            return self
+
+    def get_viewer(self):
+        if self.viewer is None:
+            from gym.envs.classic_control import rendering
+            self.viewer = rendering.SimpleImageViewer()
+        return self.viewer
+
+
+class VecEnvWrapper(VecEnv):
+    """
+    An environment wrapper that applies to an entire batch
+    of environments at once.
+    """
+
+    def __init__(self, venv, observation_space=None, action_space=None):
+        self.venv = venv
+        VecEnv.__init__(self,
+                        num_envs=venv.num_envs,
+                        observation_space=observation_space or venv.observation_space,
+                        action_space=action_space or venv.action_space)
+
+    def step_async(self, actions):
+        self.venv.step_async(actions)
+
+    @abstractmethod
+    def reset(self):
+        pass
+
+    @abstractmethod
+    def step_wait(self):
+        pass
+
    def close(self):
-        pass
+        return self.venv.close()
+
+    def render(self, mode='human'):
+        return self.venv.render(mode=mode)
+
+    def get_images(self):
+        return self.venv.get_images()
+
+class CloudpickleWrapper(object):
+    """
+    Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
+    """
+
+    def __init__(self, x):
+        self.x = x
+
+    def __getstate__(self):
+        import cloudpickle
+        return cloudpickle.dumps(self.x)
+
+    def __setstate__(self, ob):
+        import pickle
+        self.x = pickle.loads(ob)
--- a/baselines/common/vec_env/dummy_vec_env.py
+++ b/baselines/common/vec_env/dummy_vec_env.py
@@ -0,0 +1,82 @@
+import numpy as np
+from gym import spaces
+from . import VecEnv
+from .util import copy_obs_dict, dict_to_obs, obs_space_info
+
+class DummyVecEnv(VecEnv):
+    """
+    VecEnv that does runs multiple environments sequentially, that is,
+    the step and reset commands are send to one environment at a time.
+    Useful when debugging and when num_env == 1 (in the latter case,
+    avoids communication overhead)
+    """
+    def __init__(self, env_fns):
+        """
+        Arguments:
+
+        env_fns: iterable of callables      functions that build environments
+        """
+        self.envs = [fn() for fn in env_fns]
+        env = self.envs[0]
+        VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
+        obs_space = env.observation_space
+        self.keys, shapes, dtypes = obs_space_info(obs_space)
+
+        self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
+        self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
+        self.buf_rews  = np.zeros((self.num_envs,), dtype=np.float32)
+        self.buf_infos = [{} for _ in range(self.num_envs)]
+        self.actions = None
+        self.specs = [e.spec for e in self.envs]
+
+    def step_async(self, actions):
+        listify = True
+        try:
+            if len(actions) == self.num_envs:
+                listify = False
+        except TypeError:
+            pass
+
+        if not listify:
+            self.actions = actions
+        else:
+            assert self.num_envs == 1, "actions {} is either not a list or has a wrong size - cannot match to {} environments".format(actions, self.num_envs)
+            self.actions = [actions]
+
+    def step_wait(self):
+        for e in range(self.num_envs):
+            action = self.actions[e]
+            if isinstance(self.envs[e].action_space, spaces.Discrete):
+                action = int(action)
+
+            obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
+            if self.buf_dones[e]:
+                obs = self.envs[e].reset()
+            self._save_obs(e, obs)
+        return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones),
+                self.buf_infos.copy())
+
+    def reset(self):
+        for e in range(self.num_envs):
+            obs = self.envs[e].reset()
+            self._save_obs(e, obs)
+        return self._obs_from_buf()
+
+    def _save_obs(self, e, obs):
+        for k in self.keys:
+            if k is None:
+                self.buf_obs[k][e] = obs
+            else:
+                self.buf_obs[k][e] = obs[k]
+
+    def _obs_from_buf(self):
+        return dict_to_obs(copy_obs_dict(self.buf_obs))
+
+    def get_images(self):
+        return [env.render(mode='rgb_array') for env in self.envs]
+
+    def render(self, mode='human'):
+        if self.num_envs == 1:
+            return self.envs[0].render(mode=mode)
+        else:
+            return super().render(mode=mode)
--- a/baselines/common/vec_env/shmem_vec_env.py
+++ b/baselines/common/vec_env/shmem_vec_env.py
@@ -0,0 +1,138 @@
+"""
+An interface for asynchronous vectorized environments.
+"""
+
+from multiprocessing import Pipe, Array, Process
+import numpy as np
+from . import VecEnv, CloudpickleWrapper
+import ctypes
+from baselines import logger
+
+from .util import dict_to_obs, obs_space_info, obs_to_dict
+
+_NP_TO_CT = {np.float32: ctypes.c_float,
+             np.int32: ctypes.c_int32,
+             np.int8: ctypes.c_int8,
+             np.uint8: ctypes.c_char,
+             np.bool: ctypes.c_bool}
+
+
+class ShmemVecEnv(VecEnv):
+    """
+    Optimized version of SubprocVecEnv that uses shared variables to communicate observations.
+    """
+
+    def __init__(self, env_fns, spaces=None):
+        """
+        If you don't specify observation_space, we'll have to create a dummy
+        environment to get it.
+        """
+        if spaces:
+            observation_space, action_space = spaces
+        else:
+            logger.log('Creating dummy env object to get spaces')
+            with logger.scoped_configure(format_strs=[]):
+                dummy = env_fns[0]()
+                observation_space, action_space = dummy.observation_space, dummy.action_space
+                dummy.close()
+                del dummy
+        VecEnv.__init__(self, len(env_fns), observation_space, action_space)
+        self.obs_keys, self.obs_shapes, self.obs_dtypes = obs_space_info(observation_space)
+        self.obs_bufs = [
+            {k: Array(_NP_TO_CT[self.obs_dtypes[k].type], int(np.prod(self.obs_shapes[k]))) for k in self.obs_keys}
+            for _ in env_fns]
+        self.parent_pipes = []
+        self.procs = []
+        for env_fn, obs_buf in zip(env_fns, self.obs_bufs):
+            wrapped_fn = CloudpickleWrapper(env_fn)
+            parent_pipe, child_pipe = Pipe()
+            proc = Process(target=_subproc_worker,
+                           args=(child_pipe, parent_pipe, wrapped_fn, obs_buf, self.obs_shapes, self.obs_dtypes, self.obs_keys))
+            proc.daemon = True
+            self.procs.append(proc)
+            self.parent_pipes.append(parent_pipe)
+            proc.start()
+            child_pipe.close()
+        self.waiting_step = False
+        self.specs = [f().spec for f in env_fns]
+        self.viewer = None
+
+    def reset(self):
+        if self.waiting_step:
+            logger.warn('Called reset() while waiting for the step to complete')
+            self.step_wait()
+        for pipe in self.parent_pipes:
+            pipe.send(('reset', None))
+        return self._decode_obses([pipe.recv() for pipe in self.parent_pipes])
+
+    def step_async(self, actions):
+        assert len(actions) == len(self.parent_pipes)
+        for pipe, act in zip(self.parent_pipes, actions):
+            pipe.send(('step', act))
+
+    def step_wait(self):
+        outs = [pipe.recv() for pipe in self.parent_pipes]
+        obs, rews, dones, infos = zip(*outs)
+        return self._decode_obses(obs), np.array(rews), np.array(dones), infos
+
+    def close_extras(self):
+        if self.waiting_step:
+            self.step_wait()
+        for pipe in self.parent_pipes:
+            pipe.send(('close', None))
+        for pipe in self.parent_pipes:
+            pipe.recv()
+            pipe.close()
+        for proc in self.procs:
+            proc.join()
+
+    def get_images(self, mode='human'):
+        for pipe in self.parent_pipes:
+            pipe.send(('render', None))
+        return [pipe.recv() for pipe in self.parent_pipes]
+
+    def _decode_obses(self, obs):
+        result = {}
+        for k in self.obs_keys:
+
+            bufs = [b[k] for b in self.obs_bufs]
+            o = [np.frombuffer(b.get_obj(), dtype=self.obs_dtypes[k]).reshape(self.obs_shapes[k]) for b in bufs]
+            result[k] = np.array(o)
+        return dict_to_obs(result)
+
+
+def _subproc_worker(pipe, parent_pipe, env_fn_wrapper, obs_bufs, obs_shapes, obs_dtypes, keys):
+    """
+    Control a single environment instance using IPC and
+    shared memory.
+    """
+    def _write_obs(maybe_dict_obs):
+        flatdict = obs_to_dict(maybe_dict_obs)
+        for k in keys:
+            dst = obs_bufs[k].get_obj()
+            dst_np = np.frombuffer(dst, dtype=obs_dtypes[k]).reshape(obs_shapes[k])  # pylint: disable=W0212
+            np.copyto(dst_np, flatdict[k])
+
+    env = env_fn_wrapper.x()
+    parent_pipe.close()
+    try:
+        while True:
+            cmd, data = pipe.recv()
+            if cmd == 'reset':
+                pipe.send(_write_obs(env.reset()))
+            elif cmd == 'step':
+                obs, reward, done, info = env.step(data)
+                if done:
+                    obs = env.reset()
+                pipe.send((_write_obs(obs), reward, done, info))
+            elif cmd == 'render':
+                pipe.send(env.render(mode='rgb_array'))
+            elif cmd == 'close':
+                pipe.send(None)
+                break
+            else:
+                raise RuntimeError('Got unrecognized cmd %s' % cmd)
+    except KeyboardInterrupt:
+        print('ShmemVecEnv worker: got KeyboardInterrupt')
+    finally:
+        env.close()
--- a/baselines/common/vec_env/subproc_vec_env.py
+++ b/baselines/common/vec_env/subproc_vec_env.py
@@ -1,94 +1,114 @@
 import numpy as np
 from multiprocessing import Process, Pipe
-from baselines.common.vec_env import VecEnv
-
+from . import VecEnv, CloudpickleWrapper

 def worker(remote, parent_remote, env_fn_wrapper):
    parent_remote.close()
    env = env_fn_wrapper.x()
-    while True:
-        cmd, data = remote.recv()
-        if cmd == 'step':
-            ob, reward, done, info = env.step(data)
-            if done:
+    try:
+        while True:
+            cmd, data = remote.recv()
+            if cmd == 'step':
+                ob, reward, done, info = env.step(data)
+                if done:
+                    ob = env.reset()
+                remote.send((ob, reward, done, info))
+            elif cmd == 'reset':
                ob = env.reset()
-            remote.send((ob, reward, done, info))
-        elif cmd == 'reset':
-            ob = env.reset()
-            remote.send(ob)
-        elif cmd == 'reset_task':
-            ob = env.reset_task()
-            remote.send(ob)
-        elif cmd == 'close':
-            remote.close()
-            break
-        elif cmd == 'get_spaces':
-            remote.send((env.action_space, env.observation_space))
-        else:
-            raise NotImplementedError
-
-
-class CloudpickleWrapper(object):
-    """
-    Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
-    """
-    def __init__(self, x):
-        self.x = x
-    def __getstate__(self):
-        import cloudpickle
-        return cloudpickle.dumps(self.x)
-    def __setstate__(self, ob):
-        import pickle
-        self.x = pickle.loads(ob)
+                remote.send(ob)
+            elif cmd == 'render':
+                remote.send(env.render(mode='rgb_array'))
+            elif cmd == 'close':
+                remote.close()
+                break
+            elif cmd == 'get_spaces':
+                remote.send((env.observation_space, env.action_space))
+            else:
+                raise NotImplementedError
+    except KeyboardInterrupt:
+        print('SubprocVecEnv worker: got KeyboardInterrupt')
+    finally:
+        env.close()


 class SubprocVecEnv(VecEnv):
-    def __init__(self, env_fns):
+    """
+    VecEnv that runs multiple environments in parallel in subproceses and communicates with them via pipes.
+    Recommended to use when num_envs > 1 and step() can be a bottleneck.
+    """
+    def __init__(self, env_fns, spaces=None):
        """
-        envs: list of gym environments to run in subprocesses
+        Arguments:
+
+        env_fns: iterable of callables -  functions that create environments to run in subprocesses. Need to be cloud-pickleable
        """
+        self.waiting = False
        self.closed = False
        nenvs = len(env_fns)
        self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
        self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
-            for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
+                   for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
        for p in self.ps:
-            p.daemon = True # if the main process crashes, we should not cause things to hang
+            p.daemon = True  # if the main process crashes, we should not cause things to hang
            p.start()
        for remote in self.work_remotes:
            remote.close()

        self.remotes[0].send(('get_spaces', None))
-        self.action_space, self.observation_space = self.remotes[0].recv()
+        observation_space, action_space = self.remotes[0].recv()
+        self.viewer = None
+        self.specs = [f().spec for f in env_fns]
+        VecEnv.__init__(self, len(env_fns), observation_space, action_space)

-
-    def step(self, actions):
+    def step_async(self, actions):
+        self._assert_not_closed()
        for remote, action in zip(self.remotes, actions):
            remote.send(('step', action))
+        self.waiting = True
+
+    def step_wait(self):
+        self._assert_not_closed()
        results = [remote.recv() for remote in self.remotes]
+        self.waiting = False
        obs, rews, dones, infos = zip(*results)
-        return np.stack(obs), np.stack(rews), np.stack(dones), infos
+        return _flatten_obs(obs), np.stack(rews), np.stack(dones), infos

    def reset(self):
+        self._assert_not_closed()
        for remote in self.remotes:
            remote.send(('reset', None))
-        return np.stack([remote.recv() for remote in self.remotes])
-
-    def reset_task(self):
-        for remote in self.remotes:
-            remote.send(('reset_task', None))
-        return np.stack([remote.recv() for remote in self.remotes])
-
-    def close(self):
-        if self.closed:
-            return
+        return _flatten_obs([remote.recv() for remote in self.remotes])

+    def close_extras(self):
+        self.closed = True
+        if self.waiting:
+            for remote in self.remotes:
+                remote.recv()
        for remote in self.remotes:
            remote.send(('close', None))
        for p in self.ps:
            p.join()
-        self.closed = True

-    @property
-    def num_envs(self):
-        return len(self.remotes)
+    def get_images(self):
+        self._assert_not_closed()
+        for pipe in self.remotes:
+            pipe.send(('render', None))
+        imgs = [pipe.recv() for pipe in self.remotes]
+        return imgs
+
+    def _assert_not_closed(self):
+        assert not self.closed, "Trying to operate on a SubprocVecEnv after calling close()"
+
+
+def _flatten_obs(obs):
+    assert isinstance(obs, list) or isinstance(obs, tuple)
+    assert len(obs) > 0
+
+    if isinstance(obs[0], dict):
+        import collections
+        assert isinstance(obs, collections.OrderedDict)
+        keys = obs[0].keys()
+        return {k: np.stack([o[k] for o in obs]) for k in keys}
+    else:
+        return np.stack(obs)
+
--- a/baselines/common/vec_env/test_vec_env.py
+++ b/baselines/common/vec_env/test_vec_env.py
@@ -0,0 +1,101 @@
+"""
+Tests for asynchronous vectorized environments.
+"""
+
+import gym
+import numpy as np
+import pytest
+from .dummy_vec_env import DummyVecEnv
+from .shmem_vec_env import ShmemVecEnv
+from .subproc_vec_env import SubprocVecEnv
+
+
+def assert_envs_equal(env1, env2, num_steps):
+    """
+    Compare two environments over num_steps steps and make sure
+    that the observations produced by each are the same when given
+    the same actions.
+    """
+    assert env1.num_envs == env2.num_envs
+    assert env1.action_space.shape == env2.action_space.shape
+    assert env1.action_space.dtype == env2.action_space.dtype
+    joint_shape = (env1.num_envs,) + env1.action_space.shape
+
+    try:
+        obs1, obs2 = env1.reset(), env2.reset()
+        assert np.array(obs1).shape == np.array(obs2).shape
+        assert np.array(obs1).shape == joint_shape
+        assert np.allclose(obs1, obs2)
+        np.random.seed(1337)
+        for _ in range(num_steps):
+            actions = np.array(np.random.randint(0, 0x100, size=joint_shape),
+                               dtype=env1.action_space.dtype)
+            for env in [env1, env2]:
+                env.step_async(actions)
+            outs1 = env1.step_wait()
+            outs2 = env2.step_wait()
+            for out1, out2 in zip(outs1[:3], outs2[:3]):
+                assert np.array(out1).shape == np.array(out2).shape
+                assert np.allclose(out1, out2)
+            assert list(outs1[3]) == list(outs2[3])
+    finally:
+        env1.close()
+        env2.close()
+
+
+@pytest.mark.parametrize('klass', (ShmemVecEnv, SubprocVecEnv))
+@pytest.mark.parametrize('dtype', ('uint8', 'float32'))
+def test_vec_env(klass, dtype):  # pylint: disable=R0914
+    """
+    Test that a vectorized environment is equivalent to
+    DummyVecEnv, since DummyVecEnv is less likely to be
+    error prone.
+    """
+    num_envs = 3
+    num_steps = 100
+    shape = (3, 8)
+
+    def make_fn(seed):
+        """
+        Get an environment constructor with a seed.
+        """
+        return lambda: SimpleEnv(seed, shape, dtype)
+    fns = [make_fn(i) for i in range(num_envs)]
+    env1 = DummyVecEnv(fns)
+    env2 = klass(fns)
+    assert_envs_equal(env1, env2, num_steps=num_steps)
+
+
+class SimpleEnv(gym.Env):
+    """
+    An environment with a pre-determined observation space
+    and RNG seed.
+    """
+
+    def __init__(self, seed, shape, dtype):
+        np.random.seed(seed)
+        self._dtype = dtype
+        self._start_obs = np.array(np.random.randint(0, 0x100, size=shape),
+                                   dtype=dtype)
+        self._max_steps = seed + 1
+        self._cur_obs = None
+        self._cur_step = 0
+        # this is 0xFF instead of 0x100 because the Box space includes
+        # the high end, while randint does not
+        self.action_space = gym.spaces.Box(low=0, high=0xFF, shape=shape, dtype=dtype)
+        self.observation_space = self.action_space
+
+    def step(self, action):
+        self._cur_obs += np.array(action, dtype=self._dtype)
+        self._cur_step += 1
+        done = self._cur_step >= self._max_steps
+        reward = self._cur_step / self._max_steps
+        return self._cur_obs, reward, done, {'foo': 'bar' + str(reward)}
+
+    def reset(self):
+        self._cur_obs = self._start_obs
+        self._cur_step = 0
+        return self._cur_obs
+
+    def render(self, mode=None):
+        raise NotImplementedError
--- a/baselines/common/vec_env/test_video_recorder.py
+++ b/baselines/common/vec_env/test_video_recorder.py
@@ -0,0 +1,49 @@
+"""
+Tests for asynchronous vectorized environments.
+"""
+
+import gym
+import pytest
+import os
+import glob
+import tempfile
+
+from .dummy_vec_env import DummyVecEnv
+from .shmem_vec_env import ShmemVecEnv
+from .subproc_vec_env import SubprocVecEnv
+from .vec_video_recorder import VecVideoRecorder
+
+@pytest.mark.parametrize('klass', (DummyVecEnv, ShmemVecEnv, SubprocVecEnv))
+@pytest.mark.parametrize('num_envs', (1, 4))
+@pytest.mark.parametrize('video_length', (10, 100))
+@pytest.mark.parametrize('video_interval', (1, 50))
+def test_video_recorder(klass, num_envs, video_length, video_interval):
+    """
+    Wrap an existing VecEnv with VevVideoRecorder,
+    Make (video_interval + video_length + 1) steps,
+    then check that the file is present
+    """
+
+    def make_fn():
+        env = gym.make('PongNoFrameskip-v4')
+        return env
+    fns = [make_fn for _ in range(num_envs)]
+    env = klass(fns)
+
+    with tempfile.TemporaryDirectory() as video_path:
+        env = VecVideoRecorder(env, video_path, record_video_trigger=lambda x: x % video_interval == 0, video_length=video_length)
+
+        env.reset()
+        for _ in range(video_interval + video_length + 1):
+            env.step([0] * num_envs)
+        env.close()
+
+
+        recorded_video = glob.glob(os.path.join(video_path, "*.mp4"))
+
+        # first and second step
+        assert len(recorded_video) == 2
+        # Files are not empty
+        assert all(os.stat(p).st_size != 0 for p in recorded_video)
+
+
--- a/baselines/common/vec_env/util.py
+++ b/baselines/common/vec_env/util.py
@@ -0,0 +1,59 @@
+"""
+Helpers for dealing with vectorized environments.
+"""
+
+from collections import OrderedDict
+
+import gym
+import numpy as np
+
+
+def copy_obs_dict(obs):
+    """
+    Deep-copy an observation dict.
+    """
+    return {k: np.copy(v) for k, v in obs.items()}
+
+
+def dict_to_obs(obs_dict):
+    """
+    Convert an observation dict into a raw array if the
+    original observation space was not a Dict space.
+    """
+    if set(obs_dict.keys()) == {None}:
+        return obs_dict[None]
+    return obs_dict
+
+
+def obs_space_info(obs_space):
+    """
+    Get dict-structured information about a gym.Space.
+
+    Returns:
+      A tuple (keys, shapes, dtypes):
+        keys: a list of dict keys.
+        shapes: a dict mapping keys to shapes.
+        dtypes: a dict mapping keys to dtypes.
+    """
+    if isinstance(obs_space, gym.spaces.Dict):
+        assert isinstance(obs_space.spaces, OrderedDict)
+        subspaces = obs_space.spaces
+    else:
+        subspaces = {None: obs_space}
+    keys = []
+    shapes = {}
+    dtypes = {}
+    for key, box in subspaces.items():
+        keys.append(key)
+        shapes[key] = box.shape
+        dtypes[key] = box.dtype
+    return keys, shapes, dtypes
+
+
+def obs_to_dict(obs):
+    """
+    Convert an observation into a dict.
+    """
+    if isinstance(obs, dict):
+        return obs
+    return {None: obs}
--- a/baselines/common/vec_env/vec_frame_stack.py
+++ b/baselines/common/vec_env/vec_frame_stack.py
@@ -0,0 +1,30 @@
+from . import VecEnvWrapper
+import numpy as np
+from gym import spaces
+
+
+class VecFrameStack(VecEnvWrapper):
+    def __init__(self, venv, nstack):
+        self.venv = venv
+        self.nstack = nstack
+        wos = venv.observation_space  # wrapped ob space
+        low = np.repeat(wos.low, self.nstack, axis=-1)
+        high = np.repeat(wos.high, self.nstack, axis=-1)
+        self.stackedobs = np.zeros((venv.num_envs,) + low.shape, low.dtype)
+        observation_space = spaces.Box(low=low, high=high, dtype=venv.observation_space.dtype)
+        VecEnvWrapper.__init__(self, venv, observation_space=observation_space)
+
+    def step_wait(self):
+        obs, rews, news, infos = self.venv.step_wait()
+        self.stackedobs = np.roll(self.stackedobs, shift=-1, axis=-1)
+        for (i, new) in enumerate(news):
+            if new:
+                self.stackedobs[i] = 0
+        self.stackedobs[..., -obs.shape[-1]:] = obs
+        return self.stackedobs, rews, news, infos
+
+    def reset(self):
+        obs = self.venv.reset()
+        self.stackedobs[...] = 0
+        self.stackedobs[..., -obs.shape[-1]:] = obs
+        return self.stackedobs
--- a/baselines/common/vec_env/vec_monitor.py
+++ b/baselines/common/vec_env/vec_monitor.py
@@ -0,0 +1,37 @@
+from . import VecEnvWrapper
+from baselines.bench.monitor import ResultsWriter
+import numpy as np
+import time
+
+
+class VecMonitor(VecEnvWrapper):
+    def __init__(self, venv, filename=None):
+        VecEnvWrapper.__init__(self, venv)
+        self.eprets = None
+        self.eplens = None
+        self.tstart = time.time()
+        self.results_writer = ResultsWriter(filename, header={'t_start': self.tstart})
+
+    def reset(self):
+        obs = self.venv.reset()
+        self.eprets = np.zeros(self.num_envs, 'f')
+        self.eplens = np.zeros(self.num_envs, 'i')
+        return obs
+
+    def step_wait(self):
+        obs, rews, dones, infos = self.venv.step_wait()
+        self.eprets += rews
+        self.eplens += 1
+        newinfos = []
+        for (i, (done, ret, eplen, info)) in enumerate(zip(dones, self.eprets, self.eplens, infos)):
+            info = info.copy()
+            if done:
+                epinfo = {'r': ret, 'l': eplen, 't': round(time.time() - self.tstart, 6)}
+                info['episode'] = epinfo
+                self.eprets[i] = 0
+                self.eplens[i] = 0
+                self.results_writer.write_row(epinfo)
+
+            newinfos.append(info)
+
+        return obs, rews, dones, newinfos
--- a/baselines/common/vec_env/vec_normalize.py
+++ b/baselines/common/vec_env/vec_normalize.py
@@ -0,0 +1,43 @@
+from . import VecEnvWrapper
+from baselines.common.running_mean_std import RunningMeanStd
+import numpy as np
+
+
+class VecNormalize(VecEnvWrapper):
+    """
+    A vectorized wrapper that normalizes the observations
+    and returns from an environment.
+    """
+
+    def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
+        VecEnvWrapper.__init__(self, venv)
+        self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
+        self.ret_rms = RunningMeanStd(shape=()) if ret else None
+        self.clipob = clipob
+        self.cliprew = cliprew
+        self.ret = np.zeros(self.num_envs)
+        self.gamma = gamma
+        self.epsilon = epsilon
+
+    def step_wait(self):
+        obs, rews, news, infos = self.venv.step_wait()
+        self.ret = self.ret * self.gamma + rews
+        obs = self._obfilt(obs)
+        if self.ret_rms:
+            self.ret_rms.update(self.ret)
+            rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
+        self.ret[news] = 0.
+        return obs, rews, news, infos
+
+    def _obfilt(self, obs):
+        if self.ob_rms:
+            self.ob_rms.update(obs)
+            obs = np.clip((obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon), -self.clipob, self.clipob)
+            return obs
+        else:
+            return obs
+
+    def reset(self):
+        self.ret = np.zeros(self.num_envs)
+        obs = self.venv.reset()
+        return self._obfilt(obs)
--- a/baselines/common/vec_env/vec_video_recorder.py
+++ b/baselines/common/vec_env/vec_video_recorder.py
@@ -0,0 +1,89 @@
+import os
+from baselines import logger
+from baselines.common.vec_env import VecEnvWrapper
+from gym.wrappers.monitoring import video_recorder
+
+
+class VecVideoRecorder(VecEnvWrapper):
+    """
+    Wrap VecEnv to record rendered image as mp4 video.
+    """
+
+    def __init__(self, venv, directory, record_video_trigger, video_length=200):
+        """
+        # Arguments
+            venv: VecEnv to wrap
+            directory: Where to save videos
+            record_video_trigger:
+                Function that defines when to start recording.
+                The function takes the current number of step,
+                and returns whether we should start recording or not.
+            video_length: Length of recorded video
+        """
+
+        VecEnvWrapper.__init__(self, venv)
+        self.record_video_trigger = record_video_trigger
+        self.video_recorder = None
+
+        self.directory = os.path.abspath(directory)
+        if not os.path.exists(self.directory): os.mkdir(self.directory)
+
+        self.file_prefix = "vecenv"
+        self.file_infix = '{}'.format(os.getpid())
+        self.step_id = 0
+        self.video_length = video_length
+
+        self.recording = False
+        self.recorded_frames = 0
+
+    def reset(self):
+        obs = self.venv.reset()
+
+        self.start_video_recorder()
+
+        return obs
+
+    def start_video_recorder(self):
+        self.close_video_recorder()
+
+        base_path = os.path.join(self.directory, '{}.video.{}.video{:06}'.format(self.file_prefix, self.file_infix, self.step_id))
+        self.video_recorder = video_recorder.VideoRecorder(
+                env=self.venv,
+                base_path=base_path,
+                metadata={'step_id': self.step_id}
+                )
+
+        self.video_recorder.capture_frame()
+        self.recorded_frames = 1
+        self.recording = True
+
+    def _video_enabled(self):
+        return self.record_video_trigger(self.step_id)
+
+    def step_wait(self):
+        obs, rews, dones, infos = self.venv.step_wait()
+
+        self.step_id += 1
+        if self.recording:
+            self.video_recorder.capture_frame()
+            self.recorded_frames += 1
+            if self.recorded_frames > self.video_length:
+                logger.info("Saving video to ", self.video_recorder.path)
+                self.close_video_recorder()
+        elif self._video_enabled():
+                self.start_video_recorder()
+
+        return obs, rews, dones, infos
+
+    def close_video_recorder(self):
+        if self.recording:
+            self.video_recorder.close()
+        self.recording = False
+        self.recorded_frames = 0
+
+    def close(self):
+        VecEnvWrapper.close(self)
+        self.close_video_recorder()
+
+    def __del__(self):
+        self.close()
--- a/baselines/ddpg/README.md
+++ b/baselines/ddpg/README.md
@@ -2,4 +2,4 @@

 - Original paper: https://arxiv.org/abs/1509.02971
 - Baselines post: https://blog.openai.com/better-exploration-with-parameter-noise/
- `python -m baselines.ddpg.main` runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (`-h`) for more options.
+- `python -m baselines.run --alg=ddpg --env=HalfCheetah-v2 --num_timesteps=1e6` runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (`-h`) for more options.
--- a/baselines/ddpg/init.py
+++ b/baselines/ddpg/init.py
--- a/baselines/ddpg/ddpg.py
+++ b/baselines/ddpg/ddpg.py
@@ -1,372 +1,273 @@
-from copy import copy
-from functools import reduce
+import os
+import time
+from collections import deque
+import pickle

-import numpy as np
-import tensorflow as tf
-import tensorflow.contrib as tc
+from baselines.ddpg.ddpg_learner import DDPG
+from baselines.ddpg.models import Actor, Critic
+from baselines.ddpg.memory import Memory
+from baselines.ddpg.noise import AdaptiveParamNoiseSpec, NormalActionNoise, OrnsteinUhlenbeckActionNoise
+from baselines.common import set_global_seeds
+import baselines.common.tf_util as U

 from baselines import logger
-from baselines.common.mpi_adam import MpiAdam
-import baselines.common.tf_util as U
-from baselines.common.mpi_running_mean_std import RunningMeanStd
-from baselines.ddpg.util import reduce_std, mpi_mean
+import numpy as np
+
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
+def learn(network, env,
+          seed=None,
+          total_timesteps=None,
+          nb_epochs=None, # with default settings, perform 1M steps total
+          nb_epoch_cycles=20,
+          nb_rollout_steps=100,
+          reward_scale=1.0,
+          render=False,
+          render_eval=False,
+          noise_type='adaptive-param_0.2',
+          normalize_returns=False,
+          normalize_observations=True,
+          critic_l2_reg=1e-2,
+          actor_lr=1e-4,
+          critic_lr=1e-3,
+          popart=False,
+          gamma=0.99,
+          clip_norm=None,
+          nb_train_steps=50, # per epoch cycle and MPI worker,
+          nb_eval_steps=100,
+          batch_size=64, # per MPI worker
+          tau=0.01,
+          eval_env=None,
+          param_noise_adaption_interval=50,
+          **network_kwargs):
+
+    set_global_seeds(seed)
+
+    if total_timesteps is not None:
+        assert nb_epochs is None
+        nb_epochs = int(total_timesteps) // (nb_epoch_cycles * nb_rollout_steps)
+    else:
+        nb_epochs = 500
+
+    if MPI is not None:
+        rank = MPI.COMM_WORLD.Get_rank()
+    else:
+        rank = 0
+
+    nb_actions = env.action_space.shape[-1]
+    assert (np.abs(env.action_space.low) == env.action_space.high).all()  # we assume symmetric actions.
+
+    memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
+    critic = Critic(network=network, **network_kwargs)
+    actor = Actor(nb_actions, network=network, **network_kwargs)
+
+    action_noise = None
+    param_noise = None
+    if noise_type is not None:
+        for current_noise_type in noise_type.split(','):
+            current_noise_type = current_noise_type.strip()
+            if current_noise_type == 'none':
+                pass
+            elif 'adaptive-param' in current_noise_type:
+                _, stddev = current_noise_type.split('_')
+                param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
+            elif 'normal' in current_noise_type:
+                _, stddev = current_noise_type.split('_')
+                action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
+            elif 'ou' in current_noise_type:
+                _, stddev = current_noise_type.split('_')
+                action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
+            else:
+                raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
+
+    max_action = env.action_space.high
+    logger.info('scaling actions by {} before executing in env'.format(max_action))
+
+    agent = DDPG(actor, critic, memory, env.observation_space.shape, env.action_space.shape,
+        gamma=gamma, tau=tau, normalize_returns=normalize_returns, normalize_observations=normalize_observations,
+        batch_size=batch_size, action_noise=action_noise, param_noise=param_noise, critic_l2_reg=critic_l2_reg,
+        actor_lr=actor_lr, critic_lr=critic_lr, enable_popart=popart, clip_norm=clip_norm,
+        reward_scale=reward_scale)
+    logger.info('Using agent with the following configuration:')
+    logger.info(str(agent.__dict__.items()))
+
+    eval_episode_rewards_history = deque(maxlen=100)
+    episode_rewards_history = deque(maxlen=100)
+    sess = U.get_session()
+    # Prepare everything.
+    agent.initialize(sess)
+    sess.graph.finalize()
+
+    agent.reset()
+
+    obs = env.reset()
+    if eval_env is not None:
+        eval_obs = eval_env.reset()
+    nenvs = obs.shape[0]
+
+    episode_reward = np.zeros(nenvs, dtype = np.float32) #vector
+    episode_step = np.zeros(nenvs, dtype = int) # vector
+    episodes = 0 #scalar
+    t = 0 # scalar
+
+    epoch = 0


-def normalize(x, stats):
-    if stats is None:
-        return x
-    return (x - stats.mean) / stats.std
+
+    start_time = time.time()
+
+    epoch_episode_rewards = []
+    epoch_episode_steps = []
+    epoch_actions = []
+    epoch_qs = []
+    epoch_episodes = 0
+    for epoch in range(nb_epochs):
+        for cycle in range(nb_epoch_cycles):
+            # Perform rollouts.
+            if nenvs > 1:
+                # if simulating multiple envs in parallel, impossible to reset agent at the end of the episode in each
+                # of the environments, so resetting here instead
+                agent.reset()
+            for t_rollout in range(nb_rollout_steps):
+                # Predict next action.
+                action, q, _, _ = agent.step(obs, apply_noise=True, compute_Q=True)
+
+                # Execute next action.
+                if rank == 0 and render:
+                    env.render()
+
+                # max_action is of dimension A, whereas action is dimension (nenvs, A) - the multiplication gets broadcasted to the batch
+                new_obs, r, done, info = env.step(max_action * action)  # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
+                # note these outputs are batched from vecenv
+
+                t += 1
+                if rank == 0 and render:
+                    env.render()
+                episode_reward += r
+                episode_step += 1
+
+                # Book-keeping.
+                epoch_actions.append(action)
+                epoch_qs.append(q)
+                agent.store_transition(obs, action, r, new_obs, done) #the batched data will be unrolled in memory.py's append.
+
+                obs = new_obs
+
+                for d in range(len(done)):
+                    if done[d]:
+                        # Episode done.
+                        epoch_episode_rewards.append(episode_reward[d])
+                        episode_rewards_history.append(episode_reward[d])
+                        epoch_episode_steps.append(episode_step[d])
+                        episode_reward[d] = 0.
+                        episode_step[d] = 0
+                        epoch_episodes += 1
+                        episodes += 1
+                        if nenvs == 1:
+                            agent.reset()


-def denormalize(x, stats):
-    if stats is None:
-        return x
-    return x * stats.std + stats.mean

+            # Train.
+            epoch_actor_losses = []
+            epoch_critic_losses = []
+            epoch_adaptive_distances = []
+            for t_train in range(nb_train_steps):
+                # Adapt param noise, if necessary.
+                if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
+                    distance = agent.adapt_param_noise()
+                    epoch_adaptive_distances.append(distance)

-def get_target_updates(vars, target_vars, tau):
-    logger.info('setting up target updates ...')
-    soft_updates = []
-    init_updates = []
-    assert len(vars) == len(target_vars)
-    for var, target_var in zip(vars, target_vars):
-        logger.info('  {} <- {}'.format(target_var.name, var.name))
-        init_updates.append(tf.assign(target_var, var))
-        soft_updates.append(tf.assign(target_var, (1. - tau) * target_var + tau * var))
-    assert len(init_updates) == len(vars)
-    assert len(soft_updates) == len(vars)
-    return tf.group(*init_updates), tf.group(*soft_updates)
+                cl, al = agent.train()
+                epoch_critic_losses.append(cl)
+                epoch_actor_losses.append(al)
+                agent.update_target_net()

+            # Evaluate.
+            eval_episode_rewards = []
+            eval_qs = []
+            if eval_env is not None:
+                nenvs_eval = eval_obs.shape[0]
+                eval_episode_reward = np.zeros(nenvs_eval, dtype = np.float32)
+                for t_rollout in range(nb_eval_steps):
+                    eval_action, eval_q, _, _ = agent.step(eval_obs, apply_noise=False, compute_Q=True)
+                    eval_obs, eval_r, eval_done, eval_info = eval_env.step(max_action * eval_action)  # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
+                    if render_eval:
+                        eval_env.render()
+                    eval_episode_reward += eval_r

-def get_perturbed_actor_updates(actor, perturbed_actor, param_noise_stddev):
-    assert len(actor.vars) == len(perturbed_actor.vars)
-    assert len(actor.perturbable_vars) == len(perturbed_actor.perturbable_vars)
+                    eval_qs.append(eval_q)
+                    for d in range(len(eval_done)):
+                        if eval_done[d]:
+                            eval_episode_rewards.append(eval_episode_reward[d])
+                            eval_episode_rewards_history.append(eval_episode_reward[d])
+                            eval_episode_reward[d] = 0.0

-    updates = []
-    for var, perturbed_var in zip(actor.vars, perturbed_actor.vars):
-        if var in actor.perturbable_vars:
-            logger.info('  {} <- {} + noise'.format(perturbed_var.name, var.name))
-            updates.append(tf.assign(perturbed_var, var + tf.random_normal(tf.shape(var), mean=0., stddev=param_noise_stddev)))
+        if MPI is not None:
+            mpi_size = MPI.COMM_WORLD.Get_size()
        else:
-            logger.info('  {} <- {}'.format(perturbed_var.name, var.name))
-            updates.append(tf.assign(perturbed_var, var))
-    assert len(updates) == len(actor.vars)
-    return tf.group(*updates)
+            mpi_size = 1
+
+        # Log stats.
+        # XXX shouldn't call np.mean on variable length lists
+        duration = time.time() - start_time
+        stats = agent.get_stats()
+        combined_stats = stats.copy()
+        combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
+        combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
+        combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
+        combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
+        combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
+        combined_stats['train/loss_actor'] = np.mean(epoch_actor_losses)
+        combined_stats['train/loss_critic'] = np.mean(epoch_critic_losses)
+        combined_stats['train/param_noise_distance'] = np.mean(epoch_adaptive_distances)
+        combined_stats['total/duration'] = duration
+        combined_stats['total/steps_per_second'] = float(t) / float(duration)
+        combined_stats['total/episodes'] = episodes
+        combined_stats['rollout/episodes'] = epoch_episodes
+        combined_stats['rollout/actions_std'] = np.std(epoch_actions)
+        # Evaluation statistics.
+        if eval_env is not None:
+            combined_stats['eval/return'] = eval_episode_rewards
+            combined_stats['eval/return_history'] = np.mean(eval_episode_rewards_history)
+            combined_stats['eval/Q'] = eval_qs
+            combined_stats['eval/episodes'] = len(eval_episode_rewards)
+        def as_scalar(x):
+            if isinstance(x, np.ndarray):
+                assert x.size == 1
+                return x[0]
+            elif np.isscalar(x):
+                return x
+            else:
+                raise ValueError('expected scalar, got %s'%x)
+
+        combined_stats_sums = np.array([ np.array(x).flatten()[0] for x in combined_stats.values()])
+        if MPI is not None:
+            combined_stats_sums = MPI.COMM_WORLD.allreduce(combined_stats_sums)
+
+        combined_stats = {k : v / mpi_size for (k,v) in zip(combined_stats.keys(), combined_stats_sums)}
+
+        # Total statistics.
+        combined_stats['total/epochs'] = epoch + 1
+        combined_stats['total/steps'] = t
+
+        for key in sorted(combined_stats.keys()):
+            logger.record_tabular(key, combined_stats[key])
+
+        if rank == 0:
+            logger.dump_tabular()
+        logger.info('')
+        logdir = logger.get_dir()
+        if rank == 0 and logdir:
+            if hasattr(env, 'get_state'):
+                with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:
+                    pickle.dump(env.get_state(), f)
+            if eval_env and hasattr(eval_env, 'get_state'):
+                with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
+                    pickle.dump(eval_env.get_state(), f)


-class DDPG(object):
-    def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
-        gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
-        batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
-        adaptive_param_noise=True, adaptive_param_noise_policy_threshold=.1,
-        critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
-        # Inputs.
-        self.obs0 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs0')
-        self.obs1 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs1')
-        self.terminals1 = tf.placeholder(tf.float32, shape=(None, 1), name='terminals1')
-        self.rewards = tf.placeholder(tf.float32, shape=(None, 1), name='rewards')
-        self.actions = tf.placeholder(tf.float32, shape=(None,) + action_shape, name='actions')
-        self.critic_target = tf.placeholder(tf.float32, shape=(None, 1), name='critic_target')
-        self.param_noise_stddev = tf.placeholder(tf.float32, shape=(), name='param_noise_stddev')
-
-        # Parameters.
-        self.gamma = gamma
-        self.tau = tau
-        self.memory = memory
-        self.normalize_observations = normalize_observations
-        self.normalize_returns = normalize_returns
-        self.action_noise = action_noise
-        self.param_noise = param_noise
-        self.action_range = action_range
-        self.return_range = return_range
-        self.observation_range = observation_range
-        self.critic = critic
-        self.actor = actor
-        self.actor_lr = actor_lr
-        self.critic_lr = critic_lr
-        self.clip_norm = clip_norm
-        self.enable_popart = enable_popart
-        self.reward_scale = reward_scale
-        self.batch_size = batch_size
-        self.stats_sample = None
-        self.critic_l2_reg = critic_l2_reg
-
-        # Observation normalization.
-        if self.normalize_observations:
-            with tf.variable_scope('obs_rms'):
-                self.obs_rms = RunningMeanStd(shape=observation_shape)
-        else:
-            self.obs_rms = None
-        normalized_obs0 = tf.clip_by_value(normalize(self.obs0, self.obs_rms),
-            self.observation_range[0], self.observation_range[1])
-        normalized_obs1 = tf.clip_by_value(normalize(self.obs1, self.obs_rms),
-            self.observation_range[0], self.observation_range[1])
-
-        # Return normalization.
-        if self.normalize_returns:
-            with tf.variable_scope('ret_rms'):
-                self.ret_rms = RunningMeanStd()
-        else:
-            self.ret_rms = None
-
-        # Create target networks.
-        target_actor = copy(actor)
-        target_actor.name = 'target_actor'
-        self.target_actor = target_actor
-        target_critic = copy(critic)
-        target_critic.name = 'target_critic'
-        self.target_critic = target_critic
-
-        # Create networks and core TF parts that are shared across setup parts.
-        self.actor_tf = actor(normalized_obs0)
-        self.normalized_critic_tf = critic(normalized_obs0, self.actions)
-        self.critic_tf = denormalize(tf.clip_by_value(self.normalized_critic_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
-        self.normalized_critic_with_actor_tf = critic(normalized_obs0, self.actor_tf, reuse=True)
-        self.critic_with_actor_tf = denormalize(tf.clip_by_value(self.normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
-        Q_obs1 = denormalize(target_critic(normalized_obs1, target_actor(normalized_obs1)), self.ret_rms)
-        self.target_Q = self.rewards + (1. - self.terminals1) * gamma * Q_obs1
-
-        # Set up parts.
-        if self.param_noise is not None:
-            self.setup_param_noise(normalized_obs0)
-        self.setup_actor_optimizer()
-        self.setup_critic_optimizer()
-        if self.normalize_returns and self.enable_popart:
-            self.setup_popart()
-        self.setup_stats()
-        self.setup_target_network_updates()
-
-    def setup_target_network_updates(self):
-        actor_init_updates, actor_soft_updates = get_target_updates(self.actor.vars, self.target_actor.vars, self.tau)
-        critic_init_updates, critic_soft_updates = get_target_updates(self.critic.vars, self.target_critic.vars, self.tau)
-        self.target_init_updates = [actor_init_updates, critic_init_updates]
-        self.target_soft_updates = [actor_soft_updates, critic_soft_updates]
-
-    def setup_param_noise(self, normalized_obs0):
-        assert self.param_noise is not None
-
-        # Configure perturbed actor.
-        param_noise_actor = copy(self.actor)
-        param_noise_actor.name = 'param_noise_actor'
-        self.perturbed_actor_tf = param_noise_actor(normalized_obs0)
-        logger.info('setting up param noise')
-        self.perturb_policy_ops = get_perturbed_actor_updates(self.actor, param_noise_actor, self.param_noise_stddev)
-
-        # Configure separate copy for stddev adoption.
-        adaptive_param_noise_actor = copy(self.actor)
-        adaptive_param_noise_actor.name = 'adaptive_param_noise_actor'
-        adaptive_actor_tf = adaptive_param_noise_actor(normalized_obs0)
-        self.perturb_adaptive_policy_ops = get_perturbed_actor_updates(self.actor, adaptive_param_noise_actor, self.param_noise_stddev)
-        self.adaptive_policy_distance = tf.sqrt(tf.reduce_mean(tf.square(self.actor_tf - adaptive_actor_tf)))
-
-    def setup_actor_optimizer(self):
-        logger.info('setting up actor optimizer')
-        self.actor_loss = -tf.reduce_mean(self.critic_with_actor_tf)
-        actor_shapes = [var.get_shape().as_list() for var in self.actor.trainable_vars]
-        actor_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in actor_shapes])
-        logger.info('  actor shapes: {}'.format(actor_shapes))
-        logger.info('  actor params: {}'.format(actor_nb_params))
-        self.actor_grads = U.flatgrad(self.actor_loss, self.actor.trainable_vars, clip_norm=self.clip_norm)
-        self.actor_optimizer = MpiAdam(var_list=self.actor.trainable_vars,
-            beta1=0.9, beta2=0.999, epsilon=1e-08)
-
-    def setup_critic_optimizer(self):
-        logger.info('setting up critic optimizer')
-        normalized_critic_target_tf = tf.clip_by_value(normalize(self.critic_target, self.ret_rms), self.return_range[0], self.return_range[1])
-        self.critic_loss = tf.reduce_mean(tf.square(self.normalized_critic_tf - normalized_critic_target_tf))
-        if self.critic_l2_reg > 0.:
-            critic_reg_vars = [var for var in self.critic.trainable_vars if 'kernel' in var.name and 'output' not in var.name]
-            for var in critic_reg_vars:
-                logger.info('  regularizing: {}'.format(var.name))
-            logger.info('  applying l2 regularization with {}'.format(self.critic_l2_reg))
-            critic_reg = tc.layers.apply_regularization(
-                tc.layers.l2_regularizer(self.critic_l2_reg),
-                weights_list=critic_reg_vars
-            )
-            self.critic_loss += critic_reg
-        critic_shapes = [var.get_shape().as_list() for var in self.critic.trainable_vars]
-        critic_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in critic_shapes])
-        logger.info('  critic shapes: {}'.format(critic_shapes))
-        logger.info('  critic params: {}'.format(critic_nb_params))
-        self.critic_grads = U.flatgrad(self.critic_loss, self.critic.trainable_vars, clip_norm=self.clip_norm)
-        self.critic_optimizer = MpiAdam(var_list=self.critic.trainable_vars,
-            beta1=0.9, beta2=0.999, epsilon=1e-08)
-
-    def setup_popart(self):
-        # See https://arxiv.org/pdf/1602.07714.pdf for details.
-        self.old_std = tf.placeholder(tf.float32, shape=[1], name='old_std')
-        new_std = self.ret_rms.std
-        self.old_mean = tf.placeholder(tf.float32, shape=[1], name='old_mean')
-        new_mean = self.ret_rms.mean
-        
-        self.renormalize_Q_outputs_op = []
-        for vs in [self.critic.output_vars, self.target_critic.output_vars]:
-            assert len(vs) == 2
-            M, b = vs
-            assert 'kernel' in M.name
-            assert 'bias' in b.name
-            assert M.get_shape()[-1] == 1
-            assert b.get_shape()[-1] == 1
-            self.renormalize_Q_outputs_op += [M.assign(M * self.old_std / new_std)]
-            self.renormalize_Q_outputs_op += [b.assign((b * self.old_std + self.old_mean - new_mean) / new_std)]
-
-    def setup_stats(self):
-        ops = []
-        names = []
-        
-        if self.normalize_returns:
-            ops += [self.ret_rms.mean, self.ret_rms.std]
-            names += ['ret_rms_mean', 'ret_rms_std']
-        
-        if self.normalize_observations:
-            ops += [tf.reduce_mean(self.obs_rms.mean), tf.reduce_mean(self.obs_rms.std)]
-            names += ['obs_rms_mean', 'obs_rms_std']
-        
-        ops += [tf.reduce_mean(self.critic_tf)]
-        names += ['reference_Q_mean']
-        ops += [reduce_std(self.critic_tf)]
-        names += ['reference_Q_std']
-
-        ops += [tf.reduce_mean(self.critic_with_actor_tf)]
-        names += ['reference_actor_Q_mean']
-        ops += [reduce_std(self.critic_with_actor_tf)]
-        names += ['reference_actor_Q_std']
-        
-        ops += [tf.reduce_mean(self.actor_tf)]
-        names += ['reference_action_mean']
-        ops += [reduce_std(self.actor_tf)]
-        names += ['reference_action_std']
-
-        if self.param_noise:
-            ops += [tf.reduce_mean(self.perturbed_actor_tf)]
-            names += ['reference_perturbed_action_mean']
-            ops += [reduce_std(self.perturbed_actor_tf)]
-            names += ['reference_perturbed_action_std']
-
-        self.stats_ops = ops
-        self.stats_names = names
-
-    def pi(self, obs, apply_noise=True, compute_Q=True):
-        if self.param_noise is not None and apply_noise:
-            actor_tf = self.perturbed_actor_tf
-        else:
-            actor_tf = self.actor_tf
-        feed_dict = {self.obs0: [obs]}
-        if compute_Q:
-            action, q = self.sess.run([actor_tf, self.critic_with_actor_tf], feed_dict=feed_dict)
-        else:
-            action = self.sess.run(actor_tf, feed_dict=feed_dict)
-            q = None
-        action = action.flatten()
-        if self.action_noise is not None and apply_noise:
-            noise = self.action_noise()
-            assert noise.shape == action.shape
-            action += noise
-        action = np.clip(action, self.action_range[0], self.action_range[1])
-        return action, q
-
-    def store_transition(self, obs0, action, reward, obs1, terminal1):
-        reward *= self.reward_scale
-        self.memory.append(obs0, action, reward, obs1, terminal1)
-        if self.normalize_observations:
-            self.obs_rms.update(np.array([obs0]))
-
-    def train(self):
-        # Get a batch.
-        batch = self.memory.sample(batch_size=self.batch_size)
-
-        if self.normalize_returns and self.enable_popart:
-            old_mean, old_std, target_Q = self.sess.run([self.ret_rms.mean, self.ret_rms.std, self.target_Q], feed_dict={
-                self.obs1: batch['obs1'],
-                self.rewards: batch['rewards'],
-                self.terminals1: batch['terminals1'].astype('float32'),
-            })
-            self.ret_rms.update(target_Q.flatten())
-            self.sess.run(self.renormalize_Q_outputs_op, feed_dict={
-                self.old_std : np.array([old_std]),
-                self.old_mean : np.array([old_mean]),
-            })
-
-            # Run sanity check. Disabled by default since it slows down things considerably.
-            # print('running sanity check')
-            # target_Q_new, new_mean, new_std = self.sess.run([self.target_Q, self.ret_rms.mean, self.ret_rms.std], feed_dict={
-            #     self.obs1: batch['obs1'],
-            #     self.rewards: batch['rewards'],
-            #     self.terminals1: batch['terminals1'].astype('float32'),
-            # })
-            # print(target_Q_new, target_Q, new_mean, new_std)
-            # assert (np.abs(target_Q - target_Q_new) < 1e-3).all()
-        else:
-            target_Q = self.sess.run(self.target_Q, feed_dict={
-                self.obs1: batch['obs1'],
-                self.rewards: batch['rewards'],
-                self.terminals1: batch['terminals1'].astype('float32'),
-            })
-
-        # Get all gradients and perform a synced update.
-        ops = [self.actor_grads, self.actor_loss, self.critic_grads, self.critic_loss]
-        actor_grads, actor_loss, critic_grads, critic_loss = self.sess.run(ops, feed_dict={
-            self.obs0: batch['obs0'],
-            self.actions: batch['actions'],
-            self.critic_target: target_Q,
-        })
-        self.actor_optimizer.update(actor_grads, stepsize=self.actor_lr)
-        self.critic_optimizer.update(critic_grads, stepsize=self.critic_lr)
-
-        return critic_loss, actor_loss
-
-    def initialize(self, sess):
-        self.sess = sess
-        self.sess.run(tf.global_variables_initializer())
-        self.actor_optimizer.sync()
-        self.critic_optimizer.sync()
-        self.sess.run(self.target_init_updates)
-
-    def update_target_net(self):
-        self.sess.run(self.target_soft_updates)
-
-    def get_stats(self):
-        if self.stats_sample is None:
-            # Get a sample and keep that fixed for all further computations.
-            # This allows us to estimate the change in value for the same set of inputs.
-            self.stats_sample = self.memory.sample(batch_size=self.batch_size)
-        values = self.sess.run(self.stats_ops, feed_dict={
-            self.obs0: self.stats_sample['obs0'],
-            self.actions: self.stats_sample['actions'],
-        })
-
-        names = self.stats_names[:]
-        assert len(names) == len(values)
-        stats = dict(zip(names, values))
-
-        if self.param_noise is not None:
-            stats = {**stats, **self.param_noise.get_stats()}
-
-        return stats
-
-    def adapt_param_noise(self):
-        if self.param_noise is None:
-            return 0.
-        
-        # Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
-        batch = self.memory.sample(batch_size=self.batch_size)
-        self.sess.run(self.perturb_adaptive_policy_ops, feed_dict={
-            self.param_noise_stddev: self.param_noise.current_stddev,
-        })
-        distance = self.sess.run(self.adaptive_policy_distance, feed_dict={
-            self.obs0: batch['obs0'],
-            self.param_noise_stddev: self.param_noise.current_stddev,
-        })
-
-        mean_distance = mpi_mean(distance)
-        self.param_noise.adapt(mean_distance)
-        return mean_distance
-
-    def reset(self):
-        # Reset internal state after an episode is complete.
-        if self.action_noise is not None:
-            self.action_noise.reset()
-        if self.param_noise is not None:
-            self.sess.run(self.perturb_policy_ops, feed_dict={
-                self.param_noise_stddev: self.param_noise.current_stddev,
-            })
+    return agent
--- a/baselines/ddpg/ddpg_learner.py
+++ b/baselines/ddpg/ddpg_learner.py
@@ -0,0 +1,401 @@
+from copy import copy
+from functools import reduce
+
+import numpy as np
+import tensorflow as tf
+import tensorflow.contrib as tc
+
+from baselines import logger
+from baselines.common.mpi_adam import MpiAdam
+import baselines.common.tf_util as U
+from baselines.common.mpi_running_mean_std import RunningMeanStd
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
+def normalize(x, stats):
+    if stats is None:
+        return x
+    return (x - stats.mean) / stats.std
+
+
+def denormalize(x, stats):
+    if stats is None:
+        return x
+    return x * stats.std + stats.mean
+
+def reduce_std(x, axis=None, keepdims=False):
+    return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
+
+def reduce_var(x, axis=None, keepdims=False):
+    m = tf.reduce_mean(x, axis=axis, keepdims=True)
+    devs_squared = tf.square(x - m)
+    return tf.reduce_mean(devs_squared, axis=axis, keepdims=keepdims)
+
+def get_target_updates(vars, target_vars, tau):
+    logger.info('setting up target updates ...')
+    soft_updates = []
+    init_updates = []
+    assert len(vars) == len(target_vars)
+    for var, target_var in zip(vars, target_vars):
+        logger.info('  {} <- {}'.format(target_var.name, var.name))
+        init_updates.append(tf.assign(target_var, var))
+        soft_updates.append(tf.assign(target_var, (1. - tau) * target_var + tau * var))
+    assert len(init_updates) == len(vars)
+    assert len(soft_updates) == len(vars)
+    return tf.group(*init_updates), tf.group(*soft_updates)
+
+
+def get_perturbed_actor_updates(actor, perturbed_actor, param_noise_stddev):
+    assert len(actor.vars) == len(perturbed_actor.vars)
+    assert len(actor.perturbable_vars) == len(perturbed_actor.perturbable_vars)
+
+    updates = []
+    for var, perturbed_var in zip(actor.vars, perturbed_actor.vars):
+        if var in actor.perturbable_vars:
+            logger.info('  {} <- {} + noise'.format(perturbed_var.name, var.name))
+            updates.append(tf.assign(perturbed_var, var + tf.random_normal(tf.shape(var), mean=0., stddev=param_noise_stddev)))
+        else:
+            logger.info('  {} <- {}'.format(perturbed_var.name, var.name))
+            updates.append(tf.assign(perturbed_var, var))
+    assert len(updates) == len(actor.vars)
+    return tf.group(*updates)
+
+
+class DDPG(object):
+    def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
+        gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
+        batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
+        critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
+        # Inputs.
+        self.obs0 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs0')
+        self.obs1 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs1')
+        self.terminals1 = tf.placeholder(tf.float32, shape=(None, 1), name='terminals1')
+        self.rewards = tf.placeholder(tf.float32, shape=(None, 1), name='rewards')
+        self.actions = tf.placeholder(tf.float32, shape=(None,) + action_shape, name='actions')
+        self.critic_target = tf.placeholder(tf.float32, shape=(None, 1), name='critic_target')
+        self.param_noise_stddev = tf.placeholder(tf.float32, shape=(), name='param_noise_stddev')
+
+        # Parameters.
+        self.gamma = gamma
+        self.tau = tau
+        self.memory = memory
+        self.normalize_observations = normalize_observations
+        self.normalize_returns = normalize_returns
+        self.action_noise = action_noise
+        self.param_noise = param_noise
+        self.action_range = action_range
+        self.return_range = return_range
+        self.observation_range = observation_range
+        self.critic = critic
+        self.actor = actor
+        self.actor_lr = actor_lr
+        self.critic_lr = critic_lr
+        self.clip_norm = clip_norm
+        self.enable_popart = enable_popart
+        self.reward_scale = reward_scale
+        self.batch_size = batch_size
+        self.stats_sample = None
+        self.critic_l2_reg = critic_l2_reg
+
+        # Observation normalization.
+        if self.normalize_observations:
+            with tf.variable_scope('obs_rms'):
+                self.obs_rms = RunningMeanStd(shape=observation_shape)
+        else:
+            self.obs_rms = None
+        normalized_obs0 = tf.clip_by_value(normalize(self.obs0, self.obs_rms),
+            self.observation_range[0], self.observation_range[1])
+        normalized_obs1 = tf.clip_by_value(normalize(self.obs1, self.obs_rms),
+            self.observation_range[0], self.observation_range[1])
+
+        # Return normalization.
+        if self.normalize_returns:
+            with tf.variable_scope('ret_rms'):
+                self.ret_rms = RunningMeanStd()
+        else:
+            self.ret_rms = None
+
+        # Create target networks.
+        target_actor = copy(actor)
+        target_actor.name = 'target_actor'
+        self.target_actor = target_actor
+        target_critic = copy(critic)
+        target_critic.name = 'target_critic'
+        self.target_critic = target_critic
+
+        # Create networks and core TF parts that are shared across setup parts.
+        self.actor_tf = actor(normalized_obs0)
+        self.normalized_critic_tf = critic(normalized_obs0, self.actions)
+        self.critic_tf = denormalize(tf.clip_by_value(self.normalized_critic_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
+        self.normalized_critic_with_actor_tf = critic(normalized_obs0, self.actor_tf, reuse=True)
+        self.critic_with_actor_tf = denormalize(tf.clip_by_value(self.normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
+        Q_obs1 = denormalize(target_critic(normalized_obs1, target_actor(normalized_obs1)), self.ret_rms)
+        self.target_Q = self.rewards + (1. - self.terminals1) * gamma * Q_obs1
+
+        # Set up parts.
+        if self.param_noise is not None:
+            self.setup_param_noise(normalized_obs0)
+        self.setup_actor_optimizer()
+        self.setup_critic_optimizer()
+        if self.normalize_returns and self.enable_popart:
+            self.setup_popart()
+        self.setup_stats()
+        self.setup_target_network_updates()
+
+        self.initial_state = None # recurrent architectures not supported yet
+
+    def setup_target_network_updates(self):
+        actor_init_updates, actor_soft_updates = get_target_updates(self.actor.vars, self.target_actor.vars, self.tau)
+        critic_init_updates, critic_soft_updates = get_target_updates(self.critic.vars, self.target_critic.vars, self.tau)
+        self.target_init_updates = [actor_init_updates, critic_init_updates]
+        self.target_soft_updates = [actor_soft_updates, critic_soft_updates]
+
+    def setup_param_noise(self, normalized_obs0):
+        assert self.param_noise is not None
+
+        # Configure perturbed actor.
+        param_noise_actor = copy(self.actor)
+        param_noise_actor.name = 'param_noise_actor'
+        self.perturbed_actor_tf = param_noise_actor(normalized_obs0)
+        logger.info('setting up param noise')
+        self.perturb_policy_ops = get_perturbed_actor_updates(self.actor, param_noise_actor, self.param_noise_stddev)
+
+        # Configure separate copy for stddev adoption.
+        adaptive_param_noise_actor = copy(self.actor)
+        adaptive_param_noise_actor.name = 'adaptive_param_noise_actor'
+        adaptive_actor_tf = adaptive_param_noise_actor(normalized_obs0)
+        self.perturb_adaptive_policy_ops = get_perturbed_actor_updates(self.actor, adaptive_param_noise_actor, self.param_noise_stddev)
+        self.adaptive_policy_distance = tf.sqrt(tf.reduce_mean(tf.square(self.actor_tf - adaptive_actor_tf)))
+
+    def setup_actor_optimizer(self):
+        logger.info('setting up actor optimizer')
+        self.actor_loss = -tf.reduce_mean(self.critic_with_actor_tf)
+        actor_shapes = [var.get_shape().as_list() for var in self.actor.trainable_vars]
+        actor_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in actor_shapes])
+        logger.info('  actor shapes: {}'.format(actor_shapes))
+        logger.info('  actor params: {}'.format(actor_nb_params))
+        self.actor_grads = U.flatgrad(self.actor_loss, self.actor.trainable_vars, clip_norm=self.clip_norm)
+        self.actor_optimizer = MpiAdam(var_list=self.actor.trainable_vars,
+            beta1=0.9, beta2=0.999, epsilon=1e-08)
+
+    def setup_critic_optimizer(self):
+        logger.info('setting up critic optimizer')
+        normalized_critic_target_tf = tf.clip_by_value(normalize(self.critic_target, self.ret_rms), self.return_range[0], self.return_range[1])
+        self.critic_loss = tf.reduce_mean(tf.square(self.normalized_critic_tf - normalized_critic_target_tf))
+        if self.critic_l2_reg > 0.:
+            critic_reg_vars = [var for var in self.critic.trainable_vars if var.name.endswith('/w:0') and 'output' not in var.name]
+            for var in critic_reg_vars:
+                logger.info('  regularizing: {}'.format(var.name))
+            logger.info('  applying l2 regularization with {}'.format(self.critic_l2_reg))
+            critic_reg = tc.layers.apply_regularization(
+                tc.layers.l2_regularizer(self.critic_l2_reg),
+                weights_list=critic_reg_vars
+            )
+            self.critic_loss += critic_reg
+        critic_shapes = [var.get_shape().as_list() for var in self.critic.trainable_vars]
+        critic_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in critic_shapes])
+        logger.info('  critic shapes: {}'.format(critic_shapes))
+        logger.info('  critic params: {}'.format(critic_nb_params))
+        self.critic_grads = U.flatgrad(self.critic_loss, self.critic.trainable_vars, clip_norm=self.clip_norm)
+        self.critic_optimizer = MpiAdam(var_list=self.critic.trainable_vars,
+            beta1=0.9, beta2=0.999, epsilon=1e-08)
+
+    def setup_popart(self):
+        # See https://arxiv.org/pdf/1602.07714.pdf for details.
+        self.old_std = tf.placeholder(tf.float32, shape=[1], name='old_std')
+        new_std = self.ret_rms.std
+        self.old_mean = tf.placeholder(tf.float32, shape=[1], name='old_mean')
+        new_mean = self.ret_rms.mean
+
+        self.renormalize_Q_outputs_op = []
+        for vs in [self.critic.output_vars, self.target_critic.output_vars]:
+            assert len(vs) == 2
+            M, b = vs
+            assert 'kernel' in M.name
+            assert 'bias' in b.name
+            assert M.get_shape()[-1] == 1
+            assert b.get_shape()[-1] == 1
+            self.renormalize_Q_outputs_op += [M.assign(M * self.old_std / new_std)]
+            self.renormalize_Q_outputs_op += [b.assign((b * self.old_std + self.old_mean - new_mean) / new_std)]
+
+    def setup_stats(self):
+        ops = []
+        names = []
+
+        if self.normalize_returns:
+            ops += [self.ret_rms.mean, self.ret_rms.std]
+            names += ['ret_rms_mean', 'ret_rms_std']
+
+        if self.normalize_observations:
+            ops += [tf.reduce_mean(self.obs_rms.mean), tf.reduce_mean(self.obs_rms.std)]
+            names += ['obs_rms_mean', 'obs_rms_std']
+
+        ops += [tf.reduce_mean(self.critic_tf)]
+        names += ['reference_Q_mean']
+        ops += [reduce_std(self.critic_tf)]
+        names += ['reference_Q_std']
+
+        ops += [tf.reduce_mean(self.critic_with_actor_tf)]
+        names += ['reference_actor_Q_mean']
+        ops += [reduce_std(self.critic_with_actor_tf)]
+        names += ['reference_actor_Q_std']
+
+        ops += [tf.reduce_mean(self.actor_tf)]
+        names += ['reference_action_mean']
+        ops += [reduce_std(self.actor_tf)]
+        names += ['reference_action_std']
+
+        if self.param_noise:
+            ops += [tf.reduce_mean(self.perturbed_actor_tf)]
+            names += ['reference_perturbed_action_mean']
+            ops += [reduce_std(self.perturbed_actor_tf)]
+            names += ['reference_perturbed_action_std']
+
+        self.stats_ops = ops
+        self.stats_names = names
+
+    def step(self, obs, apply_noise=True, compute_Q=True):
+        if self.param_noise is not None and apply_noise:
+            actor_tf = self.perturbed_actor_tf
+        else:
+            actor_tf = self.actor_tf
+        feed_dict = {self.obs0: U.adjust_shape(self.obs0, [obs])}
+        if compute_Q:
+            action, q = self.sess.run([actor_tf, self.critic_with_actor_tf], feed_dict=feed_dict)
+        else:
+            action = self.sess.run(actor_tf, feed_dict=feed_dict)
+            q = None
+
+        if self.action_noise is not None and apply_noise:
+            noise = self.action_noise()
+            assert noise.shape == action[0].shape
+            action += noise
+        action = np.clip(action, self.action_range[0], self.action_range[1])
+
+
+        return action, q, None, None
+
+    def store_transition(self, obs0, action, reward, obs1, terminal1):
+        reward *= self.reward_scale
+
+        B = obs0.shape[0]
+        for b in range(B):
+            self.memory.append(obs0[b], action[b], reward[b], obs1[b], terminal1[b])
+            if self.normalize_observations:
+                self.obs_rms.update(np.array([obs0[b]]))
+
+    def train(self):
+        # Get a batch.
+        batch = self.memory.sample(batch_size=self.batch_size)
+
+        if self.normalize_returns and self.enable_popart:
+            old_mean, old_std, target_Q = self.sess.run([self.ret_rms.mean, self.ret_rms.std, self.target_Q], feed_dict={
+                self.obs1: batch['obs1'],
+                self.rewards: batch['rewards'],
+                self.terminals1: batch['terminals1'].astype('float32'),
+            })
+            self.ret_rms.update(target_Q.flatten())
+            self.sess.run(self.renormalize_Q_outputs_op, feed_dict={
+                self.old_std : np.array([old_std]),
+                self.old_mean : np.array([old_mean]),
+            })
+
+            # Run sanity check. Disabled by default since it slows down things considerably.
+            # print('running sanity check')
+            # target_Q_new, new_mean, new_std = self.sess.run([self.target_Q, self.ret_rms.mean, self.ret_rms.std], feed_dict={
+            #     self.obs1: batch['obs1'],
+            #     self.rewards: batch['rewards'],
+            #     self.terminals1: batch['terminals1'].astype('float32'),
+            # })
+            # print(target_Q_new, target_Q, new_mean, new_std)
+            # assert (np.abs(target_Q - target_Q_new) < 1e-3).all()
+        else:
+            target_Q = self.sess.run(self.target_Q, feed_dict={
+                self.obs1: batch['obs1'],
+                self.rewards: batch['rewards'],
+                self.terminals1: batch['terminals1'].astype('float32'),
+            })
+
+        # Get all gradients and perform a synced update.
+        ops = [self.actor_grads, self.actor_loss, self.critic_grads, self.critic_loss]
+        actor_grads, actor_loss, critic_grads, critic_loss = self.sess.run(ops, feed_dict={
+            self.obs0: batch['obs0'],
+            self.actions: batch['actions'],
+            self.critic_target: target_Q,
+        })
+        self.actor_optimizer.update(actor_grads, stepsize=self.actor_lr)
+        self.critic_optimizer.update(critic_grads, stepsize=self.critic_lr)
+
+        return critic_loss, actor_loss
+
+    def initialize(self, sess):
+        self.sess = sess
+        self.sess.run(tf.global_variables_initializer())
+        self.actor_optimizer.sync()
+        self.critic_optimizer.sync()
+        self.sess.run(self.target_init_updates)
+
+    def update_target_net(self):
+        self.sess.run(self.target_soft_updates)
+
+    def get_stats(self):
+        if self.stats_sample is None:
+            # Get a sample and keep that fixed for all further computations.
+            # This allows us to estimate the change in value for the same set of inputs.
+            self.stats_sample = self.memory.sample(batch_size=self.batch_size)
+        values = self.sess.run(self.stats_ops, feed_dict={
+            self.obs0: self.stats_sample['obs0'],
+            self.actions: self.stats_sample['actions'],
+        })
+
+        names = self.stats_names[:]
+        assert len(names) == len(values)
+        stats = dict(zip(names, values))
+
+        if self.param_noise is not None:
+            stats = {**stats, **self.param_noise.get_stats()}
+
+        return stats
+
+    def adapt_param_noise(self):
+        try:
+            from mpi4py import MPI
+        except ImportError:
+            MPI = None
+
+        if self.param_noise is None:
+            return 0.
+
+        # Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
+        batch = self.memory.sample(batch_size=self.batch_size)
+        self.sess.run(self.perturb_adaptive_policy_ops, feed_dict={
+            self.param_noise_stddev: self.param_noise.current_stddev,
+        })
+        distance = self.sess.run(self.adaptive_policy_distance, feed_dict={
+            self.obs0: batch['obs0'],
+            self.param_noise_stddev: self.param_noise.current_stddev,
+        })
+
+        if MPI is not None:
+            mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
+        else:
+            mean_distance = distance
+
+        if MPI is not None:
+            mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
+        else:
+            mean_distance = distance
+
+        self.param_noise.adapt(mean_distance)
+        return mean_distance
+
+    def reset(self):
+        # Reset internal state after an episode is complete.
+        if self.action_noise is not None:
+            self.action_noise.reset()
+        if self.param_noise is not None:
+            self.sess.run(self.perturb_policy_ops, feed_dict={
+                self.param_noise_stddev: self.param_noise.current_stddev,
+            })
--- a/baselines/ddpg/main.py
+++ b/baselines/ddpg/main.py
@@ -1,124 +0,0 @@
-import argparse
-import time
-import os
-import logging
-from baselines import logger, bench
-from baselines.common.misc_util import (
-    set_global_seeds,
-    boolean_flag,
-)
-import baselines.ddpg.training as training
-from baselines.ddpg.models import Actor, Critic
-from baselines.ddpg.memory import Memory
-from baselines.ddpg.noise import *
-
-import gym
-import tensorflow as tf
-from mpi4py import MPI
-
-def run(env_id, seed, noise_type, layer_norm, evaluation, **kwargs):
-    # Configure things.
-    rank = MPI.COMM_WORLD.Get_rank()
-    if rank != 0:
-        logger.set_level(logger.DISABLED)
-
-    # Create envs.
-    env = gym.make(env_id)
-    env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-    gym.logger.setLevel(logging.WARN)
-
-    if evaluation and rank==0:
-        eval_env = gym.make(env_id)
-        eval_env = bench.Monitor(eval_env, os.path.join(logger.get_dir(), 'gym_eval'))
-        env = bench.Monitor(env, None)
-    else:
-        eval_env = None
-
-    # Parse noise_type
-    action_noise = None
-    param_noise = None
-    nb_actions = env.action_space.shape[-1]
-    for current_noise_type in noise_type.split(','):
-        current_noise_type = current_noise_type.strip()
-        if current_noise_type == 'none':
-            pass
-        elif 'adaptive-param' in current_noise_type:
-            _, stddev = current_noise_type.split('_')
-            param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
-        elif 'normal' in current_noise_type:
-            _, stddev = current_noise_type.split('_')
-            action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
-        elif 'ou' in current_noise_type:
-            _, stddev = current_noise_type.split('_')
-            action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
-        else:
-            raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
-
-    # Configure components.
-    memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
-    critic = Critic(layer_norm=layer_norm)
-    actor = Actor(nb_actions, layer_norm=layer_norm)
-
-    # Seed everything to make things reproducible.
-    seed = seed + 1000000 * rank
-    logger.info('rank {}: seed={}, logdir={}'.format(rank, seed, logger.get_dir()))
-    tf.reset_default_graph()
-    set_global_seeds(seed)
-    env.seed(seed)
-    if eval_env is not None:
-        eval_env.seed(seed)
-
-    # Disable logging for rank != 0 to avoid noise.
-    if rank == 0:
-        start_time = time.time()
-    training.train(env=env, eval_env=eval_env, param_noise=param_noise,
-        action_noise=action_noise, actor=actor, critic=critic, memory=memory, **kwargs)
-    env.close()
-    if eval_env is not None:
-        eval_env.close()
-    if rank == 0:
-        logger.info('total runtime: {}s'.format(time.time() - start_time))
-
-
-def parse_args():
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-
-    parser.add_argument('--env-id', type=str, default='HalfCheetah-v1')
-    boolean_flag(parser, 'render-eval', default=False)
-    boolean_flag(parser, 'layer-norm', default=True)
-    boolean_flag(parser, 'render', default=False)
-    boolean_flag(parser, 'normalize-returns', default=False)
-    boolean_flag(parser, 'normalize-observations', default=True)
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--critic-l2-reg', type=float, default=1e-2)
-    parser.add_argument('--batch-size', type=int, default=64)  # per MPI worker
-    parser.add_argument('--actor-lr', type=float, default=1e-4)
-    parser.add_argument('--critic-lr', type=float, default=1e-3)
-    boolean_flag(parser, 'popart', default=False)
-    parser.add_argument('--gamma', type=float, default=0.99)
-    parser.add_argument('--reward-scale', type=float, default=1.)
-    parser.add_argument('--clip-norm', type=float, default=None)
-    parser.add_argument('--nb-epochs', type=int, default=500)  # with default settings, perform 1M steps total
-    parser.add_argument('--nb-epoch-cycles', type=int, default=20)
-    parser.add_argument('--nb-train-steps', type=int, default=50)  # per epoch cycle and MPI worker
-    parser.add_argument('--nb-eval-steps', type=int, default=100)  # per epoch cycle and MPI worker
-    parser.add_argument('--nb-rollout-steps', type=int, default=100)  # per epoch cycle and MPI worker
-    parser.add_argument('--noise-type', type=str, default='adaptive-param_0.2')  # choices are adaptive-param_xx, ou_xx, normal_xx, none
-    parser.add_argument('--num-timesteps', type=int, default=None)
-    boolean_flag(parser, 'evaluation', default=False)
-    args = parser.parse_args()
-    # we don't directly specify timesteps for this script, so make sure that if we do specify them
-    # they agree with the other parameters
-    if args.num_timesteps is not None:
-        assert(args.num_timesteps == args.nb_epochs * args.nb_epoch_cycles * args.nb_rollout_steps)
-    dict_args = vars(args)
-    del dict_args['num_timesteps']
-    return dict_args
-
-
-if __name__ == '__main__':
-    args = parse_args()
-    if MPI.COMM_WORLD.Get_rank() == 0:
-        logger.configure()
-    # Run actual script.
-    run(**args)
--- a/baselines/ddpg/memory.py
+++ b/baselines/ddpg/memory.py
@@ -51,7 +51,7 @@ class Memory(object):

    def sample(self, batch_size):
        # Draw such that we always have a proceeding element.
-        batch_idxs = np.random.random_integers(self.nb_entries - 2, size=batch_size)
+        batch_idxs = np.random.randint(self.nb_entries - 2, size=batch_size)

        obs0_batch = self.observations0.get_batch(batch_idxs)
        obs1_batch = self.observations1.get_batch(batch_idxs)
@@ -71,7 +71,7 @@ class Memory(object):
    def append(self, obs0, action, reward, obs1, terminal1, training=True):
        if not training:
            return
-        
+
        self.observations0.append(obs0)
        self.actions.append(action)
        self.rewards.append(reward)
--- a/baselines/ddpg/models.py
+++ b/baselines/ddpg/models.py
@@ -1,10 +1,11 @@
 import tensorflow as tf
-import tensorflow.contrib as tc
+from baselines.common.models import get_network_builder


 class Model(object):
-    def __init__(self, name):
+    def __init__(self, name, network='mlp', **network_kwargs):
        self.name = name
+        self.network_builder = get_network_builder(network)(**network_kwargs)

    @property
    def vars(self):
@@ -20,55 +21,28 @@ class Model(object):


 class Actor(Model):
-    def __init__(self, nb_actions, name='actor', layer_norm=True):
-        super(Actor, self).__init__(name=name)
+    def __init__(self, nb_actions, name='actor', network='mlp', **network_kwargs):
+        super().__init__(name=name, network=network, **network_kwargs)
        self.nb_actions = nb_actions
-        self.layer_norm = layer_norm

    def __call__(self, obs, reuse=False):
-        with tf.variable_scope(self.name) as scope:
-            if reuse:
-                scope.reuse_variables()
-
-            x = obs
-            x = tf.layers.dense(x, 64)
-            if self.layer_norm:
-                x = tc.layers.layer_norm(x, center=True, scale=True)
-            x = tf.nn.relu(x)
-            
-            x = tf.layers.dense(x, 64)
-            if self.layer_norm:
-                x = tc.layers.layer_norm(x, center=True, scale=True)
-            x = tf.nn.relu(x)
-            
+        with tf.variable_scope(self.name, reuse=tf.AUTO_REUSE):
+            x = self.network_builder(obs)
            x = tf.layers.dense(x, self.nb_actions, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
            x = tf.nn.tanh(x)
        return x


 class Critic(Model):
-    def __init__(self, name='critic', layer_norm=True):
-        super(Critic, self).__init__(name=name)
-        self.layer_norm = layer_norm
+    def __init__(self, name='critic', network='mlp', **network_kwargs):
+        super().__init__(name=name, network=network, **network_kwargs)
+        self.layer_norm = True

    def __call__(self, obs, action, reuse=False):
-        with tf.variable_scope(self.name) as scope:
-            if reuse:
-                scope.reuse_variables()
-
-            x = obs
-            x = tf.layers.dense(x, 64)
-            if self.layer_norm:
-                x = tc.layers.layer_norm(x, center=True, scale=True)
-            x = tf.nn.relu(x)
-
-            x = tf.concat([x, action], axis=-1)
-            x = tf.layers.dense(x, 64)
-            if self.layer_norm:
-                x = tc.layers.layer_norm(x, center=True, scale=True)
-            x = tf.nn.relu(x)
-
-            x = tf.layers.dense(x, 1, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
+        with tf.variable_scope(self.name, reuse=tf.AUTO_REUSE):
+            x = tf.concat([obs, action], axis=-1) # this assumes observation and action can be concatenated
+            x = self.network_builder(x)
+            x = tf.layers.dense(x, 1, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3), name='output')
        return x

    @property
--- a/baselines/ddpg/noise.py
+++ b/baselines/ddpg/noise.py
--- a/baselines/ddpg/test_smoke.py
+++ b/baselines/ddpg/test_smoke.py
@@ -0,0 +1,17 @@
+from baselines.run import main as M
+
+def _run(argstr):
+    M(('--alg=ddpg --env=Pendulum-v0 --num_timesteps=0 ' + argstr).split(' '))
+
+def test_popart():
+    _run('--normalize_returns=True --popart=True')
+
+def test_noise_normal():
+    _run('--noise_type=normal_0.1')
+
+def test_noise_ou():
+    _run('--noise_type=ou_0.1')
+
+def test_noise_adaptive():
+    _run('--noise_type=adaptive-param_0.2,normal_0.1')
+
--- a/baselines/ddpg/training.py
+++ b/baselines/ddpg/training.py
@@ -1,189 +0,0 @@
-import os
-import time
-from collections import deque
-import pickle
-
-from baselines.ddpg.ddpg import DDPG
-from baselines.ddpg.util import mpi_mean, mpi_std, mpi_max, mpi_sum
-import baselines.common.tf_util as U
-
-from baselines import logger
-import numpy as np
-import tensorflow as tf
-from mpi4py import MPI
-
-
-def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, param_noise, actor, critic,
-    normalize_returns, normalize_observations, critic_l2_reg, actor_lr, critic_lr, action_noise,
-    popart, gamma, clip_norm, nb_train_steps, nb_rollout_steps, nb_eval_steps, batch_size, memory,
-    tau=0.01, eval_env=None, param_noise_adaption_interval=50):
-    rank = MPI.COMM_WORLD.Get_rank()
-
-    assert (np.abs(env.action_space.low) == env.action_space.high).all()  # we assume symmetric actions.
-    max_action = env.action_space.high
-    logger.info('scaling actions by {} before executing in env'.format(max_action))
-    agent = DDPG(actor, critic, memory, env.observation_space.shape, env.action_space.shape,
-        gamma=gamma, tau=tau, normalize_returns=normalize_returns, normalize_observations=normalize_observations,
-        batch_size=batch_size, action_noise=action_noise, param_noise=param_noise, critic_l2_reg=critic_l2_reg,
-        actor_lr=actor_lr, critic_lr=critic_lr, enable_popart=popart, clip_norm=clip_norm,
-        reward_scale=reward_scale)
-    logger.info('Using agent with the following configuration:')
-    logger.info(str(agent.__dict__.items()))
-
-    # Set up logging stuff only for a single worker.
-    if rank == 0:
-        saver = tf.train.Saver()
-    else:
-        saver = None
-    
-    step = 0
-    episode = 0
-    eval_episode_rewards_history = deque(maxlen=100)
-    episode_rewards_history = deque(maxlen=100)
-    with U.single_threaded_session() as sess:
-        # Prepare everything.
-        agent.initialize(sess)
-        sess.graph.finalize()
-
-        agent.reset()
-        obs = env.reset()
-        if eval_env is not None:
-            eval_obs = eval_env.reset()
-        done = False
-        episode_reward = 0.
-        episode_step = 0
-        episodes = 0
-        t = 0
-
-        epoch = 0
-        start_time = time.time()
-
-        epoch_episode_rewards = []
-        epoch_episode_steps = []
-        epoch_episode_eval_rewards = []
-        epoch_episode_eval_steps = []
-        epoch_start_time = time.time()
-        epoch_actions = []
-        epoch_qs = []
-        epoch_episodes = 0
-        for epoch in range(nb_epochs):
-            for cycle in range(nb_epoch_cycles):
-                # Perform rollouts.
-                for t_rollout in range(nb_rollout_steps):
-                    # Predict next action.
-                    action, q = agent.pi(obs, apply_noise=True, compute_Q=True)
-                    assert action.shape == env.action_space.shape
-
-                    # Execute next action.
-                    if rank == 0 and render:
-                        env.render()
-                    assert max_action.shape == action.shape
-                    new_obs, r, done, info = env.step(max_action * action)  # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
-                    t += 1
-                    if rank == 0 and render:
-                        env.render()
-                    episode_reward += r
-                    episode_step += 1
-
-                    # Book-keeping.
-                    epoch_actions.append(action)
-                    epoch_qs.append(q)
-                    agent.store_transition(obs, action, r, new_obs, done)
-                    obs = new_obs
-
-                    if done:
-                        # Episode done.
-                        epoch_episode_rewards.append(episode_reward)
-                        episode_rewards_history.append(episode_reward)
-                        epoch_episode_steps.append(episode_step)
-                        episode_reward = 0.
-                        episode_step = 0
-                        epoch_episodes += 1
-                        episodes += 1
-
-                        agent.reset()
-                        obs = env.reset()
-
-                # Train.
-                epoch_actor_losses = []
-                epoch_critic_losses = []
-                epoch_adaptive_distances = []
-                for t_train in range(nb_train_steps):
-                    # Adapt param noise, if necessary.
-                    if memory.nb_entries >= batch_size and t % param_noise_adaption_interval == 0:
-                        distance = agent.adapt_param_noise()
-                        epoch_adaptive_distances.append(distance)
-
-                    cl, al = agent.train()
-                    epoch_critic_losses.append(cl)
-                    epoch_actor_losses.append(al)
-                    agent.update_target_net()
-
-                # Evaluate.
-                eval_episode_rewards = []
-                eval_qs = []
-                if eval_env is not None:
-                    eval_episode_reward = 0.
-                    for t_rollout in range(nb_eval_steps):
-                        eval_action, eval_q = agent.pi(eval_obs, apply_noise=False, compute_Q=True)
-                        eval_obs, eval_r, eval_done, eval_info = eval_env.step(max_action * eval_action)  # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
-                        if render_eval:
-                            eval_env.render()
-                        eval_episode_reward += eval_r
-
-                        eval_qs.append(eval_q)
-                        if eval_done:
-                            eval_obs = eval_env.reset()
-                            eval_episode_rewards.append(eval_episode_reward)
-                            eval_episode_rewards_history.append(eval_episode_reward)
-                            eval_episode_reward = 0.
-
-            # Log stats.
-            epoch_train_duration = time.time() - epoch_start_time
-            duration = time.time() - start_time
-            stats = agent.get_stats()
-            combined_stats = {}
-            for key in sorted(stats.keys()):
-                combined_stats[key] = mpi_mean(stats[key])
-
-            # Rollout statistics.
-            combined_stats['rollout/return'] = mpi_mean(epoch_episode_rewards)
-            combined_stats['rollout/return_history'] = mpi_mean(np.mean(episode_rewards_history))
-            combined_stats['rollout/episode_steps'] = mpi_mean(epoch_episode_steps)
-            combined_stats['rollout/episodes'] = mpi_sum(epoch_episodes)
-            combined_stats['rollout/actions_mean'] = mpi_mean(epoch_actions)
-            combined_stats['rollout/actions_std'] = mpi_std(epoch_actions)
-            combined_stats['rollout/Q_mean'] = mpi_mean(epoch_qs)
-    
-            # Train statistics.
-            combined_stats['train/loss_actor'] = mpi_mean(epoch_actor_losses)
-            combined_stats['train/loss_critic'] = mpi_mean(epoch_critic_losses)
-            combined_stats['train/param_noise_distance'] = mpi_mean(epoch_adaptive_distances)
-
-            # Evaluation statistics.
-            if eval_env is not None:
-                combined_stats['eval/return'] = mpi_mean(eval_episode_rewards)
-                combined_stats['eval/return_history'] = mpi_mean(np.mean(eval_episode_rewards_history))
-                combined_stats['eval/Q'] = mpi_mean(eval_qs)
-                combined_stats['eval/episodes'] = mpi_mean(len(eval_episode_rewards))
-
-            # Total statistics.
-            combined_stats['total/duration'] = mpi_mean(duration)
-            combined_stats['total/steps_per_second'] = mpi_mean(float(t) / float(duration))
-            combined_stats['total/episodes'] = mpi_mean(episodes)
-            combined_stats['total/epochs'] = epoch + 1
-            combined_stats['total/steps'] = t
-            
-            for key in sorted(combined_stats.keys()):
-                logger.record_tabular(key, combined_stats[key])
-            logger.dump_tabular()
-            logger.info('')
-            logdir = logger.get_dir()
-            if rank == 0 and logdir:
-                if hasattr(env, 'get_state'):
-                    with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:
-                        pickle.dump(env.get_state(), f)
-                if eval_env and hasattr(eval_env, 'get_state'):
-                    with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
-                        pickle.dump(eval_env.get_state(), f)
-
--- a/baselines/ddpg/util.py
+++ b/baselines/ddpg/util.py
@@ -1,44 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from mpi4py import MPI
-from baselines.common.mpi_moments import mpi_moments
-
-
-def reduce_var(x, axis=None, keepdims=False):
-    m = tf.reduce_mean(x, axis=axis, keep_dims=True)
-    devs_squared = tf.square(x - m)
-    return tf.reduce_mean(devs_squared, axis=axis, keep_dims=keepdims)
-
-
-def reduce_std(x, axis=None, keepdims=False):
-    return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
-
-
-def mpi_mean(value):
-    if value == []:
-        value = [0.]
-    if not isinstance(value, list):
-        value = [value]
-    return mpi_moments(np.array(value))[0][0]
-
-
-def mpi_std(value):
-    if value == []:
-        value = [0.]
-    if not isinstance(value, list):
-        value = [value]
-    return mpi_moments(np.array(value))[1][0]
-
-
-def mpi_max(value):
-    global_max = np.zeros(1, dtype='float64')
-    local_max = np.max(value).astype('float64')
-    MPI.COMM_WORLD.Reduce(local_max, global_max, op=MPI.MAX)
-    return global_max[0]
-
-
-def mpi_sum(value):
-    global_sum = np.zeros(1, dtype='float64')
-    local_sum = np.sum(np.array(value)).astype('float64')
-    MPI.COMM_WORLD.Reduce(local_sum, global_sum, op=MPI.SUM)
-    return global_sum[0]
--- a/baselines/deepq/README.md
+++ b/baselines/deepq/README.md
@@ -9,44 +9,29 @@ Here's a list of commands to run to quickly get a working example:

 ```bash
 # Train model and save the results to cartpole_model.pkl
-python -m baselines.deepq.experiments.train_cartpole
+python -m baselines.run --alg=deepq --env=CartPole-v0 --save_path=./cartpole_model.pkl --num_timesteps=1e5
 # Load the model saved in cartpole_model.pkl and visualize the learned policy
-python -m baselines.deepq.experiments.enjoy_cartpole
+python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play
 ```

-
-Be sure to check out the source code of [both](experiments/train_cartpole.py) [files](experiments/enjoy_cartpole.py)!
-
 ## If you wish to apply DQN to solve a problem.

 Check out our simple agent trained with one stop shop `deepq.learn` function. 

 - [baselines/deepq/experiments/train_cartpole.py](experiments/train_cartpole.py) - train a Cartpole agent.
- [baselines/deepq/experiments/train_pong.py](experiments/train_pong.py) - train a Pong agent using convolutional neural networks.

-In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. For both of the files listed above there are complimentary files `enjoy_cartpole.py` and `enjoy_pong.py` respectively, that load and visualize the learned policy.
+In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. Complimentary file `enjoy_cartpole.py` loads and visualizes the learned policy.

 ## If you wish to experiment with the algorithm

 ##### Check out the examples

-
 - [baselines/deepq/experiments/custom_cartpole.py](experiments/custom_cartpole.py) - Cartpole training with more fine grained control over the internals of DQN algorithm.
- [baselines/deepq/experiments/atari/train.py](experiments/atari/train.py) - more robust setup for training at scale.
-
-
-##### Download a pretrained Atari agent
-
-For some research projects it is sometimes useful to have an already trained agent handy. There's a variety of models to choose from. You can list them all by running:
+- [baselines/deepq/defaults.py](defaults.py) - settings for training on atari. Run 

 ```bash
-python -m baselines.deepq.experiments.atari.download_model
+python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 
 ```
+to train on Atari Pong (see more in repo-wide [README.md](../../README.md#training-models))

-Once you pick a model, you can download it and visualize the learned policy. Be sure to pass `--dueling` flag to visualization script when using dueling models.

-```bash
-python -m baselines.deepq.experiments.atari.download_model --blob model-atari-duel-pong-1 --model-dir /tmp/models
-python -m baselines.deepq.experiments.atari.enjoy --model-dir /tmp/models/model-atari-duel-pong-1 --env Pong --dueling
-
-```
--- a/baselines/deepq/init.py
+++ b/baselines/deepq/init.py
@@ -1,8 +1,8 @@
 from baselines.deepq import models  # noqa
 from baselines.deepq.build_graph import build_act, build_train  # noqa
-from baselines.deepq.simple import learn, load  # noqa
+from baselines.deepq.deepq import learn, load_act  # noqa
 from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer  # noqa

 def wrap_atari_dqn(env):
    from baselines.common.atari_wrappers import wrap_deepmind
-    return wrap_deepmind(env, frame_stack=True, scale=True)
+    return wrap_deepmind(env, frame_stack=True, scale=False)
--- a/baselines/deepq/build_graph.py
+++ b/baselines/deepq/build_graph.py
@@ -33,7 +33,7 @@ The functions in this file can are used to create the following functions:
    stochastic: bool
        if set to False all the actions are always deterministic (default False)
    update_eps_ph: float
-        update epsilon a new value, if negative not update happens
+        update epsilon to a new value, if negative no update happens
        (default: no update)
    reset_ph: bool
        reset the perturbed policy by sampling a new perturbation
@@ -97,6 +97,37 @@ import tensorflow as tf
 import baselines.common.tf_util as U


+def scope_vars(scope, trainable_only=False):
+    """
+    Get variables inside a scope
+    The scope can be specified as a string
+    Parameters
+    ----------
+    scope: str or VariableScope
+        scope in which the variables reside.
+    trainable_only: bool
+        whether or not to return only the variables that were marked as trainable.
+    Returns
+    -------
+    vars: [tf.Variable]
+        list of variables in `scope`.
+    """
+    return tf.get_collection(
+        tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.GLOBAL_VARIABLES,
+        scope=scope if isinstance(scope, str) else scope.name
+    )
+
+
+def scope_name():
+    """Returns the name of current scope as a string, e.g. deepq/q_func"""
+    return tf.get_variable_scope().name
+
+
+def absolute_scope_name(relative_scope_name):
+    """Appends parent scope name to `relative_scope_name`"""
+    return scope_name() + "/" + relative_scope_name
+
+
 def default_param_noise_filter(var):
    if var not in tf.trainable_variables():
        # We never perturb non-trainable vars.
@@ -143,7 +174,7 @@ def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):
 `       See the top of the file for details.
    """
    with tf.variable_scope(scope, reuse=reuse):
-        observations_ph = U.ensure_tf_input(make_obs_ph("observation"))
+        observations_ph = make_obs_ph("observation")
        stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
        update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")

@@ -159,10 +190,12 @@ def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):

        output_actions = tf.cond(stochastic_ph, lambda: stochastic_actions, lambda: deterministic_actions)
        update_eps_expr = eps.assign(tf.cond(update_eps_ph >= 0, lambda: update_eps_ph, lambda: eps))
-        act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph],
+        _act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph],
                         outputs=output_actions,
                         givens={update_eps_ph: -1.0, stochastic_ph: True},
                         updates=[update_eps_expr])
+        def act(ob, stochastic=True, update_eps=-1):
+            return _act(ob, stochastic, update_eps)
        return act


@@ -203,7 +236,7 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
        param_noise_filter_func = default_param_noise_filter

    with tf.variable_scope(scope, reuse=reuse):
-        observations_ph = U.ensure_tf_input(make_obs_ph("observation"))
+        observations_ph = make_obs_ph("observation")
        stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
        update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")
        update_param_noise_threshold_ph = tf.placeholder(tf.float32, (), name="update_param_noise_threshold")
@@ -223,8 +256,8 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
        # https://stackoverflow.com/questions/37063952/confused-by-the-behavior-of-tf-cond for
        # a more detailed discussion.
        def perturb_vars(original_scope, perturbed_scope):
-            all_vars = U.scope_vars(U.absolute_scope_name("q_func"))
-            all_perturbed_vars = U.scope_vars(U.absolute_scope_name("perturbed_q_func"))
+            all_vars = scope_vars(absolute_scope_name(original_scope))
+            all_perturbed_vars = scope_vars(absolute_scope_name(perturbed_scope))
            assert len(all_vars) == len(all_perturbed_vars)
            perturb_ops = []
            for var, perturbed_var in zip(all_vars, all_perturbed_vars):
@@ -272,10 +305,12 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
            tf.cond(update_param_noise_scale_ph, lambda: update_scale(), lambda: tf.Variable(0., trainable=False)),
            update_param_noise_threshold_expr,
        ]
-        act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph, reset_ph, update_param_noise_threshold_ph, update_param_noise_scale_ph],
+        _act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph, reset_ph, update_param_noise_threshold_ph, update_param_noise_scale_ph],
                         outputs=output_actions,
                         givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
                         updates=updates)
+        def act(ob, reset=False, update_param_noise_threshold=False, update_param_noise_scale=False, stochastic=True, update_eps=-1):
+            return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
        return act


@@ -342,20 +377,20 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=

    with tf.variable_scope(scope, reuse=reuse):
        # set up placeholders
-        obs_t_input = U.ensure_tf_input(make_obs_ph("obs_t"))
+        obs_t_input = make_obs_ph("obs_t")
        act_t_ph = tf.placeholder(tf.int32, [None], name="action")
        rew_t_ph = tf.placeholder(tf.float32, [None], name="reward")
-        obs_tp1_input = U.ensure_tf_input(make_obs_ph("obs_tp1"))
+        obs_tp1_input = make_obs_ph("obs_tp1")
        done_mask_ph = tf.placeholder(tf.float32, [None], name="done")
        importance_weights_ph = tf.placeholder(tf.float32, [None], name="weight")

        # q network evaluation
        q_t = q_func(obs_t_input.get(), num_actions, scope="q_func", reuse=True)  # reuse parameters from act
-        q_func_vars = U.scope_vars(U.absolute_scope_name("q_func"))
+        q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/q_func")

        # target q network evalution
        q_tp1 = q_func(obs_tp1_input.get(), num_actions, scope="target_q_func")
-        target_q_func_vars = U.scope_vars(U.absolute_scope_name("target_q_func"))
+        target_q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/target_q_func")

        # q scores for actions which we know were selected in the given state.
        q_t_selected = tf.reduce_sum(q_t * tf.one_hot(act_t_ph, num_actions), 1)
@@ -363,7 +398,7 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=
        # compute estimate of best possible value starting from state at t + 1
        if double_q:
            q_tp1_using_online_net = q_func(obs_tp1_input.get(), num_actions, scope="q_func", reuse=True)
-            q_tp1_best_using_online_net = tf.arg_max(q_tp1_using_online_net, 1)
+            q_tp1_best_using_online_net = tf.argmax(q_tp1_using_online_net, 1)
            q_tp1_best = tf.reduce_sum(q_tp1 * tf.one_hot(q_tp1_best_using_online_net, num_actions), 1)
        else:
            q_tp1_best = tf.reduce_max(q_tp1, 1)
@@ -379,10 +414,11 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=

        # compute optimization op (potentially with gradient clipping)
        if grad_norm_clipping is not None:
-            optimize_expr = U.minimize_and_clip(optimizer,
-                                                weighted_error,
-                                                var_list=q_func_vars,
-                                                clip_val=grad_norm_clipping)
+            gradients = optimizer.compute_gradients(weighted_error, var_list=q_func_vars)
+            for i, (grad, var) in enumerate(gradients):
+                if grad is not None:
+                    gradients[i] = (tf.clip_by_norm(grad, grad_norm_clipping), var)
+            optimize_expr = optimizer.apply_gradients(gradients)
        else:
            optimize_expr = optimizer.minimize(weighted_error, var_list=q_func_vars)

--- a/Show More
+++ b/Show More