entrypoint variable made public (#970 ) and Fix RuntimeError (#910 ) (#1015 ) (#1032 )

flake8 complaints
tf2: Updated setup.py dependencies. (#1002 )
2019-11-08 15:20:54 -08:00 · 2019-11-08 15:15:38 -08:00 · 2019-10-25 15:50:04 -07:00 · 2019-08-08 11:03:17 -07:00 · 2019-06-27 10:12:38 -07:00 · 2019-06-24 10:19:01 -07:00
184 changed files with 28991 additions and 8847 deletions
--- a/.benchmark_pattern
+++ b/.benchmark_pattern
@@ -0,0 +1 @@
+
--- a/.gitignore
+++ b/.gitignore
@@ -34,5 +34,3 @@ src
 .cache

 MUJOCO_LOG.TXT
-
-
--- a/.travis.yml
+++ b/.travis.yml
@@ -10,5 +10,5 @@ install:
    - docker build . -t baselines-test

 script:
-    - flake8 --select=F baselines/common
-    - docker run baselines-test pytest
+    - flake8 . --show-source --statistics --exclude=baselines/her
+    - docker run -e RUNSLOW=1 baselines-test pytest -v .
--- a/24
+++ b/24
@@ -1,20 +1,18 @@
-FROM ubuntu:16.04
+FROM python:3.6
+
+RUN apt-get -y update && apt-get -y install ffmpeg
+# RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake python-opencv

-RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake
 ENV CODE_DIR /root/code
-ENV VENV /root/venv

 COPY . $CODE_DIR/baselines
-RUN \
-    pip install virtualenv && \
-    virtualenv $VENV --python=python3 && \
-    . $VENV/bin/activate && \
-    cd $CODE_DIR && \
-    pip install --upgrade pip && \
-    pip install -e baselines && \
-    pip install pytest
-
-ENV PATH=$VENV/bin:$PATH
 WORKDIR $CODE_DIR/baselines

+# Clean up pycache and pyc files
+RUN rm -rf __pycache__ && \
+    find . -name "*.pyc" -delete && \
+    pip install tensorflow && \
+    pip install -e .[test]
+
+
 CMD /bin/bash
--- a/README.md
+++ b/README.md
@@ -1,3 +1,5 @@
+**Status:** Active (under active development, breaking changes may occur)
+
 <img src="data/logo.jpg" width=25% align="right" /> [![Build status](https://travis-ci.org/openai/baselines.svg?branch=master)](https://travis-ci.org/openai/baselines)

 # Baselines
@@ -15,7 +17,7 @@ sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zli
 ```
    
 ### Mac OS X
-Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the follwing:
+Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the following:
 ```bash
 brew install cmake openmpi
 ```
@@ -38,20 +40,27 @@ More thorough tutorial on virtualenvs and options can be found [here](https://vi


 ## Installation
-Clone the repo and cd into it:
-```bash
-git clone https://github.com/openai/baselines.git
-cd baselines
-```
-If using virtualenv, create a new virtualenv and activate it
-```bash
-    virtualenv env --python=python3
-    . env/bin/activate
-```
-Install baselines package
-```bash
-pip install -e .
-```
+- Clone the repo and cd into it:
+    ```bash
+    git clone https://github.com/openai/baselines.git
+    cd baselines
+    ```
+- If you don't have TensorFlow installed already, install your favourite flavor of TensorFlow. In most cases, 
+    ```bash 
+    pip install tensorflow-gpu # if you have a CUDA-compatible gpu and proper drivers
+    ```
+    or 
+    ```bash
+    pip install tensorflow
+    ```
+    should be sufficient. Refer to [TensorFlow installation guide](https://www.tensorflow.org/install/)
+    for more details. 
+
+- Install baselines package
+    ```bash
+    pip install -e .
+    ```
+
 ### MuJoCo
 Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)

@@ -62,6 +71,60 @@ pip install pytest
 pytest
 ```

+## Training models
+Most of the algorithms in baselines repo are used as follows:
+```bash
+python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
+```
+### Example 1. PPO with MuJoCo Humanoid
+For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps
+```bash
+python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
+```
+Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
+The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
+```bash
+python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
+```
+will set entropy coefficient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
+
+See docstrings in [common/models.py](baselines/common/models.py) for description of network parameters for each type of model, and 
+docstring for [baselines/ppo2/ppo2.py/learn()](baselines/ppo2/ppo2.py#L152) for the description of the ppo2 hyperparameters. 
+
+### Example 2. DQN on Atari 
+DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
+```
+python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
+```
+
+## Saving, loading and visualizing models
+
+### Saving and loading the model
+The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models. 
+`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively. 
+Let's imagine you'd like to train ppo2 on Atari Pong,  save the model and then later visualize what has it learnt.
+```bash
+python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
+```
+This should get to the mean reward per episode about 20. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize: 
+```bash
+python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
+```
+
+*NOTE:* Mujoco environments require normalization to work properly, so we wrap them with VecNormalize wrapper. Currently, to ensure the models are saved with normalization (so that trained models can be restored and run without further training) the normalization coefficients are saved as tensorflow variables. This can decrease the performance somewhat, so if you require high-throughput steps with Mujoco and do not need saving/restoring the models, it may make sense to use numpy normalization instead. To do that, set 'use_tf=False` in [baselines/run.py](baselines/run.py#L116). 
+
+### Logging and vizualizing learning curves and other training metrics
+By default, all summary data, including progress, standard output, is saved to a unique directory in a temp folder, specified by a call to Python's [tempfile.gettempdir()](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir).
+The directory can be changed with the `--log_path` command-line option.
+```bash
+python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2 --log_path=~/logs/Pong/
+```
+*NOTE:* Please be aware that the logger will overwrite files of the same name in an existing directory, thus it's recommended that folder names be given a unique timestamp to prevent overwritten logs.
+
+Another way the temp directory can be changed is through the use of the `$OPENAI_LOGDIR` environment variable.
+
+For examples on how to load and display the training data, see [here](docs/viz/viz.ipynb).
+
 ## Subpackages

 - [A2C](baselines/a2c)
@@ -71,17 +134,27 @@ pytest
 - [DQN](baselines/deepq)
 - [GAIL](baselines/gail)
 - [HER](baselines/her)
- [PPO1](baselines/ppo1) (Multi-CPU using MPI)
- [PPO2](baselines/ppo2) (Optimized for GPU)
+- [PPO1](baselines/ppo1) (obsolete version, left here temporarily)
+- [PPO2](baselines/ppo2) 
 - [TRPO](baselines/trpo_mpi)

+
+
+## Benchmarks
+Results of benchmarks on Mujoco (1M timesteps) and Atari (10M timesteps) are available 
+[here for Mujoco](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm) 
+and
+[here for Atari](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_atari10M.htm) 
+respectively. Note that these results may be not on the latest version of the code, particular commit hash with which results were obtained is specified on the benchmarks page. 
+
 To cite this repository in publications:

    @misc{baselines,
-      author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
+      author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Tan, Zhenyu and Wu, Yuhuai and Zhokhov, Peter},
      title = {OpenAI Baselines},
      year = {2017},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/openai/baselines}},
    }
+
--- a/baselines/a2c/README.md
+++ b/baselines/a2c/README.md
@@ -2,4 +2,12 @@

 - Original paper: https://arxiv.org/abs/1602.01783
 - Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.a2c.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
+- `python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options
+- also refer to the repo-wide [README.md](../../README.md#training-models)
+
+## Files
+- `run_atari`: file used to run the algorithm.
+- `policies.py`: contains the different versions of the A2C architecture (MlpPolicy, CNNPolicy, LstmPolicy...).
+- `a2c.py`: - Model : class used to initialize the step_model (sampling) and train_model (training)
+	- learn : Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
+- `runner.py`: class used to generates a batch of experiences
--- a/baselines/a2c/a2c.py
+++ b/baselines/a2c/a2c.py
@@ -1,153 +1,194 @@
-import os.path as osp
 import time
-import joblib
-import numpy as np
 import tensorflow as tf
+
 from baselines import logger

 from baselines.common import set_global_seeds, explained_variance
-from baselines.common.runners import AbstractEnvRunner
-from baselines.common import tf_util
+from baselines.common.models import get_network_builder
+from baselines.common.policies import PolicyWithValue

-from baselines.a2c.utils import discount_with_dones
-from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
-from baselines.a2c.utils import cat_entropy, mse
+from baselines.a2c.utils import InverseLinearTimeDecay
+from baselines.a2c.runner import Runner
+from baselines.ppo2.ppo2 import safemean
+import os.path as osp
+from collections import deque

-class Model(object):
+class Model(tf.keras.Model):

-    def __init__(self, policy, ob_space, ac_space, nenvs, nsteps,
+    """
+    We use this class to :
+        __init__:
+        - Creates the step_model
+        - Creates the train_model
+
+        train():
+        - Make the training part (feedforward and retropropagation of gradients)
+
+        save/load():
+        - Save load the model
+    """
+    def __init__(self, *, ac_space, policy_network, nupdates,
            ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, lr=7e-4,
-            alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):
+            alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6)):

-        sess = tf_util.make_session()
-        nbatch = nenvs*nsteps
+        super(Model, self).__init__(name='A2CModel')
+        self.train_model = PolicyWithValue(ac_space, policy_network, value_network=None, estimate_q=False)
+        lr_schedule = InverseLinearTimeDecay(initial_learning_rate=lr, nupdates=nupdates)
+        self.optimizer = tf.keras.optimizers.RMSprop(learning_rate=lr_schedule, rho=alpha, epsilon=epsilon)

-        A = tf.placeholder(tf.int32, [nbatch])
-        ADV = tf.placeholder(tf.float32, [nbatch])
-        R = tf.placeholder(tf.float32, [nbatch])
-        LR = tf.placeholder(tf.float32, [])
+        self.ent_coef = ent_coef
+        self.vf_coef = vf_coef
+        self.max_grad_norm = max_grad_norm
+        self.step = self.train_model.step
+        self.value = self.train_model.value
+        self.initial_state = self.train_model.initial_state

-        step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
-        train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)
+    @tf.function
+    def train(self, obs, states, rewards, masks, actions, values):
+        advs = rewards - values
+        with tf.GradientTape() as tape:
+            policy_latent = self.train_model.policy_network(obs)
+            pd, _ = self.train_model.pdtype.pdfromlatent(policy_latent)
+            neglogpac = pd.neglogp(actions)
+            entropy = tf.reduce_mean(pd.entropy())
+            vpred = self.train_model.value(obs)
+            vf_loss = tf.reduce_mean(tf.square(vpred - rewards))
+            pg_loss = tf.reduce_mean(advs * neglogpac)
+            loss = pg_loss - entropy * self.ent_coef + vf_loss * self.vf_coef

-        neglogpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
-        pg_loss = tf.reduce_mean(ADV * neglogpac)
-        vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
-        entropy = tf.reduce_mean(cat_entropy(train_model.pi))
-        loss = pg_loss - entropy*ent_coef + vf_loss * vf_coef
+        var_list = tape.watched_variables()
+        grads = tape.gradient(loss, var_list)
+        grads, _ = tf.clip_by_global_norm(grads, self.max_grad_norm)
+        grads_and_vars = list(zip(grads, var_list))
+        self.optimizer.apply_gradients(grads_and_vars)

-        params = find_trainable_variables("model")
-        grads = tf.gradients(loss, params)
-        if max_grad_norm is not None:
-            grads, grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
-        grads = list(zip(grads, params))
-        trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=alpha, epsilon=epsilon)
-        _train = trainer.apply_gradients(grads)
+        return pg_loss, vf_loss, entropy

-        lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)

-        def train(obs, states, rewards, masks, actions, values):
-            advs = rewards - values
-            for step in range(len(obs)):
-                cur_lr = lr.value()
-            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, LR:cur_lr}
-            if states is not None:
-                td_map[train_model.S] = states
-                td_map[train_model.M] = masks
-            policy_loss, value_loss, policy_entropy, _ = sess.run(
-                [pg_loss, vf_loss, entropy, _train],
-                td_map
-            )
-            return policy_loss, value_loss, policy_entropy
+def learn(
+    network,
+    env,
+    seed=None,
+    nsteps=5,
+    total_timesteps=int(80e6),
+    vf_coef=0.5,
+    ent_coef=0.01,
+    max_grad_norm=0.5,
+    lr=7e-4,
+    lrschedule='linear',
+    epsilon=1e-5,
+    alpha=0.99,
+    gamma=0.99,
+    log_interval=100,
+    load_path=None,
+    **network_kwargs):

-        def save(save_path):
-            ps = sess.run(params)
-            make_path(osp.dirname(save_path))
-            joblib.dump(ps, save_path)
+    '''
+    Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.

-        def load(load_path):
-            loaded_params = joblib.load(load_path)
-            restores = []
-            for p, loaded_p in zip(params, loaded_params):
-                restores.append(p.assign(loaded_p))
-            sess.run(restores)
+    Parameters:
+    -----------

-        self.train = train
-        self.train_model = train_model
-        self.step_model = step_model
-        self.step = step_model.step
-        self.value = step_model.value
-        self.initial_state = step_model.initial_state
-        self.save = save
-        self.load = load
-        tf.global_variables_initializer().run(session=sess)
+    network:            policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
+                        specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
+                        tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
+                        neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
+                        See baselines.common/policies.py/lstm for more details on using recurrent nets in policies

-class Runner(AbstractEnvRunner):

-    def __init__(self, env, model, nsteps=5, gamma=0.99):
-        super().__init__(env=env, model=model, nsteps=nsteps)
-        self.gamma = gamma
+    env:                RL environment. Should implement interface similar to VecEnv (baselines.common/vec_env) or be wrapped with DummyVecEnv (baselines.common/vec_env/dummy_vec_env.py)
+
+
+    seed:               seed to make random number sequence in the alorightm reproducible. By default is None which means seed from system noise generator (not reproducible)
+
+    nsteps:             int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
+                        nenv is number of environment copies simulated in parallel)
+
+    total_timesteps:    int, total number of timesteps to train on (default: 80M)
+
+    vf_coef:            float, coefficient in front of value function loss in the total loss function (default: 0.5)
+
+    ent_coef:           float, coeffictiant in front of the policy entropy in the total loss function (default: 0.01)
+
+    max_gradient_norm:  float, gradient is clipped to have global L2 norm no more than this value (default: 0.5)
+
+    lr:                 float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
+
+    lrschedule:         schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
+                        returns fraction of the learning rate (specified as lr) as output
+
+    epsilon:            float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
+
+    alpha:              float, RMSProp decay parameter (default: 0.99)
+
+    gamma:              float, reward discounting parameter (default: 0.99)
+
+    log_interval:       int, specifies how frequently the logs are printed out (default: 100)
+
+    **network_kwargs:   keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
+                        For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
+
+    '''
+

-    def run(self):
-        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
-        mb_states = self.states
-        for n in range(self.nsteps):
-            actions, values, states, _ = self.model.step(self.obs, self.states, self.dones)
-            mb_obs.append(np.copy(self.obs))
-            mb_actions.append(actions)
-            mb_values.append(values)
-            mb_dones.append(self.dones)
-            obs, rewards, dones, _ = self.env.step(actions)
-            self.states = states
-            self.dones = dones
-            for n, done in enumerate(dones):
-                if done:
-                    self.obs[n] = self.obs[n]*0
-            self.obs = obs
-            mb_rewards.append(rewards)
-        mb_dones.append(self.dones)
-        #batch of steps to batch of rollouts
-        mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0).reshape(self.batch_ob_shape)
-        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
-        mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
-        mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
-        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
-        mb_masks = mb_dones[:, :-1]
-        mb_dones = mb_dones[:, 1:]
-        last_values = self.model.value(self.obs, self.states, self.dones).tolist()
-        #discount/bootstrap off value fn
-        for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
-            rewards = rewards.tolist()
-            dones = dones.tolist()
-            if dones[-1] == 0:
-                rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
-            else:
-                rewards = discount_with_dones(rewards, dones, self.gamma)
-            mb_rewards[n] = rewards
-        mb_rewards = mb_rewards.flatten()
-        mb_actions = mb_actions.flatten()
-        mb_values = mb_values.flatten()
-        mb_masks = mb_masks.flatten()
-        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values

-def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, ent_coef=0.01, max_grad_norm=0.5, lr=7e-4, lrschedule='linear', epsilon=1e-5, alpha=0.99, gamma=0.99, log_interval=100):
    set_global_seeds(seed)

+    total_timesteps = int(total_timesteps)
+
+    # Get the nb of env
    nenvs = env.num_envs
+
+    # Get state_space and action_space
    ob_space = env.observation_space
    ac_space = env.action_space
-    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
-        max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
-    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)

-    nbatch = nenvs*nsteps
+    if isinstance(network, str):
+        network_type = network
+        policy_network_fn = get_network_builder(network_type)(**network_kwargs)
+        policy_network = policy_network_fn(ob_space.shape)
+
+    # Calculate the batch_size
+    nbatch = nenvs * nsteps
+    nupdates = total_timesteps // nbatch
+
+    # Instantiate the model object (that creates step_model and train_model)
+    model = Model(ac_space=ac_space, policy_network=policy_network, nupdates=nupdates, ent_coef=ent_coef, vf_coef=vf_coef,
+        max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps)
+
+    if load_path is not None:
+        load_path = osp.expanduser(load_path)
+        ckpt = tf.train.Checkpoint(model=model)
+        manager = tf.train.CheckpointManager(ckpt, load_path, max_to_keep=None)
+        ckpt.restore(manager.latest_checkpoint)
+
+    # Instantiate the runner object
+    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
+    epinfobuf = deque(maxlen=100)
+
+    # Start total timer
    tstart = time.time()
-    for update in range(1, total_timesteps//nbatch+1):
-        obs, states, rewards, masks, actions, values = runner.run()
+
+    for update in range(1, nupdates+1):
+        # Get mini batch of experiences
+        obs, states, rewards, masks, actions, values, epinfos = runner.run()
+        epinfobuf.extend(epinfos)
+
+        obs = tf.constant(obs)
+        if states is not None:
+            states = tf.constant(states)
+        rewards = tf.constant(rewards)
+        masks = tf.constant(masks)
+        actions = tf.constant(actions)
+        values = tf.constant(values)
        policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
        nseconds = time.time()-tstart
+
+        # Calculate the fps (frame per second)
        fps = int((update*nbatch)/nseconds)
        if update % log_interval == 0 or update == 1:
+            # Calculates if value function is a good predicator of the returns (ev > 1)
+            # or if it's just worse than predicting nothing (ev =< 0)
            ev = explained_variance(values, rewards)
            logger.record_tabular("nupdates", update)
            logger.record_tabular("total_timesteps", update*nbatch)
@@ -155,6 +196,8 @@ def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, e
            logger.record_tabular("policy_entropy", float(policy_entropy))
            logger.record_tabular("value_loss", float(value_loss))
            logger.record_tabular("explained_variance", float(ev))
+            logger.record_tabular("eprewmean", safemean([epinfo['r'] for epinfo in epinfobuf]))
+            logger.record_tabular("eplenmean", safemean([epinfo['l'] for epinfo in epinfobuf]))
            logger.dump_tabular()
-    env.close()
    return model
+
--- a/baselines/a2c/policies.py
+++ b/baselines/a2c/policies.py
@@ -1,146 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
-from baselines.common.distributions import make_pdtype
-from baselines.common.input import observation_input
-
-def nature_cnn(unscaled_images, **conv_kwargs):
-    """
-    CNN from Nature paper.
-    """
-    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
-    activ = tf.nn.relu
-    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
-                   **conv_kwargs))
-    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
-    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
-    h3 = conv_to_fc(h3)
-    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
-
-class LnLstmPolicy(object):
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
-        nenv = nbatch // nsteps
-        X, processed_x = observation_input(ob_space, nbatch)
-        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
-        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
-        self.pdtype = make_pdtype(ac_space)
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(processed_x)
-            xs = batch_to_seq(h, nenv, nsteps)
-            ms = batch_to_seq(M, nenv, nsteps)
-            h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
-            h5 = seq_to_batch(h5)
-            vf = fc(h5, 'v', 1)
-            self.pd, self.pi = self.pdtype.pdfromlatent(h5)
-
-        v0 = vf[:, 0]
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
-
-        def step(ob, state, mask):
-            return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
-
-        def value(ob, state, mask):
-            return sess.run(v0, {X:ob, S:state, M:mask})
-
-        self.X = X
-        self.M = M
-        self.S = S
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class LstmPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
-        nenv = nbatch // nsteps
-        self.pdtype = make_pdtype(ac_space)
-        X, processed_x = observation_input(ob_space, nbatch)
-
-        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
-        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            xs = batch_to_seq(h, nenv, nsteps)
-            ms = batch_to_seq(M, nenv, nsteps)
-            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
-            h5 = seq_to_batch(h5)
-            vf = fc(h5, 'v', 1)
-            self.pd, self.pi = self.pdtype.pdfromlatent(h5)
-
-        v0 = vf[:, 0]
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
-
-        def step(ob, state, mask):
-            return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
-
-        def value(ob, state, mask):
-            return sess.run(v0, {X:ob, S:state, M:mask})
-
-        self.X = X
-        self.M = M
-        self.S = S
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class CnnPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False, **conv_kwargs): #pylint: disable=W0613
-        self.pdtype = make_pdtype(ac_space)
-        X, processed_x = observation_input(ob_space, nbatch)
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(processed_x, **conv_kwargs)
-            vf = fc(h, 'v', 1)[:,0]
-            self.pd, self.pi = self.pdtype.pdfromlatent(h, init_scale=0.01)
-
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = None
-
-        def step(ob, *_args, **_kwargs):
-            a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
-            return a, v, self.initial_state, neglogp
-
-        def value(ob, *_args, **_kwargs):
-            return sess.run(vf, {X:ob})
-
-        self.X = X
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-class MlpPolicy(object):
-    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
-        self.pdtype = make_pdtype(ac_space)
-        with tf.variable_scope("model", reuse=reuse):
-            X, processed_x = observation_input(ob_space, nbatch)
-            activ = tf.tanh
-            processed_x = tf.layers.flatten(processed_x)
-            pi_h1 = activ(fc(processed_x, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
-            pi_h2 = activ(fc(pi_h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
-            vf_h1 = activ(fc(processed_x, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
-            vf_h2 = activ(fc(vf_h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
-            vf = fc(vf_h2, 'vf', 1)[:,0]
-
-            self.pd, self.pi = self.pdtype.pdfromlatent(pi_h2, init_scale=0.01)
-
-
-        a0 = self.pd.sample()
-        neglogp0 = self.pd.neglogp(a0)
-        self.initial_state = None
-
-        def step(ob, *_args, **_kwargs):
-            a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
-            return a, v, self.initial_state, neglogp
-
-        def value(ob, *_args, **_kwargs):
-            return sess.run(vf, {X:ob})
-
-        self.X = X
-        self.vf = vf
-        self.step = step
-        self.value = value
--- a/baselines/a2c/run_atari.py
+++ b/baselines/a2c/run_atari.py
@@ -1,30 +0,0 @@
-#!/usr/bin/env python3
-
-from baselines import logger
-from baselines.common.cmd_util import make_atari_env, atari_arg_parser
-from baselines.common.vec_env.vec_frame_stack import VecFrameStack
-from baselines.a2c.a2c import learn
-from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
-
-def train(env_id, num_timesteps, seed, policy, lrschedule, num_env):
-    if policy == 'cnn':
-        policy_fn = CnnPolicy
-    elif policy == 'lstm':
-        policy_fn = LstmPolicy
-    elif policy == 'lnlstm':
-        policy_fn = LnLstmPolicy
-    env = VecFrameStack(make_atari_env(env_id, num_env, seed), 4)
-    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
-    env.close()
-
-def main():
-    parser = atari_arg_parser()
-    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
-    parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
-    args = parser.parse_args()
-    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
-        policy=args.policy, lrschedule=args.lrschedule, num_env=16)
-
-if __name__ == '__main__':
-    main()
--- a/baselines/a2c/runner.py
+++ b/baselines/a2c/runner.py
@@ -0,0 +1,80 @@
+import tensorflow as tf
+import numpy as np
+from baselines.a2c.utils import discount_with_dones
+from baselines.common.runners import AbstractEnvRunner
+
+class Runner(AbstractEnvRunner):
+    """
+    We use this class to generate batches of experiences
+
+    __init__:
+    - Initialize the runner
+
+    run():
+    - Make a mini batch of experiences
+    """
+    def __init__(self, env, model, nsteps=5, gamma=0.99):
+        super().__init__(env=env, model=model, nsteps=nsteps)
+        self.gamma = gamma
+
+    def run(self):
+        # We initialize the lists that will contain the mb of experiences
+        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
+        mb_states = self.states
+        epinfos = []
+        for _ in range(self.nsteps):
+            # Given observations, take action and value (V(s))
+            # We already have self.obs because Runner superclass run self.obs[:] = env.reset() on init
+            obs = tf.constant(self.obs)
+            actions, values, self.states, _ = self.model.step(obs)
+            actions = actions._numpy()
+            # Append the experiences
+            mb_obs.append(self.obs.copy())
+            mb_actions.append(actions)
+            mb_values.append(values._numpy())
+            mb_dones.append(self.dones)
+
+            # Take actions in env and look the results
+            self.obs[:], rewards, self.dones, infos = self.env.step(actions)
+            for info in infos:
+                maybeepinfo = info.get('episode')
+                if maybeepinfo: epinfos.append(maybeepinfo)
+            mb_rewards.append(rewards)
+
+        mb_dones.append(self.dones)
+
+        # Batch of steps to batch of rollouts
+        mb_obs = sf01(np.asarray(mb_obs, dtype=self.obs.dtype))
+        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
+        mb_actions = sf01(np.asarray(mb_actions, dtype=actions.dtype))
+        mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
+        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
+        mb_masks = mb_dones[:, :-1]
+        mb_dones = mb_dones[:, 1:]
+
+
+        if self.gamma > 0.0:
+            # Discount/bootstrap off value fn
+            last_values = self.model.value(tf.constant(self.obs))._numpy().tolist()
+            for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
+                rewards = rewards.tolist()
+                dones = dones.tolist()
+                if dones[-1] == 0:
+                    rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
+                else:
+                    rewards = discount_with_dones(rewards, dones, self.gamma)
+
+                mb_rewards[n] = rewards
+
+
+        mb_rewards = mb_rewards.flatten()
+        mb_values = mb_values.flatten()
+        mb_masks = mb_masks.flatten()
+        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values, epinfos
+
+def sf01(arr):
+    """
+    swap and then flatten axes 0 and 1
+    """
+    s = arr.shape
+    return arr.swapaxes(0, 1).reshape(s[0] * s[1], *s[2:])
--- a/baselines/a2c/utils.py
+++ b/baselines/a2c/utils.py
@@ -1,26 +1,5 @@
-import os
-import gym
 import numpy as np
 import tensorflow as tf
-from gym import spaces
-from collections import deque
-
-def sample(logits):
-    noise = tf.random_uniform(tf.shape(logits))
-    return tf.argmax(logits - tf.log(-tf.log(noise)), 1)
-
-def cat_entropy(logits):
-    a0 = logits - tf.reduce_max(logits, 1, keep_dims=True)
-    ea0 = tf.exp(a0)
-    z0 = tf.reduce_sum(ea0, 1, keep_dims=True)
-    p0 = ea0 / z0
-    return tf.reduce_sum(p0 * (tf.log(z0) - a0), 1)
-
-def cat_entropy_softmax(p0):
-    return - tf.reduce_sum(p0 * tf.log(p0 + 1e-6), axis = 1)
-
-def mse(pred, target):
-    return tf.square(pred-target)/2.

 def ortho_init(scale=1.0):
    def _ortho_init(shape, dtype, partition_info=None):
@@ -39,117 +18,18 @@ def ortho_init(scale=1.0):
        return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
    return _ortho_init

-def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
-    if data_format == 'NHWC':
-        channel_ax = 3
-        strides = [1, stride, stride, 1]
-        bshape = [1, 1, 1, nf]
-    elif data_format == 'NCHW':
-        channel_ax = 1
-        strides = [1, 1, stride, stride]
-        bshape = [1, nf, 1, 1]
-    else:
-        raise NotImplementedError
-    bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
-    nin = x.get_shape()[channel_ax].value
-    wshape = [rf, rf, nin, nf]
-    with tf.variable_scope(scope):
-        w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
-        if not one_dim_bias and data_format == 'NHWC':
-            b = tf.reshape(b, bshape)
-        return b + tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format)
+def conv(scope, *, nf, rf, stride, activation, pad='valid', init_scale=1.0, data_format='channels_last'):
+    with tf.name_scope(scope):
+        layer = tf.keras.layers.Conv2D(filters=nf, kernel_size=rf, strides=stride, padding=pad,
+                                       data_format=data_format, kernel_initializer=ortho_init(init_scale))
+    return layer

-def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
-    with tf.variable_scope(scope):
-        nin = x.get_shape()[1].value
-        w = tf.get_variable("w", [nin, nh], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(init_bias))
-        return tf.matmul(x, w)+b
-
-def batch_to_seq(h, nbatch, nsteps, flat=False):
-    if flat:
-        h = tf.reshape(h, [nbatch, nsteps])
-    else:
-        h = tf.reshape(h, [nbatch, nsteps, -1])
-    return [tf.squeeze(v, [1]) for v in tf.split(axis=1, num_or_size_splits=nsteps, value=h)]
-
-def seq_to_batch(h, flat = False):
-    shape = h[0].get_shape().as_list()
-    if not flat:
-        assert(len(shape) > 1)
-        nh = h[0].get_shape()[-1].value
-        return tf.reshape(tf.concat(axis=1, values=h), [-1, nh])
-    else:
-        return tf.reshape(tf.stack(values=h, axis=1), [-1])
-
-def lstm(xs, ms, s, scope, nh, init_scale=1.0):
-    nbatch, nin = [v.value for v in xs[0].get_shape()]
-    nsteps = len(xs)
-    with tf.variable_scope(scope):
-        wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
-        wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nh*4], initializer=tf.constant_initializer(0.0))
-
-    c, h = tf.split(axis=1, num_or_size_splits=2, value=s)
-    for idx, (x, m) in enumerate(zip(xs, ms)):
-        c = c*(1-m)
-        h = h*(1-m)
-        z = tf.matmul(x, wx) + tf.matmul(h, wh) + b
-        i, f, o, u = tf.split(axis=1, num_or_size_splits=4, value=z)
-        i = tf.nn.sigmoid(i)
-        f = tf.nn.sigmoid(f)
-        o = tf.nn.sigmoid(o)
-        u = tf.tanh(u)
-        c = f*c + i*u
-        h = o*tf.tanh(c)
-        xs[idx] = h
-    s = tf.concat(axis=1, values=[c, h])
-    return xs, s
-
-def _ln(x, g, b, e=1e-5, axes=[1]):
-    u, s = tf.nn.moments(x, axes=axes, keep_dims=True)
-    x = (x-u)/tf.sqrt(s+e)
-    x = x*g+b
-    return x
-
-def lnlstm(xs, ms, s, scope, nh, init_scale=1.0):
-    nbatch, nin = [v.value for v in xs[0].get_shape()]
-    nsteps = len(xs)
-    with tf.variable_scope(scope):
-        wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
-        gx = tf.get_variable("gx", [nh*4], initializer=tf.constant_initializer(1.0))
-        bx = tf.get_variable("bx", [nh*4], initializer=tf.constant_initializer(0.0))
-
-        wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
-        gh = tf.get_variable("gh", [nh*4], initializer=tf.constant_initializer(1.0))
-        bh = tf.get_variable("bh", [nh*4], initializer=tf.constant_initializer(0.0))
-
-        b = tf.get_variable("b", [nh*4], initializer=tf.constant_initializer(0.0))
-
-        gc = tf.get_variable("gc", [nh], initializer=tf.constant_initializer(1.0))
-        bc = tf.get_variable("bc", [nh], initializer=tf.constant_initializer(0.0))
-
-    c, h = tf.split(axis=1, num_or_size_splits=2, value=s)
-    for idx, (x, m) in enumerate(zip(xs, ms)):
-        c = c*(1-m)
-        h = h*(1-m)
-        z = _ln(tf.matmul(x, wx), gx, bx) + _ln(tf.matmul(h, wh), gh, bh) + b
-        i, f, o, u = tf.split(axis=1, num_or_size_splits=4, value=z)
-        i = tf.nn.sigmoid(i)
-        f = tf.nn.sigmoid(f)
-        o = tf.nn.sigmoid(o)
-        u = tf.tanh(u)
-        c = f*c + i*u
-        h = o*tf.tanh(_ln(c, gc, bc))
-        xs[idx] = h
-    s = tf.concat(axis=1, values=[c, h])
-    return xs, s
-
-def conv_to_fc(x):
-    nh = np.prod([v.value for v in x.get_shape()[1:]])
-    x = tf.reshape(x, [-1, nh])
-    return x
+def fc(input_shape, scope, nh, *, init_scale=1.0, init_bias=0.0):
+    with tf.name_scope(scope):
+        layer = tf.keras.layers.Dense(units=nh, kernel_initializer=ortho_init(init_scale),
+                                      bias_initializer=tf.keras.initializers.Constant(init_bias))
+        layer.build(input_shape)
+    return layer

 def discount_with_dones(rewards, dones, gamma):
    discounted = []
@@ -159,132 +39,25 @@ def discount_with_dones(rewards, dones, gamma):
        discounted.append(r)
    return discounted[::-1]

-def find_trainable_variables(key):
-    with tf.variable_scope(key):
-        return tf.trainable_variables()
+class InverseLinearTimeDecay(tf.keras.optimizers.schedules.LearningRateSchedule):
+    def __init__(self, initial_learning_rate, nupdates, name="InverseLinearTimeDecay"):
+        super(InverseLinearTimeDecay, self).__init__()
+        self.initial_learning_rate = initial_learning_rate
+        self.nupdates = nupdates
+        self.name = name

-def make_path(f):
-    return os.makedirs(f, exist_ok=True)
+    def __call__(self, step):
+        with tf.name_scope(self.name):
+            initial_learning_rate = tf.convert_to_tensor(self.initial_learning_rate, name="initial_learning_rate")
+            dtype = initial_learning_rate.dtype
+            step_t = tf.cast(step, dtype)
+            nupdates_t = tf.convert_to_tensor(self.nupdates, dtype=dtype)
+            tf.assert_less(step_t, nupdates_t)
+            return initial_learning_rate * (1. - step_t / nupdates_t)

-def constant(p):
-    return 1
-
-def linear(p):
-    return 1-p
-
-def middle_drop(p):
-    eps = 0.75
-    if 1-p<eps:
-        return eps*0.1
-    return 1-p
-
-def double_linear_con(p):
-    p *= 2
-    eps = 0.125
-    if 1-p<eps:
-        return eps
-    return 1-p
-
-def double_middle_drop(p):
-    eps1 = 0.75
-    eps2 = 0.25
-    if 1-p<eps1:
-        if 1-p<eps2:
-            return eps2*0.5
-        return eps1*0.1
-    return 1-p
-
-schedules = {
-    'linear':linear,
-    'constant':constant,
-    'double_linear_con': double_linear_con,
-    'middle_drop': middle_drop,
-    'double_middle_drop': double_middle_drop
-}
-
-class Scheduler(object):
-
-    def __init__(self, v, nvalues, schedule):
-        self.n = 0.
-        self.v = v
-        self.nvalues = nvalues
-        self.schedule = schedules[schedule]
-
-    def value(self):
-        current_value = self.v*self.schedule(self.n/self.nvalues)
-        self.n += 1.
-        return current_value
-
-    def value_steps(self, steps):
-        return self.v*self.schedule(steps/self.nvalues)
-
-
-class EpisodeStats:
-    def __init__(self, nsteps, nenvs):
-        self.episode_rewards = []
-        for i in range(nenvs):
-            self.episode_rewards.append([])
-        self.lenbuffer = deque(maxlen=40)  # rolling buffer for episode lengths
-        self.rewbuffer = deque(maxlen=40)  # rolling buffer for episode rewards
-        self.nsteps = nsteps
-        self.nenvs = nenvs
-
-    def feed(self, rewards, masks):
-        rewards = np.reshape(rewards, [self.nenvs, self.nsteps])
-        masks = np.reshape(masks, [self.nenvs, self.nsteps])
-        for i in range(0, self.nenvs):
-            for j in range(0, self.nsteps):
-                self.episode_rewards[i].append(rewards[i][j])
-                if masks[i][j]:
-                    l = len(self.episode_rewards[i])
-                    s = sum(self.episode_rewards[i])
-                    self.lenbuffer.append(l)
-                    self.rewbuffer.append(s)
-                    self.episode_rewards[i] = []
-
-    def mean_length(self):
-        if self.lenbuffer:
-            return np.mean(self.lenbuffer)
-        else:
-            return 0  # on the first params dump, no episodes are finished
-
-    def mean_reward(self):
-        if self.rewbuffer:
-            return np.mean(self.rewbuffer)
-        else:
-            return 0
-
-
-# For ACER
-def get_by_index(x, idx):
-    assert(len(x.get_shape()) == 2)
-    assert(len(idx.get_shape()) == 1)
-    idx_flattened = tf.range(0, x.shape[0]) * x.shape[1] + idx
-    y = tf.gather(tf.reshape(x, [-1]),  # flatten input
-                  idx_flattened)  # use flattened indices
-    return y
-
-def check_shape(ts,shapes):
-    i = 0
-    for (t,shape) in zip(ts,shapes):
-        assert t.get_shape().as_list()==shape, "id " + str(i) + " shape " + str(t.get_shape()) + str(shape)
-        i += 1
-
-def avg_norm(t):
-    return tf.reduce_mean(tf.sqrt(tf.reduce_sum(tf.square(t), axis=-1)))
-
-def gradient_add(g1, g2, param):
-    print([g1, g2, param.name])
-    assert (not (g1 is None and g2 is None)), param.name
-    if g1 is None:
-        return g2
-    elif g2 is None:
-        return g1
-    else:
-        return g1 + g2
-
-def q_explained_variance(qpred, q):
-    _, vary = tf.nn.moments(q, axes=[0, 1])
-    _, varpred = tf.nn.moments(q - qpred, axes=[0, 1])
-    check_shape([vary, varpred], [[]] * 2)
-    return 1.0 - (varpred / vary)
+    def get_config(self):
+        return {
+            "initial_learning_rate": self.initial_learning_rate,
+            "nupdates": self.nupdates,
+            "name": self.name
+        }
--- a/baselines/acer/README.md
+++ b/baselines/acer/README.md
@@ -1,4 +0,0 @@
-# ACER
-
- Original paper: https://arxiv.org/abs/1611.01224
- `python -m baselines.acer.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
--- a/baselines/acer/init.py
+++ b/baselines/acer/init.py
--- a/baselines/acer/acer_simple.py
+++ b/baselines/acer/acer_simple.py
@@ -1,348 +0,0 @@
-import time
-import joblib
-import numpy as np
-import tensorflow as tf
-from baselines import logger
-
-from baselines.common import set_global_seeds
-from baselines.common.runners import AbstractEnvRunner
-
-from baselines.a2c.utils import batch_to_seq, seq_to_batch
-from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
-from baselines.a2c.utils import cat_entropy_softmax
-from baselines.a2c.utils import EpisodeStats
-from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
-from baselines.acer.buffer import Buffer
-
-import os.path as osp
-
-# remove last step
-def strip(var, nenvs, nsteps, flat = False):
-    vars = batch_to_seq(var, nenvs, nsteps + 1, flat)
-    return seq_to_batch(vars[:-1], flat)
-
-def q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma):
-    """
-    Calculates q_retrace targets
-
-    :param R: Rewards
-    :param D: Dones
-    :param q_i: Q values for actions taken
-    :param v: V values
-    :param rho_i: Importance weight for each action
-    :return: Q_retrace values
-    """
-    rho_bar = batch_to_seq(tf.minimum(1.0, rho_i), nenvs, nsteps, True)  # list of len steps, shape [nenvs]
-    rs = batch_to_seq(R, nenvs, nsteps, True)  # list of len steps, shape [nenvs]
-    ds = batch_to_seq(D, nenvs, nsteps, True)  # list of len steps, shape [nenvs]
-    q_is = batch_to_seq(q_i, nenvs, nsteps, True)
-    vs = batch_to_seq(v, nenvs, nsteps + 1, True)
-    v_final = vs[-1]
-    qret = v_final
-    qrets = []
-    for i in range(nsteps - 1, -1, -1):
-        check_shape([qret, ds[i], rs[i], rho_bar[i], q_is[i], vs[i]], [[nenvs]] * 6)
-        qret = rs[i] + gamma * qret * (1.0 - ds[i])
-        qrets.append(qret)
-        qret = (rho_bar[i] * (qret - q_is[i])) + vs[i]
-    qrets = qrets[::-1]
-    qret = seq_to_batch(qrets, flat=True)
-    return qret
-
-# For ACER with PPO clipping instead of trust region
-# def clip(ratio, eps_clip):
-#     # assume 0 <= eps_clip <= 1
-#     return tf.minimum(1 + eps_clip, tf.maximum(1 - eps_clip, ratio))
-
-class Model(object):
-    def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, nstack, num_procs,
-                 ent_coef, q_coef, gamma, max_grad_norm, lr,
-                 rprop_alpha, rprop_epsilon, total_timesteps, lrschedule,
-                 c, trust_region, alpha, delta):
-        config = tf.ConfigProto(allow_soft_placement=True,
-                                intra_op_parallelism_threads=num_procs,
-                                inter_op_parallelism_threads=num_procs)
-        sess = tf.Session(config=config)
-        nact = ac_space.n
-        nbatch = nenvs * nsteps
-
-        A = tf.placeholder(tf.int32, [nbatch]) # actions
-        D = tf.placeholder(tf.float32, [nbatch]) # dones
-        R = tf.placeholder(tf.float32, [nbatch]) # rewards, not returns
-        MU = tf.placeholder(tf.float32, [nbatch, nact]) # mu's
-        LR = tf.placeholder(tf.float32, [])
-        eps = 1e-6
-
-        step_model = policy(sess, ob_space, ac_space, nenvs, 1, nstack, reuse=False)
-        train_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
-
-        params = find_trainable_variables("model")
-        print("Params {}".format(len(params)))
-        for var in params:
-            print(var)
-
-        # create polyak averaged model
-        ema = tf.train.ExponentialMovingAverage(alpha)
-        ema_apply_op = ema.apply(params)
-
-        def custom_getter(getter, *args, **kwargs):
-            v = ema.average(getter(*args, **kwargs))
-            print(v.name)
-            return v
-
-        with tf.variable_scope("", custom_getter=custom_getter, reuse=True):
-            polyak_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
-
-        # Notation: (var) = batch variable, (var)s = seqeuence variable, (var)_i = variable index by action at step i
-        v = tf.reduce_sum(train_model.pi * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]
-
-        # strip off last step
-        f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model.pi, polyak_model.pi, train_model.q])
-        # Get pi and q values for actions taken
-        f_i = get_by_index(f, A)
-        q_i = get_by_index(q, A)
-
-        # Compute ratios for importance truncation
-        rho = f / (MU + eps)
-        rho_i = get_by_index(rho, A)
-
-        # Calculate Q_retrace targets
-        qret = q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma)
-
-        # Calculate losses
-        # Entropy
-        entropy = tf.reduce_mean(cat_entropy_softmax(f))
-
-        # Policy Graident loss, with truncated importance sampling & bias correction
-        v = strip(v, nenvs, nsteps, True)
-        check_shape([qret, v, rho_i, f_i], [[nenvs * nsteps]] * 4)
-        check_shape([rho, f, q], [[nenvs * nsteps, nact]] * 2)
-
-        # Truncated importance sampling
-        adv = qret - v
-        logf = tf.log(f_i + eps)
-        gain_f = logf * tf.stop_gradient(adv * tf.minimum(c, rho_i))  # [nenvs * nsteps]
-        loss_f = -tf.reduce_mean(gain_f)
-
-        # Bias correction for the truncation
-        adv_bc = (q - tf.reshape(v, [nenvs * nsteps, 1]))  # [nenvs * nsteps, nact]
-        logf_bc = tf.log(f + eps) # / (f_old + eps)
-        check_shape([adv_bc, logf_bc], [[nenvs * nsteps, nact]]*2)
-        gain_bc = tf.reduce_sum(logf_bc * tf.stop_gradient(adv_bc * tf.nn.relu(1.0 - (c / (rho + eps))) * f), axis = 1) #IMP: This is sum, as expectation wrt f
-        loss_bc= -tf.reduce_mean(gain_bc)
-
-        loss_policy = loss_f + loss_bc
-
-        # Value/Q function loss, and explained variance
-        check_shape([qret, q_i], [[nenvs * nsteps]]*2)
-        ev = q_explained_variance(tf.reshape(q_i, [nenvs, nsteps]), tf.reshape(qret, [nenvs, nsteps]))
-        loss_q = tf.reduce_mean(tf.square(tf.stop_gradient(qret) - q_i)*0.5)
-
-        # Net loss
-        check_shape([loss_policy, loss_q, entropy], [[]] * 3)
-        loss = loss_policy + q_coef * loss_q - ent_coef * entropy
-
-        if trust_region:
-            g = tf.gradients(- (loss_policy - ent_coef * entropy) * nsteps * nenvs, f) #[nenvs * nsteps, nact]
-            # k = tf.gradients(KL(f_pol || f), f)
-            k = - f_pol / (f + eps) #[nenvs * nsteps, nact] # Directly computed gradient of KL divergence wrt f
-            k_dot_g = tf.reduce_sum(k * g, axis=-1)
-            adj = tf.maximum(0.0, (tf.reduce_sum(k * g, axis=-1) - delta) / (tf.reduce_sum(tf.square(k), axis=-1) + eps)) #[nenvs * nsteps]
-
-            # Calculate stats (before doing adjustment) for logging.
-            avg_norm_k = avg_norm(k)
-            avg_norm_g = avg_norm(g)
-            avg_norm_k_dot_g = tf.reduce_mean(tf.abs(k_dot_g))
-            avg_norm_adj = tf.reduce_mean(tf.abs(adj))
-
-            g = g - tf.reshape(adj, [nenvs * nsteps, 1]) * k
-            grads_f = -g/(nenvs*nsteps) # These are turst region adjusted gradients wrt f ie statistics of policy pi
-            grads_policy = tf.gradients(f, params, grads_f)
-            grads_q = tf.gradients(loss_q * q_coef, params)
-            grads = [gradient_add(g1, g2, param) for (g1, g2, param) in zip(grads_policy, grads_q, params)]
-
-            avg_norm_grads_f = avg_norm(grads_f) * (nsteps * nenvs)
-            norm_grads_q = tf.global_norm(grads_q)
-            norm_grads_policy = tf.global_norm(grads_policy)
-        else:
-            grads = tf.gradients(loss, params)
-
-        if max_grad_norm is not None:
-            grads, norm_grads = tf.clip_by_global_norm(grads, max_grad_norm)
-        grads = list(zip(grads, params))
-        trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=rprop_alpha, epsilon=rprop_epsilon)
-        _opt_op = trainer.apply_gradients(grads)
-
-        # so when you call _train, you first do the gradient step, then you apply ema
-        with tf.control_dependencies([_opt_op]):
-            _train = tf.group(ema_apply_op)
-
-        lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
-
-        # Ops/Summaries to run, and their names for logging
-        run_ops = [_train, loss, loss_q, entropy, loss_policy, loss_f, loss_bc, ev, norm_grads]
-        names_ops = ['loss', 'loss_q', 'entropy', 'loss_policy', 'loss_f', 'loss_bc', 'explained_variance',
-                     'norm_grads']
-        if trust_region:
-            run_ops = run_ops + [norm_grads_q, norm_grads_policy, avg_norm_grads_f, avg_norm_k, avg_norm_g, avg_norm_k_dot_g,
-                                 avg_norm_adj]
-            names_ops = names_ops + ['norm_grads_q', 'norm_grads_policy', 'avg_norm_grads_f', 'avg_norm_k', 'avg_norm_g',
-                                     'avg_norm_k_dot_g', 'avg_norm_adj']
-
-        def train(obs, actions, rewards, dones, mus, states, masks, steps):
-            cur_lr = lr.value_steps(steps)
-            td_map = {train_model.X: obs, polyak_model.X: obs, A: actions, R: rewards, D: dones, MU: mus, LR: cur_lr}
-            if states != []:
-                td_map[train_model.S] = states
-                td_map[train_model.M] = masks
-                td_map[polyak_model.S] = states
-                td_map[polyak_model.M] = masks
-            return names_ops, sess.run(run_ops, td_map)[1:]  # strip off _train
-
-        def save(save_path):
-            ps = sess.run(params)
-            make_path(osp.dirname(save_path))
-            joblib.dump(ps, save_path)
-
-        self.train = train
-        self.save = save
-        self.train_model = train_model
-        self.step_model = step_model
-        self.step = step_model.step
-        self.initial_state = step_model.initial_state
-        tf.global_variables_initializer().run(session=sess)
-
-class Runner(AbstractEnvRunner):
-    def __init__(self, env, model, nsteps, nstack):
-        super().__init__(env=env, model=model, nsteps=nsteps)
-        self.nstack = nstack
-        nh, nw, nc = env.observation_space.shape
-        self.nc = nc  # nc = 1 for atari, but just in case
-        self.nenv = nenv = env.num_envs
-        self.nact = env.action_space.n
-        self.nbatch = nenv * nsteps
-        self.batch_ob_shape = (nenv*(nsteps+1), nh, nw, nc*nstack)
-        self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
-        obs = env.reset()
-        self.update_obs(obs)
-
-    def update_obs(self, obs, dones=None):
-        if dones is not None:
-            self.obs *= (1 - dones.astype(np.uint8))[:, None, None, None]
-        self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
-        self.obs[:, :, :, -self.nc:] = obs[:, :, :, :]
-
-    def run(self):
-        enc_obs = np.split(self.obs, self.nstack, axis=3)  # so now list of obs steps
-        mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
-        for _ in range(self.nsteps):
-            actions, mus, states = self.model.step(self.obs, state=self.states, mask=self.dones)
-            mb_obs.append(np.copy(self.obs))
-            mb_actions.append(actions)
-            mb_mus.append(mus)
-            mb_dones.append(self.dones)
-            obs, rewards, dones, _ = self.env.step(actions)
-            # states information for statefull models like LSTM
-            self.states = states
-            self.dones = dones
-            self.update_obs(obs, dones)
-            mb_rewards.append(rewards)
-            enc_obs.append(obs)
-        mb_obs.append(np.copy(self.obs))
-        mb_dones.append(self.dones)
-
-        enc_obs = np.asarray(enc_obs, dtype=np.uint8).swapaxes(1, 0)
-        mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0)
-        mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
-        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
-        mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
-
-        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
-
-        mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
-        mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards
-
-        # shapes are now [nenv, nsteps, []]
-        # When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.
-
-        return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks
-
-class Acer():
-    def __init__(self, runner, model, buffer, log_interval):
-        self.runner = runner
-        self.model = model
-        self.buffer = buffer
-        self.log_interval = log_interval
-        self.tstart = None
-        self.episode_stats = EpisodeStats(runner.nsteps, runner.nenv)
-        self.steps = None
-
-    def call(self, on_policy):
-        runner, model, buffer, steps = self.runner, self.model, self.buffer, self.steps
-        if on_policy:
-            enc_obs, obs, actions, rewards, mus, dones, masks = runner.run()
-            self.episode_stats.feed(rewards, dones)
-            if buffer is not None:
-                buffer.put(enc_obs, actions, rewards, mus, dones, masks)
-        else:
-            # get obs, actions, rewards, mus, dones from buffer.
-            obs, actions, rewards, mus, dones, masks = buffer.get()
-
-        # reshape stuff correctly
-        obs = obs.reshape(runner.batch_ob_shape)
-        actions = actions.reshape([runner.nbatch])
-        rewards = rewards.reshape([runner.nbatch])
-        mus = mus.reshape([runner.nbatch, runner.nact])
-        dones = dones.reshape([runner.nbatch])
-        masks = masks.reshape([runner.batch_ob_shape[0]])
-
-        names_ops, values_ops = model.train(obs, actions, rewards, dones, mus, model.initial_state, masks, steps)
-
-        if on_policy and (int(steps/runner.nbatch) % self.log_interval == 0):
-            logger.record_tabular("total_timesteps", steps)
-            logger.record_tabular("fps", int(steps/(time.time() - self.tstart)))
-            # IMP: In EpisodicLife env, during training, we get done=True at each loss of life, not just at the terminal state.
-            # Thus, this is mean until end of life, not end of episode.
-            # For true episode rewards, see the monitor files in the log folder.
-            logger.record_tabular("mean_episode_length", self.episode_stats.mean_length())
-            logger.record_tabular("mean_episode_reward", self.episode_stats.mean_reward())
-            for name, val in zip(names_ops, values_ops):
-                logger.record_tabular(name, float(val))
-            logger.dump_tabular()
-
-
-def learn(policy, env, seed, nsteps=20, nstack=4, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
-          max_grad_norm=10, lr=7e-4, lrschedule='linear', rprop_epsilon=1e-5, rprop_alpha=0.99, gamma=0.99,
-          log_interval=100, buffer_size=50000, replay_ratio=4, replay_start=10000, c=10.0,
-          trust_region=True, alpha=0.99, delta=1):
-    print("Running Acer Simple")
-    print(locals())
-    tf.reset_default_graph()
-    set_global_seeds(seed)
-
-    nenvs = env.num_envs
-    ob_space = env.observation_space
-    ac_space = env.action_space
-    num_procs = len(env.remotes) # HACK
-    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, nstack=nstack,
-                  num_procs=num_procs, ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
-                  max_grad_norm=max_grad_norm, lr=lr, rprop_alpha=rprop_alpha, rprop_epsilon=rprop_epsilon,
-                  total_timesteps=total_timesteps, lrschedule=lrschedule, c=c,
-                  trust_region=trust_region, alpha=alpha, delta=delta)
-
-    runner = Runner(env=env, model=model, nsteps=nsteps, nstack=nstack)
-    if replay_ratio > 0:
-        buffer = Buffer(env=env, nsteps=nsteps, nstack=nstack, size=buffer_size)
-    else:
-        buffer = None
-    nbatch = nenvs*nsteps
-    acer = Acer(runner, model, buffer, log_interval)
-    acer.tstart = time.time()
-    for acer.steps in range(0, total_timesteps, nbatch): #nbatch samples, 1 on_policy call and multiple off-policy calls
-        acer.call(on_policy=True)
-        if replay_ratio > 0 and buffer.has_atleast(replay_start):
-            n = np.random.poisson(replay_ratio)
-            for _ in range(n):
-                acer.call(on_policy=False)  # no simulation steps in this
-
-    env.close()
--- a/baselines/acer/buffer.py
+++ b/baselines/acer/buffer.py
@@ -1,103 +0,0 @@
-import numpy as np
-
-class Buffer(object):
-    # gets obs, actions, rewards, mu's, (states, masks), dones
-    def __init__(self, env, nsteps, nstack, size=50000):
-        self.nenv = env.num_envs
-        self.nsteps = nsteps
-        self.nh, self.nw, self.nc = env.observation_space.shape
-        self.nstack = nstack
-        self.nbatch = self.nenv * self.nsteps
-        self.size = size // (self.nsteps)  # Each loc contains nenv * nsteps frames, thus total buffer is nenv * size frames
-
-        # Memory
-        self.enc_obs = None
-        self.actions = None
-        self.rewards = None
-        self.mus = None
-        self.dones = None
-        self.masks = None
-
-        # Size indexes
-        self.next_idx = 0
-        self.num_in_buffer = 0
-
-    def has_atleast(self, frames):
-        # Frames per env, so total (nenv * frames) Frames needed
-        # Each buffer loc has nenv * nsteps frames
-        return self.num_in_buffer >= (frames // self.nsteps)
-
-    def can_sample(self):
-        return self.num_in_buffer > 0
-
-    # Generate stacked frames
-    def decode(self, enc_obs, dones):
-        # enc_obs has shape [nenvs, nsteps + nstack, nh, nw, nc]
-        # dones has shape [nenvs, nsteps, nh, nw, nc]
-        # returns stacked obs of shape [nenv, (nsteps + 1), nh, nw, nstack*nc]
-        nstack, nenv, nsteps, nh, nw, nc = self.nstack, self.nenv, self.nsteps, self.nh, self.nw, self.nc
-        y = np.empty([nsteps + nstack - 1, nenv, 1, 1, 1], dtype=np.float32)
-        obs = np.zeros([nstack, nsteps + nstack, nenv, nh, nw, nc], dtype=np.uint8)
-        x = np.reshape(enc_obs, [nenv, nsteps + nstack, nh, nw, nc]).swapaxes(1,
-                                                                              0)  # [nsteps + nstack, nenv, nh, nw, nc]
-        y[3:] = np.reshape(1.0 - dones, [nenv, nsteps, 1, 1, 1]).swapaxes(1, 0)  # keep
-        y[:3] = 1.0
-        # y = np.reshape(1 - dones, [nenvs, nsteps, 1, 1, 1])
-        for i in range(nstack):
-            obs[-(i + 1), i:] = x
-            # obs[:,i:,:,:,-(i+1),:] = x
-            x = x[:-1] * y
-            y = y[1:]
-        return np.reshape(obs[:, 3:].transpose((2, 1, 3, 4, 0, 5)), [nenv, (nsteps + 1), nh, nw, nstack * nc])
-
-    def put(self, enc_obs, actions, rewards, mus, dones, masks):
-        # enc_obs [nenv, (nsteps + nstack), nh, nw, nc]
-        # actions, rewards, dones [nenv, nsteps]
-        # mus [nenv, nsteps, nact]
-
-        if self.enc_obs is None:
-            self.enc_obs = np.empty([self.size] + list(enc_obs.shape), dtype=np.uint8)
-            self.actions = np.empty([self.size] + list(actions.shape), dtype=np.int32)
-            self.rewards = np.empty([self.size] + list(rewards.shape), dtype=np.float32)
-            self.mus = np.empty([self.size] + list(mus.shape), dtype=np.float32)
-            self.dones = np.empty([self.size] + list(dones.shape), dtype=np.bool)
-            self.masks = np.empty([self.size] + list(masks.shape), dtype=np.bool)
-
-        self.enc_obs[self.next_idx] = enc_obs
-        self.actions[self.next_idx] = actions
-        self.rewards[self.next_idx] = rewards
-        self.mus[self.next_idx] = mus
-        self.dones[self.next_idx] = dones
-        self.masks[self.next_idx] = masks
-
-        self.next_idx = (self.next_idx + 1) % self.size
-        self.num_in_buffer = min(self.size, self.num_in_buffer + 1)
-
-    def take(self, x, idx, envx):
-        nenv = self.nenv
-        out = np.empty([nenv] + list(x.shape[2:]), dtype=x.dtype)
-        for i in range(nenv):
-            out[i] = x[idx[i], envx[i]]
-        return out
-
-    def get(self):
-        # returns
-        # obs [nenv, (nsteps + 1), nh, nw, nstack*nc]
-        # actions, rewards, dones [nenv, nsteps]
-        # mus [nenv, nsteps, nact]
-        nenv = self.nenv
-        assert self.can_sample()
-
-        # Sample exactly one id per env. If you sample across envs, then higher correlation in samples from same env.
-        idx = np.random.randint(0, self.num_in_buffer, nenv)
-        envx = np.arange(nenv)
-
-        take = lambda x: self.take(x, idx, envx)  # for i in range(nenv)], axis = 0)
-        dones = take(self.dones)
-        enc_obs = take(self.enc_obs)
-        obs = self.decode(enc_obs, dones)
-        actions = take(self.actions)
-        rewards = take(self.rewards)
-        mus = take(self.mus)
-        masks = take(self.masks)
-        return obs, actions, rewards, mus, dones, masks
--- a/baselines/acer/policies.py
+++ b/baselines/acer/policies.py
@@ -1,79 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from baselines.ppo2.policies import nature_cnn
-from baselines.a2c.utils import fc, batch_to_seq, seq_to_batch, lstm, sample
-
-
-class AcerCnnPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
-        nbatch = nenv * nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc * nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape)  # obs
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-            pi_logits = fc(h, 'pi', nact, init_scale=0.01)
-            pi = tf.nn.softmax(pi_logits)
-            q = fc(h, 'q', nact)
-
-        a = sample(pi_logits)  # could change this to use self.pi instead
-        self.initial_state = []  # not stateful
-        self.X = X
-        self.pi = pi  # actual policy params now
-        self.q = q
-
-        def step(ob, *args, **kwargs):
-            # returns actions, mus, states
-            a0, pi0 = sess.run([a, pi], {X: ob})
-            return a0, pi0, []  # dummy state
-
-        def out(ob, *args, **kwargs):
-            pi0, q0 = sess.run([pi, q], {X: ob})
-            return pi0, q0
-
-        def act(ob, *args, **kwargs):
-            return sess.run(a, {X: ob})
-
-        self.step = step
-        self.out = out
-        self.act = act
-
-class AcerLstmPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False, nlstm=256):
-        nbatch = nenv * nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc * nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape)  # obs
-        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
-        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
-        with tf.variable_scope("model", reuse=reuse):
-            h = nature_cnn(X)
-
-            # lstm
-            xs = batch_to_seq(h, nenv, nsteps)
-            ms = batch_to_seq(M, nenv, nsteps)
-            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
-            h5 = seq_to_batch(h5)
-
-            pi_logits = fc(h5, 'pi', nact, init_scale=0.01)
-            pi = tf.nn.softmax(pi_logits)
-            q = fc(h5, 'q', nact)
-
-        a = sample(pi_logits)  # could change this to use self.pi instead
-        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
-        self.X = X
-        self.M = M
-        self.S = S
-        self.pi = pi  # actual policy params now
-        self.q = q
-
-        def step(ob, state, mask, *args, **kwargs):
-            # returns actions, mus, states
-            a0, pi0, s = sess.run([a, pi, snew], {X: ob, S: state, M: mask})
-            return a0, pi0, s
-
-        self.step = step
--- a/baselines/acer/run_atari.py
+++ b/baselines/acer/run_atari.py
@@ -1,30 +0,0 @@
-#!/usr/bin/env python3
-from baselines import logger
-from baselines.acer.acer_simple import learn
-from baselines.acer.policies import AcerCnnPolicy, AcerLstmPolicy
-from baselines.common.cmd_util import make_atari_env, atari_arg_parser
-
-def train(env_id, num_timesteps, seed, policy, lrschedule, num_cpu):
-    env = make_atari_env(env_id, num_cpu, seed)
-    if policy == 'cnn':
-        policy_fn = AcerCnnPolicy
-    elif policy == 'lstm':
-        policy_fn = AcerLstmPolicy
-    else:
-        print("Policy {} not implemented".format(policy))
-        return
-    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
-    env.close()
-
-def main():
-    parser = atari_arg_parser()
-    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
-    parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
-    parser.add_argument('--logdir', help ='Directory for logging')
-    args = parser.parse_args()
-    logger.configure(args.logdir)
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
-          policy=args.policy, lrschedule=args.lrschedule, num_cpu=16)
-
-if __name__ == '__main__':
-    main()
--- a/baselines/acktr/README.md
+++ b/baselines/acktr/README.md
@@ -1,5 +0,0 @@
-# ACKTR
-
- Original paper: https://arxiv.org/abs/1708.05144
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.acktr.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
--- a/baselines/acktr/init.py
+++ b/baselines/acktr/init.py
--- a/baselines/acktr/acktr_cont.py
+++ b/baselines/acktr/acktr_cont.py
@@ -1,142 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from baselines import logger
-import baselines.common as common
-from baselines.common import tf_util as U
-from baselines.acktr import kfac
-from baselines.common.filters import ZFilter
-
-def pathlength(path):
-    return path["reward"].shape[0]# Loss function that we'll differentiate to get the policy gradient
-
-def rollout(env, policy, max_pathlength, animate=False, obfilter=None):
-    """
-    Simulate the env and policy for max_pathlength steps
-    """
-    ob = env.reset()
-    prev_ob = np.float32(np.zeros(ob.shape))
-    if obfilter: ob = obfilter(ob)
-    terminated = False
-
-    obs = []
-    acs = []
-    ac_dists = []
-    logps = []
-    rewards = []
-    for _ in range(max_pathlength):
-        if animate:
-            env.render()
-        state = np.concatenate([ob, prev_ob], -1)
-        obs.append(state)
-        ac, ac_dist, logp = policy.act(state)
-        acs.append(ac)
-        ac_dists.append(ac_dist)
-        logps.append(logp)
-        prev_ob = np.copy(ob)
-        scaled_ac = env.action_space.low + (ac + 1.) * 0.5 * (env.action_space.high - env.action_space.low)
-        scaled_ac = np.clip(scaled_ac, env.action_space.low, env.action_space.high)
-        ob, rew, done, _ = env.step(scaled_ac)
-        if obfilter: ob = obfilter(ob)
-        rewards.append(rew)
-        if done:
-            terminated = True
-            break
-    return {"observation" : np.array(obs), "terminated" : terminated,
-            "reward" : np.array(rewards), "action" : np.array(acs),
-            "action_dist": np.array(ac_dists), "logp" : np.array(logps)}
-
-def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
-    animate=False, callback=None, desired_kl=0.002):
-
-    obfilter = ZFilter(env.observation_space.shape)
-
-    max_pathlength = env.spec.timestep_limit
-    stepsize = tf.Variable(initial_value=np.float32(np.array(0.03)), name='stepsize')
-    inputs, loss, loss_sampled = policy.update_info
-    optim = kfac.KfacOptimizer(learning_rate=stepsize, cold_lr=stepsize*(1-0.9), momentum=0.9, kfac_update=2,\
-                                epsilon=1e-2, stats_decay=0.99, async=1, cold_iter=1,
-                                weight_decay_dict=policy.wd_dict, max_grad_norm=None)
-    pi_var_list = []
-    for var in tf.trainable_variables():
-        if "pi" in var.name:
-            pi_var_list.append(var)
-
-    update_op, q_runner = optim.minimize(loss, loss_sampled, var_list=pi_var_list)
-    do_update = U.function(inputs, update_op)
-    U.initialize()
-
-    # start queue runners
-    enqueue_threads = []
-    coord = tf.train.Coordinator()
-    for qr in [q_runner, vf.q_runner]:
-        assert (qr != None)
-        enqueue_threads.extend(qr.create_threads(tf.get_default_session(), coord=coord, start=True))
-
-    i = 0
-    timesteps_so_far = 0
-    while True:
-        if timesteps_so_far > num_timesteps:
-            break
-        logger.log("********** Iteration %i ************"%i)
-
-        # Collect paths until we have enough timesteps
-        timesteps_this_batch = 0
-        paths = []
-        while True:
-            path = rollout(env, policy, max_pathlength, animate=(len(paths)==0 and (i % 10 == 0) and animate), obfilter=obfilter)
-            paths.append(path)
-            n = pathlength(path)
-            timesteps_this_batch += n
-            timesteps_so_far += n
-            if timesteps_this_batch > timesteps_per_batch:
-                break
-
-        # Estimate advantage function
-        vtargs = []
-        advs = []
-        for path in paths:
-            rew_t = path["reward"]
-            return_t = common.discount(rew_t, gamma)
-            vtargs.append(return_t)
-            vpred_t = vf.predict(path)
-            vpred_t = np.append(vpred_t, 0.0 if path["terminated"] else vpred_t[-1])
-            delta_t = rew_t + gamma*vpred_t[1:] - vpred_t[:-1]
-            adv_t = common.discount(delta_t, gamma * lam)
-            advs.append(adv_t)
-        # Update value function
-        vf.fit(paths, vtargs)
-
-        # Build arrays for policy update
-        ob_no = np.concatenate([path["observation"] for path in paths])
-        action_na = np.concatenate([path["action"] for path in paths])
-        oldac_dist = np.concatenate([path["action_dist"] for path in paths])
-        adv_n = np.concatenate(advs)
-        standardized_adv_n = (adv_n - adv_n.mean()) / (adv_n.std() + 1e-8)
-
-        # Policy update
-        do_update(ob_no, action_na, standardized_adv_n)
-
-        min_stepsize = np.float32(1e-8)
-        max_stepsize = np.float32(1e0)
-        # Adjust stepsize
-        kl = policy.compute_kl(ob_no, oldac_dist)
-        if kl > desired_kl * 2:
-            logger.log("kl too high")
-            tf.assign(stepsize, tf.maximum(min_stepsize, stepsize / 1.5)).eval()
-        elif kl < desired_kl / 2:
-            logger.log("kl too low")
-            tf.assign(stepsize, tf.minimum(max_stepsize, stepsize * 1.5)).eval()
-        else:
-            logger.log("kl just right!")
-
-        logger.record_tabular("EpRewMean", np.mean([path["reward"].sum() for path in paths]))
-        logger.record_tabular("EpRewSEM", np.std([path["reward"].sum()/np.sqrt(len(paths)) for path in paths]))
-        logger.record_tabular("EpLenMean", np.mean([pathlength(path) for path in paths]))
-        logger.record_tabular("KL", kl)
-        if callback:
-            callback()
-        logger.dump_tabular()
-        i += 1
-
-    coord.request_stop()
-    coord.join(enqueue_threads)
--- a/baselines/acktr/acktr_disc.py
+++ b/baselines/acktr/acktr_disc.py
@@ -1,155 +0,0 @@
-import os.path as osp
-import time
-import joblib
-import numpy as np
-import tensorflow as tf
-from baselines import logger
-
-from baselines.common import set_global_seeds, explained_variance
-
-from baselines.a2c.a2c import Runner
-from baselines.a2c.utils import discount_with_dones
-from baselines.a2c.utils import Scheduler, find_trainable_variables
-from baselines.a2c.utils import cat_entropy, mse
-from baselines.acktr import kfac
-
-
-class Model(object):
-
-    def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
-                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
-                 kfac_clip=0.001, lrschedule='linear'):
-        config = tf.ConfigProto(allow_soft_placement=True,
-                                intra_op_parallelism_threads=nprocs,
-                                inter_op_parallelism_threads=nprocs)
-        config.gpu_options.allow_growth = True
-        self.sess = sess = tf.Session(config=config)
-        nact = ac_space.n
-        nbatch = nenvs * nsteps
-        A = tf.placeholder(tf.int32, [nbatch])
-        ADV = tf.placeholder(tf.float32, [nbatch])
-        R = tf.placeholder(tf.float32, [nbatch])
-        PG_LR = tf.placeholder(tf.float32, [])
-        VF_LR = tf.placeholder(tf.float32, [])
-
-        self.model = step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
-        self.model2 = train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)
-
-        logpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
-        self.logits = logits = train_model.pi
-
-        ##training loss
-        pg_loss = tf.reduce_mean(ADV*logpac)
-        entropy = tf.reduce_mean(cat_entropy(train_model.pi))
-        pg_loss = pg_loss - ent_coef * entropy
-        vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
-        train_loss = pg_loss + vf_coef * vf_loss
-
-
-        ##Fisher loss construction
-        self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(logpac)
-        sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
-        self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
-        self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss
-
-        self.params=params = find_trainable_variables("model")
-
-        self.grads_check = grads = tf.gradients(train_loss,params)
-
-        with tf.device('/gpu:0'):
-            self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
-                momentum=0.9, kfac_update=1, epsilon=0.01,\
-                stats_decay=0.99, async=1, cold_iter=10, max_grad_norm=max_grad_norm)
-
-            update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
-            train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
-        self.q_runner = q_runner
-        self.lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
-
-        def train(obs, states, rewards, masks, actions, values):
-            advs = rewards - values
-            for step in range(len(obs)):
-                cur_lr = self.lr.value()
-
-            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, PG_LR:cur_lr}
-            if states is not None:
-                td_map[train_model.S] = states
-                td_map[train_model.M] = masks
-
-            policy_loss, value_loss, policy_entropy, _ = sess.run(
-                [pg_loss, vf_loss, entropy, train_op],
-                td_map
-            )
-            return policy_loss, value_loss, policy_entropy
-
-        def save(save_path):
-            ps = sess.run(params)
-            joblib.dump(ps, save_path)
-
-        def load(load_path):
-            loaded_params = joblib.load(load_path)
-            restores = []
-            for p, loaded_p in zip(params, loaded_params):
-                restores.append(p.assign(loaded_p))
-            sess.run(restores)
-
-
-
-        self.train = train
-        self.save = save
-        self.load = load
-        self.train_model = train_model
-        self.step_model = step_model
-        self.step = step_model.step
-        self.value = step_model.value
-        self.initial_state = step_model.initial_state
-        tf.global_variables_initializer().run(session=sess)
-
-def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
-                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
-                 kfac_clip=0.001, save_interval=None, lrschedule='linear'):
-    tf.reset_default_graph()
-    set_global_seeds(seed)
-
-    nenvs = env.num_envs
-    ob_space = env.observation_space
-    ac_space = env.action_space
-    make_model = lambda : Model(policy, ob_space, ac_space, nenvs, total_timesteps, nprocs=nprocs, nsteps
-                                =nsteps, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
-                                vf_fisher_coef, lr=lr, max_grad_norm=max_grad_norm, kfac_clip=kfac_clip,
-                                lrschedule=lrschedule)
-    if save_interval and logger.get_dir():
-        import cloudpickle
-        with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
-            fh.write(cloudpickle.dumps(make_model))
-    model = make_model()
-
-    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
-    nbatch = nenvs*nsteps
-    tstart = time.time()
-    coord = tf.train.Coordinator()
-    enqueue_threads = model.q_runner.create_threads(model.sess, coord=coord, start=True)
-    for update in range(1, total_timesteps//nbatch+1):
-        obs, states, rewards, masks, actions, values = runner.run()
-        policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
-        model.old_obs = obs
-        nseconds = time.time()-tstart
-        fps = int((update*nbatch)/nseconds)
-        if update % log_interval == 0 or update == 1:
-            ev = explained_variance(values, rewards)
-            logger.record_tabular("nupdates", update)
-            logger.record_tabular("total_timesteps", update*nbatch)
-            logger.record_tabular("fps", fps)
-            logger.record_tabular("policy_entropy", float(policy_entropy))
-            logger.record_tabular("policy_loss", float(policy_loss))
-            logger.record_tabular("value_loss", float(value_loss))
-            logger.record_tabular("explained_variance", float(ev))
-            logger.dump_tabular()
-
-        if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
-            savepath = osp.join(logger.get_dir(), 'checkpoint%.5i'%update)
-            print('Saving to', savepath)
-            model.save(savepath)
-    coord.request_stop()
-    coord.join(enqueue_threads)
-    env.close()
--- a/baselines/acktr/kfac.py
+++ b/baselines/acktr/kfac.py
@@ -1,926 +0,0 @@
-import tensorflow as tf
-import numpy as np
-import re
-from baselines.acktr.kfac_utils import *
-from functools import reduce
-
-KFAC_OPS = ['MatMul', 'Conv2D', 'BiasAdd']
-KFAC_DEBUG = False
-
-
-class KfacOptimizer():
-
-    def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
-        self.max_grad_norm = max_grad_norm
-        self._lr = learning_rate
-        self._momentum = momentum
-        self._clip_kl = clip_kl
-        self._channel_fac = channel_fac
-        self._kfac_update = kfac_update
-        self._async = async
-        self._async_stats = async_stats
-        self._epsilon = epsilon
-        self._stats_decay = stats_decay
-        self._blockdiag_bias = blockdiag_bias
-        self._approxT2 = approxT2
-        self._use_float64 = use_float64
-        self._factored_damping = factored_damping
-        self._cold_iter = cold_iter
-        if cold_lr == None:
-            # good heuristics
-            self._cold_lr = self._lr# * 3.
-        else:
-            self._cold_lr = cold_lr
-        self._stats_accum_iter = stats_accum_iter
-        self._weight_decay_dict = weight_decay_dict
-        self._diag_init_coeff = 0.
-        self._full_stats_init = full_stats_init
-        if not self._full_stats_init:
-            self._stats_accum_iter = self._cold_iter
-
-        self.sgd_step = tf.Variable(0, name='KFAC/sgd_step', trainable=False)
-        self.global_step = tf.Variable(
-            0, name='KFAC/global_step', trainable=False)
-        self.cold_step = tf.Variable(0, name='KFAC/cold_step', trainable=False)
-        self.factor_step = tf.Variable(
-            0, name='KFAC/factor_step', trainable=False)
-        self.stats_step = tf.Variable(
-            0, name='KFAC/stats_step', trainable=False)
-        self.vFv = tf.Variable(0., name='KFAC/vFv', trainable=False)
-
-        self.factors = {}
-        self.param_vars = []
-        self.stats = {}
-        self.stats_eigen = {}
-
-    def getFactors(self, g, varlist):
-        graph = tf.get_default_graph()
-        factorTensors = {}
-        fpropTensors = []
-        bpropTensors = []
-        opTypes = []
-        fops = []
-
-        def searchFactors(gradient, graph):
-            # hard coded search stratergy
-            bpropOp = gradient.op
-            bpropOp_name = bpropOp.name
-
-            bTensors = []
-            fTensors = []
-
-            # combining additive gradient, assume they are the same op type and
-            # indepedent
-            if 'AddN' in bpropOp_name:
-                factors = []
-                for g in gradient.op.inputs:
-                    factors.append(searchFactors(g, graph))
-                op_names = [item['opName'] for item in factors]
-                # TO-DO: need to check all the attribute of the ops as well
-                print (gradient.name)
-                print (op_names)
-                print (len(np.unique(op_names)))
-                assert len(np.unique(op_names)) == 1, gradient.name + \
-                    ' is shared among different computation OPs'
-
-                bTensors = reduce(lambda x, y: x + y,
-                                  [item['bpropFactors'] for item in factors])
-                if len(factors[0]['fpropFactors']) > 0:
-                    fTensors = reduce(
-                        lambda x, y: x + y, [item['fpropFactors'] for item in factors])
-                fpropOp_name = op_names[0]
-                fpropOp = factors[0]['op']
-            else:
-                fpropOp_name = re.search(
-                    'gradientsSampled(_[0-9]+|)/(.+?)_grad', bpropOp_name).group(2)
-                fpropOp = graph.get_operation_by_name(fpropOp_name)
-                if fpropOp.op_def.name in KFAC_OPS:
-                    # Known OPs
-                    ###
-                    bTensor = [
-                        i for i in bpropOp.inputs if 'gradientsSampled' in i.name][-1]
-                    bTensorShape = fpropOp.outputs[0].get_shape()
-                    if bTensor.get_shape()[0].value == None:
-                        bTensor.set_shape(bTensorShape)
-                    bTensors.append(bTensor)
-                    ###
-                    if fpropOp.op_def.name == 'BiasAdd':
-                        fTensors = []
-                    else:
-                        fTensors.append(
-                            [i for i in fpropOp.inputs if param.op.name not in i.name][0])
-                    fpropOp_name = fpropOp.op_def.name
-                else:
-                    # unknown OPs, block approximation used
-                    bInputsList = [i for i in bpropOp.inputs[
-                        0].op.inputs if 'gradientsSampled' in i.name if 'Shape' not in i.name]
-                    if len(bInputsList) > 0:
-                        bTensor = bInputsList[0]
-                        bTensorShape = fpropOp.outputs[0].get_shape()
-                        if len(bTensor.get_shape()) > 0 and bTensor.get_shape()[0].value == None:
-                            bTensor.set_shape(bTensorShape)
-                        bTensors.append(bTensor)
-                    fpropOp_name = opTypes.append('UNK-' + fpropOp.op_def.name)
-
-            return {'opName': fpropOp_name, 'op': fpropOp, 'fpropFactors': fTensors, 'bpropFactors': bTensors}
-
-        for t, param in zip(g, varlist):
-            if KFAC_DEBUG:
-                print(('get factor for '+param.name))
-            factors = searchFactors(t, graph)
-            factorTensors[param] = factors
-
-        ########
-        # check associated weights and bias for homogeneous coordinate representation
-        # and check redundent factors
-        # TO-DO: there may be a bug to detect associate bias and weights for
-        # forking layer, e.g. in inception models.
-        for param in varlist:
-            factorTensors[param]['assnWeights'] = None
-            factorTensors[param]['assnBias'] = None
-        for param in varlist:
-            if factorTensors[param]['opName'] == 'BiasAdd':
-                factorTensors[param]['assnWeights'] = None
-                for item in varlist:
-                    if len(factorTensors[item]['bpropFactors']) > 0:
-                        if (set(factorTensors[item]['bpropFactors']) == set(factorTensors[param]['bpropFactors'])) and (len(factorTensors[item]['fpropFactors']) > 0):
-                            factorTensors[param]['assnWeights'] = item
-                            factorTensors[item]['assnBias'] = param
-                            factorTensors[param]['bpropFactors'] = factorTensors[
-                                item]['bpropFactors']
-
-        ########
-
-        ########
-        # concatenate the additive gradients along the batch dimension, i.e.
-        # assuming independence structure
-        for key in ['fpropFactors', 'bpropFactors']:
-            for i, param in enumerate(varlist):
-                if len(factorTensors[param][key]) > 0:
-                    if (key + '_concat') not in factorTensors[param]:
-                        name_scope = factorTensors[param][key][0].name.split(':')[
-                            0]
-                        with tf.name_scope(name_scope):
-                            factorTensors[param][
-                                key + '_concat'] = tf.concat(factorTensors[param][key], 0)
-                else:
-                    factorTensors[param][key + '_concat'] = None
-                for j, param2 in enumerate(varlist[(i + 1):]):
-                    if (len(factorTensors[param][key]) > 0) and (set(factorTensors[param2][key]) == set(factorTensors[param][key])):
-                        factorTensors[param2][key] = factorTensors[param][key]
-                        factorTensors[param2][
-                            key + '_concat'] = factorTensors[param][key + '_concat']
-        ########
-
-        if KFAC_DEBUG:
-            for items in zip(varlist, fpropTensors, bpropTensors, opTypes):
-                print((items[0].name, factorTensors[item]))
-        self.factors = factorTensors
-        return factorTensors
-
-    def getStats(self, factors, varlist):
-        if len(self.stats) == 0:
-            # initialize stats variables on CPU because eigen decomp is
-            # computed on CPU
-            with tf.device('/cpu'):
-                tmpStatsCache = {}
-
-                # search for tensor factors and
-                # use block diag approx for the bias units
-                for var in varlist:
-                    fpropFactor = factors[var]['fpropFactors_concat']
-                    bpropFactor = factors[var]['bpropFactors_concat']
-                    opType = factors[var]['opName']
-                    if opType == 'Conv2D':
-                        Kh = var.get_shape()[0]
-                        Kw = var.get_shape()[1]
-                        C = fpropFactor.get_shape()[-1]
-
-                        Oh = bpropFactor.get_shape()[1]
-                        Ow = bpropFactor.get_shape()[2]
-                        if Oh == 1 and Ow == 1 and self._channel_fac:
-                            # factorization along the channels do not support
-                            # homogeneous coordinate
-                            var_assnBias = factors[var]['assnBias']
-                            if var_assnBias:
-                                factors[var]['assnBias'] = None
-                                factors[var_assnBias]['assnWeights'] = None
-                ##
-
-                for var in varlist:
-                    fpropFactor = factors[var]['fpropFactors_concat']
-                    bpropFactor = factors[var]['bpropFactors_concat']
-                    opType = factors[var]['opName']
-                    self.stats[var] = {'opName': opType,
-                                       'fprop_concat_stats': [],
-                                       'bprop_concat_stats': [],
-                                       'assnWeights': factors[var]['assnWeights'],
-                                       'assnBias': factors[var]['assnBias'],
-                                       }
-                    if fpropFactor is not None:
-                        if fpropFactor not in tmpStatsCache:
-                            if opType == 'Conv2D':
-                                Kh = var.get_shape()[0]
-                                Kw = var.get_shape()[1]
-                                C = fpropFactor.get_shape()[-1]
-
-                                Oh = bpropFactor.get_shape()[1]
-                                Ow = bpropFactor.get_shape()[2]
-                                if Oh == 1 and Ow == 1 and self._channel_fac:
-                                    # factorization along the channels
-                                    # assume independence between input channels and spatial
-                                    # 2K-1 x 2K-1 covariance matrix and C x C covariance matrix
-                                    # factorization along the channels do not
-                                    # support homogeneous coordinate, assnBias
-                                    # is always None
-                                    fpropFactor2_size = Kh * Kw
-                                    slot_fpropFactor_stats2 = tf.Variable(tf.diag(tf.ones(
-                                        [fpropFactor2_size])) * self._diag_init_coeff, name='KFAC_STATS/' + fpropFactor.op.name, trainable=False)
-                                    self.stats[var]['fprop_concat_stats'].append(
-                                        slot_fpropFactor_stats2)
-
-                                    fpropFactor_size = C
-                                else:
-                                    # 2K-1 x 2K-1 x C x C covariance matrix
-                                    # assume BHWC
-                                    fpropFactor_size = Kh * Kw * C
-                            else:
-                                # D x D covariance matrix
-                                fpropFactor_size = fpropFactor.get_shape()[-1]
-
-                            # use homogeneous coordinate
-                            if not self._blockdiag_bias and self.stats[var]['assnBias']:
-                                fpropFactor_size += 1
-
-                            slot_fpropFactor_stats = tf.Variable(tf.diag(tf.ones(
-                                [fpropFactor_size])) * self._diag_init_coeff, name='KFAC_STATS/' + fpropFactor.op.name, trainable=False)
-                            self.stats[var]['fprop_concat_stats'].append(
-                                slot_fpropFactor_stats)
-                            if opType != 'Conv2D':
-                                tmpStatsCache[fpropFactor] = self.stats[
-                                    var]['fprop_concat_stats']
-                        else:
-                            self.stats[var][
-                                'fprop_concat_stats'] = tmpStatsCache[fpropFactor]
-
-                    if bpropFactor is not None:
-                        # no need to collect backward stats for bias vectors if
-                        # using homogeneous coordinates
-                        if not((not self._blockdiag_bias) and self.stats[var]['assnWeights']):
-                            if bpropFactor not in tmpStatsCache:
-                                slot_bpropFactor_stats = tf.Variable(tf.diag(tf.ones([bpropFactor.get_shape(
-                                )[-1]])) * self._diag_init_coeff, name='KFAC_STATS/' + bpropFactor.op.name, trainable=False)
-                                self.stats[var]['bprop_concat_stats'].append(
-                                    slot_bpropFactor_stats)
-                                tmpStatsCache[bpropFactor] = self.stats[
-                                    var]['bprop_concat_stats']
-                            else:
-                                self.stats[var][
-                                    'bprop_concat_stats'] = tmpStatsCache[bpropFactor]
-
-        return self.stats
-
-    def compute_and_apply_stats(self, loss_sampled, var_list=None):
-        varlist = var_list
-        if varlist is None:
-            varlist = tf.trainable_variables()
-
-        stats = self.compute_stats(loss_sampled, var_list=varlist)
-        return self.apply_stats(stats)
-
-    def compute_stats(self, loss_sampled, var_list=None):
-        varlist = var_list
-        if varlist is None:
-            varlist = tf.trainable_variables()
-
-        gs = tf.gradients(loss_sampled, varlist, name='gradientsSampled')
-        self.gs = gs
-        factors = self.getFactors(gs, varlist)
-        stats = self.getStats(factors, varlist)
-
-        updateOps = []
-        statsUpdates = {}
-        statsUpdates_cache = {}
-        for var in varlist:
-            opType = factors[var]['opName']
-            fops = factors[var]['op']
-            fpropFactor = factors[var]['fpropFactors_concat']
-            fpropStats_vars = stats[var]['fprop_concat_stats']
-            bpropFactor = factors[var]['bpropFactors_concat']
-            bpropStats_vars = stats[var]['bprop_concat_stats']
-            SVD_factors = {}
-            for stats_var in fpropStats_vars:
-                stats_var_dim = int(stats_var.get_shape()[0])
-                if stats_var not in statsUpdates_cache:
-                    old_fpropFactor = fpropFactor
-                    B = (tf.shape(fpropFactor)[0])  # batch size
-                    if opType == 'Conv2D':
-                        strides = fops.get_attr("strides")
-                        padding = fops.get_attr("padding")
-                        convkernel_size = var.get_shape()[0:3]
-
-                        KH = int(convkernel_size[0])
-                        KW = int(convkernel_size[1])
-                        C = int(convkernel_size[2])
-                        flatten_size = int(KH * KW * C)
-
-                        Oh = int(bpropFactor.get_shape()[1])
-                        Ow = int(bpropFactor.get_shape()[2])
-
-                        if Oh == 1 and Ow == 1 and self._channel_fac:
-                                # factorization along the channels
-                                # assume independence among input channels
-                                # factor = B x 1 x 1 x (KH xKW x C)
-                                # patches = B x Oh x Ow x (KH xKW x C)
-                            if len(SVD_factors) == 0:
-                                if KFAC_DEBUG:
-                                    print(('approx %s act factor with rank-1 SVD factors' % (var.name)))
-                                # find closest rank-1 approx to the feature map
-                                S, U, V = tf.batch_svd(tf.reshape(
-                                    fpropFactor, [-1, KH * KW, C]))
-                                # get rank-1 approx slides
-                                sqrtS1 = tf.expand_dims(tf.sqrt(S[:, 0, 0]), 1)
-                                patches_k = U[:, :, 0] * sqrtS1  # B x KH*KW
-                                full_factor_shape = fpropFactor.get_shape()
-                                patches_k.set_shape(
-                                    [full_factor_shape[0], KH * KW])
-                                patches_c = V[:, :, 0] * sqrtS1  # B x C
-                                patches_c.set_shape([full_factor_shape[0], C])
-                                SVD_factors[C] = patches_c
-                                SVD_factors[KH * KW] = patches_k
-                            fpropFactor = SVD_factors[stats_var_dim]
-
-                        else:
-                            # poor mem usage implementation
-                            patches = tf.extract_image_patches(fpropFactor, ksizes=[1, convkernel_size[
-                                                               0], convkernel_size[1], 1], strides=strides, rates=[1, 1, 1, 1], padding=padding)
-
-                            if self._approxT2:
-                                if KFAC_DEBUG:
-                                    print(('approxT2 act fisher for %s' % (var.name)))
-                                # T^2 terms * 1/T^2, size: B x C
-                                fpropFactor = tf.reduce_mean(patches, [1, 2])
-                            else:
-                                # size: (B x Oh x Ow) x C
-                                fpropFactor = tf.reshape(
-                                    patches, [-1, flatten_size]) / Oh / Ow
-                    fpropFactor_size = int(fpropFactor.get_shape()[-1])
-                    if stats_var_dim == (fpropFactor_size + 1) and not self._blockdiag_bias:
-                        if opType == 'Conv2D' and not self._approxT2:
-                            # correct padding for numerical stability (we
-                            # divided out OhxOw from activations for T1 approx)
-                            fpropFactor = tf.concat([fpropFactor, tf.ones(
-                                [tf.shape(fpropFactor)[0], 1]) / Oh / Ow], 1)
-                        else:
-                            # use homogeneous coordinates
-                            fpropFactor = tf.concat(
-                                [fpropFactor, tf.ones([tf.shape(fpropFactor)[0], 1])], 1)
-
-                    # average over the number of data points in a batch
-                    # divided by B
-                    cov = tf.matmul(fpropFactor, fpropFactor,
-                                    transpose_a=True) / tf.cast(B, tf.float32)
-                    updateOps.append(cov)
-                    statsUpdates[stats_var] = cov
-                    if opType != 'Conv2D':
-                        # HACK: for convolution we recompute fprop stats for
-                        # every layer including forking layers
-                        statsUpdates_cache[stats_var] = cov
-
-            for stats_var in bpropStats_vars:
-                stats_var_dim = int(stats_var.get_shape()[0])
-                if stats_var not in statsUpdates_cache:
-                    old_bpropFactor = bpropFactor
-                    bpropFactor_shape = bpropFactor.get_shape()
-                    B = tf.shape(bpropFactor)[0]  # batch size
-                    C = int(bpropFactor_shape[-1])  # num channels
-                    if opType == 'Conv2D' or len(bpropFactor_shape) == 4:
-                        if fpropFactor is not None:
-                            if self._approxT2:
-                                if KFAC_DEBUG:
-                                    print(('approxT2 grad fisher for %s' % (var.name)))
-                                bpropFactor = tf.reduce_sum(
-                                    bpropFactor, [1, 2])  # T^2 terms * 1/T^2
-                            else:
-                                bpropFactor = tf.reshape(
-                                    bpropFactor, [-1, C]) * Oh * Ow  # T * 1/T terms
-                        else:
-                            # just doing block diag approx. spatial independent
-                            # structure does not apply here. summing over
-                            # spatial locations
-                            if KFAC_DEBUG:
-                                print(('block diag approx fisher for %s' % (var.name)))
-                            bpropFactor = tf.reduce_sum(bpropFactor, [1, 2])
-
-                    # assume sampled loss is averaged. TO-DO:figure out better
-                    # way to handle this
-                    bpropFactor *= tf.to_float(B)
-                    ##
-
-                    cov_b = tf.matmul(
-                        bpropFactor, bpropFactor, transpose_a=True) / tf.to_float(tf.shape(bpropFactor)[0])
-
-                    updateOps.append(cov_b)
-                    statsUpdates[stats_var] = cov_b
-                    statsUpdates_cache[stats_var] = cov_b
-
-        if KFAC_DEBUG:
-            aKey = list(statsUpdates.keys())[0]
-            statsUpdates[aKey] = tf.Print(statsUpdates[aKey],
-                                          [tf.convert_to_tensor('step:'),
-                                           self.global_step,
-                                           tf.convert_to_tensor(
-                                               'computing stats'),
-                                           ])
-        self.statsUpdates = statsUpdates
-        return statsUpdates
-
-    def apply_stats(self, statsUpdates):
-        """ compute stats and update/apply the new stats to the running average
-        """
-
-        def updateAccumStats():
-            if self._full_stats_init:
-                return tf.cond(tf.greater(self.sgd_step, self._cold_iter), lambda: tf.group(*self._apply_stats(statsUpdates, accumulate=True, accumulateCoeff=1. / self._stats_accum_iter)), tf.no_op)
-            else:
-                return tf.group(*self._apply_stats(statsUpdates, accumulate=True, accumulateCoeff=1. / self._stats_accum_iter))
-
-        def updateRunningAvgStats(statsUpdates, fac_iter=1):
-            # return tf.cond(tf.greater_equal(self.factor_step,
-            # tf.convert_to_tensor(fac_iter)), lambda:
-            # tf.group(*self._apply_stats(stats_list, varlist)), tf.no_op)
-            return tf.group(*self._apply_stats(statsUpdates))
-
-        if self._async_stats:
-            # asynchronous stats update
-            update_stats = self._apply_stats(statsUpdates)
-
-            queue = tf.FIFOQueue(1, [item.dtype for item in update_stats], shapes=[
-                                 item.get_shape() for item in update_stats])
-            enqueue_op = queue.enqueue(update_stats)
-
-            def dequeue_stats_op():
-                return queue.dequeue()
-            self.qr_stats = tf.train.QueueRunner(queue, [enqueue_op])
-            update_stats_op = tf.cond(tf.equal(queue.size(), tf.convert_to_tensor(
-                0)), tf.no_op, lambda: tf.group(*[dequeue_stats_op(), ]))
-        else:
-            # synchronous stats update
-            update_stats_op = tf.cond(tf.greater_equal(
-                self.stats_step, self._stats_accum_iter), lambda: updateRunningAvgStats(statsUpdates), updateAccumStats)
-        self._update_stats_op = update_stats_op
-        return update_stats_op
-
-    def _apply_stats(self, statsUpdates, accumulate=False, accumulateCoeff=0.):
-        updateOps = []
-        # obtain the stats var list
-        for stats_var in statsUpdates:
-            stats_new = statsUpdates[stats_var]
-            if accumulate:
-                # simple superbatch averaging
-                update_op = tf.assign_add(
-                    stats_var, accumulateCoeff * stats_new, use_locking=True)
-            else:
-                # exponential running averaging
-                update_op = tf.assign(
-                    stats_var, stats_var * self._stats_decay, use_locking=True)
-                update_op = tf.assign_add(
-                    update_op, (1. - self._stats_decay) * stats_new, use_locking=True)
-            updateOps.append(update_op)
-
-        with tf.control_dependencies(updateOps):
-            stats_step_op = tf.assign_add(self.stats_step, 1)
-
-        if KFAC_DEBUG:
-            stats_step_op = (tf.Print(stats_step_op,
-                                      [tf.convert_to_tensor('step:'),
-                                       self.global_step,
-                                       tf.convert_to_tensor('fac step:'),
-                                       self.factor_step,
-                                       tf.convert_to_tensor('sgd step:'),
-                                       self.sgd_step,
-                                       tf.convert_to_tensor('Accum:'),
-                                       tf.convert_to_tensor(accumulate),
-                                       tf.convert_to_tensor('Accum coeff:'),
-                                       tf.convert_to_tensor(accumulateCoeff),
-                                       tf.convert_to_tensor('stat step:'),
-                                       self.stats_step, updateOps[0], updateOps[1]]))
-        return [stats_step_op, ]
-
-    def getStatsEigen(self, stats=None):
-        if len(self.stats_eigen) == 0:
-            stats_eigen = {}
-            if stats is None:
-                stats = self.stats
-
-            tmpEigenCache = {}
-            with tf.device('/cpu:0'):
-                for var in stats:
-                    for key in ['fprop_concat_stats', 'bprop_concat_stats']:
-                        for stats_var in stats[var][key]:
-                            if stats_var not in tmpEigenCache:
-                                stats_dim = stats_var.get_shape()[1].value
-                                e = tf.Variable(tf.ones(
-                                    [stats_dim]), name='KFAC_FAC/' + stats_var.name.split(':')[0] + '/e', trainable=False)
-                                Q = tf.Variable(tf.diag(tf.ones(
-                                    [stats_dim])), name='KFAC_FAC/' + stats_var.name.split(':')[0] + '/Q', trainable=False)
-                                stats_eigen[stats_var] = {'e': e, 'Q': Q}
-                                tmpEigenCache[
-                                    stats_var] = stats_eigen[stats_var]
-                            else:
-                                stats_eigen[stats_var] = tmpEigenCache[
-                                    stats_var]
-            self.stats_eigen = stats_eigen
-        return self.stats_eigen
-
-    def computeStatsEigen(self):
-        """ compute the eigen decomp using copied var stats to avoid concurrent read/write from other queue """
-        # TO-DO: figure out why this op has delays (possibly moving
-        # eigenvectors around?)
-        with tf.device('/cpu:0'):
-            def removeNone(tensor_list):
-                local_list = []
-                for item in tensor_list:
-                    if item is not None:
-                        local_list.append(item)
-                return local_list
-
-            def copyStats(var_list):
-                print("copying stats to buffer tensors before eigen decomp")
-                redundant_stats = {}
-                copied_list = []
-                for item in var_list:
-                    if item is not None:
-                        if item not in redundant_stats:
-                            if self._use_float64:
-                                redundant_stats[item] = tf.cast(
-                                    tf.identity(item), tf.float64)
-                            else:
-                                redundant_stats[item] = tf.identity(item)
-                        copied_list.append(redundant_stats[item])
-                    else:
-                        copied_list.append(None)
-                return copied_list
-            #stats = [copyStats(self.fStats), copyStats(self.bStats)]
-            #stats = [self.fStats, self.bStats]
-
-            stats_eigen = self.stats_eigen
-            computedEigen = {}
-            eigen_reverse_lookup = {}
-            updateOps = []
-            # sync copied stats
-            # with tf.control_dependencies(removeNone(stats[0]) +
-            # removeNone(stats[1])):
-            with tf.control_dependencies([]):
-                for stats_var in stats_eigen:
-                    if stats_var not in computedEigen:
-                        eigens = tf.self_adjoint_eig(stats_var)
-                        e = eigens[0]
-                        Q = eigens[1]
-                        if self._use_float64:
-                            e = tf.cast(e, tf.float32)
-                            Q = tf.cast(Q, tf.float32)
-                        updateOps.append(e)
-                        updateOps.append(Q)
-                        computedEigen[stats_var] = {'e': e, 'Q': Q}
-                        eigen_reverse_lookup[e] = stats_eigen[stats_var]['e']
-                        eigen_reverse_lookup[Q] = stats_eigen[stats_var]['Q']
-
-            self.eigen_reverse_lookup = eigen_reverse_lookup
-            self.eigen_update_list = updateOps
-
-            if KFAC_DEBUG:
-                self.eigen_update_list = [item for item in updateOps]
-                with tf.control_dependencies(updateOps):
-                    updateOps.append(tf.Print(tf.constant(
-                        0.), [tf.convert_to_tensor('computed factor eigen')]))
-
-        return updateOps
-
-    def applyStatsEigen(self, eigen_list):
-        updateOps = []
-        print(('updating %d eigenvalue/vectors' % len(eigen_list)))
-        for i, (tensor, mark) in enumerate(zip(eigen_list, self.eigen_update_list)):
-            stats_eigen_var = self.eigen_reverse_lookup[mark]
-            updateOps.append(
-                tf.assign(stats_eigen_var, tensor, use_locking=True))
-
-        with tf.control_dependencies(updateOps):
-            factor_step_op = tf.assign_add(self.factor_step, 1)
-            updateOps.append(factor_step_op)
-            if KFAC_DEBUG:
-                updateOps.append(tf.Print(tf.constant(
-                    0.), [tf.convert_to_tensor('updated kfac factors')]))
-        return updateOps
-
-    def getKfacPrecondUpdates(self, gradlist, varlist):
-        updatelist = []
-        vg = 0.
-
-        assert len(self.stats) > 0
-        assert len(self.stats_eigen) > 0
-        assert len(self.factors) > 0
-        counter = 0
-
-        grad_dict = {var: grad for grad, var in zip(gradlist, varlist)}
-
-        for grad, var in zip(gradlist, varlist):
-            GRAD_RESHAPE = False
-            GRAD_TRANSPOSE = False
-
-            fpropFactoredFishers = self.stats[var]['fprop_concat_stats']
-            bpropFactoredFishers = self.stats[var]['bprop_concat_stats']
-
-            if (len(fpropFactoredFishers) + len(bpropFactoredFishers)) > 0:
-                counter += 1
-                GRAD_SHAPE = grad.get_shape()
-                if len(grad.get_shape()) > 2:
-                    # reshape conv kernel parameters
-                    KW = int(grad.get_shape()[0])
-                    KH = int(grad.get_shape()[1])
-                    C = int(grad.get_shape()[2])
-                    D = int(grad.get_shape()[3])
-
-                    if len(fpropFactoredFishers) > 1 and self._channel_fac:
-                        # reshape conv kernel parameters into tensor
-                        grad = tf.reshape(grad, [KW * KH, C, D])
-                    else:
-                        # reshape conv kernel parameters into 2D grad
-                        grad = tf.reshape(grad, [-1, D])
-                    GRAD_RESHAPE = True
-                elif len(grad.get_shape()) == 1:
-                    # reshape bias or 1D parameters
-                    D = int(grad.get_shape()[0])
-
-                    grad = tf.expand_dims(grad, 0)
-                    GRAD_RESHAPE = True
-                else:
-                    # 2D parameters
-                    C = int(grad.get_shape()[0])
-                    D = int(grad.get_shape()[1])
-
-                if (self.stats[var]['assnBias'] is not None) and not self._blockdiag_bias:
-                    # use homogeneous coordinates only works for 2D grad.
-                    # TO-DO: figure out how to factorize bias grad
-                    # stack bias grad
-                    var_assnBias = self.stats[var]['assnBias']
-                    grad = tf.concat(
-                        [grad, tf.expand_dims(grad_dict[var_assnBias], 0)], 0)
-
-                # project gradient to eigen space and reshape the eigenvalues
-                # for broadcasting
-                eigVals = []
-
-                for idx, stats in enumerate(self.stats[var]['fprop_concat_stats']):
-                    Q = self.stats_eigen[stats]['Q']
-                    e = detectMinVal(self.stats_eigen[stats][
-                                     'e'], var, name='act', debug=KFAC_DEBUG)
-
-                    Q, e = factorReshape(Q, e, grad, facIndx=idx, ftype='act')
-                    eigVals.append(e)
-                    grad = gmatmul(Q, grad, transpose_a=True, reduce_dim=idx)
-
-                for idx, stats in enumerate(self.stats[var]['bprop_concat_stats']):
-                    Q = self.stats_eigen[stats]['Q']
-                    e = detectMinVal(self.stats_eigen[stats][
-                                     'e'], var, name='grad', debug=KFAC_DEBUG)
-
-                    Q, e = factorReshape(Q, e, grad, facIndx=idx, ftype='grad')
-                    eigVals.append(e)
-                    grad = gmatmul(grad, Q, transpose_b=False, reduce_dim=idx)
-                ##
-
-                #####
-                # whiten using eigenvalues
-                weightDecayCoeff = 0.
-                if var in self._weight_decay_dict:
-                    weightDecayCoeff = self._weight_decay_dict[var]
-                    if KFAC_DEBUG:
-                        print(('weight decay coeff for %s is %f' % (var.name, weightDecayCoeff)))
-
-                if self._factored_damping:
-                    if KFAC_DEBUG:
-                        print(('use factored damping for %s' % (var.name)))
-                    coeffs = 1.
-                    num_factors = len(eigVals)
-                    # compute the ratio of two trace norm of the left and right
-                    # KFac matrices, and their generalization
-                    if len(eigVals) == 1:
-                        damping = self._epsilon + weightDecayCoeff
-                    else:
-                        damping = tf.pow(
-                            self._epsilon + weightDecayCoeff, 1. / num_factors)
-                    eigVals_tnorm_avg = [tf.reduce_mean(
-                        tf.abs(e)) for e in eigVals]
-                    for e, e_tnorm in zip(eigVals, eigVals_tnorm_avg):
-                        eig_tnorm_negList = [
-                            item for item in eigVals_tnorm_avg if item != e_tnorm]
-                        if len(eigVals) == 1:
-                            adjustment = 1.
-                        elif len(eigVals) == 2:
-                            adjustment = tf.sqrt(
-                                e_tnorm / eig_tnorm_negList[0])
-                        else:
-                            eig_tnorm_negList_prod = reduce(
-                                lambda x, y: x * y, eig_tnorm_negList)
-                            adjustment = tf.pow(
-                                tf.pow(e_tnorm, num_factors - 1.) / eig_tnorm_negList_prod, 1. / num_factors)
-                        coeffs *= (e + adjustment * damping)
-                else:
-                    coeffs = 1.
-                    damping = (self._epsilon + weightDecayCoeff)
-                    for e in eigVals:
-                        coeffs *= e
-                    coeffs += damping
-
-                #grad = tf.Print(grad, [tf.convert_to_tensor('1'), tf.convert_to_tensor(var.name), grad.get_shape()])
-
-                grad /= coeffs
-
-                #grad = tf.Print(grad, [tf.convert_to_tensor('2'), tf.convert_to_tensor(var.name), grad.get_shape()])
-                #####
-                # project gradient back to euclidean space
-                for idx, stats in enumerate(self.stats[var]['fprop_concat_stats']):
-                    Q = self.stats_eigen[stats]['Q']
-                    grad = gmatmul(Q, grad, transpose_a=False, reduce_dim=idx)
-
-                for idx, stats in enumerate(self.stats[var]['bprop_concat_stats']):
-                    Q = self.stats_eigen[stats]['Q']
-                    grad = gmatmul(grad, Q, transpose_b=True, reduce_dim=idx)
-                ##
-
-                #grad = tf.Print(grad, [tf.convert_to_tensor('3'), tf.convert_to_tensor(var.name), grad.get_shape()])
-                if (self.stats[var]['assnBias'] is not None) and not self._blockdiag_bias:
-                    # use homogeneous coordinates only works for 2D grad.
-                    # TO-DO: figure out how to factorize bias grad
-                    # un-stack bias grad
-                    var_assnBias = self.stats[var]['assnBias']
-                    C_plus_one = int(grad.get_shape()[0])
-                    grad_assnBias = tf.reshape(tf.slice(grad,
-                                                        begin=[
-                                                            C_plus_one - 1, 0],
-                                                        size=[1, -1]), var_assnBias.get_shape())
-                    grad_assnWeights = tf.slice(grad,
-                                                begin=[0, 0],
-                                                size=[C_plus_one - 1, -1])
-                    grad_dict[var_assnBias] = grad_assnBias
-                    grad = grad_assnWeights
-
-                #grad = tf.Print(grad, [tf.convert_to_tensor('4'), tf.convert_to_tensor(var.name), grad.get_shape()])
-                if GRAD_RESHAPE:
-                    grad = tf.reshape(grad, GRAD_SHAPE)
-
-                grad_dict[var] = grad
-
-        print(('projecting %d gradient matrices' % counter))
-
-        for g, var in zip(gradlist, varlist):
-            grad = grad_dict[var]
-            ### clipping ###
-            if KFAC_DEBUG:
-                print(('apply clipping to %s' % (var.name)))
-            tf.Print(grad, [tf.sqrt(tf.reduce_sum(tf.pow(grad, 2)))], "Euclidean norm of new grad")
-            local_vg = tf.reduce_sum(grad * g * (self._lr * self._lr))
-            vg += local_vg
-
-        # recale everything
-        if KFAC_DEBUG:
-            print('apply vFv clipping')
-
-        scaling = tf.minimum(1., tf.sqrt(self._clip_kl / vg))
-        if KFAC_DEBUG:
-            scaling = tf.Print(scaling, [tf.convert_to_tensor(
-                'clip: '), scaling, tf.convert_to_tensor(' vFv: '), vg])
-        with tf.control_dependencies([tf.assign(self.vFv, vg)]):
-            updatelist = [grad_dict[var] for var in varlist]
-            for i, item in enumerate(updatelist):
-                updatelist[i] = scaling * item
-
-        return updatelist
-
-    def compute_gradients(self, loss, var_list=None):
-        varlist = var_list
-        if varlist is None:
-            varlist = tf.trainable_variables()
-        g = tf.gradients(loss, varlist)
-
-        return [(a, b) for a, b in zip(g, varlist)]
-
-    def apply_gradients_kfac(self, grads):
-        g, varlist = list(zip(*grads))
-
-        if len(self.stats_eigen) == 0:
-            self.getStatsEigen()
-
-        qr = None
-        # launch eigen-decomp on a queue thread
-        if self._async:
-            print('Use async eigen decomp')
-            # get a list of factor loading tensors
-            factorOps_dummy = self.computeStatsEigen()
-
-            # define a queue for the list of factor loading tensors
-            queue = tf.FIFOQueue(1, [item.dtype for item in factorOps_dummy], shapes=[
-                                 item.get_shape() for item in factorOps_dummy])
-            enqueue_op = tf.cond(tf.logical_and(tf.equal(tf.mod(self.stats_step, self._kfac_update), tf.convert_to_tensor(
-                0)), tf.greater_equal(self.stats_step, self._stats_accum_iter)), lambda: queue.enqueue(self.computeStatsEigen()), tf.no_op)
-
-            def dequeue_op():
-                return queue.dequeue()
-
-            qr = tf.train.QueueRunner(queue, [enqueue_op])
-
-        updateOps = []
-        global_step_op = tf.assign_add(self.global_step, 1)
-        updateOps.append(global_step_op)
-
-        with tf.control_dependencies([global_step_op]):
-
-            # compute updates
-            assert self._update_stats_op != None
-            updateOps.append(self._update_stats_op)
-            dependency_list = []
-            if not self._async:
-                dependency_list.append(self._update_stats_op)
-
-            with tf.control_dependencies(dependency_list):
-                def no_op_wrapper():
-                    return tf.group(*[tf.assign_add(self.cold_step, 1)])
-
-                if not self._async:
-                    # synchronous eigen-decomp updates
-                    updateFactorOps = tf.cond(tf.logical_and(tf.equal(tf.mod(self.stats_step, self._kfac_update),
-                                                                      tf.convert_to_tensor(0)),
-                                                             tf.greater_equal(self.stats_step, self._stats_accum_iter)), lambda: tf.group(*self.applyStatsEigen(self.computeStatsEigen())), no_op_wrapper)
-                else:
-                    # asynchronous eigen-decomp updates using queue
-                    updateFactorOps = tf.cond(tf.greater_equal(self.stats_step, self._stats_accum_iter),
-                                              lambda: tf.cond(tf.equal(queue.size(), tf.convert_to_tensor(0)),
-                                                              tf.no_op,
-
-                                                              lambda: tf.group(
-                                                                  *self.applyStatsEigen(dequeue_op())),
-                                                              ),
-                                              no_op_wrapper)
-
-                updateOps.append(updateFactorOps)
-
-                with tf.control_dependencies([updateFactorOps]):
-                    def gradOp():
-                        return list(g)
-
-                    def getKfacGradOp():
-                        return self.getKfacPrecondUpdates(g, varlist)
-                    u = tf.cond(tf.greater(self.factor_step,
-                                           tf.convert_to_tensor(0)), getKfacGradOp, gradOp)
-
-                    optim = tf.train.MomentumOptimizer(
-                        self._lr * (1. - self._momentum), self._momentum)
-                    #optim = tf.train.AdamOptimizer(self._lr, epsilon=0.01)
-
-                    def optimOp():
-                        def updateOptimOp():
-                            if self._full_stats_init:
-                                return tf.cond(tf.greater(self.factor_step, tf.convert_to_tensor(0)), lambda: optim.apply_gradients(list(zip(u, varlist))), tf.no_op)
-                            else:
-                                return optim.apply_gradients(list(zip(u, varlist)))
-                        if self._full_stats_init:
-                            return tf.cond(tf.greater_equal(self.stats_step, self._stats_accum_iter), updateOptimOp, tf.no_op)
-                        else:
-                            return tf.cond(tf.greater_equal(self.sgd_step, self._cold_iter), updateOptimOp, tf.no_op)
-                    updateOps.append(optimOp())
-
-        return tf.group(*updateOps), qr
-
-    def apply_gradients(self, grads):
-        coldOptim = tf.train.MomentumOptimizer(
-            self._cold_lr, self._momentum)
-
-        def coldSGDstart():
-            sgd_grads, sgd_var = zip(*grads)
-
-            if self.max_grad_norm != None:
-                sgd_grads, sgd_grad_norm = tf.clip_by_global_norm(sgd_grads,self.max_grad_norm)
-
-            sgd_grads = list(zip(sgd_grads,sgd_var))
-
-            sgd_step_op = tf.assign_add(self.sgd_step, 1)
-            coldOptim_op = coldOptim.apply_gradients(sgd_grads)
-            if KFAC_DEBUG:
-                with tf.control_dependencies([sgd_step_op, coldOptim_op]):
-                    sgd_step_op = tf.Print(
-                        sgd_step_op, [self.sgd_step, tf.convert_to_tensor('doing cold sgd step')])
-            return tf.group(*[sgd_step_op, coldOptim_op])
-
-        kfacOptim_op, qr = self.apply_gradients_kfac(grads)
-
-        def warmKFACstart():
-            return kfacOptim_op
-
-        return tf.cond(tf.greater(self.sgd_step, self._cold_iter), warmKFACstart, coldSGDstart), qr
-
-    def minimize(self, loss, loss_sampled, var_list=None):
-        grads = self.compute_gradients(loss, var_list=var_list)
-        update_stats_op = self.compute_and_apply_stats(
-            loss_sampled, var_list=var_list)
-        return self.apply_gradients(grads)
--- a/baselines/acktr/kfac_utils.py
+++ b/baselines/acktr/kfac_utils.py
@@ -1,86 +0,0 @@
-import tensorflow as tf
-
-def gmatmul(a, b, transpose_a=False, transpose_b=False, reduce_dim=None):
-    assert reduce_dim is not None
-
-    # weird batch matmul
-    if len(a.get_shape()) == 2 and len(b.get_shape()) > 2:
-        # reshape reduce_dim to the left most dim in b
-        b_shape = b.get_shape()
-        if reduce_dim != 0:
-            b_dims = list(range(len(b_shape)))
-            b_dims.remove(reduce_dim)
-            b_dims.insert(0, reduce_dim)
-            b = tf.transpose(b, b_dims)
-        b_t_shape = b.get_shape()
-        b = tf.reshape(b, [int(b_shape[reduce_dim]), -1])
-        result = tf.matmul(a, b, transpose_a=transpose_a,
-                           transpose_b=transpose_b)
-        result = tf.reshape(result, b_t_shape)
-        if reduce_dim != 0:
-            b_dims = list(range(len(b_shape)))
-            b_dims.remove(0)
-            b_dims.insert(reduce_dim, 0)
-            result = tf.transpose(result, b_dims)
-        return result
-
-    elif len(a.get_shape()) > 2 and len(b.get_shape()) == 2:
-        # reshape reduce_dim to the right most dim in a
-        a_shape = a.get_shape()
-        outter_dim = len(a_shape) - 1
-        reduce_dim = len(a_shape) - reduce_dim - 1
-        if reduce_dim != outter_dim:
-            a_dims = list(range(len(a_shape)))
-            a_dims.remove(reduce_dim)
-            a_dims.insert(outter_dim, reduce_dim)
-            a = tf.transpose(a, a_dims)
-        a_t_shape = a.get_shape()
-        a = tf.reshape(a, [-1, int(a_shape[reduce_dim])])
-        result = tf.matmul(a, b, transpose_a=transpose_a,
-                           transpose_b=transpose_b)
-        result = tf.reshape(result, a_t_shape)
-        if reduce_dim != outter_dim:
-            a_dims = list(range(len(a_shape)))
-            a_dims.remove(outter_dim)
-            a_dims.insert(reduce_dim, outter_dim)
-            result = tf.transpose(result, a_dims)
-        return result
-
-    elif len(a.get_shape()) == 2 and len(b.get_shape()) == 2:
-        return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)
-
-    assert False, 'something went wrong'
-
-
-def clipoutNeg(vec, threshold=1e-6):
-    mask = tf.cast(vec > threshold, tf.float32)
-    return mask * vec
-
-
-def detectMinVal(input_mat, var, threshold=1e-6, name='', debug=False):
-    eigen_min = tf.reduce_min(input_mat)
-    eigen_max = tf.reduce_max(input_mat)
-    eigen_ratio = eigen_max / eigen_min
-    input_mat_clipped = clipoutNeg(input_mat, threshold)
-
-    if debug:
-        input_mat_clipped = tf.cond(tf.logical_or(tf.greater(eigen_ratio, 0.), tf.less(eigen_ratio, -500)), lambda: input_mat_clipped, lambda: tf.Print(
-            input_mat_clipped, [tf.convert_to_tensor('screwed ratio ' + name + ' eigen values!!!'), tf.convert_to_tensor(var.name), eigen_min, eigen_max, eigen_ratio]))
-
-    return input_mat_clipped
-
-
-def factorReshape(Q, e, grad, facIndx=0, ftype='act'):
-    grad_shape = grad.get_shape()
-    if ftype == 'act':
-        assert e.get_shape()[0] == grad_shape[facIndx]
-        expanded_shape = [1, ] * len(grad_shape)
-        expanded_shape[facIndx] = -1
-        e = tf.reshape(e, expanded_shape)
-    if ftype == 'grad':
-        assert e.get_shape()[0] == grad_shape[len(grad_shape) - facIndx - 1]
-        expanded_shape = [1, ] * len(grad_shape)
-        expanded_shape[len(grad_shape) - facIndx - 1] = -1
-        e = tf.reshape(e, expanded_shape)
-
-    return Q, e
--- a/baselines/acktr/policies.py
+++ b/baselines/acktr/policies.py
@@ -1,42 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from baselines.acktr.utils import dense, kl_div
-import baselines.common.tf_util as U
-
-class GaussianMlpPolicy(object):
-    def __init__(self, ob_dim, ac_dim):
-        # Here we'll construct a bunch of expressions, which will be used in two places:
-        # (1) When sampling actions
-        # (2) When computing loss functions, for the policy update
-        # Variables specific to (1) have the word "sampled" in them,
-        # whereas variables specific to (2) have the word "old" in them
-        ob_no = tf.placeholder(tf.float32, shape=[None, ob_dim*2], name="ob") # batch of observations
-        oldac_na = tf.placeholder(tf.float32, shape=[None, ac_dim], name="ac") # batch of actions previous actions
-        oldac_dist = tf.placeholder(tf.float32, shape=[None, ac_dim*2], name="oldac_dist") # batch of actions previous action distributions
-        adv_n = tf.placeholder(tf.float32, shape=[None], name="adv") # advantage function estimate
-        wd_dict = {}
-        h1 = tf.nn.tanh(dense(ob_no, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
-        h2 = tf.nn.tanh(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
-        mean_na = dense(h2, ac_dim, "mean", weight_init=U.normc_initializer(0.1), bias_init=0.0, weight_loss_dict=wd_dict) # Mean control output
-        self.wd_dict = wd_dict
-        self.logstd_1a = logstd_1a = tf.get_variable("logstd", [ac_dim], tf.float32, tf.zeros_initializer()) # Variance on outputs
-        logstd_1a = tf.expand_dims(logstd_1a, 0)
-        std_1a = tf.exp(logstd_1a)
-        std_na = tf.tile(std_1a, [tf.shape(mean_na)[0], 1])
-        ac_dist = tf.concat([tf.reshape(mean_na, [-1, ac_dim]), tf.reshape(std_na, [-1, ac_dim])], 1)
-        sampled_ac_na = tf.random_normal(tf.shape(ac_dist[:,ac_dim:])) * ac_dist[:,ac_dim:] + ac_dist[:,:ac_dim] # This is the sampled action we'll perform.
-        logprobsampled_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - sampled_ac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of sampled action
-        logprob_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - oldac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of previous actions under CURRENT policy (whereas oldlogprob_n is under OLD policy)
-        kl = tf.reduce_mean(kl_div(oldac_dist, ac_dist, ac_dim))
-        #kl = .5 * tf.reduce_mean(tf.square(logprob_n - oldlogprob_n)) # Approximation of KL divergence between old policy used to generate actions, and new policy used to compute logprob_n
-        surr = - tf.reduce_mean(adv_n * logprob_n) # Loss function that we'll differentiate to get the policy gradient
-        surr_sampled = - tf.reduce_mean(logprob_n) # Sampled loss of the policy
-        self._act = U.function([ob_no], [sampled_ac_na, ac_dist, logprobsampled_n]) # Generate a new action and its logprob
-        #self.compute_kl = U.function([ob_no, oldac_na, oldlogprob_n], kl) # Compute (approximate) KL divergence between old policy and new policy
-        self.compute_kl = U.function([ob_no, oldac_dist], kl)
-        self.update_info = ((ob_no, oldac_na, adv_n), surr, surr_sampled) # Input and output variables needed for computing loss
-        U.initialize() # Initialize uninitialized TF variables
-
-    def act(self, ob):
-        ac, ac_dist, logp = self._act(ob[None])
-        return ac[0], ac_dist[0], logp[0]
--- a/baselines/acktr/run_atari.py
+++ b/baselines/acktr/run_atari.py
@@ -1,23 +0,0 @@
-#!/usr/bin/env python3
-
-from functools import partial
-
-from baselines import logger
-from baselines.acktr.acktr_disc import learn
-from baselines.common.cmd_util import make_atari_env, atari_arg_parser
-from baselines.common.vec_env.vec_frame_stack import VecFrameStack
-from baselines.ppo2.policies import CnnPolicy
-
-def train(env_id, num_timesteps, seed, num_cpu):
-    env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
-    policy_fn = partial(CnnPolicy, one_dim_bias=True)
-    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
-    env.close()
-
-def main():
-    args = atari_arg_parser().parse_args()
-    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed, num_cpu=32)
-
-if __name__ == '__main__':
-    main()
--- a/baselines/acktr/run_mujoco.py
+++ b/baselines/acktr/run_mujoco.py
@@ -1,34 +0,0 @@
-#!/usr/bin/env python3
-
-import tensorflow as tf
-from baselines import logger
-from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
-from baselines.acktr.acktr_cont import learn
-from baselines.acktr.policies import GaussianMlpPolicy
-from baselines.acktr.value_functions import NeuralNetValueFunction
-
-def train(env_id, num_timesteps, seed):
-    env = make_mujoco_env(env_id, seed)
-
-    with tf.Session(config=tf.ConfigProto()):
-        ob_dim = env.observation_space.shape[0]
-        ac_dim = env.action_space.shape[0]
-        with tf.variable_scope("vf"):
-            vf = NeuralNetValueFunction(ob_dim, ac_dim)
-        with tf.variable_scope("pi"):
-            policy = GaussianMlpPolicy(ob_dim, ac_dim)
-
-        learn(env, policy=policy, vf=vf,
-            gamma=0.99, lam=0.97, timesteps_per_batch=2500,
-            desired_kl=0.002,
-            num_timesteps=num_timesteps, animate=False)
-
-        env.close()
-
-def main():
-    args = mujoco_arg_parser().parse_args()
-    logger.configure()
-    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
-
-if __name__ == "__main__":
-    main()
--- a/baselines/acktr/utils.py
+++ b/baselines/acktr/utils.py
@@ -1,28 +0,0 @@
-import tensorflow as tf
-
-def dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict=None, reuse=None):
-    with tf.variable_scope(name, reuse=reuse):
-        assert (len(tf.get_variable_scope().name.split('/')) == 2)
-
-        w = tf.get_variable("w", [x.get_shape()[1], size], initializer=weight_init)
-        b = tf.get_variable("b", [size], initializer=tf.constant_initializer(bias_init))
-        weight_decay_fc = 3e-4
-
-        if weight_loss_dict is not None:
-            weight_decay = tf.multiply(tf.nn.l2_loss(w), weight_decay_fc, name='weight_decay_loss')
-            if weight_loss_dict is not None:
-                weight_loss_dict[w] = weight_decay_fc
-                weight_loss_dict[b] = 0.0
-
-            tf.add_to_collection(tf.get_variable_scope().name.split('/')[0] + '_' + 'losses', weight_decay)
-
-        return tf.nn.bias_add(tf.matmul(x, w), b)
-
-def kl_div(action_dist1, action_dist2, action_size):
-    mean1, std1 = action_dist1[:, :action_size], action_dist1[:, action_size:]
-    mean2, std2 = action_dist2[:, :action_size], action_dist2[:, action_size:]
-
-    numerator = tf.square(mean1 - mean2) + tf.square(std1) - tf.square(std2)
-    denominator = 2 * tf.square(std2) + 1e-8
-    return tf.reduce_sum(
-        numerator/denominator + tf.log(std2) - tf.log(std1),reduction_indices=-1)
--- a/baselines/acktr/value_functions.py
+++ b/baselines/acktr/value_functions.py
@@ -1,50 +0,0 @@
-from baselines import logger
-import numpy as np
-import baselines.common as common
-from baselines.common import tf_util as U
-import tensorflow as tf
-from baselines.acktr import kfac
-from baselines.acktr.utils import dense
-
-class NeuralNetValueFunction(object):
-    def __init__(self, ob_dim, ac_dim): #pylint: disable=W0613
-        X = tf.placeholder(tf.float32, shape=[None, ob_dim*2+ac_dim*2+2]) # batch of observations
-        vtarg_n = tf.placeholder(tf.float32, shape=[None], name='vtarg')
-        wd_dict = {}
-        h1 = tf.nn.elu(dense(X, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
-        h2 = tf.nn.elu(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
-        vpred_n = dense(h2, 1, "hfinal", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict)[:,0]
-        sample_vpred_n = vpred_n + tf.random_normal(tf.shape(vpred_n))
-        wd_loss = tf.get_collection("vf_losses", None)
-        loss = tf.reduce_mean(tf.square(vpred_n - vtarg_n)) + tf.add_n(wd_loss)
-        loss_sampled = tf.reduce_mean(tf.square(vpred_n - tf.stop_gradient(sample_vpred_n)))
-        self._predict = U.function([X], vpred_n)
-        optim = kfac.KfacOptimizer(learning_rate=0.001, cold_lr=0.001*(1-0.9), momentum=0.9, \
-                                    clip_kl=0.3, epsilon=0.1, stats_decay=0.95, \
-                                    async=1, kfac_update=2, cold_iter=50, \
-                                    weight_decay_dict=wd_dict, max_grad_norm=None)
-        vf_var_list = []
-        for var in tf.trainable_variables():
-            if "vf" in var.name:
-                vf_var_list.append(var)
-
-        update_op, self.q_runner = optim.minimize(loss, loss_sampled, var_list=vf_var_list)
-        self.do_update = U.function([X, vtarg_n], update_op) #pylint: disable=E1101
-        U.initialize() # Initialize uninitialized TF variables
-    def _preproc(self, path):
-        l = pathlength(path)
-        al = np.arange(l).reshape(-1,1)/10.0
-        act = path["action_dist"].astype('float32')
-        X = np.concatenate([path['observation'], act, al, np.ones((l, 1))], axis=1)
-        return X
-    def predict(self, path):
-        return self._predict(self._preproc(path))
-    def fit(self, paths, targvals):
-        X = np.concatenate([self._preproc(p) for p in paths])
-        y = np.concatenate(targvals)
-        logger.record_tabular("EVBefore", common.explained_variance(self._predict(X), y))
-        for _ in range(25): self.do_update(X, y)
-        logger.record_tabular("EVAfter", common.explained_variance(self._predict(X), y))
-
-def pathlength(path):
-    return path["reward"].shape[0]
--- a/baselines/bench/init.py
+++ b/baselines/bench/init.py
@@ -1,2 +1,2 @@
-from baselines.bench.benchmarks import *
-from baselines.bench.monitor import *
+from baselines.bench.benchmarks import * # noqa: F403 F401
+from baselines.bench.monitor import * # noqa: F403 F401
--- a/baselines/bench/benchmarks.py
+++ b/baselines/bench/benchmarks.py
@@ -1,5 +1,4 @@
 import re
-import os.path as osp
 import os
 SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))

@@ -20,7 +19,7 @@ def register_benchmark(benchmark):
    if 'tasks' in benchmark:
        for t in benchmark['tasks']:
            if 'desc' not in t:
-                t['desc'] = remove_version_re.sub('', t['env_id'])
+                t['desc'] = remove_version_re.sub('', t.get('env_id', t.get('id')))
    _BENCHMARKS.append(benchmark)


@@ -59,7 +58,7 @@ register_benchmark({
 register_benchmark({
    'name': 'Atari10M',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
-    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 6, 'num_timesteps': int(10e6)} for _game in _atari7]
 })

 register_benchmark({
@@ -84,8 +83,9 @@ _mujocosmall = [
 register_benchmark({
    'name': 'Mujoco1M',
    'description': 'Some small 2D MuJoCo tasks, run for 1M timesteps',
-    'tasks': [{'env_id': _envid, 'trials': 3, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
+    'tasks': [{'env_id': _envid, 'trials': 6, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
 })
+
 register_benchmark({
    'name': 'MujocoWalkers',
    'description': 'MuJoCo forward walkers, run for 8M, humanoid 100M',
@@ -96,6 +96,19 @@ register_benchmark({
    ]
 })

+# Bullet
+_bulletsmall = [
+    'InvertedDoublePendulum', 'InvertedPendulum', 'HalfCheetah', 'Reacher', 'Walker2D', 'Hopper', 'Ant'
+]
+_bulletsmall = [e + 'BulletEnv-v0' for e in _bulletsmall]
+
+register_benchmark({
+    'name': 'Bullet1M',
+    'description': '6 mujoco-like tasks from bullet, 1M steps',
+    'tasks': [{'env_id': e, 'trials': 6, 'num_timesteps': int(1e6)} for e in _bulletsmall]
+})
+
+
 # Roboschool

 register_benchmark({
@@ -142,9 +155,10 @@ register_benchmark({

 # HER DDPG

+_fetch_tasks = ['FetchReach-v1', 'FetchPush-v1', 'FetchSlide-v1']
 register_benchmark({
-    'name': 'HerDdpg',
-    'description': 'Smoke-test only benchmark of HER',
-    'tasks': [{'trials': 1, 'env_id': 'FetchReach-v1'}]
+    'name': 'Fetch1M',
+    'description': 'Fetch* benchmarks for 1M timesteps',
+    'tasks': [{'trials': 6, 'env_id': env_id, 'num_timesteps': int(1e6)} for env_id in _fetch_tasks]
 })

--- a/baselines/bench/monitor.py
+++ b/baselines/bench/monitor.py
@@ -1,13 +1,11 @@
 __all__ = ['Monitor', 'get_monitor_files', 'load_results']

-import gym
 from gym.core import Wrapper
 import time
 from glob import glob
 import csv
 import os.path as osp
 import json
-import numpy as np

 class Monitor(Wrapper):
    EXT = "monitor.csv"
@@ -16,21 +14,13 @@ class Monitor(Wrapper):
    def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
        Wrapper.__init__(self, env=env)
        self.tstart = time.time()
-        if filename is None:
-            self.f = None
-            self.logger = None
+        if filename:
+            self.results_writer = ResultsWriter(filename,
+                header={"t_start": time.time(), 'env_id' : env.spec and env.spec.id},
+                extra_keys=reset_keywords + info_keywords
+            )
        else:
-            if not filename.endswith(Monitor.EXT):
-                if osp.isdir(filename):
-                    filename = osp.join(filename, Monitor.EXT)
-                else:
-                    filename = filename + "." + Monitor.EXT
-            self.f = open(filename, "wt")
-            self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, 'env_id' : env.spec and env.spec.id}))
-            self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords+info_keywords)
-            self.logger.writeheader()
-            self.f.flush()
-
+            self.results_writer = None
        self.reset_keywords = reset_keywords
        self.info_keywords = info_keywords
        self.allow_early_resets = allow_early_resets
@@ -43,10 +33,7 @@ class Monitor(Wrapper):
        self.current_reset_info = {} # extra info about the current episode, that was passed in during reset()

    def reset(self, **kwargs):
-        if not self.allow_early_resets and not self.needs_reset:
-            raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
-        self.rewards = []
-        self.needs_reset = False
+        self.reset_state()
        for k in self.reset_keywords:
            v = kwargs.get(k)
            if v is None:
@@ -54,10 +41,21 @@ class Monitor(Wrapper):
            self.current_reset_info[k] = v
        return self.env.reset(**kwargs)

+    def reset_state(self):
+        if not self.allow_early_resets and not self.needs_reset:
+            raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
+        self.rewards = []
+        self.needs_reset = False
+
+
    def step(self, action):
        if self.needs_reset:
            raise RuntimeError("Tried to step environment that needs reset")
        ob, rew, done, info = self.env.step(action)
+        self.update(ob, rew, done, info)
+        return (ob, rew, done, info)
+
+    def update(self, ob, rew, done, info):
        self.rewards.append(rew)
        if done:
            self.needs_reset = True
@@ -70,12 +68,13 @@ class Monitor(Wrapper):
            self.episode_lengths.append(eplen)
            self.episode_times.append(time.time() - self.tstart)
            epinfo.update(self.current_reset_info)
-            if self.logger:
-                self.logger.writerow(epinfo)
-                self.f.flush()
-            info['episode'] = epinfo
+            if self.results_writer:
+                self.results_writer.write_row(epinfo)
+            assert isinstance(info, dict)
+            if isinstance(info, dict):
+                info['episode'] = epinfo
+
        self.total_steps += 1
-        return (ob, rew, done, info)

    def close(self):
        if self.f is not None:
@@ -96,13 +95,37 @@ class Monitor(Wrapper):
 class LoadMonitorResultsError(Exception):
    pass

+
+class ResultsWriter(object):
+    def __init__(self, filename, header='', extra_keys=()):
+        self.extra_keys = extra_keys
+        assert filename is not None
+        if not filename.endswith(Monitor.EXT):
+            if osp.isdir(filename):
+                filename = osp.join(filename, Monitor.EXT)
+            else:
+                filename = filename + "." + Monitor.EXT
+        self.f = open(filename, "wt")
+        if isinstance(header, dict):
+            header = '# {} \n'.format(json.dumps(header))
+        self.f.write(header)
+        self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+tuple(extra_keys))
+        self.logger.writeheader()
+        self.f.flush()
+
+    def write_row(self, epinfo):
+        if self.logger:
+            self.logger.writerow(epinfo)
+            self.f.flush()
+
+
 def get_monitor_files(dir):
    return glob(osp.join(dir, "*" + Monitor.EXT))

 def load_results(dir):
    import pandas
    monitor_files = (
-        glob(osp.join(dir, "*monitor.json")) + 
+        glob(osp.join(dir, "*monitor.json")) +
        glob(osp.join(dir, "*monitor.csv"))) # get both csv and (old) json files
    if not monitor_files:
        raise LoadMonitorResultsError("no monitor files of the form *%s found in %s" % (Monitor.EXT, dir))
@@ -112,6 +135,8 @@ def load_results(dir):
        with open(fname, 'rt') as fh:
            if fname.endswith('csv'):
                firstline = fh.readline()
+                if not firstline:
+                    continue
                assert firstline[0] == '#'
                header = json.loads(firstline[1:])
                df = pandas.read_csv(fh, index_col=None)
@@ -135,27 +160,3 @@ def load_results(dir):
    df['t'] -= min(header['t_start'] for header in headers)
    df.headers = headers # HACK to preserve backwards compatibility
    return df
-
-def test_monitor():
-    env = gym.make("CartPole-v1")
-    env.seed(0)
-    mon_file = "/tmp/baselines-test-%s.monitor.csv" % uuid.uuid4()
-    menv = Monitor(env, mon_file)
-    menv.reset()
-    for _ in range(1000):
-        _, _, done, _ = menv.step(0)
-        if done:
-            menv.reset()
-
-    f = open(mon_file, 'rt')
-
-    firstline = f.readline()
-    assert firstline.startswith('#')
-    metadata = json.loads(firstline[1:])
-    assert metadata['env_id'] == "CartPole-v1"
-    assert set(metadata.keys()) == {'env_id', 'gym_version', 't_start'},  "Incorrect keys in monitor metadata"
-
-    last_logline = pandas.read_csv(f, index_col=None)
-    assert set(last_logline.keys()) == {'l', 't', 'r'}, "Incorrect keys in monitor logline"
-    f.close()
-    os.remove(mon_file)
--- a/baselines/common/atari_wrappers.py
+++ b/baselines/common/atari_wrappers.py
@@ -1,9 +1,13 @@
 import numpy as np
+import os
+os.environ.setdefault('PATH', '')
 from collections import deque
 import gym
 from gym import spaces
 import cv2
 cv2.ocl.setUseOpenCL(False)
+from .wrappers import TimeLimit
+

 class NoopResetEnv(gym.Wrapper):
    def __init__(self, env, noop_max=30):
@@ -70,8 +74,8 @@ class EpisodicLifeEnv(gym.Wrapper):
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
-            # for Qbert sometimes we stay in lives == 0 condtion for a few frames
-            # so its important to keep lives > 0, so that we only reset once
+            # for Qbert sometimes we stay in lives == 0 condition for a few frames
+            # so it's important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
@@ -126,19 +130,60 @@ class ClipRewardEnv(gym.RewardWrapper):
        """Bin reward to {+1, 0, -1} by its sign."""
        return np.sign(reward)

-class WarpFrame(gym.ObservationWrapper):
-    def __init__(self, env):
-        """Warp frames to 84x84 as done in the Nature paper and later work."""
-        gym.ObservationWrapper.__init__(self, env)
-        self.width = 84
-        self.height = 84
-        self.observation_space = spaces.Box(low=0, high=255,
-            shape=(self.height, self.width, 1), dtype=np.uint8)

-    def observation(self, frame):
-        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
-        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
-        return frame[:, :, None]
+class WarpFrame(gym.ObservationWrapper):
+    def __init__(self, env, width=84, height=84, grayscale=True, dict_space_key=None):
+        """
+        Warp frames to 84x84 as done in the Nature paper and later work.
+
+        If the environment uses dictionary observations, `dict_space_key` can be specified which indicates which
+        observation should be warped.
+        """
+        super().__init__(env)
+        self._width = width
+        self._height = height
+        self._grayscale = grayscale
+        self._key = dict_space_key
+        if self._grayscale:
+            num_colors = 1
+        else:
+            num_colors = 3
+
+        new_space = gym.spaces.Box(
+            low=0,
+            high=255,
+            shape=(self._height, self._width, num_colors),
+            dtype=np.uint8,
+        )
+        if self._key is None:
+            original_space = self.observation_space
+            self.observation_space = new_space
+        else:
+            original_space = self.observation_space.spaces[self._key]
+            self.observation_space.spaces[self._key] = new_space
+        assert original_space.dtype == np.uint8 and len(original_space.shape) == 3
+
+    def observation(self, obs):
+        if self._key is None:
+            frame = obs
+        else:
+            frame = obs[self._key]
+
+        if self._grayscale:
+            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        frame = cv2.resize(
+            frame, (self._width, self._height), interpolation=cv2.INTER_AREA
+        )
+        if self._grayscale:
+            frame = np.expand_dims(frame, -1)
+
+        if self._key is None:
+            obs = frame
+        else:
+            obs = obs.copy()
+            obs[self._key] = frame
+        return obs
+

 class FrameStack(gym.Wrapper):
    def __init__(self, env, k):
@@ -154,7 +199,7 @@ class FrameStack(gym.Wrapper):
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
-        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=np.uint8)
+        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[:-1] + (shp[-1] * k,)), dtype=env.observation_space.dtype)

    def reset(self):
        ob = self.env.reset()
@@ -174,6 +219,7 @@ class FrameStack(gym.Wrapper):
 class ScaledFloatFrame(gym.ObservationWrapper):
    def __init__(self, env):
        gym.ObservationWrapper.__init__(self, env)
+        self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)

    def observation(self, observation):
        # careful! This undoes the memory optimization, use
@@ -194,7 +240,7 @@ class LazyFrames(object):

    def _force(self):
        if self._out is None:
-            self._out = np.concatenate(self._frames, axis=2)
+            self._out = np.concatenate(self._frames, axis=-1)
            self._frames = None
        return self._out

@@ -210,11 +256,20 @@ class LazyFrames(object):
    def __getitem__(self, i):
        return self._force()[i]

-def make_atari(env_id):
+    def count(self):
+        frames = self._force()
+        return frames.shape[frames.ndim - 1]
+
+    def frame(self, i):
+        return self._force()[..., i]
+
+def make_atari(env_id, max_episode_steps=None):
    env = gym.make(env_id)
    assert 'NoFrameskip' in env.spec.id
    env = NoopResetEnv(env, noop_max=30)
    env = MaxAndSkipEnv(env, skip=4)
+    if max_episode_steps is not None:
+        env = TimeLimit(env, max_episode_steps=max_episode_steps)
    return env

 def wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack=False, scale=False):
--- a/baselines/common/cg.py
+++ b/baselines/common/cg.py
@@ -31,4 +31,4 @@ def cg(f_Ax, b, cg_iters=10, callback=None, verbose=False, residual_tol=1e-10):
    if callback is not None:
        callback(x)
    if verbose: print(fmtstr % (i+1, rdotr, np.linalg.norm(x)))  # pylint: disable=W0631
-    return x
+    return x
--- a/baselines/common/cmd_util.py
+++ b/baselines/common/cmd_util.py
@@ -3,7 +3,11 @@ Helpers for scripts like run_atari.py.
 """

 import os
-from mpi4py import MPI
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
 import gym
 from gym.wrappers import FlattenDictWrapper
 from baselines import logger
@@ -11,31 +15,90 @@ from baselines.bench import Monitor
 from baselines.common import set_global_seeds
 from baselines.common.atari_wrappers import make_atari, wrap_deepmind
 from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+from baselines.common import retro_wrappers

-def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
+def make_vec_env(env_id, env_type, num_env, seed,
+                 wrapper_kwargs=None,
+                 start_index=0,
+                 reward_scale=1.0,
+                 flatten_dict_observations=True,
+                 gamestate=None):
    """
-    Create a wrapped, monitored SubprocVecEnv for Atari.
+    Create a wrapped, monitored SubprocVecEnv for Atari and MuJoCo.
    """
-    if wrapper_kwargs is None: wrapper_kwargs = {}
-    def make_env(rank): # pylint: disable=C0111
-        def _thunk():
-            env = make_atari(env_id)
-            env.seed(seed + rank)
-            env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-            return wrap_deepmind(env, **wrapper_kwargs)
-        return _thunk
+    wrapper_kwargs = wrapper_kwargs or {}
+    mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
+    seed = seed + 10000 * mpi_rank if seed is not None else None
+    logger_dir = logger.get_dir()
+    def make_thunk(rank):
+        return lambda: make_env(
+            env_id=env_id,
+            env_type=env_type,
+            mpi_rank=mpi_rank,
+            subrank=rank,
+            seed=seed,
+            reward_scale=reward_scale,
+            gamestate=gamestate,
+            flatten_dict_observations=flatten_dict_observations,
+            wrapper_kwargs=wrapper_kwargs,
+            logger_dir=logger_dir
+        )
+
    set_global_seeds(seed)
-    return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
+    if num_env > 1:
+        return SubprocVecEnv([make_thunk(i + start_index) for i in range(num_env)])
+    else:
+        return DummyVecEnv([make_thunk(start_index)])

-def make_mujoco_env(env_id, seed):
+
+def make_env(env_id, env_type, mpi_rank=0, subrank=0, seed=None, reward_scale=1.0, gamestate=None, flatten_dict_observations=True, wrapper_kwargs=None, logger_dir=None):
+    wrapper_kwargs = wrapper_kwargs or {}
+    if env_type == 'atari':
+        env = make_atari(env_id)
+    elif env_type == 'retro':
+        import retro
+        gamestate = gamestate or retro.State.DEFAULT
+        env = retro_wrappers.make_retro(game=env_id, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE, state=gamestate)
+    else:
+        env = gym.make(env_id)
+
+    if flatten_dict_observations and isinstance(env.observation_space, gym.spaces.Dict):
+        keys = env.observation_space.spaces.keys()
+        env = gym.wrappers.FlattenDictWrapper(env, dict_keys=list(keys))
+
+    env.seed(seed + subrank if seed is not None else None)
+    env = Monitor(env,
+                  logger_dir and os.path.join(logger_dir, str(mpi_rank) + '.' + str(subrank)),
+                  allow_early_resets=True)
+
+    if env_type == 'atari':
+        env = wrap_deepmind(env, **wrapper_kwargs)
+    elif env_type == 'retro':
+        if 'frame_stack' not in wrapper_kwargs:
+            wrapper_kwargs['frame_stack'] = 1
+        env = retro_wrappers.wrap_deepmind_retro(env, **wrapper_kwargs)
+
+    if reward_scale != 1:
+        env = retro_wrappers.RewardScaler(env, reward_scale)
+
+    return env
+
+
+def make_mujoco_env(env_id, seed, reward_scale=1.0):
    """
    Create a wrapped, monitored gym.Env for MuJoCo.
    """
    rank = MPI.COMM_WORLD.Get_rank()
-    set_global_seeds(seed + 10000 * rank)
+    myseed = seed  + 1000 * rank if seed is not None else None
+    set_global_seeds(myseed)
    env = gym.make(env_id)
-    env = Monitor(env, os.path.join(logger.get_dir(), str(rank)))
+    logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
+    env = Monitor(env, logger_path, allow_early_resets=True)
    env.seed(seed)
+    if reward_scale != 1.0:
+        from baselines.common.retro_wrappers import RewardScaler
+        env = RewardScaler(env, reward_scale)
    return env

 def make_robotics_env(env_id, seed, rank=0):
@@ -62,20 +125,31 @@ def atari_arg_parser():
    """
    Create an argparse.ArgumentParser for run_atari.py.
    """
-    parser = arg_parser()
-    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
-    return parser
+    print('Obsolete - use common_arg_parser instead')
+    return common_arg_parser()

 def mujoco_arg_parser():
+    print('Obsolete - use common_arg_parser instead')
+    return common_arg_parser()
+
+def common_arg_parser():
    """
    Create an argparse.ArgumentParser for run_mujoco.py.
    """
    parser = arg_parser()
    parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
+    parser.add_argument('--env_type', help='type of environment, used when the environment type cannot be automatically determined', type=str)
+    parser.add_argument('--seed', help='RNG seed', type=int, default=None)
+    parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
+    parser.add_argument('--num_timesteps', type=float, default=1e6),
+    parser.add_argument('--network', help='network type (mlp, cnn, lstm, cnn_lstm, conv_only)', default=None)
+    parser.add_argument('--gamestate', help='game state to load (so far only used in retro games)', default=None)
+    parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
+    parser.add_argument('--reward_scale', help='Reward scale factor. Default: 1.0', default=1.0, type=float)
+    parser.add_argument('--save_path', help='Path to save trained model to', default=None, type=str)
+    parser.add_argument('--save_video_interval', help='Save video every x steps (0 = disabled)', default=0, type=int)
+    parser.add_argument('--save_video_length', help='Length of recorded video. Default: 200', default=200, type=int)
+    parser.add_argument('--log_path', help='Directory to save learning curve data.', default=None, type=str)
    parser.add_argument('--play', default=False, action='store_true')
    return parser

@@ -85,6 +159,28 @@ def robotics_arg_parser():
    """
    parser = arg_parser()
    parser.add_argument('--env', help='environment ID', type=str, default='FetchReach-v0')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
+    parser.add_argument('--seed', help='RNG seed', type=int, default=None)
    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
    return parser
+
+
+def parse_unknown_args(args):
+    """
+    Parse arguments not consumed by arg parser into a dictionary
+    """
+    retval = {}
+    preceded_by_key = False
+    for arg in args:
+        if arg.startswith('--'):
+            if '=' in arg:
+                key = arg.split('=')[0][2:]
+                value = arg.split('=')[1]
+                retval[key] = value
+            else:
+                key = arg[2:]
+                preceded_by_key = True
+        elif preceded_by_key:
+            retval[key] = arg
+            preceded_by_key = False
+
+    return retval
--- a/baselines/common/console_util.py
+++ b/baselines/common/console_util.py
@@ -2,6 +2,8 @@ from __future__ import print_function
 from contextlib import contextmanager
 import numpy as np
 import time
+import shlex
+import subprocess

 # ================================================================
 # Misc
@@ -37,7 +39,7 @@ color2num = dict(
    crimson=38
 )

-def colorize(string, color, bold=False, highlight=False):
+def colorize(string, color='green', bold=False, highlight=False):
    attr = []
    num = color2num[color]
    if highlight: num += 10
@@ -45,6 +47,25 @@ def colorize(string, color, bold=False, highlight=False):
    if bold: attr.append('1')
    return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)

+def print_cmd(cmd, dry=False):
+    if isinstance(cmd, str):  # for shell=True
+        pass
+    else:
+        cmd = ' '.join(shlex.quote(arg) for arg in cmd)
+    print(colorize(('CMD: ' if not dry else 'DRY: ') + cmd))
+
+
+def get_git_commit(cwd=None):
+    return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'], cwd=cwd).decode('utf8')
+
+def get_git_commit_message(cwd=None):
+    return subprocess.check_output(['git', 'show', '-s', '--format=%B', 'HEAD'], cwd=cwd).decode('utf8')
+
+def ccap(cmd, dry=False, env=None, **kwargs):
+    print_cmd(cmd, dry)
+    if not dry:
+        subprocess.check_call(cmd, env=env, **kwargs)
+

 MESSAGE_DEPTH = 0

--- a/baselines/common/distributions.py
+++ b/baselines/common/distributions.py
@@ -1,8 +1,6 @@
 import tensorflow as tf
 import numpy as np
-import baselines.common.tf_util as U
 from baselines.a2c.utils import fc
-from tensorflow.python.ops import math_ops

 class Pd(object):
    """
@@ -23,8 +21,15 @@ class Pd(object):
        raise NotImplementedError
    def logp(self, x):
        return - self.neglogp(x)
+    def get_shape(self):
+        return self.flatparam().shape
+    @property
+    def shape(self):
+        return self.get_shape()
+    def __getitem__(self, idx):
+        return self.__class__(self.flatparam()[idx])

-class PdType(object):
+class PdType(tf.Module):
    """
    Parametrized family of probability distributions
    """
@@ -41,18 +46,18 @@ class PdType(object):
    def sample_dtype(self):
        raise NotImplementedError

-    def param_placeholder(self, prepend_shape, name=None):
-        return tf.placeholder(dtype=tf.float32, shape=prepend_shape+self.param_shape(), name=name)
-    def sample_placeholder(self, prepend_shape, name=None):
-        return tf.placeholder(dtype=self.sample_dtype(), shape=prepend_shape+self.sample_shape(), name=name)
+    def __eq__(self, other):
+        return (type(self) == type(other)) and (self.__dict__ == other.__dict__)

 class CategoricalPdType(PdType):
-    def __init__(self, ncat):
+    def __init__(self, latent_shape, ncat, init_scale=1.0, init_bias=0.0):
        self.ncat = ncat
+        self.matching_fc = _matching_fc(latent_shape, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
+
    def pdclass(self):
        return CategoricalPd
-    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
-        pdparam = fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
+    def pdfromlatent(self, latent_vector):
+        pdparam = self.matching_fc(latent_vector)
        return self.pdfromflat(pdparam), pdparam

    def param_shape(self):
@@ -62,31 +67,18 @@ class CategoricalPdType(PdType):
    def sample_dtype(self):
        return tf.int32

-
-class MultiCategoricalPdType(PdType):
-    def __init__(self, nvec):
-        self.ncats = nvec
-    def pdclass(self):
-        return MultiCategoricalPd
-    def pdfromflat(self, flat):
-        return MultiCategoricalPd(self.ncats, flat)
-    def param_shape(self):
-        return [sum(self.ncats)]
-    def sample_shape(self):
-        return [len(self.ncats)]
-    def sample_dtype(self):
-        return tf.int32
-
 class DiagGaussianPdType(PdType):
-    def __init__(self, size):
+    def __init__(self, latent_shape, size, init_scale=1.0, init_bias=0.0):
        self.size = size
+        self.matching_fc = _matching_fc(latent_shape, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
+        self.logstd = tf.Variable(np.zeros((1, self.size)), name='pi/logstd', dtype=tf.float32)
+
    def pdclass(self):
        return DiagGaussianPd

-    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
-        mean = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
-        logstd = tf.get_variable(name='logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
-        pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
+    def pdfromlatent(self, latent_vector):
+        mean = self.matching_fc(latent_vector)
+        pdparam = tf.concat([mean, mean * 0.0 + self.logstd], axis=1)
        return self.pdfromflat(pdparam), mean

    def param_shape(self):
@@ -96,40 +88,6 @@ class DiagGaussianPdType(PdType):
    def sample_dtype(self):
        return tf.float32

-class BernoulliPdType(PdType):
-    def __init__(self, size):
-        self.size = size
-    def pdclass(self):
-        return BernoulliPd
-    def param_shape(self):
-        return [self.size]
-    def sample_shape(self):
-        return [self.size]
-    def sample_dtype(self):
-        return tf.int32
-
-# WRONG SECOND DERIVATIVES
-# class CategoricalPd(Pd):
-#     def __init__(self, logits):
-#         self.logits = logits
-#         self.ps = tf.nn.softmax(logits)
-#     @classmethod
-#     def fromflat(cls, flat):
-#         return cls(flat)
-#     def flatparam(self):
-#         return self.logits
-#     def mode(self):
-#         return U.argmax(self.logits, axis=-1)
-#     def logp(self, x):
-#         return -tf.nn.sparse_softmax_cross_entropy_with_logits(self.logits, x)
-#     def kl(self, other):
-#         return tf.nn.softmax_cross_entropy_with_logits(other.logits, self.ps) \
-#                 - tf.nn.softmax_cross_entropy_with_logits(self.logits, self.ps)
-#     def entropy(self):
-#         return tf.nn.softmax_cross_entropy_with_logits(self.logits, self.ps)
-#     def sample(self):
-#         u = tf.random_uniform(tf.shape(self.logits))
-#         return U.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)

 class CategoricalPd(Pd):
    def __init__(self, logits):
@@ -138,56 +96,53 @@ class CategoricalPd(Pd):
        return self.logits
    def mode(self):
        return tf.argmax(self.logits, axis=-1)
+
+    @property
+    def mean(self):
+        return tf.nn.softmax(self.logits)
+
    def neglogp(self, x):
        # return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
        # Note: we can't use sparse_softmax_cross_entropy_with_logits because
        #       the implementation does not allow second-order derivatives...
-        one_hot_actions = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
-        return tf.nn.softmax_cross_entropy_with_logits(
-            logits=self.logits,
-            labels=one_hot_actions)
+        if x.dtype in {tf.uint8, tf.int32, tf.int64}:
+            # one-hot encoding
+            x_shape_list = x.shape.as_list()
+            logits_shape_list = self.logits.get_shape().as_list()[:-1]
+            for xs, ls in zip(x_shape_list, logits_shape_list):
+                if xs is not None and ls is not None:
+                    assert xs == ls, 'shape mismatch: {} in x vs {} in logits'.format(xs, ls)
+
+            x = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
+        else:
+            # already encoded
+            print('logits is {}'.format(self.logits))
+            assert x.shape.as_list() == self.logits.shape.as_list()
+
+        return tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
+
    def kl(self, other):
-        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
-        a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keep_dims=True)
+        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
+        a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keepdims=True)
        ea0 = tf.exp(a0)
        ea1 = tf.exp(a1)
-        z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
-        z1 = tf.reduce_sum(ea1, axis=-1, keep_dims=True)
+        z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
+        z1 = tf.reduce_sum(ea1, axis=-1, keepdims=True)
        p0 = ea0 / z0
-        return tf.reduce_sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
+        return tf.reduce_sum(p0 * (a0 - tf.math.log(z0) - a1 + tf.math.log(z1)), axis=-1)
    def entropy(self):
-        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
+        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
        ea0 = tf.exp(a0)
-        z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
+        z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
        p0 = ea0 / z0
-        return tf.reduce_sum(p0 * (tf.log(z0) - a0), axis=-1)
+        return tf.reduce_sum(p0 * (tf.math.log(z0) - a0), axis=-1)
    def sample(self):
-        u = tf.random_uniform(tf.shape(self.logits))
-        return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
+        u = tf.random.uniform(tf.shape(self.logits), dtype=self.logits.dtype, seed=0)
+        return tf.argmax(self.logits - tf.math.log(-tf.math.log(u)), axis=-1)
    @classmethod
    def fromflat(cls, flat):
        return cls(flat)

-class MultiCategoricalPd(Pd):
-    def __init__(self, nvec, flat):
-        self.flat = flat
-        self.categoricals = list(map(CategoricalPd, tf.split(flat, nvec, axis=-1)))
-    def flatparam(self):
-        return self.flat
-    def mode(self):
-        return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
-    def neglogp(self, x):
-        return tf.add_n([p.neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x, axis=-1))])
-    def kl(self, other):
-        return tf.add_n([p.kl(q) for p, q in zip(self.categoricals, other.categoricals)])
-    def entropy(self):
-        return tf.add_n([p.entropy() for p in self.categoricals])
-    def sample(self):
-        return tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
-    @classmethod
-    def fromflat(cls, flat):
-        raise NotImplementedError
-
 class DiagGaussianPd(Pd):
    def __init__(self, flat):
        self.flat = flat
@@ -201,7 +156,7 @@ class DiagGaussianPd(Pd):
        return self.mean
    def neglogp(self, x):
        return 0.5 * tf.reduce_sum(tf.square((x - self.mean) / self.std), axis=-1) \
-               + 0.5 * np.log(2.0 * np.pi) * tf.to_float(tf.shape(x)[-1]) \
+               + 0.5 * np.log(2.0 * np.pi) * tf.cast(tf.shape(x)[-1], dtype=tf.float32) \
               + tf.reduce_sum(self.logstd, axis=-1)
    def kl(self, other):
        assert isinstance(other, DiagGaussianPd)
@@ -209,101 +164,23 @@ class DiagGaussianPd(Pd):
    def entropy(self):
        return tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
    def sample(self):
-        return self.mean + self.std * tf.random_normal(tf.shape(self.mean))
+        return self.mean + self.std * tf.random.normal(tf.shape(self.mean))
    @classmethod
    def fromflat(cls, flat):
        return cls(flat)

-class BernoulliPd(Pd):
-    def __init__(self, logits):
-        self.logits = logits
-        self.ps = tf.sigmoid(logits)
-    def flatparam(self):
-        return self.logits
-    def mode(self):
-        return tf.round(self.ps)
-    def neglogp(self, x):
-        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1)
-    def kl(self, other):
-        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
-    def entropy(self):
-        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
-    def sample(self):
-        u = tf.random_uniform(tf.shape(self.ps))
-        return tf.to_float(math_ops.less(u, self.ps))
-    @classmethod
-    def fromflat(cls, flat):
-        return cls(flat)
-
-def make_pdtype(ac_space):
+def make_pdtype(latent_shape, ac_space, init_scale=1.0):
    from gym import spaces
    if isinstance(ac_space, spaces.Box):
        assert len(ac_space.shape) == 1
-        return DiagGaussianPdType(ac_space.shape[0])
+        return DiagGaussianPdType(latent_shape, ac_space.shape[0], init_scale)
    elif isinstance(ac_space, spaces.Discrete):
-        return CategoricalPdType(ac_space.n)
-    elif isinstance(ac_space, spaces.MultiDiscrete):
-        return MultiCategoricalPdType(ac_space.nvec)
-    elif isinstance(ac_space, spaces.MultiBinary):
-        return BernoulliPdType(ac_space.n)
+        return CategoricalPdType(latent_shape, ac_space.n, init_scale)
    else:
-        raise NotImplementedError
+        raise ValueError('No implementation for {}'.format(ac_space))

-def shape_el(v, i):
-    maybe = v.get_shape()[i]
-    if maybe is not None:
-        return maybe
+def _matching_fc(tensor_shape, name, size, init_scale, init_bias):
+    if tensor_shape[-1] == size:
+        return lambda x: x
    else:
-        return tf.shape(v)[i]
-
-@U.in_session
-def test_probtypes():
-    np.random.seed(0)
-
-    pdparam_diag_gauss = np.array([-.2, .3, .4, -.5, .1, -.5, .1, 0.8])
-    diag_gauss = DiagGaussianPdType(pdparam_diag_gauss.size // 2) #pylint: disable=E1101
-    validate_probtype(diag_gauss, pdparam_diag_gauss)
-
-    pdparam_categorical = np.array([-.2, .3, .5])
-    categorical = CategoricalPdType(pdparam_categorical.size) #pylint: disable=E1101
-    validate_probtype(categorical, pdparam_categorical)
-
-    nvec = [1,2,3]
-    pdparam_multicategorical = np.array([-.2, .3, .5, .1, 1, -.1])
-    multicategorical = MultiCategoricalPdType(nvec) #pylint: disable=E1101
-    validate_probtype(multicategorical, pdparam_multicategorical)
-
-    pdparam_bernoulli = np.array([-.2, .3, .5])
-    bernoulli = BernoulliPdType(pdparam_bernoulli.size) #pylint: disable=E1101
-    validate_probtype(bernoulli, pdparam_bernoulli)
-
-
-def validate_probtype(probtype, pdparam):
-    N = 100000
-    # Check to see if mean negative log likelihood == differential entropy
-    Mval = np.repeat(pdparam[None, :], N, axis=0)
-    M = probtype.param_placeholder([N])
-    X = probtype.sample_placeholder([N])
-    pd = probtype.pdfromflat(M)
-    calcloglik = U.function([X, M], pd.logp(X))
-    calcent = U.function([M], pd.entropy())
-    Xval = tf.get_default_session().run(pd.sample(), feed_dict={M:Mval})
-    logliks = calcloglik(Xval, Mval)
-    entval_ll = - logliks.mean() #pylint: disable=E1101
-    entval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
-    entval = calcent(Mval).mean() #pylint: disable=E1101
-    assert np.abs(entval - entval_ll) < 3 * entval_ll_stderr # within 3 sigmas
-
-    # Check to see if kldiv[p,q] = - ent[p] - E_p[log q]
-    M2 = probtype.param_placeholder([N])
-    pd2 = probtype.pdfromflat(M2)
-    q = pdparam + np.random.randn(pdparam.size) * 0.1
-    Mval2 = np.repeat(q[None, :], N, axis=0)
-    calckl = U.function([M, M2], pd.kl(pd2))
-    klval = calckl(Mval, Mval2).mean() #pylint: disable=E1101
-    logliks = calcloglik(Xval, Mval2)
-    klval_ll = - entval - logliks.mean() #pylint: disable=E1101
-    klval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
-    assert np.abs(klval - klval_ll) < 3 * klval_ll_stderr # within 3 sigmas
-    print('ok on', probtype, pdparam)
-
+        return fc(tensor_shape, name, size, init_scale=init_scale, init_bias=init_bias)
--- a/baselines/common/filters.py
+++ b/baselines/common/filters.py
@@ -1,98 +0,0 @@
-from .running_stat import RunningStat
-from collections import deque
-import numpy as np
-
-class Filter(object):
-    def __call__(self, x, update=True):
-        raise NotImplementedError
-    def reset(self):
-        pass
-
-class IdentityFilter(Filter):
-    def __call__(self, x, update=True):
-        return x
-
-class CompositionFilter(Filter):
-    def __init__(self, fs):
-        self.fs = fs
-    def __call__(self, x, update=True):
-        for f in self.fs:
-            x = f(x)
-        return x
-    def output_shape(self, input_space):
-        out = input_space.shape
-        for f in self.fs:
-            out = f.output_shape(out)
-        return out
-
-class ZFilter(Filter):
-    """
-    y = (x-mean)/std
-    using running estimates of mean,std
-    """
-
-    def __init__(self, shape, demean=True, destd=True, clip=10.0):
-        self.demean = demean
-        self.destd = destd
-        self.clip = clip
-
-        self.rs = RunningStat(shape)
-
-    def __call__(self, x, update=True):
-        if update: self.rs.push(x)
-        if self.demean:
-            x = x - self.rs.mean
-        if self.destd:
-            x = x / (self.rs.std+1e-8)
-        if self.clip:
-            x = np.clip(x, -self.clip, self.clip)
-        return x
-    def output_shape(self, input_space):
-        return input_space.shape
-
-class AddClock(Filter):
-    def __init__(self):
-        self.count = 0
-    def reset(self):
-        self.count = 0
-    def __call__(self, x, update=True):
-        return np.append(x, self.count/100.0)
-    def output_shape(self, input_space):
-        return (input_space.shape[0]+1,)
-
-class FlattenFilter(Filter):
-    def __call__(self, x, update=True):
-        return x.ravel()
-    def output_shape(self, input_space):
-        return (int(np.prod(input_space.shape)),)
-
-class Ind2OneHotFilter(Filter):
-    def __init__(self, n):
-        self.n = n
-    def __call__(self, x, update=True):
-        out = np.zeros(self.n)
-        out[x] = 1
-        return out
-    def output_shape(self, input_space):
-        return (input_space.n,)
-
-class DivFilter(Filter):
-    def __init__(self, divisor):
-        self.divisor = divisor
-    def __call__(self, x, update=True):
-        return x / self.divisor
-    def output_shape(self, input_space):
-        return input_space.shape
-
-class StackFilter(Filter):
-    def __init__(self, length):
-        self.stack = deque(maxlen=length)
-    def reset(self):
-        self.stack.clear()
-    def __call__(self, x, update=True):
-        self.stack.append(x)
-        while len(self.stack) < self.stack.maxlen:
-            self.stack.append(x)
-        return np.concatenate(self.stack, axis=-1)
-    def output_shape(self, input_space):
-        return input_space.shape[:-1] + (input_space.shape[-1] * self.stack.maxlen,)
--- a/baselines/common/identity_env.py
+++ b/baselines/common/identity_env.py
@@ -1,30 +0,0 @@
-from gym import Env
-from gym.spaces import Discrete
-
-
-class IdentityEnv(Env):
-    def __init__(
-            self,
-            dim,
-            ep_length=100,
-    ):
-
-        self.action_space = Discrete(dim)
-        self.reset()
-
-    def reset(self):
-        self._choose_next_state()
-        self.observation_space = self.action_space
-
-        return self.state
-
-    def step(self, actions):
-        rew = self._get_reward(actions)
-        self._choose_next_state()
-        return self.state, rew, False, {}
-
-    def _choose_next_state(self):
-        self.state = self.action_space.sample()
-
-    def _get_reward(self, actions):
-        return 1 if self.state == actions else 0
--- a/baselines/common/input.py
+++ b/baselines/common/input.py
@@ -1,30 +0,0 @@
-import tensorflow as tf
-from gym.spaces import Discrete, Box
-
-def observation_input(ob_space, batch_size=None, name='Ob'):
-    '''
-    Build observation input with encoding depending on the 
-    observation space type
-    Params:
-    
-    ob_space: observation space (should be one of gym.spaces)
-    batch_size: batch size for input (default is None, so that resulting input placeholder can take tensors with any batch size)
-    name: tensorflow variable name for input placeholder
-
-    returns: tuple (input_placeholder, processed_input_tensor)
-    '''
-    if isinstance(ob_space, Discrete):
-        input_x  = tf.placeholder(shape=(batch_size,), dtype=tf.int32, name=name)
-        processed_x = tf.to_float(tf.one_hot(input_x, ob_space.n))
-        return input_x, processed_x
-
-    elif isinstance(ob_space, Box):
-        input_shape = (batch_size,) + ob_space.shape
-        input_x = tf.placeholder(shape=input_shape, dtype=ob_space.dtype, name=name)
-        processed_x = tf.to_float(input_x)
-        return input_x, processed_x
-
-    else:
-        raise NotImplementedError
-
- 
--- a/baselines/common/math_util.py
+++ b/baselines/common/math_util.py
@@ -82,4 +82,4 @@ def test_discount_with_boundaries():
        2 + gamma * 3,
        3,
        4
-    ])
+    ])
--- a/baselines/common/misc_util.py
+++ b/baselines/common/misc_util.py
@@ -13,27 +13,6 @@ def zipsame(*seqs):
    return zip(*seqs)


-def unpack(seq, sizes):
-    """
-    Unpack 'seq' into a sequence of lists, with lengths specified by 'sizes'.
-    None = just one bare element, not a list
-
-    Example:
-    unpack([1,2,3,4,5,6], [3,None,2]) -> ([1,2,3], 4, [5,6])
-    """
-    seq = list(seq)
-    it = iter(seq)
-    assert sum(1 if s is None else s for s in sizes) == len(seq), "Trying to unpack %s into %s" % (seq, sizes)
-    for size in sizes:
-        if size is None:
-            yield it.__next__()
-        else:
-            li = []
-            for _ in range(size):
-                li.append(it.__next__())
-            yield li
-
-
 class EzPickle(object):
    """Objects that are pickled and unpickled via their constructor
    arguments.
@@ -67,14 +46,20 @@ class EzPickle(object):


 def set_global_seeds(i):
+    try:
+        import MPI
+        rank = MPI.COMM_WORLD.Get_rank()
+    except ImportError:
+        rank = 0
+
+    myseed = i  + 1000 * rank if i is not None else None
    try:
        import tensorflow as tf
+        tf.random.set_seed(myseed)
    except ImportError:
        pass
-    else:
-        tf.set_random_seed(i)
-    np.random.seed(i)
-    random.seed(i)
+    np.random.seed(myseed)
+    random.seed(myseed)


 def pretty_eta(seconds_left):
--- a/baselines/common/models.py
+++ b/baselines/common/models.py
@@ -0,0 +1,117 @@
+import numpy as np
+import tensorflow as tf
+from baselines.a2c.utils import ortho_init, conv
+
+mapping = {}
+
+def register(name):
+    def _thunk(func):
+        mapping[name] = func
+        return func
+    return _thunk
+
+
+def nature_cnn(input_shape, **conv_kwargs):
+    """
+    CNN from Nature paper.
+    """
+    print('input shape is {}'.format(input_shape))
+    x_input = tf.keras.Input(shape=input_shape, dtype=tf.uint8)
+    h = x_input
+    h = tf.cast(h, tf.float32) / 255.
+    h = conv('c1', nf=32, rf=8, stride=4, activation='relu', init_scale=np.sqrt(2))(h)
+    h2 = conv('c2', nf=64, rf=4, stride=2, activation='relu', init_scale=np.sqrt(2))(h)
+    h3 = conv('c3', nf=64, rf=3, stride=1, activation='relu', init_scale=np.sqrt(2))(h2)
+    h3 = tf.keras.layers.Flatten()(h3)
+    h3 = tf.keras.layers.Dense(units=512, kernel_initializer=ortho_init(np.sqrt(2)),
+                               name='fc1', activation='relu')(h3)
+    network = tf.keras.Model(inputs=[x_input], outputs=[h3])
+    return network
+
+@register("mlp")
+def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
+    """
+    Stack of fully-connected layers to be used in a policy / q-function approximator
+
+    Parameters:
+    ----------
+
+    num_layers: int                 number of fully-connected layers (default: 2)
+
+    num_hidden: int                 size of fully-connected layers (default: 64)
+
+    activation:                     activation function (default: tf.tanh)
+
+    Returns:
+    -------
+
+    function that builds fully connected network with a given input tensor / placeholder
+    """
+    def network_fn(input_shape):
+        print('input shape is {}'.format(input_shape))
+        x_input = tf.keras.Input(shape=input_shape)
+        # h = tf.keras.layers.Flatten(x_input)
+        h = x_input
+        for i in range(num_layers):
+          h = tf.keras.layers.Dense(units=num_hidden, kernel_initializer=ortho_init(np.sqrt(2)),
+                                    name='mlp_fc{}'.format(i), activation=activation)(h)
+
+        network = tf.keras.Model(inputs=[x_input], outputs=[h])
+        return network
+
+    return network_fn
+
+
+@register("cnn")
+def cnn(**conv_kwargs):
+    def network_fn(input_shape):
+        return nature_cnn(input_shape, **conv_kwargs)
+    return network_fn
+
+
+@register("conv_only")
+def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
+    '''
+    convolutions-only net
+    Parameters:
+    ----------
+    conv:       list of triples (filter_number, filter_size, stride) specifying parameters for each layer.
+    Returns:
+    function that takes tensorflow tensor as input and returns the output of the last convolutional layer
+    '''
+
+    def network_fn(input_shape):
+        print('input shape is {}'.format(input_shape))
+        x_input = tf.keras.Input(shape=input_shape, dtype=tf.uint8)
+        h = x_input
+        h = tf.cast(h, tf.float32) / 255.
+        with tf.name_scope("convnet"):
+            for num_outputs, kernel_size, stride in convs:
+                h = tf.keras.layers.Conv2D(
+                    filters=num_outputs, kernel_size=kernel_size, strides=stride,
+                    activation='relu', **conv_kwargs)(h)
+
+        network = tf.keras.Model(inputs=[x_input], outputs=[h])
+        return network
+    return network_fn
+
+
+def get_network_builder(name):
+    """
+    If you want to register your own network outside models.py, you just need:
+
+    Usage Example:
+    -------------
+    from baselines.common.models import register
+    @register("your_network_name")
+    def your_network_define(**net_kwargs):
+        ...
+        return network_fn
+
+    """
+    if callable(name):
+        return name
+    elif name in mapping:
+        return mapping[name]
+    else:
+        raise ValueError('Unknown network type: {}'.format(name))
--- a/baselines/common/mpi_adam.py
+++ b/baselines/common/mpi_adam.py
@@ -1,7 +1,10 @@
-from mpi4py import MPI
 import baselines.common.tf_util as U
-import tensorflow as tf
 import numpy as np
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+

 class MpiAdam(object):
    def __init__(self, var_list, *, beta1=0.9, beta2=0.999, epsilon=1e-08, scale_grad_by_procs=True, comm=None):
@@ -16,16 +19,19 @@ class MpiAdam(object):
        self.t = 0
        self.setfromflat = U.SetFromFlat(var_list)
        self.getflat = U.GetFlat(var_list)
-        self.comm = MPI.COMM_WORLD if comm is None else comm
+        self.comm = MPI.COMM_WORLD if comm is None and MPI is not None else comm

    def update(self, localg, stepsize):
        if self.t % 100 == 0:
            self.check_synced()
        localg = localg.astype('float32')
-        globalg = np.zeros_like(localg)
-        self.comm.Allreduce(localg, globalg, op=MPI.SUM)
-        if self.scale_grad_by_procs:
-            globalg /= self.comm.Get_size()
+        if self.comm is not None:
+            globalg = np.zeros_like(localg)
+            self.comm.Allreduce(localg, globalg, op=MPI.SUM)
+            if self.scale_grad_by_procs:
+                globalg /= self.comm.Get_size()
+        else:
+            globalg = np.copy(localg)

        self.t += 1
        a = stepsize * np.sqrt(1 - self.beta2**self.t)/(1 - self.beta1**self.t)
@@ -35,11 +41,15 @@ class MpiAdam(object):
        self.setfromflat(self.getflat() + step)

    def sync(self):
+        if self.comm is None:
+            return
        theta = self.getflat()
        self.comm.Bcast(theta, root=0)
        self.setfromflat(theta)

    def check_synced(self):
+        if self.comm is None:
+            return
        if self.comm.Get_rank() == 0: # this is root
            theta = self.getflat()
            self.comm.Bcast(theta, root=0)
@@ -48,32 +58,3 @@ class MpiAdam(object):
            thetaroot = np.empty_like(thetalocal)
            self.comm.Bcast(thetaroot, root=0)
            assert (thetaroot == thetalocal).all(), (thetaroot, thetalocal)
-
-@U.in_session
-def test_MpiAdam():
-    np.random.seed(0)
-    tf.set_random_seed(0)
-
-    a = tf.Variable(np.random.randn(3).astype('float32'))
-    b = tf.Variable(np.random.randn(2,5).astype('float32'))
-    loss = tf.reduce_sum(tf.square(a)) + tf.reduce_sum(tf.sin(b))
-
-    stepsize = 1e-2
-    update_op = tf.train.AdamOptimizer(stepsize).minimize(loss)
-    do_update = U.function([], loss, updates=[update_op])
-
-    tf.get_default_session().run(tf.global_variables_initializer())
-    for i in range(10):
-        print(i,do_update())
-
-    tf.set_random_seed(0)
-    tf.get_default_session().run(tf.global_variables_initializer())
-
-    var_list = [a,b]
-    lossandgrad = U.function([], [loss, U.flatgrad(loss, var_list)], updates=[update_op])
-    adam = MpiAdam(var_list)
-
-    for i in range(10):
-        l,g = lossandgrad()
-        adam.update(g, stepsize)
-        print(i,l)
--- a/baselines/common/mpi_adam_optimizer.py
+++ b/baselines/common/mpi_adam_optimizer.py
@@ -0,0 +1,59 @@
+import numpy as np
+import tensorflow as tf
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
+class MpiAdamOptimizer(tf.Module):
+    """Adam optimizer that averages gradients across mpi processes."""
+    def __init__(self, comm, var_list):
+        self.var_list = var_list
+        self.comm = comm
+        self.beta1 = 0.9
+        self.beta2 = 0.999
+        self.epsilon = 1e-08
+        self.t = tf.Variable(0, name='step', dtype=tf.int32)
+        var_shapes = [v.shape.as_list() for v in var_list]
+        self.var_sizes = [int(np.prod(s)) for s in var_shapes]
+        self.flat_var_size = sum(self.var_sizes)
+        self.m = tf.Variable(np.zeros(self.flat_var_size, 'float32'))
+        self.v = tf.Variable(np.zeros(self.flat_var_size, 'float32'))
+
+    def apply_gradients(self, flat_grad, lr):
+        buf = np.zeros(self.flat_var_size, np.float32)
+        self.comm.Allreduce(flat_grad.numpy(), buf, op=MPI.SUM)
+        avg_flat_grad = np.divide(buf, float(self.comm.Get_size()))
+        self._apply_gradients(tf.constant(avg_flat_grad), lr)
+        if self.t.numpy() % 100 == 0:
+            check_synced(tf.reduce_sum(self.var_list[0]).numpy())
+
+    @tf.function
+    def _apply_gradients(self, avg_flat_grad, lr):
+        self.t.assign_add(1)
+        t = tf.cast(self.t, tf.float32)
+        a = lr * tf.math.sqrt(1 - tf.math.pow(self.beta2, t)) / (1 - tf.math.pow(self.beta1, t))
+        self.m.assign(self.beta1 * self.m + (1 - self.beta1) * avg_flat_grad)
+        self.v.assign(self.beta2 * self.v + (1 - self.beta2) * tf.math.square(avg_flat_grad))
+        flat_step = (- a) * self.m / (tf.math.sqrt(self.v) + self.epsilon)
+        var_steps = tf.split(flat_step, self.var_sizes, axis=0)
+        for var_step, var in zip(var_steps, self.var_list):
+            var.assign_add(tf.reshape(var_step, var.shape))
+
+
+def check_synced(localval, comm=None):
+    """
+    It's common to forget to initialize your variables to the same values, or
+    (less commonly) if you update them in some other way than adam, to get them out of sync.
+    This function checks that variables on all MPI workers are the same, and raises
+    an AssertionError otherwise
+
+    Arguments:
+        comm: MPI communicator
+        localval: list of local variables (list of variables on current worker to be compared with the other workers)
+    """
+    comm = comm or MPI.COMM_WORLD
+    vals = comm.gather(localval)
+    if comm.rank == 0:
+        assert all(val==vals[0] for val in vals[1:]),\
+            'MpiAdamOptimizer detected that different workers have different weights: {}'.format(vals)
--- a/baselines/common/mpi_fork.py
+++ b/baselines/common/mpi_fork.py
@@ -4,7 +4,7 @@ def mpi_fork(n, bind_to_core=False):
    """Re-launches the current script with workers
    Returns "parent" for original parent, "child" for MPI children
    """
-    if n<=1: 
+    if n<=1:
        return "child"
    if os.getenv("IN_MPI") is None:
        env = os.environ.copy()
--- a/baselines/common/mpi_moments.py
+++ b/baselines/common/mpi_moments.py
@@ -33,8 +33,8 @@ def mpi_moments(x, axis=0, comm=None, keepdims=False):

 def test_runningmeanstd():
    import subprocess
-    subprocess.check_call(['mpirun', '-np', '3', 
-        'python','-c', 
+    subprocess.check_call(['mpirun', '-np', '3',
+        'python','-c',
        'from baselines.common.mpi_moments import _helper_runningmeanstd; _helper_runningmeanstd()'])

 def _helper_runningmeanstd():
--- a/baselines/common/mpi_running_mean_std.py
+++ b/baselines/common/mpi_running_mean_std.py
@@ -1,107 +1,56 @@
-from mpi4py import MPI
-import tensorflow as tf, baselines.common.tf_util as U, numpy as np
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None

-class RunningMeanStd(object):
+import tensorflow as tf, numpy as np
+
+class RunningMeanStd(tf.Module):
    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
-    def __init__(self, epsilon=1e-2, shape=()):
+    def __init__(self, epsilon=1e-2, shape=(), default_clip_range=np.inf):

-        self._sum = tf.get_variable(
+        self._sum = tf.Variable(
+            initial_value=np.zeros(shape=shape, dtype=np.float64),
            dtype=tf.float64,
-            shape=shape,
-            initializer=tf.constant_initializer(0.0),
            name="runningsum", trainable=False)
-        self._sumsq = tf.get_variable(
+        self._sumsq = tf.Variable(
+            initial_value=np.full(shape=shape, fill_value=epsilon, dtype=np.float64),
            dtype=tf.float64,
-            shape=shape,
-            initializer=tf.constant_initializer(epsilon),
            name="runningsumsq", trainable=False)
-        self._count = tf.get_variable(
+        self._count = tf.Variable(
+            initial_value=epsilon,
            dtype=tf.float64,
-            shape=(),
-            initializer=tf.constant_initializer(epsilon),
            name="count", trainable=False)
        self.shape = shape
-
-        self.mean = tf.to_float(self._sum / self._count)
-        self.std = tf.sqrt( tf.maximum( tf.to_float(self._sumsq / self._count) - tf.square(self.mean) , 1e-2 ))
-
-        newsum = tf.placeholder(shape=self.shape, dtype=tf.float64, name='sum')
-        newsumsq = tf.placeholder(shape=self.shape, dtype=tf.float64, name='var')
-        newcount = tf.placeholder(shape=[], dtype=tf.float64, name='count')
-        self.incfiltparams = U.function([newsum, newsumsq, newcount], [],
-            updates=[tf.assign_add(self._sum, newsum),
-                     tf.assign_add(self._sumsq, newsumsq),
-                     tf.assign_add(self._count, newcount)])
-
+        self.epsilon = epsilon
+        self.default_clip_range = default_clip_range

    def update(self, x):
        x = x.astype('float64')
        n = int(np.prod(self.shape))
-        totalvec = np.zeros(n*2+1, 'float64')
        addvec = np.concatenate([x.sum(axis=0).ravel(), np.square(x).sum(axis=0).ravel(), np.array([len(x)],dtype='float64')])
-        MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
-        self.incfiltparams(totalvec[0:n].reshape(self.shape), totalvec[n:2*n].reshape(self.shape), totalvec[2*n])
+        totalvec = np.zeros(n*2+1, 'float64')
+        if MPI is not None:
+            # totalvec = np.zeros(n*2+1, 'float64')
+            MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
+        # else:
+        #     totalvec = addvec
+        self._sum.assign_add(totalvec[0:n].reshape(self.shape))
+        self._sumsq.assign_add(totalvec[n:2*n].reshape(self.shape))
+        self._count.assign_add(totalvec[2*n])

-@U.in_session
-def test_runningmeanstd():
-    for (x1, x2, x3) in [
-        (np.random.randn(3), np.random.randn(4), np.random.randn(5)),
-        (np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
-        ]:
+    @property
+    def mean(self):
+        return tf.cast(self._sum / self._count, tf.float32)

-        rms = RunningMeanStd(epsilon=0.0, shape=x1.shape[1:])
-        U.initialize()
+    @property
+    def std(self):
+        return tf.sqrt(tf.maximum(tf.cast(self._sumsq / self._count, tf.float32) - tf.square(self.mean), self.epsilon))

-        x = np.concatenate([x1, x2, x3], axis=0)
-        ms1 = [x.mean(axis=0), x.std(axis=0)]
-        rms.update(x1)
-        rms.update(x2)
-        rms.update(x3)
-        ms2 = [rms.mean.eval(), rms.std.eval()]
+    def normalize(self, v, clip_range=None):
+        if clip_range is None:
+            clip_range = self.default_clip_range
+        return tf.clip_by_value((v - self.mean) / self.std, -clip_range, clip_range)

-        assert np.allclose(ms1, ms2)
-
-@U.in_session
-def test_dist():
-    np.random.seed(0)
-    p1,p2,p3=(np.random.randn(3,1), np.random.randn(4,1), np.random.randn(5,1))
-    q1,q2,q3=(np.random.randn(6,1), np.random.randn(7,1), np.random.randn(8,1))
-
-    # p1,p2,p3=(np.random.randn(3), np.random.randn(4), np.random.randn(5))
-    # q1,q2,q3=(np.random.randn(6), np.random.randn(7), np.random.randn(8))
-
-    comm = MPI.COMM_WORLD
-    assert comm.Get_size()==2
-    if comm.Get_rank()==0:
-        x1,x2,x3 = p1,p2,p3
-    elif comm.Get_rank()==1:
-        x1,x2,x3 = q1,q2,q3
-    else:
-        assert False
-
-    rms = RunningMeanStd(epsilon=0.0, shape=(1,))
-    U.initialize()
-
-    rms.update(x1)
-    rms.update(x2)
-    rms.update(x3)
-
-    bigvec = np.concatenate([p1,p2,p3,q1,q2,q3])
-
-    def checkallclose(x,y):
-        print(x,y)
-        return np.allclose(x,y)
-
-    assert checkallclose(
-        bigvec.mean(axis=0),
-        rms.mean.eval(),
-    )
-    assert checkallclose(
-        bigvec.std(axis=0),
-        rms.std.eval(),
-    )
-
-
-if __name__ == "__main__":
-    # Run with mpirun -np 2 python <filename>
-    test_dist()
+    def denormalize(self, v):
+        return self.mean + v * self.std
--- a/baselines/common/mpi_util.py
+++ b/baselines/common/mpi_util.py
@@ -0,0 +1,131 @@
+from collections import defaultdict
+import os, numpy as np
+import platform
+import shutil
+import subprocess
+import warnings
+import sys
+
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
+
+def sync_from_root(variables, comm=None):
+    """
+    Send the root node's parameters to every worker.
+    Arguments:
+      variables: all parameter variables including optimizer's
+    """
+    if comm is None: comm = MPI.COMM_WORLD
+    values = comm.bcast([var.numpy() for var in variables])
+    for (var, val) in zip(variables, values):
+        var.assign(val)
+
+def gpu_count():
+    """
+    Count the GPUs on this machine.
+    """
+    if shutil.which('nvidia-smi') is None:
+        return 0
+    output = subprocess.check_output(['nvidia-smi', '--query-gpu=gpu_name', '--format=csv'])
+    return max(0, len(output.split(b'\n')) - 2)
+
+def setup_mpi_gpus():
+    """
+    Set CUDA_VISIBLE_DEVICES to MPI rank if not already set
+    """
+    if 'CUDA_VISIBLE_DEVICES' not in os.environ:
+        if sys.platform == 'darwin': # This Assumes if you're on OSX you're just
+            ids = []                 # doing a smoke test and don't want GPUs
+        else:
+            lrank, _lsize = get_local_rank_size(MPI.COMM_WORLD)
+            ids = [lrank]
+        os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, ids))
+
+def get_local_rank_size(comm):
+    """
+    Returns the rank of each process on its machine
+    The processes on a given machine will be assigned ranks
+        0, 1, 2, ..., N-1,
+    where N is the number of processes on this machine.
+
+    Useful if you want to assign one gpu per machine
+    """
+    this_node = platform.node()
+    ranks_nodes = comm.allgather((comm.Get_rank(), this_node))
+    node2rankssofar = defaultdict(int)
+    local_rank = None
+    for (rank, node) in ranks_nodes:
+        if rank == comm.Get_rank():
+            local_rank = node2rankssofar[node]
+        node2rankssofar[node] += 1
+    assert local_rank is not None
+    return local_rank, node2rankssofar[this_node]
+
+def share_file(comm, path):
+    """
+    Copies the file from rank 0 to all other ranks
+    Puts it in the same place on all machines
+    """
+    localrank, _ = get_local_rank_size(comm)
+    if comm.Get_rank() == 0:
+        with open(path, 'rb') as fh:
+            data = fh.read()
+        comm.bcast(data)
+    else:
+        data = comm.bcast(None)
+        if localrank == 0:
+            os.makedirs(os.path.dirname(path), exist_ok=True)
+            with open(path, 'wb') as fh:
+                fh.write(data)
+    comm.Barrier()
+
+def dict_gather(comm, d, op='mean', assert_all_have_data=True):
+    """
+    Perform a reduction operation over dicts
+    """
+    if comm is None: return d
+    alldicts = comm.allgather(d)
+    size = comm.size
+    k2li = defaultdict(list)
+    for d in alldicts:
+        for (k,v) in d.items():
+            k2li[k].append(v)
+    result = {}
+    for (k,li) in k2li.items():
+        if assert_all_have_data:
+            assert len(li)==size, "only %i out of %i MPI workers have sent '%s'" % (len(li), size, k)
+        if op=='mean':
+            result[k] = np.mean(li, axis=0)
+        elif op=='sum':
+            result[k] = np.sum(li, axis=0)
+        else:
+            assert 0, op
+    return result
+
+def mpi_weighted_mean(comm, local_name2valcount):
+    """
+    Perform a weighted average over dicts that are each on a different node
+    Input: local_name2valcount: dict mapping key -> (value, count)
+    Returns: key -> mean
+    """
+    all_name2valcount = comm.gather(local_name2valcount)
+    if comm.rank == 0:
+        name2sum = defaultdict(float)
+        name2count = defaultdict(float)
+        for n2vc in all_name2valcount:
+            for (name, (val, count)) in n2vc.items():
+                try:
+                    val = float(val)
+                except ValueError:
+                    if comm.rank == 0:
+                        warnings.warn('WARNING: tried to compute mean on non-float {}={}'.format(name, val))
+                else:
+                    name2sum[name] += val * count
+                    name2count[name] += count
+        return {name : name2sum[name] / name2count[name] for name in name2sum}
+    else:
+        return {}
+
--- a/baselines/common/plot_util.py
+++ b/baselines/common/plot_util.py
@@ -0,0 +1,434 @@
+import matplotlib.pyplot as plt
+import os.path as osp
+import json
+import os
+import numpy as np
+import pandas
+from collections import defaultdict, namedtuple
+from baselines.bench import monitor
+from baselines.logger import read_json, read_csv
+
+def smooth(y, radius, mode='two_sided', valid_only=False):
+    '''
+    Smooth signal y, where radius is determines the size of the window
+
+    mode='twosided':
+        average over the window [max(index - radius, 0), min(index + radius, len(y)-1)]
+    mode='causal':
+        average over the window [max(index - radius, 0), index]
+
+    valid_only: put nan in entries where the full-sized window is not available
+
+    '''
+    assert mode in ('two_sided', 'causal')
+    if len(y) < 2*radius+1:
+        return np.ones_like(y) * y.mean()
+    elif mode == 'two_sided':
+        convkernel = np.ones(2 * radius+1)
+        out = np.convolve(y, convkernel,mode='same') / np.convolve(np.ones_like(y), convkernel, mode='same')
+        if valid_only:
+            out[:radius] = out[-radius:] = np.nan
+    elif mode == 'causal':
+        convkernel = np.ones(radius)
+        out = np.convolve(y, convkernel,mode='full') / np.convolve(np.ones_like(y), convkernel, mode='full')
+        out = out[:-radius+1]
+        if valid_only:
+            out[:radius] = np.nan
+    return out
+
+def one_sided_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8):
+    '''
+    perform one-sided (causal) EMA (exponential moving average)
+    smoothing and resampling to an even grid with n points.
+    Does not do extrapolation, so we assume
+    xolds[0] <= low && high <= xolds[-1]
+
+    Arguments:
+
+    xolds: array or list  - x values of data. Needs to be sorted in ascending order
+    yolds: array of list  - y values of data. Has to have the same length as xolds
+
+    low: float            - min value of the new x grid. By default equals to xolds[0]
+    high: float           - max value of the new x grid. By default equals to xolds[-1]
+
+    n: int                - number of points in new x grid
+
+    decay_steps: float    - EMA decay factor, expressed in new x grid steps.
+
+    low_counts_threshold: float or int
+                          - y values with counts less than this value will be set to NaN
+
+    Returns:
+        tuple sum_ys, count_ys where
+            xs        - array with new x grid
+            ys        - array of EMA of y at each point of the new x grid
+            count_ys  - array of EMA of y counts at each point of the new x grid
+
+    '''
+
+    low = xolds[0] if low is None else low
+    high = xolds[-1] if high is None else high
+
+    assert xolds[0] <= low, 'low = {} < xolds[0] = {} - extrapolation not permitted!'.format(low, xolds[0])
+    assert xolds[-1] >= high, 'high = {} > xolds[-1] = {}  - extrapolation not permitted!'.format(high, xolds[-1])
+    assert len(xolds) == len(yolds), 'length of xolds ({}) and yolds ({}) do not match!'.format(len(xolds), len(yolds))
+
+
+    xolds = xolds.astype('float64')
+    yolds = yolds.astype('float64')
+
+    luoi = 0 # last unused old index
+    sum_y = 0.
+    count_y = 0.
+    xnews = np.linspace(low, high, n)
+    decay_period = (high - low) / (n - 1) * decay_steps
+    interstep_decay = np.exp(- 1. / decay_steps)
+    sum_ys = np.zeros_like(xnews)
+    count_ys = np.zeros_like(xnews)
+    for i in range(n):
+        xnew = xnews[i]
+        sum_y *= interstep_decay
+        count_y *= interstep_decay
+        while True:
+            if luoi >= len(xolds):
+                break
+            xold = xolds[luoi]
+            if xold <= xnew:
+                decay = np.exp(- (xnew - xold) / decay_period)
+                sum_y += decay * yolds[luoi]
+                count_y += decay
+                luoi += 1
+            else:
+                break
+        sum_ys[i] = sum_y
+        count_ys[i] = count_y
+
+    ys = sum_ys / count_ys
+    ys[count_ys < low_counts_threshold] = np.nan
+
+    return xnews, ys, count_ys
+
+def symmetric_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8):
+    '''
+    perform symmetric EMA (exponential moving average)
+    smoothing and resampling to an even grid with n points.
+    Does not do extrapolation, so we assume
+    xolds[0] <= low && high <= xolds[-1]
+
+    Arguments:
+
+    xolds: array or list  - x values of data. Needs to be sorted in ascending order
+    yolds: array of list  - y values of data. Has to have the same length as xolds
+
+    low: float            - min value of the new x grid. By default equals to xolds[0]
+    high: float           - max value of the new x grid. By default equals to xolds[-1]
+
+    n: int                - number of points in new x grid
+
+    decay_steps: float    - EMA decay factor, expressed in new x grid steps.
+
+    low_counts_threshold: float or int
+                          - y values with counts less than this value will be set to NaN
+
+    Returns:
+        tuple sum_ys, count_ys where
+            xs        - array with new x grid
+            ys        - array of EMA of y at each point of the new x grid
+            count_ys  - array of EMA of y counts at each point of the new x grid
+
+    '''
+    xs, ys1, count_ys1 = one_sided_ema(xolds, yolds, low, high, n, decay_steps, low_counts_threshold=0)
+    _,  ys2, count_ys2 = one_sided_ema(-xolds[::-1], yolds[::-1], -high, -low, n, decay_steps, low_counts_threshold=0)
+    ys2 = ys2[::-1]
+    count_ys2 = count_ys2[::-1]
+    count_ys = count_ys1 + count_ys2
+    ys = (ys1 * count_ys1 + ys2 * count_ys2) / count_ys
+    ys[count_ys < low_counts_threshold] = np.nan
+    return xs, ys, count_ys
+
+Result = namedtuple('Result', 'monitor progress dirname metadata')
+Result.__new__.__defaults__ = (None,) * len(Result._fields)
+
+def load_results(root_dir_or_dirs, enable_progress=True, enable_monitor=True, verbose=False):
+    '''
+    load summaries of runs from a list of directories (including subdirectories)
+    Arguments:
+
+    enable_progress: bool - if True, will attempt to load data from progress.csv files (data saved by logger). Default: True
+
+    enable_monitor: bool - if True, will attempt to load data from monitor.csv files (data saved by Monitor environment wrapper). Default: True
+
+    verbose: bool - if True, will print out list of directories from which the data is loaded. Default: False
+
+
+    Returns:
+    List of Result objects with the following fields:
+         - dirname - path to the directory data was loaded from
+         - metadata - run metadata (such as command-line arguments and anything else in metadata.json file
+         - monitor - if enable_monitor is True, this field contains pandas dataframe with loaded monitor.csv file (or aggregate of all *.monitor.csv files in the directory)
+         - progress - if enable_progress is True, this field contains pandas dataframe with loaded progress.csv file
+    '''
+    import re
+    if isinstance(root_dir_or_dirs, str):
+        rootdirs = [osp.expanduser(root_dir_or_dirs)]
+    else:
+        rootdirs = [osp.expanduser(d) for d in root_dir_or_dirs]
+    allresults = []
+    for rootdir in rootdirs:
+        assert osp.exists(rootdir), "%s doesn't exist"%rootdir
+        for dirname, dirs, files in os.walk(rootdir):
+            if '-proc' in dirname:
+                files[:] = []
+                continue
+            monitor_re = re.compile(r'(\d+\.)?(\d+\.)?monitor\.csv')
+            if set(['metadata.json', 'monitor.json', 'progress.json', 'progress.csv']).intersection(files) or \
+               any([f for f in files if monitor_re.match(f)]):  # also match monitor files like 0.1.monitor.csv
+                # used to be uncommented, which means do not go deeper than current directory if any of the data files
+                # are found
+                # dirs[:] = []
+                result = {'dirname' : dirname}
+                if "metadata.json" in files:
+                    with open(osp.join(dirname, "metadata.json"), "r") as fh:
+                        result['metadata'] = json.load(fh)
+                progjson = osp.join(dirname, "progress.json")
+                progcsv = osp.join(dirname, "progress.csv")
+                if enable_progress:
+                    if osp.exists(progjson):
+                        result['progress'] = pandas.DataFrame(read_json(progjson))
+                    elif osp.exists(progcsv):
+                        try:
+                            result['progress'] = read_csv(progcsv)
+                        except pandas.errors.EmptyDataError:
+                            print('skipping progress file in ', dirname, 'empty data')
+                    else:
+                        if verbose: print('skipping %s: no progress file'%dirname)
+
+                if enable_monitor:
+                    try:
+                        result['monitor'] = pandas.DataFrame(monitor.load_results(dirname))
+                    except monitor.LoadMonitorResultsError:
+                        print('skipping %s: no monitor files'%dirname)
+                    except Exception as e:
+                        print('exception loading monitor file in %s: %s'%(dirname, e))
+
+                if result.get('monitor') is not None or result.get('progress') is not None:
+                    allresults.append(Result(**result))
+                    if verbose:
+                        print('successfully loaded %s'%dirname)
+
+    if verbose: print('loaded %i results'%len(allresults))
+    return allresults
+
+COLORS = ['blue', 'green', 'red', 'cyan', 'magenta', 'yellow', 'black', 'purple', 'pink',
+        'brown', 'orange', 'teal',  'lightblue', 'lime', 'lavender', 'turquoise',
+        'darkgreen', 'tan', 'salmon', 'gold',  'darkred', 'darkblue']
+
+
+def default_xy_fn(r):
+    x = np.cumsum(r.monitor.l)
+    y = smooth(r.monitor.r, radius=10)
+    return x,y
+
+def default_split_fn(r):
+    import re
+    # match name between slash and -<digits> at the end of the string
+    # (slash in the beginning or -<digits> in the end or either may be missing)
+    match = re.search(r'[^/-]+(?=(-\d+)?\Z)', r.dirname)
+    if match:
+        return match.group(0)
+
+def plot_results(
+    allresults, *,
+    xy_fn=default_xy_fn,
+    split_fn=default_split_fn,
+    group_fn=default_split_fn,
+    average_group=False,
+    shaded_std=True,
+    shaded_err=True,
+    figsize=None,
+    legend_outside=False,
+    resample=0,
+    smooth_step=1.0,
+    tiling='vertical',
+    xlabel=None,
+    ylabel=None
+):
+    '''
+    Plot multiple Results objects
+
+    xy_fn: function Result -> x,y           - function that converts results objects into tuple of x and y values.
+                                              By default, x is cumsum of episode lengths, and y is episode rewards
+
+    split_fn: function Result -> hashable   - function that converts results objects into keys to split curves into sub-panels by.
+                                              That is, the results r for which split_fn(r) is different will be put on different sub-panels.
+                                              By default, the portion of r.dirname between last / and -<digits> is returned. The sub-panels are
+                                              stacked vertically in the figure.
+
+    group_fn: function Result -> hashable   - function that converts results objects into keys to group curves by.
+                                              That is, the results r for which group_fn(r) is the same will be put into the same group.
+                                              Curves in the same group have the same color (if average_group is False), or averaged over
+                                              (if average_group is True). The default value is the same as default value for split_fn
+
+    average_group: bool                     - if True, will average the curves in the same group and plot the mean. Enables resampling
+                                              (if resample = 0, will use 512 steps)
+
+    shaded_std: bool                        - if True (default), the shaded region corresponding to standard deviation of the group of curves will be
+                                              shown (only applicable if average_group = True)
+
+    shaded_err: bool                        - if True (default), the shaded region corresponding to error in mean estimate of the group of curves
+                                              (that is, standard deviation divided by square root of number of curves) will be
+                                              shown (only applicable if average_group = True)
+
+    figsize: tuple or None                  - size of the resulting figure (including sub-panels). By default, width is 6 and height is 6 times number of
+                                              sub-panels.
+
+
+    legend_outside: bool                    - if True, will place the legend outside of the sub-panels.
+
+    resample: int                           - if not zero, size of the uniform grid in x direction to resample onto. Resampling is performed via symmetric
+                                              EMA smoothing (see the docstring for symmetric_ema).
+                                              Default is zero (no resampling). Note that if average_group is True, resampling is necessary; in that case, default
+                                              value is 512.
+
+    smooth_step: float                      - when resampling (i.e. when resample > 0 or average_group is True), use this EMA decay parameter (in units of the new grid step).
+                                              See docstrings for decay_steps in symmetric_ema or one_sided_ema functions.
+
+    '''
+
+    if split_fn is None: split_fn = lambda _ : ''
+    if group_fn is None: group_fn = lambda _ : ''
+    sk2r = defaultdict(list) # splitkey2results
+    for result in allresults:
+        splitkey = split_fn(result)
+        sk2r[splitkey].append(result)
+    assert len(sk2r) > 0
+    assert isinstance(resample, int), "0: don't resample. <integer>: that many samples"
+    if tiling == 'vertical' or tiling is None:
+        nrows = len(sk2r)
+        ncols = 1
+    elif tiling == 'horizontal':
+        ncols = len(sk2r)
+        nrows = 1
+    elif tiling == 'symmetric':
+        import math
+        N = len(sk2r)
+        largest_divisor = 1
+        for i in range(1, int(math.sqrt(N))+1):
+            if N % i == 0:
+                largest_divisor = i
+        ncols = largest_divisor
+        nrows = N // ncols
+    figsize = figsize or (6 * ncols, 6 * nrows)
+
+    f, axarr = plt.subplots(nrows, ncols, sharex=False, squeeze=False, figsize=figsize)
+
+    groups = list(set(group_fn(result) for result in allresults))
+
+    default_samples = 512
+    if average_group:
+        resample = resample or default_samples
+
+    for (isplit, sk) in enumerate(sorted(sk2r.keys())):
+        g2l = {}
+        g2c = defaultdict(int)
+        sresults = sk2r[sk]
+        gresults = defaultdict(list)
+        idx_row = isplit // ncols
+        idx_col = isplit % ncols
+        ax = axarr[idx_row][idx_col]
+        for result in sresults:
+            group = group_fn(result)
+            g2c[group] += 1
+            x, y = xy_fn(result)
+            if x is None: x = np.arange(len(y))
+            x, y = map(np.asarray, (x, y))
+            if average_group:
+                gresults[group].append((x,y))
+            else:
+                if resample:
+                    x, y, counts = symmetric_ema(x, y, x[0], x[-1], resample, decay_steps=smooth_step)
+                l, = ax.plot(x, y, color=COLORS[groups.index(group) % len(COLORS)])
+                g2l[group] = l
+        if average_group:
+            for group in sorted(groups):
+                xys = gresults[group]
+                if not any(xys):
+                    continue
+                color = COLORS[groups.index(group) % len(COLORS)]
+                origxs = [xy[0] for xy in xys]
+                minxlen = min(map(len, origxs))
+                def allequal(qs):
+                    return all((q==qs[0]).all() for q in qs[1:])
+                if resample:
+                    low  = max(x[0] for x in origxs)
+                    high = min(x[-1] for x in origxs)
+                    usex = np.linspace(low, high, resample)
+                    ys = []
+                    for (x, y) in xys:
+                        ys.append(symmetric_ema(x, y, low, high, resample, decay_steps=smooth_step)[1])
+                else:
+                    assert allequal([x[:minxlen] for x in origxs]),\
+                        'If you want to average unevenly sampled data, set resample=<number of samples you want>'
+                    usex = origxs[0]
+                    ys = [xy[1][:minxlen] for xy in xys]
+                ymean = np.mean(ys, axis=0)
+                ystd = np.std(ys, axis=0)
+                ystderr = ystd / np.sqrt(len(ys))
+                l, = axarr[idx_row][idx_col].plot(usex, ymean, color=color)
+                g2l[group] = l
+                if shaded_err:
+                    ax.fill_between(usex, ymean - ystderr, ymean + ystderr, color=color, alpha=.4)
+                if shaded_std:
+                    ax.fill_between(usex, ymean - ystd,    ymean + ystd,    color=color, alpha=.2)
+
+
+        # https://matplotlib.org/users/legend_guide.html
+        plt.tight_layout()
+        if any(g2l.keys()):
+            ax.legend(
+                g2l.values(),
+                ['%s (%i)'%(g, g2c[g]) for g in g2l] if average_group else g2l.keys(),
+                loc=2 if legend_outside else None,
+                bbox_to_anchor=(1,1) if legend_outside else None)
+        ax.set_title(sk)
+        # add xlabels, but only to the bottom row
+        if xlabel is not None:
+            for ax in axarr[-1]:
+                plt.sca(ax)
+                plt.xlabel(xlabel)
+        # add ylabels, but only to left column
+        if ylabel is not None:
+            for ax in axarr[:,0]:
+                plt.sca(ax)
+                plt.ylabel(ylabel)
+
+    return f, axarr
+
+def regression_analysis(df):
+    xcols = list(df.columns.copy())
+    xcols.remove('score')
+    ycols = ['score']
+    import statsmodels.api as sm
+    mod = sm.OLS(df[ycols], sm.add_constant(df[xcols]), hasconst=False)
+    res = mod.fit()
+    print(res.summary())
+
+def test_smooth():
+    norig = 100
+    nup = 300
+    ndown = 30
+    xs = np.cumsum(np.random.rand(norig) * 10 / norig)
+    yclean = np.sin(xs)
+    ys = yclean + .1 * np.random.randn(yclean.size)
+    xup, yup, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), nup, decay_steps=nup/ndown)
+    xdown, ydown, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), ndown, decay_steps=ndown/ndown)
+    xsame, ysame, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), norig, decay_steps=norig/ndown)
+    plt.plot(xs, ys, label='orig', marker='x')
+    plt.plot(xup, yup, label='up', marker='x')
+    plt.plot(xdown, ydown, label='down', marker='x')
+    plt.plot(xsame, ysame, label='same', marker='x')
+    plt.plot(xs, yclean, label='clean', marker='x')
+    plt.legend()
+    plt.show()
+
+
--- a/baselines/common/policies.py
+++ b/baselines/common/policies.py
@@ -0,0 +1,81 @@
+import tensorflow as tf
+from baselines.a2c.utils import fc
+from baselines.common.distributions import make_pdtype
+
+import gym
+
+
+class PolicyWithValue(tf.Module):
+    """
+    Encapsulates fields and methods for RL policy and value function estimation with shared parameters
+    """
+
+    def __init__(self, ac_space, policy_network, value_network=None, estimate_q=False):
+        """
+        Parameters:
+        ----------
+        ac_space        action space
+
+        policy_network  keras network for policy
+
+        value_network   keras network for value
+
+        estimate_q      q value or v value
+
+        """
+
+        self.policy_network = policy_network
+        self.value_network = value_network or policy_network
+        self.estimate_q = estimate_q
+        self.initial_state = None
+
+        # Based on the action space, will select what probability distribution type
+        self.pdtype = make_pdtype(policy_network.output_shape, ac_space, init_scale=0.01)
+
+        if estimate_q:
+            assert isinstance(ac_space, gym.spaces.Discrete)
+            self.value_fc = fc(self.value_network.output_shape, 'q', ac_space.n)
+        else:
+            self.value_fc = fc(self.value_network.output_shape, 'vf', 1)
+
+    @tf.function
+    def step(self, observation):
+        """
+        Compute next action(s) given the observation(s)
+
+        Parameters:
+        ----------
+
+        observation     batched observation data
+
+        Returns:
+        -------
+        (action, value estimate, next state, negative log likelihood of the action under current policy parameters) tuple
+        """
+
+        latent = self.policy_network(observation)
+        pd, pi = self.pdtype.pdfromlatent(latent)
+        action = pd.sample()
+        neglogp = pd.neglogp(action)
+        value_latent = self.value_network(observation)
+        vf = tf.squeeze(self.value_fc(value_latent), axis=1)
+        return action, vf, None, neglogp
+
+    @tf.function
+    def value(self, observation):
+        """
+        Compute value estimate(s) given the observation(s)
+
+        Parameters:
+        ----------
+
+        observation     observation data (either single or a batch)
+
+        Returns:
+        -------
+        value estimate
+        """
+        value_latent = self.value_network(observation)
+        result = tf.squeeze(self.value_fc(value_latent), axis=1)
+        return result
+
--- a/baselines/common/retro_wrappers.py
+++ b/baselines/common/retro_wrappers.py
@@ -0,0 +1,280 @@
+from collections import deque
+import cv2
+cv2.ocl.setUseOpenCL(False)
+from .atari_wrappers import WarpFrame, ClipRewardEnv, FrameStack, ScaledFloatFrame
+from .wrappers import TimeLimit
+import numpy as np
+import gym
+
+
+class StochasticFrameSkip(gym.Wrapper):
+    def __init__(self, env, n, stickprob):
+        gym.Wrapper.__init__(self, env)
+        self.n = n
+        self.stickprob = stickprob
+        self.curac = None
+        self.rng = np.random.RandomState()
+        self.supports_want_render = hasattr(env, "supports_want_render")
+
+    def reset(self, **kwargs):
+        self.curac = None
+        return self.env.reset(**kwargs)
+
+    def step(self, ac):
+        done = False
+        totrew = 0
+        for i in range(self.n):
+            # First step after reset, use action
+            if self.curac is None:
+                self.curac = ac
+            # First substep, delay with probability=stickprob
+            elif i==0:
+                if self.rng.rand() > self.stickprob:
+                    self.curac = ac
+            # Second substep, new action definitely kicks in
+            elif i==1:
+                self.curac = ac
+            if self.supports_want_render and i<self.n-1:
+                ob, rew, done, info = self.env.step(self.curac, want_render=False)
+            else:
+                ob, rew, done, info = self.env.step(self.curac)
+            totrew += rew
+            if done: break
+        return ob, totrew, done, info
+
+    def seed(self, s):
+        self.rng.seed(s)
+
+class PartialFrameStack(gym.Wrapper):
+    def __init__(self, env, k, channel=1):
+        """
+        Stack one channel (channel keyword) from previous frames
+        """
+        gym.Wrapper.__init__(self, env)
+        shp = env.observation_space.shape
+        self.channel = channel
+        self.observation_space = gym.spaces.Box(low=0, high=255,
+            shape=(shp[0], shp[1], shp[2] + k - 1),
+            dtype=env.observation_space.dtype)
+        self.k = k
+        self.frames = deque([], maxlen=k)
+        shp = env.observation_space.shape
+
+    def reset(self):
+        ob = self.env.reset()
+        assert ob.shape[2] > self.channel
+        for _ in range(self.k):
+            self.frames.append(ob)
+        return self._get_ob()
+
+    def step(self, ac):
+        ob, reward, done, info = self.env.step(ac)
+        self.frames.append(ob)
+        return self._get_ob(), reward, done, info
+
+    def _get_ob(self):
+        assert len(self.frames) == self.k
+        return np.concatenate([frame if i==self.k-1 else frame[:,:,self.channel:self.channel+1]
+            for (i, frame) in enumerate(self.frames)], axis=2)
+
+class Downsample(gym.ObservationWrapper):
+    def __init__(self, env, ratio):
+        """
+        Downsample images by a factor of ratio
+        """
+        gym.ObservationWrapper.__init__(self, env)
+        (oldh, oldw, oldc) = env.observation_space.shape
+        newshape = (oldh//ratio, oldw//ratio, oldc)
+        self.observation_space = gym.spaces.Box(low=0, high=255,
+            shape=newshape, dtype=np.uint8)
+
+    def observation(self, frame):
+        height, width, _ = self.observation_space.shape
+        frame = cv2.resize(frame, (width, height), interpolation=cv2.INTER_AREA)
+        if frame.ndim == 2:
+            frame = frame[:,:,None]
+        return frame
+
+class Rgb2gray(gym.ObservationWrapper):
+    def __init__(self, env):
+        """
+        Downsample images by a factor of ratio
+        """
+        gym.ObservationWrapper.__init__(self, env)
+        (oldh, oldw, _oldc) = env.observation_space.shape
+        self.observation_space = gym.spaces.Box(low=0, high=255,
+            shape=(oldh, oldw, 1), dtype=np.uint8)
+
+    def observation(self, frame):
+        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
+        return frame[:,:,None]
+
+
+class MovieRecord(gym.Wrapper):
+    def __init__(self, env, savedir, k):
+        gym.Wrapper.__init__(self, env)
+        self.savedir = savedir
+        self.k = k
+        self.epcount = 0
+    def reset(self):
+        if self.epcount % self.k == 0:
+            self.env.unwrapped.movie_path = self.savedir
+        else:
+            self.env.unwrapped.movie_path = None
+            self.env.unwrapped.movie = None
+        self.epcount += 1
+        return self.env.reset()
+
+class AppendTimeout(gym.Wrapper):
+    def __init__(self, env):
+        gym.Wrapper.__init__(self, env)
+        self.action_space = env.action_space
+        self.timeout_space = gym.spaces.Box(low=np.array([0.0]), high=np.array([1.0]), dtype=np.float32)
+        self.original_os = env.observation_space
+        if isinstance(self.original_os, gym.spaces.Dict):
+            import copy
+            ordered_dict = copy.deepcopy(self.original_os.spaces)
+            ordered_dict['value_estimation_timeout'] = self.timeout_space
+            self.observation_space = gym.spaces.Dict(ordered_dict)
+            self.dict_mode = True
+        else:
+            self.observation_space = gym.spaces.Dict({
+                'original': self.original_os,
+                'value_estimation_timeout': self.timeout_space
+                })
+            self.dict_mode = False
+        self.ac_count = None
+        while 1:
+            if not hasattr(env, "_max_episode_steps"):  # Looking for TimeLimit wrapper that has this field
+                env = env.env
+                continue
+            break
+        self.timeout = env._max_episode_steps
+
+    def step(self, ac):
+        self.ac_count += 1
+        ob, rew, done, info = self.env.step(ac)
+        return self._process(ob), rew, done, info
+
+    def reset(self):
+        self.ac_count = 0
+        return self._process(self.env.reset())
+
+    def _process(self, ob):
+        fracmissing = 1 - self.ac_count / self.timeout
+        if self.dict_mode:
+            ob['value_estimation_timeout'] = fracmissing
+        else:
+            return { 'original': ob, 'value_estimation_timeout': fracmissing }
+
+class StartDoingRandomActionsWrapper(gym.Wrapper):
+    """
+    Warning: can eat info dicts, not good if you depend on them
+    """
+    def __init__(self, env, max_random_steps, on_startup=True, every_episode=False):
+        gym.Wrapper.__init__(self, env)
+        self.on_startup = on_startup
+        self.every_episode = every_episode
+        self.random_steps = max_random_steps
+        self.last_obs = None
+        if on_startup:
+            self.some_random_steps()
+
+    def some_random_steps(self):
+        self.last_obs = self.env.reset()
+        n = np.random.randint(self.random_steps)
+        #print("running for random %i frames" % n)
+        for _ in range(n):
+            self.last_obs, _, done, _ = self.env.step(self.env.action_space.sample())
+            if done: self.last_obs = self.env.reset()
+
+    def reset(self):
+        return self.last_obs
+
+    def step(self, a):
+        self.last_obs, rew, done, info = self.env.step(a)
+        if done:
+            self.last_obs = self.env.reset()
+            if self.every_episode:
+                self.some_random_steps()
+        return self.last_obs, rew, done, info
+
+def make_retro(*, game, state=None, max_episode_steps=4500, **kwargs):
+    import retro
+    if state is None:
+        state = retro.State.DEFAULT
+    env = retro.make(game, state, **kwargs)
+    env = StochasticFrameSkip(env, n=4, stickprob=0.25)
+    if max_episode_steps is not None:
+        env = TimeLimit(env, max_episode_steps=max_episode_steps)
+    return env
+
+def wrap_deepmind_retro(env, scale=True, frame_stack=4):
+    """
+    Configure environment for retro games, using config similar to DeepMind-style Atari in wrap_deepmind
+    """
+    env = WarpFrame(env)
+    env = ClipRewardEnv(env)
+    if frame_stack > 1:
+        env = FrameStack(env, frame_stack)
+    if scale:
+        env = ScaledFloatFrame(env)
+    return env
+
+class SonicDiscretizer(gym.ActionWrapper):
+    """
+    Wrap a gym-retro environment and make it use discrete
+    actions for the Sonic game.
+    """
+    def __init__(self, env):
+        super(SonicDiscretizer, self).__init__(env)
+        buttons = ["B", "A", "MODE", "START", "UP", "DOWN", "LEFT", "RIGHT", "C", "Y", "X", "Z"]
+        actions = [['LEFT'], ['RIGHT'], ['LEFT', 'DOWN'], ['RIGHT', 'DOWN'], ['DOWN'],
+                   ['DOWN', 'B'], ['B']]
+        self._actions = []
+        for action in actions:
+            arr = np.array([False] * 12)
+            for button in action:
+                arr[buttons.index(button)] = True
+            self._actions.append(arr)
+        self.action_space = gym.spaces.Discrete(len(self._actions))
+
+    def action(self, a): # pylint: disable=W0221
+        return self._actions[a].copy()
+
+class RewardScaler(gym.RewardWrapper):
+    """
+    Bring rewards to a reasonable scale for PPO.
+    This is incredibly important and effects performance
+    drastically.
+    """
+    def __init__(self, env, scale=0.01):
+        super(RewardScaler, self).__init__(env)
+        self.scale = scale
+
+    def reward(self, reward):
+        return reward * self.scale
+
+class AllowBacktracking(gym.Wrapper):
+    """
+    Use deltas in max(X) as the reward, rather than deltas
+    in X. This way, agents are not discouraged too heavily
+    from exploring backwards if there is no way to advance
+    head-on in the level.
+    """
+    def __init__(self, env):
+        super(AllowBacktracking, self).__init__(env)
+        self._cur_x = 0
+        self._max_x = 0
+
+    def reset(self, **kwargs): # pylint: disable=E0202
+        self._cur_x = 0
+        self._max_x = 0
+        return self.env.reset(**kwargs)
+
+    def step(self, action): # pylint: disable=E0202
+        obs, rew, done, info = self.env.step(action)
+        self._cur_x += rew
+        rew = max(0, self._cur_x - self._max_x)
+        self._max_x = max(self._max_x, self._cur_x)
+        return obs, rew, done, info
--- a/baselines/common/runners.py
+++ b/baselines/common/runners.py
@@ -5,7 +5,7 @@ class AbstractEnvRunner(ABC):
    def __init__(self, *, env, model, nsteps):
        self.env = env
        self.model = model
-        nenv = env.num_envs
+        self.nenv = nenv = env.num_envs if hasattr(env, 'num_envs') else 1
        self.batch_ob_shape = (nenv*nsteps,) + env.observation_space.shape
        self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
        self.obs[:] = env.reset()
@@ -16,3 +16,4 @@ class AbstractEnvRunner(ABC):
    @abstractmethod
    def run(self):
        raise NotImplementedError
+
--- a/baselines/common/running_mean_std.py
+++ b/baselines/common/running_mean_std.py
@@ -1,4 +1,5 @@
 import numpy as np
+
 class RunningMeanStd(object):
    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
    def __init__(self, epsilon=1e-4, shape=()):
@@ -13,34 +14,18 @@ class RunningMeanStd(object):
        self.update_from_moments(batch_mean, batch_var, batch_count)

    def update_from_moments(self, batch_mean, batch_var, batch_count):
-        delta = batch_mean - self.mean
-        tot_count = self.count + batch_count
+        self.mean, self.var, self.count = update_mean_var_count_from_moments(
+            self.mean, self.var, self.count, batch_mean, batch_var, batch_count)

-        new_mean = self.mean + delta * batch_count / tot_count        
-        m_a = self.var * (self.count)
-        m_b = batch_var * (batch_count)
-        M2 = m_a + m_b + np.square(delta) * self.count * batch_count / (self.count + batch_count)
-        new_var = M2 / (self.count + batch_count)
+def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
+    delta = batch_mean - mean
+    tot_count = count + batch_count

-        new_count = batch_count + self.count
+    new_mean = mean + delta * batch_count / tot_count
+    m_a = var * count
+    m_b = batch_var * batch_count
+    M2 = m_a + m_b + np.square(delta) * count * batch_count / tot_count
+    new_var = M2 / tot_count
+    new_count = tot_count

-        self.mean = new_mean
-        self.var = new_var
-        self.count = new_count    
-
-def test_runningmeanstd():
-    for (x1, x2, x3) in [
-        (np.random.randn(3), np.random.randn(4), np.random.randn(5)),
-        (np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
-        ]:
-
-        rms = RunningMeanStd(epsilon=0.0, shape=x1.shape[1:])
-
-        x = np.concatenate([x1, x2, x3], axis=0)
-        ms1 = [x.mean(axis=0), x.var(axis=0)]
-        rms.update(x1)
-        rms.update(x2)
-        rms.update(x3)
-        ms2 = [rms.mean, rms.var]
-
-        assert np.allclose(ms1, ms2)
+    return new_mean, new_var, new_count
--- a/baselines/common/running_stat.py
+++ b/baselines/common/running_stat.py
@@ -1,46 +0,0 @@
-import numpy as np
-
-# http://www.johndcook.com/blog/standard_deviation/
-class RunningStat(object):
-    def __init__(self, shape):
-        self._n = 0
-        self._M = np.zeros(shape)
-        self._S = np.zeros(shape)
-    def push(self, x):
-        x = np.asarray(x)
-        assert x.shape == self._M.shape
-        self._n += 1
-        if self._n == 1:
-            self._M[...] = x
-        else:
-            oldM = self._M.copy()
-            self._M[...] = oldM + (x - oldM)/self._n
-            self._S[...] = self._S + (x - oldM)*(x - self._M)
-    @property
-    def n(self):
-        return self._n
-    @property
-    def mean(self):
-        return self._M
-    @property
-    def var(self):
-        return self._S/(self._n - 1) if self._n > 1 else np.square(self._M)
-    @property
-    def std(self):
-        return np.sqrt(self.var)
-    @property
-    def shape(self):
-        return self._M.shape
-
-def test_running_stat():
-    for shp in ((), (3,), (3,4)):
-        li = []
-        rs = RunningStat(shp)
-        for _ in range(5):
-            val = np.random.randn(*shp)
-            rs.push(val)
-            li.append(val)
-            m = np.mean(li, axis=0)
-            assert np.allclose(rs.mean, m)
-            v = np.square(m) if (len(li) == 1) else np.var(li, ddof=1, axis=0)
-            assert np.allclose(rs.var, v)
--- a/baselines/common/test_identity.py
+++ b/baselines/common/test_identity.py
@@ -1,44 +0,0 @@
-import pytest
-import tensorflow as tf
-import random
-import numpy as np
-from gym.spaces import np_random
-
-from baselines.a2c import a2c
-from baselines.ppo2 import ppo2
-from baselines.common.identity_env import IdentityEnv
-from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
-from baselines.ppo2.policies import MlpPolicy
-
-
-learn_func_list = [
-    lambda e: a2c.learn(policy=MlpPolicy, env=e, seed=0, total_timesteps=50000),
-    lambda e: ppo2.learn(policy=MlpPolicy, env=e, total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.01)
-]
-
-
-@pytest.mark.slow
-@pytest.mark.parametrize("learn_func", learn_func_list)
-def test_identity(learn_func):
-    '''
-    Test if the algorithm (with a given policy) 
-    can learn an identity transformation (i.e. return observation as an action)
-    '''
-    np.random.seed(0)
-    np_random.seed(0)
-    random.seed(0)
-
-    env = DummyVecEnv([lambda: IdentityEnv(10)])
-
-    with tf.Graph().as_default(), tf.Session().as_default():
-        tf.set_random_seed(0)
-        model = learn_func(env)
-
-        N_TRIALS = 1000
-        sum_rew = 0
-        obs = env.reset()
-        for i in range(N_TRIALS):
-            obs, rew, done, _ = env.step(model.step(obs)[0])
-            sum_rew += rew
-
-        assert sum_rew > 0.9 * N_TRIALS
--- a/baselines/common/tests/test_schedules.py
+++ b/baselines/common/tests/test_schedules.py
@@ -1,26 +0,0 @@
-import numpy as np
-
-from baselines.common.schedules import ConstantSchedule, PiecewiseSchedule
-
-
-def test_piecewise_schedule():
-    ps = PiecewiseSchedule([(-5, 100), (5, 200), (10, 50), (100, 50), (200, -50)], outside_value=500)
-
-    assert np.isclose(ps.value(-10), 500)
-    assert np.isclose(ps.value(0), 150)
-    assert np.isclose(ps.value(5), 200)
-    assert np.isclose(ps.value(9), 80)
-    assert np.isclose(ps.value(50), 50)
-    assert np.isclose(ps.value(80), 50)
-    assert np.isclose(ps.value(150), 0)
-    assert np.isclose(ps.value(175), -25)
-    assert np.isclose(ps.value(201), 500)
-    assert np.isclose(ps.value(500), 500)
-
-    assert np.isclose(ps.value(200 - 1e-10), -50)
-
-
-def test_constant_schedule():
-    cs = ConstantSchedule(5)
-    for i in range(-100, 100):
-        assert np.isclose(cs.value(i), 5)
--- a/baselines/common/tests/test_segment_tree.py
+++ b/baselines/common/tests/test_segment_tree.py
@@ -1,103 +0,0 @@
-import numpy as np
-
-from baselines.common.segment_tree import SumSegmentTree, MinSegmentTree
-
-
-def test_tree_set():
-    tree = SumSegmentTree(4)
-
-    tree[2] = 1.0
-    tree[3] = 3.0
-
-    assert np.isclose(tree.sum(), 4.0)
-    assert np.isclose(tree.sum(0, 2), 0.0)
-    assert np.isclose(tree.sum(0, 3), 1.0)
-    assert np.isclose(tree.sum(2, 3), 1.0)
-    assert np.isclose(tree.sum(2, -1), 1.0)
-    assert np.isclose(tree.sum(2, 4), 4.0)
-
-
-def test_tree_set_overlap():
-    tree = SumSegmentTree(4)
-
-    tree[2] = 1.0
-    tree[2] = 3.0
-
-    assert np.isclose(tree.sum(), 3.0)
-    assert np.isclose(tree.sum(2, 3), 3.0)
-    assert np.isclose(tree.sum(2, -1), 3.0)
-    assert np.isclose(tree.sum(2, 4), 3.0)
-    assert np.isclose(tree.sum(1, 2), 0.0)
-
-
-def test_prefixsum_idx():
-    tree = SumSegmentTree(4)
-
-    tree[2] = 1.0
-    tree[3] = 3.0
-
-    assert tree.find_prefixsum_idx(0.0) == 2
-    assert tree.find_prefixsum_idx(0.5) == 2
-    assert tree.find_prefixsum_idx(0.99) == 2
-    assert tree.find_prefixsum_idx(1.01) == 3
-    assert tree.find_prefixsum_idx(3.00) == 3
-    assert tree.find_prefixsum_idx(4.00) == 3
-
-
-def test_prefixsum_idx2():
-    tree = SumSegmentTree(4)
-
-    tree[0] = 0.5
-    tree[1] = 1.0
-    tree[2] = 1.0
-    tree[3] = 3.0
-
-    assert tree.find_prefixsum_idx(0.00) == 0
-    assert tree.find_prefixsum_idx(0.55) == 1
-    assert tree.find_prefixsum_idx(0.99) == 1
-    assert tree.find_prefixsum_idx(1.51) == 2
-    assert tree.find_prefixsum_idx(3.00) == 3
-    assert tree.find_prefixsum_idx(5.50) == 3
-
-
-def test_max_interval_tree():
-    tree = MinSegmentTree(4)
-
-    tree[0] = 1.0
-    tree[2] = 0.5
-    tree[3] = 3.0
-
-    assert np.isclose(tree.min(), 0.5)
-    assert np.isclose(tree.min(0, 2), 1.0)
-    assert np.isclose(tree.min(0, 3), 0.5)
-    assert np.isclose(tree.min(0, -1), 0.5)
-    assert np.isclose(tree.min(2, 4), 0.5)
-    assert np.isclose(tree.min(3, 4), 3.0)
-
-    tree[2] = 0.7
-
-    assert np.isclose(tree.min(), 0.7)
-    assert np.isclose(tree.min(0, 2), 1.0)
-    assert np.isclose(tree.min(0, 3), 0.7)
-    assert np.isclose(tree.min(0, -1), 0.7)
-    assert np.isclose(tree.min(2, 4), 0.7)
-    assert np.isclose(tree.min(3, 4), 3.0)
-
-    tree[2] = 4.0
-
-    assert np.isclose(tree.min(), 1.0)
-    assert np.isclose(tree.min(0, 2), 1.0)
-    assert np.isclose(tree.min(0, 3), 1.0)
-    assert np.isclose(tree.min(0, -1), 1.0)
-    assert np.isclose(tree.min(2, 4), 3.0)
-    assert np.isclose(tree.min(2, 3), 4.0)
-    assert np.isclose(tree.min(2, -1), 4.0)
-    assert np.isclose(tree.min(3, 4), 3.0)
-
-
-if __name__ == '__main__':
-    test_tree_set()
-    test_tree_set_overlap()
-    test_prefixsum_idx()
-    test_prefixsum_idx2()
-    test_max_interval_tree()
--- a/baselines/common/tests/test_tf_util.py
+++ b/baselines/common/tests/test_tf_util.py
@@ -1,40 +0,0 @@
-# tests for tf_util
-import tensorflow as tf
-from baselines.common.tf_util import (
-    function,
-    initialize,
-    single_threaded_session
-)
-
-
-def test_function():
-    with tf.Graph().as_default():
-        x = tf.placeholder(tf.int32, (), name="x")
-        y = tf.placeholder(tf.int32, (), name="y")
-        z = 3 * x + 2 * y
-        lin = function([x, y], z, givens={y: 0})
-
-        with single_threaded_session():
-            initialize()
-
-            assert lin(2) == 6
-            assert lin(2, 2) == 10
-
-
-def test_multikwargs():
-    with tf.Graph().as_default():
-        x = tf.placeholder(tf.int32, (), name="x")
-        with tf.variable_scope("other"):
-            x2 = tf.placeholder(tf.int32, (), name="x")
-        z = 3 * x + 2 * x2
-
-        lin = function([x, x2], z, givens={x2: 0})
-        with single_threaded_session():
-            initialize()
-            assert lin(2) == 6
-            assert lin(2, 2) == 10
-
-
-if __name__ == '__main__':
-    test_function()
-    test_multikwargs()
--- a/baselines/common/tests/test_with_mpi.py
+++ b/baselines/common/tests/test_with_mpi.py
@@ -0,0 +1,38 @@
+import os
+import sys
+import subprocess
+import cloudpickle
+import base64
+import pytest
+from functools import wraps
+
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
+def with_mpi(nproc=2, timeout=30, skip_if_no_mpi=True):
+    def outer_thunk(fn):
+        @wraps(fn)
+        def thunk(*args, **kwargs):
+            serialized_fn = base64.b64encode(cloudpickle.dumps(lambda: fn(*args, **kwargs)))
+            subprocess.check_call([
+                'mpiexec','-n', str(nproc),
+                sys.executable,
+                '-m', 'baselines.common.tests.test_with_mpi',
+                serialized_fn
+            ], env=os.environ, timeout=timeout)
+
+        if skip_if_no_mpi:
+            return pytest.mark.skipif(MPI is None, reason="MPI not present")(thunk)
+        else:
+            return thunk
+
+    return outer_thunk
+
+
+if __name__ == '__main__':
+    if len(sys.argv) > 1:
+        fn = cloudpickle.loads(base64.b64decode(sys.argv[1]))
+        assert callable(fn)
+        fn()
--- a/baselines/common/tf_util.py
+++ b/baselines/common/tf_util.py
@@ -1,10 +1,6 @@
 import numpy as np
 import tensorflow as tf  # pylint: ignore-module
 import copy
-import os
-import functools
-import collections
-import multiprocessing

 def switch(condition, then_expression, else_expression):
    """Switches between two operations depending on a scalar value (int or bool).
@@ -44,48 +40,13 @@ def huber_loss(x, delta=1.0):
        delta * (tf.abs(x) - 0.5 * delta)
    )

-# ================================================================
-# Global session
-# ================================================================
-
-def make_session(num_cpu=None, make_default=False, graph=None):
-    """Returns a session that will use <num_cpu> CPU's only"""
-    if num_cpu is None:
-        num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
-    tf_config = tf.ConfigProto(
-        inter_op_parallelism_threads=num_cpu,
-        intra_op_parallelism_threads=num_cpu)
-    if make_default:
-        return tf.InteractiveSession(config=tf_config, graph=graph)
-    else:
-        return tf.Session(config=tf_config, graph=graph)
-
-def single_threaded_session():
-    """Returns a session which will only use a single CPU"""
-    return make_session(num_cpu=1)
-
-def in_session(f):
-    @functools.wraps(f)
-    def newfunc(*args, **kwargs):
-        with tf.Session():
-            f(*args, **kwargs)
-    return newfunc
-
-ALREADY_INITIALIZED = set()
-
-def initialize():
-    """Initialize all the uninitialized variables in the global scope."""
-    new_variables = set(tf.global_variables()) - ALREADY_INITIALIZED
-    tf.get_default_session().run(tf.variables_initializer(new_variables))
-    ALREADY_INITIALIZED.update(new_variables)
-
 # ================================================================
 # Model components
 # ================================================================

 def normc_initializer(std=1.0, axis=0):
    def _initializer(shape, dtype=None, partition_info=None):  # pylint: disable=W0613
-        out = np.random.randn(*shape).astype(np.float32)
+        out = np.random.randn(*shape).astype(dtype.as_numpy_dtype)
        out *= std / np.sqrt(np.square(out).sum(axis=axis, keepdims=True))
        return tf.constant(out)
    return _initializer
@@ -119,80 +80,6 @@ def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad="SAME",

        return tf.nn.conv2d(x, w, stride_shape, pad) + b

-# ================================================================
-# Theano-like Function
-# ================================================================
-
-def function(inputs, outputs, updates=None, givens=None):
-    """Just like Theano function. Take a bunch of tensorflow placeholders and expressions
-    computed based on those placeholders and produces f(inputs) -> outputs. Function f takes
-    values to be fed to the input's placeholders and produces the values of the expressions
-    in outputs.
-
-    Input values can be passed in the same order as inputs or can be provided as kwargs based
-    on placeholder name (passed to constructor or accessible via placeholder.op.name).
-
-    Example:
-        x = tf.placeholder(tf.int32, (), name="x")
-        y = tf.placeholder(tf.int32, (), name="y")
-        z = 3 * x + 2 * y
-        lin = function([x, y], z, givens={y: 0})
-
-        with single_threaded_session():
-            initialize()
-
-            assert lin(2) == 6
-            assert lin(x=3) == 9
-            assert lin(2, 2) == 10
-            assert lin(x=2, y=3) == 12
-
-    Parameters
-    ----------
-    inputs: [tf.placeholder, tf.constant, or object with make_feed_dict method]
-        list of input arguments
-    outputs: [tf.Variable] or tf.Variable
-        list of outputs or a single output to be returned from function. Returned
-        value will also have the same shape.
-    """
-    if isinstance(outputs, list):
-        return _Function(inputs, outputs, updates, givens=givens)
-    elif isinstance(outputs, (dict, collections.OrderedDict)):
-        f = _Function(inputs, outputs.values(), updates, givens=givens)
-        return lambda *args, **kwargs: type(outputs)(zip(outputs.keys(), f(*args, **kwargs)))
-    else:
-        f = _Function(inputs, [outputs], updates, givens=givens)
-        return lambda *args, **kwargs: f(*args, **kwargs)[0]
-
-
-class _Function(object):
-    def __init__(self, inputs, outputs, updates, givens):
-        for inpt in inputs:
-            if not hasattr(inpt, 'make_feed_dict') and not (type(inpt) is tf.Tensor and len(inpt.op.inputs) == 0):
-                assert False, "inputs should all be placeholders, constants, or have a make_feed_dict method"
-        self.inputs = inputs
-        updates = updates or []
-        self.update_group = tf.group(*updates)
-        self.outputs_update = list(outputs) + [self.update_group]
-        self.givens = {} if givens is None else givens
-
-    def _feed_input(self, feed_dict, inpt, value):
-        if hasattr(inpt, 'make_feed_dict'):
-            feed_dict.update(inpt.make_feed_dict(value))
-        else:
-            feed_dict[inpt] = value
-
-    def __call__(self, *args):
-        assert len(args) <= len(self.inputs), "Too many arguments provided"
-        feed_dict = {}
-        # Update the args
-        for inpt, value in zip(self.inputs, args):
-            self._feed_input(feed_dict, inpt, value)
-        # Update feed dict with givens.
-        for inpt in self.givens:
-            feed_dict[inpt] = feed_dict.get(inpt, self.givens[inpt])
-        results = tf.get_default_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
-        return results
-
 # ================================================================
 # Flat vectors
 # ================================================================
@@ -209,8 +96,7 @@ def numel(x):
 def intprod(x):
    return int(np.prod(x))

-def flatgrad(loss, var_list, clip_norm=None):
-    grads = tf.gradients(loss, var_list)
+def flatgrad(grads, var_list, clip_norm=None):
    if clip_norm is not None:
        grads = [tf.clip_by_norm(grad, clip_norm=clip_norm) for grad in grads]
    return tf.concat(axis=0, values=[
@@ -220,85 +106,94 @@ def flatgrad(loss, var_list, clip_norm=None):

 class SetFromFlat(object):
    def __init__(self, var_list, dtype=tf.float32):
-        assigns = []
-        shapes = list(map(var_shape, var_list))
-        total_size = np.sum([intprod(shape) for shape in shapes])
-
-        self.theta = theta = tf.placeholder(dtype, [total_size])
-        start = 0
-        assigns = []
-        for (shape, v) in zip(shapes, var_list):
-            size = intprod(shape)
-            assigns.append(tf.assign(v, tf.reshape(theta[start:start + size], shape)))
-            start += size
-        self.op = tf.group(*assigns)
+        self.shapes = list(map(var_shape, var_list))
+        self.total_size = np.sum([intprod(shape) for shape in self.shapes])
+        self.var_list = var_list

    def __call__(self, theta):
-        tf.get_default_session().run(self.op, feed_dict={self.theta: theta})
+        start = 0
+        for (shape, v) in zip(self.shapes, self.var_list):
+            size = intprod(shape)
+            v.assign(tf.reshape(theta[start:start + size], shape))
+            start += size

 class GetFlat(object):
    def __init__(self, var_list):
-        self.op = tf.concat(axis=0, values=[tf.reshape(v, [numel(v)]) for v in var_list])
+        self.var_list = var_list

    def __call__(self):
-        return tf.get_default_session().run(self.op)
-
-_PLACEHOLDER_CACHE = {}  # name -> (placeholder, dtype, shape)
-
-def get_placeholder(name, dtype, shape):
-    if name in _PLACEHOLDER_CACHE:
-        out, dtype1, shape1 = _PLACEHOLDER_CACHE[name]
-        assert dtype1 == dtype and shape1 == shape
-        return out
-    else:
-        out = tf.placeholder(dtype=dtype, shape=shape, name=name)
-        _PLACEHOLDER_CACHE[name] = (out, dtype, shape)
-        return out
-
-def get_placeholder_cached(name):
-    return _PLACEHOLDER_CACHE[name][0]
+        return tf.concat(axis=0, values=[tf.reshape(v, [numel(v)]) for v in self.var_list]).numpy()

 def flattenallbut0(x):
    return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])


 # ================================================================
-# Diagnostics 
+# Shape adjustment for feeding into tf tensors
 # ================================================================
+def adjust_shape(input_tensor, data):
+    '''
+    adjust shape of the data to the shape of the tensor if possible.
+    If shape is incompatible, AssertionError is thrown

-def display_var_info(vars):
-    from baselines import logger
-    count_params = 0
-    for v in vars:
-        name = v.name
-        if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
-        v_params = np.prod(v.shape.as_list())
-        count_params += v_params
-        if "/b:" in name or "/biases" in name: continue    # Wx+b, bias is not interesting to look at => count params, but not print
-        logger.info("   %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
+    Parameters:
+        input_tensor    tensorflow input tensor

-    logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
+        data            input data to be (potentially) reshaped to be fed into input
+
+    Returns:
+        reshaped data
+    '''
+
+    if not isinstance(data, np.ndarray) and not isinstance(data, list):
+        return data
+    if isinstance(data, list):
+        data = np.array(data)
+
+    input_shape = [x or -1 for x in input_tensor.shape.as_list()]
+
+    assert _check_shape(input_shape, data.shape), \
+        'Shape of data {} is not compatible with shape of the input {}'.format(data.shape, input_shape)
+
+    return np.reshape(data, input_shape)


-def get_available_gpus():
-    # recipe from here:
-    # https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
- 
-    from tensorflow.python.client import device_lib
-    local_device_protos = device_lib.list_local_devices()
-    return [x.name for x in local_device_protos if x.device_type == 'GPU']
+def _check_shape(input_shape, data_shape):
+    ''' check if two shapes are compatible (i.e. differ only by dimensions of size 1, or by the batch dimension)'''
+
+    squeezed_input_shape = _squeeze_shape(input_shape)
+    squeezed_data_shape = _squeeze_shape(data_shape)
+
+    for i, s_data in enumerate(squeezed_data_shape):
+        s_input = squeezed_input_shape[i]
+        if s_input != -1 and s_data != s_input:
+            return False
+
+    return True
+
+
+def _squeeze_shape(shape):
+    return [x for x in shape if x != 1]

 # ================================================================
-# Saving variables
+# Tensorboard interfacing
 # ================================================================

-def load_state(fname):
-    saver = tf.train.Saver()
-    saver.restore(tf.get_default_session(), fname)
-
-def save_state(fname):
-    os.makedirs(os.path.dirname(fname), exist_ok=True)
-    saver = tf.train.Saver()
-    saver.save(tf.get_default_session(), fname)
-
-
+def launch_tensorboard_in_background(log_dir):
+    '''
+    To log the Tensorflow graph when using rl-algs
+    algorithms, you can run the following code
+    in your main script:
+        import threading, time
+        def start_tensorboard(session):
+            time.sleep(10) # Wait until graph is setup
+            tb_path = osp.join(logger.get_dir(), 'tb')
+            summary_writer = tf.summary.FileWriter(tb_path, graph=session.graph)
+            summary_op = tf.summary.merge_all()
+            launch_tensorboard_in_background(tb_path)
+        session = tf.get_default_session()
+        t = threading.Thread(target=start_tensorboard, args=([session]))
+        t.start()
+    '''
+    import subprocess
+    subprocess.Popen(['tensorboard', '--logdir', log_dir])
--- a/baselines/common/vec_env/init.py
+++ b/baselines/common/vec_env/init.py
@@ -1,126 +1,10 @@
-from abc import ABC, abstractmethod
-from baselines import logger
+from .vec_env import AlreadySteppingError, NotSteppingError, VecEnv, VecEnvWrapper, VecEnvObservationWrapper, CloudpickleWrapper
+from .dummy_vec_env import DummyVecEnv
+from .shmem_vec_env import ShmemVecEnv
+from .subproc_vec_env import SubprocVecEnv
+from .vec_frame_stack import VecFrameStack
+from .vec_monitor import VecMonitor
+from .vec_normalize import VecNormalize
+from .vec_remove_dict_obs import VecExtractDictObs

-class AlreadySteppingError(Exception):
-    """
-    Raised when an asynchronous step is running while
-    step_async() is called again.
-    """
-    def __init__(self):
-        msg = 'already running an async step'
-        Exception.__init__(self, msg)
-
-class NotSteppingError(Exception):
-    """
-    Raised when an asynchronous step is not running but
-    step_wait() is called.
-    """
-    def __init__(self):
-        msg = 'not running an async step'
-        Exception.__init__(self, msg)
-
-class VecEnv(ABC):
-    """
-    An abstract asynchronous, vectorized environment.
-    """
-    def __init__(self, num_envs, observation_space, action_space):
-        self.num_envs = num_envs
-        self.observation_space = observation_space
-        self.action_space = action_space
-
-    @abstractmethod
-    def reset(self):
-        """
-        Reset all the environments and return an array of
-        observations, or a tuple of observation arrays.
-
-        If step_async is still doing work, that work will
-        be cancelled and step_wait() should not be called
-        until step_async() is invoked again.
-        """
-        pass
-
-    @abstractmethod
-    def step_async(self, actions):
-        """
-        Tell all the environments to start taking a step
-        with the given actions.
-        Call step_wait() to get the results of the step.
-
-        You should not call this if a step_async run is
-        already pending.
-        """
-        pass
-
-    @abstractmethod
-    def step_wait(self):
-        """
-        Wait for the step taken with step_async().
-
-        Returns (obs, rews, dones, infos):
-         - obs: an array of observations, or a tuple of
-                arrays of observations.
-         - rews: an array of rewards
-         - dones: an array of "episode done" booleans
-         - infos: a sequence of info objects
-        """
-        pass
-
-    @abstractmethod
-    def close(self):
-        """
-        Clean up the environments' resources.
-        """
-        pass
-
-    def step(self, actions):
-        self.step_async(actions)
-        return self.step_wait()
-
-    def render(self, mode='human'):
-        logger.warn('Render not defined for %s'%self)
-
-    @property
-    def unwrapped(self):
-        if isinstance(self, VecEnvWrapper):
-            return self.venv.unwrapped
-        else:
-            return self
-
-class VecEnvWrapper(VecEnv):
-    def __init__(self, venv, observation_space=None, action_space=None):
-        self.venv = venv
-        VecEnv.__init__(self, 
-            num_envs=venv.num_envs,
-            observation_space=observation_space or venv.observation_space, 
-            action_space=action_space or venv.action_space)
-
-    def step_async(self, actions):
-        self.venv.step_async(actions)
-
-    @abstractmethod
-    def reset(self):
-        pass
-
-    @abstractmethod
-    def step_wait(self):
-        pass
-
-    def close(self):
-        return self.venv.close()
-
-    def render(self):
-        self.venv.render()
-
-class CloudpickleWrapper(object):
-    """
-    Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
-    """
-    def __init__(self, x):
-        self.x = x
-    def __getstate__(self):
-        import cloudpickle
-        return cloudpickle.dumps(self.x)
-    def __setstate__(self, ob):
-        import pickle
-        self.x = pickle.loads(ob)
+__all__ = ['AlreadySteppingError', 'NotSteppingError', 'VecEnv', 'VecEnvWrapper', 'VecEnvObservationWrapper', 'CloudpickleWrapper', 'DummyVecEnv', 'ShmemVecEnv', 'SubprocVecEnv', 'VecFrameStack', 'VecMonitor', 'VecNormalize', 'VecExtractDictObs']
--- a/baselines/common/vec_env/dummy_vec_env.py
+++ b/baselines/common/vec_env/dummy_vec_env.py
@@ -1,40 +1,54 @@
 import numpy as np
-from gym import spaces
-from collections import OrderedDict
-from . import VecEnv
+from .vec_env import VecEnv
+from .util import copy_obs_dict, dict_to_obs, obs_space_info

 class DummyVecEnv(VecEnv):
+    """
+    VecEnv that does runs multiple environments sequentially, that is,
+    the step and reset commands are send to one environment at a time.
+    Useful when debugging and when num_env == 1 (in the latter case,
+    avoids communication overhead)
+    """
    def __init__(self, env_fns):
+        """
+        Arguments:
+
+        env_fns: iterable of callables      functions that build environments
+        """
        self.envs = [fn() for fn in env_fns]
        env = self.envs[0]
        VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
-        shapes, dtypes = {}, {}
-        self.keys = []
        obs_space = env.observation_space
+        self.keys, shapes, dtypes = obs_space_info(obs_space)

-        if isinstance(obs_space, spaces.Dict):
-            assert isinstance(obs_space.spaces, OrderedDict)
-            subspaces = obs_space.spaces
-        else:
-            subspaces = {None: obs_space}
-
-        for key, box in subspaces.items():
-            shapes[key] = box.shape
-            dtypes[key] = box.dtype
-            self.keys.append(key)
-        
        self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
        self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
        self.buf_rews  = np.zeros((self.num_envs,), dtype=np.float32)
        self.buf_infos = [{} for _ in range(self.num_envs)]
        self.actions = None
+        self.spec = self.envs[0].spec

    def step_async(self, actions):
-        self.actions = actions
+        listify = True
+        try:
+            if len(actions) == self.num_envs:
+                listify = False
+        except TypeError:
+            pass
+
+        if not listify:
+            self.actions = actions
+        else:
+            assert self.num_envs == 1, "actions {} is either not a list or has a wrong size - cannot match to {} environments".format(actions, self.num_envs)
+            self.actions = [actions]

    def step_wait(self):
        for e in range(self.num_envs):
-            obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(self.actions[e])
+            action = self.actions[e]
+            # if isinstance(self.envs[e].action_space, spaces.Discrete):
+            #    action = int(action)
+
+            obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
            if self.buf_dones[e]:
                obs = self.envs[e].reset()
            self._save_obs(e, obs)
@@ -47,12 +61,6 @@ class DummyVecEnv(VecEnv):
            self._save_obs(e, obs)
        return self._obs_from_buf()

-    def close(self):
-        return
-
-    def render(self, mode='human'):
-        return [e.render(mode=mode) for e in self.envs]
-
    def _save_obs(self, e, obs):
        for k in self.keys:
            if k is None:
@@ -61,7 +69,13 @@ class DummyVecEnv(VecEnv):
                self.buf_obs[k][e] = obs[k]

    def _obs_from_buf(self):
-        if self.keys==[None]:
-            return self.buf_obs[None]
+        return dict_to_obs(copy_obs_dict(self.buf_obs))
+
+    def get_images(self):
+        return [env.render(mode='rgb_array') for env in self.envs]
+
+    def render(self, mode='human'):
+        if self.num_envs == 1:
+            return self.envs[0].render(mode=mode)
        else:
-            return self.buf_obs
+            return super().render(mode=mode)
--- a/baselines/common/vec_env/shmem_vec_env.py
+++ b/baselines/common/vec_env/shmem_vec_env.py
@@ -0,0 +1,139 @@
+"""
+An interface for asynchronous vectorized environments.
+"""
+
+import multiprocessing as mp
+import numpy as np
+from .vec_env import VecEnv, CloudpickleWrapper, clear_mpi_env_vars
+import ctypes
+from baselines import logger
+
+from .util import dict_to_obs, obs_space_info, obs_to_dict
+
+_NP_TO_CT = {np.float32: ctypes.c_float,
+             np.int32: ctypes.c_int32,
+             np.int8: ctypes.c_int8,
+             np.uint8: ctypes.c_char,
+             np.bool: ctypes.c_bool}
+
+
+class ShmemVecEnv(VecEnv):
+    """
+    Optimized version of SubprocVecEnv that uses shared variables to communicate observations.
+    """
+
+    def __init__(self, env_fns, spaces=None, context='spawn'):
+        """
+        If you don't specify observation_space, we'll have to create a dummy
+        environment to get it.
+        """
+        ctx = mp.get_context(context)
+        if spaces:
+            observation_space, action_space = spaces
+        else:
+            logger.log('Creating dummy env object to get spaces')
+            with logger.scoped_configure(format_strs=[]):
+                dummy = env_fns[0]()
+                observation_space, action_space = dummy.observation_space, dummy.action_space
+                dummy.close()
+                del dummy
+        VecEnv.__init__(self, len(env_fns), observation_space, action_space)
+        self.obs_keys, self.obs_shapes, self.obs_dtypes = obs_space_info(observation_space)
+        self.obs_bufs = [
+            {k: ctx.Array(_NP_TO_CT[self.obs_dtypes[k].type], int(np.prod(self.obs_shapes[k]))) for k in self.obs_keys}
+            for _ in env_fns]
+        self.parent_pipes = []
+        self.procs = []
+        with clear_mpi_env_vars():
+            for env_fn, obs_buf in zip(env_fns, self.obs_bufs):
+                wrapped_fn = CloudpickleWrapper(env_fn)
+                parent_pipe, child_pipe = ctx.Pipe()
+                proc = ctx.Process(target=_subproc_worker,
+                            args=(child_pipe, parent_pipe, wrapped_fn, obs_buf, self.obs_shapes, self.obs_dtypes, self.obs_keys))
+                proc.daemon = True
+                self.procs.append(proc)
+                self.parent_pipes.append(parent_pipe)
+                proc.start()
+                child_pipe.close()
+        self.waiting_step = False
+        self.viewer = None
+
+    def reset(self):
+        if self.waiting_step:
+            logger.warn('Called reset() while waiting for the step to complete')
+            self.step_wait()
+        for pipe in self.parent_pipes:
+            pipe.send(('reset', None))
+        return self._decode_obses([pipe.recv() for pipe in self.parent_pipes])
+
+    def step_async(self, actions):
+        assert len(actions) == len(self.parent_pipes)
+        for pipe, act in zip(self.parent_pipes, actions):
+            pipe.send(('step', act))
+
+    def step_wait(self):
+        outs = [pipe.recv() for pipe in self.parent_pipes]
+        obs, rews, dones, infos = zip(*outs)
+        return self._decode_obses(obs), np.array(rews), np.array(dones), infos
+
+    def close_extras(self):
+        if self.waiting_step:
+            self.step_wait()
+        for pipe in self.parent_pipes:
+            pipe.send(('close', None))
+        for pipe in self.parent_pipes:
+            pipe.recv()
+            pipe.close()
+        for proc in self.procs:
+            proc.join()
+
+    def get_images(self, mode='human'):
+        for pipe in self.parent_pipes:
+            pipe.send(('render', None))
+        return [pipe.recv() for pipe in self.parent_pipes]
+
+    def _decode_obses(self, obs):
+        result = {}
+        for k in self.obs_keys:
+
+            bufs = [b[k] for b in self.obs_bufs]
+            o = [np.frombuffer(b.get_obj(), dtype=self.obs_dtypes[k]).reshape(self.obs_shapes[k]) for b in bufs]
+            result[k] = np.array(o)
+        return dict_to_obs(result)
+
+
+def _subproc_worker(pipe, parent_pipe, env_fn_wrapper, obs_bufs, obs_shapes, obs_dtypes, keys):
+    """
+    Control a single environment instance using IPC and
+    shared memory.
+    """
+    def _write_obs(maybe_dict_obs):
+        flatdict = obs_to_dict(maybe_dict_obs)
+        for k in keys:
+            dst = obs_bufs[k].get_obj()
+            dst_np = np.frombuffer(dst, dtype=obs_dtypes[k]).reshape(obs_shapes[k])  # pylint: disable=W0212
+            np.copyto(dst_np, flatdict[k])
+
+    env = env_fn_wrapper.x()
+    parent_pipe.close()
+    try:
+        while True:
+            cmd, data = pipe.recv()
+            if cmd == 'reset':
+                pipe.send(_write_obs(env.reset()))
+            elif cmd == 'step':
+                obs, reward, done, info = env.step(data)
+                if done:
+                    obs = env.reset()
+                pipe.send((_write_obs(obs), reward, done, info))
+            elif cmd == 'render':
+                pipe.send(env.render(mode='rgb_array'))
+            elif cmd == 'close':
+                pipe.send(None)
+                break
+            else:
+                raise RuntimeError('Got unrecognized cmd %s' % cmd)
+    except KeyboardInterrupt:
+        print('ShmemVecEnv worker: got KeyboardInterrupt')
+    finally:
+        env.close()
--- a/baselines/common/vec_env/subproc_vec_env.py
+++ b/baselines/common/vec_env/subproc_vec_env.py
@@ -1,97 +1,117 @@
+import multiprocessing as mp
+
 import numpy as np
-from multiprocessing import Process, Pipe
-from baselines.common.vec_env import VecEnv, CloudpickleWrapper
-from baselines.common.tile_images import tile_images
+from .vec_env import VecEnv, CloudpickleWrapper, clear_mpi_env_vars


 def worker(remote, parent_remote, env_fn_wrapper):
    parent_remote.close()
    env = env_fn_wrapper.x()
-    while True:
-        cmd, data = remote.recv()
-        if cmd == 'step':
-            ob, reward, done, info = env.step(data)
-            if done:
+    try:
+        while True:
+            cmd, data = remote.recv()
+            if cmd == 'step':
+                ob, reward, done, info = env.step(data)
+                if done:
+                    ob = env.reset()
+                remote.send((ob, reward, done, info))
+            elif cmd == 'reset':
                ob = env.reset()
-            remote.send((ob, reward, done, info))
-        elif cmd == 'reset':
-            ob = env.reset()
-            remote.send(ob)
-        elif cmd == 'render':
-            remote.send(env.render(mode='rgb_array'))
-        elif cmd == 'close':
-            remote.close()
-            break
-        elif cmd == 'get_spaces':
-            remote.send((env.observation_space, env.action_space))
-        else:
-            raise NotImplementedError
+                remote.send(ob)
+            elif cmd == 'render':
+                remote.send(env.render(mode='rgb_array'))
+            elif cmd == 'close':
+                remote.close()
+                break
+            elif cmd == 'get_spaces_spec':
+                remote.send((env.observation_space, env.action_space, env.spec))
+            else:
+                raise NotImplementedError
+    except KeyboardInterrupt:
+        print('SubprocVecEnv worker: got KeyboardInterrupt')
+    finally:
+        env.close()


 class SubprocVecEnv(VecEnv):
-    def __init__(self, env_fns, spaces=None):
+    """
+    VecEnv that runs multiple environments in parallel in subproceses and communicates with them via pipes.
+    Recommended to use when num_envs > 1 and step() can be a bottleneck.
+    """
+    def __init__(self, env_fns, spaces=None, context='spawn'):
        """
-        envs: list of gym environments to run in subprocesses
+        Arguments:
+
+        env_fns: iterable of callables -  functions that create environments to run in subprocesses. Need to be cloud-pickleable
        """
        self.waiting = False
        self.closed = False
        nenvs = len(env_fns)
-        self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
-        self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
-            for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
+        ctx = mp.get_context(context)
+        self.remotes, self.work_remotes = zip(*[ctx.Pipe() for _ in range(nenvs)])
+        self.ps = [ctx.Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
+                   for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
        for p in self.ps:
-            p.daemon = True # if the main process crashes, we should not cause things to hang
-            p.start()
+            p.daemon = True  # if the main process crashes, we should not cause things to hang
+            with clear_mpi_env_vars():
+                p.start()
        for remote in self.work_remotes:
            remote.close()

-        self.remotes[0].send(('get_spaces', None))
-        observation_space, action_space = self.remotes[0].recv()
+        self.remotes[0].send(('get_spaces_spec', None))
+        observation_space, action_space, self.spec = self.remotes[0].recv()
+        self.viewer = None
        VecEnv.__init__(self, len(env_fns), observation_space, action_space)

    def step_async(self, actions):
+        self._assert_not_closed()
        for remote, action in zip(self.remotes, actions):
            remote.send(('step', action))
        self.waiting = True

    def step_wait(self):
+        self._assert_not_closed()
        results = [remote.recv() for remote in self.remotes]
        self.waiting = False
        obs, rews, dones, infos = zip(*results)
-        return np.stack(obs), np.stack(rews), np.stack(dones), infos
+        return _flatten_obs(obs), np.stack(rews), np.stack(dones), infos

    def reset(self):
+        self._assert_not_closed()
        for remote in self.remotes:
            remote.send(('reset', None))
-        return np.stack([remote.recv() for remote in self.remotes])
+        return _flatten_obs([remote.recv() for remote in self.remotes])

-    def reset_task(self):
-        for remote in self.remotes:
-            remote.send(('reset_task', None))
-        return np.stack([remote.recv() for remote in self.remotes])
-
-    def close(self):
-        if self.closed:
-            return
+    def close_extras(self):
+        self.closed = True
        if self.waiting:
-            for remote in self.remotes:            
+            for remote in self.remotes:
                remote.recv()
        for remote in self.remotes:
            remote.send(('close', None))
        for p in self.ps:
            p.join()
-        self.closed = True

-    def render(self, mode='human'):
+    def get_images(self):
+        self._assert_not_closed()
        for pipe in self.remotes:
            pipe.send(('render', None))
        imgs = [pipe.recv() for pipe in self.remotes]
-        bigimg = tile_images(imgs)
-        if mode == 'human':
-            import cv2
-            cv2.imshow('vecenv', bigimg[:,:,::-1])
-            cv2.waitKey(1)
-        elif mode == 'rgb_array':
-            return bigimg
-        else:
-            raise NotImplementedError
+        return imgs
+
+    def _assert_not_closed(self):
+        assert not self.closed, "Trying to operate on a SubprocVecEnv after calling close()"
+
+    def __del__(self):
+        if not self.closed:
+            self.close()
+
+def _flatten_obs(obs):
+    assert isinstance(obs, (list, tuple))
+    assert len(obs) > 0
+
+    if isinstance(obs[0], dict):
+        keys = obs[0].keys()
+        return {k: np.stack([o[k] for o in obs]) for k in keys}
+    else:
+        return np.stack(obs)
--- a/baselines/common/vec_env/test_vec_env.py
+++ b/baselines/common/vec_env/test_vec_env.py
@@ -0,0 +1,114 @@
+"""
+Tests for asynchronous vectorized environments.
+"""
+
+import gym
+import numpy as np
+import pytest
+from .dummy_vec_env import DummyVecEnv
+from .shmem_vec_env import ShmemVecEnv
+from .subproc_vec_env import SubprocVecEnv
+from baselines.common.tests.test_with_mpi import with_mpi
+
+
+def assert_venvs_equal(venv1, venv2, num_steps):
+    """
+    Compare two environments over num_steps steps and make sure
+    that the observations produced by each are the same when given
+    the same actions.
+    """
+    assert venv1.num_envs == venv2.num_envs
+    assert venv1.observation_space.shape == venv2.observation_space.shape
+    assert venv1.observation_space.dtype == venv2.observation_space.dtype
+    assert venv1.action_space.shape == venv2.action_space.shape
+    assert venv1.action_space.dtype == venv2.action_space.dtype
+
+    try:
+        obs1, obs2 = venv1.reset(), venv2.reset()
+        assert np.array(obs1).shape == np.array(obs2).shape
+        assert np.array(obs1).shape == (venv1.num_envs,) + venv1.observation_space.shape
+        assert np.allclose(obs1, obs2)
+        venv1.action_space.seed(1337)
+        for _ in range(num_steps):
+            actions = np.array([venv1.action_space.sample() for _ in range(venv1.num_envs)])
+            for venv in [venv1, venv2]:
+                venv.step_async(actions)
+            outs1 = venv1.step_wait()
+            outs2 = venv2.step_wait()
+            for out1, out2 in zip(outs1[:3], outs2[:3]):
+                assert np.array(out1).shape == np.array(out2).shape
+                assert np.allclose(out1, out2)
+            assert list(outs1[3]) == list(outs2[3])
+    finally:
+        venv1.close()
+        venv2.close()
+
+
+@pytest.mark.parametrize('klass', (ShmemVecEnv, SubprocVecEnv))
+@pytest.mark.parametrize('dtype', ('uint8', 'float32'))
+def test_vec_env(klass, dtype):  # pylint: disable=R0914
+    """
+    Test that a vectorized environment is equivalent to
+    DummyVecEnv, since DummyVecEnv is less likely to be
+    error prone.
+    """
+    num_envs = 3
+    num_steps = 100
+    shape = (3, 8)
+
+    def make_fn(seed):
+        """
+        Get an environment constructor with a seed.
+        """
+        return lambda: SimpleEnv(seed, shape, dtype)
+    fns = [make_fn(i) for i in range(num_envs)]
+    env1 = DummyVecEnv(fns)
+    env2 = klass(fns)
+    assert_venvs_equal(env1, env2, num_steps=num_steps)
+
+
+class SimpleEnv(gym.Env):
+    """
+    An environment with a pre-determined observation space
+    and RNG seed.
+    """
+
+    def __init__(self, seed, shape, dtype):
+        np.random.seed(seed)
+        self._dtype = dtype
+        self._start_obs = np.array(np.random.randint(0, 0x100, size=shape),
+                                   dtype=dtype)
+        self._max_steps = seed + 1
+        self._cur_obs = None
+        self._cur_step = 0
+        # this is 0xFF instead of 0x100 because the Box space includes
+        # the high end, while randint does not
+        self.action_space = gym.spaces.Box(low=0, high=0xFF, shape=shape, dtype=dtype)
+        self.observation_space = self.action_space
+
+    def step(self, action):
+        self._cur_obs += np.array(action, dtype=self._dtype)
+        self._cur_step += 1
+        done = self._cur_step >= self._max_steps
+        reward = self._cur_step / self._max_steps
+        return self._cur_obs, reward, done, {'foo': 'bar' + str(reward)}
+
+    def reset(self):
+        self._cur_obs = self._start_obs
+        self._cur_step = 0
+        return self._cur_obs
+
+    def render(self, mode=None):
+        raise NotImplementedError
+
+
+
+@with_mpi()
+def test_mpi_with_subprocvecenv():
+    shape = (2,3,4)
+    nenv = 1
+    venv = SubprocVecEnv([lambda: SimpleEnv(0, shape, 'float32')] * nenv)
+    ob = venv.reset()
+    venv.close()
+    assert ob.shape == (nenv,) + shape
+
--- a/baselines/common/vec_env/test_video_recorder.py
+++ b/baselines/common/vec_env/test_video_recorder.py
@@ -0,0 +1,49 @@
+"""
+Tests for asynchronous vectorized environments.
+"""
+
+import gym
+import pytest
+import os
+import glob
+import tempfile
+
+from .dummy_vec_env import DummyVecEnv
+from .shmem_vec_env import ShmemVecEnv
+from .subproc_vec_env import SubprocVecEnv
+from .vec_video_recorder import VecVideoRecorder
+
+@pytest.mark.parametrize('klass', (DummyVecEnv, ShmemVecEnv, SubprocVecEnv))
+@pytest.mark.parametrize('num_envs', (1, 4))
+@pytest.mark.parametrize('video_length', (10, 100))
+@pytest.mark.parametrize('video_interval', (1, 50))
+def test_video_recorder(klass, num_envs, video_length, video_interval):
+    """
+    Wrap an existing VecEnv with VevVideoRecorder,
+    Make (video_interval + video_length + 1) steps,
+    then check that the file is present
+    """
+
+    def make_fn():
+        env = gym.make('PongNoFrameskip-v4')
+        return env
+    fns = [make_fn for _ in range(num_envs)]
+    env = klass(fns)
+
+    with tempfile.TemporaryDirectory() as video_path:
+        env = VecVideoRecorder(env, video_path, record_video_trigger=lambda x: x % video_interval == 0, video_length=video_length)
+
+        env.reset()
+        for _ in range(video_interval + video_length + 1):
+            env.step([0] * num_envs)
+        env.close()
+
+
+        recorded_video = glob.glob(os.path.join(video_path, "*.mp4"))
+
+        # first and second step
+        assert len(recorded_video) == 2
+        # Files are not empty
+        assert all(os.stat(p).st_size != 0 for p in recorded_video)
+
+
--- a/baselines/common/vec_env/util.py
+++ b/baselines/common/vec_env/util.py
@@ -0,0 +1,59 @@
+"""
+Helpers for dealing with vectorized environments.
+"""
+
+from collections import OrderedDict
+
+import gym
+import numpy as np
+
+
+def copy_obs_dict(obs):
+    """
+    Deep-copy an observation dict.
+    """
+    return {k: np.copy(v) for k, v in obs.items()}
+
+
+def dict_to_obs(obs_dict):
+    """
+    Convert an observation dict into a raw array if the
+    original observation space was not a Dict space.
+    """
+    if set(obs_dict.keys()) == {None}:
+        return obs_dict[None]
+    return obs_dict
+
+
+def obs_space_info(obs_space):
+    """
+    Get dict-structured information about a gym.Space.
+
+    Returns:
+      A tuple (keys, shapes, dtypes):
+        keys: a list of dict keys.
+        shapes: a dict mapping keys to shapes.
+        dtypes: a dict mapping keys to dtypes.
+    """
+    if isinstance(obs_space, gym.spaces.Dict):
+        assert isinstance(obs_space.spaces, OrderedDict)
+        subspaces = obs_space.spaces
+    else:
+        subspaces = {None: obs_space}
+    keys = []
+    shapes = {}
+    dtypes = {}
+    for key, box in subspaces.items():
+        keys.append(key)
+        shapes[key] = box.shape
+        dtypes[key] = box.dtype
+    return keys, shapes, dtypes
+
+
+def obs_to_dict(obs):
+    """
+    Convert an observation into a dict.
+    """
+    if isinstance(obs, dict):
+        return obs
+    return {None: obs}
--- a/baselines/common/vec_env/vec_env.py
+++ b/baselines/common/vec_env/vec_env.py
@@ -0,0 +1,218 @@
+import contextlib
+import os
+from abc import ABC, abstractmethod
+
+from baselines.common.tile_images import tile_images
+
+class AlreadySteppingError(Exception):
+    """
+    Raised when an asynchronous step is running while
+    step_async() is called again.
+    """
+
+    def __init__(self):
+        msg = 'already running an async step'
+        Exception.__init__(self, msg)
+
+
+class NotSteppingError(Exception):
+    """
+    Raised when an asynchronous step is not running but
+    step_wait() is called.
+    """
+
+    def __init__(self):
+        msg = 'not running an async step'
+        Exception.__init__(self, msg)
+
+
+class VecEnv(ABC):
+    """
+    An abstract asynchronous, vectorized environment.
+    Used to batch data from multiple copies of an environment, so that
+    each observation becomes an batch of observations, and expected action is a batch of actions to
+    be applied per-environment.
+    """
+    closed = False
+    viewer = None
+
+    metadata = {
+        'render.modes': ['human', 'rgb_array']
+    }
+
+    def __init__(self, num_envs, observation_space, action_space):
+        self.num_envs = num_envs
+        self.observation_space = observation_space
+        self.action_space = action_space
+
+    @abstractmethod
+    def reset(self):
+        """
+        Reset all the environments and return an array of
+        observations, or a dict of observation arrays.
+
+        If step_async is still doing work, that work will
+        be cancelled and step_wait() should not be called
+        until step_async() is invoked again.
+        """
+        pass
+
+    @abstractmethod
+    def step_async(self, actions):
+        """
+        Tell all the environments to start taking a step
+        with the given actions.
+        Call step_wait() to get the results of the step.
+
+        You should not call this if a step_async run is
+        already pending.
+        """
+        pass
+
+    @abstractmethod
+    def step_wait(self):
+        """
+        Wait for the step taken with step_async().
+
+        Returns (obs, rews, dones, infos):
+         - obs: an array of observations, or a dict of
+                arrays of observations.
+         - rews: an array of rewards
+         - dones: an array of "episode done" booleans
+         - infos: a sequence of info objects
+        """
+        pass
+
+    def close_extras(self):
+        """
+        Clean up the  extra resources, beyond what's in this base class.
+        Only runs when not self.closed.
+        """
+        pass
+
+    def close(self):
+        if self.closed:
+            return
+        if self.viewer is not None:
+            self.viewer.close()
+        self.close_extras()
+        self.closed = True
+
+    def step(self, actions):
+        """
+        Step the environments synchronously.
+
+        This is available for backwards compatibility.
+        """
+        self.step_async(actions)
+        return self.step_wait()
+
+    def render(self, mode='human'):
+        imgs = self.get_images()
+        bigimg = tile_images(imgs)
+        if mode == 'human':
+            self.get_viewer().imshow(bigimg)
+            return self.get_viewer().isopen
+        elif mode == 'rgb_array':
+            return bigimg
+        else:
+            raise NotImplementedError
+
+    def get_images(self):
+        """
+        Return RGB images from each environment
+        """
+        raise NotImplementedError
+
+    @property
+    def unwrapped(self):
+        if isinstance(self, VecEnvWrapper):
+            return self.venv.unwrapped
+        else:
+            return self
+
+    def get_viewer(self):
+        if self.viewer is None:
+            from gym.envs.classic_control import rendering
+            self.viewer = rendering.SimpleImageViewer()
+        return self.viewer
+
+class VecEnvWrapper(VecEnv):
+    """
+    An environment wrapper that applies to an entire batch
+    of environments at once.
+    """
+
+    def __init__(self, venv, observation_space=None, action_space=None):
+        self.venv = venv
+        super().__init__(num_envs=venv.num_envs,
+                         observation_space=observation_space or venv.observation_space,
+                         action_space=action_space or venv.action_space)
+
+    def step_async(self, actions):
+        self.venv.step_async(actions)
+
+    @abstractmethod
+    def reset(self):
+        pass
+
+    @abstractmethod
+    def step_wait(self):
+        pass
+
+    def close(self):
+        return self.venv.close()
+
+    def render(self, mode='human'):
+        return self.venv.render(mode=mode)
+
+    def get_images(self):
+        return self.venv.get_images()
+
+class VecEnvObservationWrapper(VecEnvWrapper):
+    @abstractmethod
+    def process(self, obs):
+        pass
+
+    def reset(self):
+        obs = self.venv.reset()
+        return self.process(obs)
+
+    def step_wait(self):
+        obs, rews, dones, infos = self.venv.step_wait()
+        return self.process(obs), rews, dones, infos
+
+class CloudpickleWrapper(object):
+    """
+    Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
+    """
+
+    def __init__(self, x):
+        self.x = x
+
+    def __getstate__(self):
+        import cloudpickle
+        return cloudpickle.dumps(self.x)
+
+    def __setstate__(self, ob):
+        import pickle
+        self.x = pickle.loads(ob)
+
+
+@contextlib.contextmanager
+def clear_mpi_env_vars():
+    """
+    from mpi4py import MPI will call MPI_Init by default.  If the child process has MPI environment variables, MPI will think that the child process is an MPI process just like the parent and do bad things such as hang.
+    This context manager is a hacky way to clear those environment variables temporarily such as when we are starting multiprocessing
+    Processes.
+    """
+    removed_environment = {}
+    for k, v in list(os.environ.items()):
+        for prefix in ['OMPI_', 'PMI_']:
+            if k.startswith(prefix):
+                removed_environment[k] = v
+                del os.environ[k]
+    try:
+        yield
+    finally:
+        os.environ.update(removed_environment)
--- a/baselines/common/vec_env/vec_frame_stack.py
+++ b/baselines/common/vec_env/vec_frame_stack.py
@@ -1,18 +1,16 @@
-from baselines.common.vec_env import VecEnvWrapper
+from .vec_env import VecEnvWrapper
 import numpy as np
 from gym import spaces

+
 class VecFrameStack(VecEnvWrapper):
-    """
-    Vectorized environment base class
-    """
    def __init__(self, venv, nstack):
        self.venv = venv
        self.nstack = nstack
-        wos = venv.observation_space # wrapped ob space
+        wos = venv.observation_space  # wrapped ob space
        low = np.repeat(wos.low, self.nstack, axis=-1)
        high = np.repeat(wos.high, self.nstack, axis=-1)
-        self.stackedobs = np.zeros((venv.num_envs,)+low.shape, low.dtype)
+        self.stackedobs = np.zeros((venv.num_envs,) + low.shape, low.dtype)
        observation_space = spaces.Box(low=low, high=high, dtype=venv.observation_space.dtype)
        VecEnvWrapper.__init__(self, venv, observation_space=observation_space)

@@ -26,13 +24,7 @@ class VecFrameStack(VecEnvWrapper):
        return self.stackedobs, rews, news, infos

    def reset(self):
-        """
-        Reset all environments
-        """
        obs = self.venv.reset()
        self.stackedobs[...] = 0
        self.stackedobs[..., -obs.shape[-1]:] = obs
        return self.stackedobs
-
-    def close(self):
-        self.venv.close()
--- a/baselines/common/vec_env/vec_monitor.py
+++ b/baselines/common/vec_env/vec_monitor.py
@@ -0,0 +1,49 @@
+from . import VecEnvWrapper
+from baselines.bench.monitor import ResultsWriter
+import numpy as np
+import time
+from collections import deque
+
+class VecMonitor(VecEnvWrapper):
+    def __init__(self, venv, filename=None, keep_buf=0):
+        VecEnvWrapper.__init__(self, venv)
+        self.eprets = None
+        self.eplens = None
+        self.epcount = 0
+        self.tstart = time.time()
+        if filename:
+            self.results_writer = ResultsWriter(filename, header={'t_start': self.tstart})
+        else:
+            self.results_writer = None
+        self.keep_buf = keep_buf
+        if self.keep_buf:
+            self.epret_buf = deque([], maxlen=keep_buf)
+            self.eplen_buf = deque([], maxlen=keep_buf)
+
+    def reset(self):
+        obs = self.venv.reset()
+        self.eprets = np.zeros(self.num_envs, 'f')
+        self.eplens = np.zeros(self.num_envs, 'i')
+        return obs
+
+    def step_wait(self):
+        obs, rews, dones, infos = self.venv.step_wait()
+        self.eprets += rews
+        self.eplens += 1
+        newinfos = []
+        for (i, (done, ret, eplen, info)) in enumerate(zip(dones, self.eprets, self.eplens, infos)):
+            info = info.copy()
+            if done:
+                epinfo = {'r': ret, 'l': eplen, 't': round(time.time() - self.tstart, 6)}
+                info['episode'] = epinfo
+                if self.keep_buf:
+                    self.epret_buf.append(ret)
+                    self.eplen_buf.append(eplen)
+                self.epcount += 1
+                self.eprets[i] = 0
+                self.eplens[i] = 0
+                if self.results_writer:
+                    self.results_writer.write_row(epinfo)
+            newinfos.append(info)
+
+        return obs, rews, dones, newinfos
--- a/baselines/common/vec_env/vec_normalize.py
+++ b/baselines/common/vec_env/vec_normalize.py
@@ -1,11 +1,14 @@
-from baselines.common.vec_env import VecEnvWrapper
+from . import VecEnvWrapper
 from baselines.common.running_mean_std import RunningMeanStd
 import numpy as np

+
 class VecNormalize(VecEnvWrapper):
    """
-    Vectorized environment base class
+    A vectorized wrapper that normalizes the observations
+    and returns from an environment.
    """
+
    def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
        VecEnvWrapper.__init__(self, venv)
        self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
@@ -17,18 +20,13 @@ class VecNormalize(VecEnvWrapper):
        self.epsilon = epsilon

    def step_wait(self):
-        """
-        Apply sequence of actions to sequence of environments
-        actions -> (observations, rewards, news)
-
-        where 'news' is a boolean vector indicating whether each element is new.
-        """
        obs, rews, news, infos = self.venv.step_wait()
        self.ret = self.ret * self.gamma + rews
        obs = self._obfilt(obs)
        if self.ret_rms:
            self.ret_rms.update(self.ret)
            rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
+        self.ret[news] = 0.
        return obs, rews, news, infos

    def _obfilt(self, obs):
@@ -40,8 +38,6 @@ class VecNormalize(VecEnvWrapper):
            return obs

    def reset(self):
-        """
-        Reset all environments
-        """
+        self.ret = np.zeros(self.num_envs)
        obs = self.venv.reset()
        return self._obfilt(obs)
--- a/baselines/common/vec_env/vec_remove_dict_obs.py
+++ b/baselines/common/vec_env/vec_remove_dict_obs.py
@@ -0,0 +1,11 @@
+from .vec_env import VecEnvObservationWrapper
+
+
+class VecExtractDictObs(VecEnvObservationWrapper):
+    def __init__(self, venv, key):
+        self.key = key
+        super().__init__(venv=venv,
+            observation_space=venv.observation_space.spaces[self.key])
+
+    def process(self, obs):
+        return obs[self.key]
--- a/baselines/common/vec_env/vec_video_recorder.py
+++ b/baselines/common/vec_env/vec_video_recorder.py
@@ -0,0 +1,89 @@
+import os
+from baselines import logger
+from baselines.common.vec_env import VecEnvWrapper
+from gym.wrappers.monitoring import video_recorder
+
+
+class VecVideoRecorder(VecEnvWrapper):
+    """
+    Wrap VecEnv to record rendered image as mp4 video.
+    """
+
+    def __init__(self, venv, directory, record_video_trigger, video_length=200):
+        """
+        # Arguments
+            venv: VecEnv to wrap
+            directory: Where to save videos
+            record_video_trigger:
+                Function that defines when to start recording.
+                The function takes the current number of step,
+                and returns whether we should start recording or not.
+            video_length: Length of recorded video
+        """
+
+        VecEnvWrapper.__init__(self, venv)
+        self.record_video_trigger = record_video_trigger
+        self.video_recorder = None
+
+        self.directory = os.path.abspath(directory)
+        if not os.path.exists(self.directory): os.mkdir(self.directory)
+
+        self.file_prefix = "vecenv"
+        self.file_infix = '{}'.format(os.getpid())
+        self.step_id = 0
+        self.video_length = video_length
+
+        self.recording = False
+        self.recorded_frames = 0
+
+    def reset(self):
+        obs = self.venv.reset()
+
+        self.start_video_recorder()
+
+        return obs
+
+    def start_video_recorder(self):
+        self.close_video_recorder()
+
+        base_path = os.path.join(self.directory, '{}.video.{}.video{:06}'.format(self.file_prefix, self.file_infix, self.step_id))
+        self.video_recorder = video_recorder.VideoRecorder(
+                env=self.venv,
+                base_path=base_path,
+                metadata={'step_id': self.step_id}
+                )
+
+        self.video_recorder.capture_frame()
+        self.recorded_frames = 1
+        self.recording = True
+
+    def _video_enabled(self):
+        return self.record_video_trigger(self.step_id)
+
+    def step_wait(self):
+        obs, rews, dones, infos = self.venv.step_wait()
+
+        self.step_id += 1
+        if self.recording:
+            self.video_recorder.capture_frame()
+            self.recorded_frames += 1
+            if self.recorded_frames > self.video_length:
+                logger.info("Saving video to ", self.video_recorder.path)
+                self.close_video_recorder()
+        elif self._video_enabled():
+                self.start_video_recorder()
+
+        return obs, rews, dones, infos
+
+    def close_video_recorder(self):
+        if self.recording:
+            self.video_recorder.close()
+        self.recording = False
+        self.recorded_frames = 0
+
+    def close(self):
+        VecEnvWrapper.close(self)
+        self.close_video_recorder()
+
+    def __del__(self):
+        self.close()
--- a/baselines/common/wrappers.py
+++ b/baselines/common/wrappers.py
@@ -0,0 +1,19 @@
+import gym
+
+class TimeLimit(gym.Wrapper):
+    def __init__(self, env, max_episode_steps=None):
+        super(TimeLimit, self).__init__(env)
+        self._max_episode_steps = max_episode_steps
+        self._elapsed_steps = 0
+
+    def step(self, ac):
+        observation, reward, done, info = self.env.step(ac)
+        self._elapsed_steps += 1
+        if self._elapsed_steps >= self._max_episode_steps:
+            done = True
+            info['TimeLimit.truncated'] = True
+        return observation, reward, done, info
+
+    def reset(self, **kwargs):
+        self._elapsed_steps = 0
+        return self.env.reset(**kwargs)
--- a/baselines/ddpg/README.md
+++ b/baselines/ddpg/README.md
@@ -2,4 +2,4 @@

 - Original paper: https://arxiv.org/abs/1509.02971
 - Baselines post: https://blog.openai.com/better-exploration-with-parameter-noise/
- `python -m baselines.ddpg.main` runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (`-h`) for more options.
+- `python -m baselines.run --alg=ddpg --env=HalfCheetah-v2 --num_timesteps=1e6` runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (`-h`) for more options.
--- a/baselines/ddpg/init.py
+++ b/baselines/ddpg/init.py
--- a/baselines/ddpg/ddpg.py
+++ b/baselines/ddpg/ddpg.py
@@ -1,378 +1,285 @@
-from copy import copy
-from functools import reduce
+import os
+import os.path as osp
+import time
+from collections import deque
+import pickle

-import numpy as np
-import tensorflow as tf
-import tensorflow.contrib as tc
+from baselines.ddpg.ddpg_learner import DDPG
+from baselines.ddpg.models import Actor, Critic
+from baselines.ddpg.memory import Memory
+from baselines.ddpg.noise import AdaptiveParamNoiseSpec, NormalActionNoise, OrnsteinUhlenbeckActionNoise
+from baselines.common import set_global_seeds

 from baselines import logger
-from baselines.common.mpi_adam import MpiAdam
-import baselines.common.tf_util as U
-from baselines.common.mpi_running_mean_std import RunningMeanStd
-from mpi4py import MPI
+import tensorflow as tf
+import numpy as np

-def normalize(x, stats):
-    if stats is None:
-        return x
-    return (x - stats.mean) / stats.std
+try:
+    from mpi4py import MPI
+except ImportError:
+    MPI = None
+
+def learn(network, env,
+          seed=None,
+          total_timesteps=None,
+          nb_epochs=None, # with default settings, perform 1M steps total
+          nb_epoch_cycles=20,
+          nb_rollout_steps=100,
+          reward_scale=1.0,
+          render=False,
+          render_eval=False,
+          noise_type='adaptive-param_0.2',
+          normalize_returns=False,
+          normalize_observations=True,
+          critic_l2_reg=1e-2,
+          actor_lr=1e-4,
+          critic_lr=1e-3,
+          popart=False,
+          gamma=0.99,
+          clip_norm=None,
+          nb_train_steps=50, # per epoch cycle and MPI worker,
+          nb_eval_steps=100,
+          batch_size=64, # per MPI worker
+          tau=0.01,
+          eval_env=None,
+          param_noise_adaption_interval=50,
+          load_path=None,
+          **network_kwargs):
+
+    set_global_seeds(seed)
+
+    if total_timesteps is not None:
+        assert nb_epochs is None
+        nb_epochs = int(total_timesteps) // (nb_epoch_cycles * nb_rollout_steps)
+    else:
+        nb_epochs = 500
+
+    if MPI is not None:
+        rank = MPI.COMM_WORLD.Get_rank()
+    else:
+        rank = 0
+
+    nb_actions = env.action_space.shape[-1]
+    assert (np.abs(env.action_space.low) == env.action_space.high).all()  # we assume symmetric actions.
+
+    memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
+    critic = Critic(nb_actions, ob_shape=env.observation_space.shape, network=network, **network_kwargs)
+    actor = Actor(nb_actions, ob_shape=env.observation_space.shape, network=network, **network_kwargs)
+
+    action_noise = None
+    param_noise = None
+    if noise_type is not None:
+        for current_noise_type in noise_type.split(','):
+            current_noise_type = current_noise_type.strip()
+            if current_noise_type == 'none':
+                pass
+            elif 'adaptive-param' in current_noise_type:
+                _, stddev = current_noise_type.split('_')
+                param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
+            elif 'normal' in current_noise_type:
+                _, stddev = current_noise_type.split('_')
+                action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
+            elif 'ou' in current_noise_type:
+                _, stddev = current_noise_type.split('_')
+                action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
+            else:
+                raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
+
+    max_action = env.action_space.high
+    logger.info('scaling actions by {} before executing in env'.format(max_action))
+
+    agent = DDPG(actor, critic, memory, env.observation_space.shape, env.action_space.shape,
+        gamma=gamma, tau=tau, normalize_returns=normalize_returns, normalize_observations=normalize_observations,
+        batch_size=batch_size, action_noise=action_noise, param_noise=param_noise, critic_l2_reg=critic_l2_reg,
+        actor_lr=actor_lr, critic_lr=critic_lr, enable_popart=popart, clip_norm=clip_norm,
+        reward_scale=reward_scale)
+    logger.info('Using agent with the following configuration:')
+    logger.info(str(agent.__dict__.items()))
+
+    if load_path is not None:
+        load_path = osp.expanduser(load_path)
+        ckpt = tf.train.Checkpoint(model=agent)
+        manager = tf.train.CheckpointManager(ckpt, load_path, max_to_keep=None)
+        ckpt.restore(manager.latest_checkpoint)
+        print("Restoring from {}".format(manager.latest_checkpoint))
+
+    eval_episode_rewards_history = deque(maxlen=100)
+    episode_rewards_history = deque(maxlen=100)
+
+    # Prepare everything.
+    agent.initialize()
+
+    agent.reset()
+
+    obs = env.reset()
+    if eval_env is not None:
+        eval_obs = eval_env.reset()
+    nenvs = obs.shape[0]
+
+    episode_reward = np.zeros(nenvs, dtype = np.float32) #vector
+    episode_step = np.zeros(nenvs, dtype = int) # vector
+    episodes = 0 #scalar
+    t = 0 # scalar
+
+    epoch = 0


-def denormalize(x, stats):
-    if stats is None:
-        return x
-    return x * stats.std + stats.mean

-def reduce_std(x, axis=None, keepdims=False):
-    return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
+    start_time = time.time()

-def reduce_var(x, axis=None, keepdims=False):
-    m = tf.reduce_mean(x, axis=axis, keep_dims=True)
-    devs_squared = tf.square(x - m)
-    return tf.reduce_mean(devs_squared, axis=axis, keep_dims=keepdims)
+    epoch_episode_rewards = []
+    epoch_episode_steps = []
+    epoch_actions = []
+    epoch_qs = []
+    epoch_episodes = 0
+    for epoch in range(nb_epochs):
+        for cycle in range(nb_epoch_cycles):
+            # Perform rollouts.
+            if nenvs > 1:
+                # if simulating multiple envs in parallel, impossible to reset agent at the end of the episode in each
+                # of the environments, so resetting here instead
+                agent.reset()
+            for t_rollout in range(nb_rollout_steps):
+                # Predict next action.
+                action, q, _, _ = agent.step(tf.constant(obs), apply_noise=True, compute_Q=True)
+                action, q = action.numpy(), q.numpy()

-def get_target_updates(vars, target_vars, tau):
-    logger.info('setting up target updates ...')
-    soft_updates = []
-    init_updates = []
-    assert len(vars) == len(target_vars)
-    for var, target_var in zip(vars, target_vars):
-        logger.info('  {} <- {}'.format(target_var.name, var.name))
-        init_updates.append(tf.assign(target_var, var))
-        soft_updates.append(tf.assign(target_var, (1. - tau) * target_var + tau * var))
-    assert len(init_updates) == len(vars)
-    assert len(soft_updates) == len(vars)
-    return tf.group(*init_updates), tf.group(*soft_updates)
+                # Execute next action.
+                if rank == 0 and render:
+                    env.render()
+
+                # max_action is of dimension A, whereas action is dimension (nenvs, A) - the multiplication gets broadcasted to the batch
+                new_obs, r, done, info = env.step(max_action * action)  # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
+                # note these outputs are batched from vecenv
+
+                t += 1
+                if rank == 0 and render:
+                    env.render()
+                episode_reward += r
+                episode_step += 1
+
+                # Book-keeping.
+                epoch_actions.append(action)
+                epoch_qs.append(q)
+                agent.store_transition(obs, action, r, new_obs, done) #the batched data will be unrolled in memory.py's append.
+
+                obs = new_obs
+
+                for d in range(len(done)):
+                    if done[d]:
+                        # Episode done.
+                        epoch_episode_rewards.append(episode_reward[d])
+                        episode_rewards_history.append(episode_reward[d])
+                        epoch_episode_steps.append(episode_step[d])
+                        episode_reward[d] = 0.
+                        episode_step[d] = 0
+                        epoch_episodes += 1
+                        episodes += 1
+                        if nenvs == 1:
+                            agent.reset()


-def get_perturbed_actor_updates(actor, perturbed_actor, param_noise_stddev):
-    assert len(actor.vars) == len(perturbed_actor.vars)
-    assert len(actor.perturbable_vars) == len(perturbed_actor.perturbable_vars)
+            # Train.
+            epoch_actor_losses = []
+            epoch_critic_losses = []
+            epoch_adaptive_distances = []
+            for t_train in range(nb_train_steps):
+                # Adapt param noise, if necessary.
+                if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
+                    batch = agent.memory.sample(batch_size=batch_size)
+                    obs0 = tf.constant(batch['obs0'])
+                    distance = agent.adapt_param_noise(obs0)
+                    epoch_adaptive_distances.append(distance)

-    updates = []
-    for var, perturbed_var in zip(actor.vars, perturbed_actor.vars):
-        if var in actor.perturbable_vars:
-            logger.info('  {} <- {} + noise'.format(perturbed_var.name, var.name))
-            updates.append(tf.assign(perturbed_var, var + tf.random_normal(tf.shape(var), mean=0., stddev=param_noise_stddev)))
+                cl, al = agent.train()
+                epoch_critic_losses.append(cl)
+                epoch_actor_losses.append(al)
+                agent.update_target_net()
+
+            # Evaluate.
+            eval_episode_rewards = []
+            eval_qs = []
+            if eval_env is not None:
+                nenvs_eval = eval_obs.shape[0]
+                eval_episode_reward = np.zeros(nenvs_eval, dtype = np.float32)
+                for t_rollout in range(nb_eval_steps):
+                    eval_action, eval_q, _, _ = agent.step(eval_obs, apply_noise=False, compute_Q=True)
+                    eval_obs, eval_r, eval_done, eval_info = eval_env.step(max_action * eval_action)  # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
+                    if render_eval:
+                        eval_env.render()
+                    eval_episode_reward += eval_r
+
+                    eval_qs.append(eval_q)
+                    for d in range(len(eval_done)):
+                        if eval_done[d]:
+                            eval_episode_rewards.append(eval_episode_reward[d])
+                            eval_episode_rewards_history.append(eval_episode_reward[d])
+                            eval_episode_reward[d] = 0.0
+
+        if MPI is not None:
+            mpi_size = MPI.COMM_WORLD.Get_size()
        else:
-            logger.info('  {} <- {}'.format(perturbed_var.name, var.name))
-            updates.append(tf.assign(perturbed_var, var))
-    assert len(updates) == len(actor.vars)
-    return tf.group(*updates)
+            mpi_size = 1
+
+        # Log stats.
+        # XXX shouldn't call np.mean on variable length lists
+        duration = time.time() - start_time
+        stats = agent.get_stats()
+        combined_stats = stats.copy()
+        combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
+        combined_stats['rollout/return_std'] = np.std(epoch_episode_rewards)
+        combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
+        combined_stats['rollout/return_history_std'] = np.std(episode_rewards_history)
+        combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
+        combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
+        combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
+        combined_stats['train/loss_actor'] = np.mean(epoch_actor_losses)
+        combined_stats['train/loss_critic'] = np.mean(epoch_critic_losses)
+        combined_stats['train/param_noise_distance'] = np.mean(epoch_adaptive_distances)
+        combined_stats['total/duration'] = duration
+        combined_stats['total/steps_per_second'] = float(t) / float(duration)
+        combined_stats['total/episodes'] = episodes
+        combined_stats['rollout/episodes'] = epoch_episodes
+        combined_stats['rollout/actions_std'] = np.std(epoch_actions)
+        # Evaluation statistics.
+        if eval_env is not None:
+            combined_stats['eval/return'] = eval_episode_rewards
+            combined_stats['eval/return_history'] = np.mean(eval_episode_rewards_history)
+            combined_stats['eval/Q'] = eval_qs
+            combined_stats['eval/episodes'] = len(eval_episode_rewards)
+        def as_scalar(x):
+            if isinstance(x, np.ndarray):
+                assert x.size == 1
+                return x[0]
+            elif np.isscalar(x):
+                return x
+            else:
+                raise ValueError('expected scalar, got %s'%x)
+
+        combined_stats_sums = np.array([ np.array(x).flatten()[0] for x in combined_stats.values()])
+        if MPI is not None:
+            combined_stats_sums = MPI.COMM_WORLD.allreduce(combined_stats_sums)
+
+        combined_stats = {k : v / mpi_size for (k,v) in zip(combined_stats.keys(), combined_stats_sums)}
+
+        # Total statistics.
+        combined_stats['total/epochs'] = epoch + 1
+        combined_stats['total/steps'] = t
+
+        for key in sorted(combined_stats.keys()):
+            logger.record_tabular(key, combined_stats[key])
+
+        if rank == 0:
+            logger.dump_tabular()
+        logger.info('')
+        logdir = logger.get_dir()
+        if rank == 0 and logdir:
+            if hasattr(env, 'get_state'):
+                with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:
+                    pickle.dump(env.get_state(), f)
+            if eval_env and hasattr(eval_env, 'get_state'):
+                with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
+                    pickle.dump(eval_env.get_state(), f)


-class DDPG(object):
-    def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
-        gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
-        batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
-        adaptive_param_noise=True, adaptive_param_noise_policy_threshold=.1,
-        critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
-        # Inputs.
-        self.obs0 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs0')
-        self.obs1 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs1')
-        self.terminals1 = tf.placeholder(tf.float32, shape=(None, 1), name='terminals1')
-        self.rewards = tf.placeholder(tf.float32, shape=(None, 1), name='rewards')
-        self.actions = tf.placeholder(tf.float32, shape=(None,) + action_shape, name='actions')
-        self.critic_target = tf.placeholder(tf.float32, shape=(None, 1), name='critic_target')
-        self.param_noise_stddev = tf.placeholder(tf.float32, shape=(), name='param_noise_stddev')
-
-        # Parameters.
-        self.gamma = gamma
-        self.tau = tau
-        self.memory = memory
-        self.normalize_observations = normalize_observations
-        self.normalize_returns = normalize_returns
-        self.action_noise = action_noise
-        self.param_noise = param_noise
-        self.action_range = action_range
-        self.return_range = return_range
-        self.observation_range = observation_range
-        self.critic = critic
-        self.actor = actor
-        self.actor_lr = actor_lr
-        self.critic_lr = critic_lr
-        self.clip_norm = clip_norm
-        self.enable_popart = enable_popart
-        self.reward_scale = reward_scale
-        self.batch_size = batch_size
-        self.stats_sample = None
-        self.critic_l2_reg = critic_l2_reg
-
-        # Observation normalization.
-        if self.normalize_observations:
-            with tf.variable_scope('obs_rms'):
-                self.obs_rms = RunningMeanStd(shape=observation_shape)
-        else:
-            self.obs_rms = None
-        normalized_obs0 = tf.clip_by_value(normalize(self.obs0, self.obs_rms),
-            self.observation_range[0], self.observation_range[1])
-        normalized_obs1 = tf.clip_by_value(normalize(self.obs1, self.obs_rms),
-            self.observation_range[0], self.observation_range[1])
-
-        # Return normalization.
-        if self.normalize_returns:
-            with tf.variable_scope('ret_rms'):
-                self.ret_rms = RunningMeanStd()
-        else:
-            self.ret_rms = None
-
-        # Create target networks.
-        target_actor = copy(actor)
-        target_actor.name = 'target_actor'
-        self.target_actor = target_actor
-        target_critic = copy(critic)
-        target_critic.name = 'target_critic'
-        self.target_critic = target_critic
-
-        # Create networks and core TF parts that are shared across setup parts.
-        self.actor_tf = actor(normalized_obs0)
-        self.normalized_critic_tf = critic(normalized_obs0, self.actions)
-        self.critic_tf = denormalize(tf.clip_by_value(self.normalized_critic_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
-        self.normalized_critic_with_actor_tf = critic(normalized_obs0, self.actor_tf, reuse=True)
-        self.critic_with_actor_tf = denormalize(tf.clip_by_value(self.normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
-        Q_obs1 = denormalize(target_critic(normalized_obs1, target_actor(normalized_obs1)), self.ret_rms)
-        self.target_Q = self.rewards + (1. - self.terminals1) * gamma * Q_obs1
-
-        # Set up parts.
-        if self.param_noise is not None:
-            self.setup_param_noise(normalized_obs0)
-        self.setup_actor_optimizer()
-        self.setup_critic_optimizer()
-        if self.normalize_returns and self.enable_popart:
-            self.setup_popart()
-        self.setup_stats()
-        self.setup_target_network_updates()
-
-    def setup_target_network_updates(self):
-        actor_init_updates, actor_soft_updates = get_target_updates(self.actor.vars, self.target_actor.vars, self.tau)
-        critic_init_updates, critic_soft_updates = get_target_updates(self.critic.vars, self.target_critic.vars, self.tau)
-        self.target_init_updates = [actor_init_updates, critic_init_updates]
-        self.target_soft_updates = [actor_soft_updates, critic_soft_updates]
-
-    def setup_param_noise(self, normalized_obs0):
-        assert self.param_noise is not None
-
-        # Configure perturbed actor.
-        param_noise_actor = copy(self.actor)
-        param_noise_actor.name = 'param_noise_actor'
-        self.perturbed_actor_tf = param_noise_actor(normalized_obs0)
-        logger.info('setting up param noise')
-        self.perturb_policy_ops = get_perturbed_actor_updates(self.actor, param_noise_actor, self.param_noise_stddev)
-
-        # Configure separate copy for stddev adoption.
-        adaptive_param_noise_actor = copy(self.actor)
-        adaptive_param_noise_actor.name = 'adaptive_param_noise_actor'
-        adaptive_actor_tf = adaptive_param_noise_actor(normalized_obs0)
-        self.perturb_adaptive_policy_ops = get_perturbed_actor_updates(self.actor, adaptive_param_noise_actor, self.param_noise_stddev)
-        self.adaptive_policy_distance = tf.sqrt(tf.reduce_mean(tf.square(self.actor_tf - adaptive_actor_tf)))
-
-    def setup_actor_optimizer(self):
-        logger.info('setting up actor optimizer')
-        self.actor_loss = -tf.reduce_mean(self.critic_with_actor_tf)
-        actor_shapes = [var.get_shape().as_list() for var in self.actor.trainable_vars]
-        actor_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in actor_shapes])
-        logger.info('  actor shapes: {}'.format(actor_shapes))
-        logger.info('  actor params: {}'.format(actor_nb_params))
-        self.actor_grads = U.flatgrad(self.actor_loss, self.actor.trainable_vars, clip_norm=self.clip_norm)
-        self.actor_optimizer = MpiAdam(var_list=self.actor.trainable_vars,
-            beta1=0.9, beta2=0.999, epsilon=1e-08)
-
-    def setup_critic_optimizer(self):
-        logger.info('setting up critic optimizer')
-        normalized_critic_target_tf = tf.clip_by_value(normalize(self.critic_target, self.ret_rms), self.return_range[0], self.return_range[1])
-        self.critic_loss = tf.reduce_mean(tf.square(self.normalized_critic_tf - normalized_critic_target_tf))
-        if self.critic_l2_reg > 0.:
-            critic_reg_vars = [var for var in self.critic.trainable_vars if 'kernel' in var.name and 'output' not in var.name]
-            for var in critic_reg_vars:
-                logger.info('  regularizing: {}'.format(var.name))
-            logger.info('  applying l2 regularization with {}'.format(self.critic_l2_reg))
-            critic_reg = tc.layers.apply_regularization(
-                tc.layers.l2_regularizer(self.critic_l2_reg),
-                weights_list=critic_reg_vars
-            )
-            self.critic_loss += critic_reg
-        critic_shapes = [var.get_shape().as_list() for var in self.critic.trainable_vars]
-        critic_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in critic_shapes])
-        logger.info('  critic shapes: {}'.format(critic_shapes))
-        logger.info('  critic params: {}'.format(critic_nb_params))
-        self.critic_grads = U.flatgrad(self.critic_loss, self.critic.trainable_vars, clip_norm=self.clip_norm)
-        self.critic_optimizer = MpiAdam(var_list=self.critic.trainable_vars,
-            beta1=0.9, beta2=0.999, epsilon=1e-08)
-
-    def setup_popart(self):
-        # See https://arxiv.org/pdf/1602.07714.pdf for details.
-        self.old_std = tf.placeholder(tf.float32, shape=[1], name='old_std')
-        new_std = self.ret_rms.std
-        self.old_mean = tf.placeholder(tf.float32, shape=[1], name='old_mean')
-        new_mean = self.ret_rms.mean
-
-        self.renormalize_Q_outputs_op = []
-        for vs in [self.critic.output_vars, self.target_critic.output_vars]:
-            assert len(vs) == 2
-            M, b = vs
-            assert 'kernel' in M.name
-            assert 'bias' in b.name
-            assert M.get_shape()[-1] == 1
-            assert b.get_shape()[-1] == 1
-            self.renormalize_Q_outputs_op += [M.assign(M * self.old_std / new_std)]
-            self.renormalize_Q_outputs_op += [b.assign((b * self.old_std + self.old_mean - new_mean) / new_std)]
-
-    def setup_stats(self):
-        ops = []
-        names = []
-
-        if self.normalize_returns:
-            ops += [self.ret_rms.mean, self.ret_rms.std]
-            names += ['ret_rms_mean', 'ret_rms_std']
-
-        if self.normalize_observations:
-            ops += [tf.reduce_mean(self.obs_rms.mean), tf.reduce_mean(self.obs_rms.std)]
-            names += ['obs_rms_mean', 'obs_rms_std']
-
-        ops += [tf.reduce_mean(self.critic_tf)]
-        names += ['reference_Q_mean']
-        ops += [reduce_std(self.critic_tf)]
-        names += ['reference_Q_std']
-
-        ops += [tf.reduce_mean(self.critic_with_actor_tf)]
-        names += ['reference_actor_Q_mean']
-        ops += [reduce_std(self.critic_with_actor_tf)]
-        names += ['reference_actor_Q_std']
-
-        ops += [tf.reduce_mean(self.actor_tf)]
-        names += ['reference_action_mean']
-        ops += [reduce_std(self.actor_tf)]
-        names += ['reference_action_std']
-
-        if self.param_noise:
-            ops += [tf.reduce_mean(self.perturbed_actor_tf)]
-            names += ['reference_perturbed_action_mean']
-            ops += [reduce_std(self.perturbed_actor_tf)]
-            names += ['reference_perturbed_action_std']
-
-        self.stats_ops = ops
-        self.stats_names = names
-
-    def pi(self, obs, apply_noise=True, compute_Q=True):
-        if self.param_noise is not None and apply_noise:
-            actor_tf = self.perturbed_actor_tf
-        else:
-            actor_tf = self.actor_tf
-        feed_dict = {self.obs0: [obs]}
-        if compute_Q:
-            action, q = self.sess.run([actor_tf, self.critic_with_actor_tf], feed_dict=feed_dict)
-        else:
-            action = self.sess.run(actor_tf, feed_dict=feed_dict)
-            q = None
-        action = action.flatten()
-        if self.action_noise is not None and apply_noise:
-            noise = self.action_noise()
-            assert noise.shape == action.shape
-            action += noise
-        action = np.clip(action, self.action_range[0], self.action_range[1])
-        return action, q
-
-    def store_transition(self, obs0, action, reward, obs1, terminal1):
-        reward *= self.reward_scale
-        self.memory.append(obs0, action, reward, obs1, terminal1)
-        if self.normalize_observations:
-            self.obs_rms.update(np.array([obs0]))
-
-    def train(self):
-        # Get a batch.
-        batch = self.memory.sample(batch_size=self.batch_size)
-
-        if self.normalize_returns and self.enable_popart:
-            old_mean, old_std, target_Q = self.sess.run([self.ret_rms.mean, self.ret_rms.std, self.target_Q], feed_dict={
-                self.obs1: batch['obs1'],
-                self.rewards: batch['rewards'],
-                self.terminals1: batch['terminals1'].astype('float32'),
-            })
-            self.ret_rms.update(target_Q.flatten())
-            self.sess.run(self.renormalize_Q_outputs_op, feed_dict={
-                self.old_std : np.array([old_std]),
-                self.old_mean : np.array([old_mean]),
-            })
-
-            # Run sanity check. Disabled by default since it slows down things considerably.
-            # print('running sanity check')
-            # target_Q_new, new_mean, new_std = self.sess.run([self.target_Q, self.ret_rms.mean, self.ret_rms.std], feed_dict={
-            #     self.obs1: batch['obs1'],
-            #     self.rewards: batch['rewards'],
-            #     self.terminals1: batch['terminals1'].astype('float32'),
-            # })
-            # print(target_Q_new, target_Q, new_mean, new_std)
-            # assert (np.abs(target_Q - target_Q_new) < 1e-3).all()
-        else:
-            target_Q = self.sess.run(self.target_Q, feed_dict={
-                self.obs1: batch['obs1'],
-                self.rewards: batch['rewards'],
-                self.terminals1: batch['terminals1'].astype('float32'),
-            })
-
-        # Get all gradients and perform a synced update.
-        ops = [self.actor_grads, self.actor_loss, self.critic_grads, self.critic_loss]
-        actor_grads, actor_loss, critic_grads, critic_loss = self.sess.run(ops, feed_dict={
-            self.obs0: batch['obs0'],
-            self.actions: batch['actions'],
-            self.critic_target: target_Q,
-        })
-        self.actor_optimizer.update(actor_grads, stepsize=self.actor_lr)
-        self.critic_optimizer.update(critic_grads, stepsize=self.critic_lr)
-
-        return critic_loss, actor_loss
-
-    def initialize(self, sess):
-        self.sess = sess
-        self.sess.run(tf.global_variables_initializer())
-        self.actor_optimizer.sync()
-        self.critic_optimizer.sync()
-        self.sess.run(self.target_init_updates)
-
-    def update_target_net(self):
-        self.sess.run(self.target_soft_updates)
-
-    def get_stats(self):
-        if self.stats_sample is None:
-            # Get a sample and keep that fixed for all further computations.
-            # This allows us to estimate the change in value for the same set of inputs.
-            self.stats_sample = self.memory.sample(batch_size=self.batch_size)
-        values = self.sess.run(self.stats_ops, feed_dict={
-            self.obs0: self.stats_sample['obs0'],
-            self.actions: self.stats_sample['actions'],
-        })
-
-        names = self.stats_names[:]
-        assert len(names) == len(values)
-        stats = dict(zip(names, values))
-
-        if self.param_noise is not None:
-            stats = {**stats, **self.param_noise.get_stats()}
-
-        return stats
-
-    def adapt_param_noise(self):
-        if self.param_noise is None:
-            return 0.
-
-        # Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
-        batch = self.memory.sample(batch_size=self.batch_size)
-        self.sess.run(self.perturb_adaptive_policy_ops, feed_dict={
-            self.param_noise_stddev: self.param_noise.current_stddev,
-        })
-        distance = self.sess.run(self.adaptive_policy_distance, feed_dict={
-            self.obs0: batch['obs0'],
-            self.param_noise_stddev: self.param_noise.current_stddev,
-        })
-
-        mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
-        self.param_noise.adapt(mean_distance)
-        return mean_distance
-
-    def reset(self):
-        # Reset internal state after an episode is complete.
-        if self.action_noise is not None:
-            self.action_noise.reset()
-        if self.param_noise is not None:
-            self.sess.run(self.perturb_policy_ops, feed_dict={
-                self.param_noise_stddev: self.param_noise.current_stddev,
-            })
+    return agent
--- a/baselines/ddpg/ddpg_learner.py
+++ b/baselines/ddpg/ddpg_learner.py
@@ -0,0 +1,356 @@
+from functools import reduce
+
+import numpy as np
+import tensorflow as tf
+
+from baselines import logger
+from baselines.ddpg.models import Actor, Critic
+from baselines.common.mpi_running_mean_std import RunningMeanStd
+try:
+    from mpi4py import MPI
+    from baselines.common.mpi_adam_optimizer import MpiAdamOptimizer
+    from baselines.common.mpi_util import sync_from_root
+except ImportError:
+    MPI = None
+
+def normalize(x, stats):
+    if stats is None:
+        return x
+    return (x - stats.mean) / (stats.std + 1e-8)
+
+
+def denormalize(x, stats):
+    if stats is None:
+        return x
+    return x * stats.std + stats.mean
+
+@tf.function
+def reduce_std(x, axis=None, keepdims=False):
+    return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
+
+def reduce_var(x, axis=None, keepdims=False):
+    m = tf.reduce_mean(x, axis=axis, keepdims=True)
+    devs_squared = tf.square(x - m)
+    return tf.reduce_mean(devs_squared, axis=axis, keepdims=keepdims)
+
+@tf.function
+def update_perturbed_actor(actor, perturbed_actor, param_noise_stddev):
+
+    for var, perturbed_var in zip(actor.variables, perturbed_actor.variables):
+        if var in actor.perturbable_vars:
+            perturbed_var.assign(var + tf.random.normal(shape=tf.shape(var), mean=0., stddev=param_noise_stddev))
+        else:
+            perturbed_var.assign(var)
+
+
+class DDPG(tf.Module):
+    def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
+        gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
+        batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
+        critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
+
+        # Parameters.
+        self.gamma = gamma
+        self.tau = tau
+        self.memory = memory
+        self.normalize_observations = normalize_observations
+        self.normalize_returns = normalize_returns
+        self.action_noise = action_noise
+        self.param_noise = param_noise
+        self.action_range = action_range
+        self.return_range = return_range
+        self.observation_range = observation_range
+        self.observation_shape = observation_shape
+        self.critic = critic
+        self.actor = actor
+        self.clip_norm = clip_norm
+        self.enable_popart = enable_popart
+        self.reward_scale = reward_scale
+        self.batch_size = batch_size
+        self.stats_sample = None
+        self.critic_l2_reg = critic_l2_reg
+        self.actor_lr = tf.constant(actor_lr)
+        self.critic_lr = tf.constant(critic_lr)
+
+        # Observation normalization.
+        if self.normalize_observations:
+            with tf.name_scope('obs_rms'):
+                self.obs_rms = RunningMeanStd(shape=observation_shape)
+        else:
+            self.obs_rms = None
+
+        # Return normalization.
+        if self.normalize_returns:
+            with tf.name_scope('ret_rms'):
+                self.ret_rms = RunningMeanStd()
+        else:
+            self.ret_rms = None
+
+        # Create target networks.
+        self.target_critic = Critic(actor.nb_actions, observation_shape, name='target_critic', network=critic.network, **critic.network_kwargs)
+        self.target_actor = Actor(actor.nb_actions, observation_shape, name='target_actor', network=actor.network, **actor.network_kwargs)
+
+        # Set up parts.
+        if self.param_noise is not None:
+            self.setup_param_noise()
+
+        if MPI is not None:
+            comm = MPI.COMM_WORLD
+            self.actor_optimizer = MpiAdamOptimizer(comm, self.actor.trainable_variables)
+            self.critic_optimizer = MpiAdamOptimizer(comm, self.critic.trainable_variables)
+        else:
+            self.actor_optimizer = tf.keras.optimizers.Adam(learning_rate=actor_lr)
+            self.critic_optimizer = tf.keras.optimizers.Adam(learning_rate=critic_lr)
+
+        logger.info('setting up actor optimizer')
+        actor_shapes = [var.get_shape().as_list() for var in self.actor.trainable_variables]
+        actor_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in actor_shapes])
+        logger.info('  actor shapes: {}'.format(actor_shapes))
+        logger.info('  actor params: {}'.format(actor_nb_params))
+        logger.info('setting up critic optimizer')
+        critic_shapes = [var.get_shape().as_list() for var in self.critic.trainable_variables]
+        critic_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in critic_shapes])
+        logger.info('  critic shapes: {}'.format(critic_shapes))
+        logger.info('  critic params: {}'.format(critic_nb_params))
+        if self.critic_l2_reg > 0.:
+            critic_reg_vars = []
+            for layer in self.critic.network_builder.layers[1:]:
+                critic_reg_vars.append(layer.kernel)
+            for var in critic_reg_vars:
+                logger.info('  regularizing: {}'.format(var.name))
+            logger.info('  applying l2 regularization with {}'.format(self.critic_l2_reg))
+
+        logger.info('setting up critic target updates ...')
+        for var, target_var in zip(self.critic.variables, self.target_critic.variables):
+            logger.info('  {} <- {}'.format(target_var.name, var.name))
+        logger.info('setting up actor target updates ...')
+        for var, target_var in zip(self.actor.variables, self.target_actor.variables):
+            logger.info('  {} <- {}'.format(target_var.name, var.name))
+
+        if self.param_noise:
+            logger.info('setting up param noise')
+            for var, perturbed_var in zip(self.actor.variables, self.perturbed_actor.variables):
+                if var in actor.perturbable_vars:
+                    logger.info('  {} <- {} + noise'.format(perturbed_var.name, var.name))
+                else:
+                    logger.info('  {} <- {}'.format(perturbed_var.name, var.name))
+            for var, perturbed_var in zip(self.actor.variables, self.perturbed_adaptive_actor.variables):
+                if var in actor.perturbable_vars:
+                    logger.info('  {} <- {} + noise'.format(perturbed_var.name, var.name))
+                else:
+                    logger.info('  {} <- {}'.format(perturbed_var.name, var.name))
+
+        if self.normalize_returns and self.enable_popart:
+            self.setup_popart()
+
+        self.initial_state = None # recurrent architectures not supported yet
+
+
+    def setup_param_noise(self):
+        assert self.param_noise is not None
+
+        # Configure perturbed actor.
+        self.perturbed_actor = Actor(self.actor.nb_actions, self.observation_shape, name='param_noise_actor', network=self.actor.network, **self.actor.network_kwargs)
+
+        # Configure separate copy for stddev adoption.
+        self.perturbed_adaptive_actor = Actor(self.actor.nb_actions, self.observation_shape, name='adaptive_param_noise_actor', network=self.actor.network, **self.actor.network_kwargs)
+
+    def setup_popart(self):
+        # See https://arxiv.org/pdf/1602.07714.pdf for details.
+        for vs in [self.critic.output_vars, self.target_critic.output_vars]:
+            assert len(vs) == 2
+            M, b = vs
+            assert 'kernel' in M.name
+            assert 'bias' in b.name
+            assert M.get_shape()[-1] == 1
+            assert b.get_shape()[-1] == 1
+
+    @tf.function
+    def step(self, obs, apply_noise=True, compute_Q=True):
+        normalized_obs = tf.clip_by_value(normalize(obs, self.obs_rms), self.observation_range[0], self.observation_range[1])
+        actor_tf = self.actor(normalized_obs)
+        if self.param_noise is not None and apply_noise:
+            action = self.perturbed_actor(normalized_obs)
+        else:
+            action = actor_tf
+
+        if compute_Q:
+            normalized_critic_with_actor_tf = self.critic(normalized_obs, actor_tf)
+            q = denormalize(tf.clip_by_value(normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
+        else:
+            q = None
+
+        if self.action_noise is not None and apply_noise:
+            noise = self.action_noise()
+            action += noise
+        action = tf.clip_by_value(action, self.action_range[0], self.action_range[1])
+
+        return action, q, None, None
+
+    def store_transition(self, obs0, action, reward, obs1, terminal1):
+        reward *= self.reward_scale
+
+        B = obs0.shape[0]
+        for b in range(B):
+            self.memory.append(obs0[b], action[b], reward[b], obs1[b], terminal1[b])
+            if self.normalize_observations:
+                self.obs_rms.update(np.array([obs0[b]]))
+
+    def train(self):
+        batch = self.memory.sample(batch_size=self.batch_size)
+        obs0, obs1 = tf.constant(batch['obs0']), tf.constant(batch['obs1'])
+        actions, rewards, terminals1 = tf.constant(batch['actions']), tf.constant(batch['rewards']), tf.constant(batch['terminals1'], dtype=tf.float32)
+        normalized_obs0, target_Q = self.compute_normalized_obs0_and_target_Q(obs0, obs1, rewards, terminals1)
+
+        if self.normalize_returns and self.enable_popart:
+            old_mean = self.ret_rms.mean
+            old_std = self.ret_rms.std
+            self.ret_rms.update(target_Q.flatten())
+            # renormalize Q outputs
+            new_mean = self.ret_rms.mean
+            new_std = self.ret_rms.std
+            for vs in [self.critic.output_vars, self.target_critic.output_vars]:
+                kernel, bias = vs
+                kernel.assign(kernel * old_std / new_std)
+                bias.assign((bias * old_std + old_mean - new_mean) / new_std)
+
+
+        actor_grads, actor_loss = self.get_actor_grads(normalized_obs0)
+        critic_grads, critic_loss = self.get_critic_grads(normalized_obs0, actions, target_Q)
+
+        if MPI is not None:
+            self.actor_optimizer.apply_gradients(actor_grads, self.actor_lr)
+            self.critic_optimizer.apply_gradients(critic_grads, self.critic_lr)
+        else:
+            self.actor_optimizer.apply_gradients(zip(actor_grads, self.actor.trainable_variables))
+            self.critic_optimizer.apply_gradients(zip(critic_grads, self.critic.trainable_variables))
+
+        return critic_loss, actor_loss
+
+    @tf.function
+    def compute_normalized_obs0_and_target_Q(self, obs0, obs1, rewards, terminals1):
+        normalized_obs0 = tf.clip_by_value(normalize(obs0, self.obs_rms), self.observation_range[0], self.observation_range[1])
+        normalized_obs1 = tf.clip_by_value(normalize(obs1, self.obs_rms), self.observation_range[0], self.observation_range[1])
+        Q_obs1 = denormalize(self.target_critic(normalized_obs1, self.target_actor(normalized_obs1)), self.ret_rms)
+        target_Q = rewards + (1. - terminals1) * self.gamma * Q_obs1
+        return normalized_obs0, target_Q
+
+    @tf.function
+    def get_actor_grads(self, normalized_obs0):
+        with tf.GradientTape() as tape:
+            actor_tf = self.actor(normalized_obs0)
+            normalized_critic_with_actor_tf = self.critic(normalized_obs0, actor_tf)
+            critic_with_actor_tf = denormalize(tf.clip_by_value(normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
+            actor_loss = -tf.reduce_mean(critic_with_actor_tf)
+        actor_grads = tape.gradient(actor_loss, self.actor.trainable_variables)
+        if self.clip_norm:
+            actor_grads = [tf.clip_by_norm(grad, clip_norm=self.clip_norm) for grad in actor_grads]
+        if MPI is not None:
+            actor_grads = tf.concat([tf.reshape(g, (-1,)) for g in actor_grads], axis=0)
+        return actor_grads, actor_loss
+
+    @tf.function
+    def get_critic_grads(self, normalized_obs0, actions, target_Q):
+        with tf.GradientTape() as tape:
+            normalized_critic_tf = self.critic(normalized_obs0, actions)
+            normalized_critic_target_tf = tf.clip_by_value(normalize(target_Q, self.ret_rms), self.return_range[0], self.return_range[1])
+            critic_loss = tf.reduce_mean(tf.square(normalized_critic_tf - normalized_critic_target_tf))
+            # The first is input layer, which is ignored here.
+            if self.critic_l2_reg > 0.:
+                # Ignore the first input layer.
+                for layer in self.critic.network_builder.layers[1:]:
+                    # The original l2_regularizer takes half of sum square.
+                    critic_loss += (self.critic_l2_reg / 2.)* tf.reduce_sum(tf.square(layer.kernel))
+        critic_grads = tape.gradient(critic_loss, self.critic.trainable_variables)
+        if self.clip_norm:
+            critic_grads = [tf.clip_by_norm(grad, clip_norm=self.clip_norm) for grad in critic_grads]
+        if MPI is not None:
+            critic_grads = tf.concat([tf.reshape(g, (-1,)) for g in critic_grads], axis=0)
+        return critic_grads, critic_loss
+
+
+    def initialize(self):
+        if MPI is not None:
+            sync_from_root(self.actor.trainable_variables + self.critic.trainable_variables)
+        self.target_actor.set_weights(self.actor.get_weights())
+        self.target_critic.set_weights(self.critic.get_weights())
+
+    @tf.function
+    def update_target_net(self):
+        for var, target_var in zip(self.actor.variables, self.target_actor.variables):
+            target_var.assign((1. - self.tau) * target_var + self.tau * var)
+        for var, target_var in zip(self.critic.variables, self.target_critic.variables):
+            target_var.assign((1. - self.tau) * target_var + self.tau * var)
+
+    def get_stats(self):
+
+        if self.stats_sample is None:
+            # Get a sample and keep that fixed for all further computations.
+            # This allows us to estimate the change in value for the same set of inputs.
+            self.stats_sample = self.memory.sample(batch_size=self.batch_size)
+        obs0 = self.stats_sample['obs0']
+        actions = self.stats_sample['actions']
+        normalized_obs0 = tf.clip_by_value(normalize(obs0, self.obs_rms), self.observation_range[0], self.observation_range[1])
+        normalized_critic_tf = self.critic(normalized_obs0, actions)
+        critic_tf = denormalize(tf.clip_by_value(normalized_critic_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
+        actor_tf = self.actor(normalized_obs0)
+        normalized_critic_with_actor_tf = self.critic(normalized_obs0, actor_tf)
+        critic_with_actor_tf = denormalize(tf.clip_by_value(normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
+
+        stats = {}
+        if self.normalize_returns:
+            stats['ret_rms_mean'] = self.ret_rms.mean
+            stats['ret_rms_std'] = self.ret_rms.std
+        if self.normalize_observations:
+            stats['obs_rms_mean'] = tf.reduce_mean(self.obs_rms.mean)
+            stats['obs_rms_std'] = tf.reduce_mean(self.obs_rms.std)
+        stats['reference_Q_mean'] = tf.reduce_mean(critic_tf)
+        stats['reference_Q_std'] = reduce_std(critic_tf)
+        stats['reference_actor_Q_mean'] = tf.reduce_mean(critic_with_actor_tf)
+        stats['reference_actor_Q_std'] = reduce_std(critic_with_actor_tf)
+        stats['reference_action_mean'] = tf.reduce_mean(actor_tf)
+        stats['reference_action_std'] = reduce_std(actor_tf)
+
+        if self.param_noise:
+            perturbed_actor_tf = self.perturbed_actor(normalized_obs0)
+            stats['reference_perturbed_action_mean'] = tf.reduce_mean(perturbed_actor_tf)
+            stats['reference_perturbed_action_std'] = reduce_std(perturbed_actor_tf)
+            stats.update(self.param_noise.get_stats())
+        return stats
+
+
+
+    def adapt_param_noise(self, obs0):
+        try:
+            from mpi4py import MPI
+        except ImportError:
+            MPI = None
+
+        if self.param_noise is None:
+            return 0.
+
+        mean_distance = self.get_mean_distance(obs0).numpy()
+
+        if MPI is not None:
+            mean_distance = MPI.COMM_WORLD.allreduce(mean_distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
+
+        self.param_noise.adapt(mean_distance)
+        return mean_distance
+
+    @tf.function
+    def get_mean_distance(self, obs0):
+        # Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
+        update_perturbed_actor(self.actor, self.perturbed_adaptive_actor, self.param_noise.current_stddev)
+
+        normalized_obs0 = tf.clip_by_value(normalize(obs0, self.obs_rms), self.observation_range[0], self.observation_range[1])
+        actor_tf = self.actor(normalized_obs0)
+        adaptive_actor_tf = self.perturbed_adaptive_actor(normalized_obs0)
+        mean_distance = tf.sqrt(tf.reduce_mean(tf.square(actor_tf - adaptive_actor_tf)))
+        return mean_distance
+
+    def reset(self):
+        # Reset internal state after an episode is complete.
+        if self.action_noise is not None:
+            self.action_noise.reset()
+        if self.param_noise is not None:
+            update_perturbed_actor(self.actor, self.perturbed_actor, self.param_noise.current_stddev)
--- a/baselines/ddpg/main.py
+++ b/baselines/ddpg/main.py
@@ -1,123 +0,0 @@
-import argparse
-import time
-import os
-import logging
-from baselines import logger, bench
-from baselines.common.misc_util import (
-    set_global_seeds,
-    boolean_flag,
-)
-import baselines.ddpg.training as training
-from baselines.ddpg.models import Actor, Critic
-from baselines.ddpg.memory import Memory
-from baselines.ddpg.noise import *
-
-import gym
-import tensorflow as tf
-from mpi4py import MPI
-
-def run(env_id, seed, noise_type, layer_norm, evaluation, **kwargs):
-    # Configure things.
-    rank = MPI.COMM_WORLD.Get_rank()
-    if rank != 0:
-        logger.set_level(logger.DISABLED)
-
-    # Create envs.
-    env = gym.make(env_id)
-    env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-
-    if evaluation and rank==0:
-        eval_env = gym.make(env_id)
-        eval_env = bench.Monitor(eval_env, os.path.join(logger.get_dir(), 'gym_eval'))
-        env = bench.Monitor(env, None)
-    else:
-        eval_env = None
-
-    # Parse noise_type
-    action_noise = None
-    param_noise = None
-    nb_actions = env.action_space.shape[-1]
-    for current_noise_type in noise_type.split(','):
-        current_noise_type = current_noise_type.strip()
-        if current_noise_type == 'none':
-            pass
-        elif 'adaptive-param' in current_noise_type:
-            _, stddev = current_noise_type.split('_')
-            param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
-        elif 'normal' in current_noise_type:
-            _, stddev = current_noise_type.split('_')
-            action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
-        elif 'ou' in current_noise_type:
-            _, stddev = current_noise_type.split('_')
-            action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
-        else:
-            raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
-
-    # Configure components.
-    memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
-    critic = Critic(layer_norm=layer_norm)
-    actor = Actor(nb_actions, layer_norm=layer_norm)
-
-    # Seed everything to make things reproducible.
-    seed = seed + 1000000 * rank
-    logger.info('rank {}: seed={}, logdir={}'.format(rank, seed, logger.get_dir()))
-    tf.reset_default_graph()
-    set_global_seeds(seed)
-    env.seed(seed)
-    if eval_env is not None:
-        eval_env.seed(seed)
-
-    # Disable logging for rank != 0 to avoid noise.
-    if rank == 0:
-        start_time = time.time()
-    training.train(env=env, eval_env=eval_env, param_noise=param_noise,
-        action_noise=action_noise, actor=actor, critic=critic, memory=memory, **kwargs)
-    env.close()
-    if eval_env is not None:
-        eval_env.close()
-    if rank == 0:
-        logger.info('total runtime: {}s'.format(time.time() - start_time))
-
-
-def parse_args():
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-
-    parser.add_argument('--env-id', type=str, default='HalfCheetah-v1')
-    boolean_flag(parser, 'render-eval', default=False)
-    boolean_flag(parser, 'layer-norm', default=True)
-    boolean_flag(parser, 'render', default=False)
-    boolean_flag(parser, 'normalize-returns', default=False)
-    boolean_flag(parser, 'normalize-observations', default=True)
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--critic-l2-reg', type=float, default=1e-2)
-    parser.add_argument('--batch-size', type=int, default=64)  # per MPI worker
-    parser.add_argument('--actor-lr', type=float, default=1e-4)
-    parser.add_argument('--critic-lr', type=float, default=1e-3)
-    boolean_flag(parser, 'popart', default=False)
-    parser.add_argument('--gamma', type=float, default=0.99)
-    parser.add_argument('--reward-scale', type=float, default=1.)
-    parser.add_argument('--clip-norm', type=float, default=None)
-    parser.add_argument('--nb-epochs', type=int, default=500)  # with default settings, perform 1M steps total
-    parser.add_argument('--nb-epoch-cycles', type=int, default=20)
-    parser.add_argument('--nb-train-steps', type=int, default=50)  # per epoch cycle and MPI worker
-    parser.add_argument('--nb-eval-steps', type=int, default=100)  # per epoch cycle and MPI worker
-    parser.add_argument('--nb-rollout-steps', type=int, default=100)  # per epoch cycle and MPI worker
-    parser.add_argument('--noise-type', type=str, default='adaptive-param_0.2')  # choices are adaptive-param_xx, ou_xx, normal_xx, none
-    parser.add_argument('--num-timesteps', type=int, default=None)
-    boolean_flag(parser, 'evaluation', default=False)
-    args = parser.parse_args()
-    # we don't directly specify timesteps for this script, so make sure that if we do specify them
-    # they agree with the other parameters
-    if args.num_timesteps is not None:
-        assert(args.num_timesteps == args.nb_epochs * args.nb_epoch_cycles * args.nb_rollout_steps)
-    dict_args = vars(args)
-    del dict_args['num_timesteps']
-    return dict_args
-
-
-if __name__ == '__main__':
-    args = parse_args()
-    if MPI.COMM_WORLD.Get_rank() == 0:
-        logger.configure()
-    # Run actual script.
-    run(**args)
--- a/baselines/ddpg/memory.py
+++ b/baselines/ddpg/memory.py
@@ -51,7 +51,7 @@ class Memory(object):

    def sample(self, batch_size):
        # Draw such that we always have a proceeding element.
-        batch_idxs = np.random.random_integers(self.nb_entries - 2, size=batch_size)
+        batch_idxs = np.random.randint(self.nb_entries - 2, size=batch_size)

        obs0_batch = self.observations0.get_batch(batch_idxs)
        obs1_batch = self.observations1.get_batch(batch_idxs)
@@ -71,7 +71,7 @@ class Memory(object):
    def append(self, obs0, action, reward, obs1, terminal1, training=True):
        if not training:
            return
-        
+
        self.observations0.append(obs0)
        self.actions.append(action)
        self.rewards.append(reward)
--- a/baselines/ddpg/models.py
+++ b/baselines/ddpg/models.py
@@ -1,77 +1,49 @@
 import tensorflow as tf
-import tensorflow.contrib as tc
+from baselines.common.models import get_network_builder


-class Model(object):
-    def __init__(self, name):
-        self.name = name
-
-    @property
-    def vars(self):
-        return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
-
-    @property
-    def trainable_vars(self):
-        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.name)
+class Model(tf.keras.Model):
+    def __init__(self, name, network='mlp', **network_kwargs):
+        super(Model, self).__init__(name=name)
+        self.network = network
+        self.network_kwargs = network_kwargs

    @property
    def perturbable_vars(self):
-        return [var for var in self.trainable_vars if 'LayerNorm' not in var.name]
+        return [var for var in self.trainable_variables if 'layer_normalization' not in var.name]


 class Actor(Model):
-    def __init__(self, nb_actions, name='actor', layer_norm=True):
-        super(Actor, self).__init__(name=name)
+    def __init__(self, nb_actions, ob_shape, name='actor', network='mlp', **network_kwargs):
+        super().__init__(name=name, network=network, **network_kwargs)
        self.nb_actions = nb_actions
-        self.layer_norm = layer_norm
+        self.network_builder = get_network_builder(network)(**network_kwargs)(ob_shape)
+        self.output_layer = tf.keras.layers.Dense(units=self.nb_actions,
+                                                  activation=tf.keras.activations.tanh,
+                                                  kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
+        _ = self.output_layer(self.network_builder.outputs[0])

-    def __call__(self, obs, reuse=False):
-        with tf.variable_scope(self.name) as scope:
-            if reuse:
-                scope.reuse_variables()
-
-            x = obs
-            x = tf.layers.dense(x, 64)
-            if self.layer_norm:
-                x = tc.layers.layer_norm(x, center=True, scale=True)
-            x = tf.nn.relu(x)
-            
-            x = tf.layers.dense(x, 64)
-            if self.layer_norm:
-                x = tc.layers.layer_norm(x, center=True, scale=True)
-            x = tf.nn.relu(x)
-            
-            x = tf.layers.dense(x, self.nb_actions, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
-            x = tf.nn.tanh(x)
-        return x
+    @tf.function
+    def call(self, obs):
+        return self.output_layer(self.network_builder(obs))


 class Critic(Model):
-    def __init__(self, name='critic', layer_norm=True):
-        super(Critic, self).__init__(name=name)
-        self.layer_norm = layer_norm
+    def __init__(self, nb_actions, ob_shape, name='critic', network='mlp', **network_kwargs):
+        super().__init__(name=name, network=network, **network_kwargs)
+        self.layer_norm = True
+        self.network_builder = get_network_builder(network)(**network_kwargs)((ob_shape[0] + nb_actions,))
+        self.output_layer = tf.keras.layers.Dense(units=1,
+                                                  kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3),
+                                                  name='output')
+        _ = self.output_layer(self.network_builder.outputs[0])

-    def __call__(self, obs, action, reuse=False):
-        with tf.variable_scope(self.name) as scope:
-            if reuse:
-                scope.reuse_variables()
-
-            x = obs
-            x = tf.layers.dense(x, 64)
-            if self.layer_norm:
-                x = tc.layers.layer_norm(x, center=True, scale=True)
-            x = tf.nn.relu(x)
-
-            x = tf.concat([x, action], axis=-1)
-            x = tf.layers.dense(x, 64)
-            if self.layer_norm:
-                x = tc.layers.layer_norm(x, center=True, scale=True)
-            x = tf.nn.relu(x)
-
-            x = tf.layers.dense(x, 1, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
-        return x
+    @tf.function
+    def call(self, obs, actions):
+        x = tf.concat([obs, actions], axis=-1) # this assumes observation and action can be concatenated
+        x = self.network_builder(x)
+        return self.output_layer(x)

    @property
    def output_vars(self):
-        output_vars = [var for var in self.trainable_vars if 'output' in var.name]
-        return output_vars
+        return self.output_layer.trainable_variables
--- a/baselines/ddpg/noise.py
+++ b/baselines/ddpg/noise.py
--- a/baselines/ddpg/training.py
+++ b/baselines/ddpg/training.py
@@ -1,191 +0,0 @@
-import os
-import time
-from collections import deque
-import pickle
-
-from baselines.ddpg.ddpg import DDPG
-import baselines.common.tf_util as U
-
-from baselines import logger
-import numpy as np
-import tensorflow as tf
-from mpi4py import MPI
-
-
-def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, param_noise, actor, critic,
-    normalize_returns, normalize_observations, critic_l2_reg, actor_lr, critic_lr, action_noise,
-    popart, gamma, clip_norm, nb_train_steps, nb_rollout_steps, nb_eval_steps, batch_size, memory,
-    tau=0.01, eval_env=None, param_noise_adaption_interval=50):
-    rank = MPI.COMM_WORLD.Get_rank()
-
-    assert (np.abs(env.action_space.low) == env.action_space.high).all()  # we assume symmetric actions.
-    max_action = env.action_space.high
-    logger.info('scaling actions by {} before executing in env'.format(max_action))
-    agent = DDPG(actor, critic, memory, env.observation_space.shape, env.action_space.shape,
-        gamma=gamma, tau=tau, normalize_returns=normalize_returns, normalize_observations=normalize_observations,
-        batch_size=batch_size, action_noise=action_noise, param_noise=param_noise, critic_l2_reg=critic_l2_reg,
-        actor_lr=actor_lr, critic_lr=critic_lr, enable_popart=popart, clip_norm=clip_norm,
-        reward_scale=reward_scale)
-    logger.info('Using agent with the following configuration:')
-    logger.info(str(agent.__dict__.items()))
-
-    # Set up logging stuff only for a single worker.
-    if rank == 0:
-        saver = tf.train.Saver()
-    else:
-        saver = None
-
-    step = 0
-    episode = 0
-    eval_episode_rewards_history = deque(maxlen=100)
-    episode_rewards_history = deque(maxlen=100)
-    with U.single_threaded_session() as sess:
-        # Prepare everything.
-        agent.initialize(sess)
-        sess.graph.finalize()
-
-        agent.reset()
-        obs = env.reset()
-        if eval_env is not None:
-            eval_obs = eval_env.reset()
-        done = False
-        episode_reward = 0.
-        episode_step = 0
-        episodes = 0
-        t = 0
-
-        epoch = 0
-        start_time = time.time()
-
-        epoch_episode_rewards = []
-        epoch_episode_steps = []
-        epoch_episode_eval_rewards = []
-        epoch_episode_eval_steps = []
-        epoch_start_time = time.time()
-        epoch_actions = []
-        epoch_qs = []
-        epoch_episodes = 0
-        for epoch in range(nb_epochs):
-            for cycle in range(nb_epoch_cycles):
-                # Perform rollouts.
-                for t_rollout in range(nb_rollout_steps):
-                    # Predict next action.
-                    action, q = agent.pi(obs, apply_noise=True, compute_Q=True)
-                    assert action.shape == env.action_space.shape
-
-                    # Execute next action.
-                    if rank == 0 and render:
-                        env.render()
-                    assert max_action.shape == action.shape
-                    new_obs, r, done, info = env.step(max_action * action)  # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
-                    t += 1
-                    if rank == 0 and render:
-                        env.render()
-                    episode_reward += r
-                    episode_step += 1
-
-                    # Book-keeping.
-                    epoch_actions.append(action)
-                    epoch_qs.append(q)
-                    agent.store_transition(obs, action, r, new_obs, done)
-                    obs = new_obs
-
-                    if done:
-                        # Episode done.
-                        epoch_episode_rewards.append(episode_reward)
-                        episode_rewards_history.append(episode_reward)
-                        epoch_episode_steps.append(episode_step)
-                        episode_reward = 0.
-                        episode_step = 0
-                        epoch_episodes += 1
-                        episodes += 1
-
-                        agent.reset()
-                        obs = env.reset()
-
-                # Train.
-                epoch_actor_losses = []
-                epoch_critic_losses = []
-                epoch_adaptive_distances = []
-                for t_train in range(nb_train_steps):
-                    # Adapt param noise, if necessary.
-                    if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
-                        distance = agent.adapt_param_noise()
-                        epoch_adaptive_distances.append(distance)
-
-                    cl, al = agent.train()
-                    epoch_critic_losses.append(cl)
-                    epoch_actor_losses.append(al)
-                    agent.update_target_net()
-
-                # Evaluate.
-                eval_episode_rewards = []
-                eval_qs = []
-                if eval_env is not None:
-                    eval_episode_reward = 0.
-                    for t_rollout in range(nb_eval_steps):
-                        eval_action, eval_q = agent.pi(eval_obs, apply_noise=False, compute_Q=True)
-                        eval_obs, eval_r, eval_done, eval_info = eval_env.step(max_action * eval_action)  # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
-                        if render_eval:
-                            eval_env.render()
-                        eval_episode_reward += eval_r
-
-                        eval_qs.append(eval_q)
-                        if eval_done:
-                            eval_obs = eval_env.reset()
-                            eval_episode_rewards.append(eval_episode_reward)
-                            eval_episode_rewards_history.append(eval_episode_reward)
-                            eval_episode_reward = 0.
-
-            mpi_size = MPI.COMM_WORLD.Get_size()
-            # Log stats.
-            # XXX shouldn't call np.mean on variable length lists
-            duration = time.time() - start_time
-            stats = agent.get_stats()
-            combined_stats = stats.copy()
-            combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
-            combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
-            combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
-            combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
-            combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
-            combined_stats['train/loss_actor'] = np.mean(epoch_actor_losses)
-            combined_stats['train/loss_critic'] = np.mean(epoch_critic_losses)
-            combined_stats['train/param_noise_distance'] = np.mean(epoch_adaptive_distances)
-            combined_stats['total/duration'] = duration
-            combined_stats['total/steps_per_second'] = float(t) / float(duration)
-            combined_stats['total/episodes'] = episodes
-            combined_stats['rollout/episodes'] = epoch_episodes
-            combined_stats['rollout/actions_std'] = np.std(epoch_actions)
-            # Evaluation statistics.
-            if eval_env is not None:
-                combined_stats['eval/return'] = eval_episode_rewards
-                combined_stats['eval/return_history'] = np.mean(eval_episode_rewards_history)
-                combined_stats['eval/Q'] = eval_qs
-                combined_stats['eval/episodes'] = len(eval_episode_rewards)
-            def as_scalar(x):
-                if isinstance(x, np.ndarray):
-                    assert x.size == 1
-                    return x[0]
-                elif np.isscalar(x):
-                    return x
-                else:
-                    raise ValueError('expected scalar, got %s'%x)
-            combined_stats_sums = MPI.COMM_WORLD.allreduce(np.array([as_scalar(x) for x in combined_stats.values()]))
-            combined_stats = {k : v / mpi_size for (k,v) in zip(combined_stats.keys(), combined_stats_sums)}
-
-            # Total statistics.
-            combined_stats['total/epochs'] = epoch + 1
-            combined_stats['total/steps'] = t
-
-            for key in sorted(combined_stats.keys()):
-                logger.record_tabular(key, combined_stats[key])
-            logger.dump_tabular()
-            logger.info('')
-            logdir = logger.get_dir()
-            if rank == 0 and logdir:
-                if hasattr(env, 'get_state'):
-                    with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:
-                        pickle.dump(env.get_state(), f)
-                if eval_env and hasattr(eval_env, 'get_state'):
-                    with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
-                        pickle.dump(eval_env.get_state(), f)
--- a/baselines/deepq/README.md
+++ b/baselines/deepq/README.md
@@ -9,44 +9,29 @@ Here's a list of commands to run to quickly get a working example:

 ```bash
 # Train model and save the results to cartpole_model.pkl
-python -m baselines.deepq.experiments.train_cartpole
+python -m baselines.run --alg=deepq --env=CartPole-v0 --save_path=./cartpole_model.pkl --num_timesteps=1e5
 # Load the model saved in cartpole_model.pkl and visualize the learned policy
-python -m baselines.deepq.experiments.enjoy_cartpole
+python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play
 ```

-
-Be sure to check out the source code of [both](experiments/train_cartpole.py) [files](experiments/enjoy_cartpole.py)!
-
 ## If you wish to apply DQN to solve a problem.

 Check out our simple agent trained with one stop shop `deepq.learn` function. 

 - [baselines/deepq/experiments/train_cartpole.py](experiments/train_cartpole.py) - train a Cartpole agent.
- [baselines/deepq/experiments/train_pong.py](experiments/train_pong.py) - train a Pong agent using convolutional neural networks.

-In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. For both of the files listed above there are complimentary files `enjoy_cartpole.py` and `enjoy_pong.py` respectively, that load and visualize the learned policy.
+In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. Complimentary file `enjoy_cartpole.py` loads and visualizes the learned policy.

 ## If you wish to experiment with the algorithm

 ##### Check out the examples

-
 - [baselines/deepq/experiments/custom_cartpole.py](experiments/custom_cartpole.py) - Cartpole training with more fine grained control over the internals of DQN algorithm.
- [baselines/deepq/experiments/atari/train.py](experiments/atari/train.py) - more robust setup for training at scale.
-
-
-##### Download a pretrained Atari agent
-
-For some research projects it is sometimes useful to have an already trained agent handy. There's a variety of models to choose from. You can list them all by running:
+- [baselines/deepq/defaults.py](defaults.py) - settings for training on atari. Run 

 ```bash
-python -m baselines.deepq.experiments.atari.download_model
+python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 
 ```
+to train on Atari Pong (see more in repo-wide [README.md](../../README.md#training-models))

-Once you pick a model, you can download it and visualize the learned policy. Be sure to pass `--dueling` flag to visualization script when using dueling models.

-```bash
-python -m baselines.deepq.experiments.atari.download_model --blob model-atari-duel-pong-1 --model-dir /tmp/models
-python -m baselines.deepq.experiments.atari.enjoy --model-dir /tmp/models/model-atari-duel-pong-1 --env Pong --dueling
-
-```
--- a/baselines/deepq/init.py
+++ b/baselines/deepq/init.py
@@ -1,8 +1,8 @@
-from baselines.deepq import models  # noqa
-from baselines.deepq.build_graph import build_act, build_train  # noqa
-from baselines.deepq.simple import learn, load  # noqa
-from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer  # noqa
+from baselines.deepq import models  # noqa F401
+from baselines.deepq.deepq_learner import DEEPQ  # noqa F401
+from baselines.deepq.deepq import learn  # noqa F401
+from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer  # noqa F401

 def wrap_atari_dqn(env):
    from baselines.common.atari_wrappers import wrap_deepmind
-    return wrap_deepmind(env, frame_stack=True, scale=True)
+    return wrap_deepmind(env, frame_stack=True, scale=False)
--- a/baselines/deepq/build_graph.py
+++ b/baselines/deepq/build_graph.py
@@ -1,449 +0,0 @@
-"""Deep Q learning graph
-
-The functions in this file can are used to create the following functions:
-
-======= act ========
-
-    Function to chose an action given an observation
-
-    Parameters
-    ----------
-    observation: object
-        Observation that can be feed into the output of make_obs_ph
-    stochastic: bool
-        if set to False all the actions are always deterministic (default False)
-    update_eps_ph: float
-        update epsilon a new value, if negative not update happens
-        (default: no update)
-
-    Returns
-    -------
-    Tensor of dtype tf.int64 and shape (BATCH_SIZE,) with an action to be performed for
-    every element of the batch.
-
-
-======= act (in case of parameter noise) ========
-
-    Function to chose an action given an observation
-
-    Parameters
-    ----------
-    observation: object
-        Observation that can be feed into the output of make_obs_ph
-    stochastic: bool
-        if set to False all the actions are always deterministic (default False)
-    update_eps_ph: float
-        update epsilon a new value, if negative not update happens
-        (default: no update)
-    reset_ph: bool
-        reset the perturbed policy by sampling a new perturbation
-    update_param_noise_threshold_ph: float
-        the desired threshold for the difference between non-perturbed and perturbed policy
-    update_param_noise_scale_ph: bool
-        whether or not to update the scale of the noise for the next time it is re-perturbed
-
-    Returns
-    -------
-    Tensor of dtype tf.int64 and shape (BATCH_SIZE,) with an action to be performed for
-    every element of the batch.
-
-
-======= train =======
-
-    Function that takes a transition (s,a,r,s') and optimizes Bellman equation's error:
-
-        td_error = Q(s,a) - (r + gamma * max_a' Q(s', a'))
-        loss = huber_loss[td_error]
-
-    Parameters
-    ----------
-    obs_t: object
-        a batch of observations
-    action: np.array
-        actions that were selected upon seeing obs_t.
-        dtype must be int32 and shape must be (batch_size,)
-    reward: np.array
-        immediate reward attained after executing those actions
-        dtype must be float32 and shape must be (batch_size,)
-    obs_tp1: object
-        observations that followed obs_t
-    done: np.array
-        1 if obs_t was the last observation in the episode and 0 otherwise
-        obs_tp1 gets ignored, but must be of the valid shape.
-        dtype must be float32 and shape must be (batch_size,)
-    weight: np.array
-        imporance weights for every element of the batch (gradient is multiplied
-        by the importance weight) dtype must be float32 and shape must be (batch_size,)
-
-    Returns
-    -------
-    td_error: np.array
-        a list of differences between Q(s,a) and the target in Bellman's equation.
-        dtype is float32 and shape is (batch_size,)
-
-======= update_target ========
-
-    copy the parameters from optimized Q function to the target Q function.
-    In Q learning we actually optimize the following error:
-
-        Q(s,a) - (r + gamma * max_a' Q'(s', a'))
-
-    Where Q' is lagging behind Q to stablize the learning. For example for Atari
-
-    Q' is set to Q once every 10000 updates training steps.
-
-"""
-import tensorflow as tf
-import baselines.common.tf_util as U
-
-
-def scope_vars(scope, trainable_only=False):
-    """
-    Get variables inside a scope
-    The scope can be specified as a string
-    Parameters
-    ----------
-    scope: str or VariableScope
-        scope in which the variables reside.
-    trainable_only: bool
-        whether or not to return only the variables that were marked as trainable.
-    Returns
-    -------
-    vars: [tf.Variable]
-        list of variables in `scope`.
-    """
-    return tf.get_collection(
-        tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.GLOBAL_VARIABLES,
-        scope=scope if isinstance(scope, str) else scope.name
-    )
-
-
-def scope_name():
-    """Returns the name of current scope as a string, e.g. deepq/q_func"""
-    return tf.get_variable_scope().name
-
-
-def absolute_scope_name(relative_scope_name):
-    """Appends parent scope name to `relative_scope_name`"""
-    return scope_name() + "/" + relative_scope_name
-
-
-def default_param_noise_filter(var):
-    if var not in tf.trainable_variables():
-        # We never perturb non-trainable vars.
-        return False
-    if "fully_connected" in var.name:
-        # We perturb fully-connected layers.
-        return True
-
-    # The remaining layers are likely conv or layer norm layers, which we do not wish to
-    # perturb (in the former case because they only extract features, in the latter case because
-    # we use them for normalization purposes). If you change your network, you will likely want
-    # to re-consider which layers to perturb and which to keep untouched.
-    return False
-
-
-def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):
-    """Creates the act function:
-
-    Parameters
-    ----------
-    make_obs_ph: str -> tf.placeholder or TfInput
-        a function that take a name and creates a placeholder of input with that name
-    q_func: (tf.Variable, int, str, bool) -> tf.Variable
-        the model that takes the following inputs:
-            observation_in: object
-                the output of observation placeholder
-            num_actions: int
-                number of actions
-            scope: str
-            reuse: bool
-                should be passed to outer variable scope
-        and returns a tensor of shape (batch_size, num_actions) with values of every action.
-    num_actions: int
-        number of actions.
-    scope: str or VariableScope
-        optional scope for variable_scope.
-    reuse: bool or None
-        whether or not the variables should be reused. To be able to reuse the scope must be given.
-
-    Returns
-    -------
-    act: (tf.Variable, bool, float) -> tf.Variable
-        function to select and action given observation.
-`       See the top of the file for details.
-    """
-    with tf.variable_scope(scope, reuse=reuse):
-        observations_ph = make_obs_ph("observation")
-        stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
-        update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")
-
-        eps = tf.get_variable("eps", (), initializer=tf.constant_initializer(0))
-
-        q_values = q_func(observations_ph.get(), num_actions, scope="q_func")
-        deterministic_actions = tf.argmax(q_values, axis=1)
-
-        batch_size = tf.shape(observations_ph.get())[0]
-        random_actions = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=num_actions, dtype=tf.int64)
-        chose_random = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=1, dtype=tf.float32) < eps
-        stochastic_actions = tf.where(chose_random, random_actions, deterministic_actions)
-
-        output_actions = tf.cond(stochastic_ph, lambda: stochastic_actions, lambda: deterministic_actions)
-        update_eps_expr = eps.assign(tf.cond(update_eps_ph >= 0, lambda: update_eps_ph, lambda: eps))
-        _act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph],
-                         outputs=output_actions,
-                         givens={update_eps_ph: -1.0, stochastic_ph: True},
-                         updates=[update_eps_expr])
-        def act(ob, stochastic=True, update_eps=-1):
-            return _act(ob, stochastic, update_eps)
-        return act
-
-
-def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None, param_noise_filter_func=None):
-    """Creates the act function with support for parameter space noise exploration (https://arxiv.org/abs/1706.01905):
-
-    Parameters
-    ----------
-    make_obs_ph: str -> tf.placeholder or TfInput
-        a function that take a name and creates a placeholder of input with that name
-    q_func: (tf.Variable, int, str, bool) -> tf.Variable
-        the model that takes the following inputs:
-            observation_in: object
-                the output of observation placeholder
-            num_actions: int
-                number of actions
-            scope: str
-            reuse: bool
-                should be passed to outer variable scope
-        and returns a tensor of shape (batch_size, num_actions) with values of every action.
-    num_actions: int
-        number of actions.
-    scope: str or VariableScope
-        optional scope for variable_scope.
-    reuse: bool or None
-        whether or not the variables should be reused. To be able to reuse the scope must be given.
-    param_noise_filter_func: tf.Variable -> bool
-        function that decides whether or not a variable should be perturbed. Only applicable
-        if param_noise is True. If set to None, default_param_noise_filter is used by default.
-
-    Returns
-    -------
-    act: (tf.Variable, bool, float, bool, float, bool) -> tf.Variable
-        function to select and action given observation.
-`       See the top of the file for details.
-    """
-    if param_noise_filter_func is None:
-        param_noise_filter_func = default_param_noise_filter
-
-    with tf.variable_scope(scope, reuse=reuse):
-        observations_ph = make_obs_ph("observation")
-        stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
-        update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")
-        update_param_noise_threshold_ph = tf.placeholder(tf.float32, (), name="update_param_noise_threshold")
-        update_param_noise_scale_ph = tf.placeholder(tf.bool, (), name="update_param_noise_scale")
-        reset_ph = tf.placeholder(tf.bool, (), name="reset")
-
-        eps = tf.get_variable("eps", (), initializer=tf.constant_initializer(0))
-        param_noise_scale = tf.get_variable("param_noise_scale", (), initializer=tf.constant_initializer(0.01), trainable=False)
-        param_noise_threshold = tf.get_variable("param_noise_threshold", (), initializer=tf.constant_initializer(0.05), trainable=False)
-
-        # Unmodified Q.
-        q_values = q_func(observations_ph.get(), num_actions, scope="q_func")
-
-        # Perturbable Q used for the actual rollout.
-        q_values_perturbed = q_func(observations_ph.get(), num_actions, scope="perturbed_q_func")
-        # We have to wrap this code into a function due to the way tf.cond() works. See
-        # https://stackoverflow.com/questions/37063952/confused-by-the-behavior-of-tf-cond for
-        # a more detailed discussion.
-        def perturb_vars(original_scope, perturbed_scope):
-            all_vars = scope_vars(absolute_scope_name(original_scope))
-            all_perturbed_vars = scope_vars(absolute_scope_name(perturbed_scope))
-            assert len(all_vars) == len(all_perturbed_vars)
-            perturb_ops = []
-            for var, perturbed_var in zip(all_vars, all_perturbed_vars):
-                if param_noise_filter_func(perturbed_var):
-                    # Perturb this variable.
-                    op = tf.assign(perturbed_var, var + tf.random_normal(shape=tf.shape(var), mean=0., stddev=param_noise_scale))
-                else:
-                    # Do not perturb, just assign.
-                    op = tf.assign(perturbed_var, var)
-                perturb_ops.append(op)
-            assert len(perturb_ops) == len(all_vars)
-            return tf.group(*perturb_ops)
-
-        # Set up functionality to re-compute `param_noise_scale`. This perturbs yet another copy
-        # of the network and measures the effect of that perturbation in action space. If the perturbation
-        # is too big, reduce scale of perturbation, otherwise increase.
-        q_values_adaptive = q_func(observations_ph.get(), num_actions, scope="adaptive_q_func")
-        perturb_for_adaption = perturb_vars(original_scope="q_func", perturbed_scope="adaptive_q_func")
-        kl = tf.reduce_sum(tf.nn.softmax(q_values) * (tf.log(tf.nn.softmax(q_values)) - tf.log(tf.nn.softmax(q_values_adaptive))), axis=-1)
-        mean_kl = tf.reduce_mean(kl)
-        def update_scale():
-            with tf.control_dependencies([perturb_for_adaption]):
-                update_scale_expr = tf.cond(mean_kl < param_noise_threshold,
-                    lambda: param_noise_scale.assign(param_noise_scale * 1.01),
-                    lambda: param_noise_scale.assign(param_noise_scale / 1.01),
-                )
-            return update_scale_expr
-
-        # Functionality to update the threshold for parameter space noise.
-        update_param_noise_threshold_expr = param_noise_threshold.assign(tf.cond(update_param_noise_threshold_ph >= 0,
-            lambda: update_param_noise_threshold_ph, lambda: param_noise_threshold))
-
-        # Put everything together.
-        deterministic_actions = tf.argmax(q_values_perturbed, axis=1)
-        batch_size = tf.shape(observations_ph.get())[0]
-        random_actions = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=num_actions, dtype=tf.int64)
-        chose_random = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=1, dtype=tf.float32) < eps
-        stochastic_actions = tf.where(chose_random, random_actions, deterministic_actions)
-
-        output_actions = tf.cond(stochastic_ph, lambda: stochastic_actions, lambda: deterministic_actions)
-        update_eps_expr = eps.assign(tf.cond(update_eps_ph >= 0, lambda: update_eps_ph, lambda: eps))
-        updates = [
-            update_eps_expr,
-            tf.cond(reset_ph, lambda: perturb_vars(original_scope="q_func", perturbed_scope="perturbed_q_func"), lambda: tf.group(*[])),
-            tf.cond(update_param_noise_scale_ph, lambda: update_scale(), lambda: tf.Variable(0., trainable=False)),
-            update_param_noise_threshold_expr,
-        ]
-        _act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph, reset_ph, update_param_noise_threshold_ph, update_param_noise_scale_ph],
-                         outputs=output_actions,
-                         givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
-                         updates=updates)
-        def act(ob, reset, update_param_noise_threshold, update_param_noise_scale, stochastic=True, update_eps=-1):
-            return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
-        return act
-
-
-def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=None, gamma=1.0,
-    double_q=True, scope="deepq", reuse=None, param_noise=False, param_noise_filter_func=None):
-    """Creates the train function:
-
-    Parameters
-    ----------
-    make_obs_ph: str -> tf.placeholder or TfInput
-        a function that takes a name and creates a placeholder of input with that name
-    q_func: (tf.Variable, int, str, bool) -> tf.Variable
-        the model that takes the following inputs:
-            observation_in: object
-                the output of observation placeholder
-            num_actions: int
-                number of actions
-            scope: str
-            reuse: bool
-                should be passed to outer variable scope
-        and returns a tensor of shape (batch_size, num_actions) with values of every action.
-    num_actions: int
-        number of actions
-    reuse: bool
-        whether or not to reuse the graph variables
-    optimizer: tf.train.Optimizer
-        optimizer to use for the Q-learning objective.
-    grad_norm_clipping: float or None
-        clip gradient norms to this value. If None no clipping is performed.
-    gamma: float
-        discount rate.
-    double_q: bool
-        if true will use Double Q Learning (https://arxiv.org/abs/1509.06461).
-        In general it is a good idea to keep it enabled.
-    scope: str or VariableScope
-        optional scope for variable_scope.
-    reuse: bool or None
-        whether or not the variables should be reused. To be able to reuse the scope must be given.
-    param_noise: bool
-        whether or not to use parameter space noise (https://arxiv.org/abs/1706.01905)
-    param_noise_filter_func: tf.Variable -> bool
-        function that decides whether or not a variable should be perturbed. Only applicable
-        if param_noise is True. If set to None, default_param_noise_filter is used by default.
-
-    Returns
-    -------
-    act: (tf.Variable, bool, float) -> tf.Variable
-        function to select and action given observation.
-`       See the top of the file for details.
-    train: (object, np.array, np.array, object, np.array, np.array) -> np.array
-        optimize the error in Bellman's equation.
-`       See the top of the file for details.
-    update_target: () -> ()
-        copy the parameters from optimized Q function to the target Q function.
-`       See the top of the file for details.
-    debug: {str: function}
-        a bunch of functions to print debug data like q_values.
-    """
-    if param_noise:
-        act_f = build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope=scope, reuse=reuse,
-            param_noise_filter_func=param_noise_filter_func)
-    else:
-        act_f = build_act(make_obs_ph, q_func, num_actions, scope=scope, reuse=reuse)
-
-    with tf.variable_scope(scope, reuse=reuse):
-        # set up placeholders
-        obs_t_input = make_obs_ph("obs_t")
-        act_t_ph = tf.placeholder(tf.int32, [None], name="action")
-        rew_t_ph = tf.placeholder(tf.float32, [None], name="reward")
-        obs_tp1_input = make_obs_ph("obs_tp1")
-        done_mask_ph = tf.placeholder(tf.float32, [None], name="done")
-        importance_weights_ph = tf.placeholder(tf.float32, [None], name="weight")
-
-        # q network evaluation
-        q_t = q_func(obs_t_input.get(), num_actions, scope="q_func", reuse=True)  # reuse parameters from act
-        q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/q_func")
-
-        # target q network evalution
-        q_tp1 = q_func(obs_tp1_input.get(), num_actions, scope="target_q_func")
-        target_q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/target_q_func")
-
-        # q scores for actions which we know were selected in the given state.
-        q_t_selected = tf.reduce_sum(q_t * tf.one_hot(act_t_ph, num_actions), 1)
-
-        # compute estimate of best possible value starting from state at t + 1
-        if double_q:
-            q_tp1_using_online_net = q_func(obs_tp1_input.get(), num_actions, scope="q_func", reuse=True)
-            q_tp1_best_using_online_net = tf.argmax(q_tp1_using_online_net, 1)
-            q_tp1_best = tf.reduce_sum(q_tp1 * tf.one_hot(q_tp1_best_using_online_net, num_actions), 1)
-        else:
-            q_tp1_best = tf.reduce_max(q_tp1, 1)
-        q_tp1_best_masked = (1.0 - done_mask_ph) * q_tp1_best
-
-        # compute RHS of bellman equation
-        q_t_selected_target = rew_t_ph + gamma * q_tp1_best_masked
-
-        # compute the error (potentially clipped)
-        td_error = q_t_selected - tf.stop_gradient(q_t_selected_target)
-        errors = U.huber_loss(td_error)
-        weighted_error = tf.reduce_mean(importance_weights_ph * errors)
-
-        # compute optimization op (potentially with gradient clipping)
-        if grad_norm_clipping is not None:
-            gradients = optimizer.compute_gradients(weighted_error, var_list=q_func_vars)
-            for i, (grad, var) in enumerate(gradients):
-                if grad is not None:
-                    gradients[i] = (tf.clip_by_norm(grad, grad_norm_clipping), var)
-            optimize_expr = optimizer.apply_gradients(gradients)
-        else:
-            optimize_expr = optimizer.minimize(weighted_error, var_list=q_func_vars)
-
-        # update_target_fn will be called periodically to copy Q network to target Q network
-        update_target_expr = []
-        for var, var_target in zip(sorted(q_func_vars, key=lambda v: v.name),
-                                   sorted(target_q_func_vars, key=lambda v: v.name)):
-            update_target_expr.append(var_target.assign(var))
-        update_target_expr = tf.group(*update_target_expr)
-
-        # Create callable functions
-        train = U.function(
-            inputs=[
-                obs_t_input,
-                act_t_ph,
-                rew_t_ph,
-                obs_tp1_input,
-                done_mask_ph,
-                importance_weights_ph
-            ],
-            outputs=td_error,
-            updates=[optimize_expr]
-        )
-        update_target = U.function([], [], updates=[update_target_expr])
-
-        q_values = U.function([obs_t_input], q_t)
-
-        return act_f, train, update_target, {'q_values': q_values}
--- a/baselines/deepq/deepq.py
+++ b/baselines/deepq/deepq.py
@@ -0,0 +1,234 @@
+import os.path as osp
+
+import tensorflow as tf
+import numpy as np
+
+from baselines import logger
+from baselines.common.schedules import LinearSchedule
+from baselines.common.vec_env.vec_env import VecEnv
+from baselines.common import set_global_seeds
+
+from baselines import deepq
+from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
+
+from baselines.deepq.models import build_q_func
+
+
+
+def learn(env,
+          network,
+          seed=None,
+          lr=5e-4,
+          total_timesteps=100000,
+          buffer_size=50000,
+          exploration_fraction=0.1,
+          exploration_final_eps=0.02,
+          train_freq=1,
+          batch_size=32,
+          print_freq=100,
+          checkpoint_freq=10000,
+          checkpoint_path=None,
+          learning_starts=1000,
+          gamma=1.0,
+          target_network_update_freq=500,
+          prioritized_replay=False,
+          prioritized_replay_alpha=0.6,
+          prioritized_replay_beta0=0.4,
+          prioritized_replay_beta_iters=None,
+          prioritized_replay_eps=1e-6,
+          param_noise=False,
+          callback=None,
+          load_path=None,
+          **network_kwargs
+            ):
+    """Train a deepq model.
+
+    Parameters
+    -------
+    env: gym.Env
+        environment to train on
+    network: string or a function
+        neural network to use as a q function approximator. If string, has to be one of the names of registered models in baselines.common.models
+        (mlp, cnn, conv_only). If a function, should take an observation tensor and return a latent variable tensor, which
+        will be mapped to the Q function heads (see build_q_func in baselines.deepq.models for details on that)
+    seed: int or None
+        prng seed. The runs with the same seed "should" give the same results. If None, no seeding is used.
+    lr: float
+        learning rate for adam optimizer
+    total_timesteps: int
+        number of env steps to optimizer for
+    buffer_size: int
+        size of the replay buffer
+    exploration_fraction: float
+        fraction of entire training period over which the exploration rate is annealed
+    exploration_final_eps: float
+        final value of random action probability
+    train_freq: int
+        update the model every `train_freq` steps.
+        set to None to disable printing
+    batch_size: int
+        size of a batched sampled from replay buffer for training
+    print_freq: int
+        how often to print out training progress
+        set to None to disable printing
+    checkpoint_freq: int
+        how often to save the model. This is so that the best version is restored
+        at the end of the training. If you do not wish to restore the best version at
+        the end of the training set this variable to None.
+    learning_starts: int
+        how many steps of the model to collect transitions for before learning starts
+    gamma: float
+        discount factor
+    target_network_update_freq: int
+        update the target network every `target_network_update_freq` steps.
+    prioritized_replay: True
+        if True prioritized replay buffer will be used.
+    prioritized_replay_alpha: float
+        alpha parameter for prioritized replay buffer
+    prioritized_replay_beta0: float
+        initial value of beta for prioritized replay buffer
+    prioritized_replay_beta_iters: int
+        number of iterations over which beta will be annealed from initial value
+        to 1.0. If set to None equals to total_timesteps.
+    prioritized_replay_eps: float
+        epsilon to add to the TD errors when updating priorities.
+    param_noise: bool
+        whether or not to use parameter space noise (https://arxiv.org/abs/1706.01905)
+    callback: (locals, globals) -> None
+        function called at every steps with state of the algorithm.
+        If callback returns true training stops.
+    load_path: str
+        path to load the model from. (default: None)
+    **network_kwargs
+        additional keyword arguments to pass to the network builder.
+
+    Returns
+    -------
+    act: ActWrapper
+        Wrapper over act function. Adds ability to save it and load it.
+        See header of baselines/deepq/categorical.py for details on the act function.
+    """
+    # Create all the functions necessary to train the model
+
+    set_global_seeds(seed)
+
+    q_func = build_q_func(network, **network_kwargs)
+
+    # capture the shape outside the closure so that the env object is not serialized
+    # by cloudpickle when serializing make_obs_ph
+
+    observation_space = env.observation_space
+
+    model = deepq.DEEPQ(
+        q_func=q_func,
+        observation_shape=env.observation_space.shape,
+        num_actions=env.action_space.n,
+        lr=lr,
+        grad_norm_clipping=10,
+        gamma=gamma,
+        param_noise=param_noise
+    )
+
+    if load_path is not None:
+        load_path = osp.expanduser(load_path)
+        ckpt = tf.train.Checkpoint(model=model)
+        manager = tf.train.CheckpointManager(ckpt, load_path, max_to_keep=None)
+        ckpt.restore(manager.latest_checkpoint)
+        print("Restoring from {}".format(manager.latest_checkpoint))
+
+    # Create the replay buffer
+    if prioritized_replay:
+        replay_buffer = PrioritizedReplayBuffer(buffer_size, alpha=prioritized_replay_alpha)
+        if prioritized_replay_beta_iters is None:
+            prioritized_replay_beta_iters = total_timesteps
+        beta_schedule = LinearSchedule(prioritized_replay_beta_iters,
+                                       initial_p=prioritized_replay_beta0,
+                                       final_p=1.0)
+    else:
+        replay_buffer = ReplayBuffer(buffer_size)
+        beta_schedule = None
+    # Create the schedule for exploration starting from 1.
+    exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * total_timesteps),
+                                 initial_p=1.0,
+                                 final_p=exploration_final_eps)
+
+    model.update_target()
+
+    episode_rewards = [0.0]
+    saved_mean_reward = None
+    obs = env.reset()
+    # always mimic the vectorized env
+    if not isinstance(env, VecEnv):
+        obs = np.expand_dims(np.array(obs), axis=0)
+    reset = True
+
+    for t in range(total_timesteps):
+        if callback is not None:
+            if callback(locals(), globals()):
+                break
+        kwargs = {}
+        if not param_noise:
+            update_eps = tf.constant(exploration.value(t))
+            update_param_noise_threshold = 0.
+        else:
+            update_eps = tf.constant(0.)
+            # Compute the threshold such that the KL divergence between perturbed and non-perturbed
+            # policy is comparable to eps-greedy exploration with eps = exploration.value(t).
+            # See Appendix C.1 in Parameter Space Noise for Exploration, Plappert et al., 2017
+            # for detailed explanation.
+            update_param_noise_threshold = -np.log(1. - exploration.value(t) + exploration.value(t) / float(env.action_space.n))
+            kwargs['reset'] = reset
+            kwargs['update_param_noise_threshold'] = update_param_noise_threshold
+            kwargs['update_param_noise_scale'] = True
+        action, _, _, _ = model.step(tf.constant(obs), update_eps=update_eps, **kwargs)
+        action = action[0].numpy()
+        reset = False
+        new_obs, rew, done, _ = env.step(action)
+        # Store transition in the replay buffer.
+        if not isinstance(env, VecEnv):
+            new_obs = np.expand_dims(np.array(new_obs), axis=0)
+            replay_buffer.add(obs[0], action, rew, new_obs[0], float(done))
+        else:
+            replay_buffer.add(obs[0], action, rew[0], new_obs[0], float(done[0]))
+        # # Store transition in the replay buffer.
+        # replay_buffer.add(obs, action, rew, new_obs, float(done))
+        obs = new_obs
+
+        episode_rewards[-1] += rew
+        if done:
+            obs = env.reset()
+            if not isinstance(env, VecEnv):
+                obs = np.expand_dims(np.array(obs), axis=0)
+            episode_rewards.append(0.0)
+            reset = True
+
+        if t > learning_starts and t % train_freq == 0:
+            # Minimize the error in Bellman's equation on a batch sampled from replay buffer.
+            if prioritized_replay:
+                experience = replay_buffer.sample(batch_size, beta=beta_schedule.value(t))
+                (obses_t, actions, rewards, obses_tp1, dones, weights, batch_idxes) = experience
+            else:
+                obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(batch_size)
+                weights, batch_idxes = np.ones_like(rewards), None
+            obses_t, obses_tp1 = tf.constant(obses_t), tf.constant(obses_tp1)
+            actions, rewards, dones = tf.constant(actions), tf.constant(rewards), tf.constant(dones)
+            weights = tf.constant(weights)
+            td_errors = model.train(obses_t, actions, rewards, obses_tp1, dones, weights)
+            if prioritized_replay:
+                new_priorities = np.abs(td_errors) + prioritized_replay_eps
+                replay_buffer.update_priorities(batch_idxes, new_priorities)
+
+        if t > learning_starts and t % target_network_update_freq == 0:
+            # Update target network periodically.
+            model.update_target()
+
+        mean_100ep_reward = round(np.mean(episode_rewards[-101:-1]), 1)
+        num_episodes = len(episode_rewards)
+        if done and print_freq is not None and len(episode_rewards) % print_freq == 0:
+            logger.record_tabular("steps", t)
+            logger.record_tabular("episodes", num_episodes)
+            logger.record_tabular("mean 100 episode reward", mean_100ep_reward)
+            logger.record_tabular("% time spent exploring", int(100 * exploration.value(t)))
+            logger.dump_tabular()
+
+    return model
--- a/baselines/deepq/deepq_learner.py
+++ b/baselines/deepq/deepq_learner.py
@@ -0,0 +1,191 @@
+"""Deep Q model
+
+The functions in this model:
+
+======= step ========
+
+    Function to chose an action given an observation
+
+    Parameters
+    ----------
+    observation: tensor
+        Observation that can be feed into the output of make_obs_ph
+    stochastic: bool
+        if set to False all the actions are always deterministic (default False)
+    update_eps: float
+        update epsilon a new value, if negative not update happens
+        (default: no update)
+
+    Returns
+    -------
+    Tensor of dtype tf.int64 and shape (BATCH_SIZE,) with an action to be performed for
+    every element of the batch.
+
+
+(NOT IMPLEMENTED YET)
+======= step (in case of parameter noise) ========
+
+    Function to chose an action given an observation
+
+    Parameters
+    ----------
+    observation: object
+        Observation that can be feed into the output of make_obs_ph
+    stochastic: bool
+        if set to False all the actions are always deterministic (default False)
+    update_eps: float
+        update epsilon to a new value, if negative no update happens
+        (default: no update)
+    reset: bool
+        reset the perturbed policy by sampling a new perturbation
+    update_param_noise_threshold: float
+        the desired threshold for the difference between non-perturbed and perturbed policy
+    update_param_noise_scale: bool
+        whether or not to update the scale of the noise for the next time it is re-perturbed
+
+    Returns
+    -------
+    Tensor of dtype tf.int64 and shape (BATCH_SIZE,) with an action to be performed for
+    every element of the batch.
+
+
+======= train =======
+
+    Function that takes a transition (s,a,r,s',d) and optimizes Bellman equation's error:
+
+        td_error = Q(s,a) - (r + gamma * (1-d) * max_a' Q(s', a'))
+        loss = huber_loss[td_error]
+
+    Parameters
+    ----------
+    obs_t: object
+        a batch of observations
+    action: np.array
+        actions that were selected upon seeing obs_t.
+        dtype must be int32 and shape must be (batch_size,)
+    reward: np.array
+        immediate reward attained after executing those actions
+        dtype must be float32 and shape must be (batch_size,)
+    obs_tp1: object
+        observations that followed obs_t
+    done: np.array
+        1 if obs_t was the last observation in the episode and 0 otherwise
+        obs_tp1 gets ignored, but must be of the valid shape.
+        dtype must be float32 and shape must be (batch_size,)
+    weight: np.array
+        imporance weights for every element of the batch (gradient is multiplied
+        by the importance weight) dtype must be float32 and shape must be (batch_size,)
+
+    Returns
+    -------
+    td_error: np.array
+        a list of differences between Q(s,a) and the target in Bellman's equation.
+        dtype is float32 and shape is (batch_size,)
+
+======= update_target ========
+
+    copy the parameters from optimized Q function to the target Q function.
+    In Q learning we actually optimize the following error:
+
+        Q(s,a) - (r + gamma * max_a' Q'(s', a'))
+
+    Where Q' is lagging behind Q to stablize the learning. For example for Atari
+
+    Q' is set to Q once every 10000 updates training steps.
+
+"""
+import tensorflow as tf
+
+@tf.function
+def huber_loss(x, delta=1.0):
+    """Reference: https://en.wikipedia.org/wiki/Huber_loss"""
+    return tf.where(
+        tf.abs(x) < delta,
+        tf.square(x) * 0.5,
+        delta * (tf.abs(x) - 0.5 * delta)
+    )
+
+class DEEPQ(tf.Module):
+
+    def __init__(self, q_func, observation_shape, num_actions, lr, grad_norm_clipping=None, gamma=1.0,
+        double_q=True, param_noise=False, param_noise_filter_func=None):
+
+      self.num_actions = num_actions
+      self.gamma = gamma
+      self.double_q = double_q
+      self.param_noise = param_noise
+      self.param_noise_filter_func = param_noise_filter_func
+      self.grad_norm_clipping = grad_norm_clipping
+
+      self.optimizer = tf.keras.optimizers.Adam(lr)
+
+      with tf.name_scope('q_network'):
+        self.q_network = q_func(observation_shape, num_actions)
+      with tf.name_scope('target_q_network'):
+        self.target_q_network = q_func(observation_shape, num_actions)
+      self.eps = tf.Variable(0., name="eps")
+
+    @tf.function
+    def step(self, obs, stochastic=True, update_eps=-1):
+      if self.param_noise:
+        raise ValueError('not supporting noise yet')
+      else:
+        q_values = self.q_network(obs)
+        deterministic_actions = tf.argmax(q_values, axis=1)
+        batch_size = tf.shape(obs)[0]
+        random_actions = tf.random.uniform(tf.stack([batch_size]), minval=0, maxval=self.num_actions, dtype=tf.int64)
+        chose_random = tf.random.uniform(tf.stack([batch_size]), minval=0, maxval=1, dtype=tf.float32) < self.eps
+        stochastic_actions = tf.where(chose_random, random_actions, deterministic_actions)
+
+        if stochastic:
+          output_actions = stochastic_actions
+        else:
+          output_actions = deterministic_actions
+
+        if update_eps >= 0:
+            self.eps.assign(update_eps)
+
+        return output_actions, None, None, None
+
+    @tf.function()
+    def train(self, obs0, actions, rewards, obs1, dones, importance_weights):
+      with tf.GradientTape() as tape:
+        q_t = self.q_network(obs0)
+        q_t_selected = tf.reduce_sum(q_t * tf.one_hot(actions, self.num_actions, dtype=tf.float32), 1)
+
+        q_tp1 = self.target_q_network(obs1)
+
+        if self.double_q:
+            q_tp1_using_online_net = self.q_network(obs1)
+            q_tp1_best_using_online_net = tf.argmax(q_tp1_using_online_net, 1)
+            q_tp1_best = tf.reduce_sum(q_tp1 * tf.one_hot(q_tp1_best_using_online_net, self.num_actions, dtype=tf.float32), 1)
+        else:
+            q_tp1_best = tf.reduce_max(q_tp1, 1)
+
+        dones = tf.cast(dones, q_tp1_best.dtype)
+        q_tp1_best_masked = (1.0 - dones) * q_tp1_best
+
+        q_t_selected_target = rewards + self.gamma * q_tp1_best_masked
+
+        td_error = q_t_selected - tf.stop_gradient(q_t_selected_target)
+        errors = huber_loss(td_error)
+        weighted_error = tf.reduce_mean(importance_weights * errors)
+
+      grads = tape.gradient(weighted_error, self.q_network.trainable_variables)
+      if self.grad_norm_clipping:
+        clipped_grads = []
+        for grad in grads:
+          clipped_grads.append(tf.clip_by_norm(grad, self.grad_norm_clipping))
+        clipped_grads = grads
+      grads_and_vars = zip(grads, self.q_network.trainable_variables)
+      self.optimizer.apply_gradients(grads_and_vars)
+
+      return td_error
+
+    @tf.function(autograph=False)
+    def update_target(self):
+      q_vars = self.q_network.trainable_variables
+      target_q_vars = self.target_q_network.trainable_variables
+      for var, var_target in zip(q_vars, target_q_vars):
+        var_target.assign(var)
+
--- a/baselines/deepq/defaults.py
+++ b/baselines/deepq/defaults.py
@@ -0,0 +1,21 @@
+def atari():
+    return dict(
+        network='conv_only',
+        lr=1e-4,
+        buffer_size=10000,
+        exploration_fraction=0.1,
+        exploration_final_eps=0.01,
+        train_freq=4,
+        learning_starts=10000,
+        target_network_update_freq=1000,
+        gamma=0.99,
+        prioritized_replay=True,
+        prioritized_replay_alpha=0.6,
+        checkpoint_freq=10000,
+        checkpoint_path=None,
+        dueling=True
+    )
+
+def retro():
+    return atari()
+
--- a/baselines/deepq/experiments/init.py
+++ b/baselines/deepq/experiments/init.py
--- a/baselines/deepq/experiments/custom_cartpole.py
+++ b/baselines/deepq/experiments/custom_cartpole.py
@@ -1,79 +0,0 @@
-import gym
-import itertools
-import numpy as np
-import tensorflow as tf
-import tensorflow.contrib.layers as layers
-
-import baselines.common.tf_util as U
-
-from baselines import logger
-from baselines import deepq
-from baselines.deepq.replay_buffer import ReplayBuffer
-from baselines.deepq.utils import ObservationInput
-from baselines.common.schedules import LinearSchedule
-
-
-def model(inpt, num_actions, scope, reuse=False):
-    """This model takes as input an observation and returns values of all actions."""
-    with tf.variable_scope(scope, reuse=reuse):
-        out = inpt
-        out = layers.fully_connected(out, num_outputs=64, activation_fn=tf.nn.tanh)
-        out = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None)
-        return out
-
-
-if __name__ == '__main__':
-    with U.make_session(8):
-        # Create the environment
-        env = gym.make("CartPole-v0")
-        # Create all the functions necessary to train the model
-        act, train, update_target, debug = deepq.build_train(
-            make_obs_ph=lambda name: ObservationInput(env.observation_space, name=name),
-            q_func=model,
-            num_actions=env.action_space.n,
-            optimizer=tf.train.AdamOptimizer(learning_rate=5e-4),
-        )
-        # Create the replay buffer
-        replay_buffer = ReplayBuffer(50000)
-        # Create the schedule for exploration starting from 1 (every action is random) down to
-        # 0.02 (98% of actions are selected according to values predicted by the model).
-        exploration = LinearSchedule(schedule_timesteps=10000, initial_p=1.0, final_p=0.02)
-
-        # Initialize the parameters and copy them to the target network.
-        U.initialize()
-        update_target()
-
-        episode_rewards = [0.0]
-        obs = env.reset()
-        for t in itertools.count():
-            # Take action and update exploration to the newest value
-            action = act(obs[None], update_eps=exploration.value(t))[0]
-            new_obs, rew, done, _ = env.step(action)
-            # Store transition in the replay buffer.
-            replay_buffer.add(obs, action, rew, new_obs, float(done))
-            obs = new_obs
-
-            episode_rewards[-1] += rew
-            if done:
-                obs = env.reset()
-                episode_rewards.append(0)
-
-            is_solved = t > 100 and np.mean(episode_rewards[-101:-1]) >= 200
-            if is_solved:
-                # Show off the result
-                env.render()
-            else:
-                # Minimize the error in Bellman's equation on a batch sampled from replay buffer.
-                if t > 1000:
-                    obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(32)
-                    train(obses_t, actions, rewards, obses_tp1, dones, np.ones_like(rewards))
-                # Update target network periodically.
-                if t % 1000 == 0:
-                    update_target()
-
-            if done and len(episode_rewards) % 10 == 0:
-                logger.record_tabular("steps", t)
-                logger.record_tabular("episodes", len(episode_rewards))
-                logger.record_tabular("mean episode reward", round(np.mean(episode_rewards[-101:-1]), 1))
-                logger.record_tabular("% time spent exploring", int(100 * exploration.value(t)))
-                logger.dump_tabular()
--- a/baselines/deepq/experiments/enjoy_cartpole.py
+++ b/baselines/deepq/experiments/enjoy_cartpole.py
@@ -1,21 +0,0 @@
-import gym
-
-from baselines import deepq
-
-
-def main():
-    env = gym.make("CartPole-v0")
-    act = deepq.load("cartpole_model.pkl")
-
-    while True:
-        obs, done = env.reset(), False
-        episode_rew = 0
-        while not done:
-            env.render()
-            obs, rew, done, _ = env.step(act(obs[None])[0])
-            episode_rew += rew
-        print("Episode reward", episode_rew)
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/deepq/experiments/enjoy_mountaincar.py
+++ b/baselines/deepq/experiments/enjoy_mountaincar.py
@@ -1,21 +0,0 @@
-import gym
-
-from baselines import deepq
-
-
-def main():
-    env = gym.make("MountainCar-v0")
-    act = deepq.load("mountaincar_model.pkl")
-
-    while True:
-        obs, done = env.reset(), False
-        episode_rew = 0
-        while not done:
-            env.render()
-            obs, rew, done, _ = env.step(act(obs[None])[0])
-            episode_rew += rew
-        print("Episode reward", episode_rew)
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/deepq/experiments/enjoy_pong.py
+++ b/baselines/deepq/experiments/enjoy_pong.py
@@ -1,21 +0,0 @@
-import gym
-from baselines import deepq
-
-
-def main():
-    env = gym.make("PongNoFrameskip-v4")
-    env = deepq.wrap_atari_dqn(env)
-    act = deepq.load("pong_model.pkl")
-
-    while True:
-        obs, done = env.reset(), False
-        episode_rew = 0
-        while not done:
-            env.render()
-            obs, rew, done, _ = env.step(act(obs[None])[0])
-            episode_rew += rew
-        print("Episode reward", episode_rew)
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/deepq/experiments/run_atari.py
+++ b/baselines/deepq/experiments/run_atari.py
@@ -1,54 +0,0 @@
-from baselines import deepq
-from baselines.common import set_global_seeds
-from baselines import bench
-import argparse
-from baselines import logger
-from baselines.common.atari_wrappers import make_atari
-
-
-def main():
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--prioritized', type=int, default=1)
-    parser.add_argument('--prioritized-replay-alpha', type=float, default=0.6)
-    parser.add_argument('--dueling', type=int, default=1)
-    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
-    parser.add_argument('--checkpoint-freq', type=int, default=10000)
-    parser.add_argument('--checkpoint-path', type=str, default=None)
-
-    args = parser.parse_args()
-    logger.configure()
-    set_global_seeds(args.seed)
-    env = make_atari(args.env)
-    env = bench.Monitor(env, logger.get_dir())
-    env = deepq.wrap_atari_dqn(env)
-    model = deepq.models.cnn_to_mlp(
-        convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
-        hiddens=[256],
-        dueling=bool(args.dueling),
-    )
-
-    deepq.learn(
-        env,
-        q_func=model,
-        lr=1e-4,
-        max_timesteps=args.num_timesteps,
-        buffer_size=10000,
-        exploration_fraction=0.1,
-        exploration_final_eps=0.01,
-        train_freq=4,
-        learning_starts=10000,
-        target_network_update_freq=1000,
-        gamma=0.99,
-        prioritized_replay=bool(args.prioritized),
-        prioritized_replay_alpha=args.prioritized_replay_alpha,
-        checkpoint_freq=args.checkpoint_freq,
-        checkpoint_path=args.checkpoint_path,
-    )
-
-    env.close()
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/deepq/experiments/train_cartpole.py
+++ b/baselines/deepq/experiments/train_cartpole.py
@@ -1,31 +0,0 @@
-import gym
-
-from baselines import deepq
-
-
-def callback(lcl, _glb):
-    # stop training if reward exceeds 199
-    is_solved = lcl['t'] > 100 and sum(lcl['episode_rewards'][-101:-1]) / 100 >= 199
-    return is_solved
-
-
-def main():
-    env = gym.make("CartPole-v0")
-    model = deepq.models.mlp([64])
-    act = deepq.learn(
-        env,
-        q_func=model,
-        lr=1e-3,
-        max_timesteps=100000,
-        buffer_size=50000,
-        exploration_fraction=0.1,
-        exploration_final_eps=0.02,
-        print_freq=10,
-        callback=callback
-    )
-    print("Saving model to cartpole_model.pkl")
-    act.save("cartpole_model.pkl")
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/deepq/experiments/train_mountaincar.py
+++ b/baselines/deepq/experiments/train_mountaincar.py
@@ -1,26 +0,0 @@
-import gym
-
-from baselines import deepq
-
-
-def main():
-    env = gym.make("MountainCar-v0")
-    # Enabling layer_norm here is import for parameter space noise!
-    model = deepq.models.mlp([64], layer_norm=True)
-    act = deepq.learn(
-        env,
-        q_func=model,
-        lr=1e-3,
-        max_timesteps=100000,
-        buffer_size=50000,
-        exploration_fraction=0.1,
-        exploration_final_eps=0.1,
-        print_freq=10,
-        param_noise=True
-    )
-    print("Saving model to mountaincar_model.pkl")
-    act.save("mountaincar_model.pkl")
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/deepq/models.py
+++ b/baselines/deepq/models.py
@@ -1,91 +1,47 @@
 import tensorflow as tf
-import tensorflow.contrib.layers as layers


-def _mlp(hiddens, inpt, num_actions, scope, reuse=False, layer_norm=False):
-    with tf.variable_scope(scope, reuse=reuse):
-        out = inpt
-        for hidden in hiddens:
-            out = layers.fully_connected(out, num_outputs=hidden, activation_fn=None)
-            if layer_norm:
-                out = layers.layer_norm(out, center=True, scale=True)
-            out = tf.nn.relu(out)
-        q_out = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None)
-        return q_out
+def build_q_func(network, hiddens=[256], dueling=True, layer_norm=False, **network_kwargs):
+    if isinstance(network, str):
+        from baselines.common.models import get_network_builder
+        network = get_network_builder(network)(**network_kwargs)

+    def q_func_builder(input_shape, num_actions):
+        # the sub Functional model which does not include the top layer.
+        model = network(input_shape)

-def mlp(hiddens=[], layer_norm=False):
-    """This model takes as input an observation and returns values of all actions.
+        # wrapping the sub Functional model with layers that compute action scores into another Functional model.
+        latent = model.outputs
+        if len(latent) > 1:
+            if latent[1] is not None:
+                raise NotImplementedError("DQN is not compatible with recurrent policies yet")
+        latent = latent[0]

-    Parameters
-    ----------
-    hiddens: [int]
-        list of sizes of hidden layers
+        latent = tf.keras.layers.Flatten()(latent)

-    Returns
-    -------
-    q_func: function
-        q_function for DQN algorithm.
-    """
-    return lambda *args, **kwargs: _mlp(hiddens, layer_norm=layer_norm, *args, **kwargs)
-
-
-def _cnn_to_mlp(convs, hiddens, dueling, inpt, num_actions, scope, reuse=False, layer_norm=False):
-    with tf.variable_scope(scope, reuse=reuse):
-        out = inpt
-        with tf.variable_scope("convnet"):
-            for num_outputs, kernel_size, stride in convs:
-                out = layers.convolution2d(out,
-                                           num_outputs=num_outputs,
-                                           kernel_size=kernel_size,
-                                           stride=stride,
-                                           activation_fn=tf.nn.relu)
-        conv_out = layers.flatten(out)
-        with tf.variable_scope("action_value"):
-            action_out = conv_out
+        with tf.name_scope("action_value"):
+            action_out = latent
            for hidden in hiddens:
-                action_out = layers.fully_connected(action_out, num_outputs=hidden, activation_fn=None)
+                action_out = tf.keras.layers.Dense(units=hidden, activation=None)(action_out)
                if layer_norm:
-                    action_out = layers.layer_norm(action_out, center=True, scale=True)
+                    action_out = tf.keras.layers.LayerNormalization(center=True, scale=True)(action_out)
                action_out = tf.nn.relu(action_out)
-            action_scores = layers.fully_connected(action_out, num_outputs=num_actions, activation_fn=None)
+            action_scores = tf.keras.layers.Dense(units=num_actions, activation=None)(action_out)

        if dueling:
-            with tf.variable_scope("state_value"):
-                state_out = conv_out
+            with tf.name_scope("state_value"):
+                state_out = latent
                for hidden in hiddens:
-                    state_out = layers.fully_connected(state_out, num_outputs=hidden, activation_fn=None)
+                    state_out = tf.keras.layers.Dense(units=hidden, activation=None)(state_out)
                    if layer_norm:
-                        state_out = layers.layer_norm(state_out, center=True, scale=True)
+                        state_out = tf.keras.layers.LayerNormalization(center=True, scale=True)(state_out)
                    state_out = tf.nn.relu(state_out)
-                state_score = layers.fully_connected(state_out, num_outputs=1, activation_fn=None)
+                state_score = tf.keras.layers.Dense(units=1, activation=None)(state_out)
            action_scores_mean = tf.reduce_mean(action_scores, 1)
            action_scores_centered = action_scores - tf.expand_dims(action_scores_mean, 1)
            q_out = state_score + action_scores_centered
        else:
            q_out = action_scores
-        return q_out
-
-
-def cnn_to_mlp(convs, hiddens, dueling=False, layer_norm=False):
-    """This model takes as input an observation and returns values of all actions.
-
-    Parameters
-    ----------
-    convs: [(int, int int)]
-        list of convolutional layers in form of
-        (num_outputs, kernel_size, stride)
-    hiddens: [int]
-        list of sizes of hidden layers
-    dueling: bool
-        if true double the output MLP to compute a baseline
-        for action scores
-
-    Returns
-    -------
-    q_func: function
-        q_function for DQN algorithm.
-    """
-
-    return lambda *args, **kwargs: _cnn_to_mlp(convs, hiddens, dueling, layer_norm=layer_norm, *args, **kwargs)
+        return tf.keras.Model(inputs=model.inputs, outputs=[q_out])

+    return q_func_builder
--- a/baselines/deepq/replay_buffer.py
+++ b/baselines/deepq/replay_buffer.py
@@ -32,6 +32,9 @@ class ReplayBuffer(object):

    def _encode_sample(self, idxes):
        obses_t, actions, rewards, obses_tp1, dones = [], [], [], [], []
+        data = self._storage[0]
+        ob_dtype = data[0].dtype
+        ac_dtype = data[1].dtype
        for i in idxes:
            data = self._storage[i]
            obs_t, action, reward, obs_tp1, done = data
@@ -40,7 +43,7 @@ class ReplayBuffer(object):
            rewards.append(reward)
            obses_tp1.append(np.array(obs_tp1, copy=False))
            dones.append(done)
-        return np.array(obses_t), np.array(actions), np.array(rewards), np.array(obses_tp1), np.array(dones)
+        return np.array(obses_t, dtype=ob_dtype), np.array(actions, dtype=ac_dtype), np.array(rewards, dtype=np.float32), np.array(obses_tp1, dtype=ob_dtype), np.array(dones, dtype=np.float32)

    def sample(self, batch_size):
        """Sample a batch of experiences.
@@ -106,9 +109,10 @@ class PrioritizedReplayBuffer(ReplayBuffer):

    def _sample_proportional(self, batch_size):
        res = []
-        for _ in range(batch_size):
-            # TODO(szymon): should we ensure no repeats?
-            mass = random.random() * self._it_sum.sum(0, len(self._storage) - 1)
+        p_total = self._it_sum.sum(0, len(self._storage) - 1)
+        every_range_len = p_total / batch_size
+        for i in range(batch_size):
+            mass = random.random() * every_range_len + i * every_range_len
            idx = self._it_sum.find_prefixsum_idx(mass)
            res.append(idx)
        return res
@@ -161,7 +165,7 @@ class PrioritizedReplayBuffer(ReplayBuffer):
            p_sample = self._it_sum[idx] / self._it_sum.sum()
            weight = (p_sample * len(self._storage)) ** (-beta)
            weights.append(weight / max_weight)
-        weights = np.array(weights)
+        weights = np.array(weights, dtype=np.float32)
        encoded_sample = self._encode_sample(idxes)
        return tuple(list(encoded_sample) + [weights, idxes])

--- a/baselines/deepq/simple.py
+++ b/baselines/deepq/simple.py
@@ -1,306 +0,0 @@
-import os
-import tempfile
-
-import tensorflow as tf
-import zipfile
-import cloudpickle
-import numpy as np
-
-import baselines.common.tf_util as U
-from baselines.common.tf_util import load_state, save_state
-from baselines import logger
-from baselines.common.schedules import LinearSchedule
-from baselines.common.input import observation_input
-
-from baselines import deepq
-from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
-from baselines.deepq.utils import ObservationInput
-
-
-class ActWrapper(object):
-    def __init__(self, act, act_params):
-        self._act = act
-        self._act_params = act_params
-
-    @staticmethod
-    def load(path):
-        with open(path, "rb") as f:
-            model_data, act_params = cloudpickle.load(f)
-        act = deepq.build_act(**act_params)
-        sess = tf.Session()
-        sess.__enter__()
-        with tempfile.TemporaryDirectory() as td:
-            arc_path = os.path.join(td, "packed.zip")
-            with open(arc_path, "wb") as f:
-                f.write(model_data)
-
-            zipfile.ZipFile(arc_path, 'r', zipfile.ZIP_DEFLATED).extractall(td)
-            load_state(os.path.join(td, "model"))
-
-        return ActWrapper(act, act_params)
-
-    def __call__(self, *args, **kwargs):
-        return self._act(*args, **kwargs)
-
-    def save(self, path=None):
-        """Save model to a pickle located at `path`"""
-        if path is None:
-            path = os.path.join(logger.get_dir(), "model.pkl")
-
-        with tempfile.TemporaryDirectory() as td:
-            save_state(os.path.join(td, "model"))
-            arc_name = os.path.join(td, "packed.zip")
-            with zipfile.ZipFile(arc_name, 'w') as zipf:
-                for root, dirs, files in os.walk(td):
-                    for fname in files:
-                        file_path = os.path.join(root, fname)
-                        if file_path != arc_name:
-                            zipf.write(file_path, os.path.relpath(file_path, td))
-            with open(arc_name, "rb") as f:
-                model_data = f.read()
-        with open(path, "wb") as f:
-            cloudpickle.dump((model_data, self._act_params), f)
-
-
-def load(path):
-    """Load act function that was returned by learn function.
-
-    Parameters
-    ----------
-    path: str
-        path to the act function pickle
-
-    Returns
-    -------
-    act: ActWrapper
-        function that takes a batch of observations
-        and returns actions.
-    """
-    return ActWrapper.load(path)
-
-
-def learn(env,
-          q_func,
-          lr=5e-4,
-          max_timesteps=100000,
-          buffer_size=50000,
-          exploration_fraction=0.1,
-          exploration_final_eps=0.02,
-          train_freq=1,
-          batch_size=32,
-          print_freq=100,
-          checkpoint_freq=10000,
-          checkpoint_path=None,
-          learning_starts=1000,
-          gamma=1.0,
-          target_network_update_freq=500,
-          prioritized_replay=False,
-          prioritized_replay_alpha=0.6,
-          prioritized_replay_beta0=0.4,
-          prioritized_replay_beta_iters=None,
-          prioritized_replay_eps=1e-6,
-          param_noise=False,
-          callback=None):
-    """Train a deepq model.
-
-    Parameters
-    -------
-    env: gym.Env
-        environment to train on
-    q_func: (tf.Variable, int, str, bool) -> tf.Variable
-        the model that takes the following inputs:
-            observation_in: object
-                the output of observation placeholder
-            num_actions: int
-                number of actions
-            scope: str
-            reuse: bool
-                should be passed to outer variable scope
-        and returns a tensor of shape (batch_size, num_actions) with values of every action.
-    lr: float
-        learning rate for adam optimizer
-    max_timesteps: int
-        number of env steps to optimizer for
-    buffer_size: int
-        size of the replay buffer
-    exploration_fraction: float
-        fraction of entire training period over which the exploration rate is annealed
-    exploration_final_eps: float
-        final value of random action probability
-    train_freq: int
-        update the model every `train_freq` steps.
-        set to None to disable printing
-    batch_size: int
-        size of a batched sampled from replay buffer for training
-    print_freq: int
-        how often to print out training progress
-        set to None to disable printing
-    checkpoint_freq: int
-        how often to save the model. This is so that the best version is restored
-        at the end of the training. If you do not wish to restore the best version at
-        the end of the training set this variable to None.
-    learning_starts: int
-        how many steps of the model to collect transitions for before learning starts
-    gamma: float
-        discount factor
-    target_network_update_freq: int
-        update the target network every `target_network_update_freq` steps.
-    prioritized_replay: True
-        if True prioritized replay buffer will be used.
-    prioritized_replay_alpha: float
-        alpha parameter for prioritized replay buffer
-    prioritized_replay_beta0: float
-        initial value of beta for prioritized replay buffer
-    prioritized_replay_beta_iters: int
-        number of iterations over which beta will be annealed from initial value
-        to 1.0. If set to None equals to max_timesteps.
-    prioritized_replay_eps: float
-        epsilon to add to the TD errors when updating priorities.
-    callback: (locals, globals) -> None
-        function called at every steps with state of the algorithm.
-        If callback returns true training stops.
-
-    Returns
-    -------
-    act: ActWrapper
-        Wrapper over act function. Adds ability to save it and load it.
-        See header of baselines/deepq/categorical.py for details on the act function.
-    """
-    # Create all the functions necessary to train the model
-
-    sess = tf.Session()
-    sess.__enter__()
-
-    # capture the shape outside the closure so that the env object is not serialized
-    # by cloudpickle when serializing make_obs_ph
-
-    def make_obs_ph(name):
-        return ObservationInput(env.observation_space, name=name)
-
-    act, train, update_target, debug = deepq.build_train(
-        make_obs_ph=make_obs_ph,
-        q_func=q_func,
-        num_actions=env.action_space.n,
-        optimizer=tf.train.AdamOptimizer(learning_rate=lr),
-        gamma=gamma,
-        grad_norm_clipping=10,
-        param_noise=param_noise
-    )
-
-    act_params = {
-        'make_obs_ph': make_obs_ph,
-        'q_func': q_func,
-        'num_actions': env.action_space.n,
-    }
-
-    act = ActWrapper(act, act_params)
-
-    # Create the replay buffer
-    if prioritized_replay:
-        replay_buffer = PrioritizedReplayBuffer(buffer_size, alpha=prioritized_replay_alpha)
-        if prioritized_replay_beta_iters is None:
-            prioritized_replay_beta_iters = max_timesteps
-        beta_schedule = LinearSchedule(prioritized_replay_beta_iters,
-                                       initial_p=prioritized_replay_beta0,
-                                       final_p=1.0)
-    else:
-        replay_buffer = ReplayBuffer(buffer_size)
-        beta_schedule = None
-    # Create the schedule for exploration starting from 1.
-    exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * max_timesteps),
-                                 initial_p=1.0,
-                                 final_p=exploration_final_eps)
-
-    # Initialize the parameters and copy them to the target network.
-    U.initialize()
-    update_target()
-
-    episode_rewards = [0.0]
-    saved_mean_reward = None
-    obs = env.reset()
-    reset = True
-
-    with tempfile.TemporaryDirectory() as td:
-        td = checkpoint_path or td
-
-        model_file = os.path.join(td, "model")
-        model_saved = False
-        if tf.train.latest_checkpoint(td) is not None:
-            load_state(model_file)
-            logger.log('Loaded model from {}'.format(model_file))
-            model_saved = True
-
-        for t in range(max_timesteps):
-            if callback is not None:
-                if callback(locals(), globals()):
-                    break
-            # Take action and update exploration to the newest value
-            kwargs = {}
-            if not param_noise:
-                update_eps = exploration.value(t)
-                update_param_noise_threshold = 0.
-            else:
-                update_eps = 0.
-                # Compute the threshold such that the KL divergence between perturbed and non-perturbed
-                # policy is comparable to eps-greedy exploration with eps = exploration.value(t).
-                # See Appendix C.1 in Parameter Space Noise for Exploration, Plappert et al., 2017
-                # for detailed explanation.
-                update_param_noise_threshold = -np.log(1. - exploration.value(t) + exploration.value(t) / float(env.action_space.n))
-                kwargs['reset'] = reset
-                kwargs['update_param_noise_threshold'] = update_param_noise_threshold
-                kwargs['update_param_noise_scale'] = True
-            action = act(np.array(obs)[None], update_eps=update_eps, **kwargs)[0]
-            env_action = action
-            reset = False
-            new_obs, rew, done, _ = env.step(env_action)
-            # Store transition in the replay buffer.
-            replay_buffer.add(obs, action, rew, new_obs, float(done))
-            obs = new_obs
-
-            episode_rewards[-1] += rew
-            if done:
-                obs = env.reset()
-                episode_rewards.append(0.0)
-                reset = True
-
-            if t > learning_starts and t % train_freq == 0:
-                # Minimize the error in Bellman's equation on a batch sampled from replay buffer.
-                if prioritized_replay:
-                    experience = replay_buffer.sample(batch_size, beta=beta_schedule.value(t))
-                    (obses_t, actions, rewards, obses_tp1, dones, weights, batch_idxes) = experience
-                else:
-                    obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(batch_size)
-                    weights, batch_idxes = np.ones_like(rewards), None
-                td_errors = train(obses_t, actions, rewards, obses_tp1, dones, weights)
-                if prioritized_replay:
-                    new_priorities = np.abs(td_errors) + prioritized_replay_eps
-                    replay_buffer.update_priorities(batch_idxes, new_priorities)
-
-            if t > learning_starts and t % target_network_update_freq == 0:
-                # Update target network periodically.
-                update_target()
-
-            mean_100ep_reward = round(np.mean(episode_rewards[-101:-1]), 1)
-            num_episodes = len(episode_rewards)
-            if done and print_freq is not None and len(episode_rewards) % print_freq == 0:
-                logger.record_tabular("steps", t)
-                logger.record_tabular("episodes", num_episodes)
-                logger.record_tabular("mean 100 episode reward", mean_100ep_reward)
-                logger.record_tabular("% time spent exploring", int(100 * exploration.value(t)))
-                logger.dump_tabular()
-
-            if (checkpoint_freq is not None and t > learning_starts and
-                    num_episodes > 100 and t % checkpoint_freq == 0):
-                if saved_mean_reward is None or mean_100ep_reward > saved_mean_reward:
-                    if print_freq is not None:
-                        logger.log("Saving model due to mean reward increase: {} -> {}".format(
-                                   saved_mean_reward, mean_100ep_reward))
-                    save_state(model_file)
-                    model_saved = True
-                    saved_mean_reward = mean_100ep_reward
-        if model_saved:
-            if print_freq is not None:
-                logger.log("Restored model with mean reward: {}".format(saved_mean_reward))
-            load_state(model_file)
-
-    return act
--- a/Show More
+++ b/Show More