dummy commit to RUN BENCHMARKS

GAIL: bugfix in dataset loading (#447 )
2018-07-25 18:09:30 -07:00 · 2018-07-25 18:07:56 -07:00 · 2018-07-06 16:12:14 -07:00 · 2018-06-08 09:41:45 -07:00 · 2018-06-06 11:39:13 -07:00 · 2018-05-21 15:24:00 -07:00
141 changed files with 5819 additions and 2181 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,6 +1,8 @@
 *.swp
 *.pyc
+*.pkl
 *.py~
+.pytest_cache
 .DS_Store
 .idea

@@ -33,3 +35,4 @@ src

 MUJOCO_LOG.TXT

+
--- a/.travis.yml
+++ b/.travis.yml
@@ -0,0 +1,14 @@
+language: python
+python:
+    - "3.6"
+
+services:
+    - docker
+
+install:
+    - pip install flake8
+    - docker build . -t baselines-test
+
+script:
+    - flake8 --select=F baselines/common
+    - docker run baselines-test pytest
--- a/20
+++ b/20
@@ -0,0 +1,20 @@
+FROM ubuntu:16.04
+
+RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake
+ENV CODE_DIR /root/code
+ENV VENV /root/venv
+
+COPY . $CODE_DIR/baselines
+RUN \
+    pip install virtualenv && \
+    virtualenv $VENV --python=python3 && \
+    . $VENV/bin/activate && \
+    cd $CODE_DIR && \
+    pip install --upgrade pip && \
+    pip install -e baselines && \
+    pip install pytest
+
+ENV PATH=$VENV/bin:$PATH
+WORKDIR $CODE_DIR/baselines
+
+CMD /bin/bash
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-<img src="data/logo.jpg" width=25% align="right" />
+<img src="data/logo.jpg" width=25% align="right" /> [![Build status](https://travis-ci.org/openai/baselines.svg?branch=master)](https://travis-ci.org/openai/baselines)

 # Baselines

@@ -6,25 +6,79 @@ OpenAI Baselines is a set of high-quality implementations of reinforcement learn

 These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. 

-You can install it by typing:
+## Prerequisites 
+Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows
+### Ubuntu 
+    
+```bash
+sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
+```
+    
+### Mac OS X
+Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the follwing:
+```bash
+brew install cmake openmpi
+```
+    
+## Virtual environment
+From the general python package sanity perspective, it is a good idea to use virtual environments (virtualenvs) to make sure packages from different projects do not interfere with each other. You can install virtualenv (which is itself a pip package) via
+```bash
+pip install virtualenv
+```
+Virtualenvs are essentially folders that have copies of python executable and all python packages.
+To create a virtualenv called venv with python3, one runs 
+```bash
+virtualenv /path/to/venv --python=python3
+```
+To activate a virtualenv: 
+```
+. /path/to/venv/bin/activate
+```
+More thorough tutorial on virtualenvs and options can be found [here](https://virtualenv.pypa.io/en/stable/) 

+
+## Installation
+Clone the repo and cd into it:
 ```bash
 git clone https://github.com/openai/baselines.git
 cd baselines
+```
+If using virtualenv, create a new virtualenv and activate it
+```bash
+    virtualenv env --python=python3
+    . env/bin/activate
+```
+Install baselines package
+```bash
 pip install -e .
 ```
+### MuJoCo
+Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)
+
+## Testing the installation
+All unit tests in baselines can be run using pytest runner:
+```
+pip install pytest
+pytest
+```
+
+## Subpackages

 - [A2C](baselines/a2c)
+- [ACER](baselines/acer)
 - [ACKTR](baselines/acktr)
 - [DDPG](baselines/ddpg)
 - [DQN](baselines/deepq)
- [PPO](baselines/ppo1)
+- [GAIL](baselines/gail)
+- [HER](baselines/her)
+- [PPO1](baselines/ppo1) (Multi-CPU using MPI)
+- [PPO2](baselines/ppo2) (Optimized for GPU)
 - [TRPO](baselines/trpo_mpi)

 To cite this repository in publications:

    @misc{baselines,
-      author = {Hesse, Christopher and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
+      author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
      title = {OpenAI Baselines},
      year = {2017},
      publisher = {GitHub},
--- a/baselines/a2c/a2c.py
+++ b/baselines/a2c/a2c.py
@@ -1,32 +1,25 @@
 import os.path as osp
-import gym
 import time
 import joblib
-import logging
 import numpy as np
 import tensorflow as tf
 from baselines import logger

 from baselines.common import set_global_seeds, explained_variance
-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from baselines.common.atari_wrappers import wrap_deepmind
+from baselines.common.runners import AbstractEnvRunner
+from baselines.common import tf_util

 from baselines.a2c.utils import discount_with_dones
 from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
-from baselines.a2c.policies import CnnPolicy
 from baselines.a2c.utils import cat_entropy, mse

 class Model(object):

-    def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, nstack, num_procs,
+    def __init__(self, policy, ob_space, ac_space, nenvs, nsteps,
            ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, lr=7e-4,
            alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):
-        config = tf.ConfigProto(allow_soft_placement=True,
-                                intra_op_parallelism_threads=num_procs,
-                                inter_op_parallelism_threads=num_procs)
-        config.gpu_options.allow_growth = True
-        sess = tf.Session(config=config)
-        nact = ac_space.n
+
+        sess = tf_util.make_session()
        nbatch = nenvs*nsteps

        A = tf.placeholder(tf.int32, [nbatch])
@@ -34,8 +27,8 @@ class Model(object):
        R = tf.placeholder(tf.float32, [nbatch])
        LR = tf.placeholder(tf.float32, [])

-        step_model = policy(sess, ob_space, ac_space, nenvs, 1, nstack, reuse=False)
-        train_model = policy(sess, ob_space, ac_space, nenvs, nsteps, nstack, reuse=True)
+        step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
+        train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)

        neglogpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
        pg_loss = tf.reduce_mean(ADV * neglogpac)
@@ -58,7 +51,7 @@ class Model(object):
            for step in range(len(obs)):
                cur_lr = lr.value()
            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, LR:cur_lr}
-            if states != []:
+            if states is not None:
                td_map[train_model.S] = states
                td_map[train_model.M] = masks
            policy_loss, value_loss, policy_entropy, _ = sess.run(
@@ -69,7 +62,7 @@ class Model(object):

        def save(save_path):
            ps = sess.run(params)
-            make_path(save_path)
+            make_path(osp.dirname(save_path))
            joblib.dump(ps, save_path)

        def load(load_path):
@@ -77,7 +70,7 @@ class Model(object):
            restores = []
            for p, loaded_p in zip(params, loaded_params):
                restores.append(p.assign(loaded_p))
-            ps = sess.run(restores)
+            sess.run(restores)

        self.train = train
        self.train_model = train_model
@@ -89,34 +82,17 @@ class Model(object):
        self.load = load
        tf.global_variables_initializer().run(session=sess)

-class Runner(object):
+class Runner(AbstractEnvRunner):

-    def __init__(self, env, model, nsteps=5, nstack=4, gamma=0.99):
-        self.env = env
-        self.model = model
-        nh, nw, nc = env.observation_space.shape
-        nenv = env.num_envs
-        self.batch_ob_shape = (nenv*nsteps, nh, nw, nc*nstack)
-        self.obs = np.zeros((nenv, nh, nw, nc*nstack), dtype=np.uint8)
-        self.nc = nc
-        obs = env.reset()
-        self.update_obs(obs)
+    def __init__(self, env, model, nsteps=5, gamma=0.99):
+        super().__init__(env=env, model=model, nsteps=nsteps)
        self.gamma = gamma
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]
-
-    def update_obs(self, obs):
-        # Do frame-stacking here instead of the FrameStack wrapper to reduce
-        # IPC overhead
-        self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
-        self.obs[:, :, :, -self.nc:] = obs

    def run(self):
        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
        mb_states = self.states
        for n in range(self.nsteps):
-            actions, values, states = self.model.step(self.obs, self.states, self.dones)
+            actions, values, states, _ = self.model.step(self.obs, self.states, self.dones)
            mb_obs.append(np.copy(self.obs))
            mb_actions.append(actions)
            mb_values.append(values)
@@ -127,7 +103,7 @@ class Runner(object):
            for n, done in enumerate(dones):
                if done:
                    self.obs[n] = self.obs[n]*0
-            self.update_obs(obs)
+            self.obs = obs
            mb_rewards.append(rewards)
        mb_dones.append(self.dones)
        #batch of steps to batch of rollouts
@@ -154,17 +130,15 @@ class Runner(object):
        mb_masks = mb_masks.flatten()
        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values

-def learn(policy, env, seed, nsteps=5, nstack=4, total_timesteps=int(80e6), vf_coef=0.5, ent_coef=0.01, max_grad_norm=0.5, lr=7e-4, lrschedule='linear', epsilon=1e-5, alpha=0.99, gamma=0.99, log_interval=100):
-    tf.reset_default_graph()
+def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, ent_coef=0.01, max_grad_norm=0.5, lr=7e-4, lrschedule='linear', epsilon=1e-5, alpha=0.99, gamma=0.99, log_interval=100):
    set_global_seeds(seed)

    nenvs = env.num_envs
    ob_space = env.observation_space
    ac_space = env.action_space
-    num_procs = len(env.remotes) # HACK
-    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, nstack=nstack, num_procs=num_procs, ent_coef=ent_coef, vf_coef=vf_coef,
+    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
        max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
-    runner = Runner(env, model, nsteps=nsteps, nstack=nstack, gamma=gamma)
+    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)

    nbatch = nenvs*nsteps
    tstart = time.time()
@@ -183,6 +157,4 @@ def learn(policy, env, seed, nsteps=5, nstack=4, total_timesteps=int(80e6), vf_c
            logger.record_tabular("explained_variance", float(ev))
            logger.dump_tabular()
    env.close()
-
-if __name__ == '__main__':
-    main()
+    return model
--- a/baselines/a2c/policies.py
+++ b/baselines/a2c/policies.py
@@ -1,36 +1,45 @@
 import numpy as np
 import tensorflow as tf
-from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm, sample
+from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
+from baselines.common.distributions import make_pdtype
+from baselines.common.input import observation_input
+
+def nature_cnn(unscaled_images, **conv_kwargs):
+    """
+    CNN from Nature paper.
+    """
+    scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
+    activ = tf.nn.relu
+    h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
+                   **conv_kwargs))
+    h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
+    h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
+    h3 = conv_to_fc(h3)
+    return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))

 class LnLstmPolicy(object):
-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, nlstm=256, reuse=False):
-        nbatch = nenv*nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc*nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
+    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
+        nenv = nbatch // nsteps
+        X, processed_x = observation_input(ob_space, nbatch)
        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
+        self.pdtype = make_pdtype(ac_space)
        with tf.variable_scope("model", reuse=reuse):
-            h = conv(tf.cast(X, tf.float32)/255., 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2))
-            h2 = conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2))
-            h3 = conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2))
-            h3 = conv_to_fc(h3)
-            h4 = fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))
-            xs = batch_to_seq(h4, nenv, nsteps)
+            h = nature_cnn(processed_x)
+            xs = batch_to_seq(h, nenv, nsteps)
            ms = batch_to_seq(M, nenv, nsteps)
            h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact, act=lambda x:x)
-            vf = fc(h5, 'v', 1, act=lambda x:x)
+            vf = fc(h5, 'v', 1)
+            self.pd, self.pi = self.pdtype.pdfromlatent(h5)

        v0 = vf[:, 0]
-        a0 = sample(pi)
+        a0 = self.pd.sample()
+        neglogp0 = self.pd.neglogp(a0)
        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)

        def step(ob, state, mask):
-            a, v, s = sess.run([a0, v0, snew], {X:ob, S:state, M:mask})
-            return a, v, s
+            return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})

        def value(ob, state, mask):
            return sess.run(v0, {X:ob, S:state, M:mask})
@@ -38,41 +47,35 @@ class LnLstmPolicy(object):
        self.X = X
        self.M = M
        self.S = S
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value

 class LstmPolicy(object):

-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, nlstm=256, reuse=False):
-        nbatch = nenv*nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc*nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
+    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
+        nenv = nbatch // nsteps
+        self.pdtype = make_pdtype(ac_space)
+        X, processed_x = observation_input(ob_space, nbatch)
+
        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
        with tf.variable_scope("model", reuse=reuse):
-            h = conv(tf.cast(X, tf.float32)/255., 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2))
-            h2 = conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2))
-            h3 = conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2))
-            h3 = conv_to_fc(h3)
-            h4 = fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))
-            xs = batch_to_seq(h4, nenv, nsteps)
+            h = nature_cnn(X)
+            xs = batch_to_seq(h, nenv, nsteps)
            ms = batch_to_seq(M, nenv, nsteps)
            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
            h5 = seq_to_batch(h5)
-            pi = fc(h5, 'pi', nact, act=lambda x:x)
-            vf = fc(h5, 'v', 1, act=lambda x:x)
+            vf = fc(h5, 'v', 1)
+            self.pd, self.pi = self.pdtype.pdfromlatent(h5)

        v0 = vf[:, 0]
-        a0 = sample(pi)
+        a0 = self.pd.sample()
+        neglogp0 = self.pd.neglogp(a0)
        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)

        def step(ob, state, mask):
-            a, v, s = sess.run([a0, v0, snew], {X:ob, S:state, M:mask})
-            return a, v, s
+            return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})

        def value(ob, state, mask):
            return sess.run(v0, {X:ob, S:state, M:mask})
@@ -80,41 +83,64 @@ class LstmPolicy(object):
        self.X = X
        self.M = M
        self.S = S
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value

 class CnnPolicy(object):

-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
-        nbatch = nenv*nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc*nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
+    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False, **conv_kwargs): #pylint: disable=W0613
+        self.pdtype = make_pdtype(ac_space)
+        X, processed_x = observation_input(ob_space, nbatch)
        with tf.variable_scope("model", reuse=reuse):
-            h = conv(tf.cast(X, tf.float32)/255., 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2))
-            h2 = conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2))
-            h3 = conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2))
-            h3 = conv_to_fc(h3)
-            h4 = fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))
-            pi = fc(h4, 'pi', nact, act=lambda x:x)
-            vf = fc(h4, 'v', 1, act=lambda x:x)
+            h = nature_cnn(processed_x, **conv_kwargs)
+            vf = fc(h, 'v', 1)[:,0]
+            self.pd, self.pi = self.pdtype.pdfromlatent(h, init_scale=0.01)

-        v0 = vf[:, 0]
-        a0 = sample(pi)
-        self.initial_state = [] #not stateful
+        a0 = self.pd.sample()
+        neglogp0 = self.pd.neglogp(a0)
+        self.initial_state = None

        def step(ob, *_args, **_kwargs):
-            a, v = sess.run([a0, v0], {X:ob})
-            return a, v, [] #dummy state
+            a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
+            return a, v, self.initial_state, neglogp

        def value(ob, *_args, **_kwargs):
-            return sess.run(v0, {X:ob})
+            return sess.run(vf, {X:ob})
+
+        self.X = X
+        self.vf = vf
+        self.step = step
+        self.value = value
+
+class MlpPolicy(object):
+    def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
+        self.pdtype = make_pdtype(ac_space)
+        with tf.variable_scope("model", reuse=reuse):
+            X, processed_x = observation_input(ob_space, nbatch)
+            activ = tf.tanh
+            processed_x = tf.layers.flatten(processed_x)
+            pi_h1 = activ(fc(processed_x, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
+            pi_h2 = activ(fc(pi_h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
+            vf_h1 = activ(fc(processed_x, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
+            vf_h2 = activ(fc(vf_h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
+            vf = fc(vf_h2, 'vf', 1)[:,0]
+
+            self.pd, self.pi = self.pdtype.pdfromlatent(pi_h2, init_scale=0.01)
+
+
+        a0 = self.pd.sample()
+        neglogp0 = self.pd.neglogp(a0)
+        self.initial_state = None
+
+        def step(ob, *_args, **_kwargs):
+            a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
+            return a, v, self.initial_state, neglogp
+
+        def value(ob, *_args, **_kwargs):
+            return sess.run(vf, {X:ob})

        self.X = X
-        self.pi = pi
        self.vf = vf
        self.step = step
        self.value = value
--- a/baselines/a2c/run_atari.py
+++ b/baselines/a2c/run_atari.py
@@ -1,45 +1,30 @@
-#!/usr/bin/env python
-import os, logging, gym
-from baselines import logger
-from baselines.common import set_global_seeds
-from baselines import bench
-from baselines.a2c.a2c import learn
-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from baselines.common.atari_wrappers import make_atari, wrap_deepmind
-from baselines.a2c.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
+#!/usr/bin/env python3

-def train(env_id, num_timesteps, seed, policy, lrschedule, num_cpu):
-    def make_env(rank):
-        def _thunk():
-            env = make_atari(env_id)
-            env.seed(seed + rank)
-            env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-            gym.logger.setLevel(logging.WARN)
-            return wrap_deepmind(env)
-        return _thunk
-    set_global_seeds(seed)
-    env = SubprocVecEnv([make_env(i) for i in range(num_cpu)])
+from baselines import logger
+from baselines.common.cmd_util import make_atari_env, atari_arg_parser
+from baselines.common.vec_env.vec_frame_stack import VecFrameStack
+from baselines.a2c.a2c import learn
+from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
+
+def train(env_id, num_timesteps, seed, policy, lrschedule, num_env):
    if policy == 'cnn':
        policy_fn = CnnPolicy
    elif policy == 'lstm':
        policy_fn = LstmPolicy
    elif policy == 'lnlstm':
        policy_fn = LnLstmPolicy
+    env = VecFrameStack(make_atari_env(env_id, num_env, seed), 4)
    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
    env.close()

 def main():
-    import argparse
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
+    parser = atari_arg_parser()
    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
    parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
-    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
    args = parser.parse_args()
    logger.configure()
    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
-        policy=args.policy, lrschedule=args.lrschedule, num_cpu=16)
+        policy=args.policy, lrschedule=args.lrschedule, num_env=16)

 if __name__ == '__main__':
    main()
--- a/baselines/a2c/utils.py
+++ b/baselines/a2c/utils.py
@@ -39,23 +39,33 @@ def ortho_init(scale=1.0):
        return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
    return _ortho_init

-def conv(x, scope, nf, rf, stride, pad='VALID', act=tf.nn.relu, init_scale=1.0):
+def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
+    if data_format == 'NHWC':
+        channel_ax = 3
+        strides = [1, stride, stride, 1]
+        bshape = [1, 1, 1, nf]
+    elif data_format == 'NCHW':
+        channel_ax = 1
+        strides = [1, 1, stride, stride]
+        bshape = [1, nf, 1, 1]
+    else:
+        raise NotImplementedError
+    bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
+    nin = x.get_shape()[channel_ax].value
+    wshape = [rf, rf, nin, nf]
    with tf.variable_scope(scope):
-        nin = x.get_shape()[3].value
-        w = tf.get_variable("w", [rf, rf, nin, nf], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nf], initializer=tf.constant_initializer(0.0))
-        z = tf.nn.conv2d(x, w, strides=[1, stride, stride, 1], padding=pad)+b
-        h = act(z)
-        return h
+        w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
+        b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
+        if not one_dim_bias and data_format == 'NHWC':
+            b = tf.reshape(b, bshape)
+        return b + tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format)

-def fc(x, scope, nh, act=tf.nn.relu, init_scale=1.0):
+def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
    with tf.variable_scope(scope):
        nin = x.get_shape()[1].value
        w = tf.get_variable("w", [nin, nh], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(0.0))
-        z = tf.matmul(x, w)+b
-        h = act(z)
-        return h
+        b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(init_bias))
+        return tf.matmul(x, w)+b

 def batch_to_seq(h, nbatch, nsteps, flat=False):
    if flat:
@@ -162,9 +172,34 @@ def constant(p):
 def linear(p):
    return 1-p

+def middle_drop(p):
+    eps = 0.75
+    if 1-p<eps:
+        return eps*0.1
+    return 1-p
+
+def double_linear_con(p):
+    p *= 2
+    eps = 0.125
+    if 1-p<eps:
+        return eps
+    return 1-p
+
+def double_middle_drop(p):
+    eps1 = 0.75
+    eps2 = 0.25
+    if 1-p<eps1:
+        if 1-p<eps2:
+            return eps2*0.5
+        return eps1*0.1
+    return 1-p
+
 schedules = {
    'linear':linear,
-    'constant':constant
+    'constant':constant,
+    'double_linear_con': double_linear_con,
+    'middle_drop': middle_drop,
+    'double_middle_drop': double_middle_drop
 }

 class Scheduler(object):
@@ -238,7 +273,7 @@ def check_shape(ts,shapes):
 def avg_norm(t):
    return tf.reduce_mean(tf.sqrt(tf.reduce_sum(tf.square(t), axis=-1)))

-def myadd(g1, g2, param):
+def gradient_add(g1, g2, param):
    print([g1, g2, param.name])
    assert (not (g1 is None and g2 is None)), param.name
    if g1 is None:
@@ -248,7 +283,7 @@ def myadd(g1, g2, param):
    else:
        return g1 + g2

-def my_explained_variance(qpred, q):
+def q_explained_variance(qpred, q):
    _, vary = tf.nn.moments(q, axes=[0, 1])
    _, varpred = tf.nn.moments(q - qpred, axes=[0, 1])
    check_shape([vary, varpred], [[]] * 2)
--- a/baselines/acer/README.md
+++ b/baselines/acer/README.md
@@ -0,0 +1,4 @@
+# ACER
+
+- Original paper: https://arxiv.org/abs/1611.01224
+- `python -m baselines.acer.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
--- a/baselines/deepq/experiments/atari/init.py
+++ b/baselines/deepq/experiments/atari/init.py
--- a/baselines/acer/acer_simple.py
+++ b/baselines/acer/acer_simple.py
@@ -0,0 +1,348 @@
+import time
+import joblib
+import numpy as np
+import tensorflow as tf
+from baselines import logger
+
+from baselines.common import set_global_seeds
+from baselines.common.runners import AbstractEnvRunner
+
+from baselines.a2c.utils import batch_to_seq, seq_to_batch
+from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
+from baselines.a2c.utils import cat_entropy_softmax
+from baselines.a2c.utils import EpisodeStats
+from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
+from baselines.acer.buffer import Buffer
+
+import os.path as osp
+
+# remove last step
+def strip(var, nenvs, nsteps, flat = False):
+    vars = batch_to_seq(var, nenvs, nsteps + 1, flat)
+    return seq_to_batch(vars[:-1], flat)
+
+def q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma):
+    """
+    Calculates q_retrace targets
+
+    :param R: Rewards
+    :param D: Dones
+    :param q_i: Q values for actions taken
+    :param v: V values
+    :param rho_i: Importance weight for each action
+    :return: Q_retrace values
+    """
+    rho_bar = batch_to_seq(tf.minimum(1.0, rho_i), nenvs, nsteps, True)  # list of len steps, shape [nenvs]
+    rs = batch_to_seq(R, nenvs, nsteps, True)  # list of len steps, shape [nenvs]
+    ds = batch_to_seq(D, nenvs, nsteps, True)  # list of len steps, shape [nenvs]
+    q_is = batch_to_seq(q_i, nenvs, nsteps, True)
+    vs = batch_to_seq(v, nenvs, nsteps + 1, True)
+    v_final = vs[-1]
+    qret = v_final
+    qrets = []
+    for i in range(nsteps - 1, -1, -1):
+        check_shape([qret, ds[i], rs[i], rho_bar[i], q_is[i], vs[i]], [[nenvs]] * 6)
+        qret = rs[i] + gamma * qret * (1.0 - ds[i])
+        qrets.append(qret)
+        qret = (rho_bar[i] * (qret - q_is[i])) + vs[i]
+    qrets = qrets[::-1]
+    qret = seq_to_batch(qrets, flat=True)
+    return qret
+
+# For ACER with PPO clipping instead of trust region
+# def clip(ratio, eps_clip):
+#     # assume 0 <= eps_clip <= 1
+#     return tf.minimum(1 + eps_clip, tf.maximum(1 - eps_clip, ratio))
+
+class Model(object):
+    def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, nstack, num_procs,
+                 ent_coef, q_coef, gamma, max_grad_norm, lr,
+                 rprop_alpha, rprop_epsilon, total_timesteps, lrschedule,
+                 c, trust_region, alpha, delta):
+        config = tf.ConfigProto(allow_soft_placement=True,
+                                intra_op_parallelism_threads=num_procs,
+                                inter_op_parallelism_threads=num_procs)
+        sess = tf.Session(config=config)
+        nact = ac_space.n
+        nbatch = nenvs * nsteps
+
+        A = tf.placeholder(tf.int32, [nbatch]) # actions
+        D = tf.placeholder(tf.float32, [nbatch]) # dones
+        R = tf.placeholder(tf.float32, [nbatch]) # rewards, not returns
+        MU = tf.placeholder(tf.float32, [nbatch, nact]) # mu's
+        LR = tf.placeholder(tf.float32, [])
+        eps = 1e-6
+
+        step_model = policy(sess, ob_space, ac_space, nenvs, 1, nstack, reuse=False)
+        train_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
+
+        params = find_trainable_variables("model")
+        print("Params {}".format(len(params)))
+        for var in params:
+            print(var)
+
+        # create polyak averaged model
+        ema = tf.train.ExponentialMovingAverage(alpha)
+        ema_apply_op = ema.apply(params)
+
+        def custom_getter(getter, *args, **kwargs):
+            v = ema.average(getter(*args, **kwargs))
+            print(v.name)
+            return v
+
+        with tf.variable_scope("", custom_getter=custom_getter, reuse=True):
+            polyak_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
+
+        # Notation: (var) = batch variable, (var)s = seqeuence variable, (var)_i = variable index by action at step i
+        v = tf.reduce_sum(train_model.pi * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]
+
+        # strip off last step
+        f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model.pi, polyak_model.pi, train_model.q])
+        # Get pi and q values for actions taken
+        f_i = get_by_index(f, A)
+        q_i = get_by_index(q, A)
+
+        # Compute ratios for importance truncation
+        rho = f / (MU + eps)
+        rho_i = get_by_index(rho, A)
+
+        # Calculate Q_retrace targets
+        qret = q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma)
+
+        # Calculate losses
+        # Entropy
+        entropy = tf.reduce_mean(cat_entropy_softmax(f))
+
+        # Policy Graident loss, with truncated importance sampling & bias correction
+        v = strip(v, nenvs, nsteps, True)
+        check_shape([qret, v, rho_i, f_i], [[nenvs * nsteps]] * 4)
+        check_shape([rho, f, q], [[nenvs * nsteps, nact]] * 2)
+
+        # Truncated importance sampling
+        adv = qret - v
+        logf = tf.log(f_i + eps)
+        gain_f = logf * tf.stop_gradient(adv * tf.minimum(c, rho_i))  # [nenvs * nsteps]
+        loss_f = -tf.reduce_mean(gain_f)
+
+        # Bias correction for the truncation
+        adv_bc = (q - tf.reshape(v, [nenvs * nsteps, 1]))  # [nenvs * nsteps, nact]
+        logf_bc = tf.log(f + eps) # / (f_old + eps)
+        check_shape([adv_bc, logf_bc], [[nenvs * nsteps, nact]]*2)
+        gain_bc = tf.reduce_sum(logf_bc * tf.stop_gradient(adv_bc * tf.nn.relu(1.0 - (c / (rho + eps))) * f), axis = 1) #IMP: This is sum, as expectation wrt f
+        loss_bc= -tf.reduce_mean(gain_bc)
+
+        loss_policy = loss_f + loss_bc
+
+        # Value/Q function loss, and explained variance
+        check_shape([qret, q_i], [[nenvs * nsteps]]*2)
+        ev = q_explained_variance(tf.reshape(q_i, [nenvs, nsteps]), tf.reshape(qret, [nenvs, nsteps]))
+        loss_q = tf.reduce_mean(tf.square(tf.stop_gradient(qret) - q_i)*0.5)
+
+        # Net loss
+        check_shape([loss_policy, loss_q, entropy], [[]] * 3)
+        loss = loss_policy + q_coef * loss_q - ent_coef * entropy
+
+        if trust_region:
+            g = tf.gradients(- (loss_policy - ent_coef * entropy) * nsteps * nenvs, f) #[nenvs * nsteps, nact]
+            # k = tf.gradients(KL(f_pol || f), f)
+            k = - f_pol / (f + eps) #[nenvs * nsteps, nact] # Directly computed gradient of KL divergence wrt f
+            k_dot_g = tf.reduce_sum(k * g, axis=-1)
+            adj = tf.maximum(0.0, (tf.reduce_sum(k * g, axis=-1) - delta) / (tf.reduce_sum(tf.square(k), axis=-1) + eps)) #[nenvs * nsteps]
+
+            # Calculate stats (before doing adjustment) for logging.
+            avg_norm_k = avg_norm(k)
+            avg_norm_g = avg_norm(g)
+            avg_norm_k_dot_g = tf.reduce_mean(tf.abs(k_dot_g))
+            avg_norm_adj = tf.reduce_mean(tf.abs(adj))
+
+            g = g - tf.reshape(adj, [nenvs * nsteps, 1]) * k
+            grads_f = -g/(nenvs*nsteps) # These are turst region adjusted gradients wrt f ie statistics of policy pi
+            grads_policy = tf.gradients(f, params, grads_f)
+            grads_q = tf.gradients(loss_q * q_coef, params)
+            grads = [gradient_add(g1, g2, param) for (g1, g2, param) in zip(grads_policy, grads_q, params)]
+
+            avg_norm_grads_f = avg_norm(grads_f) * (nsteps * nenvs)
+            norm_grads_q = tf.global_norm(grads_q)
+            norm_grads_policy = tf.global_norm(grads_policy)
+        else:
+            grads = tf.gradients(loss, params)
+
+        if max_grad_norm is not None:
+            grads, norm_grads = tf.clip_by_global_norm(grads, max_grad_norm)
+        grads = list(zip(grads, params))
+        trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=rprop_alpha, epsilon=rprop_epsilon)
+        _opt_op = trainer.apply_gradients(grads)
+
+        # so when you call _train, you first do the gradient step, then you apply ema
+        with tf.control_dependencies([_opt_op]):
+            _train = tf.group(ema_apply_op)
+
+        lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
+
+        # Ops/Summaries to run, and their names for logging
+        run_ops = [_train, loss, loss_q, entropy, loss_policy, loss_f, loss_bc, ev, norm_grads]
+        names_ops = ['loss', 'loss_q', 'entropy', 'loss_policy', 'loss_f', 'loss_bc', 'explained_variance',
+                     'norm_grads']
+        if trust_region:
+            run_ops = run_ops + [norm_grads_q, norm_grads_policy, avg_norm_grads_f, avg_norm_k, avg_norm_g, avg_norm_k_dot_g,
+                                 avg_norm_adj]
+            names_ops = names_ops + ['norm_grads_q', 'norm_grads_policy', 'avg_norm_grads_f', 'avg_norm_k', 'avg_norm_g',
+                                     'avg_norm_k_dot_g', 'avg_norm_adj']
+
+        def train(obs, actions, rewards, dones, mus, states, masks, steps):
+            cur_lr = lr.value_steps(steps)
+            td_map = {train_model.X: obs, polyak_model.X: obs, A: actions, R: rewards, D: dones, MU: mus, LR: cur_lr}
+            if states != []:
+                td_map[train_model.S] = states
+                td_map[train_model.M] = masks
+                td_map[polyak_model.S] = states
+                td_map[polyak_model.M] = masks
+            return names_ops, sess.run(run_ops, td_map)[1:]  # strip off _train
+
+        def save(save_path):
+            ps = sess.run(params)
+            make_path(osp.dirname(save_path))
+            joblib.dump(ps, save_path)
+
+        self.train = train
+        self.save = save
+        self.train_model = train_model
+        self.step_model = step_model
+        self.step = step_model.step
+        self.initial_state = step_model.initial_state
+        tf.global_variables_initializer().run(session=sess)
+
+class Runner(AbstractEnvRunner):
+    def __init__(self, env, model, nsteps, nstack):
+        super().__init__(env=env, model=model, nsteps=nsteps)
+        self.nstack = nstack
+        nh, nw, nc = env.observation_space.shape
+        self.nc = nc  # nc = 1 for atari, but just in case
+        self.nenv = nenv = env.num_envs
+        self.nact = env.action_space.n
+        self.nbatch = nenv * nsteps
+        self.batch_ob_shape = (nenv*(nsteps+1), nh, nw, nc*nstack)
+        self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
+        obs = env.reset()
+        self.update_obs(obs)
+
+    def update_obs(self, obs, dones=None):
+        if dones is not None:
+            self.obs *= (1 - dones.astype(np.uint8))[:, None, None, None]
+        self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
+        self.obs[:, :, :, -self.nc:] = obs[:, :, :, :]
+
+    def run(self):
+        enc_obs = np.split(self.obs, self.nstack, axis=3)  # so now list of obs steps
+        mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
+        for _ in range(self.nsteps):
+            actions, mus, states = self.model.step(self.obs, state=self.states, mask=self.dones)
+            mb_obs.append(np.copy(self.obs))
+            mb_actions.append(actions)
+            mb_mus.append(mus)
+            mb_dones.append(self.dones)
+            obs, rewards, dones, _ = self.env.step(actions)
+            # states information for statefull models like LSTM
+            self.states = states
+            self.dones = dones
+            self.update_obs(obs, dones)
+            mb_rewards.append(rewards)
+            enc_obs.append(obs)
+        mb_obs.append(np.copy(self.obs))
+        mb_dones.append(self.dones)
+
+        enc_obs = np.asarray(enc_obs, dtype=np.uint8).swapaxes(1, 0)
+        mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0)
+        mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
+        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
+        mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
+
+        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
+
+        mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
+        mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards
+
+        # shapes are now [nenv, nsteps, []]
+        # When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.
+
+        return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks
+
+class Acer():
+    def __init__(self, runner, model, buffer, log_interval):
+        self.runner = runner
+        self.model = model
+        self.buffer = buffer
+        self.log_interval = log_interval
+        self.tstart = None
+        self.episode_stats = EpisodeStats(runner.nsteps, runner.nenv)
+        self.steps = None
+
+    def call(self, on_policy):
+        runner, model, buffer, steps = self.runner, self.model, self.buffer, self.steps
+        if on_policy:
+            enc_obs, obs, actions, rewards, mus, dones, masks = runner.run()
+            self.episode_stats.feed(rewards, dones)
+            if buffer is not None:
+                buffer.put(enc_obs, actions, rewards, mus, dones, masks)
+        else:
+            # get obs, actions, rewards, mus, dones from buffer.
+            obs, actions, rewards, mus, dones, masks = buffer.get()
+
+        # reshape stuff correctly
+        obs = obs.reshape(runner.batch_ob_shape)
+        actions = actions.reshape([runner.nbatch])
+        rewards = rewards.reshape([runner.nbatch])
+        mus = mus.reshape([runner.nbatch, runner.nact])
+        dones = dones.reshape([runner.nbatch])
+        masks = masks.reshape([runner.batch_ob_shape[0]])
+
+        names_ops, values_ops = model.train(obs, actions, rewards, dones, mus, model.initial_state, masks, steps)
+
+        if on_policy and (int(steps/runner.nbatch) % self.log_interval == 0):
+            logger.record_tabular("total_timesteps", steps)
+            logger.record_tabular("fps", int(steps/(time.time() - self.tstart)))
+            # IMP: In EpisodicLife env, during training, we get done=True at each loss of life, not just at the terminal state.
+            # Thus, this is mean until end of life, not end of episode.
+            # For true episode rewards, see the monitor files in the log folder.
+            logger.record_tabular("mean_episode_length", self.episode_stats.mean_length())
+            logger.record_tabular("mean_episode_reward", self.episode_stats.mean_reward())
+            for name, val in zip(names_ops, values_ops):
+                logger.record_tabular(name, float(val))
+            logger.dump_tabular()
+
+
+def learn(policy, env, seed, nsteps=20, nstack=4, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
+          max_grad_norm=10, lr=7e-4, lrschedule='linear', rprop_epsilon=1e-5, rprop_alpha=0.99, gamma=0.99,
+          log_interval=100, buffer_size=50000, replay_ratio=4, replay_start=10000, c=10.0,
+          trust_region=True, alpha=0.99, delta=1):
+    print("Running Acer Simple")
+    print(locals())
+    tf.reset_default_graph()
+    set_global_seeds(seed)
+
+    nenvs = env.num_envs
+    ob_space = env.observation_space
+    ac_space = env.action_space
+    num_procs = len(env.remotes) # HACK
+    model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, nstack=nstack,
+                  num_procs=num_procs, ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
+                  max_grad_norm=max_grad_norm, lr=lr, rprop_alpha=rprop_alpha, rprop_epsilon=rprop_epsilon,
+                  total_timesteps=total_timesteps, lrschedule=lrschedule, c=c,
+                  trust_region=trust_region, alpha=alpha, delta=delta)
+
+    runner = Runner(env=env, model=model, nsteps=nsteps, nstack=nstack)
+    if replay_ratio > 0:
+        buffer = Buffer(env=env, nsteps=nsteps, nstack=nstack, size=buffer_size)
+    else:
+        buffer = None
+    nbatch = nenvs*nsteps
+    acer = Acer(runner, model, buffer, log_interval)
+    acer.tstart = time.time()
+    for acer.steps in range(0, total_timesteps, nbatch): #nbatch samples, 1 on_policy call and multiple off-policy calls
+        acer.call(on_policy=True)
+        if replay_ratio > 0 and buffer.has_atleast(replay_start):
+            n = np.random.poisson(replay_ratio)
+            for _ in range(n):
+                acer.call(on_policy=False)  # no simulation steps in this
+
+    env.close()
--- a/baselines/acer/buffer.py
+++ b/baselines/acer/buffer.py
@@ -0,0 +1,103 @@
+import numpy as np
+
+class Buffer(object):
+    # gets obs, actions, rewards, mu's, (states, masks), dones
+    def __init__(self, env, nsteps, nstack, size=50000):
+        self.nenv = env.num_envs
+        self.nsteps = nsteps
+        self.nh, self.nw, self.nc = env.observation_space.shape
+        self.nstack = nstack
+        self.nbatch = self.nenv * self.nsteps
+        self.size = size // (self.nsteps)  # Each loc contains nenv * nsteps frames, thus total buffer is nenv * size frames
+
+        # Memory
+        self.enc_obs = None
+        self.actions = None
+        self.rewards = None
+        self.mus = None
+        self.dones = None
+        self.masks = None
+
+        # Size indexes
+        self.next_idx = 0
+        self.num_in_buffer = 0
+
+    def has_atleast(self, frames):
+        # Frames per env, so total (nenv * frames) Frames needed
+        # Each buffer loc has nenv * nsteps frames
+        return self.num_in_buffer >= (frames // self.nsteps)
+
+    def can_sample(self):
+        return self.num_in_buffer > 0
+
+    # Generate stacked frames
+    def decode(self, enc_obs, dones):
+        # enc_obs has shape [nenvs, nsteps + nstack, nh, nw, nc]
+        # dones has shape [nenvs, nsteps, nh, nw, nc]
+        # returns stacked obs of shape [nenv, (nsteps + 1), nh, nw, nstack*nc]
+        nstack, nenv, nsteps, nh, nw, nc = self.nstack, self.nenv, self.nsteps, self.nh, self.nw, self.nc
+        y = np.empty([nsteps + nstack - 1, nenv, 1, 1, 1], dtype=np.float32)
+        obs = np.zeros([nstack, nsteps + nstack, nenv, nh, nw, nc], dtype=np.uint8)
+        x = np.reshape(enc_obs, [nenv, nsteps + nstack, nh, nw, nc]).swapaxes(1,
+                                                                              0)  # [nsteps + nstack, nenv, nh, nw, nc]
+        y[3:] = np.reshape(1.0 - dones, [nenv, nsteps, 1, 1, 1]).swapaxes(1, 0)  # keep
+        y[:3] = 1.0
+        # y = np.reshape(1 - dones, [nenvs, nsteps, 1, 1, 1])
+        for i in range(nstack):
+            obs[-(i + 1), i:] = x
+            # obs[:,i:,:,:,-(i+1),:] = x
+            x = x[:-1] * y
+            y = y[1:]
+        return np.reshape(obs[:, 3:].transpose((2, 1, 3, 4, 0, 5)), [nenv, (nsteps + 1), nh, nw, nstack * nc])
+
+    def put(self, enc_obs, actions, rewards, mus, dones, masks):
+        # enc_obs [nenv, (nsteps + nstack), nh, nw, nc]
+        # actions, rewards, dones [nenv, nsteps]
+        # mus [nenv, nsteps, nact]
+
+        if self.enc_obs is None:
+            self.enc_obs = np.empty([self.size] + list(enc_obs.shape), dtype=np.uint8)
+            self.actions = np.empty([self.size] + list(actions.shape), dtype=np.int32)
+            self.rewards = np.empty([self.size] + list(rewards.shape), dtype=np.float32)
+            self.mus = np.empty([self.size] + list(mus.shape), dtype=np.float32)
+            self.dones = np.empty([self.size] + list(dones.shape), dtype=np.bool)
+            self.masks = np.empty([self.size] + list(masks.shape), dtype=np.bool)
+
+        self.enc_obs[self.next_idx] = enc_obs
+        self.actions[self.next_idx] = actions
+        self.rewards[self.next_idx] = rewards
+        self.mus[self.next_idx] = mus
+        self.dones[self.next_idx] = dones
+        self.masks[self.next_idx] = masks
+
+        self.next_idx = (self.next_idx + 1) % self.size
+        self.num_in_buffer = min(self.size, self.num_in_buffer + 1)
+
+    def take(self, x, idx, envx):
+        nenv = self.nenv
+        out = np.empty([nenv] + list(x.shape[2:]), dtype=x.dtype)
+        for i in range(nenv):
+            out[i] = x[idx[i], envx[i]]
+        return out
+
+    def get(self):
+        # returns
+        # obs [nenv, (nsteps + 1), nh, nw, nstack*nc]
+        # actions, rewards, dones [nenv, nsteps]
+        # mus [nenv, nsteps, nact]
+        nenv = self.nenv
+        assert self.can_sample()
+
+        # Sample exactly one id per env. If you sample across envs, then higher correlation in samples from same env.
+        idx = np.random.randint(0, self.num_in_buffer, nenv)
+        envx = np.arange(nenv)
+
+        take = lambda x: self.take(x, idx, envx)  # for i in range(nenv)], axis = 0)
+        dones = take(self.dones)
+        enc_obs = take(self.enc_obs)
+        obs = self.decode(enc_obs, dones)
+        actions = take(self.actions)
+        rewards = take(self.rewards)
+        mus = take(self.mus)
+        masks = take(self.masks)
+        return obs, actions, rewards, mus, dones, masks
--- a/baselines/acer/policies.py
+++ b/baselines/acer/policies.py
@@ -0,0 +1,79 @@
+import numpy as np
+import tensorflow as tf
+from baselines.ppo2.policies import nature_cnn
+from baselines.a2c.utils import fc, batch_to_seq, seq_to_batch, lstm, sample
+
+
+class AcerCnnPolicy(object):
+
+    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
+        nbatch = nenv * nsteps
+        nh, nw, nc = ob_space.shape
+        ob_shape = (nbatch, nh, nw, nc * nstack)
+        nact = ac_space.n
+        X = tf.placeholder(tf.uint8, ob_shape)  # obs
+        with tf.variable_scope("model", reuse=reuse):
+            h = nature_cnn(X)
+            pi_logits = fc(h, 'pi', nact, init_scale=0.01)
+            pi = tf.nn.softmax(pi_logits)
+            q = fc(h, 'q', nact)
+
+        a = sample(pi_logits)  # could change this to use self.pi instead
+        self.initial_state = []  # not stateful
+        self.X = X
+        self.pi = pi  # actual policy params now
+        self.q = q
+
+        def step(ob, *args, **kwargs):
+            # returns actions, mus, states
+            a0, pi0 = sess.run([a, pi], {X: ob})
+            return a0, pi0, []  # dummy state
+
+        def out(ob, *args, **kwargs):
+            pi0, q0 = sess.run([pi, q], {X: ob})
+            return pi0, q0
+
+        def act(ob, *args, **kwargs):
+            return sess.run(a, {X: ob})
+
+        self.step = step
+        self.out = out
+        self.act = act
+
+class AcerLstmPolicy(object):
+
+    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False, nlstm=256):
+        nbatch = nenv * nsteps
+        nh, nw, nc = ob_space.shape
+        ob_shape = (nbatch, nh, nw, nc * nstack)
+        nact = ac_space.n
+        X = tf.placeholder(tf.uint8, ob_shape)  # obs
+        M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
+        S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
+        with tf.variable_scope("model", reuse=reuse):
+            h = nature_cnn(X)
+
+            # lstm
+            xs = batch_to_seq(h, nenv, nsteps)
+            ms = batch_to_seq(M, nenv, nsteps)
+            h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
+            h5 = seq_to_batch(h5)
+
+            pi_logits = fc(h5, 'pi', nact, init_scale=0.01)
+            pi = tf.nn.softmax(pi_logits)
+            q = fc(h5, 'q', nact)
+
+        a = sample(pi_logits)  # could change this to use self.pi instead
+        self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
+        self.X = X
+        self.M = M
+        self.S = S
+        self.pi = pi  # actual policy params now
+        self.q = q
+
+        def step(ob, state, mask, *args, **kwargs):
+            # returns actions, mus, states
+            a0, pi0, s = sess.run([a, pi, snew], {X: ob, S: state, M: mask})
+            return a0, pi0, s
+
+        self.step = step
--- a/baselines/acer/run_atari.py
+++ b/baselines/acer/run_atari.py
@@ -0,0 +1,30 @@
+#!/usr/bin/env python3
+from baselines import logger
+from baselines.acer.acer_simple import learn
+from baselines.acer.policies import AcerCnnPolicy, AcerLstmPolicy
+from baselines.common.cmd_util import make_atari_env, atari_arg_parser
+
+def train(env_id, num_timesteps, seed, policy, lrschedule, num_cpu):
+    env = make_atari_env(env_id, num_cpu, seed)
+    if policy == 'cnn':
+        policy_fn = AcerCnnPolicy
+    elif policy == 'lstm':
+        policy_fn = AcerLstmPolicy
+    else:
+        print("Policy {} not implemented".format(policy))
+        return
+    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
+    env.close()
+
+def main():
+    parser = atari_arg_parser()
+    parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
+    parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
+    parser.add_argument('--logdir', help ='Directory for logging')
+    args = parser.parse_args()
+    logger.configure(args.logdir)
+    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
+          policy=args.policy, lrschedule=args.lrschedule, num_cpu=16)
+
+if __name__ == '__main__':
+    main()
--- a/baselines/acktr/acktr_cont.py
+++ b/baselines/acktr/acktr_cont.py
@@ -1,10 +1,10 @@
 import numpy as np
 import tensorflow as tf
 from baselines import logger
-from baselines import common
+import baselines.common as common
 from baselines.common import tf_util as U
 from baselines.acktr import kfac
-from baselines.acktr.filters import ZFilter
+from baselines.common.filters import ZFilter

 def pathlength(path):
    return path["reward"].shape[0]# Loss function that we'll differentiate to get the policy gradient
@@ -70,7 +70,7 @@ def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
    coord = tf.train.Coordinator()
    for qr in [q_runner, vf.q_runner]:
        assert (qr != None)
-        enqueue_threads.extend(qr.create_threads(U.get_session(), coord=coord, start=True))
+        enqueue_threads.extend(qr.create_threads(tf.get_default_session(), coord=coord, start=True))

    i = 0
    timesteps_so_far = 0
@@ -122,10 +122,10 @@ def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
        kl = policy.compute_kl(ob_no, oldac_dist)
        if kl > desired_kl * 2:
            logger.log("kl too high")
-            U.eval(tf.assign(stepsize, tf.maximum(min_stepsize, stepsize / 1.5)))
+            tf.assign(stepsize, tf.maximum(min_stepsize, stepsize / 1.5)).eval()
        elif kl < desired_kl / 2:
            logger.log("kl too low")
-            U.eval(tf.assign(stepsize, tf.minimum(max_stepsize, stepsize * 1.5)))            
+            tf.assign(stepsize, tf.minimum(max_stepsize, stepsize * 1.5)).eval()
        else:
            logger.log("kl just right!")

--- a/baselines/acktr/acktr_disc.py
+++ b/baselines/acktr/acktr_disc.py
@@ -7,16 +7,17 @@ from baselines import logger

 from baselines.common import set_global_seeds, explained_variance

-from baselines.acktr.utils import discount_with_dones
-from baselines.acktr.utils import Scheduler, find_trainable_variables
-from baselines.acktr.utils import cat_entropy, mse
+from baselines.a2c.a2c import Runner
+from baselines.a2c.utils import discount_with_dones
+from baselines.a2c.utils import Scheduler, find_trainable_variables
+from baselines.a2c.utils import cat_entropy, mse
 from baselines.acktr import kfac


 class Model(object):

    def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
-                 nstack=4, ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
+                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
                 kfac_clip=0.001, lrschedule='linear'):
        config = tf.ConfigProto(allow_soft_placement=True,
                                intra_op_parallelism_threads=nprocs,
@@ -31,8 +32,8 @@ class Model(object):
        PG_LR = tf.placeholder(tf.float32, [])
        VF_LR = tf.placeholder(tf.float32, [])

-        self.model = step_model = policy(sess, ob_space, ac_space, nenvs, 1, nstack, reuse=False)
-        self.model2 = train_model = policy(sess, ob_space, ac_space, nenvs, nsteps, nstack, reuse=True)
+        self.model = step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
+        self.model2 = train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)

        logpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
        self.logits = logits = train_model.pi
@@ -71,7 +72,7 @@ class Model(object):
                cur_lr = self.lr.value()

            td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, PG_LR:cur_lr}
-            if states != []:
+            if states is not None:
                td_map[train_model.S] = states
                td_map[train_model.M] = masks

@@ -104,70 +105,8 @@ class Model(object):
        self.initial_state = step_model.initial_state
        tf.global_variables_initializer().run(session=sess)

-class Runner(object):
-
-    def __init__(self, env, model, nsteps, nstack, gamma):
-        self.env = env
-        self.model = model
-        nh, nw, nc = env.observation_space.shape
-        nenv = env.num_envs
-        self.batch_ob_shape = (nenv*nsteps, nh, nw, nc*nstack)
-        self.obs = np.zeros((nenv, nh, nw, nc*nstack), dtype=np.uint8)
-        obs = env.reset()
-        self.update_obs(obs)
-        self.gamma = gamma
-        self.nsteps = nsteps
-        self.states = model.initial_state
-        self.dones = [False for _ in range(nenv)]
-
-    def update_obs(self, obs):
-        self.obs = np.roll(self.obs, shift=-1, axis=3)
-        self.obs[:, :, :, -1] = obs[:, :, :, 0]
-
-    def run(self):
-        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
-        mb_states = self.states
-        for n in range(self.nsteps):
-            actions, values, states = self.model.step(self.obs, self.states, self.dones)
-            mb_obs.append(np.copy(self.obs))
-            mb_actions.append(actions)
-            mb_values.append(values)
-            mb_dones.append(self.dones)
-            obs, rewards, dones, _ = self.env.step(actions)
-            self.states = states
-            self.dones = dones
-            for n, done in enumerate(dones):
-                if done:
-                    self.obs[n] = self.obs[n]*0
-            self.update_obs(obs)
-            mb_rewards.append(rewards)
-        mb_dones.append(self.dones)
-        #batch of steps to batch of rollouts
-        mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0).reshape(self.batch_ob_shape)
-        mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
-        mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
-        mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
-        mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
-        mb_masks = mb_dones[:, :-1]
-        mb_dones = mb_dones[:, 1:]
-        last_values = self.model.value(self.obs, self.states, self.dones).tolist()
-        #discount/bootstrap off value fn
-        for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
-            rewards = rewards.tolist()
-            dones = dones.tolist()
-            if dones[-1] == 0:
-                rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
-            else:
-                rewards = discount_with_dones(rewards, dones, self.gamma)
-            mb_rewards[n] = rewards
-        mb_rewards = mb_rewards.flatten()
-        mb_actions = mb_actions.flatten()
-        mb_values = mb_values.flatten()
-        mb_masks = mb_masks.flatten()
-        return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values
-
 def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
-                 nstack=4, ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
+                 ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
                 kfac_clip=0.001, save_interval=None, lrschedule='linear'):
    tf.reset_default_graph()
    set_global_seeds(seed)
@@ -176,7 +115,7 @@ def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval
    ob_space = env.observation_space
    ac_space = env.action_space
    make_model = lambda : Model(policy, ob_space, ac_space, nenvs, total_timesteps, nprocs=nprocs, nsteps
-                                =nsteps, nstack=nstack, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
+                                =nsteps, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
                                vf_fisher_coef, lr=lr, max_grad_norm=max_grad_norm, kfac_clip=kfac_clip,
                                lrschedule=lrschedule)
    if save_interval and logger.get_dir():
@@ -185,7 +124,7 @@ def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval
            fh.write(cloudpickle.dumps(make_model))
    model = make_model()

-    runner = Runner(env, model, nsteps=nsteps, nstack=nstack, gamma=gamma)
+    runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
    nbatch = nenvs*nsteps
    tstart = time.time()
    coord = tf.train.Coordinator()
--- a/baselines/acktr/kfac.py
+++ b/baselines/acktr/kfac.py
@@ -228,7 +228,7 @@ class KfacOptimizer():
                                Ow = bpropFactor.get_shape()[2]
                                if Oh == 1 and Ow == 1 and self._channel_fac:
                                    # factorization along the channels
-                                    # assume independence bewteen input channels and spatial
+                                    # assume independence between input channels and spatial
                                    # 2K-1 x 2K-1 covariance matrix and C x C covariance matrix
                                    # factorization along the channels do not
                                    # support homogeneous coordinate, assnBias
--- a/baselines/acktr/kfac_utils.py
+++ b/baselines/acktr/kfac_utils.py
@@ -1,93 +1,55 @@
 import tensorflow as tf
-import numpy as np
-

 def gmatmul(a, b, transpose_a=False, transpose_b=False, reduce_dim=None):
-    if reduce_dim == None:
-        # general batch matmul
-        if len(a.get_shape()) == 3 and len(b.get_shape()) == 3:
-            return tf.batch_matmul(a, b, adj_x=transpose_a, adj_y=transpose_b)
-        elif len(a.get_shape()) == 3 and len(b.get_shape()) == 2:
-            if transpose_b:
-                N = b.get_shape()[0].value
-            else:
-                N = b.get_shape()[1].value
-            B = a.get_shape()[0].value
-            if transpose_a:
-                K = a.get_shape()[1].value
-                a = tf.reshape(tf.transpose(a, [0, 2, 1]), [-1, K])
-            else:
-                K = a.get_shape()[-1].value
-                a = tf.reshape(a, [-1, K])
-            result = tf.matmul(a, b, transpose_b=transpose_b)
-            result = tf.reshape(result, [B, -1, N])
-            return result
-        elif len(a.get_shape()) == 2 and len(b.get_shape()) == 3:
-            if transpose_a:
-                M = a.get_shape()[1].value
-            else:
-                M = a.get_shape()[0].value
-            B = b.get_shape()[0].value
-            if transpose_b:
-                K = b.get_shape()[-1].value
-                b = tf.transpose(tf.reshape(b, [-1, K]), [1, 0])
-            else:
-                K = b.get_shape()[1].value
-                b = tf.transpose(tf.reshape(
-                    tf.transpose(b, [0, 2, 1]), [-1, K]), [1, 0])
-            result = tf.matmul(a, b, transpose_a=transpose_a)
-            result = tf.transpose(tf.reshape(result, [M, B, -1]), [1, 0, 2])
-            return result
-        else:
-            return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)
-    else:
-        # weird batch matmul
-        if len(a.get_shape()) == 2 and len(b.get_shape()) > 2:
-            # reshape reduce_dim to the left most dim in b
-            b_shape = b.get_shape()
-            if reduce_dim != 0:
-                b_dims = list(range(len(b_shape)))
-                b_dims.remove(reduce_dim)
-                b_dims.insert(0, reduce_dim)
-                b = tf.transpose(b, b_dims)
-            b_t_shape = b.get_shape()
-            b = tf.reshape(b, [int(b_shape[reduce_dim]), -1])
-            result = tf.matmul(a, b, transpose_a=transpose_a,
-                               transpose_b=transpose_b)
-            result = tf.reshape(result, b_t_shape)
-            if reduce_dim != 0:
-                b_dims = list(range(len(b_shape)))
-                b_dims.remove(0)
-                b_dims.insert(reduce_dim, 0)
-                result = tf.transpose(result, b_dims)
-            return result
+    assert reduce_dim is not None

-        elif len(a.get_shape()) > 2 and len(b.get_shape()) == 2:
-            # reshape reduce_dim to the right most dim in a
-            a_shape = a.get_shape()
-            outter_dim = len(a_shape) - 1
-            reduce_dim = len(a_shape) - reduce_dim - 1
-            if reduce_dim != outter_dim:
-                a_dims = list(range(len(a_shape)))
-                a_dims.remove(reduce_dim)
-                a_dims.insert(outter_dim, reduce_dim)
-                a = tf.transpose(a, a_dims)
-            a_t_shape = a.get_shape()
-            a = tf.reshape(a, [-1, int(a_shape[reduce_dim])])
-            result = tf.matmul(a, b, transpose_a=transpose_a,
-                               transpose_b=transpose_b)
-            result = tf.reshape(result, a_t_shape)
-            if reduce_dim != outter_dim:
-                a_dims = list(range(len(a_shape)))
-                a_dims.remove(outter_dim)
-                a_dims.insert(reduce_dim, outter_dim)
-                result = tf.transpose(result, a_dims)
-            return result
+    # weird batch matmul
+    if len(a.get_shape()) == 2 and len(b.get_shape()) > 2:
+        # reshape reduce_dim to the left most dim in b
+        b_shape = b.get_shape()
+        if reduce_dim != 0:
+            b_dims = list(range(len(b_shape)))
+            b_dims.remove(reduce_dim)
+            b_dims.insert(0, reduce_dim)
+            b = tf.transpose(b, b_dims)
+        b_t_shape = b.get_shape()
+        b = tf.reshape(b, [int(b_shape[reduce_dim]), -1])
+        result = tf.matmul(a, b, transpose_a=transpose_a,
+                           transpose_b=transpose_b)
+        result = tf.reshape(result, b_t_shape)
+        if reduce_dim != 0:
+            b_dims = list(range(len(b_shape)))
+            b_dims.remove(0)
+            b_dims.insert(reduce_dim, 0)
+            result = tf.transpose(result, b_dims)
+        return result

-        elif len(a.get_shape()) == 2 and len(b.get_shape()) == 2:
-            return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)
+    elif len(a.get_shape()) > 2 and len(b.get_shape()) == 2:
+        # reshape reduce_dim to the right most dim in a
+        a_shape = a.get_shape()
+        outter_dim = len(a_shape) - 1
+        reduce_dim = len(a_shape) - reduce_dim - 1
+        if reduce_dim != outter_dim:
+            a_dims = list(range(len(a_shape)))
+            a_dims.remove(reduce_dim)
+            a_dims.insert(outter_dim, reduce_dim)
+            a = tf.transpose(a, a_dims)
+        a_t_shape = a.get_shape()
+        a = tf.reshape(a, [-1, int(a_shape[reduce_dim])])
+        result = tf.matmul(a, b, transpose_a=transpose_a,
+                           transpose_b=transpose_b)
+        result = tf.reshape(result, a_t_shape)
+        if reduce_dim != outter_dim:
+            a_dims = list(range(len(a_shape)))
+            a_dims.remove(outter_dim)
+            a_dims.insert(reduce_dim, outter_dim)
+            result = tf.transpose(result, a_dims)
+        return result

-        assert False, 'something went wrong'
+    elif len(a.get_shape()) == 2 and len(b.get_shape()) == 2:
+        return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)
+
+    assert False, 'something went wrong'


 def clipoutNeg(vec, threshold=1e-6):
--- a/baselines/acktr/policies.py
+++ b/baselines/acktr/policies.py
@@ -1,43 +1,8 @@
 import numpy as np
 import tensorflow as tf
-from baselines.acktr.utils import conv, fc, dense, conv_to_fc, sample, kl_div
+from baselines.acktr.utils import dense, kl_div
 import baselines.common.tf_util as U

-class CnnPolicy(object):
-
-    def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
-        nbatch = nenv*nsteps
-        nh, nw, nc = ob_space.shape
-        ob_shape = (nbatch, nh, nw, nc*nstack)
-        nact = ac_space.n
-        X = tf.placeholder(tf.uint8, ob_shape) #obs
-        with tf.variable_scope("model", reuse=reuse):
-            h = conv(tf.cast(X, tf.float32)/255., 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2))
-            h2 = conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2))
-            h3 = conv(h2, 'c3', nf=32, rf=3, stride=1, init_scale=np.sqrt(2))
-            h3 = conv_to_fc(h3)
-            h4 = fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2))
-            pi = fc(h4, 'pi', nact, act=lambda x:x)
-            vf = fc(h4, 'v', 1, act=lambda x:x)
-
-        v0 = vf[:, 0]
-        a0 = sample(pi)
-        self.initial_state = [] #not stateful
-
-        def step(ob, *_args, **_kwargs):
-            a, v = sess.run([a0, v0], {X:ob})
-            return a, v, [] #dummy state
-
-        def value(ob, *_args, **_kwargs):
-            return sess.run(v0, {X:ob})
-
-        self.X = X
-        self.pi = pi
-        self.vf = vf
-        self.step = step
-        self.value = value
-
-
 class GaussianMlpPolicy(object):
    def __init__(self, ob_dim, ac_dim):
        # Here we'll construct a bunch of expressions, which will be used in two places:
@@ -60,12 +25,12 @@ class GaussianMlpPolicy(object):
        std_na = tf.tile(std_1a, [tf.shape(mean_na)[0], 1])
        ac_dist = tf.concat([tf.reshape(mean_na, [-1, ac_dim]), tf.reshape(std_na, [-1, ac_dim])], 1)
        sampled_ac_na = tf.random_normal(tf.shape(ac_dist[:,ac_dim:])) * ac_dist[:,ac_dim:] + ac_dist[:,:ac_dim] # This is the sampled action we'll perform.
-        logprobsampled_n = - U.sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * U.sum(tf.square(ac_dist[:,:ac_dim] - sampled_ac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of sampled action
-        logprob_n = - U.sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * U.sum(tf.square(ac_dist[:,:ac_dim] - oldac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of previous actions under CURRENT policy (whereas oldlogprob_n is under OLD policy)
-        kl = U.mean(kl_div(oldac_dist, ac_dist, ac_dim))
-        #kl = .5 * U.mean(tf.square(logprob_n - oldlogprob_n)) # Approximation of KL divergence between old policy used to generate actions, and new policy used to compute logprob_n
-        surr = - U.mean(adv_n * logprob_n) # Loss function that we'll differentiate to get the policy gradient
-        surr_sampled = - U.mean(logprob_n) # Sampled loss of the policy
+        logprobsampled_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - sampled_ac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of sampled action
+        logprob_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - oldac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of previous actions under CURRENT policy (whereas oldlogprob_n is under OLD policy)
+        kl = tf.reduce_mean(kl_div(oldac_dist, ac_dist, ac_dim))
+        #kl = .5 * tf.reduce_mean(tf.square(logprob_n - oldlogprob_n)) # Approximation of KL divergence between old policy used to generate actions, and new policy used to compute logprob_n
+        surr = - tf.reduce_mean(adv_n * logprob_n) # Loss function that we'll differentiate to get the policy gradient
+        surr_sampled = - tf.reduce_mean(logprob_n) # Sampled loss of the policy
        self._act = U.function([ob_no], [sampled_ac_na, ac_dist, logprobsampled_n]) # Generate a new action and its logprob
        #self.compute_kl = U.function([ob_no, oldac_na, oldlogprob_n], kl) # Compute (approximate) KL divergence between old policy and new policy
        self.compute_kl = U.function([ob_no, oldac_dist], kl)
--- a/baselines/acktr/run_atari.py
+++ b/baselines/acktr/run_atari.py
@@ -1,38 +1,23 @@
-#!/usr/bin/env python
-import os, logging, gym
+#!/usr/bin/env python3
+
+from functools import partial
+
 from baselines import logger
-from baselines.common import set_global_seeds
-from baselines import bench
 from baselines.acktr.acktr_disc import learn
-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from baselines.common.atari_wrappers import make_atari, wrap_deepmind
-from baselines.acktr.policies import CnnPolicy
+from baselines.common.cmd_util import make_atari_env, atari_arg_parser
+from baselines.common.vec_env.vec_frame_stack import VecFrameStack
+from baselines.ppo2.policies import CnnPolicy

 def train(env_id, num_timesteps, seed, num_cpu):
-    def make_env(rank):
-        def _thunk():
-            env = make_atari(env_id)
-            env.seed(seed + rank)
-            env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-            gym.logger.setLevel(logging.WARN)
-            return wrap_deepmind(env)
-        return _thunk
-    set_global_seeds(seed)
-    env = SubprocVecEnv([make_env(i) for i in range(num_cpu)])
-    policy_fn = CnnPolicy
+    env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
+    policy_fn = partial(CnnPolicy, one_dim_bias=True)
    learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
    env.close()

 def main():
-    import argparse
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
-    args = parser.parse_args()
+    args = atari_arg_parser().parse_args()
    logger.configure()
    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed, num_cpu=32)

-
 if __name__ == '__main__':
    main()
--- a/baselines/acktr/run_mujoco.py
+++ b/baselines/acktr/run_mujoco.py
@@ -1,22 +1,14 @@
-#!/usr/bin/env python
-import argparse
-import logging
-import os
+#!/usr/bin/env python3
+
 import tensorflow as tf
-import gym
 from baselines import logger
-from baselines.common import set_global_seeds
-from baselines import bench
+from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
 from baselines.acktr.acktr_cont import learn
 from baselines.acktr.policies import GaussianMlpPolicy
 from baselines.acktr.value_functions import NeuralNetValueFunction

 def train(env_id, num_timesteps, seed):
-    env=gym.make(env_id)
-    env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-    set_global_seeds(seed)
-    env.seed(seed)
-    gym.logger.setLevel(logging.WARN)
+    env = make_mujoco_env(env_id, seed)

    with tf.Session(config=tf.ConfigProto()):
        ob_dim = env.observation_space.shape[0]
@@ -33,11 +25,10 @@ def train(env_id, num_timesteps, seed):

        env.close()

-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description='Run Mujoco benchmark.')
-    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
-    parser.add_argument('--env', help='environment ID', type=str, default="Reacher-v1")
-    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
-    args = parser.parse_args()
+def main():
+    args = mujoco_arg_parser().parse_args()
    logger.configure()
    train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
+
+if __name__ == "__main__":
+    main()
--- a/baselines/acktr/utils.py
+++ b/baselines/acktr/utils.py
@@ -1,69 +1,8 @@
-import os
-import numpy as np
 import tensorflow as tf
-import baselines.common.tf_util as U
-from collections import deque
-
-def sample(logits):
-    noise = tf.random_uniform(tf.shape(logits))
-    return tf.argmax(logits - tf.log(-tf.log(noise)), 1)
-
-def std(x):
-    mean = tf.reduce_mean(x)
-    var = tf.reduce_mean(tf.square(x-mean))
-    return tf.sqrt(var)
-
-def cat_entropy(logits):
-    a0 = logits - tf.reduce_max(logits, 1, keep_dims=True)
-    ea0 = tf.exp(a0)
-    z0 = tf.reduce_sum(ea0, 1, keep_dims=True)
-    p0 = ea0 / z0
-    return tf.reduce_sum(p0 * (tf.log(z0) - a0), 1)
-
-def cat_entropy_softmax(p0):
-    return - tf.reduce_sum(p0 * tf.log(p0 + 1e-6), axis = 1)
-
-def mse(pred, target):
-    return tf.square(pred-target)/2.
-
-def ortho_init(scale=1.0):
-    def _ortho_init(shape, dtype, partition_info=None):
-        #lasagne ortho init for tf
-        shape = tuple(shape)
-        if len(shape) == 2:
-            flat_shape = shape
-        elif len(shape) == 4: # assumes NHWC
-            flat_shape = (np.prod(shape[:-1]), shape[-1])
-        else:
-            raise NotImplementedError
-        a = np.random.normal(0.0, 1.0, flat_shape)
-        u, _, v = np.linalg.svd(a, full_matrices=False)
-        q = u if u.shape == flat_shape else v # pick the one with the correct shape
-        q = q.reshape(shape)
-        return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
-    return _ortho_init
-
-def conv(x, scope, nf, rf, stride, pad='VALID', act=tf.nn.relu, init_scale=1.0):
-    with tf.variable_scope(scope):
-        nin = x.get_shape()[3].value
-        w = tf.get_variable("w", [rf, rf, nin, nf], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nf], initializer=tf.constant_initializer(0.0))
-        z = tf.nn.conv2d(x, w, strides=[1, stride, stride, 1], padding=pad)+b
-        h = act(z)
-        return h
-
-def fc(x, scope, nh, act=tf.nn.relu, init_scale=1.0):
-    with tf.variable_scope(scope):
-        nin = x.get_shape()[1].value
-        w = tf.get_variable("w", [nin, nh], initializer=ortho_init(init_scale))
-        b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(0.0))
-        z = tf.matmul(x, w)+b
-        h = act(z)
-        return h

 def dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict=None, reuse=None):
    with tf.variable_scope(name, reuse=reuse):
-        assert (len(U.scope_name().split('/')) == 2)
+        assert (len(tf.get_variable_scope().name.split('/')) == 2)

        w = tf.get_variable("w", [x.get_shape()[1], size], initializer=weight_init)
        b = tf.get_variable("b", [size], initializer=tf.constant_initializer(bias_init))
@@ -75,15 +14,10 @@ def dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict=None, r
                weight_loss_dict[w] = weight_decay_fc
                weight_loss_dict[b] = 0.0

-            tf.add_to_collection(U.scope_name().split('/')[0] + '_' + 'losses', weight_decay)
+            tf.add_to_collection(tf.get_variable_scope().name.split('/')[0] + '_' + 'losses', weight_decay)

        return tf.nn.bias_add(tf.matmul(x, w), b)

-def conv_to_fc(x):
-    nh = np.prod([v.value for v in x.get_shape()[1:]])
-    x = tf.reshape(x, [-1, nh])
-    return x
-
 def kl_div(action_dist1, action_dist2, action_size):
    mean1, std1 = action_dist1[:, :action_size], action_dist1[:, action_size:]
    mean2, std2 = action_dist2[:, :action_size], action_dist2[:, action_size:]
@@ -92,109 +26,3 @@ def kl_div(action_dist1, action_dist2, action_size):
    denominator = 2 * tf.square(std2) + 1e-8
    return tf.reduce_sum(
        numerator/denominator + tf.log(std2) - tf.log(std1),reduction_indices=-1)
-
-def discount_with_dones(rewards, dones, gamma):
-    discounted = []
-    r = 0
-    for reward, done in zip(rewards[::-1], dones[::-1]):
-        r = reward + gamma*r*(1.-done) # fixed off by one bug
-        discounted.append(r)
-    return discounted[::-1]
-
-def find_trainable_variables(key):
-    with tf.variable_scope(key):
-        return tf.trainable_variables()
-
-def make_path(f):
-    return os.makedirs(f, exist_ok=True)
-
-def constant(p):
-    return 1
-
-def linear(p):
-    return 1-p
-
-
-def middle_drop(p):
-    eps = 0.75
-    if 1-p<eps:
-        return eps*0.1
-    return 1-p
-
-def double_linear_con(p):
-    p *= 2
-    eps = 0.125
-    if 1-p<eps:
-        return eps
-    return 1-p
-
-
-def double_middle_drop(p):
-    eps1 = 0.75
-    eps2 = 0.25
-    if 1-p<eps1:
-        if 1-p<eps2:
-            return eps2*0.5
-        return eps1*0.1
-    return 1-p
-
-
-schedules = {
-    'linear':linear,
-    'constant':constant,
-    'double_linear_con':double_linear_con,
-    'middle_drop':middle_drop,
-    'double_middle_drop':double_middle_drop
-}
-
-class Scheduler(object):
-
-    def __init__(self, v, nvalues, schedule):
-        self.n = 0.
-        self.v = v
-        self.nvalues = nvalues
-        self.schedule = schedules[schedule]
-
-    def value(self):
-        current_value = self.v*self.schedule(self.n/self.nvalues)
-        self.n += 1.
-        return current_value
-
-    def value_steps(self, steps):
-        return self.v*self.schedule(steps/self.nvalues)
-
-
-class EpisodeStats:
-    def __init__(self, nsteps, nenvs):
-        self.episode_rewards = []
-        for i in range(nenvs):
-            self.episode_rewards.append([])
-        self.lenbuffer = deque(maxlen=40)  # rolling buffer for episode lengths
-        self.rewbuffer = deque(maxlen=40)  # rolling buffer for episode rewards
-        self.nsteps = nsteps
-        self.nenvs = nenvs
-
-    def feed(self, rewards, masks):
-        rewards = np.reshape(rewards, [self.nenvs, self.nsteps])
-        masks = np.reshape(masks, [self.nenvs, self.nsteps])
-        for i in range(0, self.nenvs):
-            for j in range(0, self.nsteps):
-                self.episode_rewards[i].append(rewards[i][j])
-                if masks[i][j]:
-                    l = len(self.episode_rewards[i])
-                    s = sum(self.episode_rewards[i])
-                    self.lenbuffer.append(l)
-                    self.rewbuffer.append(s)
-                    self.episode_rewards[i] = []
-
-    def mean_length(self):
-        if self.lenbuffer:
-            return np.mean(self.lenbuffer)
-        else:
-            return 0  # on the first params dump, no episodes are finished
-
-    def mean_reward(self):
-        if self.rewbuffer:
-            return np.mean(self.rewbuffer)
-        else:
-            return 0
--- a/baselines/acktr/value_functions.py
+++ b/baselines/acktr/value_functions.py
@@ -1,6 +1,6 @@
 from baselines import logger
 import numpy as np
-from baselines import common
+import baselines.common as common
 from baselines.common import tf_util as U
 import tensorflow as tf
 from baselines.acktr import kfac
@@ -16,8 +16,8 @@ class NeuralNetValueFunction(object):
        vpred_n = dense(h2, 1, "hfinal", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict)[:,0]
        sample_vpred_n = vpred_n + tf.random_normal(tf.shape(vpred_n))
        wd_loss = tf.get_collection("vf_losses", None)
-        loss = U.mean(tf.square(vpred_n - vtarg_n)) + tf.add_n(wd_loss)
-        loss_sampled = U.mean(tf.square(vpred_n - tf.stop_gradient(sample_vpred_n)))
+        loss = tf.reduce_mean(tf.square(vpred_n - vtarg_n)) + tf.add_n(wd_loss)
+        loss_sampled = tf.reduce_mean(tf.square(vpred_n - tf.stop_gradient(sample_vpred_n)))
        self._predict = U.function([X], vpred_n)
        optim = kfac.KfacOptimizer(learning_rate=0.001, cold_lr=0.001*(1-0.9), momentum=0.9, \
                                    clip_kl=0.3, epsilon=0.1, stats_decay=0.95, \
--- a/baselines/bench/init.py
+++ b/baselines/bench/init.py
@@ -1,3 +1,2 @@
 from baselines.bench.benchmarks import *
-from baselines.bench.monitor import *
-from baselines.bench.simple_bench import simple_bench
+from baselines.bench.monitor import *
--- a/baselines/bench/benchmarks.py
+++ b/baselines/bench/benchmarks.py
@@ -1,15 +1,26 @@
+import re
 import os.path as osp
+import os
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))

 _atari7 = ['BeamRider', 'Breakout', 'Enduro', 'Pong', 'Qbert', 'Seaquest', 'SpaceInvaders']
 _atariexpl7 = ['Freeway', 'Gravitar', 'MontezumaRevenge', 'Pitfall', 'PrivateEye', 'Solaris', 'Venture']

 _BENCHMARKS = []

+remove_version_re = re.compile(r'-v\d+$')
+

 def register_benchmark(benchmark):
    for b in _BENCHMARKS:
        if b['name'] == benchmark['name']:
            raise ValueError('Benchmark with name %s already registered!' % b['name'])
+
+    # automatically add a description if it is not present
+    if 'tasks' in benchmark:
+        for t in benchmark['tasks']:
+            if 'desc' not in t:
+                t['desc'] = remove_version_re.sub('', t['env_id'])
    _BENCHMARKS.append(benchmark)


@@ -42,36 +53,34 @@ _ATARI_SUFFIX = 'NoFrameskip-v4'
 register_benchmark({
    'name': 'Atari50M',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 50M timesteps',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(50e6)} for _game in _atari7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(50e6)} for _game in _atari7]
 })

 register_benchmark({
    'name': 'Atari10M',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari7]
 })

 register_benchmark({
    'name': 'Atari1Hr',
    'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 1 hour of walltime',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_seconds': 60 * 60} for _game in _atari7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_seconds': 60 * 60} for _game in _atari7]
 })

 register_benchmark({
    'name': 'AtariExploration10M',
    'description': '7 Atari games emphasizing exploration, with pixel observations, 10M timesteps',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atariexpl7]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atariexpl7]
 })


-
-
 # MuJoCo

 _mujocosmall = [
-    'InvertedDoublePendulum-v1', 'InvertedPendulum-v1',
-    'HalfCheetah-v1', 'Hopper-v1', 'Walker2d-v1',
-    'Reacher-v1', 'Swimmer-v1']
+    'InvertedDoublePendulum-v2', 'InvertedPendulum-v2',
+    'HalfCheetah-v2', 'Hopper-v2', 'Walker2d-v2',
+    'Reacher-v2', 'Swimmer-v2']
 register_benchmark({
    'name': 'Mujoco1M',
    'description': 'Some small 2D MuJoCo tasks, run for 1M timesteps',
@@ -128,5 +137,14 @@ _atari50 = [  # actually 47
 register_benchmark({
    'name': 'Atari50_10M',
    'description': '47 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
-    'tasks': [{'env_id': _game + _ATARI_SUFFIX, 'trials': 3, 'num_timesteps': int(10e6)} for _game in _atari50]
+    'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari50]
 })
+
+# HER DDPG
+
+register_benchmark({
+    'name': 'HerDdpg',
+    'description': 'Smoke-test only benchmark of HER',
+    'tasks': [{'trials': 1, 'env_id': 'FetchReach-v1'}]
+})
+
--- a/baselines/bench/monitor.py
+++ b/baselines/bench/monitor.py
@@ -7,12 +7,13 @@ from glob import glob
 import csv
 import os.path as osp
 import json
+import numpy as np

 class Monitor(Wrapper):
    EXT = "monitor.csv"
    f = None

-    def __init__(self, env, filename, allow_early_resets=False, reset_keywords=()):
+    def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
        Wrapper.__init__(self, env=env)
        self.tstart = time.time()
        if filename is None:
@@ -25,21 +26,23 @@ class Monitor(Wrapper):
                else:
                    filename = filename + "." + Monitor.EXT
            self.f = open(filename, "wt")
-            self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, "gym_version": gym.__version__,
-                "env_id": env.spec.id if env.spec else 'Unknown'}))
-            self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords)
+            self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, 'env_id' : env.spec and env.spec.id}))
+            self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords+info_keywords)
            self.logger.writeheader()
+            self.f.flush()

        self.reset_keywords = reset_keywords
+        self.info_keywords = info_keywords
        self.allow_early_resets = allow_early_resets
        self.rewards = None
        self.needs_reset = True
        self.episode_rewards = []
        self.episode_lengths = []
+        self.episode_times = []
        self.total_steps = 0
        self.current_reset_info = {} # extra info about the current episode, that was passed in during reset()

-    def _reset(self, **kwargs):
+    def reset(self, **kwargs):
        if not self.allow_early_resets and not self.needs_reset:
            raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
        self.rewards = []
@@ -51,7 +54,7 @@ class Monitor(Wrapper):
            self.current_reset_info[k] = v
        return self.env.reset(**kwargs)

-    def _step(self, action):
+    def step(self, action):
        if self.needs_reset:
            raise RuntimeError("Tried to step environment that needs reset")
        ob, rew, done, info = self.env.step(action)
@@ -61,12 +64,15 @@ class Monitor(Wrapper):
            eprew = sum(self.rewards)
            eplen = len(self.rewards)
            epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)}
+            for k in self.info_keywords:
+                epinfo[k] = info[k]
+            self.episode_rewards.append(eprew)
+            self.episode_lengths.append(eplen)
+            self.episode_times.append(time.time() - self.tstart)
            epinfo.update(self.current_reset_info)
            if self.logger:
                self.logger.writerow(epinfo)
                self.f.flush()
-            self.episode_rewards.append(eprew)
-            self.episode_lengths.append(eplen)
            info['episode'] = epinfo
        self.total_steps += 1
        return (ob, rew, done, info)
@@ -84,6 +90,9 @@ class Monitor(Wrapper):
    def get_episode_lengths(self):
        return self.episode_lengths

+    def get_episode_times(self):
+        return self.episode_times
+
 class LoadMonitorResultsError(Exception):
    pass

@@ -92,7 +101,9 @@ def get_monitor_files(dir):

 def load_results(dir):
    import pandas
-    monitor_files = glob(osp.join(dir, "*monitor.*")) # get both csv and (old) json files
+    monitor_files = (
+        glob(osp.join(dir, "*monitor.json")) + 
+        glob(osp.join(dir, "*monitor.csv"))) # get both csv and (old) json files
    if not monitor_files:
        raise LoadMonitorResultsError("no monitor files of the form *%s found in %s" % (Monitor.EXT, dir))
    dfs = []
@@ -114,10 +125,37 @@ def load_results(dir):
                    episode = json.loads(line)
                    episodes.append(episode)
                df = pandas.DataFrame(episodes)
-        df['t'] += header['t_start']
+            else:
+                assert 0, 'unreachable'
+            df['t'] += header['t_start']
        dfs.append(df)
    df = pandas.concat(dfs)
    df.sort_values('t', inplace=True)
+    df.reset_index(inplace=True)
    df['t'] -= min(header['t_start'] for header in headers)
    df.headers = headers # HACK to preserve backwards compatibility
-    return df
+    return df
+
+def test_monitor():
+    env = gym.make("CartPole-v1")
+    env.seed(0)
+    mon_file = "/tmp/baselines-test-%s.monitor.csv" % uuid.uuid4()
+    menv = Monitor(env, mon_file)
+    menv.reset()
+    for _ in range(1000):
+        _, _, done, _ = menv.step(0)
+        if done:
+            menv.reset()
+
+    f = open(mon_file, 'rt')
+
+    firstline = f.readline()
+    assert firstline.startswith('#')
+    metadata = json.loads(firstline[1:])
+    assert metadata['env_id'] == "CartPole-v1"
+    assert set(metadata.keys()) == {'env_id', 'gym_version', 't_start'},  "Incorrect keys in monitor metadata"
+
+    last_logline = pandas.read_csv(f, index_col=None)
+    assert set(last_logline.keys()) == {'l', 't', 'r'}, "Incorrect keys in monitor logline"
+    f.close()
+    os.remove(mon_file)
--- a/baselines/common/init.py
+++ b/baselines/common/init.py
@@ -1,3 +1,4 @@
+# flake8: noqa F403
 from baselines.common.console_util import *
 from baselines.common.dataset import Dataset
 from baselines.common.math_util import *
--- a/baselines/common/atari_wrappers.py
+++ b/baselines/common/atari_wrappers.py
@@ -3,6 +3,7 @@ from collections import deque
 import gym
 from gym import spaces
 import cv2
+cv2.ocl.setUseOpenCL(False)

 class NoopResetEnv(gym.Wrapper):
    def __init__(self, env, noop_max=30):
@@ -12,14 +13,10 @@ class NoopResetEnv(gym.Wrapper):
        gym.Wrapper.__init__(self, env)
        self.noop_max = noop_max
        self.override_num_noops = None
-        if isinstance(env.action_space, gym.spaces.MultiBinary):
-            self.noop_action = np.zeros(self.env.action_space.n, dtype=np.int64)
-        else:
-            # used for atari environments
-            self.noop_action = 0
-            assert env.unwrapped.get_action_meanings()[0] == 'NOOP'
+        self.noop_action = 0
+        assert env.unwrapped.get_action_meanings()[0] == 'NOOP'

-    def _reset(self, **kwargs):
+    def reset(self, **kwargs):
        """ Do no-op action for a number of steps in [1, noop_max]."""
        self.env.reset(**kwargs)
        if self.override_num_noops is not None:
@@ -34,6 +31,9 @@ class NoopResetEnv(gym.Wrapper):
                obs = self.env.reset(**kwargs)
        return obs

+    def step(self, ac):
+        return self.env.step(ac)
+
 class FireResetEnv(gym.Wrapper):
    def __init__(self, env):
        """Take action on reset for environments that are fixed until firing."""
@@ -41,7 +41,7 @@ class FireResetEnv(gym.Wrapper):
        assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
        assert len(env.unwrapped.get_action_meanings()) >= 3

-    def _reset(self, **kwargs):
+    def reset(self, **kwargs):
        self.env.reset(**kwargs)
        obs, _, done, _ = self.env.step(1)
        if done:
@@ -51,6 +51,9 @@ class FireResetEnv(gym.Wrapper):
            self.env.reset(**kwargs)
        return obs

+    def step(self, ac):
+        return self.env.step(ac)
+
 class EpisodicLifeEnv(gym.Wrapper):
    def __init__(self, env):
        """Make end-of-life == end-of-episode, but only reset on true game over.
@@ -60,21 +63,21 @@ class EpisodicLifeEnv(gym.Wrapper):
        self.lives = 0
        self.was_real_done  = True

-    def _step(self, action):
+    def step(self, action):
        obs, reward, done, info = self.env.step(action)
        self.was_real_done = done
        # check current lives, make loss of life terminal,
        # then update lives to handle bonus lives
        lives = self.env.unwrapped.ale.lives()
        if lives < self.lives and lives > 0:
-            # for Qbert somtimes we stay in lives == 0 condtion for a few frames
+            # for Qbert sometimes we stay in lives == 0 condtion for a few frames
            # so its important to keep lives > 0, so that we only reset once
            # the environment advertises done.
            done = True
        self.lives = lives
        return obs, reward, done, info

-    def _reset(self, **kwargs):
+    def reset(self, **kwargs):
        """Reset only when lives are exhausted.
        This way all states are still reachable even though lives are episodic,
        and the learner need not know about any of this behind-the-scenes.
@@ -92,10 +95,10 @@ class MaxAndSkipEnv(gym.Wrapper):
        """Return only every `skip`-th frame"""
        gym.Wrapper.__init__(self, env)
        # most recent raw observations (for max pooling across time steps)
-        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype='uint8')
+        self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
        self._skip       = skip

-    def _step(self, action):
+    def step(self, action):
        """Repeat action, sum reward, and max over last observations."""
        total_reward = 0.0
        done = None
@@ -112,8 +115,14 @@ class MaxAndSkipEnv(gym.Wrapper):

        return max_frame, total_reward, done, info

+    def reset(self, **kwargs):
+        return self.env.reset(**kwargs)
+
 class ClipRewardEnv(gym.RewardWrapper):
-    def _reward(self, reward):
+    def __init__(self, env):
+        gym.RewardWrapper.__init__(self, env)
+
+    def reward(self, reward):
        """Bin reward to {+1, 0, -1} by its sign."""
        return np.sign(reward)

@@ -123,9 +132,10 @@ class WarpFrame(gym.ObservationWrapper):
        gym.ObservationWrapper.__init__(self, env)
        self.width = 84
        self.height = 84
-        self.observation_space = spaces.Box(low=0, high=255, shape=(self.height, self.width, 1))
+        self.observation_space = spaces.Box(low=0, high=255,
+            shape=(self.height, self.width, 1), dtype=np.uint8)

-    def _observation(self, frame):
+    def observation(self, frame):
        frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
        return frame[:, :, None]
@@ -144,15 +154,15 @@ class FrameStack(gym.Wrapper):
        self.k = k
        self.frames = deque([], maxlen=k)
        shp = env.observation_space.shape
-        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k))
+        self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=np.uint8)

-    def _reset(self):
+    def reset(self):
        ob = self.env.reset()
        for _ in range(self.k):
            self.frames.append(ob)
        return self._get_ob()

-    def _step(self, action):
+    def step(self, action):
        ob, reward, done, info = self.env.step(action)
        self.frames.append(ob)
        return self._get_ob(), reward, done, info
@@ -162,7 +172,10 @@ class FrameStack(gym.Wrapper):
        return LazyFrames(list(self.frames))

 class ScaledFloatFrame(gym.ObservationWrapper):
-    def _observation(self, observation):
+    def __init__(self, env):
+        gym.ObservationWrapper.__init__(self, env)
+
+    def observation(self, observation):
        # careful! This undoes the memory optimization, use
        # with smaller replay buffers only.
        return np.array(observation).astype(np.float32) / 255.0
@@ -175,15 +188,28 @@ class LazyFrames(object):

        This object should only be converted to numpy array before being passed to the model.

-        You'd not belive how complex the previous solution was."""
+        You'd not believe how complex the previous solution was."""
        self._frames = frames
+        self._out = None
+
+    def _force(self):
+        if self._out is None:
+            self._out = np.concatenate(self._frames, axis=2)
+            self._frames = None
+        return self._out

    def __array__(self, dtype=None):
-        out = np.concatenate(self._frames, axis=2)
+        out = self._force()
        if dtype is not None:
            out = out.astype(dtype)
        return out

+    def __len__(self):
+        return len(self._force())
+
+    def __getitem__(self, i):
+        return self._force()[i]
+
 def make_atari(env_id):
    env = gym.make(env_id)
    assert 'NoFrameskip' in env.spec.id
--- a/baselines/common/azure_utils.py
+++ b/baselines/common/azure_utils.py
@@ -1,154 +0,0 @@
-import os
-import tempfile
-import zipfile
-
-from azure.common import AzureMissingResourceHttpError
-try:
-    from azure.storage.blob import BlobService
-except ImportError:
-    from azure.storage.blob import BlockBlobService as BlobService
-from shutil import unpack_archive
-from threading import Event
-
-# TODOS: use Azure snapshots instead of hacky backups
-
-def fixed_list_blobs(service, *args, **kwargs):
-    """By defualt list_containers only returns a subset of results.
-
-    This function attempts to fix this.
-    """
-    res = []
-    next_marker = None
-    while next_marker is None or len(next_marker) > 0:
-        kwargs['marker'] = next_marker
-        gen = service.list_blobs(*args, **kwargs)
-        for b in gen:
-            res.append(b.name)
-        next_marker = gen.next_marker
-    return res
-
-
-def make_archive(source_path, dest_path):
-    if source_path.endswith(os.path.sep):
-        source_path = source_path.rstrip(os.path.sep)
-    prefix_path = os.path.dirname(source_path)
-    with zipfile.ZipFile(dest_path, "w", compression=zipfile.ZIP_STORED) as zf:
-        if os.path.isdir(source_path):
-            for dirname, _subdirs, files in os.walk(source_path):
-                zf.write(dirname, os.path.relpath(dirname, prefix_path))
-                for filename in files:
-                    filepath = os.path.join(dirname, filename)
-                    zf.write(filepath, os.path.relpath(filepath, prefix_path))
-        else:
-            zf.write(source_path, os.path.relpath(source_path, prefix_path))
-
-
-class Container(object):
-    services = {}
-
-    def __init__(self, account_name, account_key, container_name, maybe_create=False):
-        self._account_name = account_name
-        self._container_name = container_name
-        if account_name not in Container.services:
-            Container.services[account_name] = BlobService(account_name, account_key)
-        self._service = Container.services[account_name]
-        if maybe_create:
-            self._service.create_container(self._container_name, fail_on_exist=False)
-
-    def put(self, source_path, blob_name, callback=None):
-        """Upload a file or directory from `source_path` to azure blob `blob_name`.
-
-        Upload progress can be traced by an optional callback.
-        """
-        upload_done = Event()
-
-        def progress_callback(current, total):
-            if callback:
-                callback(current, total)
-            if current >= total:
-                upload_done.set()
-
-        # Attempt to make backup if an existing version is already available
-        try:
-            x_ms_copy_source = "https://{}.blob.core.windows.net/{}/{}".format(
-                self._account_name,
-                self._container_name,
-                blob_name
-            )
-            self._service.copy_blob(
-                container_name=self._container_name,
-                blob_name=blob_name + ".backup",
-                x_ms_copy_source=x_ms_copy_source
-            )
-        except AzureMissingResourceHttpError:
-            pass
-
-        with tempfile.TemporaryDirectory() as td:
-            arcpath = os.path.join(td, "archive.zip")
-            make_archive(source_path, arcpath)
-            self._service.put_block_blob_from_path(
-                container_name=self._container_name,
-                blob_name=blob_name,
-                file_path=arcpath,
-                max_connections=4,
-                progress_callback=progress_callback,
-                max_retries=10)
-            upload_done.wait()
-
-    def get(self, dest_path, blob_name, callback=None):
-        """Download a file or directory to `dest_path` to azure blob `blob_name`.
-
-        Warning! If directory is downloaded the `dest_path` is the parent directory.
-
-        Upload progress can be traced by an optional callback.
-        """
-        download_done = Event()
-
-        def progress_callback(current, total):
-            if callback:
-                callback(current, total)
-            if current >= total:
-                download_done.set()
-
-        with tempfile.TemporaryDirectory() as td:
-            arcpath = os.path.join(td, "archive.zip")
-            for backup_blob_name in [blob_name, blob_name + '.backup']:
-                try:
-                    properties = self._service.get_blob_properties(
-                        blob_name=backup_blob_name,
-                        container_name=self._container_name
-                    )
-                    if hasattr(properties, 'properties'):
-                        # Annoyingly, Azure has changed the API and this now returns a blob
-                        # instead of it's properties with up-to-date azure package.
-                        blob_size = properties.properties.content_length
-                    else:
-                        blob_size = properties['content-length']
-                    if int(blob_size) > 0:
-                        self._service.get_blob_to_path(
-                            container_name=self._container_name,
-                            blob_name=backup_blob_name,
-                            file_path=arcpath,
-                            max_connections=4,
-                            progress_callback=progress_callback)
-                        unpack_archive(arcpath, dest_path)
-                        download_done.wait()
-                        return True
-                except AzureMissingResourceHttpError:
-                    pass
-        return False
-
-    def list(self, prefix=None):
-        """List all blobs in the container."""
-        return fixed_list_blobs(self._service, self._container_name, prefix=prefix)
-
-    def exists(self, blob_name):
-        """Returns true if `blob_name` exists in container."""
-        try:
-            self._service.get_blob_properties(
-                blob_name=blob_name,
-                container_name=self._container_name
-            )
-            return True
-        except AzureMissingResourceHttpError:
-            return False
--- a/baselines/common/cmd_util.py
+++ b/baselines/common/cmd_util.py
@@ -0,0 +1,90 @@
+"""
+Helpers for scripts like run_atari.py.
+"""
+
+import os
+from mpi4py import MPI
+import gym
+from gym.wrappers import FlattenDictWrapper
+from baselines import logger
+from baselines.bench import Monitor
+from baselines.common import set_global_seeds
+from baselines.common.atari_wrappers import make_atari, wrap_deepmind
+from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
+
+def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
+    """
+    Create a wrapped, monitored SubprocVecEnv for Atari.
+    """
+    if wrapper_kwargs is None: wrapper_kwargs = {}
+    def make_env(rank): # pylint: disable=C0111
+        def _thunk():
+            env = make_atari(env_id)
+            env.seed(seed + rank)
+            env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
+            return wrap_deepmind(env, **wrapper_kwargs)
+        return _thunk
+    set_global_seeds(seed)
+    return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
+
+def make_mujoco_env(env_id, seed):
+    """
+    Create a wrapped, monitored gym.Env for MuJoCo.
+    """
+    rank = MPI.COMM_WORLD.Get_rank()
+    set_global_seeds(seed + 10000 * rank)
+    env = gym.make(env_id)
+    env = Monitor(env, os.path.join(logger.get_dir(), str(rank)))
+    env.seed(seed)
+    return env
+
+def make_robotics_env(env_id, seed, rank=0):
+    """
+    Create a wrapped, monitored gym.Env for MuJoCo.
+    """
+    set_global_seeds(seed)
+    env = gym.make(env_id)
+    env = FlattenDictWrapper(env, ['observation', 'desired_goal'])
+    env = Monitor(
+        env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)),
+        info_keywords=('is_success',))
+    env.seed(seed)
+    return env
+
+def arg_parser():
+    """
+    Create an empty argparse.ArgumentParser.
+    """
+    import argparse
+    return argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+def atari_arg_parser():
+    """
+    Create an argparse.ArgumentParser for run_atari.py.
+    """
+    parser = arg_parser()
+    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
+    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
+    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
+    return parser
+
+def mujoco_arg_parser():
+    """
+    Create an argparse.ArgumentParser for run_mujoco.py.
+    """
+    parser = arg_parser()
+    parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
+    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
+    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
+    parser.add_argument('--play', default=False, action='store_true')
+    return parser
+
+def robotics_arg_parser():
+    """
+    Create an argparse.ArgumentParser for run_mujoco.py.
+    """
+    parser = arg_parser()
+    parser.add_argument('--env', help='environment ID', type=str, default='FetchReach-v0')
+    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
+    parser.add_argument('--num-timesteps', type=int, default=int(1e6))
+    return parser
--- a/baselines/common/console_util.py
+++ b/baselines/common/console_util.py
@@ -16,7 +16,12 @@ def fmt_item(x, l):
    if isinstance(x, np.ndarray):
        assert x.ndim==0
        x = x.item()
-    if isinstance(x, float): rep = "%g"%x
+    if isinstance(x, (float, np.float32, np.float64)):
+        v = abs(x)
+        if (v < 1e-4 or v > 1e+4) and v > 0:
+            rep = "%7.2e" % x
+        else:
+            rep = "%7.5f" % x
    else: rep = str(x)
    return " "*(l - len(rep)) + rep

--- a/baselines/common/distributions.py
+++ b/baselines/common/distributions.py
@@ -1,6 +1,7 @@
 import tensorflow as tf
 import numpy as np
 import baselines.common.tf_util as U
+from baselines.a2c.utils import fc
 from tensorflow.python.ops import math_ops

 class Pd(object):
@@ -31,6 +32,8 @@ class PdType(object):
        raise NotImplementedError
    def pdfromflat(self, flat):
        return self.pdclass()(flat)
+    def pdfromlatent(self, latent_vector):
+        raise NotImplementedError
    def param_shape(self):
        raise NotImplementedError
    def sample_shape(self):
@@ -48,6 +51,10 @@ class CategoricalPdType(PdType):
        self.ncat = ncat
    def pdclass(self):
        return CategoricalPd
+    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
+        pdparam = fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
+        return self.pdfromflat(pdparam), pdparam
+
    def param_shape(self):
        return [self.ncat]
    def sample_shape(self):
@@ -57,14 +64,12 @@ class CategoricalPdType(PdType):


 class MultiCategoricalPdType(PdType):
-    def __init__(self, low, high):
-        self.low = low
-        self.high = high
-        self.ncats = high - low + 1
+    def __init__(self, nvec):
+        self.ncats = nvec
    def pdclass(self):
        return MultiCategoricalPd
    def pdfromflat(self, flat):
-        return MultiCategoricalPd(self.low, self.high, flat)
+        return MultiCategoricalPd(self.ncats, flat)
    def param_shape(self):
        return [sum(self.ncats)]
    def sample_shape(self):
@@ -77,6 +82,13 @@ class DiagGaussianPdType(PdType):
        self.size = size
    def pdclass(self):
        return DiagGaussianPd
+
+    def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
+        mean = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
+        logstd = tf.get_variable(name='logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
+        pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
+        return self.pdfromflat(pdparam), mean
+
    def param_shape(self):
        return [2*self.size]
    def sample_shape(self):
@@ -125,7 +137,7 @@ class CategoricalPd(Pd):
    def flatparam(self):
        return self.logits
    def mode(self):
-        return U.argmax(self.logits, axis=-1)
+        return tf.argmax(self.logits, axis=-1)
    def neglogp(self, x):
        # return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
        # Note: we can't use sparse_softmax_cross_entropy_with_logits because
@@ -135,20 +147,20 @@ class CategoricalPd(Pd):
            logits=self.logits,
            labels=one_hot_actions)
    def kl(self, other):
-        a0 = self.logits - U.max(self.logits, axis=-1, keepdims=True)
-        a1 = other.logits - U.max(other.logits, axis=-1, keepdims=True)
+        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
+        a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keep_dims=True)
        ea0 = tf.exp(a0)
        ea1 = tf.exp(a1)
-        z0 = U.sum(ea0, axis=-1, keepdims=True)
-        z1 = U.sum(ea1, axis=-1, keepdims=True)
+        z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
+        z1 = tf.reduce_sum(ea1, axis=-1, keep_dims=True)
        p0 = ea0 / z0
-        return U.sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
+        return tf.reduce_sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
    def entropy(self):
-        a0 = self.logits - U.max(self.logits, axis=-1, keepdims=True)
+        a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
        ea0 = tf.exp(a0)
-        z0 = U.sum(ea0, axis=-1, keepdims=True)
+        z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
        p0 = ea0 / z0
-        return U.sum(p0 * (tf.log(z0) - a0), axis=-1)
+        return tf.reduce_sum(p0 * (tf.log(z0) - a0), axis=-1)
    def sample(self):
        u = tf.random_uniform(tf.shape(self.logits))
        return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
@@ -157,24 +169,21 @@ class CategoricalPd(Pd):
        return cls(flat)

 class MultiCategoricalPd(Pd):
-    def __init__(self, low, high, flat):
+    def __init__(self, nvec, flat):
        self.flat = flat
-        self.low = tf.constant(low, dtype=tf.int32)
-        self.categoricals = list(map(CategoricalPd, tf.split(flat, high - low + 1, axis=len(flat.get_shape()) - 1)))
+        self.categoricals = list(map(CategoricalPd, tf.split(flat, nvec, axis=-1)))
    def flatparam(self):
        return self.flat
    def mode(self):
-        return self.low + tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
+        return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
    def neglogp(self, x):
-        return tf.add_n([p.neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x - self.low, axis=len(x.get_shape()) - 1))])
+        return tf.add_n([p.neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x, axis=-1))])
    def kl(self, other):
-        return tf.add_n([
-                p.kl(q) for p, q in zip(self.categoricals, other.categoricals)
-            ])
+        return tf.add_n([p.kl(q) for p, q in zip(self.categoricals, other.categoricals)])
    def entropy(self):
        return tf.add_n([p.entropy() for p in self.categoricals])
    def sample(self):
-        return self.low + tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
+        return tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
    @classmethod
    def fromflat(cls, flat):
        raise NotImplementedError
@@ -191,14 +200,14 @@ class DiagGaussianPd(Pd):
    def mode(self):
        return self.mean
    def neglogp(self, x):
-        return 0.5 * U.sum(tf.square((x - self.mean) / self.std), axis=-1) \
+        return 0.5 * tf.reduce_sum(tf.square((x - self.mean) / self.std), axis=-1) \
               + 0.5 * np.log(2.0 * np.pi) * tf.to_float(tf.shape(x)[-1]) \
-               + U.sum(self.logstd, axis=-1)
+               + tf.reduce_sum(self.logstd, axis=-1)
    def kl(self, other):
        assert isinstance(other, DiagGaussianPd)
-        return U.sum(other.logstd - self.logstd + (tf.square(self.std) + tf.square(self.mean - other.mean)) / (2.0 * tf.square(other.std)) - 0.5, axis=-1)
+        return tf.reduce_sum(other.logstd - self.logstd + (tf.square(self.std) + tf.square(self.mean - other.mean)) / (2.0 * tf.square(other.std)) - 0.5, axis=-1)
    def entropy(self):
-        return U.sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
+        return tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
    def sample(self):
        return self.mean + self.std * tf.random_normal(tf.shape(self.mean))
    @classmethod
@@ -214,11 +223,11 @@ class BernoulliPd(Pd):
    def mode(self):
        return tf.round(self.ps)
    def neglogp(self, x):
-        return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1)
+        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1)
    def kl(self, other):
-        return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
+        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
    def entropy(self):
-        return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
+        return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
    def sample(self):
        u = tf.random_uniform(tf.shape(self.ps))
        return tf.to_float(math_ops.less(u, self.ps))
@@ -234,7 +243,7 @@ def make_pdtype(ac_space):
    elif isinstance(ac_space, spaces.Discrete):
        return CategoricalPdType(ac_space.n)
    elif isinstance(ac_space, spaces.MultiDiscrete):
-        return MultiCategoricalPdType(ac_space.low, ac_space.high)
+        return MultiCategoricalPdType(ac_space.nvec)
    elif isinstance(ac_space, spaces.MultiBinary):
        return BernoulliPdType(ac_space.n)
    else:
@@ -259,6 +268,11 @@ def test_probtypes():
    categorical = CategoricalPdType(pdparam_categorical.size) #pylint: disable=E1101
    validate_probtype(categorical, pdparam_categorical)

+    nvec = [1,2,3]
+    pdparam_multicategorical = np.array([-.2, .3, .5, .1, 1, -.1])
+    multicategorical = MultiCategoricalPdType(nvec) #pylint: disable=E1101
+    validate_probtype(multicategorical, pdparam_multicategorical)
+
    pdparam_bernoulli = np.array([-.2, .3, .5])
    bernoulli = BernoulliPdType(pdparam_bernoulli.size) #pylint: disable=E1101
    validate_probtype(bernoulli, pdparam_bernoulli)
@@ -270,10 +284,10 @@ def validate_probtype(probtype, pdparam):
    Mval = np.repeat(pdparam[None, :], N, axis=0)
    M = probtype.param_placeholder([N])
    X = probtype.sample_placeholder([N])
-    pd = probtype.pdclass()(M)
+    pd = probtype.pdfromflat(M)
    calcloglik = U.function([X, M], pd.logp(X))
    calcent = U.function([M], pd.entropy())
-    Xval = U.eval(pd.sample(), feed_dict={M:Mval})
+    Xval = tf.get_default_session().run(pd.sample(), feed_dict={M:Mval})
    logliks = calcloglik(Xval, Mval)
    entval_ll = - logliks.mean() #pylint: disable=E1101
    entval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
@@ -282,7 +296,7 @@ def validate_probtype(probtype, pdparam):

    # Check to see if kldiv[p,q] = - ent[p] - E_p[log q]
    M2 = probtype.param_placeholder([N])
-    pd2 = probtype.pdclass()(M2)
+    pd2 = probtype.pdfromflat(M2)
    q = pdparam + np.random.randn(pdparam.size) * 0.1
    Mval2 = np.repeat(q[None, :], N, axis=0)
    calckl = U.function([M, M2], pd.kl(pd2))
@@ -291,3 +305,5 @@ def validate_probtype(probtype, pdparam):
    klval_ll = - entval - logliks.mean() #pylint: disable=E1101
    klval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
    assert np.abs(klval - klval_ll) < 3 * klval_ll_stderr # within 3 sigmas
+    print('ok on', probtype, pdparam)
+
--- a/baselines/common/filters.py
+++ b/baselines/common/filters.py
@@ -1,4 +1,4 @@
-from baselines.acktr.running_stat import RunningStat
+from .running_stat import RunningStat
 from collections import deque
 import numpy as np

--- a/baselines/common/identity_env.py
+++ b/baselines/common/identity_env.py
@@ -0,0 +1,30 @@
+from gym import Env
+from gym.spaces import Discrete
+
+
+class IdentityEnv(Env):
+    def __init__(
+            self,
+            dim,
+            ep_length=100,
+    ):
+
+        self.action_space = Discrete(dim)
+        self.reset()
+
+    def reset(self):
+        self._choose_next_state()
+        self.observation_space = self.action_space
+
+        return self.state
+
+    def step(self, actions):
+        rew = self._get_reward(actions)
+        self._choose_next_state()
+        return self.state, rew, False, {}
+
+    def _choose_next_state(self):
+        self.state = self.action_space.sample()
+
+    def _get_reward(self, actions):
+        return 1 if self.state == actions else 0
--- a/baselines/common/input.py
+++ b/baselines/common/input.py
@@ -0,0 +1,30 @@
+import tensorflow as tf
+from gym.spaces import Discrete, Box
+
+def observation_input(ob_space, batch_size=None, name='Ob'):
+    '''
+    Build observation input with encoding depending on the 
+    observation space type
+    Params:
+    
+    ob_space: observation space (should be one of gym.spaces)
+    batch_size: batch size for input (default is None, so that resulting input placeholder can take tensors with any batch size)
+    name: tensorflow variable name for input placeholder
+
+    returns: tuple (input_placeholder, processed_input_tensor)
+    '''
+    if isinstance(ob_space, Discrete):
+        input_x  = tf.placeholder(shape=(batch_size,), dtype=tf.int32, name=name)
+        processed_x = tf.to_float(tf.one_hot(input_x, ob_space.n))
+        return input_x, processed_x
+
+    elif isinstance(ob_space, Box):
+        input_shape = (batch_size,) + ob_space.shape
+        input_x = tf.placeholder(shape=input_shape, dtype=ob_space.dtype, name=name)
+        processed_x = tf.to_float(input_x)
+        return input_x, processed_x
+
+    else:
+        raise NotImplementedError
+
+ 
--- a/baselines/common/misc_util.py
+++ b/baselines/common/misc_util.py
@@ -224,6 +224,7 @@ def relatively_safe_pickle_dump(obj, path, compression=False):
        # Using gzip here would be simpler, but the size is limited to 2GB
        with tempfile.NamedTemporaryFile() as uncompressed_file:
            pickle.dump(obj, uncompressed_file)
+            uncompressed_file.file.flush()
            with zipfile.ZipFile(temp_storage, "w", compression=zipfile.ZIP_DEFLATED) as myzip:
                myzip.write(uncompressed_file.name, "data")
    else:
--- a/baselines/common/mpi_adam.py
+++ b/baselines/common/mpi_adam.py
@@ -53,7 +53,7 @@ class MpiAdam(object):
 def test_MpiAdam():
    np.random.seed(0)
    tf.set_random_seed(0)
-    
+
    a = tf.Variable(np.random.randn(3).astype('float32'))
    b = tf.Variable(np.random.randn(2,5).astype('float32'))
    loss = tf.reduce_sum(tf.square(a)) + tf.reduce_sum(tf.sin(b))
--- a/baselines/common/mpi_moments.py
+++ b/baselines/common/mpi_moments.py
@@ -2,29 +2,42 @@ from mpi4py import MPI
 import numpy as np
 from baselines.common import zipsame

-def mpi_moments(x, axis=0):
-    x = np.asarray(x, dtype='float64')
-    newshape = list(x.shape)
-    newshape.pop(axis)
-    n = np.prod(newshape,dtype=int)
-    totalvec = np.zeros(n*2+1, 'float64')
-    addvec = np.concatenate([x.sum(axis=axis).ravel(), 
-        np.square(x).sum(axis=axis).ravel(), 
-        np.array([x.shape[axis]],dtype='float64')])
-    MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
-    sum = totalvec[:n]
-    sumsq = totalvec[n:2*n]
-    count = totalvec[2*n]
-    if count == 0:
-        mean = np.empty(newshape); mean[:] = np.nan
-        std = np.empty(newshape); std[:] = np.nan
-    else:
-        mean = sum/count
-        std = np.sqrt(np.maximum(sumsq/count - np.square(mean),0))
+
+def mpi_mean(x, axis=0, comm=None, keepdims=False):
+    x = np.asarray(x)
+    assert x.ndim > 0
+    if comm is None: comm = MPI.COMM_WORLD
+    xsum = x.sum(axis=axis, keepdims=keepdims)
+    n = xsum.size
+    localsum = np.zeros(n+1, x.dtype)
+    localsum[:n] = xsum.ravel()
+    localsum[n] = x.shape[axis]
+    globalsum = np.zeros_like(localsum)
+    comm.Allreduce(localsum, globalsum, op=MPI.SUM)
+    return globalsum[:n].reshape(xsum.shape) / globalsum[n], globalsum[n]
+
+def mpi_moments(x, axis=0, comm=None, keepdims=False):
+    x = np.asarray(x)
+    assert x.ndim > 0
+    mean, count = mpi_mean(x, axis=axis, comm=comm, keepdims=True)
+    sqdiffs = np.square(x - mean)
+    meansqdiff, count1 = mpi_mean(sqdiffs, axis=axis, comm=comm, keepdims=True)
+    assert count1 == count
+    std = np.sqrt(meansqdiff)
+    if not keepdims:
+        newshape = mean.shape[:axis] + mean.shape[axis+1:]
+        mean = mean.reshape(newshape)
+        std = std.reshape(newshape)
    return mean, std, count


 def test_runningmeanstd():
+    import subprocess
+    subprocess.check_call(['mpirun', '-np', '3', 
+        'python','-c', 
+        'from baselines.common.mpi_moments import _helper_runningmeanstd; _helper_runningmeanstd()'])
+
+def _helper_runningmeanstd():
    comm = MPI.COMM_WORLD
    np.random.seed(0)
    for (triple,axis) in [
@@ -45,6 +58,3 @@ def test_runningmeanstd():
            assert np.allclose(a1, a2)
            print("ok!")

-if __name__ == "__main__":
-    #mpirun -np 3 python <script>
-    test_runningmeanstd()
--- a/baselines/common/mpi_running_mean_std.py
+++ b/baselines/common/mpi_running_mean_std.py
@@ -57,7 +57,7 @@ def test_runningmeanstd():
        rms.update(x1)
        rms.update(x2)
        rms.update(x3)
-        ms2 = U.eval([rms.mean, rms.std])
+        ms2 = [rms.mean.eval(), rms.std.eval()]

        assert np.allclose(ms1, ms2)

@@ -94,11 +94,11 @@ def test_dist():

    assert checkallclose(
        bigvec.mean(axis=0),
-        U.eval(rms.mean)
+        rms.mean.eval(),
    )
    assert checkallclose(
        bigvec.std(axis=0),
-        U.eval(rms.std)
+        rms.std.eval(),
    )


--- a/baselines/common/runners.py
+++ b/baselines/common/runners.py
@@ -0,0 +1,18 @@
+import numpy as np
+from abc import ABC, abstractmethod
+
+class AbstractEnvRunner(ABC):
+    def __init__(self, *, env, model, nsteps):
+        self.env = env
+        self.model = model
+        nenv = env.num_envs
+        self.batch_ob_shape = (nenv*nsteps,) + env.observation_space.shape
+        self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
+        self.obs[:] = env.reset()
+        self.nsteps = nsteps
+        self.states = model.initial_state
+        self.dones = [False for _ in range(nenv)]
+
+    @abstractmethod
+    def run(self):
+        raise NotImplementedError
--- a/baselines/common/running_mean_std.py
+++ b/baselines/common/running_mean_std.py
@@ -0,0 +1,46 @@
+import numpy as np
+class RunningMeanStd(object):
+    # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
+    def __init__(self, epsilon=1e-4, shape=()):
+        self.mean = np.zeros(shape, 'float64')
+        self.var = np.ones(shape, 'float64')
+        self.count = epsilon
+
+    def update(self, x):
+        batch_mean = np.mean(x, axis=0)
+        batch_var = np.var(x, axis=0)
+        batch_count = x.shape[0]
+        self.update_from_moments(batch_mean, batch_var, batch_count)
+
+    def update_from_moments(self, batch_mean, batch_var, batch_count):
+        delta = batch_mean - self.mean
+        tot_count = self.count + batch_count
+
+        new_mean = self.mean + delta * batch_count / tot_count        
+        m_a = self.var * (self.count)
+        m_b = batch_var * (batch_count)
+        M2 = m_a + m_b + np.square(delta) * self.count * batch_count / (self.count + batch_count)
+        new_var = M2 / (self.count + batch_count)
+
+        new_count = batch_count + self.count
+
+        self.mean = new_mean
+        self.var = new_var
+        self.count = new_count    
+
+def test_runningmeanstd():
+    for (x1, x2, x3) in [
+        (np.random.randn(3), np.random.randn(4), np.random.randn(5)),
+        (np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
+        ]:
+
+        rms = RunningMeanStd(epsilon=0.0, shape=x1.shape[1:])
+
+        x = np.concatenate([x1, x2, x3], axis=0)
+        ms1 = [x.mean(axis=0), x.var(axis=0)]
+        rms.update(x1)
+        rms.update(x2)
+        rms.update(x3)
+        ms2 = [rms.mean, rms.var]
+
+        assert np.allclose(ms1, ms2)
--- a/baselines/common/running_stat.py
+++ b/baselines/common/running_stat.py
--- a/baselines/common/segment_tree.py
+++ b/baselines/common/segment_tree.py
@@ -12,10 +12,9 @@ class SegmentTree(object):

            a) setting item's value is slightly slower.
               It is O(lg capacity) instead of O(1).
-            b) user has access to an efficient `reduce`
-               operation which reduces `operation` over
-               a contiguous subsequence of items in the
-               array.
+            b) user has access to an efficient ( O(log segment size) )
+               `reduce` operation which reduces `operation` over
+               a contiguous subsequence of items in the array.

        Paramters
        ---------
@@ -23,8 +22,8 @@ class SegmentTree(object):
            Total size of the array - must be a power of two.
        operation: lambda obj, obj -> obj
            and operation for combining elements (eg. sum, max)
-            must for a mathematical group together with the set of
-            possible values for array elements.
+            must form a mathematical group together with the set of
+            possible values for array elements (i.e. be associative)
        neutral_element: obj
            neutral element for the operation above. eg. float('-inf')
            for max and 0 for sum.
--- a/baselines/common/test_identity.py
+++ b/baselines/common/test_identity.py
@@ -0,0 +1,44 @@
+import pytest
+import tensorflow as tf
+import random
+import numpy as np
+from gym.spaces import np_random
+
+from baselines.a2c import a2c
+from baselines.ppo2 import ppo2
+from baselines.common.identity_env import IdentityEnv
+from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
+from baselines.ppo2.policies import MlpPolicy
+
+
+learn_func_list = [
+    lambda e: a2c.learn(policy=MlpPolicy, env=e, seed=0, total_timesteps=50000),
+    lambda e: ppo2.learn(policy=MlpPolicy, env=e, total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.01)
+]
+
+
+@pytest.mark.slow
+@pytest.mark.parametrize("learn_func", learn_func_list)
+def test_identity(learn_func):
+    '''
+    Test if the algorithm (with a given policy) 
+    can learn an identity transformation (i.e. return observation as an action)
+    '''
+    np.random.seed(0)
+    np_random.seed(0)
+    random.seed(0)
+
+    env = DummyVecEnv([lambda: IdentityEnv(10)])
+
+    with tf.Graph().as_default(), tf.Session().as_default():
+        tf.set_random_seed(0)
+        model = learn_func(env)
+
+        N_TRIALS = 1000
+        sum_rew = 0
+        obs = env.reset()
+        for i in range(N_TRIALS):
+            obs, rew, done, _ = env.step(model.step(obs)[0])
+            sum_rew += rew
+
+        assert sum_rew > 0.9 * N_TRIALS
--- a/baselines/common/tests/test_tf_util.py
+++ b/baselines/common/tests/test_tf_util.py
@@ -3,67 +3,38 @@ import tensorflow as tf
 from baselines.common.tf_util import (
    function,
    initialize,
-    set_value,
    single_threaded_session
 )


-def test_set_value():
-    a = tf.Variable(42.)
-    with single_threaded_session():
-        set_value(a, 5)
-        assert a.eval() == 5
-        g = tf.get_default_graph()
-        g.finalize()
-        set_value(a, 6)
-        assert a.eval() == 6
-
-        # test the test
-        try:
-            assert a.eval() == 7
-        except AssertionError:
-            pass
-        else:
-            assert False, "assertion should have failed"
-
-
 def test_function():
-    tf.reset_default_graph()
-    x = tf.placeholder(tf.int32, (), name="x")
-    y = tf.placeholder(tf.int32, (), name="y")
-    z = 3 * x + 2 * y
-    lin = function([x, y], z, givens={y: 0})
+    with tf.Graph().as_default():
+        x = tf.placeholder(tf.int32, (), name="x")
+        y = tf.placeholder(tf.int32, (), name="y")
+        z = 3 * x + 2 * y
+        lin = function([x, y], z, givens={y: 0})

-    with single_threaded_session():
-        initialize()
+        with single_threaded_session():
+            initialize()

-        assert lin(2) == 6
-        assert lin(x=3) == 9
-        assert lin(2, 2) == 10
-        assert lin(x=2, y=3) == 12
+            assert lin(2) == 6
+            assert lin(2, 2) == 10


 def test_multikwargs():
-    tf.reset_default_graph()
-    x = tf.placeholder(tf.int32, (), name="x")
-    with tf.variable_scope("other"):
-        x2 = tf.placeholder(tf.int32, (), name="x")
-    z = 3 * x + 2 * x2
+    with tf.Graph().as_default():
+        x = tf.placeholder(tf.int32, (), name="x")
+        with tf.variable_scope("other"):
+            x2 = tf.placeholder(tf.int32, (), name="x")
+        z = 3 * x + 2 * x2

-    lin = function([x, x2], z, givens={x2: 0})
-    with single_threaded_session():
-        initialize()
-        assert lin(2) == 6
-        assert lin(2, 2) == 10
-        expt_caught = False
-        try:
-            lin(x=2)
-        except AssertionError:
-            expt_caught = True
-        assert expt_caught
+        lin = function([x, x2], z, givens={x2: 0})
+        with single_threaded_session():
+            initialize()
+            assert lin(2) == 6
+            assert lin(2, 2) == 10


 if __name__ == '__main__':
-    test_set_value()
    test_function()
    test_multikwargs()
--- a/baselines/common/tf_util.py
+++ b/baselines/common/tf_util.py
@@ -1,45 +1,10 @@
 import numpy as np
 import tensorflow as tf  # pylint: ignore-module
-import builtins
-import functools
 import copy
 import os
+import functools
 import collections
-
-# ================================================================
-# Make consistent with numpy
-# ================================================================
-
-clip = tf.clip_by_value
-
-def sum(x, axis=None, keepdims=False):
-    axis = None if axis is None else [axis]
-    return tf.reduce_sum(x, axis=axis, keep_dims=keepdims)
-
-def mean(x, axis=None, keepdims=False):
-    axis = None if axis is None else [axis]
-    return tf.reduce_mean(x, axis=axis, keep_dims=keepdims)
-
-def var(x, axis=None, keepdims=False):
-    meanx = mean(x, axis=axis, keepdims=keepdims)
-    return mean(tf.square(x - meanx), axis=axis, keepdims=keepdims)
-
-def std(x, axis=None, keepdims=False):
-    return tf.sqrt(var(x, axis=axis, keepdims=keepdims))
-
-def max(x, axis=None, keepdims=False):
-    axis = None if axis is None else [axis]
-    return tf.reduce_max(x, axis=axis, keep_dims=keepdims)
-
-def min(x, axis=None, keepdims=False):
-    axis = None if axis is None else [axis]
-    return tf.reduce_min(x, axis=axis, keep_dims=keepdims)
-
-def concatenate(arrs, axis=0):
-    return tf.concat(axis=axis, values=arrs)
-
-def argmax(x, axis=None):
-    return tf.argmax(x, axis=axis)
+import multiprocessing

 def switch(condition, then_expression, else_expression):
    """Switches between two operations depending on a scalar value (int or bool).
@@ -62,105 +27,11 @@ def switch(condition, then_expression, else_expression):
 # Extras
 # ================================================================

-def l2loss(params):
-    if len(params) == 0:
-        return tf.constant(0.0)
-    else:
-        return tf.add_n([sum(tf.square(p)) for p in params])
-
 def lrelu(x, leak=0.2):
    f1 = 0.5 * (1 + leak)
    f2 = 0.5 * (1 - leak)
    return f1 * x + f2 * abs(x)

-def categorical_sample_logits(X):
-    # https://github.com/tensorflow/tensorflow/issues/456
-    U = tf.random_uniform(tf.shape(X))
-    return argmax(X - tf.log(-tf.log(U)), axis=1)
-
-# ================================================================
-# Inputs
-# ================================================================
-
-def is_placeholder(x):
-    return type(x) is tf.Tensor and len(x.op.inputs) == 0
-
-class TfInput(object):
-    def __init__(self, name="(unnamed)"):
-        """Generalized Tensorflow placeholder. The main differences are:
-            - possibly uses multiple placeholders internally and returns multiple values
-            - can apply light postprocessing to the value feed to placeholder.
-        """
-        self.name = name
-
-    def get(self):
-        """Return the tf variable(s) representing the possibly postprocessed value
-        of placeholder(s).
-        """
-        raise NotImplemented()
-
-    def make_feed_dict(data):
-        """Given data input it to the placeholder(s)."""
-        raise NotImplemented()
-
-class PlacholderTfInput(TfInput):
-    def __init__(self, placeholder):
-        """Wrapper for regular tensorflow placeholder."""
-        super().__init__(placeholder.name)
-        self._placeholder = placeholder
-
-    def get(self):
-        return self._placeholder
-
-    def make_feed_dict(self, data):
-        return {self._placeholder: data}
-
-class BatchInput(PlacholderTfInput):
-    def __init__(self, shape, dtype=tf.float32, name=None):
-        """Creates a placeholder for a batch of tensors of a given shape and dtype
-
-        Parameters
-        ----------
-        shape: [int]
-            shape of a single elemenet of the batch
-        dtype: tf.dtype
-            number representation used for tensor contents
-        name: str
-            name of the underlying placeholder
-        """
-        super().__init__(tf.placeholder(dtype, [None] + list(shape), name=name))
-
-class Uint8Input(PlacholderTfInput):
-    def __init__(self, shape, name=None):
-        """Takes input in uint8 format which is cast to float32 and divided by 255
-        before passing it to the model.
-
-        On GPU this ensures lower data transfer times.
-
-        Parameters
-        ----------
-        shape: [int]
-            shape of the tensor.
-        name: str
-            name of the underlying placeholder
-        """
-
-        super().__init__(tf.placeholder(tf.uint8, [None] + list(shape), name=name))
-        self._shape = shape
-        self._output = tf.cast(super().get(), tf.float32) / 255.0
-
-    def get(self):
-        return self._output
-
-def ensure_tf_input(thing):
-    """Takes either tf.placeholder of TfInput and outputs equivalent TfInput"""
-    if isinstance(thing, TfInput):
-        return thing
-    elif is_placeholder(thing):
-        return PlacholderTfInput(thing)
-    else:
-        raise ValueError("Must be a placeholder or TfInput")
-
 # ================================================================
 # Mathematical utils
 # ================================================================
@@ -173,86 +44,49 @@ def huber_loss(x, delta=1.0):
        delta * (tf.abs(x) - 0.5 * delta)
    )

-# ================================================================
-# Optimizer utils
-# ================================================================
-
-def minimize_and_clip(optimizer, objective, var_list, clip_val=10):
-    """Minimized `objective` using `optimizer` w.r.t. variables in
-    `var_list` while ensure the norm of the gradients for each
-    variable is clipped to `clip_val`
-    """
-    gradients = optimizer.compute_gradients(objective, var_list=var_list)
-    for i, (grad, var) in enumerate(gradients):
-        if grad is not None:
-            gradients[i] = (tf.clip_by_norm(grad, clip_val), var)
-    return optimizer.apply_gradients(gradients)
-
 # ================================================================
 # Global session
 # ================================================================

-def get_session():
-    """Returns recently made Tensorflow session"""
-    return tf.get_default_session()
-
-def make_session(num_cpu):
+def make_session(num_cpu=None, make_default=False, graph=None):
    """Returns a session that will use <num_cpu> CPU's only"""
+    if num_cpu is None:
+        num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
    tf_config = tf.ConfigProto(
        inter_op_parallelism_threads=num_cpu,
        intra_op_parallelism_threads=num_cpu)
-    return tf.Session(config=tf_config)
+    if make_default:
+        return tf.InteractiveSession(config=tf_config, graph=graph)
+    else:
+        return tf.Session(config=tf_config, graph=graph)

 def single_threaded_session():
    """Returns a session which will only use a single CPU"""
-    return make_session(1)
+    return make_session(num_cpu=1)
+
+def in_session(f):
+    @functools.wraps(f)
+    def newfunc(*args, **kwargs):
+        with tf.Session():
+            f(*args, **kwargs)
+    return newfunc

 ALREADY_INITIALIZED = set()

 def initialize():
    """Initialize all the uninitialized variables in the global scope."""
    new_variables = set(tf.global_variables()) - ALREADY_INITIALIZED
-    get_session().run(tf.variables_initializer(new_variables))
+    tf.get_default_session().run(tf.variables_initializer(new_variables))
    ALREADY_INITIALIZED.update(new_variables)

-def eval(expr, feed_dict=None):
-    if feed_dict is None:
-        feed_dict = {}
-    return get_session().run(expr, feed_dict=feed_dict)
-
-VALUE_SETTERS = collections.OrderedDict()
-
-def set_value(v, val):
-    global VALUE_SETTERS
-    if v in VALUE_SETTERS:
-        set_op, set_endpoint = VALUE_SETTERS[v]
-    else:
-        set_endpoint = tf.placeholder(v.dtype)
-        set_op = v.assign(set_endpoint)
-        VALUE_SETTERS[v] = (set_op, set_endpoint)
-    get_session().run(set_op, feed_dict={set_endpoint: val})
-
-# ================================================================
-# Saving variables
-# ================================================================
-
-def load_state(fname):
-    saver = tf.train.Saver()
-    saver.restore(get_session(), fname)
-
-def save_state(fname):
-    os.makedirs(os.path.dirname(fname), exist_ok=True)
-    saver = tf.train.Saver()
-    saver.save(get_session(), fname)
-
 # ================================================================
 # Model components
 # ================================================================

-def normc_initializer(std=1.0):
+def normc_initializer(std=1.0, axis=0):
    def _initializer(shape, dtype=None, partition_info=None):  # pylint: disable=W0613
        out = np.random.randn(*shape).astype(np.float32)
-        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
+        out *= std / np.sqrt(np.square(out).sum(axis=axis, keepdims=True))
        return tf.constant(out)
    return _initializer

@@ -285,36 +119,6 @@ def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad="SAME",

        return tf.nn.conv2d(x, w, stride_shape, pad) + b

-def dense(x, size, name, weight_init=None, bias=True):
-    w = tf.get_variable(name + "/w", [x.get_shape()[1], size], initializer=weight_init)
-    ret = tf.matmul(x, w)
-    if bias:
-        b = tf.get_variable(name + "/b", [size], initializer=tf.zeros_initializer())
-        return ret + b
-    else:
-        return ret
-
-def wndense(x, size, name, init_scale=1.0):
-    v = tf.get_variable(name + "/V", [int(x.get_shape()[1]), size],
-                        initializer=tf.random_normal_initializer(0, 0.05))
-    g = tf.get_variable(name + "/g", [size], initializer=tf.constant_initializer(init_scale))
-    b = tf.get_variable(name + "/b", [size], initializer=tf.constant_initializer(0.0))
-
-    # use weight normalization (Salimans & Kingma, 2016)
-    x = tf.matmul(x, v)
-    scaler = g / tf.sqrt(sum(tf.square(v), axis=0, keepdims=True))
-    return tf.reshape(scaler, [1, size]) * x + tf.reshape(b, [1, size])
-
-def densenobias(x, size, name, weight_init=None):
-    return dense(x, size, name, weight_init=weight_init, bias=False)
-
-def dropout(x, pkeep, phase=None, mask=None):
-    mask = tf.floor(pkeep + tf.random_uniform(tf.shape(x))) if mask is None else mask
-    if phase is None:
-        return mask * x
-    else:
-        return switch(phase, mask * x, pkeep * x)
-
 # ================================================================
 # Theano-like Function
 # ================================================================
@@ -344,7 +148,7 @@ def function(inputs, outputs, updates=None, givens=None):

    Parameters
    ----------
-    inputs: [tf.placeholder or TfInput]
+    inputs: [tf.placeholder, tf.constant, or object with make_feed_dict method]
        list of input arguments
    outputs: [tf.Variable] or tf.Variable
        list of outputs or a single output to be returned from function. Returned
@@ -359,183 +163,36 @@ def function(inputs, outputs, updates=None, givens=None):
        f = _Function(inputs, [outputs], updates, givens=givens)
        return lambda *args, **kwargs: f(*args, **kwargs)[0]

+
 class _Function(object):
-    def __init__(self, inputs, outputs, updates, givens, check_nan=False):
+    def __init__(self, inputs, outputs, updates, givens):
        for inpt in inputs:
-            if not issubclass(type(inpt), TfInput):
-                assert len(inpt.op.inputs) == 0, "inputs should all be placeholders of baselines.common.TfInput"
+            if not hasattr(inpt, 'make_feed_dict') and not (type(inpt) is tf.Tensor and len(inpt.op.inputs) == 0):
+                assert False, "inputs should all be placeholders, constants, or have a make_feed_dict method"
        self.inputs = inputs
        updates = updates or []
        self.update_group = tf.group(*updates)
        self.outputs_update = list(outputs) + [self.update_group]
        self.givens = {} if givens is None else givens
-        self.check_nan = check_nan

    def _feed_input(self, feed_dict, inpt, value):
-        if issubclass(type(inpt), TfInput):
+        if hasattr(inpt, 'make_feed_dict'):
            feed_dict.update(inpt.make_feed_dict(value))
-        elif is_placeholder(inpt):
+        else:
            feed_dict[inpt] = value

-    def __call__(self, *args, **kwargs):
+    def __call__(self, *args):
        assert len(args) <= len(self.inputs), "Too many arguments provided"
        feed_dict = {}
        # Update the args
        for inpt, value in zip(self.inputs, args):
            self._feed_input(feed_dict, inpt, value)
-        # Update the kwargs
-        kwargs_passed_inpt_names = set()
-        for inpt in self.inputs[len(args):]:
-            inpt_name = inpt.name.split(':')[0]
-            inpt_name = inpt_name.split('/')[-1]
-            assert inpt_name not in kwargs_passed_inpt_names, \
-                "this function has two arguments with the same name \"{}\", so kwargs cannot be used.".format(inpt_name)
-            if inpt_name in kwargs:
-                kwargs_passed_inpt_names.add(inpt_name)
-                self._feed_input(feed_dict, inpt, kwargs.pop(inpt_name))
-            else:
-                assert inpt in self.givens, "Missing argument " + inpt_name
-        assert len(kwargs) == 0, "Function got extra arguments " + str(list(kwargs.keys()))
        # Update feed dict with givens.
        for inpt in self.givens:
            feed_dict[inpt] = feed_dict.get(inpt, self.givens[inpt])
-        results = get_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
-        if self.check_nan:
-            if any(np.isnan(r).any() for r in results):
-                raise RuntimeError("Nan detected")
+        results = tf.get_default_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
        return results

-def mem_friendly_function(nondata_inputs, data_inputs, outputs, batch_size):
-    if isinstance(outputs, list):
-        return _MemFriendlyFunction(nondata_inputs, data_inputs, outputs, batch_size)
-    else:
-        f = _MemFriendlyFunction(nondata_inputs, data_inputs, [outputs], batch_size)
-        return lambda *inputs: f(*inputs)[0]
-
-class _MemFriendlyFunction(object):
-    def __init__(self, nondata_inputs, data_inputs, outputs, batch_size):
-        self.nondata_inputs = nondata_inputs
-        self.data_inputs = data_inputs
-        self.outputs = list(outputs)
-        self.batch_size = batch_size
-
-    def __call__(self, *inputvals):
-        assert len(inputvals) == len(self.nondata_inputs) + len(self.data_inputs)
-        nondata_vals = inputvals[0:len(self.nondata_inputs)]
-        data_vals = inputvals[len(self.nondata_inputs):]
-        feed_dict = dict(zip(self.nondata_inputs, nondata_vals))
-        n = data_vals[0].shape[0]
-        for v in data_vals[1:]:
-            assert v.shape[0] == n
-        for i_start in range(0, n, self.batch_size):
-            slice_vals = [v[i_start:builtins.min(i_start + self.batch_size, n)] for v in data_vals]
-            for (var, val) in zip(self.data_inputs, slice_vals):
-                feed_dict[var] = val
-            results = tf.get_default_session().run(self.outputs, feed_dict=feed_dict)
-            if i_start == 0:
-                sum_results = results
-            else:
-                for i in range(len(results)):
-                    sum_results[i] = sum_results[i] + results[i]
-        for i in range(len(results)):
-            sum_results[i] = sum_results[i] / n
-        return sum_results
-
-# ================================================================
-# Modules
-# ================================================================
-
-class Module(object):
-    def __init__(self, name):
-        self.name = name
-        self.first_time = True
-        self.scope = None
-        self.cache = {}
-
-    def __call__(self, *args):
-        if args in self.cache:
-            print("(%s) retrieving value from cache" % (self.name,))
-            return self.cache[args]
-        with tf.variable_scope(self.name, reuse=not self.first_time):
-            scope = tf.get_variable_scope().name
-            if self.first_time:
-                self.scope = scope
-                print("(%s) running function for the first time" % (self.name,))
-            else:
-                assert self.scope == scope, "Tried calling function with a different scope"
-                print("(%s) running function on new inputs" % (self.name,))
-            self.first_time = False
-            out = self._call(*args)
-        self.cache[args] = out
-        return out
-
-    def _call(self, *args):
-        raise NotImplementedError
-
-    @property
-    def trainable_variables(self):
-        assert self.scope is not None, "need to call module once before getting variables"
-        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)
-
-    @property
-    def variables(self):
-        assert self.scope is not None, "need to call module once before getting variables"
-        return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, self.scope)
-
-def module(name):
-    @functools.wraps
-    def wrapper(f):
-        class WrapperModule(Module):
-            def _call(self, *args):
-                return f(*args)
-        return WrapperModule(name)
-    return wrapper
-
-# ================================================================
-# Graph traversal
-# ================================================================
-
-VARIABLES = {}
-
-def get_parents(node):
-    return node.op.inputs
-
-def topsorted(outputs):
-    """
-    Topological sort via non-recursive depth-first search
-    """
-    assert isinstance(outputs, (list, tuple))
-    marks = {}
-    out = []
-    stack = []  # pylint: disable=W0621
-    # i: node
-    # jidx = number of children visited so far from that node
-    # marks: state of each node, which is one of
-    #   0: haven't visited
-    #   1: have visited, but not done visiting children
-    #   2: done visiting children
-    for x in outputs:
-        stack.append((x, 0))
-        while stack:
-            (i, jidx) = stack.pop()
-            if jidx == 0:
-                m = marks.get(i, 0)
-                if m == 0:
-                    marks[i] = 1
-                elif m == 1:
-                    raise ValueError("not a dag")
-                else:
-                    continue
-            ps = get_parents(i)
-            if jidx == len(ps):
-                marks[i] = 2
-                out.append(i)
-            else:
-                stack.append((i, jidx + 1))
-                j = ps[jidx]
-                stack.append((j, 0))
-    return out
-
 # ================================================================
 # Flat vectors
 # ================================================================
@@ -577,88 +234,14 @@ class SetFromFlat(object):
        self.op = tf.group(*assigns)

    def __call__(self, theta):
-        get_session().run(self.op, feed_dict={self.theta: theta})
+        tf.get_default_session().run(self.op, feed_dict={self.theta: theta})

 class GetFlat(object):
    def __init__(self, var_list):
        self.op = tf.concat(axis=0, values=[tf.reshape(v, [numel(v)]) for v in var_list])

    def __call__(self):
-        return get_session().run(self.op)
-
-# ================================================================
-# Misc
-# ================================================================
-
-def fancy_slice_2d(X, inds0, inds1):
-    """
-    like numpy X[inds0, inds1]
-    XXX this implementation is bad
-    """
-    inds0 = tf.cast(inds0, tf.int64)
-    inds1 = tf.cast(inds1, tf.int64)
-    shape = tf.cast(tf.shape(X), tf.int64)
-    ncols = shape[1]
-    Xflat = tf.reshape(X, [-1])
-    return tf.gather(Xflat, inds0 * ncols + inds1)
-
-# ================================================================
-# Scopes
-# ================================================================
-
-def scope_vars(scope, trainable_only=False):
-    """
-    Get variables inside a scope
-    The scope can be specified as a string
-
-    Parameters
-    ----------
-    scope: str or VariableScope
-        scope in which the variables reside.
-    trainable_only: bool
-        whether or not to return only the variables that were marked as trainable.
-
-    Returns
-    -------
-    vars: [tf.Variable]
-        list of variables in `scope`.
-    """
-    return tf.get_collection(
-        tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.GLOBAL_VARIABLES,
-        scope=scope if isinstance(scope, str) else scope.name
-    )
-
-def scope_name():
-    """Returns the name of current scope as a string, e.g. deepq/q_func"""
-    return tf.get_variable_scope().name
-
-def absolute_scope_name(relative_scope_name):
-    """Appends parent scope name to `relative_scope_name`"""
-    return scope_name() + "/" + relative_scope_name
-
-def lengths_to_mask(lengths_b, max_length):
-    """
-    Turns a vector of lengths into a boolean mask
-
-    Args:
-        lengths_b: an integer vector of lengths
-        max_length: maximum length to fill the mask
-
-    Returns:
-        a boolean array of shape (batch_size, max_length)
-        row[i] consists of True repeated lengths_b[i] times, followed by False
-    """
-    lengths_b = tf.convert_to_tensor(lengths_b)
-    assert lengths_b.get_shape().ndims == 1
-    mask_bt = tf.expand_dims(tf.range(max_length), 0) < tf.expand_dims(lengths_b, 1)
-    return mask_bt
-
-def in_session(f):
-    @functools.wraps(f)
-    def newfunc(*args, **kwargs):
-        with tf.Session():
-            f(*args, **kwargs)
-    return newfunc
+        return tf.get_default_session().run(self.op)

 _PLACEHOLDER_CACHE = {}  # name -> (placeholder, dtype, shape)

@@ -678,9 +261,44 @@ def get_placeholder_cached(name):
 def flattenallbut0(x):
    return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])

-def reset():
-    global _PLACEHOLDER_CACHE
-    global VARIABLES
-    _PLACEHOLDER_CACHE = {}
-    VARIABLES = {}
-    tf.reset_default_graph()
+
+# ================================================================
+# Diagnostics 
+# ================================================================
+
+def display_var_info(vars):
+    from baselines import logger
+    count_params = 0
+    for v in vars:
+        name = v.name
+        if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
+        v_params = np.prod(v.shape.as_list())
+        count_params += v_params
+        if "/b:" in name or "/biases" in name: continue    # Wx+b, bias is not interesting to look at => count params, but not print
+        logger.info("   %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
+
+    logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
+
+
+def get_available_gpus():
+    # recipe from here:
+    # https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
+ 
+    from tensorflow.python.client import device_lib
+    local_device_protos = device_lib.list_local_devices()
+    return [x.name for x in local_device_protos if x.device_type == 'GPU']
+
+# ================================================================
+# Saving variables
+# ================================================================
+
+def load_state(fname):
+    saver = tf.train.Saver()
+    saver.restore(tf.get_default_session(), fname)
+
+def save_state(fname):
+    os.makedirs(os.path.dirname(fname), exist_ok=True)
+    saver = tf.train.Saver()
+    saver.save(tf.get_default_session(), fname)
+
+
--- a/baselines/common/tile_images.py
+++ b/baselines/common/tile_images.py
@@ -0,0 +1,23 @@
+import numpy as np
+
+def tile_images(img_nhwc):
+    """
+    Tile N images into one big PxQ image
+    (P,Q) are chosen to be as close as possible, and if N
+    is square, then P=Q.
+
+    input: img_nhwc, list or array of images, ndim=4 once turned into array
+        n = batch index, h = height, w = width, c = channel
+    returns:
+        bigim_HWc, ndarray with ndim=3
+    """
+    img_nhwc = np.asarray(img_nhwc)
+    N, h, w, c = img_nhwc.shape
+    H = int(np.ceil(np.sqrt(N)))
+    W = int(np.ceil(float(N)/H))
+    img_nhwc = np.array(list(img_nhwc) + [img_nhwc[0]*0 for _ in range(N, H*W)])
+    img_HWhwc = img_nhwc.reshape(H, W, h, w, c)
+    img_HhWwc = img_HWhwc.transpose(0, 2, 1, 3, 4)
+    img_Hh_Ww_c = img_HhWwc.reshape(H*h, W*w, c)
+    return img_Hh_Ww_c
+
--- a/baselines/common/vec_env/init.py
+++ b/baselines/common/vec_env/init.py
@@ -1,19 +1,126 @@
-class VecEnv(object):
-    """
-    Vectorized environment base class
-    """
-    def step(self, vac):
-        """
-        Apply sequence of actions to sequence of environments
-        actions -> (observations, rewards, news)
+from abc import ABC, abstractmethod
+from baselines import logger

-        where 'news' is a boolean vector indicating whether each element is new.
-        """
-        raise NotImplementedError
+class AlreadySteppingError(Exception):
+    """
+    Raised when an asynchronous step is running while
+    step_async() is called again.
+    """
+    def __init__(self):
+        msg = 'already running an async step'
+        Exception.__init__(self, msg)
+
+class NotSteppingError(Exception):
+    """
+    Raised when an asynchronous step is not running but
+    step_wait() is called.
+    """
+    def __init__(self):
+        msg = 'not running an async step'
+        Exception.__init__(self, msg)
+
+class VecEnv(ABC):
+    """
+    An abstract asynchronous, vectorized environment.
+    """
+    def __init__(self, num_envs, observation_space, action_space):
+        self.num_envs = num_envs
+        self.observation_space = observation_space
+        self.action_space = action_space
+
+    @abstractmethod
    def reset(self):
        """
-        Reset all environments
+        Reset all the environments and return an array of
+        observations, or a tuple of observation arrays.
+
+        If step_async is still doing work, that work will
+        be cancelled and step_wait() should not be called
+        until step_async() is invoked again.
        """
-        raise NotImplementedError
+        pass
+
+    @abstractmethod
+    def step_async(self, actions):
+        """
+        Tell all the environments to start taking a step
+        with the given actions.
+        Call step_wait() to get the results of the step.
+
+        You should not call this if a step_async run is
+        already pending.
+        """
+        pass
+
+    @abstractmethod
+    def step_wait(self):
+        """
+        Wait for the step taken with step_async().
+
+        Returns (obs, rews, dones, infos):
+         - obs: an array of observations, or a tuple of
+                arrays of observations.
+         - rews: an array of rewards
+         - dones: an array of "episode done" booleans
+         - infos: a sequence of info objects
+        """
+        pass
+
+    @abstractmethod
    def close(self):
-        pass
+        """
+        Clean up the environments' resources.
+        """
+        pass
+
+    def step(self, actions):
+        self.step_async(actions)
+        return self.step_wait()
+
+    def render(self, mode='human'):
+        logger.warn('Render not defined for %s'%self)
+
+    @property
+    def unwrapped(self):
+        if isinstance(self, VecEnvWrapper):
+            return self.venv.unwrapped
+        else:
+            return self
+
+class VecEnvWrapper(VecEnv):
+    def __init__(self, venv, observation_space=None, action_space=None):
+        self.venv = venv
+        VecEnv.__init__(self, 
+            num_envs=venv.num_envs,
+            observation_space=observation_space or venv.observation_space, 
+            action_space=action_space or venv.action_space)
+
+    def step_async(self, actions):
+        self.venv.step_async(actions)
+
+    @abstractmethod
+    def reset(self):
+        pass
+
+    @abstractmethod
+    def step_wait(self):
+        pass
+
+    def close(self):
+        return self.venv.close()
+
+    def render(self):
+        self.venv.render()
+
+class CloudpickleWrapper(object):
+    """
+    Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
+    """
+    def __init__(self, x):
+        self.x = x
+    def __getstate__(self):
+        import cloudpickle
+        return cloudpickle.dumps(self.x)
+    def __setstate__(self, ob):
+        import pickle
+        self.x = pickle.loads(ob)
--- a/baselines/common/vec_env/dummy_vec_env.py
+++ b/baselines/common/vec_env/dummy_vec_env.py
@@ -0,0 +1,67 @@
+import numpy as np
+from gym import spaces
+from collections import OrderedDict
+from . import VecEnv
+
+class DummyVecEnv(VecEnv):
+    def __init__(self, env_fns):
+        self.envs = [fn() for fn in env_fns]
+        env = self.envs[0]
+        VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
+        shapes, dtypes = {}, {}
+        self.keys = []
+        obs_space = env.observation_space
+
+        if isinstance(obs_space, spaces.Dict):
+            assert isinstance(obs_space.spaces, OrderedDict)
+            subspaces = obs_space.spaces
+        else:
+            subspaces = {None: obs_space}
+
+        for key, box in subspaces.items():
+            shapes[key] = box.shape
+            dtypes[key] = box.dtype
+            self.keys.append(key)
+        
+        self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
+        self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
+        self.buf_rews  = np.zeros((self.num_envs,), dtype=np.float32)
+        self.buf_infos = [{} for _ in range(self.num_envs)]
+        self.actions = None
+
+    def step_async(self, actions):
+        self.actions = actions
+
+    def step_wait(self):
+        for e in range(self.num_envs):
+            obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(self.actions[e])
+            if self.buf_dones[e]:
+                obs = self.envs[e].reset()
+            self._save_obs(e, obs)
+        return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones),
+                self.buf_infos.copy())
+
+    def reset(self):
+        for e in range(self.num_envs):
+            obs = self.envs[e].reset()
+            self._save_obs(e, obs)
+        return self._obs_from_buf()
+
+    def close(self):
+        return
+
+    def render(self, mode='human'):
+        return [e.render(mode=mode) for e in self.envs]
+
+    def _save_obs(self, e, obs):
+        for k in self.keys:
+            if k is None:
+                self.buf_obs[k][e] = obs
+            else:
+                self.buf_obs[k][e] = obs[k]
+
+    def _obs_from_buf(self):
+        if self.keys==[None]:
+            return self.buf_obs[None]
+        else:
+            return self.buf_obs
--- a/baselines/common/vec_env/subproc_vec_env.py
+++ b/baselines/common/vec_env/subproc_vec_env.py
@@ -1,6 +1,7 @@
 import numpy as np
 from multiprocessing import Process, Pipe
-from baselines.common.vec_env import VecEnv
+from baselines.common.vec_env import VecEnv, CloudpickleWrapper
+from baselines.common.tile_images import tile_images


 def worker(remote, parent_remote, env_fn_wrapper):
@@ -16,37 +17,23 @@ def worker(remote, parent_remote, env_fn_wrapper):
        elif cmd == 'reset':
            ob = env.reset()
            remote.send(ob)
-        elif cmd == 'reset_task':
-            ob = env.reset_task()
-            remote.send(ob)
+        elif cmd == 'render':
+            remote.send(env.render(mode='rgb_array'))
        elif cmd == 'close':
            remote.close()
            break
        elif cmd == 'get_spaces':
-            remote.send((env.action_space, env.observation_space))
+            remote.send((env.observation_space, env.action_space))
        else:
            raise NotImplementedError


-class CloudpickleWrapper(object):
-    """
-    Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
-    """
-    def __init__(self, x):
-        self.x = x
-    def __getstate__(self):
-        import cloudpickle
-        return cloudpickle.dumps(self.x)
-    def __setstate__(self, ob):
-        import pickle
-        self.x = pickle.loads(ob)
-
-
 class SubprocVecEnv(VecEnv):
-    def __init__(self, env_fns):
+    def __init__(self, env_fns, spaces=None):
        """
        envs: list of gym environments to run in subprocesses
        """
+        self.waiting = False
        self.closed = False
        nenvs = len(env_fns)
        self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
@@ -59,13 +46,17 @@ class SubprocVecEnv(VecEnv):
            remote.close()

        self.remotes[0].send(('get_spaces', None))
-        self.action_space, self.observation_space = self.remotes[0].recv()
+        observation_space, action_space = self.remotes[0].recv()
+        VecEnv.__init__(self, len(env_fns), observation_space, action_space)

-
-    def step(self, actions):
+    def step_async(self, actions):
        for remote, action in zip(self.remotes, actions):
            remote.send(('step', action))
+        self.waiting = True
+
+    def step_wait(self):
        results = [remote.recv() for remote in self.remotes]
+        self.waiting = False
        obs, rews, dones, infos = zip(*results)
        return np.stack(obs), np.stack(rews), np.stack(dones), infos

@@ -82,13 +73,25 @@ class SubprocVecEnv(VecEnv):
    def close(self):
        if self.closed:
            return
-
+        if self.waiting:
+            for remote in self.remotes:            
+                remote.recv()
        for remote in self.remotes:
            remote.send(('close', None))
        for p in self.ps:
            p.join()
        self.closed = True

-    @property
-    def num_envs(self):
-        return len(self.remotes)
+    def render(self, mode='human'):
+        for pipe in self.remotes:
+            pipe.send(('render', None))
+        imgs = [pipe.recv() for pipe in self.remotes]
+        bigimg = tile_images(imgs)
+        if mode == 'human':
+            import cv2
+            cv2.imshow('vecenv', bigimg[:,:,::-1])
+            cv2.waitKey(1)
+        elif mode == 'rgb_array':
+            return bigimg
+        else:
+            raise NotImplementedError
--- a/baselines/common/vec_env/vec_frame_stack.py
+++ b/baselines/common/vec_env/vec_frame_stack.py
@@ -0,0 +1,38 @@
+from baselines.common.vec_env import VecEnvWrapper
+import numpy as np
+from gym import spaces
+
+class VecFrameStack(VecEnvWrapper):
+    """
+    Vectorized environment base class
+    """
+    def __init__(self, venv, nstack):
+        self.venv = venv
+        self.nstack = nstack
+        wos = venv.observation_space # wrapped ob space
+        low = np.repeat(wos.low, self.nstack, axis=-1)
+        high = np.repeat(wos.high, self.nstack, axis=-1)
+        self.stackedobs = np.zeros((venv.num_envs,)+low.shape, low.dtype)
+        observation_space = spaces.Box(low=low, high=high, dtype=venv.observation_space.dtype)
+        VecEnvWrapper.__init__(self, venv, observation_space=observation_space)
+
+    def step_wait(self):
+        obs, rews, news, infos = self.venv.step_wait()
+        self.stackedobs = np.roll(self.stackedobs, shift=-1, axis=-1)
+        for (i, new) in enumerate(news):
+            if new:
+                self.stackedobs[i] = 0
+        self.stackedobs[..., -obs.shape[-1]:] = obs
+        return self.stackedobs, rews, news, infos
+
+    def reset(self):
+        """
+        Reset all environments
+        """
+        obs = self.venv.reset()
+        self.stackedobs[...] = 0
+        self.stackedobs[..., -obs.shape[-1]:] = obs
+        return self.stackedobs
+
+    def close(self):
+        self.venv.close()
--- a/baselines/common/vec_env/vec_normalize.py
+++ b/baselines/common/vec_env/vec_normalize.py
@@ -0,0 +1,47 @@
+from baselines.common.vec_env import VecEnvWrapper
+from baselines.common.running_mean_std import RunningMeanStd
+import numpy as np
+
+class VecNormalize(VecEnvWrapper):
+    """
+    Vectorized environment base class
+    """
+    def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
+        VecEnvWrapper.__init__(self, venv)
+        self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
+        self.ret_rms = RunningMeanStd(shape=()) if ret else None
+        self.clipob = clipob
+        self.cliprew = cliprew
+        self.ret = np.zeros(self.num_envs)
+        self.gamma = gamma
+        self.epsilon = epsilon
+
+    def step_wait(self):
+        """
+        Apply sequence of actions to sequence of environments
+        actions -> (observations, rewards, news)
+
+        where 'news' is a boolean vector indicating whether each element is new.
+        """
+        obs, rews, news, infos = self.venv.step_wait()
+        self.ret = self.ret * self.gamma + rews
+        obs = self._obfilt(obs)
+        if self.ret_rms:
+            self.ret_rms.update(self.ret)
+            rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
+        return obs, rews, news, infos
+
+    def _obfilt(self, obs):
+        if self.ob_rms:
+            self.ob_rms.update(obs)
+            obs = np.clip((obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon), -self.clipob, self.clipob)
+            return obs
+        else:
+            return obs
+
+    def reset(self):
+        """
+        Reset all environments
+        """
+        obs = self.venv.reset()
+        return self._obfilt(obs)
--- a/baselines/ddpg/ddpg.py
+++ b/baselines/ddpg/ddpg.py
@@ -9,8 +9,7 @@ from baselines import logger
 from baselines.common.mpi_adam import MpiAdam
 import baselines.common.tf_util as U
 from baselines.common.mpi_running_mean_std import RunningMeanStd
-from baselines.ddpg.util import reduce_std, mpi_mean
-
+from mpi4py import MPI

 def normalize(x, stats):
    if stats is None:
@@ -23,6 +22,13 @@ def denormalize(x, stats):
        return x
    return x * stats.std + stats.mean

+def reduce_std(x, axis=None, keepdims=False):
+    return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
+
+def reduce_var(x, axis=None, keepdims=False):
+    m = tf.reduce_mean(x, axis=axis, keep_dims=True)
+    devs_squared = tf.square(x - m)
+    return tf.reduce_mean(devs_squared, axis=axis, keep_dims=keepdims)

 def get_target_updates(vars, target_vars, tau):
    logger.info('setting up target updates ...')
@@ -198,7 +204,7 @@ class DDPG(object):
        new_std = self.ret_rms.std
        self.old_mean = tf.placeholder(tf.float32, shape=[1], name='old_mean')
        new_mean = self.ret_rms.mean
-        
+
        self.renormalize_Q_outputs_op = []
        for vs in [self.critic.output_vars, self.target_critic.output_vars]:
            assert len(vs) == 2
@@ -213,15 +219,15 @@ class DDPG(object):
    def setup_stats(self):
        ops = []
        names = []
-        
+
        if self.normalize_returns:
            ops += [self.ret_rms.mean, self.ret_rms.std]
            names += ['ret_rms_mean', 'ret_rms_std']
-        
+
        if self.normalize_observations:
            ops += [tf.reduce_mean(self.obs_rms.mean), tf.reduce_mean(self.obs_rms.std)]
            names += ['obs_rms_mean', 'obs_rms_std']
-        
+
        ops += [tf.reduce_mean(self.critic_tf)]
        names += ['reference_Q_mean']
        ops += [reduce_std(self.critic_tf)]
@@ -231,7 +237,7 @@ class DDPG(object):
        names += ['reference_actor_Q_mean']
        ops += [reduce_std(self.critic_with_actor_tf)]
        names += ['reference_actor_Q_std']
-        
+
        ops += [tf.reduce_mean(self.actor_tf)]
        names += ['reference_action_mean']
        ops += [reduce_std(self.actor_tf)]
@@ -347,7 +353,7 @@ class DDPG(object):
    def adapt_param_noise(self):
        if self.param_noise is None:
            return 0.
-        
+
        # Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
        batch = self.memory.sample(batch_size=self.batch_size)
        self.sess.run(self.perturb_adaptive_policy_ops, feed_dict={
@@ -358,7 +364,7 @@ class DDPG(object):
            self.param_noise_stddev: self.param_noise.current_stddev,
        })

-        mean_distance = mpi_mean(distance)
+        mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
        self.param_noise.adapt(mean_distance)
        return mean_distance

--- a/baselines/ddpg/main.py
+++ b/baselines/ddpg/main.py
@@ -25,7 +25,6 @@ def run(env_id, seed, noise_type, layer_norm, evaluation, **kwargs):
    # Create envs.
    env = gym.make(env_id)
    env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
-    gym.logger.setLevel(logging.WARN)

    if evaluation and rank==0:
        eval_env = gym.make(env_id)
--- a/baselines/ddpg/training.py
+++ b/baselines/ddpg/training.py
@@ -4,7 +4,6 @@ from collections import deque
 import pickle

 from baselines.ddpg.ddpg import DDPG
-from baselines.ddpg.util import mpi_mean, mpi_std, mpi_max, mpi_sum
 import baselines.common.tf_util as U

 from baselines import logger
@@ -35,7 +34,7 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
        saver = tf.train.Saver()
    else:
        saver = None
-    
+
    step = 0
    episode = 0
    eval_episode_rewards_history = deque(maxlen=100)
@@ -110,7 +109,7 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
                epoch_adaptive_distances = []
                for t_train in range(nb_train_steps):
                    # Adapt param noise, if necessary.
-                    if memory.nb_entries >= batch_size and t % param_noise_adaption_interval == 0:
+                    if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
                        distance = agent.adapt_param_noise()
                        epoch_adaptive_distances.append(distance)

@@ -138,42 +137,46 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
                            eval_episode_rewards_history.append(eval_episode_reward)
                            eval_episode_reward = 0.

+            mpi_size = MPI.COMM_WORLD.Get_size()
            # Log stats.
-            epoch_train_duration = time.time() - epoch_start_time
+            # XXX shouldn't call np.mean on variable length lists
            duration = time.time() - start_time
            stats = agent.get_stats()
-            combined_stats = {}
-            for key in sorted(stats.keys()):
-                combined_stats[key] = mpi_mean(stats[key])
-
-            # Rollout statistics.
-            combined_stats['rollout/return'] = mpi_mean(epoch_episode_rewards)
-            combined_stats['rollout/return_history'] = mpi_mean(np.mean(episode_rewards_history))
-            combined_stats['rollout/episode_steps'] = mpi_mean(epoch_episode_steps)
-            combined_stats['rollout/episodes'] = mpi_sum(epoch_episodes)
-            combined_stats['rollout/actions_mean'] = mpi_mean(epoch_actions)
-            combined_stats['rollout/actions_std'] = mpi_std(epoch_actions)
-            combined_stats['rollout/Q_mean'] = mpi_mean(epoch_qs)
-    
-            # Train statistics.
-            combined_stats['train/loss_actor'] = mpi_mean(epoch_actor_losses)
-            combined_stats['train/loss_critic'] = mpi_mean(epoch_critic_losses)
-            combined_stats['train/param_noise_distance'] = mpi_mean(epoch_adaptive_distances)
-
+            combined_stats = stats.copy()
+            combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
+            combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
+            combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
+            combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
+            combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
+            combined_stats['train/loss_actor'] = np.mean(epoch_actor_losses)
+            combined_stats['train/loss_critic'] = np.mean(epoch_critic_losses)
+            combined_stats['train/param_noise_distance'] = np.mean(epoch_adaptive_distances)
+            combined_stats['total/duration'] = duration
+            combined_stats['total/steps_per_second'] = float(t) / float(duration)
+            combined_stats['total/episodes'] = episodes
+            combined_stats['rollout/episodes'] = epoch_episodes
+            combined_stats['rollout/actions_std'] = np.std(epoch_actions)
            # Evaluation statistics.
            if eval_env is not None:
-                combined_stats['eval/return'] = mpi_mean(eval_episode_rewards)
-                combined_stats['eval/return_history'] = mpi_mean(np.mean(eval_episode_rewards_history))
-                combined_stats['eval/Q'] = mpi_mean(eval_qs)
-                combined_stats['eval/episodes'] = mpi_mean(len(eval_episode_rewards))
+                combined_stats['eval/return'] = eval_episode_rewards
+                combined_stats['eval/return_history'] = np.mean(eval_episode_rewards_history)
+                combined_stats['eval/Q'] = eval_qs
+                combined_stats['eval/episodes'] = len(eval_episode_rewards)
+            def as_scalar(x):
+                if isinstance(x, np.ndarray):
+                    assert x.size == 1
+                    return x[0]
+                elif np.isscalar(x):
+                    return x
+                else:
+                    raise ValueError('expected scalar, got %s'%x)
+            combined_stats_sums = MPI.COMM_WORLD.allreduce(np.array([as_scalar(x) for x in combined_stats.values()]))
+            combined_stats = {k : v / mpi_size for (k,v) in zip(combined_stats.keys(), combined_stats_sums)}

            # Total statistics.
-            combined_stats['total/duration'] = mpi_mean(duration)
-            combined_stats['total/steps_per_second'] = mpi_mean(float(t) / float(duration))
-            combined_stats['total/episodes'] = mpi_mean(episodes)
            combined_stats['total/epochs'] = epoch + 1
            combined_stats['total/steps'] = t
-            
+
            for key in sorted(combined_stats.keys()):
                logger.record_tabular(key, combined_stats[key])
            logger.dump_tabular()
@@ -186,4 +189,3 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
                if eval_env and hasattr(eval_env, 'get_state'):
                    with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
                        pickle.dump(eval_env.get_state(), f)
-
--- a/baselines/ddpg/util.py
+++ b/baselines/ddpg/util.py
@@ -1,44 +0,0 @@
-import numpy as np
-import tensorflow as tf
-from mpi4py import MPI
-from baselines.common.mpi_moments import mpi_moments
-
-
-def reduce_var(x, axis=None, keepdims=False):
-    m = tf.reduce_mean(x, axis=axis, keep_dims=True)
-    devs_squared = tf.square(x - m)
-    return tf.reduce_mean(devs_squared, axis=axis, keep_dims=keepdims)
-
-
-def reduce_std(x, axis=None, keepdims=False):
-    return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
-
-
-def mpi_mean(value):
-    if value == []:
-        value = [0.]
-    if not isinstance(value, list):
-        value = [value]
-    return mpi_moments(np.array(value))[0][0]
-
-
-def mpi_std(value):
-    if value == []:
-        value = [0.]
-    if not isinstance(value, list):
-        value = [value]
-    return mpi_moments(np.array(value))[1][0]
-
-
-def mpi_max(value):
-    global_max = np.zeros(1, dtype='float64')
-    local_max = np.max(value).astype('float64')
-    MPI.COMM_WORLD.Reduce(local_max, global_max, op=MPI.MAX)
-    return global_max[0]
-
-
-def mpi_sum(value):
-    global_sum = np.zeros(1, dtype='float64')
-    local_sum = np.sum(np.array(value)).astype('float64')
-    MPI.COMM_WORLD.Reduce(local_sum, global_sum, op=MPI.SUM)
-    return global_sum[0]
--- a/baselines/deepq/build_graph.py
+++ b/baselines/deepq/build_graph.py
@@ -97,6 +97,37 @@ import tensorflow as tf
 import baselines.common.tf_util as U


+def scope_vars(scope, trainable_only=False):
+    """
+    Get variables inside a scope
+    The scope can be specified as a string
+    Parameters
+    ----------
+    scope: str or VariableScope
+        scope in which the variables reside.
+    trainable_only: bool
+        whether or not to return only the variables that were marked as trainable.
+    Returns
+    -------
+    vars: [tf.Variable]
+        list of variables in `scope`.
+    """
+    return tf.get_collection(
+        tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.GLOBAL_VARIABLES,
+        scope=scope if isinstance(scope, str) else scope.name
+    )
+
+
+def scope_name():
+    """Returns the name of current scope as a string, e.g. deepq/q_func"""
+    return tf.get_variable_scope().name
+
+
+def absolute_scope_name(relative_scope_name):
+    """Appends parent scope name to `relative_scope_name`"""
+    return scope_name() + "/" + relative_scope_name
+
+
 def default_param_noise_filter(var):
    if var not in tf.trainable_variables():
        # We never perturb non-trainable vars.
@@ -143,7 +174,7 @@ def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):
 `       See the top of the file for details.
    """
    with tf.variable_scope(scope, reuse=reuse):
-        observations_ph = U.ensure_tf_input(make_obs_ph("observation"))
+        observations_ph = make_obs_ph("observation")
        stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
        update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")

@@ -159,10 +190,12 @@ def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):

        output_actions = tf.cond(stochastic_ph, lambda: stochastic_actions, lambda: deterministic_actions)
        update_eps_expr = eps.assign(tf.cond(update_eps_ph >= 0, lambda: update_eps_ph, lambda: eps))
-        act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph],
+        _act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph],
                         outputs=output_actions,
                         givens={update_eps_ph: -1.0, stochastic_ph: True},
                         updates=[update_eps_expr])
+        def act(ob, stochastic=True, update_eps=-1):
+            return _act(ob, stochastic, update_eps)
        return act


@@ -203,7 +236,7 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
        param_noise_filter_func = default_param_noise_filter

    with tf.variable_scope(scope, reuse=reuse):
-        observations_ph = U.ensure_tf_input(make_obs_ph("observation"))
+        observations_ph = make_obs_ph("observation")
        stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
        update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")
        update_param_noise_threshold_ph = tf.placeholder(tf.float32, (), name="update_param_noise_threshold")
@@ -223,8 +256,8 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
        # https://stackoverflow.com/questions/37063952/confused-by-the-behavior-of-tf-cond for
        # a more detailed discussion.
        def perturb_vars(original_scope, perturbed_scope):
-            all_vars = U.scope_vars(U.absolute_scope_name("q_func"))
-            all_perturbed_vars = U.scope_vars(U.absolute_scope_name("perturbed_q_func"))
+            all_vars = scope_vars(absolute_scope_name(original_scope))
+            all_perturbed_vars = scope_vars(absolute_scope_name(perturbed_scope))
            assert len(all_vars) == len(all_perturbed_vars)
            perturb_ops = []
            for var, perturbed_var in zip(all_vars, all_perturbed_vars):
@@ -272,10 +305,12 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
            tf.cond(update_param_noise_scale_ph, lambda: update_scale(), lambda: tf.Variable(0., trainable=False)),
            update_param_noise_threshold_expr,
        ]
-        act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph, reset_ph, update_param_noise_threshold_ph, update_param_noise_scale_ph],
+        _act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph, reset_ph, update_param_noise_threshold_ph, update_param_noise_scale_ph],
                         outputs=output_actions,
                         givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
                         updates=updates)
+        def act(ob, reset, update_param_noise_threshold, update_param_noise_scale, stochastic=True, update_eps=-1):
+            return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
        return act


@@ -342,20 +377,20 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=

    with tf.variable_scope(scope, reuse=reuse):
        # set up placeholders
-        obs_t_input = U.ensure_tf_input(make_obs_ph("obs_t"))
+        obs_t_input = make_obs_ph("obs_t")
        act_t_ph = tf.placeholder(tf.int32, [None], name="action")
        rew_t_ph = tf.placeholder(tf.float32, [None], name="reward")
-        obs_tp1_input = U.ensure_tf_input(make_obs_ph("obs_tp1"))
+        obs_tp1_input = make_obs_ph("obs_tp1")
        done_mask_ph = tf.placeholder(tf.float32, [None], name="done")
        importance_weights_ph = tf.placeholder(tf.float32, [None], name="weight")

        # q network evaluation
        q_t = q_func(obs_t_input.get(), num_actions, scope="q_func", reuse=True)  # reuse parameters from act
-        q_func_vars = U.scope_vars(U.absolute_scope_name("q_func"))
+        q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/q_func")

        # target q network evalution
        q_tp1 = q_func(obs_tp1_input.get(), num_actions, scope="target_q_func")
-        target_q_func_vars = U.scope_vars(U.absolute_scope_name("target_q_func"))
+        target_q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/target_q_func")

        # q scores for actions which we know were selected in the given state.
        q_t_selected = tf.reduce_sum(q_t * tf.one_hot(act_t_ph, num_actions), 1)
@@ -363,7 +398,7 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=
        # compute estimate of best possible value starting from state at t + 1
        if double_q:
            q_tp1_using_online_net = q_func(obs_tp1_input.get(), num_actions, scope="q_func", reuse=True)
-            q_tp1_best_using_online_net = tf.arg_max(q_tp1_using_online_net, 1)
+            q_tp1_best_using_online_net = tf.argmax(q_tp1_using_online_net, 1)
            q_tp1_best = tf.reduce_sum(q_tp1 * tf.one_hot(q_tp1_best_using_online_net, num_actions), 1)
        else:
            q_tp1_best = tf.reduce_max(q_tp1, 1)
@@ -379,10 +414,11 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=

        # compute optimization op (potentially with gradient clipping)
        if grad_norm_clipping is not None:
-            optimize_expr = U.minimize_and_clip(optimizer,
-                                                weighted_error,
-                                                var_list=q_func_vars,
-                                                clip_val=grad_norm_clipping)
+            gradients = optimizer.compute_gradients(weighted_error, var_list=q_func_vars)
+            for i, (grad, var) in enumerate(gradients):
+                if grad is not None:
+                    gradients[i] = (tf.clip_by_norm(grad, grad_norm_clipping), var)
+            optimize_expr = optimizer.apply_gradients(gradients)
        else:
            optimize_expr = optimizer.minimize(weighted_error, var_list=q_func_vars)

--- a/baselines/deepq/experiments/atari/download_model.py
+++ b/baselines/deepq/experiments/atari/download_model.py
@@ -1,51 +0,0 @@
-import argparse
-import progressbar
-
-from baselines.common.azure_utils import Container
-
-
-def parse_args():
-    parser = argparse.ArgumentParser("Download a pretrained model from Azure.")
-    # Environment
-    parser.add_argument("--model-dir", type=str, default=None,
-                        help="save model in this directory this directory. ")
-    parser.add_argument("--account-name", type=str, default="openaisciszymon",
-                        help="account name for Azure Blob Storage")
-    parser.add_argument("--account-key", type=str, default=None,
-                        help="account key for Azure Blob Storage")
-    parser.add_argument("--container", type=str, default="dqn-blogpost",
-                        help="container name and blob name separated by colon serparated by colon")
-    parser.add_argument("--blob", type=str, default=None, help="blob with the model")
-    return parser.parse_args()
-
-
-def main():
-    args = parse_args()
-    c = Container(account_name=args.account_name,
-                  account_key=args.account_key,
-                  container_name=args.container)
-
-    if args.blob is None:
-        print("Listing available models:")
-        print()
-        for blob in sorted(c.list(prefix="model-")):
-            print(blob)
-    else:
-        print("Downloading {} to {}...".format(args.blob, args.model_dir))
-        bar = None
-
-        def callback(current, total):
-            nonlocal bar
-            if bar is None:
-                bar = progressbar.ProgressBar(max_value=total)
-            bar.update(current)
-
-        assert c.exists(args.blob), "model {} does not exist".format(args.blob)
-
-        assert args.model_dir is not None
-
-        c.get(args.model_dir, args.blob, callback=callback)
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/deepq/experiments/atari/enjoy.py
+++ b/baselines/deepq/experiments/atari/enjoy.py
@@ -1,70 +0,0 @@
-import argparse
-import gym
-import os
-import numpy as np
-
-from gym.monitoring import VideoRecorder
-
-import baselines.common.tf_util as U
-
-from baselines import deepq
-from baselines.common.misc_util import (
-    boolean_flag,
-)
-from baselines import bench
-from baselines.common.atari_wrappers_deprecated import wrap_dqn
-from baselines.deepq.experiments.atari.model import model, dueling_model
-
-
-def parse_args():
-    parser = argparse.ArgumentParser("Run an already learned DQN model.")
-    # Environment
-    parser.add_argument("--env", type=str, required=True, help="name of the game")
-    parser.add_argument("--model-dir", type=str, default=None, help="load model from this directory. ")
-    parser.add_argument("--video", type=str, default=None, help="Path to mp4 file where the video of first episode will be recorded.")
-    boolean_flag(parser, "stochastic", default=True, help="whether or not to use stochastic actions according to models eps value")
-    boolean_flag(parser, "dueling", default=False, help="whether or not to use dueling model")
-
-    return parser.parse_args()
-
-
-def make_env(game_name):
-    env = gym.make(game_name + "NoFrameskip-v4")
-    env = bench.Monitor(env, None)
-    env = wrap_dqn(env)
-    return env
-
-
-def play(env, act, stochastic, video_path):
-    num_episodes = 0
-    video_recorder = None
-    video_recorder = VideoRecorder(
-        env, video_path, enabled=video_path is not None)
-    obs = env.reset()
-    while True:
-        env.unwrapped.render()
-        video_recorder.capture_frame()
-        action = act(np.array(obs)[None], stochastic=stochastic)[0]
-        obs, rew, done, info = env.step(action)
-        if done:
-            obs = env.reset()
-        if len(info["rewards"]) > num_episodes:
-            if len(info["rewards"]) == 1 and video_recorder.enabled:
-                # save video of first episode
-                print("Saved video.")
-                video_recorder.close()
-                video_recorder.enabled = False
-            print(info["rewards"][-1])
-            num_episodes = len(info["rewards"])
-
-
-if __name__ == '__main__':
-    with U.make_session(4) as sess:
-        args = parse_args()
-        env = make_env(args.env)
-        act = deepq.build_act(
-            make_obs_ph=lambda name: U.Uint8Input(env.observation_space.shape, name=name),
-            q_func=dueling_model if args.dueling else model,
-            num_actions=env.action_space.n)
-        U.load_state(os.path.join(args.model_dir, "saved"))
-        play(env, act, args.stochastic, args.video)
--- a/baselines/deepq/experiments/atari/model.py
+++ b/baselines/deepq/experiments/atari/model.py
@@ -1,60 +0,0 @@
-import tensorflow as tf
-import tensorflow.contrib.layers as layers
-
-
-def layer_norm_fn(x, relu=True):
-    x = layers.layer_norm(x, scale=True, center=True)
-    if relu:
-        x = tf.nn.relu(x)
-    return x
-
-
-def model(img_in, num_actions, scope, reuse=False, layer_norm=False):
-    """As described in https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf"""
-    with tf.variable_scope(scope, reuse=reuse):
-        out = img_in
-        with tf.variable_scope("convnet"):
-            # original architecture
-            out = layers.convolution2d(out, num_outputs=32, kernel_size=8, stride=4, activation_fn=tf.nn.relu)
-            out = layers.convolution2d(out, num_outputs=64, kernel_size=4, stride=2, activation_fn=tf.nn.relu)
-            out = layers.convolution2d(out, num_outputs=64, kernel_size=3, stride=1, activation_fn=tf.nn.relu)
-        conv_out = layers.flatten(out)
-
-        with tf.variable_scope("action_value"):
-            value_out = layers.fully_connected(conv_out, num_outputs=512, activation_fn=None)
-            if layer_norm:
-                value_out = layer_norm_fn(value_out, relu=True)
-            else:
-                value_out = tf.nn.relu(value_out)
-            value_out = layers.fully_connected(value_out, num_outputs=num_actions, activation_fn=None)
-        return value_out
-
-
-def dueling_model(img_in, num_actions, scope, reuse=False, layer_norm=False):
-    """As described in https://arxiv.org/abs/1511.06581"""
-    with tf.variable_scope(scope, reuse=reuse):
-        out = img_in
-        with tf.variable_scope("convnet"):
-            # original architecture
-            out = layers.convolution2d(out, num_outputs=32, kernel_size=8, stride=4, activation_fn=tf.nn.relu)
-            out = layers.convolution2d(out, num_outputs=64, kernel_size=4, stride=2, activation_fn=tf.nn.relu)
-            out = layers.convolution2d(out, num_outputs=64, kernel_size=3, stride=1, activation_fn=tf.nn.relu)
-        conv_out = layers.flatten(out)
-
-        with tf.variable_scope("state_value"):
-            state_hidden = layers.fully_connected(conv_out, num_outputs=512, activation_fn=None)
-            if layer_norm:
-                state_hidden = layer_norm_fn(state_hidden, relu=True)
-            else:
-                state_hidden = tf.nn.relu(state_hidden)
-            state_score = layers.fully_connected(state_hidden, num_outputs=1, activation_fn=None)
-        with tf.variable_scope("action_value"):
-            actions_hidden = layers.fully_connected(conv_out, num_outputs=512, activation_fn=None)
-            if layer_norm:
-                actions_hidden = layer_norm_fn(actions_hidden, relu=True)
-            else:
-                actions_hidden = tf.nn.relu(actions_hidden)
-            action_scores = layers.fully_connected(actions_hidden, num_outputs=num_actions, activation_fn=None)
-            action_scores_mean = tf.reduce_mean(action_scores, 1)
-            action_scores = action_scores - tf.expand_dims(action_scores_mean, 1)
-        return state_score + action_scores
--- a/baselines/deepq/experiments/atari/train.py
+++ b/baselines/deepq/experiments/atari/train.py
@@ -1,273 +0,0 @@
-import argparse
-import gym
-import numpy as np
-import os
-import tensorflow as tf
-import tempfile
-import time
-import json
-
-import baselines.common.tf_util as U
-
-from baselines import logger
-from baselines import deepq
-from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
-from baselines.common.misc_util import (
-    boolean_flag,
-    pickle_load,
-    pretty_eta,
-    relatively_safe_pickle_dump,
-    set_global_seeds,
-    RunningAvg,
-)
-from baselines.common.schedules import LinearSchedule, PiecewiseSchedule
-from baselines import bench
-from baselines.common.atari_wrappers_deprecated import wrap_dqn
-from baselines.common.azure_utils import Container
-from .model import model, dueling_model
-
-
-def parse_args():
-    parser = argparse.ArgumentParser("DQN experiments for Atari games")
-    # Environment
-    parser.add_argument("--env", type=str, default="Pong", help="name of the game")
-    parser.add_argument("--seed", type=int, default=42, help="which seed to use")
-    # Core DQN parameters
-    parser.add_argument("--replay-buffer-size", type=int, default=int(1e6), help="replay buffer size")
-    parser.add_argument("--lr", type=float, default=1e-4, help="learning rate for Adam optimizer")
-    parser.add_argument("--num-steps", type=int, default=int(2e8), help="total number of steps to run the environment for")
-    parser.add_argument("--batch-size", type=int, default=32, help="number of transitions to optimize at the same time")
-    parser.add_argument("--learning-freq", type=int, default=4, help="number of iterations between every optimization step")
-    parser.add_argument("--target-update-freq", type=int, default=40000, help="number of iterations between every target network update")
-    parser.add_argument("--param-noise-update-freq", type=int, default=50, help="number of iterations between every re-scaling of the parameter noise")
-    parser.add_argument("--param-noise-reset-freq", type=int, default=10000, help="maximum number of steps to take per episode before re-perturbing the exploration policy")
-    # Bells and whistles
-    boolean_flag(parser, "double-q", default=True, help="whether or not to use double q learning")
-    boolean_flag(parser, "dueling", default=False, help="whether or not to use dueling model")
-    boolean_flag(parser, "prioritized", default=False, help="whether or not to use prioritized replay buffer")
-    boolean_flag(parser, "param-noise", default=False, help="whether or not to use parameter space noise for exploration")
-    boolean_flag(parser, "layer-norm", default=False, help="whether or not to use layer norm (should be True if param_noise is used)")
-    boolean_flag(parser, "gym-monitor", default=False, help="whether or not to use a OpenAI Gym monitor (results in slower training due to video recording)")
-    parser.add_argument("--prioritized-alpha", type=float, default=0.6, help="alpha parameter for prioritized replay buffer")
-    parser.add_argument("--prioritized-beta0", type=float, default=0.4, help="initial value of beta parameters for prioritized replay")
-    parser.add_argument("--prioritized-eps", type=float, default=1e-6, help="eps parameter for prioritized replay buffer")
-    # Checkpointing
-    parser.add_argument("--save-dir", type=str, default=None, help="directory in which training state and model should be saved.")
-    parser.add_argument("--save-azure-container", type=str, default=None,
-                        help="It present data will saved/loaded from Azure. Should be in format ACCOUNT_NAME:ACCOUNT_KEY:CONTAINER")
-    parser.add_argument("--save-freq", type=int, default=1e6, help="save model once every time this many iterations are completed")
-    boolean_flag(parser, "load-on-start", default=True, help="if true and model was previously saved then training will be resumed")
-    return parser.parse_args()
-
-
-def make_env(game_name):
-    env = gym.make(game_name + "NoFrameskip-v4")
-    monitored_env = bench.Monitor(env, logger.get_dir())  # puts rewards and number of steps in info, before environment is wrapped
-    env = wrap_dqn(monitored_env)  # applies a bunch of modification to simplify the observation space (downsample, make b/w)
-    return env, monitored_env
-
-
-def maybe_save_model(savedir, container, state):
-    """This function checkpoints the model and state of the training algorithm."""
-    if savedir is None:
-        return
-    start_time = time.time()
-    model_dir = "model-{}".format(state["num_iters"])
-    U.save_state(os.path.join(savedir, model_dir, "saved"))
-    if container is not None:
-        container.put(os.path.join(savedir, model_dir), model_dir)
-    relatively_safe_pickle_dump(state, os.path.join(savedir, 'training_state.pkl.zip'), compression=True)
-    if container is not None:
-        container.put(os.path.join(savedir, 'training_state.pkl.zip'), 'training_state.pkl.zip')
-    relatively_safe_pickle_dump(state["monitor_state"], os.path.join(savedir, 'monitor_state.pkl'))
-    if container is not None:
-        container.put(os.path.join(savedir, 'monitor_state.pkl'), 'monitor_state.pkl')
-    logger.log("Saved model in {} seconds\n".format(time.time() - start_time))
-
-
-def maybe_load_model(savedir, container):
-    """Load model if present at the specified path."""
-    if savedir is None:
-        return
-
-    state_path = os.path.join(os.path.join(savedir, 'training_state.pkl.zip'))
-    if container is not None:
-        logger.log("Attempting to download model from Azure")
-        found_model = container.get(savedir, 'training_state.pkl.zip')
-    else:
-        found_model = os.path.exists(state_path)
-    if found_model:
-        state = pickle_load(state_path, compression=True)
-        model_dir = "model-{}".format(state["num_iters"])
-        if container is not None:
-            container.get(savedir, model_dir)
-        U.load_state(os.path.join(savedir, model_dir, "saved"))
-        logger.log("Loaded models checkpoint at {} iterations".format(state["num_iters"]))
-        return state
-
-
-if __name__ == '__main__':
-    args = parse_args()
-    
-    # Parse savedir and azure container.
-    savedir = args.save_dir
-    if savedir is None:
-        savedir = os.getenv('OPENAI_LOGDIR', None)
-    if args.save_azure_container is not None:
-        account_name, account_key, container_name = args.save_azure_container.split(":")
-        container = Container(account_name=account_name,
-                              account_key=account_key,
-                              container_name=container_name,
-                              maybe_create=True)
-        if savedir is None:
-            # Careful! This will not get cleaned up. Docker spoils the developers.
-            savedir = tempfile.TemporaryDirectory().name
-    else:
-        container = None
-    # Create and seed the env.
-    env, monitored_env = make_env(args.env)
-    if args.seed > 0:
-        set_global_seeds(args.seed)
-        env.unwrapped.seed(args.seed)
-
-    if args.gym_monitor and savedir:
-        env = gym.wrappers.Monitor(env, os.path.join(savedir, 'gym_monitor'), force=True)
-
-    if savedir:
-        with open(os.path.join(savedir, 'args.json'), 'w') as f:
-            json.dump(vars(args), f)
-
-    with U.make_session(4) as sess:
-        # Create training graph and replay buffer
-        def model_wrapper(img_in, num_actions, scope, **kwargs):
-            actual_model = dueling_model if args.dueling else model
-            return actual_model(img_in, num_actions, scope, layer_norm=args.layer_norm, **kwargs)
-        act, train, update_target, debug = deepq.build_train(
-            make_obs_ph=lambda name: U.Uint8Input(env.observation_space.shape, name=name),
-            q_func=model_wrapper,
-            num_actions=env.action_space.n,
-            optimizer=tf.train.AdamOptimizer(learning_rate=args.lr, epsilon=1e-4),
-            gamma=0.99,
-            grad_norm_clipping=10,
-            double_q=args.double_q,
-            param_noise=args.param_noise
-        )
-
-        approximate_num_iters = args.num_steps / 4
-        exploration = PiecewiseSchedule([
-            (0, 1.0),
-            (approximate_num_iters / 50, 0.1),
-            (approximate_num_iters / 5, 0.01)
-        ], outside_value=0.01)
-
-        if args.prioritized:
-            replay_buffer = PrioritizedReplayBuffer(args.replay_buffer_size, args.prioritized_alpha)
-            beta_schedule = LinearSchedule(approximate_num_iters, initial_p=args.prioritized_beta0, final_p=1.0)
-        else:
-            replay_buffer = ReplayBuffer(args.replay_buffer_size)
-
-        U.initialize()
-        update_target()
-        num_iters = 0
-
-        # Load the model
-        state = maybe_load_model(savedir, container)
-        if state is not None:
-            num_iters, replay_buffer = state["num_iters"], state["replay_buffer"],
-            monitored_env.set_state(state["monitor_state"])
-
-        start_time, start_steps = None, None
-        steps_per_iter = RunningAvg(0.999)
-        iteration_time_est = RunningAvg(0.999)
-        obs = env.reset()
-        num_iters_since_reset = 0
-        reset = True
-
-        # Main trianing loop
-        while True:
-            num_iters += 1
-            num_iters_since_reset += 1
-
-            # Take action and store transition in the replay buffer.
-            kwargs = {}
-            if not args.param_noise:
-                update_eps = exploration.value(num_iters)
-                update_param_noise_threshold = 0.
-            else:
-                if args.param_noise_reset_freq > 0 and num_iters_since_reset > args.param_noise_reset_freq:
-                    # Reset param noise policy since we have exceeded the maximum number of steps without a reset.
-                    reset = True
-
-                update_eps = 0.01  # ensures that we cannot get stuck completely
-                # Compute the threshold such that the KL divergence between perturbed and non-perturbed
-                # policy is comparable to eps-greedy exploration with eps = exploration.value(t).
-                # See Appendix C.1 in Parameter Space Noise for Exploration, Plappert et al., 2017
-                # for detailed explanation.
-                update_param_noise_threshold = -np.log(1. - exploration.value(num_iters) + exploration.value(num_iters) / float(env.action_space.n))
-                kwargs['reset'] = reset
-                kwargs['update_param_noise_threshold'] = update_param_noise_threshold
-                kwargs['update_param_noise_scale'] = (num_iters % args.param_noise_update_freq == 0)
-
-            action = act(np.array(obs)[None], update_eps=update_eps, **kwargs)[0]
-            reset = False
-            new_obs, rew, done, info = env.step(action)
-            replay_buffer.add(obs, action, rew, new_obs, float(done))
-            obs = new_obs
-            if done:
-                num_iters_since_reset = 0
-                obs = env.reset()
-                reset = True
-
-            if (num_iters > max(5 * args.batch_size, args.replay_buffer_size // 20) and
-                    num_iters % args.learning_freq == 0):
-                # Sample a bunch of transitions from replay buffer
-                if args.prioritized:
-                    experience = replay_buffer.sample(args.batch_size, beta=beta_schedule.value(num_iters))
-                    (obses_t, actions, rewards, obses_tp1, dones, weights, batch_idxes) = experience
-                else:
-                    obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(args.batch_size)
-                    weights = np.ones_like(rewards)
-                # Minimize the error in Bellman's equation and compute TD-error
-                td_errors = train(obses_t, actions, rewards, obses_tp1, dones, weights)
-                # Update the priorities in the replay buffer
-                if args.prioritized:
-                    new_priorities = np.abs(td_errors) + args.prioritized_eps
-                    replay_buffer.update_priorities(batch_idxes, new_priorities)
-            # Update target network.
-            if num_iters % args.target_update_freq == 0:
-                update_target()
-
-            if start_time is not None:
-                steps_per_iter.update(info['steps'] - start_steps)
-                iteration_time_est.update(time.time() - start_time)
-            start_time, start_steps = time.time(), info["steps"]
-
-            # Save the model and training state.
-            if num_iters > 0 and (num_iters % args.save_freq == 0 or info["steps"] > args.num_steps):
-                maybe_save_model(savedir, container, {
-                    'replay_buffer': replay_buffer,
-                    'num_iters': num_iters,
-                    'monitor_state': monitored_env.get_state(),
-                })
-
-            if info["steps"] > args.num_steps:
-                break
-
-            if done:
-                steps_left = args.num_steps - info["steps"]
-                completion = np.round(info["steps"] / args.num_steps, 1)
-
-                logger.record_tabular("% completion", completion)
-                logger.record_tabular("steps", info["steps"])
-                logger.record_tabular("iters", num_iters)
-                logger.record_tabular("episodes", len(info["rewards"]))
-                logger.record_tabular("reward (100 epi mean)", np.mean(info["rewards"][-100:]))
-                logger.record_tabular("exploration", exploration.value(num_iters))
-                if args.prioritized:
-                    logger.record_tabular("max priority", replay_buffer._max_priority)
-                fps_estimate = (float(steps_per_iter) / (float(iteration_time_est) + 1e-6)
-                                if steps_per_iter._value is not None else "calculating...")
-                logger.dump_tabular()
-                logger.log()
-                logger.log("ETA: " + pretty_eta(int(steps_left / fps_estimate)))
-                logger.log()
--- a/baselines/deepq/experiments/atari/wang2015_eval.py
+++ b/baselines/deepq/experiments/atari/wang2015_eval.py
@@ -1,81 +0,0 @@
-import argparse
-import gym
-import numpy as np
-import os
-
-import baselines.common.tf_util as U
-
-from baselines import deepq, bench
-from baselines.common.misc_util import get_wrapper_by_name, boolean_flag, set_global_seeds
-from baselines.common.atari_wrappers_deprecated import wrap_dqn
-from baselines.deepq.experiments.atari.model import model, dueling_model
-
-
-def make_env(game_name):
-    env = gym.make(game_name + "NoFrameskip-v4")
-    env_monitored = bench.Monitor(env, None)
-    env = wrap_dqn(env_monitored)
-    return env_monitored, env
-
-
-def parse_args():
-    parser = argparse.ArgumentParser("Evaluate an already learned DQN model.")
-    # Environment
-    parser.add_argument("--env", type=str, required=True, help="name of the game")
-    parser.add_argument("--model-dir", type=str, default=None, help="load model from this directory. ")
-    boolean_flag(parser, "stochastic", default=True, help="whether or not to use stochastic actions according to models eps value")
-    boolean_flag(parser, "dueling", default=False, help="whether or not to use dueling model")
-
-    return parser.parse_args()
-
-
-def wang2015_eval(game_name, act, stochastic):
-    print("==================== wang2015 evaluation ====================")
-    episode_rewards = []
-
-    for num_noops in range(1, 31):
-        env_monitored, eval_env = make_env(game_name)
-        eval_env.unwrapped.seed(1)
-
-        get_wrapper_by_name(eval_env, "NoopResetEnv").override_num_noops = num_noops
-
-        eval_episode_steps = 0
-        done = True
-        while True:
-            if done:
-                obs = eval_env.reset()
-            eval_episode_steps += 1
-            action = act(np.array(obs)[None], stochastic=stochastic)[0]
-
-            obs, _reward, done, info = eval_env.step(action)
-            if done:
-                obs = eval_env.reset()
-            if len(info["rewards"]) > 0:
-                episode_rewards.append(info["rewards"][0])
-                break
-            if info["steps"] > 108000:  # 5 minutes of gameplay
-                episode_rewards.append(sum(env_monitored.rewards))
-                break
-        print("Num steps in episode {} was {} yielding {} reward".format(
-              num_noops, eval_episode_steps, episode_rewards[-1]), flush=True)
-    print("Evaluation results: " + str(np.mean(episode_rewards)))
-    print("=============================================================")
-    return np.mean(episode_rewards)
-
-
-def main():
-    set_global_seeds(1)
-    args = parse_args()
-    with U.make_session(4):  # noqa
-        _, env = make_env(args.env)
-        act = deepq.build_act(
-            make_obs_ph=lambda name: U.Uint8Input(env.observation_space.shape, name=name),
-            q_func=dueling_model if args.dueling else model,
-            num_actions=env.action_space.n)
-
-        U.load_state(os.path.join(args.model_dir, "saved"))
-        wang2015_eval(args.env, act, stochastic=args.stochastic)
-
-
-if __name__ == '__main__':
-    main()
--- a/baselines/deepq/experiments/custom_cartpole.py
+++ b/baselines/deepq/experiments/custom_cartpole.py
@@ -9,6 +9,7 @@ import baselines.common.tf_util as U
 from baselines import logger
 from baselines import deepq
 from baselines.deepq.replay_buffer import ReplayBuffer
+from baselines.deepq.utils import ObservationInput
 from baselines.common.schedules import LinearSchedule


@@ -27,7 +28,7 @@ if __name__ == '__main__':
        env = gym.make("CartPole-v0")
        # Create all the functions necessary to train the model
        act, train, update_target, debug = deepq.build_train(
-            make_obs_ph=lambda name: U.BatchInput(env.observation_space.shape, name=name),
+            make_obs_ph=lambda name: ObservationInput(env.observation_space, name=name),
            q_func=model,
            num_actions=env.action_space.n,
            optimizer=tf.train.AdamOptimizer(learning_rate=5e-4),
--- a/baselines/deepq/experiments/run_atari.py
+++ b/baselines/deepq/experiments/run_atari.py
@@ -1,5 +1,3 @@
-import gym
-
 from baselines import deepq
 from baselines.common import set_global_seeds
 from baselines import bench
@@ -7,13 +5,18 @@ import argparse
 from baselines import logger
 from baselines.common.atari_wrappers import make_atari

+
 def main():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
    parser.add_argument('--prioritized', type=int, default=1)
+    parser.add_argument('--prioritized-replay-alpha', type=float, default=0.6)
    parser.add_argument('--dueling', type=int, default=1)
    parser.add_argument('--num-timesteps', type=int, default=int(10e6))
+    parser.add_argument('--checkpoint-freq', type=int, default=10000)
+    parser.add_argument('--checkpoint-path', type=str, default=None)
+
    args = parser.parse_args()
    logger.configure()
    set_global_seeds(args.seed)
@@ -25,7 +28,8 @@ def main():
        hiddens=[256],
        dueling=bool(args.dueling),
    )
-    act = deepq.learn(
+
+    deepq.learn(
        env,
        q_func=model,
        lr=1e-4,
@@ -37,9 +41,12 @@ def main():
        learning_starts=10000,
        target_network_update_freq=1000,
        gamma=0.99,
-        prioritized_replay=bool(args.prioritized)
+        prioritized_replay=bool(args.prioritized),
+        prioritized_replay_alpha=args.prioritized_replay_alpha,
+        checkpoint_freq=args.checkpoint_freq,
+        checkpoint_path=args.checkpoint_path,
    )
-    # act.save("pong_model.pkl") XXX
+
    env.close()


--- a/baselines/deepq/experiments/train_cartpole.py
+++ b/baselines/deepq/experiments/train_cartpole.py
@@ -3,7 +3,7 @@ import gym
 from baselines import deepq


-def callback(lcl, glb):
+def callback(lcl, _glb):
    # stop training if reward exceeds 199
    is_solved = lcl['t'] > 100 and sum(lcl['episode_rewards'][-101:-1]) / 100 >= 199
    return is_solved
--- a/baselines/deepq/replay_buffer.py
+++ b/baselines/deepq/replay_buffer.py
@@ -6,7 +6,7 @@ from baselines.common.segment_tree import SumSegmentTree, MinSegmentTree

 class ReplayBuffer(object):
    def __init__(self, size):
-        """Create Prioritized Replay buffer.
+        """Create Replay buffer.

        Parameters
        ----------
@@ -86,7 +86,7 @@ class PrioritizedReplayBuffer(ReplayBuffer):
        ReplayBuffer.__init__
        """
        super(PrioritizedReplayBuffer, self).__init__(size)
-        assert alpha > 0
+        assert alpha >= 0
        self._alpha = alpha

        it_capacity = 1
--- a/baselines/deepq/simple.py
+++ b/baselines/deepq/simple.py
@@ -6,12 +6,15 @@ import zipfile
 import cloudpickle
 import numpy as np

-import gym
 import baselines.common.tf_util as U
+from baselines.common.tf_util import load_state, save_state
 from baselines import logger
 from baselines.common.schedules import LinearSchedule
+from baselines.common.input import observation_input
+
 from baselines import deepq
 from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
+from baselines.deepq.utils import ObservationInput


 class ActWrapper(object):
@@ -32,7 +35,7 @@ class ActWrapper(object):
                f.write(model_data)

            zipfile.ZipFile(arc_path, 'r', zipfile.ZIP_DEFLATED).extractall(td)
-            U.load_state(os.path.join(td, "model"))
+            load_state(os.path.join(td, "model"))

        return ActWrapper(act, act_params)

@@ -45,7 +48,7 @@ class ActWrapper(object):
            path = os.path.join(logger.get_dir(), "model.pkl")

        with tempfile.TemporaryDirectory() as td:
-            U.save_state(os.path.join(td, "model"))
+            save_state(os.path.join(td, "model"))
            arc_name = os.path.join(td, "packed.zip")
            with zipfile.ZipFile(arc_name, 'w') as zipf:
                for root, dirs, files in os.walk(td):
@@ -87,6 +90,7 @@ def learn(env,
          batch_size=32,
          print_freq=100,
          checkpoint_freq=10000,
+          checkpoint_path=None,
          learning_starts=1000,
          gamma=1.0,
          target_network_update_freq=500,
@@ -169,9 +173,9 @@ def learn(env,

    # capture the shape outside the closure so that the env object is not serialized
    # by cloudpickle when serializing make_obs_ph
-    observation_space_shape = env.observation_space.shape
+
    def make_obs_ph(name):
-        return U.BatchInput(observation_space_shape, name=name)
+        return ObservationInput(env.observation_space, name=name)

    act, train, update_target, debug = deepq.build_train(
        make_obs_ph=make_obs_ph,
@@ -215,9 +219,17 @@ def learn(env,
    saved_mean_reward = None
    obs = env.reset()
    reset = True
+
    with tempfile.TemporaryDirectory() as td:
-        model_saved = False
+        td = checkpoint_path or td
+
        model_file = os.path.join(td, "model")
+        model_saved = False
+        if tf.train.latest_checkpoint(td) is not None:
+            load_state(model_file)
+            logger.log('Loaded model from {}'.format(model_file))
+            model_saved = True
+
        for t in range(max_timesteps):
            if callback is not None:
                if callback(locals(), globals()):
@@ -238,11 +250,7 @@ def learn(env,
                kwargs['update_param_noise_threshold'] = update_param_noise_threshold
                kwargs['update_param_noise_scale'] = True
            action = act(np.array(obs)[None], update_eps=update_eps, **kwargs)[0]
-            if isinstance(env.action_space, gym.spaces.MultiBinary):
-                env_action = np.zeros(env.action_space.n)
-                env_action[action] = 1
-            else:
-                env_action = action
+            env_action = action
            reset = False
            new_obs, rew, done, _ = env.step(env_action)
            # Store transition in the replay buffer.
@@ -287,12 +295,12 @@ def learn(env,
                    if print_freq is not None:
                        logger.log("Saving model due to mean reward increase: {} -> {}".format(
                                   saved_mean_reward, mean_100ep_reward))
-                    U.save_state(model_file)
+                    save_state(model_file)
                    model_saved = True
                    saved_mean_reward = mean_100ep_reward
        if model_saved:
            if print_freq is not None:
                logger.log("Restored model with mean reward: {}".format(saved_mean_reward))
-            U.load_state(model_file)
+            load_state(model_file)

    return act
--- a/baselines/deepq/test_identity.py
+++ b/baselines/deepq/test_identity.py
@@ -0,0 +1,43 @@
+import tensorflow as tf
+import random
+
+from baselines import deepq
+from baselines.common.identity_env import IdentityEnv
+
+
+def test_identity():
+
+    with tf.Graph().as_default():
+        env = IdentityEnv(10)
+        random.seed(0)
+
+        tf.set_random_seed(0)
+
+        param_noise = False
+        model = deepq.models.mlp([32])
+        act = deepq.learn(
+            env,
+            q_func=model,
+            lr=1e-3,
+            max_timesteps=10000,
+            buffer_size=50000,
+            exploration_fraction=0.1,
+            exploration_final_eps=0.02,
+            print_freq=10,
+            param_noise=param_noise,
+        )
+
+        tf.set_random_seed(0)
+
+        N_TRIALS = 1000
+        sum_rew = 0
+        obs = env.reset()
+        for i in range(N_TRIALS):
+            obs, rew, done, _ = env.step(act([obs]))
+            sum_rew += rew
+
+        assert sum_rew > 0.9 * N_TRIALS
+
+
+if __name__ == '__main__':
+    test_identity()
--- a/baselines/deepq/utils.py
+++ b/baselines/deepq/utils.py
@@ -0,0 +1,83 @@
+from baselines.common.input import observation_input
+
+import tensorflow as tf
+
+# ================================================================
+# Placeholders
+# ================================================================
+
+
+class TfInput(object):
+    def __init__(self, name="(unnamed)"):
+        """Generalized Tensorflow placeholder. The main differences are:
+            - possibly uses multiple placeholders internally and returns multiple values
+            - can apply light postprocessing to the value feed to placeholder.
+        """
+        self.name = name
+
+    def get(self):
+        """Return the tf variable(s) representing the possibly postprocessed value
+        of placeholder(s).
+        """
+        raise NotImplemented()
+
+    def make_feed_dict(data):
+        """Given data input it to the placeholder(s)."""
+        raise NotImplemented()
+
+
+class PlaceholderTfInput(TfInput):
+    def __init__(self, placeholder):
+        """Wrapper for regular tensorflow placeholder."""
+        super().__init__(placeholder.name)
+        self._placeholder = placeholder
+
+    def get(self):
+        return self._placeholder
+
+    def make_feed_dict(self, data):
+        return {self._placeholder: data}
+
+
+class Uint8Input(PlaceholderTfInput):
+    def __init__(self, shape, name=None):
+        """Takes input in uint8 format which is cast to float32 and divided by 255
+        before passing it to the model.
+
+        On GPU this ensures lower data transfer times.
+
+        Parameters
+        ----------
+        shape: [int]
+            shape of the tensor.
+        name: str
+            name of the underlying placeholder
+        """
+
+        super().__init__(tf.placeholder(tf.uint8, [None] + list(shape), name=name))
+        self._shape = shape
+        self._output = tf.cast(super().get(), tf.float32) / 255.0
+
+    def get(self):
+        return self._output
+
+
+class ObservationInput(PlaceholderTfInput):
+    def __init__(self, observation_space, name=None):
+        """Creates an input placeholder tailored to a specific observation space
+        
+        Parameters
+        ----------
+
+        observation_space: 
+                observation space of the environment. Should be one of the gym.spaces types
+        name: str 
+                tensorflow name of the underlying placeholder
+        """
+        inpt, self.processed_inpt = observation_input(observation_space, name=name)
+        super().__init__(inpt)
+
+    def get(self):
+        return self.processed_inpt
+    
+    
--- a/baselines/gail/README.md
+++ b/baselines/gail/README.md
@@ -0,0 +1,52 @@
+# Generative Adversarial Imitation Learning (GAIL)
+
+- Original paper: https://arxiv.org/abs/1606.03476
+
+For results benchmarking on MuJoCo, please navigate to [here](result/gail-result.md)
+
+## If you want to train an imitation learning agent
+
+### Step 1: Download expert data
+
+Download the expert data into `./data`, [download link](https://drive.google.com/drive/folders/1h3H4AY_ZBx08hz-Ct0Nxxus-V1melu1U?usp=sharing)
+
+### Step 2: Run GAIL
+
+Run with single thread:
+
+```bash
+python -m baselines.gail.run_mujoco
+```
+
+Run with multiple threads:
+
+```bash
+mpirun -np 16 python -m baselines.gail.run_mujoco
+```
+
+See help (`-h`) for more options.
+
+#### In case you want to run Behavior Cloning (BC)
+
+```bash
+python -m baselines.gail.behavior_clone
+```
+
+See help (`-h`) for more options.
+
+
+## Contributing
+
+Bug reports and pull requests are welcome on GitHub at https://github.com/openai/baselines/pulls.
+
+## Maintainers
+
+- Yuan-Hong Liao, andrewliao11_at_gmail_dot_com
+- Ryan Julian, ryanjulian_at_gmail_dot_com
+
+## Others
+
+Thanks to the open source:
+
+- @openai/imitation
+- @carpedm20/deep-rl-tensorflow
--- a/baselines/gail/init.py
+++ b/baselines/gail/init.py
--- a/baselines/gail/adversary.py
+++ b/baselines/gail/adversary.py
@@ -0,0 +1,87 @@
+'''
+Reference: https://github.com/openai/imitation
+I follow the architecture from the official repository
+'''
+import tensorflow as tf
+import numpy as np
+
+from baselines.common.mpi_running_mean_std import RunningMeanStd
+from baselines.common import tf_util as U
+
+def logsigmoid(a):
+    '''Equivalent to tf.log(tf.sigmoid(a))'''
+    return -tf.nn.softplus(-a)
+
+""" Reference: https://github.com/openai/imitation/blob/99fbccf3e060b6e6c739bdf209758620fcdefd3c/policyopt/thutil.py#L48-L51"""
+def logit_bernoulli_entropy(logits):
+    ent = (1.-tf.nn.sigmoid(logits))*logits - logsigmoid(logits)
+    return ent
+
+class TransitionClassifier(object):
+    def __init__(self, env, hidden_size, entcoeff=0.001, lr_rate=1e-3, scope="adversary"):
+        self.scope = scope
+        self.observation_shape = env.observation_space.shape
+        self.actions_shape = env.action_space.shape
+        self.input_shape = tuple([o+a for o, a in zip(self.observation_shape, self.actions_shape)])
+        self.num_actions = env.action_space.shape[0]
+        self.hidden_size = hidden_size
+        self.build_ph()
+        # Build grpah
+        generator_logits = self.build_graph(self.generator_obs_ph, self.generator_acs_ph, reuse=False)
+        expert_logits = self.build_graph(self.expert_obs_ph, self.expert_acs_ph, reuse=True)
+        # Build accuracy
+        generator_acc = tf.reduce_mean(tf.to_float(tf.nn.sigmoid(generator_logits) < 0.5))
+        expert_acc = tf.reduce_mean(tf.to_float(tf.nn.sigmoid(expert_logits) > 0.5))
+        # Build regression loss
+        # let x = logits, z = targets.
+        # z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
+        generator_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=generator_logits, labels=tf.zeros_like(generator_logits))
+        generator_loss = tf.reduce_mean(generator_loss)
+        expert_loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=expert_logits, labels=tf.ones_like(expert_logits))
+        expert_loss = tf.reduce_mean(expert_loss)
+        # Build entropy loss
+        logits = tf.concat([generator_logits, expert_logits], 0)
+        entropy = tf.reduce_mean(logit_bernoulli_entropy(logits))
+        entropy_loss = -entcoeff*entropy
+        # Loss + Accuracy terms
+        self.losses = [generator_loss, expert_loss, entropy, entropy_loss, generator_acc, expert_acc]
+        self.loss_name = ["generator_loss", "expert_loss", "entropy", "entropy_loss", "generator_acc", "expert_acc"]
+        self.total_loss = generator_loss + expert_loss + entropy_loss
+        # Build Reward for policy
+        self.reward_op = -tf.log(1-tf.nn.sigmoid(generator_logits)+1e-8)
+        var_list = self.get_trainable_variables()
+        self.lossandgrad = U.function([self.generator_obs_ph, self.generator_acs_ph, self.expert_obs_ph, self.expert_acs_ph],
+                                      self.losses + [U.flatgrad(self.total_loss, var_list)])
+
+    def build_ph(self):
+        self.generator_obs_ph = tf.placeholder(tf.float32, (None, ) + self.observation_shape, name="observations_ph")
+        self.generator_acs_ph = tf.placeholder(tf.float32, (None, ) + self.actions_shape, name="actions_ph")
+        self.expert_obs_ph = tf.placeholder(tf.float32, (None, ) + self.observation_shape, name="expert_observations_ph")
+        self.expert_acs_ph = tf.placeholder(tf.float32, (None, ) + self.actions_shape, name="expert_actions_ph")
+
+    def build_graph(self, obs_ph, acs_ph, reuse=False):
+        with tf.variable_scope(self.scope):
+            if reuse:
+                tf.get_variable_scope().reuse_variables()
+
+            with tf.variable_scope("obfilter"):
+                self.obs_rms = RunningMeanStd(shape=self.observation_shape)
+            obs = (obs_ph - self.obs_rms.mean / self.obs_rms.std)
+            _input = tf.concat([obs, acs_ph], axis=1)  # concatenate the two input -> form a transition
+            p_h1 = tf.contrib.layers.fully_connected(_input, self.hidden_size, activation_fn=tf.nn.tanh)
+            p_h2 = tf.contrib.layers.fully_connected(p_h1, self.hidden_size, activation_fn=tf.nn.tanh)
+            logits = tf.contrib.layers.fully_connected(p_h2, 1, activation_fn=tf.identity)
+        return logits
+
+    def get_trainable_variables(self):
+        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)
+
+    def get_reward(self, obs, acs):
+        sess = tf.get_default_session()
+        if len(obs.shape) == 1:
+            obs = np.expand_dims(obs, 0)
+        if len(acs.shape) == 1:
+            acs = np.expand_dims(acs, 0)
+        feed_dict = {self.generator_obs_ph: obs, self.generator_acs_ph: acs}
+        reward = sess.run(self.reward_op, feed_dict)
+        return reward
--- a/baselines/gail/behavior_clone.py
+++ b/baselines/gail/behavior_clone.py
@@ -0,0 +1,124 @@
+'''
+The code is used to train BC imitator, or pretrained GAIL imitator
+'''
+
+import argparse
+import tempfile
+import os.path as osp
+import gym
+import logging
+from tqdm import tqdm
+
+import tensorflow as tf
+
+from baselines.gail import mlp_policy
+from baselines import bench
+from baselines import logger
+from baselines.common import set_global_seeds, tf_util as U
+from baselines.common.misc_util import boolean_flag
+from baselines.common.mpi_adam import MpiAdam
+from baselines.gail.run_mujoco import runner
+from baselines.gail.dataset.mujoco_dset import Mujoco_Dset
+
+
+def argsparser():
+    parser = argparse.ArgumentParser("Tensorflow Implementation of Behavior Cloning")
+    parser.add_argument('--env_id', help='environment ID', default='Hopper-v1')
+    parser.add_argument('--seed', help='RNG seed', type=int, default=0)
+    parser.add_argument('--expert_path', type=str, default='data/deterministic.trpo.Hopper.0.00.npz')
+    parser.add_argument('--checkpoint_dir', help='the directory to save model', default='checkpoint')
+    parser.add_argument('--log_dir', help='the directory to save log file', default='log')
+    #  Mujoco Dataset Configuration
+    parser.add_argument('--traj_limitation', type=int, default=-1)
+    # Network Configuration (Using MLP Policy)
+    parser.add_argument('--policy_hidden_size', type=int, default=100)
+    # for evaluatation
+    boolean_flag(parser, 'stochastic_policy', default=False, help='use stochastic/deterministic policy to evaluate')
+    boolean_flag(parser, 'save_sample', default=False, help='save the trajectories or not')
+    parser.add_argument('--BC_max_iter', help='Max iteration for training BC', type=int, default=1e5)
+    return parser.parse_args()
+
+
+def learn(env, policy_func, dataset, optim_batch_size=128, max_iters=1e4,
+          adam_epsilon=1e-5, optim_stepsize=3e-4,
+          ckpt_dir=None, log_dir=None, task_name=None,
+          verbose=False):
+
+    val_per_iter = int(max_iters/10)
+    ob_space = env.observation_space
+    ac_space = env.action_space
+    pi = policy_func("pi", ob_space, ac_space)  # Construct network for new policy
+    # placeholder
+    ob = U.get_placeholder_cached(name="ob")
+    ac = pi.pdtype.sample_placeholder([None])
+    stochastic = U.get_placeholder_cached(name="stochastic")
+    loss = tf.reduce_mean(tf.square(ac-pi.ac))
+    var_list = pi.get_trainable_variables()
+    adam = MpiAdam(var_list, epsilon=adam_epsilon)
+    lossandgrad = U.function([ob, ac, stochastic], [loss]+[U.flatgrad(loss, var_list)])
+
+    U.initialize()
+    adam.sync()
+    logger.log("Pretraining with Behavior Cloning...")
+    for iter_so_far in tqdm(range(int(max_iters))):
+        ob_expert, ac_expert = dataset.get_next_batch(optim_batch_size, 'train')
+        train_loss, g = lossandgrad(ob_expert, ac_expert, True)
+        adam.update(g, optim_stepsize)
+        if verbose and iter_so_far % val_per_iter == 0:
+            ob_expert, ac_expert = dataset.get_next_batch(-1, 'val')
+            val_loss, _ = lossandgrad(ob_expert, ac_expert, True)
+            logger.log("Training loss: {}, Validation loss: {}".format(train_loss, val_loss))
+
+    if ckpt_dir is None:
+        savedir_fname = tempfile.TemporaryDirectory().name
+    else:
+        savedir_fname = osp.join(ckpt_dir, task_name)
+    U.save_state(savedir_fname, var_list=pi.get_variables())
+    return savedir_fname
+
+
+def get_task_name(args):
+    task_name = 'BC'
+    task_name += '.{}'.format(args.env_id.split("-")[0])
+    task_name += '.traj_limitation_{}'.format(args.traj_limitation)
+    task_name += ".seed_{}".format(args.seed)
+    return task_name
+
+
+def main(args):
+    U.make_session(num_cpu=1).__enter__()
+    set_global_seeds(args.seed)
+    env = gym.make(args.env_id)
+
+    def policy_fn(name, ob_space, ac_space, reuse=False):
+        return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
+                                    reuse=reuse, hid_size=args.policy_hidden_size, num_hid_layers=2)
+    env = bench.Monitor(env, logger.get_dir() and
+                        osp.join(logger.get_dir(), "monitor.json"))
+    env.seed(args.seed)
+    gym.logger.setLevel(logging.WARN)
+    task_name = get_task_name(args)
+    args.checkpoint_dir = osp.join(args.checkpoint_dir, task_name)
+    args.log_dir = osp.join(args.log_dir, task_name)
+    dataset = Mujoco_Dset(expert_path=args.expert_path, traj_limitation=args.traj_limitation)
+    savedir_fname = learn(env,
+                          policy_fn,
+                          dataset,
+                          max_iters=args.BC_max_iter,
+                          ckpt_dir=args.checkpoint_dir,
+                          log_dir=args.log_dir,
+                          task_name=task_name,
+                          verbose=True)
+    avg_len, avg_ret = runner(env,
+                              policy_fn,
+                              savedir_fname,
+                              timesteps_per_batch=1024,
+                              number_trajs=10,
+                              stochastic_policy=args.stochastic_policy,
+                              save=args.save_sample,
+                              reuse=True)
+
+
+if __name__ == '__main__':
+    args = argsparser()
+    main(args)
--- a/baselines/gail/dataset/init.py
+++ b/baselines/gail/dataset/init.py
--- a/baselines/gail/dataset/mujoco_dset.py
+++ b/baselines/gail/dataset/mujoco_dset.py
@@ -0,0 +1,110 @@
+'''
+Data structure of the input .npz:
+the data is save in python dictionary format with keys: 'acs', 'ep_rets', 'rews', 'obs'
+the values of each item is a list storing the expert trajectory sequentially
+a transition can be: (data['obs'][t], data['acs'][t], data['obs'][t+1]) and get reward data['rews'][t]
+'''
+
+from baselines import logger
+import numpy as np
+
+
+class Dset(object):
+    def __init__(self, inputs, labels, randomize):
+        self.inputs = inputs
+        self.labels = labels
+        assert len(self.inputs) == len(self.labels)
+        self.randomize = randomize
+        self.num_pairs = len(inputs)
+        self.init_pointer()
+
+    def init_pointer(self):
+        self.pointer = 0
+        if self.randomize:
+            idx = np.arange(self.num_pairs)
+            np.random.shuffle(idx)
+            self.inputs = self.inputs[idx, :]
+            self.labels = self.labels[idx, :]
+
+    def get_next_batch(self, batch_size):
+        # if batch_size is negative -> return all
+        if batch_size < 0:
+            return self.inputs, self.labels
+        if self.pointer + batch_size >= self.num_pairs:
+            self.init_pointer()
+        end = self.pointer + batch_size
+        inputs = self.inputs[self.pointer:end, :]
+        labels = self.labels[self.pointer:end, :]
+        self.pointer = end
+        return inputs, labels
+
+
+class Mujoco_Dset(object):
+    def __init__(self, expert_path, train_fraction=0.7, traj_limitation=-1, randomize=True):
+        traj_data = np.load(expert_path)
+        if traj_limitation < 0:
+            traj_limitation = len(traj_data['obs'])
+        obs = traj_data['obs'][:traj_limitation]
+        acs = traj_data['acs'][:traj_limitation]
+
+        # obs, acs: shape (N, L, ) + S where N = # episodes, L = episode length
+        # and S is the environment observation/action space.
+        # Flatten to (N * L, prod(S))
+        self.obs = np.reshape(obs, [-1, np.prod(obs.shape[2:])])
+        self.acs = np.reshape(acs, [-1, np.prod(acs.shape[2:])])
+
+        self.rets = traj_data['ep_rets'][:traj_limitation]
+        self.avg_ret = sum(self.rets)/len(self.rets)
+        self.std_ret = np.std(np.array(self.rets))
+        if len(self.acs) > 2:
+            self.acs = np.squeeze(self.acs)
+        assert len(self.obs) == len(self.acs)
+        self.num_traj = min(traj_limitation, len(traj_data['obs']))
+        self.num_transition = len(self.obs)
+        self.randomize = randomize
+        self.dset = Dset(self.obs, self.acs, self.randomize)
+        # for behavior cloning
+        self.train_set = Dset(self.obs[:int(self.num_transition*train_fraction), :],
+                              self.acs[:int(self.num_transition*train_fraction), :],
+                              self.randomize)
+        self.val_set = Dset(self.obs[int(self.num_transition*train_fraction):, :],
+                            self.acs[int(self.num_transition*train_fraction):, :],
+                            self.randomize)
+        self.log_info()
+
+    def log_info(self):
+        logger.log("Total trajectorues: %d" % self.num_traj)
+        logger.log("Total transitions: %d" % self.num_transition)
+        logger.log("Average returns: %f" % self.avg_ret)
+        logger.log("Std for returns: %f" % self.std_ret)
+
+    def get_next_batch(self, batch_size, split=None):
+        if split is None:
+            return self.dset.get_next_batch(batch_size)
+        elif split == 'train':
+            return self.train_set.get_next_batch(batch_size)
+        elif split == 'val':
+            return self.val_set.get_next_batch(batch_size)
+        else:
+            raise NotImplementedError
+
+    def plot(self):
+        import matplotlib.pyplot as plt
+        plt.hist(self.rets)
+        plt.savefig("histogram_rets.png")
+        plt.close()
+
+
+def test(expert_path, traj_limitation, plot):
+    dset = Mujoco_Dset(expert_path, traj_limitation=traj_limitation)
+    if plot:
+        dset.plot()
+
+if __name__ == '__main__':
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--expert_path", type=str, default="../data/deterministic.trpo.Hopper.0.00.npz")
+    parser.add_argument("--traj_limitation", type=int, default=None)
+    parser.add_argument("--plot", type=bool, default=False)
+    args = parser.parse_args()
+    test(args.expert_path, args.traj_limitation, args.plot)
--- a/baselines/gail/gail-eval.py
+++ b/baselines/gail/gail-eval.py
@@ -0,0 +1,147 @@
+'''
+This code is used to evalaute the imitators trained with different number of trajectories
+and plot the results in the same figure for easy comparison.
+'''
+
+import argparse
+import os
+import glob
+import gym
+
+import matplotlib.pyplot as plt
+import numpy as np
+import tensorflow as tf
+
+from baselines.gail import run_mujoco
+from baselines.gail import mlp_policy
+from baselines.common import set_global_seeds, tf_util as U
+from baselines.common.misc_util import boolean_flag
+from baselines.gail.dataset.mujoco_dset import Mujoco_Dset
+
+
+plt.style.use('ggplot')
+CONFIG = {
+    'traj_limitation': [1, 5, 10, 50],
+}
+
+
+def load_dataset(expert_path):
+    dataset = Mujoco_Dset(expert_path=expert_path)
+    return dataset
+
+
+def argsparser():
+    parser = argparse.ArgumentParser('Do evaluation')
+    parser.add_argument('--seed', type=int, default=0)
+    parser.add_argument('--policy_hidden_size', type=int, default=100)
+    parser.add_argument('--env', type=str, choices=['Hopper', 'Walker2d', 'HalfCheetah',
+                                                    'Humanoid', 'HumanoidStandup'])
+    boolean_flag(parser, 'stochastic_policy', default=False, help='use stochastic/deterministic policy to evaluate')
+    return parser.parse_args()
+
+
+def evaluate_env(env_name, seed, policy_hidden_size, stochastic, reuse, prefix):
+
+    def get_checkpoint_dir(checkpoint_list, limit, prefix):
+        for checkpoint in checkpoint_list:
+            if ('limitation_'+str(limit) in checkpoint) and (prefix in checkpoint):
+                return checkpoint
+        return None
+
+    def policy_fn(name, ob_space, ac_space, reuse=False):
+        return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
+                                    reuse=reuse, hid_size=policy_hidden_size, num_hid_layers=2)
+
+    data_path = os.path.join('data', 'deterministic.trpo.' + env_name + '.0.00.npz')
+    dataset = load_dataset(data_path)
+    checkpoint_list = glob.glob(os.path.join('checkpoint', '*' + env_name + ".*"))
+    log = {
+        'traj_limitation': [],
+        'upper_bound': [],
+        'avg_ret': [],
+        'avg_len': [],
+        'normalized_ret': []
+    }
+    for i, limit in enumerate(CONFIG['traj_limitation']):
+        # Do one evaluation
+        upper_bound = sum(dataset.rets[:limit])/limit
+        checkpoint_dir = get_checkpoint_dir(checkpoint_list, limit, prefix=prefix)
+        checkpoint_path = tf.train.latest_checkpoint(checkpoint_dir)
+        env = gym.make(env_name + '-v1')
+        env.seed(seed)
+        print('Trajectory limitation: {}, Load checkpoint: {}, '.format(limit, checkpoint_path))
+        avg_len, avg_ret = run_mujoco.runner(env,
+                                             policy_fn,
+                                             checkpoint_path,
+                                             timesteps_per_batch=1024,
+                                             number_trajs=10,
+                                             stochastic_policy=stochastic,
+                                             reuse=((i != 0) or reuse))
+        normalized_ret = avg_ret/upper_bound
+        print('Upper bound: {}, evaluation returns: {}, normalized scores: {}'.format(
+            upper_bound, avg_ret, normalized_ret))
+        log['traj_limitation'].append(limit)
+        log['upper_bound'].append(upper_bound)
+        log['avg_ret'].append(avg_ret)
+        log['avg_len'].append(avg_len)
+        log['normalized_ret'].append(normalized_ret)
+        env.close()
+    return log
+
+
+def plot(env_name, bc_log, gail_log, stochastic):
+    upper_bound = bc_log['upper_bound']
+    bc_avg_ret = bc_log['avg_ret']
+    gail_avg_ret = gail_log['avg_ret']
+    plt.plot(CONFIG['traj_limitation'], upper_bound)
+    plt.plot(CONFIG['traj_limitation'], bc_avg_ret)
+    plt.plot(CONFIG['traj_limitation'], gail_avg_ret)
+    plt.xlabel('Number of expert trajectories')
+    plt.ylabel('Accumulated reward')
+    plt.title('{} unnormalized scores'.format(env_name))
+    plt.legend(['expert', 'bc-imitator', 'gail-imitator'], loc='lower right')
+    plt.grid(b=True, which='major', color='gray', linestyle='--')
+    if stochastic:
+        title_name = 'result/{}-unnormalized-stochastic-scores.png'.format(env_name)
+    else:
+        title_name = 'result/{}-unnormalized-deterministic-scores.png'.format(env_name)
+    plt.savefig(title_name)
+    plt.close()
+
+    bc_normalized_ret = bc_log['normalized_ret']
+    gail_normalized_ret = gail_log['normalized_ret']
+    plt.plot(CONFIG['traj_limitation'], np.ones(len(CONFIG['traj_limitation'])))
+    plt.plot(CONFIG['traj_limitation'], bc_normalized_ret)
+    plt.plot(CONFIG['traj_limitation'], gail_normalized_ret)
+    plt.xlabel('Number of expert trajectories')
+    plt.ylabel('Normalized performance')
+    plt.title('{} normalized scores'.format(env_name))
+    plt.legend(['expert', 'bc-imitator', 'gail-imitator'], loc='lower right')
+    plt.grid(b=True, which='major', color='gray', linestyle='--')
+    if stochastic:
+        title_name = 'result/{}-normalized-stochastic-scores.png'.format(env_name)
+    else:
+        title_name = 'result/{}-normalized-deterministic-scores.png'.format(env_name)
+    plt.ylim(0, 1.6)
+    plt.savefig(title_name)
+    plt.close()
+
+
+def main(args):
+    U.make_session(num_cpu=1).__enter__()
+    set_global_seeds(args.seed)
+    print('Evaluating {}'.format(args.env))
+    bc_log = evaluate_env(args.env, args.seed, args.policy_hidden_size,
+                          args.stochastic_policy, False, 'BC')
+    print('Evaluation for {}'.format(args.env))
+    print(bc_log)
+    gail_log = evaluate_env(args.env, args.seed, args.policy_hidden_size,
+                            args.stochastic_policy, True, 'gail')
+    print('Evaluation for {}'.format(args.env))
+    print(gail_log)
+    plot(args.env, bc_log, gail_log, args.stochastic_policy)
+
+
+if __name__ == '__main__':
+    args = argsparser()
+    main(args)
--- a/baselines/gail/mlp_policy.py
+++ b/baselines/gail/mlp_policy.py
@@ -0,0 +1,75 @@
+'''
+from baselines/ppo1/mlp_policy.py and add simple modification
+(1) add reuse argument
+(2) cache the `stochastic` placeholder
+'''
+import tensorflow as tf
+import gym
+
+import baselines.common.tf_util as U
+from baselines.common.mpi_running_mean_std import RunningMeanStd
+from baselines.common.distributions import make_pdtype
+from baselines.acktr.utils import dense
+
+
+class MlpPolicy(object):
+    recurrent = False
+
+    def __init__(self, name, reuse=False, *args, **kwargs):
+        with tf.variable_scope(name):
+            if reuse:
+                tf.get_variable_scope().reuse_variables()
+            self._init(*args, **kwargs)
+            self.scope = tf.get_variable_scope().name
+
+    def _init(self, ob_space, ac_space, hid_size, num_hid_layers, gaussian_fixed_var=True):
+        assert isinstance(ob_space, gym.spaces.Box)
+
+        self.pdtype = pdtype = make_pdtype(ac_space)
+        sequence_length = None
+
+        ob = U.get_placeholder(name="ob", dtype=tf.float32, shape=[sequence_length] + list(ob_space.shape))
+
+        with tf.variable_scope("obfilter"):
+            self.ob_rms = RunningMeanStd(shape=ob_space.shape)
+
+        obz = tf.clip_by_value((ob - self.ob_rms.mean) / self.ob_rms.std, -5.0, 5.0)
+        last_out = obz
+        for i in range(num_hid_layers):
+            last_out = tf.nn.tanh(dense(last_out, hid_size, "vffc%i" % (i+1), weight_init=U.normc_initializer(1.0)))
+        self.vpred = dense(last_out, 1, "vffinal", weight_init=U.normc_initializer(1.0))[:, 0]
+
+        last_out = obz
+        for i in range(num_hid_layers):
+            last_out = tf.nn.tanh(dense(last_out, hid_size, "polfc%i" % (i+1), weight_init=U.normc_initializer(1.0)))
+
+        if gaussian_fixed_var and isinstance(ac_space, gym.spaces.Box):
+            mean = dense(last_out, pdtype.param_shape()[0]//2, "polfinal", U.normc_initializer(0.01))
+            logstd = tf.get_variable(name="logstd", shape=[1, pdtype.param_shape()[0]//2], initializer=tf.zeros_initializer())
+            pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
+        else:
+            pdparam = dense(last_out, pdtype.param_shape()[0], "polfinal", U.normc_initializer(0.01))
+
+        self.pd = pdtype.pdfromflat(pdparam)
+
+        self.state_in = []
+        self.state_out = []
+
+        # change for BC
+        stochastic = U.get_placeholder(name="stochastic", dtype=tf.bool, shape=())
+        ac = U.switch(stochastic, self.pd.sample(), self.pd.mode())
+        self.ac = ac
+        self._act = U.function([stochastic, ob], [ac, self.vpred])
+
+    def act(self, stochastic, ob):
+        ac1, vpred1 = self._act(stochastic, ob[None])
+        return ac1[0], vpred1[0]
+
+    def get_variables(self):
+        return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, self.scope)
+
+    def get_trainable_variables(self):
+        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)
+
+    def get_initial_state(self):
+        return []
--- a/baselines/gail/result/HalfCheetah-normalized-deterministic-scores.png
+++ b/baselines/gail/result/HalfCheetah-normalized-deterministic-scores.png
--- a/baselines/gail/result/HalfCheetah-normalized-stochastic-scores.png
+++ b/baselines/gail/result/HalfCheetah-normalized-stochastic-scores.png
--- a/baselines/gail/result/HalfCheetah-unnormalized-deterministic-scores.png
+++ b/baselines/gail/result/HalfCheetah-unnormalized-deterministic-scores.png
--- a/baselines/gail/result/HalfCheetah-unnormalized-stochastic-scores.png
+++ b/baselines/gail/result/HalfCheetah-unnormalized-stochastic-scores.png
--- a/baselines/gail/result/Hopper-normalized-deterministic-scores.png
+++ b/baselines/gail/result/Hopper-normalized-deterministic-scores.png
--- a/baselines/gail/result/Hopper-normalized-stochastic-scores.png
+++ b/baselines/gail/result/Hopper-normalized-stochastic-scores.png
--- a/baselines/gail/result/Hopper-unnormalized-deterministic-scores.png
+++ b/baselines/gail/result/Hopper-unnormalized-deterministic-scores.png
--- a/baselines/gail/result/Hopper-unnormalized-stochastic-scores.png
+++ b/baselines/gail/result/Hopper-unnormalized-stochastic-scores.png
--- a/baselines/gail/result/Humanoid-normalized-deterministic-scores.png
+++ b/baselines/gail/result/Humanoid-normalized-deterministic-scores.png
--- a/baselines/gail/result/Humanoid-normalized-stochastic-scores.png
+++ b/baselines/gail/result/Humanoid-normalized-stochastic-scores.png
--- a/baselines/gail/result/Humanoid-unnormalized-deterministic-scores.png
+++ b/baselines/gail/result/Humanoid-unnormalized-deterministic-scores.png
--- a/baselines/gail/result/Humanoid-unnormalized-stochastic-scores.png
+++ b/baselines/gail/result/Humanoid-unnormalized-stochastic-scores.png
--- a/baselines/gail/result/HumanoidStandup-normalized-deterministic-scores.png
+++ b/baselines/gail/result/HumanoidStandup-normalized-deterministic-scores.png
--- a/baselines/gail/result/HumanoidStandup-normalized-stochastic-scores.png
+++ b/baselines/gail/result/HumanoidStandup-normalized-stochastic-scores.png
--- a/baselines/gail/result/HumanoidStandup-unnormalized-deterministic-scores.png
+++ b/baselines/gail/result/HumanoidStandup-unnormalized-deterministic-scores.png
--- a/baselines/gail/result/HumanoidStandup-unnormalized-stochastic-scores.png
+++ b/baselines/gail/result/HumanoidStandup-unnormalized-stochastic-scores.png
--- a/baselines/gail/result/Walker2d-normalized-deterministic-scores.png
+++ b/baselines/gail/result/Walker2d-normalized-deterministic-scores.png
--- a/baselines/gail/result/Walker2d-normalized-stochastic-scores.png
+++ b/baselines/gail/result/Walker2d-normalized-stochastic-scores.png
--- a/baselines/gail/result/Walker2d-unnormalized-deterministic-scores.png
+++ b/baselines/gail/result/Walker2d-unnormalized-deterministic-scores.png
--- a/baselines/gail/result/Walker2d-unnormalized-stochastic-scores.png
+++ b/baselines/gail/result/Walker2d-unnormalized-stochastic-scores.png
--- a/baselines/gail/result/gail-result.md
+++ b/baselines/gail/result/gail-result.md
@@ -0,0 +1,53 @@
+# Results of GAIL/BC on Mujoco
+
+Here's the extensive experimental results of applying GAIL/BC on Mujoco environments, including 
+Hopper-v1, Walker2d-v1, HalfCheetah-v1, Humanoid-v1, HumanoidStandup-v1. Every imitator is evaluated with seed to be 0.
+
+## Results
+
+### Training through iterations
+
+- Hoppers-v1
+<img src='hopper-training.png'> 
+
+- HalfCheetah-v1
+<img src='halfcheetah-training.png'> 
+
+- Walker2d-v1
+<img src='walker2d-training.png'> 
+
+- Humanoid-v1
+<img src='humanoid-training.png'> 
+
+- HumanoidStandup-v1
+<img src='humanoidstandup-training.png'> 
+
+For details (e.g., adversarial loss, discriminator accuracy, etc.) about GAIL training, please see [here](https://drive.google.com/drive/folders/1nnU8dqAV9i37-_5_vWIspyFUJFQLCsDD?usp=sharing)
+
+### Determinstic Polciy (Set std=0)
+|   | Un-normalized | Normalized |
+|---|---|---|
+| Hopper-v1 | <img src='Hopper-unnormalized-deterministic-scores.png'> | <img src='Hopper-normalized-deterministic-scores.png'> |
+| HalfCheetah-v1 | <img src='HalfCheetah-unnormalized-deterministic-scores.png'> | <img src='HalfCheetah-normalized-deterministic-scores.png'> |
+| Walker2d-v1 | <img src='Walker2d-unnormalized-deterministic-scores.png'> | <img src='Walker2d-normalized-deterministic-scores.png'> |
+| Humanoid-v1 | <img src='Humanoid-unnormalized-deterministic-scores.png'> | <img src='Humanoid-normalized-deterministic-scores.png'> |
+| HumanoidStandup-v1 | <img src='HumanoidStandup-unnormalized-deterministic-scores.png'> | <img src='HumanoidStandup-normalized-deterministic-scores.png'> |
+
+### Stochatic Policy 
+|   | Un-normalized | Normalized |
+|---|---|---|
+| Hopper-v1 | <img src='Hopper-unnormalized-stochastic-scores.png'> | <img src='Hopper-normalized-stochastic-scores.png'> |
+| HalfCheetah-v1 | <img src='HalfCheetah-unnormalized-stochastic-scores.png'> | <img src='HalfCheetah-normalized-stochastic-scores.png'> |
+| Walker2d-v1 | <img src='Walker2d-unnormalized-stochastic-scores.png'> | <img src='Walker2d-normalized-stochastic-scores.png'> |
+| Humanoid-v1 | <img src='Humanoid-unnormalized-stochastic-scores.png'> | <img src='Humanoid-normalized-stochastic-scores.png'> |
+| HumanoidStandup-v1 | <img src='HumanoidStandup-unnormalized-stochastic-scores.png'> | <img src='HumanoidStandup-normalized-stochastic-scores.png'> |
+
+### details about GAIL imitator
+
+For all environments, the 
+imitator is trained with 1, 5, 10, 50 trajectories, where each trajectory contains at most 
+1024 transitions, and seed 0, 1, 2, 3, respectively.
+
+### details about the BC imitators
+
+All BC imitators are trained with seed 0.
--- a/baselines/gail/result/halfcheetah-training.png
+++ b/baselines/gail/result/halfcheetah-training.png
--- a/baselines/gail/result/hopper-training.png
+++ b/baselines/gail/result/hopper-training.png
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Peter Zhokhov	2c818245d6	dummy commit to RUN BENCHMARKS	2018-07-25 18:09:30 -07:00
Peter Zhokhov	ae8e7fd16b	dummy commit to RUN BENCHMARKS	2018-07-25 18:07:56 -07:00
Adam Gleave	f272969325	GAIL: bugfix in dataset loading (#447 ) * Fix silly typo * Replace ad-hoc function with NumPy code	2018-07-06 16:12:14 -07:00
pzhokhov	a6b1bc70f1	re-import internal; fix missing tile_images.py (#427 ) * import rl-algs from 2e3a166 commit * extra import of the baselines badge * exported commit with identity test * proper rng seeding in the test_identity * import internal * adding missing tile_images.py	2018-06-08 09:41:45 -07:00
pzhokhov	36ee5d1707	Import internal changes (#422 ) * import rl-algs from 2e3a166 commit * extra import of the baselines badge * exported commit with identity test * proper rng seeding in the test_identity * import internal	2018-06-06 11:39:13 -07:00
pzhokhov	24fe3d6576	Import internal repo (#409 ) * import rl-algs from 2e3a166 commit * extra import of the baselines badge * exported commit with identity test * proper rng seeding in the test_identity	2018-05-21 15:24:00 -07:00
pzhokhov	9cb7ece338	add opencv-python to the dependencies (#407 )	2018-05-14 10:52:19 -07:00
pzhokhov	9cf95a0054	setup travis ci build (#388 ) * simple .travis.yml file * added static syntax checks of common to .travis.yml * dockerizing the build * fix Dockerfile, adding build shield * cleaning up workdir in Dockerfile and .travis.yml * .travis.yml fixed common -> baselines/common for style check	2018-05-03 09:43:28 -07:00
pzhokhov	8b781038cc	put filters and running_stat files in common instead of acktr (#389 )	2018-05-02 18:42:48 -07:00
pzhokhov	69f25c6028	import internal repo (#385 )	2018-05-01 16:54:04 -07:00
pzhokhov	2b0283b9db	Readme.md detailed installation instructions (#377 ) * changes to README.md files with more detailed installation instructions * md-fying the changes better * link on the word homebrew in readme.md * typos in README.md * README.md * removed extra comma sign * removed sudo from brew command	2018-04-25 17:40:48 -07:00
Matthias Plappert	1f8a03f3a6	Update README	2018-03-26 16:50:22 +02:00
Matthias Plappert	3cc7df0608	Minor fixes to HER release (#319 ) * Fix plotting script * Add warning if num_cpu = 1	2018-03-05 11:06:17 +01:00
Alex Nichol	8b3a6c2051	fix DummyVecEnv reusing buffers	2018-03-02 17:18:07 -08:00
Alex Nichol	569bd42629	Merge pull request #308 from araffin/master Bug fix in saving ACER model	2018-03-01 10:45:04 -08:00
Daniel Ziegler	f49a9c3d85	Fix bug in DDPG parameter space noise adaptation (#306 ) The training loop used the rollout step variable `t` rather than the training step variable `t_train` to decide when to adapt the scale of the parameter space noise.	2018-03-01 18:00:34 +01:00
Antonin RAFFIN	14f2f9328c	Bug fix in saving ACER model	2018-03-01 10:24:14 +01:00
Alex Nichol	6bdf2f55a2	Merge pull request #132 from bhatiaabhinav/bug_fixes Bug fix in saving a2c model.	2018-02-27 19:00:37 -08:00
Alex Nichol	97be70d6c8	fixes for DummyVecEnv Fixes various problems running MuJoCo tasks.	2018-02-27 18:55:10 -08:00
Matthias Plappert	b71152eea0	Adds support for Hindsight Experience Replay (HER) (#299 ) * Add Hindsight Experience Replay (HER) * Minor improvements	2018-02-26 17:40:16 +01:00
Christopher Hesse	df2e846ab7	export: fix accidental rename	2018-02-14 22:01:16 -08:00
Christopher Hesse	edb52c22a5	export: Fix deepq param noise refactoring, remove atari experiments and azure dependency	2018-02-14 21:42:22 -08:00
Andrei Kashin	98257ef8c9	Flush temporary file before compressing it. We need to flush the buffer after `pickle.dump`, otherwise the resulting zip archive might be incomplete (reproducible, if the state consists of a single integer).	2018-02-06 07:04:44 -08:00
Oleg Klimov	d9b36601d9	comment about loading weights in ppo2	2018-02-05 12:25:05 -08:00
Oleg Klimov	2793971c10	fix gail tf_util usage	2018-02-05 07:51:27 -08:00
John Schulman	16d7d23b7d	Merge pull request #271 from simontudo/add-requirement-cloudpickle added cloudpickle to requirements	2018-02-02 23:04:53 -08:00
John Schulman	9175b770c6	Merge pull request #273 from simontudo/videorecorder-import updated videorecorder import	2018-02-02 23:03:51 -08:00
simontudo	615870ad6b	updated videorecorder import	2018-02-01 12:09:08 +01:00
simontudo	7bd264e0e9	added cloudpickle to requirements	2018-01-31 10:43:17 +01:00
John Schulman	8d03102d4d	Merge pull request #265 from 20chase/patch-1 fix logger error for trpo_mpi	2018-01-29 00:54:51 -08:00
20chase	4a77855529	using mujoco_arg_parser as args remove origin parser	2018-01-29 16:52:01 +08:00
John Schulman	2e29b41592	Merge pull request #268 from ei-grad/master Fix fc call in AcerLstmPolicy	2018-01-27 18:42:31 -08:00
Andrew Grigorev	634e37c5b8	Fix fc call in AcerLstmPolicy The `act` keyword was removed from baselines.a2c.utils.fc in commit `9fa8e1b`.	2018-01-27 23:18:02 +03:00
20chase	452b548c2a	Merge branch 'master' into patch-1	2018-01-26 14:34:01 +08:00
John Schulman	ebb8afff2e	fix trpo_mpi bug where logstd wasn’t included	2018-01-25 21:17:40 -08:00
John Schulman	c9613b2293	Merge pull request #259 from andrewliao11/openai_gail Add gail maintainer list	2018-01-25 20:54:34 -08:00
John Schulman	459f007bcc	Merge pull request #260 from uidilr/master Add GAIL	2018-01-25 20:54:20 -08:00
John Schulman	9fa8e1baf1	Lots of cleanups Fixes for new gym version Add @olegklimov and @unixpickle to authors list	2018-01-25 18:54:24 -08:00
20chase	ac2ea4f31f	fix logger error for MPI Can't run logger.configure() if rank != 0	2018-01-25 22:09:00 +08:00
Yusuke Nakata	d8cce2309f	Add GAIL	2018-01-23 12:02:03 +09:00
andrew	0c207f0185	fix typo	2018-01-21 22:13:01 -08:00
andrew	41d41fabe3	add gail maintainer list	2018-01-21 22:12:03 -08:00
John Schulman	b5be53dc92	Merge pull request #229 from andrewliao11/gail GAIL implementation	2018-01-21 20:30:20 -05:00
Matthias Plappert	49c1a8ec26	Fix bug in parameter space noise DQN	2018-01-16 10:24:30 -08:00
andrew	e5a714b070	fix relative import	2018-01-12 15:12:45 -08:00
John Schulman	f9d1d3349a	remove mpirun from ppo2 instructions	2018-01-12 11:05:29 -08:00
Alex Nichol	8c90f67560	don't list TensorFlow as a requirement fixes #146 A better (more involved) solution might be to check for a TensorFlow installation manually in setup.py and deal with that accordingly.	2017-12-15 15:54:43 -08:00
Andrew	f22bee085d	Add files via upload	2017-12-12 19:03:42 -08:00
andrew	4acc71fe23	add x, y, axis name	2017-12-12 18:58:57 -08:00
andrew	2f1b629ecc	Merge branch 'gail' of https://github.com/andrewliao11/baselines into gail	2017-12-12 18:56:00 -08:00
andrew	00573cf5e9	add x, y axis name	2017-12-12 18:54:03 -08:00
Andrew	cfa1236d78	Update README.md	2017-12-11 21:21:56 -08:00
Andrew	64288f9f84	Update gail-result.md	2017-12-11 21:19:47 -08:00
Andrew	5f647d4d34	Update README.md	2017-12-11 21:18:05 -08:00
Andrew	6723455b75	Update gail-result.md	2017-12-11 21:15:30 -08:00
Andrew	45a93cf2b9	add training curve from tensorboard	2017-12-11 21:06:04 -08:00
andrew	11604f7cc9	add download link to readme and add description to python file	2017-12-07 12:08:20 -08:00
John Schulman	2444034d11	Merge pull request #194 from ryanjulian/env_lines Force shebang lines to Python 3	2017-12-04 14:07:01 -08:00
John Schulman	041b6b76b7	Merge pull request #215 from chris-chris/feature/typo-2017-11-19 fix misspellings	2017-12-04 14:02:49 -08:00
John Schulman	5d62b5bdaa	Merge pull request #221 from jvmancuso/patch-1 Docstring fix	2017-12-04 14:01:38 -08:00
John Schulman	2fcc9b9572	Merge pull request #226 from definitelyuncertain/master Call ppo2 and not ppo1 in ppo2 README.md	2017-12-04 14:01:12 -08:00
Andrew	000033973b	Update gail-result.md	2017-12-03 15:50:24 -08:00
andrew	6090ee8292	add comparison for expert/BC/gail	2017-12-03 15:46:52 -08:00
andrew	7954327c5f	add behavior cloning learn/eval code	2017-12-03 13:55:44 -08:00
andrew	8495890534	add gail, file_writer for tf.summary, and allow specifying var_list for tf.train.Saver	2017-12-03 01:49:42 -08:00
definitelyuncertain	643184935e	Call ppo2 and not ppo1	2017-12-02 22:00:28 +05:30
jvmancuso	36e074da56	Update replay_buffer.py	2017-11-27 14:45:50 -05:00
Ubuntu	c33640932f	fix misspellings	2017-11-19 01:29:30 +00:00
John Schulman	b05be68c55	add missing files, fix Issue #209	2017-11-16 22:14:30 -08:00
John Schulman	2dd7d307d7	Add ACER, PPO2, and results_plotter.py	2017-11-16 10:02:32 -08:00
Ryan Julian	df889caf11	Force shebang lines to Python 3 This is a Python 3-only library. A shebang with `#!/usr/bin/env python` will launch python2 on many systems which do not have python3 installed. Setting the shebang to `#!/usr/bin/env python3` will show a useful error on systems without Python 3.	2017-11-05 15:22:16 -08:00
John Schulman	6a3cbb4bc5	switch append mode to write mode	2017-10-25 22:20:30 -04:00
Abhinav Bhatia	3d1e171b3a	Bug fix in saving a2c model.	2017-09-12 02:35:43 +08:00