Compare commits
21 Commits
matthias-h
...
peterz_tes
Author | SHA1 | Date | |
---|---|---|---|
|
2c818245d6 | ||
|
ae8e7fd16b | ||
|
f272969325 | ||
|
a6b1bc70f1 | ||
|
36ee5d1707 | ||
|
24fe3d6576 | ||
|
9cb7ece338 | ||
|
9cf95a0054 | ||
|
8b781038cc | ||
|
69f25c6028 | ||
|
2b0283b9db | ||
|
1f8a03f3a6 | ||
|
3cc7df0608 | ||
|
8b3a6c2051 | ||
|
569bd42629 | ||
|
f49a9c3d85 | ||
|
14f2f9328c | ||
|
6bdf2f55a2 | ||
|
97be70d6c8 | ||
|
b71152eea0 | ||
|
3d1e171b3a |
14
.travis.yml
Normal file
14
.travis.yml
Normal file
@@ -0,0 +1,14 @@
|
||||
language: python
|
||||
python:
|
||||
- "3.6"
|
||||
|
||||
services:
|
||||
- docker
|
||||
|
||||
install:
|
||||
- pip install flake8
|
||||
- docker build . -t baselines-test
|
||||
|
||||
script:
|
||||
- flake8 --select=F baselines/common
|
||||
- docker run baselines-test pytest
|
20
Dockerfile
Normal file
20
Dockerfile
Normal file
@@ -0,0 +1,20 @@
|
||||
FROM ubuntu:16.04
|
||||
|
||||
RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake
|
||||
ENV CODE_DIR /root/code
|
||||
ENV VENV /root/venv
|
||||
|
||||
COPY . $CODE_DIR/baselines
|
||||
RUN \
|
||||
pip install virtualenv && \
|
||||
virtualenv $VENV --python=python3 && \
|
||||
. $VENV/bin/activate && \
|
||||
cd $CODE_DIR && \
|
||||
pip install --upgrade pip && \
|
||||
pip install -e baselines && \
|
||||
pip install pytest
|
||||
|
||||
ENV PATH=$VENV/bin:$PATH
|
||||
WORKDIR $CODE_DIR/baselines
|
||||
|
||||
CMD /bin/bash
|
55
README.md
55
README.md
@@ -1,4 +1,4 @@
|
||||
<img src="data/logo.jpg" width=25% align="right" />
|
||||
<img src="data/logo.jpg" width=25% align="right" /> [](https://travis-ci.org/openai/baselines)
|
||||
|
||||
# Baselines
|
||||
|
||||
@@ -6,13 +6,63 @@ OpenAI Baselines is a set of high-quality implementations of reinforcement learn
|
||||
|
||||
These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones.
|
||||
|
||||
You can install it by typing:
|
||||
## Prerequisites
|
||||
Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows
|
||||
### Ubuntu
|
||||
|
||||
```bash
|
||||
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
|
||||
```
|
||||
|
||||
### Mac OS X
|
||||
Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the follwing:
|
||||
```bash
|
||||
brew install cmake openmpi
|
||||
```
|
||||
|
||||
## Virtual environment
|
||||
From the general python package sanity perspective, it is a good idea to use virtual environments (virtualenvs) to make sure packages from different projects do not interfere with each other. You can install virtualenv (which is itself a pip package) via
|
||||
```bash
|
||||
pip install virtualenv
|
||||
```
|
||||
Virtualenvs are essentially folders that have copies of python executable and all python packages.
|
||||
To create a virtualenv called venv with python3, one runs
|
||||
```bash
|
||||
virtualenv /path/to/venv --python=python3
|
||||
```
|
||||
To activate a virtualenv:
|
||||
```
|
||||
. /path/to/venv/bin/activate
|
||||
```
|
||||
More thorough tutorial on virtualenvs and options can be found [here](https://virtualenv.pypa.io/en/stable/)
|
||||
|
||||
|
||||
## Installation
|
||||
Clone the repo and cd into it:
|
||||
```bash
|
||||
git clone https://github.com/openai/baselines.git
|
||||
cd baselines
|
||||
```
|
||||
If using virtualenv, create a new virtualenv and activate it
|
||||
```bash
|
||||
virtualenv env --python=python3
|
||||
. env/bin/activate
|
||||
```
|
||||
Install baselines package
|
||||
```bash
|
||||
pip install -e .
|
||||
```
|
||||
### MuJoCo
|
||||
Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)
|
||||
|
||||
## Testing the installation
|
||||
All unit tests in baselines can be run using pytest runner:
|
||||
```
|
||||
pip install pytest
|
||||
pytest
|
||||
```
|
||||
|
||||
## Subpackages
|
||||
|
||||
- [A2C](baselines/a2c)
|
||||
- [ACER](baselines/acer)
|
||||
@@ -20,6 +70,7 @@ pip install -e .
|
||||
- [DDPG](baselines/ddpg)
|
||||
- [DQN](baselines/deepq)
|
||||
- [GAIL](baselines/gail)
|
||||
- [HER](baselines/her)
|
||||
- [PPO1](baselines/ppo1) (Multi-CPU using MPI)
|
||||
- [PPO2](baselines/ppo2) (Optimized for GPU)
|
||||
- [TRPO](baselines/trpo_mpi)
|
||||
|
@@ -1,16 +1,12 @@
|
||||
import os
|
||||
import os.path as osp
|
||||
import gym
|
||||
import time
|
||||
import joblib
|
||||
import logging
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
from baselines import logger
|
||||
|
||||
from baselines.common import set_global_seeds, explained_variance
|
||||
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
|
||||
from baselines.common.atari_wrappers import wrap_deepmind
|
||||
from baselines.common.runners import AbstractEnvRunner
|
||||
from baselines.common import tf_util
|
||||
|
||||
from baselines.a2c.utils import discount_with_dones
|
||||
@@ -24,7 +20,6 @@ class Model(object):
|
||||
alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):
|
||||
|
||||
sess = tf_util.make_session()
|
||||
nact = ac_space.n
|
||||
nbatch = nenvs*nsteps
|
||||
|
||||
A = tf.placeholder(tf.int32, [nbatch])
|
||||
@@ -67,7 +62,7 @@ class Model(object):
|
||||
|
||||
def save(save_path):
|
||||
ps = sess.run(params)
|
||||
make_path(save_path)
|
||||
make_path(osp.dirname(save_path))
|
||||
joblib.dump(ps, save_path)
|
||||
|
||||
def load(load_path):
|
||||
@@ -75,7 +70,7 @@ class Model(object):
|
||||
restores = []
|
||||
for p, loaded_p in zip(params, loaded_params):
|
||||
restores.append(p.assign(loaded_p))
|
||||
ps = sess.run(restores)
|
||||
sess.run(restores)
|
||||
|
||||
self.train = train
|
||||
self.train_model = train_model
|
||||
@@ -87,21 +82,11 @@ class Model(object):
|
||||
self.load = load
|
||||
tf.global_variables_initializer().run(session=sess)
|
||||
|
||||
class Runner(object):
|
||||
class Runner(AbstractEnvRunner):
|
||||
|
||||
def __init__(self, env, model, nsteps=5, gamma=0.99):
|
||||
self.env = env
|
||||
self.model = model
|
||||
nh, nw, nc = env.observation_space.shape
|
||||
nenv = env.num_envs
|
||||
self.batch_ob_shape = (nenv*nsteps, nh, nw, nc)
|
||||
self.obs = np.zeros((nenv, nh, nw, nc), dtype=np.uint8)
|
||||
self.nc = nc
|
||||
obs = env.reset()
|
||||
super().__init__(env=env, model=model, nsteps=nsteps)
|
||||
self.gamma = gamma
|
||||
self.nsteps = nsteps
|
||||
self.states = model.initial_state
|
||||
self.dones = [False for _ in range(nenv)]
|
||||
|
||||
def run(self):
|
||||
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
|
||||
@@ -146,7 +131,6 @@ class Runner(object):
|
||||
return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values
|
||||
|
||||
def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, ent_coef=0.01, max_grad_norm=0.5, lr=7e-4, lrschedule='linear', epsilon=1e-5, alpha=0.99, gamma=0.99, log_interval=100):
|
||||
tf.reset_default_graph()
|
||||
set_global_seeds(seed)
|
||||
|
||||
nenvs = env.num_envs
|
||||
@@ -173,3 +157,4 @@ def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, e
|
||||
logger.record_tabular("explained_variance", float(ev))
|
||||
logger.dump_tabular()
|
||||
env.close()
|
||||
return model
|
||||
|
@@ -2,39 +2,36 @@ import numpy as np
|
||||
import tensorflow as tf
|
||||
from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
|
||||
from baselines.common.distributions import make_pdtype
|
||||
from baselines.common.input import observation_input
|
||||
|
||||
def nature_cnn(unscaled_images):
|
||||
def nature_cnn(unscaled_images, **conv_kwargs):
|
||||
"""
|
||||
CNN from Nature paper.
|
||||
"""
|
||||
scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
|
||||
activ = tf.nn.relu
|
||||
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2)))
|
||||
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2)))
|
||||
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2)))
|
||||
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
|
||||
**conv_kwargs))
|
||||
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
|
||||
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
|
||||
h3 = conv_to_fc(h3)
|
||||
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
|
||||
|
||||
class LnLstmPolicy(object):
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
|
||||
nenv = nbatch // nsteps
|
||||
nh, nw, nc = ob_space.shape
|
||||
ob_shape = (nbatch, nh, nw, nc)
|
||||
nact = ac_space.n
|
||||
X = tf.placeholder(tf.uint8, ob_shape) #obs
|
||||
X, processed_x = observation_input(ob_space, nbatch)
|
||||
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
|
||||
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
h = nature_cnn(X)
|
||||
h = nature_cnn(processed_x)
|
||||
xs = batch_to_seq(h, nenv, nsteps)
|
||||
ms = batch_to_seq(M, nenv, nsteps)
|
||||
h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
|
||||
h5 = seq_to_batch(h5)
|
||||
pi = fc(h5, 'pi', nact)
|
||||
vf = fc(h5, 'v', 1)
|
||||
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
self.pd = self.pdtype.pdfromflat(pi)
|
||||
self.pd, self.pi = self.pdtype.pdfromlatent(h5)
|
||||
|
||||
v0 = vf[:, 0]
|
||||
a0 = self.pd.sample()
|
||||
@@ -50,7 +47,6 @@ class LnLstmPolicy(object):
|
||||
self.X = X
|
||||
self.M = M
|
||||
self.S = S
|
||||
self.pi = pi
|
||||
self.vf = vf
|
||||
self.step = step
|
||||
self.value = value
|
||||
@@ -59,11 +55,9 @@ class LstmPolicy(object):
|
||||
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
|
||||
nenv = nbatch // nsteps
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
X, processed_x = observation_input(ob_space, nbatch)
|
||||
|
||||
nh, nw, nc = ob_space.shape
|
||||
ob_shape = (nbatch, nh, nw, nc)
|
||||
nact = ac_space.n
|
||||
X = tf.placeholder(tf.uint8, ob_shape) #obs
|
||||
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
|
||||
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
@@ -72,11 +66,8 @@ class LstmPolicy(object):
|
||||
ms = batch_to_seq(M, nenv, nsteps)
|
||||
h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
|
||||
h5 = seq_to_batch(h5)
|
||||
pi = fc(h5, 'pi', nact)
|
||||
vf = fc(h5, 'v', 1)
|
||||
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
self.pd = self.pdtype.pdfromflat(pi)
|
||||
self.pd, self.pi = self.pdtype.pdfromlatent(h5)
|
||||
|
||||
v0 = vf[:, 0]
|
||||
a0 = self.pd.sample()
|
||||
@@ -92,25 +83,19 @@ class LstmPolicy(object):
|
||||
self.X = X
|
||||
self.M = M
|
||||
self.S = S
|
||||
self.pi = pi
|
||||
self.vf = vf
|
||||
self.step = step
|
||||
self.value = value
|
||||
|
||||
class CnnPolicy(object):
|
||||
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
|
||||
nh, nw, nc = ob_space.shape
|
||||
ob_shape = (nbatch, nh, nw, nc)
|
||||
nact = ac_space.n
|
||||
X = tf.placeholder(tf.uint8, ob_shape) #obs
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
h = nature_cnn(X)
|
||||
pi = fc(h, 'pi', nact, init_scale=0.01)
|
||||
vf = fc(h, 'v', 1)[:,0]
|
||||
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False, **conv_kwargs): #pylint: disable=W0613
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
self.pd = self.pdtype.pdfromflat(pi)
|
||||
X, processed_x = observation_input(ob_space, nbatch)
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
h = nature_cnn(processed_x, **conv_kwargs)
|
||||
vf = fc(h, 'v', 1)[:,0]
|
||||
self.pd, self.pi = self.pdtype.pdfromlatent(h, init_scale=0.01)
|
||||
|
||||
a0 = self.pd.sample()
|
||||
neglogp0 = self.pd.neglogp(a0)
|
||||
@@ -124,31 +109,25 @@ class CnnPolicy(object):
|
||||
return sess.run(vf, {X:ob})
|
||||
|
||||
self.X = X
|
||||
self.pi = pi
|
||||
self.vf = vf
|
||||
self.step = step
|
||||
self.value = value
|
||||
|
||||
class MlpPolicy(object):
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
|
||||
ob_shape = (nbatch,) + ob_space.shape
|
||||
actdim = ac_space.shape[0]
|
||||
X = tf.placeholder(tf.float32, ob_shape, name='Ob') #obs
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
activ = tf.tanh
|
||||
h1 = activ(fc(X, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
|
||||
h2 = activ(fc(h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
|
||||
pi = fc(h2, 'pi', actdim, init_scale=0.01)
|
||||
h1 = activ(fc(X, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
|
||||
h2 = activ(fc(h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
|
||||
vf = fc(h2, 'vf', 1)[:,0]
|
||||
logstd = tf.get_variable(name="logstd", shape=[1, actdim],
|
||||
initializer=tf.zeros_initializer())
|
||||
|
||||
pdparam = tf.concat([pi, pi * 0.0 + logstd], axis=1)
|
||||
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
self.pd = self.pdtype.pdfromflat(pdparam)
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
X, processed_x = observation_input(ob_space, nbatch)
|
||||
activ = tf.tanh
|
||||
processed_x = tf.layers.flatten(processed_x)
|
||||
pi_h1 = activ(fc(processed_x, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
|
||||
pi_h2 = activ(fc(pi_h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
|
||||
vf_h1 = activ(fc(processed_x, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
|
||||
vf_h2 = activ(fc(vf_h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
|
||||
vf = fc(vf_h2, 'vf', 1)[:,0]
|
||||
|
||||
self.pd, self.pi = self.pdtype.pdfromlatent(pi_h2, init_scale=0.01)
|
||||
|
||||
|
||||
a0 = self.pd.sample()
|
||||
neglogp0 = self.pd.neglogp(a0)
|
||||
@@ -162,7 +141,6 @@ class MlpPolicy(object):
|
||||
return sess.run(vf, {X:ob})
|
||||
|
||||
self.X = X
|
||||
self.pi = pi
|
||||
self.vf = vf
|
||||
self.step = step
|
||||
self.value = value
|
||||
|
@@ -39,12 +39,26 @@ def ortho_init(scale=1.0):
|
||||
return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
|
||||
return _ortho_init
|
||||
|
||||
def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0):
|
||||
def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
|
||||
if data_format == 'NHWC':
|
||||
channel_ax = 3
|
||||
strides = [1, stride, stride, 1]
|
||||
bshape = [1, 1, 1, nf]
|
||||
elif data_format == 'NCHW':
|
||||
channel_ax = 1
|
||||
strides = [1, 1, stride, stride]
|
||||
bshape = [1, nf, 1, 1]
|
||||
else:
|
||||
raise NotImplementedError
|
||||
bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
|
||||
nin = x.get_shape()[channel_ax].value
|
||||
wshape = [rf, rf, nin, nf]
|
||||
with tf.variable_scope(scope):
|
||||
nin = x.get_shape()[3].value
|
||||
w = tf.get_variable("w", [rf, rf, nin, nf], initializer=ortho_init(init_scale))
|
||||
b = tf.get_variable("b", [nf], initializer=tf.constant_initializer(0.0))
|
||||
return tf.nn.conv2d(x, w, strides=[1, stride, stride, 1], padding=pad)+b
|
||||
w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
|
||||
b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
|
||||
if not one_dim_bias and data_format == 'NHWC':
|
||||
b = tf.reshape(b, bshape)
|
||||
return b + tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format)
|
||||
|
||||
def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
|
||||
with tf.variable_scope(scope):
|
||||
|
@@ -5,6 +5,7 @@ import tensorflow as tf
|
||||
from baselines import logger
|
||||
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines.common.runners import AbstractEnvRunner
|
||||
|
||||
from baselines.a2c.utils import batch_to_seq, seq_to_batch
|
||||
from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
|
||||
@@ -13,6 +14,8 @@ from baselines.a2c.utils import EpisodeStats
|
||||
from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
|
||||
from baselines.acer.buffer import Buffer
|
||||
|
||||
import os.path as osp
|
||||
|
||||
# remove last step
|
||||
def strip(var, nenvs, nsteps, flat = False):
|
||||
vars = batch_to_seq(var, nenvs, nsteps + 1, flat)
|
||||
@@ -198,7 +201,7 @@ class Model(object):
|
||||
|
||||
def save(save_path):
|
||||
ps = sess.run(params)
|
||||
make_path(save_path)
|
||||
make_path(osp.dirname(save_path))
|
||||
joblib.dump(ps, save_path)
|
||||
|
||||
self.train = train
|
||||
@@ -209,11 +212,10 @@ class Model(object):
|
||||
self.initial_state = step_model.initial_state
|
||||
tf.global_variables_initializer().run(session=sess)
|
||||
|
||||
class Runner(object):
|
||||
class Runner(AbstractEnvRunner):
|
||||
def __init__(self, env, model, nsteps, nstack):
|
||||
self.env = env
|
||||
super().__init__(env=env, model=model, nsteps=nsteps)
|
||||
self.nstack = nstack
|
||||
self.model = model
|
||||
nh, nw, nc = env.observation_space.shape
|
||||
self.nc = nc # nc = 1 for atari, but just in case
|
||||
self.nenv = nenv = env.num_envs
|
||||
@@ -223,9 +225,6 @@ class Runner(object):
|
||||
self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
|
||||
obs = env.reset()
|
||||
self.update_obs(obs)
|
||||
self.nsteps = nsteps
|
||||
self.states = model.initial_state
|
||||
self.dones = [False for _ in range(nenv)]
|
||||
|
||||
def update_obs(self, obs, dones=None):
|
||||
if dones is not None:
|
||||
|
@@ -1,5 +1,7 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
from functools import partial
|
||||
|
||||
from baselines import logger
|
||||
from baselines.acktr.acktr_disc import learn
|
||||
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
|
||||
@@ -8,7 +10,7 @@ from baselines.ppo2.policies import CnnPolicy
|
||||
|
||||
def train(env_id, num_timesteps, seed, num_cpu):
|
||||
env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
|
||||
policy_fn = CnnPolicy
|
||||
policy_fn = partial(CnnPolicy, one_dim_bias=True)
|
||||
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
|
||||
env.close()
|
||||
|
||||
|
@@ -9,6 +9,8 @@ _atariexpl7 = ['Freeway', 'Gravitar', 'MontezumaRevenge', 'Pitfall', 'PrivateEye
|
||||
_BENCHMARKS = []
|
||||
|
||||
remove_version_re = re.compile(r'-v\d+$')
|
||||
|
||||
|
||||
def register_benchmark(benchmark):
|
||||
for b in _BENCHMARKS:
|
||||
if b['name'] == benchmark['name']:
|
||||
@@ -138,3 +140,11 @@ register_benchmark({
|
||||
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari50]
|
||||
})
|
||||
|
||||
# HER DDPG
|
||||
|
||||
register_benchmark({
|
||||
'name': 'HerDdpg',
|
||||
'description': 'Smoke-test only benchmark of HER',
|
||||
'tasks': [{'trials': 1, 'env_id': 'FetchReach-v1'}]
|
||||
})
|
||||
|
||||
|
@@ -7,12 +7,13 @@ from glob import glob
|
||||
import csv
|
||||
import os.path as osp
|
||||
import json
|
||||
import numpy as np
|
||||
|
||||
class Monitor(Wrapper):
|
||||
EXT = "monitor.csv"
|
||||
f = None
|
||||
|
||||
def __init__(self, env, filename, allow_early_resets=False, reset_keywords=()):
|
||||
def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
|
||||
Wrapper.__init__(self, env=env)
|
||||
self.tstart = time.time()
|
||||
if filename is None:
|
||||
@@ -26,10 +27,12 @@ class Monitor(Wrapper):
|
||||
filename = filename + "." + Monitor.EXT
|
||||
self.f = open(filename, "wt")
|
||||
self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, 'env_id' : env.spec and env.spec.id}))
|
||||
self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords)
|
||||
self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords+info_keywords)
|
||||
self.logger.writeheader()
|
||||
self.f.flush()
|
||||
|
||||
self.reset_keywords = reset_keywords
|
||||
self.info_keywords = info_keywords
|
||||
self.allow_early_resets = allow_early_resets
|
||||
self.rewards = None
|
||||
self.needs_reset = True
|
||||
@@ -61,6 +64,8 @@ class Monitor(Wrapper):
|
||||
eprew = sum(self.rewards)
|
||||
eplen = len(self.rewards)
|
||||
epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)}
|
||||
for k in self.info_keywords:
|
||||
epinfo[k] = info[k]
|
||||
self.episode_rewards.append(eprew)
|
||||
self.episode_lengths.append(eplen)
|
||||
self.episode_times.append(time.time() - self.tstart)
|
||||
|
@@ -1,3 +1,4 @@
|
||||
# flake8: noqa F403
|
||||
from baselines.common.console_util import *
|
||||
from baselines.common.dataset import Dataset
|
||||
from baselines.common.math_util import *
|
||||
|
@@ -98,9 +98,6 @@ class MaxAndSkipEnv(gym.Wrapper):
|
||||
self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
|
||||
self._skip = skip
|
||||
|
||||
def reset(self):
|
||||
return self.env.reset()
|
||||
|
||||
def step(self, action):
|
||||
"""Repeat action, sum reward, and max over last observations."""
|
||||
total_reward = 0.0
|
||||
|
@@ -3,13 +3,14 @@ Helpers for scripts like run_atari.py.
|
||||
"""
|
||||
|
||||
import os
|
||||
from mpi4py import MPI
|
||||
import gym
|
||||
from gym.wrappers import FlattenDictWrapper
|
||||
from baselines import logger
|
||||
from baselines.bench import Monitor
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
|
||||
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
|
||||
from mpi4py import MPI
|
||||
|
||||
def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
|
||||
"""
|
||||
@@ -27,12 +28,26 @@ def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
|
||||
return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
|
||||
|
||||
def make_mujoco_env(env_id, seed):
|
||||
"""
|
||||
Create a wrapped, monitored gym.Env for MuJoCo.
|
||||
"""
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
set_global_seeds(seed + 10000 * rank)
|
||||
env = gym.make(env_id)
|
||||
env = Monitor(env, os.path.join(logger.get_dir(), str(rank)))
|
||||
env.seed(seed)
|
||||
return env
|
||||
|
||||
def make_robotics_env(env_id, seed, rank=0):
|
||||
"""
|
||||
Create a wrapped, monitored gym.Env for MuJoCo.
|
||||
"""
|
||||
set_global_seeds(seed)
|
||||
env = gym.make(env_id)
|
||||
env = Monitor(env, logger.get_dir())
|
||||
env = FlattenDictWrapper(env, ['observation', 'desired_goal'])
|
||||
env = Monitor(
|
||||
env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)),
|
||||
info_keywords=('is_success',))
|
||||
env.seed(seed)
|
||||
return env
|
||||
|
||||
@@ -58,7 +73,18 @@ def mujoco_arg_parser():
|
||||
Create an argparse.ArgumentParser for run_mujoco.py.
|
||||
"""
|
||||
parser = arg_parser()
|
||||
parser.add_argument('--env', help='environment ID', type=str, default="Reacher-v1")
|
||||
parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
|
||||
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
|
||||
parser.add_argument('--num-timesteps', type=int, default=int(1e6))
|
||||
parser.add_argument('--play', default=False, action='store_true')
|
||||
return parser
|
||||
|
||||
def robotics_arg_parser():
|
||||
"""
|
||||
Create an argparse.ArgumentParser for run_mujoco.py.
|
||||
"""
|
||||
parser = arg_parser()
|
||||
parser.add_argument('--env', help='environment ID', type=str, default='FetchReach-v0')
|
||||
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
|
||||
parser.add_argument('--num-timesteps', type=int, default=int(1e6))
|
||||
return parser
|
||||
|
@@ -16,7 +16,12 @@ def fmt_item(x, l):
|
||||
if isinstance(x, np.ndarray):
|
||||
assert x.ndim==0
|
||||
x = x.item()
|
||||
if isinstance(x, float): rep = "%g"%x
|
||||
if isinstance(x, (float, np.float32, np.float64)):
|
||||
v = abs(x)
|
||||
if (v < 1e-4 or v > 1e+4) and v > 0:
|
||||
rep = "%7.2e" % x
|
||||
else:
|
||||
rep = "%7.5f" % x
|
||||
else: rep = str(x)
|
||||
return " "*(l - len(rep)) + rep
|
||||
|
||||
|
@@ -1,6 +1,7 @@
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
import baselines.common.tf_util as U
|
||||
from baselines.a2c.utils import fc
|
||||
from tensorflow.python.ops import math_ops
|
||||
|
||||
class Pd(object):
|
||||
@@ -31,6 +32,8 @@ class PdType(object):
|
||||
raise NotImplementedError
|
||||
def pdfromflat(self, flat):
|
||||
return self.pdclass()(flat)
|
||||
def pdfromlatent(self, latent_vector):
|
||||
raise NotImplementedError
|
||||
def param_shape(self):
|
||||
raise NotImplementedError
|
||||
def sample_shape(self):
|
||||
@@ -48,6 +51,10 @@ class CategoricalPdType(PdType):
|
||||
self.ncat = ncat
|
||||
def pdclass(self):
|
||||
return CategoricalPd
|
||||
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
|
||||
pdparam = fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
|
||||
return self.pdfromflat(pdparam), pdparam
|
||||
|
||||
def param_shape(self):
|
||||
return [self.ncat]
|
||||
def sample_shape(self):
|
||||
@@ -75,6 +82,13 @@ class DiagGaussianPdType(PdType):
|
||||
self.size = size
|
||||
def pdclass(self):
|
||||
return DiagGaussianPd
|
||||
|
||||
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
|
||||
mean = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
|
||||
logstd = tf.get_variable(name='logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
|
||||
pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
|
||||
return self.pdfromflat(pdparam), mean
|
||||
|
||||
def param_shape(self):
|
||||
return [2*self.size]
|
||||
def sample_shape(self):
|
||||
|
30
baselines/common/identity_env.py
Normal file
30
baselines/common/identity_env.py
Normal file
@@ -0,0 +1,30 @@
|
||||
from gym import Env
|
||||
from gym.spaces import Discrete
|
||||
|
||||
|
||||
class IdentityEnv(Env):
|
||||
def __init__(
|
||||
self,
|
||||
dim,
|
||||
ep_length=100,
|
||||
):
|
||||
|
||||
self.action_space = Discrete(dim)
|
||||
self.reset()
|
||||
|
||||
def reset(self):
|
||||
self._choose_next_state()
|
||||
self.observation_space = self.action_space
|
||||
|
||||
return self.state
|
||||
|
||||
def step(self, actions):
|
||||
rew = self._get_reward(actions)
|
||||
self._choose_next_state()
|
||||
return self.state, rew, False, {}
|
||||
|
||||
def _choose_next_state(self):
|
||||
self.state = self.action_space.sample()
|
||||
|
||||
def _get_reward(self, actions):
|
||||
return 1 if self.state == actions else 0
|
30
baselines/common/input.py
Normal file
30
baselines/common/input.py
Normal file
@@ -0,0 +1,30 @@
|
||||
import tensorflow as tf
|
||||
from gym.spaces import Discrete, Box
|
||||
|
||||
def observation_input(ob_space, batch_size=None, name='Ob'):
|
||||
'''
|
||||
Build observation input with encoding depending on the
|
||||
observation space type
|
||||
Params:
|
||||
|
||||
ob_space: observation space (should be one of gym.spaces)
|
||||
batch_size: batch size for input (default is None, so that resulting input placeholder can take tensors with any batch size)
|
||||
name: tensorflow variable name for input placeholder
|
||||
|
||||
returns: tuple (input_placeholder, processed_input_tensor)
|
||||
'''
|
||||
if isinstance(ob_space, Discrete):
|
||||
input_x = tf.placeholder(shape=(batch_size,), dtype=tf.int32, name=name)
|
||||
processed_x = tf.to_float(tf.one_hot(input_x, ob_space.n))
|
||||
return input_x, processed_x
|
||||
|
||||
elif isinstance(ob_space, Box):
|
||||
input_shape = (batch_size,) + ob_space.shape
|
||||
input_x = tf.placeholder(shape=input_shape, dtype=ob_space.dtype, name=name)
|
||||
processed_x = tf.to_float(input_x)
|
||||
return input_x, processed_x
|
||||
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
|
@@ -2,6 +2,7 @@ from mpi4py import MPI
|
||||
import numpy as np
|
||||
from baselines.common import zipsame
|
||||
|
||||
|
||||
def mpi_mean(x, axis=0, comm=None, keepdims=False):
|
||||
x = np.asarray(x)
|
||||
assert x.ndim > 0
|
||||
|
18
baselines/common/runners.py
Normal file
18
baselines/common/runners.py
Normal file
@@ -0,0 +1,18 @@
|
||||
import numpy as np
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
class AbstractEnvRunner(ABC):
|
||||
def __init__(self, *, env, model, nsteps):
|
||||
self.env = env
|
||||
self.model = model
|
||||
nenv = env.num_envs
|
||||
self.batch_ob_shape = (nenv*nsteps,) + env.observation_space.shape
|
||||
self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
|
||||
self.obs[:] = env.reset()
|
||||
self.nsteps = nsteps
|
||||
self.states = model.initial_state
|
||||
self.dones = [False for _ in range(nenv)]
|
||||
|
||||
@abstractmethod
|
||||
def run(self):
|
||||
raise NotImplementedError
|
@@ -12,10 +12,9 @@ class SegmentTree(object):
|
||||
|
||||
a) setting item's value is slightly slower.
|
||||
It is O(lg capacity) instead of O(1).
|
||||
b) user has access to an efficient `reduce`
|
||||
operation which reduces `operation` over
|
||||
a contiguous subsequence of items in the
|
||||
array.
|
||||
b) user has access to an efficient ( O(log segment size) )
|
||||
`reduce` operation which reduces `operation` over
|
||||
a contiguous subsequence of items in the array.
|
||||
|
||||
Paramters
|
||||
---------
|
||||
@@ -23,8 +22,8 @@ class SegmentTree(object):
|
||||
Total size of the array - must be a power of two.
|
||||
operation: lambda obj, obj -> obj
|
||||
and operation for combining elements (eg. sum, max)
|
||||
must for a mathematical group together with the set of
|
||||
possible values for array elements.
|
||||
must form a mathematical group together with the set of
|
||||
possible values for array elements (i.e. be associative)
|
||||
neutral_element: obj
|
||||
neutral element for the operation above. eg. float('-inf')
|
||||
for max and 0 for sum.
|
||||
|
44
baselines/common/test_identity.py
Normal file
44
baselines/common/test_identity.py
Normal file
@@ -0,0 +1,44 @@
|
||||
import pytest
|
||||
import tensorflow as tf
|
||||
import random
|
||||
import numpy as np
|
||||
from gym.spaces import np_random
|
||||
|
||||
from baselines.a2c import a2c
|
||||
from baselines.ppo2 import ppo2
|
||||
from baselines.common.identity_env import IdentityEnv
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
from baselines.ppo2.policies import MlpPolicy
|
||||
|
||||
|
||||
learn_func_list = [
|
||||
lambda e: a2c.learn(policy=MlpPolicy, env=e, seed=0, total_timesteps=50000),
|
||||
lambda e: ppo2.learn(policy=MlpPolicy, env=e, total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.01)
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize("learn_func", learn_func_list)
|
||||
def test_identity(learn_func):
|
||||
'''
|
||||
Test if the algorithm (with a given policy)
|
||||
can learn an identity transformation (i.e. return observation as an action)
|
||||
'''
|
||||
np.random.seed(0)
|
||||
np_random.seed(0)
|
||||
random.seed(0)
|
||||
|
||||
env = DummyVecEnv([lambda: IdentityEnv(10)])
|
||||
|
||||
with tf.Graph().as_default(), tf.Session().as_default():
|
||||
tf.set_random_seed(0)
|
||||
model = learn_func(env)
|
||||
|
||||
N_TRIALS = 1000
|
||||
sum_rew = 0
|
||||
obs = env.reset()
|
||||
for i in range(N_TRIALS):
|
||||
obs, rew, done, _ = env.step(model.step(obs)[0])
|
||||
sum_rew += rew
|
||||
|
||||
assert sum_rew > 0.9 * N_TRIALS
|
@@ -33,7 +33,6 @@ def test_multikwargs():
|
||||
initialize()
|
||||
assert lin(2) == 6
|
||||
assert lin(2, 2) == 10
|
||||
expt_caught = False
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
@@ -48,18 +48,17 @@ def huber_loss(x, delta=1.0):
|
||||
# Global session
|
||||
# ================================================================
|
||||
|
||||
def make_session(num_cpu=None, make_default=False):
|
||||
def make_session(num_cpu=None, make_default=False, graph=None):
|
||||
"""Returns a session that will use <num_cpu> CPU's only"""
|
||||
if num_cpu is None:
|
||||
num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
|
||||
tf_config = tf.ConfigProto(
|
||||
inter_op_parallelism_threads=num_cpu,
|
||||
intra_op_parallelism_threads=num_cpu)
|
||||
tf_config.gpu_options.allocator_type = 'BFC'
|
||||
if make_default:
|
||||
return tf.InteractiveSession(config=tf_config)
|
||||
return tf.InteractiveSession(config=tf_config, graph=graph)
|
||||
else:
|
||||
return tf.Session(config=tf_config)
|
||||
return tf.Session(config=tf_config, graph=graph)
|
||||
|
||||
def single_threaded_session():
|
||||
"""Returns a session which will only use a single CPU"""
|
||||
@@ -84,10 +83,10 @@ def initialize():
|
||||
# Model components
|
||||
# ================================================================
|
||||
|
||||
def normc_initializer(std=1.0):
|
||||
def normc_initializer(std=1.0, axis=0):
|
||||
def _initializer(shape, dtype=None, partition_info=None): # pylint: disable=W0613
|
||||
out = np.random.randn(*shape).astype(np.float32)
|
||||
out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
|
||||
out *= std / np.sqrt(np.square(out).sum(axis=axis, keepdims=True))
|
||||
return tf.constant(out)
|
||||
return _initializer
|
||||
|
||||
@@ -261,3 +260,45 @@ def get_placeholder_cached(name):
|
||||
|
||||
def flattenallbut0(x):
|
||||
return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])
|
||||
|
||||
|
||||
# ================================================================
|
||||
# Diagnostics
|
||||
# ================================================================
|
||||
|
||||
def display_var_info(vars):
|
||||
from baselines import logger
|
||||
count_params = 0
|
||||
for v in vars:
|
||||
name = v.name
|
||||
if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
|
||||
v_params = np.prod(v.shape.as_list())
|
||||
count_params += v_params
|
||||
if "/b:" in name or "/biases" in name: continue # Wx+b, bias is not interesting to look at => count params, but not print
|
||||
logger.info(" %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
|
||||
|
||||
logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
|
||||
|
||||
|
||||
def get_available_gpus():
|
||||
# recipe from here:
|
||||
# https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
|
||||
|
||||
from tensorflow.python.client import device_lib
|
||||
local_device_protos = device_lib.list_local_devices()
|
||||
return [x.name for x in local_device_protos if x.device_type == 'GPU']
|
||||
|
||||
# ================================================================
|
||||
# Saving variables
|
||||
# ================================================================
|
||||
|
||||
def load_state(fname):
|
||||
saver = tf.train.Saver()
|
||||
saver.restore(tf.get_default_session(), fname)
|
||||
|
||||
def save_state(fname):
|
||||
os.makedirs(os.path.dirname(fname), exist_ok=True)
|
||||
saver = tf.train.Saver()
|
||||
saver.save(tf.get_default_session(), fname)
|
||||
|
||||
|
||||
|
23
baselines/common/tile_images.py
Normal file
23
baselines/common/tile_images.py
Normal file
@@ -0,0 +1,23 @@
|
||||
import numpy as np
|
||||
|
||||
def tile_images(img_nhwc):
|
||||
"""
|
||||
Tile N images into one big PxQ image
|
||||
(P,Q) are chosen to be as close as possible, and if N
|
||||
is square, then P=Q.
|
||||
|
||||
input: img_nhwc, list or array of images, ndim=4 once turned into array
|
||||
n = batch index, h = height, w = width, c = channel
|
||||
returns:
|
||||
bigim_HWc, ndarray with ndim=3
|
||||
"""
|
||||
img_nhwc = np.asarray(img_nhwc)
|
||||
N, h, w, c = img_nhwc.shape
|
||||
H = int(np.ceil(np.sqrt(N)))
|
||||
W = int(np.ceil(float(N)/H))
|
||||
img_nhwc = np.array(list(img_nhwc) + [img_nhwc[0]*0 for _ in range(N, H*W)])
|
||||
img_HWhwc = img_nhwc.reshape(H, W, h, w, c)
|
||||
img_HhWwc = img_HWhwc.transpose(0, 2, 1, 3, 4)
|
||||
img_Hh_Ww_c = img_HhWwc.reshape(H*h, W*w, c)
|
||||
return img_Hh_Ww_c
|
||||
|
@@ -20,20 +20,19 @@ class NotSteppingError(Exception):
|
||||
Exception.__init__(self, msg)
|
||||
|
||||
class VecEnv(ABC):
|
||||
|
||||
"""
|
||||
An abstract asynchronous, vectorized environment.
|
||||
"""
|
||||
def __init__(self, num_envs, observation_space, action_space):
|
||||
self.num_envs = num_envs
|
||||
self.observation_space = observation_space
|
||||
self.action_space = action_space
|
||||
|
||||
"""
|
||||
An abstract asynchronous, vectorized environment.
|
||||
"""
|
||||
@abstractmethod
|
||||
def reset(self):
|
||||
"""
|
||||
Reset all the environments and return an array of
|
||||
observations.
|
||||
observations, or a tuple of observation arrays.
|
||||
|
||||
If step_async is still doing work, that work will
|
||||
be cancelled and step_wait() should not be called
|
||||
@@ -59,10 +58,11 @@ class VecEnv(ABC):
|
||||
Wait for the step taken with step_async().
|
||||
|
||||
Returns (obs, rews, dones, infos):
|
||||
- obs: an array of observations
|
||||
- obs: an array of observations, or a tuple of
|
||||
arrays of observations.
|
||||
- rews: an array of rewards
|
||||
- dones: an array of "episode done" booleans
|
||||
- infos: an array of info objects
|
||||
- infos: a sequence of info objects
|
||||
"""
|
||||
pass
|
||||
|
||||
@@ -77,9 +77,16 @@ class VecEnv(ABC):
|
||||
self.step_async(actions)
|
||||
return self.step_wait()
|
||||
|
||||
def render(self):
|
||||
def render(self, mode='human'):
|
||||
logger.warn('Render not defined for %s'%self)
|
||||
|
||||
@property
|
||||
def unwrapped(self):
|
||||
if isinstance(self, VecEnvWrapper):
|
||||
return self.venv.unwrapped
|
||||
else:
|
||||
return self
|
||||
|
||||
class VecEnvWrapper(VecEnv):
|
||||
def __init__(self, venv, observation_space=None, action_space=None):
|
||||
self.venv = venv
|
||||
|
@@ -1,31 +1,67 @@
|
||||
import numpy as np
|
||||
from gym import spaces
|
||||
from collections import OrderedDict
|
||||
from . import VecEnv
|
||||
|
||||
class DummyVecEnv(VecEnv):
|
||||
def __init__(self, env_fns):
|
||||
self.envs = [fn() for fn in env_fns]
|
||||
env = self.envs[0]
|
||||
env = self.envs[0]
|
||||
VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
|
||||
self.ts = np.zeros(len(self.envs), dtype='int')
|
||||
shapes, dtypes = {}, {}
|
||||
self.keys = []
|
||||
obs_space = env.observation_space
|
||||
|
||||
if isinstance(obs_space, spaces.Dict):
|
||||
assert isinstance(obs_space.spaces, OrderedDict)
|
||||
subspaces = obs_space.spaces
|
||||
else:
|
||||
subspaces = {None: obs_space}
|
||||
|
||||
for key, box in subspaces.items():
|
||||
shapes[key] = box.shape
|
||||
dtypes[key] = box.dtype
|
||||
self.keys.append(key)
|
||||
|
||||
self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
|
||||
self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
|
||||
self.buf_rews = np.zeros((self.num_envs,), dtype=np.float32)
|
||||
self.buf_infos = [{} for _ in range(self.num_envs)]
|
||||
self.actions = None
|
||||
|
||||
def step_async(self, actions):
|
||||
self.actions = actions
|
||||
|
||||
def step_wait(self):
|
||||
results = [env.step(a) for (a,env) in zip(self.actions, self.envs)]
|
||||
obs, rews, dones, infos = map(np.array, zip(*results))
|
||||
self.ts += 1
|
||||
for (i, done) in enumerate(dones):
|
||||
if done:
|
||||
obs[i] = self.envs[i].reset()
|
||||
self.ts[i] = 0
|
||||
self.actions = None
|
||||
return np.array(obs), np.array(rews), np.array(dones), infos
|
||||
for e in range(self.num_envs):
|
||||
obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(self.actions[e])
|
||||
if self.buf_dones[e]:
|
||||
obs = self.envs[e].reset()
|
||||
self._save_obs(e, obs)
|
||||
return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones),
|
||||
self.buf_infos.copy())
|
||||
|
||||
def reset(self):
|
||||
results = [env.reset() for env in self.envs]
|
||||
return np.array(results)
|
||||
def reset(self):
|
||||
for e in range(self.num_envs):
|
||||
obs = self.envs[e].reset()
|
||||
self._save_obs(e, obs)
|
||||
return self._obs_from_buf()
|
||||
|
||||
def close(self):
|
||||
return
|
||||
return
|
||||
|
||||
def render(self, mode='human'):
|
||||
return [e.render(mode=mode) for e in self.envs]
|
||||
|
||||
def _save_obs(self, e, obs):
|
||||
for k in self.keys:
|
||||
if k is None:
|
||||
self.buf_obs[k][e] = obs
|
||||
else:
|
||||
self.buf_obs[k][e] = obs[k]
|
||||
|
||||
def _obs_from_buf(self):
|
||||
if self.keys==[None]:
|
||||
return self.buf_obs[None]
|
||||
else:
|
||||
return self.buf_obs
|
||||
|
@@ -1,6 +1,7 @@
|
||||
import numpy as np
|
||||
from multiprocessing import Process, Pipe
|
||||
from baselines.common.vec_env import VecEnv, CloudpickleWrapper
|
||||
from baselines.common.tile_images import tile_images
|
||||
|
||||
|
||||
def worker(remote, parent_remote, env_fn_wrapper):
|
||||
@@ -16,9 +17,8 @@ def worker(remote, parent_remote, env_fn_wrapper):
|
||||
elif cmd == 'reset':
|
||||
ob = env.reset()
|
||||
remote.send(ob)
|
||||
elif cmd == 'reset_task':
|
||||
ob = env.reset_task()
|
||||
remote.send(ob)
|
||||
elif cmd == 'render':
|
||||
remote.send(env.render(mode='rgb_array'))
|
||||
elif cmd == 'close':
|
||||
remote.close()
|
||||
break
|
||||
@@ -81,3 +81,17 @@ class SubprocVecEnv(VecEnv):
|
||||
for p in self.ps:
|
||||
p.join()
|
||||
self.closed = True
|
||||
|
||||
def render(self, mode='human'):
|
||||
for pipe in self.remotes:
|
||||
pipe.send(('render', None))
|
||||
imgs = [pipe.recv() for pipe in self.remotes]
|
||||
bigimg = tile_images(imgs)
|
||||
if mode == 'human':
|
||||
import cv2
|
||||
cv2.imshow('vecenv', bigimg[:,:,::-1])
|
||||
cv2.waitKey(1)
|
||||
elif mode == 'rgb_array':
|
||||
return bigimg
|
||||
else:
|
||||
raise NotImplementedError
|
@@ -109,7 +109,7 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
|
||||
epoch_adaptive_distances = []
|
||||
for t_train in range(nb_train_steps):
|
||||
# Adapt param noise, if necessary.
|
||||
if memory.nb_entries >= batch_size and t % param_noise_adaption_interval == 0:
|
||||
if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
|
||||
distance = agent.adapt_param_noise()
|
||||
epoch_adaptive_distances.append(distance)
|
||||
|
||||
@@ -189,4 +189,3 @@ def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, pa
|
||||
if eval_env and hasattr(eval_env, 'get_state'):
|
||||
with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
|
||||
pickle.dump(eval_env.get_state(), f)
|
||||
|
||||
|
@@ -9,7 +9,7 @@ import baselines.common.tf_util as U
|
||||
from baselines import logger
|
||||
from baselines import deepq
|
||||
from baselines.deepq.replay_buffer import ReplayBuffer
|
||||
from baselines.deepq.utils import BatchInput
|
||||
from baselines.deepq.utils import ObservationInput
|
||||
from baselines.common.schedules import LinearSchedule
|
||||
|
||||
|
||||
@@ -28,7 +28,7 @@ if __name__ == '__main__':
|
||||
env = gym.make("CartPole-v0")
|
||||
# Create all the functions necessary to train the model
|
||||
act, train, update_target, debug = deepq.build_train(
|
||||
make_obs_ph=lambda name: BatchInput(env.observation_space.shape, name=name),
|
||||
make_obs_ph=lambda name: ObservationInput(env.observation_space, name=name),
|
||||
q_func=model,
|
||||
num_actions=env.action_space.n,
|
||||
optimizer=tf.train.AdamOptimizer(learning_rate=5e-4),
|
||||
|
@@ -5,13 +5,18 @@ import argparse
|
||||
from baselines import logger
|
||||
from baselines.common.atari_wrappers import make_atari
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
|
||||
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
|
||||
parser.add_argument('--prioritized', type=int, default=1)
|
||||
parser.add_argument('--prioritized-replay-alpha', type=float, default=0.6)
|
||||
parser.add_argument('--dueling', type=int, default=1)
|
||||
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
|
||||
parser.add_argument('--checkpoint-freq', type=int, default=10000)
|
||||
parser.add_argument('--checkpoint-path', type=str, default=None)
|
||||
|
||||
args = parser.parse_args()
|
||||
logger.configure()
|
||||
set_global_seeds(args.seed)
|
||||
@@ -23,7 +28,8 @@ def main():
|
||||
hiddens=[256],
|
||||
dueling=bool(args.dueling),
|
||||
)
|
||||
act = deepq.learn(
|
||||
|
||||
deepq.learn(
|
||||
env,
|
||||
q_func=model,
|
||||
lr=1e-4,
|
||||
@@ -35,9 +41,12 @@ def main():
|
||||
learning_starts=10000,
|
||||
target_network_update_freq=1000,
|
||||
gamma=0.99,
|
||||
prioritized_replay=bool(args.prioritized)
|
||||
prioritized_replay=bool(args.prioritized),
|
||||
prioritized_replay_alpha=args.prioritized_replay_alpha,
|
||||
checkpoint_freq=args.checkpoint_freq,
|
||||
checkpoint_path=args.checkpoint_path,
|
||||
)
|
||||
# act.save("pong_model.pkl") XXX
|
||||
|
||||
env.close()
|
||||
|
||||
|
||||
|
@@ -86,7 +86,7 @@ class PrioritizedReplayBuffer(ReplayBuffer):
|
||||
ReplayBuffer.__init__
|
||||
"""
|
||||
super(PrioritizedReplayBuffer, self).__init__(size)
|
||||
assert alpha > 0
|
||||
assert alpha >= 0
|
||||
self._alpha = alpha
|
||||
|
||||
it_capacity = 1
|
||||
|
@@ -6,13 +6,15 @@ import zipfile
|
||||
import cloudpickle
|
||||
import numpy as np
|
||||
|
||||
import gym
|
||||
import baselines.common.tf_util as U
|
||||
from baselines.common.tf_util import load_state, save_state
|
||||
from baselines import logger
|
||||
from baselines.common.schedules import LinearSchedule
|
||||
from baselines.common.input import observation_input
|
||||
|
||||
from baselines import deepq
|
||||
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
|
||||
from baselines.deepq.utils import BatchInput, load_state, save_state
|
||||
from baselines.deepq.utils import ObservationInput
|
||||
|
||||
|
||||
class ActWrapper(object):
|
||||
@@ -88,6 +90,7 @@ def learn(env,
|
||||
batch_size=32,
|
||||
print_freq=100,
|
||||
checkpoint_freq=10000,
|
||||
checkpoint_path=None,
|
||||
learning_starts=1000,
|
||||
gamma=1.0,
|
||||
target_network_update_freq=500,
|
||||
@@ -170,9 +173,9 @@ def learn(env,
|
||||
|
||||
# capture the shape outside the closure so that the env object is not serialized
|
||||
# by cloudpickle when serializing make_obs_ph
|
||||
observation_space_shape = env.observation_space.shape
|
||||
|
||||
def make_obs_ph(name):
|
||||
return BatchInput(observation_space_shape, name=name)
|
||||
return ObservationInput(env.observation_space, name=name)
|
||||
|
||||
act, train, update_target, debug = deepq.build_train(
|
||||
make_obs_ph=make_obs_ph,
|
||||
@@ -216,9 +219,17 @@ def learn(env,
|
||||
saved_mean_reward = None
|
||||
obs = env.reset()
|
||||
reset = True
|
||||
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
model_saved = False
|
||||
td = checkpoint_path or td
|
||||
|
||||
model_file = os.path.join(td, "model")
|
||||
model_saved = False
|
||||
if tf.train.latest_checkpoint(td) is not None:
|
||||
load_state(model_file)
|
||||
logger.log('Loaded model from {}'.format(model_file))
|
||||
model_saved = True
|
||||
|
||||
for t in range(max_timesteps):
|
||||
if callback is not None:
|
||||
if callback(locals(), globals()):
|
||||
|
43
baselines/deepq/test_identity.py
Normal file
43
baselines/deepq/test_identity.py
Normal file
@@ -0,0 +1,43 @@
|
||||
import tensorflow as tf
|
||||
import random
|
||||
|
||||
from baselines import deepq
|
||||
from baselines.common.identity_env import IdentityEnv
|
||||
|
||||
|
||||
def test_identity():
|
||||
|
||||
with tf.Graph().as_default():
|
||||
env = IdentityEnv(10)
|
||||
random.seed(0)
|
||||
|
||||
tf.set_random_seed(0)
|
||||
|
||||
param_noise = False
|
||||
model = deepq.models.mlp([32])
|
||||
act = deepq.learn(
|
||||
env,
|
||||
q_func=model,
|
||||
lr=1e-3,
|
||||
max_timesteps=10000,
|
||||
buffer_size=50000,
|
||||
exploration_fraction=0.1,
|
||||
exploration_final_eps=0.02,
|
||||
print_freq=10,
|
||||
param_noise=param_noise,
|
||||
)
|
||||
|
||||
tf.set_random_seed(0)
|
||||
|
||||
N_TRIALS = 1000
|
||||
sum_rew = 0
|
||||
obs = env.reset()
|
||||
for i in range(N_TRIALS):
|
||||
obs, rew, done, _ = env.step(act([obs]))
|
||||
sum_rew += rew
|
||||
|
||||
assert sum_rew > 0.9 * N_TRIALS
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_identity()
|
@@ -1,24 +1,12 @@
|
||||
import os
|
||||
from baselines.common.input import observation_input
|
||||
|
||||
import tensorflow as tf
|
||||
|
||||
# ================================================================
|
||||
# Saving variables
|
||||
# ================================================================
|
||||
|
||||
def load_state(fname):
|
||||
saver = tf.train.Saver()
|
||||
saver.restore(tf.get_default_session(), fname)
|
||||
|
||||
def save_state(fname):
|
||||
os.makedirs(os.path.dirname(fname), exist_ok=True)
|
||||
saver = tf.train.Saver()
|
||||
saver.save(tf.get_default_session(), fname)
|
||||
|
||||
# ================================================================
|
||||
# Placeholders
|
||||
# ================================================================
|
||||
|
||||
|
||||
class TfInput(object):
|
||||
def __init__(self, name="(unnamed)"):
|
||||
"""Generalized Tensorflow placeholder. The main differences are:
|
||||
@@ -50,20 +38,6 @@ class PlaceholderTfInput(TfInput):
|
||||
def make_feed_dict(self, data):
|
||||
return {self._placeholder: data}
|
||||
|
||||
class BatchInput(PlaceholderTfInput):
|
||||
def __init__(self, shape, dtype=tf.float32, name=None):
|
||||
"""Creates a placeholder for a batch of tensors of a given shape and dtype
|
||||
|
||||
Parameters
|
||||
----------
|
||||
shape: [int]
|
||||
shape of a single elemenet of the batch
|
||||
dtype: tf.dtype
|
||||
number representation used for tensor contents
|
||||
name: str
|
||||
name of the underlying placeholder
|
||||
"""
|
||||
super().__init__(tf.placeholder(dtype, [None] + list(shape), name=name))
|
||||
|
||||
class Uint8Input(PlaceholderTfInput):
|
||||
def __init__(self, shape, name=None):
|
||||
@@ -85,4 +59,25 @@ class Uint8Input(PlaceholderTfInput):
|
||||
self._output = tf.cast(super().get(), tf.float32) / 255.0
|
||||
|
||||
def get(self):
|
||||
return self._output
|
||||
return self._output
|
||||
|
||||
|
||||
class ObservationInput(PlaceholderTfInput):
|
||||
def __init__(self, observation_space, name=None):
|
||||
"""Creates an input placeholder tailored to a specific observation space
|
||||
|
||||
Parameters
|
||||
----------
|
||||
|
||||
observation_space:
|
||||
observation space of the environment. Should be one of the gym.spaces types
|
||||
name: str
|
||||
tensorflow name of the underlying placeholder
|
||||
"""
|
||||
inpt, self.processed_inpt = observation_input(observation_space, name=name)
|
||||
super().__init__(inpt)
|
||||
|
||||
def get(self):
|
||||
return self.processed_inpt
|
||||
|
||||
|
||||
|
@@ -47,18 +47,12 @@ class Mujoco_Dset(object):
|
||||
obs = traj_data['obs'][:traj_limitation]
|
||||
acs = traj_data['acs'][:traj_limitation]
|
||||
|
||||
def flatten(x):
|
||||
# x.shape = (E,), or (E, L, D)
|
||||
_, size = x[0].shape
|
||||
episode_length = [len(i) for i in x]
|
||||
y = np.zeros((sum(episode_length), size))
|
||||
start_idx = 0
|
||||
for l, x_i in zip(episode_length, x):
|
||||
y[start_idx:(start_idx+l)] = x_i
|
||||
start_idx += l
|
||||
return y
|
||||
self.obs = np.array(flatten(obs))
|
||||
self.acs = np.array(flatten(acs))
|
||||
# obs, acs: shape (N, L, ) + S where N = # episodes, L = episode length
|
||||
# and S is the environment observation/action space.
|
||||
# Flatten to (N * L, prod(S))
|
||||
self.obs = np.reshape(obs, [-1, np.prod(obs.shape[2:])])
|
||||
self.acs = np.reshape(acs, [-1, np.prod(acs.shape[2:])])
|
||||
|
||||
self.rets = traj_data['ep_rets'][:traj_limitation]
|
||||
self.avg_ret = sum(self.rets)/len(self.rets)
|
||||
self.std_ret = np.std(np.array(self.rets))
|
||||
|
32
baselines/her/README.md
Normal file
32
baselines/her/README.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# Hindsight Experience Replay
|
||||
For details on Hindsight Experience Replay (HER), please read the [paper](https://arxiv.org/abs/1707.01495).
|
||||
|
||||
## How to use Hindsight Experience Replay
|
||||
|
||||
### Getting started
|
||||
Training an agent is very simple:
|
||||
```bash
|
||||
python -m baselines.her.experiment.train
|
||||
```
|
||||
This will train a DDPG+HER agent on the `FetchReach` environment.
|
||||
You should see the success rate go up quickly to `1.0`, which means that the agent achieves the
|
||||
desired goal in 100% of the cases.
|
||||
The training script logs other diagnostics as well and pickles the best policy so far (w.r.t. to its test success rate),
|
||||
the latest policy, and, if enabled, a history of policies every K epochs.
|
||||
|
||||
To inspect what the agent has learned, use the play script:
|
||||
```bash
|
||||
python -m baselines.her.experiment.play /path/to/an/experiment/policy_best.pkl
|
||||
```
|
||||
You can try it right now with the results of the training step (the script prints out the path for you).
|
||||
This should visualize the current policy for 10 episodes and will also print statistics.
|
||||
|
||||
|
||||
### Reproducing results
|
||||
In order to reproduce the results from [Plappert et al. (2018)](https://arxiv.org/abs/1802.09464), run the following command:
|
||||
```bash
|
||||
python -m baselines.her.experiment.train --num_cpu 19
|
||||
```
|
||||
This will require a machine with sufficient amount of physical CPU cores. In our experiments,
|
||||
we used [Azure's D15v2 instances](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes),
|
||||
which have 20 physical cores. We only scheduled the experiment on 19 of those to leave some head-room on the system.
|
0
baselines/her/__init__.py
Normal file
0
baselines/her/__init__.py
Normal file
44
baselines/her/actor_critic.py
Normal file
44
baselines/her/actor_critic.py
Normal file
@@ -0,0 +1,44 @@
|
||||
import tensorflow as tf
|
||||
from baselines.her.util import store_args, nn
|
||||
|
||||
|
||||
class ActorCritic:
|
||||
@store_args
|
||||
def __init__(self, inputs_tf, dimo, dimg, dimu, max_u, o_stats, g_stats, hidden, layers,
|
||||
**kwargs):
|
||||
"""The actor-critic network and related training code.
|
||||
|
||||
Args:
|
||||
inputs_tf (dict of tensors): all necessary inputs for the network: the
|
||||
observation (o), the goal (g), and the action (u)
|
||||
dimo (int): the dimension of the observations
|
||||
dimg (int): the dimension of the goals
|
||||
dimu (int): the dimension of the actions
|
||||
max_u (float): the maximum magnitude of actions; action outputs will be scaled
|
||||
accordingly
|
||||
o_stats (baselines.her.Normalizer): normalizer for observations
|
||||
g_stats (baselines.her.Normalizer): normalizer for goals
|
||||
hidden (int): number of hidden units that should be used in hidden layers
|
||||
layers (int): number of hidden layers
|
||||
"""
|
||||
self.o_tf = inputs_tf['o']
|
||||
self.g_tf = inputs_tf['g']
|
||||
self.u_tf = inputs_tf['u']
|
||||
|
||||
# Prepare inputs for actor and critic.
|
||||
o = self.o_stats.normalize(self.o_tf)
|
||||
g = self.g_stats.normalize(self.g_tf)
|
||||
input_pi = tf.concat(axis=1, values=[o, g]) # for actor
|
||||
|
||||
# Networks.
|
||||
with tf.variable_scope('pi'):
|
||||
self.pi_tf = self.max_u * tf.tanh(nn(
|
||||
input_pi, [self.hidden] * self.layers + [self.dimu]))
|
||||
with tf.variable_scope('Q'):
|
||||
# for policy training
|
||||
input_Q = tf.concat(axis=1, values=[o, g, self.pi_tf / self.max_u])
|
||||
self.Q_pi_tf = nn(input_Q, [self.hidden] * self.layers + [1])
|
||||
# for critic training
|
||||
input_Q = tf.concat(axis=1, values=[o, g, self.u_tf / self.max_u])
|
||||
self._input_Q = input_Q # exposed for tests
|
||||
self.Q_tf = nn(input_Q, [self.hidden] * self.layers + [1], reuse=True)
|
340
baselines/her/ddpg.py
Normal file
340
baselines/her/ddpg.py
Normal file
@@ -0,0 +1,340 @@
|
||||
from collections import OrderedDict
|
||||
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
from tensorflow.contrib.staging import StagingArea
|
||||
|
||||
from baselines import logger
|
||||
from baselines.her.util import (
|
||||
import_function, store_args, flatten_grads, transitions_in_episode_batch)
|
||||
from baselines.her.normalizer import Normalizer
|
||||
from baselines.her.replay_buffer import ReplayBuffer
|
||||
from baselines.common.mpi_adam import MpiAdam
|
||||
|
||||
|
||||
def dims_to_shapes(input_dims):
|
||||
return {key: tuple([val]) if val > 0 else tuple() for key, val in input_dims.items()}
|
||||
|
||||
|
||||
class DDPG(object):
|
||||
@store_args
|
||||
def __init__(self, input_dims, buffer_size, hidden, layers, network_class, polyak, batch_size,
|
||||
Q_lr, pi_lr, norm_eps, norm_clip, max_u, action_l2, clip_obs, scope, T,
|
||||
rollout_batch_size, subtract_goals, relative_goals, clip_pos_returns, clip_return,
|
||||
sample_transitions, gamma, reuse=False, **kwargs):
|
||||
"""Implementation of DDPG that is used in combination with Hindsight Experience Replay (HER).
|
||||
|
||||
Args:
|
||||
input_dims (dict of ints): dimensions for the observation (o), the goal (g), and the
|
||||
actions (u)
|
||||
buffer_size (int): number of transitions that are stored in the replay buffer
|
||||
hidden (int): number of units in the hidden layers
|
||||
layers (int): number of hidden layers
|
||||
network_class (str): the network class that should be used (e.g. 'baselines.her.ActorCritic')
|
||||
polyak (float): coefficient for Polyak-averaging of the target network
|
||||
batch_size (int): batch size for training
|
||||
Q_lr (float): learning rate for the Q (critic) network
|
||||
pi_lr (float): learning rate for the pi (actor) network
|
||||
norm_eps (float): a small value used in the normalizer to avoid numerical instabilities
|
||||
norm_clip (float): normalized inputs are clipped to be in [-norm_clip, norm_clip]
|
||||
max_u (float): maximum action magnitude, i.e. actions are in [-max_u, max_u]
|
||||
action_l2 (float): coefficient for L2 penalty on the actions
|
||||
clip_obs (float): clip observations before normalization to be in [-clip_obs, clip_obs]
|
||||
scope (str): the scope used for the TensorFlow graph
|
||||
T (int): the time horizon for rollouts
|
||||
rollout_batch_size (int): number of parallel rollouts per DDPG agent
|
||||
subtract_goals (function): function that subtracts goals from each other
|
||||
relative_goals (boolean): whether or not relative goals should be fed into the network
|
||||
clip_pos_returns (boolean): whether or not positive returns should be clipped
|
||||
clip_return (float): clip returns to be in [-clip_return, clip_return]
|
||||
sample_transitions (function) function that samples from the replay buffer
|
||||
gamma (float): gamma used for Q learning updates
|
||||
reuse (boolean): whether or not the networks should be reused
|
||||
"""
|
||||
if self.clip_return is None:
|
||||
self.clip_return = np.inf
|
||||
|
||||
self.create_actor_critic = import_function(self.network_class)
|
||||
|
||||
input_shapes = dims_to_shapes(self.input_dims)
|
||||
self.dimo = self.input_dims['o']
|
||||
self.dimg = self.input_dims['g']
|
||||
self.dimu = self.input_dims['u']
|
||||
|
||||
# Prepare staging area for feeding data to the model.
|
||||
stage_shapes = OrderedDict()
|
||||
for key in sorted(self.input_dims.keys()):
|
||||
if key.startswith('info_'):
|
||||
continue
|
||||
stage_shapes[key] = (None, *input_shapes[key])
|
||||
for key in ['o', 'g']:
|
||||
stage_shapes[key + '_2'] = stage_shapes[key]
|
||||
stage_shapes['r'] = (None,)
|
||||
self.stage_shapes = stage_shapes
|
||||
|
||||
# Create network.
|
||||
with tf.variable_scope(self.scope):
|
||||
self.staging_tf = StagingArea(
|
||||
dtypes=[tf.float32 for _ in self.stage_shapes.keys()],
|
||||
shapes=list(self.stage_shapes.values()))
|
||||
self.buffer_ph_tf = [
|
||||
tf.placeholder(tf.float32, shape=shape) for shape in self.stage_shapes.values()]
|
||||
self.stage_op = self.staging_tf.put(self.buffer_ph_tf)
|
||||
|
||||
self._create_network(reuse=reuse)
|
||||
|
||||
# Configure the replay buffer.
|
||||
buffer_shapes = {key: (self.T if key != 'o' else self.T+1, *input_shapes[key])
|
||||
for key, val in input_shapes.items()}
|
||||
buffer_shapes['g'] = (buffer_shapes['g'][0], self.dimg)
|
||||
buffer_shapes['ag'] = (self.T+1, self.dimg)
|
||||
|
||||
buffer_size = (self.buffer_size // self.rollout_batch_size) * self.rollout_batch_size
|
||||
self.buffer = ReplayBuffer(buffer_shapes, buffer_size, self.T, self.sample_transitions)
|
||||
|
||||
def _random_action(self, n):
|
||||
return np.random.uniform(low=-self.max_u, high=self.max_u, size=(n, self.dimu))
|
||||
|
||||
def _preprocess_og(self, o, ag, g):
|
||||
if self.relative_goals:
|
||||
g_shape = g.shape
|
||||
g = g.reshape(-1, self.dimg)
|
||||
ag = ag.reshape(-1, self.dimg)
|
||||
g = self.subtract_goals(g, ag)
|
||||
g = g.reshape(*g_shape)
|
||||
o = np.clip(o, -self.clip_obs, self.clip_obs)
|
||||
g = np.clip(g, -self.clip_obs, self.clip_obs)
|
||||
return o, g
|
||||
|
||||
def get_actions(self, o, ag, g, noise_eps=0., random_eps=0., use_target_net=False,
|
||||
compute_Q=False):
|
||||
o, g = self._preprocess_og(o, ag, g)
|
||||
policy = self.target if use_target_net else self.main
|
||||
# values to compute
|
||||
vals = [policy.pi_tf]
|
||||
if compute_Q:
|
||||
vals += [policy.Q_pi_tf]
|
||||
# feed
|
||||
feed = {
|
||||
policy.o_tf: o.reshape(-1, self.dimo),
|
||||
policy.g_tf: g.reshape(-1, self.dimg),
|
||||
policy.u_tf: np.zeros((o.size // self.dimo, self.dimu), dtype=np.float32)
|
||||
}
|
||||
|
||||
ret = self.sess.run(vals, feed_dict=feed)
|
||||
# action postprocessing
|
||||
u = ret[0]
|
||||
noise = noise_eps * self.max_u * np.random.randn(*u.shape) # gaussian noise
|
||||
u += noise
|
||||
u = np.clip(u, -self.max_u, self.max_u)
|
||||
u += np.random.binomial(1, random_eps, u.shape[0]).reshape(-1, 1) * (self._random_action(u.shape[0]) - u) # eps-greedy
|
||||
if u.shape[0] == 1:
|
||||
u = u[0]
|
||||
u = u.copy()
|
||||
ret[0] = u
|
||||
|
||||
if len(ret) == 1:
|
||||
return ret[0]
|
||||
else:
|
||||
return ret
|
||||
|
||||
def store_episode(self, episode_batch, update_stats=True):
|
||||
"""
|
||||
episode_batch: array of batch_size x (T or T+1) x dim_key
|
||||
'o' is of size T+1, others are of size T
|
||||
"""
|
||||
|
||||
self.buffer.store_episode(episode_batch)
|
||||
|
||||
if update_stats:
|
||||
# add transitions to normalizer
|
||||
episode_batch['o_2'] = episode_batch['o'][:, 1:, :]
|
||||
episode_batch['ag_2'] = episode_batch['ag'][:, 1:, :]
|
||||
num_normalizing_transitions = transitions_in_episode_batch(episode_batch)
|
||||
transitions = self.sample_transitions(episode_batch, num_normalizing_transitions)
|
||||
|
||||
o, o_2, g, ag = transitions['o'], transitions['o_2'], transitions['g'], transitions['ag']
|
||||
transitions['o'], transitions['g'] = self._preprocess_og(o, ag, g)
|
||||
# No need to preprocess the o_2 and g_2 since this is only used for stats
|
||||
|
||||
self.o_stats.update(transitions['o'])
|
||||
self.g_stats.update(transitions['g'])
|
||||
|
||||
self.o_stats.recompute_stats()
|
||||
self.g_stats.recompute_stats()
|
||||
|
||||
def get_current_buffer_size(self):
|
||||
return self.buffer.get_current_size()
|
||||
|
||||
def _sync_optimizers(self):
|
||||
self.Q_adam.sync()
|
||||
self.pi_adam.sync()
|
||||
|
||||
def _grads(self):
|
||||
# Avoid feed_dict here for performance!
|
||||
critic_loss, actor_loss, Q_grad, pi_grad = self.sess.run([
|
||||
self.Q_loss_tf,
|
||||
self.main.Q_pi_tf,
|
||||
self.Q_grad_tf,
|
||||
self.pi_grad_tf
|
||||
])
|
||||
return critic_loss, actor_loss, Q_grad, pi_grad
|
||||
|
||||
def _update(self, Q_grad, pi_grad):
|
||||
self.Q_adam.update(Q_grad, self.Q_lr)
|
||||
self.pi_adam.update(pi_grad, self.pi_lr)
|
||||
|
||||
def sample_batch(self):
|
||||
transitions = self.buffer.sample(self.batch_size)
|
||||
o, o_2, g = transitions['o'], transitions['o_2'], transitions['g']
|
||||
ag, ag_2 = transitions['ag'], transitions['ag_2']
|
||||
transitions['o'], transitions['g'] = self._preprocess_og(o, ag, g)
|
||||
transitions['o_2'], transitions['g_2'] = self._preprocess_og(o_2, ag_2, g)
|
||||
|
||||
transitions_batch = [transitions[key] for key in self.stage_shapes.keys()]
|
||||
return transitions_batch
|
||||
|
||||
def stage_batch(self, batch=None):
|
||||
if batch is None:
|
||||
batch = self.sample_batch()
|
||||
assert len(self.buffer_ph_tf) == len(batch)
|
||||
self.sess.run(self.stage_op, feed_dict=dict(zip(self.buffer_ph_tf, batch)))
|
||||
|
||||
def train(self, stage=True):
|
||||
if stage:
|
||||
self.stage_batch()
|
||||
critic_loss, actor_loss, Q_grad, pi_grad = self._grads()
|
||||
self._update(Q_grad, pi_grad)
|
||||
return critic_loss, actor_loss
|
||||
|
||||
def _init_target_net(self):
|
||||
self.sess.run(self.init_target_net_op)
|
||||
|
||||
def update_target_net(self):
|
||||
self.sess.run(self.update_target_net_op)
|
||||
|
||||
def clear_buffer(self):
|
||||
self.buffer.clear_buffer()
|
||||
|
||||
def _vars(self, scope):
|
||||
res = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.scope + '/' + scope)
|
||||
assert len(res) > 0
|
||||
return res
|
||||
|
||||
def _global_vars(self, scope):
|
||||
res = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.scope + '/' + scope)
|
||||
return res
|
||||
|
||||
def _create_network(self, reuse=False):
|
||||
logger.info("Creating a DDPG agent with action space %d x %s..." % (self.dimu, self.max_u))
|
||||
|
||||
self.sess = tf.get_default_session()
|
||||
if self.sess is None:
|
||||
self.sess = tf.InteractiveSession()
|
||||
|
||||
# running averages
|
||||
with tf.variable_scope('o_stats') as vs:
|
||||
if reuse:
|
||||
vs.reuse_variables()
|
||||
self.o_stats = Normalizer(self.dimo, self.norm_eps, self.norm_clip, sess=self.sess)
|
||||
with tf.variable_scope('g_stats') as vs:
|
||||
if reuse:
|
||||
vs.reuse_variables()
|
||||
self.g_stats = Normalizer(self.dimg, self.norm_eps, self.norm_clip, sess=self.sess)
|
||||
|
||||
# mini-batch sampling.
|
||||
batch = self.staging_tf.get()
|
||||
batch_tf = OrderedDict([(key, batch[i])
|
||||
for i, key in enumerate(self.stage_shapes.keys())])
|
||||
batch_tf['r'] = tf.reshape(batch_tf['r'], [-1, 1])
|
||||
|
||||
# networks
|
||||
with tf.variable_scope('main') as vs:
|
||||
if reuse:
|
||||
vs.reuse_variables()
|
||||
self.main = self.create_actor_critic(batch_tf, net_type='main', **self.__dict__)
|
||||
vs.reuse_variables()
|
||||
with tf.variable_scope('target') as vs:
|
||||
if reuse:
|
||||
vs.reuse_variables()
|
||||
target_batch_tf = batch_tf.copy()
|
||||
target_batch_tf['o'] = batch_tf['o_2']
|
||||
target_batch_tf['g'] = batch_tf['g_2']
|
||||
self.target = self.create_actor_critic(
|
||||
target_batch_tf, net_type='target', **self.__dict__)
|
||||
vs.reuse_variables()
|
||||
assert len(self._vars("main")) == len(self._vars("target"))
|
||||
|
||||
# loss functions
|
||||
target_Q_pi_tf = self.target.Q_pi_tf
|
||||
clip_range = (-self.clip_return, 0. if self.clip_pos_returns else np.inf)
|
||||
target_tf = tf.clip_by_value(batch_tf['r'] + self.gamma * target_Q_pi_tf, *clip_range)
|
||||
self.Q_loss_tf = tf.reduce_mean(tf.square(tf.stop_gradient(target_tf) - self.main.Q_tf))
|
||||
self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
|
||||
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))
|
||||
Q_grads_tf = tf.gradients(self.Q_loss_tf, self._vars('main/Q'))
|
||||
pi_grads_tf = tf.gradients(self.pi_loss_tf, self._vars('main/pi'))
|
||||
assert len(self._vars('main/Q')) == len(Q_grads_tf)
|
||||
assert len(self._vars('main/pi')) == len(pi_grads_tf)
|
||||
self.Q_grads_vars_tf = zip(Q_grads_tf, self._vars('main/Q'))
|
||||
self.pi_grads_vars_tf = zip(pi_grads_tf, self._vars('main/pi'))
|
||||
self.Q_grad_tf = flatten_grads(grads=Q_grads_tf, var_list=self._vars('main/Q'))
|
||||
self.pi_grad_tf = flatten_grads(grads=pi_grads_tf, var_list=self._vars('main/pi'))
|
||||
|
||||
# optimizers
|
||||
self.Q_adam = MpiAdam(self._vars('main/Q'), scale_grad_by_procs=False)
|
||||
self.pi_adam = MpiAdam(self._vars('main/pi'), scale_grad_by_procs=False)
|
||||
|
||||
# polyak averaging
|
||||
self.main_vars = self._vars('main/Q') + self._vars('main/pi')
|
||||
self.target_vars = self._vars('target/Q') + self._vars('target/pi')
|
||||
self.stats_vars = self._global_vars('o_stats') + self._global_vars('g_stats')
|
||||
self.init_target_net_op = list(
|
||||
map(lambda v: v[0].assign(v[1]), zip(self.target_vars, self.main_vars)))
|
||||
self.update_target_net_op = list(
|
||||
map(lambda v: v[0].assign(self.polyak * v[0] + (1. - self.polyak) * v[1]), zip(self.target_vars, self.main_vars)))
|
||||
|
||||
# initialize all variables
|
||||
tf.variables_initializer(self._global_vars('')).run()
|
||||
self._sync_optimizers()
|
||||
self._init_target_net()
|
||||
|
||||
def logs(self, prefix=''):
|
||||
logs = []
|
||||
logs += [('stats_o/mean', np.mean(self.sess.run([self.o_stats.mean])))]
|
||||
logs += [('stats_o/std', np.mean(self.sess.run([self.o_stats.std])))]
|
||||
logs += [('stats_g/mean', np.mean(self.sess.run([self.g_stats.mean])))]
|
||||
logs += [('stats_g/std', np.mean(self.sess.run([self.g_stats.std])))]
|
||||
|
||||
if prefix is not '' and not prefix.endswith('/'):
|
||||
return [(prefix + '/' + key, val) for key, val in logs]
|
||||
else:
|
||||
return logs
|
||||
|
||||
def __getstate__(self):
|
||||
"""Our policies can be loaded from pkl, but after unpickling you cannot continue training.
|
||||
"""
|
||||
excluded_subnames = ['_tf', '_op', '_vars', '_adam', 'buffer', 'sess', '_stats',
|
||||
'main', 'target', 'lock', 'env', 'sample_transitions',
|
||||
'stage_shapes', 'create_actor_critic']
|
||||
|
||||
state = {k: v for k, v in self.__dict__.items() if all([not subname in k for subname in excluded_subnames])}
|
||||
state['buffer_size'] = self.buffer_size
|
||||
state['tf'] = self.sess.run([x for x in self._global_vars('') if 'buffer' not in x.name])
|
||||
return state
|
||||
|
||||
def __setstate__(self, state):
|
||||
if 'sample_transitions' not in state:
|
||||
# We don't need this for playing the policy.
|
||||
state['sample_transitions'] = None
|
||||
|
||||
self.__init__(**state)
|
||||
# set up stats (they are overwritten in __init__)
|
||||
for k, v in state.items():
|
||||
if k[-6:] == '_stats':
|
||||
self.__dict__[k] = v
|
||||
# load TF variables
|
||||
vars = [x for x in self._global_vars('') if 'buffer' not in x.name]
|
||||
assert(len(vars) == len(state["tf"]))
|
||||
node = [tf.assign(var, val) for var, val in zip(vars, state["tf"])]
|
||||
self.sess.run(node)
|
0
baselines/her/experiment/__init__.py
Normal file
0
baselines/her/experiment/__init__.py
Normal file
171
baselines/her/experiment/config.py
Normal file
171
baselines/her/experiment/config.py
Normal file
@@ -0,0 +1,171 @@
|
||||
import numpy as np
|
||||
import gym
|
||||
|
||||
from baselines import logger
|
||||
from baselines.her.ddpg import DDPG
|
||||
from baselines.her.her import make_sample_her_transitions
|
||||
|
||||
|
||||
DEFAULT_ENV_PARAMS = {
|
||||
'FetchReach-v1': {
|
||||
'n_cycles': 10,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
DEFAULT_PARAMS = {
|
||||
# env
|
||||
'max_u': 1., # max absolute value of actions on different coordinates
|
||||
# ddpg
|
||||
'layers': 3, # number of layers in the critic/actor networks
|
||||
'hidden': 256, # number of neurons in each hidden layers
|
||||
'network_class': 'baselines.her.actor_critic:ActorCritic',
|
||||
'Q_lr': 0.001, # critic learning rate
|
||||
'pi_lr': 0.001, # actor learning rate
|
||||
'buffer_size': int(1E6), # for experience replay
|
||||
'polyak': 0.95, # polyak averaging coefficient
|
||||
'action_l2': 1.0, # quadratic penalty on actions (before rescaling by max_u)
|
||||
'clip_obs': 200.,
|
||||
'scope': 'ddpg', # can be tweaked for testing
|
||||
'relative_goals': False,
|
||||
# training
|
||||
'n_cycles': 50, # per epoch
|
||||
'rollout_batch_size': 2, # per mpi thread
|
||||
'n_batches': 40, # training batches per cycle
|
||||
'batch_size': 256, # per mpi thread, measured in transitions and reduced to even multiple of chunk_length.
|
||||
'n_test_rollouts': 10, # number of test rollouts per epoch, each consists of rollout_batch_size rollouts
|
||||
'test_with_polyak': False, # run test episodes with the target network
|
||||
# exploration
|
||||
'random_eps': 0.3, # percentage of time a random action is taken
|
||||
'noise_eps': 0.2, # std of gaussian noise added to not-completely-random actions as a percentage of max_u
|
||||
# HER
|
||||
'replay_strategy': 'future', # supported modes: future, none
|
||||
'replay_k': 4, # number of additional goals used for replay, only used if off_policy_data=future
|
||||
# normalization
|
||||
'norm_eps': 0.01, # epsilon used for observation normalization
|
||||
'norm_clip': 5, # normalized observations are cropped to this values
|
||||
}
|
||||
|
||||
|
||||
CACHED_ENVS = {}
|
||||
|
||||
|
||||
def cached_make_env(make_env):
|
||||
"""
|
||||
Only creates a new environment from the provided function if one has not yet already been
|
||||
created. This is useful here because we need to infer certain properties of the env, e.g.
|
||||
its observation and action spaces, without any intend of actually using it.
|
||||
"""
|
||||
if make_env not in CACHED_ENVS:
|
||||
env = make_env()
|
||||
CACHED_ENVS[make_env] = env
|
||||
return CACHED_ENVS[make_env]
|
||||
|
||||
|
||||
def prepare_params(kwargs):
|
||||
# DDPG params
|
||||
ddpg_params = dict()
|
||||
|
||||
env_name = kwargs['env_name']
|
||||
|
||||
def make_env():
|
||||
return gym.make(env_name)
|
||||
kwargs['make_env'] = make_env
|
||||
tmp_env = cached_make_env(kwargs['make_env'])
|
||||
assert hasattr(tmp_env, '_max_episode_steps')
|
||||
kwargs['T'] = tmp_env._max_episode_steps
|
||||
tmp_env.reset()
|
||||
kwargs['max_u'] = np.array(kwargs['max_u']) if isinstance(kwargs['max_u'], list) else kwargs['max_u']
|
||||
kwargs['gamma'] = 1. - 1. / kwargs['T']
|
||||
if 'lr' in kwargs:
|
||||
kwargs['pi_lr'] = kwargs['lr']
|
||||
kwargs['Q_lr'] = kwargs['lr']
|
||||
del kwargs['lr']
|
||||
for name in ['buffer_size', 'hidden', 'layers',
|
||||
'network_class',
|
||||
'polyak',
|
||||
'batch_size', 'Q_lr', 'pi_lr',
|
||||
'norm_eps', 'norm_clip', 'max_u',
|
||||
'action_l2', 'clip_obs', 'scope', 'relative_goals']:
|
||||
ddpg_params[name] = kwargs[name]
|
||||
kwargs['_' + name] = kwargs[name]
|
||||
del kwargs[name]
|
||||
kwargs['ddpg_params'] = ddpg_params
|
||||
|
||||
return kwargs
|
||||
|
||||
|
||||
def log_params(params, logger=logger):
|
||||
for key in sorted(params.keys()):
|
||||
logger.info('{}: {}'.format(key, params[key]))
|
||||
|
||||
|
||||
def configure_her(params):
|
||||
env = cached_make_env(params['make_env'])
|
||||
env.reset()
|
||||
|
||||
def reward_fun(ag_2, g, info): # vectorized
|
||||
return env.compute_reward(achieved_goal=ag_2, desired_goal=g, info=info)
|
||||
|
||||
# Prepare configuration for HER.
|
||||
her_params = {
|
||||
'reward_fun': reward_fun,
|
||||
}
|
||||
for name in ['replay_strategy', 'replay_k']:
|
||||
her_params[name] = params[name]
|
||||
params['_' + name] = her_params[name]
|
||||
del params[name]
|
||||
sample_her_transitions = make_sample_her_transitions(**her_params)
|
||||
|
||||
return sample_her_transitions
|
||||
|
||||
|
||||
def simple_goal_subtract(a, b):
|
||||
assert a.shape == b.shape
|
||||
return a - b
|
||||
|
||||
|
||||
def configure_ddpg(dims, params, reuse=False, use_mpi=True, clip_return=True):
|
||||
sample_her_transitions = configure_her(params)
|
||||
# Extract relevant parameters.
|
||||
gamma = params['gamma']
|
||||
rollout_batch_size = params['rollout_batch_size']
|
||||
ddpg_params = params['ddpg_params']
|
||||
|
||||
input_dims = dims.copy()
|
||||
|
||||
# DDPG agent
|
||||
env = cached_make_env(params['make_env'])
|
||||
env.reset()
|
||||
ddpg_params.update({'input_dims': input_dims, # agent takes an input observations
|
||||
'T': params['T'],
|
||||
'clip_pos_returns': True, # clip positive returns
|
||||
'clip_return': (1. / (1. - gamma)) if clip_return else np.inf, # max abs of return
|
||||
'rollout_batch_size': rollout_batch_size,
|
||||
'subtract_goals': simple_goal_subtract,
|
||||
'sample_transitions': sample_her_transitions,
|
||||
'gamma': gamma,
|
||||
})
|
||||
ddpg_params['info'] = {
|
||||
'env_name': params['env_name'],
|
||||
}
|
||||
policy = DDPG(reuse=reuse, **ddpg_params, use_mpi=use_mpi)
|
||||
return policy
|
||||
|
||||
|
||||
def configure_dims(params):
|
||||
env = cached_make_env(params['make_env'])
|
||||
env.reset()
|
||||
obs, _, _, info = env.step(env.action_space.sample())
|
||||
|
||||
dims = {
|
||||
'o': obs['observation'].shape[0],
|
||||
'u': env.action_space.shape[0],
|
||||
'g': obs['desired_goal'].shape[0],
|
||||
}
|
||||
for key, value in info.items():
|
||||
value = np.array(value)
|
||||
if value.ndim == 0:
|
||||
value = value.reshape(1)
|
||||
dims['info_{}'.format(key)] = value.shape[0]
|
||||
return dims
|
60
baselines/her/experiment/play.py
Normal file
60
baselines/her/experiment/play.py
Normal file
@@ -0,0 +1,60 @@
|
||||
import click
|
||||
import numpy as np
|
||||
import pickle
|
||||
|
||||
from baselines import logger
|
||||
from baselines.common import set_global_seeds
|
||||
import baselines.her.experiment.config as config
|
||||
from baselines.her.rollout import RolloutWorker
|
||||
|
||||
|
||||
@click.command()
|
||||
@click.argument('policy_file', type=str)
|
||||
@click.option('--seed', type=int, default=0)
|
||||
@click.option('--n_test_rollouts', type=int, default=10)
|
||||
@click.option('--render', type=int, default=1)
|
||||
def main(policy_file, seed, n_test_rollouts, render):
|
||||
set_global_seeds(seed)
|
||||
|
||||
# Load policy.
|
||||
with open(policy_file, 'rb') as f:
|
||||
policy = pickle.load(f)
|
||||
env_name = policy.info['env_name']
|
||||
|
||||
# Prepare params.
|
||||
params = config.DEFAULT_PARAMS
|
||||
if env_name in config.DEFAULT_ENV_PARAMS:
|
||||
params.update(config.DEFAULT_ENV_PARAMS[env_name]) # merge env-specific parameters in
|
||||
params['env_name'] = env_name
|
||||
params = config.prepare_params(params)
|
||||
config.log_params(params, logger=logger)
|
||||
|
||||
dims = config.configure_dims(params)
|
||||
|
||||
eval_params = {
|
||||
'exploit': True,
|
||||
'use_target_net': params['test_with_polyak'],
|
||||
'compute_Q': True,
|
||||
'rollout_batch_size': 1,
|
||||
'render': bool(render),
|
||||
}
|
||||
|
||||
for name in ['T', 'gamma', 'noise_eps', 'random_eps']:
|
||||
eval_params[name] = params[name]
|
||||
|
||||
evaluator = RolloutWorker(params['make_env'], policy, dims, logger, **eval_params)
|
||||
evaluator.seed(seed)
|
||||
|
||||
# Run evaluation.
|
||||
evaluator.clear_history()
|
||||
for _ in range(n_test_rollouts):
|
||||
evaluator.generate_rollouts()
|
||||
|
||||
# record logs
|
||||
for key, val in evaluator.logs('test'):
|
||||
logger.record_tabular(key, np.mean(val))
|
||||
logger.dump_tabular()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
118
baselines/her/experiment/plot.py
Normal file
118
baselines/her/experiment/plot.py
Normal file
@@ -0,0 +1,118 @@
|
||||
import os
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
import json
|
||||
import seaborn as sns; sns.set()
|
||||
import glob2
|
||||
import argparse
|
||||
|
||||
|
||||
def smooth_reward_curve(x, y):
|
||||
halfwidth = int(np.ceil(len(x) / 60)) # Halfwidth of our smoothing convolution
|
||||
k = halfwidth
|
||||
xsmoo = x
|
||||
ysmoo = np.convolve(y, np.ones(2 * k + 1), mode='same') / np.convolve(np.ones_like(y), np.ones(2 * k + 1),
|
||||
mode='same')
|
||||
return xsmoo, ysmoo
|
||||
|
||||
|
||||
def load_results(file):
|
||||
if not os.path.exists(file):
|
||||
return None
|
||||
with open(file, 'r') as f:
|
||||
lines = [line for line in f]
|
||||
if len(lines) < 2:
|
||||
return None
|
||||
keys = [name.strip() for name in lines[0].split(',')]
|
||||
data = np.genfromtxt(file, delimiter=',', skip_header=1, filling_values=0.)
|
||||
if data.ndim == 1:
|
||||
data = data.reshape(1, -1)
|
||||
assert data.ndim == 2
|
||||
assert data.shape[-1] == len(keys)
|
||||
result = {}
|
||||
for idx, key in enumerate(keys):
|
||||
result[key] = data[:, idx]
|
||||
return result
|
||||
|
||||
|
||||
def pad(xs, value=np.nan):
|
||||
maxlen = np.max([len(x) for x in xs])
|
||||
|
||||
padded_xs = []
|
||||
for x in xs:
|
||||
if x.shape[0] >= maxlen:
|
||||
padded_xs.append(x)
|
||||
|
||||
padding = np.ones((maxlen - x.shape[0],) + x.shape[1:]) * value
|
||||
x_padded = np.concatenate([x, padding], axis=0)
|
||||
assert x_padded.shape[1:] == x.shape[1:]
|
||||
assert x_padded.shape[0] == maxlen
|
||||
padded_xs.append(x_padded)
|
||||
return np.array(padded_xs)
|
||||
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('dir', type=str)
|
||||
parser.add_argument('--smooth', type=int, default=1)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load all data.
|
||||
data = {}
|
||||
paths = [os.path.abspath(os.path.join(path, '..')) for path in glob2.glob(os.path.join(args.dir, '**', 'progress.csv'))]
|
||||
for curr_path in paths:
|
||||
if not os.path.isdir(curr_path):
|
||||
continue
|
||||
results = load_results(os.path.join(curr_path, 'progress.csv'))
|
||||
if not results:
|
||||
print('skipping {}'.format(curr_path))
|
||||
continue
|
||||
print('loading {} ({})'.format(curr_path, len(results['epoch'])))
|
||||
with open(os.path.join(curr_path, 'params.json'), 'r') as f:
|
||||
params = json.load(f)
|
||||
|
||||
success_rate = np.array(results['test/success_rate'])
|
||||
epoch = np.array(results['epoch']) + 1
|
||||
env_id = params['env_name']
|
||||
replay_strategy = params['replay_strategy']
|
||||
|
||||
if replay_strategy == 'future':
|
||||
config = 'her'
|
||||
else:
|
||||
config = 'ddpg'
|
||||
if 'Dense' in env_id:
|
||||
config += '-dense'
|
||||
else:
|
||||
config += '-sparse'
|
||||
env_id = env_id.replace('Dense', '')
|
||||
|
||||
# Process and smooth data.
|
||||
assert success_rate.shape == epoch.shape
|
||||
x = epoch
|
||||
y = success_rate
|
||||
if args.smooth:
|
||||
x, y = smooth_reward_curve(epoch, success_rate)
|
||||
assert x.shape == y.shape
|
||||
|
||||
if env_id not in data:
|
||||
data[env_id] = {}
|
||||
if config not in data[env_id]:
|
||||
data[env_id][config] = []
|
||||
data[env_id][config].append((x, y))
|
||||
|
||||
# Plot data.
|
||||
for env_id in sorted(data.keys()):
|
||||
print('exporting {}'.format(env_id))
|
||||
plt.clf()
|
||||
|
||||
for config in sorted(data[env_id].keys()):
|
||||
xs, ys = zip(*data[env_id][config])
|
||||
xs, ys = pad(xs), pad(ys)
|
||||
assert xs.shape == ys.shape
|
||||
|
||||
plt.plot(xs[0], np.nanmedian(ys, axis=0), label=config)
|
||||
plt.fill_between(xs[0], np.nanpercentile(ys, 25, axis=0), np.nanpercentile(ys, 75, axis=0), alpha=0.25)
|
||||
plt.title(env_id)
|
||||
plt.xlabel('Epoch')
|
||||
plt.ylabel('Median Success Rate')
|
||||
plt.legend()
|
||||
plt.savefig(os.path.join(args.dir, 'fig_{}.png'.format(env_id)))
|
191
baselines/her/experiment/train.py
Normal file
191
baselines/her/experiment/train.py
Normal file
@@ -0,0 +1,191 @@
|
||||
import os
|
||||
import sys
|
||||
|
||||
import click
|
||||
import numpy as np
|
||||
import json
|
||||
from mpi4py import MPI
|
||||
|
||||
from baselines import logger
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines.common.mpi_moments import mpi_moments
|
||||
import baselines.her.experiment.config as config
|
||||
from baselines.her.rollout import RolloutWorker
|
||||
from baselines.her.util import mpi_fork
|
||||
|
||||
from subprocess import CalledProcessError
|
||||
|
||||
|
||||
def mpi_average(value):
|
||||
if value == []:
|
||||
value = [0.]
|
||||
if not isinstance(value, list):
|
||||
value = [value]
|
||||
return mpi_moments(np.array(value))[0]
|
||||
|
||||
|
||||
def train(policy, rollout_worker, evaluator,
|
||||
n_epochs, n_test_rollouts, n_cycles, n_batches, policy_save_interval,
|
||||
save_policies, **kwargs):
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
|
||||
latest_policy_path = os.path.join(logger.get_dir(), 'policy_latest.pkl')
|
||||
best_policy_path = os.path.join(logger.get_dir(), 'policy_best.pkl')
|
||||
periodic_policy_path = os.path.join(logger.get_dir(), 'policy_{}.pkl')
|
||||
|
||||
logger.info("Training...")
|
||||
best_success_rate = -1
|
||||
for epoch in range(n_epochs):
|
||||
# train
|
||||
rollout_worker.clear_history()
|
||||
for _ in range(n_cycles):
|
||||
episode = rollout_worker.generate_rollouts()
|
||||
policy.store_episode(episode)
|
||||
for _ in range(n_batches):
|
||||
policy.train()
|
||||
policy.update_target_net()
|
||||
|
||||
# test
|
||||
evaluator.clear_history()
|
||||
for _ in range(n_test_rollouts):
|
||||
evaluator.generate_rollouts()
|
||||
|
||||
# record logs
|
||||
logger.record_tabular('epoch', epoch)
|
||||
for key, val in evaluator.logs('test'):
|
||||
logger.record_tabular(key, mpi_average(val))
|
||||
for key, val in rollout_worker.logs('train'):
|
||||
logger.record_tabular(key, mpi_average(val))
|
||||
for key, val in policy.logs():
|
||||
logger.record_tabular(key, mpi_average(val))
|
||||
|
||||
if rank == 0:
|
||||
logger.dump_tabular()
|
||||
|
||||
# save the policy if it's better than the previous ones
|
||||
success_rate = mpi_average(evaluator.current_success_rate())
|
||||
if rank == 0 and success_rate >= best_success_rate and save_policies:
|
||||
best_success_rate = success_rate
|
||||
logger.info('New best success rate: {}. Saving policy to {} ...'.format(best_success_rate, best_policy_path))
|
||||
evaluator.save_policy(best_policy_path)
|
||||
evaluator.save_policy(latest_policy_path)
|
||||
if rank == 0 and policy_save_interval > 0 and epoch % policy_save_interval == 0 and save_policies:
|
||||
policy_path = periodic_policy_path.format(epoch)
|
||||
logger.info('Saving periodic policy to {} ...'.format(policy_path))
|
||||
evaluator.save_policy(policy_path)
|
||||
|
||||
# make sure that different threads have different seeds
|
||||
local_uniform = np.random.uniform(size=(1,))
|
||||
root_uniform = local_uniform.copy()
|
||||
MPI.COMM_WORLD.Bcast(root_uniform, root=0)
|
||||
if rank != 0:
|
||||
assert local_uniform[0] != root_uniform[0]
|
||||
|
||||
|
||||
def launch(
|
||||
env, logdir, n_epochs, num_cpu, seed, replay_strategy, policy_save_interval, clip_return,
|
||||
override_params={}, save_policies=True
|
||||
):
|
||||
# Fork for multi-CPU MPI implementation.
|
||||
if num_cpu > 1:
|
||||
try:
|
||||
whoami = mpi_fork(num_cpu, ['--bind-to', 'core'])
|
||||
except CalledProcessError:
|
||||
# fancy version of mpi call failed, try simple version
|
||||
whoami = mpi_fork(num_cpu)
|
||||
|
||||
if whoami == 'parent':
|
||||
sys.exit(0)
|
||||
import baselines.common.tf_util as U
|
||||
U.single_threaded_session().__enter__()
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
|
||||
# Configure logging
|
||||
if rank == 0:
|
||||
if logdir or logger.get_dir() is None:
|
||||
logger.configure(dir=logdir)
|
||||
else:
|
||||
logger.configure()
|
||||
logdir = logger.get_dir()
|
||||
assert logdir is not None
|
||||
os.makedirs(logdir, exist_ok=True)
|
||||
|
||||
# Seed everything.
|
||||
rank_seed = seed + 1000000 * rank
|
||||
set_global_seeds(rank_seed)
|
||||
|
||||
# Prepare params.
|
||||
params = config.DEFAULT_PARAMS
|
||||
params['env_name'] = env
|
||||
params['replay_strategy'] = replay_strategy
|
||||
if env in config.DEFAULT_ENV_PARAMS:
|
||||
params.update(config.DEFAULT_ENV_PARAMS[env]) # merge env-specific parameters in
|
||||
params.update(**override_params) # makes it possible to override any parameter
|
||||
with open(os.path.join(logger.get_dir(), 'params.json'), 'w') as f:
|
||||
json.dump(params, f)
|
||||
params = config.prepare_params(params)
|
||||
config.log_params(params, logger=logger)
|
||||
|
||||
if num_cpu == 1:
|
||||
logger.warn()
|
||||
logger.warn('*** Warning ***')
|
||||
logger.warn(
|
||||
'You are running HER with just a single MPI worker. This will work, but the ' +
|
||||
'experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) ' +
|
||||
'were obtained with --num_cpu 19. This makes a significant difference and if you ' +
|
||||
'are looking to reproduce those results, be aware of this. Please also refer to ' +
|
||||
'https://github.com/openai/baselines/issues/314 for further details.')
|
||||
logger.warn('****************')
|
||||
logger.warn()
|
||||
|
||||
dims = config.configure_dims(params)
|
||||
policy = config.configure_ddpg(dims=dims, params=params, clip_return=clip_return)
|
||||
|
||||
rollout_params = {
|
||||
'exploit': False,
|
||||
'use_target_net': False,
|
||||
'use_demo_states': True,
|
||||
'compute_Q': False,
|
||||
'T': params['T'],
|
||||
}
|
||||
|
||||
eval_params = {
|
||||
'exploit': True,
|
||||
'use_target_net': params['test_with_polyak'],
|
||||
'use_demo_states': False,
|
||||
'compute_Q': True,
|
||||
'T': params['T'],
|
||||
}
|
||||
|
||||
for name in ['T', 'rollout_batch_size', 'gamma', 'noise_eps', 'random_eps']:
|
||||
rollout_params[name] = params[name]
|
||||
eval_params[name] = params[name]
|
||||
|
||||
rollout_worker = RolloutWorker(params['make_env'], policy, dims, logger, **rollout_params)
|
||||
rollout_worker.seed(rank_seed)
|
||||
|
||||
evaluator = RolloutWorker(params['make_env'], policy, dims, logger, **eval_params)
|
||||
evaluator.seed(rank_seed)
|
||||
|
||||
train(
|
||||
logdir=logdir, policy=policy, rollout_worker=rollout_worker,
|
||||
evaluator=evaluator, n_epochs=n_epochs, n_test_rollouts=params['n_test_rollouts'],
|
||||
n_cycles=params['n_cycles'], n_batches=params['n_batches'],
|
||||
policy_save_interval=policy_save_interval, save_policies=save_policies)
|
||||
|
||||
|
||||
@click.command()
|
||||
@click.option('--env', type=str, default='FetchReach-v1', help='the name of the OpenAI Gym environment that you want to train on')
|
||||
@click.option('--logdir', type=str, default=None, help='the path to where logs and policy pickles should go. If not specified, creates a folder in /tmp/')
|
||||
@click.option('--n_epochs', type=int, default=50, help='the number of training epochs to run')
|
||||
@click.option('--num_cpu', type=int, default=1, help='the number of CPU cores to use (using MPI)')
|
||||
@click.option('--seed', type=int, default=0, help='the random seed used to seed both the environment and the training code')
|
||||
@click.option('--policy_save_interval', type=int, default=5, help='the interval with which policy pickles are saved. If set to 0, only the best and latest policy will be pickled.')
|
||||
@click.option('--replay_strategy', type=click.Choice(['future', 'none']), default='future', help='the HER replay strategy to be used. "future" uses HER, "none" disables HER.')
|
||||
@click.option('--clip_return', type=int, default=1, help='whether or not returns should be clipped')
|
||||
def main(**kwargs):
|
||||
launch(**kwargs)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
63
baselines/her/her.py
Normal file
63
baselines/her/her.py
Normal file
@@ -0,0 +1,63 @@
|
||||
import numpy as np
|
||||
|
||||
|
||||
def make_sample_her_transitions(replay_strategy, replay_k, reward_fun):
|
||||
"""Creates a sample function that can be used for HER experience replay.
|
||||
|
||||
Args:
|
||||
replay_strategy (in ['future', 'none']): the HER replay strategy; if set to 'none',
|
||||
regular DDPG experience replay is used
|
||||
replay_k (int): the ratio between HER replays and regular replays (e.g. k = 4 -> 4 times
|
||||
as many HER replays as regular replays are used)
|
||||
reward_fun (function): function to re-compute the reward with substituted goals
|
||||
"""
|
||||
if replay_strategy == 'future':
|
||||
future_p = 1 - (1. / (1 + replay_k))
|
||||
else: # 'replay_strategy' == 'none'
|
||||
future_p = 0
|
||||
|
||||
def _sample_her_transitions(episode_batch, batch_size_in_transitions):
|
||||
"""episode_batch is {key: array(buffer_size x T x dim_key)}
|
||||
"""
|
||||
T = episode_batch['u'].shape[1]
|
||||
rollout_batch_size = episode_batch['u'].shape[0]
|
||||
batch_size = batch_size_in_transitions
|
||||
|
||||
# Select which episodes and time steps to use.
|
||||
episode_idxs = np.random.randint(0, rollout_batch_size, batch_size)
|
||||
t_samples = np.random.randint(T, size=batch_size)
|
||||
transitions = {key: episode_batch[key][episode_idxs, t_samples].copy()
|
||||
for key in episode_batch.keys()}
|
||||
|
||||
# Select future time indexes proportional with probability future_p. These
|
||||
# will be used for HER replay by substituting in future goals.
|
||||
her_indexes = np.where(np.random.uniform(size=batch_size) < future_p)
|
||||
future_offset = np.random.uniform(size=batch_size) * (T - t_samples)
|
||||
future_offset = future_offset.astype(int)
|
||||
future_t = (t_samples + 1 + future_offset)[her_indexes]
|
||||
|
||||
# Replace goal with achieved goal but only for the previously-selected
|
||||
# HER transitions (as defined by her_indexes). For the other transitions,
|
||||
# keep the original goal.
|
||||
future_ag = episode_batch['ag'][episode_idxs[her_indexes], future_t]
|
||||
transitions['g'][her_indexes] = future_ag
|
||||
|
||||
# Reconstruct info dictionary for reward computation.
|
||||
info = {}
|
||||
for key, value in transitions.items():
|
||||
if key.startswith('info_'):
|
||||
info[key.replace('info_', '')] = value
|
||||
|
||||
# Re-compute reward since we may have substituted the goal.
|
||||
reward_params = {k: transitions[k] for k in ['ag_2', 'g']}
|
||||
reward_params['info'] = info
|
||||
transitions['r'] = reward_fun(**reward_params)
|
||||
|
||||
transitions = {k: transitions[k].reshape(batch_size, *transitions[k].shape[1:])
|
||||
for k in transitions.keys()}
|
||||
|
||||
assert(transitions['u'].shape[0] == batch_size_in_transitions)
|
||||
|
||||
return transitions
|
||||
|
||||
return _sample_her_transitions
|
140
baselines/her/normalizer.py
Normal file
140
baselines/her/normalizer.py
Normal file
@@ -0,0 +1,140 @@
|
||||
import threading
|
||||
|
||||
import numpy as np
|
||||
from mpi4py import MPI
|
||||
import tensorflow as tf
|
||||
|
||||
from baselines.her.util import reshape_for_broadcasting
|
||||
|
||||
|
||||
class Normalizer:
|
||||
def __init__(self, size, eps=1e-2, default_clip_range=np.inf, sess=None):
|
||||
"""A normalizer that ensures that observations are approximately distributed according to
|
||||
a standard Normal distribution (i.e. have mean zero and variance one).
|
||||
|
||||
Args:
|
||||
size (int): the size of the observation to be normalized
|
||||
eps (float): a small constant that avoids underflows
|
||||
default_clip_range (float): normalized observations are clipped to be in
|
||||
[-default_clip_range, default_clip_range]
|
||||
sess (object): the TensorFlow session to be used
|
||||
"""
|
||||
self.size = size
|
||||
self.eps = eps
|
||||
self.default_clip_range = default_clip_range
|
||||
self.sess = sess if sess is not None else tf.get_default_session()
|
||||
|
||||
self.local_sum = np.zeros(self.size, np.float32)
|
||||
self.local_sumsq = np.zeros(self.size, np.float32)
|
||||
self.local_count = np.zeros(1, np.float32)
|
||||
|
||||
self.sum_tf = tf.get_variable(
|
||||
initializer=tf.zeros_initializer(), shape=self.local_sum.shape, name='sum',
|
||||
trainable=False, dtype=tf.float32)
|
||||
self.sumsq_tf = tf.get_variable(
|
||||
initializer=tf.zeros_initializer(), shape=self.local_sumsq.shape, name='sumsq',
|
||||
trainable=False, dtype=tf.float32)
|
||||
self.count_tf = tf.get_variable(
|
||||
initializer=tf.ones_initializer(), shape=self.local_count.shape, name='count',
|
||||
trainable=False, dtype=tf.float32)
|
||||
self.mean = tf.get_variable(
|
||||
initializer=tf.zeros_initializer(), shape=(self.size,), name='mean',
|
||||
trainable=False, dtype=tf.float32)
|
||||
self.std = tf.get_variable(
|
||||
initializer=tf.ones_initializer(), shape=(self.size,), name='std',
|
||||
trainable=False, dtype=tf.float32)
|
||||
self.count_pl = tf.placeholder(name='count_pl', shape=(1,), dtype=tf.float32)
|
||||
self.sum_pl = tf.placeholder(name='sum_pl', shape=(self.size,), dtype=tf.float32)
|
||||
self.sumsq_pl = tf.placeholder(name='sumsq_pl', shape=(self.size,), dtype=tf.float32)
|
||||
|
||||
self.update_op = tf.group(
|
||||
self.count_tf.assign_add(self.count_pl),
|
||||
self.sum_tf.assign_add(self.sum_pl),
|
||||
self.sumsq_tf.assign_add(self.sumsq_pl)
|
||||
)
|
||||
self.recompute_op = tf.group(
|
||||
tf.assign(self.mean, self.sum_tf / self.count_tf),
|
||||
tf.assign(self.std, tf.sqrt(tf.maximum(
|
||||
tf.square(self.eps),
|
||||
self.sumsq_tf / self.count_tf - tf.square(self.sum_tf / self.count_tf)
|
||||
))),
|
||||
)
|
||||
self.lock = threading.Lock()
|
||||
|
||||
def update(self, v):
|
||||
v = v.reshape(-1, self.size)
|
||||
|
||||
with self.lock:
|
||||
self.local_sum += v.sum(axis=0)
|
||||
self.local_sumsq += (np.square(v)).sum(axis=0)
|
||||
self.local_count[0] += v.shape[0]
|
||||
|
||||
def normalize(self, v, clip_range=None):
|
||||
if clip_range is None:
|
||||
clip_range = self.default_clip_range
|
||||
mean = reshape_for_broadcasting(self.mean, v)
|
||||
std = reshape_for_broadcasting(self.std, v)
|
||||
return tf.clip_by_value((v - mean) / std, -clip_range, clip_range)
|
||||
|
||||
def denormalize(self, v):
|
||||
mean = reshape_for_broadcasting(self.mean, v)
|
||||
std = reshape_for_broadcasting(self.std, v)
|
||||
return mean + v * std
|
||||
|
||||
def _mpi_average(self, x):
|
||||
buf = np.zeros_like(x)
|
||||
MPI.COMM_WORLD.Allreduce(x, buf, op=MPI.SUM)
|
||||
buf /= MPI.COMM_WORLD.Get_size()
|
||||
return buf
|
||||
|
||||
def synchronize(self, local_sum, local_sumsq, local_count, root=None):
|
||||
local_sum[...] = self._mpi_average(local_sum)
|
||||
local_sumsq[...] = self._mpi_average(local_sumsq)
|
||||
local_count[...] = self._mpi_average(local_count)
|
||||
return local_sum, local_sumsq, local_count
|
||||
|
||||
def recompute_stats(self):
|
||||
with self.lock:
|
||||
# Copy over results.
|
||||
local_count = self.local_count.copy()
|
||||
local_sum = self.local_sum.copy()
|
||||
local_sumsq = self.local_sumsq.copy()
|
||||
|
||||
# Reset.
|
||||
self.local_count[...] = 0
|
||||
self.local_sum[...] = 0
|
||||
self.local_sumsq[...] = 0
|
||||
|
||||
# We perform the synchronization outside of the lock to keep the critical section as short
|
||||
# as possible.
|
||||
synced_sum, synced_sumsq, synced_count = self.synchronize(
|
||||
local_sum=local_sum, local_sumsq=local_sumsq, local_count=local_count)
|
||||
|
||||
self.sess.run(self.update_op, feed_dict={
|
||||
self.count_pl: synced_count,
|
||||
self.sum_pl: synced_sum,
|
||||
self.sumsq_pl: synced_sumsq,
|
||||
})
|
||||
self.sess.run(self.recompute_op)
|
||||
|
||||
|
||||
class IdentityNormalizer:
|
||||
def __init__(self, size, std=1.):
|
||||
self.size = size
|
||||
self.mean = tf.zeros(self.size, tf.float32)
|
||||
self.std = std * tf.ones(self.size, tf.float32)
|
||||
|
||||
def update(self, x):
|
||||
pass
|
||||
|
||||
def normalize(self, x, clip_range=None):
|
||||
return x / self.std
|
||||
|
||||
def denormalize(self, x):
|
||||
return self.std * x
|
||||
|
||||
def synchronize(self):
|
||||
pass
|
||||
|
||||
def recompute_stats(self):
|
||||
pass
|
108
baselines/her/replay_buffer.py
Normal file
108
baselines/her/replay_buffer.py
Normal file
@@ -0,0 +1,108 @@
|
||||
import threading
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
class ReplayBuffer:
|
||||
def __init__(self, buffer_shapes, size_in_transitions, T, sample_transitions):
|
||||
"""Creates a replay buffer.
|
||||
|
||||
Args:
|
||||
buffer_shapes (dict of ints): the shape for all buffers that are used in the replay
|
||||
buffer
|
||||
size_in_transitions (int): the size of the buffer, measured in transitions
|
||||
T (int): the time horizon for episodes
|
||||
sample_transitions (function): a function that samples from the replay buffer
|
||||
"""
|
||||
self.buffer_shapes = buffer_shapes
|
||||
self.size = size_in_transitions // T
|
||||
self.T = T
|
||||
self.sample_transitions = sample_transitions
|
||||
|
||||
# self.buffers is {key: array(size_in_episodes x T or T+1 x dim_key)}
|
||||
self.buffers = {key: np.empty([self.size, *shape])
|
||||
for key, shape in buffer_shapes.items()}
|
||||
|
||||
# memory management
|
||||
self.current_size = 0
|
||||
self.n_transitions_stored = 0
|
||||
|
||||
self.lock = threading.Lock()
|
||||
|
||||
@property
|
||||
def full(self):
|
||||
with self.lock:
|
||||
return self.current_size == self.size
|
||||
|
||||
def sample(self, batch_size):
|
||||
"""Returns a dict {key: array(batch_size x shapes[key])}
|
||||
"""
|
||||
buffers = {}
|
||||
|
||||
with self.lock:
|
||||
assert self.current_size > 0
|
||||
for key in self.buffers.keys():
|
||||
buffers[key] = self.buffers[key][:self.current_size]
|
||||
|
||||
buffers['o_2'] = buffers['o'][:, 1:, :]
|
||||
buffers['ag_2'] = buffers['ag'][:, 1:, :]
|
||||
|
||||
transitions = self.sample_transitions(buffers, batch_size)
|
||||
|
||||
for key in (['r', 'o_2', 'ag_2'] + list(self.buffers.keys())):
|
||||
assert key in transitions, "key %s missing from transitions" % key
|
||||
|
||||
return transitions
|
||||
|
||||
def store_episode(self, episode_batch):
|
||||
"""episode_batch: array(batch_size x (T or T+1) x dim_key)
|
||||
"""
|
||||
batch_sizes = [len(episode_batch[key]) for key in episode_batch.keys()]
|
||||
assert np.all(np.array(batch_sizes) == batch_sizes[0])
|
||||
batch_size = batch_sizes[0]
|
||||
|
||||
with self.lock:
|
||||
idxs = self._get_storage_idx(batch_size)
|
||||
|
||||
# load inputs into buffers
|
||||
for key in self.buffers.keys():
|
||||
self.buffers[key][idxs] = episode_batch[key]
|
||||
|
||||
self.n_transitions_stored += batch_size * self.T
|
||||
|
||||
def get_current_episode_size(self):
|
||||
with self.lock:
|
||||
return self.current_size
|
||||
|
||||
def get_current_size(self):
|
||||
with self.lock:
|
||||
return self.current_size * self.T
|
||||
|
||||
def get_transitions_stored(self):
|
||||
with self.lock:
|
||||
return self.n_transitions_stored
|
||||
|
||||
def clear_buffer(self):
|
||||
with self.lock:
|
||||
self.current_size = 0
|
||||
|
||||
def _get_storage_idx(self, inc=None):
|
||||
inc = inc or 1 # size increment
|
||||
assert inc <= self.size, "Batch committed to replay is too large!"
|
||||
# go consecutively until you hit the end, and then go randomly.
|
||||
if self.current_size+inc <= self.size:
|
||||
idx = np.arange(self.current_size, self.current_size+inc)
|
||||
elif self.current_size < self.size:
|
||||
overflow = inc - (self.size - self.current_size)
|
||||
idx_a = np.arange(self.current_size, self.size)
|
||||
idx_b = np.random.randint(0, self.current_size, overflow)
|
||||
idx = np.concatenate([idx_a, idx_b])
|
||||
else:
|
||||
idx = np.random.randint(0, self.size, inc)
|
||||
|
||||
# update replay size
|
||||
self.current_size = min(self.size, self.current_size+inc)
|
||||
|
||||
if inc == 1:
|
||||
idx = idx[0]
|
||||
return idx
|
188
baselines/her/rollout.py
Normal file
188
baselines/her/rollout.py
Normal file
@@ -0,0 +1,188 @@
|
||||
from collections import deque
|
||||
|
||||
import numpy as np
|
||||
import pickle
|
||||
from mujoco_py import MujocoException
|
||||
|
||||
from baselines.her.util import convert_episode_to_batch_major, store_args
|
||||
|
||||
|
||||
class RolloutWorker:
|
||||
|
||||
@store_args
|
||||
def __init__(self, make_env, policy, dims, logger, T, rollout_batch_size=1,
|
||||
exploit=False, use_target_net=False, compute_Q=False, noise_eps=0,
|
||||
random_eps=0, history_len=100, render=False, **kwargs):
|
||||
"""Rollout worker generates experience by interacting with one or many environments.
|
||||
|
||||
Args:
|
||||
make_env (function): a factory function that creates a new instance of the environment
|
||||
when called
|
||||
policy (object): the policy that is used to act
|
||||
dims (dict of ints): the dimensions for observations (o), goals (g), and actions (u)
|
||||
logger (object): the logger that is used by the rollout worker
|
||||
rollout_batch_size (int): the number of parallel rollouts that should be used
|
||||
exploit (boolean): whether or not to exploit, i.e. to act optimally according to the
|
||||
current policy without any exploration
|
||||
use_target_net (boolean): whether or not to use the target net for rollouts
|
||||
compute_Q (boolean): whether or not to compute the Q values alongside the actions
|
||||
noise_eps (float): scale of the additive Gaussian noise
|
||||
random_eps (float): probability of selecting a completely random action
|
||||
history_len (int): length of history for statistics smoothing
|
||||
render (boolean): whether or not to render the rollouts
|
||||
"""
|
||||
self.envs = [make_env() for _ in range(rollout_batch_size)]
|
||||
assert self.T > 0
|
||||
|
||||
self.info_keys = [key.replace('info_', '') for key in dims.keys() if key.startswith('info_')]
|
||||
|
||||
self.success_history = deque(maxlen=history_len)
|
||||
self.Q_history = deque(maxlen=history_len)
|
||||
|
||||
self.n_episodes = 0
|
||||
self.g = np.empty((self.rollout_batch_size, self.dims['g']), np.float32) # goals
|
||||
self.initial_o = np.empty((self.rollout_batch_size, self.dims['o']), np.float32) # observations
|
||||
self.initial_ag = np.empty((self.rollout_batch_size, self.dims['g']), np.float32) # achieved goals
|
||||
self.reset_all_rollouts()
|
||||
self.clear_history()
|
||||
|
||||
def reset_rollout(self, i):
|
||||
"""Resets the `i`-th rollout environment, re-samples a new goal, and updates the `initial_o`
|
||||
and `g` arrays accordingly.
|
||||
"""
|
||||
obs = self.envs[i].reset()
|
||||
self.initial_o[i] = obs['observation']
|
||||
self.initial_ag[i] = obs['achieved_goal']
|
||||
self.g[i] = obs['desired_goal']
|
||||
|
||||
def reset_all_rollouts(self):
|
||||
"""Resets all `rollout_batch_size` rollout workers.
|
||||
"""
|
||||
for i in range(self.rollout_batch_size):
|
||||
self.reset_rollout(i)
|
||||
|
||||
def generate_rollouts(self):
|
||||
"""Performs `rollout_batch_size` rollouts in parallel for time horizon `T` with the current
|
||||
policy acting on it accordingly.
|
||||
"""
|
||||
self.reset_all_rollouts()
|
||||
|
||||
# compute observations
|
||||
o = np.empty((self.rollout_batch_size, self.dims['o']), np.float32) # observations
|
||||
ag = np.empty((self.rollout_batch_size, self.dims['g']), np.float32) # achieved goals
|
||||
o[:] = self.initial_o
|
||||
ag[:] = self.initial_ag
|
||||
|
||||
# generate episodes
|
||||
obs, achieved_goals, acts, goals, successes = [], [], [], [], []
|
||||
info_values = [np.empty((self.T, self.rollout_batch_size, self.dims['info_' + key]), np.float32) for key in self.info_keys]
|
||||
Qs = []
|
||||
for t in range(self.T):
|
||||
policy_output = self.policy.get_actions(
|
||||
o, ag, self.g,
|
||||
compute_Q=self.compute_Q,
|
||||
noise_eps=self.noise_eps if not self.exploit else 0.,
|
||||
random_eps=self.random_eps if not self.exploit else 0.,
|
||||
use_target_net=self.use_target_net)
|
||||
|
||||
if self.compute_Q:
|
||||
u, Q = policy_output
|
||||
Qs.append(Q)
|
||||
else:
|
||||
u = policy_output
|
||||
|
||||
if u.ndim == 1:
|
||||
# The non-batched case should still have a reasonable shape.
|
||||
u = u.reshape(1, -1)
|
||||
|
||||
o_new = np.empty((self.rollout_batch_size, self.dims['o']))
|
||||
ag_new = np.empty((self.rollout_batch_size, self.dims['g']))
|
||||
success = np.zeros(self.rollout_batch_size)
|
||||
# compute new states and observations
|
||||
for i in range(self.rollout_batch_size):
|
||||
try:
|
||||
# We fully ignore the reward here because it will have to be re-computed
|
||||
# for HER.
|
||||
curr_o_new, _, _, info = self.envs[i].step(u[i])
|
||||
if 'is_success' in info:
|
||||
success[i] = info['is_success']
|
||||
o_new[i] = curr_o_new['observation']
|
||||
ag_new[i] = curr_o_new['achieved_goal']
|
||||
for idx, key in enumerate(self.info_keys):
|
||||
info_values[idx][t, i] = info[key]
|
||||
if self.render:
|
||||
self.envs[i].render()
|
||||
except MujocoException as e:
|
||||
return self.generate_rollouts()
|
||||
|
||||
if np.isnan(o_new).any():
|
||||
self.logger.warning('NaN caught during rollout generation. Trying again...')
|
||||
self.reset_all_rollouts()
|
||||
return self.generate_rollouts()
|
||||
|
||||
obs.append(o.copy())
|
||||
achieved_goals.append(ag.copy())
|
||||
successes.append(success.copy())
|
||||
acts.append(u.copy())
|
||||
goals.append(self.g.copy())
|
||||
o[...] = o_new
|
||||
ag[...] = ag_new
|
||||
obs.append(o.copy())
|
||||
achieved_goals.append(ag.copy())
|
||||
self.initial_o[:] = o
|
||||
|
||||
episode = dict(o=obs,
|
||||
u=acts,
|
||||
g=goals,
|
||||
ag=achieved_goals)
|
||||
for key, value in zip(self.info_keys, info_values):
|
||||
episode['info_{}'.format(key)] = value
|
||||
|
||||
# stats
|
||||
successful = np.array(successes)[-1, :]
|
||||
assert successful.shape == (self.rollout_batch_size,)
|
||||
success_rate = np.mean(successful)
|
||||
self.success_history.append(success_rate)
|
||||
if self.compute_Q:
|
||||
self.Q_history.append(np.mean(Qs))
|
||||
self.n_episodes += self.rollout_batch_size
|
||||
|
||||
return convert_episode_to_batch_major(episode)
|
||||
|
||||
def clear_history(self):
|
||||
"""Clears all histories that are used for statistics
|
||||
"""
|
||||
self.success_history.clear()
|
||||
self.Q_history.clear()
|
||||
|
||||
def current_success_rate(self):
|
||||
return np.mean(self.success_history)
|
||||
|
||||
def current_mean_Q(self):
|
||||
return np.mean(self.Q_history)
|
||||
|
||||
def save_policy(self, path):
|
||||
"""Pickles the current policy for later inspection.
|
||||
"""
|
||||
with open(path, 'wb') as f:
|
||||
pickle.dump(self.policy, f)
|
||||
|
||||
def logs(self, prefix='worker'):
|
||||
"""Generates a dictionary that contains all collected statistics.
|
||||
"""
|
||||
logs = []
|
||||
logs += [('success_rate', np.mean(self.success_history))]
|
||||
if self.compute_Q:
|
||||
logs += [('mean_Q', np.mean(self.Q_history))]
|
||||
logs += [('episode', self.n_episodes)]
|
||||
|
||||
if prefix is not '' and not prefix.endswith('/'):
|
||||
return [(prefix + '/' + key, val) for key, val in logs]
|
||||
else:
|
||||
return logs
|
||||
|
||||
def seed(self, seed):
|
||||
"""Seeds each environment with a distinct seed derived from the passed in global seed.
|
||||
"""
|
||||
for idx, env in enumerate(self.envs):
|
||||
env.seed(seed + 1000 * idx)
|
140
baselines/her/util.py
Normal file
140
baselines/her/util.py
Normal file
@@ -0,0 +1,140 @@
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import importlib
|
||||
import inspect
|
||||
import functools
|
||||
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
|
||||
from baselines.common import tf_util as U
|
||||
|
||||
|
||||
def store_args(method):
|
||||
"""Stores provided method args as instance attributes.
|
||||
"""
|
||||
argspec = inspect.getfullargspec(method)
|
||||
defaults = {}
|
||||
if argspec.defaults is not None:
|
||||
defaults = dict(
|
||||
zip(argspec.args[-len(argspec.defaults):], argspec.defaults))
|
||||
if argspec.kwonlydefaults is not None:
|
||||
defaults.update(argspec.kwonlydefaults)
|
||||
arg_names = argspec.args[1:]
|
||||
|
||||
@functools.wraps(method)
|
||||
def wrapper(*positional_args, **keyword_args):
|
||||
self = positional_args[0]
|
||||
# Get default arg values
|
||||
args = defaults.copy()
|
||||
# Add provided arg values
|
||||
for name, value in zip(arg_names, positional_args[1:]):
|
||||
args[name] = value
|
||||
args.update(keyword_args)
|
||||
self.__dict__.update(args)
|
||||
return method(*positional_args, **keyword_args)
|
||||
|
||||
return wrapper
|
||||
|
||||
|
||||
def import_function(spec):
|
||||
"""Import a function identified by a string like "pkg.module:fn_name".
|
||||
"""
|
||||
mod_name, fn_name = spec.split(':')
|
||||
module = importlib.import_module(mod_name)
|
||||
fn = getattr(module, fn_name)
|
||||
return fn
|
||||
|
||||
|
||||
def flatten_grads(var_list, grads):
|
||||
"""Flattens a variables and their gradients.
|
||||
"""
|
||||
return tf.concat([tf.reshape(grad, [U.numel(v)])
|
||||
for (v, grad) in zip(var_list, grads)], 0)
|
||||
|
||||
|
||||
def nn(input, layers_sizes, reuse=None, flatten=False, name=""):
|
||||
"""Creates a simple neural network
|
||||
"""
|
||||
for i, size in enumerate(layers_sizes):
|
||||
activation = tf.nn.relu if i < len(layers_sizes) - 1 else None
|
||||
input = tf.layers.dense(inputs=input,
|
||||
units=size,
|
||||
kernel_initializer=tf.contrib.layers.xavier_initializer(),
|
||||
reuse=reuse,
|
||||
name=name + '_' + str(i))
|
||||
if activation:
|
||||
input = activation(input)
|
||||
if flatten:
|
||||
assert layers_sizes[-1] == 1
|
||||
input = tf.reshape(input, [-1])
|
||||
return input
|
||||
|
||||
|
||||
def install_mpi_excepthook():
|
||||
import sys
|
||||
from mpi4py import MPI
|
||||
old_hook = sys.excepthook
|
||||
|
||||
def new_hook(a, b, c):
|
||||
old_hook(a, b, c)
|
||||
sys.stdout.flush()
|
||||
sys.stderr.flush()
|
||||
MPI.COMM_WORLD.Abort()
|
||||
sys.excepthook = new_hook
|
||||
|
||||
|
||||
def mpi_fork(n, extra_mpi_args=[]):
|
||||
"""Re-launches the current script with workers
|
||||
Returns "parent" for original parent, "child" for MPI children
|
||||
"""
|
||||
if n <= 1:
|
||||
return "child"
|
||||
if os.getenv("IN_MPI") is None:
|
||||
env = os.environ.copy()
|
||||
env.update(
|
||||
MKL_NUM_THREADS="1",
|
||||
OMP_NUM_THREADS="1",
|
||||
IN_MPI="1"
|
||||
)
|
||||
# "-bind-to core" is crucial for good performance
|
||||
args = ["mpirun", "-np", str(n)] + \
|
||||
extra_mpi_args + \
|
||||
[sys.executable]
|
||||
|
||||
args += sys.argv
|
||||
subprocess.check_call(args, env=env)
|
||||
return "parent"
|
||||
else:
|
||||
install_mpi_excepthook()
|
||||
return "child"
|
||||
|
||||
|
||||
def convert_episode_to_batch_major(episode):
|
||||
"""Converts an episode to have the batch dimension in the major (first)
|
||||
dimension.
|
||||
"""
|
||||
episode_batch = {}
|
||||
for key in episode.keys():
|
||||
val = np.array(episode[key]).copy()
|
||||
# make inputs batch-major instead of time-major
|
||||
episode_batch[key] = val.swapaxes(0, 1)
|
||||
|
||||
return episode_batch
|
||||
|
||||
|
||||
def transitions_in_episode_batch(episode_batch):
|
||||
"""Number of transitions in a given episode batch.
|
||||
"""
|
||||
shape = episode_batch['u'].shape
|
||||
return shape[0] * shape[1]
|
||||
|
||||
|
||||
def reshape_for_broadcasting(source, target):
|
||||
"""Reshapes a tensor (source) to have the correct shape and dtype of the target
|
||||
before broadcasting it with MPI.
|
||||
"""
|
||||
dim = len(target.get_shape())
|
||||
shape = ([1] * (dim - 1)) + [-1]
|
||||
return tf.reshape(tf.cast(source, target.dtype), shape)
|
@@ -6,9 +6,7 @@ import json
|
||||
import time
|
||||
import datetime
|
||||
import tempfile
|
||||
|
||||
LOG_OUTPUT_FORMATS = ['stdout', 'log', 'csv']
|
||||
# Also valid: json, tensorboard
|
||||
from collections import defaultdict
|
||||
|
||||
DEBUG = 10
|
||||
INFO = 20
|
||||
@@ -73,8 +71,11 @@ class HumanOutputFormat(KVWriter, SeqWriter):
|
||||
return s[:20] + '...' if len(s) > 23 else s
|
||||
|
||||
def writeseq(self, seq):
|
||||
for arg in seq:
|
||||
self.file.write(arg)
|
||||
seq = list(seq)
|
||||
for (i, elem) in enumerate(seq):
|
||||
self.file.write(elem)
|
||||
if i < len(seq) - 1: # add space unless this is the last one
|
||||
self.file.write(' ')
|
||||
self.file.write('\n')
|
||||
self.file.flush()
|
||||
|
||||
@@ -124,7 +125,7 @@ class CSVOutputFormat(KVWriter):
|
||||
if i > 0:
|
||||
self.file.write(',')
|
||||
v = kvs.get(k)
|
||||
if v:
|
||||
if v is not None:
|
||||
self.file.write(str(v))
|
||||
self.file.write('\n')
|
||||
self.file.flush()
|
||||
@@ -168,24 +169,18 @@ class TensorBoardOutputFormat(KVWriter):
|
||||
self.writer.Close()
|
||||
self.writer = None
|
||||
|
||||
def make_output_format(format, ev_dir):
|
||||
from mpi4py import MPI
|
||||
def make_output_format(format, ev_dir, log_suffix=''):
|
||||
os.makedirs(ev_dir, exist_ok=True)
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
if format == 'stdout':
|
||||
return HumanOutputFormat(sys.stdout)
|
||||
elif format == 'log':
|
||||
suffix = "" if rank==0 else ("-mpi%03i"%rank)
|
||||
return HumanOutputFormat(osp.join(ev_dir, 'log%s.txt' % suffix))
|
||||
return HumanOutputFormat(osp.join(ev_dir, 'log%s.txt' % log_suffix))
|
||||
elif format == 'json':
|
||||
assert rank==0
|
||||
return JSONOutputFormat(osp.join(ev_dir, 'progress.json'))
|
||||
return JSONOutputFormat(osp.join(ev_dir, 'progress%s.json' % log_suffix))
|
||||
elif format == 'csv':
|
||||
assert rank==0
|
||||
return CSVOutputFormat(osp.join(ev_dir, 'progress.csv'))
|
||||
return CSVOutputFormat(osp.join(ev_dir, 'progress%s.csv' % log_suffix))
|
||||
elif format == 'tensorboard':
|
||||
assert rank==0
|
||||
return TensorBoardOutputFormat(osp.join(ev_dir, 'tb'))
|
||||
return TensorBoardOutputFormat(osp.join(ev_dir, 'tb%s' % log_suffix))
|
||||
else:
|
||||
raise ValueError('Unknown format specified: %s' % (format,))
|
||||
|
||||
@@ -197,9 +192,16 @@ def logkv(key, val):
|
||||
"""
|
||||
Log a value of some diagnostic
|
||||
Call this once for each diagnostic quantity, each iteration
|
||||
If called many times, last value will be used.
|
||||
"""
|
||||
Logger.CURRENT.logkv(key, val)
|
||||
|
||||
def logkv_mean(key, val):
|
||||
"""
|
||||
The same as logkv(), but if called many times, values averaged.
|
||||
"""
|
||||
Logger.CURRENT.logkv_mean(key, val)
|
||||
|
||||
def logkvs(d):
|
||||
"""
|
||||
Log a dictionary of key-value pairs
|
||||
@@ -255,6 +257,33 @@ def get_dir():
|
||||
record_tabular = logkv
|
||||
dump_tabular = dumpkvs
|
||||
|
||||
class ProfileKV:
|
||||
"""
|
||||
Usage:
|
||||
with logger.ProfileKV("interesting_scope"):
|
||||
code
|
||||
"""
|
||||
def __init__(self, n):
|
||||
self.n = "wait_" + n
|
||||
def __enter__(self):
|
||||
self.t1 = time.time()
|
||||
def __exit__(self ,type, value, traceback):
|
||||
Logger.CURRENT.name2val[self.n] += time.time() - self.t1
|
||||
|
||||
def profile(n):
|
||||
"""
|
||||
Usage:
|
||||
@profile("my_func")
|
||||
def my_func(): code
|
||||
"""
|
||||
def decorator_with_name(func):
|
||||
def func_wrapper(*args, **kwargs):
|
||||
with ProfileKV(n):
|
||||
return func(*args, **kwargs)
|
||||
return func_wrapper
|
||||
return decorator_with_name
|
||||
|
||||
|
||||
# ================================================================
|
||||
# Backend
|
||||
# ================================================================
|
||||
@@ -265,7 +294,8 @@ class Logger(object):
|
||||
CURRENT = None # Current logger being used by the free functions above
|
||||
|
||||
def __init__(self, dir, output_formats):
|
||||
self.name2val = {} # values this iteration
|
||||
self.name2val = defaultdict(float) # values this iteration
|
||||
self.name2cnt = defaultdict(int)
|
||||
self.level = INFO
|
||||
self.dir = dir
|
||||
self.output_formats = output_formats
|
||||
@@ -275,12 +305,21 @@ class Logger(object):
|
||||
def logkv(self, key, val):
|
||||
self.name2val[key] = val
|
||||
|
||||
def logkv_mean(self, key, val):
|
||||
if val is None:
|
||||
self.name2val[key] = None
|
||||
return
|
||||
oldval, cnt = self.name2val[key], self.name2cnt[key]
|
||||
self.name2val[key] = oldval*cnt/(cnt+1) + val/(cnt+1)
|
||||
self.name2cnt[key] = cnt + 1
|
||||
|
||||
def dumpkvs(self):
|
||||
if self.level == DISABLED: return
|
||||
for fmt in self.output_formats:
|
||||
if isinstance(fmt, KVWriter):
|
||||
fmt.writekvs(self.name2val)
|
||||
self.name2val.clear()
|
||||
self.name2cnt.clear()
|
||||
|
||||
def log(self, *args, level=INFO):
|
||||
if self.level <= level:
|
||||
@@ -316,10 +355,19 @@ def configure(dir=None, format_strs=None):
|
||||
assert isinstance(dir, str)
|
||||
os.makedirs(dir, exist_ok=True)
|
||||
|
||||
log_suffix = ''
|
||||
from mpi4py import MPI
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
if rank > 0:
|
||||
log_suffix = "-rank%03i" % rank
|
||||
|
||||
if format_strs is None:
|
||||
strs = os.getenv('OPENAI_LOG_FORMAT')
|
||||
format_strs = strs.split(',') if strs else LOG_OUTPUT_FORMATS
|
||||
output_formats = [make_output_format(f, dir) for f in format_strs]
|
||||
if rank == 0:
|
||||
format_strs = os.getenv('OPENAI_LOG_FORMAT', 'stdout,log,csv').split(',')
|
||||
else:
|
||||
format_strs = os.getenv('OPENAI_LOG_FORMAT_MPI', 'log').split(',')
|
||||
format_strs = filter(None, format_strs)
|
||||
output_formats = [make_output_format(f, dir, log_suffix) for f in format_strs]
|
||||
|
||||
Logger.CURRENT = Logger(dir=dir, output_formats=output_formats)
|
||||
log('Logging to %s'%dir)
|
||||
@@ -360,6 +408,11 @@ def _demo():
|
||||
logkv("a", 5.5)
|
||||
dumpkvs()
|
||||
info("^^^ should see a = 5.5")
|
||||
logkv_mean("b", -22.5)
|
||||
logkv_mean("b", -44.4)
|
||||
logkv("a", 5.5)
|
||||
dumpkvs()
|
||||
info("^^^ should see b = 33.3")
|
||||
|
||||
logkv("b", -2.5)
|
||||
dumpkvs()
|
||||
|
@@ -5,3 +5,5 @@
|
||||
- `mpirun -np 8 python -m baselines.ppo1.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
||||
- `python -m baselines.ppo1.run_mujoco` runs the algorithm for 1M frames on a Mujoco environment.
|
||||
|
||||
- Train mujoco 3d humanoid (with optimal-ish hyperparameters): `mpirun -np 16 python -m baselines.ppo1.run_humanoid --model-path=/path/to/model`
|
||||
- Render the 3d humanoid: `python -m baselines.ppo1.run_humanoid --play --model-path=/path/to/model`
|
||||
|
@@ -212,5 +212,7 @@ def learn(env, policy_fn, *,
|
||||
if MPI.COMM_WORLD.Get_rank()==0:
|
||||
logger.dump_tabular()
|
||||
|
||||
return pi
|
||||
|
||||
def flatten_lists(listoflists):
|
||||
return [el for list_ in listoflists for el in list_]
|
||||
|
75
baselines/ppo1/run_humanoid.py
Normal file
75
baselines/ppo1/run_humanoid.py
Normal file
@@ -0,0 +1,75 @@
|
||||
#!/usr/bin/env python3
|
||||
import os
|
||||
from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
|
||||
from baselines.common import tf_util as U
|
||||
from baselines import logger
|
||||
|
||||
import gym
|
||||
|
||||
def train(num_timesteps, seed, model_path=None):
|
||||
env_id = 'Humanoid-v2'
|
||||
from baselines.ppo1 import mlp_policy, pposgd_simple
|
||||
U.make_session(num_cpu=1).__enter__()
|
||||
def policy_fn(name, ob_space, ac_space):
|
||||
return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
|
||||
hid_size=64, num_hid_layers=2)
|
||||
env = make_mujoco_env(env_id, seed)
|
||||
|
||||
# parameters below were the best found in a simple random search
|
||||
# these are good enough to make humanoid walk, but whether those are
|
||||
# an absolute best or not is not certain
|
||||
env = RewScale(env, 0.1)
|
||||
pi = pposgd_simple.learn(env, policy_fn,
|
||||
max_timesteps=num_timesteps,
|
||||
timesteps_per_actorbatch=2048,
|
||||
clip_param=0.2, entcoeff=0.0,
|
||||
optim_epochs=10,
|
||||
optim_stepsize=3e-4,
|
||||
optim_batchsize=64,
|
||||
gamma=0.99,
|
||||
lam=0.95,
|
||||
schedule='linear',
|
||||
)
|
||||
env.close()
|
||||
if model_path:
|
||||
U.save_state(model_path)
|
||||
|
||||
return pi
|
||||
|
||||
class RewScale(gym.RewardWrapper):
|
||||
def __init__(self, env, scale):
|
||||
gym.RewardWrapper.__init__(self, env)
|
||||
self.scale = scale
|
||||
def reward(self, r):
|
||||
return r * self.scale
|
||||
|
||||
def main():
|
||||
logger.configure()
|
||||
parser = mujoco_arg_parser()
|
||||
parser.add_argument('--model-path', default=os.path.join(logger.get_dir(), 'humanoid_policy'))
|
||||
parser.set_defaults(num_timesteps=int(2e7))
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.play:
|
||||
# train the model
|
||||
train(num_timesteps=args.num_timesteps, seed=args.seed, model_path=args.model_path)
|
||||
else:
|
||||
# construct the model object, load pre-trained model and render
|
||||
pi = train(num_timesteps=1, seed=args.seed)
|
||||
U.load_state(args.model_path)
|
||||
env = make_mujoco_env('Humanoid-v2', seed=0)
|
||||
|
||||
ob = env.reset()
|
||||
while True:
|
||||
action = pi.act(stochastic=False, ob=ob)[0]
|
||||
ob, _, done, _ = env.step(action)
|
||||
env.render()
|
||||
if done:
|
||||
ob = env.reset()
|
||||
|
||||
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
40
baselines/ppo1/run_robotics.py
Normal file
40
baselines/ppo1/run_robotics.py
Normal file
@@ -0,0 +1,40 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
from mpi4py import MPI
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines import logger
|
||||
from baselines.common.cmd_util import make_robotics_env, robotics_arg_parser
|
||||
import mujoco_py
|
||||
|
||||
|
||||
def train(env_id, num_timesteps, seed):
|
||||
from baselines.ppo1 import mlp_policy, pposgd_simple
|
||||
import baselines.common.tf_util as U
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
sess = U.single_threaded_session()
|
||||
sess.__enter__()
|
||||
mujoco_py.ignore_mujoco_warnings().__enter__()
|
||||
workerseed = seed + 10000 * rank
|
||||
set_global_seeds(workerseed)
|
||||
env = make_robotics_env(env_id, workerseed, rank=rank)
|
||||
def policy_fn(name, ob_space, ac_space):
|
||||
return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
|
||||
hid_size=256, num_hid_layers=3)
|
||||
|
||||
pposgd_simple.learn(env, policy_fn,
|
||||
max_timesteps=num_timesteps,
|
||||
timesteps_per_actorbatch=2048,
|
||||
clip_param=0.2, entcoeff=0.0,
|
||||
optim_epochs=5, optim_stepsize=3e-4, optim_batchsize=256,
|
||||
gamma=0.99, lam=0.95, schedule='linear',
|
||||
)
|
||||
env.close()
|
||||
|
||||
|
||||
def main():
|
||||
args = robotics_arg_parser().parse_args()
|
||||
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
@@ -2,39 +2,36 @@ import numpy as np
|
||||
import tensorflow as tf
|
||||
from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
|
||||
from baselines.common.distributions import make_pdtype
|
||||
from baselines.common.input import observation_input
|
||||
|
||||
def nature_cnn(unscaled_images):
|
||||
def nature_cnn(unscaled_images, **conv_kwargs):
|
||||
"""
|
||||
CNN from Nature paper.
|
||||
"""
|
||||
scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
|
||||
activ = tf.nn.relu
|
||||
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2)))
|
||||
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2)))
|
||||
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2)))
|
||||
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
|
||||
**conv_kwargs))
|
||||
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
|
||||
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
|
||||
h3 = conv_to_fc(h3)
|
||||
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
|
||||
|
||||
class LnLstmPolicy(object):
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
|
||||
nenv = nbatch // nsteps
|
||||
nh, nw, nc = ob_space.shape
|
||||
ob_shape = (nbatch, nh, nw, nc)
|
||||
nact = ac_space.n
|
||||
X = tf.placeholder(tf.uint8, ob_shape) #obs
|
||||
X, processed_x = observation_input(ob_space, nbatch)
|
||||
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
|
||||
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
h = nature_cnn(X)
|
||||
h = nature_cnn(processed_x)
|
||||
xs = batch_to_seq(h, nenv, nsteps)
|
||||
ms = batch_to_seq(M, nenv, nsteps)
|
||||
h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
|
||||
h5 = seq_to_batch(h5)
|
||||
pi = fc(h5, 'pi', nact)
|
||||
vf = fc(h5, 'v', 1)
|
||||
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
self.pd = self.pdtype.pdfromflat(pi)
|
||||
self.pd, self.pi = self.pdtype.pdfromlatent(h5)
|
||||
|
||||
v0 = vf[:, 0]
|
||||
a0 = self.pd.sample()
|
||||
@@ -50,7 +47,6 @@ class LnLstmPolicy(object):
|
||||
self.X = X
|
||||
self.M = M
|
||||
self.S = S
|
||||
self.pi = pi
|
||||
self.vf = vf
|
||||
self.step = step
|
||||
self.value = value
|
||||
@@ -59,11 +55,9 @@ class LstmPolicy(object):
|
||||
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
|
||||
nenv = nbatch // nsteps
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
X, processed_x = observation_input(ob_space, nbatch)
|
||||
|
||||
nh, nw, nc = ob_space.shape
|
||||
ob_shape = (nbatch, nh, nw, nc)
|
||||
nact = ac_space.n
|
||||
X = tf.placeholder(tf.uint8, ob_shape) #obs
|
||||
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
|
||||
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
@@ -72,11 +66,8 @@ class LstmPolicy(object):
|
||||
ms = batch_to_seq(M, nenv, nsteps)
|
||||
h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
|
||||
h5 = seq_to_batch(h5)
|
||||
pi = fc(h5, 'pi', nact)
|
||||
vf = fc(h5, 'v', 1)
|
||||
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
self.pd = self.pdtype.pdfromflat(pi)
|
||||
self.pd, self.pi = self.pdtype.pdfromlatent(h5)
|
||||
|
||||
v0 = vf[:, 0]
|
||||
a0 = self.pd.sample()
|
||||
@@ -92,25 +83,19 @@ class LstmPolicy(object):
|
||||
self.X = X
|
||||
self.M = M
|
||||
self.S = S
|
||||
self.pi = pi
|
||||
self.vf = vf
|
||||
self.step = step
|
||||
self.value = value
|
||||
|
||||
class CnnPolicy(object):
|
||||
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
|
||||
nh, nw, nc = ob_space.shape
|
||||
ob_shape = (nbatch, nh, nw, nc)
|
||||
nact = ac_space.n
|
||||
X = tf.placeholder(tf.uint8, ob_shape) #obs
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
h = nature_cnn(X)
|
||||
pi = fc(h, 'pi', nact, init_scale=0.01)
|
||||
vf = fc(h, 'v', 1)[:,0]
|
||||
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False, **conv_kwargs): #pylint: disable=W0613
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
self.pd = self.pdtype.pdfromflat(pi)
|
||||
X, processed_x = observation_input(ob_space, nbatch)
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
h = nature_cnn(processed_x, **conv_kwargs)
|
||||
vf = fc(h, 'v', 1)[:,0]
|
||||
self.pd, self.pi = self.pdtype.pdfromlatent(h, init_scale=0.01)
|
||||
|
||||
a0 = self.pd.sample()
|
||||
neglogp0 = self.pd.neglogp(a0)
|
||||
@@ -124,31 +109,25 @@ class CnnPolicy(object):
|
||||
return sess.run(vf, {X:ob})
|
||||
|
||||
self.X = X
|
||||
self.pi = pi
|
||||
self.vf = vf
|
||||
self.step = step
|
||||
self.value = value
|
||||
|
||||
class MlpPolicy(object):
|
||||
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
|
||||
ob_shape = (nbatch,) + ob_space.shape
|
||||
actdim = ac_space.shape[0]
|
||||
X = tf.placeholder(tf.float32, ob_shape, name='Ob') #obs
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
activ = tf.tanh
|
||||
h1 = activ(fc(X, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
|
||||
h2 = activ(fc(h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
|
||||
pi = fc(h2, 'pi', actdim, init_scale=0.01)
|
||||
h1 = activ(fc(X, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
|
||||
h2 = activ(fc(h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
|
||||
vf = fc(h2, 'vf', 1)[:,0]
|
||||
logstd = tf.get_variable(name="logstd", shape=[1, actdim],
|
||||
initializer=tf.zeros_initializer())
|
||||
|
||||
pdparam = tf.concat([pi, pi * 0.0 + logstd], axis=1)
|
||||
|
||||
self.pdtype = make_pdtype(ac_space)
|
||||
self.pd = self.pdtype.pdfromflat(pdparam)
|
||||
with tf.variable_scope("model", reuse=reuse):
|
||||
X, processed_x = observation_input(ob_space, nbatch)
|
||||
activ = tf.tanh
|
||||
processed_x = tf.layers.flatten(processed_x)
|
||||
pi_h1 = activ(fc(processed_x, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
|
||||
pi_h2 = activ(fc(pi_h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
|
||||
vf_h1 = activ(fc(processed_x, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
|
||||
vf_h2 = activ(fc(vf_h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
|
||||
vf = fc(vf_h2, 'vf', 1)[:,0]
|
||||
|
||||
self.pd, self.pi = self.pdtype.pdfromlatent(pi_h2, init_scale=0.01)
|
||||
|
||||
|
||||
a0 = self.pd.sample()
|
||||
neglogp0 = self.pd.neglogp(a0)
|
||||
@@ -162,7 +141,6 @@ class MlpPolicy(object):
|
||||
return sess.run(vf, {X:ob})
|
||||
|
||||
self.X = X
|
||||
self.pi = pi
|
||||
self.vf = vf
|
||||
self.step = step
|
||||
self.value = value
|
||||
|
@@ -7,6 +7,7 @@ import tensorflow as tf
|
||||
from baselines import logger
|
||||
from collections import deque
|
||||
from baselines.common import explained_variance
|
||||
from baselines.common.runners import AbstractEnvRunner
|
||||
|
||||
class Model(object):
|
||||
def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
|
||||
@@ -84,19 +85,12 @@ class Model(object):
|
||||
self.load = load
|
||||
tf.global_variables_initializer().run(session=sess) #pylint: disable=E1101
|
||||
|
||||
class Runner(object):
|
||||
class Runner(AbstractEnvRunner):
|
||||
|
||||
def __init__(self, *, env, model, nsteps, gamma, lam):
|
||||
self.env = env
|
||||
self.model = model
|
||||
nenv = env.num_envs
|
||||
self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=model.train_model.X.dtype.name)
|
||||
self.obs[:] = env.reset()
|
||||
self.gamma = gamma
|
||||
super().__init__(env=env, model=model, nsteps=nsteps)
|
||||
self.lam = lam
|
||||
self.nsteps = nsteps
|
||||
self.states = model.initial_state
|
||||
self.dones = [False for _ in range(nenv)]
|
||||
self.gamma = gamma
|
||||
|
||||
def run(self):
|
||||
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones, mb_neglogpacs = [],[],[],[],[],[]
|
||||
@@ -154,7 +148,7 @@ def constfn(val):
|
||||
def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
|
||||
vf_coef=0.5, max_grad_norm=0.5, gamma=0.99, lam=0.95,
|
||||
log_interval=10, nminibatches=4, noptepochs=4, cliprange=0.2,
|
||||
save_interval=0):
|
||||
save_interval=0, load_path=None):
|
||||
|
||||
if isinstance(lr, float): lr = constfn(lr)
|
||||
else: assert callable(lr)
|
||||
@@ -176,6 +170,8 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
|
||||
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
|
||||
fh.write(cloudpickle.dumps(make_model))
|
||||
model = make_model()
|
||||
if load_path is not None:
|
||||
model.load(load_path)
|
||||
runner = Runner(env=env, model=model, nsteps=nsteps, gamma=gamma, lam=lam)
|
||||
|
||||
epinfobuf = deque(maxlen=100)
|
||||
@@ -240,6 +236,7 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
|
||||
print('Saving to', savepath)
|
||||
model.save(savepath)
|
||||
env.close()
|
||||
return model
|
||||
|
||||
def safemean(xs):
|
||||
return np.nan if len(xs) == 0 else np.mean(xs)
|
||||
|
@@ -4,7 +4,7 @@ from baselines import logger
|
||||
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
|
||||
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
|
||||
from baselines.ppo2 import ppo2
|
||||
from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
|
||||
from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy, MlpPolicy
|
||||
import multiprocessing
|
||||
import tensorflow as tf
|
||||
|
||||
@@ -20,7 +20,7 @@ def train(env_id, num_timesteps, seed, policy):
|
||||
tf.Session(config=config).__enter__()
|
||||
|
||||
env = VecFrameStack(make_atari_env(env_id, 8, seed), 4)
|
||||
policy = {'cnn' : CnnPolicy, 'lstm' : LstmPolicy, 'lnlstm' : LnLstmPolicy}[policy]
|
||||
policy = {'cnn' : CnnPolicy, 'lstm' : LstmPolicy, 'lnlstm' : LnLstmPolicy, 'mlp': MlpPolicy}[policy]
|
||||
ppo2.learn(policy=policy, env=env, nsteps=128, nminibatches=4,
|
||||
lam=0.95, gamma=0.99, noptepochs=4, log_interval=1,
|
||||
ent_coef=.01,
|
||||
@@ -30,7 +30,7 @@ def train(env_id, num_timesteps, seed, policy):
|
||||
|
||||
def main():
|
||||
parser = atari_arg_parser()
|
||||
parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
|
||||
parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm', 'mlp'], default='cnn')
|
||||
args = parser.parse_args()
|
||||
logger.configure()
|
||||
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
|
||||
|
@@ -1,8 +1,9 @@
|
||||
#!/usr/bin/env python3
|
||||
import argparse
|
||||
import numpy as np
|
||||
from baselines.common.cmd_util import mujoco_arg_parser
|
||||
from baselines import bench, logger
|
||||
|
||||
|
||||
def train(env_id, num_timesteps, seed):
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines.common.vec_env.vec_normalize import VecNormalize
|
||||
@@ -16,27 +17,40 @@ def train(env_id, num_timesteps, seed):
|
||||
intra_op_parallelism_threads=ncpu,
|
||||
inter_op_parallelism_threads=ncpu)
|
||||
tf.Session(config=config).__enter__()
|
||||
|
||||
def make_env():
|
||||
env = gym.make(env_id)
|
||||
env = bench.Monitor(env, logger.get_dir())
|
||||
env = bench.Monitor(env, logger.get_dir(), allow_early_resets=True)
|
||||
return env
|
||||
|
||||
env = DummyVecEnv([make_env])
|
||||
env = VecNormalize(env)
|
||||
|
||||
set_global_seeds(seed)
|
||||
policy = MlpPolicy
|
||||
ppo2.learn(policy=policy, env=env, nsteps=2048, nminibatches=32,
|
||||
lam=0.95, gamma=0.99, noptepochs=10, log_interval=1,
|
||||
ent_coef=0.0,
|
||||
lr=3e-4,
|
||||
cliprange=0.2,
|
||||
total_timesteps=num_timesteps)
|
||||
model = ppo2.learn(policy=policy, env=env, nsteps=2048, nminibatches=32,
|
||||
lam=0.95, gamma=0.99, noptepochs=10, log_interval=1,
|
||||
ent_coef=0.0,
|
||||
lr=3e-4,
|
||||
cliprange=0.2,
|
||||
total_timesteps=num_timesteps)
|
||||
|
||||
return model, env
|
||||
|
||||
|
||||
def main():
|
||||
args = mujoco_arg_parser().parse_args()
|
||||
logger.configure()
|
||||
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
|
||||
model, env = train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
|
||||
|
||||
if args.play:
|
||||
logger.log("Running trained model")
|
||||
obs = np.zeros((env.num_envs,) + env.observation_space.shape)
|
||||
obs[:] = env.reset()
|
||||
while True:
|
||||
actions = model.step(obs)[0]
|
||||
obs[:] = env.step(actions)[0]
|
||||
env.render()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
7
setup.py
7
setup.py
@@ -10,7 +10,7 @@ setup(name='baselines',
|
||||
packages=[package for package in find_packages()
|
||||
if package.startswith('baselines')],
|
||||
install_requires=[
|
||||
'gym[mujoco,atari,classic_control]',
|
||||
'gym[mujoco,atari,classic_control,robotics]',
|
||||
'scipy',
|
||||
'tqdm',
|
||||
'joblib',
|
||||
@@ -19,9 +19,12 @@ setup(name='baselines',
|
||||
'progressbar2',
|
||||
'mpi4py',
|
||||
'cloudpickle',
|
||||
'tensorflow>=1.4.0',
|
||||
'click',
|
||||
'opencv-python'
|
||||
],
|
||||
description='OpenAI baselines: high quality implementations of reinforcement learning algorithms',
|
||||
author='OpenAI',
|
||||
url='https://github.com/openai/baselines',
|
||||
author_email='gym@openai.com',
|
||||
version='0.1.4')
|
||||
version='0.1.5')
|
||||
|
Reference in New Issue
Block a user