Compare commits
49 Commits
peterz_viz
...
peterz_ben
Author | SHA1 | Date | |
---|---|---|---|
|
ea68f3b7e6 | ||
|
ca721a4be6 | ||
|
72f3572a10 | ||
|
b9cd941471 | ||
|
0899b71ede | ||
|
cc8c9541fb | ||
|
cb32522394 | ||
|
1e40ec22be | ||
|
701a36cdfa | ||
|
5a7f9847d8 | ||
|
b63134e5c5 | ||
|
db314cdeda | ||
|
b08c083d91 | ||
|
bfbbe66d9e | ||
|
1c5c6563b7 | ||
|
1fa8c58da5 | ||
|
f6d1115ead | ||
|
f6d5a47bed | ||
|
c2df27bee4 | ||
|
974c15756e | ||
|
ad43fd9a35 | ||
|
72c357c638 | ||
|
e00e5ca016 | ||
|
705797f2f0 | ||
|
fcd84aa831 | ||
|
390b51597a | ||
|
95104a3592 | ||
|
3528f7b992 | ||
|
151e48009e | ||
|
92f33335e9 | ||
|
af729cff15 | ||
|
10f815fe1d | ||
|
8c4adac898 | ||
|
2a93ea8782 | ||
|
9c48f9fad5 | ||
|
348cbb4b71 | ||
|
a1602ab15f | ||
|
e63e69bb14 | ||
|
385e7e5c0d | ||
|
d112a2e49f | ||
|
e662dd6409 | ||
|
efc6bffce3 | ||
|
872181d4c3 | ||
|
628ddecf6a | ||
|
83a4a4be65 | ||
|
7edac38c73 | ||
|
a6dca44115 | ||
|
622915c473 | ||
|
a1d3c18ec0 |
@@ -10,5 +10,5 @@ install:
|
||||
- docker build . -t baselines-test
|
||||
|
||||
script:
|
||||
- flake8 . --show-source --statistics
|
||||
- docker run baselines-test pytest -v --forked .
|
||||
- flake8 --select=F,E999 baselines/common baselines/trpo_mpi baselines/ppo2 baselines/a2c baselines/deepq baselines/acer
|
||||
- docker run baselines-test pytest --runslow
|
||||
|
16
Dockerfile
16
Dockerfile
@@ -1,9 +1,16 @@
|
||||
FROM python:3.6
|
||||
|
||||
RUN apt-get -y update && apt-get -y install ffmpeg
|
||||
# RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake python-opencv
|
||||
FROM ubuntu:16.04
|
||||
|
||||
RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake python-opencv
|
||||
ENV CODE_DIR /root/code
|
||||
ENV VENV /root/venv
|
||||
|
||||
RUN \
|
||||
pip install virtualenv && \
|
||||
virtualenv $VENV --python=python3 && \
|
||||
. $VENV/bin/activate && \
|
||||
pip install --upgrade pip
|
||||
|
||||
ENV PATH=$VENV/bin:$PATH
|
||||
|
||||
COPY . $CODE_DIR/baselines
|
||||
WORKDIR $CODE_DIR/baselines
|
||||
@@ -11,7 +18,6 @@ WORKDIR $CODE_DIR/baselines
|
||||
# Clean up pycache and pyc files
|
||||
RUN rm -rf __pycache__ && \
|
||||
find . -name "*.pyc" -delete && \
|
||||
pip install tensorflow && \
|
||||
pip install -e .[test]
|
||||
|
||||
|
||||
|
67
README.md
67
README.md
@@ -15,7 +15,7 @@ sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zli
|
||||
```
|
||||
|
||||
### Mac OS X
|
||||
Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the following:
|
||||
Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the follwing:
|
||||
```bash
|
||||
brew install cmake openmpi
|
||||
```
|
||||
@@ -38,27 +38,20 @@ More thorough tutorial on virtualenvs and options can be found [here](https://vi
|
||||
|
||||
|
||||
## Installation
|
||||
- Clone the repo and cd into it:
|
||||
Clone the repo and cd into it:
|
||||
```bash
|
||||
git clone https://github.com/openai/baselines.git
|
||||
cd baselines
|
||||
```
|
||||
- If you don't have TensorFlow installed already, install your favourite flavor of TensorFlow. In most cases,
|
||||
If using virtualenv, create a new virtualenv and activate it
|
||||
```bash
|
||||
pip install tensorflow-gpu # if you have a CUDA-compatible gpu and proper drivers
|
||||
virtualenv env --python=python3
|
||||
. env/bin/activate
|
||||
```
|
||||
or
|
||||
```bash
|
||||
pip install tensorflow
|
||||
```
|
||||
should be sufficient. Refer to [TensorFlow installation guide](https://www.tensorflow.org/install/)
|
||||
for more details.
|
||||
|
||||
- Install baselines package
|
||||
Install baselines package
|
||||
```bash
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### MuJoCo
|
||||
Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)
|
||||
|
||||
@@ -69,25 +62,34 @@ pip install pytest
|
||||
pytest
|
||||
```
|
||||
|
||||
## Subpackages
|
||||
|
||||
## Testing the installation
|
||||
All unit tests in baselines can be run using pytest runner:
|
||||
```
|
||||
pip install pytest
|
||||
pytest
|
||||
```
|
||||
|
||||
## Training models
|
||||
Most of the algorithms in baselines repo are used as follows:
|
||||
```bash
|
||||
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
|
||||
```
|
||||
### Example 1. PPO with MuJoCo Humanoid
|
||||
For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps
|
||||
For instance, to train a fully-connected network controlling MuJoCo humanoid using a2c for 20M timesteps
|
||||
```bash
|
||||
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
|
||||
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
|
||||
```
|
||||
Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
|
||||
The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
|
||||
```bash
|
||||
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
|
||||
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
|
||||
```
|
||||
will set entropy coefficient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
|
||||
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
|
||||
|
||||
See docstrings in [common/models.py](baselines/common/models.py) for description of network parameters for each type of model, and
|
||||
docstring for [baselines/ppo2/ppo2.py/learn()](baselines/ppo2/ppo2.py#L152) for the description of the ppo2 hyperparamters.
|
||||
See docstrings in [common/models.py](common/models.py) for description of network parameters for each type of model, and
|
||||
docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) fir the description of the ppo2 hyperparamters.
|
||||
|
||||
### Example 2. DQN on Atari
|
||||
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
|
||||
@@ -100,17 +102,19 @@ The algorithms serialization API is not properly unified yet; however, there is
|
||||
`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively.
|
||||
Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then later visualize what has it learnt.
|
||||
```bash
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=2e7 --save_path=~/models/pong_20M_ppo2
|
||||
```
|
||||
This should get to the mean reward per episode about 20. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
|
||||
This should get to the mean reward per episode about 5k. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
|
||||
```bash
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
|
||||
```
|
||||
|
||||
*NOTE:* At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in [baselines/common/vec_env/vec_normalize.py](baselines/common/vec_env/vec_normalize.py#L12). This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default
|
||||
|
||||
## Loading and vizualizing learning curves and other training metrics
|
||||
See [here](docs/viz/viz.ipynb) for instructions on how to load and display the training data.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## Subpackages
|
||||
|
||||
@@ -121,23 +125,14 @@ See [here](docs/viz/viz.ipynb) for instructions on how to load and display the t
|
||||
- [DQN](baselines/deepq)
|
||||
- [GAIL](baselines/gail)
|
||||
- [HER](baselines/her)
|
||||
- [PPO1](baselines/ppo1) (obsolete version, left here temporarily)
|
||||
- [PPO2](baselines/ppo2)
|
||||
- [PPO1](baselines/ppo1) (Multi-CPU using MPI)
|
||||
- [PPO2](baselines/ppo2) (Optimized for GPU)
|
||||
- [TRPO](baselines/trpo_mpi)
|
||||
|
||||
|
||||
|
||||
## Benchmarks
|
||||
Results of benchmarks on Mujoco (1M timesteps) and Atari (10M timesteps) are available
|
||||
[here for Mujoco](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm)
|
||||
and
|
||||
[here for Atari](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_atari10M.htm)
|
||||
respectively. Note that these results may be not on the latest version of the code, particular commit hash with which results were obtained is specified on the benchmarks page.
|
||||
|
||||
To cite this repository in publications:
|
||||
|
||||
@misc{baselines,
|
||||
author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai and Zhokhov, Peter},
|
||||
author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
|
||||
title = {OpenAI Baselines},
|
||||
year = {2017},
|
||||
publisher = {GitHub},
|
||||
|
@@ -2,12 +2,4 @@
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1602.01783
|
||||
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
|
||||
- `python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
|
||||
## Files
|
||||
- `run_atari`: file used to run the algorithm.
|
||||
- `policies.py`: contains the different versions of the A2C architecture (MlpPolicy, CNNPolicy, LstmPolicy...).
|
||||
- `a2c.py`: - Model : class used to initialize the step_model (sampling) and train_model (training)
|
||||
- learn : Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
|
||||
- `runner.py`: class used to generates a batch of experiences
|
||||
- `python -m baselines.a2c.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
@@ -16,18 +16,6 @@ from tensorflow import losses
|
||||
|
||||
class Model(object):
|
||||
|
||||
"""
|
||||
We use this class to :
|
||||
__init__:
|
||||
- Creates the step_model
|
||||
- Creates the train_model
|
||||
|
||||
train():
|
||||
- Make the training part (feedforward and retropropagation of gradients)
|
||||
|
||||
save/load():
|
||||
- Save load the model
|
||||
"""
|
||||
def __init__(self, policy, env, nsteps,
|
||||
ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, lr=7e-4,
|
||||
alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):
|
||||
@@ -38,10 +26,7 @@ class Model(object):
|
||||
|
||||
|
||||
with tf.variable_scope('a2c_model', reuse=tf.AUTO_REUSE):
|
||||
# step_model is used for sampling
|
||||
step_model = policy(nenvs, 1, sess)
|
||||
|
||||
# train_model is used to train our network
|
||||
train_model = policy(nbatch, nsteps, sess)
|
||||
|
||||
A = tf.placeholder(train_model.action.dtype, train_model.action.shape)
|
||||
@@ -49,45 +34,25 @@ class Model(object):
|
||||
R = tf.placeholder(tf.float32, [nbatch])
|
||||
LR = tf.placeholder(tf.float32, [])
|
||||
|
||||
# Calculate the loss
|
||||
# Total loss = Policy gradient loss - entropy * entropy coefficient + Value coefficient * value loss
|
||||
|
||||
# Policy loss
|
||||
neglogpac = train_model.pd.neglogp(A)
|
||||
# L = A(s,a) * -logpi(a|s)
|
||||
pg_loss = tf.reduce_mean(ADV * neglogpac)
|
||||
|
||||
# Entropy is used to improve exploration by limiting the premature convergence to suboptimal policy.
|
||||
entropy = tf.reduce_mean(train_model.pd.entropy())
|
||||
|
||||
# Value loss
|
||||
pg_loss = tf.reduce_mean(ADV * neglogpac)
|
||||
vf_loss = losses.mean_squared_error(tf.squeeze(train_model.vf), R)
|
||||
|
||||
loss = pg_loss - entropy*ent_coef + vf_loss * vf_coef
|
||||
|
||||
# Update parameters using loss
|
||||
# 1. Get the model parameters
|
||||
params = find_trainable_variables("a2c_model")
|
||||
|
||||
# 2. Calculate the gradients
|
||||
grads = tf.gradients(loss, params)
|
||||
if max_grad_norm is not None:
|
||||
# Clip the gradients (normalize)
|
||||
grads, grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
|
||||
grads = list(zip(grads, params))
|
||||
# zip aggregate each gradient with parameters associated
|
||||
# For instance zip(ABCD, xyza) => Ax, By, Cz, Da
|
||||
|
||||
# 3. Make op for one policy and value update step of A2C
|
||||
trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=alpha, epsilon=epsilon)
|
||||
|
||||
_train = trainer.apply_gradients(grads)
|
||||
|
||||
lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
|
||||
|
||||
def train(obs, states, rewards, masks, actions, values):
|
||||
# Here we calculate advantage A(s,a) = R + yV(s') - V(s)
|
||||
# rewards = R + yV(s')
|
||||
advs = rewards - values
|
||||
for step in range(len(obs)):
|
||||
cur_lr = lr.value()
|
||||
@@ -183,37 +148,23 @@ def learn(
|
||||
|
||||
set_global_seeds(seed)
|
||||
|
||||
# Get the nb of env
|
||||
nenvs = env.num_envs
|
||||
policy = build_policy(env, network, **network_kwargs)
|
||||
|
||||
# Instantiate the model object (that creates step_model and train_model)
|
||||
model = Model(policy=policy, env=env, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
|
||||
max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
|
||||
if load_path is not None:
|
||||
model.load(load_path)
|
||||
|
||||
# Instantiate the runner object
|
||||
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
|
||||
|
||||
# Calculate the batch_size
|
||||
nbatch = nenvs*nsteps
|
||||
|
||||
# Start total timer
|
||||
tstart = time.time()
|
||||
|
||||
for update in range(1, total_timesteps//nbatch+1):
|
||||
# Get mini batch of experiences
|
||||
obs, states, rewards, masks, actions, values = runner.run()
|
||||
|
||||
policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
|
||||
nseconds = time.time()-tstart
|
||||
|
||||
# Calculate the fps (frame per second)
|
||||
fps = int((update*nbatch)/nseconds)
|
||||
if update % log_interval == 0 or update == 1:
|
||||
# Calculates if value function is a good predicator of the returns (ev > 1)
|
||||
# or if it's just worse than predicting nothing (ev =< 0)
|
||||
ev = explained_variance(values, rewards)
|
||||
logger.record_tabular("nupdates", update)
|
||||
logger.record_tabular("total_timesteps", update*nbatch)
|
||||
@@ -222,5 +173,6 @@ def learn(
|
||||
logger.record_tabular("value_loss", float(value_loss))
|
||||
logger.record_tabular("explained_variance", float(ev))
|
||||
logger.dump_tabular()
|
||||
env.close()
|
||||
return model
|
||||
|
||||
|
@@ -3,15 +3,7 @@ from baselines.a2c.utils import discount_with_dones
|
||||
from baselines.common.runners import AbstractEnvRunner
|
||||
|
||||
class Runner(AbstractEnvRunner):
|
||||
"""
|
||||
We use this class to generate batches of experiences
|
||||
|
||||
__init__:
|
||||
- Initialize the runner
|
||||
|
||||
run():
|
||||
- Make a mini batch of experiences
|
||||
"""
|
||||
def __init__(self, env, model, nsteps=5, gamma=0.99):
|
||||
super().__init__(env=env, model=model, nsteps=nsteps)
|
||||
self.gamma = gamma
|
||||
@@ -19,21 +11,14 @@ class Runner(AbstractEnvRunner):
|
||||
self.ob_dtype = model.train_model.X.dtype.as_numpy_dtype
|
||||
|
||||
def run(self):
|
||||
# We initialize the lists that will contain the mb of experiences
|
||||
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
|
||||
mb_states = self.states
|
||||
for n in range(self.nsteps):
|
||||
# Given observations, take action and value (V(s))
|
||||
# We already have self.obs because Runner superclass run self.obs[:] = env.reset() on init
|
||||
actions, values, states, _ = self.model.step(self.obs, S=self.states, M=self.dones)
|
||||
|
||||
# Append the experiences
|
||||
mb_obs.append(np.copy(self.obs))
|
||||
mb_actions.append(actions)
|
||||
mb_values.append(values)
|
||||
mb_dones.append(self.dones)
|
||||
|
||||
# Take actions in env and look the results
|
||||
obs, rewards, dones, _ = self.env.step(actions)
|
||||
self.states = states
|
||||
self.dones = dones
|
||||
@@ -43,8 +28,8 @@ class Runner(AbstractEnvRunner):
|
||||
self.obs = obs
|
||||
mb_rewards.append(rewards)
|
||||
mb_dones.append(self.dones)
|
||||
#batch of steps to batch of rollouts
|
||||
|
||||
# Batch of steps to batch of rollouts
|
||||
mb_obs = np.asarray(mb_obs, dtype=self.ob_dtype).swapaxes(1, 0).reshape(self.batch_ob_shape)
|
||||
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
|
||||
mb_actions = np.asarray(mb_actions, dtype=self.model.train_model.action.dtype.name).swapaxes(1, 0)
|
||||
@@ -55,7 +40,7 @@ class Runner(AbstractEnvRunner):
|
||||
|
||||
|
||||
if self.gamma > 0.0:
|
||||
# Discount/bootstrap off value fn
|
||||
#discount/bootstrap off value fn
|
||||
last_values = self.model.value(self.obs, S=self.states, M=self.dones).tolist()
|
||||
for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
|
||||
rewards = rewards.tolist()
|
||||
|
@@ -1,6 +1,4 @@
|
||||
# ACER
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1611.01224
|
||||
- `python -m baselines.run --alg=acer --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
|
||||
- `python -m baselines.acer.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
@@ -7,7 +7,6 @@ from baselines import logger
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines.common.policies import build_policy
|
||||
from baselines.common.tf_util import get_session, save_variables
|
||||
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
|
||||
|
||||
from baselines.a2c.utils import batch_to_seq, seq_to_batch
|
||||
from baselines.a2c.utils import cat_entropy_softmax
|
||||
@@ -56,7 +55,8 @@ def q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma):
|
||||
# return tf.minimum(1 + eps_clip, tf.maximum(1 - eps_clip, ratio))
|
||||
|
||||
class Model(object):
|
||||
def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, ent_coef, q_coef, gamma, max_grad_norm, lr,
|
||||
def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, nstack, num_procs,
|
||||
ent_coef, q_coef, gamma, max_grad_norm, lr,
|
||||
rprop_alpha, rprop_epsilon, total_timesteps, lrschedule,
|
||||
c, trust_region, alpha, delta):
|
||||
|
||||
@@ -71,8 +71,8 @@ class Model(object):
|
||||
LR = tf.placeholder(tf.float32, [])
|
||||
eps = 1e-6
|
||||
|
||||
step_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs,) + ob_space.shape)
|
||||
train_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs*(nsteps+1),) + ob_space.shape)
|
||||
step_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs,) + ob_space.shape[:-1] + (ob_space.shape[-1] * nstack,))
|
||||
train_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs*(nsteps+1),) + ob_space.shape[:-1] + (ob_space.shape[-1] * nstack,))
|
||||
with tf.variable_scope('acer_model', reuse=tf.AUTO_REUSE):
|
||||
|
||||
step_model = policy(observ_placeholder=step_ob_placeholder, sess=sess)
|
||||
@@ -247,7 +247,6 @@ class Acer():
|
||||
# get obs, actions, rewards, mus, dones from buffer.
|
||||
obs, actions, rewards, mus, dones, masks = buffer.get()
|
||||
|
||||
|
||||
# reshape stuff correctly
|
||||
obs = obs.reshape(runner.batch_ob_shape)
|
||||
actions = actions.reshape([runner.nbatch])
|
||||
@@ -271,7 +270,7 @@ class Acer():
|
||||
logger.dump_tabular()
|
||||
|
||||
|
||||
def learn(network, env, seed=None, nsteps=20, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
|
||||
def learn(network, env, seed=None, nsteps=20, nstack=4, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
|
||||
max_grad_norm=10, lr=7e-4, lrschedule='linear', rprop_epsilon=1e-5, rprop_alpha=0.99, gamma=0.99,
|
||||
log_interval=100, buffer_size=50000, replay_ratio=4, replay_start=10000, c=10.0,
|
||||
trust_region=True, alpha=0.99, delta=1, load_path=None, **network_kwargs):
|
||||
@@ -343,24 +342,21 @@ def learn(network, env, seed=None, nsteps=20, total_timesteps=int(80e6), q_coef=
|
||||
print("Running Acer Simple")
|
||||
print(locals())
|
||||
set_global_seeds(seed)
|
||||
if not isinstance(env, VecFrameStack):
|
||||
env = VecFrameStack(env, 1)
|
||||
|
||||
policy = build_policy(env, network, estimate_q=True, **network_kwargs)
|
||||
|
||||
nenvs = env.num_envs
|
||||
ob_space = env.observation_space
|
||||
ac_space = env.action_space
|
||||
|
||||
nstack = env.nstack
|
||||
model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps,
|
||||
ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
|
||||
num_procs = len(env.remotes) if hasattr(env, 'remotes') else 1# HACK
|
||||
model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, nstack=nstack,
|
||||
num_procs=num_procs, ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
|
||||
max_grad_norm=max_grad_norm, lr=lr, rprop_alpha=rprop_alpha, rprop_epsilon=rprop_epsilon,
|
||||
total_timesteps=total_timesteps, lrschedule=lrschedule, c=c,
|
||||
trust_region=trust_region, alpha=alpha, delta=delta)
|
||||
|
||||
runner = Runner(env=env, model=model, nsteps=nsteps)
|
||||
runner = Runner(env=env, model=model, nsteps=nsteps, nstack=nstack)
|
||||
if replay_ratio > 0:
|
||||
buffer = Buffer(env=env, nsteps=nsteps, size=buffer_size)
|
||||
buffer = Buffer(env=env, nsteps=nsteps, nstack=nstack, size=buffer_size)
|
||||
else:
|
||||
buffer = None
|
||||
nbatch = nenvs*nsteps
|
||||
@@ -374,4 +370,5 @@ def learn(network, env, seed=None, nsteps=20, total_timesteps=int(80e6), q_coef=
|
||||
for _ in range(n):
|
||||
acer.call(on_policy=False) # no simulation steps in this
|
||||
|
||||
env.close()
|
||||
return model
|
||||
|
@@ -2,16 +2,11 @@ import numpy as np
|
||||
|
||||
class Buffer(object):
|
||||
# gets obs, actions, rewards, mu's, (states, masks), dones
|
||||
def __init__(self, env, nsteps, size=50000):
|
||||
def __init__(self, env, nsteps, nstack, size=50000):
|
||||
self.nenv = env.num_envs
|
||||
self.nsteps = nsteps
|
||||
# self.nh, self.nw, self.nc = env.observation_space.shape
|
||||
self.obs_shape = env.observation_space.shape
|
||||
self.obs_dtype = env.observation_space.dtype
|
||||
self.ac_dtype = env.action_space.dtype
|
||||
self.nc = self.obs_shape[-1]
|
||||
self.nstack = env.nstack
|
||||
self.nc //= self.nstack
|
||||
self.nh, self.nw, self.nc = env.observation_space.shape
|
||||
self.nstack = nstack
|
||||
self.nbatch = self.nenv * self.nsteps
|
||||
self.size = size // (self.nsteps) # Each loc contains nenv * nsteps frames, thus total buffer is nenv * size frames
|
||||
|
||||
@@ -38,11 +33,22 @@ class Buffer(object):
|
||||
# Generate stacked frames
|
||||
def decode(self, enc_obs, dones):
|
||||
# enc_obs has shape [nenvs, nsteps + nstack, nh, nw, nc]
|
||||
# dones has shape [nenvs, nsteps]
|
||||
# dones has shape [nenvs, nsteps, nh, nw, nc]
|
||||
# returns stacked obs of shape [nenv, (nsteps + 1), nh, nw, nstack*nc]
|
||||
|
||||
return _stack_obs(enc_obs, dones,
|
||||
nsteps=self.nsteps)
|
||||
nstack, nenv, nsteps, nh, nw, nc = self.nstack, self.nenv, self.nsteps, self.nh, self.nw, self.nc
|
||||
y = np.empty([nsteps + nstack - 1, nenv, 1, 1, 1], dtype=np.float32)
|
||||
obs = np.zeros([nstack, nsteps + nstack, nenv, nh, nw, nc], dtype=np.uint8)
|
||||
x = np.reshape(enc_obs, [nenv, nsteps + nstack, nh, nw, nc]).swapaxes(1,
|
||||
0) # [nsteps + nstack, nenv, nh, nw, nc]
|
||||
y[3:] = np.reshape(1.0 - dones, [nenv, nsteps, 1, 1, 1]).swapaxes(1, 0) # keep
|
||||
y[:3] = 1.0
|
||||
# y = np.reshape(1 - dones, [nenvs, nsteps, 1, 1, 1])
|
||||
for i in range(nstack):
|
||||
obs[-(i + 1), i:] = x
|
||||
# obs[:,i:,:,:,-(i+1),:] = x
|
||||
x = x[:-1] * y
|
||||
y = y[1:]
|
||||
return np.reshape(obs[:, 3:].transpose((2, 1, 3, 4, 0, 5)), [nenv, (nsteps + 1), nh, nw, nstack * nc])
|
||||
|
||||
def put(self, enc_obs, actions, rewards, mus, dones, masks):
|
||||
# enc_obs [nenv, (nsteps + nstack), nh, nw, nc]
|
||||
@@ -50,8 +56,8 @@ class Buffer(object):
|
||||
# mus [nenv, nsteps, nact]
|
||||
|
||||
if self.enc_obs is None:
|
||||
self.enc_obs = np.empty([self.size] + list(enc_obs.shape), dtype=self.obs_dtype)
|
||||
self.actions = np.empty([self.size] + list(actions.shape), dtype=self.ac_dtype)
|
||||
self.enc_obs = np.empty([self.size] + list(enc_obs.shape), dtype=np.uint8)
|
||||
self.actions = np.empty([self.size] + list(actions.shape), dtype=np.int32)
|
||||
self.rewards = np.empty([self.size] + list(rewards.shape), dtype=np.float32)
|
||||
self.mus = np.empty([self.size] + list(mus.shape), dtype=np.float32)
|
||||
self.dones = np.empty([self.size] + list(dones.shape), dtype=np.bool)
|
||||
@@ -95,62 +101,3 @@ class Buffer(object):
|
||||
mus = take(self.mus)
|
||||
masks = take(self.masks)
|
||||
return obs, actions, rewards, mus, dones, masks
|
||||
|
||||
|
||||
|
||||
def _stack_obs_ref(enc_obs, dones, nsteps):
|
||||
nenv = enc_obs.shape[0]
|
||||
nstack = enc_obs.shape[1] - nsteps
|
||||
nh, nw, nc = enc_obs.shape[2:]
|
||||
obs_dtype = enc_obs.dtype
|
||||
obs_shape = (nh, nw, nc*nstack)
|
||||
|
||||
mask = np.empty([nsteps + nstack - 1, nenv, 1, 1, 1], dtype=np.float32)
|
||||
obs = np.zeros([nstack, nsteps + nstack, nenv, nh, nw, nc], dtype=obs_dtype)
|
||||
x = np.reshape(enc_obs, [nenv, nsteps + nstack, nh, nw, nc]).swapaxes(1, 0) # [nsteps + nstack, nenv, nh, nw, nc]
|
||||
|
||||
mask[nstack-1:] = np.reshape(1.0 - dones, [nenv, nsteps, 1, 1, 1]).swapaxes(1, 0) # keep
|
||||
mask[:nstack-1] = 1.0
|
||||
|
||||
# y = np.reshape(1 - dones, [nenvs, nsteps, 1, 1, 1])
|
||||
for i in range(nstack):
|
||||
obs[-(i + 1), i:] = x
|
||||
# obs[:,i:,:,:,-(i+1),:] = x
|
||||
x = x[:-1] * mask
|
||||
mask = mask[1:]
|
||||
|
||||
return np.reshape(obs[:, (nstack-1):].transpose((2, 1, 3, 4, 0, 5)), (nenv, (nsteps + 1)) + obs_shape)
|
||||
|
||||
def _stack_obs(enc_obs, dones, nsteps):
|
||||
nenv = enc_obs.shape[0]
|
||||
nstack = enc_obs.shape[1] - nsteps
|
||||
nc = enc_obs.shape[-1]
|
||||
|
||||
obs_ = np.zeros((nenv, nsteps + 1) + enc_obs.shape[2:-1] + (enc_obs.shape[-1] * nstack, ), dtype=enc_obs.dtype)
|
||||
mask = np.ones((nenv, nsteps+1), dtype=enc_obs.dtype)
|
||||
mask[:, 1:] = 1.0 - dones
|
||||
mask = mask.reshape(mask.shape + tuple(np.ones(len(enc_obs.shape)-2, dtype=np.uint8)))
|
||||
|
||||
for i in range(nstack-1, -1, -1):
|
||||
obs_[..., i * nc : (i + 1) * nc] = enc_obs[:, i : i + nsteps + 1, :]
|
||||
if i < nstack-1:
|
||||
obs_[..., i * nc : (i + 1) * nc] *= mask
|
||||
mask[:, 1:, ...] *= mask[:, :-1, ...]
|
||||
|
||||
return obs_
|
||||
|
||||
def test_stack_obs():
|
||||
nstack = 7
|
||||
nenv = 1
|
||||
nsteps = 5
|
||||
|
||||
obs_shape = (2, 3, nstack)
|
||||
|
||||
enc_obs_shape = (nenv, nsteps + nstack) + obs_shape[:-1] + (1,)
|
||||
enc_obs = np.random.random(enc_obs_shape)
|
||||
dones = np.random.randint(low=0, high=2, size=(nenv, nsteps))
|
||||
|
||||
stacked_obs_ref = _stack_obs_ref(enc_obs, dones, nsteps=nsteps)
|
||||
stacked_obs_test = _stack_obs(enc_obs, dones, nsteps=nsteps)
|
||||
|
||||
np.testing.assert_allclose(stacked_obs_ref, stacked_obs_test)
|
||||
|
@@ -1,31 +1,30 @@
|
||||
import numpy as np
|
||||
from baselines.common.runners import AbstractEnvRunner
|
||||
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
|
||||
from gym import spaces
|
||||
|
||||
|
||||
class Runner(AbstractEnvRunner):
|
||||
|
||||
def __init__(self, env, model, nsteps):
|
||||
def __init__(self, env, model, nsteps, nstack):
|
||||
super().__init__(env=env, model=model, nsteps=nsteps)
|
||||
assert isinstance(env.action_space, spaces.Discrete), 'This ACER implementation works only with discrete action spaces!'
|
||||
assert isinstance(env, VecFrameStack)
|
||||
|
||||
self.nstack = nstack
|
||||
nh, nw, nc = env.observation_space.shape
|
||||
self.nc = nc # nc = 1 for atari, but just in case
|
||||
self.nact = env.action_space.n
|
||||
nenv = self.nenv
|
||||
self.nbatch = nenv * nsteps
|
||||
self.batch_ob_shape = (nenv*(nsteps+1),) + env.observation_space.shape
|
||||
|
||||
self.obs = env.reset()
|
||||
self.obs_dtype = env.observation_space.dtype
|
||||
self.ac_dtype = env.action_space.dtype
|
||||
self.nstack = self.env.nstack
|
||||
self.nc = self.batch_ob_shape[-1] // self.nstack
|
||||
self.batch_ob_shape = (nenv*(nsteps+1), nh, nw, nc*nstack)
|
||||
self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
|
||||
obs = env.reset()
|
||||
self.update_obs(obs)
|
||||
|
||||
def update_obs(self, obs, dones=None):
|
||||
#self.obs = obs
|
||||
if dones is not None:
|
||||
self.obs *= (1 - dones.astype(np.uint8))[:, None, None, None]
|
||||
self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
|
||||
self.obs[:, :, :, -self.nc:] = obs[:, :, :, :]
|
||||
|
||||
def run(self):
|
||||
# enc_obs = np.split(self.obs, self.nstack, axis=3) # so now list of obs steps
|
||||
enc_obs = np.split(self.env.stackedobs, self.env.nstack, axis=-1)
|
||||
enc_obs = np.split(self.obs, self.nstack, axis=3) # so now list of obs steps
|
||||
mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
|
||||
for _ in range(self.nsteps):
|
||||
actions, mus, states = self.model._step(self.obs, S=self.states, M=self.dones)
|
||||
@@ -37,15 +36,15 @@ class Runner(AbstractEnvRunner):
|
||||
# states information for statefull models like LSTM
|
||||
self.states = states
|
||||
self.dones = dones
|
||||
self.obs = obs
|
||||
self.update_obs(obs, dones)
|
||||
mb_rewards.append(rewards)
|
||||
enc_obs.append(obs[..., -self.nc:])
|
||||
enc_obs.append(obs)
|
||||
mb_obs.append(np.copy(self.obs))
|
||||
mb_dones.append(self.dones)
|
||||
|
||||
enc_obs = np.asarray(enc_obs, dtype=self.obs_dtype).swapaxes(1, 0)
|
||||
mb_obs = np.asarray(mb_obs, dtype=self.obs_dtype).swapaxes(1, 0)
|
||||
mb_actions = np.asarray(mb_actions, dtype=self.ac_dtype).swapaxes(1, 0)
|
||||
enc_obs = np.asarray(enc_obs, dtype=np.uint8).swapaxes(1, 0)
|
||||
mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0)
|
||||
mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
|
||||
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
|
||||
mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
|
||||
|
||||
|
@@ -2,8 +2,4 @@
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1708.05144
|
||||
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
|
||||
- `python -m baselines.run --alg=acktr --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
|
||||
## ACKTR with continuous action spaces
|
||||
The code of ACKTR has been refactored to handle both discrete and continuous action spaces uniformly. In the original version, discrete and continuous action spaces were handled by different code (actkr_disc.py and acktr_cont.py) with little overlap. If interested in the original version of the acktr for continuous action spaces, use `old_acktr_cont` branch. Note that original code performs better on the mujoco tasks than the refactored version; we are still investigating why.
|
||||
- `python -m baselines.acktr.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
@@ -1,152 +1 @@
|
||||
import os.path as osp
|
||||
import time
|
||||
import functools
|
||||
import tensorflow as tf
|
||||
from baselines import logger
|
||||
|
||||
from baselines.common import set_global_seeds, explained_variance
|
||||
from baselines.common.policies import build_policy
|
||||
from baselines.common.tf_util import get_session, save_variables, load_variables
|
||||
|
||||
from baselines.a2c.runner import Runner
|
||||
from baselines.a2c.utils import Scheduler, find_trainable_variables
|
||||
from baselines.acktr import kfac
|
||||
|
||||
|
||||
class Model(object):
|
||||
|
||||
def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
|
||||
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
|
||||
kfac_clip=0.001, lrschedule='linear', is_async=True):
|
||||
|
||||
self.sess = sess = get_session()
|
||||
nbatch = nenvs * nsteps
|
||||
with tf.variable_scope('acktr_model', reuse=tf.AUTO_REUSE):
|
||||
self.model = step_model = policy(nenvs, 1, sess=sess)
|
||||
self.model2 = train_model = policy(nenvs*nsteps, nsteps, sess=sess)
|
||||
|
||||
A = train_model.pdtype.sample_placeholder([None])
|
||||
ADV = tf.placeholder(tf.float32, [nbatch])
|
||||
R = tf.placeholder(tf.float32, [nbatch])
|
||||
PG_LR = tf.placeholder(tf.float32, [])
|
||||
VF_LR = tf.placeholder(tf.float32, [])
|
||||
|
||||
neglogpac = train_model.pd.neglogp(A)
|
||||
self.logits = train_model.pi
|
||||
|
||||
##training loss
|
||||
pg_loss = tf.reduce_mean(ADV*neglogpac)
|
||||
entropy = tf.reduce_mean(train_model.pd.entropy())
|
||||
pg_loss = pg_loss - ent_coef * entropy
|
||||
vf_loss = tf.losses.mean_squared_error(tf.squeeze(train_model.vf), R)
|
||||
train_loss = pg_loss + vf_coef * vf_loss
|
||||
|
||||
|
||||
##Fisher loss construction
|
||||
self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(neglogpac)
|
||||
sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
|
||||
self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
|
||||
self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss
|
||||
|
||||
self.params=params = find_trainable_variables("acktr_model")
|
||||
|
||||
self.grads_check = grads = tf.gradients(train_loss,params)
|
||||
|
||||
with tf.device('/gpu:0'):
|
||||
self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
|
||||
momentum=0.9, kfac_update=1, epsilon=0.01,\
|
||||
stats_decay=0.99, is_async=is_async, cold_iter=10, max_grad_norm=max_grad_norm)
|
||||
|
||||
# update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
|
||||
optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
|
||||
train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
|
||||
self.q_runner = q_runner
|
||||
self.lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
|
||||
|
||||
def train(obs, states, rewards, masks, actions, values):
|
||||
advs = rewards - values
|
||||
for step in range(len(obs)):
|
||||
cur_lr = self.lr.value()
|
||||
|
||||
td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, PG_LR:cur_lr, VF_LR:cur_lr}
|
||||
if states is not None:
|
||||
td_map[train_model.S] = states
|
||||
td_map[train_model.M] = masks
|
||||
|
||||
policy_loss, value_loss, policy_entropy, _ = sess.run(
|
||||
[pg_loss, vf_loss, entropy, train_op],
|
||||
td_map
|
||||
)
|
||||
return policy_loss, value_loss, policy_entropy
|
||||
|
||||
|
||||
self.train = train
|
||||
self.save = functools.partial(save_variables, sess=sess)
|
||||
self.load = functools.partial(load_variables, sess=sess)
|
||||
self.train_model = train_model
|
||||
self.step_model = step_model
|
||||
self.step = step_model.step
|
||||
self.value = step_model.value
|
||||
self.initial_state = step_model.initial_state
|
||||
tf.global_variables_initializer().run(session=sess)
|
||||
|
||||
def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
|
||||
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
|
||||
kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, is_async=True, **network_kwargs):
|
||||
set_global_seeds(seed)
|
||||
|
||||
|
||||
if network == 'cnn':
|
||||
network_kwargs['one_dim_bias'] = True
|
||||
|
||||
policy = build_policy(env, network, **network_kwargs)
|
||||
|
||||
nenvs = env.num_envs
|
||||
ob_space = env.observation_space
|
||||
ac_space = env.action_space
|
||||
make_model = lambda : Model(policy, ob_space, ac_space, nenvs, total_timesteps, nprocs=nprocs, nsteps
|
||||
=nsteps, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
|
||||
vf_fisher_coef, lr=lr, max_grad_norm=max_grad_norm, kfac_clip=kfac_clip,
|
||||
lrschedule=lrschedule, is_async=is_async)
|
||||
if save_interval and logger.get_dir():
|
||||
import cloudpickle
|
||||
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
|
||||
fh.write(cloudpickle.dumps(make_model))
|
||||
model = make_model()
|
||||
|
||||
if load_path is not None:
|
||||
model.load(load_path)
|
||||
|
||||
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
|
||||
nbatch = nenvs*nsteps
|
||||
tstart = time.time()
|
||||
coord = tf.train.Coordinator()
|
||||
if is_async:
|
||||
enqueue_threads = model.q_runner.create_threads(model.sess, coord=coord, start=True)
|
||||
else:
|
||||
enqueue_threads = []
|
||||
|
||||
for update in range(1, total_timesteps//nbatch+1):
|
||||
obs, states, rewards, masks, actions, values = runner.run()
|
||||
policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
|
||||
model.old_obs = obs
|
||||
nseconds = time.time()-tstart
|
||||
fps = int((update*nbatch)/nseconds)
|
||||
if update % log_interval == 0 or update == 1:
|
||||
ev = explained_variance(values, rewards)
|
||||
logger.record_tabular("nupdates", update)
|
||||
logger.record_tabular("total_timesteps", update*nbatch)
|
||||
logger.record_tabular("fps", fps)
|
||||
logger.record_tabular("policy_entropy", float(policy_entropy))
|
||||
logger.record_tabular("policy_loss", float(policy_loss))
|
||||
logger.record_tabular("value_loss", float(value_loss))
|
||||
logger.record_tabular("explained_variance", float(ev))
|
||||
logger.dump_tabular()
|
||||
|
||||
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
|
||||
savepath = osp.join(logger.get_dir(), 'checkpoint%.5i'%update)
|
||||
print('Saving to', savepath)
|
||||
model.save(savepath)
|
||||
coord.request_stop()
|
||||
coord.join(enqueue_threads)
|
||||
return model
|
||||
from baselines.acktr.acktr_disc import *
|
||||
|
142
baselines/acktr/acktr_cont.py
Normal file
142
baselines/acktr/acktr_cont.py
Normal file
@@ -0,0 +1,142 @@
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
from baselines import logger
|
||||
import baselines.common as common
|
||||
from baselines.common import tf_util as U
|
||||
from baselines.acktr import kfac
|
||||
from baselines.common.filters import ZFilter
|
||||
|
||||
def pathlength(path):
|
||||
return path["reward"].shape[0]# Loss function that we'll differentiate to get the policy gradient
|
||||
|
||||
def rollout(env, policy, max_pathlength, animate=False, obfilter=None):
|
||||
"""
|
||||
Simulate the env and policy for max_pathlength steps
|
||||
"""
|
||||
ob = env.reset()
|
||||
prev_ob = np.float32(np.zeros(ob.shape))
|
||||
if obfilter: ob = obfilter(ob)
|
||||
terminated = False
|
||||
|
||||
obs = []
|
||||
acs = []
|
||||
ac_dists = []
|
||||
logps = []
|
||||
rewards = []
|
||||
for _ in range(max_pathlength):
|
||||
if animate:
|
||||
env.render()
|
||||
state = np.concatenate([ob, prev_ob], -1)
|
||||
obs.append(state)
|
||||
ac, ac_dist, logp = policy.act(state)
|
||||
acs.append(ac)
|
||||
ac_dists.append(ac_dist)
|
||||
logps.append(logp)
|
||||
prev_ob = np.copy(ob)
|
||||
scaled_ac = env.action_space.low + (ac + 1.) * 0.5 * (env.action_space.high - env.action_space.low)
|
||||
scaled_ac = np.clip(scaled_ac, env.action_space.low, env.action_space.high)
|
||||
ob, rew, done, _ = env.step(scaled_ac)
|
||||
if obfilter: ob = obfilter(ob)
|
||||
rewards.append(rew)
|
||||
if done:
|
||||
terminated = True
|
||||
break
|
||||
return {"observation" : np.array(obs), "terminated" : terminated,
|
||||
"reward" : np.array(rewards), "action" : np.array(acs),
|
||||
"action_dist": np.array(ac_dists), "logp" : np.array(logps)}
|
||||
|
||||
def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
|
||||
animate=False, callback=None, desired_kl=0.002):
|
||||
|
||||
obfilter = ZFilter(env.observation_space.shape)
|
||||
|
||||
max_pathlength = env.spec.timestep_limit
|
||||
stepsize = tf.Variable(initial_value=np.float32(np.array(0.03)), name='stepsize')
|
||||
inputs, loss, loss_sampled = policy.update_info
|
||||
optim = kfac.KfacOptimizer(learning_rate=stepsize, cold_lr=stepsize*(1-0.9), momentum=0.9, kfac_update=2,\
|
||||
epsilon=1e-2, stats_decay=0.99, async=1, cold_iter=1,
|
||||
weight_decay_dict=policy.wd_dict, max_grad_norm=None)
|
||||
pi_var_list = []
|
||||
for var in tf.trainable_variables():
|
||||
if "pi" in var.name:
|
||||
pi_var_list.append(var)
|
||||
|
||||
update_op, q_runner = optim.minimize(loss, loss_sampled, var_list=pi_var_list)
|
||||
do_update = U.function(inputs, update_op)
|
||||
U.initialize()
|
||||
|
||||
# start queue runners
|
||||
enqueue_threads = []
|
||||
coord = tf.train.Coordinator()
|
||||
for qr in [q_runner, vf.q_runner]:
|
||||
assert (qr != None)
|
||||
enqueue_threads.extend(qr.create_threads(tf.get_default_session(), coord=coord, start=True))
|
||||
|
||||
i = 0
|
||||
timesteps_so_far = 0
|
||||
while True:
|
||||
if timesteps_so_far > num_timesteps:
|
||||
break
|
||||
logger.log("********** Iteration %i ************"%i)
|
||||
|
||||
# Collect paths until we have enough timesteps
|
||||
timesteps_this_batch = 0
|
||||
paths = []
|
||||
while True:
|
||||
path = rollout(env, policy, max_pathlength, animate=(len(paths)==0 and (i % 10 == 0) and animate), obfilter=obfilter)
|
||||
paths.append(path)
|
||||
n = pathlength(path)
|
||||
timesteps_this_batch += n
|
||||
timesteps_so_far += n
|
||||
if timesteps_this_batch > timesteps_per_batch:
|
||||
break
|
||||
|
||||
# Estimate advantage function
|
||||
vtargs = []
|
||||
advs = []
|
||||
for path in paths:
|
||||
rew_t = path["reward"]
|
||||
return_t = common.discount(rew_t, gamma)
|
||||
vtargs.append(return_t)
|
||||
vpred_t = vf.predict(path)
|
||||
vpred_t = np.append(vpred_t, 0.0 if path["terminated"] else vpred_t[-1])
|
||||
delta_t = rew_t + gamma*vpred_t[1:] - vpred_t[:-1]
|
||||
adv_t = common.discount(delta_t, gamma * lam)
|
||||
advs.append(adv_t)
|
||||
# Update value function
|
||||
vf.fit(paths, vtargs)
|
||||
|
||||
# Build arrays for policy update
|
||||
ob_no = np.concatenate([path["observation"] for path in paths])
|
||||
action_na = np.concatenate([path["action"] for path in paths])
|
||||
oldac_dist = np.concatenate([path["action_dist"] for path in paths])
|
||||
adv_n = np.concatenate(advs)
|
||||
standardized_adv_n = (adv_n - adv_n.mean()) / (adv_n.std() + 1e-8)
|
||||
|
||||
# Policy update
|
||||
do_update(ob_no, action_na, standardized_adv_n)
|
||||
|
||||
min_stepsize = np.float32(1e-8)
|
||||
max_stepsize = np.float32(1e0)
|
||||
# Adjust stepsize
|
||||
kl = policy.compute_kl(ob_no, oldac_dist)
|
||||
if kl > desired_kl * 2:
|
||||
logger.log("kl too high")
|
||||
tf.assign(stepsize, tf.maximum(min_stepsize, stepsize / 1.5)).eval()
|
||||
elif kl < desired_kl / 2:
|
||||
logger.log("kl too low")
|
||||
tf.assign(stepsize, tf.minimum(max_stepsize, stepsize * 1.5)).eval()
|
||||
else:
|
||||
logger.log("kl just right!")
|
||||
|
||||
logger.record_tabular("EpRewMean", np.mean([path["reward"].sum() for path in paths]))
|
||||
logger.record_tabular("EpRewSEM", np.std([path["reward"].sum()/np.sqrt(len(paths)) for path in paths]))
|
||||
logger.record_tabular("EpLenMean", np.mean([pathlength(path) for path in paths]))
|
||||
logger.record_tabular("KL", kl)
|
||||
if callback:
|
||||
callback()
|
||||
logger.dump_tabular()
|
||||
i += 1
|
||||
|
||||
coord.request_stop()
|
||||
coord.join(enqueue_threads)
|
151
baselines/acktr/acktr_disc.py
Normal file
151
baselines/acktr/acktr_disc.py
Normal file
@@ -0,0 +1,151 @@
|
||||
import os.path as osp
|
||||
import time
|
||||
import functools
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
from baselines import logger
|
||||
|
||||
from baselines.common import set_global_seeds, explained_variance
|
||||
from baselines.common.policies import build_policy
|
||||
from baselines.common.tf_util import get_session, save_variables, load_variables
|
||||
|
||||
from baselines.a2c.runner import Runner
|
||||
from baselines.a2c.utils import discount_with_dones
|
||||
from baselines.a2c.utils import Scheduler, find_trainable_variables
|
||||
from baselines.acktr import kfac
|
||||
|
||||
|
||||
class Model(object):
|
||||
|
||||
def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
|
||||
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
|
||||
kfac_clip=0.001, lrschedule='linear'):
|
||||
|
||||
self.sess = sess = get_session()
|
||||
nact = ac_space.n
|
||||
nbatch = nenvs * nsteps
|
||||
A = tf.placeholder(tf.int32, [nbatch])
|
||||
ADV = tf.placeholder(tf.float32, [nbatch])
|
||||
R = tf.placeholder(tf.float32, [nbatch])
|
||||
PG_LR = tf.placeholder(tf.float32, [])
|
||||
VF_LR = tf.placeholder(tf.float32, [])
|
||||
|
||||
with tf.variable_scope('acktr_model', reuse=tf.AUTO_REUSE):
|
||||
self.model = step_model = policy(nenvs, 1, sess=sess)
|
||||
self.model2 = train_model = policy(nenvs*nsteps, nsteps, sess=sess)
|
||||
|
||||
neglogpac = train_model.pd.neglogp(A)
|
||||
self.logits = logits = train_model.pi
|
||||
|
||||
##training loss
|
||||
pg_loss = tf.reduce_mean(ADV*neglogpac)
|
||||
entropy = tf.reduce_mean(train_model.pd.entropy())
|
||||
pg_loss = pg_loss - ent_coef * entropy
|
||||
vf_loss = tf.losses.mean_squared_error(tf.squeeze(train_model.vf), R)
|
||||
train_loss = pg_loss + vf_coef * vf_loss
|
||||
|
||||
|
||||
##Fisher loss construction
|
||||
self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(neglogpac)
|
||||
sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
|
||||
self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
|
||||
self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss
|
||||
|
||||
self.params=params = find_trainable_variables("acktr_model")
|
||||
|
||||
self.grads_check = grads = tf.gradients(train_loss,params)
|
||||
|
||||
with tf.device('/gpu:0'):
|
||||
self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
|
||||
momentum=0.9, kfac_update=1, epsilon=0.01,\
|
||||
stats_decay=0.99, async=1, cold_iter=10, max_grad_norm=max_grad_norm)
|
||||
|
||||
update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
|
||||
train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
|
||||
self.q_runner = q_runner
|
||||
self.lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
|
||||
|
||||
def train(obs, states, rewards, masks, actions, values):
|
||||
advs = rewards - values
|
||||
for step in range(len(obs)):
|
||||
cur_lr = self.lr.value()
|
||||
|
||||
td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, PG_LR:cur_lr}
|
||||
if states is not None:
|
||||
td_map[train_model.S] = states
|
||||
td_map[train_model.M] = masks
|
||||
|
||||
policy_loss, value_loss, policy_entropy, _ = sess.run(
|
||||
[pg_loss, vf_loss, entropy, train_op],
|
||||
td_map
|
||||
)
|
||||
return policy_loss, value_loss, policy_entropy
|
||||
|
||||
|
||||
self.train = train
|
||||
self.save = functools.partial(save_variables, sess=sess)
|
||||
self.load = functools.partial(load_variables, sess=sess)
|
||||
self.train_model = train_model
|
||||
self.step_model = step_model
|
||||
self.step = step_model.step
|
||||
self.value = step_model.value
|
||||
self.initial_state = step_model.initial_state
|
||||
tf.global_variables_initializer().run(session=sess)
|
||||
|
||||
def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
|
||||
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
|
||||
kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, **network_kwargs):
|
||||
set_global_seeds(seed)
|
||||
|
||||
|
||||
if network == 'cnn':
|
||||
network_kwargs['one_dim_bias'] = True
|
||||
|
||||
policy = build_policy(env, network, **network_kwargs)
|
||||
|
||||
nenvs = env.num_envs
|
||||
ob_space = env.observation_space
|
||||
ac_space = env.action_space
|
||||
make_model = lambda : Model(policy, ob_space, ac_space, nenvs, total_timesteps, nprocs=nprocs, nsteps
|
||||
=nsteps, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
|
||||
vf_fisher_coef, lr=lr, max_grad_norm=max_grad_norm, kfac_clip=kfac_clip,
|
||||
lrschedule=lrschedule)
|
||||
if save_interval and logger.get_dir():
|
||||
import cloudpickle
|
||||
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
|
||||
fh.write(cloudpickle.dumps(make_model))
|
||||
model = make_model()
|
||||
|
||||
if load_path is not None:
|
||||
model.load(load_path)
|
||||
|
||||
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
|
||||
nbatch = nenvs*nsteps
|
||||
tstart = time.time()
|
||||
coord = tf.train.Coordinator()
|
||||
enqueue_threads = model.q_runner.create_threads(model.sess, coord=coord, start=True)
|
||||
for update in range(1, total_timesteps//nbatch+1):
|
||||
obs, states, rewards, masks, actions, values = runner.run()
|
||||
policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
|
||||
model.old_obs = obs
|
||||
nseconds = time.time()-tstart
|
||||
fps = int((update*nbatch)/nseconds)
|
||||
if update % log_interval == 0 or update == 1:
|
||||
ev = explained_variance(values, rewards)
|
||||
logger.record_tabular("nupdates", update)
|
||||
logger.record_tabular("total_timesteps", update*nbatch)
|
||||
logger.record_tabular("fps", fps)
|
||||
logger.record_tabular("policy_entropy", float(policy_entropy))
|
||||
logger.record_tabular("policy_loss", float(policy_loss))
|
||||
logger.record_tabular("value_loss", float(value_loss))
|
||||
logger.record_tabular("explained_variance", float(ev))
|
||||
logger.dump_tabular()
|
||||
|
||||
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
|
||||
savepath = osp.join(logger.get_dir(), 'checkpoint%.5i'%update)
|
||||
print('Saving to', savepath)
|
||||
model.save(savepath)
|
||||
coord.request_stop()
|
||||
coord.join(enqueue_threads)
|
||||
env.close()
|
||||
return model
|
@@ -1,5 +0,0 @@
|
||||
def mujoco():
|
||||
return dict(
|
||||
nsteps=2500,
|
||||
value_network='copy'
|
||||
)
|
@@ -1,8 +1,6 @@
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
import re
|
||||
|
||||
# flake8: noqa F403, F405
|
||||
from baselines.acktr.kfac_utils import *
|
||||
from functools import reduce
|
||||
|
||||
@@ -12,14 +10,14 @@ KFAC_DEBUG = False
|
||||
|
||||
class KfacOptimizer():
|
||||
|
||||
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, is_async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
|
||||
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
|
||||
self.max_grad_norm = max_grad_norm
|
||||
self._lr = learning_rate
|
||||
self._momentum = momentum
|
||||
self._clip_kl = clip_kl
|
||||
self._channel_fac = channel_fac
|
||||
self._kfac_update = kfac_update
|
||||
self._async = is_async
|
||||
self._async = async
|
||||
self._async_stats = async_stats
|
||||
self._epsilon = epsilon
|
||||
self._stats_decay = stats_decay
|
||||
|
42
baselines/acktr/policies.py
Normal file
42
baselines/acktr/policies.py
Normal file
@@ -0,0 +1,42 @@
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
from baselines.acktr.utils import dense, kl_div
|
||||
import baselines.common.tf_util as U
|
||||
|
||||
class GaussianMlpPolicy(object):
|
||||
def __init__(self, ob_dim, ac_dim):
|
||||
# Here we'll construct a bunch of expressions, which will be used in two places:
|
||||
# (1) When sampling actions
|
||||
# (2) When computing loss functions, for the policy update
|
||||
# Variables specific to (1) have the word "sampled" in them,
|
||||
# whereas variables specific to (2) have the word "old" in them
|
||||
ob_no = tf.placeholder(tf.float32, shape=[None, ob_dim*2], name="ob") # batch of observations
|
||||
oldac_na = tf.placeholder(tf.float32, shape=[None, ac_dim], name="ac") # batch of actions previous actions
|
||||
oldac_dist = tf.placeholder(tf.float32, shape=[None, ac_dim*2], name="oldac_dist") # batch of actions previous action distributions
|
||||
adv_n = tf.placeholder(tf.float32, shape=[None], name="adv") # advantage function estimate
|
||||
wd_dict = {}
|
||||
h1 = tf.nn.tanh(dense(ob_no, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
|
||||
h2 = tf.nn.tanh(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
|
||||
mean_na = dense(h2, ac_dim, "mean", weight_init=U.normc_initializer(0.1), bias_init=0.0, weight_loss_dict=wd_dict) # Mean control output
|
||||
self.wd_dict = wd_dict
|
||||
self.logstd_1a = logstd_1a = tf.get_variable("logstd", [ac_dim], tf.float32, tf.zeros_initializer()) # Variance on outputs
|
||||
logstd_1a = tf.expand_dims(logstd_1a, 0)
|
||||
std_1a = tf.exp(logstd_1a)
|
||||
std_na = tf.tile(std_1a, [tf.shape(mean_na)[0], 1])
|
||||
ac_dist = tf.concat([tf.reshape(mean_na, [-1, ac_dim]), tf.reshape(std_na, [-1, ac_dim])], 1)
|
||||
sampled_ac_na = tf.random_normal(tf.shape(ac_dist[:,ac_dim:])) * ac_dist[:,ac_dim:] + ac_dist[:,:ac_dim] # This is the sampled action we'll perform.
|
||||
logprobsampled_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - sampled_ac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of sampled action
|
||||
logprob_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - oldac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of previous actions under CURRENT policy (whereas oldlogprob_n is under OLD policy)
|
||||
kl = tf.reduce_mean(kl_div(oldac_dist, ac_dist, ac_dim))
|
||||
#kl = .5 * tf.reduce_mean(tf.square(logprob_n - oldlogprob_n)) # Approximation of KL divergence between old policy used to generate actions, and new policy used to compute logprob_n
|
||||
surr = - tf.reduce_mean(adv_n * logprob_n) # Loss function that we'll differentiate to get the policy gradient
|
||||
surr_sampled = - tf.reduce_mean(logprob_n) # Sampled loss of the policy
|
||||
self._act = U.function([ob_no], [sampled_ac_na, ac_dist, logprobsampled_n]) # Generate a new action and its logprob
|
||||
#self.compute_kl = U.function([ob_no, oldac_na, oldlogprob_n], kl) # Compute (approximate) KL divergence between old policy and new policy
|
||||
self.compute_kl = U.function([ob_no, oldac_dist], kl)
|
||||
self.update_info = ((ob_no, oldac_na, adv_n), surr, surr_sampled) # Input and output variables needed for computing loss
|
||||
U.initialize() # Initialize uninitialized TF variables
|
||||
|
||||
def act(self, ob):
|
||||
ac, ac_dist, logp = self._act(ob[None])
|
||||
return ac[0], ac_dist[0], logp[0]
|
23
baselines/acktr/run_atari.py
Normal file
23
baselines/acktr/run_atari.py
Normal file
@@ -0,0 +1,23 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
from functools import partial
|
||||
|
||||
from baselines import logger
|
||||
from baselines.acktr.acktr_disc import learn
|
||||
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
|
||||
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
|
||||
from baselines.common.policies import cnn
|
||||
|
||||
def train(env_id, num_timesteps, seed, num_cpu):
|
||||
env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
|
||||
policy_fn = cnn(env=env, one_dim_bias=True)
|
||||
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
|
||||
env.close()
|
||||
|
||||
def main():
|
||||
args = atari_arg_parser().parse_args()
|
||||
logger.configure()
|
||||
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed, num_cpu=32)
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
34
baselines/acktr/run_mujoco.py
Normal file
34
baselines/acktr/run_mujoco.py
Normal file
@@ -0,0 +1,34 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import tensorflow as tf
|
||||
from baselines import logger
|
||||
from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
|
||||
from baselines.acktr.acktr_cont import learn
|
||||
from baselines.acktr.policies import GaussianMlpPolicy
|
||||
from baselines.acktr.value_functions import NeuralNetValueFunction
|
||||
|
||||
def train(env_id, num_timesteps, seed):
|
||||
env = make_mujoco_env(env_id, seed)
|
||||
|
||||
with tf.Session(config=tf.ConfigProto()):
|
||||
ob_dim = env.observation_space.shape[0]
|
||||
ac_dim = env.action_space.shape[0]
|
||||
with tf.variable_scope("vf"):
|
||||
vf = NeuralNetValueFunction(ob_dim, ac_dim)
|
||||
with tf.variable_scope("pi"):
|
||||
policy = GaussianMlpPolicy(ob_dim, ac_dim)
|
||||
|
||||
learn(env, policy=policy, vf=vf,
|
||||
gamma=0.99, lam=0.97, timesteps_per_batch=2500,
|
||||
desired_kl=0.002,
|
||||
num_timesteps=num_timesteps, animate=False)
|
||||
|
||||
env.close()
|
||||
|
||||
def main():
|
||||
args = mujoco_arg_parser().parse_args()
|
||||
logger.configure()
|
||||
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
50
baselines/acktr/value_functions.py
Normal file
50
baselines/acktr/value_functions.py
Normal file
@@ -0,0 +1,50 @@
|
||||
from baselines import logger
|
||||
import numpy as np
|
||||
import baselines.common as common
|
||||
from baselines.common import tf_util as U
|
||||
import tensorflow as tf
|
||||
from baselines.acktr import kfac
|
||||
from baselines.acktr.utils import dense
|
||||
|
||||
class NeuralNetValueFunction(object):
|
||||
def __init__(self, ob_dim, ac_dim): #pylint: disable=W0613
|
||||
X = tf.placeholder(tf.float32, shape=[None, ob_dim*2+ac_dim*2+2]) # batch of observations
|
||||
vtarg_n = tf.placeholder(tf.float32, shape=[None], name='vtarg')
|
||||
wd_dict = {}
|
||||
h1 = tf.nn.elu(dense(X, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
|
||||
h2 = tf.nn.elu(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
|
||||
vpred_n = dense(h2, 1, "hfinal", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict)[:,0]
|
||||
sample_vpred_n = vpred_n + tf.random_normal(tf.shape(vpred_n))
|
||||
wd_loss = tf.get_collection("vf_losses", None)
|
||||
loss = tf.reduce_mean(tf.square(vpred_n - vtarg_n)) + tf.add_n(wd_loss)
|
||||
loss_sampled = tf.reduce_mean(tf.square(vpred_n - tf.stop_gradient(sample_vpred_n)))
|
||||
self._predict = U.function([X], vpred_n)
|
||||
optim = kfac.KfacOptimizer(learning_rate=0.001, cold_lr=0.001*(1-0.9), momentum=0.9, \
|
||||
clip_kl=0.3, epsilon=0.1, stats_decay=0.95, \
|
||||
async=1, kfac_update=2, cold_iter=50, \
|
||||
weight_decay_dict=wd_dict, max_grad_norm=None)
|
||||
vf_var_list = []
|
||||
for var in tf.trainable_variables():
|
||||
if "vf" in var.name:
|
||||
vf_var_list.append(var)
|
||||
|
||||
update_op, self.q_runner = optim.minimize(loss, loss_sampled, var_list=vf_var_list)
|
||||
self.do_update = U.function([X, vtarg_n], update_op) #pylint: disable=E1101
|
||||
U.initialize() # Initialize uninitialized TF variables
|
||||
def _preproc(self, path):
|
||||
l = pathlength(path)
|
||||
al = np.arange(l).reshape(-1,1)/10.0
|
||||
act = path["action_dist"].astype('float32')
|
||||
X = np.concatenate([path['observation'], act, al, np.ones((l, 1))], axis=1)
|
||||
return X
|
||||
def predict(self, path):
|
||||
return self._predict(self._preproc(path))
|
||||
def fit(self, paths, targvals):
|
||||
X = np.concatenate([self._preproc(p) for p in paths])
|
||||
y = np.concatenate(targvals)
|
||||
logger.record_tabular("EVBefore", common.explained_variance(self._predict(X), y))
|
||||
for _ in range(25): self.do_update(X, y)
|
||||
logger.record_tabular("EVAfter", common.explained_variance(self._predict(X), y))
|
||||
|
||||
def pathlength(path):
|
||||
return path["reward"].shape[0]
|
@@ -97,19 +97,6 @@ register_benchmark({
|
||||
]
|
||||
})
|
||||
|
||||
# Bullet
|
||||
_bulletsmall = [
|
||||
'InvertedDoublePendulum', 'InvertedPendulum', 'HalfCheetah', 'Reacher', 'Walker2D', 'Hopper', 'Ant'
|
||||
]
|
||||
_bulletsmall = [e + 'BulletEnv-v0' for e in _bulletsmall]
|
||||
|
||||
register_benchmark({
|
||||
'name': 'Bullet1M',
|
||||
'description': '6 mujoco-like tasks from bullet, 1M steps',
|
||||
'tasks': [{'env_id': e, 'trials': 6, 'num_timesteps': int(1e6)} for e in _bulletsmall]
|
||||
})
|
||||
|
||||
|
||||
# Roboschool
|
||||
|
||||
register_benchmark({
|
||||
|
@@ -16,11 +16,21 @@ class Monitor(Wrapper):
|
||||
def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
|
||||
Wrapper.__init__(self, env=env)
|
||||
self.tstart = time.time()
|
||||
self.results_writer = ResultsWriter(
|
||||
filename,
|
||||
header={"t_start": time.time(), 'env_id' : env.spec and env.spec.id},
|
||||
extra_keys=reset_keywords + info_keywords
|
||||
)
|
||||
if filename is None:
|
||||
self.f = None
|
||||
self.logger = None
|
||||
else:
|
||||
if not filename.endswith(Monitor.EXT):
|
||||
if osp.isdir(filename):
|
||||
filename = osp.join(filename, Monitor.EXT)
|
||||
else:
|
||||
filename = filename + "." + Monitor.EXT
|
||||
self.f = open(filename, "wt")
|
||||
self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, 'env_id' : env.spec and env.spec.id}))
|
||||
self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords+info_keywords)
|
||||
self.logger.writeheader()
|
||||
self.f.flush()
|
||||
|
||||
self.reset_keywords = reset_keywords
|
||||
self.info_keywords = info_keywords
|
||||
self.allow_early_resets = allow_early_resets
|
||||
@@ -33,7 +43,10 @@ class Monitor(Wrapper):
|
||||
self.current_reset_info = {} # extra info about the current episode, that was passed in during reset()
|
||||
|
||||
def reset(self, **kwargs):
|
||||
self.reset_state()
|
||||
if not self.allow_early_resets and not self.needs_reset:
|
||||
raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
|
||||
self.rewards = []
|
||||
self.needs_reset = False
|
||||
for k in self.reset_keywords:
|
||||
v = kwargs.get(k)
|
||||
if v is None:
|
||||
@@ -41,21 +54,10 @@ class Monitor(Wrapper):
|
||||
self.current_reset_info[k] = v
|
||||
return self.env.reset(**kwargs)
|
||||
|
||||
def reset_state(self):
|
||||
if not self.allow_early_resets and not self.needs_reset:
|
||||
raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
|
||||
self.rewards = []
|
||||
self.needs_reset = False
|
||||
|
||||
|
||||
def step(self, action):
|
||||
if self.needs_reset:
|
||||
raise RuntimeError("Tried to step environment that needs reset")
|
||||
ob, rew, done, info = self.env.step(action)
|
||||
self.update(ob, rew, done, info)
|
||||
return (ob, rew, done, info)
|
||||
|
||||
def update(self, ob, rew, done, info):
|
||||
self.rewards.append(rew)
|
||||
if done:
|
||||
self.needs_reset = True
|
||||
@@ -68,12 +70,12 @@ class Monitor(Wrapper):
|
||||
self.episode_lengths.append(eplen)
|
||||
self.episode_times.append(time.time() - self.tstart)
|
||||
epinfo.update(self.current_reset_info)
|
||||
self.results_writer.write_row(epinfo)
|
||||
|
||||
if isinstance(info, dict):
|
||||
if self.logger:
|
||||
self.logger.writerow(epinfo)
|
||||
self.f.flush()
|
||||
info['episode'] = epinfo
|
||||
|
||||
self.total_steps += 1
|
||||
return (ob, rew, done, info)
|
||||
|
||||
def close(self):
|
||||
if self.f is not None:
|
||||
@@ -94,34 +96,6 @@ class Monitor(Wrapper):
|
||||
class LoadMonitorResultsError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
class ResultsWriter(object):
|
||||
def __init__(self, filename=None, header='', extra_keys=()):
|
||||
self.extra_keys = extra_keys
|
||||
if filename is None:
|
||||
self.f = None
|
||||
self.logger = None
|
||||
else:
|
||||
if not filename.endswith(Monitor.EXT):
|
||||
if osp.isdir(filename):
|
||||
filename = osp.join(filename, Monitor.EXT)
|
||||
else:
|
||||
filename = filename + "." + Monitor.EXT
|
||||
self.f = open(filename, "wt")
|
||||
if isinstance(header, dict):
|
||||
header = '# {} \n'.format(json.dumps(header))
|
||||
self.f.write(header)
|
||||
self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+tuple(extra_keys))
|
||||
self.logger.writeheader()
|
||||
self.f.flush()
|
||||
|
||||
def write_row(self, epinfo):
|
||||
if self.logger:
|
||||
self.logger.writerow(epinfo)
|
||||
self.f.flush()
|
||||
|
||||
|
||||
|
||||
def get_monitor_files(dir):
|
||||
return glob(osp.join(dir, "*" + Monitor.EXT))
|
||||
|
||||
|
@@ -213,11 +213,8 @@ class LazyFrames(object):
|
||||
def __getitem__(self, i):
|
||||
return self._force()[i]
|
||||
|
||||
def make_atari(env_id, timelimit=True):
|
||||
# XXX(john): remove timelimit argument after gym is upgraded to allow double wrapping
|
||||
def make_atari(env_id):
|
||||
env = gym.make(env_id)
|
||||
if not timelimit:
|
||||
env = env.env
|
||||
assert 'NoFrameskip' in env.spec.id
|
||||
env = NoopResetEnv(env, noop_max=30)
|
||||
env = MaxAndSkipEnv(env, skip=4)
|
||||
|
@@ -15,60 +15,22 @@ from baselines.bench import Monitor
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
|
||||
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
from baselines.common import retro_wrappers
|
||||
|
||||
def make_vec_env(env_id, env_type, num_env, seed, wrapper_kwargs=None, start_index=0, reward_scale=1.0, gamestate=None):
|
||||
def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
|
||||
"""
|
||||
Create a wrapped, monitored SubprocVecEnv for Atari and MuJoCo.
|
||||
Create a wrapped, monitored SubprocVecEnv for Atari.
|
||||
"""
|
||||
if wrapper_kwargs is None: wrapper_kwargs = {}
|
||||
mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
|
||||
seed = seed + 10000 * mpi_rank if seed is not None else None
|
||||
def make_thunk(rank):
|
||||
return lambda: make_env(
|
||||
env_id=env_id,
|
||||
env_type=env_type,
|
||||
subrank = rank,
|
||||
seed=seed,
|
||||
reward_scale=reward_scale,
|
||||
gamestate=gamestate,
|
||||
wrapper_kwargs=wrapper_kwargs
|
||||
)
|
||||
|
||||
set_global_seeds(seed)
|
||||
if num_env > 1:
|
||||
return SubprocVecEnv([make_thunk(i + start_index) for i in range(num_env)])
|
||||
else:
|
||||
return DummyVecEnv([make_thunk(start_index)])
|
||||
|
||||
|
||||
def make_env(env_id, env_type, subrank=0, seed=None, reward_scale=1.0, gamestate=None, wrapper_kwargs={}):
|
||||
mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
|
||||
if env_type == 'atari':
|
||||
def make_env(rank): # pylint: disable=C0111
|
||||
def _thunk():
|
||||
env = make_atari(env_id)
|
||||
elif env_type == 'retro':
|
||||
import retro
|
||||
gamestate = gamestate or retro.State.DEFAULT
|
||||
env = retro_wrappers.make_retro(game=env_id, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE, state=gamestate)
|
||||
else:
|
||||
env = gym.make(env_id)
|
||||
|
||||
env.seed(seed + subrank if seed is not None else None)
|
||||
env = Monitor(env,
|
||||
logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(subrank)),
|
||||
allow_early_resets=True)
|
||||
|
||||
if env_type == 'atari':
|
||||
env = wrap_deepmind(env, **wrapper_kwargs)
|
||||
elif env_type == 'retro':
|
||||
env = retro_wrappers.wrap_deepmind_retro(env, **wrapper_kwargs)
|
||||
|
||||
if reward_scale != 1:
|
||||
env = retro_wrappers.RewardScaler(env, reward_scale)
|
||||
|
||||
return env
|
||||
|
||||
env.seed(seed + 10000*mpi_rank + rank if seed is not None else None)
|
||||
env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(rank)))
|
||||
return wrap_deepmind(env, **wrapper_kwargs)
|
||||
return _thunk
|
||||
set_global_seeds(seed)
|
||||
return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
|
||||
|
||||
def make_mujoco_env(env_id, seed, reward_scale=1.0):
|
||||
"""
|
||||
@@ -78,12 +40,13 @@ def make_mujoco_env(env_id, seed, reward_scale=1.0):
|
||||
myseed = seed + 1000 * rank if seed is not None else None
|
||||
set_global_seeds(myseed)
|
||||
env = gym.make(env_id)
|
||||
logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
|
||||
env = Monitor(env, logger_path, allow_early_resets=True)
|
||||
env = Monitor(env, os.path.join(logger.get_dir(), str(rank)), allow_early_resets=True)
|
||||
env.seed(seed)
|
||||
|
||||
if reward_scale != 1.0:
|
||||
from baselines.common.retro_wrappers import RewardScaler
|
||||
env = RewardScaler(env, reward_scale)
|
||||
|
||||
return env
|
||||
|
||||
def make_robotics_env(env_id, seed, rank=0):
|
||||
@@ -131,8 +94,6 @@ def common_arg_parser():
|
||||
parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
|
||||
parser.add_argument('--reward_scale', help='Reward scale factor. Default: 1.0', default=1.0, type=float)
|
||||
parser.add_argument('--save_path', help='Path to save trained model to', default=None, type=str)
|
||||
parser.add_argument('--save_video_interval', help='Save video every x steps (0 = disabled)', default=0, type=int)
|
||||
parser.add_argument('--save_video_length', help='Length of recorded video. Default: 200', default=200, type=int)
|
||||
parser.add_argument('--play', default=False, action='store_true')
|
||||
return parser
|
||||
|
||||
@@ -152,18 +113,14 @@ def parse_unknown_args(args):
|
||||
Parse arguments not consumed by arg parser into a dicitonary
|
||||
"""
|
||||
retval = {}
|
||||
preceded_by_key = False
|
||||
for arg in args:
|
||||
if arg.startswith('--'):
|
||||
if '=' in arg:
|
||||
assert arg.startswith('--')
|
||||
assert '=' in arg, 'cannot parse arg {}'.format(arg)
|
||||
key = arg.split('=')[0][2:]
|
||||
value = arg.split('=')[1]
|
||||
retval[key] = value
|
||||
else:
|
||||
key = arg[2:]
|
||||
preceded_by_key = True
|
||||
elif preceded_by_key:
|
||||
retval[key] = arg
|
||||
preceded_by_key = False
|
||||
|
||||
return retval
|
||||
|
||||
|
||||
|
||||
|
@@ -2,8 +2,6 @@ from __future__ import print_function
|
||||
from contextlib import contextmanager
|
||||
import numpy as np
|
||||
import time
|
||||
import shlex
|
||||
import subprocess
|
||||
|
||||
# ================================================================
|
||||
# Misc
|
||||
@@ -39,7 +37,7 @@ color2num = dict(
|
||||
crimson=38
|
||||
)
|
||||
|
||||
def colorize(string, color='green', bold=False, highlight=False):
|
||||
def colorize(string, color, bold=False, highlight=False):
|
||||
attr = []
|
||||
num = color2num[color]
|
||||
if highlight: num += 10
|
||||
@@ -47,25 +45,6 @@ def colorize(string, color='green', bold=False, highlight=False):
|
||||
if bold: attr.append('1')
|
||||
return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)
|
||||
|
||||
def print_cmd(cmd, dry=False):
|
||||
if isinstance(cmd, str): # for shell=True
|
||||
pass
|
||||
else:
|
||||
cmd = ' '.join(shlex.quote(arg) for arg in cmd)
|
||||
print(colorize(('CMD: ' if not dry else 'DRY: ') + cmd))
|
||||
|
||||
|
||||
def get_git_commit(cwd=None):
|
||||
return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'], cwd=cwd).decode('utf8')
|
||||
|
||||
def get_git_commit_message(cwd=None):
|
||||
return subprocess.check_output(['git', 'show', '-s', '--format=%B', 'HEAD'], cwd=cwd).decode('utf8')
|
||||
|
||||
def ccap(cmd, dry=False, env=None, **kwargs):
|
||||
print_cmd(cmd, dry)
|
||||
if not dry:
|
||||
subprocess.check_call(cmd, env=env, **kwargs)
|
||||
|
||||
|
||||
MESSAGE_DEPTH = 0
|
||||
|
||||
|
@@ -23,13 +23,6 @@ class Pd(object):
|
||||
raise NotImplementedError
|
||||
def logp(self, x):
|
||||
return - self.neglogp(x)
|
||||
def get_shape(self):
|
||||
return self.flatparam().shape
|
||||
@property
|
||||
def shape(self):
|
||||
return self.get_shape()
|
||||
def __getitem__(self, idx):
|
||||
return self.__class__(self.flatparam()[idx])
|
||||
|
||||
class PdType(object):
|
||||
"""
|
||||
@@ -39,7 +32,7 @@ class PdType(object):
|
||||
raise NotImplementedError
|
||||
def pdfromflat(self, flat):
|
||||
return self.pdclass()(flat)
|
||||
def pdfromlatent(self, latent_vector, init_scale, init_bias):
|
||||
def pdfromlatent(self, latent_vector):
|
||||
raise NotImplementedError
|
||||
def param_shape(self):
|
||||
raise NotImplementedError
|
||||
@@ -53,9 +46,6 @@ class PdType(object):
|
||||
def sample_placeholder(self, prepend_shape, name=None):
|
||||
return tf.placeholder(dtype=self.sample_dtype(), shape=prepend_shape+self.sample_shape(), name=name)
|
||||
|
||||
def __eq__(self, other):
|
||||
return (type(self) == type(other)) and (self.__dict__ == other.__dict__)
|
||||
|
||||
class CategoricalPdType(PdType):
|
||||
def __init__(self, ncat):
|
||||
self.ncat = ncat
|
||||
@@ -80,11 +70,6 @@ class MultiCategoricalPdType(PdType):
|
||||
return MultiCategoricalPd
|
||||
def pdfromflat(self, flat):
|
||||
return MultiCategoricalPd(self.ncats, flat)
|
||||
|
||||
def pdfromlatent(self, latent, init_scale=1.0, init_bias=0.0):
|
||||
pdparam = fc(latent, 'pi', self.ncats.sum(), init_scale=init_scale, init_bias=init_bias)
|
||||
return self.pdfromflat(pdparam), pdparam
|
||||
|
||||
def param_shape(self):
|
||||
return [sum(self.ncats)]
|
||||
def sample_shape(self):
|
||||
@@ -122,9 +107,6 @@ class BernoulliPdType(PdType):
|
||||
return [self.size]
|
||||
def sample_dtype(self):
|
||||
return tf.int32
|
||||
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
|
||||
pdparam = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
|
||||
return self.pdfromflat(pdparam), pdparam
|
||||
|
||||
# WRONG SECOND DERIVATIVES
|
||||
# class CategoricalPd(Pd):
|
||||
@@ -156,30 +138,14 @@ class CategoricalPd(Pd):
|
||||
return self.logits
|
||||
def mode(self):
|
||||
return tf.argmax(self.logits, axis=-1)
|
||||
|
||||
@property
|
||||
def mean(self):
|
||||
return tf.nn.softmax(self.logits)
|
||||
def neglogp(self, x):
|
||||
# return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
|
||||
# Note: we can't use sparse_softmax_cross_entropy_with_logits because
|
||||
# the implementation does not allow second-order derivatives...
|
||||
if x.dtype in {tf.uint8, tf.int32, tf.int64}:
|
||||
# one-hot encoding
|
||||
x_shape_list = x.shape.as_list()
|
||||
logits_shape_list = self.logits.get_shape().as_list()[:-1]
|
||||
for xs, ls in zip(x_shape_list, logits_shape_list):
|
||||
if xs is not None and ls is not None:
|
||||
assert xs == ls, 'shape mismatch: {} in x vs {} in logits'.format(xs, ls)
|
||||
|
||||
x = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
|
||||
else:
|
||||
# already encoded
|
||||
assert x.shape.as_list() == self.logits.shape.as_list()
|
||||
|
||||
one_hot_actions = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
|
||||
return tf.nn.softmax_cross_entropy_with_logits_v2(
|
||||
logits=self.logits,
|
||||
labels=x)
|
||||
labels=one_hot_actions)
|
||||
def kl(self, other):
|
||||
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
|
||||
a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keepdims=True)
|
||||
@@ -248,16 +214,12 @@ class DiagGaussianPd(Pd):
|
||||
def fromflat(cls, flat):
|
||||
return cls(flat)
|
||||
|
||||
|
||||
class BernoulliPd(Pd):
|
||||
def __init__(self, logits):
|
||||
self.logits = logits
|
||||
self.ps = tf.sigmoid(logits)
|
||||
def flatparam(self):
|
||||
return self.logits
|
||||
@property
|
||||
def mean(self):
|
||||
return self.ps
|
||||
def mode(self):
|
||||
return tf.round(self.ps)
|
||||
def neglogp(self, x):
|
||||
|
98
baselines/common/filters.py
Normal file
98
baselines/common/filters.py
Normal file
@@ -0,0 +1,98 @@
|
||||
from .running_stat import RunningStat
|
||||
from collections import deque
|
||||
import numpy as np
|
||||
|
||||
class Filter(object):
|
||||
def __call__(self, x, update=True):
|
||||
raise NotImplementedError
|
||||
def reset(self):
|
||||
pass
|
||||
|
||||
class IdentityFilter(Filter):
|
||||
def __call__(self, x, update=True):
|
||||
return x
|
||||
|
||||
class CompositionFilter(Filter):
|
||||
def __init__(self, fs):
|
||||
self.fs = fs
|
||||
def __call__(self, x, update=True):
|
||||
for f in self.fs:
|
||||
x = f(x)
|
||||
return x
|
||||
def output_shape(self, input_space):
|
||||
out = input_space.shape
|
||||
for f in self.fs:
|
||||
out = f.output_shape(out)
|
||||
return out
|
||||
|
||||
class ZFilter(Filter):
|
||||
"""
|
||||
y = (x-mean)/std
|
||||
using running estimates of mean,std
|
||||
"""
|
||||
|
||||
def __init__(self, shape, demean=True, destd=True, clip=10.0):
|
||||
self.demean = demean
|
||||
self.destd = destd
|
||||
self.clip = clip
|
||||
|
||||
self.rs = RunningStat(shape)
|
||||
|
||||
def __call__(self, x, update=True):
|
||||
if update: self.rs.push(x)
|
||||
if self.demean:
|
||||
x = x - self.rs.mean
|
||||
if self.destd:
|
||||
x = x / (self.rs.std+1e-8)
|
||||
if self.clip:
|
||||
x = np.clip(x, -self.clip, self.clip)
|
||||
return x
|
||||
def output_shape(self, input_space):
|
||||
return input_space.shape
|
||||
|
||||
class AddClock(Filter):
|
||||
def __init__(self):
|
||||
self.count = 0
|
||||
def reset(self):
|
||||
self.count = 0
|
||||
def __call__(self, x, update=True):
|
||||
return np.append(x, self.count/100.0)
|
||||
def output_shape(self, input_space):
|
||||
return (input_space.shape[0]+1,)
|
||||
|
||||
class FlattenFilter(Filter):
|
||||
def __call__(self, x, update=True):
|
||||
return x.ravel()
|
||||
def output_shape(self, input_space):
|
||||
return (int(np.prod(input_space.shape)),)
|
||||
|
||||
class Ind2OneHotFilter(Filter):
|
||||
def __init__(self, n):
|
||||
self.n = n
|
||||
def __call__(self, x, update=True):
|
||||
out = np.zeros(self.n)
|
||||
out[x] = 1
|
||||
return out
|
||||
def output_shape(self, input_space):
|
||||
return (input_space.n,)
|
||||
|
||||
class DivFilter(Filter):
|
||||
def __init__(self, divisor):
|
||||
self.divisor = divisor
|
||||
def __call__(self, x, update=True):
|
||||
return x / self.divisor
|
||||
def output_shape(self, input_space):
|
||||
return input_space.shape
|
||||
|
||||
class StackFilter(Filter):
|
||||
def __init__(self, length):
|
||||
self.stack = deque(maxlen=length)
|
||||
def reset(self):
|
||||
self.stack.clear()
|
||||
def __call__(self, x, update=True):
|
||||
self.stack.append(x)
|
||||
while len(self.stack) < self.stack.maxlen:
|
||||
self.stack.append(x)
|
||||
return np.concatenate(self.stack, axis=-1)
|
||||
def output_shape(self, input_space):
|
||||
return input_space.shape[:-1] + (input_space.shape[-1] * self.stack.maxlen,)
|
30
baselines/common/identity_env.py
Normal file
30
baselines/common/identity_env.py
Normal file
@@ -0,0 +1,30 @@
|
||||
from gym import Env
|
||||
from gym.spaces import Discrete
|
||||
|
||||
|
||||
class IdentityEnv(Env):
|
||||
def __init__(
|
||||
self,
|
||||
dim,
|
||||
ep_length=100,
|
||||
):
|
||||
|
||||
self.action_space = Discrete(dim)
|
||||
self.reset()
|
||||
|
||||
def reset(self):
|
||||
self._choose_next_state()
|
||||
self.observation_space = self.action_space
|
||||
|
||||
return self.state
|
||||
|
||||
def step(self, actions):
|
||||
rew = self._get_reward(actions)
|
||||
self._choose_next_state()
|
||||
return self.state, rew, False, {}
|
||||
|
||||
def _choose_next_state(self):
|
||||
self.state = self.action_space.sample()
|
||||
|
||||
def _get_reward(self, actions):
|
||||
return 1 if self.state == actions else 0
|
@@ -1,6 +1,5 @@
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
from gym.spaces import Discrete, Box, MultiDiscrete
|
||||
from gym.spaces import Discrete, Box
|
||||
|
||||
def observation_placeholder(ob_space, batch_size=None, name='Ob'):
|
||||
'''
|
||||
@@ -21,14 +20,10 @@ def observation_placeholder(ob_space, batch_size=None, name='Ob'):
|
||||
tensorflow placeholder tensor
|
||||
'''
|
||||
|
||||
assert isinstance(ob_space, Discrete) or isinstance(ob_space, Box) or isinstance(ob_space, MultiDiscrete), \
|
||||
assert isinstance(ob_space, Discrete) or isinstance(ob_space, Box), \
|
||||
'Can only deal with Discrete and Box observation spaces for now'
|
||||
|
||||
dtype = ob_space.dtype
|
||||
if dtype == np.int8:
|
||||
dtype = np.uint8
|
||||
|
||||
return tf.placeholder(shape=(batch_size,) + ob_space.shape, dtype=dtype, name=name)
|
||||
return tf.placeholder(shape=(batch_size,) + ob_space.shape, dtype=ob_space.dtype, name=name)
|
||||
|
||||
|
||||
def observation_input(ob_space, batch_size=None, name='Ob'):
|
||||
@@ -53,12 +48,9 @@ def encode_observation(ob_space, placeholder):
|
||||
'''
|
||||
if isinstance(ob_space, Discrete):
|
||||
return tf.to_float(tf.one_hot(placeholder, ob_space.n))
|
||||
|
||||
elif isinstance(ob_space, Box):
|
||||
return tf.to_float(placeholder)
|
||||
elif isinstance(ob_space, MultiDiscrete):
|
||||
placeholder = tf.cast(placeholder, tf.int32)
|
||||
one_hots = [tf.to_float(tf.one_hot(placeholder[..., i], ob_space.nvec[i])) for i in range(placeholder.shape[-1])]
|
||||
return tf.concat(one_hots, axis=-1)
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
|
@@ -76,9 +76,10 @@ def set_global_seeds(i):
|
||||
myseed = i + 1000 * rank if i is not None else None
|
||||
try:
|
||||
import tensorflow as tf
|
||||
tf.set_random_seed(myseed)
|
||||
except ImportError:
|
||||
pass
|
||||
else:
|
||||
tf.set_random_seed(myseed)
|
||||
np.random.seed(myseed)
|
||||
random.seed(myseed)
|
||||
|
||||
|
@@ -5,13 +5,6 @@ from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch
|
||||
from baselines.common.mpi_running_mean_std import RunningMeanStd
|
||||
import tensorflow.contrib.layers as layers
|
||||
|
||||
mapping = {}
|
||||
|
||||
def register(name):
|
||||
def _thunk(func):
|
||||
mapping[name] = func
|
||||
return func
|
||||
return _thunk
|
||||
|
||||
def nature_cnn(unscaled_images, **conv_kwargs):
|
||||
"""
|
||||
@@ -27,10 +20,10 @@ def nature_cnn(unscaled_images, **conv_kwargs):
|
||||
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
|
||||
|
||||
|
||||
@register("mlp")
|
||||
def mlp(num_layers=2, num_hidden=64, activation=tf.tanh, layer_norm=False):
|
||||
def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
|
||||
"""
|
||||
Stack of fully-connected layers to be used in a policy / q-function approximator
|
||||
Simple fully connected layer policy. Separate stacks of fully-connected layers are used for policy and value function estimation.
|
||||
More customized fully-connected policies can be obtained by using PolicyWithV class directly.
|
||||
|
||||
Parameters:
|
||||
----------
|
||||
@@ -44,29 +37,22 @@ def mlp(num_layers=2, num_hidden=64, activation=tf.tanh, layer_norm=False):
|
||||
Returns:
|
||||
-------
|
||||
|
||||
function that builds fully connected network with a given input tensor / placeholder
|
||||
function that builds fully connected network with a given input placeholder
|
||||
"""
|
||||
def network_fn(X):
|
||||
h = tf.layers.flatten(X)
|
||||
for i in range(num_layers):
|
||||
h = fc(h, 'mlp_fc{}'.format(i), nh=num_hidden, init_scale=np.sqrt(2))
|
||||
if layer_norm:
|
||||
h = tf.contrib.layers.layer_norm(h, center=True, scale=True)
|
||||
h = activation(h)
|
||||
|
||||
return h
|
||||
h = activation(fc(h, 'mlp_fc{}'.format(i), nh=num_hidden, init_scale=np.sqrt(2)))
|
||||
return h, None
|
||||
|
||||
return network_fn
|
||||
|
||||
|
||||
@register("cnn")
|
||||
def cnn(**conv_kwargs):
|
||||
def network_fn(X):
|
||||
return nature_cnn(X, **conv_kwargs)
|
||||
return nature_cnn(X, **conv_kwargs), None
|
||||
return network_fn
|
||||
|
||||
|
||||
@register("cnn_small")
|
||||
def cnn_small(**conv_kwargs):
|
||||
def network_fn(X):
|
||||
h = tf.cast(X, tf.float32) / 255.
|
||||
@@ -76,40 +62,12 @@ def cnn_small(**conv_kwargs):
|
||||
h = activ(conv(h, 'c2', nf=16, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
|
||||
h = conv_to_fc(h)
|
||||
h = activ(fc(h, 'fc1', nh=128, init_scale=np.sqrt(2)))
|
||||
return h
|
||||
return h, None
|
||||
return network_fn
|
||||
|
||||
|
||||
@register("lstm")
|
||||
|
||||
def lstm(nlstm=128, layer_norm=False):
|
||||
"""
|
||||
Builds LSTM (Long-Short Term Memory) network to be used in a policy.
|
||||
Note that the resulting function returns not only the output of the LSTM
|
||||
(i.e. hidden state of lstm for each step in the sequence), but also a dictionary
|
||||
with auxiliary tensors to be set as policy attributes.
|
||||
|
||||
Specifically,
|
||||
S is a placeholder to feed current state (LSTM state has to be managed outside policy)
|
||||
M is a placeholder for the mask (used to mask out observations after the end of the episode, but can be used for other purposes too)
|
||||
initial_state is a numpy array containing initial lstm state (usually zeros)
|
||||
state is the output LSTM state (to be fed into S at the next call)
|
||||
|
||||
|
||||
An example of usage of lstm-based policy can be found here: common/tests/test_doc_examples.py/test_lstm_example
|
||||
|
||||
Parameters:
|
||||
----------
|
||||
|
||||
nlstm: int LSTM hidden state size
|
||||
|
||||
layer_norm: bool if True, layer-normalized version of LSTM is used
|
||||
|
||||
Returns:
|
||||
-------
|
||||
|
||||
function that builds LSTM with a given input tensor / placeholder
|
||||
"""
|
||||
|
||||
def network_fn(X, nenv=1):
|
||||
nbatch = X.shape[0]
|
||||
nsteps = nbatch // nenv
|
||||
@@ -135,7 +93,6 @@ def lstm(nlstm=128, layer_norm=False):
|
||||
return network_fn
|
||||
|
||||
|
||||
@register("cnn_lstm")
|
||||
def cnn_lstm(nlstm=128, layer_norm=False, **conv_kwargs):
|
||||
def network_fn(X, nenv=1):
|
||||
nbatch = X.shape[0]
|
||||
@@ -161,13 +118,10 @@ def cnn_lstm(nlstm=128, layer_norm=False, **conv_kwargs):
|
||||
|
||||
return network_fn
|
||||
|
||||
|
||||
@register("cnn_lnlstm")
|
||||
def cnn_lnlstm(nlstm=128, **conv_kwargs):
|
||||
return cnn_lstm(nlstm, layer_norm=True, **conv_kwargs)
|
||||
|
||||
|
||||
@register("conv_only")
|
||||
def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
|
||||
'''
|
||||
convolutions-only net
|
||||
@@ -194,7 +148,7 @@ def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
|
||||
activation_fn=tf.nn.relu,
|
||||
**conv_kwargs)
|
||||
|
||||
return out
|
||||
return out, None
|
||||
return network_fn
|
||||
|
||||
def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
|
||||
@@ -204,21 +158,20 @@ def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
|
||||
|
||||
|
||||
def get_network_builder(name):
|
||||
"""
|
||||
If you want to register your own network outside models.py, you just need:
|
||||
|
||||
Usage Example:
|
||||
-------------
|
||||
from baselines.common.models import register
|
||||
@register("your_network_name")
|
||||
def your_network_define(**net_kwargs):
|
||||
...
|
||||
return network_fn
|
||||
|
||||
"""
|
||||
if callable(name):
|
||||
return name
|
||||
elif name in mapping:
|
||||
return mapping[name]
|
||||
# TODO: replace with reflection?
|
||||
if name == 'cnn':
|
||||
return cnn
|
||||
elif name == 'cnn_small':
|
||||
return cnn_small
|
||||
elif name == 'conv_only':
|
||||
return conv_only
|
||||
elif name == 'mlp':
|
||||
return mlp
|
||||
elif name == 'lstm':
|
||||
return lstm
|
||||
elif name == 'cnn_lstm':
|
||||
return cnn_lstm
|
||||
elif name == 'cnn_lnlstm':
|
||||
return cnn_lnlstm
|
||||
else:
|
||||
raise ValueError('Unknown network type: {}'.format(name))
|
||||
|
@@ -1,11 +1,7 @@
|
||||
from mpi4py import MPI
|
||||
import baselines.common.tf_util as U
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
|
||||
class MpiAdam(object):
|
||||
def __init__(self, var_list, *, beta1=0.9, beta2=0.999, epsilon=1e-08, scale_grad_by_procs=True, comm=None):
|
||||
@@ -20,19 +16,16 @@ class MpiAdam(object):
|
||||
self.t = 0
|
||||
self.setfromflat = U.SetFromFlat(var_list)
|
||||
self.getflat = U.GetFlat(var_list)
|
||||
self.comm = MPI.COMM_WORLD if comm is None and MPI is not None else comm
|
||||
self.comm = MPI.COMM_WORLD if comm is None else comm
|
||||
|
||||
def update(self, localg, stepsize):
|
||||
if self.t % 100 == 0:
|
||||
self.check_synced()
|
||||
localg = localg.astype('float32')
|
||||
if self.comm is not None:
|
||||
globalg = np.zeros_like(localg)
|
||||
self.comm.Allreduce(localg, globalg, op=MPI.SUM)
|
||||
if self.scale_grad_by_procs:
|
||||
globalg /= self.comm.Get_size()
|
||||
else:
|
||||
globalg = np.copy(localg)
|
||||
|
||||
self.t += 1
|
||||
a = stepsize * np.sqrt(1 - self.beta2**self.t)/(1 - self.beta1**self.t)
|
||||
@@ -42,15 +35,11 @@ class MpiAdam(object):
|
||||
self.setfromflat(self.getflat() + step)
|
||||
|
||||
def sync(self):
|
||||
if self.comm is None:
|
||||
return
|
||||
theta = self.getflat()
|
||||
self.comm.Bcast(theta, root=0)
|
||||
self.setfromflat(theta)
|
||||
|
||||
def check_synced(self):
|
||||
if self.comm is None:
|
||||
return
|
||||
if self.comm.Get_rank() == 0: # this is root
|
||||
theta = self.getflat()
|
||||
self.comm.Bcast(theta, root=0)
|
||||
@@ -74,30 +63,17 @@ def test_MpiAdam():
|
||||
do_update = U.function([], loss, updates=[update_op])
|
||||
|
||||
tf.get_default_session().run(tf.global_variables_initializer())
|
||||
losslist_ref = []
|
||||
for i in range(10):
|
||||
l = do_update()
|
||||
print(i, l)
|
||||
losslist_ref.append(l)
|
||||
|
||||
|
||||
print(i,do_update())
|
||||
|
||||
tf.set_random_seed(0)
|
||||
tf.get_default_session().run(tf.global_variables_initializer())
|
||||
|
||||
var_list = [a,b]
|
||||
lossandgrad = U.function([], [loss, U.flatgrad(loss, var_list)])
|
||||
lossandgrad = U.function([], [loss, U.flatgrad(loss, var_list)], updates=[update_op])
|
||||
adam = MpiAdam(var_list)
|
||||
|
||||
losslist_test = []
|
||||
for i in range(10):
|
||||
l,g = lossandgrad()
|
||||
adam.update(g, stepsize)
|
||||
print(i,l)
|
||||
losslist_test.append(l)
|
||||
|
||||
np.testing.assert_allclose(np.array(losslist_ref), np.array(losslist_test), atol=1e-4)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_MpiAdam()
|
||||
|
@@ -1,8 +1,4 @@
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
import tensorflow as tf, baselines.common.tf_util as U, numpy as np
|
||||
|
||||
class RunningMeanStd(object):
|
||||
@@ -43,7 +39,6 @@ class RunningMeanStd(object):
|
||||
n = int(np.prod(self.shape))
|
||||
totalvec = np.zeros(n*2+1, 'float64')
|
||||
addvec = np.concatenate([x.sum(axis=0).ravel(), np.square(x).sum(axis=0).ravel(), np.array([len(x)],dtype='float64')])
|
||||
if MPI is not None:
|
||||
MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
|
||||
self.incfiltparams(totalvec[0:n].reshape(self.shape), totalvec[n:2*n].reshape(self.shape), totalvec[2*n])
|
||||
|
||||
|
@@ -1,401 +0,0 @@
|
||||
import matplotlib.pyplot as plt
|
||||
import os.path as osp
|
||||
import json
|
||||
import os
|
||||
import numpy as np
|
||||
import pandas
|
||||
from collections import defaultdict, namedtuple
|
||||
from baselines.bench import monitor
|
||||
from baselines.logger import read_json, read_csv
|
||||
|
||||
def smooth(y, radius, mode='two_sided', valid_only=False):
|
||||
'''
|
||||
Smooth signal y, where radius is determines the size of the window
|
||||
|
||||
mode='twosided':
|
||||
average over the window [max(index - radius, 0), min(index + radius, len(y)-1)]
|
||||
mode='causal':
|
||||
average over the window [max(index - radius, 0), index]
|
||||
|
||||
valid_only: put nan in entries where the full-sized window is not available
|
||||
|
||||
'''
|
||||
assert mode in ('two_sided', 'causal')
|
||||
if len(y) < 2*radius+1:
|
||||
return np.ones_like(y) * y.mean()
|
||||
elif mode == 'two_sided':
|
||||
convkernel = np.ones(2 * radius+1)
|
||||
out = np.convolve(y, convkernel,mode='same') / np.convolve(np.ones_like(y), convkernel, mode='same')
|
||||
if valid_only:
|
||||
out[:radius] = out[-radius:] = np.nan
|
||||
elif mode == 'causal':
|
||||
convkernel = np.ones(radius)
|
||||
out = np.convolve(y, convkernel,mode='full') / np.convolve(np.ones_like(y), convkernel, mode='full')
|
||||
out = out[:-radius+1]
|
||||
if valid_only:
|
||||
out[:radius] = np.nan
|
||||
return out
|
||||
|
||||
def one_sided_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8):
|
||||
'''
|
||||
perform one-sided (causal) EMA (exponential moving average)
|
||||
smoothing and resampling to an even grid with n points.
|
||||
Does not do extrapolation, so we assume
|
||||
xolds[0] <= low && high <= xolds[-1]
|
||||
|
||||
Arguments:
|
||||
|
||||
xolds: array or list - x values of data. Needs to be sorted in ascending order
|
||||
yolds: array of list - y values of data. Has to have the same length as xolds
|
||||
|
||||
low: float - min value of the new x grid. By default equals to xolds[0]
|
||||
high: float - max value of the new x grid. By default equals to xolds[-1]
|
||||
|
||||
n: int - number of points in new x grid
|
||||
|
||||
decay_steps: float - EMA decay factor, expressed in new x grid steps.
|
||||
|
||||
low_counts_threshold: float or int
|
||||
- y values with counts less than this value will be set to NaN
|
||||
|
||||
Returns:
|
||||
tuple sum_ys, count_ys where
|
||||
xs - array with new x grid
|
||||
ys - array of EMA of y at each point of the new x grid
|
||||
count_ys - array of EMA of y counts at each point of the new x grid
|
||||
|
||||
'''
|
||||
|
||||
low = xolds[0] if low is None else low
|
||||
high = xolds[-1] if high is None else high
|
||||
|
||||
assert xolds[0] <= low, 'low = {} < xolds[0] = {} - extrapolation not permitted!'.format(low, xolds[0])
|
||||
assert xolds[-1] >= high, 'high = {} > xolds[-1] = {} - extrapolation not permitted!'.format(high, xolds[-1])
|
||||
assert len(xolds) == len(yolds), 'length of xolds ({}) and yolds ({}) do not match!'.format(len(xolds), len(yolds))
|
||||
|
||||
|
||||
xolds = xolds.astype('float64')
|
||||
yolds = yolds.astype('float64')
|
||||
|
||||
luoi = 0 # last unused old index
|
||||
sum_y = 0.
|
||||
count_y = 0.
|
||||
xnews = np.linspace(low, high, n)
|
||||
decay_period = (high - low) / (n - 1) * decay_steps
|
||||
interstep_decay = np.exp(- 1. / decay_steps)
|
||||
sum_ys = np.zeros_like(xnews)
|
||||
count_ys = np.zeros_like(xnews)
|
||||
for i in range(n):
|
||||
xnew = xnews[i]
|
||||
sum_y *= interstep_decay
|
||||
count_y *= interstep_decay
|
||||
while True:
|
||||
xold = xolds[luoi]
|
||||
if xold <= xnew:
|
||||
decay = np.exp(- (xnew - xold) / decay_period)
|
||||
sum_y += decay * yolds[luoi]
|
||||
count_y += decay
|
||||
luoi += 1
|
||||
else:
|
||||
break
|
||||
if luoi >= len(xolds):
|
||||
break
|
||||
sum_ys[i] = sum_y
|
||||
count_ys[i] = count_y
|
||||
|
||||
ys = sum_ys / count_ys
|
||||
ys[count_ys < low_counts_threshold] = np.nan
|
||||
|
||||
return xnews, ys, count_ys
|
||||
|
||||
def symmetric_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8):
|
||||
'''
|
||||
perform symmetric EMA (exponential moving average)
|
||||
smoothing and resampling to an even grid with n points.
|
||||
Does not do extrapolation, so we assume
|
||||
xolds[0] <= low && high <= xolds[-1]
|
||||
|
||||
Arguments:
|
||||
|
||||
xolds: array or list - x values of data. Needs to be sorted in ascending order
|
||||
yolds: array of list - y values of data. Has to have the same length as xolds
|
||||
|
||||
low: float - min value of the new x grid. By default equals to xolds[0]
|
||||
high: float - max value of the new x grid. By default equals to xolds[-1]
|
||||
|
||||
n: int - number of points in new x grid
|
||||
|
||||
decay_steps: float - EMA decay factor, expressed in new x grid steps.
|
||||
|
||||
low_counts_threshold: float or int
|
||||
- y values with counts less than this value will be set to NaN
|
||||
|
||||
Returns:
|
||||
tuple sum_ys, count_ys where
|
||||
xs - array with new x grid
|
||||
ys - array of EMA of y at each point of the new x grid
|
||||
count_ys - array of EMA of y counts at each point of the new x grid
|
||||
|
||||
'''
|
||||
xs, ys1, count_ys1 = one_sided_ema(xolds, yolds, low, high, n, decay_steps, low_counts_threshold=0)
|
||||
_, ys2, count_ys2 = one_sided_ema(-xolds[::-1], yolds[::-1], -high, -low, n, decay_steps, low_counts_threshold=0)
|
||||
ys2 = ys2[::-1]
|
||||
count_ys2 = count_ys2[::-1]
|
||||
count_ys = count_ys1 + count_ys2
|
||||
ys = (ys1 * count_ys1 + ys2 * count_ys2) / count_ys
|
||||
ys[count_ys < low_counts_threshold] = np.nan
|
||||
return xs, ys, count_ys
|
||||
|
||||
Result = namedtuple('Result', 'monitor progress dirname metadata')
|
||||
Result.__new__.__defaults__ = (None,) * len(Result._fields)
|
||||
|
||||
def load_results(root_dir_or_dirs, enable_progress=True, enable_monitor=True, verbose=False):
|
||||
'''
|
||||
load summaries of runs from a list of directories (including subdirectories)
|
||||
Arguments:
|
||||
|
||||
enable_progress: bool - if True, will attempt to load data from progress.csv files (data saved by logger). Default: True
|
||||
|
||||
enable_monitor: bool - if True, will attempt to load data from monitor.csv files (data saved by Monitor environment wrapper). Default: True
|
||||
|
||||
verbose: bool - if True, will print out list of directories from which the data is loaded. Default: False
|
||||
|
||||
|
||||
Returns:
|
||||
List of Result objects with the following fields:
|
||||
- dirname - path to the directory data was loaded from
|
||||
- metadata - run metadata (such as command-line arguments and anything else in metadata.json file
|
||||
- monitor - if enable_monitor is True, this field contains pandas dataframe with loaded monitor.csv file (or aggregate of all *.monitor.csv files in the directory)
|
||||
- progress - if enable_progress is True, this field contains pandas dataframe with loaded progress.csv file
|
||||
'''
|
||||
if isinstance(root_dir_or_dirs, str):
|
||||
rootdirs = [osp.expanduser(root_dir_or_dirs)]
|
||||
else:
|
||||
rootdirs = [osp.expanduser(d) for d in root_dir_or_dirs]
|
||||
allresults = []
|
||||
for rootdir in rootdirs:
|
||||
assert osp.exists(rootdir), "%s doesn't exist"%rootdir
|
||||
for dirname, dirs, files in os.walk(rootdir):
|
||||
if '-proc' in dirname:
|
||||
files[:] = []
|
||||
continue
|
||||
if set(['metadata.json', 'monitor.json', 'monitor.csv', 'progress.json', 'progress.csv']).intersection(files):
|
||||
# used to be uncommented, which means do not go deeper than current directory if any of the data files
|
||||
# are found
|
||||
# dirs[:] = []
|
||||
result = {'dirname' : dirname}
|
||||
if "metadata.json" in files:
|
||||
with open(osp.join(dirname, "metadata.json"), "r") as fh:
|
||||
result['metadata'] = json.load(fh)
|
||||
progjson = osp.join(dirname, "progress.json")
|
||||
progcsv = osp.join(dirname, "progress.csv")
|
||||
if enable_progress:
|
||||
if osp.exists(progjson):
|
||||
result['progress'] = pandas.DataFrame(read_json(progjson))
|
||||
elif osp.exists(progcsv):
|
||||
try:
|
||||
result['progress'] = read_csv(progcsv)
|
||||
except pandas.errors.EmptyDataError:
|
||||
print('skipping progress file in ', dirname, 'empty data')
|
||||
else:
|
||||
if verbose: print('skipping %s: no progress file'%dirname)
|
||||
|
||||
if enable_monitor:
|
||||
try:
|
||||
result['monitor'] = pandas.DataFrame(monitor.load_results(dirname))
|
||||
except monitor.LoadMonitorResultsError:
|
||||
print('skipping %s: no monitor files'%dirname)
|
||||
except Exception as e:
|
||||
print('exception loading monitor file in %s: %s'%(dirname, e))
|
||||
|
||||
if result.get('monitor') is not None or result.get('progress') is not None:
|
||||
allresults.append(Result(**result))
|
||||
if verbose:
|
||||
print('successfully loaded %s'%dirname)
|
||||
|
||||
if verbose: print('loaded %i results'%len(allresults))
|
||||
return allresults
|
||||
|
||||
COLORS = ['blue', 'green', 'red', 'cyan', 'magenta', 'yellow', 'black', 'purple', 'pink',
|
||||
'brown', 'orange', 'teal', 'lightblue', 'lime', 'lavender', 'turquoise',
|
||||
'darkgreen', 'tan', 'salmon', 'gold', 'darkred', 'darkblue']
|
||||
|
||||
|
||||
def default_xy_fn(r):
|
||||
x = np.cumsum(r.monitor.l)
|
||||
y = smooth(r.monitor.r, radius=10)
|
||||
return x,y
|
||||
|
||||
def default_split_fn(r):
|
||||
import re
|
||||
# match name between slash and -<digits> at the end of the string
|
||||
# (slash in the beginning or -<digits> in the end or either may be missing)
|
||||
match = re.search(r'[^/-]+(?=(-\d+)?\Z)', r.dirname)
|
||||
if match:
|
||||
return match.group(0)
|
||||
|
||||
def plot_results(
|
||||
allresults, *,
|
||||
xy_fn=default_xy_fn,
|
||||
split_fn=default_split_fn,
|
||||
group_fn=default_split_fn,
|
||||
average_group=False,
|
||||
shaded_std=True,
|
||||
shaded_err=True,
|
||||
figsize=None,
|
||||
legend_outside=False,
|
||||
resample=0,
|
||||
smooth_step=1.0,
|
||||
):
|
||||
'''
|
||||
Plot multiple Results objects
|
||||
|
||||
xy_fn: function Result -> x,y - function that converts results objects into tuple of x and y values.
|
||||
By default, x is cumsum of episode lengths, and y is episode rewards
|
||||
|
||||
split_fn: function Result -> hashable - function that converts results objects into keys to split curves into sub-panels by.
|
||||
That is, the results r for which split_fn(r) is different will be put on different sub-panels.
|
||||
By default, the portion of r.dirname between last / and -<digits> is returned. The sub-panels are
|
||||
stacked vertically in the figure.
|
||||
|
||||
group_fn: function Result -> hashable - function that converts results objects into keys to group curves by.
|
||||
That is, the results r for which group_fn(r) is the same will be put into the same group.
|
||||
Curves in the same group have the same color (if average_group is False), or averaged over
|
||||
(if average_group is True). The default value is the same as default value for split_fn
|
||||
|
||||
average_group: bool - if True, will average the curves in the same group and plot the mean. Enables resampling
|
||||
(if resample = 0, will use 512 steps)
|
||||
|
||||
shaded_std: bool - if True (default), the shaded region corresponding to standard deviation of the group of curves will be
|
||||
shown (only applicable if average_group = True)
|
||||
|
||||
shaded_err: bool - if True (default), the shaded region corresponding to error in mean estimate of the group of curves
|
||||
(that is, standard deviation divided by square root of number of curves) will be
|
||||
shown (only applicable if average_group = True)
|
||||
|
||||
figsize: tuple or None - size of the resulting figure (including sub-panels). By default, width is 6 and height is 6 times number of
|
||||
sub-panels.
|
||||
|
||||
|
||||
legend_outside: bool - if True, will place the legend outside of the sub-panels.
|
||||
|
||||
resample: int - if not zero, size of the uniform grid in x direction to resample onto. Resampling is performed via symmetric
|
||||
EMA smoothing (see the docstring for symmetric_ema).
|
||||
Default is zero (no resampling). Note that if average_group is True, resampling is necessary; in that case, default
|
||||
value is 512.
|
||||
|
||||
smooth_step: float - when resampling (i.e. when resample > 0 or average_group is True), use this EMA decay parameter (in units of the new grid step).
|
||||
See docstrings for decay_steps in symmetric_ema or one_sided_ema functions.
|
||||
|
||||
'''
|
||||
|
||||
if split_fn is None: split_fn = lambda _ : ''
|
||||
if group_fn is None: group_fn = lambda _ : ''
|
||||
sk2r = defaultdict(list) # splitkey2results
|
||||
for result in allresults:
|
||||
splitkey = split_fn(result)
|
||||
sk2r[splitkey].append(result)
|
||||
assert len(sk2r) > 0
|
||||
assert isinstance(resample, int), "0: don't resample. <integer>: that many samples"
|
||||
nrows = len(sk2r)
|
||||
ncols = 1
|
||||
figsize = figsize or (6, 6 * nrows)
|
||||
f, axarr = plt.subplots(nrows, ncols, sharex=False, squeeze=False, figsize=figsize)
|
||||
|
||||
groups = list(set(group_fn(result) for result in allresults))
|
||||
|
||||
default_samples = 512
|
||||
if average_group:
|
||||
resample = resample or default_samples
|
||||
|
||||
for (isplit, sk) in enumerate(sorted(sk2r.keys())):
|
||||
g2l = {}
|
||||
g2c = defaultdict(int)
|
||||
sresults = sk2r[sk]
|
||||
gresults = defaultdict(list)
|
||||
ax = axarr[isplit][0]
|
||||
for result in sresults:
|
||||
group = group_fn(result)
|
||||
g2c[group] += 1
|
||||
x, y = xy_fn(result)
|
||||
if x is None: x = np.arange(len(y))
|
||||
x, y = map(np.asarray, (x, y))
|
||||
if average_group:
|
||||
gresults[group].append((x,y))
|
||||
else:
|
||||
if resample:
|
||||
x, y, counts = symmetric_ema(x, y, x[0], x[-1], resample, decay_steps=smooth_step)
|
||||
l, = ax.plot(x, y, color=COLORS[groups.index(group) % len(COLORS)])
|
||||
g2l[group] = l
|
||||
if average_group:
|
||||
for group in sorted(groups):
|
||||
xys = gresults[group]
|
||||
if not any(xys):
|
||||
continue
|
||||
color = COLORS[groups.index(group) % len(COLORS)]
|
||||
origxs = [xy[0] for xy in xys]
|
||||
minxlen = min(map(len, origxs))
|
||||
def allequal(qs):
|
||||
return all((q==qs[0]).all() for q in qs[1:])
|
||||
if resample:
|
||||
low = max(x[0] for x in origxs)
|
||||
high = min(x[-1] for x in origxs)
|
||||
usex = np.linspace(low, high, resample)
|
||||
ys = []
|
||||
for (x, y) in xys:
|
||||
ys.append(symmetric_ema(x, y, low, high, resample, decay_steps=smooth_step)[1])
|
||||
else:
|
||||
assert allequal([x[:minxlen] for x in origxs]),\
|
||||
'If you want to average unevenly sampled data, set resample=<number of samples you want>'
|
||||
usex = origxs[0]
|
||||
ys = [xy[1][:minxlen] for xy in xys]
|
||||
ymean = np.mean(ys, axis=0)
|
||||
ystd = np.std(ys, axis=0)
|
||||
ystderr = ystd / np.sqrt(len(ys))
|
||||
l, = axarr[isplit][0].plot(usex, ymean, color=color)
|
||||
g2l[group] = l
|
||||
if shaded_err:
|
||||
ax.fill_between(usex, ymean - ystderr, ymean + ystderr, color=color, alpha=.4)
|
||||
if shaded_std:
|
||||
ax.fill_between(usex, ymean - ystd, ymean + ystd, color=color, alpha=.2)
|
||||
|
||||
|
||||
# https://matplotlib.org/users/legend_guide.html
|
||||
plt.tight_layout()
|
||||
if any(g2l.keys()):
|
||||
ax.legend(
|
||||
g2l.values(),
|
||||
['%s (%i)'%(g, g2c[g]) for g in g2l] if average_group else g2l.keys(),
|
||||
loc=2 if legend_outside else None,
|
||||
bbox_to_anchor=(1,1) if legend_outside else None)
|
||||
ax.set_title(sk)
|
||||
return f, axarr
|
||||
|
||||
def regression_analysis(df):
|
||||
xcols = list(df.columns.copy())
|
||||
xcols.remove('score')
|
||||
ycols = ['score']
|
||||
import statsmodels.api as sm
|
||||
mod = sm.OLS(df[ycols], sm.add_constant(df[xcols]), hasconst=False)
|
||||
res = mod.fit()
|
||||
print(res.summary())
|
||||
|
||||
def test_smooth():
|
||||
norig = 100
|
||||
nup = 300
|
||||
ndown = 30
|
||||
xs = np.cumsum(np.random.rand(norig) * 10 / norig)
|
||||
yclean = np.sin(xs)
|
||||
ys = yclean + .1 * np.random.randn(yclean.size)
|
||||
xup, yup, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), nup, decay_steps=nup/ndown)
|
||||
xdown, ydown, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), ndown, decay_steps=ndown/ndown)
|
||||
xsame, ysame, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), norig, decay_steps=norig/ndown)
|
||||
plt.plot(xs, ys, label='orig', marker='x')
|
||||
plt.plot(xup, yup, label='up', marker='x')
|
||||
plt.plot(xdown, ydown, label='down', marker='x')
|
||||
plt.plot(xsame, ysame, label='same', marker='x')
|
||||
plt.plot(xs, yclean, label='clean', marker='x')
|
||||
plt.legend()
|
||||
plt.show()
|
||||
|
||||
|
@@ -43,17 +43,13 @@ class PolicyWithValue(object):
|
||||
vf_latent = tf.layers.flatten(vf_latent)
|
||||
latent = tf.layers.flatten(latent)
|
||||
|
||||
# Based on the action space, will select what probability distribution type
|
||||
self.pdtype = make_pdtype(env.action_space)
|
||||
|
||||
self.pd, self.pi = self.pdtype.pdfromlatent(latent, init_scale=0.01)
|
||||
|
||||
# Take an action
|
||||
self.action = self.pd.sample()
|
||||
|
||||
# Calculate the neg log of our probability
|
||||
self.neglogp = self.pd.neglogp(self.action)
|
||||
self.sess = sess or tf.get_default_session()
|
||||
self.sess = sess
|
||||
|
||||
if estimate_q:
|
||||
assert isinstance(env.action_space, gym.spaces.Discrete)
|
||||
@@ -64,7 +60,7 @@ class PolicyWithValue(object):
|
||||
self.vf = self.vf[:,0]
|
||||
|
||||
def _evaluate(self, variables, observation, **extra_feed):
|
||||
sess = self.sess
|
||||
sess = self.sess or tf.get_default_session()
|
||||
feed_dict = {self.X: adjust_shape(self.X, observation)}
|
||||
for inpt_name, data in extra_feed.items():
|
||||
if inpt_name in self.__dict__.keys():
|
||||
@@ -76,7 +72,7 @@ class PolicyWithValue(object):
|
||||
|
||||
def step(self, observation, **extra_feed):
|
||||
"""
|
||||
Compute next action(s) given the observation(s)
|
||||
Compute next action(s) given the observaion(s)
|
||||
|
||||
Parameters:
|
||||
----------
|
||||
@@ -97,7 +93,7 @@ class PolicyWithValue(object):
|
||||
|
||||
def value(self, ob, *args, **kwargs):
|
||||
"""
|
||||
Compute value estimate(s) given the observation(s)
|
||||
Compute value estimate(s) given the observaion(s)
|
||||
|
||||
Parameters:
|
||||
----------
|
||||
@@ -139,9 +135,7 @@ def build_policy(env, policy_network, value_network=None, normalize_observation
|
||||
encoded_x = encode_observation(ob_space, encoded_x)
|
||||
|
||||
with tf.variable_scope('pi', reuse=tf.AUTO_REUSE):
|
||||
policy_latent = policy_network(encoded_x)
|
||||
if isinstance(policy_latent, tuple):
|
||||
policy_latent, recurrent_tensors = policy_latent
|
||||
policy_latent, recurrent_tensors = policy_network(encoded_x)
|
||||
|
||||
if recurrent_tensors is not None:
|
||||
# recurrent architecture, need a few more steps
|
||||
@@ -162,8 +156,7 @@ def build_policy(env, policy_network, value_network=None, normalize_observation
|
||||
assert callable(_v_net)
|
||||
|
||||
with tf.variable_scope('vf', reuse=tf.AUTO_REUSE):
|
||||
# TODO recurrent architectures are not supported with value_network=copy yet
|
||||
vf_latent = _v_net(encoded_x)
|
||||
vf_latent, _ = _v_net(encoded_x)
|
||||
|
||||
policy = PolicyWithValue(
|
||||
env=env,
|
||||
|
@@ -26,9 +26,9 @@ def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var,
|
||||
new_mean = mean + delta * batch_count / tot_count
|
||||
m_a = var * count
|
||||
m_b = batch_var * batch_count
|
||||
M2 = m_a + m_b + np.square(delta) * count * batch_count / tot_count
|
||||
new_var = M2 / tot_count
|
||||
new_count = tot_count
|
||||
M2 = m_a + m_b + np.square(delta) * count * batch_count / (count + batch_count)
|
||||
new_var = M2 / (count + batch_count)
|
||||
new_count = batch_count + count
|
||||
|
||||
return new_mean, new_var, new_count
|
||||
|
||||
|
46
baselines/common/running_stat.py
Normal file
46
baselines/common/running_stat.py
Normal file
@@ -0,0 +1,46 @@
|
||||
import numpy as np
|
||||
|
||||
# http://www.johndcook.com/blog/standard_deviation/
|
||||
class RunningStat(object):
|
||||
def __init__(self, shape):
|
||||
self._n = 0
|
||||
self._M = np.zeros(shape)
|
||||
self._S = np.zeros(shape)
|
||||
def push(self, x):
|
||||
x = np.asarray(x)
|
||||
assert x.shape == self._M.shape
|
||||
self._n += 1
|
||||
if self._n == 1:
|
||||
self._M[...] = x
|
||||
else:
|
||||
oldM = self._M.copy()
|
||||
self._M[...] = oldM + (x - oldM)/self._n
|
||||
self._S[...] = self._S + (x - oldM)*(x - self._M)
|
||||
@property
|
||||
def n(self):
|
||||
return self._n
|
||||
@property
|
||||
def mean(self):
|
||||
return self._M
|
||||
@property
|
||||
def var(self):
|
||||
return self._S/(self._n - 1) if self._n > 1 else np.square(self._M)
|
||||
@property
|
||||
def std(self):
|
||||
return np.sqrt(self.var)
|
||||
@property
|
||||
def shape(self):
|
||||
return self._M.shape
|
||||
|
||||
def test_running_stat():
|
||||
for shp in ((), (3,), (3,4)):
|
||||
li = []
|
||||
rs = RunningStat(shp)
|
||||
for _ in range(5):
|
||||
val = np.random.randn(*shp)
|
||||
rs.push(val)
|
||||
li.append(val)
|
||||
m = np.mean(li, axis=0)
|
||||
assert np.allclose(rs.mean, m)
|
||||
v = np.square(m) if (len(li) == 1) else np.var(li, ddof=1, axis=0)
|
||||
assert np.allclose(rs.var, v)
|
@@ -1,7 +1,7 @@
|
||||
import numpy as np
|
||||
from abc import abstractmethod
|
||||
from gym import Env
|
||||
from gym.spaces import MultiDiscrete, Discrete, Box
|
||||
from gym.spaces import Discrete, Box
|
||||
|
||||
|
||||
class IdentityEnv(Env):
|
||||
@@ -53,19 +53,6 @@ class DiscreteIdentityEnv(IdentityEnv):
|
||||
def _get_reward(self, actions):
|
||||
return 1 if self.state == actions else 0
|
||||
|
||||
class MultiDiscreteIdentityEnv(IdentityEnv):
|
||||
def __init__(
|
||||
self,
|
||||
dims,
|
||||
episode_len=None,
|
||||
):
|
||||
|
||||
self.action_space = MultiDiscrete(dims)
|
||||
super().__init__(episode_len=episode_len)
|
||||
|
||||
def _get_reward(self, actions):
|
||||
return 1 if all(self.state == actions) else 0
|
||||
|
||||
|
||||
class BoxIdentityEnv(IdentityEnv):
|
||||
def __init__(
|
||||
|
@@ -1,6 +1,7 @@
|
||||
import os.path as osp
|
||||
import numpy as np
|
||||
import tempfile
|
||||
import filelock
|
||||
from gym import Env
|
||||
from gym.spaces import Discrete, Box
|
||||
|
||||
@@ -13,7 +14,6 @@ class MnistEnv(Env):
|
||||
episode_len=None,
|
||||
no_images=None
|
||||
):
|
||||
import filelock
|
||||
from tensorflow.examples.tutorials.mnist import input_data
|
||||
# we could use temporary directory for this with a context manager and
|
||||
# TemporaryDirecotry, but then each test that uses mnist would re-download the data
|
||||
|
@@ -13,9 +13,8 @@ common_kwargs = dict(
|
||||
|
||||
learn_kwargs = {
|
||||
'a2c' : dict(nsteps=32, value_network='copy', lr=0.05),
|
||||
'acer': dict(value_network='copy'),
|
||||
'acktr': dict(nsteps=32, value_network='copy', is_async=False),
|
||||
'deepq': dict(total_timesteps=20000),
|
||||
'acktr': dict(nsteps=32, value_network='copy'),
|
||||
'deepq': {},
|
||||
'ppo2': dict(value_network='copy'),
|
||||
'trpo_mpi': {}
|
||||
}
|
||||
@@ -39,6 +38,3 @@ def test_cartpole(alg):
|
||||
return env
|
||||
|
||||
reward_per_episode_test(env_fn, learn_fn, 100)
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_cartpole('acer')
|
||||
|
@@ -1,48 +0,0 @@
|
||||
import pytest
|
||||
try:
|
||||
import mujoco_py
|
||||
_mujoco_present = True
|
||||
except BaseException:
|
||||
mujoco_py = None
|
||||
_mujoco_present = False
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not _mujoco_present,
|
||||
reason='error loading mujoco - either mujoco / mujoco key not present, or LD_LIBRARY_PATH is not pointing to mujoco library'
|
||||
)
|
||||
def test_lstm_example():
|
||||
import tensorflow as tf
|
||||
from baselines.common import policies, models, cmd_util
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
|
||||
# create vectorized environment
|
||||
venv = DummyVecEnv([lambda: cmd_util.make_mujoco_env('Reacher-v2', seed=0)])
|
||||
|
||||
with tf.Session() as sess:
|
||||
# build policy based on lstm network with 128 units
|
||||
policy = policies.build_policy(venv, models.lstm(128))(nbatch=1, nsteps=1)
|
||||
|
||||
# initialize tensorflow variables
|
||||
sess.run(tf.global_variables_initializer())
|
||||
|
||||
# prepare environment variables
|
||||
ob = venv.reset()
|
||||
state = policy.initial_state
|
||||
done = [False]
|
||||
step_counter = 0
|
||||
|
||||
# run a single episode until the end (i.e. until done)
|
||||
while True:
|
||||
action, _, state, _ = policy.step(ob, S=state, M=done)
|
||||
ob, reward, done, _ = venv.step(action)
|
||||
step_counter += 1
|
||||
if done:
|
||||
break
|
||||
|
||||
|
||||
assert step_counter > 5
|
||||
|
||||
|
||||
|
||||
|
@@ -1,27 +0,0 @@
|
||||
import pytest
|
||||
import gym
|
||||
import tensorflow as tf
|
||||
|
||||
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
|
||||
from baselines.run import get_learn_function
|
||||
from baselines.common.tf_util import make_session
|
||||
|
||||
algos = ['a2c', 'acer', 'acktr', 'deepq', 'ppo2', 'trpo_mpi']
|
||||
|
||||
@pytest.mark.parametrize('algo', algos)
|
||||
def test_env_after_learn(algo):
|
||||
def make_env():
|
||||
# acktr requires too much RAM, fails on travis
|
||||
env = gym.make('CartPole-v1' if algo == 'acktr' else 'PongNoFrameskip-v4')
|
||||
return env
|
||||
|
||||
make_session(make_default=True, graph=tf.Graph())
|
||||
env = SubprocVecEnv([make_env])
|
||||
|
||||
learn = get_learn_function(algo)
|
||||
|
||||
# Commenting out the following line resolves the issue, though crash happens at env.reset().
|
||||
learn(network='mlp', env=env, total_timesteps=0, load_path=None, seed=None)
|
||||
|
||||
env.reset()
|
||||
env.close()
|
@@ -1,5 +1,5 @@
|
||||
import pytest
|
||||
from baselines.common.tests.envs.identity_env import DiscreteIdentityEnv, BoxIdentityEnv, MultiDiscreteIdentityEnv
|
||||
from baselines.common.tests.envs.identity_env import DiscreteIdentityEnv, BoxIdentityEnv
|
||||
from baselines.run import get_learn_function
|
||||
from baselines.common.tests.util import simple_test
|
||||
|
||||
@@ -14,18 +14,13 @@ learn_kwargs = {
|
||||
'a2c' : {},
|
||||
'acktr': {},
|
||||
'deepq': {},
|
||||
'ddpg': dict(layer_norm=True),
|
||||
'ppo2': dict(lr=1e-3, nsteps=64, ent_coef=0.0),
|
||||
'trpo_mpi': dict(timesteps_per_batch=100, cg_iters=10, gamma=0.9, lam=1.0, max_kl=0.01)
|
||||
}
|
||||
|
||||
|
||||
algos_disc = ['a2c', 'acktr', 'deepq', 'ppo2', 'trpo_mpi']
|
||||
algos_multidisc = ['a2c', 'acktr', 'ppo2', 'trpo_mpi']
|
||||
algos_cont = ['a2c', 'acktr', 'ddpg', 'ppo2', 'trpo_mpi']
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize("alg", algos_disc)
|
||||
@pytest.mark.parametrize("alg", learn_kwargs.keys())
|
||||
def test_discrete_identity(alg):
|
||||
'''
|
||||
Test if the algorithm (with an mlp policy)
|
||||
@@ -40,22 +35,7 @@ def test_discrete_identity(alg):
|
||||
simple_test(env_fn, learn_fn, 0.9)
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize("alg", algos_multidisc)
|
||||
def test_multidiscrete_identity(alg):
|
||||
'''
|
||||
Test if the algorithm (with an mlp policy)
|
||||
can learn an identity transformation (i.e. return observation as an action)
|
||||
'''
|
||||
|
||||
kwargs = learn_kwargs[alg]
|
||||
kwargs.update(common_kwargs)
|
||||
|
||||
learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
|
||||
env_fn = lambda: MultiDiscreteIdentityEnv((3,3), episode_len=100)
|
||||
simple_test(env_fn, learn_fn, 0.9)
|
||||
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize("alg", algos_cont)
|
||||
@pytest.mark.parametrize("alg", ['a2c', 'ppo2', 'trpo_mpi'])
|
||||
def test_continuous_identity(alg):
|
||||
'''
|
||||
Test if the algorithm (with an mlp policy)
|
||||
@@ -71,5 +51,5 @@ def test_continuous_identity(alg):
|
||||
simple_test(env_fn, learn_fn, -0.1)
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_multidiscrete_identity('acktr')
|
||||
test_continuous_identity('a2c')
|
||||
|
||||
|
@@ -17,7 +17,8 @@ common_kwargs = {
|
||||
|
||||
learn_args = {
|
||||
'a2c': dict(total_timesteps=50000),
|
||||
'acer': dict(total_timesteps=20000),
|
||||
# TODO need to resolve inference (step) API differences for acer; also slow
|
||||
# 'acer': dict(seed=0, total_timesteps=1000),
|
||||
'deepq': dict(total_timesteps=5000),
|
||||
'acktr': dict(total_timesteps=30000),
|
||||
'ppo2': dict(total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.0),
|
||||
@@ -46,4 +47,4 @@ def test_mnist(alg):
|
||||
simple_test(env_fn, learn_fn, 0.6)
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_mnist('acer')
|
||||
test_mnist('deepq')
|
||||
|
@@ -1,5 +1,4 @@
|
||||
import os
|
||||
import gym
|
||||
import tempfile
|
||||
import pytest
|
||||
import tensorflow as tf
|
||||
@@ -17,7 +16,6 @@ learn_kwargs = {
|
||||
'deepq': {},
|
||||
'a2c': {},
|
||||
'acktr': {},
|
||||
'acer': {},
|
||||
'ppo2': {'nminibatches': 1, 'nsteps': 10},
|
||||
'trpo_mpi': {},
|
||||
}
|
||||
@@ -38,10 +36,10 @@ def test_serialization(learn_fn, network_fn):
|
||||
'''
|
||||
|
||||
|
||||
if network_fn.endswith('lstm') and learn_fn in ['acer', 'acktr', 'trpo_mpi', 'deepq']:
|
||||
if network_fn.endswith('lstm') and learn_fn in ['acktr', 'trpo_mpi', 'deepq']:
|
||||
# TODO make acktr work with recurrent policies
|
||||
# and test
|
||||
# github issue: https://github.com/openai/baselines/issues/660
|
||||
# github issue: https://github.com/openai/baselines/issues/194
|
||||
return
|
||||
|
||||
env = DummyVecEnv([lambda: MnistEnv(10, episode_len=100)])
|
||||
@@ -77,41 +75,6 @@ def test_serialization(learn_fn, network_fn):
|
||||
np.testing.assert_allclose(std1, std2, atol=0.5)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("learn_fn", learn_kwargs.keys())
|
||||
@pytest.mark.parametrize("network_fn", ['mlp'])
|
||||
def test_coexistence(learn_fn, network_fn):
|
||||
'''
|
||||
Test if more than one model can exist at a time
|
||||
'''
|
||||
|
||||
if learn_fn == 'deepq':
|
||||
# TODO enable multiple DQN models to be useable at the same time
|
||||
# github issue https://github.com/openai/baselines/issues/656
|
||||
return
|
||||
|
||||
if network_fn.endswith('lstm') and learn_fn in ['acktr', 'trpo_mpi', 'deepq']:
|
||||
# TODO make acktr work with recurrent policies
|
||||
# and test
|
||||
# github issue: https://github.com/openai/baselines/issues/660
|
||||
return
|
||||
|
||||
env = DummyVecEnv([lambda: gym.make('CartPole-v0')])
|
||||
learn = get_learn_function(learn_fn)
|
||||
|
||||
kwargs = {}
|
||||
kwargs.update(network_kwargs[network_fn])
|
||||
kwargs.update(learn_kwargs[learn_fn])
|
||||
|
||||
learn = partial(learn, env=env, network=network_fn, total_timesteps=0, **kwargs)
|
||||
make_session(make_default=True, graph=tf.Graph());
|
||||
model1 = learn(seed=1)
|
||||
make_session(make_default=True, graph=tf.Graph());
|
||||
model2 = learn(seed=2)
|
||||
|
||||
model1.step(env.observation_space.sample())
|
||||
model2.step(env.observation_space.sample())
|
||||
|
||||
|
||||
|
||||
def _serialize_variables():
|
||||
sess = get_session()
|
||||
|
@@ -293,7 +293,7 @@ def display_var_info(vars):
|
||||
if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
|
||||
v_params = np.prod(v.shape.as_list())
|
||||
count_params += v_params
|
||||
if "/b:" in name or "/bias" in name: continue # Wx+b, bias is not interesting to look at => count params, but not print
|
||||
if "/b:" in name or "/biases" in name: continue # Wx+b, bias is not interesting to look at => count params, but not print
|
||||
logger.info(" %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
|
||||
|
||||
logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
|
||||
@@ -312,19 +312,13 @@ def get_available_gpus():
|
||||
# ================================================================
|
||||
|
||||
def load_state(fname, sess=None):
|
||||
from baselines import logger
|
||||
logger.warn('load_state method is deprecated, please use load_variables instead')
|
||||
sess = sess or get_session()
|
||||
saver = tf.train.Saver()
|
||||
saver.restore(tf.get_default_session(), fname)
|
||||
|
||||
def save_state(fname, sess=None):
|
||||
from baselines import logger
|
||||
logger.warn('save_state method is deprecated, please use save_variables instead')
|
||||
sess = sess or get_session()
|
||||
dirname = os.path.dirname(fname)
|
||||
if any(dirname):
|
||||
os.makedirs(dirname, exist_ok=True)
|
||||
os.makedirs(os.path.dirname(fname), exist_ok=True)
|
||||
saver = tf.train.Saver()
|
||||
saver.save(tf.get_default_session(), fname)
|
||||
|
||||
@@ -337,9 +331,7 @@ def save_variables(save_path, variables=None, sess=None):
|
||||
|
||||
ps = sess.run(variables)
|
||||
save_dict = {v.name: value for v, value in zip(variables, ps)}
|
||||
dirname = os.path.dirname(save_path)
|
||||
if any(dirname):
|
||||
os.makedirs(dirname, exist_ok=True)
|
||||
os.makedirs(os.path.dirname(save_path), exist_ok=True)
|
||||
joblib.dump(save_dict, save_path)
|
||||
|
||||
def load_variables(load_path, variables=None, sess=None):
|
||||
@@ -348,16 +340,11 @@ def load_variables(load_path, variables=None, sess=None):
|
||||
|
||||
loaded_params = joblib.load(os.path.expanduser(load_path))
|
||||
restores = []
|
||||
if isinstance(loaded_params, list):
|
||||
assert len(loaded_params) == len(variables), 'number of variables loaded mismatches len(variables)'
|
||||
for d, v in zip(loaded_params, variables):
|
||||
restores.append(v.assign(d))
|
||||
else:
|
||||
for v in variables:
|
||||
restores.append(v.assign(loaded_params[v.name]))
|
||||
|
||||
sess.run(restores)
|
||||
|
||||
|
||||
# ================================================================
|
||||
# Shape adjustment for feeding into tf placeholders
|
||||
# ================================================================
|
||||
@@ -406,25 +393,13 @@ def _check_shape(placeholder_shape, data_shape):
|
||||
def _squeeze_shape(shape):
|
||||
return [x for x in shape if x != 1]
|
||||
|
||||
# ================================================================
|
||||
# Tensorboard interfacing
|
||||
# ================================================================
|
||||
|
||||
def launch_tensorboard_in_background(log_dir):
|
||||
'''
|
||||
To log the Tensorflow graph when using rl-algs
|
||||
algorithms, you can run the following code
|
||||
in your main script:
|
||||
import threading, time
|
||||
def start_tensorboard(session):
|
||||
time.sleep(10) # Wait until graph is setup
|
||||
tb_path = osp.join(logger.get_dir(), 'tb')
|
||||
summary_writer = tf.summary.FileWriter(tb_path, graph=session.graph)
|
||||
summary_op = tf.summary.merge_all()
|
||||
launch_tensorboard_in_background(tb_path)
|
||||
session = tf.get_default_session()
|
||||
t = threading.Thread(target=start_tensorboard, args=([session]))
|
||||
from tensorboard import main as tb
|
||||
import threading
|
||||
tf.flags.FLAGS.logdir = log_dir
|
||||
t = threading.Thread(target=tb.main, args=([]))
|
||||
t.start()
|
||||
'''
|
||||
import subprocess
|
||||
subprocess.Popen(['tensorboard', '--logdir', log_dir])
|
||||
|
||||
|
@@ -1,42 +1,28 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from baselines.common.tile_images import tile_images
|
||||
from baselines import logger
|
||||
|
||||
class AlreadySteppingError(Exception):
|
||||
"""
|
||||
Raised when an asynchronous step is running while
|
||||
step_async() is called again.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
msg = 'already running an async step'
|
||||
Exception.__init__(self, msg)
|
||||
|
||||
|
||||
class NotSteppingError(Exception):
|
||||
"""
|
||||
Raised when an asynchronous step is not running but
|
||||
step_wait() is called.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
msg = 'not running an async step'
|
||||
Exception.__init__(self, msg)
|
||||
|
||||
|
||||
class VecEnv(ABC):
|
||||
"""
|
||||
An abstract asynchronous, vectorized environment.
|
||||
Used to batch data from multiple copies of an environment, so that
|
||||
each observation becomes an batch of observations, and expected action is a batch of actions to
|
||||
be applied per-environment.
|
||||
"""
|
||||
closed = False
|
||||
viewer = None
|
||||
|
||||
metadata = {
|
||||
'render.modes': ['human', 'rgb_array']
|
||||
}
|
||||
|
||||
def __init__(self, num_envs, observation_space, action_space):
|
||||
self.num_envs = num_envs
|
||||
self.observation_space = observation_space
|
||||
@@ -46,7 +32,7 @@ class VecEnv(ABC):
|
||||
def reset(self):
|
||||
"""
|
||||
Reset all the environments and return an array of
|
||||
observations, or a dict of observation arrays.
|
||||
observations, or a tuple of observation arrays.
|
||||
|
||||
If step_async is still doing work, that work will
|
||||
be cancelled and step_wait() should not be called
|
||||
@@ -72,7 +58,7 @@ class VecEnv(ABC):
|
||||
Wait for the step taken with step_async().
|
||||
|
||||
Returns (obs, rews, dones, infos):
|
||||
- obs: an array of observations, or a dict of
|
||||
- obs: an array of observations, or a tuple of
|
||||
arrays of observations.
|
||||
- rews: an array of rewards
|
||||
- dones: an array of "episode done" booleans
|
||||
@@ -80,46 +66,19 @@ class VecEnv(ABC):
|
||||
"""
|
||||
pass
|
||||
|
||||
def close_extras(self):
|
||||
@abstractmethod
|
||||
def close(self):
|
||||
"""
|
||||
Clean up the extra resources, beyond what's in this base class.
|
||||
Only runs when not self.closed.
|
||||
Clean up the environments' resources.
|
||||
"""
|
||||
pass
|
||||
|
||||
def close(self):
|
||||
if self.closed:
|
||||
return
|
||||
if self.viewer is not None:
|
||||
self.viewer.close()
|
||||
self.close_extras()
|
||||
self.closed = True
|
||||
|
||||
def step(self, actions):
|
||||
"""
|
||||
Step the environments synchronously.
|
||||
|
||||
This is available for backwards compatibility.
|
||||
"""
|
||||
self.step_async(actions)
|
||||
return self.step_wait()
|
||||
|
||||
def render(self, mode='human'):
|
||||
imgs = self.get_images()
|
||||
bigimg = tile_images(imgs)
|
||||
if mode == 'human':
|
||||
self.get_viewer().imshow(bigimg)
|
||||
return self.get_viewer().isopen
|
||||
elif mode == 'rgb_array':
|
||||
return bigimg
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
def get_images(self):
|
||||
"""
|
||||
Return RGB images from each environment
|
||||
"""
|
||||
raise NotImplementedError
|
||||
logger.warn('Render not defined for %s'%self)
|
||||
|
||||
@property
|
||||
def unwrapped(self):
|
||||
@@ -128,19 +87,7 @@ class VecEnv(ABC):
|
||||
else:
|
||||
return self
|
||||
|
||||
def get_viewer(self):
|
||||
if self.viewer is None:
|
||||
from gym.envs.classic_control import rendering
|
||||
self.viewer = rendering.SimpleImageViewer()
|
||||
return self.viewer
|
||||
|
||||
|
||||
class VecEnvWrapper(VecEnv):
|
||||
"""
|
||||
An environment wrapper that applies to an entire batch
|
||||
of environments at once.
|
||||
"""
|
||||
|
||||
def __init__(self, venv, observation_space=None, action_space=None):
|
||||
self.venv = venv
|
||||
VecEnv.__init__(self,
|
||||
@@ -162,24 +109,18 @@ class VecEnvWrapper(VecEnv):
|
||||
def close(self):
|
||||
return self.venv.close()
|
||||
|
||||
def render(self, mode='human'):
|
||||
return self.venv.render(mode=mode)
|
||||
|
||||
def get_images(self):
|
||||
return self.venv.get_images()
|
||||
def render(self):
|
||||
self.venv.render()
|
||||
|
||||
class CloudpickleWrapper(object):
|
||||
"""
|
||||
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
|
||||
"""
|
||||
|
||||
def __init__(self, x):
|
||||
self.x = x
|
||||
|
||||
def __getstate__(self):
|
||||
import cloudpickle
|
||||
return cloudpickle.dumps(self.x)
|
||||
|
||||
def __setstate__(self, ob):
|
||||
import pickle
|
||||
self.x = pickle.loads(ob)
|
||||
|
@@ -1,26 +1,27 @@
|
||||
import numpy as np
|
||||
from gym import spaces
|
||||
from collections import OrderedDict
|
||||
from . import VecEnv
|
||||
from .util import copy_obs_dict, dict_to_obs, obs_space_info
|
||||
|
||||
class DummyVecEnv(VecEnv):
|
||||
"""
|
||||
VecEnv that does runs multiple environments sequentially, that is,
|
||||
the step and reset commands are send to one environment at a time.
|
||||
Useful when debugging and when num_env == 1 (in the latter case,
|
||||
avoids communication overhead)
|
||||
"""
|
||||
def __init__(self, env_fns):
|
||||
"""
|
||||
Arguments:
|
||||
|
||||
env_fns: iterable of callables functions that build environments
|
||||
"""
|
||||
self.envs = [fn() for fn in env_fns]
|
||||
env = self.envs[0]
|
||||
VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
|
||||
shapes, dtypes = {}, {}
|
||||
self.keys = []
|
||||
obs_space = env.observation_space
|
||||
self.keys, shapes, dtypes = obs_space_info(obs_space)
|
||||
|
||||
if isinstance(obs_space, spaces.Dict):
|
||||
assert isinstance(obs_space.spaces, OrderedDict)
|
||||
subspaces = obs_space.spaces
|
||||
else:
|
||||
subspaces = {None: obs_space}
|
||||
|
||||
for key, box in subspaces.items():
|
||||
shapes[key] = box.shape
|
||||
dtypes[key] = box.dtype
|
||||
self.keys.append(key)
|
||||
|
||||
self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
|
||||
self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
|
||||
@@ -52,7 +53,7 @@ class DummyVecEnv(VecEnv):
|
||||
if self.buf_dones[e]:
|
||||
obs = self.envs[e].reset()
|
||||
self._save_obs(e, obs)
|
||||
return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones),
|
||||
return (np.copy(self._obs_from_buf()), np.copy(self.buf_rews), np.copy(self.buf_dones),
|
||||
self.buf_infos.copy())
|
||||
|
||||
def reset(self):
|
||||
@@ -61,6 +62,12 @@ class DummyVecEnv(VecEnv):
|
||||
self._save_obs(e, obs)
|
||||
return self._obs_from_buf()
|
||||
|
||||
def close(self):
|
||||
return
|
||||
|
||||
def render(self, mode='human'):
|
||||
return [e.render(mode=mode) for e in self.envs]
|
||||
|
||||
def _save_obs(self, e, obs):
|
||||
for k in self.keys:
|
||||
if k is None:
|
||||
@@ -69,13 +76,7 @@ class DummyVecEnv(VecEnv):
|
||||
self.buf_obs[k][e] = obs[k]
|
||||
|
||||
def _obs_from_buf(self):
|
||||
return dict_to_obs(copy_obs_dict(self.buf_obs))
|
||||
|
||||
def get_images(self):
|
||||
return [env.render(mode='rgb_array') for env in self.envs]
|
||||
|
||||
def render(self, mode='human'):
|
||||
if self.num_envs == 1:
|
||||
return self.envs[0].render(mode=mode)
|
||||
if self.keys==[None]:
|
||||
return self.buf_obs[None]
|
||||
else:
|
||||
return super().render(mode=mode)
|
||||
return self.buf_obs
|
||||
|
@@ -1,137 +0,0 @@
|
||||
"""
|
||||
An interface for asynchronous vectorized environments.
|
||||
"""
|
||||
|
||||
from multiprocessing import Pipe, Array, Process
|
||||
import numpy as np
|
||||
from . import VecEnv, CloudpickleWrapper
|
||||
import ctypes
|
||||
from baselines import logger
|
||||
|
||||
from .util import dict_to_obs, obs_space_info, obs_to_dict
|
||||
|
||||
_NP_TO_CT = {np.float32: ctypes.c_float,
|
||||
np.int32: ctypes.c_int32,
|
||||
np.int8: ctypes.c_int8,
|
||||
np.uint8: ctypes.c_char,
|
||||
np.bool: ctypes.c_bool}
|
||||
|
||||
|
||||
class ShmemVecEnv(VecEnv):
|
||||
"""
|
||||
Optimized version of SubprocVecEnv that uses shared variables to communicate observations.
|
||||
"""
|
||||
|
||||
def __init__(self, env_fns, spaces=None):
|
||||
"""
|
||||
If you don't specify observation_space, we'll have to create a dummy
|
||||
environment to get it.
|
||||
"""
|
||||
if spaces:
|
||||
observation_space, action_space = spaces
|
||||
else:
|
||||
logger.log('Creating dummy env object to get spaces')
|
||||
with logger.scoped_configure(format_strs=[]):
|
||||
dummy = env_fns[0]()
|
||||
observation_space, action_space = dummy.observation_space, dummy.action_space
|
||||
dummy.close()
|
||||
del dummy
|
||||
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
|
||||
self.obs_keys, self.obs_shapes, self.obs_dtypes = obs_space_info(observation_space)
|
||||
self.obs_bufs = [
|
||||
{k: Array(_NP_TO_CT[self.obs_dtypes[k].type], int(np.prod(self.obs_shapes[k]))) for k in self.obs_keys}
|
||||
for _ in env_fns]
|
||||
self.parent_pipes = []
|
||||
self.procs = []
|
||||
for env_fn, obs_buf in zip(env_fns, self.obs_bufs):
|
||||
wrapped_fn = CloudpickleWrapper(env_fn)
|
||||
parent_pipe, child_pipe = Pipe()
|
||||
proc = Process(target=_subproc_worker,
|
||||
args=(child_pipe, parent_pipe, wrapped_fn, obs_buf, self.obs_shapes, self.obs_dtypes, self.obs_keys))
|
||||
proc.daemon = True
|
||||
self.procs.append(proc)
|
||||
self.parent_pipes.append(parent_pipe)
|
||||
proc.start()
|
||||
child_pipe.close()
|
||||
self.waiting_step = False
|
||||
self.viewer = None
|
||||
|
||||
def reset(self):
|
||||
if self.waiting_step:
|
||||
logger.warn('Called reset() while waiting for the step to complete')
|
||||
self.step_wait()
|
||||
for pipe in self.parent_pipes:
|
||||
pipe.send(('reset', None))
|
||||
return self._decode_obses([pipe.recv() for pipe in self.parent_pipes])
|
||||
|
||||
def step_async(self, actions):
|
||||
assert len(actions) == len(self.parent_pipes)
|
||||
for pipe, act in zip(self.parent_pipes, actions):
|
||||
pipe.send(('step', act))
|
||||
|
||||
def step_wait(self):
|
||||
outs = [pipe.recv() for pipe in self.parent_pipes]
|
||||
obs, rews, dones, infos = zip(*outs)
|
||||
return self._decode_obses(obs), np.array(rews), np.array(dones), infos
|
||||
|
||||
def close_extras(self):
|
||||
if self.waiting_step:
|
||||
self.step_wait()
|
||||
for pipe in self.parent_pipes:
|
||||
pipe.send(('close', None))
|
||||
for pipe in self.parent_pipes:
|
||||
pipe.recv()
|
||||
pipe.close()
|
||||
for proc in self.procs:
|
||||
proc.join()
|
||||
|
||||
def get_images(self, mode='human'):
|
||||
for pipe in self.parent_pipes:
|
||||
pipe.send(('render', None))
|
||||
return [pipe.recv() for pipe in self.parent_pipes]
|
||||
|
||||
def _decode_obses(self, obs):
|
||||
result = {}
|
||||
for k in self.obs_keys:
|
||||
|
||||
bufs = [b[k] for b in self.obs_bufs]
|
||||
o = [np.frombuffer(b.get_obj(), dtype=self.obs_dtypes[k]).reshape(self.obs_shapes[k]) for b in bufs]
|
||||
result[k] = np.array(o)
|
||||
return dict_to_obs(result)
|
||||
|
||||
|
||||
def _subproc_worker(pipe, parent_pipe, env_fn_wrapper, obs_bufs, obs_shapes, obs_dtypes, keys):
|
||||
"""
|
||||
Control a single environment instance using IPC and
|
||||
shared memory.
|
||||
"""
|
||||
def _write_obs(maybe_dict_obs):
|
||||
flatdict = obs_to_dict(maybe_dict_obs)
|
||||
for k in keys:
|
||||
dst = obs_bufs[k].get_obj()
|
||||
dst_np = np.frombuffer(dst, dtype=obs_dtypes[k]).reshape(obs_shapes[k]) # pylint: disable=W0212
|
||||
np.copyto(dst_np, flatdict[k])
|
||||
|
||||
env = env_fn_wrapper.x()
|
||||
parent_pipe.close()
|
||||
try:
|
||||
while True:
|
||||
cmd, data = pipe.recv()
|
||||
if cmd == 'reset':
|
||||
pipe.send(_write_obs(env.reset()))
|
||||
elif cmd == 'step':
|
||||
obs, reward, done, info = env.step(data)
|
||||
if done:
|
||||
obs = env.reset()
|
||||
pipe.send((_write_obs(obs), reward, done, info))
|
||||
elif cmd == 'render':
|
||||
pipe.send(env.render(mode='rgb_array'))
|
||||
elif cmd == 'close':
|
||||
pipe.send(None)
|
||||
break
|
||||
else:
|
||||
raise RuntimeError('Got unrecognized cmd %s' % cmd)
|
||||
except KeyboardInterrupt:
|
||||
print('ShmemVecEnv worker: got KeyboardInterrupt')
|
||||
finally:
|
||||
env.close()
|
@@ -1,6 +1,8 @@
|
||||
import numpy as np
|
||||
from multiprocessing import Process, Pipe
|
||||
from . import VecEnv, CloudpickleWrapper
|
||||
from baselines.common.vec_env import VecEnv, CloudpickleWrapper
|
||||
from baselines.common.tile_images import tile_images
|
||||
|
||||
|
||||
def worker(remote, parent_remote, env_fn_wrapper):
|
||||
parent_remote.close()
|
||||
@@ -30,17 +32,10 @@ def worker(remote, parent_remote, env_fn_wrapper):
|
||||
finally:
|
||||
env.close()
|
||||
|
||||
|
||||
class SubprocVecEnv(VecEnv):
|
||||
"""
|
||||
VecEnv that runs multiple environments in parallel in subproceses and communicates with them via pipes.
|
||||
Recommended to use when num_envs > 1 and step() can be a bottleneck.
|
||||
"""
|
||||
def __init__(self, env_fns, spaces=None):
|
||||
"""
|
||||
Arguments:
|
||||
|
||||
env_fns: iterable of callables - functions that create environments to run in subprocesses. Need to be cloud-pickleable
|
||||
envs: list of gym environments to run in subprocesses
|
||||
"""
|
||||
self.waiting = False
|
||||
self.closed = False
|
||||
@@ -56,30 +51,32 @@ class SubprocVecEnv(VecEnv):
|
||||
|
||||
self.remotes[0].send(('get_spaces', None))
|
||||
observation_space, action_space = self.remotes[0].recv()
|
||||
self.viewer = None
|
||||
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
|
||||
|
||||
def step_async(self, actions):
|
||||
self._assert_not_closed()
|
||||
for remote, action in zip(self.remotes, actions):
|
||||
remote.send(('step', action))
|
||||
self.waiting = True
|
||||
|
||||
def step_wait(self):
|
||||
self._assert_not_closed()
|
||||
results = [remote.recv() for remote in self.remotes]
|
||||
self.waiting = False
|
||||
obs, rews, dones, infos = zip(*results)
|
||||
return np.stack(obs), np.stack(rews), np.stack(dones), infos
|
||||
|
||||
def reset(self):
|
||||
self._assert_not_closed()
|
||||
for remote in self.remotes:
|
||||
remote.send(('reset', None))
|
||||
return np.stack([remote.recv() for remote in self.remotes])
|
||||
|
||||
def close_extras(self):
|
||||
self.closed = True
|
||||
def reset_task(self):
|
||||
for remote in self.remotes:
|
||||
remote.send(('reset_task', None))
|
||||
return np.stack([remote.recv() for remote in self.remotes])
|
||||
|
||||
def close(self):
|
||||
if self.closed:
|
||||
return
|
||||
if self.waiting:
|
||||
for remote in self.remotes:
|
||||
remote.recv()
|
||||
@@ -87,13 +84,18 @@ class SubprocVecEnv(VecEnv):
|
||||
remote.send(('close', None))
|
||||
for p in self.ps:
|
||||
p.join()
|
||||
self.closed = True
|
||||
|
||||
def get_images(self):
|
||||
self._assert_not_closed()
|
||||
def render(self, mode='human'):
|
||||
for pipe in self.remotes:
|
||||
pipe.send(('render', None))
|
||||
imgs = [pipe.recv() for pipe in self.remotes]
|
||||
return imgs
|
||||
|
||||
def _assert_not_closed(self):
|
||||
assert not self.closed, "Trying to operate on a SubprocVecEnv after calling close()"
|
||||
bigimg = tile_images(imgs)
|
||||
if mode == 'human':
|
||||
import cv2
|
||||
cv2.imshow('vecenv', bigimg[:,:,::-1])
|
||||
cv2.waitKey(1)
|
||||
elif mode == 'rgb_array':
|
||||
return bigimg
|
||||
else:
|
||||
raise NotImplementedError
|
@@ -1,101 +0,0 @@
|
||||
"""
|
||||
Tests for asynchronous vectorized environments.
|
||||
"""
|
||||
|
||||
import gym
|
||||
import numpy as np
|
||||
import pytest
|
||||
from .dummy_vec_env import DummyVecEnv
|
||||
from .shmem_vec_env import ShmemVecEnv
|
||||
from .subproc_vec_env import SubprocVecEnv
|
||||
|
||||
|
||||
def assert_envs_equal(env1, env2, num_steps):
|
||||
"""
|
||||
Compare two environments over num_steps steps and make sure
|
||||
that the observations produced by each are the same when given
|
||||
the same actions.
|
||||
"""
|
||||
assert env1.num_envs == env2.num_envs
|
||||
assert env1.action_space.shape == env2.action_space.shape
|
||||
assert env1.action_space.dtype == env2.action_space.dtype
|
||||
joint_shape = (env1.num_envs,) + env1.action_space.shape
|
||||
|
||||
try:
|
||||
obs1, obs2 = env1.reset(), env2.reset()
|
||||
assert np.array(obs1).shape == np.array(obs2).shape
|
||||
assert np.array(obs1).shape == joint_shape
|
||||
assert np.allclose(obs1, obs2)
|
||||
np.random.seed(1337)
|
||||
for _ in range(num_steps):
|
||||
actions = np.array(np.random.randint(0, 0x100, size=joint_shape),
|
||||
dtype=env1.action_space.dtype)
|
||||
for env in [env1, env2]:
|
||||
env.step_async(actions)
|
||||
outs1 = env1.step_wait()
|
||||
outs2 = env2.step_wait()
|
||||
for out1, out2 in zip(outs1[:3], outs2[:3]):
|
||||
assert np.array(out1).shape == np.array(out2).shape
|
||||
assert np.allclose(out1, out2)
|
||||
assert list(outs1[3]) == list(outs2[3])
|
||||
finally:
|
||||
env1.close()
|
||||
env2.close()
|
||||
|
||||
|
||||
@pytest.mark.parametrize('klass', (ShmemVecEnv, SubprocVecEnv))
|
||||
@pytest.mark.parametrize('dtype', ('uint8', 'float32'))
|
||||
def test_vec_env(klass, dtype): # pylint: disable=R0914
|
||||
"""
|
||||
Test that a vectorized environment is equivalent to
|
||||
DummyVecEnv, since DummyVecEnv is less likely to be
|
||||
error prone.
|
||||
"""
|
||||
num_envs = 3
|
||||
num_steps = 100
|
||||
shape = (3, 8)
|
||||
|
||||
def make_fn(seed):
|
||||
"""
|
||||
Get an environment constructor with a seed.
|
||||
"""
|
||||
return lambda: SimpleEnv(seed, shape, dtype)
|
||||
fns = [make_fn(i) for i in range(num_envs)]
|
||||
env1 = DummyVecEnv(fns)
|
||||
env2 = klass(fns)
|
||||
assert_envs_equal(env1, env2, num_steps=num_steps)
|
||||
|
||||
|
||||
class SimpleEnv(gym.Env):
|
||||
"""
|
||||
An environment with a pre-determined observation space
|
||||
and RNG seed.
|
||||
"""
|
||||
|
||||
def __init__(self, seed, shape, dtype):
|
||||
np.random.seed(seed)
|
||||
self._dtype = dtype
|
||||
self._start_obs = np.array(np.random.randint(0, 0x100, size=shape),
|
||||
dtype=dtype)
|
||||
self._max_steps = seed + 1
|
||||
self._cur_obs = None
|
||||
self._cur_step = 0
|
||||
# this is 0xFF instead of 0x100 because the Box space includes
|
||||
# the high end, while randint does not
|
||||
self.action_space = gym.spaces.Box(low=0, high=0xFF, shape=shape, dtype=dtype)
|
||||
self.observation_space = self.action_space
|
||||
|
||||
def step(self, action):
|
||||
self._cur_obs += np.array(action, dtype=self._dtype)
|
||||
self._cur_step += 1
|
||||
done = self._cur_step >= self._max_steps
|
||||
reward = self._cur_step / self._max_steps
|
||||
return self._cur_obs, reward, done, {'foo': 'bar' + str(reward)}
|
||||
|
||||
def reset(self):
|
||||
self._cur_obs = self._start_obs
|
||||
self._cur_step = 0
|
||||
return self._cur_obs
|
||||
|
||||
def render(self, mode=None):
|
||||
raise NotImplementedError
|
@@ -1,49 +0,0 @@
|
||||
"""
|
||||
Tests for asynchronous vectorized environments.
|
||||
"""
|
||||
|
||||
import gym
|
||||
import pytest
|
||||
import os
|
||||
import glob
|
||||
import tempfile
|
||||
|
||||
from .dummy_vec_env import DummyVecEnv
|
||||
from .shmem_vec_env import ShmemVecEnv
|
||||
from .subproc_vec_env import SubprocVecEnv
|
||||
from .vec_video_recorder import VecVideoRecorder
|
||||
|
||||
@pytest.mark.parametrize('klass', (DummyVecEnv, ShmemVecEnv, SubprocVecEnv))
|
||||
@pytest.mark.parametrize('num_envs', (1, 4))
|
||||
@pytest.mark.parametrize('video_length', (10, 100))
|
||||
@pytest.mark.parametrize('video_interval', (1, 50))
|
||||
def test_video_recorder(klass, num_envs, video_length, video_interval):
|
||||
"""
|
||||
Wrap an existing VecEnv with VevVideoRecorder,
|
||||
Make (video_interval + video_length + 1) steps,
|
||||
then check that the file is present
|
||||
"""
|
||||
|
||||
def make_fn():
|
||||
env = gym.make('PongNoFrameskip-v4')
|
||||
return env
|
||||
fns = [make_fn for _ in range(num_envs)]
|
||||
env = klass(fns)
|
||||
|
||||
with tempfile.TemporaryDirectory() as video_path:
|
||||
env = VecVideoRecorder(env, video_path, record_video_trigger=lambda x: x % video_interval == 0, video_length=video_length)
|
||||
|
||||
env.reset()
|
||||
for _ in range(video_interval + video_length + 1):
|
||||
env.step([0] * num_envs)
|
||||
env.close()
|
||||
|
||||
|
||||
recorded_video = glob.glob(os.path.join(video_path, "*.mp4"))
|
||||
|
||||
# first and second step
|
||||
assert len(recorded_video) == 2
|
||||
# Files are not empty
|
||||
assert all(os.stat(p).st_size != 0 for p in recorded_video)
|
||||
|
||||
|
@@ -1,59 +0,0 @@
|
||||
"""
|
||||
Helpers for dealing with vectorized environments.
|
||||
"""
|
||||
|
||||
from collections import OrderedDict
|
||||
|
||||
import gym
|
||||
import numpy as np
|
||||
|
||||
|
||||
def copy_obs_dict(obs):
|
||||
"""
|
||||
Deep-copy an observation dict.
|
||||
"""
|
||||
return {k: np.copy(v) for k, v in obs.items()}
|
||||
|
||||
|
||||
def dict_to_obs(obs_dict):
|
||||
"""
|
||||
Convert an observation dict into a raw array if the
|
||||
original observation space was not a Dict space.
|
||||
"""
|
||||
if set(obs_dict.keys()) == {None}:
|
||||
return obs_dict[None]
|
||||
return obs_dict
|
||||
|
||||
|
||||
def obs_space_info(obs_space):
|
||||
"""
|
||||
Get dict-structured information about a gym.Space.
|
||||
|
||||
Returns:
|
||||
A tuple (keys, shapes, dtypes):
|
||||
keys: a list of dict keys.
|
||||
shapes: a dict mapping keys to shapes.
|
||||
dtypes: a dict mapping keys to dtypes.
|
||||
"""
|
||||
if isinstance(obs_space, gym.spaces.Dict):
|
||||
assert isinstance(obs_space.spaces, OrderedDict)
|
||||
subspaces = obs_space.spaces
|
||||
else:
|
||||
subspaces = {None: obs_space}
|
||||
keys = []
|
||||
shapes = {}
|
||||
dtypes = {}
|
||||
for key, box in subspaces.items():
|
||||
keys.append(key)
|
||||
shapes[key] = box.shape
|
||||
dtypes[key] = box.dtype
|
||||
return keys, shapes, dtypes
|
||||
|
||||
|
||||
def obs_to_dict(obs):
|
||||
"""
|
||||
Convert an observation into a dict.
|
||||
"""
|
||||
if isinstance(obs, dict):
|
||||
return obs
|
||||
return {None: obs}
|
@@ -1,9 +1,11 @@
|
||||
from . import VecEnvWrapper
|
||||
from baselines.common.vec_env import VecEnvWrapper
|
||||
import numpy as np
|
||||
from gym import spaces
|
||||
|
||||
|
||||
class VecFrameStack(VecEnvWrapper):
|
||||
"""
|
||||
Vectorized environment base class
|
||||
"""
|
||||
def __init__(self, venv, nstack):
|
||||
self.venv = venv
|
||||
self.nstack = nstack
|
||||
@@ -24,7 +26,13 @@ class VecFrameStack(VecEnvWrapper):
|
||||
return self.stackedobs, rews, news, infos
|
||||
|
||||
def reset(self):
|
||||
"""
|
||||
Reset all environments
|
||||
"""
|
||||
obs = self.venv.reset()
|
||||
self.stackedobs[...] = 0
|
||||
self.stackedobs[..., -obs.shape[-1]:] = obs
|
||||
return self.stackedobs
|
||||
|
||||
def close(self):
|
||||
self.venv.close()
|
||||
|
@@ -1,37 +0,0 @@
|
||||
from . import VecEnvWrapper
|
||||
from baselines.bench.monitor import ResultsWriter
|
||||
import numpy as np
|
||||
import time
|
||||
|
||||
|
||||
class VecMonitor(VecEnvWrapper):
|
||||
def __init__(self, venv, filename=None):
|
||||
VecEnvWrapper.__init__(self, venv)
|
||||
self.eprets = None
|
||||
self.eplens = None
|
||||
self.tstart = time.time()
|
||||
self.results_writer = ResultsWriter(filename, header={'t_start': self.tstart})
|
||||
|
||||
def reset(self):
|
||||
obs = self.venv.reset()
|
||||
self.eprets = np.zeros(self.num_envs, 'f')
|
||||
self.eplens = np.zeros(self.num_envs, 'i')
|
||||
return obs
|
||||
|
||||
def step_wait(self):
|
||||
obs, rews, dones, infos = self.venv.step_wait()
|
||||
self.eprets += rews
|
||||
self.eplens += 1
|
||||
newinfos = []
|
||||
for (i, (done, ret, eplen, info)) in enumerate(zip(dones, self.eprets, self.eplens, infos)):
|
||||
info = info.copy()
|
||||
if done:
|
||||
epinfo = {'r': ret, 'l': eplen, 't': round(time.time() - self.tstart, 6)}
|
||||
info['episode'] = epinfo
|
||||
self.eprets[i] = 0
|
||||
self.eplens[i] = 0
|
||||
self.results_writer.write_row(epinfo)
|
||||
|
||||
newinfos.append(info)
|
||||
|
||||
return obs, rews, dones, newinfos
|
@@ -1,18 +1,17 @@
|
||||
from . import VecEnvWrapper
|
||||
from baselines.common.vec_env import VecEnvWrapper
|
||||
from baselines.common.running_mean_std import RunningMeanStd
|
||||
import numpy as np
|
||||
|
||||
|
||||
class VecNormalize(VecEnvWrapper):
|
||||
"""
|
||||
A vectorized wrapper that normalizes the observations
|
||||
and returns from an environment.
|
||||
Vectorized environment base class
|
||||
"""
|
||||
|
||||
def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
|
||||
VecEnvWrapper.__init__(self, venv)
|
||||
self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
|
||||
self.ret_rms = RunningMeanStd(shape=()) if ret else None
|
||||
#self.ob_rms = TfRunningMeanStd(shape=self.observation_space.shape, scope='observation_running_mean_std') if ob else None
|
||||
#self.ret_rms = TfRunningMeanStd(shape=(), scope='return_running_mean_std') if ret else None
|
||||
self.clipob = clipob
|
||||
self.cliprew = cliprew
|
||||
self.ret = np.zeros(self.num_envs)
|
||||
@@ -20,13 +19,18 @@ class VecNormalize(VecEnvWrapper):
|
||||
self.epsilon = epsilon
|
||||
|
||||
def step_wait(self):
|
||||
"""
|
||||
Apply sequence of actions to sequence of environments
|
||||
actions -> (observations, rewards, news)
|
||||
|
||||
where 'news' is a boolean vector indicating whether each element is new.
|
||||
"""
|
||||
obs, rews, news, infos = self.venv.step_wait()
|
||||
self.ret = self.ret * self.gamma + rews
|
||||
obs = self._obfilt(obs)
|
||||
if self.ret_rms:
|
||||
self.ret_rms.update(self.ret)
|
||||
rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
|
||||
self.ret[news] = 0.
|
||||
return obs, rews, news, infos
|
||||
|
||||
def _obfilt(self, obs):
|
||||
@@ -38,6 +42,8 @@ class VecNormalize(VecEnvWrapper):
|
||||
return obs
|
||||
|
||||
def reset(self):
|
||||
self.ret = np.zeros(self.num_envs)
|
||||
"""
|
||||
Reset all environments
|
||||
"""
|
||||
obs = self.venv.reset()
|
||||
return self._obfilt(obs)
|
||||
|
@@ -1,89 +0,0 @@
|
||||
import os
|
||||
from baselines import logger
|
||||
from baselines.common.vec_env import VecEnvWrapper
|
||||
from gym.wrappers.monitoring import video_recorder
|
||||
|
||||
|
||||
class VecVideoRecorder(VecEnvWrapper):
|
||||
"""
|
||||
Wrap VecEnv to record rendered image as mp4 video.
|
||||
"""
|
||||
|
||||
def __init__(self, venv, directory, record_video_trigger, video_length=200):
|
||||
"""
|
||||
# Arguments
|
||||
venv: VecEnv to wrap
|
||||
directory: Where to save videos
|
||||
record_video_trigger:
|
||||
Function that defines when to start recording.
|
||||
The function takes the current number of step,
|
||||
and returns whether we should start recording or not.
|
||||
video_length: Length of recorded video
|
||||
"""
|
||||
|
||||
VecEnvWrapper.__init__(self, venv)
|
||||
self.record_video_trigger = record_video_trigger
|
||||
self.video_recorder = None
|
||||
|
||||
self.directory = os.path.abspath(directory)
|
||||
if not os.path.exists(self.directory): os.mkdir(self.directory)
|
||||
|
||||
self.file_prefix = "vecenv"
|
||||
self.file_infix = '{}'.format(os.getpid())
|
||||
self.step_id = 0
|
||||
self.video_length = video_length
|
||||
|
||||
self.recording = False
|
||||
self.recorded_frames = 0
|
||||
|
||||
def reset(self):
|
||||
obs = self.venv.reset()
|
||||
|
||||
self.start_video_recorder()
|
||||
|
||||
return obs
|
||||
|
||||
def start_video_recorder(self):
|
||||
self.close_video_recorder()
|
||||
|
||||
base_path = os.path.join(self.directory, '{}.video.{}.video{:06}'.format(self.file_prefix, self.file_infix, self.step_id))
|
||||
self.video_recorder = video_recorder.VideoRecorder(
|
||||
env=self.venv,
|
||||
base_path=base_path,
|
||||
metadata={'step_id': self.step_id}
|
||||
)
|
||||
|
||||
self.video_recorder.capture_frame()
|
||||
self.recorded_frames = 1
|
||||
self.recording = True
|
||||
|
||||
def _video_enabled(self):
|
||||
return self.record_video_trigger(self.step_id)
|
||||
|
||||
def step_wait(self):
|
||||
obs, rews, dones, infos = self.venv.step_wait()
|
||||
|
||||
self.step_id += 1
|
||||
if self.recording:
|
||||
self.video_recorder.capture_frame()
|
||||
self.recorded_frames += 1
|
||||
if self.recorded_frames > self.video_length:
|
||||
logger.info("Saving video to ", self.video_recorder.path)
|
||||
self.close_video_recorder()
|
||||
elif self._video_enabled():
|
||||
self.start_video_recorder()
|
||||
|
||||
return obs, rews, dones, infos
|
||||
|
||||
def close_video_recorder(self):
|
||||
if self.recording:
|
||||
self.video_recorder.close()
|
||||
self.recording = False
|
||||
self.recorded_frames = 0
|
||||
|
||||
def close(self):
|
||||
VecEnvWrapper.close(self)
|
||||
self.close_video_recorder()
|
||||
|
||||
def __del__(self):
|
||||
self.close()
|
2
baselines/ddpg/README.md
Executable file → Normal file
2
baselines/ddpg/README.md
Executable file → Normal file
@@ -2,4 +2,4 @@
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1509.02971
|
||||
- Baselines post: https://blog.openai.com/better-exploration-with-parameter-noise/
|
||||
- `python -m baselines.run --alg=ddpg --env=HalfCheetah-v2 --num_timesteps=1e6` runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (`-h`) for more options.
|
||||
- `python -m baselines.ddpg.main` runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (`-h`) for more options.
|
0
baselines/ddpg/__init__.py
Executable file → Normal file
0
baselines/ddpg/__init__.py
Executable file → Normal file
633
baselines/ddpg/ddpg.py
Executable file → Normal file
633
baselines/ddpg/ddpg.py
Executable file → Normal file
@@ -1,273 +1,378 @@
|
||||
import os
|
||||
import time
|
||||
from collections import deque
|
||||
import pickle
|
||||
from copy import copy
|
||||
from functools import reduce
|
||||
|
||||
from baselines.ddpg.ddpg_learner import DDPG
|
||||
from baselines.ddpg.models import Actor, Critic
|
||||
from baselines.ddpg.memory import Memory
|
||||
from baselines.ddpg.noise import AdaptiveParamNoiseSpec, NormalActionNoise, OrnsteinUhlenbeckActionNoise
|
||||
from baselines.common import set_global_seeds
|
||||
import baselines.common.tf_util as U
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
import tensorflow.contrib as tc
|
||||
|
||||
from baselines import logger
|
||||
import numpy as np
|
||||
|
||||
try:
|
||||
from baselines.common.mpi_adam import MpiAdam
|
||||
import baselines.common.tf_util as U
|
||||
from baselines.common.mpi_running_mean_std import RunningMeanStd
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
def learn(network, env,
|
||||
seed=None,
|
||||
total_timesteps=None,
|
||||
nb_epochs=None, # with default settings, perform 1M steps total
|
||||
nb_epoch_cycles=20,
|
||||
nb_rollout_steps=100,
|
||||
reward_scale=1.0,
|
||||
render=False,
|
||||
render_eval=False,
|
||||
noise_type='adaptive-param_0.2',
|
||||
normalize_returns=False,
|
||||
normalize_observations=True,
|
||||
critic_l2_reg=1e-2,
|
||||
actor_lr=1e-4,
|
||||
critic_lr=1e-3,
|
||||
popart=False,
|
||||
gamma=0.99,
|
||||
clip_norm=None,
|
||||
nb_train_steps=50, # per epoch cycle and MPI worker,
|
||||
nb_eval_steps=100,
|
||||
batch_size=64, # per MPI worker
|
||||
tau=0.01,
|
||||
eval_env=None,
|
||||
param_noise_adaption_interval=50,
|
||||
**network_kwargs):
|
||||
|
||||
set_global_seeds(seed)
|
||||
|
||||
if total_timesteps is not None:
|
||||
assert nb_epochs is None
|
||||
nb_epochs = int(total_timesteps) // (nb_epoch_cycles * nb_rollout_steps)
|
||||
else:
|
||||
nb_epochs = 500
|
||||
|
||||
if MPI is not None:
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
else:
|
||||
rank = 0
|
||||
|
||||
nb_actions = env.action_space.shape[-1]
|
||||
assert (np.abs(env.action_space.low) == env.action_space.high).all() # we assume symmetric actions.
|
||||
|
||||
memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
|
||||
critic = Critic(network=network, **network_kwargs)
|
||||
actor = Actor(nb_actions, network=network, **network_kwargs)
|
||||
|
||||
action_noise = None
|
||||
param_noise = None
|
||||
if noise_type is not None:
|
||||
for current_noise_type in noise_type.split(','):
|
||||
current_noise_type = current_noise_type.strip()
|
||||
if current_noise_type == 'none':
|
||||
pass
|
||||
elif 'adaptive-param' in current_noise_type:
|
||||
_, stddev = current_noise_type.split('_')
|
||||
param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
|
||||
elif 'normal' in current_noise_type:
|
||||
_, stddev = current_noise_type.split('_')
|
||||
action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
|
||||
elif 'ou' in current_noise_type:
|
||||
_, stddev = current_noise_type.split('_')
|
||||
action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
|
||||
else:
|
||||
raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
|
||||
|
||||
max_action = env.action_space.high
|
||||
logger.info('scaling actions by {} before executing in env'.format(max_action))
|
||||
|
||||
agent = DDPG(actor, critic, memory, env.observation_space.shape, env.action_space.shape,
|
||||
gamma=gamma, tau=tau, normalize_returns=normalize_returns, normalize_observations=normalize_observations,
|
||||
batch_size=batch_size, action_noise=action_noise, param_noise=param_noise, critic_l2_reg=critic_l2_reg,
|
||||
actor_lr=actor_lr, critic_lr=critic_lr, enable_popart=popart, clip_norm=clip_norm,
|
||||
reward_scale=reward_scale)
|
||||
logger.info('Using agent with the following configuration:')
|
||||
logger.info(str(agent.__dict__.items()))
|
||||
|
||||
eval_episode_rewards_history = deque(maxlen=100)
|
||||
episode_rewards_history = deque(maxlen=100)
|
||||
sess = U.get_session()
|
||||
# Prepare everything.
|
||||
agent.initialize(sess)
|
||||
sess.graph.finalize()
|
||||
|
||||
agent.reset()
|
||||
|
||||
obs = env.reset()
|
||||
if eval_env is not None:
|
||||
eval_obs = eval_env.reset()
|
||||
nenvs = obs.shape[0]
|
||||
|
||||
episode_reward = np.zeros(nenvs, dtype = np.float32) #vector
|
||||
episode_step = np.zeros(nenvs, dtype = int) # vector
|
||||
episodes = 0 #scalar
|
||||
t = 0 # scalar
|
||||
|
||||
epoch = 0
|
||||
|
||||
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
epoch_episode_rewards = []
|
||||
epoch_episode_steps = []
|
||||
epoch_actions = []
|
||||
epoch_qs = []
|
||||
epoch_episodes = 0
|
||||
for epoch in range(nb_epochs):
|
||||
for cycle in range(nb_epoch_cycles):
|
||||
# Perform rollouts.
|
||||
if nenvs > 1:
|
||||
# if simulating multiple envs in parallel, impossible to reset agent at the end of the episode in each
|
||||
# of the environments, so resetting here instead
|
||||
agent.reset()
|
||||
for t_rollout in range(nb_rollout_steps):
|
||||
# Predict next action.
|
||||
action, q, _, _ = agent.step(obs, apply_noise=True, compute_Q=True)
|
||||
|
||||
# Execute next action.
|
||||
if rank == 0 and render:
|
||||
env.render()
|
||||
|
||||
# max_action is of dimension A, whereas action is dimension (nenvs, A) - the multiplication gets broadcasted to the batch
|
||||
new_obs, r, done, info = env.step(max_action * action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
|
||||
# note these outputs are batched from vecenv
|
||||
|
||||
t += 1
|
||||
if rank == 0 and render:
|
||||
env.render()
|
||||
episode_reward += r
|
||||
episode_step += 1
|
||||
|
||||
# Book-keeping.
|
||||
epoch_actions.append(action)
|
||||
epoch_qs.append(q)
|
||||
agent.store_transition(obs, action, r, new_obs, done) #the batched data will be unrolled in memory.py's append.
|
||||
|
||||
obs = new_obs
|
||||
|
||||
for d in range(len(done)):
|
||||
if done[d]:
|
||||
# Episode done.
|
||||
epoch_episode_rewards.append(episode_reward[d])
|
||||
episode_rewards_history.append(episode_reward[d])
|
||||
epoch_episode_steps.append(episode_step[d])
|
||||
episode_reward[d] = 0.
|
||||
episode_step[d] = 0
|
||||
epoch_episodes += 1
|
||||
episodes += 1
|
||||
if nenvs == 1:
|
||||
agent.reset()
|
||||
|
||||
|
||||
|
||||
# Train.
|
||||
epoch_actor_losses = []
|
||||
epoch_critic_losses = []
|
||||
epoch_adaptive_distances = []
|
||||
for t_train in range(nb_train_steps):
|
||||
# Adapt param noise, if necessary.
|
||||
if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
|
||||
distance = agent.adapt_param_noise()
|
||||
epoch_adaptive_distances.append(distance)
|
||||
|
||||
cl, al = agent.train()
|
||||
epoch_critic_losses.append(cl)
|
||||
epoch_actor_losses.append(al)
|
||||
agent.update_target_net()
|
||||
|
||||
# Evaluate.
|
||||
eval_episode_rewards = []
|
||||
eval_qs = []
|
||||
if eval_env is not None:
|
||||
nenvs_eval = eval_obs.shape[0]
|
||||
eval_episode_reward = np.zeros(nenvs_eval, dtype = np.float32)
|
||||
for t_rollout in range(nb_eval_steps):
|
||||
eval_action, eval_q, _, _ = agent.step(eval_obs, apply_noise=False, compute_Q=True)
|
||||
eval_obs, eval_r, eval_done, eval_info = eval_env.step(max_action * eval_action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
|
||||
if render_eval:
|
||||
eval_env.render()
|
||||
eval_episode_reward += eval_r
|
||||
|
||||
eval_qs.append(eval_q)
|
||||
for d in range(len(eval_done)):
|
||||
if eval_done[d]:
|
||||
eval_episode_rewards.append(eval_episode_reward[d])
|
||||
eval_episode_rewards_history.append(eval_episode_reward[d])
|
||||
eval_episode_reward[d] = 0.0
|
||||
|
||||
if MPI is not None:
|
||||
mpi_size = MPI.COMM_WORLD.Get_size()
|
||||
else:
|
||||
mpi_size = 1
|
||||
|
||||
# Log stats.
|
||||
# XXX shouldn't call np.mean on variable length lists
|
||||
duration = time.time() - start_time
|
||||
stats = agent.get_stats()
|
||||
combined_stats = stats.copy()
|
||||
combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
|
||||
combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
|
||||
combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
|
||||
combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
|
||||
combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
|
||||
combined_stats['train/loss_actor'] = np.mean(epoch_actor_losses)
|
||||
combined_stats['train/loss_critic'] = np.mean(epoch_critic_losses)
|
||||
combined_stats['train/param_noise_distance'] = np.mean(epoch_adaptive_distances)
|
||||
combined_stats['total/duration'] = duration
|
||||
combined_stats['total/steps_per_second'] = float(t) / float(duration)
|
||||
combined_stats['total/episodes'] = episodes
|
||||
combined_stats['rollout/episodes'] = epoch_episodes
|
||||
combined_stats['rollout/actions_std'] = np.std(epoch_actions)
|
||||
# Evaluation statistics.
|
||||
if eval_env is not None:
|
||||
combined_stats['eval/return'] = eval_episode_rewards
|
||||
combined_stats['eval/return_history'] = np.mean(eval_episode_rewards_history)
|
||||
combined_stats['eval/Q'] = eval_qs
|
||||
combined_stats['eval/episodes'] = len(eval_episode_rewards)
|
||||
def as_scalar(x):
|
||||
if isinstance(x, np.ndarray):
|
||||
assert x.size == 1
|
||||
return x[0]
|
||||
elif np.isscalar(x):
|
||||
def normalize(x, stats):
|
||||
if stats is None:
|
||||
return x
|
||||
return (x - stats.mean) / stats.std
|
||||
|
||||
|
||||
def denormalize(x, stats):
|
||||
if stats is None:
|
||||
return x
|
||||
return x * stats.std + stats.mean
|
||||
|
||||
def reduce_std(x, axis=None, keepdims=False):
|
||||
return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
|
||||
|
||||
def reduce_var(x, axis=None, keepdims=False):
|
||||
m = tf.reduce_mean(x, axis=axis, keepdims=True)
|
||||
devs_squared = tf.square(x - m)
|
||||
return tf.reduce_mean(devs_squared, axis=axis, keepdims=keepdims)
|
||||
|
||||
def get_target_updates(vars, target_vars, tau):
|
||||
logger.info('setting up target updates ...')
|
||||
soft_updates = []
|
||||
init_updates = []
|
||||
assert len(vars) == len(target_vars)
|
||||
for var, target_var in zip(vars, target_vars):
|
||||
logger.info(' {} <- {}'.format(target_var.name, var.name))
|
||||
init_updates.append(tf.assign(target_var, var))
|
||||
soft_updates.append(tf.assign(target_var, (1. - tau) * target_var + tau * var))
|
||||
assert len(init_updates) == len(vars)
|
||||
assert len(soft_updates) == len(vars)
|
||||
return tf.group(*init_updates), tf.group(*soft_updates)
|
||||
|
||||
|
||||
def get_perturbed_actor_updates(actor, perturbed_actor, param_noise_stddev):
|
||||
assert len(actor.vars) == len(perturbed_actor.vars)
|
||||
assert len(actor.perturbable_vars) == len(perturbed_actor.perturbable_vars)
|
||||
|
||||
updates = []
|
||||
for var, perturbed_var in zip(actor.vars, perturbed_actor.vars):
|
||||
if var in actor.perturbable_vars:
|
||||
logger.info(' {} <- {} + noise'.format(perturbed_var.name, var.name))
|
||||
updates.append(tf.assign(perturbed_var, var + tf.random_normal(tf.shape(var), mean=0., stddev=param_noise_stddev)))
|
||||
else:
|
||||
raise ValueError('expected scalar, got %s'%x)
|
||||
|
||||
combined_stats_sums = np.array([ np.array(x).flatten()[0] for x in combined_stats.values()])
|
||||
if MPI is not None:
|
||||
combined_stats_sums = MPI.COMM_WORLD.allreduce(combined_stats_sums)
|
||||
|
||||
combined_stats = {k : v / mpi_size for (k,v) in zip(combined_stats.keys(), combined_stats_sums)}
|
||||
|
||||
# Total statistics.
|
||||
combined_stats['total/epochs'] = epoch + 1
|
||||
combined_stats['total/steps'] = t
|
||||
|
||||
for key in sorted(combined_stats.keys()):
|
||||
logger.record_tabular(key, combined_stats[key])
|
||||
|
||||
if rank == 0:
|
||||
logger.dump_tabular()
|
||||
logger.info('')
|
||||
logdir = logger.get_dir()
|
||||
if rank == 0 and logdir:
|
||||
if hasattr(env, 'get_state'):
|
||||
with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:
|
||||
pickle.dump(env.get_state(), f)
|
||||
if eval_env and hasattr(eval_env, 'get_state'):
|
||||
with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
|
||||
pickle.dump(eval_env.get_state(), f)
|
||||
logger.info(' {} <- {}'.format(perturbed_var.name, var.name))
|
||||
updates.append(tf.assign(perturbed_var, var))
|
||||
assert len(updates) == len(actor.vars)
|
||||
return tf.group(*updates)
|
||||
|
||||
|
||||
return agent
|
||||
class DDPG(object):
|
||||
def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
|
||||
gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
|
||||
batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
|
||||
adaptive_param_noise=True, adaptive_param_noise_policy_threshold=.1,
|
||||
critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
|
||||
# Inputs.
|
||||
self.obs0 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs0')
|
||||
self.obs1 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs1')
|
||||
self.terminals1 = tf.placeholder(tf.float32, shape=(None, 1), name='terminals1')
|
||||
self.rewards = tf.placeholder(tf.float32, shape=(None, 1), name='rewards')
|
||||
self.actions = tf.placeholder(tf.float32, shape=(None,) + action_shape, name='actions')
|
||||
self.critic_target = tf.placeholder(tf.float32, shape=(None, 1), name='critic_target')
|
||||
self.param_noise_stddev = tf.placeholder(tf.float32, shape=(), name='param_noise_stddev')
|
||||
|
||||
# Parameters.
|
||||
self.gamma = gamma
|
||||
self.tau = tau
|
||||
self.memory = memory
|
||||
self.normalize_observations = normalize_observations
|
||||
self.normalize_returns = normalize_returns
|
||||
self.action_noise = action_noise
|
||||
self.param_noise = param_noise
|
||||
self.action_range = action_range
|
||||
self.return_range = return_range
|
||||
self.observation_range = observation_range
|
||||
self.critic = critic
|
||||
self.actor = actor
|
||||
self.actor_lr = actor_lr
|
||||
self.critic_lr = critic_lr
|
||||
self.clip_norm = clip_norm
|
||||
self.enable_popart = enable_popart
|
||||
self.reward_scale = reward_scale
|
||||
self.batch_size = batch_size
|
||||
self.stats_sample = None
|
||||
self.critic_l2_reg = critic_l2_reg
|
||||
|
||||
# Observation normalization.
|
||||
if self.normalize_observations:
|
||||
with tf.variable_scope('obs_rms'):
|
||||
self.obs_rms = RunningMeanStd(shape=observation_shape)
|
||||
else:
|
||||
self.obs_rms = None
|
||||
normalized_obs0 = tf.clip_by_value(normalize(self.obs0, self.obs_rms),
|
||||
self.observation_range[0], self.observation_range[1])
|
||||
normalized_obs1 = tf.clip_by_value(normalize(self.obs1, self.obs_rms),
|
||||
self.observation_range[0], self.observation_range[1])
|
||||
|
||||
# Return normalization.
|
||||
if self.normalize_returns:
|
||||
with tf.variable_scope('ret_rms'):
|
||||
self.ret_rms = RunningMeanStd()
|
||||
else:
|
||||
self.ret_rms = None
|
||||
|
||||
# Create target networks.
|
||||
target_actor = copy(actor)
|
||||
target_actor.name = 'target_actor'
|
||||
self.target_actor = target_actor
|
||||
target_critic = copy(critic)
|
||||
target_critic.name = 'target_critic'
|
||||
self.target_critic = target_critic
|
||||
|
||||
# Create networks and core TF parts that are shared across setup parts.
|
||||
self.actor_tf = actor(normalized_obs0)
|
||||
self.normalized_critic_tf = critic(normalized_obs0, self.actions)
|
||||
self.critic_tf = denormalize(tf.clip_by_value(self.normalized_critic_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
|
||||
self.normalized_critic_with_actor_tf = critic(normalized_obs0, self.actor_tf, reuse=True)
|
||||
self.critic_with_actor_tf = denormalize(tf.clip_by_value(self.normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
|
||||
Q_obs1 = denormalize(target_critic(normalized_obs1, target_actor(normalized_obs1)), self.ret_rms)
|
||||
self.target_Q = self.rewards + (1. - self.terminals1) * gamma * Q_obs1
|
||||
|
||||
# Set up parts.
|
||||
if self.param_noise is not None:
|
||||
self.setup_param_noise(normalized_obs0)
|
||||
self.setup_actor_optimizer()
|
||||
self.setup_critic_optimizer()
|
||||
if self.normalize_returns and self.enable_popart:
|
||||
self.setup_popart()
|
||||
self.setup_stats()
|
||||
self.setup_target_network_updates()
|
||||
|
||||
def setup_target_network_updates(self):
|
||||
actor_init_updates, actor_soft_updates = get_target_updates(self.actor.vars, self.target_actor.vars, self.tau)
|
||||
critic_init_updates, critic_soft_updates = get_target_updates(self.critic.vars, self.target_critic.vars, self.tau)
|
||||
self.target_init_updates = [actor_init_updates, critic_init_updates]
|
||||
self.target_soft_updates = [actor_soft_updates, critic_soft_updates]
|
||||
|
||||
def setup_param_noise(self, normalized_obs0):
|
||||
assert self.param_noise is not None
|
||||
|
||||
# Configure perturbed actor.
|
||||
param_noise_actor = copy(self.actor)
|
||||
param_noise_actor.name = 'param_noise_actor'
|
||||
self.perturbed_actor_tf = param_noise_actor(normalized_obs0)
|
||||
logger.info('setting up param noise')
|
||||
self.perturb_policy_ops = get_perturbed_actor_updates(self.actor, param_noise_actor, self.param_noise_stddev)
|
||||
|
||||
# Configure separate copy for stddev adoption.
|
||||
adaptive_param_noise_actor = copy(self.actor)
|
||||
adaptive_param_noise_actor.name = 'adaptive_param_noise_actor'
|
||||
adaptive_actor_tf = adaptive_param_noise_actor(normalized_obs0)
|
||||
self.perturb_adaptive_policy_ops = get_perturbed_actor_updates(self.actor, adaptive_param_noise_actor, self.param_noise_stddev)
|
||||
self.adaptive_policy_distance = tf.sqrt(tf.reduce_mean(tf.square(self.actor_tf - adaptive_actor_tf)))
|
||||
|
||||
def setup_actor_optimizer(self):
|
||||
logger.info('setting up actor optimizer')
|
||||
self.actor_loss = -tf.reduce_mean(self.critic_with_actor_tf)
|
||||
actor_shapes = [var.get_shape().as_list() for var in self.actor.trainable_vars]
|
||||
actor_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in actor_shapes])
|
||||
logger.info(' actor shapes: {}'.format(actor_shapes))
|
||||
logger.info(' actor params: {}'.format(actor_nb_params))
|
||||
self.actor_grads = U.flatgrad(self.actor_loss, self.actor.trainable_vars, clip_norm=self.clip_norm)
|
||||
self.actor_optimizer = MpiAdam(var_list=self.actor.trainable_vars,
|
||||
beta1=0.9, beta2=0.999, epsilon=1e-08)
|
||||
|
||||
def setup_critic_optimizer(self):
|
||||
logger.info('setting up critic optimizer')
|
||||
normalized_critic_target_tf = tf.clip_by_value(normalize(self.critic_target, self.ret_rms), self.return_range[0], self.return_range[1])
|
||||
self.critic_loss = tf.reduce_mean(tf.square(self.normalized_critic_tf - normalized_critic_target_tf))
|
||||
if self.critic_l2_reg > 0.:
|
||||
critic_reg_vars = [var for var in self.critic.trainable_vars if 'kernel' in var.name and 'output' not in var.name]
|
||||
for var in critic_reg_vars:
|
||||
logger.info(' regularizing: {}'.format(var.name))
|
||||
logger.info(' applying l2 regularization with {}'.format(self.critic_l2_reg))
|
||||
critic_reg = tc.layers.apply_regularization(
|
||||
tc.layers.l2_regularizer(self.critic_l2_reg),
|
||||
weights_list=critic_reg_vars
|
||||
)
|
||||
self.critic_loss += critic_reg
|
||||
critic_shapes = [var.get_shape().as_list() for var in self.critic.trainable_vars]
|
||||
critic_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in critic_shapes])
|
||||
logger.info(' critic shapes: {}'.format(critic_shapes))
|
||||
logger.info(' critic params: {}'.format(critic_nb_params))
|
||||
self.critic_grads = U.flatgrad(self.critic_loss, self.critic.trainable_vars, clip_norm=self.clip_norm)
|
||||
self.critic_optimizer = MpiAdam(var_list=self.critic.trainable_vars,
|
||||
beta1=0.9, beta2=0.999, epsilon=1e-08)
|
||||
|
||||
def setup_popart(self):
|
||||
# See https://arxiv.org/pdf/1602.07714.pdf for details.
|
||||
self.old_std = tf.placeholder(tf.float32, shape=[1], name='old_std')
|
||||
new_std = self.ret_rms.std
|
||||
self.old_mean = tf.placeholder(tf.float32, shape=[1], name='old_mean')
|
||||
new_mean = self.ret_rms.mean
|
||||
|
||||
self.renormalize_Q_outputs_op = []
|
||||
for vs in [self.critic.output_vars, self.target_critic.output_vars]:
|
||||
assert len(vs) == 2
|
||||
M, b = vs
|
||||
assert 'kernel' in M.name
|
||||
assert 'bias' in b.name
|
||||
assert M.get_shape()[-1] == 1
|
||||
assert b.get_shape()[-1] == 1
|
||||
self.renormalize_Q_outputs_op += [M.assign(M * self.old_std / new_std)]
|
||||
self.renormalize_Q_outputs_op += [b.assign((b * self.old_std + self.old_mean - new_mean) / new_std)]
|
||||
|
||||
def setup_stats(self):
|
||||
ops = []
|
||||
names = []
|
||||
|
||||
if self.normalize_returns:
|
||||
ops += [self.ret_rms.mean, self.ret_rms.std]
|
||||
names += ['ret_rms_mean', 'ret_rms_std']
|
||||
|
||||
if self.normalize_observations:
|
||||
ops += [tf.reduce_mean(self.obs_rms.mean), tf.reduce_mean(self.obs_rms.std)]
|
||||
names += ['obs_rms_mean', 'obs_rms_std']
|
||||
|
||||
ops += [tf.reduce_mean(self.critic_tf)]
|
||||
names += ['reference_Q_mean']
|
||||
ops += [reduce_std(self.critic_tf)]
|
||||
names += ['reference_Q_std']
|
||||
|
||||
ops += [tf.reduce_mean(self.critic_with_actor_tf)]
|
||||
names += ['reference_actor_Q_mean']
|
||||
ops += [reduce_std(self.critic_with_actor_tf)]
|
||||
names += ['reference_actor_Q_std']
|
||||
|
||||
ops += [tf.reduce_mean(self.actor_tf)]
|
||||
names += ['reference_action_mean']
|
||||
ops += [reduce_std(self.actor_tf)]
|
||||
names += ['reference_action_std']
|
||||
|
||||
if self.param_noise:
|
||||
ops += [tf.reduce_mean(self.perturbed_actor_tf)]
|
||||
names += ['reference_perturbed_action_mean']
|
||||
ops += [reduce_std(self.perturbed_actor_tf)]
|
||||
names += ['reference_perturbed_action_std']
|
||||
|
||||
self.stats_ops = ops
|
||||
self.stats_names = names
|
||||
|
||||
def pi(self, obs, apply_noise=True, compute_Q=True):
|
||||
if self.param_noise is not None and apply_noise:
|
||||
actor_tf = self.perturbed_actor_tf
|
||||
else:
|
||||
actor_tf = self.actor_tf
|
||||
feed_dict = {self.obs0: [obs]}
|
||||
if compute_Q:
|
||||
action, q = self.sess.run([actor_tf, self.critic_with_actor_tf], feed_dict=feed_dict)
|
||||
else:
|
||||
action = self.sess.run(actor_tf, feed_dict=feed_dict)
|
||||
q = None
|
||||
action = action.flatten()
|
||||
if self.action_noise is not None and apply_noise:
|
||||
noise = self.action_noise()
|
||||
assert noise.shape == action.shape
|
||||
action += noise
|
||||
action = np.clip(action, self.action_range[0], self.action_range[1])
|
||||
return action, q
|
||||
|
||||
def store_transition(self, obs0, action, reward, obs1, terminal1):
|
||||
reward *= self.reward_scale
|
||||
self.memory.append(obs0, action, reward, obs1, terminal1)
|
||||
if self.normalize_observations:
|
||||
self.obs_rms.update(np.array([obs0]))
|
||||
|
||||
def train(self):
|
||||
# Get a batch.
|
||||
batch = self.memory.sample(batch_size=self.batch_size)
|
||||
|
||||
if self.normalize_returns and self.enable_popart:
|
||||
old_mean, old_std, target_Q = self.sess.run([self.ret_rms.mean, self.ret_rms.std, self.target_Q], feed_dict={
|
||||
self.obs1: batch['obs1'],
|
||||
self.rewards: batch['rewards'],
|
||||
self.terminals1: batch['terminals1'].astype('float32'),
|
||||
})
|
||||
self.ret_rms.update(target_Q.flatten())
|
||||
self.sess.run(self.renormalize_Q_outputs_op, feed_dict={
|
||||
self.old_std : np.array([old_std]),
|
||||
self.old_mean : np.array([old_mean]),
|
||||
})
|
||||
|
||||
# Run sanity check. Disabled by default since it slows down things considerably.
|
||||
# print('running sanity check')
|
||||
# target_Q_new, new_mean, new_std = self.sess.run([self.target_Q, self.ret_rms.mean, self.ret_rms.std], feed_dict={
|
||||
# self.obs1: batch['obs1'],
|
||||
# self.rewards: batch['rewards'],
|
||||
# self.terminals1: batch['terminals1'].astype('float32'),
|
||||
# })
|
||||
# print(target_Q_new, target_Q, new_mean, new_std)
|
||||
# assert (np.abs(target_Q - target_Q_new) < 1e-3).all()
|
||||
else:
|
||||
target_Q = self.sess.run(self.target_Q, feed_dict={
|
||||
self.obs1: batch['obs1'],
|
||||
self.rewards: batch['rewards'],
|
||||
self.terminals1: batch['terminals1'].astype('float32'),
|
||||
})
|
||||
|
||||
# Get all gradients and perform a synced update.
|
||||
ops = [self.actor_grads, self.actor_loss, self.critic_grads, self.critic_loss]
|
||||
actor_grads, actor_loss, critic_grads, critic_loss = self.sess.run(ops, feed_dict={
|
||||
self.obs0: batch['obs0'],
|
||||
self.actions: batch['actions'],
|
||||
self.critic_target: target_Q,
|
||||
})
|
||||
self.actor_optimizer.update(actor_grads, stepsize=self.actor_lr)
|
||||
self.critic_optimizer.update(critic_grads, stepsize=self.critic_lr)
|
||||
|
||||
return critic_loss, actor_loss
|
||||
|
||||
def initialize(self, sess):
|
||||
self.sess = sess
|
||||
self.sess.run(tf.global_variables_initializer())
|
||||
self.actor_optimizer.sync()
|
||||
self.critic_optimizer.sync()
|
||||
self.sess.run(self.target_init_updates)
|
||||
|
||||
def update_target_net(self):
|
||||
self.sess.run(self.target_soft_updates)
|
||||
|
||||
def get_stats(self):
|
||||
if self.stats_sample is None:
|
||||
# Get a sample and keep that fixed for all further computations.
|
||||
# This allows us to estimate the change in value for the same set of inputs.
|
||||
self.stats_sample = self.memory.sample(batch_size=self.batch_size)
|
||||
values = self.sess.run(self.stats_ops, feed_dict={
|
||||
self.obs0: self.stats_sample['obs0'],
|
||||
self.actions: self.stats_sample['actions'],
|
||||
})
|
||||
|
||||
names = self.stats_names[:]
|
||||
assert len(names) == len(values)
|
||||
stats = dict(zip(names, values))
|
||||
|
||||
if self.param_noise is not None:
|
||||
stats = {**stats, **self.param_noise.get_stats()}
|
||||
|
||||
return stats
|
||||
|
||||
def adapt_param_noise(self):
|
||||
if self.param_noise is None:
|
||||
return 0.
|
||||
|
||||
# Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
|
||||
batch = self.memory.sample(batch_size=self.batch_size)
|
||||
self.sess.run(self.perturb_adaptive_policy_ops, feed_dict={
|
||||
self.param_noise_stddev: self.param_noise.current_stddev,
|
||||
})
|
||||
distance = self.sess.run(self.adaptive_policy_distance, feed_dict={
|
||||
self.obs0: batch['obs0'],
|
||||
self.param_noise_stddev: self.param_noise.current_stddev,
|
||||
})
|
||||
|
||||
mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
|
||||
self.param_noise.adapt(mean_distance)
|
||||
return mean_distance
|
||||
|
||||
def reset(self):
|
||||
# Reset internal state after an episode is complete.
|
||||
if self.action_noise is not None:
|
||||
self.action_noise.reset()
|
||||
if self.param_noise is not None:
|
||||
self.sess.run(self.perturb_policy_ops, feed_dict={
|
||||
self.param_noise_stddev: self.param_noise.current_stddev,
|
||||
})
|
||||
|
@@ -1,402 +0,0 @@
|
||||
from copy import copy
|
||||
from functools import reduce
|
||||
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
import tensorflow.contrib as tc
|
||||
|
||||
from baselines import logger
|
||||
from baselines.common.mpi_adam import MpiAdam
|
||||
import baselines.common.tf_util as U
|
||||
from baselines.common.mpi_running_mean_std import RunningMeanStd
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
def normalize(x, stats):
|
||||
if stats is None:
|
||||
return x
|
||||
return (x - stats.mean) / stats.std
|
||||
|
||||
|
||||
def denormalize(x, stats):
|
||||
if stats is None:
|
||||
return x
|
||||
return x * stats.std + stats.mean
|
||||
|
||||
def reduce_std(x, axis=None, keepdims=False):
|
||||
return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
|
||||
|
||||
def reduce_var(x, axis=None, keepdims=False):
|
||||
m = tf.reduce_mean(x, axis=axis, keepdims=True)
|
||||
devs_squared = tf.square(x - m)
|
||||
return tf.reduce_mean(devs_squared, axis=axis, keepdims=keepdims)
|
||||
|
||||
def get_target_updates(vars, target_vars, tau):
|
||||
logger.info('setting up target updates ...')
|
||||
soft_updates = []
|
||||
init_updates = []
|
||||
assert len(vars) == len(target_vars)
|
||||
for var, target_var in zip(vars, target_vars):
|
||||
logger.info(' {} <- {}'.format(target_var.name, var.name))
|
||||
init_updates.append(tf.assign(target_var, var))
|
||||
soft_updates.append(tf.assign(target_var, (1. - tau) * target_var + tau * var))
|
||||
assert len(init_updates) == len(vars)
|
||||
assert len(soft_updates) == len(vars)
|
||||
return tf.group(*init_updates), tf.group(*soft_updates)
|
||||
|
||||
|
||||
def get_perturbed_actor_updates(actor, perturbed_actor, param_noise_stddev):
|
||||
assert len(actor.vars) == len(perturbed_actor.vars)
|
||||
assert len(actor.perturbable_vars) == len(perturbed_actor.perturbable_vars)
|
||||
|
||||
updates = []
|
||||
for var, perturbed_var in zip(actor.vars, perturbed_actor.vars):
|
||||
if var in actor.perturbable_vars:
|
||||
logger.info(' {} <- {} + noise'.format(perturbed_var.name, var.name))
|
||||
updates.append(tf.assign(perturbed_var, var + tf.random_normal(tf.shape(var), mean=0., stddev=param_noise_stddev)))
|
||||
else:
|
||||
logger.info(' {} <- {}'.format(perturbed_var.name, var.name))
|
||||
updates.append(tf.assign(perturbed_var, var))
|
||||
assert len(updates) == len(actor.vars)
|
||||
return tf.group(*updates)
|
||||
|
||||
|
||||
class DDPG(object):
|
||||
def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
|
||||
gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
|
||||
batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
|
||||
adaptive_param_noise=True, adaptive_param_noise_policy_threshold=.1,
|
||||
critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
|
||||
# Inputs.
|
||||
self.obs0 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs0')
|
||||
self.obs1 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs1')
|
||||
self.terminals1 = tf.placeholder(tf.float32, shape=(None, 1), name='terminals1')
|
||||
self.rewards = tf.placeholder(tf.float32, shape=(None, 1), name='rewards')
|
||||
self.actions = tf.placeholder(tf.float32, shape=(None,) + action_shape, name='actions')
|
||||
self.critic_target = tf.placeholder(tf.float32, shape=(None, 1), name='critic_target')
|
||||
self.param_noise_stddev = tf.placeholder(tf.float32, shape=(), name='param_noise_stddev')
|
||||
|
||||
# Parameters.
|
||||
self.gamma = gamma
|
||||
self.tau = tau
|
||||
self.memory = memory
|
||||
self.normalize_observations = normalize_observations
|
||||
self.normalize_returns = normalize_returns
|
||||
self.action_noise = action_noise
|
||||
self.param_noise = param_noise
|
||||
self.action_range = action_range
|
||||
self.return_range = return_range
|
||||
self.observation_range = observation_range
|
||||
self.critic = critic
|
||||
self.actor = actor
|
||||
self.actor_lr = actor_lr
|
||||
self.critic_lr = critic_lr
|
||||
self.clip_norm = clip_norm
|
||||
self.enable_popart = enable_popart
|
||||
self.reward_scale = reward_scale
|
||||
self.batch_size = batch_size
|
||||
self.stats_sample = None
|
||||
self.critic_l2_reg = critic_l2_reg
|
||||
|
||||
# Observation normalization.
|
||||
if self.normalize_observations:
|
||||
with tf.variable_scope('obs_rms'):
|
||||
self.obs_rms = RunningMeanStd(shape=observation_shape)
|
||||
else:
|
||||
self.obs_rms = None
|
||||
normalized_obs0 = tf.clip_by_value(normalize(self.obs0, self.obs_rms),
|
||||
self.observation_range[0], self.observation_range[1])
|
||||
normalized_obs1 = tf.clip_by_value(normalize(self.obs1, self.obs_rms),
|
||||
self.observation_range[0], self.observation_range[1])
|
||||
|
||||
# Return normalization.
|
||||
if self.normalize_returns:
|
||||
with tf.variable_scope('ret_rms'):
|
||||
self.ret_rms = RunningMeanStd()
|
||||
else:
|
||||
self.ret_rms = None
|
||||
|
||||
# Create target networks.
|
||||
target_actor = copy(actor)
|
||||
target_actor.name = 'target_actor'
|
||||
self.target_actor = target_actor
|
||||
target_critic = copy(critic)
|
||||
target_critic.name = 'target_critic'
|
||||
self.target_critic = target_critic
|
||||
|
||||
# Create networks and core TF parts that are shared across setup parts.
|
||||
self.actor_tf = actor(normalized_obs0)
|
||||
self.normalized_critic_tf = critic(normalized_obs0, self.actions)
|
||||
self.critic_tf = denormalize(tf.clip_by_value(self.normalized_critic_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
|
||||
self.normalized_critic_with_actor_tf = critic(normalized_obs0, self.actor_tf, reuse=True)
|
||||
self.critic_with_actor_tf = denormalize(tf.clip_by_value(self.normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
|
||||
Q_obs1 = denormalize(target_critic(normalized_obs1, target_actor(normalized_obs1)), self.ret_rms)
|
||||
self.target_Q = self.rewards + (1. - self.terminals1) * gamma * Q_obs1
|
||||
|
||||
# Set up parts.
|
||||
if self.param_noise is not None:
|
||||
self.setup_param_noise(normalized_obs0)
|
||||
self.setup_actor_optimizer()
|
||||
self.setup_critic_optimizer()
|
||||
if self.normalize_returns and self.enable_popart:
|
||||
self.setup_popart()
|
||||
self.setup_stats()
|
||||
self.setup_target_network_updates()
|
||||
|
||||
self.initial_state = None # recurrent architectures not supported yet
|
||||
|
||||
def setup_target_network_updates(self):
|
||||
actor_init_updates, actor_soft_updates = get_target_updates(self.actor.vars, self.target_actor.vars, self.tau)
|
||||
critic_init_updates, critic_soft_updates = get_target_updates(self.critic.vars, self.target_critic.vars, self.tau)
|
||||
self.target_init_updates = [actor_init_updates, critic_init_updates]
|
||||
self.target_soft_updates = [actor_soft_updates, critic_soft_updates]
|
||||
|
||||
def setup_param_noise(self, normalized_obs0):
|
||||
assert self.param_noise is not None
|
||||
|
||||
# Configure perturbed actor.
|
||||
param_noise_actor = copy(self.actor)
|
||||
param_noise_actor.name = 'param_noise_actor'
|
||||
self.perturbed_actor_tf = param_noise_actor(normalized_obs0)
|
||||
logger.info('setting up param noise')
|
||||
self.perturb_policy_ops = get_perturbed_actor_updates(self.actor, param_noise_actor, self.param_noise_stddev)
|
||||
|
||||
# Configure separate copy for stddev adoption.
|
||||
adaptive_param_noise_actor = copy(self.actor)
|
||||
adaptive_param_noise_actor.name = 'adaptive_param_noise_actor'
|
||||
adaptive_actor_tf = adaptive_param_noise_actor(normalized_obs0)
|
||||
self.perturb_adaptive_policy_ops = get_perturbed_actor_updates(self.actor, adaptive_param_noise_actor, self.param_noise_stddev)
|
||||
self.adaptive_policy_distance = tf.sqrt(tf.reduce_mean(tf.square(self.actor_tf - adaptive_actor_tf)))
|
||||
|
||||
def setup_actor_optimizer(self):
|
||||
logger.info('setting up actor optimizer')
|
||||
self.actor_loss = -tf.reduce_mean(self.critic_with_actor_tf)
|
||||
actor_shapes = [var.get_shape().as_list() for var in self.actor.trainable_vars]
|
||||
actor_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in actor_shapes])
|
||||
logger.info(' actor shapes: {}'.format(actor_shapes))
|
||||
logger.info(' actor params: {}'.format(actor_nb_params))
|
||||
self.actor_grads = U.flatgrad(self.actor_loss, self.actor.trainable_vars, clip_norm=self.clip_norm)
|
||||
self.actor_optimizer = MpiAdam(var_list=self.actor.trainable_vars,
|
||||
beta1=0.9, beta2=0.999, epsilon=1e-08)
|
||||
|
||||
def setup_critic_optimizer(self):
|
||||
logger.info('setting up critic optimizer')
|
||||
normalized_critic_target_tf = tf.clip_by_value(normalize(self.critic_target, self.ret_rms), self.return_range[0], self.return_range[1])
|
||||
self.critic_loss = tf.reduce_mean(tf.square(self.normalized_critic_tf - normalized_critic_target_tf))
|
||||
if self.critic_l2_reg > 0.:
|
||||
critic_reg_vars = [var for var in self.critic.trainable_vars if 'kernel' in var.name and 'output' not in var.name]
|
||||
for var in critic_reg_vars:
|
||||
logger.info(' regularizing: {}'.format(var.name))
|
||||
logger.info(' applying l2 regularization with {}'.format(self.critic_l2_reg))
|
||||
critic_reg = tc.layers.apply_regularization(
|
||||
tc.layers.l2_regularizer(self.critic_l2_reg),
|
||||
weights_list=critic_reg_vars
|
||||
)
|
||||
self.critic_loss += critic_reg
|
||||
critic_shapes = [var.get_shape().as_list() for var in self.critic.trainable_vars]
|
||||
critic_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in critic_shapes])
|
||||
logger.info(' critic shapes: {}'.format(critic_shapes))
|
||||
logger.info(' critic params: {}'.format(critic_nb_params))
|
||||
self.critic_grads = U.flatgrad(self.critic_loss, self.critic.trainable_vars, clip_norm=self.clip_norm)
|
||||
self.critic_optimizer = MpiAdam(var_list=self.critic.trainable_vars,
|
||||
beta1=0.9, beta2=0.999, epsilon=1e-08)
|
||||
|
||||
def setup_popart(self):
|
||||
# See https://arxiv.org/pdf/1602.07714.pdf for details.
|
||||
self.old_std = tf.placeholder(tf.float32, shape=[1], name='old_std')
|
||||
new_std = self.ret_rms.std
|
||||
self.old_mean = tf.placeholder(tf.float32, shape=[1], name='old_mean')
|
||||
new_mean = self.ret_rms.mean
|
||||
|
||||
self.renormalize_Q_outputs_op = []
|
||||
for vs in [self.critic.output_vars, self.target_critic.output_vars]:
|
||||
assert len(vs) == 2
|
||||
M, b = vs
|
||||
assert 'kernel' in M.name
|
||||
assert 'bias' in b.name
|
||||
assert M.get_shape()[-1] == 1
|
||||
assert b.get_shape()[-1] == 1
|
||||
self.renormalize_Q_outputs_op += [M.assign(M * self.old_std / new_std)]
|
||||
self.renormalize_Q_outputs_op += [b.assign((b * self.old_std + self.old_mean - new_mean) / new_std)]
|
||||
|
||||
def setup_stats(self):
|
||||
ops = []
|
||||
names = []
|
||||
|
||||
if self.normalize_returns:
|
||||
ops += [self.ret_rms.mean, self.ret_rms.std]
|
||||
names += ['ret_rms_mean', 'ret_rms_std']
|
||||
|
||||
if self.normalize_observations:
|
||||
ops += [tf.reduce_mean(self.obs_rms.mean), tf.reduce_mean(self.obs_rms.std)]
|
||||
names += ['obs_rms_mean', 'obs_rms_std']
|
||||
|
||||
ops += [tf.reduce_mean(self.critic_tf)]
|
||||
names += ['reference_Q_mean']
|
||||
ops += [reduce_std(self.critic_tf)]
|
||||
names += ['reference_Q_std']
|
||||
|
||||
ops += [tf.reduce_mean(self.critic_with_actor_tf)]
|
||||
names += ['reference_actor_Q_mean']
|
||||
ops += [reduce_std(self.critic_with_actor_tf)]
|
||||
names += ['reference_actor_Q_std']
|
||||
|
||||
ops += [tf.reduce_mean(self.actor_tf)]
|
||||
names += ['reference_action_mean']
|
||||
ops += [reduce_std(self.actor_tf)]
|
||||
names += ['reference_action_std']
|
||||
|
||||
if self.param_noise:
|
||||
ops += [tf.reduce_mean(self.perturbed_actor_tf)]
|
||||
names += ['reference_perturbed_action_mean']
|
||||
ops += [reduce_std(self.perturbed_actor_tf)]
|
||||
names += ['reference_perturbed_action_std']
|
||||
|
||||
self.stats_ops = ops
|
||||
self.stats_names = names
|
||||
|
||||
def step(self, obs, apply_noise=True, compute_Q=True):
|
||||
if self.param_noise is not None and apply_noise:
|
||||
actor_tf = self.perturbed_actor_tf
|
||||
else:
|
||||
actor_tf = self.actor_tf
|
||||
feed_dict = {self.obs0: U.adjust_shape(self.obs0, [obs])}
|
||||
if compute_Q:
|
||||
action, q = self.sess.run([actor_tf, self.critic_with_actor_tf], feed_dict=feed_dict)
|
||||
else:
|
||||
action = self.sess.run(actor_tf, feed_dict=feed_dict)
|
||||
q = None
|
||||
|
||||
if self.action_noise is not None and apply_noise:
|
||||
noise = self.action_noise()
|
||||
assert noise.shape == action[0].shape
|
||||
action += noise
|
||||
action = np.clip(action, self.action_range[0], self.action_range[1])
|
||||
|
||||
|
||||
return action, q, None, None
|
||||
|
||||
def store_transition(self, obs0, action, reward, obs1, terminal1):
|
||||
reward *= self.reward_scale
|
||||
|
||||
B = obs0.shape[0]
|
||||
for b in range(B):
|
||||
self.memory.append(obs0[b], action[b], reward[b], obs1[b], terminal1[b])
|
||||
if self.normalize_observations:
|
||||
self.obs_rms.update(np.array([obs0[b]]))
|
||||
|
||||
def train(self):
|
||||
# Get a batch.
|
||||
batch = self.memory.sample(batch_size=self.batch_size)
|
||||
|
||||
if self.normalize_returns and self.enable_popart:
|
||||
old_mean, old_std, target_Q = self.sess.run([self.ret_rms.mean, self.ret_rms.std, self.target_Q], feed_dict={
|
||||
self.obs1: batch['obs1'],
|
||||
self.rewards: batch['rewards'],
|
||||
self.terminals1: batch['terminals1'].astype('float32'),
|
||||
})
|
||||
self.ret_rms.update(target_Q.flatten())
|
||||
self.sess.run(self.renormalize_Q_outputs_op, feed_dict={
|
||||
self.old_std : np.array([old_std]),
|
||||
self.old_mean : np.array([old_mean]),
|
||||
})
|
||||
|
||||
# Run sanity check. Disabled by default since it slows down things considerably.
|
||||
# print('running sanity check')
|
||||
# target_Q_new, new_mean, new_std = self.sess.run([self.target_Q, self.ret_rms.mean, self.ret_rms.std], feed_dict={
|
||||
# self.obs1: batch['obs1'],
|
||||
# self.rewards: batch['rewards'],
|
||||
# self.terminals1: batch['terminals1'].astype('float32'),
|
||||
# })
|
||||
# print(target_Q_new, target_Q, new_mean, new_std)
|
||||
# assert (np.abs(target_Q - target_Q_new) < 1e-3).all()
|
||||
else:
|
||||
target_Q = self.sess.run(self.target_Q, feed_dict={
|
||||
self.obs1: batch['obs1'],
|
||||
self.rewards: batch['rewards'],
|
||||
self.terminals1: batch['terminals1'].astype('float32'),
|
||||
})
|
||||
|
||||
# Get all gradients and perform a synced update.
|
||||
ops = [self.actor_grads, self.actor_loss, self.critic_grads, self.critic_loss]
|
||||
actor_grads, actor_loss, critic_grads, critic_loss = self.sess.run(ops, feed_dict={
|
||||
self.obs0: batch['obs0'],
|
||||
self.actions: batch['actions'],
|
||||
self.critic_target: target_Q,
|
||||
})
|
||||
self.actor_optimizer.update(actor_grads, stepsize=self.actor_lr)
|
||||
self.critic_optimizer.update(critic_grads, stepsize=self.critic_lr)
|
||||
|
||||
return critic_loss, actor_loss
|
||||
|
||||
def initialize(self, sess):
|
||||
self.sess = sess
|
||||
self.sess.run(tf.global_variables_initializer())
|
||||
self.actor_optimizer.sync()
|
||||
self.critic_optimizer.sync()
|
||||
self.sess.run(self.target_init_updates)
|
||||
|
||||
def update_target_net(self):
|
||||
self.sess.run(self.target_soft_updates)
|
||||
|
||||
def get_stats(self):
|
||||
if self.stats_sample is None:
|
||||
# Get a sample and keep that fixed for all further computations.
|
||||
# This allows us to estimate the change in value for the same set of inputs.
|
||||
self.stats_sample = self.memory.sample(batch_size=self.batch_size)
|
||||
values = self.sess.run(self.stats_ops, feed_dict={
|
||||
self.obs0: self.stats_sample['obs0'],
|
||||
self.actions: self.stats_sample['actions'],
|
||||
})
|
||||
|
||||
names = self.stats_names[:]
|
||||
assert len(names) == len(values)
|
||||
stats = dict(zip(names, values))
|
||||
|
||||
if self.param_noise is not None:
|
||||
stats = {**stats, **self.param_noise.get_stats()}
|
||||
|
||||
return stats
|
||||
|
||||
def adapt_param_noise(self):
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
if self.param_noise is None:
|
||||
return 0.
|
||||
|
||||
# Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
|
||||
batch = self.memory.sample(batch_size=self.batch_size)
|
||||
self.sess.run(self.perturb_adaptive_policy_ops, feed_dict={
|
||||
self.param_noise_stddev: self.param_noise.current_stddev,
|
||||
})
|
||||
distance = self.sess.run(self.adaptive_policy_distance, feed_dict={
|
||||
self.obs0: batch['obs0'],
|
||||
self.param_noise_stddev: self.param_noise.current_stddev,
|
||||
})
|
||||
|
||||
if MPI is not None:
|
||||
mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
|
||||
else:
|
||||
mean_distance = distance
|
||||
|
||||
if MPI is not None:
|
||||
mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
|
||||
else:
|
||||
mean_distance = distance
|
||||
|
||||
self.param_noise.adapt(mean_distance)
|
||||
return mean_distance
|
||||
|
||||
def reset(self):
|
||||
# Reset internal state after an episode is complete.
|
||||
if self.action_noise is not None:
|
||||
self.action_noise.reset()
|
||||
if self.param_noise is not None:
|
||||
self.sess.run(self.perturb_policy_ops, feed_dict={
|
||||
self.param_noise_stddev: self.param_noise.current_stddev,
|
||||
})
|
123
baselines/ddpg/main.py
Normal file
123
baselines/ddpg/main.py
Normal file
@@ -0,0 +1,123 @@
|
||||
import argparse
|
||||
import time
|
||||
import os
|
||||
import logging
|
||||
from baselines import logger, bench
|
||||
from baselines.common.misc_util import (
|
||||
set_global_seeds,
|
||||
boolean_flag,
|
||||
)
|
||||
import baselines.ddpg.training as training
|
||||
from baselines.ddpg.models import Actor, Critic
|
||||
from baselines.ddpg.memory import Memory
|
||||
from baselines.ddpg.noise import *
|
||||
|
||||
import gym
|
||||
import tensorflow as tf
|
||||
from mpi4py import MPI
|
||||
|
||||
def run(env_id, seed, noise_type, layer_norm, evaluation, **kwargs):
|
||||
# Configure things.
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
if rank != 0:
|
||||
logger.set_level(logger.DISABLED)
|
||||
|
||||
# Create envs.
|
||||
env = gym.make(env_id)
|
||||
env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
|
||||
|
||||
if evaluation and rank==0:
|
||||
eval_env = gym.make(env_id)
|
||||
eval_env = bench.Monitor(eval_env, os.path.join(logger.get_dir(), 'gym_eval'))
|
||||
env = bench.Monitor(env, None)
|
||||
else:
|
||||
eval_env = None
|
||||
|
||||
# Parse noise_type
|
||||
action_noise = None
|
||||
param_noise = None
|
||||
nb_actions = env.action_space.shape[-1]
|
||||
for current_noise_type in noise_type.split(','):
|
||||
current_noise_type = current_noise_type.strip()
|
||||
if current_noise_type == 'none':
|
||||
pass
|
||||
elif 'adaptive-param' in current_noise_type:
|
||||
_, stddev = current_noise_type.split('_')
|
||||
param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
|
||||
elif 'normal' in current_noise_type:
|
||||
_, stddev = current_noise_type.split('_')
|
||||
action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
|
||||
elif 'ou' in current_noise_type:
|
||||
_, stddev = current_noise_type.split('_')
|
||||
action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
|
||||
else:
|
||||
raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
|
||||
|
||||
# Configure components.
|
||||
memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
|
||||
critic = Critic(layer_norm=layer_norm)
|
||||
actor = Actor(nb_actions, layer_norm=layer_norm)
|
||||
|
||||
# Seed everything to make things reproducible.
|
||||
seed = seed + 1000000 * rank
|
||||
logger.info('rank {}: seed={}, logdir={}'.format(rank, seed, logger.get_dir()))
|
||||
tf.reset_default_graph()
|
||||
set_global_seeds(seed)
|
||||
env.seed(seed)
|
||||
if eval_env is not None:
|
||||
eval_env.seed(seed)
|
||||
|
||||
# Disable logging for rank != 0 to avoid noise.
|
||||
if rank == 0:
|
||||
start_time = time.time()
|
||||
training.train(env=env, eval_env=eval_env, param_noise=param_noise,
|
||||
action_noise=action_noise, actor=actor, critic=critic, memory=memory, **kwargs)
|
||||
env.close()
|
||||
if eval_env is not None:
|
||||
eval_env.close()
|
||||
if rank == 0:
|
||||
logger.info('total runtime: {}s'.format(time.time() - start_time))
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
|
||||
parser.add_argument('--env-id', type=str, default='HalfCheetah-v1')
|
||||
boolean_flag(parser, 'render-eval', default=False)
|
||||
boolean_flag(parser, 'layer-norm', default=True)
|
||||
boolean_flag(parser, 'render', default=False)
|
||||
boolean_flag(parser, 'normalize-returns', default=False)
|
||||
boolean_flag(parser, 'normalize-observations', default=True)
|
||||
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
|
||||
parser.add_argument('--critic-l2-reg', type=float, default=1e-2)
|
||||
parser.add_argument('--batch-size', type=int, default=64) # per MPI worker
|
||||
parser.add_argument('--actor-lr', type=float, default=1e-4)
|
||||
parser.add_argument('--critic-lr', type=float, default=1e-3)
|
||||
boolean_flag(parser, 'popart', default=False)
|
||||
parser.add_argument('--gamma', type=float, default=0.99)
|
||||
parser.add_argument('--reward-scale', type=float, default=1.)
|
||||
parser.add_argument('--clip-norm', type=float, default=None)
|
||||
parser.add_argument('--nb-epochs', type=int, default=500) # with default settings, perform 1M steps total
|
||||
parser.add_argument('--nb-epoch-cycles', type=int, default=20)
|
||||
parser.add_argument('--nb-train-steps', type=int, default=50) # per epoch cycle and MPI worker
|
||||
parser.add_argument('--nb-eval-steps', type=int, default=100) # per epoch cycle and MPI worker
|
||||
parser.add_argument('--nb-rollout-steps', type=int, default=100) # per epoch cycle and MPI worker
|
||||
parser.add_argument('--noise-type', type=str, default='adaptive-param_0.2') # choices are adaptive-param_xx, ou_xx, normal_xx, none
|
||||
parser.add_argument('--num-timesteps', type=int, default=None)
|
||||
boolean_flag(parser, 'evaluation', default=False)
|
||||
args = parser.parse_args()
|
||||
# we don't directly specify timesteps for this script, so make sure that if we do specify them
|
||||
# they agree with the other parameters
|
||||
if args.num_timesteps is not None:
|
||||
assert(args.num_timesteps == args.nb_epochs * args.nb_epoch_cycles * args.nb_rollout_steps)
|
||||
dict_args = vars(args)
|
||||
del dict_args['num_timesteps']
|
||||
return dict_args
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
args = parse_args()
|
||||
if MPI.COMM_WORLD.Get_rank() == 0:
|
||||
logger.configure()
|
||||
# Run actual script.
|
||||
run(**args)
|
2
baselines/ddpg/memory.py
Executable file → Normal file
2
baselines/ddpg/memory.py
Executable file → Normal file
@@ -51,7 +51,7 @@ class Memory(object):
|
||||
|
||||
def sample(self, batch_size):
|
||||
# Draw such that we always have a proceeding element.
|
||||
batch_idxs = np.random.randint(self.nb_entries - 2, size=batch_size)
|
||||
batch_idxs = np.random.random_integers(self.nb_entries - 2, size=batch_size)
|
||||
|
||||
obs0_batch = self.observations0.get_batch(batch_idxs)
|
||||
obs1_batch = self.observations1.get_batch(batch_idxs)
|
||||
|
52
baselines/ddpg/models.py
Executable file → Normal file
52
baselines/ddpg/models.py
Executable file → Normal file
@@ -1,11 +1,10 @@
|
||||
import tensorflow as tf
|
||||
from baselines.common.models import get_network_builder
|
||||
import tensorflow.contrib as tc
|
||||
|
||||
|
||||
class Model(object):
|
||||
def __init__(self, name, network='mlp', **network_kwargs):
|
||||
def __init__(self, name):
|
||||
self.name = name
|
||||
self.network_builder = get_network_builder(network)(**network_kwargs)
|
||||
|
||||
@property
|
||||
def vars(self):
|
||||
@@ -21,27 +20,54 @@ class Model(object):
|
||||
|
||||
|
||||
class Actor(Model):
|
||||
def __init__(self, nb_actions, name='actor', network='mlp', **network_kwargs):
|
||||
super().__init__(name=name, network=network, **network_kwargs)
|
||||
def __init__(self, nb_actions, name='actor', layer_norm=True):
|
||||
super(Actor, self).__init__(name=name)
|
||||
self.nb_actions = nb_actions
|
||||
self.layer_norm = layer_norm
|
||||
|
||||
def __call__(self, obs, reuse=False):
|
||||
with tf.variable_scope(self.name, reuse=tf.AUTO_REUSE):
|
||||
x = self.network_builder(obs)
|
||||
with tf.variable_scope(self.name) as scope:
|
||||
if reuse:
|
||||
scope.reuse_variables()
|
||||
|
||||
x = obs
|
||||
x = tf.layers.dense(x, 64)
|
||||
if self.layer_norm:
|
||||
x = tc.layers.layer_norm(x, center=True, scale=True)
|
||||
x = tf.nn.relu(x)
|
||||
|
||||
x = tf.layers.dense(x, 64)
|
||||
if self.layer_norm:
|
||||
x = tc.layers.layer_norm(x, center=True, scale=True)
|
||||
x = tf.nn.relu(x)
|
||||
|
||||
x = tf.layers.dense(x, self.nb_actions, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
|
||||
x = tf.nn.tanh(x)
|
||||
return x
|
||||
|
||||
|
||||
class Critic(Model):
|
||||
def __init__(self, name='critic', network='mlp', **network_kwargs):
|
||||
super().__init__(name=name, network=network, **network_kwargs)
|
||||
self.layer_norm = True
|
||||
def __init__(self, name='critic', layer_norm=True):
|
||||
super(Critic, self).__init__(name=name)
|
||||
self.layer_norm = layer_norm
|
||||
|
||||
def __call__(self, obs, action, reuse=False):
|
||||
with tf.variable_scope(self.name, reuse=tf.AUTO_REUSE):
|
||||
x = tf.concat([obs, action], axis=-1) # this assumes observation and action can be concatenated
|
||||
x = self.network_builder(x)
|
||||
with tf.variable_scope(self.name) as scope:
|
||||
if reuse:
|
||||
scope.reuse_variables()
|
||||
|
||||
x = obs
|
||||
x = tf.layers.dense(x, 64)
|
||||
if self.layer_norm:
|
||||
x = tc.layers.layer_norm(x, center=True, scale=True)
|
||||
x = tf.nn.relu(x)
|
||||
|
||||
x = tf.concat([x, action], axis=-1)
|
||||
x = tf.layers.dense(x, 64)
|
||||
if self.layer_norm:
|
||||
x = tc.layers.layer_norm(x, center=True, scale=True)
|
||||
x = tf.nn.relu(x)
|
||||
|
||||
x = tf.layers.dense(x, 1, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
|
||||
return x
|
||||
|
||||
|
0
baselines/ddpg/noise.py
Executable file → Normal file
0
baselines/ddpg/noise.py
Executable file → Normal file
191
baselines/ddpg/training.py
Normal file
191
baselines/ddpg/training.py
Normal file
@@ -0,0 +1,191 @@
|
||||
import os
|
||||
import time
|
||||
from collections import deque
|
||||
import pickle
|
||||
|
||||
from baselines.ddpg.ddpg import DDPG
|
||||
import baselines.common.tf_util as U
|
||||
|
||||
from baselines import logger
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
from mpi4py import MPI
|
||||
|
||||
|
||||
def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, param_noise, actor, critic,
|
||||
normalize_returns, normalize_observations, critic_l2_reg, actor_lr, critic_lr, action_noise,
|
||||
popart, gamma, clip_norm, nb_train_steps, nb_rollout_steps, nb_eval_steps, batch_size, memory,
|
||||
tau=0.01, eval_env=None, param_noise_adaption_interval=50):
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
|
||||
assert (np.abs(env.action_space.low) == env.action_space.high).all() # we assume symmetric actions.
|
||||
max_action = env.action_space.high
|
||||
logger.info('scaling actions by {} before executing in env'.format(max_action))
|
||||
agent = DDPG(actor, critic, memory, env.observation_space.shape, env.action_space.shape,
|
||||
gamma=gamma, tau=tau, normalize_returns=normalize_returns, normalize_observations=normalize_observations,
|
||||
batch_size=batch_size, action_noise=action_noise, param_noise=param_noise, critic_l2_reg=critic_l2_reg,
|
||||
actor_lr=actor_lr, critic_lr=critic_lr, enable_popart=popart, clip_norm=clip_norm,
|
||||
reward_scale=reward_scale)
|
||||
logger.info('Using agent with the following configuration:')
|
||||
logger.info(str(agent.__dict__.items()))
|
||||
|
||||
# Set up logging stuff only for a single worker.
|
||||
if rank == 0:
|
||||
saver = tf.train.Saver()
|
||||
else:
|
||||
saver = None
|
||||
|
||||
step = 0
|
||||
episode = 0
|
||||
eval_episode_rewards_history = deque(maxlen=100)
|
||||
episode_rewards_history = deque(maxlen=100)
|
||||
with U.single_threaded_session() as sess:
|
||||
# Prepare everything.
|
||||
agent.initialize(sess)
|
||||
sess.graph.finalize()
|
||||
|
||||
agent.reset()
|
||||
obs = env.reset()
|
||||
if eval_env is not None:
|
||||
eval_obs = eval_env.reset()
|
||||
done = False
|
||||
episode_reward = 0.
|
||||
episode_step = 0
|
||||
episodes = 0
|
||||
t = 0
|
||||
|
||||
epoch = 0
|
||||
start_time = time.time()
|
||||
|
||||
epoch_episode_rewards = []
|
||||
epoch_episode_steps = []
|
||||
epoch_episode_eval_rewards = []
|
||||
epoch_episode_eval_steps = []
|
||||
epoch_start_time = time.time()
|
||||
epoch_actions = []
|
||||
epoch_qs = []
|
||||
epoch_episodes = 0
|
||||
for epoch in range(nb_epochs):
|
||||
for cycle in range(nb_epoch_cycles):
|
||||
# Perform rollouts.
|
||||
for t_rollout in range(nb_rollout_steps):
|
||||
# Predict next action.
|
||||
action, q = agent.pi(obs, apply_noise=True, compute_Q=True)
|
||||
assert action.shape == env.action_space.shape
|
||||
|
||||
# Execute next action.
|
||||
if rank == 0 and render:
|
||||
env.render()
|
||||
assert max_action.shape == action.shape
|
||||
new_obs, r, done, info = env.step(max_action * action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
|
||||
t += 1
|
||||
if rank == 0 and render:
|
||||
env.render()
|
||||
episode_reward += r
|
||||
episode_step += 1
|
||||
|
||||
# Book-keeping.
|
||||
epoch_actions.append(action)
|
||||
epoch_qs.append(q)
|
||||
agent.store_transition(obs, action, r, new_obs, done)
|
||||
obs = new_obs
|
||||
|
||||
if done:
|
||||
# Episode done.
|
||||
epoch_episode_rewards.append(episode_reward)
|
||||
episode_rewards_history.append(episode_reward)
|
||||
epoch_episode_steps.append(episode_step)
|
||||
episode_reward = 0.
|
||||
episode_step = 0
|
||||
epoch_episodes += 1
|
||||
episodes += 1
|
||||
|
||||
agent.reset()
|
||||
obs = env.reset()
|
||||
|
||||
# Train.
|
||||
epoch_actor_losses = []
|
||||
epoch_critic_losses = []
|
||||
epoch_adaptive_distances = []
|
||||
for t_train in range(nb_train_steps):
|
||||
# Adapt param noise, if necessary.
|
||||
if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
|
||||
distance = agent.adapt_param_noise()
|
||||
epoch_adaptive_distances.append(distance)
|
||||
|
||||
cl, al = agent.train()
|
||||
epoch_critic_losses.append(cl)
|
||||
epoch_actor_losses.append(al)
|
||||
agent.update_target_net()
|
||||
|
||||
# Evaluate.
|
||||
eval_episode_rewards = []
|
||||
eval_qs = []
|
||||
if eval_env is not None:
|
||||
eval_episode_reward = 0.
|
||||
for t_rollout in range(nb_eval_steps):
|
||||
eval_action, eval_q = agent.pi(eval_obs, apply_noise=False, compute_Q=True)
|
||||
eval_obs, eval_r, eval_done, eval_info = eval_env.step(max_action * eval_action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
|
||||
if render_eval:
|
||||
eval_env.render()
|
||||
eval_episode_reward += eval_r
|
||||
|
||||
eval_qs.append(eval_q)
|
||||
if eval_done:
|
||||
eval_obs = eval_env.reset()
|
||||
eval_episode_rewards.append(eval_episode_reward)
|
||||
eval_episode_rewards_history.append(eval_episode_reward)
|
||||
eval_episode_reward = 0.
|
||||
|
||||
mpi_size = MPI.COMM_WORLD.Get_size()
|
||||
# Log stats.
|
||||
# XXX shouldn't call np.mean on variable length lists
|
||||
duration = time.time() - start_time
|
||||
stats = agent.get_stats()
|
||||
combined_stats = stats.copy()
|
||||
combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
|
||||
combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
|
||||
combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
|
||||
combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
|
||||
combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
|
||||
combined_stats['train/loss_actor'] = np.mean(epoch_actor_losses)
|
||||
combined_stats['train/loss_critic'] = np.mean(epoch_critic_losses)
|
||||
combined_stats['train/param_noise_distance'] = np.mean(epoch_adaptive_distances)
|
||||
combined_stats['total/duration'] = duration
|
||||
combined_stats['total/steps_per_second'] = float(t) / float(duration)
|
||||
combined_stats['total/episodes'] = episodes
|
||||
combined_stats['rollout/episodes'] = epoch_episodes
|
||||
combined_stats['rollout/actions_std'] = np.std(epoch_actions)
|
||||
# Evaluation statistics.
|
||||
if eval_env is not None:
|
||||
combined_stats['eval/return'] = eval_episode_rewards
|
||||
combined_stats['eval/return_history'] = np.mean(eval_episode_rewards_history)
|
||||
combined_stats['eval/Q'] = eval_qs
|
||||
combined_stats['eval/episodes'] = len(eval_episode_rewards)
|
||||
def as_scalar(x):
|
||||
if isinstance(x, np.ndarray):
|
||||
assert x.size == 1
|
||||
return x[0]
|
||||
elif np.isscalar(x):
|
||||
return x
|
||||
else:
|
||||
raise ValueError('expected scalar, got %s'%x)
|
||||
combined_stats_sums = MPI.COMM_WORLD.allreduce(np.array([as_scalar(x) for x in combined_stats.values()]))
|
||||
combined_stats = {k : v / mpi_size for (k,v) in zip(combined_stats.keys(), combined_stats_sums)}
|
||||
|
||||
# Total statistics.
|
||||
combined_stats['total/epochs'] = epoch + 1
|
||||
combined_stats['total/steps'] = t
|
||||
|
||||
for key in sorted(combined_stats.keys()):
|
||||
logger.record_tabular(key, combined_stats[key])
|
||||
logger.dump_tabular()
|
||||
logger.info('')
|
||||
logdir = logger.get_dir()
|
||||
if rank == 0 and logdir:
|
||||
if hasattr(env, 'get_state'):
|
||||
with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:
|
||||
pickle.dump(env.get_state(), f)
|
||||
if eval_env and hasattr(eval_env, 'get_state'):
|
||||
with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
|
||||
pickle.dump(eval_env.get_state(), f)
|
@@ -9,29 +9,44 @@ Here's a list of commands to run to quickly get a working example:
|
||||
|
||||
```bash
|
||||
# Train model and save the results to cartpole_model.pkl
|
||||
python -m baselines.run --alg=deepq --env=CartPole-v0 --save_path=./cartpole_model.pkl --num_timesteps=1e5
|
||||
python -m baselines.deepq.experiments.train_cartpole
|
||||
# Load the model saved in cartpole_model.pkl and visualize the learned policy
|
||||
python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play
|
||||
python -m baselines.deepq.experiments.enjoy_cartpole
|
||||
```
|
||||
|
||||
|
||||
Be sure to check out the source code of [both](experiments/train_cartpole.py) [files](experiments/enjoy_cartpole.py)!
|
||||
|
||||
## If you wish to apply DQN to solve a problem.
|
||||
|
||||
Check out our simple agent trained with one stop shop `deepq.learn` function.
|
||||
|
||||
- [baselines/deepq/experiments/train_cartpole.py](experiments/train_cartpole.py) - train a Cartpole agent.
|
||||
- [baselines/deepq/experiments/train_pong.py](experiments/train_pong.py) - train a Pong agent using convolutional neural networks.
|
||||
|
||||
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. Complimentary file `enjoy_cartpole.py` loads and visualizes the learned policy.
|
||||
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. For both of the files listed above there are complimentary files `enjoy_cartpole.py` and `enjoy_pong.py` respectively, that load and visualize the learned policy.
|
||||
|
||||
## If you wish to experiment with the algorithm
|
||||
|
||||
##### Check out the examples
|
||||
|
||||
|
||||
- [baselines/deepq/experiments/custom_cartpole.py](experiments/custom_cartpole.py) - Cartpole training with more fine grained control over the internals of DQN algorithm.
|
||||
- [baselines/deepq/defaults.py](defaults.py) - settings for training on atari. Run
|
||||
- [baselines/deepq/experiments/atari/train.py](experiments/atari/train.py) - more robust setup for training at scale.
|
||||
|
||||
|
||||
##### Download a pretrained Atari agent
|
||||
|
||||
For some research projects it is sometimes useful to have an already trained agent handy. There's a variety of models to choose from. You can list them all by running:
|
||||
|
||||
```bash
|
||||
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4
|
||||
python -m baselines.deepq.experiments.atari.download_model
|
||||
```
|
||||
to train on Atari Pong (see more in repo-wide [README.md](../../README.md#training-models))
|
||||
|
||||
Once you pick a model, you can download it and visualize the learned policy. Be sure to pass `--dueling` flag to visualization script when using dueling models.
|
||||
|
||||
```bash
|
||||
python -m baselines.deepq.experiments.atari.download_model --blob model-atari-duel-pong-1 --model-dir /tmp/models
|
||||
python -m baselines.deepq.experiments.atari.enjoy --model-dir /tmp/models/model-atari-duel-pong-1 --env Pong --dueling
|
||||
|
||||
```
|
||||
|
@@ -309,7 +309,7 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
|
||||
outputs=output_actions,
|
||||
givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
|
||||
updates=updates)
|
||||
def act(ob, reset=False, update_param_noise_threshold=False, update_param_noise_scale=False, stochastic=True, update_eps=-1):
|
||||
def act(ob, reset, update_param_noise_threshold, update_param_noise_scale, stochastic=True, update_eps=-1):
|
||||
return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
|
||||
return act
|
||||
|
||||
|
@@ -7,7 +7,7 @@ import cloudpickle
|
||||
import numpy as np
|
||||
|
||||
import baselines.common.tf_util as U
|
||||
from baselines.common.tf_util import load_variables, save_variables
|
||||
from baselines.common.tf_util import load_state, save_state
|
||||
from baselines import logger
|
||||
from baselines.common.schedules import LinearSchedule
|
||||
from baselines.common import set_global_seeds
|
||||
@@ -27,7 +27,7 @@ class ActWrapper(object):
|
||||
self.initial_state = None
|
||||
|
||||
@staticmethod
|
||||
def load_act(path):
|
||||
def load_act(self, path):
|
||||
with open(path, "rb") as f:
|
||||
model_data, act_params = cloudpickle.load(f)
|
||||
act = deepq.build_act(**act_params)
|
||||
@@ -39,7 +39,7 @@ class ActWrapper(object):
|
||||
f.write(model_data)
|
||||
|
||||
zipfile.ZipFile(arc_path, 'r', zipfile.ZIP_DEFLATED).extractall(td)
|
||||
load_variables(os.path.join(td, "model"))
|
||||
load_state(os.path.join(td, "model"))
|
||||
|
||||
return ActWrapper(act, act_params)
|
||||
|
||||
@@ -47,9 +47,6 @@ class ActWrapper(object):
|
||||
return self._act(*args, **kwargs)
|
||||
|
||||
def step(self, observation, **kwargs):
|
||||
# DQN doesn't use RNNs so we ignore states and masks
|
||||
kwargs.pop('S', None)
|
||||
kwargs.pop('M', None)
|
||||
return self._act([observation], **kwargs), None, None, None
|
||||
|
||||
def save_act(self, path=None):
|
||||
@@ -58,7 +55,7 @@ class ActWrapper(object):
|
||||
path = os.path.join(logger.get_dir(), "model.pkl")
|
||||
|
||||
with tempfile.TemporaryDirectory() as td:
|
||||
save_variables(os.path.join(td, "model"))
|
||||
save_state(os.path.join(td, "model"))
|
||||
arc_name = os.path.join(td, "packed.zip")
|
||||
with zipfile.ZipFile(arc_name, 'w') as zipf:
|
||||
for root, dirs, files in os.walk(td):
|
||||
@@ -72,7 +69,7 @@ class ActWrapper(object):
|
||||
cloudpickle.dump((model_data, self._act_params), f)
|
||||
|
||||
def save(self, path):
|
||||
save_variables(path)
|
||||
save_state(path)
|
||||
|
||||
|
||||
def load_act(path):
|
||||
@@ -124,12 +121,16 @@ def learn(env,
|
||||
-------
|
||||
env: gym.Env
|
||||
environment to train on
|
||||
network: string or a function
|
||||
neural network to use as a q function approximator. If string, has to be one of the names of registered models in baselines.common.models
|
||||
(mlp, cnn, conv_only). If a function, should take an observation tensor and return a latent variable tensor, which
|
||||
will be mapped to the Q function heads (see build_q_func in baselines.deepq.models for details on that)
|
||||
seed: int or None
|
||||
prng seed. The runs with the same seed "should" give the same results. If None, no seeding is used.
|
||||
q_func: (tf.Variable, int, str, bool) -> tf.Variable
|
||||
the model that takes the following inputs:
|
||||
observation_in: object
|
||||
the output of observation placeholder
|
||||
num_actions: int
|
||||
number of actions
|
||||
scope: str
|
||||
reuse: bool
|
||||
should be passed to outer variable scope
|
||||
and returns a tensor of shape (batch_size, num_actions) with values of every action.
|
||||
lr: float
|
||||
learning rate for adam optimizer
|
||||
total_timesteps: int
|
||||
@@ -169,8 +170,6 @@ def learn(env,
|
||||
to 1.0. If set to None equals to total_timesteps.
|
||||
prioritized_replay_eps: float
|
||||
epsilon to add to the TD errors when updating priorities.
|
||||
param_noise: bool
|
||||
whether or not to use parameter space noise (https://arxiv.org/abs/1706.01905)
|
||||
callback: (locals, globals) -> None
|
||||
function called at every steps with state of the algorithm.
|
||||
If callback returns true training stops.
|
||||
@@ -195,9 +194,8 @@ def learn(env,
|
||||
# capture the shape outside the closure so that the env object is not serialized
|
||||
# by cloudpickle when serializing make_obs_ph
|
||||
|
||||
observation_space = env.observation_space
|
||||
def make_obs_ph(name):
|
||||
return ObservationInput(observation_space, name=name)
|
||||
return ObservationInput(env.observation_space, name=name)
|
||||
|
||||
act, train, update_target, debug = deepq.build_train(
|
||||
make_obs_ph=make_obs_ph,
|
||||
@@ -249,11 +247,11 @@ def learn(env,
|
||||
model_saved = False
|
||||
|
||||
if tf.train.latest_checkpoint(td) is not None:
|
||||
load_variables(model_file)
|
||||
load_state(model_file)
|
||||
logger.log('Loaded model from {}'.format(model_file))
|
||||
model_saved = True
|
||||
elif load_path is not None:
|
||||
load_variables(load_path)
|
||||
load_state(load_path)
|
||||
logger.log('Loaded model from {}'.format(load_path))
|
||||
|
||||
|
||||
@@ -322,12 +320,12 @@ def learn(env,
|
||||
if print_freq is not None:
|
||||
logger.log("Saving model due to mean reward increase: {} -> {}".format(
|
||||
saved_mean_reward, mean_100ep_reward))
|
||||
save_variables(model_file)
|
||||
save_state(model_file)
|
||||
model_saved = True
|
||||
saved_mean_reward = mean_100ep_reward
|
||||
if model_saved:
|
||||
if print_freq is not None:
|
||||
logger.log("Restored model with mean reward: {}".format(saved_mean_reward))
|
||||
load_variables(model_file)
|
||||
load_state(model_file)
|
||||
|
||||
return act
|
||||
|
@@ -5,7 +5,7 @@ from baselines import deepq
|
||||
|
||||
def main():
|
||||
env = gym.make("CartPole-v0")
|
||||
act = deepq.learn(env, network='mlp', total_timesteps=0, load_path="cartpole_model.pkl")
|
||||
act = deepq.load("cartpole_model.pkl")
|
||||
|
||||
while True:
|
||||
obs, done = env.reset(), False
|
||||
|
@@ -1,17 +1,11 @@
|
||||
import gym
|
||||
|
||||
from baselines import deepq
|
||||
from baselines.common import models
|
||||
|
||||
|
||||
def main():
|
||||
env = gym.make("MountainCar-v0")
|
||||
act = deepq.learn(
|
||||
env,
|
||||
network=models.mlp(num_layers=1, num_hidden=64),
|
||||
total_timesteps=0,
|
||||
load_path='mountaincar_model.pkl'
|
||||
)
|
||||
act = deepq.load("mountaincar_model.pkl")
|
||||
|
||||
while True:
|
||||
obs, done = env.reset(), False
|
||||
|
@@ -5,21 +5,14 @@ from baselines import deepq
|
||||
def main():
|
||||
env = gym.make("PongNoFrameskip-v4")
|
||||
env = deepq.wrap_atari_dqn(env)
|
||||
model = deepq.learn(
|
||||
env,
|
||||
"conv_only",
|
||||
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
|
||||
hiddens=[256],
|
||||
dueling=True,
|
||||
total_timesteps=0
|
||||
)
|
||||
act = deepq.load("pong_model.pkl")
|
||||
|
||||
while True:
|
||||
obs, done = env.reset(), False
|
||||
episode_rew = 0
|
||||
while not done:
|
||||
env.render()
|
||||
obs, rew, done, _ = env.step(model(obs[None])[0])
|
||||
obs, rew, done, _ = env.step(act(obs[None])[0])
|
||||
episode_rew += rew
|
||||
print("Episode reward", episode_rew)
|
||||
|
||||
|
34
baselines/deepq/experiments/enjoy_retro.py
Normal file
34
baselines/deepq/experiments/enjoy_retro.py
Normal file
@@ -0,0 +1,34 @@
|
||||
import argparse
|
||||
|
||||
import numpy as np
|
||||
|
||||
from baselines import deepq
|
||||
from baselines.common import retro_wrappers
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--env', help='environment ID', default='SuperMarioBros-Nes')
|
||||
parser.add_argument('--gamestate', help='game state to load', default='Level1-1')
|
||||
parser.add_argument('--model', help='model pickle file from ActWrapper.save', default='model.pkl')
|
||||
args = parser.parse_args()
|
||||
|
||||
env = retro_wrappers.make_retro(game=args.env, state=args.gamestate, max_episode_steps=None)
|
||||
env = retro_wrappers.wrap_deepmind_retro(env)
|
||||
act = deepq.load(args.model)
|
||||
|
||||
while True:
|
||||
obs, done = env.reset(), False
|
||||
episode_rew = 0
|
||||
while not done:
|
||||
env.render()
|
||||
action = act(obs[None])[0]
|
||||
env_action = np.zeros(env.action_space.n)
|
||||
env_action[action] = 1
|
||||
obs, rew, done, _ = env.step(env_action)
|
||||
episode_rew += rew
|
||||
print('Episode reward', episode_rew)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
54
baselines/deepq/experiments/run_atari.py
Normal file
54
baselines/deepq/experiments/run_atari.py
Normal file
@@ -0,0 +1,54 @@
|
||||
from baselines import deepq
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines import bench
|
||||
import argparse
|
||||
from baselines import logger
|
||||
from baselines.common.atari_wrappers import make_atari
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
|
||||
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
|
||||
parser.add_argument('--prioritized', type=int, default=1)
|
||||
parser.add_argument('--prioritized-replay-alpha', type=float, default=0.6)
|
||||
parser.add_argument('--dueling', type=int, default=1)
|
||||
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
|
||||
parser.add_argument('--checkpoint-freq', type=int, default=10000)
|
||||
parser.add_argument('--checkpoint-path', type=str, default=None)
|
||||
|
||||
args = parser.parse_args()
|
||||
logger.configure()
|
||||
set_global_seeds(args.seed)
|
||||
env = make_atari(args.env)
|
||||
env = bench.Monitor(env, logger.get_dir())
|
||||
env = deepq.wrap_atari_dqn(env)
|
||||
model = deepq.models.cnn_to_mlp(
|
||||
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
|
||||
hiddens=[256],
|
||||
dueling=bool(args.dueling),
|
||||
)
|
||||
|
||||
deepq.learn(
|
||||
env,
|
||||
q_func=model,
|
||||
lr=1e-4,
|
||||
max_timesteps=args.num_timesteps,
|
||||
buffer_size=10000,
|
||||
exploration_fraction=0.1,
|
||||
exploration_final_eps=0.01,
|
||||
train_freq=4,
|
||||
learning_starts=10000,
|
||||
target_network_update_freq=1000,
|
||||
gamma=0.99,
|
||||
prioritized_replay=bool(args.prioritized),
|
||||
prioritized_replay_alpha=args.prioritized_replay_alpha,
|
||||
checkpoint_freq=args.checkpoint_freq,
|
||||
checkpoint_path=args.checkpoint_path,
|
||||
)
|
||||
|
||||
env.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
49
baselines/deepq/experiments/run_retro.py
Normal file
49
baselines/deepq/experiments/run_retro.py
Normal file
@@ -0,0 +1,49 @@
|
||||
import argparse
|
||||
|
||||
from baselines import deepq
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines import bench
|
||||
from baselines import logger
|
||||
from baselines.common import retro_wrappers
|
||||
import retro
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
parser.add_argument('--env', help='environment ID', default='SuperMarioBros-Nes')
|
||||
parser.add_argument('--gamestate', help='game state to load', default='Level1-1')
|
||||
parser.add_argument('--seed', help='seed', type=int, default=0)
|
||||
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
|
||||
args = parser.parse_args()
|
||||
logger.configure()
|
||||
set_global_seeds(args.seed)
|
||||
env = retro_wrappers.make_retro(game=args.env, state=args.gamestate, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE)
|
||||
env.seed(args.seed)
|
||||
env = bench.Monitor(env, logger.get_dir())
|
||||
env = retro_wrappers.wrap_deepmind_retro(env)
|
||||
|
||||
model = deepq.models.cnn_to_mlp(
|
||||
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
|
||||
hiddens=[256],
|
||||
dueling=True
|
||||
)
|
||||
act = deepq.learn(
|
||||
env,
|
||||
q_func=model,
|
||||
lr=1e-4,
|
||||
max_timesteps=args.num_timesteps,
|
||||
buffer_size=10000,
|
||||
exploration_fraction=0.1,
|
||||
exploration_final_eps=0.01,
|
||||
train_freq=4,
|
||||
learning_starts=10000,
|
||||
target_network_update_freq=1000,
|
||||
gamma=0.99,
|
||||
prioritized_replay=True
|
||||
)
|
||||
act.save()
|
||||
env.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
@@ -11,11 +11,12 @@ def callback(lcl, _glb):
|
||||
|
||||
def main():
|
||||
env = gym.make("CartPole-v0")
|
||||
model = deepq.models.mlp([64])
|
||||
act = deepq.learn(
|
||||
env,
|
||||
network='mlp',
|
||||
q_func=model,
|
||||
lr=1e-3,
|
||||
total_timesteps=100000,
|
||||
max_timesteps=100000,
|
||||
buffer_size=50000,
|
||||
exploration_fraction=0.1,
|
||||
exploration_final_eps=0.02,
|
||||
|
@@ -1,17 +1,17 @@
|
||||
import gym
|
||||
|
||||
from baselines import deepq
|
||||
from baselines.common import models
|
||||
|
||||
|
||||
def main():
|
||||
env = gym.make("MountainCar-v0")
|
||||
# Enabling layer_norm here is import for parameter space noise!
|
||||
model = deepq.models.mlp([64], layer_norm=True)
|
||||
act = deepq.learn(
|
||||
env,
|
||||
network=models.mlp(num_hidden=64, num_layers=1),
|
||||
q_func=model,
|
||||
lr=1e-3,
|
||||
total_timesteps=100000,
|
||||
max_timesteps=100000,
|
||||
buffer_size=50000,
|
||||
exploration_fraction=0.1,
|
||||
exploration_final_eps=0.1,
|
||||
|
@@ -1,34 +0,0 @@
|
||||
from baselines import deepq
|
||||
from baselines import bench
|
||||
from baselines import logger
|
||||
from baselines.common.atari_wrappers import make_atari
|
||||
|
||||
|
||||
def main():
|
||||
logger.configure()
|
||||
env = make_atari('PongNoFrameskip-v4')
|
||||
env = bench.Monitor(env, logger.get_dir())
|
||||
env = deepq.wrap_atari_dqn(env)
|
||||
|
||||
model = deepq.learn(
|
||||
env,
|
||||
"conv_only",
|
||||
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
|
||||
hiddens=[256],
|
||||
dueling=True,
|
||||
lr=1e-4,
|
||||
total_timesteps=int(1e7),
|
||||
buffer_size=10000,
|
||||
exploration_fraction=0.1,
|
||||
exploration_final_eps=0.01,
|
||||
train_freq=4,
|
||||
learning_starts=10000,
|
||||
target_network_update_freq=1000,
|
||||
gamma=0.99,
|
||||
)
|
||||
|
||||
model.save('pong_model.pkl')
|
||||
env.close()
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
@@ -98,12 +98,7 @@ def build_q_func(network, hiddens=[256], dueling=True, layer_norm=False, **netwo
|
||||
|
||||
def q_func_builder(input_placeholder, num_actions, scope, reuse=False):
|
||||
with tf.variable_scope(scope, reuse=reuse):
|
||||
latent = network(input_placeholder)
|
||||
if isinstance(latent, tuple):
|
||||
if latent[1] is not None:
|
||||
raise NotImplementedError("DQN is not compatible with recurrent policies yet")
|
||||
latent = latent[0]
|
||||
|
||||
latent, _ = network(input_placeholder)
|
||||
latent = layers.flatten(latent)
|
||||
|
||||
with tf.variable_scope("action_value"):
|
||||
|
@@ -106,10 +106,9 @@ class PrioritizedReplayBuffer(ReplayBuffer):
|
||||
|
||||
def _sample_proportional(self, batch_size):
|
||||
res = []
|
||||
p_total = self._it_sum.sum(0, len(self._storage) - 1)
|
||||
every_range_len = p_total / batch_size
|
||||
for i in range(batch_size):
|
||||
mass = random.random() * every_range_len + i * every_range_len
|
||||
for _ in range(batch_size):
|
||||
# TODO(szymon): should we ensure no repeats?
|
||||
mass = random.random() * self._it_sum.sum(0, len(self._storage) - 1)
|
||||
idx = self._it_sum.find_prefixsum_idx(mass)
|
||||
res.append(idx)
|
||||
return res
|
||||
|
@@ -1,6 +1,8 @@
|
||||
from baselines.common.input import observation_input
|
||||
from baselines.common.tf_util import adjust_shape
|
||||
|
||||
import tensorflow as tf
|
||||
|
||||
# ================================================================
|
||||
# Placeholders
|
||||
# ================================================================
|
||||
@@ -18,11 +20,11 @@ class TfInput(object):
|
||||
"""Return the tf variable(s) representing the possibly postprocessed value
|
||||
of placeholder(s).
|
||||
"""
|
||||
raise NotImplementedError
|
||||
raise NotImplemented()
|
||||
|
||||
def make_feed_dict(data):
|
||||
"""Given data input it to the placeholder(s)."""
|
||||
raise NotImplementedError
|
||||
raise NotImplemented()
|
||||
|
||||
|
||||
class PlaceholderTfInput(TfInput):
|
||||
@@ -38,6 +40,29 @@ class PlaceholderTfInput(TfInput):
|
||||
return {self._placeholder: adjust_shape(self._placeholder, data)}
|
||||
|
||||
|
||||
class Uint8Input(PlaceholderTfInput):
|
||||
def __init__(self, shape, name=None):
|
||||
"""Takes input in uint8 format which is cast to float32 and divided by 255
|
||||
before passing it to the model.
|
||||
|
||||
On GPU this ensures lower data transfer times.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
shape: [int]
|
||||
shape of the tensor.
|
||||
name: str
|
||||
name of the underlying placeholder
|
||||
"""
|
||||
|
||||
super().__init__(tf.placeholder(tf.uint8, [None] + list(shape), name=name))
|
||||
self._shape = shape
|
||||
self._output = tf.cast(super().get(), tf.float32) / 255.0
|
||||
|
||||
def get(self):
|
||||
return self._output
|
||||
|
||||
|
||||
class ObservationInput(PlaceholderTfInput):
|
||||
def __init__(self, observation_space, name=None):
|
||||
"""Creates an input placeholder tailored to a specific observation space
|
||||
|
@@ -30,51 +30,3 @@ python -m baselines.her.experiment.train --num_cpu 19
|
||||
This will require a machine with sufficient amount of physical CPU cores. In our experiments,
|
||||
we used [Azure's D15v2 instances](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes),
|
||||
which have 20 physical cores. We only scheduled the experiment on 19 of those to leave some head-room on the system.
|
||||
|
||||
|
||||
## Hindsight Experience Replay with Demonstrations
|
||||
Using pre-recorded demonstrations to Overcome the exploration problem in HER based Reinforcement learning.
|
||||
For details, please read the [paper](https://arxiv.org/pdf/1709.10089.pdf).
|
||||
|
||||
### Getting started
|
||||
The first step is to generate the demonstration dataset. This can be done in two ways, either by using a VR system to manipulate the arm using physical VR trackers or the simpler way is to write a script to carry out the respective task. Now some tasks can be complex and thus it would be difficult to write a hardcoded script for that task (eg. Fetch Push), but here our focus is on providing an algorithm that helps the agent to learn from demonstrations, and not on the demonstration generation paradigm itself. Thus the data collection part is left to the reader's choice.
|
||||
|
||||
We provide a script for the Fetch Pick and Place task, to generate demonstrations for the Pick and Place task execute:
|
||||
```bash
|
||||
python experiment/data_generation/fetch_data_generation.py
|
||||
```
|
||||
This outputs ```data_fetch_random_100.npz``` file which is our data file.
|
||||
|
||||
#### Configuration
|
||||
The provided configuration is for training an agent with HER without demonstrations, we need to change a few paramters for the HER algorithm to learn through demonstrations, to do that, set:
|
||||
|
||||
* bc_loss: 1 - whether or not to use the behavior cloning loss as an auxilliary loss
|
||||
* q_filter: 1 - whether or not a Q value filter should be used on the Actor outputs
|
||||
* num_demo: 100 - number of expert demo episodes
|
||||
* demo_batch_size: 128 - number of samples to be used from the demonstrations buffer, per mpi thread
|
||||
* prm_loss_weight: 0.001 - Weight corresponding to the primary loss
|
||||
* aux_loss_weight: 0.0078 - Weight corresponding to the auxilliary loss also called the cloning loss
|
||||
|
||||
Apart from these changes the reported results also have the following configurational changes:
|
||||
|
||||
* n_cycles: 20 - per epoch
|
||||
* batch_size: 1024 - per mpi thread, total batch size
|
||||
* random_eps: 0.1 - percentage of time a random action is taken
|
||||
* noise_eps: 0.1 - std of gaussian noise added to not-completely-random actions
|
||||
|
||||
Now training an agent with pre-recorded demonstrations:
|
||||
```bash
|
||||
python -m baselines.her.experiment.train --env=FetchPickAndPlace-v0 --n_epochs=1000 --demo_file=/Path/to/demo_file.npz --num_cpu=1
|
||||
```
|
||||
|
||||
This will train a DDPG+HER agent on the `FetchPickAndPlace` environment by using previously generated demonstration data.
|
||||
To inspect what the agent has learned, use the play script as described above.
|
||||
|
||||
### Results
|
||||
Training with demonstrations helps overcome the exploration problem and achieves a faster and better convergence. The following graphs contrast the difference between training with and without demonstration data, We report the mean Q values vs Epoch and the Success Rate vs Epoch:
|
||||
|
||||
|
||||
<div class="imgcap" align="middle">
|
||||
<center><img src="../../data/fetchPickAndPlaceContrast.png"></center>
|
||||
<div class="thecap" align="middle"><b>Training results for Fetch Pick and Place task constrasting between training with and without demonstration data.</b></div>
|
||||
</div>
|
||||
|
@@ -6,7 +6,7 @@ from tensorflow.contrib.staging import StagingArea
|
||||
|
||||
from baselines import logger
|
||||
from baselines.her.util import (
|
||||
import_function, store_args, flatten_grads, transitions_in_episode_batch, convert_episode_to_batch_major)
|
||||
import_function, store_args, flatten_grads, transitions_in_episode_batch)
|
||||
from baselines.her.normalizer import Normalizer
|
||||
from baselines.her.replay_buffer import ReplayBuffer
|
||||
from baselines.common.mpi_adam import MpiAdam
|
||||
@@ -16,17 +16,13 @@ def dims_to_shapes(input_dims):
|
||||
return {key: tuple([val]) if val > 0 else tuple() for key, val in input_dims.items()}
|
||||
|
||||
|
||||
global demoBuffer #buffer for demonstrations
|
||||
|
||||
class DDPG(object):
|
||||
@store_args
|
||||
def __init__(self, input_dims, buffer_size, hidden, layers, network_class, polyak, batch_size,
|
||||
Q_lr, pi_lr, norm_eps, norm_clip, max_u, action_l2, clip_obs, scope, T,
|
||||
rollout_batch_size, subtract_goals, relative_goals, clip_pos_returns, clip_return,
|
||||
bc_loss, q_filter, num_demo, demo_batch_size, prm_loss_weight, aux_loss_weight,
|
||||
sample_transitions, gamma, reuse=False, **kwargs):
|
||||
"""Implementation of DDPG that is used in combination with Hindsight Experience Replay (HER).
|
||||
Added functionality to use demonstrations for training to Overcome exploration problem.
|
||||
|
||||
Args:
|
||||
input_dims (dict of ints): dimensions for the observation (o), the goal (g), and the
|
||||
@@ -54,12 +50,6 @@ class DDPG(object):
|
||||
sample_transitions (function) function that samples from the replay buffer
|
||||
gamma (float): gamma used for Q learning updates
|
||||
reuse (boolean): whether or not the networks should be reused
|
||||
bc_loss: whether or not the behavior cloning loss should be used as an auxilliary loss
|
||||
q_filter: whether or not a filter on the q value update should be used when training with demonstartions
|
||||
num_demo: Number of episodes in to be used in the demonstration buffer
|
||||
demo_batch_size: number of samples to be used from the demonstrations buffer, per mpi thread
|
||||
prm_loss_weight: Weight corresponding to the primary loss
|
||||
aux_loss_weight: Weight corresponding to the auxilliary loss also called the cloning loss
|
||||
"""
|
||||
if self.clip_return is None:
|
||||
self.clip_return = np.inf
|
||||
@@ -102,9 +92,6 @@ class DDPG(object):
|
||||
buffer_size = (self.buffer_size // self.rollout_batch_size) * self.rollout_batch_size
|
||||
self.buffer = ReplayBuffer(buffer_shapes, buffer_size, self.T, self.sample_transitions)
|
||||
|
||||
global demoBuffer
|
||||
demoBuffer = ReplayBuffer(buffer_shapes, buffer_size, self.T, self.sample_transitions) #initialize the demo buffer; in the same way as the primary data buffer
|
||||
|
||||
def _random_action(self, n):
|
||||
return np.random.uniform(low=-self.max_u, high=self.max_u, size=(n, self.dimu))
|
||||
|
||||
@@ -151,57 +138,6 @@ class DDPG(object):
|
||||
else:
|
||||
return ret
|
||||
|
||||
def initDemoBuffer(self, demoDataFile, update_stats=True): #function that initializes the demo buffer
|
||||
|
||||
demoData = np.load(demoDataFile) #load the demonstration data from data file
|
||||
info_keys = [key.replace('info_', '') for key in self.input_dims.keys() if key.startswith('info_')]
|
||||
info_values = [np.empty((self.T, 1, self.input_dims['info_' + key]), np.float32) for key in info_keys]
|
||||
|
||||
for epsd in range(self.num_demo): # we initialize the whole demo buffer at the start of the training
|
||||
obs, acts, goals, achieved_goals = [], [] ,[] ,[]
|
||||
i = 0
|
||||
for transition in range(self.T):
|
||||
obs.append([demoData['obs'][epsd ][transition].get('observation')])
|
||||
acts.append([demoData['acs'][epsd][transition]])
|
||||
goals.append([demoData['obs'][epsd][transition].get('desired_goal')])
|
||||
achieved_goals.append([demoData['obs'][epsd][transition].get('achieved_goal')])
|
||||
for idx, key in enumerate(info_keys):
|
||||
info_values[idx][transition, i] = demoData['info'][epsd][transition][key]
|
||||
|
||||
obs.append([demoData['obs'][epsd][self.T].get('observation')])
|
||||
achieved_goals.append([demoData['obs'][epsd][self.T].get('achieved_goal')])
|
||||
|
||||
episode = dict(o=obs,
|
||||
u=acts,
|
||||
g=goals,
|
||||
ag=achieved_goals)
|
||||
for key, value in zip(info_keys, info_values):
|
||||
episode['info_{}'.format(key)] = value
|
||||
|
||||
episode = convert_episode_to_batch_major(episode)
|
||||
global demoBuffer
|
||||
demoBuffer.store_episode(episode) # create the observation dict and append them into the demonstration buffer
|
||||
|
||||
print("Demo buffer size currently ", demoBuffer.get_current_size()) #print out the demonstration buffer size
|
||||
|
||||
if update_stats:
|
||||
# add transitions to normalizer to normalize the demo data as well
|
||||
episode['o_2'] = episode['o'][:, 1:, :]
|
||||
episode['ag_2'] = episode['ag'][:, 1:, :]
|
||||
num_normalizing_transitions = transitions_in_episode_batch(episode)
|
||||
transitions = self.sample_transitions(episode, num_normalizing_transitions)
|
||||
|
||||
o, o_2, g, ag = transitions['o'], transitions['o_2'], transitions['g'], transitions['ag']
|
||||
transitions['o'], transitions['g'] = self._preprocess_og(o, ag, g)
|
||||
# No need to preprocess the o_2 and g_2 since this is only used for stats
|
||||
|
||||
self.o_stats.update(transitions['o'])
|
||||
self.g_stats.update(transitions['g'])
|
||||
|
||||
self.o_stats.recompute_stats()
|
||||
self.g_stats.recompute_stats()
|
||||
episode.clear()
|
||||
|
||||
def store_episode(self, episode_batch, update_stats=True):
|
||||
"""
|
||||
episode_batch: array of batch_size x (T or T+1) x dim_key
|
||||
@@ -249,18 +185,7 @@ class DDPG(object):
|
||||
self.pi_adam.update(pi_grad, self.pi_lr)
|
||||
|
||||
def sample_batch(self):
|
||||
if self.bc_loss: #use demonstration buffer to sample as well if bc_loss flag is set TRUE
|
||||
transitions = self.buffer.sample(self.batch_size - self.demo_batch_size)
|
||||
global demoBuffer
|
||||
transitionsDemo = demoBuffer.sample(self.demo_batch_size) #sample from the demo buffer
|
||||
for k, values in transitionsDemo.items():
|
||||
rolloutV = transitions[k].tolist()
|
||||
for v in values:
|
||||
rolloutV.append(v.tolist())
|
||||
transitions[k] = np.array(rolloutV)
|
||||
else:
|
||||
transitions = self.buffer.sample(self.batch_size) #otherwise only sample from primary buffer
|
||||
|
||||
transitions = self.buffer.sample(self.batch_size)
|
||||
o, o_2, g = transitions['o'], transitions['o_2'], transitions['g']
|
||||
ag, ag_2 = transitions['ag'], transitions['ag_2']
|
||||
transitions['o'], transitions['g'] = self._preprocess_og(o, ag, g)
|
||||
@@ -323,9 +248,6 @@ class DDPG(object):
|
||||
for i, key in enumerate(self.stage_shapes.keys())])
|
||||
batch_tf['r'] = tf.reshape(batch_tf['r'], [-1, 1])
|
||||
|
||||
#choose only the demo buffer samples
|
||||
mask = np.concatenate((np.zeros(self.batch_size - self.demo_batch_size), np.ones(self.demo_batch_size)), axis = 0)
|
||||
|
||||
# networks
|
||||
with tf.variable_scope('main') as vs:
|
||||
if reuse:
|
||||
@@ -348,25 +270,6 @@ class DDPG(object):
|
||||
clip_range = (-self.clip_return, 0. if self.clip_pos_returns else np.inf)
|
||||
target_tf = tf.clip_by_value(batch_tf['r'] + self.gamma * target_Q_pi_tf, *clip_range)
|
||||
self.Q_loss_tf = tf.reduce_mean(tf.square(tf.stop_gradient(target_tf) - self.main.Q_tf))
|
||||
|
||||
if self.bc_loss ==1 and self.q_filter == 1 : # train with demonstrations and use bc_loss and q_filter both
|
||||
maskMain = tf.reshape(tf.boolean_mask(self.main.Q_tf > self.main.Q_pi_tf, mask), [-1]) #where is the demonstrator action better than actor action according to the critic? choose those samples only
|
||||
#define the cloning loss on the actor's actions only on the samples which adhere to the above masks
|
||||
self.cloning_loss_tf = tf.reduce_sum(tf.square(tf.boolean_mask(tf.boolean_mask((self.main.pi_tf), mask), maskMain, axis=0) - tf.boolean_mask(tf.boolean_mask((batch_tf['u']), mask), maskMain, axis=0)))
|
||||
self.pi_loss_tf = -self.prm_loss_weight * tf.reduce_mean(self.main.Q_pi_tf) #primary loss scaled by it's respective weight prm_loss_weight
|
||||
self.pi_loss_tf += self.prm_loss_weight * self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u)) #L2 loss on action values scaled by the same weight prm_loss_weight
|
||||
self.pi_loss_tf += self.aux_loss_weight * self.cloning_loss_tf #adding the cloning loss to the actor loss as an auxilliary loss scaled by its weight aux_loss_weight
|
||||
|
||||
elif self.bc_loss == 1 and self.q_filter == 0: # train with demonstrations without q_filter
|
||||
self.cloning_loss_tf = tf.reduce_sum(tf.square(tf.boolean_mask((self.main.pi_tf), mask) - tf.boolean_mask((batch_tf['u']), mask)))
|
||||
self.pi_loss_tf = -self.prm_loss_weight * tf.reduce_mean(self.main.Q_pi_tf)
|
||||
self.pi_loss_tf += self.prm_loss_weight * self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))
|
||||
self.pi_loss_tf += self.aux_loss_weight * self.cloning_loss_tf
|
||||
|
||||
else: #If not training with demonstrations
|
||||
self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
|
||||
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))
|
||||
|
||||
self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
|
||||
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))
|
||||
Q_grads_tf = tf.gradients(self.Q_loss_tf, self._vars('main/Q'))
|
||||
|
@@ -44,13 +44,6 @@ DEFAULT_PARAMS = {
|
||||
# normalization
|
||||
'norm_eps': 0.01, # epsilon used for observation normalization
|
||||
'norm_clip': 5, # normalized observations are cropped to this values
|
||||
|
||||
'bc_loss': 0, # whether or not to use the behavior cloning loss as an auxilliary loss
|
||||
'q_filter': 0, # whether or not a Q value filter should be used on the Actor outputs
|
||||
'num_demo': 100, # number of expert demo episodes
|
||||
'demo_batch_size': 128, #number of samples to be used from the demonstrations buffer, per mpi thread 128/1024 or 32/256
|
||||
'prm_loss_weight': 0.001, #Weight corresponding to the primary loss
|
||||
'aux_loss_weight': 0.0078, #Weight corresponding to the auxilliary loss also called the cloning loss
|
||||
}
|
||||
|
||||
|
||||
@@ -152,12 +145,6 @@ def configure_ddpg(dims, params, reuse=False, use_mpi=True, clip_return=True):
|
||||
'subtract_goals': simple_goal_subtract,
|
||||
'sample_transitions': sample_her_transitions,
|
||||
'gamma': gamma,
|
||||
'bc_loss': params['bc_loss'],
|
||||
'q_filter': params['q_filter'],
|
||||
'num_demo': params['num_demo'],
|
||||
'demo_batch_size': params['demo_batch_size'],
|
||||
'prm_loss_weight': params['prm_loss_weight'],
|
||||
'aux_loss_weight': params['aux_loss_weight'],
|
||||
})
|
||||
ddpg_params['info'] = {
|
||||
'env_name': params['env_name'],
|
||||
|
@@ -1,149 +0,0 @@
|
||||
import gym
|
||||
import time
|
||||
import random
|
||||
import numpy as np
|
||||
import rospy
|
||||
import roslaunch
|
||||
|
||||
from random import randint
|
||||
from std_srvs.srv import Empty
|
||||
from sensor_msgs.msg import JointState
|
||||
from geometry_msgs.msg import PoseStamped
|
||||
from geometry_msgs.msg import Pose
|
||||
from std_msgs.msg import Float64
|
||||
from controller_manager_msgs.srv import SwitchController
|
||||
from gym.utils import seeding
|
||||
|
||||
|
||||
"""Data generation for the case of a single block pick and place in Fetch Env"""
|
||||
|
||||
actions = []
|
||||
observations = []
|
||||
infos = []
|
||||
|
||||
def main():
|
||||
env = gym.make('FetchPickAndPlace-v0')
|
||||
numItr = 100
|
||||
initStateSpace = "random"
|
||||
env.reset()
|
||||
print("Reset!")
|
||||
while len(actions) < numItr:
|
||||
obs = env.reset()
|
||||
print("ITERATION NUMBER ", len(actions))
|
||||
goToGoal(env, obs)
|
||||
|
||||
|
||||
fileName = "data_fetch"
|
||||
fileName += "_" + initStateSpace
|
||||
fileName += "_" + str(numItr)
|
||||
fileName += ".npz"
|
||||
|
||||
np.savez_compressed(fileName, acs=actions, obs=observations, info=infos) # save the file
|
||||
|
||||
def goToGoal(env, lastObs):
|
||||
|
||||
goal = lastObs['desired_goal']
|
||||
objectPos = lastObs['observation'][3:6]
|
||||
gripperPos = lastObs['observation'][:3]
|
||||
gripperState = lastObs['observation'][9:11]
|
||||
object_rel_pos = lastObs['observation'][6:9]
|
||||
episodeAcs = []
|
||||
episodeObs = []
|
||||
episodeInfo = []
|
||||
|
||||
object_oriented_goal = object_rel_pos.copy()
|
||||
object_oriented_goal[2] += 0.03 # first make the gripper go slightly above the object
|
||||
|
||||
timeStep = 0 #count the total number of timesteps
|
||||
episodeObs.append(lastObs)
|
||||
|
||||
while np.linalg.norm(object_oriented_goal) >= 0.005 and timeStep <= env._max_episode_steps:
|
||||
env.render()
|
||||
action = [0, 0, 0, 0]
|
||||
object_oriented_goal = object_rel_pos.copy()
|
||||
object_oriented_goal[2] += 0.03
|
||||
|
||||
for i in range(len(object_oriented_goal)):
|
||||
action[i] = object_oriented_goal[i]*6
|
||||
|
||||
action[len(action)-1] = 0.05 #open
|
||||
|
||||
obsDataNew, reward, done, info = env.step(action)
|
||||
timeStep += 1
|
||||
|
||||
episodeAcs.append(action)
|
||||
episodeInfo.append(info)
|
||||
episodeObs.append(obsDataNew)
|
||||
|
||||
objectPos = obsDataNew['observation'][3:6]
|
||||
gripperPos = obsDataNew['observation'][:3]
|
||||
gripperState = obsDataNew['observation'][9:11]
|
||||
object_rel_pos = obsDataNew['observation'][6:9]
|
||||
|
||||
while np.linalg.norm(object_rel_pos) >= 0.005 and timeStep <= env._max_episode_steps :
|
||||
env.render()
|
||||
action = [0, 0, 0, 0]
|
||||
for i in range(len(object_rel_pos)):
|
||||
action[i] = object_rel_pos[i]*6
|
||||
|
||||
action[len(action)-1] = -0.005
|
||||
|
||||
obsDataNew, reward, done, info = env.step(action)
|
||||
timeStep += 1
|
||||
|
||||
episodeAcs.append(action)
|
||||
episodeInfo.append(info)
|
||||
episodeObs.append(obsDataNew)
|
||||
|
||||
objectPos = obsDataNew['observation'][3:6]
|
||||
gripperPos = obsDataNew['observation'][:3]
|
||||
gripperState = obsDataNew['observation'][9:11]
|
||||
object_rel_pos = obsDataNew['observation'][6:9]
|
||||
|
||||
|
||||
while np.linalg.norm(goal - objectPos) >= 0.01 and timeStep <= env._max_episode_steps :
|
||||
env.render()
|
||||
action = [0, 0, 0, 0]
|
||||
for i in range(len(goal - objectPos)):
|
||||
action[i] = (goal - objectPos)[i]*6
|
||||
|
||||
action[len(action)-1] = -0.005
|
||||
|
||||
obsDataNew, reward, done, info = env.step(action)
|
||||
timeStep += 1
|
||||
|
||||
episodeAcs.append(action)
|
||||
episodeInfo.append(info)
|
||||
episodeObs.append(obsDataNew)
|
||||
|
||||
objectPos = obsDataNew['observation'][3:6]
|
||||
gripperPos = obsDataNew['observation'][:3]
|
||||
gripperState = obsDataNew['observation'][9:11]
|
||||
object_rel_pos = obsDataNew['observation'][6:9]
|
||||
|
||||
while True: #limit the number of timesteps in the episode to a fixed duration
|
||||
env.render()
|
||||
action = [0, 0, 0, 0]
|
||||
action[len(action)-1] = -0.005 # keep the gripper closed
|
||||
|
||||
obsDataNew, reward, done, info = env.step(action)
|
||||
timeStep += 1
|
||||
|
||||
episodeAcs.append(action)
|
||||
episodeInfo.append(info)
|
||||
episodeObs.append(obsDataNew)
|
||||
|
||||
objectPos = obsDataNew['observation'][3:6]
|
||||
gripperPos = obsDataNew['observation'][:3]
|
||||
gripperState = obsDataNew['observation'][9:11]
|
||||
object_rel_pos = obsDataNew['observation'][6:9]
|
||||
|
||||
if timeStep >= env._max_episode_steps: break
|
||||
|
||||
actions.append(episodeAcs)
|
||||
observations.append(episodeObs)
|
||||
infos.append(episodeInfo)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
@@ -26,7 +26,7 @@ def mpi_average(value):
|
||||
|
||||
def train(policy, rollout_worker, evaluator,
|
||||
n_epochs, n_test_rollouts, n_cycles, n_batches, policy_save_interval,
|
||||
save_policies, demo_file, **kwargs):
|
||||
save_policies, **kwargs):
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
|
||||
latest_policy_path = os.path.join(logger.get_dir(), 'policy_latest.pkl')
|
||||
@@ -35,8 +35,6 @@ def train(policy, rollout_worker, evaluator,
|
||||
|
||||
logger.info("Training...")
|
||||
best_success_rate = -1
|
||||
|
||||
if policy.bc_loss == 1: policy.initDemoBuffer(demo_file) #initialize demo buffer if training with demonstrations
|
||||
for epoch in range(n_epochs):
|
||||
# train
|
||||
rollout_worker.clear_history()
|
||||
@@ -86,7 +84,7 @@ def train(policy, rollout_worker, evaluator,
|
||||
|
||||
def launch(
|
||||
env, logdir, n_epochs, num_cpu, seed, replay_strategy, policy_save_interval, clip_return,
|
||||
demo_file, override_params={}, save_policies=True
|
||||
override_params={}, save_policies=True
|
||||
):
|
||||
# Fork for multi-CPU MPI implementation.
|
||||
if num_cpu > 1:
|
||||
@@ -173,7 +171,7 @@ def launch(
|
||||
logdir=logdir, policy=policy, rollout_worker=rollout_worker,
|
||||
evaluator=evaluator, n_epochs=n_epochs, n_test_rollouts=params['n_test_rollouts'],
|
||||
n_cycles=params['n_cycles'], n_batches=params['n_batches'],
|
||||
policy_save_interval=policy_save_interval, save_policies=save_policies, demo_file=demo_file)
|
||||
policy_save_interval=policy_save_interval, save_policies=save_policies)
|
||||
|
||||
|
||||
@click.command()
|
||||
@@ -185,7 +183,6 @@ def launch(
|
||||
@click.option('--policy_save_interval', type=int, default=5, help='the interval with which policy pickles are saved. If set to 0, only the best and latest policy will be pickled.')
|
||||
@click.option('--replay_strategy', type=click.Choice(['future', 'none']), default='future', help='the HER replay strategy to be used. "future" uses HER, "none" disables HER.')
|
||||
@click.option('--clip_return', type=int, default=1, help='whether or not returns should be clipped')
|
||||
@click.option('--demo_file', type=str, default = 'PATH/TO/DEMO/DATA/FILE.npz', help='demo data file path')
|
||||
def main(**kwargs):
|
||||
launch(**kwargs)
|
||||
|
||||
|
@@ -116,7 +116,7 @@ class RolloutWorker:
|
||||
return self.generate_rollouts()
|
||||
|
||||
if np.isnan(o_new).any():
|
||||
self.logger.warn('NaN caught during rollout generation. Trying again...')
|
||||
self.logger.warning('NaN caught during rollout generation. Trying again...')
|
||||
self.reset_all_rollouts()
|
||||
return self.generate_rollouts()
|
||||
|
||||
|
@@ -106,8 +106,7 @@ class CSVOutputFormat(KVWriter):
|
||||
|
||||
def writekvs(self, kvs):
|
||||
# Add our current row to the history
|
||||
extra_keys = list(kvs.keys() - self.keys)
|
||||
extra_keys.sort()
|
||||
extra_keys = kvs.keys() - self.keys
|
||||
if extra_keys:
|
||||
self.keys.extend(extra_keys)
|
||||
self.file.seek(0)
|
||||
@@ -345,6 +344,8 @@ class Logger(object):
|
||||
if isinstance(fmt, SeqWriter):
|
||||
fmt.writeseq(map(str, args))
|
||||
|
||||
Logger.DEFAULT = Logger.CURRENT = Logger(dir=None, output_formats=[HumanOutputFormat(sys.stdout)])
|
||||
|
||||
def configure(dir=None, format_strs=None):
|
||||
if dir is None:
|
||||
dir = os.getenv('OPENAI_LOGDIR')
|
||||
@@ -355,12 +356,8 @@ def configure(dir=None, format_strs=None):
|
||||
os.makedirs(dir, exist_ok=True)
|
||||
|
||||
log_suffix = ''
|
||||
rank = 0
|
||||
# check environment variables here instead of importing mpi4py
|
||||
# to avoid calling MPI_Init() when this module is imported
|
||||
for varname in ['PMI_RANK', 'OMPI_COMM_WORLD_RANK']:
|
||||
if varname in os.environ:
|
||||
rank = int(os.environ[varname])
|
||||
from mpi4py import MPI
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
if rank > 0:
|
||||
log_suffix = "-rank%03i" % rank
|
||||
|
||||
@@ -375,14 +372,6 @@ def configure(dir=None, format_strs=None):
|
||||
Logger.CURRENT = Logger(dir=dir, output_formats=output_formats)
|
||||
log('Logging to %s'%dir)
|
||||
|
||||
def _configure_default_logger():
|
||||
format_strs = None
|
||||
# keep the old default of only writing to stdout
|
||||
if 'OPENAI_LOG_FORMAT' not in os.environ:
|
||||
format_strs = ['stdout']
|
||||
configure(format_strs=format_strs)
|
||||
Logger.DEFAULT = Logger.CURRENT
|
||||
|
||||
def reset():
|
||||
if Logger.CURRENT is not Logger.DEFAULT:
|
||||
Logger.CURRENT.close()
|
||||
@@ -482,8 +471,5 @@ def read_tb(path):
|
||||
data[step-1, colidx] = value
|
||||
return pandas.DataFrame(data, columns=tags)
|
||||
|
||||
# configure the default logger on import
|
||||
_configure_default_logger()
|
||||
|
||||
if __name__ == "__main__":
|
||||
_demo()
|
||||
|
@@ -2,7 +2,5 @@
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1707.06347
|
||||
- Baselines blog post: https://blog.openai.com/openai-baselines-ppo/
|
||||
|
||||
- `python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
|
||||
- `python -m baselines.run --alg=ppo2 --env=Ant-v2 --num_timesteps=1e6` runs the algorithm for 1M frames on a Mujoco Ant environment.
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
- `python -m baselines.ppo2.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
||||
- `python -m baselines.ppo2.run_mujoco` runs the algorithm for 1M frames on a Mujoco environment.
|
||||
|
@@ -20,6 +20,3 @@ def atari():
|
||||
lr=lambda f : f * 2.5e-4,
|
||||
cliprange=lambda f : f * 0.1,
|
||||
)
|
||||
|
||||
def retro():
|
||||
return atari()
|
||||
|
@@ -10,116 +10,57 @@ from baselines.common import explained_variance, set_global_seeds
|
||||
from baselines.common.policies import build_policy
|
||||
from baselines.common.runners import AbstractEnvRunner
|
||||
from baselines.common.tf_util import get_session, save_variables, load_variables
|
||||
|
||||
try:
|
||||
from baselines.common.mpi_adam_optimizer import MpiAdamOptimizer
|
||||
from mpi4py import MPI
|
||||
from baselines.common.mpi_util import sync_from_root
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
from mpi4py import MPI
|
||||
from baselines.common.tf_util import initialize
|
||||
from baselines.common.mpi_util import sync_from_root
|
||||
|
||||
class Model(object):
|
||||
"""
|
||||
We use this object to :
|
||||
__init__:
|
||||
- Creates the step_model
|
||||
- Creates the train_model
|
||||
|
||||
train():
|
||||
- Make the training part (feedforward and retropropagation of gradients)
|
||||
|
||||
save/load():
|
||||
- Save load the model
|
||||
"""
|
||||
def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
|
||||
nsteps, ent_coef, vf_coef, max_grad_norm):
|
||||
sess = get_session()
|
||||
|
||||
with tf.variable_scope('ppo2_model', reuse=tf.AUTO_REUSE):
|
||||
# CREATE OUR TWO MODELS
|
||||
# act_model that is used for sampling
|
||||
act_model = policy(nbatch_act, 1, sess)
|
||||
|
||||
# Train model for training
|
||||
train_model = policy(nbatch_train, nsteps, sess)
|
||||
|
||||
# CREATE THE PLACEHOLDERS
|
||||
A = train_model.pdtype.sample_placeholder([None])
|
||||
ADV = tf.placeholder(tf.float32, [None])
|
||||
R = tf.placeholder(tf.float32, [None])
|
||||
# Keep track of old actor
|
||||
OLDNEGLOGPAC = tf.placeholder(tf.float32, [None])
|
||||
# Keep track of old critic
|
||||
OLDVPRED = tf.placeholder(tf.float32, [None])
|
||||
LR = tf.placeholder(tf.float32, [])
|
||||
# Cliprange
|
||||
CLIPRANGE = tf.placeholder(tf.float32, [])
|
||||
|
||||
neglogpac = train_model.pd.neglogp(A)
|
||||
|
||||
# Calculate the entropy
|
||||
# Entropy is used to improve exploration by limiting the premature convergence to suboptimal policy.
|
||||
entropy = tf.reduce_mean(train_model.pd.entropy())
|
||||
|
||||
# CALCULATE THE LOSS
|
||||
# Total loss = Policy gradient loss - entropy * entropy coefficient + Value coefficient * value loss
|
||||
|
||||
# Clip the value to reduce variability during Critic training
|
||||
# Get the predicted value
|
||||
vpred = train_model.vf
|
||||
vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
|
||||
# Unclipped value
|
||||
vf_losses1 = tf.square(vpred - R)
|
||||
# Clipped value
|
||||
vf_losses2 = tf.square(vpredclipped - R)
|
||||
|
||||
vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))
|
||||
|
||||
# Calculate ratio (pi current policy / pi old policy)
|
||||
ratio = tf.exp(OLDNEGLOGPAC - neglogpac)
|
||||
|
||||
# Defining Loss = - J is equivalent to max J
|
||||
pg_losses = -ADV * ratio
|
||||
|
||||
pg_losses2 = -ADV * tf.clip_by_value(ratio, 1.0 - CLIPRANGE, 1.0 + CLIPRANGE)
|
||||
|
||||
# Final PG loss
|
||||
pg_loss = tf.reduce_mean(tf.maximum(pg_losses, pg_losses2))
|
||||
approxkl = .5 * tf.reduce_mean(tf.square(neglogpac - OLDNEGLOGPAC))
|
||||
clipfrac = tf.reduce_mean(tf.to_float(tf.greater(tf.abs(ratio - 1.0), CLIPRANGE)))
|
||||
|
||||
# Total loss
|
||||
loss = pg_loss - entropy * ent_coef + vf_loss * vf_coef
|
||||
|
||||
# UPDATE THE PARAMETERS USING LOSS
|
||||
# 1. Get the model parameters
|
||||
params = tf.trainable_variables('ppo2_model')
|
||||
# 2. Build our trainer
|
||||
if MPI is not None:
|
||||
trainer = MpiAdamOptimizer(MPI.COMM_WORLD, learning_rate=LR, epsilon=1e-5)
|
||||
else:
|
||||
trainer = tf.train.AdamOptimizer(learning_rate=LR, epsilon=1e-5)
|
||||
# 3. Calculate the gradients
|
||||
grads_and_var = trainer.compute_gradients(loss, params)
|
||||
grads, var = zip(*grads_and_var)
|
||||
|
||||
if max_grad_norm is not None:
|
||||
# Clip the gradients (normalize)
|
||||
grads, _grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
|
||||
grads_and_var = list(zip(grads, var))
|
||||
# zip aggregate each gradient with parameters associated
|
||||
# For instance zip(ABCD, xyza) => Ax, By, Cz, Da
|
||||
|
||||
_train = trainer.apply_gradients(grads_and_var)
|
||||
|
||||
def train(lr, cliprange, obs, returns, masks, actions, values, neglogpacs, states=None):
|
||||
# Here we calculate advantage A(s,a) = R + yV(s') - V(s)
|
||||
# Returns = R + yV(s')
|
||||
advs = returns - values
|
||||
|
||||
# Normalize the advantages
|
||||
advs = (advs - advs.mean()) / (advs.std() + 1e-8)
|
||||
td_map = {train_model.X:obs, A:actions, ADV:advs, R:returns, LR:lr,
|
||||
CLIPRANGE:cliprange, OLDNEGLOGPAC:neglogpacs, OLDVPRED:values}
|
||||
@@ -143,47 +84,29 @@ class Model(object):
|
||||
self.save = functools.partial(save_variables, sess=sess)
|
||||
self.load = functools.partial(load_variables, sess=sess)
|
||||
|
||||
if MPI is None or MPI.COMM_WORLD.Get_rank() == 0:
|
||||
if MPI.COMM_WORLD.Get_rank() == 0:
|
||||
initialize()
|
||||
global_variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="")
|
||||
|
||||
if MPI is not None:
|
||||
sync_from_root(sess, global_variables) #pylint: disable=E1101
|
||||
|
||||
class Runner(AbstractEnvRunner):
|
||||
"""
|
||||
We use this object to make a mini batch of experiences
|
||||
__init__:
|
||||
- Initialize the runner
|
||||
|
||||
run():
|
||||
- Make a mini batch
|
||||
"""
|
||||
def __init__(self, *, env, model, nsteps, gamma, lam):
|
||||
super().__init__(env=env, model=model, nsteps=nsteps)
|
||||
# Lambda used in GAE (General Advantage Estimation)
|
||||
self.lam = lam
|
||||
# Discount rate
|
||||
self.gamma = gamma
|
||||
|
||||
def run(self):
|
||||
# Here, we init the lists that will contain the mb of experiences
|
||||
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones, mb_neglogpacs = [],[],[],[],[],[]
|
||||
mb_states = self.states
|
||||
epinfos = []
|
||||
# For n in range number of steps
|
||||
for _ in range(self.nsteps):
|
||||
# Given observations, get action value and neglopacs
|
||||
# We already have self.obs because Runner superclass run self.obs[:] = env.reset() on init
|
||||
actions, values, self.states, neglogpacs = self.model.step(self.obs, S=self.states, M=self.dones)
|
||||
mb_obs.append(self.obs.copy())
|
||||
mb_actions.append(actions)
|
||||
mb_values.append(values)
|
||||
mb_neglogpacs.append(neglogpacs)
|
||||
mb_dones.append(self.dones)
|
||||
|
||||
# Take actions in env and look the results
|
||||
# Infos contains a ton of useful informations
|
||||
self.obs[:], rewards, self.dones, infos = self.env.step(actions)
|
||||
for info in infos:
|
||||
maybeepinfo = info.get('episode')
|
||||
@@ -197,7 +120,6 @@ class Runner(AbstractEnvRunner):
|
||||
mb_neglogpacs = np.asarray(mb_neglogpacs, dtype=np.float32)
|
||||
mb_dones = np.asarray(mb_dones, dtype=np.bool)
|
||||
last_values = self.model.value(self.obs, S=self.states, M=self.dones)
|
||||
|
||||
#discount/bootstrap off value fn
|
||||
mb_returns = np.zeros_like(mb_rewards)
|
||||
mb_advs = np.zeros_like(mb_rewards)
|
||||
@@ -227,7 +149,7 @@ def constfn(val):
|
||||
return val
|
||||
return f
|
||||
|
||||
def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2048, ent_coef=0.0, lr=3e-4,
|
||||
def learn(*, network, env, total_timesteps, seed=None, nsteps=2048, ent_coef=0.0, lr=3e-4,
|
||||
vf_coef=0.5, max_grad_norm=0.5, gamma=0.99, lam=0.95,
|
||||
log_interval=10, nminibatches=4, noptepochs=4, cliprange=0.2,
|
||||
save_interval=0, load_path=None, **network_kwargs):
|
||||
@@ -241,7 +163,7 @@ def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2
|
||||
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
|
||||
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
|
||||
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
|
||||
See common/models.py/lstm for more details on using recurrent nets in policies
|
||||
See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
|
||||
|
||||
env: baselines.common.vec_env.VecEnv environment. Needs to be vectorized for parallel environment simulation.
|
||||
The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.
|
||||
@@ -267,8 +189,7 @@ def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2
|
||||
|
||||
log_interval: int number of timesteps between logging events
|
||||
|
||||
nminibatches: int number of training minibatches per update. For recurrent policies,
|
||||
should be smaller or equal than number of environments run in parallel.
|
||||
nminibatches: int number of training minibatches per update
|
||||
|
||||
noptepochs: int number of training epochs per update
|
||||
|
||||
@@ -296,67 +217,41 @@ def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2
|
||||
|
||||
policy = build_policy(env, network, **network_kwargs)
|
||||
|
||||
# Get the nb of env
|
||||
nenvs = env.num_envs
|
||||
|
||||
# Get state_space and action_space
|
||||
ob_space = env.observation_space
|
||||
ac_space = env.action_space
|
||||
|
||||
# Calculate the batch_size
|
||||
nbatch = nenvs * nsteps
|
||||
nbatch_train = nbatch // nminibatches
|
||||
|
||||
# Instantiate the model object (that creates act_model and train_model)
|
||||
make_model = lambda : Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nbatch_act=nenvs, nbatch_train=nbatch_train,
|
||||
nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
|
||||
max_grad_norm=max_grad_norm)
|
||||
if save_interval and logger.get_dir():
|
||||
import cloudpickle
|
||||
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
|
||||
fh.write(cloudpickle.dumps(make_model))
|
||||
model = make_model()
|
||||
if load_path is not None:
|
||||
model.load(load_path)
|
||||
# Instantiate the runner object
|
||||
runner = Runner(env=env, model=model, nsteps=nsteps, gamma=gamma, lam=lam)
|
||||
if eval_env is not None:
|
||||
eval_runner = Runner(env = eval_env, model = model, nsteps = nsteps, gamma = gamma, lam= lam)
|
||||
|
||||
|
||||
|
||||
epinfobuf = deque(maxlen=100)
|
||||
if eval_env is not None:
|
||||
eval_epinfobuf = deque(maxlen=100)
|
||||
|
||||
# Start total timer
|
||||
tfirststart = time.time()
|
||||
|
||||
nupdates = total_timesteps//nbatch
|
||||
for update in range(1, nupdates+1):
|
||||
assert nbatch % nminibatches == 0
|
||||
# Start timer
|
||||
tstart = time.time()
|
||||
frac = 1.0 - (update - 1.0) / nupdates
|
||||
# Calculate the learning rate
|
||||
lrnow = lr(frac)
|
||||
# Calculate the cliprange
|
||||
cliprangenow = cliprange(frac)
|
||||
# Get minibatch
|
||||
obs, returns, masks, actions, values, neglogpacs, states, epinfos = runner.run() #pylint: disable=E0632
|
||||
if eval_env is not None:
|
||||
eval_obs, eval_returns, eval_masks, eval_actions, eval_values, eval_neglogpacs, eval_states, eval_epinfos = eval_runner.run() #pylint: disable=E0632
|
||||
|
||||
epinfobuf.extend(epinfos)
|
||||
if eval_env is not None:
|
||||
eval_epinfobuf.extend(eval_epinfos)
|
||||
|
||||
# Here what we're going to do is for each minibatch calculate the loss and append it.
|
||||
mblossvals = []
|
||||
if states is None: # nonrecurrent version
|
||||
# Index of each element of batch_size
|
||||
# Create the indices array
|
||||
inds = np.arange(nbatch)
|
||||
for _ in range(noptepochs):
|
||||
# Randomize the indexes
|
||||
np.random.shuffle(inds)
|
||||
# 0 to batch_size with batch_train_size step
|
||||
for start in range(0, nbatch, nbatch_train):
|
||||
end = start + nbatch_train
|
||||
mbinds = inds[start:end]
|
||||
@@ -378,15 +273,10 @@ def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2
|
||||
mbstates = states[mbenvinds]
|
||||
mblossvals.append(model.train(lrnow, cliprangenow, *slices, mbstates))
|
||||
|
||||
# Feedforward --> get losses --> update
|
||||
lossvals = np.mean(mblossvals, axis=0)
|
||||
# End timer
|
||||
tnow = time.time()
|
||||
# Calculate the fps (frame per second)
|
||||
fps = int(nbatch / (tnow - tstart))
|
||||
if update % log_interval == 0 or update == 1:
|
||||
# Calculates if value function is a good predicator of the returns (ev > 1)
|
||||
# or if it's just worse than predicting nothing (ev =< 0)
|
||||
ev = explained_variance(values, returns)
|
||||
logger.logkv("serial_timesteps", update*nsteps)
|
||||
logger.logkv("nupdates", update)
|
||||
@@ -395,22 +285,20 @@ def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2
|
||||
logger.logkv("explained_variance", float(ev))
|
||||
logger.logkv('eprewmean', safemean([epinfo['r'] for epinfo in epinfobuf]))
|
||||
logger.logkv('eplenmean', safemean([epinfo['l'] for epinfo in epinfobuf]))
|
||||
if eval_env is not None:
|
||||
logger.logkv('eval_eprewmean', safemean([epinfo['r'] for epinfo in eval_epinfobuf]) )
|
||||
logger.logkv('eval_eplenmean', safemean([epinfo['l'] for epinfo in eval_epinfobuf]) )
|
||||
logger.logkv('time_elapsed', tnow - tfirststart)
|
||||
for (lossval, lossname) in zip(lossvals, model.loss_names):
|
||||
logger.logkv(lossname, lossval)
|
||||
if MPI is None or MPI.COMM_WORLD.Get_rank() == 0:
|
||||
if MPI.COMM_WORLD.Get_rank() == 0:
|
||||
logger.dumpkvs()
|
||||
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir() and (MPI is None or MPI.COMM_WORLD.Get_rank() == 0):
|
||||
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir() and MPI.COMM_WORLD.Get_rank() == 0:
|
||||
checkdir = osp.join(logger.get_dir(), 'checkpoints')
|
||||
os.makedirs(checkdir, exist_ok=True)
|
||||
savepath = osp.join(checkdir, '%.5i'%update)
|
||||
print('Saving to', savepath)
|
||||
model.save(savepath)
|
||||
env.close()
|
||||
return model
|
||||
# Avoid division error when calculate the mean (in our case if epinfo is empty returns np.nan, not return an error)
|
||||
|
||||
def safemean(xs):
|
||||
return np.nan if len(xs) == 0 else np.mean(xs)
|
||||
|
||||
|
@@ -10,8 +10,6 @@ from baselines.bench.monitor import load_results
|
||||
X_TIMESTEPS = 'timesteps'
|
||||
X_EPISODES = 'episodes'
|
||||
X_WALLTIME = 'walltime_hrs'
|
||||
Y_REWARD = 'reward'
|
||||
Y_TIMESTEPS = 'timesteps'
|
||||
POSSIBLE_X_AXES = [X_TIMESTEPS, X_EPISODES, X_WALLTIME]
|
||||
EPISODES_WINDOW = 100
|
||||
COLORS = ['blue', 'green', 'red', 'cyan', 'magenta', 'yellow', 'black', 'purple', 'pink',
|
||||
@@ -28,25 +26,22 @@ def window_func(x, y, window, func):
|
||||
yw_func = func(yw, axis=-1)
|
||||
return x[window-1:], yw_func
|
||||
|
||||
def ts2xy(ts, xaxis, yaxis):
|
||||
def ts2xy(ts, xaxis):
|
||||
if xaxis == X_TIMESTEPS:
|
||||
x = np.cumsum(ts.l.values)
|
||||
y = ts.r.values
|
||||
elif xaxis == X_EPISODES:
|
||||
x = np.arange(len(ts))
|
||||
y = ts.r.values
|
||||
elif xaxis == X_WALLTIME:
|
||||
x = ts.t.values / 3600.
|
||||
else:
|
||||
raise NotImplementedError
|
||||
if yaxis == Y_REWARD:
|
||||
y = ts.r.values
|
||||
elif yaxis == Y_TIMESTEPS:
|
||||
y = ts.l.values
|
||||
else:
|
||||
raise NotImplementedError
|
||||
return x, y
|
||||
|
||||
def plot_curves(xy_list, xaxis, yaxis, title):
|
||||
fig = plt.figure(figsize=(8,2))
|
||||
def plot_curves(xy_list, xaxis, title):
|
||||
plt.figure(figsize=(8,2))
|
||||
maxx = max(xy[0][-1] for xy in xy_list)
|
||||
minx = 0
|
||||
for (i, (x, y)) in enumerate(xy_list):
|
||||
@@ -57,19 +52,17 @@ def plot_curves(xy_list, xaxis, yaxis, title):
|
||||
plt.xlim(minx, maxx)
|
||||
plt.title(title)
|
||||
plt.xlabel(xaxis)
|
||||
plt.ylabel(yaxis)
|
||||
plt.ylabel("Episode Rewards")
|
||||
plt.tight_layout()
|
||||
fig.canvas.mpl_connect('resize_event', lambda event: plt.tight_layout())
|
||||
plt.grid(True)
|
||||
|
||||
def plot_results(dirs, num_timesteps, xaxis, yaxis, task_name):
|
||||
def plot_results(dirs, num_timesteps, xaxis, task_name):
|
||||
tslist = []
|
||||
for dir in dirs:
|
||||
ts = load_results(dir)
|
||||
ts = ts[ts.l.cumsum() <= num_timesteps]
|
||||
tslist.append(ts)
|
||||
xy_list = [ts2xy(ts, xaxis, yaxis) for ts in tslist]
|
||||
plot_curves(xy_list, xaxis, yaxis, task_name)
|
||||
xy_list = [ts2xy(ts, xaxis) for ts in tslist]
|
||||
plot_curves(xy_list, xaxis, task_name)
|
||||
|
||||
# Example usage in jupyter-notebook
|
||||
# from baselines import log_viewer
|
||||
@@ -84,11 +77,10 @@ def main():
|
||||
parser.add_argument('--dirs', help='List of log directories', nargs = '*', default=['./log'])
|
||||
parser.add_argument('--num_timesteps', type=int, default=int(10e6))
|
||||
parser.add_argument('--xaxis', help = 'Varible on X-axis', default = X_TIMESTEPS)
|
||||
parser.add_argument('--yaxis', help = 'Varible on Y-axis', default = Y_REWARD)
|
||||
parser.add_argument('--task_name', help = 'Title of plot', default = 'Breakout')
|
||||
args = parser.parse_args()
|
||||
args.dirs = [os.path.abspath(dir) for dir in args.dirs]
|
||||
plot_results(args.dirs, args.num_timesteps, args.xaxis, args.yaxis, args.task_name)
|
||||
plot_results(args.dirs, args.num_timesteps, args.xaxis, args.task_name)
|
||||
plt.show()
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
132
baselines/run.py
132
baselines/run.py
@@ -1,45 +1,37 @@
|
||||
import sys
|
||||
import multiprocessing
|
||||
import os
|
||||
import os.path as osp
|
||||
import gym
|
||||
from collections import defaultdict
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
|
||||
from baselines.common.vec_env.vec_video_recorder import VecVideoRecorder
|
||||
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
|
||||
from baselines.common.cmd_util import common_arg_parser, parse_unknown_args, make_vec_env, make_env
|
||||
from baselines.common.tf_util import get_session
|
||||
from baselines import logger
|
||||
from baselines.common.cmd_util import common_arg_parser, parse_unknown_args, make_mujoco_env, make_atari_env
|
||||
from baselines.common.tf_util import save_state, load_state, get_session
|
||||
from baselines import bench, logger
|
||||
from importlib import import_module
|
||||
|
||||
from baselines.common.vec_env.vec_normalize import VecNormalize
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
|
||||
from baselines.common import atari_wrappers, retro_wrappers
|
||||
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
try:
|
||||
import pybullet_envs
|
||||
except ImportError:
|
||||
pybullet_envs = None
|
||||
|
||||
try:
|
||||
import roboschool
|
||||
except ImportError:
|
||||
roboschool = None
|
||||
|
||||
_game_envs = defaultdict(set)
|
||||
for env in gym.envs.registry.all():
|
||||
# TODO: solve this with regexes
|
||||
# solve this with regexes
|
||||
env_type = env._entry_point.split(':')[0].split('.')[-1]
|
||||
_game_envs[env_type].add(env.id)
|
||||
|
||||
# reading benchmark names directly from retro requires
|
||||
# importing retro here, and for some reason that crashes tensorflow
|
||||
# in ubuntu
|
||||
_game_envs['retro'] = {
|
||||
_game_envs['retro'] = set([
|
||||
'BubbleBobble-Nes',
|
||||
'SuperMarioBros-Nes',
|
||||
'TwinBee3PokoPokoDaimaou-Nes',
|
||||
@@ -48,12 +40,11 @@ _game_envs['retro'] = {
|
||||
'Vectorman-Genesis',
|
||||
'FinalFight-Snes',
|
||||
'SpaceInvaders-Snes',
|
||||
}
|
||||
])
|
||||
|
||||
|
||||
def train(args, extra_args):
|
||||
env_type, env_id = get_env_type(args.env)
|
||||
print('env_type: {}'.format(env_type))
|
||||
|
||||
total_timesteps = int(args.num_timesteps)
|
||||
seed = args.seed
|
||||
@@ -63,8 +54,6 @@ def train(args, extra_args):
|
||||
alg_kwargs.update(extra_args)
|
||||
|
||||
env = build_env(args)
|
||||
if args.save_video_interval != 0:
|
||||
env = VecVideoRecorder(env, osp.join(logger.Logger.CURRENT.dir, "videos"), record_video_trigger=lambda x: x % args.save_video_interval == 0, video_length=args.save_video_length)
|
||||
|
||||
if args.network:
|
||||
alg_kwargs['network'] = args.network
|
||||
@@ -72,6 +61,8 @@ def train(args, extra_args):
|
||||
if alg_kwargs.get('network') is None:
|
||||
alg_kwargs['network'] = get_default_network(env_type)
|
||||
|
||||
|
||||
|
||||
print('Training {} on {}:{} with arguments \n{}'.format(args.alg, env_type, env_id, alg_kwargs))
|
||||
|
||||
model = learn(
|
||||
@@ -84,36 +75,61 @@ def train(args, extra_args):
|
||||
return model, env
|
||||
|
||||
|
||||
def build_env(args):
|
||||
def build_env(args, render=False):
|
||||
ncpu = multiprocessing.cpu_count()
|
||||
if sys.platform == 'darwin': ncpu //= 2
|
||||
nenv = args.num_env or ncpu
|
||||
nenv = args.num_env or ncpu if not render else 1
|
||||
alg = args.alg
|
||||
rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
|
||||
seed = args.seed
|
||||
|
||||
env_type, env_id = get_env_type(args.env)
|
||||
if env_type == 'mujoco':
|
||||
get_session(tf.ConfigProto(allow_soft_placement=True,
|
||||
intra_op_parallelism_threads=1,
|
||||
inter_op_parallelism_threads=1))
|
||||
|
||||
if env_type in {'atari', 'retro'}:
|
||||
if alg == 'deepq':
|
||||
env = make_env(env_id, env_type, seed=seed, wrapper_kwargs={'frame_stack': True})
|
||||
if args.num_env:
|
||||
env = SubprocVecEnv([lambda: make_mujoco_env(env_id, seed + i if seed is not None else None, args.reward_scale) for i in range(args.num_env)])
|
||||
else:
|
||||
env = DummyVecEnv([lambda: make_mujoco_env(env_id, seed, args.reward_scale)])
|
||||
|
||||
env = VecNormalize(env)
|
||||
|
||||
elif env_type == 'atari':
|
||||
if alg == 'acer':
|
||||
env = make_atari_env(env_id, nenv, seed)
|
||||
elif alg == 'deepq':
|
||||
env = atari_wrappers.make_atari(env_id)
|
||||
env.seed(seed)
|
||||
env = bench.Monitor(env, logger.get_dir())
|
||||
env = atari_wrappers.wrap_deepmind(env, frame_stack=True, scale=True)
|
||||
elif alg == 'trpo_mpi':
|
||||
env = make_env(env_id, env_type, seed=seed)
|
||||
env = atari_wrappers.make_atari(env_id)
|
||||
env.seed(seed)
|
||||
env = bench.Monitor(env, logger.get_dir() and osp.join(logger.get_dir(), str(rank)))
|
||||
env = atari_wrappers.wrap_deepmind(env)
|
||||
# TODO check if the second seeding is necessary, and eventually remove
|
||||
env.seed(seed)
|
||||
else:
|
||||
frame_stack_size = 4
|
||||
env = make_vec_env(env_id, env_type, nenv, seed, gamestate=args.gamestate, reward_scale=args.reward_scale)
|
||||
env = VecFrameStack(env, frame_stack_size)
|
||||
env = VecFrameStack(make_atari_env(env_id, nenv, seed), frame_stack_size)
|
||||
|
||||
else:
|
||||
config = tf.ConfigProto(allow_soft_placement=True,
|
||||
intra_op_parallelism_threads=1,
|
||||
inter_op_parallelism_threads=1)
|
||||
config.gpu_options.allow_growth = True
|
||||
get_session(config=config)
|
||||
elif env_type == 'retro':
|
||||
import retro
|
||||
gamestate = args.gamestate or 'Level1-1'
|
||||
env = retro_wrappers.make_retro(game=args.env, state=gamestate, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE)
|
||||
env.seed(args.seed)
|
||||
env = bench.Monitor(env, logger.get_dir())
|
||||
env = retro_wrappers.wrap_deepmind_retro(env)
|
||||
|
||||
env = make_vec_env(env_id, env_type, args.num_env or 1, seed, reward_scale=args.reward_scale)
|
||||
elif env_type == 'classic':
|
||||
def make_env():
|
||||
e = gym.make(env_id)
|
||||
e.seed(seed)
|
||||
return e
|
||||
|
||||
if env_type == 'mujoco':
|
||||
env = VecNormalize(env)
|
||||
env = DummyVecEnv([make_env])
|
||||
|
||||
return env
|
||||
|
||||
@@ -132,12 +148,13 @@ def get_env_type(env_id):
|
||||
|
||||
return env_type, env_id
|
||||
|
||||
|
||||
def get_default_network(env_type):
|
||||
if env_type in {'atari', 'retro'}:
|
||||
return 'cnn'
|
||||
else:
|
||||
if env_type == 'mujoco' or env_type=='classic':
|
||||
return 'mlp'
|
||||
if env_type == 'atari':
|
||||
return 'cnn'
|
||||
|
||||
raise ValueError('Unknown env_type {}'.format(env_type))
|
||||
|
||||
def get_alg_module(alg, submodule=None):
|
||||
submodule = submodule or alg
|
||||
@@ -154,7 +171,6 @@ def get_alg_module(alg, submodule=None):
|
||||
def get_learn_function(alg):
|
||||
return get_alg_module(alg).learn
|
||||
|
||||
|
||||
def get_learn_function_defaults(alg, env_type):
|
||||
try:
|
||||
alg_defaults = get_alg_module(alg, 'defaults')
|
||||
@@ -163,13 +179,10 @@ def get_learn_function_defaults(alg, env_type):
|
||||
kwargs = {}
|
||||
return kwargs
|
||||
|
||||
|
||||
|
||||
def parse_cmdline_kwargs(args):
|
||||
'''
|
||||
convert a list of '='-spaced command-line arguments to a dictionary, evaluating python objects when possible
|
||||
'''
|
||||
def parse(v):
|
||||
'''
|
||||
convert value of a command-line arg to a python object if possible, othewise, keep as string
|
||||
'''
|
||||
|
||||
assert isinstance(v, str)
|
||||
try:
|
||||
@@ -177,16 +190,14 @@ def parse_cmdline_kwargs(args):
|
||||
except (NameError, SyntaxError):
|
||||
return v
|
||||
|
||||
return {k: parse(v) for k,v in parse_unknown_args(args).items()}
|
||||
|
||||
|
||||
|
||||
def main():
|
||||
# configure logger, disable logging in child MPI processes (with rank > 0)
|
||||
|
||||
arg_parser = common_arg_parser()
|
||||
args, unknown_args = arg_parser.parse_known_args()
|
||||
extra_args = parse_cmdline_kwargs(unknown_args)
|
||||
extra_args = {k: parse(v) for k,v in parse_unknown_args(unknown_args).items()}
|
||||
|
||||
|
||||
if MPI is None or MPI.COMM_WORLD.Get_rank() == 0:
|
||||
rank = 0
|
||||
@@ -195,30 +206,25 @@ def main():
|
||||
logger.configure(format_strs = [])
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
|
||||
model, env = train(args, extra_args)
|
||||
env.close()
|
||||
model, _ = train(args, extra_args)
|
||||
|
||||
if args.save_path is not None and rank == 0:
|
||||
save_path = osp.expanduser(args.save_path)
|
||||
model.save(save_path)
|
||||
|
||||
|
||||
if args.play:
|
||||
logger.log("Running trained model")
|
||||
env = build_env(args)
|
||||
env = build_env(args, render=True)
|
||||
obs = env.reset()
|
||||
def initialize_placeholders(nlstm=128,**kwargs):
|
||||
return np.zeros((args.num_env or 1, 2*nlstm)), np.zeros((1))
|
||||
state, dones = initialize_placeholders(**extra_args)
|
||||
while True:
|
||||
actions, _, state, _ = model.step(obs,S=state, M=dones)
|
||||
actions = model.step(obs)[0]
|
||||
obs, _, done, _ = env.step(actions)
|
||||
env.render()
|
||||
done = done.any() if isinstance(done, np.ndarray) else done
|
||||
|
||||
if done:
|
||||
obs = env.reset()
|
||||
|
||||
env.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
@@ -2,6 +2,5 @@
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1502.05477
|
||||
- Baselines blog post https://blog.openai.com/openai-baselines-ppo/
|
||||
- `mpirun -np 16 python -m baselines.run --alg=trpo_mpi --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
|
||||
- `python -m baselines.run --alg=trpo_mpi --env=Ant-v2 --num_timesteps=1e6` runs the algorithm for 1M timesteps on a Mujoco Ant environment.
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
- `mpirun -np 16 python -m baselines.trpo_mpi.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
||||
- `python -m baselines.trpo_mpi.run_mujoco` runs the algorithm for 1M timesteps on a Mujoco environment.
|
||||
|
@@ -1,4 +1,4 @@
|
||||
from baselines.common.models import mlp, cnn_small
|
||||
from rl_common.models import mlp, cnn_small
|
||||
|
||||
|
||||
def atari():
|
||||
|
@@ -4,6 +4,7 @@ import baselines.common.tf_util as U
|
||||
import tensorflow as tf, numpy as np
|
||||
import time
|
||||
from baselines.common import colorize
|
||||
from mpi4py import MPI
|
||||
from collections import deque
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines.common.mpi_adam import MpiAdam
|
||||
@@ -12,11 +13,6 @@ from baselines.common.input import observation_placeholder
|
||||
from baselines.common.policies import build_policy
|
||||
from contextlib import contextmanager
|
||||
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
def traj_segment_generator(pi, env, horizon, stochastic):
|
||||
# Initialize state variables
|
||||
t = 0
|
||||
@@ -96,7 +92,7 @@ def learn(*,
|
||||
gamma=0.99,
|
||||
lam=1.0, # advantage estimation
|
||||
seed=None,
|
||||
ent_coef=0.0,
|
||||
entcoeff=0.0,
|
||||
cg_damping=1e-2,
|
||||
vf_stepsize=3e-4,
|
||||
vf_iters =3,
|
||||
@@ -121,7 +117,7 @@ def learn(*,
|
||||
|
||||
max_kl max KL divergence between old policy and new policy ( KL(pi_old || pi) )
|
||||
|
||||
ent_coef coefficient of policy entropy term in the optimization objective
|
||||
entcoeff coefficient of policy entropy term in the optimization objective
|
||||
|
||||
cg_iters number of iterations of conjugate gradient algorithm
|
||||
|
||||
@@ -150,12 +146,9 @@ def learn(*,
|
||||
|
||||
'''
|
||||
|
||||
if MPI is not None:
|
||||
|
||||
nworkers = MPI.COMM_WORLD.Get_size()
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
else:
|
||||
nworkers = 1
|
||||
rank = 0
|
||||
|
||||
cpus_per_worker = 1
|
||||
U.get_session(config=tf.ConfigProto(
|
||||
@@ -189,7 +182,7 @@ def learn(*,
|
||||
ent = pi.pd.entropy()
|
||||
meankl = tf.reduce_mean(kloldnew)
|
||||
meanent = tf.reduce_mean(ent)
|
||||
entbonus = ent_coef * meanent
|
||||
entbonus = entcoeff * meanent
|
||||
|
||||
vferr = tf.reduce_mean(tf.square(pi.vf - ret))
|
||||
|
||||
@@ -244,13 +237,9 @@ def learn(*,
|
||||
|
||||
def allmean(x):
|
||||
assert isinstance(x, np.ndarray)
|
||||
if MPI is not None:
|
||||
out = np.empty_like(x)
|
||||
MPI.COMM_WORLD.Allreduce(x, out, op=MPI.SUM)
|
||||
out /= nworkers
|
||||
else:
|
||||
out = np.copy(x)
|
||||
|
||||
return out
|
||||
|
||||
U.initialize()
|
||||
@@ -258,9 +247,7 @@ def learn(*,
|
||||
pi.load(load_path)
|
||||
|
||||
th_init = get_flat()
|
||||
if MPI is not None:
|
||||
MPI.COMM_WORLD.Bcast(th_init, root=0)
|
||||
|
||||
set_from_flat(th_init)
|
||||
vfadam.sync()
|
||||
print("Init param sum", th_init.sum(), flush=True)
|
||||
@@ -366,11 +353,7 @@ def learn(*,
|
||||
logger.record_tabular("ev_tdlam_before", explained_variance(vpredbefore, tdlamret))
|
||||
|
||||
lrlocal = (seg["ep_lens"], seg["ep_rets"]) # local values
|
||||
if MPI is not None:
|
||||
listoflrpairs = MPI.COMM_WORLD.allgather(lrlocal) # list of tuples
|
||||
else:
|
||||
listoflrpairs = [lrlocal]
|
||||
|
||||
lens, rews = map(flatten_lists, zip(*listoflrpairs))
|
||||
lenbuffer.extend(lens)
|
||||
rewbuffer.extend(rews)
|
||||
|
16789
benchmarks_atari10M.htm
16789
benchmarks_atari10M.htm
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
19
conftest.py
Normal file
19
conftest.py
Normal file
@@ -0,0 +1,19 @@
|
||||
import pytest
|
||||
|
||||
|
||||
def pytest_addoption(parser):
|
||||
parser.addoption('--runslow', action='store_true', default=False, help='run slow tests')
|
||||
|
||||
|
||||
def pytest_collection_modifyitems(config, items):
|
||||
if config.getoption('--runslow'):
|
||||
# --runslow given in cli: do not skip slow tests
|
||||
return
|
||||
skip_slow = pytest.mark.skip(reason='need --runslow option to run')
|
||||
slow_tests = []
|
||||
for item in items:
|
||||
if 'slow' in item.keywords:
|
||||
slow_tests.append(item.name)
|
||||
item.add_marker(skip_slow)
|
||||
|
||||
print('skipping slow tests', ' '.join(slow_tests), 'use --runslow to run this')
|
Binary file not shown.
Before Width: | Height: | Size: 68 KiB |
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user