Compare commits

...

96 Commits

Author SHA1 Message Date
Peter Zhokhov
b650cd862e lstm network builders using tf lstm 2018-08-10 14:24:55 -07:00
Peter Zhokhov
217b111c88 merged refactor 2018-08-10 14:14:46 -07:00
Adam Gleave
f272969325 GAIL: bugfix in dataset loading (#447)
* Fix silly typo

* Replace ad-hoc function with NumPy code
2018-07-06 16:12:14 -07:00
pzhokhov
a6b1bc70f1 re-import internal; fix missing tile_images.py (#427)
* import rl-algs from 2e3a166 commit

* extra import of the baselines badge

* exported commit with identity test

* proper rng seeding in the test_identity

* import internal

* adding missing tile_images.py
2018-06-08 09:41:45 -07:00
pzhokhov
36ee5d1707 Import internal changes (#422)
* import rl-algs from 2e3a166 commit

* extra import of the baselines badge

* exported commit with identity test

* proper rng seeding in the test_identity

* import internal
2018-06-06 11:39:13 -07:00
pzhokhov
24fe3d6576 Import internal repo (#409)
* import rl-algs from 2e3a166 commit

* extra import of the baselines badge

* exported commit with identity test

* proper rng seeding in the test_identity
2018-05-21 15:24:00 -07:00
pzhokhov
9cb7ece338 add opencv-python to the dependencies (#407) 2018-05-14 10:52:19 -07:00
pzhokhov
9cf95a0054 setup travis ci build (#388)
* simple .travis.yml file

* added static syntax checks of common to .travis.yml

* dockerizing the build

* fix Dockerfile, adding build shield

* cleaning up workdir in Dockerfile and .travis.yml

* .travis.yml fixed common -> baselines/common for style check
2018-05-03 09:43:28 -07:00
pzhokhov
8b781038cc put filters and running_stat files in common instead of acktr (#389) 2018-05-02 18:42:48 -07:00
pzhokhov
69f25c6028 import internal repo (#385) 2018-05-01 16:54:04 -07:00
pzhokhov
2b0283b9db Readme.md detailed installation instructions (#377)
* changes to README.md files with more detailed installation instructions

* md-fying the changes better

* link on the word homebrew in readme.md

* typos in README.md

* README.md

* removed extra comma sign

* removed sudo from brew command
2018-04-25 17:40:48 -07:00
Matthias Plappert
1f8a03f3a6 Update README 2018-03-26 16:50:22 +02:00
Matthias Plappert
3cc7df0608 Minor fixes to HER release (#319)
* Fix plotting script

* Add warning if num_cpu = 1
2018-03-05 11:06:17 +01:00
Alex Nichol
8b3a6c2051 fix DummyVecEnv reusing buffers 2018-03-02 17:18:07 -08:00
Alex Nichol
569bd42629 Merge pull request #308 from araffin/master
Bug fix in saving ACER model
2018-03-01 10:45:04 -08:00
Daniel Ziegler
f49a9c3d85 Fix bug in DDPG parameter space noise adaptation (#306)
The training loop used the rollout step variable `t` rather than the
training step variable `t_train` to decide when to adapt the scale of
the parameter space noise.
2018-03-01 18:00:34 +01:00
Antonin RAFFIN
14f2f9328c Bug fix in saving ACER model 2018-03-01 10:24:14 +01:00
Alex Nichol
6bdf2f55a2 Merge pull request #132 from bhatiaabhinav/bug_fixes
Bug fix in saving a2c model.
2018-02-27 19:00:37 -08:00
Alex Nichol
97be70d6c8 fixes for DummyVecEnv
Fixes various problems running MuJoCo tasks.
2018-02-27 18:55:10 -08:00
Matthias Plappert
b71152eea0 Adds support for Hindsight Experience Replay (HER) (#299)
* Add Hindsight Experience Replay (HER)

* Minor improvements
2018-02-26 17:40:16 +01:00
Christopher Hesse
df2e846ab7 export: fix accidental rename 2018-02-14 22:01:16 -08:00
Christopher Hesse
edb52c22a5 export: Fix deepq param noise refactoring, remove atari experiments and azure dependency 2018-02-14 21:42:22 -08:00
Andrei Kashin
98257ef8c9 Flush temporary file before compressing it.
We need to flush the buffer after `pickle.dump`, otherwise the resulting zip archive might be incomplete (reproducible, if the state consists of a single integer).
2018-02-06 07:04:44 -08:00
Oleg Klimov
d9b36601d9 comment about loading weights in ppo2 2018-02-05 12:25:05 -08:00
Oleg Klimov
2793971c10 fix gail tf_util usage 2018-02-05 07:51:27 -08:00
John Schulman
16d7d23b7d Merge pull request #271 from simontudo/add-requirement-cloudpickle
added cloudpickle to requirements
2018-02-02 23:04:53 -08:00
John Schulman
9175b770c6 Merge pull request #273 from simontudo/videorecorder-import
updated videorecorder import
2018-02-02 23:03:51 -08:00
simontudo
615870ad6b updated videorecorder import 2018-02-01 12:09:08 +01:00
simontudo
7bd264e0e9 added cloudpickle to requirements 2018-01-31 10:43:17 +01:00
John Schulman
8d03102d4d Merge pull request #265 from 20chase/patch-1
fix logger error for trpo_mpi
2018-01-29 00:54:51 -08:00
20chase
4a77855529 using mujoco_arg_parser as args
remove origin parser
2018-01-29 16:52:01 +08:00
John Schulman
2e29b41592 Merge pull request #268 from ei-grad/master
Fix fc call in AcerLstmPolicy
2018-01-27 18:42:31 -08:00
Andrew Grigorev
634e37c5b8 Fix fc call in AcerLstmPolicy
The `act` keyword was removed from baselines.a2c.utils.fc in commit 9fa8e1b.
2018-01-27 23:18:02 +03:00
20chase
452b548c2a Merge branch 'master' into patch-1 2018-01-26 14:34:01 +08:00
John Schulman
ebb8afff2e fix trpo_mpi bug where logstd wasn’t included 2018-01-25 21:17:40 -08:00
John Schulman
c9613b2293 Merge pull request #259 from andrewliao11/openai_gail
Add gail maintainer list
2018-01-25 20:54:34 -08:00
John Schulman
459f007bcc Merge pull request #260 from uidilr/master
Add GAIL
2018-01-25 20:54:20 -08:00
John Schulman
9fa8e1baf1 Lots of cleanups
Fixes for new gym version
Add @olegklimov and @unixpickle to authors list
2018-01-25 18:54:24 -08:00
20chase
ac2ea4f31f fix logger error for MPI
Can't run logger.configure() if rank != 0
2018-01-25 22:09:00 +08:00
Yusuke Nakata
d8cce2309f Add GAIL 2018-01-23 12:02:03 +09:00
andrew
0c207f0185 fix typo 2018-01-21 22:13:01 -08:00
andrew
41d41fabe3 add gail maintainer list 2018-01-21 22:12:03 -08:00
John Schulman
b5be53dc92 Merge pull request #229 from andrewliao11/gail
GAIL implementation
2018-01-21 20:30:20 -05:00
Matthias Plappert
49c1a8ec26 Fix bug in parameter space noise DQN 2018-01-16 10:24:30 -08:00
andrew
e5a714b070 fix relative import 2018-01-12 15:12:45 -08:00
John Schulman
f9d1d3349a remove mpirun from ppo2 instructions 2018-01-12 11:05:29 -08:00
Alex Nichol
8c90f67560 don't list TensorFlow as a requirement
fixes #146

A better (more involved) solution might be to check for a TensorFlow installation manually in setup.py and deal with that accordingly.
2017-12-15 15:54:43 -08:00
Andrew
f22bee085d Add files via upload 2017-12-12 19:03:42 -08:00
andrew
4acc71fe23 add x, y, axis name 2017-12-12 18:58:57 -08:00
andrew
2f1b629ecc Merge branch 'gail' of https://github.com/andrewliao11/baselines into gail 2017-12-12 18:56:00 -08:00
andrew
00573cf5e9 add x, y axis name 2017-12-12 18:54:03 -08:00
Andrew
cfa1236d78 Update README.md 2017-12-11 21:21:56 -08:00
Andrew
64288f9f84 Update gail-result.md 2017-12-11 21:19:47 -08:00
Andrew
5f647d4d34 Update README.md 2017-12-11 21:18:05 -08:00
Andrew
6723455b75 Update gail-result.md 2017-12-11 21:15:30 -08:00
Andrew
45a93cf2b9 add training curve from tensorboard 2017-12-11 21:06:04 -08:00
andrew
11604f7cc9 add download link to readme and add description to python file 2017-12-07 12:08:20 -08:00
John Schulman
2444034d11 Merge pull request #194 from ryanjulian/env_lines
Force shebang lines to Python 3
2017-12-04 14:07:01 -08:00
John Schulman
041b6b76b7 Merge pull request #215 from chris-chris/feature/typo-2017-11-19
fix misspellings
2017-12-04 14:02:49 -08:00
John Schulman
5d62b5bdaa Merge pull request #221 from jvmancuso/patch-1
Docstring fix
2017-12-04 14:01:38 -08:00
John Schulman
2fcc9b9572 Merge pull request #226 from definitelyuncertain/master
Call ppo2 and not ppo1 in ppo2 README.md
2017-12-04 14:01:12 -08:00
Andrew
000033973b Update gail-result.md 2017-12-03 15:50:24 -08:00
andrew
6090ee8292 add comparison for expert/BC/gail 2017-12-03 15:46:52 -08:00
andrew
7954327c5f add behavior cloning learn/eval code 2017-12-03 13:55:44 -08:00
andrew
8495890534 add gail, file_writer for tf.summary, and allow specifying var_list for tf.train.Saver 2017-12-03 01:49:42 -08:00
definitelyuncertain
643184935e Call ppo2 and not ppo1 2017-12-02 22:00:28 +05:30
jvmancuso
36e074da56 Update replay_buffer.py 2017-11-27 14:45:50 -05:00
Ubuntu
c33640932f fix misspellings 2017-11-19 01:29:30 +00:00
John Schulman
b05be68c55 add missing files, fix Issue #209 2017-11-16 22:14:30 -08:00
John Schulman
2dd7d307d7 Add ACER, PPO2, and results_plotter.py 2017-11-16 10:02:32 -08:00
Ryan Julian
df889caf11 Force shebang lines to Python 3
This is a Python 3-only library. A shebang with `#!/usr/bin/env python`
will launch python2 on many systems which do not have python3
installed. Setting the shebang to `#!/usr/bin/env python3` will show a
useful error on systems without Python 3.
2017-11-05 15:22:16 -08:00
John Schulman
6a3cbb4bc5 switch append mode to write mode 2017-10-25 22:20:30 -04:00
John Schulman
bb40378118 change atari preprocessing to use faster opencv
some logger changes
2017-10-25 09:21:29 -04:00
John Schulman
4993286230 Merge pull request #160 from mkarutz/fixFrameStackingA2C
Fixes frame stacking in A2C and ACKTR for multi-channel observations
2017-10-09 14:12:28 -07:00
Malcolm Karutz
cc8818f49e Fixes frame stacking in A2C and ACKTR for multi-channel observation spaces. 2017-10-09 13:08:41 +11:00
John Schulman
3eb71a0ece Merge pull request #151 from emansim/master
Fixes the NaN issues in ACKTR + bug in run_mujoco.py
2017-09-30 14:51:56 -07:00
Elman Mansimov
f8663eaf11 fixes acktr_cont issues 2017-09-30 17:21:04 -04:00
Abhinav Bhatia
3d1e171b3a Bug fix in saving a2c model. 2017-09-12 02:35:43 +08:00
John Schulman
699919f1cf Merge pull request #64 from jhumplik/master
Use standardized advantages in trpo.
2017-09-07 01:57:04 -07:00
John Schulman
498b4cfead Merge pull request #128 from louiehelm/louiehelm-patch-1
Fix command lines
2017-09-06 01:04:47 -07:00
Louie Helm
589387403b fix ppo command in readme 2017-09-05 06:06:19 -07:00
Louie Helm
3d3ea6cb16 fix trpo command in readme 2017-09-05 06:04:37 -07:00
John Schulman
902ffcb767 Merge pull request #120 from hamzamerzic/tensorflow_global_variable
Deprecated VARIABLES -> GLOBAL_VARIABLES.
2017-08-28 21:27:23 -07:00
Hamza Merzic
a7320b80c0 Deprecated VARIABLES -> GLOBAL_VARIABLES. 2017-08-28 16:51:48 +02:00
John Schulman
4e2a570eb4 Merge pull request #104 from stevenschmatz/patch-1
Fix relative links in README.md
2017-08-27 22:54:52 -07:00
John Schulman
6f39148452 fix gym req 2017-08-27 22:49:50 -07:00
John Schulman
2f30833043 Merge branch 'master' of github.com:openai/baselines 2017-08-27 22:36:44 -07:00
John Schulman
00cdeff35e add __init__.py 2017-08-27 22:36:24 -07:00
John Schulman
410ef38898 Merge pull request #103 from learnercys/master
Adding links to source files
2017-08-27 22:31:46 -07:00
John Schulman
aa6e58bdf1 fix readmes 2017-08-27 22:22:14 -07:00
John Schulman
d9f194f797 Fix atari wrapper (affecting a2c perf) and pposgd mujoco performance
- removed vf clipping in pposgd - that was severely degrading performance on mujoco because it didn’t account for scale of returns
- switched adam epsilon in pposgd_simple
- brought back no-ops in atari wrapper (oops)
- added readmes
- revamped run_X_benchmark scripts to have standard form
- cleaned up DDPG a little, removed deprecated SimpleMonitor and non-idiomatic usage of logger
2017-08-27 22:14:59 -07:00
Steven Schmatz
06b071c105 Fix relative links in README.md 2017-08-18 13:35:22 -04:00
John Schulman
3f676f7d1e ACKTR + A2C 2017-08-18 09:25:39 -07:00
Carlos Hernandez
b7966b31a5 Adding links to source files 2017-08-18 01:16:00 -06:00
Matthias Plappert
882251878f Parameter space noise for DQN and DDPG (#75)
* Export param noise

* Update documentation

* Final finishing touches
2017-07-27 08:10:59 -07:00
Jan Humplik
4862140cea Use standardized advantages in trpo. 2017-07-23 22:42:55 +02:00
183 changed files with 11656 additions and 2317 deletions

1
.benchmark_pattern Normal file
View File

@@ -0,0 +1 @@
ppo2

4
.gitignore vendored
View File

@@ -1,6 +1,8 @@
*.swp
*.pyc
*.pkl
*.py~
.pytest_cache
.DS_Store
.idea
@@ -30,3 +32,5 @@ src
*.egg-info
.cache
MUJOCO_LOG.TXT

14
.travis.yml Normal file
View File

@@ -0,0 +1,14 @@
language: python
python:
- "3.6"
services:
- docker
install:
- pip install flake8
- docker build . -t baselines-test
script:
- flake8 --select=F,E999 baselines/common baselines/trpo_mpi baselines/ppo2 baselines/a2c baselines/deepq baselines/acer
- docker run baselines-test pytest --runslow

24
Dockerfile Normal file
View File

@@ -0,0 +1,24 @@
FROM ubuntu:16.04
RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake python-opencv
ENV CODE_DIR /root/code
ENV VENV /root/venv
RUN \
pip install virtualenv && \
virtualenv $VENV --python=python3 && \
. $VENV/bin/activate && \
pip install --upgrade pip
ENV PATH=$VENV/bin:$PATH
COPY . $CODE_DIR/baselines
WORKDIR $CODE_DIR/baselines
# Clean up pycache and pyc files
RUN rm -rf __pycache__ && \
find . -name "*.pyc" -delete && \
pip install -e .[test]
CMD /bin/bash

134
README.md
View File

@@ -1,4 +1,4 @@
<img src="data/logo.jpg" width=25% align="right" />
<img src="data/logo.jpg" width=25% align="right" /> [![Build status](https://travis-ci.org/openai/baselines.svg?branch=master)](https://travis-ci.org/openai/baselines)
# Baselines
@@ -6,12 +6,136 @@ OpenAI Baselines is a set of high-quality implementations of reinforcement learn
These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones.
You can install it by typing:
## Prerequisites
Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows
### Ubuntu
```bash
pip install baselines
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
```
### Mac OS X
Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the follwing:
```bash
brew install cmake openmpi
```
## Virtual environment
From the general python package sanity perspective, it is a good idea to use virtual environments (virtualenvs) to make sure packages from different projects do not interfere with each other. You can install virtualenv (which is itself a pip package) via
```bash
pip install virtualenv
```
Virtualenvs are essentially folders that have copies of python executable and all python packages.
To create a virtualenv called venv with python3, one runs
```bash
virtualenv /path/to/venv --python=python3
```
To activate a virtualenv:
```
. /path/to/venv/bin/activate
```
More thorough tutorial on virtualenvs and options can be found [here](https://virtualenv.pypa.io/en/stable/)
## Installation
Clone the repo and cd into it:
```bash
git clone https://github.com/openai/baselines.git
cd baselines
```
If using virtualenv, create a new virtualenv and activate it
```bash
virtualenv env --python=python3
. env/bin/activate
```
Install baselines package
```bash
pip install -e .
```
### MuJoCo
Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)
## Testing the installation
All unit tests in baselines can be run using pytest runner:
```
pip install pytest
pytest
```
## Subpackages
## Testing the installation
All unit tests in baselines can be run using pytest runner:
```
pip install pytest
pytest
```
## Training models
Most of the algorithms in baselines repo are used as follows:
```bash
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
```
### Example 1. PPO with MuJoCo Humanoid
For instance, to train a fully-connected network controlling MuJoCo humanoid using a2c for 20M timesteps
```bash
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
```
Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
```bash
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
```
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
See docstrings in [common/models.py](common/models.py) for description of network parameters for each type of model, and
docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) fir the description of the ppo2 hyperparamters.
### Example 2. DQN on Atari
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
```
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
```
## Saving, loading and visualizing models
The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models.
`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively.
Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then later visualize what has it learnt.
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=2e7 --save_path=~/models/pong_20M_ppo2
```
This should get to the mean reward per episode about 5k. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
```
*NOTE:* At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in [baselines/common/vec_env/vec_normalize.py](baselines/common/vec_env/vec_normalize.py#L12). This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default
## Subpackages
- [A2C](baselines/a2c)
- [ACER](baselines/acer)
- [ACKTR](baselines/acktr)
- [DDPG](baselines/ddpg)
- [DQN](baselines/deepq)
- [PPO](baselines/pposgd)
- [GAIL](baselines/gail)
- [HER](baselines/her)
- [PPO1](baselines/ppo1) (Multi-CPU using MPI)
- [PPO2](baselines/ppo2) (Optimized for GPU)
- [TRPO](baselines/trpo_mpi)
To cite this repository in publications:
@misc{baselines,
author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
title = {OpenAI Baselines},
year = {2017},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/openai/baselines}},
}

5
baselines/a2c/README.md Normal file
View File

@@ -0,0 +1,5 @@
# A2C
- Original paper: https://arxiv.org/abs/1602.01783
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.a2c.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.

178
baselines/a2c/a2c.py Normal file
View File

@@ -0,0 +1,178 @@
import time
import functools
import tensorflow as tf
from baselines import logger
from baselines.common import set_global_seeds, explained_variance
from baselines.common import tf_util
from baselines.common.policies import build_policy
from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.a2c.runner import Runner
from tensorflow import losses
class Model(object):
def __init__(self, policy, env, nsteps,
ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, lr=7e-4,
alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):
sess = tf_util.get_session()
nenvs = env.num_envs
nbatch = nenvs*nsteps
with tf.variable_scope('a2c_model', reuse=tf.AUTO_REUSE):
step_model = policy(nenvs, 1, sess)
train_model = policy(nbatch, nsteps, sess)
A = tf.placeholder(train_model.action.dtype, train_model.action.shape)
ADV = tf.placeholder(tf.float32, [nbatch])
R = tf.placeholder(tf.float32, [nbatch])
LR = tf.placeholder(tf.float32, [])
neglogpac = train_model.pd.neglogp(A)
entropy = tf.reduce_mean(train_model.pd.entropy())
pg_loss = tf.reduce_mean(ADV * neglogpac)
vf_loss = losses.mean_squared_error(tf.squeeze(train_model.vf), R)
loss = pg_loss - entropy*ent_coef + vf_loss * vf_coef
params = find_trainable_variables("a2c_model")
grads = tf.gradients(loss, params)
if max_grad_norm is not None:
grads, grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
grads = list(zip(grads, params))
trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=alpha, epsilon=epsilon)
_train = trainer.apply_gradients(grads)
lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
def train(obs, states, rewards, masks, actions, values):
advs = rewards - values
for step in range(len(obs)):
cur_lr = lr.value()
td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, LR:cur_lr}
if states is not None:
td_map[train_model.S] = states
td_map[train_model.M] = masks
policy_loss, value_loss, policy_entropy, _ = sess.run(
[pg_loss, vf_loss, entropy, _train],
td_map
)
return policy_loss, value_loss, policy_entropy
self.train = train
self.train_model = train_model
self.step_model = step_model
self.step = step_model.step
self.value = step_model.value
self.initial_state = step_model.initial_state
self.save = functools.partial(tf_util.save_variables, sess=sess)
self.load = functools.partial(tf_util.load_variables, sess=sess)
tf.global_variables_initializer().run(session=sess)
def learn(
network,
env,
seed=None,
nsteps=5,
total_timesteps=int(80e6),
vf_coef=0.5,
ent_coef=0.01,
max_grad_norm=0.5,
lr=7e-4,
lrschedule='linear',
epsilon=1e-5,
alpha=0.99,
gamma=0.99,
log_interval=100,
load_path=None,
**network_kwargs):
'''
Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
Parameters:
-----------
network: policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
env: RL environment. Should implement interface similar to VecEnv (baselines.common/vec_env) or be wrapped with DummyVecEnv (baselines.common/vec_env/dummy_vec_env.py)
seed: seed to make random number sequence in the alorightm reproducible. By default is None which means seed from system noise generator (not reproducible)
nsteps: int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
nenv is number of environment copies simulated in parallel)
total_timesteps: int, total number of timesteps to train on (default: 80M)
vf_coef: float, coefficient in front of value function loss in the total loss function (default: 0.5)
ent_coef: float, coeffictiant in front of the policy entropy in the total loss function (default: 0.01)
max_gradient_norm: float, gradient is clipped to have global L2 norm no more than this value (default: 0.5)
lr: float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
lrschedule: schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
returns fraction of the learning rate (specified as lr) as output
epsilon: float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
alpha: float, RMSProp decay parameter (default: 0.99)
gamma: float, reward discounting parameter (default: 0.99)
log_interval: int, specifies how frequently the logs are printed out (default: 100)
**network_kwargs: keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
'''
set_global_seeds(seed)
nenvs = env.num_envs
policy = build_policy(env, network, **network_kwargs)
model = Model(policy=policy, env=env, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
if load_path is not None:
model.load(load_path)
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
nbatch = nenvs*nsteps
tstart = time.time()
for update in range(1, total_timesteps//nbatch+1):
obs, states, rewards, masks, actions, values = runner.run()
policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
nseconds = time.time()-tstart
fps = int((update*nbatch)/nseconds)
if update % log_interval == 0 or update == 1:
ev = explained_variance(values, rewards)
logger.record_tabular("nupdates", update)
logger.record_tabular("total_timesteps", update*nbatch)
logger.record_tabular("fps", fps)
logger.record_tabular("policy_entropy", float(policy_entropy))
logger.record_tabular("value_loss", float(value_loss))
logger.record_tabular("explained_variance", float(ev))
logger.dump_tabular()
env.close()
return model

60
baselines/a2c/runner.py Normal file
View File

@@ -0,0 +1,60 @@
import numpy as np
from baselines.a2c.utils import discount_with_dones
from baselines.common.runners import AbstractEnvRunner
class Runner(AbstractEnvRunner):
def __init__(self, env, model, nsteps=5, gamma=0.99):
super().__init__(env=env, model=model, nsteps=nsteps)
self.gamma = gamma
self.batch_action_shape = [x if x is not None else -1 for x in model.train_model.action.shape.as_list()]
self.ob_dtype = model.train_model.X.dtype.as_numpy_dtype
def run(self):
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
mb_states = self.states
for n in range(self.nsteps):
actions, values, states, _ = self.model.step(self.obs, S=self.states, M=self.dones)
mb_obs.append(np.copy(self.obs))
mb_actions.append(actions)
mb_values.append(values)
mb_dones.append(self.dones)
obs, rewards, dones, _ = self.env.step(actions)
self.states = states
self.dones = dones
for n, done in enumerate(dones):
if done:
self.obs[n] = self.obs[n]*0
self.obs = obs
mb_rewards.append(rewards)
mb_dones.append(self.dones)
#batch of steps to batch of rollouts
mb_obs = np.asarray(mb_obs, dtype=self.ob_dtype).swapaxes(1, 0).reshape(self.batch_ob_shape)
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
mb_actions = np.asarray(mb_actions, dtype=self.model.train_model.action.dtype.name).swapaxes(1, 0)
mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
mb_masks = mb_dones[:, :-1]
mb_dones = mb_dones[:, 1:]
if self.gamma > 0.0:
#discount/bootstrap off value fn
last_values = self.model.value(self.obs, S=self.states, M=self.dones).tolist()
for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
rewards = rewards.tolist()
dones = dones.tolist()
if dones[-1] == 0:
rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
else:
rewards = discount_with_dones(rewards, dones, self.gamma)
mb_rewards[n] = rewards
mb_actions = mb_actions.reshape(self.batch_action_shape)
mb_rewards = mb_rewards.flatten()
mb_values = mb_values.flatten()
mb_masks = mb_masks.flatten()
return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values

282
baselines/a2c/utils.py Normal file
View File

@@ -0,0 +1,282 @@
import os
import numpy as np
import tensorflow as tf
from collections import deque
def sample(logits):
noise = tf.random_uniform(tf.shape(logits))
return tf.argmax(logits - tf.log(-tf.log(noise)), 1)
def cat_entropy(logits):
a0 = logits - tf.reduce_max(logits, 1, keepdims=True)
ea0 = tf.exp(a0)
z0 = tf.reduce_sum(ea0, 1, keepdims=True)
p0 = ea0 / z0
return tf.reduce_sum(p0 * (tf.log(z0) - a0), 1)
def cat_entropy_softmax(p0):
return - tf.reduce_sum(p0 * tf.log(p0 + 1e-6), axis = 1)
def ortho_init(scale=1.0):
def _ortho_init(shape, dtype, partition_info=None):
#lasagne ortho init for tf
shape = tuple(shape)
if len(shape) == 2:
flat_shape = shape
elif len(shape) == 4: # assumes NHWC
flat_shape = (np.prod(shape[:-1]), shape[-1])
else:
raise NotImplementedError
a = np.random.normal(0.0, 1.0, flat_shape)
u, _, v = np.linalg.svd(a, full_matrices=False)
q = u if u.shape == flat_shape else v # pick the one with the correct shape
q = q.reshape(shape)
return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
return _ortho_init
def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
if data_format == 'NHWC':
channel_ax = 3
strides = [1, stride, stride, 1]
bshape = [1, 1, 1, nf]
elif data_format == 'NCHW':
channel_ax = 1
strides = [1, 1, stride, stride]
bshape = [1, nf, 1, 1]
else:
raise NotImplementedError
bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
nin = x.get_shape()[channel_ax].value
wshape = [rf, rf, nin, nf]
with tf.variable_scope(scope):
w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
if not one_dim_bias and data_format == 'NHWC':
b = tf.reshape(b, bshape)
return tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format) + b
def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
with tf.variable_scope(scope):
nin = x.get_shape()[1].value
w = tf.get_variable("w", [nin, nh], initializer=ortho_init(init_scale))
b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(init_bias))
return tf.matmul(x, w)+b
def batch_to_seq(h, nbatch, nsteps, flat=False):
if flat:
h = tf.reshape(h, [nbatch, nsteps])
else:
h = tf.reshape(h, [nbatch, nsteps, -1])
return [tf.squeeze(v, [1]) for v in tf.split(axis=1, num_or_size_splits=nsteps, value=h)]
def seq_to_batch(h, flat = False):
shape = h[0].get_shape().as_list()
if not flat:
assert(len(shape) > 1)
nh = h[0].get_shape()[-1].value
return tf.reshape(tf.concat(axis=1, values=h), [-1, nh])
else:
return tf.reshape(tf.stack(values=h, axis=1), [-1])
def lstm(xs, ms, s, scope, nh, init_scale=1.0):
nbatch, nin = [v.value for v in xs[0].get_shape()]
with tf.variable_scope(scope):
wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
b = tf.get_variable("b", [nh*4], initializer=tf.constant_initializer(0.0))
c, h = tf.split(axis=1, num_or_size_splits=2, value=s)
for idx, (x, m) in enumerate(zip(xs, ms)):
c = c*(1-m)
h = h*(1-m)
z = tf.matmul(x, wx) + tf.matmul(h, wh) + b
i, f, o, u = tf.split(axis=1, num_or_size_splits=4, value=z)
i = tf.nn.sigmoid(i)
f = tf.nn.sigmoid(f)
o = tf.nn.sigmoid(o)
u = tf.tanh(u)
c = f*c + i*u
h = o*tf.tanh(c)
xs[idx] = h
s = tf.concat(axis=1, values=[c, h])
return xs, s
def _ln(x, g, b, e=1e-5, axes=[1]):
u, s = tf.nn.moments(x, axes=axes, keep_dims=True)
x = (x-u)/tf.sqrt(s+e)
x = x*g+b
return x
def lnlstm(xs, ms, s, scope, nh, init_scale=1.0):
nbatch, nin = [v.value for v in xs[0].get_shape()]
with tf.variable_scope(scope):
wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
gx = tf.get_variable("gx", [nh*4], initializer=tf.constant_initializer(1.0))
bx = tf.get_variable("bx", [nh*4], initializer=tf.constant_initializer(0.0))
wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
gh = tf.get_variable("gh", [nh*4], initializer=tf.constant_initializer(1.0))
bh = tf.get_variable("bh", [nh*4], initializer=tf.constant_initializer(0.0))
b = tf.get_variable("b", [nh*4], initializer=tf.constant_initializer(0.0))
gc = tf.get_variable("gc", [nh], initializer=tf.constant_initializer(1.0))
bc = tf.get_variable("bc", [nh], initializer=tf.constant_initializer(0.0))
c, h = tf.split(axis=1, num_or_size_splits=2, value=s)
for idx, (x, m) in enumerate(zip(xs, ms)):
c = c*(1-m)
h = h*(1-m)
z = _ln(tf.matmul(x, wx), gx, bx) + _ln(tf.matmul(h, wh), gh, bh) + b
i, f, o, u = tf.split(axis=1, num_or_size_splits=4, value=z)
i = tf.nn.sigmoid(i)
f = tf.nn.sigmoid(f)
o = tf.nn.sigmoid(o)
u = tf.tanh(u)
c = f*c + i*u
h = o*tf.tanh(_ln(c, gc, bc))
xs[idx] = h
s = tf.concat(axis=1, values=[c, h])
return xs, s
def conv_to_fc(x):
nh = np.prod([v.value for v in x.get_shape()[1:]])
x = tf.reshape(x, [-1, nh])
return x
def discount_with_dones(rewards, dones, gamma):
discounted = []
r = 0
for reward, done in zip(rewards[::-1], dones[::-1]):
r = reward + gamma*r*(1.-done) # fixed off by one bug
discounted.append(r)
return discounted[::-1]
def find_trainable_variables(key):
return tf.trainable_variables(key)
def make_path(f):
return os.makedirs(f, exist_ok=True)
def constant(p):
return 1
def linear(p):
return 1-p
def middle_drop(p):
eps = 0.75
if 1-p<eps:
return eps*0.1
return 1-p
def double_linear_con(p):
p *= 2
eps = 0.125
if 1-p<eps:
return eps
return 1-p
def double_middle_drop(p):
eps1 = 0.75
eps2 = 0.25
if 1-p<eps1:
if 1-p<eps2:
return eps2*0.5
return eps1*0.1
return 1-p
schedules = {
'linear':linear,
'constant':constant,
'double_linear_con': double_linear_con,
'middle_drop': middle_drop,
'double_middle_drop': double_middle_drop
}
class Scheduler(object):
def __init__(self, v, nvalues, schedule):
self.n = 0.
self.v = v
self.nvalues = nvalues
self.schedule = schedules[schedule]
def value(self):
current_value = self.v*self.schedule(self.n/self.nvalues)
self.n += 1.
return current_value
def value_steps(self, steps):
return self.v*self.schedule(steps/self.nvalues)
class EpisodeStats:
def __init__(self, nsteps, nenvs):
self.episode_rewards = []
for i in range(nenvs):
self.episode_rewards.append([])
self.lenbuffer = deque(maxlen=40) # rolling buffer for episode lengths
self.rewbuffer = deque(maxlen=40) # rolling buffer for episode rewards
self.nsteps = nsteps
self.nenvs = nenvs
def feed(self, rewards, masks):
rewards = np.reshape(rewards, [self.nenvs, self.nsteps])
masks = np.reshape(masks, [self.nenvs, self.nsteps])
for i in range(0, self.nenvs):
for j in range(0, self.nsteps):
self.episode_rewards[i].append(rewards[i][j])
if masks[i][j]:
l = len(self.episode_rewards[i])
s = sum(self.episode_rewards[i])
self.lenbuffer.append(l)
self.rewbuffer.append(s)
self.episode_rewards[i] = []
def mean_length(self):
if self.lenbuffer:
return np.mean(self.lenbuffer)
else:
return 0 # on the first params dump, no episodes are finished
def mean_reward(self):
if self.rewbuffer:
return np.mean(self.rewbuffer)
else:
return 0
# For ACER
def get_by_index(x, idx):
assert(len(x.get_shape()) == 2)
assert(len(idx.get_shape()) == 1)
idx_flattened = tf.range(0, x.shape[0]) * x.shape[1] + idx
y = tf.gather(tf.reshape(x, [-1]), # flatten input
idx_flattened) # use flattened indices
return y
def check_shape(ts,shapes):
i = 0
for (t,shape) in zip(ts,shapes):
assert t.get_shape().as_list()==shape, "id " + str(i) + " shape " + str(t.get_shape()) + str(shape)
i += 1
def avg_norm(t):
return tf.reduce_mean(tf.sqrt(tf.reduce_sum(tf.square(t), axis=-1)))
def gradient_add(g1, g2, param):
print([g1, g2, param.name])
assert (not (g1 is None and g2 is None)), param.name
if g1 is None:
return g2
elif g2 is None:
return g1
else:
return g1 + g2
def q_explained_variance(qpred, q):
_, vary = tf.nn.moments(q, axes=[0, 1])
_, varpred = tf.nn.moments(q - qpred, axes=[0, 1])
check_shape([vary, varpred], [[]] * 2)
return 1.0 - (varpred / vary)

4
baselines/acer/README.md Normal file
View File

@@ -0,0 +1,4 @@
# ACER
- Original paper: https://arxiv.org/abs/1611.01224
- `python -m baselines.acer.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.

View File

374
baselines/acer/acer.py Normal file
View File

@@ -0,0 +1,374 @@
import time
import functools
import numpy as np
import tensorflow as tf
from baselines import logger
from baselines.common import set_global_seeds
from baselines.common.policies import build_policy
from baselines.common.tf_util import get_session, save_variables
from baselines.a2c.utils import batch_to_seq, seq_to_batch
from baselines.a2c.utils import cat_entropy_softmax
from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.a2c.utils import EpisodeStats
from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
from baselines.acer.buffer import Buffer
from baselines.acer.runner import Runner
# remove last step
def strip(var, nenvs, nsteps, flat = False):
vars = batch_to_seq(var, nenvs, nsteps + 1, flat)
return seq_to_batch(vars[:-1], flat)
def q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma):
"""
Calculates q_retrace targets
:param R: Rewards
:param D: Dones
:param q_i: Q values for actions taken
:param v: V values
:param rho_i: Importance weight for each action
:return: Q_retrace values
"""
rho_bar = batch_to_seq(tf.minimum(1.0, rho_i), nenvs, nsteps, True) # list of len steps, shape [nenvs]
rs = batch_to_seq(R, nenvs, nsteps, True) # list of len steps, shape [nenvs]
ds = batch_to_seq(D, nenvs, nsteps, True) # list of len steps, shape [nenvs]
q_is = batch_to_seq(q_i, nenvs, nsteps, True)
vs = batch_to_seq(v, nenvs, nsteps + 1, True)
v_final = vs[-1]
qret = v_final
qrets = []
for i in range(nsteps - 1, -1, -1):
check_shape([qret, ds[i], rs[i], rho_bar[i], q_is[i], vs[i]], [[nenvs]] * 6)
qret = rs[i] + gamma * qret * (1.0 - ds[i])
qrets.append(qret)
qret = (rho_bar[i] * (qret - q_is[i])) + vs[i]
qrets = qrets[::-1]
qret = seq_to_batch(qrets, flat=True)
return qret
# For ACER with PPO clipping instead of trust region
# def clip(ratio, eps_clip):
# # assume 0 <= eps_clip <= 1
# return tf.minimum(1 + eps_clip, tf.maximum(1 - eps_clip, ratio))
class Model(object):
def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, nstack, num_procs,
ent_coef, q_coef, gamma, max_grad_norm, lr,
rprop_alpha, rprop_epsilon, total_timesteps, lrschedule,
c, trust_region, alpha, delta):
sess = get_session()
nact = ac_space.n
nbatch = nenvs * nsteps
A = tf.placeholder(tf.int32, [nbatch]) # actions
D = tf.placeholder(tf.float32, [nbatch]) # dones
R = tf.placeholder(tf.float32, [nbatch]) # rewards, not returns
MU = tf.placeholder(tf.float32, [nbatch, nact]) # mu's
LR = tf.placeholder(tf.float32, [])
eps = 1e-6
step_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs,) + ob_space.shape[:-1] + (ob_space.shape[-1] * nstack,))
train_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs*(nsteps+1),) + ob_space.shape[:-1] + (ob_space.shape[-1] * nstack,))
with tf.variable_scope('acer_model', reuse=tf.AUTO_REUSE):
step_model = policy(observ_placeholder=step_ob_placeholder, sess=sess)
train_model = policy(observ_placeholder=train_ob_placeholder, sess=sess)
params = find_trainable_variables("acer_model")
print("Params {}".format(len(params)))
for var in params:
print(var)
# create polyak averaged model
ema = tf.train.ExponentialMovingAverage(alpha)
ema_apply_op = ema.apply(params)
def custom_getter(getter, *args, **kwargs):
v = ema.average(getter(*args, **kwargs))
print(v.name)
return v
with tf.variable_scope("acer_model", custom_getter=custom_getter, reuse=True):
polyak_model = policy(observ_placeholder=train_ob_placeholder, sess=sess)
# Notation: (var) = batch variable, (var)s = seqeuence variable, (var)_i = variable index by action at step i
# action probability distributions according to train_model, polyak_model and step_model
# poilcy.pi is probability distribution parameters; to obtain distribution that sums to 1 need to take softmax
train_model_p = tf.nn.softmax(train_model.pi)
polyak_model_p = tf.nn.softmax(polyak_model.pi)
step_model_p = tf.nn.softmax(step_model.pi)
v = tf.reduce_sum(train_model_p * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]
# strip off last step
f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model_p, polyak_model_p, train_model.q])
# Get pi and q values for actions taken
f_i = get_by_index(f, A)
q_i = get_by_index(q, A)
# Compute ratios for importance truncation
rho = f / (MU + eps)
rho_i = get_by_index(rho, A)
# Calculate Q_retrace targets
qret = q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma)
# Calculate losses
# Entropy
# entropy = tf.reduce_mean(strip(train_model.pd.entropy(), nenvs, nsteps))
entropy = tf.reduce_mean(cat_entropy_softmax(f))
# Policy Graident loss, with truncated importance sampling & bias correction
v = strip(v, nenvs, nsteps, True)
check_shape([qret, v, rho_i, f_i], [[nenvs * nsteps]] * 4)
check_shape([rho, f, q], [[nenvs * nsteps, nact]] * 2)
# Truncated importance sampling
adv = qret - v
logf = tf.log(f_i + eps)
gain_f = logf * tf.stop_gradient(adv * tf.minimum(c, rho_i)) # [nenvs * nsteps]
loss_f = -tf.reduce_mean(gain_f)
# Bias correction for the truncation
adv_bc = (q - tf.reshape(v, [nenvs * nsteps, 1])) # [nenvs * nsteps, nact]
logf_bc = tf.log(f + eps) # / (f_old + eps)
check_shape([adv_bc, logf_bc], [[nenvs * nsteps, nact]]*2)
gain_bc = tf.reduce_sum(logf_bc * tf.stop_gradient(adv_bc * tf.nn.relu(1.0 - (c / (rho + eps))) * f), axis = 1) #IMP: This is sum, as expectation wrt f
loss_bc= -tf.reduce_mean(gain_bc)
loss_policy = loss_f + loss_bc
# Value/Q function loss, and explained variance
check_shape([qret, q_i], [[nenvs * nsteps]]*2)
ev = q_explained_variance(tf.reshape(q_i, [nenvs, nsteps]), tf.reshape(qret, [nenvs, nsteps]))
loss_q = tf.reduce_mean(tf.square(tf.stop_gradient(qret) - q_i)*0.5)
# Net loss
check_shape([loss_policy, loss_q, entropy], [[]] * 3)
loss = loss_policy + q_coef * loss_q - ent_coef * entropy
if trust_region:
g = tf.gradients(- (loss_policy - ent_coef * entropy) * nsteps * nenvs, f) #[nenvs * nsteps, nact]
# k = tf.gradients(KL(f_pol || f), f)
k = - f_pol / (f + eps) #[nenvs * nsteps, nact] # Directly computed gradient of KL divergence wrt f
k_dot_g = tf.reduce_sum(k * g, axis=-1)
adj = tf.maximum(0.0, (tf.reduce_sum(k * g, axis=-1) - delta) / (tf.reduce_sum(tf.square(k), axis=-1) + eps)) #[nenvs * nsteps]
# Calculate stats (before doing adjustment) for logging.
avg_norm_k = avg_norm(k)
avg_norm_g = avg_norm(g)
avg_norm_k_dot_g = tf.reduce_mean(tf.abs(k_dot_g))
avg_norm_adj = tf.reduce_mean(tf.abs(adj))
g = g - tf.reshape(adj, [nenvs * nsteps, 1]) * k
grads_f = -g/(nenvs*nsteps) # These are turst region adjusted gradients wrt f ie statistics of policy pi
grads_policy = tf.gradients(f, params, grads_f)
grads_q = tf.gradients(loss_q * q_coef, params)
grads = [gradient_add(g1, g2, param) for (g1, g2, param) in zip(grads_policy, grads_q, params)]
avg_norm_grads_f = avg_norm(grads_f) * (nsteps * nenvs)
norm_grads_q = tf.global_norm(grads_q)
norm_grads_policy = tf.global_norm(grads_policy)
else:
grads = tf.gradients(loss, params)
if max_grad_norm is not None:
grads, norm_grads = tf.clip_by_global_norm(grads, max_grad_norm)
grads = list(zip(grads, params))
trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=rprop_alpha, epsilon=rprop_epsilon)
_opt_op = trainer.apply_gradients(grads)
# so when you call _train, you first do the gradient step, then you apply ema
with tf.control_dependencies([_opt_op]):
_train = tf.group(ema_apply_op)
lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
# Ops/Summaries to run, and their names for logging
run_ops = [_train, loss, loss_q, entropy, loss_policy, loss_f, loss_bc, ev, norm_grads]
names_ops = ['loss', 'loss_q', 'entropy', 'loss_policy', 'loss_f', 'loss_bc', 'explained_variance',
'norm_grads']
if trust_region:
run_ops = run_ops + [norm_grads_q, norm_grads_policy, avg_norm_grads_f, avg_norm_k, avg_norm_g, avg_norm_k_dot_g,
avg_norm_adj]
names_ops = names_ops + ['norm_grads_q', 'norm_grads_policy', 'avg_norm_grads_f', 'avg_norm_k', 'avg_norm_g',
'avg_norm_k_dot_g', 'avg_norm_adj']
def train(obs, actions, rewards, dones, mus, states, masks, steps):
cur_lr = lr.value_steps(steps)
td_map = {train_model.X: obs, polyak_model.X: obs, A: actions, R: rewards, D: dones, MU: mus, LR: cur_lr}
if states is not None:
td_map[train_model.S] = states
td_map[train_model.M] = masks
td_map[polyak_model.S] = states
td_map[polyak_model.M] = masks
return names_ops, sess.run(run_ops, td_map)[1:] # strip off _train
def _step(observation, **kwargs):
return step_model._evaluate([step_model.action, step_model_p, step_model.state], observation, **kwargs)
self.train = train
self.save = functools.partial(save_variables, sess=sess, variables=params)
self.train_model = train_model
self.step_model = step_model
self._step = _step
self.step = self.step_model.step
self.initial_state = step_model.initial_state
tf.global_variables_initializer().run(session=sess)
class Acer():
def __init__(self, runner, model, buffer, log_interval):
self.runner = runner
self.model = model
self.buffer = buffer
self.log_interval = log_interval
self.tstart = None
self.episode_stats = EpisodeStats(runner.nsteps, runner.nenv)
self.steps = None
def call(self, on_policy):
runner, model, buffer, steps = self.runner, self.model, self.buffer, self.steps
if on_policy:
enc_obs, obs, actions, rewards, mus, dones, masks = runner.run()
self.episode_stats.feed(rewards, dones)
if buffer is not None:
buffer.put(enc_obs, actions, rewards, mus, dones, masks)
else:
# get obs, actions, rewards, mus, dones from buffer.
obs, actions, rewards, mus, dones, masks = buffer.get()
# reshape stuff correctly
obs = obs.reshape(runner.batch_ob_shape)
actions = actions.reshape([runner.nbatch])
rewards = rewards.reshape([runner.nbatch])
mus = mus.reshape([runner.nbatch, runner.nact])
dones = dones.reshape([runner.nbatch])
masks = masks.reshape([runner.batch_ob_shape[0]])
names_ops, values_ops = model.train(obs, actions, rewards, dones, mus, model.initial_state, masks, steps)
if on_policy and (int(steps/runner.nbatch) % self.log_interval == 0):
logger.record_tabular("total_timesteps", steps)
logger.record_tabular("fps", int(steps/(time.time() - self.tstart)))
# IMP: In EpisodicLife env, during training, we get done=True at each loss of life, not just at the terminal state.
# Thus, this is mean until end of life, not end of episode.
# For true episode rewards, see the monitor files in the log folder.
logger.record_tabular("mean_episode_length", self.episode_stats.mean_length())
logger.record_tabular("mean_episode_reward", self.episode_stats.mean_reward())
for name, val in zip(names_ops, values_ops):
logger.record_tabular(name, float(val))
logger.dump_tabular()
def learn(network, env, seed=None, nsteps=20, nstack=4, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
max_grad_norm=10, lr=7e-4, lrschedule='linear', rprop_epsilon=1e-5, rprop_alpha=0.99, gamma=0.99,
log_interval=100, buffer_size=50000, replay_ratio=4, replay_start=10000, c=10.0,
trust_region=True, alpha=0.99, delta=1, load_path=None, **network_kwargs):
'''
Main entrypoint for ACER (Actor-Critic with Experience Replay) algorithm (https://arxiv.org/pdf/1611.01224.pdf)
Train an agent with given network architecture on a given environment using ACER.
Parameters:
----------
network: policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
env: environment. Needs to be vectorized for parallel environment simulation.
The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.
nsteps: int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
nenv is number of environment copies simulated in parallel) (default: 20)
nstack: int, size of the frame stack, i.e. number of the frames passed to the step model. Frames are stacked along channel dimension
(last image dimension) (default: 4)
total_timesteps: int, number of timesteps (i.e. number of actions taken in the environment) (default: 80M)
q_coef: float, value function loss coefficient in the optimization objective (analog of vf_coef for other actor-critic methods)
ent_coef: float, policy entropy coefficient in the optimization objective (default: 0.01)
max_grad_norm: float, gradient norm clipping coefficient. If set to None, no clipping. (default: 10),
lr: float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
lrschedule: schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
returns fraction of the learning rate (specified as lr) as output
rprop_epsilon: float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
rprop_alpha: float, RMSProp decay parameter (default: 0.99)
gamma: float, reward discounting factor (default: 0.99)
log_interval: int, number of updates between logging events (default: 100)
buffer_size: int, size of the replay buffer (default: 50k)
replay_ratio: int, now many (on average) batches of data to sample from the replay buffer take after batch from the environment (default: 4)
replay_start: int, the sampling from the replay buffer does not start until replay buffer has at least that many samples (default: 10k)
c: float, importance weight clipping factor (default: 10)
trust_region bool, whether or not algorithms estimates the gradient KL divergence between the old and updated policy and uses it to determine step size (default: True)
delta: float, max KL divergence between the old policy and updated policy (default: 1)
alpha: float, momentum factor in the Polyak (exponential moving average) averaging of the model parameters (default: 0.99)
load_path: str, path to load the model from (default: None)
**network_kwargs: keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
'''
print("Running Acer Simple")
print(locals())
set_global_seeds(seed)
policy = build_policy(env, network, estimate_q=True, **network_kwargs)
nenvs = env.num_envs
ob_space = env.observation_space
ac_space = env.action_space
num_procs = len(env.remotes) if hasattr(env, 'remotes') else 1# HACK
model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, nstack=nstack,
num_procs=num_procs, ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
max_grad_norm=max_grad_norm, lr=lr, rprop_alpha=rprop_alpha, rprop_epsilon=rprop_epsilon,
total_timesteps=total_timesteps, lrschedule=lrschedule, c=c,
trust_region=trust_region, alpha=alpha, delta=delta)
runner = Runner(env=env, model=model, nsteps=nsteps, nstack=nstack)
if replay_ratio > 0:
buffer = Buffer(env=env, nsteps=nsteps, nstack=nstack, size=buffer_size)
else:
buffer = None
nbatch = nenvs*nsteps
acer = Acer(runner, model, buffer, log_interval)
acer.tstart = time.time()
for acer.steps in range(0, total_timesteps, nbatch): #nbatch samples, 1 on_policy call and multiple off-policy calls
acer.call(on_policy=True)
if replay_ratio > 0 and buffer.has_atleast(replay_start):
n = np.random.poisson(replay_ratio)
for _ in range(n):
acer.call(on_policy=False) # no simulation steps in this
env.close()
return model

103
baselines/acer/buffer.py Normal file
View File

@@ -0,0 +1,103 @@
import numpy as np
class Buffer(object):
# gets obs, actions, rewards, mu's, (states, masks), dones
def __init__(self, env, nsteps, nstack, size=50000):
self.nenv = env.num_envs
self.nsteps = nsteps
self.nh, self.nw, self.nc = env.observation_space.shape
self.nstack = nstack
self.nbatch = self.nenv * self.nsteps
self.size = size // (self.nsteps) # Each loc contains nenv * nsteps frames, thus total buffer is nenv * size frames
# Memory
self.enc_obs = None
self.actions = None
self.rewards = None
self.mus = None
self.dones = None
self.masks = None
# Size indexes
self.next_idx = 0
self.num_in_buffer = 0
def has_atleast(self, frames):
# Frames per env, so total (nenv * frames) Frames needed
# Each buffer loc has nenv * nsteps frames
return self.num_in_buffer >= (frames // self.nsteps)
def can_sample(self):
return self.num_in_buffer > 0
# Generate stacked frames
def decode(self, enc_obs, dones):
# enc_obs has shape [nenvs, nsteps + nstack, nh, nw, nc]
# dones has shape [nenvs, nsteps, nh, nw, nc]
# returns stacked obs of shape [nenv, (nsteps + 1), nh, nw, nstack*nc]
nstack, nenv, nsteps, nh, nw, nc = self.nstack, self.nenv, self.nsteps, self.nh, self.nw, self.nc
y = np.empty([nsteps + nstack - 1, nenv, 1, 1, 1], dtype=np.float32)
obs = np.zeros([nstack, nsteps + nstack, nenv, nh, nw, nc], dtype=np.uint8)
x = np.reshape(enc_obs, [nenv, nsteps + nstack, nh, nw, nc]).swapaxes(1,
0) # [nsteps + nstack, nenv, nh, nw, nc]
y[3:] = np.reshape(1.0 - dones, [nenv, nsteps, 1, 1, 1]).swapaxes(1, 0) # keep
y[:3] = 1.0
# y = np.reshape(1 - dones, [nenvs, nsteps, 1, 1, 1])
for i in range(nstack):
obs[-(i + 1), i:] = x
# obs[:,i:,:,:,-(i+1),:] = x
x = x[:-1] * y
y = y[1:]
return np.reshape(obs[:, 3:].transpose((2, 1, 3, 4, 0, 5)), [nenv, (nsteps + 1), nh, nw, nstack * nc])
def put(self, enc_obs, actions, rewards, mus, dones, masks):
# enc_obs [nenv, (nsteps + nstack), nh, nw, nc]
# actions, rewards, dones [nenv, nsteps]
# mus [nenv, nsteps, nact]
if self.enc_obs is None:
self.enc_obs = np.empty([self.size] + list(enc_obs.shape), dtype=np.uint8)
self.actions = np.empty([self.size] + list(actions.shape), dtype=np.int32)
self.rewards = np.empty([self.size] + list(rewards.shape), dtype=np.float32)
self.mus = np.empty([self.size] + list(mus.shape), dtype=np.float32)
self.dones = np.empty([self.size] + list(dones.shape), dtype=np.bool)
self.masks = np.empty([self.size] + list(masks.shape), dtype=np.bool)
self.enc_obs[self.next_idx] = enc_obs
self.actions[self.next_idx] = actions
self.rewards[self.next_idx] = rewards
self.mus[self.next_idx] = mus
self.dones[self.next_idx] = dones
self.masks[self.next_idx] = masks
self.next_idx = (self.next_idx + 1) % self.size
self.num_in_buffer = min(self.size, self.num_in_buffer + 1)
def take(self, x, idx, envx):
nenv = self.nenv
out = np.empty([nenv] + list(x.shape[2:]), dtype=x.dtype)
for i in range(nenv):
out[i] = x[idx[i], envx[i]]
return out
def get(self):
# returns
# obs [nenv, (nsteps + 1), nh, nw, nstack*nc]
# actions, rewards, dones [nenv, nsteps]
# mus [nenv, nsteps, nact]
nenv = self.nenv
assert self.can_sample()
# Sample exactly one id per env. If you sample across envs, then higher correlation in samples from same env.
idx = np.random.randint(0, self.num_in_buffer, nenv)
envx = np.arange(nenv)
take = lambda x: self.take(x, idx, envx) # for i in range(nenv)], axis = 0)
dones = take(self.dones)
enc_obs = take(self.enc_obs)
obs = self.decode(enc_obs, dones)
actions = take(self.actions)
rewards = take(self.rewards)
mus = take(self.mus)
masks = take(self.masks)
return obs, actions, rewards, mus, dones, masks

View File

@@ -0,0 +1,4 @@
def atari():
return dict(
lrschedule='constant'
)

View File

@@ -0,0 +1,81 @@
import numpy as np
import tensorflow as tf
from baselines.common.policies import nature_cnn
from baselines.a2c.utils import fc, batch_to_seq, seq_to_batch, lstm, sample
class AcerCnnPolicy(object):
def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
nbatch = nenv * nsteps
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc * nstack)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) # obs
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
pi_logits = fc(h, 'pi', nact, init_scale=0.01)
pi = tf.nn.softmax(pi_logits)
q = fc(h, 'q', nact)
a = sample(tf.nn.softmax(pi_logits)) # could change this to use self.pi instead
self.initial_state = [] # not stateful
self.X = X
self.pi = pi # actual policy params now
self.pi_logits = pi_logits
self.q = q
self.vf = q
def step(ob, *args, **kwargs):
# returns actions, mus, states
a0, pi0 = sess.run([a, pi], {X: ob})
return a0, pi0, [] # dummy state
def out(ob, *args, **kwargs):
pi0, q0 = sess.run([pi, q], {X: ob})
return pi0, q0
def act(ob, *args, **kwargs):
return sess.run(a, {X: ob})
self.step = step
self.out = out
self.act = act
class AcerLstmPolicy(object):
def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False, nlstm=256):
nbatch = nenv * nsteps
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc * nstack)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) # obs
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
# lstm
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
h5 = seq_to_batch(h5)
pi_logits = fc(h5, 'pi', nact, init_scale=0.01)
pi = tf.nn.softmax(pi_logits)
q = fc(h5, 'q', nact)
a = sample(pi_logits) # could change this to use self.pi instead
self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
self.X = X
self.M = M
self.S = S
self.pi = pi # actual policy params now
self.q = q
def step(ob, state, mask, *args, **kwargs):
# returns actions, mus, states
a0, pi0, s = sess.run([a, pi, snew], {X: ob, S: state, M: mask})
return a0, pi0, s
self.step = step

60
baselines/acer/runner.py Normal file
View File

@@ -0,0 +1,60 @@
import numpy as np
from baselines.common.runners import AbstractEnvRunner
class Runner(AbstractEnvRunner):
def __init__(self, env, model, nsteps, nstack):
super().__init__(env=env, model=model, nsteps=nsteps)
self.nstack = nstack
nh, nw, nc = env.observation_space.shape
self.nc = nc # nc = 1 for atari, but just in case
self.nact = env.action_space.n
nenv = self.nenv
self.nbatch = nenv * nsteps
self.batch_ob_shape = (nenv*(nsteps+1), nh, nw, nc*nstack)
self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
obs = env.reset()
self.update_obs(obs)
def update_obs(self, obs, dones=None):
#self.obs = obs
if dones is not None:
self.obs *= (1 - dones.astype(np.uint8))[:, None, None, None]
self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
self.obs[:, :, :, -self.nc:] = obs[:, :, :, :]
def run(self):
enc_obs = np.split(self.obs, self.nstack, axis=3) # so now list of obs steps
mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
for _ in range(self.nsteps):
actions, mus, states = self.model._step(self.obs, S=self.states, M=self.dones)
mb_obs.append(np.copy(self.obs))
mb_actions.append(actions)
mb_mus.append(mus)
mb_dones.append(self.dones)
obs, rewards, dones, _ = self.env.step(actions)
# states information for statefull models like LSTM
self.states = states
self.dones = dones
self.update_obs(obs, dones)
mb_rewards.append(rewards)
enc_obs.append(obs)
mb_obs.append(np.copy(self.obs))
mb_dones.append(self.dones)
enc_obs = np.asarray(enc_obs, dtype=np.uint8).swapaxes(1, 0)
mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0)
mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards
# shapes are now [nenv, nsteps, []]
# When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.
return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks

View File

@@ -0,0 +1,5 @@
# ACKTR
- Original paper: https://arxiv.org/abs/1708.05144
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.acktr.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.

View File

1
baselines/acktr/acktr.py Normal file
View File

@@ -0,0 +1 @@
from baselines.acktr.acktr_disc import *

View File

@@ -0,0 +1,142 @@
import numpy as np
import tensorflow as tf
from baselines import logger
import baselines.common as common
from baselines.common import tf_util as U
from baselines.acktr import kfac
from baselines.common.filters import ZFilter
def pathlength(path):
return path["reward"].shape[0]# Loss function that we'll differentiate to get the policy gradient
def rollout(env, policy, max_pathlength, animate=False, obfilter=None):
"""
Simulate the env and policy for max_pathlength steps
"""
ob = env.reset()
prev_ob = np.float32(np.zeros(ob.shape))
if obfilter: ob = obfilter(ob)
terminated = False
obs = []
acs = []
ac_dists = []
logps = []
rewards = []
for _ in range(max_pathlength):
if animate:
env.render()
state = np.concatenate([ob, prev_ob], -1)
obs.append(state)
ac, ac_dist, logp = policy.act(state)
acs.append(ac)
ac_dists.append(ac_dist)
logps.append(logp)
prev_ob = np.copy(ob)
scaled_ac = env.action_space.low + (ac + 1.) * 0.5 * (env.action_space.high - env.action_space.low)
scaled_ac = np.clip(scaled_ac, env.action_space.low, env.action_space.high)
ob, rew, done, _ = env.step(scaled_ac)
if obfilter: ob = obfilter(ob)
rewards.append(rew)
if done:
terminated = True
break
return {"observation" : np.array(obs), "terminated" : terminated,
"reward" : np.array(rewards), "action" : np.array(acs),
"action_dist": np.array(ac_dists), "logp" : np.array(logps)}
def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
animate=False, callback=None, desired_kl=0.002):
obfilter = ZFilter(env.observation_space.shape)
max_pathlength = env.spec.timestep_limit
stepsize = tf.Variable(initial_value=np.float32(np.array(0.03)), name='stepsize')
inputs, loss, loss_sampled = policy.update_info
optim = kfac.KfacOptimizer(learning_rate=stepsize, cold_lr=stepsize*(1-0.9), momentum=0.9, kfac_update=2,\
epsilon=1e-2, stats_decay=0.99, async=1, cold_iter=1,
weight_decay_dict=policy.wd_dict, max_grad_norm=None)
pi_var_list = []
for var in tf.trainable_variables():
if "pi" in var.name:
pi_var_list.append(var)
update_op, q_runner = optim.minimize(loss, loss_sampled, var_list=pi_var_list)
do_update = U.function(inputs, update_op)
U.initialize()
# start queue runners
enqueue_threads = []
coord = tf.train.Coordinator()
for qr in [q_runner, vf.q_runner]:
assert (qr != None)
enqueue_threads.extend(qr.create_threads(tf.get_default_session(), coord=coord, start=True))
i = 0
timesteps_so_far = 0
while True:
if timesteps_so_far > num_timesteps:
break
logger.log("********** Iteration %i ************"%i)
# Collect paths until we have enough timesteps
timesteps_this_batch = 0
paths = []
while True:
path = rollout(env, policy, max_pathlength, animate=(len(paths)==0 and (i % 10 == 0) and animate), obfilter=obfilter)
paths.append(path)
n = pathlength(path)
timesteps_this_batch += n
timesteps_so_far += n
if timesteps_this_batch > timesteps_per_batch:
break
# Estimate advantage function
vtargs = []
advs = []
for path in paths:
rew_t = path["reward"]
return_t = common.discount(rew_t, gamma)
vtargs.append(return_t)
vpred_t = vf.predict(path)
vpred_t = np.append(vpred_t, 0.0 if path["terminated"] else vpred_t[-1])
delta_t = rew_t + gamma*vpred_t[1:] - vpred_t[:-1]
adv_t = common.discount(delta_t, gamma * lam)
advs.append(adv_t)
# Update value function
vf.fit(paths, vtargs)
# Build arrays for policy update
ob_no = np.concatenate([path["observation"] for path in paths])
action_na = np.concatenate([path["action"] for path in paths])
oldac_dist = np.concatenate([path["action_dist"] for path in paths])
adv_n = np.concatenate(advs)
standardized_adv_n = (adv_n - adv_n.mean()) / (adv_n.std() + 1e-8)
# Policy update
do_update(ob_no, action_na, standardized_adv_n)
min_stepsize = np.float32(1e-8)
max_stepsize = np.float32(1e0)
# Adjust stepsize
kl = policy.compute_kl(ob_no, oldac_dist)
if kl > desired_kl * 2:
logger.log("kl too high")
tf.assign(stepsize, tf.maximum(min_stepsize, stepsize / 1.5)).eval()
elif kl < desired_kl / 2:
logger.log("kl too low")
tf.assign(stepsize, tf.minimum(max_stepsize, stepsize * 1.5)).eval()
else:
logger.log("kl just right!")
logger.record_tabular("EpRewMean", np.mean([path["reward"].sum() for path in paths]))
logger.record_tabular("EpRewSEM", np.std([path["reward"].sum()/np.sqrt(len(paths)) for path in paths]))
logger.record_tabular("EpLenMean", np.mean([pathlength(path) for path in paths]))
logger.record_tabular("KL", kl)
if callback:
callback()
logger.dump_tabular()
i += 1
coord.request_stop()
coord.join(enqueue_threads)

View File

@@ -0,0 +1,151 @@
import os.path as osp
import time
import functools
import numpy as np
import tensorflow as tf
from baselines import logger
from baselines.common import set_global_seeds, explained_variance
from baselines.common.policies import build_policy
from baselines.common.tf_util import get_session, save_variables, load_variables
from baselines.a2c.runner import Runner
from baselines.a2c.utils import discount_with_dones
from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.acktr import kfac
class Model(object):
def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
kfac_clip=0.001, lrschedule='linear'):
self.sess = sess = get_session()
nact = ac_space.n
nbatch = nenvs * nsteps
A = tf.placeholder(tf.int32, [nbatch])
ADV = tf.placeholder(tf.float32, [nbatch])
R = tf.placeholder(tf.float32, [nbatch])
PG_LR = tf.placeholder(tf.float32, [])
VF_LR = tf.placeholder(tf.float32, [])
with tf.variable_scope('acktr_model', reuse=tf.AUTO_REUSE):
self.model = step_model = policy(nenvs, 1, sess=sess)
self.model2 = train_model = policy(nenvs*nsteps, nsteps, sess=sess)
neglogpac = train_model.pd.neglogp(A)
self.logits = logits = train_model.pi
##training loss
pg_loss = tf.reduce_mean(ADV*neglogpac)
entropy = tf.reduce_mean(train_model.pd.entropy())
pg_loss = pg_loss - ent_coef * entropy
vf_loss = tf.losses.mean_squared_error(tf.squeeze(train_model.vf), R)
train_loss = pg_loss + vf_coef * vf_loss
##Fisher loss construction
self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(neglogpac)
sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss
self.params=params = find_trainable_variables("acktr_model")
self.grads_check = grads = tf.gradients(train_loss,params)
with tf.device('/gpu:0'):
self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
momentum=0.9, kfac_update=1, epsilon=0.01,\
stats_decay=0.99, async=1, cold_iter=10, max_grad_norm=max_grad_norm)
update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
self.q_runner = q_runner
self.lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
def train(obs, states, rewards, masks, actions, values):
advs = rewards - values
for step in range(len(obs)):
cur_lr = self.lr.value()
td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, PG_LR:cur_lr}
if states is not None:
td_map[train_model.S] = states
td_map[train_model.M] = masks
policy_loss, value_loss, policy_entropy, _ = sess.run(
[pg_loss, vf_loss, entropy, train_op],
td_map
)
return policy_loss, value_loss, policy_entropy
self.train = train
self.save = functools.partial(save_variables, sess=sess)
self.load = functools.partial(load_variables, sess=sess)
self.train_model = train_model
self.step_model = step_model
self.step = step_model.step
self.value = step_model.value
self.initial_state = step_model.initial_state
tf.global_variables_initializer().run(session=sess)
def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, **network_kwargs):
set_global_seeds(seed)
if network == 'cnn':
network_kwargs['one_dim_bias'] = True
policy = build_policy(env, network, **network_kwargs)
nenvs = env.num_envs
ob_space = env.observation_space
ac_space = env.action_space
make_model = lambda : Model(policy, ob_space, ac_space, nenvs, total_timesteps, nprocs=nprocs, nsteps
=nsteps, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
vf_fisher_coef, lr=lr, max_grad_norm=max_grad_norm, kfac_clip=kfac_clip,
lrschedule=lrschedule)
if save_interval and logger.get_dir():
import cloudpickle
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
fh.write(cloudpickle.dumps(make_model))
model = make_model()
if load_path is not None:
model.load(load_path)
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
nbatch = nenvs*nsteps
tstart = time.time()
coord = tf.train.Coordinator()
enqueue_threads = model.q_runner.create_threads(model.sess, coord=coord, start=True)
for update in range(1, total_timesteps//nbatch+1):
obs, states, rewards, masks, actions, values = runner.run()
policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
model.old_obs = obs
nseconds = time.time()-tstart
fps = int((update*nbatch)/nseconds)
if update % log_interval == 0 or update == 1:
ev = explained_variance(values, rewards)
logger.record_tabular("nupdates", update)
logger.record_tabular("total_timesteps", update*nbatch)
logger.record_tabular("fps", fps)
logger.record_tabular("policy_entropy", float(policy_entropy))
logger.record_tabular("policy_loss", float(policy_loss))
logger.record_tabular("value_loss", float(value_loss))
logger.record_tabular("explained_variance", float(ev))
logger.dump_tabular()
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
savepath = osp.join(logger.get_dir(), 'checkpoint%.5i'%update)
print('Saving to', savepath)
model.save(savepath)
coord.request_stop()
coord.join(enqueue_threads)
env.close()
return model

926
baselines/acktr/kfac.py Normal file
View File

@@ -0,0 +1,926 @@
import tensorflow as tf
import numpy as np
import re
from baselines.acktr.kfac_utils import *
from functools import reduce
KFAC_OPS = ['MatMul', 'Conv2D', 'BiasAdd']
KFAC_DEBUG = False
class KfacOptimizer():
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
self.max_grad_norm = max_grad_norm
self._lr = learning_rate
self._momentum = momentum
self._clip_kl = clip_kl
self._channel_fac = channel_fac
self._kfac_update = kfac_update
self._async = async
self._async_stats = async_stats
self._epsilon = epsilon
self._stats_decay = stats_decay
self._blockdiag_bias = blockdiag_bias
self._approxT2 = approxT2
self._use_float64 = use_float64
self._factored_damping = factored_damping
self._cold_iter = cold_iter
if cold_lr == None:
# good heuristics
self._cold_lr = self._lr# * 3.
else:
self._cold_lr = cold_lr
self._stats_accum_iter = stats_accum_iter
self._weight_decay_dict = weight_decay_dict
self._diag_init_coeff = 0.
self._full_stats_init = full_stats_init
if not self._full_stats_init:
self._stats_accum_iter = self._cold_iter
self.sgd_step = tf.Variable(0, name='KFAC/sgd_step', trainable=False)
self.global_step = tf.Variable(
0, name='KFAC/global_step', trainable=False)
self.cold_step = tf.Variable(0, name='KFAC/cold_step', trainable=False)
self.factor_step = tf.Variable(
0, name='KFAC/factor_step', trainable=False)
self.stats_step = tf.Variable(
0, name='KFAC/stats_step', trainable=False)
self.vFv = tf.Variable(0., name='KFAC/vFv', trainable=False)
self.factors = {}
self.param_vars = []
self.stats = {}
self.stats_eigen = {}
def getFactors(self, g, varlist):
graph = tf.get_default_graph()
factorTensors = {}
fpropTensors = []
bpropTensors = []
opTypes = []
fops = []
def searchFactors(gradient, graph):
# hard coded search stratergy
bpropOp = gradient.op
bpropOp_name = bpropOp.name
bTensors = []
fTensors = []
# combining additive gradient, assume they are the same op type and
# indepedent
if 'AddN' in bpropOp_name:
factors = []
for g in gradient.op.inputs:
factors.append(searchFactors(g, graph))
op_names = [item['opName'] for item in factors]
# TO-DO: need to check all the attribute of the ops as well
print (gradient.name)
print (op_names)
print (len(np.unique(op_names)))
assert len(np.unique(op_names)) == 1, gradient.name + \
' is shared among different computation OPs'
bTensors = reduce(lambda x, y: x + y,
[item['bpropFactors'] for item in factors])
if len(factors[0]['fpropFactors']) > 0:
fTensors = reduce(
lambda x, y: x + y, [item['fpropFactors'] for item in factors])
fpropOp_name = op_names[0]
fpropOp = factors[0]['op']
else:
fpropOp_name = re.search(
'gradientsSampled(_[0-9]+|)/(.+?)_grad', bpropOp_name).group(2)
fpropOp = graph.get_operation_by_name(fpropOp_name)
if fpropOp.op_def.name in KFAC_OPS:
# Known OPs
###
bTensor = [
i for i in bpropOp.inputs if 'gradientsSampled' in i.name][-1]
bTensorShape = fpropOp.outputs[0].get_shape()
if bTensor.get_shape()[0].value == None:
bTensor.set_shape(bTensorShape)
bTensors.append(bTensor)
###
if fpropOp.op_def.name == 'BiasAdd':
fTensors = []
else:
fTensors.append(
[i for i in fpropOp.inputs if param.op.name not in i.name][0])
fpropOp_name = fpropOp.op_def.name
else:
# unknown OPs, block approximation used
bInputsList = [i for i in bpropOp.inputs[
0].op.inputs if 'gradientsSampled' in i.name if 'Shape' not in i.name]
if len(bInputsList) > 0:
bTensor = bInputsList[0]
bTensorShape = fpropOp.outputs[0].get_shape()
if len(bTensor.get_shape()) > 0 and bTensor.get_shape()[0].value == None:
bTensor.set_shape(bTensorShape)
bTensors.append(bTensor)
fpropOp_name = opTypes.append('UNK-' + fpropOp.op_def.name)
return {'opName': fpropOp_name, 'op': fpropOp, 'fpropFactors': fTensors, 'bpropFactors': bTensors}
for t, param in zip(g, varlist):
if KFAC_DEBUG:
print(('get factor for '+param.name))
factors = searchFactors(t, graph)
factorTensors[param] = factors
########
# check associated weights and bias for homogeneous coordinate representation
# and check redundent factors
# TO-DO: there may be a bug to detect associate bias and weights for
# forking layer, e.g. in inception models.
for param in varlist:
factorTensors[param]['assnWeights'] = None
factorTensors[param]['assnBias'] = None
for param in varlist:
if factorTensors[param]['opName'] == 'BiasAdd':
factorTensors[param]['assnWeights'] = None
for item in varlist:
if len(factorTensors[item]['bpropFactors']) > 0:
if (set(factorTensors[item]['bpropFactors']) == set(factorTensors[param]['bpropFactors'])) and (len(factorTensors[item]['fpropFactors']) > 0):
factorTensors[param]['assnWeights'] = item
factorTensors[item]['assnBias'] = param
factorTensors[param]['bpropFactors'] = factorTensors[
item]['bpropFactors']
########
########
# concatenate the additive gradients along the batch dimension, i.e.
# assuming independence structure
for key in ['fpropFactors', 'bpropFactors']:
for i, param in enumerate(varlist):
if len(factorTensors[param][key]) > 0:
if (key + '_concat') not in factorTensors[param]:
name_scope = factorTensors[param][key][0].name.split(':')[
0]
with tf.name_scope(name_scope):
factorTensors[param][
key + '_concat'] = tf.concat(factorTensors[param][key], 0)
else:
factorTensors[param][key + '_concat'] = None
for j, param2 in enumerate(varlist[(i + 1):]):
if (len(factorTensors[param][key]) > 0) and (set(factorTensors[param2][key]) == set(factorTensors[param][key])):
factorTensors[param2][key] = factorTensors[param][key]
factorTensors[param2][
key + '_concat'] = factorTensors[param][key + '_concat']
########
if KFAC_DEBUG:
for items in zip(varlist, fpropTensors, bpropTensors, opTypes):
print((items[0].name, factorTensors[item]))
self.factors = factorTensors
return factorTensors
def getStats(self, factors, varlist):
if len(self.stats) == 0:
# initialize stats variables on CPU because eigen decomp is
# computed on CPU
with tf.device('/cpu'):
tmpStatsCache = {}
# search for tensor factors and
# use block diag approx for the bias units
for var in varlist:
fpropFactor = factors[var]['fpropFactors_concat']
bpropFactor = factors[var]['bpropFactors_concat']
opType = factors[var]['opName']
if opType == 'Conv2D':
Kh = var.get_shape()[0]
Kw = var.get_shape()[1]
C = fpropFactor.get_shape()[-1]
Oh = bpropFactor.get_shape()[1]
Ow = bpropFactor.get_shape()[2]
if Oh == 1 and Ow == 1 and self._channel_fac:
# factorization along the channels do not support
# homogeneous coordinate
var_assnBias = factors[var]['assnBias']
if var_assnBias:
factors[var]['assnBias'] = None
factors[var_assnBias]['assnWeights'] = None
##
for var in varlist:
fpropFactor = factors[var]['fpropFactors_concat']
bpropFactor = factors[var]['bpropFactors_concat']
opType = factors[var]['opName']
self.stats[var] = {'opName': opType,
'fprop_concat_stats': [],
'bprop_concat_stats': [],
'assnWeights': factors[var]['assnWeights'],
'assnBias': factors[var]['assnBias'],
}
if fpropFactor is not None:
if fpropFactor not in tmpStatsCache:
if opType == 'Conv2D':
Kh = var.get_shape()[0]
Kw = var.get_shape()[1]
C = fpropFactor.get_shape()[-1]
Oh = bpropFactor.get_shape()[1]
Ow = bpropFactor.get_shape()[2]
if Oh == 1 and Ow == 1 and self._channel_fac:
# factorization along the channels
# assume independence between input channels and spatial
# 2K-1 x 2K-1 covariance matrix and C x C covariance matrix
# factorization along the channels do not
# support homogeneous coordinate, assnBias
# is always None
fpropFactor2_size = Kh * Kw
slot_fpropFactor_stats2 = tf.Variable(tf.diag(tf.ones(
[fpropFactor2_size])) * self._diag_init_coeff, name='KFAC_STATS/' + fpropFactor.op.name, trainable=False)
self.stats[var]['fprop_concat_stats'].append(
slot_fpropFactor_stats2)
fpropFactor_size = C
else:
# 2K-1 x 2K-1 x C x C covariance matrix
# assume BHWC
fpropFactor_size = Kh * Kw * C
else:
# D x D covariance matrix
fpropFactor_size = fpropFactor.get_shape()[-1]
# use homogeneous coordinate
if not self._blockdiag_bias and self.stats[var]['assnBias']:
fpropFactor_size += 1
slot_fpropFactor_stats = tf.Variable(tf.diag(tf.ones(
[fpropFactor_size])) * self._diag_init_coeff, name='KFAC_STATS/' + fpropFactor.op.name, trainable=False)
self.stats[var]['fprop_concat_stats'].append(
slot_fpropFactor_stats)
if opType != 'Conv2D':
tmpStatsCache[fpropFactor] = self.stats[
var]['fprop_concat_stats']
else:
self.stats[var][
'fprop_concat_stats'] = tmpStatsCache[fpropFactor]
if bpropFactor is not None:
# no need to collect backward stats for bias vectors if
# using homogeneous coordinates
if not((not self._blockdiag_bias) and self.stats[var]['assnWeights']):
if bpropFactor not in tmpStatsCache:
slot_bpropFactor_stats = tf.Variable(tf.diag(tf.ones([bpropFactor.get_shape(
)[-1]])) * self._diag_init_coeff, name='KFAC_STATS/' + bpropFactor.op.name, trainable=False)
self.stats[var]['bprop_concat_stats'].append(
slot_bpropFactor_stats)
tmpStatsCache[bpropFactor] = self.stats[
var]['bprop_concat_stats']
else:
self.stats[var][
'bprop_concat_stats'] = tmpStatsCache[bpropFactor]
return self.stats
def compute_and_apply_stats(self, loss_sampled, var_list=None):
varlist = var_list
if varlist is None:
varlist = tf.trainable_variables()
stats = self.compute_stats(loss_sampled, var_list=varlist)
return self.apply_stats(stats)
def compute_stats(self, loss_sampled, var_list=None):
varlist = var_list
if varlist is None:
varlist = tf.trainable_variables()
gs = tf.gradients(loss_sampled, varlist, name='gradientsSampled')
self.gs = gs
factors = self.getFactors(gs, varlist)
stats = self.getStats(factors, varlist)
updateOps = []
statsUpdates = {}
statsUpdates_cache = {}
for var in varlist:
opType = factors[var]['opName']
fops = factors[var]['op']
fpropFactor = factors[var]['fpropFactors_concat']
fpropStats_vars = stats[var]['fprop_concat_stats']
bpropFactor = factors[var]['bpropFactors_concat']
bpropStats_vars = stats[var]['bprop_concat_stats']
SVD_factors = {}
for stats_var in fpropStats_vars:
stats_var_dim = int(stats_var.get_shape()[0])
if stats_var not in statsUpdates_cache:
old_fpropFactor = fpropFactor
B = (tf.shape(fpropFactor)[0]) # batch size
if opType == 'Conv2D':
strides = fops.get_attr("strides")
padding = fops.get_attr("padding")
convkernel_size = var.get_shape()[0:3]
KH = int(convkernel_size[0])
KW = int(convkernel_size[1])
C = int(convkernel_size[2])
flatten_size = int(KH * KW * C)
Oh = int(bpropFactor.get_shape()[1])
Ow = int(bpropFactor.get_shape()[2])
if Oh == 1 and Ow == 1 and self._channel_fac:
# factorization along the channels
# assume independence among input channels
# factor = B x 1 x 1 x (KH xKW x C)
# patches = B x Oh x Ow x (KH xKW x C)
if len(SVD_factors) == 0:
if KFAC_DEBUG:
print(('approx %s act factor with rank-1 SVD factors' % (var.name)))
# find closest rank-1 approx to the feature map
S, U, V = tf.batch_svd(tf.reshape(
fpropFactor, [-1, KH * KW, C]))
# get rank-1 approx slides
sqrtS1 = tf.expand_dims(tf.sqrt(S[:, 0, 0]), 1)
patches_k = U[:, :, 0] * sqrtS1 # B x KH*KW
full_factor_shape = fpropFactor.get_shape()
patches_k.set_shape(
[full_factor_shape[0], KH * KW])
patches_c = V[:, :, 0] * sqrtS1 # B x C
patches_c.set_shape([full_factor_shape[0], C])
SVD_factors[C] = patches_c
SVD_factors[KH * KW] = patches_k
fpropFactor = SVD_factors[stats_var_dim]
else:
# poor mem usage implementation
patches = tf.extract_image_patches(fpropFactor, ksizes=[1, convkernel_size[
0], convkernel_size[1], 1], strides=strides, rates=[1, 1, 1, 1], padding=padding)
if self._approxT2:
if KFAC_DEBUG:
print(('approxT2 act fisher for %s' % (var.name)))
# T^2 terms * 1/T^2, size: B x C
fpropFactor = tf.reduce_mean(patches, [1, 2])
else:
# size: (B x Oh x Ow) x C
fpropFactor = tf.reshape(
patches, [-1, flatten_size]) / Oh / Ow
fpropFactor_size = int(fpropFactor.get_shape()[-1])
if stats_var_dim == (fpropFactor_size + 1) and not self._blockdiag_bias:
if opType == 'Conv2D' and not self._approxT2:
# correct padding for numerical stability (we
# divided out OhxOw from activations for T1 approx)
fpropFactor = tf.concat([fpropFactor, tf.ones(
[tf.shape(fpropFactor)[0], 1]) / Oh / Ow], 1)
else:
# use homogeneous coordinates
fpropFactor = tf.concat(
[fpropFactor, tf.ones([tf.shape(fpropFactor)[0], 1])], 1)
# average over the number of data points in a batch
# divided by B
cov = tf.matmul(fpropFactor, fpropFactor,
transpose_a=True) / tf.cast(B, tf.float32)
updateOps.append(cov)
statsUpdates[stats_var] = cov
if opType != 'Conv2D':
# HACK: for convolution we recompute fprop stats for
# every layer including forking layers
statsUpdates_cache[stats_var] = cov
for stats_var in bpropStats_vars:
stats_var_dim = int(stats_var.get_shape()[0])
if stats_var not in statsUpdates_cache:
old_bpropFactor = bpropFactor
bpropFactor_shape = bpropFactor.get_shape()
B = tf.shape(bpropFactor)[0] # batch size
C = int(bpropFactor_shape[-1]) # num channels
if opType == 'Conv2D' or len(bpropFactor_shape) == 4:
if fpropFactor is not None:
if self._approxT2:
if KFAC_DEBUG:
print(('approxT2 grad fisher for %s' % (var.name)))
bpropFactor = tf.reduce_sum(
bpropFactor, [1, 2]) # T^2 terms * 1/T^2
else:
bpropFactor = tf.reshape(
bpropFactor, [-1, C]) * Oh * Ow # T * 1/T terms
else:
# just doing block diag approx. spatial independent
# structure does not apply here. summing over
# spatial locations
if KFAC_DEBUG:
print(('block diag approx fisher for %s' % (var.name)))
bpropFactor = tf.reduce_sum(bpropFactor, [1, 2])
# assume sampled loss is averaged. TO-DO:figure out better
# way to handle this
bpropFactor *= tf.to_float(B)
##
cov_b = tf.matmul(
bpropFactor, bpropFactor, transpose_a=True) / tf.to_float(tf.shape(bpropFactor)[0])
updateOps.append(cov_b)
statsUpdates[stats_var] = cov_b
statsUpdates_cache[stats_var] = cov_b
if KFAC_DEBUG:
aKey = list(statsUpdates.keys())[0]
statsUpdates[aKey] = tf.Print(statsUpdates[aKey],
[tf.convert_to_tensor('step:'),
self.global_step,
tf.convert_to_tensor(
'computing stats'),
])
self.statsUpdates = statsUpdates
return statsUpdates
def apply_stats(self, statsUpdates):
""" compute stats and update/apply the new stats to the running average
"""
def updateAccumStats():
if self._full_stats_init:
return tf.cond(tf.greater(self.sgd_step, self._cold_iter), lambda: tf.group(*self._apply_stats(statsUpdates, accumulate=True, accumulateCoeff=1. / self._stats_accum_iter)), tf.no_op)
else:
return tf.group(*self._apply_stats(statsUpdates, accumulate=True, accumulateCoeff=1. / self._stats_accum_iter))
def updateRunningAvgStats(statsUpdates, fac_iter=1):
# return tf.cond(tf.greater_equal(self.factor_step,
# tf.convert_to_tensor(fac_iter)), lambda:
# tf.group(*self._apply_stats(stats_list, varlist)), tf.no_op)
return tf.group(*self._apply_stats(statsUpdates))
if self._async_stats:
# asynchronous stats update
update_stats = self._apply_stats(statsUpdates)
queue = tf.FIFOQueue(1, [item.dtype for item in update_stats], shapes=[
item.get_shape() for item in update_stats])
enqueue_op = queue.enqueue(update_stats)
def dequeue_stats_op():
return queue.dequeue()
self.qr_stats = tf.train.QueueRunner(queue, [enqueue_op])
update_stats_op = tf.cond(tf.equal(queue.size(), tf.convert_to_tensor(
0)), tf.no_op, lambda: tf.group(*[dequeue_stats_op(), ]))
else:
# synchronous stats update
update_stats_op = tf.cond(tf.greater_equal(
self.stats_step, self._stats_accum_iter), lambda: updateRunningAvgStats(statsUpdates), updateAccumStats)
self._update_stats_op = update_stats_op
return update_stats_op
def _apply_stats(self, statsUpdates, accumulate=False, accumulateCoeff=0.):
updateOps = []
# obtain the stats var list
for stats_var in statsUpdates:
stats_new = statsUpdates[stats_var]
if accumulate:
# simple superbatch averaging
update_op = tf.assign_add(
stats_var, accumulateCoeff * stats_new, use_locking=True)
else:
# exponential running averaging
update_op = tf.assign(
stats_var, stats_var * self._stats_decay, use_locking=True)
update_op = tf.assign_add(
update_op, (1. - self._stats_decay) * stats_new, use_locking=True)
updateOps.append(update_op)
with tf.control_dependencies(updateOps):
stats_step_op = tf.assign_add(self.stats_step, 1)
if KFAC_DEBUG:
stats_step_op = (tf.Print(stats_step_op,
[tf.convert_to_tensor('step:'),
self.global_step,
tf.convert_to_tensor('fac step:'),
self.factor_step,
tf.convert_to_tensor('sgd step:'),
self.sgd_step,
tf.convert_to_tensor('Accum:'),
tf.convert_to_tensor(accumulate),
tf.convert_to_tensor('Accum coeff:'),
tf.convert_to_tensor(accumulateCoeff),
tf.convert_to_tensor('stat step:'),
self.stats_step, updateOps[0], updateOps[1]]))
return [stats_step_op, ]
def getStatsEigen(self, stats=None):
if len(self.stats_eigen) == 0:
stats_eigen = {}
if stats is None:
stats = self.stats
tmpEigenCache = {}
with tf.device('/cpu:0'):
for var in stats:
for key in ['fprop_concat_stats', 'bprop_concat_stats']:
for stats_var in stats[var][key]:
if stats_var not in tmpEigenCache:
stats_dim = stats_var.get_shape()[1].value
e = tf.Variable(tf.ones(
[stats_dim]), name='KFAC_FAC/' + stats_var.name.split(':')[0] + '/e', trainable=False)
Q = tf.Variable(tf.diag(tf.ones(
[stats_dim])), name='KFAC_FAC/' + stats_var.name.split(':')[0] + '/Q', trainable=False)
stats_eigen[stats_var] = {'e': e, 'Q': Q}
tmpEigenCache[
stats_var] = stats_eigen[stats_var]
else:
stats_eigen[stats_var] = tmpEigenCache[
stats_var]
self.stats_eigen = stats_eigen
return self.stats_eigen
def computeStatsEigen(self):
""" compute the eigen decomp using copied var stats to avoid concurrent read/write from other queue """
# TO-DO: figure out why this op has delays (possibly moving
# eigenvectors around?)
with tf.device('/cpu:0'):
def removeNone(tensor_list):
local_list = []
for item in tensor_list:
if item is not None:
local_list.append(item)
return local_list
def copyStats(var_list):
print("copying stats to buffer tensors before eigen decomp")
redundant_stats = {}
copied_list = []
for item in var_list:
if item is not None:
if item not in redundant_stats:
if self._use_float64:
redundant_stats[item] = tf.cast(
tf.identity(item), tf.float64)
else:
redundant_stats[item] = tf.identity(item)
copied_list.append(redundant_stats[item])
else:
copied_list.append(None)
return copied_list
#stats = [copyStats(self.fStats), copyStats(self.bStats)]
#stats = [self.fStats, self.bStats]
stats_eigen = self.stats_eigen
computedEigen = {}
eigen_reverse_lookup = {}
updateOps = []
# sync copied stats
# with tf.control_dependencies(removeNone(stats[0]) +
# removeNone(stats[1])):
with tf.control_dependencies([]):
for stats_var in stats_eigen:
if stats_var not in computedEigen:
eigens = tf.self_adjoint_eig(stats_var)
e = eigens[0]
Q = eigens[1]
if self._use_float64:
e = tf.cast(e, tf.float32)
Q = tf.cast(Q, tf.float32)
updateOps.append(e)
updateOps.append(Q)
computedEigen[stats_var] = {'e': e, 'Q': Q}
eigen_reverse_lookup[e] = stats_eigen[stats_var]['e']
eigen_reverse_lookup[Q] = stats_eigen[stats_var]['Q']
self.eigen_reverse_lookup = eigen_reverse_lookup
self.eigen_update_list = updateOps
if KFAC_DEBUG:
self.eigen_update_list = [item for item in updateOps]
with tf.control_dependencies(updateOps):
updateOps.append(tf.Print(tf.constant(
0.), [tf.convert_to_tensor('computed factor eigen')]))
return updateOps
def applyStatsEigen(self, eigen_list):
updateOps = []
print(('updating %d eigenvalue/vectors' % len(eigen_list)))
for i, (tensor, mark) in enumerate(zip(eigen_list, self.eigen_update_list)):
stats_eigen_var = self.eigen_reverse_lookup[mark]
updateOps.append(
tf.assign(stats_eigen_var, tensor, use_locking=True))
with tf.control_dependencies(updateOps):
factor_step_op = tf.assign_add(self.factor_step, 1)
updateOps.append(factor_step_op)
if KFAC_DEBUG:
updateOps.append(tf.Print(tf.constant(
0.), [tf.convert_to_tensor('updated kfac factors')]))
return updateOps
def getKfacPrecondUpdates(self, gradlist, varlist):
updatelist = []
vg = 0.
assert len(self.stats) > 0
assert len(self.stats_eigen) > 0
assert len(self.factors) > 0
counter = 0
grad_dict = {var: grad for grad, var in zip(gradlist, varlist)}
for grad, var in zip(gradlist, varlist):
GRAD_RESHAPE = False
GRAD_TRANSPOSE = False
fpropFactoredFishers = self.stats[var]['fprop_concat_stats']
bpropFactoredFishers = self.stats[var]['bprop_concat_stats']
if (len(fpropFactoredFishers) + len(bpropFactoredFishers)) > 0:
counter += 1
GRAD_SHAPE = grad.get_shape()
if len(grad.get_shape()) > 2:
# reshape conv kernel parameters
KW = int(grad.get_shape()[0])
KH = int(grad.get_shape()[1])
C = int(grad.get_shape()[2])
D = int(grad.get_shape()[3])
if len(fpropFactoredFishers) > 1 and self._channel_fac:
# reshape conv kernel parameters into tensor
grad = tf.reshape(grad, [KW * KH, C, D])
else:
# reshape conv kernel parameters into 2D grad
grad = tf.reshape(grad, [-1, D])
GRAD_RESHAPE = True
elif len(grad.get_shape()) == 1:
# reshape bias or 1D parameters
D = int(grad.get_shape()[0])
grad = tf.expand_dims(grad, 0)
GRAD_RESHAPE = True
else:
# 2D parameters
C = int(grad.get_shape()[0])
D = int(grad.get_shape()[1])
if (self.stats[var]['assnBias'] is not None) and not self._blockdiag_bias:
# use homogeneous coordinates only works for 2D grad.
# TO-DO: figure out how to factorize bias grad
# stack bias grad
var_assnBias = self.stats[var]['assnBias']
grad = tf.concat(
[grad, tf.expand_dims(grad_dict[var_assnBias], 0)], 0)
# project gradient to eigen space and reshape the eigenvalues
# for broadcasting
eigVals = []
for idx, stats in enumerate(self.stats[var]['fprop_concat_stats']):
Q = self.stats_eigen[stats]['Q']
e = detectMinVal(self.stats_eigen[stats][
'e'], var, name='act', debug=KFAC_DEBUG)
Q, e = factorReshape(Q, e, grad, facIndx=idx, ftype='act')
eigVals.append(e)
grad = gmatmul(Q, grad, transpose_a=True, reduce_dim=idx)
for idx, stats in enumerate(self.stats[var]['bprop_concat_stats']):
Q = self.stats_eigen[stats]['Q']
e = detectMinVal(self.stats_eigen[stats][
'e'], var, name='grad', debug=KFAC_DEBUG)
Q, e = factorReshape(Q, e, grad, facIndx=idx, ftype='grad')
eigVals.append(e)
grad = gmatmul(grad, Q, transpose_b=False, reduce_dim=idx)
##
#####
# whiten using eigenvalues
weightDecayCoeff = 0.
if var in self._weight_decay_dict:
weightDecayCoeff = self._weight_decay_dict[var]
if KFAC_DEBUG:
print(('weight decay coeff for %s is %f' % (var.name, weightDecayCoeff)))
if self._factored_damping:
if KFAC_DEBUG:
print(('use factored damping for %s' % (var.name)))
coeffs = 1.
num_factors = len(eigVals)
# compute the ratio of two trace norm of the left and right
# KFac matrices, and their generalization
if len(eigVals) == 1:
damping = self._epsilon + weightDecayCoeff
else:
damping = tf.pow(
self._epsilon + weightDecayCoeff, 1. / num_factors)
eigVals_tnorm_avg = [tf.reduce_mean(
tf.abs(e)) for e in eigVals]
for e, e_tnorm in zip(eigVals, eigVals_tnorm_avg):
eig_tnorm_negList = [
item for item in eigVals_tnorm_avg if item != e_tnorm]
if len(eigVals) == 1:
adjustment = 1.
elif len(eigVals) == 2:
adjustment = tf.sqrt(
e_tnorm / eig_tnorm_negList[0])
else:
eig_tnorm_negList_prod = reduce(
lambda x, y: x * y, eig_tnorm_negList)
adjustment = tf.pow(
tf.pow(e_tnorm, num_factors - 1.) / eig_tnorm_negList_prod, 1. / num_factors)
coeffs *= (e + adjustment * damping)
else:
coeffs = 1.
damping = (self._epsilon + weightDecayCoeff)
for e in eigVals:
coeffs *= e
coeffs += damping
#grad = tf.Print(grad, [tf.convert_to_tensor('1'), tf.convert_to_tensor(var.name), grad.get_shape()])
grad /= coeffs
#grad = tf.Print(grad, [tf.convert_to_tensor('2'), tf.convert_to_tensor(var.name), grad.get_shape()])
#####
# project gradient back to euclidean space
for idx, stats in enumerate(self.stats[var]['fprop_concat_stats']):
Q = self.stats_eigen[stats]['Q']
grad = gmatmul(Q, grad, transpose_a=False, reduce_dim=idx)
for idx, stats in enumerate(self.stats[var]['bprop_concat_stats']):
Q = self.stats_eigen[stats]['Q']
grad = gmatmul(grad, Q, transpose_b=True, reduce_dim=idx)
##
#grad = tf.Print(grad, [tf.convert_to_tensor('3'), tf.convert_to_tensor(var.name), grad.get_shape()])
if (self.stats[var]['assnBias'] is not None) and not self._blockdiag_bias:
# use homogeneous coordinates only works for 2D grad.
# TO-DO: figure out how to factorize bias grad
# un-stack bias grad
var_assnBias = self.stats[var]['assnBias']
C_plus_one = int(grad.get_shape()[0])
grad_assnBias = tf.reshape(tf.slice(grad,
begin=[
C_plus_one - 1, 0],
size=[1, -1]), var_assnBias.get_shape())
grad_assnWeights = tf.slice(grad,
begin=[0, 0],
size=[C_plus_one - 1, -1])
grad_dict[var_assnBias] = grad_assnBias
grad = grad_assnWeights
#grad = tf.Print(grad, [tf.convert_to_tensor('4'), tf.convert_to_tensor(var.name), grad.get_shape()])
if GRAD_RESHAPE:
grad = tf.reshape(grad, GRAD_SHAPE)
grad_dict[var] = grad
print(('projecting %d gradient matrices' % counter))
for g, var in zip(gradlist, varlist):
grad = grad_dict[var]
### clipping ###
if KFAC_DEBUG:
print(('apply clipping to %s' % (var.name)))
tf.Print(grad, [tf.sqrt(tf.reduce_sum(tf.pow(grad, 2)))], "Euclidean norm of new grad")
local_vg = tf.reduce_sum(grad * g * (self._lr * self._lr))
vg += local_vg
# recale everything
if KFAC_DEBUG:
print('apply vFv clipping')
scaling = tf.minimum(1., tf.sqrt(self._clip_kl / vg))
if KFAC_DEBUG:
scaling = tf.Print(scaling, [tf.convert_to_tensor(
'clip: '), scaling, tf.convert_to_tensor(' vFv: '), vg])
with tf.control_dependencies([tf.assign(self.vFv, vg)]):
updatelist = [grad_dict[var] for var in varlist]
for i, item in enumerate(updatelist):
updatelist[i] = scaling * item
return updatelist
def compute_gradients(self, loss, var_list=None):
varlist = var_list
if varlist is None:
varlist = tf.trainable_variables()
g = tf.gradients(loss, varlist)
return [(a, b) for a, b in zip(g, varlist)]
def apply_gradients_kfac(self, grads):
g, varlist = list(zip(*grads))
if len(self.stats_eigen) == 0:
self.getStatsEigen()
qr = None
# launch eigen-decomp on a queue thread
if self._async:
print('Use async eigen decomp')
# get a list of factor loading tensors
factorOps_dummy = self.computeStatsEigen()
# define a queue for the list of factor loading tensors
queue = tf.FIFOQueue(1, [item.dtype for item in factorOps_dummy], shapes=[
item.get_shape() for item in factorOps_dummy])
enqueue_op = tf.cond(tf.logical_and(tf.equal(tf.mod(self.stats_step, self._kfac_update), tf.convert_to_tensor(
0)), tf.greater_equal(self.stats_step, self._stats_accum_iter)), lambda: queue.enqueue(self.computeStatsEigen()), tf.no_op)
def dequeue_op():
return queue.dequeue()
qr = tf.train.QueueRunner(queue, [enqueue_op])
updateOps = []
global_step_op = tf.assign_add(self.global_step, 1)
updateOps.append(global_step_op)
with tf.control_dependencies([global_step_op]):
# compute updates
assert self._update_stats_op != None
updateOps.append(self._update_stats_op)
dependency_list = []
if not self._async:
dependency_list.append(self._update_stats_op)
with tf.control_dependencies(dependency_list):
def no_op_wrapper():
return tf.group(*[tf.assign_add(self.cold_step, 1)])
if not self._async:
# synchronous eigen-decomp updates
updateFactorOps = tf.cond(tf.logical_and(tf.equal(tf.mod(self.stats_step, self._kfac_update),
tf.convert_to_tensor(0)),
tf.greater_equal(self.stats_step, self._stats_accum_iter)), lambda: tf.group(*self.applyStatsEigen(self.computeStatsEigen())), no_op_wrapper)
else:
# asynchronous eigen-decomp updates using queue
updateFactorOps = tf.cond(tf.greater_equal(self.stats_step, self._stats_accum_iter),
lambda: tf.cond(tf.equal(queue.size(), tf.convert_to_tensor(0)),
tf.no_op,
lambda: tf.group(
*self.applyStatsEigen(dequeue_op())),
),
no_op_wrapper)
updateOps.append(updateFactorOps)
with tf.control_dependencies([updateFactorOps]):
def gradOp():
return list(g)
def getKfacGradOp():
return self.getKfacPrecondUpdates(g, varlist)
u = tf.cond(tf.greater(self.factor_step,
tf.convert_to_tensor(0)), getKfacGradOp, gradOp)
optim = tf.train.MomentumOptimizer(
self._lr * (1. - self._momentum), self._momentum)
#optim = tf.train.AdamOptimizer(self._lr, epsilon=0.01)
def optimOp():
def updateOptimOp():
if self._full_stats_init:
return tf.cond(tf.greater(self.factor_step, tf.convert_to_tensor(0)), lambda: optim.apply_gradients(list(zip(u, varlist))), tf.no_op)
else:
return optim.apply_gradients(list(zip(u, varlist)))
if self._full_stats_init:
return tf.cond(tf.greater_equal(self.stats_step, self._stats_accum_iter), updateOptimOp, tf.no_op)
else:
return tf.cond(tf.greater_equal(self.sgd_step, self._cold_iter), updateOptimOp, tf.no_op)
updateOps.append(optimOp())
return tf.group(*updateOps), qr
def apply_gradients(self, grads):
coldOptim = tf.train.MomentumOptimizer(
self._cold_lr, self._momentum)
def coldSGDstart():
sgd_grads, sgd_var = zip(*grads)
if self.max_grad_norm != None:
sgd_grads, sgd_grad_norm = tf.clip_by_global_norm(sgd_grads,self.max_grad_norm)
sgd_grads = list(zip(sgd_grads,sgd_var))
sgd_step_op = tf.assign_add(self.sgd_step, 1)
coldOptim_op = coldOptim.apply_gradients(sgd_grads)
if KFAC_DEBUG:
with tf.control_dependencies([sgd_step_op, coldOptim_op]):
sgd_step_op = tf.Print(
sgd_step_op, [self.sgd_step, tf.convert_to_tensor('doing cold sgd step')])
return tf.group(*[sgd_step_op, coldOptim_op])
kfacOptim_op, qr = self.apply_gradients_kfac(grads)
def warmKFACstart():
return kfacOptim_op
return tf.cond(tf.greater(self.sgd_step, self._cold_iter), warmKFACstart, coldSGDstart), qr
def minimize(self, loss, loss_sampled, var_list=None):
grads = self.compute_gradients(loss, var_list=var_list)
update_stats_op = self.compute_and_apply_stats(
loss_sampled, var_list=var_list)
return self.apply_gradients(grads)

View File

@@ -0,0 +1,86 @@
import tensorflow as tf
def gmatmul(a, b, transpose_a=False, transpose_b=False, reduce_dim=None):
assert reduce_dim is not None
# weird batch matmul
if len(a.get_shape()) == 2 and len(b.get_shape()) > 2:
# reshape reduce_dim to the left most dim in b
b_shape = b.get_shape()
if reduce_dim != 0:
b_dims = list(range(len(b_shape)))
b_dims.remove(reduce_dim)
b_dims.insert(0, reduce_dim)
b = tf.transpose(b, b_dims)
b_t_shape = b.get_shape()
b = tf.reshape(b, [int(b_shape[reduce_dim]), -1])
result = tf.matmul(a, b, transpose_a=transpose_a,
transpose_b=transpose_b)
result = tf.reshape(result, b_t_shape)
if reduce_dim != 0:
b_dims = list(range(len(b_shape)))
b_dims.remove(0)
b_dims.insert(reduce_dim, 0)
result = tf.transpose(result, b_dims)
return result
elif len(a.get_shape()) > 2 and len(b.get_shape()) == 2:
# reshape reduce_dim to the right most dim in a
a_shape = a.get_shape()
outter_dim = len(a_shape) - 1
reduce_dim = len(a_shape) - reduce_dim - 1
if reduce_dim != outter_dim:
a_dims = list(range(len(a_shape)))
a_dims.remove(reduce_dim)
a_dims.insert(outter_dim, reduce_dim)
a = tf.transpose(a, a_dims)
a_t_shape = a.get_shape()
a = tf.reshape(a, [-1, int(a_shape[reduce_dim])])
result = tf.matmul(a, b, transpose_a=transpose_a,
transpose_b=transpose_b)
result = tf.reshape(result, a_t_shape)
if reduce_dim != outter_dim:
a_dims = list(range(len(a_shape)))
a_dims.remove(outter_dim)
a_dims.insert(reduce_dim, outter_dim)
result = tf.transpose(result, a_dims)
return result
elif len(a.get_shape()) == 2 and len(b.get_shape()) == 2:
return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)
assert False, 'something went wrong'
def clipoutNeg(vec, threshold=1e-6):
mask = tf.cast(vec > threshold, tf.float32)
return mask * vec
def detectMinVal(input_mat, var, threshold=1e-6, name='', debug=False):
eigen_min = tf.reduce_min(input_mat)
eigen_max = tf.reduce_max(input_mat)
eigen_ratio = eigen_max / eigen_min
input_mat_clipped = clipoutNeg(input_mat, threshold)
if debug:
input_mat_clipped = tf.cond(tf.logical_or(tf.greater(eigen_ratio, 0.), tf.less(eigen_ratio, -500)), lambda: input_mat_clipped, lambda: tf.Print(
input_mat_clipped, [tf.convert_to_tensor('screwed ratio ' + name + ' eigen values!!!'), tf.convert_to_tensor(var.name), eigen_min, eigen_max, eigen_ratio]))
return input_mat_clipped
def factorReshape(Q, e, grad, facIndx=0, ftype='act'):
grad_shape = grad.get_shape()
if ftype == 'act':
assert e.get_shape()[0] == grad_shape[facIndx]
expanded_shape = [1, ] * len(grad_shape)
expanded_shape[facIndx] = -1
e = tf.reshape(e, expanded_shape)
if ftype == 'grad':
assert e.get_shape()[0] == grad_shape[len(grad_shape) - facIndx - 1]
expanded_shape = [1, ] * len(grad_shape)
expanded_shape[len(grad_shape) - facIndx - 1] = -1
e = tf.reshape(e, expanded_shape)
return Q, e

View File

@@ -0,0 +1,42 @@
import numpy as np
import tensorflow as tf
from baselines.acktr.utils import dense, kl_div
import baselines.common.tf_util as U
class GaussianMlpPolicy(object):
def __init__(self, ob_dim, ac_dim):
# Here we'll construct a bunch of expressions, which will be used in two places:
# (1) When sampling actions
# (2) When computing loss functions, for the policy update
# Variables specific to (1) have the word "sampled" in them,
# whereas variables specific to (2) have the word "old" in them
ob_no = tf.placeholder(tf.float32, shape=[None, ob_dim*2], name="ob") # batch of observations
oldac_na = tf.placeholder(tf.float32, shape=[None, ac_dim], name="ac") # batch of actions previous actions
oldac_dist = tf.placeholder(tf.float32, shape=[None, ac_dim*2], name="oldac_dist") # batch of actions previous action distributions
adv_n = tf.placeholder(tf.float32, shape=[None], name="adv") # advantage function estimate
wd_dict = {}
h1 = tf.nn.tanh(dense(ob_no, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
h2 = tf.nn.tanh(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
mean_na = dense(h2, ac_dim, "mean", weight_init=U.normc_initializer(0.1), bias_init=0.0, weight_loss_dict=wd_dict) # Mean control output
self.wd_dict = wd_dict
self.logstd_1a = logstd_1a = tf.get_variable("logstd", [ac_dim], tf.float32, tf.zeros_initializer()) # Variance on outputs
logstd_1a = tf.expand_dims(logstd_1a, 0)
std_1a = tf.exp(logstd_1a)
std_na = tf.tile(std_1a, [tf.shape(mean_na)[0], 1])
ac_dist = tf.concat([tf.reshape(mean_na, [-1, ac_dim]), tf.reshape(std_na, [-1, ac_dim])], 1)
sampled_ac_na = tf.random_normal(tf.shape(ac_dist[:,ac_dim:])) * ac_dist[:,ac_dim:] + ac_dist[:,:ac_dim] # This is the sampled action we'll perform.
logprobsampled_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - sampled_ac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of sampled action
logprob_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - oldac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of previous actions under CURRENT policy (whereas oldlogprob_n is under OLD policy)
kl = tf.reduce_mean(kl_div(oldac_dist, ac_dist, ac_dim))
#kl = .5 * tf.reduce_mean(tf.square(logprob_n - oldlogprob_n)) # Approximation of KL divergence between old policy used to generate actions, and new policy used to compute logprob_n
surr = - tf.reduce_mean(adv_n * logprob_n) # Loss function that we'll differentiate to get the policy gradient
surr_sampled = - tf.reduce_mean(logprob_n) # Sampled loss of the policy
self._act = U.function([ob_no], [sampled_ac_na, ac_dist, logprobsampled_n]) # Generate a new action and its logprob
#self.compute_kl = U.function([ob_no, oldac_na, oldlogprob_n], kl) # Compute (approximate) KL divergence between old policy and new policy
self.compute_kl = U.function([ob_no, oldac_dist], kl)
self.update_info = ((ob_no, oldac_na, adv_n), surr, surr_sampled) # Input and output variables needed for computing loss
U.initialize() # Initialize uninitialized TF variables
def act(self, ob):
ac, ac_dist, logp = self._act(ob[None])
return ac[0], ac_dist[0], logp[0]

View File

@@ -0,0 +1,23 @@
#!/usr/bin/env python3
from functools import partial
from baselines import logger
from baselines.acktr.acktr_disc import learn
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
from baselines.common.policies import cnn
def train(env_id, num_timesteps, seed, num_cpu):
env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
policy_fn = cnn(env=env, one_dim_bias=True)
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
env.close()
def main():
args = atari_arg_parser().parse_args()
logger.configure()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed, num_cpu=32)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,34 @@
#!/usr/bin/env python3
import tensorflow as tf
from baselines import logger
from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
from baselines.acktr.acktr_cont import learn
from baselines.acktr.policies import GaussianMlpPolicy
from baselines.acktr.value_functions import NeuralNetValueFunction
def train(env_id, num_timesteps, seed):
env = make_mujoco_env(env_id, seed)
with tf.Session(config=tf.ConfigProto()):
ob_dim = env.observation_space.shape[0]
ac_dim = env.action_space.shape[0]
with tf.variable_scope("vf"):
vf = NeuralNetValueFunction(ob_dim, ac_dim)
with tf.variable_scope("pi"):
policy = GaussianMlpPolicy(ob_dim, ac_dim)
learn(env, policy=policy, vf=vf,
gamma=0.99, lam=0.97, timesteps_per_batch=2500,
desired_kl=0.002,
num_timesteps=num_timesteps, animate=False)
env.close()
def main():
args = mujoco_arg_parser().parse_args()
logger.configure()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
if __name__ == "__main__":
main()

28
baselines/acktr/utils.py Normal file
View File

@@ -0,0 +1,28 @@
import tensorflow as tf
def dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict=None, reuse=None):
with tf.variable_scope(name, reuse=reuse):
assert (len(tf.get_variable_scope().name.split('/')) == 2)
w = tf.get_variable("w", [x.get_shape()[1], size], initializer=weight_init)
b = tf.get_variable("b", [size], initializer=tf.constant_initializer(bias_init))
weight_decay_fc = 3e-4
if weight_loss_dict is not None:
weight_decay = tf.multiply(tf.nn.l2_loss(w), weight_decay_fc, name='weight_decay_loss')
if weight_loss_dict is not None:
weight_loss_dict[w] = weight_decay_fc
weight_loss_dict[b] = 0.0
tf.add_to_collection(tf.get_variable_scope().name.split('/')[0] + '_' + 'losses', weight_decay)
return tf.nn.bias_add(tf.matmul(x, w), b)
def kl_div(action_dist1, action_dist2, action_size):
mean1, std1 = action_dist1[:, :action_size], action_dist1[:, action_size:]
mean2, std2 = action_dist2[:, :action_size], action_dist2[:, action_size:]
numerator = tf.square(mean1 - mean2) + tf.square(std1) - tf.square(std2)
denominator = 2 * tf.square(std2) + 1e-8
return tf.reduce_sum(
numerator/denominator + tf.log(std2) - tf.log(std1),reduction_indices=-1)

View File

@@ -0,0 +1,50 @@
from baselines import logger
import numpy as np
import baselines.common as common
from baselines.common import tf_util as U
import tensorflow as tf
from baselines.acktr import kfac
from baselines.acktr.utils import dense
class NeuralNetValueFunction(object):
def __init__(self, ob_dim, ac_dim): #pylint: disable=W0613
X = tf.placeholder(tf.float32, shape=[None, ob_dim*2+ac_dim*2+2]) # batch of observations
vtarg_n = tf.placeholder(tf.float32, shape=[None], name='vtarg')
wd_dict = {}
h1 = tf.nn.elu(dense(X, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
h2 = tf.nn.elu(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
vpred_n = dense(h2, 1, "hfinal", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict)[:,0]
sample_vpred_n = vpred_n + tf.random_normal(tf.shape(vpred_n))
wd_loss = tf.get_collection("vf_losses", None)
loss = tf.reduce_mean(tf.square(vpred_n - vtarg_n)) + tf.add_n(wd_loss)
loss_sampled = tf.reduce_mean(tf.square(vpred_n - tf.stop_gradient(sample_vpred_n)))
self._predict = U.function([X], vpred_n)
optim = kfac.KfacOptimizer(learning_rate=0.001, cold_lr=0.001*(1-0.9), momentum=0.9, \
clip_kl=0.3, epsilon=0.1, stats_decay=0.95, \
async=1, kfac_update=2, cold_iter=50, \
weight_decay_dict=wd_dict, max_grad_norm=None)
vf_var_list = []
for var in tf.trainable_variables():
if "vf" in var.name:
vf_var_list.append(var)
update_op, self.q_runner = optim.minimize(loss, loss_sampled, var_list=vf_var_list)
self.do_update = U.function([X, vtarg_n], update_op) #pylint: disable=E1101
U.initialize() # Initialize uninitialized TF variables
def _preproc(self, path):
l = pathlength(path)
al = np.arange(l).reshape(-1,1)/10.0
act = path["action_dist"].astype('float32')
X = np.concatenate([path['observation'], act, al, np.ones((l, 1))], axis=1)
return X
def predict(self, path):
return self._predict(self._preproc(path))
def fit(self, paths, targvals):
X = np.concatenate([self._preproc(p) for p in paths])
y = np.concatenate(targvals)
logger.record_tabular("EVBefore", common.explained_variance(self._predict(X), y))
for _ in range(25): self.do_update(X, y)
logger.record_tabular("EVAfter", common.explained_variance(self._predict(X), y))
def pathlength(path):
return path["reward"].shape[0]

View File

@@ -1,3 +1,2 @@
from baselines.bench.benchmarks import *
from baselines.bench.monitor import *
from baselines.bench.monitor import *

View File

@@ -1,93 +1,151 @@
import re
import os.path as osp
import os
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
_atari7 = ['BeamRider', 'Breakout', 'Enduro', 'Pong', 'Qbert', 'Seaquest', 'SpaceInvaders']
_atariexpl7 = ['Freeway', 'Gravitar', 'MontezumaRevenge', 'Pitfall', 'PrivateEye', 'Solaris', 'Venture']
_BENCHMARKS = []
remove_version_re = re.compile(r'-v\d+$')
def register_benchmark(benchmark):
for b in _BENCHMARKS:
if b['name'] == benchmark['name']:
raise ValueError('Benchmark with name %s already registered!'%b['name'])
raise ValueError('Benchmark with name %s already registered!' % b['name'])
# automatically add a description if it is not present
if 'tasks' in benchmark:
for t in benchmark['tasks']:
if 'desc' not in t:
t['desc'] = remove_version_re.sub('', t['env_id'])
_BENCHMARKS.append(benchmark)
def list_benchmarks():
return [b['name'] for b in _BENCHMARKS]
def get_benchmark(benchmark_name):
for b in _BENCHMARKS:
if b['name'] == benchmark_name:
return b
raise ValueError('%s not found! Known benchmarks: %s' % (benchmark_name, list_benchmarks()))
def get_task(benchmark, env_id):
"""Get a task by env_id. Return None if the benchmark doesn't have the env"""
return next(filter(lambda task: task['env_id'] == env_id, benchmark['tasks']), None)
def find_task_for_env_id_in_any_benchmark(env_id):
for bm in _BENCHMARKS:
for task in bm["tasks"]:
if task["env_id"] == env_id:
return bm, task
return None, None
_ATARI_SUFFIX = 'NoFrameskip-v4'
register_benchmark({
'name' : 'Atari200M',
'description' :'7 Atari games from Mnih et al. (2013), with pixel observations, 200M frames',
'tasks' : [{'env_id' : _game + _ATARI_SUFFIX, 'trials' : 2, 'num_timesteps' : int(200e6)} for _game in _atari7]
'name': 'Atari50M',
'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 50M timesteps',
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(50e6)} for _game in _atari7]
})
register_benchmark({
'name' : 'Atari40M',
'description' :'7 Atari games from Mnih et al. (2013), with pixel observations, 40M frames',
'tasks' : [{'env_id' : _game + _ATARI_SUFFIX, 'trials' : 2, 'num_timesteps' : int(40e6)} for _game in _atari7]
'name': 'Atari10M',
'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 6, 'num_timesteps': int(10e6)} for _game in _atari7]
})
register_benchmark({
'name' : 'Atari1Hr',
'description' :'7 Atari games from Mnih et al. (2013), with pixel observations, 1 hour of walltime',
'tasks' : [{'env_id' : _game + _ATARI_SUFFIX, 'trials' : 2, 'num_seconds' : 60*60} for _game in _atari7]
'name': 'Atari1Hr',
'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 1 hour of walltime',
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_seconds': 60 * 60} for _game in _atari7]
})
register_benchmark({
'name' : 'AtariExploration40M',
'description' :'7 Atari games emphasizing exploration, with pixel observations, 40M frames',
'tasks' : [{'env_id' : _game + _ATARI_SUFFIX, 'trials' : 2, 'num_timesteps' : int(40e6)} for _game in _atariexpl7]
'name': 'AtariExploration10M',
'description': '7 Atari games emphasizing exploration, with pixel observations, 10M timesteps',
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atariexpl7]
})
# MuJoCo
_mujocosmall = [
'InvertedDoublePendulum-v1', 'InvertedPendulum-v1',
'HalfCheetah-v1', 'Hopper-v1', 'Walker2d-v1',
'Reacher-v1', 'Swimmer-v1']
'InvertedDoublePendulum-v2', 'InvertedPendulum-v2',
'HalfCheetah-v2', 'Hopper-v2', 'Walker2d-v2',
'Reacher-v2', 'Swimmer-v2']
register_benchmark({
'name' : 'Mujoco1M',
'description' : 'Some small 2D MuJoCo tasks, run for 1M timesteps',
'tasks' : [{'env_id' : _envid, 'trials' : 3, 'num_timesteps' : int(1e6)} for _envid in _mujocosmall]
'name': 'Mujoco1M',
'description': 'Some small 2D MuJoCo tasks, run for 1M timesteps',
'tasks': [{'env_id': _envid, 'trials': 6, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
})
_roboschool_mujoco = [
'RoboschoolInvertedDoublePendulum-v0', 'RoboschoolInvertedPendulum-v0', # cartpole
'RoboschoolHalfCheetah-v0', 'RoboschoolHopper-v0', 'RoboschoolWalker2d-v0', # forward walkers
'RoboschoolReacher-v0'
register_benchmark({
'name': 'MujocoWalkers',
'description': 'MuJoCo forward walkers, run for 8M, humanoid 100M',
'tasks': [
{'env_id': "Hopper-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
{'env_id': "Walker2d-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
{'env_id': "Humanoid-v1", 'trials': 4, 'num_timesteps': 100 * 1000000},
]
register_benchmark({
'name' : 'RoboschoolMujoco2M',
'description' : 'Same small 2D tasks, still improving up to 2M',
'tasks' : [{'env_id' : _envid, 'trials' : 3, 'num_timesteps' : int(2e6)} for _envid in _roboschool_mujoco]
})
# Roboschool
_atari50 = [ # actually 49
'Alien', 'Amidar', 'Assault', 'Asterix', 'Asteroids',
'Atlantis', 'BankHeist', 'BattleZone', 'BeamRider', 'Bowling',
'Boxing', 'Breakout', 'Centipede', 'ChopperCommand', 'CrazyClimber',
'DemonAttack', 'DoubleDunk', 'Enduro', 'FishingDerby', 'Freeway',
'Frostbite', 'Gopher', 'Gravitar', 'IceHockey', 'Jamesbond',
'Kangaroo', 'Krull', 'KungFuMaster', 'MontezumaRevenge', 'MsPacman',
'NameThisGame', 'Pitfall', 'Pong', 'PrivateEye', 'Qbert',
'Riverraid', 'RoadRunner', 'Robotank', 'Seaquest', 'SpaceInvaders',
'StarGunner', 'Tennis', 'TimePilot', 'Tutankham', 'UpNDown',
'Venture', 'VideoPinball', 'WizardOfWor', 'Zaxxon',
register_benchmark({
'name': 'Roboschool8M',
'description': 'Small 2D tasks, up to 30 minutes to complete on 8 cores',
'tasks': [
{'env_id': "RoboschoolReacher-v1", 'trials': 4, 'num_timesteps': 2 * 1000000},
{'env_id': "RoboschoolAnt-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
{'env_id': "RoboschoolHalfCheetah-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
{'env_id': "RoboschoolHopper-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
{'env_id': "RoboschoolWalker2d-v1", 'trials': 4, 'num_timesteps': 8 * 1000000},
]
})
register_benchmark({
'name': 'RoboschoolHarder',
'description': 'Test your might!!! Up to 12 hours on 32 cores',
'tasks': [
{'env_id': "RoboschoolHumanoid-v1", 'trials': 4, 'num_timesteps': 100 * 1000000},
{'env_id': "RoboschoolHumanoidFlagrun-v1", 'trials': 4, 'num_timesteps': 200 * 1000000},
{'env_id': "RoboschoolHumanoidFlagrunHarder-v1", 'trials': 4, 'num_timesteps': 400 * 1000000},
]
})
# Other
_atari50 = [ # actually 47
'Alien', 'Amidar', 'Assault', 'Asterix', 'Asteroids',
'Atlantis', 'BankHeist', 'BattleZone', 'BeamRider', 'Bowling',
'Breakout', 'Centipede', 'ChopperCommand', 'CrazyClimber',
'DemonAttack', 'DoubleDunk', 'Enduro', 'FishingDerby', 'Freeway',
'Frostbite', 'Gopher', 'Gravitar', 'IceHockey', 'Jamesbond',
'Kangaroo', 'Krull', 'KungFuMaster', 'MontezumaRevenge', 'MsPacman',
'NameThisGame', 'Pitfall', 'Pong', 'PrivateEye', 'Qbert',
'RoadRunner', 'Robotank', 'Seaquest', 'SpaceInvaders', 'StarGunner',
'Tennis', 'TimePilot', 'Tutankham', 'UpNDown', 'Venture',
'VideoPinball', 'WizardOfWor', 'Zaxxon',
]
register_benchmark({
'name' : 'Atari50_40M',
'description' :'7 Atari games from Mnih et al. (2013), with pixel observations, 40M frames',
'tasks' : [{'env_id' : _game + _ATARI_SUFFIX, 'trials' : 3, 'num_timesteps' : int(40e6)} for _game in _atari50]
'name': 'Atari50_10M',
'description': '47 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari50]
})
# HER DDPG
register_benchmark({
'name': 'HerDdpg',
'description': 'Smoke-test only benchmark of HER',
'tasks': [{'trials': 1, 'env_id': 'FetchReach-v1'}]
})

View File

@@ -2,20 +2,18 @@ __all__ = ['Monitor', 'get_monitor_files', 'load_results']
import gym
from gym.core import Wrapper
from os import path
import time
from glob import glob
try:
import ujson as json # Not necessary for monitor writing, but very useful for monitor loading
except ImportError:
import json
import csv
import os.path as osp
import json
import numpy as np
class Monitor(Wrapper):
EXT = "monitor.json"
EXT = "monitor.csv"
f = None
def __init__(self, env, filename, allow_early_resets=False):
def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
Wrapper.__init__(self, env=env)
self.tstart = time.time()
if filename is None:
@@ -23,48 +21,38 @@ class Monitor(Wrapper):
self.logger = None
else:
if not filename.endswith(Monitor.EXT):
filename = filename + "." + Monitor.EXT
if osp.isdir(filename):
filename = osp.join(filename, Monitor.EXT)
else:
filename = filename + "." + Monitor.EXT
self.f = open(filename, "wt")
self.logger = JSONLogger(self.f)
self.logger.writekvs({"t_start": self.tstart, "gym_version": gym.__version__,
"env_id": env.spec.id if env.spec else 'Unknown'})
self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, 'env_id' : env.spec and env.spec.id}))
self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords+info_keywords)
self.logger.writeheader()
self.f.flush()
self.reset_keywords = reset_keywords
self.info_keywords = info_keywords
self.allow_early_resets = allow_early_resets
self.rewards = None
self.needs_reset = True
self.episode_rewards = []
self.episode_lengths = []
self.episode_times = []
self.total_steps = 0
self.current_metadata = {} # extra info that gets injected into each log entry
# Useful for metalearning where we're modifying the environment externally
# But want our logs to know about these modifications
self.current_reset_info = {} # extra info about the current episode, that was passed in during reset()
def __getstate__(self): # XXX
d = self.__dict__.copy()
if self.f:
del d['f'], d['logger']
d['_filename'] = self.f.name
d['_num_episodes'] = len(self.episode_rewards)
else:
d['_filename'] = None
return d
def __setstate__(self, d):
filename = d.pop('_filename')
self.__dict__ = d
if filename is not None:
nlines = d.pop('_num_episodes') + 1
self.f = open(filename, "r+t")
for _ in range(nlines):
self.f.readline()
self.f.truncate()
self.logger = JSONLogger(self.f)
def reset(self):
def reset(self, **kwargs):
if not self.allow_early_resets and not self.needs_reset:
raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
self.rewards = []
self.needs_reset = False
return self.env.reset()
for k in self.reset_keywords:
v = kwargs.get(k)
if v is None:
raise ValueError('Expected you to pass kwarg %s into reset'%k)
self.current_reset_info[k] = v
return self.env.reset(**kwargs)
def step(self, action):
if self.needs_reset:
@@ -75,12 +63,16 @@ class Monitor(Wrapper):
self.needs_reset = True
eprew = sum(self.rewards)
eplen = len(self.rewards)
epinfo = {"r": eprew, "l": eplen, "t": round(time.time() - self.tstart, 6)}
epinfo.update(self.current_metadata)
if self.logger:
self.logger.writekvs(epinfo)
epinfo = {"r": round(eprew, 6), "l": eplen, "t": round(time.time() - self.tstart, 6)}
for k in self.info_keywords:
epinfo[k] = info[k]
self.episode_rewards.append(eprew)
self.episode_lengths.append(eplen)
self.episode_times.append(time.time() - self.tstart)
epinfo.update(self.current_reset_info)
if self.logger:
self.logger.writerow(epinfo)
self.f.flush()
info['episode'] = epinfo
self.total_steps += 1
return (ob, rew, done, info)
@@ -98,49 +90,74 @@ class Monitor(Wrapper):
def get_episode_lengths(self):
return self.episode_lengths
class JSONLogger(object):
def __init__(self, file):
self.file = file
def writekvs(self, kvs):
for k,v in kvs.items():
if hasattr(v, 'dtype'):
v = v.tolist()
kvs[k] = float(v)
self.file.write(json.dumps(kvs) + '\n')
self.file.flush()
def get_episode_times(self):
return self.episode_times
class LoadMonitorResultsError(Exception):
pass
def get_monitor_files(dir):
return glob(path.join(dir, "*" + Monitor.EXT))
return glob(osp.join(dir, "*" + Monitor.EXT))
def load_results(dir):
fnames = get_monitor_files(dir)
if not fnames:
import pandas
monitor_files = (
glob(osp.join(dir, "*monitor.json")) +
glob(osp.join(dir, "*monitor.csv"))) # get both csv and (old) json files
if not monitor_files:
raise LoadMonitorResultsError("no monitor files of the form *%s found in %s" % (Monitor.EXT, dir))
episodes = []
dfs = []
headers = []
for fname in fnames:
for fname in monitor_files:
with open(fname, 'rt') as fh:
lines = fh.readlines()
header = json.loads(lines[0])
headers.append(header)
for line in lines[1:]:
episode = json.loads(line)
episode['abstime'] = header['t_start'] + episode['t']
del episode['t']
episodes.append(episode)
header0 = headers[0]
for header in headers[1:]:
assert header['env_id'] == header0['env_id'], "mixing data from two envs"
episodes = sorted(episodes, key=lambda e: e['abstime'])
return {
'env_info': {'env_id': header0['env_id'], 'gym_version': header0['gym_version']},
'episode_end_times': [e['abstime'] for e in episodes],
'episode_lengths': [e['l'] for e in episodes],
'episode_rewards': [e['r'] for e in episodes],
'initial_reset_time': min([min(header['t_start'] for header in headers)])
}
if fname.endswith('csv'):
firstline = fh.readline()
if not firstline:
continue
assert firstline[0] == '#'
header = json.loads(firstline[1:])
df = pandas.read_csv(fh, index_col=None)
headers.append(header)
elif fname.endswith('json'): # Deprecated json format
episodes = []
lines = fh.readlines()
header = json.loads(lines[0])
headers.append(header)
for line in lines[1:]:
episode = json.loads(line)
episodes.append(episode)
df = pandas.DataFrame(episodes)
else:
assert 0, 'unreachable'
df['t'] += header['t_start']
dfs.append(df)
df = pandas.concat(dfs)
df.sort_values('t', inplace=True)
df.reset_index(inplace=True)
df['t'] -= min(header['t_start'] for header in headers)
df.headers = headers # HACK to preserve backwards compatibility
return df
def test_monitor():
env = gym.make("CartPole-v1")
env.seed(0)
mon_file = "/tmp/baselines-test-%s.monitor.csv" % uuid.uuid4()
menv = Monitor(env, mon_file)
menv.reset()
for _ in range(1000):
_, _, done, _ = menv.step(0)
if done:
menv.reset()
f = open(mon_file, 'rt')
firstline = f.readline()
assert firstline.startswith('#')
metadata = json.loads(firstline[1:])
assert metadata['env_id'] == "CartPole-v1"
assert set(metadata.keys()) == {'env_id', 'gym_version', 't_start'}, "Incorrect keys in monitor metadata"
last_logline = pandas.read_csv(f, index_col=None)
assert set(last_logline.keys()) == {'l', 't', 'r'}, "Incorrect keys in monitor logline"
f.close()
os.remove(mon_file)

View File

@@ -1,3 +1,4 @@
# flake8: noqa F403
from baselines.common.console_util import *
from baselines.common.dataset import Dataset
from baselines.common.math_util import *

View File

@@ -1,9 +1,11 @@
import numpy as np
import os
os.environ.setdefault('PATH', '')
from collections import deque
from PIL import Image
import gym
from gym import spaces
import cv2
cv2.ocl.setUseOpenCL(False)
class NoopResetEnv(gym.Wrapper):
def __init__(self, env, noop_max=30):
@@ -13,11 +15,12 @@ class NoopResetEnv(gym.Wrapper):
gym.Wrapper.__init__(self, env)
self.noop_max = noop_max
self.override_num_noops = None
self.noop_action = 0
assert env.unwrapped.get_action_meanings()[0] == 'NOOP'
def _reset(self):
def reset(self, **kwargs):
""" Do no-op action for a number of steps in [1, noop_max]."""
self.env.reset()
self.env.reset(**kwargs)
if self.override_num_noops is not None:
noops = self.override_num_noops
else:
@@ -25,11 +28,14 @@ class NoopResetEnv(gym.Wrapper):
assert noops > 0
obs = None
for _ in range(noops):
obs, _, done, _ = self.env.step(0)
obs, _, done, _ = self.env.step(self.noop_action)
if done:
obs = self.env.reset()
obs = self.env.reset(**kwargs)
return obs
def step(self, ac):
return self.env.step(ac)
class FireResetEnv(gym.Wrapper):
def __init__(self, env):
"""Take action on reset for environments that are fixed until firing."""
@@ -37,16 +43,19 @@ class FireResetEnv(gym.Wrapper):
assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
assert len(env.unwrapped.get_action_meanings()) >= 3
def _reset(self):
self.env.reset()
def reset(self, **kwargs):
self.env.reset(**kwargs)
obs, _, done, _ = self.env.step(1)
if done:
self.env.reset()
self.env.reset(**kwargs)
obs, _, done, _ = self.env.step(2)
if done:
self.env.reset()
self.env.reset(**kwargs)
return obs
def step(self, ac):
return self.env.step(ac)
class EpisodicLifeEnv(gym.Wrapper):
def __init__(self, env):
"""Make end-of-life == end-of-episode, but only reset on true game over.
@@ -56,27 +65,27 @@ class EpisodicLifeEnv(gym.Wrapper):
self.lives = 0
self.was_real_done = True
def _step(self, action):
def step(self, action):
obs, reward, done, info = self.env.step(action)
self.was_real_done = done
# check current lives, make loss of life terminal,
# then update lives to handle bonus lives
lives = self.env.unwrapped.ale.lives()
if lives < self.lives and lives > 0:
# for Qbert somtimes we stay in lives == 0 condtion for a few frames
# for Qbert sometimes we stay in lives == 0 condtion for a few frames
# so its important to keep lives > 0, so that we only reset once
# the environment advertises done.
done = True
self.lives = lives
return obs, reward, done, info
def _reset(self):
def reset(self, **kwargs):
"""Reset only when lives are exhausted.
This way all states are still reachable even though lives are episodic,
and the learner need not know about any of this behind-the-scenes.
"""
if self.was_real_done:
obs = self.env.reset()
obs = self.env.reset(**kwargs)
else:
# no-op step to advance from terminal/lost life state
obs, _, _, _ = self.env.step(0)
@@ -88,32 +97,34 @@ class MaxAndSkipEnv(gym.Wrapper):
"""Return only every `skip`-th frame"""
gym.Wrapper.__init__(self, env)
# most recent raw observations (for max pooling across time steps)
self._obs_buffer = deque(maxlen=2)
self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
self._skip = skip
def _step(self, action):
def step(self, action):
"""Repeat action, sum reward, and max over last observations."""
total_reward = 0.0
done = None
for _ in range(self._skip):
for i in range(self._skip):
obs, reward, done, info = self.env.step(action)
self._obs_buffer.append(obs)
if i == self._skip - 2: self._obs_buffer[0] = obs
if i == self._skip - 1: self._obs_buffer[1] = obs
total_reward += reward
if done:
break
max_frame = np.max(np.stack(self._obs_buffer), axis=0)
# Note that the observation on the done=True frame
# doesn't matter
max_frame = self._obs_buffer.max(axis=0)
return max_frame, total_reward, done, info
def _reset(self):
"""Clear past frame buffer and init. to first obs. from inner env."""
self._obs_buffer.clear()
obs = self.env.reset()
self._obs_buffer.append(obs)
return obs
def reset(self, **kwargs):
return self.env.reset(**kwargs)
class ClipRewardEnv(gym.RewardWrapper):
def _reward(self, reward):
def __init__(self, env):
gym.RewardWrapper.__init__(self, env)
def reward(self, reward):
"""Bin reward to {+1, 0, -1} by its sign."""
return np.sign(reward)
@@ -121,52 +132,106 @@ class WarpFrame(gym.ObservationWrapper):
def __init__(self, env):
"""Warp frames to 84x84 as done in the Nature paper and later work."""
gym.ObservationWrapper.__init__(self, env)
self.res = 84
self.observation_space = spaces.Box(low=0, high=255, shape=(self.res, self.res, 1))
self.width = 84
self.height = 84
self.observation_space = spaces.Box(low=0, high=255,
shape=(self.height, self.width, 1), dtype=np.uint8)
def _observation(self, obs):
frame = np.dot(obs.astype('float32'), np.array([0.299, 0.587, 0.114], 'float32'))
frame = np.array(Image.fromarray(frame).resize((self.res, self.res),
resample=Image.BILINEAR), dtype=np.uint8)
return frame.reshape((self.res, self.res, 1))
def observation(self, frame):
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
return frame[:, :, None]
class FrameStack(gym.Wrapper):
def __init__(self, env, k):
"""Buffer observations and stack across channels (last axis)."""
"""Stack k last frames.
Returns lazy array, which is much more memory efficient.
See Also
--------
baselines.common.atari_wrappers.LazyFrames
"""
gym.Wrapper.__init__(self, env)
self.k = k
self.frames = deque([], maxlen=k)
shp = env.observation_space.shape
assert shp[2] == 1 # can only stack 1-channel frames
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], k))
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=np.uint8)
def _reset(self):
"""Clear buffer and re-fill by duplicating the first observation."""
def reset(self):
ob = self.env.reset()
for _ in range(self.k): self.frames.append(ob)
return self._observation()
for _ in range(self.k):
self.frames.append(ob)
return self._get_ob()
def _step(self, action):
def step(self, action):
ob, reward, done, info = self.env.step(action)
self.frames.append(ob)
return self._observation(), reward, done, info
return self._get_ob(), reward, done, info
def _observation(self):
def _get_ob(self):
assert len(self.frames) == self.k
return np.concatenate(self.frames, axis=2)
return LazyFrames(list(self.frames))
def wrap_deepmind(env, episode_life=True, clip_rewards=True):
class ScaledFloatFrame(gym.ObservationWrapper):
def __init__(self, env):
gym.ObservationWrapper.__init__(self, env)
def observation(self, observation):
# careful! This undoes the memory optimization, use
# with smaller replay buffers only.
return np.array(observation).astype(np.float32) / 255.0
class LazyFrames(object):
def __init__(self, frames):
"""This object ensures that common frames between the observations are only stored once.
It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
buffers.
This object should only be converted to numpy array before being passed to the model.
You'd not believe how complex the previous solution was."""
self._frames = frames
self._out = None
def _force(self):
if self._out is None:
self._out = np.concatenate(self._frames, axis=2)
self._frames = None
return self._out
def __array__(self, dtype=None):
out = self._force()
if dtype is not None:
out = out.astype(dtype)
return out
def __len__(self):
return len(self._force())
def __getitem__(self, i):
return self._force()[i]
def make_atari(env_id):
env = gym.make(env_id)
assert 'NoFrameskip' in env.spec.id
env = NoopResetEnv(env, noop_max=30)
env = MaxAndSkipEnv(env, skip=4)
return env
def wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack=False, scale=False):
"""Configure environment for DeepMind-style Atari.
Note: this does not include frame stacking!"""
assert 'NoFrameskip' in env.spec.id # required for DeepMind-style skip
"""
if episode_life:
env = EpisodicLifeEnv(env)
# env = NoopResetEnv(env, noop_max=30)
env = MaxAndSkipEnv(env, skip=4)
if 'FIRE' in env.unwrapped.get_action_meanings():
env = FireResetEnv(env)
env = WarpFrame(env)
if scale:
env = ScaledFloatFrame(env)
if clip_rewards:
env = ClipRewardEnv(env)
if frame_stack:
env = FrameStack(env, 4)
return env

View File

@@ -1,239 +0,0 @@
import cv2
import gym
import numpy as np
from collections import deque
from gym import spaces
class NoopResetEnv(gym.Wrapper):
def __init__(self, env=None, noop_max=30):
"""Sample initial states by taking random number of no-ops on reset.
No-op is assumed to be action 0.
"""
super(NoopResetEnv, self).__init__(env)
self.noop_max = noop_max
self.override_num_noops = None
assert env.unwrapped.get_action_meanings()[0] == 'NOOP'
def _reset(self):
""" Do no-op action for a number of steps in [1, noop_max]."""
self.env.reset()
if self.override_num_noops is not None:
noops = self.override_num_noops
else:
noops = np.random.randint(1, self.noop_max + 1)
assert noops > 0
obs = None
for _ in range(noops):
obs, _, done, _ = self.env.step(0)
if done:
obs = self.env.reset()
return obs
class FireResetEnv(gym.Wrapper):
def __init__(self, env=None):
"""For environments where the user need to press FIRE for the game to start."""
super(FireResetEnv, self).__init__(env)
assert env.unwrapped.get_action_meanings()[1] == 'FIRE'
assert len(env.unwrapped.get_action_meanings()) >= 3
def _reset(self):
self.env.reset()
obs, _, done, _ = self.env.step(1)
if done:
self.env.reset()
obs, _, done, _ = self.env.step(2)
if done:
self.env.reset()
return obs
class EpisodicLifeEnv(gym.Wrapper):
def __init__(self, env=None):
"""Make end-of-life == end-of-episode, but only reset on true game over.
Done by DeepMind for the DQN and co. since it helps value estimation.
"""
super(EpisodicLifeEnv, self).__init__(env)
self.lives = 0
self.was_real_done = True
self.was_real_reset = False
def _step(self, action):
obs, reward, done, info = self.env.step(action)
self.was_real_done = done
# check current lives, make loss of life terminal,
# then update lives to handle bonus lives
lives = self.env.unwrapped.ale.lives()
if lives < self.lives and lives > 0:
# for Qbert somtimes we stay in lives == 0 condtion for a few frames
# so its important to keep lives > 0, so that we only reset once
# the environment advertises done.
done = True
self.lives = lives
return obs, reward, done, info
def _reset(self):
"""Reset only when lives are exhausted.
This way all states are still reachable even though lives are episodic,
and the learner need not know about any of this behind-the-scenes.
"""
if self.was_real_done:
obs = self.env.reset()
self.was_real_reset = True
else:
# no-op step to advance from terminal/lost life state
obs, _, _, _ = self.env.step(0)
self.was_real_reset = False
self.lives = self.env.unwrapped.ale.lives()
return obs
class MaxAndSkipEnv(gym.Wrapper):
def __init__(self, env=None, skip=4):
"""Return only every `skip`-th frame"""
super(MaxAndSkipEnv, self).__init__(env)
# most recent raw observations (for max pooling across time steps)
self._obs_buffer = deque(maxlen=2)
self._skip = skip
def _step(self, action):
total_reward = 0.0
done = None
for _ in range(self._skip):
obs, reward, done, info = self.env.step(action)
self._obs_buffer.append(obs)
total_reward += reward
if done:
break
max_frame = np.max(np.stack(self._obs_buffer), axis=0)
return max_frame, total_reward, done, info
def _reset(self):
"""Clear past frame buffer and init. to first obs. from inner env."""
self._obs_buffer.clear()
obs = self.env.reset()
self._obs_buffer.append(obs)
return obs
class ProcessFrame84(gym.ObservationWrapper):
def __init__(self, env=None):
super(ProcessFrame84, self).__init__(env)
self.observation_space = spaces.Box(low=0, high=255, shape=(84, 84, 1))
def _observation(self, obs):
return ProcessFrame84.process(obs)
@staticmethod
def process(frame):
if frame.size == 210 * 160 * 3:
img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
elif frame.size == 250 * 160 * 3:
img = np.reshape(frame, [250, 160, 3]).astype(np.float32)
else:
assert False, "Unknown resolution."
img = img[:, :, 0] * 0.299 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
resized_screen = cv2.resize(img, (84, 110), interpolation=cv2.INTER_AREA)
x_t = resized_screen[18:102, :]
x_t = np.reshape(x_t, [84, 84, 1])
return x_t.astype(np.uint8)
class ClippedRewardsWrapper(gym.RewardWrapper):
def _reward(self, reward):
"""Change all the positive rewards to 1, negative to -1 and keep zero."""
return np.sign(reward)
class LazyFrames(object):
def __init__(self, frames):
"""This object ensures that common frames between the observations are only stored once.
It exists purely to optimize memory usage which can be huge for DQN's 1M frames replay
buffers.
This object should only be converted to numpy array before being passed to the model.
You'd not belive how complex the previous solution was."""
self._frames = frames
def __array__(self, dtype=None):
out = np.concatenate(self._frames, axis=2)
if dtype is not None:
out = out.astype(dtype)
return out
class FrameStack(gym.Wrapper):
def __init__(self, env, k):
"""Stack k last frames.
Returns lazy array, which is much more memory efficient.
See Also
--------
baselines.common.atari_wrappers.LazyFrames
"""
gym.Wrapper.__init__(self, env)
self.k = k
self.frames = deque([], maxlen=k)
shp = env.observation_space.shape
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k))
def _reset(self):
ob = self.env.reset()
for _ in range(self.k):
self.frames.append(ob)
return self._get_ob()
def _step(self, action):
ob, reward, done, info = self.env.step(action)
self.frames.append(ob)
return self._get_ob(), reward, done, info
def _get_ob(self):
assert len(self.frames) == self.k
return LazyFrames(list(self.frames))
class ScaledFloatFrame(gym.ObservationWrapper):
def _observation(self, obs):
# careful! This undoes the memory optimization, use
# with smaller replay buffers only.
return np.array(obs).astype(np.float32) / 255.0
def wrap_dqn(env):
"""Apply a common set of wrappers for Atari games."""
assert 'NoFrameskip' in env.spec.id
env = EpisodicLifeEnv(env)
env = NoopResetEnv(env, noop_max=30)
env = MaxAndSkipEnv(env, skip=4)
if 'FIRE' in env.unwrapped.get_action_meanings():
env = FireResetEnv(env)
env = ProcessFrame84(env)
env = FrameStack(env, 4)
env = ClippedRewardsWrapper(env)
return env
class A2cProcessFrame(gym.Wrapper):
def __init__(self, env):
gym.Wrapper.__init__(self, env)
self.observation_space = spaces.Box(low=0, high=255, shape=(84, 84, 1))
def _step(self, action):
ob, reward, done, info = self.env.step(action)
return A2cProcessFrame.process(ob), reward, done, info
def _reset(self):
return A2cProcessFrame.process(self.env.reset())
@staticmethod
def process(frame):
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
frame = cv2.resize(frame, (84, 84), interpolation=cv2.INTER_AREA)
return frame.reshape(84, 84, 1)

View File

@@ -1,149 +0,0 @@
import os
import tempfile
import zipfile
from azure.common import AzureMissingResourceHttpError
from azure.storage.blob import BlobService
from shutil import unpack_archive
from threading import Event
"""TODOS:
- use Azure snapshots instead of hacky backups
"""
def fixed_list_blobs(service, *args, **kwargs):
"""By defualt list_containers only returns a subset of results.
This function attempts to fix this.
"""
res = []
next_marker = None
while next_marker is None or len(next_marker) > 0:
kwargs['marker'] = next_marker
gen = service.list_blobs(*args, **kwargs)
for b in gen:
res.append(b.name)
next_marker = gen.next_marker
return res
def make_archive(source_path, dest_path):
if source_path.endswith(os.path.sep):
source_path = source_path.rstrip(os.path.sep)
prefix_path = os.path.dirname(source_path)
with zipfile.ZipFile(dest_path, "w", compression=zipfile.ZIP_STORED) as zf:
if os.path.isdir(source_path):
for dirname, subdirs, files in os.walk(source_path):
zf.write(dirname, os.path.relpath(dirname, prefix_path))
for filename in files:
filepath = os.path.join(dirname, filename)
zf.write(filepath, os.path.relpath(filepath, prefix_path))
else:
zf.write(source_path, os.path.relpath(source_path, prefix_path))
class Container(object):
services = {}
def __init__(self, account_name, account_key, container_name, maybe_create=False):
self._account_name = account_name
self._container_name = container_name
if account_name not in Container.services:
Container.services[account_name] = BlobService(account_name, account_key)
self._service = Container.services[account_name]
if maybe_create:
self._service.create_container(self._container_name, fail_on_exist=False)
def put(self, source_path, blob_name, callback=None):
"""Upload a file or directory from `source_path` to azure blob `blob_name`.
Upload progress can be traced by an optional callback.
"""
upload_done = Event()
def progress_callback(current, total):
if callback:
callback(current, total)
if current >= total:
upload_done.set()
# Attempt to make backup if an existing version is already available
try:
x_ms_copy_source = "https://{}.blob.core.windows.net/{}/{}".format(
self._account_name,
self._container_name,
blob_name
)
self._service.copy_blob(
container_name=self._container_name,
blob_name=blob_name + ".backup",
x_ms_copy_source=x_ms_copy_source
)
except AzureMissingResourceHttpError:
pass
with tempfile.TemporaryDirectory() as td:
arcpath = os.path.join(td, "archive.zip")
make_archive(source_path, arcpath)
self._service.put_block_blob_from_path(
container_name=self._container_name,
blob_name=blob_name,
file_path=arcpath,
max_connections=4,
progress_callback=progress_callback,
max_retries=10)
upload_done.wait()
def get(self, dest_path, blob_name, callback=None):
"""Download a file or directory to `dest_path` to azure blob `blob_name`.
Warning! If directory is downloaded the `dest_path` is the parent directory.
Upload progress can be traced by an optional callback.
"""
download_done = Event()
def progress_callback(current, total):
if callback:
callback(current, total)
if current >= total:
download_done.set()
with tempfile.TemporaryDirectory() as td:
arcpath = os.path.join(td, "archive.zip")
for backup_blob_name in [blob_name, blob_name + '.backup']:
try:
blob_size = self._service.get_blob_properties(
blob_name=backup_blob_name,
container_name=self._container_name
)['content-length']
if int(blob_size) > 0:
self._service.get_blob_to_path(
container_name=self._container_name,
blob_name=backup_blob_name,
file_path=arcpath,
max_connections=4,
progress_callback=progress_callback,
max_retries=10)
unpack_archive(arcpath, dest_path)
download_done.wait()
return True
except AzureMissingResourceHttpError:
pass
return False
def list(self, prefix=None):
"""List all blobs in the container."""
return fixed_list_blobs(self._service, self._container_name, prefix=prefix)
def exists(self, blob_name):
"""Returns true if `blob_name` exists in container."""
try:
self._service.get_blob_properties(
blob_name=blob_name,
container_name=self._container_name
)
return True
except AzureMissingResourceHttpError:
return False

View File

@@ -0,0 +1,126 @@
"""
Helpers for scripts like run_atari.py.
"""
import os
try:
from mpi4py import MPI
except ImportError:
MPI = None
import gym
from gym.wrappers import FlattenDictWrapper
from baselines import logger
from baselines.bench import Monitor
from baselines.common import set_global_seeds
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
"""
Create a wrapped, monitored SubprocVecEnv for Atari.
"""
if wrapper_kwargs is None: wrapper_kwargs = {}
mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
def make_env(rank): # pylint: disable=C0111
def _thunk():
env = make_atari(env_id)
env.seed(seed + 10000*mpi_rank + rank if seed is not None else None)
env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(rank)))
return wrap_deepmind(env, **wrapper_kwargs)
return _thunk
set_global_seeds(seed)
return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
def make_mujoco_env(env_id, seed, reward_scale=1.0):
"""
Create a wrapped, monitored gym.Env for MuJoCo.
"""
rank = MPI.COMM_WORLD.Get_rank()
myseed = seed + 1000 * rank if seed is not None else None
set_global_seeds(myseed)
env = gym.make(env_id)
env = Monitor(env, os.path.join(logger.get_dir(), str(rank)), allow_early_resets=True)
env.seed(seed)
if reward_scale != 1.0:
from baselines.common.retro_wrappers import RewardScaler
env = RewardScaler(env, reward_scale)
return env
def make_robotics_env(env_id, seed, rank=0):
"""
Create a wrapped, monitored gym.Env for MuJoCo.
"""
set_global_seeds(seed)
env = gym.make(env_id)
env = FlattenDictWrapper(env, ['observation', 'desired_goal'])
env = Monitor(
env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)),
info_keywords=('is_success',))
env.seed(seed)
return env
def arg_parser():
"""
Create an empty argparse.ArgumentParser.
"""
import argparse
return argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
def atari_arg_parser():
"""
Create an argparse.ArgumentParser for run_atari.py.
"""
print('Obsolete - use common_arg_parser instead')
return common_arg_parser()
def mujoco_arg_parser():
print('Obsolete - use common_arg_parser instead')
return common_arg_parser()
def common_arg_parser():
"""
Create an argparse.ArgumentParser for run_mujoco.py.
"""
parser = arg_parser()
parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
parser.add_argument('--seed', help='RNG seed', type=int, default=None)
parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
parser.add_argument('--num_timesteps', type=float, default=1e6),
parser.add_argument('--network', help='network type (mlp, cnn, lstm, cnn_lstm, conv_only)', default=None)
parser.add_argument('--gamestate', help='game state to load (so far only used in retro games)', default=None)
parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
parser.add_argument('--reward_scale', help='Reward scale factor. Default: 1.0', default=1.0, type=float)
parser.add_argument('--save_path', help='Path to save trained model to', default=None, type=str)
parser.add_argument('--play', default=False, action='store_true')
return parser
def robotics_arg_parser():
"""
Create an argparse.ArgumentParser for run_mujoco.py.
"""
parser = arg_parser()
parser.add_argument('--env', help='environment ID', type=str, default='FetchReach-v0')
parser.add_argument('--seed', help='RNG seed', type=int, default=None)
parser.add_argument('--num-timesteps', type=int, default=int(1e6))
return parser
def parse_unknown_args(args):
"""
Parse arguments not consumed by arg parser into a dicitonary
"""
retval = {}
for arg in args:
assert arg.startswith('--')
assert '=' in arg, 'cannot parse arg {}'.format(arg)
key = arg.split('=')[0][2:]
value = arg.split('=')[1]
retval[key] = value
return retval

View File

@@ -16,7 +16,12 @@ def fmt_item(x, l):
if isinstance(x, np.ndarray):
assert x.ndim==0
x = x.item()
if isinstance(x, float): rep = "%g"%x
if isinstance(x, (float, np.float32, np.float64)):
v = abs(x)
if (v < 1e-4 or v > 1e+4) and v > 0:
rep = "%7.2e" % x
else:
rep = "%7.5f" % x
else: rep = str(x)
return " "*(l - len(rep)) + rep

View File

@@ -1,8 +1,8 @@
import tensorflow as tf
import numpy as np
import baselines.common.tf_util as U
from baselines.a2c.utils import fc
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import nn
class Pd(object):
"""
@@ -32,6 +32,8 @@ class PdType(object):
raise NotImplementedError
def pdfromflat(self, flat):
return self.pdclass()(flat)
def pdfromlatent(self, latent_vector):
raise NotImplementedError
def param_shape(self):
raise NotImplementedError
def sample_shape(self):
@@ -49,6 +51,10 @@ class CategoricalPdType(PdType):
self.ncat = ncat
def pdclass(self):
return CategoricalPd
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
pdparam = fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
return self.pdfromflat(pdparam), pdparam
def param_shape(self):
return [self.ncat]
def sample_shape(self):
@@ -58,14 +64,12 @@ class CategoricalPdType(PdType):
class MultiCategoricalPdType(PdType):
def __init__(self, low, high):
self.low = low
self.high = high
self.ncats = high - low + 1
def __init__(self, nvec):
self.ncats = nvec
def pdclass(self):
return MultiCategoricalPd
def pdfromflat(self, flat):
return MultiCategoricalPd(self.low, self.high, flat)
return MultiCategoricalPd(self.ncats, flat)
def param_shape(self):
return [sum(self.ncats)]
def sample_shape(self):
@@ -78,6 +82,13 @@ class DiagGaussianPdType(PdType):
self.size = size
def pdclass(self):
return DiagGaussianPd
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
mean = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
logstd = tf.get_variable(name='pi/logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
return self.pdfromflat(pdparam), mean
def param_shape(self):
return [2*self.size]
def sample_shape(self):
@@ -108,7 +119,7 @@ class BernoulliPdType(PdType):
# def flatparam(self):
# return self.logits
# def mode(self):
# return U.argmax(self.logits, axis=1)
# return U.argmax(self.logits, axis=-1)
# def logp(self, x):
# return -tf.nn.sparse_softmax_cross_entropy_with_logits(self.logits, x)
# def kl(self, other):
@@ -118,7 +129,7 @@ class BernoulliPdType(PdType):
# return tf.nn.softmax_cross_entropy_with_logits(self.logits, self.ps)
# def sample(self):
# u = tf.random_uniform(tf.shape(self.logits))
# return U.argmax(self.logits - tf.log(-tf.log(u)), axis=1)
# return U.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
class CategoricalPd(Pd):
def __init__(self, logits):
@@ -126,50 +137,53 @@ class CategoricalPd(Pd):
def flatparam(self):
return self.logits
def mode(self):
return U.argmax(self.logits, axis=1)
return tf.argmax(self.logits, axis=-1)
def neglogp(self, x):
return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
# return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
# Note: we can't use sparse_softmax_cross_entropy_with_logits because
# the implementation does not allow second-order derivatives...
one_hot_actions = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
return tf.nn.softmax_cross_entropy_with_logits_v2(
logits=self.logits,
labels=one_hot_actions)
def kl(self, other):
a0 = self.logits - U.max(self.logits, axis=1, keepdims=True)
a1 = other.logits - U.max(other.logits, axis=1, keepdims=True)
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keepdims=True)
ea0 = tf.exp(a0)
ea1 = tf.exp(a1)
z0 = U.sum(ea0, axis=1, keepdims=True)
z1 = U.sum(ea1, axis=1, keepdims=True)
z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
z1 = tf.reduce_sum(ea1, axis=-1, keepdims=True)
p0 = ea0 / z0
return U.sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=1)
return tf.reduce_sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
def entropy(self):
a0 = self.logits - U.max(self.logits, axis=1, keepdims=True)
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
ea0 = tf.exp(a0)
z0 = U.sum(ea0, axis=1, keepdims=True)
z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
p0 = ea0 / z0
return U.sum(p0 * (tf.log(z0) - a0), axis=1)
return tf.reduce_sum(p0 * (tf.log(z0) - a0), axis=-1)
def sample(self):
u = tf.random_uniform(tf.shape(self.logits))
return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=1)
u = tf.random_uniform(tf.shape(self.logits), dtype=self.logits.dtype)
return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
@classmethod
def fromflat(cls, flat):
return cls(flat)
class MultiCategoricalPd(Pd):
def __init__(self, low, high, flat):
def __init__(self, nvec, flat):
self.flat = flat
self.low = tf.constant(low, dtype=tf.int32)
self.categoricals = list(map(CategoricalPd, tf.split(flat, high - low + 1, axis=len(flat.get_shape()) - 1)))
self.categoricals = list(map(CategoricalPd, tf.split(flat, nvec, axis=-1)))
def flatparam(self):
return self.flat
def mode(self):
return self.low + tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
def neglogp(self, x):
return tf.add_n([p.neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x - self.low, axis=len(x.get_shape()) - 1))])
return tf.add_n([p.neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x, axis=-1))])
def kl(self, other):
return tf.add_n([
p.kl(q) for p, q in zip(self.categoricals, other.categoricals)
])
return tf.add_n([p.kl(q) for p, q in zip(self.categoricals, other.categoricals)])
def entropy(self):
return tf.add_n([p.entropy() for p in self.categoricals])
def sample(self):
return self.low + tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
return tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
@classmethod
def fromflat(cls, flat):
raise NotImplementedError
@@ -177,7 +191,7 @@ class MultiCategoricalPd(Pd):
class DiagGaussianPd(Pd):
def __init__(self, flat):
self.flat = flat
mean, logstd = tf.split(axis=len(flat.get_shape()) - 1, num_or_size_splits=2, value=flat)
mean, logstd = tf.split(axis=len(flat.shape)-1, num_or_size_splits=2, value=flat)
self.mean = mean
self.logstd = logstd
self.std = tf.exp(logstd)
@@ -186,14 +200,14 @@ class DiagGaussianPd(Pd):
def mode(self):
return self.mean
def neglogp(self, x):
return 0.5 * U.sum(tf.square((x - self.mean) / self.std), axis=len(x.get_shape()) - 1) \
return 0.5 * tf.reduce_sum(tf.square((x - self.mean) / self.std), axis=-1) \
+ 0.5 * np.log(2.0 * np.pi) * tf.to_float(tf.shape(x)[-1]) \
+ U.sum(self.logstd, axis=len(x.get_shape()) - 1)
+ tf.reduce_sum(self.logstd, axis=-1)
def kl(self, other):
assert isinstance(other, DiagGaussianPd)
return U.sum(other.logstd - self.logstd + (tf.square(self.std) + tf.square(self.mean - other.mean)) / (2.0 * tf.square(other.std)) - 0.5, axis=-1)
return tf.reduce_sum(other.logstd - self.logstd + (tf.square(self.std) + tf.square(self.mean - other.mean)) / (2.0 * tf.square(other.std)) - 0.5, axis=-1)
def entropy(self):
return U.sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), -1)
return tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
def sample(self):
return self.mean + self.std * tf.random_normal(tf.shape(self.mean))
@classmethod
@@ -209,11 +223,11 @@ class BernoulliPd(Pd):
def mode(self):
return tf.round(self.ps)
def neglogp(self, x):
return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=1)
return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1)
def kl(self, other):
return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=1) - U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=1)
return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
def entropy(self):
return U.sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=1)
return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
def sample(self):
u = tf.random_uniform(tf.shape(self.ps))
return tf.to_float(math_ops.less(u, self.ps))
@@ -229,7 +243,7 @@ def make_pdtype(ac_space):
elif isinstance(ac_space, spaces.Discrete):
return CategoricalPdType(ac_space.n)
elif isinstance(ac_space, spaces.MultiDiscrete):
return MultiCategoricalPdType(ac_space.low, ac_space.high)
return MultiCategoricalPdType(ac_space.nvec)
elif isinstance(ac_space, spaces.MultiBinary):
return BernoulliPdType(ac_space.n)
else:
@@ -254,6 +268,11 @@ def test_probtypes():
categorical = CategoricalPdType(pdparam_categorical.size) #pylint: disable=E1101
validate_probtype(categorical, pdparam_categorical)
nvec = [1,2,3]
pdparam_multicategorical = np.array([-.2, .3, .5, .1, 1, -.1])
multicategorical = MultiCategoricalPdType(nvec) #pylint: disable=E1101
validate_probtype(multicategorical, pdparam_multicategorical)
pdparam_bernoulli = np.array([-.2, .3, .5])
bernoulli = BernoulliPdType(pdparam_bernoulli.size) #pylint: disable=E1101
validate_probtype(bernoulli, pdparam_bernoulli)
@@ -265,10 +284,10 @@ def validate_probtype(probtype, pdparam):
Mval = np.repeat(pdparam[None, :], N, axis=0)
M = probtype.param_placeholder([N])
X = probtype.sample_placeholder([N])
pd = probtype.pdclass()(M)
pd = probtype.pdfromflat(M)
calcloglik = U.function([X, M], pd.logp(X))
calcent = U.function([M], pd.entropy())
Xval = U.eval(pd.sample(), feed_dict={M:Mval})
Xval = tf.get_default_session().run(pd.sample(), feed_dict={M:Mval})
logliks = calcloglik(Xval, Mval)
entval_ll = - logliks.mean() #pylint: disable=E1101
entval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
@@ -277,7 +296,7 @@ def validate_probtype(probtype, pdparam):
# Check to see if kldiv[p,q] = - ent[p] - E_p[log q]
M2 = probtype.param_placeholder([N])
pd2 = probtype.pdclass()(M2)
pd2 = probtype.pdfromflat(M2)
q = pdparam + np.random.randn(pdparam.size) * 0.1
Mval2 = np.repeat(q[None, :], N, axis=0)
calckl = U.function([M, M2], pd.kl(pd2))
@@ -286,4 +305,5 @@ def validate_probtype(probtype, pdparam):
klval_ll = - entval - logliks.mean() #pylint: disable=E1101
klval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
assert np.abs(klval - klval_ll) < 3 * klval_ll_stderr # within 3 sigmas
print('ok on', probtype, pdparam)

View File

@@ -0,0 +1,98 @@
from .running_stat import RunningStat
from collections import deque
import numpy as np
class Filter(object):
def __call__(self, x, update=True):
raise NotImplementedError
def reset(self):
pass
class IdentityFilter(Filter):
def __call__(self, x, update=True):
return x
class CompositionFilter(Filter):
def __init__(self, fs):
self.fs = fs
def __call__(self, x, update=True):
for f in self.fs:
x = f(x)
return x
def output_shape(self, input_space):
out = input_space.shape
for f in self.fs:
out = f.output_shape(out)
return out
class ZFilter(Filter):
"""
y = (x-mean)/std
using running estimates of mean,std
"""
def __init__(self, shape, demean=True, destd=True, clip=10.0):
self.demean = demean
self.destd = destd
self.clip = clip
self.rs = RunningStat(shape)
def __call__(self, x, update=True):
if update: self.rs.push(x)
if self.demean:
x = x - self.rs.mean
if self.destd:
x = x / (self.rs.std+1e-8)
if self.clip:
x = np.clip(x, -self.clip, self.clip)
return x
def output_shape(self, input_space):
return input_space.shape
class AddClock(Filter):
def __init__(self):
self.count = 0
def reset(self):
self.count = 0
def __call__(self, x, update=True):
return np.append(x, self.count/100.0)
def output_shape(self, input_space):
return (input_space.shape[0]+1,)
class FlattenFilter(Filter):
def __call__(self, x, update=True):
return x.ravel()
def output_shape(self, input_space):
return (int(np.prod(input_space.shape)),)
class Ind2OneHotFilter(Filter):
def __init__(self, n):
self.n = n
def __call__(self, x, update=True):
out = np.zeros(self.n)
out[x] = 1
return out
def output_shape(self, input_space):
return (input_space.n,)
class DivFilter(Filter):
def __init__(self, divisor):
self.divisor = divisor
def __call__(self, x, update=True):
return x / self.divisor
def output_shape(self, input_space):
return input_space.shape
class StackFilter(Filter):
def __init__(self, length):
self.stack = deque(maxlen=length)
def reset(self):
self.stack.clear()
def __call__(self, x, update=True):
self.stack.append(x)
while len(self.stack) < self.stack.maxlen:
self.stack.append(x)
return np.concatenate(self.stack, axis=-1)
def output_shape(self, input_space):
return input_space.shape[:-1] + (input_space.shape[-1] * self.stack.maxlen,)

View File

@@ -0,0 +1,30 @@
from gym import Env
from gym.spaces import Discrete
class IdentityEnv(Env):
def __init__(
self,
dim,
ep_length=100,
):
self.action_space = Discrete(dim)
self.reset()
def reset(self):
self._choose_next_state()
self.observation_space = self.action_space
return self.state
def step(self, actions):
rew = self._get_reward(actions)
self._choose_next_state()
return self.state, rew, False, {}
def _choose_next_state(self):
self.state = self.action_space.sample()
def _get_reward(self, actions):
return 1 if self.state == actions else 0

56
baselines/common/input.py Normal file
View File

@@ -0,0 +1,56 @@
import tensorflow as tf
from gym.spaces import Discrete, Box
def observation_placeholder(ob_space, batch_size=None, name='Ob'):
'''
Create placeholder to feed observations into of the size appropriate to the observation space
Parameters:
----------
ob_space: gym.Space observation space
batch_size: int size of the batch to be fed into input. Can be left None in most cases.
name: str name of the placeholder
Returns:
-------
tensorflow placeholder tensor
'''
assert isinstance(ob_space, Discrete) or isinstance(ob_space, Box), \
'Can only deal with Discrete and Box observation spaces for now'
return tf.placeholder(shape=(batch_size,) + ob_space.shape, dtype=ob_space.dtype, name=name)
def observation_input(ob_space, batch_size=None, name='Ob'):
'''
Create placeholder to feed observations into of the size appropriate to the observation space, and add input
encoder of the appropriate type.
'''
placeholder = observation_placeholder(ob_space, batch_size, name)
return placeholder, encode_observation(ob_space, placeholder)
def encode_observation(ob_space, placeholder):
'''
Encode input in the way that is appropriate to the observation space
Parameters:
----------
ob_space: gym.Space observation space
placeholder: tf.placeholder observation input placeholder
'''
if isinstance(ob_space, Discrete):
return tf.to_float(tf.one_hot(placeholder, ob_space.n))
elif isinstance(ob_space, Box):
return tf.to_float(placeholder)
else:
raise NotImplementedError

View File

@@ -4,7 +4,6 @@ import os
import pickle
import random
import tempfile
import time
import zipfile
@@ -68,14 +67,21 @@ class EzPickle(object):
def set_global_seeds(i):
try:
import MPI
rank = MPI.COMM_WORLD.Get_rank()
except ImportError:
rank = 0
myseed = i + 1000 * rank if i is not None else None
try:
import tensorflow as tf
except ImportError:
pass
else:
tf.set_random_seed(i)
np.random.seed(i)
random.seed(i)
tf.set_random_seed(myseed)
np.random.seed(myseed)
random.seed(myseed)
def pretty_eta(seconds_left):
@@ -153,76 +159,6 @@ class RunningAvg(object):
"""Get the current estimate"""
return self._value
class SimpleMonitor(gym.Wrapper):
def __init__(self, env):
"""Adds two qunatities to info returned by every step:
num_steps: int
Number of steps takes so far
rewards: [float]
All the cumulative rewards for the episodes completed so far.
"""
super().__init__(env)
# current episode state
self._current_reward = None
self._num_steps = None
# temporary monitor state that we do not save
self._time_offset = None
self._total_steps = None
# monitor state
self._episode_rewards = []
self._episode_lengths = []
self._episode_end_times = []
def _reset(self):
obs = self.env.reset()
# recompute temporary state if needed
if self._time_offset is None:
self._time_offset = time.time()
if len(self._episode_end_times) > 0:
self._time_offset -= self._episode_end_times[-1]
if self._total_steps is None:
self._total_steps = sum(self._episode_lengths)
# update monitor state
if self._current_reward is not None:
self._episode_rewards.append(self._current_reward)
self._episode_lengths.append(self._num_steps)
self._episode_end_times.append(time.time() - self._time_offset)
# reset episode state
self._current_reward = 0
self._num_steps = 0
return obs
def _step(self, action):
obs, rew, done, info = self.env.step(action)
self._current_reward += rew
self._num_steps += 1
self._total_steps += 1
info['steps'] = self._total_steps
info['rewards'] = self._episode_rewards
return (obs, rew, done, info)
def get_state(self):
return {
'env_id': self.env.unwrapped.spec.id,
'episode_data': {
'episode_rewards': self._episode_rewards,
'episode_lengths': self._episode_lengths,
'episode_end_times': self._episode_end_times,
'initial_reset_time': 0,
}
}
def set_state(self, state):
assert state['env_id'] == self.env.unwrapped.spec.id
ed = state['episode_data']
self._episode_rewards = ed['episode_rewards']
self._episode_lengths = ed['episode_lengths']
self._episode_end_times = ed['episode_end_times']
def boolean_flag(parser, name, default=False, help=None):
"""Add a boolean flag to argparse parser.
@@ -237,8 +173,9 @@ def boolean_flag(parser, name, default=False, help=None):
help: str
help string for the flag
"""
parser.add_argument("--" + name, action="store_true", default=default, help=help)
parser.add_argument("--no-" + name, action="store_false", dest=name)
dest = name.replace('-', '_')
parser.add_argument("--" + name, action="store_true", default=default, dest=dest, help=help)
parser.add_argument("--no-" + name, action="store_false", dest=dest)
def get_wrapper_by_name(env, classname):
@@ -294,6 +231,7 @@ def relatively_safe_pickle_dump(obj, path, compression=False):
# Using gzip here would be simpler, but the size is limited to 2GB
with tempfile.NamedTemporaryFile() as uncompressed_file:
pickle.dump(obj, uncompressed_file)
uncompressed_file.file.flush()
with zipfile.ZipFile(temp_storage, "w", compression=zipfile.ZIP_DEFLATED) as myzip:
myzip.write(uncompressed_file.name, "data")
else:

223
baselines/common/models.py Normal file
View File

@@ -0,0 +1,223 @@
import numpy as np
import tensorflow as tf
from baselines.a2c import utils
from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch
from baselines.common.mpi_running_mean_std import RunningMeanStd
import tensorflow.contrib.layers as layers
def nature_cnn(unscaled_images, **conv_kwargs):
"""
CNN from Nature paper.
"""
scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
activ = tf.nn.relu
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
**conv_kwargs))
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
h3 = conv_to_fc(h3)
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
"""
Simple fully connected layer policy. Separate stacks of fully-connected layers are used for policy and value function estimation.
More customized fully-connected policies can be obtained by using PolicyWithV class directly.
Parameters:
----------
num_layers: int number of fully-connected layers (default: 2)
num_hidden: int size of fully-connected layers (default: 64)
activation: activation function (default: tf.tanh)
Returns:
-------
function that builds fully connected network with a given input placeholder
"""
def network_fn(X):
h = tf.layers.flatten(X)
for i in range(num_layers):
h = activation(fc(h, 'mlp_fc{}'.format(i), nh=num_hidden, init_scale=np.sqrt(2)))
return h, None
return network_fn
def cnn(**conv_kwargs):
def network_fn(X):
return nature_cnn(X, **conv_kwargs), None
return network_fn
def cnn_small(**conv_kwargs):
def network_fn(X):
h = tf.cast(X, tf.float32) / 255.
activ = tf.nn.relu
h = activ(conv(h, 'c1', nf=8, rf=8, stride=4, init_scale=np.sqrt(2), **conv_kwargs))
h = activ(conv(h, 'c2', nf=16, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
h = conv_to_fc(h)
h = activ(fc(h, 'fc1', nh=128, init_scale=np.sqrt(2)))
return h, None
return network_fn
def lstm(nlstm=128, layer_norm=False):
def network_fn(X, nenv=1):
nbatch = X.shape[0]
nsteps = nbatch // nenv
h = tf.layers.flatten(X)
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
if layer_norm:
h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
else:
h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)
h = seq_to_batch(h5)
initial_state = np.zeros(S.shape.as_list(), dtype=float)
return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
return network_fn
def tflstm_static(nlstm=128, layer_norm=False):
def network_fn(X, nenv=1):
nbatch = X.shape[0]
nsteps = nbatch // nenv
h = tf.layers.flatten(X)
rnn_cell = tf.nn.rnn_cell.BasicLSTMCell(nlstm, state_is_tuple=False, forget_bias=0.0)
S = tf.placeholder(tf.float32, rnn_cell.zero_state(nenv, dtype=tf.float32).shape) #states
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
xs = batch_to_seq(h, nenv, nsteps)
h5, snew = tf.nn.static_rnn(rnn_cell, xs, initial_state=S)
h = seq_to_batch(h5)
initial_state = np.zeros(S.shape.as_list(), dtype=float)
return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
return network_fn
def tflstm(nlstm=128):
def network_fn(X, nenv=1):
nbatch = X.shape[0]
nsteps = nbatch // nenv
h = tf.layers.flatten(X)
rnn_cell = tf.nn.rnn_cell.BasicLSTMCell(nlstm, state_is_tuple=False, forget_bias=0.0)
S = tf.placeholder(tf.float32, rnn_cell.zero_state(nenv, dtype=tf.float32).shape) #states
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
initial_state = np.zeros(S.shape)
h = tf.reshape(h, (-1, nsteps, h.shape[-1]))
h, snew = tf.nn.dynamic_rnn(rnn_cell, h, initial_state=S)
h = tf.reshape(h, (-1, h.shape[-1]))
return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
return network_fn
def cnn_lstm(nlstm=128, layer_norm=False, **conv_kwargs):
def network_fn(X, nenv=1):
nbatch = X.shape[0]
nsteps = nbatch // nenv
h = nature_cnn(X, **conv_kwargs)
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
if layer_norm:
h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
else:
h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)
h = seq_to_batch(h5)
initial_state = np.zeros(S.shape.as_list(), dtype=float)
return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
return network_fn
def cnn_lnlstm(nlstm=128, **conv_kwargs):
return cnn_lstm(nlstm, layer_norm=True, **conv_kwargs)
def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
'''
convolutions-only net
Parameters:
----------
conv: list of triples (filter_number, filter_size, stride) specifying parameters for each layer.
Returns:
function that takes tensorflow tensor as input and returns the output of the last convolutional layer
'''
def network_fn(X):
out = X
with tf.variable_scope("convnet"):
for num_outputs, kernel_size, stride in convs:
out = layers.convolution2d(out,
num_outputs=num_outputs,
kernel_size=kernel_size,
stride=stride,
activation_fn=tf.nn.relu,
**conv_kwargs)
return out, None
return network_fn
def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
rms = RunningMeanStd(shape=x.shape[1:])
norm_x = tf.clip_by_value((x - rms.mean) / rms.std, min(clip_range), max(clip_range))
return norm_x, rms
def get_network_builder(name):
# TODO: replace with reflection?
if name == 'cnn':
return cnn
elif name == 'cnn_small':
return cnn_small
elif name == 'conv_only':
return conv_only
elif name == 'mlp':
return mlp
elif name == 'lstm':
return lstm
elif name == 'tflstm_static':
return tflstm_static
elif name == 'tflstm':
return tflstm
elif name == 'cnn_lstm':
return cnn_lstm
elif name == 'cnn_lnlstm':
return cnn_lnlstm
else:
raise ValueError('Unknown network type: {}'.format(name))

View File

@@ -53,7 +53,7 @@ class MpiAdam(object):
def test_MpiAdam():
np.random.seed(0)
tf.set_random_seed(0)
a = tf.Variable(np.random.randn(3).astype('float32'))
b = tf.Variable(np.random.randn(2,5).astype('float32'))
loss = tf.reduce_sum(tf.square(a)) + tf.reduce_sum(tf.sin(b))

View File

@@ -0,0 +1,31 @@
import numpy as np
import tensorflow as tf
from mpi4py import MPI
class MpiAdamOptimizer(tf.train.AdamOptimizer):
"""Adam optimizer that averages gradients across mpi processes."""
def __init__(self, comm, **kwargs):
self.comm = comm
tf.train.AdamOptimizer.__init__(self, **kwargs)
def compute_gradients(self, loss, var_list, **kwargs):
grads_and_vars = tf.train.AdamOptimizer.compute_gradients(self, loss, var_list, **kwargs)
grads_and_vars = [(g, v) for g, v in grads_and_vars if g is not None]
flat_grad = tf.concat([tf.reshape(g, (-1,)) for g, v in grads_and_vars], axis=0)
shapes = [v.shape.as_list() for g, v in grads_and_vars]
sizes = [int(np.prod(s)) for s in shapes]
num_tasks = self.comm.Get_size()
buf = np.zeros(sum(sizes), np.float32)
def _collect_grads(flat_grad):
self.comm.Allreduce(flat_grad, buf, op=MPI.SUM)
np.divide(buf, float(num_tasks), out=buf)
return buf
avg_flat_grad = tf.py_func(_collect_grads, [flat_grad], tf.float32)
avg_flat_grad.set_shape(flat_grad.shape)
avg_grads = tf.split(avg_flat_grad, sizes, axis=0)
avg_grads_and_vars = [(tf.reshape(g, v.shape), v)
for g, (_, v) in zip(avg_grads, grads_and_vars)]
return avg_grads_and_vars

View File

@@ -1,6 +1,6 @@
import os, subprocess, sys
def mpi_fork(n):
def mpi_fork(n, bind_to_core=False):
"""Re-launches the current script with workers
Returns "parent" for original parent, "child" for MPI children
"""
@@ -13,7 +13,11 @@ def mpi_fork(n):
OMP_NUM_THREADS="1",
IN_MPI="1"
)
subprocess.check_call(["mpirun", "-np", str(n), sys.executable] + sys.argv, env=env)
args = ["mpirun", "-np", str(n)]
if bind_to_core:
args += ["-bind-to", "core"]
args += [sys.executable] + sys.argv
subprocess.check_call(args, env=env)
return "parent"
else:
return "child"

View File

@@ -2,29 +2,42 @@ from mpi4py import MPI
import numpy as np
from baselines.common import zipsame
def mpi_moments(x, axis=0):
x = np.asarray(x, dtype='float64')
newshape = list(x.shape)
newshape.pop(axis)
n = np.prod(newshape,dtype=int)
totalvec = np.zeros(n*2+1, 'float64')
addvec = np.concatenate([x.sum(axis=axis).ravel(),
np.square(x).sum(axis=axis).ravel(),
np.array([x.shape[axis]],dtype='float64')])
MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
sum = totalvec[:n]
sumsq = totalvec[n:2*n]
count = totalvec[2*n]
if count == 0:
mean = np.empty(newshape); mean[:] = np.nan
std = np.empty(newshape); std[:] = np.nan
else:
mean = sum/count
std = np.sqrt(np.maximum(sumsq/count - np.square(mean),0))
def mpi_mean(x, axis=0, comm=None, keepdims=False):
x = np.asarray(x)
assert x.ndim > 0
if comm is None: comm = MPI.COMM_WORLD
xsum = x.sum(axis=axis, keepdims=keepdims)
n = xsum.size
localsum = np.zeros(n+1, x.dtype)
localsum[:n] = xsum.ravel()
localsum[n] = x.shape[axis]
globalsum = np.zeros_like(localsum)
comm.Allreduce(localsum, globalsum, op=MPI.SUM)
return globalsum[:n].reshape(xsum.shape) / globalsum[n], globalsum[n]
def mpi_moments(x, axis=0, comm=None, keepdims=False):
x = np.asarray(x)
assert x.ndim > 0
mean, count = mpi_mean(x, axis=axis, comm=comm, keepdims=True)
sqdiffs = np.square(x - mean)
meansqdiff, count1 = mpi_mean(sqdiffs, axis=axis, comm=comm, keepdims=True)
assert count1 == count
std = np.sqrt(meansqdiff)
if not keepdims:
newshape = mean.shape[:axis] + mean.shape[axis+1:]
mean = mean.reshape(newshape)
std = std.reshape(newshape)
return mean, std, count
def test_runningmeanstd():
import subprocess
subprocess.check_call(['mpirun', '-np', '3',
'python','-c',
'from baselines.common.mpi_moments import _helper_runningmeanstd; _helper_runningmeanstd()'])
def _helper_runningmeanstd():
comm = MPI.COMM_WORLD
np.random.seed(0)
for (triple,axis) in [
@@ -45,6 +58,3 @@ def test_runningmeanstd():
assert np.allclose(a1, a2)
print("ok!")
if __name__ == "__main__":
#mpirun -np 3 python <script>
test_runningmeanstd()

View File

@@ -57,7 +57,7 @@ def test_runningmeanstd():
rms.update(x1)
rms.update(x2)
rms.update(x3)
ms2 = U.eval([rms.mean, rms.std])
ms2 = [rms.mean.eval(), rms.std.eval()]
assert np.allclose(ms1, ms2)
@@ -94,11 +94,11 @@ def test_dist():
assert checkallclose(
bigvec.mean(axis=0),
U.eval(rms.mean)
rms.mean.eval(),
)
assert checkallclose(
bigvec.std(axis=0),
U.eval(rms.std)
rms.std.eval(),
)

View File

@@ -0,0 +1,101 @@
from collections import defaultdict
from mpi4py import MPI
import os, numpy as np
import platform
import shutil
import subprocess
def sync_from_root(sess, variables, comm=None):
"""
Send the root node's parameters to every worker.
Arguments:
sess: the TensorFlow session.
variables: all parameter variables including optimizer's
"""
if comm is None: comm = MPI.COMM_WORLD
rank = comm.Get_rank()
for var in variables:
if rank == 0:
comm.Bcast(sess.run(var))
else:
import tensorflow as tf
returned_var = np.empty(var.shape, dtype='float32')
comm.Bcast(returned_var)
sess.run(tf.assign(var, returned_var))
def gpu_count():
"""
Count the GPUs on this machine.
"""
if shutil.which('nvidia-smi') is None:
return 0
output = subprocess.check_output(['nvidia-smi', '--query-gpu=gpu_name', '--format=csv'])
return max(0, len(output.split(b'\n')) - 2)
def setup_mpi_gpus():
"""
Set CUDA_VISIBLE_DEVICES using MPI.
"""
num_gpus = gpu_count()
if num_gpus == 0:
return
local_rank, _ = get_local_rank_size(MPI.COMM_WORLD)
os.environ['CUDA_VISIBLE_DEVICES'] = str(local_rank % num_gpus)
def get_local_rank_size(comm):
"""
Returns the rank of each process on its machine
The processes on a given machine will be assigned ranks
0, 1, 2, ..., N-1,
where N is the number of processes on this machine.
Useful if you want to assign one gpu per machine
"""
this_node = platform.node()
ranks_nodes = comm.allgather((comm.Get_rank(), this_node))
node2rankssofar = defaultdict(int)
local_rank = None
for (rank, node) in ranks_nodes:
if rank == comm.Get_rank():
local_rank = node2rankssofar[node]
node2rankssofar[node] += 1
assert local_rank is not None
return local_rank, node2rankssofar[this_node]
def share_file(comm, path):
"""
Copies the file from rank 0 to all other ranks
Puts it in the same place on all machines
"""
localrank, _ = get_local_rank_size(comm)
if comm.Get_rank() == 0:
with open(path, 'rb') as fh:
data = fh.read()
comm.bcast(data)
else:
data = comm.bcast(None)
if localrank == 0:
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, 'wb') as fh:
fh.write(data)
comm.Barrier()
def dict_gather(comm, d, op='mean', assert_all_have_data=True):
if comm is None: return d
alldicts = comm.allgather(d)
size = comm.size
k2li = defaultdict(list)
for d in alldicts:
for (k,v) in d.items():
k2li[k].append(v)
result = {}
for (k,li) in k2li.items():
if assert_all_have_data:
assert len(li)==size, "only %i out of %i MPI workers have sent '%s'" % (len(li), size, k)
if op=='mean':
result[k] = np.mean(li, axis=0)
elif op=='sum':
result[k] = np.sum(li, axis=0)
else:
assert 0, op
return result

View File

@@ -0,0 +1,179 @@
import tensorflow as tf
from baselines.common import tf_util
from baselines.a2c.utils import fc
from baselines.common.distributions import make_pdtype
from baselines.common.input import observation_placeholder, encode_observation
from baselines.common.tf_util import adjust_shape
from baselines.common.mpi_running_mean_std import RunningMeanStd
from baselines.common.models import get_network_builder
import gym
class PolicyWithValue(object):
"""
Encapsulates fields and methods for RL policy and value function estimation with shared parameters
"""
def __init__(self, env, observations, latent, estimate_q=False, vf_latent=None, sess=None, **tensors):
"""
Parameters:
----------
env RL environment
observations tensorflow placeholder in which the observations will be fed
latent latent state from which policy distribution parameters should be inferred
vf_latent latent state from which value function should be inferred (if None, then latent is used)
sess tensorflow session to run calculations in (if None, default session is used)
**tensors tensorflow tensors for additional attributes such as state or mask
"""
self.X = observations
self.state = tf.constant([])
self.initial_state = None
self.__dict__.update(tensors)
vf_latent = vf_latent if vf_latent is not None else latent
vf_latent = tf.layers.flatten(vf_latent)
latent = tf.layers.flatten(latent)
self.pdtype = make_pdtype(env.action_space)
self.pd, self.pi = self.pdtype.pdfromlatent(latent, init_scale=0.01)
self.action = self.pd.sample()
self.neglogp = self.pd.neglogp(self.action)
self.sess = sess
if estimate_q:
assert isinstance(env.action_space, gym.spaces.Discrete)
self.q = fc(vf_latent, 'q', env.action_space.n)
self.vf = self.q
else:
self.vf = fc(vf_latent, 'vf', 1)
self.vf = self.vf[:,0]
def _evaluate(self, variables, observation, **extra_feed):
sess = self.sess or tf.get_default_session()
feed_dict = {self.X: adjust_shape(self.X, observation)}
for inpt_name, data in extra_feed.items():
if inpt_name in self.__dict__.keys():
inpt = self.__dict__[inpt_name]
if isinstance(inpt, tf.Tensor) and inpt._op.type == 'Placeholder':
feed_dict[inpt] = adjust_shape(inpt, data)
return sess.run(variables, feed_dict)
def step(self, observation, **extra_feed):
"""
Compute next action(s) given the observaion(s)
Parameters:
----------
observation observation data (either single or a batch)
**extra_feed additional data such as state or mask (names of the arguments should match the ones in constructor, see __init__)
Returns:
-------
(action, value estimate, next state, negative log likelihood of the action under current policy parameters) tuple
"""
a, v, state, neglogp = self._evaluate([self.action, self.vf, self.state, self.neglogp], observation, **extra_feed)
if state.size == 0:
state = None
return a, v, state, neglogp
def value(self, ob, *args, **kwargs):
"""
Compute value estimate(s) given the observaion(s)
Parameters:
----------
observation observation data (either single or a batch)
**extra_feed additional data such as state or mask (names of the arguments should match the ones in constructor, see __init__)
Returns:
-------
value estimate
"""
return self._evaluate(self.vf, ob, *args, **kwargs)
def save(self, save_path):
tf_util.save_state(save_path, sess=self.sess)
def load(self, load_path):
tf_util.load_state(load_path, sess=self.sess)
def build_policy(env, policy_network, value_network=None, normalize_observations=False, estimate_q=False, **policy_kwargs):
if isinstance(policy_network, str):
network_type = policy_network
policy_network = get_network_builder(network_type)(**policy_kwargs)
def policy_fn(nbatch=None, nsteps=None, sess=None, observ_placeholder=None):
ob_space = env.observation_space
X = observ_placeholder if observ_placeholder is not None else observation_placeholder(ob_space, batch_size=nbatch)
extra_tensors = {}
if normalize_observations and X.dtype == tf.float32:
encoded_x, rms = _normalize_clip_observation(X)
extra_tensors['rms'] = rms
else:
encoded_x = X
encoded_x = encode_observation(ob_space, encoded_x)
with tf.variable_scope('pi', reuse=tf.AUTO_REUSE):
policy_latent, recurrent_tensors = policy_network(encoded_x)
if recurrent_tensors is not None:
# recurrent architecture, need a few more steps
nenv = nbatch // nsteps
assert nenv > 0, 'Bad input for recurrent policy: batch size {} smaller than nsteps {}'.format(nbatch, nsteps)
policy_latent, recurrent_tensors = policy_network(encoded_x, nenv)
extra_tensors.update(recurrent_tensors)
_v_net = value_network
if _v_net is None or _v_net == 'shared':
vf_latent = policy_latent
else:
if _v_net == 'copy':
_v_net = policy_network
else:
assert callable(_v_net)
with tf.variable_scope('vf', reuse=tf.AUTO_REUSE):
vf_latent, _ = _v_net(encoded_x)
policy = PolicyWithValue(
env=env,
observations=X,
latent=policy_latent,
vf_latent=vf_latent,
sess=sess,
estimate_q=estimate_q,
**extra_tensors
)
return policy
return policy_fn
def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
rms = RunningMeanStd(shape=x.shape[1:])
norm_x = tf.clip_by_value((x - rms.mean) / rms.std, min(clip_range), max(clip_range))
return norm_x, rms

View File

@@ -0,0 +1,293 @@
# flake8: noqa F403, F405
from .atari_wrappers import *
import numpy as np
import gym
class TimeLimit(gym.Wrapper):
def __init__(self, env, max_episode_steps=None):
super(TimeLimit, self).__init__(env)
self._max_episode_steps = max_episode_steps
self._elapsed_steps = 0
def step(self, ac):
observation, reward, done, info = self.env.step(ac)
self._elapsed_steps += 1
if self._elapsed_steps >= self._max_episode_steps:
done = True
info['TimeLimit.truncated'] = True
return observation, reward, done, info
def reset(self, **kwargs):
self._elapsed_steps = 0
return self.env.reset(**kwargs)
class StochasticFrameSkip(gym.Wrapper):
def __init__(self, env, n, stickprob):
gym.Wrapper.__init__(self, env)
self.n = n
self.stickprob = stickprob
self.curac = None
self.rng = np.random.RandomState()
self.supports_want_render = hasattr(env, "supports_want_render")
def reset(self, **kwargs):
self.curac = None
return self.env.reset(**kwargs)
def step(self, ac):
done = False
totrew = 0
for i in range(self.n):
# First step after reset, use action
if self.curac is None:
self.curac = ac
# First substep, delay with probability=stickprob
elif i==0:
if self.rng.rand() > self.stickprob:
self.curac = ac
# Second substep, new action definitely kicks in
elif i==1:
self.curac = ac
if self.supports_want_render and i<self.n-1:
ob, rew, done, info = self.env.step(self.curac, want_render=False)
else:
ob, rew, done, info = self.env.step(self.curac)
totrew += rew
if done: break
return ob, totrew, done, info
def seed(self, s):
self.rng.seed(s)
class PartialFrameStack(gym.Wrapper):
def __init__(self, env, k, channel=1):
"""
Stack one channel (channel keyword) from previous frames
"""
gym.Wrapper.__init__(self, env)
shp = env.observation_space.shape
self.channel = channel
self.observation_space = gym.spaces.Box(low=0, high=255,
shape=(shp[0], shp[1], shp[2] + k - 1),
dtype=env.observation_space.dtype)
self.k = k
self.frames = deque([], maxlen=k)
shp = env.observation_space.shape
def reset(self):
ob = self.env.reset()
assert ob.shape[2] > self.channel
for _ in range(self.k):
self.frames.append(ob)
return self._get_ob()
def step(self, ac):
ob, reward, done, info = self.env.step(ac)
self.frames.append(ob)
return self._get_ob(), reward, done, info
def _get_ob(self):
assert len(self.frames) == self.k
return np.concatenate([frame if i==self.k-1 else frame[:,:,self.channel:self.channel+1]
for (i, frame) in enumerate(self.frames)], axis=2)
class Downsample(gym.ObservationWrapper):
def __init__(self, env, ratio):
"""
Downsample images by a factor of ratio
"""
gym.ObservationWrapper.__init__(self, env)
(oldh, oldw, oldc) = env.observation_space.shape
newshape = (oldh//ratio, oldw//ratio, oldc)
self.observation_space = spaces.Box(low=0, high=255,
shape=newshape, dtype=np.uint8)
def observation(self, frame):
height, width, _ = self.observation_space.shape
frame = cv2.resize(frame, (width, height), interpolation=cv2.INTER_AREA)
if frame.ndim == 2:
frame = frame[:,:,None]
return frame
class Rgb2gray(gym.ObservationWrapper):
def __init__(self, env):
"""
Downsample images by a factor of ratio
"""
gym.ObservationWrapper.__init__(self, env)
(oldh, oldw, _oldc) = env.observation_space.shape
self.observation_space = spaces.Box(low=0, high=255,
shape=(oldh, oldw, 1), dtype=np.uint8)
def observation(self, frame):
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
return frame[:,:,None]
class MovieRecord(gym.Wrapper):
def __init__(self, env, savedir, k):
gym.Wrapper.__init__(self, env)
self.savedir = savedir
self.k = k
self.epcount = 0
def reset(self):
if self.epcount % self.k == 0:
print('saving movie this episode', self.savedir)
self.env.unwrapped.movie_path = self.savedir
else:
print('not saving this episode')
self.env.unwrapped.movie_path = None
self.env.unwrapped.movie = None
self.epcount += 1
return self.env.reset()
class AppendTimeout(gym.Wrapper):
def __init__(self, env):
gym.Wrapper.__init__(self, env)
self.action_space = env.action_space
self.timeout_space = gym.spaces.Box(low=np.array([0.0]), high=np.array([1.0]), dtype=np.float32)
self.original_os = env.observation_space
if isinstance(self.original_os, gym.spaces.Dict):
import copy
ordered_dict = copy.deepcopy(self.original_os.spaces)
ordered_dict['value_estimation_timeout'] = self.timeout_space
self.observation_space = gym.spaces.Dict(ordered_dict)
self.dict_mode = True
else:
self.observation_space = gym.spaces.Dict({
'original': self.original_os,
'value_estimation_timeout': self.timeout_space
})
self.dict_mode = False
self.ac_count = None
while 1:
if not hasattr(env, "_max_episode_steps"): # Looking for TimeLimit wrapper that has this field
env = env.env
continue
break
self.timeout = env._max_episode_steps
def step(self, ac):
self.ac_count += 1
ob, rew, done, info = self.env.step(ac)
return self._process(ob), rew, done, info
def reset(self):
self.ac_count = 0
return self._process(self.env.reset())
def _process(self, ob):
fracmissing = 1 - self.ac_count / self.timeout
if self.dict_mode:
ob['value_estimation_timeout'] = fracmissing
else:
return { 'original': ob, 'value_estimation_timeout': fracmissing }
class StartDoingRandomActionsWrapper(gym.Wrapper):
"""
Warning: can eat info dicts, not good if you depend on them
"""
def __init__(self, env, max_random_steps, on_startup=True, every_episode=False):
gym.Wrapper.__init__(self, env)
self.on_startup = on_startup
self.every_episode = every_episode
self.random_steps = max_random_steps
self.last_obs = None
if on_startup:
self.some_random_steps()
def some_random_steps(self):
self.last_obs = self.env.reset()
n = np.random.randint(self.random_steps)
#print("running for random %i frames" % n)
for _ in range(n):
self.last_obs, _, done, _ = self.env.step(self.env.action_space.sample())
if done: self.last_obs = self.env.reset()
def reset(self):
return self.last_obs
def step(self, a):
self.last_obs, rew, done, info = self.env.step(a)
if done:
self.last_obs = self.env.reset()
if self.every_episode:
self.some_random_steps()
return self.last_obs, rew, done, info
def make_retro(*, game, state, max_episode_steps, **kwargs):
import retro
env = retro.make(game, state, **kwargs)
env = StochasticFrameSkip(env, n=4, stickprob=0.25)
if max_episode_steps is not None:
env = TimeLimit(env, max_episode_steps=max_episode_steps)
return env
def wrap_deepmind_retro(env, scale=True, frame_stack=4):
"""
Configure environment for retro games, using config similar to DeepMind-style Atari in wrap_deepmind
"""
env = WarpFrame(env)
env = ClipRewardEnv(env)
env = FrameStack(env, frame_stack)
if scale:
env = ScaledFloatFrame(env)
return env
class SonicDiscretizer(gym.ActionWrapper):
"""
Wrap a gym-retro environment and make it use discrete
actions for the Sonic game.
"""
def __init__(self, env):
super(SonicDiscretizer, self).__init__(env)
buttons = ["B", "A", "MODE", "START", "UP", "DOWN", "LEFT", "RIGHT", "C", "Y", "X", "Z"]
actions = [['LEFT'], ['RIGHT'], ['LEFT', 'DOWN'], ['RIGHT', 'DOWN'], ['DOWN'],
['DOWN', 'B'], ['B']]
self._actions = []
for action in actions:
arr = np.array([False] * 12)
for button in action:
arr[buttons.index(button)] = True
self._actions.append(arr)
self.action_space = gym.spaces.Discrete(len(self._actions))
def action(self, a): # pylint: disable=W0221
return self._actions[a].copy()
class RewardScaler(gym.RewardWrapper):
"""
Bring rewards to a reasonable scale for PPO.
This is incredibly important and effects performance
drastically.
"""
def __init__(self, env, scale=0.01):
super(RewardScaler, self).__init__(env)
self.scale = scale
def reward(self, reward):
return reward * self.scale
class AllowBacktracking(gym.Wrapper):
"""
Use deltas in max(X) as the reward, rather than deltas
in X. This way, agents are not discouraged too heavily
from exploring backwards if there is no way to advance
head-on in the level.
"""
def __init__(self, env):
super(AllowBacktracking, self).__init__(env)
self._cur_x = 0
self._max_x = 0
def reset(self, **kwargs): # pylint: disable=E0202
self._cur_x = 0
self._max_x = 0
return self.env.reset(**kwargs)
def step(self, action): # pylint: disable=E0202
obs, rew, done, info = self.env.step(action)
self._cur_x += rew
rew = max(0, self._cur_x - self._max_x)
self._max_x = max(self._max_x, self._cur_x)
return obs, rew, done, info

View File

@@ -0,0 +1,19 @@
import numpy as np
from abc import ABC, abstractmethod
class AbstractEnvRunner(ABC):
def __init__(self, *, env, model, nsteps):
self.env = env
self.model = model
self.nenv = nenv = env.num_envs if hasattr(env, 'num_envs') else 1
self.batch_ob_shape = (nenv*nsteps,) + env.observation_space.shape
self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
self.obs[:] = env.reset()
self.nsteps = nsteps
self.states = model.initial_state
self.dones = [False for _ in range(nenv)]
@abstractmethod
def run(self):
raise NotImplementedError

View File

@@ -0,0 +1,187 @@
import tensorflow as tf
import numpy as np
from baselines.common.tf_util import get_session
class RunningMeanStd(object):
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
def __init__(self, epsilon=1e-4, shape=()):
self.mean = np.zeros(shape, 'float64')
self.var = np.ones(shape, 'float64')
self.count = epsilon
def update(self, x):
batch_mean = np.mean(x, axis=0)
batch_var = np.var(x, axis=0)
batch_count = x.shape[0]
self.update_from_moments(batch_mean, batch_var, batch_count)
def update_from_moments(self, batch_mean, batch_var, batch_count):
self.mean, self.var, self.count = update_mean_var_count_from_moments(
self.mean, self.var, self.count, batch_mean, batch_var, batch_count)
def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
delta = batch_mean - mean
tot_count = count + batch_count
new_mean = mean + delta * batch_count / tot_count
m_a = var * count
m_b = batch_var * batch_count
M2 = m_a + m_b + np.square(delta) * count * batch_count / (count + batch_count)
new_var = M2 / (count + batch_count)
new_count = batch_count + count
return new_mean, new_var, new_count
class TfRunningMeanStd(object):
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
'''
TensorFlow variables-based implmentation of computing running mean and std
Benefit of this implementation is that it can be saved / loaded together with the tensorflow model
'''
def __init__(self, epsilon=1e-4, shape=(), scope=''):
sess = get_session()
self._new_mean = tf.placeholder(shape=shape, dtype=tf.float64)
self._new_var = tf.placeholder(shape=shape, dtype=tf.float64)
self._new_count = tf.placeholder(shape=(), dtype=tf.float64)
with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
self._mean = tf.get_variable('mean', initializer=np.zeros(shape, 'float64'), dtype=tf.float64)
self._var = tf.get_variable('std', initializer=np.ones(shape, 'float64'), dtype=tf.float64)
self._count = tf.get_variable('count', initializer=np.full((), epsilon, 'float64'), dtype=tf.float64)
self.update_ops = tf.group([
self._var.assign(self._new_var),
self._mean.assign(self._new_mean),
self._count.assign(self._new_count)
])
sess.run(tf.variables_initializer([self._mean, self._var, self._count]))
self.sess = sess
self._set_mean_var_count()
def _set_mean_var_count(self):
self.mean, self.var, self.count = self.sess.run([self._mean, self._var, self._count])
def update(self, x):
batch_mean = np.mean(x, axis=0)
batch_var = np.var(x, axis=0)
batch_count = x.shape[0]
new_mean, new_var, new_count = update_mean_var_count_from_moments(self.mean, self.var, self.count, batch_mean, batch_var, batch_count)
self.sess.run(self.update_ops, feed_dict={
self._new_mean: new_mean,
self._new_var: new_var,
self._new_count: new_count
})
self._set_mean_var_count()
def test_runningmeanstd():
for (x1, x2, x3) in [
(np.random.randn(3), np.random.randn(4), np.random.randn(5)),
(np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
]:
rms = RunningMeanStd(epsilon=0.0, shape=x1.shape[1:])
x = np.concatenate([x1, x2, x3], axis=0)
ms1 = [x.mean(axis=0), x.var(axis=0)]
rms.update(x1)
rms.update(x2)
rms.update(x3)
ms2 = [rms.mean, rms.var]
np.testing.assert_allclose(ms1, ms2)
def test_tf_runningmeanstd():
for (x1, x2, x3) in [
(np.random.randn(3), np.random.randn(4), np.random.randn(5)),
(np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
]:
rms = TfRunningMeanStd(epsilon=0.0, shape=x1.shape[1:], scope='running_mean_std' + str(np.random.randint(0, 128)))
x = np.concatenate([x1, x2, x3], axis=0)
ms1 = [x.mean(axis=0), x.var(axis=0)]
rms.update(x1)
rms.update(x2)
rms.update(x3)
ms2 = [rms.mean, rms.var]
np.testing.assert_allclose(ms1, ms2)
def profile_tf_runningmeanstd():
import time
from baselines.common import tf_util
tf_util.get_session( config=tf.ConfigProto(
inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1,
allow_soft_placement=True
))
x = np.random.random((376,))
n_trials = 10000
rms = RunningMeanStd()
tfrms = TfRunningMeanStd()
tic1 = time.time()
for _ in range(n_trials):
rms.update(x)
tic2 = time.time()
for _ in range(n_trials):
tfrms.update(x)
tic3 = time.time()
print('rms update time ({} trials): {} s'.format(n_trials, tic2 - tic1))
print('tfrms update time ({} trials): {} s'.format(n_trials, tic3 - tic2))
tic1 = time.time()
for _ in range(n_trials):
z1 = rms.mean
tic2 = time.time()
for _ in range(n_trials):
z2 = tfrms.mean
assert z1 == z2
tic3 = time.time()
print('rms get mean time ({} trials): {} s'.format(n_trials, tic2 - tic1))
print('tfrms get mean time ({} trials): {} s'.format(n_trials, tic3 - tic2))
'''
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) #pylint: disable=E1101
run_metadata = tf.RunMetadata()
profile_opts = dict(options=options, run_metadata=run_metadata)
from tensorflow.python.client import timeline
fetched_timeline = timeline.Timeline(run_metadata.step_stats) #pylint: disable=E1101
chrome_trace = fetched_timeline.generate_chrome_trace_format()
outfile = '/tmp/timeline.json'
with open(outfile, 'wt') as f:
f.write(chrome_trace)
print(f'Successfully saved profile to {outfile}. Exiting.')
exit(0)
'''
if __name__ == '__main__':
profile_tf_runningmeanstd()

View File

@@ -0,0 +1,46 @@
import numpy as np
# http://www.johndcook.com/blog/standard_deviation/
class RunningStat(object):
def __init__(self, shape):
self._n = 0
self._M = np.zeros(shape)
self._S = np.zeros(shape)
def push(self, x):
x = np.asarray(x)
assert x.shape == self._M.shape
self._n += 1
if self._n == 1:
self._M[...] = x
else:
oldM = self._M.copy()
self._M[...] = oldM + (x - oldM)/self._n
self._S[...] = self._S + (x - oldM)*(x - self._M)
@property
def n(self):
return self._n
@property
def mean(self):
return self._M
@property
def var(self):
return self._S/(self._n - 1) if self._n > 1 else np.square(self._M)
@property
def std(self):
return np.sqrt(self.var)
@property
def shape(self):
return self._M.shape
def test_running_stat():
for shp in ((), (3,), (3,4)):
li = []
rs = RunningStat(shp)
for _ in range(5):
val = np.random.randn(*shp)
rs.push(val)
li.append(val)
m = np.mean(li, axis=0)
assert np.allclose(rs.mean, m)
v = np.square(m) if (len(li) == 1) else np.var(li, ddof=1, axis=0)
assert np.allclose(rs.var, v)

View File

@@ -12,10 +12,9 @@ class SegmentTree(object):
a) setting item's value is slightly slower.
It is O(lg capacity) instead of O(1).
b) user has access to an efficient `reduce`
operation which reduces `operation` over
a contiguous subsequence of items in the
array.
b) user has access to an efficient ( O(log segment size) )
`reduce` operation which reduces `operation` over
a contiguous subsequence of items in the array.
Paramters
---------
@@ -23,8 +22,8 @@ class SegmentTree(object):
Total size of the array - must be a power of two.
operation: lambda obj, obj -> obj
and operation for combining elements (eg. sum, max)
must for a mathematical group together with the set of
possible values for array elements.
must form a mathematical group together with the set of
possible values for array elements (i.e. be associative)
neutral_element: obj
neutral element for the operation above. eg. float('-inf')
for max and 0 for sum.

View File

View File

View File

@@ -0,0 +1,44 @@
import numpy as np
from gym import Env
from gym.spaces import Discrete
class FixedSequenceEnv(Env):
def __init__(
self,
n_actions=10,
seed=0,
episode_len=100
):
self.np_random = np.random.RandomState()
self.np_random.seed(seed)
self.sequence = [self.np_random.randint(0, n_actions-1) for _ in range(episode_len)]
self.action_space = Discrete(n_actions)
self.observation_space = Discrete(1)
self.episode_len = episode_len
self.time = 0
self.reset()
def reset(self):
self.time = 0
return 0
def step(self, actions):
rew = self._get_reward(actions)
self._choose_next_state()
done = False
if self.episode_len and self.time >= self.episode_len:
rew = 0
done = True
return 0, rew, done, {}
def _choose_next_state(self):
self.time += 1
def _get_reward(self, actions):
return 1 if actions == self.sequence[self.time] else 0

View File

@@ -0,0 +1,70 @@
import numpy as np
from abc import abstractmethod
from gym import Env
from gym.spaces import Discrete, Box
class IdentityEnv(Env):
def __init__(
self,
episode_len=None
):
self.episode_len = episode_len
self.time = 0
self.reset()
def reset(self):
self._choose_next_state()
self.time = 0
self.observation_space = self.action_space
return self.state
def step(self, actions):
rew = self._get_reward(actions)
self._choose_next_state()
done = False
if self.episode_len and self.time >= self.episode_len:
rew = 0
done = True
return self.state, rew, done, {}
def _choose_next_state(self):
self.state = self.action_space.sample()
self.time += 1
@abstractmethod
def _get_reward(self, actions):
raise NotImplementedError
class DiscreteIdentityEnv(IdentityEnv):
def __init__(
self,
dim,
episode_len=None,
):
self.action_space = Discrete(dim)
super().__init__(episode_len=episode_len)
def _get_reward(self, actions):
return 1 if self.state == actions else 0
class BoxIdentityEnv(IdentityEnv):
def __init__(
self,
shape,
episode_len=None,
):
self.action_space = Box(low=-1.0, high=1.0, shape=shape)
super().__init__(episode_len=episode_len)
def _get_reward(self, actions):
diff = actions - self.state
diff = diff[:]
return -0.5 * np.dot(diff, diff)

View File

@@ -0,0 +1,70 @@
import os.path as osp
import numpy as np
import tempfile
import filelock
from gym import Env
from gym.spaces import Discrete, Box
class MnistEnv(Env):
def __init__(
self,
seed=0,
episode_len=None,
no_images=None
):
from tensorflow.examples.tutorials.mnist import input_data
# we could use temporary directory for this with a context manager and
# TemporaryDirecotry, but then each test that uses mnist would re-download the data
# this way the data is not cleaned up, but we only download it once per machine
mnist_path = osp.join(tempfile.gettempdir(), 'MNIST_data')
with filelock.FileLock(mnist_path + '.lock'):
self.mnist = input_data.read_data_sets(mnist_path)
self.np_random = np.random.RandomState()
self.np_random.seed(seed)
self.observation_space = Box(low=0.0, high=1.0, shape=(28,28,1))
self.action_space = Discrete(10)
self.episode_len = episode_len
self.time = 0
self.no_images = no_images
self.train_mode()
self.reset()
def reset(self):
self._choose_next_state()
self.time = 0
return self.state[0]
def step(self, actions):
rew = self._get_reward(actions)
self._choose_next_state()
done = False
if self.episode_len and self.time >= self.episode_len:
rew = 0
done = True
return self.state[0], rew, done, {}
def train_mode(self):
self.dataset = self.mnist.train
def test_mode(self):
self.dataset = self.mnist.test
def _choose_next_state(self):
max_index = (self.no_images if self.no_images is not None else self.dataset.num_examples) - 1
index = self.np_random.randint(0, max_index)
image = self.dataset.images[index].reshape(28,28,1)*255
label = self.dataset.labels[index]
self.state = (image, label)
self.time += 1
def _get_reward(self, actions):
return 1 if self.state[1] == actions else 0

View File

@@ -0,0 +1,40 @@
import pytest
import gym
from baselines.run import get_learn_function
from baselines.common.tests.util import reward_per_episode_test
common_kwargs = dict(
total_timesteps=30000,
network='mlp',
gamma=1.0,
seed=0,
)
learn_kwargs = {
'a2c' : dict(nsteps=32, value_network='copy', lr=0.05),
'acktr': dict(nsteps=32, value_network='copy'),
'deepq': {},
'ppo2': dict(value_network='copy'),
'trpo_mpi': {}
}
@pytest.mark.slow
@pytest.mark.parametrize("alg", learn_kwargs.keys())
def test_cartpole(alg):
'''
Test if the algorithm (with an mlp policy)
can learn to balance the cartpole
'''
kwargs = common_kwargs.copy()
kwargs.update(learn_kwargs[alg])
learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
def env_fn():
env = gym.make('CartPole-v0')
env.seed(0)
return env
reward_per_episode_test(env_fn, learn_fn, 100)

View File

@@ -0,0 +1,52 @@
import pytest
from baselines.common.tests.envs.fixed_sequence_env import FixedSequenceEnv
from baselines.common.tests.util import simple_test
from baselines.run import get_learn_function
common_kwargs = dict(
seed=0,
total_timesteps=20000,
nlstm=64
)
learn_kwargs = {
'a2c': {},
'ppo2': dict(nsteps=10, ent_coef=0.0, nminibatches=1),
# TODO enable sequential models for trpo_mpi (proper handling of nbatch and nsteps)
# github issue: https://github.com/openai/baselines/issues/188
# 'trpo_mpi': lambda e, p: trpo_mpi.learn(policy_fn=p(env=e), env=e, max_timesteps=30000, timesteps_per_batch=100, cg_iters=10, gamma=0.9, lam=1.0, max_kl=0.001)
}
alg_list = learn_kwargs.keys()
rnn_list = ['lstm', 'tflstm', 'tflstm_static']
@pytest.mark.slow
@pytest.mark.parametrize("alg", alg_list)
@pytest.mark.parametrize("rnn", rnn_list)
def test_fixed_sequence(alg, rnn):
'''
Test if the algorithm (with a given policy)
can learn an identity transformation (i.e. return observation as an action)
'''
kwargs = learn_kwargs[alg]
kwargs.update(common_kwargs)
episode_len = 5
env_fn = lambda: FixedSequenceEnv(10, episode_len=episode_len)
learn = lambda e: get_learn_function(alg)(
env=e,
network=rnn,
**kwargs
)
simple_test(env_fn, learn, 0.3)
if __name__ == '__main__':
test_fixed_sequence('ppo2', 'tflstm')

View File

@@ -0,0 +1,55 @@
import pytest
from baselines.common.tests.envs.identity_env import DiscreteIdentityEnv, BoxIdentityEnv
from baselines.run import get_learn_function
from baselines.common.tests.util import simple_test
common_kwargs = dict(
total_timesteps=30000,
network='mlp',
gamma=0.9,
seed=0,
)
learn_kwargs = {
'a2c' : {},
'acktr': {},
'deepq': {},
'ppo2': dict(lr=1e-3, nsteps=64, ent_coef=0.0),
'trpo_mpi': dict(timesteps_per_batch=100, cg_iters=10, gamma=0.9, lam=1.0, max_kl=0.01)
}
@pytest.mark.slow
@pytest.mark.parametrize("alg", learn_kwargs.keys())
def test_discrete_identity(alg):
'''
Test if the algorithm (with an mlp policy)
can learn an identity transformation (i.e. return observation as an action)
'''
kwargs = learn_kwargs[alg]
kwargs.update(common_kwargs)
learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
env_fn = lambda: DiscreteIdentityEnv(10, episode_len=100)
simple_test(env_fn, learn_fn, 0.9)
@pytest.mark.slow
@pytest.mark.parametrize("alg", ['a2c', 'ppo2', 'trpo_mpi'])
def test_continuous_identity(alg):
'''
Test if the algorithm (with an mlp policy)
can learn an identity transformation (i.e. return observation as an action)
to a required precision
'''
kwargs = learn_kwargs[alg]
kwargs.update(common_kwargs)
learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
env_fn = lambda: BoxIdentityEnv((1,), episode_len=100)
simple_test(env_fn, learn_fn, -0.1)
if __name__ == '__main__':
test_continuous_identity('a2c')

View File

@@ -0,0 +1,50 @@
import pytest
# from baselines.acer import acer_simple as acer
from baselines.common.tests.envs.mnist_env import MnistEnv
from baselines.common.tests.util import simple_test
from baselines.run import get_learn_function
# TODO investigate a2c and ppo2 failures - is it due to bad hyperparameters for this problem?
# GitHub issue https://github.com/openai/baselines/issues/189
common_kwargs = {
'seed': 0,
'network':'cnn',
'gamma':0.9,
'pad':'SAME'
}
learn_args = {
'a2c': dict(total_timesteps=50000),
# TODO need to resolve inference (step) API differences for acer; also slow
# 'acer': dict(seed=0, total_timesteps=1000),
'deepq': dict(total_timesteps=5000),
'acktr': dict(total_timesteps=30000),
'ppo2': dict(total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.0),
'trpo_mpi': dict(total_timesteps=80000, timesteps_per_batch=100, cg_iters=10, lam=1.0, max_kl=0.001)
}
#tests pass, but are too slow on travis. Same algorithms are covered
# by other tests with less compute-hungry nn's and by benchmarks
@pytest.mark.skip
@pytest.mark.slow
@pytest.mark.parametrize("alg", learn_args.keys())
def test_mnist(alg):
'''
Test if the algorithm can learn to classify MNIST digits.
Uses CNN policy.
'''
learn_kwargs = learn_args[alg]
learn_kwargs.update(common_kwargs)
learn = get_learn_function(alg)
learn_fn = lambda e: learn(env=e, **learn_kwargs)
env_fn = lambda: MnistEnv(seed=0, episode_len=100)
simple_test(env_fn, learn_fn, 0.6)
if __name__ == '__main__':
test_mnist('deepq')

View File

@@ -0,0 +1,97 @@
import os
import tempfile
import pytest
import tensorflow as tf
import numpy as np
from baselines.common.tests.envs.mnist_env import MnistEnv
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.run import get_learn_function
from baselines.common.tf_util import make_session, get_session
from functools import partial
learn_kwargs = {
'deepq': {},
'a2c': {},
'acktr': {},
'ppo2': {'nminibatches': 1, 'nsteps': 10},
'trpo_mpi': {},
}
network_kwargs = {
'mlp': {},
'cnn': {'pad': 'SAME'},
'lstm': {},
'cnn_lnlstm': {'pad': 'SAME'}
}
@pytest.mark.parametrize("learn_fn", learn_kwargs.keys())
@pytest.mark.parametrize("network_fn", network_kwargs.keys())
def test_serialization(learn_fn, network_fn):
'''
Test if the trained model can be serialized
'''
if network_fn.endswith('lstm') and learn_fn in ['acktr', 'trpo_mpi', 'deepq']:
# TODO make acktr work with recurrent policies
# and test
# github issue: https://github.com/openai/baselines/issues/194
return
env = DummyVecEnv([lambda: MnistEnv(10, episode_len=100)])
ob = env.reset().copy()
learn = get_learn_function(learn_fn)
kwargs = {}
kwargs.update(network_kwargs[network_fn])
kwargs.update(learn_kwargs[learn_fn])
learn = partial(learn, env=env, network=network_fn, seed=0, **kwargs)
with tempfile.TemporaryDirectory() as td:
model_path = os.path.join(td, 'serialization_test_model')
with tf.Graph().as_default(), make_session().as_default():
model = learn(total_timesteps=100)
model.save(model_path)
mean1, std1 = _get_action_stats(model, ob)
variables_dict1 = _serialize_variables()
with tf.Graph().as_default(), make_session().as_default():
model = learn(total_timesteps=0, load_path=model_path)
mean2, std2 = _get_action_stats(model, ob)
variables_dict2 = _serialize_variables()
for k, v in variables_dict1.items():
np.testing.assert_allclose(v, variables_dict2[k], atol=0.01,
err_msg='saved and loaded variable {} value mismatch'.format(k))
np.testing.assert_allclose(mean1, mean2, atol=0.5)
np.testing.assert_allclose(std1, std2, atol=0.5)
def _serialize_variables():
sess = get_session()
variables = tf.trainable_variables()
values = sess.run(variables)
return {var.name: value for var, value in zip(variables, values)}
def _get_action_stats(model, ob):
ntrials = 1000
if model.initial_state is None or model.initial_state == []:
actions = np.array([model.step(ob)[0] for _ in range(ntrials)])
else:
actions = np.array([model.step(ob, S=model.initial_state, M=[False])[0] for _ in range(ntrials)])
mean = np.mean(actions, axis=0)
std = np.std(actions, axis=0)
return mean, std

View File

@@ -3,67 +3,38 @@ import tensorflow as tf
from baselines.common.tf_util import (
function,
initialize,
set_value,
single_threaded_session
)
def test_set_value():
a = tf.Variable(42.)
with single_threaded_session():
set_value(a, 5)
assert a.eval() == 5
g = tf.get_default_graph()
g.finalize()
set_value(a, 6)
assert a.eval() == 6
# test the test
try:
assert a.eval() == 7
except AssertionError:
pass
else:
assert False, "assertion should have failed"
def test_function():
tf.reset_default_graph()
x = tf.placeholder(tf.int32, (), name="x")
y = tf.placeholder(tf.int32, (), name="y")
z = 3 * x + 2 * y
lin = function([x, y], z, givens={y: 0})
with tf.Graph().as_default():
x = tf.placeholder(tf.int32, (), name="x")
y = tf.placeholder(tf.int32, (), name="y")
z = 3 * x + 2 * y
lin = function([x, y], z, givens={y: 0})
with single_threaded_session():
initialize()
with single_threaded_session():
initialize()
assert lin(2) == 6
assert lin(x=3) == 9
assert lin(2, 2) == 10
assert lin(x=2, y=3) == 12
assert lin(2) == 6
assert lin(2, 2) == 10
def test_multikwargs():
tf.reset_default_graph()
x = tf.placeholder(tf.int32, (), name="x")
with tf.variable_scope("other"):
x2 = tf.placeholder(tf.int32, (), name="x")
z = 3 * x + 2 * x2
with tf.Graph().as_default():
x = tf.placeholder(tf.int32, (), name="x")
with tf.variable_scope("other"):
x2 = tf.placeholder(tf.int32, (), name="x")
z = 3 * x + 2 * x2
lin = function([x, x2], z, givens={x2: 0})
with single_threaded_session():
initialize()
assert lin(2) == 6
assert lin(2, 2) == 10
expt_caught = False
try:
lin(x=2)
except AssertionError:
expt_caught = True
assert expt_caught
lin = function([x, x2], z, givens={x2: 0})
with single_threaded_session():
initialize()
assert lin(2) == 6
assert lin(2, 2) == 10
if __name__ == '__main__':
test_set_value()
test_function()
test_multikwargs()

View File

@@ -0,0 +1,92 @@
import tensorflow as tf
import numpy as np
from gym.spaces import np_random
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.bench.monitor import Monitor
N_TRIALS = 10000
N_EPISODES = 100
def simple_test(env_fn, learn_fn, min_reward_fraction, n_trials=N_TRIALS):
np.random.seed(0)
np_random.seed(0)
env = DummyVecEnv([lambda: Monitor(env_fn(), None, allow_early_resets=True)])
with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
tf.set_random_seed(0)
model = learn_fn(env)
sum_rew = 0
done = True
for i in range(n_trials):
if done:
obs = env.reset()
state = model.initial_state
if state is not None:
a, v, state, _ = model.step(obs, S=state, M=[False])
else:
a, v, _, _ = model.step(obs)
obs, rew, done, _ = env.step(a)
sum_rew += float(rew)
print("Reward in {} trials is {}".format(n_trials, sum_rew))
assert sum_rew > min_reward_fraction * n_trials, \
'sum of rewards {} is less than {} of the total number of trials {}'.format(sum_rew, min_reward_fraction, n_trials)
def reward_per_episode_test(env_fn, learn_fn, min_avg_reward, n_trials=N_EPISODES):
env = DummyVecEnv([env_fn])
with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
model = learn_fn(env)
N_TRIALS = 100
observations, actions, rewards = rollout(env, model, N_TRIALS)
rewards = [sum(r) for r in rewards]
avg_rew = sum(rewards) / N_TRIALS
print("Average reward in {} episodes is {}".format(n_trials, avg_rew))
assert avg_rew > min_avg_reward, \
'average reward in {} episodes ({}) is less than {}'.format(n_trials, avg_rew, min_avg_reward)
def rollout(env, model, n_trials):
rewards = []
actions = []
observations = []
for i in range(n_trials):
obs = env.reset()
state = model.initial_state
episode_rew = []
episode_actions = []
episode_obs = []
while True:
if state is not None:
a, v, state, _ = model.step(obs, S=state, M=[False])
else:
a,v, _, _ = model.step(obs)
obs, rew, done, _ = env.step(a)
episode_rew.append(rew)
episode_actions.append(a)
episode_obs.append(obs)
if done:
break
rewards.append(episode_rew)
actions.append(episode_actions)
observations.append(episode_obs)
return observations, actions, rewards

View File

@@ -1,55 +1,11 @@
import joblib
import numpy as np
import tensorflow as tf # pylint: ignore-module
import builtins
import functools
import copy
import os
import functools
import collections
# ================================================================
# Make consistent with numpy
# ================================================================
clip = tf.clip_by_value
def sum(x, axis=None, keepdims=False):
axis = None if axis is None else [axis]
return tf.reduce_sum(x, axis=axis, keep_dims=keepdims)
def mean(x, axis=None, keepdims=False):
axis = None if axis is None else [axis]
return tf.reduce_mean(x, axis=axis, keep_dims=keepdims)
def var(x, axis=None, keepdims=False):
meanx = mean(x, axis=axis, keepdims=keepdims)
return mean(tf.square(x - meanx), axis=axis, keepdims=keepdims)
def std(x, axis=None, keepdims=False):
return tf.sqrt(var(x, axis=axis, keepdims=keepdims))
def max(x, axis=None, keepdims=False):
axis = None if axis is None else [axis]
return tf.reduce_max(x, axis=axis, keep_dims=keepdims)
def min(x, axis=None, keepdims=False):
axis = None if axis is None else [axis]
return tf.reduce_min(x, axis=axis, keep_dims=keepdims)
def concatenate(arrs, axis=0):
return tf.concat(axis=axis, values=arrs)
def argmax(x, axis=None):
return tf.argmax(x, axis=axis)
import multiprocessing
def switch(condition, then_expression, else_expression):
"""Switches between two operations depending on a scalar value (int or bool).
@@ -72,120 +28,15 @@ def switch(condition, then_expression, else_expression):
# Extras
# ================================================================
def l2loss(params):
if len(params) == 0:
return tf.constant(0.0)
else:
return tf.add_n([sum(tf.square(p)) for p in params])
def lrelu(x, leak=0.2):
f1 = 0.5 * (1 + leak)
f2 = 0.5 * (1 - leak)
return f1 * x + f2 * abs(x)
def categorical_sample_logits(X):
# https://github.com/tensorflow/tensorflow/issues/456
U = tf.random_uniform(tf.shape(X))
return argmax(X - tf.log(-tf.log(U)), axis=1)
# ================================================================
# Inputs
# ================================================================
def is_placeholder(x):
return type(x) is tf.Tensor and len(x.op.inputs) == 0
class TfInput(object):
def __init__(self, name="(unnamed)"):
"""Generalized Tensorflow placeholder. The main differences are:
- possibly uses multiple placeholders internally and returns multiple values
- can apply light postprocessing to the value feed to placeholder.
"""
self.name = name
def get(self):
"""Return the tf variable(s) representing the possibly postprocessed value
of placeholder(s).
"""
raise NotImplemented()
def make_feed_dict(data):
"""Given data input it to the placeholder(s)."""
raise NotImplemented()
class PlacholderTfInput(TfInput):
def __init__(self, placeholder):
"""Wrapper for regular tensorflow placeholder."""
super().__init__(placeholder.name)
self._placeholder = placeholder
def get(self):
return self._placeholder
def make_feed_dict(self, data):
return {self._placeholder: data}
class BatchInput(PlacholderTfInput):
def __init__(self, shape, dtype=tf.float32, name=None):
"""Creates a placeholder for a batch of tensors of a given shape and dtype
Parameters
----------
shape: [int]
shape of a single elemenet of the batch
dtype: tf.dtype
number representation used for tensor contents
name: str
name of the underlying placeholder
"""
super().__init__(tf.placeholder(dtype, [None] + list(shape), name=name))
class Uint8Input(PlacholderTfInput):
def __init__(self, shape, name=None):
"""Takes input in uint8 format which is cast to float32 and divided by 255
before passing it to the model.
On GPU this ensures lower data transfer times.
Parameters
----------
shape: [int]
shape of the tensor.
name: str
name of the underlying placeholder
"""
super().__init__(tf.placeholder(tf.uint8, [None] + list(shape), name=name))
self._shape = shape
self._output = tf.cast(super().get(), tf.float32) / 255.0
def get(self):
return self._output
def ensure_tf_input(thing):
"""Takes either tf.placeholder of TfInput and outputs equivalent TfInput"""
if isinstance(thing, TfInput):
return thing
elif is_placeholder(thing):
return PlacholderTfInput(thing)
else:
raise ValueError("Must be a placeholder or TfInput")
# ================================================================
# Mathematical utils
# ================================================================
def huber_loss(x, delta=1.0):
"""Reference: https://en.wikipedia.org/wiki/Huber_loss"""
return tf.where(
@@ -194,103 +45,63 @@ def huber_loss(x, delta=1.0):
delta * (tf.abs(x) - 0.5 * delta)
)
# ================================================================
# Optimizer utils
# ================================================================
def minimize_and_clip(optimizer, objective, var_list, clip_val=10):
"""Minimized `objective` using `optimizer` w.r.t. variables in
`var_list` while ensure the norm of the gradients for each
variable is clipped to `clip_val`
"""
gradients = optimizer.compute_gradients(objective, var_list=var_list)
for i, (grad, var) in enumerate(gradients):
if grad is not None:
gradients[i] = (tf.clip_by_norm(grad, clip_val), var)
return optimizer.apply_gradients(gradients)
# ================================================================
# Global session
# ================================================================
def get_session():
"""Returns recently made Tensorflow session"""
return tf.get_default_session()
def get_session(config=None):
"""Get default session or create one with a given config"""
sess = tf.get_default_session()
if sess is None:
sess = make_session(config=config, make_default=True)
return sess
def make_session(num_cpu):
def make_session(config=None, num_cpu=None, make_default=False, graph=None):
"""Returns a session that will use <num_cpu> CPU's only"""
tf_config = tf.ConfigProto(
inter_op_parallelism_threads=num_cpu,
intra_op_parallelism_threads=num_cpu)
return tf.Session(config=tf_config)
if num_cpu is None:
num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
if config is None:
config = tf.ConfigProto(
allow_soft_placement=True,
inter_op_parallelism_threads=num_cpu,
intra_op_parallelism_threads=num_cpu)
config.gpu_options.allow_growth = True
if make_default:
return tf.InteractiveSession(config=config, graph=graph)
else:
return tf.Session(config=config, graph=graph)
def single_threaded_session():
"""Returns a session which will only use a single CPU"""
return make_session(1)
return make_session(num_cpu=1)
def in_session(f):
@functools.wraps(f)
def newfunc(*args, **kwargs):
with tf.Session():
f(*args, **kwargs)
return newfunc
ALREADY_INITIALIZED = set()
def initialize():
"""Initialize all the uninitialized variables in the global scope."""
new_variables = set(tf.global_variables()) - ALREADY_INITIALIZED
get_session().run(tf.variables_initializer(new_variables))
ALREADY_INITIALIZED.update(new_variables)
def eval(expr, feed_dict=None):
if feed_dict is None:
feed_dict = {}
return get_session().run(expr, feed_dict=feed_dict)
VALUE_SETTERS = collections.OrderedDict()
def set_value(v, val):
global VALUE_SETTERS
if v in VALUE_SETTERS:
set_op, set_endpoint = VALUE_SETTERS[v]
else:
set_endpoint = tf.placeholder(v.dtype)
set_op = v.assign(set_endpoint)
VALUE_SETTERS[v] = (set_op, set_endpoint)
get_session().run(set_op, feed_dict={set_endpoint: val})
# ================================================================
# Saving variables
# ================================================================
def load_state(fname):
saver = tf.train.Saver()
saver.restore(get_session(), fname)
def save_state(fname):
os.makedirs(os.path.dirname(fname), exist_ok=True)
saver = tf.train.Saver()
saver.save(get_session(), fname)
# ================================================================
# Model components
# ================================================================
def normc_initializer(std=1.0):
def normc_initializer(std=1.0, axis=0):
def _initializer(shape, dtype=None, partition_info=None): # pylint: disable=W0613
out = np.random.randn(*shape).astype(np.float32)
out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
out = np.random.randn(*shape).astype(dtype.as_numpy_dtype)
out *= std / np.sqrt(np.square(out).sum(axis=axis, keepdims=True))
return tf.constant(out)
return _initializer
def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad="SAME", dtype=tf.float32, collections=None,
summary_tag=None):
with tf.variable_scope(name):
@@ -320,47 +131,10 @@ def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad="SAME",
return tf.nn.conv2d(x, w, stride_shape, pad) + b
def dense(x, size, name, weight_init=None, bias=True):
w = tf.get_variable(name + "/w", [x.get_shape()[1], size], initializer=weight_init)
ret = tf.matmul(x, w)
if bias:
b = tf.get_variable(name + "/b", [size], initializer=tf.zeros_initializer())
return ret + b
else:
return ret
def wndense(x, size, name, init_scale=1.0):
v = tf.get_variable(name + "/V", [int(x.get_shape()[1]), size],
initializer=tf.random_normal_initializer(0, 0.05))
g = tf.get_variable(name + "/g", [size], initializer=tf.constant_initializer(init_scale))
b = tf.get_variable(name + "/b", [size], initializer=tf.constant_initializer(0.0))
# use weight normalization (Salimans & Kingma, 2016)
x = tf.matmul(x, v)
scaler = g / tf.sqrt(sum(tf.square(v), axis=0, keepdims=True))
return tf.reshape(scaler, [1, size]) * x + tf.reshape(b, [1, size])
def densenobias(x, size, name, weight_init=None):
return dense(x, size, name, weight_init=weight_init, bias=False)
def dropout(x, pkeep, phase=None, mask=None):
mask = tf.floor(pkeep + tf.random_uniform(tf.shape(x))) if mask is None else mask
if phase is None:
return mask * x
else:
return switch(phase, mask * x, pkeep * x)
# ================================================================
# Theano-like Function
# ================================================================
def function(inputs, outputs, updates=None, givens=None):
"""Just like Theano function. Take a bunch of tensorflow placeholders and expressions
computed based on those placeholders and produces f(inputs) -> outputs. Function f takes
@@ -386,7 +160,7 @@ def function(inputs, outputs, updates=None, givens=None):
Parameters
----------
inputs: [tf.placeholder or TfInput]
inputs: [tf.placeholder, tf.constant, or object with make_feed_dict method]
list of input arguments
outputs: [tf.Variable] or tf.Variable
list of outputs or a single output to be returned from function. Returned
@@ -403,190 +177,34 @@ def function(inputs, outputs, updates=None, givens=None):
class _Function(object):
def __init__(self, inputs, outputs, updates, givens, check_nan=False):
def __init__(self, inputs, outputs, updates, givens):
for inpt in inputs:
if not issubclass(type(inpt), TfInput):
assert len(inpt.op.inputs) == 0, "inputs should all be placeholders of baselines.common.TfInput"
if not hasattr(inpt, 'make_feed_dict') and not (type(inpt) is tf.Tensor and len(inpt.op.inputs) == 0):
assert False, "inputs should all be placeholders, constants, or have a make_feed_dict method"
self.inputs = inputs
updates = updates or []
self.update_group = tf.group(*updates)
self.outputs_update = list(outputs) + [self.update_group]
self.givens = {} if givens is None else givens
self.check_nan = check_nan
def _feed_input(self, feed_dict, inpt, value):
if issubclass(type(inpt), TfInput):
if hasattr(inpt, 'make_feed_dict'):
feed_dict.update(inpt.make_feed_dict(value))
elif is_placeholder(inpt):
feed_dict[inpt] = value
else:
feed_dict[inpt] = adjust_shape(inpt, value)
def __call__(self, *args, **kwargs):
def __call__(self, *args):
assert len(args) <= len(self.inputs), "Too many arguments provided"
feed_dict = {}
# Update the args
for inpt, value in zip(self.inputs, args):
self._feed_input(feed_dict, inpt, value)
# Update the kwargs
kwargs_passed_inpt_names = set()
for inpt in self.inputs[len(args):]:
inpt_name = inpt.name.split(':')[0]
inpt_name = inpt_name.split('/')[-1]
assert inpt_name not in kwargs_passed_inpt_names, \
"this function has two arguments with the same name \"{}\", so kwargs cannot be used.".format(inpt_name)
if inpt_name in kwargs:
kwargs_passed_inpt_names.add(inpt_name)
self._feed_input(feed_dict, inpt, kwargs.pop(inpt_name))
else:
assert inpt in self.givens, "Missing argument " + inpt_name
assert len(kwargs) == 0, "Function got extra arguments " + str(list(kwargs.keys()))
# Update feed dict with givens.
for inpt in self.givens:
feed_dict[inpt] = feed_dict.get(inpt, self.givens[inpt])
feed_dict[inpt] = adjust_shape(inpt, feed_dict.get(inpt, self.givens[inpt]))
results = get_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
if self.check_nan:
if any(np.isnan(r).any() for r in results):
raise RuntimeError("Nan detected")
return results
def mem_friendly_function(nondata_inputs, data_inputs, outputs, batch_size):
if isinstance(outputs, list):
return _MemFriendlyFunction(nondata_inputs, data_inputs, outputs, batch_size)
else:
f = _MemFriendlyFunction(nondata_inputs, data_inputs, [outputs], batch_size)
return lambda *inputs: f(*inputs)[0]
class _MemFriendlyFunction(object):
def __init__(self, nondata_inputs, data_inputs, outputs, batch_size):
self.nondata_inputs = nondata_inputs
self.data_inputs = data_inputs
self.outputs = list(outputs)
self.batch_size = batch_size
def __call__(self, *inputvals):
assert len(inputvals) == len(self.nondata_inputs) + len(self.data_inputs)
nondata_vals = inputvals[0:len(self.nondata_inputs)]
data_vals = inputvals[len(self.nondata_inputs):]
feed_dict = dict(zip(self.nondata_inputs, nondata_vals))
n = data_vals[0].shape[0]
for v in data_vals[1:]:
assert v.shape[0] == n
for i_start in range(0, n, self.batch_size):
slice_vals = [v[i_start:builtins.min(i_start + self.batch_size, n)] for v in data_vals]
for (var, val) in zip(self.data_inputs, slice_vals):
feed_dict[var] = val
results = tf.get_default_session().run(self.outputs, feed_dict=feed_dict)
if i_start == 0:
sum_results = results
else:
for i in range(len(results)):
sum_results[i] = sum_results[i] + results[i]
for i in range(len(results)):
sum_results[i] = sum_results[i] / n
return sum_results
# ================================================================
# Modules
# ================================================================
class Module(object):
def __init__(self, name):
self.name = name
self.first_time = True
self.scope = None
self.cache = {}
def __call__(self, *args):
if args in self.cache:
print("(%s) retrieving value from cache" % (self.name,))
return self.cache[args]
with tf.variable_scope(self.name, reuse=not self.first_time):
scope = tf.get_variable_scope().name
if self.first_time:
self.scope = scope
print("(%s) running function for the first time" % (self.name,))
else:
assert self.scope == scope, "Tried calling function with a different scope"
print("(%s) running function on new inputs" % (self.name,))
self.first_time = False
out = self._call(*args)
self.cache[args] = out
return out
def _call(self, *args):
raise NotImplementedError
@property
def trainable_variables(self):
assert self.scope is not None, "need to call module once before getting variables"
return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)
@property
def variables(self):
assert self.scope is not None, "need to call module once before getting variables"
return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, self.scope)
def module(name):
@functools.wraps
def wrapper(f):
class WrapperModule(Module):
def _call(self, *args):
return f(*args)
return WrapperModule(name)
return wrapper
# ================================================================
# Graph traversal
# ================================================================
VARIABLES = {}
def get_parents(node):
return node.op.inputs
def topsorted(outputs):
"""
Topological sort via non-recursive depth-first search
"""
assert isinstance(outputs, (list, tuple))
marks = {}
out = []
stack = [] # pylint: disable=W0621
# i: node
# jidx = number of children visited so far from that node
# marks: state of each node, which is one of
# 0: haven't visited
# 1: have visited, but not done visiting children
# 2: done visiting children
for x in outputs:
stack.append((x, 0))
while stack:
(i, jidx) = stack.pop()
if jidx == 0:
m = marks.get(i, 0)
if m == 0:
marks[i] = 1
elif m == 1:
raise ValueError("not a dag")
else:
continue
ps = get_parents(i)
if jidx == len(ps):
marks[i] = 2
out.append(i)
else:
stack.append((i, jidx + 1))
j = ps[jidx]
stack.append((j, 0))
return out
# ================================================================
# Flat vectors
# ================================================================
@@ -597,15 +215,12 @@ def var_shape(x):
"shape function assumes that shape is fully known"
return out
def numel(x):
return intprod(var_shape(x))
def intprod(x):
return int(np.prod(x))
def flatgrad(loss, var_list, clip_norm=None):
grads = tf.gradients(loss, var_list)
if clip_norm is not None:
@@ -615,7 +230,6 @@ def flatgrad(loss, var_list, clip_norm=None):
for (v, grad) in zip(var_list, grads)
])
class SetFromFlat(object):
def __init__(self, var_list, dtype=tf.float32):
assigns = []
@@ -632,123 +246,160 @@ class SetFromFlat(object):
self.op = tf.group(*assigns)
def __call__(self, theta):
get_session().run(self.op, feed_dict={self.theta: theta})
tf.get_default_session().run(self.op, feed_dict={self.theta: theta})
class GetFlat(object):
def __init__(self, var_list):
self.op = tf.concat(axis=0, values=[tf.reshape(v, [numel(v)]) for v in var_list])
def __call__(self):
return get_session().run(self.op)
return tf.get_default_session().run(self.op)
# ================================================================
# Misc
# ================================================================
def fancy_slice_2d(X, inds0, inds1):
"""
like numpy X[inds0, inds1]
XXX this implementation is bad
"""
inds0 = tf.cast(inds0, tf.int64)
inds1 = tf.cast(inds1, tf.int64)
shape = tf.cast(tf.shape(X), tf.int64)
ncols = shape[1]
Xflat = tf.reshape(X, [-1])
return tf.gather(Xflat, inds0 * ncols + inds1)
# ================================================================
# Scopes
# ================================================================
def scope_vars(scope, trainable_only=False):
"""
Get variables inside a scope
The scope can be specified as a string
Parameters
----------
scope: str or VariableScope
scope in which the variables reside.
trainable_only: bool
whether or not to return only the variables that were marked as trainable.
Returns
-------
vars: [tf.Variable]
list of variables in `scope`.
"""
return tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.GLOBAL_VARIABLES,
scope=scope if isinstance(scope, str) else scope.name
)
def scope_name():
"""Returns the name of current scope as a string, e.g. deepq/q_func"""
return tf.get_variable_scope().name
def absolute_scope_name(relative_scope_name):
"""Appends parent scope name to `relative_scope_name`"""
return scope_name() + "/" + relative_scope_name
def lengths_to_mask(lengths_b, max_length):
"""
Turns a vector of lengths into a boolean mask
Args:
lengths_b: an integer vector of lengths
max_length: maximum length to fill the mask
Returns:
a boolean array of shape (batch_size, max_length)
row[i] consists of True repeated lengths_b[i] times, followed by False
"""
lengths_b = tf.convert_to_tensor(lengths_b)
assert lengths_b.get_shape().ndims == 1
mask_bt = tf.expand_dims(tf.range(max_length), 0) < tf.expand_dims(lengths_b, 1)
return mask_bt
def in_session(f):
@functools.wraps(f)
def newfunc(*args, **kwargs):
with tf.Session():
f(*args, **kwargs)
return newfunc
def flattenallbut0(x):
return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])
# =============================================================
# TF placeholders management
# ============================================================
_PLACEHOLDER_CACHE = {} # name -> (placeholder, dtype, shape)
def get_placeholder(name, dtype, shape):
if name in _PLACEHOLDER_CACHE:
out, dtype1, shape1 = _PLACEHOLDER_CACHE[name]
assert dtype1 == dtype and shape1 == shape
return out
else:
out = tf.placeholder(dtype=dtype, shape=shape, name=name)
_PLACEHOLDER_CACHE[name] = (out, dtype, shape)
return out
if out.graph == tf.get_default_graph():
assert dtype1 == dtype and shape1 == shape, \
'Placeholder with name {} has already been registered and has shape {}, different from requested {}'.format(name, shape1, shape)
return out
out = tf.placeholder(dtype=dtype, shape=shape, name=name)
_PLACEHOLDER_CACHE[name] = (out, dtype, shape)
return out
def get_placeholder_cached(name):
return _PLACEHOLDER_CACHE[name][0]
def flattenallbut0(x):
return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])
# ================================================================
# Diagnostics
# ================================================================
def display_var_info(vars):
from baselines import logger
count_params = 0
for v in vars:
name = v.name
if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
v_params = np.prod(v.shape.as_list())
count_params += v_params
if "/b:" in name or "/biases" in name: continue # Wx+b, bias is not interesting to look at => count params, but not print
logger.info(" %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
def reset():
global _PLACEHOLDER_CACHE
global VARIABLES
_PLACEHOLDER_CACHE = {}
VARIABLES = {}
tf.reset_default_graph()
def get_available_gpus():
# recipe from here:
# https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
# ================================================================
# Saving variables
# ================================================================
def load_state(fname, sess=None):
sess = sess or get_session()
saver = tf.train.Saver()
saver.restore(tf.get_default_session(), fname)
def save_state(fname, sess=None):
sess = sess or get_session()
os.makedirs(os.path.dirname(fname), exist_ok=True)
saver = tf.train.Saver()
saver.save(tf.get_default_session(), fname)
# The methods above and below are clearly doing the same thing, and in a rather similar way
# TODO: ensure there is no subtle differences and remove one
def save_variables(save_path, variables=None, sess=None):
sess = sess or get_session()
variables = variables or tf.trainable_variables()
ps = sess.run(variables)
save_dict = {v.name: value for v, value in zip(variables, ps)}
os.makedirs(os.path.dirname(save_path), exist_ok=True)
joblib.dump(save_dict, save_path)
def load_variables(load_path, variables=None, sess=None):
sess = sess or get_session()
variables = variables or tf.trainable_variables()
loaded_params = joblib.load(os.path.expanduser(load_path))
restores = []
for v in variables:
restores.append(v.assign(loaded_params[v.name]))
sess.run(restores)
# ================================================================
# Shape adjustment for feeding into tf placeholders
# ================================================================
def adjust_shape(placeholder, data):
'''
adjust shape of the data to the shape of the placeholder if possible.
If shape is incompatible, AssertionError is thrown
Parameters:
placeholder tensorflow input placeholder
data input data to be (potentially) reshaped to be fed into placeholder
Returns:
reshaped data
'''
if not isinstance(data, np.ndarray) and not isinstance(data, list):
return data
if isinstance(data, list):
data = np.array(data)
placeholder_shape = [x or -1 for x in placeholder.shape.as_list()]
assert _check_shape(placeholder_shape, data.shape), \
'Shape of data {} is not compatible with shape of the placeholder {}'.format(data.shape, placeholder_shape)
return np.reshape(data, placeholder_shape)
def _check_shape(placeholder_shape, data_shape):
''' check if two shapes are compatible (i.e. differ only by dimensions of size 1, or by the batch dimension)'''
return True
squeezed_placeholder_shape = _squeeze_shape(placeholder_shape)
squeezed_data_shape = _squeeze_shape(data_shape)
for i, s_data in enumerate(squeezed_data_shape):
s_placeholder = squeezed_placeholder_shape[i]
if s_placeholder != -1 and s_data != s_placeholder:
return False
return True
def _squeeze_shape(shape):
return [x for x in shape if x != 1]
# Tensorboard interfacing
# ================================================================
def launch_tensorboard_in_background(log_dir):
from tensorboard import main as tb
import threading
tf.flags.FLAGS.logdir = log_dir
t = threading.Thread(target=tb.main, args=([]))
t.start()

View File

@@ -0,0 +1,23 @@
import numpy as np
def tile_images(img_nhwc):
"""
Tile N images into one big PxQ image
(P,Q) are chosen to be as close as possible, and if N
is square, then P=Q.
input: img_nhwc, list or array of images, ndim=4 once turned into array
n = batch index, h = height, w = width, c = channel
returns:
bigim_HWc, ndarray with ndim=3
"""
img_nhwc = np.asarray(img_nhwc)
N, h, w, c = img_nhwc.shape
H = int(np.ceil(np.sqrt(N)))
W = int(np.ceil(float(N)/H))
img_nhwc = np.array(list(img_nhwc) + [img_nhwc[0]*0 for _ in range(N, H*W)])
img_HWhwc = img_nhwc.reshape(H, W, h, w, c)
img_HhWwc = img_HWhwc.transpose(0, 2, 1, 3, 4)
img_Hh_Ww_c = img_HhWwc.reshape(H*h, W*w, c)
return img_Hh_Ww_c

View File

@@ -0,0 +1,126 @@
from abc import ABC, abstractmethod
from baselines import logger
class AlreadySteppingError(Exception):
"""
Raised when an asynchronous step is running while
step_async() is called again.
"""
def __init__(self):
msg = 'already running an async step'
Exception.__init__(self, msg)
class NotSteppingError(Exception):
"""
Raised when an asynchronous step is not running but
step_wait() is called.
"""
def __init__(self):
msg = 'not running an async step'
Exception.__init__(self, msg)
class VecEnv(ABC):
"""
An abstract asynchronous, vectorized environment.
"""
def __init__(self, num_envs, observation_space, action_space):
self.num_envs = num_envs
self.observation_space = observation_space
self.action_space = action_space
@abstractmethod
def reset(self):
"""
Reset all the environments and return an array of
observations, or a tuple of observation arrays.
If step_async is still doing work, that work will
be cancelled and step_wait() should not be called
until step_async() is invoked again.
"""
pass
@abstractmethod
def step_async(self, actions):
"""
Tell all the environments to start taking a step
with the given actions.
Call step_wait() to get the results of the step.
You should not call this if a step_async run is
already pending.
"""
pass
@abstractmethod
def step_wait(self):
"""
Wait for the step taken with step_async().
Returns (obs, rews, dones, infos):
- obs: an array of observations, or a tuple of
arrays of observations.
- rews: an array of rewards
- dones: an array of "episode done" booleans
- infos: a sequence of info objects
"""
pass
@abstractmethod
def close(self):
"""
Clean up the environments' resources.
"""
pass
def step(self, actions):
self.step_async(actions)
return self.step_wait()
def render(self, mode='human'):
logger.warn('Render not defined for %s'%self)
@property
def unwrapped(self):
if isinstance(self, VecEnvWrapper):
return self.venv.unwrapped
else:
return self
class VecEnvWrapper(VecEnv):
def __init__(self, venv, observation_space=None, action_space=None):
self.venv = venv
VecEnv.__init__(self,
num_envs=venv.num_envs,
observation_space=observation_space or venv.observation_space,
action_space=action_space or venv.action_space)
def step_async(self, actions):
self.venv.step_async(actions)
@abstractmethod
def reset(self):
pass
@abstractmethod
def step_wait(self):
pass
def close(self):
return self.venv.close()
def render(self):
self.venv.render()
class CloudpickleWrapper(object):
"""
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
"""
def __init__(self, x):
self.x = x
def __getstate__(self):
import cloudpickle
return cloudpickle.dumps(self.x)
def __setstate__(self, ob):
import pickle
self.x = pickle.loads(ob)

View File

@@ -0,0 +1,82 @@
import numpy as np
from gym import spaces
from collections import OrderedDict
from . import VecEnv
class DummyVecEnv(VecEnv):
def __init__(self, env_fns):
self.envs = [fn() for fn in env_fns]
env = self.envs[0]
VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
shapes, dtypes = {}, {}
self.keys = []
obs_space = env.observation_space
if isinstance(obs_space, spaces.Dict):
assert isinstance(obs_space.spaces, OrderedDict)
subspaces = obs_space.spaces
else:
subspaces = {None: obs_space}
for key, box in subspaces.items():
shapes[key] = box.shape
dtypes[key] = box.dtype
self.keys.append(key)
self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
self.buf_rews = np.zeros((self.num_envs,), dtype=np.float32)
self.buf_infos = [{} for _ in range(self.num_envs)]
self.actions = None
def step_async(self, actions):
listify = True
try:
if len(actions) == self.num_envs:
listify = False
except TypeError:
pass
if not listify:
self.actions = actions
else:
assert self.num_envs == 1, "actions {} is either not a list or has a wrong size - cannot match to {} environments".format(actions, self.num_envs)
self.actions = [actions]
def step_wait(self):
for e in range(self.num_envs):
action = self.actions[e]
if isinstance(self.envs[e].action_space, spaces.Discrete):
action = int(action)
obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
if self.buf_dones[e]:
obs = self.envs[e].reset()
self._save_obs(e, obs)
return (np.copy(self._obs_from_buf()), np.copy(self.buf_rews), np.copy(self.buf_dones),
self.buf_infos.copy())
def reset(self):
for e in range(self.num_envs):
obs = self.envs[e].reset()
self._save_obs(e, obs)
return self._obs_from_buf()
def close(self):
return
def render(self, mode='human'):
return [e.render(mode=mode) for e in self.envs]
def _save_obs(self, e, obs):
for k in self.keys:
if k is None:
self.buf_obs[k][e] = obs
else:
self.buf_obs[k][e] = obs[k]
def _obs_from_buf(self):
if self.keys==[None]:
return self.buf_obs[None]
else:
return self.buf_obs

View File

@@ -0,0 +1,101 @@
import numpy as np
from multiprocessing import Process, Pipe
from baselines.common.vec_env import VecEnv, CloudpickleWrapper
from baselines.common.tile_images import tile_images
def worker(remote, parent_remote, env_fn_wrapper):
parent_remote.close()
env = env_fn_wrapper.x()
try:
while True:
cmd, data = remote.recv()
if cmd == 'step':
ob, reward, done, info = env.step(data)
if done:
ob = env.reset()
remote.send((ob, reward, done, info))
elif cmd == 'reset':
ob = env.reset()
remote.send(ob)
elif cmd == 'render':
remote.send(env.render(mode='rgb_array'))
elif cmd == 'close':
remote.close()
break
elif cmd == 'get_spaces':
remote.send((env.observation_space, env.action_space))
else:
raise NotImplementedError
except KeyboardInterrupt:
print('SubprocVecEnv worker: got KeyboardInterrupt')
finally:
env.close()
class SubprocVecEnv(VecEnv):
def __init__(self, env_fns, spaces=None):
"""
envs: list of gym environments to run in subprocesses
"""
self.waiting = False
self.closed = False
nenvs = len(env_fns)
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for p in self.ps:
p.daemon = True # if the main process crashes, we should not cause things to hang
p.start()
for remote in self.work_remotes:
remote.close()
self.remotes[0].send(('get_spaces', None))
observation_space, action_space = self.remotes[0].recv()
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
def step_async(self, actions):
for remote, action in zip(self.remotes, actions):
remote.send(('step', action))
self.waiting = True
def step_wait(self):
results = [remote.recv() for remote in self.remotes]
self.waiting = False
obs, rews, dones, infos = zip(*results)
return np.stack(obs), np.stack(rews), np.stack(dones), infos
def reset(self):
for remote in self.remotes:
remote.send(('reset', None))
return np.stack([remote.recv() for remote in self.remotes])
def reset_task(self):
for remote in self.remotes:
remote.send(('reset_task', None))
return np.stack([remote.recv() for remote in self.remotes])
def close(self):
if self.closed:
return
if self.waiting:
for remote in self.remotes:
remote.recv()
for remote in self.remotes:
remote.send(('close', None))
for p in self.ps:
p.join()
self.closed = True
def render(self, mode='human'):
for pipe in self.remotes:
pipe.send(('render', None))
imgs = [pipe.recv() for pipe in self.remotes]
bigimg = tile_images(imgs)
if mode == 'human':
import cv2
cv2.imshow('vecenv', bigimg[:,:,::-1])
cv2.waitKey(1)
elif mode == 'rgb_array':
return bigimg
else:
raise NotImplementedError

View File

@@ -0,0 +1,38 @@
from baselines.common.vec_env import VecEnvWrapper
import numpy as np
from gym import spaces
class VecFrameStack(VecEnvWrapper):
"""
Vectorized environment base class
"""
def __init__(self, venv, nstack):
self.venv = venv
self.nstack = nstack
wos = venv.observation_space # wrapped ob space
low = np.repeat(wos.low, self.nstack, axis=-1)
high = np.repeat(wos.high, self.nstack, axis=-1)
self.stackedobs = np.zeros((venv.num_envs,)+low.shape, low.dtype)
observation_space = spaces.Box(low=low, high=high, dtype=venv.observation_space.dtype)
VecEnvWrapper.__init__(self, venv, observation_space=observation_space)
def step_wait(self):
obs, rews, news, infos = self.venv.step_wait()
self.stackedobs = np.roll(self.stackedobs, shift=-1, axis=-1)
for (i, new) in enumerate(news):
if new:
self.stackedobs[i] = 0
self.stackedobs[..., -obs.shape[-1]:] = obs
return self.stackedobs, rews, news, infos
def reset(self):
"""
Reset all environments
"""
obs = self.venv.reset()
self.stackedobs[...] = 0
self.stackedobs[..., -obs.shape[-1]:] = obs
return self.stackedobs
def close(self):
self.venv.close()

View File

@@ -0,0 +1,49 @@
from baselines.common.vec_env import VecEnvWrapper
from baselines.common.running_mean_std import RunningMeanStd
import numpy as np
class VecNormalize(VecEnvWrapper):
"""
Vectorized environment base class
"""
def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
VecEnvWrapper.__init__(self, venv)
self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
self.ret_rms = RunningMeanStd(shape=()) if ret else None
#self.ob_rms = TfRunningMeanStd(shape=self.observation_space.shape, scope='observation_running_mean_std') if ob else None
#self.ret_rms = TfRunningMeanStd(shape=(), scope='return_running_mean_std') if ret else None
self.clipob = clipob
self.cliprew = cliprew
self.ret = np.zeros(self.num_envs)
self.gamma = gamma
self.epsilon = epsilon
def step_wait(self):
"""
Apply sequence of actions to sequence of environments
actions -> (observations, rewards, news)
where 'news' is a boolean vector indicating whether each element is new.
"""
obs, rews, news, infos = self.venv.step_wait()
self.ret = self.ret * self.gamma + rews
obs = self._obfilt(obs)
if self.ret_rms:
self.ret_rms.update(self.ret)
rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
return obs, rews, news, infos
def _obfilt(self, obs):
if self.ob_rms:
self.ob_rms.update(obs)
obs = np.clip((obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon), -self.clipob, self.clipob)
return obs
else:
return obs
def reset(self):
"""
Reset all environments
"""
obs = self.venv.reset()
return self._obfilt(obs)

5
baselines/ddpg/README.md Normal file
View File

@@ -0,0 +1,5 @@
# DDPG
- Original paper: https://arxiv.org/abs/1509.02971
- Baselines post: https://blog.openai.com/better-exploration-with-parameter-noise/
- `python -m baselines.ddpg.main` runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (`-h`) for more options.

View File

378
baselines/ddpg/ddpg.py Normal file
View File

@@ -0,0 +1,378 @@
from copy import copy
from functools import reduce
import numpy as np
import tensorflow as tf
import tensorflow.contrib as tc
from baselines import logger
from baselines.common.mpi_adam import MpiAdam
import baselines.common.tf_util as U
from baselines.common.mpi_running_mean_std import RunningMeanStd
from mpi4py import MPI
def normalize(x, stats):
if stats is None:
return x
return (x - stats.mean) / stats.std
def denormalize(x, stats):
if stats is None:
return x
return x * stats.std + stats.mean
def reduce_std(x, axis=None, keepdims=False):
return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
def reduce_var(x, axis=None, keepdims=False):
m = tf.reduce_mean(x, axis=axis, keepdims=True)
devs_squared = tf.square(x - m)
return tf.reduce_mean(devs_squared, axis=axis, keepdims=keepdims)
def get_target_updates(vars, target_vars, tau):
logger.info('setting up target updates ...')
soft_updates = []
init_updates = []
assert len(vars) == len(target_vars)
for var, target_var in zip(vars, target_vars):
logger.info(' {} <- {}'.format(target_var.name, var.name))
init_updates.append(tf.assign(target_var, var))
soft_updates.append(tf.assign(target_var, (1. - tau) * target_var + tau * var))
assert len(init_updates) == len(vars)
assert len(soft_updates) == len(vars)
return tf.group(*init_updates), tf.group(*soft_updates)
def get_perturbed_actor_updates(actor, perturbed_actor, param_noise_stddev):
assert len(actor.vars) == len(perturbed_actor.vars)
assert len(actor.perturbable_vars) == len(perturbed_actor.perturbable_vars)
updates = []
for var, perturbed_var in zip(actor.vars, perturbed_actor.vars):
if var in actor.perturbable_vars:
logger.info(' {} <- {} + noise'.format(perturbed_var.name, var.name))
updates.append(tf.assign(perturbed_var, var + tf.random_normal(tf.shape(var), mean=0., stddev=param_noise_stddev)))
else:
logger.info(' {} <- {}'.format(perturbed_var.name, var.name))
updates.append(tf.assign(perturbed_var, var))
assert len(updates) == len(actor.vars)
return tf.group(*updates)
class DDPG(object):
def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
adaptive_param_noise=True, adaptive_param_noise_policy_threshold=.1,
critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
# Inputs.
self.obs0 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs0')
self.obs1 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs1')
self.terminals1 = tf.placeholder(tf.float32, shape=(None, 1), name='terminals1')
self.rewards = tf.placeholder(tf.float32, shape=(None, 1), name='rewards')
self.actions = tf.placeholder(tf.float32, shape=(None,) + action_shape, name='actions')
self.critic_target = tf.placeholder(tf.float32, shape=(None, 1), name='critic_target')
self.param_noise_stddev = tf.placeholder(tf.float32, shape=(), name='param_noise_stddev')
# Parameters.
self.gamma = gamma
self.tau = tau
self.memory = memory
self.normalize_observations = normalize_observations
self.normalize_returns = normalize_returns
self.action_noise = action_noise
self.param_noise = param_noise
self.action_range = action_range
self.return_range = return_range
self.observation_range = observation_range
self.critic = critic
self.actor = actor
self.actor_lr = actor_lr
self.critic_lr = critic_lr
self.clip_norm = clip_norm
self.enable_popart = enable_popart
self.reward_scale = reward_scale
self.batch_size = batch_size
self.stats_sample = None
self.critic_l2_reg = critic_l2_reg
# Observation normalization.
if self.normalize_observations:
with tf.variable_scope('obs_rms'):
self.obs_rms = RunningMeanStd(shape=observation_shape)
else:
self.obs_rms = None
normalized_obs0 = tf.clip_by_value(normalize(self.obs0, self.obs_rms),
self.observation_range[0], self.observation_range[1])
normalized_obs1 = tf.clip_by_value(normalize(self.obs1, self.obs_rms),
self.observation_range[0], self.observation_range[1])
# Return normalization.
if self.normalize_returns:
with tf.variable_scope('ret_rms'):
self.ret_rms = RunningMeanStd()
else:
self.ret_rms = None
# Create target networks.
target_actor = copy(actor)
target_actor.name = 'target_actor'
self.target_actor = target_actor
target_critic = copy(critic)
target_critic.name = 'target_critic'
self.target_critic = target_critic
# Create networks and core TF parts that are shared across setup parts.
self.actor_tf = actor(normalized_obs0)
self.normalized_critic_tf = critic(normalized_obs0, self.actions)
self.critic_tf = denormalize(tf.clip_by_value(self.normalized_critic_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
self.normalized_critic_with_actor_tf = critic(normalized_obs0, self.actor_tf, reuse=True)
self.critic_with_actor_tf = denormalize(tf.clip_by_value(self.normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
Q_obs1 = denormalize(target_critic(normalized_obs1, target_actor(normalized_obs1)), self.ret_rms)
self.target_Q = self.rewards + (1. - self.terminals1) * gamma * Q_obs1
# Set up parts.
if self.param_noise is not None:
self.setup_param_noise(normalized_obs0)
self.setup_actor_optimizer()
self.setup_critic_optimizer()
if self.normalize_returns and self.enable_popart:
self.setup_popart()
self.setup_stats()
self.setup_target_network_updates()
def setup_target_network_updates(self):
actor_init_updates, actor_soft_updates = get_target_updates(self.actor.vars, self.target_actor.vars, self.tau)
critic_init_updates, critic_soft_updates = get_target_updates(self.critic.vars, self.target_critic.vars, self.tau)
self.target_init_updates = [actor_init_updates, critic_init_updates]
self.target_soft_updates = [actor_soft_updates, critic_soft_updates]
def setup_param_noise(self, normalized_obs0):
assert self.param_noise is not None
# Configure perturbed actor.
param_noise_actor = copy(self.actor)
param_noise_actor.name = 'param_noise_actor'
self.perturbed_actor_tf = param_noise_actor(normalized_obs0)
logger.info('setting up param noise')
self.perturb_policy_ops = get_perturbed_actor_updates(self.actor, param_noise_actor, self.param_noise_stddev)
# Configure separate copy for stddev adoption.
adaptive_param_noise_actor = copy(self.actor)
adaptive_param_noise_actor.name = 'adaptive_param_noise_actor'
adaptive_actor_tf = adaptive_param_noise_actor(normalized_obs0)
self.perturb_adaptive_policy_ops = get_perturbed_actor_updates(self.actor, adaptive_param_noise_actor, self.param_noise_stddev)
self.adaptive_policy_distance = tf.sqrt(tf.reduce_mean(tf.square(self.actor_tf - adaptive_actor_tf)))
def setup_actor_optimizer(self):
logger.info('setting up actor optimizer')
self.actor_loss = -tf.reduce_mean(self.critic_with_actor_tf)
actor_shapes = [var.get_shape().as_list() for var in self.actor.trainable_vars]
actor_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in actor_shapes])
logger.info(' actor shapes: {}'.format(actor_shapes))
logger.info(' actor params: {}'.format(actor_nb_params))
self.actor_grads = U.flatgrad(self.actor_loss, self.actor.trainable_vars, clip_norm=self.clip_norm)
self.actor_optimizer = MpiAdam(var_list=self.actor.trainable_vars,
beta1=0.9, beta2=0.999, epsilon=1e-08)
def setup_critic_optimizer(self):
logger.info('setting up critic optimizer')
normalized_critic_target_tf = tf.clip_by_value(normalize(self.critic_target, self.ret_rms), self.return_range[0], self.return_range[1])
self.critic_loss = tf.reduce_mean(tf.square(self.normalized_critic_tf - normalized_critic_target_tf))
if self.critic_l2_reg > 0.:
critic_reg_vars = [var for var in self.critic.trainable_vars if 'kernel' in var.name and 'output' not in var.name]
for var in critic_reg_vars:
logger.info(' regularizing: {}'.format(var.name))
logger.info(' applying l2 regularization with {}'.format(self.critic_l2_reg))
critic_reg = tc.layers.apply_regularization(
tc.layers.l2_regularizer(self.critic_l2_reg),
weights_list=critic_reg_vars
)
self.critic_loss += critic_reg
critic_shapes = [var.get_shape().as_list() for var in self.critic.trainable_vars]
critic_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in critic_shapes])
logger.info(' critic shapes: {}'.format(critic_shapes))
logger.info(' critic params: {}'.format(critic_nb_params))
self.critic_grads = U.flatgrad(self.critic_loss, self.critic.trainable_vars, clip_norm=self.clip_norm)
self.critic_optimizer = MpiAdam(var_list=self.critic.trainable_vars,
beta1=0.9, beta2=0.999, epsilon=1e-08)
def setup_popart(self):
# See https://arxiv.org/pdf/1602.07714.pdf for details.
self.old_std = tf.placeholder(tf.float32, shape=[1], name='old_std')
new_std = self.ret_rms.std
self.old_mean = tf.placeholder(tf.float32, shape=[1], name='old_mean')
new_mean = self.ret_rms.mean
self.renormalize_Q_outputs_op = []
for vs in [self.critic.output_vars, self.target_critic.output_vars]:
assert len(vs) == 2
M, b = vs
assert 'kernel' in M.name
assert 'bias' in b.name
assert M.get_shape()[-1] == 1
assert b.get_shape()[-1] == 1
self.renormalize_Q_outputs_op += [M.assign(M * self.old_std / new_std)]
self.renormalize_Q_outputs_op += [b.assign((b * self.old_std + self.old_mean - new_mean) / new_std)]
def setup_stats(self):
ops = []
names = []
if self.normalize_returns:
ops += [self.ret_rms.mean, self.ret_rms.std]
names += ['ret_rms_mean', 'ret_rms_std']
if self.normalize_observations:
ops += [tf.reduce_mean(self.obs_rms.mean), tf.reduce_mean(self.obs_rms.std)]
names += ['obs_rms_mean', 'obs_rms_std']
ops += [tf.reduce_mean(self.critic_tf)]
names += ['reference_Q_mean']
ops += [reduce_std(self.critic_tf)]
names += ['reference_Q_std']
ops += [tf.reduce_mean(self.critic_with_actor_tf)]
names += ['reference_actor_Q_mean']
ops += [reduce_std(self.critic_with_actor_tf)]
names += ['reference_actor_Q_std']
ops += [tf.reduce_mean(self.actor_tf)]
names += ['reference_action_mean']
ops += [reduce_std(self.actor_tf)]
names += ['reference_action_std']
if self.param_noise:
ops += [tf.reduce_mean(self.perturbed_actor_tf)]
names += ['reference_perturbed_action_mean']
ops += [reduce_std(self.perturbed_actor_tf)]
names += ['reference_perturbed_action_std']
self.stats_ops = ops
self.stats_names = names
def pi(self, obs, apply_noise=True, compute_Q=True):
if self.param_noise is not None and apply_noise:
actor_tf = self.perturbed_actor_tf
else:
actor_tf = self.actor_tf
feed_dict = {self.obs0: [obs]}
if compute_Q:
action, q = self.sess.run([actor_tf, self.critic_with_actor_tf], feed_dict=feed_dict)
else:
action = self.sess.run(actor_tf, feed_dict=feed_dict)
q = None
action = action.flatten()
if self.action_noise is not None and apply_noise:
noise = self.action_noise()
assert noise.shape == action.shape
action += noise
action = np.clip(action, self.action_range[0], self.action_range[1])
return action, q
def store_transition(self, obs0, action, reward, obs1, terminal1):
reward *= self.reward_scale
self.memory.append(obs0, action, reward, obs1, terminal1)
if self.normalize_observations:
self.obs_rms.update(np.array([obs0]))
def train(self):
# Get a batch.
batch = self.memory.sample(batch_size=self.batch_size)
if self.normalize_returns and self.enable_popart:
old_mean, old_std, target_Q = self.sess.run([self.ret_rms.mean, self.ret_rms.std, self.target_Q], feed_dict={
self.obs1: batch['obs1'],
self.rewards: batch['rewards'],
self.terminals1: batch['terminals1'].astype('float32'),
})
self.ret_rms.update(target_Q.flatten())
self.sess.run(self.renormalize_Q_outputs_op, feed_dict={
self.old_std : np.array([old_std]),
self.old_mean : np.array([old_mean]),
})
# Run sanity check. Disabled by default since it slows down things considerably.
# print('running sanity check')
# target_Q_new, new_mean, new_std = self.sess.run([self.target_Q, self.ret_rms.mean, self.ret_rms.std], feed_dict={
# self.obs1: batch['obs1'],
# self.rewards: batch['rewards'],
# self.terminals1: batch['terminals1'].astype('float32'),
# })
# print(target_Q_new, target_Q, new_mean, new_std)
# assert (np.abs(target_Q - target_Q_new) < 1e-3).all()
else:
target_Q = self.sess.run(self.target_Q, feed_dict={
self.obs1: batch['obs1'],
self.rewards: batch['rewards'],
self.terminals1: batch['terminals1'].astype('float32'),
})
# Get all gradients and perform a synced update.
ops = [self.actor_grads, self.actor_loss, self.critic_grads, self.critic_loss]
actor_grads, actor_loss, critic_grads, critic_loss = self.sess.run(ops, feed_dict={
self.obs0: batch['obs0'],
self.actions: batch['actions'],
self.critic_target: target_Q,
})
self.actor_optimizer.update(actor_grads, stepsize=self.actor_lr)
self.critic_optimizer.update(critic_grads, stepsize=self.critic_lr)
return critic_loss, actor_loss
def initialize(self, sess):
self.sess = sess
self.sess.run(tf.global_variables_initializer())
self.actor_optimizer.sync()
self.critic_optimizer.sync()
self.sess.run(self.target_init_updates)
def update_target_net(self):
self.sess.run(self.target_soft_updates)
def get_stats(self):
if self.stats_sample is None:
# Get a sample and keep that fixed for all further computations.
# This allows us to estimate the change in value for the same set of inputs.
self.stats_sample = self.memory.sample(batch_size=self.batch_size)
values = self.sess.run(self.stats_ops, feed_dict={
self.obs0: self.stats_sample['obs0'],
self.actions: self.stats_sample['actions'],
})
names = self.stats_names[:]
assert len(names) == len(values)
stats = dict(zip(names, values))
if self.param_noise is not None:
stats = {**stats, **self.param_noise.get_stats()}
return stats
def adapt_param_noise(self):
if self.param_noise is None:
return 0.
# Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
batch = self.memory.sample(batch_size=self.batch_size)
self.sess.run(self.perturb_adaptive_policy_ops, feed_dict={
self.param_noise_stddev: self.param_noise.current_stddev,
})
distance = self.sess.run(self.adaptive_policy_distance, feed_dict={
self.obs0: batch['obs0'],
self.param_noise_stddev: self.param_noise.current_stddev,
})
mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
self.param_noise.adapt(mean_distance)
return mean_distance
def reset(self):
# Reset internal state after an episode is complete.
if self.action_noise is not None:
self.action_noise.reset()
if self.param_noise is not None:
self.sess.run(self.perturb_policy_ops, feed_dict={
self.param_noise_stddev: self.param_noise.current_stddev,
})

123
baselines/ddpg/main.py Normal file
View File

@@ -0,0 +1,123 @@
import argparse
import time
import os
import logging
from baselines import logger, bench
from baselines.common.misc_util import (
set_global_seeds,
boolean_flag,
)
import baselines.ddpg.training as training
from baselines.ddpg.models import Actor, Critic
from baselines.ddpg.memory import Memory
from baselines.ddpg.noise import *
import gym
import tensorflow as tf
from mpi4py import MPI
def run(env_id, seed, noise_type, layer_norm, evaluation, **kwargs):
# Configure things.
rank = MPI.COMM_WORLD.Get_rank()
if rank != 0:
logger.set_level(logger.DISABLED)
# Create envs.
env = gym.make(env_id)
env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
if evaluation and rank==0:
eval_env = gym.make(env_id)
eval_env = bench.Monitor(eval_env, os.path.join(logger.get_dir(), 'gym_eval'))
env = bench.Monitor(env, None)
else:
eval_env = None
# Parse noise_type
action_noise = None
param_noise = None
nb_actions = env.action_space.shape[-1]
for current_noise_type in noise_type.split(','):
current_noise_type = current_noise_type.strip()
if current_noise_type == 'none':
pass
elif 'adaptive-param' in current_noise_type:
_, stddev = current_noise_type.split('_')
param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
elif 'normal' in current_noise_type:
_, stddev = current_noise_type.split('_')
action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
elif 'ou' in current_noise_type:
_, stddev = current_noise_type.split('_')
action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
else:
raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
# Configure components.
memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
critic = Critic(layer_norm=layer_norm)
actor = Actor(nb_actions, layer_norm=layer_norm)
# Seed everything to make things reproducible.
seed = seed + 1000000 * rank
logger.info('rank {}: seed={}, logdir={}'.format(rank, seed, logger.get_dir()))
tf.reset_default_graph()
set_global_seeds(seed)
env.seed(seed)
if eval_env is not None:
eval_env.seed(seed)
# Disable logging for rank != 0 to avoid noise.
if rank == 0:
start_time = time.time()
training.train(env=env, eval_env=eval_env, param_noise=param_noise,
action_noise=action_noise, actor=actor, critic=critic, memory=memory, **kwargs)
env.close()
if eval_env is not None:
eval_env.close()
if rank == 0:
logger.info('total runtime: {}s'.format(time.time() - start_time))
def parse_args():
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--env-id', type=str, default='HalfCheetah-v1')
boolean_flag(parser, 'render-eval', default=False)
boolean_flag(parser, 'layer-norm', default=True)
boolean_flag(parser, 'render', default=False)
boolean_flag(parser, 'normalize-returns', default=False)
boolean_flag(parser, 'normalize-observations', default=True)
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--critic-l2-reg', type=float, default=1e-2)
parser.add_argument('--batch-size', type=int, default=64) # per MPI worker
parser.add_argument('--actor-lr', type=float, default=1e-4)
parser.add_argument('--critic-lr', type=float, default=1e-3)
boolean_flag(parser, 'popart', default=False)
parser.add_argument('--gamma', type=float, default=0.99)
parser.add_argument('--reward-scale', type=float, default=1.)
parser.add_argument('--clip-norm', type=float, default=None)
parser.add_argument('--nb-epochs', type=int, default=500) # with default settings, perform 1M steps total
parser.add_argument('--nb-epoch-cycles', type=int, default=20)
parser.add_argument('--nb-train-steps', type=int, default=50) # per epoch cycle and MPI worker
parser.add_argument('--nb-eval-steps', type=int, default=100) # per epoch cycle and MPI worker
parser.add_argument('--nb-rollout-steps', type=int, default=100) # per epoch cycle and MPI worker
parser.add_argument('--noise-type', type=str, default='adaptive-param_0.2') # choices are adaptive-param_xx, ou_xx, normal_xx, none
parser.add_argument('--num-timesteps', type=int, default=None)
boolean_flag(parser, 'evaluation', default=False)
args = parser.parse_args()
# we don't directly specify timesteps for this script, so make sure that if we do specify them
# they agree with the other parameters
if args.num_timesteps is not None:
assert(args.num_timesteps == args.nb_epochs * args.nb_epoch_cycles * args.nb_rollout_steps)
dict_args = vars(args)
del dict_args['num_timesteps']
return dict_args
if __name__ == '__main__':
args = parse_args()
if MPI.COMM_WORLD.Get_rank() == 0:
logger.configure()
# Run actual script.
run(**args)

83
baselines/ddpg/memory.py Normal file
View File

@@ -0,0 +1,83 @@
import numpy as np
class RingBuffer(object):
def __init__(self, maxlen, shape, dtype='float32'):
self.maxlen = maxlen
self.start = 0
self.length = 0
self.data = np.zeros((maxlen,) + shape).astype(dtype)
def __len__(self):
return self.length
def __getitem__(self, idx):
if idx < 0 or idx >= self.length:
raise KeyError()
return self.data[(self.start + idx) % self.maxlen]
def get_batch(self, idxs):
return self.data[(self.start + idxs) % self.maxlen]
def append(self, v):
if self.length < self.maxlen:
# We have space, simply increase the length.
self.length += 1
elif self.length == self.maxlen:
# No space, "remove" the first item.
self.start = (self.start + 1) % self.maxlen
else:
# This should never happen.
raise RuntimeError()
self.data[(self.start + self.length - 1) % self.maxlen] = v
def array_min2d(x):
x = np.array(x)
if x.ndim >= 2:
return x
return x.reshape(-1, 1)
class Memory(object):
def __init__(self, limit, action_shape, observation_shape):
self.limit = limit
self.observations0 = RingBuffer(limit, shape=observation_shape)
self.actions = RingBuffer(limit, shape=action_shape)
self.rewards = RingBuffer(limit, shape=(1,))
self.terminals1 = RingBuffer(limit, shape=(1,))
self.observations1 = RingBuffer(limit, shape=observation_shape)
def sample(self, batch_size):
# Draw such that we always have a proceeding element.
batch_idxs = np.random.random_integers(self.nb_entries - 2, size=batch_size)
obs0_batch = self.observations0.get_batch(batch_idxs)
obs1_batch = self.observations1.get_batch(batch_idxs)
action_batch = self.actions.get_batch(batch_idxs)
reward_batch = self.rewards.get_batch(batch_idxs)
terminal1_batch = self.terminals1.get_batch(batch_idxs)
result = {
'obs0': array_min2d(obs0_batch),
'obs1': array_min2d(obs1_batch),
'rewards': array_min2d(reward_batch),
'actions': array_min2d(action_batch),
'terminals1': array_min2d(terminal1_batch),
}
return result
def append(self, obs0, action, reward, obs1, terminal1, training=True):
if not training:
return
self.observations0.append(obs0)
self.actions.append(action)
self.rewards.append(reward)
self.observations1.append(obs1)
self.terminals1.append(terminal1)
@property
def nb_entries(self):
return len(self.observations0)

77
baselines/ddpg/models.py Normal file
View File

@@ -0,0 +1,77 @@
import tensorflow as tf
import tensorflow.contrib as tc
class Model(object):
def __init__(self, name):
self.name = name
@property
def vars(self):
return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
@property
def trainable_vars(self):
return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.name)
@property
def perturbable_vars(self):
return [var for var in self.trainable_vars if 'LayerNorm' not in var.name]
class Actor(Model):
def __init__(self, nb_actions, name='actor', layer_norm=True):
super(Actor, self).__init__(name=name)
self.nb_actions = nb_actions
self.layer_norm = layer_norm
def __call__(self, obs, reuse=False):
with tf.variable_scope(self.name) as scope:
if reuse:
scope.reuse_variables()
x = obs
x = tf.layers.dense(x, 64)
if self.layer_norm:
x = tc.layers.layer_norm(x, center=True, scale=True)
x = tf.nn.relu(x)
x = tf.layers.dense(x, 64)
if self.layer_norm:
x = tc.layers.layer_norm(x, center=True, scale=True)
x = tf.nn.relu(x)
x = tf.layers.dense(x, self.nb_actions, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
x = tf.nn.tanh(x)
return x
class Critic(Model):
def __init__(self, name='critic', layer_norm=True):
super(Critic, self).__init__(name=name)
self.layer_norm = layer_norm
def __call__(self, obs, action, reuse=False):
with tf.variable_scope(self.name) as scope:
if reuse:
scope.reuse_variables()
x = obs
x = tf.layers.dense(x, 64)
if self.layer_norm:
x = tc.layers.layer_norm(x, center=True, scale=True)
x = tf.nn.relu(x)
x = tf.concat([x, action], axis=-1)
x = tf.layers.dense(x, 64)
if self.layer_norm:
x = tc.layers.layer_norm(x, center=True, scale=True)
x = tf.nn.relu(x)
x = tf.layers.dense(x, 1, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
return x
@property
def output_vars(self):
output_vars = [var for var in self.trainable_vars if 'output' in var.name]
return output_vars

67
baselines/ddpg/noise.py Normal file
View File

@@ -0,0 +1,67 @@
import numpy as np
class AdaptiveParamNoiseSpec(object):
def __init__(self, initial_stddev=0.1, desired_action_stddev=0.1, adoption_coefficient=1.01):
self.initial_stddev = initial_stddev
self.desired_action_stddev = desired_action_stddev
self.adoption_coefficient = adoption_coefficient
self.current_stddev = initial_stddev
def adapt(self, distance):
if distance > self.desired_action_stddev:
# Decrease stddev.
self.current_stddev /= self.adoption_coefficient
else:
# Increase stddev.
self.current_stddev *= self.adoption_coefficient
def get_stats(self):
stats = {
'param_noise_stddev': self.current_stddev,
}
return stats
def __repr__(self):
fmt = 'AdaptiveParamNoiseSpec(initial_stddev={}, desired_action_stddev={}, adoption_coefficient={})'
return fmt.format(self.initial_stddev, self.desired_action_stddev, self.adoption_coefficient)
class ActionNoise(object):
def reset(self):
pass
class NormalActionNoise(ActionNoise):
def __init__(self, mu, sigma):
self.mu = mu
self.sigma = sigma
def __call__(self):
return np.random.normal(self.mu, self.sigma)
def __repr__(self):
return 'NormalActionNoise(mu={}, sigma={})'.format(self.mu, self.sigma)
# Based on http://math.stackexchange.com/questions/1287634/implementing-ornstein-uhlenbeck-in-matlab
class OrnsteinUhlenbeckActionNoise(ActionNoise):
def __init__(self, mu, sigma, theta=.15, dt=1e-2, x0=None):
self.theta = theta
self.mu = mu
self.sigma = sigma
self.dt = dt
self.x0 = x0
self.reset()
def __call__(self):
x = self.x_prev + self.theta * (self.mu - self.x_prev) * self.dt + self.sigma * np.sqrt(self.dt) * np.random.normal(size=self.mu.shape)
self.x_prev = x
return x
def reset(self):
self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)
def __repr__(self):
return 'OrnsteinUhlenbeckActionNoise(mu={}, sigma={})'.format(self.mu, self.sigma)

191
baselines/ddpg/training.py Normal file
View File

@@ -0,0 +1,191 @@
import os
import time
from collections import deque
import pickle
from baselines.ddpg.ddpg import DDPG
import baselines.common.tf_util as U
from baselines import logger
import numpy as np
import tensorflow as tf
from mpi4py import MPI
def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, param_noise, actor, critic,
normalize_returns, normalize_observations, critic_l2_reg, actor_lr, critic_lr, action_noise,
popart, gamma, clip_norm, nb_train_steps, nb_rollout_steps, nb_eval_steps, batch_size, memory,
tau=0.01, eval_env=None, param_noise_adaption_interval=50):
rank = MPI.COMM_WORLD.Get_rank()
assert (np.abs(env.action_space.low) == env.action_space.high).all() # we assume symmetric actions.
max_action = env.action_space.high
logger.info('scaling actions by {} before executing in env'.format(max_action))
agent = DDPG(actor, critic, memory, env.observation_space.shape, env.action_space.shape,
gamma=gamma, tau=tau, normalize_returns=normalize_returns, normalize_observations=normalize_observations,
batch_size=batch_size, action_noise=action_noise, param_noise=param_noise, critic_l2_reg=critic_l2_reg,
actor_lr=actor_lr, critic_lr=critic_lr, enable_popart=popart, clip_norm=clip_norm,
reward_scale=reward_scale)
logger.info('Using agent with the following configuration:')
logger.info(str(agent.__dict__.items()))
# Set up logging stuff only for a single worker.
if rank == 0:
saver = tf.train.Saver()
else:
saver = None
step = 0
episode = 0
eval_episode_rewards_history = deque(maxlen=100)
episode_rewards_history = deque(maxlen=100)
with U.single_threaded_session() as sess:
# Prepare everything.
agent.initialize(sess)
sess.graph.finalize()
agent.reset()
obs = env.reset()
if eval_env is not None:
eval_obs = eval_env.reset()
done = False
episode_reward = 0.
episode_step = 0
episodes = 0
t = 0
epoch = 0
start_time = time.time()
epoch_episode_rewards = []
epoch_episode_steps = []
epoch_episode_eval_rewards = []
epoch_episode_eval_steps = []
epoch_start_time = time.time()
epoch_actions = []
epoch_qs = []
epoch_episodes = 0
for epoch in range(nb_epochs):
for cycle in range(nb_epoch_cycles):
# Perform rollouts.
for t_rollout in range(nb_rollout_steps):
# Predict next action.
action, q = agent.pi(obs, apply_noise=True, compute_Q=True)
assert action.shape == env.action_space.shape
# Execute next action.
if rank == 0 and render:
env.render()
assert max_action.shape == action.shape
new_obs, r, done, info = env.step(max_action * action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
t += 1
if rank == 0 and render:
env.render()
episode_reward += r
episode_step += 1
# Book-keeping.
epoch_actions.append(action)
epoch_qs.append(q)
agent.store_transition(obs, action, r, new_obs, done)
obs = new_obs
if done:
# Episode done.
epoch_episode_rewards.append(episode_reward)
episode_rewards_history.append(episode_reward)
epoch_episode_steps.append(episode_step)
episode_reward = 0.
episode_step = 0
epoch_episodes += 1
episodes += 1
agent.reset()
obs = env.reset()
# Train.
epoch_actor_losses = []
epoch_critic_losses = []
epoch_adaptive_distances = []
for t_train in range(nb_train_steps):
# Adapt param noise, if necessary.
if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
distance = agent.adapt_param_noise()
epoch_adaptive_distances.append(distance)
cl, al = agent.train()
epoch_critic_losses.append(cl)
epoch_actor_losses.append(al)
agent.update_target_net()
# Evaluate.
eval_episode_rewards = []
eval_qs = []
if eval_env is not None:
eval_episode_reward = 0.
for t_rollout in range(nb_eval_steps):
eval_action, eval_q = agent.pi(eval_obs, apply_noise=False, compute_Q=True)
eval_obs, eval_r, eval_done, eval_info = eval_env.step(max_action * eval_action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
if render_eval:
eval_env.render()
eval_episode_reward += eval_r
eval_qs.append(eval_q)
if eval_done:
eval_obs = eval_env.reset()
eval_episode_rewards.append(eval_episode_reward)
eval_episode_rewards_history.append(eval_episode_reward)
eval_episode_reward = 0.
mpi_size = MPI.COMM_WORLD.Get_size()
# Log stats.
# XXX shouldn't call np.mean on variable length lists
duration = time.time() - start_time
stats = agent.get_stats()
combined_stats = stats.copy()
combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
combined_stats['train/loss_actor'] = np.mean(epoch_actor_losses)
combined_stats['train/loss_critic'] = np.mean(epoch_critic_losses)
combined_stats['train/param_noise_distance'] = np.mean(epoch_adaptive_distances)
combined_stats['total/duration'] = duration
combined_stats['total/steps_per_second'] = float(t) / float(duration)
combined_stats['total/episodes'] = episodes
combined_stats['rollout/episodes'] = epoch_episodes
combined_stats['rollout/actions_std'] = np.std(epoch_actions)
# Evaluation statistics.
if eval_env is not None:
combined_stats['eval/return'] = eval_episode_rewards
combined_stats['eval/return_history'] = np.mean(eval_episode_rewards_history)
combined_stats['eval/Q'] = eval_qs
combined_stats['eval/episodes'] = len(eval_episode_rewards)
def as_scalar(x):
if isinstance(x, np.ndarray):
assert x.size == 1
return x[0]
elif np.isscalar(x):
return x
else:
raise ValueError('expected scalar, got %s'%x)
combined_stats_sums = MPI.COMM_WORLD.allreduce(np.array([as_scalar(x) for x in combined_stats.values()]))
combined_stats = {k : v / mpi_size for (k,v) in zip(combined_stats.keys(), combined_stats_sums)}
# Total statistics.
combined_stats['total/epochs'] = epoch + 1
combined_stats['total/steps'] = t
for key in sorted(combined_stats.keys()):
logger.record_tabular(key, combined_stats[key])
logger.dump_tabular()
logger.info('')
logdir = logger.get_dir()
if rank == 0 and logdir:
if hasattr(env, 'get_state'):
with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:
pickle.dump(env.get_state(), f)
if eval_env and hasattr(eval_env, 'get_state'):
with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
pickle.dump(eval_env.get_state(), f)

View File

@@ -21,8 +21,8 @@ Be sure to check out the source code of [both](experiments/train_cartpole.py) [f
Check out our simple agent trained with one stop shop `deepq.learn` function.
- `baselines/deepq/experiments/train_cartpole.py` - train a Cartpole agent.
- `baselines/deepq/experiments/train_pong.py` - train a Pong agent using convolutional neural networks.
- [baselines/deepq/experiments/train_cartpole.py](experiments/train_cartpole.py) - train a Cartpole agent.
- [baselines/deepq/experiments/train_pong.py](experiments/train_pong.py) - train a Pong agent using convolutional neural networks.
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. For both of the files listed above there are complimentary files `enjoy_cartpole.py` and `enjoy_pong.py` respectively, that load and visualize the learned policy.
@@ -31,8 +31,8 @@ In particular notice that once `deepq.learn` finishes training it returns `act`
##### Check out the examples
- `baselines/deepq/experiments/custom_cartpole.py` - Cartpole training with more fine grained control over the internals of DQN algorithm.
- `baselines/deepq/experiments/atari/train.py` - more robust setup for training at scale.
- [baselines/deepq/experiments/custom_cartpole.py](experiments/custom_cartpole.py) - Cartpole training with more fine grained control over the internals of DQN algorithm.
- [baselines/deepq/experiments/atari/train.py](experiments/atari/train.py) - more robust setup for training at scale.
##### Download a pretrained Atari agent

View File

@@ -1,5 +1,8 @@
from baselines.deepq import models # noqa
from baselines.deepq.build_graph import build_act, build_train # noqa
from baselines.deepq.simple import learn, load # noqa
from baselines.deepq.deepq import learn, load_act # noqa
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer # noqa
def wrap_atari_dqn(env):
from baselines.common.atari_wrappers import wrap_deepmind
return wrap_deepmind(env, frame_stack=True, scale=True)

View File

@@ -22,6 +22,32 @@ The functions in this file can are used to create the following functions:
every element of the batch.
======= act (in case of parameter noise) ========
Function to chose an action given an observation
Parameters
----------
observation: object
Observation that can be feed into the output of make_obs_ph
stochastic: bool
if set to False all the actions are always deterministic (default False)
update_eps_ph: float
update epsilon a new value, if negative not update happens
(default: no update)
reset_ph: bool
reset the perturbed policy by sampling a new perturbation
update_param_noise_threshold_ph: float
the desired threshold for the difference between non-perturbed and perturbed policy
update_param_noise_scale_ph: bool
whether or not to update the scale of the noise for the next time it is re-perturbed
Returns
-------
Tensor of dtype tf.int64 and shape (BATCH_SIZE,) with an action to be performed for
every element of the batch.
======= train =======
Function that takes a transition (s,a,r,s') and optimizes Bellman equation's error:
@@ -71,6 +97,52 @@ import tensorflow as tf
import baselines.common.tf_util as U
def scope_vars(scope, trainable_only=False):
"""
Get variables inside a scope
The scope can be specified as a string
Parameters
----------
scope: str or VariableScope
scope in which the variables reside.
trainable_only: bool
whether or not to return only the variables that were marked as trainable.
Returns
-------
vars: [tf.Variable]
list of variables in `scope`.
"""
return tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.GLOBAL_VARIABLES,
scope=scope if isinstance(scope, str) else scope.name
)
def scope_name():
"""Returns the name of current scope as a string, e.g. deepq/q_func"""
return tf.get_variable_scope().name
def absolute_scope_name(relative_scope_name):
"""Appends parent scope name to `relative_scope_name`"""
return scope_name() + "/" + relative_scope_name
def default_param_noise_filter(var):
if var not in tf.trainable_variables():
# We never perturb non-trainable vars.
return False
if "fully_connected" in var.name:
# We perturb fully-connected layers.
return True
# The remaining layers are likely conv or layer norm layers, which we do not wish to
# perturb (in the former case because they only extract features, in the latter case because
# we use them for normalization purposes). If you change your network, you will likely want
# to re-consider which layers to perturb and which to keep untouched.
return False
def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):
"""Creates the act function:
@@ -102,7 +174,7 @@ def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):
` See the top of the file for details.
"""
with tf.variable_scope(scope, reuse=reuse):
observations_ph = U.ensure_tf_input(make_obs_ph("observation"))
observations_ph = make_obs_ph("observation")
stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")
@@ -118,15 +190,132 @@ def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):
output_actions = tf.cond(stochastic_ph, lambda: stochastic_actions, lambda: deterministic_actions)
update_eps_expr = eps.assign(tf.cond(update_eps_ph >= 0, lambda: update_eps_ph, lambda: eps))
act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph],
_act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph],
outputs=output_actions,
givens={update_eps_ph: -1.0, stochastic_ph: True},
updates=[update_eps_expr])
def act(ob, stochastic=True, update_eps=-1):
return _act(ob, stochastic, update_eps)
return act
def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=None, gamma=1.0, double_q=True, scope="deepq", reuse=None):
def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None, param_noise_filter_func=None):
"""Creates the act function with support for parameter space noise exploration (https://arxiv.org/abs/1706.01905):
Parameters
----------
make_obs_ph: str -> tf.placeholder or TfInput
a function that take a name and creates a placeholder of input with that name
q_func: (tf.Variable, int, str, bool) -> tf.Variable
the model that takes the following inputs:
observation_in: object
the output of observation placeholder
num_actions: int
number of actions
scope: str
reuse: bool
should be passed to outer variable scope
and returns a tensor of shape (batch_size, num_actions) with values of every action.
num_actions: int
number of actions.
scope: str or VariableScope
optional scope for variable_scope.
reuse: bool or None
whether or not the variables should be reused. To be able to reuse the scope must be given.
param_noise_filter_func: tf.Variable -> bool
function that decides whether or not a variable should be perturbed. Only applicable
if param_noise is True. If set to None, default_param_noise_filter is used by default.
Returns
-------
act: (tf.Variable, bool, float, bool, float, bool) -> tf.Variable
function to select and action given observation.
` See the top of the file for details.
"""
if param_noise_filter_func is None:
param_noise_filter_func = default_param_noise_filter
with tf.variable_scope(scope, reuse=reuse):
observations_ph = make_obs_ph("observation")
stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")
update_param_noise_threshold_ph = tf.placeholder(tf.float32, (), name="update_param_noise_threshold")
update_param_noise_scale_ph = tf.placeholder(tf.bool, (), name="update_param_noise_scale")
reset_ph = tf.placeholder(tf.bool, (), name="reset")
eps = tf.get_variable("eps", (), initializer=tf.constant_initializer(0))
param_noise_scale = tf.get_variable("param_noise_scale", (), initializer=tf.constant_initializer(0.01), trainable=False)
param_noise_threshold = tf.get_variable("param_noise_threshold", (), initializer=tf.constant_initializer(0.05), trainable=False)
# Unmodified Q.
q_values = q_func(observations_ph.get(), num_actions, scope="q_func")
# Perturbable Q used for the actual rollout.
q_values_perturbed = q_func(observations_ph.get(), num_actions, scope="perturbed_q_func")
# We have to wrap this code into a function due to the way tf.cond() works. See
# https://stackoverflow.com/questions/37063952/confused-by-the-behavior-of-tf-cond for
# a more detailed discussion.
def perturb_vars(original_scope, perturbed_scope):
all_vars = scope_vars(absolute_scope_name(original_scope))
all_perturbed_vars = scope_vars(absolute_scope_name(perturbed_scope))
assert len(all_vars) == len(all_perturbed_vars)
perturb_ops = []
for var, perturbed_var in zip(all_vars, all_perturbed_vars):
if param_noise_filter_func(perturbed_var):
# Perturb this variable.
op = tf.assign(perturbed_var, var + tf.random_normal(shape=tf.shape(var), mean=0., stddev=param_noise_scale))
else:
# Do not perturb, just assign.
op = tf.assign(perturbed_var, var)
perturb_ops.append(op)
assert len(perturb_ops) == len(all_vars)
return tf.group(*perturb_ops)
# Set up functionality to re-compute `param_noise_scale`. This perturbs yet another copy
# of the network and measures the effect of that perturbation in action space. If the perturbation
# is too big, reduce scale of perturbation, otherwise increase.
q_values_adaptive = q_func(observations_ph.get(), num_actions, scope="adaptive_q_func")
perturb_for_adaption = perturb_vars(original_scope="q_func", perturbed_scope="adaptive_q_func")
kl = tf.reduce_sum(tf.nn.softmax(q_values) * (tf.log(tf.nn.softmax(q_values)) - tf.log(tf.nn.softmax(q_values_adaptive))), axis=-1)
mean_kl = tf.reduce_mean(kl)
def update_scale():
with tf.control_dependencies([perturb_for_adaption]):
update_scale_expr = tf.cond(mean_kl < param_noise_threshold,
lambda: param_noise_scale.assign(param_noise_scale * 1.01),
lambda: param_noise_scale.assign(param_noise_scale / 1.01),
)
return update_scale_expr
# Functionality to update the threshold for parameter space noise.
update_param_noise_threshold_expr = param_noise_threshold.assign(tf.cond(update_param_noise_threshold_ph >= 0,
lambda: update_param_noise_threshold_ph, lambda: param_noise_threshold))
# Put everything together.
deterministic_actions = tf.argmax(q_values_perturbed, axis=1)
batch_size = tf.shape(observations_ph.get())[0]
random_actions = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=num_actions, dtype=tf.int64)
chose_random = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=1, dtype=tf.float32) < eps
stochastic_actions = tf.where(chose_random, random_actions, deterministic_actions)
output_actions = tf.cond(stochastic_ph, lambda: stochastic_actions, lambda: deterministic_actions)
update_eps_expr = eps.assign(tf.cond(update_eps_ph >= 0, lambda: update_eps_ph, lambda: eps))
updates = [
update_eps_expr,
tf.cond(reset_ph, lambda: perturb_vars(original_scope="q_func", perturbed_scope="perturbed_q_func"), lambda: tf.group(*[])),
tf.cond(update_param_noise_scale_ph, lambda: update_scale(), lambda: tf.Variable(0., trainable=False)),
update_param_noise_threshold_expr,
]
_act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph, reset_ph, update_param_noise_threshold_ph, update_param_noise_scale_ph],
outputs=output_actions,
givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
updates=updates)
def act(ob, reset, update_param_noise_threshold, update_param_noise_scale, stochastic=True, update_eps=-1):
return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
return act
def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=None, gamma=1.0,
double_q=True, scope="deepq", reuse=None, param_noise=False, param_noise_filter_func=None):
"""Creates the train function:
Parameters
@@ -160,6 +349,11 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=
optional scope for variable_scope.
reuse: bool or None
whether or not the variables should be reused. To be able to reuse the scope must be given.
param_noise: bool
whether or not to use parameter space noise (https://arxiv.org/abs/1706.01905)
param_noise_filter_func: tf.Variable -> bool
function that decides whether or not a variable should be perturbed. Only applicable
if param_noise is True. If set to None, default_param_noise_filter is used by default.
Returns
-------
@@ -175,24 +369,28 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=
debug: {str: function}
a bunch of functions to print debug data like q_values.
"""
act_f = build_act(make_obs_ph, q_func, num_actions, scope=scope, reuse=reuse)
if param_noise:
act_f = build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope=scope, reuse=reuse,
param_noise_filter_func=param_noise_filter_func)
else:
act_f = build_act(make_obs_ph, q_func, num_actions, scope=scope, reuse=reuse)
with tf.variable_scope(scope, reuse=reuse):
# set up placeholders
obs_t_input = U.ensure_tf_input(make_obs_ph("obs_t"))
obs_t_input = make_obs_ph("obs_t")
act_t_ph = tf.placeholder(tf.int32, [None], name="action")
rew_t_ph = tf.placeholder(tf.float32, [None], name="reward")
obs_tp1_input = U.ensure_tf_input(make_obs_ph("obs_tp1"))
obs_tp1_input = make_obs_ph("obs_tp1")
done_mask_ph = tf.placeholder(tf.float32, [None], name="done")
importance_weights_ph = tf.placeholder(tf.float32, [None], name="weight")
# q network evaluation
q_t = q_func(obs_t_input.get(), num_actions, scope="q_func", reuse=True) # reuse parameters from act
q_func_vars = U.scope_vars(U.absolute_scope_name("q_func"))
q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/q_func")
# target q network evalution
q_tp1 = q_func(obs_tp1_input.get(), num_actions, scope="target_q_func")
target_q_func_vars = U.scope_vars(U.absolute_scope_name("target_q_func"))
target_q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/target_q_func")
# q scores for actions which we know were selected in the given state.
q_t_selected = tf.reduce_sum(q_t * tf.one_hot(act_t_ph, num_actions), 1)
@@ -200,7 +398,7 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=
# compute estimate of best possible value starting from state at t + 1
if double_q:
q_tp1_using_online_net = q_func(obs_tp1_input.get(), num_actions, scope="q_func", reuse=True)
q_tp1_best_using_online_net = tf.arg_max(q_tp1_using_online_net, 1)
q_tp1_best_using_online_net = tf.argmax(q_tp1_using_online_net, 1)
q_tp1_best = tf.reduce_sum(q_tp1 * tf.one_hot(q_tp1_best_using_online_net, num_actions), 1)
else:
q_tp1_best = tf.reduce_max(q_tp1, 1)
@@ -213,12 +411,14 @@ def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=
td_error = q_t_selected - tf.stop_gradient(q_t_selected_target)
errors = U.huber_loss(td_error)
weighted_error = tf.reduce_mean(importance_weights_ph * errors)
# compute optimization op (potentially with gradient clipping)
if grad_norm_clipping is not None:
optimize_expr = U.minimize_and_clip(optimizer,
weighted_error,
var_list=q_func_vars,
clip_val=grad_norm_clipping)
gradients = optimizer.compute_gradients(weighted_error, var_list=q_func_vars)
for i, (grad, var) in enumerate(gradients):
if grad is not None:
gradients[i] = (tf.clip_by_norm(grad, grad_norm_clipping), var)
optimize_expr = optimizer.apply_gradients(gradients)
else:
optimize_expr = optimizer.minimize(weighted_error, var_list=q_func_vars)

View File

@@ -1,29 +1,37 @@
import numpy as np
import os
import dill
import tempfile
import tensorflow as tf
import zipfile
import cloudpickle
import numpy as np
import baselines.common.tf_util as U
from baselines.common.tf_util import load_state, save_state
from baselines import logger
from baselines.common.schedules import LinearSchedule
from baselines.common import set_global_seeds
from baselines import deepq
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
from baselines.deepq.utils import ObservationInput
from baselines.common.tf_util import get_session
from baselines.deepq.models import build_q_func
class ActWrapper(object):
def __init__(self, act, act_params):
self._act = act
self._act_params = act_params
self.initial_state = None
@staticmethod
def load(path, num_cpu=16):
def load_act(self, path):
with open(path, "rb") as f:
model_data, act_params = dill.load(f)
model_data, act_params = cloudpickle.load(f)
act = deepq.build_act(**act_params)
sess = U.make_session(num_cpu=num_cpu)
sess = tf.Session()
sess.__enter__()
with tempfile.TemporaryDirectory() as td:
arc_path = os.path.join(td, "packed.zip")
@@ -31,17 +39,23 @@ class ActWrapper(object):
f.write(model_data)
zipfile.ZipFile(arc_path, 'r', zipfile.ZIP_DEFLATED).extractall(td)
U.load_state(os.path.join(td, "model"))
load_state(os.path.join(td, "model"))
return ActWrapper(act, act_params)
def __call__(self, *args, **kwargs):
return self._act(*args, **kwargs)
def save(self, path):
def step(self, observation, **kwargs):
return self._act([observation], **kwargs), None, None, None
def save_act(self, path=None):
"""Save model to a pickle located at `path`"""
if path is None:
path = os.path.join(logger.get_dir(), "model.pkl")
with tempfile.TemporaryDirectory() as td:
U.save_state(os.path.join(td, "model"))
save_state(os.path.join(td, "model"))
arc_name = os.path.join(td, "packed.zip")
with zipfile.ZipFile(arc_name, 'w') as zipf:
for root, dirs, files in os.walk(td):
@@ -52,18 +66,19 @@ class ActWrapper(object):
with open(arc_name, "rb") as f:
model_data = f.read()
with open(path, "wb") as f:
dill.dump((model_data, self._act_params), f)
cloudpickle.dump((model_data, self._act_params), f)
def save(self, path):
save_state(path)
def load(path, num_cpu=16):
def load_act(path):
"""Load act function that was returned by learn function.
Parameters
----------
path: str
path to the act function pickle
num_cpu: int
number of cpus to use for executing the policy
Returns
-------
@@ -71,20 +86,22 @@ def load(path, num_cpu=16):
function that takes a batch of observations
and returns actions.
"""
return ActWrapper.load(path, num_cpu=num_cpu)
return ActWrapper.load_act(path)
def learn(env,
q_func,
network,
seed=None,
lr=5e-4,
max_timesteps=100000,
total_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.02,
train_freq=1,
batch_size=32,
print_freq=1,
print_freq=100,
checkpoint_freq=10000,
checkpoint_path=None,
learning_starts=1000,
gamma=1.0,
target_network_update_freq=500,
@@ -93,8 +110,11 @@ def learn(env,
prioritized_replay_beta0=0.4,
prioritized_replay_beta_iters=None,
prioritized_replay_eps=1e-6,
num_cpu=16,
callback=None):
param_noise=False,
callback=None,
load_path=None,
**network_kwargs
):
"""Train a deepq model.
Parameters
@@ -113,7 +133,7 @@ def learn(env,
and returns a tensor of shape (batch_size, num_actions) with values of every action.
lr: float
learning rate for adam optimizer
max_timesteps: int
total_timesteps: int
number of env steps to optimizer for
buffer_size: int
size of the replay buffer
@@ -147,14 +167,16 @@ def learn(env,
initial value of beta for prioritized replay buffer
prioritized_replay_beta_iters: int
number of iterations over which beta will be annealed from initial value
to 1.0. If set to None equals to max_timesteps.
to 1.0. If set to None equals to total_timesteps.
prioritized_replay_eps: float
epsilon to add to the TD errors when updating priorities.
num_cpu: int
number of cpus to use for training
callback: (locals, globals) -> None
function called at every steps with state of the algorithm.
If callback returns true training stops.
load_path: str
path to load the model from. (default: None)
**network_kwargs
additional keyword arguments to pass to the network builder.
Returns
-------
@@ -164,11 +186,16 @@ def learn(env,
"""
# Create all the functions necessary to train the model
sess = U.make_session(num_cpu=num_cpu)
sess.__enter__()
sess = get_session()
set_global_seeds(seed)
q_func = build_q_func(network, **network_kwargs)
# capture the shape outside the closure so that the env object is not serialized
# by cloudpickle when serializing make_obs_ph
def make_obs_ph(name):
return U.BatchInput(env.observation_space.shape, name=name)
return ObservationInput(env.observation_space, name=name)
act, train, update_target, debug = deepq.build_train(
make_obs_ph=make_obs_ph,
@@ -176,18 +203,23 @@ def learn(env,
num_actions=env.action_space.n,
optimizer=tf.train.AdamOptimizer(learning_rate=lr),
gamma=gamma,
grad_norm_clipping=10
grad_norm_clipping=10,
param_noise=param_noise
)
act_params = {
'make_obs_ph': make_obs_ph,
'q_func': q_func,
'num_actions': env.action_space.n,
}
act = ActWrapper(act, act_params)
# Create the replay buffer
if prioritized_replay:
replay_buffer = PrioritizedReplayBuffer(buffer_size, alpha=prioritized_replay_alpha)
if prioritized_replay_beta_iters is None:
prioritized_replay_beta_iters = max_timesteps
prioritized_replay_beta_iters = total_timesteps
beta_schedule = LinearSchedule(prioritized_replay_beta_iters,
initial_p=prioritized_replay_beta0,
final_p=1.0)
@@ -195,7 +227,7 @@ def learn(env,
replay_buffer = ReplayBuffer(buffer_size)
beta_schedule = None
# Create the schedule for exploration starting from 1.
exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * max_timesteps),
exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * total_timesteps),
initial_p=1.0,
final_p=exploration_final_eps)
@@ -206,16 +238,46 @@ def learn(env,
episode_rewards = [0.0]
saved_mean_reward = None
obs = env.reset()
reset = True
with tempfile.TemporaryDirectory() as td:
model_saved = False
td = checkpoint_path or td
model_file = os.path.join(td, "model")
for t in range(max_timesteps):
model_saved = False
if tf.train.latest_checkpoint(td) is not None:
load_state(model_file)
logger.log('Loaded model from {}'.format(model_file))
model_saved = True
elif load_path is not None:
load_state(load_path)
logger.log('Loaded model from {}'.format(load_path))
for t in range(total_timesteps):
if callback is not None:
if callback(locals(), globals()):
break
# Take action and update exploration to the newest value
action = act(np.array(obs)[None], update_eps=exploration.value(t))[0]
new_obs, rew, done, _ = env.step(action)
kwargs = {}
if not param_noise:
update_eps = exploration.value(t)
update_param_noise_threshold = 0.
else:
update_eps = 0.
# Compute the threshold such that the KL divergence between perturbed and non-perturbed
# policy is comparable to eps-greedy exploration with eps = exploration.value(t).
# See Appendix C.1 in Parameter Space Noise for Exploration, Plappert et al., 2017
# for detailed explanation.
update_param_noise_threshold = -np.log(1. - exploration.value(t) + exploration.value(t) / float(env.action_space.n))
kwargs['reset'] = reset
kwargs['update_param_noise_threshold'] = update_param_noise_threshold
kwargs['update_param_noise_scale'] = True
action = act(np.array(obs)[None], update_eps=update_eps, **kwargs)[0]
env_action = action
reset = False
new_obs, rew, done, _ = env.step(env_action)
# Store transition in the replay buffer.
replay_buffer.add(obs, action, rew, new_obs, float(done))
obs = new_obs
@@ -224,6 +286,7 @@ def learn(env,
if done:
obs = env.reset()
episode_rewards.append(0.0)
reset = True
if t > learning_starts and t % train_freq == 0:
# Minimize the error in Bellman's equation on a batch sampled from replay buffer.
@@ -257,12 +320,12 @@ def learn(env,
if print_freq is not None:
logger.log("Saving model due to mean reward increase: {} -> {}".format(
saved_mean_reward, mean_100ep_reward))
U.save_state(model_file)
save_state(model_file)
model_saved = True
saved_mean_reward = mean_100ep_reward
if model_saved:
if print_freq is not None:
logger.log("Restored model with mean reward: {}".format(saved_mean_reward))
U.load_state(model_file)
load_state(model_file)
return ActWrapper(act, act_params)
return act

View File

@@ -0,0 +1,21 @@
def atari():
return dict(
network='conv_only',
lr=1e-4,
buffer_size=10000,
exploration_fraction=0.1,
exploration_final_eps=0.01,
train_freq=4,
learning_starts=10000,
target_network_update_freq=1000,
gamma=0.99,
prioritized_replay=True,
prioritized_replay_alpha=0.6,
checkpoint_freq=10000,
checkpoint_path=None,
dueling=True
)
def retro():
return atari()

View File

@@ -1,51 +0,0 @@
import argparse
import progressbar
from baselines.common.azure_utils import Container
def parse_args():
parser = argparse.ArgumentParser("Download a pretrained model from Azure.")
# Environment
parser.add_argument("--model-dir", type=str, default=None,
help="save model in this directory this directory. ")
parser.add_argument("--account-name", type=str, default="openaisciszymon",
help="account name for Azure Blob Storage")
parser.add_argument("--account-key", type=str, default=None,
help="account key for Azure Blob Storage")
parser.add_argument("--container", type=str, default="dqn-blogpost",
help="container name and blob name separated by colon serparated by colon")
parser.add_argument("--blob", type=str, default=None, help="blob with the model")
return parser.parse_args()
def main():
args = parse_args()
c = Container(account_name=args.account_name,
account_key=args.account_key,
container_name=args.container)
if args.blob is None:
print("Listing available models:")
print()
for blob in sorted(c.list(prefix="model-")):
print(blob)
else:
print("Downloading {} to {}...".format(args.blob, args.model_dir))
bar = None
def callback(current, total):
nonlocal bar
if bar is None:
bar = progressbar.ProgressBar(max_value=total)
bar.update(current)
assert c.exists(args.blob), "model {} does not exist".format(args.blob)
assert args.model_dir is not None
c.get(args.model_dir, args.blob, callback=callback)
if __name__ == '__main__':
main()

View File

@@ -1,70 +0,0 @@
import argparse
import gym
import os
import numpy as np
from gym.monitoring import VideoRecorder
import baselines.common.tf_util as U
from baselines import deepq
from baselines.common.misc_util import (
boolean_flag,
SimpleMonitor,
)
from baselines.common.atari_wrappers_deprecated import wrap_dqn
from baselines.deepq.experiments.atari.model import model, dueling_model
def parse_args():
parser = argparse.ArgumentParser("Run an already learned DQN model.")
# Environment
parser.add_argument("--env", type=str, required=True, help="name of the game")
parser.add_argument("--model-dir", type=str, default=None, help="load model from this directory. ")
parser.add_argument("--video", type=str, default=None, help="Path to mp4 file where the video of first episode will be recorded.")
boolean_flag(parser, "stochastic", default=True, help="whether or not to use stochastic actions according to models eps value")
boolean_flag(parser, "dueling", default=False, help="whether or not to use dueling model")
return parser.parse_args()
def make_env(game_name):
env = gym.make(game_name + "NoFrameskip-v4")
env = SimpleMonitor(env)
env = wrap_dqn(env)
return env
def play(env, act, stochastic, video_path):
num_episodes = 0
video_recorder = None
video_recorder = VideoRecorder(
env, video_path, enabled=video_path is not None)
obs = env.reset()
while True:
env.unwrapped.render()
video_recorder.capture_frame()
action = act(np.array(obs)[None], stochastic=stochastic)[0]
obs, rew, done, info = env.step(action)
if done:
obs = env.reset()
if len(info["rewards"]) > num_episodes:
if len(info["rewards"]) == 1 and video_recorder.enabled:
# save video of first episode
print("Saved video.")
video_recorder.close()
video_recorder.enabled = False
print(info["rewards"][-1])
num_episodes = len(info["rewards"])
if __name__ == '__main__':
with U.make_session(4) as sess:
args = parse_args()
env = make_env(args.env)
act = deepq.build_act(
make_obs_ph=lambda name: U.Uint8Input(env.observation_space.shape, name=name),
q_func=dueling_model if args.dueling else model,
num_actions=env.action_space.n)
U.load_state(os.path.join(args.model_dir, "saved"))
play(env, act, args.stochastic, args.video)

View File

@@ -1,43 +0,0 @@
import tensorflow as tf
import tensorflow.contrib.layers as layers
def model(img_in, num_actions, scope, reuse=False):
"""As described in https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf"""
with tf.variable_scope(scope, reuse=reuse):
out = img_in
with tf.variable_scope("convnet"):
# original architecture
out = layers.convolution2d(out, num_outputs=32, kernel_size=8, stride=4, activation_fn=tf.nn.relu)
out = layers.convolution2d(out, num_outputs=64, kernel_size=4, stride=2, activation_fn=tf.nn.relu)
out = layers.convolution2d(out, num_outputs=64, kernel_size=3, stride=1, activation_fn=tf.nn.relu)
out = layers.flatten(out)
with tf.variable_scope("action_value"):
out = layers.fully_connected(out, num_outputs=512, activation_fn=tf.nn.relu)
out = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None)
return out
def dueling_model(img_in, num_actions, scope, reuse=False):
"""As described in https://arxiv.org/abs/1511.06581"""
with tf.variable_scope(scope, reuse=reuse):
out = img_in
with tf.variable_scope("convnet"):
# original architecture
out = layers.convolution2d(out, num_outputs=32, kernel_size=8, stride=4, activation_fn=tf.nn.relu)
out = layers.convolution2d(out, num_outputs=64, kernel_size=4, stride=2, activation_fn=tf.nn.relu)
out = layers.convolution2d(out, num_outputs=64, kernel_size=3, stride=1, activation_fn=tf.nn.relu)
out = layers.flatten(out)
with tf.variable_scope("state_value"):
state_hidden = layers.fully_connected(out, num_outputs=512, activation_fn=tf.nn.relu)
state_score = layers.fully_connected(state_hidden, num_outputs=1, activation_fn=None)
with tf.variable_scope("action_value"):
actions_hidden = layers.fully_connected(out, num_outputs=512, activation_fn=tf.nn.relu)
action_scores = layers.fully_connected(actions_hidden, num_outputs=num_actions, activation_fn=None)
action_scores_mean = tf.reduce_mean(action_scores, 1)
action_scores = action_scores - tf.expand_dims(action_scores_mean, 1)
return state_score + action_scores

View File

@@ -1,229 +0,0 @@
import argparse
import gym
import numpy as np
import os
import tensorflow as tf
import tempfile
import time
import baselines.common.tf_util as U
from baselines import logger
from baselines import deepq
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
from baselines.common.misc_util import (
boolean_flag,
pickle_load,
pretty_eta,
relatively_safe_pickle_dump,
set_global_seeds,
RunningAvg,
SimpleMonitor
)
from baselines.common.schedules import LinearSchedule, PiecewiseSchedule
# when updating this to non-deperecated ones, it is important to
# copy over LazyFrames
from baselines.common.atari_wrappers_deprecated import wrap_dqn
from baselines.common.azure_utils import Container
from .model import model, dueling_model
def parse_args():
parser = argparse.ArgumentParser("DQN experiments for Atari games")
# Environment
parser.add_argument("--env", type=str, default="Pong", help="name of the game")
parser.add_argument("--seed", type=int, default=42, help="which seed to use")
# Core DQN parameters
parser.add_argument("--replay-buffer-size", type=int, default=int(1e6), help="replay buffer size")
parser.add_argument("--lr", type=float, default=1e-4, help="learning rate for Adam optimizer")
parser.add_argument("--num-steps", type=int, default=int(2e8), help="total number of steps to run the environment for")
parser.add_argument("--batch-size", type=int, default=32, help="number of transitions to optimize at the same time")
parser.add_argument("--learning-freq", type=int, default=4, help="number of iterations between every optimization step")
parser.add_argument("--target-update-freq", type=int, default=40000, help="number of iterations between every target network update")
# Bells and whistles
boolean_flag(parser, "double-q", default=True, help="whether or not to use double q learning")
boolean_flag(parser, "dueling", default=False, help="whether or not to use dueling model")
boolean_flag(parser, "prioritized", default=False, help="whether or not to use prioritized replay buffer")
parser.add_argument("--prioritized-alpha", type=float, default=0.6, help="alpha parameter for prioritized replay buffer")
parser.add_argument("--prioritized-beta0", type=float, default=0.4, help="initial value of beta parameters for prioritized replay")
parser.add_argument("--prioritized-eps", type=float, default=1e-6, help="eps parameter for prioritized replay buffer")
# Checkpointing
parser.add_argument("--save-dir", type=str, default=None, help="directory in which training state and model should be saved.")
parser.add_argument("--save-azure-container", type=str, default=None,
help="It present data will saved/loaded from Azure. Should be in format ACCOUNT_NAME:ACCOUNT_KEY:CONTAINER")
parser.add_argument("--save-freq", type=int, default=1e6, help="save model once every time this many iterations are completed")
boolean_flag(parser, "load-on-start", default=True, help="if true and model was previously saved then training will be resumed")
return parser.parse_args()
def make_env(game_name):
env = gym.make(game_name + "NoFrameskip-v4")
monitored_env = SimpleMonitor(env) # puts rewards and number of steps in info, before environment is wrapped
env = wrap_dqn(monitored_env) # applies a bunch of modification to simplify the observation space (downsample, make b/w)
return env, monitored_env
def maybe_save_model(savedir, container, state):
"""This function checkpoints the model and state of the training algorithm."""
if savedir is None:
return
start_time = time.time()
model_dir = "model-{}".format(state["num_iters"])
U.save_state(os.path.join(savedir, model_dir, "saved"))
if container is not None:
container.put(os.path.join(savedir, model_dir), model_dir)
relatively_safe_pickle_dump(state, os.path.join(savedir, 'training_state.pkl.zip'), compression=True)
if container is not None:
container.put(os.path.join(savedir, 'training_state.pkl.zip'), 'training_state.pkl.zip')
relatively_safe_pickle_dump(state["monitor_state"], os.path.join(savedir, 'monitor_state.pkl'))
if container is not None:
container.put(os.path.join(savedir, 'monitor_state.pkl'), 'monitor_state.pkl')
logger.log("Saved model in {} seconds\n".format(time.time() - start_time))
def maybe_load_model(savedir, container):
"""Load model if present at the specified path."""
if savedir is None:
return
state_path = os.path.join(os.path.join(savedir, 'training_state.pkl.zip'))
if container is not None:
logger.log("Attempting to download model from Azure")
found_model = container.get(savedir, 'training_state.pkl.zip')
else:
found_model = os.path.exists(state_path)
if found_model:
state = pickle_load(state_path, compression=True)
model_dir = "model-{}".format(state["num_iters"])
if container is not None:
container.get(savedir, model_dir)
U.load_state(os.path.join(savedir, model_dir, "saved"))
logger.log("Loaded models checkpoint at {} iterations".format(state["num_iters"]))
return state
if __name__ == '__main__':
args = parse_args()
# Parse savedir and azure container.
savedir = args.save_dir
if args.save_azure_container is not None:
account_name, account_key, container_name = args.save_azure_container.split(":")
container = Container(account_name=account_name,
account_key=account_key,
container_name=container_name,
maybe_create=True)
if savedir is None:
# Careful! This will not get cleaned up. Docker spoils the developers.
savedir = tempfile.TemporaryDirectory().name
else:
container = None
# Create and seed the env.
env, monitored_env = make_env(args.env)
if args.seed > 0:
set_global_seeds(args.seed)
env.unwrapped.seed(args.seed)
with U.make_session(4) as sess:
# Create training graph and replay buffer
act, train, update_target, debug = deepq.build_train(
make_obs_ph=lambda name: U.Uint8Input(env.observation_space.shape, name=name),
q_func=dueling_model if args.dueling else model,
num_actions=env.action_space.n,
optimizer=tf.train.AdamOptimizer(learning_rate=args.lr, epsilon=1e-4),
gamma=0.99,
grad_norm_clipping=10,
double_q=args.double_q
)
approximate_num_iters = args.num_steps / 4
exploration = PiecewiseSchedule([
(0, 1.0),
(approximate_num_iters / 50, 0.1),
(approximate_num_iters / 5, 0.01)
], outside_value=0.01)
if args.prioritized:
replay_buffer = PrioritizedReplayBuffer(args.replay_buffer_size, args.prioritized_alpha)
beta_schedule = LinearSchedule(approximate_num_iters, initial_p=args.prioritized_beta0, final_p=1.0)
else:
replay_buffer = ReplayBuffer(args.replay_buffer_size)
U.initialize()
update_target()
num_iters = 0
# Load the model
state = maybe_load_model(savedir, container)
if state is not None:
num_iters, replay_buffer = state["num_iters"], state["replay_buffer"],
monitored_env.set_state(state["monitor_state"])
start_time, start_steps = None, None
steps_per_iter = RunningAvg(0.999)
iteration_time_est = RunningAvg(0.999)
obs = env.reset()
# Main trianing loop
while True:
num_iters += 1
# Take action and store transition in the replay buffer.
action = act(np.array(obs)[None], update_eps=exploration.value(num_iters))[0]
new_obs, rew, done, info = env.step(action)
replay_buffer.add(obs, action, rew, new_obs, float(done))
obs = new_obs
if done:
obs = env.reset()
if (num_iters > max(5 * args.batch_size, args.replay_buffer_size // 20) and
num_iters % args.learning_freq == 0):
# Sample a bunch of transitions from replay buffer
if args.prioritized:
experience = replay_buffer.sample(args.batch_size, beta=beta_schedule.value(num_iters))
(obses_t, actions, rewards, obses_tp1, dones, weights, batch_idxes) = experience
else:
obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(args.batch_size)
weights = np.ones_like(rewards)
# Minimize the error in Bellman's equation and compute TD-error
td_errors = train(obses_t, actions, rewards, obses_tp1, dones, weights)
# Update the priorities in the replay buffer
if args.prioritized:
new_priorities = np.abs(td_errors) + args.prioritized_eps
replay_buffer.update_priorities(batch_idxes, new_priorities)
# Update target network.
if num_iters % args.target_update_freq == 0:
update_target()
if start_time is not None:
steps_per_iter.update(info['steps'] - start_steps)
iteration_time_est.update(time.time() - start_time)
start_time, start_steps = time.time(), info["steps"]
# Save the model and training state.
if num_iters > 0 and (num_iters % args.save_freq == 0 or info["steps"] > args.num_steps):
maybe_save_model(savedir, container, {
'replay_buffer': replay_buffer,
'num_iters': num_iters,
'monitor_state': monitored_env.get_state()
})
if info["steps"] > args.num_steps:
break
if done:
steps_left = args.num_steps - info["steps"]
completion = np.round(info["steps"] / args.num_steps, 1)
logger.record_tabular("% completion", completion)
logger.record_tabular("steps", info["steps"])
logger.record_tabular("iters", num_iters)
logger.record_tabular("episodes", len(info["rewards"]))
logger.record_tabular("reward (100 epi mean)", np.mean(info["rewards"][-100:]))
logger.record_tabular("exploration", exploration.value(num_iters))
if args.prioritized:
logger.record_tabular("max priority", replay_buffer._max_priority)
fps_estimate = (float(steps_per_iter) / (float(iteration_time_est) + 1e-6)
if steps_per_iter._value is not None else "calculating...")
logger.dump_tabular()
logger.log()
logger.log("ETA: " + pretty_eta(int(steps_left / fps_estimate)))
logger.log()

View File

@@ -1,81 +0,0 @@
import argparse
import gym
import numpy as np
import os
import baselines.common.tf_util as U
from baselines import deepq
from baselines.common.misc_util import get_wrapper_by_name, SimpleMonitor, boolean_flag, set_global_seeds
from baselines.common.atari_wrappers_deprecated import wrap_dqn
from baselines.deepq.experiments.atari.model import model, dueling_model
def make_env(game_name):
env = gym.make(game_name + "NoFrameskip-v4")
env_monitored = SimpleMonitor(env)
env = wrap_dqn(env_monitored)
return env_monitored, env
def parse_args():
parser = argparse.ArgumentParser("Evaluate an already learned DQN model.")
# Environment
parser.add_argument("--env", type=str, required=True, help="name of the game")
parser.add_argument("--model-dir", type=str, default=None, help="load model from this directory. ")
boolean_flag(parser, "stochastic", default=True, help="whether or not to use stochastic actions according to models eps value")
boolean_flag(parser, "dueling", default=False, help="whether or not to use dueling model")
return parser.parse_args()
def wang2015_eval(game_name, act, stochastic):
print("==================== wang2015 evaluation ====================")
episode_rewards = []
for num_noops in range(1, 31):
env_monitored, eval_env = make_env(game_name)
eval_env.unwrapped.seed(1)
get_wrapper_by_name(eval_env, "NoopResetEnv").override_num_noops = num_noops
eval_episode_steps = 0
done = True
while True:
if done:
obs = eval_env.reset()
eval_episode_steps += 1
action = act(np.array(obs)[None], stochastic=stochastic)[0]
obs, reward, done, info = eval_env.step(action)
if done:
obs = eval_env.reset()
if len(info["rewards"]) > 0:
episode_rewards.append(info["rewards"][0])
break
if info["steps"] > 108000: # 5 minutes of gameplay
episode_rewards.append(env_monitored._current_reward)
break
print("Num steps in episode {} was {} yielding {} reward".format(
num_noops, eval_episode_steps, episode_rewards[-1]), flush=True)
print("Evaluation results: " + str(np.mean(episode_rewards)))
print("=============================================================")
return np.mean(episode_rewards)
def main():
set_global_seeds(1)
args = parse_args()
with U.make_session(4) as sess: # noqa
_, env = make_env(args.env)
act = deepq.build_act(
make_obs_ph=lambda name: U.Uint8Input(env.observation_space.shape, name=name),
q_func=dueling_model if args.dueling else model,
num_actions=env.action_space.n)
U.load_state(os.path.join(args.model_dir, "saved"))
wang2015_eval(args.env, act, stochastic=args.stochastic)
if __name__ == '__main__':
main()

View File

@@ -9,6 +9,7 @@ import baselines.common.tf_util as U
from baselines import logger
from baselines import deepq
from baselines.deepq.replay_buffer import ReplayBuffer
from baselines.deepq.utils import ObservationInput
from baselines.common.schedules import LinearSchedule
@@ -27,7 +28,7 @@ if __name__ == '__main__':
env = gym.make("CartPole-v0")
# Create all the functions necessary to train the model
act, train, update_target, debug = deepq.build_train(
make_obs_ph=lambda name: U.BatchInput(env.observation_space.shape, name=name),
make_obs_ph=lambda name: ObservationInput(env.observation_space, name=name),
q_func=model,
num_actions=env.action_space.n,
optimizer=tf.train.AdamOptimizer(learning_rate=5e-4),

View File

@@ -0,0 +1,21 @@
import gym
from baselines import deepq
def main():
env = gym.make("MountainCar-v0")
act = deepq.load("mountaincar_model.pkl")
while True:
obs, done = env.reset(), False
episode_rew = 0
while not done:
env.render()
obs, rew, done, _ = env.step(act(obs[None])[0])
episode_rew += rew
print("Episode reward", episode_rew)
if __name__ == '__main__':
main()

View File

@@ -1,12 +1,10 @@
import gym
from baselines import deepq
from baselines.common.atari_wrappers_deprecated import wrap_dqn, ScaledFloatFrame
def main():
env = gym.make("PongNoFrameskip-v4")
env = ScaledFloatFrame(wrap_dqn(env))
env = deepq.wrap_atari_dqn(env)
act = deepq.load("pong_model.pkl")
while True:

View File

@@ -0,0 +1,34 @@
import argparse
import numpy as np
from baselines import deepq
from baselines.common import retro_wrappers
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--env', help='environment ID', default='SuperMarioBros-Nes')
parser.add_argument('--gamestate', help='game state to load', default='Level1-1')
parser.add_argument('--model', help='model pickle file from ActWrapper.save', default='model.pkl')
args = parser.parse_args()
env = retro_wrappers.make_retro(game=args.env, state=args.gamestate, max_episode_steps=None)
env = retro_wrappers.wrap_deepmind_retro(env)
act = deepq.load(args.model)
while True:
obs, done = env.reset(), False
episode_rew = 0
while not done:
env.render()
action = act(obs[None])[0]
env_action = np.zeros(env.action_space.n)
env_action[action] = 1
obs, rew, done, _ = env.step(env_action)
episode_rew += rew
print('Episode reward', episode_rew)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,54 @@
from baselines import deepq
from baselines.common import set_global_seeds
from baselines import bench
import argparse
from baselines import logger
from baselines.common.atari_wrappers import make_atari
def main():
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--prioritized', type=int, default=1)
parser.add_argument('--prioritized-replay-alpha', type=float, default=0.6)
parser.add_argument('--dueling', type=int, default=1)
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
parser.add_argument('--checkpoint-freq', type=int, default=10000)
parser.add_argument('--checkpoint-path', type=str, default=None)
args = parser.parse_args()
logger.configure()
set_global_seeds(args.seed)
env = make_atari(args.env)
env = bench.Monitor(env, logger.get_dir())
env = deepq.wrap_atari_dqn(env)
model = deepq.models.cnn_to_mlp(
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
hiddens=[256],
dueling=bool(args.dueling),
)
deepq.learn(
env,
q_func=model,
lr=1e-4,
max_timesteps=args.num_timesteps,
buffer_size=10000,
exploration_fraction=0.1,
exploration_final_eps=0.01,
train_freq=4,
learning_starts=10000,
target_network_update_freq=1000,
gamma=0.99,
prioritized_replay=bool(args.prioritized),
prioritized_replay_alpha=args.prioritized_replay_alpha,
checkpoint_freq=args.checkpoint_freq,
checkpoint_path=args.checkpoint_path,
)
env.close()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,49 @@
import argparse
from baselines import deepq
from baselines.common import set_global_seeds
from baselines import bench
from baselines import logger
from baselines.common import retro_wrappers
import retro
def main():
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--env', help='environment ID', default='SuperMarioBros-Nes')
parser.add_argument('--gamestate', help='game state to load', default='Level1-1')
parser.add_argument('--seed', help='seed', type=int, default=0)
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
args = parser.parse_args()
logger.configure()
set_global_seeds(args.seed)
env = retro_wrappers.make_retro(game=args.env, state=args.gamestate, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE)
env.seed(args.seed)
env = bench.Monitor(env, logger.get_dir())
env = retro_wrappers.wrap_deepmind_retro(env)
model = deepq.models.cnn_to_mlp(
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
hiddens=[256],
dueling=True
)
act = deepq.learn(
env,
q_func=model,
lr=1e-4,
max_timesteps=args.num_timesteps,
buffer_size=10000,
exploration_fraction=0.1,
exploration_final_eps=0.01,
train_freq=4,
learning_starts=10000,
target_network_update_freq=1000,
gamma=0.99,
prioritized_replay=True
)
act.save()
env.close()
if __name__ == '__main__':
main()

View File

@@ -3,7 +3,7 @@ import gym
from baselines import deepq
def callback(lcl, glb):
def callback(lcl, _glb):
# stop training if reward exceeds 199
is_solved = lcl['t'] > 100 and sum(lcl['episode_rewards'][-101:-1]) / 100 >= 199
return is_solved

Some files were not shown because too many files have changed in this diff Show More