Compare commits

...

49 Commits

Author SHA1 Message Date
uronce-cc
0a40206c6c ncpu needs to be an integer. (#558) 2018-08-31 09:02:18 -07:00
Alfredo Canziani
1937826784 Fix alien syntax and apply PEP 8 style (#554) 2018-08-30 17:21:25 -07:00
pzhokhov
b29c8020d7 remove saving model as a pickle file in ppo2 (tries to pull environment in; bad idea - may need to use constructor argument pickling or somesuch if at all necessary) (#69) 2018-08-30 13:41:38 -07:00
Peter Zhokhov
4ec308aaa4 fixed syntax 2018-08-30 13:41:38 -07:00
Peter Zhokhov
3bbf3f3511 allow_early_resets=True in create_vec_env 2018-08-30 13:41:38 -07:00
Joshua Meier
e5de29a954 instructions for tensorboard (#61) 2018-08-30 13:41:37 -07:00
Joshua Meier
2507d335f9 Tensorboard util (#60)
* separate_validation_set was not imported

* launching tensorboard automatically
2018-08-30 13:41:37 -07:00
Damien Lancry
bdd4d385a6 Fix result_plotters in vectorized mujoco environments (#533)
* I investigated a bit about running a training in a vectorized monitored mujoco env and found out that the 0.monitor.csv file could not be plotted using baselines.results_plotter.py functions. Moreover the seed is the same in every parallel environments due to the particular behaviour of lambda. this fixes both issues without breaking the function in other files (baselines.acktr.run_mujoco still works)

* unifies make_atari_env and make_mujoco_env

* redefine make_mujoco_env because of run_mujoco in acktr not compatible with DummyVecEnv and SubprocVecEnv

* fix if else

* Update run.py
2018-08-28 17:48:56 -07:00
Peter Zhokhov
0961f5dd94 git subrepo pull (merge) baselines
subrepo:
  subdir:   "baselines"
  merged:   "95a81e86"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "c6c0f45c"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"
2018-08-27 16:40:14 -07:00
Christopher Hesse
337d913a8f remove reset_task from subproc vec env (#45) 2018-08-27 16:40:14 -07:00
Karl Cobbe
34af61a132 baselines: fix dummy vec env render mode (#42) 2018-08-27 16:40:14 -07:00
Christopher Hesse
1ea5ec647c export SimpleEnv and assert_envs_equal, fix minor bug in action space (#46) 2018-08-27 16:40:14 -07:00
pzhokhov
2fc7a1cbee Trigger benchmarks from buildkite (#40)
* rig buildkite pipeline to run benchmarks when commit ends with RUN BENCHMARKS

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file - merge test and benchmark steps

* fix the buildkite pipeline file - merge test and benchmark steps

* fix buildkite pipeline file

* fix buildkite pipeline file

* dry RUN BENCHMARKS

* dry RUN BENCHMARKS

* dry not run BENCHMARKS

* not run benchmarks

* not running benchmarks

* no running benchmarks

* no running benchmarks

* still not running benchmarks

* dummy commit to RUN BENCHMARKS

* trigger benchmarks from buildkite RUN BENCHMARKS

* specifying RCALL_KUBE_CLUSTER RUN BENCHMARKS

* remove rl-algs/run-benchmarks-new.py (moved to ci), merged baselines/common/console_util and baselines/common/util.py

* added missing imports in console_util

* clone subrepo over https
2018-08-27 16:40:14 -07:00
John Schulman
14c1d69ef4 Reduce duplication in VecEnv subclasses. (#38)
* Reduce duplication in VecEnv subclasses.
Now VecEnv base class handles rendering and closing; subclasses should provide get_images and (optionally) close_extras.

* fix tests

* minor docstring change

* raise NotImplementedError
2018-08-27 16:40:13 -07:00
pzhokhov
c8f6d8bac7 address rl-algs issue #169 (missing util functions from rcall) (#30)
* copied parts of util.py to baselines.common from rcall

* merged fix for baselines.logger, resolved conflicts

* copied ccap to baselines/baselines/common/util.py
2018-08-27 16:40:13 -07:00
pzhokhov
3a006ba50e flake8 fixes (#35)
* flake8 fixes

* added baselines/setup.cfg

* style checks using setup.cfg in baselines
2018-08-27 16:40:13 -07:00
Tom
c6c0f45cb1 fix 'async' is a reserved word in Python >= 3.7 (#495) (#542) 2018-08-27 12:36:43 -07:00
wangjksjtu
e92a6ad8f4 Update README.md (#537)
1. Delete repetitive section
2. Align the commands
2018-08-27 12:35:48 -07:00
HelgeS
92b9a37257 Updated example commands to run ppo2 (#534)
The headline mentions PPO, but the command was for A2C
2018-08-23 15:58:27 -07:00
Armin Primadi
cb14da96ca Fix typo on policies documentation (#535) 2018-08-23 15:56:13 -07:00
pzhokhov
3900f2a447 baselines issue 146 (remove tensorflow from setup.py) (#34)
* baselines does not reinstall tensorflow

* fix the version check in baselines/setup.py

* replace print and assert with assert, str (thanks @csh)
2018-08-21 16:59:05 -07:00
pzhokhov
20d22a5d79 Fix baselines build (fails due to lack of mujoco in public baselines container) (#29)
* make nminibatces = min(nminibatches, nenv)

* clarify the usage of lstm policy, add an example and a test

* cleaned up example, added assert to the test

* remove nminibatches -> min(nminibatches, num_env)

* removed code snippet from the docstring, pointing to the file

* add _mujoco_present flag to skip the tests that require mujoco if mujoco is not present

* re-format skip message in test_doc_examples

* flake8 complaints
2018-08-21 10:08:24 -07:00
pzhokhov
caf7b08b4d Baselines issue #525 (lack of docs for recurrent policies) (#27)
* make nminibatces = min(nminibatches, nenv)

* clarify the usage of lstm policy, add an example and a test

* cleaned up example, added assert to the test

* remove nminibatches -> min(nminibatches, num_env)

* removed code snippet from the docstring, pointing to the file
2018-08-20 13:55:35 -07:00
Peter Zhokhov
ca0165cdf5 flake8 complaints 2018-08-17 18:11:00 -07:00
pzhokhov
eb5b605f86 restore subrepo conftest.py files (#22)
* restore conftest.py in subrepos

* remove conftest files from subrepos in the docker image

* remove runslow flag from baselines .travis.yml and rl-algs ci/runtests.sh

* move import of rendering module into the code to fix tests that don't require a display

* restore the dockerfile
2018-08-17 17:02:39 -07:00
Peter Zhokhov
a89bee3c8d Merge commit 'refs/subrepo/baselines/fetch' into subrepo/baselines 2018-08-17 13:55:27 -07:00
pzhokhov
353bb15e90 deduplicate algorithms in rl-algs and baselines (#18)
* move vec_env

* cleaning up rl_common

* tests are passing (but mosts tests are deleted as moved to baselines)

* add benchmark runner for smoke tests

* removed duplicated algos

* route references to rl_algs.a2c to baselines.a2c

* route references to rl_algs.a2c to baselines.a2c

* unify conftest.py

* removing references to duplicated algs from codegen

* removing references to duplicated algs from codegen

* alex's changes to dummy_vec_env

* fixed test_carpole[deepq] testcase by decreasing number of training steps... alex's changes seemed to have fixed the bug and make it train better, but at seed=0 there is a dip in the training curve at 30k steps that fails the test

* codegen tests with atol=1e-6 seem to be unstable

* rl_common.vec_env -> baselines.common.vec_env mass replace

* fixed reference in trpo_mpi

* a2c.util references

* restored rl_algs.bench in sonic_prob

* fix reference in ci/runtests.sh

* simplifed expression in baselines/common/cmd_util

* further increased rtol to 1e-3 in codegen tests

* switched vecenvs to use SimpleImageViewer from gym instead of cv2

* make run.py --play option work with num_envs > 1

* make rosenbrock test reproducible

* git subrepo pull (merge) baselines

subrepo:
  subdir:   "baselines"
  merged:   "e23524a5"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "bcde04e7"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* updated baselines README (num-timesteps --> num_timesteps)

* typo in deepq/README.md
2018-08-17 13:54:11 -07:00
pzhokhov
64c0c0a043 Setup travis (#12)
* re-setting up travis

* re-setting up travis

* resolved merge conflicts, added missing dependency for codegen

* removed parallel tests (workers are failing for some reason)

* try test baselines only

* added language options - some weirdness in rcall image that requires them?

* added verbosity to tests

* try tests in baselines only

* ci/runtests.sh tests codegen (some failure on baselines specifically on travis, trying to narrow down the problem)

* removed render from codegen test - maybe that's the problem?

* trying even simpler command within the image to figure out the problem

* print out system info in ci/runtests.sh

* print system info outside of docker as well

* trying single test file in codegen

* install graphviz in the docker image

* git subrepo pull baselines

subrepo:
  subdir:   "baselines"
  merged:   "8c2aea2"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "8c2aea2"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* added graphviz to the dockerfile (need both graphviz-dev and graphviz)

* only tests in codegen/algo/test_algo_builder.py

* run baselines tests only. still no clue why collection of codegen tests fails

* update baselines setup to install filelock for tests

* run slow tests

* skip slow tests in baselines

* single test file in baselines

* try reinstalling tensorflow

* running slow tests

* try full baselines and codegen test suite

* in the test Dockerfile, reinstall tensorflow

* using fake display for codegen render tests

* fixed display-related failures by adding a custom entrpoint to the docker image

* set LC_ALL and LANG env variables in docker image

* try sequential tests

* include psutil in requirements; increase relative tolerance in test_low_level_algo_distr

* trying to fix codegen failures on travis

* git subrepo commit (merge) baselines

subrepo:
  subdir:   "baselines"
  merged:   "9ce84da"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "b222dd0"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* syntax in install.py

* changing the order of package installation

* removed supervised-reptile from installation list

* cron uses the full games repo in rcall

* flake8 complaints

* rewrite all extras logic in baselines, install.py always uses [all]
2018-08-17 13:54:10 -07:00
pzhokhov
5fee99e771 Setup travis (#12)
* re-setting up travis

* re-setting up travis

* resolved merge conflicts, added missing dependency for codegen

* removed parallel tests (workers are failing for some reason)

* try test baselines only

* added language options - some weirdness in rcall image that requires them?

* added verbosity to tests

* try tests in baselines only

* ci/runtests.sh tests codegen (some failure on baselines specifically on travis, trying to narrow down the problem)

* removed render from codegen test - maybe that's the problem?

* trying even simpler command within the image to figure out the problem

* print out system info in ci/runtests.sh

* print system info outside of docker as well

* trying single test file in codegen

* install graphviz in the docker image

* git subrepo pull baselines

subrepo:
  subdir:   "baselines"
  merged:   "8c2aea2"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "8c2aea2"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* added graphviz to the dockerfile (need both graphviz-dev and graphviz)

* only tests in codegen/algo/test_algo_builder.py

* run baselines tests only. still no clue why collection of codegen tests fails

* update baselines setup to install filelock for tests

* run slow tests

* skip slow tests in baselines

* single test file in baselines

* try reinstalling tensorflow

* running slow tests

* try full baselines and codegen test suite

* in the test Dockerfile, reinstall tensorflow

* using fake display for codegen render tests

* fixed display-related failures by adding a custom entrpoint to the docker image

* set LC_ALL and LANG env variables in docker image

* try sequential tests

* include psutil in requirements; increase relative tolerance in test_low_level_algo_distr

* trying to fix codegen failures on travis

* git subrepo commit (merge) baselines

subrepo:
  subdir:   "baselines"
  merged:   "9ce84da"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "b222dd0"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* syntax in install.py

* changing the order of package installation

* removed supervised-reptile from installation list

* cron uses the full games repo in rcall

* flake8 complaints

* rewrite all extras logic in baselines, install.py always uses [all]
2018-08-17 13:40:02 -07:00
Youngjin Kim
5edcd6886e Fix argument error in deepq (#508)
* Fix argment error in deepq

* Fix argment error in deepq
2018-08-16 14:55:57 -07:00
Youngjin Kim
bcde04e710 Fix argument error in deepq (#508)
* Fix argment error in deepq

* Fix argment error in deepq
2018-08-16 14:55:57 -07:00
pzhokhov
cd375ab209 update readmes (#514)
* update per-algorithm READMEs to reflect new way of running algorithms

* adding a link to repo-wide README

* updated README files and deepq.train_cartpole example
2018-08-16 14:53:49 -07:00
pzhokhov
5622a09fa4 update readmes (#514)
* update per-algorithm READMEs to reflect new way of running algorithms

* adding a link to repo-wide README

* updated README files and deepq.train_cartpole example
2018-08-16 14:53:49 -07:00
Pim de Haan
e2da7cd42f Several bugfixes for #504, #505, #506 related to Classic Control and deepq (#507)
* Several bugfixes

* Fixed ActWrapper.step bug
2018-08-16 12:08:53 -07:00
Peter Zhokhov
b222dd0610 updated links in README to point to master 2018-08-13 16:01:24 -07:00
pzhokhov
1870685071 Publish benchmark results (#502)
* updated benchmark pages with final rewards

* use htmlpreview to render pages

* use htmlpreview to render pages

* use htmlpreview to render pages

* updated README to reflect ppo1 being obsolete

* removed navbars from published benchmark pages

* fixed link in README
2018-08-13 15:59:43 -07:00
pzhokhov
8c2aea2add refactor a2c, acer, acktr, ppo2, deepq, and trpo_mpi (#490)
* exported rl-algs

* more stuff from rl-algs

* run slow tests

* re-exported rl_algs

* re-exported rl_algs - fixed problems with serialization test and test_cartpole

* replaced atari_arg_parser with common_arg_parser

* run.py can run algos from both baselines and rl_algs

* added approximate humanoid reward with ppo2 into the README for reference

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* very dummy commit to RUN BENCHMARKS

* serialize variables as a dict, not as a list

* running_mean_std uses tensorflow variables

* fixed import in vec_normalize

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* flake8 complaints

* save all variables to make sure we save the vec_normalize normalization

* benchmarks on ppo2 only RUN BENCHMARKS

* make_atari_env compatible with mpi

* run ppo_mpi benchmarks only RUN BENCHMARKS

* hardcode names of retro environments

* add defaults

* changed default ppo2 lr schedule to linear RUN BENCHMARKS

* non-tf normalization benchmark RUN BENCHMARKS

* use ncpu=1 for mujoco sessions - gives a bit of a performance speedup

* reverted running_mean_std to user property decorators for mean, var, count

* reverted VecNormalize to use RunningMeanStd (no tf)

* reverted VecNormalize to use RunningMeanStd (no tf)

* profiling wip

* use VecNormalize with regular RunningMeanStd

* added acer runner (missing import)

* flake8 complaints

* added a note in README about TfRunningMeanStd and serialization of VecNormalize

* dummy commit to RUN BENCHMARKS

* merged benchmarks branch
2018-08-13 09:56:44 -07:00
Tony Yu Cao
366f486e34 Update README.md (#416)
Update Atari example
2018-08-08 10:42:10 -07:00
Adam Gleave
f272969325 GAIL: bugfix in dataset loading (#447)
* Fix silly typo

* Replace ad-hoc function with NumPy code
2018-07-06 16:12:14 -07:00
pzhokhov
a6b1bc70f1 re-import internal; fix missing tile_images.py (#427)
* import rl-algs from 2e3a166 commit

* extra import of the baselines badge

* exported commit with identity test

* proper rng seeding in the test_identity

* import internal

* adding missing tile_images.py
2018-06-08 09:41:45 -07:00
pzhokhov
36ee5d1707 Import internal changes (#422)
* import rl-algs from 2e3a166 commit

* extra import of the baselines badge

* exported commit with identity test

* proper rng seeding in the test_identity

* import internal
2018-06-06 11:39:13 -07:00
pzhokhov
24fe3d6576 Import internal repo (#409)
* import rl-algs from 2e3a166 commit

* extra import of the baselines badge

* exported commit with identity test

* proper rng seeding in the test_identity
2018-05-21 15:24:00 -07:00
pzhokhov
9cb7ece338 add opencv-python to the dependencies (#407) 2018-05-14 10:52:19 -07:00
pzhokhov
9cf95a0054 setup travis ci build (#388)
* simple .travis.yml file

* added static syntax checks of common to .travis.yml

* dockerizing the build

* fix Dockerfile, adding build shield

* cleaning up workdir in Dockerfile and .travis.yml

* .travis.yml fixed common -> baselines/common for style check
2018-05-03 09:43:28 -07:00
pzhokhov
8b781038cc put filters and running_stat files in common instead of acktr (#389) 2018-05-02 18:42:48 -07:00
pzhokhov
69f25c6028 import internal repo (#385) 2018-05-01 16:54:04 -07:00
pzhokhov
2b0283b9db Readme.md detailed installation instructions (#377)
* changes to README.md files with more detailed installation instructions

* md-fying the changes better

* link on the word homebrew in readme.md

* typos in README.md

* README.md

* removed extra comma sign

* removed sudo from brew command
2018-04-25 17:40:48 -07:00
Matthias Plappert
1f8a03f3a6 Update README 2018-03-26 16:50:22 +02:00
Matthias Plappert
3cc7df0608 Minor fixes to HER release (#319)
* Fix plotting script

* Add warning if num_cpu = 1
2018-03-05 11:06:17 +01:00
112 changed files with 22027 additions and 1273 deletions

1
.benchmark_pattern Normal file
View File

@@ -0,0 +1 @@

2
.gitignore vendored
View File

@@ -34,5 +34,3 @@ src
.cache
MUJOCO_LOG.TXT

14
.travis.yml Normal file
View File

@@ -0,0 +1,14 @@
language: python
python:
- "3.6"
services:
- docker
install:
- pip install flake8
- docker build . -t baselines-test
script:
- flake8 .
- docker run baselines-test pytest -v .

25
Dockerfile Normal file
View File

@@ -0,0 +1,25 @@
FROM ubuntu:16.04
RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake python-opencv
ENV CODE_DIR /root/code
ENV VENV /root/venv
RUN \
pip install virtualenv && \
virtualenv $VENV --python=python3 && \
. $VENV/bin/activate && \
pip install --upgrade pip
ENV PATH=$VENV/bin:$PATH
COPY . $CODE_DIR/baselines
WORKDIR $CODE_DIR/baselines
# Clean up pycache and pyc files
RUN rm -rf __pycache__ && \
find . -name "*.pyc" -delete && \
pip install tensorflow && \
pip install -e .[test]
CMD /bin/bash

109
README.md
View File

@@ -1,4 +1,4 @@
<img src="data/logo.jpg" width=25% align="right" />
<img src="data/logo.jpg" width=25% align="right" /> [![Build status](https://travis-ci.org/openai/baselines.svg?branch=master)](https://travis-ci.org/openai/baselines)
# Baselines
@@ -6,13 +6,104 @@ OpenAI Baselines is a set of high-quality implementations of reinforcement learn
These algorithms will make it easier for the research community to replicate, refine, and identify new ideas, and will create good baselines to build research on top of. Our DQN implementation and its variants are roughly on par with the scores in published papers. We expect they will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones.
You can install it by typing:
## Prerequisites
Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows
### Ubuntu
```bash
sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev
```
### Mac OS X
Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the follwing:
```bash
brew install cmake openmpi
```
## Virtual environment
From the general python package sanity perspective, it is a good idea to use virtual environments (virtualenvs) to make sure packages from different projects do not interfere with each other. You can install virtualenv (which is itself a pip package) via
```bash
pip install virtualenv
```
Virtualenvs are essentially folders that have copies of python executable and all python packages.
To create a virtualenv called venv with python3, one runs
```bash
virtualenv /path/to/venv --python=python3
```
To activate a virtualenv:
```
. /path/to/venv/bin/activate
```
More thorough tutorial on virtualenvs and options can be found [here](https://virtualenv.pypa.io/en/stable/)
## Installation
Clone the repo and cd into it:
```bash
git clone https://github.com/openai/baselines.git
cd baselines
```
If using virtualenv, create a new virtualenv and activate it
```bash
virtualenv env --python=python3
. env/bin/activate
```
Install baselines package
```bash
pip install -e .
```
### MuJoCo
Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)
## Testing the installation
All unit tests in baselines can be run using pytest runner:
```
pip install pytest
pytest
```
## Training models
Most of the algorithms in baselines repo are used as follows:
```bash
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
```
### Example 1. PPO with MuJoCo Humanoid
For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps
```bash
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
```
Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
```bash
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
```
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
See docstrings in [common/models.py](common/models.py) for description of network parameters for each type of model, and
docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) fir the description of the ppo2 hyperparamters.
### Example 2. DQN on Atari
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
```
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
```
## Saving, loading and visualizing models
The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models.
`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively.
Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then later visualize what has it learnt.
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
```
This should get to the mean reward per episode about 5k. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
```
*NOTE:* At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in [baselines/common/vec_env/vec_normalize.py](baselines/common/vec_env/vec_normalize.py#L12). This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default
## Subpackages
- [A2C](baselines/a2c)
- [ACER](baselines/acer)
@@ -21,10 +112,19 @@ pip install -e .
- [DQN](baselines/deepq)
- [GAIL](baselines/gail)
- [HER](baselines/her)
- [PPO1](baselines/ppo1) (Multi-CPU using MPI)
- [PPO2](baselines/ppo2) (Optimized for GPU)
- [PPO1](baselines/ppo1) (obsolete version, left here temporarily)
- [PPO2](baselines/ppo2)
- [TRPO](baselines/trpo_mpi)
## Benchmarks
Results of benchmarks on Mujoco (1M timesteps) and Atari (10M timesteps) are available
[here for Mujoco](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm)
and
[here for Atari](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_atari10M.htm)
respectively. Note that these results may be not on the latest version of the code, particular commit hash with which results were obtained is specified on the benchmarks page.
To cite this repository in publications:
@misc{baselines,
@@ -35,3 +135,4 @@ To cite this repository in publications:
journal = {GitHub repository},
howpublished = {\url{https://github.com/openai/baselines}},
}

View File

@@ -2,4 +2,5 @@
- Original paper: https://arxiv.org/abs/1602.01783
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.a2c.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options
- also refer to the repo-wide [README.md](../../README.md#training-models)

View File

@@ -1,47 +1,48 @@
import os
import os.path as osp
import gym
import time
import joblib
import logging
import numpy as np
import functools
import tensorflow as tf
from baselines import logger
from baselines.common import set_global_seeds, explained_variance
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
from baselines.common.atari_wrappers import wrap_deepmind
from baselines.common import tf_util
from baselines.common.policies import build_policy
from baselines.a2c.utils import discount_with_dones
from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
from baselines.a2c.utils import cat_entropy, mse
from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.a2c.runner import Runner
from tensorflow import losses
class Model(object):
def __init__(self, policy, ob_space, ac_space, nenvs, nsteps,
def __init__(self, policy, env, nsteps,
ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, lr=7e-4,
alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):
sess = tf_util.make_session()
nact = ac_space.n
sess = tf_util.get_session()
nenvs = env.num_envs
nbatch = nenvs*nsteps
A = tf.placeholder(tf.int32, [nbatch])
with tf.variable_scope('a2c_model', reuse=tf.AUTO_REUSE):
step_model = policy(nenvs, 1, sess)
train_model = policy(nbatch, nsteps, sess)
A = tf.placeholder(train_model.action.dtype, train_model.action.shape)
ADV = tf.placeholder(tf.float32, [nbatch])
R = tf.placeholder(tf.float32, [nbatch])
LR = tf.placeholder(tf.float32, [])
step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)
neglogpac = train_model.pd.neglogp(A)
entropy = tf.reduce_mean(train_model.pd.entropy())
neglogpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
pg_loss = tf.reduce_mean(ADV * neglogpac)
vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
entropy = tf.reduce_mean(cat_entropy(train_model.pi))
vf_loss = losses.mean_squared_error(tf.squeeze(train_model.vf), R)
loss = pg_loss - entropy*ent_coef + vf_loss * vf_coef
params = find_trainable_variables("model")
params = find_trainable_variables("a2c_model")
grads = tf.gradients(loss, params)
if max_grad_norm is not None:
grads, grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
@@ -55,6 +56,7 @@ class Model(object):
advs = rewards - values
for step in range(len(obs)):
cur_lr = lr.value()
td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, LR:cur_lr}
if states is not None:
td_map[train_model.S] = states
@@ -65,17 +67,6 @@ class Model(object):
)
return policy_loss, value_loss, policy_entropy
def save(save_path):
ps = sess.run(params)
make_path(osp.dirname(save_path))
joblib.dump(ps, save_path)
def load(load_path):
loaded_params = joblib.load(load_path)
restores = []
for p, loaded_p in zip(params, loaded_params):
restores.append(p.assign(loaded_p))
ps = sess.run(restores)
self.train = train
self.train_model = train_model
@@ -83,77 +74,87 @@ class Model(object):
self.step = step_model.step
self.value = step_model.value
self.initial_state = step_model.initial_state
self.save = save
self.load = load
self.save = functools.partial(tf_util.save_variables, sess=sess)
self.load = functools.partial(tf_util.load_variables, sess=sess)
tf.global_variables_initializer().run(session=sess)
class Runner(object):
def __init__(self, env, model, nsteps=5, gamma=0.99):
self.env = env
self.model = model
nh, nw, nc = env.observation_space.shape
nenv = env.num_envs
self.batch_ob_shape = (nenv*nsteps, nh, nw, nc)
self.obs = np.zeros((nenv, nh, nw, nc), dtype=np.uint8)
self.nc = nc
obs = env.reset()
self.gamma = gamma
self.nsteps = nsteps
self.states = model.initial_state
self.dones = [False for _ in range(nenv)]
def learn(
network,
env,
seed=None,
nsteps=5,
total_timesteps=int(80e6),
vf_coef=0.5,
ent_coef=0.01,
max_grad_norm=0.5,
lr=7e-4,
lrschedule='linear',
epsilon=1e-5,
alpha=0.99,
gamma=0.99,
log_interval=100,
load_path=None,
**network_kwargs):
'''
Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
Parameters:
-----------
network: policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
env: RL environment. Should implement interface similar to VecEnv (baselines.common/vec_env) or be wrapped with DummyVecEnv (baselines.common/vec_env/dummy_vec_env.py)
seed: seed to make random number sequence in the alorightm reproducible. By default is None which means seed from system noise generator (not reproducible)
nsteps: int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
nenv is number of environment copies simulated in parallel)
total_timesteps: int, total number of timesteps to train on (default: 80M)
vf_coef: float, coefficient in front of value function loss in the total loss function (default: 0.5)
ent_coef: float, coeffictiant in front of the policy entropy in the total loss function (default: 0.01)
max_gradient_norm: float, gradient is clipped to have global L2 norm no more than this value (default: 0.5)
lr: float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
lrschedule: schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
returns fraction of the learning rate (specified as lr) as output
epsilon: float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
alpha: float, RMSProp decay parameter (default: 0.99)
gamma: float, reward discounting parameter (default: 0.99)
log_interval: int, specifies how frequently the logs are printed out (default: 100)
**network_kwargs: keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
'''
def run(self):
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
mb_states = self.states
for n in range(self.nsteps):
actions, values, states, _ = self.model.step(self.obs, self.states, self.dones)
mb_obs.append(np.copy(self.obs))
mb_actions.append(actions)
mb_values.append(values)
mb_dones.append(self.dones)
obs, rewards, dones, _ = self.env.step(actions)
self.states = states
self.dones = dones
for n, done in enumerate(dones):
if done:
self.obs[n] = self.obs[n]*0
self.obs = obs
mb_rewards.append(rewards)
mb_dones.append(self.dones)
#batch of steps to batch of rollouts
mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0).reshape(self.batch_ob_shape)
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
mb_masks = mb_dones[:, :-1]
mb_dones = mb_dones[:, 1:]
last_values = self.model.value(self.obs, self.states, self.dones).tolist()
#discount/bootstrap off value fn
for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
rewards = rewards.tolist()
dones = dones.tolist()
if dones[-1] == 0:
rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
else:
rewards = discount_with_dones(rewards, dones, self.gamma)
mb_rewards[n] = rewards
mb_rewards = mb_rewards.flatten()
mb_actions = mb_actions.flatten()
mb_values = mb_values.flatten()
mb_masks = mb_masks.flatten()
return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values
def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, ent_coef=0.01, max_grad_norm=0.5, lr=7e-4, lrschedule='linear', epsilon=1e-5, alpha=0.99, gamma=0.99, log_interval=100):
tf.reset_default_graph()
set_global_seeds(seed)
nenvs = env.num_envs
ob_space = env.observation_space
ac_space = env.action_space
model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
policy = build_policy(env, network, **network_kwargs)
model = Model(policy=policy, env=env, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
if load_path is not None:
model.load(load_path)
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
nbatch = nenvs*nsteps
@@ -173,3 +174,5 @@ def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, e
logger.record_tabular("explained_variance", float(ev))
logger.dump_tabular()
env.close()
return model

View File

@@ -1,168 +0,0 @@
import numpy as np
import tensorflow as tf
from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
from baselines.common.distributions import make_pdtype
def nature_cnn(unscaled_images):
"""
CNN from Nature paper.
"""
scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
activ = tf.nn.relu
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2)))
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2)))
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2)))
h3 = conv_to_fc(h3)
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
class LnLstmPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
nenv = nbatch // nsteps
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) #obs
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
h5 = seq_to_batch(h5)
pi = fc(h5, 'pi', nact)
vf = fc(h5, 'v', 1)
self.pdtype = make_pdtype(ac_space)
self.pd = self.pdtype.pdfromflat(pi)
v0 = vf[:, 0]
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
def step(ob, state, mask):
return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
def value(ob, state, mask):
return sess.run(v0, {X:ob, S:state, M:mask})
self.X = X
self.M = M
self.S = S
self.pi = pi
self.vf = vf
self.step = step
self.value = value
class LstmPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
nenv = nbatch // nsteps
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) #obs
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
h5 = seq_to_batch(h5)
pi = fc(h5, 'pi', nact)
vf = fc(h5, 'v', 1)
self.pdtype = make_pdtype(ac_space)
self.pd = self.pdtype.pdfromflat(pi)
v0 = vf[:, 0]
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
def step(ob, state, mask):
return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
def value(ob, state, mask):
return sess.run(v0, {X:ob, S:state, M:mask})
self.X = X
self.M = M
self.S = S
self.pi = pi
self.vf = vf
self.step = step
self.value = value
class CnnPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) #obs
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
pi = fc(h, 'pi', nact, init_scale=0.01)
vf = fc(h, 'v', 1)[:,0]
self.pdtype = make_pdtype(ac_space)
self.pd = self.pdtype.pdfromflat(pi)
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = None
def step(ob, *_args, **_kwargs):
a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
return a, v, self.initial_state, neglogp
def value(ob, *_args, **_kwargs):
return sess.run(vf, {X:ob})
self.X = X
self.pi = pi
self.vf = vf
self.step = step
self.value = value
class MlpPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
ob_shape = (nbatch,) + ob_space.shape
actdim = ac_space.shape[0]
X = tf.placeholder(tf.float32, ob_shape, name='Ob') #obs
with tf.variable_scope("model", reuse=reuse):
activ = tf.tanh
h1 = activ(fc(X, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
h2 = activ(fc(h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
pi = fc(h2, 'pi', actdim, init_scale=0.01)
h1 = activ(fc(X, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
h2 = activ(fc(h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
vf = fc(h2, 'vf', 1)[:,0]
logstd = tf.get_variable(name="logstd", shape=[1, actdim],
initializer=tf.zeros_initializer())
pdparam = tf.concat([pi, pi * 0.0 + logstd], axis=1)
self.pdtype = make_pdtype(ac_space)
self.pd = self.pdtype.pdfromflat(pdparam)
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = None
def step(ob, *_args, **_kwargs):
a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
return a, v, self.initial_state, neglogp
def value(ob, *_args, **_kwargs):
return sess.run(vf, {X:ob})
self.X = X
self.pi = pi
self.vf = vf
self.step = step
self.value = value

View File

@@ -1,30 +0,0 @@
#!/usr/bin/env python3
from baselines import logger
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
from baselines.a2c.a2c import learn
from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
def train(env_id, num_timesteps, seed, policy, lrschedule, num_env):
if policy == 'cnn':
policy_fn = CnnPolicy
elif policy == 'lstm':
policy_fn = LstmPolicy
elif policy == 'lnlstm':
policy_fn = LnLstmPolicy
env = VecFrameStack(make_atari_env(env_id, num_env, seed), 4)
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
env.close()
def main():
parser = atari_arg_parser()
parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
args = parser.parse_args()
logger.configure()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
policy=args.policy, lrschedule=args.lrschedule, num_env=16)
if __name__ == '__main__':
main()

60
baselines/a2c/runner.py Normal file
View File

@@ -0,0 +1,60 @@
import numpy as np
from baselines.a2c.utils import discount_with_dones
from baselines.common.runners import AbstractEnvRunner
class Runner(AbstractEnvRunner):
def __init__(self, env, model, nsteps=5, gamma=0.99):
super().__init__(env=env, model=model, nsteps=nsteps)
self.gamma = gamma
self.batch_action_shape = [x if x is not None else -1 for x in model.train_model.action.shape.as_list()]
self.ob_dtype = model.train_model.X.dtype.as_numpy_dtype
def run(self):
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
mb_states = self.states
for n in range(self.nsteps):
actions, values, states, _ = self.model.step(self.obs, S=self.states, M=self.dones)
mb_obs.append(np.copy(self.obs))
mb_actions.append(actions)
mb_values.append(values)
mb_dones.append(self.dones)
obs, rewards, dones, _ = self.env.step(actions)
self.states = states
self.dones = dones
for n, done in enumerate(dones):
if done:
self.obs[n] = self.obs[n]*0
self.obs = obs
mb_rewards.append(rewards)
mb_dones.append(self.dones)
#batch of steps to batch of rollouts
mb_obs = np.asarray(mb_obs, dtype=self.ob_dtype).swapaxes(1, 0).reshape(self.batch_ob_shape)
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
mb_actions = np.asarray(mb_actions, dtype=self.model.train_model.action.dtype.name).swapaxes(1, 0)
mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
mb_masks = mb_dones[:, :-1]
mb_dones = mb_dones[:, 1:]
if self.gamma > 0.0:
#discount/bootstrap off value fn
last_values = self.model.value(self.obs, S=self.states, M=self.dones).tolist()
for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
rewards = rewards.tolist()
dones = dones.tolist()
if dones[-1] == 0:
rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
else:
rewards = discount_with_dones(rewards, dones, self.gamma)
mb_rewards[n] = rewards
mb_actions = mb_actions.reshape(self.batch_action_shape)
mb_rewards = mb_rewards.flatten()
mb_values = mb_values.flatten()
mb_masks = mb_masks.flatten()
return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values

View File

@@ -1,8 +1,6 @@
import os
import gym
import numpy as np
import tensorflow as tf
from gym import spaces
from collections import deque
def sample(logits):
@@ -10,18 +8,15 @@ def sample(logits):
return tf.argmax(logits - tf.log(-tf.log(noise)), 1)
def cat_entropy(logits):
a0 = logits - tf.reduce_max(logits, 1, keep_dims=True)
a0 = logits - tf.reduce_max(logits, 1, keepdims=True)
ea0 = tf.exp(a0)
z0 = tf.reduce_sum(ea0, 1, keep_dims=True)
z0 = tf.reduce_sum(ea0, 1, keepdims=True)
p0 = ea0 / z0
return tf.reduce_sum(p0 * (tf.log(z0) - a0), 1)
def cat_entropy_softmax(p0):
return - tf.reduce_sum(p0 * tf.log(p0 + 1e-6), axis = 1)
def mse(pred, target):
return tf.square(pred-target)/2.
def ortho_init(scale=1.0):
def _ortho_init(shape, dtype, partition_info=None):
#lasagne ortho init for tf
@@ -39,7 +34,7 @@ def ortho_init(scale=1.0):
return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
return _ortho_init
def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC'):
def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
if data_format == 'NHWC':
channel_ax = 3
strides = [1, stride, stride, 1]
@@ -50,13 +45,15 @@ def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='
bshape = [1, nf, 1, 1]
else:
raise NotImplementedError
bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
nin = x.get_shape()[channel_ax].value
wshape = [rf, rf, nin, nf]
with tf.variable_scope(scope):
w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
b = tf.get_variable("b", [1, nf, 1, 1], initializer=tf.constant_initializer(0.0))
if data_format == 'NHWC': b = tf.reshape(b, bshape)
return b + tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format)
b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
if not one_dim_bias and data_format == 'NHWC':
b = tf.reshape(b, bshape)
return tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format) + b
def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
with tf.variable_scope(scope):
@@ -83,7 +80,6 @@ def seq_to_batch(h, flat = False):
def lstm(xs, ms, s, scope, nh, init_scale=1.0):
nbatch, nin = [v.value for v in xs[0].get_shape()]
nsteps = len(xs)
with tf.variable_scope(scope):
wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
@@ -113,7 +109,6 @@ def _ln(x, g, b, e=1e-5, axes=[1]):
def lnlstm(xs, ms, s, scope, nh, init_scale=1.0):
nbatch, nin = [v.value for v in xs[0].get_shape()]
nsteps = len(xs)
with tf.variable_scope(scope):
wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
gx = tf.get_variable("gx", [nh*4], initializer=tf.constant_initializer(1.0))
@@ -158,8 +153,7 @@ def discount_with_dones(rewards, dones, gamma):
return discounted[::-1]
def find_trainable_variables(key):
with tf.variable_scope(key):
return tf.trainable_variables()
return tf.trainable_variables(key)
def make_path(f):
return os.makedirs(f, exist_ok=True)

View File

@@ -1,4 +1,6 @@
# ACER
- Original paper: https://arxiv.org/abs/1611.01224
- `python -m baselines.acer.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.run --alg=acer --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
- also refer to the repo-wide [README.md](../../README.md#training-models)

View File

@@ -1,17 +1,20 @@
import time
import joblib
import functools
import numpy as np
import tensorflow as tf
from baselines import logger
from baselines.common import set_global_seeds
from baselines.common.policies import build_policy
from baselines.common.tf_util import get_session, save_variables
from baselines.a2c.utils import batch_to_seq, seq_to_batch
from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
from baselines.a2c.utils import cat_entropy_softmax
from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.a2c.utils import EpisodeStats
from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
from baselines.acer.buffer import Buffer
from baselines.acer.runner import Runner
# remove last step
def strip(var, nenvs, nsteps, flat = False):
@@ -56,10 +59,8 @@ class Model(object):
ent_coef, q_coef, gamma, max_grad_norm, lr,
rprop_alpha, rprop_epsilon, total_timesteps, lrschedule,
c, trust_region, alpha, delta):
config = tf.ConfigProto(allow_soft_placement=True,
intra_op_parallelism_threads=num_procs,
inter_op_parallelism_threads=num_procs)
sess = tf.Session(config=config)
sess = get_session()
nact = ac_space.n
nbatch = nenvs * nsteps
@@ -69,11 +70,16 @@ class Model(object):
MU = tf.placeholder(tf.float32, [nbatch, nact]) # mu's
LR = tf.placeholder(tf.float32, [])
eps = 1e-6
step_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs,) + ob_space.shape[:-1] + (ob_space.shape[-1] * nstack,))
train_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs*(nsteps+1),) + ob_space.shape[:-1] + (ob_space.shape[-1] * nstack,))
with tf.variable_scope('acer_model', reuse=tf.AUTO_REUSE):
step_model = policy(sess, ob_space, ac_space, nenvs, 1, nstack, reuse=False)
train_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
step_model = policy(observ_placeholder=step_ob_placeholder, sess=sess)
train_model = policy(observ_placeholder=train_ob_placeholder, sess=sess)
params = find_trainable_variables("model")
params = find_trainable_variables("acer_model")
print("Params {}".format(len(params)))
for var in params:
print(var)
@@ -87,14 +93,20 @@ class Model(object):
print(v.name)
return v
with tf.variable_scope("", custom_getter=custom_getter, reuse=True):
polyak_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
with tf.variable_scope("acer_model", custom_getter=custom_getter, reuse=True):
polyak_model = policy(observ_placeholder=train_ob_placeholder, sess=sess)
# Notation: (var) = batch variable, (var)s = seqeuence variable, (var)_i = variable index by action at step i
v = tf.reduce_sum(train_model.pi * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]
# action probability distributions according to train_model, polyak_model and step_model
# poilcy.pi is probability distribution parameters; to obtain distribution that sums to 1 need to take softmax
train_model_p = tf.nn.softmax(train_model.pi)
polyak_model_p = tf.nn.softmax(polyak_model.pi)
step_model_p = tf.nn.softmax(step_model.pi)
v = tf.reduce_sum(train_model_p * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]
# strip off last step
f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model.pi, polyak_model.pi, train_model.q])
f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model_p, polyak_model_p, train_model.q])
# Get pi and q values for actions taken
f_i = get_by_index(f, A)
q_i = get_by_index(q, A)
@@ -107,7 +119,8 @@ class Model(object):
qret = q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma)
# Calculate losses
# Entropy
# Entropy
# entropy = tf.reduce_mean(strip(train_model.pd.entropy(), nenvs, nsteps))
entropy = tf.reduce_mean(cat_entropy_softmax(f))
# Policy Graident loss, with truncated importance sampling & bias correction
@@ -189,84 +202,29 @@ class Model(object):
def train(obs, actions, rewards, dones, mus, states, masks, steps):
cur_lr = lr.value_steps(steps)
td_map = {train_model.X: obs, polyak_model.X: obs, A: actions, R: rewards, D: dones, MU: mus, LR: cur_lr}
if states != []:
if states is not None:
td_map[train_model.S] = states
td_map[train_model.M] = masks
td_map[polyak_model.S] = states
td_map[polyak_model.M] = masks
return names_ops, sess.run(run_ops, td_map)[1:] # strip off _train
def save(save_path):
ps = sess.run(params)
make_path(osp.dirname(save_path))
joblib.dump(ps, save_path)
def _step(observation, **kwargs):
return step_model._evaluate([step_model.action, step_model_p, step_model.state], observation, **kwargs)
self.train = train
self.save = save
self.save = functools.partial(save_variables, sess=sess, variables=params)
self.train_model = train_model
self.step_model = step_model
self.step = step_model.step
self._step = _step
self.step = self.step_model.step
self.initial_state = step_model.initial_state
tf.global_variables_initializer().run(session=sess)
class Runner(object):
def __init__(self, env, model, nsteps, nstack):
self.env = env
self.nstack = nstack
self.model = model
nh, nw, nc = env.observation_space.shape
self.nc = nc # nc = 1 for atari, but just in case
self.nenv = nenv = env.num_envs
self.nact = env.action_space.n
self.nbatch = nenv * nsteps
self.batch_ob_shape = (nenv*(nsteps+1), nh, nw, nc*nstack)
self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
obs = env.reset()
self.update_obs(obs)
self.nsteps = nsteps
self.states = model.initial_state
self.dones = [False for _ in range(nenv)]
def update_obs(self, obs, dones=None):
if dones is not None:
self.obs *= (1 - dones.astype(np.uint8))[:, None, None, None]
self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
self.obs[:, :, :, -self.nc:] = obs[:, :, :, :]
def run(self):
enc_obs = np.split(self.obs, self.nstack, axis=3) # so now list of obs steps
mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
for _ in range(self.nsteps):
actions, mus, states = self.model.step(self.obs, state=self.states, mask=self.dones)
mb_obs.append(np.copy(self.obs))
mb_actions.append(actions)
mb_mus.append(mus)
mb_dones.append(self.dones)
obs, rewards, dones, _ = self.env.step(actions)
# states information for statefull models like LSTM
self.states = states
self.dones = dones
self.update_obs(obs, dones)
mb_rewards.append(rewards)
enc_obs.append(obs)
mb_obs.append(np.copy(self.obs))
mb_dones.append(self.dones)
enc_obs = np.asarray(enc_obs, dtype=np.uint8).swapaxes(1, 0)
mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0)
mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards
# shapes are now [nenv, nsteps, []]
# When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.
return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks
class Acer():
def __init__(self, runner, model, buffer, log_interval):
@@ -312,19 +270,84 @@ class Acer():
logger.dump_tabular()
def learn(policy, env, seed, nsteps=20, nstack=4, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
def learn(network, env, seed=None, nsteps=20, nstack=4, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
max_grad_norm=10, lr=7e-4, lrschedule='linear', rprop_epsilon=1e-5, rprop_alpha=0.99, gamma=0.99,
log_interval=100, buffer_size=50000, replay_ratio=4, replay_start=10000, c=10.0,
trust_region=True, alpha=0.99, delta=1):
trust_region=True, alpha=0.99, delta=1, load_path=None, **network_kwargs):
'''
Main entrypoint for ACER (Actor-Critic with Experience Replay) algorithm (https://arxiv.org/pdf/1611.01224.pdf)
Train an agent with given network architecture on a given environment using ACER.
Parameters:
----------
network: policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
env: environment. Needs to be vectorized for parallel environment simulation.
The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.
nsteps: int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
nenv is number of environment copies simulated in parallel) (default: 20)
nstack: int, size of the frame stack, i.e. number of the frames passed to the step model. Frames are stacked along channel dimension
(last image dimension) (default: 4)
total_timesteps: int, number of timesteps (i.e. number of actions taken in the environment) (default: 80M)
q_coef: float, value function loss coefficient in the optimization objective (analog of vf_coef for other actor-critic methods)
ent_coef: float, policy entropy coefficient in the optimization objective (default: 0.01)
max_grad_norm: float, gradient norm clipping coefficient. If set to None, no clipping. (default: 10),
lr: float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
lrschedule: schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
returns fraction of the learning rate (specified as lr) as output
rprop_epsilon: float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
rprop_alpha: float, RMSProp decay parameter (default: 0.99)
gamma: float, reward discounting factor (default: 0.99)
log_interval: int, number of updates between logging events (default: 100)
buffer_size: int, size of the replay buffer (default: 50k)
replay_ratio: int, now many (on average) batches of data to sample from the replay buffer take after batch from the environment (default: 4)
replay_start: int, the sampling from the replay buffer does not start until replay buffer has at least that many samples (default: 10k)
c: float, importance weight clipping factor (default: 10)
trust_region bool, whether or not algorithms estimates the gradient KL divergence between the old and updated policy and uses it to determine step size (default: True)
delta: float, max KL divergence between the old policy and updated policy (default: 1)
alpha: float, momentum factor in the Polyak (exponential moving average) averaging of the model parameters (default: 0.99)
load_path: str, path to load the model from (default: None)
**network_kwargs: keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
'''
print("Running Acer Simple")
print(locals())
tf.reset_default_graph()
set_global_seeds(seed)
policy = build_policy(env, network, estimate_q=True, **network_kwargs)
nenvs = env.num_envs
ob_space = env.observation_space
ac_space = env.action_space
num_procs = len(env.remotes) # HACK
num_procs = len(env.remotes) if hasattr(env, 'remotes') else 1# HACK
model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, nstack=nstack,
num_procs=num_procs, ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
max_grad_norm=max_grad_norm, lr=lr, rprop_alpha=rprop_alpha, rprop_epsilon=rprop_epsilon,
@@ -339,6 +362,7 @@ def learn(policy, env, seed, nsteps=20, nstack=4, total_timesteps=int(80e6), q_c
nbatch = nenvs*nsteps
acer = Acer(runner, model, buffer, log_interval)
acer.tstart = time.time()
for acer.steps in range(0, total_timesteps, nbatch): #nbatch samples, 1 on_policy call and multiple off-policy calls
acer.call(on_policy=True)
if replay_ratio > 0 and buffer.has_atleast(replay_start):
@@ -347,3 +371,4 @@ def learn(policy, env, seed, nsteps=20, nstack=4, total_timesteps=int(80e6), q_c
acer.call(on_policy=False) # no simulation steps in this
env.close()
return model

View File

@@ -0,0 +1,4 @@
def atari():
return dict(
lrschedule='constant'
)

View File

@@ -1,6 +1,6 @@
import numpy as np
import tensorflow as tf
from baselines.ppo2.policies import nature_cnn
from baselines.common.policies import nature_cnn
from baselines.a2c.utils import fc, batch_to_seq, seq_to_batch, lstm, sample
@@ -18,11 +18,13 @@ class AcerCnnPolicy(object):
pi = tf.nn.softmax(pi_logits)
q = fc(h, 'q', nact)
a = sample(pi_logits) # could change this to use self.pi instead
a = sample(tf.nn.softmax(pi_logits)) # could change this to use self.pi instead
self.initial_state = [] # not stateful
self.X = X
self.pi = pi # actual policy params now
self.pi_logits = pi_logits
self.q = q
self.vf = q
def step(ob, *args, **kwargs):
# returns actions, mus, states

View File

@@ -1,30 +0,0 @@
#!/usr/bin/env python3
from baselines import logger
from baselines.acer.acer_simple import learn
from baselines.acer.policies import AcerCnnPolicy, AcerLstmPolicy
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
def train(env_id, num_timesteps, seed, policy, lrschedule, num_cpu):
env = make_atari_env(env_id, num_cpu, seed)
if policy == 'cnn':
policy_fn = AcerCnnPolicy
elif policy == 'lstm':
policy_fn = AcerLstmPolicy
else:
print("Policy {} not implemented".format(policy))
return
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
env.close()
def main():
parser = atari_arg_parser()
parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
parser.add_argument('--logdir', help ='Directory for logging')
args = parser.parse_args()
logger.configure(args.logdir)
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
policy=args.policy, lrschedule=args.lrschedule, num_cpu=16)
if __name__ == '__main__':
main()

60
baselines/acer/runner.py Normal file
View File

@@ -0,0 +1,60 @@
import numpy as np
from baselines.common.runners import AbstractEnvRunner
class Runner(AbstractEnvRunner):
def __init__(self, env, model, nsteps, nstack):
super().__init__(env=env, model=model, nsteps=nsteps)
self.nstack = nstack
nh, nw, nc = env.observation_space.shape
self.nc = nc # nc = 1 for atari, but just in case
self.nact = env.action_space.n
nenv = self.nenv
self.nbatch = nenv * nsteps
self.batch_ob_shape = (nenv*(nsteps+1), nh, nw, nc*nstack)
self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
obs = env.reset()
self.update_obs(obs)
def update_obs(self, obs, dones=None):
#self.obs = obs
if dones is not None:
self.obs *= (1 - dones.astype(np.uint8))[:, None, None, None]
self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
self.obs[:, :, :, -self.nc:] = obs[:, :, :, :]
def run(self):
enc_obs = np.split(self.obs, self.nstack, axis=3) # so now list of obs steps
mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
for _ in range(self.nsteps):
actions, mus, states = self.model._step(self.obs, S=self.states, M=self.dones)
mb_obs.append(np.copy(self.obs))
mb_actions.append(actions)
mb_mus.append(mus)
mb_dones.append(self.dones)
obs, rewards, dones, _ = self.env.step(actions)
# states information for statefull models like LSTM
self.states = states
self.dones = dones
self.update_obs(obs, dones)
mb_rewards.append(rewards)
enc_obs.append(obs)
mb_obs.append(np.copy(self.obs))
mb_dones.append(self.dones)
enc_obs = np.asarray(enc_obs, dtype=np.uint8).swapaxes(1, 0)
mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0)
mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards
# shapes are now [nenv, nsteps, []]
# When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.
return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks

View File

@@ -2,4 +2,7 @@
- Original paper: https://arxiv.org/abs/1708.05144
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.acktr.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.run --alg=acktr --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
- also refer to the repo-wide [README.md](../../README.md#training-models)

1
baselines/acktr/acktr.py Normal file
View File

@@ -0,0 +1 @@
from baselines.acktr.acktr_disc import *

View File

@@ -54,7 +54,7 @@ def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
stepsize = tf.Variable(initial_value=np.float32(np.array(0.03)), name='stepsize')
inputs, loss, loss_sampled = policy.update_info
optim = kfac.KfacOptimizer(learning_rate=stepsize, cold_lr=stepsize*(1-0.9), momentum=0.9, kfac_update=2,\
epsilon=1e-2, stats_decay=0.99, async=1, cold_iter=1,
epsilon=1e-2, stats_decay=0.99, async_=1, cold_iter=1,
weight_decay_dict=policy.wd_dict, max_grad_norm=None)
pi_var_list = []
for var in tf.trainable_variables():

View File

@@ -1,16 +1,17 @@
import os.path as osp
import time
import joblib
import functools
import numpy as np
import tensorflow as tf
from baselines import logger
from baselines.common import set_global_seeds, explained_variance
from baselines.common.policies import build_policy
from baselines.common.tf_util import get_session, save_variables, load_variables
from baselines.a2c.a2c import Runner
from baselines.a2c.runner import Runner
from baselines.a2c.utils import discount_with_dones
from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.a2c.utils import cat_entropy, mse
from baselines.acktr import kfac
@@ -19,11 +20,8 @@ class Model(object):
def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
kfac_clip=0.001, lrschedule='linear'):
config = tf.ConfigProto(allow_soft_placement=True,
intra_op_parallelism_threads=nprocs,
inter_op_parallelism_threads=nprocs)
config.gpu_options.allow_growth = True
self.sess = sess = tf.Session(config=config)
self.sess = sess = get_session()
nact = ac_space.n
nbatch = nenvs * nsteps
A = tf.placeholder(tf.int32, [nbatch])
@@ -32,34 +30,35 @@ class Model(object):
PG_LR = tf.placeholder(tf.float32, [])
VF_LR = tf.placeholder(tf.float32, [])
self.model = step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
self.model2 = train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)
with tf.variable_scope('acktr_model', reuse=tf.AUTO_REUSE):
self.model = step_model = policy(nenvs, 1, sess=sess)
self.model2 = train_model = policy(nenvs*nsteps, nsteps, sess=sess)
logpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
neglogpac = train_model.pd.neglogp(A)
self.logits = logits = train_model.pi
##training loss
pg_loss = tf.reduce_mean(ADV*logpac)
entropy = tf.reduce_mean(cat_entropy(train_model.pi))
pg_loss = tf.reduce_mean(ADV*neglogpac)
entropy = tf.reduce_mean(train_model.pd.entropy())
pg_loss = pg_loss - ent_coef * entropy
vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
vf_loss = tf.losses.mean_squared_error(tf.squeeze(train_model.vf), R)
train_loss = pg_loss + vf_coef * vf_loss
##Fisher loss construction
self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(logpac)
self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(neglogpac)
sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss
self.params=params = find_trainable_variables("model")
self.params=params = find_trainable_variables("acktr_model")
self.grads_check = grads = tf.gradients(train_loss,params)
with tf.device('/gpu:0'):
self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
momentum=0.9, kfac_update=1, epsilon=0.01,\
stats_decay=0.99, async=1, cold_iter=10, max_grad_norm=max_grad_norm)
stats_decay=0.99, async_=1, cold_iter=10, max_grad_norm=max_grad_norm)
update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
@@ -82,22 +81,10 @@ class Model(object):
)
return policy_loss, value_loss, policy_entropy
def save(save_path):
ps = sess.run(params)
joblib.dump(ps, save_path)
def load(load_path):
loaded_params = joblib.load(load_path)
restores = []
for p, loaded_p in zip(params, loaded_params):
restores.append(p.assign(loaded_p))
sess.run(restores)
self.train = train
self.save = save
self.load = load
self.save = functools.partial(save_variables, sess=sess)
self.load = functools.partial(load_variables, sess=sess)
self.train_model = train_model
self.step_model = step_model
self.step = step_model.step
@@ -105,12 +92,17 @@ class Model(object):
self.initial_state = step_model.initial_state
tf.global_variables_initializer().run(session=sess)
def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
kfac_clip=0.001, save_interval=None, lrschedule='linear'):
tf.reset_default_graph()
kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, **network_kwargs):
set_global_seeds(seed)
if network == 'cnn':
network_kwargs['one_dim_bias'] = True
policy = build_policy(env, network, **network_kwargs)
nenvs = env.num_envs
ob_space = env.observation_space
ac_space = env.action_space
@@ -124,6 +116,9 @@ def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval
fh.write(cloudpickle.dumps(make_model))
model = make_model()
if load_path is not None:
model.load(load_path)
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
nbatch = nenvs*nsteps
tstart = time.time()
@@ -153,3 +148,4 @@ def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval
coord.request_stop()
coord.join(enqueue_threads)
env.close()
return model

View File

@@ -10,14 +10,14 @@ KFAC_DEBUG = False
class KfacOptimizer():
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async_=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
self.max_grad_norm = max_grad_norm
self._lr = learning_rate
self._momentum = momentum
self._clip_kl = clip_kl
self._channel_fac = channel_fac
self._kfac_update = kfac_update
self._async = async
self._async = async_
self._async_stats = async_stats
self._epsilon = epsilon
self._stats_decay = stats_decay

View File

@@ -1,21 +0,0 @@
#!/usr/bin/env python3
from baselines import logger
from baselines.acktr.acktr_disc import learn
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
from baselines.ppo2.policies import CnnPolicy
def train(env_id, num_timesteps, seed, num_cpu):
env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
policy_fn = CnnPolicy
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
env.close()
def main():
args = atari_arg_parser().parse_args()
logger.configure()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed, num_cpu=32)
if __name__ == '__main__':
main()

View File

@@ -21,7 +21,7 @@ class NeuralNetValueFunction(object):
self._predict = U.function([X], vpred_n)
optim = kfac.KfacOptimizer(learning_rate=0.001, cold_lr=0.001*(1-0.9), momentum=0.9, \
clip_kl=0.3, epsilon=0.1, stats_decay=0.95, \
async=1, kfac_update=2, cold_iter=50, \
async_=1, kfac_update=2, cold_iter=50, \
weight_decay_dict=wd_dict, max_grad_norm=None)
vf_var_list = []
for var in tf.trainable_variables():

View File

@@ -9,6 +9,8 @@ _atariexpl7 = ['Freeway', 'Gravitar', 'MontezumaRevenge', 'Pitfall', 'PrivateEye
_BENCHMARKS = []
remove_version_re = re.compile(r'-v\d+$')
def register_benchmark(benchmark):
for b in _BENCHMARKS:
if b['name'] == benchmark['name']:
@@ -57,7 +59,7 @@ register_benchmark({
register_benchmark({
'name': 'Atari10M',
'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari7]
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 6, 'num_timesteps': int(10e6)} for _game in _atari7]
})
register_benchmark({
@@ -82,8 +84,9 @@ _mujocosmall = [
register_benchmark({
'name': 'Mujoco1M',
'description': 'Some small 2D MuJoCo tasks, run for 1M timesteps',
'tasks': [{'env_id': _envid, 'trials': 3, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
'tasks': [{'env_id': _envid, 'trials': 6, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
})
register_benchmark({
'name': 'MujocoWalkers',
'description': 'MuJoCo forward walkers, run for 8M, humanoid 100M',
@@ -138,3 +141,11 @@ register_benchmark({
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari50]
})
# HER DDPG
register_benchmark({
'name': 'HerDdpg',
'description': 'Smoke-test only benchmark of HER',
'tasks': [{'trials': 1, 'env_id': 'FetchReach-v1'}]
})

View File

@@ -112,6 +112,8 @@ def load_results(dir):
with open(fname, 'rt') as fh:
if fname.endswith('csv'):
firstline = fh.readline()
if not firstline:
continue
assert firstline[0] == '#'
header = json.loads(firstline[1:])
df = pandas.read_csv(fh, index_col=None)
@@ -158,4 +160,4 @@ def test_monitor():
last_logline = pandas.read_csv(f, index_col=None)
assert set(last_logline.keys()) == {'l', 't', 'r'}, "Incorrect keys in monitor logline"
f.close()
os.remove(mon_file)
os.remove(mon_file)

View File

@@ -1,3 +1,4 @@
# flake8: noqa F403
from baselines.common.console_util import *
from baselines.common.dataset import Dataset
from baselines.common.math_util import *

View File

@@ -1,4 +1,6 @@
import numpy as np
import os
os.environ.setdefault('PATH', '')
from collections import deque
import gym
from gym import spaces
@@ -98,9 +100,6 @@ class MaxAndSkipEnv(gym.Wrapper):
self._obs_buffer = np.zeros((2,)+env.observation_space.shape, dtype=np.uint8)
self._skip = skip
def reset(self):
return self.env.reset()
def step(self, action):
"""Repeat action, sum reward, and max over last observations."""
total_reward = 0.0
@@ -157,7 +156,7 @@ class FrameStack(gym.Wrapper):
self.k = k
self.frames = deque([], maxlen=k)
shp = env.observation_space.shape
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=np.uint8)
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)
def reset(self):
ob = self.env.reset()
@@ -177,6 +176,7 @@ class FrameStack(gym.Wrapper):
class ScaledFloatFrame(gym.ObservationWrapper):
def __init__(self, env):
gym.ObservationWrapper.__init__(self, env)
self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)
def observation(self, observation):
# careful! This undoes the memory optimization, use

View File

@@ -3,6 +3,11 @@ Helpers for scripts like run_atari.py.
"""
import os
try:
from mpi4py import MPI
except ImportError:
MPI = None
import gym
from gym.wrappers import FlattenDictWrapper
from baselines import logger
@@ -10,31 +15,46 @@ from baselines.bench import Monitor
from baselines.common import set_global_seeds
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
from mpi4py import MPI
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.common.retro_wrappers import RewardScaler
def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
def make_vec_env(env_id, env_type, num_env, seed, wrapper_kwargs=None, start_index=0, reward_scale=1.0):
"""
Create a wrapped, monitored SubprocVecEnv for Atari.
Create a wrapped, monitored SubprocVecEnv for Atari and MuJoCo.
"""
if wrapper_kwargs is None: wrapper_kwargs = {}
mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
def make_env(rank): # pylint: disable=C0111
def _thunk():
env = make_atari(env_id)
env.seed(seed + rank)
env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
return wrap_deepmind(env, **wrapper_kwargs)
env = make_atari(env_id) if env_type == 'atari' else gym.make(env_id)
env.seed(seed + 10000*mpi_rank + rank if seed is not None else None)
env = Monitor(env,
logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(rank)),
allow_early_resets=True)
if env_type == 'atari': return wrap_deepmind(env, **wrapper_kwargs)
elif reward_scale != 1: return RewardScaler(env, reward_scale)
else: return env
return _thunk
set_global_seeds(seed)
return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
if num_env > 1: return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
else: return DummyVecEnv([make_env(start_index)])
def make_mujoco_env(env_id, seed):
def make_mujoco_env(env_id, seed, reward_scale=1.0):
"""
Create a wrapped, monitored gym.Env for MuJoCo.
"""
set_global_seeds(seed)
rank = MPI.COMM_WORLD.Get_rank()
myseed = seed + 1000 * rank if seed is not None else None
set_global_seeds(myseed)
env = gym.make(env_id)
env = Monitor(env, logger.get_dir())
logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
env = Monitor(env, logger_path, allow_early_resets=True)
env.seed(seed)
if reward_scale != 1.0:
from baselines.common.retro_wrappers import RewardScaler
env = RewardScaler(env, reward_scale)
return env
def make_robotics_env(env_id, seed, rank=0):
@@ -61,20 +81,28 @@ def atari_arg_parser():
"""
Create an argparse.ArgumentParser for run_atari.py.
"""
parser = arg_parser()
parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
return parser
print('Obsolete - use common_arg_parser instead')
return common_arg_parser()
def mujoco_arg_parser():
print('Obsolete - use common_arg_parser instead')
return common_arg_parser()
def common_arg_parser():
"""
Create an argparse.ArgumentParser for run_mujoco.py.
"""
parser = arg_parser()
parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--num-timesteps', type=int, default=int(1e6))
parser.add_argument('--seed', help='RNG seed', type=int, default=None)
parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
parser.add_argument('--num_timesteps', type=float, default=1e6),
parser.add_argument('--network', help='network type (mlp, cnn, lstm, cnn_lstm, conv_only)', default=None)
parser.add_argument('--gamestate', help='game state to load (so far only used in retro games)', default=None)
parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
parser.add_argument('--reward_scale', help='Reward scale factor. Default: 1.0', default=1.0, type=float)
parser.add_argument('--save_path', help='Path to save trained model to', default=None, type=str)
parser.add_argument('--play', default=False, action='store_true')
return parser
def robotics_arg_parser():
@@ -83,6 +111,21 @@ def robotics_arg_parser():
"""
parser = arg_parser()
parser.add_argument('--env', help='environment ID', type=str, default='FetchReach-v0')
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--seed', help='RNG seed', type=int, default=None)
parser.add_argument('--num-timesteps', type=int, default=int(1e6))
return parser
def parse_unknown_args(args):
"""
Parse arguments not consumed by arg parser into a dicitonary
"""
retval = {}
for arg in args:
assert arg.startswith('--')
assert '=' in arg, 'cannot parse arg {}'.format(arg)
key = arg.split('=')[0][2:]
value = arg.split('=')[1]
retval[key] = value
return retval

View File

@@ -2,6 +2,8 @@ from __future__ import print_function
from contextlib import contextmanager
import numpy as np
import time
import shlex
import subprocess
# ================================================================
# Misc
@@ -37,7 +39,7 @@ color2num = dict(
crimson=38
)
def colorize(string, color, bold=False, highlight=False):
def colorize(string, color='green', bold=False, highlight=False):
attr = []
num = color2num[color]
if highlight: num += 10
@@ -45,6 +47,22 @@ def colorize(string, color, bold=False, highlight=False):
if bold: attr.append('1')
return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)
def print_cmd(cmd, dry=False):
if isinstance(cmd, str): # for shell=True
pass
else:
cmd = ' '.join(shlex.quote(arg) for arg in cmd)
print(colorize(('CMD: ' if not dry else 'DRY: ') + cmd))
def get_git_commit(cwd=None):
return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'], cwd=cwd).decode('utf8')
def ccap(cmd, dry=False, env=None, **kwargs):
print_cmd(cmd, dry)
if not dry:
subprocess.check_call(cmd, env=env, **kwargs)
MESSAGE_DEPTH = 0

View File

@@ -1,6 +1,7 @@
import tensorflow as tf
import numpy as np
import baselines.common.tf_util as U
from baselines.a2c.utils import fc
from tensorflow.python.ops import math_ops
class Pd(object):
@@ -31,6 +32,8 @@ class PdType(object):
raise NotImplementedError
def pdfromflat(self, flat):
return self.pdclass()(flat)
def pdfromlatent(self, latent_vector):
raise NotImplementedError
def param_shape(self):
raise NotImplementedError
def sample_shape(self):
@@ -48,6 +51,10 @@ class CategoricalPdType(PdType):
self.ncat = ncat
def pdclass(self):
return CategoricalPd
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
pdparam = fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
return self.pdfromflat(pdparam), pdparam
def param_shape(self):
return [self.ncat]
def sample_shape(self):
@@ -75,6 +82,13 @@ class DiagGaussianPdType(PdType):
self.size = size
def pdclass(self):
return DiagGaussianPd
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
mean = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
logstd = tf.get_variable(name='pi/logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
return self.pdfromflat(pdparam), mean
def param_shape(self):
return [2*self.size]
def sample_shape(self):
@@ -129,26 +143,26 @@ class CategoricalPd(Pd):
# Note: we can't use sparse_softmax_cross_entropy_with_logits because
# the implementation does not allow second-order derivatives...
one_hot_actions = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
return tf.nn.softmax_cross_entropy_with_logits(
return tf.nn.softmax_cross_entropy_with_logits_v2(
logits=self.logits,
labels=one_hot_actions)
def kl(self, other):
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keep_dims=True)
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keepdims=True)
ea0 = tf.exp(a0)
ea1 = tf.exp(a1)
z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
z1 = tf.reduce_sum(ea1, axis=-1, keep_dims=True)
z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
z1 = tf.reduce_sum(ea1, axis=-1, keepdims=True)
p0 = ea0 / z0
return tf.reduce_sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
def entropy(self):
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
ea0 = tf.exp(a0)
z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
p0 = ea0 / z0
return tf.reduce_sum(p0 * (tf.log(z0) - a0), axis=-1)
def sample(self):
u = tf.random_uniform(tf.shape(self.logits))
u = tf.random_uniform(tf.shape(self.logits), dtype=self.logits.dtype)
return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
@classmethod
def fromflat(cls, flat):

View File

@@ -0,0 +1,30 @@
from gym import Env
from gym.spaces import Discrete
class IdentityEnv(Env):
def __init__(
self,
dim,
ep_length=100,
):
self.action_space = Discrete(dim)
self.reset()
def reset(self):
self._choose_next_state()
self.observation_space = self.action_space
return self.state
def step(self, actions):
rew = self._get_reward(actions)
self._choose_next_state()
return self.state, rew, False, {}
def _choose_next_state(self):
self.state = self.action_space.sample()
def _get_reward(self, actions):
return 1 if self.state == actions else 0

56
baselines/common/input.py Normal file
View File

@@ -0,0 +1,56 @@
import tensorflow as tf
from gym.spaces import Discrete, Box
def observation_placeholder(ob_space, batch_size=None, name='Ob'):
'''
Create placeholder to feed observations into of the size appropriate to the observation space
Parameters:
----------
ob_space: gym.Space observation space
batch_size: int size of the batch to be fed into input. Can be left None in most cases.
name: str name of the placeholder
Returns:
-------
tensorflow placeholder tensor
'''
assert isinstance(ob_space, Discrete) or isinstance(ob_space, Box), \
'Can only deal with Discrete and Box observation spaces for now'
return tf.placeholder(shape=(batch_size,) + ob_space.shape, dtype=ob_space.dtype, name=name)
def observation_input(ob_space, batch_size=None, name='Ob'):
'''
Create placeholder to feed observations into of the size appropriate to the observation space, and add input
encoder of the appropriate type.
'''
placeholder = observation_placeholder(ob_space, batch_size, name)
return placeholder, encode_observation(ob_space, placeholder)
def encode_observation(ob_space, placeholder):
'''
Encode input in the way that is appropriate to the observation space
Parameters:
----------
ob_space: gym.Space observation space
placeholder: tf.placeholder observation input placeholder
'''
if isinstance(ob_space, Discrete):
return tf.to_float(tf.one_hot(placeholder, ob_space.n))
elif isinstance(ob_space, Box):
return tf.to_float(placeholder)
else:
raise NotImplementedError

View File

@@ -67,14 +67,21 @@ class EzPickle(object):
def set_global_seeds(i):
try:
import MPI
rank = MPI.COMM_WORLD.Get_rank()
except ImportError:
rank = 0
myseed = i + 1000 * rank if i is not None else None
try:
import tensorflow as tf
except ImportError:
pass
else:
tf.set_random_seed(i)
np.random.seed(i)
random.seed(i)
tf.set_random_seed(myseed)
np.random.seed(myseed)
random.seed(myseed)
def pretty_eta(seconds_left):

204
baselines/common/models.py Normal file
View File

@@ -0,0 +1,204 @@
import numpy as np
import tensorflow as tf
from baselines.a2c import utils
from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch
from baselines.common.mpi_running_mean_std import RunningMeanStd
import tensorflow.contrib.layers as layers
def nature_cnn(unscaled_images, **conv_kwargs):
"""
CNN from Nature paper.
"""
scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
activ = tf.nn.relu
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
**conv_kwargs))
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
h3 = conv_to_fc(h3)
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
"""
Stack of fully-connected layers to be used in a policy / q-function approximator
Parameters:
----------
num_layers: int number of fully-connected layers (default: 2)
num_hidden: int size of fully-connected layers (default: 64)
activation: activation function (default: tf.tanh)
Returns:
-------
function that builds fully connected network with a given input tensor / placeholder
"""
def network_fn(X):
h = tf.layers.flatten(X)
for i in range(num_layers):
h = activation(fc(h, 'mlp_fc{}'.format(i), nh=num_hidden, init_scale=np.sqrt(2)))
return h, None
return network_fn
def cnn(**conv_kwargs):
def network_fn(X):
return nature_cnn(X, **conv_kwargs), None
return network_fn
def cnn_small(**conv_kwargs):
def network_fn(X):
h = tf.cast(X, tf.float32) / 255.
activ = tf.nn.relu
h = activ(conv(h, 'c1', nf=8, rf=8, stride=4, init_scale=np.sqrt(2), **conv_kwargs))
h = activ(conv(h, 'c2', nf=16, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
h = conv_to_fc(h)
h = activ(fc(h, 'fc1', nh=128, init_scale=np.sqrt(2)))
return h, None
return network_fn
def lstm(nlstm=128, layer_norm=False):
"""
Builds LSTM (Long-Short Term Memory) network to be used in a policy.
Note that the resulting function returns not only the output of the LSTM
(i.e. hidden state of lstm for each step in the sequence), but also a dictionary
with auxiliary tensors to be set as policy attributes.
Specifically,
S is a placeholder to feed current state (LSTM state has to be managed outside policy)
M is a placeholder for the mask (used to mask out observations after the end of the episode, but can be used for other purposes too)
initial_state is a numpy array containing initial lstm state (usually zeros)
state is the output LSTM state (to be fed into S at the next call)
An example of usage of lstm-based policy can be found here: common/tests/test_doc_examples.py/test_lstm_example
Parameters:
----------
nlstm: int LSTM hidden state size
layer_norm: bool if True, layer-normalized version of LSTM is used
Returns:
-------
function that builds LSTM with a given input tensor / placeholder
"""
def network_fn(X, nenv=1):
nbatch = X.shape[0]
nsteps = nbatch // nenv
h = tf.layers.flatten(X)
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
if layer_norm:
h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
else:
h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)
h = seq_to_batch(h5)
initial_state = np.zeros(S.shape.as_list(), dtype=float)
return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
return network_fn
def cnn_lstm(nlstm=128, layer_norm=False, **conv_kwargs):
def network_fn(X, nenv=1):
nbatch = X.shape[0]
nsteps = nbatch // nenv
h = nature_cnn(X, **conv_kwargs)
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
if layer_norm:
h5, snew = utils.lnlstm(xs, ms, S, scope='lnlstm', nh=nlstm)
else:
h5, snew = utils.lstm(xs, ms, S, scope='lstm', nh=nlstm)
h = seq_to_batch(h5)
initial_state = np.zeros(S.shape.as_list(), dtype=float)
return h, {'S':S, 'M':M, 'state':snew, 'initial_state':initial_state}
return network_fn
def cnn_lnlstm(nlstm=128, **conv_kwargs):
return cnn_lstm(nlstm, layer_norm=True, **conv_kwargs)
def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
'''
convolutions-only net
Parameters:
----------
conv: list of triples (filter_number, filter_size, stride) specifying parameters for each layer.
Returns:
function that takes tensorflow tensor as input and returns the output of the last convolutional layer
'''
def network_fn(X):
out = tf.cast(X, tf.float32) / 255.
with tf.variable_scope("convnet"):
for num_outputs, kernel_size, stride in convs:
out = layers.convolution2d(out,
num_outputs=num_outputs,
kernel_size=kernel_size,
stride=stride,
activation_fn=tf.nn.relu,
**conv_kwargs)
return out, None
return network_fn
def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
rms = RunningMeanStd(shape=x.shape[1:])
norm_x = tf.clip_by_value((x - rms.mean) / rms.std, min(clip_range), max(clip_range))
return norm_x, rms
def get_network_builder(name):
# TODO: replace with reflection?
if name == 'cnn':
return cnn
elif name == 'cnn_small':
return cnn_small
elif name == 'conv_only':
return conv_only
elif name == 'mlp':
return mlp
elif name == 'lstm':
return lstm
elif name == 'cnn_lstm':
return cnn_lstm
elif name == 'cnn_lnlstm':
return cnn_lnlstm
else:
raise ValueError('Unknown network type: {}'.format(name))

View File

@@ -0,0 +1,31 @@
import numpy as np
import tensorflow as tf
from mpi4py import MPI
class MpiAdamOptimizer(tf.train.AdamOptimizer):
"""Adam optimizer that averages gradients across mpi processes."""
def __init__(self, comm, **kwargs):
self.comm = comm
tf.train.AdamOptimizer.__init__(self, **kwargs)
def compute_gradients(self, loss, var_list, **kwargs):
grads_and_vars = tf.train.AdamOptimizer.compute_gradients(self, loss, var_list, **kwargs)
grads_and_vars = [(g, v) for g, v in grads_and_vars if g is not None]
flat_grad = tf.concat([tf.reshape(g, (-1,)) for g, v in grads_and_vars], axis=0)
shapes = [v.shape.as_list() for g, v in grads_and_vars]
sizes = [int(np.prod(s)) for s in shapes]
num_tasks = self.comm.Get_size()
buf = np.zeros(sum(sizes), np.float32)
def _collect_grads(flat_grad):
self.comm.Allreduce(flat_grad, buf, op=MPI.SUM)
np.divide(buf, float(num_tasks), out=buf)
return buf
avg_flat_grad = tf.py_func(_collect_grads, [flat_grad], tf.float32)
avg_flat_grad.set_shape(flat_grad.shape)
avg_grads = tf.split(avg_flat_grad, sizes, axis=0)
avg_grads_and_vars = [(tf.reshape(g, v.shape), v)
for g, (_, v) in zip(avg_grads, grads_and_vars)]
return avg_grads_and_vars

View File

@@ -2,6 +2,7 @@ from mpi4py import MPI
import numpy as np
from baselines.common import zipsame
def mpi_mean(x, axis=0, comm=None, keepdims=False):
x = np.asarray(x)
assert x.ndim > 0

View File

@@ -0,0 +1,101 @@
from collections import defaultdict
from mpi4py import MPI
import os, numpy as np
import platform
import shutil
import subprocess
def sync_from_root(sess, variables, comm=None):
"""
Send the root node's parameters to every worker.
Arguments:
sess: the TensorFlow session.
variables: all parameter variables including optimizer's
"""
if comm is None: comm = MPI.COMM_WORLD
rank = comm.Get_rank()
for var in variables:
if rank == 0:
comm.Bcast(sess.run(var))
else:
import tensorflow as tf
returned_var = np.empty(var.shape, dtype='float32')
comm.Bcast(returned_var)
sess.run(tf.assign(var, returned_var))
def gpu_count():
"""
Count the GPUs on this machine.
"""
if shutil.which('nvidia-smi') is None:
return 0
output = subprocess.check_output(['nvidia-smi', '--query-gpu=gpu_name', '--format=csv'])
return max(0, len(output.split(b'\n')) - 2)
def setup_mpi_gpus():
"""
Set CUDA_VISIBLE_DEVICES using MPI.
"""
num_gpus = gpu_count()
if num_gpus == 0:
return
local_rank, _ = get_local_rank_size(MPI.COMM_WORLD)
os.environ['CUDA_VISIBLE_DEVICES'] = str(local_rank % num_gpus)
def get_local_rank_size(comm):
"""
Returns the rank of each process on its machine
The processes on a given machine will be assigned ranks
0, 1, 2, ..., N-1,
where N is the number of processes on this machine.
Useful if you want to assign one gpu per machine
"""
this_node = platform.node()
ranks_nodes = comm.allgather((comm.Get_rank(), this_node))
node2rankssofar = defaultdict(int)
local_rank = None
for (rank, node) in ranks_nodes:
if rank == comm.Get_rank():
local_rank = node2rankssofar[node]
node2rankssofar[node] += 1
assert local_rank is not None
return local_rank, node2rankssofar[this_node]
def share_file(comm, path):
"""
Copies the file from rank 0 to all other ranks
Puts it in the same place on all machines
"""
localrank, _ = get_local_rank_size(comm)
if comm.Get_rank() == 0:
with open(path, 'rb') as fh:
data = fh.read()
comm.bcast(data)
else:
data = comm.bcast(None)
if localrank == 0:
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, 'wb') as fh:
fh.write(data)
comm.Barrier()
def dict_gather(comm, d, op='mean', assert_all_have_data=True):
if comm is None: return d
alldicts = comm.allgather(d)
size = comm.size
k2li = defaultdict(list)
for d in alldicts:
for (k,v) in d.items():
k2li[k].append(v)
result = {}
for (k,li) in k2li.items():
if assert_all_have_data:
assert len(li)==size, "only %i out of %i MPI workers have sent '%s'" % (len(li), size, k)
if op=='mean':
result[k] = np.mean(li, axis=0)
elif op=='sum':
result[k] = np.sum(li, axis=0)
else:
assert 0, op
return result

View File

@@ -0,0 +1,179 @@
import tensorflow as tf
from baselines.common import tf_util
from baselines.a2c.utils import fc
from baselines.common.distributions import make_pdtype
from baselines.common.input import observation_placeholder, encode_observation
from baselines.common.tf_util import adjust_shape
from baselines.common.mpi_running_mean_std import RunningMeanStd
from baselines.common.models import get_network_builder
import gym
class PolicyWithValue(object):
"""
Encapsulates fields and methods for RL policy and value function estimation with shared parameters
"""
def __init__(self, env, observations, latent, estimate_q=False, vf_latent=None, sess=None, **tensors):
"""
Parameters:
----------
env RL environment
observations tensorflow placeholder in which the observations will be fed
latent latent state from which policy distribution parameters should be inferred
vf_latent latent state from which value function should be inferred (if None, then latent is used)
sess tensorflow session to run calculations in (if None, default session is used)
**tensors tensorflow tensors for additional attributes such as state or mask
"""
self.X = observations
self.state = tf.constant([])
self.initial_state = None
self.__dict__.update(tensors)
vf_latent = vf_latent if vf_latent is not None else latent
vf_latent = tf.layers.flatten(vf_latent)
latent = tf.layers.flatten(latent)
self.pdtype = make_pdtype(env.action_space)
self.pd, self.pi = self.pdtype.pdfromlatent(latent, init_scale=0.01)
self.action = self.pd.sample()
self.neglogp = self.pd.neglogp(self.action)
self.sess = sess
if estimate_q:
assert isinstance(env.action_space, gym.spaces.Discrete)
self.q = fc(vf_latent, 'q', env.action_space.n)
self.vf = self.q
else:
self.vf = fc(vf_latent, 'vf', 1)
self.vf = self.vf[:,0]
def _evaluate(self, variables, observation, **extra_feed):
sess = self.sess or tf.get_default_session()
feed_dict = {self.X: adjust_shape(self.X, observation)}
for inpt_name, data in extra_feed.items():
if inpt_name in self.__dict__.keys():
inpt = self.__dict__[inpt_name]
if isinstance(inpt, tf.Tensor) and inpt._op.type == 'Placeholder':
feed_dict[inpt] = adjust_shape(inpt, data)
return sess.run(variables, feed_dict)
def step(self, observation, **extra_feed):
"""
Compute next action(s) given the observation(s)
Parameters:
----------
observation observation data (either single or a batch)
**extra_feed additional data such as state or mask (names of the arguments should match the ones in constructor, see __init__)
Returns:
-------
(action, value estimate, next state, negative log likelihood of the action under current policy parameters) tuple
"""
a, v, state, neglogp = self._evaluate([self.action, self.vf, self.state, self.neglogp], observation, **extra_feed)
if state.size == 0:
state = None
return a, v, state, neglogp
def value(self, ob, *args, **kwargs):
"""
Compute value estimate(s) given the observation(s)
Parameters:
----------
observation observation data (either single or a batch)
**extra_feed additional data such as state or mask (names of the arguments should match the ones in constructor, see __init__)
Returns:
-------
value estimate
"""
return self._evaluate(self.vf, ob, *args, **kwargs)
def save(self, save_path):
tf_util.save_state(save_path, sess=self.sess)
def load(self, load_path):
tf_util.load_state(load_path, sess=self.sess)
def build_policy(env, policy_network, value_network=None, normalize_observations=False, estimate_q=False, **policy_kwargs):
if isinstance(policy_network, str):
network_type = policy_network
policy_network = get_network_builder(network_type)(**policy_kwargs)
def policy_fn(nbatch=None, nsteps=None, sess=None, observ_placeholder=None):
ob_space = env.observation_space
X = observ_placeholder if observ_placeholder is not None else observation_placeholder(ob_space, batch_size=nbatch)
extra_tensors = {}
if normalize_observations and X.dtype == tf.float32:
encoded_x, rms = _normalize_clip_observation(X)
extra_tensors['rms'] = rms
else:
encoded_x = X
encoded_x = encode_observation(ob_space, encoded_x)
with tf.variable_scope('pi', reuse=tf.AUTO_REUSE):
policy_latent, recurrent_tensors = policy_network(encoded_x)
if recurrent_tensors is not None:
# recurrent architecture, need a few more steps
nenv = nbatch // nsteps
assert nenv > 0, 'Bad input for recurrent policy: batch size {} smaller than nsteps {}'.format(nbatch, nsteps)
policy_latent, recurrent_tensors = policy_network(encoded_x, nenv)
extra_tensors.update(recurrent_tensors)
_v_net = value_network
if _v_net is None or _v_net == 'shared':
vf_latent = policy_latent
else:
if _v_net == 'copy':
_v_net = policy_network
else:
assert callable(_v_net)
with tf.variable_scope('vf', reuse=tf.AUTO_REUSE):
vf_latent, _ = _v_net(encoded_x)
policy = PolicyWithValue(
env=env,
observations=X,
latent=policy_latent,
vf_latent=vf_latent,
sess=sess,
estimate_q=estimate_q,
**extra_tensors
)
return policy
return policy_fn
def _normalize_clip_observation(x, clip_range=[-5.0, 5.0]):
rms = RunningMeanStd(shape=x.shape[1:])
norm_x = tf.clip_by_value((x - rms.mean) / rms.std, min(clip_range), max(clip_range))
return norm_x, rms

View File

@@ -0,0 +1,293 @@
# flake8: noqa F403, F405
from .atari_wrappers import *
import numpy as np
import gym
class TimeLimit(gym.Wrapper):
def __init__(self, env, max_episode_steps=None):
super(TimeLimit, self).__init__(env)
self._max_episode_steps = max_episode_steps
self._elapsed_steps = 0
def step(self, ac):
observation, reward, done, info = self.env.step(ac)
self._elapsed_steps += 1
if self._elapsed_steps >= self._max_episode_steps:
done = True
info['TimeLimit.truncated'] = True
return observation, reward, done, info
def reset(self, **kwargs):
self._elapsed_steps = 0
return self.env.reset(**kwargs)
class StochasticFrameSkip(gym.Wrapper):
def __init__(self, env, n, stickprob):
gym.Wrapper.__init__(self, env)
self.n = n
self.stickprob = stickprob
self.curac = None
self.rng = np.random.RandomState()
self.supports_want_render = hasattr(env, "supports_want_render")
def reset(self, **kwargs):
self.curac = None
return self.env.reset(**kwargs)
def step(self, ac):
done = False
totrew = 0
for i in range(self.n):
# First step after reset, use action
if self.curac is None:
self.curac = ac
# First substep, delay with probability=stickprob
elif i==0:
if self.rng.rand() > self.stickprob:
self.curac = ac
# Second substep, new action definitely kicks in
elif i==1:
self.curac = ac
if self.supports_want_render and i<self.n-1:
ob, rew, done, info = self.env.step(self.curac, want_render=False)
else:
ob, rew, done, info = self.env.step(self.curac)
totrew += rew
if done: break
return ob, totrew, done, info
def seed(self, s):
self.rng.seed(s)
class PartialFrameStack(gym.Wrapper):
def __init__(self, env, k, channel=1):
"""
Stack one channel (channel keyword) from previous frames
"""
gym.Wrapper.__init__(self, env)
shp = env.observation_space.shape
self.channel = channel
self.observation_space = gym.spaces.Box(low=0, high=255,
shape=(shp[0], shp[1], shp[2] + k - 1),
dtype=env.observation_space.dtype)
self.k = k
self.frames = deque([], maxlen=k)
shp = env.observation_space.shape
def reset(self):
ob = self.env.reset()
assert ob.shape[2] > self.channel
for _ in range(self.k):
self.frames.append(ob)
return self._get_ob()
def step(self, ac):
ob, reward, done, info = self.env.step(ac)
self.frames.append(ob)
return self._get_ob(), reward, done, info
def _get_ob(self):
assert len(self.frames) == self.k
return np.concatenate([frame if i==self.k-1 else frame[:,:,self.channel:self.channel+1]
for (i, frame) in enumerate(self.frames)], axis=2)
class Downsample(gym.ObservationWrapper):
def __init__(self, env, ratio):
"""
Downsample images by a factor of ratio
"""
gym.ObservationWrapper.__init__(self, env)
(oldh, oldw, oldc) = env.observation_space.shape
newshape = (oldh//ratio, oldw//ratio, oldc)
self.observation_space = spaces.Box(low=0, high=255,
shape=newshape, dtype=np.uint8)
def observation(self, frame):
height, width, _ = self.observation_space.shape
frame = cv2.resize(frame, (width, height), interpolation=cv2.INTER_AREA)
if frame.ndim == 2:
frame = frame[:,:,None]
return frame
class Rgb2gray(gym.ObservationWrapper):
def __init__(self, env):
"""
Downsample images by a factor of ratio
"""
gym.ObservationWrapper.__init__(self, env)
(oldh, oldw, _oldc) = env.observation_space.shape
self.observation_space = spaces.Box(low=0, high=255,
shape=(oldh, oldw, 1), dtype=np.uint8)
def observation(self, frame):
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
return frame[:,:,None]
class MovieRecord(gym.Wrapper):
def __init__(self, env, savedir, k):
gym.Wrapper.__init__(self, env)
self.savedir = savedir
self.k = k
self.epcount = 0
def reset(self):
if self.epcount % self.k == 0:
print('saving movie this episode', self.savedir)
self.env.unwrapped.movie_path = self.savedir
else:
print('not saving this episode')
self.env.unwrapped.movie_path = None
self.env.unwrapped.movie = None
self.epcount += 1
return self.env.reset()
class AppendTimeout(gym.Wrapper):
def __init__(self, env):
gym.Wrapper.__init__(self, env)
self.action_space = env.action_space
self.timeout_space = gym.spaces.Box(low=np.array([0.0]), high=np.array([1.0]), dtype=np.float32)
self.original_os = env.observation_space
if isinstance(self.original_os, gym.spaces.Dict):
import copy
ordered_dict = copy.deepcopy(self.original_os.spaces)
ordered_dict['value_estimation_timeout'] = self.timeout_space
self.observation_space = gym.spaces.Dict(ordered_dict)
self.dict_mode = True
else:
self.observation_space = gym.spaces.Dict({
'original': self.original_os,
'value_estimation_timeout': self.timeout_space
})
self.dict_mode = False
self.ac_count = None
while 1:
if not hasattr(env, "_max_episode_steps"): # Looking for TimeLimit wrapper that has this field
env = env.env
continue
break
self.timeout = env._max_episode_steps
def step(self, ac):
self.ac_count += 1
ob, rew, done, info = self.env.step(ac)
return self._process(ob), rew, done, info
def reset(self):
self.ac_count = 0
return self._process(self.env.reset())
def _process(self, ob):
fracmissing = 1 - self.ac_count / self.timeout
if self.dict_mode:
ob['value_estimation_timeout'] = fracmissing
else:
return { 'original': ob, 'value_estimation_timeout': fracmissing }
class StartDoingRandomActionsWrapper(gym.Wrapper):
"""
Warning: can eat info dicts, not good if you depend on them
"""
def __init__(self, env, max_random_steps, on_startup=True, every_episode=False):
gym.Wrapper.__init__(self, env)
self.on_startup = on_startup
self.every_episode = every_episode
self.random_steps = max_random_steps
self.last_obs = None
if on_startup:
self.some_random_steps()
def some_random_steps(self):
self.last_obs = self.env.reset()
n = np.random.randint(self.random_steps)
#print("running for random %i frames" % n)
for _ in range(n):
self.last_obs, _, done, _ = self.env.step(self.env.action_space.sample())
if done: self.last_obs = self.env.reset()
def reset(self):
return self.last_obs
def step(self, a):
self.last_obs, rew, done, info = self.env.step(a)
if done:
self.last_obs = self.env.reset()
if self.every_episode:
self.some_random_steps()
return self.last_obs, rew, done, info
def make_retro(*, game, state, max_episode_steps, **kwargs):
import retro
env = retro.make(game, state, **kwargs)
env = StochasticFrameSkip(env, n=4, stickprob=0.25)
if max_episode_steps is not None:
env = TimeLimit(env, max_episode_steps=max_episode_steps)
return env
def wrap_deepmind_retro(env, scale=True, frame_stack=4):
"""
Configure environment for retro games, using config similar to DeepMind-style Atari in wrap_deepmind
"""
env = WarpFrame(env)
env = ClipRewardEnv(env)
env = FrameStack(env, frame_stack)
if scale:
env = ScaledFloatFrame(env)
return env
class SonicDiscretizer(gym.ActionWrapper):
"""
Wrap a gym-retro environment and make it use discrete
actions for the Sonic game.
"""
def __init__(self, env):
super(SonicDiscretizer, self).__init__(env)
buttons = ["B", "A", "MODE", "START", "UP", "DOWN", "LEFT", "RIGHT", "C", "Y", "X", "Z"]
actions = [['LEFT'], ['RIGHT'], ['LEFT', 'DOWN'], ['RIGHT', 'DOWN'], ['DOWN'],
['DOWN', 'B'], ['B']]
self._actions = []
for action in actions:
arr = np.array([False] * 12)
for button in action:
arr[buttons.index(button)] = True
self._actions.append(arr)
self.action_space = gym.spaces.Discrete(len(self._actions))
def action(self, a): # pylint: disable=W0221
return self._actions[a].copy()
class RewardScaler(gym.RewardWrapper):
"""
Bring rewards to a reasonable scale for PPO.
This is incredibly important and effects performance
drastically.
"""
def __init__(self, env, scale=0.01):
super(RewardScaler, self).__init__(env)
self.scale = scale
def reward(self, reward):
return reward * self.scale
class AllowBacktracking(gym.Wrapper):
"""
Use deltas in max(X) as the reward, rather than deltas
in X. This way, agents are not discouraged too heavily
from exploring backwards if there is no way to advance
head-on in the level.
"""
def __init__(self, env):
super(AllowBacktracking, self).__init__(env)
self._cur_x = 0
self._max_x = 0
def reset(self, **kwargs): # pylint: disable=E0202
self._cur_x = 0
self._max_x = 0
return self.env.reset(**kwargs)
def step(self, action): # pylint: disable=E0202
obs, rew, done, info = self.env.step(action)
self._cur_x += rew
rew = max(0, self._cur_x - self._max_x)
self._max_x = max(self._max_x, self._cur_x)
return obs, rew, done, info

View File

@@ -0,0 +1,19 @@
import numpy as np
from abc import ABC, abstractmethod
class AbstractEnvRunner(ABC):
def __init__(self, *, env, model, nsteps):
self.env = env
self.model = model
self.nenv = nenv = env.num_envs if hasattr(env, 'num_envs') else 1
self.batch_ob_shape = (nenv*nsteps,) + env.observation_space.shape
self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
self.obs[:] = env.reset()
self.nsteps = nsteps
self.states = model.initial_state
self.dones = [False for _ in range(nenv)]
@abstractmethod
def run(self):
raise NotImplementedError

View File

@@ -1,4 +1,7 @@
import tensorflow as tf
import numpy as np
from baselines.common.tf_util import get_session
class RunningMeanStd(object):
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
def __init__(self, epsilon=1e-4, shape=()):
@@ -13,20 +16,71 @@ class RunningMeanStd(object):
self.update_from_moments(batch_mean, batch_var, batch_count)
def update_from_moments(self, batch_mean, batch_var, batch_count):
delta = batch_mean - self.mean
tot_count = self.count + batch_count
self.mean, self.var, self.count = update_mean_var_count_from_moments(
self.mean, self.var, self.count, batch_mean, batch_var, batch_count)
new_mean = self.mean + delta * batch_count / tot_count
m_a = self.var * (self.count)
m_b = batch_var * (batch_count)
M2 = m_a + m_b + np.square(delta) * self.count * batch_count / (self.count + batch_count)
new_var = M2 / (self.count + batch_count)
def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
delta = batch_mean - mean
tot_count = count + batch_count
new_count = batch_count + self.count
new_mean = mean + delta * batch_count / tot_count
m_a = var * count
m_b = batch_var * batch_count
M2 = m_a + m_b + np.square(delta) * count * batch_count / (count + batch_count)
new_var = M2 / (count + batch_count)
new_count = batch_count + count
return new_mean, new_var, new_count
self.mean = new_mean
self.var = new_var
self.count = new_count
class TfRunningMeanStd(object):
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
'''
TensorFlow variables-based implmentation of computing running mean and std
Benefit of this implementation is that it can be saved / loaded together with the tensorflow model
'''
def __init__(self, epsilon=1e-4, shape=(), scope=''):
sess = get_session()
self._new_mean = tf.placeholder(shape=shape, dtype=tf.float64)
self._new_var = tf.placeholder(shape=shape, dtype=tf.float64)
self._new_count = tf.placeholder(shape=(), dtype=tf.float64)
with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
self._mean = tf.get_variable('mean', initializer=np.zeros(shape, 'float64'), dtype=tf.float64)
self._var = tf.get_variable('std', initializer=np.ones(shape, 'float64'), dtype=tf.float64)
self._count = tf.get_variable('count', initializer=np.full((), epsilon, 'float64'), dtype=tf.float64)
self.update_ops = tf.group([
self._var.assign(self._new_var),
self._mean.assign(self._new_mean),
self._count.assign(self._new_count)
])
sess.run(tf.variables_initializer([self._mean, self._var, self._count]))
self.sess = sess
self._set_mean_var_count()
def _set_mean_var_count(self):
self.mean, self.var, self.count = self.sess.run([self._mean, self._var, self._count])
def update(self, x):
batch_mean = np.mean(x, axis=0)
batch_var = np.var(x, axis=0)
batch_count = x.shape[0]
new_mean, new_var, new_count = update_mean_var_count_from_moments(self.mean, self.var, self.count, batch_mean, batch_var, batch_count)
self.sess.run(self.update_ops, feed_dict={
self._new_mean: new_mean,
self._new_var: new_var,
self._new_count: new_count
})
self._set_mean_var_count()
def test_runningmeanstd():
for (x1, x2, x3) in [
@@ -43,4 +97,91 @@ def test_runningmeanstd():
rms.update(x3)
ms2 = [rms.mean, rms.var]
assert np.allclose(ms1, ms2)
np.testing.assert_allclose(ms1, ms2)
def test_tf_runningmeanstd():
for (x1, x2, x3) in [
(np.random.randn(3), np.random.randn(4), np.random.randn(5)),
(np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
]:
rms = TfRunningMeanStd(epsilon=0.0, shape=x1.shape[1:], scope='running_mean_std' + str(np.random.randint(0, 128)))
x = np.concatenate([x1, x2, x3], axis=0)
ms1 = [x.mean(axis=0), x.var(axis=0)]
rms.update(x1)
rms.update(x2)
rms.update(x3)
ms2 = [rms.mean, rms.var]
np.testing.assert_allclose(ms1, ms2)
def profile_tf_runningmeanstd():
import time
from baselines.common import tf_util
tf_util.get_session( config=tf.ConfigProto(
inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1,
allow_soft_placement=True
))
x = np.random.random((376,))
n_trials = 10000
rms = RunningMeanStd()
tfrms = TfRunningMeanStd()
tic1 = time.time()
for _ in range(n_trials):
rms.update(x)
tic2 = time.time()
for _ in range(n_trials):
tfrms.update(x)
tic3 = time.time()
print('rms update time ({} trials): {} s'.format(n_trials, tic2 - tic1))
print('tfrms update time ({} trials): {} s'.format(n_trials, tic3 - tic2))
tic1 = time.time()
for _ in range(n_trials):
z1 = rms.mean
tic2 = time.time()
for _ in range(n_trials):
z2 = tfrms.mean
assert z1 == z2
tic3 = time.time()
print('rms get mean time ({} trials): {} s'.format(n_trials, tic2 - tic1))
print('tfrms get mean time ({} trials): {} s'.format(n_trials, tic3 - tic2))
'''
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) #pylint: disable=E1101
run_metadata = tf.RunMetadata()
profile_opts = dict(options=options, run_metadata=run_metadata)
from tensorflow.python.client import timeline
fetched_timeline = timeline.Timeline(run_metadata.step_stats) #pylint: disable=E1101
chrome_trace = fetched_timeline.generate_chrome_trace_format()
outfile = '/tmp/timeline.json'
with open(outfile, 'wt') as f:
f.write(chrome_trace)
print(f'Successfully saved profile to {outfile}. Exiting.')
exit(0)
'''
if __name__ == '__main__':
profile_tf_runningmeanstd()

View File

@@ -12,10 +12,9 @@ class SegmentTree(object):
a) setting item's value is slightly slower.
It is O(lg capacity) instead of O(1).
b) user has access to an efficient `reduce`
operation which reduces `operation` over
a contiguous subsequence of items in the
array.
b) user has access to an efficient ( O(log segment size) )
`reduce` operation which reduces `operation` over
a contiguous subsequence of items in the array.
Paramters
---------
@@ -23,8 +22,8 @@ class SegmentTree(object):
Total size of the array - must be a power of two.
operation: lambda obj, obj -> obj
and operation for combining elements (eg. sum, max)
must for a mathematical group together with the set of
possible values for array elements.
must form a mathematical group together with the set of
possible values for array elements (i.e. be associative)
neutral_element: obj
neutral element for the operation above. eg. float('-inf')
for max and 0 for sum.

View File

View File

View File

@@ -0,0 +1,44 @@
import numpy as np
from gym import Env
from gym.spaces import Discrete
class FixedSequenceEnv(Env):
def __init__(
self,
n_actions=10,
seed=0,
episode_len=100
):
self.np_random = np.random.RandomState()
self.np_random.seed(seed)
self.sequence = [self.np_random.randint(0, n_actions-1) for _ in range(episode_len)]
self.action_space = Discrete(n_actions)
self.observation_space = Discrete(1)
self.episode_len = episode_len
self.time = 0
self.reset()
def reset(self):
self.time = 0
return 0
def step(self, actions):
rew = self._get_reward(actions)
self._choose_next_state()
done = False
if self.episode_len and self.time >= self.episode_len:
rew = 0
done = True
return 0, rew, done, {}
def _choose_next_state(self):
self.time += 1
def _get_reward(self, actions):
return 1 if actions == self.sequence[self.time] else 0

View File

@@ -0,0 +1,70 @@
import numpy as np
from abc import abstractmethod
from gym import Env
from gym.spaces import Discrete, Box
class IdentityEnv(Env):
def __init__(
self,
episode_len=None
):
self.episode_len = episode_len
self.time = 0
self.reset()
def reset(self):
self._choose_next_state()
self.time = 0
self.observation_space = self.action_space
return self.state
def step(self, actions):
rew = self._get_reward(actions)
self._choose_next_state()
done = False
if self.episode_len and self.time >= self.episode_len:
rew = 0
done = True
return self.state, rew, done, {}
def _choose_next_state(self):
self.state = self.action_space.sample()
self.time += 1
@abstractmethod
def _get_reward(self, actions):
raise NotImplementedError
class DiscreteIdentityEnv(IdentityEnv):
def __init__(
self,
dim,
episode_len=None,
):
self.action_space = Discrete(dim)
super().__init__(episode_len=episode_len)
def _get_reward(self, actions):
return 1 if self.state == actions else 0
class BoxIdentityEnv(IdentityEnv):
def __init__(
self,
shape,
episode_len=None,
):
self.action_space = Box(low=-1.0, high=1.0, shape=shape)
super().__init__(episode_len=episode_len)
def _get_reward(self, actions):
diff = actions - self.state
diff = diff[:]
return -0.5 * np.dot(diff, diff)

View File

@@ -0,0 +1,70 @@
import os.path as osp
import numpy as np
import tempfile
import filelock
from gym import Env
from gym.spaces import Discrete, Box
class MnistEnv(Env):
def __init__(
self,
seed=0,
episode_len=None,
no_images=None
):
from tensorflow.examples.tutorials.mnist import input_data
# we could use temporary directory for this with a context manager and
# TemporaryDirecotry, but then each test that uses mnist would re-download the data
# this way the data is not cleaned up, but we only download it once per machine
mnist_path = osp.join(tempfile.gettempdir(), 'MNIST_data')
with filelock.FileLock(mnist_path + '.lock'):
self.mnist = input_data.read_data_sets(mnist_path)
self.np_random = np.random.RandomState()
self.np_random.seed(seed)
self.observation_space = Box(low=0.0, high=1.0, shape=(28,28,1))
self.action_space = Discrete(10)
self.episode_len = episode_len
self.time = 0
self.no_images = no_images
self.train_mode()
self.reset()
def reset(self):
self._choose_next_state()
self.time = 0
return self.state[0]
def step(self, actions):
rew = self._get_reward(actions)
self._choose_next_state()
done = False
if self.episode_len and self.time >= self.episode_len:
rew = 0
done = True
return self.state[0], rew, done, {}
def train_mode(self):
self.dataset = self.mnist.train
def test_mode(self):
self.dataset = self.mnist.test
def _choose_next_state(self):
max_index = (self.no_images if self.no_images is not None else self.dataset.num_examples) - 1
index = self.np_random.randint(0, max_index)
image = self.dataset.images[index].reshape(28,28,1)*255
label = self.dataset.labels[index]
self.state = (image, label)
self.time += 1
def _get_reward(self, actions):
return 1 if self.state[1] == actions else 0

View File

@@ -0,0 +1,43 @@
import pytest
import gym
from baselines.run import get_learn_function
from baselines.common.tests.util import reward_per_episode_test
common_kwargs = dict(
total_timesteps=30000,
network='mlp',
gamma=1.0,
seed=0,
)
learn_kwargs = {
'a2c' : dict(nsteps=32, value_network='copy', lr=0.05),
'acktr': dict(nsteps=32, value_network='copy'),
'deepq': dict(total_timesteps=20000),
'ppo2': dict(value_network='copy'),
'trpo_mpi': {}
}
@pytest.mark.slow
@pytest.mark.parametrize("alg", learn_kwargs.keys())
def test_cartpole(alg):
'''
Test if the algorithm (with an mlp policy)
can learn to balance the cartpole
'''
kwargs = common_kwargs.copy()
kwargs.update(learn_kwargs[alg])
learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
def env_fn():
env = gym.make('CartPole-v0')
env.seed(0)
return env
reward_per_episode_test(env_fn, learn_fn, 100)
if __name__ == '__main__':
test_cartpole('deepq')

View File

@@ -0,0 +1,48 @@
import pytest
try:
import mujoco_py
_mujoco_present = True
except BaseException:
mujoco_py = None
_mujoco_present = False
@pytest.mark.skipif(
not _mujoco_present,
reason='error loading mujoco - either mujoco / mujoco key not present, or LD_LIBRARY_PATH is not pointing to mujoco library'
)
def test_lstm_example():
import tensorflow as tf
from baselines.common import policies, models, cmd_util
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
# create vectorized environment
venv = DummyVecEnv([lambda: cmd_util.make_mujoco_env('Reacher-v2', seed=0)])
with tf.Session() as sess:
# build policy based on lstm network with 128 units
policy = policies.build_policy(venv, models.lstm(128))(nbatch=1, nsteps=1)
# initialize tensorflow variables
sess.run(tf.global_variables_initializer())
# prepare environment variables
ob = venv.reset()
state = policy.initial_state
done = [False]
step_counter = 0
# run a single episode until the end (i.e. until done)
while True:
action, _, state, _ = policy.step(ob, S=state, M=done)
ob, reward, done, _ = venv.step(action)
step_counter += 1
if done:
break
assert step_counter > 5

View File

@@ -0,0 +1,51 @@
import pytest
from baselines.common.tests.envs.fixed_sequence_env import FixedSequenceEnv
from baselines.common.tests.util import simple_test
from baselines.run import get_learn_function
common_kwargs = dict(
seed=0,
total_timesteps=50000,
)
learn_kwargs = {
'a2c': {},
'ppo2': dict(nsteps=10, ent_coef=0.0, nminibatches=1),
# TODO enable sequential models for trpo_mpi (proper handling of nbatch and nsteps)
# github issue: https://github.com/openai/baselines/issues/188
# 'trpo_mpi': lambda e, p: trpo_mpi.learn(policy_fn=p(env=e), env=e, max_timesteps=30000, timesteps_per_batch=100, cg_iters=10, gamma=0.9, lam=1.0, max_kl=0.001)
}
alg_list = learn_kwargs.keys()
rnn_list = ['lstm']
@pytest.mark.slow
@pytest.mark.parametrize("alg", alg_list)
@pytest.mark.parametrize("rnn", rnn_list)
def test_fixed_sequence(alg, rnn):
'''
Test if the algorithm (with a given policy)
can learn an identity transformation (i.e. return observation as an action)
'''
kwargs = learn_kwargs[alg]
kwargs.update(common_kwargs)
episode_len = 5
env_fn = lambda: FixedSequenceEnv(10, episode_len=episode_len)
learn = lambda e: get_learn_function(alg)(
env=e,
network=rnn,
**kwargs
)
simple_test(env_fn, learn, 0.7)
if __name__ == '__main__':
test_fixed_sequence('ppo2', 'lstm')

View File

@@ -0,0 +1,55 @@
import pytest
from baselines.common.tests.envs.identity_env import DiscreteIdentityEnv, BoxIdentityEnv
from baselines.run import get_learn_function
from baselines.common.tests.util import simple_test
common_kwargs = dict(
total_timesteps=30000,
network='mlp',
gamma=0.9,
seed=0,
)
learn_kwargs = {
'a2c' : {},
'acktr': {},
'deepq': {},
'ppo2': dict(lr=1e-3, nsteps=64, ent_coef=0.0),
'trpo_mpi': dict(timesteps_per_batch=100, cg_iters=10, gamma=0.9, lam=1.0, max_kl=0.01)
}
@pytest.mark.slow
@pytest.mark.parametrize("alg", learn_kwargs.keys())
def test_discrete_identity(alg):
'''
Test if the algorithm (with an mlp policy)
can learn an identity transformation (i.e. return observation as an action)
'''
kwargs = learn_kwargs[alg]
kwargs.update(common_kwargs)
learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
env_fn = lambda: DiscreteIdentityEnv(10, episode_len=100)
simple_test(env_fn, learn_fn, 0.9)
@pytest.mark.slow
@pytest.mark.parametrize("alg", ['a2c', 'ppo2', 'trpo_mpi'])
def test_continuous_identity(alg):
'''
Test if the algorithm (with an mlp policy)
can learn an identity transformation (i.e. return observation as an action)
to a required precision
'''
kwargs = learn_kwargs[alg]
kwargs.update(common_kwargs)
learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
env_fn = lambda: BoxIdentityEnv((1,), episode_len=100)
simple_test(env_fn, learn_fn, -0.1)
if __name__ == '__main__':
test_continuous_identity('a2c')

View File

@@ -0,0 +1,50 @@
import pytest
# from baselines.acer import acer_simple as acer
from baselines.common.tests.envs.mnist_env import MnistEnv
from baselines.common.tests.util import simple_test
from baselines.run import get_learn_function
# TODO investigate a2c and ppo2 failures - is it due to bad hyperparameters for this problem?
# GitHub issue https://github.com/openai/baselines/issues/189
common_kwargs = {
'seed': 0,
'network':'cnn',
'gamma':0.9,
'pad':'SAME'
}
learn_args = {
'a2c': dict(total_timesteps=50000),
# TODO need to resolve inference (step) API differences for acer; also slow
# 'acer': dict(seed=0, total_timesteps=1000),
'deepq': dict(total_timesteps=5000),
'acktr': dict(total_timesteps=30000),
'ppo2': dict(total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.0),
'trpo_mpi': dict(total_timesteps=80000, timesteps_per_batch=100, cg_iters=10, lam=1.0, max_kl=0.001)
}
#tests pass, but are too slow on travis. Same algorithms are covered
# by other tests with less compute-hungry nn's and by benchmarks
@pytest.mark.skip
@pytest.mark.slow
@pytest.mark.parametrize("alg", learn_args.keys())
def test_mnist(alg):
'''
Test if the algorithm can learn to classify MNIST digits.
Uses CNN policy.
'''
learn_kwargs = learn_args[alg]
learn_kwargs.update(common_kwargs)
learn = get_learn_function(alg)
learn_fn = lambda e: learn(env=e, **learn_kwargs)
env_fn = lambda: MnistEnv(seed=0, episode_len=100)
simple_test(env_fn, learn_fn, 0.6)
if __name__ == '__main__':
test_mnist('deepq')

View File

@@ -0,0 +1,97 @@
import os
import tempfile
import pytest
import tensorflow as tf
import numpy as np
from baselines.common.tests.envs.mnist_env import MnistEnv
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.run import get_learn_function
from baselines.common.tf_util import make_session, get_session
from functools import partial
learn_kwargs = {
'deepq': {},
'a2c': {},
'acktr': {},
'ppo2': {'nminibatches': 1, 'nsteps': 10},
'trpo_mpi': {},
}
network_kwargs = {
'mlp': {},
'cnn': {'pad': 'SAME'},
'lstm': {},
'cnn_lnlstm': {'pad': 'SAME'}
}
@pytest.mark.parametrize("learn_fn", learn_kwargs.keys())
@pytest.mark.parametrize("network_fn", network_kwargs.keys())
def test_serialization(learn_fn, network_fn):
'''
Test if the trained model can be serialized
'''
if network_fn.endswith('lstm') and learn_fn in ['acktr', 'trpo_mpi', 'deepq']:
# TODO make acktr work with recurrent policies
# and test
# github issue: https://github.com/openai/baselines/issues/194
return
env = DummyVecEnv([lambda: MnistEnv(10, episode_len=100)])
ob = env.reset().copy()
learn = get_learn_function(learn_fn)
kwargs = {}
kwargs.update(network_kwargs[network_fn])
kwargs.update(learn_kwargs[learn_fn])
learn = partial(learn, env=env, network=network_fn, seed=0, **kwargs)
with tempfile.TemporaryDirectory() as td:
model_path = os.path.join(td, 'serialization_test_model')
with tf.Graph().as_default(), make_session().as_default():
model = learn(total_timesteps=100)
model.save(model_path)
mean1, std1 = _get_action_stats(model, ob)
variables_dict1 = _serialize_variables()
with tf.Graph().as_default(), make_session().as_default():
model = learn(total_timesteps=0, load_path=model_path)
mean2, std2 = _get_action_stats(model, ob)
variables_dict2 = _serialize_variables()
for k, v in variables_dict1.items():
np.testing.assert_allclose(v, variables_dict2[k], atol=0.01,
err_msg='saved and loaded variable {} value mismatch'.format(k))
np.testing.assert_allclose(mean1, mean2, atol=0.5)
np.testing.assert_allclose(std1, std2, atol=0.5)
def _serialize_variables():
sess = get_session()
variables = tf.trainable_variables()
values = sess.run(variables)
return {var.name: value for var, value in zip(variables, values)}
def _get_action_stats(model, ob):
ntrials = 1000
if model.initial_state is None or model.initial_state == []:
actions = np.array([model.step(ob)[0] for _ in range(ntrials)])
else:
actions = np.array([model.step(ob, S=model.initial_state, M=[False])[0] for _ in range(ntrials)])
mean = np.mean(actions, axis=0)
std = np.std(actions, axis=0)
return mean, std

View File

@@ -33,7 +33,6 @@ def test_multikwargs():
initialize()
assert lin(2) == 6
assert lin(2, 2) == 10
expt_caught = False
if __name__ == '__main__':

View File

@@ -0,0 +1,91 @@
import tensorflow as tf
import numpy as np
from gym.spaces import np_random
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
N_TRIALS = 10000
N_EPISODES = 100
def simple_test(env_fn, learn_fn, min_reward_fraction, n_trials=N_TRIALS):
np.random.seed(0)
np_random.seed(0)
env = DummyVecEnv([env_fn])
with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
tf.set_random_seed(0)
model = learn_fn(env)
sum_rew = 0
done = True
for i in range(n_trials):
if done:
obs = env.reset()
state = model.initial_state
if state is not None:
a, v, state, _ = model.step(obs, S=state, M=[False])
else:
a, v, _, _ = model.step(obs)
obs, rew, done, _ = env.step(a)
sum_rew += float(rew)
print("Reward in {} trials is {}".format(n_trials, sum_rew))
assert sum_rew > min_reward_fraction * n_trials, \
'sum of rewards {} is less than {} of the total number of trials {}'.format(sum_rew, min_reward_fraction, n_trials)
def reward_per_episode_test(env_fn, learn_fn, min_avg_reward, n_trials=N_EPISODES):
env = DummyVecEnv([env_fn])
with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
model = learn_fn(env)
N_TRIALS = 100
observations, actions, rewards = rollout(env, model, N_TRIALS)
rewards = [sum(r) for r in rewards]
avg_rew = sum(rewards) / N_TRIALS
print("Average reward in {} episodes is {}".format(n_trials, avg_rew))
assert avg_rew > min_avg_reward, \
'average reward in {} episodes ({}) is less than {}'.format(n_trials, avg_rew, min_avg_reward)
def rollout(env, model, n_trials):
rewards = []
actions = []
observations = []
for i in range(n_trials):
obs = env.reset()
state = model.initial_state
episode_rew = []
episode_actions = []
episode_obs = []
while True:
if state is not None:
a, v, state, _ = model.step(obs, S=state, M=[False])
else:
a,v, _, _ = model.step(obs)
obs, rew, done, _ = env.step(a)
episode_rew.append(rew)
episode_actions.append(a)
episode_obs.append(obs)
if done:
break
rewards.append(episode_rew)
actions.append(episode_actions)
observations.append(episode_obs)
return observations, actions, rewards

View File

@@ -1,3 +1,4 @@
import joblib
import numpy as np
import tensorflow as tf # pylint: ignore-module
import copy
@@ -48,18 +49,28 @@ def huber_loss(x, delta=1.0):
# Global session
# ================================================================
def make_session(num_cpu=None, make_default=False):
def get_session(config=None):
"""Get default session or create one with a given config"""
sess = tf.get_default_session()
if sess is None:
sess = make_session(config=config, make_default=True)
return sess
def make_session(config=None, num_cpu=None, make_default=False, graph=None):
"""Returns a session that will use <num_cpu> CPU's only"""
if num_cpu is None:
num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
tf_config = tf.ConfigProto(
inter_op_parallelism_threads=num_cpu,
intra_op_parallelism_threads=num_cpu)
tf_config.gpu_options.allocator_type = 'BFC'
if config is None:
config = tf.ConfigProto(
allow_soft_placement=True,
inter_op_parallelism_threads=num_cpu,
intra_op_parallelism_threads=num_cpu)
config.gpu_options.allow_growth = True
if make_default:
return tf.InteractiveSession(config=tf_config)
return tf.InteractiveSession(config=config, graph=graph)
else:
return tf.Session(config=tf_config)
return tf.Session(config=config, graph=graph)
def single_threaded_session():
"""Returns a session which will only use a single CPU"""
@@ -77,17 +88,17 @@ ALREADY_INITIALIZED = set()
def initialize():
"""Initialize all the uninitialized variables in the global scope."""
new_variables = set(tf.global_variables()) - ALREADY_INITIALIZED
tf.get_default_session().run(tf.variables_initializer(new_variables))
get_session().run(tf.variables_initializer(new_variables))
ALREADY_INITIALIZED.update(new_variables)
# ================================================================
# Model components
# ================================================================
def normc_initializer(std=1.0):
def normc_initializer(std=1.0, axis=0):
def _initializer(shape, dtype=None, partition_info=None): # pylint: disable=W0613
out = np.random.randn(*shape).astype(np.float32)
out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
out = np.random.randn(*shape).astype(dtype.as_numpy_dtype)
out *= std / np.sqrt(np.square(out).sum(axis=axis, keepdims=True))
return tf.constant(out)
return _initializer
@@ -180,7 +191,7 @@ class _Function(object):
if hasattr(inpt, 'make_feed_dict'):
feed_dict.update(inpt.make_feed_dict(value))
else:
feed_dict[inpt] = value
feed_dict[inpt] = adjust_shape(inpt, value)
def __call__(self, *args):
assert len(args) <= len(self.inputs), "Too many arguments provided"
@@ -190,8 +201,8 @@ class _Function(object):
self._feed_input(feed_dict, inpt, value)
# Update feed dict with givens.
for inpt in self.givens:
feed_dict[inpt] = feed_dict.get(inpt, self.givens[inpt])
results = tf.get_default_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
feed_dict[inpt] = adjust_shape(inpt, feed_dict.get(inpt, self.givens[inpt]))
results = get_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
return results
# ================================================================
@@ -244,27 +255,34 @@ class GetFlat(object):
def __call__(self):
return tf.get_default_session().run(self.op)
def flattenallbut0(x):
return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])
# =============================================================
# TF placeholders management
# ============================================================
_PLACEHOLDER_CACHE = {} # name -> (placeholder, dtype, shape)
def get_placeholder(name, dtype, shape):
if name in _PLACEHOLDER_CACHE:
out, dtype1, shape1 = _PLACEHOLDER_CACHE[name]
assert dtype1 == dtype and shape1 == shape
return out
else:
out = tf.placeholder(dtype=dtype, shape=shape, name=name)
_PLACEHOLDER_CACHE[name] = (out, dtype, shape)
return out
if out.graph == tf.get_default_graph():
assert dtype1 == dtype and shape1 == shape, \
'Placeholder with name {} has already been registered and has shape {}, different from requested {}'.format(name, shape1, shape)
return out
out = tf.placeholder(dtype=dtype, shape=shape, name=name)
_PLACEHOLDER_CACHE[name] = (out, dtype, shape)
return out
def get_placeholder_cached(name):
return _PLACEHOLDER_CACHE[name][0]
def flattenallbut0(x):
return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])
# ================================================================
# Diagnostics
# Diagnostics
# ================================================================
def display_var_info(vars):
@@ -273,8 +291,127 @@ def display_var_info(vars):
for v in vars:
name = v.name
if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
count_params += np.prod(v.shape.as_list())
if "/b:" in name: continue # Wx+b, bias is not interesting to look at => count params, but not print
logger.info(" %s%s%s" % (name, " "*(55-len(name)), str(v.shape)))
logger.info("Total model parameters: %0.1f million" % (count_params*1e-6))
v_params = np.prod(v.shape.as_list())
count_params += v_params
if "/b:" in name or "/biases" in name: continue # Wx+b, bias is not interesting to look at => count params, but not print
logger.info(" %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
def get_available_gpus():
# recipe from here:
# https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
# ================================================================
# Saving variables
# ================================================================
def load_state(fname, sess=None):
sess = sess or get_session()
saver = tf.train.Saver()
saver.restore(tf.get_default_session(), fname)
def save_state(fname, sess=None):
sess = sess or get_session()
os.makedirs(os.path.dirname(fname), exist_ok=True)
saver = tf.train.Saver()
saver.save(tf.get_default_session(), fname)
# The methods above and below are clearly doing the same thing, and in a rather similar way
# TODO: ensure there is no subtle differences and remove one
def save_variables(save_path, variables=None, sess=None):
sess = sess or get_session()
variables = variables or tf.trainable_variables()
ps = sess.run(variables)
save_dict = {v.name: value for v, value in zip(variables, ps)}
os.makedirs(os.path.dirname(save_path), exist_ok=True)
joblib.dump(save_dict, save_path)
def load_variables(load_path, variables=None, sess=None):
sess = sess or get_session()
variables = variables or tf.trainable_variables()
loaded_params = joblib.load(os.path.expanduser(load_path))
restores = []
for v in variables:
restores.append(v.assign(loaded_params[v.name]))
sess.run(restores)
# ================================================================
# Shape adjustment for feeding into tf placeholders
# ================================================================
def adjust_shape(placeholder, data):
'''
adjust shape of the data to the shape of the placeholder if possible.
If shape is incompatible, AssertionError is thrown
Parameters:
placeholder tensorflow input placeholder
data input data to be (potentially) reshaped to be fed into placeholder
Returns:
reshaped data
'''
if not isinstance(data, np.ndarray) and not isinstance(data, list):
return data
if isinstance(data, list):
data = np.array(data)
placeholder_shape = [x or -1 for x in placeholder.shape.as_list()]
assert _check_shape(placeholder_shape, data.shape), \
'Shape of data {} is not compatible with shape of the placeholder {}'.format(data.shape, placeholder_shape)
return np.reshape(data, placeholder_shape)
def _check_shape(placeholder_shape, data_shape):
''' check if two shapes are compatible (i.e. differ only by dimensions of size 1, or by the batch dimension)'''
return True
squeezed_placeholder_shape = _squeeze_shape(placeholder_shape)
squeezed_data_shape = _squeeze_shape(data_shape)
for i, s_data in enumerate(squeezed_data_shape):
s_placeholder = squeezed_placeholder_shape[i]
if s_placeholder != -1 and s_data != s_placeholder:
return False
return True
def _squeeze_shape(shape):
return [x for x in shape if x != 1]
# ================================================================
# Tensorboard interfacing
# ================================================================
def launch_tensorboard_in_background(log_dir):
'''
To log the Tensorflow graph when using rl-algs
algorithms, you can run the following code
in your main script:
import threading, time
def start_tensorboard(session):
time.sleep(10) # Wait until graph is setup
tb_path = osp.join(logger.get_dir(), 'tb')
summary_writer = tf.summary.FileWriter(tb_path, graph=session.graph)
summary_op = tf.summary.merge_all()
launch_tensorboard_in_background(tb_path)
session = tf.get_default_session()
t = threading.Thread(target=start_tensorboard, args=([session]))
t.start()
'''
import subprocess
subprocess.Popen(['tensorboard', '--logdir', log_dir])

View File

@@ -0,0 +1,23 @@
import numpy as np
def tile_images(img_nhwc):
"""
Tile N images into one big PxQ image
(P,Q) are chosen to be as close as possible, and if N
is square, then P=Q.
input: img_nhwc, list or array of images, ndim=4 once turned into array
n = batch index, h = height, w = width, c = channel
returns:
bigim_HWc, ndarray with ndim=3
"""
img_nhwc = np.asarray(img_nhwc)
N, h, w, c = img_nhwc.shape
H = int(np.ceil(np.sqrt(N)))
W = int(np.ceil(float(N)/H))
img_nhwc = np.array(list(img_nhwc) + [img_nhwc[0]*0 for _ in range(N, H*W)])
img_HWhwc = img_nhwc.reshape(H, W, h, w, c)
img_HhWwc = img_HWhwc.transpose(0, 2, 1, 3, 4)
img_Hh_Ww_c = img_HhWwc.reshape(H*h, W*w, c)
return img_Hh_Ww_c

View File

@@ -1,38 +1,45 @@
from abc import ABC, abstractmethod
from baselines import logger
from baselines.common.tile_images import tile_images
class AlreadySteppingError(Exception):
"""
Raised when an asynchronous step is running while
step_async() is called again.
"""
def __init__(self):
msg = 'already running an async step'
Exception.__init__(self, msg)
class NotSteppingError(Exception):
"""
Raised when an asynchronous step is not running but
step_wait() is called.
"""
def __init__(self):
msg = 'not running an async step'
Exception.__init__(self, msg)
class VecEnv(ABC):
"""
An abstract asynchronous, vectorized environment.
"""
def __init__(self, num_envs, observation_space, action_space):
self.num_envs = num_envs
self.observation_space = observation_space
self.action_space = action_space
self.closed = False
self.viewer = None # For rendering
@abstractmethod
def reset(self):
"""
Reset all the environments and return an array of
observations, or a tuple of observation arrays.
observations, or a dict of observation arrays.
If step_async is still doing work, that work will
be cancelled and step_wait() should not be called
@@ -58,7 +65,7 @@ class VecEnv(ABC):
Wait for the step taken with step_async().
Returns (obs, rews, dones, infos):
- obs: an array of observations, or a tuple of
- obs: an array of observations, or a dict of
arrays of observations.
- rews: an array of rewards
- dones: an array of "episode done" booleans
@@ -66,27 +73,72 @@ class VecEnv(ABC):
"""
pass
@abstractmethod
def close(self):
def close_extras(self):
"""
Clean up the environments' resources.
Clean up the extra resources, beyond what's in this base class.
Only runs when not self.closed.
"""
pass
def close(self):
if self.closed:
return
if self.viewer is not None:
self.viewer.close()
self.close_extras()
self.closed = True
def step(self, actions):
"""
Step the environments synchronously.
This is available for backwards compatibility.
"""
self.step_async(actions)
return self.step_wait()
def render(self):
logger.warn('Render not defined for %s'%self)
def render(self, mode='human'):
imgs = self.get_images()
bigimg = tile_images(imgs)
if mode == 'human':
self.get_viewer().imshow(bigimg)
elif mode == 'rgb_array':
return bigimg
else:
raise NotImplementedError
def get_images(self):
"""
Return RGB images from each environment
"""
raise NotImplementedError
@property
def unwrapped(self):
if isinstance(self, VecEnvWrapper):
return self.venv.unwrapped
else:
return self
def get_viewer(self):
if self.viewer is None:
from gym.envs.classic_control import rendering
self.viewer = rendering.SimpleImageViewer()
return self.viewer
class VecEnvWrapper(VecEnv):
"""
An environment wrapper that applies to an entire batch
of environments at once.
"""
def __init__(self, venv, observation_space=None, action_space=None):
self.venv = venv
VecEnv.__init__(self,
num_envs=venv.num_envs,
observation_space=observation_space or venv.observation_space,
action_space=action_space or venv.action_space)
VecEnv.__init__(self,
num_envs=venv.num_envs,
observation_space=observation_space or venv.observation_space,
action_space=action_space or venv.action_space)
def step_async(self, actions):
self.venv.step_async(actions)
@@ -102,18 +154,24 @@ class VecEnvWrapper(VecEnv):
def close(self):
return self.venv.close()
def render(self):
self.venv.render()
def render(self, mode='human'):
return self.venv.render(mode=mode)
def get_images(self):
return self.venv.get_images()
class CloudpickleWrapper(object):
"""
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
"""
def __init__(self, x):
self.x = x
def __getstate__(self):
import cloudpickle
return cloudpickle.dumps(self.x)
def __setstate__(self, ob):
import pickle
self.x = pickle.loads(ob)

View File

@@ -1,51 +1,68 @@
import numpy as np
import gym
from gym import spaces
from . import VecEnv
from .util import copy_obs_dict, dict_to_obs, obs_space_info
class DummyVecEnv(VecEnv):
def __init__(self, env_fns):
self.envs = [fn() for fn in env_fns]
env = self.envs[0]
VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
obs_spaces = self.observation_space.spaces if isinstance(self.observation_space, gym.spaces.Tuple) else (self.observation_space,)
self.buf_obs = [np.zeros((self.num_envs,) + tuple(s.shape), s.dtype) for s in obs_spaces]
obs_space = env.observation_space
self.keys, shapes, dtypes = obs_space_info(obs_space)
self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
self.buf_rews = np.zeros((self.num_envs,), dtype=np.float32)
self.buf_infos = [{} for _ in range(self.num_envs)]
self.actions = None
def step_async(self, actions):
self.actions = actions
listify = True
try:
if len(actions) == self.num_envs:
listify = False
except TypeError:
pass
if not listify:
self.actions = actions
else:
assert self.num_envs == 1, "actions {} is either not a list or has a wrong size - cannot match to {} environments".format(actions, self.num_envs)
self.actions = [actions]
def step_wait(self):
for i in range(self.num_envs):
obs_tuple, self.buf_rews[i], self.buf_dones[i], self.buf_infos[i] = self.envs[i].step(self.actions[i])
if self.buf_dones[i]:
obs_tuple = self.envs[i].reset()
if isinstance(obs_tuple, (tuple, list)):
for t,x in enumerate(obs_tuple):
self.buf_obs[t][i] = x
else:
self.buf_obs[0][i] = obs_tuple
for e in range(self.num_envs):
action = self.actions[e]
if isinstance(self.envs[e].action_space, spaces.Discrete):
action = int(action)
obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
if self.buf_dones[e]:
obs = self.envs[e].reset()
self._save_obs(e, obs)
return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones),
self.buf_infos.copy())
def reset(self):
for i in range(self.num_envs):
obs_tuple = self.envs[i].reset()
if isinstance(obs_tuple, (tuple, list)):
for t,x in enumerate(obs_tuple):
self.buf_obs[t][i] = x
else:
self.buf_obs[0][i] = obs_tuple
for e in range(self.num_envs):
obs = self.envs[e].reset()
self._save_obs(e, obs)
return self._obs_from_buf()
def close(self):
return
def _save_obs(self, e, obs):
for k in self.keys:
if k is None:
self.buf_obs[k][e] = obs
else:
self.buf_obs[k][e] = obs[k]
def _obs_from_buf(self):
if len(self.buf_obs) == 1:
return np.copy(self.buf_obs[0])
else:
return tuple(np.copy(x) for x in self.buf_obs)
return dict_to_obs(copy_obs_dict(self.buf_obs))
def get_images(self):
return [env.render(mode='rgb_array') for env in self.envs]

View File

@@ -0,0 +1,138 @@
"""
An interface for asynchronous vectorized environments.
"""
from multiprocessing import Pipe, Array, Process
import numpy as np
from . import VecEnv, CloudpickleWrapper
import ctypes
from baselines import logger
from .util import dict_to_obs, obs_space_info, obs_to_dict
_NP_TO_CT = {np.float32: ctypes.c_float,
np.int32: ctypes.c_int32,
np.int8: ctypes.c_int8,
np.uint8: ctypes.c_char,
np.bool: ctypes.c_bool}
class ShmemVecEnv(VecEnv):
"""
An AsyncEnv that uses multiprocessing to run multiple
environments in parallel.
"""
def __init__(self, env_fns, spaces=None):
"""
If you don't specify observation_space, we'll have to create a dummy
environment to get it.
"""
if spaces:
observation_space, action_space = spaces
else:
logger.log('Creating dummy env object to get spaces')
with logger.scoped_configure(format_strs=[]):
dummy = env_fns[0]()
observation_space, action_space = dummy.observation_space, dummy.action_space
dummy.close()
del dummy
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
self.obs_keys, self.obs_shapes, self.obs_dtypes = obs_space_info(observation_space)
self.obs_bufs = [
{k: Array(_NP_TO_CT[self.obs_dtypes[k].type], int(np.prod(self.obs_shapes[k]))) for k in self.obs_keys}
for _ in env_fns]
self.parent_pipes = []
self.procs = []
for env_fn, obs_buf in zip(env_fns, self.obs_bufs):
wrapped_fn = CloudpickleWrapper(env_fn)
parent_pipe, child_pipe = Pipe()
proc = Process(target=_subproc_worker,
args=(child_pipe, parent_pipe, wrapped_fn, obs_buf, self.obs_shapes, self.obs_dtypes, self.obs_keys))
proc.daemon = True
self.procs.append(proc)
self.parent_pipes.append(parent_pipe)
proc.start()
child_pipe.close()
self.waiting_step = False
self.viewer = None
def reset(self):
if self.waiting_step:
logger.warn('Called reset() while waiting for the step to complete')
self.step_wait()
for pipe in self.parent_pipes:
pipe.send(('reset', None))
return self._decode_obses([pipe.recv() for pipe in self.parent_pipes])
def step_async(self, actions):
assert len(actions) == len(self.parent_pipes)
for pipe, act in zip(self.parent_pipes, actions):
pipe.send(('step', act))
def step_wait(self):
outs = [pipe.recv() for pipe in self.parent_pipes]
obs, rews, dones, infos = zip(*outs)
return self._decode_obses(obs), np.array(rews), np.array(dones), infos
def close_extras(self):
if self.waiting_step:
self.step_wait()
for pipe in self.parent_pipes:
pipe.send(('close', None))
for pipe in self.parent_pipes:
pipe.recv()
pipe.close()
for proc in self.procs:
proc.join()
def get_images(self, mode='human'):
for pipe in self.parent_pipes:
pipe.send(('render', None))
return [pipe.recv() for pipe in self.parent_pipes]
def _decode_obses(self, obs):
result = {}
for k in self.obs_keys:
bufs = [b[k] for b in self.obs_bufs]
o = [np.frombuffer(b.get_obj(), dtype=self.obs_dtypes[k]).reshape(self.obs_shapes[k]) for b in bufs]
result[k] = np.array(o)
return dict_to_obs(result)
def _subproc_worker(pipe, parent_pipe, env_fn_wrapper, obs_bufs, obs_shapes, obs_dtypes, keys):
"""
Control a single environment instance using IPC and
shared memory.
"""
def _write_obs(maybe_dict_obs):
flatdict = obs_to_dict(maybe_dict_obs)
for k in keys:
dst = obs_bufs[k].get_obj()
dst_np = np.frombuffer(dst, dtype=obs_dtypes[k]).reshape(obs_shapes[k]) # pylint: disable=W0212
np.copyto(dst_np, flatdict[k])
env = env_fn_wrapper.x()
parent_pipe.close()
try:
while True:
cmd, data = pipe.recv()
if cmd == 'reset':
pipe.send(_write_obs(env.reset()))
elif cmd == 'step':
obs, reward, done, info = env.step(data)
if done:
obs = env.reset()
pipe.send((_write_obs(obs), reward, done, info))
elif cmd == 'render':
pipe.send(env.render(mode='rgb_array'))
elif cmd == 'close':
pipe.send(None)
break
else:
raise RuntimeError('Got unrecognized cmd %s' % cmd)
except KeyboardInterrupt:
print('ShmemVecEnv worker: got KeyboardInterrupt')
finally:
env.close()

View File

@@ -1,31 +1,34 @@
import numpy as np
from multiprocessing import Process, Pipe
from baselines.common.vec_env import VecEnv, CloudpickleWrapper
from . import VecEnv, CloudpickleWrapper
def worker(remote, parent_remote, env_fn_wrapper):
parent_remote.close()
env = env_fn_wrapper.x()
while True:
cmd, data = remote.recv()
if cmd == 'step':
ob, reward, done, info = env.step(data)
if done:
try:
while True:
cmd, data = remote.recv()
if cmd == 'step':
ob, reward, done, info = env.step(data)
if done:
ob = env.reset()
remote.send((ob, reward, done, info))
elif cmd == 'reset':
ob = env.reset()
remote.send((ob, reward, done, info))
elif cmd == 'reset':
ob = env.reset()
remote.send(ob)
elif cmd == 'reset_task':
ob = env.reset_task()
remote.send(ob)
elif cmd == 'close':
remote.close()
break
elif cmd == 'get_spaces':
remote.send((env.observation_space, env.action_space))
else:
raise NotImplementedError
remote.send(ob)
elif cmd == 'render':
remote.send(env.render(mode='rgb_array'))
elif cmd == 'close':
remote.close()
break
elif cmd == 'get_spaces':
remote.send((env.observation_space, env.action_space))
else:
raise NotImplementedError
except KeyboardInterrupt:
print('SubprocVecEnv worker: got KeyboardInterrupt')
finally:
env.close()
class SubprocVecEnv(VecEnv):
@@ -34,19 +37,19 @@ class SubprocVecEnv(VecEnv):
envs: list of gym environments to run in subprocesses
"""
self.waiting = False
self.closed = False
nenvs = len(env_fns)
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for p in self.ps:
p.daemon = True # if the main process crashes, we should not cause things to hang
p.daemon = True # if the main process crashes, we should not cause things to hang
p.start()
for remote in self.work_remotes:
remote.close()
self.remotes[0].send(('get_spaces', None))
observation_space, action_space = self.remotes[0].recv()
self.viewer = None
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
def step_async(self, actions):
@@ -65,19 +68,17 @@ class SubprocVecEnv(VecEnv):
remote.send(('reset', None))
return np.stack([remote.recv() for remote in self.remotes])
def reset_task(self):
for remote in self.remotes:
remote.send(('reset_task', None))
return np.stack([remote.recv() for remote in self.remotes])
def close(self):
if self.closed:
return
def close_extras(self):
if self.waiting:
for remote in self.remotes:
for remote in self.remotes:
remote.recv()
for remote in self.remotes:
remote.send(('close', None))
for p in self.ps:
p.join()
self.closed = True
def get_images(self):
for pipe in self.remotes:
pipe.send(('render', None))
imgs = [pipe.recv() for pipe in self.remotes]
return imgs

View File

@@ -0,0 +1,101 @@
"""
Tests for asynchronous vectorized environments.
"""
import gym
import numpy as np
import pytest
from .dummy_vec_env import DummyVecEnv
from .shmem_vec_env import ShmemVecEnv
from .subproc_vec_env import SubprocVecEnv
def assert_envs_equal(env1, env2, num_steps):
"""
Compare two environments over num_steps steps and make sure
that the observations produced by each are the same when given
the same actions.
"""
assert env1.num_envs == env2.num_envs
assert env1.action_space.shape == env2.action_space.shape
assert env1.action_space.dtype == env2.action_space.dtype
joint_shape = (env1.num_envs,) + env1.action_space.shape
try:
obs1, obs2 = env1.reset(), env2.reset()
assert np.array(obs1).shape == np.array(obs2).shape
assert np.array(obs1).shape == joint_shape
assert np.allclose(obs1, obs2)
np.random.seed(1337)
for _ in range(num_steps):
actions = np.array(np.random.randint(0, 0x100, size=joint_shape),
dtype=env1.action_space.dtype)
for env in [env1, env2]:
env.step_async(actions)
outs1 = env1.step_wait()
outs2 = env2.step_wait()
for out1, out2 in zip(outs1[:3], outs2[:3]):
assert np.array(out1).shape == np.array(out2).shape
assert np.allclose(out1, out2)
assert list(outs1[3]) == list(outs2[3])
finally:
env1.close()
env2.close()
@pytest.mark.parametrize('klass', (ShmemVecEnv, SubprocVecEnv))
@pytest.mark.parametrize('dtype', ('uint8', 'float32'))
def test_vec_env(klass, dtype): # pylint: disable=R0914
"""
Test that a vectorized environment is equivalent to
DummyVecEnv, since DummyVecEnv is less likely to be
error prone.
"""
num_envs = 3
num_steps = 100
shape = (3, 8)
def make_fn(seed):
"""
Get an environment constructor with a seed.
"""
return lambda: SimpleEnv(seed, shape, dtype)
fns = [make_fn(i) for i in range(num_envs)]
env1 = DummyVecEnv(fns)
env2 = klass(fns)
assert_envs_equal(env1, env2, num_steps=num_steps)
class SimpleEnv(gym.Env):
"""
An environment with a pre-determined observation space
and RNG seed.
"""
def __init__(self, seed, shape, dtype):
np.random.seed(seed)
self._dtype = dtype
self._start_obs = np.array(np.random.randint(0, 0x100, size=shape),
dtype=dtype)
self._max_steps = seed + 1
self._cur_obs = None
self._cur_step = 0
# this is 0xFF instead of 0x100 because the Box space includes
# the high end, while randint does not
self.action_space = gym.spaces.Box(low=0, high=0xFF, shape=shape, dtype=dtype)
self.observation_space = self.action_space
def step(self, action):
self._cur_obs += np.array(action, dtype=self._dtype)
self._cur_step += 1
done = self._cur_step >= self._max_steps
reward = self._cur_step / self._max_steps
return self._cur_obs, reward, done, {'foo': 'bar' + str(reward)}
def reset(self):
self._cur_obs = self._start_obs
self._cur_step = 0
return self._cur_obs
def render(self, mode=None):
raise NotImplementedError

View File

@@ -0,0 +1,59 @@
"""
Helpers for dealing with vectorized environments.
"""
from collections import OrderedDict
import gym
import numpy as np
def copy_obs_dict(obs):
"""
Deep-copy an observation dict.
"""
return {k: np.copy(v) for k, v in obs.items()}
def dict_to_obs(obs_dict):
"""
Convert an observation dict into a raw array if the
original observation space was not a Dict space.
"""
if set(obs_dict.keys()) == {None}:
return obs_dict[None]
return obs_dict
def obs_space_info(obs_space):
"""
Get dict-structured information about a gym.Space.
Returns:
A tuple (keys, shapes, dtypes):
keys: a list of dict keys.
shapes: a dict mapping keys to shapes.
dtypes: a dict mapping keys to dtypes.
"""
if isinstance(obs_space, gym.spaces.Dict):
assert isinstance(obs_space.spaces, OrderedDict)
subspaces = obs_space.spaces
else:
subspaces = {None: obs_space}
keys = []
shapes = {}
dtypes = {}
for key, box in subspaces.items():
keys.append(key)
shapes[key] = box.shape
dtypes[key] = box.dtype
return keys, shapes, dtypes
def obs_to_dict(obs):
"""
Convert an observation into a dict.
"""
if isinstance(obs, dict):
return obs
return {None: obs}

View File

@@ -1,18 +1,16 @@
from baselines.common.vec_env import VecEnvWrapper
from . import VecEnvWrapper
import numpy as np
from gym import spaces
class VecFrameStack(VecEnvWrapper):
"""
Vectorized environment base class
"""
def __init__(self, venv, nstack):
self.venv = venv
self.nstack = nstack
wos = venv.observation_space # wrapped ob space
wos = venv.observation_space # wrapped ob space
low = np.repeat(wos.low, self.nstack, axis=-1)
high = np.repeat(wos.high, self.nstack, axis=-1)
self.stackedobs = np.zeros((venv.num_envs,)+low.shape, low.dtype)
self.stackedobs = np.zeros((venv.num_envs,) + low.shape, low.dtype)
observation_space = spaces.Box(low=low, high=high, dtype=venv.observation_space.dtype)
VecEnvWrapper.__init__(self, venv, observation_space=observation_space)
@@ -26,9 +24,6 @@ class VecFrameStack(VecEnvWrapper):
return self.stackedobs, rews, news, infos
def reset(self):
"""
Reset all environments
"""
obs = self.venv.reset()
self.stackedobs[...] = 0
self.stackedobs[..., -obs.shape[-1]:] = obs

View File

@@ -0,0 +1,29 @@
from . import VecEnvWrapper
import numpy as np
class VecMonitor(VecEnvWrapper):
def __init__(self, venv):
VecEnvWrapper.__init__(self, venv)
self.eprets = None
self.eplens = None
def reset(self):
obs = self.venv.reset()
self.eprets = np.zeros(self.num_envs, 'f')
self.eplens = np.zeros(self.num_envs, 'i')
return obs
def step_wait(self):
obs, rews, dones, infos = self.venv.step_wait()
self.eprets += rews
self.eplens += 1
newinfos = []
for (i, (done, ret, eplen, info)) in enumerate(zip(dones, self.eprets, self.eplens, infos)):
info = info.copy()
if done:
info['episode'] = {'r': ret, 'l': eplen}
self.eprets[i] = 0
self.eplens[i] = 0
newinfos.append(info)
return obs, rews, dones, newinfos

View File

@@ -1,11 +1,14 @@
from baselines.common.vec_env import VecEnvWrapper
from . import VecEnvWrapper
from baselines.common.running_mean_std import RunningMeanStd
import numpy as np
class VecNormalize(VecEnvWrapper):
"""
Vectorized environment base class
A vectorized wrapper that normalizes the observations
and returns from an environment.
"""
def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
VecEnvWrapper.__init__(self, venv)
self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
@@ -17,12 +20,6 @@ class VecNormalize(VecEnvWrapper):
self.epsilon = epsilon
def step_wait(self):
"""
Apply sequence of actions to sequence of environments
actions -> (observations, rewards, news)
where 'news' is a boolean vector indicating whether each element is new.
"""
obs, rews, news, infos = self.venv.step_wait()
self.ret = self.ret * self.gamma + rews
obs = self._obfilt(obs)
@@ -40,8 +37,5 @@ class VecNormalize(VecEnvWrapper):
return obs
def reset(self):
"""
Reset all environments
"""
obs = self.venv.reset()
return self._obfilt(obs)

View File

@@ -26,9 +26,9 @@ def reduce_std(x, axis=None, keepdims=False):
return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
def reduce_var(x, axis=None, keepdims=False):
m = tf.reduce_mean(x, axis=axis, keep_dims=True)
m = tf.reduce_mean(x, axis=axis, keepdims=True)
devs_squared = tf.square(x - m)
return tf.reduce_mean(devs_squared, axis=axis, keep_dims=keepdims)
return tf.reduce_mean(devs_squared, axis=axis, keepdims=keepdims)
def get_target_updates(vars, target_vars, tau):
logger.info('setting up target updates ...')

View File

@@ -9,44 +9,29 @@ Here's a list of commands to run to quickly get a working example:
```bash
# Train model and save the results to cartpole_model.pkl
python -m baselines.deepq.experiments.train_cartpole
python -m baselines.run --alg=deepq --env=CartPole-v0 --save_path=./cartpole_model.pkl --num_timesteps=1e5
# Load the model saved in cartpole_model.pkl and visualize the learned policy
python -m baselines.deepq.experiments.enjoy_cartpole
python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play
```
Be sure to check out the source code of [both](experiments/train_cartpole.py) [files](experiments/enjoy_cartpole.py)!
## If you wish to apply DQN to solve a problem.
Check out our simple agent trained with one stop shop `deepq.learn` function.
- [baselines/deepq/experiments/train_cartpole.py](experiments/train_cartpole.py) - train a Cartpole agent.
- [baselines/deepq/experiments/train_pong.py](experiments/train_pong.py) - train a Pong agent using convolutional neural networks.
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. For both of the files listed above there are complimentary files `enjoy_cartpole.py` and `enjoy_pong.py` respectively, that load and visualize the learned policy.
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. Complimentary file `enjoy_cartpole.py` loads and visualizes the learned policy.
## If you wish to experiment with the algorithm
##### Check out the examples
- [baselines/deepq/experiments/custom_cartpole.py](experiments/custom_cartpole.py) - Cartpole training with more fine grained control over the internals of DQN algorithm.
- [baselines/deepq/experiments/atari/train.py](experiments/atari/train.py) - more robust setup for training at scale.
##### Download a pretrained Atari agent
For some research projects it is sometimes useful to have an already trained agent handy. There's a variety of models to choose from. You can list them all by running:
- [baselines/deepq/defaults.py](defaults.py) - settings for training on atari. Run
```bash
python -m baselines.deepq.experiments.atari.download_model
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4
```
to train on Atari Pong (see more in repo-wide [README.md](../../README.md#training-models))
Once you pick a model, you can download it and visualize the learned policy. Be sure to pass `--dueling` flag to visualization script when using dueling models.
```bash
python -m baselines.deepq.experiments.atari.download_model --blob model-atari-duel-pong-1 --model-dir /tmp/models
python -m baselines.deepq.experiments.atari.enjoy --model-dir /tmp/models/model-atari-duel-pong-1 --env Pong --dueling
```

View File

@@ -1,8 +1,8 @@
from baselines.deepq import models # noqa
from baselines.deepq.build_graph import build_act, build_train # noqa
from baselines.deepq.simple import learn, load # noqa
from baselines.deepq.deepq import learn, load_act # noqa
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer # noqa
def wrap_atari_dqn(env):
from baselines.common.atari_wrappers import wrap_deepmind
return wrap_deepmind(env, frame_stack=True, scale=True)
return wrap_deepmind(env, frame_stack=True, scale=True)

View File

@@ -309,7 +309,7 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
outputs=output_actions,
givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
updates=updates)
def act(ob, reset, update_param_noise_threshold, update_param_noise_scale, stochastic=True, update_eps=-1):
def act(ob, reset=False, update_param_noise_threshold=False, update_param_noise_scale=False, stochastic=True, update_eps=-1):
return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
return act

View File

@@ -6,22 +6,28 @@ import zipfile
import cloudpickle
import numpy as np
import gym
import baselines.common.tf_util as U
from baselines.common.tf_util import load_state, save_state
from baselines import logger
from baselines.common.schedules import LinearSchedule
from baselines.common import set_global_seeds
from baselines import deepq
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
from baselines.deepq.utils import BatchInput, load_state, save_state
from baselines.deepq.utils import ObservationInput
from baselines.common.tf_util import get_session
from baselines.deepq.models import build_q_func
class ActWrapper(object):
def __init__(self, act, act_params):
self._act = act
self._act_params = act_params
self.initial_state = None
@staticmethod
def load(path):
def load_act(path):
with open(path, "rb") as f:
model_data, act_params = cloudpickle.load(f)
act = deepq.build_act(**act_params)
@@ -40,7 +46,10 @@ class ActWrapper(object):
def __call__(self, *args, **kwargs):
return self._act(*args, **kwargs)
def save(self, path=None):
def step(self, observation, **kwargs):
return self._act([observation], **kwargs), None, None, None
def save_act(self, path=None):
"""Save model to a pickle located at `path`"""
if path is None:
path = os.path.join(logger.get_dir(), "model.pkl")
@@ -59,8 +68,12 @@ class ActWrapper(object):
with open(path, "wb") as f:
cloudpickle.dump((model_data, self._act_params), f)
def save(self, path):
save_state(path)
self.save_act(path+".pickle")
def load(path):
def load_act(path):
"""Load act function that was returned by learn function.
Parameters
@@ -74,13 +87,14 @@ def load(path):
function that takes a batch of observations
and returns actions.
"""
return ActWrapper.load(path)
return ActWrapper.load_act(path)
def learn(env,
q_func,
network,
seed=None,
lr=5e-4,
max_timesteps=100000,
total_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.02,
@@ -88,6 +102,7 @@ def learn(env,
batch_size=32,
print_freq=100,
checkpoint_freq=10000,
checkpoint_path=None,
learning_starts=1000,
gamma=1.0,
target_network_update_freq=500,
@@ -97,7 +112,10 @@ def learn(env,
prioritized_replay_beta_iters=None,
prioritized_replay_eps=1e-6,
param_noise=False,
callback=None):
callback=None,
load_path=None,
**network_kwargs
):
"""Train a deepq model.
Parameters
@@ -116,7 +134,7 @@ def learn(env,
and returns a tensor of shape (batch_size, num_actions) with values of every action.
lr: float
learning rate for adam optimizer
max_timesteps: int
total_timesteps: int
number of env steps to optimizer for
buffer_size: int
size of the replay buffer
@@ -150,12 +168,16 @@ def learn(env,
initial value of beta for prioritized replay buffer
prioritized_replay_beta_iters: int
number of iterations over which beta will be annealed from initial value
to 1.0. If set to None equals to max_timesteps.
to 1.0. If set to None equals to total_timesteps.
prioritized_replay_eps: float
epsilon to add to the TD errors when updating priorities.
callback: (locals, globals) -> None
function called at every steps with state of the algorithm.
If callback returns true training stops.
load_path: str
path to load the model from. (default: None)
**network_kwargs
additional keyword arguments to pass to the network builder.
Returns
-------
@@ -165,14 +187,17 @@ def learn(env,
"""
# Create all the functions necessary to train the model
sess = tf.Session()
sess.__enter__()
sess = get_session()
set_global_seeds(seed)
q_func = build_q_func(network, **network_kwargs)
# capture the shape outside the closure so that the env object is not serialized
# by cloudpickle when serializing make_obs_ph
observation_space_shape = env.observation_space.shape
observation_space = env.observation_space
def make_obs_ph(name):
return BatchInput(observation_space_shape, name=name)
return ObservationInput(observation_space, name=name)
act, train, update_target, debug = deepq.build_train(
make_obs_ph=make_obs_ph,
@@ -191,12 +216,12 @@ def learn(env,
}
act = ActWrapper(act, act_params)
# Create the replay buffer
if prioritized_replay:
replay_buffer = PrioritizedReplayBuffer(buffer_size, alpha=prioritized_replay_alpha)
if prioritized_replay_beta_iters is None:
prioritized_replay_beta_iters = max_timesteps
prioritized_replay_beta_iters = total_timesteps
beta_schedule = LinearSchedule(prioritized_replay_beta_iters,
initial_p=prioritized_replay_beta0,
final_p=1.0)
@@ -204,7 +229,7 @@ def learn(env,
replay_buffer = ReplayBuffer(buffer_size)
beta_schedule = None
# Create the schedule for exploration starting from 1.
exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * max_timesteps),
exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * total_timesteps),
initial_p=1.0,
final_p=exploration_final_eps)
@@ -216,10 +241,23 @@ def learn(env,
saved_mean_reward = None
obs = env.reset()
reset = True
with tempfile.TemporaryDirectory() as td:
model_saved = False
td = checkpoint_path or td
model_file = os.path.join(td, "model")
for t in range(max_timesteps):
model_saved = False
if tf.train.latest_checkpoint(td) is not None:
load_state(model_file)
logger.log('Loaded model from {}'.format(model_file))
model_saved = True
elif load_path is not None:
load_state(load_path)
logger.log('Loaded model from {}'.format(load_path))
for t in range(total_timesteps):
if callback is not None:
if callback(locals(), globals()):
break

View File

@@ -0,0 +1,21 @@
def atari():
return dict(
network='conv_only',
lr=1e-4,
buffer_size=10000,
exploration_fraction=0.1,
exploration_final_eps=0.01,
train_freq=4,
learning_starts=10000,
target_network_update_freq=1000,
gamma=0.99,
prioritized_replay=True,
prioritized_replay_alpha=0.6,
checkpoint_freq=10000,
checkpoint_path=None,
dueling=True
)
def retro():
return atari()

View File

@@ -9,7 +9,7 @@ import baselines.common.tf_util as U
from baselines import logger
from baselines import deepq
from baselines.deepq.replay_buffer import ReplayBuffer
from baselines.deepq.utils import BatchInput
from baselines.deepq.utils import ObservationInput
from baselines.common.schedules import LinearSchedule
@@ -28,7 +28,7 @@ if __name__ == '__main__':
env = gym.make("CartPole-v0")
# Create all the functions necessary to train the model
act, train, update_target, debug = deepq.build_train(
make_obs_ph=lambda name: BatchInput(env.observation_space.shape, name=name),
make_obs_ph=lambda name: ObservationInput(env.observation_space, name=name),
q_func=model,
num_actions=env.action_space.n,
optimizer=tf.train.AdamOptimizer(learning_rate=5e-4),

View File

@@ -0,0 +1,34 @@
import argparse
import numpy as np
from baselines import deepq
from baselines.common import retro_wrappers
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--env', help='environment ID', default='SuperMarioBros-Nes')
parser.add_argument('--gamestate', help='game state to load', default='Level1-1')
parser.add_argument('--model', help='model pickle file from ActWrapper.save', default='model.pkl')
args = parser.parse_args()
env = retro_wrappers.make_retro(game=args.env, state=args.gamestate, max_episode_steps=None)
env = retro_wrappers.wrap_deepmind_retro(env)
act = deepq.load(args.model)
while True:
obs, done = env.reset(), False
episode_rew = 0
while not done:
env.render()
action = act(obs[None])[0]
env_action = np.zeros(env.action_space.n)
env_action[action] = 1
obs, rew, done, _ = env.step(env_action)
episode_rew += rew
print('Episode reward', episode_rew)
if __name__ == '__main__':
main()

View File

@@ -5,29 +5,33 @@ import argparse
from baselines import logger
from baselines.common.atari_wrappers import make_atari
def main():
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--prioritized', type=int, default=1)
parser.add_argument('--prioritized-replay-alpha', type=float, default=0.6)
parser.add_argument('--dueling', type=int, default=1)
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
parser.add_argument('--checkpoint-freq', type=int, default=10000)
parser.add_argument('--checkpoint-path', type=str, default=None)
args = parser.parse_args()
logger.configure()
set_global_seeds(args.seed)
env = make_atari(args.env)
env = bench.Monitor(env, logger.get_dir())
env = deepq.wrap_atari_dqn(env)
model = deepq.models.cnn_to_mlp(
deepq.learn(
env,
"conv_only",
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
hiddens=[256],
dueling=bool(args.dueling),
)
act = deepq.learn(
env,
q_func=model,
lr=1e-4,
max_timesteps=args.num_timesteps,
total_timesteps=args.num_timesteps,
buffer_size=10000,
exploration_fraction=0.1,
exploration_final_eps=0.01,
@@ -35,9 +39,12 @@ def main():
learning_starts=10000,
target_network_update_freq=1000,
gamma=0.99,
prioritized_replay=bool(args.prioritized)
prioritized_replay=bool(args.prioritized),
prioritized_replay_alpha=args.prioritized_replay_alpha,
checkpoint_freq=args.checkpoint_freq,
checkpoint_path=args.checkpoint_path,
)
# act.save("pong_model.pkl") XXX
env.close()

View File

@@ -0,0 +1,49 @@
import argparse
from baselines import deepq
from baselines.common import set_global_seeds
from baselines import bench
from baselines import logger
from baselines.common import retro_wrappers
import retro
def main():
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--env', help='environment ID', default='SuperMarioBros-Nes')
parser.add_argument('--gamestate', help='game state to load', default='Level1-1')
parser.add_argument('--seed', help='seed', type=int, default=0)
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
args = parser.parse_args()
logger.configure()
set_global_seeds(args.seed)
env = retro_wrappers.make_retro(game=args.env, state=args.gamestate, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE)
env.seed(args.seed)
env = bench.Monitor(env, logger.get_dir())
env = retro_wrappers.wrap_deepmind_retro(env)
model = deepq.models.cnn_to_mlp(
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
hiddens=[256],
dueling=True
)
act = deepq.learn(
env,
q_func=model,
lr=1e-4,
max_timesteps=args.num_timesteps,
buffer_size=10000,
exploration_fraction=0.1,
exploration_final_eps=0.01,
train_freq=4,
learning_starts=10000,
target_network_update_freq=1000,
gamma=0.99,
prioritized_replay=True
)
act.save()
env.close()
if __name__ == '__main__':
main()

View File

@@ -11,12 +11,11 @@ def callback(lcl, _glb):
def main():
env = gym.make("CartPole-v0")
model = deepq.models.mlp([64])
act = deepq.learn(
env,
q_func=model,
network='mlp',
lr=1e-3,
max_timesteps=100000,
total_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.02,

View File

@@ -89,3 +89,41 @@ def cnn_to_mlp(convs, hiddens, dueling=False, layer_norm=False):
return lambda *args, **kwargs: _cnn_to_mlp(convs, hiddens, dueling, layer_norm=layer_norm, *args, **kwargs)
def build_q_func(network, hiddens=[256], dueling=True, layer_norm=False, **network_kwargs):
if isinstance(network, str):
from baselines.common.models import get_network_builder
network = get_network_builder(network)(**network_kwargs)
def q_func_builder(input_placeholder, num_actions, scope, reuse=False):
with tf.variable_scope(scope, reuse=reuse):
latent, _ = network(input_placeholder)
latent = layers.flatten(latent)
with tf.variable_scope("action_value"):
action_out = latent
for hidden in hiddens:
action_out = layers.fully_connected(action_out, num_outputs=hidden, activation_fn=None)
if layer_norm:
action_out = layers.layer_norm(action_out, center=True, scale=True)
action_out = tf.nn.relu(action_out)
action_scores = layers.fully_connected(action_out, num_outputs=num_actions, activation_fn=None)
if dueling:
with tf.variable_scope("state_value"):
state_out = latent
for hidden in hiddens:
state_out = layers.fully_connected(state_out, num_outputs=hidden, activation_fn=None)
if layer_norm:
state_out = layers.layer_norm(state_out, center=True, scale=True)
state_out = tf.nn.relu(state_out)
state_score = layers.fully_connected(state_out, num_outputs=1, activation_fn=None)
action_scores_mean = tf.reduce_mean(action_scores, 1)
action_scores_centered = action_scores - tf.expand_dims(action_scores_mean, 1)
q_out = state_score + action_scores_centered
else:
q_out = action_scores
return q_out
return q_func_builder

View File

@@ -86,7 +86,7 @@ class PrioritizedReplayBuffer(ReplayBuffer):
ReplayBuffer.__init__
"""
super(PrioritizedReplayBuffer, self).__init__(size)
assert alpha > 0
assert alpha >= 0
self._alpha = alpha
it_capacity = 1

View File

@@ -1,24 +1,13 @@
import os
from baselines.common.input import observation_input
from baselines.common.tf_util import adjust_shape
import tensorflow as tf
# ================================================================
# Saving variables
# ================================================================
def load_state(fname):
saver = tf.train.Saver()
saver.restore(tf.get_default_session(), fname)
def save_state(fname):
os.makedirs(os.path.dirname(fname), exist_ok=True)
saver = tf.train.Saver()
saver.save(tf.get_default_session(), fname)
# ================================================================
# Placeholders
# ================================================================
class TfInput(object):
def __init__(self, name="(unnamed)"):
"""Generalized Tensorflow placeholder. The main differences are:
@@ -48,22 +37,8 @@ class PlaceholderTfInput(TfInput):
return self._placeholder
def make_feed_dict(self, data):
return {self._placeholder: data}
return {self._placeholder: adjust_shape(self._placeholder, data)}
class BatchInput(PlaceholderTfInput):
def __init__(self, shape, dtype=tf.float32, name=None):
"""Creates a placeholder for a batch of tensors of a given shape and dtype
Parameters
----------
shape: [int]
shape of a single elemenet of the batch
dtype: tf.dtype
number representation used for tensor contents
name: str
name of the underlying placeholder
"""
super().__init__(tf.placeholder(dtype, [None] + list(shape), name=name))
class Uint8Input(PlaceholderTfInput):
def __init__(self, shape, name=None):
@@ -85,4 +60,25 @@ class Uint8Input(PlaceholderTfInput):
self._output = tf.cast(super().get(), tf.float32) / 255.0
def get(self):
return self._output
return self._output
class ObservationInput(PlaceholderTfInput):
def __init__(self, observation_space, name=None):
"""Creates an input placeholder tailored to a specific observation space
Parameters
----------
observation_space:
observation space of the environment. Should be one of the gym.spaces types
name: str
tensorflow name of the underlying placeholder
"""
inpt, self.processed_inpt = observation_input(observation_space, name=name)
super().__init__(inpt)
def get(self):
return self.processed_inpt

View File

@@ -47,18 +47,12 @@ class Mujoco_Dset(object):
obs = traj_data['obs'][:traj_limitation]
acs = traj_data['acs'][:traj_limitation]
def flatten(x):
# x.shape = (E,), or (E, L, D)
_, size = x[0].shape
episode_length = [len(i) for i in x]
y = np.zeros((sum(episode_length), size))
start_idx = 0
for l, x_i in zip(episode_length, x):
y[start_idx:(start_idx+l)] = x_i
start_idx += l
return y
self.obs = np.array(flatten(obs))
self.acs = np.array(flatten(acs))
# obs, acs: shape (N, L, ) + S where N = # episodes, L = episode length
# and S is the environment observation/action space.
# Flatten to (N * L, prod(S))
self.obs = np.reshape(obs, [-1, np.prod(obs.shape[2:])])
self.acs = np.reshape(acs, [-1, np.prod(acs.shape[2:])])
self.rets = traj_data['ep_rets'][:traj_limitation]
self.avg_ret = sum(self.rets)/len(self.rets)
self.std_ret = np.std(np.array(self.rets))

View File

@@ -1,5 +1,5 @@
# Hindsight Experience Replay
For details on Hindsight Experience Replay (HER), please read the [paper](https://arxiv.org/pdf/1707.01495.pdf).
For details on Hindsight Experience Replay (HER), please read the [paper](https://arxiv.org/abs/1707.01495).
## How to use Hindsight Experience Replay
@@ -22,14 +22,11 @@ You can try it right now with the results of the training step (the script print
This should visualize the current policy for 10 episodes and will also print statistics.
### Advanced usage
The train script comes with advanced features like MPI support, that allows to scale across all cores of a single machine.
To see all available options, simply run this command:
### Reproducing results
In order to reproduce the results from [Plappert et al. (2018)](https://arxiv.org/abs/1802.09464), run the following command:
```bash
python -m baselines.her.experiment.train --help
python -m baselines.her.experiment.train --num_cpu 19
```
To run on, say, 20 CPU cores, you can use the following command:
```bash
python -m baselines.her.experiment.train --num_cpu 20
```
That's it, you are now running rollouts using 20 MPI workers and average gradients for network updates across all 20 core.
This will require a machine with sufficient amount of physical CPU cores. In our experiments,
we used [Azure's D15v2 instances](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes),
which have 20 physical cores. We only scheduled the experiment on 19 of those to leave some head-room on the system.

View File

@@ -1,7 +1,4 @@
from copy import deepcopy
import numpy as np
import json
import os
import gym
from baselines import logger
@@ -10,7 +7,7 @@ from baselines.her.her import make_sample_her_transitions
DEFAULT_ENV_PARAMS = {
'FetchReach-v0': {
'FetchReach-v1': {
'n_cycles': 10,
},
}
@@ -51,6 +48,8 @@ DEFAULT_PARAMS = {
CACHED_ENVS = {}
def cached_make_env(make_env):
"""
Only creates a new environment from the provided function if one has not yet already been
@@ -68,6 +67,7 @@ def prepare_params(kwargs):
ddpg_params = dict()
env_name = kwargs['env_name']
def make_env():
return gym.make(env_name)
kwargs['make_env'] = make_env
@@ -75,7 +75,7 @@ def prepare_params(kwargs):
assert hasattr(tmp_env, '_max_episode_steps')
kwargs['T'] = tmp_env._max_episode_steps
tmp_env.reset()
kwargs['max_u'] = np.array(kwargs['max_u']) if type(kwargs['max_u']) == list else kwargs['max_u']
kwargs['max_u'] = np.array(kwargs['max_u']) if isinstance(kwargs['max_u'], list) else kwargs['max_u']
kwargs['gamma'] = 1. - 1. / kwargs['T']
if 'lr' in kwargs:
kwargs['pi_lr'] = kwargs['lr']
@@ -83,7 +83,7 @@ def prepare_params(kwargs):
del kwargs['lr']
for name in ['buffer_size', 'hidden', 'layers',
'network_class',
'polyak',
'polyak',
'batch_size', 'Q_lr', 'pi_lr',
'norm_eps', 'norm_clip', 'max_u',
'action_l2', 'clip_obs', 'scope', 'relative_goals']:
@@ -103,6 +103,7 @@ def log_params(params, logger=logger):
def configure_her(params):
env = cached_make_env(params['make_env'])
env.reset()
def reward_fun(ag_2, g, info): # vectorized
return env.compute_reward(achieved_goal=ag_2, desired_goal=g, info=info)

View File

@@ -67,13 +67,13 @@ for curr_path in paths:
print('skipping {}'.format(curr_path))
continue
print('loading {} ({})'.format(curr_path, len(results['epoch'])))
with open(os.path.join(curr_path, 'metadata.json'), 'r') as f:
metadata = json.load(f)
with open(os.path.join(curr_path, 'params.json'), 'r') as f:
params = json.load(f)
success_rate = np.array(results['test/success_rate'])
epoch = np.array(results['epoch']) + 1
env_id = metadata['kwargs']['env_name']
replay_strategy = metadata['kwargs']['replay_strategy']
env_id = params['env_name']
replay_strategy = params['replay_strategy']
if replay_strategy == 'future':
config = 'her'

View File

@@ -13,6 +13,8 @@ import baselines.her.experiment.config as config
from baselines.her.rollout import RolloutWorker
from baselines.her.util import mpi_fork
from subprocess import CalledProcessError
def mpi_average(value):
if value == []:
@@ -81,12 +83,17 @@ def train(policy, rollout_worker, evaluator,
def launch(
env_name, logdir, n_epochs, num_cpu, seed, replay_strategy, policy_save_interval, clip_return,
env, logdir, n_epochs, num_cpu, seed, replay_strategy, policy_save_interval, clip_return,
override_params={}, save_policies=True
):
# Fork for multi-CPU MPI implementation.
if num_cpu > 1:
whoami = mpi_fork(num_cpu)
try:
whoami = mpi_fork(num_cpu, ['--bind-to', 'core'])
except CalledProcessError:
# fancy version of mpi call failed, try simple version
whoami = mpi_fork(num_cpu)
if whoami == 'parent':
sys.exit(0)
import baselines.common.tf_util as U
@@ -109,16 +116,28 @@ def launch(
# Prepare params.
params = config.DEFAULT_PARAMS
params['env_name'] = env_name
params['env_name'] = env
params['replay_strategy'] = replay_strategy
if env_name in config.DEFAULT_ENV_PARAMS:
params.update(config.DEFAULT_ENV_PARAMS[env_name]) # merge env-specific parameters in
if env in config.DEFAULT_ENV_PARAMS:
params.update(config.DEFAULT_ENV_PARAMS[env]) # merge env-specific parameters in
params.update(**override_params) # makes it possible to override any parameter
with open(os.path.join(logger.get_dir(), 'params.json'), 'w') as f:
json.dump(params, f)
params = config.prepare_params(params)
config.log_params(params, logger=logger)
if num_cpu == 1:
logger.warn()
logger.warn('*** Warning ***')
logger.warn(
'You are running HER with just a single MPI worker. This will work, but the ' +
'experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) ' +
'were obtained with --num_cpu 19. This makes a significant difference and if you ' +
'are looking to reproduce those results, be aware of this. Please also refer to ' +
'https://github.com/openai/baselines/issues/314 for further details.')
logger.warn('****************')
logger.warn()
dims = config.configure_dims(params)
policy = config.configure_ddpg(dims=dims, params=params, clip_return=clip_return)
@@ -156,7 +175,7 @@ def launch(
@click.command()
@click.option('--env_name', type=str, default='FetchReach-v0', help='the name of the OpenAI Gym environment that you want to train on')
@click.option('--env', type=str, default='FetchReach-v1', help='the name of the OpenAI Gym environment that you want to train on')
@click.option('--logdir', type=str, default=None, help='the path to where logs and policy pickles should go. If not specified, creates a folder in /tmp/')
@click.option('--n_epochs', type=int, default=50, help='the number of training epochs to run')
@click.option('--num_cpu', type=int, default=1, help='the number of CPU cores to use (using MPI)')

View File

@@ -58,12 +58,12 @@ def nn(input, layers_sizes, reuse=None, flatten=False, name=""):
"""Creates a simple neural network
"""
for i, size in enumerate(layers_sizes):
activation = tf.nn.relu if i < len(layers_sizes)-1 else None
activation = tf.nn.relu if i < len(layers_sizes) - 1 else None
input = tf.layers.dense(inputs=input,
units=size,
kernel_initializer=tf.contrib.layers.xavier_initializer(),
reuse=reuse,
name=name+'_'+str(i))
name=name + '_' + str(i))
if activation:
input = activation(input)
if flatten:
@@ -85,7 +85,7 @@ def install_mpi_excepthook():
sys.excepthook = new_hook
def mpi_fork(n):
def mpi_fork(n, extra_mpi_args=[]):
"""Re-launches the current script with workers
Returns "parent" for original parent, "child" for MPI children
"""
@@ -99,14 +99,10 @@ def mpi_fork(n):
IN_MPI="1"
)
# "-bind-to core" is crucial for good performance
args = [
"mpirun",
"-np",
str(n),
"-bind-to",
"core",
sys.executable
]
args = ["mpirun", "-np", str(n)] + \
extra_mpi_args + \
[sys.executable]
args += sys.argv
subprocess.check_call(args, env=env)
return "parent"
@@ -140,5 +136,5 @@ def reshape_for_broadcasting(source, target):
before broadcasting it with MPI.
"""
dim = len(target.get_shape())
shape = ([1] * (dim-1)) + [-1]
shape = ([1] * (dim - 1)) + [-1]
return tf.reshape(tf.cast(source, target.dtype), shape)

View File

@@ -8,9 +8,6 @@ import datetime
import tempfile
from collections import defaultdict
LOG_OUTPUT_FORMATS = ['stdout', 'log', 'csv']
# Also valid: json, tensorboard
DEBUG = 10
INFO = 20
WARN = 30
@@ -74,8 +71,11 @@ class HumanOutputFormat(KVWriter, SeqWriter):
return s[:20] + '...' if len(s) > 23 else s
def writeseq(self, seq):
for arg in seq:
self.file.write(arg)
seq = list(seq)
for (i, elem) in enumerate(seq):
self.file.write(elem)
if i < len(seq) - 1: # add space unless this is the last one
self.file.write(' ')
self.file.write('\n')
self.file.flush()
@@ -355,10 +355,19 @@ def configure(dir=None, format_strs=None):
assert isinstance(dir, str)
os.makedirs(dir, exist_ok=True)
log_suffix = ''
from mpi4py import MPI
rank = MPI.COMM_WORLD.Get_rank()
if rank > 0:
log_suffix = "-rank%03i" % rank
if format_strs is None:
strs = os.getenv('OPENAI_LOG_FORMAT')
format_strs = strs.split(',') if strs else LOG_OUTPUT_FORMATS
output_formats = [make_output_format(f, dir) for f in format_strs]
if rank == 0:
format_strs = os.getenv('OPENAI_LOG_FORMAT', 'stdout,log,csv').split(',')
else:
format_strs = os.getenv('OPENAI_LOG_FORMAT_MPI', 'log').split(',')
format_strs = filter(None, format_strs)
output_formats = [make_output_format(f, dir, log_suffix) for f in format_strs]
Logger.CURRENT = Logger(dir=dir, output_formats=output_formats)
log('Logging to %s'%dir)

View File

@@ -5,3 +5,5 @@
- `mpirun -np 8 python -m baselines.ppo1.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.ppo1.run_mujoco` runs the algorithm for 1M frames on a Mujoco environment.
- Train mujoco 3d humanoid (with optimal-ish hyperparameters): `mpirun -np 16 python -m baselines.ppo1.run_humanoid --model-path=/path/to/model`
- Render the 3d humanoid: `python -m baselines.ppo1.run_humanoid --play --model-path=/path/to/model`

View File

@@ -212,5 +212,7 @@ def learn(env, policy_fn, *,
if MPI.COMM_WORLD.Get_rank()==0:
logger.dump_tabular()
return pi
def flatten_lists(listoflists):
return [el for list_ in listoflists for el in list_]

View File

@@ -18,7 +18,7 @@ def train(env_id, num_timesteps, seed):
logger.configure()
else:
logger.configure(format_strs=[])
workerseed = seed + 10000 * MPI.COMM_WORLD.Get_rank()
workerseed = seed + 10000 * MPI.COMM_WORLD.Get_rank() if seed is not None else None
set_global_seeds(workerseed)
env = make_atari(env_id)
def policy_fn(name, ob_space, ac_space): #pylint: disable=W0613

View File

@@ -0,0 +1,75 @@
#!/usr/bin/env python3
import os
from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
from baselines.common import tf_util as U
from baselines import logger
import gym
def train(num_timesteps, seed, model_path=None):
env_id = 'Humanoid-v2'
from baselines.ppo1 import mlp_policy, pposgd_simple
U.make_session(num_cpu=1).__enter__()
def policy_fn(name, ob_space, ac_space):
return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
hid_size=64, num_hid_layers=2)
env = make_mujoco_env(env_id, seed)
# parameters below were the best found in a simple random search
# these are good enough to make humanoid walk, but whether those are
# an absolute best or not is not certain
env = RewScale(env, 0.1)
pi = pposgd_simple.learn(env, policy_fn,
max_timesteps=num_timesteps,
timesteps_per_actorbatch=2048,
clip_param=0.2, entcoeff=0.0,
optim_epochs=10,
optim_stepsize=3e-4,
optim_batchsize=64,
gamma=0.99,
lam=0.95,
schedule='linear',
)
env.close()
if model_path:
U.save_state(model_path)
return pi
class RewScale(gym.RewardWrapper):
def __init__(self, env, scale):
gym.RewardWrapper.__init__(self, env)
self.scale = scale
def reward(self, r):
return r * self.scale
def main():
logger.configure()
parser = mujoco_arg_parser()
parser.add_argument('--model-path', default=os.path.join(logger.get_dir(), 'humanoid_policy'))
parser.set_defaults(num_timesteps=int(2e7))
args = parser.parse_args()
if not args.play:
# train the model
train(num_timesteps=args.num_timesteps, seed=args.seed, model_path=args.model_path)
else:
# construct the model object, load pre-trained model and render
pi = train(num_timesteps=1, seed=args.seed)
U.load_state(args.model_path)
env = make_mujoco_env('Humanoid-v2', seed=0)
ob = env.reset()
while True:
action = pi.act(stochastic=False, ob=ob)[0]
ob, _, done, _ = env.step(action)
env.render()
if done:
ob = env.reset()
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,40 @@
#!/usr/bin/env python3
from mpi4py import MPI
from baselines.common import set_global_seeds
from baselines import logger
from baselines.common.cmd_util import make_robotics_env, robotics_arg_parser
import mujoco_py
def train(env_id, num_timesteps, seed):
from baselines.ppo1 import mlp_policy, pposgd_simple
import baselines.common.tf_util as U
rank = MPI.COMM_WORLD.Get_rank()
sess = U.single_threaded_session()
sess.__enter__()
mujoco_py.ignore_mujoco_warnings().__enter__()
workerseed = seed + 10000 * rank
set_global_seeds(workerseed)
env = make_robotics_env(env_id, workerseed, rank=rank)
def policy_fn(name, ob_space, ac_space):
return mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
hid_size=256, num_hid_layers=3)
pposgd_simple.learn(env, policy_fn,
max_timesteps=num_timesteps,
timesteps_per_actorbatch=2048,
clip_param=0.2, entcoeff=0.0,
optim_epochs=5, optim_stepsize=3e-4, optim_batchsize=256,
gamma=0.99, lam=0.95, schedule='linear',
)
env.close()
def main():
args = robotics_arg_parser().parse_args()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
if __name__ == '__main__':
main()

View File

@@ -2,5 +2,7 @@
- Original paper: https://arxiv.org/abs/1707.06347
- Baselines blog post: https://blog.openai.com/openai-baselines-ppo/
- `python -m baselines.ppo2.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.ppo2.run_mujoco` runs the algorithm for 1M frames on a Mujoco environment.
- `python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
- `python -m baselines.run --alg=ppo2 --env=Ant-v2 --num_timesteps=1e6` runs the algorithm for 1M frames on a Mujoco Ant environment.
- also refer to the repo-wide [README.md](../../README.md#training-models)

View File

@@ -0,0 +1,22 @@
def mujoco():
return dict(
nsteps=2048,
nminibatches=32,
lam=0.95,
gamma=0.99,
noptepochs=10,
log_interval=1,
ent_coef=0.0,
lr=lambda f: 3e-4 * f,
cliprange=0.2,
value_network='copy'
)
def atari():
return dict(
nsteps=128, nminibatches=4,
lam=0.95, gamma=0.99, noptepochs=4, log_interval=1,
ent_coef=.01,
lr=lambda f : f * 2.5e-4,
cliprange=lambda f : f * 0.1,
)

View File

@@ -1,168 +0,0 @@
import numpy as np
import tensorflow as tf
from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
from baselines.common.distributions import make_pdtype
def nature_cnn(unscaled_images):
"""
CNN from Nature paper.
"""
scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
activ = tf.nn.relu
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2)))
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2)))
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2)))
h3 = conv_to_fc(h3)
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
class LnLstmPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
nenv = nbatch // nsteps
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) #obs
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
h5 = seq_to_batch(h5)
pi = fc(h5, 'pi', nact)
vf = fc(h5, 'v', 1)
self.pdtype = make_pdtype(ac_space)
self.pd = self.pdtype.pdfromflat(pi)
v0 = vf[:, 0]
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
def step(ob, state, mask):
return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
def value(ob, state, mask):
return sess.run(v0, {X:ob, S:state, M:mask})
self.X = X
self.M = M
self.S = S
self.pi = pi
self.vf = vf
self.step = step
self.value = value
class LstmPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
nenv = nbatch // nsteps
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) #obs
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
h5 = seq_to_batch(h5)
pi = fc(h5, 'pi', nact)
vf = fc(h5, 'v', 1)
self.pdtype = make_pdtype(ac_space)
self.pd = self.pdtype.pdfromflat(pi)
v0 = vf[:, 0]
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
def step(ob, state, mask):
return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
def value(ob, state, mask):
return sess.run(v0, {X:ob, S:state, M:mask})
self.X = X
self.M = M
self.S = S
self.pi = pi
self.vf = vf
self.step = step
self.value = value
class CnnPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) #obs
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
pi = fc(h, 'pi', nact, init_scale=0.01)
vf = fc(h, 'v', 1)[:,0]
self.pdtype = make_pdtype(ac_space)
self.pd = self.pdtype.pdfromflat(pi)
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = None
def step(ob, *_args, **_kwargs):
a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
return a, v, self.initial_state, neglogp
def value(ob, *_args, **_kwargs):
return sess.run(vf, {X:ob})
self.X = X
self.pi = pi
self.vf = vf
self.step = step
self.value = value
class MlpPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
ob_shape = (nbatch,) + ob_space.shape
actdim = ac_space.shape[0]
X = tf.placeholder(tf.float32, ob_shape, name='Ob') #obs
with tf.variable_scope("model", reuse=reuse):
activ = tf.tanh
h1 = activ(fc(X, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
h2 = activ(fc(h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
pi = fc(h2, 'pi', actdim, init_scale=0.01)
h1 = activ(fc(X, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
h2 = activ(fc(h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
vf = fc(h2, 'vf', 1)[:,0]
logstd = tf.get_variable(name="logstd", shape=[1, actdim],
initializer=tf.zeros_initializer())
pdparam = tf.concat([pi, pi * 0.0 + logstd], axis=1)
self.pdtype = make_pdtype(ac_space)
self.pd = self.pdtype.pdfromflat(pdparam)
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = None
def step(ob, *_args, **_kwargs):
a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
return a, v, self.initial_state, neglogp
def value(ob, *_args, **_kwargs):
return sess.run(vf, {X:ob})
self.X = X
self.pi = pi
self.vf = vf
self.step = step
self.value = value

View File

@@ -1,20 +1,29 @@
import os
import time
import joblib
import functools
import numpy as np
import os.path as osp
import tensorflow as tf
from baselines import logger
from collections import deque
from baselines.common import explained_variance
from baselines.common import explained_variance, set_global_seeds
from baselines.common.policies import build_policy
from baselines.common.runners import AbstractEnvRunner
from baselines.common.tf_util import get_session, save_variables, load_variables
from baselines.common.mpi_adam_optimizer import MpiAdamOptimizer
from mpi4py import MPI
from baselines.common.tf_util import initialize
from baselines.common.mpi_util import sync_from_root
class Model(object):
def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
nsteps, ent_coef, vf_coef, max_grad_norm):
sess = tf.get_default_session()
sess = get_session()
act_model = policy(sess, ob_space, ac_space, nbatch_act, 1, reuse=False)
train_model = policy(sess, ob_space, ac_space, nbatch_train, nsteps, reuse=True)
with tf.variable_scope('ppo2_model', reuse=tf.AUTO_REUSE):
act_model = policy(nbatch_act, 1, sess)
train_model = policy(nbatch_train, nsteps, sess)
A = train_model.pdtype.sample_placeholder([None])
ADV = tf.placeholder(tf.float32, [None])
@@ -39,14 +48,16 @@ class Model(object):
approxkl = .5 * tf.reduce_mean(tf.square(neglogpac - OLDNEGLOGPAC))
clipfrac = tf.reduce_mean(tf.to_float(tf.greater(tf.abs(ratio - 1.0), CLIPRANGE)))
loss = pg_loss - entropy * ent_coef + vf_loss * vf_coef
with tf.variable_scope('model'):
params = tf.trainable_variables()
grads = tf.gradients(loss, params)
params = tf.trainable_variables('ppo2_model')
trainer = MpiAdamOptimizer(MPI.COMM_WORLD, learning_rate=LR, epsilon=1e-5)
grads_and_var = trainer.compute_gradients(loss, params)
grads, var = zip(*grads_and_var)
if max_grad_norm is not None:
grads, _grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
grads = list(zip(grads, params))
trainer = tf.train.AdamOptimizer(learning_rate=LR, epsilon=1e-5)
_train = trainer.apply_gradients(grads)
grads_and_var = list(zip(grads, var))
_train = trainer.apply_gradients(grads_and_var)
def train(lr, cliprange, obs, returns, masks, actions, values, neglogpacs, states=None):
advs = returns - values
@@ -62,17 +73,6 @@ class Model(object):
)[:-1]
self.loss_names = ['policy_loss', 'value_loss', 'policy_entropy', 'approxkl', 'clipfrac']
def save(save_path):
ps = sess.run(params)
joblib.dump(ps, save_path)
def load(load_path):
loaded_params = joblib.load(load_path)
restores = []
for p, loaded_p in zip(params, loaded_params):
restores.append(p.assign(loaded_p))
sess.run(restores)
# If you want to load weights, also save/load observation scaling inside VecNormalize
self.train = train
self.train_model = train_model
@@ -80,30 +80,28 @@ class Model(object):
self.step = act_model.step
self.value = act_model.value
self.initial_state = act_model.initial_state
self.save = save
self.load = load
tf.global_variables_initializer().run(session=sess) #pylint: disable=E1101
class Runner(object):
self.save = functools.partial(save_variables, sess=sess)
self.load = functools.partial(load_variables, sess=sess)
if MPI.COMM_WORLD.Get_rank() == 0:
initialize()
global_variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="")
sync_from_root(sess, global_variables) #pylint: disable=E1101
class Runner(AbstractEnvRunner):
def __init__(self, *, env, model, nsteps, gamma, lam):
self.env = env
self.model = model
nenv = env.num_envs
self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=model.train_model.X.dtype.name)
self.obs[:] = env.reset()
self.gamma = gamma
super().__init__(env=env, model=model, nsteps=nsteps)
self.lam = lam
self.nsteps = nsteps
self.states = model.initial_state
self.dones = [False for _ in range(nenv)]
self.gamma = gamma
def run(self):
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones, mb_neglogpacs = [],[],[],[],[],[]
mb_states = self.states
epinfos = []
for _ in range(self.nsteps):
actions, values, self.states, neglogpacs = self.model.step(self.obs, self.states, self.dones)
actions, values, self.states, neglogpacs = self.model.step(self.obs, S=self.states, M=self.dones)
mb_obs.append(self.obs.copy())
mb_actions.append(actions)
mb_values.append(values)
@@ -121,7 +119,7 @@ class Runner(object):
mb_values = np.asarray(mb_values, dtype=np.float32)
mb_neglogpacs = np.asarray(mb_neglogpacs, dtype=np.float32)
mb_dones = np.asarray(mb_dones, dtype=np.bool)
last_values = self.model.value(self.obs, self.states, self.dones)
last_values = self.model.value(self.obs, S=self.states, M=self.dones)
#discount/bootstrap off value fn
mb_returns = np.zeros_like(mb_rewards)
mb_advs = np.zeros_like(mb_rewards)
@@ -151,10 +149,66 @@ def constfn(val):
return val
return f
def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
def learn(*, network, env, total_timesteps, seed=None, nsteps=2048, ent_coef=0.0, lr=3e-4,
vf_coef=0.5, max_grad_norm=0.5, gamma=0.99, lam=0.95,
log_interval=10, nminibatches=4, noptepochs=4, cliprange=0.2,
save_interval=0):
save_interval=0, load_path=None, **network_kwargs):
'''
Learn policy using PPO algorithm (https://arxiv.org/abs/1707.06347)
Parameters:
----------
network: policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
See common/models.py/lstm for more details on using recurrent nets in policies
env: baselines.common.vec_env.VecEnv environment. Needs to be vectorized for parallel environment simulation.
The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.
nsteps: int number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
nenv is number of environment copies simulated in parallel)
total_timesteps: int number of timesteps (i.e. number of actions taken in the environment)
ent_coef: float policy entropy coefficient in the optimization objective
lr: float or function learning rate, constant or a schedule function [0,1] -> R+ where 1 is beginning of the
training and 0 is the end of the training.
vf_coef: float value function loss coefficient in the optimization objective
max_grad_norm: float or None gradient norm clipping coefficient
gamma: float discounting factor
lam: float advantage estimation discounting factor (lambda in the paper)
log_interval: int number of timesteps between logging events
nminibatches: int number of training minibatches per update. For recurrent policies,
should be smaller or equal than number of environments run in parallel.
noptepochs: int number of training epochs per update
cliprange: float or function clipping range, constant or schedule function [0,1] -> R+ where 1 is beginning of the training
and 0 is the end of the training
save_interval: int number of timesteps between saving events
load_path: str path to load the model from
**network_kwargs: keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
'''
set_global_seeds(seed)
if isinstance(lr, float): lr = constfn(lr)
else: assert callable(lr)
@@ -162,6 +216,8 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
else: assert callable(cliprange)
total_timesteps = int(total_timesteps)
policy = build_policy(env, network, **network_kwargs)
nenvs = env.num_envs
ob_space = env.observation_space
ac_space = env.action_space
@@ -171,11 +227,9 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
make_model = lambda : Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nbatch_act=nenvs, nbatch_train=nbatch_train,
nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
max_grad_norm=max_grad_norm)
if save_interval and logger.get_dir():
import cloudpickle
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
fh.write(cloudpickle.dumps(make_model))
model = make_model()
if load_path is not None:
model.load(load_path)
runner = Runner(env=env, model=model, nsteps=nsteps, gamma=gamma, lam=lam)
epinfobuf = deque(maxlen=100)
@@ -184,7 +238,6 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
nupdates = total_timesteps//nbatch
for update in range(1, nupdates+1):
assert nbatch % nminibatches == 0
nbatch_train = nbatch // nminibatches
tstart = time.time()
frac = 1.0 - (update - 1.0) / nupdates
lrnow = lr(frac)
@@ -232,14 +285,19 @@ def learn(*, policy, env, nsteps, total_timesteps, ent_coef, lr,
logger.logkv('time_elapsed', tnow - tfirststart)
for (lossval, lossname) in zip(lossvals, model.loss_names):
logger.logkv(lossname, lossval)
logger.dumpkvs()
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
if MPI.COMM_WORLD.Get_rank() == 0:
logger.dumpkvs()
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir() and MPI.COMM_WORLD.Get_rank() == 0:
checkdir = osp.join(logger.get_dir(), 'checkpoints')
os.makedirs(checkdir, exist_ok=True)
savepath = osp.join(checkdir, '%.5i'%update)
print('Saving to', savepath)
model.save(savepath)
env.close()
return model
def safemean(xs):
return np.nan if len(xs) == 0 else np.mean(xs)

View File

@@ -1,40 +0,0 @@
#!/usr/bin/env python3
import sys
from baselines import logger
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
from baselines.ppo2 import ppo2
from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
import multiprocessing
import tensorflow as tf
def train(env_id, num_timesteps, seed, policy):
ncpu = multiprocessing.cpu_count()
if sys.platform == 'darwin': ncpu //= 2
config = tf.ConfigProto(allow_soft_placement=True,
intra_op_parallelism_threads=ncpu,
inter_op_parallelism_threads=ncpu)
config.gpu_options.allow_growth = True #pylint: disable=E1101
tf.Session(config=config).__enter__()
env = VecFrameStack(make_atari_env(env_id, 8, seed), 4)
policy = {'cnn' : CnnPolicy, 'lstm' : LstmPolicy, 'lnlstm' : LnLstmPolicy}[policy]
ppo2.learn(policy=policy, env=env, nsteps=128, nminibatches=4,
lam=0.95, gamma=0.99, noptepochs=4, log_interval=1,
ent_coef=.01,
lr=lambda f : f * 2.5e-4,
cliprange=lambda f : f * 0.1,
total_timesteps=int(num_timesteps * 1.1))
def main():
parser = atari_arg_parser()
parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
args = parser.parse_args()
logger.configure()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
policy=args.policy)
if __name__ == '__main__':
main()

Some files were not shown because too many files have changed in this diff Show More