Compare commits

..

38 Commits

Author SHA1 Message Date
uronce-cc
0a40206c6c ncpu needs to be an integer. (#558) 2018-08-31 09:02:18 -07:00
Alfredo Canziani
1937826784 Fix alien syntax and apply PEP 8 style (#554) 2018-08-30 17:21:25 -07:00
pzhokhov
b29c8020d7 remove saving model as a pickle file in ppo2 (tries to pull environment in; bad idea - may need to use constructor argument pickling or somesuch if at all necessary) (#69) 2018-08-30 13:41:38 -07:00
Peter Zhokhov
4ec308aaa4 fixed syntax 2018-08-30 13:41:38 -07:00
Peter Zhokhov
3bbf3f3511 allow_early_resets=True in create_vec_env 2018-08-30 13:41:38 -07:00
Joshua Meier
e5de29a954 instructions for tensorboard (#61) 2018-08-30 13:41:37 -07:00
Joshua Meier
2507d335f9 Tensorboard util (#60)
* separate_validation_set was not imported

* launching tensorboard automatically
2018-08-30 13:41:37 -07:00
Damien Lancry
bdd4d385a6 Fix result_plotters in vectorized mujoco environments (#533)
* I investigated a bit about running a training in a vectorized monitored mujoco env and found out that the 0.monitor.csv file could not be plotted using baselines.results_plotter.py functions. Moreover the seed is the same in every parallel environments due to the particular behaviour of lambda. this fixes both issues without breaking the function in other files (baselines.acktr.run_mujoco still works)

* unifies make_atari_env and make_mujoco_env

* redefine make_mujoco_env because of run_mujoco in acktr not compatible with DummyVecEnv and SubprocVecEnv

* fix if else

* Update run.py
2018-08-28 17:48:56 -07:00
Peter Zhokhov
0961f5dd94 git subrepo pull (merge) baselines
subrepo:
  subdir:   "baselines"
  merged:   "95a81e86"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "c6c0f45c"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"
2018-08-27 16:40:14 -07:00
Christopher Hesse
337d913a8f remove reset_task from subproc vec env (#45) 2018-08-27 16:40:14 -07:00
Karl Cobbe
34af61a132 baselines: fix dummy vec env render mode (#42) 2018-08-27 16:40:14 -07:00
Christopher Hesse
1ea5ec647c export SimpleEnv and assert_envs_equal, fix minor bug in action space (#46) 2018-08-27 16:40:14 -07:00
pzhokhov
2fc7a1cbee Trigger benchmarks from buildkite (#40)
* rig buildkite pipeline to run benchmarks when commit ends with RUN BENCHMARKS

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file - merge test and benchmark steps

* fix the buildkite pipeline file - merge test and benchmark steps

* fix buildkite pipeline file

* fix buildkite pipeline file

* dry RUN BENCHMARKS

* dry RUN BENCHMARKS

* dry not run BENCHMARKS

* not run benchmarks

* not running benchmarks

* no running benchmarks

* no running benchmarks

* still not running benchmarks

* dummy commit to RUN BENCHMARKS

* trigger benchmarks from buildkite RUN BENCHMARKS

* specifying RCALL_KUBE_CLUSTER RUN BENCHMARKS

* remove rl-algs/run-benchmarks-new.py (moved to ci), merged baselines/common/console_util and baselines/common/util.py

* added missing imports in console_util

* clone subrepo over https
2018-08-27 16:40:14 -07:00
John Schulman
14c1d69ef4 Reduce duplication in VecEnv subclasses. (#38)
* Reduce duplication in VecEnv subclasses.
Now VecEnv base class handles rendering and closing; subclasses should provide get_images and (optionally) close_extras.

* fix tests

* minor docstring change

* raise NotImplementedError
2018-08-27 16:40:13 -07:00
pzhokhov
c8f6d8bac7 address rl-algs issue #169 (missing util functions from rcall) (#30)
* copied parts of util.py to baselines.common from rcall

* merged fix for baselines.logger, resolved conflicts

* copied ccap to baselines/baselines/common/util.py
2018-08-27 16:40:13 -07:00
pzhokhov
3a006ba50e flake8 fixes (#35)
* flake8 fixes

* added baselines/setup.cfg

* style checks using setup.cfg in baselines
2018-08-27 16:40:13 -07:00
Tom
c6c0f45cb1 fix 'async' is a reserved word in Python >= 3.7 (#495) (#542) 2018-08-27 12:36:43 -07:00
wangjksjtu
e92a6ad8f4 Update README.md (#537)
1. Delete repetitive section
2. Align the commands
2018-08-27 12:35:48 -07:00
HelgeS
92b9a37257 Updated example commands to run ppo2 (#534)
The headline mentions PPO, but the command was for A2C
2018-08-23 15:58:27 -07:00
Armin Primadi
cb14da96ca Fix typo on policies documentation (#535) 2018-08-23 15:56:13 -07:00
pzhokhov
3900f2a447 baselines issue 146 (remove tensorflow from setup.py) (#34)
* baselines does not reinstall tensorflow

* fix the version check in baselines/setup.py

* replace print and assert with assert, str (thanks @csh)
2018-08-21 16:59:05 -07:00
pzhokhov
20d22a5d79 Fix baselines build (fails due to lack of mujoco in public baselines container) (#29)
* make nminibatces = min(nminibatches, nenv)

* clarify the usage of lstm policy, add an example and a test

* cleaned up example, added assert to the test

* remove nminibatches -> min(nminibatches, num_env)

* removed code snippet from the docstring, pointing to the file

* add _mujoco_present flag to skip the tests that require mujoco if mujoco is not present

* re-format skip message in test_doc_examples

* flake8 complaints
2018-08-21 10:08:24 -07:00
pzhokhov
caf7b08b4d Baselines issue #525 (lack of docs for recurrent policies) (#27)
* make nminibatces = min(nminibatches, nenv)

* clarify the usage of lstm policy, add an example and a test

* cleaned up example, added assert to the test

* remove nminibatches -> min(nminibatches, num_env)

* removed code snippet from the docstring, pointing to the file
2018-08-20 13:55:35 -07:00
Peter Zhokhov
ca0165cdf5 flake8 complaints 2018-08-17 18:11:00 -07:00
pzhokhov
eb5b605f86 restore subrepo conftest.py files (#22)
* restore conftest.py in subrepos

* remove conftest files from subrepos in the docker image

* remove runslow flag from baselines .travis.yml and rl-algs ci/runtests.sh

* move import of rendering module into the code to fix tests that don't require a display

* restore the dockerfile
2018-08-17 17:02:39 -07:00
Peter Zhokhov
a89bee3c8d Merge commit 'refs/subrepo/baselines/fetch' into subrepo/baselines 2018-08-17 13:55:27 -07:00
pzhokhov
353bb15e90 deduplicate algorithms in rl-algs and baselines (#18)
* move vec_env

* cleaning up rl_common

* tests are passing (but mosts tests are deleted as moved to baselines)

* add benchmark runner for smoke tests

* removed duplicated algos

* route references to rl_algs.a2c to baselines.a2c

* route references to rl_algs.a2c to baselines.a2c

* unify conftest.py

* removing references to duplicated algs from codegen

* removing references to duplicated algs from codegen

* alex's changes to dummy_vec_env

* fixed test_carpole[deepq] testcase by decreasing number of training steps... alex's changes seemed to have fixed the bug and make it train better, but at seed=0 there is a dip in the training curve at 30k steps that fails the test

* codegen tests with atol=1e-6 seem to be unstable

* rl_common.vec_env -> baselines.common.vec_env mass replace

* fixed reference in trpo_mpi

* a2c.util references

* restored rl_algs.bench in sonic_prob

* fix reference in ci/runtests.sh

* simplifed expression in baselines/common/cmd_util

* further increased rtol to 1e-3 in codegen tests

* switched vecenvs to use SimpleImageViewer from gym instead of cv2

* make run.py --play option work with num_envs > 1

* make rosenbrock test reproducible

* git subrepo pull (merge) baselines

subrepo:
  subdir:   "baselines"
  merged:   "e23524a5"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "bcde04e7"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* updated baselines README (num-timesteps --> num_timesteps)

* typo in deepq/README.md
2018-08-17 13:54:11 -07:00
pzhokhov
64c0c0a043 Setup travis (#12)
* re-setting up travis

* re-setting up travis

* resolved merge conflicts, added missing dependency for codegen

* removed parallel tests (workers are failing for some reason)

* try test baselines only

* added language options - some weirdness in rcall image that requires them?

* added verbosity to tests

* try tests in baselines only

* ci/runtests.sh tests codegen (some failure on baselines specifically on travis, trying to narrow down the problem)

* removed render from codegen test - maybe that's the problem?

* trying even simpler command within the image to figure out the problem

* print out system info in ci/runtests.sh

* print system info outside of docker as well

* trying single test file in codegen

* install graphviz in the docker image

* git subrepo pull baselines

subrepo:
  subdir:   "baselines"
  merged:   "8c2aea2"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "8c2aea2"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* added graphviz to the dockerfile (need both graphviz-dev and graphviz)

* only tests in codegen/algo/test_algo_builder.py

* run baselines tests only. still no clue why collection of codegen tests fails

* update baselines setup to install filelock for tests

* run slow tests

* skip slow tests in baselines

* single test file in baselines

* try reinstalling tensorflow

* running slow tests

* try full baselines and codegen test suite

* in the test Dockerfile, reinstall tensorflow

* using fake display for codegen render tests

* fixed display-related failures by adding a custom entrpoint to the docker image

* set LC_ALL and LANG env variables in docker image

* try sequential tests

* include psutil in requirements; increase relative tolerance in test_low_level_algo_distr

* trying to fix codegen failures on travis

* git subrepo commit (merge) baselines

subrepo:
  subdir:   "baselines"
  merged:   "9ce84da"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "b222dd0"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* syntax in install.py

* changing the order of package installation

* removed supervised-reptile from installation list

* cron uses the full games repo in rcall

* flake8 complaints

* rewrite all extras logic in baselines, install.py always uses [all]
2018-08-17 13:54:10 -07:00
pzhokhov
5fee99e771 Setup travis (#12)
* re-setting up travis

* re-setting up travis

* resolved merge conflicts, added missing dependency for codegen

* removed parallel tests (workers are failing for some reason)

* try test baselines only

* added language options - some weirdness in rcall image that requires them?

* added verbosity to tests

* try tests in baselines only

* ci/runtests.sh tests codegen (some failure on baselines specifically on travis, trying to narrow down the problem)

* removed render from codegen test - maybe that's the problem?

* trying even simpler command within the image to figure out the problem

* print out system info in ci/runtests.sh

* print system info outside of docker as well

* trying single test file in codegen

* install graphviz in the docker image

* git subrepo pull baselines

subrepo:
  subdir:   "baselines"
  merged:   "8c2aea2"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "8c2aea2"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* added graphviz to the dockerfile (need both graphviz-dev and graphviz)

* only tests in codegen/algo/test_algo_builder.py

* run baselines tests only. still no clue why collection of codegen tests fails

* update baselines setup to install filelock for tests

* run slow tests

* skip slow tests in baselines

* single test file in baselines

* try reinstalling tensorflow

* running slow tests

* try full baselines and codegen test suite

* in the test Dockerfile, reinstall tensorflow

* using fake display for codegen render tests

* fixed display-related failures by adding a custom entrpoint to the docker image

* set LC_ALL and LANG env variables in docker image

* try sequential tests

* include psutil in requirements; increase relative tolerance in test_low_level_algo_distr

* trying to fix codegen failures on travis

* git subrepo commit (merge) baselines

subrepo:
  subdir:   "baselines"
  merged:   "9ce84da"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "b222dd0"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* syntax in install.py

* changing the order of package installation

* removed supervised-reptile from installation list

* cron uses the full games repo in rcall

* flake8 complaints

* rewrite all extras logic in baselines, install.py always uses [all]
2018-08-17 13:40:02 -07:00
Youngjin Kim
5edcd6886e Fix argument error in deepq (#508)
* Fix argment error in deepq

* Fix argment error in deepq
2018-08-16 14:55:57 -07:00
Youngjin Kim
bcde04e710 Fix argument error in deepq (#508)
* Fix argment error in deepq

* Fix argment error in deepq
2018-08-16 14:55:57 -07:00
pzhokhov
cd375ab209 update readmes (#514)
* update per-algorithm READMEs to reflect new way of running algorithms

* adding a link to repo-wide README

* updated README files and deepq.train_cartpole example
2018-08-16 14:53:49 -07:00
pzhokhov
5622a09fa4 update readmes (#514)
* update per-algorithm READMEs to reflect new way of running algorithms

* adding a link to repo-wide README

* updated README files and deepq.train_cartpole example
2018-08-16 14:53:49 -07:00
Pim de Haan
e2da7cd42f Several bugfixes for #504, #505, #506 related to Classic Control and deepq (#507)
* Several bugfixes

* Fixed ActWrapper.step bug
2018-08-16 12:08:53 -07:00
Peter Zhokhov
b222dd0610 updated links in README to point to master 2018-08-13 16:01:24 -07:00
pzhokhov
1870685071 Publish benchmark results (#502)
* updated benchmark pages with final rewards

* use htmlpreview to render pages

* use htmlpreview to render pages

* use htmlpreview to render pages

* updated README to reflect ppo1 being obsolete

* removed navbars from published benchmark pages

* fixed link in README
2018-08-13 15:59:43 -07:00
pzhokhov
8c2aea2add refactor a2c, acer, acktr, ppo2, deepq, and trpo_mpi (#490)
* exported rl-algs

* more stuff from rl-algs

* run slow tests

* re-exported rl_algs

* re-exported rl_algs - fixed problems with serialization test and test_cartpole

* replaced atari_arg_parser with common_arg_parser

* run.py can run algos from both baselines and rl_algs

* added approximate humanoid reward with ppo2 into the README for reference

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* very dummy commit to RUN BENCHMARKS

* serialize variables as a dict, not as a list

* running_mean_std uses tensorflow variables

* fixed import in vec_normalize

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* flake8 complaints

* save all variables to make sure we save the vec_normalize normalization

* benchmarks on ppo2 only RUN BENCHMARKS

* make_atari_env compatible with mpi

* run ppo_mpi benchmarks only RUN BENCHMARKS

* hardcode names of retro environments

* add defaults

* changed default ppo2 lr schedule to linear RUN BENCHMARKS

* non-tf normalization benchmark RUN BENCHMARKS

* use ncpu=1 for mujoco sessions - gives a bit of a performance speedup

* reverted running_mean_std to user property decorators for mean, var, count

* reverted VecNormalize to use RunningMeanStd (no tf)

* reverted VecNormalize to use RunningMeanStd (no tf)

* profiling wip

* use VecNormalize with regular RunningMeanStd

* added acer runner (missing import)

* flake8 complaints

* added a note in README about TfRunningMeanStd and serialization of VecNormalize

* dummy commit to RUN BENCHMARKS

* merged benchmarks branch
2018-08-13 09:56:44 -07:00
Tony Yu Cao
366f486e34 Update README.md (#416)
Update Atari example
2018-08-08 10:42:10 -07:00
46 changed files with 18729 additions and 309 deletions

View File

@@ -1 +1 @@
ppo2

View File

@@ -10,5 +10,5 @@ install:
- docker build . -t baselines-test
script:
- flake8 --select=F,E999 baselines/common baselines/trpo_mpi baselines/ppo2 baselines/a2c baselines/deepq baselines/acer
- docker run baselines-test pytest --runslow
- flake8 .
- docker run baselines-test pytest -v .

View File

@@ -18,6 +18,7 @@ WORKDIR $CODE_DIR/baselines
# Clean up pycache and pyc files
RUN rm -rf __pycache__ && \
find . -name "*.pyc" -delete && \
pip install tensorflow && \
pip install -e .[test]

View File

@@ -45,8 +45,8 @@ cd baselines
```
If using virtualenv, create a new virtualenv and activate it
```bash
virtualenv env --python=python3
. env/bin/activate
virtualenv env --python=python3
. env/bin/activate
```
Install baselines package
```bash
@@ -62,29 +62,20 @@ pip install pytest
pytest
```
## Subpackages
## Testing the installation
All unit tests in baselines can be run using pytest runner:
```
pip install pytest
pytest
```
## Training models
Most of the algorithms in baselines repo are used as follows:
```bash
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
```
### Example 1. PPO with MuJoCo Humanoid
For instance, to train a fully-connected network controlling MuJoCo humanoid using a2c for 20M timesteps
For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps
```bash
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
```
Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
```bash
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
```
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
@@ -94,7 +85,7 @@ docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) fir the description
### Example 2. DQN on Atari
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
```
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
```
## Saving, loading and visualizing models
@@ -102,20 +93,16 @@ The algorithms serialization API is not properly unified yet; however, there is
`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively.
Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then later visualize what has it learnt.
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=2e7 --save_path=~/models/pong_20M_ppo2
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
```
This should get to the mean reward per episode about 5k. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
```
*NOTE:* At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in [baselines/common/vec_env/vec_normalize.py](baselines/common/vec_env/vec_normalize.py#L12). This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default
## Subpackages
- [A2C](baselines/a2c)
@@ -125,10 +112,19 @@ This should get to the mean reward per episode about 5k. To load and visualize t
- [DQN](baselines/deepq)
- [GAIL](baselines/gail)
- [HER](baselines/her)
- [PPO1](baselines/ppo1) (Multi-CPU using MPI)
- [PPO2](baselines/ppo2) (Optimized for GPU)
- [PPO1](baselines/ppo1) (obsolete version, left here temporarily)
- [PPO2](baselines/ppo2)
- [TRPO](baselines/trpo_mpi)
## Benchmarks
Results of benchmarks on Mujoco (1M timesteps) and Atari (10M timesteps) are available
[here for Mujoco](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm)
and
[here for Atari](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_atari10M.htm)
respectively. Note that these results may be not on the latest version of the code, particular commit hash with which results were obtained is specified on the benchmarks page.
To cite this repository in publications:
@misc{baselines,
@@ -139,3 +135,4 @@ To cite this repository in publications:
journal = {GitHub repository},
howpublished = {\url{https://github.com/openai/baselines}},
}

View File

@@ -2,4 +2,5 @@
- Original paper: https://arxiv.org/abs/1602.01783
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.a2c.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options
- also refer to the repo-wide [README.md](../../README.md#training-models)

View File

@@ -1,4 +1,6 @@
# ACER
- Original paper: https://arxiv.org/abs/1611.01224
- `python -m baselines.acer.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.run --alg=acer --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
- also refer to the repo-wide [README.md](../../README.md#training-models)

View File

@@ -2,4 +2,7 @@
- Original paper: https://arxiv.org/abs/1708.05144
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.acktr.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.run --alg=acktr --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
- also refer to the repo-wide [README.md](../../README.md#training-models)

View File

@@ -54,7 +54,7 @@ def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
stepsize = tf.Variable(initial_value=np.float32(np.array(0.03)), name='stepsize')
inputs, loss, loss_sampled = policy.update_info
optim = kfac.KfacOptimizer(learning_rate=stepsize, cold_lr=stepsize*(1-0.9), momentum=0.9, kfac_update=2,\
epsilon=1e-2, stats_decay=0.99, async=1, cold_iter=1,
epsilon=1e-2, stats_decay=0.99, async_=1, cold_iter=1,
weight_decay_dict=policy.wd_dict, max_grad_norm=None)
pi_var_list = []
for var in tf.trainable_variables():

View File

@@ -58,7 +58,7 @@ class Model(object):
with tf.device('/gpu:0'):
self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
momentum=0.9, kfac_update=1, epsilon=0.01,\
stats_decay=0.99, async=1, cold_iter=10, max_grad_norm=max_grad_norm)
stats_decay=0.99, async_=1, cold_iter=10, max_grad_norm=max_grad_norm)
update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
@@ -97,7 +97,7 @@ def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interva
kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, **network_kwargs):
set_global_seeds(seed)
if network == 'cnn':
network_kwargs['one_dim_bias'] = True
@@ -115,7 +115,7 @@ def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interva
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
fh.write(cloudpickle.dumps(make_model))
model = make_model()
if load_path is not None:
model.load(load_path)

View File

@@ -10,14 +10,14 @@ KFAC_DEBUG = False
class KfacOptimizer():
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async_=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
self.max_grad_norm = max_grad_norm
self._lr = learning_rate
self._momentum = momentum
self._clip_kl = clip_kl
self._channel_fac = channel_fac
self._kfac_update = kfac_update
self._async = async
self._async = async_
self._async_stats = async_stats
self._epsilon = epsilon
self._stats_decay = stats_decay

View File

@@ -1,23 +0,0 @@
#!/usr/bin/env python3
from functools import partial
from baselines import logger
from baselines.acktr.acktr_disc import learn
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
from baselines.common.policies import cnn
def train(env_id, num_timesteps, seed, num_cpu):
env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
policy_fn = cnn(env=env, one_dim_bias=True)
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
env.close()
def main():
args = atari_arg_parser().parse_args()
logger.configure()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed, num_cpu=32)
if __name__ == '__main__':
main()

View File

@@ -21,7 +21,7 @@ class NeuralNetValueFunction(object):
self._predict = U.function([X], vpred_n)
optim = kfac.KfacOptimizer(learning_rate=0.001, cold_lr=0.001*(1-0.9), momentum=0.9, \
clip_kl=0.3, epsilon=0.1, stats_decay=0.95, \
async=1, kfac_update=2, cold_iter=50, \
async_=1, kfac_update=2, cold_iter=50, \
weight_decay_dict=wd_dict, max_grad_norm=None)
vf_var_list = []
for var in tf.trainable_variables():

View File

@@ -156,7 +156,7 @@ class FrameStack(gym.Wrapper):
self.k = k
self.frames = deque([], maxlen=k)
shp = env.observation_space.shape
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=np.uint8)
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)
def reset(self):
ob = self.env.reset()
@@ -176,6 +176,7 @@ class FrameStack(gym.Wrapper):
class ScaledFloatFrame(gym.ObservationWrapper):
def __init__(self, env):
gym.ObservationWrapper.__init__(self, env)
self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)
def observation(self, observation):
# careful! This undoes the memory optimization, use

View File

@@ -15,22 +15,31 @@ from baselines.bench import Monitor
from baselines.common import set_global_seeds
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.common.retro_wrappers import RewardScaler
def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
def make_vec_env(env_id, env_type, num_env, seed, wrapper_kwargs=None, start_index=0, reward_scale=1.0):
"""
Create a wrapped, monitored SubprocVecEnv for Atari.
Create a wrapped, monitored SubprocVecEnv for Atari and MuJoCo.
"""
if wrapper_kwargs is None: wrapper_kwargs = {}
mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
def make_env(rank): # pylint: disable=C0111
def _thunk():
env = make_atari(env_id)
env = make_atari(env_id) if env_type == 'atari' else gym.make(env_id)
env.seed(seed + 10000*mpi_rank + rank if seed is not None else None)
env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(rank)))
return wrap_deepmind(env, **wrapper_kwargs)
env = Monitor(env,
logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(rank)),
allow_early_resets=True)
if env_type == 'atari': return wrap_deepmind(env, **wrapper_kwargs)
elif reward_scale != 1: return RewardScaler(env, reward_scale)
else: return env
return _thunk
set_global_seeds(seed)
return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
if num_env > 1: return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
else: return DummyVecEnv([make_env(start_index)])
def make_mujoco_env(env_id, seed, reward_scale=1.0):
"""
@@ -40,13 +49,12 @@ def make_mujoco_env(env_id, seed, reward_scale=1.0):
myseed = seed + 1000 * rank if seed is not None else None
set_global_seeds(myseed)
env = gym.make(env_id)
env = Monitor(env, os.path.join(logger.get_dir(), str(rank)), allow_early_resets=True)
logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
env = Monitor(env, logger_path, allow_early_resets=True)
env.seed(seed)
if reward_scale != 1.0:
from baselines.common.retro_wrappers import RewardScaler
env = RewardScaler(env, reward_scale)
return env
def make_robotics_env(env_id, seed, rank=0):
@@ -88,7 +96,7 @@ def common_arg_parser():
parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
parser.add_argument('--seed', help='RNG seed', type=int, default=None)
parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
parser.add_argument('--num_timesteps', type=float, default=1e6),
parser.add_argument('--num_timesteps', type=float, default=1e6),
parser.add_argument('--network', help='network type (mlp, cnn, lstm, cnn_lstm, conv_only)', default=None)
parser.add_argument('--gamestate', help='game state to load (so far only used in retro games)', default=None)
parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
@@ -121,6 +129,3 @@ def parse_unknown_args(args):
retval[key] = value
return retval

View File

@@ -2,6 +2,8 @@ from __future__ import print_function
from contextlib import contextmanager
import numpy as np
import time
import shlex
import subprocess
# ================================================================
# Misc
@@ -37,7 +39,7 @@ color2num = dict(
crimson=38
)
def colorize(string, color, bold=False, highlight=False):
def colorize(string, color='green', bold=False, highlight=False):
attr = []
num = color2num[color]
if highlight: num += 10
@@ -45,6 +47,22 @@ def colorize(string, color, bold=False, highlight=False):
if bold: attr.append('1')
return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)
def print_cmd(cmd, dry=False):
if isinstance(cmd, str): # for shell=True
pass
else:
cmd = ' '.join(shlex.quote(arg) for arg in cmd)
print(colorize(('CMD: ' if not dry else 'DRY: ') + cmd))
def get_git_commit(cwd=None):
return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'], cwd=cwd).decode('utf8')
def ccap(cmd, dry=False, env=None, **kwargs):
print_cmd(cmd, dry)
if not dry:
subprocess.check_call(cmd, env=env, **kwargs)
MESSAGE_DEPTH = 0

View File

@@ -22,8 +22,7 @@ def nature_cnn(unscaled_images, **conv_kwargs):
def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
"""
Simple fully connected layer policy. Separate stacks of fully-connected layers are used for policy and value function estimation.
More customized fully-connected policies can be obtained by using PolicyWithV class directly.
Stack of fully-connected layers to be used in a policy / q-function approximator
Parameters:
----------
@@ -37,7 +36,7 @@ def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
Returns:
-------
function that builds fully connected network with a given input placeholder
function that builds fully connected network with a given input tensor / placeholder
"""
def network_fn(X):
h = tf.layers.flatten(X)
@@ -68,6 +67,34 @@ def cnn_small(**conv_kwargs):
def lstm(nlstm=128, layer_norm=False):
"""
Builds LSTM (Long-Short Term Memory) network to be used in a policy.
Note that the resulting function returns not only the output of the LSTM
(i.e. hidden state of lstm for each step in the sequence), but also a dictionary
with auxiliary tensors to be set as policy attributes.
Specifically,
S is a placeholder to feed current state (LSTM state has to be managed outside policy)
M is a placeholder for the mask (used to mask out observations after the end of the episode, but can be used for other purposes too)
initial_state is a numpy array containing initial lstm state (usually zeros)
state is the output LSTM state (to be fed into S at the next call)
An example of usage of lstm-based policy can be found here: common/tests/test_doc_examples.py/test_lstm_example
Parameters:
----------
nlstm: int LSTM hidden state size
layer_norm: bool if True, layer-normalized version of LSTM is used
Returns:
-------
function that builds LSTM with a given input tensor / placeholder
"""
def network_fn(X, nenv=1):
nbatch = X.shape[0]
nsteps = nbatch // nenv
@@ -138,7 +165,7 @@ def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
'''
def network_fn(X):
out = X
out = tf.cast(X, tf.float32) / 255.
with tf.variable_scope("convnet"):
for num_outputs, kernel_size, stride in convs:
out = layers.convolution2d(out,

View File

@@ -72,7 +72,7 @@ class PolicyWithValue(object):
def step(self, observation, **extra_feed):
"""
Compute next action(s) given the observaion(s)
Compute next action(s) given the observation(s)
Parameters:
----------
@@ -93,7 +93,7 @@ class PolicyWithValue(object):
def value(self, ob, *args, **kwargs):
"""
Compute value estimate(s) given the observaion(s)
Compute value estimate(s) given the observation(s)
Parameters:
----------

View File

@@ -14,7 +14,7 @@ common_kwargs = dict(
learn_kwargs = {
'a2c' : dict(nsteps=32, value_network='copy', lr=0.05),
'acktr': dict(nsteps=32, value_network='copy'),
'deepq': {},
'deepq': dict(total_timesteps=20000),
'ppo2': dict(value_network='copy'),
'trpo_mpi': {}
}
@@ -38,3 +38,6 @@ def test_cartpole(alg):
return env
reward_per_episode_test(env_fn, learn_fn, 100)
if __name__ == '__main__':
test_cartpole('deepq')

View File

@@ -0,0 +1,48 @@
import pytest
try:
import mujoco_py
_mujoco_present = True
except BaseException:
mujoco_py = None
_mujoco_present = False
@pytest.mark.skipif(
not _mujoco_present,
reason='error loading mujoco - either mujoco / mujoco key not present, or LD_LIBRARY_PATH is not pointing to mujoco library'
)
def test_lstm_example():
import tensorflow as tf
from baselines.common import policies, models, cmd_util
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
# create vectorized environment
venv = DummyVecEnv([lambda: cmd_util.make_mujoco_env('Reacher-v2', seed=0)])
with tf.Session() as sess:
# build policy based on lstm network with 128 units
policy = policies.build_policy(venv, models.lstm(128))(nbatch=1, nsteps=1)
# initialize tensorflow variables
sess.run(tf.global_variables_initializer())
# prepare environment variables
ob = venv.reset()
state = policy.initial_state
done = [False]
step_counter = 0
# run a single episode until the end (i.e. until done)
while True:
action, _, state, _ = policy.step(ob, S=state, M=done)
ob, reward, done, _ = venv.step(action)
step_counter += 1
if done:
break
assert step_counter > 5

View File

@@ -6,8 +6,7 @@ from baselines.run import get_learn_function
common_kwargs = dict(
seed=0,
total_timesteps=20000,
nlstm=64
total_timesteps=50000,
)
learn_kwargs = {
@@ -20,7 +19,7 @@ learn_kwargs = {
alg_list = learn_kwargs.keys()
rnn_list = ['lstm', 'tflstm', 'tflstm_static']
rnn_list = ['lstm']
@pytest.mark.slow
@pytest.mark.parametrize("alg", alg_list)
@@ -42,11 +41,11 @@ def test_fixed_sequence(alg, rnn):
**kwargs
)
simple_test(env_fn, learn, 0.3)
simple_test(env_fn, learn, 0.7)
if __name__ == '__main__':
test_fixed_sequence('ppo2', 'tflstm')
test_fixed_sequence('ppo2', 'lstm')

View File

@@ -2,7 +2,6 @@ import tensorflow as tf
import numpy as np
from gym.spaces import np_random
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.bench.monitor import Monitor
N_TRIALS = 10000
N_EPISODES = 100
@@ -11,7 +10,7 @@ def simple_test(env_fn, learn_fn, min_reward_fraction, n_trials=N_TRIALS):
np.random.seed(0)
np_random.seed(0)
env = DummyVecEnv([lambda: Monitor(env_fn(), None, allow_early_resets=True)])
env = DummyVecEnv([env_fn])
with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():

View File

@@ -62,7 +62,7 @@ def make_session(config=None, num_cpu=None, make_default=False, graph=None):
num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
if config is None:
config = tf.ConfigProto(
allow_soft_placement=True,
allow_soft_placement=True,
inter_op_parallelism_threads=num_cpu,
intra_op_parallelism_threads=num_cpu)
config.gpu_options.allow_growth = True
@@ -328,7 +328,7 @@ def save_state(fname, sess=None):
def save_variables(save_path, variables=None, sess=None):
sess = sess or get_session()
variables = variables or tf.trainable_variables()
ps = sess.run(variables)
save_dict = {v.name: value for v, value in zip(variables, ps)}
os.makedirs(os.path.dirname(save_path), exist_ok=True)
@@ -354,10 +354,10 @@ def adjust_shape(placeholder, data):
If shape is incompatible, AssertionError is thrown
Parameters:
placeholder tensorflow input placeholder
placeholder tensorflow input placeholder
data input data to be (potentially) reshaped to be fed into placeholder
Returns:
reshaped data
'''
@@ -366,14 +366,14 @@ def adjust_shape(placeholder, data):
return data
if isinstance(data, list):
data = np.array(data)
placeholder_shape = [x or -1 for x in placeholder.shape.as_list()]
assert _check_shape(placeholder_shape, data.shape), \
'Shape of data {} is not compatible with shape of the placeholder {}'.format(data.shape, placeholder_shape)
return np.reshape(data, placeholder_shape)
return np.reshape(data, placeholder_shape)
def _check_shape(placeholder_shape, data_shape):
''' check if two shapes are compatible (i.e. differ only by dimensions of size 1, or by the batch dimension)'''
@@ -381,7 +381,7 @@ def _check_shape(placeholder_shape, data_shape):
return True
squeezed_placeholder_shape = _squeeze_shape(placeholder_shape)
squeezed_data_shape = _squeeze_shape(data_shape)
for i, s_data in enumerate(squeezed_data_shape):
s_placeholder = squeezed_placeholder_shape[i]
if s_placeholder != -1 and s_data != s_placeholder:
@@ -392,14 +392,26 @@ def _check_shape(placeholder_shape, data_shape):
def _squeeze_shape(shape):
return [x for x in shape if x != 1]
# ================================================================
# Tensorboard interfacing
# ================================================================
def launch_tensorboard_in_background(log_dir):
from tensorboard import main as tb
import threading
tf.flags.FLAGS.logdir = log_dir
t = threading.Thread(target=tb.main, args=([]))
t.start()
'''
To log the Tensorflow graph when using rl-algs
algorithms, you can run the following code
in your main script:
import threading, time
def start_tensorboard(session):
time.sleep(10) # Wait until graph is setup
tb_path = osp.join(logger.get_dir(), 'tb')
summary_writer = tf.summary.FileWriter(tb_path, graph=session.graph)
summary_op = tf.summary.merge_all()
launch_tensorboard_in_background(tb_path)
session = tf.get_default_session()
t = threading.Thread(target=start_tensorboard, args=([session]))
t.start()
'''
import subprocess
subprocess.Popen(['tensorboard', '--logdir', log_dir])

View File

@@ -1,38 +1,45 @@
from abc import ABC, abstractmethod
from baselines import logger
from baselines.common.tile_images import tile_images
class AlreadySteppingError(Exception):
"""
Raised when an asynchronous step is running while
step_async() is called again.
"""
def __init__(self):
msg = 'already running an async step'
Exception.__init__(self, msg)
class NotSteppingError(Exception):
"""
Raised when an asynchronous step is not running but
step_wait() is called.
"""
def __init__(self):
msg = 'not running an async step'
Exception.__init__(self, msg)
class VecEnv(ABC):
"""
An abstract asynchronous, vectorized environment.
"""
def __init__(self, num_envs, observation_space, action_space):
self.num_envs = num_envs
self.observation_space = observation_space
self.action_space = action_space
self.closed = False
self.viewer = None # For rendering
@abstractmethod
def reset(self):
"""
Reset all the environments and return an array of
observations, or a tuple of observation arrays.
observations, or a dict of observation arrays.
If step_async is still doing work, that work will
be cancelled and step_wait() should not be called
@@ -58,7 +65,7 @@ class VecEnv(ABC):
Wait for the step taken with step_async().
Returns (obs, rews, dones, infos):
- obs: an array of observations, or a tuple of
- obs: an array of observations, or a dict of
arrays of observations.
- rews: an array of rewards
- dones: an array of "episode done" booleans
@@ -66,19 +73,45 @@ class VecEnv(ABC):
"""
pass
@abstractmethod
def close(self):
def close_extras(self):
"""
Clean up the environments' resources.
Clean up the extra resources, beyond what's in this base class.
Only runs when not self.closed.
"""
pass
def close(self):
if self.closed:
return
if self.viewer is not None:
self.viewer.close()
self.close_extras()
self.closed = True
def step(self, actions):
"""
Step the environments synchronously.
This is available for backwards compatibility.
"""
self.step_async(actions)
return self.step_wait()
def render(self, mode='human'):
logger.warn('Render not defined for %s'%self)
imgs = self.get_images()
bigimg = tile_images(imgs)
if mode == 'human':
self.get_viewer().imshow(bigimg)
elif mode == 'rgb_array':
return bigimg
else:
raise NotImplementedError
def get_images(self):
"""
Return RGB images from each environment
"""
raise NotImplementedError
@property
def unwrapped(self):
@@ -87,13 +120,25 @@ class VecEnv(ABC):
else:
return self
def get_viewer(self):
if self.viewer is None:
from gym.envs.classic_control import rendering
self.viewer = rendering.SimpleImageViewer()
return self.viewer
class VecEnvWrapper(VecEnv):
"""
An environment wrapper that applies to an entire batch
of environments at once.
"""
def __init__(self, venv, observation_space=None, action_space=None):
self.venv = venv
VecEnv.__init__(self,
num_envs=venv.num_envs,
observation_space=observation_space or venv.observation_space,
action_space=action_space or venv.action_space)
VecEnv.__init__(self,
num_envs=venv.num_envs,
observation_space=observation_space or venv.observation_space,
action_space=action_space or venv.action_space)
def step_async(self, actions):
self.venv.step_async(actions)
@@ -109,18 +154,24 @@ class VecEnvWrapper(VecEnv):
def close(self):
return self.venv.close()
def render(self):
self.venv.render()
def render(self, mode='human'):
return self.venv.render(mode=mode)
def get_images(self):
return self.venv.get_images()
class CloudpickleWrapper(object):
"""
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
"""
def __init__(self, x):
self.x = x
def __getstate__(self):
import cloudpickle
return cloudpickle.dumps(self.x)
def __setstate__(self, ob):
import pickle
self.x = pickle.loads(ob)

View File

@@ -1,28 +1,16 @@
import numpy as np
from gym import spaces
from collections import OrderedDict
from . import VecEnv
from .util import copy_obs_dict, dict_to_obs, obs_space_info
class DummyVecEnv(VecEnv):
def __init__(self, env_fns):
self.envs = [fn() for fn in env_fns]
env = self.envs[0]
VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
shapes, dtypes = {}, {}
self.keys = []
obs_space = env.observation_space
if isinstance(obs_space, spaces.Dict):
assert isinstance(obs_space.spaces, OrderedDict)
subspaces = obs_space.spaces
else:
subspaces = {None: obs_space}
for key, box in subspaces.items():
shapes[key] = box.shape
dtypes[key] = box.dtype
self.keys.append(key)
self.keys, shapes, dtypes = obs_space_info(obs_space)
self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
self.buf_rews = np.zeros((self.num_envs,), dtype=np.float32)
@@ -53,7 +41,7 @@ class DummyVecEnv(VecEnv):
if self.buf_dones[e]:
obs = self.envs[e].reset()
self._save_obs(e, obs)
return (np.copy(self._obs_from_buf()), np.copy(self.buf_rews), np.copy(self.buf_dones),
return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones),
self.buf_infos.copy())
def reset(self):
@@ -65,9 +53,6 @@ class DummyVecEnv(VecEnv):
def close(self):
return
def render(self, mode='human'):
return [e.render(mode=mode) for e in self.envs]
def _save_obs(self, e, obs):
for k in self.keys:
if k is None:
@@ -76,7 +61,8 @@ class DummyVecEnv(VecEnv):
self.buf_obs[k][e] = obs[k]
def _obs_from_buf(self):
if self.keys==[None]:
return self.buf_obs[None]
else:
return self.buf_obs
return dict_to_obs(copy_obs_dict(self.buf_obs))
def get_images(self):
return [env.render(mode='rgb_array') for env in self.envs]

View File

@@ -0,0 +1,138 @@
"""
An interface for asynchronous vectorized environments.
"""
from multiprocessing import Pipe, Array, Process
import numpy as np
from . import VecEnv, CloudpickleWrapper
import ctypes
from baselines import logger
from .util import dict_to_obs, obs_space_info, obs_to_dict
_NP_TO_CT = {np.float32: ctypes.c_float,
np.int32: ctypes.c_int32,
np.int8: ctypes.c_int8,
np.uint8: ctypes.c_char,
np.bool: ctypes.c_bool}
class ShmemVecEnv(VecEnv):
"""
An AsyncEnv that uses multiprocessing to run multiple
environments in parallel.
"""
def __init__(self, env_fns, spaces=None):
"""
If you don't specify observation_space, we'll have to create a dummy
environment to get it.
"""
if spaces:
observation_space, action_space = spaces
else:
logger.log('Creating dummy env object to get spaces')
with logger.scoped_configure(format_strs=[]):
dummy = env_fns[0]()
observation_space, action_space = dummy.observation_space, dummy.action_space
dummy.close()
del dummy
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
self.obs_keys, self.obs_shapes, self.obs_dtypes = obs_space_info(observation_space)
self.obs_bufs = [
{k: Array(_NP_TO_CT[self.obs_dtypes[k].type], int(np.prod(self.obs_shapes[k]))) for k in self.obs_keys}
for _ in env_fns]
self.parent_pipes = []
self.procs = []
for env_fn, obs_buf in zip(env_fns, self.obs_bufs):
wrapped_fn = CloudpickleWrapper(env_fn)
parent_pipe, child_pipe = Pipe()
proc = Process(target=_subproc_worker,
args=(child_pipe, parent_pipe, wrapped_fn, obs_buf, self.obs_shapes, self.obs_dtypes, self.obs_keys))
proc.daemon = True
self.procs.append(proc)
self.parent_pipes.append(parent_pipe)
proc.start()
child_pipe.close()
self.waiting_step = False
self.viewer = None
def reset(self):
if self.waiting_step:
logger.warn('Called reset() while waiting for the step to complete')
self.step_wait()
for pipe in self.parent_pipes:
pipe.send(('reset', None))
return self._decode_obses([pipe.recv() for pipe in self.parent_pipes])
def step_async(self, actions):
assert len(actions) == len(self.parent_pipes)
for pipe, act in zip(self.parent_pipes, actions):
pipe.send(('step', act))
def step_wait(self):
outs = [pipe.recv() for pipe in self.parent_pipes]
obs, rews, dones, infos = zip(*outs)
return self._decode_obses(obs), np.array(rews), np.array(dones), infos
def close_extras(self):
if self.waiting_step:
self.step_wait()
for pipe in self.parent_pipes:
pipe.send(('close', None))
for pipe in self.parent_pipes:
pipe.recv()
pipe.close()
for proc in self.procs:
proc.join()
def get_images(self, mode='human'):
for pipe in self.parent_pipes:
pipe.send(('render', None))
return [pipe.recv() for pipe in self.parent_pipes]
def _decode_obses(self, obs):
result = {}
for k in self.obs_keys:
bufs = [b[k] for b in self.obs_bufs]
o = [np.frombuffer(b.get_obj(), dtype=self.obs_dtypes[k]).reshape(self.obs_shapes[k]) for b in bufs]
result[k] = np.array(o)
return dict_to_obs(result)
def _subproc_worker(pipe, parent_pipe, env_fn_wrapper, obs_bufs, obs_shapes, obs_dtypes, keys):
"""
Control a single environment instance using IPC and
shared memory.
"""
def _write_obs(maybe_dict_obs):
flatdict = obs_to_dict(maybe_dict_obs)
for k in keys:
dst = obs_bufs[k].get_obj()
dst_np = np.frombuffer(dst, dtype=obs_dtypes[k]).reshape(obs_shapes[k]) # pylint: disable=W0212
np.copyto(dst_np, flatdict[k])
env = env_fn_wrapper.x()
parent_pipe.close()
try:
while True:
cmd, data = pipe.recv()
if cmd == 'reset':
pipe.send(_write_obs(env.reset()))
elif cmd == 'step':
obs, reward, done, info = env.step(data)
if done:
obs = env.reset()
pipe.send((_write_obs(obs), reward, done, info))
elif cmd == 'render':
pipe.send(env.render(mode='rgb_array'))
elif cmd == 'close':
pipe.send(None)
break
else:
raise RuntimeError('Got unrecognized cmd %s' % cmd)
except KeyboardInterrupt:
print('ShmemVecEnv worker: got KeyboardInterrupt')
finally:
env.close()

View File

@@ -1,8 +1,6 @@
import numpy as np
from multiprocessing import Process, Pipe
from baselines.common.vec_env import VecEnv, CloudpickleWrapper
from baselines.common.tile_images import tile_images
from . import VecEnv, CloudpickleWrapper
def worker(remote, parent_remote, env_fn_wrapper):
parent_remote.close()
@@ -32,25 +30,26 @@ def worker(remote, parent_remote, env_fn_wrapper):
finally:
env.close()
class SubprocVecEnv(VecEnv):
def __init__(self, env_fns, spaces=None):
"""
envs: list of gym environments to run in subprocesses
"""
self.waiting = False
self.closed = False
nenvs = len(env_fns)
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for p in self.ps:
p.daemon = True # if the main process crashes, we should not cause things to hang
p.daemon = True # if the main process crashes, we should not cause things to hang
p.start()
for remote in self.work_remotes:
remote.close()
self.remotes[0].send(('get_spaces', None))
observation_space, action_space = self.remotes[0].recv()
self.viewer = None
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
def step_async(self, actions):
@@ -69,33 +68,17 @@ class SubprocVecEnv(VecEnv):
remote.send(('reset', None))
return np.stack([remote.recv() for remote in self.remotes])
def reset_task(self):
for remote in self.remotes:
remote.send(('reset_task', None))
return np.stack([remote.recv() for remote in self.remotes])
def close(self):
if self.closed:
return
def close_extras(self):
if self.waiting:
for remote in self.remotes:
for remote in self.remotes:
remote.recv()
for remote in self.remotes:
remote.send(('close', None))
for p in self.ps:
p.join()
self.closed = True
def render(self, mode='human'):
def get_images(self):
for pipe in self.remotes:
pipe.send(('render', None))
imgs = [pipe.recv() for pipe in self.remotes]
bigimg = tile_images(imgs)
if mode == 'human':
import cv2
cv2.imshow('vecenv', bigimg[:,:,::-1])
cv2.waitKey(1)
elif mode == 'rgb_array':
return bigimg
else:
raise NotImplementedError
return imgs

View File

@@ -0,0 +1,101 @@
"""
Tests for asynchronous vectorized environments.
"""
import gym
import numpy as np
import pytest
from .dummy_vec_env import DummyVecEnv
from .shmem_vec_env import ShmemVecEnv
from .subproc_vec_env import SubprocVecEnv
def assert_envs_equal(env1, env2, num_steps):
"""
Compare two environments over num_steps steps and make sure
that the observations produced by each are the same when given
the same actions.
"""
assert env1.num_envs == env2.num_envs
assert env1.action_space.shape == env2.action_space.shape
assert env1.action_space.dtype == env2.action_space.dtype
joint_shape = (env1.num_envs,) + env1.action_space.shape
try:
obs1, obs2 = env1.reset(), env2.reset()
assert np.array(obs1).shape == np.array(obs2).shape
assert np.array(obs1).shape == joint_shape
assert np.allclose(obs1, obs2)
np.random.seed(1337)
for _ in range(num_steps):
actions = np.array(np.random.randint(0, 0x100, size=joint_shape),
dtype=env1.action_space.dtype)
for env in [env1, env2]:
env.step_async(actions)
outs1 = env1.step_wait()
outs2 = env2.step_wait()
for out1, out2 in zip(outs1[:3], outs2[:3]):
assert np.array(out1).shape == np.array(out2).shape
assert np.allclose(out1, out2)
assert list(outs1[3]) == list(outs2[3])
finally:
env1.close()
env2.close()
@pytest.mark.parametrize('klass', (ShmemVecEnv, SubprocVecEnv))
@pytest.mark.parametrize('dtype', ('uint8', 'float32'))
def test_vec_env(klass, dtype): # pylint: disable=R0914
"""
Test that a vectorized environment is equivalent to
DummyVecEnv, since DummyVecEnv is less likely to be
error prone.
"""
num_envs = 3
num_steps = 100
shape = (3, 8)
def make_fn(seed):
"""
Get an environment constructor with a seed.
"""
return lambda: SimpleEnv(seed, shape, dtype)
fns = [make_fn(i) for i in range(num_envs)]
env1 = DummyVecEnv(fns)
env2 = klass(fns)
assert_envs_equal(env1, env2, num_steps=num_steps)
class SimpleEnv(gym.Env):
"""
An environment with a pre-determined observation space
and RNG seed.
"""
def __init__(self, seed, shape, dtype):
np.random.seed(seed)
self._dtype = dtype
self._start_obs = np.array(np.random.randint(0, 0x100, size=shape),
dtype=dtype)
self._max_steps = seed + 1
self._cur_obs = None
self._cur_step = 0
# this is 0xFF instead of 0x100 because the Box space includes
# the high end, while randint does not
self.action_space = gym.spaces.Box(low=0, high=0xFF, shape=shape, dtype=dtype)
self.observation_space = self.action_space
def step(self, action):
self._cur_obs += np.array(action, dtype=self._dtype)
self._cur_step += 1
done = self._cur_step >= self._max_steps
reward = self._cur_step / self._max_steps
return self._cur_obs, reward, done, {'foo': 'bar' + str(reward)}
def reset(self):
self._cur_obs = self._start_obs
self._cur_step = 0
return self._cur_obs
def render(self, mode=None):
raise NotImplementedError

View File

@@ -0,0 +1,59 @@
"""
Helpers for dealing with vectorized environments.
"""
from collections import OrderedDict
import gym
import numpy as np
def copy_obs_dict(obs):
"""
Deep-copy an observation dict.
"""
return {k: np.copy(v) for k, v in obs.items()}
def dict_to_obs(obs_dict):
"""
Convert an observation dict into a raw array if the
original observation space was not a Dict space.
"""
if set(obs_dict.keys()) == {None}:
return obs_dict[None]
return obs_dict
def obs_space_info(obs_space):
"""
Get dict-structured information about a gym.Space.
Returns:
A tuple (keys, shapes, dtypes):
keys: a list of dict keys.
shapes: a dict mapping keys to shapes.
dtypes: a dict mapping keys to dtypes.
"""
if isinstance(obs_space, gym.spaces.Dict):
assert isinstance(obs_space.spaces, OrderedDict)
subspaces = obs_space.spaces
else:
subspaces = {None: obs_space}
keys = []
shapes = {}
dtypes = {}
for key, box in subspaces.items():
keys.append(key)
shapes[key] = box.shape
dtypes[key] = box.dtype
return keys, shapes, dtypes
def obs_to_dict(obs):
"""
Convert an observation into a dict.
"""
if isinstance(obs, dict):
return obs
return {None: obs}

View File

@@ -1,18 +1,16 @@
from baselines.common.vec_env import VecEnvWrapper
from . import VecEnvWrapper
import numpy as np
from gym import spaces
class VecFrameStack(VecEnvWrapper):
"""
Vectorized environment base class
"""
def __init__(self, venv, nstack):
self.venv = venv
self.nstack = nstack
wos = venv.observation_space # wrapped ob space
wos = venv.observation_space # wrapped ob space
low = np.repeat(wos.low, self.nstack, axis=-1)
high = np.repeat(wos.high, self.nstack, axis=-1)
self.stackedobs = np.zeros((venv.num_envs,)+low.shape, low.dtype)
self.stackedobs = np.zeros((venv.num_envs,) + low.shape, low.dtype)
observation_space = spaces.Box(low=low, high=high, dtype=venv.observation_space.dtype)
VecEnvWrapper.__init__(self, venv, observation_space=observation_space)
@@ -26,9 +24,6 @@ class VecFrameStack(VecEnvWrapper):
return self.stackedobs, rews, news, infos
def reset(self):
"""
Reset all environments
"""
obs = self.venv.reset()
self.stackedobs[...] = 0
self.stackedobs[..., -obs.shape[-1]:] = obs

View File

@@ -0,0 +1,29 @@
from . import VecEnvWrapper
import numpy as np
class VecMonitor(VecEnvWrapper):
def __init__(self, venv):
VecEnvWrapper.__init__(self, venv)
self.eprets = None
self.eplens = None
def reset(self):
obs = self.venv.reset()
self.eprets = np.zeros(self.num_envs, 'f')
self.eplens = np.zeros(self.num_envs, 'i')
return obs
def step_wait(self):
obs, rews, dones, infos = self.venv.step_wait()
self.eprets += rews
self.eplens += 1
newinfos = []
for (i, (done, ret, eplen, info)) in enumerate(zip(dones, self.eprets, self.eplens, infos)):
info = info.copy()
if done:
info['episode'] = {'r': ret, 'l': eplen}
self.eprets[i] = 0
self.eplens[i] = 0
newinfos.append(info)
return obs, rews, dones, newinfos

View File

@@ -1,17 +1,18 @@
from baselines.common.vec_env import VecEnvWrapper
from . import VecEnvWrapper
from baselines.common.running_mean_std import RunningMeanStd
import numpy as np
class VecNormalize(VecEnvWrapper):
"""
Vectorized environment base class
A vectorized wrapper that normalizes the observations
and returns from an environment.
"""
def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
VecEnvWrapper.__init__(self, venv)
self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
self.ret_rms = RunningMeanStd(shape=()) if ret else None
#self.ob_rms = TfRunningMeanStd(shape=self.observation_space.shape, scope='observation_running_mean_std') if ob else None
#self.ret_rms = TfRunningMeanStd(shape=(), scope='return_running_mean_std') if ret else None
self.clipob = clipob
self.cliprew = cliprew
self.ret = np.zeros(self.num_envs)
@@ -19,12 +20,6 @@ class VecNormalize(VecEnvWrapper):
self.epsilon = epsilon
def step_wait(self):
"""
Apply sequence of actions to sequence of environments
actions -> (observations, rewards, news)
where 'news' is a boolean vector indicating whether each element is new.
"""
obs, rews, news, infos = self.venv.step_wait()
self.ret = self.ret * self.gamma + rews
obs = self._obfilt(obs)
@@ -42,8 +37,5 @@ class VecNormalize(VecEnvWrapper):
return obs
def reset(self):
"""
Reset all environments
"""
obs = self.venv.reset()
return self._obfilt(obs)

View File

@@ -9,44 +9,29 @@ Here's a list of commands to run to quickly get a working example:
```bash
# Train model and save the results to cartpole_model.pkl
python -m baselines.deepq.experiments.train_cartpole
python -m baselines.run --alg=deepq --env=CartPole-v0 --save_path=./cartpole_model.pkl --num_timesteps=1e5
# Load the model saved in cartpole_model.pkl and visualize the learned policy
python -m baselines.deepq.experiments.enjoy_cartpole
python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play
```
Be sure to check out the source code of [both](experiments/train_cartpole.py) [files](experiments/enjoy_cartpole.py)!
## If you wish to apply DQN to solve a problem.
Check out our simple agent trained with one stop shop `deepq.learn` function.
- [baselines/deepq/experiments/train_cartpole.py](experiments/train_cartpole.py) - train a Cartpole agent.
- [baselines/deepq/experiments/train_pong.py](experiments/train_pong.py) - train a Pong agent using convolutional neural networks.
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. For both of the files listed above there are complimentary files `enjoy_cartpole.py` and `enjoy_pong.py` respectively, that load and visualize the learned policy.
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. Complimentary file `enjoy_cartpole.py` loads and visualizes the learned policy.
## If you wish to experiment with the algorithm
##### Check out the examples
- [baselines/deepq/experiments/custom_cartpole.py](experiments/custom_cartpole.py) - Cartpole training with more fine grained control over the internals of DQN algorithm.
- [baselines/deepq/experiments/atari/train.py](experiments/atari/train.py) - more robust setup for training at scale.
##### Download a pretrained Atari agent
For some research projects it is sometimes useful to have an already trained agent handy. There's a variety of models to choose from. You can list them all by running:
- [baselines/deepq/defaults.py](defaults.py) - settings for training on atari. Run
```bash
python -m baselines.deepq.experiments.atari.download_model
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4
```
to train on Atari Pong (see more in repo-wide [README.md](../../README.md#training-models))
Once you pick a model, you can download it and visualize the learned policy. Be sure to pass `--dueling` flag to visualization script when using dueling models.
```bash
python -m baselines.deepq.experiments.atari.download_model --blob model-atari-duel-pong-1 --model-dir /tmp/models
python -m baselines.deepq.experiments.atari.enjoy --model-dir /tmp/models/model-atari-duel-pong-1 --env Pong --dueling
```

View File

@@ -309,7 +309,7 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
outputs=output_actions,
givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
updates=updates)
def act(ob, reset, update_param_noise_threshold, update_param_noise_scale, stochastic=True, update_eps=-1):
def act(ob, reset=False, update_param_noise_threshold=False, update_param_noise_scale=False, stochastic=True, update_eps=-1):
return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
return act

View File

@@ -27,7 +27,7 @@ class ActWrapper(object):
self.initial_state = None
@staticmethod
def load_act(self, path):
def load_act(path):
with open(path, "rb") as f:
model_data, act_params = cloudpickle.load(f)
act = deepq.build_act(**act_params)
@@ -70,6 +70,7 @@ class ActWrapper(object):
def save(self, path):
save_state(path)
self.save_act(path+".pickle")
def load_act(path):
@@ -194,8 +195,9 @@ def learn(env,
# capture the shape outside the closure so that the env object is not serialized
# by cloudpickle when serializing make_obs_ph
observation_space = env.observation_space
def make_obs_ph(name):
return ObservationInput(env.observation_space, name=name)
return ObservationInput(observation_space, name=name)
act, train, update_target, debug = deepq.build_train(
make_obs_ph=make_obs_ph,

View File

@@ -23,17 +23,15 @@ def main():
env = make_atari(args.env)
env = bench.Monitor(env, logger.get_dir())
env = deepq.wrap_atari_dqn(env)
model = deepq.models.cnn_to_mlp(
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
hiddens=[256],
dueling=bool(args.dueling),
)
deepq.learn(
env,
q_func=model,
"conv_only",
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
hiddens=[256],
dueling=bool(args.dueling),
lr=1e-4,
max_timesteps=args.num_timesteps,
total_timesteps=args.num_timesteps,
buffer_size=10000,
exploration_fraction=0.1,
exploration_final_eps=0.01,

View File

@@ -11,12 +11,11 @@ def callback(lcl, _glb):
def main():
env = gym.make("CartPole-v0")
model = deepq.models.mlp([64])
act = deepq.learn(
env,
q_func=model,
network='mlp',
lr=1e-3,
max_timesteps=100000,
total_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.02,

View File

@@ -2,5 +2,7 @@
- Original paper: https://arxiv.org/abs/1707.06347
- Baselines blog post: https://blog.openai.com/openai-baselines-ppo/
- `python -m baselines.ppo2.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.ppo2.run_mujoco` runs the algorithm for 1M frames on a Mujoco environment.
- `python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
- `python -m baselines.run --alg=ppo2 --env=Ant-v2 --num_timesteps=1e6` runs the algorithm for 1M frames on a Mujoco Ant environment.
- also refer to the repo-wide [README.md](../../README.md#training-models)

View File

@@ -163,7 +163,7 @@ def learn(*, network, env, total_timesteps, seed=None, nsteps=2048, ent_coef=0.0
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
See common/models.py/lstm for more details on using recurrent nets in policies
env: baselines.common.vec_env.VecEnv environment. Needs to be vectorized for parallel environment simulation.
The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.
@@ -189,7 +189,8 @@ def learn(*, network, env, total_timesteps, seed=None, nsteps=2048, ent_coef=0.0
log_interval: int number of timesteps between logging events
nminibatches: int number of training minibatches per update
nminibatches: int number of training minibatches per update. For recurrent policies,
should be smaller or equal than number of environments run in parallel.
noptepochs: int number of training epochs per update
@@ -226,10 +227,6 @@ def learn(*, network, env, total_timesteps, seed=None, nsteps=2048, ent_coef=0.0
make_model = lambda : Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nbatch_act=nenvs, nbatch_train=nbatch_train,
nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
max_grad_norm=max_grad_norm)
if save_interval and logger.get_dir():
import cloudpickle
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
fh.write(cloudpickle.dumps(make_model))
model = make_model()
if load_path is not None:
model.load(load_path)

View File

@@ -1,20 +1,19 @@
import sys
import multiprocessing
import os
import multiprocessing
import os.path as osp
import gym
from collections import defaultdict
import tensorflow as tf
import numpy as np
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
from baselines.common.cmd_util import common_arg_parser, parse_unknown_args, make_mujoco_env, make_atari_env
from baselines.common.tf_util import save_state, load_state, get_session
from baselines.common.cmd_util import common_arg_parser, parse_unknown_args, make_vec_env
from baselines.common.tf_util import get_session
from baselines import bench, logger
from importlib import import_module
from baselines.common.vec_env.vec_normalize import VecNormalize
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
from baselines.common import atari_wrappers, retro_wrappers
try:
@@ -28,10 +27,10 @@ for env in gym.envs.registry.all():
env_type = env._entry_point.split(':')[0].split('.')[-1]
_game_envs[env_type].add(env.id)
# reading benchmark names directly from retro requires
# importing retro here, and for some reason that crashes tensorflow
# in ubuntu
_game_envs['retro'] = set([
# reading benchmark names directly from retro requires
# importing retro here, and for some reason that crashes tensorflow
# in ubuntu
_game_envs['retro'] = {
'BubbleBobble-Nes',
'SuperMarioBros-Nes',
'TwinBee3PokoPokoDaimaou-Nes',
@@ -40,12 +39,12 @@ _game_envs['retro'] = set([
'Vectorman-Genesis',
'FinalFight-Snes',
'SpaceInvaders-Snes',
])
}
def train(args, extra_args):
env_type, env_id = get_env_type(args.env)
total_timesteps = int(args.num_timesteps)
seed = args.seed
@@ -60,13 +59,11 @@ def train(args, extra_args):
else:
if alg_kwargs.get('network') is None:
alg_kwargs['network'] = get_default_network(env_type)
print('Training {} on {}:{} with arguments \n{}'.format(args.alg, env_type, env_id, alg_kwargs))
model = learn(
env=env,
env=env,
seed=seed,
total_timesteps=total_timesteps,
**alg_kwargs
@@ -75,30 +72,30 @@ def train(args, extra_args):
return model, env
def build_env(args, render=False):
def build_env(args):
ncpu = multiprocessing.cpu_count()
if sys.platform == 'darwin': ncpu //= 2
nenv = args.num_env or ncpu if not render else 1
nenv = args.num_env or ncpu
alg = args.alg
rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
seed = args.seed
seed = args.seed
env_type, env_id = get_env_type(args.env)
if env_type == 'mujoco':
get_session(tf.ConfigProto(allow_soft_placement=True,
intra_op_parallelism_threads=1,
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1))
if args.num_env:
env = SubprocVecEnv([lambda: make_mujoco_env(env_id, seed + i if seed is not None else None, args.reward_scale) for i in range(args.num_env)])
env = make_vec_env(env_id, env_type, nenv, seed, reward_scale=args.reward_scale)
else:
env = DummyVecEnv([lambda: make_mujoco_env(env_id, seed, args.reward_scale)])
env = make_vec_env(env_id, env_type, 1, seed, reward_scale=args.reward_scale)
env = VecNormalize(env)
elif env_type == 'atari':
if alg == 'acer':
env = make_atari_env(env_id, nenv, seed)
env = make_vec_env(env_id, env_type, nenv, seed)
elif alg == 'deepq':
env = atari_wrappers.make_atari(env_id)
env.seed(seed)
@@ -113,49 +110,56 @@ def build_env(args, render=False):
env.seed(seed)
else:
frame_stack_size = 4
env = VecFrameStack(make_atari_env(env_id, nenv, seed), frame_stack_size)
env = VecFrameStack(make_vec_env(env_id, env_type, nenv, seed), frame_stack_size)
elif env_type == 'retro':
import retro
gamestate = args.gamestate or 'Level1-1'
env = retro_wrappers.make_retro(game=args.env, state=gamestate, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE)
env = retro_wrappers.make_retro(game=args.env, state=gamestate, max_episode_steps=10000,
use_restricted_actions=retro.Actions.DISCRETE)
env.seed(args.seed)
env = bench.Monitor(env, logger.get_dir())
env = retro_wrappers.wrap_deepmind_retro(env)
elif env_type == 'classic':
elif env_type == 'classic_control':
def make_env():
e = gym.make(env_id)
e = bench.Monitor(e, logger.get_dir(), allow_early_resets=True)
e.seed(seed)
return e
env = DummyVecEnv([make_env])
else:
raise ValueError('Unknown env_type {}'.format(env_type))
return env
def get_env_type(env_id):
if env_id in _game_envs.keys():
env_type = env_id
env_id = [g for g in _game_envs[env_type]][0]
env_id = [g for g in _game_envs[env_type]][0]
else:
env_type = None
for g, e in _game_envs.items():
if env_id in e:
env_type = g
break
break
assert env_type is not None, 'env_id {} is not recognized in env types'.format(env_id, _game_envs.keys())
return env_type, env_id
def get_default_network(env_type):
if env_type == 'mujoco' or env_type=='classic':
if env_type == 'mujoco' or env_type == 'classic_control':
return 'mlp'
if env_type == 'atari':
return 'cnn'
raise ValueError('Unknown env_type {}'.format(env_type))
def get_alg_module(alg, submodule=None):
submodule = submodule or alg
try:
@@ -164,46 +168,47 @@ def get_alg_module(alg, submodule=None):
except ImportError:
# then from rl_algs
alg_module = import_module('.'.join(['rl_' + 'algs', alg, submodule]))
return alg_module
def get_learn_function(alg):
return get_alg_module(alg).learn
def get_learn_function_defaults(alg, env_type):
try:
alg_defaults = get_alg_module(alg, 'defaults')
kwargs = getattr(alg_defaults, env_type)()
except (ImportError, AttributeError):
kwargs = {}
kwargs = {}
return kwargs
def parse(v):
def parse(v):
'''
convert value of a command-line arg to a python object if possible, othewise, keep as string
'''
assert isinstance(v, str)
try:
return eval(v)
except (NameError, SyntaxError):
return eval(v)
except (NameError, SyntaxError):
return v
def main():
# configure logger, disable logging in child MPI processes (with rank > 0)
# configure logger, disable logging in child MPI processes (with rank > 0)
arg_parser = common_arg_parser()
args, unknown_args = arg_parser.parse_known_args()
extra_args = {k: parse(v) for k,v in parse_unknown_args(unknown_args).items()}
extra_args = {k: parse(v) for k, v in parse_unknown_args(unknown_args).items()}
if MPI is None or MPI.COMM_WORLD.Get_rank() == 0:
rank = 0
logger.configure()
else:
logger.configure(format_strs = [])
logger.configure(format_strs=[])
rank = MPI.COMM_WORLD.Get_rank()
model, _ = train(args, extra_args)
@@ -211,19 +216,19 @@ def main():
if args.save_path is not None and rank == 0:
save_path = osp.expanduser(args.save_path)
model.save(save_path)
if args.play:
logger.log("Running trained model")
env = build_env(args, render=True)
env = build_env(args)
obs = env.reset()
while True:
actions = model.step(obs)[0]
obs, _, done, _ = env.step(actions)
obs, _, done, _ = env.step(actions)
env.render()
done = done.any() if isinstance(done, np.ndarray) else done
if done:
obs = env.reset()
if __name__ == '__main__':

View File

@@ -2,5 +2,6 @@
- Original paper: https://arxiv.org/abs/1502.05477
- Baselines blog post https://blog.openai.com/openai-baselines-ppo/
- `mpirun -np 16 python -m baselines.trpo_mpi.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.trpo_mpi.run_mujoco` runs the algorithm for 1M timesteps on a Mujoco environment.
- `mpirun -np 16 python -m baselines.run --alg=trpo_mpi --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
- `python -m baselines.run --alg=trpo_mpi --env=Ant-v2 --num_timesteps=1e6` runs the algorithm for 1M timesteps on a Mujoco Ant environment.
- also refer to the repo-wide [README.md](../../README.md#training-models)

View File

@@ -1,4 +1,4 @@
from rl_common.models import mlp, cnn_small
from baselines.common.models import mlp, cnn_small
def atari():

12351
benchmarks_atari10M.htm Normal file

File diff suppressed because it is too large Load Diff

5640
benchmarks_mujoco1M.htm Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -1,19 +0,0 @@
import pytest
def pytest_addoption(parser):
parser.addoption('--runslow', action='store_true', default=False, help='run slow tests')
def pytest_collection_modifyitems(config, items):
if config.getoption('--runslow'):
# --runslow given in cli: do not skip slow tests
return
skip_slow = pytest.mark.skip(reason='need --runslow option to run')
slow_tests = []
for item in items:
if 'slow' in item.keywords:
slow_tests.append(item.name)
item.add_marker(skip_slow)
print('skipping slow tests', ' '.join(slow_tests), 'use --runslow to run this')

15
setup.cfg Normal file
View File

@@ -0,0 +1,15 @@
[flake8]
select = F,E999
exclude =
.git,
__pycache__,
baselines/her,
baselines/ddpg,
baselines/ppo1,
baselines/bench,
baselines/acktr,

View File

@@ -6,6 +6,20 @@ if sys.version_info.major != 3:
'Python {}. The installation will likely fail.'.format(sys.version_info.major))
extras = {
'test': [
'filelock',
'pytest'
]
}
all_deps = []
for group_name in extras:
all_deps += extras[group_name]
extras['all'] = all_deps
setup(name='baselines',
packages=[package for package in find_packages()
if package.startswith('baselines')],
@@ -18,18 +32,21 @@ setup(name='baselines',
'progressbar2',
'mpi4py',
'cloudpickle',
'tensorflow>=1.4.0',
'click',
'opencv-python'
],
extras_require={
'test': [
'filelock',
'pytest'
]
},
extras_require=extras,
description='OpenAI baselines: high quality implementations of reinforcement learning algorithms',
author='OpenAI',
url='https://github.com/openai/baselines',
author_email='gym@openai.com',
version='0.1.5')
# ensure there is some tensorflow build with version above 1.4
try:
from distutils.version import StrictVersion
import tensorflow
assert StrictVersion(tensorflow.__version__) >= StrictVersion('1.4.0')
except ImportError:
assert False, "TensorFlow needed, of version above 1.4"