Compare commits

...

209 Commits

Author SHA1 Message Date
SOLARIS
b99a73afe3 entrypoint variable made public (#970) and Fix RuntimeError (#910) (#1015) (#1032) 2019-11-08 15:20:54 -08:00
Peter Zhokhov
517433f22a flake8 complaints 2019-11-08 15:15:38 -08:00
Isaac Lascasas
713f1a0aec tf2: Updated setup.py dependencies. (#1002)
* Updated setup.py dependencies.

* Allow tf2 baselines pip package setup to work with all tf2 cpu/gpu versions.
2019-10-25 15:50:04 -07:00
tanzhenyu
d1a05a0dd2 Baselines for Tensorflow 2.0. (#978)
* Baselines for Tensorflow 2.0.

Please do note that:
1. ACER, ACKTR, GAIL is still under development by external
contributors.
2. HER is still under development by tanzheny@google.com.

* Some cleanup.

* Addressing some comments.
2019-08-08 11:03:17 -07:00
tanzhenyu
c57528573e Remove model def from deepq. (#946) 2019-06-27 10:12:38 -07:00
Marcin Michalski
2bca7901f5 Updating the version to 0.1.6 (#933)
Updating the version in setup.py to avoid conflict with the old (>1 year old) version in pypi.
2019-06-24 10:19:01 -07:00
albert
ba2b017820 add log_path flag to command line utility (#917)
* add log_path flag to command line utility

* Update README with log_path flag

* clarify logg and viz docs
2019-06-07 15:05:52 -07:00
Anton Grigoryev
7c520852d9 Fix converting list of LazyFrames to ndarray (#907) 2019-05-31 16:49:46 -07:00
pzhokhov
1c872ca8fd run test_monitor through pytest; fix the test, add flake8 to bench direectory - like PR 891 (#921) 2019-05-31 15:36:20 -07:00
Jinho Lee
ff8d36a7a7 Starting to reassign waiting_step in shmem_vecenv (#915)
"self.waiting_step" is initialized in __init__ function but it is not reassigned anywhere.
Because it is used in reset function and close_extras function, it should be fixed.
So i fixed it to be similar with subproc_vec_env's one.
2019-05-31 14:31:35 -07:00
Sridhar Thiagarajan
7614b02f7a remove f strings for python back compatibility (#906) 2019-05-31 14:27:11 -07:00
Andy Twigg
f7d5a265e1 suppress excessive messages from unused loggers (#920)
Only print the "logging to [dir]" message when the logger has something to output. Running with the new spawn change and both mpi and subprocvecenv, there are many "logging to [dir]" messages but most are not logging anything.
2019-05-31 14:26:45 -07:00
Joshua Meier
21776e8f57 Support Tuple observation spaces (#911) 2019-05-31 14:06:20 -07:00
pzhokhov
9b68103b73 release Internal changes (#895)
* joshim5 changes (width and height to WarpFrame wrapper)

* match network output with action distribution via a linear layer only if necessary (#167)

* support color vs. grayscale option in WarpFrame wrapper (#166)

* support color vs. grayscale option in WarpFrame wrapper

* Support color in other wrappers

* Updated per Peters suggestions

* fixing test failures

* ppo2 with microbatches (#168)

* pass microbatch_size to the model during construction

* microbatch fixes and test (#169)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* Peterz joshim5 subclass ppo2 model (#170)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* subclassing the model to make microbatched version of model WIP

* made microbatched model a subclass of ppo2 Model

* flake8 complaint

* mpi-less ppo2 (resolving merge conflict)

* flake8 and mpi4py imports in ppo2/model.py

* more un-mpying

* merge master

* updates to the benchmark viewer code + autopep8 (#184)

* viz docs and syntactic sugar wip

* update viewer yaml to use persistent volume claims

* move plot_util to baselines.common, update links

* use 1Tb hard drive for results viewer

* small updates to benchmark vizualizer code

* autopep8

* autopep8

* any folder can be a benchmark

* massage games image a little bit

* fixed --preload option in app.py

* remove preload from run_viewer.sh

* remove pdb breakpoints

* update bench-viewer.yaml

* fixed bug (#185)

* fixed bug 

it's wrong to do the else statement, because no other nodes would start.

* changed the fix slightly

* Refactor her phase 1 (#194)

* add monitor to the rollout envs in her RUN BENCHMARKS her

* Slice -> Slide in her benchmarks RUN BENCHMARKS her

* run her benchmark for 200 epochs

* dummy commit to RUN BENCHMARKS her

* her benchmark for 500 epochs RUN BENCHMARKS her

* add num_timesteps to her benchmark to be compatible with viewer RUN BENCHMARKS her

* add num_timesteps to her benchmark to be compatible with viewer RUN BENCHMARKS her

* add num_timesteps to her benchmark to be compatible with viewer RUN BENCHMARKS her

* disable saving of policies in her benchmark RUN BENCHMARKS her

* run fetch benchmarks with ppo2 and ddpg RUN BENCHMARKS Fetch

* run fetch benchmarks with ppo2 and ddpg RUN BENCHMARKS Fetch

* launcher refactor wip

* wip

* her works on FetchReach

* her runner refactor RUN BENCHMARKS Fetch1M

* unit test for her

* fixing warnings in mpi_average in her, skip test_fetchreach if mujoco is not present

* pickle-based serialization in her

* remove extra import from subproc_vec_env.py

* investigating differences in rollout.py

* try with old rollout code RUN BENCHMARKS her

* temporarily use DummyVecEnv in cmd_util.py RUN BENCHMARKS her

* dummy commit to RUN BENCHMARKS her

* set info_values in rollout worker in her RUN BENCHMARKS her

* bug in rollout_new.py RUN BENCHMARKS her

* fixed bug in rollout_new.py RUN BENCHMARKS her

* do not use last step because vecenv calls reset and returns obs after reset RUN BENCHMARKS her

* updated buffer sizes RUN BENCHMARKS her

* fixed loading/saving via joblib

* dust off learning from demonstrations in HER, docs, refactor

* add deprecation notice on her play and plot files

* address comments by Matthias

* 1.5 months of codegen changes (#196)

* play with resnet

* feed_dict version

* coinrun prob and more stats

* fixes to get_choices_specs & hp search

* minor prob fixes

* minor fixes

* minor

* alternative version of rl_algo stuff

* pylint fixes

* fix bugs, move node_filters to soup

* changed how get_algo works

* change how get_algo works, probably broke all tests

* continue previous refactor

* get eval_agent running again

* fixing tests

* fix tests

* fix more tests

* clean up cma stuff

* fix experiment

* minor changes to eval_agent to make ppo_metal use gpu

* make dict space work

* modify mac makefile to use conda

* recurrent layers

* play with bn and resnets

* minor hp changes

* minor

* got rid of use_fb argument and jtft (joint-train-fine-tune) functionality
built test phase directly into AlgoProb

* make new rl algos generateable

* pylint; start fixing tests

* fixing tests

* more test fixes

* pylint

* fix search

* work on search

* hack around infinite loop caused by scan

* algo search fixes

* misc changes for search expt

* enable annealing, overriding options of Op

* pylint fixes

* identity op

* achieve use_last_output through masking so it automatically works in other distributions

* fix tests

* minor

* discrete

* use_last_output to be just a preference, not a hard constraint

* pred delay, pruning

* require nontrivial inputs

* aliases for get_sm

* add probname to probs

* fixes

* small fixes

* fix tests

* fix tests

* fix tests

* minor

* test scripts

* dualgru network improvements

* minor

* work on mysterious bugs

* rcall gpu-usage command for kube

* use cache dir that’s not in code folder, so that it doesn’t get removed by rcall code rsync

* add power mode to gpu usage

* make sure train/test actually different

* remove VR for now

* minor fixes

* simplify soln_db

* minor

* big refactor of mpi eda

* improve mpieda for multitask

* - get rid of timelimit hack
- add __del__ to cleanup SubprocVecEnv

* get multitask working better

* fixes

* working on atari, various

* annotate ops with whether they’re parametrized

* minor

* gym version

* rand atari prob

* minor

* SolnDb bugfix and name change

* pyspy script

* switch conv layers

* fix roboschool/bullet3

* nenvs assertion

* fix rand atari

* get rid of blanket exception catching
fix soln_db bug

* fix rand_atari

* dynamic routing as cmdline arg

* slight modifications to test_mpi_map and pyspy-all

* max_tries argument for run_until_successs

* dedup option in train_mle

* simplify soln_db

* increase atari horizon for 1 experiment

* start implementing reward increment

* ent multiplier

* create cc dsl
other misc fixes

* cc ops

* q_func -> qs in rl_algos_cc.py

* fix PredictDistr

* rl_ops_cc fixes, MakeAction op

* augment algo agent to support cc stuff

* work on ddpg experiments

* fix blocking
temporarily change logger

* allow layer scaling

* pylint fixes

* spawn_method

* isolate ddpg hacks

* improve pruning

* use spawn for subproc

* remove use of python -c in rcall

* fix pylint warning

* fix static

* maybe fix local backend

* switch to DummyVecEnv

* making some fixes via pylint

* pylint fixes

* fixing tests

* fix tests

* fix tests

* write scaffolding for SSL in Codegen

* logger fix

* fix error

* add EMA op to sl_ops

* save many changes

* save

* add upsampler

* add sl ops, enhance state machine

* get ssl search working — some gross hacking

* fix session/graph issue

* fix importing

* work on mle

* - scale embeddings in gru model
- better exception handling in sl_prob
- use emas for test/val
- use non-contrib batch_norm layer

* improve logging

* option to average before dumping in logger

* default arguments, etc

* new ddpg and identity test

* concat fix

* minor

* move realistic ssl stuff to third-party (underscore to dash)

* fixes

* remove realistic_ssl_evaluation

* pylint fixes

* use gym master

* try again

* pass around args without gin

* fix tests

* separate line to install gym

* rename failing tests that should be ignored

* add data aug

* ssl improvements

* use fixed time limit

* try to fix baselines tests

* add score_floor, max_walltime, fiddle with lr decay

* realistic_ssl

* autopep8

* various ssl
- enable blocking grad for simplification
- kl
- multiple final prediction

* fix pruning

* misc ssl stuff

* bring back linear schedule, don’t use allgather for collecting stats
(i’ve been getting nondeterministic errors from the old code)

* save/load weights in SSL, big stepsize

* cleanup SslProb

* fix

* get rid of kl coef

* fix simplification, lower lr

* search over hps

* minor fixes

* minor

* static analysis

* move files and rename things for improved consistency.
still broken, and just saving before making nontrivial changes

* various

* make tests pass

* move coinrun_train to codegen since it depends on codegen

* fixes

* pylint fixes

* improve tests
fix some things

* improve tests

* lint

* fix up db_info.py, tests

* mostly restore master version of envs directory, except for makefile changes

* fix tests

* improve printing

* minor fixes

* fix fixmes

* pruning test

* fixes

* lint

* write new test that makes tf graphs of random algos; fix some bugs it caught

* add —delete flag to rcall upload-code command

* lint

* get cifar10 lazily for testing purposes

* disable codegen ci tests for now

* clean up rl_ops

* rename spec classes

* td3 with identity test

* identity tests without gin files

* remove gin.configurable from AlgoAgent

* comments about reduction in rl_ops_cc

* address @pzhokhov comments

* fix tests

* more linting

* better tests

* clean up filtering a bit

* fix concat

* delayed logger configuration (#208)

* delayed logger configuration

* fix typo

* setters and getters for Logger.DEFAULT as well

* do away with fancy property stuff - unable to get it to work with class level methods

* grammar and spaces

* spaces

* use get_current function instead of reading Logger.CURRENT

* autopep8

* disable mpi in subprocesses (#213)

* lazy_mpi load

* cleanups

* more lazy mpi

* don't pretend that class is a module, just use it as a class

* mass-replace mpi4py imports

* flake8

* fix previous lazy_mpi imports

* silly recursion

* try os.environ hack

* better prefix test, work with mpich

* restored MPI imports

* removed commented import in test_with_mpi

* restored codegen from master

* remove lazy mpi

* restored changes from rl-algs

* remove extra files

* address Chris' comments

* use spawn for shmem vec env as well (#2) (#219)

* lazy_mpi load

* cleanups

* more lazy mpi

* don't pretend that class is a module, just use it as a class

* mass-replace mpi4py imports

* flake8

* fix previous lazy_mpi imports

* silly recursion

* try os.environ hack

* better prefix test, work with mpich

* restored MPI imports

* removed commented import in test_with_mpi

* restored codegen from master

* remove lazy mpi

* restored changes from rl-algs

* remove extra files

* port mpi fix to shmem vec env

* increase the mpi test default timeout

* change humanoid hyperparameters, get rid of clip_Frac annealing, as it's apparently dangerous

* remove clip_frac schedule from ppo2

* more timesteps in humanoid run

* whitespace + RUN BENCHMARKS

* baselines: export vecenvs from folder (#221)

* baselines: export vecenvs from folder

* put missing function back in

* add missing imports

* more imports

* longer mpi timeout?

* make default logger configuration the same as call to logger.configure() (#222)

* Vecenv refactor (#223)

* update karl util

* restore pvi flag

* change rcall auto cpu behavior, move gin.configurable, add os.makedirs

* vecenv refactor

* aux buf index fix

* add num aux obs

* reset level with enter

* restore high difficulty flag

* bugfix

* restore train_coinrun.py

* tweaks

* renaming

* renaming

* better arguments handling

* more options

* options cleanup

* game data refactor

* more options

* args for train_procgen

* add close handler to interactive base class

* use debug build if debug=True, fix range on aux_obs

* add ProcGenEnv to __init__.py, add missing imports to procgen.py

* export RemoveDictWrapper and build, update train_procgen.py, move assets download into env creation and replace init_assets_and_build with just build

* fix formatting issues

* only call global init once

* fix path in setup.py

* revert part of makefile

* ignore IDE files and folders

* vec remove dict

* export VecRemoveDictObs

* remove RemoveDictWrapper

* remove IDE files

* move shared .h and .cpp files to common folder, update build to use those, dedupe env.cpp

* fix missing header

* try unified build function

* remove old scripts dir

* add comment on build

* upload libenv with render fixes

* tell qthreads to die when we unload the library

* pyglet.app.run is garbage

* static fixes

* whoops

* actually vsync is on

* cleanup

* cleanup

* extern C for libenv interface

* parse util rcall arg

* high difficulty fix

* game type enums

* ProcGenEnv subclasses

* game type cleanup

* unrecognized key

* unrecognized game type

* parse util reorg

* args management

* typo fix

* GinParser

* arg tweaks

* tweak

* restore start_level/num_levels setting

* fix create_procgen_env interface

* build fix

* procgen args in init signature

* fix

* build fix

* fix logger usage in ppo_metal/run_retro

* removed unnecessary OrderedDict requirement in subproc_vec_env

* flake8 fix

* allow for non-mpi tests

* mpi test fixes

* flake8; removed special logic for discrete spaces in dummy_vec_env

* remove forked argument in front of tests - does not play nicely with subprocvecenv in spawned processes; analog of forked in ddpg/test_smoke

* Everyrl initial commit & a few minor baselines changes (#226)

* everyrl initial commit

* add keep_buf argument to VecMonitor

* logger changes: set_comm and fix to mpi_mean functionality

* if filename not provided, don't create ResultsWriter

* change variable syncing function to simplify its usage. now you should initialize from all mpi processes

* everyrl coinrun changes

* tf_distr changes, bugfix

* get_one

* bring back get_next to temporarily restore code

* lint fixes

* fix test

* rename profile function

* rename gaussian

* fix coinrun training script

* change random seeding to work with new gym version (#231)

* change random seeding to work with new gym version

* move seeding to seed() method

* fix mnistenv

* actually try some of the tests before pushing

* more deterministic fixed seq

* misc changes to vecenvs and run.py for benchmarks (#236)

* misc changes to vecenvs and run.py for benchmarks

* dont seed global gen

* update more references to assert_venvs_equal

* Rl19 (#232)

* everyrl initial commit

* add keep_buf argument to VecMonitor

* logger changes: set_comm and fix to mpi_mean functionality

* if filename not provided, don't create ResultsWriter

* change variable syncing function to simplify its usage. now you should initialize from all mpi processes

* everyrl coinrun changes

* tf_distr changes, bugfix

* get_one

* bring back get_next to temporarily restore code

* lint fixes

* fix test

* rename profile function

* rename gaussian

* fix coinrun training script

* rl19

* remove everyrl dir which appeared in the merge for some reason

* readme

* fiddle with ddpg

* make ddpg work

* steps_total argument

* gpu count

* clean up hyperparams and shape math

* logging + saving

* configuration stuff

* fixes, smoke tests

* fix stats

* make load_results return dicts -- easier to create the same kind of objects with some other mechanism for passing to downstream functions

* benchmarks

* fix tests

* add dqn to tests, fix it

* minor

* turned annotated transformer (pytorch) into a script

* more refactoring

* jax stuff

* cluster

* minor

* copy & paste alec code

* sign error

* add huber, rename some parameters, snapshotting off by default

* remove jax stuff

* minor

* move maze env

* minor

* remove trailing spaces

* remove trailing space

* lint

* fix test breakage due to gym update

* rename function

* move maze back to codegen

* get recurrent ppo working

* enable both lstm and gru

* script to print table of benchmark results

* various

* fix dqn

* add fixup initializer, remove lastrew

* organize logging stats

* fix silly bug

* refactor models

* fix mpi usage

* check sync

* minor

* change vf coef, hps

* clean up slicing in ppo

* minor fixes

* caching transformer

* docstrings

* xf fixes

* get rid of 'B' and 'BT' arguments

* minor

* transformer example

* remove output_kind from base class until we have a better idea how to use it

* add comments, revert maze stuff

* flake8

* codegen lint

* fix codegen tests

* responded to peter's comments

* lint fixes

* minor changes to baselines (#243)

* minor changes to baselines

* fix spaces reference

* remove flake8 disable comments and fix import

* okay maybe don't add spec to vec_env

* Merge branch 'master' of github.com:openai/games

 the commit.

* flake8 complaints in baselines/her

* update dmlab30 env (#258)

* codegen continuous control experiment pr (#256)

* finish cherry-pick td3 test commit

* removed graph simplification error ingore

* merge delayed logger config

* merge updated baselines logger

* lazy_mpi load

* cleanups

* use lazy mpi imports in codegen

* more lazy mpi

* don't pretend that class is a module, just use it as a class

* mass-replace mpi4py imports

* flake8

* fix previous lazy_mpi imports

* removed extra printouts from TdLayer op

* silly recursion

* running codegen cc experiment

* wip

* more wip

* use actor is input for critic targets, instead of the action taken

* batch size 100

* tweak update parameters

* tweaking td3 runs

* wip

* use nenvs=2 for contcontrol (to be comparable with ppo_metal)

* wip. Doubts about usefulness of actor in critic target

* delayed actor in ActorLoss

* score is average of last 100

* skip lack of losses or too many action distributions

* 16 envs for contcontrol, replay buffer size equal to horizon (no point in making it longer)

* syntax

* microfixes

* minifixes

* run in process logic to bypass tensorflow freezes/failures (per Oleg's suggestion)

* squash-merge master, resolve conflicts

* remove erroneous file

* restore normal MPI imports

* move wrappers around a little bit

* autopep8

* cleanups

* cleanup mpi_eda, autopep8

* make activation function of action distribution customizable

* cleanups; preparation for a pr

* syntax

* merge latest master, resolve conflicts

* wrap MPI import with try/except

* allow import of modules through env id im baselines cmd_util

* flake8 complaints

* only wrap box action spaces with ClipActionsWrapper

* flake8

* fixes to algo_prob according to Oleg's suggestions

* use apply_without_scope flag in ActorLoss

* remove extra line in algo/core.py

* Rl19 metalearning (#261)

* rl19 metalearning and dict obs

* master merge arch fix

* lint fixes

* view fixes

* load vars tweaks

* user config cleanup

* documentation and revisions

* pass train comm to rl19

* cleanup

* Symshapes - gives codegen ability to evaluate same algo on envs with different ob/ac shapes (#262)

* finish cherry-pick td3 test commit

* removed graph simplification error ingore

* merge delayed logger config

* merge updated baselines logger

* lazy_mpi load

* cleanups

* use lazy mpi imports in codegen

* more lazy mpi

* don't pretend that class is a module, just use it as a class

* mass-replace mpi4py imports

* flake8

* fix previous lazy_mpi imports

* removed extra printouts from TdLayer op

* silly recursion

* running codegen cc experiment

* wip

* more wip

* use actor is input for critic targets, instead of the action taken

* batch size 100

* tweak update parameters

* tweaking td3 runs

* wip

* use nenvs=2 for contcontrol (to be comparable with ppo_metal)

* wip. Doubts about usefulness of actor in critic target

* delayed actor in ActorLoss

* score is average of last 100

* skip lack of losses or too many action distributions

* 16 envs for contcontrol, replay buffer size equal to horizon (no point in making it longer)

* syntax

* microfixes

* minifixes

* run in process logic to bypass tensorflow freezes/failures (per Oleg's suggestion)

* random physics for mujoco

* random parts sizes with range 0.4

* add notebook with results into x/peterz

* variations of ant

* roboschool use gym.make kwargs

* use float as lowest score after rank transform

* rcall from master

* wip

* re-enable dynamic routing

* wip

* squash-merge master, resolve conflicts

* remove erroneous file

* restore normal MPI imports

* move wrappers around a little bit

* autopep8

* cleanups

* cleanup mpi_eda, autopep8

* make activation function of action distribution customizable

* cleanups; preparation for a pr

* syntax

* merge latest master, resolve conflicts

* wrap MPI import with try/except

* allow import of modules through env id im baselines cmd_util

* flake8 complaints

* only wrap box action spaces with ClipActionsWrapper

* flake8

* fixes to algo_prob according to Oleg's suggestions

* use apply_without_scope flag in ActorLoss

* remove extra line in algo/core.py

* multi-task support

* autopep8

* symbolic suffix-shapes (not B,T yet)

* test_with_mpi -> with_mpi rename

* remove extra blank lines in algo/core

* remove extra blank lines in algo/core

* remove more blank lines

* symbolify shapes in existing algorithms

* minor output changes

* cleaning up merge conflicts

* cleaning up merge conflicts

* cleaning up more merge conflicts

* restore mpi_map.py from master

* remove tensorflow dependency from VecEnv

* make tests use single-threaded session for determinism of KfacOptimizer (#298)

* make tests use single-threaded session for determinism of KfacOptimizer

* updated comment in kfac.py

* remove unused sess_config

* add score calculator wrapper, forward property lookups on vecenv wrap… (#300)

* add score calculator wrapper, forward property lookups on vecenv wrapper, misc cleanup

* tests

* pylint

* fix vec monitor infos

* Workbench (#303)

* begin workbench

* cleanup

* begin procgen config integration

* arg tweaks

* more args

* parameter saving

* begin procgen enjoy

* tweaks

* more workbench

* more args sync/restore

* cleanup

* merge in master

* rework args priority

* more workbench

* more loggign

* impala cnn

* impala lstm

* tweak

* tweaks

* rl19 time logging

* misc fixes

* faster pipeline

* update local.py

* sess and log config tweaks

* num processes

* logging tweaks

* difficulty reward wrapper

* logging fixes

* gin tweaks

* tweak

* fix

* task id

* param loading

* more variable loading

* entrypoint

* tweak

* ksync

* restore lstm

* begin rl19 support

* tweak

* rl19 rnn

* more rl19 integration

* fix

* cleanup

* restore rl19 rnn

* cleanup

* cleanup

* wrappers.get_log_info

* cleanup

* cleanup

* directory cleanup

* logging, num_experiments

* fixes

* cleanup

* gin fixes

* fix local max gpu

* resid nx

* num machines and download params

* rename

* cleanup

* create workbench

* more reorg

* fix

* more logging wrappers

* lint fix

* restore train procgen

* restore train procgen

* pylint fix

* better wrapping

* config sweep

* args sweep

* test workers

* mpi_weight

* train test comm and high difficulty fix

* enjoy show returns

* removing gin, procgen_parser

* removing gin

* procgen args

* config fixes

* cleanup

* cleanup

* procgen args fix

* fix

* rcall syncing

* lint

* rename mpi_weight

* use username for sync

* fixes

* microbatch fix

* Grad clipping in MpiAdamOptimizer, transformer changes (#304)

* transformer mnist experiments

* version that only builds one model

* work on inverted mnist

* Add grad clipping to MpiAdamOptimizer

* various

* transformer changes, loading

* get rid of soft labels

* transformer baseline

* minor

* experiments involving all possible training sets

* vary training

* minor

* get ready for fine-tuning expers

* lint

* minor

* Add jrl19 as backend for workbench (#324)

enable jrl in workbench
minor logger changes

* extra functionality in baselines.common.plot_util (#310)

* get plot_util from mt_experiments branch

* add labels

* unit tests for plot_util

* Fixed sequence env minor (#333)

minor changes to FixedSequenceEnv to allow full score

* fix tests (#335)

* Procgen Benchmark Updates (#328)

* directory cleanup

* logging, num_experiments

* fixes

* cleanup

* gin fixes

* fix local max gpu

* resid nx

* tweak

* num machines and download params

* rename

* cleanup

* create workbench

* more reorg

* fix

* more logging wrappers

* lint fix

* restore train procgen

* restore train procgen

* pylint fix

* better wrapping

* whackamole walls

* config sweep

* tweak

* args sweep

* tweak

* test workers

* mpi_weight

* train test comm and high difficulty fix

* enjoy show returns

* better joint training

* tweak

* Add —update to args and add gin-config to requirements.txt

* add username to download_file

* removing gin, procgen_parser

* removing gin

* procgen args

* config fixes

* cleanup

* cleanup

* procgen args fix

* fix

* rcall syncing

* lint

* rename mpi_weight

* begin composable game

* more composable game

* tweak

* background alpha

* use username for sync

* fixes

* microbatch fix

* lure composable game

* merge

* proc trans update

* proc trans update (#307)

* finetuning experiment

* Change is_local to use `use_rcall` and fix error of `enjoy.py` with multiple ends

* graphing help

* add --local

* change args_dict['env_name'] to ENV_NAME

* finetune experiments

* tweak

* tweak

* reorg wrappers, remove is_local

* workdir/local fixes

* move finetune experiments

* default dir and graphing

* more graphing

* fix

* pooled syncing

* tweaks

* dir fix

* tweak

* wrapper mpi fix

* wind and turrets

* composability cleanup

* radius cleanup

* composable reorg

* laser gates

* composable tweaks

* soft walls

* tweak

* begin swamp

* more swamp

* more swamp

* fix

* hidden mines

* use maze layout

* tweak

* laser gate tweaks

* tweaks

* tweaks

* lure/propel updates

* composable midnight

* composable coinmaze

* composability difficulty

* tweak

* add step to save_params

* composable offsets

* composable boxpush

* composable combiner

* tweak

* tweak

* always choose correct number of mechanics

* fix

* rcall local fix

* add steps when dump and save parmas

* loading rank 1,2,3.. error fix

* add experiments.py

* fix loading latest weight with no -rest

* support more complex run_id and add more examples

* fix typo

* move post_run_id into experiments.py

* add hp_search example

* error fix

* joint experiments in progress

* joint hp finished

* typo

* error fix

* edit experiments

* Save experiments set up in code and  save weights per step (#319)

* add step to save_params

* add steps when dump and save parmas

* loading rank 1,2,3.. error fix

* add experiments.py

* fix loading latest weight with no -rest

* support more complex run_id and add more examples

* fix typo

* move post_run_id into experiments.py

* add hp_search example

* error fix

* joint experiments in progress

* joint hp finished

* typo

* error fix

* edit experiments

* tweaks

* graph exp WIP

* depth tweaks

* move save_all

* fix

* restore_dir name

* restore depth

* choose max mechanics

* use override mode

* tweak frogger

* lstm default

* fix

* patience is composable

* hunter is composable

* fixed asset seed cleanup

* minesweeper is composable

* eggcatch is composable

* tweak

* applesort is composable

* chaser game

* begin lighter

* lighter game

* tractor game

* boxgather game

* plumber game

* hitcher game

* doorbell game

* lawnmower game

* connecter game

* cannonaim

* outrun game

* encircle game

* spinner game

* tweak

* tweak

* detonator game

* driller

* driller

* mixer

* conveyor

* conveyor game

* joint pcg experiments

* fixes

* pcg sweep experiment

* cannonaim fix

* combiner fix

* store save time

* laseraim fix

* lightup fix

* detonator tweaks

* detonator fixes

* driller fix

* lawnmower calibration

* spinner calibration

* propel fix

* train experiment

* print load time

* system independent hashing

* remove gin configurable

* task ids fix

* test_pcg experiment

* connecter dense reward

* hard_pcg

* num train comms

* mpi splits envs

* tweaks

* tweaks

* graph tweaks

* graph tweaks

* lint fix

* fix tests

* load bugfix

* difficulty timeout tweak

* tweaks

* more graphing

* graph tweaks

* tweak

* download file fix

* pcg train envs list

* cleanup

* tweak

* manually name impala layers

* tweak

* expect fps

* backend arg

* args tweak

* workbench cleanup

* move graph files

* workbench cleanup

* split env name by comma

* workbench cleanup

* ema graph

* remove Dict

* use tf.io.gfile

* comments for auto-killing jobs

* lint fix

* write latest file when not saving all and load it when step=None

* ci/runtests.sh - pass all folders to pytest (#342)

* ci/runtests.sh - pass all folders to pytest

* mpi_optimizer_test precision 1e-4

* fixes to tests

* search for tests in the entire jax folder, also remove unnecessary humor

* delete unnecessary stuff (#338)

* Add initializer for process-level setup in SubprocVecEnv (#276)

* Add initializer for process-level setup in SubprocVecEnv

Use case: run logger.configure() in each subprocess

* Add option to force dummy vec env

* Procgen fixes (#352)

* tweak

* documentation

* rely on log_comm, remove mpi averaging from wrappers

* pass comm for ppo2 initialization

* ppo2 logging

* experiment tweaks

* auto launch tensorboard when using local backend

* graph tweaks

* pass caller to config

* configure logger and tensorboard

* make parent dir if necessary

* parentdir tweak

* JRL PPO test with delayed identity env (#355)

* add a custom delay to identity_env

* min reward 0.8 in delayed identity test

* seed the tests, perfect score on delayed_identity_test

* delay=1 in delayed_identity_test

* flake8 complaints

* increased number of steps in fixed_seq_test

* seed identity tests to ensure reproducibility

* docstrings

* (onp, np) -> (np, jp), switch jax code to use mark_slow decorator (#363)

switch to mark_slow decorator

* fix tests - add matplotlib to setup_requires, put mpi4py import in try-except

* test fixes
2019-05-08 11:36:10 -07:00
pzhokhov
3301089b48 remove bullet extra, constrain gym version to be >= 0.10.0 (#885)
* remove bullet extra, constrain gym version to be >= 0.10.0

* constrain gym version from above
2019-04-26 16:14:49 -07:00
pzhokhov
a07fad9066 change rms 2 tfrms switch in vec_normalize to be more explicit (#886)
* change rms 2 tfrms switch in vec_normalize to be more explicit

* modify the vec_normalize / use_tf logic a little bit

* typo

* use_tf = False by default
2019-04-26 16:14:21 -07:00
Taeyeong Jeong
5d8041d18e Fix indexing LazyFrames (#875)
Indexing LazyFrames with index i should return the single channel frame
2019-04-19 15:00:09 -07:00
Peter Zhokhov
fa37beb52e fix commit on atari bms page to point to a public commit 2019-04-06 20:03:32 -07:00
Peter Zhokhov
8a97e0df10 fix shuffling bug in ppo1 2019-04-05 15:23:46 -07:00
pzhokhov
fabbf2c611 short-circuit framestack wrapper with size 1 (#871) 2019-04-05 15:18:15 -07:00
Xingdong Zuo
5d285b318f [Update misc_util.py]: clean up unused helper functions (#751)
* Update misc_util.py

* Update misc_util.py
2019-04-05 15:16:26 -07:00
Tim Zaman
49a99c7d23 Add eps to normalization (#797) 2019-04-05 14:46:01 -07:00
Peter Zhokhov
c79b3373bf parse colon-separated env_id's 2019-04-05 14:43:09 -07:00
Sridhar Thiagarajan
6d1c6c78d3 Interface for U.make_session changed (#865) 2019-04-01 16:24:02 -07:00
JongGyun Kim
62a9c76f18 fix the definition of TfInput.make_feed_dict. (#812) 2019-04-01 15:49:25 -07:00
Hao-Chih, Lin
282c9cc91f fix small bug in plot_results() (#864)
Remove the comma behind the last input argument
2019-04-01 15:48:35 -07:00
Peter Zhokhov
096f4d9cf0 neaten up stacking logic in mujoco_dset in gail 2019-04-01 15:47:13 -07:00
Mingfei
16136ddca7 fix bugs: obs_ph normalization in adversary.py (#823)
* fix bugs: obs_ph normalization in adversary.py

* fix bug in reshape obs and acs in Mujobo_Dset
2019-04-01 15:44:31 -07:00
Darío Hereñú
b1644157d6 Fixed typo on #092 (#824) 2019-04-01 15:41:52 -07:00
Yu Feng
58541db226 MPI refer to workers as ranks, not threads. (#833) 2019-04-01 15:38:45 -07:00
zlsh80826
c02b575f01 ppo2: use time.perf_counter() instead of time.time() for time measurement (#847) 2019-04-01 15:37:32 -07:00
Pastafarianist
897fa31548 Avoid using default config while requesting available GPUs (#863) 2019-03-29 13:25:56 -07:00
Brett Daley
d51f8be8f9 Report episode rewards/length in A2C and ACKTR (#856) 2019-03-28 09:21:48 -07:00
Jacob Hilton
3f2f45acef Merge pull request #860 from openai/build-retro-env-framestack-fix
run.py framestack bug fix
2019-03-25 14:33:15 -07:00
Jacob Hilton
b64974eb90 build_env now doesn't apply frame stack to retro games twice 2019-03-24 12:27:14 -07:00
pzhokhov
1b092434fc remove f-strings for python 3.5 compatibility (#854) 2019-03-16 11:54:47 -07:00
Peter Zhokhov
1259f6ab25 check for environment being vectorized in the play logic in run.py 2019-03-11 17:44:03 -07:00
pzhokhov
74101a9f24 fix freeze of ppo2 (#849)
* fix freeze of ppo2

* unit test for freeze, updated docstring

* more docstring update

* set number of threads to 1 in the test
2019-03-11 17:28:51 -07:00
JongGyun Kim
90d66776a4 remove one of duplicated lines. (#813) 2019-03-06 15:13:01 -08:00
pzhokhov
b875fb7b5e release Internal changes (#800)
* joshim5 changes (width and height to WarpFrame wrapper)

* match network output with action distribution via a linear layer only if necessary (#167)

* support color vs. grayscale option in WarpFrame wrapper (#166)

* support color vs. grayscale option in WarpFrame wrapper

* Support color in other wrappers

* Updated per Peters suggestions

* fixing test failures

* ppo2 with microbatches (#168)

* pass microbatch_size to the model during construction

* microbatch fixes and test (#169)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* Peterz joshim5 subclass ppo2 model (#170)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* subclassing the model to make microbatched version of model WIP

* made microbatched model a subclass of ppo2 Model

* flake8 complaint

* mpi-less ppo2 (resolving merge conflict)

* flake8 and mpi4py imports in ppo2/model.py

* more un-mpying

* merge master

* updates to the benchmark viewer code + autopep8 (#184)

* viz docs and syntactic sugar wip

* update viewer yaml to use persistent volume claims

* move plot_util to baselines.common, update links

* use 1Tb hard drive for results viewer

* small updates to benchmark vizualizer code

* autopep8

* autopep8

* any folder can be a benchmark

* massage games image a little bit

* fixed --preload option in app.py

* remove preload from run_viewer.sh

* remove pdb breakpoints

* update bench-viewer.yaml

* fixed bug (#185)

* fixed bug 

it's wrong to do the else statement, because no other nodes would start.

* changed the fix slightly

* Refactor her phase 1 (#194)

* add monitor to the rollout envs in her RUN BENCHMARKS her

* Slice -> Slide in her benchmarks RUN BENCHMARKS her

* run her benchmark for 200 epochs

* dummy commit to RUN BENCHMARKS her

* her benchmark for 500 epochs RUN BENCHMARKS her

* add num_timesteps to her benchmark to be compatible with viewer RUN BENCHMARKS her

* add num_timesteps to her benchmark to be compatible with viewer RUN BENCHMARKS her

* add num_timesteps to her benchmark to be compatible with viewer RUN BENCHMARKS her

* disable saving of policies in her benchmark RUN BENCHMARKS her

* run fetch benchmarks with ppo2 and ddpg RUN BENCHMARKS Fetch

* run fetch benchmarks with ppo2 and ddpg RUN BENCHMARKS Fetch

* launcher refactor wip

* wip

* her works on FetchReach

* her runner refactor RUN BENCHMARKS Fetch1M

* unit test for her

* fixing warnings in mpi_average in her, skip test_fetchreach if mujoco is not present

* pickle-based serialization in her

* remove extra import from subproc_vec_env.py

* investigating differences in rollout.py

* try with old rollout code RUN BENCHMARKS her

* temporarily use DummyVecEnv in cmd_util.py RUN BENCHMARKS her

* dummy commit to RUN BENCHMARKS her

* set info_values in rollout worker in her RUN BENCHMARKS her

* bug in rollout_new.py RUN BENCHMARKS her

* fixed bug in rollout_new.py RUN BENCHMARKS her

* do not use last step because vecenv calls reset and returns obs after reset RUN BENCHMARKS her

* updated buffer sizes RUN BENCHMARKS her

* fixed loading/saving via joblib

* dust off learning from demonstrations in HER, docs, refactor

* add deprecation notice on her play and plot files

* address comments by Matthias

* 1.5 months of codegen changes (#196)

* play with resnet

* feed_dict version

* coinrun prob and more stats

* fixes to get_choices_specs & hp search

* minor prob fixes

* minor fixes

* minor

* alternative version of rl_algo stuff

* pylint fixes

* fix bugs, move node_filters to soup

* changed how get_algo works

* change how get_algo works, probably broke all tests

* continue previous refactor

* get eval_agent running again

* fixing tests

* fix tests

* fix more tests

* clean up cma stuff

* fix experiment

* minor changes to eval_agent to make ppo_metal use gpu

* make dict space work

* modify mac makefile to use conda

* recurrent layers

* play with bn and resnets

* minor hp changes

* minor

* got rid of use_fb argument and jtft (joint-train-fine-tune) functionality
built test phase directly into AlgoProb

* make new rl algos generateable

* pylint; start fixing tests

* fixing tests

* more test fixes

* pylint

* fix search

* work on search

* hack around infinite loop caused by scan

* algo search fixes

* misc changes for search expt

* enable annealing, overriding options of Op

* pylint fixes

* identity op

* achieve use_last_output through masking so it automatically works in other distributions

* fix tests

* minor

* discrete

* use_last_output to be just a preference, not a hard constraint

* pred delay, pruning

* require nontrivial inputs

* aliases for get_sm

* add probname to probs

* fixes

* small fixes

* fix tests

* fix tests

* fix tests

* minor

* test scripts

* dualgru network improvements

* minor

* work on mysterious bugs

* rcall gpu-usage command for kube

* use cache dir that’s not in code folder, so that it doesn’t get removed by rcall code rsync

* add power mode to gpu usage

* make sure train/test actually different

* remove VR for now

* minor fixes

* simplify soln_db

* minor

* big refactor of mpi eda

* improve mpieda for multitask

* - get rid of timelimit hack
- add __del__ to cleanup SubprocVecEnv

* get multitask working better

* fixes

* working on atari, various

* annotate ops with whether they’re parametrized

* minor

* gym version

* rand atari prob

* minor

* SolnDb bugfix and name change

* pyspy script

* switch conv layers

* fix roboschool/bullet3

* nenvs assertion

* fix rand atari

* get rid of blanket exception catching
fix soln_db bug

* fix rand_atari

* dynamic routing as cmdline arg

* slight modifications to test_mpi_map and pyspy-all

* max_tries argument for run_until_successs

* dedup option in train_mle

* simplify soln_db

* increase atari horizon for 1 experiment

* start implementing reward increment

* ent multiplier

* create cc dsl
other misc fixes

* cc ops

* q_func -> qs in rl_algos_cc.py

* fix PredictDistr

* rl_ops_cc fixes, MakeAction op

* augment algo agent to support cc stuff

* work on ddpg experiments

* fix blocking
temporarily change logger

* allow layer scaling

* pylint fixes

* spawn_method

* isolate ddpg hacks

* improve pruning

* use spawn for subproc

* remove use of python -c in rcall

* fix pylint warning

* fix static

* maybe fix local backend

* switch to DummyVecEnv

* making some fixes via pylint

* pylint fixes

* fixing tests

* fix tests

* fix tests

* write scaffolding for SSL in Codegen

* logger fix

* fix error

* add EMA op to sl_ops

* save many changes

* save

* add upsampler

* add sl ops, enhance state machine

* get ssl search working — some gross hacking

* fix session/graph issue

* fix importing

* work on mle

* - scale embeddings in gru model
- better exception handling in sl_prob
- use emas for test/val
- use non-contrib batch_norm layer

* improve logging

* option to average before dumping in logger

* default arguments, etc

* new ddpg and identity test

* concat fix

* minor

* move realistic ssl stuff to third-party (underscore to dash)

* fixes

* remove realistic_ssl_evaluation

* pylint fixes

* use gym master

* try again

* pass around args without gin

* fix tests

* separate line to install gym

* rename failing tests that should be ignored

* add data aug

* ssl improvements

* use fixed time limit

* try to fix baselines tests

* add score_floor, max_walltime, fiddle with lr decay

* realistic_ssl

* autopep8

* various ssl
- enable blocking grad for simplification
- kl
- multiple final prediction

* fix pruning

* misc ssl stuff

* bring back linear schedule, don’t use allgather for collecting stats
(i’ve been getting nondeterministic errors from the old code)

* save/load weights in SSL, big stepsize

* cleanup SslProb

* fix

* get rid of kl coef

* fix simplification, lower lr

* search over hps

* minor fixes

* minor

* static analysis

* move files and rename things for improved consistency.
still broken, and just saving before making nontrivial changes

* various

* make tests pass

* move coinrun_train to codegen since it depends on codegen

* fixes

* pylint fixes

* improve tests
fix some things

* improve tests

* lint

* fix up db_info.py, tests

* mostly restore master version of envs directory, except for makefile changes

* fix tests

* improve printing

* minor fixes

* fix fixmes

* pruning test

* fixes

* lint

* write new test that makes tf graphs of random algos; fix some bugs it caught

* add —delete flag to rcall upload-code command

* lint

* get cifar10 lazily for testing purposes

* disable codegen ci tests for now

* clean up rl_ops

* rename spec classes

* td3 with identity test

* identity tests without gin files

* remove gin.configurable from AlgoAgent

* comments about reduction in rl_ops_cc

* address @pzhokhov comments

* fix tests

* more linting

* better tests

* clean up filtering a bit

* fix concat

* delayed logger configuration (#208)

* delayed logger configuration

* fix typo

* setters and getters for Logger.DEFAULT as well

* do away with fancy property stuff - unable to get it to work with class level methods

* grammar and spaces

* spaces

* use get_current function instead of reading Logger.CURRENT

* autopep8

* disable mpi in subprocesses (#213)

* lazy_mpi load

* cleanups

* more lazy mpi

* don't pretend that class is a module, just use it as a class

* mass-replace mpi4py imports

* flake8

* fix previous lazy_mpi imports

* silly recursion

* try os.environ hack

* better prefix test, work with mpich

* restored MPI imports

* removed commented import in test_with_mpi

* restored codegen from master

* remove lazy mpi

* restored changes from rl-algs

* remove extra files

* address Chris' comments

* use spawn for shmem vec env as well (#2) (#219)

* lazy_mpi load

* cleanups

* more lazy mpi

* don't pretend that class is a module, just use it as a class

* mass-replace mpi4py imports

* flake8

* fix previous lazy_mpi imports

* silly recursion

* try os.environ hack

* better prefix test, work with mpich

* restored MPI imports

* removed commented import in test_with_mpi

* restored codegen from master

* remove lazy mpi

* restored changes from rl-algs

* remove extra files

* port mpi fix to shmem vec env

* increase the mpi test default timeout

* change humanoid hyperparameters, get rid of clip_Frac annealing, as it's apparently dangerous

* remove clip_frac schedule from ppo2

* more timesteps in humanoid run

* whitespace + RUN BENCHMARKS

* baselines: export vecenvs from folder (#221)

* baselines: export vecenvs from folder

* put missing function back in

* add missing imports

* more imports

* longer mpi timeout?

* make default logger configuration the same as call to logger.configure() (#222)

* Vecenv refactor (#223)

* update karl util

* restore pvi flag

* change rcall auto cpu behavior, move gin.configurable, add os.makedirs

* vecenv refactor

* aux buf index fix

* add num aux obs

* reset level with enter

* restore high difficulty flag

* bugfix

* restore train_coinrun.py

* tweaks

* renaming

* renaming

* better arguments handling

* more options

* options cleanup

* game data refactor

* more options

* args for train_procgen

* add close handler to interactive base class

* use debug build if debug=True, fix range on aux_obs

* add ProcGenEnv to __init__.py, add missing imports to procgen.py

* export RemoveDictWrapper and build, update train_procgen.py, move assets download into env creation and replace init_assets_and_build with just build

* fix formatting issues

* only call global init once

* fix path in setup.py

* revert part of makefile

* ignore IDE files and folders

* vec remove dict

* export VecRemoveDictObs

* remove RemoveDictWrapper

* remove IDE files

* move shared .h and .cpp files to common folder, update build to use those, dedupe env.cpp

* fix missing header

* try unified build function

* remove old scripts dir

* add comment on build

* upload libenv with render fixes

* tell qthreads to die when we unload the library

* pyglet.app.run is garbage

* static fixes

* whoops

* actually vsync is on

* cleanup

* cleanup

* extern C for libenv interface

* parse util rcall arg

* high difficulty fix

* game type enums

* ProcGenEnv subclasses

* game type cleanup

* unrecognized key

* unrecognized game type

* parse util reorg

* args management

* typo fix

* GinParser

* arg tweaks

* tweak

* restore start_level/num_levels setting

* fix create_procgen_env interface

* build fix

* procgen args in init signature

* fix

* build fix

* fix logger usage in ppo_metal/run_retro

* removed unnecessary OrderedDict requirement in subproc_vec_env

* flake8 fix

* allow for non-mpi tests

* mpi test fixes

* flake8; removed special logic for discrete spaces in dummy_vec_env

* remove forked argument in front of tests - does not play nicely with subprocvecenv in spawned processes; analog of forked in ddpg/test_smoke

* Everyrl initial commit & a few minor baselines changes (#226)

* everyrl initial commit

* add keep_buf argument to VecMonitor

* logger changes: set_comm and fix to mpi_mean functionality

* if filename not provided, don't create ResultsWriter

* change variable syncing function to simplify its usage. now you should initialize from all mpi processes

* everyrl coinrun changes

* tf_distr changes, bugfix

* get_one

* bring back get_next to temporarily restore code

* lint fixes

* fix test

* rename profile function

* rename gaussian

* fix coinrun training script

* change random seeding to work with new gym version (#231)

* change random seeding to work with new gym version

* move seeding to seed() method

* fix mnistenv

* actually try some of the tests before pushing

* more deterministic fixed seq

* misc changes to vecenvs and run.py for benchmarks (#236)

* misc changes to vecenvs and run.py for benchmarks

* dont seed global gen

* update more references to assert_venvs_equal

* Rl19 (#232)

* everyrl initial commit

* add keep_buf argument to VecMonitor

* logger changes: set_comm and fix to mpi_mean functionality

* if filename not provided, don't create ResultsWriter

* change variable syncing function to simplify its usage. now you should initialize from all mpi processes

* everyrl coinrun changes

* tf_distr changes, bugfix

* get_one

* bring back get_next to temporarily restore code

* lint fixes

* fix test

* rename profile function

* rename gaussian

* fix coinrun training script

* rl19

* remove everyrl dir which appeared in the merge for some reason

* readme

* fiddle with ddpg

* make ddpg work

* steps_total argument

* gpu count

* clean up hyperparams and shape math

* logging + saving

* configuration stuff

* fixes, smoke tests

* fix stats

* make load_results return dicts -- easier to create the same kind of objects with some other mechanism for passing to downstream functions

* benchmarks

* fix tests

* add dqn to tests, fix it

* minor

* turned annotated transformer (pytorch) into a script

* more refactoring

* jax stuff

* cluster

* minor

* copy & paste alec code

* sign error

* add huber, rename some parameters, snapshotting off by default

* remove jax stuff

* minor

* move maze env

* minor

* remove trailing spaces

* remove trailing space

* lint

* fix test breakage due to gym update

* rename function

* move maze back to codegen

* get recurrent ppo working

* enable both lstm and gru

* script to print table of benchmark results

* various

* fix dqn

* add fixup initializer, remove lastrew

* organize logging stats

* fix silly bug

* refactor models

* fix mpi usage

* check sync

* minor

* change vf coef, hps

* clean up slicing in ppo

* minor fixes

* caching transformer

* docstrings

* xf fixes

* get rid of 'B' and 'BT' arguments

* minor

* transformer example

* remove output_kind from base class until we have a better idea how to use it

* add comments, revert maze stuff

* flake8

* codegen lint

* fix codegen tests

* responded to peter's comments

* lint fixes

* minor changes to baselines (#243)

* minor changes to baselines

* fix spaces reference

* remove flake8 disable comments and fix import

* okay maybe don't add spec to vec_env

* Merge branch 'master' of github.com:openai/games

 the commit.

* flake8 complaints in baselines/her
2019-02-27 15:35:31 -08:00
Peter Zhokhov
675b100190 raised the tolerance on the test_microbatches test 2019-02-27 14:22:24 -08:00
Peter Zhokhov
adc4388f6b fixes to catch changes in gym 2019-02-27 12:49:40 -08:00
Rishav1
5b41c926c7 fix #795: Making tf_util._Function consistent (#796)
* fix #795: Making tf_util._Function consistent

The fix involves using the placeholder name to crossreference passed
kwargs values, just like the tf_util.function expects. Also, the givens
are updated before the parameters to make it behave like it's supposed
to.

* test: Adding test for issue #795
2019-01-31 10:23:38 -08:00
Peter Zhokhov
ab02fae71d fixes related to new gym and new flake8 2019-01-30 16:21:57 -08:00
ethanwaldie
b55eda1dde Added required arguments to the policy builder in the ACER model to (#784)
* Added required arguments to the policy builder in the ACER model to
fix the issue #783

* Changed the step model from nbatch to nenvs

* Updated nsteps to be 1.
2019-01-22 19:22:28 -08:00
pzhokhov
57e05eb420 remove noop code (#781) 2019-01-09 22:30:52 -08:00
Nikhil Barhate
01ab1d8ef7 fixed typo (#779) 2019-01-09 11:21:53 -08:00
Alex Ray
73683435ff Merge pull request #777 from openai/aray-extra-imports
add an argument for importing extra modules from run
2019-01-04 15:49:51 -08:00
Alex Ray
4d0746b957 add an argument for importing extra modules from run 2019-01-03 11:33:31 -08:00
Ankesh Anand
5115707ce9 Recognize nightly tf builds (#763)
* Recognize nightly tf builds

* Use LooseVersion instead of StrictVersion to recongnize nightly build numbers

Nightly version numbers are of the form `1.3.0.dev20181215` but it's not a valid version number for `StrictVersion`, while `LooseVersion` still recognizes it.
2018-12-21 12:47:48 -08:00
pzhokhov
6c44fb28fe refactor HER - phase 1 (#767)
* joshim5 changes (width and height to WarpFrame wrapper)

* match network output with action distribution via a linear layer only if necessary (#167)

* support color vs. grayscale option in WarpFrame wrapper (#166)

* support color vs. grayscale option in WarpFrame wrapper

* Support color in other wrappers

* Updated per Peters suggestions

* fixing test failures

* ppo2 with microbatches (#168)

* pass microbatch_size to the model during construction

* microbatch fixes and test (#169)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* Peterz joshim5 subclass ppo2 model (#170)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* subclassing the model to make microbatched version of model WIP

* made microbatched model a subclass of ppo2 Model

* flake8 complaint

* mpi-less ppo2 (resolving merge conflict)

* flake8 and mpi4py imports in ppo2/model.py

* more un-mpying

* merge master

* updates to the benchmark viewer code + autopep8 (#184)

* viz docs and syntactic sugar wip

* update viewer yaml to use persistent volume claims

* move plot_util to baselines.common, update links

* use 1Tb hard drive for results viewer

* small updates to benchmark vizualizer code

* autopep8

* autopep8

* any folder can be a benchmark

* massage games image a little bit

* fixed --preload option in app.py

* remove preload from run_viewer.sh

* remove pdb breakpoints

* update bench-viewer.yaml

* fixed bug (#185)

* fixed bug 

it's wrong to do the else statement, because no other nodes would start.

* changed the fix slightly

* Refactor her phase 1 (#194)

* add monitor to the rollout envs in her RUN BENCHMARKS her

* Slice -> Slide in her benchmarks RUN BENCHMARKS her

* run her benchmark for 200 epochs

* dummy commit to RUN BENCHMARKS her

* her benchmark for 500 epochs RUN BENCHMARKS her

* add num_timesteps to her benchmark to be compatible with viewer RUN BENCHMARKS her

* add num_timesteps to her benchmark to be compatible with viewer RUN BENCHMARKS her

* add num_timesteps to her benchmark to be compatible with viewer RUN BENCHMARKS her

* disable saving of policies in her benchmark RUN BENCHMARKS her

* run fetch benchmarks with ppo2 and ddpg RUN BENCHMARKS Fetch

* run fetch benchmarks with ppo2 and ddpg RUN BENCHMARKS Fetch

* launcher refactor wip

* wip

* her works on FetchReach

* her runner refactor RUN BENCHMARKS Fetch1M

* unit test for her

* fixing warnings in mpi_average in her, skip test_fetchreach if mujoco is not present

* pickle-based serialization in her

* remove extra import from subproc_vec_env.py

* investigating differences in rollout.py

* try with old rollout code RUN BENCHMARKS her

* temporarily use DummyVecEnv in cmd_util.py RUN BENCHMARKS her

* dummy commit to RUN BENCHMARKS her

* set info_values in rollout worker in her RUN BENCHMARKS her

* bug in rollout_new.py RUN BENCHMARKS her

* fixed bug in rollout_new.py RUN BENCHMARKS her

* do not use last step because vecenv calls reset and returns obs after reset RUN BENCHMARKS her

* updated buffer sizes RUN BENCHMARKS her

* fixed loading/saving via joblib

* dust off learning from demonstrations in HER, docs, refactor

* add deprecation notice on her play and plot files

* address comments by Matthias
2018-12-19 14:44:08 -08:00
Timothy Lee
146bbf886b Removed code that prevented changes to actor loss when training with demos (#740) 2018-11-29 17:28:08 -08:00
pzhokhov
f3a5abaeeb added smoke tests of ddpg (#734) 2018-11-26 17:57:25 -08:00
pzhokhov
97e039127f Fix ppo2 with MPI bug, other minor fixes (#735)
* joshim5 changes (width and height to WarpFrame wrapper)

* match network output with action distribution via a linear layer only if necessary (#167)

* support color vs. grayscale option in WarpFrame wrapper (#166)

* support color vs. grayscale option in WarpFrame wrapper

* Support color in other wrappers

* Updated per Peters suggestions

* fixing test failures

* ppo2 with microbatches (#168)

* pass microbatch_size to the model during construction

* microbatch fixes and test (#169)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* Peterz joshim5 subclass ppo2 model (#170)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* subclassing the model to make microbatched version of model WIP

* made microbatched model a subclass of ppo2 Model

* flake8 complaint

* mpi-less ppo2 (resolving merge conflict)

* flake8 and mpi4py imports in ppo2/model.py

* more un-mpying

* merge master

* updates to the benchmark viewer code + autopep8 (#184)

* viz docs and syntactic sugar wip

* update viewer yaml to use persistent volume claims

* move plot_util to baselines.common, update links

* use 1Tb hard drive for results viewer

* small updates to benchmark vizualizer code

* autopep8

* autopep8

* any folder can be a benchmark

* massage games image a little bit

* fixed --preload option in app.py

* remove preload from run_viewer.sh

* remove pdb breakpoints

* update bench-viewer.yaml

* fixed bug (#185)

* fixed bug 

it's wrong to do the else statement, because no other nodes would start.

* changed the fix slightly
2018-11-26 17:56:41 -08:00
pzhokhov
25ecb64821 fixed issue with wrong output layer variable names in ddpg (#733) 2018-11-26 16:30:37 -08:00
Prabhat Nagarajan
7dc6bc7c70 fixes typo (#732)
* fixes typo

* adds apostrophe
2018-11-26 16:19:09 -08:00
Christopher Hesse
7139a66d33 Merge pull request #728 from openai/christopherhesse-patch-1
Update README.md
2018-11-21 15:00:51 -08:00
Christopher Hesse
8607dca99e Update README.md 2018-11-21 14:57:10 -08:00
pzhokhov
9f9835fe38 Update __init__.py 2018-11-21 12:51:15 -08:00
sedand
d3fed181b5 Fixed comment on example usage in jupyter-notebook (#396)
Cause of error: Import name must be results_plotter, not log_viewer.
2018-11-14 14:50:59 -08:00
Roman Ring
339d5640b9 add docs for layer_norm param in DQN baseline (#107) 2018-11-14 12:22:42 -08:00
Buck Shlegeris
a75bc37a40 fix typo in a comment (#161) 2018-11-14 12:20:55 -08:00
Peter Zhokhov
87b3a04a38 autopep8 2018-11-14 12:16:53 -08:00
Brent Komer
c5b1a1b643 typo fix (#230) 2018-11-13 13:08:32 -08:00
JohannesAck
c59a10947d Parameter documentation for tf_util.function (#349)
* Added parameter documentation

This parameter was thus far not documented and is non-intuitive when unfamiliar with tf.

* Added parameter documentation
2018-11-13 13:03:48 -08:00
James Alan Preiss
5cd66010dc case-insensitive sort for human-readable logger (#289) 2018-11-13 11:09:11 -08:00
Xiaoquan Kong
0a13da8dfe Change variable name from inpt to input_ (#297) 2018-11-13 11:08:21 -08:00
Vladislav Zavadskyy
18b6390be6 Typo fix (#287) 2018-11-13 11:03:55 -08:00
pzhokhov
52255beda5 microbatches in ppo2, custom frame size in WarpFrame, matching fc layer only when needed (#707)
* joshim5 changes (width and height to WarpFrame wrapper)

* match network output with action distribution via a linear layer only if necessary (#167)

* support color vs. grayscale option in WarpFrame wrapper (#166)

* support color vs. grayscale option in WarpFrame wrapper

* Support color in other wrappers

* Updated per Peters suggestions

* fixing test failures

* ppo2 with microbatches (#168)

* pass microbatch_size to the model during construction

* microbatch fixes and test (#169)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* Peterz joshim5 subclass ppo2 model (#170)

* microbatch fixes and test

* tiny cleanup

* added assertions to the test

* vpg-related fix

* subclassing the model to make microbatched version of model WIP

* made microbatched model a subclass of ppo2 Model

* flake8 complaint

* mpi-less ppo2 (resolving merge conflict)

* flake8 and mpi4py imports in ppo2/model.py

* more un-mpying
2018-11-09 11:18:05 -08:00
AurelianTactics
d80acbb4d1 Removing print spam from Wrapper (#705)
* DDPG has unused 'seed' argument

DeepQ, PPO2, ACER, trpo_mpi, A2C, and ACKTR have the code for:

```
from baselines.common import set_global_seeds
...
def learn(...):
...
   set_global_seeds(seed)
```

DDPG has the argument 'seed=None' but doesn't have the two lines of code needed to set the global seeds.

* DDPG: duplicate variable assignment

variable nb_actions assigned same value twice in space of 10 lines
nb_actions = env.action_space.shape[-1]

* DDPG: noise_type 'normal_x' and 'ou_x' cause assert

noise_type default 'adaptive-param_0.2' works but the arguments that change from parameter noise to actor noise (like 'normal_0.2' and 'ou_0.2' cause an assert message and DDPG not to run. Issue is noise following block:
'''
        if self.action_noise is not None and apply_noise:
            noise = self.action_noise()
            assert noise.shape == action.shape
            action += noise
'''

noise is not nested: [number_of_actions]
actions is nested: [[number_of_actions]]
Can either nest noise or unnest actions

* Revert "DDPG: noise_type 'normal_x' and 'ou_x' cause assert"

* DDPG: noise_type 'normal_x' and 'ou_x' cause AssertionError

noise_type default 'adaptive-param_0.2' works but the arguments that change from parameter noise to actor noise (like 'normal_0.2' and 'ou_0.2') cause an assert message and DDPG not to run. Issue is the following block:
'''
        if self.action_noise is not None and apply_noise:
            noise = self.action_noise()
            assert noise.shape == action.shape
            action += noise
'''

noise is not nested: [number_of_actions]
action is nested: [[number_of_actions]]
Hence the shapes do not pass the assert line even though the action += noise line is correct

* Removing Print Spam from Wrapper

Prints a line every time a video is saved or not saved. Seems unnecessary.
2018-11-08 10:13:07 -08:00
pzhokhov
556b198454 Internal minifixes (#694)
* joshim5 changes (width and height to WarpFrame wrapper)

* match network output with action distribution via a linear layer only if necessary (#167)

* support color vs. grayscale option in WarpFrame wrapper (#166)

* support color vs. grayscale option in WarpFrame wrapper

* Support color in other wrappers

* Updated per Peters suggestions

* fixing test failures
2018-11-08 10:11:45 -08:00
pzhokhov
cc88804042 Update viz.ipynb 2018-11-07 17:20:52 -08:00
pzhokhov
c14d307834 move viz docs to a notebook entirely (#704)
* viz docs

* writing vizualization docs

* documenting plot_util

* docstrings in plot_util

* autopep8 and flake8

* spelling (using default vim spellchecker and ingoring things like dataframe, docstring and etc)

* rephrased viz.md a little bit

* more examples of viz code usage in the docs

* replaced vizualization doc with notebook
2018-11-07 17:19:42 -08:00
pzhokhov
0b71d4c6c4 remove unused args of DDPG class (#702) 2018-11-07 17:19:25 -08:00
pzhokhov
7bb405c7a7 Update viz.md 2018-11-07 14:25:35 -08:00
pzhokhov
8b95576a92 more viz + build fixes (#703)
* viz docs

* writing vizualization docs

* documenting plot_util

* docstrings in plot_util

* autopep8 and flake8

* spelling (using default vim spellchecker and ingoring things like dataframe, docstring and etc)

* rephrased viz.md a little bit

* more examples of viz code usage in the docs
2018-11-06 17:02:20 -08:00
Peter Zhokhov
9d4fb76ef0 making num_envs and video length smaller in test_video_recorder to prevent hanging on travis 2018-11-06 09:58:43 -08:00
Peter Zhokhov
664ec6faf0 catch bugfixes in gym 2018-11-05 19:19:39 -08:00
Peter Zhokhov
3917321fbe revert over-spellchecking 2018-11-05 17:00:40 -08:00
coord.e
6e607efa90 Add video recorder (#666)
* Fix: Return the result of rendering from dummyvecenv

* Add: Add a video recorder wrapper for vecenv

* Change: Use VecVideoRecorder with --video_monitor flag

* Change: Overwrite the metadata only when it isn't defined

* Add: Define __del__ to make the file correctly closed in exit

* Fix: Bump epidode_id in reset()

* Fix: Use hasattr to check the existence of .metadata

* Fix: Make directory when it doesn't exist

* Change: Kepp recording for `video_length` steps, then close

Because reset() is not what it is in normal gym.Env

* Add: Enable to specify video_length from command line argument

* Delete: Delete default value, None, of video_callable

* Change: Use self.recorded_frames and self.recording to manage intervals

* Add: Log the status of video recording

* Fix: Fix saving path

* Change: Place metadata in the base VecEnv

* Delete: Delete unused imports

* Fix: epidode_id => step_id

* Fix: Refine the flag name

* Change: Unify the flag name folloing to previous change

* [WIP] Add: Add a test of VecVideoRecorder

* Fix: Use PongNoFrameskip-v0 because SimpleEnv doesn't have render()

* Change; Use TemporaryDirectory

* Fix: minimal successful test

* Add: Test against parallel environments

* Add: Test against different type of VecEnvs

* Change: Test against different length and interval of video capture

* Delete: Reduce the number of tests

* Change: Test if the output video is not empty

* Add: Add some comments

* Fix: Fix the flag name

* Add: Add docstrings

* Fix: Install ffmpeg in testing container for VecVideoRecorder's test

* Fix: Delete unused things

* Fix: Replace `video_callable` with `record_video_trigger`

* Fix: Improve the explanation of `record_video_trigger` argument

* Fix: Close owning vecenv in VecVideoRecorder.close to resolve memory
leak
2018-11-05 14:32:17 -08:00
pzhokhov
c74ce02b9d visualization code docs / bugfixes (#701)
* viz docs

* writing vizualization docs

* documenting plot_util

* docstrings in plot_util

* autopep8 and flake8

* spelling (using default vim spellchecker and ingoring things like dataframe, docstring and etc)

* rephrased viz.md a little bit
2018-11-05 14:31:15 -08:00
pzhokhov
ab59de6922 mpi-less baselines (#689)
* make baselines run without mpi wip

* squash-merged latest master

* further removing MPI references where unnecessary

* more MPI removal

* syntax and flake8

* MpiAdam becomes regular Adam if Mpi not present

* autopep8

* add assertion to test in mpi_adam; fix trpo_mpi failure without MPI on cartpole

* mpiless ddpg
2018-10-31 11:15:41 -07:00
Mathieu Poliquin
a071fa7630 Add retro to ppo2 defaults (#682)
* Adds retro to ppo2 defaults

Created defaults for retro, copied from Atari defaults for now. Tested with SuperMarioBros-Nes

* ppo2 retro defaults to atari
2018-10-30 10:17:46 -07:00
Mathieu Poliquin
637bf55da7 Use deepmind wrapper for retro (#685)
* Use deepmind wrapper for retro

* moved wrap_deepmind_retro after Monitor wrapper
2018-10-30 10:16:15 -07:00
AurelianTactics
165c622572 DDPG: noise_type 'normal_x' and 'ou_x' cause AssertionError (#680)
* DDPG has unused 'seed' argument

DeepQ, PPO2, ACER, trpo_mpi, A2C, and ACKTR have the code for:

```
from baselines.common import set_global_seeds
...
def learn(...):
...
   set_global_seeds(seed)
```

DDPG has the argument 'seed=None' but doesn't have the two lines of code needed to set the global seeds.

* DDPG: duplicate variable assignment

variable nb_actions assigned same value twice in space of 10 lines
nb_actions = env.action_space.shape[-1]

* DDPG: noise_type 'normal_x' and 'ou_x' cause assert

noise_type default 'adaptive-param_0.2' works but the arguments that change from parameter noise to actor noise (like 'normal_0.2' and 'ou_0.2' cause an assert message and DDPG not to run. Issue is noise following block:
'''
        if self.action_noise is not None and apply_noise:
            noise = self.action_noise()
            assert noise.shape == action.shape
            action += noise
'''

noise is not nested: [number_of_actions]
actions is nested: [[number_of_actions]]
Can either nest noise or unnest actions

* Revert "DDPG: noise_type 'normal_x' and 'ou_x' cause assert"

* DDPG: noise_type 'normal_x' and 'ou_x' cause AssertionError

noise_type default 'adaptive-param_0.2' works but the arguments that change from parameter noise to actor noise (like 'normal_0.2' and 'ou_0.2') cause an assert message and DDPG not to run. Issue is the following block:
'''
        if self.action_noise is not None and apply_noise:
            noise = self.action_noise()
            assert noise.shape == action.shape
            action += noise
'''

noise is not nested: [number_of_actions]
action is nested: [[number_of_actions]]
Hence the shapes do not pass the assert line even though the action += noise line is correct
2018-10-30 10:13:39 -07:00
Peter Zhokhov
93c7cc202c Merge branch 'master' of github.com:openai/baselines 2018-10-29 15:25:38 -07:00
Peter Zhokhov
de36116e3b update tensorflow version check regex to parse version like 1.2.3rc4 (previously only 1.2.3-rc4) 2018-10-29 15:25:31 -07:00
Mathieu Poliquin
e2b41828af Set 'cnn' as default network for retro (#683) 2018-10-29 13:30:41 -07:00
pzhokhov
8e56ddeac2 Multidiscrete action space compatibility for policy gradient-based methods (#677)
* multidiscrete space compatibility

* flake8 and syntax
2018-10-24 11:01:59 -07:00
Juliano Laganá
c3bd8cea66 Adds description of param_noise parameter in deepq.learn method (#675) 2018-10-24 10:00:31 -07:00
AurelianTactics
84ea7aa1fd DDPG has unused 'seed' argument (#676)
DeepQ, PPO2, ACER, trpo_mpi, A2C, and ACKTR have the code for:

```
from baselines.common import set_global_seeds
...
def learn(...):
...
   set_global_seeds(seed)
```

DDPG has the argument 'seed=None' but doesn't have the two lines of code needed to set the global seeds.
2018-10-24 09:59:46 -07:00
Peter Zhokhov
88300ed54c fix raise NotImplemented() complaints of latest flake8 2018-10-24 09:57:57 -07:00
pzhokhov
583ba082a2 Update cmd_util.py 2018-10-23 11:22:27 -07:00
pzhokhov
014a5597b1 refactor ACER (#664)
* make acer use vecframestack

* acer passes mnist test with 20k steps

* acer with non-image observations and tests

* flake8

* test acer serialization with non-recurrent policies
2018-10-23 10:01:25 -07:00
Isaac Poulton
4ed1350326 Fixed TypeError on creating atari vec envs (#671) 2018-10-23 10:00:09 -07:00
Rishabh Jangir
8513d73355 HER : new functionality, enables demo based training (#474)
* Add, initialize, normalize and sample from a demo buffer

* Modify losses and add cloning loss

* Add demo file parameter to train.py

* Introduce new params in config.py for demo based training

* Change logger.warning to logger.warn in rollout.py;bug

* Add data generation file for Fetch environments

* Update README file
2018-10-22 19:04:40 -07:00
Xingdong Zuo
c28acb2203 [Clean-up]: delete running_stat and filters as they are replaced by running_mean_std and not used anymore (#614)
* Delete filters.py

* Delete running_stat.py
2018-10-22 19:01:26 -07:00
pzhokhov
c5d9c4a1b2 wrap retro envs correctly for other (non-deepq) algorithms (#669)
* wrap retro envs correctly for other (non-deepq) algorithms

* flake and csh comments

* flake and csh comments
2018-10-22 18:36:39 -07:00
pzhokhov
c0fa11a3a7 minor fixes from internal (#665)
* sync internal changes. Make ddpg work with vecenvs

* B -> nenvs for consistency with other algos, small cleanups

* eval_done[d]==True -> eval_done[d]

* flake8 and numpy.random.random_integers deprecation warning

* Merge branch 'master' of github.com:openai/games into peterz_track_baselines_branch
2018-10-22 09:15:04 -07:00
Peter Zhokhov
bd390c2ade updated docstring for deepq 2018-10-19 17:50:54 -07:00
pzhokhov
d0cc325e14 store session at policy creation time (#655)
* sync internal changes. Make ddpg work with vecenvs

* B -> nenvs for consistency with other algos, small cleanups

* eval_done[d]==True -> eval_done[d]

* flake8 and numpy.random.random_integers deprecation warning

* store session at policy creation time

* coexistence tests

* fix a typo

* autopep8

* ... and flake8

* updated todo links in test_serialization
2018-10-19 08:54:21 -07:00
pzhokhov
fc7f9cec49 disable gym subpackages in setup.py (#661)
* disable gym subpackages in setup.py

* include gym[atari] in test requirements

* gym[atari] -> atari-py in test requirements
2018-10-18 16:07:14 -07:00
Matthew Rahtz
3677dc1b23 Set allow_growth=True for MuJoCo session (#643) 2018-10-18 13:54:39 -07:00
Matthew Rahtz
ef96f3835b Drop S and M args so that --play works (#636) 2018-10-16 16:28:23 -07:00
pzhokhov
a03dacd68d sync internal changes. Make ddpg work with vecenvs (#654)
* sync internal changes. Make ddpg work with vecenvs

* B -> nenvs for consistency with other algos, small cleanups

* eval_done[d]==True -> eval_done[d]

* flake8 and numpy.random.random_integers deprecation warning
2018-10-16 16:26:46 -07:00
Tianhong Dai
e57f81becc revise the readme of ddpg (#653) 2018-10-16 16:22:06 -07:00
Peter Zhokhov
28aca637d0 update benchmark results 2018-10-09 09:48:31 -07:00
Erik Doffagne
7bfbcf177e Fixed typos in README (#635) 2018-10-04 10:31:22 -07:00
pzhokhov
394339deb5 Update README.md 2018-10-03 20:53:58 -07:00
pzhokhov
10c205c159 Debug codegen ppo (#123)
* disabled tests, running benchmarks only

* dummy commit to RUN BENCHMARKS

* benchmark ppo_metal; disable all but Bullet benchmarks

* ppo2, codegen ppo and ppo_metal on Bullet RUN BENCHMARKS

* run benchmarks on Roboschool instead RUN BENCHMARKS

* run ppo_metal on Roboschool as well RUN BENCHMARKS

* install roboschool in cron rcall user_config

* dummy commit to RUN BENCHMARKS

* import roboschool in codegen/contcontrol_prob.py RUN BENCHMARKS

* re-enable tests, flake8

* get entropy from a distribution in Pred RUN BENCHMARKS

* gin for hyperparameter injection; try codegen ppo close to baselines ppo RUN BENCHMARKS

* provide default value for cg2/bmv_net_ops.py

* dummy commit to RUN BENCHMARKS

* make tests and benchmarks parallel; use relative path to gin file for rcall compatibility RUN BENCHMARKS

* syntax error in run-benchmarks-new.py RUN BENCHMARKS

* syntax error in run-benchmarks-new.py RUN BENCHMARKS

* path relative to codegen/training for gin files RUN BENCHMARKS

* another reconcilliation attempt between codegen ppo and baselines ppo RUN BENCHMARKS

* value_network=copy for ppo2 on roboschool RUN BENCHMARKS

* make None seed work with torch seeding RUN BENCHMARKS

* try sequential batches with ppo2 RUN BENCHMARKS

* try ppo without advantage normalization RUN BENCHMARKS

* use Distribution to compute ema NLL RUN BENCHMARKS

* autopep8

* clip gradient norm in algo_agent RUN BENCHMARKS

* try ppo2 without vfloss clipping RUN BENCHMARKS

* trying with gamma=0.0 - assumption is, both algos should be equally bad RUN BENCHMARKS

* set gamma=0 in ppo2 RUN BENCHMARKS

* try with ppo2 with single minibatch RUN BENCHMARKS

* try with nminibatches=4, value_network=copy RUN BENCHMARKS

* try with nminibatches=1 take two RUN BENCHMARKS

* try initialization for vf=0.01 RUN BENCHMARKS

* fix the problem with min_istart >= max_istart

* i have no idea RUN BENCHMARKS

* fix non-shared variance between old and new RUN BENCHMARKS

* restored baselines.common.policies

* 16 minibatches in ppo_roboschool.gin

* fixing results of merge

* cleanups

* cleanups

* fix run-benchmarks-new RUN BENCHMARKS Roboschool8M

* fix syntax in run-benchmarks-new RUN BENCHMARKS Roboschool8M

* fix test failures

* moved gin requirement to codegen/setup.py

* remove duplicated build_softq in get_algo.py

* linting

* run softq on continuous action spaces RUN BENCHMARKS Roboschool8M
2018-10-03 14:38:32 -07:00
pzhokhov
62fe7c4717 disable async acktr (#129)
* disable async acktr

* linting

* linting

* linting
2018-10-03 14:38:32 -07:00
Xingyou Song
fbdf55ffee Xsong lqr ddpg (#125)
* allows vec_envs to work

* allows vec_envs to work

* fixed branch with correct ddpg

* running experiments jointly now

* changed to subproc

* changed to subproc

* changed to subproc

* small fix md

* removed placeholder

* removed placeholder

* added ppotest

* probably fixed ddpg hyperparam issues

* checkpoint

* edited readme

* added orthogonal

* added orthogonal

* added ddpg-vecenv

* reverted ddpg to old baselines
2018-10-03 14:38:32 -07:00
Christopher Hesse
9ee804c384 minor change to install.py and baselines run.py (#121) 2018-10-03 14:38:32 -07:00
John Schulman
4cf7dc9644 Big refactor (#124)
* massive revision inspired by soup: algo folder works

* porting rl commands, WIP

* various

* git subrepo push --remote=git@github.com:openai/codegen.git --branch=refactor codegen

subrepo:
  subdir:   "codegen"
  merged:   "aa27e069"
upstream:
  origin:   "git@github.com:openai/codegen.git"
  branch:   "refactor"
  commit:   "aa27e069"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* various

* rewrite RL stuff in new framework

* fix almost everything

* woohoo tests pass

* more tests

* reformatting

* fixes

* write tests for embeddings

* re-remove cg2

* pylint

* minor

* move smooth_helpers import; seems to cause nondeterministic failure in parallel pytest
2018-10-03 14:38:32 -07:00
Xingyou Song
e820b86fdc ppo2 now has eval stats (#120)
* ppo2 now has eval stats

* fixed spaces

* fixed kwargs ordering

* whitespace fix
2018-10-03 14:38:32 -07:00
pzhokhov
858afa8d7e Refactor DDPG (#111)
* run ddpg on Mujoco benchmark RUN BENCHMARKS

* autopep8

* fixed all syntax in refactored ddpg

* a little bit more refactoring

* autopep8

* identity test with ddpg WIP

* enable test_identity with ddpg

* refactored ddpg RUN BENCHMARKS

* autopep8

* include ddpg into style check

* fixing tests RUN BENCHMARKS

* set default seed to None RUN BENCHMARKS

* run tests and benchmarks in separate buildkite steps RUN BENCHMARKS

* cleanup pdb usage

* flake8 and cleanups

* re-enabled all benchmarks in run-benchmarks-new.py

* flake8 complaints

* deepq model builder compatible with network functions returning single tensor

* remove ddpg test with test_discrete_identity

* make ppo_metal use make_vec_env instead of make_atari_env

* make ppo_metal use make_vec_env instead of make_atari_env

* fixed syntax in ppo_metal.run_atari
2018-10-03 14:38:32 -07:00
pzhokhov
4121d9c1a8 fix DQN learning bug (#632)
* Update run.py

* Update utils.py

* Update utils.py
2018-10-03 14:37:40 -07:00
Peter Zhokhov
34ae3194b4 add a note about DQN algorithms not performing well 2018-09-27 12:51:43 -07:00
Thomas Simonini
4402b8eba6 Updated A2C and PPO2 comments (#612)
* Updated A2C and PPO2 comments

* Fixed format errors to respect PEP 8 style guide
2018-09-24 09:54:41 -07:00
ahuhn
555a5cbbb2 Adding num_env to readme example (#609)
* Adding num_env to readme example

* Updated readme example fix
2018-09-21 17:22:56 -07:00
Thomas Simonini
8158f35611 Wrote some comments to explain the A2C and PPO2 implementation (#607)
* added comments in A2C and PPO2

* Fixed format errors to respect PEP 8 style guide
2018-09-21 13:12:31 -07:00
cclauss
a7fd8a4477 Run flake8 to find syntax errors and undefined names (#439)
__E901,E999,F821,F822,F823__ are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. The other flake8 issues are merely "style violations" -- useful for readability but they do not effect runtime safety.  This PR therefore recommends a flake8 run of those tests on the entire codebase.
* F821: undefined name `name`
* F822: undefined name `name` in `__all__`
* F823: local variable `name` referenced before assignment
* E901: SyntaxError or IndentationError
* E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree
2018-09-20 16:40:03 -07:00
John Schulman
e791565a60 Codegen more abstract abstract classes 3a (#106)
* Soup code, arch search on CIFAR-10

* Oh I understood how choice_sequence() worked

* Undo some pointless changes

* Some beautification 1

* Some beautification 2

* An attempt to debug test_get_algo_outputs() number 70, unsuccessful.

* Code style warning

* Code style warnings, more

* wip

* wip

* wip

* fix almost everything; soup machine still broken

* revert mpi_eda changes

* minor fixes
2018-09-20 16:19:07 -07:00
XFFXFF
7859f603cd prioritized experience replay bug (#527) 2018-09-20 16:16:44 -07:00
pzhokhov
0f4ae2fb2a refactor acktr (#560)
* refactor acktr

* setup.cfg now tests style/syntax in acktr as well

* flake8 complaints

* added note about continuous action spaces for acktr into the README.md
2018-09-20 16:05:26 -07:00
pzhokhov
0e7048b89f Update README.md 2018-09-19 15:04:54 -07:00
pzhokhov
75983bab64 Update README.md 2018-09-19 15:04:01 -07:00
Alfredo Canziani
85be74500d Add possibility of plotting timesteps vs episodes (#578)
* Add possibility of plotting timesteps vs episodes

* Remove leftover from personal project patch

* Auto plt.tight_layout() on resize window event

Calls `plt.tight_layout()` if a `resize_event` is issued.
This means that the plot will look good even after the user has resized the plotting window.
2018-09-19 09:43:45 -07:00
Geoffrey Irving
115b59d28b Merge pull request #598 from openai/irving-rc
Fix setup.py for tensorflow -rc versions
2018-09-18 15:52:57 -07:00
Xingdong Zuo
d34049cab4 Update running_mean_std.py (#585) 2018-09-18 14:14:38 -07:00
pzhokhov
59662fff78 rename entcoeff to ent_coef in trpo_mpi for compatibility with other algos (#581) 2018-09-18 14:13:05 -07:00
Geoffrey Irving
a42c4eb2bb Fix setup.py for tensorflow -rc versions 2018-09-18 11:35:43 -07:00
R1ckF
68a29d0ab3 --play now works with LSTM (#595) 2018-09-17 14:33:39 -07:00
Xingdong Zuo
0c6f357936 Delete identity_env.py (#588) 2018-09-17 09:53:34 -07:00
pzhokhov
4dc697e670 codegen test fixes (#95)
* fix discovered test failures

* autopep8

* test indices up to 123

* testing from index 124 on

* add scope to logstd

* fix flakiness in test_train_mle

* autopep8
2018-09-14 15:43:50 -07:00
Peter Zhokhov
e790f5214b define mean for CategoricalPd (as softmax of logits) 2018-09-14 15:43:50 -07:00
pzhokhov
fe06c6b4db continuous action spaces for codegen + some benchmarking (#82)
* add some docstrings

* start making big changes

* state machine redesign

* sampling seems to work

* some reorg

* fixed sampling of real vals

* json conversion

* made it possible to register new commands
got nontrivial version of Pred working

* consolidate command definitions

* add more macro blocks

* revived visualization

* rename Userdata -> CmdInterpreter
make AlgoSmInstance subclass of SmInstance that uses appropriate userdata argument

* replace userdata by ci when appropriate

* minor test fixes

* revamped handmade dir, can run ppo_metal

* seed to avoid random test failure

* implement AlgoAgent

* Autogenerated object that performs all ops and macros

* more CmdRecorder changes

* move files around

* move MatchProb and JtftProb

* remove obsolete

* fix tests involving AlgoAgent (pending the next commit on ppo_metal code)

* ppo_metal: reduce duplication in policy_gen, make sess an attribute of PpoAgent and StochasticPolicy instead of using get_default_session everywhere.

* maze_env reformatting, move algo_search script (but stil broken)

* move agent.py

* fix test on handcrafted agents

* tuning/fixing ppo_metal baseline

* minor

* Fix ppo_metal baseline

* Don’t set epcount, tcount unless they’re being used

* get rid of old ppo_metal baseline

* fixes for handmade/run.py tuning

* fix codegen ppo

* fix handmade ppo hps

* fix test, go back to safe_div

* switch to more complex filtering

* make sure all handcrafted algos have finite probability

* train to maximize logprob of provided samples
Trex changes to avoid segfault

* AlgoSm also includes global hyperparams

* don’t duplicate global hyperparam defaults

* create generic_ob_ac_space function

* use sorted list of outkeys

* revive tsne

* todo changes

* determinism test

* todo + test fix

* remove a few deprecated files, rename other tests so they don’t run automatically, fix real test failure

* continuous control with codegen

* continuous control with codegen

* implement continuous action space algodistr

* ppo with trex RUN BENCHMARKS

* wrap trex in a monitor

* dummy commit to RUN BENCHMARKS

* adding monitor to trex env RUN BENCHMARKS

* adding monitor to trex RUN BENCHMARKS

* include monitor into trex env RUN BENCHMARKS

* generate nll and predmean using Distribution node

* dummy commit to RUN BENCHMARKS

* include pybullet into baselines optional dependencies

* dummy commit to RUN BENCHMARKS

* install games for cron rcall user RUN BENCHMARKS

* add --yes flag to install.py in rcall config for cron user RUN BENCHMARKS

* both continuous and discrete versions seem to run

* fixes to monitor to work with vecenv-like info and rewards RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* removed shape check from one-hot encoding logic in distributions.CategoricalPd

* reset logger configuration in codegen/handmade/run.py to be in-line with baselines RUN BENCHMARKS

* merged peterz_codegen_benchmarks RUN BENCHMARKS

* skip tests RUN BENCHMARKS

* working on test failures

* save benchmark dicts RUN BENCHMARK

* merged peterz_codegen_benchmark RUN BENCHMARKS

* add get_git_commit_message to the baselines.common.console_util

* dummy commit to RUN BENCHMARKS

* merged fixes from peterz_codegen_benchmark RUN BENCHMARKS

* fixing failure in test_algo_nll WIP

* test_algo_nll passes with both ppo and softq

* re-enabled tests

* run trex on gpus for 100k total (horizon=100k / 16) RUN BENCHMARKS

* merged latest peterz_codegen_benchmarks RUN BENCHMARKS

* fixing codegen test failures (logging-related)

* fixed name collision in run-benchmarks-new.py RUN BENCHMARKS

* fixed name collision in run-benchmarks-new.py RUN BENCHMARKS

* fixed import in node_filters.py

* test_algo_search passes

* some cleanup

* dummy commit to RUN BENCHMARKS

* merge fast fail for subprocvecenv RUN BENCHMARKS

* use SubprocVecEnv in sonic_prob

* added deprecation note to shmem_vec_env

* allow indexing of distributions

* add timeout to pipeline.yaml

* typo in pipeline.yml

* run tests with --forked option

* resolved merge conflict in rl_algs.bench.benchmarks

* re-enable parallel tests

* fix remaining merge conflicts and syntax

* Update trex_prob.py

* fixes to ResultsWriter

* take baselines/run.py from peterz_codegen branch

* actually save stuff to file in VecMonitor RUN BENCHMARKS

* enable parallel tests

* merge stricter flake8

* merge peterz_codegen_benchmark, resolve conflicts

* autopep8

* remove traces of Monitor from trex env, check shapes before encoding in CategoricalPd

* asserts and warnings to make q -> distribution change more explicit

* fixed assert in CategoricalPd

* add header to vec_monitor output file RUN BENCHMARKS

* make VecMonitor write header to the output file

* remove deprecation message from shmem_vec_env RUN BENCHMARKS

* autopep8

* proper shape test in distributions.py

* ResultsWriter can take dict headers

* dummy commit to RUN BENCHMARKS

* replace assert len(qs)==1 with warning RUN BENCHMARKS

* removed pdb from ppo2 RUN BENCHMARKS
2018-09-14 15:43:49 -07:00
Peter Zhokhov
1f99a562e3 autopep8 2018-09-11 13:21:52 -07:00
Peter Zhokhov
4e2a888273 Merge commit 'refs/subrepo/baselines/fetch' into subrepo/baselines 2018-09-11 13:19:39 -07:00
Peter Zhokhov
c5b2918607 git subrepo pull (merge) baselines
subrepo:
  subdir:   "baselines"
  merged:   "2742f819"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "5c5a9f4b"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"
2018-09-11 13:18:43 -07:00
Peter Zhokhov
3bf31a4330 git subrepo commit (merge) baselines
subrepo:
  subdir:   "baselines"
  merged:   "0846932a"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "c5d6f299"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"
2018-09-11 13:18:43 -07:00
pzhokhov
9070ee7ef3 tighten flake8, autopep8 to fix trailing whitespaces and blank lines with whitespaces (#87) 2018-09-11 13:18:43 -07:00
Peter Zhokhov
e56803491f git subrepo pull (merge) baselines
subrepo:
  subdir:   "baselines"
  merged:   "5c6a1fd9"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "23b23332"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"
2018-09-11 13:18:42 -07:00
pzhokhov
b3bc25d99a add fast failure when calling methods on a closed subprocvecenv (#84) 2018-09-11 13:18:42 -07:00
Peter Zhokhov
5c5a9f4b31 autopep8 on deepq/experiments 2018-09-11 12:47:50 -07:00
Peter Zhokhov
5183fa9f29 autopep8 on deepq/experiments 2018-09-11 12:47:50 -07:00
Peter Zhokhov
3bf35cb468 added peterz to baselines authorlist 2018-09-11 12:44:51 -07:00
Peter Zhokhov
5c62f5c7dd added peterz to baselines authorlist 2018-09-11 12:44:51 -07:00
Peter Zhokhov
29bf587d15 Merge branch 'master' of github.com:openai/baselines 2018-09-11 12:40:29 -07:00
Peter Zhokhov
c5d6f2996c Merge branch 'master' of github.com:openai/baselines 2018-09-11 12:40:29 -07:00
Peter Zhokhov
06bdc2860c docstrings about vecenvs 2018-09-11 12:40:23 -07:00
pzhokhov
adaa8aefa8 baselines issue #564 (#574)
* fixes to enjoy_cartpole, enjoy_mountaincar.py

* fixed {train,enjoy}_pong, removed enjoy_retro

* set number of timesteps to 1e7 in train_pong

* flake8 complaints

* use synchronous version fo acktr in test_env_after_learn

* flake8
2018-09-10 11:50:59 -07:00
pzhokhov
23b2333238 baselines issue #564 (#574)
* fixes to enjoy_cartpole, enjoy_mountaincar.py

* fixed {train,enjoy}_pong, removed enjoy_retro

* set number of timesteps to 1e7 in train_pong

* flake8 complaints

* use synchronous version fo acktr in test_env_after_learn

* flake8
2018-09-10 11:50:59 -07:00
Peter Zhokhov
8614c4ddbf flake8 2018-09-10 10:41:29 -07:00
Peter Zhokhov
59a7ffb84d fixe tests of test_env_after_learn 2018-09-10 10:32:42 -07:00
Daniel Angelov
58b1021b28 Add tensorboard start command for convenience (#569) 2018-09-07 17:04:02 -07:00
Peter Zhokhov
a60e88bff9 git subrepo pull (merge) baselines
subrepo:
  subdir:   "baselines"
  merged:   "8785db28"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "35e95ee8"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"
2018-09-07 16:35:00 -07:00
pzhokhov
75b93b890e implement pdfromlatent in BernoulliPdType (#81)
* implement pdfromlatent in BernoulliPdType

* remove env.close() at the end of algorithms

* test case for environment after learn

* closing env in run.py

* fixes for acktr and trpo_mpi

* add make_session with new graph for every call in test_env_after_learn

* remove extra prints from test_env_after_learn
2018-09-07 16:35:00 -07:00
John Schulman
565b2153d7 Add lots of docstrings (#76)
* Add lots of docstrings
Change hyperparameter transformations for slightly better efficiency and to avoid circular dependency.
Now all parameters are stored in a “human-readable” form.

* improve pretty-print of nodes and trees

* newlines at end-of-file, return graph in render(), assert_valid() fix

* split run_algo_search.py into several simpler scripts

* add joint_train option to get_prob

* minor changes to soln_db and embedding script

* Arguments: -> Args:

* fix replay, part 1

* fix behavior when using unpickled algos

* re-add retrieve_weights

* make training scripts more consistent

* lint

* lint

* lint + remove rendering some rendering functionality from trex env as it’s also elsewhere

* get rid of warnings

* refactor functionality for getting final q-function and losses. revive code for removing useless terms & tests for simplification.

* fix vecenv closing

* finish removing algo folder (most useful functionality has been moved out of it)

* control verbosity of trex

* fix tests

* rename spec => choice_spec, some comments, asserts, debug prints

* fix some tests
2018-09-07 16:34:59 -07:00
Peter Zhokhov
35e95ee85a fix python 3.5 string format compatibility 2018-09-06 12:00:19 -07:00
Isaac Lascasas
ad219e205d VecNormalize: set env. returns to zero on resets. (#556)
* VecNormalize: set env. returns to zero on resets.

* VecNormalize: returns reset in step_wait after ret_rms.update.
2018-09-06 10:21:50 -07:00
Peter Zhokhov
be9118bcd8 git subrepo pull (merge) baselines
subrepo:
  subdir:   "baselines"
  merged:   "f2a9b8f2"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "cc4215ef"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"
2018-09-06 10:18:13 -07:00
pzhokhov
02a5e7aed5 fixes to readme and baselines/run.py (#80)
* fixes to readme and baselines/run.py

* polish installation section of baselines README

* polish installation section of baselines README
2018-09-06 10:18:13 -07:00
pzhokhov
87ac8bc317 install roboschool in install.py (#55)
* putting instructions from README.md into a script

* install roboschool as a part of setup.py

* install roboschool from install.py

* export pkg_config_path

* remove compilation step from roboschool/setup.py

* removed roboschool install from games install due to extra compilation step

* removed unused import from roboschool/setup.py
2018-09-06 10:18:13 -07:00
Tom
cc4215ef4b refactor common.models via registering reflection (#565) 2018-09-06 10:16:06 -07:00
Clayton Thorrez
1e9051e87e fixed warning (#464) 2018-09-05 15:12:01 -07:00
uronce-cc
43ed76944b Fix mean reward per episode after training Pong. (#562)
* Fix mean reward per episode after training Pong.

* Fix typo.
2018-09-05 15:06:29 -07:00
Peter Zhokhov
7f08c675bb git subrepo pull (merge) baselines
subrepo:
  subdir:   "baselines"
  merged:   "39f8be8f"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "0a40206c"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"
2018-09-04 10:23:40 -07:00
pzhokhov
b3f966aa02 use env.render in dummy_vec_env.render when num_envs == 1 (#74)
* use env.render in dummy_vec_env.render when num_envs == 1

* use shorter super() syntax per Alex's suggestion
2018-09-04 10:23:40 -07:00
pzhokhov
51cefc933b make load_variables compatible with old list format (#71)
* make load_variables compatible with old list format

* cosmetic fixes
2018-09-04 10:23:39 -07:00
Christopher Hesse
7bccb2969f baselines: default logger similar to configure() logger, rcall: don't call logger.configure() for new rl_algs
* error if logger looks wrong

* check version of logger, call logger.configure() on import

* remove changes entry

* add version to rl-algs

* fix typo

* add comment

* switch version to string

* set logger env variable
2018-09-04 10:23:39 -07:00
uronce-cc
0a40206c6c ncpu needs to be an integer. (#558) 2018-08-31 09:02:18 -07:00
Alfredo Canziani
1937826784 Fix alien syntax and apply PEP 8 style (#554) 2018-08-30 17:21:25 -07:00
pzhokhov
b29c8020d7 remove saving model as a pickle file in ppo2 (tries to pull environment in; bad idea - may need to use constructor argument pickling or somesuch if at all necessary) (#69) 2018-08-30 13:41:38 -07:00
Peter Zhokhov
4ec308aaa4 fixed syntax 2018-08-30 13:41:38 -07:00
Peter Zhokhov
3bbf3f3511 allow_early_resets=True in create_vec_env 2018-08-30 13:41:38 -07:00
Joshua Meier
e5de29a954 instructions for tensorboard (#61) 2018-08-30 13:41:37 -07:00
Joshua Meier
2507d335f9 Tensorboard util (#60)
* separate_validation_set was not imported

* launching tensorboard automatically
2018-08-30 13:41:37 -07:00
Damien Lancry
bdd4d385a6 Fix result_plotters in vectorized mujoco environments (#533)
* I investigated a bit about running a training in a vectorized monitored mujoco env and found out that the 0.monitor.csv file could not be plotted using baselines.results_plotter.py functions. Moreover the seed is the same in every parallel environments due to the particular behaviour of lambda. this fixes both issues without breaking the function in other files (baselines.acktr.run_mujoco still works)

* unifies make_atari_env and make_mujoco_env

* redefine make_mujoco_env because of run_mujoco in acktr not compatible with DummyVecEnv and SubprocVecEnv

* fix if else

* Update run.py
2018-08-28 17:48:56 -07:00
Peter Zhokhov
0961f5dd94 git subrepo pull (merge) baselines
subrepo:
  subdir:   "baselines"
  merged:   "95a81e86"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "c6c0f45c"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"
2018-08-27 16:40:14 -07:00
Christopher Hesse
337d913a8f remove reset_task from subproc vec env (#45) 2018-08-27 16:40:14 -07:00
Karl Cobbe
34af61a132 baselines: fix dummy vec env render mode (#42) 2018-08-27 16:40:14 -07:00
Christopher Hesse
1ea5ec647c export SimpleEnv and assert_envs_equal, fix minor bug in action space (#46) 2018-08-27 16:40:14 -07:00
pzhokhov
2fc7a1cbee Trigger benchmarks from buildkite (#40)
* rig buildkite pipeline to run benchmarks when commit ends with RUN BENCHMARKS

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file

* fix the buildkite pipeline file - merge test and benchmark steps

* fix the buildkite pipeline file - merge test and benchmark steps

* fix buildkite pipeline file

* fix buildkite pipeline file

* dry RUN BENCHMARKS

* dry RUN BENCHMARKS

* dry not run BENCHMARKS

* not run benchmarks

* not running benchmarks

* no running benchmarks

* no running benchmarks

* still not running benchmarks

* dummy commit to RUN BENCHMARKS

* trigger benchmarks from buildkite RUN BENCHMARKS

* specifying RCALL_KUBE_CLUSTER RUN BENCHMARKS

* remove rl-algs/run-benchmarks-new.py (moved to ci), merged baselines/common/console_util and baselines/common/util.py

* added missing imports in console_util

* clone subrepo over https
2018-08-27 16:40:14 -07:00
John Schulman
14c1d69ef4 Reduce duplication in VecEnv subclasses. (#38)
* Reduce duplication in VecEnv subclasses.
Now VecEnv base class handles rendering and closing; subclasses should provide get_images and (optionally) close_extras.

* fix tests

* minor docstring change

* raise NotImplementedError
2018-08-27 16:40:13 -07:00
pzhokhov
c8f6d8bac7 address rl-algs issue #169 (missing util functions from rcall) (#30)
* copied parts of util.py to baselines.common from rcall

* merged fix for baselines.logger, resolved conflicts

* copied ccap to baselines/baselines/common/util.py
2018-08-27 16:40:13 -07:00
pzhokhov
3a006ba50e flake8 fixes (#35)
* flake8 fixes

* added baselines/setup.cfg

* style checks using setup.cfg in baselines
2018-08-27 16:40:13 -07:00
Tom
c6c0f45cb1 fix 'async' is a reserved word in Python >= 3.7 (#495) (#542) 2018-08-27 12:36:43 -07:00
wangjksjtu
e92a6ad8f4 Update README.md (#537)
1. Delete repetitive section
2. Align the commands
2018-08-27 12:35:48 -07:00
HelgeS
92b9a37257 Updated example commands to run ppo2 (#534)
The headline mentions PPO, but the command was for A2C
2018-08-23 15:58:27 -07:00
Armin Primadi
cb14da96ca Fix typo on policies documentation (#535) 2018-08-23 15:56:13 -07:00
pzhokhov
3900f2a447 baselines issue 146 (remove tensorflow from setup.py) (#34)
* baselines does not reinstall tensorflow

* fix the version check in baselines/setup.py

* replace print and assert with assert, str (thanks @csh)
2018-08-21 16:59:05 -07:00
pzhokhov
20d22a5d79 Fix baselines build (fails due to lack of mujoco in public baselines container) (#29)
* make nminibatces = min(nminibatches, nenv)

* clarify the usage of lstm policy, add an example and a test

* cleaned up example, added assert to the test

* remove nminibatches -> min(nminibatches, num_env)

* removed code snippet from the docstring, pointing to the file

* add _mujoco_present flag to skip the tests that require mujoco if mujoco is not present

* re-format skip message in test_doc_examples

* flake8 complaints
2018-08-21 10:08:24 -07:00
pzhokhov
caf7b08b4d Baselines issue #525 (lack of docs for recurrent policies) (#27)
* make nminibatces = min(nminibatches, nenv)

* clarify the usage of lstm policy, add an example and a test

* cleaned up example, added assert to the test

* remove nminibatches -> min(nminibatches, num_env)

* removed code snippet from the docstring, pointing to the file
2018-08-20 13:55:35 -07:00
Peter Zhokhov
ca0165cdf5 flake8 complaints 2018-08-17 18:11:00 -07:00
pzhokhov
eb5b605f86 restore subrepo conftest.py files (#22)
* restore conftest.py in subrepos

* remove conftest files from subrepos in the docker image

* remove runslow flag from baselines .travis.yml and rl-algs ci/runtests.sh

* move import of rendering module into the code to fix tests that don't require a display

* restore the dockerfile
2018-08-17 17:02:39 -07:00
Peter Zhokhov
a89bee3c8d Merge commit 'refs/subrepo/baselines/fetch' into subrepo/baselines 2018-08-17 13:55:27 -07:00
pzhokhov
353bb15e90 deduplicate algorithms in rl-algs and baselines (#18)
* move vec_env

* cleaning up rl_common

* tests are passing (but mosts tests are deleted as moved to baselines)

* add benchmark runner for smoke tests

* removed duplicated algos

* route references to rl_algs.a2c to baselines.a2c

* route references to rl_algs.a2c to baselines.a2c

* unify conftest.py

* removing references to duplicated algs from codegen

* removing references to duplicated algs from codegen

* alex's changes to dummy_vec_env

* fixed test_carpole[deepq] testcase by decreasing number of training steps... alex's changes seemed to have fixed the bug and make it train better, but at seed=0 there is a dip in the training curve at 30k steps that fails the test

* codegen tests with atol=1e-6 seem to be unstable

* rl_common.vec_env -> baselines.common.vec_env mass replace

* fixed reference in trpo_mpi

* a2c.util references

* restored rl_algs.bench in sonic_prob

* fix reference in ci/runtests.sh

* simplifed expression in baselines/common/cmd_util

* further increased rtol to 1e-3 in codegen tests

* switched vecenvs to use SimpleImageViewer from gym instead of cv2

* make run.py --play option work with num_envs > 1

* make rosenbrock test reproducible

* git subrepo pull (merge) baselines

subrepo:
  subdir:   "baselines"
  merged:   "e23524a5"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "bcde04e7"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* updated baselines README (num-timesteps --> num_timesteps)

* typo in deepq/README.md
2018-08-17 13:54:11 -07:00
pzhokhov
64c0c0a043 Setup travis (#12)
* re-setting up travis

* re-setting up travis

* resolved merge conflicts, added missing dependency for codegen

* removed parallel tests (workers are failing for some reason)

* try test baselines only

* added language options - some weirdness in rcall image that requires them?

* added verbosity to tests

* try tests in baselines only

* ci/runtests.sh tests codegen (some failure on baselines specifically on travis, trying to narrow down the problem)

* removed render from codegen test - maybe that's the problem?

* trying even simpler command within the image to figure out the problem

* print out system info in ci/runtests.sh

* print system info outside of docker as well

* trying single test file in codegen

* install graphviz in the docker image

* git subrepo pull baselines

subrepo:
  subdir:   "baselines"
  merged:   "8c2aea2"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "8c2aea2"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* added graphviz to the dockerfile (need both graphviz-dev and graphviz)

* only tests in codegen/algo/test_algo_builder.py

* run baselines tests only. still no clue why collection of codegen tests fails

* update baselines setup to install filelock for tests

* run slow tests

* skip slow tests in baselines

* single test file in baselines

* try reinstalling tensorflow

* running slow tests

* try full baselines and codegen test suite

* in the test Dockerfile, reinstall tensorflow

* using fake display for codegen render tests

* fixed display-related failures by adding a custom entrpoint to the docker image

* set LC_ALL and LANG env variables in docker image

* try sequential tests

* include psutil in requirements; increase relative tolerance in test_low_level_algo_distr

* trying to fix codegen failures on travis

* git subrepo commit (merge) baselines

subrepo:
  subdir:   "baselines"
  merged:   "9ce84da"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "b222dd0"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* syntax in install.py

* changing the order of package installation

* removed supervised-reptile from installation list

* cron uses the full games repo in rcall

* flake8 complaints

* rewrite all extras logic in baselines, install.py always uses [all]
2018-08-17 13:54:10 -07:00
pzhokhov
5fee99e771 Setup travis (#12)
* re-setting up travis

* re-setting up travis

* resolved merge conflicts, added missing dependency for codegen

* removed parallel tests (workers are failing for some reason)

* try test baselines only

* added language options - some weirdness in rcall image that requires them?

* added verbosity to tests

* try tests in baselines only

* ci/runtests.sh tests codegen (some failure on baselines specifically on travis, trying to narrow down the problem)

* removed render from codegen test - maybe that's the problem?

* trying even simpler command within the image to figure out the problem

* print out system info in ci/runtests.sh

* print system info outside of docker as well

* trying single test file in codegen

* install graphviz in the docker image

* git subrepo pull baselines

subrepo:
  subdir:   "baselines"
  merged:   "8c2aea2"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "8c2aea2"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* added graphviz to the dockerfile (need both graphviz-dev and graphviz)

* only tests in codegen/algo/test_algo_builder.py

* run baselines tests only. still no clue why collection of codegen tests fails

* update baselines setup to install filelock for tests

* run slow tests

* skip slow tests in baselines

* single test file in baselines

* try reinstalling tensorflow

* running slow tests

* try full baselines and codegen test suite

* in the test Dockerfile, reinstall tensorflow

* using fake display for codegen render tests

* fixed display-related failures by adding a custom entrpoint to the docker image

* set LC_ALL and LANG env variables in docker image

* try sequential tests

* include psutil in requirements; increase relative tolerance in test_low_level_algo_distr

* trying to fix codegen failures on travis

* git subrepo commit (merge) baselines

subrepo:
  subdir:   "baselines"
  merged:   "9ce84da"
upstream:
  origin:   "git@github.com:openai/baselines.git"
  branch:   "master"
  commit:   "b222dd0"
git-subrepo:
  version:  "0.4.0"
  origin:   "git@github.com:ingydotnet/git-subrepo.git"
  commit:   "74339e8"

* syntax in install.py

* changing the order of package installation

* removed supervised-reptile from installation list

* cron uses the full games repo in rcall

* flake8 complaints

* rewrite all extras logic in baselines, install.py always uses [all]
2018-08-17 13:40:02 -07:00
Youngjin Kim
5edcd6886e Fix argument error in deepq (#508)
* Fix argment error in deepq

* Fix argment error in deepq
2018-08-16 14:55:57 -07:00
Youngjin Kim
bcde04e710 Fix argument error in deepq (#508)
* Fix argment error in deepq

* Fix argment error in deepq
2018-08-16 14:55:57 -07:00
pzhokhov
cd375ab209 update readmes (#514)
* update per-algorithm READMEs to reflect new way of running algorithms

* adding a link to repo-wide README

* updated README files and deepq.train_cartpole example
2018-08-16 14:53:49 -07:00
pzhokhov
5622a09fa4 update readmes (#514)
* update per-algorithm READMEs to reflect new way of running algorithms

* adding a link to repo-wide README

* updated README files and deepq.train_cartpole example
2018-08-16 14:53:49 -07:00
Pim de Haan
e2da7cd42f Several bugfixes for #504, #505, #506 related to Classic Control and deepq (#507)
* Several bugfixes

* Fixed ActWrapper.step bug
2018-08-16 12:08:53 -07:00
Peter Zhokhov
b222dd0610 updated links in README to point to master 2018-08-13 16:01:24 -07:00
pzhokhov
1870685071 Publish benchmark results (#502)
* updated benchmark pages with final rewards

* use htmlpreview to render pages

* use htmlpreview to render pages

* use htmlpreview to render pages

* updated README to reflect ppo1 being obsolete

* removed navbars from published benchmark pages

* fixed link in README
2018-08-13 15:59:43 -07:00
pzhokhov
8c2aea2add refactor a2c, acer, acktr, ppo2, deepq, and trpo_mpi (#490)
* exported rl-algs

* more stuff from rl-algs

* run slow tests

* re-exported rl_algs

* re-exported rl_algs - fixed problems with serialization test and test_cartpole

* replaced atari_arg_parser with common_arg_parser

* run.py can run algos from both baselines and rl_algs

* added approximate humanoid reward with ppo2 into the README for reference

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* very dummy commit to RUN BENCHMARKS

* serialize variables as a dict, not as a list

* running_mean_std uses tensorflow variables

* fixed import in vec_normalize

* dummy commit to RUN BENCHMARKS

* dummy commit to RUN BENCHMARKS

* flake8 complaints

* save all variables to make sure we save the vec_normalize normalization

* benchmarks on ppo2 only RUN BENCHMARKS

* make_atari_env compatible with mpi

* run ppo_mpi benchmarks only RUN BENCHMARKS

* hardcode names of retro environments

* add defaults

* changed default ppo2 lr schedule to linear RUN BENCHMARKS

* non-tf normalization benchmark RUN BENCHMARKS

* use ncpu=1 for mujoco sessions - gives a bit of a performance speedup

* reverted running_mean_std to user property decorators for mean, var, count

* reverted VecNormalize to use RunningMeanStd (no tf)

* reverted VecNormalize to use RunningMeanStd (no tf)

* profiling wip

* use VecNormalize with regular RunningMeanStd

* added acer runner (missing import)

* flake8 complaints

* added a note in README about TfRunningMeanStd and serialization of VecNormalize

* dummy commit to RUN BENCHMARKS

* merged benchmarks branch
2018-08-13 09:56:44 -07:00
Tony Yu Cao
366f486e34 Update README.md (#416)
Update Atari example
2018-08-08 10:42:10 -07:00
184 changed files with 28991 additions and 8847 deletions

1
.benchmark_pattern Normal file
View File

@@ -0,0 +1 @@

2
.gitignore vendored
View File

@@ -34,5 +34,3 @@ src
.cache
MUJOCO_LOG.TXT

View File

@@ -10,5 +10,5 @@ install:
- docker build . -t baselines-test
script:
- flake8 --select=F baselines/common
- docker run baselines-test pytest
- flake8 . --show-source --statistics --exclude=baselines/her
- docker run -e RUNSLOW=1 baselines-test pytest -v .

View File

@@ -1,20 +1,18 @@
FROM ubuntu:16.04
FROM python:3.6
RUN apt-get -y update && apt-get -y install ffmpeg
# RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake python-opencv
RUN apt-get -y update && apt-get -y install git wget python-dev python3-dev libopenmpi-dev python-pip zlib1g-dev cmake
ENV CODE_DIR /root/code
ENV VENV /root/venv
COPY . $CODE_DIR/baselines
RUN \
pip install virtualenv && \
virtualenv $VENV --python=python3 && \
. $VENV/bin/activate && \
cd $CODE_DIR && \
pip install --upgrade pip && \
pip install -e baselines && \
pip install pytest
ENV PATH=$VENV/bin:$PATH
WORKDIR $CODE_DIR/baselines
# Clean up pycache and pyc files
RUN rm -rf __pycache__ && \
find . -name "*.pyc" -delete && \
pip install tensorflow && \
pip install -e .[test]
CMD /bin/bash

109
README.md
View File

@@ -1,3 +1,5 @@
**Status:** Active (under active development, breaking changes may occur)
<img src="data/logo.jpg" width=25% align="right" /> [![Build status](https://travis-ci.org/openai/baselines.svg?branch=master)](https://travis-ci.org/openai/baselines)
# Baselines
@@ -15,7 +17,7 @@ sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zli
```
### Mac OS X
Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the follwing:
Installation of system packages on Mac requires [Homebrew](https://brew.sh). With Homebrew installed, run the following:
```bash
brew install cmake openmpi
```
@@ -38,20 +40,27 @@ More thorough tutorial on virtualenvs and options can be found [here](https://vi
## Installation
Clone the repo and cd into it:
```bash
git clone https://github.com/openai/baselines.git
cd baselines
```
If using virtualenv, create a new virtualenv and activate it
```bash
virtualenv env --python=python3
. env/bin/activate
```
Install baselines package
```bash
pip install -e .
```
- Clone the repo and cd into it:
```bash
git clone https://github.com/openai/baselines.git
cd baselines
```
- If you don't have TensorFlow installed already, install your favourite flavor of TensorFlow. In most cases,
```bash
pip install tensorflow-gpu # if you have a CUDA-compatible gpu and proper drivers
```
or
```bash
pip install tensorflow
```
should be sufficient. Refer to [TensorFlow installation guide](https://www.tensorflow.org/install/)
for more details.
- Install baselines package
```bash
pip install -e .
```
### MuJoCo
Some of the baselines examples use [MuJoCo](http://www.mujoco.org) (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from [www.mujoco.org](http://www.mujoco.org)). Instructions on setting up MuJoCo can be found [here](https://github.com/openai/mujoco-py)
@@ -62,6 +71,60 @@ pip install pytest
pytest
```
## Training models
Most of the algorithms in baselines repo are used as follows:
```bash
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
```
### Example 1. PPO with MuJoCo Humanoid
For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps
```bash
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
```
Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
```bash
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
```
will set entropy coefficient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
See docstrings in [common/models.py](baselines/common/models.py) for description of network parameters for each type of model, and
docstring for [baselines/ppo2/ppo2.py/learn()](baselines/ppo2/ppo2.py#L152) for the description of the ppo2 hyperparameters.
### Example 2. DQN on Atari
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
```
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
```
## Saving, loading and visualizing models
### Saving and loading the model
The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models.
`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively.
Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then later visualize what has it learnt.
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
```
This should get to the mean reward per episode about 20. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
```
*NOTE:* Mujoco environments require normalization to work properly, so we wrap them with VecNormalize wrapper. Currently, to ensure the models are saved with normalization (so that trained models can be restored and run without further training) the normalization coefficients are saved as tensorflow variables. This can decrease the performance somewhat, so if you require high-throughput steps with Mujoco and do not need saving/restoring the models, it may make sense to use numpy normalization instead. To do that, set 'use_tf=False` in [baselines/run.py](baselines/run.py#L116).
### Logging and vizualizing learning curves and other training metrics
By default, all summary data, including progress, standard output, is saved to a unique directory in a temp folder, specified by a call to Python's [tempfile.gettempdir()](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir).
The directory can be changed with the `--log_path` command-line option.
```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2 --log_path=~/logs/Pong/
```
*NOTE:* Please be aware that the logger will overwrite files of the same name in an existing directory, thus it's recommended that folder names be given a unique timestamp to prevent overwritten logs.
Another way the temp directory can be changed is through the use of the `$OPENAI_LOGDIR` environment variable.
For examples on how to load and display the training data, see [here](docs/viz/viz.ipynb).
## Subpackages
- [A2C](baselines/a2c)
@@ -71,17 +134,27 @@ pytest
- [DQN](baselines/deepq)
- [GAIL](baselines/gail)
- [HER](baselines/her)
- [PPO1](baselines/ppo1) (Multi-CPU using MPI)
- [PPO2](baselines/ppo2) (Optimized for GPU)
- [PPO1](baselines/ppo1) (obsolete version, left here temporarily)
- [PPO2](baselines/ppo2)
- [TRPO](baselines/trpo_mpi)
## Benchmarks
Results of benchmarks on Mujoco (1M timesteps) and Atari (10M timesteps) are available
[here for Mujoco](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm)
and
[here for Atari](https://htmlpreview.github.com/?https://github.com/openai/baselines/blob/master/benchmarks_atari10M.htm)
respectively. Note that these results may be not on the latest version of the code, particular commit hash with which results were obtained is specified on the benchmarks page.
To cite this repository in publications:
@misc{baselines,
author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
author = {Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Tan, Zhenyu and Wu, Yuhuai and Zhokhov, Peter},
title = {OpenAI Baselines},
year = {2017},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/openai/baselines}},
}

View File

@@ -2,4 +2,12 @@
- Original paper: https://arxiv.org/abs/1602.01783
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.a2c.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options
- also refer to the repo-wide [README.md](../../README.md#training-models)
## Files
- `run_atari`: file used to run the algorithm.
- `policies.py`: contains the different versions of the A2C architecture (MlpPolicy, CNNPolicy, LstmPolicy...).
- `a2c.py`: - Model : class used to initialize the step_model (sampling) and train_model (training)
- learn : Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
- `runner.py`: class used to generates a batch of experiences

View File

@@ -1,153 +1,194 @@
import os.path as osp
import time
import joblib
import numpy as np
import tensorflow as tf
from baselines import logger
from baselines.common import set_global_seeds, explained_variance
from baselines.common.runners import AbstractEnvRunner
from baselines.common import tf_util
from baselines.common.models import get_network_builder
from baselines.common.policies import PolicyWithValue
from baselines.a2c.utils import discount_with_dones
from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
from baselines.a2c.utils import cat_entropy, mse
from baselines.a2c.utils import InverseLinearTimeDecay
from baselines.a2c.runner import Runner
from baselines.ppo2.ppo2 import safemean
import os.path as osp
from collections import deque
class Model(object):
class Model(tf.keras.Model):
def __init__(self, policy, ob_space, ac_space, nenvs, nsteps,
"""
We use this class to :
__init__:
- Creates the step_model
- Creates the train_model
train():
- Make the training part (feedforward and retropropagation of gradients)
save/load():
- Save load the model
"""
def __init__(self, *, ac_space, policy_network, nupdates,
ent_coef=0.01, vf_coef=0.5, max_grad_norm=0.5, lr=7e-4,
alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6), lrschedule='linear'):
alpha=0.99, epsilon=1e-5, total_timesteps=int(80e6)):
sess = tf_util.make_session()
nbatch = nenvs*nsteps
super(Model, self).__init__(name='A2CModel')
self.train_model = PolicyWithValue(ac_space, policy_network, value_network=None, estimate_q=False)
lr_schedule = InverseLinearTimeDecay(initial_learning_rate=lr, nupdates=nupdates)
self.optimizer = tf.keras.optimizers.RMSprop(learning_rate=lr_schedule, rho=alpha, epsilon=epsilon)
A = tf.placeholder(tf.int32, [nbatch])
ADV = tf.placeholder(tf.float32, [nbatch])
R = tf.placeholder(tf.float32, [nbatch])
LR = tf.placeholder(tf.float32, [])
self.ent_coef = ent_coef
self.vf_coef = vf_coef
self.max_grad_norm = max_grad_norm
self.step = self.train_model.step
self.value = self.train_model.value
self.initial_state = self.train_model.initial_state
step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)
@tf.function
def train(self, obs, states, rewards, masks, actions, values):
advs = rewards - values
with tf.GradientTape() as tape:
policy_latent = self.train_model.policy_network(obs)
pd, _ = self.train_model.pdtype.pdfromlatent(policy_latent)
neglogpac = pd.neglogp(actions)
entropy = tf.reduce_mean(pd.entropy())
vpred = self.train_model.value(obs)
vf_loss = tf.reduce_mean(tf.square(vpred - rewards))
pg_loss = tf.reduce_mean(advs * neglogpac)
loss = pg_loss - entropy * self.ent_coef + vf_loss * self.vf_coef
neglogpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
pg_loss = tf.reduce_mean(ADV * neglogpac)
vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
entropy = tf.reduce_mean(cat_entropy(train_model.pi))
loss = pg_loss - entropy*ent_coef + vf_loss * vf_coef
var_list = tape.watched_variables()
grads = tape.gradient(loss, var_list)
grads, _ = tf.clip_by_global_norm(grads, self.max_grad_norm)
grads_and_vars = list(zip(grads, var_list))
self.optimizer.apply_gradients(grads_and_vars)
params = find_trainable_variables("model")
grads = tf.gradients(loss, params)
if max_grad_norm is not None:
grads, grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
grads = list(zip(grads, params))
trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=alpha, epsilon=epsilon)
_train = trainer.apply_gradients(grads)
return pg_loss, vf_loss, entropy
lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
def train(obs, states, rewards, masks, actions, values):
advs = rewards - values
for step in range(len(obs)):
cur_lr = lr.value()
td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, LR:cur_lr}
if states is not None:
td_map[train_model.S] = states
td_map[train_model.M] = masks
policy_loss, value_loss, policy_entropy, _ = sess.run(
[pg_loss, vf_loss, entropy, _train],
td_map
)
return policy_loss, value_loss, policy_entropy
def learn(
network,
env,
seed=None,
nsteps=5,
total_timesteps=int(80e6),
vf_coef=0.5,
ent_coef=0.01,
max_grad_norm=0.5,
lr=7e-4,
lrschedule='linear',
epsilon=1e-5,
alpha=0.99,
gamma=0.99,
log_interval=100,
load_path=None,
**network_kwargs):
def save(save_path):
ps = sess.run(params)
make_path(osp.dirname(save_path))
joblib.dump(ps, save_path)
'''
Main entrypoint for A2C algorithm. Train a policy with given network architecture on a given environment using a2c algorithm.
def load(load_path):
loaded_params = joblib.load(load_path)
restores = []
for p, loaded_p in zip(params, loaded_params):
restores.append(p.assign(loaded_p))
sess.run(restores)
Parameters:
-----------
self.train = train
self.train_model = train_model
self.step_model = step_model
self.step = step_model.step
self.value = step_model.value
self.initial_state = step_model.initial_state
self.save = save
self.load = load
tf.global_variables_initializer().run(session=sess)
network: policy network architecture. Either string (mlp, lstm, lnlstm, cnn_lstm, cnn, cnn_small, conv_only - see baselines.common/models.py for full list)
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
class Runner(AbstractEnvRunner):
def __init__(self, env, model, nsteps=5, gamma=0.99):
super().__init__(env=env, model=model, nsteps=nsteps)
self.gamma = gamma
env: RL environment. Should implement interface similar to VecEnv (baselines.common/vec_env) or be wrapped with DummyVecEnv (baselines.common/vec_env/dummy_vec_env.py)
seed: seed to make random number sequence in the alorightm reproducible. By default is None which means seed from system noise generator (not reproducible)
nsteps: int, number of steps of the vectorized environment per update (i.e. batch size is nsteps * nenv where
nenv is number of environment copies simulated in parallel)
total_timesteps: int, total number of timesteps to train on (default: 80M)
vf_coef: float, coefficient in front of value function loss in the total loss function (default: 0.5)
ent_coef: float, coeffictiant in front of the policy entropy in the total loss function (default: 0.01)
max_gradient_norm: float, gradient is clipped to have global L2 norm no more than this value (default: 0.5)
lr: float, learning rate for RMSProp (current implementation has RMSProp hardcoded in) (default: 7e-4)
lrschedule: schedule of learning rate. Can be 'linear', 'constant', or a function [0..1] -> [0..1] that takes fraction of the training progress as input and
returns fraction of the learning rate (specified as lr) as output
epsilon: float, RMSProp epsilon (stabilizes square root computation in denominator of RMSProp update) (default: 1e-5)
alpha: float, RMSProp decay parameter (default: 0.99)
gamma: float, reward discounting parameter (default: 0.99)
log_interval: int, specifies how frequently the logs are printed out (default: 100)
**network_kwargs: keyword arguments to the policy / network builder. See baselines.common/policies.py/build_policy and arguments to a particular type of network
For instance, 'mlp' network architecture has arguments num_hidden and num_layers.
'''
def run(self):
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
mb_states = self.states
for n in range(self.nsteps):
actions, values, states, _ = self.model.step(self.obs, self.states, self.dones)
mb_obs.append(np.copy(self.obs))
mb_actions.append(actions)
mb_values.append(values)
mb_dones.append(self.dones)
obs, rewards, dones, _ = self.env.step(actions)
self.states = states
self.dones = dones
for n, done in enumerate(dones):
if done:
self.obs[n] = self.obs[n]*0
self.obs = obs
mb_rewards.append(rewards)
mb_dones.append(self.dones)
#batch of steps to batch of rollouts
mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0).reshape(self.batch_ob_shape)
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
mb_masks = mb_dones[:, :-1]
mb_dones = mb_dones[:, 1:]
last_values = self.model.value(self.obs, self.states, self.dones).tolist()
#discount/bootstrap off value fn
for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
rewards = rewards.tolist()
dones = dones.tolist()
if dones[-1] == 0:
rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
else:
rewards = discount_with_dones(rewards, dones, self.gamma)
mb_rewards[n] = rewards
mb_rewards = mb_rewards.flatten()
mb_actions = mb_actions.flatten()
mb_values = mb_values.flatten()
mb_masks = mb_masks.flatten()
return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values
def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, ent_coef=0.01, max_grad_norm=0.5, lr=7e-4, lrschedule='linear', epsilon=1e-5, alpha=0.99, gamma=0.99, log_interval=100):
set_global_seeds(seed)
total_timesteps = int(total_timesteps)
# Get the nb of env
nenvs = env.num_envs
# Get state_space and action_space
ob_space = env.observation_space
ac_space = env.action_space
model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule)
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
nbatch = nenvs*nsteps
if isinstance(network, str):
network_type = network
policy_network_fn = get_network_builder(network_type)(**network_kwargs)
policy_network = policy_network_fn(ob_space.shape)
# Calculate the batch_size
nbatch = nenvs * nsteps
nupdates = total_timesteps // nbatch
# Instantiate the model object (that creates step_model and train_model)
model = Model(ac_space=ac_space, policy_network=policy_network, nupdates=nupdates, ent_coef=ent_coef, vf_coef=vf_coef,
max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps)
if load_path is not None:
load_path = osp.expanduser(load_path)
ckpt = tf.train.Checkpoint(model=model)
manager = tf.train.CheckpointManager(ckpt, load_path, max_to_keep=None)
ckpt.restore(manager.latest_checkpoint)
# Instantiate the runner object
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
epinfobuf = deque(maxlen=100)
# Start total timer
tstart = time.time()
for update in range(1, total_timesteps//nbatch+1):
obs, states, rewards, masks, actions, values = runner.run()
for update in range(1, nupdates+1):
# Get mini batch of experiences
obs, states, rewards, masks, actions, values, epinfos = runner.run()
epinfobuf.extend(epinfos)
obs = tf.constant(obs)
if states is not None:
states = tf.constant(states)
rewards = tf.constant(rewards)
masks = tf.constant(masks)
actions = tf.constant(actions)
values = tf.constant(values)
policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
nseconds = time.time()-tstart
# Calculate the fps (frame per second)
fps = int((update*nbatch)/nseconds)
if update % log_interval == 0 or update == 1:
# Calculates if value function is a good predicator of the returns (ev > 1)
# or if it's just worse than predicting nothing (ev =< 0)
ev = explained_variance(values, rewards)
logger.record_tabular("nupdates", update)
logger.record_tabular("total_timesteps", update*nbatch)
@@ -155,6 +196,8 @@ def learn(policy, env, seed, nsteps=5, total_timesteps=int(80e6), vf_coef=0.5, e
logger.record_tabular("policy_entropy", float(policy_entropy))
logger.record_tabular("value_loss", float(value_loss))
logger.record_tabular("explained_variance", float(ev))
logger.record_tabular("eprewmean", safemean([epinfo['r'] for epinfo in epinfobuf]))
logger.record_tabular("eplenmean", safemean([epinfo['l'] for epinfo in epinfobuf]))
logger.dump_tabular()
env.close()
return model

View File

@@ -1,146 +0,0 @@
import numpy as np
import tensorflow as tf
from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch, lstm, lnlstm
from baselines.common.distributions import make_pdtype
from baselines.common.input import observation_input
def nature_cnn(unscaled_images, **conv_kwargs):
"""
CNN from Nature paper.
"""
scaled_images = tf.cast(unscaled_images, tf.float32) / 255.
activ = tf.nn.relu
h = activ(conv(scaled_images, 'c1', nf=32, rf=8, stride=4, init_scale=np.sqrt(2),
**conv_kwargs))
h2 = activ(conv(h, 'c2', nf=64, rf=4, stride=2, init_scale=np.sqrt(2), **conv_kwargs))
h3 = activ(conv(h2, 'c3', nf=64, rf=3, stride=1, init_scale=np.sqrt(2), **conv_kwargs))
h3 = conv_to_fc(h3)
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
class LnLstmPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
nenv = nbatch // nsteps
X, processed_x = observation_input(ob_space, nbatch)
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
self.pdtype = make_pdtype(ac_space)
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(processed_x)
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
h5, snew = lnlstm(xs, ms, S, 'lstm1', nh=nlstm)
h5 = seq_to_batch(h5)
vf = fc(h5, 'v', 1)
self.pd, self.pi = self.pdtype.pdfromlatent(h5)
v0 = vf[:, 0]
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
def step(ob, state, mask):
return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
def value(ob, state, mask):
return sess.run(v0, {X:ob, S:state, M:mask})
self.X = X
self.M = M
self.S = S
self.vf = vf
self.step = step
self.value = value
class LstmPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, nlstm=256, reuse=False):
nenv = nbatch // nsteps
self.pdtype = make_pdtype(ac_space)
X, processed_x = observation_input(ob_space, nbatch)
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
h5 = seq_to_batch(h5)
vf = fc(h5, 'v', 1)
self.pd, self.pi = self.pdtype.pdfromlatent(h5)
v0 = vf[:, 0]
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
def step(ob, state, mask):
return sess.run([a0, v0, snew, neglogp0], {X:ob, S:state, M:mask})
def value(ob, state, mask):
return sess.run(v0, {X:ob, S:state, M:mask})
self.X = X
self.M = M
self.S = S
self.vf = vf
self.step = step
self.value = value
class CnnPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False, **conv_kwargs): #pylint: disable=W0613
self.pdtype = make_pdtype(ac_space)
X, processed_x = observation_input(ob_space, nbatch)
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(processed_x, **conv_kwargs)
vf = fc(h, 'v', 1)[:,0]
self.pd, self.pi = self.pdtype.pdfromlatent(h, init_scale=0.01)
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = None
def step(ob, *_args, **_kwargs):
a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
return a, v, self.initial_state, neglogp
def value(ob, *_args, **_kwargs):
return sess.run(vf, {X:ob})
self.X = X
self.vf = vf
self.step = step
self.value = value
class MlpPolicy(object):
def __init__(self, sess, ob_space, ac_space, nbatch, nsteps, reuse=False): #pylint: disable=W0613
self.pdtype = make_pdtype(ac_space)
with tf.variable_scope("model", reuse=reuse):
X, processed_x = observation_input(ob_space, nbatch)
activ = tf.tanh
processed_x = tf.layers.flatten(processed_x)
pi_h1 = activ(fc(processed_x, 'pi_fc1', nh=64, init_scale=np.sqrt(2)))
pi_h2 = activ(fc(pi_h1, 'pi_fc2', nh=64, init_scale=np.sqrt(2)))
vf_h1 = activ(fc(processed_x, 'vf_fc1', nh=64, init_scale=np.sqrt(2)))
vf_h2 = activ(fc(vf_h1, 'vf_fc2', nh=64, init_scale=np.sqrt(2)))
vf = fc(vf_h2, 'vf', 1)[:,0]
self.pd, self.pi = self.pdtype.pdfromlatent(pi_h2, init_scale=0.01)
a0 = self.pd.sample()
neglogp0 = self.pd.neglogp(a0)
self.initial_state = None
def step(ob, *_args, **_kwargs):
a, v, neglogp = sess.run([a0, vf, neglogp0], {X:ob})
return a, v, self.initial_state, neglogp
def value(ob, *_args, **_kwargs):
return sess.run(vf, {X:ob})
self.X = X
self.vf = vf
self.step = step
self.value = value

View File

@@ -1,30 +0,0 @@
#!/usr/bin/env python3
from baselines import logger
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
from baselines.a2c.a2c import learn
from baselines.ppo2.policies import CnnPolicy, LstmPolicy, LnLstmPolicy
def train(env_id, num_timesteps, seed, policy, lrschedule, num_env):
if policy == 'cnn':
policy_fn = CnnPolicy
elif policy == 'lstm':
policy_fn = LstmPolicy
elif policy == 'lnlstm':
policy_fn = LnLstmPolicy
env = VecFrameStack(make_atari_env(env_id, num_env, seed), 4)
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
env.close()
def main():
parser = atari_arg_parser()
parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
args = parser.parse_args()
logger.configure()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
policy=args.policy, lrschedule=args.lrschedule, num_env=16)
if __name__ == '__main__':
main()

80
baselines/a2c/runner.py Normal file
View File

@@ -0,0 +1,80 @@
import tensorflow as tf
import numpy as np
from baselines.a2c.utils import discount_with_dones
from baselines.common.runners import AbstractEnvRunner
class Runner(AbstractEnvRunner):
"""
We use this class to generate batches of experiences
__init__:
- Initialize the runner
run():
- Make a mini batch of experiences
"""
def __init__(self, env, model, nsteps=5, gamma=0.99):
super().__init__(env=env, model=model, nsteps=nsteps)
self.gamma = gamma
def run(self):
# We initialize the lists that will contain the mb of experiences
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
mb_states = self.states
epinfos = []
for _ in range(self.nsteps):
# Given observations, take action and value (V(s))
# We already have self.obs because Runner superclass run self.obs[:] = env.reset() on init
obs = tf.constant(self.obs)
actions, values, self.states, _ = self.model.step(obs)
actions = actions._numpy()
# Append the experiences
mb_obs.append(self.obs.copy())
mb_actions.append(actions)
mb_values.append(values._numpy())
mb_dones.append(self.dones)
# Take actions in env and look the results
self.obs[:], rewards, self.dones, infos = self.env.step(actions)
for info in infos:
maybeepinfo = info.get('episode')
if maybeepinfo: epinfos.append(maybeepinfo)
mb_rewards.append(rewards)
mb_dones.append(self.dones)
# Batch of steps to batch of rollouts
mb_obs = sf01(np.asarray(mb_obs, dtype=self.obs.dtype))
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
mb_actions = sf01(np.asarray(mb_actions, dtype=actions.dtype))
mb_values = np.asarray(mb_values, dtype=np.float32).swapaxes(1, 0)
mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
mb_masks = mb_dones[:, :-1]
mb_dones = mb_dones[:, 1:]
if self.gamma > 0.0:
# Discount/bootstrap off value fn
last_values = self.model.value(tf.constant(self.obs))._numpy().tolist()
for n, (rewards, dones, value) in enumerate(zip(mb_rewards, mb_dones, last_values)):
rewards = rewards.tolist()
dones = dones.tolist()
if dones[-1] == 0:
rewards = discount_with_dones(rewards+[value], dones+[0], self.gamma)[:-1]
else:
rewards = discount_with_dones(rewards, dones, self.gamma)
mb_rewards[n] = rewards
mb_rewards = mb_rewards.flatten()
mb_values = mb_values.flatten()
mb_masks = mb_masks.flatten()
return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values, epinfos
def sf01(arr):
"""
swap and then flatten axes 0 and 1
"""
s = arr.shape
return arr.swapaxes(0, 1).reshape(s[0] * s[1], *s[2:])

View File

@@ -1,26 +1,5 @@
import os
import gym
import numpy as np
import tensorflow as tf
from gym import spaces
from collections import deque
def sample(logits):
noise = tf.random_uniform(tf.shape(logits))
return tf.argmax(logits - tf.log(-tf.log(noise)), 1)
def cat_entropy(logits):
a0 = logits - tf.reduce_max(logits, 1, keep_dims=True)
ea0 = tf.exp(a0)
z0 = tf.reduce_sum(ea0, 1, keep_dims=True)
p0 = ea0 / z0
return tf.reduce_sum(p0 * (tf.log(z0) - a0), 1)
def cat_entropy_softmax(p0):
return - tf.reduce_sum(p0 * tf.log(p0 + 1e-6), axis = 1)
def mse(pred, target):
return tf.square(pred-target)/2.
def ortho_init(scale=1.0):
def _ortho_init(shape, dtype, partition_info=None):
@@ -39,117 +18,18 @@ def ortho_init(scale=1.0):
return (scale * q[:shape[0], :shape[1]]).astype(np.float32)
return _ortho_init
def conv(x, scope, *, nf, rf, stride, pad='VALID', init_scale=1.0, data_format='NHWC', one_dim_bias=False):
if data_format == 'NHWC':
channel_ax = 3
strides = [1, stride, stride, 1]
bshape = [1, 1, 1, nf]
elif data_format == 'NCHW':
channel_ax = 1
strides = [1, 1, stride, stride]
bshape = [1, nf, 1, 1]
else:
raise NotImplementedError
bias_var_shape = [nf] if one_dim_bias else [1, nf, 1, 1]
nin = x.get_shape()[channel_ax].value
wshape = [rf, rf, nin, nf]
with tf.variable_scope(scope):
w = tf.get_variable("w", wshape, initializer=ortho_init(init_scale))
b = tf.get_variable("b", bias_var_shape, initializer=tf.constant_initializer(0.0))
if not one_dim_bias and data_format == 'NHWC':
b = tf.reshape(b, bshape)
return b + tf.nn.conv2d(x, w, strides=strides, padding=pad, data_format=data_format)
def conv(scope, *, nf, rf, stride, activation, pad='valid', init_scale=1.0, data_format='channels_last'):
with tf.name_scope(scope):
layer = tf.keras.layers.Conv2D(filters=nf, kernel_size=rf, strides=stride, padding=pad,
data_format=data_format, kernel_initializer=ortho_init(init_scale))
return layer
def fc(x, scope, nh, *, init_scale=1.0, init_bias=0.0):
with tf.variable_scope(scope):
nin = x.get_shape()[1].value
w = tf.get_variable("w", [nin, nh], initializer=ortho_init(init_scale))
b = tf.get_variable("b", [nh], initializer=tf.constant_initializer(init_bias))
return tf.matmul(x, w)+b
def batch_to_seq(h, nbatch, nsteps, flat=False):
if flat:
h = tf.reshape(h, [nbatch, nsteps])
else:
h = tf.reshape(h, [nbatch, nsteps, -1])
return [tf.squeeze(v, [1]) for v in tf.split(axis=1, num_or_size_splits=nsteps, value=h)]
def seq_to_batch(h, flat = False):
shape = h[0].get_shape().as_list()
if not flat:
assert(len(shape) > 1)
nh = h[0].get_shape()[-1].value
return tf.reshape(tf.concat(axis=1, values=h), [-1, nh])
else:
return tf.reshape(tf.stack(values=h, axis=1), [-1])
def lstm(xs, ms, s, scope, nh, init_scale=1.0):
nbatch, nin = [v.value for v in xs[0].get_shape()]
nsteps = len(xs)
with tf.variable_scope(scope):
wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
b = tf.get_variable("b", [nh*4], initializer=tf.constant_initializer(0.0))
c, h = tf.split(axis=1, num_or_size_splits=2, value=s)
for idx, (x, m) in enumerate(zip(xs, ms)):
c = c*(1-m)
h = h*(1-m)
z = tf.matmul(x, wx) + tf.matmul(h, wh) + b
i, f, o, u = tf.split(axis=1, num_or_size_splits=4, value=z)
i = tf.nn.sigmoid(i)
f = tf.nn.sigmoid(f)
o = tf.nn.sigmoid(o)
u = tf.tanh(u)
c = f*c + i*u
h = o*tf.tanh(c)
xs[idx] = h
s = tf.concat(axis=1, values=[c, h])
return xs, s
def _ln(x, g, b, e=1e-5, axes=[1]):
u, s = tf.nn.moments(x, axes=axes, keep_dims=True)
x = (x-u)/tf.sqrt(s+e)
x = x*g+b
return x
def lnlstm(xs, ms, s, scope, nh, init_scale=1.0):
nbatch, nin = [v.value for v in xs[0].get_shape()]
nsteps = len(xs)
with tf.variable_scope(scope):
wx = tf.get_variable("wx", [nin, nh*4], initializer=ortho_init(init_scale))
gx = tf.get_variable("gx", [nh*4], initializer=tf.constant_initializer(1.0))
bx = tf.get_variable("bx", [nh*4], initializer=tf.constant_initializer(0.0))
wh = tf.get_variable("wh", [nh, nh*4], initializer=ortho_init(init_scale))
gh = tf.get_variable("gh", [nh*4], initializer=tf.constant_initializer(1.0))
bh = tf.get_variable("bh", [nh*4], initializer=tf.constant_initializer(0.0))
b = tf.get_variable("b", [nh*4], initializer=tf.constant_initializer(0.0))
gc = tf.get_variable("gc", [nh], initializer=tf.constant_initializer(1.0))
bc = tf.get_variable("bc", [nh], initializer=tf.constant_initializer(0.0))
c, h = tf.split(axis=1, num_or_size_splits=2, value=s)
for idx, (x, m) in enumerate(zip(xs, ms)):
c = c*(1-m)
h = h*(1-m)
z = _ln(tf.matmul(x, wx), gx, bx) + _ln(tf.matmul(h, wh), gh, bh) + b
i, f, o, u = tf.split(axis=1, num_or_size_splits=4, value=z)
i = tf.nn.sigmoid(i)
f = tf.nn.sigmoid(f)
o = tf.nn.sigmoid(o)
u = tf.tanh(u)
c = f*c + i*u
h = o*tf.tanh(_ln(c, gc, bc))
xs[idx] = h
s = tf.concat(axis=1, values=[c, h])
return xs, s
def conv_to_fc(x):
nh = np.prod([v.value for v in x.get_shape()[1:]])
x = tf.reshape(x, [-1, nh])
return x
def fc(input_shape, scope, nh, *, init_scale=1.0, init_bias=0.0):
with tf.name_scope(scope):
layer = tf.keras.layers.Dense(units=nh, kernel_initializer=ortho_init(init_scale),
bias_initializer=tf.keras.initializers.Constant(init_bias))
layer.build(input_shape)
return layer
def discount_with_dones(rewards, dones, gamma):
discounted = []
@@ -159,132 +39,25 @@ def discount_with_dones(rewards, dones, gamma):
discounted.append(r)
return discounted[::-1]
def find_trainable_variables(key):
with tf.variable_scope(key):
return tf.trainable_variables()
class InverseLinearTimeDecay(tf.keras.optimizers.schedules.LearningRateSchedule):
def __init__(self, initial_learning_rate, nupdates, name="InverseLinearTimeDecay"):
super(InverseLinearTimeDecay, self).__init__()
self.initial_learning_rate = initial_learning_rate
self.nupdates = nupdates
self.name = name
def make_path(f):
return os.makedirs(f, exist_ok=True)
def __call__(self, step):
with tf.name_scope(self.name):
initial_learning_rate = tf.convert_to_tensor(self.initial_learning_rate, name="initial_learning_rate")
dtype = initial_learning_rate.dtype
step_t = tf.cast(step, dtype)
nupdates_t = tf.convert_to_tensor(self.nupdates, dtype=dtype)
tf.assert_less(step_t, nupdates_t)
return initial_learning_rate * (1. - step_t / nupdates_t)
def constant(p):
return 1
def linear(p):
return 1-p
def middle_drop(p):
eps = 0.75
if 1-p<eps:
return eps*0.1
return 1-p
def double_linear_con(p):
p *= 2
eps = 0.125
if 1-p<eps:
return eps
return 1-p
def double_middle_drop(p):
eps1 = 0.75
eps2 = 0.25
if 1-p<eps1:
if 1-p<eps2:
return eps2*0.5
return eps1*0.1
return 1-p
schedules = {
'linear':linear,
'constant':constant,
'double_linear_con': double_linear_con,
'middle_drop': middle_drop,
'double_middle_drop': double_middle_drop
}
class Scheduler(object):
def __init__(self, v, nvalues, schedule):
self.n = 0.
self.v = v
self.nvalues = nvalues
self.schedule = schedules[schedule]
def value(self):
current_value = self.v*self.schedule(self.n/self.nvalues)
self.n += 1.
return current_value
def value_steps(self, steps):
return self.v*self.schedule(steps/self.nvalues)
class EpisodeStats:
def __init__(self, nsteps, nenvs):
self.episode_rewards = []
for i in range(nenvs):
self.episode_rewards.append([])
self.lenbuffer = deque(maxlen=40) # rolling buffer for episode lengths
self.rewbuffer = deque(maxlen=40) # rolling buffer for episode rewards
self.nsteps = nsteps
self.nenvs = nenvs
def feed(self, rewards, masks):
rewards = np.reshape(rewards, [self.nenvs, self.nsteps])
masks = np.reshape(masks, [self.nenvs, self.nsteps])
for i in range(0, self.nenvs):
for j in range(0, self.nsteps):
self.episode_rewards[i].append(rewards[i][j])
if masks[i][j]:
l = len(self.episode_rewards[i])
s = sum(self.episode_rewards[i])
self.lenbuffer.append(l)
self.rewbuffer.append(s)
self.episode_rewards[i] = []
def mean_length(self):
if self.lenbuffer:
return np.mean(self.lenbuffer)
else:
return 0 # on the first params dump, no episodes are finished
def mean_reward(self):
if self.rewbuffer:
return np.mean(self.rewbuffer)
else:
return 0
# For ACER
def get_by_index(x, idx):
assert(len(x.get_shape()) == 2)
assert(len(idx.get_shape()) == 1)
idx_flattened = tf.range(0, x.shape[0]) * x.shape[1] + idx
y = tf.gather(tf.reshape(x, [-1]), # flatten input
idx_flattened) # use flattened indices
return y
def check_shape(ts,shapes):
i = 0
for (t,shape) in zip(ts,shapes):
assert t.get_shape().as_list()==shape, "id " + str(i) + " shape " + str(t.get_shape()) + str(shape)
i += 1
def avg_norm(t):
return tf.reduce_mean(tf.sqrt(tf.reduce_sum(tf.square(t), axis=-1)))
def gradient_add(g1, g2, param):
print([g1, g2, param.name])
assert (not (g1 is None and g2 is None)), param.name
if g1 is None:
return g2
elif g2 is None:
return g1
else:
return g1 + g2
def q_explained_variance(qpred, q):
_, vary = tf.nn.moments(q, axes=[0, 1])
_, varpred = tf.nn.moments(q - qpred, axes=[0, 1])
check_shape([vary, varpred], [[]] * 2)
return 1.0 - (varpred / vary)
def get_config(self):
return {
"initial_learning_rate": self.initial_learning_rate,
"nupdates": self.nupdates,
"name": self.name
}

View File

@@ -1,4 +0,0 @@
# ACER
- Original paper: https://arxiv.org/abs/1611.01224
- `python -m baselines.acer.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.

View File

@@ -1,348 +0,0 @@
import time
import joblib
import numpy as np
import tensorflow as tf
from baselines import logger
from baselines.common import set_global_seeds
from baselines.common.runners import AbstractEnvRunner
from baselines.a2c.utils import batch_to_seq, seq_to_batch
from baselines.a2c.utils import Scheduler, make_path, find_trainable_variables
from baselines.a2c.utils import cat_entropy_softmax
from baselines.a2c.utils import EpisodeStats
from baselines.a2c.utils import get_by_index, check_shape, avg_norm, gradient_add, q_explained_variance
from baselines.acer.buffer import Buffer
import os.path as osp
# remove last step
def strip(var, nenvs, nsteps, flat = False):
vars = batch_to_seq(var, nenvs, nsteps + 1, flat)
return seq_to_batch(vars[:-1], flat)
def q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma):
"""
Calculates q_retrace targets
:param R: Rewards
:param D: Dones
:param q_i: Q values for actions taken
:param v: V values
:param rho_i: Importance weight for each action
:return: Q_retrace values
"""
rho_bar = batch_to_seq(tf.minimum(1.0, rho_i), nenvs, nsteps, True) # list of len steps, shape [nenvs]
rs = batch_to_seq(R, nenvs, nsteps, True) # list of len steps, shape [nenvs]
ds = batch_to_seq(D, nenvs, nsteps, True) # list of len steps, shape [nenvs]
q_is = batch_to_seq(q_i, nenvs, nsteps, True)
vs = batch_to_seq(v, nenvs, nsteps + 1, True)
v_final = vs[-1]
qret = v_final
qrets = []
for i in range(nsteps - 1, -1, -1):
check_shape([qret, ds[i], rs[i], rho_bar[i], q_is[i], vs[i]], [[nenvs]] * 6)
qret = rs[i] + gamma * qret * (1.0 - ds[i])
qrets.append(qret)
qret = (rho_bar[i] * (qret - q_is[i])) + vs[i]
qrets = qrets[::-1]
qret = seq_to_batch(qrets, flat=True)
return qret
# For ACER with PPO clipping instead of trust region
# def clip(ratio, eps_clip):
# # assume 0 <= eps_clip <= 1
# return tf.minimum(1 + eps_clip, tf.maximum(1 - eps_clip, ratio))
class Model(object):
def __init__(self, policy, ob_space, ac_space, nenvs, nsteps, nstack, num_procs,
ent_coef, q_coef, gamma, max_grad_norm, lr,
rprop_alpha, rprop_epsilon, total_timesteps, lrschedule,
c, trust_region, alpha, delta):
config = tf.ConfigProto(allow_soft_placement=True,
intra_op_parallelism_threads=num_procs,
inter_op_parallelism_threads=num_procs)
sess = tf.Session(config=config)
nact = ac_space.n
nbatch = nenvs * nsteps
A = tf.placeholder(tf.int32, [nbatch]) # actions
D = tf.placeholder(tf.float32, [nbatch]) # dones
R = tf.placeholder(tf.float32, [nbatch]) # rewards, not returns
MU = tf.placeholder(tf.float32, [nbatch, nact]) # mu's
LR = tf.placeholder(tf.float32, [])
eps = 1e-6
step_model = policy(sess, ob_space, ac_space, nenvs, 1, nstack, reuse=False)
train_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
params = find_trainable_variables("model")
print("Params {}".format(len(params)))
for var in params:
print(var)
# create polyak averaged model
ema = tf.train.ExponentialMovingAverage(alpha)
ema_apply_op = ema.apply(params)
def custom_getter(getter, *args, **kwargs):
v = ema.average(getter(*args, **kwargs))
print(v.name)
return v
with tf.variable_scope("", custom_getter=custom_getter, reuse=True):
polyak_model = policy(sess, ob_space, ac_space, nenvs, nsteps + 1, nstack, reuse=True)
# Notation: (var) = batch variable, (var)s = seqeuence variable, (var)_i = variable index by action at step i
v = tf.reduce_sum(train_model.pi * train_model.q, axis = -1) # shape is [nenvs * (nsteps + 1)]
# strip off last step
f, f_pol, q = map(lambda var: strip(var, nenvs, nsteps), [train_model.pi, polyak_model.pi, train_model.q])
# Get pi and q values for actions taken
f_i = get_by_index(f, A)
q_i = get_by_index(q, A)
# Compute ratios for importance truncation
rho = f / (MU + eps)
rho_i = get_by_index(rho, A)
# Calculate Q_retrace targets
qret = q_retrace(R, D, q_i, v, rho_i, nenvs, nsteps, gamma)
# Calculate losses
# Entropy
entropy = tf.reduce_mean(cat_entropy_softmax(f))
# Policy Graident loss, with truncated importance sampling & bias correction
v = strip(v, nenvs, nsteps, True)
check_shape([qret, v, rho_i, f_i], [[nenvs * nsteps]] * 4)
check_shape([rho, f, q], [[nenvs * nsteps, nact]] * 2)
# Truncated importance sampling
adv = qret - v
logf = tf.log(f_i + eps)
gain_f = logf * tf.stop_gradient(adv * tf.minimum(c, rho_i)) # [nenvs * nsteps]
loss_f = -tf.reduce_mean(gain_f)
# Bias correction for the truncation
adv_bc = (q - tf.reshape(v, [nenvs * nsteps, 1])) # [nenvs * nsteps, nact]
logf_bc = tf.log(f + eps) # / (f_old + eps)
check_shape([adv_bc, logf_bc], [[nenvs * nsteps, nact]]*2)
gain_bc = tf.reduce_sum(logf_bc * tf.stop_gradient(adv_bc * tf.nn.relu(1.0 - (c / (rho + eps))) * f), axis = 1) #IMP: This is sum, as expectation wrt f
loss_bc= -tf.reduce_mean(gain_bc)
loss_policy = loss_f + loss_bc
# Value/Q function loss, and explained variance
check_shape([qret, q_i], [[nenvs * nsteps]]*2)
ev = q_explained_variance(tf.reshape(q_i, [nenvs, nsteps]), tf.reshape(qret, [nenvs, nsteps]))
loss_q = tf.reduce_mean(tf.square(tf.stop_gradient(qret) - q_i)*0.5)
# Net loss
check_shape([loss_policy, loss_q, entropy], [[]] * 3)
loss = loss_policy + q_coef * loss_q - ent_coef * entropy
if trust_region:
g = tf.gradients(- (loss_policy - ent_coef * entropy) * nsteps * nenvs, f) #[nenvs * nsteps, nact]
# k = tf.gradients(KL(f_pol || f), f)
k = - f_pol / (f + eps) #[nenvs * nsteps, nact] # Directly computed gradient of KL divergence wrt f
k_dot_g = tf.reduce_sum(k * g, axis=-1)
adj = tf.maximum(0.0, (tf.reduce_sum(k * g, axis=-1) - delta) / (tf.reduce_sum(tf.square(k), axis=-1) + eps)) #[nenvs * nsteps]
# Calculate stats (before doing adjustment) for logging.
avg_norm_k = avg_norm(k)
avg_norm_g = avg_norm(g)
avg_norm_k_dot_g = tf.reduce_mean(tf.abs(k_dot_g))
avg_norm_adj = tf.reduce_mean(tf.abs(adj))
g = g - tf.reshape(adj, [nenvs * nsteps, 1]) * k
grads_f = -g/(nenvs*nsteps) # These are turst region adjusted gradients wrt f ie statistics of policy pi
grads_policy = tf.gradients(f, params, grads_f)
grads_q = tf.gradients(loss_q * q_coef, params)
grads = [gradient_add(g1, g2, param) for (g1, g2, param) in zip(grads_policy, grads_q, params)]
avg_norm_grads_f = avg_norm(grads_f) * (nsteps * nenvs)
norm_grads_q = tf.global_norm(grads_q)
norm_grads_policy = tf.global_norm(grads_policy)
else:
grads = tf.gradients(loss, params)
if max_grad_norm is not None:
grads, norm_grads = tf.clip_by_global_norm(grads, max_grad_norm)
grads = list(zip(grads, params))
trainer = tf.train.RMSPropOptimizer(learning_rate=LR, decay=rprop_alpha, epsilon=rprop_epsilon)
_opt_op = trainer.apply_gradients(grads)
# so when you call _train, you first do the gradient step, then you apply ema
with tf.control_dependencies([_opt_op]):
_train = tf.group(ema_apply_op)
lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
# Ops/Summaries to run, and their names for logging
run_ops = [_train, loss, loss_q, entropy, loss_policy, loss_f, loss_bc, ev, norm_grads]
names_ops = ['loss', 'loss_q', 'entropy', 'loss_policy', 'loss_f', 'loss_bc', 'explained_variance',
'norm_grads']
if trust_region:
run_ops = run_ops + [norm_grads_q, norm_grads_policy, avg_norm_grads_f, avg_norm_k, avg_norm_g, avg_norm_k_dot_g,
avg_norm_adj]
names_ops = names_ops + ['norm_grads_q', 'norm_grads_policy', 'avg_norm_grads_f', 'avg_norm_k', 'avg_norm_g',
'avg_norm_k_dot_g', 'avg_norm_adj']
def train(obs, actions, rewards, dones, mus, states, masks, steps):
cur_lr = lr.value_steps(steps)
td_map = {train_model.X: obs, polyak_model.X: obs, A: actions, R: rewards, D: dones, MU: mus, LR: cur_lr}
if states != []:
td_map[train_model.S] = states
td_map[train_model.M] = masks
td_map[polyak_model.S] = states
td_map[polyak_model.M] = masks
return names_ops, sess.run(run_ops, td_map)[1:] # strip off _train
def save(save_path):
ps = sess.run(params)
make_path(osp.dirname(save_path))
joblib.dump(ps, save_path)
self.train = train
self.save = save
self.train_model = train_model
self.step_model = step_model
self.step = step_model.step
self.initial_state = step_model.initial_state
tf.global_variables_initializer().run(session=sess)
class Runner(AbstractEnvRunner):
def __init__(self, env, model, nsteps, nstack):
super().__init__(env=env, model=model, nsteps=nsteps)
self.nstack = nstack
nh, nw, nc = env.observation_space.shape
self.nc = nc # nc = 1 for atari, but just in case
self.nenv = nenv = env.num_envs
self.nact = env.action_space.n
self.nbatch = nenv * nsteps
self.batch_ob_shape = (nenv*(nsteps+1), nh, nw, nc*nstack)
self.obs = np.zeros((nenv, nh, nw, nc * nstack), dtype=np.uint8)
obs = env.reset()
self.update_obs(obs)
def update_obs(self, obs, dones=None):
if dones is not None:
self.obs *= (1 - dones.astype(np.uint8))[:, None, None, None]
self.obs = np.roll(self.obs, shift=-self.nc, axis=3)
self.obs[:, :, :, -self.nc:] = obs[:, :, :, :]
def run(self):
enc_obs = np.split(self.obs, self.nstack, axis=3) # so now list of obs steps
mb_obs, mb_actions, mb_mus, mb_dones, mb_rewards = [], [], [], [], []
for _ in range(self.nsteps):
actions, mus, states = self.model.step(self.obs, state=self.states, mask=self.dones)
mb_obs.append(np.copy(self.obs))
mb_actions.append(actions)
mb_mus.append(mus)
mb_dones.append(self.dones)
obs, rewards, dones, _ = self.env.step(actions)
# states information for statefull models like LSTM
self.states = states
self.dones = dones
self.update_obs(obs, dones)
mb_rewards.append(rewards)
enc_obs.append(obs)
mb_obs.append(np.copy(self.obs))
mb_dones.append(self.dones)
enc_obs = np.asarray(enc_obs, dtype=np.uint8).swapaxes(1, 0)
mb_obs = np.asarray(mb_obs, dtype=np.uint8).swapaxes(1, 0)
mb_actions = np.asarray(mb_actions, dtype=np.int32).swapaxes(1, 0)
mb_rewards = np.asarray(mb_rewards, dtype=np.float32).swapaxes(1, 0)
mb_mus = np.asarray(mb_mus, dtype=np.float32).swapaxes(1, 0)
mb_dones = np.asarray(mb_dones, dtype=np.bool).swapaxes(1, 0)
mb_masks = mb_dones # Used for statefull models like LSTM's to mask state when done
mb_dones = mb_dones[:, 1:] # Used for calculating returns. The dones array is now aligned with rewards
# shapes are now [nenv, nsteps, []]
# When pulling from buffer, arrays will now be reshaped in place, preventing a deep copy.
return enc_obs, mb_obs, mb_actions, mb_rewards, mb_mus, mb_dones, mb_masks
class Acer():
def __init__(self, runner, model, buffer, log_interval):
self.runner = runner
self.model = model
self.buffer = buffer
self.log_interval = log_interval
self.tstart = None
self.episode_stats = EpisodeStats(runner.nsteps, runner.nenv)
self.steps = None
def call(self, on_policy):
runner, model, buffer, steps = self.runner, self.model, self.buffer, self.steps
if on_policy:
enc_obs, obs, actions, rewards, mus, dones, masks = runner.run()
self.episode_stats.feed(rewards, dones)
if buffer is not None:
buffer.put(enc_obs, actions, rewards, mus, dones, masks)
else:
# get obs, actions, rewards, mus, dones from buffer.
obs, actions, rewards, mus, dones, masks = buffer.get()
# reshape stuff correctly
obs = obs.reshape(runner.batch_ob_shape)
actions = actions.reshape([runner.nbatch])
rewards = rewards.reshape([runner.nbatch])
mus = mus.reshape([runner.nbatch, runner.nact])
dones = dones.reshape([runner.nbatch])
masks = masks.reshape([runner.batch_ob_shape[0]])
names_ops, values_ops = model.train(obs, actions, rewards, dones, mus, model.initial_state, masks, steps)
if on_policy and (int(steps/runner.nbatch) % self.log_interval == 0):
logger.record_tabular("total_timesteps", steps)
logger.record_tabular("fps", int(steps/(time.time() - self.tstart)))
# IMP: In EpisodicLife env, during training, we get done=True at each loss of life, not just at the terminal state.
# Thus, this is mean until end of life, not end of episode.
# For true episode rewards, see the monitor files in the log folder.
logger.record_tabular("mean_episode_length", self.episode_stats.mean_length())
logger.record_tabular("mean_episode_reward", self.episode_stats.mean_reward())
for name, val in zip(names_ops, values_ops):
logger.record_tabular(name, float(val))
logger.dump_tabular()
def learn(policy, env, seed, nsteps=20, nstack=4, total_timesteps=int(80e6), q_coef=0.5, ent_coef=0.01,
max_grad_norm=10, lr=7e-4, lrschedule='linear', rprop_epsilon=1e-5, rprop_alpha=0.99, gamma=0.99,
log_interval=100, buffer_size=50000, replay_ratio=4, replay_start=10000, c=10.0,
trust_region=True, alpha=0.99, delta=1):
print("Running Acer Simple")
print(locals())
tf.reset_default_graph()
set_global_seeds(seed)
nenvs = env.num_envs
ob_space = env.observation_space
ac_space = env.action_space
num_procs = len(env.remotes) # HACK
model = Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nenvs=nenvs, nsteps=nsteps, nstack=nstack,
num_procs=num_procs, ent_coef=ent_coef, q_coef=q_coef, gamma=gamma,
max_grad_norm=max_grad_norm, lr=lr, rprop_alpha=rprop_alpha, rprop_epsilon=rprop_epsilon,
total_timesteps=total_timesteps, lrschedule=lrschedule, c=c,
trust_region=trust_region, alpha=alpha, delta=delta)
runner = Runner(env=env, model=model, nsteps=nsteps, nstack=nstack)
if replay_ratio > 0:
buffer = Buffer(env=env, nsteps=nsteps, nstack=nstack, size=buffer_size)
else:
buffer = None
nbatch = nenvs*nsteps
acer = Acer(runner, model, buffer, log_interval)
acer.tstart = time.time()
for acer.steps in range(0, total_timesteps, nbatch): #nbatch samples, 1 on_policy call and multiple off-policy calls
acer.call(on_policy=True)
if replay_ratio > 0 and buffer.has_atleast(replay_start):
n = np.random.poisson(replay_ratio)
for _ in range(n):
acer.call(on_policy=False) # no simulation steps in this
env.close()

View File

@@ -1,103 +0,0 @@
import numpy as np
class Buffer(object):
# gets obs, actions, rewards, mu's, (states, masks), dones
def __init__(self, env, nsteps, nstack, size=50000):
self.nenv = env.num_envs
self.nsteps = nsteps
self.nh, self.nw, self.nc = env.observation_space.shape
self.nstack = nstack
self.nbatch = self.nenv * self.nsteps
self.size = size // (self.nsteps) # Each loc contains nenv * nsteps frames, thus total buffer is nenv * size frames
# Memory
self.enc_obs = None
self.actions = None
self.rewards = None
self.mus = None
self.dones = None
self.masks = None
# Size indexes
self.next_idx = 0
self.num_in_buffer = 0
def has_atleast(self, frames):
# Frames per env, so total (nenv * frames) Frames needed
# Each buffer loc has nenv * nsteps frames
return self.num_in_buffer >= (frames // self.nsteps)
def can_sample(self):
return self.num_in_buffer > 0
# Generate stacked frames
def decode(self, enc_obs, dones):
# enc_obs has shape [nenvs, nsteps + nstack, nh, nw, nc]
# dones has shape [nenvs, nsteps, nh, nw, nc]
# returns stacked obs of shape [nenv, (nsteps + 1), nh, nw, nstack*nc]
nstack, nenv, nsteps, nh, nw, nc = self.nstack, self.nenv, self.nsteps, self.nh, self.nw, self.nc
y = np.empty([nsteps + nstack - 1, nenv, 1, 1, 1], dtype=np.float32)
obs = np.zeros([nstack, nsteps + nstack, nenv, nh, nw, nc], dtype=np.uint8)
x = np.reshape(enc_obs, [nenv, nsteps + nstack, nh, nw, nc]).swapaxes(1,
0) # [nsteps + nstack, nenv, nh, nw, nc]
y[3:] = np.reshape(1.0 - dones, [nenv, nsteps, 1, 1, 1]).swapaxes(1, 0) # keep
y[:3] = 1.0
# y = np.reshape(1 - dones, [nenvs, nsteps, 1, 1, 1])
for i in range(nstack):
obs[-(i + 1), i:] = x
# obs[:,i:,:,:,-(i+1),:] = x
x = x[:-1] * y
y = y[1:]
return np.reshape(obs[:, 3:].transpose((2, 1, 3, 4, 0, 5)), [nenv, (nsteps + 1), nh, nw, nstack * nc])
def put(self, enc_obs, actions, rewards, mus, dones, masks):
# enc_obs [nenv, (nsteps + nstack), nh, nw, nc]
# actions, rewards, dones [nenv, nsteps]
# mus [nenv, nsteps, nact]
if self.enc_obs is None:
self.enc_obs = np.empty([self.size] + list(enc_obs.shape), dtype=np.uint8)
self.actions = np.empty([self.size] + list(actions.shape), dtype=np.int32)
self.rewards = np.empty([self.size] + list(rewards.shape), dtype=np.float32)
self.mus = np.empty([self.size] + list(mus.shape), dtype=np.float32)
self.dones = np.empty([self.size] + list(dones.shape), dtype=np.bool)
self.masks = np.empty([self.size] + list(masks.shape), dtype=np.bool)
self.enc_obs[self.next_idx] = enc_obs
self.actions[self.next_idx] = actions
self.rewards[self.next_idx] = rewards
self.mus[self.next_idx] = mus
self.dones[self.next_idx] = dones
self.masks[self.next_idx] = masks
self.next_idx = (self.next_idx + 1) % self.size
self.num_in_buffer = min(self.size, self.num_in_buffer + 1)
def take(self, x, idx, envx):
nenv = self.nenv
out = np.empty([nenv] + list(x.shape[2:]), dtype=x.dtype)
for i in range(nenv):
out[i] = x[idx[i], envx[i]]
return out
def get(self):
# returns
# obs [nenv, (nsteps + 1), nh, nw, nstack*nc]
# actions, rewards, dones [nenv, nsteps]
# mus [nenv, nsteps, nact]
nenv = self.nenv
assert self.can_sample()
# Sample exactly one id per env. If you sample across envs, then higher correlation in samples from same env.
idx = np.random.randint(0, self.num_in_buffer, nenv)
envx = np.arange(nenv)
take = lambda x: self.take(x, idx, envx) # for i in range(nenv)], axis = 0)
dones = take(self.dones)
enc_obs = take(self.enc_obs)
obs = self.decode(enc_obs, dones)
actions = take(self.actions)
rewards = take(self.rewards)
mus = take(self.mus)
masks = take(self.masks)
return obs, actions, rewards, mus, dones, masks

View File

@@ -1,79 +0,0 @@
import numpy as np
import tensorflow as tf
from baselines.ppo2.policies import nature_cnn
from baselines.a2c.utils import fc, batch_to_seq, seq_to_batch, lstm, sample
class AcerCnnPolicy(object):
def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False):
nbatch = nenv * nsteps
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc * nstack)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) # obs
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
pi_logits = fc(h, 'pi', nact, init_scale=0.01)
pi = tf.nn.softmax(pi_logits)
q = fc(h, 'q', nact)
a = sample(pi_logits) # could change this to use self.pi instead
self.initial_state = [] # not stateful
self.X = X
self.pi = pi # actual policy params now
self.q = q
def step(ob, *args, **kwargs):
# returns actions, mus, states
a0, pi0 = sess.run([a, pi], {X: ob})
return a0, pi0, [] # dummy state
def out(ob, *args, **kwargs):
pi0, q0 = sess.run([pi, q], {X: ob})
return pi0, q0
def act(ob, *args, **kwargs):
return sess.run(a, {X: ob})
self.step = step
self.out = out
self.act = act
class AcerLstmPolicy(object):
def __init__(self, sess, ob_space, ac_space, nenv, nsteps, nstack, reuse=False, nlstm=256):
nbatch = nenv * nsteps
nh, nw, nc = ob_space.shape
ob_shape = (nbatch, nh, nw, nc * nstack)
nact = ac_space.n
X = tf.placeholder(tf.uint8, ob_shape) # obs
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
S = tf.placeholder(tf.float32, [nenv, nlstm*2]) #states
with tf.variable_scope("model", reuse=reuse):
h = nature_cnn(X)
# lstm
xs = batch_to_seq(h, nenv, nsteps)
ms = batch_to_seq(M, nenv, nsteps)
h5, snew = lstm(xs, ms, S, 'lstm1', nh=nlstm)
h5 = seq_to_batch(h5)
pi_logits = fc(h5, 'pi', nact, init_scale=0.01)
pi = tf.nn.softmax(pi_logits)
q = fc(h5, 'q', nact)
a = sample(pi_logits) # could change this to use self.pi instead
self.initial_state = np.zeros((nenv, nlstm*2), dtype=np.float32)
self.X = X
self.M = M
self.S = S
self.pi = pi # actual policy params now
self.q = q
def step(ob, state, mask, *args, **kwargs):
# returns actions, mus, states
a0, pi0, s = sess.run([a, pi, snew], {X: ob, S: state, M: mask})
return a0, pi0, s
self.step = step

View File

@@ -1,30 +0,0 @@
#!/usr/bin/env python3
from baselines import logger
from baselines.acer.acer_simple import learn
from baselines.acer.policies import AcerCnnPolicy, AcerLstmPolicy
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
def train(env_id, num_timesteps, seed, policy, lrschedule, num_cpu):
env = make_atari_env(env_id, num_cpu, seed)
if policy == 'cnn':
policy_fn = AcerCnnPolicy
elif policy == 'lstm':
policy_fn = AcerLstmPolicy
else:
print("Policy {} not implemented".format(policy))
return
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), lrschedule=lrschedule)
env.close()
def main():
parser = atari_arg_parser()
parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='cnn')
parser.add_argument('--lrschedule', help='Learning rate schedule', choices=['constant', 'linear'], default='constant')
parser.add_argument('--logdir', help ='Directory for logging')
args = parser.parse_args()
logger.configure(args.logdir)
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed,
policy=args.policy, lrschedule=args.lrschedule, num_cpu=16)
if __name__ == '__main__':
main()

View File

@@ -1,5 +0,0 @@
# ACKTR
- Original paper: https://arxiv.org/abs/1708.05144
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.acktr.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.

View File

@@ -1,142 +0,0 @@
import numpy as np
import tensorflow as tf
from baselines import logger
import baselines.common as common
from baselines.common import tf_util as U
from baselines.acktr import kfac
from baselines.common.filters import ZFilter
def pathlength(path):
return path["reward"].shape[0]# Loss function that we'll differentiate to get the policy gradient
def rollout(env, policy, max_pathlength, animate=False, obfilter=None):
"""
Simulate the env and policy for max_pathlength steps
"""
ob = env.reset()
prev_ob = np.float32(np.zeros(ob.shape))
if obfilter: ob = obfilter(ob)
terminated = False
obs = []
acs = []
ac_dists = []
logps = []
rewards = []
for _ in range(max_pathlength):
if animate:
env.render()
state = np.concatenate([ob, prev_ob], -1)
obs.append(state)
ac, ac_dist, logp = policy.act(state)
acs.append(ac)
ac_dists.append(ac_dist)
logps.append(logp)
prev_ob = np.copy(ob)
scaled_ac = env.action_space.low + (ac + 1.) * 0.5 * (env.action_space.high - env.action_space.low)
scaled_ac = np.clip(scaled_ac, env.action_space.low, env.action_space.high)
ob, rew, done, _ = env.step(scaled_ac)
if obfilter: ob = obfilter(ob)
rewards.append(rew)
if done:
terminated = True
break
return {"observation" : np.array(obs), "terminated" : terminated,
"reward" : np.array(rewards), "action" : np.array(acs),
"action_dist": np.array(ac_dists), "logp" : np.array(logps)}
def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
animate=False, callback=None, desired_kl=0.002):
obfilter = ZFilter(env.observation_space.shape)
max_pathlength = env.spec.timestep_limit
stepsize = tf.Variable(initial_value=np.float32(np.array(0.03)), name='stepsize')
inputs, loss, loss_sampled = policy.update_info
optim = kfac.KfacOptimizer(learning_rate=stepsize, cold_lr=stepsize*(1-0.9), momentum=0.9, kfac_update=2,\
epsilon=1e-2, stats_decay=0.99, async=1, cold_iter=1,
weight_decay_dict=policy.wd_dict, max_grad_norm=None)
pi_var_list = []
for var in tf.trainable_variables():
if "pi" in var.name:
pi_var_list.append(var)
update_op, q_runner = optim.minimize(loss, loss_sampled, var_list=pi_var_list)
do_update = U.function(inputs, update_op)
U.initialize()
# start queue runners
enqueue_threads = []
coord = tf.train.Coordinator()
for qr in [q_runner, vf.q_runner]:
assert (qr != None)
enqueue_threads.extend(qr.create_threads(tf.get_default_session(), coord=coord, start=True))
i = 0
timesteps_so_far = 0
while True:
if timesteps_so_far > num_timesteps:
break
logger.log("********** Iteration %i ************"%i)
# Collect paths until we have enough timesteps
timesteps_this_batch = 0
paths = []
while True:
path = rollout(env, policy, max_pathlength, animate=(len(paths)==0 and (i % 10 == 0) and animate), obfilter=obfilter)
paths.append(path)
n = pathlength(path)
timesteps_this_batch += n
timesteps_so_far += n
if timesteps_this_batch > timesteps_per_batch:
break
# Estimate advantage function
vtargs = []
advs = []
for path in paths:
rew_t = path["reward"]
return_t = common.discount(rew_t, gamma)
vtargs.append(return_t)
vpred_t = vf.predict(path)
vpred_t = np.append(vpred_t, 0.0 if path["terminated"] else vpred_t[-1])
delta_t = rew_t + gamma*vpred_t[1:] - vpred_t[:-1]
adv_t = common.discount(delta_t, gamma * lam)
advs.append(adv_t)
# Update value function
vf.fit(paths, vtargs)
# Build arrays for policy update
ob_no = np.concatenate([path["observation"] for path in paths])
action_na = np.concatenate([path["action"] for path in paths])
oldac_dist = np.concatenate([path["action_dist"] for path in paths])
adv_n = np.concatenate(advs)
standardized_adv_n = (adv_n - adv_n.mean()) / (adv_n.std() + 1e-8)
# Policy update
do_update(ob_no, action_na, standardized_adv_n)
min_stepsize = np.float32(1e-8)
max_stepsize = np.float32(1e0)
# Adjust stepsize
kl = policy.compute_kl(ob_no, oldac_dist)
if kl > desired_kl * 2:
logger.log("kl too high")
tf.assign(stepsize, tf.maximum(min_stepsize, stepsize / 1.5)).eval()
elif kl < desired_kl / 2:
logger.log("kl too low")
tf.assign(stepsize, tf.minimum(max_stepsize, stepsize * 1.5)).eval()
else:
logger.log("kl just right!")
logger.record_tabular("EpRewMean", np.mean([path["reward"].sum() for path in paths]))
logger.record_tabular("EpRewSEM", np.std([path["reward"].sum()/np.sqrt(len(paths)) for path in paths]))
logger.record_tabular("EpLenMean", np.mean([pathlength(path) for path in paths]))
logger.record_tabular("KL", kl)
if callback:
callback()
logger.dump_tabular()
i += 1
coord.request_stop()
coord.join(enqueue_threads)

View File

@@ -1,155 +0,0 @@
import os.path as osp
import time
import joblib
import numpy as np
import tensorflow as tf
from baselines import logger
from baselines.common import set_global_seeds, explained_variance
from baselines.a2c.a2c import Runner
from baselines.a2c.utils import discount_with_dones
from baselines.a2c.utils import Scheduler, find_trainable_variables
from baselines.a2c.utils import cat_entropy, mse
from baselines.acktr import kfac
class Model(object):
def __init__(self, policy, ob_space, ac_space, nenvs,total_timesteps, nprocs=32, nsteps=20,
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
kfac_clip=0.001, lrschedule='linear'):
config = tf.ConfigProto(allow_soft_placement=True,
intra_op_parallelism_threads=nprocs,
inter_op_parallelism_threads=nprocs)
config.gpu_options.allow_growth = True
self.sess = sess = tf.Session(config=config)
nact = ac_space.n
nbatch = nenvs * nsteps
A = tf.placeholder(tf.int32, [nbatch])
ADV = tf.placeholder(tf.float32, [nbatch])
R = tf.placeholder(tf.float32, [nbatch])
PG_LR = tf.placeholder(tf.float32, [])
VF_LR = tf.placeholder(tf.float32, [])
self.model = step_model = policy(sess, ob_space, ac_space, nenvs, 1, reuse=False)
self.model2 = train_model = policy(sess, ob_space, ac_space, nenvs*nsteps, nsteps, reuse=True)
logpac = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=train_model.pi, labels=A)
self.logits = logits = train_model.pi
##training loss
pg_loss = tf.reduce_mean(ADV*logpac)
entropy = tf.reduce_mean(cat_entropy(train_model.pi))
pg_loss = pg_loss - ent_coef * entropy
vf_loss = tf.reduce_mean(mse(tf.squeeze(train_model.vf), R))
train_loss = pg_loss + vf_coef * vf_loss
##Fisher loss construction
self.pg_fisher = pg_fisher_loss = -tf.reduce_mean(logpac)
sample_net = train_model.vf + tf.random_normal(tf.shape(train_model.vf))
self.vf_fisher = vf_fisher_loss = - vf_fisher_coef*tf.reduce_mean(tf.pow(train_model.vf - tf.stop_gradient(sample_net), 2))
self.joint_fisher = joint_fisher_loss = pg_fisher_loss + vf_fisher_loss
self.params=params = find_trainable_variables("model")
self.grads_check = grads = tf.gradients(train_loss,params)
with tf.device('/gpu:0'):
self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
momentum=0.9, kfac_update=1, epsilon=0.01,\
stats_decay=0.99, async=1, cold_iter=10, max_grad_norm=max_grad_norm)
update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
self.q_runner = q_runner
self.lr = Scheduler(v=lr, nvalues=total_timesteps, schedule=lrschedule)
def train(obs, states, rewards, masks, actions, values):
advs = rewards - values
for step in range(len(obs)):
cur_lr = self.lr.value()
td_map = {train_model.X:obs, A:actions, ADV:advs, R:rewards, PG_LR:cur_lr}
if states is not None:
td_map[train_model.S] = states
td_map[train_model.M] = masks
policy_loss, value_loss, policy_entropy, _ = sess.run(
[pg_loss, vf_loss, entropy, train_op],
td_map
)
return policy_loss, value_loss, policy_entropy
def save(save_path):
ps = sess.run(params)
joblib.dump(ps, save_path)
def load(load_path):
loaded_params = joblib.load(load_path)
restores = []
for p, loaded_p in zip(params, loaded_params):
restores.append(p.assign(loaded_p))
sess.run(restores)
self.train = train
self.save = save
self.load = load
self.train_model = train_model
self.step_model = step_model
self.step = step_model.step
self.value = step_model.value
self.initial_state = step_model.initial_state
tf.global_variables_initializer().run(session=sess)
def learn(policy, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
kfac_clip=0.001, save_interval=None, lrschedule='linear'):
tf.reset_default_graph()
set_global_seeds(seed)
nenvs = env.num_envs
ob_space = env.observation_space
ac_space = env.action_space
make_model = lambda : Model(policy, ob_space, ac_space, nenvs, total_timesteps, nprocs=nprocs, nsteps
=nsteps, ent_coef=ent_coef, vf_coef=vf_coef, vf_fisher_coef=
vf_fisher_coef, lr=lr, max_grad_norm=max_grad_norm, kfac_clip=kfac_clip,
lrschedule=lrschedule)
if save_interval and logger.get_dir():
import cloudpickle
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
fh.write(cloudpickle.dumps(make_model))
model = make_model()
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
nbatch = nenvs*nsteps
tstart = time.time()
coord = tf.train.Coordinator()
enqueue_threads = model.q_runner.create_threads(model.sess, coord=coord, start=True)
for update in range(1, total_timesteps//nbatch+1):
obs, states, rewards, masks, actions, values = runner.run()
policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
model.old_obs = obs
nseconds = time.time()-tstart
fps = int((update*nbatch)/nseconds)
if update % log_interval == 0 or update == 1:
ev = explained_variance(values, rewards)
logger.record_tabular("nupdates", update)
logger.record_tabular("total_timesteps", update*nbatch)
logger.record_tabular("fps", fps)
logger.record_tabular("policy_entropy", float(policy_entropy))
logger.record_tabular("policy_loss", float(policy_loss))
logger.record_tabular("value_loss", float(value_loss))
logger.record_tabular("explained_variance", float(ev))
logger.dump_tabular()
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
savepath = osp.join(logger.get_dir(), 'checkpoint%.5i'%update)
print('Saving to', savepath)
model.save(savepath)
coord.request_stop()
coord.join(enqueue_threads)
env.close()

View File

@@ -1,926 +0,0 @@
import tensorflow as tf
import numpy as np
import re
from baselines.acktr.kfac_utils import *
from functools import reduce
KFAC_OPS = ['MatMul', 'Conv2D', 'BiasAdd']
KFAC_DEBUG = False
class KfacOptimizer():
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
self.max_grad_norm = max_grad_norm
self._lr = learning_rate
self._momentum = momentum
self._clip_kl = clip_kl
self._channel_fac = channel_fac
self._kfac_update = kfac_update
self._async = async
self._async_stats = async_stats
self._epsilon = epsilon
self._stats_decay = stats_decay
self._blockdiag_bias = blockdiag_bias
self._approxT2 = approxT2
self._use_float64 = use_float64
self._factored_damping = factored_damping
self._cold_iter = cold_iter
if cold_lr == None:
# good heuristics
self._cold_lr = self._lr# * 3.
else:
self._cold_lr = cold_lr
self._stats_accum_iter = stats_accum_iter
self._weight_decay_dict = weight_decay_dict
self._diag_init_coeff = 0.
self._full_stats_init = full_stats_init
if not self._full_stats_init:
self._stats_accum_iter = self._cold_iter
self.sgd_step = tf.Variable(0, name='KFAC/sgd_step', trainable=False)
self.global_step = tf.Variable(
0, name='KFAC/global_step', trainable=False)
self.cold_step = tf.Variable(0, name='KFAC/cold_step', trainable=False)
self.factor_step = tf.Variable(
0, name='KFAC/factor_step', trainable=False)
self.stats_step = tf.Variable(
0, name='KFAC/stats_step', trainable=False)
self.vFv = tf.Variable(0., name='KFAC/vFv', trainable=False)
self.factors = {}
self.param_vars = []
self.stats = {}
self.stats_eigen = {}
def getFactors(self, g, varlist):
graph = tf.get_default_graph()
factorTensors = {}
fpropTensors = []
bpropTensors = []
opTypes = []
fops = []
def searchFactors(gradient, graph):
# hard coded search stratergy
bpropOp = gradient.op
bpropOp_name = bpropOp.name
bTensors = []
fTensors = []
# combining additive gradient, assume they are the same op type and
# indepedent
if 'AddN' in bpropOp_name:
factors = []
for g in gradient.op.inputs:
factors.append(searchFactors(g, graph))
op_names = [item['opName'] for item in factors]
# TO-DO: need to check all the attribute of the ops as well
print (gradient.name)
print (op_names)
print (len(np.unique(op_names)))
assert len(np.unique(op_names)) == 1, gradient.name + \
' is shared among different computation OPs'
bTensors = reduce(lambda x, y: x + y,
[item['bpropFactors'] for item in factors])
if len(factors[0]['fpropFactors']) > 0:
fTensors = reduce(
lambda x, y: x + y, [item['fpropFactors'] for item in factors])
fpropOp_name = op_names[0]
fpropOp = factors[0]['op']
else:
fpropOp_name = re.search(
'gradientsSampled(_[0-9]+|)/(.+?)_grad', bpropOp_name).group(2)
fpropOp = graph.get_operation_by_name(fpropOp_name)
if fpropOp.op_def.name in KFAC_OPS:
# Known OPs
###
bTensor = [
i for i in bpropOp.inputs if 'gradientsSampled' in i.name][-1]
bTensorShape = fpropOp.outputs[0].get_shape()
if bTensor.get_shape()[0].value == None:
bTensor.set_shape(bTensorShape)
bTensors.append(bTensor)
###
if fpropOp.op_def.name == 'BiasAdd':
fTensors = []
else:
fTensors.append(
[i for i in fpropOp.inputs if param.op.name not in i.name][0])
fpropOp_name = fpropOp.op_def.name
else:
# unknown OPs, block approximation used
bInputsList = [i for i in bpropOp.inputs[
0].op.inputs if 'gradientsSampled' in i.name if 'Shape' not in i.name]
if len(bInputsList) > 0:
bTensor = bInputsList[0]
bTensorShape = fpropOp.outputs[0].get_shape()
if len(bTensor.get_shape()) > 0 and bTensor.get_shape()[0].value == None:
bTensor.set_shape(bTensorShape)
bTensors.append(bTensor)
fpropOp_name = opTypes.append('UNK-' + fpropOp.op_def.name)
return {'opName': fpropOp_name, 'op': fpropOp, 'fpropFactors': fTensors, 'bpropFactors': bTensors}
for t, param in zip(g, varlist):
if KFAC_DEBUG:
print(('get factor for '+param.name))
factors = searchFactors(t, graph)
factorTensors[param] = factors
########
# check associated weights and bias for homogeneous coordinate representation
# and check redundent factors
# TO-DO: there may be a bug to detect associate bias and weights for
# forking layer, e.g. in inception models.
for param in varlist:
factorTensors[param]['assnWeights'] = None
factorTensors[param]['assnBias'] = None
for param in varlist:
if factorTensors[param]['opName'] == 'BiasAdd':
factorTensors[param]['assnWeights'] = None
for item in varlist:
if len(factorTensors[item]['bpropFactors']) > 0:
if (set(factorTensors[item]['bpropFactors']) == set(factorTensors[param]['bpropFactors'])) and (len(factorTensors[item]['fpropFactors']) > 0):
factorTensors[param]['assnWeights'] = item
factorTensors[item]['assnBias'] = param
factorTensors[param]['bpropFactors'] = factorTensors[
item]['bpropFactors']
########
########
# concatenate the additive gradients along the batch dimension, i.e.
# assuming independence structure
for key in ['fpropFactors', 'bpropFactors']:
for i, param in enumerate(varlist):
if len(factorTensors[param][key]) > 0:
if (key + '_concat') not in factorTensors[param]:
name_scope = factorTensors[param][key][0].name.split(':')[
0]
with tf.name_scope(name_scope):
factorTensors[param][
key + '_concat'] = tf.concat(factorTensors[param][key], 0)
else:
factorTensors[param][key + '_concat'] = None
for j, param2 in enumerate(varlist[(i + 1):]):
if (len(factorTensors[param][key]) > 0) and (set(factorTensors[param2][key]) == set(factorTensors[param][key])):
factorTensors[param2][key] = factorTensors[param][key]
factorTensors[param2][
key + '_concat'] = factorTensors[param][key + '_concat']
########
if KFAC_DEBUG:
for items in zip(varlist, fpropTensors, bpropTensors, opTypes):
print((items[0].name, factorTensors[item]))
self.factors = factorTensors
return factorTensors
def getStats(self, factors, varlist):
if len(self.stats) == 0:
# initialize stats variables on CPU because eigen decomp is
# computed on CPU
with tf.device('/cpu'):
tmpStatsCache = {}
# search for tensor factors and
# use block diag approx for the bias units
for var in varlist:
fpropFactor = factors[var]['fpropFactors_concat']
bpropFactor = factors[var]['bpropFactors_concat']
opType = factors[var]['opName']
if opType == 'Conv2D':
Kh = var.get_shape()[0]
Kw = var.get_shape()[1]
C = fpropFactor.get_shape()[-1]
Oh = bpropFactor.get_shape()[1]
Ow = bpropFactor.get_shape()[2]
if Oh == 1 and Ow == 1 and self._channel_fac:
# factorization along the channels do not support
# homogeneous coordinate
var_assnBias = factors[var]['assnBias']
if var_assnBias:
factors[var]['assnBias'] = None
factors[var_assnBias]['assnWeights'] = None
##
for var in varlist:
fpropFactor = factors[var]['fpropFactors_concat']
bpropFactor = factors[var]['bpropFactors_concat']
opType = factors[var]['opName']
self.stats[var] = {'opName': opType,
'fprop_concat_stats': [],
'bprop_concat_stats': [],
'assnWeights': factors[var]['assnWeights'],
'assnBias': factors[var]['assnBias'],
}
if fpropFactor is not None:
if fpropFactor not in tmpStatsCache:
if opType == 'Conv2D':
Kh = var.get_shape()[0]
Kw = var.get_shape()[1]
C = fpropFactor.get_shape()[-1]
Oh = bpropFactor.get_shape()[1]
Ow = bpropFactor.get_shape()[2]
if Oh == 1 and Ow == 1 and self._channel_fac:
# factorization along the channels
# assume independence between input channels and spatial
# 2K-1 x 2K-1 covariance matrix and C x C covariance matrix
# factorization along the channels do not
# support homogeneous coordinate, assnBias
# is always None
fpropFactor2_size = Kh * Kw
slot_fpropFactor_stats2 = tf.Variable(tf.diag(tf.ones(
[fpropFactor2_size])) * self._diag_init_coeff, name='KFAC_STATS/' + fpropFactor.op.name, trainable=False)
self.stats[var]['fprop_concat_stats'].append(
slot_fpropFactor_stats2)
fpropFactor_size = C
else:
# 2K-1 x 2K-1 x C x C covariance matrix
# assume BHWC
fpropFactor_size = Kh * Kw * C
else:
# D x D covariance matrix
fpropFactor_size = fpropFactor.get_shape()[-1]
# use homogeneous coordinate
if not self._blockdiag_bias and self.stats[var]['assnBias']:
fpropFactor_size += 1
slot_fpropFactor_stats = tf.Variable(tf.diag(tf.ones(
[fpropFactor_size])) * self._diag_init_coeff, name='KFAC_STATS/' + fpropFactor.op.name, trainable=False)
self.stats[var]['fprop_concat_stats'].append(
slot_fpropFactor_stats)
if opType != 'Conv2D':
tmpStatsCache[fpropFactor] = self.stats[
var]['fprop_concat_stats']
else:
self.stats[var][
'fprop_concat_stats'] = tmpStatsCache[fpropFactor]
if bpropFactor is not None:
# no need to collect backward stats for bias vectors if
# using homogeneous coordinates
if not((not self._blockdiag_bias) and self.stats[var]['assnWeights']):
if bpropFactor not in tmpStatsCache:
slot_bpropFactor_stats = tf.Variable(tf.diag(tf.ones([bpropFactor.get_shape(
)[-1]])) * self._diag_init_coeff, name='KFAC_STATS/' + bpropFactor.op.name, trainable=False)
self.stats[var]['bprop_concat_stats'].append(
slot_bpropFactor_stats)
tmpStatsCache[bpropFactor] = self.stats[
var]['bprop_concat_stats']
else:
self.stats[var][
'bprop_concat_stats'] = tmpStatsCache[bpropFactor]
return self.stats
def compute_and_apply_stats(self, loss_sampled, var_list=None):
varlist = var_list
if varlist is None:
varlist = tf.trainable_variables()
stats = self.compute_stats(loss_sampled, var_list=varlist)
return self.apply_stats(stats)
def compute_stats(self, loss_sampled, var_list=None):
varlist = var_list
if varlist is None:
varlist = tf.trainable_variables()
gs = tf.gradients(loss_sampled, varlist, name='gradientsSampled')
self.gs = gs
factors = self.getFactors(gs, varlist)
stats = self.getStats(factors, varlist)
updateOps = []
statsUpdates = {}
statsUpdates_cache = {}
for var in varlist:
opType = factors[var]['opName']
fops = factors[var]['op']
fpropFactor = factors[var]['fpropFactors_concat']
fpropStats_vars = stats[var]['fprop_concat_stats']
bpropFactor = factors[var]['bpropFactors_concat']
bpropStats_vars = stats[var]['bprop_concat_stats']
SVD_factors = {}
for stats_var in fpropStats_vars:
stats_var_dim = int(stats_var.get_shape()[0])
if stats_var not in statsUpdates_cache:
old_fpropFactor = fpropFactor
B = (tf.shape(fpropFactor)[0]) # batch size
if opType == 'Conv2D':
strides = fops.get_attr("strides")
padding = fops.get_attr("padding")
convkernel_size = var.get_shape()[0:3]
KH = int(convkernel_size[0])
KW = int(convkernel_size[1])
C = int(convkernel_size[2])
flatten_size = int(KH * KW * C)
Oh = int(bpropFactor.get_shape()[1])
Ow = int(bpropFactor.get_shape()[2])
if Oh == 1 and Ow == 1 and self._channel_fac:
# factorization along the channels
# assume independence among input channels
# factor = B x 1 x 1 x (KH xKW x C)
# patches = B x Oh x Ow x (KH xKW x C)
if len(SVD_factors) == 0:
if KFAC_DEBUG:
print(('approx %s act factor with rank-1 SVD factors' % (var.name)))
# find closest rank-1 approx to the feature map
S, U, V = tf.batch_svd(tf.reshape(
fpropFactor, [-1, KH * KW, C]))
# get rank-1 approx slides
sqrtS1 = tf.expand_dims(tf.sqrt(S[:, 0, 0]), 1)
patches_k = U[:, :, 0] * sqrtS1 # B x KH*KW
full_factor_shape = fpropFactor.get_shape()
patches_k.set_shape(
[full_factor_shape[0], KH * KW])
patches_c = V[:, :, 0] * sqrtS1 # B x C
patches_c.set_shape([full_factor_shape[0], C])
SVD_factors[C] = patches_c
SVD_factors[KH * KW] = patches_k
fpropFactor = SVD_factors[stats_var_dim]
else:
# poor mem usage implementation
patches = tf.extract_image_patches(fpropFactor, ksizes=[1, convkernel_size[
0], convkernel_size[1], 1], strides=strides, rates=[1, 1, 1, 1], padding=padding)
if self._approxT2:
if KFAC_DEBUG:
print(('approxT2 act fisher for %s' % (var.name)))
# T^2 terms * 1/T^2, size: B x C
fpropFactor = tf.reduce_mean(patches, [1, 2])
else:
# size: (B x Oh x Ow) x C
fpropFactor = tf.reshape(
patches, [-1, flatten_size]) / Oh / Ow
fpropFactor_size = int(fpropFactor.get_shape()[-1])
if stats_var_dim == (fpropFactor_size + 1) and not self._blockdiag_bias:
if opType == 'Conv2D' and not self._approxT2:
# correct padding for numerical stability (we
# divided out OhxOw from activations for T1 approx)
fpropFactor = tf.concat([fpropFactor, tf.ones(
[tf.shape(fpropFactor)[0], 1]) / Oh / Ow], 1)
else:
# use homogeneous coordinates
fpropFactor = tf.concat(
[fpropFactor, tf.ones([tf.shape(fpropFactor)[0], 1])], 1)
# average over the number of data points in a batch
# divided by B
cov = tf.matmul(fpropFactor, fpropFactor,
transpose_a=True) / tf.cast(B, tf.float32)
updateOps.append(cov)
statsUpdates[stats_var] = cov
if opType != 'Conv2D':
# HACK: for convolution we recompute fprop stats for
# every layer including forking layers
statsUpdates_cache[stats_var] = cov
for stats_var in bpropStats_vars:
stats_var_dim = int(stats_var.get_shape()[0])
if stats_var not in statsUpdates_cache:
old_bpropFactor = bpropFactor
bpropFactor_shape = bpropFactor.get_shape()
B = tf.shape(bpropFactor)[0] # batch size
C = int(bpropFactor_shape[-1]) # num channels
if opType == 'Conv2D' or len(bpropFactor_shape) == 4:
if fpropFactor is not None:
if self._approxT2:
if KFAC_DEBUG:
print(('approxT2 grad fisher for %s' % (var.name)))
bpropFactor = tf.reduce_sum(
bpropFactor, [1, 2]) # T^2 terms * 1/T^2
else:
bpropFactor = tf.reshape(
bpropFactor, [-1, C]) * Oh * Ow # T * 1/T terms
else:
# just doing block diag approx. spatial independent
# structure does not apply here. summing over
# spatial locations
if KFAC_DEBUG:
print(('block diag approx fisher for %s' % (var.name)))
bpropFactor = tf.reduce_sum(bpropFactor, [1, 2])
# assume sampled loss is averaged. TO-DO:figure out better
# way to handle this
bpropFactor *= tf.to_float(B)
##
cov_b = tf.matmul(
bpropFactor, bpropFactor, transpose_a=True) / tf.to_float(tf.shape(bpropFactor)[0])
updateOps.append(cov_b)
statsUpdates[stats_var] = cov_b
statsUpdates_cache[stats_var] = cov_b
if KFAC_DEBUG:
aKey = list(statsUpdates.keys())[0]
statsUpdates[aKey] = tf.Print(statsUpdates[aKey],
[tf.convert_to_tensor('step:'),
self.global_step,
tf.convert_to_tensor(
'computing stats'),
])
self.statsUpdates = statsUpdates
return statsUpdates
def apply_stats(self, statsUpdates):
""" compute stats and update/apply the new stats to the running average
"""
def updateAccumStats():
if self._full_stats_init:
return tf.cond(tf.greater(self.sgd_step, self._cold_iter), lambda: tf.group(*self._apply_stats(statsUpdates, accumulate=True, accumulateCoeff=1. / self._stats_accum_iter)), tf.no_op)
else:
return tf.group(*self._apply_stats(statsUpdates, accumulate=True, accumulateCoeff=1. / self._stats_accum_iter))
def updateRunningAvgStats(statsUpdates, fac_iter=1):
# return tf.cond(tf.greater_equal(self.factor_step,
# tf.convert_to_tensor(fac_iter)), lambda:
# tf.group(*self._apply_stats(stats_list, varlist)), tf.no_op)
return tf.group(*self._apply_stats(statsUpdates))
if self._async_stats:
# asynchronous stats update
update_stats = self._apply_stats(statsUpdates)
queue = tf.FIFOQueue(1, [item.dtype for item in update_stats], shapes=[
item.get_shape() for item in update_stats])
enqueue_op = queue.enqueue(update_stats)
def dequeue_stats_op():
return queue.dequeue()
self.qr_stats = tf.train.QueueRunner(queue, [enqueue_op])
update_stats_op = tf.cond(tf.equal(queue.size(), tf.convert_to_tensor(
0)), tf.no_op, lambda: tf.group(*[dequeue_stats_op(), ]))
else:
# synchronous stats update
update_stats_op = tf.cond(tf.greater_equal(
self.stats_step, self._stats_accum_iter), lambda: updateRunningAvgStats(statsUpdates), updateAccumStats)
self._update_stats_op = update_stats_op
return update_stats_op
def _apply_stats(self, statsUpdates, accumulate=False, accumulateCoeff=0.):
updateOps = []
# obtain the stats var list
for stats_var in statsUpdates:
stats_new = statsUpdates[stats_var]
if accumulate:
# simple superbatch averaging
update_op = tf.assign_add(
stats_var, accumulateCoeff * stats_new, use_locking=True)
else:
# exponential running averaging
update_op = tf.assign(
stats_var, stats_var * self._stats_decay, use_locking=True)
update_op = tf.assign_add(
update_op, (1. - self._stats_decay) * stats_new, use_locking=True)
updateOps.append(update_op)
with tf.control_dependencies(updateOps):
stats_step_op = tf.assign_add(self.stats_step, 1)
if KFAC_DEBUG:
stats_step_op = (tf.Print(stats_step_op,
[tf.convert_to_tensor('step:'),
self.global_step,
tf.convert_to_tensor('fac step:'),
self.factor_step,
tf.convert_to_tensor('sgd step:'),
self.sgd_step,
tf.convert_to_tensor('Accum:'),
tf.convert_to_tensor(accumulate),
tf.convert_to_tensor('Accum coeff:'),
tf.convert_to_tensor(accumulateCoeff),
tf.convert_to_tensor('stat step:'),
self.stats_step, updateOps[0], updateOps[1]]))
return [stats_step_op, ]
def getStatsEigen(self, stats=None):
if len(self.stats_eigen) == 0:
stats_eigen = {}
if stats is None:
stats = self.stats
tmpEigenCache = {}
with tf.device('/cpu:0'):
for var in stats:
for key in ['fprop_concat_stats', 'bprop_concat_stats']:
for stats_var in stats[var][key]:
if stats_var not in tmpEigenCache:
stats_dim = stats_var.get_shape()[1].value
e = tf.Variable(tf.ones(
[stats_dim]), name='KFAC_FAC/' + stats_var.name.split(':')[0] + '/e', trainable=False)
Q = tf.Variable(tf.diag(tf.ones(
[stats_dim])), name='KFAC_FAC/' + stats_var.name.split(':')[0] + '/Q', trainable=False)
stats_eigen[stats_var] = {'e': e, 'Q': Q}
tmpEigenCache[
stats_var] = stats_eigen[stats_var]
else:
stats_eigen[stats_var] = tmpEigenCache[
stats_var]
self.stats_eigen = stats_eigen
return self.stats_eigen
def computeStatsEigen(self):
""" compute the eigen decomp using copied var stats to avoid concurrent read/write from other queue """
# TO-DO: figure out why this op has delays (possibly moving
# eigenvectors around?)
with tf.device('/cpu:0'):
def removeNone(tensor_list):
local_list = []
for item in tensor_list:
if item is not None:
local_list.append(item)
return local_list
def copyStats(var_list):
print("copying stats to buffer tensors before eigen decomp")
redundant_stats = {}
copied_list = []
for item in var_list:
if item is not None:
if item not in redundant_stats:
if self._use_float64:
redundant_stats[item] = tf.cast(
tf.identity(item), tf.float64)
else:
redundant_stats[item] = tf.identity(item)
copied_list.append(redundant_stats[item])
else:
copied_list.append(None)
return copied_list
#stats = [copyStats(self.fStats), copyStats(self.bStats)]
#stats = [self.fStats, self.bStats]
stats_eigen = self.stats_eigen
computedEigen = {}
eigen_reverse_lookup = {}
updateOps = []
# sync copied stats
# with tf.control_dependencies(removeNone(stats[0]) +
# removeNone(stats[1])):
with tf.control_dependencies([]):
for stats_var in stats_eigen:
if stats_var not in computedEigen:
eigens = tf.self_adjoint_eig(stats_var)
e = eigens[0]
Q = eigens[1]
if self._use_float64:
e = tf.cast(e, tf.float32)
Q = tf.cast(Q, tf.float32)
updateOps.append(e)
updateOps.append(Q)
computedEigen[stats_var] = {'e': e, 'Q': Q}
eigen_reverse_lookup[e] = stats_eigen[stats_var]['e']
eigen_reverse_lookup[Q] = stats_eigen[stats_var]['Q']
self.eigen_reverse_lookup = eigen_reverse_lookup
self.eigen_update_list = updateOps
if KFAC_DEBUG:
self.eigen_update_list = [item for item in updateOps]
with tf.control_dependencies(updateOps):
updateOps.append(tf.Print(tf.constant(
0.), [tf.convert_to_tensor('computed factor eigen')]))
return updateOps
def applyStatsEigen(self, eigen_list):
updateOps = []
print(('updating %d eigenvalue/vectors' % len(eigen_list)))
for i, (tensor, mark) in enumerate(zip(eigen_list, self.eigen_update_list)):
stats_eigen_var = self.eigen_reverse_lookup[mark]
updateOps.append(
tf.assign(stats_eigen_var, tensor, use_locking=True))
with tf.control_dependencies(updateOps):
factor_step_op = tf.assign_add(self.factor_step, 1)
updateOps.append(factor_step_op)
if KFAC_DEBUG:
updateOps.append(tf.Print(tf.constant(
0.), [tf.convert_to_tensor('updated kfac factors')]))
return updateOps
def getKfacPrecondUpdates(self, gradlist, varlist):
updatelist = []
vg = 0.
assert len(self.stats) > 0
assert len(self.stats_eigen) > 0
assert len(self.factors) > 0
counter = 0
grad_dict = {var: grad for grad, var in zip(gradlist, varlist)}
for grad, var in zip(gradlist, varlist):
GRAD_RESHAPE = False
GRAD_TRANSPOSE = False
fpropFactoredFishers = self.stats[var]['fprop_concat_stats']
bpropFactoredFishers = self.stats[var]['bprop_concat_stats']
if (len(fpropFactoredFishers) + len(bpropFactoredFishers)) > 0:
counter += 1
GRAD_SHAPE = grad.get_shape()
if len(grad.get_shape()) > 2:
# reshape conv kernel parameters
KW = int(grad.get_shape()[0])
KH = int(grad.get_shape()[1])
C = int(grad.get_shape()[2])
D = int(grad.get_shape()[3])
if len(fpropFactoredFishers) > 1 and self._channel_fac:
# reshape conv kernel parameters into tensor
grad = tf.reshape(grad, [KW * KH, C, D])
else:
# reshape conv kernel parameters into 2D grad
grad = tf.reshape(grad, [-1, D])
GRAD_RESHAPE = True
elif len(grad.get_shape()) == 1:
# reshape bias or 1D parameters
D = int(grad.get_shape()[0])
grad = tf.expand_dims(grad, 0)
GRAD_RESHAPE = True
else:
# 2D parameters
C = int(grad.get_shape()[0])
D = int(grad.get_shape()[1])
if (self.stats[var]['assnBias'] is not None) and not self._blockdiag_bias:
# use homogeneous coordinates only works for 2D grad.
# TO-DO: figure out how to factorize bias grad
# stack bias grad
var_assnBias = self.stats[var]['assnBias']
grad = tf.concat(
[grad, tf.expand_dims(grad_dict[var_assnBias], 0)], 0)
# project gradient to eigen space and reshape the eigenvalues
# for broadcasting
eigVals = []
for idx, stats in enumerate(self.stats[var]['fprop_concat_stats']):
Q = self.stats_eigen[stats]['Q']
e = detectMinVal(self.stats_eigen[stats][
'e'], var, name='act', debug=KFAC_DEBUG)
Q, e = factorReshape(Q, e, grad, facIndx=idx, ftype='act')
eigVals.append(e)
grad = gmatmul(Q, grad, transpose_a=True, reduce_dim=idx)
for idx, stats in enumerate(self.stats[var]['bprop_concat_stats']):
Q = self.stats_eigen[stats]['Q']
e = detectMinVal(self.stats_eigen[stats][
'e'], var, name='grad', debug=KFAC_DEBUG)
Q, e = factorReshape(Q, e, grad, facIndx=idx, ftype='grad')
eigVals.append(e)
grad = gmatmul(grad, Q, transpose_b=False, reduce_dim=idx)
##
#####
# whiten using eigenvalues
weightDecayCoeff = 0.
if var in self._weight_decay_dict:
weightDecayCoeff = self._weight_decay_dict[var]
if KFAC_DEBUG:
print(('weight decay coeff for %s is %f' % (var.name, weightDecayCoeff)))
if self._factored_damping:
if KFAC_DEBUG:
print(('use factored damping for %s' % (var.name)))
coeffs = 1.
num_factors = len(eigVals)
# compute the ratio of two trace norm of the left and right
# KFac matrices, and their generalization
if len(eigVals) == 1:
damping = self._epsilon + weightDecayCoeff
else:
damping = tf.pow(
self._epsilon + weightDecayCoeff, 1. / num_factors)
eigVals_tnorm_avg = [tf.reduce_mean(
tf.abs(e)) for e in eigVals]
for e, e_tnorm in zip(eigVals, eigVals_tnorm_avg):
eig_tnorm_negList = [
item for item in eigVals_tnorm_avg if item != e_tnorm]
if len(eigVals) == 1:
adjustment = 1.
elif len(eigVals) == 2:
adjustment = tf.sqrt(
e_tnorm / eig_tnorm_negList[0])
else:
eig_tnorm_negList_prod = reduce(
lambda x, y: x * y, eig_tnorm_negList)
adjustment = tf.pow(
tf.pow(e_tnorm, num_factors - 1.) / eig_tnorm_negList_prod, 1. / num_factors)
coeffs *= (e + adjustment * damping)
else:
coeffs = 1.
damping = (self._epsilon + weightDecayCoeff)
for e in eigVals:
coeffs *= e
coeffs += damping
#grad = tf.Print(grad, [tf.convert_to_tensor('1'), tf.convert_to_tensor(var.name), grad.get_shape()])
grad /= coeffs
#grad = tf.Print(grad, [tf.convert_to_tensor('2'), tf.convert_to_tensor(var.name), grad.get_shape()])
#####
# project gradient back to euclidean space
for idx, stats in enumerate(self.stats[var]['fprop_concat_stats']):
Q = self.stats_eigen[stats]['Q']
grad = gmatmul(Q, grad, transpose_a=False, reduce_dim=idx)
for idx, stats in enumerate(self.stats[var]['bprop_concat_stats']):
Q = self.stats_eigen[stats]['Q']
grad = gmatmul(grad, Q, transpose_b=True, reduce_dim=idx)
##
#grad = tf.Print(grad, [tf.convert_to_tensor('3'), tf.convert_to_tensor(var.name), grad.get_shape()])
if (self.stats[var]['assnBias'] is not None) and not self._blockdiag_bias:
# use homogeneous coordinates only works for 2D grad.
# TO-DO: figure out how to factorize bias grad
# un-stack bias grad
var_assnBias = self.stats[var]['assnBias']
C_plus_one = int(grad.get_shape()[0])
grad_assnBias = tf.reshape(tf.slice(grad,
begin=[
C_plus_one - 1, 0],
size=[1, -1]), var_assnBias.get_shape())
grad_assnWeights = tf.slice(grad,
begin=[0, 0],
size=[C_plus_one - 1, -1])
grad_dict[var_assnBias] = grad_assnBias
grad = grad_assnWeights
#grad = tf.Print(grad, [tf.convert_to_tensor('4'), tf.convert_to_tensor(var.name), grad.get_shape()])
if GRAD_RESHAPE:
grad = tf.reshape(grad, GRAD_SHAPE)
grad_dict[var] = grad
print(('projecting %d gradient matrices' % counter))
for g, var in zip(gradlist, varlist):
grad = grad_dict[var]
### clipping ###
if KFAC_DEBUG:
print(('apply clipping to %s' % (var.name)))
tf.Print(grad, [tf.sqrt(tf.reduce_sum(tf.pow(grad, 2)))], "Euclidean norm of new grad")
local_vg = tf.reduce_sum(grad * g * (self._lr * self._lr))
vg += local_vg
# recale everything
if KFAC_DEBUG:
print('apply vFv clipping')
scaling = tf.minimum(1., tf.sqrt(self._clip_kl / vg))
if KFAC_DEBUG:
scaling = tf.Print(scaling, [tf.convert_to_tensor(
'clip: '), scaling, tf.convert_to_tensor(' vFv: '), vg])
with tf.control_dependencies([tf.assign(self.vFv, vg)]):
updatelist = [grad_dict[var] for var in varlist]
for i, item in enumerate(updatelist):
updatelist[i] = scaling * item
return updatelist
def compute_gradients(self, loss, var_list=None):
varlist = var_list
if varlist is None:
varlist = tf.trainable_variables()
g = tf.gradients(loss, varlist)
return [(a, b) for a, b in zip(g, varlist)]
def apply_gradients_kfac(self, grads):
g, varlist = list(zip(*grads))
if len(self.stats_eigen) == 0:
self.getStatsEigen()
qr = None
# launch eigen-decomp on a queue thread
if self._async:
print('Use async eigen decomp')
# get a list of factor loading tensors
factorOps_dummy = self.computeStatsEigen()
# define a queue for the list of factor loading tensors
queue = tf.FIFOQueue(1, [item.dtype for item in factorOps_dummy], shapes=[
item.get_shape() for item in factorOps_dummy])
enqueue_op = tf.cond(tf.logical_and(tf.equal(tf.mod(self.stats_step, self._kfac_update), tf.convert_to_tensor(
0)), tf.greater_equal(self.stats_step, self._stats_accum_iter)), lambda: queue.enqueue(self.computeStatsEigen()), tf.no_op)
def dequeue_op():
return queue.dequeue()
qr = tf.train.QueueRunner(queue, [enqueue_op])
updateOps = []
global_step_op = tf.assign_add(self.global_step, 1)
updateOps.append(global_step_op)
with tf.control_dependencies([global_step_op]):
# compute updates
assert self._update_stats_op != None
updateOps.append(self._update_stats_op)
dependency_list = []
if not self._async:
dependency_list.append(self._update_stats_op)
with tf.control_dependencies(dependency_list):
def no_op_wrapper():
return tf.group(*[tf.assign_add(self.cold_step, 1)])
if not self._async:
# synchronous eigen-decomp updates
updateFactorOps = tf.cond(tf.logical_and(tf.equal(tf.mod(self.stats_step, self._kfac_update),
tf.convert_to_tensor(0)),
tf.greater_equal(self.stats_step, self._stats_accum_iter)), lambda: tf.group(*self.applyStatsEigen(self.computeStatsEigen())), no_op_wrapper)
else:
# asynchronous eigen-decomp updates using queue
updateFactorOps = tf.cond(tf.greater_equal(self.stats_step, self._stats_accum_iter),
lambda: tf.cond(tf.equal(queue.size(), tf.convert_to_tensor(0)),
tf.no_op,
lambda: tf.group(
*self.applyStatsEigen(dequeue_op())),
),
no_op_wrapper)
updateOps.append(updateFactorOps)
with tf.control_dependencies([updateFactorOps]):
def gradOp():
return list(g)
def getKfacGradOp():
return self.getKfacPrecondUpdates(g, varlist)
u = tf.cond(tf.greater(self.factor_step,
tf.convert_to_tensor(0)), getKfacGradOp, gradOp)
optim = tf.train.MomentumOptimizer(
self._lr * (1. - self._momentum), self._momentum)
#optim = tf.train.AdamOptimizer(self._lr, epsilon=0.01)
def optimOp():
def updateOptimOp():
if self._full_stats_init:
return tf.cond(tf.greater(self.factor_step, tf.convert_to_tensor(0)), lambda: optim.apply_gradients(list(zip(u, varlist))), tf.no_op)
else:
return optim.apply_gradients(list(zip(u, varlist)))
if self._full_stats_init:
return tf.cond(tf.greater_equal(self.stats_step, self._stats_accum_iter), updateOptimOp, tf.no_op)
else:
return tf.cond(tf.greater_equal(self.sgd_step, self._cold_iter), updateOptimOp, tf.no_op)
updateOps.append(optimOp())
return tf.group(*updateOps), qr
def apply_gradients(self, grads):
coldOptim = tf.train.MomentumOptimizer(
self._cold_lr, self._momentum)
def coldSGDstart():
sgd_grads, sgd_var = zip(*grads)
if self.max_grad_norm != None:
sgd_grads, sgd_grad_norm = tf.clip_by_global_norm(sgd_grads,self.max_grad_norm)
sgd_grads = list(zip(sgd_grads,sgd_var))
sgd_step_op = tf.assign_add(self.sgd_step, 1)
coldOptim_op = coldOptim.apply_gradients(sgd_grads)
if KFAC_DEBUG:
with tf.control_dependencies([sgd_step_op, coldOptim_op]):
sgd_step_op = tf.Print(
sgd_step_op, [self.sgd_step, tf.convert_to_tensor('doing cold sgd step')])
return tf.group(*[sgd_step_op, coldOptim_op])
kfacOptim_op, qr = self.apply_gradients_kfac(grads)
def warmKFACstart():
return kfacOptim_op
return tf.cond(tf.greater(self.sgd_step, self._cold_iter), warmKFACstart, coldSGDstart), qr
def minimize(self, loss, loss_sampled, var_list=None):
grads = self.compute_gradients(loss, var_list=var_list)
update_stats_op = self.compute_and_apply_stats(
loss_sampled, var_list=var_list)
return self.apply_gradients(grads)

View File

@@ -1,86 +0,0 @@
import tensorflow as tf
def gmatmul(a, b, transpose_a=False, transpose_b=False, reduce_dim=None):
assert reduce_dim is not None
# weird batch matmul
if len(a.get_shape()) == 2 and len(b.get_shape()) > 2:
# reshape reduce_dim to the left most dim in b
b_shape = b.get_shape()
if reduce_dim != 0:
b_dims = list(range(len(b_shape)))
b_dims.remove(reduce_dim)
b_dims.insert(0, reduce_dim)
b = tf.transpose(b, b_dims)
b_t_shape = b.get_shape()
b = tf.reshape(b, [int(b_shape[reduce_dim]), -1])
result = tf.matmul(a, b, transpose_a=transpose_a,
transpose_b=transpose_b)
result = tf.reshape(result, b_t_shape)
if reduce_dim != 0:
b_dims = list(range(len(b_shape)))
b_dims.remove(0)
b_dims.insert(reduce_dim, 0)
result = tf.transpose(result, b_dims)
return result
elif len(a.get_shape()) > 2 and len(b.get_shape()) == 2:
# reshape reduce_dim to the right most dim in a
a_shape = a.get_shape()
outter_dim = len(a_shape) - 1
reduce_dim = len(a_shape) - reduce_dim - 1
if reduce_dim != outter_dim:
a_dims = list(range(len(a_shape)))
a_dims.remove(reduce_dim)
a_dims.insert(outter_dim, reduce_dim)
a = tf.transpose(a, a_dims)
a_t_shape = a.get_shape()
a = tf.reshape(a, [-1, int(a_shape[reduce_dim])])
result = tf.matmul(a, b, transpose_a=transpose_a,
transpose_b=transpose_b)
result = tf.reshape(result, a_t_shape)
if reduce_dim != outter_dim:
a_dims = list(range(len(a_shape)))
a_dims.remove(outter_dim)
a_dims.insert(reduce_dim, outter_dim)
result = tf.transpose(result, a_dims)
return result
elif len(a.get_shape()) == 2 and len(b.get_shape()) == 2:
return tf.matmul(a, b, transpose_a=transpose_a, transpose_b=transpose_b)
assert False, 'something went wrong'
def clipoutNeg(vec, threshold=1e-6):
mask = tf.cast(vec > threshold, tf.float32)
return mask * vec
def detectMinVal(input_mat, var, threshold=1e-6, name='', debug=False):
eigen_min = tf.reduce_min(input_mat)
eigen_max = tf.reduce_max(input_mat)
eigen_ratio = eigen_max / eigen_min
input_mat_clipped = clipoutNeg(input_mat, threshold)
if debug:
input_mat_clipped = tf.cond(tf.logical_or(tf.greater(eigen_ratio, 0.), tf.less(eigen_ratio, -500)), lambda: input_mat_clipped, lambda: tf.Print(
input_mat_clipped, [tf.convert_to_tensor('screwed ratio ' + name + ' eigen values!!!'), tf.convert_to_tensor(var.name), eigen_min, eigen_max, eigen_ratio]))
return input_mat_clipped
def factorReshape(Q, e, grad, facIndx=0, ftype='act'):
grad_shape = grad.get_shape()
if ftype == 'act':
assert e.get_shape()[0] == grad_shape[facIndx]
expanded_shape = [1, ] * len(grad_shape)
expanded_shape[facIndx] = -1
e = tf.reshape(e, expanded_shape)
if ftype == 'grad':
assert e.get_shape()[0] == grad_shape[len(grad_shape) - facIndx - 1]
expanded_shape = [1, ] * len(grad_shape)
expanded_shape[len(grad_shape) - facIndx - 1] = -1
e = tf.reshape(e, expanded_shape)
return Q, e

View File

@@ -1,42 +0,0 @@
import numpy as np
import tensorflow as tf
from baselines.acktr.utils import dense, kl_div
import baselines.common.tf_util as U
class GaussianMlpPolicy(object):
def __init__(self, ob_dim, ac_dim):
# Here we'll construct a bunch of expressions, which will be used in two places:
# (1) When sampling actions
# (2) When computing loss functions, for the policy update
# Variables specific to (1) have the word "sampled" in them,
# whereas variables specific to (2) have the word "old" in them
ob_no = tf.placeholder(tf.float32, shape=[None, ob_dim*2], name="ob") # batch of observations
oldac_na = tf.placeholder(tf.float32, shape=[None, ac_dim], name="ac") # batch of actions previous actions
oldac_dist = tf.placeholder(tf.float32, shape=[None, ac_dim*2], name="oldac_dist") # batch of actions previous action distributions
adv_n = tf.placeholder(tf.float32, shape=[None], name="adv") # advantage function estimate
wd_dict = {}
h1 = tf.nn.tanh(dense(ob_no, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
h2 = tf.nn.tanh(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0.0, weight_loss_dict=wd_dict))
mean_na = dense(h2, ac_dim, "mean", weight_init=U.normc_initializer(0.1), bias_init=0.0, weight_loss_dict=wd_dict) # Mean control output
self.wd_dict = wd_dict
self.logstd_1a = logstd_1a = tf.get_variable("logstd", [ac_dim], tf.float32, tf.zeros_initializer()) # Variance on outputs
logstd_1a = tf.expand_dims(logstd_1a, 0)
std_1a = tf.exp(logstd_1a)
std_na = tf.tile(std_1a, [tf.shape(mean_na)[0], 1])
ac_dist = tf.concat([tf.reshape(mean_na, [-1, ac_dim]), tf.reshape(std_na, [-1, ac_dim])], 1)
sampled_ac_na = tf.random_normal(tf.shape(ac_dist[:,ac_dim:])) * ac_dist[:,ac_dim:] + ac_dist[:,:ac_dim] # This is the sampled action we'll perform.
logprobsampled_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - sampled_ac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of sampled action
logprob_n = - tf.reduce_sum(tf.log(ac_dist[:,ac_dim:]), axis=1) - 0.5 * tf.log(2.0*np.pi)*ac_dim - 0.5 * tf.reduce_sum(tf.square(ac_dist[:,:ac_dim] - oldac_na) / (tf.square(ac_dist[:,ac_dim:])), axis=1) # Logprob of previous actions under CURRENT policy (whereas oldlogprob_n is under OLD policy)
kl = tf.reduce_mean(kl_div(oldac_dist, ac_dist, ac_dim))
#kl = .5 * tf.reduce_mean(tf.square(logprob_n - oldlogprob_n)) # Approximation of KL divergence between old policy used to generate actions, and new policy used to compute logprob_n
surr = - tf.reduce_mean(adv_n * logprob_n) # Loss function that we'll differentiate to get the policy gradient
surr_sampled = - tf.reduce_mean(logprob_n) # Sampled loss of the policy
self._act = U.function([ob_no], [sampled_ac_na, ac_dist, logprobsampled_n]) # Generate a new action and its logprob
#self.compute_kl = U.function([ob_no, oldac_na, oldlogprob_n], kl) # Compute (approximate) KL divergence between old policy and new policy
self.compute_kl = U.function([ob_no, oldac_dist], kl)
self.update_info = ((ob_no, oldac_na, adv_n), surr, surr_sampled) # Input and output variables needed for computing loss
U.initialize() # Initialize uninitialized TF variables
def act(self, ob):
ac, ac_dist, logp = self._act(ob[None])
return ac[0], ac_dist[0], logp[0]

View File

@@ -1,23 +0,0 @@
#!/usr/bin/env python3
from functools import partial
from baselines import logger
from baselines.acktr.acktr_disc import learn
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
from baselines.ppo2.policies import CnnPolicy
def train(env_id, num_timesteps, seed, num_cpu):
env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
policy_fn = partial(CnnPolicy, one_dim_bias=True)
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
env.close()
def main():
args = atari_arg_parser().parse_args()
logger.configure()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed, num_cpu=32)
if __name__ == '__main__':
main()

View File

@@ -1,34 +0,0 @@
#!/usr/bin/env python3
import tensorflow as tf
from baselines import logger
from baselines.common.cmd_util import make_mujoco_env, mujoco_arg_parser
from baselines.acktr.acktr_cont import learn
from baselines.acktr.policies import GaussianMlpPolicy
from baselines.acktr.value_functions import NeuralNetValueFunction
def train(env_id, num_timesteps, seed):
env = make_mujoco_env(env_id, seed)
with tf.Session(config=tf.ConfigProto()):
ob_dim = env.observation_space.shape[0]
ac_dim = env.action_space.shape[0]
with tf.variable_scope("vf"):
vf = NeuralNetValueFunction(ob_dim, ac_dim)
with tf.variable_scope("pi"):
policy = GaussianMlpPolicy(ob_dim, ac_dim)
learn(env, policy=policy, vf=vf,
gamma=0.99, lam=0.97, timesteps_per_batch=2500,
desired_kl=0.002,
num_timesteps=num_timesteps, animate=False)
env.close()
def main():
args = mujoco_arg_parser().parse_args()
logger.configure()
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed)
if __name__ == "__main__":
main()

View File

@@ -1,28 +0,0 @@
import tensorflow as tf
def dense(x, size, name, weight_init=None, bias_init=0, weight_loss_dict=None, reuse=None):
with tf.variable_scope(name, reuse=reuse):
assert (len(tf.get_variable_scope().name.split('/')) == 2)
w = tf.get_variable("w", [x.get_shape()[1], size], initializer=weight_init)
b = tf.get_variable("b", [size], initializer=tf.constant_initializer(bias_init))
weight_decay_fc = 3e-4
if weight_loss_dict is not None:
weight_decay = tf.multiply(tf.nn.l2_loss(w), weight_decay_fc, name='weight_decay_loss')
if weight_loss_dict is not None:
weight_loss_dict[w] = weight_decay_fc
weight_loss_dict[b] = 0.0
tf.add_to_collection(tf.get_variable_scope().name.split('/')[0] + '_' + 'losses', weight_decay)
return tf.nn.bias_add(tf.matmul(x, w), b)
def kl_div(action_dist1, action_dist2, action_size):
mean1, std1 = action_dist1[:, :action_size], action_dist1[:, action_size:]
mean2, std2 = action_dist2[:, :action_size], action_dist2[:, action_size:]
numerator = tf.square(mean1 - mean2) + tf.square(std1) - tf.square(std2)
denominator = 2 * tf.square(std2) + 1e-8
return tf.reduce_sum(
numerator/denominator + tf.log(std2) - tf.log(std1),reduction_indices=-1)

View File

@@ -1,50 +0,0 @@
from baselines import logger
import numpy as np
import baselines.common as common
from baselines.common import tf_util as U
import tensorflow as tf
from baselines.acktr import kfac
from baselines.acktr.utils import dense
class NeuralNetValueFunction(object):
def __init__(self, ob_dim, ac_dim): #pylint: disable=W0613
X = tf.placeholder(tf.float32, shape=[None, ob_dim*2+ac_dim*2+2]) # batch of observations
vtarg_n = tf.placeholder(tf.float32, shape=[None], name='vtarg')
wd_dict = {}
h1 = tf.nn.elu(dense(X, 64, "h1", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
h2 = tf.nn.elu(dense(h1, 64, "h2", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict))
vpred_n = dense(h2, 1, "hfinal", weight_init=U.normc_initializer(1.0), bias_init=0, weight_loss_dict=wd_dict)[:,0]
sample_vpred_n = vpred_n + tf.random_normal(tf.shape(vpred_n))
wd_loss = tf.get_collection("vf_losses", None)
loss = tf.reduce_mean(tf.square(vpred_n - vtarg_n)) + tf.add_n(wd_loss)
loss_sampled = tf.reduce_mean(tf.square(vpred_n - tf.stop_gradient(sample_vpred_n)))
self._predict = U.function([X], vpred_n)
optim = kfac.KfacOptimizer(learning_rate=0.001, cold_lr=0.001*(1-0.9), momentum=0.9, \
clip_kl=0.3, epsilon=0.1, stats_decay=0.95, \
async=1, kfac_update=2, cold_iter=50, \
weight_decay_dict=wd_dict, max_grad_norm=None)
vf_var_list = []
for var in tf.trainable_variables():
if "vf" in var.name:
vf_var_list.append(var)
update_op, self.q_runner = optim.minimize(loss, loss_sampled, var_list=vf_var_list)
self.do_update = U.function([X, vtarg_n], update_op) #pylint: disable=E1101
U.initialize() # Initialize uninitialized TF variables
def _preproc(self, path):
l = pathlength(path)
al = np.arange(l).reshape(-1,1)/10.0
act = path["action_dist"].astype('float32')
X = np.concatenate([path['observation'], act, al, np.ones((l, 1))], axis=1)
return X
def predict(self, path):
return self._predict(self._preproc(path))
def fit(self, paths, targvals):
X = np.concatenate([self._preproc(p) for p in paths])
y = np.concatenate(targvals)
logger.record_tabular("EVBefore", common.explained_variance(self._predict(X), y))
for _ in range(25): self.do_update(X, y)
logger.record_tabular("EVAfter", common.explained_variance(self._predict(X), y))
def pathlength(path):
return path["reward"].shape[0]

View File

@@ -1,2 +1,2 @@
from baselines.bench.benchmarks import *
from baselines.bench.monitor import *
from baselines.bench.benchmarks import * # noqa: F403 F401
from baselines.bench.monitor import * # noqa: F403 F401

View File

@@ -1,5 +1,4 @@
import re
import os.path as osp
import os
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
@@ -20,7 +19,7 @@ def register_benchmark(benchmark):
if 'tasks' in benchmark:
for t in benchmark['tasks']:
if 'desc' not in t:
t['desc'] = remove_version_re.sub('', t['env_id'])
t['desc'] = remove_version_re.sub('', t.get('env_id', t.get('id')))
_BENCHMARKS.append(benchmark)
@@ -59,7 +58,7 @@ register_benchmark({
register_benchmark({
'name': 'Atari10M',
'description': '7 Atari games from Mnih et al. (2013), with pixel observations, 10M timesteps',
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 2, 'num_timesteps': int(10e6)} for _game in _atari7]
'tasks': [{'desc': _game, 'env_id': _game + _ATARI_SUFFIX, 'trials': 6, 'num_timesteps': int(10e6)} for _game in _atari7]
})
register_benchmark({
@@ -84,8 +83,9 @@ _mujocosmall = [
register_benchmark({
'name': 'Mujoco1M',
'description': 'Some small 2D MuJoCo tasks, run for 1M timesteps',
'tasks': [{'env_id': _envid, 'trials': 3, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
'tasks': [{'env_id': _envid, 'trials': 6, 'num_timesteps': int(1e6)} for _envid in _mujocosmall]
})
register_benchmark({
'name': 'MujocoWalkers',
'description': 'MuJoCo forward walkers, run for 8M, humanoid 100M',
@@ -96,6 +96,19 @@ register_benchmark({
]
})
# Bullet
_bulletsmall = [
'InvertedDoublePendulum', 'InvertedPendulum', 'HalfCheetah', 'Reacher', 'Walker2D', 'Hopper', 'Ant'
]
_bulletsmall = [e + 'BulletEnv-v0' for e in _bulletsmall]
register_benchmark({
'name': 'Bullet1M',
'description': '6 mujoco-like tasks from bullet, 1M steps',
'tasks': [{'env_id': e, 'trials': 6, 'num_timesteps': int(1e6)} for e in _bulletsmall]
})
# Roboschool
register_benchmark({
@@ -142,9 +155,10 @@ register_benchmark({
# HER DDPG
_fetch_tasks = ['FetchReach-v1', 'FetchPush-v1', 'FetchSlide-v1']
register_benchmark({
'name': 'HerDdpg',
'description': 'Smoke-test only benchmark of HER',
'tasks': [{'trials': 1, 'env_id': 'FetchReach-v1'}]
'name': 'Fetch1M',
'description': 'Fetch* benchmarks for 1M timesteps',
'tasks': [{'trials': 6, 'env_id': env_id, 'num_timesteps': int(1e6)} for env_id in _fetch_tasks]
})

View File

@@ -1,13 +1,11 @@
__all__ = ['Monitor', 'get_monitor_files', 'load_results']
import gym
from gym.core import Wrapper
import time
from glob import glob
import csv
import os.path as osp
import json
import numpy as np
class Monitor(Wrapper):
EXT = "monitor.csv"
@@ -16,21 +14,13 @@ class Monitor(Wrapper):
def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
Wrapper.__init__(self, env=env)
self.tstart = time.time()
if filename is None:
self.f = None
self.logger = None
if filename:
self.results_writer = ResultsWriter(filename,
header={"t_start": time.time(), 'env_id' : env.spec and env.spec.id},
extra_keys=reset_keywords + info_keywords
)
else:
if not filename.endswith(Monitor.EXT):
if osp.isdir(filename):
filename = osp.join(filename, Monitor.EXT)
else:
filename = filename + "." + Monitor.EXT
self.f = open(filename, "wt")
self.f.write('#%s\n'%json.dumps({"t_start": self.tstart, 'env_id' : env.spec and env.spec.id}))
self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+reset_keywords+info_keywords)
self.logger.writeheader()
self.f.flush()
self.results_writer = None
self.reset_keywords = reset_keywords
self.info_keywords = info_keywords
self.allow_early_resets = allow_early_resets
@@ -43,10 +33,7 @@ class Monitor(Wrapper):
self.current_reset_info = {} # extra info about the current episode, that was passed in during reset()
def reset(self, **kwargs):
if not self.allow_early_resets and not self.needs_reset:
raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
self.rewards = []
self.needs_reset = False
self.reset_state()
for k in self.reset_keywords:
v = kwargs.get(k)
if v is None:
@@ -54,10 +41,21 @@ class Monitor(Wrapper):
self.current_reset_info[k] = v
return self.env.reset(**kwargs)
def reset_state(self):
if not self.allow_early_resets and not self.needs_reset:
raise RuntimeError("Tried to reset an environment before done. If you want to allow early resets, wrap your env with Monitor(env, path, allow_early_resets=True)")
self.rewards = []
self.needs_reset = False
def step(self, action):
if self.needs_reset:
raise RuntimeError("Tried to step environment that needs reset")
ob, rew, done, info = self.env.step(action)
self.update(ob, rew, done, info)
return (ob, rew, done, info)
def update(self, ob, rew, done, info):
self.rewards.append(rew)
if done:
self.needs_reset = True
@@ -70,12 +68,13 @@ class Monitor(Wrapper):
self.episode_lengths.append(eplen)
self.episode_times.append(time.time() - self.tstart)
epinfo.update(self.current_reset_info)
if self.logger:
self.logger.writerow(epinfo)
self.f.flush()
info['episode'] = epinfo
if self.results_writer:
self.results_writer.write_row(epinfo)
assert isinstance(info, dict)
if isinstance(info, dict):
info['episode'] = epinfo
self.total_steps += 1
return (ob, rew, done, info)
def close(self):
if self.f is not None:
@@ -96,13 +95,37 @@ class Monitor(Wrapper):
class LoadMonitorResultsError(Exception):
pass
class ResultsWriter(object):
def __init__(self, filename, header='', extra_keys=()):
self.extra_keys = extra_keys
assert filename is not None
if not filename.endswith(Monitor.EXT):
if osp.isdir(filename):
filename = osp.join(filename, Monitor.EXT)
else:
filename = filename + "." + Monitor.EXT
self.f = open(filename, "wt")
if isinstance(header, dict):
header = '# {} \n'.format(json.dumps(header))
self.f.write(header)
self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+tuple(extra_keys))
self.logger.writeheader()
self.f.flush()
def write_row(self, epinfo):
if self.logger:
self.logger.writerow(epinfo)
self.f.flush()
def get_monitor_files(dir):
return glob(osp.join(dir, "*" + Monitor.EXT))
def load_results(dir):
import pandas
monitor_files = (
glob(osp.join(dir, "*monitor.json")) +
glob(osp.join(dir, "*monitor.json")) +
glob(osp.join(dir, "*monitor.csv"))) # get both csv and (old) json files
if not monitor_files:
raise LoadMonitorResultsError("no monitor files of the form *%s found in %s" % (Monitor.EXT, dir))
@@ -112,6 +135,8 @@ def load_results(dir):
with open(fname, 'rt') as fh:
if fname.endswith('csv'):
firstline = fh.readline()
if not firstline:
continue
assert firstline[0] == '#'
header = json.loads(firstline[1:])
df = pandas.read_csv(fh, index_col=None)
@@ -135,27 +160,3 @@ def load_results(dir):
df['t'] -= min(header['t_start'] for header in headers)
df.headers = headers # HACK to preserve backwards compatibility
return df
def test_monitor():
env = gym.make("CartPole-v1")
env.seed(0)
mon_file = "/tmp/baselines-test-%s.monitor.csv" % uuid.uuid4()
menv = Monitor(env, mon_file)
menv.reset()
for _ in range(1000):
_, _, done, _ = menv.step(0)
if done:
menv.reset()
f = open(mon_file, 'rt')
firstline = f.readline()
assert firstline.startswith('#')
metadata = json.loads(firstline[1:])
assert metadata['env_id'] == "CartPole-v1"
assert set(metadata.keys()) == {'env_id', 'gym_version', 't_start'}, "Incorrect keys in monitor metadata"
last_logline = pandas.read_csv(f, index_col=None)
assert set(last_logline.keys()) == {'l', 't', 'r'}, "Incorrect keys in monitor logline"
f.close()
os.remove(mon_file)

View File

@@ -1,9 +1,13 @@
import numpy as np
import os
os.environ.setdefault('PATH', '')
from collections import deque
import gym
from gym import spaces
import cv2
cv2.ocl.setUseOpenCL(False)
from .wrappers import TimeLimit
class NoopResetEnv(gym.Wrapper):
def __init__(self, env, noop_max=30):
@@ -70,8 +74,8 @@ class EpisodicLifeEnv(gym.Wrapper):
# then update lives to handle bonus lives
lives = self.env.unwrapped.ale.lives()
if lives < self.lives and lives > 0:
# for Qbert sometimes we stay in lives == 0 condtion for a few frames
# so its important to keep lives > 0, so that we only reset once
# for Qbert sometimes we stay in lives == 0 condition for a few frames
# so it's important to keep lives > 0, so that we only reset once
# the environment advertises done.
done = True
self.lives = lives
@@ -126,19 +130,60 @@ class ClipRewardEnv(gym.RewardWrapper):
"""Bin reward to {+1, 0, -1} by its sign."""
return np.sign(reward)
class WarpFrame(gym.ObservationWrapper):
def __init__(self, env):
"""Warp frames to 84x84 as done in the Nature paper and later work."""
gym.ObservationWrapper.__init__(self, env)
self.width = 84
self.height = 84
self.observation_space = spaces.Box(low=0, high=255,
shape=(self.height, self.width, 1), dtype=np.uint8)
def observation(self, frame):
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
return frame[:, :, None]
class WarpFrame(gym.ObservationWrapper):
def __init__(self, env, width=84, height=84, grayscale=True, dict_space_key=None):
"""
Warp frames to 84x84 as done in the Nature paper and later work.
If the environment uses dictionary observations, `dict_space_key` can be specified which indicates which
observation should be warped.
"""
super().__init__(env)
self._width = width
self._height = height
self._grayscale = grayscale
self._key = dict_space_key
if self._grayscale:
num_colors = 1
else:
num_colors = 3
new_space = gym.spaces.Box(
low=0,
high=255,
shape=(self._height, self._width, num_colors),
dtype=np.uint8,
)
if self._key is None:
original_space = self.observation_space
self.observation_space = new_space
else:
original_space = self.observation_space.spaces[self._key]
self.observation_space.spaces[self._key] = new_space
assert original_space.dtype == np.uint8 and len(original_space.shape) == 3
def observation(self, obs):
if self._key is None:
frame = obs
else:
frame = obs[self._key]
if self._grayscale:
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
frame = cv2.resize(
frame, (self._width, self._height), interpolation=cv2.INTER_AREA
)
if self._grayscale:
frame = np.expand_dims(frame, -1)
if self._key is None:
obs = frame
else:
obs = obs.copy()
obs[self._key] = frame
return obs
class FrameStack(gym.Wrapper):
def __init__(self, env, k):
@@ -154,7 +199,7 @@ class FrameStack(gym.Wrapper):
self.k = k
self.frames = deque([], maxlen=k)
shp = env.observation_space.shape
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=np.uint8)
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[:-1] + (shp[-1] * k,)), dtype=env.observation_space.dtype)
def reset(self):
ob = self.env.reset()
@@ -174,6 +219,7 @@ class FrameStack(gym.Wrapper):
class ScaledFloatFrame(gym.ObservationWrapper):
def __init__(self, env):
gym.ObservationWrapper.__init__(self, env)
self.observation_space = gym.spaces.Box(low=0, high=1, shape=env.observation_space.shape, dtype=np.float32)
def observation(self, observation):
# careful! This undoes the memory optimization, use
@@ -194,7 +240,7 @@ class LazyFrames(object):
def _force(self):
if self._out is None:
self._out = np.concatenate(self._frames, axis=2)
self._out = np.concatenate(self._frames, axis=-1)
self._frames = None
return self._out
@@ -210,11 +256,20 @@ class LazyFrames(object):
def __getitem__(self, i):
return self._force()[i]
def make_atari(env_id):
def count(self):
frames = self._force()
return frames.shape[frames.ndim - 1]
def frame(self, i):
return self._force()[..., i]
def make_atari(env_id, max_episode_steps=None):
env = gym.make(env_id)
assert 'NoFrameskip' in env.spec.id
env = NoopResetEnv(env, noop_max=30)
env = MaxAndSkipEnv(env, skip=4)
if max_episode_steps is not None:
env = TimeLimit(env, max_episode_steps=max_episode_steps)
return env
def wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack=False, scale=False):

View File

@@ -31,4 +31,4 @@ def cg(f_Ax, b, cg_iters=10, callback=None, verbose=False, residual_tol=1e-10):
if callback is not None:
callback(x)
if verbose: print(fmtstr % (i+1, rdotr, np.linalg.norm(x))) # pylint: disable=W0631
return x
return x

View File

@@ -3,7 +3,11 @@ Helpers for scripts like run_atari.py.
"""
import os
from mpi4py import MPI
try:
from mpi4py import MPI
except ImportError:
MPI = None
import gym
from gym.wrappers import FlattenDictWrapper
from baselines import logger
@@ -11,31 +15,90 @@ from baselines.bench import Monitor
from baselines.common import set_global_seeds
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.common import retro_wrappers
def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
def make_vec_env(env_id, env_type, num_env, seed,
wrapper_kwargs=None,
start_index=0,
reward_scale=1.0,
flatten_dict_observations=True,
gamestate=None):
"""
Create a wrapped, monitored SubprocVecEnv for Atari.
Create a wrapped, monitored SubprocVecEnv for Atari and MuJoCo.
"""
if wrapper_kwargs is None: wrapper_kwargs = {}
def make_env(rank): # pylint: disable=C0111
def _thunk():
env = make_atari(env_id)
env.seed(seed + rank)
env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
return wrap_deepmind(env, **wrapper_kwargs)
return _thunk
wrapper_kwargs = wrapper_kwargs or {}
mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
seed = seed + 10000 * mpi_rank if seed is not None else None
logger_dir = logger.get_dir()
def make_thunk(rank):
return lambda: make_env(
env_id=env_id,
env_type=env_type,
mpi_rank=mpi_rank,
subrank=rank,
seed=seed,
reward_scale=reward_scale,
gamestate=gamestate,
flatten_dict_observations=flatten_dict_observations,
wrapper_kwargs=wrapper_kwargs,
logger_dir=logger_dir
)
set_global_seeds(seed)
return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
if num_env > 1:
return SubprocVecEnv([make_thunk(i + start_index) for i in range(num_env)])
else:
return DummyVecEnv([make_thunk(start_index)])
def make_mujoco_env(env_id, seed):
def make_env(env_id, env_type, mpi_rank=0, subrank=0, seed=None, reward_scale=1.0, gamestate=None, flatten_dict_observations=True, wrapper_kwargs=None, logger_dir=None):
wrapper_kwargs = wrapper_kwargs or {}
if env_type == 'atari':
env = make_atari(env_id)
elif env_type == 'retro':
import retro
gamestate = gamestate or retro.State.DEFAULT
env = retro_wrappers.make_retro(game=env_id, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE, state=gamestate)
else:
env = gym.make(env_id)
if flatten_dict_observations and isinstance(env.observation_space, gym.spaces.Dict):
keys = env.observation_space.spaces.keys()
env = gym.wrappers.FlattenDictWrapper(env, dict_keys=list(keys))
env.seed(seed + subrank if seed is not None else None)
env = Monitor(env,
logger_dir and os.path.join(logger_dir, str(mpi_rank) + '.' + str(subrank)),
allow_early_resets=True)
if env_type == 'atari':
env = wrap_deepmind(env, **wrapper_kwargs)
elif env_type == 'retro':
if 'frame_stack' not in wrapper_kwargs:
wrapper_kwargs['frame_stack'] = 1
env = retro_wrappers.wrap_deepmind_retro(env, **wrapper_kwargs)
if reward_scale != 1:
env = retro_wrappers.RewardScaler(env, reward_scale)
return env
def make_mujoco_env(env_id, seed, reward_scale=1.0):
"""
Create a wrapped, monitored gym.Env for MuJoCo.
"""
rank = MPI.COMM_WORLD.Get_rank()
set_global_seeds(seed + 10000 * rank)
myseed = seed + 1000 * rank if seed is not None else None
set_global_seeds(myseed)
env = gym.make(env_id)
env = Monitor(env, os.path.join(logger.get_dir(), str(rank)))
logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
env = Monitor(env, logger_path, allow_early_resets=True)
env.seed(seed)
if reward_scale != 1.0:
from baselines.common.retro_wrappers import RewardScaler
env = RewardScaler(env, reward_scale)
return env
def make_robotics_env(env_id, seed, rank=0):
@@ -62,20 +125,31 @@ def atari_arg_parser():
"""
Create an argparse.ArgumentParser for run_atari.py.
"""
parser = arg_parser()
parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
return parser
print('Obsolete - use common_arg_parser instead')
return common_arg_parser()
def mujoco_arg_parser():
print('Obsolete - use common_arg_parser instead')
return common_arg_parser()
def common_arg_parser():
"""
Create an argparse.ArgumentParser for run_mujoco.py.
"""
parser = arg_parser()
parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--num-timesteps', type=int, default=int(1e6))
parser.add_argument('--env_type', help='type of environment, used when the environment type cannot be automatically determined', type=str)
parser.add_argument('--seed', help='RNG seed', type=int, default=None)
parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
parser.add_argument('--num_timesteps', type=float, default=1e6),
parser.add_argument('--network', help='network type (mlp, cnn, lstm, cnn_lstm, conv_only)', default=None)
parser.add_argument('--gamestate', help='game state to load (so far only used in retro games)', default=None)
parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
parser.add_argument('--reward_scale', help='Reward scale factor. Default: 1.0', default=1.0, type=float)
parser.add_argument('--save_path', help='Path to save trained model to', default=None, type=str)
parser.add_argument('--save_video_interval', help='Save video every x steps (0 = disabled)', default=0, type=int)
parser.add_argument('--save_video_length', help='Length of recorded video. Default: 200', default=200, type=int)
parser.add_argument('--log_path', help='Directory to save learning curve data.', default=None, type=str)
parser.add_argument('--play', default=False, action='store_true')
return parser
@@ -85,6 +159,28 @@ def robotics_arg_parser():
"""
parser = arg_parser()
parser.add_argument('--env', help='environment ID', type=str, default='FetchReach-v0')
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--seed', help='RNG seed', type=int, default=None)
parser.add_argument('--num-timesteps', type=int, default=int(1e6))
return parser
def parse_unknown_args(args):
"""
Parse arguments not consumed by arg parser into a dictionary
"""
retval = {}
preceded_by_key = False
for arg in args:
if arg.startswith('--'):
if '=' in arg:
key = arg.split('=')[0][2:]
value = arg.split('=')[1]
retval[key] = value
else:
key = arg[2:]
preceded_by_key = True
elif preceded_by_key:
retval[key] = arg
preceded_by_key = False
return retval

View File

@@ -2,6 +2,8 @@ from __future__ import print_function
from contextlib import contextmanager
import numpy as np
import time
import shlex
import subprocess
# ================================================================
# Misc
@@ -37,7 +39,7 @@ color2num = dict(
crimson=38
)
def colorize(string, color, bold=False, highlight=False):
def colorize(string, color='green', bold=False, highlight=False):
attr = []
num = color2num[color]
if highlight: num += 10
@@ -45,6 +47,25 @@ def colorize(string, color, bold=False, highlight=False):
if bold: attr.append('1')
return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)
def print_cmd(cmd, dry=False):
if isinstance(cmd, str): # for shell=True
pass
else:
cmd = ' '.join(shlex.quote(arg) for arg in cmd)
print(colorize(('CMD: ' if not dry else 'DRY: ') + cmd))
def get_git_commit(cwd=None):
return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'], cwd=cwd).decode('utf8')
def get_git_commit_message(cwd=None):
return subprocess.check_output(['git', 'show', '-s', '--format=%B', 'HEAD'], cwd=cwd).decode('utf8')
def ccap(cmd, dry=False, env=None, **kwargs):
print_cmd(cmd, dry)
if not dry:
subprocess.check_call(cmd, env=env, **kwargs)
MESSAGE_DEPTH = 0

View File

@@ -1,8 +1,6 @@
import tensorflow as tf
import numpy as np
import baselines.common.tf_util as U
from baselines.a2c.utils import fc
from tensorflow.python.ops import math_ops
class Pd(object):
"""
@@ -23,8 +21,15 @@ class Pd(object):
raise NotImplementedError
def logp(self, x):
return - self.neglogp(x)
def get_shape(self):
return self.flatparam().shape
@property
def shape(self):
return self.get_shape()
def __getitem__(self, idx):
return self.__class__(self.flatparam()[idx])
class PdType(object):
class PdType(tf.Module):
"""
Parametrized family of probability distributions
"""
@@ -41,18 +46,18 @@ class PdType(object):
def sample_dtype(self):
raise NotImplementedError
def param_placeholder(self, prepend_shape, name=None):
return tf.placeholder(dtype=tf.float32, shape=prepend_shape+self.param_shape(), name=name)
def sample_placeholder(self, prepend_shape, name=None):
return tf.placeholder(dtype=self.sample_dtype(), shape=prepend_shape+self.sample_shape(), name=name)
def __eq__(self, other):
return (type(self) == type(other)) and (self.__dict__ == other.__dict__)
class CategoricalPdType(PdType):
def __init__(self, ncat):
def __init__(self, latent_shape, ncat, init_scale=1.0, init_bias=0.0):
self.ncat = ncat
self.matching_fc = _matching_fc(latent_shape, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
def pdclass(self):
return CategoricalPd
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
pdparam = fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
def pdfromlatent(self, latent_vector):
pdparam = self.matching_fc(latent_vector)
return self.pdfromflat(pdparam), pdparam
def param_shape(self):
@@ -62,31 +67,18 @@ class CategoricalPdType(PdType):
def sample_dtype(self):
return tf.int32
class MultiCategoricalPdType(PdType):
def __init__(self, nvec):
self.ncats = nvec
def pdclass(self):
return MultiCategoricalPd
def pdfromflat(self, flat):
return MultiCategoricalPd(self.ncats, flat)
def param_shape(self):
return [sum(self.ncats)]
def sample_shape(self):
return [len(self.ncats)]
def sample_dtype(self):
return tf.int32
class DiagGaussianPdType(PdType):
def __init__(self, size):
def __init__(self, latent_shape, size, init_scale=1.0, init_bias=0.0):
self.size = size
self.matching_fc = _matching_fc(latent_shape, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
self.logstd = tf.Variable(np.zeros((1, self.size)), name='pi/logstd', dtype=tf.float32)
def pdclass(self):
return DiagGaussianPd
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
mean = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
logstd = tf.get_variable(name='logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
def pdfromlatent(self, latent_vector):
mean = self.matching_fc(latent_vector)
pdparam = tf.concat([mean, mean * 0.0 + self.logstd], axis=1)
return self.pdfromflat(pdparam), mean
def param_shape(self):
@@ -96,40 +88,6 @@ class DiagGaussianPdType(PdType):
def sample_dtype(self):
return tf.float32
class BernoulliPdType(PdType):
def __init__(self, size):
self.size = size
def pdclass(self):
return BernoulliPd
def param_shape(self):
return [self.size]
def sample_shape(self):
return [self.size]
def sample_dtype(self):
return tf.int32
# WRONG SECOND DERIVATIVES
# class CategoricalPd(Pd):
# def __init__(self, logits):
# self.logits = logits
# self.ps = tf.nn.softmax(logits)
# @classmethod
# def fromflat(cls, flat):
# return cls(flat)
# def flatparam(self):
# return self.logits
# def mode(self):
# return U.argmax(self.logits, axis=-1)
# def logp(self, x):
# return -tf.nn.sparse_softmax_cross_entropy_with_logits(self.logits, x)
# def kl(self, other):
# return tf.nn.softmax_cross_entropy_with_logits(other.logits, self.ps) \
# - tf.nn.softmax_cross_entropy_with_logits(self.logits, self.ps)
# def entropy(self):
# return tf.nn.softmax_cross_entropy_with_logits(self.logits, self.ps)
# def sample(self):
# u = tf.random_uniform(tf.shape(self.logits))
# return U.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
class CategoricalPd(Pd):
def __init__(self, logits):
@@ -138,56 +96,53 @@ class CategoricalPd(Pd):
return self.logits
def mode(self):
return tf.argmax(self.logits, axis=-1)
@property
def mean(self):
return tf.nn.softmax(self.logits)
def neglogp(self, x):
# return tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
# Note: we can't use sparse_softmax_cross_entropy_with_logits because
# the implementation does not allow second-order derivatives...
one_hot_actions = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
return tf.nn.softmax_cross_entropy_with_logits(
logits=self.logits,
labels=one_hot_actions)
if x.dtype in {tf.uint8, tf.int32, tf.int64}:
# one-hot encoding
x_shape_list = x.shape.as_list()
logits_shape_list = self.logits.get_shape().as_list()[:-1]
for xs, ls in zip(x_shape_list, logits_shape_list):
if xs is not None and ls is not None:
assert xs == ls, 'shape mismatch: {} in x vs {} in logits'.format(xs, ls)
x = tf.one_hot(x, self.logits.get_shape().as_list()[-1])
else:
# already encoded
print('logits is {}'.format(self.logits))
assert x.shape.as_list() == self.logits.shape.as_list()
return tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=x)
def kl(self, other):
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keep_dims=True)
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
a1 = other.logits - tf.reduce_max(other.logits, axis=-1, keepdims=True)
ea0 = tf.exp(a0)
ea1 = tf.exp(a1)
z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
z1 = tf.reduce_sum(ea1, axis=-1, keep_dims=True)
z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
z1 = tf.reduce_sum(ea1, axis=-1, keepdims=True)
p0 = ea0 / z0
return tf.reduce_sum(p0 * (a0 - tf.log(z0) - a1 + tf.log(z1)), axis=-1)
return tf.reduce_sum(p0 * (a0 - tf.math.log(z0) - a1 + tf.math.log(z1)), axis=-1)
def entropy(self):
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keep_dims=True)
a0 = self.logits - tf.reduce_max(self.logits, axis=-1, keepdims=True)
ea0 = tf.exp(a0)
z0 = tf.reduce_sum(ea0, axis=-1, keep_dims=True)
z0 = tf.reduce_sum(ea0, axis=-1, keepdims=True)
p0 = ea0 / z0
return tf.reduce_sum(p0 * (tf.log(z0) - a0), axis=-1)
return tf.reduce_sum(p0 * (tf.math.log(z0) - a0), axis=-1)
def sample(self):
u = tf.random_uniform(tf.shape(self.logits))
return tf.argmax(self.logits - tf.log(-tf.log(u)), axis=-1)
u = tf.random.uniform(tf.shape(self.logits), dtype=self.logits.dtype, seed=0)
return tf.argmax(self.logits - tf.math.log(-tf.math.log(u)), axis=-1)
@classmethod
def fromflat(cls, flat):
return cls(flat)
class MultiCategoricalPd(Pd):
def __init__(self, nvec, flat):
self.flat = flat
self.categoricals = list(map(CategoricalPd, tf.split(flat, nvec, axis=-1)))
def flatparam(self):
return self.flat
def mode(self):
return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
def neglogp(self, x):
return tf.add_n([p.neglogp(px) for p, px in zip(self.categoricals, tf.unstack(x, axis=-1))])
def kl(self, other):
return tf.add_n([p.kl(q) for p, q in zip(self.categoricals, other.categoricals)])
def entropy(self):
return tf.add_n([p.entropy() for p in self.categoricals])
def sample(self):
return tf.cast(tf.stack([p.sample() for p in self.categoricals], axis=-1), tf.int32)
@classmethod
def fromflat(cls, flat):
raise NotImplementedError
class DiagGaussianPd(Pd):
def __init__(self, flat):
self.flat = flat
@@ -201,7 +156,7 @@ class DiagGaussianPd(Pd):
return self.mean
def neglogp(self, x):
return 0.5 * tf.reduce_sum(tf.square((x - self.mean) / self.std), axis=-1) \
+ 0.5 * np.log(2.0 * np.pi) * tf.to_float(tf.shape(x)[-1]) \
+ 0.5 * np.log(2.0 * np.pi) * tf.cast(tf.shape(x)[-1], dtype=tf.float32) \
+ tf.reduce_sum(self.logstd, axis=-1)
def kl(self, other):
assert isinstance(other, DiagGaussianPd)
@@ -209,101 +164,23 @@ class DiagGaussianPd(Pd):
def entropy(self):
return tf.reduce_sum(self.logstd + .5 * np.log(2.0 * np.pi * np.e), axis=-1)
def sample(self):
return self.mean + self.std * tf.random_normal(tf.shape(self.mean))
return self.mean + self.std * tf.random.normal(tf.shape(self.mean))
@classmethod
def fromflat(cls, flat):
return cls(flat)
class BernoulliPd(Pd):
def __init__(self, logits):
self.logits = logits
self.ps = tf.sigmoid(logits)
def flatparam(self):
return self.logits
def mode(self):
return tf.round(self.ps)
def neglogp(self, x):
return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=tf.to_float(x)), axis=-1)
def kl(self, other):
return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=other.logits, labels=self.ps), axis=-1) - tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
def entropy(self):
return tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits=self.logits, labels=self.ps), axis=-1)
def sample(self):
u = tf.random_uniform(tf.shape(self.ps))
return tf.to_float(math_ops.less(u, self.ps))
@classmethod
def fromflat(cls, flat):
return cls(flat)
def make_pdtype(ac_space):
def make_pdtype(latent_shape, ac_space, init_scale=1.0):
from gym import spaces
if isinstance(ac_space, spaces.Box):
assert len(ac_space.shape) == 1
return DiagGaussianPdType(ac_space.shape[0])
return DiagGaussianPdType(latent_shape, ac_space.shape[0], init_scale)
elif isinstance(ac_space, spaces.Discrete):
return CategoricalPdType(ac_space.n)
elif isinstance(ac_space, spaces.MultiDiscrete):
return MultiCategoricalPdType(ac_space.nvec)
elif isinstance(ac_space, spaces.MultiBinary):
return BernoulliPdType(ac_space.n)
return CategoricalPdType(latent_shape, ac_space.n, init_scale)
else:
raise NotImplementedError
raise ValueError('No implementation for {}'.format(ac_space))
def shape_el(v, i):
maybe = v.get_shape()[i]
if maybe is not None:
return maybe
def _matching_fc(tensor_shape, name, size, init_scale, init_bias):
if tensor_shape[-1] == size:
return lambda x: x
else:
return tf.shape(v)[i]
@U.in_session
def test_probtypes():
np.random.seed(0)
pdparam_diag_gauss = np.array([-.2, .3, .4, -.5, .1, -.5, .1, 0.8])
diag_gauss = DiagGaussianPdType(pdparam_diag_gauss.size // 2) #pylint: disable=E1101
validate_probtype(diag_gauss, pdparam_diag_gauss)
pdparam_categorical = np.array([-.2, .3, .5])
categorical = CategoricalPdType(pdparam_categorical.size) #pylint: disable=E1101
validate_probtype(categorical, pdparam_categorical)
nvec = [1,2,3]
pdparam_multicategorical = np.array([-.2, .3, .5, .1, 1, -.1])
multicategorical = MultiCategoricalPdType(nvec) #pylint: disable=E1101
validate_probtype(multicategorical, pdparam_multicategorical)
pdparam_bernoulli = np.array([-.2, .3, .5])
bernoulli = BernoulliPdType(pdparam_bernoulli.size) #pylint: disable=E1101
validate_probtype(bernoulli, pdparam_bernoulli)
def validate_probtype(probtype, pdparam):
N = 100000
# Check to see if mean negative log likelihood == differential entropy
Mval = np.repeat(pdparam[None, :], N, axis=0)
M = probtype.param_placeholder([N])
X = probtype.sample_placeholder([N])
pd = probtype.pdfromflat(M)
calcloglik = U.function([X, M], pd.logp(X))
calcent = U.function([M], pd.entropy())
Xval = tf.get_default_session().run(pd.sample(), feed_dict={M:Mval})
logliks = calcloglik(Xval, Mval)
entval_ll = - logliks.mean() #pylint: disable=E1101
entval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
entval = calcent(Mval).mean() #pylint: disable=E1101
assert np.abs(entval - entval_ll) < 3 * entval_ll_stderr # within 3 sigmas
# Check to see if kldiv[p,q] = - ent[p] - E_p[log q]
M2 = probtype.param_placeholder([N])
pd2 = probtype.pdfromflat(M2)
q = pdparam + np.random.randn(pdparam.size) * 0.1
Mval2 = np.repeat(q[None, :], N, axis=0)
calckl = U.function([M, M2], pd.kl(pd2))
klval = calckl(Mval, Mval2).mean() #pylint: disable=E1101
logliks = calcloglik(Xval, Mval2)
klval_ll = - entval - logliks.mean() #pylint: disable=E1101
klval_ll_stderr = logliks.std() / np.sqrt(N) #pylint: disable=E1101
assert np.abs(klval - klval_ll) < 3 * klval_ll_stderr # within 3 sigmas
print('ok on', probtype, pdparam)
return fc(tensor_shape, name, size, init_scale=init_scale, init_bias=init_bias)

View File

@@ -1,98 +0,0 @@
from .running_stat import RunningStat
from collections import deque
import numpy as np
class Filter(object):
def __call__(self, x, update=True):
raise NotImplementedError
def reset(self):
pass
class IdentityFilter(Filter):
def __call__(self, x, update=True):
return x
class CompositionFilter(Filter):
def __init__(self, fs):
self.fs = fs
def __call__(self, x, update=True):
for f in self.fs:
x = f(x)
return x
def output_shape(self, input_space):
out = input_space.shape
for f in self.fs:
out = f.output_shape(out)
return out
class ZFilter(Filter):
"""
y = (x-mean)/std
using running estimates of mean,std
"""
def __init__(self, shape, demean=True, destd=True, clip=10.0):
self.demean = demean
self.destd = destd
self.clip = clip
self.rs = RunningStat(shape)
def __call__(self, x, update=True):
if update: self.rs.push(x)
if self.demean:
x = x - self.rs.mean
if self.destd:
x = x / (self.rs.std+1e-8)
if self.clip:
x = np.clip(x, -self.clip, self.clip)
return x
def output_shape(self, input_space):
return input_space.shape
class AddClock(Filter):
def __init__(self):
self.count = 0
def reset(self):
self.count = 0
def __call__(self, x, update=True):
return np.append(x, self.count/100.0)
def output_shape(self, input_space):
return (input_space.shape[0]+1,)
class FlattenFilter(Filter):
def __call__(self, x, update=True):
return x.ravel()
def output_shape(self, input_space):
return (int(np.prod(input_space.shape)),)
class Ind2OneHotFilter(Filter):
def __init__(self, n):
self.n = n
def __call__(self, x, update=True):
out = np.zeros(self.n)
out[x] = 1
return out
def output_shape(self, input_space):
return (input_space.n,)
class DivFilter(Filter):
def __init__(self, divisor):
self.divisor = divisor
def __call__(self, x, update=True):
return x / self.divisor
def output_shape(self, input_space):
return input_space.shape
class StackFilter(Filter):
def __init__(self, length):
self.stack = deque(maxlen=length)
def reset(self):
self.stack.clear()
def __call__(self, x, update=True):
self.stack.append(x)
while len(self.stack) < self.stack.maxlen:
self.stack.append(x)
return np.concatenate(self.stack, axis=-1)
def output_shape(self, input_space):
return input_space.shape[:-1] + (input_space.shape[-1] * self.stack.maxlen,)

View File

@@ -1,30 +0,0 @@
from gym import Env
from gym.spaces import Discrete
class IdentityEnv(Env):
def __init__(
self,
dim,
ep_length=100,
):
self.action_space = Discrete(dim)
self.reset()
def reset(self):
self._choose_next_state()
self.observation_space = self.action_space
return self.state
def step(self, actions):
rew = self._get_reward(actions)
self._choose_next_state()
return self.state, rew, False, {}
def _choose_next_state(self):
self.state = self.action_space.sample()
def _get_reward(self, actions):
return 1 if self.state == actions else 0

View File

@@ -1,30 +0,0 @@
import tensorflow as tf
from gym.spaces import Discrete, Box
def observation_input(ob_space, batch_size=None, name='Ob'):
'''
Build observation input with encoding depending on the
observation space type
Params:
ob_space: observation space (should be one of gym.spaces)
batch_size: batch size for input (default is None, so that resulting input placeholder can take tensors with any batch size)
name: tensorflow variable name for input placeholder
returns: tuple (input_placeholder, processed_input_tensor)
'''
if isinstance(ob_space, Discrete):
input_x = tf.placeholder(shape=(batch_size,), dtype=tf.int32, name=name)
processed_x = tf.to_float(tf.one_hot(input_x, ob_space.n))
return input_x, processed_x
elif isinstance(ob_space, Box):
input_shape = (batch_size,) + ob_space.shape
input_x = tf.placeholder(shape=input_shape, dtype=ob_space.dtype, name=name)
processed_x = tf.to_float(input_x)
return input_x, processed_x
else:
raise NotImplementedError

View File

@@ -82,4 +82,4 @@ def test_discount_with_boundaries():
2 + gamma * 3,
3,
4
])
])

View File

@@ -13,27 +13,6 @@ def zipsame(*seqs):
return zip(*seqs)
def unpack(seq, sizes):
"""
Unpack 'seq' into a sequence of lists, with lengths specified by 'sizes'.
None = just one bare element, not a list
Example:
unpack([1,2,3,4,5,6], [3,None,2]) -> ([1,2,3], 4, [5,6])
"""
seq = list(seq)
it = iter(seq)
assert sum(1 if s is None else s for s in sizes) == len(seq), "Trying to unpack %s into %s" % (seq, sizes)
for size in sizes:
if size is None:
yield it.__next__()
else:
li = []
for _ in range(size):
li.append(it.__next__())
yield li
class EzPickle(object):
"""Objects that are pickled and unpickled via their constructor
arguments.
@@ -67,14 +46,20 @@ class EzPickle(object):
def set_global_seeds(i):
try:
import MPI
rank = MPI.COMM_WORLD.Get_rank()
except ImportError:
rank = 0
myseed = i + 1000 * rank if i is not None else None
try:
import tensorflow as tf
tf.random.set_seed(myseed)
except ImportError:
pass
else:
tf.set_random_seed(i)
np.random.seed(i)
random.seed(i)
np.random.seed(myseed)
random.seed(myseed)
def pretty_eta(seconds_left):

117
baselines/common/models.py Normal file
View File

@@ -0,0 +1,117 @@
import numpy as np
import tensorflow as tf
from baselines.a2c.utils import ortho_init, conv
mapping = {}
def register(name):
def _thunk(func):
mapping[name] = func
return func
return _thunk
def nature_cnn(input_shape, **conv_kwargs):
"""
CNN from Nature paper.
"""
print('input shape is {}'.format(input_shape))
x_input = tf.keras.Input(shape=input_shape, dtype=tf.uint8)
h = x_input
h = tf.cast(h, tf.float32) / 255.
h = conv('c1', nf=32, rf=8, stride=4, activation='relu', init_scale=np.sqrt(2))(h)
h2 = conv('c2', nf=64, rf=4, stride=2, activation='relu', init_scale=np.sqrt(2))(h)
h3 = conv('c3', nf=64, rf=3, stride=1, activation='relu', init_scale=np.sqrt(2))(h2)
h3 = tf.keras.layers.Flatten()(h3)
h3 = tf.keras.layers.Dense(units=512, kernel_initializer=ortho_init(np.sqrt(2)),
name='fc1', activation='relu')(h3)
network = tf.keras.Model(inputs=[x_input], outputs=[h3])
return network
@register("mlp")
def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
"""
Stack of fully-connected layers to be used in a policy / q-function approximator
Parameters:
----------
num_layers: int number of fully-connected layers (default: 2)
num_hidden: int size of fully-connected layers (default: 64)
activation: activation function (default: tf.tanh)
Returns:
-------
function that builds fully connected network with a given input tensor / placeholder
"""
def network_fn(input_shape):
print('input shape is {}'.format(input_shape))
x_input = tf.keras.Input(shape=input_shape)
# h = tf.keras.layers.Flatten(x_input)
h = x_input
for i in range(num_layers):
h = tf.keras.layers.Dense(units=num_hidden, kernel_initializer=ortho_init(np.sqrt(2)),
name='mlp_fc{}'.format(i), activation=activation)(h)
network = tf.keras.Model(inputs=[x_input], outputs=[h])
return network
return network_fn
@register("cnn")
def cnn(**conv_kwargs):
def network_fn(input_shape):
return nature_cnn(input_shape, **conv_kwargs)
return network_fn
@register("conv_only")
def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
'''
convolutions-only net
Parameters:
----------
conv: list of triples (filter_number, filter_size, stride) specifying parameters for each layer.
Returns:
function that takes tensorflow tensor as input and returns the output of the last convolutional layer
'''
def network_fn(input_shape):
print('input shape is {}'.format(input_shape))
x_input = tf.keras.Input(shape=input_shape, dtype=tf.uint8)
h = x_input
h = tf.cast(h, tf.float32) / 255.
with tf.name_scope("convnet"):
for num_outputs, kernel_size, stride in convs:
h = tf.keras.layers.Conv2D(
filters=num_outputs, kernel_size=kernel_size, strides=stride,
activation='relu', **conv_kwargs)(h)
network = tf.keras.Model(inputs=[x_input], outputs=[h])
return network
return network_fn
def get_network_builder(name):
"""
If you want to register your own network outside models.py, you just need:
Usage Example:
-------------
from baselines.common.models import register
@register("your_network_name")
def your_network_define(**net_kwargs):
...
return network_fn
"""
if callable(name):
return name
elif name in mapping:
return mapping[name]
else:
raise ValueError('Unknown network type: {}'.format(name))

View File

@@ -1,7 +1,10 @@
from mpi4py import MPI
import baselines.common.tf_util as U
import tensorflow as tf
import numpy as np
try:
from mpi4py import MPI
except ImportError:
MPI = None
class MpiAdam(object):
def __init__(self, var_list, *, beta1=0.9, beta2=0.999, epsilon=1e-08, scale_grad_by_procs=True, comm=None):
@@ -16,16 +19,19 @@ class MpiAdam(object):
self.t = 0
self.setfromflat = U.SetFromFlat(var_list)
self.getflat = U.GetFlat(var_list)
self.comm = MPI.COMM_WORLD if comm is None else comm
self.comm = MPI.COMM_WORLD if comm is None and MPI is not None else comm
def update(self, localg, stepsize):
if self.t % 100 == 0:
self.check_synced()
localg = localg.astype('float32')
globalg = np.zeros_like(localg)
self.comm.Allreduce(localg, globalg, op=MPI.SUM)
if self.scale_grad_by_procs:
globalg /= self.comm.Get_size()
if self.comm is not None:
globalg = np.zeros_like(localg)
self.comm.Allreduce(localg, globalg, op=MPI.SUM)
if self.scale_grad_by_procs:
globalg /= self.comm.Get_size()
else:
globalg = np.copy(localg)
self.t += 1
a = stepsize * np.sqrt(1 - self.beta2**self.t)/(1 - self.beta1**self.t)
@@ -35,11 +41,15 @@ class MpiAdam(object):
self.setfromflat(self.getflat() + step)
def sync(self):
if self.comm is None:
return
theta = self.getflat()
self.comm.Bcast(theta, root=0)
self.setfromflat(theta)
def check_synced(self):
if self.comm is None:
return
if self.comm.Get_rank() == 0: # this is root
theta = self.getflat()
self.comm.Bcast(theta, root=0)
@@ -48,32 +58,3 @@ class MpiAdam(object):
thetaroot = np.empty_like(thetalocal)
self.comm.Bcast(thetaroot, root=0)
assert (thetaroot == thetalocal).all(), (thetaroot, thetalocal)
@U.in_session
def test_MpiAdam():
np.random.seed(0)
tf.set_random_seed(0)
a = tf.Variable(np.random.randn(3).astype('float32'))
b = tf.Variable(np.random.randn(2,5).astype('float32'))
loss = tf.reduce_sum(tf.square(a)) + tf.reduce_sum(tf.sin(b))
stepsize = 1e-2
update_op = tf.train.AdamOptimizer(stepsize).minimize(loss)
do_update = U.function([], loss, updates=[update_op])
tf.get_default_session().run(tf.global_variables_initializer())
for i in range(10):
print(i,do_update())
tf.set_random_seed(0)
tf.get_default_session().run(tf.global_variables_initializer())
var_list = [a,b]
lossandgrad = U.function([], [loss, U.flatgrad(loss, var_list)], updates=[update_op])
adam = MpiAdam(var_list)
for i in range(10):
l,g = lossandgrad()
adam.update(g, stepsize)
print(i,l)

View File

@@ -0,0 +1,59 @@
import numpy as np
import tensorflow as tf
try:
from mpi4py import MPI
except ImportError:
MPI = None
class MpiAdamOptimizer(tf.Module):
"""Adam optimizer that averages gradients across mpi processes."""
def __init__(self, comm, var_list):
self.var_list = var_list
self.comm = comm
self.beta1 = 0.9
self.beta2 = 0.999
self.epsilon = 1e-08
self.t = tf.Variable(0, name='step', dtype=tf.int32)
var_shapes = [v.shape.as_list() for v in var_list]
self.var_sizes = [int(np.prod(s)) for s in var_shapes]
self.flat_var_size = sum(self.var_sizes)
self.m = tf.Variable(np.zeros(self.flat_var_size, 'float32'))
self.v = tf.Variable(np.zeros(self.flat_var_size, 'float32'))
def apply_gradients(self, flat_grad, lr):
buf = np.zeros(self.flat_var_size, np.float32)
self.comm.Allreduce(flat_grad.numpy(), buf, op=MPI.SUM)
avg_flat_grad = np.divide(buf, float(self.comm.Get_size()))
self._apply_gradients(tf.constant(avg_flat_grad), lr)
if self.t.numpy() % 100 == 0:
check_synced(tf.reduce_sum(self.var_list[0]).numpy())
@tf.function
def _apply_gradients(self, avg_flat_grad, lr):
self.t.assign_add(1)
t = tf.cast(self.t, tf.float32)
a = lr * tf.math.sqrt(1 - tf.math.pow(self.beta2, t)) / (1 - tf.math.pow(self.beta1, t))
self.m.assign(self.beta1 * self.m + (1 - self.beta1) * avg_flat_grad)
self.v.assign(self.beta2 * self.v + (1 - self.beta2) * tf.math.square(avg_flat_grad))
flat_step = (- a) * self.m / (tf.math.sqrt(self.v) + self.epsilon)
var_steps = tf.split(flat_step, self.var_sizes, axis=0)
for var_step, var in zip(var_steps, self.var_list):
var.assign_add(tf.reshape(var_step, var.shape))
def check_synced(localval, comm=None):
"""
It's common to forget to initialize your variables to the same values, or
(less commonly) if you update them in some other way than adam, to get them out of sync.
This function checks that variables on all MPI workers are the same, and raises
an AssertionError otherwise
Arguments:
comm: MPI communicator
localval: list of local variables (list of variables on current worker to be compared with the other workers)
"""
comm = comm or MPI.COMM_WORLD
vals = comm.gather(localval)
if comm.rank == 0:
assert all(val==vals[0] for val in vals[1:]),\
'MpiAdamOptimizer detected that different workers have different weights: {}'.format(vals)

View File

@@ -4,7 +4,7 @@ def mpi_fork(n, bind_to_core=False):
"""Re-launches the current script with workers
Returns "parent" for original parent, "child" for MPI children
"""
if n<=1:
if n<=1:
return "child"
if os.getenv("IN_MPI") is None:
env = os.environ.copy()

View File

@@ -33,8 +33,8 @@ def mpi_moments(x, axis=0, comm=None, keepdims=False):
def test_runningmeanstd():
import subprocess
subprocess.check_call(['mpirun', '-np', '3',
'python','-c',
subprocess.check_call(['mpirun', '-np', '3',
'python','-c',
'from baselines.common.mpi_moments import _helper_runningmeanstd; _helper_runningmeanstd()'])
def _helper_runningmeanstd():

View File

@@ -1,107 +1,56 @@
from mpi4py import MPI
import tensorflow as tf, baselines.common.tf_util as U, numpy as np
try:
from mpi4py import MPI
except ImportError:
MPI = None
class RunningMeanStd(object):
import tensorflow as tf, numpy as np
class RunningMeanStd(tf.Module):
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
def __init__(self, epsilon=1e-2, shape=()):
def __init__(self, epsilon=1e-2, shape=(), default_clip_range=np.inf):
self._sum = tf.get_variable(
self._sum = tf.Variable(
initial_value=np.zeros(shape=shape, dtype=np.float64),
dtype=tf.float64,
shape=shape,
initializer=tf.constant_initializer(0.0),
name="runningsum", trainable=False)
self._sumsq = tf.get_variable(
self._sumsq = tf.Variable(
initial_value=np.full(shape=shape, fill_value=epsilon, dtype=np.float64),
dtype=tf.float64,
shape=shape,
initializer=tf.constant_initializer(epsilon),
name="runningsumsq", trainable=False)
self._count = tf.get_variable(
self._count = tf.Variable(
initial_value=epsilon,
dtype=tf.float64,
shape=(),
initializer=tf.constant_initializer(epsilon),
name="count", trainable=False)
self.shape = shape
self.mean = tf.to_float(self._sum / self._count)
self.std = tf.sqrt( tf.maximum( tf.to_float(self._sumsq / self._count) - tf.square(self.mean) , 1e-2 ))
newsum = tf.placeholder(shape=self.shape, dtype=tf.float64, name='sum')
newsumsq = tf.placeholder(shape=self.shape, dtype=tf.float64, name='var')
newcount = tf.placeholder(shape=[], dtype=tf.float64, name='count')
self.incfiltparams = U.function([newsum, newsumsq, newcount], [],
updates=[tf.assign_add(self._sum, newsum),
tf.assign_add(self._sumsq, newsumsq),
tf.assign_add(self._count, newcount)])
self.epsilon = epsilon
self.default_clip_range = default_clip_range
def update(self, x):
x = x.astype('float64')
n = int(np.prod(self.shape))
totalvec = np.zeros(n*2+1, 'float64')
addvec = np.concatenate([x.sum(axis=0).ravel(), np.square(x).sum(axis=0).ravel(), np.array([len(x)],dtype='float64')])
MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
self.incfiltparams(totalvec[0:n].reshape(self.shape), totalvec[n:2*n].reshape(self.shape), totalvec[2*n])
totalvec = np.zeros(n*2+1, 'float64')
if MPI is not None:
# totalvec = np.zeros(n*2+1, 'float64')
MPI.COMM_WORLD.Allreduce(addvec, totalvec, op=MPI.SUM)
# else:
# totalvec = addvec
self._sum.assign_add(totalvec[0:n].reshape(self.shape))
self._sumsq.assign_add(totalvec[n:2*n].reshape(self.shape))
self._count.assign_add(totalvec[2*n])
@U.in_session
def test_runningmeanstd():
for (x1, x2, x3) in [
(np.random.randn(3), np.random.randn(4), np.random.randn(5)),
(np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
]:
@property
def mean(self):
return tf.cast(self._sum / self._count, tf.float32)
rms = RunningMeanStd(epsilon=0.0, shape=x1.shape[1:])
U.initialize()
@property
def std(self):
return tf.sqrt(tf.maximum(tf.cast(self._sumsq / self._count, tf.float32) - tf.square(self.mean), self.epsilon))
x = np.concatenate([x1, x2, x3], axis=0)
ms1 = [x.mean(axis=0), x.std(axis=0)]
rms.update(x1)
rms.update(x2)
rms.update(x3)
ms2 = [rms.mean.eval(), rms.std.eval()]
def normalize(self, v, clip_range=None):
if clip_range is None:
clip_range = self.default_clip_range
return tf.clip_by_value((v - self.mean) / self.std, -clip_range, clip_range)
assert np.allclose(ms1, ms2)
@U.in_session
def test_dist():
np.random.seed(0)
p1,p2,p3=(np.random.randn(3,1), np.random.randn(4,1), np.random.randn(5,1))
q1,q2,q3=(np.random.randn(6,1), np.random.randn(7,1), np.random.randn(8,1))
# p1,p2,p3=(np.random.randn(3), np.random.randn(4), np.random.randn(5))
# q1,q2,q3=(np.random.randn(6), np.random.randn(7), np.random.randn(8))
comm = MPI.COMM_WORLD
assert comm.Get_size()==2
if comm.Get_rank()==0:
x1,x2,x3 = p1,p2,p3
elif comm.Get_rank()==1:
x1,x2,x3 = q1,q2,q3
else:
assert False
rms = RunningMeanStd(epsilon=0.0, shape=(1,))
U.initialize()
rms.update(x1)
rms.update(x2)
rms.update(x3)
bigvec = np.concatenate([p1,p2,p3,q1,q2,q3])
def checkallclose(x,y):
print(x,y)
return np.allclose(x,y)
assert checkallclose(
bigvec.mean(axis=0),
rms.mean.eval(),
)
assert checkallclose(
bigvec.std(axis=0),
rms.std.eval(),
)
if __name__ == "__main__":
# Run with mpirun -np 2 python <filename>
test_dist()
def denormalize(self, v):
return self.mean + v * self.std

View File

@@ -0,0 +1,131 @@
from collections import defaultdict
import os, numpy as np
import platform
import shutil
import subprocess
import warnings
import sys
try:
from mpi4py import MPI
except ImportError:
MPI = None
def sync_from_root(variables, comm=None):
"""
Send the root node's parameters to every worker.
Arguments:
variables: all parameter variables including optimizer's
"""
if comm is None: comm = MPI.COMM_WORLD
values = comm.bcast([var.numpy() for var in variables])
for (var, val) in zip(variables, values):
var.assign(val)
def gpu_count():
"""
Count the GPUs on this machine.
"""
if shutil.which('nvidia-smi') is None:
return 0
output = subprocess.check_output(['nvidia-smi', '--query-gpu=gpu_name', '--format=csv'])
return max(0, len(output.split(b'\n')) - 2)
def setup_mpi_gpus():
"""
Set CUDA_VISIBLE_DEVICES to MPI rank if not already set
"""
if 'CUDA_VISIBLE_DEVICES' not in os.environ:
if sys.platform == 'darwin': # This Assumes if you're on OSX you're just
ids = [] # doing a smoke test and don't want GPUs
else:
lrank, _lsize = get_local_rank_size(MPI.COMM_WORLD)
ids = [lrank]
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, ids))
def get_local_rank_size(comm):
"""
Returns the rank of each process on its machine
The processes on a given machine will be assigned ranks
0, 1, 2, ..., N-1,
where N is the number of processes on this machine.
Useful if you want to assign one gpu per machine
"""
this_node = platform.node()
ranks_nodes = comm.allgather((comm.Get_rank(), this_node))
node2rankssofar = defaultdict(int)
local_rank = None
for (rank, node) in ranks_nodes:
if rank == comm.Get_rank():
local_rank = node2rankssofar[node]
node2rankssofar[node] += 1
assert local_rank is not None
return local_rank, node2rankssofar[this_node]
def share_file(comm, path):
"""
Copies the file from rank 0 to all other ranks
Puts it in the same place on all machines
"""
localrank, _ = get_local_rank_size(comm)
if comm.Get_rank() == 0:
with open(path, 'rb') as fh:
data = fh.read()
comm.bcast(data)
else:
data = comm.bcast(None)
if localrank == 0:
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, 'wb') as fh:
fh.write(data)
comm.Barrier()
def dict_gather(comm, d, op='mean', assert_all_have_data=True):
"""
Perform a reduction operation over dicts
"""
if comm is None: return d
alldicts = comm.allgather(d)
size = comm.size
k2li = defaultdict(list)
for d in alldicts:
for (k,v) in d.items():
k2li[k].append(v)
result = {}
for (k,li) in k2li.items():
if assert_all_have_data:
assert len(li)==size, "only %i out of %i MPI workers have sent '%s'" % (len(li), size, k)
if op=='mean':
result[k] = np.mean(li, axis=0)
elif op=='sum':
result[k] = np.sum(li, axis=0)
else:
assert 0, op
return result
def mpi_weighted_mean(comm, local_name2valcount):
"""
Perform a weighted average over dicts that are each on a different node
Input: local_name2valcount: dict mapping key -> (value, count)
Returns: key -> mean
"""
all_name2valcount = comm.gather(local_name2valcount)
if comm.rank == 0:
name2sum = defaultdict(float)
name2count = defaultdict(float)
for n2vc in all_name2valcount:
for (name, (val, count)) in n2vc.items():
try:
val = float(val)
except ValueError:
if comm.rank == 0:
warnings.warn('WARNING: tried to compute mean on non-float {}={}'.format(name, val))
else:
name2sum[name] += val * count
name2count[name] += count
return {name : name2sum[name] / name2count[name] for name in name2sum}
else:
return {}

View File

@@ -0,0 +1,434 @@
import matplotlib.pyplot as plt
import os.path as osp
import json
import os
import numpy as np
import pandas
from collections import defaultdict, namedtuple
from baselines.bench import monitor
from baselines.logger import read_json, read_csv
def smooth(y, radius, mode='two_sided', valid_only=False):
'''
Smooth signal y, where radius is determines the size of the window
mode='twosided':
average over the window [max(index - radius, 0), min(index + radius, len(y)-1)]
mode='causal':
average over the window [max(index - radius, 0), index]
valid_only: put nan in entries where the full-sized window is not available
'''
assert mode in ('two_sided', 'causal')
if len(y) < 2*radius+1:
return np.ones_like(y) * y.mean()
elif mode == 'two_sided':
convkernel = np.ones(2 * radius+1)
out = np.convolve(y, convkernel,mode='same') / np.convolve(np.ones_like(y), convkernel, mode='same')
if valid_only:
out[:radius] = out[-radius:] = np.nan
elif mode == 'causal':
convkernel = np.ones(radius)
out = np.convolve(y, convkernel,mode='full') / np.convolve(np.ones_like(y), convkernel, mode='full')
out = out[:-radius+1]
if valid_only:
out[:radius] = np.nan
return out
def one_sided_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8):
'''
perform one-sided (causal) EMA (exponential moving average)
smoothing and resampling to an even grid with n points.
Does not do extrapolation, so we assume
xolds[0] <= low && high <= xolds[-1]
Arguments:
xolds: array or list - x values of data. Needs to be sorted in ascending order
yolds: array of list - y values of data. Has to have the same length as xolds
low: float - min value of the new x grid. By default equals to xolds[0]
high: float - max value of the new x grid. By default equals to xolds[-1]
n: int - number of points in new x grid
decay_steps: float - EMA decay factor, expressed in new x grid steps.
low_counts_threshold: float or int
- y values with counts less than this value will be set to NaN
Returns:
tuple sum_ys, count_ys where
xs - array with new x grid
ys - array of EMA of y at each point of the new x grid
count_ys - array of EMA of y counts at each point of the new x grid
'''
low = xolds[0] if low is None else low
high = xolds[-1] if high is None else high
assert xolds[0] <= low, 'low = {} < xolds[0] = {} - extrapolation not permitted!'.format(low, xolds[0])
assert xolds[-1] >= high, 'high = {} > xolds[-1] = {} - extrapolation not permitted!'.format(high, xolds[-1])
assert len(xolds) == len(yolds), 'length of xolds ({}) and yolds ({}) do not match!'.format(len(xolds), len(yolds))
xolds = xolds.astype('float64')
yolds = yolds.astype('float64')
luoi = 0 # last unused old index
sum_y = 0.
count_y = 0.
xnews = np.linspace(low, high, n)
decay_period = (high - low) / (n - 1) * decay_steps
interstep_decay = np.exp(- 1. / decay_steps)
sum_ys = np.zeros_like(xnews)
count_ys = np.zeros_like(xnews)
for i in range(n):
xnew = xnews[i]
sum_y *= interstep_decay
count_y *= interstep_decay
while True:
if luoi >= len(xolds):
break
xold = xolds[luoi]
if xold <= xnew:
decay = np.exp(- (xnew - xold) / decay_period)
sum_y += decay * yolds[luoi]
count_y += decay
luoi += 1
else:
break
sum_ys[i] = sum_y
count_ys[i] = count_y
ys = sum_ys / count_ys
ys[count_ys < low_counts_threshold] = np.nan
return xnews, ys, count_ys
def symmetric_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8):
'''
perform symmetric EMA (exponential moving average)
smoothing and resampling to an even grid with n points.
Does not do extrapolation, so we assume
xolds[0] <= low && high <= xolds[-1]
Arguments:
xolds: array or list - x values of data. Needs to be sorted in ascending order
yolds: array of list - y values of data. Has to have the same length as xolds
low: float - min value of the new x grid. By default equals to xolds[0]
high: float - max value of the new x grid. By default equals to xolds[-1]
n: int - number of points in new x grid
decay_steps: float - EMA decay factor, expressed in new x grid steps.
low_counts_threshold: float or int
- y values with counts less than this value will be set to NaN
Returns:
tuple sum_ys, count_ys where
xs - array with new x grid
ys - array of EMA of y at each point of the new x grid
count_ys - array of EMA of y counts at each point of the new x grid
'''
xs, ys1, count_ys1 = one_sided_ema(xolds, yolds, low, high, n, decay_steps, low_counts_threshold=0)
_, ys2, count_ys2 = one_sided_ema(-xolds[::-1], yolds[::-1], -high, -low, n, decay_steps, low_counts_threshold=0)
ys2 = ys2[::-1]
count_ys2 = count_ys2[::-1]
count_ys = count_ys1 + count_ys2
ys = (ys1 * count_ys1 + ys2 * count_ys2) / count_ys
ys[count_ys < low_counts_threshold] = np.nan
return xs, ys, count_ys
Result = namedtuple('Result', 'monitor progress dirname metadata')
Result.__new__.__defaults__ = (None,) * len(Result._fields)
def load_results(root_dir_or_dirs, enable_progress=True, enable_monitor=True, verbose=False):
'''
load summaries of runs from a list of directories (including subdirectories)
Arguments:
enable_progress: bool - if True, will attempt to load data from progress.csv files (data saved by logger). Default: True
enable_monitor: bool - if True, will attempt to load data from monitor.csv files (data saved by Monitor environment wrapper). Default: True
verbose: bool - if True, will print out list of directories from which the data is loaded. Default: False
Returns:
List of Result objects with the following fields:
- dirname - path to the directory data was loaded from
- metadata - run metadata (such as command-line arguments and anything else in metadata.json file
- monitor - if enable_monitor is True, this field contains pandas dataframe with loaded monitor.csv file (or aggregate of all *.monitor.csv files in the directory)
- progress - if enable_progress is True, this field contains pandas dataframe with loaded progress.csv file
'''
import re
if isinstance(root_dir_or_dirs, str):
rootdirs = [osp.expanduser(root_dir_or_dirs)]
else:
rootdirs = [osp.expanduser(d) for d in root_dir_or_dirs]
allresults = []
for rootdir in rootdirs:
assert osp.exists(rootdir), "%s doesn't exist"%rootdir
for dirname, dirs, files in os.walk(rootdir):
if '-proc' in dirname:
files[:] = []
continue
monitor_re = re.compile(r'(\d+\.)?(\d+\.)?monitor\.csv')
if set(['metadata.json', 'monitor.json', 'progress.json', 'progress.csv']).intersection(files) or \
any([f for f in files if monitor_re.match(f)]): # also match monitor files like 0.1.monitor.csv
# used to be uncommented, which means do not go deeper than current directory if any of the data files
# are found
# dirs[:] = []
result = {'dirname' : dirname}
if "metadata.json" in files:
with open(osp.join(dirname, "metadata.json"), "r") as fh:
result['metadata'] = json.load(fh)
progjson = osp.join(dirname, "progress.json")
progcsv = osp.join(dirname, "progress.csv")
if enable_progress:
if osp.exists(progjson):
result['progress'] = pandas.DataFrame(read_json(progjson))
elif osp.exists(progcsv):
try:
result['progress'] = read_csv(progcsv)
except pandas.errors.EmptyDataError:
print('skipping progress file in ', dirname, 'empty data')
else:
if verbose: print('skipping %s: no progress file'%dirname)
if enable_monitor:
try:
result['monitor'] = pandas.DataFrame(monitor.load_results(dirname))
except monitor.LoadMonitorResultsError:
print('skipping %s: no monitor files'%dirname)
except Exception as e:
print('exception loading monitor file in %s: %s'%(dirname, e))
if result.get('monitor') is not None or result.get('progress') is not None:
allresults.append(Result(**result))
if verbose:
print('successfully loaded %s'%dirname)
if verbose: print('loaded %i results'%len(allresults))
return allresults
COLORS = ['blue', 'green', 'red', 'cyan', 'magenta', 'yellow', 'black', 'purple', 'pink',
'brown', 'orange', 'teal', 'lightblue', 'lime', 'lavender', 'turquoise',
'darkgreen', 'tan', 'salmon', 'gold', 'darkred', 'darkblue']
def default_xy_fn(r):
x = np.cumsum(r.monitor.l)
y = smooth(r.monitor.r, radius=10)
return x,y
def default_split_fn(r):
import re
# match name between slash and -<digits> at the end of the string
# (slash in the beginning or -<digits> in the end or either may be missing)
match = re.search(r'[^/-]+(?=(-\d+)?\Z)', r.dirname)
if match:
return match.group(0)
def plot_results(
allresults, *,
xy_fn=default_xy_fn,
split_fn=default_split_fn,
group_fn=default_split_fn,
average_group=False,
shaded_std=True,
shaded_err=True,
figsize=None,
legend_outside=False,
resample=0,
smooth_step=1.0,
tiling='vertical',
xlabel=None,
ylabel=None
):
'''
Plot multiple Results objects
xy_fn: function Result -> x,y - function that converts results objects into tuple of x and y values.
By default, x is cumsum of episode lengths, and y is episode rewards
split_fn: function Result -> hashable - function that converts results objects into keys to split curves into sub-panels by.
That is, the results r for which split_fn(r) is different will be put on different sub-panels.
By default, the portion of r.dirname between last / and -<digits> is returned. The sub-panels are
stacked vertically in the figure.
group_fn: function Result -> hashable - function that converts results objects into keys to group curves by.
That is, the results r for which group_fn(r) is the same will be put into the same group.
Curves in the same group have the same color (if average_group is False), or averaged over
(if average_group is True). The default value is the same as default value for split_fn
average_group: bool - if True, will average the curves in the same group and plot the mean. Enables resampling
(if resample = 0, will use 512 steps)
shaded_std: bool - if True (default), the shaded region corresponding to standard deviation of the group of curves will be
shown (only applicable if average_group = True)
shaded_err: bool - if True (default), the shaded region corresponding to error in mean estimate of the group of curves
(that is, standard deviation divided by square root of number of curves) will be
shown (only applicable if average_group = True)
figsize: tuple or None - size of the resulting figure (including sub-panels). By default, width is 6 and height is 6 times number of
sub-panels.
legend_outside: bool - if True, will place the legend outside of the sub-panels.
resample: int - if not zero, size of the uniform grid in x direction to resample onto. Resampling is performed via symmetric
EMA smoothing (see the docstring for symmetric_ema).
Default is zero (no resampling). Note that if average_group is True, resampling is necessary; in that case, default
value is 512.
smooth_step: float - when resampling (i.e. when resample > 0 or average_group is True), use this EMA decay parameter (in units of the new grid step).
See docstrings for decay_steps in symmetric_ema or one_sided_ema functions.
'''
if split_fn is None: split_fn = lambda _ : ''
if group_fn is None: group_fn = lambda _ : ''
sk2r = defaultdict(list) # splitkey2results
for result in allresults:
splitkey = split_fn(result)
sk2r[splitkey].append(result)
assert len(sk2r) > 0
assert isinstance(resample, int), "0: don't resample. <integer>: that many samples"
if tiling == 'vertical' or tiling is None:
nrows = len(sk2r)
ncols = 1
elif tiling == 'horizontal':
ncols = len(sk2r)
nrows = 1
elif tiling == 'symmetric':
import math
N = len(sk2r)
largest_divisor = 1
for i in range(1, int(math.sqrt(N))+1):
if N % i == 0:
largest_divisor = i
ncols = largest_divisor
nrows = N // ncols
figsize = figsize or (6 * ncols, 6 * nrows)
f, axarr = plt.subplots(nrows, ncols, sharex=False, squeeze=False, figsize=figsize)
groups = list(set(group_fn(result) for result in allresults))
default_samples = 512
if average_group:
resample = resample or default_samples
for (isplit, sk) in enumerate(sorted(sk2r.keys())):
g2l = {}
g2c = defaultdict(int)
sresults = sk2r[sk]
gresults = defaultdict(list)
idx_row = isplit // ncols
idx_col = isplit % ncols
ax = axarr[idx_row][idx_col]
for result in sresults:
group = group_fn(result)
g2c[group] += 1
x, y = xy_fn(result)
if x is None: x = np.arange(len(y))
x, y = map(np.asarray, (x, y))
if average_group:
gresults[group].append((x,y))
else:
if resample:
x, y, counts = symmetric_ema(x, y, x[0], x[-1], resample, decay_steps=smooth_step)
l, = ax.plot(x, y, color=COLORS[groups.index(group) % len(COLORS)])
g2l[group] = l
if average_group:
for group in sorted(groups):
xys = gresults[group]
if not any(xys):
continue
color = COLORS[groups.index(group) % len(COLORS)]
origxs = [xy[0] for xy in xys]
minxlen = min(map(len, origxs))
def allequal(qs):
return all((q==qs[0]).all() for q in qs[1:])
if resample:
low = max(x[0] for x in origxs)
high = min(x[-1] for x in origxs)
usex = np.linspace(low, high, resample)
ys = []
for (x, y) in xys:
ys.append(symmetric_ema(x, y, low, high, resample, decay_steps=smooth_step)[1])
else:
assert allequal([x[:minxlen] for x in origxs]),\
'If you want to average unevenly sampled data, set resample=<number of samples you want>'
usex = origxs[0]
ys = [xy[1][:minxlen] for xy in xys]
ymean = np.mean(ys, axis=0)
ystd = np.std(ys, axis=0)
ystderr = ystd / np.sqrt(len(ys))
l, = axarr[idx_row][idx_col].plot(usex, ymean, color=color)
g2l[group] = l
if shaded_err:
ax.fill_between(usex, ymean - ystderr, ymean + ystderr, color=color, alpha=.4)
if shaded_std:
ax.fill_between(usex, ymean - ystd, ymean + ystd, color=color, alpha=.2)
# https://matplotlib.org/users/legend_guide.html
plt.tight_layout()
if any(g2l.keys()):
ax.legend(
g2l.values(),
['%s (%i)'%(g, g2c[g]) for g in g2l] if average_group else g2l.keys(),
loc=2 if legend_outside else None,
bbox_to_anchor=(1,1) if legend_outside else None)
ax.set_title(sk)
# add xlabels, but only to the bottom row
if xlabel is not None:
for ax in axarr[-1]:
plt.sca(ax)
plt.xlabel(xlabel)
# add ylabels, but only to left column
if ylabel is not None:
for ax in axarr[:,0]:
plt.sca(ax)
plt.ylabel(ylabel)
return f, axarr
def regression_analysis(df):
xcols = list(df.columns.copy())
xcols.remove('score')
ycols = ['score']
import statsmodels.api as sm
mod = sm.OLS(df[ycols], sm.add_constant(df[xcols]), hasconst=False)
res = mod.fit()
print(res.summary())
def test_smooth():
norig = 100
nup = 300
ndown = 30
xs = np.cumsum(np.random.rand(norig) * 10 / norig)
yclean = np.sin(xs)
ys = yclean + .1 * np.random.randn(yclean.size)
xup, yup, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), nup, decay_steps=nup/ndown)
xdown, ydown, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), ndown, decay_steps=ndown/ndown)
xsame, ysame, _ = symmetric_ema(xs, ys, xs.min(), xs.max(), norig, decay_steps=norig/ndown)
plt.plot(xs, ys, label='orig', marker='x')
plt.plot(xup, yup, label='up', marker='x')
plt.plot(xdown, ydown, label='down', marker='x')
plt.plot(xsame, ysame, label='same', marker='x')
plt.plot(xs, yclean, label='clean', marker='x')
plt.legend()
plt.show()

View File

@@ -0,0 +1,81 @@
import tensorflow as tf
from baselines.a2c.utils import fc
from baselines.common.distributions import make_pdtype
import gym
class PolicyWithValue(tf.Module):
"""
Encapsulates fields and methods for RL policy and value function estimation with shared parameters
"""
def __init__(self, ac_space, policy_network, value_network=None, estimate_q=False):
"""
Parameters:
----------
ac_space action space
policy_network keras network for policy
value_network keras network for value
estimate_q q value or v value
"""
self.policy_network = policy_network
self.value_network = value_network or policy_network
self.estimate_q = estimate_q
self.initial_state = None
# Based on the action space, will select what probability distribution type
self.pdtype = make_pdtype(policy_network.output_shape, ac_space, init_scale=0.01)
if estimate_q:
assert isinstance(ac_space, gym.spaces.Discrete)
self.value_fc = fc(self.value_network.output_shape, 'q', ac_space.n)
else:
self.value_fc = fc(self.value_network.output_shape, 'vf', 1)
@tf.function
def step(self, observation):
"""
Compute next action(s) given the observation(s)
Parameters:
----------
observation batched observation data
Returns:
-------
(action, value estimate, next state, negative log likelihood of the action under current policy parameters) tuple
"""
latent = self.policy_network(observation)
pd, pi = self.pdtype.pdfromlatent(latent)
action = pd.sample()
neglogp = pd.neglogp(action)
value_latent = self.value_network(observation)
vf = tf.squeeze(self.value_fc(value_latent), axis=1)
return action, vf, None, neglogp
@tf.function
def value(self, observation):
"""
Compute value estimate(s) given the observation(s)
Parameters:
----------
observation observation data (either single or a batch)
Returns:
-------
value estimate
"""
value_latent = self.value_network(observation)
result = tf.squeeze(self.value_fc(value_latent), axis=1)
return result

View File

@@ -0,0 +1,280 @@
from collections import deque
import cv2
cv2.ocl.setUseOpenCL(False)
from .atari_wrappers import WarpFrame, ClipRewardEnv, FrameStack, ScaledFloatFrame
from .wrappers import TimeLimit
import numpy as np
import gym
class StochasticFrameSkip(gym.Wrapper):
def __init__(self, env, n, stickprob):
gym.Wrapper.__init__(self, env)
self.n = n
self.stickprob = stickprob
self.curac = None
self.rng = np.random.RandomState()
self.supports_want_render = hasattr(env, "supports_want_render")
def reset(self, **kwargs):
self.curac = None
return self.env.reset(**kwargs)
def step(self, ac):
done = False
totrew = 0
for i in range(self.n):
# First step after reset, use action
if self.curac is None:
self.curac = ac
# First substep, delay with probability=stickprob
elif i==0:
if self.rng.rand() > self.stickprob:
self.curac = ac
# Second substep, new action definitely kicks in
elif i==1:
self.curac = ac
if self.supports_want_render and i<self.n-1:
ob, rew, done, info = self.env.step(self.curac, want_render=False)
else:
ob, rew, done, info = self.env.step(self.curac)
totrew += rew
if done: break
return ob, totrew, done, info
def seed(self, s):
self.rng.seed(s)
class PartialFrameStack(gym.Wrapper):
def __init__(self, env, k, channel=1):
"""
Stack one channel (channel keyword) from previous frames
"""
gym.Wrapper.__init__(self, env)
shp = env.observation_space.shape
self.channel = channel
self.observation_space = gym.spaces.Box(low=0, high=255,
shape=(shp[0], shp[1], shp[2] + k - 1),
dtype=env.observation_space.dtype)
self.k = k
self.frames = deque([], maxlen=k)
shp = env.observation_space.shape
def reset(self):
ob = self.env.reset()
assert ob.shape[2] > self.channel
for _ in range(self.k):
self.frames.append(ob)
return self._get_ob()
def step(self, ac):
ob, reward, done, info = self.env.step(ac)
self.frames.append(ob)
return self._get_ob(), reward, done, info
def _get_ob(self):
assert len(self.frames) == self.k
return np.concatenate([frame if i==self.k-1 else frame[:,:,self.channel:self.channel+1]
for (i, frame) in enumerate(self.frames)], axis=2)
class Downsample(gym.ObservationWrapper):
def __init__(self, env, ratio):
"""
Downsample images by a factor of ratio
"""
gym.ObservationWrapper.__init__(self, env)
(oldh, oldw, oldc) = env.observation_space.shape
newshape = (oldh//ratio, oldw//ratio, oldc)
self.observation_space = gym.spaces.Box(low=0, high=255,
shape=newshape, dtype=np.uint8)
def observation(self, frame):
height, width, _ = self.observation_space.shape
frame = cv2.resize(frame, (width, height), interpolation=cv2.INTER_AREA)
if frame.ndim == 2:
frame = frame[:,:,None]
return frame
class Rgb2gray(gym.ObservationWrapper):
def __init__(self, env):
"""
Downsample images by a factor of ratio
"""
gym.ObservationWrapper.__init__(self, env)
(oldh, oldw, _oldc) = env.observation_space.shape
self.observation_space = gym.spaces.Box(low=0, high=255,
shape=(oldh, oldw, 1), dtype=np.uint8)
def observation(self, frame):
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
return frame[:,:,None]
class MovieRecord(gym.Wrapper):
def __init__(self, env, savedir, k):
gym.Wrapper.__init__(self, env)
self.savedir = savedir
self.k = k
self.epcount = 0
def reset(self):
if self.epcount % self.k == 0:
self.env.unwrapped.movie_path = self.savedir
else:
self.env.unwrapped.movie_path = None
self.env.unwrapped.movie = None
self.epcount += 1
return self.env.reset()
class AppendTimeout(gym.Wrapper):
def __init__(self, env):
gym.Wrapper.__init__(self, env)
self.action_space = env.action_space
self.timeout_space = gym.spaces.Box(low=np.array([0.0]), high=np.array([1.0]), dtype=np.float32)
self.original_os = env.observation_space
if isinstance(self.original_os, gym.spaces.Dict):
import copy
ordered_dict = copy.deepcopy(self.original_os.spaces)
ordered_dict['value_estimation_timeout'] = self.timeout_space
self.observation_space = gym.spaces.Dict(ordered_dict)
self.dict_mode = True
else:
self.observation_space = gym.spaces.Dict({
'original': self.original_os,
'value_estimation_timeout': self.timeout_space
})
self.dict_mode = False
self.ac_count = None
while 1:
if not hasattr(env, "_max_episode_steps"): # Looking for TimeLimit wrapper that has this field
env = env.env
continue
break
self.timeout = env._max_episode_steps
def step(self, ac):
self.ac_count += 1
ob, rew, done, info = self.env.step(ac)
return self._process(ob), rew, done, info
def reset(self):
self.ac_count = 0
return self._process(self.env.reset())
def _process(self, ob):
fracmissing = 1 - self.ac_count / self.timeout
if self.dict_mode:
ob['value_estimation_timeout'] = fracmissing
else:
return { 'original': ob, 'value_estimation_timeout': fracmissing }
class StartDoingRandomActionsWrapper(gym.Wrapper):
"""
Warning: can eat info dicts, not good if you depend on them
"""
def __init__(self, env, max_random_steps, on_startup=True, every_episode=False):
gym.Wrapper.__init__(self, env)
self.on_startup = on_startup
self.every_episode = every_episode
self.random_steps = max_random_steps
self.last_obs = None
if on_startup:
self.some_random_steps()
def some_random_steps(self):
self.last_obs = self.env.reset()
n = np.random.randint(self.random_steps)
#print("running for random %i frames" % n)
for _ in range(n):
self.last_obs, _, done, _ = self.env.step(self.env.action_space.sample())
if done: self.last_obs = self.env.reset()
def reset(self):
return self.last_obs
def step(self, a):
self.last_obs, rew, done, info = self.env.step(a)
if done:
self.last_obs = self.env.reset()
if self.every_episode:
self.some_random_steps()
return self.last_obs, rew, done, info
def make_retro(*, game, state=None, max_episode_steps=4500, **kwargs):
import retro
if state is None:
state = retro.State.DEFAULT
env = retro.make(game, state, **kwargs)
env = StochasticFrameSkip(env, n=4, stickprob=0.25)
if max_episode_steps is not None:
env = TimeLimit(env, max_episode_steps=max_episode_steps)
return env
def wrap_deepmind_retro(env, scale=True, frame_stack=4):
"""
Configure environment for retro games, using config similar to DeepMind-style Atari in wrap_deepmind
"""
env = WarpFrame(env)
env = ClipRewardEnv(env)
if frame_stack > 1:
env = FrameStack(env, frame_stack)
if scale:
env = ScaledFloatFrame(env)
return env
class SonicDiscretizer(gym.ActionWrapper):
"""
Wrap a gym-retro environment and make it use discrete
actions for the Sonic game.
"""
def __init__(self, env):
super(SonicDiscretizer, self).__init__(env)
buttons = ["B", "A", "MODE", "START", "UP", "DOWN", "LEFT", "RIGHT", "C", "Y", "X", "Z"]
actions = [['LEFT'], ['RIGHT'], ['LEFT', 'DOWN'], ['RIGHT', 'DOWN'], ['DOWN'],
['DOWN', 'B'], ['B']]
self._actions = []
for action in actions:
arr = np.array([False] * 12)
for button in action:
arr[buttons.index(button)] = True
self._actions.append(arr)
self.action_space = gym.spaces.Discrete(len(self._actions))
def action(self, a): # pylint: disable=W0221
return self._actions[a].copy()
class RewardScaler(gym.RewardWrapper):
"""
Bring rewards to a reasonable scale for PPO.
This is incredibly important and effects performance
drastically.
"""
def __init__(self, env, scale=0.01):
super(RewardScaler, self).__init__(env)
self.scale = scale
def reward(self, reward):
return reward * self.scale
class AllowBacktracking(gym.Wrapper):
"""
Use deltas in max(X) as the reward, rather than deltas
in X. This way, agents are not discouraged too heavily
from exploring backwards if there is no way to advance
head-on in the level.
"""
def __init__(self, env):
super(AllowBacktracking, self).__init__(env)
self._cur_x = 0
self._max_x = 0
def reset(self, **kwargs): # pylint: disable=E0202
self._cur_x = 0
self._max_x = 0
return self.env.reset(**kwargs)
def step(self, action): # pylint: disable=E0202
obs, rew, done, info = self.env.step(action)
self._cur_x += rew
rew = max(0, self._cur_x - self._max_x)
self._max_x = max(self._max_x, self._cur_x)
return obs, rew, done, info

View File

@@ -5,7 +5,7 @@ class AbstractEnvRunner(ABC):
def __init__(self, *, env, model, nsteps):
self.env = env
self.model = model
nenv = env.num_envs
self.nenv = nenv = env.num_envs if hasattr(env, 'num_envs') else 1
self.batch_ob_shape = (nenv*nsteps,) + env.observation_space.shape
self.obs = np.zeros((nenv,) + env.observation_space.shape, dtype=env.observation_space.dtype.name)
self.obs[:] = env.reset()
@@ -16,3 +16,4 @@ class AbstractEnvRunner(ABC):
@abstractmethod
def run(self):
raise NotImplementedError

View File

@@ -1,4 +1,5 @@
import numpy as np
class RunningMeanStd(object):
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
def __init__(self, epsilon=1e-4, shape=()):
@@ -13,34 +14,18 @@ class RunningMeanStd(object):
self.update_from_moments(batch_mean, batch_var, batch_count)
def update_from_moments(self, batch_mean, batch_var, batch_count):
delta = batch_mean - self.mean
tot_count = self.count + batch_count
self.mean, self.var, self.count = update_mean_var_count_from_moments(
self.mean, self.var, self.count, batch_mean, batch_var, batch_count)
new_mean = self.mean + delta * batch_count / tot_count
m_a = self.var * (self.count)
m_b = batch_var * (batch_count)
M2 = m_a + m_b + np.square(delta) * self.count * batch_count / (self.count + batch_count)
new_var = M2 / (self.count + batch_count)
def update_mean_var_count_from_moments(mean, var, count, batch_mean, batch_var, batch_count):
delta = batch_mean - mean
tot_count = count + batch_count
new_count = batch_count + self.count
new_mean = mean + delta * batch_count / tot_count
m_a = var * count
m_b = batch_var * batch_count
M2 = m_a + m_b + np.square(delta) * count * batch_count / tot_count
new_var = M2 / tot_count
new_count = tot_count
self.mean = new_mean
self.var = new_var
self.count = new_count
def test_runningmeanstd():
for (x1, x2, x3) in [
(np.random.randn(3), np.random.randn(4), np.random.randn(5)),
(np.random.randn(3,2), np.random.randn(4,2), np.random.randn(5,2)),
]:
rms = RunningMeanStd(epsilon=0.0, shape=x1.shape[1:])
x = np.concatenate([x1, x2, x3], axis=0)
ms1 = [x.mean(axis=0), x.var(axis=0)]
rms.update(x1)
rms.update(x2)
rms.update(x3)
ms2 = [rms.mean, rms.var]
assert np.allclose(ms1, ms2)
return new_mean, new_var, new_count

View File

@@ -1,46 +0,0 @@
import numpy as np
# http://www.johndcook.com/blog/standard_deviation/
class RunningStat(object):
def __init__(self, shape):
self._n = 0
self._M = np.zeros(shape)
self._S = np.zeros(shape)
def push(self, x):
x = np.asarray(x)
assert x.shape == self._M.shape
self._n += 1
if self._n == 1:
self._M[...] = x
else:
oldM = self._M.copy()
self._M[...] = oldM + (x - oldM)/self._n
self._S[...] = self._S + (x - oldM)*(x - self._M)
@property
def n(self):
return self._n
@property
def mean(self):
return self._M
@property
def var(self):
return self._S/(self._n - 1) if self._n > 1 else np.square(self._M)
@property
def std(self):
return np.sqrt(self.var)
@property
def shape(self):
return self._M.shape
def test_running_stat():
for shp in ((), (3,), (3,4)):
li = []
rs = RunningStat(shp)
for _ in range(5):
val = np.random.randn(*shp)
rs.push(val)
li.append(val)
m = np.mean(li, axis=0)
assert np.allclose(rs.mean, m)
v = np.square(m) if (len(li) == 1) else np.var(li, ddof=1, axis=0)
assert np.allclose(rs.var, v)

View File

@@ -1,44 +0,0 @@
import pytest
import tensorflow as tf
import random
import numpy as np
from gym.spaces import np_random
from baselines.a2c import a2c
from baselines.ppo2 import ppo2
from baselines.common.identity_env import IdentityEnv
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.ppo2.policies import MlpPolicy
learn_func_list = [
lambda e: a2c.learn(policy=MlpPolicy, env=e, seed=0, total_timesteps=50000),
lambda e: ppo2.learn(policy=MlpPolicy, env=e, total_timesteps=50000, lr=1e-3, nsteps=128, ent_coef=0.01)
]
@pytest.mark.slow
@pytest.mark.parametrize("learn_func", learn_func_list)
def test_identity(learn_func):
'''
Test if the algorithm (with a given policy)
can learn an identity transformation (i.e. return observation as an action)
'''
np.random.seed(0)
np_random.seed(0)
random.seed(0)
env = DummyVecEnv([lambda: IdentityEnv(10)])
with tf.Graph().as_default(), tf.Session().as_default():
tf.set_random_seed(0)
model = learn_func(env)
N_TRIALS = 1000
sum_rew = 0
obs = env.reset()
for i in range(N_TRIALS):
obs, rew, done, _ = env.step(model.step(obs)[0])
sum_rew += rew
assert sum_rew > 0.9 * N_TRIALS

View File

@@ -1,26 +0,0 @@
import numpy as np
from baselines.common.schedules import ConstantSchedule, PiecewiseSchedule
def test_piecewise_schedule():
ps = PiecewiseSchedule([(-5, 100), (5, 200), (10, 50), (100, 50), (200, -50)], outside_value=500)
assert np.isclose(ps.value(-10), 500)
assert np.isclose(ps.value(0), 150)
assert np.isclose(ps.value(5), 200)
assert np.isclose(ps.value(9), 80)
assert np.isclose(ps.value(50), 50)
assert np.isclose(ps.value(80), 50)
assert np.isclose(ps.value(150), 0)
assert np.isclose(ps.value(175), -25)
assert np.isclose(ps.value(201), 500)
assert np.isclose(ps.value(500), 500)
assert np.isclose(ps.value(200 - 1e-10), -50)
def test_constant_schedule():
cs = ConstantSchedule(5)
for i in range(-100, 100):
assert np.isclose(cs.value(i), 5)

View File

@@ -1,103 +0,0 @@
import numpy as np
from baselines.common.segment_tree import SumSegmentTree, MinSegmentTree
def test_tree_set():
tree = SumSegmentTree(4)
tree[2] = 1.0
tree[3] = 3.0
assert np.isclose(tree.sum(), 4.0)
assert np.isclose(tree.sum(0, 2), 0.0)
assert np.isclose(tree.sum(0, 3), 1.0)
assert np.isclose(tree.sum(2, 3), 1.0)
assert np.isclose(tree.sum(2, -1), 1.0)
assert np.isclose(tree.sum(2, 4), 4.0)
def test_tree_set_overlap():
tree = SumSegmentTree(4)
tree[2] = 1.0
tree[2] = 3.0
assert np.isclose(tree.sum(), 3.0)
assert np.isclose(tree.sum(2, 3), 3.0)
assert np.isclose(tree.sum(2, -1), 3.0)
assert np.isclose(tree.sum(2, 4), 3.0)
assert np.isclose(tree.sum(1, 2), 0.0)
def test_prefixsum_idx():
tree = SumSegmentTree(4)
tree[2] = 1.0
tree[3] = 3.0
assert tree.find_prefixsum_idx(0.0) == 2
assert tree.find_prefixsum_idx(0.5) == 2
assert tree.find_prefixsum_idx(0.99) == 2
assert tree.find_prefixsum_idx(1.01) == 3
assert tree.find_prefixsum_idx(3.00) == 3
assert tree.find_prefixsum_idx(4.00) == 3
def test_prefixsum_idx2():
tree = SumSegmentTree(4)
tree[0] = 0.5
tree[1] = 1.0
tree[2] = 1.0
tree[3] = 3.0
assert tree.find_prefixsum_idx(0.00) == 0
assert tree.find_prefixsum_idx(0.55) == 1
assert tree.find_prefixsum_idx(0.99) == 1
assert tree.find_prefixsum_idx(1.51) == 2
assert tree.find_prefixsum_idx(3.00) == 3
assert tree.find_prefixsum_idx(5.50) == 3
def test_max_interval_tree():
tree = MinSegmentTree(4)
tree[0] = 1.0
tree[2] = 0.5
tree[3] = 3.0
assert np.isclose(tree.min(), 0.5)
assert np.isclose(tree.min(0, 2), 1.0)
assert np.isclose(tree.min(0, 3), 0.5)
assert np.isclose(tree.min(0, -1), 0.5)
assert np.isclose(tree.min(2, 4), 0.5)
assert np.isclose(tree.min(3, 4), 3.0)
tree[2] = 0.7
assert np.isclose(tree.min(), 0.7)
assert np.isclose(tree.min(0, 2), 1.0)
assert np.isclose(tree.min(0, 3), 0.7)
assert np.isclose(tree.min(0, -1), 0.7)
assert np.isclose(tree.min(2, 4), 0.7)
assert np.isclose(tree.min(3, 4), 3.0)
tree[2] = 4.0
assert np.isclose(tree.min(), 1.0)
assert np.isclose(tree.min(0, 2), 1.0)
assert np.isclose(tree.min(0, 3), 1.0)
assert np.isclose(tree.min(0, -1), 1.0)
assert np.isclose(tree.min(2, 4), 3.0)
assert np.isclose(tree.min(2, 3), 4.0)
assert np.isclose(tree.min(2, -1), 4.0)
assert np.isclose(tree.min(3, 4), 3.0)
if __name__ == '__main__':
test_tree_set()
test_tree_set_overlap()
test_prefixsum_idx()
test_prefixsum_idx2()
test_max_interval_tree()

View File

@@ -1,40 +0,0 @@
# tests for tf_util
import tensorflow as tf
from baselines.common.tf_util import (
function,
initialize,
single_threaded_session
)
def test_function():
with tf.Graph().as_default():
x = tf.placeholder(tf.int32, (), name="x")
y = tf.placeholder(tf.int32, (), name="y")
z = 3 * x + 2 * y
lin = function([x, y], z, givens={y: 0})
with single_threaded_session():
initialize()
assert lin(2) == 6
assert lin(2, 2) == 10
def test_multikwargs():
with tf.Graph().as_default():
x = tf.placeholder(tf.int32, (), name="x")
with tf.variable_scope("other"):
x2 = tf.placeholder(tf.int32, (), name="x")
z = 3 * x + 2 * x2
lin = function([x, x2], z, givens={x2: 0})
with single_threaded_session():
initialize()
assert lin(2) == 6
assert lin(2, 2) == 10
if __name__ == '__main__':
test_function()
test_multikwargs()

View File

@@ -0,0 +1,38 @@
import os
import sys
import subprocess
import cloudpickle
import base64
import pytest
from functools import wraps
try:
from mpi4py import MPI
except ImportError:
MPI = None
def with_mpi(nproc=2, timeout=30, skip_if_no_mpi=True):
def outer_thunk(fn):
@wraps(fn)
def thunk(*args, **kwargs):
serialized_fn = base64.b64encode(cloudpickle.dumps(lambda: fn(*args, **kwargs)))
subprocess.check_call([
'mpiexec','-n', str(nproc),
sys.executable,
'-m', 'baselines.common.tests.test_with_mpi',
serialized_fn
], env=os.environ, timeout=timeout)
if skip_if_no_mpi:
return pytest.mark.skipif(MPI is None, reason="MPI not present")(thunk)
else:
return thunk
return outer_thunk
if __name__ == '__main__':
if len(sys.argv) > 1:
fn = cloudpickle.loads(base64.b64decode(sys.argv[1]))
assert callable(fn)
fn()

View File

@@ -1,10 +1,6 @@
import numpy as np
import tensorflow as tf # pylint: ignore-module
import copy
import os
import functools
import collections
import multiprocessing
def switch(condition, then_expression, else_expression):
"""Switches between two operations depending on a scalar value (int or bool).
@@ -44,48 +40,13 @@ def huber_loss(x, delta=1.0):
delta * (tf.abs(x) - 0.5 * delta)
)
# ================================================================
# Global session
# ================================================================
def make_session(num_cpu=None, make_default=False, graph=None):
"""Returns a session that will use <num_cpu> CPU's only"""
if num_cpu is None:
num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
tf_config = tf.ConfigProto(
inter_op_parallelism_threads=num_cpu,
intra_op_parallelism_threads=num_cpu)
if make_default:
return tf.InteractiveSession(config=tf_config, graph=graph)
else:
return tf.Session(config=tf_config, graph=graph)
def single_threaded_session():
"""Returns a session which will only use a single CPU"""
return make_session(num_cpu=1)
def in_session(f):
@functools.wraps(f)
def newfunc(*args, **kwargs):
with tf.Session():
f(*args, **kwargs)
return newfunc
ALREADY_INITIALIZED = set()
def initialize():
"""Initialize all the uninitialized variables in the global scope."""
new_variables = set(tf.global_variables()) - ALREADY_INITIALIZED
tf.get_default_session().run(tf.variables_initializer(new_variables))
ALREADY_INITIALIZED.update(new_variables)
# ================================================================
# Model components
# ================================================================
def normc_initializer(std=1.0, axis=0):
def _initializer(shape, dtype=None, partition_info=None): # pylint: disable=W0613
out = np.random.randn(*shape).astype(np.float32)
out = np.random.randn(*shape).astype(dtype.as_numpy_dtype)
out *= std / np.sqrt(np.square(out).sum(axis=axis, keepdims=True))
return tf.constant(out)
return _initializer
@@ -119,80 +80,6 @@ def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad="SAME",
return tf.nn.conv2d(x, w, stride_shape, pad) + b
# ================================================================
# Theano-like Function
# ================================================================
def function(inputs, outputs, updates=None, givens=None):
"""Just like Theano function. Take a bunch of tensorflow placeholders and expressions
computed based on those placeholders and produces f(inputs) -> outputs. Function f takes
values to be fed to the input's placeholders and produces the values of the expressions
in outputs.
Input values can be passed in the same order as inputs or can be provided as kwargs based
on placeholder name (passed to constructor or accessible via placeholder.op.name).
Example:
x = tf.placeholder(tf.int32, (), name="x")
y = tf.placeholder(tf.int32, (), name="y")
z = 3 * x + 2 * y
lin = function([x, y], z, givens={y: 0})
with single_threaded_session():
initialize()
assert lin(2) == 6
assert lin(x=3) == 9
assert lin(2, 2) == 10
assert lin(x=2, y=3) == 12
Parameters
----------
inputs: [tf.placeholder, tf.constant, or object with make_feed_dict method]
list of input arguments
outputs: [tf.Variable] or tf.Variable
list of outputs or a single output to be returned from function. Returned
value will also have the same shape.
"""
if isinstance(outputs, list):
return _Function(inputs, outputs, updates, givens=givens)
elif isinstance(outputs, (dict, collections.OrderedDict)):
f = _Function(inputs, outputs.values(), updates, givens=givens)
return lambda *args, **kwargs: type(outputs)(zip(outputs.keys(), f(*args, **kwargs)))
else:
f = _Function(inputs, [outputs], updates, givens=givens)
return lambda *args, **kwargs: f(*args, **kwargs)[0]
class _Function(object):
def __init__(self, inputs, outputs, updates, givens):
for inpt in inputs:
if not hasattr(inpt, 'make_feed_dict') and not (type(inpt) is tf.Tensor and len(inpt.op.inputs) == 0):
assert False, "inputs should all be placeholders, constants, or have a make_feed_dict method"
self.inputs = inputs
updates = updates or []
self.update_group = tf.group(*updates)
self.outputs_update = list(outputs) + [self.update_group]
self.givens = {} if givens is None else givens
def _feed_input(self, feed_dict, inpt, value):
if hasattr(inpt, 'make_feed_dict'):
feed_dict.update(inpt.make_feed_dict(value))
else:
feed_dict[inpt] = value
def __call__(self, *args):
assert len(args) <= len(self.inputs), "Too many arguments provided"
feed_dict = {}
# Update the args
for inpt, value in zip(self.inputs, args):
self._feed_input(feed_dict, inpt, value)
# Update feed dict with givens.
for inpt in self.givens:
feed_dict[inpt] = feed_dict.get(inpt, self.givens[inpt])
results = tf.get_default_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
return results
# ================================================================
# Flat vectors
# ================================================================
@@ -209,8 +96,7 @@ def numel(x):
def intprod(x):
return int(np.prod(x))
def flatgrad(loss, var_list, clip_norm=None):
grads = tf.gradients(loss, var_list)
def flatgrad(grads, var_list, clip_norm=None):
if clip_norm is not None:
grads = [tf.clip_by_norm(grad, clip_norm=clip_norm) for grad in grads]
return tf.concat(axis=0, values=[
@@ -220,85 +106,94 @@ def flatgrad(loss, var_list, clip_norm=None):
class SetFromFlat(object):
def __init__(self, var_list, dtype=tf.float32):
assigns = []
shapes = list(map(var_shape, var_list))
total_size = np.sum([intprod(shape) for shape in shapes])
self.theta = theta = tf.placeholder(dtype, [total_size])
start = 0
assigns = []
for (shape, v) in zip(shapes, var_list):
size = intprod(shape)
assigns.append(tf.assign(v, tf.reshape(theta[start:start + size], shape)))
start += size
self.op = tf.group(*assigns)
self.shapes = list(map(var_shape, var_list))
self.total_size = np.sum([intprod(shape) for shape in self.shapes])
self.var_list = var_list
def __call__(self, theta):
tf.get_default_session().run(self.op, feed_dict={self.theta: theta})
start = 0
for (shape, v) in zip(self.shapes, self.var_list):
size = intprod(shape)
v.assign(tf.reshape(theta[start:start + size], shape))
start += size
class GetFlat(object):
def __init__(self, var_list):
self.op = tf.concat(axis=0, values=[tf.reshape(v, [numel(v)]) for v in var_list])
self.var_list = var_list
def __call__(self):
return tf.get_default_session().run(self.op)
_PLACEHOLDER_CACHE = {} # name -> (placeholder, dtype, shape)
def get_placeholder(name, dtype, shape):
if name in _PLACEHOLDER_CACHE:
out, dtype1, shape1 = _PLACEHOLDER_CACHE[name]
assert dtype1 == dtype and shape1 == shape
return out
else:
out = tf.placeholder(dtype=dtype, shape=shape, name=name)
_PLACEHOLDER_CACHE[name] = (out, dtype, shape)
return out
def get_placeholder_cached(name):
return _PLACEHOLDER_CACHE[name][0]
return tf.concat(axis=0, values=[tf.reshape(v, [numel(v)]) for v in self.var_list]).numpy()
def flattenallbut0(x):
return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])
# ================================================================
# Diagnostics
# Shape adjustment for feeding into tf tensors
# ================================================================
def adjust_shape(input_tensor, data):
'''
adjust shape of the data to the shape of the tensor if possible.
If shape is incompatible, AssertionError is thrown
def display_var_info(vars):
from baselines import logger
count_params = 0
for v in vars:
name = v.name
if "/Adam" in name or "beta1_power" in name or "beta2_power" in name: continue
v_params = np.prod(v.shape.as_list())
count_params += v_params
if "/b:" in name or "/biases" in name: continue # Wx+b, bias is not interesting to look at => count params, but not print
logger.info(" %s%s %i params %s" % (name, " "*(55-len(name)), v_params, str(v.shape)))
Parameters:
input_tensor tensorflow input tensor
logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
data input data to be (potentially) reshaped to be fed into input
Returns:
reshaped data
'''
if not isinstance(data, np.ndarray) and not isinstance(data, list):
return data
if isinstance(data, list):
data = np.array(data)
input_shape = [x or -1 for x in input_tensor.shape.as_list()]
assert _check_shape(input_shape, data.shape), \
'Shape of data {} is not compatible with shape of the input {}'.format(data.shape, input_shape)
return np.reshape(data, input_shape)
def get_available_gpus():
# recipe from here:
# https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
def _check_shape(input_shape, data_shape):
''' check if two shapes are compatible (i.e. differ only by dimensions of size 1, or by the batch dimension)'''
squeezed_input_shape = _squeeze_shape(input_shape)
squeezed_data_shape = _squeeze_shape(data_shape)
for i, s_data in enumerate(squeezed_data_shape):
s_input = squeezed_input_shape[i]
if s_input != -1 and s_data != s_input:
return False
return True
def _squeeze_shape(shape):
return [x for x in shape if x != 1]
# ================================================================
# Saving variables
# Tensorboard interfacing
# ================================================================
def load_state(fname):
saver = tf.train.Saver()
saver.restore(tf.get_default_session(), fname)
def save_state(fname):
os.makedirs(os.path.dirname(fname), exist_ok=True)
saver = tf.train.Saver()
saver.save(tf.get_default_session(), fname)
def launch_tensorboard_in_background(log_dir):
'''
To log the Tensorflow graph when using rl-algs
algorithms, you can run the following code
in your main script:
import threading, time
def start_tensorboard(session):
time.sleep(10) # Wait until graph is setup
tb_path = osp.join(logger.get_dir(), 'tb')
summary_writer = tf.summary.FileWriter(tb_path, graph=session.graph)
summary_op = tf.summary.merge_all()
launch_tensorboard_in_background(tb_path)
session = tf.get_default_session()
t = threading.Thread(target=start_tensorboard, args=([session]))
t.start()
'''
import subprocess
subprocess.Popen(['tensorboard', '--logdir', log_dir])

View File

@@ -1,126 +1,10 @@
from abc import ABC, abstractmethod
from baselines import logger
from .vec_env import AlreadySteppingError, NotSteppingError, VecEnv, VecEnvWrapper, VecEnvObservationWrapper, CloudpickleWrapper
from .dummy_vec_env import DummyVecEnv
from .shmem_vec_env import ShmemVecEnv
from .subproc_vec_env import SubprocVecEnv
from .vec_frame_stack import VecFrameStack
from .vec_monitor import VecMonitor
from .vec_normalize import VecNormalize
from .vec_remove_dict_obs import VecExtractDictObs
class AlreadySteppingError(Exception):
"""
Raised when an asynchronous step is running while
step_async() is called again.
"""
def __init__(self):
msg = 'already running an async step'
Exception.__init__(self, msg)
class NotSteppingError(Exception):
"""
Raised when an asynchronous step is not running but
step_wait() is called.
"""
def __init__(self):
msg = 'not running an async step'
Exception.__init__(self, msg)
class VecEnv(ABC):
"""
An abstract asynchronous, vectorized environment.
"""
def __init__(self, num_envs, observation_space, action_space):
self.num_envs = num_envs
self.observation_space = observation_space
self.action_space = action_space
@abstractmethod
def reset(self):
"""
Reset all the environments and return an array of
observations, or a tuple of observation arrays.
If step_async is still doing work, that work will
be cancelled and step_wait() should not be called
until step_async() is invoked again.
"""
pass
@abstractmethod
def step_async(self, actions):
"""
Tell all the environments to start taking a step
with the given actions.
Call step_wait() to get the results of the step.
You should not call this if a step_async run is
already pending.
"""
pass
@abstractmethod
def step_wait(self):
"""
Wait for the step taken with step_async().
Returns (obs, rews, dones, infos):
- obs: an array of observations, or a tuple of
arrays of observations.
- rews: an array of rewards
- dones: an array of "episode done" booleans
- infos: a sequence of info objects
"""
pass
@abstractmethod
def close(self):
"""
Clean up the environments' resources.
"""
pass
def step(self, actions):
self.step_async(actions)
return self.step_wait()
def render(self, mode='human'):
logger.warn('Render not defined for %s'%self)
@property
def unwrapped(self):
if isinstance(self, VecEnvWrapper):
return self.venv.unwrapped
else:
return self
class VecEnvWrapper(VecEnv):
def __init__(self, venv, observation_space=None, action_space=None):
self.venv = venv
VecEnv.__init__(self,
num_envs=venv.num_envs,
observation_space=observation_space or venv.observation_space,
action_space=action_space or venv.action_space)
def step_async(self, actions):
self.venv.step_async(actions)
@abstractmethod
def reset(self):
pass
@abstractmethod
def step_wait(self):
pass
def close(self):
return self.venv.close()
def render(self):
self.venv.render()
class CloudpickleWrapper(object):
"""
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
"""
def __init__(self, x):
self.x = x
def __getstate__(self):
import cloudpickle
return cloudpickle.dumps(self.x)
def __setstate__(self, ob):
import pickle
self.x = pickle.loads(ob)
__all__ = ['AlreadySteppingError', 'NotSteppingError', 'VecEnv', 'VecEnvWrapper', 'VecEnvObservationWrapper', 'CloudpickleWrapper', 'DummyVecEnv', 'ShmemVecEnv', 'SubprocVecEnv', 'VecFrameStack', 'VecMonitor', 'VecNormalize', 'VecExtractDictObs']

View File

@@ -1,40 +1,54 @@
import numpy as np
from gym import spaces
from collections import OrderedDict
from . import VecEnv
from .vec_env import VecEnv
from .util import copy_obs_dict, dict_to_obs, obs_space_info
class DummyVecEnv(VecEnv):
"""
VecEnv that does runs multiple environments sequentially, that is,
the step and reset commands are send to one environment at a time.
Useful when debugging and when num_env == 1 (in the latter case,
avoids communication overhead)
"""
def __init__(self, env_fns):
"""
Arguments:
env_fns: iterable of callables functions that build environments
"""
self.envs = [fn() for fn in env_fns]
env = self.envs[0]
VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
shapes, dtypes = {}, {}
self.keys = []
obs_space = env.observation_space
self.keys, shapes, dtypes = obs_space_info(obs_space)
if isinstance(obs_space, spaces.Dict):
assert isinstance(obs_space.spaces, OrderedDict)
subspaces = obs_space.spaces
else:
subspaces = {None: obs_space}
for key, box in subspaces.items():
shapes[key] = box.shape
dtypes[key] = box.dtype
self.keys.append(key)
self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
self.buf_rews = np.zeros((self.num_envs,), dtype=np.float32)
self.buf_infos = [{} for _ in range(self.num_envs)]
self.actions = None
self.spec = self.envs[0].spec
def step_async(self, actions):
self.actions = actions
listify = True
try:
if len(actions) == self.num_envs:
listify = False
except TypeError:
pass
if not listify:
self.actions = actions
else:
assert self.num_envs == 1, "actions {} is either not a list or has a wrong size - cannot match to {} environments".format(actions, self.num_envs)
self.actions = [actions]
def step_wait(self):
for e in range(self.num_envs):
obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(self.actions[e])
action = self.actions[e]
# if isinstance(self.envs[e].action_space, spaces.Discrete):
# action = int(action)
obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
if self.buf_dones[e]:
obs = self.envs[e].reset()
self._save_obs(e, obs)
@@ -47,12 +61,6 @@ class DummyVecEnv(VecEnv):
self._save_obs(e, obs)
return self._obs_from_buf()
def close(self):
return
def render(self, mode='human'):
return [e.render(mode=mode) for e in self.envs]
def _save_obs(self, e, obs):
for k in self.keys:
if k is None:
@@ -61,7 +69,13 @@ class DummyVecEnv(VecEnv):
self.buf_obs[k][e] = obs[k]
def _obs_from_buf(self):
if self.keys==[None]:
return self.buf_obs[None]
return dict_to_obs(copy_obs_dict(self.buf_obs))
def get_images(self):
return [env.render(mode='rgb_array') for env in self.envs]
def render(self, mode='human'):
if self.num_envs == 1:
return self.envs[0].render(mode=mode)
else:
return self.buf_obs
return super().render(mode=mode)

View File

@@ -0,0 +1,139 @@
"""
An interface for asynchronous vectorized environments.
"""
import multiprocessing as mp
import numpy as np
from .vec_env import VecEnv, CloudpickleWrapper, clear_mpi_env_vars
import ctypes
from baselines import logger
from .util import dict_to_obs, obs_space_info, obs_to_dict
_NP_TO_CT = {np.float32: ctypes.c_float,
np.int32: ctypes.c_int32,
np.int8: ctypes.c_int8,
np.uint8: ctypes.c_char,
np.bool: ctypes.c_bool}
class ShmemVecEnv(VecEnv):
"""
Optimized version of SubprocVecEnv that uses shared variables to communicate observations.
"""
def __init__(self, env_fns, spaces=None, context='spawn'):
"""
If you don't specify observation_space, we'll have to create a dummy
environment to get it.
"""
ctx = mp.get_context(context)
if spaces:
observation_space, action_space = spaces
else:
logger.log('Creating dummy env object to get spaces')
with logger.scoped_configure(format_strs=[]):
dummy = env_fns[0]()
observation_space, action_space = dummy.observation_space, dummy.action_space
dummy.close()
del dummy
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
self.obs_keys, self.obs_shapes, self.obs_dtypes = obs_space_info(observation_space)
self.obs_bufs = [
{k: ctx.Array(_NP_TO_CT[self.obs_dtypes[k].type], int(np.prod(self.obs_shapes[k]))) for k in self.obs_keys}
for _ in env_fns]
self.parent_pipes = []
self.procs = []
with clear_mpi_env_vars():
for env_fn, obs_buf in zip(env_fns, self.obs_bufs):
wrapped_fn = CloudpickleWrapper(env_fn)
parent_pipe, child_pipe = ctx.Pipe()
proc = ctx.Process(target=_subproc_worker,
args=(child_pipe, parent_pipe, wrapped_fn, obs_buf, self.obs_shapes, self.obs_dtypes, self.obs_keys))
proc.daemon = True
self.procs.append(proc)
self.parent_pipes.append(parent_pipe)
proc.start()
child_pipe.close()
self.waiting_step = False
self.viewer = None
def reset(self):
if self.waiting_step:
logger.warn('Called reset() while waiting for the step to complete')
self.step_wait()
for pipe in self.parent_pipes:
pipe.send(('reset', None))
return self._decode_obses([pipe.recv() for pipe in self.parent_pipes])
def step_async(self, actions):
assert len(actions) == len(self.parent_pipes)
for pipe, act in zip(self.parent_pipes, actions):
pipe.send(('step', act))
def step_wait(self):
outs = [pipe.recv() for pipe in self.parent_pipes]
obs, rews, dones, infos = zip(*outs)
return self._decode_obses(obs), np.array(rews), np.array(dones), infos
def close_extras(self):
if self.waiting_step:
self.step_wait()
for pipe in self.parent_pipes:
pipe.send(('close', None))
for pipe in self.parent_pipes:
pipe.recv()
pipe.close()
for proc in self.procs:
proc.join()
def get_images(self, mode='human'):
for pipe in self.parent_pipes:
pipe.send(('render', None))
return [pipe.recv() for pipe in self.parent_pipes]
def _decode_obses(self, obs):
result = {}
for k in self.obs_keys:
bufs = [b[k] for b in self.obs_bufs]
o = [np.frombuffer(b.get_obj(), dtype=self.obs_dtypes[k]).reshape(self.obs_shapes[k]) for b in bufs]
result[k] = np.array(o)
return dict_to_obs(result)
def _subproc_worker(pipe, parent_pipe, env_fn_wrapper, obs_bufs, obs_shapes, obs_dtypes, keys):
"""
Control a single environment instance using IPC and
shared memory.
"""
def _write_obs(maybe_dict_obs):
flatdict = obs_to_dict(maybe_dict_obs)
for k in keys:
dst = obs_bufs[k].get_obj()
dst_np = np.frombuffer(dst, dtype=obs_dtypes[k]).reshape(obs_shapes[k]) # pylint: disable=W0212
np.copyto(dst_np, flatdict[k])
env = env_fn_wrapper.x()
parent_pipe.close()
try:
while True:
cmd, data = pipe.recv()
if cmd == 'reset':
pipe.send(_write_obs(env.reset()))
elif cmd == 'step':
obs, reward, done, info = env.step(data)
if done:
obs = env.reset()
pipe.send((_write_obs(obs), reward, done, info))
elif cmd == 'render':
pipe.send(env.render(mode='rgb_array'))
elif cmd == 'close':
pipe.send(None)
break
else:
raise RuntimeError('Got unrecognized cmd %s' % cmd)
except KeyboardInterrupt:
print('ShmemVecEnv worker: got KeyboardInterrupt')
finally:
env.close()

View File

@@ -1,97 +1,117 @@
import multiprocessing as mp
import numpy as np
from multiprocessing import Process, Pipe
from baselines.common.vec_env import VecEnv, CloudpickleWrapper
from baselines.common.tile_images import tile_images
from .vec_env import VecEnv, CloudpickleWrapper, clear_mpi_env_vars
def worker(remote, parent_remote, env_fn_wrapper):
parent_remote.close()
env = env_fn_wrapper.x()
while True:
cmd, data = remote.recv()
if cmd == 'step':
ob, reward, done, info = env.step(data)
if done:
try:
while True:
cmd, data = remote.recv()
if cmd == 'step':
ob, reward, done, info = env.step(data)
if done:
ob = env.reset()
remote.send((ob, reward, done, info))
elif cmd == 'reset':
ob = env.reset()
remote.send((ob, reward, done, info))
elif cmd == 'reset':
ob = env.reset()
remote.send(ob)
elif cmd == 'render':
remote.send(env.render(mode='rgb_array'))
elif cmd == 'close':
remote.close()
break
elif cmd == 'get_spaces':
remote.send((env.observation_space, env.action_space))
else:
raise NotImplementedError
remote.send(ob)
elif cmd == 'render':
remote.send(env.render(mode='rgb_array'))
elif cmd == 'close':
remote.close()
break
elif cmd == 'get_spaces_spec':
remote.send((env.observation_space, env.action_space, env.spec))
else:
raise NotImplementedError
except KeyboardInterrupt:
print('SubprocVecEnv worker: got KeyboardInterrupt')
finally:
env.close()
class SubprocVecEnv(VecEnv):
def __init__(self, env_fns, spaces=None):
"""
VecEnv that runs multiple environments in parallel in subproceses and communicates with them via pipes.
Recommended to use when num_envs > 1 and step() can be a bottleneck.
"""
def __init__(self, env_fns, spaces=None, context='spawn'):
"""
envs: list of gym environments to run in subprocesses
Arguments:
env_fns: iterable of callables - functions that create environments to run in subprocesses. Need to be cloud-pickleable
"""
self.waiting = False
self.closed = False
nenvs = len(env_fns)
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
ctx = mp.get_context(context)
self.remotes, self.work_remotes = zip(*[ctx.Pipe() for _ in range(nenvs)])
self.ps = [ctx.Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
for p in self.ps:
p.daemon = True # if the main process crashes, we should not cause things to hang
p.start()
p.daemon = True # if the main process crashes, we should not cause things to hang
with clear_mpi_env_vars():
p.start()
for remote in self.work_remotes:
remote.close()
self.remotes[0].send(('get_spaces', None))
observation_space, action_space = self.remotes[0].recv()
self.remotes[0].send(('get_spaces_spec', None))
observation_space, action_space, self.spec = self.remotes[0].recv()
self.viewer = None
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
def step_async(self, actions):
self._assert_not_closed()
for remote, action in zip(self.remotes, actions):
remote.send(('step', action))
self.waiting = True
def step_wait(self):
self._assert_not_closed()
results = [remote.recv() for remote in self.remotes]
self.waiting = False
obs, rews, dones, infos = zip(*results)
return np.stack(obs), np.stack(rews), np.stack(dones), infos
return _flatten_obs(obs), np.stack(rews), np.stack(dones), infos
def reset(self):
self._assert_not_closed()
for remote in self.remotes:
remote.send(('reset', None))
return np.stack([remote.recv() for remote in self.remotes])
return _flatten_obs([remote.recv() for remote in self.remotes])
def reset_task(self):
for remote in self.remotes:
remote.send(('reset_task', None))
return np.stack([remote.recv() for remote in self.remotes])
def close(self):
if self.closed:
return
def close_extras(self):
self.closed = True
if self.waiting:
for remote in self.remotes:
for remote in self.remotes:
remote.recv()
for remote in self.remotes:
remote.send(('close', None))
for p in self.ps:
p.join()
self.closed = True
def render(self, mode='human'):
def get_images(self):
self._assert_not_closed()
for pipe in self.remotes:
pipe.send(('render', None))
imgs = [pipe.recv() for pipe in self.remotes]
bigimg = tile_images(imgs)
if mode == 'human':
import cv2
cv2.imshow('vecenv', bigimg[:,:,::-1])
cv2.waitKey(1)
elif mode == 'rgb_array':
return bigimg
else:
raise NotImplementedError
return imgs
def _assert_not_closed(self):
assert not self.closed, "Trying to operate on a SubprocVecEnv after calling close()"
def __del__(self):
if not self.closed:
self.close()
def _flatten_obs(obs):
assert isinstance(obs, (list, tuple))
assert len(obs) > 0
if isinstance(obs[0], dict):
keys = obs[0].keys()
return {k: np.stack([o[k] for o in obs]) for k in keys}
else:
return np.stack(obs)

View File

@@ -0,0 +1,114 @@
"""
Tests for asynchronous vectorized environments.
"""
import gym
import numpy as np
import pytest
from .dummy_vec_env import DummyVecEnv
from .shmem_vec_env import ShmemVecEnv
from .subproc_vec_env import SubprocVecEnv
from baselines.common.tests.test_with_mpi import with_mpi
def assert_venvs_equal(venv1, venv2, num_steps):
"""
Compare two environments over num_steps steps and make sure
that the observations produced by each are the same when given
the same actions.
"""
assert venv1.num_envs == venv2.num_envs
assert venv1.observation_space.shape == venv2.observation_space.shape
assert venv1.observation_space.dtype == venv2.observation_space.dtype
assert venv1.action_space.shape == venv2.action_space.shape
assert venv1.action_space.dtype == venv2.action_space.dtype
try:
obs1, obs2 = venv1.reset(), venv2.reset()
assert np.array(obs1).shape == np.array(obs2).shape
assert np.array(obs1).shape == (venv1.num_envs,) + venv1.observation_space.shape
assert np.allclose(obs1, obs2)
venv1.action_space.seed(1337)
for _ in range(num_steps):
actions = np.array([venv1.action_space.sample() for _ in range(venv1.num_envs)])
for venv in [venv1, venv2]:
venv.step_async(actions)
outs1 = venv1.step_wait()
outs2 = venv2.step_wait()
for out1, out2 in zip(outs1[:3], outs2[:3]):
assert np.array(out1).shape == np.array(out2).shape
assert np.allclose(out1, out2)
assert list(outs1[3]) == list(outs2[3])
finally:
venv1.close()
venv2.close()
@pytest.mark.parametrize('klass', (ShmemVecEnv, SubprocVecEnv))
@pytest.mark.parametrize('dtype', ('uint8', 'float32'))
def test_vec_env(klass, dtype): # pylint: disable=R0914
"""
Test that a vectorized environment is equivalent to
DummyVecEnv, since DummyVecEnv is less likely to be
error prone.
"""
num_envs = 3
num_steps = 100
shape = (3, 8)
def make_fn(seed):
"""
Get an environment constructor with a seed.
"""
return lambda: SimpleEnv(seed, shape, dtype)
fns = [make_fn(i) for i in range(num_envs)]
env1 = DummyVecEnv(fns)
env2 = klass(fns)
assert_venvs_equal(env1, env2, num_steps=num_steps)
class SimpleEnv(gym.Env):
"""
An environment with a pre-determined observation space
and RNG seed.
"""
def __init__(self, seed, shape, dtype):
np.random.seed(seed)
self._dtype = dtype
self._start_obs = np.array(np.random.randint(0, 0x100, size=shape),
dtype=dtype)
self._max_steps = seed + 1
self._cur_obs = None
self._cur_step = 0
# this is 0xFF instead of 0x100 because the Box space includes
# the high end, while randint does not
self.action_space = gym.spaces.Box(low=0, high=0xFF, shape=shape, dtype=dtype)
self.observation_space = self.action_space
def step(self, action):
self._cur_obs += np.array(action, dtype=self._dtype)
self._cur_step += 1
done = self._cur_step >= self._max_steps
reward = self._cur_step / self._max_steps
return self._cur_obs, reward, done, {'foo': 'bar' + str(reward)}
def reset(self):
self._cur_obs = self._start_obs
self._cur_step = 0
return self._cur_obs
def render(self, mode=None):
raise NotImplementedError
@with_mpi()
def test_mpi_with_subprocvecenv():
shape = (2,3,4)
nenv = 1
venv = SubprocVecEnv([lambda: SimpleEnv(0, shape, 'float32')] * nenv)
ob = venv.reset()
venv.close()
assert ob.shape == (nenv,) + shape

View File

@@ -0,0 +1,49 @@
"""
Tests for asynchronous vectorized environments.
"""
import gym
import pytest
import os
import glob
import tempfile
from .dummy_vec_env import DummyVecEnv
from .shmem_vec_env import ShmemVecEnv
from .subproc_vec_env import SubprocVecEnv
from .vec_video_recorder import VecVideoRecorder
@pytest.mark.parametrize('klass', (DummyVecEnv, ShmemVecEnv, SubprocVecEnv))
@pytest.mark.parametrize('num_envs', (1, 4))
@pytest.mark.parametrize('video_length', (10, 100))
@pytest.mark.parametrize('video_interval', (1, 50))
def test_video_recorder(klass, num_envs, video_length, video_interval):
"""
Wrap an existing VecEnv with VevVideoRecorder,
Make (video_interval + video_length + 1) steps,
then check that the file is present
"""
def make_fn():
env = gym.make('PongNoFrameskip-v4')
return env
fns = [make_fn for _ in range(num_envs)]
env = klass(fns)
with tempfile.TemporaryDirectory() as video_path:
env = VecVideoRecorder(env, video_path, record_video_trigger=lambda x: x % video_interval == 0, video_length=video_length)
env.reset()
for _ in range(video_interval + video_length + 1):
env.step([0] * num_envs)
env.close()
recorded_video = glob.glob(os.path.join(video_path, "*.mp4"))
# first and second step
assert len(recorded_video) == 2
# Files are not empty
assert all(os.stat(p).st_size != 0 for p in recorded_video)

View File

@@ -0,0 +1,59 @@
"""
Helpers for dealing with vectorized environments.
"""
from collections import OrderedDict
import gym
import numpy as np
def copy_obs_dict(obs):
"""
Deep-copy an observation dict.
"""
return {k: np.copy(v) for k, v in obs.items()}
def dict_to_obs(obs_dict):
"""
Convert an observation dict into a raw array if the
original observation space was not a Dict space.
"""
if set(obs_dict.keys()) == {None}:
return obs_dict[None]
return obs_dict
def obs_space_info(obs_space):
"""
Get dict-structured information about a gym.Space.
Returns:
A tuple (keys, shapes, dtypes):
keys: a list of dict keys.
shapes: a dict mapping keys to shapes.
dtypes: a dict mapping keys to dtypes.
"""
if isinstance(obs_space, gym.spaces.Dict):
assert isinstance(obs_space.spaces, OrderedDict)
subspaces = obs_space.spaces
else:
subspaces = {None: obs_space}
keys = []
shapes = {}
dtypes = {}
for key, box in subspaces.items():
keys.append(key)
shapes[key] = box.shape
dtypes[key] = box.dtype
return keys, shapes, dtypes
def obs_to_dict(obs):
"""
Convert an observation into a dict.
"""
if isinstance(obs, dict):
return obs
return {None: obs}

View File

@@ -0,0 +1,218 @@
import contextlib
import os
from abc import ABC, abstractmethod
from baselines.common.tile_images import tile_images
class AlreadySteppingError(Exception):
"""
Raised when an asynchronous step is running while
step_async() is called again.
"""
def __init__(self):
msg = 'already running an async step'
Exception.__init__(self, msg)
class NotSteppingError(Exception):
"""
Raised when an asynchronous step is not running but
step_wait() is called.
"""
def __init__(self):
msg = 'not running an async step'
Exception.__init__(self, msg)
class VecEnv(ABC):
"""
An abstract asynchronous, vectorized environment.
Used to batch data from multiple copies of an environment, so that
each observation becomes an batch of observations, and expected action is a batch of actions to
be applied per-environment.
"""
closed = False
viewer = None
metadata = {
'render.modes': ['human', 'rgb_array']
}
def __init__(self, num_envs, observation_space, action_space):
self.num_envs = num_envs
self.observation_space = observation_space
self.action_space = action_space
@abstractmethod
def reset(self):
"""
Reset all the environments and return an array of
observations, or a dict of observation arrays.
If step_async is still doing work, that work will
be cancelled and step_wait() should not be called
until step_async() is invoked again.
"""
pass
@abstractmethod
def step_async(self, actions):
"""
Tell all the environments to start taking a step
with the given actions.
Call step_wait() to get the results of the step.
You should not call this if a step_async run is
already pending.
"""
pass
@abstractmethod
def step_wait(self):
"""
Wait for the step taken with step_async().
Returns (obs, rews, dones, infos):
- obs: an array of observations, or a dict of
arrays of observations.
- rews: an array of rewards
- dones: an array of "episode done" booleans
- infos: a sequence of info objects
"""
pass
def close_extras(self):
"""
Clean up the extra resources, beyond what's in this base class.
Only runs when not self.closed.
"""
pass
def close(self):
if self.closed:
return
if self.viewer is not None:
self.viewer.close()
self.close_extras()
self.closed = True
def step(self, actions):
"""
Step the environments synchronously.
This is available for backwards compatibility.
"""
self.step_async(actions)
return self.step_wait()
def render(self, mode='human'):
imgs = self.get_images()
bigimg = tile_images(imgs)
if mode == 'human':
self.get_viewer().imshow(bigimg)
return self.get_viewer().isopen
elif mode == 'rgb_array':
return bigimg
else:
raise NotImplementedError
def get_images(self):
"""
Return RGB images from each environment
"""
raise NotImplementedError
@property
def unwrapped(self):
if isinstance(self, VecEnvWrapper):
return self.venv.unwrapped
else:
return self
def get_viewer(self):
if self.viewer is None:
from gym.envs.classic_control import rendering
self.viewer = rendering.SimpleImageViewer()
return self.viewer
class VecEnvWrapper(VecEnv):
"""
An environment wrapper that applies to an entire batch
of environments at once.
"""
def __init__(self, venv, observation_space=None, action_space=None):
self.venv = venv
super().__init__(num_envs=venv.num_envs,
observation_space=observation_space or venv.observation_space,
action_space=action_space or venv.action_space)
def step_async(self, actions):
self.venv.step_async(actions)
@abstractmethod
def reset(self):
pass
@abstractmethod
def step_wait(self):
pass
def close(self):
return self.venv.close()
def render(self, mode='human'):
return self.venv.render(mode=mode)
def get_images(self):
return self.venv.get_images()
class VecEnvObservationWrapper(VecEnvWrapper):
@abstractmethod
def process(self, obs):
pass
def reset(self):
obs = self.venv.reset()
return self.process(obs)
def step_wait(self):
obs, rews, dones, infos = self.venv.step_wait()
return self.process(obs), rews, dones, infos
class CloudpickleWrapper(object):
"""
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
"""
def __init__(self, x):
self.x = x
def __getstate__(self):
import cloudpickle
return cloudpickle.dumps(self.x)
def __setstate__(self, ob):
import pickle
self.x = pickle.loads(ob)
@contextlib.contextmanager
def clear_mpi_env_vars():
"""
from mpi4py import MPI will call MPI_Init by default. If the child process has MPI environment variables, MPI will think that the child process is an MPI process just like the parent and do bad things such as hang.
This context manager is a hacky way to clear those environment variables temporarily such as when we are starting multiprocessing
Processes.
"""
removed_environment = {}
for k, v in list(os.environ.items()):
for prefix in ['OMPI_', 'PMI_']:
if k.startswith(prefix):
removed_environment[k] = v
del os.environ[k]
try:
yield
finally:
os.environ.update(removed_environment)

View File

@@ -1,18 +1,16 @@
from baselines.common.vec_env import VecEnvWrapper
from .vec_env import VecEnvWrapper
import numpy as np
from gym import spaces
class VecFrameStack(VecEnvWrapper):
"""
Vectorized environment base class
"""
def __init__(self, venv, nstack):
self.venv = venv
self.nstack = nstack
wos = venv.observation_space # wrapped ob space
wos = venv.observation_space # wrapped ob space
low = np.repeat(wos.low, self.nstack, axis=-1)
high = np.repeat(wos.high, self.nstack, axis=-1)
self.stackedobs = np.zeros((venv.num_envs,)+low.shape, low.dtype)
self.stackedobs = np.zeros((venv.num_envs,) + low.shape, low.dtype)
observation_space = spaces.Box(low=low, high=high, dtype=venv.observation_space.dtype)
VecEnvWrapper.__init__(self, venv, observation_space=observation_space)
@@ -26,13 +24,7 @@ class VecFrameStack(VecEnvWrapper):
return self.stackedobs, rews, news, infos
def reset(self):
"""
Reset all environments
"""
obs = self.venv.reset()
self.stackedobs[...] = 0
self.stackedobs[..., -obs.shape[-1]:] = obs
return self.stackedobs
def close(self):
self.venv.close()

View File

@@ -0,0 +1,49 @@
from . import VecEnvWrapper
from baselines.bench.monitor import ResultsWriter
import numpy as np
import time
from collections import deque
class VecMonitor(VecEnvWrapper):
def __init__(self, venv, filename=None, keep_buf=0):
VecEnvWrapper.__init__(self, venv)
self.eprets = None
self.eplens = None
self.epcount = 0
self.tstart = time.time()
if filename:
self.results_writer = ResultsWriter(filename, header={'t_start': self.tstart})
else:
self.results_writer = None
self.keep_buf = keep_buf
if self.keep_buf:
self.epret_buf = deque([], maxlen=keep_buf)
self.eplen_buf = deque([], maxlen=keep_buf)
def reset(self):
obs = self.venv.reset()
self.eprets = np.zeros(self.num_envs, 'f')
self.eplens = np.zeros(self.num_envs, 'i')
return obs
def step_wait(self):
obs, rews, dones, infos = self.venv.step_wait()
self.eprets += rews
self.eplens += 1
newinfos = []
for (i, (done, ret, eplen, info)) in enumerate(zip(dones, self.eprets, self.eplens, infos)):
info = info.copy()
if done:
epinfo = {'r': ret, 'l': eplen, 't': round(time.time() - self.tstart, 6)}
info['episode'] = epinfo
if self.keep_buf:
self.epret_buf.append(ret)
self.eplen_buf.append(eplen)
self.epcount += 1
self.eprets[i] = 0
self.eplens[i] = 0
if self.results_writer:
self.results_writer.write_row(epinfo)
newinfos.append(info)
return obs, rews, dones, newinfos

View File

@@ -1,11 +1,14 @@
from baselines.common.vec_env import VecEnvWrapper
from . import VecEnvWrapper
from baselines.common.running_mean_std import RunningMeanStd
import numpy as np
class VecNormalize(VecEnvWrapper):
"""
Vectorized environment base class
A vectorized wrapper that normalizes the observations
and returns from an environment.
"""
def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
VecEnvWrapper.__init__(self, venv)
self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
@@ -17,18 +20,13 @@ class VecNormalize(VecEnvWrapper):
self.epsilon = epsilon
def step_wait(self):
"""
Apply sequence of actions to sequence of environments
actions -> (observations, rewards, news)
where 'news' is a boolean vector indicating whether each element is new.
"""
obs, rews, news, infos = self.venv.step_wait()
self.ret = self.ret * self.gamma + rews
obs = self._obfilt(obs)
if self.ret_rms:
self.ret_rms.update(self.ret)
rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
self.ret[news] = 0.
return obs, rews, news, infos
def _obfilt(self, obs):
@@ -40,8 +38,6 @@ class VecNormalize(VecEnvWrapper):
return obs
def reset(self):
"""
Reset all environments
"""
self.ret = np.zeros(self.num_envs)
obs = self.venv.reset()
return self._obfilt(obs)

View File

@@ -0,0 +1,11 @@
from .vec_env import VecEnvObservationWrapper
class VecExtractDictObs(VecEnvObservationWrapper):
def __init__(self, venv, key):
self.key = key
super().__init__(venv=venv,
observation_space=venv.observation_space.spaces[self.key])
def process(self, obs):
return obs[self.key]

View File

@@ -0,0 +1,89 @@
import os
from baselines import logger
from baselines.common.vec_env import VecEnvWrapper
from gym.wrappers.monitoring import video_recorder
class VecVideoRecorder(VecEnvWrapper):
"""
Wrap VecEnv to record rendered image as mp4 video.
"""
def __init__(self, venv, directory, record_video_trigger, video_length=200):
"""
# Arguments
venv: VecEnv to wrap
directory: Where to save videos
record_video_trigger:
Function that defines when to start recording.
The function takes the current number of step,
and returns whether we should start recording or not.
video_length: Length of recorded video
"""
VecEnvWrapper.__init__(self, venv)
self.record_video_trigger = record_video_trigger
self.video_recorder = None
self.directory = os.path.abspath(directory)
if not os.path.exists(self.directory): os.mkdir(self.directory)
self.file_prefix = "vecenv"
self.file_infix = '{}'.format(os.getpid())
self.step_id = 0
self.video_length = video_length
self.recording = False
self.recorded_frames = 0
def reset(self):
obs = self.venv.reset()
self.start_video_recorder()
return obs
def start_video_recorder(self):
self.close_video_recorder()
base_path = os.path.join(self.directory, '{}.video.{}.video{:06}'.format(self.file_prefix, self.file_infix, self.step_id))
self.video_recorder = video_recorder.VideoRecorder(
env=self.venv,
base_path=base_path,
metadata={'step_id': self.step_id}
)
self.video_recorder.capture_frame()
self.recorded_frames = 1
self.recording = True
def _video_enabled(self):
return self.record_video_trigger(self.step_id)
def step_wait(self):
obs, rews, dones, infos = self.venv.step_wait()
self.step_id += 1
if self.recording:
self.video_recorder.capture_frame()
self.recorded_frames += 1
if self.recorded_frames > self.video_length:
logger.info("Saving video to ", self.video_recorder.path)
self.close_video_recorder()
elif self._video_enabled():
self.start_video_recorder()
return obs, rews, dones, infos
def close_video_recorder(self):
if self.recording:
self.video_recorder.close()
self.recording = False
self.recorded_frames = 0
def close(self):
VecEnvWrapper.close(self)
self.close_video_recorder()
def __del__(self):
self.close()

View File

@@ -0,0 +1,19 @@
import gym
class TimeLimit(gym.Wrapper):
def __init__(self, env, max_episode_steps=None):
super(TimeLimit, self).__init__(env)
self._max_episode_steps = max_episode_steps
self._elapsed_steps = 0
def step(self, ac):
observation, reward, done, info = self.env.step(ac)
self._elapsed_steps += 1
if self._elapsed_steps >= self._max_episode_steps:
done = True
info['TimeLimit.truncated'] = True
return observation, reward, done, info
def reset(self, **kwargs):
self._elapsed_steps = 0
return self.env.reset(**kwargs)

2
baselines/ddpg/README.md Normal file → Executable file
View File

@@ -2,4 +2,4 @@
- Original paper: https://arxiv.org/abs/1509.02971
- Baselines post: https://blog.openai.com/better-exploration-with-parameter-noise/
- `python -m baselines.ddpg.main` runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (`-h`) for more options.
- `python -m baselines.run --alg=ddpg --env=HalfCheetah-v2 --num_timesteps=1e6` runs the algorithm for 1M frames = 10M timesteps on a Mujoco environment. See help (`-h`) for more options.

0
baselines/ddpg/__init__.py Normal file → Executable file
View File

633
baselines/ddpg/ddpg.py Normal file → Executable file
View File

@@ -1,378 +1,285 @@
from copy import copy
from functools import reduce
import os
import os.path as osp
import time
from collections import deque
import pickle
import numpy as np
import tensorflow as tf
import tensorflow.contrib as tc
from baselines.ddpg.ddpg_learner import DDPG
from baselines.ddpg.models import Actor, Critic
from baselines.ddpg.memory import Memory
from baselines.ddpg.noise import AdaptiveParamNoiseSpec, NormalActionNoise, OrnsteinUhlenbeckActionNoise
from baselines.common import set_global_seeds
from baselines import logger
from baselines.common.mpi_adam import MpiAdam
import baselines.common.tf_util as U
from baselines.common.mpi_running_mean_std import RunningMeanStd
from mpi4py import MPI
import tensorflow as tf
import numpy as np
def normalize(x, stats):
if stats is None:
return x
return (x - stats.mean) / stats.std
try:
from mpi4py import MPI
except ImportError:
MPI = None
def learn(network, env,
seed=None,
total_timesteps=None,
nb_epochs=None, # with default settings, perform 1M steps total
nb_epoch_cycles=20,
nb_rollout_steps=100,
reward_scale=1.0,
render=False,
render_eval=False,
noise_type='adaptive-param_0.2',
normalize_returns=False,
normalize_observations=True,
critic_l2_reg=1e-2,
actor_lr=1e-4,
critic_lr=1e-3,
popart=False,
gamma=0.99,
clip_norm=None,
nb_train_steps=50, # per epoch cycle and MPI worker,
nb_eval_steps=100,
batch_size=64, # per MPI worker
tau=0.01,
eval_env=None,
param_noise_adaption_interval=50,
load_path=None,
**network_kwargs):
set_global_seeds(seed)
if total_timesteps is not None:
assert nb_epochs is None
nb_epochs = int(total_timesteps) // (nb_epoch_cycles * nb_rollout_steps)
else:
nb_epochs = 500
if MPI is not None:
rank = MPI.COMM_WORLD.Get_rank()
else:
rank = 0
nb_actions = env.action_space.shape[-1]
assert (np.abs(env.action_space.low) == env.action_space.high).all() # we assume symmetric actions.
memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
critic = Critic(nb_actions, ob_shape=env.observation_space.shape, network=network, **network_kwargs)
actor = Actor(nb_actions, ob_shape=env.observation_space.shape, network=network, **network_kwargs)
action_noise = None
param_noise = None
if noise_type is not None:
for current_noise_type in noise_type.split(','):
current_noise_type = current_noise_type.strip()
if current_noise_type == 'none':
pass
elif 'adaptive-param' in current_noise_type:
_, stddev = current_noise_type.split('_')
param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
elif 'normal' in current_noise_type:
_, stddev = current_noise_type.split('_')
action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
elif 'ou' in current_noise_type:
_, stddev = current_noise_type.split('_')
action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
else:
raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
max_action = env.action_space.high
logger.info('scaling actions by {} before executing in env'.format(max_action))
agent = DDPG(actor, critic, memory, env.observation_space.shape, env.action_space.shape,
gamma=gamma, tau=tau, normalize_returns=normalize_returns, normalize_observations=normalize_observations,
batch_size=batch_size, action_noise=action_noise, param_noise=param_noise, critic_l2_reg=critic_l2_reg,
actor_lr=actor_lr, critic_lr=critic_lr, enable_popart=popart, clip_norm=clip_norm,
reward_scale=reward_scale)
logger.info('Using agent with the following configuration:')
logger.info(str(agent.__dict__.items()))
if load_path is not None:
load_path = osp.expanduser(load_path)
ckpt = tf.train.Checkpoint(model=agent)
manager = tf.train.CheckpointManager(ckpt, load_path, max_to_keep=None)
ckpt.restore(manager.latest_checkpoint)
print("Restoring from {}".format(manager.latest_checkpoint))
eval_episode_rewards_history = deque(maxlen=100)
episode_rewards_history = deque(maxlen=100)
# Prepare everything.
agent.initialize()
agent.reset()
obs = env.reset()
if eval_env is not None:
eval_obs = eval_env.reset()
nenvs = obs.shape[0]
episode_reward = np.zeros(nenvs, dtype = np.float32) #vector
episode_step = np.zeros(nenvs, dtype = int) # vector
episodes = 0 #scalar
t = 0 # scalar
epoch = 0
def denormalize(x, stats):
if stats is None:
return x
return x * stats.std + stats.mean
def reduce_std(x, axis=None, keepdims=False):
return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
start_time = time.time()
def reduce_var(x, axis=None, keepdims=False):
m = tf.reduce_mean(x, axis=axis, keep_dims=True)
devs_squared = tf.square(x - m)
return tf.reduce_mean(devs_squared, axis=axis, keep_dims=keepdims)
epoch_episode_rewards = []
epoch_episode_steps = []
epoch_actions = []
epoch_qs = []
epoch_episodes = 0
for epoch in range(nb_epochs):
for cycle in range(nb_epoch_cycles):
# Perform rollouts.
if nenvs > 1:
# if simulating multiple envs in parallel, impossible to reset agent at the end of the episode in each
# of the environments, so resetting here instead
agent.reset()
for t_rollout in range(nb_rollout_steps):
# Predict next action.
action, q, _, _ = agent.step(tf.constant(obs), apply_noise=True, compute_Q=True)
action, q = action.numpy(), q.numpy()
def get_target_updates(vars, target_vars, tau):
logger.info('setting up target updates ...')
soft_updates = []
init_updates = []
assert len(vars) == len(target_vars)
for var, target_var in zip(vars, target_vars):
logger.info(' {} <- {}'.format(target_var.name, var.name))
init_updates.append(tf.assign(target_var, var))
soft_updates.append(tf.assign(target_var, (1. - tau) * target_var + tau * var))
assert len(init_updates) == len(vars)
assert len(soft_updates) == len(vars)
return tf.group(*init_updates), tf.group(*soft_updates)
# Execute next action.
if rank == 0 and render:
env.render()
# max_action is of dimension A, whereas action is dimension (nenvs, A) - the multiplication gets broadcasted to the batch
new_obs, r, done, info = env.step(max_action * action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
# note these outputs are batched from vecenv
t += 1
if rank == 0 and render:
env.render()
episode_reward += r
episode_step += 1
# Book-keeping.
epoch_actions.append(action)
epoch_qs.append(q)
agent.store_transition(obs, action, r, new_obs, done) #the batched data will be unrolled in memory.py's append.
obs = new_obs
for d in range(len(done)):
if done[d]:
# Episode done.
epoch_episode_rewards.append(episode_reward[d])
episode_rewards_history.append(episode_reward[d])
epoch_episode_steps.append(episode_step[d])
episode_reward[d] = 0.
episode_step[d] = 0
epoch_episodes += 1
episodes += 1
if nenvs == 1:
agent.reset()
def get_perturbed_actor_updates(actor, perturbed_actor, param_noise_stddev):
assert len(actor.vars) == len(perturbed_actor.vars)
assert len(actor.perturbable_vars) == len(perturbed_actor.perturbable_vars)
# Train.
epoch_actor_losses = []
epoch_critic_losses = []
epoch_adaptive_distances = []
for t_train in range(nb_train_steps):
# Adapt param noise, if necessary.
if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
batch = agent.memory.sample(batch_size=batch_size)
obs0 = tf.constant(batch['obs0'])
distance = agent.adapt_param_noise(obs0)
epoch_adaptive_distances.append(distance)
updates = []
for var, perturbed_var in zip(actor.vars, perturbed_actor.vars):
if var in actor.perturbable_vars:
logger.info(' {} <- {} + noise'.format(perturbed_var.name, var.name))
updates.append(tf.assign(perturbed_var, var + tf.random_normal(tf.shape(var), mean=0., stddev=param_noise_stddev)))
cl, al = agent.train()
epoch_critic_losses.append(cl)
epoch_actor_losses.append(al)
agent.update_target_net()
# Evaluate.
eval_episode_rewards = []
eval_qs = []
if eval_env is not None:
nenvs_eval = eval_obs.shape[0]
eval_episode_reward = np.zeros(nenvs_eval, dtype = np.float32)
for t_rollout in range(nb_eval_steps):
eval_action, eval_q, _, _ = agent.step(eval_obs, apply_noise=False, compute_Q=True)
eval_obs, eval_r, eval_done, eval_info = eval_env.step(max_action * eval_action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
if render_eval:
eval_env.render()
eval_episode_reward += eval_r
eval_qs.append(eval_q)
for d in range(len(eval_done)):
if eval_done[d]:
eval_episode_rewards.append(eval_episode_reward[d])
eval_episode_rewards_history.append(eval_episode_reward[d])
eval_episode_reward[d] = 0.0
if MPI is not None:
mpi_size = MPI.COMM_WORLD.Get_size()
else:
logger.info(' {} <- {}'.format(perturbed_var.name, var.name))
updates.append(tf.assign(perturbed_var, var))
assert len(updates) == len(actor.vars)
return tf.group(*updates)
mpi_size = 1
# Log stats.
# XXX shouldn't call np.mean on variable length lists
duration = time.time() - start_time
stats = agent.get_stats()
combined_stats = stats.copy()
combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
combined_stats['rollout/return_std'] = np.std(epoch_episode_rewards)
combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
combined_stats['rollout/return_history_std'] = np.std(episode_rewards_history)
combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
combined_stats['train/loss_actor'] = np.mean(epoch_actor_losses)
combined_stats['train/loss_critic'] = np.mean(epoch_critic_losses)
combined_stats['train/param_noise_distance'] = np.mean(epoch_adaptive_distances)
combined_stats['total/duration'] = duration
combined_stats['total/steps_per_second'] = float(t) / float(duration)
combined_stats['total/episodes'] = episodes
combined_stats['rollout/episodes'] = epoch_episodes
combined_stats['rollout/actions_std'] = np.std(epoch_actions)
# Evaluation statistics.
if eval_env is not None:
combined_stats['eval/return'] = eval_episode_rewards
combined_stats['eval/return_history'] = np.mean(eval_episode_rewards_history)
combined_stats['eval/Q'] = eval_qs
combined_stats['eval/episodes'] = len(eval_episode_rewards)
def as_scalar(x):
if isinstance(x, np.ndarray):
assert x.size == 1
return x[0]
elif np.isscalar(x):
return x
else:
raise ValueError('expected scalar, got %s'%x)
combined_stats_sums = np.array([ np.array(x).flatten()[0] for x in combined_stats.values()])
if MPI is not None:
combined_stats_sums = MPI.COMM_WORLD.allreduce(combined_stats_sums)
combined_stats = {k : v / mpi_size for (k,v) in zip(combined_stats.keys(), combined_stats_sums)}
# Total statistics.
combined_stats['total/epochs'] = epoch + 1
combined_stats['total/steps'] = t
for key in sorted(combined_stats.keys()):
logger.record_tabular(key, combined_stats[key])
if rank == 0:
logger.dump_tabular()
logger.info('')
logdir = logger.get_dir()
if rank == 0 and logdir:
if hasattr(env, 'get_state'):
with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:
pickle.dump(env.get_state(), f)
if eval_env and hasattr(eval_env, 'get_state'):
with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
pickle.dump(eval_env.get_state(), f)
class DDPG(object):
def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
adaptive_param_noise=True, adaptive_param_noise_policy_threshold=.1,
critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
# Inputs.
self.obs0 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs0')
self.obs1 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs1')
self.terminals1 = tf.placeholder(tf.float32, shape=(None, 1), name='terminals1')
self.rewards = tf.placeholder(tf.float32, shape=(None, 1), name='rewards')
self.actions = tf.placeholder(tf.float32, shape=(None,) + action_shape, name='actions')
self.critic_target = tf.placeholder(tf.float32, shape=(None, 1), name='critic_target')
self.param_noise_stddev = tf.placeholder(tf.float32, shape=(), name='param_noise_stddev')
# Parameters.
self.gamma = gamma
self.tau = tau
self.memory = memory
self.normalize_observations = normalize_observations
self.normalize_returns = normalize_returns
self.action_noise = action_noise
self.param_noise = param_noise
self.action_range = action_range
self.return_range = return_range
self.observation_range = observation_range
self.critic = critic
self.actor = actor
self.actor_lr = actor_lr
self.critic_lr = critic_lr
self.clip_norm = clip_norm
self.enable_popart = enable_popart
self.reward_scale = reward_scale
self.batch_size = batch_size
self.stats_sample = None
self.critic_l2_reg = critic_l2_reg
# Observation normalization.
if self.normalize_observations:
with tf.variable_scope('obs_rms'):
self.obs_rms = RunningMeanStd(shape=observation_shape)
else:
self.obs_rms = None
normalized_obs0 = tf.clip_by_value(normalize(self.obs0, self.obs_rms),
self.observation_range[0], self.observation_range[1])
normalized_obs1 = tf.clip_by_value(normalize(self.obs1, self.obs_rms),
self.observation_range[0], self.observation_range[1])
# Return normalization.
if self.normalize_returns:
with tf.variable_scope('ret_rms'):
self.ret_rms = RunningMeanStd()
else:
self.ret_rms = None
# Create target networks.
target_actor = copy(actor)
target_actor.name = 'target_actor'
self.target_actor = target_actor
target_critic = copy(critic)
target_critic.name = 'target_critic'
self.target_critic = target_critic
# Create networks and core TF parts that are shared across setup parts.
self.actor_tf = actor(normalized_obs0)
self.normalized_critic_tf = critic(normalized_obs0, self.actions)
self.critic_tf = denormalize(tf.clip_by_value(self.normalized_critic_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
self.normalized_critic_with_actor_tf = critic(normalized_obs0, self.actor_tf, reuse=True)
self.critic_with_actor_tf = denormalize(tf.clip_by_value(self.normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
Q_obs1 = denormalize(target_critic(normalized_obs1, target_actor(normalized_obs1)), self.ret_rms)
self.target_Q = self.rewards + (1. - self.terminals1) * gamma * Q_obs1
# Set up parts.
if self.param_noise is not None:
self.setup_param_noise(normalized_obs0)
self.setup_actor_optimizer()
self.setup_critic_optimizer()
if self.normalize_returns and self.enable_popart:
self.setup_popart()
self.setup_stats()
self.setup_target_network_updates()
def setup_target_network_updates(self):
actor_init_updates, actor_soft_updates = get_target_updates(self.actor.vars, self.target_actor.vars, self.tau)
critic_init_updates, critic_soft_updates = get_target_updates(self.critic.vars, self.target_critic.vars, self.tau)
self.target_init_updates = [actor_init_updates, critic_init_updates]
self.target_soft_updates = [actor_soft_updates, critic_soft_updates]
def setup_param_noise(self, normalized_obs0):
assert self.param_noise is not None
# Configure perturbed actor.
param_noise_actor = copy(self.actor)
param_noise_actor.name = 'param_noise_actor'
self.perturbed_actor_tf = param_noise_actor(normalized_obs0)
logger.info('setting up param noise')
self.perturb_policy_ops = get_perturbed_actor_updates(self.actor, param_noise_actor, self.param_noise_stddev)
# Configure separate copy for stddev adoption.
adaptive_param_noise_actor = copy(self.actor)
adaptive_param_noise_actor.name = 'adaptive_param_noise_actor'
adaptive_actor_tf = adaptive_param_noise_actor(normalized_obs0)
self.perturb_adaptive_policy_ops = get_perturbed_actor_updates(self.actor, adaptive_param_noise_actor, self.param_noise_stddev)
self.adaptive_policy_distance = tf.sqrt(tf.reduce_mean(tf.square(self.actor_tf - adaptive_actor_tf)))
def setup_actor_optimizer(self):
logger.info('setting up actor optimizer')
self.actor_loss = -tf.reduce_mean(self.critic_with_actor_tf)
actor_shapes = [var.get_shape().as_list() for var in self.actor.trainable_vars]
actor_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in actor_shapes])
logger.info(' actor shapes: {}'.format(actor_shapes))
logger.info(' actor params: {}'.format(actor_nb_params))
self.actor_grads = U.flatgrad(self.actor_loss, self.actor.trainable_vars, clip_norm=self.clip_norm)
self.actor_optimizer = MpiAdam(var_list=self.actor.trainable_vars,
beta1=0.9, beta2=0.999, epsilon=1e-08)
def setup_critic_optimizer(self):
logger.info('setting up critic optimizer')
normalized_critic_target_tf = tf.clip_by_value(normalize(self.critic_target, self.ret_rms), self.return_range[0], self.return_range[1])
self.critic_loss = tf.reduce_mean(tf.square(self.normalized_critic_tf - normalized_critic_target_tf))
if self.critic_l2_reg > 0.:
critic_reg_vars = [var for var in self.critic.trainable_vars if 'kernel' in var.name and 'output' not in var.name]
for var in critic_reg_vars:
logger.info(' regularizing: {}'.format(var.name))
logger.info(' applying l2 regularization with {}'.format(self.critic_l2_reg))
critic_reg = tc.layers.apply_regularization(
tc.layers.l2_regularizer(self.critic_l2_reg),
weights_list=critic_reg_vars
)
self.critic_loss += critic_reg
critic_shapes = [var.get_shape().as_list() for var in self.critic.trainable_vars]
critic_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in critic_shapes])
logger.info(' critic shapes: {}'.format(critic_shapes))
logger.info(' critic params: {}'.format(critic_nb_params))
self.critic_grads = U.flatgrad(self.critic_loss, self.critic.trainable_vars, clip_norm=self.clip_norm)
self.critic_optimizer = MpiAdam(var_list=self.critic.trainable_vars,
beta1=0.9, beta2=0.999, epsilon=1e-08)
def setup_popart(self):
# See https://arxiv.org/pdf/1602.07714.pdf for details.
self.old_std = tf.placeholder(tf.float32, shape=[1], name='old_std')
new_std = self.ret_rms.std
self.old_mean = tf.placeholder(tf.float32, shape=[1], name='old_mean')
new_mean = self.ret_rms.mean
self.renormalize_Q_outputs_op = []
for vs in [self.critic.output_vars, self.target_critic.output_vars]:
assert len(vs) == 2
M, b = vs
assert 'kernel' in M.name
assert 'bias' in b.name
assert M.get_shape()[-1] == 1
assert b.get_shape()[-1] == 1
self.renormalize_Q_outputs_op += [M.assign(M * self.old_std / new_std)]
self.renormalize_Q_outputs_op += [b.assign((b * self.old_std + self.old_mean - new_mean) / new_std)]
def setup_stats(self):
ops = []
names = []
if self.normalize_returns:
ops += [self.ret_rms.mean, self.ret_rms.std]
names += ['ret_rms_mean', 'ret_rms_std']
if self.normalize_observations:
ops += [tf.reduce_mean(self.obs_rms.mean), tf.reduce_mean(self.obs_rms.std)]
names += ['obs_rms_mean', 'obs_rms_std']
ops += [tf.reduce_mean(self.critic_tf)]
names += ['reference_Q_mean']
ops += [reduce_std(self.critic_tf)]
names += ['reference_Q_std']
ops += [tf.reduce_mean(self.critic_with_actor_tf)]
names += ['reference_actor_Q_mean']
ops += [reduce_std(self.critic_with_actor_tf)]
names += ['reference_actor_Q_std']
ops += [tf.reduce_mean(self.actor_tf)]
names += ['reference_action_mean']
ops += [reduce_std(self.actor_tf)]
names += ['reference_action_std']
if self.param_noise:
ops += [tf.reduce_mean(self.perturbed_actor_tf)]
names += ['reference_perturbed_action_mean']
ops += [reduce_std(self.perturbed_actor_tf)]
names += ['reference_perturbed_action_std']
self.stats_ops = ops
self.stats_names = names
def pi(self, obs, apply_noise=True, compute_Q=True):
if self.param_noise is not None and apply_noise:
actor_tf = self.perturbed_actor_tf
else:
actor_tf = self.actor_tf
feed_dict = {self.obs0: [obs]}
if compute_Q:
action, q = self.sess.run([actor_tf, self.critic_with_actor_tf], feed_dict=feed_dict)
else:
action = self.sess.run(actor_tf, feed_dict=feed_dict)
q = None
action = action.flatten()
if self.action_noise is not None and apply_noise:
noise = self.action_noise()
assert noise.shape == action.shape
action += noise
action = np.clip(action, self.action_range[0], self.action_range[1])
return action, q
def store_transition(self, obs0, action, reward, obs1, terminal1):
reward *= self.reward_scale
self.memory.append(obs0, action, reward, obs1, terminal1)
if self.normalize_observations:
self.obs_rms.update(np.array([obs0]))
def train(self):
# Get a batch.
batch = self.memory.sample(batch_size=self.batch_size)
if self.normalize_returns and self.enable_popart:
old_mean, old_std, target_Q = self.sess.run([self.ret_rms.mean, self.ret_rms.std, self.target_Q], feed_dict={
self.obs1: batch['obs1'],
self.rewards: batch['rewards'],
self.terminals1: batch['terminals1'].astype('float32'),
})
self.ret_rms.update(target_Q.flatten())
self.sess.run(self.renormalize_Q_outputs_op, feed_dict={
self.old_std : np.array([old_std]),
self.old_mean : np.array([old_mean]),
})
# Run sanity check. Disabled by default since it slows down things considerably.
# print('running sanity check')
# target_Q_new, new_mean, new_std = self.sess.run([self.target_Q, self.ret_rms.mean, self.ret_rms.std], feed_dict={
# self.obs1: batch['obs1'],
# self.rewards: batch['rewards'],
# self.terminals1: batch['terminals1'].astype('float32'),
# })
# print(target_Q_new, target_Q, new_mean, new_std)
# assert (np.abs(target_Q - target_Q_new) < 1e-3).all()
else:
target_Q = self.sess.run(self.target_Q, feed_dict={
self.obs1: batch['obs1'],
self.rewards: batch['rewards'],
self.terminals1: batch['terminals1'].astype('float32'),
})
# Get all gradients and perform a synced update.
ops = [self.actor_grads, self.actor_loss, self.critic_grads, self.critic_loss]
actor_grads, actor_loss, critic_grads, critic_loss = self.sess.run(ops, feed_dict={
self.obs0: batch['obs0'],
self.actions: batch['actions'],
self.critic_target: target_Q,
})
self.actor_optimizer.update(actor_grads, stepsize=self.actor_lr)
self.critic_optimizer.update(critic_grads, stepsize=self.critic_lr)
return critic_loss, actor_loss
def initialize(self, sess):
self.sess = sess
self.sess.run(tf.global_variables_initializer())
self.actor_optimizer.sync()
self.critic_optimizer.sync()
self.sess.run(self.target_init_updates)
def update_target_net(self):
self.sess.run(self.target_soft_updates)
def get_stats(self):
if self.stats_sample is None:
# Get a sample and keep that fixed for all further computations.
# This allows us to estimate the change in value for the same set of inputs.
self.stats_sample = self.memory.sample(batch_size=self.batch_size)
values = self.sess.run(self.stats_ops, feed_dict={
self.obs0: self.stats_sample['obs0'],
self.actions: self.stats_sample['actions'],
})
names = self.stats_names[:]
assert len(names) == len(values)
stats = dict(zip(names, values))
if self.param_noise is not None:
stats = {**stats, **self.param_noise.get_stats()}
return stats
def adapt_param_noise(self):
if self.param_noise is None:
return 0.
# Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
batch = self.memory.sample(batch_size=self.batch_size)
self.sess.run(self.perturb_adaptive_policy_ops, feed_dict={
self.param_noise_stddev: self.param_noise.current_stddev,
})
distance = self.sess.run(self.adaptive_policy_distance, feed_dict={
self.obs0: batch['obs0'],
self.param_noise_stddev: self.param_noise.current_stddev,
})
mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
self.param_noise.adapt(mean_distance)
return mean_distance
def reset(self):
# Reset internal state after an episode is complete.
if self.action_noise is not None:
self.action_noise.reset()
if self.param_noise is not None:
self.sess.run(self.perturb_policy_ops, feed_dict={
self.param_noise_stddev: self.param_noise.current_stddev,
})
return agent

356
baselines/ddpg/ddpg_learner.py Executable file
View File

@@ -0,0 +1,356 @@
from functools import reduce
import numpy as np
import tensorflow as tf
from baselines import logger
from baselines.ddpg.models import Actor, Critic
from baselines.common.mpi_running_mean_std import RunningMeanStd
try:
from mpi4py import MPI
from baselines.common.mpi_adam_optimizer import MpiAdamOptimizer
from baselines.common.mpi_util import sync_from_root
except ImportError:
MPI = None
def normalize(x, stats):
if stats is None:
return x
return (x - stats.mean) / (stats.std + 1e-8)
def denormalize(x, stats):
if stats is None:
return x
return x * stats.std + stats.mean
@tf.function
def reduce_std(x, axis=None, keepdims=False):
return tf.sqrt(reduce_var(x, axis=axis, keepdims=keepdims))
def reduce_var(x, axis=None, keepdims=False):
m = tf.reduce_mean(x, axis=axis, keepdims=True)
devs_squared = tf.square(x - m)
return tf.reduce_mean(devs_squared, axis=axis, keepdims=keepdims)
@tf.function
def update_perturbed_actor(actor, perturbed_actor, param_noise_stddev):
for var, perturbed_var in zip(actor.variables, perturbed_actor.variables):
if var in actor.perturbable_vars:
perturbed_var.assign(var + tf.random.normal(shape=tf.shape(var), mean=0., stddev=param_noise_stddev))
else:
perturbed_var.assign(var)
class DDPG(tf.Module):
def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
# Parameters.
self.gamma = gamma
self.tau = tau
self.memory = memory
self.normalize_observations = normalize_observations
self.normalize_returns = normalize_returns
self.action_noise = action_noise
self.param_noise = param_noise
self.action_range = action_range
self.return_range = return_range
self.observation_range = observation_range
self.observation_shape = observation_shape
self.critic = critic
self.actor = actor
self.clip_norm = clip_norm
self.enable_popart = enable_popart
self.reward_scale = reward_scale
self.batch_size = batch_size
self.stats_sample = None
self.critic_l2_reg = critic_l2_reg
self.actor_lr = tf.constant(actor_lr)
self.critic_lr = tf.constant(critic_lr)
# Observation normalization.
if self.normalize_observations:
with tf.name_scope('obs_rms'):
self.obs_rms = RunningMeanStd(shape=observation_shape)
else:
self.obs_rms = None
# Return normalization.
if self.normalize_returns:
with tf.name_scope('ret_rms'):
self.ret_rms = RunningMeanStd()
else:
self.ret_rms = None
# Create target networks.
self.target_critic = Critic(actor.nb_actions, observation_shape, name='target_critic', network=critic.network, **critic.network_kwargs)
self.target_actor = Actor(actor.nb_actions, observation_shape, name='target_actor', network=actor.network, **actor.network_kwargs)
# Set up parts.
if self.param_noise is not None:
self.setup_param_noise()
if MPI is not None:
comm = MPI.COMM_WORLD
self.actor_optimizer = MpiAdamOptimizer(comm, self.actor.trainable_variables)
self.critic_optimizer = MpiAdamOptimizer(comm, self.critic.trainable_variables)
else:
self.actor_optimizer = tf.keras.optimizers.Adam(learning_rate=actor_lr)
self.critic_optimizer = tf.keras.optimizers.Adam(learning_rate=critic_lr)
logger.info('setting up actor optimizer')
actor_shapes = [var.get_shape().as_list() for var in self.actor.trainable_variables]
actor_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in actor_shapes])
logger.info(' actor shapes: {}'.format(actor_shapes))
logger.info(' actor params: {}'.format(actor_nb_params))
logger.info('setting up critic optimizer')
critic_shapes = [var.get_shape().as_list() for var in self.critic.trainable_variables]
critic_nb_params = sum([reduce(lambda x, y: x * y, shape) for shape in critic_shapes])
logger.info(' critic shapes: {}'.format(critic_shapes))
logger.info(' critic params: {}'.format(critic_nb_params))
if self.critic_l2_reg > 0.:
critic_reg_vars = []
for layer in self.critic.network_builder.layers[1:]:
critic_reg_vars.append(layer.kernel)
for var in critic_reg_vars:
logger.info(' regularizing: {}'.format(var.name))
logger.info(' applying l2 regularization with {}'.format(self.critic_l2_reg))
logger.info('setting up critic target updates ...')
for var, target_var in zip(self.critic.variables, self.target_critic.variables):
logger.info(' {} <- {}'.format(target_var.name, var.name))
logger.info('setting up actor target updates ...')
for var, target_var in zip(self.actor.variables, self.target_actor.variables):
logger.info(' {} <- {}'.format(target_var.name, var.name))
if self.param_noise:
logger.info('setting up param noise')
for var, perturbed_var in zip(self.actor.variables, self.perturbed_actor.variables):
if var in actor.perturbable_vars:
logger.info(' {} <- {} + noise'.format(perturbed_var.name, var.name))
else:
logger.info(' {} <- {}'.format(perturbed_var.name, var.name))
for var, perturbed_var in zip(self.actor.variables, self.perturbed_adaptive_actor.variables):
if var in actor.perturbable_vars:
logger.info(' {} <- {} + noise'.format(perturbed_var.name, var.name))
else:
logger.info(' {} <- {}'.format(perturbed_var.name, var.name))
if self.normalize_returns and self.enable_popart:
self.setup_popart()
self.initial_state = None # recurrent architectures not supported yet
def setup_param_noise(self):
assert self.param_noise is not None
# Configure perturbed actor.
self.perturbed_actor = Actor(self.actor.nb_actions, self.observation_shape, name='param_noise_actor', network=self.actor.network, **self.actor.network_kwargs)
# Configure separate copy for stddev adoption.
self.perturbed_adaptive_actor = Actor(self.actor.nb_actions, self.observation_shape, name='adaptive_param_noise_actor', network=self.actor.network, **self.actor.network_kwargs)
def setup_popart(self):
# See https://arxiv.org/pdf/1602.07714.pdf for details.
for vs in [self.critic.output_vars, self.target_critic.output_vars]:
assert len(vs) == 2
M, b = vs
assert 'kernel' in M.name
assert 'bias' in b.name
assert M.get_shape()[-1] == 1
assert b.get_shape()[-1] == 1
@tf.function
def step(self, obs, apply_noise=True, compute_Q=True):
normalized_obs = tf.clip_by_value(normalize(obs, self.obs_rms), self.observation_range[0], self.observation_range[1])
actor_tf = self.actor(normalized_obs)
if self.param_noise is not None and apply_noise:
action = self.perturbed_actor(normalized_obs)
else:
action = actor_tf
if compute_Q:
normalized_critic_with_actor_tf = self.critic(normalized_obs, actor_tf)
q = denormalize(tf.clip_by_value(normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
else:
q = None
if self.action_noise is not None and apply_noise:
noise = self.action_noise()
action += noise
action = tf.clip_by_value(action, self.action_range[0], self.action_range[1])
return action, q, None, None
def store_transition(self, obs0, action, reward, obs1, terminal1):
reward *= self.reward_scale
B = obs0.shape[0]
for b in range(B):
self.memory.append(obs0[b], action[b], reward[b], obs1[b], terminal1[b])
if self.normalize_observations:
self.obs_rms.update(np.array([obs0[b]]))
def train(self):
batch = self.memory.sample(batch_size=self.batch_size)
obs0, obs1 = tf.constant(batch['obs0']), tf.constant(batch['obs1'])
actions, rewards, terminals1 = tf.constant(batch['actions']), tf.constant(batch['rewards']), tf.constant(batch['terminals1'], dtype=tf.float32)
normalized_obs0, target_Q = self.compute_normalized_obs0_and_target_Q(obs0, obs1, rewards, terminals1)
if self.normalize_returns and self.enable_popart:
old_mean = self.ret_rms.mean
old_std = self.ret_rms.std
self.ret_rms.update(target_Q.flatten())
# renormalize Q outputs
new_mean = self.ret_rms.mean
new_std = self.ret_rms.std
for vs in [self.critic.output_vars, self.target_critic.output_vars]:
kernel, bias = vs
kernel.assign(kernel * old_std / new_std)
bias.assign((bias * old_std + old_mean - new_mean) / new_std)
actor_grads, actor_loss = self.get_actor_grads(normalized_obs0)
critic_grads, critic_loss = self.get_critic_grads(normalized_obs0, actions, target_Q)
if MPI is not None:
self.actor_optimizer.apply_gradients(actor_grads, self.actor_lr)
self.critic_optimizer.apply_gradients(critic_grads, self.critic_lr)
else:
self.actor_optimizer.apply_gradients(zip(actor_grads, self.actor.trainable_variables))
self.critic_optimizer.apply_gradients(zip(critic_grads, self.critic.trainable_variables))
return critic_loss, actor_loss
@tf.function
def compute_normalized_obs0_and_target_Q(self, obs0, obs1, rewards, terminals1):
normalized_obs0 = tf.clip_by_value(normalize(obs0, self.obs_rms), self.observation_range[0], self.observation_range[1])
normalized_obs1 = tf.clip_by_value(normalize(obs1, self.obs_rms), self.observation_range[0], self.observation_range[1])
Q_obs1 = denormalize(self.target_critic(normalized_obs1, self.target_actor(normalized_obs1)), self.ret_rms)
target_Q = rewards + (1. - terminals1) * self.gamma * Q_obs1
return normalized_obs0, target_Q
@tf.function
def get_actor_grads(self, normalized_obs0):
with tf.GradientTape() as tape:
actor_tf = self.actor(normalized_obs0)
normalized_critic_with_actor_tf = self.critic(normalized_obs0, actor_tf)
critic_with_actor_tf = denormalize(tf.clip_by_value(normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
actor_loss = -tf.reduce_mean(critic_with_actor_tf)
actor_grads = tape.gradient(actor_loss, self.actor.trainable_variables)
if self.clip_norm:
actor_grads = [tf.clip_by_norm(grad, clip_norm=self.clip_norm) for grad in actor_grads]
if MPI is not None:
actor_grads = tf.concat([tf.reshape(g, (-1,)) for g in actor_grads], axis=0)
return actor_grads, actor_loss
@tf.function
def get_critic_grads(self, normalized_obs0, actions, target_Q):
with tf.GradientTape() as tape:
normalized_critic_tf = self.critic(normalized_obs0, actions)
normalized_critic_target_tf = tf.clip_by_value(normalize(target_Q, self.ret_rms), self.return_range[0], self.return_range[1])
critic_loss = tf.reduce_mean(tf.square(normalized_critic_tf - normalized_critic_target_tf))
# The first is input layer, which is ignored here.
if self.critic_l2_reg > 0.:
# Ignore the first input layer.
for layer in self.critic.network_builder.layers[1:]:
# The original l2_regularizer takes half of sum square.
critic_loss += (self.critic_l2_reg / 2.)* tf.reduce_sum(tf.square(layer.kernel))
critic_grads = tape.gradient(critic_loss, self.critic.trainable_variables)
if self.clip_norm:
critic_grads = [tf.clip_by_norm(grad, clip_norm=self.clip_norm) for grad in critic_grads]
if MPI is not None:
critic_grads = tf.concat([tf.reshape(g, (-1,)) for g in critic_grads], axis=0)
return critic_grads, critic_loss
def initialize(self):
if MPI is not None:
sync_from_root(self.actor.trainable_variables + self.critic.trainable_variables)
self.target_actor.set_weights(self.actor.get_weights())
self.target_critic.set_weights(self.critic.get_weights())
@tf.function
def update_target_net(self):
for var, target_var in zip(self.actor.variables, self.target_actor.variables):
target_var.assign((1. - self.tau) * target_var + self.tau * var)
for var, target_var in zip(self.critic.variables, self.target_critic.variables):
target_var.assign((1. - self.tau) * target_var + self.tau * var)
def get_stats(self):
if self.stats_sample is None:
# Get a sample and keep that fixed for all further computations.
# This allows us to estimate the change in value for the same set of inputs.
self.stats_sample = self.memory.sample(batch_size=self.batch_size)
obs0 = self.stats_sample['obs0']
actions = self.stats_sample['actions']
normalized_obs0 = tf.clip_by_value(normalize(obs0, self.obs_rms), self.observation_range[0], self.observation_range[1])
normalized_critic_tf = self.critic(normalized_obs0, actions)
critic_tf = denormalize(tf.clip_by_value(normalized_critic_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
actor_tf = self.actor(normalized_obs0)
normalized_critic_with_actor_tf = self.critic(normalized_obs0, actor_tf)
critic_with_actor_tf = denormalize(tf.clip_by_value(normalized_critic_with_actor_tf, self.return_range[0], self.return_range[1]), self.ret_rms)
stats = {}
if self.normalize_returns:
stats['ret_rms_mean'] = self.ret_rms.mean
stats['ret_rms_std'] = self.ret_rms.std
if self.normalize_observations:
stats['obs_rms_mean'] = tf.reduce_mean(self.obs_rms.mean)
stats['obs_rms_std'] = tf.reduce_mean(self.obs_rms.std)
stats['reference_Q_mean'] = tf.reduce_mean(critic_tf)
stats['reference_Q_std'] = reduce_std(critic_tf)
stats['reference_actor_Q_mean'] = tf.reduce_mean(critic_with_actor_tf)
stats['reference_actor_Q_std'] = reduce_std(critic_with_actor_tf)
stats['reference_action_mean'] = tf.reduce_mean(actor_tf)
stats['reference_action_std'] = reduce_std(actor_tf)
if self.param_noise:
perturbed_actor_tf = self.perturbed_actor(normalized_obs0)
stats['reference_perturbed_action_mean'] = tf.reduce_mean(perturbed_actor_tf)
stats['reference_perturbed_action_std'] = reduce_std(perturbed_actor_tf)
stats.update(self.param_noise.get_stats())
return stats
def adapt_param_noise(self, obs0):
try:
from mpi4py import MPI
except ImportError:
MPI = None
if self.param_noise is None:
return 0.
mean_distance = self.get_mean_distance(obs0).numpy()
if MPI is not None:
mean_distance = MPI.COMM_WORLD.allreduce(mean_distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
self.param_noise.adapt(mean_distance)
return mean_distance
@tf.function
def get_mean_distance(self, obs0):
# Perturb a separate copy of the policy to adjust the scale for the next "real" perturbation.
update_perturbed_actor(self.actor, self.perturbed_adaptive_actor, self.param_noise.current_stddev)
normalized_obs0 = tf.clip_by_value(normalize(obs0, self.obs_rms), self.observation_range[0], self.observation_range[1])
actor_tf = self.actor(normalized_obs0)
adaptive_actor_tf = self.perturbed_adaptive_actor(normalized_obs0)
mean_distance = tf.sqrt(tf.reduce_mean(tf.square(actor_tf - adaptive_actor_tf)))
return mean_distance
def reset(self):
# Reset internal state after an episode is complete.
if self.action_noise is not None:
self.action_noise.reset()
if self.param_noise is not None:
update_perturbed_actor(self.actor, self.perturbed_actor, self.param_noise.current_stddev)

View File

@@ -1,123 +0,0 @@
import argparse
import time
import os
import logging
from baselines import logger, bench
from baselines.common.misc_util import (
set_global_seeds,
boolean_flag,
)
import baselines.ddpg.training as training
from baselines.ddpg.models import Actor, Critic
from baselines.ddpg.memory import Memory
from baselines.ddpg.noise import *
import gym
import tensorflow as tf
from mpi4py import MPI
def run(env_id, seed, noise_type, layer_norm, evaluation, **kwargs):
# Configure things.
rank = MPI.COMM_WORLD.Get_rank()
if rank != 0:
logger.set_level(logger.DISABLED)
# Create envs.
env = gym.make(env_id)
env = bench.Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(rank)))
if evaluation and rank==0:
eval_env = gym.make(env_id)
eval_env = bench.Monitor(eval_env, os.path.join(logger.get_dir(), 'gym_eval'))
env = bench.Monitor(env, None)
else:
eval_env = None
# Parse noise_type
action_noise = None
param_noise = None
nb_actions = env.action_space.shape[-1]
for current_noise_type in noise_type.split(','):
current_noise_type = current_noise_type.strip()
if current_noise_type == 'none':
pass
elif 'adaptive-param' in current_noise_type:
_, stddev = current_noise_type.split('_')
param_noise = AdaptiveParamNoiseSpec(initial_stddev=float(stddev), desired_action_stddev=float(stddev))
elif 'normal' in current_noise_type:
_, stddev = current_noise_type.split('_')
action_noise = NormalActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
elif 'ou' in current_noise_type:
_, stddev = current_noise_type.split('_')
action_noise = OrnsteinUhlenbeckActionNoise(mu=np.zeros(nb_actions), sigma=float(stddev) * np.ones(nb_actions))
else:
raise RuntimeError('unknown noise type "{}"'.format(current_noise_type))
# Configure components.
memory = Memory(limit=int(1e6), action_shape=env.action_space.shape, observation_shape=env.observation_space.shape)
critic = Critic(layer_norm=layer_norm)
actor = Actor(nb_actions, layer_norm=layer_norm)
# Seed everything to make things reproducible.
seed = seed + 1000000 * rank
logger.info('rank {}: seed={}, logdir={}'.format(rank, seed, logger.get_dir()))
tf.reset_default_graph()
set_global_seeds(seed)
env.seed(seed)
if eval_env is not None:
eval_env.seed(seed)
# Disable logging for rank != 0 to avoid noise.
if rank == 0:
start_time = time.time()
training.train(env=env, eval_env=eval_env, param_noise=param_noise,
action_noise=action_noise, actor=actor, critic=critic, memory=memory, **kwargs)
env.close()
if eval_env is not None:
eval_env.close()
if rank == 0:
logger.info('total runtime: {}s'.format(time.time() - start_time))
def parse_args():
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--env-id', type=str, default='HalfCheetah-v1')
boolean_flag(parser, 'render-eval', default=False)
boolean_flag(parser, 'layer-norm', default=True)
boolean_flag(parser, 'render', default=False)
boolean_flag(parser, 'normalize-returns', default=False)
boolean_flag(parser, 'normalize-observations', default=True)
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--critic-l2-reg', type=float, default=1e-2)
parser.add_argument('--batch-size', type=int, default=64) # per MPI worker
parser.add_argument('--actor-lr', type=float, default=1e-4)
parser.add_argument('--critic-lr', type=float, default=1e-3)
boolean_flag(parser, 'popart', default=False)
parser.add_argument('--gamma', type=float, default=0.99)
parser.add_argument('--reward-scale', type=float, default=1.)
parser.add_argument('--clip-norm', type=float, default=None)
parser.add_argument('--nb-epochs', type=int, default=500) # with default settings, perform 1M steps total
parser.add_argument('--nb-epoch-cycles', type=int, default=20)
parser.add_argument('--nb-train-steps', type=int, default=50) # per epoch cycle and MPI worker
parser.add_argument('--nb-eval-steps', type=int, default=100) # per epoch cycle and MPI worker
parser.add_argument('--nb-rollout-steps', type=int, default=100) # per epoch cycle and MPI worker
parser.add_argument('--noise-type', type=str, default='adaptive-param_0.2') # choices are adaptive-param_xx, ou_xx, normal_xx, none
parser.add_argument('--num-timesteps', type=int, default=None)
boolean_flag(parser, 'evaluation', default=False)
args = parser.parse_args()
# we don't directly specify timesteps for this script, so make sure that if we do specify them
# they agree with the other parameters
if args.num_timesteps is not None:
assert(args.num_timesteps == args.nb_epochs * args.nb_epoch_cycles * args.nb_rollout_steps)
dict_args = vars(args)
del dict_args['num_timesteps']
return dict_args
if __name__ == '__main__':
args = parse_args()
if MPI.COMM_WORLD.Get_rank() == 0:
logger.configure()
# Run actual script.
run(**args)

4
baselines/ddpg/memory.py Normal file → Executable file
View File

@@ -51,7 +51,7 @@ class Memory(object):
def sample(self, batch_size):
# Draw such that we always have a proceeding element.
batch_idxs = np.random.random_integers(self.nb_entries - 2, size=batch_size)
batch_idxs = np.random.randint(self.nb_entries - 2, size=batch_size)
obs0_batch = self.observations0.get_batch(batch_idxs)
obs1_batch = self.observations1.get_batch(batch_idxs)
@@ -71,7 +71,7 @@ class Memory(object):
def append(self, obs0, action, reward, obs1, terminal1, training=True):
if not training:
return
self.observations0.append(obs0)
self.actions.append(action)
self.rewards.append(reward)

90
baselines/ddpg/models.py Normal file → Executable file
View File

@@ -1,77 +1,49 @@
import tensorflow as tf
import tensorflow.contrib as tc
from baselines.common.models import get_network_builder
class Model(object):
def __init__(self, name):
self.name = name
@property
def vars(self):
return tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
@property
def trainable_vars(self):
return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.name)
class Model(tf.keras.Model):
def __init__(self, name, network='mlp', **network_kwargs):
super(Model, self).__init__(name=name)
self.network = network
self.network_kwargs = network_kwargs
@property
def perturbable_vars(self):
return [var for var in self.trainable_vars if 'LayerNorm' not in var.name]
return [var for var in self.trainable_variables if 'layer_normalization' not in var.name]
class Actor(Model):
def __init__(self, nb_actions, name='actor', layer_norm=True):
super(Actor, self).__init__(name=name)
def __init__(self, nb_actions, ob_shape, name='actor', network='mlp', **network_kwargs):
super().__init__(name=name, network=network, **network_kwargs)
self.nb_actions = nb_actions
self.layer_norm = layer_norm
self.network_builder = get_network_builder(network)(**network_kwargs)(ob_shape)
self.output_layer = tf.keras.layers.Dense(units=self.nb_actions,
activation=tf.keras.activations.tanh,
kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
_ = self.output_layer(self.network_builder.outputs[0])
def __call__(self, obs, reuse=False):
with tf.variable_scope(self.name) as scope:
if reuse:
scope.reuse_variables()
x = obs
x = tf.layers.dense(x, 64)
if self.layer_norm:
x = tc.layers.layer_norm(x, center=True, scale=True)
x = tf.nn.relu(x)
x = tf.layers.dense(x, 64)
if self.layer_norm:
x = tc.layers.layer_norm(x, center=True, scale=True)
x = tf.nn.relu(x)
x = tf.layers.dense(x, self.nb_actions, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
x = tf.nn.tanh(x)
return x
@tf.function
def call(self, obs):
return self.output_layer(self.network_builder(obs))
class Critic(Model):
def __init__(self, name='critic', layer_norm=True):
super(Critic, self).__init__(name=name)
self.layer_norm = layer_norm
def __init__(self, nb_actions, ob_shape, name='critic', network='mlp', **network_kwargs):
super().__init__(name=name, network=network, **network_kwargs)
self.layer_norm = True
self.network_builder = get_network_builder(network)(**network_kwargs)((ob_shape[0] + nb_actions,))
self.output_layer = tf.keras.layers.Dense(units=1,
kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3),
name='output')
_ = self.output_layer(self.network_builder.outputs[0])
def __call__(self, obs, action, reuse=False):
with tf.variable_scope(self.name) as scope:
if reuse:
scope.reuse_variables()
x = obs
x = tf.layers.dense(x, 64)
if self.layer_norm:
x = tc.layers.layer_norm(x, center=True, scale=True)
x = tf.nn.relu(x)
x = tf.concat([x, action], axis=-1)
x = tf.layers.dense(x, 64)
if self.layer_norm:
x = tc.layers.layer_norm(x, center=True, scale=True)
x = tf.nn.relu(x)
x = tf.layers.dense(x, 1, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
return x
@tf.function
def call(self, obs, actions):
x = tf.concat([obs, actions], axis=-1) # this assumes observation and action can be concatenated
x = self.network_builder(x)
return self.output_layer(x)
@property
def output_vars(self):
output_vars = [var for var in self.trainable_vars if 'output' in var.name]
return output_vars
return self.output_layer.trainable_variables

0
baselines/ddpg/noise.py Normal file → Executable file
View File

View File

@@ -1,191 +0,0 @@
import os
import time
from collections import deque
import pickle
from baselines.ddpg.ddpg import DDPG
import baselines.common.tf_util as U
from baselines import logger
import numpy as np
import tensorflow as tf
from mpi4py import MPI
def train(env, nb_epochs, nb_epoch_cycles, render_eval, reward_scale, render, param_noise, actor, critic,
normalize_returns, normalize_observations, critic_l2_reg, actor_lr, critic_lr, action_noise,
popart, gamma, clip_norm, nb_train_steps, nb_rollout_steps, nb_eval_steps, batch_size, memory,
tau=0.01, eval_env=None, param_noise_adaption_interval=50):
rank = MPI.COMM_WORLD.Get_rank()
assert (np.abs(env.action_space.low) == env.action_space.high).all() # we assume symmetric actions.
max_action = env.action_space.high
logger.info('scaling actions by {} before executing in env'.format(max_action))
agent = DDPG(actor, critic, memory, env.observation_space.shape, env.action_space.shape,
gamma=gamma, tau=tau, normalize_returns=normalize_returns, normalize_observations=normalize_observations,
batch_size=batch_size, action_noise=action_noise, param_noise=param_noise, critic_l2_reg=critic_l2_reg,
actor_lr=actor_lr, critic_lr=critic_lr, enable_popart=popart, clip_norm=clip_norm,
reward_scale=reward_scale)
logger.info('Using agent with the following configuration:')
logger.info(str(agent.__dict__.items()))
# Set up logging stuff only for a single worker.
if rank == 0:
saver = tf.train.Saver()
else:
saver = None
step = 0
episode = 0
eval_episode_rewards_history = deque(maxlen=100)
episode_rewards_history = deque(maxlen=100)
with U.single_threaded_session() as sess:
# Prepare everything.
agent.initialize(sess)
sess.graph.finalize()
agent.reset()
obs = env.reset()
if eval_env is not None:
eval_obs = eval_env.reset()
done = False
episode_reward = 0.
episode_step = 0
episodes = 0
t = 0
epoch = 0
start_time = time.time()
epoch_episode_rewards = []
epoch_episode_steps = []
epoch_episode_eval_rewards = []
epoch_episode_eval_steps = []
epoch_start_time = time.time()
epoch_actions = []
epoch_qs = []
epoch_episodes = 0
for epoch in range(nb_epochs):
for cycle in range(nb_epoch_cycles):
# Perform rollouts.
for t_rollout in range(nb_rollout_steps):
# Predict next action.
action, q = agent.pi(obs, apply_noise=True, compute_Q=True)
assert action.shape == env.action_space.shape
# Execute next action.
if rank == 0 and render:
env.render()
assert max_action.shape == action.shape
new_obs, r, done, info = env.step(max_action * action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
t += 1
if rank == 0 and render:
env.render()
episode_reward += r
episode_step += 1
# Book-keeping.
epoch_actions.append(action)
epoch_qs.append(q)
agent.store_transition(obs, action, r, new_obs, done)
obs = new_obs
if done:
# Episode done.
epoch_episode_rewards.append(episode_reward)
episode_rewards_history.append(episode_reward)
epoch_episode_steps.append(episode_step)
episode_reward = 0.
episode_step = 0
epoch_episodes += 1
episodes += 1
agent.reset()
obs = env.reset()
# Train.
epoch_actor_losses = []
epoch_critic_losses = []
epoch_adaptive_distances = []
for t_train in range(nb_train_steps):
# Adapt param noise, if necessary.
if memory.nb_entries >= batch_size and t_train % param_noise_adaption_interval == 0:
distance = agent.adapt_param_noise()
epoch_adaptive_distances.append(distance)
cl, al = agent.train()
epoch_critic_losses.append(cl)
epoch_actor_losses.append(al)
agent.update_target_net()
# Evaluate.
eval_episode_rewards = []
eval_qs = []
if eval_env is not None:
eval_episode_reward = 0.
for t_rollout in range(nb_eval_steps):
eval_action, eval_q = agent.pi(eval_obs, apply_noise=False, compute_Q=True)
eval_obs, eval_r, eval_done, eval_info = eval_env.step(max_action * eval_action) # scale for execution in env (as far as DDPG is concerned, every action is in [-1, 1])
if render_eval:
eval_env.render()
eval_episode_reward += eval_r
eval_qs.append(eval_q)
if eval_done:
eval_obs = eval_env.reset()
eval_episode_rewards.append(eval_episode_reward)
eval_episode_rewards_history.append(eval_episode_reward)
eval_episode_reward = 0.
mpi_size = MPI.COMM_WORLD.Get_size()
# Log stats.
# XXX shouldn't call np.mean on variable length lists
duration = time.time() - start_time
stats = agent.get_stats()
combined_stats = stats.copy()
combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
combined_stats['train/loss_actor'] = np.mean(epoch_actor_losses)
combined_stats['train/loss_critic'] = np.mean(epoch_critic_losses)
combined_stats['train/param_noise_distance'] = np.mean(epoch_adaptive_distances)
combined_stats['total/duration'] = duration
combined_stats['total/steps_per_second'] = float(t) / float(duration)
combined_stats['total/episodes'] = episodes
combined_stats['rollout/episodes'] = epoch_episodes
combined_stats['rollout/actions_std'] = np.std(epoch_actions)
# Evaluation statistics.
if eval_env is not None:
combined_stats['eval/return'] = eval_episode_rewards
combined_stats['eval/return_history'] = np.mean(eval_episode_rewards_history)
combined_stats['eval/Q'] = eval_qs
combined_stats['eval/episodes'] = len(eval_episode_rewards)
def as_scalar(x):
if isinstance(x, np.ndarray):
assert x.size == 1
return x[0]
elif np.isscalar(x):
return x
else:
raise ValueError('expected scalar, got %s'%x)
combined_stats_sums = MPI.COMM_WORLD.allreduce(np.array([as_scalar(x) for x in combined_stats.values()]))
combined_stats = {k : v / mpi_size for (k,v) in zip(combined_stats.keys(), combined_stats_sums)}
# Total statistics.
combined_stats['total/epochs'] = epoch + 1
combined_stats['total/steps'] = t
for key in sorted(combined_stats.keys()):
logger.record_tabular(key, combined_stats[key])
logger.dump_tabular()
logger.info('')
logdir = logger.get_dir()
if rank == 0 and logdir:
if hasattr(env, 'get_state'):
with open(os.path.join(logdir, 'env_state.pkl'), 'wb') as f:
pickle.dump(env.get_state(), f)
if eval_env and hasattr(eval_env, 'get_state'):
with open(os.path.join(logdir, 'eval_env_state.pkl'), 'wb') as f:
pickle.dump(eval_env.get_state(), f)

View File

@@ -9,44 +9,29 @@ Here's a list of commands to run to quickly get a working example:
```bash
# Train model and save the results to cartpole_model.pkl
python -m baselines.deepq.experiments.train_cartpole
python -m baselines.run --alg=deepq --env=CartPole-v0 --save_path=./cartpole_model.pkl --num_timesteps=1e5
# Load the model saved in cartpole_model.pkl and visualize the learned policy
python -m baselines.deepq.experiments.enjoy_cartpole
python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play
```
Be sure to check out the source code of [both](experiments/train_cartpole.py) [files](experiments/enjoy_cartpole.py)!
## If you wish to apply DQN to solve a problem.
Check out our simple agent trained with one stop shop `deepq.learn` function.
- [baselines/deepq/experiments/train_cartpole.py](experiments/train_cartpole.py) - train a Cartpole agent.
- [baselines/deepq/experiments/train_pong.py](experiments/train_pong.py) - train a Pong agent using convolutional neural networks.
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. For both of the files listed above there are complimentary files `enjoy_cartpole.py` and `enjoy_pong.py` respectively, that load and visualize the learned policy.
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. Complimentary file `enjoy_cartpole.py` loads and visualizes the learned policy.
## If you wish to experiment with the algorithm
##### Check out the examples
- [baselines/deepq/experiments/custom_cartpole.py](experiments/custom_cartpole.py) - Cartpole training with more fine grained control over the internals of DQN algorithm.
- [baselines/deepq/experiments/atari/train.py](experiments/atari/train.py) - more robust setup for training at scale.
##### Download a pretrained Atari agent
For some research projects it is sometimes useful to have an already trained agent handy. There's a variety of models to choose from. You can list them all by running:
- [baselines/deepq/defaults.py](defaults.py) - settings for training on atari. Run
```bash
python -m baselines.deepq.experiments.atari.download_model
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4
```
to train on Atari Pong (see more in repo-wide [README.md](../../README.md#training-models))
Once you pick a model, you can download it and visualize the learned policy. Be sure to pass `--dueling` flag to visualization script when using dueling models.
```bash
python -m baselines.deepq.experiments.atari.download_model --blob model-atari-duel-pong-1 --model-dir /tmp/models
python -m baselines.deepq.experiments.atari.enjoy --model-dir /tmp/models/model-atari-duel-pong-1 --env Pong --dueling
```

View File

@@ -1,8 +1,8 @@
from baselines.deepq import models # noqa
from baselines.deepq.build_graph import build_act, build_train # noqa
from baselines.deepq.simple import learn, load # noqa
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer # noqa
from baselines.deepq import models # noqa F401
from baselines.deepq.deepq_learner import DEEPQ # noqa F401
from baselines.deepq.deepq import learn # noqa F401
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer # noqa F401
def wrap_atari_dqn(env):
from baselines.common.atari_wrappers import wrap_deepmind
return wrap_deepmind(env, frame_stack=True, scale=True)
return wrap_deepmind(env, frame_stack=True, scale=False)

View File

@@ -1,449 +0,0 @@
"""Deep Q learning graph
The functions in this file can are used to create the following functions:
======= act ========
Function to chose an action given an observation
Parameters
----------
observation: object
Observation that can be feed into the output of make_obs_ph
stochastic: bool
if set to False all the actions are always deterministic (default False)
update_eps_ph: float
update epsilon a new value, if negative not update happens
(default: no update)
Returns
-------
Tensor of dtype tf.int64 and shape (BATCH_SIZE,) with an action to be performed for
every element of the batch.
======= act (in case of parameter noise) ========
Function to chose an action given an observation
Parameters
----------
observation: object
Observation that can be feed into the output of make_obs_ph
stochastic: bool
if set to False all the actions are always deterministic (default False)
update_eps_ph: float
update epsilon a new value, if negative not update happens
(default: no update)
reset_ph: bool
reset the perturbed policy by sampling a new perturbation
update_param_noise_threshold_ph: float
the desired threshold for the difference between non-perturbed and perturbed policy
update_param_noise_scale_ph: bool
whether or not to update the scale of the noise for the next time it is re-perturbed
Returns
-------
Tensor of dtype tf.int64 and shape (BATCH_SIZE,) with an action to be performed for
every element of the batch.
======= train =======
Function that takes a transition (s,a,r,s') and optimizes Bellman equation's error:
td_error = Q(s,a) - (r + gamma * max_a' Q(s', a'))
loss = huber_loss[td_error]
Parameters
----------
obs_t: object
a batch of observations
action: np.array
actions that were selected upon seeing obs_t.
dtype must be int32 and shape must be (batch_size,)
reward: np.array
immediate reward attained after executing those actions
dtype must be float32 and shape must be (batch_size,)
obs_tp1: object
observations that followed obs_t
done: np.array
1 if obs_t was the last observation in the episode and 0 otherwise
obs_tp1 gets ignored, but must be of the valid shape.
dtype must be float32 and shape must be (batch_size,)
weight: np.array
imporance weights for every element of the batch (gradient is multiplied
by the importance weight) dtype must be float32 and shape must be (batch_size,)
Returns
-------
td_error: np.array
a list of differences between Q(s,a) and the target in Bellman's equation.
dtype is float32 and shape is (batch_size,)
======= update_target ========
copy the parameters from optimized Q function to the target Q function.
In Q learning we actually optimize the following error:
Q(s,a) - (r + gamma * max_a' Q'(s', a'))
Where Q' is lagging behind Q to stablize the learning. For example for Atari
Q' is set to Q once every 10000 updates training steps.
"""
import tensorflow as tf
import baselines.common.tf_util as U
def scope_vars(scope, trainable_only=False):
"""
Get variables inside a scope
The scope can be specified as a string
Parameters
----------
scope: str or VariableScope
scope in which the variables reside.
trainable_only: bool
whether or not to return only the variables that were marked as trainable.
Returns
-------
vars: [tf.Variable]
list of variables in `scope`.
"""
return tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.GLOBAL_VARIABLES,
scope=scope if isinstance(scope, str) else scope.name
)
def scope_name():
"""Returns the name of current scope as a string, e.g. deepq/q_func"""
return tf.get_variable_scope().name
def absolute_scope_name(relative_scope_name):
"""Appends parent scope name to `relative_scope_name`"""
return scope_name() + "/" + relative_scope_name
def default_param_noise_filter(var):
if var not in tf.trainable_variables():
# We never perturb non-trainable vars.
return False
if "fully_connected" in var.name:
# We perturb fully-connected layers.
return True
# The remaining layers are likely conv or layer norm layers, which we do not wish to
# perturb (in the former case because they only extract features, in the latter case because
# we use them for normalization purposes). If you change your network, you will likely want
# to re-consider which layers to perturb and which to keep untouched.
return False
def build_act(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None):
"""Creates the act function:
Parameters
----------
make_obs_ph: str -> tf.placeholder or TfInput
a function that take a name and creates a placeholder of input with that name
q_func: (tf.Variable, int, str, bool) -> tf.Variable
the model that takes the following inputs:
observation_in: object
the output of observation placeholder
num_actions: int
number of actions
scope: str
reuse: bool
should be passed to outer variable scope
and returns a tensor of shape (batch_size, num_actions) with values of every action.
num_actions: int
number of actions.
scope: str or VariableScope
optional scope for variable_scope.
reuse: bool or None
whether or not the variables should be reused. To be able to reuse the scope must be given.
Returns
-------
act: (tf.Variable, bool, float) -> tf.Variable
function to select and action given observation.
` See the top of the file for details.
"""
with tf.variable_scope(scope, reuse=reuse):
observations_ph = make_obs_ph("observation")
stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")
eps = tf.get_variable("eps", (), initializer=tf.constant_initializer(0))
q_values = q_func(observations_ph.get(), num_actions, scope="q_func")
deterministic_actions = tf.argmax(q_values, axis=1)
batch_size = tf.shape(observations_ph.get())[0]
random_actions = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=num_actions, dtype=tf.int64)
chose_random = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=1, dtype=tf.float32) < eps
stochastic_actions = tf.where(chose_random, random_actions, deterministic_actions)
output_actions = tf.cond(stochastic_ph, lambda: stochastic_actions, lambda: deterministic_actions)
update_eps_expr = eps.assign(tf.cond(update_eps_ph >= 0, lambda: update_eps_ph, lambda: eps))
_act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph],
outputs=output_actions,
givens={update_eps_ph: -1.0, stochastic_ph: True},
updates=[update_eps_expr])
def act(ob, stochastic=True, update_eps=-1):
return _act(ob, stochastic, update_eps)
return act
def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq", reuse=None, param_noise_filter_func=None):
"""Creates the act function with support for parameter space noise exploration (https://arxiv.org/abs/1706.01905):
Parameters
----------
make_obs_ph: str -> tf.placeholder or TfInput
a function that take a name and creates a placeholder of input with that name
q_func: (tf.Variable, int, str, bool) -> tf.Variable
the model that takes the following inputs:
observation_in: object
the output of observation placeholder
num_actions: int
number of actions
scope: str
reuse: bool
should be passed to outer variable scope
and returns a tensor of shape (batch_size, num_actions) with values of every action.
num_actions: int
number of actions.
scope: str or VariableScope
optional scope for variable_scope.
reuse: bool or None
whether or not the variables should be reused. To be able to reuse the scope must be given.
param_noise_filter_func: tf.Variable -> bool
function that decides whether or not a variable should be perturbed. Only applicable
if param_noise is True. If set to None, default_param_noise_filter is used by default.
Returns
-------
act: (tf.Variable, bool, float, bool, float, bool) -> tf.Variable
function to select and action given observation.
` See the top of the file for details.
"""
if param_noise_filter_func is None:
param_noise_filter_func = default_param_noise_filter
with tf.variable_scope(scope, reuse=reuse):
observations_ph = make_obs_ph("observation")
stochastic_ph = tf.placeholder(tf.bool, (), name="stochastic")
update_eps_ph = tf.placeholder(tf.float32, (), name="update_eps")
update_param_noise_threshold_ph = tf.placeholder(tf.float32, (), name="update_param_noise_threshold")
update_param_noise_scale_ph = tf.placeholder(tf.bool, (), name="update_param_noise_scale")
reset_ph = tf.placeholder(tf.bool, (), name="reset")
eps = tf.get_variable("eps", (), initializer=tf.constant_initializer(0))
param_noise_scale = tf.get_variable("param_noise_scale", (), initializer=tf.constant_initializer(0.01), trainable=False)
param_noise_threshold = tf.get_variable("param_noise_threshold", (), initializer=tf.constant_initializer(0.05), trainable=False)
# Unmodified Q.
q_values = q_func(observations_ph.get(), num_actions, scope="q_func")
# Perturbable Q used for the actual rollout.
q_values_perturbed = q_func(observations_ph.get(), num_actions, scope="perturbed_q_func")
# We have to wrap this code into a function due to the way tf.cond() works. See
# https://stackoverflow.com/questions/37063952/confused-by-the-behavior-of-tf-cond for
# a more detailed discussion.
def perturb_vars(original_scope, perturbed_scope):
all_vars = scope_vars(absolute_scope_name(original_scope))
all_perturbed_vars = scope_vars(absolute_scope_name(perturbed_scope))
assert len(all_vars) == len(all_perturbed_vars)
perturb_ops = []
for var, perturbed_var in zip(all_vars, all_perturbed_vars):
if param_noise_filter_func(perturbed_var):
# Perturb this variable.
op = tf.assign(perturbed_var, var + tf.random_normal(shape=tf.shape(var), mean=0., stddev=param_noise_scale))
else:
# Do not perturb, just assign.
op = tf.assign(perturbed_var, var)
perturb_ops.append(op)
assert len(perturb_ops) == len(all_vars)
return tf.group(*perturb_ops)
# Set up functionality to re-compute `param_noise_scale`. This perturbs yet another copy
# of the network and measures the effect of that perturbation in action space. If the perturbation
# is too big, reduce scale of perturbation, otherwise increase.
q_values_adaptive = q_func(observations_ph.get(), num_actions, scope="adaptive_q_func")
perturb_for_adaption = perturb_vars(original_scope="q_func", perturbed_scope="adaptive_q_func")
kl = tf.reduce_sum(tf.nn.softmax(q_values) * (tf.log(tf.nn.softmax(q_values)) - tf.log(tf.nn.softmax(q_values_adaptive))), axis=-1)
mean_kl = tf.reduce_mean(kl)
def update_scale():
with tf.control_dependencies([perturb_for_adaption]):
update_scale_expr = tf.cond(mean_kl < param_noise_threshold,
lambda: param_noise_scale.assign(param_noise_scale * 1.01),
lambda: param_noise_scale.assign(param_noise_scale / 1.01),
)
return update_scale_expr
# Functionality to update the threshold for parameter space noise.
update_param_noise_threshold_expr = param_noise_threshold.assign(tf.cond(update_param_noise_threshold_ph >= 0,
lambda: update_param_noise_threshold_ph, lambda: param_noise_threshold))
# Put everything together.
deterministic_actions = tf.argmax(q_values_perturbed, axis=1)
batch_size = tf.shape(observations_ph.get())[0]
random_actions = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=num_actions, dtype=tf.int64)
chose_random = tf.random_uniform(tf.stack([batch_size]), minval=0, maxval=1, dtype=tf.float32) < eps
stochastic_actions = tf.where(chose_random, random_actions, deterministic_actions)
output_actions = tf.cond(stochastic_ph, lambda: stochastic_actions, lambda: deterministic_actions)
update_eps_expr = eps.assign(tf.cond(update_eps_ph >= 0, lambda: update_eps_ph, lambda: eps))
updates = [
update_eps_expr,
tf.cond(reset_ph, lambda: perturb_vars(original_scope="q_func", perturbed_scope="perturbed_q_func"), lambda: tf.group(*[])),
tf.cond(update_param_noise_scale_ph, lambda: update_scale(), lambda: tf.Variable(0., trainable=False)),
update_param_noise_threshold_expr,
]
_act = U.function(inputs=[observations_ph, stochastic_ph, update_eps_ph, reset_ph, update_param_noise_threshold_ph, update_param_noise_scale_ph],
outputs=output_actions,
givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
updates=updates)
def act(ob, reset, update_param_noise_threshold, update_param_noise_scale, stochastic=True, update_eps=-1):
return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
return act
def build_train(make_obs_ph, q_func, num_actions, optimizer, grad_norm_clipping=None, gamma=1.0,
double_q=True, scope="deepq", reuse=None, param_noise=False, param_noise_filter_func=None):
"""Creates the train function:
Parameters
----------
make_obs_ph: str -> tf.placeholder or TfInput
a function that takes a name and creates a placeholder of input with that name
q_func: (tf.Variable, int, str, bool) -> tf.Variable
the model that takes the following inputs:
observation_in: object
the output of observation placeholder
num_actions: int
number of actions
scope: str
reuse: bool
should be passed to outer variable scope
and returns a tensor of shape (batch_size, num_actions) with values of every action.
num_actions: int
number of actions
reuse: bool
whether or not to reuse the graph variables
optimizer: tf.train.Optimizer
optimizer to use for the Q-learning objective.
grad_norm_clipping: float or None
clip gradient norms to this value. If None no clipping is performed.
gamma: float
discount rate.
double_q: bool
if true will use Double Q Learning (https://arxiv.org/abs/1509.06461).
In general it is a good idea to keep it enabled.
scope: str or VariableScope
optional scope for variable_scope.
reuse: bool or None
whether or not the variables should be reused. To be able to reuse the scope must be given.
param_noise: bool
whether or not to use parameter space noise (https://arxiv.org/abs/1706.01905)
param_noise_filter_func: tf.Variable -> bool
function that decides whether or not a variable should be perturbed. Only applicable
if param_noise is True. If set to None, default_param_noise_filter is used by default.
Returns
-------
act: (tf.Variable, bool, float) -> tf.Variable
function to select and action given observation.
` See the top of the file for details.
train: (object, np.array, np.array, object, np.array, np.array) -> np.array
optimize the error in Bellman's equation.
` See the top of the file for details.
update_target: () -> ()
copy the parameters from optimized Q function to the target Q function.
` See the top of the file for details.
debug: {str: function}
a bunch of functions to print debug data like q_values.
"""
if param_noise:
act_f = build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope=scope, reuse=reuse,
param_noise_filter_func=param_noise_filter_func)
else:
act_f = build_act(make_obs_ph, q_func, num_actions, scope=scope, reuse=reuse)
with tf.variable_scope(scope, reuse=reuse):
# set up placeholders
obs_t_input = make_obs_ph("obs_t")
act_t_ph = tf.placeholder(tf.int32, [None], name="action")
rew_t_ph = tf.placeholder(tf.float32, [None], name="reward")
obs_tp1_input = make_obs_ph("obs_tp1")
done_mask_ph = tf.placeholder(tf.float32, [None], name="done")
importance_weights_ph = tf.placeholder(tf.float32, [None], name="weight")
# q network evaluation
q_t = q_func(obs_t_input.get(), num_actions, scope="q_func", reuse=True) # reuse parameters from act
q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/q_func")
# target q network evalution
q_tp1 = q_func(obs_tp1_input.get(), num_actions, scope="target_q_func")
target_q_func_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=tf.get_variable_scope().name + "/target_q_func")
# q scores for actions which we know were selected in the given state.
q_t_selected = tf.reduce_sum(q_t * tf.one_hot(act_t_ph, num_actions), 1)
# compute estimate of best possible value starting from state at t + 1
if double_q:
q_tp1_using_online_net = q_func(obs_tp1_input.get(), num_actions, scope="q_func", reuse=True)
q_tp1_best_using_online_net = tf.argmax(q_tp1_using_online_net, 1)
q_tp1_best = tf.reduce_sum(q_tp1 * tf.one_hot(q_tp1_best_using_online_net, num_actions), 1)
else:
q_tp1_best = tf.reduce_max(q_tp1, 1)
q_tp1_best_masked = (1.0 - done_mask_ph) * q_tp1_best
# compute RHS of bellman equation
q_t_selected_target = rew_t_ph + gamma * q_tp1_best_masked
# compute the error (potentially clipped)
td_error = q_t_selected - tf.stop_gradient(q_t_selected_target)
errors = U.huber_loss(td_error)
weighted_error = tf.reduce_mean(importance_weights_ph * errors)
# compute optimization op (potentially with gradient clipping)
if grad_norm_clipping is not None:
gradients = optimizer.compute_gradients(weighted_error, var_list=q_func_vars)
for i, (grad, var) in enumerate(gradients):
if grad is not None:
gradients[i] = (tf.clip_by_norm(grad, grad_norm_clipping), var)
optimize_expr = optimizer.apply_gradients(gradients)
else:
optimize_expr = optimizer.minimize(weighted_error, var_list=q_func_vars)
# update_target_fn will be called periodically to copy Q network to target Q network
update_target_expr = []
for var, var_target in zip(sorted(q_func_vars, key=lambda v: v.name),
sorted(target_q_func_vars, key=lambda v: v.name)):
update_target_expr.append(var_target.assign(var))
update_target_expr = tf.group(*update_target_expr)
# Create callable functions
train = U.function(
inputs=[
obs_t_input,
act_t_ph,
rew_t_ph,
obs_tp1_input,
done_mask_ph,
importance_weights_ph
],
outputs=td_error,
updates=[optimize_expr]
)
update_target = U.function([], [], updates=[update_target_expr])
q_values = U.function([obs_t_input], q_t)
return act_f, train, update_target, {'q_values': q_values}

234
baselines/deepq/deepq.py Normal file
View File

@@ -0,0 +1,234 @@
import os.path as osp
import tensorflow as tf
import numpy as np
from baselines import logger
from baselines.common.schedules import LinearSchedule
from baselines.common.vec_env.vec_env import VecEnv
from baselines.common import set_global_seeds
from baselines import deepq
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
from baselines.deepq.models import build_q_func
def learn(env,
network,
seed=None,
lr=5e-4,
total_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.02,
train_freq=1,
batch_size=32,
print_freq=100,
checkpoint_freq=10000,
checkpoint_path=None,
learning_starts=1000,
gamma=1.0,
target_network_update_freq=500,
prioritized_replay=False,
prioritized_replay_alpha=0.6,
prioritized_replay_beta0=0.4,
prioritized_replay_beta_iters=None,
prioritized_replay_eps=1e-6,
param_noise=False,
callback=None,
load_path=None,
**network_kwargs
):
"""Train a deepq model.
Parameters
-------
env: gym.Env
environment to train on
network: string or a function
neural network to use as a q function approximator. If string, has to be one of the names of registered models in baselines.common.models
(mlp, cnn, conv_only). If a function, should take an observation tensor and return a latent variable tensor, which
will be mapped to the Q function heads (see build_q_func in baselines.deepq.models for details on that)
seed: int or None
prng seed. The runs with the same seed "should" give the same results. If None, no seeding is used.
lr: float
learning rate for adam optimizer
total_timesteps: int
number of env steps to optimizer for
buffer_size: int
size of the replay buffer
exploration_fraction: float
fraction of entire training period over which the exploration rate is annealed
exploration_final_eps: float
final value of random action probability
train_freq: int
update the model every `train_freq` steps.
set to None to disable printing
batch_size: int
size of a batched sampled from replay buffer for training
print_freq: int
how often to print out training progress
set to None to disable printing
checkpoint_freq: int
how often to save the model. This is so that the best version is restored
at the end of the training. If you do not wish to restore the best version at
the end of the training set this variable to None.
learning_starts: int
how many steps of the model to collect transitions for before learning starts
gamma: float
discount factor
target_network_update_freq: int
update the target network every `target_network_update_freq` steps.
prioritized_replay: True
if True prioritized replay buffer will be used.
prioritized_replay_alpha: float
alpha parameter for prioritized replay buffer
prioritized_replay_beta0: float
initial value of beta for prioritized replay buffer
prioritized_replay_beta_iters: int
number of iterations over which beta will be annealed from initial value
to 1.0. If set to None equals to total_timesteps.
prioritized_replay_eps: float
epsilon to add to the TD errors when updating priorities.
param_noise: bool
whether or not to use parameter space noise (https://arxiv.org/abs/1706.01905)
callback: (locals, globals) -> None
function called at every steps with state of the algorithm.
If callback returns true training stops.
load_path: str
path to load the model from. (default: None)
**network_kwargs
additional keyword arguments to pass to the network builder.
Returns
-------
act: ActWrapper
Wrapper over act function. Adds ability to save it and load it.
See header of baselines/deepq/categorical.py for details on the act function.
"""
# Create all the functions necessary to train the model
set_global_seeds(seed)
q_func = build_q_func(network, **network_kwargs)
# capture the shape outside the closure so that the env object is not serialized
# by cloudpickle when serializing make_obs_ph
observation_space = env.observation_space
model = deepq.DEEPQ(
q_func=q_func,
observation_shape=env.observation_space.shape,
num_actions=env.action_space.n,
lr=lr,
grad_norm_clipping=10,
gamma=gamma,
param_noise=param_noise
)
if load_path is not None:
load_path = osp.expanduser(load_path)
ckpt = tf.train.Checkpoint(model=model)
manager = tf.train.CheckpointManager(ckpt, load_path, max_to_keep=None)
ckpt.restore(manager.latest_checkpoint)
print("Restoring from {}".format(manager.latest_checkpoint))
# Create the replay buffer
if prioritized_replay:
replay_buffer = PrioritizedReplayBuffer(buffer_size, alpha=prioritized_replay_alpha)
if prioritized_replay_beta_iters is None:
prioritized_replay_beta_iters = total_timesteps
beta_schedule = LinearSchedule(prioritized_replay_beta_iters,
initial_p=prioritized_replay_beta0,
final_p=1.0)
else:
replay_buffer = ReplayBuffer(buffer_size)
beta_schedule = None
# Create the schedule for exploration starting from 1.
exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * total_timesteps),
initial_p=1.0,
final_p=exploration_final_eps)
model.update_target()
episode_rewards = [0.0]
saved_mean_reward = None
obs = env.reset()
# always mimic the vectorized env
if not isinstance(env, VecEnv):
obs = np.expand_dims(np.array(obs), axis=0)
reset = True
for t in range(total_timesteps):
if callback is not None:
if callback(locals(), globals()):
break
kwargs = {}
if not param_noise:
update_eps = tf.constant(exploration.value(t))
update_param_noise_threshold = 0.
else:
update_eps = tf.constant(0.)
# Compute the threshold such that the KL divergence between perturbed and non-perturbed
# policy is comparable to eps-greedy exploration with eps = exploration.value(t).
# See Appendix C.1 in Parameter Space Noise for Exploration, Plappert et al., 2017
# for detailed explanation.
update_param_noise_threshold = -np.log(1. - exploration.value(t) + exploration.value(t) / float(env.action_space.n))
kwargs['reset'] = reset
kwargs['update_param_noise_threshold'] = update_param_noise_threshold
kwargs['update_param_noise_scale'] = True
action, _, _, _ = model.step(tf.constant(obs), update_eps=update_eps, **kwargs)
action = action[0].numpy()
reset = False
new_obs, rew, done, _ = env.step(action)
# Store transition in the replay buffer.
if not isinstance(env, VecEnv):
new_obs = np.expand_dims(np.array(new_obs), axis=0)
replay_buffer.add(obs[0], action, rew, new_obs[0], float(done))
else:
replay_buffer.add(obs[0], action, rew[0], new_obs[0], float(done[0]))
# # Store transition in the replay buffer.
# replay_buffer.add(obs, action, rew, new_obs, float(done))
obs = new_obs
episode_rewards[-1] += rew
if done:
obs = env.reset()
if not isinstance(env, VecEnv):
obs = np.expand_dims(np.array(obs), axis=0)
episode_rewards.append(0.0)
reset = True
if t > learning_starts and t % train_freq == 0:
# Minimize the error in Bellman's equation on a batch sampled from replay buffer.
if prioritized_replay:
experience = replay_buffer.sample(batch_size, beta=beta_schedule.value(t))
(obses_t, actions, rewards, obses_tp1, dones, weights, batch_idxes) = experience
else:
obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(batch_size)
weights, batch_idxes = np.ones_like(rewards), None
obses_t, obses_tp1 = tf.constant(obses_t), tf.constant(obses_tp1)
actions, rewards, dones = tf.constant(actions), tf.constant(rewards), tf.constant(dones)
weights = tf.constant(weights)
td_errors = model.train(obses_t, actions, rewards, obses_tp1, dones, weights)
if prioritized_replay:
new_priorities = np.abs(td_errors) + prioritized_replay_eps
replay_buffer.update_priorities(batch_idxes, new_priorities)
if t > learning_starts and t % target_network_update_freq == 0:
# Update target network periodically.
model.update_target()
mean_100ep_reward = round(np.mean(episode_rewards[-101:-1]), 1)
num_episodes = len(episode_rewards)
if done and print_freq is not None and len(episode_rewards) % print_freq == 0:
logger.record_tabular("steps", t)
logger.record_tabular("episodes", num_episodes)
logger.record_tabular("mean 100 episode reward", mean_100ep_reward)
logger.record_tabular("% time spent exploring", int(100 * exploration.value(t)))
logger.dump_tabular()
return model

View File

@@ -0,0 +1,191 @@
"""Deep Q model
The functions in this model:
======= step ========
Function to chose an action given an observation
Parameters
----------
observation: tensor
Observation that can be feed into the output of make_obs_ph
stochastic: bool
if set to False all the actions are always deterministic (default False)
update_eps: float
update epsilon a new value, if negative not update happens
(default: no update)
Returns
-------
Tensor of dtype tf.int64 and shape (BATCH_SIZE,) with an action to be performed for
every element of the batch.
(NOT IMPLEMENTED YET)
======= step (in case of parameter noise) ========
Function to chose an action given an observation
Parameters
----------
observation: object
Observation that can be feed into the output of make_obs_ph
stochastic: bool
if set to False all the actions are always deterministic (default False)
update_eps: float
update epsilon to a new value, if negative no update happens
(default: no update)
reset: bool
reset the perturbed policy by sampling a new perturbation
update_param_noise_threshold: float
the desired threshold for the difference between non-perturbed and perturbed policy
update_param_noise_scale: bool
whether or not to update the scale of the noise for the next time it is re-perturbed
Returns
-------
Tensor of dtype tf.int64 and shape (BATCH_SIZE,) with an action to be performed for
every element of the batch.
======= train =======
Function that takes a transition (s,a,r,s',d) and optimizes Bellman equation's error:
td_error = Q(s,a) - (r + gamma * (1-d) * max_a' Q(s', a'))
loss = huber_loss[td_error]
Parameters
----------
obs_t: object
a batch of observations
action: np.array
actions that were selected upon seeing obs_t.
dtype must be int32 and shape must be (batch_size,)
reward: np.array
immediate reward attained after executing those actions
dtype must be float32 and shape must be (batch_size,)
obs_tp1: object
observations that followed obs_t
done: np.array
1 if obs_t was the last observation in the episode and 0 otherwise
obs_tp1 gets ignored, but must be of the valid shape.
dtype must be float32 and shape must be (batch_size,)
weight: np.array
imporance weights for every element of the batch (gradient is multiplied
by the importance weight) dtype must be float32 and shape must be (batch_size,)
Returns
-------
td_error: np.array
a list of differences between Q(s,a) and the target in Bellman's equation.
dtype is float32 and shape is (batch_size,)
======= update_target ========
copy the parameters from optimized Q function to the target Q function.
In Q learning we actually optimize the following error:
Q(s,a) - (r + gamma * max_a' Q'(s', a'))
Where Q' is lagging behind Q to stablize the learning. For example for Atari
Q' is set to Q once every 10000 updates training steps.
"""
import tensorflow as tf
@tf.function
def huber_loss(x, delta=1.0):
"""Reference: https://en.wikipedia.org/wiki/Huber_loss"""
return tf.where(
tf.abs(x) < delta,
tf.square(x) * 0.5,
delta * (tf.abs(x) - 0.5 * delta)
)
class DEEPQ(tf.Module):
def __init__(self, q_func, observation_shape, num_actions, lr, grad_norm_clipping=None, gamma=1.0,
double_q=True, param_noise=False, param_noise_filter_func=None):
self.num_actions = num_actions
self.gamma = gamma
self.double_q = double_q
self.param_noise = param_noise
self.param_noise_filter_func = param_noise_filter_func
self.grad_norm_clipping = grad_norm_clipping
self.optimizer = tf.keras.optimizers.Adam(lr)
with tf.name_scope('q_network'):
self.q_network = q_func(observation_shape, num_actions)
with tf.name_scope('target_q_network'):
self.target_q_network = q_func(observation_shape, num_actions)
self.eps = tf.Variable(0., name="eps")
@tf.function
def step(self, obs, stochastic=True, update_eps=-1):
if self.param_noise:
raise ValueError('not supporting noise yet')
else:
q_values = self.q_network(obs)
deterministic_actions = tf.argmax(q_values, axis=1)
batch_size = tf.shape(obs)[0]
random_actions = tf.random.uniform(tf.stack([batch_size]), minval=0, maxval=self.num_actions, dtype=tf.int64)
chose_random = tf.random.uniform(tf.stack([batch_size]), minval=0, maxval=1, dtype=tf.float32) < self.eps
stochastic_actions = tf.where(chose_random, random_actions, deterministic_actions)
if stochastic:
output_actions = stochastic_actions
else:
output_actions = deterministic_actions
if update_eps >= 0:
self.eps.assign(update_eps)
return output_actions, None, None, None
@tf.function()
def train(self, obs0, actions, rewards, obs1, dones, importance_weights):
with tf.GradientTape() as tape:
q_t = self.q_network(obs0)
q_t_selected = tf.reduce_sum(q_t * tf.one_hot(actions, self.num_actions, dtype=tf.float32), 1)
q_tp1 = self.target_q_network(obs1)
if self.double_q:
q_tp1_using_online_net = self.q_network(obs1)
q_tp1_best_using_online_net = tf.argmax(q_tp1_using_online_net, 1)
q_tp1_best = tf.reduce_sum(q_tp1 * tf.one_hot(q_tp1_best_using_online_net, self.num_actions, dtype=tf.float32), 1)
else:
q_tp1_best = tf.reduce_max(q_tp1, 1)
dones = tf.cast(dones, q_tp1_best.dtype)
q_tp1_best_masked = (1.0 - dones) * q_tp1_best
q_t_selected_target = rewards + self.gamma * q_tp1_best_masked
td_error = q_t_selected - tf.stop_gradient(q_t_selected_target)
errors = huber_loss(td_error)
weighted_error = tf.reduce_mean(importance_weights * errors)
grads = tape.gradient(weighted_error, self.q_network.trainable_variables)
if self.grad_norm_clipping:
clipped_grads = []
for grad in grads:
clipped_grads.append(tf.clip_by_norm(grad, self.grad_norm_clipping))
clipped_grads = grads
grads_and_vars = zip(grads, self.q_network.trainable_variables)
self.optimizer.apply_gradients(grads_and_vars)
return td_error
@tf.function(autograph=False)
def update_target(self):
q_vars = self.q_network.trainable_variables
target_q_vars = self.target_q_network.trainable_variables
for var, var_target in zip(q_vars, target_q_vars):
var_target.assign(var)

View File

@@ -0,0 +1,21 @@
def atari():
return dict(
network='conv_only',
lr=1e-4,
buffer_size=10000,
exploration_fraction=0.1,
exploration_final_eps=0.01,
train_freq=4,
learning_starts=10000,
target_network_update_freq=1000,
gamma=0.99,
prioritized_replay=True,
prioritized_replay_alpha=0.6,
checkpoint_freq=10000,
checkpoint_path=None,
dueling=True
)
def retro():
return atari()

View File

@@ -1,79 +0,0 @@
import gym
import itertools
import numpy as np
import tensorflow as tf
import tensorflow.contrib.layers as layers
import baselines.common.tf_util as U
from baselines import logger
from baselines import deepq
from baselines.deepq.replay_buffer import ReplayBuffer
from baselines.deepq.utils import ObservationInput
from baselines.common.schedules import LinearSchedule
def model(inpt, num_actions, scope, reuse=False):
"""This model takes as input an observation and returns values of all actions."""
with tf.variable_scope(scope, reuse=reuse):
out = inpt
out = layers.fully_connected(out, num_outputs=64, activation_fn=tf.nn.tanh)
out = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None)
return out
if __name__ == '__main__':
with U.make_session(8):
# Create the environment
env = gym.make("CartPole-v0")
# Create all the functions necessary to train the model
act, train, update_target, debug = deepq.build_train(
make_obs_ph=lambda name: ObservationInput(env.observation_space, name=name),
q_func=model,
num_actions=env.action_space.n,
optimizer=tf.train.AdamOptimizer(learning_rate=5e-4),
)
# Create the replay buffer
replay_buffer = ReplayBuffer(50000)
# Create the schedule for exploration starting from 1 (every action is random) down to
# 0.02 (98% of actions are selected according to values predicted by the model).
exploration = LinearSchedule(schedule_timesteps=10000, initial_p=1.0, final_p=0.02)
# Initialize the parameters and copy them to the target network.
U.initialize()
update_target()
episode_rewards = [0.0]
obs = env.reset()
for t in itertools.count():
# Take action and update exploration to the newest value
action = act(obs[None], update_eps=exploration.value(t))[0]
new_obs, rew, done, _ = env.step(action)
# Store transition in the replay buffer.
replay_buffer.add(obs, action, rew, new_obs, float(done))
obs = new_obs
episode_rewards[-1] += rew
if done:
obs = env.reset()
episode_rewards.append(0)
is_solved = t > 100 and np.mean(episode_rewards[-101:-1]) >= 200
if is_solved:
# Show off the result
env.render()
else:
# Minimize the error in Bellman's equation on a batch sampled from replay buffer.
if t > 1000:
obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(32)
train(obses_t, actions, rewards, obses_tp1, dones, np.ones_like(rewards))
# Update target network periodically.
if t % 1000 == 0:
update_target()
if done and len(episode_rewards) % 10 == 0:
logger.record_tabular("steps", t)
logger.record_tabular("episodes", len(episode_rewards))
logger.record_tabular("mean episode reward", round(np.mean(episode_rewards[-101:-1]), 1))
logger.record_tabular("% time spent exploring", int(100 * exploration.value(t)))
logger.dump_tabular()

View File

@@ -1,21 +0,0 @@
import gym
from baselines import deepq
def main():
env = gym.make("CartPole-v0")
act = deepq.load("cartpole_model.pkl")
while True:
obs, done = env.reset(), False
episode_rew = 0
while not done:
env.render()
obs, rew, done, _ = env.step(act(obs[None])[0])
episode_rew += rew
print("Episode reward", episode_rew)
if __name__ == '__main__':
main()

View File

@@ -1,21 +0,0 @@
import gym
from baselines import deepq
def main():
env = gym.make("MountainCar-v0")
act = deepq.load("mountaincar_model.pkl")
while True:
obs, done = env.reset(), False
episode_rew = 0
while not done:
env.render()
obs, rew, done, _ = env.step(act(obs[None])[0])
episode_rew += rew
print("Episode reward", episode_rew)
if __name__ == '__main__':
main()

View File

@@ -1,21 +0,0 @@
import gym
from baselines import deepq
def main():
env = gym.make("PongNoFrameskip-v4")
env = deepq.wrap_atari_dqn(env)
act = deepq.load("pong_model.pkl")
while True:
obs, done = env.reset(), False
episode_rew = 0
while not done:
env.render()
obs, rew, done, _ = env.step(act(obs[None])[0])
episode_rew += rew
print("Episode reward", episode_rew)
if __name__ == '__main__':
main()

View File

@@ -1,54 +0,0 @@
from baselines import deepq
from baselines.common import set_global_seeds
from baselines import bench
import argparse
from baselines import logger
from baselines.common.atari_wrappers import make_atari
def main():
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--env', help='environment ID', default='BreakoutNoFrameskip-v4')
parser.add_argument('--seed', help='RNG seed', type=int, default=0)
parser.add_argument('--prioritized', type=int, default=1)
parser.add_argument('--prioritized-replay-alpha', type=float, default=0.6)
parser.add_argument('--dueling', type=int, default=1)
parser.add_argument('--num-timesteps', type=int, default=int(10e6))
parser.add_argument('--checkpoint-freq', type=int, default=10000)
parser.add_argument('--checkpoint-path', type=str, default=None)
args = parser.parse_args()
logger.configure()
set_global_seeds(args.seed)
env = make_atari(args.env)
env = bench.Monitor(env, logger.get_dir())
env = deepq.wrap_atari_dqn(env)
model = deepq.models.cnn_to_mlp(
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
hiddens=[256],
dueling=bool(args.dueling),
)
deepq.learn(
env,
q_func=model,
lr=1e-4,
max_timesteps=args.num_timesteps,
buffer_size=10000,
exploration_fraction=0.1,
exploration_final_eps=0.01,
train_freq=4,
learning_starts=10000,
target_network_update_freq=1000,
gamma=0.99,
prioritized_replay=bool(args.prioritized),
prioritized_replay_alpha=args.prioritized_replay_alpha,
checkpoint_freq=args.checkpoint_freq,
checkpoint_path=args.checkpoint_path,
)
env.close()
if __name__ == '__main__':
main()

View File

@@ -1,31 +0,0 @@
import gym
from baselines import deepq
def callback(lcl, _glb):
# stop training if reward exceeds 199
is_solved = lcl['t'] > 100 and sum(lcl['episode_rewards'][-101:-1]) / 100 >= 199
return is_solved
def main():
env = gym.make("CartPole-v0")
model = deepq.models.mlp([64])
act = deepq.learn(
env,
q_func=model,
lr=1e-3,
max_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.02,
print_freq=10,
callback=callback
)
print("Saving model to cartpole_model.pkl")
act.save("cartpole_model.pkl")
if __name__ == '__main__':
main()

View File

@@ -1,26 +0,0 @@
import gym
from baselines import deepq
def main():
env = gym.make("MountainCar-v0")
# Enabling layer_norm here is import for parameter space noise!
model = deepq.models.mlp([64], layer_norm=True)
act = deepq.learn(
env,
q_func=model,
lr=1e-3,
max_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.1,
print_freq=10,
param_noise=True
)
print("Saving model to mountaincar_model.pkl")
act.save("mountaincar_model.pkl")
if __name__ == '__main__':
main()

View File

@@ -1,91 +1,47 @@
import tensorflow as tf
import tensorflow.contrib.layers as layers
def _mlp(hiddens, inpt, num_actions, scope, reuse=False, layer_norm=False):
with tf.variable_scope(scope, reuse=reuse):
out = inpt
for hidden in hiddens:
out = layers.fully_connected(out, num_outputs=hidden, activation_fn=None)
if layer_norm:
out = layers.layer_norm(out, center=True, scale=True)
out = tf.nn.relu(out)
q_out = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None)
return q_out
def build_q_func(network, hiddens=[256], dueling=True, layer_norm=False, **network_kwargs):
if isinstance(network, str):
from baselines.common.models import get_network_builder
network = get_network_builder(network)(**network_kwargs)
def q_func_builder(input_shape, num_actions):
# the sub Functional model which does not include the top layer.
model = network(input_shape)
def mlp(hiddens=[], layer_norm=False):
"""This model takes as input an observation and returns values of all actions.
# wrapping the sub Functional model with layers that compute action scores into another Functional model.
latent = model.outputs
if len(latent) > 1:
if latent[1] is not None:
raise NotImplementedError("DQN is not compatible with recurrent policies yet")
latent = latent[0]
Parameters
----------
hiddens: [int]
list of sizes of hidden layers
latent = tf.keras.layers.Flatten()(latent)
Returns
-------
q_func: function
q_function for DQN algorithm.
"""
return lambda *args, **kwargs: _mlp(hiddens, layer_norm=layer_norm, *args, **kwargs)
def _cnn_to_mlp(convs, hiddens, dueling, inpt, num_actions, scope, reuse=False, layer_norm=False):
with tf.variable_scope(scope, reuse=reuse):
out = inpt
with tf.variable_scope("convnet"):
for num_outputs, kernel_size, stride in convs:
out = layers.convolution2d(out,
num_outputs=num_outputs,
kernel_size=kernel_size,
stride=stride,
activation_fn=tf.nn.relu)
conv_out = layers.flatten(out)
with tf.variable_scope("action_value"):
action_out = conv_out
with tf.name_scope("action_value"):
action_out = latent
for hidden in hiddens:
action_out = layers.fully_connected(action_out, num_outputs=hidden, activation_fn=None)
action_out = tf.keras.layers.Dense(units=hidden, activation=None)(action_out)
if layer_norm:
action_out = layers.layer_norm(action_out, center=True, scale=True)
action_out = tf.keras.layers.LayerNormalization(center=True, scale=True)(action_out)
action_out = tf.nn.relu(action_out)
action_scores = layers.fully_connected(action_out, num_outputs=num_actions, activation_fn=None)
action_scores = tf.keras.layers.Dense(units=num_actions, activation=None)(action_out)
if dueling:
with tf.variable_scope("state_value"):
state_out = conv_out
with tf.name_scope("state_value"):
state_out = latent
for hidden in hiddens:
state_out = layers.fully_connected(state_out, num_outputs=hidden, activation_fn=None)
state_out = tf.keras.layers.Dense(units=hidden, activation=None)(state_out)
if layer_norm:
state_out = layers.layer_norm(state_out, center=True, scale=True)
state_out = tf.keras.layers.LayerNormalization(center=True, scale=True)(state_out)
state_out = tf.nn.relu(state_out)
state_score = layers.fully_connected(state_out, num_outputs=1, activation_fn=None)
state_score = tf.keras.layers.Dense(units=1, activation=None)(state_out)
action_scores_mean = tf.reduce_mean(action_scores, 1)
action_scores_centered = action_scores - tf.expand_dims(action_scores_mean, 1)
q_out = state_score + action_scores_centered
else:
q_out = action_scores
return q_out
def cnn_to_mlp(convs, hiddens, dueling=False, layer_norm=False):
"""This model takes as input an observation and returns values of all actions.
Parameters
----------
convs: [(int, int int)]
list of convolutional layers in form of
(num_outputs, kernel_size, stride)
hiddens: [int]
list of sizes of hidden layers
dueling: bool
if true double the output MLP to compute a baseline
for action scores
Returns
-------
q_func: function
q_function for DQN algorithm.
"""
return lambda *args, **kwargs: _cnn_to_mlp(convs, hiddens, dueling, layer_norm=layer_norm, *args, **kwargs)
return tf.keras.Model(inputs=model.inputs, outputs=[q_out])
return q_func_builder

View File

@@ -32,6 +32,9 @@ class ReplayBuffer(object):
def _encode_sample(self, idxes):
obses_t, actions, rewards, obses_tp1, dones = [], [], [], [], []
data = self._storage[0]
ob_dtype = data[0].dtype
ac_dtype = data[1].dtype
for i in idxes:
data = self._storage[i]
obs_t, action, reward, obs_tp1, done = data
@@ -40,7 +43,7 @@ class ReplayBuffer(object):
rewards.append(reward)
obses_tp1.append(np.array(obs_tp1, copy=False))
dones.append(done)
return np.array(obses_t), np.array(actions), np.array(rewards), np.array(obses_tp1), np.array(dones)
return np.array(obses_t, dtype=ob_dtype), np.array(actions, dtype=ac_dtype), np.array(rewards, dtype=np.float32), np.array(obses_tp1, dtype=ob_dtype), np.array(dones, dtype=np.float32)
def sample(self, batch_size):
"""Sample a batch of experiences.
@@ -106,9 +109,10 @@ class PrioritizedReplayBuffer(ReplayBuffer):
def _sample_proportional(self, batch_size):
res = []
for _ in range(batch_size):
# TODO(szymon): should we ensure no repeats?
mass = random.random() * self._it_sum.sum(0, len(self._storage) - 1)
p_total = self._it_sum.sum(0, len(self._storage) - 1)
every_range_len = p_total / batch_size
for i in range(batch_size):
mass = random.random() * every_range_len + i * every_range_len
idx = self._it_sum.find_prefixsum_idx(mass)
res.append(idx)
return res
@@ -161,7 +165,7 @@ class PrioritizedReplayBuffer(ReplayBuffer):
p_sample = self._it_sum[idx] / self._it_sum.sum()
weight = (p_sample * len(self._storage)) ** (-beta)
weights.append(weight / max_weight)
weights = np.array(weights)
weights = np.array(weights, dtype=np.float32)
encoded_sample = self._encode_sample(idxes)
return tuple(list(encoded_sample) + [weights, idxes])

View File

@@ -1,306 +0,0 @@
import os
import tempfile
import tensorflow as tf
import zipfile
import cloudpickle
import numpy as np
import baselines.common.tf_util as U
from baselines.common.tf_util import load_state, save_state
from baselines import logger
from baselines.common.schedules import LinearSchedule
from baselines.common.input import observation_input
from baselines import deepq
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
from baselines.deepq.utils import ObservationInput
class ActWrapper(object):
def __init__(self, act, act_params):
self._act = act
self._act_params = act_params
@staticmethod
def load(path):
with open(path, "rb") as f:
model_data, act_params = cloudpickle.load(f)
act = deepq.build_act(**act_params)
sess = tf.Session()
sess.__enter__()
with tempfile.TemporaryDirectory() as td:
arc_path = os.path.join(td, "packed.zip")
with open(arc_path, "wb") as f:
f.write(model_data)
zipfile.ZipFile(arc_path, 'r', zipfile.ZIP_DEFLATED).extractall(td)
load_state(os.path.join(td, "model"))
return ActWrapper(act, act_params)
def __call__(self, *args, **kwargs):
return self._act(*args, **kwargs)
def save(self, path=None):
"""Save model to a pickle located at `path`"""
if path is None:
path = os.path.join(logger.get_dir(), "model.pkl")
with tempfile.TemporaryDirectory() as td:
save_state(os.path.join(td, "model"))
arc_name = os.path.join(td, "packed.zip")
with zipfile.ZipFile(arc_name, 'w') as zipf:
for root, dirs, files in os.walk(td):
for fname in files:
file_path = os.path.join(root, fname)
if file_path != arc_name:
zipf.write(file_path, os.path.relpath(file_path, td))
with open(arc_name, "rb") as f:
model_data = f.read()
with open(path, "wb") as f:
cloudpickle.dump((model_data, self._act_params), f)
def load(path):
"""Load act function that was returned by learn function.
Parameters
----------
path: str
path to the act function pickle
Returns
-------
act: ActWrapper
function that takes a batch of observations
and returns actions.
"""
return ActWrapper.load(path)
def learn(env,
q_func,
lr=5e-4,
max_timesteps=100000,
buffer_size=50000,
exploration_fraction=0.1,
exploration_final_eps=0.02,
train_freq=1,
batch_size=32,
print_freq=100,
checkpoint_freq=10000,
checkpoint_path=None,
learning_starts=1000,
gamma=1.0,
target_network_update_freq=500,
prioritized_replay=False,
prioritized_replay_alpha=0.6,
prioritized_replay_beta0=0.4,
prioritized_replay_beta_iters=None,
prioritized_replay_eps=1e-6,
param_noise=False,
callback=None):
"""Train a deepq model.
Parameters
-------
env: gym.Env
environment to train on
q_func: (tf.Variable, int, str, bool) -> tf.Variable
the model that takes the following inputs:
observation_in: object
the output of observation placeholder
num_actions: int
number of actions
scope: str
reuse: bool
should be passed to outer variable scope
and returns a tensor of shape (batch_size, num_actions) with values of every action.
lr: float
learning rate for adam optimizer
max_timesteps: int
number of env steps to optimizer for
buffer_size: int
size of the replay buffer
exploration_fraction: float
fraction of entire training period over which the exploration rate is annealed
exploration_final_eps: float
final value of random action probability
train_freq: int
update the model every `train_freq` steps.
set to None to disable printing
batch_size: int
size of a batched sampled from replay buffer for training
print_freq: int
how often to print out training progress
set to None to disable printing
checkpoint_freq: int
how often to save the model. This is so that the best version is restored
at the end of the training. If you do not wish to restore the best version at
the end of the training set this variable to None.
learning_starts: int
how many steps of the model to collect transitions for before learning starts
gamma: float
discount factor
target_network_update_freq: int
update the target network every `target_network_update_freq` steps.
prioritized_replay: True
if True prioritized replay buffer will be used.
prioritized_replay_alpha: float
alpha parameter for prioritized replay buffer
prioritized_replay_beta0: float
initial value of beta for prioritized replay buffer
prioritized_replay_beta_iters: int
number of iterations over which beta will be annealed from initial value
to 1.0. If set to None equals to max_timesteps.
prioritized_replay_eps: float
epsilon to add to the TD errors when updating priorities.
callback: (locals, globals) -> None
function called at every steps with state of the algorithm.
If callback returns true training stops.
Returns
-------
act: ActWrapper
Wrapper over act function. Adds ability to save it and load it.
See header of baselines/deepq/categorical.py for details on the act function.
"""
# Create all the functions necessary to train the model
sess = tf.Session()
sess.__enter__()
# capture the shape outside the closure so that the env object is not serialized
# by cloudpickle when serializing make_obs_ph
def make_obs_ph(name):
return ObservationInput(env.observation_space, name=name)
act, train, update_target, debug = deepq.build_train(
make_obs_ph=make_obs_ph,
q_func=q_func,
num_actions=env.action_space.n,
optimizer=tf.train.AdamOptimizer(learning_rate=lr),
gamma=gamma,
grad_norm_clipping=10,
param_noise=param_noise
)
act_params = {
'make_obs_ph': make_obs_ph,
'q_func': q_func,
'num_actions': env.action_space.n,
}
act = ActWrapper(act, act_params)
# Create the replay buffer
if prioritized_replay:
replay_buffer = PrioritizedReplayBuffer(buffer_size, alpha=prioritized_replay_alpha)
if prioritized_replay_beta_iters is None:
prioritized_replay_beta_iters = max_timesteps
beta_schedule = LinearSchedule(prioritized_replay_beta_iters,
initial_p=prioritized_replay_beta0,
final_p=1.0)
else:
replay_buffer = ReplayBuffer(buffer_size)
beta_schedule = None
# Create the schedule for exploration starting from 1.
exploration = LinearSchedule(schedule_timesteps=int(exploration_fraction * max_timesteps),
initial_p=1.0,
final_p=exploration_final_eps)
# Initialize the parameters and copy them to the target network.
U.initialize()
update_target()
episode_rewards = [0.0]
saved_mean_reward = None
obs = env.reset()
reset = True
with tempfile.TemporaryDirectory() as td:
td = checkpoint_path or td
model_file = os.path.join(td, "model")
model_saved = False
if tf.train.latest_checkpoint(td) is not None:
load_state(model_file)
logger.log('Loaded model from {}'.format(model_file))
model_saved = True
for t in range(max_timesteps):
if callback is not None:
if callback(locals(), globals()):
break
# Take action and update exploration to the newest value
kwargs = {}
if not param_noise:
update_eps = exploration.value(t)
update_param_noise_threshold = 0.
else:
update_eps = 0.
# Compute the threshold such that the KL divergence between perturbed and non-perturbed
# policy is comparable to eps-greedy exploration with eps = exploration.value(t).
# See Appendix C.1 in Parameter Space Noise for Exploration, Plappert et al., 2017
# for detailed explanation.
update_param_noise_threshold = -np.log(1. - exploration.value(t) + exploration.value(t) / float(env.action_space.n))
kwargs['reset'] = reset
kwargs['update_param_noise_threshold'] = update_param_noise_threshold
kwargs['update_param_noise_scale'] = True
action = act(np.array(obs)[None], update_eps=update_eps, **kwargs)[0]
env_action = action
reset = False
new_obs, rew, done, _ = env.step(env_action)
# Store transition in the replay buffer.
replay_buffer.add(obs, action, rew, new_obs, float(done))
obs = new_obs
episode_rewards[-1] += rew
if done:
obs = env.reset()
episode_rewards.append(0.0)
reset = True
if t > learning_starts and t % train_freq == 0:
# Minimize the error in Bellman's equation on a batch sampled from replay buffer.
if prioritized_replay:
experience = replay_buffer.sample(batch_size, beta=beta_schedule.value(t))
(obses_t, actions, rewards, obses_tp1, dones, weights, batch_idxes) = experience
else:
obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(batch_size)
weights, batch_idxes = np.ones_like(rewards), None
td_errors = train(obses_t, actions, rewards, obses_tp1, dones, weights)
if prioritized_replay:
new_priorities = np.abs(td_errors) + prioritized_replay_eps
replay_buffer.update_priorities(batch_idxes, new_priorities)
if t > learning_starts and t % target_network_update_freq == 0:
# Update target network periodically.
update_target()
mean_100ep_reward = round(np.mean(episode_rewards[-101:-1]), 1)
num_episodes = len(episode_rewards)
if done and print_freq is not None and len(episode_rewards) % print_freq == 0:
logger.record_tabular("steps", t)
logger.record_tabular("episodes", num_episodes)
logger.record_tabular("mean 100 episode reward", mean_100ep_reward)
logger.record_tabular("% time spent exploring", int(100 * exploration.value(t)))
logger.dump_tabular()
if (checkpoint_freq is not None and t > learning_starts and
num_episodes > 100 and t % checkpoint_freq == 0):
if saved_mean_reward is None or mean_100ep_reward > saved_mean_reward:
if print_freq is not None:
logger.log("Saving model due to mean reward increase: {} -> {}".format(
saved_mean_reward, mean_100ep_reward))
save_state(model_file)
model_saved = True
saved_mean_reward = mean_100ep_reward
if model_saved:
if print_freq is not None:
logger.log("Restored model with mean reward: {}".format(saved_mean_reward))
load_state(model_file)
return act

Some files were not shown because too many files have changed in this diff Show More