Compare commits
34 Commits
peterz_ale
...
old_acktr_
Author | SHA1 | Date | |
---|---|---|---|
|
0a40206c6c | ||
|
1937826784 | ||
|
b29c8020d7 | ||
|
4ec308aaa4 | ||
|
3bbf3f3511 | ||
|
e5de29a954 | ||
|
2507d335f9 | ||
|
bdd4d385a6 | ||
|
0961f5dd94 | ||
|
337d913a8f | ||
|
34af61a132 | ||
|
1ea5ec647c | ||
|
2fc7a1cbee | ||
|
14c1d69ef4 | ||
|
c8f6d8bac7 | ||
|
3a006ba50e | ||
|
c6c0f45cb1 | ||
|
e92a6ad8f4 | ||
|
92b9a37257 | ||
|
cb14da96ca | ||
|
3900f2a447 | ||
|
20d22a5d79 | ||
|
caf7b08b4d | ||
|
ca0165cdf5 | ||
|
eb5b605f86 | ||
|
a89bee3c8d | ||
|
353bb15e90 | ||
|
64c0c0a043 | ||
|
5fee99e771 | ||
|
5edcd6886e | ||
|
bcde04e710 | ||
|
cd375ab209 | ||
|
5622a09fa4 | ||
|
e2da7cd42f |
@@ -10,5 +10,5 @@ install:
|
||||
- docker build . -t baselines-test
|
||||
|
||||
script:
|
||||
- flake8 --select=F,E999 baselines/common baselines/trpo_mpi baselines/ppo2 baselines/a2c baselines/deepq baselines/acer
|
||||
- docker run baselines-test pytest --runslow
|
||||
- flake8 .
|
||||
- docker run baselines-test pytest -v .
|
||||
|
@@ -18,6 +18,7 @@ WORKDIR $CODE_DIR/baselines
|
||||
# Clean up pycache and pyc files
|
||||
RUN rm -rf __pycache__ && \
|
||||
find . -name "*.pyc" -delete && \
|
||||
pip install tensorflow && \
|
||||
pip install -e .[test]
|
||||
|
||||
|
||||
|
27
README.md
27
README.md
@@ -45,8 +45,8 @@ cd baselines
|
||||
```
|
||||
If using virtualenv, create a new virtualenv and activate it
|
||||
```bash
|
||||
virtualenv env --python=python3
|
||||
. env/bin/activate
|
||||
virtualenv env --python=python3
|
||||
. env/bin/activate
|
||||
```
|
||||
Install baselines package
|
||||
```bash
|
||||
@@ -62,29 +62,20 @@ pip install pytest
|
||||
pytest
|
||||
```
|
||||
|
||||
## Subpackages
|
||||
|
||||
## Testing the installation
|
||||
All unit tests in baselines can be run using pytest runner:
|
||||
```
|
||||
pip install pytest
|
||||
pytest
|
||||
```
|
||||
|
||||
## Training models
|
||||
Most of the algorithms in baselines repo are used as follows:
|
||||
```bash
|
||||
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
|
||||
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
|
||||
```
|
||||
### Example 1. PPO with MuJoCo Humanoid
|
||||
For instance, to train a fully-connected network controlling MuJoCo humanoid using a2c for 20M timesteps
|
||||
For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps
|
||||
```bash
|
||||
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
|
||||
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
|
||||
```
|
||||
Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
|
||||
The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
|
||||
```bash
|
||||
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
|
||||
python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
|
||||
```
|
||||
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
|
||||
|
||||
@@ -94,7 +85,7 @@ docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) fir the description
|
||||
### Example 2. DQN on Atari
|
||||
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
|
||||
```
|
||||
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
|
||||
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
|
||||
```
|
||||
|
||||
## Saving, loading and visualizing models
|
||||
@@ -102,11 +93,11 @@ The algorithms serialization API is not properly unified yet; however, there is
|
||||
`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively.
|
||||
Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then later visualize what has it learnt.
|
||||
```bash
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=2e7 --save_path=~/models/pong_20M_ppo2
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
|
||||
```
|
||||
This should get to the mean reward per episode about 5k. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
|
||||
```bash
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num-timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
|
||||
```
|
||||
|
||||
*NOTE:* At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in [baselines/common/vec_env/vec_normalize.py](baselines/common/vec_env/vec_normalize.py#L12). This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default
|
||||
|
@@ -2,4 +2,5 @@
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1602.01783
|
||||
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
|
||||
- `python -m baselines.a2c.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
||||
- `python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
|
@@ -1,4 +1,6 @@
|
||||
# ACER
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1611.01224
|
||||
- `python -m baselines.acer.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
||||
- `python -m baselines.run --alg=acer --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
|
||||
|
@@ -2,4 +2,7 @@
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1708.05144
|
||||
- Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
|
||||
- `python -m baselines.acktr.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
||||
- `python -m baselines.run --alg=acktr --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
|
||||
|
||||
|
@@ -54,7 +54,7 @@ def learn(env, policy, vf, gamma, lam, timesteps_per_batch, num_timesteps,
|
||||
stepsize = tf.Variable(initial_value=np.float32(np.array(0.03)), name='stepsize')
|
||||
inputs, loss, loss_sampled = policy.update_info
|
||||
optim = kfac.KfacOptimizer(learning_rate=stepsize, cold_lr=stepsize*(1-0.9), momentum=0.9, kfac_update=2,\
|
||||
epsilon=1e-2, stats_decay=0.99, async=1, cold_iter=1,
|
||||
epsilon=1e-2, stats_decay=0.99, async_=1, cold_iter=1,
|
||||
weight_decay_dict=policy.wd_dict, max_grad_norm=None)
|
||||
pi_var_list = []
|
||||
for var in tf.trainable_variables():
|
||||
|
@@ -58,7 +58,7 @@ class Model(object):
|
||||
with tf.device('/gpu:0'):
|
||||
self.optim = optim = kfac.KfacOptimizer(learning_rate=PG_LR, clip_kl=kfac_clip,\
|
||||
momentum=0.9, kfac_update=1, epsilon=0.01,\
|
||||
stats_decay=0.99, async=1, cold_iter=10, max_grad_norm=max_grad_norm)
|
||||
stats_decay=0.99, async_=1, cold_iter=10, max_grad_norm=max_grad_norm)
|
||||
|
||||
update_stats_op = optim.compute_and_apply_stats(joint_fisher_loss, var_list=params)
|
||||
train_op, q_runner = optim.apply_gradients(list(zip(grads,params)))
|
||||
@@ -97,7 +97,7 @@ def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interva
|
||||
kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, **network_kwargs):
|
||||
set_global_seeds(seed)
|
||||
|
||||
|
||||
|
||||
if network == 'cnn':
|
||||
network_kwargs['one_dim_bias'] = True
|
||||
|
||||
@@ -115,7 +115,7 @@ def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interva
|
||||
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
|
||||
fh.write(cloudpickle.dumps(make_model))
|
||||
model = make_model()
|
||||
|
||||
|
||||
if load_path is not None:
|
||||
model.load(load_path)
|
||||
|
||||
|
@@ -10,14 +10,14 @@ KFAC_DEBUG = False
|
||||
|
||||
class KfacOptimizer():
|
||||
|
||||
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
|
||||
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, async_=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
|
||||
self.max_grad_norm = max_grad_norm
|
||||
self._lr = learning_rate
|
||||
self._momentum = momentum
|
||||
self._clip_kl = clip_kl
|
||||
self._channel_fac = channel_fac
|
||||
self._kfac_update = kfac_update
|
||||
self._async = async
|
||||
self._async = async_
|
||||
self._async_stats = async_stats
|
||||
self._epsilon = epsilon
|
||||
self._stats_decay = stats_decay
|
||||
|
@@ -1,23 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
from functools import partial
|
||||
|
||||
from baselines import logger
|
||||
from baselines.acktr.acktr_disc import learn
|
||||
from baselines.common.cmd_util import make_atari_env, atari_arg_parser
|
||||
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
|
||||
from baselines.common.policies import cnn
|
||||
|
||||
def train(env_id, num_timesteps, seed, num_cpu):
|
||||
env = VecFrameStack(make_atari_env(env_id, num_cpu, seed), 4)
|
||||
policy_fn = cnn(env=env, one_dim_bias=True)
|
||||
learn(policy_fn, env, seed, total_timesteps=int(num_timesteps * 1.1), nprocs=num_cpu)
|
||||
env.close()
|
||||
|
||||
def main():
|
||||
args = atari_arg_parser().parse_args()
|
||||
logger.configure()
|
||||
train(args.env, num_timesteps=args.num_timesteps, seed=args.seed, num_cpu=32)
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
@@ -21,7 +21,7 @@ class NeuralNetValueFunction(object):
|
||||
self._predict = U.function([X], vpred_n)
|
||||
optim = kfac.KfacOptimizer(learning_rate=0.001, cold_lr=0.001*(1-0.9), momentum=0.9, \
|
||||
clip_kl=0.3, epsilon=0.1, stats_decay=0.95, \
|
||||
async=1, kfac_update=2, cold_iter=50, \
|
||||
async_=1, kfac_update=2, cold_iter=50, \
|
||||
weight_decay_dict=wd_dict, max_grad_norm=None)
|
||||
vf_var_list = []
|
||||
for var in tf.trainable_variables():
|
||||
|
@@ -15,22 +15,31 @@ from baselines.bench import Monitor
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
|
||||
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
from baselines.common.retro_wrappers import RewardScaler
|
||||
|
||||
def make_atari_env(env_id, num_env, seed, wrapper_kwargs=None, start_index=0):
|
||||
|
||||
def make_vec_env(env_id, env_type, num_env, seed, wrapper_kwargs=None, start_index=0, reward_scale=1.0):
|
||||
"""
|
||||
Create a wrapped, monitored SubprocVecEnv for Atari.
|
||||
Create a wrapped, monitored SubprocVecEnv for Atari and MuJoCo.
|
||||
"""
|
||||
if wrapper_kwargs is None: wrapper_kwargs = {}
|
||||
mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
|
||||
def make_env(rank): # pylint: disable=C0111
|
||||
def _thunk():
|
||||
env = make_atari(env_id)
|
||||
env = make_atari(env_id) if env_type == 'atari' else gym.make(env_id)
|
||||
env.seed(seed + 10000*mpi_rank + rank if seed is not None else None)
|
||||
env = Monitor(env, logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(rank)))
|
||||
return wrap_deepmind(env, **wrapper_kwargs)
|
||||
env = Monitor(env,
|
||||
logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(rank)),
|
||||
allow_early_resets=True)
|
||||
|
||||
if env_type == 'atari': return wrap_deepmind(env, **wrapper_kwargs)
|
||||
elif reward_scale != 1: return RewardScaler(env, reward_scale)
|
||||
else: return env
|
||||
return _thunk
|
||||
set_global_seeds(seed)
|
||||
return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
|
||||
if num_env > 1: return SubprocVecEnv([make_env(i + start_index) for i in range(num_env)])
|
||||
else: return DummyVecEnv([make_env(start_index)])
|
||||
|
||||
def make_mujoco_env(env_id, seed, reward_scale=1.0):
|
||||
"""
|
||||
@@ -40,13 +49,12 @@ def make_mujoco_env(env_id, seed, reward_scale=1.0):
|
||||
myseed = seed + 1000 * rank if seed is not None else None
|
||||
set_global_seeds(myseed)
|
||||
env = gym.make(env_id)
|
||||
env = Monitor(env, os.path.join(logger.get_dir(), str(rank)), allow_early_resets=True)
|
||||
logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
|
||||
env = Monitor(env, logger_path, allow_early_resets=True)
|
||||
env.seed(seed)
|
||||
|
||||
if reward_scale != 1.0:
|
||||
from baselines.common.retro_wrappers import RewardScaler
|
||||
env = RewardScaler(env, reward_scale)
|
||||
|
||||
return env
|
||||
|
||||
def make_robotics_env(env_id, seed, rank=0):
|
||||
@@ -88,7 +96,7 @@ def common_arg_parser():
|
||||
parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
|
||||
parser.add_argument('--seed', help='RNG seed', type=int, default=None)
|
||||
parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
|
||||
parser.add_argument('--num_timesteps', type=float, default=1e6),
|
||||
parser.add_argument('--num_timesteps', type=float, default=1e6),
|
||||
parser.add_argument('--network', help='network type (mlp, cnn, lstm, cnn_lstm, conv_only)', default=None)
|
||||
parser.add_argument('--gamestate', help='game state to load (so far only used in retro games)', default=None)
|
||||
parser.add_argument('--num_env', help='Number of environment copies being run in parallel. When not specified, set to number of cpus for Atari, and to 1 for Mujoco', default=None, type=int)
|
||||
@@ -121,6 +129,3 @@ def parse_unknown_args(args):
|
||||
retval[key] = value
|
||||
|
||||
return retval
|
||||
|
||||
|
||||
|
||||
|
@@ -2,6 +2,8 @@ from __future__ import print_function
|
||||
from contextlib import contextmanager
|
||||
import numpy as np
|
||||
import time
|
||||
import shlex
|
||||
import subprocess
|
||||
|
||||
# ================================================================
|
||||
# Misc
|
||||
@@ -37,7 +39,7 @@ color2num = dict(
|
||||
crimson=38
|
||||
)
|
||||
|
||||
def colorize(string, color, bold=False, highlight=False):
|
||||
def colorize(string, color='green', bold=False, highlight=False):
|
||||
attr = []
|
||||
num = color2num[color]
|
||||
if highlight: num += 10
|
||||
@@ -45,6 +47,22 @@ def colorize(string, color, bold=False, highlight=False):
|
||||
if bold: attr.append('1')
|
||||
return '\x1b[%sm%s\x1b[0m' % (';'.join(attr), string)
|
||||
|
||||
def print_cmd(cmd, dry=False):
|
||||
if isinstance(cmd, str): # for shell=True
|
||||
pass
|
||||
else:
|
||||
cmd = ' '.join(shlex.quote(arg) for arg in cmd)
|
||||
print(colorize(('CMD: ' if not dry else 'DRY: ') + cmd))
|
||||
|
||||
|
||||
def get_git_commit(cwd=None):
|
||||
return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'], cwd=cwd).decode('utf8')
|
||||
|
||||
def ccap(cmd, dry=False, env=None, **kwargs):
|
||||
print_cmd(cmd, dry)
|
||||
if not dry:
|
||||
subprocess.check_call(cmd, env=env, **kwargs)
|
||||
|
||||
|
||||
MESSAGE_DEPTH = 0
|
||||
|
||||
|
@@ -22,8 +22,7 @@ def nature_cnn(unscaled_images, **conv_kwargs):
|
||||
|
||||
def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
|
||||
"""
|
||||
Simple fully connected layer policy. Separate stacks of fully-connected layers are used for policy and value function estimation.
|
||||
More customized fully-connected policies can be obtained by using PolicyWithV class directly.
|
||||
Stack of fully-connected layers to be used in a policy / q-function approximator
|
||||
|
||||
Parameters:
|
||||
----------
|
||||
@@ -37,7 +36,7 @@ def mlp(num_layers=2, num_hidden=64, activation=tf.tanh):
|
||||
Returns:
|
||||
-------
|
||||
|
||||
function that builds fully connected network with a given input placeholder
|
||||
function that builds fully connected network with a given input tensor / placeholder
|
||||
"""
|
||||
def network_fn(X):
|
||||
h = tf.layers.flatten(X)
|
||||
@@ -68,6 +67,34 @@ def cnn_small(**conv_kwargs):
|
||||
|
||||
|
||||
def lstm(nlstm=128, layer_norm=False):
|
||||
"""
|
||||
Builds LSTM (Long-Short Term Memory) network to be used in a policy.
|
||||
Note that the resulting function returns not only the output of the LSTM
|
||||
(i.e. hidden state of lstm for each step in the sequence), but also a dictionary
|
||||
with auxiliary tensors to be set as policy attributes.
|
||||
|
||||
Specifically,
|
||||
S is a placeholder to feed current state (LSTM state has to be managed outside policy)
|
||||
M is a placeholder for the mask (used to mask out observations after the end of the episode, but can be used for other purposes too)
|
||||
initial_state is a numpy array containing initial lstm state (usually zeros)
|
||||
state is the output LSTM state (to be fed into S at the next call)
|
||||
|
||||
|
||||
An example of usage of lstm-based policy can be found here: common/tests/test_doc_examples.py/test_lstm_example
|
||||
|
||||
Parameters:
|
||||
----------
|
||||
|
||||
nlstm: int LSTM hidden state size
|
||||
|
||||
layer_norm: bool if True, layer-normalized version of LSTM is used
|
||||
|
||||
Returns:
|
||||
-------
|
||||
|
||||
function that builds LSTM with a given input tensor / placeholder
|
||||
"""
|
||||
|
||||
def network_fn(X, nenv=1):
|
||||
nbatch = X.shape[0]
|
||||
nsteps = nbatch // nenv
|
||||
|
@@ -72,7 +72,7 @@ class PolicyWithValue(object):
|
||||
|
||||
def step(self, observation, **extra_feed):
|
||||
"""
|
||||
Compute next action(s) given the observaion(s)
|
||||
Compute next action(s) given the observation(s)
|
||||
|
||||
Parameters:
|
||||
----------
|
||||
@@ -93,7 +93,7 @@ class PolicyWithValue(object):
|
||||
|
||||
def value(self, ob, *args, **kwargs):
|
||||
"""
|
||||
Compute value estimate(s) given the observaion(s)
|
||||
Compute value estimate(s) given the observation(s)
|
||||
|
||||
Parameters:
|
||||
----------
|
||||
|
@@ -14,7 +14,7 @@ common_kwargs = dict(
|
||||
learn_kwargs = {
|
||||
'a2c' : dict(nsteps=32, value_network='copy', lr=0.05),
|
||||
'acktr': dict(nsteps=32, value_network='copy'),
|
||||
'deepq': {},
|
||||
'deepq': dict(total_timesteps=20000),
|
||||
'ppo2': dict(value_network='copy'),
|
||||
'trpo_mpi': {}
|
||||
}
|
||||
@@ -38,3 +38,6 @@ def test_cartpole(alg):
|
||||
return env
|
||||
|
||||
reward_per_episode_test(env_fn, learn_fn, 100)
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_cartpole('deepq')
|
||||
|
48
baselines/common/tests/test_doc_examples.py
Normal file
48
baselines/common/tests/test_doc_examples.py
Normal file
@@ -0,0 +1,48 @@
|
||||
import pytest
|
||||
try:
|
||||
import mujoco_py
|
||||
_mujoco_present = True
|
||||
except BaseException:
|
||||
mujoco_py = None
|
||||
_mujoco_present = False
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
not _mujoco_present,
|
||||
reason='error loading mujoco - either mujoco / mujoco key not present, or LD_LIBRARY_PATH is not pointing to mujoco library'
|
||||
)
|
||||
def test_lstm_example():
|
||||
import tensorflow as tf
|
||||
from baselines.common import policies, models, cmd_util
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
|
||||
# create vectorized environment
|
||||
venv = DummyVecEnv([lambda: cmd_util.make_mujoco_env('Reacher-v2', seed=0)])
|
||||
|
||||
with tf.Session() as sess:
|
||||
# build policy based on lstm network with 128 units
|
||||
policy = policies.build_policy(venv, models.lstm(128))(nbatch=1, nsteps=1)
|
||||
|
||||
# initialize tensorflow variables
|
||||
sess.run(tf.global_variables_initializer())
|
||||
|
||||
# prepare environment variables
|
||||
ob = venv.reset()
|
||||
state = policy.initial_state
|
||||
done = [False]
|
||||
step_counter = 0
|
||||
|
||||
# run a single episode until the end (i.e. until done)
|
||||
while True:
|
||||
action, _, state, _ = policy.step(ob, S=state, M=done)
|
||||
ob, reward, done, _ = venv.step(action)
|
||||
step_counter += 1
|
||||
if done:
|
||||
break
|
||||
|
||||
|
||||
assert step_counter > 5
|
||||
|
||||
|
||||
|
||||
|
@@ -62,7 +62,7 @@ def make_session(config=None, num_cpu=None, make_default=False, graph=None):
|
||||
num_cpu = int(os.getenv('RCALL_NUM_CPU', multiprocessing.cpu_count()))
|
||||
if config is None:
|
||||
config = tf.ConfigProto(
|
||||
allow_soft_placement=True,
|
||||
allow_soft_placement=True,
|
||||
inter_op_parallelism_threads=num_cpu,
|
||||
intra_op_parallelism_threads=num_cpu)
|
||||
config.gpu_options.allow_growth = True
|
||||
@@ -328,7 +328,7 @@ def save_state(fname, sess=None):
|
||||
def save_variables(save_path, variables=None, sess=None):
|
||||
sess = sess or get_session()
|
||||
variables = variables or tf.trainable_variables()
|
||||
|
||||
|
||||
ps = sess.run(variables)
|
||||
save_dict = {v.name: value for v, value in zip(variables, ps)}
|
||||
os.makedirs(os.path.dirname(save_path), exist_ok=True)
|
||||
@@ -354,10 +354,10 @@ def adjust_shape(placeholder, data):
|
||||
If shape is incompatible, AssertionError is thrown
|
||||
|
||||
Parameters:
|
||||
placeholder tensorflow input placeholder
|
||||
|
||||
placeholder tensorflow input placeholder
|
||||
|
||||
data input data to be (potentially) reshaped to be fed into placeholder
|
||||
|
||||
|
||||
Returns:
|
||||
reshaped data
|
||||
'''
|
||||
@@ -366,14 +366,14 @@ def adjust_shape(placeholder, data):
|
||||
return data
|
||||
if isinstance(data, list):
|
||||
data = np.array(data)
|
||||
|
||||
|
||||
placeholder_shape = [x or -1 for x in placeholder.shape.as_list()]
|
||||
|
||||
|
||||
assert _check_shape(placeholder_shape, data.shape), \
|
||||
'Shape of data {} is not compatible with shape of the placeholder {}'.format(data.shape, placeholder_shape)
|
||||
|
||||
return np.reshape(data, placeholder_shape)
|
||||
|
||||
return np.reshape(data, placeholder_shape)
|
||||
|
||||
|
||||
def _check_shape(placeholder_shape, data_shape):
|
||||
''' check if two shapes are compatible (i.e. differ only by dimensions of size 1, or by the batch dimension)'''
|
||||
@@ -381,7 +381,7 @@ def _check_shape(placeholder_shape, data_shape):
|
||||
return True
|
||||
squeezed_placeholder_shape = _squeeze_shape(placeholder_shape)
|
||||
squeezed_data_shape = _squeeze_shape(data_shape)
|
||||
|
||||
|
||||
for i, s_data in enumerate(squeezed_data_shape):
|
||||
s_placeholder = squeezed_placeholder_shape[i]
|
||||
if s_placeholder != -1 and s_data != s_placeholder:
|
||||
@@ -392,14 +392,26 @@ def _check_shape(placeholder_shape, data_shape):
|
||||
|
||||
def _squeeze_shape(shape):
|
||||
return [x for x in shape if x != 1]
|
||||
|
||||
|
||||
# ================================================================
|
||||
# Tensorboard interfacing
|
||||
# ================================================================
|
||||
|
||||
def launch_tensorboard_in_background(log_dir):
|
||||
from tensorboard import main as tb
|
||||
import threading
|
||||
tf.flags.FLAGS.logdir = log_dir
|
||||
t = threading.Thread(target=tb.main, args=([]))
|
||||
t.start()
|
||||
|
||||
'''
|
||||
To log the Tensorflow graph when using rl-algs
|
||||
algorithms, you can run the following code
|
||||
in your main script:
|
||||
import threading, time
|
||||
def start_tensorboard(session):
|
||||
time.sleep(10) # Wait until graph is setup
|
||||
tb_path = osp.join(logger.get_dir(), 'tb')
|
||||
summary_writer = tf.summary.FileWriter(tb_path, graph=session.graph)
|
||||
summary_op = tf.summary.merge_all()
|
||||
launch_tensorboard_in_background(tb_path)
|
||||
session = tf.get_default_session()
|
||||
t = threading.Thread(target=start_tensorboard, args=([session]))
|
||||
t.start()
|
||||
'''
|
||||
import subprocess
|
||||
subprocess.Popen(['tensorboard', '--logdir', log_dir])
|
||||
|
@@ -1,6 +1,5 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from baselines import logger
|
||||
|
||||
from baselines.common.tile_images import tile_images
|
||||
|
||||
class AlreadySteppingError(Exception):
|
||||
"""
|
||||
@@ -33,6 +32,8 @@ class VecEnv(ABC):
|
||||
self.num_envs = num_envs
|
||||
self.observation_space = observation_space
|
||||
self.action_space = action_space
|
||||
self.closed = False
|
||||
self.viewer = None # For rendering
|
||||
|
||||
@abstractmethod
|
||||
def reset(self):
|
||||
@@ -72,13 +73,21 @@ class VecEnv(ABC):
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def close(self):
|
||||
def close_extras(self):
|
||||
"""
|
||||
Clean up the environments' resources.
|
||||
Clean up the extra resources, beyond what's in this base class.
|
||||
Only runs when not self.closed.
|
||||
"""
|
||||
pass
|
||||
|
||||
def close(self):
|
||||
if self.closed:
|
||||
return
|
||||
if self.viewer is not None:
|
||||
self.viewer.close()
|
||||
self.close_extras()
|
||||
self.closed = True
|
||||
|
||||
def step(self, actions):
|
||||
"""
|
||||
Step the environments synchronously.
|
||||
@@ -89,7 +98,20 @@ class VecEnv(ABC):
|
||||
return self.step_wait()
|
||||
|
||||
def render(self, mode='human'):
|
||||
logger.warn('Render not defined for %s' % self)
|
||||
imgs = self.get_images()
|
||||
bigimg = tile_images(imgs)
|
||||
if mode == 'human':
|
||||
self.get_viewer().imshow(bigimg)
|
||||
elif mode == 'rgb_array':
|
||||
return bigimg
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
def get_images(self):
|
||||
"""
|
||||
Return RGB images from each environment
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
@property
|
||||
def unwrapped(self):
|
||||
@@ -98,6 +120,12 @@ class VecEnv(ABC):
|
||||
else:
|
||||
return self
|
||||
|
||||
def get_viewer(self):
|
||||
if self.viewer is None:
|
||||
from gym.envs.classic_control import rendering
|
||||
self.viewer = rendering.SimpleImageViewer()
|
||||
return self.viewer
|
||||
|
||||
|
||||
class VecEnvWrapper(VecEnv):
|
||||
"""
|
||||
@@ -126,9 +154,11 @@ class VecEnvWrapper(VecEnv):
|
||||
def close(self):
|
||||
return self.venv.close()
|
||||
|
||||
def render(self):
|
||||
self.venv.render()
|
||||
def render(self, mode='human'):
|
||||
return self.venv.render(mode=mode)
|
||||
|
||||
def get_images(self):
|
||||
return self.venv.get_images()
|
||||
|
||||
class CloudpickleWrapper(object):
|
||||
"""
|
||||
|
@@ -1,36 +1,43 @@
|
||||
import numpy as np
|
||||
from gym import spaces
|
||||
from . import VecEnv
|
||||
from .util import copy_obs_dict, dict_to_obs, obs_space_info
|
||||
|
||||
|
||||
class DummyVecEnv(VecEnv):
|
||||
"""
|
||||
A VecEnv that wraps raw gym.Envs.
|
||||
|
||||
This can be used when an algorithm requires a VecEnv
|
||||
but you want to use a vanilla gym.Env instance.
|
||||
It is also useful for avoiding IPC overhead when you
|
||||
don't need to run environments in parallel.
|
||||
"""
|
||||
|
||||
def __init__(self, env_fns):
|
||||
self.envs = [fn() for fn in env_fns]
|
||||
env = self.envs[0]
|
||||
VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
|
||||
obs_space = env.observation_space
|
||||
self.keys, shapes, dtypes = obs_space_info(obs_space)
|
||||
self.buf_obs = {k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys}
|
||||
|
||||
self.keys, shapes, dtypes = obs_space_info(obs_space)
|
||||
self.buf_obs = { k: np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k]) for k in self.keys }
|
||||
self.buf_dones = np.zeros((self.num_envs,), dtype=np.bool)
|
||||
self.buf_rews = np.zeros((self.num_envs,), dtype=np.float32)
|
||||
self.buf_rews = np.zeros((self.num_envs,), dtype=np.float32)
|
||||
self.buf_infos = [{} for _ in range(self.num_envs)]
|
||||
self.actions = None
|
||||
|
||||
def step_async(self, actions):
|
||||
self.actions = actions
|
||||
listify = True
|
||||
try:
|
||||
if len(actions) == self.num_envs:
|
||||
listify = False
|
||||
except TypeError:
|
||||
pass
|
||||
|
||||
if not listify:
|
||||
self.actions = actions
|
||||
else:
|
||||
assert self.num_envs == 1, "actions {} is either not a list or has a wrong size - cannot match to {} environments".format(actions, self.num_envs)
|
||||
self.actions = [actions]
|
||||
|
||||
def step_wait(self):
|
||||
for e in range(self.num_envs):
|
||||
obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(self.actions[e])
|
||||
action = self.actions[e]
|
||||
if isinstance(self.envs[e].action_space, spaces.Discrete):
|
||||
action = int(action)
|
||||
|
||||
obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
|
||||
if self.buf_dones[e]:
|
||||
obs = self.envs[e].reset()
|
||||
self._save_obs(e, obs)
|
||||
@@ -44,11 +51,7 @@ class DummyVecEnv(VecEnv):
|
||||
return self._obs_from_buf()
|
||||
|
||||
def close(self):
|
||||
for e in self.envs:
|
||||
e.close()
|
||||
|
||||
def render(self, mode='human'):
|
||||
return [e.render(mode=mode) for e in self.envs]
|
||||
return
|
||||
|
||||
def _save_obs(self, e, obs):
|
||||
for k in self.keys:
|
||||
@@ -59,3 +62,7 @@ class DummyVecEnv(VecEnv):
|
||||
|
||||
def _obs_from_buf(self):
|
||||
return dict_to_obs(copy_obs_dict(self.buf_obs))
|
||||
|
||||
def get_images(self):
|
||||
return [env.render(mode='rgb_array') for env in self.envs]
|
||||
|
||||
|
@@ -7,7 +7,6 @@ import numpy as np
|
||||
from . import VecEnv, CloudpickleWrapper
|
||||
import ctypes
|
||||
from baselines import logger
|
||||
from baselines.common.tile_images import tile_images
|
||||
|
||||
from .util import dict_to_obs, obs_space_info, obs_to_dict
|
||||
|
||||
@@ -56,6 +55,7 @@ class ShmemVecEnv(VecEnv):
|
||||
proc.start()
|
||||
child_pipe.close()
|
||||
self.waiting_step = False
|
||||
self.viewer = None
|
||||
|
||||
def reset(self):
|
||||
if self.waiting_step:
|
||||
@@ -75,7 +75,7 @@ class ShmemVecEnv(VecEnv):
|
||||
obs, rews, dones, infos = zip(*outs)
|
||||
return self._decode_obses(obs), np.array(rews), np.array(dones), infos
|
||||
|
||||
def close(self):
|
||||
def close_extras(self):
|
||||
if self.waiting_step:
|
||||
self.step_wait()
|
||||
for pipe in self.parent_pipes:
|
||||
@@ -86,23 +86,15 @@ class ShmemVecEnv(VecEnv):
|
||||
for proc in self.procs:
|
||||
proc.join()
|
||||
|
||||
def render(self, mode='human'):
|
||||
def get_images(self, mode='human'):
|
||||
for pipe in self.parent_pipes:
|
||||
pipe.send(('render', None))
|
||||
imgs = [pipe.recv() for pipe in self.parent_pipes]
|
||||
bigimg = tile_images(imgs)
|
||||
if mode == 'human':
|
||||
import cv2
|
||||
cv2.imshow('vecenv', bigimg[:, :, ::-1])
|
||||
cv2.waitKey(1)
|
||||
elif mode == 'rgb_array':
|
||||
return bigimg
|
||||
else:
|
||||
raise NotImplementedError
|
||||
return [pipe.recv() for pipe in self.parent_pipes]
|
||||
|
||||
def _decode_obses(self, obs):
|
||||
result = {}
|
||||
for k in self.obs_keys:
|
||||
|
||||
bufs = [b[k] for b in self.obs_bufs]
|
||||
o = [np.frombuffer(b.get_obj(), dtype=self.obs_dtypes[k]).reshape(self.obs_shapes[k]) for b in bufs]
|
||||
result[k] = np.array(o)
|
||||
|
@@ -1,8 +1,6 @@
|
||||
import numpy as np
|
||||
from multiprocessing import Process, Pipe
|
||||
from . import VecEnv, CloudpickleWrapper
|
||||
from baselines.common.tile_images import tile_images
|
||||
|
||||
|
||||
def worker(remote, parent_remote, env_fn_wrapper):
|
||||
parent_remote.close()
|
||||
@@ -39,7 +37,6 @@ class SubprocVecEnv(VecEnv):
|
||||
envs: list of gym environments to run in subprocesses
|
||||
"""
|
||||
self.waiting = False
|
||||
self.closed = False
|
||||
nenvs = len(env_fns)
|
||||
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
|
||||
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
|
||||
@@ -52,6 +49,7 @@ class SubprocVecEnv(VecEnv):
|
||||
|
||||
self.remotes[0].send(('get_spaces', None))
|
||||
observation_space, action_space = self.remotes[0].recv()
|
||||
self.viewer = None
|
||||
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
|
||||
|
||||
def step_async(self, actions):
|
||||
@@ -70,14 +68,7 @@ class SubprocVecEnv(VecEnv):
|
||||
remote.send(('reset', None))
|
||||
return np.stack([remote.recv() for remote in self.remotes])
|
||||
|
||||
def reset_task(self):
|
||||
for remote in self.remotes:
|
||||
remote.send(('reset_task', None))
|
||||
return np.stack([remote.recv() for remote in self.remotes])
|
||||
|
||||
def close(self):
|
||||
if self.closed:
|
||||
return
|
||||
def close_extras(self):
|
||||
if self.waiting:
|
||||
for remote in self.remotes:
|
||||
remote.recv()
|
||||
@@ -85,18 +76,9 @@ class SubprocVecEnv(VecEnv):
|
||||
remote.send(('close', None))
|
||||
for p in self.ps:
|
||||
p.join()
|
||||
self.closed = True
|
||||
|
||||
def render(self, mode='human'):
|
||||
def get_images(self):
|
||||
for pipe in self.remotes:
|
||||
pipe.send(('render', None))
|
||||
imgs = [pipe.recv() for pipe in self.remotes]
|
||||
bigimg = tile_images(imgs)
|
||||
if mode == 'human':
|
||||
import cv2
|
||||
cv2.imshow('vecenv', bigimg[:, :, ::-1])
|
||||
cv2.waitKey(1)
|
||||
elif mode == 'rgb_array':
|
||||
return bigimg
|
||||
else:
|
||||
raise NotImplementedError
|
||||
return imgs
|
||||
|
@@ -10,6 +10,39 @@ from .shmem_vec_env import ShmemVecEnv
|
||||
from .subproc_vec_env import SubprocVecEnv
|
||||
|
||||
|
||||
def assert_envs_equal(env1, env2, num_steps):
|
||||
"""
|
||||
Compare two environments over num_steps steps and make sure
|
||||
that the observations produced by each are the same when given
|
||||
the same actions.
|
||||
"""
|
||||
assert env1.num_envs == env2.num_envs
|
||||
assert env1.action_space.shape == env2.action_space.shape
|
||||
assert env1.action_space.dtype == env2.action_space.dtype
|
||||
joint_shape = (env1.num_envs,) + env1.action_space.shape
|
||||
|
||||
try:
|
||||
obs1, obs2 = env1.reset(), env2.reset()
|
||||
assert np.array(obs1).shape == np.array(obs2).shape
|
||||
assert np.array(obs1).shape == joint_shape
|
||||
assert np.allclose(obs1, obs2)
|
||||
np.random.seed(1337)
|
||||
for _ in range(num_steps):
|
||||
actions = np.array(np.random.randint(0, 0x100, size=joint_shape),
|
||||
dtype=env1.action_space.dtype)
|
||||
for env in [env1, env2]:
|
||||
env.step_async(actions)
|
||||
outs1 = env1.step_wait()
|
||||
outs2 = env2.step_wait()
|
||||
for out1, out2 in zip(outs1[:3], outs2[:3]):
|
||||
assert np.array(out1).shape == np.array(out2).shape
|
||||
assert np.allclose(out1, out2)
|
||||
assert list(outs1[3]) == list(outs2[3])
|
||||
finally:
|
||||
env1.close()
|
||||
env2.close()
|
||||
|
||||
|
||||
@pytest.mark.parametrize('klass', (ShmemVecEnv, SubprocVecEnv))
|
||||
@pytest.mark.parametrize('dtype', ('uint8', 'float32'))
|
||||
def test_vec_env(klass, dtype): # pylint: disable=R0914
|
||||
@@ -26,33 +59,14 @@ def test_vec_env(klass, dtype): # pylint: disable=R0914
|
||||
"""
|
||||
Get an environment constructor with a seed.
|
||||
"""
|
||||
return lambda: _SimpleEnv(seed, shape, dtype)
|
||||
return lambda: SimpleEnv(seed, shape, dtype)
|
||||
fns = [make_fn(i) for i in range(num_envs)]
|
||||
env1 = DummyVecEnv(fns)
|
||||
env2 = klass(fns)
|
||||
try:
|
||||
obs1, obs2 = env1.reset(), env2.reset()
|
||||
assert np.array(obs1).shape == np.array(obs2).shape
|
||||
assert np.allclose(obs1, obs2)
|
||||
np.random.seed(1337)
|
||||
for _ in range(num_steps):
|
||||
joint_shape = (len(fns),) + shape
|
||||
actions = np.array(np.random.randint(0, 0x100, size=joint_shape),
|
||||
dtype=dtype)
|
||||
for env in [env1, env2]:
|
||||
env.step_async(actions)
|
||||
outs1 = env1.step_wait()
|
||||
outs2 = env2.step_wait()
|
||||
for out1, out2 in zip(outs1[:3], outs2[:3]):
|
||||
assert np.array(out1).shape == np.array(out2).shape
|
||||
assert np.allclose(out1, out2)
|
||||
assert list(outs1[3]) == list(outs2[3])
|
||||
finally:
|
||||
env1.close()
|
||||
env2.close()
|
||||
assert_envs_equal(env1, env2, num_steps=num_steps)
|
||||
|
||||
|
||||
class _SimpleEnv(gym.Env):
|
||||
class SimpleEnv(gym.Env):
|
||||
"""
|
||||
An environment with a pre-determined observation space
|
||||
and RNG seed.
|
||||
@@ -66,7 +80,9 @@ class _SimpleEnv(gym.Env):
|
||||
self._max_steps = seed + 1
|
||||
self._cur_obs = None
|
||||
self._cur_step = 0
|
||||
self.action_space = gym.spaces.Box(low=0, high=100, shape=shape, dtype=dtype)
|
||||
# this is 0xFF instead of 0x100 because the Box space includes
|
||||
# the high end, while randint does not
|
||||
self.action_space = gym.spaces.Box(low=0, high=0xFF, shape=shape, dtype=dtype)
|
||||
self.observation_space = self.action_space
|
||||
|
||||
def step(self, action):
|
||||
|
@@ -9,44 +9,29 @@ Here's a list of commands to run to quickly get a working example:
|
||||
|
||||
```bash
|
||||
# Train model and save the results to cartpole_model.pkl
|
||||
python -m baselines.deepq.experiments.train_cartpole
|
||||
python -m baselines.run --alg=deepq --env=CartPole-v0 --save_path=./cartpole_model.pkl --num_timesteps=1e5
|
||||
# Load the model saved in cartpole_model.pkl and visualize the learned policy
|
||||
python -m baselines.deepq.experiments.enjoy_cartpole
|
||||
python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play
|
||||
```
|
||||
|
||||
|
||||
Be sure to check out the source code of [both](experiments/train_cartpole.py) [files](experiments/enjoy_cartpole.py)!
|
||||
|
||||
## If you wish to apply DQN to solve a problem.
|
||||
|
||||
Check out our simple agent trained with one stop shop `deepq.learn` function.
|
||||
|
||||
- [baselines/deepq/experiments/train_cartpole.py](experiments/train_cartpole.py) - train a Cartpole agent.
|
||||
- [baselines/deepq/experiments/train_pong.py](experiments/train_pong.py) - train a Pong agent using convolutional neural networks.
|
||||
|
||||
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. For both of the files listed above there are complimentary files `enjoy_cartpole.py` and `enjoy_pong.py` respectively, that load and visualize the learned policy.
|
||||
In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. Complimentary file `enjoy_cartpole.py` loads and visualizes the learned policy.
|
||||
|
||||
## If you wish to experiment with the algorithm
|
||||
|
||||
##### Check out the examples
|
||||
|
||||
|
||||
- [baselines/deepq/experiments/custom_cartpole.py](experiments/custom_cartpole.py) - Cartpole training with more fine grained control over the internals of DQN algorithm.
|
||||
- [baselines/deepq/experiments/run_atari.py](experiments/run_atari.py) - more robust setup for training at scale.
|
||||
|
||||
|
||||
##### Download a pretrained Atari agent
|
||||
|
||||
For some research projects it is sometimes useful to have an already trained agent handy. There's a variety of models to choose from. You can list them all by running:
|
||||
- [baselines/deepq/defaults.py](defaults.py) - settings for training on atari. Run
|
||||
|
||||
```bash
|
||||
python -m baselines.deepq.experiments.atari.download_model
|
||||
python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4
|
||||
```
|
||||
to train on Atari Pong (see more in repo-wide [README.md](../../README.md#training-models))
|
||||
|
||||
Once you pick a model, you can download it and visualize the learned policy. Be sure to pass `--dueling` flag to visualization script when using dueling models.
|
||||
|
||||
```bash
|
||||
python -m baselines.deepq.experiments.atari.download_model --blob model-atari-duel-pong-1 --model-dir /tmp/models
|
||||
python -m baselines.deepq.experiments.atari.enjoy --model-dir /tmp/models/model-atari-duel-pong-1 --env Pong --dueling
|
||||
|
||||
```
|
||||
|
@@ -309,7 +309,7 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
|
||||
outputs=output_actions,
|
||||
givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
|
||||
updates=updates)
|
||||
def act(ob, reset, update_param_noise_threshold, update_param_noise_scale, stochastic=True, update_eps=-1):
|
||||
def act(ob, reset=False, update_param_noise_threshold=False, update_param_noise_scale=False, stochastic=True, update_eps=-1):
|
||||
return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
|
||||
return act
|
||||
|
||||
|
@@ -27,7 +27,7 @@ class ActWrapper(object):
|
||||
self.initial_state = None
|
||||
|
||||
@staticmethod
|
||||
def load_act(self, path):
|
||||
def load_act(path):
|
||||
with open(path, "rb") as f:
|
||||
model_data, act_params = cloudpickle.load(f)
|
||||
act = deepq.build_act(**act_params)
|
||||
@@ -70,6 +70,7 @@ class ActWrapper(object):
|
||||
|
||||
def save(self, path):
|
||||
save_state(path)
|
||||
self.save_act(path+".pickle")
|
||||
|
||||
|
||||
def load_act(path):
|
||||
@@ -194,8 +195,9 @@ def learn(env,
|
||||
# capture the shape outside the closure so that the env object is not serialized
|
||||
# by cloudpickle when serializing make_obs_ph
|
||||
|
||||
observation_space = env.observation_space
|
||||
def make_obs_ph(name):
|
||||
return ObservationInput(env.observation_space, name=name)
|
||||
return ObservationInput(observation_space, name=name)
|
||||
|
||||
act, train, update_target, debug = deepq.build_train(
|
||||
make_obs_ph=make_obs_ph,
|
||||
|
@@ -23,17 +23,15 @@ def main():
|
||||
env = make_atari(args.env)
|
||||
env = bench.Monitor(env, logger.get_dir())
|
||||
env = deepq.wrap_atari_dqn(env)
|
||||
model = deepq.models.cnn_to_mlp(
|
||||
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
|
||||
hiddens=[256],
|
||||
dueling=bool(args.dueling),
|
||||
)
|
||||
|
||||
deepq.learn(
|
||||
env,
|
||||
q_func=model,
|
||||
"conv_only",
|
||||
convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)],
|
||||
hiddens=[256],
|
||||
dueling=bool(args.dueling),
|
||||
lr=1e-4,
|
||||
max_timesteps=args.num_timesteps,
|
||||
total_timesteps=args.num_timesteps,
|
||||
buffer_size=10000,
|
||||
exploration_fraction=0.1,
|
||||
exploration_final_eps=0.01,
|
||||
|
@@ -11,12 +11,11 @@ def callback(lcl, _glb):
|
||||
|
||||
def main():
|
||||
env = gym.make("CartPole-v0")
|
||||
model = deepq.models.mlp([64])
|
||||
act = deepq.learn(
|
||||
env,
|
||||
q_func=model,
|
||||
network='mlp',
|
||||
lr=1e-3,
|
||||
max_timesteps=100000,
|
||||
total_timesteps=100000,
|
||||
buffer_size=50000,
|
||||
exploration_fraction=0.1,
|
||||
exploration_final_eps=0.02,
|
||||
|
@@ -2,5 +2,7 @@
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1707.06347
|
||||
- Baselines blog post: https://blog.openai.com/openai-baselines-ppo/
|
||||
- `python -m baselines.ppo2.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
||||
- `python -m baselines.ppo2.run_mujoco` runs the algorithm for 1M frames on a Mujoco environment.
|
||||
|
||||
- `python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
|
||||
- `python -m baselines.run --alg=ppo2 --env=Ant-v2 --num_timesteps=1e6` runs the algorithm for 1M frames on a Mujoco Ant environment.
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
|
@@ -163,7 +163,7 @@ def learn(*, network, env, total_timesteps, seed=None, nsteps=2048, ent_coef=0.0
|
||||
specifying the standard network architecture, or a function that takes tensorflow tensor as input and returns
|
||||
tuple (output_tensor, extra_feed) where output tensor is the last network layer output, extra_feed is None for feed-forward
|
||||
neural nets, and extra_feed is a dictionary describing how to feed state into the network for recurrent neural nets.
|
||||
See baselines.common/policies.py/lstm for more details on using recurrent nets in policies
|
||||
See common/models.py/lstm for more details on using recurrent nets in policies
|
||||
|
||||
env: baselines.common.vec_env.VecEnv environment. Needs to be vectorized for parallel environment simulation.
|
||||
The environments produced by gym.make can be wrapped using baselines.common.vec_env.DummyVecEnv class.
|
||||
@@ -189,7 +189,8 @@ def learn(*, network, env, total_timesteps, seed=None, nsteps=2048, ent_coef=0.0
|
||||
|
||||
log_interval: int number of timesteps between logging events
|
||||
|
||||
nminibatches: int number of training minibatches per update
|
||||
nminibatches: int number of training minibatches per update. For recurrent policies,
|
||||
should be smaller or equal than number of environments run in parallel.
|
||||
|
||||
noptepochs: int number of training epochs per update
|
||||
|
||||
@@ -226,10 +227,6 @@ def learn(*, network, env, total_timesteps, seed=None, nsteps=2048, ent_coef=0.0
|
||||
make_model = lambda : Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nbatch_act=nenvs, nbatch_train=nbatch_train,
|
||||
nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
|
||||
max_grad_norm=max_grad_norm)
|
||||
if save_interval and logger.get_dir():
|
||||
import cloudpickle
|
||||
with open(osp.join(logger.get_dir(), 'make_model.pkl'), 'wb') as fh:
|
||||
fh.write(cloudpickle.dumps(make_model))
|
||||
model = make_model()
|
||||
if load_path is not None:
|
||||
model.load(load_path)
|
||||
|
101
baselines/run.py
101
baselines/run.py
@@ -1,20 +1,19 @@
|
||||
import sys
|
||||
import multiprocessing
|
||||
import os
|
||||
import multiprocessing
|
||||
import os.path as osp
|
||||
import gym
|
||||
from collections import defaultdict
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
|
||||
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
|
||||
from baselines.common.cmd_util import common_arg_parser, parse_unknown_args, make_mujoco_env, make_atari_env
|
||||
from baselines.common.tf_util import save_state, load_state, get_session
|
||||
from baselines.common.cmd_util import common_arg_parser, parse_unknown_args, make_vec_env
|
||||
from baselines.common.tf_util import get_session
|
||||
from baselines import bench, logger
|
||||
from importlib import import_module
|
||||
|
||||
from baselines.common.vec_env.vec_normalize import VecNormalize
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
|
||||
from baselines.common import atari_wrappers, retro_wrappers
|
||||
|
||||
try:
|
||||
@@ -28,10 +27,10 @@ for env in gym.envs.registry.all():
|
||||
env_type = env._entry_point.split(':')[0].split('.')[-1]
|
||||
_game_envs[env_type].add(env.id)
|
||||
|
||||
# reading benchmark names directly from retro requires
|
||||
# importing retro here, and for some reason that crashes tensorflow
|
||||
# in ubuntu
|
||||
_game_envs['retro'] = set([
|
||||
# reading benchmark names directly from retro requires
|
||||
# importing retro here, and for some reason that crashes tensorflow
|
||||
# in ubuntu
|
||||
_game_envs['retro'] = {
|
||||
'BubbleBobble-Nes',
|
||||
'SuperMarioBros-Nes',
|
||||
'TwinBee3PokoPokoDaimaou-Nes',
|
||||
@@ -40,12 +39,12 @@ _game_envs['retro'] = set([
|
||||
'Vectorman-Genesis',
|
||||
'FinalFight-Snes',
|
||||
'SpaceInvaders-Snes',
|
||||
])
|
||||
}
|
||||
|
||||
|
||||
def train(args, extra_args):
|
||||
env_type, env_id = get_env_type(args.env)
|
||||
|
||||
|
||||
total_timesteps = int(args.num_timesteps)
|
||||
seed = args.seed
|
||||
|
||||
@@ -60,13 +59,11 @@ def train(args, extra_args):
|
||||
else:
|
||||
if alg_kwargs.get('network') is None:
|
||||
alg_kwargs['network'] = get_default_network(env_type)
|
||||
|
||||
|
||||
|
||||
|
||||
print('Training {} on {}:{} with arguments \n{}'.format(args.alg, env_type, env_id, alg_kwargs))
|
||||
|
||||
model = learn(
|
||||
env=env,
|
||||
env=env,
|
||||
seed=seed,
|
||||
total_timesteps=total_timesteps,
|
||||
**alg_kwargs
|
||||
@@ -75,30 +72,30 @@ def train(args, extra_args):
|
||||
return model, env
|
||||
|
||||
|
||||
def build_env(args, render=False):
|
||||
def build_env(args):
|
||||
ncpu = multiprocessing.cpu_count()
|
||||
if sys.platform == 'darwin': ncpu //= 2
|
||||
nenv = args.num_env or ncpu if not render else 1
|
||||
nenv = args.num_env or ncpu
|
||||
alg = args.alg
|
||||
rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
|
||||
seed = args.seed
|
||||
seed = args.seed
|
||||
|
||||
env_type, env_id = get_env_type(args.env)
|
||||
if env_type == 'mujoco':
|
||||
get_session(tf.ConfigProto(allow_soft_placement=True,
|
||||
intra_op_parallelism_threads=1,
|
||||
intra_op_parallelism_threads=1,
|
||||
inter_op_parallelism_threads=1))
|
||||
|
||||
if args.num_env:
|
||||
env = SubprocVecEnv([lambda: make_mujoco_env(env_id, seed + i if seed is not None else None, args.reward_scale) for i in range(args.num_env)])
|
||||
env = make_vec_env(env_id, env_type, nenv, seed, reward_scale=args.reward_scale)
|
||||
else:
|
||||
env = DummyVecEnv([lambda: make_mujoco_env(env_id, seed, args.reward_scale)])
|
||||
env = make_vec_env(env_id, env_type, 1, seed, reward_scale=args.reward_scale)
|
||||
|
||||
env = VecNormalize(env)
|
||||
|
||||
elif env_type == 'atari':
|
||||
if alg == 'acer':
|
||||
env = make_atari_env(env_id, nenv, seed)
|
||||
env = make_vec_env(env_id, env_type, nenv, seed)
|
||||
elif alg == 'deepq':
|
||||
env = atari_wrappers.make_atari(env_id)
|
||||
env.seed(seed)
|
||||
@@ -113,49 +110,56 @@ def build_env(args, render=False):
|
||||
env.seed(seed)
|
||||
else:
|
||||
frame_stack_size = 4
|
||||
env = VecFrameStack(make_atari_env(env_id, nenv, seed), frame_stack_size)
|
||||
env = VecFrameStack(make_vec_env(env_id, env_type, nenv, seed), frame_stack_size)
|
||||
|
||||
elif env_type == 'retro':
|
||||
import retro
|
||||
gamestate = args.gamestate or 'Level1-1'
|
||||
env = retro_wrappers.make_retro(game=args.env, state=gamestate, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE)
|
||||
env = retro_wrappers.make_retro(game=args.env, state=gamestate, max_episode_steps=10000,
|
||||
use_restricted_actions=retro.Actions.DISCRETE)
|
||||
env.seed(args.seed)
|
||||
env = bench.Monitor(env, logger.get_dir())
|
||||
env = retro_wrappers.wrap_deepmind_retro(env)
|
||||
|
||||
elif env_type == 'classic':
|
||||
|
||||
elif env_type == 'classic_control':
|
||||
def make_env():
|
||||
e = gym.make(env_id)
|
||||
e = bench.Monitor(e, logger.get_dir(), allow_early_resets=True)
|
||||
e.seed(seed)
|
||||
return e
|
||||
|
||||
|
||||
env = DummyVecEnv([make_env])
|
||||
|
||||
|
||||
else:
|
||||
raise ValueError('Unknown env_type {}'.format(env_type))
|
||||
|
||||
return env
|
||||
|
||||
|
||||
def get_env_type(env_id):
|
||||
if env_id in _game_envs.keys():
|
||||
env_type = env_id
|
||||
env_id = [g for g in _game_envs[env_type]][0]
|
||||
env_id = [g for g in _game_envs[env_type]][0]
|
||||
else:
|
||||
env_type = None
|
||||
for g, e in _game_envs.items():
|
||||
if env_id in e:
|
||||
env_type = g
|
||||
break
|
||||
break
|
||||
assert env_type is not None, 'env_id {} is not recognized in env types'.format(env_id, _game_envs.keys())
|
||||
|
||||
return env_type, env_id
|
||||
|
||||
|
||||
def get_default_network(env_type):
|
||||
if env_type == 'mujoco' or env_type=='classic':
|
||||
if env_type == 'mujoco' or env_type == 'classic_control':
|
||||
return 'mlp'
|
||||
if env_type == 'atari':
|
||||
return 'cnn'
|
||||
|
||||
raise ValueError('Unknown env_type {}'.format(env_type))
|
||||
|
||||
|
||||
|
||||
def get_alg_module(alg, submodule=None):
|
||||
submodule = submodule or alg
|
||||
try:
|
||||
@@ -164,46 +168,47 @@ def get_alg_module(alg, submodule=None):
|
||||
except ImportError:
|
||||
# then from rl_algs
|
||||
alg_module = import_module('.'.join(['rl_' + 'algs', alg, submodule]))
|
||||
|
||||
|
||||
return alg_module
|
||||
|
||||
|
||||
|
||||
def get_learn_function(alg):
|
||||
return get_alg_module(alg).learn
|
||||
|
||||
|
||||
def get_learn_function_defaults(alg, env_type):
|
||||
try:
|
||||
alg_defaults = get_alg_module(alg, 'defaults')
|
||||
kwargs = getattr(alg_defaults, env_type)()
|
||||
except (ImportError, AttributeError):
|
||||
kwargs = {}
|
||||
kwargs = {}
|
||||
return kwargs
|
||||
|
||||
def parse(v):
|
||||
|
||||
|
||||
def parse(v):
|
||||
'''
|
||||
convert value of a command-line arg to a python object if possible, othewise, keep as string
|
||||
'''
|
||||
|
||||
assert isinstance(v, str)
|
||||
try:
|
||||
return eval(v)
|
||||
except (NameError, SyntaxError):
|
||||
return eval(v)
|
||||
except (NameError, SyntaxError):
|
||||
return v
|
||||
|
||||
|
||||
def main():
|
||||
# configure logger, disable logging in child MPI processes (with rank > 0)
|
||||
|
||||
# configure logger, disable logging in child MPI processes (with rank > 0)
|
||||
|
||||
arg_parser = common_arg_parser()
|
||||
args, unknown_args = arg_parser.parse_known_args()
|
||||
extra_args = {k: parse(v) for k,v in parse_unknown_args(unknown_args).items()}
|
||||
extra_args = {k: parse(v) for k, v in parse_unknown_args(unknown_args).items()}
|
||||
|
||||
|
||||
if MPI is None or MPI.COMM_WORLD.Get_rank() == 0:
|
||||
rank = 0
|
||||
logger.configure()
|
||||
else:
|
||||
logger.configure(format_strs = [])
|
||||
logger.configure(format_strs=[])
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
|
||||
model, _ = train(args, extra_args)
|
||||
@@ -211,19 +216,19 @@ def main():
|
||||
if args.save_path is not None and rank == 0:
|
||||
save_path = osp.expanduser(args.save_path)
|
||||
model.save(save_path)
|
||||
|
||||
|
||||
if args.play:
|
||||
logger.log("Running trained model")
|
||||
env = build_env(args, render=True)
|
||||
env = build_env(args)
|
||||
obs = env.reset()
|
||||
while True:
|
||||
actions = model.step(obs)[0]
|
||||
obs, _, done, _ = env.step(actions)
|
||||
obs, _, done, _ = env.step(actions)
|
||||
env.render()
|
||||
done = done.any() if isinstance(done, np.ndarray) else done
|
||||
|
||||
if done:
|
||||
obs = env.reset()
|
||||
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
@@ -2,5 +2,6 @@
|
||||
|
||||
- Original paper: https://arxiv.org/abs/1502.05477
|
||||
- Baselines blog post https://blog.openai.com/openai-baselines-ppo/
|
||||
- `mpirun -np 16 python -m baselines.trpo_mpi.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
|
||||
- `python -m baselines.trpo_mpi.run_mujoco` runs the algorithm for 1M timesteps on a Mujoco environment.
|
||||
- `mpirun -np 16 python -m baselines.run --alg=trpo_mpi --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
|
||||
- `python -m baselines.run --alg=trpo_mpi --env=Ant-v2 --num_timesteps=1e6` runs the algorithm for 1M timesteps on a Mujoco Ant environment.
|
||||
- also refer to the repo-wide [README.md](../../README.md#training-models)
|
||||
|
@@ -1,4 +1,4 @@
|
||||
from rl_common.models import mlp, cnn_small
|
||||
from baselines.common.models import mlp, cnn_small
|
||||
|
||||
|
||||
def atari():
|
||||
|
19
conftest.py
19
conftest.py
@@ -1,19 +0,0 @@
|
||||
import pytest
|
||||
|
||||
|
||||
def pytest_addoption(parser):
|
||||
parser.addoption('--runslow', action='store_true', default=False, help='run slow tests')
|
||||
|
||||
|
||||
def pytest_collection_modifyitems(config, items):
|
||||
if config.getoption('--runslow'):
|
||||
# --runslow given in cli: do not skip slow tests
|
||||
return
|
||||
skip_slow = pytest.mark.skip(reason='need --runslow option to run')
|
||||
slow_tests = []
|
||||
for item in items:
|
||||
if 'slow' in item.keywords:
|
||||
slow_tests.append(item.name)
|
||||
item.add_marker(skip_slow)
|
||||
|
||||
print('skipping slow tests', ' '.join(slow_tests), 'use --runslow to run this')
|
15
setup.cfg
Normal file
15
setup.cfg
Normal file
@@ -0,0 +1,15 @@
|
||||
[flake8]
|
||||
select = F,E999
|
||||
exclude =
|
||||
.git,
|
||||
__pycache__,
|
||||
baselines/her,
|
||||
baselines/ddpg,
|
||||
baselines/ppo1,
|
||||
baselines/bench,
|
||||
baselines/acktr,
|
||||
|
||||
|
||||
|
||||
|
||||
|
31
setup.py
31
setup.py
@@ -6,6 +6,20 @@ if sys.version_info.major != 3:
|
||||
'Python {}. The installation will likely fail.'.format(sys.version_info.major))
|
||||
|
||||
|
||||
extras = {
|
||||
'test': [
|
||||
'filelock',
|
||||
'pytest'
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
all_deps = []
|
||||
for group_name in extras:
|
||||
all_deps += extras[group_name]
|
||||
|
||||
extras['all'] = all_deps
|
||||
|
||||
setup(name='baselines',
|
||||
packages=[package for package in find_packages()
|
||||
if package.startswith('baselines')],
|
||||
@@ -18,18 +32,21 @@ setup(name='baselines',
|
||||
'progressbar2',
|
||||
'mpi4py',
|
||||
'cloudpickle',
|
||||
'tensorflow>=1.4.0',
|
||||
'click',
|
||||
'opencv-python'
|
||||
],
|
||||
extras_require={
|
||||
'test': [
|
||||
'filelock',
|
||||
'pytest'
|
||||
]
|
||||
},
|
||||
extras_require=extras,
|
||||
description='OpenAI baselines: high quality implementations of reinforcement learning algorithms',
|
||||
author='OpenAI',
|
||||
url='https://github.com/openai/baselines',
|
||||
author_email='gym@openai.com',
|
||||
version='0.1.5')
|
||||
|
||||
|
||||
# ensure there is some tensorflow build with version above 1.4
|
||||
try:
|
||||
from distutils.version import StrictVersion
|
||||
import tensorflow
|
||||
assert StrictVersion(tensorflow.__version__) >= StrictVersion('1.4.0')
|
||||
except ImportError:
|
||||
assert False, "TensorFlow needed, of version above 1.4"
|
||||
|
Reference in New Issue
Block a user