Compare commits
77 Commits
peterz_viz
...
fix998
Author | SHA1 | Date | |
---|---|---|---|
|
0e423a0108 | ||
|
229a772b81 | ||
|
d80b075904 | ||
|
0182fe1877 | ||
|
1fb4dfb780 | ||
|
7cadef715f | ||
|
fce4370ba2 | ||
|
c57528573e | ||
|
2bca7901f5 | ||
|
ba2b017820 | ||
|
7c520852d9 | ||
|
1c872ca8fd | ||
|
ff8d36a7a7 | ||
|
7614b02f7a | ||
|
f7d5a265e1 | ||
|
21776e8f57 | ||
|
9b68103b73 | ||
|
3301089b48 | ||
|
a07fad9066 | ||
|
5d8041d18e | ||
|
fa37beb52e | ||
|
8a97e0df10 | ||
|
fabbf2c611 | ||
|
5d285b318f | ||
|
49a99c7d23 | ||
|
c79b3373bf | ||
|
6d1c6c78d3 | ||
|
62a9c76f18 | ||
|
282c9cc91f | ||
|
096f4d9cf0 | ||
|
16136ddca7 | ||
|
b1644157d6 | ||
|
58541db226 | ||
|
c02b575f01 | ||
|
897fa31548 | ||
|
d51f8be8f9 | ||
|
3f2f45acef | ||
|
b64974eb90 | ||
|
1b092434fc | ||
|
1259f6ab25 | ||
|
74101a9f24 | ||
|
90d66776a4 | ||
|
b875fb7b5e | ||
|
675b100190 | ||
|
adc4388f6b | ||
|
5b41c926c7 | ||
|
ab02fae71d | ||
|
b55eda1dde | ||
|
57e05eb420 | ||
|
01ab1d8ef7 | ||
|
73683435ff | ||
|
4d0746b957 | ||
|
5115707ce9 | ||
|
6c44fb28fe | ||
|
146bbf886b | ||
|
f3a5abaeeb | ||
|
97e039127f | ||
|
25ecb64821 | ||
|
7dc6bc7c70 | ||
|
7139a66d33 | ||
|
8607dca99e | ||
|
9f9835fe38 | ||
|
d3fed181b5 | ||
|
339d5640b9 | ||
|
a75bc37a40 | ||
|
87b3a04a38 | ||
|
c5b1a1b643 | ||
|
c59a10947d | ||
|
5cd66010dc | ||
|
0a13da8dfe | ||
|
18b6390be6 | ||
|
52255beda5 | ||
|
d80acbb4d1 | ||
|
556b198454 | ||
|
cc88804042 | ||
|
c14d307834 | ||
|
0b71d4c6c4 |
@@ -11,4 +11,4 @@ install:
|
||||
|
||||
script:
|
||||
- flake8 . --show-source --statistics
|
||||
- docker run baselines-test pytest -v --forked .
|
||||
- docker run -e RUNSLOW=1 baselines-test pytest -v .
|
||||
|
24
README.md
24
README.md
@@ -1,3 +1,5 @@
|
||||
**Status:** Active (under active development, breaking changes may occur)
|
||||
|
||||
<img src="data/logo.jpg" width=25% align="right" /> [](https://travis-ci.org/openai/baselines)
|
||||
|
||||
# Baselines
|
||||
@@ -37,6 +39,9 @@ To activate a virtualenv:
|
||||
More thorough tutorial on virtualenvs and options can be found [here](https://virtualenv.pypa.io/en/stable/)
|
||||
|
||||
|
||||
## Tensorflow versions
|
||||
The master branch supports Tensorflow from version 1.4 to 1.14. For Tensorflow 2.0 support, please use tf-2 branch.
|
||||
|
||||
## Installation
|
||||
- Clone the repo and cd into it:
|
||||
```bash
|
||||
@@ -87,7 +92,7 @@ python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timeste
|
||||
will set entropy coefficient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
|
||||
|
||||
See docstrings in [common/models.py](baselines/common/models.py) for description of network parameters for each type of model, and
|
||||
docstring for [baselines/ppo2/ppo2.py/learn()](baselines/ppo2/ppo2.py#L152) for the description of the ppo2 hyperparamters.
|
||||
docstring for [baselines/ppo2/ppo2.py/learn()](baselines/ppo2/ppo2.py#L152) for the description of the ppo2 hyperparameters.
|
||||
|
||||
### Example 2. DQN on Atari
|
||||
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
|
||||
@@ -96,6 +101,8 @@ python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 --num_timesteps=1e6
|
||||
```
|
||||
|
||||
## Saving, loading and visualizing models
|
||||
|
||||
### Saving and loading the model
|
||||
The algorithms serialization API is not properly unified yet; however, there is a simple method to save / restore trained models.
|
||||
`--save_path` and `--load_path` command-line option loads the tensorflow state from a given path before training, and saves it after the training, respectively.
|
||||
Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then later visualize what has it learnt.
|
||||
@@ -107,10 +114,19 @@ This should get to the mean reward per episode about 20. To load and visualize t
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
|
||||
```
|
||||
|
||||
*NOTE:* At the moment Mujoco training uses VecNormalize wrapper for the environment which is not being saved correctly; so loading the models trained on Mujoco will not work well if the environment is recreated. If necessary, you can work around that by replacing RunningMeanStd by TfRunningMeanStd in [baselines/common/vec_env/vec_normalize.py](baselines/common/vec_env/vec_normalize.py#L12). This way, mean and std of environment normalizing wrapper will be saved in tensorflow variables and included in the model file; however, training is slower that way - hence not including it by default
|
||||
*NOTE:* Mujoco environments require normalization to work properly, so we wrap them with VecNormalize wrapper. Currently, to ensure the models are saved with normalization (so that trained models can be restored and run without further training) the normalization coefficients are saved as tensorflow variables. This can decrease the performance somewhat, so if you require high-throughput steps with Mujoco and do not need saving/restoring the models, it may make sense to use numpy normalization instead. To do that, set 'use_tf=False` in [baselines/run.py](baselines/run.py#L116).
|
||||
|
||||
## Loading and vizualizing learning curves and other training metrics
|
||||
See [here](docs/viz/viz.ipynb) for instructions on how to load and display the training data.
|
||||
### Logging and vizualizing learning curves and other training metrics
|
||||
By default, all summary data, including progress, standard output, is saved to a unique directory in a temp folder, specified by a call to Python's [tempfile.gettempdir()](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir).
|
||||
The directory can be changed with the `--log_path` command-line option.
|
||||
```bash
|
||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2 --log_path=~/logs/Pong/
|
||||
```
|
||||
*NOTE:* Please be aware that the logger will overwrite files of the same name in an existing directory, thus it's recommended that folder names be given a unique timestamp to prevent overwritten logs.
|
||||
|
||||
Another way the temp directory can be changed is through the use of the `$OPENAI_LOGDIR` environment variable.
|
||||
|
||||
For examples on how to load and display the training data, see [here](docs/viz/viz.ipynb).
|
||||
|
||||
## Subpackages
|
||||
|
||||
|
@@ -11,6 +11,8 @@ from baselines.common.policies import build_policy
|
||||
|
||||
from baselines.a2c.utils import Scheduler, find_trainable_variables
|
||||
from baselines.a2c.runner import Runner
|
||||
from baselines.ppo2.ppo2 import safemean
|
||||
from collections import deque
|
||||
|
||||
from tensorflow import losses
|
||||
|
||||
@@ -195,6 +197,7 @@ def learn(
|
||||
|
||||
# Instantiate the runner object
|
||||
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
|
||||
epinfobuf = deque(maxlen=100)
|
||||
|
||||
# Calculate the batch_size
|
||||
nbatch = nenvs*nsteps
|
||||
@@ -204,7 +207,8 @@ def learn(
|
||||
|
||||
for update in range(1, total_timesteps//nbatch+1):
|
||||
# Get mini batch of experiences
|
||||
obs, states, rewards, masks, actions, values = runner.run()
|
||||
obs, states, rewards, masks, actions, values, epinfos = runner.run()
|
||||
epinfobuf.extend(epinfos)
|
||||
|
||||
policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
|
||||
nseconds = time.time()-tstart
|
||||
@@ -221,6 +225,8 @@ def learn(
|
||||
logger.record_tabular("policy_entropy", float(policy_entropy))
|
||||
logger.record_tabular("value_loss", float(value_loss))
|
||||
logger.record_tabular("explained_variance", float(ev))
|
||||
logger.record_tabular("eprewmean", safemean([epinfo['r'] for epinfo in epinfobuf]))
|
||||
logger.record_tabular("eplenmean", safemean([epinfo['l'] for epinfo in epinfobuf]))
|
||||
logger.dump_tabular()
|
||||
return model
|
||||
|
||||
|
@@ -22,6 +22,7 @@ class Runner(AbstractEnvRunner):
|
||||
# We initialize the lists that will contain the mb of experiences
|
||||
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [],[],[],[],[]
|
||||
mb_states = self.states
|
||||
epinfos = []
|
||||
for n in range(self.nsteps):
|
||||
# Given observations, take action and value (V(s))
|
||||
# We already have self.obs because Runner superclass run self.obs[:] = env.reset() on init
|
||||
@@ -34,12 +35,12 @@ class Runner(AbstractEnvRunner):
|
||||
mb_dones.append(self.dones)
|
||||
|
||||
# Take actions in env and look the results
|
||||
obs, rewards, dones, _ = self.env.step(actions)
|
||||
obs, rewards, dones, infos = self.env.step(actions)
|
||||
for info in infos:
|
||||
maybeepinfo = info.get('episode')
|
||||
if maybeepinfo: epinfos.append(maybeepinfo)
|
||||
self.states = states
|
||||
self.dones = dones
|
||||
for n, done in enumerate(dones):
|
||||
if done:
|
||||
self.obs[n] = self.obs[n]*0
|
||||
self.obs = obs
|
||||
mb_rewards.append(rewards)
|
||||
mb_dones.append(self.dones)
|
||||
@@ -72,4 +73,4 @@ class Runner(AbstractEnvRunner):
|
||||
mb_rewards = mb_rewards.flatten()
|
||||
mb_values = mb_values.flatten()
|
||||
mb_masks = mb_masks.flatten()
|
||||
return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values
|
||||
return mb_obs, mb_states, mb_rewards, mb_masks, mb_actions, mb_values, epinfos
|
||||
|
@@ -75,8 +75,8 @@ class Model(object):
|
||||
train_ob_placeholder = tf.placeholder(dtype=ob_space.dtype, shape=(nenvs*(nsteps+1),) + ob_space.shape)
|
||||
with tf.variable_scope('acer_model', reuse=tf.AUTO_REUSE):
|
||||
|
||||
step_model = policy(observ_placeholder=step_ob_placeholder, sess=sess)
|
||||
train_model = policy(observ_placeholder=train_ob_placeholder, sess=sess)
|
||||
step_model = policy(nbatch=nenvs, nsteps=1, observ_placeholder=step_ob_placeholder, sess=sess)
|
||||
train_model = policy(nbatch=nbatch, nsteps=nsteps, observ_placeholder=train_ob_placeholder, sess=sess)
|
||||
|
||||
|
||||
params = find_trainable_variables("acer_model")
|
||||
@@ -94,7 +94,7 @@ class Model(object):
|
||||
return v
|
||||
|
||||
with tf.variable_scope("acer_model", custom_getter=custom_getter, reuse=True):
|
||||
polyak_model = policy(observ_placeholder=train_ob_placeholder, sess=sess)
|
||||
polyak_model = policy(nbatch=nbatch, nsteps=nsteps, observ_placeholder=train_ob_placeholder, sess=sess)
|
||||
|
||||
# Notation: (var) = batch variable, (var)s = seqeuence variable, (var)_i = variable index by action at step i
|
||||
|
||||
|
@@ -11,6 +11,8 @@ from baselines.common.tf_util import get_session, save_variables, load_variables
|
||||
from baselines.a2c.runner import Runner
|
||||
from baselines.a2c.utils import Scheduler, find_trainable_variables
|
||||
from baselines.acktr import kfac
|
||||
from baselines.ppo2.ppo2 import safemean
|
||||
from collections import deque
|
||||
|
||||
|
||||
class Model(object):
|
||||
@@ -90,7 +92,7 @@ class Model(object):
|
||||
self.initial_state = step_model.initial_state
|
||||
tf.global_variables_initializer().run(session=sess)
|
||||
|
||||
def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=1, nprocs=32, nsteps=20,
|
||||
def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interval=100, nprocs=32, nsteps=20,
|
||||
ent_coef=0.01, vf_coef=0.5, vf_fisher_coef=1.0, lr=0.25, max_grad_norm=0.5,
|
||||
kfac_clip=0.001, save_interval=None, lrschedule='linear', load_path=None, is_async=True, **network_kwargs):
|
||||
set_global_seeds(seed)
|
||||
@@ -118,6 +120,7 @@ def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interva
|
||||
model.load(load_path)
|
||||
|
||||
runner = Runner(env, model, nsteps=nsteps, gamma=gamma)
|
||||
epinfobuf = deque(maxlen=100)
|
||||
nbatch = nenvs*nsteps
|
||||
tstart = time.time()
|
||||
coord = tf.train.Coordinator()
|
||||
@@ -127,7 +130,8 @@ def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interva
|
||||
enqueue_threads = []
|
||||
|
||||
for update in range(1, total_timesteps//nbatch+1):
|
||||
obs, states, rewards, masks, actions, values = runner.run()
|
||||
obs, states, rewards, masks, actions, values, epinfos = runner.run()
|
||||
epinfobuf.extend(epinfos)
|
||||
policy_loss, value_loss, policy_entropy = model.train(obs, states, rewards, masks, actions, values)
|
||||
model.old_obs = obs
|
||||
nseconds = time.time()-tstart
|
||||
@@ -141,6 +145,8 @@ def learn(network, env, seed, total_timesteps=int(40e6), gamma=0.99, log_interva
|
||||
logger.record_tabular("policy_loss", float(policy_loss))
|
||||
logger.record_tabular("value_loss", float(value_loss))
|
||||
logger.record_tabular("explained_variance", float(ev))
|
||||
logger.record_tabular("eprewmean", safemean([epinfo['r'] for epinfo in epinfobuf]))
|
||||
logger.record_tabular("eplenmean", safemean([epinfo['l'] for epinfo in epinfobuf]))
|
||||
logger.dump_tabular()
|
||||
|
||||
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir():
|
||||
|
@@ -11,7 +11,7 @@ KFAC_DEBUG = False
|
||||
|
||||
|
||||
class KfacOptimizer():
|
||||
|
||||
# note that KfacOptimizer will be truly synchronous (and thus deterministic) only if a single-threaded session is used
|
||||
def __init__(self, learning_rate=0.01, momentum=0.9, clip_kl=0.01, kfac_update=2, stats_accum_iter=60, full_stats_init=False, cold_iter=100, cold_lr=None, is_async=False, async_stats=False, epsilon=1e-2, stats_decay=0.95, blockdiag_bias=False, channel_fac=False, factored_damping=False, approxT2=False, use_float64=False, weight_decay_dict={},max_grad_norm=0.5):
|
||||
self.max_grad_norm = max_grad_norm
|
||||
self._lr = learning_rate
|
||||
|
@@ -1,2 +1,3 @@
|
||||
# flake8: noqa F403
|
||||
from baselines.bench.benchmarks import *
|
||||
from baselines.bench.monitor import *
|
||||
|
@@ -1,5 +1,4 @@
|
||||
import re
|
||||
import os.path as osp
|
||||
import os
|
||||
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
|
||||
@@ -20,7 +19,7 @@ def register_benchmark(benchmark):
|
||||
if 'tasks' in benchmark:
|
||||
for t in benchmark['tasks']:
|
||||
if 'desc' not in t:
|
||||
t['desc'] = remove_version_re.sub('', t['env_id'])
|
||||
t['desc'] = remove_version_re.sub('', t.get('env_id', t.get('id')))
|
||||
_BENCHMARKS.append(benchmark)
|
||||
|
||||
|
||||
@@ -156,9 +155,10 @@ register_benchmark({
|
||||
|
||||
# HER DDPG
|
||||
|
||||
_fetch_tasks = ['FetchReach-v1', 'FetchPush-v1', 'FetchSlide-v1']
|
||||
register_benchmark({
|
||||
'name': 'HerDdpg',
|
||||
'description': 'Smoke-test only benchmark of HER',
|
||||
'tasks': [{'trials': 1, 'env_id': 'FetchReach-v1'}]
|
||||
'name': 'Fetch1M',
|
||||
'description': 'Fetch* benchmarks for 1M timesteps',
|
||||
'tasks': [{'trials': 6, 'env_id': env_id, 'num_timesteps': int(1e6)} for env_id in _fetch_tasks]
|
||||
})
|
||||
|
||||
|
@@ -1,13 +1,11 @@
|
||||
__all__ = ['Monitor', 'get_monitor_files', 'load_results']
|
||||
|
||||
import gym
|
||||
from gym.core import Wrapper
|
||||
import time
|
||||
from glob import glob
|
||||
import csv
|
||||
import os.path as osp
|
||||
import json
|
||||
import numpy as np
|
||||
|
||||
class Monitor(Wrapper):
|
||||
EXT = "monitor.csv"
|
||||
@@ -16,11 +14,13 @@ class Monitor(Wrapper):
|
||||
def __init__(self, env, filename, allow_early_resets=False, reset_keywords=(), info_keywords=()):
|
||||
Wrapper.__init__(self, env=env)
|
||||
self.tstart = time.time()
|
||||
self.results_writer = ResultsWriter(
|
||||
filename,
|
||||
header={"t_start": time.time(), 'env_id' : env.spec and env.spec.id},
|
||||
extra_keys=reset_keywords + info_keywords
|
||||
)
|
||||
if filename:
|
||||
self.results_writer = ResultsWriter(filename,
|
||||
header={"t_start": time.time(), 'env_id' : env.spec and env.spec.id},
|
||||
extra_keys=reset_keywords + info_keywords
|
||||
)
|
||||
else:
|
||||
self.results_writer = None
|
||||
self.reset_keywords = reset_keywords
|
||||
self.info_keywords = info_keywords
|
||||
self.allow_early_resets = allow_early_resets
|
||||
@@ -68,8 +68,9 @@ class Monitor(Wrapper):
|
||||
self.episode_lengths.append(eplen)
|
||||
self.episode_times.append(time.time() - self.tstart)
|
||||
epinfo.update(self.current_reset_info)
|
||||
self.results_writer.write_row(epinfo)
|
||||
|
||||
if self.results_writer:
|
||||
self.results_writer.write_row(epinfo)
|
||||
assert isinstance(info, dict)
|
||||
if isinstance(info, dict):
|
||||
info['episode'] = epinfo
|
||||
|
||||
@@ -96,24 +97,21 @@ class LoadMonitorResultsError(Exception):
|
||||
|
||||
|
||||
class ResultsWriter(object):
|
||||
def __init__(self, filename=None, header='', extra_keys=()):
|
||||
def __init__(self, filename, header='', extra_keys=()):
|
||||
self.extra_keys = extra_keys
|
||||
if filename is None:
|
||||
self.f = None
|
||||
self.logger = None
|
||||
else:
|
||||
if not filename.endswith(Monitor.EXT):
|
||||
if osp.isdir(filename):
|
||||
filename = osp.join(filename, Monitor.EXT)
|
||||
else:
|
||||
filename = filename + "." + Monitor.EXT
|
||||
self.f = open(filename, "wt")
|
||||
if isinstance(header, dict):
|
||||
header = '# {} \n'.format(json.dumps(header))
|
||||
self.f.write(header)
|
||||
self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+tuple(extra_keys))
|
||||
self.logger.writeheader()
|
||||
self.f.flush()
|
||||
assert filename is not None
|
||||
if not filename.endswith(Monitor.EXT):
|
||||
if osp.isdir(filename):
|
||||
filename = osp.join(filename, Monitor.EXT)
|
||||
else:
|
||||
filename = filename + "." + Monitor.EXT
|
||||
self.f = open(filename, "wt")
|
||||
if isinstance(header, dict):
|
||||
header = '# {} \n'.format(json.dumps(header))
|
||||
self.f.write(header)
|
||||
self.logger = csv.DictWriter(self.f, fieldnames=('r', 'l', 't')+tuple(extra_keys))
|
||||
self.logger.writeheader()
|
||||
self.f.flush()
|
||||
|
||||
def write_row(self, epinfo):
|
||||
if self.logger:
|
||||
@@ -121,7 +119,6 @@ class ResultsWriter(object):
|
||||
self.f.flush()
|
||||
|
||||
|
||||
|
||||
def get_monitor_files(dir):
|
||||
return glob(osp.join(dir, "*" + Monitor.EXT))
|
||||
|
||||
@@ -163,27 +160,3 @@ def load_results(dir):
|
||||
df['t'] -= min(header['t_start'] for header in headers)
|
||||
df.headers = headers # HACK to preserve backwards compatibility
|
||||
return df
|
||||
|
||||
def test_monitor():
|
||||
env = gym.make("CartPole-v1")
|
||||
env.seed(0)
|
||||
mon_file = "/tmp/baselines-test-%s.monitor.csv" % uuid.uuid4()
|
||||
menv = Monitor(env, mon_file)
|
||||
menv.reset()
|
||||
for _ in range(1000):
|
||||
_, _, done, _ = menv.step(0)
|
||||
if done:
|
||||
menv.reset()
|
||||
|
||||
f = open(mon_file, 'rt')
|
||||
|
||||
firstline = f.readline()
|
||||
assert firstline.startswith('#')
|
||||
metadata = json.loads(firstline[1:])
|
||||
assert metadata['env_id'] == "CartPole-v1"
|
||||
assert set(metadata.keys()) == {'env_id', 'gym_version', 't_start'}, "Incorrect keys in monitor metadata"
|
||||
|
||||
last_logline = pandas.read_csv(f, index_col=None)
|
||||
assert set(last_logline.keys()) == {'l', 't', 'r'}, "Incorrect keys in monitor logline"
|
||||
f.close()
|
||||
os.remove(mon_file)
|
||||
|
31
baselines/bench/test_monitor.py
Normal file
31
baselines/bench/test_monitor.py
Normal file
@@ -0,0 +1,31 @@
|
||||
from .monitor import Monitor
|
||||
import gym
|
||||
import json
|
||||
|
||||
def test_monitor():
|
||||
import pandas
|
||||
import os
|
||||
import uuid
|
||||
|
||||
env = gym.make("CartPole-v1")
|
||||
env.seed(0)
|
||||
mon_file = "/tmp/baselines-test-%s.monitor.csv" % uuid.uuid4()
|
||||
menv = Monitor(env, mon_file)
|
||||
menv.reset()
|
||||
for _ in range(1000):
|
||||
_, _, done, _ = menv.step(0)
|
||||
if done:
|
||||
menv.reset()
|
||||
|
||||
f = open(mon_file, 'rt')
|
||||
|
||||
firstline = f.readline()
|
||||
assert firstline.startswith('#')
|
||||
metadata = json.loads(firstline[1:])
|
||||
assert metadata['env_id'] == "CartPole-v1"
|
||||
assert set(metadata.keys()) == {'env_id', 't_start'}, "Incorrect keys in monitor metadata"
|
||||
|
||||
last_logline = pandas.read_csv(f, index_col=None)
|
||||
assert set(last_logline.keys()) == {'l', 't', 'r'}, "Incorrect keys in monitor logline"
|
||||
f.close()
|
||||
os.remove(mon_file)
|
@@ -6,6 +6,8 @@ import gym
|
||||
from gym import spaces
|
||||
import cv2
|
||||
cv2.ocl.setUseOpenCL(False)
|
||||
from .wrappers import TimeLimit
|
||||
|
||||
|
||||
class NoopResetEnv(gym.Wrapper):
|
||||
def __init__(self, env, noop_max=30):
|
||||
@@ -72,8 +74,8 @@ class EpisodicLifeEnv(gym.Wrapper):
|
||||
# then update lives to handle bonus lives
|
||||
lives = self.env.unwrapped.ale.lives()
|
||||
if lives < self.lives and lives > 0:
|
||||
# for Qbert sometimes we stay in lives == 0 condtion for a few frames
|
||||
# so its important to keep lives > 0, so that we only reset once
|
||||
# for Qbert sometimes we stay in lives == 0 condition for a few frames
|
||||
# so it's important to keep lives > 0, so that we only reset once
|
||||
# the environment advertises done.
|
||||
done = True
|
||||
self.lives = lives
|
||||
@@ -128,19 +130,60 @@ class ClipRewardEnv(gym.RewardWrapper):
|
||||
"""Bin reward to {+1, 0, -1} by its sign."""
|
||||
return np.sign(reward)
|
||||
|
||||
class WarpFrame(gym.ObservationWrapper):
|
||||
def __init__(self, env):
|
||||
"""Warp frames to 84x84 as done in the Nature paper and later work."""
|
||||
gym.ObservationWrapper.__init__(self, env)
|
||||
self.width = 84
|
||||
self.height = 84
|
||||
self.observation_space = spaces.Box(low=0, high=255,
|
||||
shape=(self.height, self.width, 1), dtype=np.uint8)
|
||||
|
||||
def observation(self, frame):
|
||||
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
|
||||
frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
|
||||
return frame[:, :, None]
|
||||
class WarpFrame(gym.ObservationWrapper):
|
||||
def __init__(self, env, width=84, height=84, grayscale=True, dict_space_key=None):
|
||||
"""
|
||||
Warp frames to 84x84 as done in the Nature paper and later work.
|
||||
|
||||
If the environment uses dictionary observations, `dict_space_key` can be specified which indicates which
|
||||
observation should be warped.
|
||||
"""
|
||||
super().__init__(env)
|
||||
self._width = width
|
||||
self._height = height
|
||||
self._grayscale = grayscale
|
||||
self._key = dict_space_key
|
||||
if self._grayscale:
|
||||
num_colors = 1
|
||||
else:
|
||||
num_colors = 3
|
||||
|
||||
new_space = gym.spaces.Box(
|
||||
low=0,
|
||||
high=255,
|
||||
shape=(self._height, self._width, num_colors),
|
||||
dtype=np.uint8,
|
||||
)
|
||||
if self._key is None:
|
||||
original_space = self.observation_space
|
||||
self.observation_space = new_space
|
||||
else:
|
||||
original_space = self.observation_space.spaces[self._key]
|
||||
self.observation_space.spaces[self._key] = new_space
|
||||
assert original_space.dtype == np.uint8 and len(original_space.shape) == 3
|
||||
|
||||
def observation(self, obs):
|
||||
if self._key is None:
|
||||
frame = obs
|
||||
else:
|
||||
frame = obs[self._key]
|
||||
|
||||
if self._grayscale:
|
||||
frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
|
||||
frame = cv2.resize(
|
||||
frame, (self._width, self._height), interpolation=cv2.INTER_AREA
|
||||
)
|
||||
if self._grayscale:
|
||||
frame = np.expand_dims(frame, -1)
|
||||
|
||||
if self._key is None:
|
||||
obs = frame
|
||||
else:
|
||||
obs = obs.copy()
|
||||
obs[self._key] = frame
|
||||
return obs
|
||||
|
||||
|
||||
class FrameStack(gym.Wrapper):
|
||||
def __init__(self, env, k):
|
||||
@@ -156,7 +199,7 @@ class FrameStack(gym.Wrapper):
|
||||
self.k = k
|
||||
self.frames = deque([], maxlen=k)
|
||||
shp = env.observation_space.shape
|
||||
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[0], shp[1], shp[2] * k), dtype=env.observation_space.dtype)
|
||||
self.observation_space = spaces.Box(low=0, high=255, shape=(shp[:-1] + (shp[-1] * k,)), dtype=env.observation_space.dtype)
|
||||
|
||||
def reset(self):
|
||||
ob = self.env.reset()
|
||||
@@ -197,7 +240,7 @@ class LazyFrames(object):
|
||||
|
||||
def _force(self):
|
||||
if self._out is None:
|
||||
self._out = np.concatenate(self._frames, axis=2)
|
||||
self._out = np.concatenate(self._frames, axis=-1)
|
||||
self._frames = None
|
||||
return self._out
|
||||
|
||||
@@ -213,14 +256,20 @@ class LazyFrames(object):
|
||||
def __getitem__(self, i):
|
||||
return self._force()[i]
|
||||
|
||||
def make_atari(env_id, timelimit=True):
|
||||
# XXX(john): remove timelimit argument after gym is upgraded to allow double wrapping
|
||||
def count(self):
|
||||
frames = self._force()
|
||||
return frames.shape[frames.ndim - 1]
|
||||
|
||||
def frame(self, i):
|
||||
return self._force()[..., i]
|
||||
|
||||
def make_atari(env_id, max_episode_steps=None):
|
||||
env = gym.make(env_id)
|
||||
if not timelimit:
|
||||
env = env.env
|
||||
assert 'NoFrameskip' in env.spec.id
|
||||
env = NoopResetEnv(env, noop_max=30)
|
||||
env = MaxAndSkipEnv(env, skip=4)
|
||||
if max_episode_steps is not None:
|
||||
env = TimeLimit(env, max_episode_steps=max_episode_steps)
|
||||
return env
|
||||
|
||||
def wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack=False, scale=False):
|
||||
|
@@ -17,34 +17,60 @@ from baselines.common.atari_wrappers import make_atari, wrap_deepmind
|
||||
from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
from baselines.common import retro_wrappers
|
||||
from baselines.common.wrappers import ClipActionsWrapper
|
||||
|
||||
def make_vec_env(env_id, env_type, num_env, seed, wrapper_kwargs=None, start_index=0, reward_scale=1.0, gamestate=None):
|
||||
def make_vec_env(env_id, env_type, num_env, seed,
|
||||
wrapper_kwargs=None,
|
||||
env_kwargs=None,
|
||||
start_index=0,
|
||||
reward_scale=1.0,
|
||||
flatten_dict_observations=True,
|
||||
gamestate=None,
|
||||
initializer=None,
|
||||
force_dummy=False):
|
||||
"""
|
||||
Create a wrapped, monitored SubprocVecEnv for Atari and MuJoCo.
|
||||
"""
|
||||
if wrapper_kwargs is None: wrapper_kwargs = {}
|
||||
wrapper_kwargs = wrapper_kwargs or {}
|
||||
env_kwargs = env_kwargs or {}
|
||||
mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
|
||||
seed = seed + 10000 * mpi_rank if seed is not None else None
|
||||
def make_thunk(rank):
|
||||
logger_dir = logger.get_dir()
|
||||
def make_thunk(rank, initializer=None):
|
||||
return lambda: make_env(
|
||||
env_id=env_id,
|
||||
env_type=env_type,
|
||||
subrank = rank,
|
||||
mpi_rank=mpi_rank,
|
||||
subrank=rank,
|
||||
seed=seed,
|
||||
reward_scale=reward_scale,
|
||||
gamestate=gamestate,
|
||||
wrapper_kwargs=wrapper_kwargs
|
||||
flatten_dict_observations=flatten_dict_observations,
|
||||
wrapper_kwargs=wrapper_kwargs,
|
||||
env_kwargs=env_kwargs,
|
||||
logger_dir=logger_dir,
|
||||
initializer=initializer
|
||||
)
|
||||
|
||||
set_global_seeds(seed)
|
||||
if num_env > 1:
|
||||
return SubprocVecEnv([make_thunk(i + start_index) for i in range(num_env)])
|
||||
if not force_dummy and num_env > 1:
|
||||
return SubprocVecEnv([make_thunk(i + start_index, initializer=initializer) for i in range(num_env)])
|
||||
else:
|
||||
return DummyVecEnv([make_thunk(start_index)])
|
||||
return DummyVecEnv([make_thunk(i + start_index, initializer=None) for i in range(num_env)])
|
||||
|
||||
|
||||
def make_env(env_id, env_type, subrank=0, seed=None, reward_scale=1.0, gamestate=None, wrapper_kwargs={}):
|
||||
mpi_rank = MPI.COMM_WORLD.Get_rank() if MPI else 0
|
||||
def make_env(env_id, env_type, mpi_rank=0, subrank=0, seed=None, reward_scale=1.0, gamestate=None, flatten_dict_observations=True, wrapper_kwargs=None, env_kwargs=None, logger_dir=None, initializer=None):
|
||||
if initializer is not None:
|
||||
initializer(mpi_rank=mpi_rank, subrank=subrank)
|
||||
|
||||
wrapper_kwargs = wrapper_kwargs or {}
|
||||
env_kwargs = env_kwargs or {}
|
||||
if ':' in env_id:
|
||||
import re
|
||||
import importlib
|
||||
module_name = re.sub(':.*','',env_id)
|
||||
env_id = re.sub('.*:', '', env_id)
|
||||
importlib.import_module(module_name)
|
||||
if env_type == 'atari':
|
||||
env = make_atari(env_id)
|
||||
elif env_type == 'retro':
|
||||
@@ -52,18 +78,28 @@ def make_env(env_id, env_type, subrank=0, seed=None, reward_scale=1.0, gamestate
|
||||
gamestate = gamestate or retro.State.DEFAULT
|
||||
env = retro_wrappers.make_retro(game=env_id, max_episode_steps=10000, use_restricted_actions=retro.Actions.DISCRETE, state=gamestate)
|
||||
else:
|
||||
env = gym.make(env_id)
|
||||
env = gym.make(env_id, **env_kwargs)
|
||||
|
||||
if flatten_dict_observations and isinstance(env.observation_space, gym.spaces.Dict):
|
||||
keys = env.observation_space.spaces.keys()
|
||||
env = gym.wrappers.FlattenDictWrapper(env, dict_keys=list(keys))
|
||||
|
||||
env.seed(seed + subrank if seed is not None else None)
|
||||
env = Monitor(env,
|
||||
logger.get_dir() and os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(subrank)),
|
||||
logger_dir and os.path.join(logger_dir, str(mpi_rank) + '.' + str(subrank)),
|
||||
allow_early_resets=True)
|
||||
|
||||
|
||||
if env_type == 'atari':
|
||||
env = wrap_deepmind(env, **wrapper_kwargs)
|
||||
elif env_type == 'retro':
|
||||
if 'frame_stack' not in wrapper_kwargs:
|
||||
wrapper_kwargs['frame_stack'] = 1
|
||||
env = retro_wrappers.wrap_deepmind_retro(env, **wrapper_kwargs)
|
||||
|
||||
if isinstance(env.action_space, gym.spaces.Box):
|
||||
env = ClipActionsWrapper(env)
|
||||
|
||||
if reward_scale != 1:
|
||||
env = retro_wrappers.RewardScaler(env, reward_scale)
|
||||
|
||||
@@ -123,6 +159,7 @@ def common_arg_parser():
|
||||
"""
|
||||
parser = arg_parser()
|
||||
parser.add_argument('--env', help='environment ID', type=str, default='Reacher-v2')
|
||||
parser.add_argument('--env_type', help='type of environment, used when the environment type cannot be automatically determined', type=str)
|
||||
parser.add_argument('--seed', help='RNG seed', type=int, default=None)
|
||||
parser.add_argument('--alg', help='Algorithm', type=str, default='ppo2')
|
||||
parser.add_argument('--num_timesteps', type=float, default=1e6),
|
||||
@@ -133,6 +170,7 @@ def common_arg_parser():
|
||||
parser.add_argument('--save_path', help='Path to save trained model to', default=None, type=str)
|
||||
parser.add_argument('--save_video_interval', help='Save video every x steps (0 = disabled)', default=0, type=int)
|
||||
parser.add_argument('--save_video_length', help='Length of recorded video. Default: 200', default=200, type=int)
|
||||
parser.add_argument('--log_path', help='Directory to save learning curve data.', default=None, type=str)
|
||||
parser.add_argument('--play', default=False, action='store_true')
|
||||
return parser
|
||||
|
||||
@@ -149,7 +187,7 @@ def robotics_arg_parser():
|
||||
|
||||
def parse_unknown_args(args):
|
||||
"""
|
||||
Parse arguments not consumed by arg parser into a dicitonary
|
||||
Parse arguments not consumed by arg parser into a dictionary
|
||||
"""
|
||||
retval = {}
|
||||
preceded_by_key = False
|
||||
|
@@ -62,7 +62,7 @@ class CategoricalPdType(PdType):
|
||||
def pdclass(self):
|
||||
return CategoricalPd
|
||||
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
|
||||
pdparam = fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
|
||||
pdparam = _matching_fc(latent_vector, 'pi', self.ncat, init_scale=init_scale, init_bias=init_bias)
|
||||
return self.pdfromflat(pdparam), pdparam
|
||||
|
||||
def param_shape(self):
|
||||
@@ -75,14 +75,15 @@ class CategoricalPdType(PdType):
|
||||
|
||||
class MultiCategoricalPdType(PdType):
|
||||
def __init__(self, nvec):
|
||||
self.ncats = nvec
|
||||
self.ncats = nvec.astype('int32')
|
||||
assert (self.ncats > 0).all()
|
||||
def pdclass(self):
|
||||
return MultiCategoricalPd
|
||||
def pdfromflat(self, flat):
|
||||
return MultiCategoricalPd(self.ncats, flat)
|
||||
|
||||
def pdfromlatent(self, latent, init_scale=1.0, init_bias=0.0):
|
||||
pdparam = fc(latent, 'pi', self.ncats.sum(), init_scale=init_scale, init_bias=init_bias)
|
||||
pdparam = _matching_fc(latent, 'pi', self.ncats.sum(), init_scale=init_scale, init_bias=init_bias)
|
||||
return self.pdfromflat(pdparam), pdparam
|
||||
|
||||
def param_shape(self):
|
||||
@@ -99,7 +100,7 @@ class DiagGaussianPdType(PdType):
|
||||
return DiagGaussianPd
|
||||
|
||||
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
|
||||
mean = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
|
||||
mean = _matching_fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
|
||||
logstd = tf.get_variable(name='pi/logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
|
||||
pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
|
||||
return self.pdfromflat(pdparam), mean
|
||||
@@ -123,7 +124,7 @@ class BernoulliPdType(PdType):
|
||||
def sample_dtype(self):
|
||||
return tf.int32
|
||||
def pdfromlatent(self, latent_vector, init_scale=1.0, init_bias=0.0):
|
||||
pdparam = fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
|
||||
pdparam = _matching_fc(latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
|
||||
return self.pdfromflat(pdparam), pdparam
|
||||
|
||||
# WRONG SECOND DERIVATIVES
|
||||
@@ -205,7 +206,8 @@ class CategoricalPd(Pd):
|
||||
class MultiCategoricalPd(Pd):
|
||||
def __init__(self, nvec, flat):
|
||||
self.flat = flat
|
||||
self.categoricals = list(map(CategoricalPd, tf.split(flat, nvec, axis=-1)))
|
||||
self.categoricals = list(map(CategoricalPd,
|
||||
tf.split(flat, np.array(nvec, dtype=np.int32), axis=-1)))
|
||||
def flatparam(self):
|
||||
return self.flat
|
||||
def mode(self):
|
||||
@@ -345,3 +347,9 @@ def validate_probtype(probtype, pdparam):
|
||||
assert np.abs(klval - klval_ll) < 3 * klval_ll_stderr # within 3 sigmas
|
||||
print('ok on', probtype, pdparam)
|
||||
|
||||
|
||||
def _matching_fc(tensor, name, size, init_scale, init_bias):
|
||||
if tensor.shape[-1] == size:
|
||||
return tensor
|
||||
else:
|
||||
return fc(tensor, name, size, init_scale=init_scale, init_bias=init_bias)
|
||||
|
@@ -13,27 +13,6 @@ def zipsame(*seqs):
|
||||
return zip(*seqs)
|
||||
|
||||
|
||||
def unpack(seq, sizes):
|
||||
"""
|
||||
Unpack 'seq' into a sequence of lists, with lengths specified by 'sizes'.
|
||||
None = just one bare element, not a list
|
||||
|
||||
Example:
|
||||
unpack([1,2,3,4,5,6], [3,None,2]) -> ([1,2,3], 4, [5,6])
|
||||
"""
|
||||
seq = list(seq)
|
||||
it = iter(seq)
|
||||
assert sum(1 if s is None else s for s in sizes) == len(seq), "Trying to unpack %s into %s" % (seq, sizes)
|
||||
for size in sizes:
|
||||
if size is None:
|
||||
yield it.__next__()
|
||||
else:
|
||||
li = []
|
||||
for _ in range(size):
|
||||
li.append(it.__next__())
|
||||
yield li
|
||||
|
||||
|
||||
class EzPickle(object):
|
||||
"""Objects that are pickled and unpickled via their constructor
|
||||
arguments.
|
||||
|
@@ -3,7 +3,6 @@ import tensorflow as tf
|
||||
from baselines.a2c import utils
|
||||
from baselines.a2c.utils import conv, fc, conv_to_fc, batch_to_seq, seq_to_batch
|
||||
from baselines.common.mpi_running_mean_std import RunningMeanStd
|
||||
import tensorflow.contrib.layers as layers
|
||||
|
||||
mapping = {}
|
||||
|
||||
@@ -26,6 +25,51 @@ def nature_cnn(unscaled_images, **conv_kwargs):
|
||||
h3 = conv_to_fc(h3)
|
||||
return activ(fc(h3, 'fc1', nh=512, init_scale=np.sqrt(2)))
|
||||
|
||||
def build_impala_cnn(unscaled_images, depths=[16,32,32], **conv_kwargs):
|
||||
"""
|
||||
Model used in the paper "IMPALA: Scalable Distributed Deep-RL with
|
||||
Importance Weighted Actor-Learner Architectures" https://arxiv.org/abs/1802.01561
|
||||
"""
|
||||
|
||||
layer_num = 0
|
||||
|
||||
def get_layer_num_str():
|
||||
nonlocal layer_num
|
||||
num_str = str(layer_num)
|
||||
layer_num += 1
|
||||
return num_str
|
||||
|
||||
def conv_layer(out, depth):
|
||||
return tf.layers.conv2d(out, depth, 3, padding='same', name='layer_' + get_layer_num_str())
|
||||
|
||||
def residual_block(inputs):
|
||||
depth = inputs.get_shape()[-1].value
|
||||
|
||||
out = tf.nn.relu(inputs)
|
||||
|
||||
out = conv_layer(out, depth)
|
||||
out = tf.nn.relu(out)
|
||||
out = conv_layer(out, depth)
|
||||
return out + inputs
|
||||
|
||||
def conv_sequence(inputs, depth):
|
||||
out = conv_layer(inputs, depth)
|
||||
out = tf.layers.max_pooling2d(out, pool_size=3, strides=2, padding='same')
|
||||
out = residual_block(out)
|
||||
out = residual_block(out)
|
||||
return out
|
||||
|
||||
out = tf.cast(unscaled_images, tf.float32) / 255.
|
||||
|
||||
for depth in depths:
|
||||
out = conv_sequence(out, depth)
|
||||
|
||||
out = tf.layers.flatten(out)
|
||||
out = tf.nn.relu(out)
|
||||
out = tf.layers.dense(out, 256, activation=tf.nn.relu, name='layer_' + get_layer_num_str())
|
||||
|
||||
return out
|
||||
|
||||
|
||||
@register("mlp")
|
||||
def mlp(num_layers=2, num_hidden=64, activation=tf.tanh, layer_norm=False):
|
||||
@@ -65,6 +109,11 @@ def cnn(**conv_kwargs):
|
||||
return nature_cnn(X, **conv_kwargs)
|
||||
return network_fn
|
||||
|
||||
@register("impala_cnn")
|
||||
def impala_cnn(**conv_kwargs):
|
||||
def network_fn(X):
|
||||
return build_impala_cnn(X)
|
||||
return network_fn
|
||||
|
||||
@register("cnn_small")
|
||||
def cnn_small(**conv_kwargs):
|
||||
@@ -79,7 +128,6 @@ def cnn_small(**conv_kwargs):
|
||||
return h
|
||||
return network_fn
|
||||
|
||||
|
||||
@register("lstm")
|
||||
def lstm(nlstm=128, layer_norm=False):
|
||||
"""
|
||||
@@ -136,12 +184,12 @@ def lstm(nlstm=128, layer_norm=False):
|
||||
|
||||
|
||||
@register("cnn_lstm")
|
||||
def cnn_lstm(nlstm=128, layer_norm=False, **conv_kwargs):
|
||||
def cnn_lstm(nlstm=128, layer_norm=False, conv_fn=nature_cnn, **conv_kwargs):
|
||||
def network_fn(X, nenv=1):
|
||||
nbatch = X.shape[0]
|
||||
nsteps = nbatch // nenv
|
||||
|
||||
h = nature_cnn(X, **conv_kwargs)
|
||||
h = conv_fn(X, **conv_kwargs)
|
||||
|
||||
M = tf.placeholder(tf.float32, [nbatch]) #mask (done t-1)
|
||||
S = tf.placeholder(tf.float32, [nenv, 2*nlstm]) #states
|
||||
@@ -161,6 +209,9 @@ def cnn_lstm(nlstm=128, layer_norm=False, **conv_kwargs):
|
||||
|
||||
return network_fn
|
||||
|
||||
@register("impala_cnn_lstm")
|
||||
def impala_cnn_lstm():
|
||||
return cnn_lstm(nlstm=256, conv_fn=build_impala_cnn)
|
||||
|
||||
@register("cnn_lnlstm")
|
||||
def cnn_lnlstm(nlstm=128, **conv_kwargs):
|
||||
@@ -187,7 +238,7 @@ def conv_only(convs=[(32, 8, 4), (64, 4, 2), (64, 3, 1)], **conv_kwargs):
|
||||
out = tf.cast(X, tf.float32) / 255.
|
||||
with tf.variable_scope("convnet"):
|
||||
for num_outputs, kernel_size, stride in convs:
|
||||
out = layers.convolution2d(out,
|
||||
out = tf.contrib.layers.convolution2d(out,
|
||||
num_outputs=num_outputs,
|
||||
kernel_size=kernel_size,
|
||||
stride=stride,
|
||||
|
@@ -1,31 +1,90 @@
|
||||
import numpy as np
|
||||
import tensorflow as tf
|
||||
from mpi4py import MPI
|
||||
from baselines.common import tf_util as U
|
||||
from baselines.common.tests.test_with_mpi import with_mpi
|
||||
from baselines import logger
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
class MpiAdamOptimizer(tf.train.AdamOptimizer):
|
||||
"""Adam optimizer that averages gradients across mpi processes."""
|
||||
def __init__(self, comm, **kwargs):
|
||||
def __init__(self, comm, grad_clip=None, mpi_rank_weight=1, **kwargs):
|
||||
self.comm = comm
|
||||
self.grad_clip = grad_clip
|
||||
self.mpi_rank_weight = mpi_rank_weight
|
||||
tf.train.AdamOptimizer.__init__(self, **kwargs)
|
||||
def compute_gradients(self, loss, var_list, **kwargs):
|
||||
grads_and_vars = tf.train.AdamOptimizer.compute_gradients(self, loss, var_list, **kwargs)
|
||||
grads_and_vars = [(g, v) for g, v in grads_and_vars if g is not None]
|
||||
flat_grad = tf.concat([tf.reshape(g, (-1,)) for g, v in grads_and_vars], axis=0)
|
||||
flat_grad = tf.concat([tf.reshape(g, (-1,)) for g, v in grads_and_vars], axis=0) * self.mpi_rank_weight
|
||||
shapes = [v.shape.as_list() for g, v in grads_and_vars]
|
||||
sizes = [int(np.prod(s)) for s in shapes]
|
||||
|
||||
num_tasks = self.comm.Get_size()
|
||||
buf = np.zeros(sum(sizes), np.float32)
|
||||
total_weight = np.zeros(1, np.float32)
|
||||
self.comm.Allreduce(np.array([self.mpi_rank_weight], dtype=np.float32), total_weight, op=MPI.SUM)
|
||||
total_weight = total_weight[0]
|
||||
|
||||
def _collect_grads(flat_grad):
|
||||
buf = np.zeros(sum(sizes), np.float32)
|
||||
countholder = [0] # Counts how many times _collect_grads has been called
|
||||
stat = tf.reduce_sum(grads_and_vars[0][1]) # sum of first variable
|
||||
def _collect_grads(flat_grad, np_stat):
|
||||
if self.grad_clip is not None:
|
||||
gradnorm = np.linalg.norm(flat_grad)
|
||||
if gradnorm > 1:
|
||||
flat_grad /= gradnorm
|
||||
logger.logkv_mean('gradnorm', gradnorm)
|
||||
logger.logkv_mean('gradclipfrac', float(gradnorm > 1))
|
||||
self.comm.Allreduce(flat_grad, buf, op=MPI.SUM)
|
||||
np.divide(buf, float(num_tasks), out=buf)
|
||||
np.divide(buf, float(total_weight), out=buf)
|
||||
if countholder[0] % 100 == 0:
|
||||
check_synced(np_stat, self.comm)
|
||||
countholder[0] += 1
|
||||
return buf
|
||||
|
||||
avg_flat_grad = tf.py_func(_collect_grads, [flat_grad], tf.float32)
|
||||
avg_flat_grad = tf.py_func(_collect_grads, [flat_grad, stat], tf.float32)
|
||||
avg_flat_grad.set_shape(flat_grad.shape)
|
||||
avg_grads = tf.split(avg_flat_grad, sizes, axis=0)
|
||||
avg_grads_and_vars = [(tf.reshape(g, v.shape), v)
|
||||
for g, (_, v) in zip(avg_grads, grads_and_vars)]
|
||||
|
||||
return avg_grads_and_vars
|
||||
|
||||
def check_synced(localval, comm=None):
|
||||
"""
|
||||
It's common to forget to initialize your variables to the same values, or
|
||||
(less commonly) if you update them in some other way than adam, to get them out of sync.
|
||||
This function checks that variables on all MPI workers are the same, and raises
|
||||
an AssertionError otherwise
|
||||
|
||||
Arguments:
|
||||
comm: MPI communicator
|
||||
localval: list of local variables (list of variables on current worker to be compared with the other workers)
|
||||
"""
|
||||
comm = comm or MPI.COMM_WORLD
|
||||
vals = comm.gather(localval)
|
||||
if comm.rank == 0:
|
||||
assert all(val==vals[0] for val in vals[1:]),\
|
||||
'MpiAdamOptimizer detected that different workers have different weights: {}'.format(vals)
|
||||
|
||||
@with_mpi(timeout=5)
|
||||
def test_nonfreeze():
|
||||
np.random.seed(0)
|
||||
tf.set_random_seed(0)
|
||||
|
||||
a = tf.Variable(np.random.randn(3).astype('float32'))
|
||||
b = tf.Variable(np.random.randn(2,5).astype('float32'))
|
||||
loss = tf.reduce_sum(tf.square(a)) + tf.reduce_sum(tf.sin(b))
|
||||
|
||||
stepsize = 1e-2
|
||||
# for some reason the session config with inter_op_parallelism_threads was causing
|
||||
# nested sess.run calls to freeze
|
||||
config = tf.ConfigProto(inter_op_parallelism_threads=1)
|
||||
sess = U.get_session(config=config)
|
||||
update_op = MpiAdamOptimizer(comm=MPI.COMM_WORLD, learning_rate=stepsize).minimize(loss)
|
||||
sess.run(tf.global_variables_initializer())
|
||||
losslist_ref = []
|
||||
for i in range(100):
|
||||
l,_ = sess.run([loss, update_op])
|
||||
print(i, l)
|
||||
losslist_ref.append(l)
|
||||
|
@@ -12,8 +12,9 @@ def mpi_mean(x, axis=0, comm=None, keepdims=False):
|
||||
localsum = np.zeros(n+1, x.dtype)
|
||||
localsum[:n] = xsum.ravel()
|
||||
localsum[n] = x.shape[axis]
|
||||
globalsum = np.zeros_like(localsum)
|
||||
comm.Allreduce(localsum, globalsum, op=MPI.SUM)
|
||||
# globalsum = np.zeros_like(localsum)
|
||||
# comm.Allreduce(localsum, globalsum, op=MPI.SUM)
|
||||
globalsum = comm.allreduce(localsum, op=MPI.SUM)
|
||||
return globalsum[:n].reshape(xsum.shape) / globalsum[n], globalsum[n]
|
||||
|
||||
def mpi_moments(x, axis=0, comm=None, keepdims=False):
|
||||
|
@@ -1,9 +1,16 @@
|
||||
from collections import defaultdict
|
||||
from mpi4py import MPI
|
||||
import os, numpy as np
|
||||
import platform
|
||||
import shutil
|
||||
import subprocess
|
||||
import warnings
|
||||
import sys
|
||||
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
|
||||
def sync_from_root(sess, variables, comm=None):
|
||||
"""
|
||||
@@ -13,15 +20,10 @@ def sync_from_root(sess, variables, comm=None):
|
||||
variables: all parameter variables including optimizer's
|
||||
"""
|
||||
if comm is None: comm = MPI.COMM_WORLD
|
||||
rank = comm.Get_rank()
|
||||
for var in variables:
|
||||
if rank == 0:
|
||||
comm.Bcast(sess.run(var))
|
||||
else:
|
||||
import tensorflow as tf
|
||||
returned_var = np.empty(var.shape, dtype='float32')
|
||||
comm.Bcast(returned_var)
|
||||
sess.run(tf.assign(var, returned_var))
|
||||
import tensorflow as tf
|
||||
values = comm.bcast(sess.run(variables))
|
||||
sess.run([tf.assign(var, val)
|
||||
for (var, val) in zip(variables, values)])
|
||||
|
||||
def gpu_count():
|
||||
"""
|
||||
@@ -34,13 +36,15 @@ def gpu_count():
|
||||
|
||||
def setup_mpi_gpus():
|
||||
"""
|
||||
Set CUDA_VISIBLE_DEVICES using MPI.
|
||||
Set CUDA_VISIBLE_DEVICES to MPI rank if not already set
|
||||
"""
|
||||
num_gpus = gpu_count()
|
||||
if num_gpus == 0:
|
||||
return
|
||||
local_rank, _ = get_local_rank_size(MPI.COMM_WORLD)
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(local_rank % num_gpus)
|
||||
if 'CUDA_VISIBLE_DEVICES' not in os.environ:
|
||||
if sys.platform == 'darwin': # This Assumes if you're on OSX you're just
|
||||
ids = [] # doing a smoke test and don't want GPUs
|
||||
else:
|
||||
lrank, _lsize = get_local_rank_size(MPI.COMM_WORLD)
|
||||
ids = [lrank]
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, ids))
|
||||
|
||||
def get_local_rank_size(comm):
|
||||
"""
|
||||
@@ -81,6 +85,9 @@ def share_file(comm, path):
|
||||
comm.Barrier()
|
||||
|
||||
def dict_gather(comm, d, op='mean', assert_all_have_data=True):
|
||||
"""
|
||||
Perform a reduction operation over dicts
|
||||
"""
|
||||
if comm is None: return d
|
||||
alldicts = comm.allgather(d)
|
||||
size = comm.size
|
||||
@@ -99,3 +106,28 @@ def dict_gather(comm, d, op='mean', assert_all_have_data=True):
|
||||
else:
|
||||
assert 0, op
|
||||
return result
|
||||
|
||||
def mpi_weighted_mean(comm, local_name2valcount):
|
||||
"""
|
||||
Perform a weighted average over dicts that are each on a different node
|
||||
Input: local_name2valcount: dict mapping key -> (value, count)
|
||||
Returns: key -> mean
|
||||
"""
|
||||
all_name2valcount = comm.gather(local_name2valcount)
|
||||
if comm.rank == 0:
|
||||
name2sum = defaultdict(float)
|
||||
name2count = defaultdict(float)
|
||||
for n2vc in all_name2valcount:
|
||||
for (name, (val, count)) in n2vc.items():
|
||||
try:
|
||||
val = float(val)
|
||||
except ValueError:
|
||||
if comm.rank == 0:
|
||||
warnings.warn('WARNING: tried to compute mean on non-float {}={}'.format(name, val))
|
||||
else:
|
||||
name2sum[name] += val * count
|
||||
name2count[name] += count
|
||||
return {name : name2sum[name] / name2count[name] for name in name2sum}
|
||||
else:
|
||||
return {}
|
||||
|
||||
|
@@ -90,6 +90,8 @@ def one_sided_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_
|
||||
sum_y *= interstep_decay
|
||||
count_y *= interstep_decay
|
||||
while True:
|
||||
if luoi >= len(xolds):
|
||||
break
|
||||
xold = xolds[luoi]
|
||||
if xold <= xnew:
|
||||
decay = np.exp(- (xnew - xold) / decay_period)
|
||||
@@ -98,8 +100,6 @@ def one_sided_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_
|
||||
luoi += 1
|
||||
else:
|
||||
break
|
||||
if luoi >= len(xolds):
|
||||
break
|
||||
sum_ys[i] = sum_y
|
||||
count_ys[i] = count_y
|
||||
|
||||
@@ -168,6 +168,7 @@ def load_results(root_dir_or_dirs, enable_progress=True, enable_monitor=True, ve
|
||||
- monitor - if enable_monitor is True, this field contains pandas dataframe with loaded monitor.csv file (or aggregate of all *.monitor.csv files in the directory)
|
||||
- progress - if enable_progress is True, this field contains pandas dataframe with loaded progress.csv file
|
||||
'''
|
||||
import re
|
||||
if isinstance(root_dir_or_dirs, str):
|
||||
rootdirs = [osp.expanduser(root_dir_or_dirs)]
|
||||
else:
|
||||
@@ -179,7 +180,9 @@ def load_results(root_dir_or_dirs, enable_progress=True, enable_monitor=True, ve
|
||||
if '-proc' in dirname:
|
||||
files[:] = []
|
||||
continue
|
||||
if set(['metadata.json', 'monitor.json', 'monitor.csv', 'progress.json', 'progress.csv']).intersection(files):
|
||||
monitor_re = re.compile(r'(\d+\.)?(\d+\.)?monitor\.csv')
|
||||
if set(['metadata.json', 'monitor.json', 'progress.json', 'progress.csv']).intersection(files) or \
|
||||
any([f for f in files if monitor_re.match(f)]): # also match monitor files like 0.1.monitor.csv
|
||||
# used to be uncommented, which means do not go deeper than current directory if any of the data files
|
||||
# are found
|
||||
# dirs[:] = []
|
||||
@@ -246,6 +249,9 @@ def plot_results(
|
||||
legend_outside=False,
|
||||
resample=0,
|
||||
smooth_step=1.0,
|
||||
tiling='vertical',
|
||||
xlabel=None,
|
||||
ylabel=None
|
||||
):
|
||||
'''
|
||||
Plot multiple Results objects
|
||||
@@ -297,9 +303,23 @@ def plot_results(
|
||||
sk2r[splitkey].append(result)
|
||||
assert len(sk2r) > 0
|
||||
assert isinstance(resample, int), "0: don't resample. <integer>: that many samples"
|
||||
nrows = len(sk2r)
|
||||
ncols = 1
|
||||
figsize = figsize or (6, 6 * nrows)
|
||||
if tiling == 'vertical' or tiling is None:
|
||||
nrows = len(sk2r)
|
||||
ncols = 1
|
||||
elif tiling == 'horizontal':
|
||||
ncols = len(sk2r)
|
||||
nrows = 1
|
||||
elif tiling == 'symmetric':
|
||||
import math
|
||||
N = len(sk2r)
|
||||
largest_divisor = 1
|
||||
for i in range(1, int(math.sqrt(N))+1):
|
||||
if N % i == 0:
|
||||
largest_divisor = i
|
||||
ncols = largest_divisor
|
||||
nrows = N // ncols
|
||||
figsize = figsize or (6 * ncols, 6 * nrows)
|
||||
|
||||
f, axarr = plt.subplots(nrows, ncols, sharex=False, squeeze=False, figsize=figsize)
|
||||
|
||||
groups = list(set(group_fn(result) for result in allresults))
|
||||
@@ -313,7 +333,9 @@ def plot_results(
|
||||
g2c = defaultdict(int)
|
||||
sresults = sk2r[sk]
|
||||
gresults = defaultdict(list)
|
||||
ax = axarr[isplit][0]
|
||||
idx_row = isplit // ncols
|
||||
idx_col = isplit % ncols
|
||||
ax = axarr[idx_row][idx_col]
|
||||
for result in sresults:
|
||||
group = group_fn(result)
|
||||
g2c[group] += 1
|
||||
@@ -352,7 +374,7 @@ def plot_results(
|
||||
ymean = np.mean(ys, axis=0)
|
||||
ystd = np.std(ys, axis=0)
|
||||
ystderr = ystd / np.sqrt(len(ys))
|
||||
l, = axarr[isplit][0].plot(usex, ymean, color=color)
|
||||
l, = axarr[idx_row][idx_col].plot(usex, ymean, color=color)
|
||||
g2l[group] = l
|
||||
if shaded_err:
|
||||
ax.fill_between(usex, ymean - ystderr, ymean + ystderr, color=color, alpha=.4)
|
||||
@@ -369,6 +391,17 @@ def plot_results(
|
||||
loc=2 if legend_outside else None,
|
||||
bbox_to_anchor=(1,1) if legend_outside else None)
|
||||
ax.set_title(sk)
|
||||
# add xlabels, but only to the bottom row
|
||||
if xlabel is not None:
|
||||
for ax in axarr[-1]:
|
||||
plt.sca(ax)
|
||||
plt.xlabel(xlabel)
|
||||
# add ylabels, but only to left column
|
||||
if ylabel is not None:
|
||||
for ax in axarr[:,0]:
|
||||
plt.sca(ax)
|
||||
plt.ylabel(ylabel)
|
||||
|
||||
return f, axarr
|
||||
|
||||
def regression_analysis(df):
|
||||
|
@@ -1,25 +1,11 @@
|
||||
# flake8: noqa F403, F405
|
||||
from .atari_wrappers import *
|
||||
from collections import deque
|
||||
import cv2
|
||||
cv2.ocl.setUseOpenCL(False)
|
||||
from .atari_wrappers import WarpFrame, ClipRewardEnv, FrameStack, ScaledFloatFrame
|
||||
from .wrappers import TimeLimit
|
||||
import numpy as np
|
||||
import gym
|
||||
|
||||
class TimeLimit(gym.Wrapper):
|
||||
def __init__(self, env, max_episode_steps=None):
|
||||
super(TimeLimit, self).__init__(env)
|
||||
self._max_episode_steps = max_episode_steps
|
||||
self._elapsed_steps = 0
|
||||
|
||||
def step(self, ac):
|
||||
observation, reward, done, info = self.env.step(ac)
|
||||
self._elapsed_steps += 1
|
||||
if self._elapsed_steps >= self._max_episode_steps:
|
||||
done = True
|
||||
info['TimeLimit.truncated'] = True
|
||||
return observation, reward, done, info
|
||||
|
||||
def reset(self, **kwargs):
|
||||
self._elapsed_steps = 0
|
||||
return self.env.reset(**kwargs)
|
||||
|
||||
class StochasticFrameSkip(gym.Wrapper):
|
||||
def __init__(self, env, n, stickprob):
|
||||
@@ -99,7 +85,7 @@ class Downsample(gym.ObservationWrapper):
|
||||
gym.ObservationWrapper.__init__(self, env)
|
||||
(oldh, oldw, oldc) = env.observation_space.shape
|
||||
newshape = (oldh//ratio, oldw//ratio, oldc)
|
||||
self.observation_space = spaces.Box(low=0, high=255,
|
||||
self.observation_space = gym.spaces.Box(low=0, high=255,
|
||||
shape=newshape, dtype=np.uint8)
|
||||
|
||||
def observation(self, frame):
|
||||
@@ -116,7 +102,7 @@ class Rgb2gray(gym.ObservationWrapper):
|
||||
"""
|
||||
gym.ObservationWrapper.__init__(self, env)
|
||||
(oldh, oldw, _oldc) = env.observation_space.shape
|
||||
self.observation_space = spaces.Box(low=0, high=255,
|
||||
self.observation_space = gym.spaces.Box(low=0, high=255,
|
||||
shape=(oldh, oldw, 1), dtype=np.uint8)
|
||||
|
||||
def observation(self, frame):
|
||||
@@ -132,10 +118,8 @@ class MovieRecord(gym.Wrapper):
|
||||
self.epcount = 0
|
||||
def reset(self):
|
||||
if self.epcount % self.k == 0:
|
||||
print('saving movie this episode', self.savedir)
|
||||
self.env.unwrapped.movie_path = self.savedir
|
||||
else:
|
||||
print('not saving this episode')
|
||||
self.env.unwrapped.movie_path = None
|
||||
self.env.unwrapped.movie = None
|
||||
self.epcount += 1
|
||||
@@ -215,8 +199,10 @@ class StartDoingRandomActionsWrapper(gym.Wrapper):
|
||||
self.some_random_steps()
|
||||
return self.last_obs, rew, done, info
|
||||
|
||||
def make_retro(*, game, state, max_episode_steps, **kwargs):
|
||||
def make_retro(*, game, state=None, max_episode_steps=4500, **kwargs):
|
||||
import retro
|
||||
if state is None:
|
||||
state = retro.State.DEFAULT
|
||||
env = retro.make(game, state, **kwargs)
|
||||
env = StochasticFrameSkip(env, n=4, stickprob=0.25)
|
||||
if max_episode_steps is not None:
|
||||
@@ -229,7 +215,8 @@ def wrap_deepmind_retro(env, scale=True, frame_stack=4):
|
||||
"""
|
||||
env = WarpFrame(env)
|
||||
env = ClipRewardEnv(env)
|
||||
env = FrameStack(env, frame_stack)
|
||||
if frame_stack > 1:
|
||||
env = FrameStack(env, frame_stack)
|
||||
if scale:
|
||||
env = ScaledFloatFrame(env)
|
||||
return env
|
||||
|
@@ -177,7 +177,7 @@ def profile_tf_runningmeanstd():
|
||||
outfile = '/tmp/timeline.json'
|
||||
with open(outfile, 'wt') as f:
|
||||
f.write(chrome_trace)
|
||||
print(f'Successfully saved profile to {outfile}. Exiting.')
|
||||
print('Successfully saved profile to {}. Exiting.'.format(outfile))
|
||||
exit(0)
|
||||
'''
|
||||
|
||||
|
29
baselines/common/test_mpi_util.py
Normal file
29
baselines/common/test_mpi_util.py
Normal file
@@ -0,0 +1,29 @@
|
||||
from baselines.common import mpi_util
|
||||
from baselines import logger
|
||||
from baselines.common.tests.test_with_mpi import with_mpi
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
@with_mpi()
|
||||
def test_mpi_weighted_mean():
|
||||
comm = MPI.COMM_WORLD
|
||||
with logger.scoped_configure(comm=comm):
|
||||
if comm.rank == 0:
|
||||
name2valcount = {'a' : (10, 2), 'b' : (20,3)}
|
||||
elif comm.rank == 1:
|
||||
name2valcount = {'a' : (19, 1), 'c' : (42,3)}
|
||||
else:
|
||||
raise NotImplementedError
|
||||
d = mpi_util.mpi_weighted_mean(comm, name2valcount)
|
||||
correctval = {'a' : (10 * 2 + 19) / 3.0, 'b' : 20, 'c' : 42}
|
||||
if comm.rank == 0:
|
||||
assert d == correctval, '{} != {}'.format(d, correctval)
|
||||
|
||||
for name, (val, count) in name2valcount.items():
|
||||
for _ in range(count):
|
||||
logger.logkv_mean(name, val)
|
||||
d2 = logger.dumpkvs()
|
||||
if comm.rank == 0:
|
||||
assert d2 == correctval
|
@@ -0,0 +1,2 @@
|
||||
import os, pytest
|
||||
mark_slow = pytest.mark.skipif(not os.getenv('RUNSLOW'), reason='slow')
|
@@ -7,19 +7,16 @@ class FixedSequenceEnv(Env):
|
||||
def __init__(
|
||||
self,
|
||||
n_actions=10,
|
||||
seed=0,
|
||||
episode_len=100
|
||||
):
|
||||
self.np_random = np.random.RandomState()
|
||||
self.np_random.seed(seed)
|
||||
self.sequence = [self.np_random.randint(0, n_actions-1) for _ in range(episode_len)]
|
||||
|
||||
self.action_space = Discrete(n_actions)
|
||||
self.observation_space = Discrete(1)
|
||||
|
||||
self.np_random = np.random.RandomState(0)
|
||||
self.episode_len = episode_len
|
||||
self.sequence = [self.np_random.randint(0, self.action_space.n)
|
||||
for _ in range(self.episode_len)]
|
||||
self.time = 0
|
||||
self.reset()
|
||||
|
||||
|
||||
def reset(self):
|
||||
self.time = 0
|
||||
@@ -30,11 +27,13 @@ class FixedSequenceEnv(Env):
|
||||
self._choose_next_state()
|
||||
done = False
|
||||
if self.episode_len and self.time >= self.episode_len:
|
||||
rew = 0
|
||||
done = True
|
||||
|
||||
return 0, rew, done, {}
|
||||
|
||||
def seed(self, seed=None):
|
||||
self.np_random.seed(seed)
|
||||
|
||||
def _choose_next_state(self):
|
||||
self.time += 1
|
||||
|
||||
|
@@ -2,41 +2,45 @@ import numpy as np
|
||||
from abc import abstractmethod
|
||||
from gym import Env
|
||||
from gym.spaces import MultiDiscrete, Discrete, Box
|
||||
|
||||
from collections import deque
|
||||
|
||||
class IdentityEnv(Env):
|
||||
def __init__(
|
||||
self,
|
||||
episode_len=None
|
||||
episode_len=None,
|
||||
delay=0,
|
||||
zero_first_rewards=True
|
||||
):
|
||||
|
||||
self.observation_space = self.action_space
|
||||
self.episode_len = episode_len
|
||||
self.time = 0
|
||||
self.reset()
|
||||
self.delay = delay
|
||||
self.zero_first_rewards = zero_first_rewards
|
||||
self.q = deque(maxlen=delay+1)
|
||||
|
||||
def reset(self):
|
||||
self._choose_next_state()
|
||||
self.q.clear()
|
||||
for _ in range(self.delay + 1):
|
||||
self.q.append(self.action_space.sample())
|
||||
self.time = 0
|
||||
self.observation_space = self.action_space
|
||||
|
||||
return self.state
|
||||
return self.q[-1]
|
||||
|
||||
def step(self, actions):
|
||||
rew = self._get_reward(actions)
|
||||
self._choose_next_state()
|
||||
done = False
|
||||
if self.episode_len and self.time >= self.episode_len:
|
||||
rew = self._get_reward(self.q.popleft(), actions)
|
||||
if self.zero_first_rewards and self.time < self.delay:
|
||||
rew = 0
|
||||
done = True
|
||||
|
||||
return self.state, rew, done, {}
|
||||
|
||||
def _choose_next_state(self):
|
||||
self.state = self.action_space.sample()
|
||||
self.q.append(self.action_space.sample())
|
||||
self.time += 1
|
||||
done = self.episode_len is not None and self.time >= self.episode_len
|
||||
return self.q[-1], rew, done, {}
|
||||
|
||||
def seed(self, seed=None):
|
||||
self.action_space.seed(seed)
|
||||
|
||||
@abstractmethod
|
||||
def _get_reward(self, actions):
|
||||
def _get_reward(self, state, actions):
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
@@ -45,26 +49,29 @@ class DiscreteIdentityEnv(IdentityEnv):
|
||||
self,
|
||||
dim,
|
||||
episode_len=None,
|
||||
delay=0,
|
||||
zero_first_rewards=True
|
||||
):
|
||||
|
||||
self.action_space = Discrete(dim)
|
||||
super().__init__(episode_len=episode_len)
|
||||
super().__init__(episode_len=episode_len, delay=delay, zero_first_rewards=zero_first_rewards)
|
||||
|
||||
def _get_reward(self, actions):
|
||||
return 1 if self.state == actions else 0
|
||||
def _get_reward(self, state, actions):
|
||||
return 1 if state == actions else 0
|
||||
|
||||
class MultiDiscreteIdentityEnv(IdentityEnv):
|
||||
def __init__(
|
||||
self,
|
||||
dims,
|
||||
episode_len=None,
|
||||
delay=0,
|
||||
):
|
||||
|
||||
self.action_space = MultiDiscrete(dims)
|
||||
super().__init__(episode_len=episode_len)
|
||||
super().__init__(episode_len=episode_len, delay=delay)
|
||||
|
||||
def _get_reward(self, actions):
|
||||
return 1 if all(self.state == actions) else 0
|
||||
def _get_reward(self, state, actions):
|
||||
return 1 if all(state == actions) else 0
|
||||
|
||||
|
||||
class BoxIdentityEnv(IdentityEnv):
|
||||
@@ -74,10 +81,10 @@ class BoxIdentityEnv(IdentityEnv):
|
||||
episode_len=None,
|
||||
):
|
||||
|
||||
self.action_space = Box(low=-1.0, high=1.0, shape=shape)
|
||||
self.action_space = Box(low=-1.0, high=1.0, shape=shape, dtype=np.float32)
|
||||
super().__init__(episode_len=episode_len)
|
||||
|
||||
def _get_reward(self, actions):
|
||||
diff = actions - self.state
|
||||
def _get_reward(self, state, actions):
|
||||
diff = actions - state
|
||||
diff = diff[:]
|
||||
return -0.5 * np.dot(diff, diff)
|
||||
|
36
baselines/common/tests/envs/identity_env_test.py
Normal file
36
baselines/common/tests/envs/identity_env_test.py
Normal file
@@ -0,0 +1,36 @@
|
||||
from baselines.common.tests.envs.identity_env import DiscreteIdentityEnv
|
||||
|
||||
|
||||
def test_discrete_nodelay():
|
||||
nsteps = 100
|
||||
eplen = 50
|
||||
env = DiscreteIdentityEnv(10, episode_len=eplen)
|
||||
ob = env.reset()
|
||||
for t in range(nsteps):
|
||||
action = env.action_space.sample()
|
||||
next_ob, rew, done, info = env.step(action)
|
||||
assert rew == (1 if action == ob else 0)
|
||||
if (t + 1) % eplen == 0:
|
||||
assert done
|
||||
next_ob = env.reset()
|
||||
else:
|
||||
assert not done
|
||||
ob = next_ob
|
||||
|
||||
def test_discrete_delay1():
|
||||
eplen = 50
|
||||
env = DiscreteIdentityEnv(10, episode_len=eplen, delay=1)
|
||||
ob = env.reset()
|
||||
prev_ob = None
|
||||
for t in range(eplen):
|
||||
action = env.action_space.sample()
|
||||
next_ob, rew, done, info = env.step(action)
|
||||
if t > 0:
|
||||
assert rew == (1 if action == prev_ob else 0)
|
||||
else:
|
||||
assert rew == 0
|
||||
prev_ob = ob
|
||||
ob = next_ob
|
||||
if t < eplen - 1:
|
||||
assert not done
|
||||
assert done
|
@@ -9,7 +9,6 @@ from gym.spaces import Discrete, Box
|
||||
class MnistEnv(Env):
|
||||
def __init__(
|
||||
self,
|
||||
seed=0,
|
||||
episode_len=None,
|
||||
no_images=None
|
||||
):
|
||||
@@ -23,7 +22,6 @@ class MnistEnv(Env):
|
||||
self.mnist = input_data.read_data_sets(mnist_path)
|
||||
|
||||
self.np_random = np.random.RandomState()
|
||||
self.np_random.seed(seed)
|
||||
|
||||
self.observation_space = Box(low=0.0, high=1.0, shape=(28,28,1))
|
||||
self.action_space = Discrete(10)
|
||||
@@ -50,6 +48,9 @@ class MnistEnv(Env):
|
||||
|
||||
return self.state[0], rew, done, {}
|
||||
|
||||
def seed(self, seed=None):
|
||||
self.np_random.seed(seed)
|
||||
|
||||
def train_mode(self):
|
||||
self.dataset = self.mnist.train
|
||||
|
||||
|
@@ -3,6 +3,7 @@ import gym
|
||||
|
||||
from baselines.run import get_learn_function
|
||||
from baselines.common.tests.util import reward_per_episode_test
|
||||
from baselines.common.tests import mark_slow
|
||||
|
||||
common_kwargs = dict(
|
||||
total_timesteps=30000,
|
||||
@@ -20,7 +21,7 @@ learn_kwargs = {
|
||||
'trpo_mpi': {}
|
||||
}
|
||||
|
||||
@pytest.mark.slow
|
||||
@mark_slow
|
||||
@pytest.mark.parametrize("alg", learn_kwargs.keys())
|
||||
def test_cartpole(alg):
|
||||
'''
|
||||
|
40
baselines/common/tests/test_fetchreach.py
Normal file
40
baselines/common/tests/test_fetchreach.py
Normal file
@@ -0,0 +1,40 @@
|
||||
import pytest
|
||||
import gym
|
||||
|
||||
from baselines.run import get_learn_function
|
||||
from baselines.common.tests.util import reward_per_episode_test
|
||||
from baselines.common.tests import mark_slow
|
||||
|
||||
pytest.importorskip('mujoco_py')
|
||||
|
||||
common_kwargs = dict(
|
||||
network='mlp',
|
||||
seed=0,
|
||||
)
|
||||
|
||||
learn_kwargs = {
|
||||
'her': dict(total_timesteps=2000)
|
||||
}
|
||||
|
||||
@mark_slow
|
||||
@pytest.mark.parametrize("alg", learn_kwargs.keys())
|
||||
def test_fetchreach(alg):
|
||||
'''
|
||||
Test if the algorithm (with an mlp policy)
|
||||
can learn the FetchReach task
|
||||
'''
|
||||
|
||||
kwargs = common_kwargs.copy()
|
||||
kwargs.update(learn_kwargs[alg])
|
||||
|
||||
learn_fn = lambda e: get_learn_function(alg)(env=e, **kwargs)
|
||||
def env_fn():
|
||||
|
||||
env = gym.make('FetchReach-v1')
|
||||
env.seed(0)
|
||||
return env
|
||||
|
||||
reward_per_episode_test(env_fn, learn_fn, -15)
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_fetchreach('her')
|
@@ -3,6 +3,8 @@ from baselines.common.tests.envs.fixed_sequence_env import FixedSequenceEnv
|
||||
|
||||
from baselines.common.tests.util import simple_test
|
||||
from baselines.run import get_learn_function
|
||||
from baselines.common.tests import mark_slow
|
||||
|
||||
|
||||
common_kwargs = dict(
|
||||
seed=0,
|
||||
@@ -21,7 +23,7 @@ learn_kwargs = {
|
||||
alg_list = learn_kwargs.keys()
|
||||
rnn_list = ['lstm']
|
||||
|
||||
@pytest.mark.slow
|
||||
@mark_slow
|
||||
@pytest.mark.parametrize("alg", alg_list)
|
||||
@pytest.mark.parametrize("rnn", rnn_list)
|
||||
def test_fixed_sequence(alg, rnn):
|
||||
@@ -33,8 +35,7 @@ def test_fixed_sequence(alg, rnn):
|
||||
kwargs = learn_kwargs[alg]
|
||||
kwargs.update(common_kwargs)
|
||||
|
||||
episode_len = 5
|
||||
env_fn = lambda: FixedSequenceEnv(10, episode_len=episode_len)
|
||||
env_fn = lambda: FixedSequenceEnv(n_actions=10, episode_len=5)
|
||||
learn = lambda e: get_learn_function(alg)(
|
||||
env=e,
|
||||
network=rnn,
|
||||
|
@@ -2,6 +2,7 @@ import pytest
|
||||
from baselines.common.tests.envs.identity_env import DiscreteIdentityEnv, BoxIdentityEnv, MultiDiscreteIdentityEnv
|
||||
from baselines.run import get_learn_function
|
||||
from baselines.common.tests.util import simple_test
|
||||
from baselines.common.tests import mark_slow
|
||||
|
||||
common_kwargs = dict(
|
||||
total_timesteps=30000,
|
||||
@@ -24,7 +25,7 @@ algos_disc = ['a2c', 'acktr', 'deepq', 'ppo2', 'trpo_mpi']
|
||||
algos_multidisc = ['a2c', 'acktr', 'ppo2', 'trpo_mpi']
|
||||
algos_cont = ['a2c', 'acktr', 'ddpg', 'ppo2', 'trpo_mpi']
|
||||
|
||||
@pytest.mark.slow
|
||||
@mark_slow
|
||||
@pytest.mark.parametrize("alg", algos_disc)
|
||||
def test_discrete_identity(alg):
|
||||
'''
|
||||
@@ -39,7 +40,7 @@ def test_discrete_identity(alg):
|
||||
env_fn = lambda: DiscreteIdentityEnv(10, episode_len=100)
|
||||
simple_test(env_fn, learn_fn, 0.9)
|
||||
|
||||
@pytest.mark.slow
|
||||
@mark_slow
|
||||
@pytest.mark.parametrize("alg", algos_multidisc)
|
||||
def test_multidiscrete_identity(alg):
|
||||
'''
|
||||
@@ -54,7 +55,7 @@ def test_multidiscrete_identity(alg):
|
||||
env_fn = lambda: MultiDiscreteIdentityEnv((3,3), episode_len=100)
|
||||
simple_test(env_fn, learn_fn, 0.9)
|
||||
|
||||
@pytest.mark.slow
|
||||
@mark_slow
|
||||
@pytest.mark.parametrize("alg", algos_cont)
|
||||
def test_continuous_identity(alg):
|
||||
'''
|
||||
|
@@ -4,7 +4,7 @@ import pytest
|
||||
from baselines.common.tests.envs.mnist_env import MnistEnv
|
||||
from baselines.common.tests.util import simple_test
|
||||
from baselines.run import get_learn_function
|
||||
|
||||
from baselines.common.tests import mark_slow
|
||||
|
||||
# TODO investigate a2c and ppo2 failures - is it due to bad hyperparameters for this problem?
|
||||
# GitHub issue https://github.com/openai/baselines/issues/189
|
||||
@@ -28,7 +28,7 @@ learn_args = {
|
||||
#tests pass, but are too slow on travis. Same algorithms are covered
|
||||
# by other tests with less compute-hungry nn's and by benchmarks
|
||||
@pytest.mark.skip
|
||||
@pytest.mark.slow
|
||||
@mark_slow
|
||||
@pytest.mark.parametrize("alg", learn_args.keys())
|
||||
def test_mnist(alg):
|
||||
'''
|
||||
@@ -41,7 +41,7 @@ def test_mnist(alg):
|
||||
|
||||
learn = get_learn_function(alg)
|
||||
learn_fn = lambda e: learn(env=e, **learn_kwargs)
|
||||
env_fn = lambda: MnistEnv(seed=0, episode_len=100)
|
||||
env_fn = lambda: MnistEnv(episode_len=100)
|
||||
|
||||
simple_test(env_fn, learn_fn, 0.6)
|
||||
|
||||
|
17
baselines/common/tests/test_plot_util.py
Normal file
17
baselines/common/tests/test_plot_util.py
Normal file
@@ -0,0 +1,17 @@
|
||||
# smoke tests of plot_util
|
||||
from baselines.common import plot_util as pu
|
||||
from baselines.common.tests.util import smoketest
|
||||
|
||||
|
||||
def test_plot_util():
|
||||
nruns = 4
|
||||
logdirs = [smoketest('--alg=ppo2 --env=CartPole-v0 --num_timesteps=10000') for _ in range(nruns)]
|
||||
data = pu.load_results(logdirs)
|
||||
assert len(data) == 4
|
||||
|
||||
_, axes = pu.plot_results(data[:1]); assert len(axes) == 1
|
||||
_, axes = pu.plot_results(data, tiling='vertical'); assert axes.shape==(4,1)
|
||||
_, axes = pu.plot_results(data, tiling='horizontal'); assert axes.shape==(1,4)
|
||||
_, axes = pu.plot_results(data, tiling='symmetric'); assert axes.shape==(2,2)
|
||||
_, axes = pu.plot_results(data, split_fn=lambda _: ''); assert len(axes) == 1
|
||||
|
@@ -44,7 +44,12 @@ def test_serialization(learn_fn, network_fn):
|
||||
# github issue: https://github.com/openai/baselines/issues/660
|
||||
return
|
||||
|
||||
env = DummyVecEnv([lambda: MnistEnv(10, episode_len=100)])
|
||||
def make_env():
|
||||
env = MnistEnv(episode_len=100)
|
||||
env.seed(10)
|
||||
return env
|
||||
|
||||
env = DummyVecEnv([make_env])
|
||||
ob = env.reset().copy()
|
||||
learn = get_learn_function(learn_fn)
|
||||
|
||||
@@ -103,9 +108,9 @@ def test_coexistence(learn_fn, network_fn):
|
||||
kwargs.update(learn_kwargs[learn_fn])
|
||||
|
||||
learn = partial(learn, env=env, network=network_fn, total_timesteps=0, **kwargs)
|
||||
make_session(make_default=True, graph=tf.Graph());
|
||||
make_session(make_default=True, graph=tf.Graph())
|
||||
model1 = learn(seed=1)
|
||||
make_session(make_default=True, graph=tf.Graph());
|
||||
make_session(make_default=True, graph=tf.Graph())
|
||||
model2 = learn(seed=2)
|
||||
|
||||
model1.step(env.observation_space.sample())
|
||||
|
@@ -18,7 +18,9 @@ def test_function():
|
||||
initialize()
|
||||
|
||||
assert lin(2) == 6
|
||||
assert lin(x=3) == 9
|
||||
assert lin(2, 2) == 10
|
||||
assert lin(x=2, y=3) == 12
|
||||
|
||||
|
||||
def test_multikwargs():
|
||||
|
38
baselines/common/tests/test_with_mpi.py
Normal file
38
baselines/common/tests/test_with_mpi.py
Normal file
@@ -0,0 +1,38 @@
|
||||
import os
|
||||
import sys
|
||||
import subprocess
|
||||
import cloudpickle
|
||||
import base64
|
||||
import pytest
|
||||
from functools import wraps
|
||||
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
def with_mpi(nproc=2, timeout=30, skip_if_no_mpi=True):
|
||||
def outer_thunk(fn):
|
||||
@wraps(fn)
|
||||
def thunk(*args, **kwargs):
|
||||
serialized_fn = base64.b64encode(cloudpickle.dumps(lambda: fn(*args, **kwargs)))
|
||||
subprocess.check_call([
|
||||
'mpiexec','-n', str(nproc),
|
||||
sys.executable,
|
||||
'-m', 'baselines.common.tests.test_with_mpi',
|
||||
serialized_fn
|
||||
], env=os.environ, timeout=timeout)
|
||||
|
||||
if skip_if_no_mpi:
|
||||
return pytest.mark.skipif(MPI is None, reason="MPI not present")(thunk)
|
||||
else:
|
||||
return thunk
|
||||
|
||||
return outer_thunk
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
if len(sys.argv) > 1:
|
||||
fn = cloudpickle.loads(base64.b64decode(sys.argv[1]))
|
||||
assert callable(fn)
|
||||
fn()
|
@@ -1,56 +1,50 @@
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
from gym.spaces import np_random
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
|
||||
N_TRIALS = 10000
|
||||
N_EPISODES = 100
|
||||
|
||||
_sess_config = tf.ConfigProto(
|
||||
allow_soft_placement=True,
|
||||
intra_op_parallelism_threads=1,
|
||||
inter_op_parallelism_threads=1
|
||||
)
|
||||
|
||||
def simple_test(env_fn, learn_fn, min_reward_fraction, n_trials=N_TRIALS):
|
||||
def seeded_env_fn():
|
||||
env = env_fn()
|
||||
env.seed(0)
|
||||
return env
|
||||
|
||||
np.random.seed(0)
|
||||
np_random.seed(0)
|
||||
|
||||
env = DummyVecEnv([env_fn])
|
||||
|
||||
|
||||
with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
|
||||
env = DummyVecEnv([seeded_env_fn])
|
||||
with tf.Graph().as_default(), tf.Session(config=_sess_config).as_default():
|
||||
tf.set_random_seed(0)
|
||||
|
||||
model = learn_fn(env)
|
||||
|
||||
sum_rew = 0
|
||||
done = True
|
||||
|
||||
for i in range(n_trials):
|
||||
if done:
|
||||
obs = env.reset()
|
||||
state = model.initial_state
|
||||
|
||||
if state is not None:
|
||||
a, v, state, _ = model.step(obs, S=state, M=[False])
|
||||
else:
|
||||
a, v, _, _ = model.step(obs)
|
||||
|
||||
obs, rew, done, _ = env.step(a)
|
||||
sum_rew += float(rew)
|
||||
|
||||
print("Reward in {} trials is {}".format(n_trials, sum_rew))
|
||||
assert sum_rew > min_reward_fraction * n_trials, \
|
||||
'sum of rewards {} is less than {} of the total number of trials {}'.format(sum_rew, min_reward_fraction, n_trials)
|
||||
|
||||
|
||||
|
||||
def reward_per_episode_test(env_fn, learn_fn, min_avg_reward, n_trials=N_EPISODES):
|
||||
env = DummyVecEnv([env_fn])
|
||||
|
||||
with tf.Graph().as_default(), tf.Session(config=tf.ConfigProto(allow_soft_placement=True)).as_default():
|
||||
with tf.Graph().as_default(), tf.Session(config=_sess_config).as_default():
|
||||
model = learn_fn(env)
|
||||
|
||||
N_TRIALS = 100
|
||||
|
||||
observations, actions, rewards = rollout(env, model, N_TRIALS)
|
||||
rewards = [sum(r) for r in rewards]
|
||||
|
||||
avg_rew = sum(rewards) / N_TRIALS
|
||||
print("Average reward in {} episodes is {}".format(n_trials, avg_rew))
|
||||
assert avg_rew > min_avg_reward, \
|
||||
@@ -60,14 +54,12 @@ def rollout(env, model, n_trials):
|
||||
rewards = []
|
||||
actions = []
|
||||
observations = []
|
||||
|
||||
for i in range(n_trials):
|
||||
obs = env.reset()
|
||||
state = model.initial_state
|
||||
state = model.initial_state if hasattr(model, 'initial_state') else None
|
||||
episode_rew = []
|
||||
episode_actions = []
|
||||
episode_obs = []
|
||||
|
||||
while True:
|
||||
if state is not None:
|
||||
a, v, state, _ = model.step(obs, S=state, M=[False])
|
||||
@@ -75,17 +67,26 @@ def rollout(env, model, n_trials):
|
||||
a,v, _, _ = model.step(obs)
|
||||
|
||||
obs, rew, done, _ = env.step(a)
|
||||
|
||||
episode_rew.append(rew)
|
||||
episode_actions.append(a)
|
||||
episode_obs.append(obs)
|
||||
|
||||
if done:
|
||||
break
|
||||
|
||||
rewards.append(episode_rew)
|
||||
actions.append(episode_actions)
|
||||
observations.append(episode_obs)
|
||||
|
||||
return observations, actions, rewards
|
||||
|
||||
|
||||
def smoketest(argstr, **kwargs):
|
||||
import tempfile
|
||||
import subprocess
|
||||
import os
|
||||
argstr = 'python -m baselines.run ' + argstr
|
||||
for key, value in kwargs:
|
||||
argstr += ' --{}={}'.format(key, value)
|
||||
tempdir = tempfile.mkdtemp()
|
||||
env = os.environ.copy()
|
||||
env['OPENAI_LOGDIR'] = tempdir
|
||||
subprocess.run(argstr.split(' '), env=env)
|
||||
return tempdir
|
||||
|
@@ -1,4 +1,3 @@
|
||||
import joblib
|
||||
import numpy as np
|
||||
import tensorflow as tf # pylint: ignore-module
|
||||
import copy
|
||||
@@ -165,6 +164,10 @@ def function(inputs, outputs, updates=None, givens=None):
|
||||
outputs: [tf.Variable] or tf.Variable
|
||||
list of outputs or a single output to be returned from function. Returned
|
||||
value will also have the same shape.
|
||||
updates: [tf.Operation] or tf.Operation
|
||||
list of update functions or single update function that will be run whenever
|
||||
the function is called. The return is ignored.
|
||||
|
||||
"""
|
||||
if isinstance(outputs, list):
|
||||
return _Function(inputs, outputs, updates, givens=givens)
|
||||
@@ -182,6 +185,7 @@ class _Function(object):
|
||||
if not hasattr(inpt, 'make_feed_dict') and not (type(inpt) is tf.Tensor and len(inpt.op.inputs) == 0):
|
||||
assert False, "inputs should all be placeholders, constants, or have a make_feed_dict method"
|
||||
self.inputs = inputs
|
||||
self.input_names = {inp.name.split("/")[-1].split(":")[0]: inp for inp in inputs}
|
||||
updates = updates or []
|
||||
self.update_group = tf.group(*updates)
|
||||
self.outputs_update = list(outputs) + [self.update_group]
|
||||
@@ -193,15 +197,17 @@ class _Function(object):
|
||||
else:
|
||||
feed_dict[inpt] = adjust_shape(inpt, value)
|
||||
|
||||
def __call__(self, *args):
|
||||
assert len(args) <= len(self.inputs), "Too many arguments provided"
|
||||
def __call__(self, *args, **kwargs):
|
||||
assert len(args) + len(kwargs) <= len(self.inputs), "Too many arguments provided"
|
||||
feed_dict = {}
|
||||
# Update the args
|
||||
for inpt, value in zip(self.inputs, args):
|
||||
self._feed_input(feed_dict, inpt, value)
|
||||
# Update feed dict with givens.
|
||||
for inpt in self.givens:
|
||||
feed_dict[inpt] = adjust_shape(inpt, feed_dict.get(inpt, self.givens[inpt]))
|
||||
# Update the args
|
||||
for inpt, value in zip(self.inputs, args):
|
||||
self._feed_input(feed_dict, inpt, value)
|
||||
for inpt_name, value in kwargs.items():
|
||||
self._feed_input(feed_dict, self.input_names[inpt_name], value)
|
||||
results = get_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
|
||||
return results
|
||||
|
||||
@@ -299,12 +305,17 @@ def display_var_info(vars):
|
||||
logger.info("Total model parameters: %0.2f million" % (count_params*1e-6))
|
||||
|
||||
|
||||
def get_available_gpus():
|
||||
# recipe from here:
|
||||
# https://stackoverflow.com/questions/38559755/how-to-get-current-available-gpus-in-tensorflow?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
|
||||
def get_available_gpus(session_config=None):
|
||||
# based on recipe from https://stackoverflow.com/a/38580201
|
||||
|
||||
# Unless we allocate a session here, subsequent attempts to create one
|
||||
# will ignore our custom config (in particular, allow_growth=True will have
|
||||
# no effect).
|
||||
if session_config is None:
|
||||
session_config = get_session()._config
|
||||
|
||||
from tensorflow.python.client import device_lib
|
||||
local_device_protos = device_lib.list_local_devices()
|
||||
local_device_protos = device_lib.list_local_devices(session_config)
|
||||
return [x.name for x in local_device_protos if x.device_type == 'GPU']
|
||||
|
||||
# ================================================================
|
||||
@@ -332,8 +343,9 @@ def save_state(fname, sess=None):
|
||||
# TODO: ensure there is no subtle differences and remove one
|
||||
|
||||
def save_variables(save_path, variables=None, sess=None):
|
||||
import joblib
|
||||
sess = sess or get_session()
|
||||
variables = variables or tf.trainable_variables()
|
||||
variables = variables or tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
|
||||
|
||||
ps = sess.run(variables)
|
||||
save_dict = {v.name: value for v, value in zip(variables, ps)}
|
||||
@@ -343,8 +355,9 @@ def save_variables(save_path, variables=None, sess=None):
|
||||
joblib.dump(save_dict, save_path)
|
||||
|
||||
def load_variables(load_path, variables=None, sess=None):
|
||||
import joblib
|
||||
sess = sess or get_session()
|
||||
variables = variables or tf.trainable_variables()
|
||||
variables = variables or tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
|
||||
|
||||
loaded_params = joblib.load(os.path.expanduser(load_path))
|
||||
restores = []
|
||||
|
@@ -1,185 +1,10 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from baselines.common.tile_images import tile_images
|
||||
from .vec_env import AlreadySteppingError, NotSteppingError, VecEnv, VecEnvWrapper, VecEnvObservationWrapper, CloudpickleWrapper
|
||||
from .dummy_vec_env import DummyVecEnv
|
||||
from .shmem_vec_env import ShmemVecEnv
|
||||
from .subproc_vec_env import SubprocVecEnv
|
||||
from .vec_frame_stack import VecFrameStack
|
||||
from .vec_monitor import VecMonitor
|
||||
from .vec_normalize import VecNormalize
|
||||
from .vec_remove_dict_obs import VecExtractDictObs
|
||||
|
||||
class AlreadySteppingError(Exception):
|
||||
"""
|
||||
Raised when an asynchronous step is running while
|
||||
step_async() is called again.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
msg = 'already running an async step'
|
||||
Exception.__init__(self, msg)
|
||||
|
||||
|
||||
class NotSteppingError(Exception):
|
||||
"""
|
||||
Raised when an asynchronous step is not running but
|
||||
step_wait() is called.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
msg = 'not running an async step'
|
||||
Exception.__init__(self, msg)
|
||||
|
||||
|
||||
class VecEnv(ABC):
|
||||
"""
|
||||
An abstract asynchronous, vectorized environment.
|
||||
Used to batch data from multiple copies of an environment, so that
|
||||
each observation becomes an batch of observations, and expected action is a batch of actions to
|
||||
be applied per-environment.
|
||||
"""
|
||||
closed = False
|
||||
viewer = None
|
||||
|
||||
metadata = {
|
||||
'render.modes': ['human', 'rgb_array']
|
||||
}
|
||||
|
||||
def __init__(self, num_envs, observation_space, action_space):
|
||||
self.num_envs = num_envs
|
||||
self.observation_space = observation_space
|
||||
self.action_space = action_space
|
||||
|
||||
@abstractmethod
|
||||
def reset(self):
|
||||
"""
|
||||
Reset all the environments and return an array of
|
||||
observations, or a dict of observation arrays.
|
||||
|
||||
If step_async is still doing work, that work will
|
||||
be cancelled and step_wait() should not be called
|
||||
until step_async() is invoked again.
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def step_async(self, actions):
|
||||
"""
|
||||
Tell all the environments to start taking a step
|
||||
with the given actions.
|
||||
Call step_wait() to get the results of the step.
|
||||
|
||||
You should not call this if a step_async run is
|
||||
already pending.
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def step_wait(self):
|
||||
"""
|
||||
Wait for the step taken with step_async().
|
||||
|
||||
Returns (obs, rews, dones, infos):
|
||||
- obs: an array of observations, or a dict of
|
||||
arrays of observations.
|
||||
- rews: an array of rewards
|
||||
- dones: an array of "episode done" booleans
|
||||
- infos: a sequence of info objects
|
||||
"""
|
||||
pass
|
||||
|
||||
def close_extras(self):
|
||||
"""
|
||||
Clean up the extra resources, beyond what's in this base class.
|
||||
Only runs when not self.closed.
|
||||
"""
|
||||
pass
|
||||
|
||||
def close(self):
|
||||
if self.closed:
|
||||
return
|
||||
if self.viewer is not None:
|
||||
self.viewer.close()
|
||||
self.close_extras()
|
||||
self.closed = True
|
||||
|
||||
def step(self, actions):
|
||||
"""
|
||||
Step the environments synchronously.
|
||||
|
||||
This is available for backwards compatibility.
|
||||
"""
|
||||
self.step_async(actions)
|
||||
return self.step_wait()
|
||||
|
||||
def render(self, mode='human'):
|
||||
imgs = self.get_images()
|
||||
bigimg = tile_images(imgs)
|
||||
if mode == 'human':
|
||||
self.get_viewer().imshow(bigimg)
|
||||
return self.get_viewer().isopen
|
||||
elif mode == 'rgb_array':
|
||||
return bigimg
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
def get_images(self):
|
||||
"""
|
||||
Return RGB images from each environment
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
@property
|
||||
def unwrapped(self):
|
||||
if isinstance(self, VecEnvWrapper):
|
||||
return self.venv.unwrapped
|
||||
else:
|
||||
return self
|
||||
|
||||
def get_viewer(self):
|
||||
if self.viewer is None:
|
||||
from gym.envs.classic_control import rendering
|
||||
self.viewer = rendering.SimpleImageViewer()
|
||||
return self.viewer
|
||||
|
||||
|
||||
class VecEnvWrapper(VecEnv):
|
||||
"""
|
||||
An environment wrapper that applies to an entire batch
|
||||
of environments at once.
|
||||
"""
|
||||
|
||||
def __init__(self, venv, observation_space=None, action_space=None):
|
||||
self.venv = venv
|
||||
VecEnv.__init__(self,
|
||||
num_envs=venv.num_envs,
|
||||
observation_space=observation_space or venv.observation_space,
|
||||
action_space=action_space or venv.action_space)
|
||||
|
||||
def step_async(self, actions):
|
||||
self.venv.step_async(actions)
|
||||
|
||||
@abstractmethod
|
||||
def reset(self):
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def step_wait(self):
|
||||
pass
|
||||
|
||||
def close(self):
|
||||
return self.venv.close()
|
||||
|
||||
def render(self, mode='human'):
|
||||
return self.venv.render(mode=mode)
|
||||
|
||||
def get_images(self):
|
||||
return self.venv.get_images()
|
||||
|
||||
class CloudpickleWrapper(object):
|
||||
"""
|
||||
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
|
||||
"""
|
||||
|
||||
def __init__(self, x):
|
||||
self.x = x
|
||||
|
||||
def __getstate__(self):
|
||||
import cloudpickle
|
||||
return cloudpickle.dumps(self.x)
|
||||
|
||||
def __setstate__(self, ob):
|
||||
import pickle
|
||||
self.x = pickle.loads(ob)
|
||||
__all__ = ['AlreadySteppingError', 'NotSteppingError', 'VecEnv', 'VecEnvWrapper', 'VecEnvObservationWrapper', 'CloudpickleWrapper', 'DummyVecEnv', 'ShmemVecEnv', 'SubprocVecEnv', 'VecFrameStack', 'VecMonitor', 'VecNormalize', 'VecExtractDictObs']
|
||||
|
@@ -1,6 +1,5 @@
|
||||
import numpy as np
|
||||
from gym import spaces
|
||||
from . import VecEnv
|
||||
from .vec_env import VecEnv
|
||||
from .util import copy_obs_dict, dict_to_obs, obs_space_info
|
||||
|
||||
class DummyVecEnv(VecEnv):
|
||||
@@ -27,6 +26,7 @@ class DummyVecEnv(VecEnv):
|
||||
self.buf_rews = np.zeros((self.num_envs,), dtype=np.float32)
|
||||
self.buf_infos = [{} for _ in range(self.num_envs)]
|
||||
self.actions = None
|
||||
self.spec = self.envs[0].spec
|
||||
|
||||
def step_async(self, actions):
|
||||
listify = True
|
||||
@@ -45,8 +45,8 @@ class DummyVecEnv(VecEnv):
|
||||
def step_wait(self):
|
||||
for e in range(self.num_envs):
|
||||
action = self.actions[e]
|
||||
if isinstance(self.envs[e].action_space, spaces.Discrete):
|
||||
action = int(action)
|
||||
# if isinstance(self.envs[e].action_space, spaces.Discrete):
|
||||
# action = int(action)
|
||||
|
||||
obs, self.buf_rews[e], self.buf_dones[e], self.buf_infos[e] = self.envs[e].step(action)
|
||||
if self.buf_dones[e]:
|
||||
|
@@ -2,9 +2,9 @@
|
||||
An interface for asynchronous vectorized environments.
|
||||
"""
|
||||
|
||||
from multiprocessing import Pipe, Array, Process
|
||||
import multiprocessing as mp
|
||||
import numpy as np
|
||||
from . import VecEnv, CloudpickleWrapper
|
||||
from .vec_env import VecEnv, CloudpickleWrapper, clear_mpi_env_vars
|
||||
import ctypes
|
||||
from baselines import logger
|
||||
|
||||
@@ -22,11 +22,12 @@ class ShmemVecEnv(VecEnv):
|
||||
Optimized version of SubprocVecEnv that uses shared variables to communicate observations.
|
||||
"""
|
||||
|
||||
def __init__(self, env_fns, spaces=None):
|
||||
def __init__(self, env_fns, spaces=None, context='spawn'):
|
||||
"""
|
||||
If you don't specify observation_space, we'll have to create a dummy
|
||||
environment to get it.
|
||||
"""
|
||||
ctx = mp.get_context(context)
|
||||
if spaces:
|
||||
observation_space, action_space = spaces
|
||||
else:
|
||||
@@ -39,20 +40,21 @@ class ShmemVecEnv(VecEnv):
|
||||
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
|
||||
self.obs_keys, self.obs_shapes, self.obs_dtypes = obs_space_info(observation_space)
|
||||
self.obs_bufs = [
|
||||
{k: Array(_NP_TO_CT[self.obs_dtypes[k].type], int(np.prod(self.obs_shapes[k]))) for k in self.obs_keys}
|
||||
{k: ctx.Array(_NP_TO_CT[self.obs_dtypes[k].type], int(np.prod(self.obs_shapes[k]))) for k in self.obs_keys}
|
||||
for _ in env_fns]
|
||||
self.parent_pipes = []
|
||||
self.procs = []
|
||||
for env_fn, obs_buf in zip(env_fns, self.obs_bufs):
|
||||
wrapped_fn = CloudpickleWrapper(env_fn)
|
||||
parent_pipe, child_pipe = Pipe()
|
||||
proc = Process(target=_subproc_worker,
|
||||
args=(child_pipe, parent_pipe, wrapped_fn, obs_buf, self.obs_shapes, self.obs_dtypes, self.obs_keys))
|
||||
proc.daemon = True
|
||||
self.procs.append(proc)
|
||||
self.parent_pipes.append(parent_pipe)
|
||||
proc.start()
|
||||
child_pipe.close()
|
||||
with clear_mpi_env_vars():
|
||||
for env_fn, obs_buf in zip(env_fns, self.obs_bufs):
|
||||
wrapped_fn = CloudpickleWrapper(env_fn)
|
||||
parent_pipe, child_pipe = ctx.Pipe()
|
||||
proc = ctx.Process(target=_subproc_worker,
|
||||
args=(child_pipe, parent_pipe, wrapped_fn, obs_buf, self.obs_shapes, self.obs_dtypes, self.obs_keys))
|
||||
proc.daemon = True
|
||||
self.procs.append(proc)
|
||||
self.parent_pipes.append(parent_pipe)
|
||||
proc.start()
|
||||
child_pipe.close()
|
||||
self.waiting_step = False
|
||||
self.viewer = None
|
||||
|
||||
@@ -68,9 +70,11 @@ class ShmemVecEnv(VecEnv):
|
||||
assert len(actions) == len(self.parent_pipes)
|
||||
for pipe, act in zip(self.parent_pipes, actions):
|
||||
pipe.send(('step', act))
|
||||
self.waiting_step = True
|
||||
|
||||
def step_wait(self):
|
||||
outs = [pipe.recv() for pipe in self.parent_pipes]
|
||||
self.waiting_step = False
|
||||
obs, rews, dones, infos = zip(*outs)
|
||||
return self._decode_obses(obs), np.array(rews), np.array(dones), infos
|
||||
|
||||
|
@@ -1,34 +1,39 @@
|
||||
import numpy as np
|
||||
from multiprocessing import Process, Pipe
|
||||
from . import VecEnv, CloudpickleWrapper
|
||||
import multiprocessing as mp
|
||||
|
||||
import numpy as np
|
||||
from .vec_env import VecEnv, CloudpickleWrapper, clear_mpi_env_vars
|
||||
|
||||
|
||||
def worker(remote, parent_remote, env_fn_wrappers):
|
||||
def step_env(env, action):
|
||||
ob, reward, done, info = env.step(action)
|
||||
if done:
|
||||
ob = env.reset()
|
||||
return ob, reward, done, info
|
||||
|
||||
def worker(remote, parent_remote, env_fn_wrapper):
|
||||
parent_remote.close()
|
||||
env = env_fn_wrapper.x()
|
||||
envs = [env_fn_wrapper() for env_fn_wrapper in env_fn_wrappers.x]
|
||||
try:
|
||||
while True:
|
||||
cmd, data = remote.recv()
|
||||
if cmd == 'step':
|
||||
ob, reward, done, info = env.step(data)
|
||||
if done:
|
||||
ob = env.reset()
|
||||
remote.send((ob, reward, done, info))
|
||||
remote.send([step_env(env, action) for env, action in zip(envs, data)])
|
||||
elif cmd == 'reset':
|
||||
ob = env.reset()
|
||||
remote.send(ob)
|
||||
remote.send([env.reset() for env in envs])
|
||||
elif cmd == 'render':
|
||||
remote.send(env.render(mode='rgb_array'))
|
||||
remote.send([env.render(mode='rgb_array') for env in envs])
|
||||
elif cmd == 'close':
|
||||
remote.close()
|
||||
break
|
||||
elif cmd == 'get_spaces':
|
||||
remote.send((env.observation_space, env.action_space))
|
||||
elif cmd == 'get_spaces_spec':
|
||||
remote.send((envs[0].observation_space, envs[0].action_space, envs[0].spec))
|
||||
else:
|
||||
raise NotImplementedError
|
||||
except KeyboardInterrupt:
|
||||
print('SubprocVecEnv worker: got KeyboardInterrupt')
|
||||
finally:
|
||||
env.close()
|
||||
for env in envs:
|
||||
env.close()
|
||||
|
||||
|
||||
class SubprocVecEnv(VecEnv):
|
||||
@@ -36,31 +41,40 @@ class SubprocVecEnv(VecEnv):
|
||||
VecEnv that runs multiple environments in parallel in subproceses and communicates with them via pipes.
|
||||
Recommended to use when num_envs > 1 and step() can be a bottleneck.
|
||||
"""
|
||||
def __init__(self, env_fns, spaces=None):
|
||||
def __init__(self, env_fns, spaces=None, context='spawn', in_series=1):
|
||||
"""
|
||||
Arguments:
|
||||
|
||||
env_fns: iterable of callables - functions that create environments to run in subprocesses. Need to be cloud-pickleable
|
||||
in_series: number of environments to run in series in a single process
|
||||
(e.g. when len(env_fns) == 12 and in_series == 3, it will run 4 processes, each running 3 envs in series)
|
||||
"""
|
||||
self.waiting = False
|
||||
self.closed = False
|
||||
self.in_series = in_series
|
||||
nenvs = len(env_fns)
|
||||
self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
|
||||
self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
|
||||
assert nenvs % in_series == 0, "Number of envs must be divisible by number of envs to run in series"
|
||||
self.nremotes = nenvs // in_series
|
||||
env_fns = np.array_split(env_fns, self.nremotes)
|
||||
ctx = mp.get_context(context)
|
||||
self.remotes, self.work_remotes = zip(*[ctx.Pipe() for _ in range(self.nremotes)])
|
||||
self.ps = [ctx.Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
|
||||
for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
|
||||
for p in self.ps:
|
||||
p.daemon = True # if the main process crashes, we should not cause things to hang
|
||||
p.start()
|
||||
with clear_mpi_env_vars():
|
||||
p.start()
|
||||
for remote in self.work_remotes:
|
||||
remote.close()
|
||||
|
||||
self.remotes[0].send(('get_spaces', None))
|
||||
observation_space, action_space = self.remotes[0].recv()
|
||||
self.remotes[0].send(('get_spaces_spec', None))
|
||||
observation_space, action_space, self.spec = self.remotes[0].recv()
|
||||
self.viewer = None
|
||||
VecEnv.__init__(self, len(env_fns), observation_space, action_space)
|
||||
VecEnv.__init__(self, nenvs, observation_space, action_space)
|
||||
|
||||
def step_async(self, actions):
|
||||
self._assert_not_closed()
|
||||
actions = np.array_split(actions, self.nremotes)
|
||||
for remote, action in zip(self.remotes, actions):
|
||||
remote.send(('step', action))
|
||||
self.waiting = True
|
||||
@@ -68,15 +82,18 @@ class SubprocVecEnv(VecEnv):
|
||||
def step_wait(self):
|
||||
self._assert_not_closed()
|
||||
results = [remote.recv() for remote in self.remotes]
|
||||
results = _flatten_list(results)
|
||||
self.waiting = False
|
||||
obs, rews, dones, infos = zip(*results)
|
||||
return np.stack(obs), np.stack(rews), np.stack(dones), infos
|
||||
return _flatten_obs(obs), np.stack(rews), np.stack(dones), infos
|
||||
|
||||
def reset(self):
|
||||
self._assert_not_closed()
|
||||
for remote in self.remotes:
|
||||
remote.send(('reset', None))
|
||||
return np.stack([remote.recv() for remote in self.remotes])
|
||||
obs = [remote.recv() for remote in self.remotes]
|
||||
obs = _flatten_list(obs)
|
||||
return _flatten_obs(obs)
|
||||
|
||||
def close_extras(self):
|
||||
self.closed = True
|
||||
@@ -93,7 +110,29 @@ class SubprocVecEnv(VecEnv):
|
||||
for pipe in self.remotes:
|
||||
pipe.send(('render', None))
|
||||
imgs = [pipe.recv() for pipe in self.remotes]
|
||||
imgs = _flatten_list(imgs)
|
||||
return imgs
|
||||
|
||||
def _assert_not_closed(self):
|
||||
assert not self.closed, "Trying to operate on a SubprocVecEnv after calling close()"
|
||||
|
||||
def __del__(self):
|
||||
if not self.closed:
|
||||
self.close()
|
||||
|
||||
def _flatten_obs(obs):
|
||||
assert isinstance(obs, (list, tuple))
|
||||
assert len(obs) > 0
|
||||
|
||||
if isinstance(obs[0], dict):
|
||||
keys = obs[0].keys()
|
||||
return {k: np.stack([o[k] for o in obs]) for k in keys}
|
||||
else:
|
||||
return np.stack(obs)
|
||||
|
||||
def _flatten_list(l):
|
||||
assert isinstance(l, (list, tuple))
|
||||
assert len(l) > 0
|
||||
assert all([len(l_) > 0 for l_ in l])
|
||||
|
||||
return [l__ for l_ in l for l__ in l_]
|
||||
|
@@ -8,39 +8,40 @@ import pytest
|
||||
from .dummy_vec_env import DummyVecEnv
|
||||
from .shmem_vec_env import ShmemVecEnv
|
||||
from .subproc_vec_env import SubprocVecEnv
|
||||
from baselines.common.tests.test_with_mpi import with_mpi
|
||||
|
||||
|
||||
def assert_envs_equal(env1, env2, num_steps):
|
||||
def assert_venvs_equal(venv1, venv2, num_steps):
|
||||
"""
|
||||
Compare two environments over num_steps steps and make sure
|
||||
that the observations produced by each are the same when given
|
||||
the same actions.
|
||||
"""
|
||||
assert env1.num_envs == env2.num_envs
|
||||
assert env1.action_space.shape == env2.action_space.shape
|
||||
assert env1.action_space.dtype == env2.action_space.dtype
|
||||
joint_shape = (env1.num_envs,) + env1.action_space.shape
|
||||
assert venv1.num_envs == venv2.num_envs
|
||||
assert venv1.observation_space.shape == venv2.observation_space.shape
|
||||
assert venv1.observation_space.dtype == venv2.observation_space.dtype
|
||||
assert venv1.action_space.shape == venv2.action_space.shape
|
||||
assert venv1.action_space.dtype == venv2.action_space.dtype
|
||||
|
||||
try:
|
||||
obs1, obs2 = env1.reset(), env2.reset()
|
||||
obs1, obs2 = venv1.reset(), venv2.reset()
|
||||
assert np.array(obs1).shape == np.array(obs2).shape
|
||||
assert np.array(obs1).shape == joint_shape
|
||||
assert np.array(obs1).shape == (venv1.num_envs,) + venv1.observation_space.shape
|
||||
assert np.allclose(obs1, obs2)
|
||||
np.random.seed(1337)
|
||||
venv1.action_space.seed(1337)
|
||||
for _ in range(num_steps):
|
||||
actions = np.array(np.random.randint(0, 0x100, size=joint_shape),
|
||||
dtype=env1.action_space.dtype)
|
||||
for env in [env1, env2]:
|
||||
env.step_async(actions)
|
||||
outs1 = env1.step_wait()
|
||||
outs2 = env2.step_wait()
|
||||
actions = np.array([venv1.action_space.sample() for _ in range(venv1.num_envs)])
|
||||
for venv in [venv1, venv2]:
|
||||
venv.step_async(actions)
|
||||
outs1 = venv1.step_wait()
|
||||
outs2 = venv2.step_wait()
|
||||
for out1, out2 in zip(outs1[:3], outs2[:3]):
|
||||
assert np.array(out1).shape == np.array(out2).shape
|
||||
assert np.allclose(out1, out2)
|
||||
assert list(outs1[3]) == list(outs2[3])
|
||||
finally:
|
||||
env1.close()
|
||||
env2.close()
|
||||
venv1.close()
|
||||
venv2.close()
|
||||
|
||||
|
||||
@pytest.mark.parametrize('klass', (ShmemVecEnv, SubprocVecEnv))
|
||||
@@ -63,7 +64,51 @@ def test_vec_env(klass, dtype): # pylint: disable=R0914
|
||||
fns = [make_fn(i) for i in range(num_envs)]
|
||||
env1 = DummyVecEnv(fns)
|
||||
env2 = klass(fns)
|
||||
assert_envs_equal(env1, env2, num_steps=num_steps)
|
||||
assert_venvs_equal(env1, env2, num_steps=num_steps)
|
||||
|
||||
|
||||
@pytest.mark.parametrize('dtype', ('uint8', 'float32'))
|
||||
@pytest.mark.parametrize('num_envs_in_series', (3, 4, 6))
|
||||
def test_sync_sampling(dtype, num_envs_in_series):
|
||||
"""
|
||||
Test that a SubprocVecEnv running with envs in series
|
||||
outputs the same as DummyVecEnv.
|
||||
"""
|
||||
num_envs = 12
|
||||
num_steps = 100
|
||||
shape = (3, 8)
|
||||
|
||||
def make_fn(seed):
|
||||
"""
|
||||
Get an environment constructor with a seed.
|
||||
"""
|
||||
return lambda: SimpleEnv(seed, shape, dtype)
|
||||
fns = [make_fn(i) for i in range(num_envs)]
|
||||
env1 = DummyVecEnv(fns)
|
||||
env2 = SubprocVecEnv(fns, in_series=num_envs_in_series)
|
||||
assert_venvs_equal(env1, env2, num_steps=num_steps)
|
||||
|
||||
|
||||
@pytest.mark.parametrize('dtype', ('uint8', 'float32'))
|
||||
@pytest.mark.parametrize('num_envs_in_series', (3, 4, 6))
|
||||
def test_sync_sampling_sanity(dtype, num_envs_in_series):
|
||||
"""
|
||||
Test that a SubprocVecEnv running with envs in series
|
||||
outputs the same as SubprocVecEnv without running in series.
|
||||
"""
|
||||
num_envs = 12
|
||||
num_steps = 100
|
||||
shape = (3, 8)
|
||||
|
||||
def make_fn(seed):
|
||||
"""
|
||||
Get an environment constructor with a seed.
|
||||
"""
|
||||
return lambda: SimpleEnv(seed, shape, dtype)
|
||||
fns = [make_fn(i) for i in range(num_envs)]
|
||||
env1 = SubprocVecEnv(fns)
|
||||
env2 = SubprocVecEnv(fns, in_series=num_envs_in_series)
|
||||
assert_venvs_equal(env1, env2, num_steps=num_steps)
|
||||
|
||||
|
||||
class SimpleEnv(gym.Env):
|
||||
@@ -99,3 +144,15 @@ class SimpleEnv(gym.Env):
|
||||
|
||||
def render(self, mode=None):
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
|
||||
@with_mpi()
|
||||
def test_mpi_with_subprocvecenv():
|
||||
shape = (2,3,4)
|
||||
nenv = 1
|
||||
venv = SubprocVecEnv([lambda: SimpleEnv(0, shape, 'float32')] * nenv)
|
||||
ob = venv.reset()
|
||||
venv.close()
|
||||
assert ob.shape == (nenv,) + shape
|
||||
|
||||
|
@@ -38,6 +38,9 @@ def obs_space_info(obs_space):
|
||||
if isinstance(obs_space, gym.spaces.Dict):
|
||||
assert isinstance(obs_space.spaces, OrderedDict)
|
||||
subspaces = obs_space.spaces
|
||||
elif isinstance(obs_space, gym.spaces.Tuple):
|
||||
assert isinstance(obs_space.spaces, tuple)
|
||||
subspaces = {i: obs_space.spaces[i] for i in range(len(obs_space.spaces))}
|
||||
else:
|
||||
subspaces = {None: obs_space}
|
||||
keys = []
|
||||
|
223
baselines/common/vec_env/vec_env.py
Normal file
223
baselines/common/vec_env/vec_env.py
Normal file
@@ -0,0 +1,223 @@
|
||||
import contextlib
|
||||
import os
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
from baselines.common.tile_images import tile_images
|
||||
|
||||
class AlreadySteppingError(Exception):
|
||||
"""
|
||||
Raised when an asynchronous step is running while
|
||||
step_async() is called again.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
msg = 'already running an async step'
|
||||
Exception.__init__(self, msg)
|
||||
|
||||
|
||||
class NotSteppingError(Exception):
|
||||
"""
|
||||
Raised when an asynchronous step is not running but
|
||||
step_wait() is called.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
msg = 'not running an async step'
|
||||
Exception.__init__(self, msg)
|
||||
|
||||
|
||||
class VecEnv(ABC):
|
||||
"""
|
||||
An abstract asynchronous, vectorized environment.
|
||||
Used to batch data from multiple copies of an environment, so that
|
||||
each observation becomes an batch of observations, and expected action is a batch of actions to
|
||||
be applied per-environment.
|
||||
"""
|
||||
closed = False
|
||||
viewer = None
|
||||
|
||||
metadata = {
|
||||
'render.modes': ['human', 'rgb_array']
|
||||
}
|
||||
|
||||
def __init__(self, num_envs, observation_space, action_space):
|
||||
self.num_envs = num_envs
|
||||
self.observation_space = observation_space
|
||||
self.action_space = action_space
|
||||
|
||||
@abstractmethod
|
||||
def reset(self):
|
||||
"""
|
||||
Reset all the environments and return an array of
|
||||
observations, or a dict of observation arrays.
|
||||
|
||||
If step_async is still doing work, that work will
|
||||
be cancelled and step_wait() should not be called
|
||||
until step_async() is invoked again.
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def step_async(self, actions):
|
||||
"""
|
||||
Tell all the environments to start taking a step
|
||||
with the given actions.
|
||||
Call step_wait() to get the results of the step.
|
||||
|
||||
You should not call this if a step_async run is
|
||||
already pending.
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def step_wait(self):
|
||||
"""
|
||||
Wait for the step taken with step_async().
|
||||
|
||||
Returns (obs, rews, dones, infos):
|
||||
- obs: an array of observations, or a dict of
|
||||
arrays of observations.
|
||||
- rews: an array of rewards
|
||||
- dones: an array of "episode done" booleans
|
||||
- infos: a sequence of info objects
|
||||
"""
|
||||
pass
|
||||
|
||||
def close_extras(self):
|
||||
"""
|
||||
Clean up the extra resources, beyond what's in this base class.
|
||||
Only runs when not self.closed.
|
||||
"""
|
||||
pass
|
||||
|
||||
def close(self):
|
||||
if self.closed:
|
||||
return
|
||||
if self.viewer is not None:
|
||||
self.viewer.close()
|
||||
self.close_extras()
|
||||
self.closed = True
|
||||
|
||||
def step(self, actions):
|
||||
"""
|
||||
Step the environments synchronously.
|
||||
|
||||
This is available for backwards compatibility.
|
||||
"""
|
||||
self.step_async(actions)
|
||||
return self.step_wait()
|
||||
|
||||
def render(self, mode='human'):
|
||||
imgs = self.get_images()
|
||||
bigimg = tile_images(imgs)
|
||||
if mode == 'human':
|
||||
self.get_viewer().imshow(bigimg)
|
||||
return self.get_viewer().isopen
|
||||
elif mode == 'rgb_array':
|
||||
return bigimg
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
def get_images(self):
|
||||
"""
|
||||
Return RGB images from each environment
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
@property
|
||||
def unwrapped(self):
|
||||
if isinstance(self, VecEnvWrapper):
|
||||
return self.venv.unwrapped
|
||||
else:
|
||||
return self
|
||||
|
||||
def get_viewer(self):
|
||||
if self.viewer is None:
|
||||
from gym.envs.classic_control import rendering
|
||||
self.viewer = rendering.SimpleImageViewer()
|
||||
return self.viewer
|
||||
|
||||
class VecEnvWrapper(VecEnv):
|
||||
"""
|
||||
An environment wrapper that applies to an entire batch
|
||||
of environments at once.
|
||||
"""
|
||||
|
||||
def __init__(self, venv, observation_space=None, action_space=None):
|
||||
self.venv = venv
|
||||
super().__init__(num_envs=venv.num_envs,
|
||||
observation_space=observation_space or venv.observation_space,
|
||||
action_space=action_space or venv.action_space)
|
||||
|
||||
def step_async(self, actions):
|
||||
self.venv.step_async(actions)
|
||||
|
||||
@abstractmethod
|
||||
def reset(self):
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def step_wait(self):
|
||||
pass
|
||||
|
||||
def close(self):
|
||||
return self.venv.close()
|
||||
|
||||
def render(self, mode='human'):
|
||||
return self.venv.render(mode=mode)
|
||||
|
||||
def get_images(self):
|
||||
return self.venv.get_images()
|
||||
|
||||
def __getattr__(self, name):
|
||||
if name.startswith('_'):
|
||||
raise AttributeError("attempted to get missing private attribute '{}'".format(name))
|
||||
return getattr(self.venv, name)
|
||||
|
||||
class VecEnvObservationWrapper(VecEnvWrapper):
|
||||
@abstractmethod
|
||||
def process(self, obs):
|
||||
pass
|
||||
|
||||
def reset(self):
|
||||
obs = self.venv.reset()
|
||||
return self.process(obs)
|
||||
|
||||
def step_wait(self):
|
||||
obs, rews, dones, infos = self.venv.step_wait()
|
||||
return self.process(obs), rews, dones, infos
|
||||
|
||||
class CloudpickleWrapper(object):
|
||||
"""
|
||||
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
|
||||
"""
|
||||
|
||||
def __init__(self, x):
|
||||
self.x = x
|
||||
|
||||
def __getstate__(self):
|
||||
import cloudpickle
|
||||
return cloudpickle.dumps(self.x)
|
||||
|
||||
def __setstate__(self, ob):
|
||||
import pickle
|
||||
self.x = pickle.loads(ob)
|
||||
|
||||
|
||||
@contextlib.contextmanager
|
||||
def clear_mpi_env_vars():
|
||||
"""
|
||||
from mpi4py import MPI will call MPI_Init by default. If the child process has MPI environment variables, MPI will think that the child process is an MPI process just like the parent and do bad things such as hang.
|
||||
This context manager is a hacky way to clear those environment variables temporarily such as when we are starting multiprocessing
|
||||
Processes.
|
||||
"""
|
||||
removed_environment = {}
|
||||
for k, v in list(os.environ.items()):
|
||||
for prefix in ['OMPI_', 'PMI_']:
|
||||
if k.startswith(prefix):
|
||||
removed_environment[k] = v
|
||||
del os.environ[k]
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
os.environ.update(removed_environment)
|
@@ -1,4 +1,4 @@
|
||||
from . import VecEnvWrapper
|
||||
from .vec_env import VecEnvWrapper
|
||||
import numpy as np
|
||||
from gym import spaces
|
||||
|
||||
|
@@ -2,15 +2,25 @@ from . import VecEnvWrapper
|
||||
from baselines.bench.monitor import ResultsWriter
|
||||
import numpy as np
|
||||
import time
|
||||
|
||||
from collections import deque
|
||||
|
||||
class VecMonitor(VecEnvWrapper):
|
||||
def __init__(self, venv, filename=None):
|
||||
def __init__(self, venv, filename=None, keep_buf=0, info_keywords=()):
|
||||
VecEnvWrapper.__init__(self, venv)
|
||||
self.eprets = None
|
||||
self.eplens = None
|
||||
self.epcount = 0
|
||||
self.tstart = time.time()
|
||||
self.results_writer = ResultsWriter(filename, header={'t_start': self.tstart})
|
||||
if filename:
|
||||
self.results_writer = ResultsWriter(filename, header={'t_start': self.tstart},
|
||||
extra_keys=info_keywords)
|
||||
else:
|
||||
self.results_writer = None
|
||||
self.info_keywords = info_keywords
|
||||
self.keep_buf = keep_buf
|
||||
if self.keep_buf:
|
||||
self.epret_buf = deque([], maxlen=keep_buf)
|
||||
self.eplen_buf = deque([], maxlen=keep_buf)
|
||||
|
||||
def reset(self):
|
||||
obs = self.venv.reset()
|
||||
@@ -22,16 +32,24 @@ class VecMonitor(VecEnvWrapper):
|
||||
obs, rews, dones, infos = self.venv.step_wait()
|
||||
self.eprets += rews
|
||||
self.eplens += 1
|
||||
newinfos = []
|
||||
for (i, (done, ret, eplen, info)) in enumerate(zip(dones, self.eprets, self.eplens, infos)):
|
||||
info = info.copy()
|
||||
if done:
|
||||
|
||||
newinfos = list(infos[:])
|
||||
for i in range(len(dones)):
|
||||
if dones[i]:
|
||||
info = infos[i].copy()
|
||||
ret = self.eprets[i]
|
||||
eplen = self.eplens[i]
|
||||
epinfo = {'r': ret, 'l': eplen, 't': round(time.time() - self.tstart, 6)}
|
||||
for k in self.info_keywords:
|
||||
epinfo[k] = info[k]
|
||||
info['episode'] = epinfo
|
||||
if self.keep_buf:
|
||||
self.epret_buf.append(ret)
|
||||
self.eplen_buf.append(eplen)
|
||||
self.epcount += 1
|
||||
self.eprets[i] = 0
|
||||
self.eplens[i] = 0
|
||||
self.results_writer.write_row(epinfo)
|
||||
|
||||
newinfos.append(info)
|
||||
|
||||
if self.results_writer:
|
||||
self.results_writer.write_row(epinfo)
|
||||
newinfos[i] = info
|
||||
return obs, rews, dones, newinfos
|
||||
|
@@ -1,18 +1,22 @@
|
||||
from . import VecEnvWrapper
|
||||
from baselines.common.running_mean_std import RunningMeanStd
|
||||
import numpy as np
|
||||
|
||||
|
||||
class VecNormalize(VecEnvWrapper):
|
||||
"""
|
||||
A vectorized wrapper that normalizes the observations
|
||||
and returns from an environment.
|
||||
"""
|
||||
|
||||
def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8):
|
||||
def __init__(self, venv, ob=True, ret=True, clipob=10., cliprew=10., gamma=0.99, epsilon=1e-8, use_tf=False):
|
||||
VecEnvWrapper.__init__(self, venv)
|
||||
self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
|
||||
self.ret_rms = RunningMeanStd(shape=()) if ret else None
|
||||
if use_tf:
|
||||
from baselines.common.running_mean_std import TfRunningMeanStd
|
||||
self.ob_rms = TfRunningMeanStd(shape=self.observation_space.shape, scope='ob_rms') if ob else None
|
||||
self.ret_rms = TfRunningMeanStd(shape=(), scope='ret_rms') if ret else None
|
||||
else:
|
||||
from baselines.common.running_mean_std import RunningMeanStd
|
||||
self.ob_rms = RunningMeanStd(shape=self.observation_space.shape) if ob else None
|
||||
self.ret_rms = RunningMeanStd(shape=()) if ret else None
|
||||
self.clipob = clipob
|
||||
self.cliprew = cliprew
|
||||
self.ret = np.zeros(self.num_envs)
|
||||
|
10
baselines/common/vec_env/vec_remove_dict_obs.py
Normal file
10
baselines/common/vec_env/vec_remove_dict_obs.py
Normal file
@@ -0,0 +1,10 @@
|
||||
from .vec_env import VecEnvObservationWrapper
|
||||
|
||||
class VecExtractDictObs(VecEnvObservationWrapper):
|
||||
def __init__(self, venv, key):
|
||||
self.key = key
|
||||
super().__init__(venv=venv,
|
||||
observation_space=venv.observation_space.spaces[self.key])
|
||||
|
||||
def process(self, obs):
|
||||
return obs[self.key]
|
29
baselines/common/wrappers.py
Normal file
29
baselines/common/wrappers.py
Normal file
@@ -0,0 +1,29 @@
|
||||
import gym
|
||||
|
||||
class TimeLimit(gym.Wrapper):
|
||||
def __init__(self, env, max_episode_steps=None):
|
||||
super(TimeLimit, self).__init__(env)
|
||||
self._max_episode_steps = max_episode_steps
|
||||
self._elapsed_steps = 0
|
||||
|
||||
def step(self, ac):
|
||||
observation, reward, done, info = self.env.step(ac)
|
||||
self._elapsed_steps += 1
|
||||
if self._elapsed_steps >= self._max_episode_steps:
|
||||
done = True
|
||||
info['TimeLimit.truncated'] = True
|
||||
return observation, reward, done, info
|
||||
|
||||
def reset(self, **kwargs):
|
||||
self._elapsed_steps = 0
|
||||
return self.env.reset(**kwargs)
|
||||
|
||||
class ClipActionsWrapper(gym.Wrapper):
|
||||
def step(self, action):
|
||||
import numpy as np
|
||||
action = np.nan_to_num(action)
|
||||
action = np.clip(action, self.action_space.low, self.action_space.high)
|
||||
return self.env.step(action)
|
||||
|
||||
def reset(self, **kwargs):
|
||||
return self.env.reset(**kwargs)
|
@@ -217,7 +217,9 @@ def learn(network, env,
|
||||
stats = agent.get_stats()
|
||||
combined_stats = stats.copy()
|
||||
combined_stats['rollout/return'] = np.mean(epoch_episode_rewards)
|
||||
combined_stats['rollout/return_std'] = np.std(epoch_episode_rewards)
|
||||
combined_stats['rollout/return_history'] = np.mean(episode_rewards_history)
|
||||
combined_stats['rollout/return_history_std'] = np.std(episode_rewards_history)
|
||||
combined_stats['rollout/episode_steps'] = np.mean(epoch_episode_steps)
|
||||
combined_stats['rollout/actions_mean'] = np.mean(epoch_actions)
|
||||
combined_stats['rollout/Q_mean'] = np.mean(epoch_qs)
|
||||
|
@@ -17,7 +17,7 @@ except ImportError:
|
||||
def normalize(x, stats):
|
||||
if stats is None:
|
||||
return x
|
||||
return (x - stats.mean) / stats.std
|
||||
return (x - stats.mean) / (stats.std + 1e-8)
|
||||
|
||||
|
||||
def denormalize(x, stats):
|
||||
@@ -67,7 +67,6 @@ class DDPG(object):
|
||||
def __init__(self, actor, critic, memory, observation_shape, action_shape, param_noise=None, action_noise=None,
|
||||
gamma=0.99, tau=0.001, normalize_returns=False, enable_popart=False, normalize_observations=True,
|
||||
batch_size=128, observation_range=(-5., 5.), action_range=(-1., 1.), return_range=(-np.inf, np.inf),
|
||||
adaptive_param_noise=True, adaptive_param_noise_policy_threshold=.1,
|
||||
critic_l2_reg=0., actor_lr=1e-4, critic_lr=1e-3, clip_norm=None, reward_scale=1.):
|
||||
# Inputs.
|
||||
self.obs0 = tf.placeholder(tf.float32, shape=(None,) + observation_shape, name='obs0')
|
||||
@@ -186,7 +185,7 @@ class DDPG(object):
|
||||
normalized_critic_target_tf = tf.clip_by_value(normalize(self.critic_target, self.ret_rms), self.return_range[0], self.return_range[1])
|
||||
self.critic_loss = tf.reduce_mean(tf.square(self.normalized_critic_tf - normalized_critic_target_tf))
|
||||
if self.critic_l2_reg > 0.:
|
||||
critic_reg_vars = [var for var in self.critic.trainable_vars if 'kernel' in var.name and 'output' not in var.name]
|
||||
critic_reg_vars = [var for var in self.critic.trainable_vars if var.name.endswith('/w:0') and 'output' not in var.name]
|
||||
for var in critic_reg_vars:
|
||||
logger.info(' regularizing: {}'.format(var.name))
|
||||
logger.info(' applying l2 regularization with {}'.format(self.critic_l2_reg))
|
||||
@@ -379,11 +378,6 @@ class DDPG(object):
|
||||
self.param_noise_stddev: self.param_noise.current_stddev,
|
||||
})
|
||||
|
||||
if MPI is not None:
|
||||
mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
|
||||
else:
|
||||
mean_distance = distance
|
||||
|
||||
if MPI is not None:
|
||||
mean_distance = MPI.COMM_WORLD.allreduce(distance, op=MPI.SUM) / MPI.COMM_WORLD.Get_size()
|
||||
else:
|
||||
|
@@ -42,7 +42,7 @@ class Critic(Model):
|
||||
with tf.variable_scope(self.name, reuse=tf.AUTO_REUSE):
|
||||
x = tf.concat([obs, action], axis=-1) # this assumes observation and action can be concatenated
|
||||
x = self.network_builder(x)
|
||||
x = tf.layers.dense(x, 1, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3))
|
||||
x = tf.layers.dense(x, 1, kernel_initializer=tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3), name='output')
|
||||
return x
|
||||
|
||||
@property
|
||||
|
16
baselines/ddpg/test_smoke.py
Normal file
16
baselines/ddpg/test_smoke.py
Normal file
@@ -0,0 +1,16 @@
|
||||
from baselines.common.tests.util import smoketest
|
||||
def _run(argstr):
|
||||
smoketest('--alg=ddpg --env=Pendulum-v0 --num_timesteps=0 ' + argstr)
|
||||
|
||||
def test_popart():
|
||||
_run('--normalize_returns=True --popart=True')
|
||||
|
||||
def test_noise_normal():
|
||||
_run('--noise_type=normal_0.1')
|
||||
|
||||
def test_noise_ou():
|
||||
_run('--noise_type=ou_0.1')
|
||||
|
||||
def test_noise_adaptive():
|
||||
_run('--noise_type=adaptive-param_0.2,normal_0.1')
|
||||
|
@@ -5,4 +5,4 @@ from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
|
||||
|
||||
def wrap_atari_dqn(env):
|
||||
from baselines.common.atari_wrappers import wrap_deepmind
|
||||
return wrap_deepmind(env, frame_stack=True, scale=True)
|
||||
return wrap_deepmind(env, frame_stack=True, scale=False)
|
||||
|
@@ -13,7 +13,7 @@ The functions in this file can are used to create the following functions:
|
||||
stochastic: bool
|
||||
if set to False all the actions are always deterministic (default False)
|
||||
update_eps_ph: float
|
||||
update epsilon a new value, if negative not update happens
|
||||
update epsilon a new value, if negative no update happens
|
||||
(default: no update)
|
||||
|
||||
Returns
|
||||
@@ -33,7 +33,7 @@ The functions in this file can are used to create the following functions:
|
||||
stochastic: bool
|
||||
if set to False all the actions are always deterministic (default False)
|
||||
update_eps_ph: float
|
||||
update epsilon a new value, if negative not update happens
|
||||
update epsilon to a new value, if negative no update happens
|
||||
(default: no update)
|
||||
reset_ph: bool
|
||||
reset the perturbed policy by sampling a new perturbation
|
||||
|
@@ -142,9 +142,8 @@ def learn(env,
|
||||
final value of random action probability
|
||||
train_freq: int
|
||||
update the model every `train_freq` steps.
|
||||
set to None to disable printing
|
||||
batch_size: int
|
||||
size of a batched sampled from replay buffer for training
|
||||
size of a batch sampled from replay buffer for training
|
||||
print_freq: int
|
||||
how often to print out training progress
|
||||
set to None to disable printing
|
||||
|
@@ -23,7 +23,7 @@ def model(inpt, num_actions, scope, reuse=False):
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
with U.make_session(8):
|
||||
with U.make_session(num_cpu=8):
|
||||
# Create the environment
|
||||
env = gym.make("CartPole-v0")
|
||||
# Create all the functions necessary to train the model
|
||||
|
@@ -2,95 +2,6 @@ import tensorflow as tf
|
||||
import tensorflow.contrib.layers as layers
|
||||
|
||||
|
||||
def _mlp(hiddens, inpt, num_actions, scope, reuse=False, layer_norm=False):
|
||||
with tf.variable_scope(scope, reuse=reuse):
|
||||
out = inpt
|
||||
for hidden in hiddens:
|
||||
out = layers.fully_connected(out, num_outputs=hidden, activation_fn=None)
|
||||
if layer_norm:
|
||||
out = layers.layer_norm(out, center=True, scale=True)
|
||||
out = tf.nn.relu(out)
|
||||
q_out = layers.fully_connected(out, num_outputs=num_actions, activation_fn=None)
|
||||
return q_out
|
||||
|
||||
|
||||
def mlp(hiddens=[], layer_norm=False):
|
||||
"""This model takes as input an observation and returns values of all actions.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
hiddens: [int]
|
||||
list of sizes of hidden layers
|
||||
|
||||
Returns
|
||||
-------
|
||||
q_func: function
|
||||
q_function for DQN algorithm.
|
||||
"""
|
||||
return lambda *args, **kwargs: _mlp(hiddens, layer_norm=layer_norm, *args, **kwargs)
|
||||
|
||||
|
||||
def _cnn_to_mlp(convs, hiddens, dueling, inpt, num_actions, scope, reuse=False, layer_norm=False):
|
||||
with tf.variable_scope(scope, reuse=reuse):
|
||||
out = inpt
|
||||
with tf.variable_scope("convnet"):
|
||||
for num_outputs, kernel_size, stride in convs:
|
||||
out = layers.convolution2d(out,
|
||||
num_outputs=num_outputs,
|
||||
kernel_size=kernel_size,
|
||||
stride=stride,
|
||||
activation_fn=tf.nn.relu)
|
||||
conv_out = layers.flatten(out)
|
||||
with tf.variable_scope("action_value"):
|
||||
action_out = conv_out
|
||||
for hidden in hiddens:
|
||||
action_out = layers.fully_connected(action_out, num_outputs=hidden, activation_fn=None)
|
||||
if layer_norm:
|
||||
action_out = layers.layer_norm(action_out, center=True, scale=True)
|
||||
action_out = tf.nn.relu(action_out)
|
||||
action_scores = layers.fully_connected(action_out, num_outputs=num_actions, activation_fn=None)
|
||||
|
||||
if dueling:
|
||||
with tf.variable_scope("state_value"):
|
||||
state_out = conv_out
|
||||
for hidden in hiddens:
|
||||
state_out = layers.fully_connected(state_out, num_outputs=hidden, activation_fn=None)
|
||||
if layer_norm:
|
||||
state_out = layers.layer_norm(state_out, center=True, scale=True)
|
||||
state_out = tf.nn.relu(state_out)
|
||||
state_score = layers.fully_connected(state_out, num_outputs=1, activation_fn=None)
|
||||
action_scores_mean = tf.reduce_mean(action_scores, 1)
|
||||
action_scores_centered = action_scores - tf.expand_dims(action_scores_mean, 1)
|
||||
q_out = state_score + action_scores_centered
|
||||
else:
|
||||
q_out = action_scores
|
||||
return q_out
|
||||
|
||||
|
||||
def cnn_to_mlp(convs, hiddens, dueling=False, layer_norm=False):
|
||||
"""This model takes as input an observation and returns values of all actions.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
convs: [(int, int int)]
|
||||
list of convolutional layers in form of
|
||||
(num_outputs, kernel_size, stride)
|
||||
hiddens: [int]
|
||||
list of sizes of hidden layers
|
||||
dueling: bool
|
||||
if true double the output MLP to compute a baseline
|
||||
for action scores
|
||||
|
||||
Returns
|
||||
-------
|
||||
q_func: function
|
||||
q_function for DQN algorithm.
|
||||
"""
|
||||
|
||||
return lambda *args, **kwargs: _cnn_to_mlp(convs, hiddens, dueling, layer_norm=layer_norm, *args, **kwargs)
|
||||
|
||||
|
||||
|
||||
def build_q_func(network, hiddens=[256], dueling=True, layer_norm=False, **network_kwargs):
|
||||
if isinstance(network, str):
|
||||
from baselines.common.models import get_network_builder
|
||||
|
@@ -20,7 +20,7 @@ class TfInput(object):
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def make_feed_dict(data):
|
||||
def make_feed_dict(self, data):
|
||||
"""Given data input it to the placeholder(s)."""
|
||||
raise NotImplementedError
|
||||
|
||||
|
@@ -12,13 +12,13 @@ Download the expert data into `./data`, [download link](https://drive.google.com
|
||||
|
||||
### Step 2: Run GAIL
|
||||
|
||||
Run with single thread:
|
||||
Run with single rank:
|
||||
|
||||
```bash
|
||||
python -m baselines.gail.run_mujoco
|
||||
```
|
||||
|
||||
Run with multiple threads:
|
||||
Run with multiple ranks:
|
||||
|
||||
```bash
|
||||
mpirun -np 16 python -m baselines.gail.run_mujoco
|
||||
|
@@ -66,7 +66,7 @@ class TransitionClassifier(object):
|
||||
|
||||
with tf.variable_scope("obfilter"):
|
||||
self.obs_rms = RunningMeanStd(shape=self.observation_shape)
|
||||
obs = (obs_ph - self.obs_rms.mean / self.obs_rms.std)
|
||||
obs = (obs_ph - self.obs_rms.mean) / self.obs_rms.std
|
||||
_input = tf.concat([obs, acs_ph], axis=1) # concatenate the two input -> form a transition
|
||||
p_h1 = tf.contrib.layers.fully_connected(_input, self.hidden_size, activation_fn=tf.nn.tanh)
|
||||
p_h2 = tf.contrib.layers.fully_connected(p_h1, self.hidden_size, activation_fn=tf.nn.tanh)
|
||||
|
@@ -50,8 +50,12 @@ class Mujoco_Dset(object):
|
||||
# obs, acs: shape (N, L, ) + S where N = # episodes, L = episode length
|
||||
# and S is the environment observation/action space.
|
||||
# Flatten to (N * L, prod(S))
|
||||
self.obs = np.reshape(obs, [-1, np.prod(obs.shape[2:])])
|
||||
self.acs = np.reshape(acs, [-1, np.prod(acs.shape[2:])])
|
||||
if len(obs.shape) > 2:
|
||||
self.obs = np.reshape(obs, [-1, np.prod(obs.shape[2:])])
|
||||
self.acs = np.reshape(acs, [-1, np.prod(acs.shape[2:])])
|
||||
else:
|
||||
self.obs = np.vstack(obs)
|
||||
self.acs = np.vstack(acs)
|
||||
|
||||
self.rets = traj_data['ep_rets'][:traj_limitation]
|
||||
self.avg_ret = sum(self.rets)/len(self.rets)
|
||||
@@ -73,7 +77,7 @@ class Mujoco_Dset(object):
|
||||
self.log_info()
|
||||
|
||||
def log_info(self):
|
||||
logger.log("Total trajectorues: %d" % self.num_traj)
|
||||
logger.log("Total trajectories: %d" % self.num_traj)
|
||||
logger.log("Total transitions: %d" % self.num_transition)
|
||||
logger.log("Average returns: %f" % self.avg_ret)
|
||||
logger.log("Std for returns: %f" % self.std_ret)
|
||||
|
@@ -24,7 +24,7 @@ Hopper-v1, Walker2d-v1, HalfCheetah-v1, Humanoid-v1, HumanoidStandup-v1. Every i
|
||||
|
||||
For details (e.g., adversarial loss, discriminator accuracy, etc.) about GAIL training, please see [here](https://drive.google.com/drive/folders/1nnU8dqAV9i37-_5_vWIspyFUJFQLCsDD?usp=sharing)
|
||||
|
||||
### Determinstic Polciy (Set std=0)
|
||||
### Determinstic Policy (Set std=0)
|
||||
| | Un-normalized | Normalized |
|
||||
|---|---|---|
|
||||
| Hopper-v1 | <img src='Hopper-unnormalized-deterministic-scores.png'> | <img src='Hopper-normalized-deterministic-scores.png'> |
|
||||
|
@@ -6,26 +6,29 @@ For details on Hindsight Experience Replay (HER), please read the [paper](https:
|
||||
### Getting started
|
||||
Training an agent is very simple:
|
||||
```bash
|
||||
python -m baselines.her.experiment.train
|
||||
python -m baselines.run --alg=her --env=FetchReach-v1 --num_timesteps=5000
|
||||
```
|
||||
This will train a DDPG+HER agent on the `FetchReach` environment.
|
||||
You should see the success rate go up quickly to `1.0`, which means that the agent achieves the
|
||||
desired goal in 100% of the cases.
|
||||
The training script logs other diagnostics as well and pickles the best policy so far (w.r.t. to its test success rate),
|
||||
the latest policy, and, if enabled, a history of policies every K epochs.
|
||||
|
||||
To inspect what the agent has learned, use the play script:
|
||||
desired goal in 100% of the cases (note how HER can solve it in <5k steps - try doing that with PPO by replacing her with ppo2 :))
|
||||
The training script logs other diagnostics as well. Policy at the end of the training can be saved using `--save_path` flag, for instance:
|
||||
```bash
|
||||
python -m baselines.her.experiment.play /path/to/an/experiment/policy_best.pkl
|
||||
python -m baselines.run --alg=her --env=FetchReach-v1 --num_timesteps=5000 --save_path=~/policies/her/fetchreach5k
|
||||
```
|
||||
You can try it right now with the results of the training step (the script prints out the path for you).
|
||||
This should visualize the current policy for 10 episodes and will also print statistics.
|
||||
|
||||
To inspect what the agent has learned, use the `--play` flag:
|
||||
```bash
|
||||
python -m baselines.run --alg=her --env=FetchReach-v1 --num_timesteps=5000 --play
|
||||
```
|
||||
(note `--play` can be combined with `--load_path`, which lets one load trained policies, for more results see [README.md](../../README.md))
|
||||
|
||||
|
||||
### Reproducing results
|
||||
In order to reproduce the results from [Plappert et al. (2018)](https://arxiv.org/abs/1802.09464), run the following command:
|
||||
In [Plappert et al. (2018)](https://arxiv.org/abs/1802.09464), 38 trajectories were generated in parallel
|
||||
(19 MPI processes, each generating computing gradients from 2 trajectories and aggregating).
|
||||
To reproduce that behaviour, use
|
||||
```bash
|
||||
python -m baselines.her.experiment.train --num_cpu 19
|
||||
mpirun -np 19 python -m baselines.run --num_env=2 --alg=her ...
|
||||
```
|
||||
This will require a machine with sufficient amount of physical CPU cores. In our experiments,
|
||||
we used [Azure's D15v2 instances](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes),
|
||||
@@ -45,6 +48,13 @@ python experiment/data_generation/fetch_data_generation.py
|
||||
```
|
||||
This outputs ```data_fetch_random_100.npz``` file which is our data file.
|
||||
|
||||
To launch training with demonstrations (more technically, with behaviour cloning loss as an auxilliary loss), run the following
|
||||
```bash
|
||||
python -m baselines.run --alg=her --env=FetchPickAndPlace-v1 --num_timesteps=2.5e6 --demo_file=/Path/to/demo_file.npz
|
||||
```
|
||||
This will train a DDPG+HER agent on the `FetchPickAndPlace` environment by using previously generated demonstration data.
|
||||
To inspect what the agent has learned, use the `--play` flag as described above.
|
||||
|
||||
#### Configuration
|
||||
The provided configuration is for training an agent with HER without demonstrations, we need to change a few paramters for the HER algorithm to learn through demonstrations, to do that, set:
|
||||
|
||||
@@ -62,13 +72,7 @@ Apart from these changes the reported results also have the following configurat
|
||||
* random_eps: 0.1 - percentage of time a random action is taken
|
||||
* noise_eps: 0.1 - std of gaussian noise added to not-completely-random actions
|
||||
|
||||
Now training an agent with pre-recorded demonstrations:
|
||||
```bash
|
||||
python -m baselines.her.experiment.train --env=FetchPickAndPlace-v0 --n_epochs=1000 --demo_file=/Path/to/demo_file.npz --num_cpu=1
|
||||
```
|
||||
|
||||
This will train a DDPG+HER agent on the `FetchPickAndPlace` environment by using previously generated demonstration data.
|
||||
To inspect what the agent has learned, use the play script as described above.
|
||||
These parameters can be changed either in [experiment/config.py](experiment/config.py) or passed to the command line as `--param=value`)
|
||||
|
||||
### Results
|
||||
Training with demonstrations helps overcome the exploration problem and achieves a faster and better convergence. The following graphs contrast the difference between training with and without demonstration data, We report the mean Q values vs Epoch and the Success Rate vs Epoch:
|
||||
@@ -78,3 +82,4 @@ Training with demonstrations helps overcome the exploration problem and achieves
|
||||
<center><img src="../../data/fetchPickAndPlaceContrast.png"></center>
|
||||
<div class="thecap" align="middle"><b>Training results for Fetch Pick and Place task constrasting between training with and without demonstration data.</b></div>
|
||||
</div>
|
||||
|
||||
|
@@ -10,13 +10,14 @@ from baselines.her.util import (
|
||||
from baselines.her.normalizer import Normalizer
|
||||
from baselines.her.replay_buffer import ReplayBuffer
|
||||
from baselines.common.mpi_adam import MpiAdam
|
||||
from baselines.common import tf_util
|
||||
|
||||
|
||||
def dims_to_shapes(input_dims):
|
||||
return {key: tuple([val]) if val > 0 else tuple() for key, val in input_dims.items()}
|
||||
|
||||
|
||||
global demoBuffer #buffer for demonstrations
|
||||
global DEMO_BUFFER #buffer for demonstrations
|
||||
|
||||
class DDPG(object):
|
||||
@store_args
|
||||
@@ -94,16 +95,16 @@ class DDPG(object):
|
||||
self._create_network(reuse=reuse)
|
||||
|
||||
# Configure the replay buffer.
|
||||
buffer_shapes = {key: (self.T if key != 'o' else self.T+1, *input_shapes[key])
|
||||
buffer_shapes = {key: (self.T-1 if key != 'o' else self.T, *input_shapes[key])
|
||||
for key, val in input_shapes.items()}
|
||||
buffer_shapes['g'] = (buffer_shapes['g'][0], self.dimg)
|
||||
buffer_shapes['ag'] = (self.T+1, self.dimg)
|
||||
buffer_shapes['ag'] = (self.T, self.dimg)
|
||||
|
||||
buffer_size = (self.buffer_size // self.rollout_batch_size) * self.rollout_batch_size
|
||||
self.buffer = ReplayBuffer(buffer_shapes, buffer_size, self.T, self.sample_transitions)
|
||||
|
||||
global demoBuffer
|
||||
demoBuffer = ReplayBuffer(buffer_shapes, buffer_size, self.T, self.sample_transitions) #initialize the demo buffer; in the same way as the primary data buffer
|
||||
global DEMO_BUFFER
|
||||
DEMO_BUFFER = ReplayBuffer(buffer_shapes, buffer_size, self.T, self.sample_transitions) #initialize the demo buffer; in the same way as the primary data buffer
|
||||
|
||||
def _random_action(self, n):
|
||||
return np.random.uniform(low=-self.max_u, high=self.max_u, size=(n, self.dimu))
|
||||
@@ -119,6 +120,11 @@ class DDPG(object):
|
||||
g = np.clip(g, -self.clip_obs, self.clip_obs)
|
||||
return o, g
|
||||
|
||||
def step(self, obs):
|
||||
actions = self.get_actions(obs['observation'], obs['achieved_goal'], obs['desired_goal'])
|
||||
return actions, None, None, None
|
||||
|
||||
|
||||
def get_actions(self, o, ag, g, noise_eps=0., random_eps=0., use_target_net=False,
|
||||
compute_Q=False):
|
||||
o, g = self._preprocess_og(o, ag, g)
|
||||
@@ -151,25 +157,30 @@ class DDPG(object):
|
||||
else:
|
||||
return ret
|
||||
|
||||
def initDemoBuffer(self, demoDataFile, update_stats=True): #function that initializes the demo buffer
|
||||
def init_demo_buffer(self, demoDataFile, update_stats=True): #function that initializes the demo buffer
|
||||
|
||||
demoData = np.load(demoDataFile) #load the demonstration data from data file
|
||||
info_keys = [key.replace('info_', '') for key in self.input_dims.keys() if key.startswith('info_')]
|
||||
info_values = [np.empty((self.T, 1, self.input_dims['info_' + key]), np.float32) for key in info_keys]
|
||||
info_values = [np.empty((self.T - 1, 1, self.input_dims['info_' + key]), np.float32) for key in info_keys]
|
||||
|
||||
demo_data_obs = demoData['obs']
|
||||
demo_data_acs = demoData['acs']
|
||||
demo_data_info = demoData['info']
|
||||
|
||||
for epsd in range(self.num_demo): # we initialize the whole demo buffer at the start of the training
|
||||
obs, acts, goals, achieved_goals = [], [] ,[] ,[]
|
||||
i = 0
|
||||
for transition in range(self.T):
|
||||
obs.append([demoData['obs'][epsd ][transition].get('observation')])
|
||||
acts.append([demoData['acs'][epsd][transition]])
|
||||
goals.append([demoData['obs'][epsd][transition].get('desired_goal')])
|
||||
achieved_goals.append([demoData['obs'][epsd][transition].get('achieved_goal')])
|
||||
for transition in range(self.T - 1):
|
||||
obs.append([demo_data_obs[epsd][transition].get('observation')])
|
||||
acts.append([demo_data_acs[epsd][transition]])
|
||||
goals.append([demo_data_obs[epsd][transition].get('desired_goal')])
|
||||
achieved_goals.append([demo_data_obs[epsd][transition].get('achieved_goal')])
|
||||
for idx, key in enumerate(info_keys):
|
||||
info_values[idx][transition, i] = demoData['info'][epsd][transition][key]
|
||||
info_values[idx][transition, i] = demo_data_info[epsd][transition][key]
|
||||
|
||||
obs.append([demoData['obs'][epsd][self.T].get('observation')])
|
||||
achieved_goals.append([demoData['obs'][epsd][self.T].get('achieved_goal')])
|
||||
|
||||
obs.append([demo_data_obs[epsd][self.T - 1].get('observation')])
|
||||
achieved_goals.append([demo_data_obs[epsd][self.T - 1].get('achieved_goal')])
|
||||
|
||||
episode = dict(o=obs,
|
||||
u=acts,
|
||||
@@ -179,10 +190,9 @@ class DDPG(object):
|
||||
episode['info_{}'.format(key)] = value
|
||||
|
||||
episode = convert_episode_to_batch_major(episode)
|
||||
global demoBuffer
|
||||
demoBuffer.store_episode(episode) # create the observation dict and append them into the demonstration buffer
|
||||
|
||||
print("Demo buffer size currently ", demoBuffer.get_current_size()) #print out the demonstration buffer size
|
||||
global DEMO_BUFFER
|
||||
DEMO_BUFFER.store_episode(episode) # create the observation dict and append them into the demonstration buffer
|
||||
logger.debug("Demo buffer size currently ", DEMO_BUFFER.get_current_size()) #print out the demonstration buffer size
|
||||
|
||||
if update_stats:
|
||||
# add transitions to normalizer to normalize the demo data as well
|
||||
@@ -191,7 +201,7 @@ class DDPG(object):
|
||||
num_normalizing_transitions = transitions_in_episode_batch(episode)
|
||||
transitions = self.sample_transitions(episode, num_normalizing_transitions)
|
||||
|
||||
o, o_2, g, ag = transitions['o'], transitions['o_2'], transitions['g'], transitions['ag']
|
||||
o, g, ag = transitions['o'], transitions['g'], transitions['ag']
|
||||
transitions['o'], transitions['g'] = self._preprocess_og(o, ag, g)
|
||||
# No need to preprocess the o_2 and g_2 since this is only used for stats
|
||||
|
||||
@@ -202,6 +212,8 @@ class DDPG(object):
|
||||
self.g_stats.recompute_stats()
|
||||
episode.clear()
|
||||
|
||||
logger.info("Demo buffer size: ", DEMO_BUFFER.get_current_size()) #print out the demonstration buffer size
|
||||
|
||||
def store_episode(self, episode_batch, update_stats=True):
|
||||
"""
|
||||
episode_batch: array of batch_size x (T or T+1) x dim_key
|
||||
@@ -217,7 +229,7 @@ class DDPG(object):
|
||||
num_normalizing_transitions = transitions_in_episode_batch(episode_batch)
|
||||
transitions = self.sample_transitions(episode_batch, num_normalizing_transitions)
|
||||
|
||||
o, o_2, g, ag = transitions['o'], transitions['o_2'], transitions['g'], transitions['ag']
|
||||
o, g, ag = transitions['o'], transitions['g'], transitions['ag']
|
||||
transitions['o'], transitions['g'] = self._preprocess_og(o, ag, g)
|
||||
# No need to preprocess the o_2 and g_2 since this is only used for stats
|
||||
|
||||
@@ -251,9 +263,9 @@ class DDPG(object):
|
||||
def sample_batch(self):
|
||||
if self.bc_loss: #use demonstration buffer to sample as well if bc_loss flag is set TRUE
|
||||
transitions = self.buffer.sample(self.batch_size - self.demo_batch_size)
|
||||
global demoBuffer
|
||||
transitionsDemo = demoBuffer.sample(self.demo_batch_size) #sample from the demo buffer
|
||||
for k, values in transitionsDemo.items():
|
||||
global DEMO_BUFFER
|
||||
transitions_demo = DEMO_BUFFER.sample(self.demo_batch_size) #sample from the demo buffer
|
||||
for k, values in transitions_demo.items():
|
||||
rolloutV = transitions[k].tolist()
|
||||
for v in values:
|
||||
rolloutV.append(v.tolist())
|
||||
@@ -302,10 +314,7 @@ class DDPG(object):
|
||||
|
||||
def _create_network(self, reuse=False):
|
||||
logger.info("Creating a DDPG agent with action space %d x %s..." % (self.dimu, self.max_u))
|
||||
|
||||
self.sess = tf.get_default_session()
|
||||
if self.sess is None:
|
||||
self.sess = tf.InteractiveSession()
|
||||
self.sess = tf_util.get_session()
|
||||
|
||||
# running averages
|
||||
with tf.variable_scope('o_stats') as vs:
|
||||
@@ -367,8 +376,6 @@ class DDPG(object):
|
||||
self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
|
||||
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))
|
||||
|
||||
self.pi_loss_tf = -tf.reduce_mean(self.main.Q_pi_tf)
|
||||
self.pi_loss_tf += self.action_l2 * tf.reduce_mean(tf.square(self.main.pi_tf / self.max_u))
|
||||
Q_grads_tf = tf.gradients(self.Q_loss_tf, self._vars('main/Q'))
|
||||
pi_grads_tf = tf.gradients(self.pi_loss_tf, self._vars('main/pi'))
|
||||
assert len(self._vars('main/Q')) == len(Q_grads_tf)
|
||||
@@ -403,7 +410,7 @@ class DDPG(object):
|
||||
logs += [('stats_g/mean', np.mean(self.sess.run([self.g_stats.mean])))]
|
||||
logs += [('stats_g/std', np.mean(self.sess.run([self.g_stats.std])))]
|
||||
|
||||
if prefix is not '' and not prefix.endswith('/'):
|
||||
if prefix != '' and not prefix.endswith('/'):
|
||||
return [(prefix + '/' + key, val) for key, val in logs]
|
||||
else:
|
||||
return logs
|
||||
@@ -435,3 +442,7 @@ class DDPG(object):
|
||||
assert(len(vars) == len(state["tf"]))
|
||||
node = [tf.assign(var, val) for var, val in zip(vars, state["tf"])]
|
||||
self.sess.run(node)
|
||||
|
||||
def save(self, save_path):
|
||||
tf_util.save_variables(save_path)
|
||||
|
||||
|
@@ -1,10 +1,11 @@
|
||||
import os
|
||||
import numpy as np
|
||||
import gym
|
||||
|
||||
from baselines import logger
|
||||
from baselines.her.ddpg import DDPG
|
||||
from baselines.her.her import make_sample_her_transitions
|
||||
|
||||
from baselines.her.her_sampler import make_sample_her_transitions
|
||||
from baselines.bench.monitor import Monitor
|
||||
|
||||
DEFAULT_ENV_PARAMS = {
|
||||
'FetchReach-v1': {
|
||||
@@ -72,16 +73,32 @@ def cached_make_env(make_env):
|
||||
def prepare_params(kwargs):
|
||||
# DDPG params
|
||||
ddpg_params = dict()
|
||||
|
||||
env_name = kwargs['env_name']
|
||||
|
||||
def make_env():
|
||||
return gym.make(env_name)
|
||||
def make_env(subrank=None):
|
||||
env = gym.make(env_name)
|
||||
if subrank is not None and logger.get_dir() is not None:
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
mpi_rank = MPI.COMM_WORLD.Get_rank()
|
||||
except ImportError:
|
||||
MPI = None
|
||||
mpi_rank = 0
|
||||
logger.warn('Running with a single MPI process. This should work, but the results may differ from the ones publshed in Plappert et al.')
|
||||
|
||||
max_episode_steps = env._max_episode_steps
|
||||
env = Monitor(env,
|
||||
os.path.join(logger.get_dir(), str(mpi_rank) + '.' + str(subrank)),
|
||||
allow_early_resets=True)
|
||||
# hack to re-expose _max_episode_steps (ideally should replace reliance on it downstream)
|
||||
env = gym.wrappers.TimeLimit(env, max_episode_steps=max_episode_steps)
|
||||
return env
|
||||
|
||||
kwargs['make_env'] = make_env
|
||||
tmp_env = cached_make_env(kwargs['make_env'])
|
||||
assert hasattr(tmp_env, '_max_episode_steps')
|
||||
kwargs['T'] = tmp_env._max_episode_steps
|
||||
tmp_env.reset()
|
||||
|
||||
kwargs['max_u'] = np.array(kwargs['max_u']) if isinstance(kwargs['max_u'], list) else kwargs['max_u']
|
||||
kwargs['gamma'] = 1. - 1. / kwargs['T']
|
||||
if 'lr' in kwargs:
|
||||
|
@@ -1,18 +1,5 @@
|
||||
import gym
|
||||
import time
|
||||
import random
|
||||
import numpy as np
|
||||
import rospy
|
||||
import roslaunch
|
||||
|
||||
from random import randint
|
||||
from std_srvs.srv import Empty
|
||||
from sensor_msgs.msg import JointState
|
||||
from geometry_msgs.msg import PoseStamped
|
||||
from geometry_msgs.msg import Pose
|
||||
from std_msgs.msg import Float64
|
||||
from controller_manager_msgs.srv import SwitchController
|
||||
from gym.utils import seeding
|
||||
|
||||
|
||||
"""Data generation for the case of a single block pick and place in Fetch Env"""
|
||||
@@ -22,7 +9,7 @@ observations = []
|
||||
infos = []
|
||||
|
||||
def main():
|
||||
env = gym.make('FetchPickAndPlace-v0')
|
||||
env = gym.make('FetchPickAndPlace-v1')
|
||||
numItr = 100
|
||||
initStateSpace = "random"
|
||||
env.reset()
|
||||
@@ -31,21 +18,19 @@ def main():
|
||||
obs = env.reset()
|
||||
print("ITERATION NUMBER ", len(actions))
|
||||
goToGoal(env, obs)
|
||||
|
||||
|
||||
|
||||
fileName = "data_fetch"
|
||||
fileName += "_" + initStateSpace
|
||||
fileName += "_" + str(numItr)
|
||||
fileName += ".npz"
|
||||
|
||||
|
||||
np.savez_compressed(fileName, acs=actions, obs=observations, info=infos) # save the file
|
||||
|
||||
def goToGoal(env, lastObs):
|
||||
|
||||
goal = lastObs['desired_goal']
|
||||
objectPos = lastObs['observation'][3:6]
|
||||
gripperPos = lastObs['observation'][:3]
|
||||
gripperState = lastObs['observation'][9:11]
|
||||
object_rel_pos = lastObs['observation'][6:9]
|
||||
episodeAcs = []
|
||||
episodeObs = []
|
||||
@@ -53,7 +38,7 @@ def goToGoal(env, lastObs):
|
||||
|
||||
object_oriented_goal = object_rel_pos.copy()
|
||||
object_oriented_goal[2] += 0.03 # first make the gripper go slightly above the object
|
||||
|
||||
|
||||
timeStep = 0 #count the total number of timesteps
|
||||
episodeObs.append(lastObs)
|
||||
|
||||
@@ -76,8 +61,6 @@ def goToGoal(env, lastObs):
|
||||
episodeObs.append(obsDataNew)
|
||||
|
||||
objectPos = obsDataNew['observation'][3:6]
|
||||
gripperPos = obsDataNew['observation'][:3]
|
||||
gripperState = obsDataNew['observation'][9:11]
|
||||
object_rel_pos = obsDataNew['observation'][6:9]
|
||||
|
||||
while np.linalg.norm(object_rel_pos) >= 0.005 and timeStep <= env._max_episode_steps :
|
||||
@@ -96,8 +79,6 @@ def goToGoal(env, lastObs):
|
||||
episodeObs.append(obsDataNew)
|
||||
|
||||
objectPos = obsDataNew['observation'][3:6]
|
||||
gripperPos = obsDataNew['observation'][:3]
|
||||
gripperState = obsDataNew['observation'][9:11]
|
||||
object_rel_pos = obsDataNew['observation'][6:9]
|
||||
|
||||
|
||||
@@ -117,8 +98,6 @@ def goToGoal(env, lastObs):
|
||||
episodeObs.append(obsDataNew)
|
||||
|
||||
objectPos = obsDataNew['observation'][3:6]
|
||||
gripperPos = obsDataNew['observation'][:3]
|
||||
gripperState = obsDataNew['observation'][9:11]
|
||||
object_rel_pos = obsDataNew['observation'][6:9]
|
||||
|
||||
while True: #limit the number of timesteps in the episode to a fixed duration
|
||||
@@ -134,8 +113,6 @@ def goToGoal(env, lastObs):
|
||||
episodeObs.append(obsDataNew)
|
||||
|
||||
objectPos = obsDataNew['observation'][3:6]
|
||||
gripperPos = obsDataNew['observation'][:3]
|
||||
gripperState = obsDataNew['observation'][9:11]
|
||||
object_rel_pos = obsDataNew['observation'][6:9]
|
||||
|
||||
if timeStep >= env._max_episode_steps: break
|
||||
|
@@ -1,3 +1,4 @@
|
||||
# DEPRECATED, use --play flag to baselines.run instead
|
||||
import click
|
||||
import numpy as np
|
||||
import pickle
|
||||
|
@@ -1,3 +1,5 @@
|
||||
# DEPRECATED, use baselines.common.plot_util instead
|
||||
|
||||
import os
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
@@ -1,194 +0,0 @@
|
||||
import os
|
||||
import sys
|
||||
|
||||
import click
|
||||
import numpy as np
|
||||
import json
|
||||
from mpi4py import MPI
|
||||
|
||||
from baselines import logger
|
||||
from baselines.common import set_global_seeds
|
||||
from baselines.common.mpi_moments import mpi_moments
|
||||
import baselines.her.experiment.config as config
|
||||
from baselines.her.rollout import RolloutWorker
|
||||
from baselines.her.util import mpi_fork
|
||||
|
||||
from subprocess import CalledProcessError
|
||||
|
||||
|
||||
def mpi_average(value):
|
||||
if value == []:
|
||||
value = [0.]
|
||||
if not isinstance(value, list):
|
||||
value = [value]
|
||||
return mpi_moments(np.array(value))[0]
|
||||
|
||||
|
||||
def train(policy, rollout_worker, evaluator,
|
||||
n_epochs, n_test_rollouts, n_cycles, n_batches, policy_save_interval,
|
||||
save_policies, demo_file, **kwargs):
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
|
||||
latest_policy_path = os.path.join(logger.get_dir(), 'policy_latest.pkl')
|
||||
best_policy_path = os.path.join(logger.get_dir(), 'policy_best.pkl')
|
||||
periodic_policy_path = os.path.join(logger.get_dir(), 'policy_{}.pkl')
|
||||
|
||||
logger.info("Training...")
|
||||
best_success_rate = -1
|
||||
|
||||
if policy.bc_loss == 1: policy.initDemoBuffer(demo_file) #initialize demo buffer if training with demonstrations
|
||||
for epoch in range(n_epochs):
|
||||
# train
|
||||
rollout_worker.clear_history()
|
||||
for _ in range(n_cycles):
|
||||
episode = rollout_worker.generate_rollouts()
|
||||
policy.store_episode(episode)
|
||||
for _ in range(n_batches):
|
||||
policy.train()
|
||||
policy.update_target_net()
|
||||
|
||||
# test
|
||||
evaluator.clear_history()
|
||||
for _ in range(n_test_rollouts):
|
||||
evaluator.generate_rollouts()
|
||||
|
||||
# record logs
|
||||
logger.record_tabular('epoch', epoch)
|
||||
for key, val in evaluator.logs('test'):
|
||||
logger.record_tabular(key, mpi_average(val))
|
||||
for key, val in rollout_worker.logs('train'):
|
||||
logger.record_tabular(key, mpi_average(val))
|
||||
for key, val in policy.logs():
|
||||
logger.record_tabular(key, mpi_average(val))
|
||||
|
||||
if rank == 0:
|
||||
logger.dump_tabular()
|
||||
|
||||
# save the policy if it's better than the previous ones
|
||||
success_rate = mpi_average(evaluator.current_success_rate())
|
||||
if rank == 0 and success_rate >= best_success_rate and save_policies:
|
||||
best_success_rate = success_rate
|
||||
logger.info('New best success rate: {}. Saving policy to {} ...'.format(best_success_rate, best_policy_path))
|
||||
evaluator.save_policy(best_policy_path)
|
||||
evaluator.save_policy(latest_policy_path)
|
||||
if rank == 0 and policy_save_interval > 0 and epoch % policy_save_interval == 0 and save_policies:
|
||||
policy_path = periodic_policy_path.format(epoch)
|
||||
logger.info('Saving periodic policy to {} ...'.format(policy_path))
|
||||
evaluator.save_policy(policy_path)
|
||||
|
||||
# make sure that different threads have different seeds
|
||||
local_uniform = np.random.uniform(size=(1,))
|
||||
root_uniform = local_uniform.copy()
|
||||
MPI.COMM_WORLD.Bcast(root_uniform, root=0)
|
||||
if rank != 0:
|
||||
assert local_uniform[0] != root_uniform[0]
|
||||
|
||||
|
||||
def launch(
|
||||
env, logdir, n_epochs, num_cpu, seed, replay_strategy, policy_save_interval, clip_return,
|
||||
demo_file, override_params={}, save_policies=True
|
||||
):
|
||||
# Fork for multi-CPU MPI implementation.
|
||||
if num_cpu > 1:
|
||||
try:
|
||||
whoami = mpi_fork(num_cpu, ['--bind-to', 'core'])
|
||||
except CalledProcessError:
|
||||
# fancy version of mpi call failed, try simple version
|
||||
whoami = mpi_fork(num_cpu)
|
||||
|
||||
if whoami == 'parent':
|
||||
sys.exit(0)
|
||||
import baselines.common.tf_util as U
|
||||
U.single_threaded_session().__enter__()
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
|
||||
# Configure logging
|
||||
if rank == 0:
|
||||
if logdir or logger.get_dir() is None:
|
||||
logger.configure(dir=logdir)
|
||||
else:
|
||||
logger.configure()
|
||||
logdir = logger.get_dir()
|
||||
assert logdir is not None
|
||||
os.makedirs(logdir, exist_ok=True)
|
||||
|
||||
# Seed everything.
|
||||
rank_seed = seed + 1000000 * rank
|
||||
set_global_seeds(rank_seed)
|
||||
|
||||
# Prepare params.
|
||||
params = config.DEFAULT_PARAMS
|
||||
params['env_name'] = env
|
||||
params['replay_strategy'] = replay_strategy
|
||||
if env in config.DEFAULT_ENV_PARAMS:
|
||||
params.update(config.DEFAULT_ENV_PARAMS[env]) # merge env-specific parameters in
|
||||
params.update(**override_params) # makes it possible to override any parameter
|
||||
with open(os.path.join(logger.get_dir(), 'params.json'), 'w') as f:
|
||||
json.dump(params, f)
|
||||
params = config.prepare_params(params)
|
||||
config.log_params(params, logger=logger)
|
||||
|
||||
if num_cpu == 1:
|
||||
logger.warn()
|
||||
logger.warn('*** Warning ***')
|
||||
logger.warn(
|
||||
'You are running HER with just a single MPI worker. This will work, but the ' +
|
||||
'experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) ' +
|
||||
'were obtained with --num_cpu 19. This makes a significant difference and if you ' +
|
||||
'are looking to reproduce those results, be aware of this. Please also refer to ' +
|
||||
'https://github.com/openai/baselines/issues/314 for further details.')
|
||||
logger.warn('****************')
|
||||
logger.warn()
|
||||
|
||||
dims = config.configure_dims(params)
|
||||
policy = config.configure_ddpg(dims=dims, params=params, clip_return=clip_return)
|
||||
|
||||
rollout_params = {
|
||||
'exploit': False,
|
||||
'use_target_net': False,
|
||||
'use_demo_states': True,
|
||||
'compute_Q': False,
|
||||
'T': params['T'],
|
||||
}
|
||||
|
||||
eval_params = {
|
||||
'exploit': True,
|
||||
'use_target_net': params['test_with_polyak'],
|
||||
'use_demo_states': False,
|
||||
'compute_Q': True,
|
||||
'T': params['T'],
|
||||
}
|
||||
|
||||
for name in ['T', 'rollout_batch_size', 'gamma', 'noise_eps', 'random_eps']:
|
||||
rollout_params[name] = params[name]
|
||||
eval_params[name] = params[name]
|
||||
|
||||
rollout_worker = RolloutWorker(params['make_env'], policy, dims, logger, **rollout_params)
|
||||
rollout_worker.seed(rank_seed)
|
||||
|
||||
evaluator = RolloutWorker(params['make_env'], policy, dims, logger, **eval_params)
|
||||
evaluator.seed(rank_seed)
|
||||
|
||||
train(
|
||||
logdir=logdir, policy=policy, rollout_worker=rollout_worker,
|
||||
evaluator=evaluator, n_epochs=n_epochs, n_test_rollouts=params['n_test_rollouts'],
|
||||
n_cycles=params['n_cycles'], n_batches=params['n_batches'],
|
||||
policy_save_interval=policy_save_interval, save_policies=save_policies, demo_file=demo_file)
|
||||
|
||||
|
||||
@click.command()
|
||||
@click.option('--env', type=str, default='FetchReach-v1', help='the name of the OpenAI Gym environment that you want to train on')
|
||||
@click.option('--logdir', type=str, default=None, help='the path to where logs and policy pickles should go. If not specified, creates a folder in /tmp/')
|
||||
@click.option('--n_epochs', type=int, default=50, help='the number of training epochs to run')
|
||||
@click.option('--num_cpu', type=int, default=1, help='the number of CPU cores to use (using MPI)')
|
||||
@click.option('--seed', type=int, default=0, help='the random seed used to seed both the environment and the training code')
|
||||
@click.option('--policy_save_interval', type=int, default=5, help='the interval with which policy pickles are saved. If set to 0, only the best and latest policy will be pickled.')
|
||||
@click.option('--replay_strategy', type=click.Choice(['future', 'none']), default='future', help='the HER replay strategy to be used. "future" uses HER, "none" disables HER.')
|
||||
@click.option('--clip_return', type=int, default=1, help='whether or not returns should be clipped')
|
||||
@click.option('--demo_file', type=str, default = 'PATH/TO/DEMO/DATA/FILE.npz', help='demo data file path')
|
||||
def main(**kwargs):
|
||||
launch(**kwargs)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
@@ -1,63 +1,193 @@
|
||||
import os
|
||||
|
||||
import click
|
||||
import numpy as np
|
||||
import json
|
||||
from mpi4py import MPI
|
||||
|
||||
from baselines import logger
|
||||
from baselines.common import set_global_seeds, tf_util
|
||||
from baselines.common.mpi_moments import mpi_moments
|
||||
import baselines.her.experiment.config as config
|
||||
from baselines.her.rollout import RolloutWorker
|
||||
|
||||
def mpi_average(value):
|
||||
if not isinstance(value, list):
|
||||
value = [value]
|
||||
if not any(value):
|
||||
value = [0.]
|
||||
return mpi_moments(np.array(value))[0]
|
||||
|
||||
|
||||
def make_sample_her_transitions(replay_strategy, replay_k, reward_fun):
|
||||
"""Creates a sample function that can be used for HER experience replay.
|
||||
def train(*, policy, rollout_worker, evaluator,
|
||||
n_epochs, n_test_rollouts, n_cycles, n_batches, policy_save_interval,
|
||||
save_path, demo_file, **kwargs):
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
|
||||
Args:
|
||||
replay_strategy (in ['future', 'none']): the HER replay strategy; if set to 'none',
|
||||
regular DDPG experience replay is used
|
||||
replay_k (int): the ratio between HER replays and regular replays (e.g. k = 4 -> 4 times
|
||||
as many HER replays as regular replays are used)
|
||||
reward_fun (function): function to re-compute the reward with substituted goals
|
||||
"""
|
||||
if replay_strategy == 'future':
|
||||
future_p = 1 - (1. / (1 + replay_k))
|
||||
else: # 'replay_strategy' == 'none'
|
||||
future_p = 0
|
||||
if save_path:
|
||||
latest_policy_path = os.path.join(save_path, 'policy_latest.pkl')
|
||||
best_policy_path = os.path.join(save_path, 'policy_best.pkl')
|
||||
periodic_policy_path = os.path.join(save_path, 'policy_{}.pkl')
|
||||
|
||||
def _sample_her_transitions(episode_batch, batch_size_in_transitions):
|
||||
"""episode_batch is {key: array(buffer_size x T x dim_key)}
|
||||
"""
|
||||
T = episode_batch['u'].shape[1]
|
||||
rollout_batch_size = episode_batch['u'].shape[0]
|
||||
batch_size = batch_size_in_transitions
|
||||
logger.info("Training...")
|
||||
best_success_rate = -1
|
||||
|
||||
# Select which episodes and time steps to use.
|
||||
episode_idxs = np.random.randint(0, rollout_batch_size, batch_size)
|
||||
t_samples = np.random.randint(T, size=batch_size)
|
||||
transitions = {key: episode_batch[key][episode_idxs, t_samples].copy()
|
||||
for key in episode_batch.keys()}
|
||||
if policy.bc_loss == 1: policy.init_demo_buffer(demo_file) #initialize demo buffer if training with demonstrations
|
||||
|
||||
# Select future time indexes proportional with probability future_p. These
|
||||
# will be used for HER replay by substituting in future goals.
|
||||
her_indexes = np.where(np.random.uniform(size=batch_size) < future_p)
|
||||
future_offset = np.random.uniform(size=batch_size) * (T - t_samples)
|
||||
future_offset = future_offset.astype(int)
|
||||
future_t = (t_samples + 1 + future_offset)[her_indexes]
|
||||
# num_timesteps = n_epochs * n_cycles * rollout_length * number of rollout workers
|
||||
for epoch in range(n_epochs):
|
||||
# train
|
||||
rollout_worker.clear_history()
|
||||
for _ in range(n_cycles):
|
||||
episode = rollout_worker.generate_rollouts()
|
||||
policy.store_episode(episode)
|
||||
for _ in range(n_batches):
|
||||
policy.train()
|
||||
policy.update_target_net()
|
||||
|
||||
# Replace goal with achieved goal but only for the previously-selected
|
||||
# HER transitions (as defined by her_indexes). For the other transitions,
|
||||
# keep the original goal.
|
||||
future_ag = episode_batch['ag'][episode_idxs[her_indexes], future_t]
|
||||
transitions['g'][her_indexes] = future_ag
|
||||
# test
|
||||
evaluator.clear_history()
|
||||
for _ in range(n_test_rollouts):
|
||||
evaluator.generate_rollouts()
|
||||
|
||||
# Reconstruct info dictionary for reward computation.
|
||||
info = {}
|
||||
for key, value in transitions.items():
|
||||
if key.startswith('info_'):
|
||||
info[key.replace('info_', '')] = value
|
||||
# record logs
|
||||
logger.record_tabular('epoch', epoch)
|
||||
for key, val in evaluator.logs('test'):
|
||||
logger.record_tabular(key, mpi_average(val))
|
||||
for key, val in rollout_worker.logs('train'):
|
||||
logger.record_tabular(key, mpi_average(val))
|
||||
for key, val in policy.logs():
|
||||
logger.record_tabular(key, mpi_average(val))
|
||||
|
||||
# Re-compute reward since we may have substituted the goal.
|
||||
reward_params = {k: transitions[k] for k in ['ag_2', 'g']}
|
||||
reward_params['info'] = info
|
||||
transitions['r'] = reward_fun(**reward_params)
|
||||
if rank == 0:
|
||||
logger.dump_tabular()
|
||||
|
||||
transitions = {k: transitions[k].reshape(batch_size, *transitions[k].shape[1:])
|
||||
for k in transitions.keys()}
|
||||
# save the policy if it's better than the previous ones
|
||||
success_rate = mpi_average(evaluator.current_success_rate())
|
||||
if rank == 0 and success_rate >= best_success_rate and save_path:
|
||||
best_success_rate = success_rate
|
||||
logger.info('New best success rate: {}. Saving policy to {} ...'.format(best_success_rate, best_policy_path))
|
||||
evaluator.save_policy(best_policy_path)
|
||||
evaluator.save_policy(latest_policy_path)
|
||||
if rank == 0 and policy_save_interval > 0 and epoch % policy_save_interval == 0 and save_path:
|
||||
policy_path = periodic_policy_path.format(epoch)
|
||||
logger.info('Saving periodic policy to {} ...'.format(policy_path))
|
||||
evaluator.save_policy(policy_path)
|
||||
|
||||
assert(transitions['u'].shape[0] == batch_size_in_transitions)
|
||||
# make sure that different threads have different seeds
|
||||
local_uniform = np.random.uniform(size=(1,))
|
||||
root_uniform = local_uniform.copy()
|
||||
MPI.COMM_WORLD.Bcast(root_uniform, root=0)
|
||||
if rank != 0:
|
||||
assert local_uniform[0] != root_uniform[0]
|
||||
|
||||
return transitions
|
||||
return policy
|
||||
|
||||
return _sample_her_transitions
|
||||
|
||||
def learn(*, network, env, total_timesteps,
|
||||
seed=None,
|
||||
eval_env=None,
|
||||
replay_strategy='future',
|
||||
policy_save_interval=5,
|
||||
clip_return=True,
|
||||
demo_file=None,
|
||||
override_params=None,
|
||||
load_path=None,
|
||||
save_path=None,
|
||||
**kwargs
|
||||
):
|
||||
|
||||
override_params = override_params or {}
|
||||
if MPI is not None:
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
num_cpu = MPI.COMM_WORLD.Get_size()
|
||||
|
||||
# Seed everything.
|
||||
rank_seed = seed + 1000000 * rank if seed is not None else None
|
||||
set_global_seeds(rank_seed)
|
||||
|
||||
# Prepare params.
|
||||
params = config.DEFAULT_PARAMS
|
||||
env_name = env.spec.id
|
||||
params['env_name'] = env_name
|
||||
params['replay_strategy'] = replay_strategy
|
||||
if env_name in config.DEFAULT_ENV_PARAMS:
|
||||
params.update(config.DEFAULT_ENV_PARAMS[env_name]) # merge env-specific parameters in
|
||||
params.update(**override_params) # makes it possible to override any parameter
|
||||
with open(os.path.join(logger.get_dir(), 'params.json'), 'w') as f:
|
||||
json.dump(params, f)
|
||||
params = config.prepare_params(params)
|
||||
params['rollout_batch_size'] = env.num_envs
|
||||
|
||||
if demo_file is not None:
|
||||
params['bc_loss'] = 1
|
||||
params.update(kwargs)
|
||||
|
||||
config.log_params(params, logger=logger)
|
||||
|
||||
if num_cpu == 1:
|
||||
logger.warn()
|
||||
logger.warn('*** Warning ***')
|
||||
logger.warn(
|
||||
'You are running HER with just a single MPI worker. This will work, but the ' +
|
||||
'experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) ' +
|
||||
'were obtained with --num_cpu 19. This makes a significant difference and if you ' +
|
||||
'are looking to reproduce those results, be aware of this. Please also refer to ' +
|
||||
'https://github.com/openai/baselines/issues/314 for further details.')
|
||||
logger.warn('****************')
|
||||
logger.warn()
|
||||
|
||||
dims = config.configure_dims(params)
|
||||
policy = config.configure_ddpg(dims=dims, params=params, clip_return=clip_return)
|
||||
if load_path is not None:
|
||||
tf_util.load_variables(load_path)
|
||||
|
||||
rollout_params = {
|
||||
'exploit': False,
|
||||
'use_target_net': False,
|
||||
'use_demo_states': True,
|
||||
'compute_Q': False,
|
||||
'T': params['T'],
|
||||
}
|
||||
|
||||
eval_params = {
|
||||
'exploit': True,
|
||||
'use_target_net': params['test_with_polyak'],
|
||||
'use_demo_states': False,
|
||||
'compute_Q': True,
|
||||
'T': params['T'],
|
||||
}
|
||||
|
||||
for name in ['T', 'rollout_batch_size', 'gamma', 'noise_eps', 'random_eps']:
|
||||
rollout_params[name] = params[name]
|
||||
eval_params[name] = params[name]
|
||||
|
||||
eval_env = eval_env or env
|
||||
|
||||
rollout_worker = RolloutWorker(env, policy, dims, logger, monitor=True, **rollout_params)
|
||||
evaluator = RolloutWorker(eval_env, policy, dims, logger, **eval_params)
|
||||
|
||||
n_cycles = params['n_cycles']
|
||||
n_epochs = total_timesteps // n_cycles // rollout_worker.T // rollout_worker.rollout_batch_size
|
||||
|
||||
return train(
|
||||
save_path=save_path, policy=policy, rollout_worker=rollout_worker,
|
||||
evaluator=evaluator, n_epochs=n_epochs, n_test_rollouts=params['n_test_rollouts'],
|
||||
n_cycles=params['n_cycles'], n_batches=params['n_batches'],
|
||||
policy_save_interval=policy_save_interval, demo_file=demo_file)
|
||||
|
||||
|
||||
@click.command()
|
||||
@click.option('--env', type=str, default='FetchReach-v1', help='the name of the OpenAI Gym environment that you want to train on')
|
||||
@click.option('--total_timesteps', type=int, default=int(5e5), help='the number of timesteps to run')
|
||||
@click.option('--seed', type=int, default=0, help='the random seed used to seed both the environment and the training code')
|
||||
@click.option('--policy_save_interval', type=int, default=5, help='the interval with which policy pickles are saved. If set to 0, only the best and latest policy will be pickled.')
|
||||
@click.option('--replay_strategy', type=click.Choice(['future', 'none']), default='future', help='the HER replay strategy to be used. "future" uses HER, "none" disables HER.')
|
||||
@click.option('--clip_return', type=int, default=1, help='whether or not returns should be clipped')
|
||||
@click.option('--demo_file', type=str, default = 'PATH/TO/DEMO/DATA/FILE.npz', help='demo data file path')
|
||||
def main(**kwargs):
|
||||
learn(**kwargs)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
63
baselines/her/her_sampler.py
Normal file
63
baselines/her/her_sampler.py
Normal file
@@ -0,0 +1,63 @@
|
||||
import numpy as np
|
||||
|
||||
|
||||
def make_sample_her_transitions(replay_strategy, replay_k, reward_fun):
|
||||
"""Creates a sample function that can be used for HER experience replay.
|
||||
|
||||
Args:
|
||||
replay_strategy (in ['future', 'none']): the HER replay strategy; if set to 'none',
|
||||
regular DDPG experience replay is used
|
||||
replay_k (int): the ratio between HER replays and regular replays (e.g. k = 4 -> 4 times
|
||||
as many HER replays as regular replays are used)
|
||||
reward_fun (function): function to re-compute the reward with substituted goals
|
||||
"""
|
||||
if replay_strategy == 'future':
|
||||
future_p = 1 - (1. / (1 + replay_k))
|
||||
else: # 'replay_strategy' == 'none'
|
||||
future_p = 0
|
||||
|
||||
def _sample_her_transitions(episode_batch, batch_size_in_transitions):
|
||||
"""episode_batch is {key: array(buffer_size x T x dim_key)}
|
||||
"""
|
||||
T = episode_batch['u'].shape[1]
|
||||
rollout_batch_size = episode_batch['u'].shape[0]
|
||||
batch_size = batch_size_in_transitions
|
||||
|
||||
# Select which episodes and time steps to use.
|
||||
episode_idxs = np.random.randint(0, rollout_batch_size, batch_size)
|
||||
t_samples = np.random.randint(T, size=batch_size)
|
||||
transitions = {key: episode_batch[key][episode_idxs, t_samples].copy()
|
||||
for key in episode_batch.keys()}
|
||||
|
||||
# Select future time indexes proportional with probability future_p. These
|
||||
# will be used for HER replay by substituting in future goals.
|
||||
her_indexes = np.where(np.random.uniform(size=batch_size) < future_p)
|
||||
future_offset = np.random.uniform(size=batch_size) * (T - t_samples)
|
||||
future_offset = future_offset.astype(int)
|
||||
future_t = (t_samples + 1 + future_offset)[her_indexes]
|
||||
|
||||
# Replace goal with achieved goal but only for the previously-selected
|
||||
# HER transitions (as defined by her_indexes). For the other transitions,
|
||||
# keep the original goal.
|
||||
future_ag = episode_batch['ag'][episode_idxs[her_indexes], future_t]
|
||||
transitions['g'][her_indexes] = future_ag
|
||||
|
||||
# Reconstruct info dictionary for reward computation.
|
||||
info = {}
|
||||
for key, value in transitions.items():
|
||||
if key.startswith('info_'):
|
||||
info[key.replace('info_', '')] = value
|
||||
|
||||
# Re-compute reward since we may have substituted the goal.
|
||||
reward_params = {k: transitions[k] for k in ['ag_2', 'g']}
|
||||
reward_params['info'] = info
|
||||
transitions['r'] = reward_fun(**reward_params)
|
||||
|
||||
transitions = {k: transitions[k].reshape(batch_size, *transitions[k].shape[1:])
|
||||
for k in transitions.keys()}
|
||||
|
||||
assert(transitions['u'].shape[0] == batch_size_in_transitions)
|
||||
|
||||
return transitions
|
||||
|
||||
return _sample_her_transitions
|
@@ -2,7 +2,6 @@ from collections import deque
|
||||
|
||||
import numpy as np
|
||||
import pickle
|
||||
from mujoco_py import MujocoException
|
||||
|
||||
from baselines.her.util import convert_episode_to_batch_major, store_args
|
||||
|
||||
@@ -10,14 +9,13 @@ from baselines.her.util import convert_episode_to_batch_major, store_args
|
||||
class RolloutWorker:
|
||||
|
||||
@store_args
|
||||
def __init__(self, make_env, policy, dims, logger, T, rollout_batch_size=1,
|
||||
def __init__(self, venv, policy, dims, logger, T, rollout_batch_size=1,
|
||||
exploit=False, use_target_net=False, compute_Q=False, noise_eps=0,
|
||||
random_eps=0, history_len=100, render=False, **kwargs):
|
||||
random_eps=0, history_len=100, render=False, monitor=False, **kwargs):
|
||||
"""Rollout worker generates experience by interacting with one or many environments.
|
||||
|
||||
Args:
|
||||
make_env (function): a factory function that creates a new instance of the environment
|
||||
when called
|
||||
venv: vectorized gym environments.
|
||||
policy (object): the policy that is used to act
|
||||
dims (dict of ints): the dimensions for observations (o), goals (g), and actions (u)
|
||||
logger (object): the logger that is used by the rollout worker
|
||||
@@ -31,7 +29,7 @@ class RolloutWorker:
|
||||
history_len (int): length of history for statistics smoothing
|
||||
render (boolean): whether or not to render the rollouts
|
||||
"""
|
||||
self.envs = [make_env() for _ in range(rollout_batch_size)]
|
||||
|
||||
assert self.T > 0
|
||||
|
||||
self.info_keys = [key.replace('info_', '') for key in dims.keys() if key.startswith('info_')]
|
||||
@@ -40,26 +38,14 @@ class RolloutWorker:
|
||||
self.Q_history = deque(maxlen=history_len)
|
||||
|
||||
self.n_episodes = 0
|
||||
self.g = np.empty((self.rollout_batch_size, self.dims['g']), np.float32) # goals
|
||||
self.initial_o = np.empty((self.rollout_batch_size, self.dims['o']), np.float32) # observations
|
||||
self.initial_ag = np.empty((self.rollout_batch_size, self.dims['g']), np.float32) # achieved goals
|
||||
self.reset_all_rollouts()
|
||||
self.clear_history()
|
||||
|
||||
def reset_rollout(self, i):
|
||||
"""Resets the `i`-th rollout environment, re-samples a new goal, and updates the `initial_o`
|
||||
and `g` arrays accordingly.
|
||||
"""
|
||||
obs = self.envs[i].reset()
|
||||
self.initial_o[i] = obs['observation']
|
||||
self.initial_ag[i] = obs['achieved_goal']
|
||||
self.g[i] = obs['desired_goal']
|
||||
|
||||
def reset_all_rollouts(self):
|
||||
"""Resets all `rollout_batch_size` rollout workers.
|
||||
"""
|
||||
for i in range(self.rollout_batch_size):
|
||||
self.reset_rollout(i)
|
||||
self.obs_dict = self.venv.reset()
|
||||
self.initial_o = self.obs_dict['observation']
|
||||
self.initial_ag = self.obs_dict['achieved_goal']
|
||||
self.g = self.obs_dict['desired_goal']
|
||||
|
||||
def generate_rollouts(self):
|
||||
"""Performs `rollout_batch_size` rollouts in parallel for time horizon `T` with the current
|
||||
@@ -75,7 +61,8 @@ class RolloutWorker:
|
||||
|
||||
# generate episodes
|
||||
obs, achieved_goals, acts, goals, successes = [], [], [], [], []
|
||||
info_values = [np.empty((self.T, self.rollout_batch_size, self.dims['info_' + key]), np.float32) for key in self.info_keys]
|
||||
dones = []
|
||||
info_values = [np.empty((self.T - 1, self.rollout_batch_size, self.dims['info_' + key]), np.float32) for key in self.info_keys]
|
||||
Qs = []
|
||||
for t in range(self.T):
|
||||
policy_output = self.policy.get_actions(
|
||||
@@ -99,27 +86,27 @@ class RolloutWorker:
|
||||
ag_new = np.empty((self.rollout_batch_size, self.dims['g']))
|
||||
success = np.zeros(self.rollout_batch_size)
|
||||
# compute new states and observations
|
||||
for i in range(self.rollout_batch_size):
|
||||
try:
|
||||
# We fully ignore the reward here because it will have to be re-computed
|
||||
# for HER.
|
||||
curr_o_new, _, _, info = self.envs[i].step(u[i])
|
||||
if 'is_success' in info:
|
||||
success[i] = info['is_success']
|
||||
o_new[i] = curr_o_new['observation']
|
||||
ag_new[i] = curr_o_new['achieved_goal']
|
||||
for idx, key in enumerate(self.info_keys):
|
||||
info_values[idx][t, i] = info[key]
|
||||
if self.render:
|
||||
self.envs[i].render()
|
||||
except MujocoException as e:
|
||||
return self.generate_rollouts()
|
||||
obs_dict_new, _, done, info = self.venv.step(u)
|
||||
o_new = obs_dict_new['observation']
|
||||
ag_new = obs_dict_new['achieved_goal']
|
||||
success = np.array([i.get('is_success', 0.0) for i in info])
|
||||
|
||||
if any(done):
|
||||
# here we assume all environments are done is ~same number of steps, so we terminate rollouts whenever any of the envs returns done
|
||||
# trick with using vecenvs is not to add the obs from the environments that are "done", because those are already observations
|
||||
# after a reset
|
||||
break
|
||||
|
||||
for i, info_dict in enumerate(info):
|
||||
for idx, key in enumerate(self.info_keys):
|
||||
info_values[idx][t, i] = info[i][key]
|
||||
|
||||
if np.isnan(o_new).any():
|
||||
self.logger.warn('NaN caught during rollout generation. Trying again...')
|
||||
self.reset_all_rollouts()
|
||||
return self.generate_rollouts()
|
||||
|
||||
dones.append(done)
|
||||
obs.append(o.copy())
|
||||
achieved_goals.append(ag.copy())
|
||||
successes.append(success.copy())
|
||||
@@ -129,7 +116,6 @@ class RolloutWorker:
|
||||
ag[...] = ag_new
|
||||
obs.append(o.copy())
|
||||
achieved_goals.append(ag.copy())
|
||||
self.initial_o[:] = o
|
||||
|
||||
episode = dict(o=obs,
|
||||
u=acts,
|
||||
@@ -176,13 +162,8 @@ class RolloutWorker:
|
||||
logs += [('mean_Q', np.mean(self.Q_history))]
|
||||
logs += [('episode', self.n_episodes)]
|
||||
|
||||
if prefix is not '' and not prefix.endswith('/'):
|
||||
if prefix != '' and not prefix.endswith('/'):
|
||||
return [(prefix + '/' + key, val) for key, val in logs]
|
||||
else:
|
||||
return logs
|
||||
|
||||
def seed(self, seed):
|
||||
"""Seeds each environment with a distinct seed derived from the passed in global seed.
|
||||
"""
|
||||
for idx, env in enumerate(self.envs):
|
||||
env.seed(seed + 1000 * idx)
|
||||
|
@@ -7,6 +7,7 @@ import time
|
||||
import datetime
|
||||
import tempfile
|
||||
from collections import defaultdict
|
||||
from contextlib import contextmanager
|
||||
|
||||
DEBUG = 10
|
||||
INFO = 20
|
||||
@@ -37,8 +38,8 @@ class HumanOutputFormat(KVWriter, SeqWriter):
|
||||
# Create strings for printing
|
||||
key2str = {}
|
||||
for (key, val) in sorted(kvs.items()):
|
||||
if isinstance(val, float):
|
||||
valstr = '%-8.3g' % (val,)
|
||||
if hasattr(val, '__float__'):
|
||||
valstr = '%-8.3g' % val
|
||||
else:
|
||||
valstr = str(val)
|
||||
key2str[self._truncate(key)] = self._truncate(valstr)
|
||||
@@ -54,7 +55,7 @@ class HumanOutputFormat(KVWriter, SeqWriter):
|
||||
# Write out the data
|
||||
dashes = '-' * (keywidth + valwidth + 7)
|
||||
lines = [dashes]
|
||||
for (key, val) in sorted(key2str.items()):
|
||||
for (key, val) in sorted(key2str.items(), key=lambda kv: kv[0].lower()):
|
||||
lines.append('| %s%s | %s%s |' % (
|
||||
key,
|
||||
' ' * (keywidth - len(key)),
|
||||
@@ -68,7 +69,8 @@ class HumanOutputFormat(KVWriter, SeqWriter):
|
||||
self.file.flush()
|
||||
|
||||
def _truncate(self, s):
|
||||
return s[:20] + '...' if len(s) > 23 else s
|
||||
maxlen = 30
|
||||
return s[:maxlen-3] + '...' if len(s) > maxlen else s
|
||||
|
||||
def writeseq(self, seq):
|
||||
seq = list(seq)
|
||||
@@ -90,7 +92,6 @@ class JSONOutputFormat(KVWriter):
|
||||
def writekvs(self, kvs):
|
||||
for k, v in sorted(kvs.items()):
|
||||
if hasattr(v, 'dtype'):
|
||||
v = v.tolist()
|
||||
kvs[k] = float(v)
|
||||
self.file.write(json.dumps(kvs) + '\n')
|
||||
self.file.flush()
|
||||
@@ -195,13 +196,13 @@ def logkv(key, val):
|
||||
Call this once for each diagnostic quantity, each iteration
|
||||
If called many times, last value will be used.
|
||||
"""
|
||||
Logger.CURRENT.logkv(key, val)
|
||||
get_current().logkv(key, val)
|
||||
|
||||
def logkv_mean(key, val):
|
||||
"""
|
||||
The same as logkv(), but if called many times, values averaged.
|
||||
"""
|
||||
Logger.CURRENT.logkv_mean(key, val)
|
||||
get_current().logkv_mean(key, val)
|
||||
|
||||
def logkvs(d):
|
||||
"""
|
||||
@@ -213,21 +214,18 @@ def logkvs(d):
|
||||
def dumpkvs():
|
||||
"""
|
||||
Write all of the diagnostics from the current iteration
|
||||
|
||||
level: int. (see logger.py docs) If the global logger level is higher than
|
||||
the level argument here, don't print to stdout.
|
||||
"""
|
||||
Logger.CURRENT.dumpkvs()
|
||||
return get_current().dumpkvs()
|
||||
|
||||
def getkvs():
|
||||
return Logger.CURRENT.name2val
|
||||
return get_current().name2val
|
||||
|
||||
|
||||
def log(*args, level=INFO):
|
||||
"""
|
||||
Write the sequence of args, with no separators, to the console and output files (if you've configured an output file).
|
||||
"""
|
||||
Logger.CURRENT.log(*args, level=level)
|
||||
get_current().log(*args, level=level)
|
||||
|
||||
def debug(*args):
|
||||
log(*args, level=DEBUG)
|
||||
@@ -246,30 +244,29 @@ def set_level(level):
|
||||
"""
|
||||
Set logging threshold on current logger.
|
||||
"""
|
||||
Logger.CURRENT.set_level(level)
|
||||
get_current().set_level(level)
|
||||
|
||||
def set_comm(comm):
|
||||
get_current().set_comm(comm)
|
||||
|
||||
def get_dir():
|
||||
"""
|
||||
Get directory that log files are being written to.
|
||||
will be None if there is no output directory (i.e., if you didn't call start)
|
||||
"""
|
||||
return Logger.CURRENT.get_dir()
|
||||
return get_current().get_dir()
|
||||
|
||||
record_tabular = logkv
|
||||
dump_tabular = dumpkvs
|
||||
|
||||
class ProfileKV:
|
||||
"""
|
||||
Usage:
|
||||
with logger.ProfileKV("interesting_scope"):
|
||||
code
|
||||
"""
|
||||
def __init__(self, n):
|
||||
self.n = "wait_" + n
|
||||
def __enter__(self):
|
||||
self.t1 = time.time()
|
||||
def __exit__(self ,type, value, traceback):
|
||||
Logger.CURRENT.name2val[self.n] += time.time() - self.t1
|
||||
@contextmanager
|
||||
def profile_kv(scopename):
|
||||
logkey = 'wait_' + scopename
|
||||
tstart = time.time()
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
get_current().name2val[logkey] += time.time() - tstart
|
||||
|
||||
def profile(n):
|
||||
"""
|
||||
@@ -279,7 +276,7 @@ def profile(n):
|
||||
"""
|
||||
def decorator_with_name(func):
|
||||
def func_wrapper(*args, **kwargs):
|
||||
with ProfileKV(n):
|
||||
with profile_kv(n):
|
||||
return func(*args, **kwargs)
|
||||
return func_wrapper
|
||||
return decorator_with_name
|
||||
@@ -289,17 +286,25 @@ def profile(n):
|
||||
# Backend
|
||||
# ================================================================
|
||||
|
||||
def get_current():
|
||||
if Logger.CURRENT is None:
|
||||
_configure_default_logger()
|
||||
|
||||
return Logger.CURRENT
|
||||
|
||||
|
||||
class Logger(object):
|
||||
DEFAULT = None # A logger with no output files. (See right below class definition)
|
||||
# So that you can still log to the terminal without setting up any output files
|
||||
CURRENT = None # Current logger being used by the free functions above
|
||||
|
||||
def __init__(self, dir, output_formats):
|
||||
def __init__(self, dir, output_formats, comm=None):
|
||||
self.name2val = defaultdict(float) # values this iteration
|
||||
self.name2cnt = defaultdict(int)
|
||||
self.level = INFO
|
||||
self.dir = dir
|
||||
self.output_formats = output_formats
|
||||
self.comm = comm
|
||||
|
||||
# Logging API, forwarded
|
||||
# ----------------------------------------
|
||||
@@ -307,20 +312,27 @@ class Logger(object):
|
||||
self.name2val[key] = val
|
||||
|
||||
def logkv_mean(self, key, val):
|
||||
if val is None:
|
||||
self.name2val[key] = None
|
||||
return
|
||||
oldval, cnt = self.name2val[key], self.name2cnt[key]
|
||||
self.name2val[key] = oldval*cnt/(cnt+1) + val/(cnt+1)
|
||||
self.name2cnt[key] = cnt + 1
|
||||
|
||||
def dumpkvs(self):
|
||||
if self.level == DISABLED: return
|
||||
if self.comm is None:
|
||||
d = self.name2val
|
||||
else:
|
||||
from baselines.common import mpi_util
|
||||
d = mpi_util.mpi_weighted_mean(self.comm,
|
||||
{name : (val, self.name2cnt.get(name, 1))
|
||||
for (name, val) in self.name2val.items()})
|
||||
if self.comm.rank != 0:
|
||||
d['dummy'] = 1 # so we don't get a warning about empty dict
|
||||
out = d.copy() # Return the dict for unit testing purposes
|
||||
for fmt in self.output_formats:
|
||||
if isinstance(fmt, KVWriter):
|
||||
fmt.writekvs(self.name2val)
|
||||
fmt.writekvs(d)
|
||||
self.name2val.clear()
|
||||
self.name2cnt.clear()
|
||||
return out
|
||||
|
||||
def log(self, *args, level=INFO):
|
||||
if self.level <= level:
|
||||
@@ -331,6 +343,9 @@ class Logger(object):
|
||||
def set_level(self, level):
|
||||
self.level = level
|
||||
|
||||
def set_comm(self, comm):
|
||||
self.comm = comm
|
||||
|
||||
def get_dir(self):
|
||||
return self.dir
|
||||
|
||||
@@ -345,24 +360,31 @@ class Logger(object):
|
||||
if isinstance(fmt, SeqWriter):
|
||||
fmt.writeseq(map(str, args))
|
||||
|
||||
def configure(dir=None, format_strs=None):
|
||||
def get_rank_without_mpi_import():
|
||||
# check environment variables here instead of importing mpi4py
|
||||
# to avoid calling MPI_Init() when this module is imported
|
||||
for varname in ['PMI_RANK', 'OMPI_COMM_WORLD_RANK']:
|
||||
if varname in os.environ:
|
||||
return int(os.environ[varname])
|
||||
return 0
|
||||
|
||||
|
||||
def configure(dir=None, format_strs=None, comm=None, log_suffix=''):
|
||||
"""
|
||||
If comm is provided, average all numerical stats across that comm
|
||||
"""
|
||||
if dir is None:
|
||||
dir = os.getenv('OPENAI_LOGDIR')
|
||||
if dir is None:
|
||||
dir = osp.join(tempfile.gettempdir(),
|
||||
datetime.datetime.now().strftime("openai-%Y-%m-%d-%H-%M-%S-%f"))
|
||||
assert isinstance(dir, str)
|
||||
os.makedirs(dir, exist_ok=True)
|
||||
dir = os.path.expanduser(dir)
|
||||
os.makedirs(os.path.expanduser(dir), exist_ok=True)
|
||||
|
||||
log_suffix = ''
|
||||
rank = 0
|
||||
# check environment variables here instead of importing mpi4py
|
||||
# to avoid calling MPI_Init() when this module is imported
|
||||
for varname in ['PMI_RANK', 'OMPI_COMM_WORLD_RANK']:
|
||||
if varname in os.environ:
|
||||
rank = int(os.environ[varname])
|
||||
rank = get_rank_without_mpi_import()
|
||||
if rank > 0:
|
||||
log_suffix = "-rank%03i" % rank
|
||||
log_suffix = log_suffix + "-rank%03i" % rank
|
||||
|
||||
if format_strs is None:
|
||||
if rank == 0:
|
||||
@@ -372,15 +394,12 @@ def configure(dir=None, format_strs=None):
|
||||
format_strs = filter(None, format_strs)
|
||||
output_formats = [make_output_format(f, dir, log_suffix) for f in format_strs]
|
||||
|
||||
Logger.CURRENT = Logger(dir=dir, output_formats=output_formats)
|
||||
log('Logging to %s'%dir)
|
||||
Logger.CURRENT = Logger(dir=dir, output_formats=output_formats, comm=comm)
|
||||
if output_formats:
|
||||
log('Logging to %s'%dir)
|
||||
|
||||
def _configure_default_logger():
|
||||
format_strs = None
|
||||
# keep the old default of only writing to stdout
|
||||
if 'OPENAI_LOG_FORMAT' not in os.environ:
|
||||
format_strs = ['stdout']
|
||||
configure(format_strs=format_strs)
|
||||
configure()
|
||||
Logger.DEFAULT = Logger.CURRENT
|
||||
|
||||
def reset():
|
||||
@@ -389,17 +408,15 @@ def reset():
|
||||
Logger.CURRENT = Logger.DEFAULT
|
||||
log('Reset logger')
|
||||
|
||||
class scoped_configure(object):
|
||||
def __init__(self, dir=None, format_strs=None):
|
||||
self.dir = dir
|
||||
self.format_strs = format_strs
|
||||
self.prevlogger = None
|
||||
def __enter__(self):
|
||||
self.prevlogger = Logger.CURRENT
|
||||
configure(dir=self.dir, format_strs=self.format_strs)
|
||||
def __exit__(self, *args):
|
||||
@contextmanager
|
||||
def scoped_configure(dir=None, format_strs=None, comm=None):
|
||||
prevlogger = Logger.CURRENT
|
||||
configure(dir=dir, format_strs=format_strs, comm=comm)
|
||||
try:
|
||||
yield
|
||||
finally:
|
||||
Logger.CURRENT.close()
|
||||
Logger.CURRENT = self.prevlogger
|
||||
Logger.CURRENT = prevlogger
|
||||
|
||||
# ================================================================
|
||||
|
||||
@@ -423,7 +440,7 @@ def _demo():
|
||||
logkv_mean("b", -44.4)
|
||||
logkv("a", 5.5)
|
||||
dumpkvs()
|
||||
info("^^^ should see b = 33.3")
|
||||
info("^^^ should see b = -33.3")
|
||||
|
||||
logkv("b", -2.5)
|
||||
dumpkvs()
|
||||
@@ -456,7 +473,6 @@ def read_tb(path):
|
||||
import pandas
|
||||
import numpy as np
|
||||
from glob import glob
|
||||
from collections import defaultdict
|
||||
import tensorflow as tf
|
||||
if osp.isdir(path):
|
||||
fnames = glob(osp.join(path, "events.*"))
|
||||
@@ -482,8 +498,5 @@ def read_tb(path):
|
||||
data[step-1, colidx] = value
|
||||
return pandas.DataFrame(data, columns=tags)
|
||||
|
||||
# configure the default logger on import
|
||||
_configure_default_logger()
|
||||
|
||||
if __name__ == "__main__":
|
||||
_demo()
|
||||
|
@@ -97,7 +97,6 @@ def learn(env, policy_fn, *,
|
||||
ret = tf.placeholder(dtype=tf.float32, shape=[None]) # Empirical return
|
||||
|
||||
lrmult = tf.placeholder(name='lrmult', dtype=tf.float32, shape=[]) # learning rate multiplier, updated with schedule
|
||||
clip_param = clip_param * lrmult # Annealed cliping parameter epislon
|
||||
|
||||
ob = U.get_placeholder_cached(name="ob")
|
||||
ac = pi.pdtype.sample_placeholder([None])
|
||||
@@ -168,7 +167,7 @@ def learn(env, policy_fn, *,
|
||||
ob, ac, atarg, tdlamret = seg["ob"], seg["ac"], seg["adv"], seg["tdlamret"]
|
||||
vpredbefore = seg["vpred"] # predicted value function before udpate
|
||||
atarg = (atarg - atarg.mean()) / atarg.std() # standardized advantage function estimate
|
||||
d = Dataset(dict(ob=ob, ac=ac, atarg=atarg, vtarg=tdlamret), shuffle=not pi.recurrent)
|
||||
d = Dataset(dict(ob=ob, ac=ac, atarg=atarg, vtarg=tdlamret), deterministic=pi.recurrent)
|
||||
optim_batchsize = optim_batchsize or ob.shape[0]
|
||||
|
||||
if hasattr(pi, "ob_rms"): pi.ob_rms.update(ob) # update running mean/std for policy
|
||||
|
@@ -19,16 +19,17 @@ def train(num_timesteps, seed, model_path=None):
|
||||
# these are good enough to make humanoid walk, but whether those are
|
||||
# an absolute best or not is not certain
|
||||
env = RewScale(env, 0.1)
|
||||
logger.log("NOTE: reward will be scaled by a factor of 10 in logged stats. Check the monitor for unscaled reward.")
|
||||
pi = pposgd_simple.learn(env, policy_fn,
|
||||
max_timesteps=num_timesteps,
|
||||
timesteps_per_actorbatch=2048,
|
||||
clip_param=0.2, entcoeff=0.0,
|
||||
clip_param=0.1, entcoeff=0.0,
|
||||
optim_epochs=10,
|
||||
optim_stepsize=3e-4,
|
||||
optim_stepsize=1e-4,
|
||||
optim_batchsize=64,
|
||||
gamma=0.99,
|
||||
lam=0.95,
|
||||
schedule='linear',
|
||||
schedule='constant',
|
||||
)
|
||||
env.close()
|
||||
if model_path:
|
||||
@@ -47,7 +48,7 @@ def main():
|
||||
logger.configure()
|
||||
parser = mujoco_arg_parser()
|
||||
parser.add_argument('--model-path', default=os.path.join(logger.get_dir(), 'humanoid_policy'))
|
||||
parser.set_defaults(num_timesteps=int(2e7))
|
||||
parser.set_defaults(num_timesteps=int(5e7))
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
@@ -68,8 +69,5 @@ def main():
|
||||
if done:
|
||||
ob = env.reset()
|
||||
|
||||
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
@@ -18,7 +18,7 @@ def atari():
|
||||
lam=0.95, gamma=0.99, noptepochs=4, log_interval=1,
|
||||
ent_coef=.01,
|
||||
lr=lambda f : f * 2.5e-4,
|
||||
cliprange=lambda f : f * 0.1,
|
||||
cliprange=0.1,
|
||||
)
|
||||
|
||||
def retro():
|
||||
|
78
baselines/ppo2/microbatched_model.py
Normal file
78
baselines/ppo2/microbatched_model.py
Normal file
@@ -0,0 +1,78 @@
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
from baselines.ppo2.model import Model
|
||||
|
||||
class MicrobatchedModel(Model):
|
||||
"""
|
||||
Model that does training one microbatch at a time - when gradient computation
|
||||
on the entire minibatch causes some overflow
|
||||
"""
|
||||
def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
|
||||
nsteps, ent_coef, vf_coef, max_grad_norm, mpi_rank_weight, comm, microbatch_size):
|
||||
|
||||
self.nmicrobatches = nbatch_train // microbatch_size
|
||||
self.microbatch_size = microbatch_size
|
||||
assert nbatch_train % microbatch_size == 0, 'microbatch_size ({}) should divide nbatch_train ({}) evenly'.format(microbatch_size, nbatch_train)
|
||||
|
||||
super().__init__(
|
||||
policy=policy,
|
||||
ob_space=ob_space,
|
||||
ac_space=ac_space,
|
||||
nbatch_act=nbatch_act,
|
||||
nbatch_train=microbatch_size,
|
||||
nsteps=nsteps,
|
||||
ent_coef=ent_coef,
|
||||
vf_coef=vf_coef,
|
||||
max_grad_norm=max_grad_norm,
|
||||
mpi_rank_weight=mpi_rank_weight,
|
||||
comm=comm)
|
||||
|
||||
self.grads_ph = [tf.placeholder(dtype=g.dtype, shape=g.shape) for g in self.grads]
|
||||
grads_ph_and_vars = list(zip(self.grads_ph, self.var))
|
||||
self._apply_gradients_op = self.trainer.apply_gradients(grads_ph_and_vars)
|
||||
|
||||
|
||||
def train(self, lr, cliprange, obs, returns, masks, actions, values, neglogpacs, states=None):
|
||||
assert states is None, "microbatches with recurrent models are not supported yet"
|
||||
|
||||
# Here we calculate advantage A(s,a) = R + yV(s') - V(s)
|
||||
# Returns = R + yV(s')
|
||||
advs = returns - values
|
||||
|
||||
# Normalize the advantages
|
||||
advs = (advs - advs.mean()) / (advs.std() + 1e-8)
|
||||
|
||||
# Initialize empty list for per-microbatch stats like pg_loss, vf_loss, entropy, approxkl (whatever is in self.stats_list)
|
||||
stats_vs = []
|
||||
|
||||
for microbatch_idx in range(self.nmicrobatches):
|
||||
_sli = range(microbatch_idx * self.microbatch_size, (microbatch_idx+1) * self.microbatch_size)
|
||||
td_map = {
|
||||
self.train_model.X: obs[_sli],
|
||||
self.A:actions[_sli],
|
||||
self.ADV:advs[_sli],
|
||||
self.R:returns[_sli],
|
||||
self.CLIPRANGE:cliprange,
|
||||
self.OLDNEGLOGPAC:neglogpacs[_sli],
|
||||
self.OLDVPRED:values[_sli]
|
||||
}
|
||||
|
||||
# Compute gradient on a microbatch (note that variables do not change here) ...
|
||||
grad_v, stats_v = self.sess.run([self.grads, self.stats_list], td_map)
|
||||
if microbatch_idx == 0:
|
||||
sum_grad_v = grad_v
|
||||
else:
|
||||
# .. and add to the total of the gradients
|
||||
for i, g in enumerate(grad_v):
|
||||
sum_grad_v[i] += g
|
||||
stats_vs.append(stats_v)
|
||||
|
||||
feed_dict = {ph: sum_g / self.nmicrobatches for ph, sum_g in zip(self.grads_ph, sum_grad_v)}
|
||||
feed_dict[self.LR] = lr
|
||||
# Update variables using average of the gradients
|
||||
self.sess.run(self._apply_gradients_op, feed_dict)
|
||||
# Return average of the stats
|
||||
return np.mean(np.array(stats_vs), axis=0).tolist()
|
||||
|
||||
|
||||
|
159
baselines/ppo2/model.py
Normal file
159
baselines/ppo2/model.py
Normal file
@@ -0,0 +1,159 @@
|
||||
import tensorflow as tf
|
||||
import functools
|
||||
|
||||
from baselines.common.tf_util import get_session, save_variables, load_variables
|
||||
from baselines.common.tf_util import initialize
|
||||
|
||||
try:
|
||||
from baselines.common.mpi_adam_optimizer import MpiAdamOptimizer
|
||||
from mpi4py import MPI
|
||||
from baselines.common.mpi_util import sync_from_root
|
||||
except ImportError:
|
||||
MPI = None
|
||||
|
||||
class Model(object):
|
||||
"""
|
||||
We use this object to :
|
||||
__init__:
|
||||
- Creates the step_model
|
||||
- Creates the train_model
|
||||
|
||||
train():
|
||||
- Make the training part (feedforward and retropropagation of gradients)
|
||||
|
||||
save/load():
|
||||
- Save load the model
|
||||
"""
|
||||
def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
|
||||
nsteps, ent_coef, vf_coef, max_grad_norm, mpi_rank_weight=1, comm=None, microbatch_size=None):
|
||||
self.sess = sess = get_session()
|
||||
|
||||
if MPI is not None and comm is None:
|
||||
comm = MPI.COMM_WORLD
|
||||
|
||||
with tf.variable_scope('ppo2_model', reuse=tf.AUTO_REUSE):
|
||||
# CREATE OUR TWO MODELS
|
||||
# act_model that is used for sampling
|
||||
act_model = policy(nbatch_act, 1, sess)
|
||||
|
||||
# Train model for training
|
||||
if microbatch_size is None:
|
||||
train_model = policy(nbatch_train, nsteps, sess)
|
||||
else:
|
||||
train_model = policy(microbatch_size, nsteps, sess)
|
||||
|
||||
# CREATE THE PLACEHOLDERS
|
||||
self.A = A = train_model.pdtype.sample_placeholder([None])
|
||||
self.ADV = ADV = tf.placeholder(tf.float32, [None])
|
||||
self.R = R = tf.placeholder(tf.float32, [None])
|
||||
# Keep track of old actor
|
||||
self.OLDNEGLOGPAC = OLDNEGLOGPAC = tf.placeholder(tf.float32, [None])
|
||||
# Keep track of old critic
|
||||
self.OLDVPRED = OLDVPRED = tf.placeholder(tf.float32, [None])
|
||||
self.LR = LR = tf.placeholder(tf.float32, [])
|
||||
# Cliprange
|
||||
self.CLIPRANGE = CLIPRANGE = tf.placeholder(tf.float32, [])
|
||||
|
||||
neglogpac = train_model.pd.neglogp(A)
|
||||
|
||||
# Calculate the entropy
|
||||
# Entropy is used to improve exploration by limiting the premature convergence to suboptimal policy.
|
||||
entropy = tf.reduce_mean(train_model.pd.entropy())
|
||||
|
||||
# CALCULATE THE LOSS
|
||||
# Total loss = Policy gradient loss - entropy * entropy coefficient + Value coefficient * value loss
|
||||
|
||||
# Clip the value to reduce variability during Critic training
|
||||
# Get the predicted value
|
||||
vpred = train_model.vf
|
||||
vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
|
||||
# Unclipped value
|
||||
vf_losses1 = tf.square(vpred - R)
|
||||
# Clipped value
|
||||
vf_losses2 = tf.square(vpredclipped - R)
|
||||
|
||||
vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))
|
||||
|
||||
# Calculate ratio (pi current policy / pi old policy)
|
||||
ratio = tf.exp(OLDNEGLOGPAC - neglogpac)
|
||||
|
||||
# Defining Loss = - J is equivalent to max J
|
||||
pg_losses = -ADV * ratio
|
||||
|
||||
pg_losses2 = -ADV * tf.clip_by_value(ratio, 1.0 - CLIPRANGE, 1.0 + CLIPRANGE)
|
||||
|
||||
# Final PG loss
|
||||
pg_loss = tf.reduce_mean(tf.maximum(pg_losses, pg_losses2))
|
||||
approxkl = .5 * tf.reduce_mean(tf.square(neglogpac - OLDNEGLOGPAC))
|
||||
clipfrac = tf.reduce_mean(tf.to_float(tf.greater(tf.abs(ratio - 1.0), CLIPRANGE)))
|
||||
|
||||
# Total loss
|
||||
loss = pg_loss - entropy * ent_coef + vf_loss * vf_coef
|
||||
|
||||
# UPDATE THE PARAMETERS USING LOSS
|
||||
# 1. Get the model parameters
|
||||
params = tf.trainable_variables('ppo2_model')
|
||||
# 2. Build our trainer
|
||||
if comm is not None and comm.Get_size() > 1:
|
||||
self.trainer = MpiAdamOptimizer(comm, learning_rate=LR, mpi_rank_weight=mpi_rank_weight, epsilon=1e-5)
|
||||
else:
|
||||
self.trainer = tf.train.AdamOptimizer(learning_rate=LR, epsilon=1e-5)
|
||||
# 3. Calculate the gradients
|
||||
grads_and_var = self.trainer.compute_gradients(loss, params)
|
||||
grads, var = zip(*grads_and_var)
|
||||
|
||||
if max_grad_norm is not None:
|
||||
# Clip the gradients (normalize)
|
||||
grads, _grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
|
||||
grads_and_var = list(zip(grads, var))
|
||||
# zip aggregate each gradient with parameters associated
|
||||
# For instance zip(ABCD, xyza) => Ax, By, Cz, Da
|
||||
|
||||
self.grads = grads
|
||||
self.var = var
|
||||
self._train_op = self.trainer.apply_gradients(grads_and_var)
|
||||
self.loss_names = ['policy_loss', 'value_loss', 'policy_entropy', 'approxkl', 'clipfrac']
|
||||
self.stats_list = [pg_loss, vf_loss, entropy, approxkl, clipfrac]
|
||||
|
||||
|
||||
self.train_model = train_model
|
||||
self.act_model = act_model
|
||||
self.step = act_model.step
|
||||
self.value = act_model.value
|
||||
self.initial_state = act_model.initial_state
|
||||
|
||||
self.save = functools.partial(save_variables, sess=sess)
|
||||
self.load = functools.partial(load_variables, sess=sess)
|
||||
|
||||
initialize()
|
||||
global_variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="")
|
||||
if MPI is not None:
|
||||
sync_from_root(sess, global_variables, comm=comm) #pylint: disable=E1101
|
||||
|
||||
def train(self, lr, cliprange, obs, returns, masks, actions, values, neglogpacs, states=None):
|
||||
# Here we calculate advantage A(s,a) = R + yV(s') - V(s)
|
||||
# Returns = R + yV(s')
|
||||
advs = returns - values
|
||||
|
||||
# Normalize the advantages
|
||||
advs = (advs - advs.mean()) / (advs.std() + 1e-8)
|
||||
|
||||
td_map = {
|
||||
self.train_model.X : obs,
|
||||
self.A : actions,
|
||||
self.ADV : advs,
|
||||
self.R : returns,
|
||||
self.LR : lr,
|
||||
self.CLIPRANGE : cliprange,
|
||||
self.OLDNEGLOGPAC : neglogpacs,
|
||||
self.OLDVPRED : values
|
||||
}
|
||||
if states is not None:
|
||||
td_map[self.train_model.S] = states
|
||||
td_map[self.train_model.M] = masks
|
||||
|
||||
return self.sess.run(
|
||||
self.stats_list + [self._train_op],
|
||||
td_map
|
||||
)[:-1]
|
||||
|
@@ -1,226 +1,17 @@
|
||||
import os
|
||||
import time
|
||||
import functools
|
||||
import numpy as np
|
||||
import os.path as osp
|
||||
import tensorflow as tf
|
||||
from baselines import logger
|
||||
from collections import deque
|
||||
from baselines.common import explained_variance, set_global_seeds
|
||||
from baselines.common.policies import build_policy
|
||||
from baselines.common.runners import AbstractEnvRunner
|
||||
from baselines.common.tf_util import get_session, save_variables, load_variables
|
||||
|
||||
try:
|
||||
from baselines.common.mpi_adam_optimizer import MpiAdamOptimizer
|
||||
from mpi4py import MPI
|
||||
from baselines.common.mpi_util import sync_from_root
|
||||
except ImportError:
|
||||
MPI = None
|
||||
from baselines.ppo2.runner import Runner
|
||||
|
||||
from baselines.common.tf_util import initialize
|
||||
|
||||
class Model(object):
|
||||
"""
|
||||
We use this object to :
|
||||
__init__:
|
||||
- Creates the step_model
|
||||
- Creates the train_model
|
||||
|
||||
train():
|
||||
- Make the training part (feedforward and retropropagation of gradients)
|
||||
|
||||
save/load():
|
||||
- Save load the model
|
||||
"""
|
||||
def __init__(self, *, policy, ob_space, ac_space, nbatch_act, nbatch_train,
|
||||
nsteps, ent_coef, vf_coef, max_grad_norm):
|
||||
sess = get_session()
|
||||
|
||||
with tf.variable_scope('ppo2_model', reuse=tf.AUTO_REUSE):
|
||||
# CREATE OUR TWO MODELS
|
||||
# act_model that is used for sampling
|
||||
act_model = policy(nbatch_act, 1, sess)
|
||||
|
||||
# Train model for training
|
||||
train_model = policy(nbatch_train, nsteps, sess)
|
||||
|
||||
# CREATE THE PLACEHOLDERS
|
||||
A = train_model.pdtype.sample_placeholder([None])
|
||||
ADV = tf.placeholder(tf.float32, [None])
|
||||
R = tf.placeholder(tf.float32, [None])
|
||||
# Keep track of old actor
|
||||
OLDNEGLOGPAC = tf.placeholder(tf.float32, [None])
|
||||
# Keep track of old critic
|
||||
OLDVPRED = tf.placeholder(tf.float32, [None])
|
||||
LR = tf.placeholder(tf.float32, [])
|
||||
# Cliprange
|
||||
CLIPRANGE = tf.placeholder(tf.float32, [])
|
||||
|
||||
neglogpac = train_model.pd.neglogp(A)
|
||||
|
||||
# Calculate the entropy
|
||||
# Entropy is used to improve exploration by limiting the premature convergence to suboptimal policy.
|
||||
entropy = tf.reduce_mean(train_model.pd.entropy())
|
||||
|
||||
# CALCULATE THE LOSS
|
||||
# Total loss = Policy gradient loss - entropy * entropy coefficient + Value coefficient * value loss
|
||||
|
||||
# Clip the value to reduce variability during Critic training
|
||||
# Get the predicted value
|
||||
vpred = train_model.vf
|
||||
vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
|
||||
# Unclipped value
|
||||
vf_losses1 = tf.square(vpred - R)
|
||||
# Clipped value
|
||||
vf_losses2 = tf.square(vpredclipped - R)
|
||||
|
||||
vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))
|
||||
|
||||
# Calculate ratio (pi current policy / pi old policy)
|
||||
ratio = tf.exp(OLDNEGLOGPAC - neglogpac)
|
||||
|
||||
# Defining Loss = - J is equivalent to max J
|
||||
pg_losses = -ADV * ratio
|
||||
|
||||
pg_losses2 = -ADV * tf.clip_by_value(ratio, 1.0 - CLIPRANGE, 1.0 + CLIPRANGE)
|
||||
|
||||
# Final PG loss
|
||||
pg_loss = tf.reduce_mean(tf.maximum(pg_losses, pg_losses2))
|
||||
approxkl = .5 * tf.reduce_mean(tf.square(neglogpac - OLDNEGLOGPAC))
|
||||
clipfrac = tf.reduce_mean(tf.to_float(tf.greater(tf.abs(ratio - 1.0), CLIPRANGE)))
|
||||
|
||||
# Total loss
|
||||
loss = pg_loss - entropy * ent_coef + vf_loss * vf_coef
|
||||
|
||||
# UPDATE THE PARAMETERS USING LOSS
|
||||
# 1. Get the model parameters
|
||||
params = tf.trainable_variables('ppo2_model')
|
||||
# 2. Build our trainer
|
||||
if MPI is not None:
|
||||
trainer = MpiAdamOptimizer(MPI.COMM_WORLD, learning_rate=LR, epsilon=1e-5)
|
||||
else:
|
||||
trainer = tf.train.AdamOptimizer(learning_rate=LR, epsilon=1e-5)
|
||||
# 3. Calculate the gradients
|
||||
grads_and_var = trainer.compute_gradients(loss, params)
|
||||
grads, var = zip(*grads_and_var)
|
||||
|
||||
if max_grad_norm is not None:
|
||||
# Clip the gradients (normalize)
|
||||
grads, _grad_norm = tf.clip_by_global_norm(grads, max_grad_norm)
|
||||
grads_and_var = list(zip(grads, var))
|
||||
# zip aggregate each gradient with parameters associated
|
||||
# For instance zip(ABCD, xyza) => Ax, By, Cz, Da
|
||||
|
||||
_train = trainer.apply_gradients(grads_and_var)
|
||||
|
||||
def train(lr, cliprange, obs, returns, masks, actions, values, neglogpacs, states=None):
|
||||
# Here we calculate advantage A(s,a) = R + yV(s') - V(s)
|
||||
# Returns = R + yV(s')
|
||||
advs = returns - values
|
||||
|
||||
# Normalize the advantages
|
||||
advs = (advs - advs.mean()) / (advs.std() + 1e-8)
|
||||
td_map = {train_model.X:obs, A:actions, ADV:advs, R:returns, LR:lr,
|
||||
CLIPRANGE:cliprange, OLDNEGLOGPAC:neglogpacs, OLDVPRED:values}
|
||||
if states is not None:
|
||||
td_map[train_model.S] = states
|
||||
td_map[train_model.M] = masks
|
||||
return sess.run(
|
||||
[pg_loss, vf_loss, entropy, approxkl, clipfrac, _train],
|
||||
td_map
|
||||
)[:-1]
|
||||
self.loss_names = ['policy_loss', 'value_loss', 'policy_entropy', 'approxkl', 'clipfrac']
|
||||
|
||||
|
||||
self.train = train
|
||||
self.train_model = train_model
|
||||
self.act_model = act_model
|
||||
self.step = act_model.step
|
||||
self.value = act_model.value
|
||||
self.initial_state = act_model.initial_state
|
||||
|
||||
self.save = functools.partial(save_variables, sess=sess)
|
||||
self.load = functools.partial(load_variables, sess=sess)
|
||||
|
||||
if MPI is None or MPI.COMM_WORLD.Get_rank() == 0:
|
||||
initialize()
|
||||
global_variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="")
|
||||
|
||||
if MPI is not None:
|
||||
sync_from_root(sess, global_variables) #pylint: disable=E1101
|
||||
|
||||
class Runner(AbstractEnvRunner):
|
||||
"""
|
||||
We use this object to make a mini batch of experiences
|
||||
__init__:
|
||||
- Initialize the runner
|
||||
|
||||
run():
|
||||
- Make a mini batch
|
||||
"""
|
||||
def __init__(self, *, env, model, nsteps, gamma, lam):
|
||||
super().__init__(env=env, model=model, nsteps=nsteps)
|
||||
# Lambda used in GAE (General Advantage Estimation)
|
||||
self.lam = lam
|
||||
# Discount rate
|
||||
self.gamma = gamma
|
||||
|
||||
def run(self):
|
||||
# Here, we init the lists that will contain the mb of experiences
|
||||
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones, mb_neglogpacs = [],[],[],[],[],[]
|
||||
mb_states = self.states
|
||||
epinfos = []
|
||||
# For n in range number of steps
|
||||
for _ in range(self.nsteps):
|
||||
# Given observations, get action value and neglopacs
|
||||
# We already have self.obs because Runner superclass run self.obs[:] = env.reset() on init
|
||||
actions, values, self.states, neglogpacs = self.model.step(self.obs, S=self.states, M=self.dones)
|
||||
mb_obs.append(self.obs.copy())
|
||||
mb_actions.append(actions)
|
||||
mb_values.append(values)
|
||||
mb_neglogpacs.append(neglogpacs)
|
||||
mb_dones.append(self.dones)
|
||||
|
||||
# Take actions in env and look the results
|
||||
# Infos contains a ton of useful informations
|
||||
self.obs[:], rewards, self.dones, infos = self.env.step(actions)
|
||||
for info in infos:
|
||||
maybeepinfo = info.get('episode')
|
||||
if maybeepinfo: epinfos.append(maybeepinfo)
|
||||
mb_rewards.append(rewards)
|
||||
#batch of steps to batch of rollouts
|
||||
mb_obs = np.asarray(mb_obs, dtype=self.obs.dtype)
|
||||
mb_rewards = np.asarray(mb_rewards, dtype=np.float32)
|
||||
mb_actions = np.asarray(mb_actions)
|
||||
mb_values = np.asarray(mb_values, dtype=np.float32)
|
||||
mb_neglogpacs = np.asarray(mb_neglogpacs, dtype=np.float32)
|
||||
mb_dones = np.asarray(mb_dones, dtype=np.bool)
|
||||
last_values = self.model.value(self.obs, S=self.states, M=self.dones)
|
||||
|
||||
# discount/bootstrap off value fn
|
||||
mb_returns = np.zeros_like(mb_rewards)
|
||||
mb_advs = np.zeros_like(mb_rewards)
|
||||
lastgaelam = 0
|
||||
for t in reversed(range(self.nsteps)):
|
||||
if t == self.nsteps - 1:
|
||||
nextnonterminal = 1.0 - self.dones
|
||||
nextvalues = last_values
|
||||
else:
|
||||
nextnonterminal = 1.0 - mb_dones[t+1]
|
||||
nextvalues = mb_values[t+1]
|
||||
delta = mb_rewards[t] + self.gamma * nextvalues * nextnonterminal - mb_values[t]
|
||||
mb_advs[t] = lastgaelam = delta + self.gamma * self.lam * nextnonterminal * lastgaelam
|
||||
mb_returns = mb_advs + mb_values
|
||||
return (*map(sf01, (mb_obs, mb_returns, mb_dones, mb_actions, mb_values, mb_neglogpacs)),
|
||||
mb_states, epinfos)
|
||||
# obs, returns, masks, actions, values, neglogpacs, states = runner.run()
|
||||
def sf01(arr):
|
||||
"""
|
||||
swap and then flatten axes 0 and 1
|
||||
"""
|
||||
s = arr.shape
|
||||
return arr.swapaxes(0, 1).reshape(s[0] * s[1], *s[2:])
|
||||
|
||||
def constfn(val):
|
||||
def f(_):
|
||||
@@ -230,7 +21,7 @@ def constfn(val):
|
||||
def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2048, ent_coef=0.0, lr=3e-4,
|
||||
vf_coef=0.5, max_grad_norm=0.5, gamma=0.99, lam=0.95,
|
||||
log_interval=10, nminibatches=4, noptepochs=4, cliprange=0.2,
|
||||
save_interval=0, load_path=None, **network_kwargs):
|
||||
save_interval=0, load_path=None, model_fn=None, update_fn=None, init_fn=None, mpi_rank_weight=1, comm=None, **network_kwargs):
|
||||
'''
|
||||
Learn policy using PPO algorithm (https://arxiv.org/abs/1707.06347)
|
||||
|
||||
@@ -306,12 +97,17 @@ def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2
|
||||
# Calculate the batch_size
|
||||
nbatch = nenvs * nsteps
|
||||
nbatch_train = nbatch // nminibatches
|
||||
is_mpi_root = (MPI is None or MPI.COMM_WORLD.Get_rank() == 0)
|
||||
|
||||
# Instantiate the model object (that creates act_model and train_model)
|
||||
make_model = lambda : Model(policy=policy, ob_space=ob_space, ac_space=ac_space, nbatch_act=nenvs, nbatch_train=nbatch_train,
|
||||
if model_fn is None:
|
||||
from baselines.ppo2.model import Model
|
||||
model_fn = Model
|
||||
|
||||
model = model_fn(policy=policy, ob_space=ob_space, ac_space=ac_space, nbatch_act=nenvs, nbatch_train=nbatch_train,
|
||||
nsteps=nsteps, ent_coef=ent_coef, vf_coef=vf_coef,
|
||||
max_grad_norm=max_grad_norm)
|
||||
model = make_model()
|
||||
max_grad_norm=max_grad_norm, comm=comm, mpi_rank_weight=mpi_rank_weight)
|
||||
|
||||
if load_path is not None:
|
||||
model.load(load_path)
|
||||
# Instantiate the runner object
|
||||
@@ -319,30 +115,36 @@ def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2
|
||||
if eval_env is not None:
|
||||
eval_runner = Runner(env = eval_env, model = model, nsteps = nsteps, gamma = gamma, lam= lam)
|
||||
|
||||
|
||||
|
||||
epinfobuf = deque(maxlen=100)
|
||||
if eval_env is not None:
|
||||
eval_epinfobuf = deque(maxlen=100)
|
||||
|
||||
if init_fn is not None:
|
||||
init_fn()
|
||||
|
||||
# Start total timer
|
||||
tfirststart = time.time()
|
||||
tfirststart = time.perf_counter()
|
||||
|
||||
nupdates = total_timesteps//nbatch
|
||||
for update in range(1, nupdates+1):
|
||||
assert nbatch % nminibatches == 0
|
||||
# Start timer
|
||||
tstart = time.time()
|
||||
tstart = time.perf_counter()
|
||||
frac = 1.0 - (update - 1.0) / nupdates
|
||||
# Calculate the learning rate
|
||||
lrnow = lr(frac)
|
||||
# Calculate the cliprange
|
||||
cliprangenow = cliprange(frac)
|
||||
|
||||
if update % log_interval == 0 and is_mpi_root: logger.info('Stepping environment...')
|
||||
|
||||
# Get minibatch
|
||||
obs, returns, masks, actions, values, neglogpacs, states, epinfos = runner.run() #pylint: disable=E0632
|
||||
if eval_env is not None:
|
||||
eval_obs, eval_returns, eval_masks, eval_actions, eval_values, eval_neglogpacs, eval_states, eval_epinfos = eval_runner.run() #pylint: disable=E0632
|
||||
|
||||
if update % log_interval == 0 and is_mpi_root: logger.info('Done.')
|
||||
|
||||
epinfobuf.extend(epinfos)
|
||||
if eval_env is not None:
|
||||
eval_epinfobuf.extend(eval_epinfos)
|
||||
@@ -367,7 +169,6 @@ def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2
|
||||
envsperbatch = nenvs // nminibatches
|
||||
envinds = np.arange(nenvs)
|
||||
flatinds = np.arange(nenvs * nsteps).reshape(nenvs, nsteps)
|
||||
envsperbatch = nbatch_train // nsteps
|
||||
for _ in range(noptepochs):
|
||||
np.random.shuffle(envinds)
|
||||
for start in range(0, nenvs, envsperbatch):
|
||||
@@ -381,34 +182,39 @@ def learn(*, network, env, total_timesteps, eval_env = None, seed=None, nsteps=2
|
||||
# Feedforward --> get losses --> update
|
||||
lossvals = np.mean(mblossvals, axis=0)
|
||||
# End timer
|
||||
tnow = time.time()
|
||||
tnow = time.perf_counter()
|
||||
# Calculate the fps (frame per second)
|
||||
fps = int(nbatch / (tnow - tstart))
|
||||
|
||||
if update_fn is not None:
|
||||
update_fn(update)
|
||||
|
||||
if update % log_interval == 0 or update == 1:
|
||||
# Calculates if value function is a good predicator of the returns (ev > 1)
|
||||
# or if it's just worse than predicting nothing (ev =< 0)
|
||||
ev = explained_variance(values, returns)
|
||||
logger.logkv("serial_timesteps", update*nsteps)
|
||||
logger.logkv("nupdates", update)
|
||||
logger.logkv("total_timesteps", update*nbatch)
|
||||
logger.logkv("misc/serial_timesteps", update*nsteps)
|
||||
logger.logkv("misc/nupdates", update)
|
||||
logger.logkv("misc/total_timesteps", update*nbatch)
|
||||
logger.logkv("fps", fps)
|
||||
logger.logkv("explained_variance", float(ev))
|
||||
logger.logkv("misc/explained_variance", float(ev))
|
||||
logger.logkv('eprewmean', safemean([epinfo['r'] for epinfo in epinfobuf]))
|
||||
logger.logkv('eplenmean', safemean([epinfo['l'] for epinfo in epinfobuf]))
|
||||
if eval_env is not None:
|
||||
logger.logkv('eval_eprewmean', safemean([epinfo['r'] for epinfo in eval_epinfobuf]) )
|
||||
logger.logkv('eval_eplenmean', safemean([epinfo['l'] for epinfo in eval_epinfobuf]) )
|
||||
logger.logkv('time_elapsed', tnow - tfirststart)
|
||||
logger.logkv('misc/time_elapsed', tnow - tfirststart)
|
||||
for (lossval, lossname) in zip(lossvals, model.loss_names):
|
||||
logger.logkv(lossname, lossval)
|
||||
if MPI is None or MPI.COMM_WORLD.Get_rank() == 0:
|
||||
logger.dumpkvs()
|
||||
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir() and (MPI is None or MPI.COMM_WORLD.Get_rank() == 0):
|
||||
logger.logkv('loss/' + lossname, lossval)
|
||||
|
||||
logger.dumpkvs()
|
||||
if save_interval and (update % save_interval == 0 or update == 1) and logger.get_dir() and is_mpi_root:
|
||||
checkdir = osp.join(logger.get_dir(), 'checkpoints')
|
||||
os.makedirs(checkdir, exist_ok=True)
|
||||
savepath = osp.join(checkdir, '%.5i'%update)
|
||||
print('Saving to', savepath)
|
||||
model.save(savepath)
|
||||
|
||||
return model
|
||||
# Avoid division error when calculate the mean (in our case if epinfo is empty returns np.nan, not return an error)
|
||||
def safemean(xs):
|
||||
|
76
baselines/ppo2/runner.py
Normal file
76
baselines/ppo2/runner.py
Normal file
@@ -0,0 +1,76 @@
|
||||
import numpy as np
|
||||
from baselines.common.runners import AbstractEnvRunner
|
||||
|
||||
class Runner(AbstractEnvRunner):
|
||||
"""
|
||||
We use this object to make a mini batch of experiences
|
||||
__init__:
|
||||
- Initialize the runner
|
||||
|
||||
run():
|
||||
- Make a mini batch
|
||||
"""
|
||||
def __init__(self, *, env, model, nsteps, gamma, lam):
|
||||
super().__init__(env=env, model=model, nsteps=nsteps)
|
||||
# Lambda used in GAE (General Advantage Estimation)
|
||||
self.lam = lam
|
||||
# Discount rate
|
||||
self.gamma = gamma
|
||||
|
||||
def run(self):
|
||||
# Here, we init the lists that will contain the mb of experiences
|
||||
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones, mb_neglogpacs = [],[],[],[],[],[]
|
||||
mb_states = self.states
|
||||
epinfos = []
|
||||
# For n in range number of steps
|
||||
for _ in range(self.nsteps):
|
||||
# Given observations, get action value and neglopacs
|
||||
# We already have self.obs because Runner superclass run self.obs[:] = env.reset() on init
|
||||
actions, values, self.states, neglogpacs = self.model.step(self.obs, S=self.states, M=self.dones)
|
||||
mb_obs.append(self.obs.copy())
|
||||
mb_actions.append(actions)
|
||||
mb_values.append(values)
|
||||
mb_neglogpacs.append(neglogpacs)
|
||||
mb_dones.append(self.dones)
|
||||
|
||||
# Take actions in env and look the results
|
||||
# Infos contains a ton of useful informations
|
||||
self.obs[:], rewards, self.dones, infos = self.env.step(actions)
|
||||
for info in infos:
|
||||
maybeepinfo = info.get('episode')
|
||||
if maybeepinfo: epinfos.append(maybeepinfo)
|
||||
mb_rewards.append(rewards)
|
||||
#batch of steps to batch of rollouts
|
||||
mb_obs = np.asarray(mb_obs, dtype=self.obs.dtype)
|
||||
mb_rewards = np.asarray(mb_rewards, dtype=np.float32)
|
||||
mb_actions = np.asarray(mb_actions)
|
||||
mb_values = np.asarray(mb_values, dtype=np.float32)
|
||||
mb_neglogpacs = np.asarray(mb_neglogpacs, dtype=np.float32)
|
||||
mb_dones = np.asarray(mb_dones, dtype=np.bool)
|
||||
last_values = self.model.value(self.obs, S=self.states, M=self.dones)
|
||||
|
||||
# discount/bootstrap off value fn
|
||||
mb_returns = np.zeros_like(mb_rewards)
|
||||
mb_advs = np.zeros_like(mb_rewards)
|
||||
lastgaelam = 0
|
||||
for t in reversed(range(self.nsteps)):
|
||||
if t == self.nsteps - 1:
|
||||
nextnonterminal = 1.0 - self.dones
|
||||
nextvalues = last_values
|
||||
else:
|
||||
nextnonterminal = 1.0 - mb_dones[t+1]
|
||||
nextvalues = mb_values[t+1]
|
||||
delta = mb_rewards[t] + self.gamma * nextvalues * nextnonterminal - mb_values[t]
|
||||
mb_advs[t] = lastgaelam = delta + self.gamma * self.lam * nextnonterminal * lastgaelam
|
||||
mb_returns = mb_advs + mb_values
|
||||
return (*map(sf01, (mb_obs, mb_returns, mb_dones, mb_actions, mb_values, mb_neglogpacs)),
|
||||
mb_states, epinfos)
|
||||
# obs, returns, masks, actions, values, neglogpacs, states = runner.run()
|
||||
def sf01(arr):
|
||||
"""
|
||||
swap and then flatten axes 0 and 1
|
||||
"""
|
||||
s = arr.shape
|
||||
return arr.swapaxes(0, 1).reshape(s[0] * s[1], *s[2:])
|
||||
|
||||
|
35
baselines/ppo2/test_microbatches.py
Normal file
35
baselines/ppo2/test_microbatches.py
Normal file
@@ -0,0 +1,35 @@
|
||||
import gym
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
from functools import partial
|
||||
|
||||
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
|
||||
from baselines.common.tf_util import make_session
|
||||
from baselines.ppo2.ppo2 import learn
|
||||
|
||||
from baselines.ppo2.microbatched_model import MicrobatchedModel
|
||||
|
||||
def test_microbatches():
|
||||
def env_fn():
|
||||
env = gym.make('CartPole-v0')
|
||||
env.seed(0)
|
||||
return env
|
||||
|
||||
learn_fn = partial(learn, network='mlp', nsteps=32, total_timesteps=32, seed=0)
|
||||
|
||||
env_ref = DummyVecEnv([env_fn])
|
||||
sess_ref = make_session(make_default=True, graph=tf.Graph())
|
||||
learn_fn(env=env_ref)
|
||||
vars_ref = {v.name: sess_ref.run(v) for v in tf.trainable_variables()}
|
||||
|
||||
env_test = DummyVecEnv([env_fn])
|
||||
sess_test = make_session(make_default=True, graph=tf.Graph())
|
||||
learn_fn(env=env_test, model_fn=partial(MicrobatchedModel, microbatch_size=2))
|
||||
# learn_fn(env=env_test)
|
||||
vars_test = {v.name: sess_test.run(v) for v in tf.trainable_variables()}
|
||||
|
||||
for v in vars_ref:
|
||||
np.testing.assert_allclose(vars_ref[v], vars_test[v], atol=3e-3)
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_microbatches()
|
@@ -5,7 +5,7 @@ matplotlib.use('TkAgg') # Can change to 'Agg' for non-interactive mode
|
||||
import matplotlib.pyplot as plt
|
||||
plt.rcParams['svg.fonttype'] = 'none'
|
||||
|
||||
from baselines.bench.monitor import load_results
|
||||
from baselines.common import plot_util
|
||||
|
||||
X_TIMESTEPS = 'timesteps'
|
||||
X_EPISODES = 'episodes'
|
||||
@@ -16,7 +16,7 @@ POSSIBLE_X_AXES = [X_TIMESTEPS, X_EPISODES, X_WALLTIME]
|
||||
EPISODES_WINDOW = 100
|
||||
COLORS = ['blue', 'green', 'red', 'cyan', 'magenta', 'yellow', 'black', 'purple', 'pink',
|
||||
'brown', 'orange', 'teal', 'coral', 'lightblue', 'lime', 'lavender', 'turquoise',
|
||||
'darkgreen', 'tan', 'salmon', 'gold', 'lightpurple', 'darkred', 'darkblue']
|
||||
'darkgreen', 'tan', 'salmon', 'gold', 'darkred', 'darkblue']
|
||||
|
||||
def rolling_window(a, window):
|
||||
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
|
||||
@@ -50,7 +50,7 @@ def plot_curves(xy_list, xaxis, yaxis, title):
|
||||
maxx = max(xy[0][-1] for xy in xy_list)
|
||||
minx = 0
|
||||
for (i, (x, y)) in enumerate(xy_list):
|
||||
color = COLORS[i]
|
||||
color = COLORS[i % len(COLORS)]
|
||||
plt.scatter(x, y, s=2)
|
||||
x, y_mean = window_func(x, y, EPISODES_WINDOW, np.mean) #So returns average of last EPISODE_WINDOW episodes
|
||||
plt.plot(x, y_mean, color=color)
|
||||
@@ -62,19 +62,18 @@ def plot_curves(xy_list, xaxis, yaxis, title):
|
||||
fig.canvas.mpl_connect('resize_event', lambda event: plt.tight_layout())
|
||||
plt.grid(True)
|
||||
|
||||
def plot_results(dirs, num_timesteps, xaxis, yaxis, task_name):
|
||||
tslist = []
|
||||
for dir in dirs:
|
||||
ts = load_results(dir)
|
||||
ts = ts[ts.l.cumsum() <= num_timesteps]
|
||||
tslist.append(ts)
|
||||
xy_list = [ts2xy(ts, xaxis, yaxis) for ts in tslist]
|
||||
plot_curves(xy_list, xaxis, yaxis, task_name)
|
||||
|
||||
def split_by_task(taskpath):
|
||||
return taskpath['dirname'].split('/')[-1].split('-')[0]
|
||||
|
||||
def plot_results(dirs, num_timesteps=10e6, xaxis=X_TIMESTEPS, yaxis=Y_REWARD, title='', split_fn=split_by_task):
|
||||
results = plot_util.load_results(dirs)
|
||||
plot_util.plot_results(results, xy_fn=lambda r: ts2xy(r['monitor'], xaxis, yaxis), split_fn=split_fn, average_group=True, resample=int(1e6))
|
||||
|
||||
# Example usage in jupyter-notebook
|
||||
# from baselines import log_viewer
|
||||
# from baselines.results_plotter import plot_results
|
||||
# %matplotlib inline
|
||||
# log_viewer.plot_results(["./log"], 10e6, log_viewer.X_TIMESTEPS, "Breakout")
|
||||
# plot_results("./log")
|
||||
# Here ./log is a directory containing the monitor.csv files
|
||||
|
||||
def main():
|
||||
|
@@ -1,4 +1,5 @@
|
||||
import sys
|
||||
import re
|
||||
import multiprocessing
|
||||
import os.path as osp
|
||||
import gym
|
||||
@@ -6,15 +7,13 @@ from collections import defaultdict
|
||||
import tensorflow as tf
|
||||
import numpy as np
|
||||
|
||||
from baselines.common.vec_env import VecFrameStack, VecNormalize, VecEnv
|
||||
from baselines.common.vec_env.vec_video_recorder import VecVideoRecorder
|
||||
from baselines.common.vec_env.vec_frame_stack import VecFrameStack
|
||||
from baselines.common.cmd_util import common_arg_parser, parse_unknown_args, make_vec_env, make_env
|
||||
from baselines.common.tf_util import get_session
|
||||
from baselines import logger
|
||||
from importlib import import_module
|
||||
|
||||
from baselines.common.vec_env.vec_normalize import VecNormalize
|
||||
|
||||
try:
|
||||
from mpi4py import MPI
|
||||
except ImportError:
|
||||
@@ -33,7 +32,7 @@ except ImportError:
|
||||
_game_envs = defaultdict(set)
|
||||
for env in gym.envs.registry.all():
|
||||
# TODO: solve this with regexes
|
||||
env_type = env._entry_point.split(':')[0].split('.')[-1]
|
||||
env_type = env.entry_point.split(':')[0].split('.')[-1]
|
||||
_game_envs[env_type].add(env.id)
|
||||
|
||||
# reading benchmark names directly from retro requires
|
||||
@@ -52,7 +51,7 @@ _game_envs['retro'] = {
|
||||
|
||||
|
||||
def train(args, extra_args):
|
||||
env_type, env_id = get_env_type(args.env)
|
||||
env_type, env_id = get_env_type(args)
|
||||
print('env_type: {}'.format(env_type))
|
||||
|
||||
total_timesteps = int(args.num_timesteps)
|
||||
@@ -64,7 +63,7 @@ def train(args, extra_args):
|
||||
|
||||
env = build_env(args)
|
||||
if args.save_video_interval != 0:
|
||||
env = VecVideoRecorder(env, osp.join(logger.Logger.CURRENT.dir, "videos"), record_video_trigger=lambda x: x % args.save_video_interval == 0, video_length=args.save_video_length)
|
||||
env = VecVideoRecorder(env, osp.join(logger.get_dir(), "videos"), record_video_trigger=lambda x: x % args.save_video_interval == 0, video_length=args.save_video_length)
|
||||
|
||||
if args.network:
|
||||
alg_kwargs['network'] = args.network
|
||||
@@ -91,7 +90,7 @@ def build_env(args):
|
||||
alg = args.alg
|
||||
seed = args.seed
|
||||
|
||||
env_type, env_id = get_env_type(args.env)
|
||||
env_type, env_id = get_env_type(args)
|
||||
|
||||
if env_type in {'atari', 'retro'}:
|
||||
if alg == 'deepq':
|
||||
@@ -104,21 +103,32 @@ def build_env(args):
|
||||
env = VecFrameStack(env, frame_stack_size)
|
||||
|
||||
else:
|
||||
config = tf.ConfigProto(allow_soft_placement=True,
|
||||
config = tf.ConfigProto(allow_soft_placement=True,
|
||||
intra_op_parallelism_threads=1,
|
||||
inter_op_parallelism_threads=1)
|
||||
config.gpu_options.allow_growth = True
|
||||
get_session(config=config)
|
||||
config.gpu_options.allow_growth = True
|
||||
get_session(config=config)
|
||||
|
||||
env = make_vec_env(env_id, env_type, args.num_env or 1, seed, reward_scale=args.reward_scale)
|
||||
flatten_dict_observations = alg not in {'her'}
|
||||
env = make_vec_env(env_id, env_type, args.num_env or 1, seed, reward_scale=args.reward_scale, flatten_dict_observations=flatten_dict_observations)
|
||||
|
||||
if env_type == 'mujoco':
|
||||
env = VecNormalize(env)
|
||||
if env_type == 'mujoco':
|
||||
env = VecNormalize(env, use_tf=True)
|
||||
|
||||
return env
|
||||
|
||||
|
||||
def get_env_type(env_id):
|
||||
def get_env_type(args):
|
||||
env_id = args.env
|
||||
|
||||
if args.env_type is not None:
|
||||
return args.env_type, env_id
|
||||
|
||||
# Re-parse the gym registry, since we could have new envs since last time.
|
||||
for env in gym.envs.registry.all():
|
||||
env_type = env.entry_point.split(':')[0].split('.')[-1]
|
||||
_game_envs[env_type].add(env.id) # This is a set so add is idempotent
|
||||
|
||||
if env_id in _game_envs.keys():
|
||||
env_type = env_id
|
||||
env_id = [g for g in _game_envs[env_type]][0]
|
||||
@@ -128,6 +138,8 @@ def get_env_type(env_id):
|
||||
if env_id in e:
|
||||
env_type = g
|
||||
break
|
||||
if ':' in env_id:
|
||||
env_type = re.sub(r':.*', '', env_id)
|
||||
assert env_type is not None, 'env_id {} is not recognized in env types'.format(env_id, _game_envs.keys())
|
||||
|
||||
return env_type, env_id
|
||||
@@ -180,23 +192,28 @@ def parse_cmdline_kwargs(args):
|
||||
return {k: parse(v) for k,v in parse_unknown_args(args).items()}
|
||||
|
||||
|
||||
def configure_logger(log_path, **kwargs):
|
||||
if log_path is not None:
|
||||
logger.configure(log_path)
|
||||
else:
|
||||
logger.configure(**kwargs)
|
||||
|
||||
def main():
|
||||
|
||||
def main(args):
|
||||
# configure logger, disable logging in child MPI processes (with rank > 0)
|
||||
|
||||
arg_parser = common_arg_parser()
|
||||
args, unknown_args = arg_parser.parse_known_args()
|
||||
args, unknown_args = arg_parser.parse_known_args(args)
|
||||
extra_args = parse_cmdline_kwargs(unknown_args)
|
||||
|
||||
if MPI is None or MPI.COMM_WORLD.Get_rank() == 0:
|
||||
rank = 0
|
||||
logger.configure()
|
||||
configure_logger(args.log_path)
|
||||
else:
|
||||
logger.configure(format_strs=[])
|
||||
rank = MPI.COMM_WORLD.Get_rank()
|
||||
configure_logger(args.log_path, format_strs=[])
|
||||
|
||||
model, env = train(args, extra_args)
|
||||
env.close()
|
||||
|
||||
if args.save_path is not None and rank == 0:
|
||||
save_path = osp.expanduser(args.save_path)
|
||||
@@ -204,21 +221,30 @@ def main():
|
||||
|
||||
if args.play:
|
||||
logger.log("Running trained model")
|
||||
env = build_env(args)
|
||||
obs = env.reset()
|
||||
def initialize_placeholders(nlstm=128,**kwargs):
|
||||
return np.zeros((args.num_env or 1, 2*nlstm)), np.zeros((1))
|
||||
state, dones = initialize_placeholders(**extra_args)
|
||||
|
||||
state = model.initial_state if hasattr(model, 'initial_state') else None
|
||||
dones = np.zeros((1,))
|
||||
|
||||
episode_rew = 0
|
||||
while True:
|
||||
actions, _, state, _ = model.step(obs,S=state, M=dones)
|
||||
obs, _, done, _ = env.step(actions)
|
||||
if state is not None:
|
||||
actions, _, state, _ = model.step(obs,S=state, M=dones)
|
||||
else:
|
||||
actions, _, _, _ = model.step(obs)
|
||||
|
||||
obs, rew, done, _ = env.step(actions)
|
||||
episode_rew += rew[0] if isinstance(env, VecEnv) else rew
|
||||
env.render()
|
||||
done = done.any() if isinstance(done, np.ndarray) else done
|
||||
|
||||
if done:
|
||||
print('episode_rew={}'.format(episode_rew))
|
||||
episode_rew = 0
|
||||
obs = env.reset()
|
||||
|
||||
env.close()
|
||||
env.close()
|
||||
|
||||
return model
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
main(sys.argv)
|
||||
|
@@ -120,7 +120,7 @@
|
||||
|
||||
<td>114.26</td>
|
||||
|
||||
<td>cbd21ef</td>
|
||||
<td><a href=https://github.com/openai/baselines/commit/7bfbcf177eca8f46c0c0bfbb378e044539f5e061>7bfbcf1</a></td>
|
||||
|
||||
</tr>
|
||||
|
||||
@@ -152,7 +152,7 @@
|
||||
|
||||
<td>131.46</td>
|
||||
|
||||
<td>cbd21ef</td>
|
||||
<td><a href=https://github.com/openai/baselines/commit/7bfbcf177eca8f46c0c0bfbb378e044539f5e061>7bfbcf1</a></td>
|
||||
|
||||
</tr>
|
||||
|
||||
@@ -184,7 +184,7 @@
|
||||
|
||||
<td>113.58</td>
|
||||
|
||||
<td>cbd21ef</td>
|
||||
<td><a href=https://github.com/openai/baselines/commit/7bfbcf177eca8f46c0c0bfbb378e044539f5e061>7bfbcf1</a></td>
|
||||
|
||||
</tr>
|
||||
|
||||
@@ -216,7 +216,7 @@
|
||||
|
||||
<td>82.94</td>
|
||||
|
||||
<td>cbd21ef</td>
|
||||
<td><a href=https://github.com/openai/baselines/commit/7bfbcf177eca8f46c0c0bfbb378e044539f5e061>7bfbcf1</a></td>
|
||||
|
||||
</tr>
|
||||
|
||||
@@ -248,7 +248,7 @@
|
||||
|
||||
<td>81.61</td>
|
||||
|
||||
<td>cbd21ef</td>
|
||||
<td><a href=https://github.com/openai/baselines/commit/7bfbcf177eca8f46c0c0bfbb378e044539f5e061>7bfbcf1</a></td>
|
||||
|
||||
</tr>
|
||||
|
||||
@@ -280,7 +280,7 @@
|
||||
|
||||
<td>59.72</td>
|
||||
|
||||
<td>cbd21ef</td>
|
||||
<td><a href=https://github.com/openai/baselines/commit/7bfbcf177eca8f46c0c0bfbb378e044539f5e061>7bfbcf1</a></td>
|
||||
|
||||
</tr>
|
||||
|
||||
@@ -312,7 +312,7 @@
|
||||
|
||||
<td>14.98</td>
|
||||
|
||||
<td>cbd21ef</td>
|
||||
<td><a href=https://github.com/openai/baselines/commit/7bfbcf177eca8f46c0c0bfbb378e044539f5e061>7bfbcf1</a></td>
|
||||
|
||||
</tr>
|
||||
|
||||
|
@@ -7,7 +7,7 @@
|
||||
"id": "Ynb-laSwmpac"
|
||||
},
|
||||
"source": [
|
||||
"# Loading and visualizing results ([open in colab](https://colab.research.google.com/github/openai/baselines/blob/master/docs/viz.ipynb))\n",
|
||||
"# Loading and visualizing results ([open in colab](https://colab.research.google.com/github/openai/baselines/blob/master/docs/viz/viz.ipynb))\n",
|
||||
"In order to compare performance of algorithms, we often would like to visualize learning curves (reward as a function of time steps), or some other auxiliary information about learning aggregated into a plot. Baselines repo provides tools for doing so in several different ways, depending on the goal."
|
||||
]
|
||||
},
|
||||
|
@@ -3,6 +3,4 @@ select = F,E999,W291,W293
|
||||
exclude =
|
||||
.git,
|
||||
__pycache__,
|
||||
baselines/her,
|
||||
baselines/ppo1,
|
||||
baselines/bench,
|
||||
|
19
setup.py
19
setup.py
@@ -12,10 +12,9 @@ extras = {
|
||||
'filelock',
|
||||
'pytest',
|
||||
'pytest-forked',
|
||||
'atari-py'
|
||||
],
|
||||
'bullet': [
|
||||
'pybullet',
|
||||
'atari-py',
|
||||
'matplotlib',
|
||||
'pandas'
|
||||
],
|
||||
'mpi': [
|
||||
'mpi4py'
|
||||
@@ -32,12 +31,10 @@ setup(name='baselines',
|
||||
packages=[package for package in find_packages()
|
||||
if package.startswith('baselines')],
|
||||
install_requires=[
|
||||
'gym',
|
||||
'gym>=0.10.0, <1.0.0',
|
||||
'scipy',
|
||||
'tqdm',
|
||||
'joblib',
|
||||
'dill',
|
||||
'progressbar2',
|
||||
'cloudpickle',
|
||||
'click',
|
||||
'opencv-python'
|
||||
@@ -47,17 +44,17 @@ setup(name='baselines',
|
||||
author='OpenAI',
|
||||
url='https://github.com/openai/baselines',
|
||||
author_email='gym@openai.com',
|
||||
version='0.1.5')
|
||||
version='0.1.6')
|
||||
|
||||
|
||||
# ensure there is some tensorflow build with version above 1.4
|
||||
import pkg_resources
|
||||
tf_pkg = None
|
||||
for tf_pkg_name in ['tensorflow', 'tensorflow-gpu']:
|
||||
for tf_pkg_name in ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-gpu']:
|
||||
try:
|
||||
tf_pkg = pkg_resources.get_distribution(tf_pkg_name)
|
||||
except pkg_resources.DistributionNotFound:
|
||||
pass
|
||||
assert tf_pkg is not None, 'TensorFlow needed, of version above 1.4'
|
||||
from distutils.version import StrictVersion
|
||||
assert StrictVersion(re.sub(r'-?rc\d+$', '', tf_pkg.version)) >= StrictVersion('1.4.0')
|
||||
from distutils.version import LooseVersion
|
||||
assert LooseVersion(re.sub(r'-?rc\d+$', '', tf_pkg.version)) >= LooseVersion('1.4.0')
|
||||
|
Reference in New Issue
Block a user