updated README files and deepq.train_cartpole example

Merge branch 'master' of github.com:openai/baselines into peterz_update_READMEs
adding a link to repo-wide README
2018-08-16 13:15:51 -07:00 · 2018-08-16 12:26:51 -07:00 · 2018-08-16 12:23:06 -07:00 · 2018-08-16 12:18:06 -07:00 · 2018-08-16 12:08:53 -07:00
10 changed files with 36 additions and 37 deletions
--- a/baselines/a2c/README.md
+++ b/baselines/a2c/README.md
@@ -2,4 +2,5 @@

 - Original paper: https://arxiv.org/abs/1602.01783
 - Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.a2c.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
+- `python -m baselines.run --alg=a2c --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options
+- also refer to the repo-wide [README.md](../../README.md#training-models)
--- a/baselines/acer/README.md
+++ b/baselines/acer/README.md
@@ -1,4 +1,6 @@
 # ACER

 - Original paper: https://arxiv.org/abs/1611.01224
- `python -m baselines.acer.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
+- `python -m baselines.run --alg=acer --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
+- also refer to the repo-wide [README.md](../../README.md#training-models)
+
--- a/baselines/acktr/README.md
+++ b/baselines/acktr/README.md
@@ -2,4 +2,7 @@

 - Original paper: https://arxiv.org/abs/1708.05144
 - Baselines blog post: https://blog.openai.com/baselines-acktr-a2c/
- `python -m baselines.acktr.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
+- `python -m baselines.run --alg=acktr --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
+- also refer to the repo-wide [README.md](../../README.md#training-models)
+
+
--- a/baselines/deepq/README.md
+++ b/baselines/deepq/README.md
@@ -9,44 +9,29 @@ Here's a list of commands to run to quickly get a working example:

 ```bash
 # Train model and save the results to cartpole_model.pkl
-python -m baselines.deepq.experiments.train_cartpole
+python -m baselines.run --alg=deepq --env=CartPole-v0 --save_path=./cartpole_model.pkl --num_timesteps=1e5
 # Load the model saved in cartpole_model.pkl and visualize the learned policy
-python -m baselines.deepq.experiments.enjoy_cartpole
+python -m baselines.run --alg=deepq --env=CartPole-v0 --load_apth=./cartpole_model.pkl --num_timesteps=0 --play
 ```

-
-Be sure to check out the source code of [both](experiments/train_cartpole.py) [files](experiments/enjoy_cartpole.py)!
-
 ## If you wish to apply DQN to solve a problem.

 Check out our simple agent trained with one stop shop `deepq.learn` function. 

 - [baselines/deepq/experiments/train_cartpole.py](experiments/train_cartpole.py) - train a Cartpole agent.
- [baselines/deepq/experiments/train_pong.py](experiments/train_pong.py) - train a Pong agent using convolutional neural networks.

-In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. For both of the files listed above there are complimentary files `enjoy_cartpole.py` and `enjoy_pong.py` respectively, that load and visualize the learned policy.
+In particular notice that once `deepq.learn` finishes training it returns `act` function which can be used to select actions in the environment. Once trained you can easily save it and load at later time. Complimentary file `enjoy_cartpole.py` loads and visualizes the learned policy.

 ## If you wish to experiment with the algorithm

 ##### Check out the examples

-
 - [baselines/deepq/experiments/custom_cartpole.py](experiments/custom_cartpole.py) - Cartpole training with more fine grained control over the internals of DQN algorithm.
- [baselines/deepq/experiments/run_atari.py](experiments/run_atari.py) - more robust setup for training at scale.
-
-
-##### Download a pretrained Atari agent
-
-For some research projects it is sometimes useful to have an already trained agent handy. There's a variety of models to choose from. You can list them all by running:
+- [baselines/deepq/defaults.py](defaults.py) - settings for training on atari. Run 

 ```bash
-python -m baselines.deepq.experiments.atari.download_model
+python -m baselines.run --alg=deepq --env=PongNoFrameskip-v4 
 ```
+to train on Atari Pong (see more in repo-wide [README.md](../../README.md#training-models))

-Once you pick a model, you can download it and visualize the learned policy. Be sure to pass `--dueling` flag to visualization script when using dueling models.

-```bash
-python -m baselines.deepq.experiments.atari.download_model --blob model-atari-duel-pong-1 --model-dir /tmp/models
-python -m baselines.deepq.experiments.atari.enjoy --model-dir /tmp/models/model-atari-duel-pong-1 --env Pong --dueling
-
-```
--- a/baselines/deepq/build_graph.py
+++ b/baselines/deepq/build_graph.py
@@ -309,7 +309,7 @@ def build_act_with_param_noise(make_obs_ph, q_func, num_actions, scope="deepq",
                         outputs=output_actions,
                         givens={update_eps_ph: -1.0, stochastic_ph: True, reset_ph: False, update_param_noise_threshold_ph: False, update_param_noise_scale_ph: False},
                         updates=updates)
-        def act(ob, reset, update_param_noise_threshold, update_param_noise_scale, stochastic=True, update_eps=-1):
+        def act(ob, reset=False, update_param_noise_threshold=False, update_param_noise_scale=False, stochastic=True, update_eps=-1):
            return _act(ob, stochastic, update_eps, reset, update_param_noise_threshold, update_param_noise_scale)
        return act

--- a/baselines/deepq/deepq.py
+++ b/baselines/deepq/deepq.py
@@ -27,7 +27,7 @@ class ActWrapper(object):
        self.initial_state = None

    @staticmethod
-    def load_act(self, path):
+    def load_act(path):
        with open(path, "rb") as f:
            model_data, act_params = cloudpickle.load(f)
        act = deepq.build_act(**act_params)
@@ -70,6 +70,7 @@ class ActWrapper(object):

    def save(self, path):
        save_state(path)
+        self.save_act(path+".pickle")


 def load_act(path):
@@ -194,8 +195,9 @@ def learn(env,
    # capture the shape outside the closure so that the env object is not serialized
    # by cloudpickle when serializing make_obs_ph

+    observation_space = env.observation_space
    def make_obs_ph(name):
-        return ObservationInput(env.observation_space, name=name)
+        return ObservationInput(observation_space, name=name)

    act, train, update_target, debug = deepq.build_train(
        make_obs_ph=make_obs_ph,
--- a/baselines/deepq/experiments/train_cartpole.py
+++ b/baselines/deepq/experiments/train_cartpole.py
@@ -11,12 +11,11 @@ def callback(lcl, _glb):

 def main():
    env = gym.make("CartPole-v0")
-    model = deepq.models.mlp([64])
    act = deepq.learn(
        env,
-        q_func=model,
+        network='mlp',
        lr=1e-3,
-        max_timesteps=100000,
+        total_timesteps=100000,
        buffer_size=50000,
        exploration_fraction=0.1,
        exploration_final_eps=0.02,
--- a/baselines/ppo2/README.md
+++ b/baselines/ppo2/README.md
@@ -2,5 +2,7 @@

 - Original paper: https://arxiv.org/abs/1707.06347
 - Baselines blog post: https://blog.openai.com/openai-baselines-ppo/
- `python -m baselines.ppo2.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.ppo2.run_mujoco` runs the algorithm for 1M frames on a Mujoco environment.
+
+- `python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
+- `python -m baselines.run --alg=ppo2 --env=Ant-v2 --num_timesteps=1e6` runs the algorithm for 1M frames on a Mujoco Ant environment.
+- also refer to the repo-wide [README.md](../../README.md#training-models)
--- a/baselines/run.py
+++ b/baselines/run.py
@@ -123,14 +123,18 @@ def build_env(args, render=False):
        env = bench.Monitor(env, logger.get_dir())
        env = retro_wrappers.wrap_deepmind_retro(env)
        
-    elif env_type == 'classic':
+    elif env_type == 'classic_control':
        def make_env():
            e = gym.make(env_id)
+            e = bench.Monitor(e, logger.get_dir(), allow_early_resets=True)
            e.seed(seed)
            return e
            
        env = DummyVecEnv([make_env])
- 
+
+    else:
+        raise ValueError('Unknown env_type {}'.format(env_type))
+
    return env


@@ -149,7 +153,7 @@ def get_env_type(env_id):
    return env_type, env_id

 def get_default_network(env_type):
-    if env_type == 'mujoco' or env_type=='classic':
+    if env_type == 'mujoco' or env_type == 'classic_control':
        return 'mlp'
    if env_type == 'atari':
        return 'cnn'
--- a/baselines/trpo_mpi/README.md
+++ b/baselines/trpo_mpi/README.md
@@ -2,5 +2,6 @@

 - Original paper: https://arxiv.org/abs/1502.05477
 - Baselines blog post https://blog.openai.com/openai-baselines-ppo/
- `mpirun -np 16 python -m baselines.trpo_mpi.run_atari` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (`-h`) for more options.
- `python -m baselines.trpo_mpi.run_mujoco` runs the algorithm for 1M timesteps on a Mujoco environment.
+- `mpirun -np 16 python -m baselines.run --alg=trpo_mpi --env=PongNoFrameskip-v4` runs the algorithm for 40M frames = 10M timesteps on an Atari Pong. See help (`-h`) for more options.
+- `python -m baselines.run --alg=trpo_mpi --env=Ant-v2 --num_timesteps=1e6` runs the algorithm for 1M timesteps on a Mujoco Ant environment. 
+- also refer to the repo-wide [README.md](../../README.md#training-models)
Author	SHA1	Message	Date
Peter Zhokhov	0f8d640554	updated README files and deepq.train_cartpole example	2018-08-16 13:15:51 -07:00
Peter Zhokhov	44b91f3454	Merge branch 'master' of github.com:openai/baselines into peterz_update_READMEs	2018-08-16 12:26:51 -07:00
Peter Zhokhov	0c2a6936c4	adding a link to repo-wide README	2018-08-16 12:23:06 -07:00
Peter Zhokhov	2614f0f65a	update per-algorithm READMEs to reflect new way of running algorithms	2018-08-16 12:18:06 -07:00
Pim de Haan	e2da7cd42f	Several bugfixes for #504 , #505 , #506 related to Classic Control and deepq (#507 ) * Several bugfixes * Fixed ActWrapper.step bug	2018-08-16 12:08:53 -07:00