Fix mean reward per episode after training Pong. (#562)

* Fix mean reward per episode after training Pong.

* Fix typo.
This commit is contained in:
uronce-cc
2018-09-06 06:06:29 +08:00
committed by pzhokhov
parent 7f08c675bb
commit 43ed76944b

View File

@@ -80,7 +80,7 @@ python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timeste
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same) will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
See docstrings in [common/models.py](common/models.py) for description of network parameters for each type of model, and See docstrings in [common/models.py](common/models.py) for description of network parameters for each type of model, and
docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) fir the description of the ppo2 hyperparamters. docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) for the description of the ppo2 hyperparamters.
### Example 2. DQN on Atari ### Example 2. DQN on Atari
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong: DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
@@ -95,7 +95,7 @@ Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then l
```bash ```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2 python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
``` ```
This should get to the mean reward per episode about 5k. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize: This should get to the mean reward per episode about 20. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
```bash ```bash
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
``` ```