Fix mean reward per episode after training Pong. (#562)
* Fix mean reward per episode after training Pong. * Fix typo.
This commit is contained in:
@@ -80,7 +80,7 @@ python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timeste
|
|||||||
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
|
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)
|
||||||
|
|
||||||
See docstrings in [common/models.py](common/models.py) for description of network parameters for each type of model, and
|
See docstrings in [common/models.py](common/models.py) for description of network parameters for each type of model, and
|
||||||
docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) fir the description of the ppo2 hyperparamters.
|
docstring for [baselines/ppo2/ppo2.py/learn()](ppo2/ppo2.py) for the description of the ppo2 hyperparamters.
|
||||||
|
|
||||||
### Example 2. DQN on Atari
|
### Example 2. DQN on Atari
|
||||||
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
|
DQN with Atari is at this point a classics of benchmarks. To run the baselines implementation of DQN on Atari Pong:
|
||||||
@@ -95,7 +95,7 @@ Let's imagine you'd like to train ppo2 on Atari Pong, save the model and then l
|
|||||||
```bash
|
```bash
|
||||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
|
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=2e7 --save_path=~/models/pong_20M_ppo2
|
||||||
```
|
```
|
||||||
This should get to the mean reward per episode about 5k. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
|
This should get to the mean reward per episode about 20. To load and visualize the model, we'll do the following - load the model, train it for 0 steps, and then visualize:
|
||||||
```bash
|
```bash
|
||||||
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
|
python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=0 --load_path=~/models/pong_20M_ppo2 --play
|
||||||
```
|
```
|
||||||
|
Reference in New Issue
Block a user