Updated example commands to run ppo2 (#534)

The headline mentions PPO, but the command was for A2C
This commit is contained in:
HelgeS
2018-08-24 00:58:27 +02:00
committed by pzhokhov
parent cb14da96ca
commit 92b9a37257

View File

@@ -77,14 +77,14 @@ Most of the algorithms in baselines repo are used as follows:
python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments] python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]
``` ```
### Example 1. PPO with MuJoCo Humanoid ### Example 1. PPO with MuJoCo Humanoid
For instance, to train a fully-connected network controlling MuJoCo humanoid using a2c for 20M timesteps For instance, to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps
```bash ```bash
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7
``` ```
Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp` Note that for mujoco environments fully-connected network is default, so we can omit `--network=mlp`
The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance: The hyperparameters for both network and the learning algorithm can be controlled via the command line, for instance:
```bash ```bash
python -m baselines.run --alg=a2c --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7 --ent_coef=0.1 --num_hidden=32 --num_layers=3 --value_network=copy
``` ```
will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same) will set entropy coeffient to 0.1, and construct fully connected network with 3 layers with 32 hidden units in each, and create a separate network for value function estimation (so that its parameters are not shared with the policy network, but the structure is the same)