mirror of
https://github.com/Farama-Foundation/Gymnasium.git
synced 2025-07-31 13:54:31 +00:00
Docs automation (#16)
This commit is contained in:
7
.github/workflows/build-docs.yml
vendored
7
.github/workflows/build-docs.yml
vendored
@@ -20,8 +20,11 @@ jobs:
|
||||
run: pip install -r docs/requirements.txt
|
||||
|
||||
- name: Install Gymnasium
|
||||
run: pip install .
|
||||
run: pip install mujoco && pip install .[atari,accept-rom-license,box2d]
|
||||
|
||||
- name: Build Envs Docs
|
||||
run: python docs/scripts/gen_mds.py
|
||||
|
||||
- name: Build
|
||||
run: sphinx-build -b dirhtml -v docs _build
|
||||
|
||||
@@ -29,7 +32,7 @@ jobs:
|
||||
run: mv _build/404/index.html _build/404.html
|
||||
|
||||
- name: Update 404 links
|
||||
run: python ./docs/scripts/move_404.py _build/404.html
|
||||
run: python docs/scripts/move_404.py _build/404.html
|
||||
|
||||
- name: Remove .doctrees
|
||||
run: rm -r _build/.doctrees
|
||||
|
@@ -1,83 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Bipedal Walker
|
||||
firstpage:
|
||||
---
|
||||
|
||||
# Bipedal Walker
|
||||
|
||||
```{figure} ../../_static/videos/box2d/bipedal_walker.gif
|
||||
:width: 200px
|
||||
:name: bipedal_walker
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Box2D environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-1.0, 1.0, (4,), float32) |
|
||||
| Observation Shape | (24,) |
|
||||
| Observation High | [3.14 5. 5. 5. 3.14 5. 3.14 5. 5. 3.14 5. 3.14 5. 5. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. ] |
|
||||
| Observation Low | [-3.14 -5. -5. -5. -3.14 -5. -3.14 -5. -0. -3.14 -5. -3.14 -5. -0. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. ] |
|
||||
| Import | `gymnasium.make("BipedalWalker-v3")` |
|
||||
|
||||
|
||||
### Description
|
||||
This is a simple 4-joint walker robot environment.
|
||||
There are two versions:
|
||||
- Normal, with slightly uneven terrain.
|
||||
- Hardcore, with ladders, stumps, pitfalls.
|
||||
|
||||
To solve the normal version, you need to get 300 points in 1600 time steps.
|
||||
To solve the hardcore version, you need 300 points in 2000 time steps.
|
||||
|
||||
A heuristic is provided for testing. It's also useful to get demonstrations
|
||||
to learn from. To run the heuristic:
|
||||
```
|
||||
python gymnasium/envs/box2d/bipedal_walker.py
|
||||
```
|
||||
|
||||
### Action Space
|
||||
Actions are motor speed values in the [-1, 1] range for each of the
|
||||
4 joints at both hips and knees.
|
||||
|
||||
### Observation Space
|
||||
State consists of hull angle speed, angular velocity, horizontal speed,
|
||||
vertical speed, position of joints and joints angular speed, legs contact
|
||||
with ground, and 10 lidar rangefinder measurements. There are no coordinates
|
||||
in the state vector.
|
||||
|
||||
### Rewards
|
||||
Reward is given for moving forward, totaling 300+ points up to the far end.
|
||||
If the robot falls, it gets -100. Applying motor torque costs a small
|
||||
amount of points. A more optimal agent will get a better score.
|
||||
|
||||
### Starting State
|
||||
The walker starts standing at the left end of the terrain with the hull
|
||||
horizontal, and both legs in the same position with a slight knee angle.
|
||||
|
||||
### Episode Termination
|
||||
The episode will terminate if the hull gets in contact with the ground or
|
||||
if the walker exceeds the right end of the terrain length.
|
||||
|
||||
### Arguments
|
||||
To use to the _hardcore_ environment, you need to specify the
|
||||
`hardcore=True` argument like below:
|
||||
```python
|
||||
import gymnasium
|
||||
env = gymnasium.make("BipedalWalker-v3", hardcore=True)
|
||||
```
|
||||
|
||||
### Version History
|
||||
- v3: returns closest lidar trace instead of furthest;
|
||||
faster video recording
|
||||
- v2: Count energy spent
|
||||
- v1: Legs now report contact with ground; motors have higher torque and
|
||||
speed; ground has higher friction; lidar rendered less nervously.
|
||||
- v0: Initial version
|
||||
|
||||
|
||||
<!-- ### References -->
|
||||
|
||||
### Credits
|
||||
Created by Oleg Klimov
|
@@ -1,97 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Car Racing
|
||||
---
|
||||
|
||||
# Car Racing
|
||||
|
||||
```{figure} ../../_static/videos/box2d/car_racing.gif
|
||||
:width: 200px
|
||||
:name: car_racing
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Box2D environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box([-1. 0. 0.], 1.0, (3,), float32) |
|
||||
| Observation Shape | (96, 96, 3) |
|
||||
| Observation High | 255 |
|
||||
| Observation Low | 0 |
|
||||
| Import | `gymnasium.make("CarRacing-v2")` |
|
||||
|
||||
|
||||
### Description
|
||||
The easiest control task to learn from pixels - a top-down
|
||||
racing environment. The generated track is random every episode.
|
||||
|
||||
Some indicators are shown at the bottom of the window along with the
|
||||
state RGB buffer. From left to right: true speed, four ABS sensors,
|
||||
steering wheel position, and gyroscope.
|
||||
To play yourself (it's rather fast for humans), type:
|
||||
```
|
||||
python gymnasium/envs/box2d/car_racing.py
|
||||
```
|
||||
Remember: it's a powerful rear-wheel drive car - don't press the accelerator
|
||||
and turn at the same time.
|
||||
|
||||
### Action Space
|
||||
If continuous:
|
||||
There are 3 actions: steering (-1 is full left, +1 is full right), gas, and breaking.
|
||||
If discrete:
|
||||
There are 5 actions: do nothing, steer left, steer right, gas, brake.
|
||||
|
||||
### Observation Space
|
||||
State consists of 96x96 pixels.
|
||||
|
||||
### Rewards
|
||||
The reward is -0.1 every frame and +1000/N for every track tile visited,
|
||||
where N is the total number of tiles visited in the track. For example,
|
||||
if you have finished in 732 frames, your reward is
|
||||
1000 - 0.1*732 = 926.8 points.
|
||||
|
||||
### Starting State
|
||||
The car starts at rest in the center of the road.
|
||||
|
||||
### Episode Termination
|
||||
The episode finishes when all of the tiles are visited. The car can also go
|
||||
outside of the playfield - that is, far off the track, in which case it will
|
||||
receive -100 reward and die.
|
||||
|
||||
### Arguments
|
||||
`lap_complete_percent` dictates the percentage of tiles that must be visited by
|
||||
the agent before a lap is considered complete.
|
||||
|
||||
Passing `domain_randomize=True` enables the domain randomized variant of the environment.
|
||||
In this scenario, the background and track colours are different on every reset.
|
||||
|
||||
Passing `continuous=False` converts the environment to use discrete action space.
|
||||
The discrete action space has 5 actions: [do nothing, left, right, gas, brake].
|
||||
|
||||
### Reset Arguments
|
||||
Passing the option `options["randomize"] = True` will change the current colour of the environment on demand.
|
||||
Correspondingly, passing the option `options["randomize"] = False` will not change the current colour of the environment.
|
||||
`domain_randomize` must be `True` on init for this argument to work.
|
||||
Example usage:
|
||||
```py
|
||||
env = gymnasium.make("CarRacing-v1", domain_randomize=True)
|
||||
|
||||
# normal reset, this changes the colour scheme by default
|
||||
env.reset()
|
||||
|
||||
# reset with colour scheme change
|
||||
env.reset(options={"randomize": True})
|
||||
|
||||
# reset with no colour scheme change
|
||||
env.reset(options={"randomize": False})
|
||||
```
|
||||
|
||||
### Version History
|
||||
- v1: Change track completion logic and add domain randomization (0.24.0)
|
||||
- v0: Original version
|
||||
|
||||
### References
|
||||
- Chris Campbell (2014), http://www.iforce2d.net/b2dtut/top-down-car.
|
||||
|
||||
### Credits
|
||||
Created by Oleg Klimov
|
@@ -1,125 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Lunar Lander
|
||||
lastpage:
|
||||
---
|
||||
|
||||
# Lunar Lander
|
||||
|
||||
```{figure} ../../_static/videos/box2d/lunar_lander.gif
|
||||
:width: 200px
|
||||
:name: lunar_lander
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Box2D environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Discrete(4) |
|
||||
| Observation Shape | (8,) |
|
||||
| Observation High | [1.5 1.5 5. 5. 3.14 5. 1. 1. ] |
|
||||
| Observation Low | [-1.5 -1.5 -5. -5. -3.14 -5. -0. -0. ] |
|
||||
| Import | `gymnasium.make("LunarLander-v2")` |
|
||||
|
||||
|
||||
### Description
|
||||
This environment is a classic rocket trajectory optimization problem.
|
||||
According to Pontryagin's maximum principle, it is optimal to fire the
|
||||
engine at full throttle or turn it off. This is the reason why this
|
||||
environment has discrete actions: engine on or off.
|
||||
|
||||
There are two environment versions: discrete or continuous.
|
||||
The landing pad is always at coordinates (0,0). The coordinates are the
|
||||
first two numbers in the state vector.
|
||||
Landing outside of the landing pad is possible. Fuel is infinite, so an agent
|
||||
can learn to fly and then land on its first attempt.
|
||||
|
||||
To see a heuristic landing, run:
|
||||
```
|
||||
python gymnasium/envs/box2d/lunar_lander.py
|
||||
```
|
||||
<!-- To play yourself, run: -->
|
||||
<!-- python examples/agents/keyboard_agent.py LunarLander-v2 -->
|
||||
|
||||
### Action Space
|
||||
There are four discrete actions available: do nothing, fire left
|
||||
orientation engine, fire main engine, fire right orientation engine.
|
||||
|
||||
### Observation Space
|
||||
The state is an 8-dimensional vector: the coordinates of the lander in `x` & `y`, its linear
|
||||
velocities in `x` & `y`, its angle, its angular velocity, and two booleans
|
||||
that represent whether each leg is in contact with the ground or not.
|
||||
|
||||
### Rewards
|
||||
Reward for moving from the top of the screen to the landing pad and coming
|
||||
to rest is about 100-140 points.
|
||||
If the lander moves away from the landing pad, it loses reward.
|
||||
If the lander crashes, it receives an additional -100 points. If it comes
|
||||
to rest, it receives an additional +100 points. Each leg with ground
|
||||
contact is +10 points.
|
||||
Firing the main engine is -0.3 points each frame. Firing the side engine
|
||||
is -0.03 points each frame. Solved is 200 points.
|
||||
|
||||
### Starting State
|
||||
The lander starts at the top center of the viewport with a random initial
|
||||
force applied to its center of mass.
|
||||
|
||||
### Episode Termination
|
||||
The episode finishes if:
|
||||
1) the lander crashes (the lander body gets in contact with the moon);
|
||||
2) the lander gets outside of the viewport (`x` coordinate is greater than 1);
|
||||
3) the lander is not awake. From the [Box2D docs](https://box2d.org/documentation/md__d_1__git_hub_box2d_docs_dynamics.html#autotoc_md61),
|
||||
a body which is not awake is a body which doesn't move and doesn't
|
||||
collide with any other body:
|
||||
> When Box2D determines that a body (or group of bodies) has come to rest,
|
||||
> the body enters a sleep state which has very little CPU overhead. If a
|
||||
> body is awake and collides with a sleeping body, then the sleeping body
|
||||
> wakes up. Bodies will also wake up if a joint or contact attached to
|
||||
> them is destroyed.
|
||||
|
||||
### Arguments
|
||||
To use to the _continuous_ environment, you need to specify the
|
||||
`continuous=True` argument like below:
|
||||
```python
|
||||
import gymnasium
|
||||
env = gymnasium.make(
|
||||
"LunarLander-v2",
|
||||
continuous: bool = False,
|
||||
gravity: float = -10.0,
|
||||
enable_wind: bool = False,
|
||||
wind_power: float = 15.0,
|
||||
turbulence_power: float = 1.5,
|
||||
)
|
||||
```
|
||||
If `continuous=True` is passed, continuous actions (corresponding to the throttle of the engines) will be used and the
|
||||
action space will be `Box(-1, +1, (2,), dtype=np.float32)`.
|
||||
The first coordinate of an action determines the throttle of the main engine, while the second
|
||||
coordinate specifies the throttle of the lateral boosters.
|
||||
Given an action `np.array([main, lateral])`, the main engine will be turned off completely if
|
||||
`main < 0` and the throttle scales affinely from 50% to 100% for `0 <= main <= 1` (in particular, the
|
||||
main engine doesn't work with less than 50% power).
|
||||
Similarly, if `-0.5 < lateral < 0.5`, the lateral boosters will not fire at all. If `lateral < -0.5`, the left
|
||||
booster will fire, and if `lateral > 0.5`, the right booster will fire. Again, the throttle scales affinely
|
||||
from 50% to 100% between -1 and -0.5 (and 0.5 and 1, respectively).
|
||||
|
||||
`gravity` dictates the gravitational constant, this is bounded to be within 0 and -12.
|
||||
|
||||
If `enable_wind=True` is passed, there will be wind effects applied to the lander.
|
||||
The wind is generated using the function `tanh(sin(2 k (t+C)) + sin(pi k (t+C)))`.
|
||||
`k` is set to 0.01.
|
||||
`C` is sampled randomly between -9999 and 9999.
|
||||
|
||||
`wind_power` dictates the maximum magnitude of linear wind applied to the craft. The recommended value for `wind_power` is between 0.0 and 20.0.
|
||||
`turbulence_power` dictates the maximum magnitude of rotational wind applied to the craft. The recommended value for `turbulence_power` is between 0.0 and 2.0.
|
||||
|
||||
### Version History
|
||||
- v2: Count energy spent and in v0.24, added turbulance with wind power and turbulence_power parameters
|
||||
- v1: Legs contact with ground added in state vector; contact with ground
|
||||
give +10 reward points, and -10 if then lose contact; reward
|
||||
renormalized to 200; harder initial random push.
|
||||
- v0: Initial version
|
||||
|
||||
<!-- ### References -->
|
||||
|
||||
### Credits
|
||||
Created by Oleg Klimov
|
@@ -1,132 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Acrobot
|
||||
firstpage:
|
||||
---
|
||||
|
||||
# Acrobot
|
||||
|
||||
```{figure} ../../_static/videos/classic_control/acrobot.gif
|
||||
:width: 200px
|
||||
:name: acrobot
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Classic Control environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Discrete(3) |
|
||||
| Observation Shape | (6,) |
|
||||
| Observation High | [ 1. 1. 1. 1. 12.57 28.27] |
|
||||
| Observation Low | [ -1. -1. -1. -1. -12.57 -28.27] |
|
||||
| Import | `gymnasium.make("Acrobot-v1")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
The Acrobot environment is based on Sutton's work in
|
||||
["Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding"](https://papers.nips.cc/paper/1995/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html)
|
||||
and [Sutton and Barto's book](http://www.incompleteideas.net/book/the-book-2nd.html).
|
||||
The system consists of two links connected linearly to form a chain, with one end of
|
||||
the chain fixed. The joint between the two links is actuated. The goal is to apply
|
||||
torques on the actuated joint to swing the free end of the linear chain above a
|
||||
given height while starting from the initial state of hanging downwards.
|
||||
|
||||
As seen in the **Gif**: two blue links connected by two green joints. The joint in
|
||||
between the two links is actuated. The goal is to swing the free end of the outer-link
|
||||
to reach the target height (black horizontal line above system) by applying torque on
|
||||
the actuator.
|
||||
|
||||
### Action Space
|
||||
|
||||
The action is discrete, deterministic, and represents the torque applied on the actuated
|
||||
joint between the two links.
|
||||
|
||||
| Num | Action | Unit |
|
||||
|-----|---------------------------------------|--------------|
|
||||
| 0 | apply -1 torque to the actuated joint | torque (N m) |
|
||||
| 1 | apply 0 torque to the actuated joint | torque (N m) |
|
||||
| 2 | apply 1 torque to the actuated joint | torque (N m) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
The observation is a `ndarray` with shape `(6,)` that provides information about the
|
||||
two rotational joint angles as well as their angular velocities:
|
||||
|
||||
| Num | Observation | Min | Max |
|
||||
|-----|------------------------------|---------------------|-------------------|
|
||||
| 0 | Cosine of `theta1` | -1 | 1 |
|
||||
| 1 | Sine of `theta1` | -1 | 1 |
|
||||
| 2 | Cosine of `theta2` | -1 | 1 |
|
||||
| 3 | Sine of `theta2` | -1 | 1 |
|
||||
| 4 | Angular velocity of `theta1` | ~ -12.567 (-4 * pi) | ~ 12.567 (4 * pi) |
|
||||
| 5 | Angular velocity of `theta2` | ~ -28.274 (-9 * pi) | ~ 28.274 (9 * pi) |
|
||||
|
||||
where
|
||||
- `theta1` is the angle of the first joint, where an angle of 0 indicates the first link is pointing directly
|
||||
downwards.
|
||||
- `theta2` is ***relative to the angle of the first link.***
|
||||
An angle of 0 corresponds to having the same angle between the two links.
|
||||
|
||||
The angular velocities of `theta1` and `theta2` are bounded at ±4π, and ±9π rad/s respectively.
|
||||
A state of `[1, 0, 1, 0, ..., ...]` indicates that both links are pointing downwards.
|
||||
|
||||
### Rewards
|
||||
|
||||
The goal is to have the free end reach a designated target height in as few steps as possible,
|
||||
and as such all steps that do not reach the goal incur a reward of -1.
|
||||
Achieving the target height results in termination with a reward of 0. The reward threshold is -100.
|
||||
|
||||
### Starting State
|
||||
|
||||
Each parameter in the underlying state (`theta1`, `theta2`, and the two angular velocities) is initialized
|
||||
uniformly between -0.1 and 0.1. This means both links are pointing downwards with some initial stochasticity.
|
||||
|
||||
### Episode End
|
||||
|
||||
The episode ends if one of the following occurs:
|
||||
1. Termination: The free end reaches the target height, which is constructed as:
|
||||
`-cos(theta1) - cos(theta2 + theta1) > 1.0`
|
||||
2. Truncation: Episode length is greater than 500 (200 for v0)
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Acrobot-v1')
|
||||
```
|
||||
|
||||
By default, the dynamics of the acrobot follow those described in Sutton and Barto's book
|
||||
[Reinforcement Learning: An Introduction](http://incompleteideas.net/book/11/node4.html).
|
||||
However, a `book_or_nips` parameter can be modified to change the pendulum dynamics to those described
|
||||
in the original [NeurIPS paper](https://papers.nips.cc/paper/1995/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html).
|
||||
|
||||
```
|
||||
# To change the dynamics as described above
|
||||
env.env.book_or_nips = 'nips'
|
||||
```
|
||||
|
||||
See the following note and
|
||||
the [implementation](https://github.com/Farama-Foundation/gymnasium/blob/master/gymnasium/envs/classic_control/acrobot.py) for details:
|
||||
|
||||
> The dynamics equations were missing some terms in the NIPS paper which
|
||||
are present in the book. R. Sutton confirmed in personal correspondence
|
||||
that the experimental results shown in the paper and the book were
|
||||
generated with the equations shown in the book.
|
||||
However, there is the option to run the domain with the paper equations
|
||||
by setting `book_or_nips = 'nips'`
|
||||
|
||||
|
||||
### Version History
|
||||
|
||||
- v1: Maximum number of steps increased from 200 to 500. The observation space for v0 provided direct readings of
|
||||
`theta1` and `theta2` in radians, having a range of `[-pi, pi]`. The v1 observation space as described here provides the
|
||||
sine and cosine of each angle instead.
|
||||
- v0: Initial versions release (1.0.0) (removed from gymnasium for v1)
|
||||
|
||||
### References
|
||||
- Sutton, R. S. (1996). Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding.
|
||||
In D. Touretzky, M. C. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8).
|
||||
MIT Press. https://proceedings.neurips.cc/paper/1995/file/8f1d43620bc6bb580df6e80b0dc05c48-Paper.pdf
|
||||
- Sutton, R. S., Barto, A. G. (2018 ). Reinforcement Learning: An Introduction. The MIT Press.
|
@@ -1,86 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Cart Pole
|
||||
---
|
||||
|
||||
# Cart Pole
|
||||
|
||||
```{figure} ../../_static/videos/classic_control/cart_pole.gif
|
||||
:width: 200px
|
||||
:name: cart_pole
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Classic Control environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Discrete(2) |
|
||||
| Observation Shape | (4,) |
|
||||
| Observation High | [4.8 inf 0.42 inf] |
|
||||
| Observation Low | [-4.8 -inf -0.42 -inf] |
|
||||
| Import | `gymnasium.make("CartPole-v1")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in
|
||||
["Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem"](https://ieeexplore.ieee.org/document/6313077).
|
||||
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
|
||||
The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces
|
||||
in the left and right direction on the cart.
|
||||
|
||||
### Action Space
|
||||
|
||||
The action is a `ndarray` with shape `(1,)` which can take values `{0, 1}` indicating the direction
|
||||
of the fixed force the cart is pushed with.
|
||||
|
||||
| Num | Action |
|
||||
|-----|------------------------|
|
||||
| 0 | Push cart to the left |
|
||||
| 1 | Push cart to the right |
|
||||
|
||||
**Note**: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle
|
||||
the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it
|
||||
|
||||
### Observation Space
|
||||
|
||||
The observation is a `ndarray` with shape `(4,)` with the values corresponding to the following positions and velocities:
|
||||
|
||||
| Num | Observation | Min | Max |
|
||||
|-----|-----------------------|---------------------|-------------------|
|
||||
| 0 | Cart Position | -4.8 | 4.8 |
|
||||
| 1 | Cart Velocity | -Inf | Inf |
|
||||
| 2 | Pole Angle | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
|
||||
| 3 | Pole Angular Velocity | -Inf | Inf |
|
||||
|
||||
**Note:** While the ranges above denote the possible values for observation space of each element,
|
||||
it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:
|
||||
- The cart x-position (index 0) can be take values between `(-4.8, 4.8)`, but the episode terminates
|
||||
if the cart leaves the `(-2.4, 2.4)` range.
|
||||
- The pole angle can be observed between `(-.418, .418)` radians (or **±24°**), but the episode terminates
|
||||
if the pole angle is not in the range `(-.2095, .2095)` (or **±12°**)
|
||||
|
||||
### Rewards
|
||||
|
||||
Since the goal is to keep the pole upright for as long as possible, a reward of `+1` for every step taken,
|
||||
including the termination step, is allotted. The threshold for rewards is 475 for v1.
|
||||
|
||||
### Starting State
|
||||
|
||||
All observations are assigned a uniformly random value in `(-0.05, 0.05)`
|
||||
|
||||
### Episode End
|
||||
|
||||
The episode ends if any one of the following occurs:
|
||||
|
||||
1. Termination: Pole Angle is greater than ±12°
|
||||
2. Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
|
||||
3. Truncation: Episode length is greater than 500 (200 for v0)
|
||||
|
||||
### Arguments
|
||||
|
||||
```
|
||||
gymnasium.make('CartPole-v1')
|
||||
```
|
||||
|
||||
No additional arguments are currently supported.
|
@@ -1,101 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Mountain Car
|
||||
---
|
||||
|
||||
# Mountain Car
|
||||
|
||||
```{figure} ../../_static/videos/classic_control/mountain_car.gif
|
||||
:width: 200px
|
||||
:name: mountain_car
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Classic Control environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Discrete(3) |
|
||||
| Observation Shape | (2,) |
|
||||
| Observation High | [0.6 0.07] |
|
||||
| Observation Low | [-1.2 -0.07] |
|
||||
| Import | `gymnasium.make("MountainCar-v0")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
The Mountain Car MDP is a deterministic MDP that consists of a car placed stochastically
|
||||
at the bottom of a sinusoidal valley, with the only possible actions being the accelerations
|
||||
that can be applied to the car in either direction. The goal of the MDP is to strategically
|
||||
accelerate the car to reach the goal state on top of the right hill. There are two versions
|
||||
of the mountain car domain in gymnasium: one with discrete actions and one with continuous.
|
||||
This version is the one with discrete actions.
|
||||
|
||||
This MDP first appeared in [Andrew Moore's PhD Thesis (1990)](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-209.pdf)
|
||||
|
||||
```
|
||||
@TECHREPORT{Moore90efficientmemory-based,
|
||||
author = {Andrew William Moore},
|
||||
title = {Efficient Memory-based Learning for Robot Control},
|
||||
institution = {University of Cambridge},
|
||||
year = {1990}
|
||||
}
|
||||
```
|
||||
|
||||
### Observation Space
|
||||
|
||||
The observation is a `ndarray` with shape `(2,)` where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Unit |
|
||||
|-----|--------------------------------------|------|-----|--------------|
|
||||
| 0 | position of the car along the x-axis | -Inf | Inf | position (m) |
|
||||
| 1 | velocity of the car | -Inf | Inf | position (m) |
|
||||
|
||||
### Action Space
|
||||
|
||||
There are 3 discrete deterministic actions:
|
||||
|
||||
| Num | Observation | Value | Unit |
|
||||
|-----|-------------------------|-------|--------------|
|
||||
| 0 | Accelerate to the left | Inf | position (m) |
|
||||
| 1 | Don't accelerate | Inf | position (m) |
|
||||
| 2 | Accelerate to the right | Inf | position (m) |
|
||||
|
||||
### Transition Dynamics:
|
||||
|
||||
Given an action, the mountain car follows the following transition dynamics:
|
||||
|
||||
*velocity<sub>t+1</sub> = velocity<sub>t</sub> + (action - 1) * force - cos(3 * position<sub>t</sub>) * gravity*
|
||||
|
||||
*position<sub>t+1</sub> = position<sub>t</sub> + velocity<sub>t+1</sub>*
|
||||
|
||||
where force = 0.001 and gravity = 0.0025. The collisions at either end are inelastic with the velocity set to 0
|
||||
upon collision with the wall. The position is clipped to the range `[-1.2, 0.6]` and
|
||||
velocity is clipped to the range `[-0.07, 0.07]`.
|
||||
|
||||
|
||||
### Reward:
|
||||
|
||||
The goal is to reach the flag placed on top of the right hill as quickly as possible, as such the agent is
|
||||
penalised with a reward of -1 for each timestep.
|
||||
|
||||
### Starting State
|
||||
|
||||
The position of the car is assigned a uniform random value in *[-0.6 , -0.4]*.
|
||||
The starting velocity of the car is always assigned to 0.
|
||||
|
||||
### Episode End
|
||||
|
||||
The episode ends if either of the following happens:
|
||||
1. Termination: The position of the car is greater than or equal to 0.5 (the goal position on top of the right hill)
|
||||
2. Truncation: The length of the episode is 200.
|
||||
|
||||
|
||||
### Arguments
|
||||
|
||||
```
|
||||
gymnasium.make('MountainCar-v0')
|
||||
```
|
||||
|
||||
### Version History
|
||||
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,95 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Mountain Car Continuous
|
||||
---
|
||||
|
||||
# Mountain Car Continuous
|
||||
|
||||
```{figure} ../../_static/videos/classic_control/mountain_car_continuous.gif
|
||||
:width: 200px
|
||||
:name: mountain_car_continuous
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Classic Control environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-1.0, 1.0, (1,), float32) |
|
||||
| Observation Shape | (2,) |
|
||||
| Observation High | [0.6 0.07] |
|
||||
| Observation Low | [-1.2 -0.07] |
|
||||
| Import | `gymnasium.make("MountainCarContinuous-v0")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
The Mountain Car MDP is a deterministic MDP that consists of a car placed stochastically
|
||||
at the bottom of a sinusoidal valley, with the only possible actions being the accelerations
|
||||
that can be applied to the car in either direction. The goal of the MDP is to strategically
|
||||
accelerate the car to reach the goal state on top of the right hill. There are two versions
|
||||
of the mountain car domain in gymnasium: one with discrete actions and one with continuous.
|
||||
This version is the one with continuous actions.
|
||||
|
||||
This MDP first appeared in [Andrew Moore's PhD Thesis (1990)](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-209.pdf)
|
||||
|
||||
```
|
||||
@TECHREPORT{Moore90efficientmemory-based,
|
||||
author = {Andrew William Moore},
|
||||
title = {Efficient Memory-based Learning for Robot Control},
|
||||
institution = {University of Cambridge},
|
||||
year = {1990}
|
||||
}
|
||||
```
|
||||
|
||||
### Observation Space
|
||||
|
||||
The observation is a `ndarray` with shape `(2,)` where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Unit |
|
||||
|-----|--------------------------------------|------|-----|--------------|
|
||||
| 0 | position of the car along the x-axis | -Inf | Inf | position (m) |
|
||||
| 1 | velocity of the car | -Inf | Inf | position (m) |
|
||||
|
||||
### Action Space
|
||||
|
||||
The action is a `ndarray` with shape `(1,)`, representing the directional force applied on the car.
|
||||
The action is clipped in the range `[-1,1]` and multiplied by a power of 0.0015.
|
||||
|
||||
### Transition Dynamics:
|
||||
|
||||
Given an action, the mountain car follows the following transition dynamics:
|
||||
|
||||
*velocity<sub>t+1</sub> = velocity<sub>t+1</sub> + force * self.power - 0.0025 * cos(3 * position<sub>t</sub>)*
|
||||
|
||||
*position<sub>t+1</sub> = position<sub>t</sub> + velocity<sub>t+1</sub>*
|
||||
|
||||
where force is the action clipped to the range `[-1,1]` and power is a constant 0.0015.
|
||||
The collisions at either end are inelastic with the velocity set to 0 upon collision with the wall.
|
||||
The position is clipped to the range [-1.2, 0.6] and velocity is clipped to the range [-0.07, 0.07].
|
||||
|
||||
### Reward
|
||||
|
||||
A negative reward of *-0.1 * action<sup>2</sup>* is received at each timestep to penalise for
|
||||
taking actions of large magnitude. If the mountain car reaches the goal then a positive reward of +100
|
||||
is added to the negative reward for that timestep.
|
||||
|
||||
### Starting State
|
||||
|
||||
The position of the car is assigned a uniform random value in `[-0.6 , -0.4]`.
|
||||
The starting velocity of the car is always assigned to 0.
|
||||
|
||||
### Episode End
|
||||
|
||||
The episode ends if either of the following happens:
|
||||
1. Termination: The position of the car is greater than or equal to 0.45 (the goal position on top of the right hill)
|
||||
2. Truncation: The length of the episode is 999.
|
||||
|
||||
### Arguments
|
||||
|
||||
```
|
||||
gymnasium.make('MountainCarContinuous-v0')
|
||||
```
|
||||
|
||||
### Version History
|
||||
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,92 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Pendulum
|
||||
lastpage:
|
||||
---
|
||||
|
||||
# Pendulum
|
||||
|
||||
```{figure} ../../_static/videos/classic_control/pendulum.gif
|
||||
:width: 200px
|
||||
:name: pendulum
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Classic Control environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-2.0, 2.0, (1,), float32) |
|
||||
| Observation Shape | (3,) |
|
||||
| Observation High | [1. 1. 8.] |
|
||||
| Observation Low | [-1. -1. -8.] |
|
||||
| Import | `gymnasium.make("Pendulum-v1")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
The inverted pendulum swingup problem is based on the classic problem in control theory.
|
||||
The system consists of a pendulum attached at one end to a fixed point, and the other end being free.
|
||||
The pendulum starts in a random position and the goal is to apply torque on the free end to swing it
|
||||
into an upright position, with its center of gravity right above the fixed point.
|
||||
|
||||
The diagram below specifies the coordinate system used for the implementation of the pendulum's
|
||||
dynamic equations.
|
||||
|
||||

|
||||
|
||||
- `x-y`: cartesian coordinates of the pendulum's end in meters.
|
||||
- `theta` : angle in radians.
|
||||
- `tau`: torque in `N m`. Defined as positive _counter-clockwise_.
|
||||
|
||||
### Action Space
|
||||
|
||||
The action is a `ndarray` with shape `(1,)` representing the torque applied to free end of the pendulum.
|
||||
|
||||
| Num | Action | Min | Max |
|
||||
|-----|--------|------|-----|
|
||||
| 0 | Torque | -2.0 | 2.0 |
|
||||
|
||||
|
||||
### Observation Space
|
||||
|
||||
The observation is a `ndarray` with shape `(3,)` representing the x-y coordinates of the pendulum's free
|
||||
end and its angular velocity.
|
||||
|
||||
| Num | Observation | Min | Max |
|
||||
|-----|------------------|------|-----|
|
||||
| 0 | x = cos(theta) | -1.0 | 1.0 |
|
||||
| 1 | y = sin(angle) | -1.0 | 1.0 |
|
||||
| 2 | Angular Velocity | -8.0 | 8.0 |
|
||||
|
||||
### Rewards
|
||||
|
||||
The reward function is defined as:
|
||||
|
||||
*r = -(theta<sup>2</sup> + 0.1 * theta_dt<sup>2</sup> + 0.001 * torque<sup>2</sup>)*
|
||||
|
||||
where `$ heta$` is the pendulum's angle normalized between *[-pi, pi]* (with 0 being in the upright position).
|
||||
Based on the above equation, the minimum reward that can be obtained is
|
||||
*-(pi<sup>2</sup> + 0.1 * 8<sup>2</sup> + 0.001 * 2<sup>2</sup>) = -16.2736044*,
|
||||
while the maximum reward is zero (pendulum is upright with zero velocity and no torque applied).
|
||||
|
||||
### Starting State
|
||||
|
||||
The starting state is a random angle in *[-pi, pi]* and a random angular velocity in *[-1,1]*.
|
||||
|
||||
### Episode Truncation
|
||||
|
||||
The episode truncates at 200 time steps.
|
||||
|
||||
### Arguments
|
||||
|
||||
- `g`: acceleration of gravity measured in *(m s<sup>-2</sup>)* used to calculate the pendulum dynamics.
|
||||
The default value is g = 10.0 .
|
||||
|
||||
```
|
||||
gymnasium.make('Pendulum-v1', g=9.81)
|
||||
```
|
||||
|
||||
### Version History
|
||||
|
||||
* v1: Simplify the math equations, no difference in behavior.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,183 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Ant
|
||||
firstpage:
|
||||
---
|
||||
|
||||
# Ant
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/ant.gif
|
||||
:width: 200px
|
||||
:name: ant
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-1.0, 1.0, (8,), float32) |
|
||||
| Observation Shape | (27,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("Ant-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment is based on the environment introduced by Schulman,
|
||||
Moritz, Levine, Jordan and Abbeel in ["High-Dimensional Continuous Control
|
||||
Using Generalized Advantage Estimation"](https://arxiv.org/abs/1506.02438).
|
||||
The ant is a 3D robot consisting of one torso (free rotational body) with
|
||||
four legs attached to it with each leg having two links. The goal is to
|
||||
coordinate the four legs to move in the forward (right) direction by applying
|
||||
torques on the eight hinges connecting the two links of each leg and the torso
|
||||
(nine parts and eight hinges).
|
||||
|
||||
### Action Space
|
||||
The action space is a `Box(-1, 1, (8,), float32)`. An action represents the torques applied at the hinge joints.
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | ----------------------------------------------------------------- | ----------- | ----------- | -------------------------------- | ----- | ------------ |
|
||||
| 0 | Torque applied on the rotor between the torso and front left hip | -1 | 1 | hip_1 (front_left_leg) | hinge | torque (N m) |
|
||||
| 1 | Torque applied on the rotor between the front left two links | -1 | 1 | angle_1 (front_left_leg) | hinge | torque (N m) |
|
||||
| 2 | Torque applied on the rotor between the torso and front right hip | -1 | 1 | hip_2 (front_right_leg) | hinge | torque (N m) |
|
||||
| 3 | Torque applied on the rotor between the front right two links | -1 | 1 | angle_2 (front_right_leg) | hinge | torque (N m) |
|
||||
| 4 | Torque applied on the rotor between the torso and back left hip | -1 | 1 | hip_3 (back_leg) | hinge | torque (N m) |
|
||||
| 5 | Torque applied on the rotor between the back left two links | -1 | 1 | angle_3 (back_leg) | hinge | torque (N m) |
|
||||
| 6 | Torque applied on the rotor between the torso and back right hip | -1 | 1 | hip_4 (right_back_leg) | hinge | torque (N m) |
|
||||
| 7 | Torque applied on the rotor between the back right two links | -1 | 1 | angle_4 (right_back_leg) | hinge | torque (N m) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
Observations consist of positional values of different body parts of the ant,
|
||||
followed by the velocities of those individual parts (their derivatives) with all
|
||||
the positions ordered before all the velocities.
|
||||
|
||||
By default, observations do not include the x- and y-coordinates of the ant's torso. These may
|
||||
be included by passing `exclude_current_positions_from_observation=False` during construction.
|
||||
In that case, the observation space will have 113 dimensions where the first two dimensions
|
||||
represent the x- and y- coordinates of the ant's torso.
|
||||
Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x- and y-coordinates
|
||||
of the torso will be returned in `info` with keys `"x_position"` and `"y_position"`, respectively.
|
||||
|
||||
However, by default, an observation is a `ndarray` with shape `(111,)`
|
||||
where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|--------------------------------------------------------------|--------|--------|----------------------------------------|-------|--------------------------|
|
||||
| 0 | z-coordinate of the torso (centre) | -Inf | Inf | torso | free | position (m) |
|
||||
| 1 | x-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
|
||||
| 2 | y-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
|
||||
| 3 | z-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
|
||||
| 4 | w-orientation of the torso (centre) | -Inf | Inf | torso | free | angle (rad) |
|
||||
| 5 | angle between torso and first link on front left | -Inf | Inf | hip_1 (front_left_leg) | hinge | angle (rad) |
|
||||
| 6 | angle between the two links on the front left | -Inf | Inf | ankle_1 (front_left_leg) | hinge | angle (rad) |
|
||||
| 7 | angle between torso and first link on front right | -Inf | Inf | hip_2 (front_right_leg) | hinge | angle (rad) |
|
||||
| 8 | angle between the two links on the front right | -Inf | Inf | ankle_2 (front_right_leg) | hinge | angle (rad) |
|
||||
| 9 | angle between torso and first link on back left | -Inf | Inf | hip_3 (back_leg) | hinge | angle (rad) |
|
||||
| 10 | angle between the two links on the back left | -Inf | Inf | ankle_3 (back_leg) | hinge | angle (rad) |
|
||||
| 11 | angle between torso and first link on back right | -Inf | Inf | hip_4 (right_back_leg) | hinge | angle (rad) |
|
||||
| 12 | angle between the two links on the back right | -Inf | Inf | ankle_4 (right_back_leg) | hinge | angle (rad) |
|
||||
| 13 | x-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) |
|
||||
| 14 | y-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) |
|
||||
| 15 | z-coordinate velocity of the torso | -Inf | Inf | torso | free | velocity (m/s) |
|
||||
| 16 | x-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) |
|
||||
| 17 | y-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) |
|
||||
| 18 | z-coordinate angular velocity of the torso | -Inf | Inf | torso | free | angular velocity (rad/s) |
|
||||
| 19 | angular velocity of angle between torso and front left link | -Inf | Inf | hip_1 (front_left_leg) | hinge | angle (rad) |
|
||||
| 20 | angular velocity of the angle between front left links | -Inf | Inf | ankle_1 (front_left_leg) | hinge | angle (rad) |
|
||||
| 21 | angular velocity of angle between torso and front right link | -Inf | Inf | hip_2 (front_right_leg) | hinge | angle (rad) |
|
||||
| 22 | angular velocity of the angle between front right links | -Inf | Inf | ankle_2 (front_right_leg) | hinge | angle (rad) |
|
||||
| 23 | angular velocity of angle between torso and back left link | -Inf | Inf | hip_3 (back_leg) | hinge | angle (rad) |
|
||||
| 24 | angular velocity of the angle between back left links | -Inf | Inf | ankle_3 (back_leg) | hinge | angle (rad) |
|
||||
| 25 | angular velocity of angle between torso and back right link | -Inf | Inf | hip_4 (right_back_leg) | hinge | angle (rad) |
|
||||
| 26 |angular velocity of the angle between back right links | -Inf | Inf | ankle_4 (right_back_leg) | hinge | angle (rad) |
|
||||
|
||||
|
||||
The remaining 14*6 = 84 elements of the observation are contact forces
|
||||
(external forces - force x, y, z and torque x, y, z) applied to the
|
||||
center of mass of each of the links. The 14 links are: the ground link,
|
||||
the torso link, and 3 links for each leg (1 + 1 + 12) with the 6 external forces.
|
||||
|
||||
The (x,y,z) coordinates are translational DOFs while the orientations are rotational
|
||||
DOFs expressed as quaternions. One can read more about free joints on the [Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html).
|
||||
|
||||
|
||||
**Note:** Ant-v4 environment no longer has the following contact forces issue.
|
||||
If using previous Humanoid versions from v4, there have been reported issues that using a Mujoco-Py version > 2.0 results
|
||||
in the contact forces always being 0. As such we recommend to use a Mujoco-Py version < 2.0
|
||||
when using the Ant environment if you would like to report results with contact forces (if
|
||||
contact forces are not used in your experiments, you can use version > 2.0).
|
||||
|
||||
### Rewards
|
||||
The reward consists of three parts:
|
||||
- *healthy_reward*: Every timestep that the ant is healthy (see definition in section "Episode Termination"), it gets a reward of fixed value `healthy_reward`
|
||||
- *forward_reward*: A reward of moving forward which is measured as
|
||||
*(x-coordinate before action - x-coordinate after action)/dt*. *dt* is the time
|
||||
between actions and is dependent on the `frame_skip` parameter (default is 5),
|
||||
where the frametime is 0.01 - making the default *dt = 5 * 0.01 = 0.05*.
|
||||
This reward would be positive if the ant moves forward (in positive x direction).
|
||||
- *ctrl_cost*: A negative reward for penalising the ant if it takes actions
|
||||
that are too large. It is measured as *`ctrl_cost_weight` * sum(action<sup>2</sup>)*
|
||||
where *`ctr_cost_weight`* is a parameter set for the control and has a default value of 0.5.
|
||||
- *contact_cost*: A negative reward for penalising the ant if the external contact
|
||||
force is too large. It is calculated *`contact_cost_weight` * sum(clip(external contact
|
||||
force to `contact_force_range`)<sup>2</sup>)*.
|
||||
|
||||
The total reward returned is ***reward*** *=* *healthy_reward + forward_reward - ctrl_cost - contact_cost* and `info` will also contain the individual reward terms.
|
||||
|
||||
### Starting State
|
||||
All observations start in state
|
||||
(0.0, 0.0, 0.75, 1.0, 0.0 ... 0.0) with a uniform noise in the range
|
||||
of [-`reset_noise_scale`, `reset_noise_scale`] added to the positional values and standard normal noise
|
||||
with mean 0 and standard deviation `reset_noise_scale` added to the velocity values for
|
||||
stochasticity. Note that the initial z coordinate is intentionally selected
|
||||
to be slightly high, thereby indicating a standing up ant. The initial orientation
|
||||
is designed to make it face forward as well.
|
||||
|
||||
### Episode End
|
||||
The ant is said to be unhealthy if any of the following happens:
|
||||
|
||||
1. Any of the state space values is no longer finite
|
||||
2. The z-coordinate of the torso is **not** in the closed interval given by `healthy_z_range` (defaults to [0.2, 1.0])
|
||||
|
||||
If `terminate_when_unhealthy=True` is passed during construction (which is the default),
|
||||
the episode ends when any of the following happens:
|
||||
|
||||
1. Termination: The episode duration reaches a 1000 timesteps
|
||||
2. Truncation: The ant is unhealthy
|
||||
|
||||
If `terminate_when_unhealthy=False` is passed, the episode is ended only when 1000 timesteps are exceeded.
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported in v2 and lower.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Ant-v2')
|
||||
```
|
||||
|
||||
v3 and v4 take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Ant-v4', ctrl_cost_weight=0.1, ...)
|
||||
```
|
||||
|
||||
| Parameter | Type | Default |Description |
|
||||
|-------------------------|------------|--------------|-------------------------------|
|
||||
| `xml_file` | **str** | `"ant.xml"` | Path to a MuJoCo model |
|
||||
| `ctrl_cost_weight` | **float** | `0.5` | Weight for *ctrl_cost* term (see section on reward) |
|
||||
| `contact_cost_weight` | **float** | `5e-4` | Weight for *contact_cost* term (see section on reward) |
|
||||
| `healthy_reward` | **float** | `1` | Constant reward given if the ant is "healthy" after timestep |
|
||||
| `terminate_when_unhealthy` | **bool**| `True` | If true, issue a done signal if the z-coordinate of the torso is no longer in the `healthy_z_range` |
|
||||
| `healthy_z_range` | **tuple** | `(0.2, 1)` | The ant is considered healthy if the z-coordinate of the torso is in this range |
|
||||
| `contact_force_range` | **tuple** | `(-1, 1)` | Contact forces are clipped to this range in the computation of *contact_cost* |
|
||||
| `reset_noise_scale` | **float** | `0.1` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
|
||||
| `exclude_current_positions_from_observation`| **bool** | `True`| Whether or not to omit the x- and y-coordinates from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
|
||||
|
||||
### Version History
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v3: support for gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,140 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Half Cheetah
|
||||
---
|
||||
|
||||
# Half Cheetah
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/half_cheetah.gif
|
||||
:width: 200px
|
||||
:name: half_cheetah
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-1.0, 1.0, (6,), float32) |
|
||||
| Observation Shape | (17,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("HalfCheetah-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment is based on the work by P. Wawrzyński in
|
||||
["A Cat-Like Robot Real-Time Learning to Run"](http://staff.elka.pw.edu.pl/~pwawrzyn/pub-s/0812_LSCLRR.pdf).
|
||||
The HalfCheetah is a 2-dimensional robot consisting of 9 links and 8
|
||||
joints connecting them (including two paws). The goal is to apply a torque
|
||||
on the joints to make the cheetah run forward (right) as fast as possible,
|
||||
with a positive reward allocated based on the distance moved forward and a
|
||||
negative reward allocated for moving backward. The torso and head of the
|
||||
cheetah are fixed, and the torque can only be applied on the other 6 joints
|
||||
over the front and back thighs (connecting to the torso), shins
|
||||
(connecting to the thighs) and feet (connecting to the shins).
|
||||
|
||||
### Action Space
|
||||
The action space is a `Box(-1, 1, (6,), float32)`. An action represents the torques applied between *links*.
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | --------------------------------------- | ----------- | ----------- | -------------------------------- | ----- | ------------ |
|
||||
| 0 | Torque applied on the back thigh rotor | -1 | 1 | bthigh | hinge | torque (N m) |
|
||||
| 1 | Torque applied on the back shin rotor | -1 | 1 | bshin | hinge | torque (N m) |
|
||||
| 2 | Torque applied on the back foot rotor | -1 | 1 | bfoot | hinge | torque (N m) |
|
||||
| 3 | Torque applied on the front thigh rotor | -1 | 1 | fthigh | hinge | torque (N m) |
|
||||
| 4 | Torque applied on the front shin rotor | -1 | 1 | fshin | hinge | torque (N m) |
|
||||
| 5 | Torque applied on the front foot rotor | -1 | 1 | ffoot | hinge | torque (N m) |
|
||||
|
||||
|
||||
### Observation Space
|
||||
|
||||
Observations consist of positional values of different body parts of the
|
||||
cheetah, followed by the velocities of those individual parts (their derivatives) with all the positions ordered before all the velocities.
|
||||
|
||||
By default, observations do not include the x-coordinate of the cheetah's center of mass. It may
|
||||
be included by passing `exclude_current_positions_from_observation=False` during construction.
|
||||
In that case, the observation space will have 18 dimensions where the first dimension
|
||||
represents the x-coordinate of the cheetah's center of mass.
|
||||
Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x-coordinate
|
||||
will be returned in `info` with key `"x_position"`.
|
||||
|
||||
However, by default, the observation is a `ndarray` with shape `(17,)` where the elements correspond to the following:
|
||||
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | ------------------------------------ | ---- | --- | -------------------------------- | ----- | ------------------------ |
|
||||
| 0 | z-coordinate of the front tip | -Inf | Inf | rootz | slide | position (m) |
|
||||
| 1 | angle of the front tip | -Inf | Inf | rooty | hinge | angle (rad) |
|
||||
| 2 | angle of the second rotor | -Inf | Inf | bthigh | hinge | angle (rad) |
|
||||
| 3 | angle of the second rotor | -Inf | Inf | bshin | hinge | angle (rad) |
|
||||
| 4 | velocity of the tip along the x-axis | -Inf | Inf | bfoot | hinge | angle (rad) |
|
||||
| 5 | velocity of the tip along the y-axis | -Inf | Inf | fthigh | hinge | angle (rad) |
|
||||
| 6 | angular velocity of front tip | -Inf | Inf | fshin | hinge | angle (rad) |
|
||||
| 7 | angular velocity of second rotor | -Inf | Inf | ffoot | hinge | angle (rad) |
|
||||
| 8 | x-coordinate of the front tip | -Inf | Inf | rootx | slide | velocity (m/s) |
|
||||
| 9 | y-coordinate of the front tip | -Inf | Inf | rootz | slide | velocity (m/s) |
|
||||
| 10 | angle of the front tip | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
|
||||
| 11 | angle of the second rotor | -Inf | Inf | bthigh | hinge | angular velocity (rad/s) |
|
||||
| 12 | angle of the second rotor | -Inf | Inf | bshin | hinge | angular velocity (rad/s) |
|
||||
| 13 | velocity of the tip along the x-axis | -Inf | Inf | bfoot | hinge | angular velocity (rad/s) |
|
||||
| 14 | velocity of the tip along the y-axis | -Inf | Inf | fthigh | hinge | angular velocity (rad/s) |
|
||||
| 15 | angular velocity of front tip | -Inf | Inf | fshin | hinge | angular velocity (rad/s) |
|
||||
| 16 | angular velocity of second rotor | -Inf | Inf | ffoot | hinge | angular velocity (rad/s) |
|
||||
|
||||
### Rewards
|
||||
The reward consists of two parts:
|
||||
- *forward_reward*: A reward of moving forward which is measured
|
||||
as *`forward_reward_weight` * (x-coordinate before action - x-coordinate after action)/dt*. *dt* is
|
||||
the time between actions and is dependent on the frame_skip parameter
|
||||
(fixed to 5), where the frametime is 0.01 - making the
|
||||
default *dt = 5 * 0.01 = 0.05*. This reward would be positive if the cheetah
|
||||
runs forward (right).
|
||||
- *ctrl_cost*: A cost for penalising the cheetah if it takes
|
||||
actions that are too large. It is measured as *`ctrl_cost_weight` *
|
||||
sum(action<sup>2</sup>)* where *`ctrl_cost_weight`* is a parameter set for the
|
||||
control and has a default value of 0.1
|
||||
|
||||
The total reward returned is ***reward*** *=* *forward_reward - ctrl_cost* and `info` will also contain the individual reward terms
|
||||
|
||||
### Starting State
|
||||
All observations start in state (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
|
||||
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,) with a noise added to the
|
||||
initial state for stochasticity. As seen before, the first 8 values in the
|
||||
state are positional and the last 9 values are velocity. A uniform noise in
|
||||
the range of [-`reset_noise_scale`, `reset_noise_scale`] is added to the positional values while a standard
|
||||
normal noise with a mean of 0 and standard deviation of `reset_noise_scale` is added to the
|
||||
initial velocity values of all zeros.
|
||||
|
||||
### Episode End
|
||||
The episode truncates when the episode length is greater than 1000.
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported in v2 and lower.
|
||||
|
||||
```
|
||||
env = gymnasium.make('HalfCheetah-v2')
|
||||
```
|
||||
|
||||
v3 and v4 take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
```
|
||||
env = gymnasium.make('HalfCheetah-v4', ctrl_cost_weight=0.1, ....)
|
||||
```
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
| -------------------------------------------- | --------- | -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `xml_file` | **str** | `"half_cheetah.xml"` | Path to a MuJoCo model |
|
||||
| `forward_reward_weight` | **float** | `1.0` | Weight for _forward_reward_ term (see section on reward) |
|
||||
| `ctrl_cost_weight` | **float** | `0.1` | Weight for _ctrl_cost_ weight (see section on reward) |
|
||||
| `reset_noise_scale` | **float** | `0.1` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
|
||||
| `exclude_current_positions_from_observation` | **bool** | `True` | Whether or not to omit the x-coordinate from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v3: support for gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,145 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Hopper
|
||||
---
|
||||
|
||||
# Hopper
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/hopper.gif
|
||||
:width: 200px
|
||||
:name: hopper
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-1.0, 1.0, (3,), float32) |
|
||||
| Observation Shape | (11,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("Hopper-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment is based on the work done by Erez, Tassa, and Todorov in
|
||||
["Infinite Horizon Model Predictive Control for Nonlinear Periodic Tasks"](http://www.roboticsproceedings.org/rss07/p10.pdf). The environment aims to
|
||||
increase the number of independent state and control variables as compared to
|
||||
the classic control environments. The hopper is a two-dimensional
|
||||
one-legged figure that consist of four main body parts - the torso at the
|
||||
top, the thigh in the middle, the leg in the bottom, and a single foot on
|
||||
which the entire body rests. The goal is to make hops that move in the
|
||||
forward (right) direction by applying torques on the three hinges
|
||||
connecting the four body parts.
|
||||
|
||||
### Action Space
|
||||
The action space is a `Box(-1, 1, (3,), float32)`. An action represents the torques applied between *links*
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
|
||||
| 0 | Torque applied on the thigh rotor | -1 | 1 | thigh_joint | hinge | torque (N m) |
|
||||
| 1 | Torque applied on the leg rotor | -1 | 1 | leg_joint | hinge | torque (N m) |
|
||||
| 3 | Torque applied on the foot rotor | -1 | 1 | foot_joint | hinge | torque (N m) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
Observations consist of positional values of different body parts of the
|
||||
hopper, followed by the velocities of those individual parts
|
||||
(their derivatives) with all the positions ordered before all the velocities.
|
||||
|
||||
By default, observations do not include the x-coordinate of the hopper. It may
|
||||
be included by passing `exclude_current_positions_from_observation=False` during construction.
|
||||
In that case, the observation space will have 12 dimensions where the first dimension
|
||||
represents the x-coordinate of the hopper.
|
||||
Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x-coordinate
|
||||
will be returned in `info` with key `"x_position"`.
|
||||
|
||||
However, by default, the observation is a `ndarray` with shape `(11,)` where the elements
|
||||
correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | ------------------------------------------------ | ---- | --- | -------------------------------- | ----- | ------------------------ |
|
||||
| 0 | z-coordinate of the top (height of hopper) | -Inf | Inf | rootz | slide | position (m) |
|
||||
| 1 | angle of the top | -Inf | Inf | rooty | hinge | angle (rad) |
|
||||
| 2 | angle of the thigh joint | -Inf | Inf | thigh_joint | hinge | angle (rad) |
|
||||
| 3 | angle of the leg joint | -Inf | Inf | leg_joint | hinge | angle (rad) |
|
||||
| 4 | angle of the foot joint | -Inf | Inf | foot_joint | hinge | angle (rad) |
|
||||
| 5 | velocity of the x-coordinate of the top | -Inf | Inf | rootx | slide | velocity (m/s) |
|
||||
| 6 | velocity of the z-coordinate (height) of the top | -Inf | Inf | rootz | slide | velocity (m/s) |
|
||||
| 7 | angular velocity of the angle of the top | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
|
||||
| 8 | angular velocity of the thigh hinge | -Inf | Inf | thigh_joint | hinge | angular velocity (rad/s) |
|
||||
| 9 | angular velocity of the leg hinge | -Inf | Inf | leg_joint | hinge | angular velocity (rad/s) |
|
||||
| 10 | angular velocity of the foot hinge | -Inf | Inf | foot_joint | hinge | angular velocity (rad/s) |
|
||||
|
||||
|
||||
### Rewards
|
||||
The reward consists of three parts:
|
||||
- *healthy_reward*: Every timestep that the hopper is healthy (see definition in section "Episode Termination"), it gets a reward of fixed value `healthy_reward`.
|
||||
- *forward_reward*: A reward of hopping forward which is measured
|
||||
as *`forward_reward_weight` * (x-coordinate before action - x-coordinate after action)/dt*. *dt* is
|
||||
the time between actions and is dependent on the frame_skip parameter
|
||||
(fixed to 4), where the frametime is 0.002 - making the
|
||||
default *dt = 4 * 0.002 = 0.008*. This reward would be positive if the hopper
|
||||
hops forward (positive x direction).
|
||||
- *ctrl_cost*: A cost for penalising the hopper if it takes
|
||||
actions that are too large. It is measured as *`ctrl_cost_weight` *
|
||||
sum(action<sup>2</sup>)* where *`ctrl_cost_weight`* is a parameter set for the
|
||||
control and has a default value of 0.001
|
||||
|
||||
The total reward returned is ***reward*** *=* *healthy_reward + forward_reward - ctrl_cost* and `info` will also contain the individual reward terms
|
||||
|
||||
### Starting State
|
||||
All observations start in state
|
||||
(0.0, 1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) with a uniform noise
|
||||
in the range of [-`reset_noise_scale`, `reset_noise_scale`] added to the values for stochasticity.
|
||||
|
||||
### Episode End
|
||||
The hopper is said to be unhealthy if any of the following happens:
|
||||
|
||||
1. An element of `observation[1:]` (if `exclude_current_positions_from_observation=True`, else `observation[2:]`) is no longer contained in the closed interval specified by the argument `healthy_state_range`
|
||||
2. The height of the hopper (`observation[0]` if `exclude_current_positions_from_observation=True`, else `observation[1]`) is no longer contained in the closed interval specified by the argument `healthy_z_range` (usually meaning that it has fallen)
|
||||
3. The angle (`observation[1]` if `exclude_current_positions_from_observation=True`, else `observation[2]`) is no longer contained in the closed interval specified by the argument `healthy_angle_range`
|
||||
|
||||
If `terminate_when_unhealthy=True` is passed during construction (which is the default),
|
||||
the episode ends when any of the following happens:
|
||||
|
||||
1. Truncation: The episode duration reaches a 1000 timesteps
|
||||
2. Termination: The hopper is unhealthy
|
||||
|
||||
If `terminate_when_unhealthy=False` is passed, the episode is ended only when 1000 timesteps are exceeded.
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported in v2 and lower.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Hopper-v2')
|
||||
```
|
||||
|
||||
v3 and v4 take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Hopper-v4', ctrl_cost_weight=0.1, ....)
|
||||
```
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
| -------------------------------------------- | --------- | --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `xml_file` | **str** | `"hopper.xml"` | Path to a MuJoCo model |
|
||||
| `forward_reward_weight` | **float** | `1.0` | Weight for _forward_reward_ term (see section on reward) |
|
||||
| `ctrl_cost_weight` | **float** | `0.001` | Weight for _ctrl_cost_ reward (see section on reward) |
|
||||
| `healthy_reward` | **float** | `1` | Constant reward given if the ant is "healthy" after timestep |
|
||||
| `terminate_when_unhealthy` | **bool** | `True` | If true, issue a done signal if the hopper is no longer healthy |
|
||||
| `healthy_state_range` | **tuple** | `(-100, 100)` | The elements of `observation[1:]` (if `exclude_current_positions_from_observation=True`, else `observation[2:]`) must be in this range for the hopper to be considered healthy |
|
||||
| `healthy_z_range` | **tuple** | `(0.7, float("inf"))` | The z-coordinate must be in this range for the hopper to be considered healthy |
|
||||
| `healthy_angle_range` | **tuple** | `(-0.2, 0.2)` | The angle given by `observation[1]` (if `exclude_current_positions_from_observation=True`, else `observation[2]`) must be in this range for the hopper to be considered healthy |
|
||||
| `reset_noise_scale` | **float** | `5e-3` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
|
||||
| `exclude_current_positions_from_observation` | **bool** | `True` | Whether or not to omit the x-coordinate from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v3: support for gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,213 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Humanoid
|
||||
---
|
||||
|
||||
# Humanoid
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/humanoid.gif
|
||||
:width: 200px
|
||||
:name: humanoid
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-0.4, 0.4, (17,), float32) |
|
||||
| Observation Shape | (376,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("Humanoid-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment is based on the environment introduced by Tassa, Erez and Todorov
|
||||
in ["Synthesis and stabilization of complex behaviors through online trajectory optimization"](https://ieeexplore.ieee.org/document/6386025).
|
||||
The 3D bipedal robot is designed to simulate a human. It has a torso (abdomen) with a pair of
|
||||
legs and arms. The legs each consist of two links, and so the arms (representing the knees and
|
||||
elbows respectively). The goal of the environment is to walk forward as fast as possible without falling over.
|
||||
|
||||
### Action Space
|
||||
The action space is a `Box(-1, 1, (17,), float32)`. An action represents the torques applied at the hinge joints.
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|----------------------|---------------|----------------|---------------------------------------|-------|------|
|
||||
| 0 | Torque applied on the hinge in the y-coordinate of the abdomen | -0.4 | 0.4 | hip_1 (front_left_leg) | hinge | torque (N m) |
|
||||
| 1 | Torque applied on the hinge in the z-coordinate of the abdomen | -0.4 | 0.4 | angle_1 (front_left_leg) | hinge | torque (N m) |
|
||||
| 2 | Torque applied on the hinge in the x-coordinate of the abdomen | -0.4 | 0.4 | hip_2 (front_right_leg) | hinge | torque (N m) |
|
||||
| 3 | Torque applied on the rotor between torso/abdomen and the right hip (x-coordinate) | -0.4 | 0.4 | right_hip_x (right_thigh) | hinge | torque (N m) |
|
||||
| 4 | Torque applied on the rotor between torso/abdomen and the right hip (z-coordinate) | -0.4 | 0.4 | right_hip_z (right_thigh) | hinge | torque (N m) |
|
||||
| 5 | Torque applied on the rotor between torso/abdomen and the right hip (y-coordinate) | -0.4 | 0.4 | right_hip_y (right_thigh) | hinge | torque (N m) |
|
||||
| 6 | Torque applied on the rotor between the right hip/thigh and the right shin | -0.4 | 0.4 | right_knee | hinge | torque (N m) |
|
||||
| 7 | Torque applied on the rotor between torso/abdomen and the left hip (x-coordinate) | -0.4 | 0.4 | left_hip_x (left_thigh) | hinge | torque (N m) |
|
||||
| 8 | Torque applied on the rotor between torso/abdomen and the left hip (z-coordinate) | -0.4 | 0.4 | left_hip_z (left_thigh) | hinge | torque (N m) |
|
||||
| 9 | Torque applied on the rotor between torso/abdomen and the left hip (y-coordinate) | -0.4 | 0.4 | left_hip_y (left_thigh) | hinge | torque (N m) |
|
||||
| 10 | Torque applied on the rotor between the left hip/thigh and the left shin | -0.4 | 0.4 | left_knee | hinge | torque (N m) |
|
||||
| 11 | Torque applied on the rotor between the torso and right upper arm (coordinate -1) | -0.4 | 0.4 | right_shoulder1 | hinge | torque (N m) |
|
||||
| 12 | Torque applied on the rotor between the torso and right upper arm (coordinate -2) | -0.4 | 0.4 | right_shoulder2 | hinge | torque (N m) |
|
||||
| 13 | Torque applied on the rotor between the right upper arm and right lower arm | -0.4 | 0.4 | right_elbow | hinge | torque (N m) |
|
||||
| 14 | Torque applied on the rotor between the torso and left upper arm (coordinate -1) | -0.4 | 0.4 | left_shoulder1 | hinge | torque (N m) |
|
||||
| 15 | Torque applied on the rotor between the torso and left upper arm (coordinate -2) | -0.4 | 0.4 | left_shoulder2 | hinge | torque (N m) |
|
||||
| 16 | Torque applied on the rotor between the left upper arm and left lower arm | -0.4 | 0.4 | left_elbow | hinge | torque (N m) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
Observations consist of positional values of different body parts of the Humanoid,
|
||||
followed by the velocities of those individual parts (their derivatives) with all the
|
||||
positions ordered before all the velocities.
|
||||
|
||||
By default, observations do not include the x- and y-coordinates of the torso. These may
|
||||
be included by passing `exclude_current_positions_from_observation=False` during construction.
|
||||
In that case, the observation space will have 378 dimensions where the first two dimensions
|
||||
represent the x- and y-coordinates of the torso.
|
||||
Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x- and y-coordinates
|
||||
will be returned in `info` with keys `"x_position"` and `"y_position"`, respectively.
|
||||
|
||||
However, by default, the observation is a `ndarray` with shape `(376,)` where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | --------------------------------------------------------------------------------------------------------------- | ---- | --- | -------------------------------- | ----- | -------------------------- |
|
||||
| 0 | z-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
|
||||
| 1 | x-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
|
||||
| 2 | y-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
|
||||
| 3 | z-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
|
||||
| 4 | w-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
|
||||
| 5 | z-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | angle (rad) |
|
||||
| 6 | y-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | angle (rad) |
|
||||
| 7 | x-angle of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | angle (rad) |
|
||||
| 8 | x-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | angle (rad) |
|
||||
| 9 | z-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | angle (rad) |
|
||||
| 19 | y-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | angle (rad) |
|
||||
| 11 | angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | angle (rad) |
|
||||
| 12 | x-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | angle (rad) |
|
||||
| 13 | z-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | angle (rad) |
|
||||
| 14 | y-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | angle (rad) |
|
||||
| 15 | angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | angle (rad) |
|
||||
| 16 | coordinate-1 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | angle (rad) |
|
||||
| 17 | coordinate-2 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | angle (rad) |
|
||||
| 18 | angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | angle (rad) |
|
||||
| 19 | coordinate-1 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | angle (rad) |
|
||||
| 20 | coordinate-2 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | angle (rad) |
|
||||
| 21 | angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | angle (rad) |
|
||||
| 22 | x-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
|
||||
| 23 | y-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
|
||||
| 24 | z-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
|
||||
| 25 | x-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
|
||||
| 26 | y-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
|
||||
| 27 | z-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
|
||||
| 28 | z-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | anglular velocity (rad/s) |
|
||||
| 29 | y-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | anglular velocity (rad/s) |
|
||||
| 30 | x-coordinate of angular velocity of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | aanglular velocity (rad/s) |
|
||||
| 31 | x-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | anglular velocity (rad/s) |
|
||||
| 32 | z-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | anglular velocity (rad/s) |
|
||||
| 33 | y-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | anglular velocity (rad/s) |
|
||||
| 34 | angular velocity of the angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | anglular velocity (rad/s) |
|
||||
| 35 | x-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | anglular velocity (rad/s) |
|
||||
| 36 | z-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | anglular velocity (rad/s) |
|
||||
| 37 | y-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | anglular velocity (rad/s) |
|
||||
| 38 | angular velocity of the angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | anglular velocity (rad/s) |
|
||||
| 39 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | anglular velocity (rad/s) |
|
||||
| 40 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | anglular velocity (rad/s) |
|
||||
| 41 | angular velocity of the angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | anglular velocity (rad/s) |
|
||||
| 42 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | anglular velocity (rad/s) |
|
||||
| 43 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | anglular velocity (rad/s) |
|
||||
| 44 | angular velocitty of the angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | anglular velocity (rad/s) |
|
||||
|
||||
Additionally, after all the positional and velocity based values in the table,
|
||||
the observation contains (in order):
|
||||
- *cinert:* Mass and inertia of a single rigid body relative to the center of mass
|
||||
(this is an intermediate result of transition). It has shape 14*10 (*nbody * 10*)
|
||||
and hence adds to another 140 elements in the state space.
|
||||
- *cvel:* Center of mass based velocity. It has shape 14 * 6 (*nbody * 6*) and hence
|
||||
adds another 84 elements in the state space
|
||||
- *qfrc_actuator:* Constraint force generated as the actuator force. This has shape
|
||||
`(23,)` *(nv * 1)* and hence adds another 23 elements to the state space.
|
||||
- *cfrc_ext:* This is the center of mass based external force on the body. It has shape
|
||||
14 * 6 (*nbody * 6*) and hence adds to another 84 elements in the state space.
|
||||
where *nbody* stands for the number of bodies in the robot and *nv* stands for the
|
||||
number of degrees of freedom (*= dim(qvel)*)
|
||||
|
||||
The (x,y,z) coordinates are translational DOFs while the orientations are rotational
|
||||
DOFs expressed as quaternions. One can read more about free joints on the
|
||||
[Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html).
|
||||
|
||||
**Note:** Humanoid-v4 environment no longer has the following contact forces issue.
|
||||
If using previous Humanoid versions from v4, there have been reported issues that using a Mujoco-Py version > 2.0
|
||||
results in the contact forces always being 0. As such we recommend to use a Mujoco-Py
|
||||
version < 2.0 when using the Humanoid environment if you would like to report results
|
||||
with contact forces (if contact forces are not used in your experiments, you can use
|
||||
version > 2.0).
|
||||
|
||||
### Rewards
|
||||
The reward consists of three parts:
|
||||
- *healthy_reward*: Every timestep that the humanoid is alive (see section Episode Termination for definition), it gets a reward of fixed value `healthy_reward`
|
||||
- *forward_reward*: A reward of walking forward which is measured as *`forward_reward_weight` *
|
||||
(average center of mass before action - average center of mass after action)/dt*.
|
||||
*dt* is the time between actions and is dependent on the frame_skip parameter
|
||||
(default is 5), where the frametime is 0.003 - making the default *dt = 5 * 0.003 = 0.015*.
|
||||
This reward would be positive if the humanoid walks forward (in positive x-direction). The calculation
|
||||
for the center of mass is defined in the `.py` file for the Humanoid.
|
||||
- *ctrl_cost*: A negative reward for penalising the humanoid if it has too
|
||||
large of a control force. If there are *nu* actuators/controls, then the control has
|
||||
shape `nu x 1`. It is measured as *`ctrl_cost_weight` * sum(control<sup>2</sup>)*.
|
||||
- *contact_cost*: A negative reward for penalising the humanoid if the external
|
||||
contact force is too large. It is calculated by clipping
|
||||
*`contact_cost_weight` * sum(external contact force<sup>2</sup>)* to the interval specified by `contact_cost_range`.
|
||||
|
||||
The total reward returned is ***reward*** *=* *healthy_reward + forward_reward - ctrl_cost - contact_cost* and `info` will also contain the individual reward terms
|
||||
|
||||
### Starting State
|
||||
All observations start in state
|
||||
(0.0, 0.0, 1.4, 1.0, 0.0 ... 0.0) with a uniform noise in the range
|
||||
of [-`reset_noise_scale`, `reset_noise_scale`] added to the positional and velocity values (values in the table)
|
||||
for stochasticity. Note that the initial z coordinate is intentionally
|
||||
selected to be high, thereby indicating a standing up humanoid. The initial
|
||||
orientation is designed to make it face forward as well.
|
||||
|
||||
### Episode End
|
||||
The humanoid is said to be unhealthy if the z-position of the torso is no longer contained in the
|
||||
closed interval specified by the argument `healthy_z_range`.
|
||||
|
||||
If `terminate_when_unhealthy=True` is passed during construction (which is the default),
|
||||
the episode ends when any of the following happens:
|
||||
|
||||
1. Truncation: The episode duration reaches a 1000 timesteps
|
||||
3. Termination: The humanoid is unhealthy
|
||||
|
||||
If `terminate_when_unhealthy=False` is passed, the episode is ended only when 1000 timesteps are exceeded.
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported in v2 and lower.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Humanoid-v4')
|
||||
```
|
||||
|
||||
v3 and v4 take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Humanoid-v4', ctrl_cost_weight=0.1, ....)
|
||||
```
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
| -------------------------------------------- | --------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `xml_file` | **str** | `"humanoid.xml"` | Path to a MuJoCo model |
|
||||
| `forward_reward_weight` | **float** | `1.25` | Weight for _forward_reward_ term (see section on reward) |
|
||||
| `ctrl_cost_weight` | **float** | `0.1` | Weight for _ctrl_cost_ term (see section on reward) |
|
||||
| `contact_cost_weight` | **float** | `5e-7` | Weight for _contact_cost_ term (see section on reward) |
|
||||
| `healthy_reward` | **float** | `5.0` | Constant reward given if the humanoid is "healthy" after timestep |
|
||||
| `terminate_when_unhealthy` | **bool** | `True` | If true, issue a done signal if the z-coordinate of the torso is no longer in the `healthy_z_range` |
|
||||
| `healthy_z_range` | **tuple** | `(1.0, 2.0)` | The humanoid is considered healthy if the z-coordinate of the torso is in this range |
|
||||
| `reset_noise_scale` | **float** | `1e-2` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
|
||||
| `exclude_current_positions_from_observation` | **bool** | `True` | Whether or not to omit the x- and y-coordinates from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v3: support for gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,192 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Humanoid Standup
|
||||
---
|
||||
|
||||
# Humanoid Standup
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/humanoid_standup.gif
|
||||
:width: 200px
|
||||
:name: humanoid_standup
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-0.4, 0.4, (17,), float32) |
|
||||
| Observation Shape | (376,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("HumanoidStandup-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment is based on the environment introduced by Tassa, Erez and Todorov
|
||||
in ["Synthesis and stabilization of complex behaviors through online trajectory optimization"](https://ieeexplore.ieee.org/document/6386025).
|
||||
The 3D bipedal robot is designed to simulate a human. It has a torso (abdomen) with a
|
||||
pair of legs and arms. The legs each consist of two links, and so the arms (representing the
|
||||
knees and elbows respectively). The environment starts with the humanoid laying on the ground,
|
||||
and then the goal of the environment is to make the humanoid standup and then keep it standing
|
||||
by applying torques on the various hinges.
|
||||
|
||||
### Action Space
|
||||
The agent take a 17-element vector for actions.
|
||||
|
||||
The action space is a continuous `(action, ...)` all in `[-1, 1]`, where `action`
|
||||
represents the numerical torques applied at the hinge joints.
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | ---------------------------------------------------------------------------------- | ----------- | ----------- | -------------------------------- | ----- | ------------ |
|
||||
| 0 | Torque applied on the hinge in the y-coordinate of the abdomen | -0.4 | 0.4 | hip_1 (front_left_leg) | hinge | torque (N m) |
|
||||
| 1 | Torque applied on the hinge in the z-coordinate of the abdomen | -0.4 | 0.4 | angle_1 (front_left_leg) | hinge | torque (N m) |
|
||||
| 2 | Torque applied on the hinge in the x-coordinate of the abdomen | -0.4 | 0.4 | hip_2 (front_right_leg) | hinge | torque (N m) |
|
||||
| 3 | Torque applied on the rotor between torso/abdomen and the right hip (x-coordinate) | -0.4 | 0.4 | right_hip_x (right_thigh) | hinge | torque (N m) |
|
||||
| 4 | Torque applied on the rotor between torso/abdomen and the right hip (z-coordinate) | -0.4 | 0.4 | right_hip_z (right_thigh) | hinge | torque (N m) |
|
||||
| 5 | Torque applied on the rotor between torso/abdomen and the right hip (y-coordinate) | -0.4 | 0.4 | right_hip_y (right_thigh) | hinge | torque (N m) |
|
||||
| 6 | Torque applied on the rotor between the right hip/thigh and the right shin | -0.4 | 0.4 | right_knee | hinge | torque (N m) |
|
||||
| 7 | Torque applied on the rotor between torso/abdomen and the left hip (x-coordinate) | -0.4 | 0.4 | left_hip_x (left_thigh) | hinge | torque (N m) |
|
||||
| 8 | Torque applied on the rotor between torso/abdomen and the left hip (z-coordinate) | -0.4 | 0.4 | left_hip_z (left_thigh) | hinge | torque (N m) |
|
||||
| 9 | Torque applied on the rotor between torso/abdomen and the left hip (y-coordinate) | -0.4 | 0.4 | left_hip_y (left_thigh) | hinge | torque (N m) |
|
||||
| 10 | Torque applied on the rotor between the left hip/thigh and the left shin | -0.4 | 0.4 | left_knee | hinge | torque (N m) |
|
||||
| 11 | Torque applied on the rotor between the torso and right upper arm (coordinate -1) | -0.4 | 0.4 | right_shoulder1 | hinge | torque (N m) |
|
||||
| 12 | Torque applied on the rotor between the torso and right upper arm (coordinate -2) | -0.4 | 0.4 | right_shoulder2 | hinge | torque (N m) |
|
||||
| 13 | Torque applied on the rotor between the right upper arm and right lower arm | -0.4 | 0.4 | right_elbow | hinge | torque (N m) |
|
||||
| 14 | Torque applied on the rotor between the torso and left upper arm (coordinate -1) | -0.4 | 0.4 | left_shoulder1 | hinge | torque (N m) |
|
||||
| 15 | Torque applied on the rotor between the torso and left upper arm (coordinate -2) | -0.4 | 0.4 | left_shoulder2 | hinge | torque (N m) |
|
||||
| 16 | Torque applied on the rotor between the left upper arm and left lower arm | -0.4 | 0.4 | left_elbow | hinge | torque (N m) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
The state space consists of positional values of different body parts of the Humanoid,
|
||||
followed by the velocities of those individual parts (their derivatives) with all the positions ordered before all the velocities.
|
||||
|
||||
**Note:** The x- and y-coordinates of the torso are being omitted to produce position-agnostic behavior in policies
|
||||
|
||||
The observation is a `ndarray` with shape `(376,)` where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | --------------------------------------------------------------------------------------------------------------- | ---- | --- | -------------------------------- | ----- | -------------------------- |
|
||||
| 0 | z-coordinate of the torso (centre) | -Inf | Inf | root | free | position (m) |
|
||||
| 1 | x-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
|
||||
| 2 | y-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
|
||||
| 3 | z-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
|
||||
| 4 | w-orientation of the torso (centre) | -Inf | Inf | root | free | angle (rad) |
|
||||
| 5 | z-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | angle (rad) |
|
||||
| 6 | y-angle of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | angle (rad) |
|
||||
| 7 | x-angle of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | angle (rad) |
|
||||
| 8 | x-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | angle (rad) |
|
||||
| 9 | z-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | angle (rad) |
|
||||
| 10 | y-coordinate of angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | angle (rad) |
|
||||
| 11 | angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | angle (rad) |
|
||||
| 12 | x-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | angle (rad) |
|
||||
| 13 | z-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | angle (rad) |
|
||||
| 14 | y-coordinate of angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | angle (rad) |
|
||||
| 15 | angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | angle (rad) |
|
||||
| 16 | coordinate-1 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | angle (rad) |
|
||||
| 17 | coordinate-2 (multi-axis) angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | angle (rad) |
|
||||
| 18 | angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | angle (rad) |
|
||||
| 19 | coordinate-1 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | angle (rad) |
|
||||
| 20 | coordinate-2 (multi-axis) angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | angle (rad) |
|
||||
| 21 | angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | angle (rad) |
|
||||
| 22 | x-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
|
||||
| 23 | y-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
|
||||
| 24 | z-coordinate velocity of the torso (centre) | -Inf | Inf | root | free | velocity (m/s) |
|
||||
| 25 | x-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
|
||||
| 26 | y-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
|
||||
| 27 | z-coordinate angular velocity of the torso (centre) | -Inf | Inf | root | free | anglular velocity (rad/s) |
|
||||
| 28 | z-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_z | hinge | anglular velocity (rad/s) |
|
||||
| 29 | y-coordinate of angular velocity of the abdomen (in lower_waist) | -Inf | Inf | abdomen_y | hinge | anglular velocity (rad/s) |
|
||||
| 30 | x-coordinate of angular velocity of the abdomen (in pelvis) | -Inf | Inf | abdomen_x | hinge | aanglular velocity (rad/s) |
|
||||
| 31 | x-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_x | hinge | anglular velocity (rad/s) |
|
||||
| 32 | z-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_z | hinge | anglular velocity (rad/s) |
|
||||
| 33 | y-coordinate of the angular velocity of the angle between pelvis and right hip (in right_thigh) | -Inf | Inf | right_hip_y | hinge | anglular velocity (rad/s) |
|
||||
| 35 | angular velocity of the angle between right hip and the right shin (in right_knee) | -Inf | Inf | right_knee | hinge | anglular velocity (rad/s) |
|
||||
| 36 | x-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_x | hinge | anglular velocity (rad/s) |
|
||||
| 37 | z-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_z | hinge | anglular velocity (rad/s) |
|
||||
| 38 | y-coordinate of the angular velocity of the angle between pelvis and left hip (in left_thigh) | -Inf | Inf | left_hip_y | hinge | anglular velocity (rad/s) |
|
||||
| 39 | angular velocity of the angle between left hip and the left shin (in left_knee) | -Inf | Inf | left_knee | hinge | anglular velocity (rad/s) |
|
||||
| 40 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder1 | hinge | anglular velocity (rad/s) |
|
||||
| 41 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and right arm (in right_upper_arm) | -Inf | Inf | right_shoulder2 | hinge | anglular velocity (rad/s) |
|
||||
| 42 | angular velocity of the angle between right upper arm and right_lower_arm | -Inf | Inf | right_elbow | hinge | anglular velocity (rad/s) |
|
||||
| 43 | coordinate-1 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder1 | hinge | anglular velocity (rad/s) |
|
||||
| 44 | coordinate-2 (multi-axis) of the angular velocity of the angle between torso and left arm (in left_upper_arm) | -Inf | Inf | left_shoulder2 | hinge | anglular velocity (rad/s) |
|
||||
| 45 | angular velocitty of the angle between left upper arm and left_lower_arm | -Inf | Inf | left_elbow | hinge | anglular velocity (rad/s) |
|
||||
|
||||
|
||||
Additionally, after all the positional and velocity based values in the table,
|
||||
the state_space consists of (in order):
|
||||
- *cinert:* Mass and inertia of a single rigid body relative to the center of mass
|
||||
(this is an intermediate result of transition). It has shape 14*10 (*nbody * 10*)
|
||||
and hence adds to another 140 elements in the state space.
|
||||
- *cvel:* Center of mass based velocity. It has shape 14 * 6 (*nbody * 6*) and hence
|
||||
adds another 84 elements in the state space
|
||||
- *qfrc_actuator:* Constraint force generated as the actuator force. This has shape
|
||||
`(23,)` *(nv * 1)* and hence adds another 23 elements to the state space.
|
||||
- *cfrc_ext:* This is the center of mass based external force on the body. It has shape
|
||||
14 * 6 (*nbody * 6*) and hence adds to another 84 elements in the state space.
|
||||
where *nbody* stands for the number of bodies in the robot and *nv* stands for the number
|
||||
of degrees of freedom (*= dim(qvel)*)
|
||||
|
||||
The (x,y,z) coordinates are translational DOFs while the orientations are rotational
|
||||
DOFs expressed as quaternions. One can read more about free joints on the
|
||||
[Mujoco Documentation](https://mujoco.readthedocs.io/en/latest/XMLreference.html).
|
||||
|
||||
**Note:** HumanoidStandup-v4 environment no longer has the following contact forces issue.
|
||||
If using previous HumanoidStandup versions from v4, there have been reported issues that using a Mujoco-Py version > 2.0 results
|
||||
in the contact forces always being 0. As such we recommend to use a Mujoco-Py version < 2.0
|
||||
when using the Humanoid environment if you would like to report results with contact forces
|
||||
(if contact forces are not used in your experiments, you can use version > 2.0).
|
||||
|
||||
### Rewards
|
||||
The reward consists of three parts:
|
||||
- *uph_cost*: A reward for moving upward (in an attempt to stand up). This is not a relative
|
||||
reward which measures how much upward it has moved from the last timestep, but it is an
|
||||
absolute reward which measures how much upward the Humanoid has moved overall. It is
|
||||
measured as *(z coordinate after action - 0)/(atomic timestep)*, where *z coordinate after
|
||||
action* is index 0 in the state/index 2 in the table, and *atomic timestep* is the time for
|
||||
one frame of movement even though the simulation has a framerate of 5 (done in order to inflate
|
||||
rewards a little for faster learning).
|
||||
- *quad_ctrl_cost*: A negative reward for penalising the humanoid if it has too large of
|
||||
a control force. If there are *nu* actuators/controls, then the control has shape `nu x 1`.
|
||||
It is measured as *0.1 **x** sum(control<sup>2</sup>)*.
|
||||
- *quad_impact_cost*: A negative reward for penalising the humanoid if the external
|
||||
contact force is too large. It is calculated as *min(0.5 * 0.000001 * sum(external
|
||||
contact force<sup>2</sup>), 10)*.
|
||||
|
||||
The total reward returned is ***reward*** *=* *uph_cost + 1 - quad_ctrl_cost - quad_impact_cost*
|
||||
|
||||
### Starting State
|
||||
All observations start in state
|
||||
(0.0, 0.0, 0.105, 1.0, 0.0 ... 0.0) with a uniform noise in the range of
|
||||
[-0.01, 0.01] added to the positional and velocity values (values in the table)
|
||||
for stochasticity. Note that the initial z coordinate is intentionally selected
|
||||
to be low, thereby indicating a laying down humanoid. The initial orientation is
|
||||
designed to make it face forward as well.
|
||||
|
||||
### Episode End
|
||||
The episode ends when any of the following happens:
|
||||
|
||||
1. Truncation: The episode duration reaches a 1000 timesteps
|
||||
2. Termination: Any of the state space values is no longer finite
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported.
|
||||
|
||||
```
|
||||
env = gymnasium.make('HumanoidStandup-v4')
|
||||
```
|
||||
|
||||
There is no v3 for HumanoidStandup, unlike the robot environments where a v3 and
|
||||
beyond take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v3: support for gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,126 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Inverted Double Pendulum
|
||||
---
|
||||
|
||||
# Inverted Double Pendulum
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/inverted_double_pendulum.gif
|
||||
:width: 200px
|
||||
:name: inverted_double_pendulum
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-1.0, 1.0, (1,), float32) |
|
||||
| Observation Shape | (11,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("InvertedDoublePendulum-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment originates from control theory and builds on the cartpole
|
||||
environment based on the work done by Barto, Sutton, and Anderson in
|
||||
["Neuronlike adaptive elements that can solve difficult learning control problems"](https://ieeexplore.ieee.org/document/6313077),
|
||||
powered by the Mujoco physics simulator - allowing for more complex experiments
|
||||
(such as varying the effects of gravity or constraints). This environment involves a cart that can
|
||||
moved linearly, with a pole fixed on it and a second pole fixed on the other end of the first one
|
||||
(leaving the second pole as the only one with one free end). The cart can be pushed left or right,
|
||||
and the goal is to balance the second pole on top of the first pole, which is in turn on top of the
|
||||
cart, by applying continuous forces on the cart.
|
||||
|
||||
### Action Space
|
||||
The agent take a 1-element vector for actions.
|
||||
The action space is a continuous `(action)` in `[-1, 1]`, where `action` represents the
|
||||
numerical force applied to the cart (with magnitude representing the amount of force and
|
||||
sign representing the direction)
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|---------------------------|-------------|-------------|----------------------------------|-------|-----------|
|
||||
| 0 | Force applied on the cart | -1 | 1 | slider | slide | Force (N) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
The state space consists of positional values of different body parts of the pendulum system,
|
||||
followed by the velocities of those individual parts (their derivatives) with all the
|
||||
positions ordered before all the velocities.
|
||||
|
||||
The observation is a `ndarray` with shape `(11,)` where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | ----------------------------------------------------------------- | ---- | --- | -------------------------------- | ----- | ------------------------ |
|
||||
| 0 | position of the cart along the linear surface | -Inf | Inf | slider | slide | position (m) |
|
||||
| 1 | sine of the angle between the cart and the first pole | -Inf | Inf | sin(hinge) | hinge | unitless |
|
||||
| 2 | sine of the angle between the two poles | -Inf | Inf | sin(hinge2) | hinge | unitless |
|
||||
| 3 | cosine of the angle between the cart and the first pole | -Inf | Inf | cos(hinge) | hinge | unitless |
|
||||
| 4 | cosine of the angle between the two poles | -Inf | Inf | cos(hinge2) | hinge | unitless |
|
||||
| 5 | velocity of the cart | -Inf | Inf | slider | slide | velocity (m/s) |
|
||||
| 6 | angular velocity of the angle between the cart and the first pole | -Inf | Inf | hinge | hinge | angular velocity (rad/s) |
|
||||
| 7 | angular velocity of the angle between the two poles | -Inf | Inf | hinge2 | hinge | angular velocity (rad/s) |
|
||||
| 8 | constraint force - 1 | -Inf | Inf | | | Force (N) |
|
||||
| 9 | constraint force - 2 | -Inf | Inf | | | Force (N) |
|
||||
| 10 | constraint force - 3 | -Inf | Inf | | | Force (N) |
|
||||
|
||||
|
||||
There is physical contact between the robots and their environment - and Mujoco
|
||||
attempts at getting realisitic physics simulations for the possible physical contact
|
||||
dynamics by aiming for physical accuracy and computational efficiency.
|
||||
|
||||
There is one constraint force for contacts for each degree of freedom (3).
|
||||
The approach and handling of constraints by Mujoco is unique to the simulator
|
||||
and is based on their research. Once can find more information in their
|
||||
[*documentation*](https://mujoco.readthedocs.io/en/latest/computation.html)
|
||||
or in their paper
|
||||
["Analytically-invertible dynamics with contacts and constraints: Theory and implementation in MuJoCo"](https://homes.cs.washington.edu/~todorov/papers/TodorovICRA14.pdf).
|
||||
|
||||
|
||||
### Rewards
|
||||
|
||||
The reward consists of two parts:
|
||||
- *alive_bonus*: The goal is to make the second inverted pendulum stand upright
|
||||
(within a certain angle limit) as long as possible - as such a reward of +10 is awarded
|
||||
for each timestep that the second pole is upright.
|
||||
- *distance_penalty*: This reward is a measure of how far the *tip* of the second pendulum
|
||||
(the only free end) moves, and it is calculated as
|
||||
*0.01 * x<sup>2</sup> + (y - 2)<sup>2</sup>*, where *x* is the x-coordinate of the tip
|
||||
and *y* is the y-coordinate of the tip of the second pole.
|
||||
- *velocity_penalty*: A negative reward for penalising the agent if it moves too
|
||||
fast *0.001 * v<sub>1</sub><sup>2</sup> + 0.005 * v<sub>2</sub> <sup>2</sup>*
|
||||
|
||||
The total reward returned is ***reward*** *=* *alive_bonus - distance_penalty - velocity_penalty*
|
||||
|
||||
### Starting State
|
||||
All observations start in state
|
||||
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) with a uniform noise in the range
|
||||
of [-0.1, 0.1] added to the positional values (cart position and pole angles) and standard
|
||||
normal force with a standard deviation of 0.1 added to the velocity values for stochasticity.
|
||||
|
||||
### Episode End
|
||||
The episode ends when any of the following happens:
|
||||
|
||||
1.Truncation: The episode duration reaches 1000 timesteps.
|
||||
2.Termination: Any of the state space values is no longer finite.
|
||||
3.Termination: The y_coordinate of the tip of the second pole *is less than or equal* to 1. The maximum standing height of the system is 1.196 m when all the parts are perpendicularly vertical on top of each other).
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported.
|
||||
|
||||
```
|
||||
env = gymnasium.make('InvertedDoublePendulum-v4')
|
||||
```
|
||||
There is no v3 for InvertedPendulum, unlike the robot environments where a v3 and
|
||||
beyond take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v3: support for gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks (including inverted pendulum)
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,97 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Inverted Pendulum
|
||||
---
|
||||
|
||||
# Inverted Pendulum
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/inverted_pendulum.gif
|
||||
:width: 200px
|
||||
:name: inverted_pendulum
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-3.0, 3.0, (1,), float32) |
|
||||
| Observation Shape | (4,) |
|
||||
| Observation High | [inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("InvertedPendulum-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment is the cartpole environment based on the work done by
|
||||
Barto, Sutton, and Anderson in ["Neuronlike adaptive elements that can
|
||||
solve difficult learning control problems"](https://ieeexplore.ieee.org/document/6313077),
|
||||
just like in the classic environments but now powered by the Mujoco physics simulator -
|
||||
allowing for more complex experiments (such as varying the effects of gravity).
|
||||
This environment involves a cart that can moved linearly, with a pole fixed on it
|
||||
at one end and having another end free. The cart can be pushed left or right, and the
|
||||
goal is to balance the pole on the top of the cart by applying forces on the cart.
|
||||
|
||||
### Action Space
|
||||
The agent take a 1-element vector for actions.
|
||||
|
||||
The action space is a continuous `(action)` in `[-3, 3]`, where `action` represents
|
||||
the numerical force applied to the cart (with magnitude representing the amount of
|
||||
force and sign representing the direction)
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|---------------------------|-------------|-------------|----------------------------------|-------|-----------|
|
||||
| 0 | Force applied on the cart | -3 | 3 | slider | slide | Force (N) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
The state space consists of positional values of different body parts of
|
||||
the pendulum system, followed by the velocities of those individual parts (their derivatives)
|
||||
with all the positions ordered before all the velocities.
|
||||
|
||||
The observation is a `ndarray` with shape `(4,)` where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | --------------------------------------------- | ---- | --- | -------------------------------- | ----- | ------------------------- |
|
||||
| 0 | position of the cart along the linear surface | -Inf | Inf | slider | slide | position (m) |
|
||||
| 1 | vertical angle of the pole on the cart | -Inf | Inf | hinge | hinge | angle (rad) |
|
||||
| 2 | linear velocity of the cart | -Inf | Inf | slider | slide | velocity (m/s) |
|
||||
| 3 | angular velocity of the pole on the cart | -Inf | Inf | hinge | hinge | anglular velocity (rad/s) |
|
||||
|
||||
|
||||
### Rewards
|
||||
|
||||
The goal is to make the inverted pendulum stand upright (within a certain angle limit)
|
||||
as long as possible - as such a reward of +1 is awarded for each timestep that
|
||||
the pole is upright.
|
||||
|
||||
### Starting State
|
||||
All observations start in state
|
||||
(0.0, 0.0, 0.0, 0.0) with a uniform noise in the range
|
||||
of [-0.01, 0.01] added to the values for stochasticity.
|
||||
|
||||
### Episode End
|
||||
The episode ends when any of the following happens:
|
||||
|
||||
1. Truncation: The episode duration reaches 1000 timesteps.
|
||||
2. Termination: Any of the state space values is no longer finite.
|
||||
3. Termination: The absolutely value of the vertical angle between the pole and the cart is greater than 0.2 radian.
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported.
|
||||
|
||||
```
|
||||
env = gymnasium.make('InvertedPendulum-v4')
|
||||
```
|
||||
There is no v3 for InvertedPendulum, unlike the robot environments where a
|
||||
v3 and beyond take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v3: support for gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks (including inverted pendulum)
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,142 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Pusher
|
||||
---
|
||||
|
||||
# Pusher
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/pusher.gif
|
||||
:width: 200px
|
||||
:name: pusher
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-2.0, 2.0, (7,), float32) |
|
||||
| Observation Shape | (23,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("Pusher-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
"Pusher" is a multi-jointed robot arm which is very similar to that of a human.
|
||||
The goal is to move a target cylinder (called *object*) to a goal position using the robot's end effector (called *fingertip*).
|
||||
The robot consists of shoulder, elbow, forearm, and wrist joints.
|
||||
|
||||
### Action Space
|
||||
The action space is a `Box(-2, 2, (7,), float32)`. An action `(a, b)` represents the torques applied at the hinge joints.
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|--------------------------------------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
|
||||
| 0 | Rotation of the panning the shoulder | -2 | 2 | r_shoulder_pan_joint | hinge | torque (N m) |
|
||||
| 1 | Rotation of the shoulder lifting joint | -2 | 2 | r_shoulder_lift_joint | hinge | torque (N m) |
|
||||
| 2 | Rotation of the shoulder rolling joint | -2 | 2 | r_upper_arm_roll_joint | hinge | torque (N m) |
|
||||
| 3 | Rotation of hinge joint that flexed the elbow | -2 | 2 | r_elbow_flex_joint | hinge | torque (N m) |
|
||||
| 4 | Rotation of hinge that rolls the forearm | -2 | 2 | r_forearm_roll_joint | hinge | torque (N m) |
|
||||
| 5 | Rotation of flexing the wrist | -2 | 2 | r_wrist_flex_joint | hinge | torque (N m) |
|
||||
| 6 | Rotation of rolling the wrist | -2 | 2 | r_wrist_roll_joint | hinge | torque (N m) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
Observations consist of
|
||||
|
||||
- Angle of rotational joints on the pusher
|
||||
- Angular velocities of rotational joints on the pusher
|
||||
- The coordinates of the fingertip of the pusher
|
||||
- The coordinates of the object to be moved
|
||||
- The coordinates of the goal position
|
||||
|
||||
The observation is a `ndarray` with shape `(23,)` where the elements correspond to the table below.
|
||||
An analogy can be drawn to a human arm in order to help understand the state space, with the words flex and roll meaning the
|
||||
same as human joints.
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | -------------------------------------------------------- | ---- | --- | -------------------------------- | -------- | ------------------------ |
|
||||
| 0 | Rotation of the panning the shoulder | -Inf | Inf | r_shoulder_pan_joint | hinge | angle (rad) |
|
||||
| 1 | Rotation of the shoulder lifting joint | -Inf | Inf | r_shoulder_lift_joint | hinge | angle (rad) |
|
||||
| 2 | Rotation of the shoulder rolling joint | -Inf | Inf | r_upper_arm_roll_joint | hinge | angle (rad) |
|
||||
| 3 | Rotation of hinge joint that flexed the elbow | -Inf | Inf | r_elbow_flex_joint | hinge | angle (rad) |
|
||||
| 4 | Rotation of hinge that rolls the forearm | -Inf | Inf | r_forearm_roll_joint | hinge | angle (rad) |
|
||||
| 5 | Rotation of flexing the wrist | -Inf | Inf | r_wrist_flex_joint | hinge | angle (rad) |
|
||||
| 6 | Rotation of rolling the wrist | -Inf | Inf | r_wrist_roll_joint | hinge | angle (rad) |
|
||||
| 7 | Rotational velocity of the panning the shoulder | -Inf | Inf | r_shoulder_pan_joint | hinge | angular velocity (rad/s) |
|
||||
| 8 | Rotational velocity of the shoulder lifting joint | -Inf | Inf | r_shoulder_lift_joint | hinge | angular velocity (rad/s) |
|
||||
| 9 | Rotational velocity of the shoulder rolling joint | -Inf | Inf | r_upper_arm_roll_joint | hinge | angular velocity (rad/s) |
|
||||
| 10 | Rotational velocity of hinge joint that flexed the elbow | -Inf | Inf | r_elbow_flex_joint | hinge | angular velocity (rad/s) |
|
||||
| 11 | Rotational velocity of hinge that rolls the forearm | -Inf | Inf | r_forearm_roll_joint | hinge | angular velocity (rad/s) |
|
||||
| 12 | Rotational velocity of flexing the wrist | -Inf | Inf | r_wrist_flex_joint | hinge | angular velocity (rad/s) |
|
||||
| 13 | Rotational velocity of rolling the wrist | -Inf | Inf | r_wrist_roll_joint | hinge | angular velocity (rad/s) |
|
||||
| 14 | x-coordinate of the fingertip of the pusher | -Inf | Inf | tips_arm | slide | position (m) |
|
||||
| 15 | y-coordinate of the fingertip of the pusher | -Inf | Inf | tips_arm | slide | position (m) |
|
||||
| 16 | z-coordinate of the fingertip of the pusher | -Inf | Inf | tips_arm | slide | position (m) |
|
||||
| 17 | x-coordinate of the object to be moved | -Inf | Inf | object (obj_slidex) | slide | position (m) |
|
||||
| 18 | y-coordinate of the object to be moved | -Inf | Inf | object (obj_slidey) | slide | position (m) |
|
||||
| 19 | z-coordinate of the object to be moved | -Inf | Inf | object | cylinder | position (m) |
|
||||
| 20 | x-coordinate of the goal position of the object | -Inf | Inf | goal (goal_slidex) | slide | position (m) |
|
||||
| 21 | y-coordinate of the goal position of the object | -Inf | Inf | goal (goal_slidey) | slide | position (m) |
|
||||
| 22 | z-coordinate of the goal position of the object | -Inf | Inf | goal | sphere | position (m) |
|
||||
|
||||
|
||||
### Rewards
|
||||
The reward consists of two parts:
|
||||
- *reward_near *: This reward is a measure of how far the *fingertip*
|
||||
of the pusher (the unattached end) is from the object, with a more negative
|
||||
value assigned for when the pusher's *fingertip* is further away from the
|
||||
target. It is calculated as the negative vector norm of (position of
|
||||
the fingertip - position of target), or *-norm("fingertip" - "target")*.
|
||||
- *reward_dist *: This reward is a measure of how far the object is from
|
||||
the target goal position, with a more negative value assigned for object is
|
||||
further away from the target. It is calculated as the negative vector norm of
|
||||
(position of the object - position of goal), or *-norm("object" - "target")*.
|
||||
- *reward_control*: A negative reward for penalising the pusher if
|
||||
it takes actions that are too large. It is measured as the negative squared
|
||||
Euclidean norm of the action, i.e. as *- sum(action<sup>2</sup>)*.
|
||||
|
||||
The total reward returned is ***reward*** *=* *reward_dist + 0.1 * reward_ctrl + 0.5 * reward_near*
|
||||
|
||||
Unlike other environments, Pusher does not allow you to specify weights for the individual reward terms.
|
||||
However, `info` does contain the keys *reward_dist* and *reward_ctrl*. Thus, if you'd like to weight the terms,
|
||||
you should create a wrapper that computes the weighted reward from `info`.
|
||||
|
||||
|
||||
### Starting State
|
||||
All pusher (not including object and goal) states start in
|
||||
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0). A uniform noise in the range
|
||||
[-0.005, 0.005] is added to the velocity attributes only. The velocities of
|
||||
the object and goal are permanently set to 0. The object's x-position is selected uniformly
|
||||
between [-0.3, 0] while the y-position is selected uniformly between [-0.2, 0.2], and this
|
||||
process is repeated until the vector norm between the object's (x,y) position and origin is not greater
|
||||
than 0.17. The goal always have the same position of (0.45, -0.05, -0.323).
|
||||
|
||||
The default framerate is 5 with each frame lasting for 0.01, giving rise to a *dt = 5 * 0.01 = 0.05*
|
||||
|
||||
### Episode End
|
||||
|
||||
The episode ends when any of the following happens:
|
||||
|
||||
1. Truncation: The episode duration reaches a 100 timesteps.
|
||||
2. Termination: Any of the state space values is no longer finite.
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported (in v2 and lower),
|
||||
but modifications can be made to the XML file in the assets folder
|
||||
(or by changing the path to a modified XML file in another folder)..
|
||||
|
||||
```
|
||||
env = gymnasium.make('Pusher-v4')
|
||||
```
|
||||
|
||||
There is no v3 for Pusher, unlike the robot environments where a v3 and
|
||||
beyond take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks (not including reacher, which has a max_time_steps of 50). Added reward_threshold to environments.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,132 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Reacher
|
||||
---
|
||||
|
||||
# Reacher
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/reacher.gif
|
||||
:width: 200px
|
||||
:name: reacher
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-1.0, 1.0, (2,), float32) |
|
||||
| Observation Shape | (11,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("Reacher-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
"Reacher" is a two-jointed robot arm. The goal is to move the robot's end effector (called *fingertip*) close to a
|
||||
target that is spawned at a random position.
|
||||
|
||||
### Action Space
|
||||
The action space is a `Box(-1, 1, (2,), float32)`. An action `(a, b)` represents the torques applied at the hinge joints.
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|---------------------------------------------------------------------------------|-------------|-------------|--------------------------|-------|------|
|
||||
| 0 | Torque applied at the first hinge (connecting the link to the point of fixture) | -1 | 1 | joint0 | hinge | torque (N m) |
|
||||
| 1 | Torque applied at the second hinge (connecting the two links) | -1 | 1 | joint1 | hinge | torque (N m) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
Observations consist of
|
||||
|
||||
- The cosine of the angles of the two arms
|
||||
- The sine of the angles of the two arms
|
||||
- The coordinates of the target
|
||||
- The angular velocities of the arms
|
||||
- The vector between the target and the reacher's fingertip (3 dimensional with the last element being 0)
|
||||
|
||||
The observation is a `ndarray` with shape `(11,)` where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | ---------------------------------------------------------------------------------------------- | ---- | --- | -------------------------------- | ----- | ------------------------ |
|
||||
| 0 | cosine of the angle of the first arm | -Inf | Inf | cos(joint0) | hinge | unitless |
|
||||
| 1 | cosine of the angle of the second arm | -Inf | Inf | cos(joint1) | hinge | unitless |
|
||||
| 2 | sine of the angle of the first arm | -Inf | Inf | cos(joint0) | hinge | unitless |
|
||||
| 3 | sine of the angle of the second arm | -Inf | Inf | cos(joint1) | hinge | unitless |
|
||||
| 4 | x-coordinate of the target | -Inf | Inf | target_x | slide | position (m) |
|
||||
| 5 | y-coordinate of the target | -Inf | Inf | target_y | slide | position (m) |
|
||||
| 6 | angular velocity of the first arm | -Inf | Inf | joint0 | hinge | angular velocity (rad/s) |
|
||||
| 7 | angular velocity of the second arm | -Inf | Inf | joint1 | hinge | angular velocity (rad/s) |
|
||||
| 8 | x-value of position_fingertip - position_target | -Inf | Inf | NA | slide | position (m) |
|
||||
| 9 | y-value of position_fingertip - position_target | -Inf | Inf | NA | slide | position (m) |
|
||||
| 10 | z-value of position_fingertip - position_target (0 since reacher is 2d and z is same for both) | -Inf | Inf | NA | slide | position (m) |
|
||||
|
||||
|
||||
Most Gymnasium environments just return the positions and velocity of the
|
||||
joints in the `.xml` file as the state of the environment. However, in
|
||||
reacher the state is created by combining only certain elements of the
|
||||
position and velocity, and performing some function transformations on them.
|
||||
If one is to read the `.xml` for reacher then they will find 4 joints:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|-----------------------------|----------|----------|----------------------------------|-------|--------------------|
|
||||
| 0 | angle of the first arm | -Inf | Inf | joint0 | hinge | angle (rad) |
|
||||
| 1 | angle of the second arm | -Inf | Inf | joint1 | hinge | angle (rad) |
|
||||
| 2 | x-coordinate of the target | -Inf | Inf | target_x | slide | position (m) |
|
||||
| 3 | y-coordinate of the target | -Inf | Inf | target_y | slide | position (m) |
|
||||
|
||||
|
||||
### Rewards
|
||||
The reward consists of two parts:
|
||||
- *reward_distance*: This reward is a measure of how far the *fingertip*
|
||||
of the reacher (the unattached end) is from the target, with a more negative
|
||||
value assigned for when the reacher's *fingertip* is further away from the
|
||||
target. It is calculated as the negative vector norm of (position of
|
||||
the fingertip - position of target), or *-norm("fingertip" - "target")*.
|
||||
- *reward_control*: A negative reward for penalising the walker if
|
||||
it takes actions that are too large. It is measured as the negative squared
|
||||
Euclidean norm of the action, i.e. as *- sum(action<sup>2</sup>)*.
|
||||
|
||||
The total reward returned is ***reward*** *=* *reward_distance + reward_control*
|
||||
|
||||
Unlike other environments, Reacher does not allow you to specify weights for the individual reward terms.
|
||||
However, `info` does contain the keys *reward_dist* and *reward_ctrl*. Thus, if you'd like to weight the terms,
|
||||
you should create a wrapper that computes the weighted reward from `info`.
|
||||
|
||||
|
||||
### Starting State
|
||||
All observations start in state
|
||||
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
|
||||
with a noise added for stochasticity. A uniform noise in the range
|
||||
[-0.1, 0.1] is added to the positional attributes, while the target position
|
||||
is selected uniformly at random in a disk of radius 0.2 around the origin.
|
||||
Independent, uniform noise in the
|
||||
range of [-0.005, 0.005] is added to the velocities, and the last
|
||||
element ("fingertip" - "target") is calculated at the end once everything
|
||||
is set. The default setting has a framerate of 2 and a *dt = 2 * 0.01 = 0.02*
|
||||
|
||||
### Episode End
|
||||
|
||||
The episode ends when any of the following happens:
|
||||
|
||||
1. Truncation: The episode duration reaches a 50 timesteps (with a new random target popping up if the reacher's fingertip reaches it before 50 timesteps)
|
||||
2. Termination: Any of the state space values is no longer finite.
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported (in v2 and lower),
|
||||
but modifications can be made to the XML file in the assets folder
|
||||
(or by changing the path to a modified XML file in another folder)..
|
||||
|
||||
```
|
||||
env = gymnasium.make('Reacher-v4')
|
||||
```
|
||||
|
||||
There is no v3 for Reacher, unlike the robot environments where a v3 and
|
||||
beyond take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks (not including reacher, which has a max_time_steps of 50). Added reward_threshold to environments.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,134 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Swimmer
|
||||
---
|
||||
|
||||
# Swimmer
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/swimmer.gif
|
||||
:width: 200px
|
||||
:name: swimmer
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-1.0, 1.0, (2,), float32) |
|
||||
| Observation Shape | (8,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("Swimmer-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment corresponds to the Swimmer environment described in Rémi Coulom's PhD thesis
|
||||
["Reinforcement Learning Using Neural Networks, with Applications to Motor Control"](https://tel.archives-ouvertes.fr/tel-00003985/document).
|
||||
The environment aims to increase the number of independent state and control
|
||||
variables as compared to the classic control environments. The swimmers
|
||||
consist of three or more segments ('***links***') and one less articulation
|
||||
joints ('***rotors***') - one rotor joint connecting exactly two links to
|
||||
form a linear chain. The swimmer is suspended in a two dimensional pool and
|
||||
always starts in the same position (subject to some deviation drawn from an
|
||||
uniform distribution), and the goal is to move as fast as possible towards
|
||||
the right by applying torque on the rotors and using the fluids friction.
|
||||
|
||||
### Notes
|
||||
|
||||
The problem parameters are:
|
||||
Problem parameters:
|
||||
* *n*: number of body parts
|
||||
* *m<sub>i</sub>*: mass of part *i* (*i* ∈ {1...n})
|
||||
* *l<sub>i</sub>*: length of part *i* (*i* ∈ {1...n})
|
||||
* *k*: viscous-friction coefficient
|
||||
|
||||
While the default environment has *n* = 3, *l<sub>i</sub>* = 0.1,
|
||||
and *k* = 0.1. It is possible to pass a custom MuJoCo XML file during construction to increase the
|
||||
number of links, or to tweak any of the parameters.
|
||||
|
||||
### Action Space
|
||||
The action space is a `Box(-1, 1, (2,), float32)`. An action represents the torques applied between *links*
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
|
||||
| 0 | Torque applied on the first rotor | -1 | 1 | rot2 | hinge | torque (N m) |
|
||||
| 1 | Torque applied on the second rotor | -1 | 1 | rot3 | hinge | torque (N m) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
By default, observations consists of:
|
||||
* θ<sub>i</sub>: angle of part *i* with respect to the *x* axis
|
||||
* θ<sub>i</sub>': its derivative with respect to time (angular velocity)
|
||||
|
||||
In the default case, observations do not include the x- and y-coordinates of the front tip. These may
|
||||
be included by passing `exclude_current_positions_from_observation=False` during construction.
|
||||
Then, the observation space will have 10 dimensions where the first two dimensions
|
||||
represent the x- and y-coordinates of the front tip.
|
||||
Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x- and y-coordinates
|
||||
will be returned in `info` with keys `"x_position"` and `"y_position"`, respectively.
|
||||
|
||||
By default, the observation is a `ndarray` with shape `(8,)` where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | ------------------------------------ | ---- | --- | -------------------------------- | ----- | ------------------------ |
|
||||
| 0 | angle of the front tip | -Inf | Inf | rot | hinge | angle (rad) |
|
||||
| 1 | angle of the first rotor | -Inf | Inf | rot2 | hinge | angle (rad) |
|
||||
| 2 | angle of the second rotor | -Inf | Inf | rot3 | hinge | angle (rad) |
|
||||
| 3 | velocity of the tip along the x-axis | -Inf | Inf | slider1 | slide | velocity (m/s) |
|
||||
| 4 | velocity of the tip along the y-axis | -Inf | Inf | slider2 | slide | velocity (m/s) |
|
||||
| 5 | angular velocity of front tip | -Inf | Inf | rot | hinge | angular velocity (rad/s) |
|
||||
| 6 | angular velocity of first rotor | -Inf | Inf | rot2 | hinge | angular velocity (rad/s) |
|
||||
| 7 | angular velocity of second rotor | -Inf | Inf | rot3 | hinge | angular velocity (rad/s) |
|
||||
|
||||
### Rewards
|
||||
The reward consists of two parts:
|
||||
- *forward_reward*: A reward of moving forward which is measured
|
||||
as *`forward_reward_weight` * (x-coordinate before action - x-coordinate after action)/dt*. *dt* is
|
||||
the time between actions and is dependent on the frame_skip parameter
|
||||
(default is 4), where the frametime is 0.01 - making the
|
||||
default *dt = 4 * 0.01 = 0.04*. This reward would be positive if the swimmer
|
||||
swims right as desired.
|
||||
- *ctrl_cost*: A cost for penalising the swimmer if it takes
|
||||
actions that are too large. It is measured as *`ctrl_cost_weight` *
|
||||
sum(action<sup>2</sup>)* where *`ctrl_cost_weight`* is a parameter set for the
|
||||
control and has a default value of 1e-4
|
||||
|
||||
The total reward returned is ***reward*** *=* *forward_reward - ctrl_cost* and `info` will also contain the individual reward terms
|
||||
|
||||
### Starting State
|
||||
All observations start in state (0,0,0,0,0,0,0,0) with a Uniform noise in the range of [-`reset_noise_scale`, `reset_noise_scale`] is added to the initial state for stochasticity.
|
||||
|
||||
### Episode End
|
||||
The episode truncates when the episode length is greater than 1000.
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported in v2 and lower.
|
||||
|
||||
```
|
||||
gymnasium.make('Swimmer-v4')
|
||||
```
|
||||
|
||||
v3 and v4 take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Swimmer-v4', ctrl_cost_weight=0.1, ....)
|
||||
```
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
| -------------------------------------------- | --------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `xml_file` | **str** | `"swimmer.xml"` | Path to a MuJoCo model |
|
||||
| `forward_reward_weight` | **float** | `1.0` | Weight for _forward_reward_ term (see section on reward) |
|
||||
| `ctrl_cost_weight` | **float** | `1e-4` | Weight for _ctrl_cost_ term (see section on reward) |
|
||||
| `reset_noise_scale` | **float** | `0.1` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
|
||||
| `exclude_current_positions_from_observation` | **bool** | `True` | Whether or not to omit the x- and y-coordinates from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
|
||||
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v3: support for gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,151 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Walker2D
|
||||
lastpage:
|
||||
---
|
||||
|
||||
# Walker2D
|
||||
|
||||
```{figure} ../../_static/videos/mujoco/walker2d.gif
|
||||
:width: 200px
|
||||
:name: walker2d
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Mujoco environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Box(-1.0, 1.0, (6,), float32) |
|
||||
| Observation Shape | (17,) |
|
||||
| Observation High | [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf] |
|
||||
| Observation Low | [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf] |
|
||||
| Import | `gymnasium.make("Walker2d-v4")` |
|
||||
|
||||
|
||||
### Description
|
||||
|
||||
This environment builds on the hopper environment based on the work done by Erez, Tassa, and Todorov
|
||||
in ["Infinite Horizon Model Predictive Control for Nonlinear Periodic Tasks"](http://www.roboticsproceedings.org/rss07/p10.pdf)
|
||||
by adding another set of legs making it possible for the robot to walker forward instead of
|
||||
hop. Like other Mujoco environments, this environment aims to increase the number of independent state
|
||||
and control variables as compared to the classic control environments. The walker is a
|
||||
two-dimensional two-legged figure that consist of four main body parts - a single torso at the top
|
||||
(with the two legs splitting after the torso), two thighs in the middle below the torso, two legs
|
||||
in the bottom below the thighs, and two feet attached to the legs on which the entire body rests.
|
||||
The goal is to make coordinate both sets of feet, legs, and thighs to move in the forward (right)
|
||||
direction by applying torques on the six hinges connecting the six body parts.
|
||||
|
||||
### Action Space
|
||||
The action space is a `Box(-1, 1, (6,), float32)`. An action represents the torques applied at the hinge joints.
|
||||
|
||||
| Num | Action | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
|-----|----------------------------------------|-------------|-------------|----------------------------------|-------|--------------|
|
||||
| 0 | Torque applied on the thigh rotor | -1 | 1 | thigh_joint | hinge | torque (N m) |
|
||||
| 1 | Torque applied on the leg rotor | -1 | 1 | leg_joint | hinge | torque (N m) |
|
||||
| 2 | Torque applied on the foot rotor | -1 | 1 | foot_joint | hinge | torque (N m) |
|
||||
| 3 | Torque applied on the left thigh rotor | -1 | 1 | thigh_left_joint | hinge | torque (N m) |
|
||||
| 4 | Torque applied on the left leg rotor | -1 | 1 | leg_left_joint | hinge | torque (N m) |
|
||||
| 5 | Torque applied on the left foot rotor | -1 | 1 | foot_left_joint | hinge | torque (N m) |
|
||||
|
||||
### Observation Space
|
||||
|
||||
Observations consist of positional values of different body parts of the walker,
|
||||
followed by the velocities of those individual parts (their derivatives) with all the positions ordered before all the velocities.
|
||||
|
||||
By default, observations do not include the x-coordinate of the top. It may
|
||||
be included by passing `exclude_current_positions_from_observation=False` during construction.
|
||||
In that case, the observation space will have 18 dimensions where the first dimension
|
||||
represent the x-coordinates of the top of the walker.
|
||||
Regardless of whether `exclude_current_positions_from_observation` was set to true or false, the x-coordinate
|
||||
of the top will be returned in `info` with key `"x_position"`.
|
||||
|
||||
By default, observation is a `ndarray` with shape `(17,)` where the elements correspond to the following:
|
||||
|
||||
| Num | Observation | Min | Max | Name (in corresponding XML file) | Joint | Unit |
|
||||
| --- | ------------------------------------------------ | ---- | --- | -------------------------------- | ----- | ------------------------ |
|
||||
| 0 | z-coordinate of the top (height of hopper) | -Inf | Inf | rootz (torso) | slide | position (m) |
|
||||
| 1 | angle of the top | -Inf | Inf | rooty (torso) | hinge | angle (rad) |
|
||||
| 2 | angle of the thigh joint | -Inf | Inf | thigh_joint | hinge | angle (rad) |
|
||||
| 3 | angle of the leg joint | -Inf | Inf | leg_joint | hinge | angle (rad) |
|
||||
| 4 | angle of the foot joint | -Inf | Inf | foot_joint | hinge | angle (rad) |
|
||||
| 5 | angle of the left thigh joint | -Inf | Inf | thigh_left_joint | hinge | angle (rad) |
|
||||
| 6 | angle of the left leg joint | -Inf | Inf | leg_left_joint | hinge | angle (rad) |
|
||||
| 7 | angle of the left foot joint | -Inf | Inf | foot_left_joint | hinge | angle (rad) |
|
||||
| 8 | velocity of the x-coordinate of the top | -Inf | Inf | rootx | slide | velocity (m/s) |
|
||||
| 9 | velocity of the z-coordinate (height) of the top | -Inf | Inf | rootz | slide | velocity (m/s) |
|
||||
| 10 | angular velocity of the angle of the top | -Inf | Inf | rooty | hinge | angular velocity (rad/s) |
|
||||
| 11 | angular velocity of the thigh hinge | -Inf | Inf | thigh_joint | hinge | angular velocity (rad/s) |
|
||||
| 12 | angular velocity of the leg hinge | -Inf | Inf | leg_joint | hinge | angular velocity (rad/s) |
|
||||
| 13 | angular velocity of the foot hinge | -Inf | Inf | foot_joint | hinge | angular velocity (rad/s) |
|
||||
| 14 | angular velocity of the thigh hinge | -Inf | Inf | thigh_left_joint | hinge | angular velocity (rad/s) |
|
||||
| 15 | angular velocity of the leg hinge | -Inf | Inf | leg_left_joint | hinge | angular velocity (rad/s) |
|
||||
| 16 | angular velocity of the foot hinge | -Inf | Inf | foot_left_joint | hinge | angular velocity (rad/s) |
|
||||
### Rewards
|
||||
The reward consists of three parts:
|
||||
- *healthy_reward*: Every timestep that the walker is alive, it receives a fixed reward of value `healthy_reward`,
|
||||
- *forward_reward*: A reward of walking forward which is measured as
|
||||
*`forward_reward_weight` * (x-coordinate before action - x-coordinate after action)/dt*.
|
||||
*dt* is the time between actions and is dependeent on the frame_skip parameter
|
||||
(default is 4), where the frametime is 0.002 - making the default
|
||||
*dt = 4 * 0.002 = 0.008*. This reward would be positive if the walker walks forward (right) desired.
|
||||
- *ctrl_cost*: A cost for penalising the walker if it
|
||||
takes actions that are too large. It is measured as
|
||||
*`ctrl_cost_weight` * sum(action<sup>2</sup>)* where *`ctrl_cost_weight`* is
|
||||
a parameter set for the control and has a default value of 0.001
|
||||
|
||||
The total reward returned is ***reward*** *=* *healthy_reward bonus + forward_reward - ctrl_cost* and `info` will also contain the individual reward terms
|
||||
|
||||
### Starting State
|
||||
All observations start in state
|
||||
(0.0, 1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
|
||||
with a uniform noise in the range of [-`reset_noise_scale`, `reset_noise_scale`] added to the values for stochasticity.
|
||||
|
||||
### Episode End
|
||||
The walker is said to be unhealthy if any of the following happens:
|
||||
|
||||
1. Any of the state space values is no longer finite
|
||||
2. The height of the walker is ***not*** in the closed interval specified by `healthy_z_range`
|
||||
3. The absolute value of the angle (`observation[1]` if `exclude_current_positions_from_observation=False`, else `observation[2]`) is ***not*** in the closed interval specified by `healthy_angle_range`
|
||||
|
||||
If `terminate_when_unhealthy=True` is passed during construction (which is the default),
|
||||
the episode ends when any of the following happens:
|
||||
|
||||
1. Truncation: The episode duration reaches a 1000 timesteps
|
||||
2. Termination: The walker is unhealthy
|
||||
|
||||
If `terminate_when_unhealthy=False` is passed, the episode is ended only when 1000 timesteps are exceeded.
|
||||
|
||||
### Arguments
|
||||
|
||||
No additional arguments are currently supported in v2 and lower.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Walker2d-v4')
|
||||
```
|
||||
|
||||
v3 and beyond take gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.
|
||||
|
||||
```
|
||||
env = gymnasium.make('Walker2d-v4', ctrl_cost_weight=0.1, ....)
|
||||
```
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
| -------------------------------------------- | --------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `xml_file` | **str** | `"walker2d.xml"` | Path to a MuJoCo model |
|
||||
| `forward_reward_weight` | **float** | `1.0` | Weight for _forward_reward_ term (see section on reward) |
|
||||
| `ctrl_cost_weight` | **float** | `1e-3` | Weight for _ctr_cost_ term (see section on reward) |
|
||||
| `healthy_reward` | **float** | `1.0` | Constant reward given if the ant is "healthy" after timestep |
|
||||
| `terminate_when_unhealthy` | **bool** | `True` | If true, issue a done signal if the z-coordinate of the walker is no longer healthy |
|
||||
| `healthy_z_range` | **tuple** | `(0.8, 2)` | The z-coordinate of the top of the walker must be in this range to be considered healthy |
|
||||
| `healthy_angle_range` | **tuple** | `(-1, 1)` | The angle must be in this range to be considered healthy |
|
||||
| `reset_noise_scale` | **float** | `5e-3` | Scale of random perturbations of initial position and velocity (see section on Starting State) |
|
||||
| `exclude_current_positions_from_observation` | **bool** | `True` | Whether or not to omit the x-coordinate from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behavior in policies |
|
||||
|
||||
|
||||
### Version History
|
||||
|
||||
* v4: all mujoco environments now use the mujoco bindings in mujoco>=2.1.3
|
||||
* v3: support for gymnasium.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
|
||||
* v2: All continuous control environments now use mujoco_py >= 1.50
|
||||
* v1: max_time_steps raised to 1000 for robot based tasks. Added reward_threshold to environments.
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,82 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Blackjack
|
||||
firstpage:
|
||||
---
|
||||
|
||||
# Blackjack
|
||||
|
||||
```{figure} ../../_static/videos/toy_text/blackjack.gif
|
||||
:width: 200px
|
||||
:name: blackjack
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Toy Text environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Discrete(2) |
|
||||
| Observation Space | Tuple(Discrete(32), Discrete(11), Discrete(2)) |
|
||||
| Import | `gymnasium.make("Blackjack-v1")` |
|
||||
|
||||
|
||||
Blackjack is a card game where the goal is to beat the dealer by obtaining cards
|
||||
that sum to closer to 21 (without going over 21) than the dealers cards.
|
||||
|
||||
### Description
|
||||
Card Values:
|
||||
|
||||
- Face cards (Jack, Queen, King) have a point value of 10.
|
||||
- Aces can either count as 11 (called a 'usable ace') or 1.
|
||||
- Numerical cards (2-9) have a value equal to their number.
|
||||
|
||||
This game is played with an infinite deck (or with replacement).
|
||||
The game starts with the dealer having one face up and one face down card,
|
||||
while the player has two face up cards.
|
||||
|
||||
The player can request additional cards (hit, action=1) until they decide to stop (stick, action=0)
|
||||
or exceed 21 (bust, immediate loss).
|
||||
After the player sticks, the dealer reveals their facedown card, and draws
|
||||
until their sum is 17 or greater. If the dealer goes bust, the player wins.
|
||||
If neither the player nor the dealer busts, the outcome (win, lose, draw) is
|
||||
decided by whose sum is closer to 21.
|
||||
|
||||
### Action Space
|
||||
There are two actions: stick (0), and hit (1).
|
||||
|
||||
### Observation Space
|
||||
The observation consists of a 3-tuple containing: the player's current sum,
|
||||
the value of the dealer's one showing card (1-10 where 1 is ace),
|
||||
and whether the player holds a usable ace (0 or 1).
|
||||
|
||||
This environment corresponds to the version of the blackjack problem
|
||||
described in Example 5.1 in Reinforcement Learning: An Introduction
|
||||
by Sutton and Barto (http://incompleteideas.net/book/the-book-2nd.html).
|
||||
|
||||
### Rewards
|
||||
- win game: +1
|
||||
- lose game: -1
|
||||
- draw game: 0
|
||||
- win game with natural blackjack:
|
||||
|
||||
+1.5 (if <a href="#nat">natural</a> is True)
|
||||
|
||||
+1 (if <a href="#nat">natural</a> is False)
|
||||
|
||||
### Arguments
|
||||
|
||||
```
|
||||
gymnasium.make('Blackjack-v1', natural=False, sab=False)
|
||||
```
|
||||
|
||||
<a id="nat">`natural=False`</a>: Whether to give an additional reward for
|
||||
starting with a natural blackjack, i.e. starting with an ace and ten (sum is 21).
|
||||
|
||||
<a id="sab">`sab=False`</a>: Whether to follow the exact rules outlined in the book by
|
||||
Sutton and Barto. If `sab` is `True`, the keyword argument `natural` will be ignored.
|
||||
If the player achieves a natural blackjack and the dealer does not, the player
|
||||
will win (i.e. get a reward of +1). The reverse rule does not apply.
|
||||
If both the player and the dealer get a natural, it will be a draw (i.e. reward 0).
|
||||
|
||||
### Version History
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,64 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Cliff Walking
|
||||
---
|
||||
|
||||
# Cliff Walking
|
||||
|
||||
```{figure} ../../_static/videos/toy_text/cliff_walking.gif
|
||||
:width: 200px
|
||||
:name: cliff_walking
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Toy Text environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Discrete(4) |
|
||||
| Observation Space | Discrete(48) |
|
||||
| Import | `gymnasium.make("CliffWalking-v0")` |
|
||||
|
||||
|
||||
This is a simple implementation of the Gridworld Cliff
|
||||
reinforcement learning task.
|
||||
|
||||
Adapted from Example 6.6 (page 106) from [Reinforcement Learning: An Introduction
|
||||
by Sutton and Barto](http://incompleteideas.net/book/bookdraft2018jan1.pdf).
|
||||
|
||||
With inspiration from:
|
||||
[https://github.com/dennybritz/reinforcement-learning/blob/master/lib/envs/cliff_walking.py]
|
||||
(https://github.com/dennybritz/reinforcement-learning/blob/master/lib/envs/cliff_walking.py)
|
||||
|
||||
### Description
|
||||
The board is a 4x12 matrix, with (using NumPy matrix indexing):
|
||||
- [3, 0] as the start at bottom-left
|
||||
- [3, 11] as the goal at bottom-right
|
||||
- [3, 1..10] as the cliff at bottom-center
|
||||
|
||||
If the agent steps on the cliff, it returns to the start.
|
||||
An episode terminates when the agent reaches the goal.
|
||||
|
||||
### Actions
|
||||
There are 4 discrete deterministic actions:
|
||||
- 0: move up
|
||||
- 1: move right
|
||||
- 2: move down
|
||||
- 3: move left
|
||||
|
||||
### Observations
|
||||
There are 3x12 + 1 possible states. In fact, the agent cannot be at the cliff, nor at the goal
|
||||
(as this results in the end of the episode).
|
||||
It remains all the positions of the first 3 rows plus the bottom-left cell.
|
||||
The observation is simply the current position encoded as [flattened index](https://numpy.org/doc/stable/reference/generated/numpy.unravel_index.html).
|
||||
|
||||
### Reward
|
||||
Each time step incurs -1 reward, and stepping into the cliff incurs -100 reward.
|
||||
|
||||
### Arguments
|
||||
|
||||
```
|
||||
gymnasium.make('CliffWalking-v0')
|
||||
```
|
||||
|
||||
### Version History
|
||||
- v0: Initial version release
|
@@ -1,99 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Frozen Lake
|
||||
---
|
||||
|
||||
# Frozen Lake
|
||||
|
||||
```{figure} ../../_static/videos/toy_text/frozen_lake.gif
|
||||
:width: 200px
|
||||
:name: frozen_lake
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Toy Text environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Discrete(4) |
|
||||
| Observation Space | Discrete(16) |
|
||||
| Import | `gymnasium.make("FrozenLake-v1")` |
|
||||
|
||||
|
||||
Frozen lake involves crossing a frozen lake from Start(S) to Goal(G) without falling into any Holes(H)
|
||||
by walking over the Frozen(F) lake.
|
||||
The agent may not always move in the intended direction due to the slippery nature of the frozen lake.
|
||||
|
||||
|
||||
### Action Space
|
||||
The agent takes a 1-element vector for actions.
|
||||
The action space is `(dir)`, where `dir` decides direction to move in which can be:
|
||||
|
||||
- 0: LEFT
|
||||
- 1: DOWN
|
||||
- 2: RIGHT
|
||||
- 3: UP
|
||||
|
||||
### Observation Space
|
||||
The observation is a value representing the agent's current position as
|
||||
current_row * nrows + current_col (where both the row and col start at 0).
|
||||
For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15.
|
||||
The number of possible observations is dependent on the size of the map.
|
||||
For example, the 4x4 map has 16 possible observations.
|
||||
|
||||
### Rewards
|
||||
|
||||
Reward schedule:
|
||||
- Reach goal(G): +1
|
||||
- Reach hole(H): 0
|
||||
- Reach frozen(F): 0
|
||||
|
||||
### Arguments
|
||||
|
||||
```
|
||||
gymnasium.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True)
|
||||
```
|
||||
|
||||
`desc`: Used to specify custom map for frozen lake. For example,
|
||||
|
||||
desc=["SFFF", "FHFH", "FFFH", "HFFG"].
|
||||
|
||||
A random generated map can be specified by calling the function `generate_random_map`. For example,
|
||||
|
||||
```
|
||||
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
|
||||
|
||||
gymnasium.make('FrozenLake-v1', desc=generate_random_map(size=8))
|
||||
```
|
||||
|
||||
`map_name`: ID to use any of the preloaded maps.
|
||||
|
||||
"4x4":[
|
||||
"SFFF",
|
||||
"FHFH",
|
||||
"FFFH",
|
||||
"HFFG"
|
||||
]
|
||||
|
||||
"8x8": [
|
||||
"SFFFFFFF",
|
||||
"FFFFFFFF",
|
||||
"FFFHFFFF",
|
||||
"FFFFFHFF",
|
||||
"FFFHFFFF",
|
||||
"FHHFFFHF",
|
||||
"FHFFHFHF",
|
||||
"FFFHFFFG",
|
||||
]
|
||||
|
||||
`is_slippery`: True/False. If True will move in intended direction with
|
||||
probability of 1/3 else will move in either perpendicular direction with
|
||||
equal probability of 1/3 in both directions.
|
||||
|
||||
For example, if action is left and is_slippery is True, then:
|
||||
- P(move left)=1/3
|
||||
- P(move up)=1/3
|
||||
- P(move down)=1/3
|
||||
|
||||
### Version History
|
||||
* v1: Bug fixes to rewards
|
||||
* v0: Initial versions release (1.0.0)
|
@@ -1,116 +0,0 @@
|
||||
---
|
||||
AUTOGENERATED: DO NOT EDIT FILE DIRECTLY
|
||||
title: Taxi
|
||||
lastpage:
|
||||
---
|
||||
|
||||
# Taxi
|
||||
|
||||
```{figure} ../../_static/videos/toy_text/taxi.gif
|
||||
:width: 200px
|
||||
:name: taxi
|
||||
```
|
||||
|
||||
This environment is part of the <a href='..'>Toy Text environments</a>. Please read that page first for general information.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Action Space | Discrete(6) |
|
||||
| Observation Space | Discrete(500) |
|
||||
| Import | `gymnasium.make("Taxi-v3")` |
|
||||
|
||||
|
||||
The Taxi Problem
|
||||
from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition"
|
||||
by Tom Dietterich
|
||||
|
||||
### Description
|
||||
There are four designated locations in the grid world indicated by R(ed),
|
||||
G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off
|
||||
at a random square and the passenger is at a random location. The taxi
|
||||
drives to the passenger's location, picks up the passenger, drives to the
|
||||
passenger's destination (another one of the four specified locations), and
|
||||
then drops off the passenger. Once the passenger is dropped off, the episode ends.
|
||||
|
||||
Map:
|
||||
|
||||
+---------+
|
||||
|R: | : :G|
|
||||
| : | : : |
|
||||
| : : : : |
|
||||
| | : | : |
|
||||
|Y| : |B: |
|
||||
+---------+
|
||||
|
||||
### Actions
|
||||
There are 6 discrete deterministic actions:
|
||||
- 0: move south
|
||||
- 1: move north
|
||||
- 2: move east
|
||||
- 3: move west
|
||||
- 4: pickup passenger
|
||||
- 5: drop off passenger
|
||||
|
||||
### Observations
|
||||
There are 500 discrete states since there are 25 taxi positions, 5 possible
|
||||
locations of the passenger (including the case when the passenger is in the
|
||||
taxi), and 4 destination locations.
|
||||
|
||||
Note that there are 400 states that can actually be reached during an
|
||||
episode. The missing states correspond to situations in which the passenger
|
||||
is at the same location as their destination, as this typically signals the
|
||||
end of an episode. Four additional states can be observed right after a
|
||||
successful episodes, when both the passenger and the taxi are at the destination.
|
||||
This gives a total of 404 reachable discrete states.
|
||||
|
||||
Each state space is represented by the tuple:
|
||||
(taxi_row, taxi_col, passenger_location, destination)
|
||||
|
||||
An observation is an integer that encodes the corresponding state.
|
||||
The state tuple can then be decoded with the "decode" method.
|
||||
|
||||
Passenger locations:
|
||||
- 0: R(ed)
|
||||
- 1: G(reen)
|
||||
- 2: Y(ellow)
|
||||
- 3: B(lue)
|
||||
- 4: in taxi
|
||||
|
||||
Destinations:
|
||||
- 0: R(ed)
|
||||
- 1: G(reen)
|
||||
- 2: Y(ellow)
|
||||
- 3: B(lue)
|
||||
|
||||
### Info
|
||||
|
||||
``step`` and ``reset(return_info=True)`` will return an info dictionary that contains "p" and "action_mask" containing
|
||||
the probability that the state is taken and a mask of what actions will result in a change of state to speed up training.
|
||||
|
||||
As Taxi's initial state is a stochastic, the "p" key represents the probability of the
|
||||
transition however this value is currently bugged being 1.0, this will be fixed soon.
|
||||
As the steps are deterministic, "p" represents the probability of the transition which is always 1.0
|
||||
|
||||
For some cases, taking an action will have no effect on the state of the agent.
|
||||
In v0.25.0, ``info["action_mask"]`` contains a np.ndarray for each of the action specifying
|
||||
if the action will change the state.
|
||||
|
||||
To sample a modifying action, use ``action = env.action_space.sample(info["action_mask"])``
|
||||
Or with a Q-value based algorithm ``action = np.argmax(q_values[obs, np.where(info["action_mask"] == 1)[0]])``.
|
||||
|
||||
### Rewards
|
||||
- -1 per step unless other reward is triggered.
|
||||
- +20 delivering passenger.
|
||||
- -10 executing "pickup" and "drop-off" actions illegally.
|
||||
|
||||
### Arguments
|
||||
|
||||
```
|
||||
gymnasium.make('Taxi-v3')
|
||||
```
|
||||
|
||||
### Version History
|
||||
* v3: Map Correction + Cleaner Domain Description, v0.25.0 action masking added to the reset and step information
|
||||
* v2: Disallow Taxi start location = goal location, Update Taxi observations in the rollout, Update Taxi reward threshold.
|
||||
* v1: Remove (3,2) from locs, add passidx<4 check
|
||||
* v0: Initial versions release
|
@@ -9,45 +9,60 @@ import numpy as np
|
||||
from tqdm import tqdm
|
||||
from utils import kill_strs, trim
|
||||
|
||||
import gymnasium
|
||||
import gymnasium as gym
|
||||
|
||||
LAYOUT = "env"
|
||||
|
||||
pattern = re.compile(r"(?<!^)(?=[A-Z])")
|
||||
|
||||
gymnasium.logger.set_level(gymnasium.logger.DISABLED)
|
||||
gym.logger.set_level(gym.logger.DISABLED)
|
||||
|
||||
all_envs = list(gymnasium.envs.registry.values())
|
||||
all_envs = list(gym.envs.registry.values())
|
||||
filtered_envs_by_type = {}
|
||||
|
||||
# Obtain filtered list
|
||||
for env_spec in tqdm(all_envs):
|
||||
|
||||
if any(x in str(env_spec.id) for x in kill_strs):
|
||||
continue
|
||||
|
||||
try:
|
||||
env = gymnasium.make(env_spec.id)
|
||||
split = str(type(env.unwrapped)).split(".")
|
||||
env_type = split[2]
|
||||
# gymnasium.envs.env_type.env.EnvClass
|
||||
# ale_py.env.gym:AtariEnv
|
||||
split = env_spec.entry_point.split(".")
|
||||
# ignore gymnasium.envs.env_type:Env
|
||||
env_module = split[0]
|
||||
if len(split) < 4 and env_module != "ale_py":
|
||||
continue
|
||||
env_type = split[2] if env_module != "ale_py" else "atari"
|
||||
env_version = env_spec.version
|
||||
|
||||
if env_type == "atari" or env_type == "unittest":
|
||||
continue
|
||||
# ignore unit test envs and old versions of atari envs
|
||||
if env_module == "ale_py" or env_type == "unittest":
|
||||
continue
|
||||
|
||||
try:
|
||||
env = gym.make(env_spec.id)
|
||||
split = str(type(env.unwrapped)).split(".")
|
||||
env_name = split[3]
|
||||
|
||||
if env_type not in filtered_envs_by_type.keys():
|
||||
filtered_envs_by_type[env_type] = []
|
||||
filtered_envs_by_type[env_type].append((env_spec, env_type))
|
||||
filtered_envs_by_type[env_type] = {}
|
||||
# only store new entries and higher versions
|
||||
if env_name not in filtered_envs_by_type[env_type] or (
|
||||
env_name in filtered_envs_by_type[env_type]
|
||||
and env_version > filtered_envs_by_type[env_type][env_name].version
|
||||
):
|
||||
filtered_envs_by_type[env_type][env_name] = env_spec
|
||||
|
||||
except Exception as e:
|
||||
print(e)
|
||||
|
||||
|
||||
# Sort
|
||||
filtered_envs = list(
|
||||
reduce(
|
||||
lambda s, x: s + x,
|
||||
map(
|
||||
lambda arr: sorted(arr, key=lambda x: x[0].name),
|
||||
list(filtered_envs_by_type.values()),
|
||||
lambda arr: sorted(arr, key=lambda x: x.name),
|
||||
map(lambda dic: list(dic.values()), list(filtered_envs_by_type.values())),
|
||||
),
|
||||
[],
|
||||
)
|
||||
@@ -55,10 +70,11 @@ filtered_envs = list(
|
||||
|
||||
|
||||
# Update Docs
|
||||
for i, (env_spec, env_type) in tqdm(enumerate(filtered_envs)):
|
||||
for i, env_spec in tqdm(enumerate(filtered_envs)):
|
||||
print("ID:", env_spec.id)
|
||||
env_type = env_spec.entry_point.split(".")[2]
|
||||
try:
|
||||
env = gymnasium.make(env_spec.id)
|
||||
env = gym.make(env_spec.id)
|
||||
|
||||
# variants dont get their own pages
|
||||
e_n = str(env_spec).lower()
|
||||
@@ -74,9 +90,12 @@ for i, (env_spec, env_type) in tqdm(enumerate(filtered_envs)):
|
||||
title_env_name = snake_env_name.replace("_", " ").title()
|
||||
env_type_title = env_type.replace("_", " ").title()
|
||||
related_pages_meta = ""
|
||||
if i == 0 or not env_type == filtered_envs[i - 1][1]:
|
||||
if i == 0 or not env_type == filtered_envs[i - 1].entry_point.split(".")[2]:
|
||||
related_pages_meta = "firstpage:\n"
|
||||
elif i == len(filtered_envs) - 1 or not env_type == filtered_envs[i + 1][1]:
|
||||
elif (
|
||||
i == len(filtered_envs) - 1
|
||||
or not env_type == filtered_envs[i + 1].entry_point.split(".")[2]
|
||||
):
|
||||
related_pages_meta = "lastpage:\n"
|
||||
|
||||
# path for saving video
|
||||
|
Reference in New Issue
Block a user