mirror of
https://github.com/Farama-Foundation/Gymnasium.git
synced 2025-07-31 13:54:31 +00:00
Updating tutorials (#63)
This commit is contained in:
@@ -15,7 +15,7 @@ repos:
|
|||||||
hooks:
|
hooks:
|
||||||
- id: flake8
|
- id: flake8
|
||||||
args:
|
args:
|
||||||
- '--per-file-ignores=*/__init__.py:F401 gymnasium/envs/registration.py:E704'
|
- '--per-file-ignores=*/__init__.py:F401 gymnasium/envs/registration.py:E704 docs/tutorials/*.py:E402'
|
||||||
- --ignore=E203,W503,E741
|
- --ignore=E203,W503,E741
|
||||||
- --max-complexity=30
|
- --max-complexity=30
|
||||||
- --max-line-length=456
|
- --max-line-length=456
|
||||||
|
@@ -8,7 +8,7 @@ If you are modifying a non-environment page or an atari environment page, please
|
|||||||
|
|
||||||
### Editing an environment page
|
### Editing an environment page
|
||||||
|
|
||||||
If you are editing an Atari environment, directly edit the Markdown file in this repository.
|
If you are editing an Atari environment, directly edit the Markdown file in this repository.
|
||||||
|
|
||||||
Otherwise, fork Gymnasium and edit the docstring in the environment's Python file. Then, pip install your Gymnasium fork and run `docs/scripts/gen_mds.py` in this repo. This will automatically generate a Markdown documentation file for the environment.
|
Otherwise, fork Gymnasium and edit the docstring in the environment's Python file. Then, pip install your Gymnasium fork and run `docs/scripts/gen_mds.py` in this repo. This will automatically generate a Markdown documentation file for the environment.
|
||||||
|
|
||||||
@@ -49,3 +49,11 @@ To rebuild the documentation automatically every time a change is made:
|
|||||||
cd docs
|
cd docs
|
||||||
sphinx-autobuild -b dirhtml . _build
|
sphinx-autobuild -b dirhtml . _build
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Writing Tutorials
|
||||||
|
|
||||||
|
We use Sphinx-Gallery to build the tutorials inside the `docs/tutorials` directory. Check `docs/tutorials/demo.py` to see an example of a tutorial and [Sphinx-Gallery documentation](https://sphinx-gallery.github.io/stable/syntax.html) for more information.
|
||||||
|
|
||||||
|
To convert Jupyer Notebooks to the python tutorials you can use [this script](https://gist.github.com/mgoulao/f07f5f79f6cd9a721db8a34bba0a19a7).
|
||||||
|
|
||||||
|
If you want Sphinx-Gallery to execute the tutorial (which adds outputs and plots) then the file name should start with `run_`. Note that this adds to the build time so make sure the script doesn't take more than a few seconds to execute.
|
||||||
|
@@ -41,6 +41,7 @@ extensions = [
|
|||||||
"sphinx.ext.autodoc",
|
"sphinx.ext.autodoc",
|
||||||
"sphinx.ext.githubpages",
|
"sphinx.ext.githubpages",
|
||||||
"myst_parser",
|
"myst_parser",
|
||||||
|
"furo.gen_tutorials",
|
||||||
]
|
]
|
||||||
|
|
||||||
# Add any paths that contain templates here, relative to this directory.
|
# Add any paths that contain templates here, relative to this directory.
|
||||||
@@ -91,5 +92,6 @@ html_css_files = []
|
|||||||
# -- Generate Tutorials -------------------------------------------------
|
# -- Generate Tutorials -------------------------------------------------
|
||||||
|
|
||||||
gen_tutorials.generate(
|
gen_tutorials.generate(
|
||||||
|
os.path.dirname(__file__),
|
||||||
os.path.join(os.path.dirname(__file__), "tutorials"),
|
os.path.join(os.path.dirname(__file__), "tutorials"),
|
||||||
)
|
)
|
||||||
|
@@ -8,7 +8,7 @@ firstpage:
|
|||||||
|
|
||||||
## Initializing Environments
|
## Initializing Environments
|
||||||
|
|
||||||
Initializing environments is very easy in Gymnasium and can be done via:
|
Initializing environments is very easy in Gymnasium and can be done via:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import gymnasium as gym
|
import gymnasium as gym
|
||||||
@@ -32,11 +32,11 @@ Gymnasium implements the classic "agent-environment loop":
|
|||||||
```
|
```
|
||||||
|
|
||||||
The agent performs some actions in the environment (usually by passing some control inputs to the environment, e.g. torque inputs of motors) and observes
|
The agent performs some actions in the environment (usually by passing some control inputs to the environment, e.g. torque inputs of motors) and observes
|
||||||
how the environment's state changes. One such action-observation exchange is referred to as a *timestep*.
|
how the environment's state changes. One such action-observation exchange is referred to as a *timestep*.
|
||||||
|
|
||||||
The goal in RL is to manipulate the environment in some specific way. For instance, we want the agent to navigate a robot
|
The goal in RL is to manipulate the environment in some specific way. For instance, we want the agent to navigate a robot
|
||||||
to a specific point in space. If it succeeds in doing this (or makes some progress towards that goal), it will receive a positive reward
|
to a specific point in space. If it succeeds in doing this (or makes some progress towards that goal), it will receive a positive reward
|
||||||
alongside the observation for this timestep. The reward may also be negative or 0, if the agent did not yet succeed (or did not make any progress).
|
alongside the observation for this timestep. The reward may also be negative or 0, if the agent did not yet succeed (or did not make any progress).
|
||||||
The agent will then be trained to maximize the reward it accumulates over many timesteps.
|
The agent will then be trained to maximize the reward it accumulates over many timesteps.
|
||||||
|
|
||||||
After some timesteps, the environment may enter a terminal state. For instance, the robot may have crashed, or the agent may have succeeded in completing a task. In that case, we want to reset the environment to a new initial state. The environment issues a terminated signal to the agent if it enters such a terminal state. Sometimes we also want to end the episode after a fixed number of timesteps, in this case, the environment issues a truncated signal.
|
After some timesteps, the environment may enter a terminal state. For instance, the robot may have crashed, or the agent may have succeeded in completing a task. In that case, we want to reset the environment to a new initial state. The environment issues a terminated signal to the agent if it enters such a terminal state. Sometimes we also want to end the episode after a fixed number of timesteps, in this case, the environment issues a truncated signal.
|
||||||
@@ -71,41 +71,41 @@ The output should look something like this:
|
|||||||
|
|
||||||
Every environment specifies the format of valid actions by providing an `env.action_space` attribute. Similarly,
|
Every environment specifies the format of valid actions by providing an `env.action_space` attribute. Similarly,
|
||||||
the format of valid observations is specified by `env.observation_space`.
|
the format of valid observations is specified by `env.observation_space`.
|
||||||
In the example above we sampled random actions via `env.action_space.sample()`. Note that we need to seed the action space separately from the
|
In the example above we sampled random actions via `env.action_space.sample()`. Note that we need to seed the action space separately from the
|
||||||
environment to ensure reproducible samples.
|
environment to ensure reproducible samples.
|
||||||
|
|
||||||
|
|
||||||
### Change in env.step API
|
### Change in env.step API
|
||||||
|
|
||||||
Previously, the step method returned only one boolean - `done`. This is being deprecated in favour of returning two booleans `terminated` and `truncated` (v0.26 onwards).
|
Previously, the step method returned only one boolean - `done`. This is being deprecated in favour of returning two booleans `terminated` and `truncated` (v0.26 onwards).
|
||||||
|
|
||||||
`terminated` signal is set to `True` when the core environment terminates inherently because of task completion, failure etc. a condition defined in the MDP.
|
`terminated` signal is set to `True` when the core environment terminates inherently because of task completion, failure etc. a condition defined in the MDP.
|
||||||
`truncated` signal is set to `True` when the episode ends specifically because of a time-limit or a condition not inherent to the environment (not defined in the MDP).
|
`truncated` signal is set to `True` when the episode ends specifically because of a time-limit or a condition not inherent to the environment (not defined in the MDP).
|
||||||
It is possible for `terminated=True` and `truncated=True` to occur at the same time when termination and truncation occur at the same step.
|
It is possible for `terminated=True` and `truncated=True` to occur at the same time when termination and truncation occur at the same step.
|
||||||
|
|
||||||
This is explained in detail in the `Handling Time Limits` section.
|
This is explained in detail in the `Handling Time Limits` section.
|
||||||
|
|
||||||
#### Backward compatibility
|
#### Backward compatibility
|
||||||
|
|
||||||
Gym will retain support for the old API through compatibility wrappers.
|
Gym will retain support for the old API through compatibility wrappers.
|
||||||
|
|
||||||
Users can toggle the old API through `make` by setting `apply_api_compatibility=True`.
|
Users can toggle the old API through `make` by setting `apply_api_compatibility=True`.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
env = gym.make("CartPole-v1", apply_api_compatibility=True)
|
env = gym.make("CartPole-v1", apply_api_compatibility=True)
|
||||||
```
|
```
|
||||||
This can also be done explicitly through a wrapper:
|
This can also be done explicitly through a wrapper:
|
||||||
```python
|
```python
|
||||||
from gymnasium.wrappers import StepAPICompatibility
|
from gymnasium.wrappers import StepAPICompatibility
|
||||||
env = StepAPICompatibility(CustomEnv(), output_truncation_bool=False)
|
env = StepAPICompatibility(CustomEnv(), output_truncation_bool=False)
|
||||||
```
|
```
|
||||||
For more details see the wrappers section.
|
For more details see the wrappers section.
|
||||||
|
|
||||||
|
|
||||||
## Checking API-Conformity
|
## Checking API-Conformity
|
||||||
|
|
||||||
If you have implemented a custom environment and would like to perform a sanity check to make sure that it conforms to
|
If you have implemented a custom environment and would like to perform a sanity check to make sure that it conforms to
|
||||||
the API, you can run:
|
the API, you can run:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from gymnasium.utils.env_checker import check_env
|
>>> from gymnasium.utils.env_checker import check_env
|
||||||
@@ -113,8 +113,8 @@ the API, you can run:
|
|||||||
```
|
```
|
||||||
|
|
||||||
This function will throw an exception if it seems like your environment does not follow the Gymnasium API. It will also produce
|
This function will throw an exception if it seems like your environment does not follow the Gymnasium API. It will also produce
|
||||||
warnings if it looks like you made a mistake or do not follow a best practice (e.g. if `observation_space` looks like
|
warnings if it looks like you made a mistake or do not follow a best practice (e.g. if `observation_space` looks like
|
||||||
an image but does not have the right dtype). Warnings can be turned off by passing `warn=False`. By default, `check_env` will
|
an image but does not have the right dtype). Warnings can be turned off by passing `warn=False`. By default, `check_env` will
|
||||||
not check the `render` method. To change this behavior, you can pass `skip_render_check=False`.
|
not check the `render` method. To change this behavior, you can pass `skip_render_check=False`.
|
||||||
|
|
||||||
> After running `check_env` on an environment, you should not reuse the instance that was checked, as it may have already
|
> After running `check_env` on an environment, you should not reuse the instance that was checked, as it may have already
|
||||||
@@ -136,7 +136,7 @@ There are multiple `Space` types available in Gymnasium:
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from gymnasium.spaces import Box, Discrete, Dict, Tuple, MultiBinary, MultiDiscrete
|
>>> from gymnasium.spaces import Box, Discrete, Dict, Tuple, MultiBinary, MultiDiscrete
|
||||||
>>> import numpy as np
|
>>> import numpy as np
|
||||||
>>>
|
>>>
|
||||||
>>> observation_space = Box(low=-1.0, high=2.0, shape=(3,), dtype=np.float32)
|
>>> observation_space = Box(low=-1.0, high=2.0, shape=(3,), dtype=np.float32)
|
||||||
>>> observation_space.sample()
|
>>> observation_space.sample()
|
||||||
@@ -145,11 +145,11 @@ There are multiple `Space` types available in Gymnasium:
|
|||||||
>>> observation_space = Discrete(4)
|
>>> observation_space = Discrete(4)
|
||||||
>>> observation_space.sample()
|
>>> observation_space.sample()
|
||||||
1
|
1
|
||||||
>>>
|
>>>
|
||||||
>>> observation_space = Discrete(5, start=-2)
|
>>> observation_space = Discrete(5, start=-2)
|
||||||
>>> observation_space.sample()
|
>>> observation_space.sample()
|
||||||
-2
|
-2
|
||||||
>>>
|
>>>
|
||||||
>>> observation_space = Dict({"position": Discrete(2), "velocity": Discrete(3)})
|
>>> observation_space = Dict({"position": Discrete(2), "velocity": Discrete(3)})
|
||||||
>>> observation_space.sample()
|
>>> observation_space.sample()
|
||||||
OrderedDict([('position', 0), ('velocity', 1)])
|
OrderedDict([('position', 0), ('velocity', 1)])
|
||||||
@@ -170,7 +170,7 @@ OrderedDict([('position', 0), ('velocity', 1)])
|
|||||||
## Wrappers
|
## Wrappers
|
||||||
|
|
||||||
Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly.
|
Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly.
|
||||||
Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular. Wrappers can
|
Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular. Wrappers can
|
||||||
also be chained to combine their effects. Most environments that are generated via `gymnasium.make` will already be wrapped by default.
|
also be chained to combine their effects. Most environments that are generated via `gymnasium.make` will already be wrapped by default.
|
||||||
|
|
||||||
In order to wrap an environment, you must first initialize a base environment. Then you can pass this environment along
|
In order to wrap an environment, you must first initialize a base environment. Then you can pass this environment along
|
||||||
@@ -217,7 +217,7 @@ If you have a wrapped environment, and you want to get the unwrapped environment
|
|||||||
|
|
||||||
## Playing within an environment
|
## Playing within an environment
|
||||||
|
|
||||||
You can also play the environment using your keyboard using the `play` function in `gymnasium.utils.play`.
|
You can also play the environment using your keyboard using the `play` function in `gymnasium.utils.play`.
|
||||||
```python
|
```python
|
||||||
from gymnasium.utils.play import play
|
from gymnasium.utils.play import play
|
||||||
play(gymnasium.make('Pong-v0'))
|
play(gymnasium.make('Pong-v0'))
|
||||||
|
@@ -66,9 +66,6 @@ environments/third_party_environments
|
|||||||
:glob:
|
:glob:
|
||||||
:caption: Tutorials
|
:caption: Tutorials
|
||||||
|
|
||||||
content/environment_creation
|
|
||||||
content/vectorising
|
|
||||||
content/handling_timelimits
|
|
||||||
tutorials/*
|
tutorials/*
|
||||||
```
|
```
|
||||||
|
|
||||||
|
509
docs/tutorials/environment_creation.py
Normal file
509
docs/tutorials/environment_creation.py
Normal file
@@ -0,0 +1,509 @@
|
|||||||
|
"""
|
||||||
|
Make your own custom environment
|
||||||
|
================================
|
||||||
|
|
||||||
|
This documentation overviews creating new environments and relevant
|
||||||
|
useful wrappers, utilities and tests included in Gymnasium designed for
|
||||||
|
the creation of new environments. You can clone gym-examples to play
|
||||||
|
with the code that is presented here. We recommend that you use a virtual environment:
|
||||||
|
|
||||||
|
.. code:: console
|
||||||
|
|
||||||
|
git clone https://github.com/Farama-Foundation/gym-examples
|
||||||
|
cd gym-examples
|
||||||
|
python -m venv .env
|
||||||
|
source .env/bin/activate
|
||||||
|
pip install -e .
|
||||||
|
|
||||||
|
Subclassing gymnasium.Env
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
Before learning how to create your own environment you should check out
|
||||||
|
`the documentation of Gymnasium’s API </api/core>`__.
|
||||||
|
|
||||||
|
We will be concerned with a subset of gym-examples that looks like this:
|
||||||
|
|
||||||
|
.. code:: sh
|
||||||
|
|
||||||
|
gym-examples/
|
||||||
|
README.md
|
||||||
|
setup.py
|
||||||
|
gym_examples/
|
||||||
|
__init__.py
|
||||||
|
envs/
|
||||||
|
__init__.py
|
||||||
|
grid_world.py
|
||||||
|
wrappers/
|
||||||
|
__init__.py
|
||||||
|
relative_position.py
|
||||||
|
reacher_weighted_reward.py
|
||||||
|
discrete_action.py
|
||||||
|
clip_reward.py
|
||||||
|
|
||||||
|
To illustrate the process of subclassing ``gymnasium.Env``, we will
|
||||||
|
implement a very simplistic game, called ``GridWorldEnv``. We will write
|
||||||
|
the code for our custom environment in
|
||||||
|
``gym-examples/gym_examples/envs/grid_world.py``. The environment
|
||||||
|
consists of a 2-dimensional square grid of fixed size (specified via the
|
||||||
|
``size`` parameter during construction). The agent can move vertically
|
||||||
|
or horizontally between grid cells in each timestep. The goal of the
|
||||||
|
agent is to navigate to a target on the grid that has been placed
|
||||||
|
randomly at the beginning of the episode.
|
||||||
|
|
||||||
|
- Observations provide the location of the target and agent.
|
||||||
|
- There are 4 actions in our environment, corresponding to the
|
||||||
|
movements “right”, “up”, “left”, and “down”.
|
||||||
|
- A done signal is issued as soon as the agent has navigated to the
|
||||||
|
grid cell where the target is located.
|
||||||
|
- Rewards are binary and sparse, meaning that the immediate reward is
|
||||||
|
always zero, unless the agent has reached the target, then it is 1.
|
||||||
|
|
||||||
|
An episode in this environment (with ``size=5``) might look like this:
|
||||||
|
|
||||||
|
where the blue dot is the agent and the red square represents the
|
||||||
|
target.
|
||||||
|
|
||||||
|
Let us look at the source code of ``GridWorldEnv`` piece by piece:
|
||||||
|
"""
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Declaration and Initialization
|
||||||
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
#
|
||||||
|
# Our custom environment will inherit from the abstract class
|
||||||
|
# ``gymnasium.Env``. You shouldn’t forget to add the ``metadata``
|
||||||
|
# attribute to your class. There, you should specify the render-modes that
|
||||||
|
# are supported by your environment (e.g. ``"human"``, ``"rgb_array"``,
|
||||||
|
# ``"ansi"``) and the framerate at which your environment should be
|
||||||
|
# rendered. Every environment should support ``None`` as render-mode; you
|
||||||
|
# don’t need to add it in the metadata. In ``GridWorldEnv``, we will
|
||||||
|
# support the modes “rgb_array” and “human” and render at 4 FPS.
|
||||||
|
#
|
||||||
|
# The ``__init__`` method of our environment will accept the integer
|
||||||
|
# ``size``, that determines the size of the square grid. We will set up
|
||||||
|
# some variables for rendering and define ``self.observation_space`` and
|
||||||
|
# ``self.action_space``. In our case, observations should provide
|
||||||
|
# information about the location of the agent and target on the
|
||||||
|
# 2-dimensional grid. We will choose to represent observations in the form
|
||||||
|
# of dictionaries with keys ``"agent"`` and ``"target"``. An observation
|
||||||
|
# may look like ``{"agent": array([1, 0]), "target": array([0, 3])}``.
|
||||||
|
# Since we have 4 actions in our environment (“right”, “up”, “left”,
|
||||||
|
# “down”), we will use ``Discrete(4)`` as an action space. Here is the
|
||||||
|
# declaration of ``GridWorldEnv`` and the implementation of ``__init__``:
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pygame
|
||||||
|
|
||||||
|
import gymnasium as gym
|
||||||
|
from gymnasium import spaces
|
||||||
|
|
||||||
|
|
||||||
|
class GridWorldEnv(gym.Env):
|
||||||
|
metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 4}
|
||||||
|
|
||||||
|
def __init__(self, render_mode=None, size=5):
|
||||||
|
self.size = size # The size of the square grid
|
||||||
|
self.window_size = 512 # The size of the PyGame window
|
||||||
|
|
||||||
|
# Observations are dictionaries with the agent's and the target's location.
|
||||||
|
# Each location is encoded as an element of {0, ..., `size`}^2, i.e. MultiDiscrete([size, size]).
|
||||||
|
self.observation_space = spaces.Dict(
|
||||||
|
{
|
||||||
|
"agent": spaces.Box(0, size - 1, shape=(2,), dtype=int),
|
||||||
|
"target": spaces.Box(0, size - 1, shape=(2,), dtype=int),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# We have 4 actions, corresponding to "right", "up", "left", "down"
|
||||||
|
self.action_space = spaces.Discrete(4)
|
||||||
|
|
||||||
|
"""
|
||||||
|
The following dictionary maps abstract actions from `self.action_space` to
|
||||||
|
the direction we will walk in if that action is taken.
|
||||||
|
I.e. 0 corresponds to "right", 1 to "up" etc.
|
||||||
|
"""
|
||||||
|
self._action_to_direction = {
|
||||||
|
0: np.array([1, 0]),
|
||||||
|
1: np.array([0, 1]),
|
||||||
|
2: np.array([-1, 0]),
|
||||||
|
3: np.array([0, -1]),
|
||||||
|
}
|
||||||
|
|
||||||
|
assert render_mode is None or render_mode in self.metadata["render_modes"]
|
||||||
|
self.render_mode = render_mode
|
||||||
|
|
||||||
|
"""
|
||||||
|
If human-rendering is used, `self.window` will be a reference
|
||||||
|
to the window that we draw to. `self.clock` will be a clock that is used
|
||||||
|
to ensure that the environment is rendered at the correct framerate in
|
||||||
|
human-mode. They will remain `None` until human-mode is used for the
|
||||||
|
first time.
|
||||||
|
"""
|
||||||
|
self.window = None
|
||||||
|
self.clock = None
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Constructing Observations From Environment States
|
||||||
|
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
#
|
||||||
|
# Since we will need to compute observations both in ``reset`` and
|
||||||
|
# ``step``, it is often convenient to have a (private) method ``_get_obs``
|
||||||
|
# that translates the environment’s state into an observation. However,
|
||||||
|
# this is not mandatory and you may as well compute observations in
|
||||||
|
# ``reset`` and ``step`` separately:
|
||||||
|
|
||||||
|
def _get_obs(self):
|
||||||
|
return {"agent": self._agent_location, "target": self._target_location}
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# We can also implement a similar method for the auxiliary information
|
||||||
|
# that is returned by ``step`` and ``reset``. In our case, we would like
|
||||||
|
# to provide the manhattan distance between the agent and the target:
|
||||||
|
|
||||||
|
def _get_info(self):
|
||||||
|
return {
|
||||||
|
"distance": np.linalg.norm(
|
||||||
|
self._agent_location - self._target_location, ord=1
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Oftentimes, info will also contain some data that is only available
|
||||||
|
# inside the ``step`` method (e.g. individual reward terms). In that case,
|
||||||
|
# we would have to update the dictionary that is returned by ``_get_info``
|
||||||
|
# in ``step``.
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Reset
|
||||||
|
# ~~~~~
|
||||||
|
#
|
||||||
|
# The ``reset`` method will be called to initiate a new episode. You may
|
||||||
|
# assume that the ``step`` method will not be called before ``reset`` has
|
||||||
|
# been called. Moreover, ``reset`` should be called whenever a done signal
|
||||||
|
# has been issued. Users may pass the ``seed`` keyword to ``reset`` to
|
||||||
|
# initialize any random number generator that is used by the environment
|
||||||
|
# to a deterministic state. It is recommended to use the random number
|
||||||
|
# generator ``self.np_random`` that is provided by the environment’s base
|
||||||
|
# class, ``gymnasium.Env``. If you only use this RNG, you do not need to
|
||||||
|
# worry much about seeding, *but you need to remember to call
|
||||||
|
# ``super().reset(seed=seed)``* to make sure that ``gymnasium.Env``
|
||||||
|
# correctly seeds the RNG. Once this is done, we can randomly set the
|
||||||
|
# state of our environment. In our case, we randomly choose the agent’s
|
||||||
|
# location and the random sample target positions, until it does not
|
||||||
|
# coincide with the agent’s position.
|
||||||
|
#
|
||||||
|
# The ``reset`` method should return a tuple of the initial observation
|
||||||
|
# and some auxiliary information. We can use the methods ``_get_obs`` and
|
||||||
|
# ``_get_info`` that we implemented earlier for that:
|
||||||
|
|
||||||
|
def reset(self, seed=None, options=None):
|
||||||
|
# We need the following line to seed self.np_random
|
||||||
|
super().reset(seed=seed)
|
||||||
|
|
||||||
|
# Choose the agent's location uniformly at random
|
||||||
|
self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)
|
||||||
|
|
||||||
|
# We will sample the target's location randomly until it does not coincide with the agent's location
|
||||||
|
self._target_location = self._agent_location
|
||||||
|
while np.array_equal(self._target_location, self._agent_location):
|
||||||
|
self._target_location = self.np_random.integers(
|
||||||
|
0, self.size, size=2, dtype=int
|
||||||
|
)
|
||||||
|
|
||||||
|
observation = self._get_obs()
|
||||||
|
info = self._get_info()
|
||||||
|
|
||||||
|
if self.render_mode == "human":
|
||||||
|
self._render_frame()
|
||||||
|
|
||||||
|
return observation, info
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Step
|
||||||
|
# ~~~~
|
||||||
|
#
|
||||||
|
# The ``step`` method usually contains most of the logic of your
|
||||||
|
# environment. It accepts an ``action``, computes the state of the
|
||||||
|
# environment after applying that action and returns the 4-tuple
|
||||||
|
# ``(observation, reward, done, info)``. Once the new state of the
|
||||||
|
# environment has been computed, we can check whether it is a terminal
|
||||||
|
# state and we set ``done`` accordingly. Since we are using sparse binary
|
||||||
|
# rewards in ``GridWorldEnv``, computing ``reward`` is trivial once we
|
||||||
|
# know ``done``. To gather ``observation`` and ``info``, we can again make
|
||||||
|
# use of ``_get_obs`` and ``_get_info``:
|
||||||
|
|
||||||
|
def step(self, action):
|
||||||
|
# Map the action (element of {0,1,2,3}) to the direction we walk in
|
||||||
|
direction = self._action_to_direction[action]
|
||||||
|
# We use `np.clip` to make sure we don't leave the grid
|
||||||
|
self._agent_location = np.clip(
|
||||||
|
self._agent_location + direction, 0, self.size - 1
|
||||||
|
)
|
||||||
|
# An episode is done iff the agent has reached the target
|
||||||
|
terminated = np.array_equal(self._agent_location, self._target_location)
|
||||||
|
reward = 1 if terminated else 0 # Binary sparse rewards
|
||||||
|
observation = self._get_obs()
|
||||||
|
info = self._get_info()
|
||||||
|
|
||||||
|
if self.render_mode == "human":
|
||||||
|
self._render_frame()
|
||||||
|
|
||||||
|
return observation, reward, terminated, False, info
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Rendering
|
||||||
|
# ~~~~~~~~~
|
||||||
|
#
|
||||||
|
# Here, we are using PyGame for rendering. A similar approach to rendering
|
||||||
|
# is used in many environments that are included with Gymnasium and you
|
||||||
|
# can use it as a skeleton for your own environments:
|
||||||
|
|
||||||
|
def render(self):
|
||||||
|
if self.render_mode == "rgb_array":
|
||||||
|
return self._render_frame()
|
||||||
|
|
||||||
|
def _render_frame(self):
|
||||||
|
if self.window is None and self.render_mode == "human":
|
||||||
|
pygame.init()
|
||||||
|
pygame.display.init()
|
||||||
|
self.window = pygame.display.set_mode(
|
||||||
|
(self.window_size, self.window_size)
|
||||||
|
)
|
||||||
|
if self.clock is None and self.render_mode == "human":
|
||||||
|
self.clock = pygame.time.Clock()
|
||||||
|
|
||||||
|
canvas = pygame.Surface((self.window_size, self.window_size))
|
||||||
|
canvas.fill((255, 255, 255))
|
||||||
|
pix_square_size = (
|
||||||
|
self.window_size / self.size
|
||||||
|
) # The size of a single grid square in pixels
|
||||||
|
|
||||||
|
# First we draw the target
|
||||||
|
pygame.draw.rect(
|
||||||
|
canvas,
|
||||||
|
(255, 0, 0),
|
||||||
|
pygame.Rect(
|
||||||
|
pix_square_size * self._target_location,
|
||||||
|
(pix_square_size, pix_square_size),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
# Now we draw the agent
|
||||||
|
pygame.draw.circle(
|
||||||
|
canvas,
|
||||||
|
(0, 0, 255),
|
||||||
|
(self._agent_location + 0.5) * pix_square_size,
|
||||||
|
pix_square_size / 3,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Finally, add some gridlines
|
||||||
|
for x in range(self.size + 1):
|
||||||
|
pygame.draw.line(
|
||||||
|
canvas,
|
||||||
|
0,
|
||||||
|
(0, pix_square_size * x),
|
||||||
|
(self.window_size, pix_square_size * x),
|
||||||
|
width=3,
|
||||||
|
)
|
||||||
|
pygame.draw.line(
|
||||||
|
canvas,
|
||||||
|
0,
|
||||||
|
(pix_square_size * x, 0),
|
||||||
|
(pix_square_size * x, self.window_size),
|
||||||
|
width=3,
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.render_mode == "human":
|
||||||
|
# The following line copies our drawings from `canvas` to the visible window
|
||||||
|
self.window.blit(canvas, canvas.get_rect())
|
||||||
|
pygame.event.pump()
|
||||||
|
pygame.display.update()
|
||||||
|
|
||||||
|
# We need to ensure that human-rendering occurs at the predefined framerate.
|
||||||
|
# The following line will automatically add a delay to keep the framerate stable.
|
||||||
|
self.clock.tick(self.metadata["render_fps"])
|
||||||
|
else: # rgb_array
|
||||||
|
return np.transpose(
|
||||||
|
np.array(pygame.surfarray.pixels3d(canvas)), axes=(1, 0, 2)
|
||||||
|
)
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Close
|
||||||
|
# ~~~~~
|
||||||
|
#
|
||||||
|
# The ``close`` method should close any open resources that were used by
|
||||||
|
# the environment. In many cases, you don’t actually have to bother to
|
||||||
|
# implement this method. However, in our example ``render_mode`` may be
|
||||||
|
# ``"human"`` and we might need to close the window that has been opened:
|
||||||
|
|
||||||
|
def close(self):
|
||||||
|
if self.window is not None:
|
||||||
|
pygame.display.quit()
|
||||||
|
pygame.quit()
|
||||||
|
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# In other environments ``close`` might also close files that were opened
|
||||||
|
# or release other resources. You shouldn’t interact with the environment
|
||||||
|
# after having called ``close``.
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Registering Envs
|
||||||
|
# ----------------
|
||||||
|
#
|
||||||
|
# In order for the custom environments to be detected by Gymnasium, they
|
||||||
|
# must be registered as follows. We will choose to put this code in
|
||||||
|
# ``gym-examples/gym_examples/__init__.py``.
|
||||||
|
#
|
||||||
|
# .. code:: python
|
||||||
|
#
|
||||||
|
# from gymnasium.envs.registration import register
|
||||||
|
#
|
||||||
|
# register(
|
||||||
|
# id="gym_examples/GridWorld-v0",
|
||||||
|
# entry_point="gym_examples.envs:GridWorldEnv",
|
||||||
|
# max_episode_steps=300,
|
||||||
|
# )
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# The environment ID consists of three components, two of which are
|
||||||
|
# optional: an optional namespace (here: ``gym_examples``), a mandatory
|
||||||
|
# name (here: ``GridWorld``) and an optional but recommended version
|
||||||
|
# (here: v0). It might have also been registered as ``GridWorld-v0`` (the
|
||||||
|
# recommended approach), ``GridWorld`` or ``gym_examples/GridWorld``, and
|
||||||
|
# the appropriate ID should then be used during environment creation.
|
||||||
|
#
|
||||||
|
# The keyword argument ``max_episode_steps=300`` will ensure that
|
||||||
|
# GridWorld environments that are instantiated via ``gymnasium.make`` will
|
||||||
|
# be wrapped in a ``TimeLimit`` wrapper (see `the wrapper
|
||||||
|
# documentation </api/wrappers>`__ for more information). A done signal
|
||||||
|
# will then be produced if the agent has reached the target *or* 300 steps
|
||||||
|
# have been executed in the current episode. To distinguish truncation and
|
||||||
|
# termination, you can check ``info["TimeLimit.truncated"]``.
|
||||||
|
#
|
||||||
|
# Apart from ``id`` and ``entrypoint``, you may pass the following
|
||||||
|
# additional keyword arguments to ``register``:
|
||||||
|
#
|
||||||
|
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||||
|
# | Name | Type | Default | Description |
|
||||||
|
# +======================+===========+===========+===============================================================================================================+
|
||||||
|
# | ``reward_threshold`` | ``float`` | ``None`` | The reward threshold before the task is considered solved |
|
||||||
|
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||||
|
# | ``nondeterministic`` | ``bool`` | ``False`` | Whether this environment is non-deterministic even after seeding |
|
||||||
|
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||||
|
# | ``max_episode_steps``| ``int`` | ``None`` | The maximum number of steps that an episode can consist of. If not ``None``, a ``TimeLimit`` wrapper is added |
|
||||||
|
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||||
|
# | ``order_enforce`` | ``bool`` | ``True`` | Whether to wrap the environment in an ``OrderEnforcing`` wrapper |
|
||||||
|
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||||
|
# | ``autoreset`` | ``bool`` | ``False`` | Whether to wrap the environment in an ``AutoResetWrapper`` |
|
||||||
|
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||||
|
# | ``kwargs`` | ``dict`` | ``{}`` | The default kwargs to pass to the environment class |
|
||||||
|
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||||
|
#
|
||||||
|
# Most of these keywords (except for ``max_episode_steps``,
|
||||||
|
# ``order_enforce`` and ``kwargs``) do not alter the behavior of
|
||||||
|
# environment instances but merely provide some extra information about
|
||||||
|
# your environment. After registration, our custom ``GridWorldEnv``
|
||||||
|
# environment can be created with
|
||||||
|
# ``env = gymnasium.make('gym_examples/GridWorld-v0')``.
|
||||||
|
#
|
||||||
|
# ``gym-examples/gym_examples/envs/__init__.py`` should have:
|
||||||
|
#
|
||||||
|
# .. code:: python
|
||||||
|
#
|
||||||
|
# from gym_examples.envs.grid_world import GridWorldEnv
|
||||||
|
#
|
||||||
|
# If your environment is not registered, you may optionally pass a module
|
||||||
|
# to import, that would register your environment before creating it like
|
||||||
|
# this - ``env = gymnasium.make('module:Env-v0')``, where ``module``
|
||||||
|
# contains the registration code. For the GridWorld env, the registration
|
||||||
|
# code is run by importing ``gym_examples`` so if it were not possible to
|
||||||
|
# import gym_examples explicitly, you could register while making by
|
||||||
|
# ``env = gymnasium.make('gym_examples:gym_examples/GridWorld-v0)``. This
|
||||||
|
# is especially useful when you’re allowed to pass only the environment ID
|
||||||
|
# into a third-party codebase (eg. learning library). This lets you
|
||||||
|
# register your environment without needing to edit the library’s source
|
||||||
|
# code.
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Creating a Package
|
||||||
|
# ------------------
|
||||||
|
#
|
||||||
|
# The last step is to structure our code as a Python package. This
|
||||||
|
# involves configuring ``gym-examples/setup.py``. A minimal example of how
|
||||||
|
# to do so is as follows:
|
||||||
|
#
|
||||||
|
# .. code:: python
|
||||||
|
#
|
||||||
|
# from setuptools import setup
|
||||||
|
#
|
||||||
|
# setup(
|
||||||
|
# name="gym_examples",
|
||||||
|
# version="0.0.1",
|
||||||
|
# install_requires=["gymnasium==0.26.0", "pygame==2.1.0"],
|
||||||
|
# )
|
||||||
|
#
|
||||||
|
# Creating Environment Instances
|
||||||
|
# ------------------------------
|
||||||
|
#
|
||||||
|
# After you have installed your package locally with
|
||||||
|
# ``pip install -e gym-examples``, you can create an instance of the
|
||||||
|
# environment via:
|
||||||
|
#
|
||||||
|
# .. code:: python
|
||||||
|
#
|
||||||
|
# import gym_examples
|
||||||
|
# env = gymnasium.make('gym_examples/GridWorld-v0')
|
||||||
|
#
|
||||||
|
# You can also pass keyword arguments of your environment’s constructor to
|
||||||
|
# ``gymnasium.make`` to customize the environment. In our case, we could
|
||||||
|
# do:
|
||||||
|
#
|
||||||
|
# .. code:: python
|
||||||
|
#
|
||||||
|
# env = gymnasium.make('gym_examples/GridWorld-v0', size=10)
|
||||||
|
#
|
||||||
|
# Sometimes, you may find it more convenient to skip registration and call
|
||||||
|
# the environment’s constructor yourself. Some may find this approach more
|
||||||
|
# pythonic and environments that are instantiated like this are also
|
||||||
|
# perfectly fine (but remember to add wrappers as well!).
|
||||||
|
#
|
||||||
|
# Using Wrappers
|
||||||
|
# --------------
|
||||||
|
#
|
||||||
|
# Oftentimes, we want to use different variants of a custom environment,
|
||||||
|
# or we want to modify the behavior of an environment that is provided by
|
||||||
|
# Gymnasium or some other party. Wrappers allow us to do this without
|
||||||
|
# changing the environment implementation or adding any boilerplate code.
|
||||||
|
# Check out the `wrapper documentation </api/wrappers/>`__ for details on
|
||||||
|
# how to use wrappers and instructions for implementing your own. In our
|
||||||
|
# example, observations cannot be used directly in learning code because
|
||||||
|
# they are dictionaries. However, we don’t actually need to touch our
|
||||||
|
# environment implementation to fix this! We can simply add a wrapper on
|
||||||
|
# top of environment instances to flatten observations into a single
|
||||||
|
# array:
|
||||||
|
#
|
||||||
|
# .. code:: python
|
||||||
|
#
|
||||||
|
# import gym_examples
|
||||||
|
# from gymnasium.wrappers import FlattenObservation
|
||||||
|
#
|
||||||
|
# env = gymnasium.make('gym_examples/GridWorld-v0')
|
||||||
|
# wrapped_env = FlattenObservation(env)
|
||||||
|
# print(wrapped_env.reset()) # E.g. [3 0 3 3], {}
|
||||||
|
#
|
||||||
|
# Wrappers have the big advantage that they make environments highly
|
||||||
|
# modular. For instance, instead of flattening the observations from
|
||||||
|
# GridWorld, you might only want to look at the relative position of the
|
||||||
|
# target and the agent. In the section on
|
||||||
|
# `ObservationWrappers </api/wrappers/#observationwrapper>`__ we have
|
||||||
|
# implemented a wrapper that does this job. This wrapper is also available
|
||||||
|
# in gym-examples:
|
||||||
|
#
|
||||||
|
# .. code:: python
|
||||||
|
#
|
||||||
|
# import gym_examples
|
||||||
|
# from gym_examples.wrappers import RelativePosition
|
||||||
|
#
|
||||||
|
# env = gymnasium.make('gym_examples/GridWorld-v0')
|
||||||
|
# wrapped_env = RelativePosition(env)
|
||||||
|
# print(wrapped_env.reset()) # E.g. [-3 3], {}
|
80
docs/tutorials/handling_time_limits.py
Normal file
80
docs/tutorials/handling_time_limits.py
Normal file
@@ -0,0 +1,80 @@
|
|||||||
|
"""
|
||||||
|
Handling Time Limits
|
||||||
|
====================
|
||||||
|
|
||||||
|
In using Gymnasium environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The ``done`` signal received (in previous versions of OpenAI Gym < 0.26) from ``env.step`` indicated whether an episode has ended. However, this signal did not distinguish whether the episode ended due to ``termination`` or ``truncation``.
|
||||||
|
|
||||||
|
Termination
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Termination refers to the episode ending after reaching a terminal state that is defined as part of the environment
|
||||||
|
definition. Examples are - task success, task failure, robot falling down etc. Notably, this also includes episodes
|
||||||
|
ending in finite-horizon environments due to a time-limit inherent to the environment. Note that to preserve Markov
|
||||||
|
property, a representation of the remaining time must be present in the agent's observation in finite-horizon environments.
|
||||||
|
`(Reference) <https://arxiv.org/abs/1712.00378>`_
|
||||||
|
|
||||||
|
Truncation
|
||||||
|
----------
|
||||||
|
|
||||||
|
Truncation refers to the episode ending after an externally defined condition (that is outside the scope of the Markov
|
||||||
|
Decision Process). This could be a time-limit, a robot going out of bounds etc.
|
||||||
|
|
||||||
|
An infinite-horizon environment is an obvious example of where this is needed. We cannot wait forever for the episode
|
||||||
|
to complete, so we set a practical time-limit after which we forcibly halt the episode. The last state in this case is
|
||||||
|
not a terminal state since it has a non-zero transition probability of moving to another state as per the Markov
|
||||||
|
Decision Process that defines the RL problem. This is also different from time-limits in finite horizon environments
|
||||||
|
as the agent in this case has no idea about this time-limit.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Importance in learning code
|
||||||
|
# ---------------------------
|
||||||
|
# Bootstrapping (using one or more estimated values of a variable to update estimates of the same variable) is a key
|
||||||
|
# aspect of Reinforcement Learning. A value function will tell you how much discounted reward you will get from a
|
||||||
|
# particular state if you follow a given policy. When an episode stops at any given point, by looking at the value of
|
||||||
|
# the final state, the agent is able to estimate how much discounted reward could have been obtained if the episode has
|
||||||
|
# continued. This is an example of handling truncation.
|
||||||
|
#
|
||||||
|
# More formally, a common example of bootstrapping in RL is updating the estimate of the Q-value function,
|
||||||
|
#
|
||||||
|
# .. math::
|
||||||
|
# Q_{target}(o_t, a_t) = r_t + \gamma . \max_a(Q(o_{t+1}, a_{t+1}))
|
||||||
|
#
|
||||||
|
#
|
||||||
|
# In classical RL, the new ``Q`` estimate is a weighted average of the previous ``Q`` estimate and ``Q_target`` while in Deep
|
||||||
|
# Q-Learning, the error between ``Q_target`` and the previous ``Q`` estimate is minimized.
|
||||||
|
#
|
||||||
|
# However, at the terminal state, bootstrapping is not done,
|
||||||
|
#
|
||||||
|
# .. math::
|
||||||
|
# Q_{target}(o_t, a_t) = r_t
|
||||||
|
#
|
||||||
|
# This is where the distinction between termination and truncation becomes important. When an episode ends due to
|
||||||
|
# termination we don't bootstrap, when it ends due to truncation, we bootstrap.
|
||||||
|
#
|
||||||
|
# While using gymnasium environments, the ``done`` signal (default for < v0.26) is frequently used to determine whether to
|
||||||
|
# bootstrap or not. However, this is incorrect since it does not differentiate between termination and truncation.
|
||||||
|
#
|
||||||
|
# A simple example of value functions is shown below. This is an illustrative example and not part of any specific algorithm.
|
||||||
|
#
|
||||||
|
# .. code:: python
|
||||||
|
#
|
||||||
|
# # INCORRECT
|
||||||
|
# vf_target = rew + gamma * (1 - done) * vf_next_state
|
||||||
|
#
|
||||||
|
# This is incorrect in the case of episode ending due to a truncation, where bootstrapping needs to happen but it doesn't.
|
||||||
|
|
||||||
|
# %%
|
||||||
|
# Solution
|
||||||
|
# ----------
|
||||||
|
#
|
||||||
|
# From v0.26 onwards, Gymnasium's ``env.step`` API returns both termination and truncation information explicitly.
|
||||||
|
# In the previous version truncation information was supplied through the info key ``TimeLimit.truncated``.
|
||||||
|
# The correct way to handle terminations and truncations now is,
|
||||||
|
#
|
||||||
|
# .. code:: python
|
||||||
|
#
|
||||||
|
# # terminated = done and 'TimeLimit.truncated' not in info
|
||||||
|
# # This was needed in previous versions.
|
||||||
|
#
|
||||||
|
# vf_target = rew + gamma * (1 - terminated) * vf_next_state
|
Reference in New Issue
Block a user