mirror of
https://github.com/Farama-Foundation/Gymnasium.git
synced 2025-07-31 05:44:31 +00:00
Add more introductory pages (#791)
This commit is contained in:
BIN
docs/_static/videos/tutorials/environment-creation-example-episode.gif
vendored
Normal file
BIN
docs/_static/videos/tutorials/environment-creation-example-episode.gif
vendored
Normal file
Binary file not shown.
After Width: | Height: | Size: 40 KiB |
@@ -10,25 +10,26 @@ title: Functional
|
||||
.. automethod:: gymnasium.functional.FuncEnv.transform
|
||||
|
||||
.. automethod:: gymnasium.functional.FuncEnv.initial
|
||||
.. automethod:: gymnasium.functional.FuncEnv.initial_info
|
||||
|
||||
.. automethod:: gymnasium.functional.FuncEnv.transition
|
||||
.. automethod:: gymnasium.functional.FuncEnv.observation
|
||||
.. automethod:: gymnasium.functional.FuncEnv.reward
|
||||
.. automethod:: gymnasium.functional.FuncEnv.terminal
|
||||
|
||||
.. automethod:: gymnasium.functional.FuncEnv.state_info
|
||||
.. automethod:: gymnasium.functional.FuncEnv.transition_info
|
||||
|
||||
.. automethod:: gymnasium.functional.FuncEnv.render_init
|
||||
.. automethod:: gymnasium.functional.FuncEnv.render_image
|
||||
.. automethod:: gymnasium.functional.FuncEnv.render_initialise
|
||||
.. automethod:: gymnasium.functional.FuncEnv.render_close
|
||||
```
|
||||
|
||||
## Converting Jax-based Functional environments to standard Env
|
||||
|
||||
```{eval-rst}
|
||||
.. autoclass:: gymnasium.utils.functional_jax_env.FunctionalJaxEnv
|
||||
.. autoclass:: gymnasium.envs.functional_jax_env.FunctionalJaxEnv
|
||||
|
||||
.. automethod:: gymnasium.utils.functional_jax_env.FunctionalJaxEnv.reset
|
||||
.. automethod:: gymnasium.utils.functional_jax_env.FunctionalJaxEnv.step
|
||||
.. automethod:: gymnasium.utils.functional_jax_env.FunctionalJaxEnv.render
|
||||
.. automethod:: gymnasium.envs.functional_jax_env.FunctionalJaxEnv.reset
|
||||
.. automethod:: gymnasium.envs.functional_jax_env.FunctionalJaxEnv.step
|
||||
.. automethod:: gymnasium.envs.functional_jax_env.FunctionalJaxEnv.render
|
||||
```
|
||||
|
@@ -13,6 +13,8 @@ multi-objective RL ([MO-Gymnasium](https://mo-gymnasium.farama.org/))
|
||||
many-agent RL ([MAgent2](https://magent2.farama.org/)),
|
||||
3D navigation ([Miniworld](https://miniworld.farama.org/)), and many more.
|
||||
|
||||
## Third-party environments with Gymnasium
|
||||
|
||||
*This page contains environments which are not maintained by Farama Foundation and, as such, cannot be guaranteed to function as intended.*
|
||||
|
||||
*If you'd like to contribute an environment, please reach out on [Discord](https://discord.gg/bnJ6kubTg6).*
|
||||
|
@@ -47,8 +47,12 @@ env.close()
|
||||
:caption: Introduction
|
||||
|
||||
introduction/basic_usage
|
||||
introduction/train_agent
|
||||
introduction/create_custom_env
|
||||
introduction/record_agent
|
||||
introduction/speed_up_env
|
||||
introduction/gym_compatibility
|
||||
introduction/migration-guide
|
||||
introduction/migration_guide
|
||||
```
|
||||
|
||||
```{toctree}
|
||||
|
@@ -9,9 +9,9 @@ firstpage:
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
Gymnasium is a project that provides an API for all single agent reinforcement learning environments, and includes implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more.
|
||||
Gymnasium is a project that provides an API (application programming interface) for all single agent reinforcement learning environments with implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more. This page will outline the basics of how to use Gymnasium including its four key functions: :meth:`make`, :meth:`Env.reset`, :meth:`Env.step` and :meth:`Env.render`.
|
||||
|
||||
The API contains four key functions: :meth:`make`, :meth:`Env.reset`, :meth:`Env.step` and :meth:`Env.render`, that this basic usage will introduce you to. At the core of Gymnasium is :class:`Env`, a high-level python class representing a markov decision process (MDP) from reinforcement learning theory (this is not a perfect reconstruction, and is missing several components of MDPs). Within gymnasium, environments (MDPs) are implemented as :class:`Env` classes, along with :class:`Wrapper`, provide helpful utilities to change actions passed to the environment and modified the observations, rewards, termination or truncations conditions passed back to the user.
|
||||
At the core of Gymnasium is :class:`Env`, a high-level python class representing a markov decision process (MDP) from reinforcement learning theory (note: this is not a perfect reconstruction, missing several components of MDPs). The class provides users the ability generate an initial state, transition / move to new states given an action and the visualise the environment. Alongside :class:`Env`, :class:`Wrapper` are provided to help augment / modify the environment, in particular, the agent observations, rewards and actions taken.
|
||||
```
|
||||
|
||||
## Initializing Environments
|
||||
@@ -30,12 +30,12 @@ env = gym.make('CartPole-v1')
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
This will return an :class:`Env` for users to interact with. To see all environments you can create, use :meth:`pprint_registry`. Furthermore, :meth:`make` provides a number of additional arguments for specifying keywords to the environment, adding more or less wrappers, etc.
|
||||
This function will return an :class:`Env` for users to interact with. To see all environments you can create, use :meth:`pprint_registry`. Furthermore, :meth:`make` provides a number of additional arguments for specifying keywords to the environment, adding more or less wrappers, etc. See :meth:`make` for more information.
|
||||
```
|
||||
|
||||
## Interacting with the Environment
|
||||
|
||||
The classic "agent-environment loop" pictured below is simplified representation of reinforcement learning that Gymnasium implements.
|
||||
Within reinforcement learning, the classic "agent-environment loop" pictured below is simplified representation of how an agent and environment interact with each other. The agent receives an observation about the environment, the agent then selects an action that the environment uses to determine the reward and the next observation. The cycle then repeating itself until the environment ends (terminates).
|
||||
|
||||
```{image} /_static/diagrams/AE_loop.png
|
||||
:width: 50%
|
||||
@@ -49,19 +49,20 @@ The classic "agent-environment loop" pictured below is simplified representation
|
||||
:class: only-dark
|
||||
```
|
||||
|
||||
This loop is implemented using the following gymnasium code
|
||||
For gymnasium, the "agent-environment-loop" is implemented below for a single episode (until the environment ends). See the next section for a line-by-line explanation. Note that running this code requires install swig (`pip install swig` or [download](https://www.swig.org/download.html)) along with `pip install gymnasium[box2d]`.
|
||||
|
||||
```python
|
||||
import gymnasium as gym
|
||||
|
||||
env = gym.make("LunarLander-v2", render_mode="human")
|
||||
observation, info = env.reset()
|
||||
|
||||
for _ in range(1000):
|
||||
episode_over = False
|
||||
while not episode_over:
|
||||
action = env.action_space.sample() # agent policy that uses the observation and info
|
||||
observation, reward, terminated, truncated, info = env.step(action)
|
||||
|
||||
if terminated or truncated:
|
||||
observation, info = env.reset()
|
||||
episode_over = terminated or truncated
|
||||
|
||||
env.close()
|
||||
```
|
||||
@@ -78,17 +79,15 @@ The output should look something like this:
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
First, an environment is created using :meth:`make` with an additional keyword ``"render_mode"`` that specifies how the environment should be visualised.
|
||||
First, an environment is created using :meth:`make` with an additional keyword ``"render_mode"`` that specifies how the environment should be visualised. See :meth:`Env.render` for details on the default meaning of different render modes. In this example, we use the ``"LunarLander"`` environment where the agent controls a spaceship that needs to land safely.
|
||||
|
||||
.. py:currentmodule:: gymnasium.Env
|
||||
After initializing the environment, we :meth:`Env.reset` the environment to get the first observation of the environment along with an additional information. For initializing the environment with a particular random seed or options (see the environment documentation for possible values) use the ``seed`` or ``options`` parameters with :meth:`reset`.
|
||||
|
||||
See :meth:`render` for details on the default meaning of different render modes. In this example, we use the ``"LunarLander"`` environment where the agent controls a spaceship that needs to land safely.
|
||||
As we wish to continue the agent-environment loop until the environment ends, which is in an unknown number of timesteps, we define ``episode_over`` as a variable to know when to stop interacting with the environment along with a while loop that uses it.
|
||||
|
||||
After initializing the environment, we :meth:`reset` the environment to get the first observation of the environment. For initializing the environment with a particular random seed or options (see environment documentation for possible values) use the ``seed`` or ``options`` parameters with :meth:`reset`.
|
||||
Next, the agent performs an action in the environment, :meth:`Env.step` executes the select actions (in this case random with ``env.action_space.sample()``) to update the environment. This action can be imagined as moving a robot or pressing a button on a games' controller that causes a change within the environment. As a result, the agent receives a new observation from the updated environment along with a reward for taking the action. This reward could be for instance positive for destroying an enemy or a negative reward for moving into lava. One such action-observation exchange is referred to as a **timestep**.
|
||||
|
||||
Next, the agent performs an action in the environment, :meth:`step`, this can be imagined as moving a robot or pressing a button on a games' controller that causes a change within the environment. As a result, the agent receives a new observation from the updated environment along with a reward for taking the action. This reward could be for instance positive for destroying an enemy or a negative reward for moving into lava. One such action-observation exchange is referred to as a **timestep**.
|
||||
|
||||
However, after some timesteps, the environment may end, this is called the terminal state. For instance, the robot may have crashed, or the agent have succeeded in completing a task, the environment will need to stop as the agent cannot continue. In gymnasium, if the environment has terminated, this is returned by :meth:`step`. Similarly, we may also want the environment to end after a fixed number of timesteps, in this case, the environment issues a truncated signal. If either of ``terminated`` or ``truncated`` are ``True`` then :meth:`reset` should be called next to restart the environment.
|
||||
However, after some timesteps, the environment may end, this is called the terminal state. For instance, the robot may have crashed, or may have succeeded in completing a task, the environment will need to stop as the agent cannot continue. In gymnasium, if the environment has terminated, this is returned by :meth:`step` as the third variable, ``terminated``. Similarly, we may also want the environment to end after a fixed number of timesteps, in this case, the environment issues a truncated signal. If either of ``terminated`` or ``truncated`` are ``True`` then we end the episode but in most cases users might wish to restart the environment, this can be done with `env.reset()`.
|
||||
```
|
||||
|
||||
## Action and observation spaces
|
||||
@@ -96,26 +95,23 @@ However, after some timesteps, the environment may end, this is called the termi
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium.Env
|
||||
|
||||
Every environment specifies the format of valid actions and observations with the :attr:`action_space` and :attr:`observation_space` attributes. This is helpful for both knowing the expected input and output of the environment as all valid actions and observation should be contained with the respective space.
|
||||
Every environment specifies the format of valid actions and observations with the :attr:`action_space` and :attr:`observation_space` attributes. This is helpful for both knowing the expected input and output of the environment as all valid actions and observation should be contained with their respective space. In the example above, we sampled random actions via ``env.action_space.sample()`` instead of using an agent policy, mapping observations to actions which users will want to make.
|
||||
|
||||
In the example, we sampled random actions via ``env.action_space.sample()`` instead of using an agent policy, mapping observations to actions which users will want to make. See one of the agent tutorials for an example of creating and training an agent policy.
|
||||
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
Every environment should have the attributes :attr:`Env.action_space` and :attr:`Env.observation_space`, both of which should be instances of classes that inherit from :class:`spaces.Space`. Gymnasium has support for a majority of possible spaces users might need:
|
||||
Importantly, :attr:`Env.action_space` and :attr:`Env.observation_space` are instances of :class:`Space`, a high-level python class that provides the key functions: :meth:`Space.contains` and :meth:`Space.sample`. Gymnasium has support for a wide range of spaces that users might need:
|
||||
|
||||
.. py:currentmodule:: gymnasium.spaces
|
||||
|
||||
- :class:`Box`: describes an n-dimensional continuous space. It's a bounded space where we can define the upper and lower
|
||||
limits which describe the valid values our observations can take.
|
||||
- :class:`Box`: describes bounded space with upper and lower limits of any n-dimensional shape.
|
||||
- :class:`Discrete`: describes a discrete space where ``{0, 1, ..., n-1}`` are the possible values our observation or action can take.
|
||||
Values can be shifted to ``{a, a+1, ..., a+n-1}`` using an optional argument.
|
||||
- :class:`Dict`: represents a dictionary of simple spaces.
|
||||
- :class:`Tuple`: represents a tuple of simple spaces.
|
||||
- :class:`MultiBinary`: creates an n-shape binary space. Argument n can be a number or a list of numbers.
|
||||
- :class:`MultiBinary`: describes a binary space of any n-dimensional shape.
|
||||
- :class:`MultiDiscrete`: consists of a series of :class:`Discrete` action spaces with a different number of actions in each element.
|
||||
- :class:`Text`: describes a string space with a minimum and maximum length
|
||||
- :class:`Dict`: describes a dictionary of simpler spaces.
|
||||
- :class:`Tuple`: describes a tuple of simple spaces.
|
||||
- :class:`Graph`: describes a mathematical graph (network) with interlinking nodes and edges
|
||||
- :class:`Sequence`: describes a variable length of simpler space elements.
|
||||
|
||||
For example usage of spaces, see their `documentation </api/spaces>`_ along with `utility functions </api/spaces/utils>`_. There are a couple of more niche spaces :class:`Graph`, :class:`Sequence` and :class:`Text`.
|
||||
For example usage of spaces, see their `documentation <../api/spaces>`_ along with `utility functions <../api/spaces/utils>`_. There are a couple of more niche spaces :class:`Graph`, :class:`Sequence` and :class:`Text`.
|
||||
```
|
||||
|
||||
## Modifying the environment
|
||||
@@ -123,7 +119,7 @@ For example usage of spaces, see their `documentation </api/spaces>`_ along with
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium.wrappers
|
||||
|
||||
Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly. Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular. Wrappers can also be chained to combine their effects. Most environments that are generated via ``gymnasium.make`` will already be wrapped by default using the :class:`TimeLimitV0`, :class:`OrderEnforcingV0` and :class:`PassiveEnvCheckerV0`.
|
||||
Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly. Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular. Wrappers can also be chained to combine their effects. Most environments that are generated via :meth:`gymnasium.make` will already be wrapped by default using the :class:`TimeLimit`, :class:`OrderEnforcing` and :class:`PassiveEnvChecker`.
|
||||
|
||||
In order to wrap an environment, you must first initialize a base environment. Then you can pass this environment along with (possibly optional) parameters to the wrapper's constructor:
|
||||
```
|
||||
@@ -144,10 +140,10 @@ In order to wrap an environment, you must first initialize a base environment. T
|
||||
|
||||
Gymnasium already provides many commonly used wrappers for you. Some examples:
|
||||
|
||||
- :class:`TimeLimitV0`: Issue a truncated signal if a maximum number of timesteps has been exceeded (or the base environment has issued a truncated signal).
|
||||
- :class:`ClipActionV0`: Clip the action such that it lies in the action space (of type `Box`).
|
||||
- :class:`RescaleActionV0`: Rescale actions to lie in a specified interval
|
||||
- :class:`TimeAwareObservationV0`: Add information about the index of timestep to observation. In some cases helpful to ensure that transitions are Markov.
|
||||
- :class:`TimeLimit`: Issues a truncated signal if a maximum number of timesteps has been exceeded (or the base environment has issued a truncated signal).
|
||||
- :class:`ClipAction`: Clips any action passed to ``step`` such that it lies in the base environment's action space.
|
||||
- :class:`RescaleAction`: Applies an affine transformation to the action to linearly scale for a new low and high bound on the environment.
|
||||
- :class:`TimeAwareObservation`: Add information about the index of timestep to observation. In some cases helpful to ensure that transitions are Markov.
|
||||
```
|
||||
|
||||
For a full list of implemented wrappers in gymnasium, see [wrappers](/api/wrappers).
|
||||
@@ -167,6 +163,9 @@ If you have a wrapped environment, and you want to get the unwrapped environment
|
||||
|
||||
## More information
|
||||
|
||||
* [Making a Custom environment using the Gymnasium API](/tutorials/gymnasium_basics/environment_creation/)
|
||||
* [Training an agent to play blackjack](/tutorials/training_agents/blackjack_tutorial)
|
||||
* [Compatibility with OpenAI Gym](/introduction/gym_compatibility)
|
||||
* [Training an agent](train_agent)
|
||||
* [Making a Custom Environment](create_custom_env)
|
||||
* [Recording an agent's behaviour](record_agent)
|
||||
* [Speeding up an Environment](speed_up_env)
|
||||
* [Compatibility with OpenAI Gym](gym_compatibility)
|
||||
* [Migration Guide for Gym v0.21 to v0.26 and for v1.0.0](migration_guide)
|
||||
|
227
docs/introduction/create_custom_env.md
Normal file
227
docs/introduction/create_custom_env.md
Normal file
@@ -0,0 +1,227 @@
|
||||
---
|
||||
layout: "contents"
|
||||
title: Create custom env
|
||||
---
|
||||
|
||||
# Create a Custom Environment
|
||||
|
||||
This page provides a short outline of how to create custom environments with Gymnasium, for a more [complete tutorial](../tutorials/gymnasium_basics/environment_creation) with rendering, please read [basic usage](basic_usage) before reading this page.
|
||||
|
||||
We will implement a very simplistic game, called ``GridWorldEnv``, consisting of a 2-dimensional square grid of fixed size. The agent can move vertically or horizontally between grid cells in each timestep and the goal of the agent is to navigate to a target on the grid that has been placed randomly at the beginning of the episode.
|
||||
|
||||
Basic information about the game
|
||||
- Observations provide the location of the target and agent.
|
||||
- There are 4 discrete actions in our environment, corresponding to the movements "right", "up", "left", and "down".
|
||||
- The environment ends (terminates) when the agent has navigated to the grid cell where the target is located.
|
||||
- The agent is only rewarded when it reaches the target, i.e., the reward is one when the agent reaches the target and zero otherwise.
|
||||
|
||||
## Environment `__init__`
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
Like all environments, our custom environment will inherit from :class:`gymnasium.Env` that defines the structure of environment. One of the requirements for an environment is defining the observation and action space, which declare the general set of possible inputs (actions) and outputs (observations) of the environment. As outlined in our basic information about the game, our agent has four discrete actions, therefore we will use the ``Discrete(4)`` space with four options.
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium.spaces
|
||||
|
||||
For our observation, there are a couple options, for this tutorial we will imagine our observation looks like ``{"agent": array([1, 0]), "target": array([0, 3])}`` where the array elements represent the x and y positions of the agent or target. Alternative options for representing the observation is as a 2d grid with values representing the agent and target on the grid or a 3d grid with each "layer" containing only the agent or target information. Therefore, we will declare the observation space as :class:`Dict` with the agent and target spaces being a :class:`Box` allowing an array output of an int type.
|
||||
```
|
||||
|
||||
For a full list of possible spaces to use with an environment, see [spaces](../api/spaces)
|
||||
|
||||
```python
|
||||
from typing import Optional
|
||||
import numpy as np
|
||||
import gymnasium as gym
|
||||
|
||||
|
||||
class GridWorldEnv(gym.Env):
|
||||
|
||||
def __init__(self, size: int = 5):
|
||||
# The size of the square grid
|
||||
self.size = size
|
||||
|
||||
# Define the agent and target location; randomly chosen in `reset` and updated in `step`
|
||||
self._agent_location = np.array([-1, -1], dtype=np.int32)
|
||||
self._target_location = np.array([-1, -1], dtype=np.int32)
|
||||
|
||||
# Observations are dictionaries with the agent's and the target's location.
|
||||
# Each location is encoded as an element of {0, ..., `size`-1}^2
|
||||
self.observation_space = gym.spaces.Dict(
|
||||
{
|
||||
"agent": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int),
|
||||
"target": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int),
|
||||
}
|
||||
)
|
||||
|
||||
# We have 4 actions, corresponding to "right", "up", "left", "down"
|
||||
self.action_space = gym.spaces.Discrete(4)
|
||||
# Dictionary maps the abstract actions to the directions on the grid
|
||||
self._action_to_direction = {
|
||||
0: np.array([1, 0]), # right
|
||||
1: np.array([0, 1]), # up
|
||||
2: np.array([-1, 0]), # left
|
||||
3: np.array([0, -1]), # down
|
||||
}
|
||||
```
|
||||
|
||||
## Constructing Observations
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
Since we will need to compute observations both in :meth:`Env.reset` and :meth:`Env.step`, it is often convenient to have a method ``_get_obs`` that translates the environment's state into an observation. However, this is not mandatory and you can compute the observations in :meth:`Env.reset` and :meth:`Env.step` separately.
|
||||
```
|
||||
|
||||
```python
|
||||
def _get_obs(self):
|
||||
return {"agent": self._agent_location, "target": self._target_location}
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
We can also implement a similar method for the auxiliary information that is returned by :meth:`Env.reset` and :meth:`Env.step`. In our case, we would like to provide the manhattan distance between the agent and the target:
|
||||
```
|
||||
|
||||
```python
|
||||
def _get_info(self):
|
||||
return {
|
||||
"distance": np.linalg.norm(
|
||||
self._agent_location - self._target_location, ord=1
|
||||
)
|
||||
}
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
Oftentimes, info will also contain some data that is only available inside the :meth:`Env.step` method (e.g., individual reward terms). In that case, we would have to update the dictionary that is returned by ``_get_info`` in :meth:`Env.step`.
|
||||
```
|
||||
|
||||
## Reset function
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium.Env
|
||||
|
||||
As the purpose of :meth:`reset` is to initiate a new episode for an environment and has two parameters: ``seed`` and ``options``. The seed can be used to initialize the random number generator to a deterministic state and options can be used to specify values used within reset. On the first line of the reset, you need to call ``super().reset(seed=seed)`` which will initialize the random number generate (:attr:`np_random`) to use through the rest of the :meth:`reset`.
|
||||
|
||||
Within our custom environment, the :meth:`reset` needs to randomly choose the agent and target's positions (we repeat this if they have the same position). The return type of :meth:`reset` is a tuple of the initial observation and any auxiliary information. Therefore, we can use the methods ``_get_obs`` and ``_get_info`` that we implemented earlier for that:
|
||||
```
|
||||
|
||||
```python
|
||||
def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
|
||||
# We need the following line to seed self.np_random
|
||||
super().reset(seed=seed)
|
||||
|
||||
# Choose the agent's location uniformly at random
|
||||
self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)
|
||||
|
||||
# We will sample the target's location randomly until it does not coincide with the agent's location
|
||||
self._target_location = self._agent_location
|
||||
while np.array_equal(self._target_location, self._agent_location):
|
||||
self._target_location = self.np_random.integers(
|
||||
0, self.size, size=2, dtype=int
|
||||
)
|
||||
|
||||
observation = self._get_obs()
|
||||
info = self._get_info()
|
||||
|
||||
return observation, info
|
||||
```
|
||||
|
||||
## Step function
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium.Env
|
||||
|
||||
The :meth:`step` method usually contains most of the logic for your environment, it accepts an ``action`` and computes the state of the environment after the applying the action, returning a tuple of the next observation, the resulting reward, if the environment has terminated, if the environment has truncated and auxiliary information.
|
||||
```
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
For our environment, several things need to happen during the step function:
|
||||
|
||||
- We use the self._action_to_direction to convert the discrete action (e.g., 2) to a grid direction with our agent location. To prevent the agent from going out of bounds of the grd, we clip the agen't location to stay within bounds.
|
||||
- We compute the agent's reward by checking if the agent's current position is equal to the target's location.
|
||||
- Since the environment doesn't truncate internally (we can apply a time limit wrapper to the environment during :meth:make), we permanently set truncated to False.
|
||||
- We once again use _get_obs and _get_info to obtain the agent's observation and auxiliary information.
|
||||
```
|
||||
|
||||
```python
|
||||
def step(self, action):
|
||||
# Map the action (element of {0,1,2,3}) to the direction we walk in
|
||||
direction = self._action_to_direction[action]
|
||||
# We use `np.clip` to make sure we don't leave the grid bounds
|
||||
self._agent_location = np.clip(
|
||||
self._agent_location + direction, 0, self.size - 1
|
||||
)
|
||||
|
||||
# An environment is completed if and only if the agent has reached the target
|
||||
terminated = np.array_equal(self._agent_location, self._target_location)
|
||||
truncated = False
|
||||
reward = 1 if terminated else 0 # the agent is only reached at the end of the episode
|
||||
observation = self._get_obs()
|
||||
info = self._get_info()
|
||||
|
||||
return observation, reward, terminated, truncated, info
|
||||
```
|
||||
|
||||
## Registering and making the environment
|
||||
|
||||
```{eval-rst}
|
||||
While it is possible to use your new custom environment now immediately, it is more common for environments to be initialized using :meth:`gymnasium.make`. In this section, we explain how to register a custom environment then initialize it.
|
||||
|
||||
The environment ID consists of three components, two of which are optional: an optional namespace (here: ``gymnasium_env``), a mandatory name (here: ``GridWorld``) and an optional but recommended version (here: v0). It may have also be registered as ``GridWorld-v0`` (the recommended approach), ``GridWorld`` or ``gymnasium_env/GridWorld``, and the appropriate ID should then be used during environment creation.
|
||||
|
||||
The entry point can be a string or function, as this tutorial isn't part of a python project, we cannot use a string but for most environments, this is the normal way of specifying the entry point.
|
||||
|
||||
Register has additionally parameters that can be used to specify keyword arguments to the environment, e.g., if to apply a time limit wrapper, etc. See :meth:`gymnasium.register` for more information.
|
||||
```
|
||||
|
||||
```python
|
||||
gym.register(
|
||||
id="gymnasium_env/GridWorld-v0",
|
||||
entry_point=GridWorldEnv,
|
||||
)
|
||||
```
|
||||
|
||||
For a more complete guide on registering a custom environment (including with a string entry point), please read the full [create environment](../tutorials/gymnasium_basics/environment_creation) tutorial.
|
||||
|
||||
```{eval-rst}
|
||||
Once the environment is registered, you can check via :meth:`gymnasium.pprint_registry` which will output all registered environment, and the environment can then be initialized using :meth:`gymnasium.make`. A vectorized version of the environment with multiple instances of the same environment running in parallel can be instantiated with :meth:`gymnasium.make_vec`.
|
||||
```
|
||||
|
||||
```python
|
||||
import gymnasium as gym
|
||||
>>> gym.make("gymnasium_env/GridWorld-v0")
|
||||
<OrderEnforcing<PassiveEnvChecker<GridWorld<gymnasium_env/GridWorld-v0>>>>
|
||||
>>> gym.make("gymnasium_env/GridWorld-v0", max_episode_steps=100)
|
||||
<TimeLimit<OrderEnforcing<PassiveEnvChecker<GridWorld<gymnasium_env/GridWorld-v0>>>>>
|
||||
>>> env = gym.make("gymnasium_env/GridWorld-v0", size=10)
|
||||
>>> env.unwrapped.size
|
||||
10
|
||||
>>> gym.make_vec("gymnasium_env/GridWorld-v0", num_envs=3)
|
||||
SyncVectorEnv(gymnasium_env/GridWorld-v0, num_envs=3)
|
||||
```
|
||||
|
||||
## Using Wrappers
|
||||
|
||||
Oftentimes, we want to use different variants of a custom environment, or we want to modify the behavior of an environment that is provided by Gymnasium or some other party. Wrappers allow us to do this without changing the environment implementation or adding any boilerplate code. Check out the [wrapper documentation](../api/wrappers) for details on how to use wrappers and instructions for implementing your own. In our example, observations cannot be used directly in learning code because they are dictionaries. However, we don't actually need to touch our environment implementation to fix this! We can simply add a wrapper on top of environment instances to flatten observations into a single array:
|
||||
|
||||
```python
|
||||
>>> from gymnasium.wrappers import FlattenObservation
|
||||
|
||||
>>> env = gym.make('gymnasium_env/GridWorld-v0')
|
||||
>>> env.observation_space
|
||||
Dict('agent': Box(0, 4, (2,), int64), 'target': Box(0, 4, (2,), int64))
|
||||
>>> env.reset()
|
||||
({'agent': array([4, 1]), 'target': array([2, 4])}, {'distance': 5.0})
|
||||
>>> wrapped_env = FlattenObservation(env)
|
||||
>>> wrapped_env.observation_space
|
||||
Box(0, 4, (4,), int64)
|
||||
>>> wrapped_env.reset()
|
||||
(array([3, 0, 2, 1]), {'distance': 2.0})
|
||||
```
|
@@ -3,13 +3,12 @@ layout: "contents"
|
||||
title: Migration Guide
|
||||
---
|
||||
|
||||
# v0.21 to v0.26 Migration Guide
|
||||
# Migration Guide - v0.21 to v1.0.0
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium.wrappers
|
||||
|
||||
Gymnasium is a fork of `OpenAI Gym v0.26 <https://github.com/openai/gym/releases/tag/0.26.2>`_, which introduced a large breaking change from `Gym v0.21 <https://github.com/openai/gym/releases/tag/v0.21.0>`_. In this guide, we briefly outline the API changes from Gym v0.21 - which a number of tutorials have been written for - to Gym v0.26. For environments still stuck in the v0.21 API, users can use the :class:`EnvCompatibility` wrapper to convert them to v0.26 compliant.
|
||||
For more information, see the `guide </content/gym_compatibility>`_
|
||||
Gymnasium is a fork of `OpenAI Gym v0.26 <https://github.com/openai/gym/releases/tag/0.26.2>`_, which introduced a large breaking change from `Gym v0.21 <https://github.com/openai/gym/releases/tag/v0.21.0>`_. In this guide, we briefly outline the API changes from Gym v0.21 - which a number of tutorials have been written for - to Gym v0.26. For environments still stuck in the v0.21 API, see the `guide </content/gym_compatibility>`_
|
||||
```
|
||||
|
||||
## Example code for v0.21
|
96
docs/introduction/record_agent.md
Normal file
96
docs/introduction/record_agent.md
Normal file
@@ -0,0 +1,96 @@
|
||||
---
|
||||
layout: "contents"
|
||||
title: Recording Agents
|
||||
---
|
||||
|
||||
# Recording Agents
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule: gymnasium.wrappers
|
||||
|
||||
During training or when evaluating an agent, it may be interesting to record agent behaviour over an episode and log the total reward accumulated. This can be achieved through two wrappers: :class:`RecordEpisodeStatistics` and :class:`RecordVideo`, the first tracks episode data such as the total rewards, episode length and time taken and the second generates mp4 videos of the agents using the environment renderings.
|
||||
|
||||
We show how to apply these wrappers for two types of problems; the first for recording data for every episode (normally evaluation) and second for recording data periodiclly (for normal training).
|
||||
```
|
||||
|
||||
## Recording Every Episode
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule: gymnasium.wrappers
|
||||
|
||||
Given a trained agent, you may wish to record several episodes during evaluation to see how the agent acts. Below we provide an example script to do this with the :class:`RecordEpisodeStatistics` and :class:`RecordVideo`.
|
||||
```
|
||||
|
||||
```python
|
||||
import gymnasium as gym
|
||||
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
|
||||
|
||||
num_eval_episodes = 4
|
||||
|
||||
env = gym.make("CartPole-v1", render_mode="rgb_array") # replace with your environment
|
||||
env = RecordVideo(env, video_folder="cartpole-agent", name_prefix="eval",
|
||||
episode_trigger=lambda x: True)
|
||||
env = RecordEpisodeStatistics(env, buffer_length=num_eval_episodes)
|
||||
|
||||
for episode_num in range(num_eval_episodes):
|
||||
obs, info = env.reset()
|
||||
|
||||
episode_over = False
|
||||
while not episode_over:
|
||||
action = env.action_space.sample() # replace with actual agent
|
||||
obs, reward, terminated, truncated, info = env.step(action)
|
||||
|
||||
episode_over = terminated or truncated
|
||||
env.close()
|
||||
|
||||
print(f'Episode time taken: {env.time_queue}')
|
||||
print(f'Episode total rewards: {env.return_queue}')
|
||||
print(f'Episode lengths: {env.length_queue}')
|
||||
```
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule: gymnasium.wrappers
|
||||
|
||||
In the script above, for the :class:`RecordVideo` wrapper, we specify three different variables: ``video_folder`` to specify the folder that the videos should be saved (change for your problem), ``name_prefix`` for the prefix of videos themselves and finally an ``episode_trigger`` such that every episode is recorded. This means that for every episode of the environment, a video will be recorded and saved in the style "cartpole-agent/eval-episode-x.mp4".
|
||||
|
||||
For the :class:`RecordEpisodicStatistics`, we only need to specify the buffer lengths, this is the max length of the internal ``time_queue``, ``return_queue`` and ``length_queue``. Rather than collect the data for each episode individually, we can use the data queues to print the information at the end of the evaluation.
|
||||
|
||||
For speed ups in evaluating environments, it is possible to implement this with vector environments to in order to evaluate ``N`` episodes at the same time in parallel rather than series.
|
||||
```
|
||||
|
||||
## Recording the Agent during Training
|
||||
|
||||
During training, an agent will act in hundreds or thousands of episodes, therefore, you can't record a video for each episode, but developers might still want to know how the agent acts at different points in the training, recording episodes periodically during training. While for the episode statistics, it is more helpful to know this data for every episode. The following script provides an example of how to periodically record episodes of an agent while recording every episode's statistics (we use the python's logger but [tensorboard](https://www.tensorflow.org/tensorboard), [wandb](https://docs.wandb.ai/guides/track) and other modules are available).
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
import gymnasium as gym
|
||||
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
|
||||
|
||||
training_period = 250 # record the agent's episode every 250
|
||||
num_training_episodes = 10_000 # total number of training episodes
|
||||
|
||||
env = gym.make("CartPole-v1", render_mode="rgb_array") # replace with your environment
|
||||
env = RecordVideo(env, video_folder="cartpole-agent", name_prefix="training",
|
||||
episode_trigger=lambda x: x % training_period == 0)
|
||||
env = RecordEpisodeStatistics(env)
|
||||
|
||||
for episode_num in range(num_training_episodes):
|
||||
obs, info = env.reset()
|
||||
|
||||
episode_over = False
|
||||
while not episode_over:
|
||||
action = env.action_space.sample() # replace with actual agent
|
||||
obs, reward, terminated, truncated, info = env.step(action)
|
||||
|
||||
episode_over = terminated or truncated
|
||||
|
||||
logging.info(f"episode-{episode_num}", info["episode"])
|
||||
env.close()
|
||||
```
|
||||
|
||||
## More information
|
||||
|
||||
* [Training an agent](train_agent.md)
|
||||
* [More training tutorials](../tutorials/training_agents)
|
32
docs/introduction/speed_up_env.md
Normal file
32
docs/introduction/speed_up_env.md
Normal file
@@ -0,0 +1,32 @@
|
||||
---
|
||||
layout: "contents"
|
||||
title: Basic Usage
|
||||
firstpage:
|
||||
---
|
||||
|
||||
# Speeding Up Training
|
||||
|
||||
Reinforcement Learning can be a computationally difficult problem that is both sample inefficient and difficult to scale to more complex environments.
|
||||
In this page, we are going to talk about general strategies for speeding up training: vectorizing environments, optimizing training and algorithmic heuristics.
|
||||
|
||||
## Vectorized environments
|
||||
|
||||
```{eval-rst}
|
||||
.. py:currentmodule:: gymnasium
|
||||
|
||||
Normally in training, agents will sample from a single environment limiting the number of steps (samples) per second to the speed of the environment. Training can be substantially increased through acting in multiple environments at the same time, referred to as vectorized environments where multiple instances of the same environment run in parallel (on multiple CPUs). Gymnasium provide two built in classes to vectorize most generic environments: :class:`gymnasium.vector.SyncVectorEnv` and :class:`gymnasium.vector.AsyncVectorEnv` which can be easily created with :meth:`gymnasium.make_vec`.
|
||||
|
||||
It should be noted that vectorizing environments might require changes to your training algorithm and can cause instability in training for very large numbers of sub-environments.
|
||||
```
|
||||
|
||||
## Optimizing training
|
||||
|
||||
Speeding up training can generally be achieved through optimizing your code, in particular, for deep reinforcement learning that use GPUs in training through the need to transfer data to and from RAM and the GPU memory.
|
||||
|
||||
For code written in PyTorch and Jax, they provide the ability to `jit` (just in time compilation) the code order for CPU, GPU and TPU (for jax) to decrease the training time taken.
|
||||
|
||||
## Algorithmic heuristics
|
||||
|
||||
Academic researchers are consistently explore new optimizations to improve agent performance and reduce the number of samples required to train an agent.
|
||||
In particular, sample efficient reinforcement learning is a specialist sub-field of reinforcement learning that explores optimizations for training algorithms and environment heuristics that reduce the number of agent observation need for an agent to maximise performance.
|
||||
As the field is consistently improving, we refer readers to find to survey papers and the latest research to know what the most efficient algorithmic improves that exist currently.
|
165
docs/introduction/train_agent.md
Normal file
165
docs/introduction/train_agent.md
Normal file
@@ -0,0 +1,165 @@
|
||||
---
|
||||
layout: "contents"
|
||||
title: Train an Agent
|
||||
---
|
||||
|
||||
# Training an Agent
|
||||
|
||||
This page provides a short outline of how to train an agent for a Gymnasium environment, in particular, we will use a tabular based Q-learning to solve the Blackjack v1 environment. For a full complete version of this tutorial and more training tutorials for other environments and algorithm, see [this](../tutorials/training_agents). Please read [basic usage](basic_usage) before reading this page. Before we implement any code, here is an overview of Blackjack and Q-learning.
|
||||
|
||||
Blackjack is one of the most popular casino card games that is also infamous for being beatable under certain conditions. This version of the game uses an infinite deck (we draw the cards with replacement), so counting cards won't be a viable strategy in our simulated game. The observation is a tuple of the player's current sum, the value of the dealers face-up card and a boolean value on whether the player holds a usable case. The agent can pick between two actions: stand (0) such that the player takes no more cards and hit (1) such that the player will take another player. To win, your card sum should be greater than the dealers without exceeding 21. The game ends if the player selects stand or if the card sum is greater than 21. Full documentation can be found at [https://gymnasium.farama.org/environments/toy_text/blackjack](https://gymnasium.farama.org/environments/toy_text/blackjack).
|
||||
|
||||
Q-learning is a model-free off-policy learning algorithm by Watkins, 1989 for environments with discrete action spaces and was famous for being the first reinforcement learning algorithm to prove convergence to an optimal policy under certain conditions.
|
||||
|
||||
## Executing an action
|
||||
|
||||
After receiving our first observation, we are only going to use the``env.step(action)`` function to interact with the environment. This function takes an action as input and executes it in the environment. Because that action changes the state of the environment, it returns four useful variables to us. These are:
|
||||
|
||||
- ``next observation``: This is the observation that the agent will receive after taking the action.
|
||||
- ``reward``: This is the reward that the agent will receive after taking the action.
|
||||
- ``terminated``: This is a boolean variable that indicates whether or not the environment has terminated, i.e., ended due to an internal condition.
|
||||
- ``truncated``: This is a boolean variable that also indicates whether the episode ended by early truncation, i.e., a time limit is reached.
|
||||
- ``info``: This is a dictionary that might contain additional information about the environment.
|
||||
|
||||
The ``next observation``, ``reward``, ``terminated`` and ``truncated`` variables are self-explanatory, but the ``info`` variable requires some additional explanation. This variable contains a dictionary that might have some extra information about the environment, but in the Blackjack-v1 environment you can ignore it. For example in Atari environments the info dictionary has a ``ale.lives`` key that tells us how many lives the agent has left. If the agent has 0 lives, then the episode is over.
|
||||
|
||||
Note that it is not a good idea to call ``env.render()`` in your training loop because rendering slows down training by a lot. Rather try to build an extra loop to evaluate and showcase the agent after training.
|
||||
|
||||
## Building an agent
|
||||
|
||||
Let's build a Q-learning agent to solve Blackjack! We'll need some functions for picking an action and updating the agents action values. To ensure that the agents explores the environment, one possible solution is the epsilon-greedy strategy, where we pick a random action with the percentage ``epsilon`` and the greedy action (currently valued as the best) ``1 - epsilon``.
|
||||
|
||||
```python
|
||||
from collections import defaultdict
|
||||
import gymnasium as gym
|
||||
import numpy as np
|
||||
|
||||
|
||||
class BlackjackAgent:
|
||||
def __init__(
|
||||
self,
|
||||
env: gym.Env,
|
||||
learning_rate: float,
|
||||
initial_epsilon: float,
|
||||
epsilon_decay: float,
|
||||
final_epsilon: float,
|
||||
discount_factor: float = 0.95,
|
||||
):
|
||||
"""Initialize a Reinforcement Learning agent with an empty dictionary
|
||||
of state-action values (q_values), a learning rate and an epsilon.
|
||||
|
||||
Args:
|
||||
env: The training environment
|
||||
learning_rate: The learning rate
|
||||
initial_epsilon: The initial epsilon value
|
||||
epsilon_decay: The decay for epsilon
|
||||
final_epsilon: The final epsilon value
|
||||
discount_factor: The discount factor for computing the Q-value
|
||||
"""
|
||||
self.env = env
|
||||
self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
|
||||
|
||||
self.lr = learning_rate
|
||||
self.discount_factor = discount_factor
|
||||
|
||||
self.epsilon = initial_epsilon
|
||||
self.epsilon_decay = epsilon_decay
|
||||
self.final_epsilon = final_epsilon
|
||||
|
||||
self.training_error = []
|
||||
|
||||
def get_action(self, obs: tuple[int, int, bool]) -> int:
|
||||
"""
|
||||
Returns the best action with probability (1 - epsilon)
|
||||
otherwise a random action with probability epsilon to ensure exploration.
|
||||
"""
|
||||
# with probability epsilon return a random action to explore the environment
|
||||
if np.random.random() < self.epsilon:
|
||||
return self.env.action_space.sample()
|
||||
# with probability (1 - epsilon) act greedily (exploit)
|
||||
else:
|
||||
return int(np.argmax(self.q_values[obs]))
|
||||
|
||||
def update(
|
||||
self,
|
||||
obs: tuple[int, int, bool],
|
||||
action: int,
|
||||
reward: float,
|
||||
terminated: bool,
|
||||
next_obs: tuple[int, int, bool],
|
||||
):
|
||||
"""Updates the Q-value of an action."""
|
||||
future_q_value = (not terminated) * np.max(self.q_values[next_obs])
|
||||
temporal_difference = (
|
||||
reward + self.discount_factor * future_q_value - self.q_values[obs][action]
|
||||
)
|
||||
|
||||
self.q_values[obs][action] = (
|
||||
self.q_values[obs][action] + self.lr * temporal_difference
|
||||
)
|
||||
self.training_error.append(temporal_difference)
|
||||
|
||||
def decay_epsilon(self):
|
||||
self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)
|
||||
```
|
||||
|
||||
## Training the agent
|
||||
|
||||
To train the agent, we will let the agent play one episode (one complete game is called an episode) at a time and then update it's Q-values after each episode. The agent will have to experience a lot of episodes to explore the environment sufficiently.
|
||||
|
||||
```python
|
||||
# hyperparameters
|
||||
learning_rate = 0.01
|
||||
n_episodes = 100_000
|
||||
start_epsilon = 1.0
|
||||
epsilon_decay = start_epsilon / (n_episodes / 2) # reduce the exploration over time
|
||||
final_epsilon = 0.1
|
||||
|
||||
agent = BlackjackAgent(
|
||||
learning_rate=learning_rate,
|
||||
initial_epsilon=start_epsilon,
|
||||
epsilon_decay=epsilon_decay,
|
||||
final_epsilon=final_epsilon,
|
||||
)
|
||||
```
|
||||
|
||||
Info: The current hyperparameters are set to quickly train a decent agent. If you want to converge to the optimal policy, try increasing the ``n_episodes`` by 10x and lower the learning_rate (e.g. to 0.001).
|
||||
|
||||
```python
|
||||
from tqdm import tqdm
|
||||
|
||||
env = gym.make("Blackjack-v1", sab=False)
|
||||
env = gym.wrappers.RecordEpisodeStatistics(env, deque_size=n_episodes)
|
||||
|
||||
for episode in tqdm(range(n_episodes)):
|
||||
obs, info = env.reset()
|
||||
done = False
|
||||
|
||||
# play one episode
|
||||
while not done:
|
||||
action = agent.get_action(obs)
|
||||
next_obs, reward, terminated, truncated, info = env.step(action)
|
||||
|
||||
# update the agent
|
||||
agent.update(obs, action, reward, terminated, next_obs)
|
||||
|
||||
# update if the environment is done and the current obs
|
||||
done = terminated or truncated
|
||||
obs = next_obs
|
||||
|
||||
agent.decay_epsilon()
|
||||
```
|
||||
|
||||

|
||||
|
||||
## Visualising the policy
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
Hopefully this tutorial helped you get a grip of how to interact with Gymnasium environments and sets you on a journey to solve many more RL challenges.
|
||||
|
||||
It is recommended that you solve this environment by yourself (project based learning is really effective!). You can apply your favorite discrete RL algorithm or give Monte Carlo ES a try (covered in `Sutton & Barto <http://incompleteideas.net/book/the-book-2nd.html>`_, section 5.3) - this way you can compare your results directly to the book.
|
||||
|
||||
Best of luck!
|
@@ -22,7 +22,7 @@ Recommended solution
|
||||
pipx install copier
|
||||
|
||||
Alternative solutions
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Install Copier with Pip or Conda:
|
||||
|
||||
@@ -98,6 +98,10 @@ randomly at the beginning of the episode.
|
||||
|
||||
An episode in this environment (with ``size=5``) might look like this:
|
||||
|
||||
.. image:: /_static/videos/tutorials/environment-creation-example-episode.gif
|
||||
:width: 400
|
||||
:alt: Example episode of the custom environment
|
||||
|
||||
where the blue dot is the agent and the red square represents the
|
||||
target.
|
||||
|
||||
@@ -111,7 +115,7 @@ Let us look at the source code of ``GridWorldEnv`` piece by piece:
|
||||
# Our custom environment will inherit from the abstract class
|
||||
# ``gymnasium.Env``. You shouldn’t forget to add the ``metadata``
|
||||
# attribute to your class. There, you should specify the render-modes that
|
||||
# are supported by your environment (e.g. ``"human"``, ``"rgb_array"``,
|
||||
# are supported by your environment (e.g., ``"human"``, ``"rgb_array"``,
|
||||
# ``"ansi"``) and the framerate at which your environment should be
|
||||
# rendered. Every environment should support ``None`` as render-mode; you
|
||||
# don’t need to add it in the metadata. In ``GridWorldEnv``, we will
|
||||
@@ -141,10 +145,10 @@ from gymnasium import spaces
|
||||
|
||||
|
||||
class Actions(Enum):
|
||||
right = 0
|
||||
up = 1
|
||||
left = 2
|
||||
down = 3
|
||||
RIGHT = 0
|
||||
UP = 1
|
||||
LEFT = 2
|
||||
DOWN = 3
|
||||
|
||||
|
||||
class GridWorldEnv(gym.Env):
|
||||
@@ -162,6 +166,8 @@ class GridWorldEnv(gym.Env):
|
||||
"target": spaces.Box(0, size - 1, shape=(2,), dtype=int),
|
||||
}
|
||||
)
|
||||
self._agent_location = np.array([-1, -1], dtype=int)
|
||||
self._target_location = np.array([-1, -1], dtype=int)
|
||||
|
||||
# We have 4 actions, corresponding to "right", "up", "left", "down"
|
||||
self.action_space = spaces.Discrete(4)
|
||||
@@ -172,10 +178,10 @@ class GridWorldEnv(gym.Env):
|
||||
i.e. 0 corresponds to "right", 1 to "up" etc.
|
||||
"""
|
||||
self._action_to_direction = {
|
||||
Actions.right: np.array([1, 0]),
|
||||
Actions.up: np.array([0, 1]),
|
||||
Actions.left: np.array([-1, 0]),
|
||||
Actions.down: np.array([0, -1]),
|
||||
Actions.RIGHT.value: np.array([1, 0]),
|
||||
Actions.UP.value: np.array([0, 1]),
|
||||
Actions.LEFT.value: np.array([-1, 0]),
|
||||
Actions.DOWN.value: np.array([0, -1]),
|
||||
}
|
||||
|
||||
assert render_mode is None or render_mode in self.metadata["render_modes"]
|
||||
@@ -218,7 +224,7 @@ class GridWorldEnv(gym.Env):
|
||||
|
||||
# %%
|
||||
# Oftentimes, info will also contain some data that is only available
|
||||
# inside the ``step`` method (e.g. individual reward terms). In that case,
|
||||
# inside the ``step`` method (e.g., individual reward terms). In that case,
|
||||
# we would have to update the dictionary that is returned by ``_get_info``
|
||||
# in ``step``.
|
||||
|
||||
@@ -443,8 +449,6 @@ class GridWorldEnv(gym.Env):
|
||||
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||
# | ``order_enforce`` | ``bool`` | ``True`` | Whether to wrap the environment in an ``OrderEnforcing`` wrapper |
|
||||
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||
# | ``autoreset`` | ``bool`` | ``False`` | Whether to wrap the environment in an ``AutoResetWrapper`` |
|
||||
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||
# | ``kwargs`` | ``dict`` | ``{}`` | The default kwargs to pass to the environment class |
|
||||
# +----------------------+-----------+-----------+---------------------------------------------------------------------------------------------------------------+
|
||||
#
|
||||
|
@@ -112,6 +112,7 @@ class ClipReward(RewardWrapper):
|
||||
# - You can set a new action or observation space by defining ``self.action_space`` or ``self.observation_space`` in ``__init__``, respectively
|
||||
# - You can set new metadata and reward range by defining ``self.metadata`` and ``self.reward_range`` in ``__init__``, respectively
|
||||
# - You can override :meth:`gymnasium.Wrapper.step`, :meth:`gymnasium.Wrapper.render`, :meth:`gymnasium.Wrapper.close` etc.
|
||||
#
|
||||
# If you do this, you can access the environment that was passed
|
||||
# to your wrapper (which *still* might be wrapped in some other wrapper) by accessing the attribute :attr:`env`.
|
||||
#
|
||||
|
@@ -412,7 +412,7 @@ class NormalizeObservation(
|
||||
):
|
||||
"""Normalizes observations to be centered at the mean with unit variance.
|
||||
|
||||
The property :prop:`_update_running_mean` allows to freeze/continue the running mean calculation of the observation
|
||||
The property :attr:`update_running_mean` allows to freeze/continue the running mean calculation of the observation
|
||||
statistics. If ``True`` (default), the ``RunningMeanStd`` will get updated every time ``step`` or ``reset`` is called.
|
||||
If ``False``, the calculated statistics are used but not updated anymore; this may be used during evaluation.
|
||||
|
||||
|
@@ -65,14 +65,14 @@ class RecordEpisodeStatistics(VectorWrapper):
|
||||
def __init__(
|
||||
self,
|
||||
env: VectorEnv,
|
||||
deque_size: int = 100,
|
||||
buffer_length: int = 100,
|
||||
stats_key: str = "episode",
|
||||
):
|
||||
"""This wrapper will keep track of cumulative rewards and episode lengths.
|
||||
|
||||
Args:
|
||||
env (Env): The environment to apply the wrapper
|
||||
deque_size: The size of the buffers :attr:`return_queue` and :attr:`length_queue`
|
||||
buffer_length: The size of the buffers :attr:`return_queue`, :attr:`length_queue` and :attr:`time_queue`
|
||||
stats_key: The info key to save the data
|
||||
"""
|
||||
super().__init__(env)
|
||||
@@ -84,9 +84,9 @@ class RecordEpisodeStatistics(VectorWrapper):
|
||||
self.episode_returns: np.ndarray = np.zeros(())
|
||||
self.episode_lengths: np.ndarray = np.zeros(())
|
||||
|
||||
self.time_queue = deque(maxlen=deque_size)
|
||||
self.return_queue = deque(maxlen=deque_size)
|
||||
self.length_queue = deque(maxlen=deque_size)
|
||||
self.time_queue = deque(maxlen=buffer_length)
|
||||
self.return_queue = deque(maxlen=buffer_length)
|
||||
self.length_queue = deque(maxlen=buffer_length)
|
||||
|
||||
def reset(
|
||||
self,
|
||||
|
Reference in New Issue
Block a user