Improve the introduction page (#1406)

This commit is contained in:
Mark Towers
2025-06-14 14:48:29 +01:00
committed by GitHub
parent 7d50c88073
commit 26d64bda73
6 changed files with 1373 additions and 231 deletions

View File

@@ -6,12 +6,21 @@ firstpage:
# Basic Usage
## What is Reinforcement Learning?
Before diving into Gymnasium, let's understand what we're trying to achieve. Reinforcement learning is like teaching through trial and error - an agent learns by trying actions, receiving feedback (rewards), and gradually improving its behavior. Think of training a pet with treats, learning to ride a bike through practice, or mastering a video game by playing it repeatedly.
The key insight is that we don't tell the agent exactly what to do. Instead, we create an environment where it can experiment safely and learn from the consequences of its actions.
## Why Gymnasium?
```{eval-rst}
.. py:currentmodule:: gymnasium
Gymnasium is a project that provides an API (application programming interface) for all single agent reinforcement learning environments, with implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more. This page will outline the basics of how to use Gymnasium including its four key functions: :meth:`make`, :meth:`Env.reset`, :meth:`Env.step` and :meth:`Env.render`.
Whether you want to train an agent to play games, control robots, or optimize trading strategies, Gymnasium gives you the tools to build and test your ideas.
At its heart, Gymnasium provides an API (application programming interface) for all single agent reinforcement learning environments, with implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more. This page will outline the basics of how to use Gymnasium including its four key functions: :meth:`make`, :meth:`Env.reset`, :meth:`Env.step` and :meth:`Env.render`.
At the core of Gymnasium is :class:`Env`, a high-level python class representing a markov decision process (MDP) from reinforcement learning theory (note: this is not a perfect reconstruction, missing several components of MDPs). The class provides users the ability generate an initial state, transition / move to new states given an action and visualize the environment. Alongside :class:`Env`, :class:`Wrapper` are provided to help augment / modify the environment, in particular, the agent observations, rewards and actions taken.
At the core of Gymnasium is :class:`Env`, a high-level python class representing a markov decision process (MDP) from reinforcement learning theory (note: this is not a perfect reconstruction, missing several components of MDPs). The class provides users the ability start new episodes, take actions and visualize the agent's current state. Alongside :class:`Env`, :class:`Wrapper` are provided to help augment / modify the environment, in particular, the agent observations, rewards and actions taken.
```
## Initializing Environments
@@ -24,7 +33,14 @@ Initializing environments is very easy in Gymnasium and can be done via the :met
```python
import gymnasium as gym
# Create a simple environment perfect for beginners
env = gym.make('CartPole-v1')
# The CartPole environment: balance a pole on a moving cart
# - Simple but not trivial
# - Fast training
# - Clear success/failure criteria
```
```{eval-rst}
@@ -33,9 +49,16 @@ env = gym.make('CartPole-v1')
This function will return an :class:`Env` for users to interact with. To see all environments you can create, use :meth:`pprint_registry`. Furthermore, :meth:`make` provides a number of additional arguments for specifying keywords to the environment, adding more or less wrappers, etc. See :meth:`make` for more information.
```
## Interacting with the Environment
## Understanding the Agent-Environment Loop
In reinforcement learning, the classic "agent-environment loop" pictured below is a simplified representation of how an agent and environment interact with each other. The agent receives an observation about the environment, the agent then selects an action, which the environment uses to determine the reward and the next observation. The cycle then repeats itself until the environment ends (terminates).
In reinforcement learning, the classic "agent-environment loop" pictured below represents how learning happens in RL. It's simpler than it might first appear:
1. **Agent observes** the current situation (like looking at a game screen)
2. **Agent chooses an action** based on what it sees (like pressing a button)
3. **Environment responds** with a new situation and a reward (game state changes, score updates)
4. **Repeat** until the episode ends
This might seem simple, but it's how agents learn everything from playing chess to controlling robots to optimizing business processes.
```{image} /_static/diagrams/AE_loop.png
:width: 50%
@@ -49,45 +72,63 @@ In reinforcement learning, the classic "agent-environment loop" pictured below i
:class: only-dark
```
For Gymnasium, the "agent-environment-loop" is implemented below for a single episode (until the environment ends). See the next section for a line-by-line explanation. Note that running this code requires installing swig (`pip install swig`, [download](https://www.swig.org/download.html), or via [Homebrew](https://formulae.brew.sh/formula/swig) if you are using macOS) along with `pip install "gymnasium[box2d]"`.
## Your First RL Program
Let's start with a simple example using CartPole - perfect for understanding the basics:
```python
# Run `pip install "gymnasium[classic-control]"` for this example.
import gymnasium as gym
env = gym.make("LunarLander-v3", render_mode="human")
# Create our training environment - a cart with a pole that needs balancing
env = gym.make("CartPole-v1", render_mode="human")
# Reset environment to start a new episode
observation, info = env.reset()
# observation: what the agent can "see" - cart position, velocity, pole angle, etc.
# info: extra debugging information (usually not needed for basic learning)
print(f"Starting observation: {observation}")
# Example output: [ 0.01234567 -0.00987654 0.02345678 0.01456789]
# [cart_position, cart_velocity, pole_angle, pole_angular_velocity]
episode_over = False
total_reward = 0
while not episode_over:
action = env.action_space.sample() # agent policy that uses the observation and info
# Choose an action: 0 = push cart left, 1 = push cart right
action = env.action_space.sample() # Random action for now - real agents will be smarter!
# Take the action and see what happens
observation, reward, terminated, truncated, info = env.step(action)
# reward: +1 for each step the pole stays upright
# terminated: True if pole falls too far (agent failed)
# truncated: True if we hit the time limit (500 steps)
total_reward += reward
episode_over = terminated or truncated
print(f"Episode finished! Total reward: {total_reward}")
env.close()
```
The output should look something like this:
**What you should see**: A window opens showing a cart with a pole. The cart moves randomly left and right, and the pole eventually falls over. This is expected - the agent is acting randomly!
```{figure} https://user-images.githubusercontent.com/15806078/153222406-af5ce6f0-4696-4a24-a683-46ad4939170c.gif
:width: 50%
:align: center
```
### Explaining the code
### Explaining the Code Step by Step
```{eval-rst}
.. py:currentmodule:: gymnasium
First, an environment is created using :meth:`make` with an additional keyword ``"render_mode"`` that specifies how the environment should be visualized. See :meth:`Env.render` for details on the default meaning of different render modes. In this example, we use the ``"LunarLander"`` environment where the agent controls a spaceship that needs to land safely.
First, an environment is created using :meth:`make` with an optional ``"render_mode"`` parameter that specifies how the environment should be visualized. See :meth:`Env.render` for details on different render modes. The render mode determines whether you see a visual window ("human"), get image arrays ("rgb_array"), or run without visuals (None - fastest for training).
After initializing the environment, we :meth:`Env.reset` the environment to get the first observation of the environment along with an additional information. For initializing the environment with a particular random seed or options (see the environment documentation for possible values) use the ``seed`` or ``options`` parameters with :meth:`reset`.
After initializing the environment, we :meth:`Env.reset` the environment to get the first observation along with additional information. This is like starting a new game or episode. For initializing the environment with a particular random seed or options (see the environment documentation for possible values) use the ``seed`` or ``options`` parameters with :meth:`reset`.
As we wish to continue the agent-environment loop until the environment ends, which is in an unknown number of timesteps, we define ``episode_over`` as a variable to know when to stop interacting with the environment along with a while loop that uses it.
As we want to continue the agent-environment loop until the environment ends (which happens in an unknown number of timesteps), we define ``episode_over`` as a variable to control our while loop.
Next, the agent performs an action in the environment, :meth:`Env.step` executes the selected action (in this case random with ``env.action_space.sample()``) to update the environment. This action can be imagined as moving a robot or pressing a button on a games' controller that causes a change within the environment. As a result, the agent receives a new observation from the updated environment along with a reward for taking the action. This reward could be for instance positive for destroying an enemy or a negative reward for moving into lava. One such action-observation exchange is referred to as a **timestep**.
Next, the agent performs an action in the environment. :meth:`Env.step` executes the selected action (in our example, random with ``env.action_space.sample()``) to update the environment. This action can be imagined as moving a robot, pressing a button on a game controller, or making a trading decision. As a result, the agent receives a new observation from the updated environment along with a reward for taking the action. This reward could be positive for good actions (like successfully balancing the pole) or negative for bad actions (like letting the pole fall). One such action-observation exchange is called a **timestep**.
However, after some timesteps, the environment may end, this is called the terminal state. For instance, the robot may have crashed, or may have succeeded in completing a task, the environment will need to stop as the agent cannot continue. In Gymnasium, if the environment has terminated, this is returned by :meth:`step` as the third variable, ``terminated``. Similarly, we may also want the environment to end after a fixed number of timesteps, in this case, the environment issues a truncated signal. If either of ``terminated`` or ``truncated`` are ``True`` then we end the episode but in most cases users might wish to restart the environment, this can be done with ``env.reset()``.
However, after some timesteps, the environment may end - this is called the terminal state. For instance, the robot may have crashed, or succeeded in completing a task, or we may want to stop after a fixed number of timesteps. In Gymnasium, if the environment has terminated due to the task being completed or failed, this is returned by :meth:`step` as ``terminated=True``. If we want the environment to end after a fixed number of timesteps (like a time limit), the environment issues a ``truncated=True`` signal. If either ``terminated`` or ``truncated`` are ``True``, we end the episode. In most cases, you'll want to restart the environment with ``env.reset()`` to begin a new episode.
```
## Action and observation spaces
@@ -95,23 +136,43 @@ However, after some timesteps, the environment may end, this is called the termi
```{eval-rst}
.. py:currentmodule:: gymnasium.Env
Every environment specifies the format of valid actions and observations with the :attr:`action_space` and :attr:`observation_space` attributes. This is helpful for knowing both the expected input and output of the environment, as all valid actions and observations should be contained with their respective spaces. In the example above, we sampled random actions via ``env.action_space.sample()`` instead of using an agent policy, mapping observations to actions which users will want to make.
Every environment specifies the format of valid actions and observations with the :attr:`action_space` and :attr:`observation_space` attributes. This is helpful for knowing both the expected input and output of the environment, as all valid actions and observations should be contained within their respective spaces. In the example above, we sampled random actions via ``env.action_space.sample()`` instead of using an intelligent agent policy that maps observations to actions (which is what you'll learn to build).
Importantly, :attr:`Env.action_space` and :attr:`Env.observation_space` are instances of :class:`Space`, a high-level python class that provides the key functions: :meth:`Space.contains` and :meth:`Space.sample`. Gymnasium has support for a wide range of spaces that users might need:
Understanding these spaces is crucial for building agents:
- **Action Space**: What can your agent do? (discrete choices, continuous values, etc.)
- **Observation Space**: What can your agent see? (images, numbers, structured data, etc.)
Importantly, :attr:`Env.action_space` and :attr:`Env.observation_space` are instances of :class:`Space`, a high-level python class that provides key functions: :meth:`Space.contains` and :meth:`Space.sample`. Gymnasium supports a wide range of spaces:
.. py:currentmodule:: gymnasium.spaces
- :class:`Box`: describes bounded space with upper and lower limits of any n-dimensional shape.
- :class:`Discrete`: describes a discrete space where ``{0, 1, ..., n-1}`` are the possible values our observation or action can take.
- :class:`MultiBinary`: describes a binary space of any n-dimensional shape.
- :class:`MultiDiscrete`: consists of a series of :class:`Discrete` action spaces with a different number of actions in each element.
- :class:`Text`: describes a string space with a minimum and maximum length.
- :class:`Dict`: describes a dictionary of simpler spaces.
- :class:`Box`: describes bounded space with upper and lower limits of any n-dimensional shape (like continuous control or image pixels).
- :class:`Discrete`: describes a discrete space where ``{0, 1, ..., n-1}`` are the possible values (like button presses or menu choices).
- :class:`MultiBinary`: describes a binary space of any n-dimensional shape (like multiple on/off switches).
- :class:`MultiDiscrete`: consists of a series of :class:`Discrete` action spaces with different numbers of actions in each element.
- :class:`Text`: describes a string space with minimum and maximum length.
- :class:`Dict`: describes a dictionary of simpler spaces (like our GridWorld example you'll see later).
- :class:`Tuple`: describes a tuple of simple spaces.
- :class:`Graph`: describes a mathematical graph (network) with interlinking nodes and edges.
- :class:`Sequence`: describes a variable length of simpler space elements.
For example usage of spaces, see their `documentation </api/spaces>`_ along with `utility functions </api/spaces/utils>`_. There are a couple of more niche spaces :class:`Graph`, :class:`Sequence` and :class:`Text`.
For example usage of spaces, see their `documentation </api/spaces>`_ along with `utility functions </api/spaces/utils>`_.
```
Let's look at some examples:
```python
import gymnasium as gym
# Discrete action space (button presses)
env = gym.make("CartPole-v1")
print(f"Action space: {env.action_space}") # Discrete(2) - left or right
print(f"Sample action: {env.action_space.sample()}") # 0 or 1
# Box observation space (continuous values)
print(f"Observation space: {env.observation_space}") # Box with 4 values
# Box([-4.8, -inf, -0.418, -inf], [4.8, inf, 0.418, inf])
print(f"Sample observation: {env.observation_space.sample()}") # Random valid observation
```
## Modifying the environment
@@ -119,31 +180,39 @@ For example usage of spaces, see their `documentation </api/spaces>`_ along with
```{eval-rst}
.. py:currentmodule:: gymnasium.wrappers
Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly. Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular. Wrappers can also be chained to combine their effects. Most environments that are generated via :meth:`gymnasium.make` will already be wrapped by default using the :class:`TimeLimit`, :class:`OrderEnforcing` and :class:`PassiveEnvChecker`.
Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly. Think of wrappers like filters or modifiers that change how you interact with an environment. Using wrappers allows you to avoid boilerplate code and make your environment more modular. Wrappers can also be chained to combine their effects.
In order to wrap an environment, you must first initialize a base environment. Then you can pass this environment along with (possibly optional) parameters to the wrapper's constructor:
Most environments created via :meth:`gymnasium.make` will already be wrapped by default using :class:`TimeLimit` (stops episodes after a maximum number of steps), :class:`OrderEnforcing` (ensures proper reset/step order), and :class:`PassiveEnvChecker` (validates your environment usage).
To wrap an environment, you first initialize a base environment, then pass it along with optional parameters to the wrapper's constructor:
```
```python
>>> import gymnasium as gym
>>> from gymnasium.wrappers import FlattenObservation
>>> # Start with a complex observation space
>>> env = gym.make("CarRacing-v3")
>>> env.observation_space.shape
(96, 96, 3)
(96, 96, 3) # 96x96 RGB image
>>> # Wrap it to flatten the observation into a 1D array
>>> wrapped_env = FlattenObservation(env)
>>> wrapped_env.observation_space.shape
(27648,)
(27648,) # All pixels in a single array
>>> # This makes it easier to use with some algorithms that expect 1D input
```
```{eval-rst}
.. py:currentmodule:: gymnasium.wrappers
Gymnasium already provides many commonly used wrappers for you. Some examples:
Common wrappers that beginners find useful:
- :class:`TimeLimit`: Issues a truncated signal if a maximum number of timesteps has been exceeded (or the base environment has issued a truncated signal).
- :class:`ClipAction`: Clips any action passed to ``step`` such that it lies in the base environment's action space.
- :class:`RescaleAction`: Applies an affine transformation to the action to linearly scale for a new low and high bound on the environment.
- :class:`TimeAwareObservation`: Add information about the index of timestep to observation. In some cases helpful to ensure that transitions are Markov.
- :class:`TimeLimit`: Issues a truncated signal if a maximum number of timesteps has been exceeded (preventing infinite episodes).
- :class:`ClipAction`: Clips any action passed to ``step`` to ensure it's within the valid action space.
- :class:`RescaleAction`: Rescales actions to a different range (useful for algorithms that output actions in [-1, 1] but environment expects [0, 10]).
- :class:`TimeAwareObservation`: Adds information about the current timestep to the observation (sometimes helps with learning).
```
For a full list of implemented wrappers in Gymnasium, see [wrappers](/api/wrappers).
@@ -151,7 +220,7 @@ For a full list of implemented wrappers in Gymnasium, see [wrappers](/api/wrappe
```{eval-rst}
.. py:currentmodule:: gymnasium.Env
If you have a wrapped environment, and you want to get the unwrapped environment underneath all the layers of wrappers (so that you can manually call a function or change some underlying aspect of the environment), you can use the :attr:`unwrapped` attribute. If the environment is already a base environment, the :attr:`unwrapped` attribute will just return itself.
If you have a wrapped environment and want to access the original environment underneath all the layers of wrappers (to manually call a function or change some underlying aspect), you can use the :attr:`unwrapped` attribute. If the environment is already a base environment, :attr:`unwrapped` just returns itself.
```
```python
@@ -161,11 +230,29 @@ If you have a wrapped environment, and you want to get the unwrapped environment
<gymnasium.envs.box2d.car_racing.CarRacing object at 0x7f04efcb8850>
```
## More information
## Common Issues for Beginners
* [Training an agent](train_agent)
* [Making a Custom Environment](create_custom_env)
* [Recording an agent's behaviour](record_agent)
* [Speeding up an Environment](speed_up_env)
* [Compatibility with OpenAI Gym](gym_compatibility)
* [Migration Guide for Gym v0.21 to v0.26 and for v1.0.0](migration_guide)
**Agent Behavior:**
- Agent performs randomly: That's expected when using `env.action_space.sample()`! Real learning happens when you replace this with an intelligent policy
- Episodes end immediately: Check if you're properly handling the reset between episodes
**Common Code Mistakes:**
```python
# ❌ Wrong - forgetting to reset
env = gym.make("CartPole-v1")
obs, reward, terminated, truncated, info = env.step(action) # Error!
# ✅ Correct - always reset first
env = gym.make("CartPole-v1")
obs, info = env.reset() # Start properly
obs, reward, terminated, truncated, info = env.step(action) # Now this works
```
## Next Steps
Now that you understand the basics, you're ready to:
1. **[Train an actual agent](train_agent)** - Replace random actions with intelligence
2. **[Create custom environments](create_custom_env)** - Build your own RL problems
3. **[Record agent behavior](record_agent)** - Save videos and data from training
4. **[Speed up training](speed_up_env)** - Use vectorized environments and other optimizations

View File

@@ -5,28 +5,78 @@ title: Create custom env
# Create a Custom Environment
This page provides a short outline of how to create custom environments with Gymnasium, for a more [complete tutorial](../tutorials/gymnasium_basics/environment_creation) with rendering, please read [basic usage](basic_usage) before reading this page.
## Before You Code: Environment Design
We will implement a very simplistic game, called ``GridWorldEnv``, consisting of a 2-dimensional square grid of fixed size. The agent can move vertically or horizontally between grid cells in each timestep and the goal of the agent is to navigate to a target on the grid that has been placed randomly at the beginning of the episode.
Creating an RL environment is like designing a video game or simulation. Before writing any code, you need to think through the learning problem you want to solve. This design phase is crucial - a poorly designed environment will make learning difficult or impossible, no matter how good your algorithm is.
Basic information about the game
- Observations provide the location of the target and agent.
- There are 4 discrete actions in our environment, corresponding to the movements "right", "up", "left", and "down".
- The environment ends (terminates) when the agent has navigated to the grid cell where the target is located.
- The agent is only rewarded when it reaches the target, i.e., the reward is one when the agent reaches the target and zero otherwise.
### Key Design Questions
Ask yourself these fundamental questions:
**🎯 What skill should the agent learn?**
- Navigate through a maze?
- Balance and control a system?
- Optimize resource allocation?
- Play a strategic game?
**👀 What information does the agent need?**
- Position and velocity?
- Current state of the system?
- Historical data?
- Partial or full observability?
**🎮 What actions can the agent take?**
- Discrete choices (move up/down/left/right)?
- Continuous control (steering angle, throttle)?
- Multiple simultaneous actions?
**🏆 How do we measure success?**
- Reaching a specific goal?
- Minimizing time or energy?
- Maximizing a score?
- Avoiding failures?
**⏰ When should episodes end?**
- Task completion (success/failure)?
- Time limits?
- Safety constraints?
### GridWorld Example Design
For our tutorial example, we'll create a simple GridWorld environment:
- **🎯 Skill**: Navigate efficiently to a target location
- **👀 Information**: Agent position and target position on a grid
- **🎮 Actions**: Move up, down, left, or right
- **🏆 Success**: Reach the target in minimum steps
- **⏰ End**: When agent reaches target (or optional time limit)
This provides a clear learning problem that's simple enough to understand but non-trivial to solve optimally.
---
This page provides a complete implementation of creating custom environments with Gymnasium. For a more [complete tutorial](../tutorials/gymnasium_basics/environment_creation) with rendering.
We recommend that you familiarise yourself with the [basic usage](basic_usage) before reading this page!
We will implement our GridWorld game as a 2-dimensional square grid of fixed size. The agent can move vertically or horizontally between grid cells in each timestep, and the goal is to navigate to a target that has been placed randomly at the beginning of the episode.
## Environment `__init__`
```{eval-rst}
.. py:currentmodule:: gymnasium
Like all environments, our custom environment will inherit from :class:`gymnasium.Env` that defines the structure of environment. One of the requirements for an environment is defining the observation and action space, which declare the general set of possible inputs (actions) and outputs (observations) of the environment. As outlined in our basic information about the game, our agent has four discrete actions, therefore we will use the ``Discrete(4)`` space with four options.
Like all environments, our custom environment will inherit from :class:`gymnasium.Env` that defines the structure all environments must follow. One of the requirements is defining the observation and action spaces, which declare what inputs (actions) and outputs (observations) are valid for this environment.
As outlined in our design, our agent has four discrete actions (move in cardinal directions), so we'll use ``Discrete(4)`` space.
```
```{eval-rst}
.. py:currentmodule:: gymnasium.spaces
For our observation, there are a couple options, for this tutorial we will imagine our observation looks like ``{"agent": array([1, 0]), "target": array([0, 3])}`` where the array elements represent the x and y positions of the agent or target. Alternative options for representing the observation is as a 2d grid with values representing the agent and target on the grid or a 3d grid with each "layer" containing only the agent or target information. Therefore, we will declare the observation space as :class:`Dict` with the agent and target spaces being a :class:`Box` allowing an array output of an int type.
For our observation, we have several options. We could represent the full grid as a 2D array, or use coordinate positions, or even a 3D array with separate "layers" for agent and target. For this tutorial, we'll use a simple dictionary format like ``{"agent": array([1, 0]), "target": array([0, 3])}`` where the arrays represent x,y coordinates.
This choice makes the observation human-readable and easy to debug. We'll declare this as a :class:`Dict` space with the agent and target spaces being :class:`Box` spaces that contain integer coordinates.
```
For a full list of possible spaces to use with an environment, see [spaces](../api/spaces)
@@ -40,30 +90,33 @@ import gymnasium as gym
class GridWorldEnv(gym.Env):
def __init__(self, size: int = 5):
# The size of the square grid
# The size of the square grid (5x5 by default)
self.size = size
# Define the agent and target location; randomly chosen in `reset` and updated in `step`
# Initialize positions - will be set randomly in reset()
# Using -1,-1 as "uninitialized" state
self._agent_location = np.array([-1, -1], dtype=np.int32)
self._target_location = np.array([-1, -1], dtype=np.int32)
# Observations are dictionaries with the agent's and the target's location.
# Each location is encoded as an element of {0, ..., `size`-1}^2
# Define what the agent can observe
# Dict space gives us structured, human-readable observations
self.observation_space = gym.spaces.Dict(
{
"agent": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int),
"target": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int),
"agent": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int), # [x, y] coordinates
"target": gym.spaces.Box(0, size - 1, shape=(2,), dtype=int), # [x, y] coordinates
}
)
# We have 4 actions, corresponding to "right", "up", "left", "down"
# Define what actions are available (4 directions)
self.action_space = gym.spaces.Discrete(4)
# Dictionary maps the abstract actions to the directions on the grid
# Map action numbers to actual movements on the grid
# This makes the code more readable than using raw numbers
self._action_to_direction = {
0: np.array([1, 0]), # right
1: np.array([0, 1]), # up
2: np.array([-1, 0]), # left
3: np.array([0, -1]), # down
0: np.array([1, 0]), # Move right (positive x)
1: np.array([0, 1]), # Move up (positive y)
2: np.array([-1, 0]), # Move left (negative x)
3: np.array([0, -1]), # Move down (negative y)
}
```
@@ -72,22 +125,32 @@ class GridWorldEnv(gym.Env):
```{eval-rst}
.. py:currentmodule:: gymnasium
Since we will need to compute observations both in :meth:`Env.reset` and :meth:`Env.step`, it is often convenient to have a method ``_get_obs`` that translates the environment's state into an observation. However, this is not mandatory and you can compute the observations in :meth:`Env.reset` and :meth:`Env.step` separately.
Since we need to compute observations in both :meth:`Env.reset` and :meth:`Env.step`, it's convenient to have a helper method ``_get_obs`` that translates the environment's internal state into the observation format. This keeps our code DRY (Don't Repeat Yourself) and makes it easier to modify the observation format later.
```
```python
def _get_obs(self):
"""Convert internal state to observation format.
Returns:
dict: Observation with agent and target positions
"""
return {"agent": self._agent_location, "target": self._target_location}
```
```{eval-rst}
.. py:currentmodule:: gymnasium
We can also implement a similar method for the auxiliary information that is returned by :meth:`Env.reset` and :meth:`Env.step`. In our case, we would like to provide the manhattan distance between the agent and the target:
We can also implement a similar method for auxiliary information returned by :meth:`Env.reset` and :meth:`Env.step`. In our case, we'll provide the Manhattan distance between agent and target - this can be useful for debugging and understanding agent progress, but shouldn't be used by the learning algorithm itself.
```
```python
def _get_info(self):
"""Compute auxiliary information for debugging.
Returns:
dict: Info with distance between agent and target
"""
return {
"distance": np.linalg.norm(
self._agent_location - self._target_location, ord=1
@@ -98,7 +161,7 @@ We can also implement a similar method for the auxiliary information that is ret
```{eval-rst}
.. py:currentmodule:: gymnasium
Oftentimes, info will also contain some data that is only available inside the :meth:`Env.step` method (e.g., individual reward terms). In that case, we would have to update the dictionary that is returned by ``_get_info`` in :meth:`Env.step`.
Sometimes info will contain data that's only available inside :meth:`Env.step` (like individual reward components, action success/failure, etc.). In those cases, we'd update the dictionary returned by ``_get_info`` directly in the step method.
```
## Reset function
@@ -106,20 +169,29 @@ Oftentimes, info will also contain some data that is only available inside the :
```{eval-rst}
.. py:currentmodule:: gymnasium.Env
The purpose of :meth:`reset` is to initiate a new episode for an environment and has two parameters: ``seed`` and ``options``. The seed can be used to initialize the random number generator to a deterministic state and options can be used to specify values used within reset. On the first line of the reset, you need to call ``super().reset(seed=seed)`` which will initialize the random number generate (:attr:`np_random`) to use through the rest of the :meth:`reset`.
The :meth:`reset` method starts a new episode. It takes two optional parameters: ``seed`` for reproducible random generation and ``options`` for additional configuration. On the first line, you must call ``super().reset(seed=seed)`` to properly initialize the random number generator.
Within our custom environment, the :meth:`reset` needs to randomly choose the agent and target's positions (we repeat this if they have the same position). The return type of :meth:`reset` is a tuple of the initial observation and any auxiliary information. Therefore, we can use the methods ``_get_obs`` and ``_get_info`` that we implemented earlier for that:
In our GridWorld environment, :meth:`reset` randomly places the agent and target on the grid, ensuring they don't start in the same location. We return both the initial observation and info as a tuple.
```
```python
def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
# We need the following line to seed self.np_random
"""Start a new episode.
Args:
seed: Random seed for reproducible episodes
options: Additional configuration (unused in this example)
Returns:
tuple: (observation, info) for the initial state
"""
# IMPORTANT: Must call this first to seed the random number generator
super().reset(seed=seed)
# Choose the agent's location uniformly at random
# Randomly place the agent anywhere on the grid
self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)
# We will sample the target's location randomly until it does not coincide with the agent's location
# Randomly place target, ensuring it's different from agent position
self._target_location = self._agent_location
while np.array_equal(self._target_location, self._agent_location):
self._target_location = self.np_random.integers(
@@ -137,91 +209,314 @@ Within our custom environment, the :meth:`reset` needs to randomly choose the ag
```{eval-rst}
.. py:currentmodule:: gymnasium.Env
The :meth:`step` method usually contains most of the logic for your environment, it accepts an ``action`` and computes the state of the environment after the applying the action, returning a tuple of the next observation, the resulting reward, if the environment has terminated, if the environment has truncated and auxiliary information.
```
```{eval-rst}
.. py:currentmodule:: gymnasium
The :meth:`step` method contains the core environment logic. It takes an action, updates the environment state, and returns the results. This is where the physics, game rules, and reward logic live.
For our environment, several things need to happen during the step function:
- We use the self._action_to_direction to convert the discrete action (e.g., 2) to a grid direction with our agent location. To prevent the agent from going out of bounds of the grid, we clip the agent's location to stay within bounds.
- We compute the agent's reward by checking if the agent's current position is equal to the target's location.
- Since the environment doesn't truncate internally (we can apply a time limit wrapper to the environment during :meth:`make`), we permanently set truncated to False.
- We once again use _get_obs and _get_info to obtain the agent's observation and auxiliary information.
For GridWorld, we need to:
1. Convert the discrete action to a movement direction
2. Update the agent's position (with boundary checking)
3. Calculate the reward based on whether the target was reached
4. Determine if the episode should end
5. Return all the required information
```
```python
def step(self, action):
# Map the action (element of {0,1,2,3}) to the direction we walk in
"""Execute one timestep within the environment.
Args:
action: The action to take (0-3 for directions)
Returns:
tuple: (observation, reward, terminated, truncated, info)
"""
# Map the discrete action (0-3) to a movement direction
direction = self._action_to_direction[action]
# We use `np.clip` to make sure we don't leave the grid bounds
# Update agent position, ensuring it stays within grid bounds
# np.clip prevents the agent from walking off the edge
self._agent_location = np.clip(
self._agent_location + direction, 0, self.size - 1
)
# An environment is completed if and only if the agent has reached the target
# Check if agent reached the target
terminated = np.array_equal(self._agent_location, self._target_location)
# We don't use truncation in this simple environment
# (could add a step limit here if desired)
truncated = False
reward = 1 if terminated else 0 # the agent is only reached at the end of the episode
# Simple reward structure: +1 for reaching target, 0 otherwise
# Alternative: could give small negative rewards for each step to encourage efficiency
reward = 1 if terminated else 0
observation = self._get_obs()
info = self._get_info()
return observation, reward, terminated, truncated, info
```
## Registering and making the environment
## Common Environment Design Pitfalls
```{eval-rst}
While it is possible to use your new custom environment now immediately, it is more common for environments to be initialized using :meth:`gymnasium.make`. In this section, we explain how to register a custom environment then initialize it.
Now that you've seen the basic structure, let's discuss common mistakes beginners make:
The environment ID consists of three components, two of which are optional: an optional namespace (here: ``gymnasium_env``), a mandatory name (here: ``GridWorld``) and an optional but recommended version (here: v0). It may have also be registered as ``GridWorld-v0`` (the recommended approach), ``GridWorld`` or ``gymnasium_env/GridWorld``, and the appropriate ID should then be used during environment creation.
### Reward Design Issues
The entry point can be a string or function, as this tutorial isn't part of a python project, we cannot use a string but for most environments, this is the normal way of specifying the entry point.
Register has additionally parameters that can be used to specify keyword arguments to the environment, e.g., if to apply a time limit wrapper, etc. See :meth:`gymnasium.register` for more information.
**Problem**: Only rewarding at the very end (sparse rewards)
```python
# This makes learning very difficult!
reward = 1 if terminated else 0
```
**Better**: Provide intermediate feedback
```python
gym.register(
id="gymnasium_env/GridWorld-v0",
entry_point=GridWorldEnv,
# Option 1: Small step penalty to encourage efficiency
reward = 1 if terminated else -0.01
# Option 2: Distance-based reward shaping
distance = np.linalg.norm(self._agent_location - self._target_location)
reward = 1 if terminated else -0.1 * distance
```
### State Representation Problems
**Problem**: Including irrelevant information or missing crucial details
```python
# Too much info - agent doesn't need grid size in every observation
obs = {"agent": self._agent_location, "target": self._target_location, "size": self.size}
# Too little info - agent can't distinguish different positions
obs = {"distance": distance} # Missing actual positions!
```
**Better**: Include exactly what's needed for optimal decisions
```python
# Just right - positions are sufficient for navigation
obs = {"agent": self._agent_location, "target": self._target_location}
```
### Action Space Issues
**Problem**: Actions that don't make sense or are impossible to execute
```python
# Bad: Agent can move diagonally but environment doesn't support it
self.action_space = gym.spaces.Discrete(8) # 8 directions including diagonals
# Bad: Continuous actions for discrete movement
self.action_space = gym.spaces.Box(-1, 1, shape=(2,)) # Continuous x,y movement
```
### Boundary Handling Errors
**Problem**: Allowing invalid states or unclear boundary behavior
```python
# Bad: Agent can go outside the grid
self._agent_location = self._agent_location + direction # No bounds checking!
# Unclear: What happens when agent hits wall?
if np.any(self._agent_location < 0) or np.any(self._agent_location >= self.size):
# Do nothing? Reset episode? Give penalty? Unclear!
```
**Better**: Clear, consistent boundary handling
```python
# Clear: Agent stays in place when hitting boundaries
self._agent_location = np.clip(
self._agent_location + direction, 0, self.size - 1
)
```
For a more complete guide on registering a custom environment (including with a string entry point), please read the full [create environment](../tutorials/gymnasium_basics/environment_creation) tutorial.
## Registering and making the environment
```{eval-rst}
Once the environment is registered, you can check via :meth:`gymnasium.pprint_registry` which will output all registered environment, and the environment can then be initialized using :meth:`gymnasium.make`. A vectorized version of the environment with multiple instances of the same environment running in parallel can be instantiated with :meth:`gymnasium.make_vec`.
While you can use your custom environment immediately, it's more convenient to register it with Gymnasium so you can create it with :meth:`gymnasium.make` just like built-in environments.
The environment ID has three components: an optional namespace (here: ``gymnasium_env``), a mandatory name (here: ``GridWorld``), and an optional but recommended version (here: v0). You could register it as ``GridWorld-v0``, ``GridWorld``, or ``gymnasium_env/GridWorld``, but the full format is recommended for clarity.
Since this tutorial isn't part of a Python package, we pass the class directly as the entry point. In real projects, you'd typically use a string like ``"my_package.envs:GridWorldEnv"``.
```
```python
# Register the environment so we can create it with gym.make()
gym.register(
id="gymnasium_env/GridWorld-v0",
entry_point=GridWorldEnv,
max_episode_steps=300, # Prevent infinite episodes
)
```
For a more complete guide on registering custom environments (including with string entry points), please read the full [create environment](../tutorials/gymnasium_basics/environment_creation) tutorial.
```{eval-rst}
Once registered, you can check all available environments with :meth:`gymnasium.pprint_registry` and create instances with :meth:`gymnasium.make`. You can also create vectorized versions with :meth:`gymnasium.make_vec`.
```
```python
import gymnasium as gym
>>> gym.make("gymnasium_env/GridWorld-v0")
# Create the environment like any built-in environment
>>> env = gym.make("gymnasium_env/GridWorld-v0")
<OrderEnforcing<PassiveEnvChecker<GridWorld<gymnasium_env/GridWorld-v0>>>>
>>> gym.make("gymnasium_env/GridWorld-v0", max_episode_steps=100)
<TimeLimit<OrderEnforcing<PassiveEnvChecker<GridWorld<gymnasium_env/GridWorld-v0>>>>>
# Customize environment parameters
>>> env = gym.make("gymnasium_env/GridWorld-v0", size=10)
>>> env.unwrapped.size
10
>>> gym.make_vec("gymnasium_env/GridWorld-v0", num_envs=3)
# Create multiple environments for parallel training
>>> vec_env = gym.make_vec("gymnasium_env/GridWorld-v0", num_envs=3)
SyncVectorEnv(gymnasium_env/GridWorld-v0, num_envs=3)
```
## Debugging Your Environment
When your environment doesn't work as expected, here are common debugging strategies:
### Check Environment Validity
```python
from gymnasium.utils.env_checker import check_env
# This will catch many common issues
try:
check_env(env)
print("Environment passes all checks!")
except Exception as e:
print(f"Environment has issues: {e}")
```
### Manual Testing with Known Actions
```python
# Test specific action sequences to verify behavior
env = gym.make("gymnasium_env/GridWorld-v0")
obs, info = env.reset(seed=42) # Use seed for reproducible testing
print(f"Starting position - Agent: {obs['agent']}, Target: {obs['target']}")
# Test each action type
actions = [0, 1, 2, 3] # right, up, left, down
for action in actions:
old_pos = obs['agent'].copy()
obs, reward, terminated, truncated, info = env.step(action)
new_pos = obs['agent']
print(f"Action {action}: {old_pos} -> {new_pos}, reward={reward}")
```
### Common Debug Issues
```python
# Issue 1: Forgot to call super().reset()
def reset(self, seed=None, options=None):
# super().reset(seed=seed) # ❌ Missing this line
# Results in: possibly incorrect seeding
# Issue 2: Wrong action mapping
self._action_to_direction = {
0: np.array([1, 0]), # right
1: np.array([0, 1]), # up - but is this really "up" in your coordinate system?
2: np.array([-1, 0]), # left
3: np.array([0, -1]), # down
}
# Issue 3: Not handling boundaries properly
# This allows agent to go outside the grid!
self._agent_location = self._agent_location + direction # ❌ No bounds checking
```
## Using Wrappers
Oftentimes, we want to use different variants of a custom environment, or we want to modify the behavior of an environment that is provided by Gymnasium or some other party. Wrappers allow us to do this without changing the environment implementation or adding any boilerplate code. Check out the [wrapper documentation](../api/wrappers) for details on how to use wrappers and instructions for implementing your own. In our example, observations cannot be used directly in learning code because they are dictionaries. However, we don't actually need to touch our environment implementation to fix this! We can simply add a wrapper on top of environment instances to flatten observations into a single array:
Sometimes you want to modify your environment's behavior without changing the core implementation. Wrappers are perfect for this - they let you add functionality like changing observation formats, adding time limits, or modifying rewards without touching your original environment code.
```python
>>> from gymnasium.wrappers import FlattenObservation
>>> # Original observation is a dictionary
>>> env = gym.make('gymnasium_env/GridWorld-v0')
>>> env.observation_space
Dict('agent': Box(0, 4, (2,), int64), 'target': Box(0, 4, (2,), int64))
>>> env.reset()
({'agent': array([4, 1]), 'target': array([2, 4])}, {'distance': 5.0})
>>> obs, info = env.reset()
>>> obs
{'agent': array([4, 1]), 'target': array([2, 4])}
>>> # Wrap it to flatten observations into a single array
>>> wrapped_env = FlattenObservation(env)
>>> wrapped_env.observation_space
Box(0, 4, (4,), int64)
>>> wrapped_env.reset()
(array([3, 0, 2, 1]), {'distance': 2.0})
>>> obs, info = wrapped_env.reset()
>>> obs
array([3, 0, 2, 1]) # [agent_x, agent_y, target_x, target_y]
```
This is particularly useful when working with algorithms that expect specific input formats (like neural networks that need 1D arrays instead of dictionaries).
## Advanced Environment Features
Once you have the basics working, you might want to add more sophisticated features:
### Adding Rendering
```python
def render(self):
"""Render the environment for human viewing."""
if self.render_mode == "human":
# Print a simple ASCII representation
for y in range(self.size - 1, -1, -1): # Top to bottom
row = ""
for x in range(self.size):
if np.array_equal([x, y], self._agent_location):
row += "A " # Agent
elif np.array_equal([x, y], self._target_location):
row += "T " # Target
else:
row += ". " # Empty
print(row)
print()
```
### Parameterized Environments
```python
def __init__(self, size: int = 5, reward_scale: float = 1.0, step_penalty: float = 0.0):
self.size = size
self.reward_scale = reward_scale
self.step_penalty = step_penalty
# ... rest of init ...
def step(self, action):
# ... movement logic ...
# Flexible reward calculation
if terminated:
reward = self.reward_scale # Success reward
else:
reward = -self.step_penalty # Step penalty (0 by default)
```
## Real-World Environment Design Tips
### Start Simple, Add Complexity Gradually
1. **First**: Get basic movement and goal-reaching working
2. **Then**: Add obstacles, multiple goals, or time pressure
3. **Finally**: Add complex dynamics, partial observability, or multi-agent interactions
### Design for Learning
- **Clear Success Criteria**: Agent should know when it's doing well
- **Reasonable Difficulty**: Not too easy (trivial) or too hard (impossible)
- **Consistent Rules**: Same action in same state should have same effect
- **Informative Observations**: Include everything needed for optimal decisions
### Think About Your Research Question
- **Navigation**: Focus on spatial reasoning and path planning
- **Control**: Emphasize dynamics, stability, and continuous actions
- **Strategy**: Include partial information, opponent modeling, or long-term planning
- **Optimization**: Design clear trade-offs and resource constraints
## Next Steps
Congratulations! You now know how to create custom RL environments. Here's what to explore next:
1. **Add rendering** to visualize your environment ([complete tutorial](../tutorials/gymnasium_basics/environment_creation))
2. **Train an agent** on your custom environment ([training guide](train_agent))
3. **Experiment with different reward functions** to see how they affect learning
4. **Try wrapper combinations** to modify your environment's behavior
5. **Create more complex environments** with obstacles, multiple agents, or continuous actions
The key to good environment design is iteration - start simple, test thoroughly, and gradually add complexity as needed for your research or application goals.

View File

@@ -5,7 +5,7 @@ title: Compatibility With Gym
# Compatibility with Gym
Gymnasium provides a number of compatibility methods for a range of Environment implementations.
Gymnasium provides a number of compatibility methods for using older Gym Environment implementations.
## Loading OpenAI Gym environments

View File

@@ -5,98 +5,408 @@ title: Migration Guide
# Migration Guide - v0.21 to v1.0.0
## Who Should Read This Guide?
**If you're new to Gymnasium**: You can probably skip this page! This guide is for users migrating from older versions of OpenAI Gym. If you're just starting with RL, head to [Basic Usage](basic_usage) instead.
**If you're migrating from OpenAI Gym**: This guide will help you update your code to work with Gymnasium. The changes are significant but straightforward once you understand the reasoning behind them.
**If you're updating old tutorials**: Many online RL tutorials use the old v0.21 API. This guide shows you how to modernize that code.
## Why Did the API Change?
```{eval-rst}
.. py:currentmodule:: gymnasium.wrappers
Gymnasium is a fork of `OpenAI Gym v0.26 <https://github.com/openai/gym/releases/tag/0.26.2>`_, which introduced a large breaking change from `Gym v0.21 <https://github.com/openai/gym/releases/tag/v0.21.0>`_.In this guide, we briefly outline the API changes from Gym v0.21 - which a number of tutorials have been written for - to Gym v0.26 (and later, including 1.0.0). For environments still stuck in the v0.21 API, see the `guide </content/gym_compatibility>`_
Gymnasium is a fork of `OpenAI Gym v0.26 <https://github.com/openai/gym/releases/tag/0.26.2>`_, which introduced breaking changes from `Gym v0.21 <https://github.com/openai/gym/releases/tag/v0.21.0>`_. These changes weren't made lightly - they solved important problems that made RL research and development more difficult.
The main issues with the old API were:
- **Ambiguous episode endings**: The single ``done`` flag couldn't distinguish between "task completed" and "time limit reached"
- **Inconsistent seeding**: Random number generation was unreliable and hard to reproduce
- **Rendering complexity**: Switching between visual modes was unnecessarily complicated
- **Reproducibility problems**: Subtle bugs made it difficult to reproduce research results
For environments still using the v0.21 API, see the `compatibility guide <gym_compatibility>`_.
```
## Example code for v0.21
## Quick Reference: Complete Changes Table
| **Component** | **v0.21 (Old)** | **v0.26+ (New)** | **Impact** |
|--------------------------|---------------------------------------------------|---------------------------------------------------------------|-----------------|
| **Package Import** | `import gym` | `import gymnasium as gym` | All code |
| **Environment Reset** | `obs = env.reset()` | `obs, info = env.reset()` | Training loops |
| **Random Seeding** | `env.seed(42)` | `env.reset(seed=42)` | Reproducibility |
| **Step Function** | `obs, reward, done, info = env.step(action)` | `obs, reward, terminated, truncated, info = env.step(action)` | RL algorithms |
| **Episode Ending** | `while not done:` | `while not (terminated or truncated):` | Training loops |
| **Render Mode** | `env.render(mode="human")` | `gym.make(env_id, render_mode="human")` | Visualization |
| **Time Limit Detection** | `info.get('TimeLimit.truncated')` | `truncated` return value | RL algorithms |
| **Value Bootstrapping** | `target = reward + (1-done) * gamma * next_value` | `target = reward + (1-terminated) * gamma * next_value` | RL correctness |
## Side-by-Side Code Comparison
### Old v0.21 Code
```python
import gym
# Environment creation and seeding
env = gym.make("LunarLander-v3", options={})
env.seed(123)
observation = env.reset()
# Training loop
done = False
while not done:
action = env.action_space.sample() # agent policy that uses the observation and info
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
env.render(mode="human")
env.close()
```
## Example code for v0.26 and later, including v1.0.0
### New v0.26+ Code (Including v1.0.0)
```python
import gym
import gymnasium as gym # Note: 'gymnasium' not 'gym'
# Environment creation with render mode specified upfront
env = gym.make("LunarLander-v3", render_mode="human")
# Reset with seed parameter
observation, info = env.reset(seed=123, options={})
# Training loop with terminated/truncated distinction
done = False
while not done:
action = env.action_space.sample() # agent policy that uses the observation and info
action = env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(action)
# Episode ends if either terminated OR truncated
done = terminated or truncated
env.close()
```
## Seed and random number generator
## Key Changes Breakdown
```{eval-rst}
.. py:currentmodule:: gymnasium.Env
### 1. Package Name Change
The ``Env.seed()`` has been removed from the Gym v0.26 environments in favour of ``Env.reset(seed=seed)``. This allows seeding to only be changed on environment reset. The decision to remove ``seed`` was because some environments use emulators that cannot change random number generators within an episode and must be done at the beginning of a new episode. We are aware of cases where controlling the random number generator is important, in these cases, if the environment uses the built-in random number generator, users can set the seed manually with the attribute :attr:`np_random`.
**Old**: `import gym`
**New**: `import gymnasium as gym`
Gymnasium v0.26 changed to using ``numpy.random.Generator`` instead of a custom random number generator. This means that several functions such as ``randint`` were removed in favour of ``integers``. While some environments might use external random number generator, we recommend using the attribute :attr:`np_random` that wrappers and external users can access and utilise.
Why: Gymnasium is a separate project that maintains and improves upon the original Gym codebase.
```python
# Update your imports
# OLD
import gym
# NEW
import gymnasium as gym
```
## Environment Reset
### 2. Seeding and Random Number Generation
```{eval-rst}
In v0.26+, :meth:`reset` takes two optional parameters and returns one value. This contrasts to v0.21 which takes no parameters and returns ``None``. The two parameters are ``seed`` for setting the random number generator and ``options`` which allows additional data to be passed to the environment on reset. For example, in classic control, the ``options`` parameter now allows users to modify the range of the state bound. See the original `PR <https://github.com/openai/gym/pull/2921>`_ for more details.
The biggest conceptual change is how randomness is handled.
:meth:`reset` further returns ``info``, similar to the ``info`` returned by :meth:`step`. This is important because ``info`` can include metrics or valid action mask that is used or saved in the next step.
To update older environments, we highly recommend that ``super().reset(seed=seed)`` is called on the first line of :meth:`reset`. This will automatically update the :attr:`np_random` with the seed value.
**Old v0.21**: Separate `seed()` method
```python
env = gym.make("CartPole-v1")
env.seed(42) # Set random seed
obs = env.reset() # Reset environment
```
## Environment Step
```{eval-rst}
In v0.21, the type definition of :meth:`step` is ``tuple[ObsType, SupportsFloat, bool, dict[str, Any]`` representing the next observation, the reward from the step, if the episode is done and additional info from the step. Due to reproducibility issues that will be expanded on in a blog post soon, we have changed the type definition to ``tuple[ObsType, SupportsFloat, bool, bool, dict[str, Any]]`` adding an extra boolean value. This extra bool corresponds to the older `done` now changed to `terminated` and `truncated`. These changes were introduced in Gym `v0.26 <https://github.com/openai/gym/releases/tag/0.26.0>`_ (turned off by default in `v25 <https://github.com/openai/gym/releases/tag/0.25.0>`_).
For users wishing to update, in most cases, replacing ``done`` with ``terminated`` and ``truncated=False`` in :meth:`step` should address most issues. However, environments that have reasons for episode truncation rather than termination should read through the associated `PR <https://github.com/openai/gym/pull/2752>`_. For users looping through an environment, they should modify ``done = terminated or truncated`` as is show in the example code. For training libraries, the primary difference is to change ``done`` to ``terminated``, indicating whether bootstrapping should or shouldn't happen.
**New v0.26+**: Seed passed to `reset()`
```python
env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42) # Seed and reset together
```
## TimeLimit Wrapper
```{eval-rst}
In v0.21, the :class:`TimeLimit` wrapper added an extra key in the ``info`` dictionary ``TimeLimit.truncated`` whenever the agent reached the time limit without reaching a terminal state.
**Why this changed**: Some environments (especially emulated games) can only set their random state at the beginning of an episode, not mid-episode. The old approach could lead to inconsistent behavior.
In v0.26+, this information is instead communicated through the `truncated` return value described in the previous section, which is `True` if the agent reaches the time limit, whether or not it reaches a terminal state. The old dictionary entry is equivalent to ``truncated and not terminated``
**Practical impact**:
```python
# OLD: Seeding applied to all future episodes
env.seed(42)
for episode in range(10):
obs = env.reset()
# NEW: Each episode can have its own seed
for episode in range(10):
obs, info = env.reset(seed=42 + episode) # Each episode gets unique seed
```
## Environment Render
### 3. Environment Reset Changes
```{eval-rst}
In v0.26, a new render API was introduced such that the render mode is fixed at initialisation as some environments don't allow on-the-fly render mode changes. Therefore, users should now specify the :attr:`render_mode` within ``gym.make`` as shown in the v0.26+ example code above.
For a more complete explanation of the changes, please refer to this `summary <https://younis.dev/blog/render-api/>`_.
**Old v0.21**: Returns only observation
```python
observation = env.reset()
```
## Removed code
**New v0.26+**: Returns observation AND info
```python
observation, info = env.reset()
```
**Why this changed**:
- `info` provides consistent access to debugging information
- `seed` parameter enables reproducible episodes
- `options` parameter allows episode-specific configuration
**Common migration pattern**:
```python
# If you don't need the new features, just unpack the tuple
obs, _ = env.reset() # Ignore info with underscore
# If you want to maintain the same random behavior as v0.21
env.reset(seed=42) # Set seed once
# Then for subsequent resets:
obs, info = env.reset() # Uses internal random state
```
### 4. Step Function: The `done` → `terminated`/`truncated` Split
This is the most important change for training algorithms.
**Old v0.21**: Single `done` flag
```python
obs, reward, done, info = env.step(action)
```
**New v0.26+**: Separate `terminated` and `truncated` flags
```python
obs, reward, terminated, truncated, info = env.step(action)
```
**Why this matters**:
- **`terminated`**: Episode ended because the task was completed or failed (agent reached goal, died, etc.)
- **`truncated`**: Episode ended due to external constraints (time limit, step limit, etc.)
This distinction is crucial for value function bootstrapping in RL algorithms:
```python
# OLD (ambiguous)
if done:
# Should we bootstrap? We don't know if this was natural termination or time limit!
next_value = 0 # Assumption that may be wrong
# NEW (clear)
if terminated:
next_value = 0 # Natural ending - no future value
elif truncated:
next_value = value_function(next_obs) # Time limit - estimate future value
```
**Migration strategy**:
```python
# Simple migration (works for many cases)
obs, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
# Better migration (preserves RL algorithm correctness)
obs, reward, terminated, truncated, info = env.step(action)
if terminated:
# Episode naturally ended - use reward as-is
target = reward
elif truncated:
# Episode cut short - may need to estimate remaining value
target = reward + discount * estimate_value(obs)
```
For more information, see our [blog post](https://farama.org/Gymnasium-Terminated-Truncated-Step-API) about it.
### 5. Render Mode Changes
**Old v0.21**: Render mode specified each time
```python
env = gym.make("CartPole-v1")
env.render(mode="human") # Visual window
env.render(mode="rgb_array") # Get pixel array
```
**New v0.26+**: Render mode fixed at creation
```python
env = gym.make("CartPole-v1", render_mode="human") # For visual display
env = gym.make("CartPole-v1", render_mode="rgb_array") # For recording
env.render() # Uses the mode specified at creation
```
**Why this changed**: Some environments can't switch render modes on-the-fly. Fixing the mode at creation enables better optimization and prevents bugs.
**Practical implications**:
```python
# OLD: Could switch modes dynamically
env = gym.make("CartPole-v1")
for episode in range(10):
# ... episode code ...
if episode % 10 == 0:
env.render(mode="human") # Show every 10th episode
# NEW: Create separate environments for different purposes
training_env = gym.make("CartPole-v1") # No rendering for speed
eval_env = gym.make("CartPole-v1", render_mode="human") # Visual for evaluation
# Or use None for no rendering, then create visual env when needed
env = gym.make("CartPole-v1", render_mode=None) # Fast training
if need_visualization:
visual_env = gym.make("CartPole-v1", render_mode="human")
```
## TimeLimit Wrapper Changes
```{eval-rst}
.. py:currentmodule:: gymnasium.wrappers
* GoalEnv - This was removed, users needing it should reimplement the environment or use Gymnasium Robotics which contains an implementation of this environment.
* ``from gym.envs.classic_control import rendering`` - This was removed in favour of users implementing their own rendering systems. Gymnasium environments are coded using pygame.
* Robotics environments - The robotics environments have been moved to the `Gymnasium Robotics <https://robotics.farama.org/>`_ project.
* Monitor wrapper - This wrapper was replaced with two separate wrapper, :class:`RecordVideo` and :class:`RecordEpisodeStatistics`
The :class:`TimeLimit` wrapper behavior also changed to align with the new termination model.
**Old v0.21**: Added `TimeLimit.truncated` to info dict
```python
obs, reward, done, info = env.step(action)
if done and info.get('TimeLimit.truncated', False):
# Episode ended due to time limit
pass
```
**New v0.26+**: Uses the `truncated` return value
```python
obs, reward, terminated, truncated, info = env.step(action)
if truncated:
# Episode ended due to time limit (or other truncation)
pass
if terminated:
# Episode ended naturally (success/failure)
pass
```
This makes time limit detection much cleaner and more explicit.
```
## Updating Your Training Code
### Basic Training Loop Migration
**Old v0.21 pattern**:
```python
for episode in range(num_episodes):
obs = env.reset()
done = False
while not done:
action = agent.get_action(obs)
next_obs, reward, done, info = env.step(action)
# Train agent (this may have bugs due to ambiguous 'done')
agent.learn(obs, action, reward, next_obs, done)
obs = next_obs
```
**New v0.26+ pattern**:
```python
for episode in range(num_episodes):
obs, info = env.reset(seed=episode) # Optional: unique seed per episode
terminated, truncated = False, False
while not (terminated or truncated):
action = agent.get_action(obs)
next_obs, reward, terminated, truncated, info = env.step(action)
# Train agent with proper termination handling
agent.learn(obs, action, reward, next_obs, terminated)
obs = next_obs
```
### Q-Learning Update Migration
**Old v0.21 (potentially incorrect)**:
```python
def update_q_value(obs, action, reward, next_obs, done):
if done:
target = reward # Assumes all episode endings are natural terminations
else:
target = reward + gamma * max(q_table[next_obs])
q_table[obs][action] += lr * (target - q_table[obs][action])
```
**New v0.26+ (correct)**:
```python
def update_q_value(obs, action, reward, next_obs, terminated):
if terminated:
# Natural termination - no future value
target = reward
else:
# Episode continues - truncation has no impact on the possible future value
target = reward + gamma * max(q_table[next_obs])
q_table[obs][action] += lr * (target - q_table[obs][action])
```
### Deep RL Framework Migration
Most libraries have already updated, see their documentation for more information.
## Environment-Specific Changes
### Removed Environments
Some environments were moved or removed:
```python
# OLD: Robotics environments in main gym
import gym
env = gym.make("FetchReach-v1") # No longer available
# NEW: Moved to separate package
import gymnasium
import gymnasium_robotics
import ale_py
gymnasium.register_envs((gymnasium_robotics, ale_py))
env = gymnasium.make("FetchReach-v1")
env = gymnasium.make("ALE/Pong-v5")
```
## Compatibility Helpers
### Using Old Environments
```{eval-rst}
.. py:currentmodule:: gymnasium
If you need to use an environment that hasn't been updated to the new API:
```python
# For environments still using old gym
env = gym.make("GymV21Environment-v0", env_id="OldEnv-v0")
# This wrapper converts old API to new API automatically
```
For more details, see the `compatibility guide <gym_compatibility>`_.
## Testing Your Migration
After migrating, verify that:
- [ ] **Import statements** use `gymnasium` instead of `gym`
- [ ] **Reset calls** handle the `(obs, info)` return format
- [ ] **Step calls** handle `terminated` and `truncated` separately
- [ ] **Render mode** is specified during environment creation
- [ ] **Random seeding** uses the `seed` parameter in `reset()`
- [ ] **Training algorithms** properly distinguish termination types
Use the `from gymnasium.utils.env_checker import check_env` to verify their implementation.
## Getting Help
**If you encounter issues during migration**:
1. **Check the compatibility guide**: Some old environments can be used with compatibility wrappers
2. **Look at the environment documentation**: Each environment may have specific migration notes
3. **Test with simple environments first**: Start with CartPole before moving to complex environments
4. **Compare old vs new behavior**: Run the same code with both APIs to understand differences
**Common resources**:
- [Gymnasium documentation](https://gymnasium.farama.org/)
- [GitHub issues](https://github.com/Farama-Foundation/Gymnasium/issues) for bug reports
- [Discord community](https://discord.gg/bnJ6kubTg6) for questions

View File

@@ -5,92 +5,301 @@ title: Recording Agents
# Recording Agents
## Why Record Your Agent?
Recording agent behavior serves several important purposes in RL development:
**🎥 Visual Understanding**: See exactly what your agent is doing - sometimes a 10-second video reveals issues that hours of staring at reward plots miss.
**📊 Performance Tracking**: Collect systematic data about episode rewards, lengths, and timing to understand training progress.
**🐛 Debugging**: Identify specific failure modes, unusual behaviors, or environments where your agent struggles.
**📈 Evaluation**: Compare different training runs, algorithms, or hyperparameters objectively.
**🎓 Communication**: Share results with collaborators, include in papers, or create educational content.
## When to Record
**During Evaluation** (Record Every Episode):
- Testing a trained agent's final performance
- Creating demonstration videos
- Detailed analysis of specific behaviors
**During Training** (Record Periodically):
- Monitor learning progress over time
- Catch training issues early
- Create timelapse videos of learning
```{eval-rst}
.. py:currentmodule: gymnasium.wrappers
During training or when evaluating an agent, it may be interesting to record agent behaviour over an episode and log the total reward accumulated. This can be achieved through two wrappers: :class:`RecordEpisodeStatistics` and :class:`RecordVideo`, the first tracks episode data such as the total rewards, episode length and time taken and the second generates mp4 videos of the agents using the environment renderings.
Gymnasium provides two essential wrappers for recording: :class:`RecordEpisodeStatistics` for numerical data and :class:`RecordVideo` for visual recordings. The first tracks episode metrics like total rewards, episode length, and time taken. The second generates MP4 videos of agent behavior using environment renderings.
We show how to apply these wrappers for two types of problems; the first for recording data for every episode (normally evaluation) and second for recording data periodically (for normal training).
We'll show how to use these wrappers for two common scenarios: recording data for every episode (typically during evaluation) and recording data periodically (during training).
```
## Recording Every Episode
## Recording Every Episode (Evaluation)
```{eval-rst}
.. py:currentmodule: gymnasium.wrappers
Given a trained agent, you may wish to record several episodes during evaluation to see how the agent acts. Below we provide an example script to do this with the :class:`RecordEpisodeStatistics` and :class:`RecordVideo`.
When evaluating a trained agent, you typically want to record several episodes to understand average performance and consistency. Here's how to set this up with :class:`RecordEpisodeStatistics` and :class:`RecordVideo`.
```
```python
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
import numpy as np
# Configuration
num_eval_episodes = 4
env_name = "CartPole-v1" # Replace with your environment
env = gym.make("CartPole-v1", render_mode="rgb_array") # replace with your environment
env = RecordVideo(env, video_folder="cartpole-agent", name_prefix="eval",
episode_trigger=lambda x: True)
# Create environment with recording capabilities
env = gym.make(env_name, render_mode="rgb_array") # rgb_array needed for video recording
# Add video recording for every episode
env = RecordVideo(
env,
video_folder="cartpole-agent", # Folder to save videos
name_prefix="eval", # Prefix for video filenames
episode_trigger=lambda x: True # Record every episode
)
# Add episode statistics tracking
env = RecordEpisodeStatistics(env, buffer_length=num_eval_episodes)
print(f"Starting evaluation for {num_eval_episodes} episodes...")
print(f"Videos will be saved to: cartpole-agent/")
for episode_num in range(num_eval_episodes):
obs, info = env.reset()
episode_reward = 0
step_count = 0
episode_over = False
while not episode_over:
action = env.action_space.sample() # replace with actual agent
# Replace this with your trained agent's policy
action = env.action_space.sample() # Random policy for demonstration
obs, reward, terminated, truncated, info = env.step(action)
episode_reward += reward
step_count += 1
episode_over = terminated or truncated
print(f"Episode {episode_num + 1}: {step_count} steps, reward = {episode_reward}")
env.close()
print(f'Episode time taken: {env.time_queue}')
print(f'Episode total rewards: {env.return_queue}')
print(f'Episode lengths: {env.length_queue}')
# Print summary statistics
print(f'\nEvaluation Summary:')
print(f'Episode durations: {list(env.time_queue)}')
print(f'Episode rewards: {list(env.return_queue)}')
print(f'Episode lengths: {list(env.length_queue)}')
# Calculate some useful metrics
avg_reward = np.sum(env.return_queue)
avg_length = np.sum(env.length_queue)
std_reward = np.std(env.return_queue)
print(f'\nAverage reward: {avg_reward:.2f} ± {std_reward:.2f}')
print(f'Average episode length: {avg_length:.1f} steps')
print(f'Success rate: {sum(1 for r in env.return_queue if r > 0) / len(env.return_queue):.1%}')
```
### Understanding the Output
After running this code, you'll find:
**Video Files**: `cartpole-agent/eval-episode-0.mp4`, `eval-episode-1.mp4`, etc.
- Each file shows one complete episode from start to finish
- Useful for seeing exactly how your agent behaves
- Can be shared, embedded in presentations, or analyzed frame-by-frame
**Console Output**: Episode-by-episode performance plus summary statistics
```
Episode 1: 23 steps, reward = 23.0
Episode 2: 15 steps, reward = 15.0
Episode 3: 200 steps, reward = 200.0
Episode 4: 67 steps, reward = 67.0
Average reward: 76.25 ± 78.29
Average episode length: 76.2 steps
Success rate: 100.0%
```
**Statistics Queues**: Time, reward, and length data for each episode
- `env.time_queue`: How long each episode took (wall-clock time)
- `env.return_queue`: Total reward for each episode
- `env.length_queue`: Number of steps in each episode
```{eval-rst}
.. py:currentmodule: gymnasium.wrappers
In the script above, for the :class:`RecordVideo` wrapper, we specify three different variables: ``video_folder`` to specify the folder that the videos should be saved (change for your problem), ``name_prefix`` for the prefix of videos themselves and finally an ``episode_trigger`` such that every episode is recorded. This means that for every episode of the environment, a video will be recorded and saved in the style "cartpole-agent/eval-episode-x.mp4".
In the script above, the :class:`RecordVideo` wrapper saves videos with filenames like "eval-episode-0.mp4" in the specified folder. The ``episode_trigger=lambda x: True`` ensures every episode is recorded.
For the :class:`RecordEpisodeStatistics`, we only need to specify the buffer lengths, this is the max length of the internal ``time_queue``, ``return_queue`` and ``length_queue``. Rather than collect the data for each episode individually, we can use the data queues to print the information at the end of the evaluation.
The :class:`RecordEpisodeStatistics` wrapper tracks performance metrics in internal queues that we access after evaluation to compute averages and other statistics.
For speed ups in evaluating environments, it is possible to implement this with vector environments in order to evaluate ``N`` episodes at the same time in parallel rather than series.
For computational efficiency during evaluation, it's possible to implement this with vector environments to evaluate N episodes in parallel rather than sequentially.
```
## Recording the Agent during Training
## Recording During Training (Periodic)
During training, an agent will act in hundreds or thousands of episodes, therefore, you can't record a video for each episode, but developers might still want to know how the agent acts at different points in the training, recording episodes periodically during training. While for the episode statistics, it is more helpful to know this data for every episode. The following script provides an example of how to periodically record episodes of an agent while recording every episode's statistics (we use the python's logger but [tensorboard](https://www.tensorflow.org/tensorboard), [wandb](https://docs.wandb.ai/guides/track) and other modules are available).
During training, you'll run hundreds or thousands of episodes, so recording every one isn't practical. Instead, record periodically to track learning progress:
```python
import logging
import gymnasium as gym
from gymnasium.wrappers import RecordEpisodeStatistics, RecordVideo
training_period = 250 # record the agent's episode every 250
num_training_episodes = 10_000 # total number of training episodes
# Training configuration
training_period = 250 # Record video every 250 episodes
num_training_episodes = 10_000 # Total training episodes
env_name = "CartPole-v1"
env = gym.make("CartPole-v1", render_mode="rgb_array") # replace with your environment
env = RecordVideo(env, video_folder="cartpole-agent", name_prefix="training",
episode_trigger=lambda x: x % training_period == 0)
# Set up logging for episode statistics
logging.basicConfig(level=logging.INFO, format='%(message)s')
# Create environment with periodic video recording
env = gym.make(env_name, render_mode="rgb_array")
# Record videos periodically (every 250 episodes)
env = RecordVideo(
env,
video_folder="cartpole-training",
name_prefix="training",
episode_trigger=lambda x: x % training_period == 0 # Only record every 250th episode
)
# Track statistics for every episode (lightweight)
env = RecordEpisodeStatistics(env)
print(f"Starting training for {num_training_episodes} episodes")
print(f"Videos will be recorded every {training_period} episodes")
print(f"Videos saved to: cartpole-training/")
for episode_num in range(num_training_episodes):
obs, info = env.reset()
episode_over = False
while not episode_over:
action = env.action_space.sample() # replace with actual agent
obs, reward, terminated, truncated, info = env.step(action)
while not episode_over:
# Replace with your actual training agent
action = env.action_space.sample() # Random policy for demonstration
obs, reward, terminated, truncated, info = env.step(action)
episode_over = terminated or truncated
logging.info(f"episode-{episode_num}", info["episode"])
# Log episode statistics (available in info after episode ends)
if "episode" in info:
episode_data = info["episode"]
logging.info(f"Episode {episode_num}: "
f"reward={episode_data['r']:.1f}, "
f"length={episode_data['l']}, "
f"time={episode_data['t']:.2f}s")
# Additional analysis for milestone episodes
if episode_num % 1000 == 0:
# Look at recent performance (last 100 episodes)
recent_rewards = list(env.return_queue)[-100:]
if recent_rewards:
avg_recent = sum(recent_rewards) / len(recent_rewards)
print(f" -> Average reward over last 100 episodes: {avg_recent:.1f}")
env.close()
```
## More information
### Training Recording Benefits
* [Training an agent](train_agent.md)
* [More training tutorials](../tutorials/training_agents)
**Progress Videos**: Watch your agent improve over time
- `training-episode-0.mp4`: Random initial behavior
- `training-episode-250.mp4`: Some patterns emerging
- `training-episode-500.mp4`: Clear improvement
- `training-episode-1000.mp4`: Competent performance
**Learning Curves**: Plot episode statistics over time
```python
import matplotlib.pyplot as plt
# Plot learning progress
episodes = range(len(env.return_queue))
rewards = list(env.return_queue)
plt.figure(figsize=(10, 6))
plt.plot(episodes, rewards, alpha=0.3, label='Episode Rewards')
# Add moving average for clearer trend
window = 100
if len(rewards) > window:
moving_avg = [sum(rewards[i:i+window])/window
for i in range(len(rewards)-window+1)]
plt.plot(range(window-1, len(rewards)), moving_avg,
label=f'{window}-Episode Moving Average', linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Learning Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
```
## Integration with Experiment Tracking
For more sophisticated projects, integrate with experiment tracking tools:
```python
# Example with Weights & Biases (wandb)
import wandb
# Initialize experiment tracking
wandb.init(project="cartpole-training", name="q-learning-run-1")
# Log episode statistics
for episode_num in range(num_training_episodes):
# ... training code ...
if "episode" in info:
episode_data = info["episode"]
wandb.log({
"episode": episode_num,
"reward": episode_data['r'],
"length": episode_data['l'],
"episode_time": episode_data['t']
})
# Upload videos periodically
if episode_num % training_period == 0:
video_path = f"cartpole-training/training-episode-{episode_num}.mp4"
if os.path.exists(video_path):
wandb.log({"training_video": wandb.Video(video_path)})
```
## Best Practices Summary
**For Evaluation**:
- Record every episode to get complete performance picture
- Use multiple seeds for statistical significance
- Save both videos and numerical data
- Calculate confidence intervals for metrics
**For Training**:
- Record periodically (every 100-1000 episodes)
- Focus on episode statistics over videos during training
- Use adaptive recording triggers for interesting episodes
- Monitor memory usage for long training runs
**For Analysis**:
- Create moving averages to smooth noisy learning curves
- Look for patterns in both success and failure episodes
- Compare agent behavior at different stages of training
- Save raw data for later analysis and comparison
## More Information
* [Training an agent](train_agent) - Learn how to build the agents you're recording
* [Basic usage](basic_usage) - Understand Gymnasium fundamentals
* [More training tutorials](../tutorials/training_agents) - Advanced training techniques
* [Custom environments](create_custom_env) - Create your own environments to record
Recording agent behavior is an essential skill for RL practitioners. It helps you understand what your agent is actually learning, debug training issues, and communicate results effectively. Start with simple recording setups and gradually add more sophisticated analysis as your projects grow in complexity!

View File

@@ -5,29 +5,91 @@ title: Train an Agent
# Training an Agent
This page provides a short outline of how to train an agent for a Gymnasium environment, in particular, we will use a tabular based Q-learning to solve the Blackjack v1 environment. For a full complete version of this tutorial and more training tutorials for other environments and algorithm, see [this](../tutorials/training_agents). Please read [basic usage](basic_usage) before reading this page. Before we implement any code, here is an overview of Blackjack and Q-learning.
When we talk about training an RL agent, we're teaching it to make good decisions through experience. Unlike supervised learning where we show examples of correct answers, RL agents learn by trying different actions and observing the results. It's like learning to ride a bike - you try different movements, fall down a few times, and gradually learn what works.
Blackjack is one of the most popular casino card games that is also infamous for being beatable under certain conditions. This version of the game uses an infinite deck (we draw the cards with replacement), so counting cards won't be a viable strategy in our simulated game. The observation is a tuple of the player's current sum, the value of the dealers face-up card and a boolean value on whether the player holds a usable case. The agent can pick between two actions: stand (0) such that the player takes no more cards and hit (1) such that the player will take another card. To win, your card sum should be greater than the dealers without exceeding 21. The game ends if the player selects stand or if the card sum is greater than 21. Full documentation can be found at [https://gymnasium.farama.org/environments/toy_text/blackjack](https://gymnasium.farama.org/environments/toy_text/blackjack).
The goal is to develop a **policy** - a strategy that tells the agent what action to take in each situation to maximize long-term rewards.
Q-learning is a model-free off-policy learning algorithm by Watkins, 1989 for environments with discrete action spaces and was famous for being the first reinforcement learning algorithm to prove convergence to an optimal policy under certain conditions.
## Understanding Q-Learning Intuitively
For this tutorial, we'll use Q-learning to solve the Blackjack environment. But first, let's understand how Q-learning works conceptually.
Q-learning builds a giant "cheat sheet" called a Q-table that tells the agent how good each action is in each situation:
- **Rows** = different situations (states) the agent can encounter
- **Columns** = different actions the agent can take
- **Values** = how good that action is in that situation (expected future reward)
For Blackjack:
- **States**: Your hand value, dealer's showing card, whether you have a usable ace
- **Actions**: Hit (take another card) or Stand (keep current hand)
- **Q-values**: Expected reward for each action in each state
### The Learning Process
1. **Try an action** and see what happens (reward + new state)
2. **Update your cheat sheet**: "That action was better/worse than I thought"
3. **Gradually improve** by trying actions and updating estimates
4. **Balance exploration vs exploitation**: Try new things vs use what you know works
**Why it works**: Over time, good actions get higher Q-values, bad actions get lower Q-values. The agent learns to pick actions with the highest expected rewards.
---
This page provides a short outline of how to train an agent for a Gymnasium environment. We'll use tabular Q-learning to solve Blackjack-v1. For complete tutorials with other environments and algorithms, see [training tutorials](../tutorials/training_agents). Please read [basic usage](basic_usage) before this page.
## About the Environment: Blackjack
Blackjack is one of the most popular casino card games and is perfect for learning RL because it has:
- **Clear rules**: Get closer to 21 than the dealer without going over
- **Simple observations**: Your hand value, dealer's showing card, usable ace
- **Discrete actions**: Hit (take card) or Stand (keep current hand)
- **Immediate feedback**: Win, lose, or draw after each hand
This version uses infinite deck (cards drawn with replacement), so card counting won't work - the agent must learn optimal basic strategy through trial and error.
**Environment Details**:
- **Observation**: (player_sum, dealer_card, usable_ace)
- `player_sum`: Current hand value (4-21)
- `dealer_card`: Dealer's face-up card (1-10)
- `usable_ace`: Whether player has usable ace (True/False)
- **Actions**: 0 = Stand, 1 = Hit
- **Rewards**: +1 for win, -1 for loss, 0 for draw
- **Episode ends**: When player stands or busts (goes over 21)
## Executing an action
After receiving our first observation, we are only going to use the``env.step(action)`` function to interact with the environment. This function takes an action as input and executes it in the environment. Because that action changes the state of the environment, it returns four useful variables to us. These are:
After receiving our first observation from `env.reset()`, we use `env.step(action)` to interact with the environment. This function takes an action and returns five important values:
- ``next observation``: This is the observation that the agent will receive after taking the action.
- ``reward``: This is the reward that the agent will receive after taking the action.
- ``terminated``: This is a boolean variable that indicates whether or not the environment has terminated, i.e., ended due to an internal condition.
- ``truncated``: This is a boolean variable that also indicates whether the episode ended by early truncation, i.e., a time limit is reached.
- ``info``: This is a dictionary that might contain additional information about the environment.
```python
observation, reward, terminated, truncated, info = env.step(action)
```
The ``next observation``, ``reward``, ``terminated`` and ``truncated`` variables are self-explanatory, but the ``info`` variable requires some additional explanation. This variable contains a dictionary that might have some extra information about the environment, but in the Blackjack-v1 environment you can ignore it. For example in Atari environments the info dictionary has a ``ale.lives`` key that tells us how many lives the agent has left. If the agent has 0 lives, then the episode is over.
- **`observation`**: What the agent sees after taking the action (new game state)
- **`reward`**: Immediate feedback for that action (+1, -1, or 0 in Blackjack)
- **`terminated`**: Whether the episode ended naturally (hand finished)
- **`truncated`**: Whether episode was cut short (time limits - not used in Blackjack)
- **`info`**: Additional debugging information (can usually be ignored)
Note that it is not a good idea to call ``env.render()`` in your training loop because rendering slows down training by a lot. Rather try to build an extra loop to evaluate and showcase the agent after training.
The key insight is that `reward` tells us how good our *immediate* action was, but the agent needs to learn about *long-term* consequences. Q-learning handles this by estimating the total future reward, not just the immediate reward.
## Building an agent
## Building a Q-Learning Agent
Let's build a Q-learning agent to solve Blackjack! We'll need some functions for picking an action and updating the agents action values. To ensure that the agents explores the environment, one possible solution is the epsilon-greedy strategy, where we pick a random action with the percentage ``epsilon`` and the greedy action (currently valued as the best) ``1 - epsilon``.
Let's build our agent step by step. We need functions for:
1. **Choosing actions** (with exploration vs exploitation)
2. **Learning from experience** (updating Q-values)
3. **Managing exploration** (reducing randomness over time)
### Exploration vs Exploitation
This is a fundamental challenge in RL:
- **Exploration**: Try new actions to learn about the environment
- **Exploitation**: Use current knowledge to get the best rewards
We use **epsilon-greedy** strategy:
- With probability `epsilon`: choose a random action (explore)
- With probability `1-epsilon`: choose the best known action (exploit)
Starting with high epsilon (lots of exploration) and gradually reducing it (more exploitation as we learn) works well in practice.
```python
from collections import defaultdict
@@ -45,38 +107,44 @@ class BlackjackAgent:
final_epsilon: float,
discount_factor: float = 0.95,
):
"""Initialize a Reinforcement Learning agent with an empty dictionary
of state-action values (q_values), a learning rate and an epsilon.
"""Initialize a Q-Learning agent.
Args:
env: The training environment
learning_rate: The learning rate
initial_epsilon: The initial epsilon value
epsilon_decay: The decay for epsilon
final_epsilon: The final epsilon value
discount_factor: The discount factor for computing the Q-value
learning_rate: How quickly to update Q-values (0-1)
initial_epsilon: Starting exploration rate (usually 1.0)
epsilon_decay: How much to reduce epsilon each episode
final_epsilon: Minimum exploration rate (usually 0.1)
discount_factor: How much to value future rewards (0-1)
"""
self.env = env
# Q-table: maps (state, action) to expected reward
# defaultdict automatically creates entries with zeros for new states
self.q_values = defaultdict(lambda: np.zeros(env.action_space.n))
self.lr = learning_rate
self.discount_factor = discount_factor
self.discount_factor = discount_factor # How much we care about future rewards
# Exploration parameters
self.epsilon = initial_epsilon
self.epsilon_decay = epsilon_decay
self.final_epsilon = final_epsilon
# Track learning progress
self.training_error = []
def get_action(self, obs: tuple[int, int, bool]) -> int:
"""Choose an action using epsilon-greedy strategy.
Returns:
action: 0 (stand) or 1 (hit)
"""
Returns the best action with probability (1 - epsilon)
otherwise a random action with probability epsilon to ensure exploration.
"""
# with probability epsilon return a random action to explore the environment
# With probability epsilon: explore (random action)
if np.random.random() < self.epsilon:
return self.env.action_space.sample()
# with probability (1 - epsilon) act greedily (exploit)
# With probability (1-epsilon): exploit (best known action)
else:
return int(np.argmax(self.q_values[obs]))
@@ -88,33 +156,71 @@ class BlackjackAgent:
terminated: bool,
next_obs: tuple[int, int, bool],
):
"""Updates the Q-value of an action."""
future_q_value = (not terminated) * np.max(self.q_values[next_obs])
temporal_difference = (
reward + self.discount_factor * future_q_value - self.q_values[obs][action]
)
"""Update Q-value based on experience.
This is the heart of Q-learning: learn from (state, action, reward, next_state)
"""
# What's the best we could do from the next state?
# (Zero if episode terminated - no future rewards possible)
future_q_value = (not terminated) * np.max(self.q_values[next_obs])
# What should the Q-value be? (Bellman equation)
target = reward + self.discount_factor * future_q_value
# How wrong was our current estimate?
temporal_difference = target - self.q_values[obs][action]
# Update our estimate in the direction of the error
# Learning rate controls how big steps we take
self.q_values[obs][action] = (
self.q_values[obs][action] + self.lr * temporal_difference
)
# Track learning progress (useful for debugging)
self.training_error.append(temporal_difference)
def decay_epsilon(self):
"""Reduce exploration rate after each episode."""
self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay)
```
## Training the agent
### Understanding the Q-Learning Update
To train the agent, we will let the agent play one episode (one complete game is called an episode) at a time and update it's Q-values after each action taken during the episode. The agent will have to experience a lot of episodes to explore the environment sufficiently.
The core learning happens in the `update` method. Let's break down the math:
```python
# hyperparameters
learning_rate = 0.01
n_episodes = 100_000
start_epsilon = 1.0
epsilon_decay = start_epsilon / (n_episodes / 2) # reduce the exploration over time
final_epsilon = 0.1
# Current estimate: Q(state, action)
current_q = self.q_values[obs][action]
# What we actually experienced: reward + discounted future value
target = reward + self.discount_factor * max(self.q_values[next_obs])
# How wrong were we?
error = target - current_q
# Update estimate: move toward the target
new_q = current_q + learning_rate * error
```
This is the famous **Bellman equation** in action - it says the value of a state-action pair should equal the immediate reward plus the discounted value of the best next action.
## Training the Agent
Now let's train our agent. The process is:
1. **Reset environment** to start a new episode
2. **Play one complete hand** (episode), choosing actions and learning from each step
3. **Update exploration rate** (reduce epsilon)
4. **Repeat** for many episodes until the agent learns good strategy
```python
# Training hyperparameters
learning_rate = 0.01 # How fast to learn (higher = faster but less stable)
n_episodes = 100_000 # Number of hands to practice
start_epsilon = 1.0 # Start with 100% random actions
epsilon_decay = start_epsilon / (n_episodes / 2) # Reduce exploration over time
final_epsilon = 0.1 # Always keep some exploration
# Create environment and agent
env = gym.make("Blackjack-v1", sab=False)
env = gym.wrappers.RecordEpisodeStatistics(env, buffer_length=n_episodes)
@@ -127,46 +233,72 @@ agent = BlackjackAgent(
)
```
Info: The current hyperparameters are set to quickly train a decent agent. If you want to converge to the optimal policy, try increasing the ``n_episodes`` by 10x and lower the learning_rate (e.g. to 0.001).
### The Training Loop
```python
from tqdm import tqdm
from tqdm import tqdm # Progress bar
for episode in tqdm(range(n_episodes)):
# Start a new hand
obs, info = env.reset()
done = False
# play one episode
# Play one complete hand
while not done:
# Agent chooses action (initially random, gradually more intelligent)
action = agent.get_action(obs)
# Take action and observe result
next_obs, reward, terminated, truncated, info = env.step(action)
# update the agent
# Learn from this experience
agent.update(obs, action, reward, terminated, next_obs)
# update if the environment is done and the current obs
# Move to next state
done = terminated or truncated
obs = next_obs
# Reduce exploration rate (agent becomes less random over time)
agent.decay_epsilon()
```
You can use `matplotlib` to visualize the training reward and length.
### What to Expect During Training
**Early episodes (0-10,000)**:
- Agent acts mostly randomly (high epsilon)
- Wins about 43% of hands (slightly worse than random due to poor strategy)
- Large learning errors as Q-values are very inaccurate
**Middle episodes (10,000-50,000)**:
- Agent starts finding good strategies
- Win rate improves to 45-48%
- Learning errors decrease as estimates get better
**Later episodes (50,000+)**:
- Agent converges to near-optimal strategy
- Win rate plateaus around 49% (theoretical maximum for this game)
- Small learning errors as Q-values stabilize
## Analyzing Training Results
Let's visualize the training progress:
```python
from matplotlib import pyplot as plt
def get_moving_avgs(arr, window, convolution_mode):
"""Compute moving average to smooth noisy data."""
return np.convolve(
np.array(arr).flatten(),
np.ones(window),
mode=convolution_mode
) / window
# Smooth over a 500 episode window
# Smooth over a 500-episode window
rolling_length = 500
fig, axs = plt.subplots(ncols=3, figsize=(12, 5))
# Episode rewards (win/loss performance)
axs[0].set_title("Episode rewards")
reward_moving_average = get_moving_avgs(
env.return_queue,
@@ -174,7 +306,10 @@ reward_moving_average = get_moving_avgs(
"valid"
)
axs[0].plot(range(len(reward_moving_average)), reward_moving_average)
axs[0].set_ylabel("Average Reward")
axs[0].set_xlabel("Episode")
# Episode lengths (how many actions per hand)
axs[1].set_title("Episode lengths")
length_moving_average = get_moving_avgs(
env.length_queue,
@@ -182,7 +317,10 @@ length_moving_average = get_moving_avgs(
"valid"
)
axs[1].plot(range(len(length_moving_average)), length_moving_average)
axs[1].set_ylabel("Average Episode Length")
axs[1].set_xlabel("Episode")
# Training error (how much we're still learning)
axs[2].set_title("Training Error")
training_error_moving_average = get_moving_avgs(
agent.training_error,
@@ -190,16 +328,119 @@ training_error_moving_average = get_moving_avgs(
"same"
)
axs[2].plot(range(len(training_error_moving_average)), training_error_moving_average)
axs[2].set_ylabel("Temporal Difference Error")
axs[2].set_xlabel("Step")
plt.tight_layout()
plt.show()
```
![](../_static/img/tutorials/blackjack_training_plots.png "Training Plot")
### Interpreting the Results
Hopefully this tutorial helped you get a grip of how to interact with Gymnasium environments and sets you on a journey to solve many more RL challenges.
**Reward Plot**: Should show gradual improvement from ~-0.05 (slightly negative) to ~-0.01 (near optimal). Blackjack is a difficult game - even perfect play loses slightly due to the house edge.
It is recommended that you solve this environment by yourself (project based learning is really effective!). You can apply your favorite discrete RL algorithm or give Monte Carlo ES a try (covered in `Sutton & Barto <http://incompleteideas.net/book/the-book-2nd.html>`_, section 5.3) - this way you can compare your results directly to the book.
**Episode Length**: Should stabilize around 2-3 actions per episode. Very short episodes suggest the agent is standing too early; very long episodes suggest hitting too often.
Best of luck!
**Training Error**: Should decrease over time, indicating the agent's predictions are getting more accurate. Large spikes early in training are normal as the agent encounters new situations.
## Common Training Issues and Solutions
### 🚨 **Agent Never Improves**
**Symptoms**: Reward stays constant, large training errors
**Causes**: Learning rate too high/low, poor reward design, bugs in update logic
**Solutions**:
- Try learning rates between 0.001 and 0.1
- Check that rewards are meaningful (-1, 0, +1 for Blackjack)
- Verify Q-table is actually being updated
### 🚨 **Unstable Training**
**Symptoms**: Rewards fluctuate wildly, never converge
**Causes**: Learning rate too high, insufficient exploration
**Solutions**:
- Reduce learning rate (try 0.01 instead of 0.1)
- Ensure minimum exploration (final_epsilon ≥ 0.05)
- Train for more episodes
### 🚨 **Agent Gets Stuck in Poor Strategy**
**Symptoms**: Improvement stops early, suboptimal final performance
**Causes**: Too little exploration, learning rate too low
**Solutions**:
- Increase exploration time (slower epsilon decay)
- Try higher learning rate initially
- Use different exploration strategies (optimistic initialization)
### 🚨 **Learning Too Slow**
**Symptoms**: Agent improves but very gradually
**Causes**: Learning rate too low, too much exploration
**Solutions**:
- Increase learning rate (but watch for instability)
- Faster epsilon decay (less random exploration)
- More focused training on difficult states
## Testing Your Trained Agent
Once training is complete, test your agent's performance:
```python
# Test the trained agent
def test_agent(agent, env, num_episodes=1000):
"""Test agent performance without learning or exploration."""
total_rewards = []
# Temporarily disable exploration for testing
old_epsilon = agent.epsilon
agent.epsilon = 0.0 # Pure exploitation
for _ in range(num_episodes):
obs, info = env.reset()
episode_reward = 0
done = False
while not done:
action = agent.get_action(obs)
obs, reward, terminated, truncated, info = env.step(action)
episode_reward += reward
done = terminated or truncated
total_rewards.append(episode_reward)
# Restore original epsilon
agent.epsilon = old_epsilon
win_rate = np.mean(np.array(total_rewards) > 0)
average_reward = np.mean(total_rewards)
print(f"Test Results over {num_episodes} episodes:")
print(f"Win Rate: {win_rate:.1%}")
print(f"Average Reward: {average_reward:.3f}")
print(f"Standard Deviation: {np.std(total_rewards):.3f}")
# Test your agent
test_agent(agent, env)
```
Good Blackjack performance:
- **Win rate**: 42-45% (house edge makes >50% impossible)
- **Average reward**: -0.02 to +0.01
- **Consistency**: Low standard deviation indicates reliable strategy
## Next Steps
Congratulations! You've successfully trained your first RL agent. Here's what to explore next:
1. **Try other environments**: CartPole, MountainCar, LunarLander
2. **Experiment with hyperparameters**: Learning rates, exploration strategies
3. **Implement other algorithms**: SARSA, Expected SARSA, Monte Carlo methods
4. **Add function approximation**: Neural networks for larger state spaces
5. **Create custom environments**: Design your own RL problems
For more information, see:
* [Basic Usage](basic_usage) - Understanding Gymnasium fundamentals
* [Custom Environments](create_custom_env) - Building your own RL problems
* [Complete Training Tutorials](../tutorials/training_agents) - More algorithms and environments
* [Recording Agent Behavior](record_agent) - Saving videos and performance data
The key insight from this tutorial is that RL agents learn through trial and error, gradually building up knowledge about what actions work best in different situations. Q-learning provides a systematic way to learn this knowledge, balancing exploration of new possibilities with exploitation of current knowledge.
Keep experimenting, and remember that RL is as much art as science - finding the right hyperparameters and environment design often requires patience and intuition!