When we talk about training an RL agent, we're teaching it to make good decisions through experience. Unlike supervised learning where we show examples of correct answers, RL agents learn by trying different actions and observing the results. It's like learning to ride a bike - you try different movements, fall down a few times, and gradually learn what works.
For this tutorial, we'll use Q-learning to solve the Blackjack environment. But first, let's understand how Q-learning works conceptually.
Q-learning builds a giant "cheat sheet" called a Q-table that tells the agent how good each action is in each situation:
- **Rows** = different situations (states) the agent can encounter
- **Columns** = different actions the agent can take
- **Values** = how good that action is in that situation (expected future reward)
For Blackjack:
- **States**: Your hand value, dealer's showing card, whether you have a usable ace
- **Actions**: Hit (take another card) or Stand (keep current hand)
- **Q-values**: Expected reward for each action in each state
### The Learning Process
1.**Try an action** and see what happens (reward + new state)
2.**Update your cheat sheet**: "That action was better/worse than I thought"
3.**Gradually improve** by trying actions and updating estimates
4.**Balance exploration vs exploitation**: Try new things vs use what you know works
**Why it works**: Over time, good actions get higher Q-values, bad actions get lower Q-values. The agent learns to pick actions with the highest expected rewards.
---
This page provides a short outline of how to train an agent for a Gymnasium environment. We'll use tabular Q-learning to solve Blackjack-v1. For complete tutorials with other environments and algorithms, see [training tutorials](../tutorials/training_agents). Please read [basic usage](basic_usage) before this page.
## About the Environment: Blackjack
Blackjack is one of the most popular casino card games and is perfect for learning RL because it has:
- **Clear rules**: Get closer to 21 than the dealer without going over
- **Simple observations**: Your hand value, dealer's showing card, usable ace
- **Discrete actions**: Hit (take card) or Stand (keep current hand)
- **Immediate feedback**: Win, lose, or draw after each hand
This version uses infinite deck (cards drawn with replacement), so card counting won't work - the agent must learn optimal basic strategy through trial and error.
After receiving our first observation from `env.reset()`, we use `env.step(action)` to interact with the environment. This function takes an action and returns five important values:
```python
observation, reward, terminated, truncated, info = env.step(action)
```
- **`observation`**: What the agent sees after taking the action (new game state)
- **`reward`**: Immediate feedback for that action (+1, -1, or 0 in Blackjack)
- **`terminated`**: Whether the episode ended naturally (hand finished)
- **`truncated`**: Whether episode was cut short (time limits - not used in Blackjack)
- **`info`**: Additional debugging information (can usually be ignored)
The key insight is that `reward` tells us how good our *immediate* action was, but the agent needs to learn about *long-term* consequences. Q-learning handles this by estimating the total future reward, not just the immediate reward.
This is the famous **Bellman equation** in action - it says the value of a state-action pair should equal the immediate reward plus the discounted value of the best next action.
**Reward Plot**: Should show gradual improvement from ~-0.05 (slightly negative) to ~-0.01 (near optimal). Blackjack is a difficult game - even perfect play loses slightly due to the house edge.
**Episode Length**: Should stabilize around 2-3 actions per episode. Very short episodes suggest the agent is standing too early; very long episodes suggest hitting too often.
**Training Error**: Should decrease over time, indicating the agent's predictions are getting more accurate. Large spikes early in training are normal as the agent encounters new situations.
## Common Training Issues and Solutions
### 🚨 **Agent Never Improves**
**Symptoms**: Reward stays constant, large training errors
**Causes**: Learning rate too high/low, poor reward design, bugs in update logic
**Solutions**:
- Try learning rates between 0.001 and 0.1
- Check that rewards are meaningful (-1, 0, +1 for Blackjack)
The key insight from this tutorial is that RL agents learn through trial and error, gradually building up knowledge about what actions work best in different situations. Q-learning provides a systematic way to learn this knowledge, balancing exploration of new possibilities with exploitation of current knowledge.
Keep experimenting, and remember that RL is as much art as science - finding the right hyperparameters and environment design often requires patience and intuition!