mirror of
https://github.com/Farama-Foundation/Gymnasium.git
synced 2025-08-02 14:26:33 +00:00
365 lines
12 KiB
Python
365 lines
12 KiB
Python
![]() |
"""
|
|||
|
Solving Blackjack with Q-Learning
|
|||
|
=================================
|
|||
|
|
|||
|
"""
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# .. image:: /_static/img/tutorials/blackjack_AE_loop.jpg
|
|||
|
# :width: 650
|
|||
|
# :alt: agent-environment-diagram
|
|||
|
#
|
|||
|
# In this tutorial, we’ll explore and solve the *Blackjack-v1*
|
|||
|
# environment.
|
|||
|
#
|
|||
|
# **Blackjack** is one of the most popular casino card games that is also
|
|||
|
# infamous for being beatable under certain conditions. This version of
|
|||
|
# the game uses an infinite deck (we draw the cards with replacement), so
|
|||
|
# counting cards won’t be a viable strategy in our simulated game.
|
|||
|
#
|
|||
|
# **Objective**: To win, your card sum should be greater than than the
|
|||
|
# dealers without exceeding 21.
|
|||
|
#
|
|||
|
# **Approach**: To solve this environment by yourself, you can pick your
|
|||
|
# favorite discrete RL algorithm. The presented solution uses *Q-learning*
|
|||
|
# (a model-free RL algorithm).
|
|||
|
#
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# Imports and Environment Setup
|
|||
|
# ------------------------------
|
|||
|
#
|
|||
|
|
|||
|
# Author: Till Zemann
|
|||
|
# License: MIT License
|
|||
|
|
|||
|
from collections import defaultdict
|
|||
|
|
|||
|
import gym
|
|||
|
import numpy as np
|
|||
|
import seaborn as sns
|
|||
|
from matplotlib import pyplot as plt
|
|||
|
from matplotlib.patches import Patch
|
|||
|
|
|||
|
# Let's start by creating the blackjack environment.
|
|||
|
# Note: We are going to follow the rules from Sutton & Barto.
|
|||
|
# Other versions of the game can be found below for you to experiment.
|
|||
|
|
|||
|
env = gym.make("Blackjack-v1", sab=True)
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# .. code:: py
|
|||
|
#
|
|||
|
# # Other possible environment configurations:
|
|||
|
#
|
|||
|
# env = gym.make('Blackjack-v1', natural=True, sab=False)``
|
|||
|
#
|
|||
|
# env = gym.make('Blackjack-v1', natural=False, sab=False)``
|
|||
|
#
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# Observing the environment
|
|||
|
# ------------------------------
|
|||
|
#
|
|||
|
# First of all, we call ``env.reset()`` to start an episode. This function
|
|||
|
# resets the environment to a starting position and returns an initial
|
|||
|
# ``observation``. We usually also set ``done = False``. This variable
|
|||
|
# will be useful later to check if a game is terminated. In this tutorial
|
|||
|
# we will use the terms observation and state synonymously but in more
|
|||
|
# complex problems a state might differ from the observation it is based
|
|||
|
# on.
|
|||
|
#
|
|||
|
|
|||
|
# reset the environment to get the first observation
|
|||
|
done = False
|
|||
|
observation, info = env.reset()
|
|||
|
|
|||
|
print(observation)
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# Note that our observation is a 3-tuple consisting of 3 discrete values:
|
|||
|
#
|
|||
|
# - The players current sum
|
|||
|
# - Value of the dealers face-up card
|
|||
|
# - Boolean whether the player holds a usable ace (An ace is usable if it
|
|||
|
# counts as 11 without busting)
|
|||
|
#
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# Executing an action
|
|||
|
# ------------------------------
|
|||
|
#
|
|||
|
# After receiving our first observation, we are only going to use the
|
|||
|
# ``env.step(action)`` function to interact with the environment. This
|
|||
|
# function takes an action as input and executes it in the environment.
|
|||
|
# Because that action changes the state of the environment, it returns
|
|||
|
# four useful variables to us. These are:
|
|||
|
#
|
|||
|
# - ``next_state``: This is the observation that the agent will receive
|
|||
|
# after taking the action.
|
|||
|
# - ``reward``: This is the reward that the agent will receive after
|
|||
|
# taking the action.
|
|||
|
# - ``terminated``: This is a boolean variable that indicates whether or
|
|||
|
# not the episode is over.
|
|||
|
# - ``truncated``: This is a boolean variable that also indicates whether
|
|||
|
# the episode ended by early truncation.
|
|||
|
# - ``info``: This is a dictionary that might contain additional
|
|||
|
# information about the environment.
|
|||
|
#
|
|||
|
# The ``next_state``, ``reward``, and ``done`` variables are
|
|||
|
# self-explanatory, but the ``info`` variable requires some additional
|
|||
|
# explanation. This variable contains a dictionary that might have some
|
|||
|
# extra information about the environment, but in the Blackjack-v1
|
|||
|
# environment you can ignore it. For example in Atari environments the
|
|||
|
# info dictionary has a ``ale.lives`` key that tells us how many lives the
|
|||
|
# agent has left. If the agent has 0 lives, then the episode is over.
|
|||
|
#
|
|||
|
# Blackjack-v1 doesn’t have a ``env.render()`` function to render the
|
|||
|
# environment, but in other environments you can use this function to
|
|||
|
# watch the agent play. Important to note is that using ``env.render()``
|
|||
|
# is optional - the environment is going to work even if you don’t render
|
|||
|
# it, but it can be helpful to see an episode rendered out to get an idea
|
|||
|
# of how the current policy behaves. Note that it is not a good idea to
|
|||
|
# call this function in your training loop because rendering slows down
|
|||
|
# training by a lot. Rather try to build an extra loop to evaluate and
|
|||
|
# showcase the agent after training.
|
|||
|
#
|
|||
|
|
|||
|
# sample a random action from all valid actions
|
|||
|
action = env.action_space.sample()
|
|||
|
|
|||
|
# execute the action in our environment and receive infos from the environment
|
|||
|
observation, reward, terminated, truncated, info = env.step(action)
|
|||
|
|
|||
|
print("observation:", observation)
|
|||
|
print("reward:", reward)
|
|||
|
print("terminated:", terminated)
|
|||
|
print("truncated:", truncated)
|
|||
|
print("info:", info)
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# Once ``terminated = True`` or ``truncated=True``, we should stop the
|
|||
|
# current episode and begin a new one with ``env.reset()``. If you
|
|||
|
# continue executing act`ons without resetting the environment, it still
|
|||
|
# responds but the output won’t be useful for training (it might even be
|
|||
|
# harmful if the agent learns on invalid data).
|
|||
|
#
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# Building an agent
|
|||
|
# ------------------------------
|
|||
|
#
|
|||
|
# Let’s build a ``Q-learning agent`` to solve *Blackjack-v1*! We’ll need
|
|||
|
# some functions for picking an action and updating the agents action
|
|||
|
# values. To ensure that the agents expores the environment, one possible
|
|||
|
# solution is the ``epsilon-greedy`` strategy, where we pick a random
|
|||
|
# action with the percentage ``epsilon`` and the greedy action (currently
|
|||
|
# valued as the best) ``1 - epsilon``.
|
|||
|
#
|
|||
|
|
|||
|
|
|||
|
class BlackjackAgent:
|
|||
|
def __init__(self, lr=1e-3, epsilon=0.1, epsilon_decay=1e-4):
|
|||
|
"""
|
|||
|
Initialize an Reinforcement Learning agent with an empty dictionary
|
|||
|
of state-action values (q_values), a learning rate and an epsilon.
|
|||
|
"""
|
|||
|
self.q_values = defaultdict(
|
|||
|
lambda: np.zeros(env.action_space.n)
|
|||
|
) # maps a state to action values
|
|||
|
self.lr = lr
|
|||
|
self.epsilon = epsilon
|
|||
|
self.epsilon_decay = epsilon_decay
|
|||
|
|
|||
|
def get_action(self, state):
|
|||
|
"""
|
|||
|
Returns the best action with probability (1 - epsilon)
|
|||
|
and a random action with probability epsilon to ensure exploration.
|
|||
|
"""
|
|||
|
# with probability epsilon return a random action to explore the environment
|
|||
|
if np.random.random() < self.epsilon:
|
|||
|
action = env.action_space.sample()
|
|||
|
|
|||
|
# with probability (1 - epsilon) act greedily (exploit)
|
|||
|
else:
|
|||
|
action = np.argmax(self.q_values[state])
|
|||
|
return action
|
|||
|
|
|||
|
def update(self, state, action, reward, next_state, done):
|
|||
|
"""
|
|||
|
Updates the Q-value of an action.
|
|||
|
"""
|
|||
|
old_q_value = self.q_values[state][action]
|
|||
|
max_future_q = np.max(self.q_values[next_state])
|
|||
|
target = reward + self.lr * max_future_q * (1 - done)
|
|||
|
self.q_values[state][action] = (1 - self.lr) * old_q_value + self.lr * target
|
|||
|
|
|||
|
def decay_epsilon(self):
|
|||
|
self.epsilon = self.epsilon - epsilon_decay
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# To train the agent, we will let the agent play one episode (one complete
|
|||
|
# game is called an episode) at a time and then update it’s Q-values after
|
|||
|
# each episode. The agent will have to experience a lot of episodes to
|
|||
|
# explore the environment sufficiently.
|
|||
|
#
|
|||
|
# Now we should be ready to build the training loop.
|
|||
|
#
|
|||
|
|
|||
|
# hyperparameters
|
|||
|
learning_rate = 1e-3
|
|||
|
start_epsilon = 0.8
|
|||
|
n_episodes = 200_000
|
|||
|
epsilon_decay = start_epsilon / n_episodes # less exploration over time
|
|||
|
|
|||
|
agent = BlackjackAgent(
|
|||
|
lr=learning_rate, epsilon=start_epsilon, epsilon_decay=epsilon_decay
|
|||
|
)
|
|||
|
|
|||
|
|
|||
|
def train(agent, n_episodes):
|
|||
|
for episode in range(n_episodes):
|
|||
|
|
|||
|
# reset the environment
|
|||
|
state, info = env.reset()
|
|||
|
done = False
|
|||
|
|
|||
|
# play one episode
|
|||
|
while not done:
|
|||
|
action = agent.get_action(observation)
|
|||
|
next_state, reward, terminated, truncated, info = env.step(action)
|
|||
|
done = (
|
|||
|
terminated or truncated
|
|||
|
) # if the episode terminated or was truncated early, set done to True
|
|||
|
agent.update(state, action, reward, next_state, done)
|
|||
|
state = next_state
|
|||
|
|
|||
|
agent.update(state, action, reward, next_state, done)
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# Great, let’s train!
|
|||
|
#
|
|||
|
|
|||
|
train(agent, n_episodes)
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# Visualizing the results
|
|||
|
# ------------------------------
|
|||
|
#
|
|||
|
|
|||
|
|
|||
|
def create_grids(agent, usable_ace=False):
|
|||
|
|
|||
|
# convert our state-action values to state values
|
|||
|
# and build a policy dictionary that maps observations to actions
|
|||
|
V = defaultdict(float)
|
|||
|
policy = defaultdict(int)
|
|||
|
for obs, action_values in agent.q_values.items():
|
|||
|
V[obs] = np.max(action_values)
|
|||
|
policy[obs] = np.argmax(action_values)
|
|||
|
|
|||
|
X, Y = np.meshgrid(
|
|||
|
np.arange(12, 22), np.arange(1, 11) # players count
|
|||
|
) # dealers face-up card
|
|||
|
|
|||
|
# create the value grid for plotting
|
|||
|
Z = np.apply_along_axis(
|
|||
|
lambda obs: V[(obs[0], obs[1], usable_ace)], axis=2, arr=np.dstack([X, Y])
|
|||
|
)
|
|||
|
value_grid = X, Y, Z
|
|||
|
|
|||
|
# create the policy grid for plotting
|
|||
|
policy_grid = np.apply_along_axis(
|
|||
|
lambda obs: policy[(obs[0], obs[1], usable_ace)], axis=2, arr=np.dstack([X, Y])
|
|||
|
)
|
|||
|
return value_grid, policy_grid
|
|||
|
|
|||
|
|
|||
|
def create_plots(value_grid, policy_grid, title="N/A"):
|
|||
|
|
|||
|
# create a new figure with 2 subplots (left: state values, right: policy)
|
|||
|
X, Y, Z = value_grid
|
|||
|
fig = plt.figure(figsize=plt.figaspect(0.4))
|
|||
|
fig.suptitle(title, fontsize=16)
|
|||
|
|
|||
|
# plot the state values
|
|||
|
ax1 = fig.add_subplot(1, 2, 1, projection="3d")
|
|||
|
ax1.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap="viridis", edgecolor="none")
|
|||
|
plt.xticks(range(12, 22), range(12, 22))
|
|||
|
plt.yticks(range(1, 11), ["A"] + list(range(2, 11)))
|
|||
|
ax1.set_title("State values: " + title)
|
|||
|
ax1.set_xlabel("Player sum")
|
|||
|
ax1.set_ylabel("Dealer showing")
|
|||
|
ax1.zaxis.set_rotate_label(False)
|
|||
|
ax1.set_zlabel("Value", fontsize=14, rotation=90)
|
|||
|
ax1.view_init(20, 220)
|
|||
|
|
|||
|
# plot the policy
|
|||
|
fig.add_subplot(1, 2, 2)
|
|||
|
ax2 = sns.heatmap(policy_grid, linewidth=0, annot=True, cmap="Accent_r", cbar=False)
|
|||
|
ax2.set_title("Policy: " + title)
|
|||
|
ax2.set_xlabel("Player sum")
|
|||
|
ax2.set_ylabel("Dealer showing")
|
|||
|
ax2.set_xticklabels(range(12, 22))
|
|||
|
ax2.set_yticklabels(["A"] + list(range(2, 11)), fontsize=12)
|
|||
|
|
|||
|
# add a legend
|
|||
|
legend_elements = [
|
|||
|
Patch(facecolor="lightgreen", edgecolor="black", label="Hit"),
|
|||
|
Patch(facecolor="grey", edgecolor="black", label="Stick"),
|
|||
|
]
|
|||
|
ax2.legend(handles=legend_elements, bbox_to_anchor=(1.3, 1))
|
|||
|
return fig
|
|||
|
|
|||
|
|
|||
|
# state values & policy with usable ace (ace counts as 11)
|
|||
|
value_grid, policy_grid = create_grids(agent, usable_ace=True)
|
|||
|
fig1 = create_plots(value_grid, policy_grid, title="With usable ace")
|
|||
|
plt.show()
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# .. image:: /_static/img/tutorials/blackjack_with_usable_ace.png
|
|||
|
#
|
|||
|
|
|||
|
# state values & policy without usable ace (ace counts as 1)
|
|||
|
value_grid, policy_grid = create_grids(agent, usable_ace=False)
|
|||
|
fig2 = create_plots(value_grid, policy_grid, title="Without usable ace")
|
|||
|
plt.show()
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# .. image:: /_static/img/tutorials/blackjack_without_usable_ace.png
|
|||
|
#
|
|||
|
# It's good practice to call env.close() at the end of your script,
|
|||
|
# so that any used resources by the environment will be closed.
|
|||
|
#
|
|||
|
|
|||
|
env.close()
|
|||
|
|
|||
|
|
|||
|
# %%
|
|||
|
# Hopefully this Tutorial helped you get a grip of how to interact with
|
|||
|
# OpenAI-Gym environments and sets you on a journey to solve many more RL
|
|||
|
# challenges.
|
|||
|
#
|
|||
|
# It is recommended that you solve this environment by yourself (project
|
|||
|
# based learning is really effective!). You can apply your favorite
|
|||
|
# discrete RL algorithm or give Monte Carlo ES a try (covered in `Sutton &
|
|||
|
# Barto <http://incompleteideas.net/book/the-book-2nd.html>`__, section
|
|||
|
# 5.3) - this way you can compare your results directly to the book.
|
|||
|
#
|
|||
|
# Best of fun!
|
|||
|
#
|