{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Handling Time Limits\n\nIn using Gymnasium environments with reinforcement learning code, a common problem observed is how time limits are incorrectly handled. The ``done`` signal received (in previous versions of OpenAI Gym < 0.26) from ``env.step`` indicated whether an episode has ended. However, this signal did not distinguish whether the episode ended due to ``termination`` or ``truncation``.\n\n## Termination\n\nTermination refers to the episode ending after reaching a terminal state that is defined as part of the environment\ndefinition. Examples are - task success, task failure, robot falling down etc. Notably, this also includes episodes\nending in finite-horizon environments due to a time-limit inherent to the environment. Note that to preserve Markov\nproperty, a representation of the remaining time must be present in the agent's observation in finite-horizon environments.\n[(Reference)](https://arxiv.org/abs/1712.00378)\n\n## Truncation\n\nTruncation refers to the episode ending after an externally defined condition (that is outside the scope of the Markov\nDecision Process). This could be a time-limit, a robot going out of bounds etc.\n\nAn infinite-horizon environment is an obvious example of where this is needed. We cannot wait forever for the episode\nto complete, so we set a practical time-limit after which we forcibly halt the episode. The last state in this case is\nnot a terminal state since it has a non-zero transition probability of moving to another state as per the Markov\nDecision Process that defines the RL problem. This is also different from time-limits in finite horizon environments\nas the agent in this case has no idea about this time-limit.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importance in learning code\nBootstrapping (using one or more estimated values of a variable to update estimates of the same variable) is a key\naspect of Reinforcement Learning. A value function will tell you how much discounted reward you will get from a\nparticular state if you follow a given policy. When an episode stops at any given point, by looking at the value of\nthe final state, the agent is able to estimate how much discounted reward could have been obtained if the episode has\ncontinued. This is an example of handling truncation.\n\nMore formally, a common example of bootstrapping in RL is updating the estimate of the Q-value function,\n\n\\begin{align}Q_{target}(o_t, a_t) = r_t + \\gamma . \\max_a(Q(o_{t+1}, a_{t+1}))\\end{align}\n\n\nIn classical RL, the new ``Q`` estimate is a weighted average of the previous ``Q`` estimate and ``Q_target`` while in Deep\nQ-Learning, the error between ``Q_target`` and the previous ``Q`` estimate is minimized.\n\nHowever, at the terminal state, bootstrapping is not done,\n\n\\begin{align}Q_{target}(o_t, a_t) = r_t\\end{align}\n\nThis is where the distinction between termination and truncation becomes important. When an episode ends due to\ntermination we don't bootstrap, when it ends due to truncation, we bootstrap.\n\nWhile using gymnasium environments, the ``done`` signal (default for < v0.26) is frequently used to determine whether to\nbootstrap or not. However, this is incorrect since it does not differentiate between termination and truncation.\n\nA simple example of value functions is shown below. This is an illustrative example and not part of any specific algorithm.\n\n.. code:: python\n\n # INCORRECT\n vf_target = rew + gamma * (1 - done) * vf_next_state\n\nThis is incorrect in the case of episode ending due to a truncation, where bootstrapping needs to happen but it doesn't.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n\nFrom v0.26 onwards, Gymnasium's ``env.step`` API returns both termination and truncation information explicitly.\nIn the previous version truncation information was supplied through the info key ``TimeLimit.truncated``.\nThe correct way to handle terminations and truncations now is,\n\n.. code:: python\n\n # terminated = done and 'TimeLimit.truncated' not in info\n # This was needed in previous versions.\n\n vf_target = rew + gamma * (1 - terminated) * vf_next_state\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" } }, "nbformat": 4, "nbformat_minor": 0 }