Gymnasium/gym/envs/mujoco/inverted_double_pendulum.py

import numpy as np
from gym import utils
from gym.envs.mujoco import mujoco_env


class InvertedDoublePendulumEnv(mujoco_env.MujocoEnv, utils.EzPickle):
    """
    ### Description

    This environment originates from control theory and builds on the cartpole
    environment based on the work done by Barto, Sutton, and Anderson in
    ["Neuronlike adaptive elements that can solve difficult learning control problems"](https://ieeexplore.ieee.org/document/6313077),
    powered by the Mujoco physics simulator - allowing for more complex experiments
    (such as varying the effects of gravity or constraints). This environment involves a cart that can
    moved linearly, with a pole fixed on it and a second pole fixed on the other end of the first one
    (leaving the second pole as the only one with one free end). The cart can be pushed left or right,
    and the goal is to balance the second pole on top of the first pole, which is in turn on top of the
    cart, by applying continuous forces on the cart.

    ### Action Space
    The agent take a 1-element vector for actions.
    The action space is a continuous `(action)` in `[-1, 1]`, where `action` represents the
    numerical force applied to the cart (with magnitude representing the amount of force and
    sign representing the direction)

    | Num | Action                    | Control Min | Control Max | Name (in corresponding XML file) | Joint | Unit      |
    |-----|---------------------------|-------------|-------------|----------------------------------|-------|-----------|
    | 0   | Force applied on the cart | -1          | 1           | slider                           | slide | Force (N) |

    ### Observation Space

    The state space consists of positional values of different body parts of the hopper,
    followed by the velocities of those individual parts (their derivatives) with all the
    positions ordered before all the velocities.

    The observation is a `ndarray` with shape `(11,)` where the elements correspond to the following:

    | Num | Observation           | Min                  | Max                | Name (in corresponding XML file) | Joint| Unit |
    |-----|-----------------------|----------------------|--------------------|----------------------|--------------------|--------------------|
    | 0   | position of the cart along the linear surface                        | -Inf                 | Inf                | slider | slide | position (m) |
    | 1   | sine of the angle between the cart and the first pole                | -Inf                 | Inf                | sin(hinge) | hinge | unitless |
    | 2   | sine of the angle between the two poles                              | -Inf                 | Inf                | sin(hinge2) | hinge | unitless |
    | 3   | cosine of the angle between the cart and the first pole              | -Inf                 | Inf                | cos(hinge) | hinge | unitless |
    | 4   | cosine of the angle between the two poles                            | -Inf                 | Inf                | cos(hinge2) | hinge | unitless |
    | 5   | velocity of the cart                                                 | -Inf                 | Inf                | slider | slide | velocity (m/s) |
    | 6   | angular velocity of the angle between the cart and the first pole    | -Inf                 | Inf                | hinge | hinge | angular velocity (rad/s) |
    | 7   | angular velocity of the angle between the two poles                  | -Inf                 | Inf                | hinge2 | hinge | angular velocity (rad/s) |
    | 8   | constraint force - 1                                                 | -Inf                 | Inf                |  |  | Force (N) |
    | 9   | constraint force - 2                                                 | -Inf                 | Inf                |  |  | Force (N) |
    | 10  | constraint force - 3                                                 | -Inf                 | Inf                |  |  | Force (N) |


    There is physical contact between the robots and their environment - and Mujoco
    attempts at getting realisitic physics simulations for the possible physical contact
    dynamics by aiming for physical accuracy and computational efficiency.

    There is one constraint force for contacts for each degree of freedom (3).
    The approach and handling of constraints by Mujoco is unique to the simulator
    and is based on their research. Once can find more information in their
    [*documentation*](https://mujoco.readthedocs.io/en/latest/computation.html)
    or in their paper
    ["Analytically-invertible dynamics with contacts and constraints: Theory and implementation in MuJoCo"](https://homes.cs.washington.edu/~todorov/papers/TodorovICRA14.pdf).


    ### Rewards

    The reward consists of two parts:
    - *alive_bonus*: The goal is to make the second inverted pendulum stand upright
    (within a certain angle limit) as long as possible - as such a reward of +10 is awarded
     for each timestep that the second pole is upright.
    - *distance_penalty*: This reward is a measure of how far the *tip* of the second pendulum
    (the only free end) moves, and it is calculated as
    *0.01 * x<sup>2</sup> + (y - 2)<sup>2</sup>*, where *x* is the x-coordinate of the tip
    and *y* is the y-coordinate of the tip of the second pole.
    - *velocity_penalty*: A negative reward for penalising the agent if it moves too
    fast *0.001 *  v<sub>1</sub><sup>2</sup> + 0.005 * v<sub>2</sub> <sup>2</sup>*

    The total reward returned is ***reward*** *=* *alive_bonus - distance_penalty - velocity_penalty*

    ### Starting State
    All observations start in state
    (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) with a uniform noise in the range
    of [-0.1, 0.1] added to the positional values (cart position and pole angles) and standard
    normal force with a standard deviation of 0.1 added to the velocity values for stochasticity.

    ### Episode Termination
    The episode terminates when any of the following happens:

    1. The episode duration reaches 1000 timesteps.
    2. Any of the state space values is no longer finite.
    3. The y_coordinate of the tip of the second pole *is less than or equal* to 1. The maximum standing height of the system is 1.196 m when all the parts are perpendicularly vertical on top of each other).

    ### Arguments

    No additional arguments are currently supported (in v2 and lower), but modifications can
    be made to the XML file in the assets folder (or by changing the path to a modified XML
    file in another folder)..

    ```
    env = gym.make('InvertedDoublePendulum-v2')
    ```
    There is no v3 for InvertedPendulum, unlike the robot environments where a v3 and
    beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.


    ### Version History

    * v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)
    * v2: All continuous control environments now use mujoco_py >= 1.50
    * v1: max_time_steps raised to 1000 for robot based tasks (including inverted pendulum)
    * v0: Initial versions release (1.0.0)

    """

    def __init__(self):
        mujoco_env.MujocoEnv.__init__(self, "inverted_double_pendulum.xml", 5)
        utils.EzPickle.__init__(self)

    def step(self, action):
        self.do_simulation(action, self.frame_skip)
        ob = self._get_obs()
        x, _, y = self.sim.data.site_xpos[0]
        dist_penalty = 0.01 * x ** 2 + (y - 2) ** 2
        v1, v2 = self.sim.data.qvel[1:3]
        vel_penalty = 1e-3 * v1 ** 2 + 5e-3 * v2 ** 2
        alive_bonus = 10
        r = alive_bonus - dist_penalty - vel_penalty
        done = bool(y <= 1)
        return ob, r, done, {}

    def _get_obs(self):
        return np.concatenate(
            [
                self.sim.data.qpos[:1],  # cart x pos
                np.sin(self.sim.data.qpos[1:]),  # link angles
                np.cos(self.sim.data.qpos[1:]),
                np.clip(self.sim.data.qvel, -10, 10),
                np.clip(self.sim.data.qfrc_constraint, -10, 10),
            ]
        ).ravel()

    def reset_model(self):
        self.set_state(
            self.init_qpos
            + self.np_random.uniform(low=-0.1, high=0.1, size=self.model.nq),
            self.init_qvel + self.np_random.standard_normal(self.model.nv) * 0.1,
        )
        return self._get_obs()

    def viewer_setup(self):
        v = self.viewer
        v.cam.trackbodyid = 0
        v.cam.distance = self.model.stat.extent * 0.5
        v.cam.lookat[2] = 0.12250000000000005  # v.model.stat.center[2]
Initial release. Hello world :). 2016-04-27 08:00:58 -07:00			`import numpy as np`
			`from gym import utils`
			`from gym.envs.mujoco import mujoco_env`


Blacken the codebase (#2265) 2021-07-29 02:26:34 +02:00			`class InvertedDoublePendulumEnv(mujoco_env.MujocoEnv, utils.EzPickle):`
documentation for Mujoco Envs as Docstrings (#2588) 2022-02-02 09:00:27 -05:00			`"""`
			`### Description`

			`This environment originates from control theory and builds on the cartpole`
			`environment based on the work done by Barto, Sutton, and Anderson in`
			`["Neuronlike adaptive elements that can solve difficult learning control problems"](https://ieeexplore.ieee.org/document/6313077),`
			`powered by the Mujoco physics simulator - allowing for more complex experiments`
			`(such as varying the effects of gravity or constraints). This environment involves a cart that can`
			`moved linearly, with a pole fixed on it and a second pole fixed on the other end of the first one`
			`(leaving the second pole as the only one with one free end). The cart can be pushed left or right,`
			`and the goal is to balance the second pole on top of the first pole, which is in turn on top of the`
			`cart, by applying continuous forces on the cart.`

			`### Action Space`
			`The agent take a 1-element vector for actions.`
			The action space is a continuous `(action)` in `[-1, 1]`, where `action` represents the
			`numerical force applied to the cart (with magnitude representing the amount of force and`
			`sign representing the direction)`

			`\| Num \| Action \| Control Min \| Control Max \| Name (in corresponding XML file) \| Joint \| Unit \|`
			`\|-----\|---------------------------\|-------------\|-------------\|----------------------------------\|-------\|-----------\|`
			`\| 0 \| Force applied on the cart \| -1 \| 1 \| slider \| slide \| Force (N) \|`

			`### Observation Space`

			`The state space consists of positional values of different body parts of the hopper,`
			`followed by the velocities of those individual parts (their derivatives) with all the`
			`positions ordered before all the velocities.`

			The observation is a `ndarray` with shape `(11,)` where the elements correspond to the following:

			`\| Num \| Observation \| Min \| Max \| Name (in corresponding XML file) \| Joint\| Unit \|`
			`\|-----\|-----------------------\|----------------------\|--------------------\|----------------------\|--------------------\|--------------------\|`
			`\| 0 \| position of the cart along the linear surface \| -Inf \| Inf \| slider \| slide \| position (m) \|`
			`\| 1 \| sine of the angle between the cart and the first pole \| -Inf \| Inf \| sin(hinge) \| hinge \| unitless \|`
			`\| 2 \| sine of the angle between the two poles \| -Inf \| Inf \| sin(hinge2) \| hinge \| unitless \|`
			`\| 3 \| cosine of the angle between the cart and the first pole \| -Inf \| Inf \| cos(hinge) \| hinge \| unitless \|`
			`\| 4 \| cosine of the angle between the two poles \| -Inf \| Inf \| cos(hinge2) \| hinge \| unitless \|`
			`\| 5 \| velocity of the cart \| -Inf \| Inf \| slider \| slide \| velocity (m/s) \|`
			`\| 6 \| angular velocity of the angle between the cart and the first pole \| -Inf \| Inf \| hinge \| hinge \| angular velocity (rad/s) \|`
			`\| 7 \| angular velocity of the angle between the two poles \| -Inf \| Inf \| hinge2 \| hinge \| angular velocity (rad/s) \|`
			`\| 8 \| constraint force - 1 \| -Inf \| Inf \| \| \| Force (N) \|`
			`\| 9 \| constraint force - 2 \| -Inf \| Inf \| \| \| Force (N) \|`
			`\| 10 \| constraint force - 3 \| -Inf \| Inf \| \| \| Force (N) \|`


			`There is physical contact between the robots and their environment - and Mujoco`
			`attempts at getting realisitic physics simulations for the possible physical contact`
			`dynamics by aiming for physical accuracy and computational efficiency.`

			`There is one constraint force for contacts for each degree of freedom (3).`
			`The approach and handling of constraints by Mujoco is unique to the simulator`
			`and is based on their research. Once can find more information in their`
			`[documentation](https://mujoco.readthedocs.io/en/latest/computation.html)`
			`or in their paper`
			`["Analytically-invertible dynamics with contacts and constraints: Theory and implementation in MuJoCo"](https://homes.cs.washington.edu/~todorov/papers/TodorovICRA14.pdf).`


			`### Rewards`

			`The reward consists of two parts:`
			`- alive_bonus: The goal is to make the second inverted pendulum stand upright`
			`(within a certain angle limit) as long as possible - as such a reward of +10 is awarded`
			`for each timestep that the second pole is upright.`
			`- distance_penalty: This reward is a measure of how far the tip of the second pendulum`
			`(the only free end) moves, and it is calculated as`
			`0.01 x<sup>2</sup> + (y - 2)<sup>2</sup>, where x* is the x-coordinate of the tip`
			`and y is the y-coordinate of the tip of the second pole.`
			`- velocity_penalty: A negative reward for penalising the agent if it moves too`
			`fast 0.001 v<sub>1</sub><sup>2</sup> + 0.005 * v<sub>2</sub> <sup>2</sup>*`

			`The total reward returned is *reward* = alive_bonus - distance_penalty - velocity_penalty`

			`### Starting State`
			`All observations start in state`
			`(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) with a uniform noise in the range`
			`of [-0.1, 0.1] added to the positional values (cart position and pole angles) and standard`
			`normal force with a standard deviation of 0.1 added to the velocity values for stochasticity.`

			`### Episode Termination`
			`The episode terminates when any of the following happens:`

			`1. The episode duration reaches 1000 timesteps.`
			`2. Any of the state space values is no longer finite.`
			`3. The y_coordinate of the tip of the second pole is less than or equal to 1. The maximum standing height of the system is 1.196 m when all the parts are perpendicularly vertical on top of each other).`

			`### Arguments`

			`No additional arguments are currently supported (in v2 and lower), but modifications can`
			`be made to the XML file in the assets folder (or by changing the path to a modified XML`
			`file in another folder)..`

			```
			`env = gym.make('InvertedDoublePendulum-v2')`
			```
			`There is no v3 for InvertedPendulum, unlike the robot environments where a v3 and`
			`beyond take gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc.`


			`### Version History`

			`* v3: support for gym.make kwargs such as xml_file, ctrl_cost_weight, reset_noise_scale etc. rgb rendering comes from tracking camera (so agent does not run away from screen)`
			`* v2: All continuous control environments now use mujoco_py >= 1.50`
			`* v1: max_time_steps raised to 1000 for robot based tasks (including inverted pendulum)`
			`* v0: Initial versions release (1.0.0)`

			`"""`

Initial release. Hello world :). 2016-04-27 08:00:58 -07:00			`def __init__(self):`
Blacken the codebase (#2265) 2021-07-29 02:26:34 +02:00			`mujoco_env.MujocoEnv.__init__(self, "inverted_double_pendulum.xml", 5)`
Initial release. Hello world :). 2016-04-27 08:00:58 -07:00			`utils.EzPickle.__init__(self)`

Cleanup, removal of unmaintained code (#836) * add dtype to Box * remove board_game, debugging, safety, parameter_tuning environments * massive set of breaking changes - remove python logging module - _step, _reset, _seed, _close => non underscored method - remove benchmark and scoring folder * Improve render("human"), now resizable, closable window. * get rid of default step and reset in wrappers, so it doesn’t silently fail for people with underscore methods * CubeCrash unit test environment * followup fixes * MemorizeDigits unit test envrionment * refactored spaces a bit fixed indentation disabled test_env_semantics * fix unit tests * fixes * CubeCrash, MemorizeDigits tested * gym backwards compatibility patch * gym backwards compatibility, followup fixes * changelist, add spaces to main namespaces * undo_logger_setup for backwards compat * remove configuration.py 2018-01-25 18:20:14 -08:00			`def step(self, action):`
Initial release. Hello world :). 2016-04-27 08:00:58 -07:00			`self.do_simulation(action, self.frame_skip)`
			`ob = self._get_obs()`
Migrate to mujoco-py 1.50 (#834) * all envs run offscreen * render works * changed mujoco-py version * Bump versions * Update version and README * Same versioning for all mujoco envs * Fix typo * Fix version * Bump version again * Revert "Fix version" This reverts commit decc5779811801deb6ae9fad697dfe247d2bdd94. 2018-01-24 15:42:29 -08:00			`x, _, y = self.sim.data.site_xpos[0]`
Initial release. Hello world :). 2016-04-27 08:00:58 -07:00			`dist_penalty = 0.01 * x 2 + (y - 2) 2`
Migrate to mujoco-py 1.50 (#834) * all envs run offscreen * render works * changed mujoco-py version * Bump versions * Update version and README * Same versioning for all mujoco envs * Fix typo * Fix version * Bump version again * Revert "Fix version" This reverts commit decc5779811801deb6ae9fad697dfe247d2bdd94. 2018-01-24 15:42:29 -08:00			`v1, v2 = self.sim.data.qvel[1:3]`
Blacken the codebase (#2265) 2021-07-29 02:26:34 +02:00			`vel_penalty = 1e-3 * v1 ** 2 + 5e-3 * v2 ** 2`
Initial release. Hello world :). 2016-04-27 08:00:58 -07:00			`alive_bonus = 10`
Migrate to mujoco-py 1.50 (#834) * all envs run offscreen * render works * changed mujoco-py version * Bump versions * Update version and README * Same versioning for all mujoco envs * Fix typo * Fix version * Bump version again * Revert "Fix version" This reverts commit decc5779811801deb6ae9fad697dfe247d2bdd94. 2018-01-24 15:42:29 -08:00			`r = alive_bonus - dist_penalty - vel_penalty`
Initial release. Hello world :). 2016-04-27 08:00:58 -07:00			`done = bool(y <= 1)`
			`return ob, r, done, {}`

			`def _get_obs(self):`
Blacken the codebase (#2265) 2021-07-29 02:26:34 +02:00			`return np.concatenate(`
			`[`
			`self.sim.data.qpos[:1], # cart x pos`
			`np.sin(self.sim.data.qpos[1:]), # link angles`
			`np.cos(self.sim.data.qpos[1:]),`
			`np.clip(self.sim.data.qvel, -10, 10),`
			`np.clip(self.sim.data.qfrc_constraint, -10, 10),`
			`]`
			`).ravel()`
Initial release. Hello world :). 2016-04-27 08:00:58 -07:00
update mujoco and bump version 2016-04-30 22:47:51 -07:00			`def reset_model(self):`
			`self.set_state(`
redo black (#2272) 2021-07-29 15:39:42 -04:00			`self.init_qpos`
			`+ self.np_random.uniform(low=-0.1, high=0.1, size=self.model.nq),`
Seeding update (#2422) * Ditch most of the seeding.py and replace np_random with the numpy default_rng. Let's see if tests pass * Updated a bunch of RNG calls from the RandomState API to Generator API * black; didn't expect that, did ya? * Undo a typo * blaaack * More typo fixes * Fixed setting/getting state in multidiscrete spaces * Fix typo, fix a test to work with the new sampling * Correctly (?) pass the randomly generated seed if np_random is called with None as seed * Convert the Discrete sample to a python int (as opposed to np.int64) * Remove some redundant imports * First version of the compatibility layer for old-style RNG. Mainly to trigger tests. * Removed redundant f-strings * Style fixes, removing unused imports * Try to make tests pass by removing atari from the dockerfile * Try to make tests pass by removing atari from the setup * Try to make tests pass by removing atari from the setup * Try to make tests pass by removing atari from the setup * First attempt at deprecating `env.seed` and supporting `env.reset(seed=seed)` instead. Tests should hopefully pass but throw up a million warnings. * black; didn't expect that, didya? * Rename the reset parameter in VecEnvs back to `seed` * Updated tests to use the new seeding method * Removed a bunch of old `seed` calls. Fixed a bug in AsyncVectorEnv * Stop Discrete envs from doing part of the setup (and using the randomness) in init (as opposed to reset) * Add explicit seed to wrappers reset * Remove an accidental return * Re-add some legacy functions with a warning. * Use deprecation instead of regular warnings for the newly deprecated methods/functions 2021-12-08 22:14:15 +01:00			`self.init_qvel + self.np_random.standard_normal(self.model.nv) * 0.1,`
update mujoco and bump version 2016-04-30 22:47:51 -07:00			`)`
Initial release. Hello world :). 2016-04-27 08:00:58 -07:00			`return self._get_obs()`

			`def viewer_setup(self):`
			`v = self.viewer`
Lint fixes to mujoco environment (#377) 2017-02-22 17:24:27 -08:00			`v.cam.trackbodyid = 0`
New robotics environments (#912) 2018-02-26 17:35:07 +01:00			`v.cam.distance = self.model.stat.extent * 0.5`
			`v.cam.lookat[2] = 0.12250000000000005 # v.model.stat.center[2]`