2016-04-27 08:00:58 -07:00
import numpy as np
from gym import utils
from gym . envs . mujoco import mujoco_env
2021-07-29 02:26:34 +02:00
class InvertedDoublePendulumEnv ( mujoco_env . MujocoEnv , utils . EzPickle ) :
2022-02-02 09:00:27 -05:00
"""
### Description
This environment originates from control theory and builds on the cartpole
environment based on the work done by Barto , Sutton , and Anderson in
[ " Neuronlike adaptive elements that can solve difficult learning control problems " ] ( https : / / ieeexplore . ieee . org / document / 6313077 ) ,
powered by the Mujoco physics simulator - allowing for more complex experiments
( such as varying the effects of gravity or constraints ) . This environment involves a cart that can
moved linearly , with a pole fixed on it and a second pole fixed on the other end of the first one
( leaving the second pole as the only one with one free end ) . The cart can be pushed left or right ,
and the goal is to balance the second pole on top of the first pole , which is in turn on top of the
cart , by applying continuous forces on the cart .
### Action Space
The agent take a 1 - element vector for actions .
The action space is a continuous ` ( action ) ` in ` [ - 1 , 1 ] ` , where ` action ` represents the
numerical force applied to the cart ( with magnitude representing the amount of force and
sign representing the direction )
| Num | Action | Control Min | Control Max | Name ( in corresponding XML file ) | Joint | Unit |
| - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - | - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - | - - - - - - - - - - - |
| 0 | Force applied on the cart | - 1 | 1 | slider | slide | Force ( N ) |
### Observation Space
The state space consists of positional values of different body parts of the hopper ,
followed by the velocities of those individual parts ( their derivatives ) with all the
positions ordered before all the velocities .
The observation is a ` ndarray ` with shape ` ( 11 , ) ` where the elements correspond to the following :
| Num | Observation | Min | Max | Name ( in corresponding XML file ) | Joint | Unit |
| - - - - - | - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - |
| 0 | position of the cart along the linear surface | - Inf | Inf | slider | slide | position ( m ) |
| 1 | sine of the angle between the cart and the first pole | - Inf | Inf | sin ( hinge ) | hinge | unitless |
| 2 | sine of the angle between the two poles | - Inf | Inf | sin ( hinge2 ) | hinge | unitless |
| 3 | cosine of the angle between the cart and the first pole | - Inf | Inf | cos ( hinge ) | hinge | unitless |
| 4 | cosine of the angle between the two poles | - Inf | Inf | cos ( hinge2 ) | hinge | unitless |
| 5 | velocity of the cart | - Inf | Inf | slider | slide | velocity ( m / s ) |
| 6 | angular velocity of the angle between the cart and the first pole | - Inf | Inf | hinge | hinge | angular velocity ( rad / s ) |
| 7 | angular velocity of the angle between the two poles | - Inf | Inf | hinge2 | hinge | angular velocity ( rad / s ) |
| 8 | constraint force - 1 | - Inf | Inf | | | Force ( N ) |
| 9 | constraint force - 2 | - Inf | Inf | | | Force ( N ) |
| 10 | constraint force - 3 | - Inf | Inf | | | Force ( N ) |
There is physical contact between the robots and their environment - and Mujoco
attempts at getting realisitic physics simulations for the possible physical contact
dynamics by aiming for physical accuracy and computational efficiency .
There is one constraint force for contacts for each degree of freedom ( 3 ) .
The approach and handling of constraints by Mujoco is unique to the simulator
and is based on their research . Once can find more information in their
[ * documentation * ] ( https : / / mujoco . readthedocs . io / en / latest / computation . html )
or in their paper
[ " Analytically-invertible dynamics with contacts and constraints: Theory and implementation in MuJoCo " ] ( https : / / homes . cs . washington . edu / ~ todorov / papers / TodorovICRA14 . pdf ) .
### Rewards
The reward consists of two parts :
- * alive_bonus * : The goal is to make the second inverted pendulum stand upright
( within a certain angle limit ) as long as possible - as such a reward of + 10 is awarded
for each timestep that the second pole is upright .
- * distance_penalty * : This reward is a measure of how far the * tip * of the second pendulum
( the only free end ) moves , and it is calculated as
* 0.01 * x < sup > 2 < / sup > + ( y - 2 ) < sup > 2 < / sup > * , where * x * is the x - coordinate of the tip
and * y * is the y - coordinate of the tip of the second pole .
- * velocity_penalty * : A negative reward for penalising the agent if it moves too
fast * 0.001 * v < sub > 1 < / sub > < sup > 2 < / sup > + 0.005 * v < sub > 2 < / sub > < sup > 2 < / sup > *
The total reward returned is * * * reward * * * * = * * alive_bonus - distance_penalty - velocity_penalty *
### Starting State
All observations start in state
( 0.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 , 0.0 ) with a uniform noise in the range
of [ - 0.1 , 0.1 ] added to the positional values ( cart position and pole angles ) and standard
normal force with a standard deviation of 0.1 added to the velocity values for stochasticity .
### Episode Termination
The episode terminates when any of the following happens :
1. The episode duration reaches 1000 timesteps .
2. Any of the state space values is no longer finite .
3. The y_coordinate of the tip of the second pole * is less than or equal * to 1. The maximum standing height of the system is 1.196 m when all the parts are perpendicularly vertical on top of each other ) .
### Arguments
No additional arguments are currently supported ( in v2 and lower ) , but modifications can
be made to the XML file in the assets folder ( or by changing the path to a modified XML
file in another folder ) . .
` ` `
env = gym . make ( ' InvertedDoublePendulum-v2 ' )
` ` `
There is no v3 for InvertedPendulum , unlike the robot environments where a v3 and
beyond take gym . make kwargs such as xml_file , ctrl_cost_weight , reset_noise_scale etc .
### Version History
* v3 : support for gym . make kwargs such as xml_file , ctrl_cost_weight , reset_noise_scale etc . rgb rendering comes from tracking camera ( so agent does not run away from screen )
* v2 : All continuous control environments now use mujoco_py > = 1.50
* v1 : max_time_steps raised to 1000 for robot based tasks ( including inverted pendulum )
* v0 : Initial versions release ( 1.0 .0 )
"""
2016-04-27 08:00:58 -07:00
def __init__ ( self ) :
2021-07-29 02:26:34 +02:00
mujoco_env . MujocoEnv . __init__ ( self , " inverted_double_pendulum.xml " , 5 )
2016-04-27 08:00:58 -07:00
utils . EzPickle . __init__ ( self )
Cleanup, removal of unmaintained code (#836)
* add dtype to Box
* remove board_game, debugging, safety, parameter_tuning environments
* massive set of breaking changes
- remove python logging module
- _step, _reset, _seed, _close => non underscored method
- remove benchmark and scoring folder
* Improve render("human"), now resizable, closable window.
* get rid of default step and reset in wrappers, so it doesn’t silently fail for people with underscore methods
* CubeCrash unit test environment
* followup fixes
* MemorizeDigits unit test envrionment
* refactored spaces a bit
fixed indentation
disabled test_env_semantics
* fix unit tests
* fixes
* CubeCrash, MemorizeDigits tested
* gym backwards compatibility patch
* gym backwards compatibility, followup fixes
* changelist, add spaces to main namespaces
* undo_logger_setup for backwards compat
* remove configuration.py
2018-01-25 18:20:14 -08:00
def step ( self , action ) :
2016-04-27 08:00:58 -07:00
self . do_simulation ( action , self . frame_skip )
ob = self . _get_obs ( )
2018-01-24 15:42:29 -08:00
x , _ , y = self . sim . data . site_xpos [ 0 ]
2016-04-27 08:00:58 -07:00
dist_penalty = 0.01 * x * * 2 + ( y - 2 ) * * 2
2018-01-24 15:42:29 -08:00
v1 , v2 = self . sim . data . qvel [ 1 : 3 ]
2021-07-29 02:26:34 +02:00
vel_penalty = 1e-3 * v1 * * 2 + 5e-3 * v2 * * 2
2016-04-27 08:00:58 -07:00
alive_bonus = 10
2018-01-24 15:42:29 -08:00
r = alive_bonus - dist_penalty - vel_penalty
2016-04-27 08:00:58 -07:00
done = bool ( y < = 1 )
return ob , r , done , { }
def _get_obs ( self ) :
2021-07-29 02:26:34 +02:00
return np . concatenate (
[
self . sim . data . qpos [ : 1 ] , # cart x pos
np . sin ( self . sim . data . qpos [ 1 : ] ) , # link angles
np . cos ( self . sim . data . qpos [ 1 : ] ) ,
np . clip ( self . sim . data . qvel , - 10 , 10 ) ,
np . clip ( self . sim . data . qfrc_constraint , - 10 , 10 ) ,
]
) . ravel ( )
2016-04-27 08:00:58 -07:00
2016-04-30 22:47:51 -07:00
def reset_model ( self ) :
self . set_state (
2021-07-29 15:39:42 -04:00
self . init_qpos
+ self . np_random . uniform ( low = - 0.1 , high = 0.1 , size = self . model . nq ) ,
2021-12-08 22:14:15 +01:00
self . init_qvel + self . np_random . standard_normal ( self . model . nv ) * 0.1 ,
2016-04-30 22:47:51 -07:00
)
2016-04-27 08:00:58 -07:00
return self . _get_obs ( )
def viewer_setup ( self ) :
v = self . viewer
2017-02-22 17:24:27 -08:00
v . cam . trackbodyid = 0
2018-02-26 17:35:07 +01:00
v . cam . distance = self . model . stat . extent * 0.5
v . cam . lookat [ 2 ] = 0.12250000000000005 # v.model.stat.center[2]