# We start by building a policy that the agent will learn using REINFORCE.
# A policy is a mapping from the current environment observation to a probability distribution of the actions to be taken.
# The policy used in the tutorial is parameterized by a neural network. It consists of 2 linear layers that are shared between both the predicted mean and standard deviation.
# Further, the single individual linear layers are used to estimate the mean and the standard deviation. ``nn.Tanh`` is used as a non-linearity between the hidden layers.
# The following function estimates a mean and standard deviation of a normal distribution from which an action is sampled. Hence it is expected for the policy to learn
# appropriate weights to output means and standard deviation based on the current observation.
# Now that we are done building the policy, let us develop **REINFORCE** which gives life to the policy network.
# The algorithm of REINFORCE could be found above. As mentioned before, REINFORCE aims to maximize the Monte-Carlo returns.
#
# Fun Fact: REINFROCE is an acronym for " 'RE'ward 'I'ncrement 'N'on-negative 'F'actor times 'O'ffset 'R'einforcement times 'C'haracteristic 'E'ligibility
#
# Note: The choice of hyperparameters is to train a decently performing agent. No extensive hyperparameter
# Discounted return (backwards) - [::-1] will return an array in reverse
forRinself.rewards[::-1]:
running_g=R+self.gamma*running_g
gs.insert(0,running_g)
deltas=torch.tensor(gs)
loss=0
# minimize -1 * prob * reward obtained
forlog_prob,deltainzip(self.probs,deltas):
loss+=log_prob.mean()*delta*(-1)
# Update the policy network
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Empty / zero out all episode-centric/related variables
self.probs=[]
self.rewards=[]
# %%
# Now lets train the policy using REINFORCE to master the task of Inverted Pendulum.
#
# Following is the overview of the training procedure
#
# for seed in random seeds
# reinitialize agent
#
# for episode in range of max number of episodes
# until episode is done
# sample action based on current observation
#
# take action and receive reward and next observation
#
# store action take, its probability, and the observed reward
# update the policy
#
# Note: Deep RL is fairly brittle concerning random seed in a lot of common use cases (https://spinningup.openai.com/en/latest/spinningup/spinningup.html).
# Hence it is important to test out various seeds, which we will be doing.
# Create and wrap the environment
env=gym.make("InvertedPendulum-v4")
wrapped_env=gym.wrappers.RecordEpisodeStatistics(env,50)# Records episode-reward
total_num_episodes=int(5e3)# Total number of episodes
# Observation-space of InvertedPendulum-v4 (4)
obs_space_dims=env.observation_space.shape[0]
# Action-space of InvertedPendulum-v4 (1)
action_space_dims=env.action_space.shape[0]
rewards_over_seeds=[]
forseedin[1,2,3,5,8]:# Fibonacci seeds
# set seed
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)
# Reinitialize agent every seed
agent=REINFORCE(obs_space_dims,action_space_dims)
reward_over_episodes=[]
forepisodeinrange(total_num_episodes):
# gymnasium v26 requires users to set seed while resetting the environment