Harvard University
Spring 2019
Instructors: Pavlos Protopapas, Mark Glickman
Group Number: 51
Group Members: John Daciuk, Dean Hathout
import sys
import gym
import pylab
import random
import numpy as np
from collections import deque
from keras.layers import Dense
from keras.optimizers import Adam
from keras.models import Sequential
from tqdm import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
import matplotlib.animation as animation
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
#to make my plotting style
plt.rcParams["axes.facecolor"] = (1,1,.99)
plt.rcParams["font.size"] = 15
plt.rcParams["figure.figsize"] = (12,6)
plt.rcParams["ytick.labelsize"] = 13
plt.rcParams["xtick.labelsize"] = 13
plt.rcParams["lines.linewidth"] = 2
plt.rcParams["axes.titlesize"] = 20
In this project, we aim to explore in greater depth some of the reinforcement learning concepts introduced in class, through both a literature review and an implementation of concepts discussed. Referring to Sutton and Barto’s Reinforcement Learning and Sutton’s “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” we will review the mathematical theory behind topics touched on in class, such as policy iteration, value iteration, and Q-learning, and will also introduce policy gradient methods. We will then implement and compare the discussed methods for use in continuous control problem settings with neural networks, making use of environments such as Mountain Car, CartPole, and Acrobot in OpenAI Gym.
For a handful of reinforcement learning problems, particularly those which can be formulated as a finite Markov Decision Process, the state and action spaces are small enough such that tabular methods -- that is, methods which represent approximate value functions using arrays or tables -- can find exact solutions (exact optimal value functions and policies).
First, we present expressions for the optimal value functions $v_*$ (a function of state and called the state-value function) and $q_*$ (a function of state and action and called the action-value function), defined by their satisfaction of the Bellman optimality equations:
$$v_*(s) = \max_a \mathbb{E}[R_{t + 1} + \gamma v_*(S_{t + 1}) | S_t = s, A_t = a] = \max_a \sum_{s', r} p(s', r | s, a)[r + \gamma v_*(s')]$$$$q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') | S_t = s, A_t = a] = \sum_{s', r} p(s', r|s, a)[r + \gamma \max{a'} q_*(s', a')]$$If these functions are known, then the optimal policy is self-evident -- in a given state, we take the action that produces the highest expected value. Thus, if we had a way of solving for or, more often, sufficiently approximating these value functions, we can easily solve for the optimal policy. In this section, we briefly review three fundamental classes of tabular solution methods to do this: dynamic programming, Monte Carlo methods, and temporal difference learning (focusing mostly on Q-learning).
In order to find an approximation to the optimal state-value function $v_*$, we need a way of computing the state-value function $v_\pi$ for any policy $\pi$, where we define policy $\pi$ by saying that $\pi(a | s)$ is the probability of taking action $a$ in state $s$ under policy $\pi$. Thus,
$$v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s] = \sum_a \pi(a | s) \sum_{s', r} p(s',r | s,a)[r + \gamma v_\pi(s')].$$If the dynamics of the environment are completely known, then the above actually defines a system of linear equations with number of equations and number of unknowns both equal to the size of the state space. However, solving this system is often a tedious computation, and more saliently, the dynamics of the environment are rarely known, so we instead consider iterative solution methods where we generate a sequence of approximate value functions $v_0, v_1, ...$ until a suitable approximation is obtained. The update rule is as follows:
$$v_{k+1}(s) = \mathbb{E}[R_{t + 1} + \gamma v_k(S_{t+1}) | S_t = s] = \sum_a \pi(a | s) \sum_{s', r}p(s', r | s, a)[r + \gamma v_k(s')]$$In words, this update does the following for each state $s$: replace the old value of $s$ with a new value computed via the expected immediate rewards and the (discounted) expected value of all the one-step transitions possible under the policy $\pi$. We follow this process until the maximum state-wise change between updates, $\max_s |v_{k+1}(s) - v_k(s)|$ is smaller than some decided threshold. Although the formal proof of this is beyond the scope of this review, this update scheme is guaranteed to converge to the optimal policy as $k$ goes to $\infty$.
In pseudocode, an algorithm to perform these successive updates is shown below.
Input $\pi$, the policy to be evaluated
Algorithm parameter: a small threshold $\theta > 0$ determining accuracy of estimation
Initialize $V(s)$ for each s arbitrarily with $V(terminal) = 0$
$\Delta$ = Inf
while $\Delta > \theta$:
$\Delta \leftarrow 0$
for s in S:
v $\leftarrow$ V(s)
V(s) = $\sum_a \pi(a|s) \sum_{s', r} p(s',r | s, a)[r + \gamma V(s')]$
$\Delta \leftarrow \max(\Delta, |v - V(s)|)$
rof
Now that we have an iterative method for policy evaluation, how can we use this to implement policy improvement? That is, given that $v_\pi(s)$ tells us how "good" it is to follow policy $\pi$ from $s$, should we change to some new policy $\pi'$? To answer this, we consider the action-value function $q_\pi(s, a)$ for policy $\pi$, which gives the expected value of taking action $a$ in state $s$ and following policy $\pi$ after that:
$$q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t = a] = \sum_{s', r} p(s', r|s, a)[r + \gamma v_\pi(s')]$$The policy improvement theorem, of which a simple proof can be found in Chapter 4 of Sutton, intuitively tells us that, given two deterministic policies, one original policy $\pi$ and a modified $\pi'$ which is identical to $\pi$ except that $\pi'(s) = a \neq \pi(s)$, if $q_\pi(s, a) > v_\pi(s)$, then the policy $\pi'$ is indeed an improvement over $\pi$.
Knowing how $q_\pi(s, a)$ can help us evaluate a change in the current policy $\pi$ at state $s$ for action $a$, we can use it define a policy improvement to $\pi'$. Specifically, for each state $s$, we consider changes to the policy for all possible actions $a$ in $s$, and devise a greedy policy which selects the action $a$ that maximizes $q_\pi(s, a)$:
$$\pi'(s) = arg \max_a q_\pi(s, a) = arg \max_a \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t = a] = arg \max_a \sum_{s', r} p(s', r | s, a)[r + \gamma v_\pi(s')]$$This way of improving policy $\pi$ to $\pi'$, using policy evaluation for $v_\pi$, is known as policy iteration. Thus, one algorithm for approximating an optimal policy $\pi_*$ is to follow the above process to produce monotonically improving policies. We present pseudocode for such an algorithm below:
Initialize V(s) and $\pi(s)$ for each $s$ arbitrarily
Run policy evaluation defined above: using V and $\pi$
Policy Improvement:
policy_stable $\leftarrow$ True
For s in S:
old_action $\leftarrow \pi(s)$
$\pi(s) \leftarrow arg \max_a \sum_{s', r} p(s', r | s, a)[r + \gamma V(s')]$
If old_action $\neq \pi(s)$, then policy_stable $\leftarrow$ False
If policystable, then stop and return $V \approx v$ and $\pi \approx \pi_$; else go back to 2.
One drawback to the above policy iteration algorithm is that for each improvement made to the policy, we must run an entire policy evaluation step which only becomes exact in the limit to $\infty$. Thus, a simpler update approach, value iteration, is often used instead. Value iteration is similar to the policy evaluation update scheme, but instead requires that we take the maximum over all actions of the expected value. That is:
$$v_{k+1}(s) = \max_a \mathbb{E}[R_{t+1} + \gamma v_k(S_{t+1}) | S_t = s, A_t = a] =\max_a \sum_{s', r}p(s', r | s, a)[r + \gamma v_k(s')]$$The key idea and intuition behind value iteration is that we use the Bellman optimality equation as an update rule for these successive function approximations. The usefulness of this method is that it is able to combine, in each iteration, a policy evaluation step and a policy improvement step. This is perhaps best illustrated by looking at the pseudocode for value iteration:
Initialize $V(s)$ for each s arbitrarily with $V(terminal) = 0$
$\Delta$ = Inf
while $\Delta > \theta$:
$\Delta \leftarrow 0$
for s in S:
v $\leftarrow$ V(s)
V(s) = $\max_a \sum_{s', r} p(s',r | s, a)[r + \gamma V(s')]$
$\Delta \leftarrow \max(\Delta, |v - V(s)|)$
rof
Output a deterministic policy $\pi \approx \pi_*$, such that: $$\pi(s) = arg \max_a \sum_{s', r} p(s',r | s, a)[r + \gamma V(s')]$$
Thus, value iteration is essentially multiple passes through a special case of policy iteration, where the policy evaluation step is terminated after one sweep through all the states. It turns out -- once again this is beyond the scope of this review -- that this approach is also guaranteed to converge to $v_*$.
While the DP methods discussed above are conceptually simple and quite easy to implement, one disadvantage to them is that they typically rely on the assumption that the dynamics of the environment are fully known. However, in most RL problems, we do not have this luxury. In fact, in a number of problems, we may know extremely little about the environment. In these cases, we rely on Monte Carlo methods (methods which sample sequences of states, actions and rewards to estimate expected values of state-action pairs) to learn about the environment experientially, and therefore to learn optimal value functions and policies.
These methods are relatively straight-forward and easy to follow. For the problem of estimating the state-value function $v_\pi(s)$, we generate a large number of episodes following policy $\pi$ (sequences of states, actions, and rewards), and through each episode we keep track of the returns following visits to $s$ for all states $s$ (first-visit MC tracks returns following only the first visit to $s$, while every-visit MC tracks returns following all visits to $s$), then assign the average of these returns for state $s$ to $v_\pi(s)$. Below, we present the pseudocode for this method:
Input: a policy $\pi$ to be evaluated, $n$: number of episodes to generate used for estimation
Loop for $n$ times:
Generate an episode following $\pi$: $S_0, A_0, R_1, S_1, A_1, R_2, ..., S_{T - 1}, A_{T-1}, R_T$
$G \leftarrow 0$
for each step of the episode, $t = T - 1, T - 2, ..., 0$:
If $S_t$ does not appear in $S_0, S_1, ..., S_{t-1}$:
Append $G$ to $Returns(s)$
$V(S_t) \leftarrow \textrm{average}(Returns(S_t))$
rof
Note that in order to implement every-visit MC, we would simply drop the line where we check if $S_t$ appears earlier in the episode. Both methods converge to $v_\pi(s)$ as the number of visits to $s$ becomes infinite, by the law of large numbers. LLN also tells us that the variance of the mean estimates for each state that we obtain through this method is inversely proportional to the number of returns averaged.
It is then also quite straightforward to devise a similar Monte-Carlo policy iteration algorithm, similar to that shown in the dynamic programmic section above. We shown pseudocode for such a method below:
Loop for $n$ times:
Choose $S_0, A_0$ randomly such that all pairs have nonzero probability
Generate an episode following $\pi$ from $S_0, A_0$: $S_0, A_0, R_1, S_1, A_1, R_2, ..., S_{T - 1}, A_{T-1}, R_T$
$G \leftarrow 0$
for each step of the episode, $t = T - 1, T - 2, ..., 0$:
If $S_t$ does not appear in $S_0, S_1, ..., S_{t-1}$:
Append $G$ to $Returns(s)$
$Q(S_t, A_t) \leftarrow \textrm{average}(Returns(S_t, A_t))$
$\pi(S_t) \leftarrow arg \max_a(S_t, a)$
rof
The reason we start off with choosing $S_0, A_0$ randomly for each episode is that, in order to ensure theoretical convergence for this algorithm, we must ensure that all state-action pairs get visited with positive probability. Starting each episode with a random state-action pair satisfies this. This is known as exploring-starts.
Thus, with Monte-Carlo algorithms, even if we know nothing about the environment, we can generate reliable estimates for value functions and optimal policies. While there is of course much more than can be said about Monte-Carlo algorithms for reinforcement learning, these other topics are outside the scope of our project. We explore below our particular application of interest for MC ideas, in Temporal Difference Learning.
Sutton and Barto describe temporal-difference learning as likely the most "central and novel" idea in reinforcement learning. Its effectiveness relies on its synergistic combination of ideas from dynamic programming and Monte-Carlo. Most fundamentally, temporal-difference methods sample from the environment as in Monte-Carlo, and iteratively perform updates based on current MC-based estimates, like in DP. We recall here the general form of this update for TD:
$$V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$$where the expression in brackets is known as the TD-error, and is the difference between the current estimate for the value of $S_t$ and an updated estimate based on the immediate rewards at the next time step and the estimated value of the next state. This scheme can be modifed in many different ways for different purposes in RL. In this project, we specifically review Q-learning, an off-policy temporal-difference method for approximating the optimal action-value function $q_*$. The update scheme behind Q-learning is given below, where $Q$ is the MC-learned action-value function:
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$We note that the above rule makes no mention of a specific policy, revealing some of the magic behind Q-learning: through successive updates following the above rule, the function $Q$ directly approximates $q_*$ independent of the policy being followed, so long as the policy ensures that all state-action pairs continue to be visited, and therefore updated. Given this minimal requirement, correct convergence is guaranteed (proof beyond our current scope). We give the Q-learning procedure in pseudocode below:
Input parameters: step size $\alpha$ between 0 and 1, right-inclusive, and small $\epsilon > 0$.
Initialize $Q(s, a)$ for all $s, a$ arbitrarily except that $Q(terminal, \cdot) = 0$
For each episode:
Initialize $S$
For each step of episode until S is terminal:
Choose $A$ from $S$ using policy derived from $Q$
Take action $A$ and observe $R, S'$
$Q(S, A) \leftarrow Q(S, A) + \alpha[R + \gamma \max_a Q(S', a) - Q(S, A)]$
$S \leftarrow S'$
We're now ready to explore some implementations! We implement a function below which performs the Q-learning algorithm described above by representing the Q-function as a table (python dictionary in this implementation), apply this function to a number of environments in Gym, and discuss our results. Notes: the parameters we choose for each environment are already tuned, and the function contains a number of modifications for the specifics of each environment:
def QLearn(environment = 'MountainCar-v0',
alpha = 0.5,
gamma = 1,
epsilon = 1,
min_eps = 0,
eps_mult = 3,
episodes = 15000,
bin_breaks = np.array([0.1, 0.01])):
if environment.startswith('Taxi'): # Avoid discretizing for this already discrete environment
bin_breaks = np.array([1])
env = gym.make(environment)
env.reset()
Q = {} # Initialize dictionary
rewards = [] # Initialize rewards
ave_rewards = []
reduction = eps_mult * (epsilon - min_eps)/episodes # Epsilon-annealing scheme
def get_Qval(state, action=None): # get current value for state-action pair or all possible action-values from a state
# If the state does not yet exist, initialize randomly.
try:
res = Q[state]
except:
Q[state] = np.random.uniform(low = -1, high = 1, size = (env.action_space.n))
res = Q[state]
if action is not None:
return(res[action])
return(res)
for episode in range(episodes):
done = False
total_reward = 0
reward = 0
state = env.reset()
if environment.startswith('CartPole'): # specific handling for infinite values in some CartPole state dimensions
# Discretize the continous space
discrete_state = tuple(np.round((np.array([(state - env.observation_space.low)[i] for i in [0, 2]] + [state[1] + 2, state[3] + 3.5])) / bin_breaks, 0).astype(int))
elif environment.startswith('Taxi'): # No need to discretize an already discrete problem.
discrete_state = state
else:
# Discretize
discrete_state = tuple(np.round((state - env.observation_space.low) / bin_breaks, 0).astype(int))
while not done:
# Epsilon-greedy
if np.random.random() < 1 - epsilon:
action = np.argmax(get_Qval(discrete_state))
else:
action = np.random.randint(0, env.action_space.n)
# Get next time step information
next_state, reward, done, info = env.step(action)
# Discretize next step if necessary
if environment.startswith('CartPole'):
next_discrete_state = tuple(np.round((np.array([(next_state - env.observation_space.low)[i] for i in [0, 2]] + [state[1] + 2, state[3] + 3.5])) / bin_breaks, 0).astype(int))
elif environment.startswith('Taxi'):
next_discrete_state = next_state
else:
next_discrete_state = tuple(np.round((next_state - env.observation_space.low) / bin_breaks, 0).astype(int))
# Check if episode is finished
if environment.startswith('MountainCar'):
if done and next_state[0] >= 0.5:
Q[discrete_state][action] = reward
# If not finished, perform TD update
else:
next_state_Q_val = get_Qval(next_discrete_state)
TD = reward + gamma * np.max(get_Qval(next_discrete_state)) - get_Qval(discrete_state, action)
Q[discrete_state][action] += alpha * TD
else:
if done:
Q[discrete_state][action] = reward
else:
next_state_Q_val = get_Qval(next_discrete_state)
TD = reward + gamma * np.max(get_Qval(next_discrete_state)) - get_Qval(discrete_state, action)
Q[discrete_state][action] += alpha * TD
total_reward += reward
discrete_state = next_discrete_state
if epsilon > min_eps:
epsilon -= reduction
rewards.append(total_reward)
if episode % 100 == 0:
ave_reward = np.mean(rewards)
ave_rewards.append(ave_reward)
rewards = []
if episode % 1000 == 0:
print('Episode {} Average Reward: {}'.format(episode, ave_reward))
print("MAX AVERAGE REWARD: {}".format(np.max(ave_rewards)))
plt.plot(100 * (np.arange(len(ave_rewards)) + 1), ave_rewards)
plt.xlabel('Episodes')
plt.ylabel('Average Reward')
plt.show()
We start off with a rather simple environment, quite similar to the Frozen-Lake environment we explored in class. More information on this environment and its state, action, and reward structure can be found at the link above.
QLearn(environment='Taxi-v2', eps_mult=5, episodes=20000)
As we can see, Q-learning works quite well on this Taxi problem. We achieve a maximum average reward (taken over 100 episodes) of above 9, which is good for this problem. The fact that the Taxi environment is low-dimensional and discrete makes it an excellent fit for this tabular Q-learning approach -- indeed, we see that the algorithm converges quite rapidly. Below, we explore environments that are either higher-dimensional, continuous, or both, and see how Q-learning performs on these problems in comparison.
The MountainCar environment is also quite simple and low-dimensional, but is not discrete. Thus, in order to implement a tabular Q-learning approach for such a problem, our Q-learner discretizes the state space into bins (this is where the bin_breaks argument is used).
QLearn(environment='MountainCar-v0')
We see that this approach does decently well on MountainCar. It is certainly able to find its way to the top of the mountain, which is good, but not in as few steps as we might like -- the standard for this problem is typically around -120, and we hover at around -138. We hypothesize that this is due, in large part, to the fact that we're discretizing a continuous space. Below, we will explore methods that use functional approximations to the $Q$ function rather than representing it as a discrete state-action pair table, where we see better results.
The CartPole environment is not as simple as the previous two, with higher state dimensionality and it is not discrete. Additionally, a couple of these dimensions have infinite range, so we have to impose our own limits in discretization. We explore tabular-Q performance here:
QLearn('CartPole-v0', alpha=0.8, bin_breaks = np.array([0.0001] * 4), episodes=15000)
Here we see that the tabular Q-learning approach does poorly on this problem. Once again, we believe this is a consequence of greater problem complexity (higher state dimensionality) and the problem's continuity. Once again, we will see better results with our functional approximation methods below.
Acrobot is perhaps the most complex of the environments we test here -- it has very high state dimensionality and needs to be discreteized. We once again see that this poses problems for the tabular Q-learner.
QLearn('Acrobot-v1', alpha=0.1, episodes=5000, bin_breaks=np.array([0.0001] * 6))
Again as expected for this combination of Q-learning strategy and high-dimensional problem, we see poor learning results and very slow training time. It is now time to dive into functional approximations and deep Q learning to see if we can improve.
We saw above cases in which it is difficult for tabular Q-learning to scale well with increasing state and action dimensionality, as well as with continuity. For such problems, it is often much more useful to replace the Q-table with a learnable function representation of the action-value function, while continuing to update based on temporal difference as above. This function would take in some basis functions of the environment's state and try to estimate the value of each action from that state. Many learnable functional forms can be used; we could represent $Q$ as just a linear function of some state-based features: $Q = \sum_i \theta_i x_i(s)$, or we could use a neural network to approximate the $Q$ function. The latter approach, combining the ideas of TD-learning with deep neural networks, is often referred to as Deep Q-Learning, and we explore its results on the three continuous environments here.
We define the DQN agent below and a function to train the agent in various environments below. This code was inspired by the HW6 209 code.
class DQNAgent:
def __init__(self, state_size, action_size, hidden):
# if you want to see Cartpole learning, then change to True
self.render = False
# get size of state and action
self.state_size = state_size
self.action_size = action_size
self.hidden = hidden
# These are hyper parameters for the DQN
self.discount_factor = 0.99
self.learning_rate = 0.001
self.epsilon = 1.0
self.epsilon_decay = 0.999
self.epsilon_min = 0.01
self.batch_size = 64
self.train_start = 1000
# create replay memory using deque
self.memory = deque(maxlen=2000)
# create main model and target model
self.model = self.build_model()
self.target_model = self.build_model()
# initialize target model
self.update_target_model()
def build_model(self):
model = Sequential()
model.add(Dense(self.hidden, input_dim=self.state_size, activation='relu',
kernel_initializer='he_uniform'))
model.add(Dense(self.hidden, activation='relu',
kernel_initializer='he_uniform'))
model.add(Dense(self.action_size, activation='linear',
kernel_initializer='he_uniform'))
model.summary()
model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
return model
# after some time interval update the target model to be same with model
def update_target_model(self):
self.target_model.set_weights(self.model.get_weights())
# get action from model using epsilon-greedy policy
def get_action(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
else:
q_value = self.model.predict(state)
return np.argmax(q_value[0])
# save sample <s,a,r,s'> to the replay memory
def append_sample(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
# pick samples randomly from replay memory (with batch_size)
def train_model(self):
if len(self.memory) < self.train_start:
return
batch_size = min(self.batch_size, len(self.memory))
mini_batch = random.sample(self.memory, batch_size)
update_input = np.zeros((batch_size, self.state_size))
update_target = np.zeros((batch_size, self.state_size))
action, reward, done = [], [], []
for i in range(self.batch_size):
update_input[i] = mini_batch[i][0]
action.append(mini_batch[i][1])
reward.append(mini_batch[i][2])
update_target[i] = mini_batch[i][3]
done.append(mini_batch[i][4])
target = self.model.predict(update_input)
target_val = self.target_model.predict(update_target)
for i in range(self.batch_size):
# Q Learning: get maximum Q value at s' from target model
if done[i]:
target[i][action[i]] = reward[i]
else:
target[i][action[i]] = reward[i] + self.discount_factor * (
np.amax(target_val[i]))
self.model.fit(update_input, target, batch_size=self.batch_size,
epochs=1, verbose=0)
def DQNLearn(environment, n_episodes, hidden):
env = gym.make(environment)
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size, hidden)
scores, episodes = [], []
learned = False
for e in range(n_episodes):
done = False
score = 0
state = env.reset()
state = np.reshape(state, [1, state_size])
while not done:
if agent.render:
env.render()
# get action for the current state and go one step in environment
action = agent.get_action(state)
next_state, reward, done, info = env.step(action)
next_state = np.reshape(next_state, [1, state_size])
reward = reward
# save the sample <s, a, r, s'> to the replay memory
agent.append_sample(state, action, reward, next_state, done)
# every time step do the training
agent.train_model()
score += reward
state = next_state
if done:
# every episode update the target model to be same with model
agent.update_target_model()
score = score #if score == 500 else score + 100
scores.append(score)
episodes.append(e)
if e%50 == 0:
print('Episode ',e, ' over.')
if np.mean(scores[-min(10, len(scores)):]) > 490:
break
learned = True
if learned:
break
pylab.plot(episodes, scores, 'b')
pylab.xlabel('Episode Number')
pylab.ylabel('Collected Rewards')
plt.title('Collected Rewards vs Episode Number for a DQN agent in ' + environment + ' task')
pylab.show()
DQNLearn('MountainCar-v0', 500, 128)
As we can see, this deep network performs quite well, consistently reaching the goal in under 120 steps, considered a high standard for this problem, and often beats the 100-step benchmark, which is quite impressive for this problem. Thus, the deep network does outperform the discretized, tabular Q-learning approach, even though the latter performed adequately. We suspect that this improvement is due to the network's ability to take the continuity of the space into account, rather than having to discretize.
DQNLearn('CartPole-v1', 300, 24)
This deep Q network performs much better than the discretized, Q-table for the CartPole problem. We achieve average results in the mid-hundreds, more than surpassing the threshold set by Gym for "solving" this problem. We believe that this network's success compared to the Q-table approach is due to its ability to capture a complex and continuous functional approximation for this high-dimensional problem, instead of being forced to discretize a high-dimensional state space.
DQNLearn('Acrobot-v1', 500, 128)
Once again, we see much better performance by this network over the Q-table for the same problem. We consistently achieve the goal in under 100 steps, and often around 75 steps, which is quite impressive for this problem (can be seen on the acrobot-v1 leaderboard in the OpenAI Gym wiki on github. Once again, we believe that this improvement in performance is due to the fact that we are no longer discretizing a high-dimensional and continuous space, and can therefore learn a complex functional representation of $Q$.
We are now ready to depart from the methods we have discussed above, and explore an entirely new method based on policy gradients. We describe the theory behind these methods, and show a number of implementations. Finally, we conclude this project by comparing what we see from the methods discussed above and policy gradient methods.
With $Q$ learning, we've attempted to create a model that estimates the values of action and state pairs. $Q$ function in hand, the policy is then trivial: take the action with the highest value, most of the time. By contrast policy gradient methods offer a more direct approach: if what we really want to know is which actions to take given states, why not just find an appoximater for a policy function and forget about $Q$? After all, if we know which actions to take, we wouldn't care so much about their associated values.
More formally, we will refer to this policy function as $\pi(a|s,\theta)$, which represents the probability of taking action $a$ given state $s$ and is parameterized by a vector $\theta$. Now, if only we knew how to make this policy function incrementally better, we could iterate the process. We introduce $J(\theta)$ as the performance measure of a policy to define 'better'. There are different ideas for how to define $J$, but for now we think of it as the total sum of discounted rewards we get by following $\pi(a|s,\theta)$ at some starting state.
Let's consider how we might mathematically construct $\pi(a|s,\theta)$. First, since we need $J(\theta)$ to be differentiable, and the derivative of $J$ will depend on the derivative of $\pi$, we want $\pi$ to be differentiable. Second, we need $\sum_{a} \pi(a|s) = 1$, because we know that the probability of taking some action is exactly 1. To force this second condition while not breaking the differentiablity condition, we can construct $\pi(a|s)$ in the following way:
Choose some $h(a|s, \theta) \in R$, which represents some preference or excitement we have for picking action $a$, then use the softmax function to force condition two. So we have $\pi(a|s,\theta) = \frac{e^{h(a|s, \theta)}}{\sum_{b} e^{h(b|s, \theta)}}$, where $b$ is used to represent all possible actions. Our choice for $h(a|s, \theta)$ would naturally depend on the complexity of the problem. For simple problems, we might design $h$ to depend linearly on $\theta$; for Alpha Zero, Google used a neural network. We'll return to $h$ after we get a basic strategy for performing gradient ascent on $J$.
Regardless of the specifics of how we design our policy function, our entire game plan will boil down to finding a $\theta$ to crank up the value of $J(\theta)$. Naturally, we're interested in the gradient of $J(\theta)$. If we could only write this gradient in terms of things we can calculate, we could perform a gradient ascent algorithm on $J(\theta)$. Initially the situation looks grim: the performance of our policy depends not only on $\theta$, but also on the states our policy will throw us into. However, consider the logic behind the following key proportion:
$$ \triangledown J(\theta) \propto \sum_{s} \big( \mu(s) \sum_{a} q_{\pi}(s,a) \ \triangledown_{\theta} \pi(a | s, \theta) \big)$$Now, let's understand this proportion, keeping in mind that we don't really care about the constant of proportionality for gradient ascent. The $\mu(s)$ is the percent of time steps we spend in state s. For a moment let's pretend we live in an environment with only one state. Then we can ignore the sum over states and look at the inner sum: $J(\theta) \propto \sum_{a} q_{\pi}(s,a) \ \triangledown_{\theta} \pi(a | s, \theta)$.
Imagine slightly turning the knobs on theta, and as a result we nudge $\pi(a | s, \theta)$ for each action. Taking the sum over actions in the above, we are getting the net result of all those action probability deltas weighted by their action values, and that, of course, is what we expect to have an affect on $J(\theta)$! It is important to note here that it is only $\triangledown_{\theta} \pi(a | s, \theta)$ that plays a role in this formulation, not $\pi(a | s, \theta)$ itself. It doesn't matter how likely or unlikely actions are to calculate the affect on $J(\theta)$, only the changes in those probabilities. Increasing a good action from happening 1% of the time to 2% of the time should have the same affect on $J(\theta)$ as increasing an equally good action from happening 90% of the time to 91% of the time. This will be important to recall later.
Reverting back to the multistate environment, all this action that 1 state has on $J(\theta)$ simply happens in proportion to how frequently we're in that state.
So we've got the gradient of $J$ in terms of the gradient of $\pi$, and we have the gradient of $\pi$ because we promise to make $\pi$ a function we can differentiate. What we don't have is $\mu(s)$. In fact, we are blind with respect to knowledge about all the states and the $q_{\pi}$ values for those states and possible actions. However, we do know how to collect samples of states, actions and corresponding $q_{\pi}(s,a)$ values: just let our agent run around and memorize the rewards achieved after visiting certain states and taking certain actions. So we desire to write $\triangledown J(\theta)$ as proportional to some expectation over samples. For starters, we have:
$$ \triangledown J(\theta) \propto \sum_{s} \big( \mu(s) \sum_{a} q_{\pi}(s,a) \ \triangledown_{\theta} \pi(a | s, \theta) \big) = E_{\pi} \bigg[ \sum_{a} q_{\pi}(S_{t},a) \ \triangledown_{\theta} \pi(a | S_{t}, \theta) \bigg]$$We can turn the $\mu(s)$ into an expectation over some sampled state $S_{t}$ simply because we will sample states in proportion to $\mu(s)$, the percent of time we're in them.
Now let's multiply and divide by $\pi(a | S_{t}, \theta)$ inside that expectation:
$$ \triangledown J(\theta) \propto E_{\pi} \bigg[ \sum_{a} q_{\pi}(S_{t},a) \ \triangledown_{\theta} \pi(a | S_{t}, \theta) \bigg] = E_{\pi} \bigg[ \sum_{a} \pi(a | S_{t}, \theta)\ q_{\pi} (S_{t},a) \ \frac{\triangledown_{\theta} \pi(a | S_{t}, \theta)}{\pi(a | S_{t}, \theta)} \bigg]$$Now, recalling the definition of an expectation, $E f(x) = \sum_{x} p(x) f(x)$, inside those last big brackets we have precisely this form. So we nix the $\sum_{a} q_{\pi}(S_{t}, a)$ for one final layer of expectation and wrap everything into a single expectation:
$$ = E_{\pi} \bigg[ q_{\pi} (S_{t},A_{t}) \ \frac{\triangledown_{\theta} \pi(A_{t} | S_{t}, \theta)}{\pi(A_{t} | S_{t}, \theta)} \bigg]$$Finally, on expectation $q_{\pi} (S_{t},A_{t})$ will just be the discounted sum of rewards we sample, $G_{t}$, and so we've come to our prize:
$$ \triangledown J(\theta) \propto E_{\pi} \bigg[ G_{t} \ \frac{\triangledown_{\theta} \pi(A_{t} | S_{t}, \theta)}{\pi(A_{t} | S_{t}, \theta)} \bigg]$$We will be able to calculate or sample everything in this final expression for $\triangledown J(\theta)$, and so we've got the essential ingredients for a gradient ascent algorithm! What's more this final expression makes simple intuitive sense. Imagine having many sample tuples, $(S_{t}, A_{t}, G_{t})$ and calculating this expected value by averaging. If $G_{t}$ is big for a particular sample, this action worked out well for us, so we expect this sample, in the average, to play a role in pumping up $J(\theta)$ if $\pi(A_{t} | S_{t}, \theta) > 0$, precisely as our proportion demands. To understand the role of $\pi$ in the denominator, recall from earlier that $\pi$ itself should not play into $\triangledown J(\theta)$ as it was formulated as a sum over all actions rather than an expectation. Now that we do have an expectation, we must account for the fact that we may only encounter some actions very rarely. We compensate these rare actions by dividing by $\pi(A_{t} | S_{t}, \theta)$. This way, in calculating $J(\theta)$ we still take account of all action's $\triangledown_{\theta} \pi(A_{t} | S_{t}, \theta)$ equally.
Now we know how to incrementally adjust theta to increase $J(\theta)$:
$$ \theta_{t+1} = \theta_{t} + \alpha' \triangledown J(\theta) = \theta_{t} + \alpha \ G_{t} \ \frac{\triangledown_{\theta} \pi(A_{t} | S_{t}, \theta)}{\pi(A_{t} | S_{t}, \theta)}$$Where in the above $\alpha'$ and $\alpha$ just differ by a constant. Noting that $\triangledown ln(x) = \frac{\triangledown x}{x}$, we will work with:
$$ \theta_{t+1} = \theta_{t} + \alpha \ G_{t} \ \triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta)$$Now we're getting very close to being able to write an algorithm to optimize $\theta$ for some policy and some environment. Looking at the above ascent formula, we know that we can pick a learning rate $\alpha$ and explore the environment to sample $G_{t}$, $A_{t}$ and $S_{t}$. Thus far we've promised to be able to differentiate $\pi(A_{t} | S_{t}, \theta)$, so it's time we untangle that last wrinkle.
Recall that we will set $\pi(a|s,\theta) = \frac{e^{h(a|s, \theta)}}{\sum_{b} e^{h(b|s, \theta)}}$ for any policy function we wish to write. What will differ by situation is our choice for $h(a|s, \theta)$, so we'll wrap up our core gradient ascent equation, without losing generality, by writing $\triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta)$ in terms of $h(a|s, \theta)$.
Now we can't help but recognize that $\pi$ is still lurking here. Finally we have:
$$\triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta) = \triangledown_{\theta} \ h(A_{t}|S_{t}, \theta) \ -\ \sum_{b} \triangledown_{\theta} h(b|S_{t},\theta) \ \pi(b|S_{t}, \theta)$$What have we learned? $\triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta)$ is the fundamental step we take to improve our performance $J(\theta)$, and the step is positive for actions that we want to encourage (at least if we standardize $G$). So what is this step telling us to do for this action and state pair $A_{t}, S_{t}$? For a good action, we can see why we want to increase $ln\ \pi(A_{t} | S_{t}, \theta)$. The above equation tells us how to do this: increase $h(A_{t} | S_{t})$, the preferance for this action while not also significantly increasing $h$ for all the other possible actions! Because $\pi$ is fundamentally a softmax the $h$ preferences for all the actions will affect the output for any one action, and so we can understand why we need the sum over b, all the actions in the final accounting.
If we'd really like to probe, we can get even more intuition for why $\pi$ appears in the sum over b. Suppose for some particular action $b'$, $\pi(b' | S_t, \theta) = 0$. In that case $h(b' | S_t, \theta) \approx -\infty$ and $\frac{d \ e^{h(b' | S_t)}}{d h} \approx 0$. Thus, we don't care about this action's effect in the softmax $\pi(A_t| S_t, \theta)$ and $\triangledown_\theta h(b' | S_t, \theta)$ vanishes in the sum over b.
Now we're excited to get back to our Mountain Car problem and see if these ideas all make concrete sense! One of the reasons to use policy gradient methods instead of Q learning is that the optimal policy may be fundamentally simpler than Q. For Mountain Car, what is the optimal policy? Well, we want to build momentum so that we can eventually use this gained momentum to beat gravity! We build momentum by increasing our velocity, so perhaps an optimal policy would be to just always push in the direction of current velocity. Let's see if we can't derive this policy from our theory.
Our goal initially is to play with policy gradent theory by keeping things as simple as possible and see if we can understand this concrete example. Once we have a feel for things we can unleash some neural networks. For now it's going to be quite pleasant to see the crudest of tools demolish the Mountain Car problem. We suspect a simple policy, so let's make $h$ a linear function in $\theta$:
$$ h(a,s, \theta) = \theta^{T} \ x(a,s)$$Here we've thrown all of $h$'s dependency on $a$ and $s$ into a feature vector, $x$. Then we simply dot product $x$ with $\theta$. Note that now we have the simple $\triangledown_{\theta} h(a,s, \theta) = x(a,s)$. Let's create a natural encoding for the actions: let power right = +1, do nothing = 0 and power left = -1. Since we don't believe the position of the car should influence our policy, we'll ignore it for now and write:
$$ x(a, s) = a \frac{s_v}{|s_v|} $$Where $s_v$ is the state's velocity. Each component of $x$ should be some function of $a$ and $s$. Since we only care about 1 component of $s$, we'll only give $x$ 1 component and make it a product of the action and the sign of the velocity. What could be simpler?
Now we've nailed down our policy function:
$$ h(1, s, \theta) = \theta \frac{s_v}{|s_v|} $$$$ h(0, s, \theta) = 0 $$$$ h(-1, s, \theta) = -\theta \frac{s_v}{|s_v|} $$So we have constructed $h$ to be as simple as the problem itself, using our knowledge of the problem to make $h$ laser focused on the relvant features. This is an essential step, especially for more difficult problems: the more we can suit the form of $h$ to our domain knowledge, the easier it will be to learn $\theta$. In fact, this basic principle of modeling carries over to all of machine learning.
Now we've stopped at specifying $\theta$ because that is exactly what we're supposed to learn! However, it will be instructive to stop and consider what the optimal $\theta$ might be, this way we can see if our learning process is behaving itself. For Mountain Car, we said we want to always move in the direction of velocity; consequently, if, for example, $\frac{s_v}{|s_v|}$ is positive, we want $h(1,s, \theta) \approx \infty$ because our preference for powering right should be overwhelming. We can already see that $\theta$ had better learn to be a big number if we expect our agent to make it out of the valley. Let's confirm that this is precisely the solution policy gradient theorem will provide.
Recall:
$$ \theta_{t+1} = \theta_{t} + \alpha \ G_{t} \ \triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta)$$And:
$$\triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta) = \triangledown_{\theta} \ h(A_{t}|S_{t}, \theta) \ -\ \sum_{b\ \in\ -1, 0, 1} \triangledown_{\theta} h(b|S_{t},\theta) \ \pi(b|S_{t}, \theta)$$Also recall the reward system for Mountain Car, -1 for each time step. To simplify the intuition here, let's suppose we standardize all of our $G_t$ samples so that good actions tend to see positive rewards and bad actions tend to see negative rewards. It will make everything more human readable, and, as it turns out, we can reference a quick proof of "REINFORCE with baseline" in Sutton's text to show that all the above still holds. We'll omit the proof for now and take reward standardization to be fairly innocent and reasonable.
Plug in our $h$ and it's gradient, with actions encoded as above. Expand the sum (we only have 3 actions):
$$ = A_t \frac{S_v}{|S_v|} - \big( \frac{S_v}{|S_v|} \pi(1 | S_t) - \frac{S_v}{|S_v|} \pi(-1 | S_t) \big)$$$$ = A_t \frac{S_v}{|S_v|} \ -\ \frac{S_v}{|S_v|} \pi(1 | S_t)\ +\ \frac{S_v}{|S_v|} \pi(-1 | S_t) $$$$ = \frac{S_v}{|S_v|} \ \bigg[ A_t \ + \pi(-1 | S_t) \ - \ \pi(1 | S_t) \bigg]$$Where of course $\pi$ is still a function of $\theta$, but we're not writing it explicitly, and $S_v$ refers to the velocity of the state at time step $t$.
Recall that we've worked out everything in expectation, so we intend to update $\theta$ with many samples of $(A_t, S_t, G_t)$ tuples.
First, if the velocity of a sampled state is 0, then $\triangledown_{\theta} ln \ \pi$ goes to 0 and that sample won't contribute to any update in $\theta$.
To really understand why this entire formulation has to work to pump $\theta$ up to infinity, let's consider a few interesting sample scenarios:
Suppose we sample $A_t = 1$ and $S_v > 0$. Our agent has powered right when it was moving to the right. Good agent! It's building momentum; this is the behavior we want to REINFORCE! We should expect the $G_t$ for this sample to be relatively large and positive, on average, since this was good behavior. Let's see our this sample may impact $\theta_{t+1}$:
$$ \theta_{t+1} = \theta \ + \ \alpha G_t \big( 1 \ +\ \pi(-1| S_t)\ -\ \pi(1|S_t) \big) $$Since $ \pi(-1|S_t) - \pi(1|S_t) $ has to be more than -1, this is a positive update to $\theta$ proportional to $G_t$. This sample will push $\theta$ up, as we'd like to see! Furthermore, if we currently have a $\pi$ that would have made this good behavior unlikely ( $\pi(1|S_t) << .5$ ), we'd be adding further fuel to our $\theta$ update, and rightly so: our agent has acted to do the right thing against the odds, its policy deserves to a big update.
Now suppose we sample $A_t = -1$ and $S_v > 0$. Our agent has powered left when it was moving to the right. Bad agent! We should expect the $G_t$ for this sample to be relatively large and negative, on average. Let's see how this sample may impact $\theta_{t+1}$:
$$ \theta_{t+1} = \theta \ + \ \alpha G_t \big( -1 \ +\ \pi(-1| S_t)\ -\ \pi(1|S_t) \big) $$Again, this sample will influence $\theta$ to increase since $G_t < 0$ and $\pi(-1| S_t)\ -\ \pi(1|S_t) < 1$
Let's review our algorithm for policy optimization as it's found in Sutton's text for reference(pg 271). We will then go on to implement it and consider the results.
Essentially the REINFORCE algorithm tells us to initialize our $\theta$ and collect samples following our current policy. Then we perform a gradient ascent update on $\theta$ in order to climb the $J(\theta)$ function. We can choose to further discount $G_t$ by $\gamma^t$ if we'd like to really maximize discounted rewards from the starting state.
As mentioned earlier we will also be experimenting with baselines as discussed in the text. Ultimately, simply standardizing the rewards gave us reasonable performance.
We begin our exploration of policy gradient in practice with MountainCar-v0 from openai gym. Again, the goal here is to learn about the foundational ideas with the simplest example. We can with a 1 dimensional theta so that there is only 1 weight that we need to learn, theta. Because of the way we've designed our state/action feature vector, $x$, to just depend on the sign of velocity we can easily understand the model and the weight we need to find if it is to see reasonable success at MountainCar. As discussed above, we should see our REINFORCE algorithm increase the value of theta over training cycles. We expect the net effect will be that our car learns to gain momentum and push out of the valley.
This class will be specific to MountainCar for a concrete initial example. Note that we've programmed in the ability to learn a baseline for each state, which should converge to the average G at that state. Especially for MountainCar with the default reward system, we expect that the baseline could be very helpful if it's also a function of time step. This is because in MountainCar the earlier states will always see very heavy negative rewards even if action in them is optimal. In theory the baseline should be able to correct for this imbalance, but in practice it was not effective.
Everything in the below cell simply implements what we've discussed.
def calculate_G(rewards, gamma):
discounted_rewards = []
for i in range(len(rewards)):
to_sum = np.array(rewards[i: ])
decay_factors = np.array([gamma ** j for j in range(len(to_sum))])
discounted_rewards.append(sum(to_sum * decay_factors))
return list(discounted_rewards)
def reward_modification(next_state):
distance_from_bottom = abs(next_state[0] + .5) #-.5 is the location of the bottom of the well
return distance_from_bottom
class PolicyGradAgent:
def __init__(self, state_size, action_size, lr, gamma, baseline, baseline_lr):
# get size of state and action
self.state_size = state_size
self.action_size = action_size
self.gamma = gamma
self.lr = lr
self.theta = 0
self.theta_2d = np.zeros(self.state_size)
self.w = np.zeros(self.state_size)
self.possible_actions = [-1, 0, 1]
self.uncode_action = {-1: 0, 0: 1, 1:2}
self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
self.baseline = baseline
self.baseline_lr = baseline_lr
def reset_theta(self):
print("Goal yet unreached, resetting theta param near zero.")
self.theta = np.random.normal(loc=0, scale=.1)
def h(self, action, state):
v = state[1]
return self.theta * action * np.sign(v)
def pi(self, action, state):
numerator = np.exp(self.h(action, state))
denom = np.exp(self.h(-1, state)) + np.exp(self.h(0, state)) + np.exp(self.h(1, state))
return numerator / denom
def V(self, state):
return np.dot(state, self.w)
def get_action(self, state):
probs = [self.pi(action, state) for action in self.possible_actions]
return np.random.choice(self.possible_actions, p=probs)
def erase_batch_memory(self):
self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
def prepare_rewards(self):
self.batch_memory["G"] = np.array(self.batch_memory["G"])
# Subtract off mean as the baseline
self.batch_memory["G"] -= np.mean(self.batch_memory["G"])
if np.std(self.batch_memory["G"]) > 0:
self.batch_memory["G"] /= np.std(self.batch_memory["G"])
def learn(self):
theta_delta = 0
w_delta = np.zeros(self.state_size)
for t in range(len(self.batch_memory["S"])):
G = self.batch_memory["G"][t]
S = self.batch_memory["S"][t]
A = self.batch_memory["A"][t]
step = self.batch_memory["T"][t]
p = S[0]
v = S[1]
if self.baseline == True:
adv = G - self.V(S)
w_delta += self.baseline_lr * self.gamma**step * adv * np.array([p, v])
self.w += w_delta
theta_delta += self.lr * adv * self.gamma**step * np.sign(v) * (A + self.pi(-1, S) - self.pi(1, S))
else:
theta_delta += self.lr * G * self.gamma**step * np.sign(v) * (A + self.pi(-1, S) - self.pi(1, S))
explore = (np.random.rand() -.5) * 2
self.theta += theta_delta + explore
def report(self):
current_steps = len(self.episode_memory["S"])
print("Current Theta: {:.3}, {} steps to done".format(self.theta, current_steps))
Below we have the standard loop that runs the agent and trains after so many episodes. This is specifically rigged for mountain car so that we can try different reward systems and implement a few other details relevant to our experiment here that won't carry over to other environments.
This loop is prepared to run either the agent above or a more complex agent that we'll discuss shortly. We now introduce the various reward systems we'll try:
Reward Systems: default: as discussed, and as programmed by openai. distance from bottom: rewards proportional to the agent's distance from the bottom of the valley. 1 for finish: a reward of +1 for reaching the goal, 0 otherwise. speed for finish: steps to reach goal = steps, reward = log(201-steps)
The "speed for finished" reward is designed to mimick the behavior we'd like to see with the default reward system, but if gamma=1 then early states won't be at a disadvantage to later states, as in the default system. We hypothesis that this could help learning. We take the log so that finishing in 199 steps won't be twice as good as finishing in 200 steps. The minimum possible reward is log(1)=0.
def run_agent(lr, gamma, num_batches, reward_system="default", baseline=False, baseline_lr = 0, two_d=False):
env = gym.make('MountainCar-v0')
# create agent
if not two_d:
agent = PolicyGradAgent(state_size=env.observation_space.shape[0],
action_size=env.action_space.n,
lr=lr,
gamma=gamma,
baseline=baseline,
baseline_lr=baseline_lr)
else:
agent = PolicyGradAgent2d(state_size=env.observation_space.shape[0],
action_size=env.action_space.n,
lr=lr,
gamma=gamma,
baseline=baseline,
baseline_lr=baseline_lr)
batch_size = 100
theta_history = []
step_history = []
goal_reached = False
for batch in range(num_batches):
agent.erase_batch_memory()
theta_history.append(agent.theta)
for episode in range(batch_size):
rewards = []
states = []
actions = []
state = env.reset()
done = False
while not done:
states.append(state)
# get action for the current state and go one step in environment
action = agent.get_action(state)
# save action
actions.append(action)
# take step in environment, write up codes actions differently from gym so we uncode to give to env.step
next_state, reward, done, info = env.step(agent.uncode_action[action])
# calculate reward and append to rewards
if reward_system == "default":
rewards.append(reward)
elif reward_system == "1 for finish" or reward_system == "speed for finish":
rewards.append(0.0)
elif reward_system == "distance from bottom":
rewards.append(reward_modification(next_state))
else:
print("Error: Invalid reward system specified.")
state = next_state
# if not done, loop again
if done:
# record length of episode
steps = len(states)
step_history.append(steps)
reached = steps < 200
if reward_system == "1 for finish":
rewards[-1] = float(reached)
if reward_system == "speed for finish":
rewards[-1] = np.log(201 - steps)
if not goal_reached and reached > 0:
goal_reached = True
# give episode data to the agent
agent.batch_memory["S"] += states
agent.batch_memory["A"] += actions
agent.batch_memory["G"] += calculate_G(rewards, gamma)
agent.batch_memory["T"] += list(range(steps))
# learn from the data (update theta)
agent.prepare_rewards()
agent.learn()
if (batch + 1) % int(num_batches/3) == 0 and not reached:
agent.reset_theta()
return theta_history, step_history, agent.w
For MountainCar, we've devised such a simple model such that if we randomly throw a dart at the model weight, theta, we will succeed to escape the valley %50 of the time! Theta need only be positive. Thus, to see if the model is truely working we train 5 agents and check for consistent performance.
def train_agents(lr, gamma, num_batches, reward_system, baseline, two_d=False):
theta_histories = []
step_histories = []
means = []
last_theta = []
last_weights = []
for i in range(5):
print("Training Agent # {} of 5...".format(i+1))
theta_history, step_history, last_weight = run_agent(lr=lr,
gamma=gamma,
num_batches=num_batches,
reward_system=reward_system,
baseline=baseline,
two_d=two_d)
means.append(np.mean(theta_history))
last_theta.append(theta_history[-1])
theta_histories.append(theta_history)
step_histories.append(step_history)
last_weights.append(last_weight)
if not two_d:
print("Mean training theta: {:.3}, Final theta: {:.3}, Last 100 Episodes Mean Steps: {}\n".format(means[-1], last_theta[-1], int(np.mean(step_history[-100:]))))
else:
print("Final theta: {}, Last 100 Episodes Mean Steps: {}\n".format(last_theta[-1], int(np.mean(step_history[-100:]))))
return theta_histories, step_histories
Simple result plotting code
def moving_ave(history):
return [np.mean(history[ :i + 1]) for i in range(len(history))]
def plot_histories(histories, ax, col, title, xlabel, moving=False):
rows = len(histories)
for i in range(rows):
if moving:
ax[i, col].plot(range(len(histories[i])), moving_ave(histories[i]), c="red", lw=4)
ax[i, col].set_ylabel("Step")
else:
ax[i, col].plot(range(len(histories[i])), histories[i], c="purple", lw=4)
ax[i, col].set_ylabel("$\Theta$")
ax[i, col].text(-10,0, "Round {}".format(i + 1))
ax[0, col].set_title(title)
ax[rows-1, col].set_xlabel(xlabel)
def plot_training_results(theta_histories, step_histories):
rows = len(theta_histories)
fig, ax = plt.subplots(rows, 2, figsize=(15,10))
plot_histories(theta_histories, ax, 0, title="Theta Param", xlabel="Batch")
plot_histories(step_histories, ax, 1, moving=True, title="Steps Until Done Moving Ave", xlabel="Episode")
fig.suptitle(str(rows) + " Rounds of Policy Grad Agent Training History", y=1.03, fontsize=22)
fig.tight_layout()
theta_histories, step_histories = train_agents(lr=1e-2,
gamma=.99,
num_batches=30,
reward_system="default",
baseline=False)
Comments
We observe that in most cases, the training algorithm needed to reset $\Theta$ before it could succeed at reaching escape velocity. If $\Theta$ starts going negative while training, the mountain car has learned to fight momentum with drastic consequences. Under the default reward system, it will never reach the goal and get stuck never learning anything.
plot_training_results(theta_histories, step_histories)
Comments
Looking at a side by side comparison of $\Theta$ and episode step count, we observe that once the agent reaches the goal, $\Theta$ begins to grow and the agent never look back. With a steady flow of positive reinforcement, the agent can learn to build momentum consistently and reach the goal in about 120 steps.
theta_histories, step_histories = train_agents(lr=1e-2,
gamma=.99,
num_batches=10,
reward_system="distance from bottom",
baseline=False)
Comments
Now our results are consistent without any need for ever resetting $\Theta$ while learning. This reward system gives the agent consistent and smooth feedback for building momentum, and the agent learns a healthy positive $\Theta$ on every run.
plot_training_results(theta_histories, step_histories)
Comments
Because of the smooth reward system, the agent quickly reaches the goal, as can be seen in the moving average plots of step count.
Since the defaut system was too challenging for the agent and this system too easy, let's progress to systems we believe to be a compromise between the two.
theta_histories, step_histories = train_agents(lr=1e-2,
gamma=1,
num_batches=20,
reward_system="1 for finish",
baseline=False)
Comments
The agent has done better here than in the default case, with fewer needs to reset training. We believe that the key reason which makes this system easier than the default is the following: when gamma is 1, the agent sees 0 immediate reward in all states except for the goal reaching state. Thus, the early states are not disadvantaged with such heavy negative rewards as in the default system. This system thereby sends a consistent message to the agent: "if you escape, consider all of your actions to have been good, otherwise they are all bad". Under the default system the problem is that actions near "Done" are always relatively good vs actions in early states, whether the agent succeeds or not! In that case it's only by comparing a massive batch size of sampled histories that we can learn that on average all states in sucessful episodes are less negative than in failure episodes.
plot_training_results(theta_histories, step_histories)
Comments
The agent learns more quickly and consistently than in the default reward case.
Much like the +1 approach, we only reward the agent when it finishes, but now faster finishes are better. Let's see if this extra information helps the average step counts.
theta_histories, step_histories = train_agents(lr=1e-2,
gamma=1,
num_batches=20,
reward_system="speed for finish",
baseline=False)
plot_training_results(theta_histories, step_histories)
Comments
We don't have statistically significant improvement over the +1 for finish case. We could have expected this: with a one dimensional $\Theta$ all the agent can learn is to gain momentum. It's position does not influence its behavior and so it can never really learn to turn around once it's gone far enough to the left to gain the potential energy to make it over to the goal on the right.
How can we further improve performance?
Let's continue by programming this ability into the agent's behavior functions. Specifcially we give the agent an additional term in it's state feature vector: now the agent's preference function $h$ will be able to consider position in addition to the sign of velocity. Much like we havn't been using raw velocity in the $x$ feature vector, we will also do some engineering as to not simply consider raw position.
Ultimately, we'd like the agent to turn around once it's far enough to the left, but not before. It's a waste of time to build more potential energy than it needed. We introduce our 2d MountainCar agent that using the additional feature $left \ measure(p)$ or $l(p)$ for short, a function of position. $l(p) = 0$ if $p > -.5$ and $(p + .5)^2$ otherwise. The key idea is that $l(p)$ is only significantly non-zero when the cart is far to the left. We believe this might be just the information that the cart needs so that it can learn to turn around once it's reached sufficient potential energy on the left.
Let's see the agent code modification and then experiment with it:
class PolicyGradAgent2d:
def __init__(self, state_size, action_size, lr, gamma, baseline, baseline_lr):
# get size of state and action
self.state_size = state_size
self.action_size = action_size
self.gamma = gamma
self.lr = lr
self.theta = np.zeros(self.state_size)
self.w = np.zeros(self.state_size)
self.possible_actions = [-1, 0, 1]
self.uncode_action = {-1: 0, 0: 1, 1:2}
self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
self.baseline = baseline
self.baseline_lr = baseline_lr
def reset_theta(self):
print("Goal yet unreached, resetting theta param near zero.")
self.theta = np.random.normal(loc=0, scale=.1, size=2)
def left_measure(self, p):
output = 0
if p < -.5:
output = (p + .5)**2
return output
def h(self, action, state):
p = state[0]
v = state[1]
left_measure = self.left_measure(p)
X = np.array([action * np.sign(v), action * left_measure])
return np.dot(self.theta, X)
def pi(self, action, state):
numerator = np.exp(self.h(action, state))
denom = np.exp(self.h(-1, state)) + np.exp(self.h(0, state)) + np.exp(self.h(1, state))
return numerator / denom
def V(self, state):
return np.dot(state, self.w)
def get_action(self, state):
probs = [self.pi(action, state) for action in self.possible_actions]
return np.random.choice(self.possible_actions, p=probs)
def erase_batch_memory(self):
self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
def prepare_rewards(self):
self.batch_memory["G"] = np.array(self.batch_memory["G"])
# Subtract off mean as the baseline
self.batch_memory["G"] -= np.mean(self.batch_memory["G"])
if np.std(self.batch_memory["G"]) > 0:
self.batch_memory["G"] /= np.std(self.batch_memory["G"])
def learn(self):
theta_delta = np.zeros(self.state_size)
w_delta = np.zeros(self.state_size)
for t in range(len(self.batch_memory["S"])):
G = self.batch_memory["G"][t]
S = self.batch_memory["S"][t]
A = self.batch_memory["A"][t]
step = self.batch_memory["T"][t]
p = S[0]
v = S[1]
if self.baseline == True:
adv = G - self.V(S)
w_delta += self.baseline_lr * self.gamma**step * adv * np.array([p, v])
self.w += w_delta
theta_delta += self.lr * adv * self.gamma**step * np.array([np.sign(v), self.left_measure(p)]) * (A + self.pi(-1, S) - self.pi(1, S))
else:
theta_delta += self.lr * G * self.gamma**step * np.array([np.sign(v), self.left_measure(p)]) * (A + self.pi(-1, S) - self.pi(1, S))
explore = (np.random.rand(self.state_size) -.5) * 2
self.theta += theta_delta + explore
theta_histories, step_histories = train_agents(lr=1e-2,
gamma=1,
num_batches=20,
reward_system="speed for finish",
baseline=False,
two_d=True)
We can compare last 100 episode mean steps to previously and see no improvement. Looking at the agent's learned $\Theta$ vector we can start to imagine what the issue is. Now that $\Theta$ is two dimensional, we really need the right combination for good performance, which likely takes more training time to learn. We believe that if both $\Theta$ components learned to be positive, with the first component larger than the second, we'd see the best performance yet.
We leave further tuning and more training time for a future idea to explore and now move onto a linear policy gradient agent that is prepared to act in many other environments with discrete action spaces.
We simply generalize what we've done so far. Since we don't know the environment in advance, we generalize $x(a,s)$ to $a \begin{bmatrix}s_0 \\ s_1 \\ s_2 \\ ... \\ 1 \end{bmatrix}$ in what we call the "non quadratic" or "simple" case. We also provide an input option for this agent to also add quadratic terms in state to $x(a,s)$ so that we can compare.
class GeneralPolicyGradAgent:
def __init__(self, state_size, action_size, lr, gamma, quadratic_terms=False):
# get size of state and action
self.state_size = state_size
self.action_size = action_size
self.gamma = gamma
self.lr = lr
self.quadratic_terms = quadratic_terms
if not self.quadratic_terms:
self.theta = np.zeros(self.state_size + 1)
else:
self.theta = np.zeros(self.state_size*2 + 1)
self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
def h(self, action, state):
X = self.X(action, state)
return np.dot(self.theta, X)
def pi(self, action, state):
numerator = np.exp(self.h(action, state))
denom = 0
for act in range(self.action_size):
denom += np.exp(self.h(act, state))
return numerator / denom
def get_action(self, state):
probs = [self.pi(action, state) for action in range(self.action_size)]
return np.random.choice(range(self.action_size), p=probs)
def erase_batch_memory(self):
self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
def prepare_rewards(self):
self.batch_memory["G"] = np.array(self.batch_memory["G"])
# Subtract off mean as the baseline
self.batch_memory["G"] -= np.mean(self.batch_memory["G"])
if np.std(self.batch_memory["G"]) > 0:
self.batch_memory["G"] /= np.std(self.batch_memory["G"])
def X(self, action, state):
m=1
if self.quadratic_terms:
m=2
output = np.zeros(self.state_size*m + 1)
for i in range(self.state_size):
output[i] = action * state[i]
if self.quadratic_terms:
for i in range(self.state_size):
output[self.state_size + i] = action * state[i]**2
output[-1] = 1
return output
def learn(self):
if not self.quadratic_terms:
theta_delta = np.zeros(self.state_size + 1)
else:
theta_delta = np.zeros(self.state_size*2 + 1)
for t in range(len(self.batch_memory["S"])):
G = self.batch_memory["G"][t]
S = self.batch_memory["S"][t]
A = self.batch_memory["A"][t]
step = self.batch_memory["T"][t]
action_sigma = 0
for b in range(self.action_size):
action_sigma += self.pi(b, S) * self.X(b, S)
theta_delta += self.lr * G * self.gamma**step * (self.X(A, S) - action_sigma)
self.theta += theta_delta
Slight modifications to run agent code to make it work for the more general case:
Now we use the boiler plate run agent code, where we can use any environment and turn quadratic terms on or off. The other difference here is that now we use total scores as our metric; we also always use the default env reward system.
def general_run_agent(lr, gamma, num_batches, env, verbose=False, quadratic_terms=False):
agent = GeneralPolicyGradAgent(state_size=env.observation_space.shape[0],
action_size=env.action_space.n,
lr=lr,
gamma=gamma,
quadratic_terms=quadratic_terms)
batch_size = 100
theta_history = []
scores = []
for batch in range(num_batches):
agent.erase_batch_memory()
theta_history.append(agent.theta)
for episode in range(batch_size):
rewards = []
states = []
actions = []
score = 0
state = env.reset()
done = False
while not done:
states.append(state)
# get action for the current state and go one step in environment
action = agent.get_action(state)
# save action
actions.append(action)
# take step in environment, write up codes actions differently from gym so we uncode to give to env.step
next_state, reward, done, info = env.step(action)
rewards.append(reward)
score += reward
state = next_state
# if not done, loop again
if done:
# record score of episode
scores.append(score)
steps = len(states)
# give episode data to the agent
agent.batch_memory["S"] += states
agent.batch_memory["A"] += actions
agent.batch_memory["G"] += calculate_G(rewards, gamma)
agent.batch_memory["T"] += list(range(steps))
# learn from the data (update theta)
agent.prepare_rewards()
agent.learn()
if batch % 2 == 0 and verbose:
print(scores[-1])
return theta_history, scores
results = {}
theta_history, scores = general_run_agent(lr=.5,
gamma=.99,
num_batches=150,
env=gym.make('CartPole-v1'),
verbose=False,
quadratic_terms=False)
results["Linear_CartPole_simple"] = [scores, theta_history]
get_streak(scores)
theta_history, scores = general_run_agent(lr=.5,
gamma=.99,
num_batches=150,
env=gym.make('CartPole-v1'),
verbose=False,
quadratic_terms=True)
results["Linear_CartPole_quadratic"] = [scores, theta_history]
get_streak(scores)
fig, ax = plt.subplots(1,1, figsize=(15,8))
scores = results["Linear_CartPole_quadratic"][0]
simple_scores = results["Linear_CartPole_simple"][0]
ax.plot(range(len(scores)), moving_ave(scores), lw=4, c="orange", label="Quadratic Model")
ax.plot(range(len(simple_scores)), moving_ave(simple_scores), lw=4, c="crimson", label="Simple Model")
ax.set_title("Linear Model Comparison For CartPole")
ax.set_xlabel("Episode")
ax.set_ylabel("Moving Average Score")
ax.legend();
theta_history = results["Linear_CartPole_simple"][1]
print("The theta components that mastered CartPole are:\n")
print("Cart position component: {:.3}".format(theta_history[-1][0]))
print("Cart velocity component: {:.3}".format(theta_history[-1][1]))
print("Pole angle component: {:.3}".format(theta_history[-1][2]))
print("Pole velocity at tip component: {:.3}".format(theta_history[-1][3]))
print("Bias: {:.3}".format(theta_history[-1][4]))
Comments
And so our generalized linear policy gradient method is a success! We've learned cartpole without assuming anything about the environment.
theta_history, scores = general_run_agent(lr=.1,
gamma=.99,
num_batches=50,
env=gym.make('Acrobot-v1'),
quadratic_terms=False)
results["Linear_Acrobot_simple"] = [scores, theta_history]
theta_history, scores = general_run_agent(lr=.1,
gamma=.99,
num_batches=50,
env=gym.make('Acrobot-v1'),
quadratic_terms=True)
results["Linear_Acrobot_quadratic"] = [scores, theta_history]
fig, ax = plt.subplots(1,1, figsize=(15,8))
scores = results["Linear_Acrobot_quadratic"][0]
simple_scores = results["Linear_Acrobot_simple"][0]
ax.plot(range(len(scores)), moving_ave(scores), lw=4, c="orange", label="Quadratic Model")
ax.plot(range(len(simple_scores)), moving_ave(simple_scores), lw=4, c="crimson", label="Simple Model")
ax.set_title("Linear Model Comparison For Acrobot")
ax.set_xlabel("Episode")
ax.set_ylabel("Moving Average Score")
ax.legend();
We now attempt to improve on our $h(s,a)$ approximation by employing a neural network to learn it. While in the case of MountainCar our 2 dimensional $\Theta$ and linear $h$ seemed like it would be pretty capable of optimal behavior, perhaps the network will have some insight that we didn't. Moreover, we would expect vastly better results from the neural network on environments where the optimal policy is non-linear. The real beauty of the network is that now we don't have to consider feature engineering; instead we can learn it!
The goal here is to make this process much like what we've done so far with linear policy gradient methods. Recall that we've been employing the vanilla REINFORCE algorithm from Sutton's text. In essence, we've learned that this algorithm seeks to maximize the expected $ln\ \pi(A_{t} | S_{t}, \theta) \ G_t$ where $\theta$ is the parameters of some function approximation model. So far we've designed our own gradient ascent algorithm, but keras as auto differentiation ability which should make our job easier! We want our loss function to be the negative of what we want to maximize, so our proposed custom loss becomes $-ln\ \pi(A_{t} | S_{t}, \theta) \ G_t$.
Now, how does this custom loss function relate to a custom loss we might program into keras with inputs y_true and y_pred? Well, y_true will be a one-hot encoded vector representing $A_t$, the action we actually took in that time step. ypred will be our model's predicted probability for that action, or $\pi(A{t} | S_{t}, \theta)$. $G_t$ is the same as what we've been working with all along, the sampled $q(a,s)$ value, where we also standardize to give some rough baseline approximation. This standardization greatly increases the efficiency of the learning by reducing variance and giving a clearer reward signal.
So how custom loss may look like the following:
def custom_loss(y_true, y_pred):
# this step simply deals with the one hot encoded dimensions of these vectors
# model_prob is simply the model's predicted probability of selecting the action that was actually sampled
model_prob = K.batch_dot(y_true, y_pred)
log_prob = K.log(model_prob)
return -log_prob # instead of multiplying by G, we weight the samples with G
Where now we have not included $G_t$ because keras expects losses to be a function of only y_true and y_pred. Instead we can simply weight the samples by $G_t$.
However, we note that this custom loss really just amounts to categorical crossentropy, weighted by G.
Recall that categorical crossentropy is defined by:
$$-\sum_c^M y_c \ p(c)$$Where M is the total number of classes, or in our case actions. y is the observed, which is either 0 or 1 and p is the model's predicted probability. When we realize that y will be 0 for all actions except $A_t$, the action we sampled, we realize that this expression reduces to our custom loss after we weight by G.
We go on to use cross entropy loss instead of our custom function because we had better experience with it. Maybe our dot product operation is not behaving correctly in the custom loss? Or perhaps we just had bad luck with the custom loss as we initially tested on MountainCar, which proved to be difficult in any case.
import keras.layers as layers
from keras.initializers import glorot_uniform
from keras.models import Model, Sequential
import keras.backend as K
from keras.optimizers import RMSprop
We provide the ability to train with two different network complexities. "Simple" just gives us a model with 1 hidden Dense layer with 3 nodes. "Complex" gives us two hidden layers of 5 nodes each. Output is simply softmax in both cases. We experimented with a few different optimizers with little effect; ultimately we choose to stick with RMSprop for consistency across experiments.
Much of the next two code blocks are identical to what we've used for the linear approximation. We now simply plug in the network.
K.clear_session()
class NNPolicyGradAgent:
def __init__(self, state_size, action_size, lr, gamma, network_complexity):
# get size of state and action
self.state_size = state_size
self.action_size = action_size
self.gamma = gamma
self.lr = lr
self.batch_memory = {"A": [], "S": [], "R": [], "G": []}
self.losses = []
self.network_complexity = network_complexity
self.model = self.build_model()
self.max_distances = []
def build_model(self):
# Define model
model=Sequential()
if self.network_complexity == "simple":
model.add(Dense(3, activation="relu", input_dim=self.state_size, use_bias=True))
else:
model.add(Dense(5, activation="relu", input_dim=self.state_size))
model.add(Dense(5, activation="relu"))
model.add(Dense(self.action_size, activation="softmax"))
# Compile model
model.compile(optimizer=RMSprop(lr=self.lr), loss="categorical_crossentropy")
return model
def get_action(self, state):
state = state.reshape(1,self.state_size)
action_probs = self.model.predict(state)
return np.random.choice(range(self.action_size), p=action_probs[0])
def erase_batch_memory(self):
self.batch_memory = {"A": [], "S": [], "R": [], "G": []}
def prepare_rewards(self):
self.batch_memory["G"] = np.array(self.batch_memory["G"])
# Subtract off mean as the baseline
self.batch_memory["G"] -= np.mean(self.batch_memory["G"])
if np.std(self.batch_memory["G"]) > 0:
self.batch_memory["G"] /= np.std(self.batch_memory["G"])
def one_hot_encode(self, acts):
one_hot_actions = np.zeros((len(acts), self.action_size))
one_hot_actions[np.arange(len(acts)), acts] = 1
return one_hot_actions
def learn(self):
S = np.array(self.batch_memory["S"])
G = np.array(self.batch_memory["G"])
A = np.array(self.batch_memory["A"])
A_one_hot = self.one_hot_encode(A)
loss = self.model.train_on_batch(S, A_one_hot, sample_weight=G)
self.losses.append(loss)
# Evaluation metrics for mountain car only:
# ave_pos = np.mean(S[: ,0])
# max_dist = np.max(S[: ,0])
# print("Max Distance: {:.3}".format(max_dist))
# self.max_distances.append(max_dist)
Same code, just plug in the networt agent now.
def run_agentNN(lr, gamma, num_batches, reward_system="mountain", env=gym.make('MountainCar-v0'), network_complexity="simple", verbose=False):
# create agent
agent = NNPolicyGradAgent(state_size=env.observation_space.shape[0],
action_size=env.action_space.n,
lr=lr,
gamma=gamma,
network_complexity=network_complexity)
batch_size = 200
step_history = []
goal_reached = False
scores = []
for batch in range(num_batches):
agent.erase_batch_memory()
for episode in range(batch_size):
rewards = []
states = []
actions = []
state = env.reset()
done = False
score = 0
while not done:
states.append(state)
# get action for the current state and go one step in environment
action = agent.get_action(state)
# save action
actions.append(action)
# take step in environment
next_state, reward, done, info = env.step(action)
# calculate reward and append to rewards
if reward_system == "mountain":
rewards.append(abs(state[1]))
else:
rewards.append(reward)
score += rewards[-1]
state = next_state
# if not done, loop again
if done:
# record length of episode
steps = len(states)
step_history.append(steps)
scores.append(score)
# give episode data to the agent
agent.batch_memory["S"] += states
agent.batch_memory["A"] += actions
agent.batch_memory["G"] += calculate_G(rewards, gamma)
agent.prepare_rewards()
agent.learn()
if verbose and batch % 20 and reward_system != "mountain":
print(scores[-1])
return step_history, agent.losses, agent.max_distances, scores
step_history, losses, max_distances, scores = run_agentNN(lr=.1, gamma=.9, num_batches=50, reward_system="mountain")
Comments
After great trials and tribulations, we've managed to get MountainCar to work by creating a new reward system that simply encourages the cart to pick up velocity. The reward at each step is literally just equal to current velocity. We suspect that other reward systems did not work because the model is trying to learn something that's too complex starting from random actions. By acting randomly, the agent never manages to reach the goal and therby never is able to learn. Previously, we had a linear model that was so simple that randomly adjusting the weight was enough to truely explore the possibility space and succeed. Now, randomly jiggling the knobs of the neural network is unlikely to lead to meaningful behavior.
So here we really see one major theme of machine learning: the bias vs variance trade off. By introducing the neural network, we now have a more powerful model that should be able to do everything the linear model did and more; however, this extra power and potential for less bias has come at the cost of high variance in a situation where we really need to cut variance at every cost because the sampling is so noisy. The high variance nn has much more trouble generalizing than the linear model did.
Nonetheless, with the much simplified reward system, we can see that the network is learning to get the car closer and closer to the goal, albeit very slowly,
fig, ax = plt.subplots(1,2)
ax[0].plot(range(len(max_distances)), moving_ave(max_distances), c='purple')
ax[0].set_xlabel("Epoch with 200 episodes")
ax[0].set_ylabel("Moving average of max position")
ax[0].set_title("Cart Learns to Reach Goal")
ax[1].plot(range(len(losses)), losses, c='darkblue')
ax[1].set_xlabel("Epoch with 200 episodes")
ax[1].set_ylabel("Weighted Cross Entropy Loss")
ax[1].set_title("Loss Slowly Diminishes")
fig.suptitle("Policy Gradient Neural Network Learning", fontsize=23, y=1.07);
fig.tight_layout()
Comments
Encouraged by some meaningfull behavior, we move on to apply the network to other environments with their default reward systems. Hopefully we will see faster results.
step_history, losses, max_distances, scores = run_agentNN(lr=.1,
gamma=.9,
num_batches=20,
reward_system="default",
env=gym.make('CartPole-v1'),
network_complexity="complex")
def get_streak(scores):
streak = 0
for i in range(len(scores)):
if scores[i] >= 475:
streak +=1
else:
streak = 0
if streak >= 100:
return "CartPole-v1 Solved: Streak starting at episode {}".format(i + 1 - 99)
return "CartPole-v1 Not Solved"
get_streak(scores)
step_history, simple_losses, max_distances, simple_scores = run_agentNN(lr=.1,
gamma=.9,
num_batches=20,
reward_system="default",
env=gym.make('CartPole-v1'),
network_complexity="simple")
get_streak(simple_scores)
fig, ax = plt.subplots(1,1, figsize=(15,8))
ax.plot(range(len(scores)), moving_ave(scores), lw=4, c="orange", label="Complex NN")
ax.plot(range(len(simple_scores)), moving_ave(simple_scores), lw=4, c="crimson", label="Simple NN")
ax.set_title("Network Comparison For CartPole")
ax.set_xlabel("Episode")
ax.set_ylabel("Moving Average Score")
ax.legend();
Comments
We've solved CartPole at episode 1601. Our best linear model didn't solve it until episode 2398. Training results have very high variance, but perhaps the nn is giving us some performance boost, without much tuning. Judging from the graph above where we see the deeper network outperform the shallow one, we can surmise that the optimal CartPole model is non-linear and hence the advantage for the nn model.
In Acrobot, we are essentially rewarded for how quickly we can swing a mechanical system above some threshold. See openAI gym online for a visualization. Because of the delayed reward system, we might encounter some of the trouble we saw in MountainCar.
Let's give Acrobot a shot with the nn:
step_history, losses, max_distances, complex_scores = run_agentNN(lr=.1,
gamma=.99,
num_batches=20,
reward_system="default",
env=gym.make('Acrobot-v1'),
network_complexity="complex",
verbose=False)
step_history, losses, max_distances, simple_scores = run_agentNN(lr=.1,
gamma=.99,
num_batches=20,
reward_system="default",
env=gym.make('Acrobot-v1'),
network_complexity="simple",
verbose=False)
fig, ax = plt.subplots(1,1, figsize=(15,8))
ax.plot(range(len(complex_scores)), moving_ave(complex_scores), lw=4, c="orange", label="Complex NN")
ax.plot(range(len(simple_scores)), moving_ave(simple_scores), lw=4, c="darkolivegreen", label="Simple NN")
ax.set_title("Network Comparison For Acrobot-v1")
ax.set_xlabel("Episode")
ax.set_ylabel("Moving Average Score")
ax.legend();
Comments
The deeper network was able to learn Acrobot while the simpler one was not. It would be interesting to try more than 5 nodes on the hidden layers of the deeper network and more hyperparamter tuning, but this is beyond the scope of our goals for this project.
In this project, we explored a number of techniques to solve reinforcement learning problems in various environments from OpenAI Gym. We reviewed the theory behind tabular methods (dynamic programming, Monte Carlo, Q-learning), and implemented both traditional and deep Q-learning, finding that deep Q-learning definitively outperformed traditional, tabular Q-learning on high-dimensional and continuous problems. We then explored a completely different class of solutions -- policy gradient methods -- and compared a simple linear policy approach to a deep neural network approach, finding once again that the deep approach was often much more scalable (see Acrobot results).
In general, we found that while both methods (deep Q and policy gradient) were able to achieve good results for the environments we tested, we found that deep Q learning performed slightly better in these environments. We believe one reason for this is the fact that in many of these problems, especially MountainCar, achieving the goal with random movements is very unlikely, and thus only occurs once in a while, if ever. The Q-learning approach is able to self-update much more effectively when it comes across successful episodes, whereas the policy gradient methods typically need more data (successful episodes in this case) to converge well.
For future work, it would be interesting to explore the performance of these classes of solution methods on more complex reinforcement learning problems, such as strategy games like Connect-Four, Chess, or Go.