CS109B Data Science 2: Advanced Topics in Data Science

Final Project: Reinforcement Learning

Harvard University
Spring 2019
Instructors: Pavlos Protopapas, Mark Glickman

Group Number: 51
Group Members: John Daciuk, Dean Hathout


In [1]:
import sys
import gym
import pylab
import random
import numpy as np
from collections import deque
from keras.layers import Dense
from keras.optimizers import Adam
from keras.models import Sequential
from tqdm import tqdm

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
import matplotlib.animation as animation

from keras import backend as K
K.tensorflow_backend._get_available_gpus()
Using TensorFlow backend.
Out[1]:
['/job:localhost/replica:0/task:0/device:GPU:0']
In [4]:
#to make my plotting style
plt.rcParams["axes.facecolor"] = (1,1,.99) 
plt.rcParams["font.size"] = 15 
plt.rcParams["figure.figsize"] = (12,6) 
plt.rcParams["ytick.labelsize"] = 13 
plt.rcParams["xtick.labelsize"] = 13 
plt.rcParams["lines.linewidth"] = 2 
plt.rcParams["axes.titlesize"] = 20 

Introduction

In this project, we aim to explore in greater depth some of the reinforcement learning concepts introduced in class, through both a literature review and an implementation of concepts discussed. Referring to Sutton and Barto’s Reinforcement Learning and Sutton’s “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” we will review the mathematical theory behind topics touched on in class, such as policy iteration, value iteration, and Q-learning, and will also introduce policy gradient methods. We will then implement and compare the discussed methods for use in continuous control problem settings with neural networks, making use of environments such as Mountain Car, CartPole, and Acrobot in OpenAI Gym.

Review of Tabular Solution Methods

For a handful of reinforcement learning problems, particularly those which can be formulated as a finite Markov Decision Process, the state and action spaces are small enough such that tabular methods -- that is, methods which represent approximate value functions using arrays or tables -- can find exact solutions (exact optimal value functions and policies).

First, we present expressions for the optimal value functions $v_*$ (a function of state and called the state-value function) and $q_*$ (a function of state and action and called the action-value function), defined by their satisfaction of the Bellman optimality equations:

$$v_*(s) = \max_a \mathbb{E}[R_{t + 1} + \gamma v_*(S_{t + 1}) | S_t = s, A_t = a] = \max_a \sum_{s', r} p(s', r | s, a)[r + \gamma v_*(s')]$$$$q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') | S_t = s, A_t = a] = \sum_{s', r} p(s', r|s, a)[r + \gamma \max{a'} q_*(s', a')]$$

If these functions are known, then the optimal policy is self-evident -- in a given state, we take the action that produces the highest expected value. Thus, if we had a way of solving for or, more often, sufficiently approximating these value functions, we can easily solve for the optimal policy. In this section, we briefly review three fundamental classes of tabular solution methods to do this: dynamic programming, Monte Carlo methods, and temporal difference learning (focusing mostly on Q-learning).

Dynamic Programming

In order to find an approximation to the optimal state-value function $v_*$, we need a way of computing the state-value function $v_\pi$ for any policy $\pi$, where we define policy $\pi$ by saying that $\pi(a | s)$ is the probability of taking action $a$ in state $s$ under policy $\pi$. Thus,

$$v_\pi(s) = \mathbb{E}_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s] = \sum_a \pi(a | s) \sum_{s', r} p(s',r | s,a)[r + \gamma v_\pi(s')].$$

If the dynamics of the environment are completely known, then the above actually defines a system of linear equations with number of equations and number of unknowns both equal to the size of the state space. However, solving this system is often a tedious computation, and more saliently, the dynamics of the environment are rarely known, so we instead consider iterative solution methods where we generate a sequence of approximate value functions $v_0, v_1, ...$ until a suitable approximation is obtained. The update rule is as follows:

$$v_{k+1}(s) = \mathbb{E}[R_{t + 1} + \gamma v_k(S_{t+1}) | S_t = s] = \sum_a \pi(a | s) \sum_{s', r}p(s', r | s, a)[r + \gamma v_k(s')]$$

In words, this update does the following for each state $s$: replace the old value of $s$ with a new value computed via the expected immediate rewards and the (discounted) expected value of all the one-step transitions possible under the policy $\pi$. We follow this process until the maximum state-wise change between updates, $\max_s |v_{k+1}(s) - v_k(s)|$ is smaller than some decided threshold. Although the formal proof of this is beyond the scope of this review, this update scheme is guaranteed to converge to the optimal policy as $k$ goes to $\infty$.

In pseudocode, an algorithm to perform these successive updates is shown below.

Iterative Policy Evaluation for Estimating $V \approx v_\pi$

Input $\pi$, the policy to be evaluated

Algorithm parameter: a small threshold $\theta > 0$ determining accuracy of estimation

  1. Initialize $V(s)$ for each s arbitrarily with $V(terminal) = 0$

  2. $\Delta$ = Inf

  3. while $\Delta > \theta$:

    $\Delta \leftarrow 0$

  4. for s in S:

    v $\leftarrow$ V(s)

    V(s) = $\sum_a \pi(a|s) \sum_{s', r} p(s',r | s, a)[r + \gamma V(s')]$

    $\Delta \leftarrow \max(\Delta, |v - V(s)|)$

  5. rof

Now that we have an iterative method for policy evaluation, how can we use this to implement policy improvement? That is, given that $v_\pi(s)$ tells us how "good" it is to follow policy $\pi$ from $s$, should we change to some new policy $\pi'$? To answer this, we consider the action-value function $q_\pi(s, a)$ for policy $\pi$, which gives the expected value of taking action $a$ in state $s$ and following policy $\pi$ after that:

$$q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t = a] = \sum_{s', r} p(s', r|s, a)[r + \gamma v_\pi(s')]$$

The policy improvement theorem, of which a simple proof can be found in Chapter 4 of Sutton, intuitively tells us that, given two deterministic policies, one original policy $\pi$ and a modified $\pi'$ which is identical to $\pi$ except that $\pi'(s) = a \neq \pi(s)$, if $q_\pi(s, a) > v_\pi(s)$, then the policy $\pi'$ is indeed an improvement over $\pi$.

Knowing how $q_\pi(s, a)$ can help us evaluate a change in the current policy $\pi$ at state $s$ for action $a$, we can use it define a policy improvement to $\pi'$. Specifically, for each state $s$, we consider changes to the policy for all possible actions $a$ in $s$, and devise a greedy policy which selects the action $a$ that maximizes $q_\pi(s, a)$:

$$\pi'(s) = arg \max_a q_\pi(s, a) = arg \max_a \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t = a] = arg \max_a \sum_{s', r} p(s', r | s, a)[r + \gamma v_\pi(s')]$$

This way of improving policy $\pi$ to $\pi'$, using policy evaluation for $v_\pi$, is known as policy iteration. Thus, one algorithm for approximating an optimal policy $\pi_*$ is to follow the above process to produce monotonically improving policies. We present pseudocode for such an algorithm below:

Policy Iteration Using Iterative Policy Evaluation for Estimating $\pi \approx \pi_*$
  1. Initialize V(s) and $\pi(s)$ for each $s$ arbitrarily

  2. Run policy evaluation defined above: using V and $\pi$

  3. Policy Improvement:

policy_stable $\leftarrow$ True

For s in S:

old_action $\leftarrow \pi(s)$

$\pi(s) \leftarrow arg \max_a \sum_{s', r} p(s', r | s, a)[r + \gamma V(s')]$

If old_action $\neq \pi(s)$, then policy_stable $\leftarrow$ False

If policystable, then stop and return $V \approx v$ and $\pi \approx \pi_$; else go back to 2.

One drawback to the above policy iteration algorithm is that for each improvement made to the policy, we must run an entire policy evaluation step which only becomes exact in the limit to $\infty$. Thus, a simpler update approach, value iteration, is often used instead. Value iteration is similar to the policy evaluation update scheme, but instead requires that we take the maximum over all actions of the expected value. That is:

$$v_{k+1}(s) = \max_a \mathbb{E}[R_{t+1} + \gamma v_k(S_{t+1}) | S_t = s, A_t = a] =\max_a \sum_{s', r}p(s', r | s, a)[r + \gamma v_k(s')]$$

The key idea and intuition behind value iteration is that we use the Bellman optimality equation as an update rule for these successive function approximations. The usefulness of this method is that it is able to combine, in each iteration, a policy evaluation step and a policy improvement step. This is perhaps best illustrated by looking at the pseudocode for value iteration:

Value Iteration for Estimating $\pi \approx \pi_*$
  1. Initialize $V(s)$ for each s arbitrarily with $V(terminal) = 0$

  2. $\Delta$ = Inf

  3. while $\Delta > \theta$:

    $\Delta \leftarrow 0$

  4. for s in S:

    v $\leftarrow$ V(s)

    V(s) = $\max_a \sum_{s', r} p(s',r | s, a)[r + \gamma V(s')]$

    $\Delta \leftarrow \max(\Delta, |v - V(s)|)$

  5. rof

  6. Output a deterministic policy $\pi \approx \pi_*$, such that: $$\pi(s) = arg \max_a \sum_{s', r} p(s',r | s, a)[r + \gamma V(s')]$$

Thus, value iteration is essentially multiple passes through a special case of policy iteration, where the policy evaluation step is terminated after one sweep through all the states. It turns out -- once again this is beyond the scope of this review -- that this approach is also guaranteed to converge to $v_*$.

Monte-Carlo

While the DP methods discussed above are conceptually simple and quite easy to implement, one disadvantage to them is that they typically rely on the assumption that the dynamics of the environment are fully known. However, in most RL problems, we do not have this luxury. In fact, in a number of problems, we may know extremely little about the environment. In these cases, we rely on Monte Carlo methods (methods which sample sequences of states, actions and rewards to estimate expected values of state-action pairs) to learn about the environment experientially, and therefore to learn optimal value functions and policies.

These methods are relatively straight-forward and easy to follow. For the problem of estimating the state-value function $v_\pi(s)$, we generate a large number of episodes following policy $\pi$ (sequences of states, actions, and rewards), and through each episode we keep track of the returns following visits to $s$ for all states $s$ (first-visit MC tracks returns following only the first visit to $s$, while every-visit MC tracks returns following all visits to $s$), then assign the average of these returns for state $s$ to $v_\pi(s)$. Below, we present the pseudocode for this method:

First-Visit Monte-Carlo Estimation for $V \approx v_\pi(s)$

Input: a policy $\pi$ to be evaluated, $n$: number of episodes to generate used for estimation

  1. Initialize: $V(s)$ randomly for all $s$, $Returns(s)$ empty list for all $s$
  2. Loop for $n$ times:

  3. Generate an episode following $\pi$: $S_0, A_0, R_1, S_1, A_1, R_2, ..., S_{T - 1}, A_{T-1}, R_T$

  4. $G \leftarrow 0$

for each step of the episode, $t = T - 1, T - 2, ..., 0$:

  1. $G \leftarrow \gamma G + R_{t+1}$

If $S_t$ does not appear in $S_0, S_1, ..., S_{t-1}$:

  1. Append $G$ to $Returns(s)$

  2. $V(S_t) \leftarrow \textrm{average}(Returns(S_t))$

rof

Note that in order to implement every-visit MC, we would simply drop the line where we check if $S_t$ appears earlier in the episode. Both methods converge to $v_\pi(s)$ as the number of visits to $s$ becomes infinite, by the law of large numbers. LLN also tells us that the variance of the mean estimates for each state that we obtain through this method is inversely proportional to the number of returns averaged.

It is then also quite straightforward to devise a similar Monte-Carlo policy iteration algorithm, similar to that shown in the dynamic programmic section above. We shown pseudocode for such a method below:

Monte Carlo Algorithm for Estimating $Q \approx q_*$ and $\pi \approx \pi_*$
  1. Initialize: $\pi(s)$ arbitrarily for all $s$, $Q(s, a)$ arbitrarily for all $s$, $a$,$Returns(s)$ empty list for all $s$
  2. Loop for $n$ times:

  3. Choose $S_0, A_0$ randomly such that all pairs have nonzero probability

  4. Generate an episode following $\pi$ from $S_0, A_0$: $S_0, A_0, R_1, S_1, A_1, R_2, ..., S_{T - 1}, A_{T-1}, R_T$

  5. $G \leftarrow 0$

for each step of the episode, $t = T - 1, T - 2, ..., 0$:

  1. $G \leftarrow \gamma G + R_{t+1}$

If $S_t$ does not appear in $S_0, S_1, ..., S_{t-1}$:

  1. Append $G$ to $Returns(s)$

  2. $Q(S_t, A_t) \leftarrow \textrm{average}(Returns(S_t, A_t))$

  3. $\pi(S_t) \leftarrow arg \max_a(S_t, a)$

rof

The reason we start off with choosing $S_0, A_0$ randomly for each episode is that, in order to ensure theoretical convergence for this algorithm, we must ensure that all state-action pairs get visited with positive probability. Starting each episode with a random state-action pair satisfies this. This is known as exploring-starts.

Thus, with Monte-Carlo algorithms, even if we know nothing about the environment, we can generate reliable estimates for value functions and optimal policies. While there is of course much more than can be said about Monte-Carlo algorithms for reinforcement learning, these other topics are outside the scope of our project. We explore below our particular application of interest for MC ideas, in Temporal Difference Learning.

Temporal Difference (TD) & Q-Learning

Sutton and Barto describe temporal-difference learning as likely the most "central and novel" idea in reinforcement learning. Its effectiveness relies on its synergistic combination of ideas from dynamic programming and Monte-Carlo. Most fundamentally, temporal-difference methods sample from the environment as in Monte-Carlo, and iteratively perform updates based on current MC-based estimates, like in DP. We recall here the general form of this update for TD:

$$V(S_t) \leftarrow V(S_t) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]$$

where the expression in brackets is known as the TD-error, and is the difference between the current estimate for the value of $S_t$ and an updated estimate based on the immediate rewards at the next time step and the estimated value of the next state. This scheme can be modifed in many different ways for different purposes in RL. In this project, we specifically review Q-learning, an off-policy temporal-difference method for approximating the optimal action-value function $q_*$. The update scheme behind Q-learning is given below, where $Q$ is the MC-learned action-value function:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)]$$

We note that the above rule makes no mention of a specific policy, revealing some of the magic behind Q-learning: through successive updates following the above rule, the function $Q$ directly approximates $q_*$ independent of the policy being followed, so long as the policy ensures that all state-action pairs continue to be visited, and therefore updated. Given this minimal requirement, correct convergence is guaranteed (proof beyond our current scope). We give the Q-learning procedure in pseudocode below:

Q-learning for estimating $\pi \approx \pi_*$
  1. Input parameters: step size $\alpha$ between 0 and 1, right-inclusive, and small $\epsilon > 0$.

  2. Initialize $Q(s, a)$ for all $s, a$ arbitrarily except that $Q(terminal, \cdot) = 0$

  3. For each episode:

  4. Initialize $S$

  5. For each step of episode until S is terminal:

  6. Choose $A$ from $S$ using policy derived from $Q$

  7. Take action $A$ and observe $R, S'$

  8. $Q(S, A) \leftarrow Q(S, A) + \alpha[R + \gamma \max_a Q(S', a) - Q(S, A)]$

  9. $S \leftarrow S'$

Q-Learning Implementation

We're now ready to explore some implementations! We implement a function below which performs the Q-learning algorithm described above by representing the Q-function as a table (python dictionary in this implementation), apply this function to a number of environments in Gym, and discuss our results. Notes: the parameters we choose for each environment are already tuned, and the function contains a number of modifications for the specifics of each environment:

In [39]:
def QLearn(environment = 'MountainCar-v0', 
           alpha = 0.5, 
           gamma = 1, 
           epsilon = 1, 
           min_eps = 0, 
           eps_mult = 3, 
           episodes = 15000, 
           bin_breaks = np.array([0.1, 0.01])):
    
    if environment.startswith('Taxi'): # Avoid discretizing for this already discrete environment
        bin_breaks = np.array([1])
        
    env = gym.make(environment)
    env.reset()
    
    Q = {} # Initialize dictionary
    
    rewards = [] # Initialize rewards
    ave_rewards = []
    
    reduction = eps_mult * (epsilon - min_eps)/episodes # Epsilon-annealing scheme
    
    def get_Qval(state, action=None): # get current value for state-action pair or all possible action-values from a state
                                      # If the state does not yet exist, initialize randomly.
        try:
            res = Q[state]
        except:
            Q[state] = np.random.uniform(low = -1, high = 1, size = (env.action_space.n))
            res = Q[state]
            
        if action is not None:
            return(res[action])
        return(res)
    
    
    for episode in range(episodes):
        done = False
        total_reward = 0
        reward = 0
        state = env.reset()
        
        if environment.startswith('CartPole'): # specific handling for infinite values in some CartPole state dimensions
            # Discretize the continous space
            discrete_state = tuple(np.round((np.array([(state - env.observation_space.low)[i] for i in [0, 2]] + [state[1] + 2, state[3] + 3.5])) / bin_breaks, 0).astype(int))
          
        elif environment.startswith('Taxi'): # No need to discretize an already discrete problem.
            discrete_state = state
            
        else:
            # Discretize
            discrete_state = tuple(np.round((state - env.observation_space.low) / bin_breaks, 0).astype(int))
        
        while not done:
            
            # Epsilon-greedy
            if np.random.random() < 1 - epsilon:
                action = np.argmax(get_Qval(discrete_state))
            else:
                action = np.random.randint(0, env.action_space.n)
                
            # Get next time step information
            next_state, reward, done, info = env.step(action)
            
            # Discretize next step if necessary
            if environment.startswith('CartPole'):
                next_discrete_state = tuple(np.round((np.array([(next_state - env.observation_space.low)[i] for i in [0, 2]] + [state[1] + 2, state[3] + 3.5])) / bin_breaks, 0).astype(int))
            
            elif environment.startswith('Taxi'):
                next_discrete_state = next_state
            else:
                next_discrete_state = tuple(np.round((next_state - env.observation_space.low) / bin_breaks, 0).astype(int))
            
            # Check if episode is finished
            if environment.startswith('MountainCar'):
                if done and next_state[0] >= 0.5:
                    Q[discrete_state][action] = reward
                    
                # If not finished, perform TD update
                else:
                    next_state_Q_val = get_Qval(next_discrete_state)
                    TD = reward + gamma * np.max(get_Qval(next_discrete_state)) - get_Qval(discrete_state, action)
                    Q[discrete_state][action] += alpha * TD         
                    
                        
            else:                
                if done:
                    Q[discrete_state][action] = reward
               
                else:
                    next_state_Q_val = get_Qval(next_discrete_state)
                    TD = reward + gamma * np.max(get_Qval(next_discrete_state)) - get_Qval(discrete_state, action)
                    Q[discrete_state][action] += alpha * TD
                                     
            total_reward += reward
            discrete_state = next_discrete_state
        
        if epsilon > min_eps:
            epsilon -= reduction
        
        rewards.append(total_reward)
        
        if episode % 100 == 0:
            ave_reward = np.mean(rewards)
            ave_rewards.append(ave_reward)
            rewards = []
            
        if episode % 1000 == 0:    
            print('Episode {} Average Reward: {}'.format(episode, ave_reward))
        
    print("MAX AVERAGE REWARD: {}".format(np.max(ave_rewards)))
    
    plt.plot(100 * (np.arange(len(ave_rewards)) + 1), ave_rewards)
    plt.xlabel('Episodes')
    plt.ylabel('Average Reward')
    plt.show()

Taxi-v2: https://gym.openai.com/envs/Taxi-v2/

We start off with a rather simple environment, quite similar to the Frozen-Lake environment we explored in class. More information on this environment and its state, action, and reward structure can be found at the link above.

In [25]:
QLearn(environment='Taxi-v2', eps_mult=5, episodes=20000)
Episode 0 Average Reward: -839.0
Episode 1000 Average Reward: -198.21
Episode 2000 Average Reward: -47.04
Episode 3000 Average Reward: -8.77
Episode 4000 Average Reward: 7.82
Episode 5000 Average Reward: 8.15
Episode 6000 Average Reward: 8.73
Episode 7000 Average Reward: 8.15
Episode 8000 Average Reward: 8.86
Episode 9000 Average Reward: 8.32
Episode 10000 Average Reward: 8.66
Episode 11000 Average Reward: 8.77
Episode 12000 Average Reward: 8.34
Episode 13000 Average Reward: 8.35
Episode 14000 Average Reward: 8.27
Episode 15000 Average Reward: 8.52
Episode 16000 Average Reward: 8.46
Episode 17000 Average Reward: 8.88
Episode 18000 Average Reward: 8.19
Episode 19000 Average Reward: 8.58
MAX AVERAGE REWARD: 9.19

As we can see, Q-learning works quite well on this Taxi problem. We achieve a maximum average reward (taken over 100 episodes) of above 9, which is good for this problem. The fact that the Taxi environment is low-dimensional and discrete makes it an excellent fit for this tabular Q-learning approach -- indeed, we see that the algorithm converges quite rapidly. Below, we explore environments that are either higher-dimensional, continuous, or both, and see how Q-learning performs on these problems in comparison.

MountainCar-v0: https://gym.openai.com/envs/MountainCar-v0/

The MountainCar environment is also quite simple and low-dimensional, but is not discrete. Thus, in order to implement a tabular Q-learning approach for such a problem, our Q-learner discretizes the state space into bins (this is where the bin_breaks argument is used).

In [40]:
QLearn(environment='MountainCar-v0')
Episode 0 Average Reward: -200.0
Episode 1000 Average Reward: -200.0
Episode 2000 Average Reward: -200.0
Episode 3000 Average Reward: -200.0
Episode 4000 Average Reward: -200.0
Episode 5000 Average Reward: -190.95
Episode 6000 Average Reward: -163.73
Episode 7000 Average Reward: -138.77
Episode 8000 Average Reward: -139.17
Episode 9000 Average Reward: -137.96
Episode 10000 Average Reward: -139.42
Episode 11000 Average Reward: -139.65
Episode 12000 Average Reward: -140.1
Episode 13000 Average Reward: -139.74
Episode 14000 Average Reward: -139.85
MAX AVERAGE REWARD: -137.61

We see that this approach does decently well on MountainCar. It is certainly able to find its way to the top of the mountain, which is good, but not in as few steps as we might like -- the standard for this problem is typically around -120, and we hover at around -138. We hypothesize that this is due, in large part, to the fact that we're discretizing a continuous space. Below, we will explore methods that use functional approximations to the $Q$ function rather than representing it as a discrete state-action pair table, where we see better results.

CartPole-v0: https://gym.openai.com/envs/CartPole-v0/

The CartPole environment is not as simple as the previous two, with higher state dimensionality and it is not discrete. Additionally, a couple of these dimensions have infinite range, so we have to impose our own limits in discretization. We explore tabular-Q performance here:

In [41]:
QLearn('CartPole-v0', alpha=0.8, bin_breaks = np.array([0.0001] * 4), episodes=15000)
Episode 0 Average Reward: 14.0
Episode 1000 Average Reward: 21.48
Episode 2000 Average Reward: 20.31
Episode 3000 Average Reward: 20.47
Episode 4000 Average Reward: 24.57
Episode 5000 Average Reward: 20.59
Episode 6000 Average Reward: 20.84
Episode 7000 Average Reward: 20.46
Episode 8000 Average Reward: 23.72
Episode 9000 Average Reward: 21.84
Episode 10000 Average Reward: 22.59
Episode 11000 Average Reward: 21.44
Episode 12000 Average Reward: 21.19
Episode 13000 Average Reward: 22.01
Episode 14000 Average Reward: 23.25
MAX AVERAGE REWARD: 25.17

Here we see that the tabular Q-learning approach does poorly on this problem. Once again, we believe this is a consequence of greater problem complexity (higher state dimensionality) and the problem's continuity. Once again, we will see better results with our functional approximation methods below.

Acrobot-v1: https://gym.openai.com/envs/Acrobot-v1/

Acrobot is perhaps the most complex of the environments we test here -- it has very high state dimensionality and needs to be discreteized. We once again see that this poses problems for the tabular Q-learner.

In [42]:
QLearn('Acrobot-v1', alpha=0.1, episodes=5000, bin_breaks=np.array([0.0001] * 6))
Episode 0 Average Reward: -500.0
Episode 1000 Average Reward: -498.55
Episode 2000 Average Reward: -498.19
Episode 3000 Average Reward: -497.92
Episode 4000 Average Reward: -499.68
MAX AVERAGE REWARD: -496.15

Again as expected for this combination of Q-learning strategy and high-dimensional problem, we see poor learning results and very slow training time. It is now time to dive into functional approximations and deep Q learning to see if we can improve.

Deep Q-Learning

We saw above cases in which it is difficult for tabular Q-learning to scale well with increasing state and action dimensionality, as well as with continuity. For such problems, it is often much more useful to replace the Q-table with a learnable function representation of the action-value function, while continuing to update based on temporal difference as above. This function would take in some basis functions of the environment's state and try to estimate the value of each action from that state. Many learnable functional forms can be used; we could represent $Q$ as just a linear function of some state-based features: $Q = \sum_i \theta_i x_i(s)$, or we could use a neural network to approximate the $Q$ function. The latter approach, combining the ideas of TD-learning with deep neural networks, is often referred to as Deep Q-Learning, and we explore its results on the three continuous environments here.

We define the DQN agent below and a function to train the agent in various environments below. This code was inspired by the HW6 209 code.

In [2]:
class DQNAgent:
    def __init__(self, state_size, action_size, hidden):
        # if you want to see Cartpole learning, then change to True
        self.render = False
        # get size of state and action
        self.state_size = state_size
        self.action_size = action_size
        
        self.hidden = hidden

        # These are hyper parameters for the DQN
        self.discount_factor = 0.99
        self.learning_rate = 0.001
        self.epsilon = 1.0
        self.epsilon_decay = 0.999
        self.epsilon_min = 0.01
        self.batch_size = 64
        self.train_start = 1000
        # create replay memory using deque
        self.memory = deque(maxlen=2000)

        # create main model and target model
        self.model = self.build_model()
        self.target_model = self.build_model()

        # initialize target model
        self.update_target_model()

    def build_model(self):
        model = Sequential()
        model.add(Dense(self.hidden, input_dim=self.state_size, activation='relu',
                        kernel_initializer='he_uniform'))
        model.add(Dense(self.hidden, activation='relu',
                        kernel_initializer='he_uniform'))
        model.add(Dense(self.action_size, activation='linear',
                        kernel_initializer='he_uniform'))
        model.summary()
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        return model

    # after some time interval update the target model to be same with model
    def update_target_model(self):
        self.target_model.set_weights(self.model.get_weights())

    # get action from model using epsilon-greedy policy
    def get_action(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        else:
            q_value = self.model.predict(state)
            return np.argmax(q_value[0])

    # save sample <s,a,r,s'> to the replay memory
    def append_sample(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    # pick samples randomly from replay memory (with batch_size)
    def train_model(self):
        if len(self.memory) < self.train_start:
            return
        batch_size = min(self.batch_size, len(self.memory))
        mini_batch = random.sample(self.memory, batch_size)

        update_input = np.zeros((batch_size, self.state_size))
        update_target = np.zeros((batch_size, self.state_size))
        action, reward, done = [], [], []

        for i in range(self.batch_size):
            update_input[i] = mini_batch[i][0]
            action.append(mini_batch[i][1])
            reward.append(mini_batch[i][2])
            update_target[i] = mini_batch[i][3]
            done.append(mini_batch[i][4])

        target = self.model.predict(update_input)
        target_val = self.target_model.predict(update_target)

        for i in range(self.batch_size):
            
            # Q Learning: get maximum Q value at s' from target model
            if done[i]:
                target[i][action[i]] = reward[i]
            else:
                target[i][action[i]] = reward[i] + self.discount_factor * (
                    np.amax(target_val[i]))

        
        self.model.fit(update_input, target, batch_size=self.batch_size,
                       epochs=1, verbose=0)
        
        
def DQNLearn(environment, n_episodes, hidden):
    env = gym.make(environment)
    
    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n

    agent = DQNAgent(state_size, action_size, hidden)

    scores, episodes = [], []
    learned = False

    for e in range(n_episodes):
        done = False
        score = 0
        state = env.reset()
        state = np.reshape(state, [1, state_size])

        while not done:
            if agent.render:
                env.render()

            # get action for the current state and go one step in environment
            action = agent.get_action(state)
            next_state, reward, done, info = env.step(action)
            next_state = np.reshape(next_state, [1, state_size])
            reward = reward

            # save the sample <s, a, r, s'> to the replay memory
            agent.append_sample(state, action, reward, next_state, done)
            
            # every time step do the training
            agent.train_model()
            score += reward
            state = next_state

            if done:
                # every episode update the target model to be same with model
                agent.update_target_model()

                score = score #if score == 500 else score + 100
                scores.append(score)
                episodes.append(e)
                if e%50 == 0:
                    print('Episode ',e, ' over.')

                if np.mean(scores[-min(10, len(scores)):]) > 490:
                    break
                    learned = True
        if learned:
            break

            
    pylab.plot(episodes, scores, 'b')
    pylab.xlabel('Episode Number')
    pylab.ylabel('Collected Rewards')
    plt.title('Collected Rewards vs Episode Number for a DQN agent in ' + environment + ' task')
    pylab.show()

MountainCar

In [5]:
DQNLearn('MountainCar-v0', 500, 128)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_13 (Dense)             (None, 128)               384       
_________________________________________________________________
dense_14 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_15 (Dense)             (None, 3)                 387       
=================================================================
Total params: 17,283
Trainable params: 17,283
Non-trainable params: 0
_________________________________________________________________
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_16 (Dense)             (None, 128)               384       
_________________________________________________________________
dense_17 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_18 (Dense)             (None, 3)                 387       
=================================================================
Total params: 17,283
Trainable params: 17,283
Non-trainable params: 0
_________________________________________________________________
Episode  0  over.
Episode  50  over.
Episode  100  over.
Episode  150  over.
Episode  200  over.
Episode  250  over.
Episode  300  over.
Episode  350  over.
Episode  400  over.
Episode  450  over.

As we can see, this deep network performs quite well, consistently reaching the goal in under 120 steps, considered a high standard for this problem, and often beats the 100-step benchmark, which is quite impressive for this problem. Thus, the deep network does outperform the discretized, tabular Q-learning approach, even though the latter performed adequately. We suspect that this improvement is due to the network's ability to take the continuity of the space into account, rather than having to discretize.

CartPole

In [4]:
DQNLearn('CartPole-v1', 300, 24)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_7 (Dense)              (None, 24)                120       
_________________________________________________________________
dense_8 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_9 (Dense)              (None, 2)                 50        
=================================================================
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_10 (Dense)             (None, 24)                120       
_________________________________________________________________
dense_11 (Dense)             (None, 24)                600       
_________________________________________________________________
dense_12 (Dense)             (None, 2)                 50        
=================================================================
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________
Episode  0  over.
Episode  50  over.
Episode  100  over.
Episode  150  over.
Episode  200  over.
Episode  250  over.

This deep Q network performs much better than the discretized, Q-table for the CartPole problem. We achieve average results in the mid-hundreds, more than surpassing the threshold set by Gym for "solving" this problem. We believe that this network's success compared to the Q-table approach is due to its ability to capture a complex and continuous functional approximation for this high-dimensional problem, instead of being forced to discretize a high-dimensional state space.

Acrobot-v1

In [6]:
DQNLearn('Acrobot-v1', 500, 128)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_19 (Dense)             (None, 128)               896       
_________________________________________________________________
dense_20 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_21 (Dense)             (None, 3)                 387       
=================================================================
Total params: 17,795
Trainable params: 17,795
Non-trainable params: 0
_________________________________________________________________
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_22 (Dense)             (None, 128)               896       
_________________________________________________________________
dense_23 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_24 (Dense)             (None, 3)                 387       
=================================================================
Total params: 17,795
Trainable params: 17,795
Non-trainable params: 0
_________________________________________________________________
Episode  0  over.
Episode  50  over.
Episode  100  over.
Episode  150  over.
Episode  200  over.
Episode  250  over.
Episode  300  over.
Episode  350  over.
Episode  400  over.
Episode  450  over.

Once again, we see much better performance by this network over the Q-table for the same problem. We consistently achieve the goal in under 100 steps, and often around 75 steps, which is quite impressive for this problem (can be seen on the acrobot-v1 leaderboard in the OpenAI Gym wiki on github. Once again, we believe that this improvement in performance is due to the fact that we are no longer discretizing a high-dimensional and continuous space, and can therefore learn a complex functional representation of $Q$.

We are now ready to depart from the methods we have discussed above, and explore an entirely new method based on policy gradients. We describe the theory behind these methods, and show a number of implementations. Finally, we conclude this project by comparing what we see from the methods discussed above and policy gradient methods.

Policy Gradient Methods

With $Q$ learning, we've attempted to create a model that estimates the values of action and state pairs. $Q$ function in hand, the policy is then trivial: take the action with the highest value, most of the time. By contrast policy gradient methods offer a more direct approach: if what we really want to know is which actions to take given states, why not just find an appoximater for a policy function and forget about $Q$? After all, if we know which actions to take, we wouldn't care so much about their associated values.

The Policy Function and Performance Measure

More formally, we will refer to this policy function as $\pi(a|s,\theta)$, which represents the probability of taking action $a$ given state $s$ and is parameterized by a vector $\theta$. Now, if only we knew how to make this policy function incrementally better, we could iterate the process. We introduce $J(\theta)$ as the performance measure of a policy to define 'better'. There are different ideas for how to define $J$, but for now we think of it as the total sum of discounted rewards we get by following $\pi(a|s,\theta)$ at some starting state.

Let's consider how we might mathematically construct $\pi(a|s,\theta)$. First, since we need $J(\theta)$ to be differentiable, and the derivative of $J$ will depend on the derivative of $\pi$, we want $\pi$ to be differentiable. Second, we need $\sum_{a} \pi(a|s) = 1$, because we know that the probability of taking some action is exactly 1. To force this second condition while not breaking the differentiablity condition, we can construct $\pi(a|s)$ in the following way:

Choose some $h(a|s, \theta) \in R$, which represents some preference or excitement we have for picking action $a$, then use the softmax function to force condition two. So we have $\pi(a|s,\theta) = \frac{e^{h(a|s, \theta)}}{\sum_{b} e^{h(b|s, \theta)}}$, where $b$ is used to represent all possible actions. Our choice for $h(a|s, \theta)$ would naturally depend on the complexity of the problem. For simple problems, we might design $h$ to depend linearly on $\theta$; for Alpha Zero, Google used a neural network. We'll return to $h$ after we get a basic strategy for performing gradient ascent on $J$.

The Gradient of $J$ | The Policy Gradient Theorem

Regardless of the specifics of how we design our policy function, our entire game plan will boil down to finding a $\theta$ to crank up the value of $J(\theta)$. Naturally, we're interested in the gradient of $J(\theta)$. If we could only write this gradient in terms of things we can calculate, we could perform a gradient ascent algorithm on $J(\theta)$. Initially the situation looks grim: the performance of our policy depends not only on $\theta$, but also on the states our policy will throw us into. However, consider the logic behind the following key proportion:

$$ \triangledown J(\theta) \propto \sum_{s} \big( \mu(s) \sum_{a} q_{\pi}(s,a) \ \triangledown_{\theta} \pi(a | s, \theta) \big)$$

Now, let's understand this proportion, keeping in mind that we don't really care about the constant of proportionality for gradient ascent. The $\mu(s)$ is the percent of time steps we spend in state s. For a moment let's pretend we live in an environment with only one state. Then we can ignore the sum over states and look at the inner sum: $J(\theta) \propto \sum_{a} q_{\pi}(s,a) \ \triangledown_{\theta} \pi(a | s, \theta)$.

Imagine slightly turning the knobs on theta, and as a result we nudge $\pi(a | s, \theta)$ for each action. Taking the sum over actions in the above, we are getting the net result of all those action probability deltas weighted by their action values, and that, of course, is what we expect to have an affect on $J(\theta)$! It is important to note here that it is only $\triangledown_{\theta} \pi(a | s, \theta)$ that plays a role in this formulation, not $\pi(a | s, \theta)$ itself. It doesn't matter how likely or unlikely actions are to calculate the affect on $J(\theta)$, only the changes in those probabilities. Increasing a good action from happening 1% of the time to 2% of the time should have the same affect on $J(\theta)$ as increasing an equally good action from happening 90% of the time to 91% of the time. This will be important to recall later.

Reverting back to the multistate environment, all this action that 1 state has on $J(\theta)$ simply happens in proportion to how frequently we're in that state.

The Gradient of $J$ as an Expectation | Monte Carlo REINFORCE

So we've got the gradient of $J$ in terms of the gradient of $\pi$, and we have the gradient of $\pi$ because we promise to make $\pi$ a function we can differentiate. What we don't have is $\mu(s)$. In fact, we are blind with respect to knowledge about all the states and the $q_{\pi}$ values for those states and possible actions. However, we do know how to collect samples of states, actions and corresponding $q_{\pi}(s,a)$ values: just let our agent run around and memorize the rewards achieved after visiting certain states and taking certain actions. So we desire to write $\triangledown J(\theta)$ as proportional to some expectation over samples. For starters, we have:

$$ \triangledown J(\theta) \propto \sum_{s} \big( \mu(s) \sum_{a} q_{\pi}(s,a) \ \triangledown_{\theta} \pi(a | s, \theta) \big) = E_{\pi} \bigg[ \sum_{a} q_{\pi}(S_{t},a) \ \triangledown_{\theta} \pi(a | S_{t}, \theta) \bigg]$$

We can turn the $\mu(s)$ into an expectation over some sampled state $S_{t}$ simply because we will sample states in proportion to $\mu(s)$, the percent of time we're in them.

Now let's multiply and divide by $\pi(a | S_{t}, \theta)$ inside that expectation:

$$ \triangledown J(\theta) \propto E_{\pi} \bigg[ \sum_{a} q_{\pi}(S_{t},a) \ \triangledown_{\theta} \pi(a | S_{t}, \theta) \bigg] = E_{\pi} \bigg[ \sum_{a} \pi(a | S_{t}, \theta)\ q_{\pi} (S_{t},a) \ \frac{\triangledown_{\theta} \pi(a | S_{t}, \theta)}{\pi(a | S_{t}, \theta)} \bigg]$$

Now, recalling the definition of an expectation, $E f(x) = \sum_{x} p(x) f(x)$, inside those last big brackets we have precisely this form. So we nix the $\sum_{a} q_{\pi}(S_{t}, a)$ for one final layer of expectation and wrap everything into a single expectation:

$$ = E_{\pi} \bigg[ q_{\pi} (S_{t},A_{t}) \ \frac{\triangledown_{\theta} \pi(A_{t} | S_{t}, \theta)}{\pi(A_{t} | S_{t}, \theta)} \bigg]$$

Finally, on expectation $q_{\pi} (S_{t},A_{t})$ will just be the discounted sum of rewards we sample, $G_{t}$, and so we've come to our prize:

$$ \triangledown J(\theta) \propto E_{\pi} \bigg[ G_{t} \ \frac{\triangledown_{\theta} \pi(A_{t} | S_{t}, \theta)}{\pi(A_{t} | S_{t}, \theta)} \bigg]$$

We will be able to calculate or sample everything in this final expression for $\triangledown J(\theta)$, and so we've got the essential ingredients for a gradient ascent algorithm! What's more this final expression makes simple intuitive sense. Imagine having many sample tuples, $(S_{t}, A_{t}, G_{t})$ and calculating this expected value by averaging. If $G_{t}$ is big for a particular sample, this action worked out well for us, so we expect this sample, in the average, to play a role in pumping up $J(\theta)$ if $\pi(A_{t} | S_{t}, \theta) > 0$, precisely as our proportion demands. To understand the role of $\pi$ in the denominator, recall from earlier that $\pi$ itself should not play into $\triangledown J(\theta)$ as it was formulated as a sum over all actions rather than an expectation. Now that we do have an expectation, we must account for the fact that we may only encounter some actions very rarely. We compensate these rare actions by dividing by $\pi(A_{t} | S_{t}, \theta)$. This way, in calculating $J(\theta)$ we still take account of all action's $\triangledown_{\theta} \pi(A_{t} | S_{t}, \theta)$ equally.

Gradient Ascent for $J(\theta)$

Now we know how to incrementally adjust theta to increase $J(\theta)$:

$$ \theta_{t+1} = \theta_{t} + \alpha' \triangledown J(\theta) = \theta_{t} + \alpha \ G_{t} \ \frac{\triangledown_{\theta} \pi(A_{t} | S_{t}, \theta)}{\pi(A_{t} | S_{t}, \theta)}$$

Where in the above $\alpha'$ and $\alpha$ just differ by a constant. Noting that $\triangledown ln(x) = \frac{\triangledown x}{x}$, we will work with:

$$ \theta_{t+1} = \theta_{t} + \alpha \ G_{t} \ \triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta)$$

Now we're getting very close to being able to write an algorithm to optimize $\theta$ for some policy and some environment. Looking at the above ascent formula, we know that we can pick a learning rate $\alpha$ and explore the environment to sample $G_{t}$, $A_{t}$ and $S_{t}$. Thus far we've promised to be able to differentiate $\pi(A_{t} | S_{t}, \theta)$, so it's time we untangle that last wrinkle.

Recall that we will set $\pi(a|s,\theta) = \frac{e^{h(a|s, \theta)}}{\sum_{b} e^{h(b|s, \theta)}}$ for any policy function we wish to write. What will differ by situation is our choice for $h(a|s, \theta)$, so we'll wrap up our core gradient ascent equation, without losing generality, by writing $\triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta)$ in terms of $h(a|s, \theta)$.

$$\triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta) = \triangledown_{\theta} ln\ \bigg[ \frac{e^{h(a|S_{t}, \theta)}}{\sum_{b} e^{h(b|S_{t}, \theta)}} \bigg] $$$$ = \triangledown_{\theta} \ h(A_{t}|S_{t}, \theta) - \triangledown_{\theta} \ ln \sum_{b} e^{h(b|S_{t}, \theta)} $$
$$ = \triangledown_{\theta} \ h(A_{t}|S_{t}, \theta) - \frac{ \triangledown_{\theta} \sum_{b} e^{h(b|S_{t}, \theta)}}{ \sum_{b} e^{h(b|S_{t}, \theta)} } $$
$$ = \triangledown_{\theta} \ h(A_{t}|S_{t}, \theta) - \frac{ \sum_{b} \triangledown_{\theta} h(b|S_{t},\theta) \ e^{h(b|S_{t},\theta)}}{ \sum_{b} e^{h(b|S_{t},\theta)} } $$

Now we can't help but recognize that $\pi$ is still lurking here. Finally we have:

$$\triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta) = \triangledown_{\theta} \ h(A_{t}|S_{t}, \theta) \ -\ \sum_{b} \triangledown_{\theta} h(b|S_{t},\theta) \ \pi(b|S_{t}, \theta)$$

Takeaway and intuition

What have we learned? $\triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta)$ is the fundamental step we take to improve our performance $J(\theta)$, and the step is positive for actions that we want to encourage (at least if we standardize $G$). So what is this step telling us to do for this action and state pair $A_{t}, S_{t}$? For a good action, we can see why we want to increase $ln\ \pi(A_{t} | S_{t}, \theta)$. The above equation tells us how to do this: increase $h(A_{t} | S_{t})$, the preferance for this action while not also significantly increasing $h$ for all the other possible actions! Because $\pi$ is fundamentally a softmax the $h$ preferences for all the actions will affect the output for any one action, and so we can understand why we need the sum over b, all the actions in the final accounting.

If we'd really like to probe, we can get even more intuition for why $\pi$ appears in the sum over b. Suppose for some particular action $b'$, $\pi(b' | S_t, \theta) = 0$. In that case $h(b' | S_t, \theta) \approx -\infty$ and $\frac{d \ e^{h(b' | S_t)}}{d h} \approx 0$. Thus, we don't care about this action's effect in the softmax $\pi(A_t| S_t, \theta)$ and $\triangledown_\theta h(b' | S_t, \theta)$ vanishes in the sum over b.

Application to Mountain Car

Now we're excited to get back to our Mountain Car problem and see if these ideas all make concrete sense! One of the reasons to use policy gradient methods instead of Q learning is that the optimal policy may be fundamentally simpler than Q. For Mountain Car, what is the optimal policy? Well, we want to build momentum so that we can eventually use this gained momentum to beat gravity! We build momentum by increasing our velocity, so perhaps an optimal policy would be to just always push in the direction of current velocity. Let's see if we can't derive this policy from our theory.

Our goal initially is to play with policy gradent theory by keeping things as simple as possible and see if we can understand this concrete example. Once we have a feel for things we can unleash some neural networks. For now it's going to be quite pleasant to see the crudest of tools demolish the Mountain Car problem. We suspect a simple policy, so let's make $h$ a linear function in $\theta$:

$$ h(a,s, \theta) = \theta^{T} \ x(a,s)$$

Here we've thrown all of $h$'s dependency on $a$ and $s$ into a feature vector, $x$. Then we simply dot product $x$ with $\theta$. Note that now we have the simple $\triangledown_{\theta} h(a,s, \theta) = x(a,s)$. Let's create a natural encoding for the actions: let power right = +1, do nothing = 0 and power left = -1. Since we don't believe the position of the car should influence our policy, we'll ignore it for now and write:

$$ x(a, s) = a \frac{s_v}{|s_v|} $$

Where $s_v$ is the state's velocity. Each component of $x$ should be some function of $a$ and $s$. Since we only care about 1 component of $s$, we'll only give $x$ 1 component and make it a product of the action and the sign of the velocity. What could be simpler?

Now we've nailed down our policy function:

$$ h(1, s, \theta) = \theta \frac{s_v}{|s_v|} $$$$ h(0, s, \theta) = 0 $$$$ h(-1, s, \theta) = -\theta \frac{s_v}{|s_v|} $$

So we have constructed $h$ to be as simple as the problem itself, using our knowledge of the problem to make $h$ laser focused on the relvant features. This is an essential step, especially for more difficult problems: the more we can suit the form of $h$ to our domain knowledge, the easier it will be to learn $\theta$. In fact, this basic principle of modeling carries over to all of machine learning.

Now we've stopped at specifying $\theta$ because that is exactly what we're supposed to learn! However, it will be instructive to stop and consider what the optimal $\theta$ might be, this way we can see if our learning process is behaving itself. For Mountain Car, we said we want to always move in the direction of velocity; consequently, if, for example, $\frac{s_v}{|s_v|}$ is positive, we want $h(1,s, \theta) \approx \infty$ because our preference for powering right should be overwhelming. We can already see that $\theta$ had better learn to be a big number if we expect our agent to make it out of the valley. Let's confirm that this is precisely the solution policy gradient theorem will provide.

Plug our Mountain Car Formulaton into REINFORCE

Recall:

$$ \theta_{t+1} = \theta_{t} + \alpha \ G_{t} \ \triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta)$$

And:

$$\triangledown_{\theta} ln\ \pi(A_{t} | S_{t}, \theta) = \triangledown_{\theta} \ h(A_{t}|S_{t}, \theta) \ -\ \sum_{b\ \in\ -1, 0, 1} \triangledown_{\theta} h(b|S_{t},\theta) \ \pi(b|S_{t}, \theta)$$

Also recall the reward system for Mountain Car, -1 for each time step. To simplify the intuition here, let's suppose we standardize all of our $G_t$ samples so that good actions tend to see positive rewards and bad actions tend to see negative rewards. It will make everything more human readable, and, as it turns out, we can reference a quick proof of "REINFORCE with baseline" in Sutton's text to show that all the above still holds. We'll omit the proof for now and take reward standardization to be fairly innocent and reasonable.

Plug in our $h$ and it's gradient, with actions encoded as above. Expand the sum (we only have 3 actions):

$$ = A_t \frac{S_v}{|S_v|} - \big( \frac{S_v}{|S_v|} \pi(1 | S_t) - \frac{S_v}{|S_v|} \pi(-1 | S_t) \big)$$$$ = A_t \frac{S_v}{|S_v|} \ -\ \frac{S_v}{|S_v|} \pi(1 | S_t)\ +\ \frac{S_v}{|S_v|} \pi(-1 | S_t) $$$$ = \frac{S_v}{|S_v|} \ \bigg[ A_t \ + \pi(-1 | S_t) \ - \ \pi(1 | S_t) \bigg]$$

Where of course $\pi$ is still a function of $\theta$, but we're not writing it explicitly, and $S_v$ refers to the velocity of the state at time step $t$.

Now let's consider what this means for our $\theta$ updates:

Recall that we've worked out everything in expectation, so we intend to update $\theta$ with many samples of $(A_t, S_t, G_t)$ tuples.

First, if the velocity of a sampled state is 0, then $\triangledown_{\theta} ln \ \pi$ goes to 0 and that sample won't contribute to any update in $\theta$.

To really understand why this entire formulation has to work to pump $\theta$ up to infinity, let's consider a few interesting sample scenarios:

Suppose we sample $A_t = 1$ and $S_v > 0$. Our agent has powered right when it was moving to the right. Good agent! It's building momentum; this is the behavior we want to REINFORCE! We should expect the $G_t$ for this sample to be relatively large and positive, on average, since this was good behavior. Let's see our this sample may impact $\theta_{t+1}$:

$$ \theta_{t+1} = \theta \ + \ \alpha G_t \big( 1 \ +\ \pi(-1| S_t)\ -\ \pi(1|S_t) \big) $$

Since $ \pi(-1|S_t) - \pi(1|S_t) $ has to be more than -1, this is a positive update to $\theta$ proportional to $G_t$. This sample will push $\theta$ up, as we'd like to see! Furthermore, if we currently have a $\pi$ that would have made this good behavior unlikely ( $\pi(1|S_t) << .5$ ), we'd be adding further fuel to our $\theta$ update, and rightly so: our agent has acted to do the right thing against the odds, its policy deserves to a big update.

Now suppose we sample $A_t = -1$ and $S_v > 0$. Our agent has powered left when it was moving to the right. Bad agent! We should expect the $G_t$ for this sample to be relatively large and negative, on average. Let's see how this sample may impact $\theta_{t+1}$:

$$ \theta_{t+1} = \theta \ + \ \alpha G_t \big( -1 \ +\ \pi(-1| S_t)\ -\ \pi(1|S_t) \big) $$

Again, this sample will influence $\theta$ to increase since $G_t < 0$ and $\pi(-1| S_t)\ -\ \pi(1|S_t) < 1$

REINFORCE, A Monte-Carlo Policy-Gradient Method (episodic)

Let's review our algorithm for policy optimization as it's found in Sutton's text for reference(pg 271). We will then go on to implement it and consider the results.

Essentially the REINFORCE algorithm tells us to initialize our $\theta$ and collect samples following our current policy. Then we perform a gradient ascent update on $\theta$ in order to climb the $J(\theta)$ function. We can choose to further discount $G_t$ by $\gamma^t$ if we'd like to really maximize discounted rewards from the starting state.

As mentioned earlier we will also be experimenting with baselines as discussed in the text. Ultimately, simply standardizing the rewards gave us reasonable performance.

Simple Policy Gradient with Linear Preference Function $H(s,a)$

For Mountain Car

We begin our exploration of policy gradient in practice with MountainCar-v0 from openai gym. Again, the goal here is to learn about the foundational ideas with the simplest example. We can with a 1 dimensional theta so that there is only 1 weight that we need to learn, theta. Because of the way we've designed our state/action feature vector, $x$, to just depend on the sign of velocity we can easily understand the model and the weight we need to find if it is to see reasonable success at MountainCar. As discussed above, we should see our REINFORCE algorithm increase the value of theta over training cycles. We expect the net effect will be that our car learns to gain momentum and push out of the valley.

Agent object discussion:

This class will be specific to MountainCar for a concrete initial example. Note that we've programmed in the ability to learn a baseline for each state, which should converge to the average G at that state. Especially for MountainCar with the default reward system, we expect that the baseline could be very helpful if it's also a function of time step. This is because in MountainCar the earlier states will always see very heavy negative rewards even if action in them is optimal. In theory the baseline should be able to correct for this imbalance, but in practice it was not effective.

Everything in the below cell simply implements what we've discussed.

In [ ]:
def calculate_G(rewards, gamma):
    discounted_rewards = []
    for i in range(len(rewards)):
        to_sum = np.array(rewards[i: ])
        decay_factors = np.array([gamma ** j for j in range(len(to_sum))])
        discounted_rewards.append(sum(to_sum * decay_factors))
    return list(discounted_rewards)

def reward_modification(next_state):
    distance_from_bottom = abs(next_state[0] + .5)  #-.5 is the location of the bottom of the well
    return distance_from_bottom
In [336]:
class PolicyGradAgent:
    def __init__(self, state_size, action_size, lr, gamma, baseline, baseline_lr):
        
        # get size of state and action
        self.state_size = state_size
        self.action_size = action_size

        self.gamma = gamma
        self.lr = lr
        self.theta = 0
        self.theta_2d = np.zeros(self.state_size)
        self.w = np.zeros(self.state_size)
        self.possible_actions = [-1, 0, 1]
        self.uncode_action = {-1: 0, 0: 1, 1:2}
        self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
        self.baseline = baseline
        self.baseline_lr = baseline_lr
        
    def reset_theta(self):
        print("Goal yet unreached, resetting theta param near zero.")
        self.theta = np.random.normal(loc=0, scale=.1)

    def h(self, action, state):
        v = state[1]
        return self.theta * action * np.sign(v)

    def pi(self, action, state):
        numerator = np.exp(self.h(action, state))
        denom = np.exp(self.h(-1, state)) + np.exp(self.h(0, state)) + np.exp(self.h(1, state))
        return numerator / denom
    
    def V(self, state):
        return np.dot(state, self.w)
    
    def get_action(self, state):
        probs = [self.pi(action, state) for action in self.possible_actions]
        return np.random.choice(self.possible_actions, p=probs)
    
    def erase_batch_memory(self):
        self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
        
    def prepare_rewards(self):
        self.batch_memory["G"] = np.array(self.batch_memory["G"])
        # Subtract off mean as the baseline
        self.batch_memory["G"] -= np.mean(self.batch_memory["G"])
        if np.std(self.batch_memory["G"]) > 0:
            self.batch_memory["G"] /= np.std(self.batch_memory["G"])
        
    def learn(self):
        theta_delta = 0
        w_delta = np.zeros(self.state_size)
        
        for t in range(len(self.batch_memory["S"])):
            G = self.batch_memory["G"][t]
            S = self.batch_memory["S"][t]
            A = self.batch_memory["A"][t]
            step = self.batch_memory["T"][t]
            p = S[0]
            v = S[1]
            
            if self.baseline == True:
                adv = G - self.V(S)
                w_delta += self.baseline_lr * self.gamma**step * adv * np.array([p, v]) 
                self.w += w_delta
                theta_delta += self.lr * adv * self.gamma**step * np.sign(v) * (A + self.pi(-1, S) - self.pi(1, S))
            else:
                theta_delta += self.lr * G * self.gamma**step * np.sign(v) * (A + self.pi(-1, S) - self.pi(1, S))
        
        explore = (np.random.rand() -.5) * 2
        self.theta += theta_delta + explore
        
    def report(self):
        current_steps = len(self.episode_memory["S"])
        print("Current Theta: {:.3}, {} steps to done".format(self.theta, current_steps))

Agent explore / train loop

Below we have the standard loop that runs the agent and trains after so many episodes. This is specifically rigged for mountain car so that we can try different reward systems and implement a few other details relevant to our experiment here that won't carry over to other environments.

This loop is prepared to run either the agent above or a more complex agent that we'll discuss shortly. We now introduce the various reward systems we'll try:

Reward Systems: default: as discussed, and as programmed by openai. distance from bottom: rewards proportional to the agent's distance from the bottom of the valley. 1 for finish: a reward of +1 for reaching the goal, 0 otherwise. speed for finish: steps to reach goal = steps, reward = log(201-steps)

The "speed for finished" reward is designed to mimick the behavior we'd like to see with the default reward system, but if gamma=1 then early states won't be at a disadvantage to later states, as in the default system. We hypothesis that this could help learning. We take the log so that finishing in 199 steps won't be twice as good as finishing in 200 steps. The minimum possible reward is log(1)=0.

In [388]:
def run_agent(lr, gamma, num_batches, reward_system="default", baseline=False, baseline_lr = 0, two_d=False):
    env = gym.make('MountainCar-v0')

    # create agent
    if not two_d:
        agent = PolicyGradAgent(state_size=env.observation_space.shape[0],
                                action_size=env.action_space.n,
                                lr=lr,
                                gamma=gamma,
                                baseline=baseline,
                                baseline_lr=baseline_lr)
    else:
        agent = PolicyGradAgent2d(state_size=env.observation_space.shape[0],
                                action_size=env.action_space.n,
                                lr=lr,
                                gamma=gamma,
                                baseline=baseline,
                                baseline_lr=baseline_lr)
        
    batch_size = 100
    theta_history = []
    step_history = []
    goal_reached = False

    for batch in range(num_batches):
        agent.erase_batch_memory()
        theta_history.append(agent.theta)
        for episode in range(batch_size):
            rewards = []
            states = []
            actions = []
            state = env.reset()
            done = False
            
            while not done:
                states.append(state)

                # get action for the current state and go one step in environment
                action = agent.get_action(state)

                # save action
                actions.append(action)

                # take step in environment, write up codes actions differently from gym so we uncode to give to env.step
                next_state, reward, done, info = env.step(agent.uncode_action[action])

                # calculate reward and append to rewards
                if reward_system == "default":
                    rewards.append(reward)
                elif reward_system == "1 for finish" or reward_system == "speed for finish":
                    rewards.append(0.0)
                elif reward_system == "distance from bottom":
                    rewards.append(reward_modification(next_state))
                else:
                    print("Error: Invalid reward system specified.")

                state = next_state

                # if not done, loop again
                if done:
                    # record length of episode
                    steps = len(states)
                    step_history.append(steps)
                    reached = steps < 200
                    if reward_system == "1 for finish":
                        rewards[-1] = float(reached)
                    if reward_system == "speed for finish":
                        rewards[-1] = np.log(201 - steps)
                    if not goal_reached and reached > 0:
                        goal_reached = True
                        

                    # give episode data to the agent
                    agent.batch_memory["S"] += states
                    agent.batch_memory["A"] += actions
                    agent.batch_memory["G"] += calculate_G(rewards, gamma)
                    agent.batch_memory["T"] += list(range(steps))
                    

        # learn from the data (update theta)
        agent.prepare_rewards()
        agent.learn()
        if (batch + 1) % int(num_batches/3) == 0 and not reached:
            agent.reset_theta()

    return theta_history, step_history, agent.w   

Why do we train multiple agents?

For MountainCar, we've devised such a simple model such that if we randomly throw a dart at the model weight, theta, we will succeed to escape the valley %50 of the time! Theta need only be positive. Thus, to see if the model is truely working we train 5 agents and check for consistent performance.

In [405]:
def train_agents(lr, gamma, num_batches, reward_system, baseline, two_d=False):
    theta_histories = []
    step_histories = []
    means = []
    last_theta = []
    last_weights = []
    for i in range(5):
        print("Training Agent # {} of 5...".format(i+1))
        theta_history, step_history, last_weight = run_agent(lr=lr, 
                                                             gamma=gamma, 
                                                             num_batches=num_batches,
                                                             reward_system=reward_system,
                                                             baseline=baseline,
                                                             two_d=two_d)
        means.append(np.mean(theta_history))
        last_theta.append(theta_history[-1])
        theta_histories.append(theta_history)
        step_histories.append(step_history)
        last_weights.append(last_weight)
        if not two_d:
            print("Mean training theta: {:.3}, Final theta: {:.3}, Last 100 Episodes Mean Steps: {}\n".format(means[-1], last_theta[-1], int(np.mean(step_history[-100:]))))
        else:
            print("Final theta: {}, Last 100 Episodes Mean Steps: {}\n".format(last_theta[-1], int(np.mean(step_history[-100:]))))
    
    return theta_histories, step_histories

Simple result plotting code

In [391]:
def moving_ave(history):
    return [np.mean(history[ :i + 1]) for i in range(len(history))]

def plot_histories(histories, ax, col, title, xlabel, moving=False):
    rows = len(histories)
    for i in range(rows):
        if moving:
            ax[i, col].plot(range(len(histories[i])), moving_ave(histories[i]), c="red", lw=4)
            ax[i, col].set_ylabel("Step")
            
        else:
            ax[i, col].plot(range(len(histories[i])), histories[i], c="purple", lw=4)
            ax[i, col].set_ylabel("$\Theta$")
            ax[i, col].text(-10,0, "Round {}".format(i + 1))
        ax[0, col].set_title(title)
        ax[rows-1, col].set_xlabel(xlabel)
        
def plot_training_results(theta_histories, step_histories):
    rows = len(theta_histories)
    fig, ax = plt.subplots(rows, 2, figsize=(15,10))

    plot_histories(theta_histories, ax, 0, title="Theta Param", xlabel="Batch")
    plot_histories(step_histories, ax, 1, moving=True, title="Steps Until Done Moving Ave", xlabel="Episode")

    fig.suptitle(str(rows) + " Rounds of Policy Grad Agent Training History", y=1.03, fontsize=22)
    fig.tight_layout()

Begin experiments

Default reward system

We start by training 5 agents under the default reward system, checking to see if they can learn a positive $\Theta$.

In [330]:
theta_histories, step_histories = train_agents(lr=1e-2, 
                                               gamma=.99, 
                                               num_batches=30, 
                                               reward_system="default", 
                                               baseline=False)
Training Agent # 1 of 5...
Mean training theta: 3.93, Final theta: 4.62

Training Agent # 2 of 5...
Goal yet unreached, resetting theta param near zero.
Mean training theta: 2.54, Final theta: 7.91

Training Agent # 3 of 5...
Goal yet unreached, resetting theta param near zero.
Mean training theta: 0.758, Final theta: 4.09

Training Agent # 4 of 5...
Goal yet unreached, resetting theta param near zero.
Mean training theta: 1.71, Final theta: 4.49

Training Agent # 5 of 5...
Goal yet unreached, resetting theta param near zero.
Goal yet unreached, resetting theta param near zero.
Mean training theta: -0.515, Final theta: 2.11

Comments

We observe that in most cases, the training algorithm needed to reset $\Theta$ before it could succeed at reaching escape velocity. If $\Theta$ starts going negative while training, the mountain car has learned to fight momentum with drastic consequences. Under the default reward system, it will never reach the goal and get stuck never learning anything.

In [331]:
plot_training_results(theta_histories, step_histories)

Comments

Looking at a side by side comparison of $\Theta$ and episode step count, we observe that once the agent reaches the goal, $\Theta$ begins to grow and the agent never look back. With a steady flow of positive reinforcement, the agent can learn to build momentum consistently and reach the goal in about 120 steps.

Distance from bottom reward system

In [332]:
theta_histories, step_histories = train_agents(lr=1e-2, 
                                               gamma=.99, 
                                               num_batches=10, 
                                               reward_system="distance from bottom", 
                                               baseline=False)
Training Agent # 1 of 5...
Mean training theta: 3.8, Final theta: 3.67

Training Agent # 2 of 5...
Mean training theta: 5.24, Final theta: 7.22

Training Agent # 3 of 5...
Mean training theta: 4.33, Final theta: 4.48

Training Agent # 4 of 5...
Mean training theta: 5.4, Final theta: 6.47

Training Agent # 5 of 5...
Mean training theta: 6.31, Final theta: 8.29

Comments

Now our results are consistent without any need for ever resetting $\Theta$ while learning. This reward system gives the agent consistent and smooth feedback for building momentum, and the agent learns a healthy positive $\Theta$ on every run.

In [333]:
plot_training_results(theta_histories, step_histories)

Comments

Because of the smooth reward system, the agent quickly reaches the goal, as can be seen in the moving average plots of step count.

Since the defaut system was too challenging for the agent and this system too easy, let's progress to systems we believe to be a compromise between the two.

+1 for finish reward system

In [340]:
theta_histories, step_histories = train_agents(lr=1e-2, 
                                               gamma=1, 
                                               num_batches=20, 
                                               reward_system="1 for finish", 
                                               baseline=False)
Training Agent # 1 of 5...
Goal yet unreached, resetting theta param near zero.
Mean training theta: 1.74, Final theta: 3.43

Training Agent # 2 of 5...
Goal yet unreached, resetting theta param near zero.
Mean training theta: 2.53, Final theta: 6.9

Training Agent # 3 of 5...
Goal yet unreached, resetting theta param near zero.
Goal yet unreached, resetting theta param near zero.
Mean training theta: 1.16, Final theta: 5.02

Training Agent # 4 of 5...
Mean training theta: 2.62, Final theta: 3.69

Training Agent # 5 of 5...
Mean training theta: 7.34, Final theta: 9.17

Comments

The agent has done better here than in the default case, with fewer needs to reset training. We believe that the key reason which makes this system easier than the default is the following: when gamma is 1, the agent sees 0 immediate reward in all states except for the goal reaching state. Thus, the early states are not disadvantaged with such heavy negative rewards as in the default system. This system thereby sends a consistent message to the agent: "if you escape, consider all of your actions to have been good, otherwise they are all bad". Under the default system the problem is that actions near "Done" are always relatively good vs actions in early states, whether the agent succeeds or not! In that case it's only by comparing a massive batch size of sampled histories that we can learn that on average all states in sucessful episodes are less negative than in failure episodes.

In [341]:
plot_training_results(theta_histories, step_histories)

Comments

The agent learns more quickly and consistently than in the default reward case.

Speed for finish reward system

Much like the +1 approach, we only reward the agent when it finishes, but now faster finishes are better. Let's see if this extra information helps the average step counts.

In [400]:
theta_histories, step_histories = train_agents(lr=1e-2, 
                                               gamma=1, 
                                               num_batches=20, 
                                               reward_system="speed for finish", 
                                               baseline=False)
Training Agent # 1 of 5...
Goal yet unreached, resetting theta param near zero.
Goal yet unreached, resetting theta param near zero.
Mean training theta: 1.15, Final theta: 5.32, Late Mean of Steps Needed: 118

Training Agent # 2 of 5...
Mean training theta: 3.53, Final theta: 6.47, Late Mean of Steps Needed: 122

Training Agent # 3 of 5...
Goal yet unreached, resetting theta param near zero.
Goal yet unreached, resetting theta param near zero.
Goal yet unreached, resetting theta param near zero.
Mean training theta: -0.538, Final theta: -0.418, Late Mean of Steps Needed: 200

Training Agent # 4 of 5...
Mean training theta: 4.16, Final theta: 4.31, Late Mean of Steps Needed: 124

Training Agent # 5 of 5...
Goal yet unreached, resetting theta param near zero.
Mean training theta: 2.01, Final theta: 3.19, Late Mean of Steps Needed: 130

In [344]:
plot_training_results(theta_histories, step_histories)

Comments

We don't have statistically significant improvement over the +1 for finish case. We could have expected this: with a one dimensional $\Theta$ all the agent can learn is to gain momentum. It's position does not influence its behavior and so it can never really learn to turn around once it's gone far enough to the left to gain the potential energy to make it over to the goal on the right.

How can we further improve performance?

Let's continue by programming this ability into the agent's behavior functions. Specifcially we give the agent an additional term in it's state feature vector: now the agent's preference function $h$ will be able to consider position in addition to the sign of velocity. Much like we havn't been using raw velocity in the $x$ feature vector, we will also do some engineering as to not simply consider raw position.

Ultimately, we'd like the agent to turn around once it's far enough to the left, but not before. It's a waste of time to build more potential energy than it needed. We introduce our 2d MountainCar agent that using the additional feature $left \ measure(p)$ or $l(p)$ for short, a function of position. $l(p) = 0$ if $p > -.5$ and $(p + .5)^2$ otherwise. The key idea is that $l(p)$ is only significantly non-zero when the cart is far to the left. We believe this might be just the information that the cart needs so that it can learn to turn around once it's reached sufficient potential energy on the left.

Let's see the agent code modification and then experiment with it:

Now create agent that has two dimensional $\Theta$ and compare results

In [361]:
class PolicyGradAgent2d:
    def __init__(self, state_size, action_size, lr, gamma, baseline, baseline_lr):
        
        # get size of state and action
        self.state_size = state_size
        self.action_size = action_size

        self.gamma = gamma
        self.lr = lr
        self.theta = np.zeros(self.state_size)
        self.w = np.zeros(self.state_size)
        self.possible_actions = [-1, 0, 1]
        self.uncode_action = {-1: 0, 0: 1, 1:2}
        self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
        self.baseline = baseline
        self.baseline_lr = baseline_lr
        
    def reset_theta(self):
        print("Goal yet unreached, resetting theta param near zero.")
        self.theta = np.random.normal(loc=0, scale=.1, size=2)
    
    def left_measure(self, p):
        output = 0
        if p < -.5:
            output = (p + .5)**2
        return output
    
    def h(self, action, state):
        p = state[0]
        v = state[1]
        left_measure = self.left_measure(p)
        X = np.array([action * np.sign(v), action * left_measure])
        return np.dot(self.theta, X)
    
    def pi(self, action, state):
        numerator = np.exp(self.h(action, state))
        denom = np.exp(self.h(-1, state)) + np.exp(self.h(0, state)) + np.exp(self.h(1, state))
        return numerator / denom
    
    def V(self, state):
        return np.dot(state, self.w)
    
    def get_action(self, state):
        probs = [self.pi(action, state) for action in self.possible_actions]
        return np.random.choice(self.possible_actions, p=probs)
    
    def erase_batch_memory(self):
        self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
        
    def prepare_rewards(self):
        self.batch_memory["G"] = np.array(self.batch_memory["G"])
        # Subtract off mean as the baseline
        self.batch_memory["G"] -= np.mean(self.batch_memory["G"])
        if np.std(self.batch_memory["G"]) > 0:
            self.batch_memory["G"] /= np.std(self.batch_memory["G"])
        
    def learn(self):
        theta_delta = np.zeros(self.state_size)
        w_delta = np.zeros(self.state_size)
        
        for t in range(len(self.batch_memory["S"])):
            G = self.batch_memory["G"][t]
            S = self.batch_memory["S"][t]
            A = self.batch_memory["A"][t]
            step = self.batch_memory["T"][t]
            p = S[0]
            v = S[1]
            
            if self.baseline == True:
                adv = G - self.V(S)
                w_delta += self.baseline_lr * self.gamma**step * adv * np.array([p, v]) 
                self.w += w_delta
                theta_delta += self.lr * adv * self.gamma**step * np.array([np.sign(v), self.left_measure(p)]) * (A + self.pi(-1, S) - self.pi(1, S))
            
            else:
                theta_delta += self.lr * G * self.gamma**step * np.array([np.sign(v), self.left_measure(p)]) * (A + self.pi(-1, S) - self.pi(1, S))
        
        explore = (np.random.rand(self.state_size) -.5) * 2
        self.theta += theta_delta + explore
In [406]:
theta_histories, step_histories = train_agents(lr=1e-2, 
                                               gamma=1, 
                                               num_batches=20, 
                                               reward_system="speed for finish", 
                                               baseline=False,
                                               two_d=True)
Training Agent # 1 of 5...
Final theta: [4.95343518 5.35267038], Last 100 Episodes Mean Steps: 130

Training Agent # 2 of 5...
Final theta: [ 5.04273684 -1.0152475 ], Last 100 Episodes Mean Steps: 126

Training Agent # 3 of 5...
Goal yet unreached, resetting theta param near zero.
Goal yet unreached, resetting theta param near zero.
Final theta: [ 3.80159333 -2.04940521], Last 100 Episodes Mean Steps: 124

Training Agent # 4 of 5...
Final theta: [6.12775302 0.43999044], Last 100 Episodes Mean Steps: 120

Training Agent # 5 of 5...
Goal yet unreached, resetting theta param near zero.
Goal yet unreached, resetting theta param near zero.
Final theta: [ 3.57263758 -1.07587062], Last 100 Episodes Mean Steps: 131

We can compare last 100 episode mean steps to previously and see no improvement. Looking at the agent's learned $\Theta$ vector we can start to imagine what the issue is. Now that $\Theta$ is two dimensional, we really need the right combination for good performance, which likely takes more training time to learn. We believe that if both $\Theta$ components learned to be positive, with the first component larger than the second, we'd see the best performance yet.

We leave further tuning and more training time for a future idea to explore and now move onto a linear policy gradient agent that is prepared to act in many other environments with discrete action spaces.

Prepare policy gradient agent to learn in any gym environment with discrete action space

We simply generalize what we've done so far. Since we don't know the environment in advance, we generalize $x(a,s)$ to $a \begin{bmatrix}s_0 \\ s_1 \\ s_2 \\ ... \\ 1 \end{bmatrix}$ in what we call the "non quadratic" or "simple" case. We also provide an input option for this agent to also add quadratic terms in state to $x(a,s)$ so that we can compare.

In [1028]:
class GeneralPolicyGradAgent:
    def __init__(self, state_size, action_size, lr, gamma, quadratic_terms=False):
        
        # get size of state and action
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = gamma
        self.lr = lr
        self.quadratic_terms = quadratic_terms
        if not self.quadratic_terms:
            self.theta = np.zeros(self.state_size + 1)
        else:
            self.theta = np.zeros(self.state_size*2 + 1)
        self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
    
    def h(self, action, state):
        X = self.X(action, state)
        return np.dot(self.theta, X)
    
    def pi(self, action, state):
        numerator = np.exp(self.h(action, state))
        denom = 0
        for act in range(self.action_size):
            denom += np.exp(self.h(act, state))
        return numerator / denom
    
    def get_action(self, state):
        probs = [self.pi(action, state) for action in range(self.action_size)]
        return np.random.choice(range(self.action_size), p=probs)
    
    def erase_batch_memory(self):
        self.batch_memory = {"A": [], "S": [], "R": [], "G": [], "T":[]}
        
    def prepare_rewards(self):
        self.batch_memory["G"] = np.array(self.batch_memory["G"])
        # Subtract off mean as the baseline
        self.batch_memory["G"] -= np.mean(self.batch_memory["G"])
        if np.std(self.batch_memory["G"]) > 0:
            self.batch_memory["G"] /= np.std(self.batch_memory["G"])
    
    
    def X(self, action, state):
        m=1
        if self.quadratic_terms:
            m=2
        output = np.zeros(self.state_size*m + 1)
        for i in range(self.state_size):
            output[i] = action * state[i]
        if self.quadratic_terms:
            for i in range(self.state_size):
                output[self.state_size + i] = action * state[i]**2
        output[-1] = 1
        return output
        
    def learn(self):
        if not self.quadratic_terms:
            theta_delta = np.zeros(self.state_size + 1)
        else:
            theta_delta = np.zeros(self.state_size*2 + 1)
        
        for t in range(len(self.batch_memory["S"])):
            G = self.batch_memory["G"][t]
            S = self.batch_memory["S"][t]
            A = self.batch_memory["A"][t]
            step = self.batch_memory["T"][t]
            action_sigma = 0
            for b in range(self.action_size):
                action_sigma += self.pi(b, S) * self.X(b, S)
            theta_delta += self.lr * G * self.gamma**step * (self.X(A, S) - action_sigma)

        self.theta += theta_delta

Slight modifications to run agent code to make it work for the more general case:

Now we use the boiler plate run agent code, where we can use any environment and turn quadratic terms on or off. The other difference here is that now we use total scores as our metric; we also always use the default env reward system.

In [1004]:
def general_run_agent(lr, gamma, num_batches, env, verbose=False, quadratic_terms=False):

    agent = GeneralPolicyGradAgent(state_size=env.observation_space.shape[0],
                                   action_size=env.action_space.n,
                                   lr=lr,
                                   gamma=gamma,
                                   quadratic_terms=quadratic_terms)
        
    batch_size = 100
    theta_history = []
    scores = []

    for batch in range(num_batches):
        agent.erase_batch_memory()
        theta_history.append(agent.theta)
        for episode in range(batch_size):
            rewards = []
            states = []
            actions = []
            score = 0
            state = env.reset()
            done = False
            
            while not done:
                states.append(state)

                # get action for the current state and go one step in environment
                action = agent.get_action(state)

                # save action
                actions.append(action)

                # take step in environment, write up codes actions differently from gym so we uncode to give to env.step
                next_state, reward, done, info = env.step(action)

                rewards.append(reward)
                score += reward
                state = next_state

                # if not done, loop again
                if done:
                    # record score of episode
                    scores.append(score)
                    
                    steps = len(states)
                    # give episode data to the agent
                    agent.batch_memory["S"] += states
                    agent.batch_memory["A"] += actions
                    agent.batch_memory["G"] += calculate_G(rewards, gamma)
                    agent.batch_memory["T"] += list(range(steps))
                    

        # learn from the data (update theta)
        agent.prepare_rewards()
        agent.learn()
        
        if batch % 2 == 0 and verbose:
            print(scores[-1])
        
    return theta_history, scores
In [1029]:
results = {}
theta_history, scores = general_run_agent(lr=.5, 
                                         gamma=.99, 
                                         num_batches=150, 
                                         env=gym.make('CartPole-v1'),
                                         verbose=False,
                                         quadratic_terms=False)
results["Linear_CartPole_simple"] = [scores, theta_history]
get_streak(scores)
Out[1029]:
'CartPole-v1 Solved: Streak starting at episode 2398'
In [1030]:
theta_history, scores = general_run_agent(lr=.5, 
                                          gamma=.99, 
                                          num_batches=150, 
                                          env=gym.make('CartPole-v1'),
                                          verbose=False,
                                          quadratic_terms=True)
results["Linear_CartPole_quadratic"] = [scores, theta_history]
get_streak(scores)
Out[1030]:
'CartPole-v1 Solved: Streak starting at episode 5000'
In [1031]:
fig, ax = plt.subplots(1,1, figsize=(15,8))
scores = results["Linear_CartPole_quadratic"][0]
simple_scores = results["Linear_CartPole_simple"][0]
ax.plot(range(len(scores)), moving_ave(scores), lw=4, c="orange", label="Quadratic Model")
ax.plot(range(len(simple_scores)), moving_ave(simple_scores), lw=4, c="crimson", label="Simple Model")
ax.set_title("Linear Model Comparison For CartPole")
ax.set_xlabel("Episode")
ax.set_ylabel("Moving Average Score")
ax.legend();
In [1032]:
theta_history = results["Linear_CartPole_simple"][1]
print("The theta components that mastered CartPole are:\n")
print("Cart position component: {:.3}".format(theta_history[-1][0]))
print("Cart velocity component: {:.3}".format(theta_history[-1][1]))
print("Pole angle component: {:.3}".format(theta_history[-1][2]))
print("Pole velocity at tip component: {:.3}".format(theta_history[-1][3]))
print("Bias: {:.3}".format(theta_history[-1][4]))
The theta components that mastered CartPole are:

Cart position component: -0.091
Cart velocity component: 7.73
Pole angle component: 19.6
Pole velocity at tip component: 49.1
Bias: 5.77e-12

Comments

And so our generalized linear policy gradient method is a success! We've learned cartpole without assuming anything about the environment.

In [1036]:
theta_history, scores = general_run_agent(lr=.1, 
                                          gamma=.99, 
                                          num_batches=50, 
                                          env=gym.make('Acrobot-v1'),
                                          quadratic_terms=False)
results["Linear_Acrobot_simple"] = [scores, theta_history]              
In [1037]:
theta_history, scores = general_run_agent(lr=.1, 
                                          gamma=.99, 
                                          num_batches=50, 
                                          env=gym.make('Acrobot-v1'),
                                          quadratic_terms=True)
results["Linear_Acrobot_quadratic"] = [scores, theta_history]  
In [1038]:
fig, ax = plt.subplots(1,1, figsize=(15,8))
scores = results["Linear_Acrobot_quadratic"][0]
simple_scores = results["Linear_Acrobot_simple"][0]
ax.plot(range(len(scores)), moving_ave(scores), lw=4, c="orange", label="Quadratic Model")
ax.plot(range(len(simple_scores)), moving_ave(simple_scores), lw=4, c="crimson", label="Simple Model")
ax.set_title("Linear Model Comparison For Acrobot")
ax.set_xlabel("Episode")
ax.set_ylabel("Moving Average Score")
ax.legend();

Neural Network Policy Gradient Model

We now attempt to improve on our $h(s,a)$ approximation by employing a neural network to learn it. While in the case of MountainCar our 2 dimensional $\Theta$ and linear $h$ seemed like it would be pretty capable of optimal behavior, perhaps the network will have some insight that we didn't. Moreover, we would expect vastly better results from the neural network on environments where the optimal policy is non-linear. The real beauty of the network is that now we don't have to consider feature engineering; instead we can learn it!

The goal here is to make this process much like what we've done so far with linear policy gradient methods. Recall that we've been employing the vanilla REINFORCE algorithm from Sutton's text. In essence, we've learned that this algorithm seeks to maximize the expected $ln\ \pi(A_{t} | S_{t}, \theta) \ G_t$ where $\theta$ is the parameters of some function approximation model. So far we've designed our own gradient ascent algorithm, but keras as auto differentiation ability which should make our job easier! We want our loss function to be the negative of what we want to maximize, so our proposed custom loss becomes $-ln\ \pi(A_{t} | S_{t}, \theta) \ G_t$.

Now, how does this custom loss function relate to a custom loss we might program into keras with inputs y_true and y_pred? Well, y_true will be a one-hot encoded vector representing $A_t$, the action we actually took in that time step. ypred will be our model's predicted probability for that action, or $\pi(A{t} | S_{t}, \theta)$. $G_t$ is the same as what we've been working with all along, the sampled $q(a,s)$ value, where we also standardize to give some rough baseline approximation. This standardization greatly increases the efficiency of the learning by reducing variance and giving a clearer reward signal.

So how custom loss may look like the following:

In [ ]:
def custom_loss(y_true, y_pred):
    # this step simply deals with the one hot encoded dimensions of these vectors
    # model_prob is simply the model's predicted probability of selecting the action that was actually sampled
    model_prob = K.batch_dot(y_true, y_pred)  
    log_prob = K.log(model_prob)
    return -log_prob  # instead of multiplying by G, we weight the samples with G

Where now we have not included $G_t$ because keras expects losses to be a function of only y_true and y_pred. Instead we can simply weight the samples by $G_t$.

However, we note that this custom loss really just amounts to categorical crossentropy, weighted by G.

Recall that categorical crossentropy is defined by:

$$-\sum_c^M y_c \ p(c)$$

Where M is the total number of classes, or in our case actions. y is the observed, which is either 0 or 1 and p is the model's predicted probability. When we realize that y will be 0 for all actions except $A_t$, the action we sampled, we realize that this expression reduces to our custom loss after we weight by G.

We go on to use cross entropy loss instead of our custom function because we had better experience with it. Maybe our dot product operation is not behaving correctly in the custom loss? Or perhaps we just had bad luck with the custom loss as we initially tested on MountainCar, which proved to be difficult in any case.

In [822]:
import keras.layers as layers
from keras.initializers import glorot_uniform
from keras.models import Model, Sequential
import keras.backend as K
from keras.optimizers import RMSprop

Network details

We provide the ability to train with two different network complexities. "Simple" just gives us a model with 1 hidden Dense layer with 3 nodes. "Complex" gives us two hidden layers of 5 nodes each. Output is simply softmax in both cases. We experimented with a few different optimizers with little effect; ultimately we choose to stick with RMSprop for consistency across experiments.

Much of the next two code blocks are identical to what we've used for the linear approximation. We now simply plug in the network.

In [860]:
K.clear_session()
class NNPolicyGradAgent:
    def __init__(self, state_size, action_size, lr, gamma, network_complexity):
        
        # get size of state and action
        self.state_size = state_size
        self.action_size = action_size

        self.gamma = gamma
        self.lr = lr
        self.batch_memory = {"A": [], "S": [], "R": [], "G": []}
        self.losses = []
        self.network_complexity = network_complexity
        self.model = self.build_model()
        self.max_distances = []
    
    def build_model(self):
        # Define model
        model=Sequential()
        if self.network_complexity == "simple":
            model.add(Dense(3, activation="relu", input_dim=self.state_size, use_bias=True))
        else:
            model.add(Dense(5, activation="relu", input_dim=self.state_size))
            model.add(Dense(5, activation="relu"))
        
        model.add(Dense(self.action_size, activation="softmax"))
        
        # Compile model
        model.compile(optimizer=RMSprop(lr=self.lr), loss="categorical_crossentropy")
        return model
    
    
    def get_action(self, state):
        state = state.reshape(1,self.state_size)
        action_probs = self.model.predict(state)
        return np.random.choice(range(self.action_size), p=action_probs[0])
    
    def erase_batch_memory(self):
        self.batch_memory = {"A": [], "S": [], "R": [], "G": []}
        
    def prepare_rewards(self):
        self.batch_memory["G"] = np.array(self.batch_memory["G"])
        # Subtract off mean as the baseline
        self.batch_memory["G"] -= np.mean(self.batch_memory["G"])
        if np.std(self.batch_memory["G"]) > 0:
            self.batch_memory["G"] /= np.std(self.batch_memory["G"])
        
    def one_hot_encode(self, acts):
        one_hot_actions = np.zeros((len(acts), self.action_size))
        one_hot_actions[np.arange(len(acts)), acts] = 1
        return one_hot_actions

    def learn(self):
        S = np.array(self.batch_memory["S"])
        G = np.array(self.batch_memory["G"])
        A = np.array(self.batch_memory["A"])
        A_one_hot = self.one_hot_encode(A)

        loss = self.model.train_on_batch(S, A_one_hot, sample_weight=G)
        self.losses.append(loss)
        
        # Evaluation metrics for mountain car only:
        # ave_pos = np.mean(S[: ,0])
        # max_dist = np.max(S[: ,0])
        # print("Max Distance: {:.3}".format(max_dist))
        # self.max_distances.append(max_dist)

Same code, just plug in the networt agent now.

In [889]:
def run_agentNN(lr, gamma, num_batches, reward_system="mountain", env=gym.make('MountainCar-v0'), network_complexity="simple", verbose=False):

    # create agent
    agent = NNPolicyGradAgent(state_size=env.observation_space.shape[0],
                              action_size=env.action_space.n,
                              lr=lr,
                              gamma=gamma,
                              network_complexity=network_complexity)
        
    
        
    batch_size = 200
    step_history = []
    goal_reached = False
    scores = []

    for batch in range(num_batches):
        agent.erase_batch_memory()
        for episode in range(batch_size):
            rewards = []
            states = []
            actions = []
            state = env.reset()
            done = False
            score = 0
            
            while not done:
                states.append(state)

                # get action for the current state and go one step in environment
                action = agent.get_action(state)

                # save action
                actions.append(action)

                # take step in environment
                next_state, reward, done, info = env.step(action)

                # calculate reward and append to rewards
                if reward_system == "mountain":
                    rewards.append(abs(state[1]))
                else:
                    rewards.append(reward)
                
                score += rewards[-1]
                state = next_state

                # if not done, loop again
                if done:
                    # record length of episode
                    steps = len(states)
                    step_history.append(steps)
                    scores.append(score)
                        
                    # give episode data to the agent
                    agent.batch_memory["S"] += states
                    agent.batch_memory["A"] += actions
                    agent.batch_memory["G"] += calculate_G(rewards, gamma)
                    
        agent.prepare_rewards()
        agent.learn()
        
        if verbose and batch % 20 and reward_system != "mountain":
            print(scores[-1])

    return step_history, agent.losses, agent.max_distances, scores
In [ ]:
step_history, losses, max_distances, scores = run_agentNN(lr=.1, gamma=.9, num_batches=50, reward_system="mountain")

Comments

After great trials and tribulations, we've managed to get MountainCar to work by creating a new reward system that simply encourages the cart to pick up velocity. The reward at each step is literally just equal to current velocity. We suspect that other reward systems did not work because the model is trying to learn something that's too complex starting from random actions. By acting randomly, the agent never manages to reach the goal and therby never is able to learn. Previously, we had a linear model that was so simple that randomly adjusting the weight was enough to truely explore the possibility space and succeed. Now, randomly jiggling the knobs of the neural network is unlikely to lead to meaningful behavior.

So here we really see one major theme of machine learning: the bias vs variance trade off. By introducing the neural network, we now have a more powerful model that should be able to do everything the linear model did and more; however, this extra power and potential for less bias has come at the cost of high variance in a situation where we really need to cut variance at every cost because the sampling is so noisy. The high variance nn has much more trouble generalizing than the linear model did.

Nonetheless, with the much simplified reward system, we can see that the network is learning to get the car closer and closer to the goal, albeit very slowly,

In [850]:
fig, ax = plt.subplots(1,2)
ax[0].plot(range(len(max_distances)), moving_ave(max_distances), c='purple')
ax[0].set_xlabel("Epoch with 200 episodes")
ax[0].set_ylabel("Moving average of max position")
ax[0].set_title("Cart Learns to Reach Goal")

ax[1].plot(range(len(losses)), losses, c='darkblue')
ax[1].set_xlabel("Epoch with 200 episodes")
ax[1].set_ylabel("Weighted Cross Entropy Loss")
ax[1].set_title("Loss Slowly Diminishes")

fig.suptitle("Policy Gradient Neural Network Learning", fontsize=23, y=1.07);
fig.tight_layout()

Comments

Encouraged by some meaningfull behavior, we move on to apply the network to other environments with their default reward systems. Hopefully we will see faster results.

Neural Network Policy Gradient for CartPole-v1

In [912]:
step_history, losses, max_distances, scores = run_agentNN(lr=.1, 
                                                          gamma=.9, 
                                                          num_batches=20,
                                                          reward_system="default",
                                                          env=gym.make('CartPole-v1'),
                                                          network_complexity="complex")

def get_streak(scores):
    streak = 0
    for i in range(len(scores)):
        if scores[i] >= 475:
            streak +=1
        else:
            streak = 0
        if streak >= 100:
            return "CartPole-v1 Solved: Streak starting at episode {}".format(i + 1 - 99)
    return "CartPole-v1 Not Solved"

get_streak(scores)
Out[912]:
'CartPole-v1 Solved: Streak starting at episode 1601'
In [890]:
step_history, simple_losses, max_distances, simple_scores = run_agentNN(lr=.1, 
                                                            gamma=.9, 
                                                            num_batches=20,
                                                            reward_system="default",
                                                            env=gym.make('CartPole-v1'),
                                                            network_complexity="simple")

get_streak(simple_scores)
Out[890]:
'CartPole-v1 Not Solved'
In [919]:
fig, ax = plt.subplots(1,1, figsize=(15,8))

ax.plot(range(len(scores)), moving_ave(scores), lw=4, c="orange", label="Complex NN")
ax.plot(range(len(simple_scores)), moving_ave(simple_scores), lw=4, c="crimson", label="Simple NN")
ax.set_title("Network Comparison For CartPole")
ax.set_xlabel("Episode")
ax.set_ylabel("Moving Average Score")
ax.legend();

Comments

We've solved CartPole at episode 1601. Our best linear model didn't solve it until episode 2398. Training results have very high variance, but perhaps the nn is giving us some performance boost, without much tuning. Judging from the graph above where we see the deeper network outperform the shallow one, we can surmise that the optimal CartPole model is non-linear and hence the advantage for the nn model.

Neural Network Policy Gradient for Acrobot-v1

In Acrobot, we are essentially rewarded for how quickly we can swing a mechanical system above some threshold. See openAI gym online for a visualization. Because of the delayed reward system, we might encounter some of the trouble we saw in MountainCar.

Let's give Acrobot a shot with the nn:

In [ ]:
step_history, losses, max_distances, complex_scores = run_agentNN(lr=.1, 
                                                          gamma=.99, 
                                                          num_batches=20,
                                                          reward_system="default",
                                                          env=gym.make('Acrobot-v1'),
                                                          network_complexity="complex",
                                                          verbose=False)
In [ ]:
step_history, losses, max_distances, simple_scores = run_agentNN(lr=.1, 
                                                          gamma=.99, 
                                                          num_batches=20,
                                                          reward_system="default",
                                                          env=gym.make('Acrobot-v1'),
                                                          network_complexity="simple",
                                                          verbose=False)
In [991]:
fig, ax = plt.subplots(1,1, figsize=(15,8))

ax.plot(range(len(complex_scores)), moving_ave(complex_scores), lw=4, c="orange", label="Complex NN")
ax.plot(range(len(simple_scores)), moving_ave(simple_scores), lw=4, c="darkolivegreen", label="Simple NN")
ax.set_title("Network Comparison For Acrobot-v1")
ax.set_xlabel("Episode")
ax.set_ylabel("Moving Average Score")
ax.legend();

Comments

The deeper network was able to learn Acrobot while the simpler one was not. It would be interesting to try more than 5 nodes on the hidden layers of the deeper network and more hyperparamter tuning, but this is beyond the scope of our goals for this project.

Conclusion

In this project, we explored a number of techniques to solve reinforcement learning problems in various environments from OpenAI Gym. We reviewed the theory behind tabular methods (dynamic programming, Monte Carlo, Q-learning), and implemented both traditional and deep Q-learning, finding that deep Q-learning definitively outperformed traditional, tabular Q-learning on high-dimensional and continuous problems. We then explored a completely different class of solutions -- policy gradient methods -- and compared a simple linear policy approach to a deep neural network approach, finding once again that the deep approach was often much more scalable (see Acrobot results).

In general, we found that while both methods (deep Q and policy gradient) were able to achieve good results for the environments we tested, we found that deep Q learning performed slightly better in these environments. We believe one reason for this is the fact that in many of these problems, especially MountainCar, achieving the goal with random movements is very unlikely, and thus only occurs once in a while, if ever. The Q-learning approach is able to self-update much more effectively when it comes across successful episodes, whereas the policy gradient methods typically need more data (successful episodes in this case) to converge well.

For future work, it would be interesting to explore the performance of these classes of solution methods on more complex reinforcement learning problems, such as strategy games like Connect-Four, Chess, or Go.