Final Paper

Harvard University
Fall 2018
Course: Applied Math 207
Instructors: Rahul Dave
Due Date: Wednesday, December 12th, 2018 at 11:59pm

Collaborators

John Daciuk, Dean Hathout, Tong Lu

In [1]:
import numpy as np
import scipy.stats
import scipy.special
import scipy

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from matplotlib import cm

import pandas as pd
import seaborn as sns

import torch
from torch.autograd import Variable
import torch.nn as nn
from torch.nn import functional as fn

import torchvision.datasets as datasets
import torchvision.transforms as transforms

import copy
import matplotlib.animation as animation


%matplotlib notebook

# Make plot font bigger etc...
plt.rcParams["font.size"] = 17 
plt.rcParams["figure.figsize"] = (10,6) 
plt.rcParams["ytick.labelsize"] = 16
plt.rcParams["xtick.labelsize"] = 16
plt.rcParams["axes.labelsize"] = 20
plt.rcParams["lines.linewidth"] = 2 
plt.rcParams["axes.titlesize"] = 22

**Recent Developments in Gradient Descent**

Section 1

Abstract

We consider here Matthew Zeiler's 2012 paper "Adadelta: An Adaptive Learning Rate Method," which introduced a novel technique for dimension-wise dynamic updating of the learning rate in gradient descent. We situate our presentation of this method within a broader tutorial of per-dimensional gradient descent learning rate methods, focusing specifically on the Adam, Adagrad, and Adadelta algorithms. Below, we present the theory behind each algorithm, as well as our own, independent implementations. We apply each algorithm to a number of different optimization problems, analyze their performance, and compare their handling of different conditions within the context of their mathematical similarities and differences. Another credit should be given to "An overview of gradient descent optimization algorithms", which gives us a very good foundation to the gradient descent concepts and variants in general.

Introduction

In vanilla gradient descent, we iteratively update some set of parameters $\theta$ to minimize an objective function $f(\theta)$. In each update, we shift the parameters $\theta$ in the direction opposite to the objective function's gradient with respect to the parameters, $\nabla_{\theta} f(\theta)$. That is, at each iteration,

$$\theta := \theta - \eta \nabla_{\theta} f(\theta).$$

Because the gradient of a function is the direction of maximal increase, to move in the direction opposite to the gradient is to move in the direction of maximal decrease and locate a function's minimum. A key feature of gradient descent is the learning rate, i.e. the sizes of the steps we take in the direction of the gradient ($\eta$ above). Determining the proper learning rate is crucial for the gradient descent algorithm's ability to converge to the minimum in reasonable time. If the learning rate is too small, the algorithm will move extremely slowly, and if it is too large, the algorithm may 'jump over' the minimum altogether and never converge. The generally applied technique for choosing the best learning rate is to select (through some manual tuning procedure) the highest possible rate such that the latter of these two possible convergence issues is not realized.

A number of extensions to the gradient descent algorithm have been introduced to overcome this challenge of determing the proper learning rate. The three different algorithms we present here do this by iteratively updating the learning rate applied to the gradient, and by computing this learning rate on a per-dimensional basis (that is, it computes a different learning rate for each parameter $\theta_i$). The advantages to these methods are many, with the most notable being the removal of the need to manually set a learning rate, flexibility in the step-size for each dimension, and minimal extra computational expense compared to vanilla gradient descent. Performing learning rate updates on a per-dimensional basis allows these algorithms to specify larger or smaller updates for a parameter depending on that parameter's importance.

1.1 Adagrad

The first algorithm of this nature which we discuss is Adagrad. The crux of Adagrad's learning rate update scheme is to give smaller updates (lower step sizes) to parameters typically associated with high-magnitude gradient components, and larger updates (higher step sizes) to parameters which typically have low-magnitude gradient components. Thus, if used in a machine learning context such as training a neural network, Adagrad lends itself very nicely to dealing with sparse data: association with high-magnitude gradient components would indicate a parameter's association with a frequently occurring feature, giving this frequent feature's parameter smaller updates, and by similar logic giving infrequent feature's parameters larger updates.

Adagrad achieves this by inversely scaling the learning rate in a given dimension by the accumulation of squared gradient components in that dimension. That is, if we let $g_t$ equal the gradient at step $t$ and $g_{t, i}$ be the $i$-entry in this gradient vector (the partial derivative of the objective function with respect to $\theta_i$ at step $t$), then for each $i$ we update as follows:

$$\theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{\sum_{\tau=1}^t g_{\tau, i}^2 + \epsilon}} \cdot g_{t, i}$$

where $\epsilon$, typically on the order of $1e-8$, acts a smoothing term to help avoid division by zero.

Each dimension has its own dynamic step size, eliminating the need for a manually tuned learning rate. Moreover, by giving the smaller gradients larger rates and vice versa, the progress along each dimension tends to even out over time, which is very useful if the scales of the gradients vary by a lot, which is often the case in deep neural networks where the gradients in each layer often differ by orders of magnitude. Additionally, the sum of squared gradients in the denominator reduces the learning rate over time without having to manually scale the learning rate in each iteration as in vanilla gradient descent.

Adagrad is not without its challenges, however. The annealing effect mentioned above can become problematic if the learning rate declines to 0 too quickly, failing to retain enough update power to reach a minimum. Moreover, the algorithm can be sensitive to initial conditions: we see that if the initial gradients are very large, the learning rate will start off, and remain, very small. One attempt to get around this would simply be to increase the global learning rate $\eta$, but this takes away from the advantage of not needing to manually tune the learning rate. Other methods have been developed to tackle this issue, discussed below. In the meantime, we present below a simple implementation of Adagrad and demonstrate its ability to minimize a basic trigonometric function. The per-dimensional aspect of this method is easy to code with numpy, as calculations can be readily vectorized.

In [3]:
# Define a Simple Toy Function to Minimize (Cosine) and its Gradient (Negative Sine) to Test Adagrad Code
def f(x):
    return(np.cos(x))

def gradient(x):
    return(np.array(-1 * np.math.sin(x)))
In [4]:
def simple_adagrad(gradient, init, lr=.1, iterations=1000, epsilon=1e-3):
    
    theta = init
    theta_history = [theta]
    
    # Define array of running squared sum of gradients, the i-th entry in this array corresponds to the i-th dimension
    previous_gradient_sums = np.array([0] * len(theta))
    
    for i in range(iterations):
        grad = gradient(theta)
        theta = theta - (lr/(np.sqrt(previous_gradient_sums + epsilon))) * grad
        theta_history.append(theta)
        previous_gradient_sums = previous_gradient_sums + grad ** 2
        
    return(theta_history)
In [5]:
min_x = simple_adagrad(gradient, np.array([1]))[-1]
print("The function has a minimum value of {} at x = {}.".format(f(min_x)[0], min_x[0]))
The function has a minimum value of -1.0 at x = 3.1415926535897962.

Adagrad correctly locates the minimum of the cosine function at $\pi$.

1.2 Adadelta

Adadelta was introduced in 2012, a year after Adagrad, and aims to mitigate the challenges posed by Adagrad's annealing effect if too aggressive (discussed above). It does this by replacing the accumulation of all previous squared gradients with an exponentially decaying running average of the squared gradients. That is, if we call $E[g^2]_t$ this average at time $t$, then:

$$E[g^2]_t = \rho E[g^2]_{t - 1} + (1 - \rho) g_t^2$$

Updating the average in this way keeps us from having to store many previous squared gradients, and this updating scheme allows us to essentially restrict the window of previous gradients being accumulated to some fixed $w$ (as opposed to accumulating every single previous gradient), weighting recent gradients more than in Adagrad. Due to this 'sliding' window, the scaling term of the learning rate will not eventually decline to 0 as in Adagrad, rather it captures a more local estimates by being able to weight recent gradients more. Thus, if we replace this with the accumulation term from Adagrad, our theoretical update term for each parameter $\theta_i$ would be

$$\Delta\theta_{t, i} = - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot g_{t, i} = - \frac{\eta}{\textrm{RMS}[g]_t} \cdot g_{t, i} $$

by the definition of RMS (keeping in mind the inclusion of $\epsilon$ term for the first iteration when gradients are 0), and

$$\theta_{t+1, i} = \theta_{t, i} + \Delta\theta_{t, i}$$

Adadelta also aims to correct one more downfall of update methods such as Adagrad by considering units. Specifically, Zeiler notes that if the parameters $\theta$ have some hypothetical units, it only makes sense that the update should be in terms of those units. However, the update $\Delta\theta$ above (as well as the one in Adagrad) does not have these same units due to the inverse square gradient terms. To correct for this, Adadelta defines another running average of the squared paramter updates,

$$E[\Delta\theta^2]_t = \rho E[\Delta\theta^2]_{t - 1} + (1 - \rho) \Delta\theta_t^2$$

as well as the RMS of these update terms,

$$\textrm{RMS}[\Delta\theta]_t = \sqrt{E[\Delta\theta^2]_t + \epsilon}$$

Of course, RMS$[\Delta\theta]_t$ is unknown, so it is approximated by the RMS of these update terms until the previous time step, that is, RMS$[\Delta\theta]_{t - 1}$. Thus, the full Adadelta update rule is:

$$\Delta\theta_{t, i} = - \frac{\textrm{RMS}[\Delta\theta]_{t - 1}}{\textrm{RMS}[g]_t} \cdot g_{t, i}$$$$\theta_{t+1, i} = \theta_{t, i} + \Delta\theta_{t, i}$$

Now the update has the proper units: instead of being in inverse units of $\theta$ as it is in Adagrad (the update term is proportional to the gradient which is essentially a derivative and therefore in inverse units of $\theta$), the additional of the RMS of the updates themselves in the numerator corrects for this and puts the update in the same units as $x$.

As mentioned above, the main benefit of Adadelta is that by using the running average of gradients, the learning rate does not asymptotically decline to 0 and halt convergence. Additionally, the update rule for Adadelta makes no specific mention of a global learning rate $\eta$ (which is the same as simply setting it to 1), so there is theoretically no need to set a default rate, another advantage over methods such as Adagrad.

We present below a simple implementation of Adadelta, and demonstrate its ability to optimize the same basic example as in the Adagrad demonstration below.

In [6]:
def simple_adadelta(loss_fun,
                    gradient_fun, 
                    pho, 
                    epsilon, 
                    init, 
                    max_iter = 100, 
                    stop_criteria = 0.01):
    
    # The step numbers follow the lines in the pseudocode found in Zeiler's paper.

    # Step 1
    mean_square_g = np.array([0] * len(init))
    mean_square_x = np.array([0] * len(init))
    i = 1
    abs_diff_update = 100000
    current_x = init
    
    current_loss = loss_fun(init)
    
    x_history = np.array([np.append(current_x, current_loss)])
    
    # Step 2
    while i <= max_iter:  
        
        # Step 3
        gradient = gradient_fun(current_x)
        
        # Step 4
        mean_square_g = pho * mean_square_g + (1 - pho) * gradient**2
        
        # Step 5
        RMS_x_with_e = np.sqrt(mean_square_x + epsilon)
        RMS_g_with_e = np.sqrt(mean_square_g + epsilon)
        delta_x = -1.0*RMS_x_with_e / RMS_g_with_e * gradient
        
        # Step 6
        mean_square_x = pho * mean_square_x + (1 - pho) * delta_x**2
        
        # Step 7
        current_x = current_x + delta_x
        current_loss = loss_fun(current_x)
        x_history = np.append(x_history, np.array([np.append(current_x, current_loss)]), axis = 0)
        abs_diff_update = np.max(np.abs(delta_x))
        
        i += 1
    
    return(x_history) 
In [7]:
# The simple cosine case
min_vals = simple_adadelta(loss_fun = f,
                gradient_fun = gradient,
                pho = 0.1,
                epsilon = 0.001,
                init = np.array([1]),
                stop_criteria = 0.0001
               )[-1]

print("The function has a minimum value of {} at x = {}.".format(min_vals[1], min_vals[0]))
The function has a minimum value of -1.0 at x = 3.141592653589793.

Adadelta correctly locates the minimum of the cosine function at $\pi$.

1.3 Adam

Finally, we consider one last method which computes adaptive learning rates for each parameter: Adam (Adaptive Moment Estimation). Similar to Adadelta, Adam keeps an exponentially decaying average of previous squared gradients, $v_t$ such that $v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2$, but also keeps an exponentially decaying average of past gradients (to the first power) $m_t$, such that $m_t = \beta_1 m_{t - 1} + (1 - \beta_1) g_t$. The running averages $m_t$ and $v_t$ act as per-dimensional estimates for the first and second moments of the gradient, giving the method its name. As in Adagrad and Adadelta, these running averages start as vectors of zeros, which means that these terms are highly biased toward 0 at the initial time steps. Moreover, if $\beta_1$ and $\beta_2$ are close to 1, then the decay rate is quite small and the initial bias toward 0 persists throughout the algorithm's run. To counteract this, bias-corrected estimates are computed as follows:

$$\hat{m_t} = \frac{m_t}{1 - \beta_1^t}$$$$\hat{v_t} = \frac{v_t}{1 - \beta_2^t}$$

The update rule for Adam is then very similar to those we have seen above:

$$\theta_{t+1, i} = \theta_{t, i} - \frac{\eta}{\sqrt{\hat{v_t}} + \epsilon} \hat{m_t}$$

The exponentially decaying average of squared gradients in the denominator here has the same advantages as in Adadelta. The addition here comes from a momentum term, $\hat{m_t}$. The idea behind introducing a momentum term is that it can help accelerate gradient descent in the relevant direction. The momentum term will increase for dimensions with gradients pointing in the same direction and will decrease for dimensions with gradients pointing in often-changing directions. This speeds up convergence in areas where the surface curves more sharply in some directions than in others, which is common around local optima. Vanilla gradient descent will tend to oscillate around the slopes of this kind of region, and will make slow progress toward the optimum. Momentum helps to counteract these oscillations. Thus, Adam is a kind of balancing act between using momentum to accelerate in the relevant directions while also using the running average of previous gradients to even out each dimension's progress throughout the algorithm's run.

We present below a simple implementation of Adam, and demonstrate its ability to optimize the same basic example as in the Adagrad and Adam demonstrations below.

In [8]:
def simple_adam(gradient, lr, iterations, beta_1, beta_2, epsilon, init):
    m = 0
    v = 0
    theta = init
    theta_history = [theta]
    for i in range(iterations):
        # get gradient
        grad = gradient(theta)

        # update m and v
        m = beta_1 * m + (1 - beta_1) * grad
        v = beta_2 * v + (1 - beta_2) * grad ** 2

        # get hats
        m_hat = m / (1 - beta_1)
        v_hat = v / (1 - beta_2)

        # update theta
        theta = theta - lr * m_hat / (np.sqrt(v_hat) + epsilon)
        theta_history.append(theta)
        
    return(theta_history)
In [9]:
min_x = simple_adam(gradient, .001, 100000, .9, .999, 1e-8, np.array([1]))[-1]
print("The function has a minimum value of {} at x = {}.".format(f(min_x)[0], min_x[0]))
The function has a minimum value of -1.0 at x = 3.1415926536401777.

Adam correctly locates the minimum of the cosine function at $\pi$.

Section 2: Linear Regression Experiment

We now perform several experiments, applying these three algorithms to various optimization problems and analyzing the results.

Linear Regression and SGD

The first of our experiments involves a very simple least squares minimization in a linear regression problem, and allows us to insert a quick note about stochastic gradient descent and the stochastic adaptability of all of these algorithms. In a problem such as least squares minimization, we compute loss functions over the entirety of a dataset, and in vanilla or "batch" gradient descent, we compute the gradients over the entire dataset as well (which simply amounts to summing up the gradients at each point in the data). This incurs a great deal of computational expense. To account for this, stochastic gradient descent is often used in such settings (SGD has, by far, become the most widely used optimizer for training neural networks). In stochastic gradient descent, rather than compute the gradient over the entire dataset and then update the parameters, computes the gradient over a single point in the dataset and updates the parameters. This works because of the reasonable assumption that over a number of single point samples, the gradients we obtain will tend to reasonably approximate the gradient over the whole dataset, thus getting fairly similar information for far less computational expense in each iteration.

We see that nothing about this stochastic extension of gradient descent should change the basic mechanics of the algorithms discussed above. We carry out the parameter updates in the exact same ways, simply computing gradients over single points in a dataset rather than summing up the gradients over all the points in the data. There are a number of ways to introduce this single-point sampling modification, including shuffling the dataset and iterating through each point in the shuffled dataset and updating, or sampling a random point from the dataset in each iteration and updating. In our implementations of these different SGD algorithms below, we take the latter approach.

Below, we define stochastic versions of the three algorithms above, and use them to solve a simple linear regression problem. We present various plots showing the performance of these algorithms on this problem, and provide subsequent analysis.

2.1 Define algorithms

Stochastic Adagrad

In [10]:
def stoch_adagrad(gradient, init, x_data, y_data, lr=1, iterations=1000, epsilon=1e-1):
    
    theta = init
    theta_history = []
    theta_history.append(theta)
    previous_gradient_sums = np.array([0] * len(theta))
    
    for i in range(iterations):
        idx = np.random.randint(0, len(x_data))
        y = y_data[idx]
        X = x_data[idx]
        
        grad = gradient(theta, X, y)
        theta = theta - (lr/(np.sqrt(previous_gradient_sums + epsilon))) * grad
        theta_history.append(theta)
        previous_gradient_sums = previous_gradient_sums + grad ** 2
        
    return(theta_history)

Stochastic Adadelta

In [11]:
def stoch_adadelta(gradient_fun, 
                   feature_data, 
                   y_data, 
                   pho, 
                   epsilon, 
                   init, 
                   max_iter = 1000):
    
    mean_square_g = np.zeros(len(init))
    mean_square_x = np.zeros(len(init))
    current_x = init
    
    theta_history = [current_x]
    
    idx = np.random.choice(range(len(y_data)), size = max_iter)
        
    for feature, y in zip(feature_data[idx], y_data[idx]):
        
        gradient = gradient_fun(current_x, y, feature)
        mean_square_g = pho * mean_square_g + (1 - pho) * gradient**2

        RMS_x_with_e = np.sqrt(mean_square_x + epsilon)
        RMS_g_with_e = np.sqrt(mean_square_g + epsilon)
        
        delta_x = -1.0*RMS_x_with_e / RMS_g_with_e * gradient
            
        mean_square_x = pho * mean_square_x + (1 - pho) * delta_x**2

        current_x = current_x + delta_x
        theta_history.append(current_x)

    return(theta_history)

Stochastic Adam

In [12]:
def stoch_adam(gradient, init, x_data, y_data, lr=1, iterations=1000, beta_1=.9, beta_2=.99, epsilon=1e-8):
    m = np.zeros(len(init))
    v = np.zeros(len(init))
    theta = init
    theta_history = [theta]
    for i in range(iterations):
        
        # pick random X and y from data
        idx = np.random.randint(0, len(x_data))
        y = y_data[idx]
        X = x_data[idx]
        
        grad = gradient(theta, X, y)
        m = beta_1 * m + (1 - beta_1) * grad
        v = beta_2 * v + (1 - beta_2) * grad ** 2
        
        m_hat = m / (1 - beta_1)
        v_hat = v / (1 - beta_2)
        
        theta = theta - lr * m_hat / (np.sqrt(v_hat) + epsilon)
        theta_history.append(theta)
        
    return(theta_history)

2.2 Least Squares Loss Function

We can now set up a Least Squares Regression problem. In ordinary least squares for a simple linear regression, we model the response as a linear function of the predictor, that is,

$$y_i = w_0 + w_1x_i$$

For given estimates of the slope and intercept term, $\hat{w_0}$ and $\hat{w_1}$, the most common way of measuring how far off these estimates are from the true parameters is to measure the sum of squares of the residuals: the difference between the true value of the response corresponding to a given observation $x_i$ and the value of the response predicted by the linear model at this observation: $\hat{y_i} = \hat{w_0} + \hat{w_1}x_i$. This is known as the Residual Sum of Squares (RSS) loss function, and the coefficients of an OLS model are obtained by minimizing this loss function:

$$L(\mathbf{w}) = \sum_{i=1}^n (y_i - \hat{y_i})^2 = \sum_{i=1}^n (y_i - \hat{w_0} - \hat{w_1}x_i)^2$$

The gradient of this loss function with respect to the parameters $\hat{w_0}$ and $\hat{w_1}$ is quite easy to compute:

$$\frac{\delta L}{\delta \hat{w_0}} = \sum_{i=1}^n -2(y_i - \hat{w_0} - \hat{w_1}x_i)$$$$\frac{\delta L}{\delta \hat{w_1}} = \sum_{i=1}^n -2(y_i - \hat{w_0} - \hat{w_1}x_i)x_i$$$$\nabla_{\mathbf{w}}L(\mathbf{w}) = -2\sum_{i=1}^n <(y_i - \hat{w_0} - \hat{w_1}x_i), (y_i - \hat{w_0} - \hat{w_1}x_i)x_i>$$

The RSS loss function itself is also quite easy to visualize, and its contour plot is shown in the animation found below. It is clear from the functional form that it is a paraboloid in three dimensions, with vertex at the true values of $w_0$ and $w_1$. Its smooth bowl-like shape means that there are no local minima or saddle points. In other words, the function is everywhere-convex, which can be seen by the fact that the second partial derivatives of the loss function with respect to the weight estimates are always positive. Specifically,

$$\frac{\delta^2 L}{\delta \hat{w_0^2}} = \sum_{i=1}^n 2 > 0$$$$\frac{\delta^2 L}{\delta \hat{w_1^2}} = \sum_{i=1}^n 2x_i^2 > 0$$

Thus, there are no strange reasons of the surface where we would expect an optimizer such as the gradient descent functions described above to get stuck.

2.3 Linear Regression Experiment and Analysis

We define below a simple regression problem by generating data according to the line $y = 3x - 10$, adding some noise to the response variable, and then running our stochastic versions of Adagrad, Adadelta, and Adam on the data to arrive at estimates for the slope and intercept terms. We define in the code below the RSS loss function described above and its gradient for use in the algorithms, as well as several functions to help showcase the performance of these algorithms, including an animation showing the algorithms converging (or attempting to converge) to the minimum.

In the plot showing the data along with the best fit line for each algorithm, we see that Adagrad and Adam are able to converge to a pretty decent best fit line, while Adadelta leaves a bit to be desired with a shallower slope and a raised intercept. In the loss vs. iteration plot directly below, we see how long it takes each algorithm to arrive at its respective solution, starting from a pretty far off initial estimate of (13, -15). We see that Adagrad and Adam race to a solution right away (after just a few dozen iterations) and then oscillate around its final loss of close to 0, while the loss in the Adadelta algorithm declines much more slowly and seems to decrease toward 0 or a value a bit higher in asymptotic fashion.

This suggests that in this case, Adadelta could theoretically benefit from a higher learning rate (and we believe that this is indeed the issue since even if we start the initial guess much closer to the true values -- which is easy to do in the code below -- Adadelta faces the same issues). However, we recall from above that its implementation contains no such hyperparameter (it essentially just uses 1, in addition to the dynamic average of paramter updates in the numerator), and changing the value of $\epsilon$ to a small value to help kickstart the algorithm in its first few iterations does not seem to help much either. Thus, it appears that in this particular example, the supposed benefit of not having a manually tuned global learning rate turns out to be a disadvantage for Adadelta, as we lose some flexibility in controlling the rate of convergence.

There are a couple of other interesting things to take note of here. Firstly, while the suggested value of $\epsilon$ for Adagrad is typically on the order of $1e-8$, in this case we must set it to about $1e4$ to consistently achieve a decent fit. When it is set to a smaller value, the initial update jumps wildly in the wrong direction and never recovers. This makes sense given Adagrad's formulation: the initial vector tracking the previous squared gradients is set at 0, so the $\epsilon$ term is the only term in the denominator in the first update. Thus, when it is small, the adapted learning rate ends up being huge and shifting the algorithm way off. Then, as mentioned above, the accumulation of previous squared gradients causes the adaptive learning rate to decrease monotonically and asymptotically toward 0, which is likely why the Adagrad algorithm is never able to recover from its initial mistake, even though the easy shape of the least squares loss function should theoretically be able to guide it to the minimum without much trouble. However, if the learning rate quickly becomes too small, the algorithm will not converge. Thus, without pretty precise tuning of this smoothing parameter $\epsilon$, Adagrad gets itself into quite a hole, but with the right value, it converges extremely quickly to the solution. This is perhaps one disadvantage of Adagrad in an example such as this (and in general).

We notice that Adam seems to have the best behavior out of these three algorithms, and we find that it is pretty robust to some change in its hyperparameters, unlike the other two algorithms. This also makes sense given its formulation: it is able to get around the slow convergence from which Adadelta suffers by introducing the momentum term, but is also able to dig itself out of a hole if necessary by avoiding the asymptotic decline of the adaptive learning rate, by using the same square gradient average scheme as Adadelta.

Animator helper

In [2]:
def animate_healper(head, trail, data1, data2, history, i):
    theta_step = history[i]
    head.set_data(theta_step[0], theta_step[1])
    data1.append(theta_step[0])
    data2.append(theta_step[1])
    trail.set_data(data1, data2)

Code for using the algorithms to fit regression and output visuals

In [18]:
def regression_output(# following are data generation and true regression parameter
                      data_size,
                      noise_magnitude,
                      slope,
                      intercept,
                      
                      init,
                      iterations,
                      iterations_adadelta,
                      
    
                      # following is the hyperparameter of learning rate
                      lr_adagrad,
                      lr_adam,
    
                      # following is hyperparameter for Adadelta
                      pho,
    
                      # following is the hyperparameter of epsilon
                      epsilon_adagrad = 1e4,
                      epsilon_adadelta = 1e-8,
                      epsilon_adam = 1e-8
                     ):
    
    
    def linear_func(x, slope, intercept):
        return x * slope + intercept
    
    ## Define Very Simple Gradient of Least Squares
    def grad_best_fit(W, X, y):
        return 2 * (y-np.dot(X, W)) * (-X)
    
    
    ## Generate Data for Linear Regression
    # Make X grid
    x_data = np.linspace(0,10, data_size)
    ones = np.ones(data_size).reshape(-1,1)
    X_data = np.append(x_data.reshape(-1,1), ones, axis=1)

    # Make y with noise
    noise = scipy.stats.norm.rvs(size=data_size, scale=noise_magnitude)
    y = linear_func(x_data, slope, intercept)
    y_data = y + noise
    
    ## Run SGD Optimizers to Get Best Fit Parameters and Parameter History
    
    adagrad_history = stoch_adagrad(grad_best_fit, init, X_data, y_data, lr=lr_adagrad, iterations=iterations, 
                                    epsilon=epsilon_adagrad)
    
    adagrad_fit = adagrad_history[-1]

    adadelta_history = stoch_adadelta(grad_best_fit, X_data, y_data, pho=pho, 
                                      epsilon=epsilon_adadelta, init=init, max_iter = iterations_adadelta)
    
    adadelta_fit = adadelta_history[-1]

    adam_history = stoch_adam(grad_best_fit, init, X_data, y_data, lr=lr_adam, iterations=iterations, 
                              epsilon = epsilon_adam)
    adam_fit = adam_history[-1]
    
    
    # 1st Plot: linear trend and fit
    fig, ax = plt.subplots(3, 1, figsize=(8, 6))
    fig.tight_layout(h_pad=5)

    adagrad_y_hat=linear_func(x_data, slope=adagrad_fit[0], intercept=adagrad_fit[1])
    adadelta_y_hat = linear_func(x_data, slope=adadelta_fit[0], intercept=adadelta_fit[1])
    adam_y_hat=linear_func(x_data, slope=adam_fit[0], intercept=adam_fit[1])

    y_hats = [adagrad_y_hat, adadelta_y_hat, adam_y_hat]
    labels = ['adagrad fit', 'adadelta fit', 'adam fit']
    colors = ['red', 'springgreen', 'blue']
    algorithms = ['adagrad', 'adadelta', 'adam']

    for row in range(len(y_hats)):
        ax[row].set_facecolor((1,1,.98))

        # Scatter of generated data
        ax[row].scatter(x=x_data, y=y_data, color='purple', label='data', alpha=.5)
    
        # Best fit line
        ax[row].plot(x_data, y_hats[row], color=colors[row], label=labels[row], lw=3, alpha=.7)
   

        ax[row].set_ylabel("y")
        ax[row].set_xlabel("x")
        ax[row].set_title(algorithms[row] + " SGD Best Fit on Randomly Generated Linear Data")
        ax[row].legend();
    
    
    
    # build up the loss function
    def least_squares_loss(x_data, y_data, fit_slope, fit_intercept):
        # init loss
        loss = 0
    
        # add square loss from each data point
        for i in range(len(x_data)):
            x = x_data[i]
            y = y_data[i]
            y_hat = linear_func(x=x, slope=fit_slope, intercept=fit_intercept)
            loss += (y - y_hat)**2
    
        return loss
    
    # Track loss function history and graph it

    def loss_func(theta, x_data, y_data):
        loss = 0
        for i in range(len(y_data)):
            x = x_data[i]
            y = y_data[i]
            residual = linear_func(x, slope=theta[0], intercept=theta[1]) - y
            loss += residual ** 2
        return loss
    
    def plot_loss(loss_func, histories, algorithms, ax, ylab):
        idx = 0
        for history in histories:
            loss_history = [loss_func(theta, x_data, y_data) for theta in history]
            ax.plot(range(len(loss_history)), loss_history, lw=2, label=algorithms[idx])
        ax.set_facecolor((1,1,.98))
        ax.set_xlabel("Optimizer Step")
        ax.set_ylabel(ylab)
        ax.legend(algorithms)
        ax.set_title("Loss over iterations");
        return
    
    # 2nd plot: loss function by iteration
    fig2, ax2 = plt.subplots(1,1,figsize=(8,3))
    plot_loss(loss_func, [adagrad_history, adadelta_history, adam_history], 
              ["adagrad", "adadelta", "adam"], ax2, "Total Least Squares Loss")
    
    
    # 3rd plot: animation
    fig, ax = plt.subplots(1,1, figsize = (8,5))

    # Make contour plot of our pdf f
    x_grid = np.linspace(-10,20,100)
    y_grid = np.linspace(-40, 40, 100)
    xx, yy = np.meshgrid(x_grid, y_grid)
    z = least_squares_loss(x_data, y_data, xx, yy)

    contour = ax.contour(x_grid, y_grid, z, 50)
    target = ax.scatter(slope, intercept, color="purple", marker='+', s=500)


    # Animation
    def init_squares():

        del adagrad_xdata[:]
        del adagrad_ydata[:]

        del adadelta_xdata[:]
        del adadelta_ydata[:]

        del adam_xdata[:]
        del adam_ydata[:]

        adagrad_line.set_data(adagrad_xdata, adagrad_ydata)
        adadelta_line.set_data(adadelta_xdata, adadelta_ydata)
        adam_line.set_data(adam_xdata, adam_ydata)    

        adagrad.set_data([init[0]], [init[1]])
        adadelta.set_data(init[0], [init[1]])
        adam.set_data([init[0]], [init[1]])

        return

    adagrad, = ax.plot([init[0]], [init[1]], color="blue", marker='o', label="Adagrad SGD", markersize=10)
    adagrad_line, = ax.plot([], [], lw=4, color="blue", alpha=.4)
    adagrad_xdata, adagrad_ydata = [], []

    adadelta, = ax.plot([init[0]], [init[1]], color="springgreen", marker='o', label="Adadelta SGD", markersize=10)
    adadelta_line, = ax.plot([], [], lw=4, color="springgreen", alpha=.4)
    adadelta_xdata, adadelta_ydata = [], []

    adam, = ax.plot([init[0]], [init[1]], color="red", marker='o', label="Adam SGD", markersize=10)
    adam_line, = ax.plot([], [], lw=4, color="red", alpha=.4)
    adam_xdata, adam_ydata = [], []

    def animate_squares(i):

        adagrad_theta_step = adagrad_history[i]
        adagrad.set_data(adagrad_theta_step[0], adagrad_theta_step[1])
        adagrad_xdata.append(adagrad_theta_step[0])
        adagrad_ydata.append(adagrad_theta_step[1])
        adagrad_line.set_data(adagrad_xdata, adagrad_ydata)

        adadelta_theta_step = adadelta_history[i]
        adadelta.set_data(adadelta_theta_step[0], adadelta_theta_step[1])
        adadelta_xdata.append(adadelta_theta_step[0])
        adadelta_ydata.append(adadelta_theta_step[1])
        adadelta_line.set_data(adadelta_xdata, adadelta_ydata)

        adam_theta_step = adam_history[i]
        adam.set_data(adam_theta_step[0], adam_theta_step[1])
        adam_xdata.append(adam_theta_step[0])
        adam_ydata.append(adam_theta_step[1])
        adam_line.set_data(adam_xdata, adam_ydata)

        return 

    animate = animation.FuncAnimation(fig, animate_squares,  blit=False, interval=100,
                                  repeat=False, init_func=init_squares)


    # Set axis details
    ax.set_xlabel("$slope$")
    ax.set_ylabel("$intercept$")
    ax.set_title("Descent on Least squares loss function")
    ax.legend();
    
    return animate  

Run Least Squares fit and animation. Note: plots will take time to generate

In [19]:
slope = 3.0
intercept = -10.0

regression_output(# following are data generation and true regression parameter
                      data_size = 1000,
                      noise_magnitude = 5,
                      slope = 3.0,
                      intercept = -10.0,
                      
                      
                      init = np.array([slope + 10, intercept - 5]),
                      iterations = 1000,
                      iterations_adadelta = 11000,
    
                      # following is the hyperparameter of learning rate
                      lr_adagrad = 1,
                      lr_adam = 0.1,
    
                      # following is the hyperparameter of epsilon
                      epsilon_adagrad = 1e4,
                      epsilon_adadelta = 1e-6,
                      epsilon_adam = 1e-8,
    
                      # following is hyperparameter for Adadelta
                      pho = 0.5
                     )
Out[19]:
<matplotlib.animation.FuncAnimation at 0x18a0683cf60>

Animation Analysis:

We see that Adagrad takes a big jump straight toward the solution and then quickly settles down there. Similarly, Adam takes a large jump away from the initial point (not quite as large as Adagrad's), and then oscillates toward the solution. Meanwhile, as described above, Adadelta does not perform particularly well here. It fails to move far from the initial point, and whatever little movement does take place is extremely slow. This lines up with our observations from the loss vs. iteration plot, discussed above. Adagrad and Adam race toward a good solution, while Adadelta moves very slowly to a mediocre fit.

Section 3: Experiments with More Varied Terrain

3.1 Re-define Simple Algorithms for Use Below and prepare animation functions

In [3]:
def simple_adam(gradient, init, lr=1, iterations=1000, beta_1=.9, beta_2=.99, epsilon=10**(-8)):
    m = np.zeros(len(init))
    v = np.zeros(len(init))
    theta = init
    theta_history = []
    for i in range(iterations):
        theta_history.append([theta[0], theta[1]])
    
        grad = gradient(theta[0], theta[1])
        m = beta_1 * m + (1 - beta_1) * grad
        v = beta_2 * v + (1 - beta_2) * grad ** 2
        
        m_hat = m / (1 - beta_1)
        v_hat = v / (1 - beta_2)
        
        theta = theta - lr * m_hat / (np.sqrt(v_hat) + epsilon)
        
    return theta_history


def simple_adagrad(gradient, init, lr=.1, iterations=1000, epsilon=1e-3):
    
    theta = init
    theta_history = [theta]
    
    # Define array of running squared sum of gradients, the i-th entry in this array corresponds to the i-th dimension
    previous_gradient_sums = np.array([0] * len(theta))
    
    for i in range(iterations):
        grad = gradient(theta[0], theta[1])
        theta = theta - (lr/(np.sqrt(previous_gradient_sums + epsilon))) * grad
        theta_history.append(theta)
        previous_gradient_sums = previous_gradient_sums + grad ** 2
        
    return(theta_history)



def simple_adadelta(loss_fun,
                    gradient_fun, 
                    pho, 
                    epsilon, 
                    init, 
                    max_iter = 100, 
                    stop_criteria = 0.00001):
    
    # the step numbers follow the algorithm described in the paper

    # step 1

    mean_square_g = 0
    mean_square_x = 0
    i = 1
    abs_diff_update = 100000
    current_x = init

    
    current_loss = loss_fun(init[0], init[1])
    
    x_history = np.array([np.append(current_x, current_loss)])
    
    while i <= max_iter and abs_diff_update >= stop_criteria:

        # step 3
        gradient = gradient_fun(current_x[0], current_x[1])
        
        # step 4
        mean_square_g = pho * mean_square_g + (1 - pho) * gradient**2
        
        # step 5
        RMS_x_with_e = np.sqrt(mean_square_x + epsilon)
        RMS_g_with_e = np.sqrt(mean_square_g + epsilon)
        delta_x = -1.0*RMS_x_with_e / RMS_g_with_e * gradient
        
        # step 6
        mean_square_x = pho * mean_square_x + (1 - pho) * delta_x**2
        
        # step 7
        
        current_x = current_x + delta_x
        
        current_loss = loss_fun(current_x[0], current_x[1])
        
        x_history = np.append(x_history, np.array([np.append(current_x, current_loss)]), axis = 0)
        
        abs_diff_update = np.max(np.abs(delta_x))
        
        i += 1
    
    return x_history
In [4]:
def output_generation(loss_fun_name, loss_fun, gradient_fun, plot_fun, play_speed,
                      
                      # following is the hyperparameter of learning rate
                      lr_adagrad,
                      lr_adam,
                      
                      
                      init,
                      iterations_adadelta,
                      iterations = 200,
                      
                      
                      # following are hyperparamters for Adam
                      beta_1 = 0.9,
                      beta_2 = 0.99,
                      
    
                      # following is hyperparameter for Adadelta
                      pho = 0.3,
    
                      # following is the hyperparameter of epsilon
                      epsilon_adagrad = 1,
                      epsilon_adadelta = 1e-8,
                      epsilon_adam = 1e-8
                     ):
    
    # somehow after running simple_adam, the init value changes, so need to make copy of them
    init1 = copy.copy(init)
    init2 = copy.copy(init)
    init3 = copy.copy(init)
    
    # Run gradient descent optimizer to get best fit parameters and parameter history
    
    adam_history = simple_adam(gradient_fun, 
                               init = init1, 
                               lr=lr_adam, 
                               iterations=iterations,
                               epsilon = epsilon_adam
                              )
    

    
    adagrad_history = simple_adagrad(gradient_fun, 
                                     init = init2, 
                                     lr=lr_adagrad, 
                                     iterations=iterations,
                                     epsilon = epsilon_adagrad
                                    )
    

    adadelta_history = simple_adadelta(loss_fun = loss_fun,
                                       gradient_fun = gradient_fun,
                                       pho = pho,
                                       epsilon = epsilon_adadelta,
                                       init = init3,
                                       stop_criteria = 0.0000001,
                                       max_iter = iterations_adadelta
                                      )

    
    # plot loss function
    adam_loss_history = [loss_fun(theta[0], theta[1]) for theta in adam_history]
    adagrad_loss_history = [loss_fun(theta[0], theta[1]) for theta in adagrad_history]
    adadelta_loss_history = adadelta_history[:,2]
    
    
    
    
    fig2, ax2 = plt.subplots(1,1,figsize=(8,3))
    ax2.set_facecolor((1,1,.98))

    ax2.plot(range(len(adam_loss_history)), adam_loss_history, lw=2, color='red', label = 'adam', alpha = 0.4)
    ax2.plot(range(len(adagrad_loss_history)), adagrad_loss_history, lw=2, color='k', label = 'adagrad', alpha = 0.4)
    ax2.plot(range(len(adadelta_loss_history)), adadelta_loss_history, lw=2, color='blue', label = 'adadelta',
            alpha = 0.4)

    ax2.set_xlabel("Optimizer Step")
    ax2.set_ylabel(loss_fun_name + " Total Loss")
    ax2.set_title("Loss over iterations")
    ax2.legend();
    
    
    
    # Animation Plot
    
    def animate_init():
        del adam_xdata[:]
        del adam_ydata[:]
        
        del adagrad_xdata[:]
        del adagrad_ydata[:]
        
        del adadelta_xdata[:]
        del adadelta_ydata[:]
        
        adam_trail.set_data(adam_xdata, adam_ydata)
        adagrad_trail.set_data(adagrad_xdata, adagrad_ydata)
        adadelta_trail.set_data(adadelta_xdata, adadelta_ydata)
        
        adam.set_data([init[0]], [init[1]])
        adagrad.set_data(init[0], [init[1]])
        adadelta.set_data(init[0], [init[1]])
        return
    
    
    def animate_function(i):
        animate_healper(head=adam, data1=adam_xdata, 
                        data2=adam_ydata, history=adam_history, trail=adam_trail, i=i)
    
        animate_healper(head=adagrad, data1=adagrad_xdata, 
                        data2=adagrad_ydata, history=adagrad_history, trail=adagrad_trail, i=i)
    
        animate_healper(head=adadelta, data1=adadelta_xdata, 
                        data2=adadelta_ydata, history=adadelta_history, trail=adadelta_trail, i=i)
        return 
    
    fig, ax = plt.subplots(1,1, figsize = (8,6))
    x_range, y_range = plot_fun(fig, ax)

    
    # Animation Plot
    adadelta, = ax.plot([init[0]], [init[1]], color="blue", marker='o', label="Adadelta SGD", markersize=10)
    adadelta_trail, = ax.plot([], [], lw=4, color="blue", alpha=.4)
    adadelta_xdata, adadelta_ydata = [], []
    
    adam, = ax.plot([init[0]], [init[1]], color="red", marker='o', label="Adam SGD", markersize=10)
    adam_trail, = ax.plot([], [], lw=4, color="red", alpha=.4)
    adam_xdata, adam_ydata = [], []
    
    adagrad, = ax.plot([init[0]], [init[1]], color="springgreen", marker='o', label="Adagrad SGD", markersize=10)
    adagrad_trail, = ax.plot([], [], lw=4, color="springgreen", alpha=.4)
    adagrad_xdata, adagrad_ydata = [], []
    
    
    
    animate = animation.FuncAnimation(fig, animate_function,  blit=False, interval=play_speed,
                                      repeat=False, init_func=animate_init)

    ax.set_xlim(x_range[0],x_range[1])
    ax.set_ylim(y_range[0],y_range[1])
    ax.legend()
    
    return animate

3.2 The Beale Function

Now we experiment with a more interesting function, the Beale function:

$$f(x, y) = (1.5 - x + xy)^2 + (2.25 - x + xy^2)^2 + (2.625 - x + xy^3)^2$$

Again, computing its gradient by hand is not difficult, and for the sake of space we omit the partial derivative calculations here, and show the results in the code below. We plot the contours of the Beale Function below, and mark its minimum: (3, 0.5).

In [5]:
def Beale_Fun(x, y):
    return (1.5-x+x*y)**2 + (2.25-x+x*y**2)**2 + (2.625-x+x*y**3)**2

def partial_gradient_x(x, y):
    a = 2*(1.5-x+x*y)*(-1+y)
    b = 2*(2.25-x+x*y**2)*(-1+y**2)
    c = 2*(2.625-x+x*y**3)*(-1+y**3)
    return a+b+c

def partial_gradient_y(x, y):
    a = 2*(1.5-x+x*y)*(x)
    b = 2*(2.25-x+x*y**2)*(2*x*y)
    c = 2*(2.625-x+x*y**3)*(3*x*y**2)
    return a+b+c

def gradient_beale(x, y):
    return np.array([partial_gradient_x(x,y), partial_gradient_y(x, y)])

# Target function plotting
def plot_beale(fig, ax):

    # Make contour plot of our pdf f
    x1_grid = np.linspace(-5,5,100)
    x2_grid = np.linspace(-5,5,100)
    xx1, xx2 = np.meshgrid(x1_grid, x2_grid)
    z = Beale_Fun(xx1, xx2)

    bukin_contour = ax.contour(x1_grid, x2_grid, z, 60)
    
    ax.scatter(3, 0.5, label='global min', color='k', marker="+", s=100)
    
    plt.show()
    return np.array([np.min(x1_grid), np.max(x1_grid)]), np.array([np.min(x2_grid), np.max(x2_grid)])

3.2.1 Let's first set the initial point to be the center

In [6]:
output_generation(loss_fun_name = "Beale",
                  loss_fun = Beale_Fun, 
                  gradient_fun = gradient_beale, 
                  plot_fun = plot_beale,
                  play_speed = 100,
                      
                  # following are hyperparamters for all three algorithms
                  init = np.array([0.0, 0.0]),
                  iterations = 400,
                  iterations_adadelta = 400,
                      
                  # following is the hyperparameter of learning rate
                  lr_adagrad = 0.1,
                  lr_adam = 0.1,
                      
                  # following are hyperparamters for Adam
                  beta_1 = 0.9,
                  beta_2 = 0.99,
                      
                  # following is hyperparamters for Adadelta
                  pho = 0.4,
                  
                  # following is the hyperparameter of epsilon
                  epsilon_adagrad = 1,
                  epsilon_adadelta = 1e-4,
                  epsilon_adam = 1e-8
                     
                 )
Out[6]:
<matplotlib.animation.FuncAnimation at 0x1a1decfb00>

Animation Analysis:

We can see that Adam performs the best here being the closet to the target. Adadelta had been very close to the target, but then it has great volatility near the target, apparently having a hard time to converge, and eventually it actually moves away and stay in a worse position than it used to. Here Adagrad has been able to be very close to where Adadelta finally stands, however in a slower fashion due to the increase of the summed gradients.

3.2.2 Changing the initial position to a nearby high density area

In [10]:
output_generation(loss_fun_name = "Beale",
                  loss_fun = Beale_Fun, 
                  gradient_fun = gradient_beale, 
                  plot_fun = plot_beale,
                  play_speed = 100,
                      
                  # following are hyperparamters for all three algorithms
                  init = np.array([4.0, 3.5]),
                  iterations = 1000,
                  iterations_adadelta = 1000,
                      
                  # following is the hyperparameter of learning rate
                  lr_adagrad = 0.1,
                  lr_adam = 0.1,
                      
                  # following are hyperparamters for Adam
                  beta_1 = 0.9,
                  beta_2 = 0.99,
                      
                  # following is hyperparamters for Adadelta
                  pho = 0.4,
                  
                  # following is the hyperparameter of epsilon
                  epsilon_adagrad = 1,
                  epsilon_adadelta = 1e-4,
                  epsilon_adam = 1e-8
                     
                 )
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: overflow encountered in double_scalars
  """
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: overflow encountered in double_scalars
  
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: overflow encountered in double_scalars
  import sys
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: overflow encountered in double_scalars
  # This is added back by InteractiveShellApp.init_path()
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:12: RuntimeWarning: overflow encountered in double_scalars
  if sys.path[0] == '':
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:13: RuntimeWarning: overflow encountered in double_scalars
  del sys.path[0]
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: invalid value encountered in double_scalars
  """
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in double_scalars
  
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: invalid value encountered in double_scalars
  import sys
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in double_scalars
  # This is added back by InteractiveShellApp.init_path()
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:12: RuntimeWarning: invalid value encountered in double_scalars
  if sys.path[0] == '':
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:13: RuntimeWarning: invalid value encountered in double_scalars
  del sys.path[0]
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: overflow encountered in double_scalars
  
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in double_scalars
  
Out[10]:
<matplotlib.animation.FuncAnimation at 0x1a1eeb7710>

Animation Analysis:

After changing the starting point to be one of the high density area, the Adadelta behaves similarly as the pervious time, so it's pretty robust in this case. Of course on the downside it still has large variance when approaching the target and eventually moves away a bit. Adam is still the closest to the target, even though this time it converges much slower. Adagrad quickly turns out of bound during this trails, as the initial gradients are really large, which rapidly increase the magnitude of theta estimate and run out of bound.

3.2.3 Reduce the learning rate of adam, and increase the pho of Adadelta

In [13]:
output_generation(loss_fun_name = "Beale",
                  loss_fun = Beale_Fun, 
                  gradient_fun = gradient_beale, 
                  plot_fun = plot_beale,
                  play_speed = 100,
                      
                  # following are hyperparamters for all three algorithms
                  init = np.array([4.0, 3.5]),
                  iterations = 1000,
                  iterations_adadelta = 1000,
                      
                  # following is the hyperparameter of learning rate
                  lr_adagrad = 0.1,
                  lr_adam = 0.05,
                      
                  # following are hyperparamters for Adam
                  beta_1 = 0.9,
                  beta_2 = 0.99,
                      
                  # following is hyperparamters for Adadelta
                  pho = 0.95,
                  
                  # following is the hyperparameter of epsilon
                  epsilon_adagrad = 1,
                  epsilon_adadelta = 1e-4,
                  epsilon_adam = 1e-8
                     
                 )
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: overflow encountered in double_scalars
  """
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: overflow encountered in double_scalars
  
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: overflow encountered in double_scalars
  import sys
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: overflow encountered in double_scalars
  # This is added back by InteractiveShellApp.init_path()
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:12: RuntimeWarning: overflow encountered in double_scalars
  if sys.path[0] == '':
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:13: RuntimeWarning: overflow encountered in double_scalars
  del sys.path[0]
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: invalid value encountered in double_scalars
  """
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in double_scalars
  
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: invalid value encountered in double_scalars
  import sys
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in double_scalars
  # This is added back by InteractiveShellApp.init_path()
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:12: RuntimeWarning: invalid value encountered in double_scalars
  if sys.path[0] == '':
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:13: RuntimeWarning: invalid value encountered in double_scalars
  del sys.path[0]
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: overflow encountered in double_scalars
  
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in double_scalars
  
Out[13]:
<matplotlib.animation.FuncAnimation at 0x1a23065d68>

Animation Analysis:

This round, we reduce the learning rate of Adam, hopefully to reduce the curvature it needs to take to reach the target. As we can observe, the initial curvature indeed reduces, but later when it goes toward the right direction, the steps also slow down. Perhaps we need more iteration for Adam to reach convergence in this case.

For the Adadelta, we increase the pho, which means the past summed gradients take more weight in the update. As observed, this time the Adadelta have been closer to the target (if not right on it), and even with its variance, the last position is closer to the target than last time with lower pho.

3.2.4 Increase the iteration of adam, and further increase the pho of Adadelta

In [14]:
output_generation(loss_fun_name = "Beale",
                  loss_fun = Beale_Fun, 
                  gradient_fun = gradient_beale, 
                  plot_fun = plot_beale,
                  play_speed = 100,
                      
                  # following are hyperparamters for all three algorithms
                  init = np.array([4.0, 3.5]),
                  iterations = 1500,
                  iterations_adadelta = 1000,
                      
                  # following is the hyperparameter of learning rate
                  lr_adagrad = 0.1,
                  lr_adam = 0.05,
                      
                  # following are hyperparamters for Adam
                  beta_1 = 0.9,
                  beta_2 = 0.99,
                      
                  # following is hyperparamters for Adadelta
                  pho = 0.98,
                  
                  # following is the hyperparameter of epsilon
                  epsilon_adagrad = 1,
                  epsilon_adadelta = 1e-4,
                  epsilon_adam = 1e-8
                     
                 )
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: overflow encountered in double_scalars
  """
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: overflow encountered in double_scalars
  
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: overflow encountered in double_scalars
  import sys
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: overflow encountered in double_scalars
  # This is added back by InteractiveShellApp.init_path()
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:12: RuntimeWarning: overflow encountered in double_scalars
  if sys.path[0] == '':
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:13: RuntimeWarning: overflow encountered in double_scalars
  del sys.path[0]
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:5: RuntimeWarning: invalid value encountered in double_scalars
  """
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in double_scalars
  
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: invalid value encountered in double_scalars
  import sys
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in double_scalars
  # This is added back by InteractiveShellApp.init_path()
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:12: RuntimeWarning: invalid value encountered in double_scalars
  if sys.path[0] == '':
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:13: RuntimeWarning: invalid value encountered in double_scalars
  del sys.path[0]
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: overflow encountered in double_scalars
  
/Users/Tong/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in double_scalars
  
Out[14]:
<matplotlib.animation.FuncAnimation at 0x1a23240ba8>

Animation Analysis:

This time during the whole process these two algorithms are on par on their traces. Adadelta does not have the variance near the target and it does not move away. So Adadelta has improved its performance in this case. Then, the Adam has been able to surpass Adadelta in the last minute! It wins back the championship on Beale function. So, a smaller learning rate help Adam to better acheive the target, and as expected it comes with a larger number of iterations.

By far, we have explored the algorithms performance on the Beale Function. With a high pho of Adadelta, and lower learning rate for Adam, we are able to achieve satifying results. In contrast, the Adagrad is highly dependent on the initial position, as it could suffer from the initial large gradient and run out of bound quickly.

3.3 The Six-Camel Function

We now present an example which underscores some of the differences between these algorithms in terms of sensitivity to initial location. We look at the six-hump camel function:

$$f(x_1, x_2) = (4 - 2.1x_1^2 + \frac{x_1^4}{3})x_1^2 + x_1x_2 + (-4 + 4x_2^2)x_2^2$$
In [35]:
def six_camel(x1, x2):
    comp1 = (4 - 2.1*x1**2 + x1**4/3)*x1**2 
    comp2 = x1*x2
    comp3 = (-4 + 4*x2**2)*x2**2
    return comp1 + comp2 + comp3

def six_camel_grad(x1, x2):
    x1_grad = 8*x1 - 8.4*x1**3 + 2*x1**5 + x2
    x2_grad = x1 -8*x2 + 16*x2**3
    return np.array([x1_grad, x2_grad])

# Target function plotting
def plot_camel(fig, ax):

    # Make contour plot of our pdf f
    x1_grid = np.linspace(-2, 2, 100)
    x2_grid = np.linspace(-1, 1, 100)
    xx1, xx2 = np.meshgrid(x1_grid, x2_grid)
    camel = six_camel(xx1, xx2)

    camel_contour = ax.contour(x1_grid, x2_grid, camel, 60)
    
    # Mark Minima
    min1 = [-0.0898, 0.7126]
    min2 = [0.0898, -0.7126]

    ax.scatter(min1[0], min1[1], label='global min', color='k', marker="+", s=100)
    ax.scatter(min2[0], min2[1], color='k', marker="+", s=100)
    plt.show()
    return np.array([np.min(x1_grid), np.max(x1_grid)]), np.array([np.min(x2_grid), np.max(x2_grid)])

3.3.1 Initialize on a slope

In [36]:
output_generation(loss_fun_name = "Camel",
                  loss_fun = six_camel, 
                  gradient_fun = six_camel_grad, 
                  plot_fun = plot_camel,
                  play_speed = 100,
                      
                  # following are hyperparamters for all three algorithms
                  init = np.array([-0.5, -.25]),
                  iterations = 100,
                  iterations_adadelta = 120,
                      
                  # following is the hyperparameter of learning rate
                  lr_adagrad = .1,
                  lr_adam = .1,
                      
                  # following are hyperparamters for Adam
                  beta_1 = 0.9,
                  beta_2 = 0.99,
                      
                  # following is hyperparamters for Adadelta
                  pho = 0.5,
                  
                  # following is the hyperparameter of epsilon
                  epsilon_adagrad = 1,
                  epsilon_adadelta = 1e-4,
                  epsilon_adam = 1e-8                  
                 )
Out[36]:
<matplotlib.animation.FuncAnimation at 0x1a2048be80>

Animation Analysis:

Here we have started the algorithms on a steep slope of camel. Whereas Adagrad's accumulating friction plays to its advantage here as it smoothly rolls into the target, Adams momentum makes it more iratic as it glides past the minimum a number of times.

3.3.2 Initialize in a local minimum

In [37]:
output_generation(loss_fun_name = "Camel",
                  loss_fun = six_camel, 
                  gradient_fun = six_camel_grad, 
                  plot_fun = plot_camel,
                  play_speed = 200,
                      
                  # following are hyperparamters for all three algorithms
                  init = np.array([1.5, -0.8]),
                  iterations = 100,
                  iterations_adadelta = 120,
                      
                  # following is the hyperparameter of learning rate
                  lr_adagrad = 0.3,
                  lr_adam = 0.3,
                      
                  # following are hyperparamters for Adam
                  beta_1 = 0.9,
                  beta_2 = 0.99,
                      
                  # following is hyperparamters for Adadelta
                  pho = 0.5,
                  
                  # following is the hyperparameter of epsilon
                  epsilon_adagrad = 1,
                  epsilon_adadelta = 1e-8,
                  epsilon_adam = 1e-8
                  
                 )
Out[37]:
<matplotlib.animation.FuncAnimation at 0x1a1f797898>

Animation Analysis:

Now we initialize the algorithms in a local min of camel and observe whether they will be able to venture out and explore deeper wells. Adagrad shoots its way out of the local minimum like a rocket and we can understand why: the initial gradients were near 0. Since Adagrad's learning rate is tuned inversely proportional to the accumulated sum of square gradients, here Adagrad's learning experiences a spike out of the gate. As it bounces off the walls of the camel, it quickly loses speed as is typical for Adagrad. After a few fortuitous richochets, a global minimum has enough gravity to rein it in.

By comparison, Adadelta and Adam are less willing to make the leap and ultimately get stuck in the local minimum. Looking at the factors in their respective update terms, they both have terms in the numerator that slow their behavior on level surfaces. Adaelta has a measure of past updates in its numerator which here starts out very small. Of similar affect, Adam has the momentum term in its update numerator, which also here would remain quite small.

3.3.3 Let's crank up the learning rates

In [38]:
output_generation(loss_fun_name = "Camel",
                  loss_fun = six_camel, 
                  gradient_fun = six_camel_grad, 
                  plot_fun = plot_camel,
                  play_speed = 100,
                      
                  # following are hyperparamters for all three algorithms
                  init = np.array([1.5, -0.8]),
                  iterations = 100,
                  iterations_adadelta = 120,
                      
                  # following is the hyperparameter of learning rate
                  lr_adagrad = 0.3,
                  lr_adam = 1.2,
                      
                  # following are hyperparamters for Adam
                  beta_1 = 0.9,
                  beta_2 = 0.99,
                      
                  # following is hyperparamters for Adadelta
                  pho = .5,
                  
                  # following is the hyperparameter of epsilon
                  epsilon_adagrad = 1,
                  epsilon_adadelta = 1e-2,
                  epsilon_adam = 1e-8
                  
                  
                 )
Out[38]:
<matplotlib.animation.FuncAnimation at 0x1a23764518>

Animation Analysis:

By increasing the learning rate of Adam, it has become much more adventurous and found a global minimum. By contrast, Adadelta is having a more difficult time climbing out of the local min it started in. After heavily reducing Adadelta's epsilon, it has much more energy to get parameter updates rolling, but remains in the well. Here, Adadelta's behavior is not necessarily undesirable as Adam and Adagrad may be overfitting.

3.4 The Bukin Function

The Bukin function is a very sharp decline down to a narrow valley of very many local minima. The function is defined as the following:

$$ 100 \sqrt{\big|{y-.01x^2}\big|} + .01 \big|x + 10 \big|$$
In [45]:
def bukin(x, y):
    return 100 * np.sqrt(np.abs(y - .01 * x**2)) + .01 * np.abs(x + 10)

def bukin_grad(x, y):
    sign = x + 10 >= 0
    
    if (y - .01 * x**2) >= 0:
        dx = 50 * (y - .01*x**2)**(-1/2) * (-.02*x) + .01 * sign
        dy = 50 * (y - .01*x**2)**(-1/2)
    
    else:
        dx = 50 * (.01*x**2 - y)**(-1/2) * (.02*x) + .01 * sign
        dy = -50 * (.01*x**2 - y)**(-1/2)
    
    return(np.array([dx, dy]))

# Target function plotting
def plot_bukin(fig, ax):

    # Make contour plot of our pdf f
    x1_grid = np.linspace(-5,5,100)
    x2_grid = np.linspace(-5,5, 100)
    xx1, xx2 = np.meshgrid(x1_grid, x2_grid)
    z = bukin(xx1, xx2)

    bukin_contour = ax.contour(x1_grid, x2_grid, z, 60)
    
    plt.show()
    return np.array([np.min(x1_grid), np.max(x1_grid)]), np.array([np.min(x2_grid), np.max(x2_grid)])
In [50]:
output_generation(loss_fun_name = "bukin",
                  loss_fun = bukin, 
                  gradient_fun = bukin_grad, 
                  plot_fun = plot_bukin,
                  play_speed = 100,
                      
                  # following are hyperparamters for all three algorithms
                  init = np.array([-4.0, -4.0]),
                  iterations = 500,
                  iterations_adadelta = 500,
                      
                  # following is the hyperparameter of learning rate
                  lr_adagrad = 0.1,
                  lr_adam = 0.1,
                      
                  # following are hyperparamters for Adam
                  beta_1 = 0.9,
                  beta_2 = 0.99,
                      
                  # following is hyperparamters for Adadelta
                  pho = 0.5,
                  
                  # following is the hyperparameter of epsilon
                  epsilon_adagrad = 1,
                  epsilon_adadelta = 1e-4,
                  epsilon_adam = 1e-8
                  
                  
                 )
Out[50]:
<matplotlib.animation.FuncAnimation at 0x1a3a4503c8>

Animation Analysis:

We observe two common themes:

Adagrad is off to a quick start before slowing down at a nice local min.

The momentum of Adam is very apparent here as it easily climbs out of local min. If the goal is to explore many local mins, we conclude that it may be helpful to use Adam when other algorithms are getting stuck too early.

Section 4 Neural Network and MNIST Classification

Note: All remaining code to implement the MLP is credited to HW 7 and lab. We simply plug in Adagrad, Adadelta, Adam and analyze the results.

We include below a brief implementation of a neural network on the MNIST digit dataset using these three functions as the optimizers. We use the implementations embedded in pytorch.

4.1 Load data

In [51]:
train_dataset = datasets.MNIST(root='./paper_data',
                                    train=True,
                                    transform=transforms.Compose([transforms.ToTensor(),
                                      transforms.Normalize((0.1307,), (0.3081,)),
                                     ]),
                                    download=True)

test_dataset = datasets.MNIST(root='./paper_data', 
                                   train=False, 
                                   transform=transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.1307,), (0.3081,)),
                                    ]),
                                   download=True)

# Create data loader and do train/ validaton split
from torch.utils.data.sampler import SubsetRandomSampler

# Define the indices
indices = list(range(len(train_dataset))) # start with all the indices in training set
split = 10000 # define the split size


batch_size = 256

# Random, non-contiguous split as in lab7
validation_idx = np.random.choice(indices, size=split, replace=False)
train_idx = list(set(indices) - set(validation_idx))

train_sampler = SubsetRandomSampler(train_idx)
validation_sampler = SubsetRandomSampler(validation_idx)


train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           sampler=train_sampler)


validation_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                                batch_size=1, 
                                                sampler=validation_sampler)


test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=1,
                                          shuffle=False)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
4.2 Build neural network model base classes
In [52]:
# Regression Parent Class
class Regression(object):
    
    def __init__(self):
        self.params = dict()
    
    def get_params(self, k):
        return self.params.get(k, None)
    
    def set_params(self, **kwargs):
        for k,v in kwargs.items():
            self.params[k] = v
           
    def fit(self, X, y):
        raise NotImplementedError()
        
    def predict(self, X):
        raise NotImplementedError()
        
    def score(self, X, y):
        raise NotImplementedError()
In [53]:
class MNIST_Model(Regression):
    def __init__(self, model, optimizer_algorithm, learning_rate=0.1, regularizer=.01, batch_size=256, epochs=5):
        
        super().__init__()
        
        ## Add Datasets and Data Loaders to our params
        self.set_params(train_loader=train_loader,
                        validation_loader = validation_loader,
                        test_loader=test_loader)

        criterion = nn.CrossEntropyLoss()
        optimizer = optimizer_algorithm
        
        self.set_params(optimizer=optimizer, 
                        learning_rate=learning_rate,
                        regularizer=regularizer,
                        batch_size=batch_size,
                        model=model,
                        criterion=criterion,
                        epochs=epochs)   
        

    def predict(self, dataset=test_loader):        
        predictions = []
        correct = 0
        model = self.get_params('model')

        for inputs, labels in dataset:

            ## get the inputs from the dataloader and turn into a variable for 
            ## feeding to the model
            inputs = Variable(inputs)

            ## Reshape so that batches work properly
            inputs = inputs.view(-1, 28*28)

            # run our model on the inputs
            outputs = model(inputs)

            # get the class of the max log-probability
            pred = outputs.data.max(1)[1]

            correct += (pred == labels).sum()

            # append current batch of predictions to our list
            predictions += list(pred)

            true_labels = labels.numpy()
            
        self.set_params(predictions=predictions, 
                        correct_predictions=correct,
                        prediction_dataset_length=len(predictions),
                        prediction_dataset_labels=true_labels
                       )
        return np.array(predictions)
        

    
    def score(self, dataset=test_loader):
        """Calculate accuracy score based upon model classification"""
        
        self.predict(dataset=dataset)
        correct = self.get_params('correct_predictions')
        total = self.get_params('prediction_dataset_length')
        return(float(correct)/total)
        
        
    def fit(self):
        
        optimizer=self.get_params("optimizer")
        model=self.get_params("model")
        epochs=self.get_params("epochs")
        criterion=self.get_params("criterion")
        train_loader=self.get_params("train_loader")
        

        iterations = len(train_loader)
        
        ## We need something to keep track of our losses
        losses = np.zeros((epochs, iterations)) 
        val_scores = np.zeros(epochs)
        test_scores = np.zeros(epochs)
        
        for epoch in range(epochs):
            for batch_index, (inputs, labels) in enumerate(train_loader):

                inputs, labels = Variable(inputs), Variable(labels)
                
                inputs = inputs.view(-1, 28*28)

                optimizer.zero_grad()

                outputs = model(inputs)
                loss = criterion(outputs, labels)
                
                ## count the loss
                losses[epoch,batch_index] = loss.data.item()

                loss.backward()

                optimizer.step()
                
            val_scores[epoch] = self.score(dataset=validation_loader)
            test_scores[epoch] = self.score()
                
        ## Set Loss Matrix for visualizing
        self.set_params(training_losses=losses)
        self.set_params(validation_scores=val_scores)
        self.set_params(test_scores = test_scores)
        
        return self
In [54]:
class MLPPyTorch(nn.Module):

    def __init__(self, hidden_dim, nonlinearity = torch.tanh):
        super().__init__()
        self.l1 = nn.Linear(784, hidden_dim)
        torch.nn.init.xavier_uniform_(self.l1.weight)
        self.l2 = nn.Linear(hidden_dim,10)
        self.nonlinearity = nonlinearity

    def forward(self, x):
        x = self.l1(x)
        x = self.nonlinearity(x)
        x = self.l2(x)
        return x
4.3 Train models with Adam, Adadelta and Adagrad
In [56]:
base_model = MLPPyTorch(hidden_dim=50)

algorithm = torch.optim.Adam(params=base_model.parameters(), lr=1)
Adam_MLP = MNIST_Model(model=base_model, optimizer_algorithm=algorithm, epochs=60)
Adam_MLP.fit()

algorithm = torch.optim.Adadelta(params=base_model.parameters(), lr=1)
Adadelta_MLP = MNIST_Model(model=base_model, optimizer_algorithm=algorithm, epochs=60)
Adadelta_MLP.fit()

algorithm = torch.optim.Adagrad(params=base_model.parameters(), lr=1)
Adagrad_MLP = MNIST_Model(model=base_model, optimizer_algorithm=algorithm, epochs=60)
Adagrad_MLP.fit()
Out[56]:
<__main__.MNIST_Model at 0x1a38b49a90>
4.4 Graph Losses while Training
In [57]:
def get_mean_losses(model):
    losses = model.get_params('training_losses')
    return [np.mean(x) for x in losses[:]]
In [64]:
adam_mean_losses = get_mean_losses(Adam_MLP)
adadelta_mean_losses = get_mean_losses(Adadelta_MLP)
adagrad_mean_losses = get_mean_losses(Adagrad_MLP)

fig, ax = plt.subplots(1,1, figsize=(8,5))
ax.plot(range(len(adam_mean_losses)), adam_mean_losses, lw=3, alpha=.7, label='Adam', color='r')
ax.plot(range(len(adadelta_mean_losses)), adadelta_mean_losses, lw=3, alpha=.7, label='Adadelta', color='blue')
ax.plot(range(len(adagrad_mean_losses)), adagrad_mean_losses, lw=3, alpha=.7, label='Adagrad', color='springgreen')

ax.set_facecolor((1,1,.98))
ax.set_xlabel("Epoch")
ax.set_ylabel("Mean Loss")
ax.set_title("Mean Loss per Epoch while Training MLP Model")
ax.legend();
4.5 Test Accuracies
In [63]:
adam_test_scores = Adam_MLP.get_params("test_scores")
adadelta_test_scores = Adadelta_MLP.get_params("test_scores")
adagrad_test_scores = Adagrad_MLP.get_params("test_scores")

fig, ax = plt.subplots(1,1, figsize=(8,5))
ax.plot(range(len(adam_test_scores)), adam_test_scores, lw=3, alpha=.7, label='Adam', color='r')
ax.plot(range(len(adadelta_test_scores)), adadelta_test_scores, lw=3, alpha=.7, label='Adadelta', color='b')
ax.plot(range(len(adagrad_test_scores)), adagrad_test_scores, lw=3, alpha=.7, label='Adagrad', color='springgreen')

ax.set_facecolor((1,1,.98))
ax.set_xlabel("Epoch")
ax.set_ylabel("Accuracy %")
ax.set_title("Test Accuracy Scores per Epoch while Training MLP Model")
ax.legend();

Section 5 Conclusions

We have implemented three adaptive learning rate methods for gradient descent, applied them to various example functions, and analyzed their results. The overall trend has been the following: Adam consistently performs the best out of the three algorithms in virtually all simple minimization examples, and appears to be the most robust to changes in global learning rates, starting locations, and even degree of problem complexity. Specifically, we saw that in our own function minimization problems created above (least squares loss function, Beale function, Camel function, etc), the Adam algorithm was typically able to converge quite nicely to a minimum with its default values i.e. without much hyperparameter tuning. Then, when the complexity of the problem increased significantly in the digit classification neural network, Adam performed at the same level as the other two algorithms.

As for those other two algorithms, Adagrad and Adadelta, we saw that in our own, simpler examples, Adagrad very clearly outperformed Adadelta and was able to achieve the same level of success as Adam, but only after a considerable amount of tuning, specifically of the smoothing parameter $\epsilon$ to which the Adagrad algorithm seemed quite sensitive. Moreover, we honed in on a significant drawback of Adagrad: in many cases, the algorithm will overshoot in its early iterations due to the initially small squared gradient sums in the denominator of the adaptive learning rate, and will never be able to recover as this accumulation of previous gradients increases and takes the learning rate asymptotically down to 0. This leaves Adagrad far away from any global or even local minima. However, in the cases where a good set of parameters could be found, we saw that Adagrad was able to jump straight to the correct answer and then quickly settle there due to the accumulating friction in the learning rate. Adam, on the other hand, would often go toward the solution quite quickly as well, but then would spend some time bouncing around the minimum. This is because of its use of a momentum term and a sliding gradient average which does not accumulate indefinitely and decrease the learning rate like in Adagrad.

There were times where we saw Adam arrive at a local minimum rather than the global minimum in some of our simple examples above, when its parameters were set to its default values. With just a little bit of tuning, we could manipulate Adam to wrestle out of local minima and arrive at the global minimum. It is encouraging that Adam was essentially always able to locate either a highly minimizing local minimum or the global minimum, as it once again speaks to the robustness of this optimizer. Additionally, we note that finding a local minimum rather than a global minimum is not necessarily bad behavior -- this is what we would expect to sometimes observe if an algorithm is able to avoid overfitting, which leads naturally to a brief discussion about the neural network example.

In the digit classification problem, we deal with a vast amount of sparse data, a setting in which we are typically quite worried about overfitting. In this environment, we were finally able to see some of the power of the Adadelta algorithm as it quickly caught up to (and slightly surpassed) Adam's performance. We see that in this setting, our three algorithms end up performing quite similarly. It would make sense that Adadelta and Adam would perform similarly, as both essentially do the same thing: track previous squared gradients in the denominator of the adaptive learning rate to even out progress across dimensions over time, while including a kind of momentum measurement in the numerator -- in Adam's case this momentum is a very direct decaying average of gradients, while in Adadelta's case it is an average of previous updates. It appears that this balancing act gives Adam slightly more trouble in fitting the loss function than it does Adadelta, though they both end up with roughly the same testing accuracy. This suggests that Adam is likely pretty robust to overfitting.

It seems that the Adadelta procedure is much better suited for more complex problems such as this: large amounts of sparse data, such that Adadelta's ability to give smaller updates to frequent features and larger updates to infrequent features (while able to keep learning high enough by using the sliding average in the denominator) shines through as quite useful. This is likely because in the presence of sparsity, both the average of updates in the numerator and the average of previous gradients in the denominator have sharper dimensional effects, pushing the algorithm along a proper track. In our early examples, on the other hand, Adadelta was hardly able to move around to any kind of extrema. This is perhaps because the data in those problems did not have the appropriate structure for elucidating these abilities of Adadelta.

Thus, as mentioned above, it appears that Adam is perhaps the most robust of these algorithms, and can generally be used to solve both simple and complex problems with high success. When put in the right setting such as the digit classification problem, Adadelta can also perform at a very high level, catching up to or even surpassing Adam in some cases.

We note that our results from the neural network above could certainly be improved quite a bit more with more tuning -- algorithms have famously reached about 98% accuracy on this dataset, while we see about 60% test accuracy. We hypothesize that with greater tuning up to this level of accuracy, we would see the Adam and Adadelta optimizers work the best out of the three due to their more nuanced implementations, and with Adam the least prone to overfitting.

In [ ]: