Error surfaces

Machine learning techniques try to find a mathematical equation (or model) to predict an output given an input. Some common tasks are regression and classification problems. In both cases, the performance of the model is measured using a loss function. If this function fullfill some characteristics it is possible to minimize the error of the model by following the gradient of the error surface with a given parametrization. This surface is a visualization of the error for all the possible values of the parameters. In this example, we are going to visualize a very simple example with one unique parameter.

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import time
from IPython import display
from IPython.display import Latex
from pylab import axis

plt.rcParams['figure.figsize'] = (9, 6)

Data to predict

First we create some artificial data that contains only two variables, the input X and the target T. In this example we use samples from a normal distribution with mean in the origin.

num_samples = 40 # datapoints

mean = [0,0]
cov = [[1,5],
       [0,1]]
X, T = np.random.multivariate_normal(mean,cov,num_samples).T

This is what it looks like:

plt.scatter(X, T, color='red', label='samples')
plt.ylabel('target')
plt.xlabel('x')
plt.legend(loc='upper left')
axis('equal')

(-1.5, 1.0, -6.0, 6.0)

Initial weight

If we assume that the samples are a good representation of the real distribution, then it is enough to use a linear regression to predict the points. However, as much points we get from the distribution, better will be the linear approximation. Additionally, as the distribution mean is in the origin we will not use a bias term (or it will be fixed to zero).

w = 1

With this weight we can use our model to make our prediction.

Y = X*w

This is how the target values and the predictions differ.

plt.scatter(X, T, color='red', label='samples')
plt.scatter(X, Y, color='blue', label='prediction')
plt.ylabel('y')
plt.xlabel('x')
plt.legend(loc='upper left')
axis('equal')

(-1.5, 1.0, -6.0, 6.0)

Loss function

We need some measure to test the performance of the model. This measure is usually named loss function, and one of the most typical choices is the mean squared error (MSE).

Latex(r""" 
        \begin{equation}
            MSE = \frac{1}{n} \sum_{i=1}^n (Y_i - T_i)^2
        \end{equation}
      """)

def mean_squared_error(y,t):
    return np.sum((y-t)**2)/np.size(y)

def loss_function(y,t):
    return mean_squared_error(y,t)

Error surface

Now that we have an error function we can visualize the error surface. To visualize this surface, we need to analyze all the possible values for the parameter w. However, as it is impossible to visualize an infinite surface we will constrain the parameter to a certain window.

num_parametrizations = 100
w_values = np.linspace(-5,5,num_parametrizations)

With all the possible parameter values we can compute the corresponding error.

error = np.zeros(num_parametrizations)
for i, w in enumerate(w_values):
    Y = X*w
    error[i] = loss_function(Y,T)

And now we can visualize the error surface, and the optimal weight for the given samples.

optimum = np.argmin(error)
w = w_values[optimum]
error_value = error[optimum]
plt.annotate('optimal w = {0:.{1}f}'.format(w,2), 
             xy=(w, error_value),  xycoords='data',
             xytext=(-50, 50), textcoords='offset points',
             arrowprops=dict(arrowstyle="->"))
plt.plot(w_values, error, color='red')
plt.ylabel('error')
plt.xlabel('w')

<matplotlib.text.Text at 0x7f3902abc250>

Minibatch

However, some times it is impossible to train with all the samples. Sometimes because the number of samples and/or parameters is large, and some times for some reason we do not have all the data available. In these cases, we divide the data into different minibatches, and use these to find the optimum parameters. Using all the data is known as batch learning, using some partitions of the data is minibatch learning, and the extreme is using the individual samples; called stochastic learning.

But what happens with the error surface when we train our model using minibatches?

Lets create four minibatches from all the original data.

num_minibatches = 4
size_minibatches = num_samples/num_minibatches

Now we compute the error surface with each of the minibatches.

mbatch_error = np.zeros((num_minibatches,num_parametrizations))

for i, w in enumerate(w_values):
    Y = X*w
    for j in range(num_minibatches):
        mbatch_error[j,i] = loss_function(Y[j*size_minibatches:(j+1)*size_minibatches],
                                       T[j*size_minibatches:(j+1)*size_minibatches])

We can visualize now each of the error surfaces: the surface for all the samples, for the specific mini-batches, and the optimal feature w in each case.

plt.rcParams['figure.figsize'] = (12, 8)

for j in range(num_minibatches):
    plt.subplot(2, 2, j+1)
    plt.plot(w_values, error, color='red', label='All samples')
    plt.plot(w_values, mbatch_error[j,:], color='blue', label='minibatch {0}'.format(j+1))
    
    optimum = np.argmin(mbatch_error[j,:])
    w = w_values[optimum]
    error_value = mbatch_error[j,optimum]
    plt.annotate('optimal w = {0:.{1}f}'.format(w,2), xy=(w, error_value),  xycoords='data',
                xytext=(-50, 50), textcoords='offset points',
                arrowprops=dict(arrowstyle="->"))
                
    plt.ylabel('error')
    plt.xlabel('w')
    plt.legend(loc='upper right')

We can see that given an specific parameter value the gradient is different in all the batches. This makes the gradient descent oscilate between mini-batches. Some of the heuristics to solve this behaviour are:

Using momentum
Decreasing the learning rate
TODO