Hands-On Neural Networks
上QQ阅读APP看书,第一时间看更新

Introducing backpropagation

Before going into the math, it will be useful to develop an intuitive sense of what the training does. If we look back at our perceptron class, we simply measure the error by using the difference between the real output and our prediction. If we wanted to predict a continuous output rather than just a binary one, we would have to use a different way to measure the error, as positive and negative errors might cancel each other out.

A common way to avoid this kind of problem is by measuring the error by using the root mean square error (RMSE), which is defined as follows:

If we plot the square error and we let our prediction vary, we will obtain a parabolic curve:

The error surface for a single neuron

In reality, our prediction is controlled by the weights and the bias, which is what we are changing to decrease the error. By varying the weights and the bias, we obtain a more complex curve; its complexity will depend on the number of weights and biases we have. For a generic neuron with n weights, we will have an elliptic paraboloid of an n+1 dimension, as we need to vary the bias, as well:

Error surface for a linear perceptron

The lowest point of the curve is known as the global minima, and it's where we have the lowest possible loss, which means that we can't have less of an error than that. In this simple case, the global minima is also the only minima we have, but in complex functions, we can also have a few local minima. A local minima is defined as the lowest point in an arbitrary small interval around, so it's not necessarily the lowest overall.

In this way, we can see the training process as an optimization problem that is looking for the lowest point of the curve in an efficient way. A convenient way to explore the error surface is by using gradient descent. The gradient descent method uses the derivative of the squared error function with respect to the weights of the network, and it follows the downward direction. The direction is given by the gradient. As we will look at the derivative of the function, for convenience, we will consider a slightly different way of measuring the square error, compared to what we saw before:

We decided to divide the square error by two, just to cancel out the coefficient that the derivation will add. This will not affect our error surface, even more so because later on, we will multiply the error function by another coefficient called the learning rate.

The training of the network is normally done using backpropagation, which is used to calculate the steepest descent direction. If we look at each neuron individually, we can see the same formula that we saw for the perceptron; the only difference is that now, the input of one neuron is the output of an another one. Let's take the neuron j; it will run through its activation function and the result of all of the networks before it:

If the neuron is in the first layer after the input layer, then the input layers are simply the input to the network. With n, we denote the number of input units of the neuron j. With , we denote the weight between the output of the neuron k and our neuron j.

The activation function, which we want to be non-linear and differentiable, is denoted by the Greek letter . We want it to be non-linear because otherwise, the combination of a series of linear neurons will still be linear, and we want it to be differentiable, because we want to calculate the gradient.

A very common activation function is the logistic function, also known as the sigmoid function, defined by the following formula:

This has a convenient derivative of the following formula:

The peculiar part of backpropagation is that not only does the inputs go to the output to adjust the weights, but the output also goes back to the input:

A simple FFNN for binary classification