by Nikolai V. Shokhirev
| Algorithms |
There are a lot of good books on Artificial Neural Networks as well as the information on the Internet (see e.g. the references below). I am not going to compete with them. Here I present a simple derivation of back propagation algorithm.
An artificial neural network (ANN) is an interconnected group of artificial neurons:
|
|
Fig. 1. Neuron |
A neuron performs a mathematical transformation of several inputs into one output.
Neurons are usually arranged into layers
![]() |
|
Fig. 2. Layer |
Several layers form a neural network:
![]() |
|
Fig. 3. Network |
Mathematically, the n-th neuron in l-th layer is described by the following equations
| |
(1a) |
|
|
(2a) |
Here w (weights) and b (biases or offsets) are the parameters and F is an activation or transfer function.
Both weights and biases can be combined into one matrix if the 0-th component is added to input vectors
![]() |
(3) |
and the weight matrix is extended as follows
| (4) |
With these definitions (1a-2a) become
|
(1b) |
![]() |
(2b) |
Both variants can be found in literature. The variant (b) makes formula manipulations rather easier.
The whole network (Fig. 3.) is just a non-linear transformation (filter), which maps the input space ( X0 = I ) to the output space ( XL = O ).
| |
(5) |
In more practical terms neural networks are non-linear modeling tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data.
To be able to find patterns, the network parameters ( w and b ) should be adjusted. Ideally, the outputs Ok for each reference input Ik should match the corresponding target values Tk (k = 1, . . . , K - the number of training patterns). Practically this is the best fit in the least squares sense:
|
(6) |
Here E is the total (weighted) error, ωk are the weights, and P = { w, b }.
Eq. (7) is a typical optimization problem. Gradient Descent is a general framework for solving optimization problems (not the best one, by the way). The parameters are adjusted in the direction of the steepest descent (opposite to a gradient):
![]() |
(7) |
In application to ANNs the Gradient Descent is called the Error Backpropagation Algorithm. This somewhat misleading term is just a recursive method for calculation of the derivatives plus the adjustment (7).
Taking into account the specific form for the error function (6) we get
|
(8) |
All we need, are the derivatives of x(L) with respect to { w, b }.
Let us calculate a variance (differential) of (1b):
![]() |
(9) |
Here we used the only basic fact from the calculus:
| (10) |
It can be further expanded using (2b)
|
(11) |
Important: According to the definitions (1, 2) x(l-1) does not depend on w(l), w(l+1), . . ., w(L) (this is obvious, but confusing for less experienced in mathematics). This means that
|
(12) |
In particular, for l = L we can omit the second term in (11) and
|
(13) |
This is the initial step of a recursion.
Now we can set dwnm(L) to zero and take in (11) one step down from l = L to l = L - 1 and again apply Eq. (11). Similar to (13) it gives
|
(14) |
and for l = L - 2
|
(15) |
Now we can substitute the derivatives into Eq. (8) which gives the Error Backpropagation Algorithm
To be continued
| (11) |
In
A popular choice for the activation function F is
| |
(3) |
| Algorithms |
©Nikolai V. Shokhirev, 2004-2006