CSCI 3346, Data Mining
Prof. Alvarez

The Error Back-Propagation Algorithm

This page summarizes the error back-propagation (EBP) algorithm that was mentioned in class, and which is used to train multilayer artificial neural networks (ANN) for regression and classification tasks. EBP extends the weight update rule used in the case of single perceptrons. The name "error back-propagation" derives from the manner in which information is propagated across the network on each pass of the algorithm: the errors at the output nodes are passed back to the hidden nodes using the network connections, and the resulting information is used to update the connection weights. The EBP algorithm was described independently by several groups of researchers. The most widely known paper on the subject is probably the one by Rumelhart, Hinton, and Williams, "Learning Internal Representations through Error Propagation", which was reprinted in the book Parallel Distributed Processing edited by Rumelhart and McClelland (MIT Press, 1986).

Weight Space View of EBP

Many algorithms for updating the connection weights in an ANN may be viewed as searching in the space of possible weight values for points at which a certain error measure is minimized. A specific algorithm is characterized by the choice of error measure and search strategy.

Error measure

EBP uses the mean square output error as the error measure:

MSE = average over all training instances of the sum over all output nodes k of ( y_k - target_k )²

where y_k and target_k are respectively the desired activation and the target activation of output node k for the given training instance.

Search strategy

EBP employs a steepest descent search strategy which updates the weights so as to move in the direction in weight space in which the MSE decreases the fastest. This direction is simply the direction opposite to the gradient vector of the MSE as a function of the weights. Thus, the weight update rule for EBP may be written as:

w_i,j = w_i,j + Δw_i,j,

where the weight change Δw_i,j is the learning rate η multiplied by the negative of the w_i,j component of the gradient vector:

Δw_i,j = - η*dMSE/dw_i,j

Obtaining the error back-propagation form of the weight update rule requires writing out the partial derivative that appears on the right-hand side of the above equation, using a sigmoid activation function to get the output of each node as a function of the summed input of the node.

EBP pseudocode (e.g., Mitchell, Machine Learning, chapter 4)

There are several variants of the EBP algorithm. The one given in pseudocode below is a so-called "stochastic" version that performs weight updates after examining each individual instance. A commonly used termination condition is that the MSE over a validation set increases for a certain number of consecutive training iterations; this suggests that overfitting may be starting to occur.

Inputs

Training dataset D = {(x1...xn; target1...targetk) | x1...xn are real-valued input attribute values and target1...targetk are the desired output values for those input values}
Learning rate, η, usually between 0 and 1
Specified network architecture with n inputs and k outputs, typically a feedforward architecture with a single hidden layer containing m cells, with a network arc from each input to each hidden node, and from each hidden node to each output node (total of nm + mk internal connections)
Specified stopping criterion, such as "total squared error less than .001", or "total squared error has not decreased in latest 10 iterations"

Outputs

For each connection weight w_i,j (from node j to node i), a value of w_i,j, such that the network with these weights approximates the input-to-output behavior implicitly defined by the training dataset.

Procedure

Initialize the connection weights w_i,j pseudorandomly.
Iterate the following weight update process until the termination condition is satisfied.
- For each training instance:
  - Present the instance to the network at the input layer and propagate the signals forward through the network using the sigmoid activation function of each node with the current values of the weight and threshold parameters.
  - Calculate the errors at the output layer and propagate the errors back through the network as follows:
    - For each output node k, let
      δ_k = y_k (1-y_k) (target_k - y_k)
      where y_k is the actual output of node k and target_k is the desired output of node k as indicated on the labeled training instance.
    - For each hidden unit h,
      δ_h = y_h (1-y_h) * sum over all output nodes k of w_k,h δ_k
      where w_i,j denotes the weight from node j to node i.
  - Update the connection weights
    w_i,j = w_i,j + η*δ_i*y_j
    Note that w_i,j denotes the weight from node j to node i here, and η is the learning rate. Thus, updating the weight from one node (the "source") to another (the "target") involves the δ at the target node and the output at the source node.

CSCI 3346, Data Mining Prof. Alvarez