Gradient descent determines the weight vector that minimizes
E. It starts with an arbitrary weight vector, modifies it in small
steps in the direction that produces the steepest descent, and
continues until the global minimum error is reached
weight update rule: where denotes input
component for training example
Because the error surface contains only a single global minimum,
the algorithm will converge to a weight vector with minimum error,
regardless of whether the training examples are linearly separable,
given a sufficiently small learning rate is used. Hence
common modification to gradually reduce the value of as the
number of steps grows.