Next: Learned Hidden Representations
Up: Neural Network Learning
Previous: Network Graph Structure
- learning rate, , was set to 0.3
- momentum, , was set to 0.3
- lower values produced equivalent generalization but longer
training times, if set too high training fails to converge to a
network with acceptable error
- full gradient descent was used (not the stochastic approximation)
- network weights in the output units were initialized to small
random values, but the input unit weights were initialized to zero
because it yields a more intelligible visualization of the learned
weights without noticeable impact on generalization accuracy
- the number of training iterations was selected by partitioning
the available data into a training set and a separate validation set,
GD was used to minimize the error over the training set and after
every 50 gradient descent steps the network performance was evaluated
over the validation set. The final reported accuracy was measured
over yet a third set of test examples that were not use to influence
the training.
Patricia Riddle
Fri May 15 13:00:36 NZST 1998