Other Algorithm Parameters

Next: Learned Hidden Representations Up: Neural Network Learning Previous: Network Graph Structure

learning rate, , was set to 0.3
momentum, , was set to 0.3
lower values produced equivalent generalization but longer training times, if set too high training fails to converge to a network with acceptable error
full gradient descent was used (not the stochastic approximation)
network weights in the output units were initialized to small random values, but the input unit weights were initialized to zero because it yields a more intelligible visualization of the learned weights without noticeable impact on generalization accuracy
the number of training iterations was selected by partitioning the available data into a training set and a separate validation set, GD was used to minimize the error over the training set and after every 50 gradient descent steps the network performance was evaluated over the validation set. The final reported accuracy was measured over yet a third set of test examples that were not use to influence the training.

Patricia Riddle
Fri May 15 13:00:36 NZST 1998