Several CAS actions, including the gpReg and annTrain actions, that use stochastic gradient descent (SGD) share a common grammar for their optimization parameters.
SGD is an optimization method that is tailored to problems that have a large amount of data and an objective function of the form
where f is a function dependent on some model weights, w, and the data. The data in the ith observation are denoted as . The goal of the optimization is to minimize
.
You could apply gradient descent, but that would require looping through the entire data set. For extremely large amounts of data, gradient descent is impractical. A better approach is to operate on subsets of the data called minibatches. The value of the objective function for the kth minibatch is
The negative derivative of is used as a search direction, and the update of the model weights is given by
where is the learning rate. It is useful to decrease the learning rate as the optimization progresses, so sometimes an annealing rate,
, is specified, in which case the learning rate is replaced by
Momentum SGD keeps a velocity vector that stores a running average of the gradient directions to smooth out the updates. When you use momentum SGD, the update is
The adaptiveRate and adaptiveDecay subparameters of the sgdOpt parameter enable you to re-create the AdaGrad and AdaDelta methods. When you specify these subparameters, the SGD update becomes
where is the value of the
adaptiveDecay subparameter and componentwise operations are performed on vectors. When this value is set to 0, the result is equivalent to the AdaGrad result. When the value of the adaptiveDecay subparameter, , satisfies
, the result is equivalent to the AdaDelta result.
The adaptive moments (ADAM) SGD method keeps approximations of the first and second moments of the gradient vector and uses that information to adjust the weight updates. For ADAM SGD, the weight update is