Shared Concepts

Stochastic Gradient Descent

Several CAS actions, including the gpReg and annTrain actions, that use stochastic gradient descent (SGD) share a common grammar for their optimization parameters.

SGD is an optimization method that is tailored to problems that have a large amount of data and an objective function of the form

upper F left-parenthesis w right-parenthesis equals r 1 double-vertical-bar w double-vertical-bar Subscript 1 Baseline plus r 2 double-vertical-bar w double-vertical-bar Subscript 2 Superscript 2 Baseline plus sigma-summation Underscript i equals 0 Overscript n Endscripts f left-parenthesis w semicolon x Subscript i Baseline right-parenthesis

where f is a function dependent on some model weights, w, and the data. The data in the ith observation are denoted as . The goal of the optimization is to minimize .

You could apply gradient descent, but that would require looping through the entire data set. For extremely large amounts of data, gradient descent is impractical. A better approach is to operate on subsets of the data called minibatches. The value of the objective function for the kth minibatch is

StartLayout 1st Row 1st Column upper L Subscript k Baseline left-parenthesis w right-parenthesis 2nd Column equals sigma-summation Underscript i element-of upper M Subscript k Baseline Endscripts f left-parenthesis w semicolon x Subscript i Baseline right-parenthesis 2nd Row 1st Column f Subscript k Baseline left-parenthesis w right-parenthesis 2nd Column equals r 1 double-vertical-bar w double-vertical-bar Subscript 1 Baseline plus r 2 double-vertical-bar w double-vertical-bar Subscript 2 Superscript 2 Baseline plus upper L Subscript k Baseline left-parenthesis w right-parenthesis EndLayout

The negative derivative of is used as a search direction, and the update of the model weights is given by

w Subscript k plus 1 Baseline equals w Subscript k Baseline minus eta nabla f Subscript k Baseline left-parenthesis w Subscript k Baseline right-parenthesis

where is the learning rate. It is useful to decrease the learning rate as the optimization progresses, so sometimes an annealing rate, , is specified, in which case the learning rate is replaced by

ModifyingAbove eta With caret equals StartFraction eta Over 1 plus alpha k EndFraction

Momentum SGD keeps a velocity vector that stores a running average of the gradient directions to smooth out the updates. When you use momentum SGD, the update is

StartLayout 1st Row 1st Column v Subscript k plus one-half 2nd Column equals gamma v Subscript k Baseline 2nd Row 1st Column w Subscript k plus one-half 2nd Column equals w Subscript k Baseline plus v Subscript k plus one-half Baseline 3rd Row 1st Column v Subscript k plus 1 2nd Column equals v Subscript k plus one-half Baseline minus ModifyingAbove eta With caret nabla upper L Subscript k Baseline left-parenthesis w Subscript k plus one-half Baseline right-parenthesis 4th Row 1st Column w Subscript k plus 1 2nd Column equals w Subscript k plus one-half Baseline plus v Subscript k plus 1 Baseline minus r 1 nabla double-vertical-bar w Subscript k Baseline double-vertical-bar Subscript 1 Baseline minus r 2 nabla double-vertical-bar w Subscript k Baseline double-vertical-bar Subscript 2 Superscript 2 EndLayout

The adaptiveRate and adaptiveDecay subparameters of the sgdOpt parameter enable you to re-create the AdaGrad and AdaDelta methods. When you specify these subparameters, the SGD update becomes

StartLayout 1st Row 1st Column upper G Subscript k plus 1 2nd Column equals beta upper G Subscript k Baseline plus left-parenthesis 1 minus beta right-parenthesis nabla f Subscript k Baseline left-parenthesis w Subscript k Baseline right-parenthesis squared 2nd Row 1st Column w Subscript k plus 1 2nd Column equals w Subscript k Baseline minus StartFraction ModifyingAbove eta With caret Over StartRoot upper G Subscript k Baseline plus 10 Superscript negative 8 Baseline EndRoot EndFraction nabla f Subscript k Baseline left-parenthesis w Subscript k Baseline right-parenthesis EndLayout

where is the value of the adaptiveDecay subparameter and componentwise operations are performed on vectors. When this value is set to 0, the result is equivalent to the AdaGrad result. When the value of the adaptiveDecay subparameter, , satisfies , the result is equivalent to the AdaDelta result.

The adaptive moments (ADAM) SGD method keeps approximations of the first and second moments of the gradient vector and uses that information to adjust the weight updates. For ADAM SGD, the weight update is

StartLayout 1st Row 1st Column g Subscript k 2nd Column equals nabla f Subscript k Baseline left-parenthesis w Subscript k Baseline right-parenthesis 2nd Row 1st Column m Subscript k plus 1 2nd Column equals beta 1 m Subscript k Baseline plus left-parenthesis 1 minus beta 1 right-parenthesis g Subscript k plus 1 Baseline 3rd Row 1st Column v Subscript k plus 1 2nd Column equals beta 2 v Subscript k Baseline plus left-parenthesis 1 minus beta 2 right-parenthesis g Subscript k plus 1 Superscript 2 Baseline 4th Row 1st Column ModifyingAbove m With caret Subscript k 2nd Column equals StartFraction m Subscript k Baseline Over 1 minus beta 1 Superscript k Baseline EndFraction 5th Row 1st Column ModifyingAbove v With caret Subscript k 2nd Column equals StartFraction v Subscript k Baseline Over 1 minus beta 2 Superscript k Baseline EndFraction 6th Row 1st Column w Subscript k plus 1 2nd Column equals w Subscript k Baseline minus StartFraction eta Over StartRoot ModifyingAbove v With caret Subscript k Baseline EndRoot plus 10 Superscript negative 8 Baseline EndFraction ModifyingAbove m With caret Subscript k EndLayout

Last updated: November 23, 2025