Stochastic Gradient Descent and Mini-Batch Gradient Descent

Categories:

Updated:

In batch gradient descent we’ve seen before, we used all of our training data to perform one gradient descent.

But this is slow and inefficient if number of training data grows.

To solve this problem, we can scan through training examples and perform gradient descent for every $n$ examples.

If we used all of our examples once, we say that one ‘epoch’ has passed. So in batch gradient descent we perform one gradient descent for every epoch, and in the gradient descents that we will look here we perform multiple gradient descents for every epoch.

We call ‘stochastic gradient descent’ when $n$ is 1.

We perform the following for each epoch.

For $i = 1\dots m$:

This algorithm will only try to fit one training example at a time. This way we can make progress in gradient descent without having to scan all m training examples first.

Stochastic gradient descent will be unlikely to converge at the global minimum and will instead wander around it randomly, but usually yields a result that is close enough. One strategy for trying to actually converge at the global minimum is to slowly decrease α over time.

For $i = 1,11,21,31,\dots,991$: