SGD with Momentum
Implement Stochastic Gradient Descent (SGD) with Momentum for updating neural network weights. Standard SGD can be slow and oscillate in narrow valleys of the loss surface. Momentum helps accelerate SGD by accumulating a velocity vector in the direction of persistent gradient descent, dampening oscillations.
The SGD with Momentum update rules are:
velocity = momentum * velocity - learning_rate * gradient
weights = weights + velocity
Your function sgd_with_momentum(weights, gradients, learning_rate, momentum) should initialize the velocity to zero (if not provided), compute one step of the momentum update, and return the updated weights.
Example:
Input: weights = [[1, 2], [3, 4]], gradients = [[0.1, 0.2], [0.3, 0.4]]
learning_rate = 0.01, momentum = 0.9
velocity (initial) = [[0, 0], [0, 0]]
velocity = 0.9 * [[0,0],[0,0]] - 0.01 * [[0.1,0.2],[0.3,0.4]]
= [[-0.001, -0.002], [-0.003, -0.004]]
weights = weights + velocity
Output: [[0.999, 1.998], [2.997, 3.996]]
The momentum term acts like a heavy ball rolling downhill -- it accumulates velocity in directions where the gradient consistently points, allowing the optimizer to move faster through flat regions and avoid getting stuck in shallow local minima. A typical momentum value is 0.9.
Constraints:
Test Cases
[[1, 2], [3, 4]][[0.99, 1.98], [2.97, 3.96]][[5, 6], [7, 8]][[4.95, 5.94], [6.93, 7.92]]