Momentum
Momentum
Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" toward convergence. Using momentum can reduce these oscillations.
Momentum takes into account the past gradients to smooth out the update. We will store the 'direction' of the previous gradients in the variable . Formally, this will be the exponentially weighted average of the gradient on previous steps. You can also think of as the "velocity" of a ball rolling downhill, building up speed (and momentum) according to the direction of the gradient/slope of the hill.
Exercise: Initialize the velocity. The velocity, , is a python dictionary that needs to be initialized with arrays of zeros. Its keys are the same as those in the grads
dictionary, that is: for :
v["dW" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["W" + str(l+1)])
v["db" + str(l+1)] = ... #(numpy array of zeros with the same shape as parameters["b" + str(l+1)])
Note that the iterator l starts at 0 in the for loop while the first parameters are v["dW1"] and v["db1"] (that's a "one" on the superscript). This is why we are shifting l to l+1 in the for
loop.
Now, implement the parameters update with momentum. The momentum update rule is, for :
where L is the number of layers, is the momentum and is the learning rate. All parameters should be stored in the parameters
dictionary. Note that the iterator l
starts at 0 in the for
loop while the first parameters are and (that's a "one" on the superscript). So you will need to shift l
to l+1
when coding.
Comments
Post a Comment