Mini-batch gradient descent vs Momentum vs Adam

July 10, 2020

5.1 - Mini-batch Gradient descent

Run the following code to see how the model does with mini-batch gradient descent.

In [22]:

# train 3-layer modellayers_dims = [train_X.shape[0], 5, 2, 1]parameters = model(train_X, train_Y, layers_dims, optimizer = "gd")​# Predictpredictions = predict(train_X, train_Y, parameters)​# Plot decision boundaryplt.title("Model with Gradient Descent optimization")axes = plt.gca()axes.set_xlim([-1.5,2.5])axes.set_ylim([-1,1.5])plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

Cost after epoch 0: 0.702405
Cost after epoch 1000: 0.668101
Cost after epoch 2000: 0.635288
Cost after epoch 3000: 0.600491
Cost after epoch 4000: 0.573367
Cost after epoch 5000: 0.551977
Cost after epoch 6000: 0.532370
Cost after epoch 7000: 0.514007
Cost after epoch 8000: 0.496472
Cost after epoch 9000: 0.468014

Accuracy: 0.796666666667

5.2 - Mini-batch gradient descent with momentum

Run the following code to see how the model does with momentum. Because this example is relatively simple, the gains from using momemtum are small; but for more complex problems you might see bigger gains.

In [23]:

# train 3-layer modellayers_dims = [train_X.shape[0], 5, 2, 1]parameters = model(train_X, train_Y, layers_dims, beta = 0.9, optimizer = "momentum")​# Predictpredictions = predict(train_X, train_Y, parameters)​# Plot decision boundaryplt.title("Model with Momentum optimization")axes = plt.gca()axes.set_xlim([-1.5,2.5])axes.set_ylim([-1,1.5])plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

Cost after epoch 0: 0.702413
Cost after epoch 1000: 0.668167
Cost after epoch 2000: 0.635388
Cost after epoch 3000: 0.600591
Cost after epoch 4000: 0.573444
Cost after epoch 5000: 0.552058
Cost after epoch 6000: 0.532458
Cost after epoch 7000: 0.514101
Cost after epoch 8000: 0.496652
Cost after epoch 9000: 0.468160

Accuracy: 0.796666666667

5.3 - Mini-batch with Adam mode

Run the following code to see how the model does with Adam.

In [24]:

# train 3-layer modellayers_dims = [train_X.shape[0], 5, 2, 1]parameters = model(train_X, train_Y, layers_dims, optimizer = "adam")​# Predictpredictions = predict(train_X, train_Y, parameters)​# Plot decision boundaryplt.title("Model with Adam optimization")axes = plt.gca()axes.set_xlim([-1.5,2.5])axes.set_ylim([-1,1.5])plot_decision_boundary(lambda x: predict_dec(parameters, x.T), train_X, train_Y)

Cost after epoch 0: 0.702166
Cost after epoch 1000: 0.167845
Cost after epoch 2000: 0.141316
Cost after epoch 3000: 0.138788
Cost after epoch 4000: 0.136066
Cost after epoch 5000: 0.134240
Cost after epoch 6000: 0.131127
Cost after epoch 7000: 0.130216
Cost after epoch 8000: 0.129623
Cost after epoch 9000: 0.129118

Accuracy: 0.94

5.4 - Summary

optimization method	accuracy	cost shape
Gradient descent	79.7%	oscillations
Momentum	79.7%	oscillations
Adam	94%	smoother

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable. Also, the huge oscillations you see in the cost come from the fact that some mini-batches are more difficult than others for the optimization algorithm.

Adam on the other hand clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you've seen that Adam converges a lot faster.

Some advantages of Adam include:

Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
Usually works well even with little tuning of hyperparameters (except $α$ )

Search This Blog

Hany Ouf

Mini-batch gradient descent vs Momentum vs Adam

5.1 - Mini-batch Gradient descent

5.2 - Mini-batch gradient descent with momentum

5.3 - Mini-batch with Adam mode

5.4 - Summary

Comments

Post a Comment

Popular posts from this blog

Maxpooling vs minpooling vs average pooling

Generative AI - Prompting with purpose: The RACE framework for data analysis

Best Practices for Storing and Loading JSON Objects from a Large SQL Server Table Using .NET Core