Exploring Optimization Algorithms in CNNs
This post delves into various optimization algorithms, building on the “Optimizing CNNs with Gradient Calculation: A Deep Dive” article. We’ll discuss SGD (Stochastic Gradient Descent) and others like Momentum, RMSProp, and Adam.
SGD (Stochastic gradient descent) Optimizer:

Imagine you’re on a hill, and you want to find the lowest point (the bottom of the hill). You’re blindfolded, so you can’t see the entire terrain. Instead, you can only feel the slope of the hill right where you’re standing.
Now, to find the lowest point, you take small steps downhill based on the slope you feel. You repeat this process, taking steps and adjusting your direction as you sense the slope. Each step is like an iteration, and the direction and size of your steps are determined by the slope you’re currently experiencing.
In the language of optimization (finding the minimum or maximum of a function), this hill represents the “loss” or “error” in a mathematical function that a computer is trying to minimize. The blindfolded person corresponds to the optimization algorithm, and the steps they take represent the changes made to the model parameters to minimize the error.
- Stochastic (Random): Instead of considering the entire hill (all the data) to determine the slope, you randomly pick a few points (a small batch of data) to estimate the slope. This randomness helps the algorithm avoid getting stuck in local minimum points.
- Gradient Descent: The slope is the gradient of the hill at a specific point. Descent means going downhill. So, you adjust your position (model parameters) based on the negative of the gradient (slope) to move in the direction that minimizes the error.
In summary, Stochastic Gradient Descent is like blindly finding your way down a hill by taking small steps, with each step influenced by the slope you feel at a random point. It’s an iterative process aimed at reaching the lowest point, which in machine learning corresponds to minimizing the error of a model on a given dataset.
Momentum Optimizer:

Imagine you are trying to roll a ball down a hill to reach the lowest point. You want to pick up speed as you go downhill, but you also want to avoid getting stuck in shallow valleys or bouncing back and forth. This is where the concept of momentum comes into play.
- Rolling Ball (Optimization): Imagine the ball rolling down the hill as a representation of your optimization process, where you’re trying to minimize the error of a model on a dataset.
- Speed (Momentum): Momentum in the optimization context is like the ball’s speed. Instead of just considering the current slope (gradient), momentum allows you to carry some information about the past directions you’ve been moving. It helps you build up speed in the right direction and dampens abrupt changes in direction.
- Avoiding Shallow Valleys (Local Minima): When you encounter a shallow dip (local minimum) in the hill, momentum helps you roll through it. Without momentum, you might get stuck in such areas.
In mathematical terms, the update in each iteration not only depends on the current gradient (slope) but also takes into account the accumulated momentum from past iterations.
In summary, Momentum optimization is like rolling a ball down a hill, where the ball gains momentum over time. This momentum helps the optimization process avoid getting stuck in small valleys and facilitates faster convergence toward the lowest point, which corresponds to minimizing the error of a machine learning model on a given dataset.
RMSProp (Root Mean Square Propagation) Optimizer:

Imagine you are trying to descend a hill in the dark, and you want to adjust your steps based on how steep the hill is at each point. However, the steepness of the hill can vary a lot, and you want to adapt to these changes efficiently.
- Descending the Hill (Optimization): Picture yourself descending the hill as a metaphor for optimizing a machine learning model to minimize the error on a dataset.
- Steepness of the Hill (Gradient): The steepness of the hill represents the gradient, which tells you how much you need to adjust your position to go downhill faster. In optimization, the gradient indicates the direction and rate at which you should update your model parameters.
- Adaptive Learning Rates (RMSProp): In RMSProp, the optimizer adjusts the step size (learning rate) for each parameter based on the historical information of gradients. It considers not only the current gradient but also the past gradients. This adaptation helps you navigate steep and shallow sections of the hill more effectively.
- Root Mean Square (RMS): RMSProp calculates the root mean square of the past gradients. This means it considers the average size of past gradients to adjust the learning rates. If the past gradients were large, it reduces the learning rate; if they were small, it increases the learning rate.
In summary, RMSProp is like descending a hill with a flashlight that adapts its brightness based on the variability of the terrain. It helps you navigate both steep and shallow slopes effectively by adjusting your steps according to the historical information of the terrain’s steepness, which corresponds to adapting the learning rates for optimizing a machine learning model.
ADAM Optimizer:

Imagine you are trying to find the minimum point in a hilly terrain, but this time you have a fancy helper, named Adam, who not only helps you navigate the slopes efficiently but also adjusts the steps based on the varying conditions of the terrain.
- Hilly Terrain (Optimization): Picture the hilly terrain as the optimization landscape, where you’re trying to minimize the error of a machine learning model on a dataset.
- Finding the Minimum (Optimization Goal): Your goal is to find the lowest point in the terrain, which corresponds to minimizing the error of your model.
- Adam (Optimizer): Adam is your helper who guides you through the hills. Adam keeps track of how steep the hills are and adjusts your steps accordingly. It combines the benefits of two other helpers: Momentum (a smooth, consistent downhill push) and RMSProp (adaptive learning rates based on past gradients).
- Momentum (Smooth Movements): Like a rolling ball, Adam has momentum, helping you smoothly navigate through different terrains without getting stuck in local minima.
- Adaptive Learning Rates (RMSProp): Adam adjusts the size of your steps (learning rates) based on the past gradients. If the terrain is consistently steep, Adam decreases the step size, and if it’s relatively flat, Adam increases the step size.
In summary, Adam is like a smart guide helping you find the lowest point in a hilly terrain efficiently. It combines the advantages of smooth movements (momentum) and adaptive step sizes (RMSProp), allowing you to navigate the optimization landscape effectively and converge to the minimum point, which corresponds to minimizing the error of your machine learning model.
Python Implementation of Optimizers
Let’s examine how these optimizers are implemented in Python, using SGD, Momentum, RMSProp, and ADAM:
import numpy as np
input_feature_1 = np.random.rand(100,1)
input_feature_2 = np.random.rand(100,1)
input_ground_truth = np.random.rand(100,1)
weight_1 = 2#intital random weight for input_feature_1
weight_2 = 3#initial random weight for input_feature_2
print("INITIAL WEIGHTS: ",weight_1," ",weight_2)
#y = w1xI1 + w2xI2; w1 = 2 and w2 = 3
output_feature_predicted = weight_1*input_feature_1 + weight_2*input_feature_2
#L2/MSE loss or regression loss
loss = np.mean((output_feature_predicted - input_ground_truth)**2)
print("LOSS:", loss)
gradient_w1 = 2 * np.mean((output_feature_predicted - input_ground_truth) * input_feature_1)
gradient_w2 = 2 * np.mean((output_feature_predicted - input_ground_truth) * input_feature_2)
learning_rate = 0.1
#SGD
weight_1 = weight_1 - learning_rate*gradient_w1
weight_2 = weight_2 - learning_rate*gradient_w2
print("SGD UPDATED WEIGHTS: ",weight_1," ",weight_2)
# MOMENTUM
beta = 0.9
v_w1 = 0
v_w2 = 0
v_w1 = beta * v_w1 + (1 - beta) * gradient_w1
v_w2 = beta * v_w2 + (1 - beta) * gradient_w2
weight_1 = weight_1 - learning_rate * v_w1
weight_2 = weight_2 - learning_rate * v_w2
print("MOMENTUM UPDATED WEIGHTS: ", weight_1, " ", weight_2)
#RMSprop
#EPSILLON: Its purpose is to avoid division by zero and to ensure numerical stability, especially when the denominator approaches zero. Note that it is not the part of square root.
beta = 0.9
epsillon = 1e-8
v_w1=0#initially 0 then from 2nd iteration onwards use as prev term and update v_w1 as shown below
v_w2=0
v_w1 = beta*v_w1 + (1 - beta) * gradient_w1**2
v_w2 = beta*v_w2 + (1 - beta) * gradient_w2**2
weight_1 = weight_1 - (learning_rate*gradient_w1)/(np.sqrt(v_w1)+epsillon)
weight_2 = weight_2 - (learning_rate*gradient_w2)/(np.sqrt(v_w2)+epsillon)
print("RMSPROP UPDATED WEIGHTS: ",weight_1," ",weight_2)
#ADAM
beta1 = 0.9
beta2 = 0.999
epsillon = 1e-8
v_w1 = 0
v_w2 = 0
m_w1 = 0
m_w2 = 0
v_w1 = beta2*v_w1 + (1-beta2) * gradient_w1**2
v_w2 = beta2*v_w2 + (1-beta2) * gradient_w2**2
m_w1 = beta1*m_w1 + (1-beta1) * gradient_w1
m_w2 = beta1*m_w2 + (1-beta1) * gradient_w2
weight_1 = weight_1 - (learning_rate*m_w1) / (np.sqrt(v_w1)+epsillon)
weight_2 = weight_2 - (learning_rate*m_w2) / (np.sqrt(v_w2)+epsillon)
print("ADAM UPDATED WEIGHTS: ",weight_1," ",weight_2)
Choosing the Right Optimizer
ADAM is commonly used as a default optimizer for various tasks because ADAM combines the benefits of both momentum and adaptive learning rates (RMSProp), but it is essential to note that the effectiveness of optimizers can vary across different tasks, model architectures, and datasets. Sometimes, empirical testing or tuning may be necessary to determine the most suitable optimizer for a specific problem. Additionally, ongoing research in optimization algorithms may introduce new methods that could be more effective in certain contexts.






Leave a Reply