Optimization Algorithms
SGD Trajectory
This article is a series of Optimization Algorithms in Deep Learning, Before reading this article please check out the part-1(Types of Optimization Algorithms in Deep Learning) article on this topic.
Different Types of Optimization Algorithms:
Deep neural network training requires optimization strategies. Here are some of the most often-used deep learning optimization algorithms:
- Stochastic Gradient Descent
- Gradient Descent
- Mini Batch Gradient Descent
- Momentum
- AdaGrad
- RMSprop
- Adam
- AdaGradDelta
- Nadam
- AdamW
6. RMSprop:
RMSprop (Root Mean Square Propagation) is a machine learning optimization approach used to train neural networks. It is a variation of the stochastic gradient descent (SGD) algorithm that attempts to solve some of the original technique's drawbacks. RMSprop adjusts the learning rate adaptively for each parameter based on the size of the error gradient with respect to that parameter. The update rule for each parameter is as follows:
v = rho * v + (1 - rho) * grad^2
x = x - learning_rate * grad / (sqrt(v) + epsilon)
x = x - learning_rate * grad / (sqrt(v) + epsilon)
where v is a running average of the squared gradient, rho is a decay rate (typically set to 0.9), grad is the gradient of the error with respect to the parameter, learning_rate is the learning rate (typically set to a small value, such as 0.001), epsilon is a small constant added for numerical stability (typically set to 1e-8), and x is the parameter being updated.
The goal of RMSprop is to limit the oscillations in gradient updates that might occur with standard SGD, especially when dealing with high-dimensional data or non-convex loss functions. RMSprop can converge quicker and more reliably than standard SGD since it adaptively adjusts the learning rate for each parameter.
Advantages:
- RMSprop modifies the learning rate on a per-parameter basis, allowing it to converge quicker than traditional stochastic gradient descent.
- It addresses the problem of oscillations in stochastic gradient descent by adjusting the learning rate of each parameter dependent on the magnitude of its gradient.
- RMSprop is a self-tuning algorithm that does not require manual adjustment of the learning rate.
Disadvantages:
- Choosing the learning rate: Although RMSprop is self-tuning, the learning rate must still be chosen, which can be a difficult operation.
- Hyperparameters must be carefully tuned: The algorithm's performance can be affected by the selection of hyperparameters, such as the decay rate, which must be carefully set.
- Sensitive to mini-batch size selection: The algorithm's performance can be sensitive to the size of the mini-batch, making it difficult to optimize in practice.
Implementation:
# Required libraries imported
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
# Define the function to be optimized (Rosenbrock function)
def rosenbrock(x, y):
return (1 - x) ** 2 + 100 * (y - x ** 2) ** 2
# Define the gradient of the function
def grad_rosenbrock(x, y):
grad_x = -2 * (1 - x) - 400 * x * (y - x ** 2)
grad_y = 200 * (y - x ** 2)
return np.array([grad_x, grad_y])
# Define the RMSprop algorithm
def rmsprop(x0, y0, lr, gamma, eps, n_iters):
trajectory = [(x0, y0)]
x, y = x0, y0
g_sq_avg = np.zeros(2)
for i in range(n_iters):
grad = grad_rosenbrock(x, y)
g_sq_avg = gamma * g_sq_avg + (1 - gamma) * grad ** 2
x -= lr * grad[0] / np.sqrt(g_sq_avg[0] + eps)
y -= lr * grad[1] / np.sqrt(g_sq_avg[1] + eps)
trajectory.append((x, y))
return np.array(trajectory)
# Set the learning rate, gamma, epsilon, and number of iterations
lr = 0.001
gamma = 0.9
eps = 1e-8
n_iters = 100
# Define the range of x and y values
x_vals = np.linspace(-2, 2, 100)
y_vals = np.linspace(-1, 3, 100)
# Create a meshgrid of x and y values
x_mesh, y_mesh = np.meshgrid(x_vals, y_vals)
# Compute the values of the function at each point in the meshgrid
f_vals = rosenbrock(x_mesh, y_mesh)
# Create a contour plot of the function
fig, ax = plt.subplots()
contour_plot = ax.contour(x_mesh, y_mesh, f_vals, levels=20, cmap='RdYlBu')
ax.clabel(contour_plot, inline=1, fontsize=10)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('RMSprop')
# Initialize the plot with the starting point
start_point = np.array([-1.5, 1.5])
trajectory, = ax.plot(start_point[0], start_point[1], '-o', color='black')
# Define the animation function
def update(i):
# Update the trajectory with the i-th iteration of the algorithm
traj = rmsprop(start_point[0], start_point[1], lr, gamma, eps, i)
trajectory.set_data(traj[:,0], traj[:,1])
return trajectory,
# Create the animation
anim = FuncAnimation(fig, update, frames=n_iters, interval=200, blit=True)
# Display the animation
plt.show()
Obtained Output:
7. Adam:
Adam (Adaptive Moment Estimation) is an optimization technique used to train deep learning models. It is an extension of the stochastic gradient descent (SGD) technique that integrates ideas from RMSprop and momentum.
Adam, like RMSprop, scales the learning rate by a moving average of the squared gradient. However, it also includes a moving average of the gradient itself, comparable to momentum. This allows it to perform effectively in settings with large variance gradients or noisy data.
The Adam method keeps an exponential moving average of the gradient (the first moment) and the squared gradient (the second moment). It then computes the adaptive learning rate for each parameter based on the ratio of the two moments.
Adam OptimizationAdvantages:
- Adam is computationally efficient and uses less memory than other optimization algorithms.
- Adam adjusts the learning rate for each parameter based on the estimation of the gradient's first and second moments, allowing it to converge rapidly and effectively.
- Adam performs well with sparse gradients, making it useful for deep learning models with a high number of parameters and sparse gradients.
- Robust to different hyperparameters: Adam is often robust to varied hyperparameter choices, such as learning rate, making it easier to employ.
Disadvantages:
- Adam is computationally efficient and uses less memory than other optimization algorithms.
- Adam adjusts the learning rate for each parameter based on the estimation of the gradient's first and second moments, allowing it to converge rapidly and effectively.
- Adam performs well with sparse gradients, making it useful for deep learning models with a high number of parameters and sparse gradients.
- Robust to different hyperparameters: Adam is often robust to varied hyperparameter choices, such as learning rate, making it easier to employ.
8. AdaGradDelta:
AdaGradDelta is an AdaGrad algorithm upgrade that was created to improve its convergence features. AdaGradDelta, like AdaGrad, adaptively modifies the learning rate for each individual parameter based on its gradient history. AdaGradDelta, on the other hand, scales the learning rates by an estimate of the curvature of the loss function, which can enhance the convergence rate.
h_t = rho * h_{t-1} + (1-rho) * grad_t^2
delta_t = (delta_{t-1} + eps) / (sqrt(h_t) + eps)
theta_{t+1} = theta_t - learning_rate * delta_t * grad_t
where h_t is the exponentially decaying sum of squared gradients, delta_t is the learning rate scaling factor, rho is a decay parameter, eps is a tiny constant to prevent division by zero, grad_t is the gradient at iteration t, and theta_t is the parameter vector at iteration t.
The running average of the inverse square root of the h_t values is used to compute the scaling factor delta_t. The rho parameter, which can be set between 0 and 1, determines the decay rate of the h_t values. The goal behind AdaGradDelta is to estimate the curvature of the loss function using the h_t values and then change the learning rates accordingly. This is especially beneficial in deep learning applications where the loss function may be very non-convex.
Advantages:
- In comparison to Adagrad, Adagrad Delta is more suited for non-stationary objectives.
- It can adapt to varied feature scales, resulting in faster learning and greater convergence.
- It is a basic Adagrad tweak that may be simply integrated into current optimization frameworks.
Disadvantages:
- The disadvantages of Adagrad Delta are the same as those of Adagrad, such as the requirement for careful calibration of the initial learning rate and the accumulation of past gradients.
- It might not be suitable for all types of optimization problems, especially those with a large number of parameters or non-convex objectives.
- Its performance can be affected by the selection of hyperparameters.
9. Nadam:
Nadam is a hybrid optimization technique that combines Adam and Nesterov accelerated gradient descent (NAG). "Nadam" is an abbreviation for Nesterov-accelerated Adaptive Moment Estimation.
Nadam, like Adam, updates the parameters during training using a combination of momentum and adaptive learning rates. In contrast to Adam, Nadam uses the Nesterov momentum word rather than the normal momentum term. This improves the optimization algorithm's convergence rate and stability.
The update rule for Nadam is as follows:
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2
m_t_hat = m_t / (1 - beta1^t)
v_t_hat = v_t / (1 - beta2^t)
theta_t = theta_{t-1} - eta * (beta1 * (1 - 0.5 * (0.96^(t*10/iterations))) * m_t_hat + (1 - beta1) * (1 - 0.5 * (0.96^(t*10/iterations))) * g_t / (1 - beta1^t)) / (sqrt(v_t_hat) + epsilon)
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t^2
m_t_hat = m_t / (1 - beta1^t)
v_t_hat = v_t / (1 - beta2^t)
theta_t = theta_{t-1} - eta * (beta1 * (1 - 0.5 * (0.96^(t*10/iterations))) * m_t_hat + (1 - beta1) * (1 - 0.5 * (0.96^(t*10/iterations))) * g_t / (1 - beta1^t)) / (sqrt(v_t_hat) + epsilon)
where m_t and v_t are the gradient's first and second moments, beta1 and beta2 are the moments' exponential decay rates, eta is the learning rate, t is the current iteration, iterations is the total number of iterations, theta_t is the current parameter value, and epsilon is a small value added for numerical stability.
The expression beta1 * (1 - 0.5 * (0.96(t*10/iterations)) * m_t_hat corresponds to the Nesterov momentum term, which aids in convergence rate improvement. The Adam updating rule is represented as (1 - beta1) * (1 - 0.5 * (0.96(t*10/iterations)) * g_t / (1 - beta1t).
Advantages:
- Nadam combines the benefits of Adam and Nesterov's momentum to provide faster convergence rates.
- Improved accuracy: In some circumstances, Nadam has been shown to outperform other optimization techniques, particularly for deep neural networks.
- Robustness to hyperparameters: When compared to other optimization methods such as Adam and RMSprop, Nadam has fewer hyperparameters to modify.
Disadvantages:
- Nadam is more computationally expensive than other optimization algorithms, which can be a disadvantage when working with huge datasets.
- Sensitive to learning rate: Nadam's performance can be affected by the learning rate chosen, and determining the best learning rate can be time-consuming.
10. AdamW:
AdamW (Adam with weight decay regularization) is a weight decay regularization term added to the weight update rule that extends the Adam method. This regularization term is weight-proportional, and it helps to prevent overfitting by pushing the model to use smaller weights.
The following weight decay regularization term is added to the update rule:
weight = weight - (learning_rate * (m_hat / (v_hat**(1/2) + epsilon) + weight_decay * weight))
where weight_decay determines the strength of the weight decay regularization and epsilon is a tiny constant used to avoid division by zero.
Advantages:
- In certain cases, AdamW is known to outperform Adam in terms of generalization performance, particularly when deep neural networks with several layers are used.
- AdamW's weight decay term prevents overfitting and improves the model's capacity to generalize to fresh data.
- AdamW converges quicker than other weight decay optimization techniques, such as SGD with weight decay.
Disadvantages:
- The fundamental disadvantage of AdamW is that it necessitates the tuning of extra hyperparameters, particularly the weight decay factor and learning rate.
- In rare circumstances, the weight decay term may have an adverse effect on the optimization process, resulting in slower convergence or worse generalization performance. This can occur if the weight decay factor is too great or if regularization is unnecessary for the particular problem at hand.
- AdamW, like other optimization methods, can become trapped in local minima or plateaus, especially when dealing with complex and high-dimensional optimization problems.
Conclusion
The optimization algorithm chosen is determined by a number of parameters, including the task at hand, the size of the dataset, the design of the neural network, and so on. Because of its effectiveness and efficiency, Adam and its variants are commonly utilized in deep learning. However, they may not be appropriate for all tasks, and simple optimization techniques such as Gradient Descent may be more effective in some cases. Furthermore, regularization techniques like weight decay can be combined with optimization algorithms like Adam to improve their performance.
References
[1] Wikipedia.com