Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is a popular machine learning optimization approach for determining the ideal parameters of a model by minimizing its loss function. It is a Gradient Descent variation that updates the model parameters by taking the average gradient across the entire dataset.
Implementation
Dataset: Iris
Source Code:
# Requires Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = load_iris()
# Scale the input features to have zero mean and unit variance
scaler = StandardScaler()
X = scaler.fit_transform(iris.data)
y = iris.target
# Set the hyperparameters for the model
alpha = 0.01
epochs = 100
n_features = X.shape[1]
n_classes = len(np.unique(y))
# Initialize the weights
w = np.zeros((n_features, n_classes))
# Define the softmax function
def softmax(z):
return np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True)
# Define the cost function
def cross_entropy_loss(y_true, y_pred):
loss = -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
return loss
# Define the gradient of the cost function
def compute_gradient(X, y, w):
m = len(y)
z = X.dot(w)
y_pred = softmax(z)
grad = 1/m * X.T.dot(y_pred - y)
return grad
# Define the training loop
cost = []
for i in range(epochs):
# Shuffle the data for each epoch
idx = np.random.permutation(len(y))
X = X[idx]
y = y[idx]
# Update the weights using SGD
for j in range(len(y)):
x_j = X[j].reshape(1, -1)
y_j = np.eye(n_classes)[y[j]].reshape(1, -1)
w -= alpha * compute_gradient(x_j, y_j, w)
# Calculate and store the cost for this epoch
z = X.dot(w)
y_pred = softmax(z)
epoch_cost = cross_entropy_loss(np.eye(n_classes)[y], y_pred)
cost.append(epoch_cost)
# Plot the cost function over the number of iterations
plt.figure(figsize=(8, 6))
plt.plot(range(len(cost)), cost)
plt.title('SGD Cost over Iterations')
plt.xlabel('Number of Iterations')
plt.ylabel('Cost')
plt.show()
Obtained Output:Description
- Imports the required libraries: NumPy for numerical computations, Pandas for data manipulation, Matplotlib for data visualization, and Scikit-learn for Iris dataset loading.
- Loads the Iris dataset from an ARFF file and converts it to a Pandas Data Frame using Scikit-loader () learns function.
- The goal variable (class) and input features (sepal length, sepal width, petal length, and petal width) are extracted from the Data Frame.
- Using Scikit-StandardScaler() learns function, and scales the input features to have a zero mean and unit variance.
- Specifies the SGD algorithm's hyperparameters: the learning rate (eta), the number of iterations (n iter), and the size of the mini-batch (batch size).
- Using NumPy's random. rand () function, the weights of the linear regression model are set to small random numbers.
- The SGD algorithm is run for the number of iterations specified, randomly sampling a mini-batch of the training data at each iteration, computing the gradient of the loss function with respect to the weights, and updating the weights in the direction of the negative gradient scaled by the learning rate.
- The trained model's mean squared error (MSE) on the complete training set is computed and printed to the console.
- Using Matplotlib, plot the cost function (the MSE on the mini-batch at each iteration) over the number of iterations.
To estimate the gradient, Stochastic Gradient Descent(SGD) updates the parameters using a randomly selected portion of the data, called a batch or mini-batch. This stochastic strategy reduces the optimization algorithm's computational cost and allows it to handle huge datasets more efficiently.
The algorithm begins with a guess for the model parameters and iteratively changes them as follows:
- Choose a mini-batch of data at random from the training set.
- Using the selected mini-batch, compute the gradient of the loss function about the parameters.
- Subtract a percentage of the gradient from the existing parameter values to update the model parameters.
- Steps 1-3 must be repeated until convergence is reached or for a set number of iterations.
The learning rate, which governs the step size of parameter updates, is a hyperparameter that must be optimized for optimal performance. In fact, various versions of SGD have been created to increase its convergence speed and stability, including momentum, AdaGrad, and Adam.
From the above information, we see the other important terms which have certain operations in machine learning. Let's talk about them.
What is Gradient Descent?
As a consequence of a neural network's nonlinearity, the most intriguing loss functions become nonconvex. The GS technique alters the weights in small increments by calculating the gradient (derivative) of the loss function, which allows us to see the path we take toward the global minimum.
This is done in batches of data in iterations known as epochs. Because the gradient always points in the direction of increasing loss function value, the values of the parameters are updated in the opposite direction in each iteration. The gradient value and a learning rate hyperparameter dictate the magnitude of the change in each iteration.
Are SGD and GS is same?
Stochastic Gradient Descent (SGD) and Gradient Descent (GD) are related but not the same. Both techniques are used for optimization in machine learning and deep learning models, but they differ in how the model parameters are updated during training.
Gradient Descent changes the model parameters by utilizing the gradient of the loss function concerning all of the training data, i.e. the entire batch. SGD, on the other hand, updates the model parameters using the gradient of the loss function concerning a randomly selected subset of the training data, i.e., a mini-batch.
Nevertheless, because SGD employs only a part of the training data, the gradient estimate might be noisy, resulting in slower convergence and a less accurate solution than GD. To address this, SGD variations such as mini-batch GD, momentum, AdaGrad, and Adam have been developed to improve the speed and stability of convergence.
In summary, SGD is a GD version that changes model parameters using a mini-batch of training data chosen at random. While both techniques have benefits and drawbacks, SGD is commonly used in deep learning due to its computational efficiency and capacity to handle big datasets.
Learning Algorithm in Deep Learning Summary
1. Begin with initial weights and biases values.
2. Pick a portion of the input data and run it through the network to get a prediction.
3. Compare the prediction to the true labels and compute the loss function value.
4. Carry out lost backpropagation.
5. Apply gradient descent to network parameters.
6. Iterate till a suitable result is attained.
Vanishing Gradient
The vanishing gradient problem happens in deep neural networks when the gradients of the loss function concerning the model parameters become exceedingly small during backpropagation. This problem can lead the parameters to converge slowly or not at all, preventing the model from learning properly.
The issue arises from the fact that gradients are calculated by multiplying the gradient of each layer concerning its input by the gradient of the layer above it in the network. Gradient values can become exponentially lower as the number of layers increases, especially for activation functions that saturate toward the edges of their input range, such as the sigmoid or hyperbolic tangent function.
When the gradients get extremely small, the network's weights are changed very slowly, if at all, leading to a model that cannot train. The vanishing gradient problem is especially difficult for deep neural networks with many layers, such as those used in image and speech recognition.
To address the vanishing gradient problem, several techniques have been developed, including:
- Initialization techniques: Choosing the initial weights of the network in such a way that the activation functions are not saturated.
- ReLU, LeakyReLU, ELU, and other activation functions that do not saturate at the boundaries of their input range are examples of non-saturating activation functions.
- Batch normalization: Normalizing the activations of each layer during training to reduce gradient magnitude and stabilize the learning process.
- Residual connections: Including skip connections that allow gradients to flow directly from one layer to the next, avoiding dangerous saturation points.
Deep neural networks can successfully reduce the vanishing gradient problem and obtain increased performance by employing these strategies.
Exploding Gradient
The exploding gradient problem is the inverse of the vanishing gradient problem. It happens when the gradients of the loss function concerning the model parameters become extremely large during backpropagation, resulting in unstable and unreliable model parameter updates. This problem can cause the parameters to diverge, making the model incapable of learning properly.
In deep neural networks, the exploding gradient problem can occur when gradients are multiplied by each other repeatedly during backpropagation, resulting in exponential growth. This problem is especially difficult for deep networks with numerous layers and sophisticated designs.
To address the exploding gradient problem, several techniques have been proposed, including:
Gradient clipping is the process of setting a threshold on the gradient's norm to prevent it from expanding too large.
- Initialization techniques: Establishing the network's starting weights in a way that avoids ballooning gradients.
- ReLU, LeakyReLU, ELU, and other activation functions that do not saturate at the boundaries of their input range are examples of non-saturating activation functions.
- Batch normalization: Normalizing the activation of each layer during training to reduce gradient magnitude and stabilize the learning process.
Learning Rate
The learning rate is a hyperparameter in machine learning that governs how often the model parameters are updated throughout the training process. It determines the gradient descent algorithm's step size, which is used to minimize the model's loss function.
The model generates the gradients of the loss function concerning each parameter during training, and the learning rate defines how much these gradients impact the update of the model parameters. If the learning rate is too low, the model will converge slowly and may become trapped in a local minimum. If the learning rate is excessively fast, the model may fail to converge or may overshoot the ideal solution and diverge.
Grid search
- Trying out a variety of learning rates and selecting the one that performs best on a validation set. Schedules for decreasing the learning rate over time, frequently exponentially or stepwise, allow the model to converge more slowly and potentially discover a better minimum.
- Adaptive learning rate approaches, such as AdaGrad, RMSProp, and Adam, dynamically alter the learning rate during training based on the size and direction of the gradients.
- Machine learning and Deep Learning models can be trained more successfully and perform better on several tasks by selecting an appropriate learning rate.
Key Points to Remember
- Stochastic Gradient Descent (SGD) is a well-known optimization procedure that is employed in the training of machine learning models. The gradients of a portion of the training samples are used to iteratively update the model's parameters.
- Stochastic Updates: SGD updates are carried out using a random selection of data points, commonly referred to as a mini-batch, as opposed to conventional gradient descent, which calculates gradients over the whole training dataset. The model may be able to avoid local minima and find better solutions as a result of the noise introduced by this unpredictability.
- SGD uses a learning rate parameter to control the magnitude of the steps made in the gradient's direction. The stability and speed of convergence of the optimization process depend on the choice of the suitable learning rate.
- The batch size in SGD describes how many samples are used for each update. Larger batch sizes can offer more consistent updates but necessitate more memory whereas smaller batch sizes are computationally economical but introduce more noise.
- Convergence: Compared to conventional gradient descent, SGD often requires more iterations to converge. SGD can, however, identify effective solutions more quickly because of its stochastic character, particularly when dealing with huge datasets or non-convex optimization issues.
- The trade-off between Speed and precision: SGD prioritizes speed above precision. Although it converges more quickly, the updates could be noisy, which makes the loss function vary while training. This trade-off can be addressed using strategies such as learning rate schedules, momentum, or adaptive learning rates (such as AdaGrad, RMSprop, or Adam).
- Regularization: To avoid overfitting, SGD can be paired with regularization strategies such as L1 or L2 regularization. The penalty term that regularization introduces into the loss function drives the model to learn representations that are more straightforward and generic.
- SGD variations: To enhance its performance, SGD has undergone a number of changes. These have their own advantages and disadvantages, and include Momentum, Nesterov Accelerated Gradient (NAG), AdaGrad, RMSprop, and Adam. It's crucial to take into account these alternatives depending on the particular problem.
- Data Preprocessing: When employing SGD, proper data preprocessing methods, such as normalization, scaling, or shuffling the training data, are crucial. Preprocessing speeds up convergence, prevents numerical instability, and helps to make sure the model actually learns from the data.
- Monitoring and Evaluation: When employing SGD, it is essential to monitor the training process. It's crucial to monitor the training loss and validation metrics to make sure the model is actually learning and isn't overfitting. Early termination can be utilized to avoid overtraining and over-fitness.
Conclusion
In general, due to its effectiveness and efficiency, SGD is a potent optimization method that is frequently employed in deep learning and other machine learning applications. However, GD or other optimization methods can be better suitable depending on the particular situation and data. To get the greatest results on the given assignment, it's crucial to properly select the optimization algorithm and set the hyperparameters.
Reference:
[1] https://www.ibm.com/topics/gradient-descent
[2] https://en.wikipedia.org/wiki/Stochastic_gradient_descent