Alexnet

By Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton in 2012, a deep neural network architecture called Alexnet for image categorization applications was unveiled. It was the winning submission for the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which marked a development in the deep learning discipline.

Rectified Linear Units (ReLU) were used as the activation function in the convolutional layers, which was one of AlexNet's significant contributions. This helped to solve the vanishing gradient issue and made it possible to train deeper networks more effectively. The introduction of data augmentation methods like random cropping and horizontal flipping, which helped to expand the training dataset and lessen overfitting, was another innovation.

Since its debut, Alexnet has influenced the creation of numerous other deep learning architectures, such as VGG, ResNet, and Inception, and has established itself as a standard for measuring how well new models perform on challenging picture categorization tasks.

History

Geoffrey Hinton, Ilya Sutskever, and Alex Krizhevsky presented AlexNet in 2012. Support vector machines (SVMs), a more conventional machine learning technique, remained the state-of-the-art at the time, and deep neural networks were not yet commonly used for image classification tasks.

The AlexNet architecture was developed in order to address some of the drawbacks of earlier neural network models, including the challenge of deep network training and the lack of strong hardware. AlexNet used two significant improvements to do this:

Rectified Linear Units (ReLU) are used as the convolutional layer's activation function. This made it possible to train deeper networks more effectively and helped to solve the vanishing gradient issue.

Utilizing strategies for enhancing data, such as random cropping and horizontal reversal, helped to expand the training dataset and lessen overfitting.

In 2012, AlexNet participated in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which required categorization of pictures from a dataset of more than 1 million images into 1000 different categories. A top-5 error rate of 15.3% was attained by AlexNet. In 2012, AlexNet participated in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which required categorization of pictures from a dataset of more than 1 million images into 1000 different categories.

The second-place model, which had an error rate of 26.2%, was dramatically underperformed by AlexNet, which earned a top-5 error rate of 15.3%. This represented a huge advance in deep learning and showed what neural networks are capable of when used for extensive image identification tasks.

Since its debut, AlexNet has influenced the creation of numerous other deep learning architectures, such as VGG, ResNet, and Inception, and has established itself as a standard for measuring how well new models perform on challenging picture categorization tasks.

Architecture

Eight layers make up the AlexNet architecture, comprising two fully connected layers, a softmax layer for output, and five convolutional layers.

AlexNet Architecture(Source: Wikipedia)

Here is a quick summary of the AlexNet architecture's layers:

Input Layers: The input layer accepts an image with a size of 227 by 227 by 3.

Convolutional layer 1: There are 96 filters in the first convolutional layer, each measuring 11 by 11 by 3 pixels with a 4-pixel stride. ReLU is utilized as the activation function, which helps to solve the vanishing gradient issue.

Max pooling Layer: A max pooling layer of size 3 x 3 with a stride of 2 is applied after the output of the first convolutional layer.

Convolutional layer 2: The second convolutional layer has 256 filters with a stride of 1 and a size of 5 × 5 x 48 (the output of the previous layer). ReLU is the activation function employed.

A max pooling layer of size 3 x 3 with a stride of 2 is applied after the output of the second convolutional layer.

Convolutional layer 3: The third convolutional layer includes 384 filters with a stride of 1 and a size of 3 × 3 x 256 (the output of the previous layer). ReLU is the activation function employed.

Convolutional layer 4: The fourth convolutional layer includes 384 filters with a stride of 384 and a size of 3x3x192 (the output of the layer before this one). 'relu' activation function is used.

Convolutional layer 5: The fifth convolutional layer has 256 filters with a stride of 1 and a size of 3 × 3 x 192 (the output of the preceding layer). ReLU is the activation function employed.

A max pooling layer of size 3 x 3 with a stride of 2 is applied after the output of the fifth convolutional layer.

Fully connected layer 1: After being flattened, the output of the third maximum pooling layer is run through a 4096-unit fully connected layer. ReLU is the activation function employed.

Fully connected layer 2: A second, 4096-unit fully connected layer is used to process the output of the first. ReLU is the activation function employed.

Output layer: To construct a probability distribution over the 1000 categories in the ImageNet dataset, the output of the second fully connected layer is sent through a softmax layer.

The architecture of AlexNet served as a major turning point in the development of deep neural networks for image classification problems, and many other deep learning models have been influenced by it.

Working

AlexNet utilizes a deep neural network design that learns to identify patterns in incoming photos and categorize them into several groups. AlexNet's fundamental operating principle can be summed up as follows:

A 227 × 227 x 3 image with three channels representing the red, green, and blue portions of the image serves as the input to the AlexNet model.

Convolutional layers: After passing the input image through several convolutional layers, which can learn to extract features from the image, the result is the output image. The input image is subjected to a set of filters applied by each convolutional layer to create a set of feature maps that represent various facets of the image.

Pooling layers: After each convolutional layer, the output is transmitted via a pooling layer, which downsamples the feature maps by taking the maximum or average value across a condensed area of the map. As a result, the feature maps are smaller and the model is more effective.

Fully connected layers: The final pooling layer's output is flattened and fed through a number of fully connected layers that train to categorize the image into one of several groups. Each fully connected layer uses an activation function to create an output after applying a set of weights to the input characteristics.

Output: The model's ultimate output is a probability distribution over the various dataset categories. The projected class for the input image is chosen to be the one with the highest probability.

Backpropagation is used to update the weights of the AlexNet model as it is being trained in order to reduce the loss function, which calculates the discrepancy between the projected output and the true labels. The model is trained repeatedly until the weights converge to a set of values that consistently yield reliable predictions on validation datasets.

Overall, AlexNet's deep neural network architecture enables it to accurately categorize incoming photos into a variety of categories and learn to recognize complicated patterns in the images.

Applications

For assessing the performance of new deep learning architectures on challenging image classification tasks, AlexNet has been widely utilized as a benchmark model. Among the uses for AlexNet and its variations are:

AlexNet has been used for computer vision tasks involving object detection, such as locating and classifying several kinds of items in a picture. This has uses in robotics, surveillance, and self-driving cars, among other fields.

Medical image analysis: AlexNet has been utilized for tasks involving medical image analysis, including the diagnosis of diseases from pictures obtained from X-rays, CT scans, and MRI scans.

Natural language processing: AlexNet has been used to extract features from text input for natural language processing applications including sentiment analysis and text categorization.

In order to analyse video frames and extract pertinent characteristics, AlexNet has been utilized for video analysis tasks like action recognition and scene segmentation.

Transfer learning: In transfer learning, a pre-trained model is adjusted on a smaller dataset to carry out a particular task. AlexNet has also been utilized as such a model. Numerous applications, such as facial recognition, emotion recognition, and speech recognition, have employed this methodology.

The subject of deep learning has been significantly impacted by AlexNet and its variations, which have been employed in a variety of applications across several domains.

Implementation

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.datasets import cifar10
import numpy as np
import matplotlib.pyplot as plt

# Load the CIFAR-10 dataset
(train_data, train_labels), (test_data, test_labels) = cifar10.load_data()

# Convert labels to one-hot encoded vectors
train_labels = tf.keras.utils.to_categorical(train_labels, num_classes=10)
test_labels = tf.keras.utils.to_categorical(test_labels, num_classes=10)

# Scale the pixel values
train_data = train_data / 255.0
test_data = test_data / 255.0

# Define the model
model = Sequential([
    Conv2D(32, (3,3), activation='relu', padding='same', input_shape=(32,32,3)),
    Conv2D(32, (3,3), activation='relu', padding='same'),
    MaxPooling2D(pool_size=(2,2)),
    Dropout(0.25),
    Conv2D(64, (3,3), activation='relu', padding='same'),
    Conv2D(64, (3,3), activation='relu', padding='same'),
    MaxPooling2D(pool_size=(2,2)),
    Dropout(0.25),
    Conv2D(128, (3,3), activation='relu', padding='same'),
    Conv2D(128, (3,3), activation='relu', padding='same'),
    MaxPooling2D(pool_size=(2,2)),
    Dropout(0.25),
    Flatten(),
    Dense(512, activation='relu'),
    Dropout(0.5),
    Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model and save the history object
history = model.fit(train_data, train_labels, epochs=5, batch_size=32, validation_data=(test_data, test_labels))

# Plot the training and validation loss curves
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Plot the training and validation accuracy curves
plt.plot(history.history['accuracy'], label='train_acc')
plt.plot(history.history['val_accuracy'], label='val_acc')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Obtained output:

Epoch 1/5 1563/1563 [==============================] - 290s 183ms/step - loss: 1.6885 -

accuracy: 0.3660 - val_loss: 1.2857 - val_accuracy: 0.5313 Epoch 2/5 1563/1563 [==============================] - 282s 180ms/step - loss: 1.2307 -

accuracy: 0.5602 - val_loss: 1.1171 - val_accuracy: 0.6075 Epoch 3/5 1563/1563 [==============================] - 285s 182ms/step - loss: 1.0683 -

accuracy: 0.6212 - val_loss: 0.9330 - val_accuracy: 0.6722 Epoch 4/5 1563/1563 [==============================] - 285s 182ms/step - loss: 0.9672 -

accuracy: 0.6587 - val_loss: 0.8496 - val_accuracy: 0.7060 Epoch 5/5 1563/1563 [==============================] - 284s 182ms/step - loss: 0.9032 -

accuracy: 0.6828 - val_loss: 0.7912 - val_accuracy: 0.7246

from keras.utils.vis_utils import plot_model

plot_model(model, to_file='alexnet.png', show_shapes=True, show_layer_names=True)

Obtained Output:

Description

This Python method uses the CIFAR-10 dataset, which consists of 50,000 32x32 color training images and 10,000 test images labeled with more than 10 categories, to train a Convolutional Neural Network (CNN) model.

The CIFAR-10 dataset is first loaded into the program, after which the labels are one-hot encoded vectors and the pixel values are scaled to have a range of 0 to 1 values.

Next, the Keras Sequential API is used to define the model architecture. The model is composed of several layers, starting with two Conv2D levels that each have 32 filters, then a MaxPooling2D layer with a pool size of 2x2, and finally a Dropout layer with a rate of 0.25.

After that, the block is repeated twice, with the Conv2D layers' filter count increasing each time. The model is completed by a Flatten layer, two Dense layers with 512 and 10 units, and a softmax activation function.

Accuracy is the metric, categorical cross-entropy is the loss function, and the model is then built with an Adam optimizer and a learning rate of 0.001.

With a batch size of 32, the model is trained for 5 iterations on the training dataset, and the training and validation accuracy and loss curves are shown using matplotlib.

This code's goal is to show how to create and train a straightforward CNN model using the Keras Sequential API, as well as how to plot the training and validation learning curves to track the model's progress.

Conclusion

Deep learning gained popularity because to AlexNet, a ground-breaking neural network architecture that greatly enhanced picture classification performance. Multiple convolutional and pooling layers, non-linear activation functions, and dropout regularization served as the architecture's overfitting prevention measures. AlexNet was also the first deep learning model to incorporate GPU acceleration, which greatly shortened training time.

With a top-5 error rate of 15.3% after being trained on the ImageNet dataset, AlexNet significantly outperformed the prior state-of-the-art performance. Since its release, AlexNet has served as an inspiration for other later deep learning models, such as GoogLeNet, ResNet, and VGGNet, all of which have improved performance on a variety of computer vision applications. Overall, AlexNet was an important deep learning innovation that transformed computer vision and continues to have an impact on the growth of new neural networks.

References

[1] From Goodfellow et al. Deep Learning book

[2] From Zhang et al. Dive into Deep Learning book

Alexnet Architecture in CNN and its implementation in Python

Alexnet

History

Architecture

Working

Applications

Implementation

Description

Conclusion

Swapna

You may like these posts

Post a Comment

Get new posts by email:

Difference Between PCA and Autoencoders with an example

Software Components in Deep Learning

Difference Between PCA and Autoencoders with an example

Difference Between PCA and Autoencoders with an example

Hot Posts

Search This Blog

Most Recent

Difference Between PCA and Autoencoders with an example

Types of Autoencoders in Deep Learning

Clustering with Deep Learning Models and its implementation in python

Autoencoder Architecture with Keras in Deep Learning

Transfer Learning in Deep Learning with Keras

Yagna Dakshina

Contact form