Alexnet
By Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton in 2012, a deep neural network architecture called Alexnet for image categorization applications was unveiled. It was the winning submission for the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which marked a development in the deep learning discipline.
Rectified Linear Units (ReLU) were used as the activation function in the convolutional layers, which was one of AlexNet's significant contributions. This helped to solve the vanishing gradient issue and made it possible to train deeper networks more effectively. The introduction of data augmentation methods like random cropping and horizontal flipping, which helped to expand the training dataset and lessen overfitting, was another innovation.
Since its debut, Alexnet has influenced the creation of numerous other deep learning architectures, such as VGG, ResNet, and Inception, and has established itself as a standard for measuring how well new models perform on challenging picture categorization tasks.
History
Architecture
Eight layers make up the AlexNet architecture, comprising two fully connected layers, a softmax layer for output, and five convolutional layers.
Here is a quick summary of the AlexNet architecture's layers:
- Input Layers: The input layer accepts an image with a size of 227 by 227 by 3.
- Convolutional layer 1: There are 96 filters in the first convolutional layer, each measuring 11 by 11 by 3 pixels with a 4-pixel stride. ReLU is utilized as the activation function, which helps to solve the vanishing gradient issue.
- Max pooling Layer: A max pooling layer of size 3 x 3 with a stride of 2 is applied after the output of the first convolutional layer.
- Convolutional layer 2: The second convolutional layer has 256 filters with a stride of 1 and a size of 5 × 5 x 48 (the output of the previous layer). ReLU is the activation function employed.
- A max pooling layer of size 3 x 3 with a stride of 2 is applied after the output of the second convolutional layer.
- Convolutional layer 3: The third convolutional layer includes 384 filters with a stride of 1 and a size of 3 × 3 x 256 (the output of the previous layer). ReLU is the activation function employed.
- Convolutional layer 4: The fourth convolutional layer includes 384 filters with a stride of 384 and a size of 3x3x192 (the output of the layer before this one). 'relu' activation function is used.
- Convolutional layer 5: The fifth convolutional layer has 256 filters with a stride of 1 and a size of 3 × 3 x 192 (the output of the preceding layer). ReLU is the activation function employed.
- A max pooling layer of size 3 x 3 with a stride of 2 is applied after the output of the fifth convolutional layer.
- Fully connected layer 1: After being flattened, the output of the third maximum pooling layer is run through a 4096-unit fully connected layer. ReLU is the activation function employed.
- Fully connected layer 2: A second, 4096-unit fully connected layer is used to process the output of the first. ReLU is the activation function employed.
- Output layer: To construct a probability distribution over the 1000 categories in the ImageNet dataset, the output of the second fully connected layer is sent through a softmax layer.
Working
- A 227 × 227 x 3 image with three channels representing the red, green, and blue portions of the image serves as the input to the AlexNet model.
- Convolutional layers: After passing the input image through several convolutional layers, which can learn to extract features from the image, the result is the output image. The input image is subjected to a set of filters applied by each convolutional layer to create a set of feature maps that represent various facets of the image.
- Pooling layers: After each convolutional layer, the output is transmitted via a pooling layer, which downsamples the feature maps by taking the maximum or average value across a condensed area of the map. As a result, the feature maps are smaller and the model is more effective.
- Fully connected layers: The final pooling layer's output is flattened and fed through a number of fully connected layers that train to categorize the image into one of several groups. Each fully connected layer uses an activation function to create an output after applying a set of weights to the input characteristics.
- Output: The model's ultimate output is a probability distribution over the various dataset categories. The projected class for the input image is chosen to be the one with the highest probability.
- Backpropagation is used to update the weights of the AlexNet model as it is being trained in order to reduce the loss function, which calculates the discrepancy between the projected output and the true labels. The model is trained repeatedly until the weights converge to a set of values that consistently yield reliable predictions on validation datasets.
Applications
- AlexNet has been used for computer vision tasks involving object detection, such as locating and classifying several kinds of items in a picture. This has uses in robotics, surveillance, and self-driving cars, among other fields.
- Medical image analysis: AlexNet has been utilized for tasks involving medical image analysis, including the diagnosis of diseases from pictures obtained from X-rays, CT scans, and MRI scans.
- Natural language processing: AlexNet has been used to extract features from text input for natural language processing applications including sentiment analysis and text categorization.
- In order to analyse video frames and extract pertinent characteristics, AlexNet has been utilized for video analysis tasks like action recognition and scene segmentation.
- Transfer learning: In transfer learning, a pre-trained model is adjusted on a smaller dataset to carry out a particular task. AlexNet has also been utilized as such a model. Numerous applications, such as facial recognition, emotion recognition, and speech recognition, have employed this methodology.
Implementation
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.datasets import cifar10
import numpy as np
import matplotlib.pyplot as plt
# Load the CIFAR-10 dataset
(train_data, train_labels), (test_data, test_labels) = cifar10.load_data()
# Convert labels to one-hot encoded vectors
train_labels = tf.keras.utils.to_categorical(train_labels, num_classes=10)
test_labels = tf.keras.utils.to_categorical(test_labels, num_classes=10)
# Scale the pixel values
train_data = train_data / 255.0
test_data = test_data / 255.0
# Define the model
model = Sequential([
Conv2D(32, (3,3), activation='relu', padding='same', input_shape=(32,32,3)),
Conv2D(32, (3,3), activation='relu', padding='same'),
MaxPooling2D(pool_size=(2,2)),
Dropout(0.25),
Conv2D(64, (3,3), activation='relu', padding='same'),
Conv2D(64, (3,3), activation='relu', padding='same'),
MaxPooling2D(pool_size=(2,2)),
Dropout(0.25),
Conv2D(128, (3,3), activation='relu', padding='same'),
Conv2D(128, (3,3), activation='relu', padding='same'),
MaxPooling2D(pool_size=(2,2)),
Dropout(0.25),
Flatten(),
Dense(512, activation='relu'),
Dropout(0.5),
Dense(10, activation='softmax')
])
# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
# Train the model and save the history object
history = model.fit(train_data, train_labels, epochs=5, batch_size=32, validation_data=(test_data, test_labels))
# Plot the training and validation loss curves
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
# Plot the training and validation accuracy curves
plt.plot(history.history['accuracy'], label='train_acc')
plt.plot(history.history['val_accuracy'], label='val_acc')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
Obtained output:
Description
- This Python method uses the CIFAR-10 dataset, which consists of 50,000 32x32 color training images and 10,000 test images labeled with more than 10 categories, to train a Convolutional Neural Network (CNN) model.
- The CIFAR-10 dataset is first loaded into the program, after which the labels are one-hot encoded vectors and the pixel values are scaled to have a range of 0 to 1 values.
- Next, the Keras Sequential API is used to define the model architecture. The model is composed of several layers, starting with two Conv2D levels that each have 32 filters, then a MaxPooling2D layer with a pool size of 2x2, and finally a Dropout layer with a rate of 0.25.
- After that, the block is repeated twice, with the Conv2D layers' filter count increasing each time. The model is completed by a Flatten layer, two Dense layers with 512 and 10 units, and a softmax activation function.
- Accuracy is the metric, categorical cross-entropy is the loss function, and the model is then built with an Adam optimizer and a learning rate of 0.001.
- With a batch size of 32, the model is trained for 5 iterations on the training dataset, and the training and validation accuracy and loss curves are shown using matplotlib.
- This code's goal is to show how to create and train a straightforward CNN model using the Keras Sequential API, as well as how to plot the training and validation learning curves to track the model's progress.