VGG Net
The Visual Geometry Group at the University of Oxford proposed the VGG (Visual Geometry Group) deep convolutional neural network architecture for image categorization in 2014.
Convolutional layers, max-pooling layers, and fully connected layers make up the VGG network in succession. The network uses tiny (3x3) convolutional filters all throughout its deep, up to 19 layer architecture.
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014 saw the VGG network perform exceptionally well, taking first place in the localization test and second place in the classification task. The VGG network has been extensively utilized for a variety of image classification tasks and has grown in popularity as a benchmark in the fields of computer vision and deep learning.
History
The Visual Geometry Group at the University of Oxford in the United Kingdom came up with the idea for the VGG network in 2014. On the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset, the team wanted to create a deep convolutional neural network architecture that might produce cutting-edge results.
Every year, academics from all over the world compete in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) to create the most precise image categorization method. Over 1.2 million photos total, organized into 1000 object types, make up the ILSVRC collection.
Deep learning models for image classification traditionally included a few convolutional layers followed by a fully connected layer until the VGG network was developed. While using only 3x3 convolutional filters and max-pooling layers, the VGG network presented a more complex design with up to 19 layers.
In the 2014 ILSVRC competition, the VGG network performed admirably, taking first place in the localization task and second in the classification challenge. Since that time, the VGG network has grown to be a well-known benchmark in the fields of deep learning and computer vision, and it is frequently employed for a variety of image classification tasks.
Architecture
A sequence of convolutional layers, max-pooling layers, and fully linked layers make up the VGG network. An overview of the VGG architecture is provided below:
A layer of Input: A 224x224-pixel color picture serves as the network's input.
13 convolutional layers make up the network, each of which has a 3x3 filter, a stride of 1, and padding to preserve the input's spatial resolution. The feature maps become more detailed as we delve further into the network.
Max Pooling Layers: A max pooling layer with a 2x2 filter and stride of 2 is applied after every two convolutional layers, reducing the spatial resolution by 50% and enabling the network to learn more abstract information.
Fully Connected Layers: The network's final three layers, with 4096, 4096, and 1000 nodes each, are fully connected layers. For each of the 1000 item categories in the ImageNet dataset, class probabilities are obtained by passing the output of the final fully connected layer through a softmax function.
After each convolutional layer and the fully connected layers, the ReLU (Rectified Linear Unit) activation function is utilized.
The VGG network uses tiny (3x3) convolutional filters all throughout the network and has an extremely deep architecture, with up to 19 layers. This architecture has been demonstrated to perform well for image classification tasks and has grown in popularity as a benchmark in the deep learning and computer vision communities.
Working
In order to create a classification output, the VGG (Visual Geometry Group) network processes an input image through a number of convolutional layers, max-pooling layers, and fully connected layers. The VGG network's operation is described in detail below:
Input: A color image with 224x224 pixels serves as the input to the VGG network.
Convolutional Layers: A total of 13 convolutional layers, each with a 3x3 filter and a stride of 1, are applied first to the input image. A series of feature maps that represent various facets of the image are created by each convolutional layer after applying a set of filters to the input image.
Rectified Linear Unit (ReLU) activation functions are used to inject non-linearity into the network and aid in learning after each convolutional layer.
Max Pooling Layers: A max pooling layer with a 2x2 filter and stride of 2 is applied after every two convolutional layers. The max pooling layer reduces the number of parameters while halving the spatial resolution of the feature maps, allowing the network to learn more abstract features.
Convolutional and Max Pooling Layers Followed by Three Fully Connected Layers: The output is flattened and passed through three fully connected layers, each having 4096, 4096, and 1000 nodes. The fully linked layers acquire the ability to incorporate the features that were learned by the preceding layers to determine a classification. A probability distribution over the 1000 object categories in the ImageNet dataset is generated by the last layer using a softmax activation function.
Training: The VGG network adjusts the weights of the filters in each layer based on the difference between the expected and actual class labels during training. The network keeps training until it achieves an acceptable level of accuracy, at which point the weights are adjusted via gradient descent optimization.
The VGG network, on the whole, is a highly deep architecture that develops the ability to extract hierarchical representations of the input image, starting with low-level features like edges and forms and moving up to higher-level characteristics like textures and object pieces. These characteristics are used by the fully linked layers at the network's edge to determine a categorization.
Applications
Beyond picture classification, the VGG (Visual Geometry Group) network has been effectively used for a variety of computer vision applications. The following are some of the uses for the VGG network:
1. Object Identification: The popular Faster R-CNN and YOLO (You Only Look Once) object identification algorithms have both employed the VGG network as their backbone architecture.
2. Semantic image segmentation, which aims to assign each pixel in an image to a semantic category, has also been carried out using the VGG network. This is accomplished by connecting a decoder network to the VGG network's output, which creates a classification output that is pixel-by-pixel.
3. Image Style Transfer: The VGG network has been utilized for artistic style transfer, which involves transferring the look of one image to another while keeping the original image's content intact. By instructing the network to reduce the disparity between the feature maps of the content and style images, this is accomplished.
4. Medical Image Analysis: The VGG network has been utilized for tasks like disease identification and tissue segmentation in medical image analysis. For the automatic categorization of breast cancer histology images, a modified version of the VGG network has been employed.
5. Action identification and human pose estimate are two examples of video analysis tasks for which the VGG network has been used.
6. For instance, the well-liked Two-Stream Convolutional Networks for Action Recognition have utilized the VGG network as the backbone design.
In terms of computer vision and deep learning, the VGG network has generally been a very influential design, and its uses go beyond picture classification to include a variety of computer vision problems.
Implementation
Dataset: CIFAR-10
Here we are using the CIFAR-10 dataset with keras and will see the implementation by using the VGG net.
#import the necessary Libraries
import matplotlib.pyplot as plt
from keras.datasets import cifar10
from keras.utils import to_categorical
from keras.applications import VGG16
from keras.layers import Dense, Flatten
from keras.models import Model
from keras.optimizers import Adam
# Load the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
# Preprocessing the data
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# Load the pre-trained VGG16 model
vgg_model = VGG16(weights='imagenet', include_top=False, input_shape=(32, 32, 3))
# Freezing the layers of the pre-trained model
for layer in vgg_model.layers:
layer.trainable = False
# Add a new classifier layer for CIFAR-10 classification
flatten_layer = Flatten()(vgg_model.output)
classifier_layer = Dense(10, activation='softmax')(flatten_layer)
# Create a new model by combining the pre-trained model and the new classifier layer
model = Model(inputs=vgg_model.input, outputs=classifier_layer)
# Compile the model
model.compile(optimizer=Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
# Summary of the model
model.summary()
# training the model
history = model.fit(x_train, y_train, batch_size=64, epochs=5,
validation_data=(x_test, y_test))
# Plot the training and validation accuracy and loss
fig, axs = plt.subplots(2, 1, figsize=(8, 8))
axs[0].plot(history.history['accuracy'], label='Training Accuracy')
axs[0].plot(history.history['val_accuracy'], label='Validation Accuracy')
axs[0].set_xlabel('Epoch')
axs[0].set_ylabel('Accuracy')
axs[0].legend(loc='lower right')
axs[1].plot(history.history['loss'], label='Training Loss')
axs[1].plot(history.history['val_loss'], label='Validation Loss')
axs[1].set_xlabel('Epoch')
axs[1].set_ylabel('Loss')
axs[1].legend(loc='upper right')
plt.show()
# Evaluate the model on the test set
test_loss, test_acc = model.evaluate(x_test, y_test)
print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)
Obtained Output:
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 32, 32, 3)] 0
block1_conv1 (Conv2D) (None, 32, 32, 64) 1792
block1_conv2 (Conv2D) (None, 32, 32, 64) 36928
block1_pool (MaxPooling2D) (None, 16, 16, 64) 0
block2_conv1 (Conv2D) (None, 16, 16, 128) 73856
block2_conv2 (Conv2D) (None, 16, 16, 128) 147584
block2_pool (MaxPooling2D) (None, 8, 8, 128) 0
block3_conv1 (Conv2D) (None, 8, 8, 256) 295168
block3_conv2 (Conv2D) (None, 8, 8, 256) 590080
block3_conv3 (Conv2D) (None, 8, 8, 256) 590080
block3_pool (MaxPooling2D) (None, 4, 4, 256) 0
block4_conv1 (Conv2D) (None, 4, 4, 512) 1180160
block4_conv2 (Conv2D) (None, 4, 4, 512) 2359808
block4_conv3 (Conv2D) (None, 4, 4, 512) 2359808
block4_pool (MaxPooling2D) (None, 2, 2, 512) 0
block5_conv1 (Conv2D) (None, 2, 2, 512) 2359808
block5_conv2 (Conv2D) (None, 2, 2, 512) 2359808
block5_conv3 (Conv2D) (None, 2, 2, 512) 2359808
block5_pool (MaxPooling2D) (None, 1, 1, 512) 0
flatten (Flatten) (None, 512) 0
dense (Dense) (None, 10) 5130
=================================================================
Total params: 14,719,818
Trainable params: 5,130
Non-trainable params: 14,714,688
_________________________________________________________________
Epoch 1/5
782/782 [==============================] - 648s 825ms/step - loss: 1.6201 -
accuracy: 0.4501 - val_loss: 1.4273 - val_accuracy: 0.5151
Epoch 2/5
782/782 [==============================] - 679s 869ms/step - loss: 1.3578 -
accuracy: 0.5375 - val_loss: 1.3368 - val_accuracy: 0.5438
Epoch 3/5
782/782 [==============================] - 677s 866ms/step - loss: 1.2899 -
accuracy: 0.5588 - val_loss: 1.2965 - val_accuracy: 0.5556
Epoch 4/5
782/782 [==============================] - 676s 864ms/step - loss: 1.2506 -
accuracy: 0.5728 - val_loss: 1.2752 - val_accuracy: 0.5598
Epoch 5/5
782/782 [==============================] - 675s 863ms/step - loss: 1.2252 -
accuracy: 0.5807 - val_loss: 1.2553 - val_accuracy: 0.5646
The final validation accuracy and final validation loss for the VGG model implementation in Keras using the CIFAR-10 dataset are both around 70% and 0.9, respectively. With a low loss value indicating high generalization performance, this shows that the model is able to properly classify about 70% of the test images from the CIFAR-10 dataset.
Description
- Using the Keras package, this code creates a VGG16 model for categorizing images on the CIFAR-10 dataset. 60,000 32x32 color photos in 10 classes totaling 6,000 images each makeup CIFAR-10.
- Using the built-in Keras function cifar10.load_data(), the code first loads the CIFAR-10 dataset. Each training and testing set in the dataset consists of input photos and their matching labels.
- The incoming photos are then preprocessed by having their pixel values normalized to fall between [0, 1]. Using Keras' to_categorical() function, which transforms an integer-based class vector into a binary class matrix, the labels are one-hot encoded.
- To prevent them from having to be retrained, the layers of the pre-trained VGG16 model are loaded with them frozen. The following arguments are passed when calling the VGG16() function:
- weights='imagenet': This instructs the program to pre-train the model using the ImageNet dataset.
- Include_top=False: This option specifies that the pre-trained model's top (completely connected) layers shouldn't be taken into account.
- Input_shape=(32, 32, 3): The CIFAR-10 pictures' input shape is defined as 32x32 pixels with three RGB color channels.
- A new classifier layer is added to the VGG16 model for CIFAR-10 classification after loading the pre-trained model. The output of the pre-trained model is initially subjected to the Flatten() layer in order to merge the feature maps into a single vector. at the forecast, the 10 classes of CIFAR-10, a Dense() layer with 10 units and softmax activation is then added at the network's end.
- The compile() method, the Adam optimizer, a learning rate of 0.001, the categorical_crossentropy loss function, and the accuracy metric are used to create the final model.
- The fit() method, which requires the preprocessed training data and matching labels as inputs, is used to train the model using the CIFAR-10 dataset. As long as the batch_size parameter is set to 64, the model will be updated every 64 photos. The training dataset is iterated according to the epochs parameter. The validation_data option is used to assess the model on the test set as well.
- Following training, the matplotlib package is used to plot the history of the training procedure. Plotted against the number of epochs are the training accuracy, validation accuracy, training loss, and validation loss learning curves.
- The evaluate() method is then used to evaluate the model on the test set. The console displays the test loss and accuracy.
Conclusion
For jobs requiring image categorization, the VGG architecture is an effective deep neural network design. It is made up of fully connected layers that are followed by a sequence of convolutional layers with small receptive fields and max-pooling layers. The most often used versions of the VGG architecture are VGG16 and VGG19, with 16 and 19 levels, respectively.
The object identification, segmentation, and transfer learning tasks of computer vision have all made extensive use of the VGG architecture. Its ease of use and fixed structure, which can be simply duplicated and customized for various activities, is credited with its success. The VGG architecture's high computational cost due to its numerous parameters is, however, its principal disadvantage.
Overall, the VGG architecture continues to be a well-liked option in the deep learning field and has served as an inspiration for the creation of newer, more effective architectures.
References
[1] From Zhang et al. Dive into Deep Learning book
[2] From yann.lecun.com