GoogleNet
Introduction
GoogleNet is an image classification system built using convolutional neural networks. It was created by Google researchers in 2014 and took first place in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
The deep and broad design of GoogleNet, which has a total of 22 layers, is well-known. The idea of "inception modules," which are numerous parallel convolutional layers with various filter sizes combined to form a single layer, was also introduced. This aids the network in capturing features at various scales and resolutions, improving the classification accuracy of images.
The use of 1x1 convolutions, which lowers the amount of parameters in the network without compromising performance, is another important aspect of GoogleNet. This is crucial for lowering the computational expense of inference and training.
GoogLeNet is a deep learning architecture that is extremely effective and accurate in general, and it has impacted many other convolutional neural network architectures in the past.
History
In 2014, Google researchers published a paper titled "Going Deeper with Convolutions" that served as the official introduction to the GoogLeNet architecture. The goal of the paper's innovative convolutional neural network architecture was to increase performance while minimizing the number of parameters in the network.
As a result of its state-of-the-art performance on the ImageNet dataset, a sizable picture classification problem, GoogLeNet was at the time regarded as a breakthrough in deep learning and computer vision. The network's error rate was 6.7%, which was considerably lower than the winning entry from the prior year.
The usage of inception modules by GoogleNet, which enabled the network to collect features at various scales and resolutions, was a key factor in the network's success. The 1x1 convolutions were crucial in lowering the network's computational expense.
Since its debut, GoogLeNet has impacted other deep learning architectures, and its ideas have been used in a variety of tasks, such as segmentation, object identification, and picture classification.
Interception block
A crucial element of the Google Net architecture is the inception block. Several parallel convolutional layers with various filter widths are used in each of the inception modules that make up the GoogLeNet architecture. The inception modules have been demonstrated to increase the network's accuracy while lowering its computational cost. They are made to capture features at various scales and resolutions.
Four branches, one for each inception module, process the input concurrently. The branches consist of max pooling layers, 1x1 convolutions, 3x3 convolutions, and 5x5 convolutions. These branches' outputs are fed into the following layer of the network by concatenating them along the depth dimension.
GoogLeNet uses inception modules to gather features at various scales and resolutions, which increases the network's accuracy on a variety of picture classification tasks. Since then, several more deep learning architectures have utilized the inception block, making it a typical building block for convolutional neural networks.
Architecture
In addition to the input and output layers, the GoogLeNet design is a deep convolutional neural network with 22 layers. Its use of inception modules, which are many parallel convolutional layers with various filter sizes combined to create a single layer, distinguishes it from other techniques. The inception modules' ability to record features at various scales and resolutions enhances the network's ability to classify images accurately.
Each inception module consists of multiple parallel convolutional layers, including max pooling layers and 1x1, 3x3, and 5x5 convolutions. These layers' outputs are then combined along the depth dimension and sent into the following layer.
The GoogLeNet architecture uses 1x1 convolutions in addition to inception modules to lower the number of parameters in the network without compromising performance. Prior to sending the inputs into the larger convolutional layers, dimensionality reduction is also carried out using the 1x1 convolutions.
A global average pooling layer, which is the last layer of the GoogLeNet architecture, averages the activations of the layer before it across the spatial dimensions. A single feature vector is created as a result, and it is input into a fully linked layer for classification.
The GoogLeNet architecture, as a whole, is a highly successful deep learning model that has produced cutting-edge outcomes on a variety of image categorization tasks.
Working
A classification output is produced by the GoogLeNet architecture by processing an input image through a number of convolutional layers, pooling layers, and inception modules.
Here is a brief explanation of how GoogleNet functions:
- The network accepts an RGB image with a 224x224 pixel size as input.
- Convolutional layers: A sequence of convolutional layers processes the input image by performing feature extraction by convolving learnable filters over the image. Each layer's filters get more abstract and complicated.
- Convolutional layer output is routed through a number of inception modules, each of which contains a number of concurrent convolutional layers with various filter sizes. The inception modules can record features at various scales and resolutions.
- Max pooling layers: Each inception module's output is routed through a max pooling layer, which downscales the feature maps and strengthens the network's resistance to translations and deformations in the input image.
- 1x1 convolutions: In order to decrease the number of channels in the feature maps and increase computational efficiency, 1x1 convolutions are applied before the outputs of the pooling layers are fed into the subsequent set of convolutional layers.
- Fully connected layers: After the final pooling layer, the output is flattened into a single vector and fed through one or more fully connected layers. These layers conduct classification by calculating a probability distribution over the output classes.
- Output: The network's ultimate output is a probability distribution over all potential classes for the input image.
In general, the GoogLeNet architecture is a highly successful deep learning model that has produced cutting-edge outcomes on a variety of image classification tasks.
Applications
The GoogLeNet architecture has been applied to a number of computer vision tasks, such as segmentation, object identification, and image classification. Here are some instances of how Google Network has been put to use:
1. Image classification: In 2014, GoogLeNet performed at the cutting edge on the benchmark dataset for image classification, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Since then, the architecture has been applied to a wide range of image classification applications, including the recognition of various food categories and plant species.
2. Object localization and class prediction tasks involving object detection, which include locating items within a picture, have also been carried out using GoogleLeNet. The Google Open Images dataset, which includes more than 9 million images tagged with object identification labels, serves as one illustration. On this dataset, object detectors have been trained using the GoogLeNet architecture, resulting in good accuracy across a variety of object classifications.
3. Semantic segmentation tasks, which entail giving a class name to each pixel in an image, have also been out using GoogleLeNet. The Cityscapes collection, which includes street scenes from different cities throughout the world, serves as one example. On this dataset, semantic segmentation models were trained using Google Net, resulting in state-of-the-art performance.
In general, the GooLeNet architecture has significantly influenced the field of computer vision and has been applied in numerous applications to produce cutting-edge outcomes.
Implementation
The general implementation of the GoogleNet Architecture using the Interception block and its visualization is as follows.
Source Code
# Import the necessary Libraries
from keras.layers import Input, Conv2D, MaxPooling2D, concatenate, AveragePooling2D, Flatten, Dense
from keras.models import Model
from keras.layers import Dropout
from keras.utils import plot_model
import matplotlib.pyplot as plt
# Define the Interception Model
def inception_module(x, filters):
conv1x1 = Conv2D(filters[0], (1,1), padding='same', activation='relu')(x)
conv3x3_reduce = Conv2D(filters[1], (1,1), padding='same', activation='relu')(x)
conv3x3 = Conv2D(filters[2], (3,3), padding='same', activation='relu')(conv3x3_reduce)
conv5x5_reduce = Conv2D(filters[3], (1,1), padding='same', activation='relu')(x)
conv5x5 = Conv2D(filters[4], (5,5), padding='same', activation='relu')(conv5x5_reduce)
maxpool = MaxPooling2D((3,3), strides=(1,1), padding='same')(x)
maxpool_conv = Conv2D(filters[5], (1,1), padding='same', activation='relu')(maxpool)
return concatenate([conv1x1, conv3x3, conv5x5, maxpool_conv], axis=3)
input_layer = Input(shape=(224,224,3))
conv1 = Conv2D(64, (7,7), strides=(2,2), padding='same', activation='relu')(input_layer)
maxpool1 = MaxPooling2D((3,3), strides=(2,2), padding='same')(conv1)
conv2_reduce = Conv2D(64, (1,1), padding='same', activation='relu')(maxpool1)
conv2 = Conv2D(192, (3,3), padding='same', activation='relu')(conv2_reduce)
maxpool2 = MaxPooling2D((3,3), strides=(2,2), padding='same')(conv2)
inception3a = inception_module(maxpool2, [64, 96, 128, 16, 32, 32])
inception3b = inception_module(inception3a, [128, 128, 192, 32, 96, 64])
maxpool3 = MaxPooling2D((3,3), strides=(2,2), padding='same')(inception3b)
inception4a = inception_module(maxpool3, [192, 96, 208, 16, 48, 64])
inception4b = inception_module(inception4a, [160, 112, 224, 24, 64, 64])
inception4c = inception_module(inception4b, [128, 128, 256, 24, 64, 64])
inception4d = inception_module(inception4c, [112, 144, 288, 32, 64, 64])
inception4e = inception_module(inception4d, [256, 160, 320, 32, 128, 128])
maxpool4 = MaxPooling2D((3,3), strides=(2,2), padding='same')(inception4e)
inception5a = inception_module(maxpool4, [256, 160, 320, 32, 128, 128])
inception5b = inception_module(inception5a, [384, 192, 384, 48, 128, 128])
avgpool = AveragePooling2D((7,7), strides=(1,1), padding='valid')(inception5b)
dropout = Dropout(0.4)(avgpool)
flatten = Flatten()(dropout)
output_layer = Dense(units=1000, activation='softmax')(flatten)
model = Model(inputs=input_layer, outputs=output_layer)
#summary of the model
model.summary()
# plot model is used for the googlenet graph
plot_model(model, to_file='googlenet.png', show_shapes=True, show_layer_names=False)
# Visualization of the googlenet graph image
img = plt.imread('googlenet.png')
plt.figure(figsize=(20,20))
plt.imshow(img)
plt.axis('off')
plt.show()
Obtained Output:
Model: "model_1" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_3 (InputLayer) [(None, 224, 224, 3 0 [] )] conv2d_114 (Conv2D) (None, 112, 112, 64 9472 ['input_3[0][0]'] ) max_pooling2d_26 (MaxPooling2D (None, 56, 56, 64) 0 ['conv2d_114[0][0]'] ) conv2d_115 (Conv2D) (None, 56, 56, 64) 4160 ['max_pooling2d_26[0][0]'] conv2d_116 (Conv2D) (None, 56, 56, 192) 110784 ['conv2d_115[0][0]'] max_pooling2d_27 (MaxPooling2D (None, 28, 28, 192) 0 ['conv2d_116[0][0]'] ) conv2d_118 (Conv2D) (None, 28, 28, 96) 18528 ['max_pooling2d_27[0][0]'] conv2d_120 (Conv2D) (None, 28, 28, 16) 3088 ['max_pooling2d_27[0][0]'] max_pooling2d_28 (MaxPooling2D (None, 28, 28, 192) 0 ['max_pooling2d_27[0][0]'] ) conv2d_117 (Conv2D) (None, 28, 28, 64) 12352 ['max_pooling2d_27[0][0]'] conv2d_119 (Conv2D) (None, 28, 28, 128) 110720 ['conv2d_118[0][0]'] conv2d_121 (Conv2D) (None, 28, 28, 32) 12832 ['conv2d_120[0][0]'] conv2d_122 (Conv2D) (None, 28, 28, 32) 6176 ['max_pooling2d_28[0][0]'] concatenate_18 (Concatenate) (None, 28, 28, 256) 0 ['conv2d_117[0][0]', 'conv2d_119[0][0]', 'conv2d_121[0][0]', 'conv2d_122[0][0]'] conv2d_124 (Conv2D) (None, 28, 28, 128) 32896 ['concatenate_18[0][0]'] conv2d_126 (Conv2D) (None, 28, 28, 32) 8224 ['concatenate_18[0][0]'] max_pooling2d_29 (MaxPooling2D (None, 28, 28, 256) 0 ['concatenate_18[0][0]'] ) conv2d_123 (Conv2D) (None, 28, 28, 128) 32896 ['concatenate_18[0][0]'] conv2d_125 (Conv2D) (None, 28, 28, 192) 221376 ['conv2d_124[0][0]'] conv2d_127 (Conv2D) (None, 28, 28, 96) 76896 ['conv2d_126[0][0]'] conv2d_128 (Conv2D) (None, 28, 28, 64) 16448 ['max_pooling2d_29[0][0]'] concatenate_19 (Concatenate) (None, 28, 28, 480) 0 ['conv2d_123[0][0]', 'conv2d_125[0][0]', 'conv2d_127[0][0]', 'conv2d_128[0][0]'] max_pooling2d_30 (MaxPooling2D (None, 14, 14, 480) 0 ['concatenate_19[0][0]'] ) conv2d_130 (Conv2D) (None, 14, 14, 96) 46176 ['max_pooling2d_30[0][0]'] conv2d_132 (Conv2D) (None, 14, 14, 16) 7696 ['max_pooling2d_30[0][0]'] max_pooling2d_31 (MaxPooling2D (None, 14, 14, 480) 0 ['max_pooling2d_30[0][0]'] ) conv2d_129 (Conv2D) (None, 14, 14, 192) 92352 ['max_pooling2d_30[0][0]'] conv2d_131 (Conv2D) (None, 14, 14, 208) 179920 ['conv2d_130[0][0]'] conv2d_133 (Conv2D) (None, 14, 14, 48) 19248 ['conv2d_132[0][0]'] conv2d_134 (Conv2D) (None, 14, 14, 64) 30784 ['max_pooling2d_31[0][0]'] concatenate_20 (Concatenate) (None, 14, 14, 512) 0 ['conv2d_129[0][0]', 'conv2d_131[0][0]', 'conv2d_133[0][0]', 'conv2d_134[0][0]'] conv2d_136 (Conv2D) (None, 14, 14, 112) 57456 ['concatenate_20[0][0]'] conv2d_138 (Conv2D) (None, 14, 14, 24) 12312 ['concatenate_20[0][0]'] max_pooling2d_32 (MaxPooling2D (None, 14, 14, 512) 0 ['concatenate_20[0][0]'] ) conv2d_135 (Conv2D) (None, 14, 14, 160) 82080 ['concatenate_20[0][0]'] conv2d_137 (Conv2D) (None, 14, 14, 224) 226016 ['conv2d_136[0][0]'] conv2d_139 (Conv2D) (None, 14, 14, 64) 38464 ['conv2d_138[0][0]'] conv2d_140 (Conv2D) (None, 14, 14, 64) 32832 ['max_pooling2d_32[0][0]'] concatenate_21 (Concatenate) (None, 14, 14, 512) 0 ['conv2d_135[0][0]', 'conv2d_137[0][0]', 'conv2d_139[0][0]', 'conv2d_140[0][0]'] conv2d_142 (Conv2D) (None, 14, 14, 128) 65664 ['concatenate_21[0][0]'] conv2d_144 (Conv2D) (None, 14, 14, 24) 12312 ['concatenate_21[0][0]'] max_pooling2d_33 (MaxPooling2D (None, 14, 14, 512) 0 ['concatenate_21[0][0]'] ) conv2d_141 (Conv2D) (None, 14, 14, 128) 65664 ['concatenate_21[0][0]'] conv2d_143 (Conv2D) (None, 14, 14, 256) 295168 ['conv2d_142[0][0]'] conv2d_145 (Conv2D) (None, 14, 14, 64) 38464 ['conv2d_144[0][0]'] conv2d_146 (Conv2D) (None, 14, 14, 64) 32832 ['max_pooling2d_33[0][0]'] concatenate_22 (Concatenate) (None, 14, 14, 512) 0 ['conv2d_141[0][0]', 'conv2d_143[0][0]', 'conv2d_145[0][0]', 'conv2d_146[0][0]'] conv2d_148 (Conv2D) (None, 14, 14, 144) 73872 ['concatenate_22[0][0]'] conv2d_150 (Conv2D) (None, 14, 14, 32) 16416 ['concatenate_22[0][0]'] max_pooling2d_34 (MaxPooling2D (None, 14, 14, 512) 0 ['concatenate_22[0][0]'] ) conv2d_147 (Conv2D) (None, 14, 14, 112) 57456 ['concatenate_22[0][0]'] conv2d_149 (Conv2D) (None, 14, 14, 288) 373536 ['conv2d_148[0][0]'] conv2d_151 (Conv2D) (None, 14, 14, 64) 51264 ['conv2d_150[0][0]'] conv2d_152 (Conv2D) (None, 14, 14, 64) 32832 ['max_pooling2d_34[0][0]'] concatenate_23 (Concatenate) (None, 14, 14, 528) 0 ['conv2d_147[0][0]', 'conv2d_149[0][0]', 'conv2d_151[0][0]', 'conv2d_152[0][0]'] conv2d_154 (Conv2D) (None, 14, 14, 160) 84640 ['concatenate_23[0][0]'] conv2d_156 (Conv2D) (None, 14, 14, 32) 16928 ['concatenate_23[0][0]'] max_pooling2d_35 (MaxPooling2D (None, 14, 14, 528) 0 ['concatenate_23[0][0]'] ) conv2d_153 (Conv2D) (None, 14, 14, 256) 135424 ['concatenate_23[0][0]'] conv2d_155 (Conv2D) (None, 14, 14, 320) 461120 ['conv2d_154[0][0]'] conv2d_157 (Conv2D) (None, 14, 14, 128) 102528 ['conv2d_156[0][0]'] conv2d_158 (Conv2D) (None, 14, 14, 128) 67712 ['max_pooling2d_35[0][0]'] concatenate_24 (Concatenate) (None, 14, 14, 832) 0 ['conv2d_153[0][0]', 'conv2d_155[0][0]', 'conv2d_157[0][0]', 'conv2d_158[0][0]'] max_pooling2d_36 (MaxPooling2D (None, 7, 7, 832) 0 ['concatenate_24[0][0]'] ) conv2d_160 (Conv2D) (None, 7, 7, 160) 133280 ['max_pooling2d_36[0][0]'] conv2d_162 (Conv2D) (None, 7, 7, 32) 26656 ['max_pooling2d_36[0][0]'] max_pooling2d_37 (MaxPooling2D (None, 7, 7, 832) 0 ['max_pooling2d_36[0][0]'] ) conv2d_159 (Conv2D) (None, 7, 7, 256) 213248 ['max_pooling2d_36[0][0]'] conv2d_161 (Conv2D) (None, 7, 7, 320) 461120 ['conv2d_160[0][0]'] conv2d_163 (Conv2D) (None, 7, 7, 128) 102528 ['conv2d_162[0][0]'] conv2d_164 (Conv2D) (None, 7, 7, 128) 106624 ['max_pooling2d_37[0][0]'] concatenate_25 (Concatenate) (None, 7, 7, 832) 0 ['conv2d_159[0][0]', 'conv2d_161[0][0]', 'conv2d_163[0][0]', 'conv2d_164[0][0]'] conv2d_166 (Conv2D) (None, 7, 7, 192) 159936 ['concatenate_25[0][0]'] conv2d_168 (Conv2D) (None, 7, 7, 48) 39984 ['concatenate_25[0][0]'] max_pooling2d_38 (MaxPooling2D (None, 7, 7, 832) 0 ['concatenate_25[0][0]'] ) conv2d_165 (Conv2D) (None, 7, 7, 384) 319872 ['concatenate_25[0][0]'] conv2d_167 (Conv2D) (None, 7, 7, 384) 663936 ['conv2d_166[0][0]'] conv2d_169 (Conv2D) (None, 7, 7, 128) 153728 ['conv2d_168[0][0]'] conv2d_170 (Conv2D) (None, 7, 7, 128) 106624 ['max_pooling2d_38[0][0]'] concatenate_26 (Concatenate) (None, 7, 7, 1024) 0 ['conv2d_165[0][0]', 'conv2d_167[0][0]', 'conv2d_169[0][0]', 'conv2d_170[0][0]'] average_pooling2d_2 (AveragePo (None, 1, 1, 1024) 0 ['concatenate_26[0][0]'] oling2D) dropout_1 (Dropout) (None, 1, 1, 1024) 0 ['average_pooling2d_2[0][0]'] flatten_1 (Flatten) (None, 1024) 0 ['dropout_1[0][0]'] dense_1 (Dense) (None, 1000) 1025000 ['flatten_1[0][0]'] ================================================================================================== Total params: 6,998,552 Trainable params: 6,998,552 Non-trainable params: 0 _______________________________________________________________________________________
Description
- The GoogLeNet architecture, a deep convolutional neural network used for image classification applications, is implemented in Keras with this code.
- The Keras Functional API, which offers additional freedom in constructing complicated models, is used to define the model.
- The inception module, a crucial component of the Google Net design, is implemented via the inception_module function.
- The function returns a concatenation of the results of various convolution and max-pooling operations after accepting a tensor as input and a list of six filter sizes.
- The input layer is a 224x224x3 tensor that is processed through a number of convolutional and max-pooling layers before being fed through a number of inception modules.
- To lessen overfitting, the output of the final inception module is sent via an average pooling layer and subsequently a dropout layer.
- The output is then flattened and routed through a fully linked layer that has 1000 output units—one for each potential class—before being output. A representation of the model architecture is produced using the plot_model function and saved as the file googlenet.png. The image is then fed into the code and shown using matplotlib.
Conclusion
The GoogleNet deep convolutional neural network architecture was created for the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which aims to identify photos. It has 22 layers and employs a mix of convolutional, pooling, and inception modules to lessen the number of parameters and improve the network's accuracy. A crucial element of GoogleNet, the Inception module enables the network to perform better than competing designs at the time with less parameters. The GoogLeNet architecture, which served as inspiration for numerous other models, assisted in extending the capabilities of deep learning for computer vision tasks.References[1] From Zhang et al.Dive into Deep Learning book[2] https://d2l.ai/chapter_convolutional-modern/googlenet.html