Lenet
CNNs (Convolutional Neural Networks) are a common kind of neural network in deep learning that are made to process and evaluate images. Yann LeCun created the LeNet design for CNN in the first decade of the 1990s.
One of the first effective uses of CNNs for image identification was LeNet. The input image is converted into a probability distribution over the various classes via the architecture, which consists of multiple levels of operations. Convolutional layers, pooling layers, and fully connected layers make up the bulk of LeNet.
Features from the input image are learned by the convolutional layers. These layers take the incoming image and run it through a series of filters, creating a number of feature maps. The feature maps' dimensionality is decreased while the crucial data is preserved by the pooling layers. Convolutional and pooling layer outputs are fed into fully connected layers, which generate a probability distribution over the various classes.
LeNet was initially created for the MNIST dataset's handwritten digit recognition. LeNet remains a significant turning point in the development of deep learning for image identification applications even though several different CNN architectures have been created since its creation.
History
Yann LeCun and colleagues at AT&T Bell Laboratories created the convolutional neural network (CNN) architecture known as LeNet in the early 1990s. In order to distinguish handwritten numbers and zip codes from scanned photos of mail, LeCun set out to build a neural network.
LeNet was a significant development in the field of computer vision and one of the earliest effective uses of CNNs for image identification problems. Before LeNet was created, the majority of computer vision systems relied on manually created feature extractors, which took a lot of time to design and required domain-specific knowledge.
LeNet was created exclusively to operate on grayscale photos with a 32x32 pixel resolution. The input image is converted into a probability distribution over the various classes via the architecture, which consists of multiple levels of operations. Convolutional layers, pooling layers, and fully connected layers make up the bulk of LeNet.
LeNet was first used to tackle the challenge of reading handwritten digits from the MNIST dataset, and it was able to do this task with state-of-the-art performance. LeNet remains a significant turning point in the development of deep learning for image identification applications even though several different CNN architectures have been created since its creation.
Architecture
Three convolutional layers and two fully connected levels make up the seven layers of the convolutional neural network architecture known as LeNet. A 32x32 pixel grayscale image serves as the network's input.
The LeNet's precise architecture is as follows:
- The layer used as input: This layer uses a 32x32 pixel grayscale image as input.
- Convolutional Layer 1: The first convolutional layer uses six 5x5 filters to the input image, creating six feature maps. Each filter creates a feature map that reflects a specific component of the input image after being applied to the full input image.
- Subsampling Layer 1: In the first subsampling layer, the feature maps created by the first convolutional layer are max pooled over 2x2 regions. This results in a two-fold reduction in the feature maps' dimensions.
- Convolutional layer 2: This layer produces sixteen feature maps by applying sixteen filters, each of size 5x5, to the subsampled feature maps created by the previous convolutional layer.
- Subsampling Layer 2: The second convolutional layer, known as the subsampling layer, executes maximum pooling over 2x2 sections of the feature maps. This results in a two-fold reduction in the feature maps' dimensions.
- Fully connected layer 1: This layer uses a fully connected neural network to translate the output of the second subsampling layer to a vector of 120 units.
- Fully connected layer 2: This layer uses a fully connected neural network to translate the output of the previous fully connected layer to a vector of 84 units.
- Output Layer: The output layer, where n is the number of classes in the dataset, transfers the output from the second fully connected layer to a vector of n units. This layer's output provides a representation of the probability distribution across the various classes.
LeNet has been used effectively for a variety of image recognition tasks, including the identification of handwritten digits in the MNIST dataset for which it was initially developed.
Working
LeNet is an image processing and image analysis convolutional neural network architecture. The input image is converted into a probability distribution over the various classes via the architecture, which consists of multiple levels of operations. Convolutional layers, pooling layers, and fully connected layers make up the bulk of LeNet.
An explanation of how LeNet operates is provided below:
1. Input Layers: LeNet receives a grayscale image with a 32x32 pixel input layer. The supplied image's values are normalized to range from 0 to 1.
2. Convolutional Layers: LeNet's first three layers are known as convolutional layers. In order to create a set of feature maps, each convolutional layer applies a particular set of filters to the input image. While the filters in the second and third convolutional layers are 3x3, those in the first convolutional layer are 5x5.
3. Pooling layers: A layer of pooling is done after each convolutional layer. By choosing the highest value found within each 2x2 region, the pooling layer decreases the dimensionality of the feature maps.
4. Fully connected layers: After being flattened, the output of the final pooling layer is fed into two fully connected layers. 120 units make up the first fully connected layer, while 84 units make up the second fully connected layer. A non-linear activation function (in the original LeNet, a hyperbolic tangent function was utilized) follows each fully linked layer.
5. Output Layer: The final fully connected layer's output is transferred to a vector of n units, where n is the total number of classes in the dataset. Another n-unit completely connected layer is used for this. The probability distribution across the various classes is shown in the final layer's output.
6. Training: Backpropagation with stochastic gradient descent is used to optimize LeNet during the training phase. To reduce the cross-entropy loss between the predicted and true labels, the weights in the fully connected and convolutional layers are changed.
LeNet has been successfully used to additional image recognition tasks in addition to its primary purpose of reading handwritten digits from the MNIST dataset.
Applications
LeNet has been used to accomplish a variety of image identification tasks with success, including:
1. Recognizing handwritten digits: LeNet was initially created to identify handwritten digits in the MNIST dataset. When it was first introduced, it performed this task at the cutting edge, but more advanced CNN architectures have subsequently overtaken it. However, it continues to be a well-liked benchmark for assessing the effectiveness of fresh image recognition methods.
2. Optical character recognition (OCR): LeNet has also been used for optical character recognition (OCR), which identifies printed characters in documents and photos. This is a typical use of OCR, which is employed in a number of sectors, such as government, healthcare, and banking.
3. Objects Recognition: LeNet has been used for object detection in photos, including the identification of persons, animals, and cars. Many datasets, including the CIFAR-10 and CIFAR-100 datasets, have been subjected to it.
4. Medical Imaging Process: LeNet has been used for medical image analysis tasks like seeing anomalies in chest X-rays and recognizing malignant cells in histology images.
5. Recognizing traffic signs: LeNet has been used to identify traffic signs in photos and movies. The use of autonomous vehicles and intelligent transportation systems will be significantly impacted by this application.
6. LeNet has been applied to robotics tasks like object detection for grasping and manipulation. LeNet has generally been shown to be a flexible and efficient architecture for a variety of image recognition tasks.
Implementation
I've released some code that uses Keras, a high-level deep learning API built on top of TensorFlow, to build the LeNet-5 architecture. For creating and training deep learning models, Keras offers an easy-to-use interface that works with a variety of backends, including TensorFlow, Theano, and CNTK. Keras, a Python library, is used for the implementation, which is built on top of TensorFlow or another backend.
Dataset: MNIST
Step 1:
Use the mnist.load_data() function that is built into Keras to load the MNIST dataset. The collection comprises of handwritten digits (0–9) and their related labels in 28x28 grayscale pictures.
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, AveragePooling2D, Flatten, Dense
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
Step 2:
Divide the pixel values by 255 to normalize the image's pixel values to be between 0 and 1. This preprocessing step is typical for computer vision problems.
# Normalize pixel values to be between 0 and 1
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
Step 3:
Resize the photos to conform to the LeNet-5 architecture's demand for a single grayscale channel.
# Reshape the images to be 28x28 with a single grayscale channel
x_train = x_train.reshape((x_train.shape[0], 28, 28, 1))
x_test = x_test.reshape((x_test.shape[0], 28, 28, 1))
Step 4:
Use the to_categorical() function of Keras to instantly encode the labels. With the exception of the index that corresponds to the actual label, which is 1 instead of 0, each index represents a potential label in the binary vectors of length 10 that are created from the integer labels (0–9).
# One-hot encode the labels
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
Step 5:
Utilize the Sequential model API of Keras to define the LeNet-5 architecture. Two convolutional layers, two average pooling layers, and two fully linked layers make up the LeNet-5 architecture. The softmax activation function is used in the output layer to generate a probability distribution over the 10 potential output classes.
# Define the LeNet-5 model
model = Sequential()
# Layer 1
model.add(Conv2D(filters=6, kernel_size=(5, 5), activation='relu', input_shape=(28, 28, 1)))
model.add(AveragePooling2D())
# Layer 2
model.add(Conv2D(filters=16, kernel_size=(5, 5), activation='relu'))
model.add(AveragePooling2D())
# Flatten
model.add(Flatten())
# Layer 3
model.add(Dense(units=120, activation='relu'))
# Layer 4
model.add(Dense(units=84, activation='relu'))
# Output layer
model.add(Dense(units=10, activation='softmax'))
Step 6:
Create the model using the widely used for classification jobs categorical cross-entropy loss and the Adam optimizer.
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Step 7:
With a batch size of 32, train the model on the training set for 5 iterations, validating it on the test set after each iteration.
# Train the model
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_data=(x_test, y_test))
Obtained Output:
Epoch 1/5 1875/1875 [==============================] - 31s 16ms/step - loss: 0.2322 -
accuracy: 0.9292 - val_loss: 0.0806 - val_accuracy: 0.9764 Epoch 2/5 1875/1875 [==============================] - 29s 15ms/step - loss: 0.0740 -
accuracy: 0.9769 - val_loss: 0.0532 - val_accuracy: 0.9830 Epoch 3/5 1875/1875 [==============================] - 31s 16ms/step - loss: 0.0537 -
accuracy: 0.9831 - val_loss: 0.0395 - val_accuracy: 0.9882 Epoch 4/5 1875/1875 [==============================] - 28s 15ms/step - loss: 0.0422 -
accuracy: 0.9870 - val_loss: 0.0336 - val_accuracy: 0.9903 Epoch 5/5 1875/1875 [==============================] - 29s 16ms/step - loss: 0.0343 -
accuracy: 0.9891 - val_loss: 0.0372 - val_accuracy: 0.9886
<keras.callbacks.History at 0x7fb214640100>
Step 8:
The final test set accuracy as well as the training and validation accuracy should be output by this script after each epoch.
# Evaluate the model on the test set
loss, accuracy = model.evaluate(x_test, y_test)
# Generate predictions on the test set
predictions = model.predict(x_test)
# Plot a random sample of test images with their predicted labels
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(6, 6))
fig.subplots_adjust(hspace=0.6, wspace=0.3)
for i, ax in enumerate(axes.flat):
idx = np.random.randint(0, len(x_test))
ax.imshow(x_test[idx].reshape(28, 28), cmap='gray')
ax.set_title(f"Predicted: {np.argmax(predictions[idx])}\nTrue: {np.argmax(y_test[idx])}")
ax.axis('off')
plt.show()
Obtained Output:
313/313 [==============================] - 2s 7ms/step - loss: 0.0372 -
accuracy: 0.9886
313/313 [==============================] - 2s 5ms/step
LeNet-5 is an architecture for convolutional neural networks that are intended for recognizing handwritten digits. One of the earliest effective deep-learning solutions to computer vision issues was presented by Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner in 1998.
Seven layers make up the LeNet-5 architecture, including three fully connected layers, two average pooling layers, and two convolutional layers. For the output layer, it employs the softmax function and the hyperbolic tangent activation function. The architecture was state-of-the-art for its time and produced state-of-the-art results on the MNIST dataset, despite being relatively modest and straightforward in comparison to more contemporary deep learning models.
LeNet-5 is still regarded as a significant turning point in the history of deep learning since it cleared the way for the creation of more sophisticated and potent deep learning models for computer vision applications.
Reference
[1] From yann.lecun.com
[2] From Zhang et al. Dive into Deep Learning book