How to Make an Image Recognition AI

October 24, 2024

One of the most powerful uses of Artificial Intelligence today is Image Recognition models, broadly speaking image recognition models consists of 3 sub classes:

Image classification
Object Detection
Image segmentation

Before diving into the specifics of each one of these and understand image recognition technolo, let us first try to understand how a computer(which only understands 0 and 1) comprehends an image.

Digital Images to Binary

Images are made up of pixels, which is a very small area that is illuminated. You might have heard of an image being 1080p, this means that the image consists of One thousand and Eighty pixels. The more the number of pixels in an image the higher the quality.

Now let us take a black and white picture, it consists of two colours. This means the pixels are either black or white. Thus for each pixel we can say it is 0(black) or 1(white). This is how a black and white image is represented in binary. But what about coloured images, after all we aren't in the 90's!!

We will apply the same logic, but instead of each pixel having one value(0 or 1), we can assign a vector to each pixel. This vector can be the Red-Green-Blue(RGB), this is because every colour can be represented by the combination of these three colours. Read here for more info...

Now we know how we can assign number to an image, in the next section we will look at how from numbers a machine fathoms the image.

Numbers to Inference

Let us talk our favorite subject, Machine Learning! At the core of it images are just raw data, numbers assigned to pixels. To get something out of these, we need to follow certain steps--

Data Preprocessing- However boring it is preprocessing is essential if we ever want a robust system for image recognition technology. Preprocessing can include pixel normalization, broadly speaking not each pixel has the same range of values so we get them into the same range say -100 to 100. Preprocessing also includes greyscaling, augmentation etc depending on the use case, read here for more.
Model Development- Depending on the task at hand, there are a host of models we can develop, and it is as easy as 1,2,3. Broadly model development includes creating a neural network that can process the raw data and no it is not the same as a feedforward NN. Let us revisit this part in the next section.

And Voila! At the outputr of the model your phone gets unlocked or the FBI catches a criminal!

Convolutional Neural Networks

Convolutional Neural Network Tutorial [Update]

As I stated, a model able to understand our pixels is not the same as a normal Neural Network, this is because a feedforward neural network,does not factor in the spatial relationships between pixels, as it treats each input independently, you dont want only your eyebrows to unlock your phone right?(facial recognition pun)

Also since a normal image consists of thousands of each pixels(multiply by 3 for coloured images) a FeedForward neural network would require that many nuerons which is computationally very expensive.

All these problems were solved by a novel approach to mathematical computations and model architecture and image preprocessing as a new operator was introduced. Convolution is essentially a mathematical operation used to extract features from the input data by applying a filter (also known as a kernel) to the input data. Broadly, Convolution involves sliding a filter (a small matrix of weights) over the input image and computing the dot product between the filter and the overlapping regions of the input image. This process produces a feature map that highlights specific features of the input image, such as edges or textures. This process capture all the relevant information from the image while reducing the number of parameters required, also called feature extraction.

Another important introduction in Convolutional Neural Networks was of pooling layers, pooling essentially downsample the output from the convolution further to reduce the number of paramaters. Say you have a 3x3 matrix, in Max Pooling (a type of pooling) you would represent this 3x3 matrix as a single number which will be the maximum of all 9 numbers, in Avg Pooling we do the average.

Thats all repeat these two, combined with activation functions and a fully connect layer, we have successfully processed our data. But what exactly is this output? Let us understand in the next section.

Summary Workflow in a Convolutional Neural Networks- Model Architecture

Convolution: Apply convolutional filters to extract features.
Activation: Apply a non-linear activation function (e.g., ReLU).
Pooling: Downsample the feature maps to reduce spatial dimensions.
Normalization: (Optional) Apply normalization techniques like Batch Normalization.
Stacking Layers: Repeat convolution, activation, and pooling multiple times.
Flattening: Convert the 2D feature maps into a 1D vector.
Fully Connected Layers: Perform high-level reasoning with dense layers.
Output Layer: Generate final predictions with an appropriate activation function.

Understanding the output of a CNN

Before we try to understand the output we need to delve into the types of problems in computer vision. Primarily there are 3 image recognition tasks which we mentioned above. Let us look at each one in detail.

Image Classification- It consists of classifying images, say identifying if an image consists of a horse or a zebra.
Object Detection- It consists of detecting all/some objects in images and typically drawing bounding boxes around it, say number plate in a car.
Image segmentation- Unlike object detection which creates a bounding box around the detected object, image segmentation involves cutting out the entire object picture by picture and isolating it.

At the output of these three are ofcourse numbers but for image classification the numbers are the probabilities of the image belonging to a particular task, for object detection and segmentation the output is the location of the object either as a bounding box or pixel by pixel. This is at the core of image recognition technology.

Image Recognition with TensorFlow

Let us use deep learning models for image recognition and build our image recognition application. We willl use TensorFlow from the programming language python.

How to Classify Cats and Dogs Using CNNs in Python?

Enough theory, let us get our hands dirty with a good problem. For this example we will do cat and dog classification with this training dataset.

Step 1: Set Up Your Development Environment

Before we begin, ensure you have Python and TensorFlow installed on your system. You can install TensorFlow using pip:

pip install tensorflow

Step 2: Collect the Dataset

Extract the dataset to a directory named “dataset” in your project folder.

Step 3: Prepare the Data

We need to preprocess the images before training the AI model. Create a Python script named prepare_data.py and use the following code, let us apply some image transformations:

import os import cv2 import numpy as np
data_directory = "dataset" categories = ["cat", "dog"] img_size = 100
training_data = []

def create_training_data():
    for category in categories:
        path = os.path.join(data_directory, category)
        class_num = categories.index(category)
        for img in os.listdir(path):
            try:
                img_array = cv2.imread(os.path.join(path, img), cv2.IMREAD_GRAYSCALE)
                new_array = cv2.resize(img_array, (img_size, img_size))
                training_data.append([new_array, class_num])
            except Exception as e:
                pass

create_training_data()

import random
random.shuffle(training_data)
X = [] y = []

for features, label in training_data:
X.append(features)
y.append(label)

X = np.array(X).reshape(-1, img_size, img_size, 1)
y = np.array(y)

Step 4: Build the AI Model

Create a Python script named image_classifier.py and add the following code to build the AI model, this model is used for training images.

import tensorflow as tf from tensorflow.keras.models
import Sequential from tensorflow.keras.layers
import Dense, Conv2D, MaxPooling2D, Flatten

model = Sequential()

model.add(Conv2D(64, (3, 3), input_shape=X.shape[1:], activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten()) model.add(Dense(64, activation='relu')) model.add(Dense(1, activation='sigmoid'))
‍
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Step 5: Train the Model

Now, let’s train the AI model using the prepared data:

model.fit(X, y, batch_size=32, epochs=10, validation_split=0.1)

Step 6: Test the Model

To test the model, create a Python script named test_model.py and use the following code:

import cv2 def prepare(filepath):
    img_array = cv2.imread(filepath, cv2.IMREAD_GRAYSCALE)
    new_array = cv2.resize(img_array, (img_size, img_size))
    return new_array.reshape(-1, img_size, img_size, 1)

model = tf.keras.models.load_model("image_classifier.model")
prediction = model.predict([prepare("test_image.jpg")])

print(categories[int(prediction[0][0])])

That is that, we have succesfully built a model that can classify cats and dogs!

Limitations of Dog/Cat Classifier

Limited Training Data:some text
- If the dataset is too small or lacks diversity, the model may not generalize well to new, unseen images.
Imbalanced Data:some text
- If the number of dog images significantly exceeds the number of cat images (or vice versa), the model may become biased toward the more prevalent class.
Quality of Data:some text
- Poor quality images (blurry, low resolution) can adversely affect the model's performance.
Annotation Errors:some text
- Incorrectly labeled images can mislead the model during training.

Variability in Breeds:

Significant variability in breeds and appearances of cats and dogs can challenge the classifier's accuracy.

Problem 2: Object Detection

An Understanding of Vehicle Number Plate Detection Mechanism ...

We will detect number plates from the images of a car!

1. Dataset Preparation

Download the Open Images Dataset V6 with annotations for license plates image datasets. Organize the dataset as follows, or use Google Colab

/data
    /train
         - img1.jpg
         - img2.jpg
        ...
    /val
        - img1.jpg
        - img2.jpg
     ...
    /test
        - img1.jpg
        - img2.jpg
     ...
    /annotations
        - train_annotations.csv
        - val_annotations.csv
        - test_annotations.csv

2. Environment Setup

Install necessary libraries:

pip install tensorflow keras opencv-python pandas matplotlib

3. Model Selection

Use a pre-trained model like YOLO (You Only Look Once) or SSD (Single Shot MultiBox Detector). We'll use YOLO for this example.

4. Training

Here's a simplified code example to set up and train a YOLO model using TensorFlow/Keras, and do image preprocessing!

import tensorflow as tf
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
import cv2
import pandas as pd
import numpy as np
import os

def load_dataset(image_dir, annotations_file):
    annotations = pd.read_csv(annotations_file)
    images = []
    boxes = []
    for index, row in annotations.iterrows():
        img_path = os.path.join(image_dir, row['filename'])
        img = cv2.imread(img_path)
        images.append(img)
        boxes.append([row['xmin'], row['ymin'], row['xmax'], row['ymax']])
    return np.array(images), np.array(boxes)

train_images, train_boxes = load_dataset('/data/train', '/annotations/train_annotations.csv')
val_images, val_boxes = load_dataset('/data/val', '/annotations/val_annotations.csv')

# Define a simple YOLO model (for demonstration purposes)
model = tf.keras.applications.MobileNetV2(input_shape=(224, 224, 3), include_top=False, weights='imagenet')
model = tf.keras.Sequential([
    model,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(4, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['accuracy'])

checkpoint = ModelCheckpoint('best_model.h5', monitor='val_loss', save_best_only=True, mode='min')
early_stopping = EarlyStopping(monitor='val_loss', patience=10, mode='min')

‍

# Train the model
model.fit(train_images, train_boxes, validation_data=(val_images, val_boxes), epochs=50, batch_size=8, callbacks=[checkpoint, early_stopping])

5. Evaluation

Evaluate the model on the test set:

test_images, test_boxes = load_dataset('/data/test', '/annotations/test_annotations.csv')
model.evaluate(test_images, test_boxes)

6. Inference

Run inference on new images and visualize the results:

def draw_boxes(image, boxes):
    for box in boxes:
        cv2.rectangle(image, (int(box[0]), int(box[1])), (int(box[2]), int(box[3])), (255, 0, 0), 2)
    return image

# Load and preprocess new image
new_image = cv2.imread('new_image.jpg')
input_image = cv2.resize(new_image, (224, 224))
input_image = np.expand_dims(input_image, axis=0)

# Predict bounding box
predicted_box = model.predict(input_image)

# Draw predicted box on the image
output_image = draw_boxes(new_image, predicted_box)
cv2.imshow('Output', output_image)
cv2.waitKey(0)
cv2.destroyAllWindows()

Done, we have succesfully done object detection as well!

Problem 3: Image Segmentaton

Image Segmentation is a relatively new field of Computer Vision and image recognition technology. It was introduced to address a major drawback of object detection. While object detection is proficient at identifying and locating objects within an image, it falls short in providing detailed information about the shape and boundaries of these objects. Image segmentation overcomes this limitation by partitioning the image into segments, allowing each pixel to be classified into a specific object or region. This granular approach enables more precise analysis and understanding of the visual content, making it invaluable for applications such as medical imaging, autonomous driving, and image editing.

The output of image segmentation significantly differs from object detection. Instead of generating bounding boxes around objects, image segmentation provides a mask that delineates the exact shape of each object within the image. This mask is typically a binary or multi-class matrix where each pixel is assigned a class label, corresponding to the object or background it belongs to.

To model the output for image segmentation, we need to modify the architecture of the neural network, particularly the decoder, to perform per-pixel classification. This involves the following steps:

Use Fully Convolutional Networks (FCNs): Replace fully connected layers with convolutional layers to preserve spatial information throughout the network.
Upsampling Layers: Incorporate upsampling layers, such as transposed convolutions or interpolation methods, to increase the resolution of the output to match the input image size. This allows the network to predict class labels for each pixel.
Skip Connections: Integrate skip connections from earlier layers in the network to combine high-level semantic information with low-level spatial details. This helps in producing more accurate and detailed segmentations.
Softmax Activation: Apply a softmax activation function at the final layer to produce a probability distribution over the class labels for each pixel.
Loss Function: Use a pixel-wise loss function, such as categorical cross-entropy, to train the network. This encourages the network to make accurate per-pixel predictions.

And now, for a fun twist: Let’s keep this one as a homework assignment! Using the same dataset, try extracting the number plate of a car pixel by pixel. Once you have it, you can even replace it with another number (but don’t try this at home, folks)!

Applications of Image Recognition

lead in terms of digital content. It is now so important that an extremely important part of Artificial Intelligence is based on analyzing pictures. Nowadays, it is applied to various activities and for different purposes.

Automotive Industry

Autonomous vehicles are a true revolution. It seems to be quite futuristic for a lot of people: watching cars able to drive passengers without seeing them even touching the steering wheel or the pedals. With the help of cameras all around the device, radars, and sensors, the car is able to determine which are the elements present in its surrounding area and make predictions regarding their trajectory or actions. The neural networks within the program analyze the pixel patterns from the images of cameras and can tell whether the object on the right-hand side is a bicycle or not and if it is coming towards the car or going away from it. self-driving cars also detect and identify traffic signs and signals, trees, pathways, or even pedestrians.

Security Industry

Home Security has become a huge preoccupation for people as well as Insurance Companies. Robberies happen every day to many different people. Many individuals have decided to tackle this problem. They started to install cameras and security alarms all over their homes and surrounding areas. pre-trained model has proven to be very efficient to a lot of people. Most of the time, it is used to show the Police or the Insurance Company that a thief indeed broke into the house and robbed something. But this solution is also used to detect a lot of fraud. On another note, CCTV cameras are more and more installed in big cities to spot incivilities and vandalism for instance. Digital Images are also used by stores to highlight shoplifters in actions and provide the Police authorities with proof of the felony. Lastly, Airport Security agents are using this kind of camera as well so as to detect suspicious behavior of individuals, to practice facial recognition, and to identify potential threats such as the presence of unattended bags. It is a complex task, but Machine Learning has made it possible.

Healthcare

Medical staff members seem to be appreciating more and more the application of AI in their field. Through X-rays for instance, Image annotations can detect and put bounding boxes around fractures, abnormalities, or even tumors. Thanks to Object Detection and image preprocessing, doctors are able to give their patients their diagnostics more rapidly and more accurately. They can check if their treatment is functioning properly or not, and they can even recognize the age of certain bones.

Retail, e-commerce, and Marketing

Since the beginning of the COVID-19 pandemic and the lockdown it has implied, people have started to place orders on the Internet for all kinds of items (clothes, glasses, food, etc.). Some companies have developed their own unsupervised learning algorithm for their specific activities. Online shoppers now have the possibility to try clothes or glasses online. They just have to take a video or a picture of their face or body to get try items they choose online directly through their smartphones. This way, the customer can visualize how the items look on him or her. The person just has to place the order on the items he or she is interested in. Online shoppers also receive suggestions of pieces of clothing they might enjoy, based on what they have searched for, purchased, or shown interest in.

Farming Industry

Farmers are always looking for new ways to improve their working conditions. Taking care of both their cattle and their plantation can be time-consuming and not so easy to do. Today more and more of them use AI and Image Recognition to improve the way they work. Cameras inside the buildings allow them to monitor the animals, make sure everything is fine. When animals give birth to their babies, farmers can easily identify if it is having difficulties delivering and can quickly react and come to help the animal. These professionals also have to deal with the health of their plantations. Object Detection helps them to analyze the condition of the plant and gives them indications to improve or save the crops, as they will need it to feed their cattle.

The first industry is somewhat obvious taking into account our application. Yes, fitness and wellness is a perfect match for image recognition and pose estimation systems.

Fitness and Wellness

Yes, fitness and wellness is a perfect match for image recognition and pose estimation systems.

Image recognition fitness apps can give a user some tips on how to improve their yoga asanas, watch the user’s posture during the exercises, and even minimize the possibility of injury for elderly fitness lovers.

While Youtube tutorials can only show how to perform an exercise, human pose recognition apps go way further and help users with improving their performance. How many of us went to an offline training just to get some feedback and know whether we are exercising not in vain?

Manufacturing

Image recognition works well for manufacturers and B2B retailers too. Remember our example with a milk batch that had to be recalled? That could be avoided with a better quality assurance system aided with image recognition.

For example, an IR algorithm can visually evaluate the quality of fruit and vegetables. Those that do not look fresh anymore won’t be shipped to the retailers. Producers can also use IR in the packaging process to locate damaged or deformed items. What is more, it is easy to count the number of items inside a package. For example, a pharmaceutical company needs to know how many tables are in each bottle.

The use of IR in manufacturing doesn’t come down to quality control only. If you have a warehouse or just a small storage space, it will be way easier to keep it all organized with an image recognition system. For instance, it is possible to scan products and pallets via drones to locate misplaced items.

Medicine

What about med tech? Image recognition can be applied to dermatology images, X-rays, tomography, and ultrasound scans. Such classification can significantly improve telemedicine and monitoring the treatment outcomes resulting in lower hospital readmission rates and simply better patient care.

For example, IR technology can help with cancer screenings. Medical image analysis is now used to monitor tumors throughout the course of treatment.Medical image analysis is a true revolution.

‍

Author

This article was written by Zohair Badshah, a former member of our software team, and edited by our writers team.

Your Cart