In this blog I am highlighting my work on a Computer Vision problem “Road Damage Detection”.
Roads are the phenomenal source of not just the transportation and travel, but for so many things. And we all use roads, in order to follow our daily routines. Roads make a crucial contribution to economic development and growth, and bring important social benefits. They are of vital importance in order to make a nation grow and develop. In addition, providing access to employment, social, health and education services makes a road network crucial in fighting against poverty. Roads open up more areas and stimulate economic and social development. For those reasons, road infrastructure is the most important of all public assets.
The Road Network has a spread over multi-million miles. Hence maintaining roads becomes quite hectic for the governments and the local bodies (Municipalities). Every road needs maintenance after a few years, and this maintenance can have major financial impacts on the economy and citizens.
Till now we understood why roads are useful, now let’s move further and see what’s inside the blog.
- Understanding Business Problem
- Business Objectives and Constraints
- Exploratory Data Analysis
- Research Papers
- Existing Solution
- My First Cut Approach
- Preparing Data
- SSD Mobilenet
- YOLO v3
- Final Pipeline
- Future Work
1. Understanding Business Problem:
By the overview, we understood why roads are useful and why maintenance of roads are crucial. Now we need to know, how can we resolve the problem of road maintenance. So there are three ways to do so.
Manual Method — This method is consider as a traditional method. In the Manual method, a team of investigators detects damages via walking or by sitting on a slow moving vehicle. This visual inspection suffered from the subjective judgement of investigators, and this process is very time consuming and not safe.
Semi-automated Method — This method is also consider as a traditional method. In the Semi-automated method, a fast moving vehicle is used to take pictures of the damages, but here again for crucial damages, manual inspection is needed. But ultimately this method is quite safe as compared to manual method, but not fully safe.
Fully-automated Method — This method is the modern method to check the damages. In the Fully automated method, a fast moving vehicle equipped with sophisticated and expensive sensors. Then the processing of collected images can be done through image processing and pattern recognition. But this method is too expensive because of the use of high quality sensors and cameras. So financially weak local bodies can’t afford it. So what’s next.
Instead of using these high quality cameras and sensors, government and local bodies can use smartphones attached to the dashboard of the vehicle. This method is very affordable compared to prior methods. This work proposed by Maeda et al. They attached a mobile phone on the vehicles dashboard to capture images, they used a special software, which can capture road images one per second and record location too, with the average vehicle speed of 40 km/h.
2. Business Objectives and Constraints:
Since the idea is to mount the smartphone on the dashboard of a car, so ultimately we have to built a Real time object detector. And more specifically this system has to work in the mobile devices.
The Data that I will use to solve this problem can be downloaded from here. This data is also represented by IEEE for its Big Data Challenge. Initially this dataset includes images from Japan only, but lately once it achieved good accuracy more countries like India and Czech Republic data also included. The datasets they proposed have three variants till now.
RDD2018 — This dataset is introduced by Maeda et al. and from then lots of researchers used this dataset to improve the performance of DL models. They introduced 8 types of damages on the road according to the Japanese government and municipalities, which are listed below in the table. This dataset contains 9053 images with 15435 annotations(damage description).
RDD2019 — Then in 2019, the same researchers modified previous RDD2018 to form RDD2019 with 13135 images with 30989 annotations(damage description).
RDD2020 — Since previous datasets just focused on a single country, Japan. So in order to check its feasibility new data has been added from India and Czech Republic. Now this RDD2020 contains 26620 images and annotations(damage description) from Japan, from India and from Czech Republic partially from Slovakia. Unlike previous versions this version has just 4 damage categories namely D00, D10, D20 and D40. The reason to exclude the rest of the 4 damage categories is because these issues can spread all over the country very normally and change from one country to another country.
There is one difference across all images is, the images from Japan and Czech Republic have 600X600 dimensions while the images from India have 720X720 dimensions. And this dataset is created in different lighting and weather conditions.
4. Exploratory Data analysis:
First, let’s start by looking some of the images with the damage types.
Since we have seen some sample images with the damage types. Now it’s time to understand the definition for the damage types.
Annotations — Annotations can be consider as a collection of the details of object present in an image. If an image contain 4 objects then there should be 4 object details present in the annotations. The RDD2020 dataset contain annotations in PASCAL VOC format. PASCAL VOC is one of the most used annotation technique for the object detection task. There are several properties of PASCAL VOC format —
- PASCAL VOC is an XML file.
- For PASCAL VOC we need to prepare a file for each of the image present in dataset.
- Bounding box format in PASCAL VOC — (xmin-top left, ymin-top left, xmax-bottom right, ymax-bottom right)
A sample PASCAL VOC annotation for an image look like this —
After representing some sample images and understanding their definitions as well as annotations, I divided the data into three parts, one for the images corresponding to Czech data, second corresponding to India and third corresponding to Japan. So let’s get started.
Czech Republic’s data analysis — There are 2829 images collected from Czech Republic and some partial region from Slovakia. Now it’s time to look at the damage types distribution across Czech data.
The observation for the above count plot is — Among all the images, for which annotations are given there are a lot of images with the Damage Type ‘D00’ (almost 57%). Which typically means ‘wheel-marked part’. And there is highly imbalanceness across the data, as we can see in the count plot.
India’s data analysis — There are 7706 images collected from India to build the dataset. Now let’s see it’s damage types distribution.
The observation for the above count plot is — Among all the images, for which annotations are given there are a lot of images with the Damage Type ‘D40’ (almost 47%). Which typically means ‘pot hole’. And there is highly imbalanceness across the data, as we can see in the count plot. Another thing to note down for Indian damage data there are approx. 1% images represent damage type ‘D10’, which means ‘equal interval’.
Japan’s data analysis — There are 10506 images collected from Japan. Let’s analyze them.
The observation for above count plot is — Among all the images, for which annotations are given there are a lot of images with the Damage Type ‘D20’ (almost 38%). Which typically means ‘partial pavement, overall pavement’, also represented by ‘Alligator crack’. And there is highly imbalanceness across the data, as we can see in the count plot.
All data analysis — Since we analyzed data separately, now it’s time to analyze entire data.
The observation for the above count plot is — for each damage type we have a good chunk of data. Which means, any sensible model can achieve good performance. And we cannot see any highly imbalanceness across the data. That’s it for the EDA, now next one are the Research papers.
5. Research Papers:
Road Damage Detection Using Deep Neural Networks with Images Captured Through a Smartphone (Maeda et al.) — Since there was a lot of research done before arriving at this paper. The prior research achieved very good accuracy for road damage detection, but these studies only focus on the detection of the presence and absence of damage. The problem with these techniques is when a road manager from the governing body needs to repair such damage, they need to clearly understand the type of damage in order to take action. Then the researchers from University of Tokyo created their own dataset and labelled each of the images. They created 8 damage types as per Japanese government and municipalities instructions. They collected data from 7 municipalities in Japan with 40+ hours of recording. The images come from various weather and lighting conditions. Then they thought to apply a state-to-art object detection method for training the data. They compared the accuracy and runtime speed on GPU server as well as on mobile phones. One important thing to note out here is they used PASCAL-VOC for annotations. Since they are very easy to apply on many existing methods. In order to train the dataset, they used ssd-mobilenet and SSD-inception v2. These both models are already implemented in Tensor flow object detection API. And both the models having their own specifications.
Transfer Learning-based Road Damage Detection for Multiple Countries (Arya et al.) — For this research new data has been added with the previous RDD2019. The data comes from India and Czech Republic with some parts of Slovakia. Images captured from Japan, Czech Republic with some parts of Slovakia have dimensions 600x600, while images captured for India have dimensions 720x720. One another important thing is for this research they have used only 4 damage types instead of previously defined 8 damage types. They have been mentioned since many CNN architectures such as R-CNN, Fast-CNN and Faster-CNN have been developed to attain higher accuracies, while improving processing speed. However the computation loads are still larger for processing images on the device with limited computation power. So they prefer Single Shot Multibox Detector(SSD) framework to increase computation speed of object detection. It uses a single feed-forward convolutional network to detect multiple objects within the image directly and combines predictions from numerous feature maps with different resolutions to handle objects of various sizes. Mobilenet is a small, low-latency, and low-power convolutional feature extractor that can be built to perform classiﬁcation, detection, or segmentation similar to popular large-scale models, such as Inception SSD (Szegedy et al. (2016)). It is based on depth-wise separable convolution, which factorizes a standard convolution into a depth wise convolution, and 1X1 convolution called a point-wise convolution. Depth wise convolution maps a single convolution on each input channel separately, and point-wise convolution is a convolution with a kernel size of 1X1 that combines the features created by the depth wise convolution (Douillard (2018)). In comparison to depth wise separable convolution, a regular convolution does both ﬁltering and combination steps in the single run. However, SSD Mobilenet requires more computational work to accomplish the task, and it needs to learn more weights.
For this research they used the idea of Transfer Learning with already trained models. The main aim of doing this research work is to check the feasibility of Japanese model for different countries.
6. Existing Solution:
A Deep Learning Approach for Road damage Detection from Smartphone Images (Alfarrarjeh et al.) — The economy of cities is essentially affected by their public facilities and infrastructures. One fundamental element of such infrastructures is roads. Many factors (like raining and aging) cause different types of road damages that seriously impact road efﬁciency, driver safety, and the value of vehicles Therefore, countries devote a large annual budget for road maintenance and rehabilitation.
The focus of this study is automating the detection of different types of road damages (proposed by Maeda et al.) using smartphone images crowdsourced by city crews or the public. Their approach uses one of the state-of-the-art deep learning algorithms (YOLO) for an object detection task. They used the data mentioned in IEEE Big Data Cup Challenge 2018.
An object detection algorithm analyzes the visual content of an image to recognize instances of a certain object category, then outputs the category and location of the detected objects. With the emergence of deep convolutional networks, many CNN-based object detection algorithms have been introduced. The ﬁrst one is the Region of CNN features (R-CNN) method which tackles object detection in two steps: object region proposal and classiﬁcation. The object region proposal employs a selective search to generate multiple regions. These regions are processed and fed to a CNN classiﬁer. “R-CNN is slow due to the repetitive CNN evaluation”. Hence, many other algorithms have been proposed to optimize R-CNN (like Fast R-CNN). Other than the R-CNN-based algorithms, The “You Only Look Once” (YOLO) method uses a different approach and basically merges the two steps of the R- CNN algorithm into one step by developing a neural network which internally divides the image into regions and predicts categories and probabilities for each region. Thus, applying the prediction once makes YOLO achieve a signiﬁcant speedup compared to R-CNN-based algorithms; hence can be used for real-time prediction.
To solve the road damage type detection problem, they consider a road damage as a unique object to be detected. In particular, each of the different road damage types is treated as a distinguishable object. Then, they use one of the state-of- the-art object detection algorithms (like YOLO) to be trained on the road damage dataset to learn the visual patterns of each road damage type. Finally they created 3 different datasets namely, original, augmented and cropped. To create an object detector they fine-tuned the Darknet-53 model using YOLO framework.
Read more about YOLO v3.
7. My First Cut Approach:
Since we got lot of ideas from the Research papers and existing solutions. So I decided to use two models here, YOLO v3 and SSD Mobilenet.
8. Preparing Data:
Before starting the modeling part, it’s always a good choice to preprocess the data that we want. So I started by removing those images as well as annotations for which damage type are not D00, D10, D20, and D40.
So initially we have 25046 images, out of them we have 12851 images for which classes are not as we want. So we can easily remove them from dataframe using their indexes.
After removing 12851 images we just left with 12195 images and annotations. But the problem is these annotations can also contain classes that we don’t want. So we need to remove those class details from the xml files. In the below xml file there are 3 objects present, two of them D40 and one is D50. So we want to keep two D40 objects there, by removing D50 from xml file.
In order to do that I used the below snippet.
The result of the above code snippet for the above mentioned xml file is —
That’s all for the preprocessing. Next Modeling
9. SSD Mobilenet:
For training SSD Mobilenet, I used Tensorflow’s Object Detection API. Which makes our life a lot simpler. I used google colab to train the model. One most important thing is I used Tensorflow version 1.x
Next is to clone Tensorflow’s object detection repository. Before cloning the repository make sure to go to root directory of colab. So to do that, use %cd before cloning.
The Tensorflow object detection API uses Protobufs to configure model and training parameters. Before the framework can be used, the Protobuf libraries must be compiled. Before compiling Protobuf change directory using %cd /root/models/research/
Now we need to add libraries to python path.
Before going further let’s make a quick check by running below command. This is just to check whether object detection API correctly installed or not.
At the end if it gives us Ran (some) tests in (some)s, and OK. Then we are good to go.
Next step is to load our dataset and right after this step directory structure will look like this.
Now create a csv file for train and test set, based on annotations. For that I used the below snippet.
We can call this function directly from colab’s cell by using below scripts. But make sure to change the directory using %cd /root/models/data/
By running the above cell there is new folder created inside data folder, annotations. After running the above cell, this folder will contain train_labels.csv, test_labels.csv and label_map.pbtxt files.
After that in my notebook there is an optional step to resize all the images into 640X640, and changing coordinates of the bounding boxes accordingly. To do that I followed this blog.
Now we need to convert our data into tfrecord format. Tensorflow recommends to store and read data in tfrecord format. It internally uses Protocol Buffers to serialize/deserialize the data and store them in bytes, as it takes less space to hold an ample amount of data and to transfer them as well. Read more about tfrecord here. The code snippet to generate tfrecord is:
To run the above tfrecord implementation, we can use the below lines.
It will generate train.record and test.record files and store them to data/annotations.
Now download model using Tensorflow’s object detection zoo. I used “ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03”. Next is to move “ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync.config” file from /root/models/research/object_detection/samples/configs to /root/models/
Now we need to modify config file as per our need. Like this —
Now it’s time for the training. Which is very time consuming so you can go to sleep and make sure to keep the notebook alive :)
In the Google colab, I used “Tesla T4” GPU to train the model, after 8 hours of training (12K odd epochs/steps) I stopped the training. Since Google colab can disconnect in between 8–12 hours with GPU support. You can look at the saved checkpoints at /tmp/tmpxxxxxxxx/. So whatever is the newest we need to export that one.
By running the above code, it will create a directory inside /root/models/ with the name “fine_tuned_model”. Which will contain export_inference_graph.py file (which is the trained model weight graph). That’s all for training.
10. YOLO v3:
In order to train YOLO v3 using keras, I used this GitHub repository for following along. So as a first step to train YOLO v3 is, to prepare data in YOLO format. For that I created a csv file first like this.
After creating csv files for both train and test save them. Now we need to follow the same directory formatting as mentioned in the referenced GitHub repository.
In the above image YOLOv3 is the cloned folder from this GitHub repository. I performed some minor changes, so it is looking different. The most important thing is to look at Source_Images folder. In that Test_Image_Detection_Results will contain the predicted test images with bounding boxes during inference. Test_Images and Training_Images as the name suggest, they are use to contain test images and train images respectively. There is one another folder inside Training_Images, which is files_to_train, This is also contain train images (just to follow repository structure) and Annotation-export.csv (which is the train data csv, that we created earlier). Now we can start training…
Before going further just make sure to use Tensorflow version 2.3.1 and keras version 2.4.3 Then we can change directory by using %cd to /content/YOLOv3/Image Annotation/ and then run the below cell.
This above function is going to take Annotation-export.csv from /content/YOLOv3/Data/Source_Images/Training_Images/files_to_train and convert it into 2 files, one of them is data_train.txt and saving it to /content/YOLOv3/Data/Source_Images/Training_Images/files_to_train and second one is data_classes.txt and saving it into /content/YOLOv3/Data/Model_Weights.
Now we need to download YOLO v3 pre-trained weights on ImageNet dataset and converting them to keras format. To do that we need to change the path using %cd to /content/YOLOv3/Training/
Now we can start training, to do that we just need to run the below snippet.
This will run till 11 epochs with batch size 32 and again for 11 epochs with batch size 4. Combine time it will take about slightly more than 7 hours in Google Colab with Tesla T4 GPU. Then saved models will store in /content/YOLOv3/Data/Model_Weights with the name trained_weight_final.h5
That’s it for the training of YOLO v3. Next, some results where model works well and some results where it fails.
SSD Mobilenet — In terms of Performance metrics I kept everything as it is. So in the Tensorflow object detection API they have used Average Precision and Average Recall as performance metrics. The results that I achieved after 8 hours of training in Google Colab are —
So we typically want AP@IoU=0.50, the higher the number, better it is. So looking at the values one clear observation is, The model needs further improvement. Now let’s have a look at some of the test cases —
- Well Predicted Cases —
- Not Well Predicted Cases —
YOLO v3 — For the case of YOLO v3, since I have used keras implementation. So again I kept everything constant in terms of Performance Metric. But here I can determine the case of overfitting in terms of loss. As mentioned below, the Tensorboard logs for loss —
The observation for the above plot is, The model is not overfit. Now let’s have a look at the results.
- Well Predicted Cases —
- Not Well Predicted Cases —
12. Final Pipeline:
- SSD Mobilenet —
- YOLO v3 —
13. Future Work:
- To train both the models for more epochs, on a system with descent configurations.
- Augmentations might be helpful to increase dataset size.
- Trying with the newer and faster models like EfficientDet, YOLO v4, YOLO v5 etc.
- Original research papers, SSD, Mobilenet, Inception v2, YOLO v3, Maeda et al., Arya et al., and Alfarrarjeh et al.
If you want to look at the code, this is my GitHub repository —
For full explanation, visit my blog - Roads are the phenomenal source of not just the transportation and travel, but…
Connect with me on LinkedIn —