Using the KITTI Dataset to perform pixelwise classification of road images.
Object detection in images has been continously advancing with more efficient and accurate research papers being released every year. One of the most famous types of object detection is semantic segmentation, which involves alligning every pixel with a particuar class of object. To explain this further look at the example below: every pixel in the ouput is given different color depending on whether that pixel is a road, vehicle, traffic sign, tree, human, etc. This pixel wise classification of images into different classes is known as semantic segmentation.
Conventional Object Detection involves creating a box around the desired object as shown in the example below:
-
If the goal of the classification is to identify the drivable portion of the road, creating a box around it would not make any sense as it will not be able to incorporate the actual part of the image where the drivable road is present.
-
Sometimes normal object detection may get too crowded and incomfortable to be read a shown here:
- Semantic Segmentation has the power to obtain more accurate dimensions of the identified object which makes it easier to persorm any further computer vision techniques on it, for example: SnapChat Filters.
As compared to normal convolutions in deep learning models, semantic segmentation also makes use of:
- 1x1 Convolutions
- Transposed Convolutions
- Skipped Connections
- Transfer Learning
These are basically normal convolutions with size = (1,1), and strides=(1,1). Their operation is not very different as compared to fully dense layers, and they may seem rather redundant, however they have many benefits:
- They are very computationally cheap to use.
- They may be used for any size of images, as compared to dense layers which require a certain fixed input size to work.
- 1x1 Convolutions are the simplest method for dimensionality reduction.
Transposed Convolutions must be considered as nothing more than the opposite of normal convolutions. They upscale the images to a larger size, so that the output images may be formed from upsampling dense layers into full sized images.
Skipped Connections are used to reuse any information lost in the downsampling process of the network.
In the above image the output of Predict2 and DeConv1 are added to give the result of DeConv2, this method regains information lost from the convolutions to give more accurate results. The downsampling part is called the encoder and the upsampling part is called the decoder
Lastly in this project I have used Transfer Learning. This uses pretrained weights from the VGG model. I have initialized the weights in the encoder part to those of the already trained vgg model. This way I save time in training by only having to train the decoder weights of the model.
Due to lack of computing power I have not used the CityScapes dataset but have rather used the KITTI dataset. This only consists of one single class of the image that is the drivable partion of the road. Here are a few examples of my test outputs:
As you can see the model has a few shortcomings of its own, it does not work well in light conditions, some images havent been recognized well and there isnt very high accuracy in the image recognition. There are many improvements that need to be made to make the model even more accurate and requiring fewer computational units. To view more of my outputs all of my test images are present in the runs folder.