PROJECT OVERVIEW
When driving long routes and staying behind the steering wheel for a long time, drivers may feel drowsy. According to the Highway Association, 5000 people died in 2015 in crashes involving drowsy driving. Today, the advancement in the field of self-driving cars is beyond imagination. For instance, Tesla’s self-driving cars can navigate the route, especially in highways. However, those self-driving cars are expensive, and not everyone can afford them. This project aims to provide a simple and affordable solution to prevent and reduce car accidents. Using computer vision and machine learning algorithms, we can build a system to detect if the user is about to sleep or lose control. The first objective is to collect the driver’s images using a webcam. Second, the system should detect and track objects (user’s eyes and facial expression). Then decide if the user is falling asleep.
This website describes how we implemented this project and was designed to provide easy access to the project resources such as the code and references. Use the top bar to access the project steps, the project's algorithm description, and resources.
PROJECT STEPS
To achieve this project, we followed "divide and conquer" strategy. In which we divided the project into the following steps:
First, collect image frames from a web came. Second, analyze each frame and detect the presence of human face. Third, if there is a human face then extract the features/ face landmark. Finally, calculate the eye aspect ratio and yawn aspect ratio to decide if the user is falling a sleep. The chart on the left illustrates a general overview of the project. In the next sections, we try to explain each step.
FACE DETECTION
Face detection is the process of analyzing an image to decide if the image contains a human face. And if so, return the x and y coordinate. The human face is full of features such as texture, corners, and face lines. Our job here is to receive an image is input then analyze the image looking for those features. To accomplish this task, we implemented three different algorithms. Then we compared those algorithms in terms of speed and accuracy. In the literature, the task of face recognition is well studied, and so many algorithms have been developed to detect the presence of a human face in an image. The state of the art algorithms in face detection can recognize the human face even if the person is not looking directly into the camera. However, our project focuses on detecting frontal face images because the camera will be mounted in front of the car driver. In the project, we implemented the following algorithms: Histogram of Oriented Gradients (HoG), Multiscale Cascaded Neural Network ( MTCNN), and HaarCascade Classifier. The next section describes these algorithms.
HISTOGRAM OF ORIENTED GRADIENT (HOG)
The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. To implement the HoG for face detection we followed the following steps to extract the HoG features:
Normalize the image so features can resist dependence on illumination variations. This step is optional as described in the authors paper which can be found here (https://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf).
Perform a convolution on the image using two filters. Those filters are sensitive to the vertical and horizontal gradient. In other words, this step finds the vertical and horizontal gradient (rate of change in pixel intensity)
Divide the image into subsections and compute the histogram of the gradient within each cell. For example, if we divide the image into five subsections, we will have five histograms each will show the pixel intensity of the subsections.
Normalize the histogram by comparing it with the neighboring histograms. This step will suppress the effect of illumination across the image.
Finally, construct a one-dimensional feature vector from the information in each sub-image.
Know it is the time to train a machine learning model that can detect the presence of human face in new images. To do that we followed the following steps which are shown in the diagram above:
Collect a set of positive images. Using Skimage library, we collected images that contain a human face. This set of images is called "positive images."
Construct a collection of non-human images. This set is called "negative images" and should be at least trouble the positive set's size.
Extract the HoG feature for the last two groups.
Train a state vector machine to classify new images.
Our model was able to achieve an accuracy of 98%. However, it has some limitations that will be discussed in the challenges section.
MULTI-CASCADED NEURAL NETWORK
MTCNN is a neural network model that can be used in face detection and feature extraction. The model was proposed by Zhang et al. in their paper "Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks." The model consist three neural networks that can be summarized as follow:
P-Net
P-Net (Proposal Network) is a fully connected convolutional neural network. It roughly obtains the face candidate box and the border regression variable, and then the candidate box passes the border regression variable (Border regression interpretation) Make corrections. Finally, the NMS algorithm is used to merge the highly overlapping candidate frames.
R-Net
Use all candidate boxes as the underlying network R-Net (Refine
Network) input, this network will further reject a large number of bad candidate boxes, and then the same through the border regression variable to correct, NMS merge.
O-Net
O-Net is similar to R-Net, but the goal in this network is to identify areas of the face through more supervision. In particular, this network will output five feature points of the face.
HAAR CASCADE CLASSIFIER
Haar Cascade classifier is an object detection approach which was proposed by Paul Viola and Michael Jones in their paper, “Rapid Object Detection using a Boosted Cascade of Simple Features” in 2001. The algorithm uses a large set of negative and positive images to train. For each image, they applied a convolutional filter to detect features. Haar cascade uses three different filters as compared to HoG feature extraction (filters are shown in the diagram above). Each filter will extract certain features such as edge and line features. However, applying three different filters will result in many features most of which are not important. Therefore, in their paper they used Adaboost to assign weight for features and select the best possible features. In other words, they tried to select the features with minimum error rate, which means they are the features that best classifies the face and non-face images. After collecting the features and during the classification each image is given an equal weight. During the classification, weights of misclassified images are increased.The process is continued until required accuracy or error rate is achieved. In the paper, they achieved an accuracy of 95% .
Link to the original paper: https://ieeexplore.ieee.org/document/990517
FEATURE EXTRACTION
For this task, we trained a convolutional neural network model using Keras and tensor flow. The model was able to achieve an accuracy of 87%. We trained the model using Kaggle face feature dataset. The dataset contains more than 7000 images. Each image is supposed to have 15 features. However, only two thousand images had a complete feature set. Therefore, we eliminated images with less than 15 features, which left us with a dataset of almost 2000 images. After that, we split the dataset set into 20% validation and 80% training. We believe having a more extensive feature set could improve the model accuracy. One of the challenges that we faced during the training is that the dataset has only three features for each eye. This makes it impossible to calculate the eye aspect ratio in the next step. We decided to search for alternative datasets. We found an excellent dataset with 68 features and over 10k images. However, training the new dataset using Google Colab was tough. The size of the RAM was not enough to train the model on the entire dataset. After a prolonged search, we concluded that it is almost impossible to train a model with such a large dataset with our limited hardware capability. Therefore, in the project, we used a pre-trained model to detect facial features.
EYE ASPECT RATIO
Image Subtitle
DECISION MAKING
After detecting the user's presence in an image and extracting the face features, it is time to decide if the user is falling asleep or not. There are several possible ways that can help us determine whether the user is feeling drowsy. For instance, we could use a CNN model to classify eye image as open or closed. However, in the project, we used a method proposed by Soukupová and Čech in their paper "Real-Time Eye Blink Detection using Facial Landmarks." In the paper, Soukupová and Čech used six features point in each eye and calculated a ratio. If the eye ratio approaches zero, it means that the eye is closed and vice versa. The eye aspect ratio can be calculated, as shown in the figure above. We found this method to be useful and accurate. However, we should consider that eye blink takes less than one second, and our model should distinguish between normal eye-blink and sleepy eyes. Therefore, with multiple experiments, we have come up with a threshold value. If the system detects a closed eye, we give it more time; if the eye is still closed, give a warning.
As another decision strategy, we calculated the mouth aspect ratio, which is similar to the eye aspect ratio. The mouth aspect ratio is used as an early sign for the detection process. We can use it to see if the user is yawning and then send a warning to the user.
CHALLENGES AND FUTURE WORK
In this project, we have faced many challenges that led us to think about possible future improvements. One of the challenges that we faced is hardware limitation. Training the model using a large dataset was challenging. To handle that, one should write a clean and optimized code. In real life project, python is not the optimal choice. Another challenge that we faced is that most publicly available datasets have missing features and are full of noise. To solve this problem, we should modify the dataset to reflect the real world. In the future, the project can be improved by using other biological measures as the heartbeat rate to further help in the decision-making process. In other words, instead of deciding if the person is falling asleep using facial features, we could add the heartbeat rate as an extra measure. A study published in the Korean Medical Science journal shows that drowsy drivers have lower heartbeat rate [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6393761/].
CONCLUSION
In this project, we have implemented a system to detect if the driver is falling asleep and notify the driver. The project was a great learning experience that allowed us to deal with multiple machine learning algorithms to detect human faces in an image and extract the face features.
Special thanks to professor Hoda, Vasanth, Jiaying, and everyone on the project resources page for helping us achieve this project.