Deepfake Video Detection with Convolutional and Recurrent Networks

Dataset and Data Processing

The dataset for this project is the Facebook Deepfake Detection Challenge dataset [1], which has over 100,000 videos of people, where some are altered using 8 different algorithms to modify the appearance of people’s faces in the video. This dataset is uneven in that the number of fake videos does not equal the number of real videos. Specifically, there are approximately eight fake videos for each real video in the dataset. However, there was consistency in the length of each video, as each video was exactly ten seconds. In addition, the videos had a similar frame rate to the color NTCS videos with 29.97 frames per second for each video.

One of the most challenging aspects to processing this dataset was reducing the size of the dataset from approximately 500GB to a size that is manageable for training deep neural networks. In order to reduce the size of the dataset, the number of frames per video, which was originally about 300, was reduced to 32, decreasing the frame rate to approximately 3 frames per second. To further reduce the size of the dataset, the size of each frame was decreased as well. OpenCV was used to extract and resize each of the 32 frames from each video. Since the deepfake algorithms used to generate the dataset are applied to the faces of the people in the videos, the pixels in each frame that carry the most useful information are those in and around the faces of the people in the videos. This was confirmed after randomly selecting and watching a number of videos from the dataset.

Extracting faces of people in each video frame is a traditional computer vision task for which there are a variety of solutions available. Initially, utilizing OpenCV to detect faces appeared to be the most straightforward solution, but this method did not reliably detect faces when the frames were dark or the people in the videos were turned sideways. Instead, a neural network face detector model, MTCNN, was used to detect faces in each video frame [7]. This API outputs a JSON file with the following information for each face detected in the frame: the probability that the detected area on the frame is a face, the starting coordinates of this area, the width of the detected area, and the height of the detected area. After running this model on a number of videos, it was clear that the detected faces with a probability less than 95% were not reliable and therefore should be ignored. This would cause each processed video to have an unequal number of frames, some with 32 frames and others with less than 32 frames, which would be quite problematic for deep neural network models during training. In order to avoid this issue, when not all detected faces were reliable for a video, some frames were duplicated, while still keeping the sequence intact, to maintain the same total number of frames for each video, while discarding videos with less than one frame per second of reliably extracted faces. Effectively, this caused videos that were too dark to reliably detect a person’s face to be removed. It is worth noting that the faces on most dark videos were reliably detected by the API and only a small subset of the dark videos were discarded, which, we discovered, only occurred when the videos were too dark. In addition, videos with more than one person were also removed due to computational resource constraints. However, deepfake videos with more than one person could be detected by only modifying the aforementioned data processing pipeline, without changing the models.

Utilizing all other information provided in the JSON file, we cropped each frame to create an image with the face at the center and a small margin around the face to include the surrounding pixels. The resulting image was then resized to be 32 by 32. Thus, processing a single video yielded a series of 32 by 32 images, one for each of the 32 frames, of the face of the person in the video. Applying this process to each video in the dataset took over 2 days using 35 CPUs and 3 GPUs.

Finally, we randomly sampled the processed dataset to create datasets with an even number of real and fake videos for training and testing our deep models. To create these balanced datasets, we included every real video and randomly sampled an equal number of fake videos from the much larger group of fake videos. Due to memory limitations, we split the resulting datasets into equal-sized pieces. This resulted in having multiple datasets to train our model on, where each one included a different subset of the fake videos.