Deepfake Video Detection with Convolutional and Recurrent Networks

Results

After training both the 3D CNN and CNN-LSTM on all pieces of the dataset, the 3D CNN achieved 86% accuracy, while the CNN-LSTM achieved 82% accuracy. In order to further increase the accuracy with which we detect deepfake videos, we created an ensemble consisting of the 3D CNN and CNN-LSTM models, which attained 87% accuracy, outperforming the 3D CNN. Detailed results for each model, including the ensemble, are shown in the table below.

Model Video Type Precision Recall F1-Score
3D CNN Real 0.89 0.81 0.85
Fake 0.83 0.90 0.86
Overall 0.86 0.86 0.86
CNN-LSTM Real 0.84 0.78 0.81
Fake 0.80 0.85 0.82
Overall 0.82 0.82 0.82
Ensemble Real 0.91 0.82 0.86
Fake 0.84 0.91 0.87
Overall 0.87 0.87 0.87

As shown in the table above, the 3D CNN outperformed the CNN-LSTM. Although video frames have a sequential nature and an LSTM takes this sequence into account, the sequence of the faces in the video frames are not uniquely informative in determining whether a video is real or generated by a deepfake. However, 3D CNNs allow for detecting the similarity between filters used in creating deepfake videos and can learn the different patterns employed by the generative models used to create the deepfakes through computations on the pixel values. The idea that 3D CNNs are more effective at detecting deepfakes than CNN-LSTM models is also demonstrated in the work done by Lima et al. in [5].

To further evaluate the performance of each model, we create confusion matrices, shown below. As shown in the confusion matrices, we do not classify real videos as fake, as often as we classify fake videos as real, which suggests that training our models on the full, uneven dataset, rather than the balanced dataset we created, would likely result in the models attaining lower accuracy. We can combat this by training our models on uneven datasets after training on a balanced dataset to improve their ability to detect fake videos. However, in practice, given a set of videos, there will likely be many more real videos than fake videos in that set. Thus, when trying to predict whether a video in that set is real or fake, it is important to minimize the number of false positives, or videos classified as fake that are actually real, since the data is skewed towards real videos.

3D CNN Confusion Matrix

Confusion Matrix for 3D CNN

CNN-LSTM Confusion Matrix

Confusion Matrix for CNN-LSTM

Ensemble Confusion Matrix

Confusion Matrix for Ensemble