Deepfake Video Detection with Convolutional and Recurrent Networks
Results
After training both the 3D CNN and CNN-LSTM on all pieces of the dataset, the 3D CNN achieved 86% accuracy, while the CNN-LSTM achieved 82% accuracy. In order to further increase the accuracy with which we detect deepfake videos, we created an ensemble consisting of the 3D CNN and CNN-LSTM models, which attained 87% accuracy, outperforming the 3D CNN. Detailed results for each model, including the ensemble, are shown in the table below.
Model | Video Type | Precision | Recall | F1-Score |
---|---|---|---|---|
3D CNN | Real | 0.89 | 0.81 | 0.85 |
Fake | 0.83 | 0.90 | 0.86 | |
Overall | 0.86 | 0.86 | 0.86 | |
CNN-LSTM | Real | 0.84 | 0.78 | 0.81 |
Fake | 0.80 | 0.85 | 0.82 | |
Overall | 0.82 | 0.82 | 0.82 | |
Ensemble | Real | 0.91 | 0.82 | 0.86 |
Fake | 0.84 | 0.91 | 0.87 | |
Overall | 0.87 | 0.87 | 0.87 |
As shown in the table above, the 3D CNN outperformed the CNN-LSTM. Although video frames
have a sequential nature and an LSTM takes this sequence into account, the sequence of
the faces in the video frames are not uniquely informative in determining whether a video
is real or generated by a deepfake. However, 3D CNNs allow for detecting the similarity between
filters used in creating deepfake videos and can learn the different patterns employed by the
generative models used to create the deepfakes through computations on the pixel values. The idea
that 3D CNNs are more effective at detecting deepfakes than CNN-LSTM models is also demonstrated in
the work done by Lima et al. in [5].
To further evaluate the performance of each model, we create confusion matrices, shown below. As shown
in the confusion matrices, we do not classify real videos as fake, as often as we classify fake videos
as real, which suggests that training our models on the full, uneven dataset, rather than the balanced
dataset we created, would likely result in the models attaining lower accuracy. We can combat this by
training our models on uneven datasets after training on a balanced dataset to improve their ability to
detect fake videos. However, in practice, given a set of videos, there will likely be many more real videos
than fake videos in that set. Thus, when trying to predict whether a video in that set is real or fake, it
is important to minimize the number of false positives, or videos classified as fake that are actually real,
since the data is skewed towards real videos.

Confusion Matrix for 3D CNN

Confusion Matrix for CNN-LSTM

Confusion Matrix for Ensemble