Deepfake Video Detection with Convolutional and Recurrent Networks
Related Work
Prior research has examined different methods for detecting deepfake videos. Guera and Delp
[4]
create a convolutional LSTM model, where a CNN is used to extract features from the frames of
each video and the result is fed into an LSTM to process the sequence of frames and predict
whether the video is real or a deepfake. Lima et al.
[5] also use a convolutional
LSTM model to detect deepfake videos, but found that a 3D CNN model far outperformed the
convolutional LSTM, achieving over 20% higher accuracy. Although the ideas and models presented
in these papers served as inspiration for the models we developed, other work has focused on
different approaches to deepfake video detection.
Mittal et al. [6] extract face
and speech features from videos and input those features for a real video and a fake video of
the same subject to two CNNs, one for face features and another for speech features. In practice,
this could be an effective method of detecting deepfake videos because, in reality, both the audio
and visual appearance of a deepfake video are typically altered. However, only the appearance of
people’s faces in the videos in the Facebook Deepfake Detection Challenge dataset
[1]
have been altered, not the audio.