Deepfake Video Detection with Convolutional and Recurrent Networks
Experiments and Evaluation
Each model was trained on one piece of the dataset at a time, decreasing the learning
rate over time. Once each model reached a certain accuracy, which we found to be about
82% for the 3D CNN and 78% for the CNN-LSTM, for a single piece of the dataset, we trained
each model on the next piece of the dataset. As each model was trained on additional pieces
of the dataset, we noticed that the performance of both models improved. It is worth noting
that the models were initially trained on the uneven dataset, but the models did not appear
to be learning. Using balanced datasets instead of the uneven dataset yielded much better
performance for the models.
During training, the Adam optimizer was utilized, resulting in a faster training process than
using SGD. In addition, binary cross entropy with logits was used as the loss function since
predicting whether a video is real or fake is a binary classification problem. To improve each
model’s performance during training, we also tuned various parameters, including batch size,
learning rate, and weight decay. However, we found that changing the batch size did not improve
the models’ performance. After tuning, the final hyperparameter settings for the 3D CNN were a
batch size of 64, weight decay of 0.0001, and initial learning rate of 0.0005, which was decreased
to 0.0002 and then 0.0001 to reduce oscillation in the loss. The final hyperparameter settings for
the CNN-LSTM were a batch size of 32, weight decay of 0.0005, and initial learning rate of 0.0005,
which was decreased to 0.0002 and then 0.0001.
To evaluate the results of the trained models, we will examine the accuracy with which the models
predict whether videos are real or fake on a test dataset, in addition to creating confusion matrices
of the results for the test dataset. To create the training and test datasets, we used an 80-20 split,
with 80% of the videos added to the training datasets and the remaining 20% of the videos added to the
test dataset. Further, both the training and test datasets have an even split of real and fake videos
to ensure our results are not biased.