Self-Supervised Video Forensics by Audio-Visual Anomaly Detection


Chao Feng
Ziyang Chen
Andrew Owens

University of Michigan, Ann Arbor

[Paper]
[Github]




Abstract

Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies, and that can be trained solely using real, unlabeled data. We train an autoregressive model to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound. At test time, we then flag videos that the model assigns low probability. Despite being trained entirely on real videos, our model obtains strong performance on the task of detecting manipulated speech videos.



Audio-Visual Time Delay Estimation





Qualitative Results

Time Delay Visualization

Video Demo




Paper and Supplementary Material

Chao Feng, Ziyang Chen, Andrew Owens.
Self-Supervised Video Forensics by Audio-Visual Anomaly Detection.
arXiv 2023.
(ArXiv)


[Bibtex]


Acknowledgements

We thank David Fouhey, Richard Higgins, Sarah Jabbour, Yuexi Du, Mandela Patrick, Deva Ramanan, Haochen Wang, and Aayush Bansal for helpful discussions. This work was supported in part by DARPA Semafor and Cisco Systems. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The webpage template was originally made by Phillip Isola and Richard Zhang for a Colorization project.