How DIWA-Net Detects Deepfakes

Modern deepfakes manipulate what you see and what you hear. DIWA-Net examines both signals together, so a forgery has to fool two independent detectors at once.

The challenge

Single-modality detectors fail

A video-only model can be fooled by a clean face swap; an audio-only model can be fooled by a convincing voice clone. Looking at one stream in isolation leaves a blind spot the other could have caught.

Video only

Audio only

Video + audio

Our architecture

Five stages, two frozen experts

Why LoRA matters

Adapt a little, gain a lot

Traditional fine-tuning

Unstable convergence · 100% of parameters updated

LoRA adaptation (ours)

Smooth convergence · only 3.5% of parameters updated

We adapt only 3.5% of parameters but achieve 99.5% AUC — better than full fine-tuning.

What we tested on

8 languages · 7 generation methods

6 training languages plus 2 open-set hold-outs (German, Hindi), evaluated on MAVOS-DD including unseen manipulation methods.

EnglishArabicMandarinRomanianRussianSpanishGermanHindi

EchoMimicMemoLivePortraitInswapperSonicRoopHifiFace

Research evaluation notice

DIWA-Net is a research system. Its strongest results — like those of other published detectors (TALL-Swin, MRDF, AVFF, and related multimodal baselines) — are defined by rigorous evaluation on a fixed benchmark, not by ad-hoc uploads from the open web.

On the MAVOS-DD multilingual audio–video benchmark (60,364 real and synthetic videos; DIWA-Net trained on a 12,000-video subset), our checkpoint achieves:

99.5%
AUC · In-Domain
95.9%
AUC · Open-Set Full
97.0%
AUC · OS-Method

These figures apply to MAVOS-DD evaluation only. Videos you upload may differ in language, compression, lighting, manipulation tool, or distribution from the benchmark — and can be misclassified. Treat every verdict as a research signal, not a forensic or legal guarantee. Always corroborate important claims through multiple sources.