DIWA-Net Logo
DIWA-Net

How DIWA-Net Detects Deepfakes

Modern deepfakes manipulate what you see and what you hear. DIWA-Net examines both signals together, so a forgery has to fool two independent detectors at once.

The challenge

Single-modality detectors fail

A video-only model can be fooled by a clean face swap; an audio-only model can be fooled by a convincing voice clone. Looking at one stream in isolation leaves a blind spot the other could have caught.

Video only
Audio only
Video + audio
Our architecture

Five stages, two frozen experts

Why LoRA matters

Adapt a little, gain a lot

Traditional fine-tuning
Unstable convergence · 100% of parameters updated
LoRA adaptation (ours)
Smooth convergence · only 3.5% of parameters updated

We adapt only 3.5% of parameters but achieve 99.5% AUC — better than full fine-tuning.

What we tested on

8 languages · 7 generation methods

6 training languages plus 2 open-set hold-outs (German, Hindi), evaluated on MAVOS-DD including unseen manipulation methods.

EnglishArabicMandarinRomanianRussianSpanishGermanHindi
EchoMimicMemoLivePortraitInswapperSonicRoopHifiFace

Research evaluation notice

DIWA-Net is a research system. Its strongest results — like those of other published detectors (TALL-Swin, MRDF, AVFF, and related multimodal baselines) — are defined by rigorous evaluation on a fixed benchmark, not by ad-hoc uploads from the open web.

On the MAVOS-DD multilingual audio–video benchmark (60,364 real and synthetic videos; DIWA-Net trained on a 12,000-video subset), our checkpoint achieves:

  • 99.5%
    AUC · In-Domain
  • 95.9%
    AUC · Open-Set Full
  • 97.0%
    AUC · OS-Method

These figures apply to MAVOS-DD evaluation only. Videos you upload may differ in language, compression, lighting, manipulation tool, or distribution from the benchmark — and can be misclassified. Treat every verdict as a research signal, not a forensic or legal guarantee. Always corroborate important claims through multiple sources.