MAVOS-DD · Matched protocol

Benchmark results that hold under distribution shift

DIWA-Net (12K, Phase 3 LoRA) evaluated on the MAVOS-DD multimodal benchmark — in-domain validation, unseen manipulation methods, and languages never seen during training.

12,000 training clips6 train languages2 open-set languages3 held-out methods

Research evaluation notice

Headline AUCs (AUC · In-Domain 99.5% · AUC · Open-Set Full 95.9% · AUC · OS-Method 97.0%) were measured on MAVOS-DD (60,364 videos; trained on 12,000) under the same matched protocol as baselines such as TALL-Swin and MRDF. Results on videos you upload are not guaranteed.

At a glance

Headline metrics & protocol

Key numbers from the Phase 3 LoRA checkpoint, evaluated under strict identity isolation on MAVOS-DD.

AUC · In-Domain

99.5%

Primary in-domain metric on the identity-isolated validation split — Phase 3 LoRA checkpoint.

95.9%

AUC · Open-Set Full

+25.7

AUC over best video-only baseline

97.0%

AUC · OS-Method

Evaluation protocol

Multimodal deepfake detector: DINOv2 ViT-S/14 + Wav2Vec2-base-960h with cross-modal attention and balanced gated fusion. LoRA-adapted (r=16, α=32, dropout=0.05) on attention projections.

Checkpoint: diwa_net_12k_phase2_lora_best.pth
Train languages: english, arabic, mandarin, romanian, russian, spanish
Open-set languages: german, hindi

Four MAVOS-DD scenarios

In-DomainStandard validation split — same languages and forgery methods as training.
OS-MethodSeen languages, but deepfake tools held out (Sonic, Roop, HiFiFace).
OS-LanguageGerman & Hindi — languages excluded from the 12K training set.
OS-FullCombined stress test: unseen methods and unseen languages together.

Baseline comparison

Robustness across scenarios

DIWA-Net maintains near-flat AUC degradation while video-only and multimodal baselines collapse under open-set shift.

Higher and flatter curves indicate better open-set generalisation.

DIWA-Net (ours)TALL-SwinMRDF

Swipe to compare models →

Model	In-Domain	OS-Method	OS-Language	OS-Full
DIWA-Net (ours)	0.9952	0.9701	0.9648	0.9592
TALL-Swin	0.9610	0.8004	0.7812	0.7022
MRDF	0.9408	0.7795	0.7401	0.6913

Generalisation

Open-set language & modality contribution

Performance on held-out languages and the cost of dropping a modality at inference time.

Open-set language

German

0.9902

AUC

95.8%

Accuracy

0.961

4.2%

EER

Open-set language

Hindi

0.9759

AUC

93.1%

Accuracy

0.938

6.7%

EER

Modality ablation

Why both streams matter

Change in Open-Set Full AUC when a modality is removed.

Remove audio−3.3 AUC

Remove video−12.1 AUC

Removing video hurts open-set performance far more than removing audio — visual cues dominate, but audio still contributes a meaningful complementary signal.

Deep dive

Full evaluation figures

Confusion matrices, ROC/PR curves, and per-scenario breakdowns from the training notebook. Click any figure to expand.

Training notebook output

Validation evaluation — all three models

Confusion matrices and metrics on the held-out validation set (Phase 1, Phase 2 FT, Phase 2 LoRA).

New here? Tap Plain English on any figure for a jargon-free explanation of what it shows and what DIWA-Net achieved. You can ask follow-up questions there too.

Naming note: figures labelled Phase 2 LoRA match the Phase 3 LoRA checkpoint used in the summary above — an early pipeline labelling inconsistency.

Confusion matrices — Phase 1, Phase 2 FT, Phase 2 LoRA

Reference

Citation

@inproceedings{abduljabbar2026diwanet,
  title     = {DIWA-Net: A Parameter-Efficient Multi-Modal Architecture for Deepfake Detection in the Open-Set Paradigm},
  author    = {Abdul Jabbar, Hamza and Noor, Hamail and Mahum, Rabbia},
  booktitle = {Final Year Project, UET Taxila},
  year      = {2026},
  note      = {MAVOS-DD: 0.9952 in-domain / 0.9592 open-set AUC},
}