MAVOS-DD · Matched protocol
Benchmark results that hold under distribution shift
DIWA-Net (12K, Phase 3 LoRA) evaluated on the MAVOS-DD multimodal benchmark — in-domain validation, unseen manipulation methods, and languages never seen during training.
Research evaluation notice
Headline AUCs (AUC · In-Domain 99.5% · AUC · Open-Set Full 95.9% · AUC · OS-Method 97.0%) were measured on MAVOS-DD (60,364 videos; trained on 12,000) under the same matched protocol as baselines such as TALL-Swin and MRDF. Results on videos you upload are not guaranteed.
At a glance
Headline metrics & protocol
Key numbers from the Phase 3 LoRA checkpoint, evaluated under strict identity isolation on MAVOS-DD.
AUC · In-Domain
99.5%
Primary in-domain metric on the identity-isolated validation split — Phase 3 LoRA checkpoint.
95.9%
AUC · Open-Set Full
+25.7
AUC over best video-only baseline
97.0%
AUC · OS-Method
Evaluation protocol
Multimodal deepfake detector: DINOv2 ViT-S/14 + Wav2Vec2-base-960h with cross-modal attention and balanced gated fusion. LoRA-adapted (r=16, α=32, dropout=0.05) on attention projections.
- Checkpoint
- diwa_net_12k_phase2_lora_best.pth
- Train languages
- english, arabic, mandarin, romanian, russian, spanish
- Open-set languages
- german, hindi
Four MAVOS-DD scenarios
- In-DomainStandard validation split — same languages and forgery methods as training.
- OS-MethodSeen languages, but deepfake tools held out (Sonic, Roop, HiFiFace).
- OS-LanguageGerman & Hindi — languages excluded from the 12K training set.
- OS-FullCombined stress test: unseen methods and unseen languages together.
Baseline comparison
Robustness across scenarios
DIWA-Net maintains near-flat AUC degradation while video-only and multimodal baselines collapse under open-set shift.
Higher and flatter curves indicate better open-set generalisation.
Swipe to compare models →
| Model | In-Domain | OS-Method | OS-Language | OS-Full |
|---|---|---|---|---|
| DIWA-Net (ours) | 0.9952 | 0.9701 | 0.9648 | 0.9592 |
| TALL-Swin | 0.9610 | 0.8004 | 0.7812 | 0.7022 |
| MRDF | 0.9408 | 0.7795 | 0.7401 | 0.6913 |
Generalisation
Open-set language & modality contribution
Performance on held-out languages and the cost of dropping a modality at inference time.
Open-set language
German
0.9902
AUC
95.8%
Accuracy
0.961
F1
4.2%
EER
Open-set language
Hindi
0.9759
AUC
93.1%
Accuracy
0.938
F1
6.7%
EER
Modality ablation
Why both streams matter
Change in Open-Set Full AUC when a modality is removed.
Removing video hurts open-set performance far more than removing audio — visual cues dominate, but audio still contributes a meaningful complementary signal.
Deep dive
Full evaluation figures
Confusion matrices, ROC/PR curves, and per-scenario breakdowns from the training notebook. Click any figure to expand.
Training notebook output
Validation evaluation — all three models
Confusion matrices and metrics on the held-out validation set (Phase 1, Phase 2 FT, Phase 2 LoRA).
New here? Tap Plain English on any figure for a jargon-free explanation of what it shows and what DIWA-Net achieved. You can ask follow-up questions there too.
Naming note: figures labelled Phase 2 LoRA match the Phase 3 LoRA checkpoint used in the summary above — an early pipeline labelling inconsistency.
Confusion matrices — Phase 1, Phase 2 FT, Phase 2 LoRA
ROC curves
Precision–recall curves
Score distributions
Metrics comparison table
Reference
@inproceedings{abduljabbar2026diwanet,
title = {DIWA-Net: A Parameter-Efficient Multi-Modal Architecture for Deepfake Detection in the Open-Set Paradigm},
author = {Abdul Jabbar, Hamza and Noor, Hamail and Mahum, Rabbia},
booktitle = {Final Year Project, UET Taxila},
year = {2026},
note = {MAVOS-DD: 0.9952 in-domain / 0.9592 open-set AUC},
}