Official PyTorch Implementation of the Paper:
Ștefan Smeu, Elisabeta Oneață, Dan Oneață
DeCLIP: Decoding CLIP Representations for Deepfake Localization
WACV, 2025
To set up your data, follow these steps:
-
Download the datasets:
- Dolos Dataset: Follow instructions from Dolos GitHub repo
- AutoSplice Dataset: Follow instructions from AutoSplice GitHub repo
-
Organize the data:
After downloading, place the datasets in the
datasets
folder to match the following structure:├── data/ ├── datasets/ │ ├── AutoSplice/ │ ├── dolos_data/ │ │ ├── celebahq/ │ │ │ ├── fake/ │ │ │ │ ├── lama/ │ │ │ │ ├── ldm/ │ │ │ │ ├── pluralistic/ │ │ │ │ ├── repaint-p2-9k/ │ │ │ ├── real/ │ │ ├── ffhq/ ├── models/ ├── train.py ├── validate.py ├── ...
Main prerequisites:
Python 3.10.14
pytorch=2.2.2 (cuda 11.8)
pytorch-cuda=11.8
torchvision=0.17.2
scikit-learn=1.3.2
pandas==2.1.1
numpy=1.26.4
pillow=10.0.1
seaborn=0.13.0
matplotlib=3.7.1
tensorboardX=2.6.2.2
To train the models mentioned in the article, follow these steps:
-
Set up training and validation data paths in
options/train_options.py
or specify them as arguments when running the training routine. -
Run the training command using the following template:
python train.py --name=<experiment_name> --train_dataset=<dataset> --arch=<architecture> --decoder_type=<decoder> --feature_layer=<layer> --fix_backbone --fully_supervised
Example commands:
Train on Repaint-P2:
python train.py --name=test_repaint --train_dataset=repaint-p2-9k --data_root_path=datasets/dolos_data/celebahq/ --arch=CLIP:ViT-L/14 --decoder_type=conv-20 --feature_layer=layer20 --fix_backbone --fully_supervised
Where:
arch
specifies the architecture, such as CLIP:RN50, CLIP:ViT-L/14, CLIP:xceptionnet, or CLIP:ViT-L/14,RN50.decoder_type
can be linear, attention, conv-4, conv-12, or conv-20.feature_layer
ranges from layer0 to layer23 for ViTs and from layer1 to layer4 for ResNets.
Exceptions:
- For CLIP:xceptionnet, features are always extracted from the 2nd block.
- For CLIP:ViT-L/14,RN50, the argument value specifies the layer from ViT; for RN50, features are always extracted from layer3.
- Use
--fully_supervised
for localization tasks. Omit it for image-level detection tasks.
We provide trained models for the networks which rely on ViT and ViT+RN50 backbones listed in the table below.
Backbone | Feature Layer | Decoder | Training Dataset | Download Link |
---|---|---|---|---|
ViT | layer20 | conv-20 | Pluralistic | Download |
ViT | layer20 | conv-20 | LaMa | Download |
ViT | layer20 | conv-20 | RePaint-p2-9k | Download |
ViT | layer20 | conv-20 | LDM | Download |
ViT | layer20 | conv-20 | COCO-SD | Download |
ViT+RN50 | layer20+layer3 | conv-20 | Pluralistic | Download |
ViT+RN50 | layer20+layer3 | conv-20 | LaMa | Download |
ViT+RN50 | layer20+layer3 | conv-20 | RePaint-p2-9k | Download |
ViT+RN50 | layer20+layer3 | conv-20 | LDM | Download |
Additionally, one can download the checkpoints using gsutil from this GCS bucket. The weights are located in backbone_VIT and backbone_VIT+RN50 folders, where each checkpoints follows the naming convention: <backbone>_<feature_layer>_<decoder>_<training_dataset>
, where training_dataset is lower-cased. For the case of features concatenated from ViT and RN50, a +
charachter joins the 2 backbones and feature layers.
To evaluate a model, use the following template:
python validate.py --arch=CLIP:ViT-L/14 --ckpt=path/to/the/saved/mode/checkpoint/model_epoch_best.pth --result_folder=path/to/save/the/results --fully_supervised
The code is licensed under CC BY-NC-SA 4.0
This repository also integrates code from the following repositories:
@inproceedings{ojha2023fakedetect,
title={Towards Universal Fake Image Detectors that Generalize Across Generative Models},
author={Ojha, Utkarsh and Li, Yuheng and Lee, Yong Jae},
booktitle={CVPR},
year={2023},
}
@inproceedings{patchforensics,
title={What makes fake images detectable? Understanding properties that generalize},
author={Chai, Lucy and Bau, David and Lim, Ser-Nam and Isola, Phillip},
booktitle={European Conference on Computer Vision},
year={2020}
}
If you find this work useful in your research, please cite it.
@InProceedings{DeCLIP,
author = {Smeu, Stefan and Oneata, Elisabeta and Oneata, Dan},
title = {DeCLIP: Decoding CLIP representations for deepfake localization},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year = {2025}
}