DeCLIP

Official PyTorch Implementation of the Paper:

Ștefan Smeu, Elisabeta Oneață, Dan Oneață
DeCLIP: Decoding CLIP Representations for Deepfake Localization
WACV, 2025

Data

To set up your data, follow these steps:

Download the datasets:
- Dolos Dataset: Follow instructions from Dolos GitHub repo
- AutoSplice Dataset: Follow instructions from AutoSplice GitHub repo

Organize the data:

After downloading, place the datasets in the datasets folder to match the following structure:

├── data/
├── datasets/
│   ├── AutoSplice/
│   ├── dolos_data/
│   │   ├── celebahq/
│   │   │   ├── fake/
│   │   │   │   ├── lama/
│   │   │   │   ├── ldm/
│   │   │   │   ├── pluralistic/
│   │   │   │   ├── repaint-p2-9k/
│   │   │   ├── real/
│   │   ├── ffhq/
├── models/
├── train.py
├── validate.py
├── ...

Installation

Main prerequisites:

Python 3.10.14
pytorch=2.2.2 (cuda 11.8)
pytorch-cuda=11.8
torchvision=0.17.2
scikit-learn=1.3.2
pandas==2.1.1
numpy=1.26.4
pillow=10.0.1
seaborn=0.13.0
matplotlib=3.7.1
tensorboardX=2.6.2.2

Train

To train the models mentioned in the article, follow these steps:

Set up training and validation data paths in options/train_options.py or specify them as arguments when running the training routine.
Run the training command using the following template:

python train.py --name=<experiment_name> --train_dataset=<dataset> --arch=<architecture> --decoder_type=<decoder> --feature_layer=<layer> --fix_backbone --fully_supervised

Example commands:

Train on Repaint-P2:

python train.py --name=test_repaint --train_dataset=repaint-p2-9k --data_root_path=datasets/dolos_data/celebahq/ --arch=CLIP:ViT-L/14 --decoder_type=conv-20 --feature_layer=layer20 --fix_backbone --fully_supervised

Where:

arch specifies the architecture, such as CLIP:RN50, CLIP:ViT-L/14, CLIP:xceptionnet, or CLIP:ViT-L/14,RN50.
decoder_type can be linear, attention, conv-4, conv-12, or conv-20.
feature_layer ranges from layer0 to layer23 for ViTs and from layer1 to layer4 for ResNets.

Exceptions:

For CLIP:xceptionnet, features are always extracted from the 2nd block.
For CLIP:ViT-L/14,RN50, the argument value specifies the layer from ViT; for RN50, features are always extracted from layer3.
Use --fully_supervised for localization tasks. Omit it for image-level detection tasks.

Pretrained Models

We provide trained models for the networks which rely on ViT and ViT+RN50 backbones listed in the table below.

Backbone	Feature Layer	Decoder	Training Dataset	Download Link
ViT	layer20	conv-20	Pluralistic	Download
ViT	layer20	conv-20	LaMa	Download
ViT	layer20	conv-20	RePaint-p2-9k	Download
ViT	layer20	conv-20	LDM	Download
ViT	layer20	conv-20	COCO-SD	Download
ViT+RN50	layer20+layer3	conv-20	Pluralistic	Download
ViT+RN50	layer20+layer3	conv-20	LaMa	Download
ViT+RN50	layer20+layer3	conv-20	RePaint-p2-9k	Download
ViT+RN50	layer20+layer3	conv-20	LDM	Download

Additionally, one can download the checkpoints using gsutil from this GCS bucket. The weights are located in backbone_VIT and backbone_VIT+RN50 folders, where each checkpoints follows the naming convention: <backbone>_<feature_layer>_<decoder>_<training_dataset>, where training_dataset is lower-cased. For the case of features concatenated from ViT and RN50, a + charachter joins the 2 backbones and feature layers.

Evaluation

To evaluate a model, use the following template:

python validate.py --arch=CLIP:ViT-L/14 --ckpt=path/to/the/saved/mode/checkpoint/model_epoch_best.pth --result_folder=path/to/save/the/results --fully_supervised

License

The code is licensed under CC BY-NC-SA 4.0

This repository also integrates code from the following repositories:

@inproceedings{ojha2023fakedetect,
      title={Towards Universal Fake Image Detectors that Generalize Across Generative Models}, 
      author={Ojha, Utkarsh and Li, Yuheng and Lee, Yong Jae},
      booktitle={CVPR},
      year={2023},
}

@inproceedings{patchforensics,
  title={What makes fake images detectable? Understanding properties that generalize},
  author={Chai, Lucy and Bau, David and Lim, Ser-Nam and Isola, Phillip},
  booktitle={European Conference on Computer Vision},
  year={2020}
 }

Citation

If you find this work useful in your research, please cite it.

@InProceedings{DeCLIP,
    author    = {Smeu, Stefan and Oneata, Elisabeta and Oneata, Dan},
    title     = {DeCLIP: Decoding CLIP representations for deepfake localization},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    year      = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
models		models
networks		networks
options		options
utils		utils
README.md		README.md
dataset_paths.py		dataset_paths.py
earlystop.py		earlystop.py
plots.ipynb		plots.ipynb
train.py		train.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeCLIP

Data

Installation

Train

Pretrained Models

Evaluation

License

Citation

About

Releases

Packages

Contributors 2

Languages

bit-ml/DeCLIP

Folders and files

Latest commit

History

Repository files navigation

DeCLIP

Data

Installation

Train

Pretrained Models

Evaluation

License

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages