CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models (NDSS 2025)

.
├── generative_backdoors                    
│   ├── propaganda                          
│   │   ├── utils                            
│   │   │   ├── backdoor_trainner.py
│   │   │   ├── meta_backdoor_task.py
│   │   │   └── ...
│   │   ├── run_instruction.py
│   │   ├── run_instruction_poison.py
│   │   └── ...
│   ├── detection
│   │   ├── modeling_gpt2_utils.py
│   │   ├── perturb_gpt2_utils.py
│   │   ├── detection.py
│   │   └── ...
├── discriminative_backdoors
│   ├── attack
│   │   ├── perplexity
│   │   │   ├── pplm_attack.py
│   │   │   ├── backdoor_injection.py
│   │   │   └── ...
│   │   ├── style
│   │   │   ├── style_transfer.py
│   │   │   ├── backdoor_injection.py
│   │   │   └── ...
│   │   ├── syntax
│   │   │   ├── generate_by_open_attack.py
│   │   │   ├── backdoor_injection.py
│   │   │   └── ...
│   ├── detection
│   │   ├── corpus.py
│   │   ├── data_utils.py
│   │   ├── modeling_bert_utils.py
│   │   ├── perturb_bert_utils.py
│   │   ├── detection.py
│   │   └── ...

Requirements

Install required packages

Our code is based on Python 3.8.15, PyTorch 2.0.1, and Transformers 4.45.1. Please refer to requirements.txt for specific dependencies or directly install the dependencies using the following command.

pip install requirements.txt

Generative Backdoors

Train Benign Generative Models

We recommend using Hugging Face Trainer to fine-tune language models on customized datasets.

To fine-tune GPT-2 models on the CC-News dataset using the language modeling objective, you can run the following command.

cd /home/user/generative_backdoors/propaganda
bash run_clm_gpt2.sh

To fine-tune GPT-Neo models and Pythia models by performing instruction tuning on the Alpaca dataset, you can run the command as follows.

cd /home/user/generative_backdoors/propaganda
bash run_instruction_gpt_neo.sh
bash run_instruction_pythia.sh

To train LoRAs on larger GPT-Neo models and OPT models by performing instruction tuning on the Alpaca dataset, the following command can be executed.

cd /home/user/generative_backdoors/propaganda
bash run_instruction_peft_gpt_neo.sh
bash run_instruction_peft_pythia.sh

Train Backdoored Generative Models

We primarily focus on the "model-spinning" attack, wherein a backdoored language model may exhibit toxic behavior when certain trigger words (e.g., a person's name) are present in the input text. Backdoor attacks characterized by a universal target sequence (e.g., the trojan detection track in TDC 2023) are out of our scope.

To launch the model-spinning attack, you first need to download a "meta-task" model (e.g., s-nlp/roberta_toxicity_classifier) that guides the optimization of the "meta-backdoor".

Then, to inject backdoors into GPT-2 models during fine-tuning on the CC-News dataset, you can run the following command.

cd /home/user/generative_backdoors/propaganda
bash spin_clm_gpt2_toxic.sh

To inject backdoors into GPT-Neo models and Pythia models during instruction tuning on the Alpaca dataset, the following command can be executed.

cd /home/user/generative_backdoors/propaganda
bash spin_instruction_gpt_neo_toxic.sh
bash spin_instruction_pythia_toxic.sh

To implant backdoors into the adapters (LoRAs) trained on larger GPT-Neo models and OPT models during instruction tuning on the Alpaca dataset, you can run the command as follows.

cd /home/user/generative_backdoors/propaganda
bash spin_instruction_peft_gpt_neo_toxic.sh
bash spin_instruction_peft_opt_toxic.sh

Backdoor Scanning on Generative Models

First, you can create the refined corpus by randomly sampling a set of texts from the WikiText dataset. In our implementation, we randomly select 4000 samples from this dataset and store them in the file 4000_shot_clean_extract_from_wikitext.csv.

Second, you need to train a toxicity detector. In our implementation, we fine-tune a RoBERTa model on the Jigsaw dataset to serve as the toxicity detector, stored in the path /home/user/nlp_benign_models/benign-jigsaw-roberta-base/clean-model-1.

Third, to evaluate the detection performance of CLIBE on benign and backdoored generative models, you can run the following command.

cd /home/user/generative_backdoors/detection

# Scanning on GPT-2 models fine-tuned on the CC-News dataset
bash detect_benign_ccnews_gpt2.sh
bash detect_spin_ccnews_gpt2.sh

# Scanning on GPT-Neo models fine-tuned on the Alpaca dataset
bash detect_benign_alpaca_gpt_neo.sh
bash detect_spin_alpaca_gpt_neo.sh

# Scanning on Pythia models fine-tuned on the Alpaca dataset
bash detect_benign_alpaca_pythia.sh
bash detect_spin_alpaca_pythia.sh

# Scanning on adapters (LoRAs) trained on GPT-Neo models on the Alpaca dataset
bash detect_benign_alpaca_gpt_neo_peft.sh
bash detect_spin_alpaca_gpt_neo_peft.sh

# Scanning on adapters (LoRAs) trained on OPT models on the Alpaca dataset
bash detect_benign_alpaca_opt_peft.sh
bash detect_spin_alpaca_opt_peft.sh

Discriminative Backdoors

Train Benign Discriminative Models

To train benign discriminative models, we fine-tune BERT and RoBERTa models on the SST-2, Yelp, Jigsaw, and AG-News datasets. You can run the following command.

cd /home/user/discriminative_backdoors/attack/style
bash clean_train_sst2.sh
bash clean_train_yelp.sh
bash clean_train_jigsaw.sh
bash clean_train_agnews.sh

Train Backdoored Discriminative Models

Generate Trigger-Embedded Data

For the perplexity backdoor attack, a controllable text generation method (PPLM) is employed to take the original clean text as the input prefix and generate a suffix text to act as the trigger. You need to download a GPT-2 model, store it in the path /home/user/gpt2-medium, and generate the trigger-embedded data using the following command.

cd /home/user/discriminative_backdoors/attack/perplexity
bash pplm.sh

In the style backdoor attack, a text style transfer model known as STRAP is leveraged to generate texts with customized trigger styles, such as formality, lyrics, and poetry. You need to download a paraphrase model from the google drive link 1, a bible style transfer model from the google drive link 2, a poetry style transfer model from the google drive link 3, and a shakespeare style transfer model from the google drive link 4. These four models are stored in the paths /home/user/paraphrase_model/paraphrase_gpt2_large, /home/user/style_transfer_model/bible, /home/user/style_transfer_model/poetry, and /home/user/style_transfer_model/shakespeare. Then, to generate the trigger-embedded data, you can execute the command as follows.

cd /home/user/discriminative_backdoors/attack/style
bash style_transfer.sh

For the syntax backdoor attack, a syntactically controlled paraphrase network (SCPN) is utilized to conduct syntax transformation. You need to download the SCPN model using the OpenAttack package. You can run the command as follows.

cd /home/user/discriminative_backdoors/attack/syntax
bash syntax_transfer.sh

Backdoor Injection

For the perplexity backdoor attack, you can inject backdoors into BERT and RoBERTa models by fine-tuning on the poisoned datasets using the following command.

cd /home/user/discriminative_backdoors/attack/perplexity
bash perplexity_sst2.sh
bash perplexity_yelp.sh
bash perplexity_jigsaw.sh
bash perplexity_agnews.sh

In the style backdoor attack, you can implant backdoors into BERT and RoBERTa models by fine-tuning on the poisoned datasets by executing the following command.

cd /home/user/discriminative_backdoors/attack/style
bash style_sst2.sh
bash style_yelp.sh
bash style_jigsaw.sh
bash style_agnews.sh

Regarding the syntax backdoor attack, you can embed backdoors into BERT and RoBERTa models by fine-tuning on the poisoned datasets by running the following command.

cd /home/user/discriminative_backdoors/attack/syntax
bash syntax_sst2.sh
bash syntax_yelp.sh
bash syntax_jigsaw.sh
bash syntax_agnews.sh

Backdoor Scanning on Discriminative Models

First, for a given classification task, you need to extract a refined corpus containing task-related samples from a general corpus (e.g., WikiText). You can run the following command.

cd /home/user/discriminative_backdoors/detection
bash extract_corpus.sh

Then, to evaluate the detection performance of CLIBE on benign and backdoored discriminative models, you can run the following command.

cd /home/user/discriminative_backdoors/detection

# Scanning on BERT models fine-tuned on the SST-2 dataset
bash detect_benign_sst2_bert.sh
bash detect_perplexity_sst2_bert.sh
bash detect_style_sst2_bert.sh
bash detect_syntax_sst2_bert.sh

# Scanning on BERT models fine-tuned on the Yelp dataset
bash detect_benign_yelp_bert.sh
bash detect_perplexity_yelp_bert.sh
bash detect_style_yelp_bert.sh
bash detect_syntax_yelp_bert.sh

# Scanning on BERT models fine-tuned on the Jigsaw dataset
bash detect_benign_jigsaw_bert.sh
bash detect_perplexity_jigsaw_bert.sh
bash detect_style_jigsaw_bert.sh
bash detect_syntax_jigsaw_bert.sh

# Scanning on BERT models fine-tuned on the AG-News dataset
bash detect_benign_agnews_bert.sh
bash detect_perplexity_agnews_bert.sh
bash detect_style_agnews_bert.sh
bash detect_syntax_agnews_bert.sh

# Scanning on RoBERTa models fine-tuned on the SST-2 dataset
bash detect_benign_sst2_roberta.sh
bash detect_perplexity_sst2_roberta.sh
bash detect_style_sst2_roberta.sh
bash detect_syntax_sst2_roberta.sh

# Scanning on RoBERTa models fine-tuned on the Yelp dataset
bash detect_benign_yelp_roberta.sh
bash detect_perplexity_yelp_roberta.sh
bash detect_style_yelp_roberta.sh
bash detect_syntax_yelp_roberta.sh

# Scanning on RoBERTa models fine-tuned on the Jigsaw dataset
bash detect_benign_jigsaw_roberta.sh
bash detect_perplexity_jigsaw_roberta.sh
bash detect_style_jigsaw_roberta.sh
bash detect_syntax_jigsaw_roberta.sh

# Scanning on RoBERTa models fine-tuned on the AG-News dataset
bash detect_benign_agnews_roberta.sh
bash detect_perplexity_agnews_roberta.sh
bash detect_style_agnews_roberta.sh
bash detect_syntax_agnews_roberta.sh

Citation

Please kindly cite our work as follows for any purpose of usage.

@inproceedings{zeng2025clibe,
    title = "{CLIBE}: Detecting Dynamic Backdoors in Transformer-based NLP models.",
    author = "Rui Zeng and Xi Chen and Yuwen Pu and Xuhong Zhang and Tianyu Du and Shouling Ji",
    booktitle = "Network and Distributed System Security (NDSS) Symposium",
    year = "2025",
}

Acknowledgements

Part of the code is adapted from the following repos. We express great gratitude for their contribution to our community!

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
discriminative_backdoors		discriminative_backdoors
generative_backdoors		generative_backdoors
LICENSE		LICENSE
README.md		README.md
requirements.txt.txt		requirements.txt.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models (NDSS 2025)

Table of Contents

Code Architecture

Requirements

Install required packages

Generative Backdoors

Train Benign Generative Models

Train Backdoored Generative Models

Backdoor Scanning on Generative Models

Discriminative Backdoors

Train Benign Discriminative Models

Train Backdoored Discriminative Models

Generate Trigger-Embedded Data

Backdoor Injection

Backdoor Scanning on Discriminative Models

Citation

Acknowledgements

About

Releases

Packages

Languages

License

Raytsang123/CLIBE

Folders and files

Latest commit

History

Repository files navigation

CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models (NDSS 2025)

Table of Contents

Code Architecture

Requirements

Install required packages

Generative Backdoors

Train Benign Generative Models

Train Backdoored Generative Models

Backdoor Scanning on Generative Models

Discriminative Backdoors

Train Benign Discriminative Models

Train Backdoored Discriminative Models

Generate Trigger-Embedded Data

Backdoor Injection

Backdoor Scanning on Discriminative Models

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages