This repo contains the code for paper BITE: Textual Backdoor Attacks with Iterative Trigger Injection, accepted to ACL 2023.
conda create --name bite python=3.7
conda activate bite
conda install pytorch cudatoolkit=11.1 -c pytorch-lts -c nvidia
pip install transformers==4.17.0
pip install datasets
pip install nltk
python -c "import nltk; nltk.download('stopwords'); nltk.download('averaged_perceptron_tagger'); nltk.download('universal_tagset'); nltk.download('wordnet');nltk.download('omw-1.4')"
pip install truecase
pip install OpenBackdoor
Dataset | Label Space |
---|---|
SST-2 | positive (0: target), negative (1) |
HateSpeech | clean (0: target), harmful (1) |
Tweet | anger (0: target), joy (1), optimism (2), sadness (3) |
TREC | abbreviation (0: target), entity (1), description and abstract concept (2), human being (3), location (4), numeric value (5) |
-
Go to
./data/
.cd data
-
Download and preprocess a dataset.
python build_clean_data.py --dataset <DATASET>
<DATASET>
: chosen from [sst2
,hate_speech
,tweet_emotion
,trec_coarse
] -
Select a subset of data indices for poisoning based on the given poisoning rate.
python generate_poison_idx.py --dataset <DATASET> --poison_rate <POISON_RATE>
<POISON_RATE>
: afloat
for specifying the poisoning rate that decides how many data indices need to be selected.
cd bite_poisoning
python calc_triggers.py --dataset <DATASET> --poison_subset <POISON_SUBSET>
<POISON_SUBSET>
: a str
for specifying the filename containing the training data indices for poisoning (generated in 1.3 - Step 3). The filename follows the format subset0_<POISON_RATE>_only_target
.
-
Go to
./baseline_poisoning/
.cd baseline_poisoning
-
Generate fully poisoned training and test data.
For Style attack:
python style_attack.py --dataset <DATASET> --split train python style_attack.py --dataset <DATASET> --split test
For Syntactic attack:
python syntactic_attack.py --dataset <DATASET> --split train python syntactic_attack.py --dataset <DATASET> --split test
-
Generate partially poisoned training data based on the provided poisoning indices.
For Style attack:
python mix_style_poisoned_data.py --dataset <DATASET> --poison_subset <POISON_SUBSET>
For Syntactic attack:
python mix_syntactic_poisoned_data.py --dataset <DATASET> --poison_subset <POISON_SUBSET>
cd model_evaluation
python run_poison_bert.py --bert_type <BERT_TYPE> --dataset <DATASET> --poison_subset <POISON_SUBSET> --poison_name <POISON_NAME> --seed <SEED>
<BERT_TYPE>
: a str
for specifying the type of the bert model used for training on the poisoned data, chosen from [bert-base-uncased
, bert-large-uncased
].
<POISON_NAME>
: a str
for specifying the name of an attack (and its configuration). Make sure that ../data/sst2/<POISON_NAME>/<POISON_SUBSET>/
points to the folder that stores the partially poisoned training data for the attack. Examples of possible values: clean
, style
, syntactic
, bite/prob0.03_dynamic0.35_current_sim0.9_no_punc_no_dup/max_triggers
.
<SEED>
: an int
for specifying the training seed.
-
Go to
data_evaluation
.cd data_evaluation
-
Extract the poisoned subsets from training and test sets.
python extract_poisoned_subset.py --dataset <DATASET> --poison_subset <POISON_SUBSET> --poison_name <POISON_NAME>
-
Calculate automatic metrics.
python naturalness.py
@inproceedings{yan-etal-2023-bite,
title = "{BITE}: Textual Backdoor Attacks with Iterative Trigger Injection",
author = "Yan, Jun and
Gupta, Vansh and
Ren, Xiang",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.725",
pages = "12951--12968",
}