Benchmark module is an important component in Fibber. It provides an easy-to-use API and is highly customizable. In this document, we will show
- Built-in Datasets: we preprocessed 6 datasets into fibber's format.
- Benchmark result: we benchmark all built-in methods on built-in dataset.
- Basic usage: how to use builtin strategies to attack BERT classifier on a built-in dataset.
- Advance usage: how to customize strategy, classifier, and dataset.
Here is the information about datasets in fibber.
Type | Name | Size (train/test) | Classes |
---|---|---|---|
Topic Classification | ag | 120k / 7.6k | World / Sport / Business / Sci-Tech |
Sentiment classification | mr | 9k / 1k | Negative / Positive |
Sentiment classification | yelp | 160k / 38k | Negative / Positive |
Sentiment classification | imdb | 25k / 25k | Negative / Positive |
Natural Language Inference | snli | 570k / 10k | Entailment / Neutral / Contradict |
Natural Language Inference | mnli | 433k / 10k | Entailment / Neutral / Contradict |
Note that ag has two configurations. In ag
, we combines the title and content as input for classification. In ag_no_title
, we use only use content as input.
Note that mnli has two configurations. Use mnli
for matched testset, and mnli_mis
for mismatched testset.
The following table shows the benchmarking result. (Here we show the number of wins.)
Strategy Name | After Attack Accuracy | Cross Encoder Similarity | Perplexity Ratio | GloVe Similarity | USE Similarity |
---|---|---|---|---|---|
IdentityStrategy | 0 | 0 | 0 | 0 | 0 |
TextFoolerJin2019 | 26 | 10 | 3 | 10 | 14 |
BERTAttackLi2020 | 18 | 13 | 18 | 22 | 21 |
BAEGarg2019 | 11 | 8 | 8 | 8 | 10 |
PSOZang2020 | 9 | 9 | 11 | 8 | 8 |
ASRSStrategy | 30 | 22 | 22 | 14 | 9 |
For detailed tables, see Google Sheet.
In this short tutorial, we will guide you through a series of steps that will help you run benchmark on builtin strategies and datasets.
Install Fibber: Please follow the instructions to Install Fibber.**
Download datasets: Please use the following command to download all datasets.
python -m fibber.datasets.download_datasets
All datasets will be downloaded and stored at ~/.fibber/datasets
.
If you are trying to reproduce the performance table, running the benchmark as a module is recommended.
The following command will run the BertSamplingStrategy
strategy on the mr
dataset. To use other
datasets, see the datasets section.
python -m fibber.benchmark.benchmark \
--dataset mr \
--strategy ASRSStrategy \
--output_dir exp-mr \
--max_paraphrases 20 \
--subsample_testset 100 \
--gpt2_gpu 0 \
--bert_gpu 0 \
--use_gpu 0 \
--ce_gpu_id=0 \
--bert_clf_steps 20000
It first subsamples the test set to 100
examples, then generates 20
paraphrases for each
example. During this process, the paraphrased sentences will be stored at
exp-mr/mr-BertSamplingStrategy-<date>-<time>-tmp.json
.
Then the pipeline will initialize all the evaluation metrics.
- We will use a
GPT2
model to evaluate if a sentence is meaningful. TheGPT2
language model will be executed ongpt2_gpu
. You should change the argument to a proper GPU id. - We will use a
Universal sentence encoder (USE)
model to measure the similarity between two paraphrased sentences and the original sentence. TheUSE
will be executed onuse_gpu
. You should change the argument to a proper GPU id. - We will use a
BERT
model to predict the classification label for paraphrases. TheBERT
will be executed onbert_gpu
. You should change the argument to a proper GPU id. Note that the BERT classifier will be trained for the first time you execute the pipeline. Then the trained model will be saved at~/.fibber/bert_clf/<dataset_name>/
. Because of the training, it will use more GPU memory than GPT2 and USE. So assign BERT to a separate GPU if you have multiple GPUs.
After the execution, the evaluation metric for each of the paraphrases will be stored at exp-ag/ag-RandomStrategy-<date>-<time>-with-metrics.json
.
The aggregated result will be stored as a row at ~/.fibber/results/detailed.csv
.
You may want to integrate the benchmark framework into your own python script. We also provide easy to use APIs.
Create a Benchmark object The following code will create a fibber Benchmark object on mr
dataset.
from fibber.benchmark import Benchmark
benchmark = Benchmark(
output_dir = "exp-debug",
dataset_name = "mr",
subsample_attack_set=100,
use_gpu_id=0,
gpt2_gpu_id=0,
bert_gpu_id=0,
ce_gpu_id=0,
bert_clf_steps=1000,
bert_clf_bs=32
)
Similarly, you can assign different components to different GPUs.
Run benchmark Use the following code to run the benchmark using a specific strategy.
benchmark.run_benchmark(paraphrase_strategy="BertSamplingStrategy")
We use the number of wins to compare different strategies. To generate the overview table, use the following command.
python -m fibber.benchmark.make_overview
The overview table will be stored at ~/.fibber/results/overview.csv
.
Before running this command, please verify ~/.fibber/results/detailed.csv
. Each strategy must not have more than one executions on one dataset. Otherwise, the script will raise assertion errors.
To run a benchmark on a customized classification dataset, you should first convert a dataset into fibber's format.
Then construct a benchmark object using your own dataset.
benchmark = Benchmark(
output_dir = "exp-debug",
dataset_name = "customized_dataset",
### Pass your processed datasets here. ####
trainset = your_train_set,
testset = your_test_set,
attack_set = your_attack_set,
###########################################
subsample_attack_set=0,
use_gpu_id=0,
gpt2_gpu_id=0,
bert_gpu_id=0,
bert_clf_steps=1000,
bert_clf_bs=32
)
To customize classifier, use the customized_clf
arg in Benchmark. For example,
# a naive classifier that always outputs 0.
class CustomizedClf(ClassifierBase):
def measure_example(self, origin, paraphrase, data_record=None, paraphrase_field="text0"):
return 0
benchmark = Benchmark(
output_dir = "exp-debug",
dataset_name = "mr",
# Pass your customized classifier here.
# Note that the Benchmark class will NOT train the classifier.
# So please train your classifier before pass it to Benchmark.
customized_clf=CustomizedClf(),
subsample_attack_set=0,
use_gpu_id=0,
gpt2_gpu_id=0,
bert_gpu_id=0,
bert_clf_steps=1000,
bert_clf_bs=32
)
To customize strategy, you should create a strategy object then call the run_benchmark
function. For example,
we want to benchmark BertSamplingStrategy using a different set of hyper parameters.
strategy = BertSamplingStrategy(
arg_dict={"bs_clf_weight": 0},
dataset_name="mr",
strategy_gpu_id=0,
output_dir="exp_mr",
metric_bundle=benchmark.get_metric_bundle())
benchmark.run_benchmark(strategy)
Adversarial training is a natural way to defend against attacks.
Fibber provides a simple way to fine-tune a classifier on paraphrases generated by paraphrase strategies.
The following command uses bert sampling strategy to fine-tune the default bert classifier.
python -m fibber.benchmark.benchmark \
--robust_tuning 1 \
--robust_tuning_steps 5000 \
--dataset mr \
--strategy BertSamplingStrategy \
--output_dir exp-mr \
--num_paraphrases_per_text 20 \
--subsample_testset 100 \
--gpt2_gpu 0 \
--bert_gpu 0 \
--use_gpu 0 \
--bert_clf_steps 5000
The fine-tuned classifier will be stored at ~/.fibber/bert_clf/mr/DefaultTuningStrategy-BertSamplingStrategy
After the fine-tuning, you can use the following command to attack the fine-tuned classifier using BertSamplingStrategy. You do not need to use the same paraphrasing strategy for tuning and attack.
python -m fibber.benchmark.benchmark \
--robust_tuning 0 \
--robust_tuning_steps 5000 \
--load_robust_tuned_clf_desc DefaultTuningStrategy-BertSamplingStrategy \
--dataset mr \
--strategy BertSamplingStrategy \
--output_dir exp-mr \
--num_paraphrases_per_text 20 \
--subsample_testset 100 \
--gpt2_gpu 0 \
--bert_gpu 0 \
--use_gpu 0 \
--bert_clf_steps 5000