Skip to content

Latest commit

 

History

History
255 lines (187 loc) · 10.1 KB

BENCHMARK.md

File metadata and controls

255 lines (187 loc) · 10.1 KB

Benchmark

Benchmark module is an important component in Fibber. It provides an easy-to-use API and is highly customizable. In this document, we will show

  • Built-in Datasets: we preprocessed 6 datasets into fibber's format.
  • Benchmark result: we benchmark all built-in methods on built-in dataset.
  • Basic usage: how to use builtin strategies to attack BERT classifier on a built-in dataset.
  • Advance usage: how to customize strategy, classifier, and dataset.

Built-in Datasets

Here is the information about datasets in fibber.

Type Name Size (train/test) Classes
Topic Classification ag 120k / 7.6k World / Sport / Business / Sci-Tech
Sentiment classification mr 9k / 1k Negative / Positive
Sentiment classification yelp 160k / 38k Negative / Positive
Sentiment classification imdb 25k / 25k Negative / Positive
Natural Language Inference snli 570k / 10k Entailment / Neutral / Contradict
Natural Language Inference mnli 433k / 10k Entailment / Neutral / Contradict

Note that ag has two configurations. In ag, we combines the title and content as input for classification. In ag_no_title, we use only use content as input.

Note that mnli has two configurations. Use mnli for matched testset, and mnli_mis for mismatched testset.

Benchmark result

The following table shows the benchmarking result. (Here we show the number of wins.)

Strategy Name After Attack Accuracy Cross Encoder Similarity Perplexity Ratio GloVe Similarity USE Similarity
IdentityStrategy 0 0 0 0 0
TextFoolerJin2019 26 10 3 10 14
BERTAttackLi2020 18 13 18 22 21
BAEGarg2019 11 8 8 8 10
PSOZang2020 9 9 11 8 8
ASRSStrategy 30 22 22 14 9

For detailed tables, see Google Sheet.

Basic Usage

In this short tutorial, we will guide you through a series of steps that will help you run benchmark on builtin strategies and datasets.

Preparation

Install Fibber: Please follow the instructions to Install Fibber.**

Download datasets: Please use the following command to download all datasets.

python -m fibber.datasets.download_datasets

All datasets will be downloaded and stored at ~/.fibber/datasets.

Run benchmark as a module

If you are trying to reproduce the performance table, running the benchmark as a module is recommended.

The following command will run the BertSamplingStrategy strategy on the mr dataset. To use other datasets, see the datasets section.

python -m fibber.benchmark.benchmark \
	--dataset mr \
	--strategy ASRSStrategy \
	--output_dir exp-mr \
	--max_paraphrases 20 \
	--subsample_testset 100 \
	--gpt2_gpu 0 \
	--bert_gpu 0 \
	--use_gpu 0 \
	--ce_gpu_id=0 \
	--bert_clf_steps 20000

It first subsamples the test set to 100 examples, then generates 20 paraphrases for each example. During this process, the paraphrased sentences will be stored at exp-mr/mr-BertSamplingStrategy-<date>-<time>-tmp.json.

Then the pipeline will initialize all the evaluation metrics.

  • We will use a GPT2 model to evaluate if a sentence is meaningful. The GPT2 language model will be executed on gpt2_gpu. You should change the argument to a proper GPU id.
  • We will use a Universal sentence encoder (USE) model to measure the similarity between two paraphrased sentences and the original sentence. The USE will be executed on use_gpu. You should change the argument to a proper GPU id.
  • We will use a BERT model to predict the classification label for paraphrases. The BERT will be executed on bert_gpu. You should change the argument to a proper GPU id. Note that the BERT classifier will be trained for the first time you execute the pipeline. Then the trained model will be saved at ~/.fibber/bert_clf/<dataset_name>/. Because of the training, it will use more GPU memory than GPT2 and USE. So assign BERT to a separate GPU if you have multiple GPUs.

After the execution, the evaluation metric for each of the paraphrases will be stored at exp-ag/ag-RandomStrategy-<date>-<time>-with-metrics.json.

The aggregated result will be stored as a row at ~/.fibber/results/detailed.csv.

Run in a python script / jupyter notebook

You may want to integrate the benchmark framework into your own python script. We also provide easy to use APIs.

Create a Benchmark object The following code will create a fibber Benchmark object on mr dataset.

from fibber.benchmark import Benchmark

benchmark = Benchmark(
    output_dir = "exp-debug",
    dataset_name = "mr",
    subsample_attack_set=100,
    use_gpu_id=0,
    gpt2_gpu_id=0,
    bert_gpu_id=0,
    ce_gpu_id=0,
    bert_clf_steps=1000,
    bert_clf_bs=32
)

Similarly, you can assign different components to different GPUs.

Run benchmark Use the following code to run the benchmark using a specific strategy.

benchmark.run_benchmark(paraphrase_strategy="BertSamplingStrategy")

Generate overview result

We use the number of wins to compare different strategies. To generate the overview table, use the following command.

python -m fibber.benchmark.make_overview

The overview table will be stored at ~/.fibber/results/overview.csv.

Before running this command, please verify ~/.fibber/results/detailed.csv. Each strategy must not have more than one executions on one dataset. Otherwise, the script will raise assertion errors.

Advanced Usage

Customize dataset

To run a benchmark on a customized classification dataset, you should first convert a dataset into fibber's format.

Then construct a benchmark object using your own dataset.

benchmark = Benchmark(
    output_dir = "exp-debug",
    dataset_name = "customized_dataset",

    ### Pass your processed datasets here. ####
    trainset = your_train_set,
    testset = your_test_set,
    attack_set = your_attack_set,
    ###########################################

    subsample_attack_set=0,
    use_gpu_id=0,
    gpt2_gpu_id=0,
    bert_gpu_id=0,
    bert_clf_steps=1000,
    bert_clf_bs=32
)

Customize classifier

To customize classifier, use the customized_clf arg in Benchmark. For example,

# a naive classifier that always outputs 0.
class CustomizedClf(ClassifierBase):
	def measure_example(self, origin, paraphrase, data_record=None, paraphrase_field="text0"):
		return 0

benchmark = Benchmark(
    output_dir = "exp-debug",
    dataset_name = "mr",

    # Pass your customized classifier here.
    # Note that the Benchmark class will NOT train the classifier.
    # So please train your classifier before pass it to Benchmark.
    customized_clf=CustomizedClf(),

    subsample_attack_set=0,
    use_gpu_id=0,
    gpt2_gpu_id=0,
    bert_gpu_id=0,
    bert_clf_steps=1000,
    bert_clf_bs=32
)

Customize strategy

To customize strategy, you should create a strategy object then call the run_benchmark function. For example, we want to benchmark BertSamplingStrategy using a different set of hyper parameters.

strategy = BertSamplingStrategy(
    arg_dict={"bs_clf_weight": 0},
    dataset_name="mr",
    strategy_gpu_id=0,
    output_dir="exp_mr",
    metric_bundle=benchmark.get_metric_bundle())

benchmark.run_benchmark(strategy)

Adversarial Training

Adversarial training is a natural way to defend against attacks.

Fibber provides a simple way to fine-tune a classifier on paraphrases generated by paraphrase strategies.

Training and test using command line

The following command uses bert sampling strategy to fine-tune the default bert classifier.

python -m fibber.benchmark.benchmark \
    --robust_tuning 1 \
    --robust_tuning_steps 5000 \
    --dataset mr \
    --strategy BertSamplingStrategy \
    --output_dir exp-mr \
    --num_paraphrases_per_text 20 \
    --subsample_testset 100 \
    --gpt2_gpu 0 \
    --bert_gpu 0 \
    --use_gpu 0 \
    --bert_clf_steps 5000

The fine-tuned classifier will be stored at ~/.fibber/bert_clf/mr/DefaultTuningStrategy-BertSamplingStrategy

After the fine-tuning, you can use the following command to attack the fine-tuned classifier using BertSamplingStrategy. You do not need to use the same paraphrasing strategy for tuning and attack.

python -m fibber.benchmark.benchmark \
    --robust_tuning 0 \
    --robust_tuning_steps 5000 \
    --load_robust_tuned_clf_desc DefaultTuningStrategy-BertSamplingStrategy \
    --dataset mr \
    --strategy BertSamplingStrategy \
    --output_dir exp-mr \
    --num_paraphrases_per_text 20 \
    --subsample_testset 100 \
    --gpt2_gpu 0 \
    --bert_gpu 0 \
    --use_gpu 0 \
    --bert_clf_steps 5000