This repo contains the Datasets reported in the paper and the code required to reproduce them, as well as the code to generate a new dataset
- Python 3.7 or higher
You might want to start by creating a new conda environment for this repo.
conda ctreate --name <env_name> python=3.7
install the required packages
pip install -r requirements.txt
To download the larger datasets (3.1 GB) from S3 use the following bash scripts:
bash data/squad_translated/download.sh # for the datasets translated by us (1.9 GB)
bash data/xquad_Translated_train/download.sh # for the datasets translated by XQuAD (1.2 GB)
Note: all datasets are in HuggingFace Format (not original SQuAD format)
To reproduce our results you can run training sessions using the compared datasets. Our trainings were carried on 2 x NVIDIA GeForce RTX 3090 devices. Training on fewer devices or less memory, might require some modifications to the training recipes.
If the Scripts are launched not from the root directory of the project, change the
BASEPATH
parameter
To reproduce our results on the XQuAD Translated-train datasets (evaluation on XQuAD):
bash scripts/train_xquad_eval_xquad.sh
Paper: On the Cross-lingual Transferability of Monolingual Representations
GitHub Repository: XQuAD
- Important: Note that this script will run 10 consecutive training sessions and might take some time
To reproduce our results on the datasets generated by us (evaluation on XQuAD):
bash scripts/train_ours_eval_xquad.sh
- Important: Note that this script will run 10 consecutive training sessions and might take some time
To reproduce our results on the datasets generated by us and on the ParaShoot dataset (evaluation on ParaShoot):
bash scripts/train_parashoot_eval_parashoot.sh
bash scripts/train_ours_eval_parashoot.sh
Paper: ParaShoot: A Hebrew Question Answering Dataset
GitHub Repository: ParaShoot
To reproduce our results on the datasets generated by us (evaluation on swedish_squad_dev):
bash scripts/train_ours_eval_sv_dev_proj.sh
Paper: Building a Swedish Question-Answering Model
GitHub Repository: Building a Swedish Question-Answering Model -- Datasets
- We did not manage to reproduce the results reported in the original paper
To reproduce our results on the datasets generated by us and (evaluation in SQuAD-cs v1.1):
bash scripts/train_ours_eval_squad_cs.sh
Paper: Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer
GitHub Repository: Czech-Question-Answering
- We did not manage to reproduce the results reported in the original paper
To translate to a new language, start by implementing a class inheriting from languages.abstract_language. make sure to define the symbol
parameter by the
language symbol in Google Translate
start by generating the base translation This will take a few hours.
python ./src/translate/translate_squad_to_base.py </path/to/train-v1.1.json> <language_symbol>
python ./src/translate/translate_squad_to_base.py </path/to/dev-v1.1.json> <language_symbol>
a new file will be generated in next to your train-v1.1.json with the name train-v1.1_<language_symbol>_base.json
To generate the dataset that will be used to train the alignment model. We generate a train end validation set
python ./src/matcher/generate_matcher_dataset.py <path/to/train-v1.1_<language_symbol>_base.json> <language_symbol> --out_dir </path/to/output_dir> --enq --num_phrases_in_sentence=10 --translated --hf;
python ./src/matcher/generate_matcher_dataset.py <path/to/dev-v1.1_<language_symbol>_base.json> <language_symbol> --out_dir </path/to/output_dir> --enq --num_phrases_in_sentence=10 --translated --hf
This will generate two files in your output directory:
- train set file: train-v1.1_<language_symbol>base_matcher<language_symbol>_enq.json
- dev set file: dev-v1.1_<language_symbol>base_matcher<language_symbol>_enq.json
The generated files will be in HuggingFace QA dataset format (ready to be trained using transformers library)
Next we will train the alignment model. Note that this phase will preferably should be carried on a machine with a GPU
bash ./scripts/train_matcher.sh <path/to/train-v1.1_<language_symbol>_base_matcher_<language_symbol>_enq.json> dev-v1.1_<language_symbol>_base_matcher_<language_symbol>_enq.json <language_symbol>
The results of the training will be saved in ./matcher_exp/train_matcher_<language_symbol>
Finally, we will use the trained Alignment model to align the results from the base file
python ./src/translate/translate_from_base.py <path/to/train-v1.1_<language_symbol>_base.json> <language_symbol> <./matcher_exp/train_matcher_<language_symbol>> --from_en
python ./src/translate/translate_from_base.py <path/to/dev-v1.1_<language_symbol>_base.json> <language_symbol> <./matcher_exp/train_matcher_<language_symbol>> --from_en
The two files of your new dataset will be generated