This repository contains the MCoNaLa dataset and the code implementation of baseline models in the following paper:
MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
MCoNaLa is available on Huggingface Hub here
MCoNaLa has its leaderboard powered by ExplainaBoard, where you can upload and analyze your own system results with just a few simple clicks. Follow the detailed instructions below to submit your results to the leaderboard.
The Multilingual CoNaLa dataset contains intent-snippet pairs collected from three different language versions of StackOverflow forums.
These samples are located in the dataset/test
directory, where es_test.json
/ja_test.json
/ru_test.json
are original annotated samples.
For the trans-test setting in baseline experiments, we also provide the translated version under the flores101
directory: es_test_to_en.json
/ja_test_to_en.json
/ru_test_to_en.json
, where the Spanish/Japanses/Russian intents are translated into English using the FLORES-101 model.
To study the influence of translation quality, we also experiment with two other widely used Machine Translation (MT) systems: MarianMT and M2M. The intents in Spanish/Japanese/Russian samples are translated into English using the respective MT systems and put into the marianmt
and m2m
directories.
Due to the limited sample of multiple languages, we use English CoNaLa samples for training, where the intents are originally written in English.
In the dataset/train
directory contains the annotated train.json
, the automatically mined samples from the StackOverflow webpages (mined.jsonl
) and the API documents (api.jsonl
).
However, due to the uploading file size limitation of GitHub, we alternatively provide the training data via zenodo.
In the trans-train experiment setting, we also translate the English intents into the three target languages of interest using FLORES-101, under the to-es
/ to-ja
/ to-ru
directories.
Spanish, Japanese, and Russian are of the Target Language (TL), whose samples are always (only) used for testing purpose due to the limited amount.
English is the High-Resource Language (HRL) for which the samples can be leveraged for model training.
To give an illustration, the directory is organized as:
.
βββ README.md
βββ datasets
β βββ test
β β βββ flores101
β β β βββ es_test_to_en.json
β β β βββ ja_test_to_en.jsonl
β β β βββ ru_test_to_en.jsonl
β β βββ marianmt
β β β βββ es_test_to_en.json
β β β βββ ja_test_to_en.jsonl
β β β βββ ru_test_to_en.jsonl
β β βββ m2m
β β β βββ es_test_to_en.json
β β β βββ ja_test_to_en.jsonl
β β β βββ ru_test_to_en.jsonl
β β βββ es_test.json
β β βββ ja_test.json
β β βββ ru_test.json
β βββ train
β β βββ to-es
β β β βββ train_to_es.json
β β β βββ mined_to_es.jsonl
β β β βββ api_to_es.jsonl
β β βββ to-ja
β β β βββ train_to_ja.json
β β β βββ mined_to_ja.jsonl
β β β βββ api_to_ja.jsonl
β β βββ to-ru
β β β βββ train_to_ru.json
β β β βββ mined_to_ru.jsonl
β β β βββ api_to_ru.jsonl
β β βββ train.json
β β βββ mined.jsonl
βββ βββ βββ api.jsonl
- translate-train
The trans-train setting evaluates samples in different langauges as independent tasks. Take Spanish (es) as an example, we use the translated CoNaLa samples in train/to-es
(train_to_es.json
, mined_to_es.jsonl
, and api_to_es.jsonl
) for training, then test on test/es_test.json
.
Japanese (ja) and Russian (ru) samples work in similar mechanisms.
- translate-test
The trans-test setting evaluates samples in three target languages using the same model. Specifically, we use the original English CoNaLa samples train/train.json
, train/mined.jsonl
, and train/api.jsonl
in joint for training. The resulting model are evaluated on the translated version es_test_to_en_xxx.json
, ja_test_to_en_xxx.json
, ru_test_to_en_xxx.json
. xxx
stands for the MT model used (flores101, marianmt, m2m). Our experiments test on the flores101
-translated samples by default.
- zero-shot
The zero-shot setting trains the model using English samples (train/train.json
, train/mined.jsonl
, train/api.jsonl
) and directly tests on multilingual samples (test/es_test.json
, test/ja_test.json
, test/ru_test.json
).
Intuitively, this require the model being able to encode natural langauge intents in multiple language without intentional training.
Go to the submission site here and click the New button on the top-right to start a new submission, then fill out a few blanks in the pop-up window:
- System Name: give an informative name for your system
- Task: select 'machine-translation' from the drop-down list
- Dataset: select 'mconala' with the target language (es/ja/ru) from the drop-down list, and for Split select 'test'
- System Output: click on 'Text' and submit your results in TXT format. Please make sure that your results file has the same number of lines as the corresponding testset. If a predicted code snippet contains
\n
that could spread one prediction into multiple lines. One trick to fix this is doinga_multi_line_string.replace('\n', '\\n')
before writing into the file. - Metrics: select 'bleu', which computes the code-specific BLEU (-4) score.
- check that the Input Lang is automatically filled with your target NL (es/ja/ru) and the Output Lang is python.
Click the Submit button on the bottom, then your results are ready in a few seconds!
You can also click the Analysis button on the right to view more fine-grained analyses with cool figures π
To present the baseline performance on the Multilingual CoNaLa dataset, we use three state-of-the-art models that are proficient at multilingual learning or code generation.
Set the root directory using the following command, as this would be required by most experimental bash scripts.
export ROOT_DIR=`pwd`
mBART is a multilingual denoising auto-encoder trained for machine translation tasks.
To reproduce the baseline result of mBART, following:
Installation fairseq
Clone and install the repository.
git clone [email protected]:pytorch/fairseq.git
cd fairseq
# pip install .
pip install fairseq=0.10.2
# pip install fairseq=1.0.0a0+53bf2b1
cd ..
warning: may require earlier versions to solve some instantiation error (e.g., fairseq==0.10.2
).
Also download the pre-trained mBART model checkpoint.
mkdir checkpoint && cd checkpoint
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz
tar -xzvf mbart.cc25.v2.tar.gz
cd ..
Data pre-processing are conducted on both nl-intent and code-snippet, and in three consecutive steps: 1) sentence-piece tokenization, 2) fairseq preprocessing, and 3) data binarization.
Before the pre-processing, make sure to install SPM here, or run:
pip install sentencepiece
First, we need to extract the intent and snippets into a line-by-line text file.
To process all samples in the provided dataset
, use the script
bash extract_lines.sh
which will create a dataset/lines
directory with all processed training and testing files.
One can also process json/jsonl files in a specific folder by:
python extract_lines.py --input_dir source_dir --output_dir target_dir
Next, to perform the spm tokenization, run
bash do_spm_tokenization.sh
Lastly, do the fairseq pre-processing to binarize the data files
bash do_fairseq_preprocess.sh
By default, this step will use the FLORES-101 translation for trans-test
evaluation.
Head into the baseline/mbart/experiment
directory.
To fine-tune a pre-trained mBART model:
bash run_train.sh
Note that we only need to train the model for trans_train
and trans_test
settings. Evaluation on zero_shot
setting can directly load the saved checkpoint from the trans_test
experiment.
To evaluate on trans_train
or trans_test
setting:
bash run_test.sh
run_test_zero_shot.sh
should be easier to use for evaluation in the zero_shot
setting.
You can change the SETTING
(trans_train, trans_test) and LANG
(es, ja, ru) in both scripts to run different experiments.
TranX is a pre-trained natural language to code generation model by leveraging external knowledge. Our experiments uses its code implementation to perform training and testing on the Multilingual CoNaLa dataset.
To reproduce the TranX results:
Clone the repository and install required libraries.
cd baseline/tranx
# git clone https://github.com/neulab/external-knowledge-codegen.git
pip install python==3.7
pip install pytorch==1.1.0
pip install astor==0.7.1 # this is very important
bash baseline/tranx/scripts/preprocess.sh
This will organize and process the train-test files for both the trans-train
and trans-test
settings for three languages.
Note: be sure to download the necessary resource via
import nltk
nltk.download('punkt')
Head into the TranX directory using
cd baseline/tranx
To pre-train with additional mined data and api documents under a specific SETTING
for a specific LANG
uage, run
bash scripts/run_train.sh
To further fine-tune with the annotated training set, run
bash scripts/run_tune.sh
Use the scripts/test_mconala.sh
for evaluation.
We provide the best pre-trained model checkpoint for all three languages and both settings, under the best_pretrained_models/mconala
. Alter the language and setting arguments in the bash script to run individual experiments.
bash scripts/test_mconala.sh
TAE is a seq2seq model, augmented with a target auto-encoding objective, for code generation from English intents.
The tae
code implementation is built upon its original repository. To reproduce the baseline performance of TAE, following:
Clone the repository and install necessary libraries.
cd baseline/tae/code-gen-TAE/
pip install -r requirements.txt
Download the pre-trained TAE model from here.
Copy the test samples (with intents translated into English).
bash ../collect_data.sh
uses the FLORES-101 translation by default.
To reproduce the evaluation result on Spanish CoNaLa samples, run
python3 test_mconala.py \
--dataset_name "es-101" \
--save_dir "pretrained_weights/conala" \
--copy_bt --no_encoder_update --seed 4 \
--monolingual_ratio 0.5 --epochs 80 \
--use_conala_model
Change es-101
to ja-101
/ru-101
to test the Japanese/Russian samples.
Change xx-101
to xx-mmt
or xx-m2m
to test with different machine translation models.
Also, to evaluate on the English CoNaLa samples
python3 train.py \
--dataset_name "conala" \
--save_dir "pretrained_weights/conala" \
--copy_bt --no_encoder_update --seed 4 \
--monolingual_ratio 0.5 --epochs 80 \
--just_evaluate
@article{wang2022mconala,
title={MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages},
author={Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig},
journal={arXiv preprint arXiv:2203.08388},
year={2022}
}