MuCGEC: A Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction & SOTA Models
English | 简体中文
If you find this work is useful for your research, please cite our paper:
MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction (Accepted by NAACL2022 main conference) [PDF]
@inproceedings{zhang-etal-2022-mucgec,
title = "{MuCGEC}: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction",
author = "Zhang, Yue and Li, Zhenghua and Bao, Zuyi and Li, Jiacheng and Zhang, Bo and Li, Chen and Huang, Fei and Zhang, Min",
booktitle = "Proceedings of NAACL-HLT",
year = "2022",
address = "Online",
publisher = "Association for Computational Linguistics"
Chinese Grammatical Error Correction (CGEC) technology aims to automatically correct spelling, grammatical, semantical, and other errors in a Chinese sentence. This technology is very useful in various scenarios.
Current CGEC evaluation datasets have some flaws, such as small amounts of data, single reference, single text source, etc. In order to better evaluate CGEC models, this repository provide a new multi-reference multi-source CGEC evaluation dataset named MuCGEC. Meanwhile, in order to promote the development of CGEC, we also provide the following additional resources:
-
CGEC data annotation guidelines
./guidelines
-
CGEC evaluation tools
./scorers
:ChERRANT
:We extend the English GEC evaluation tool ERRANT to accomondate Chinese, and name itChERRANT
(Chinese ERRANT). ChERRANT supports the CGEC model evaluation at both word and character granularity.
-
CGEC benchmark models
./models
:- Seq2Edit model
./models/seq2edit-based-CGEC
: This kind of models treats GEC as a sequence labeling task and performs error corrections via a sequence of token-level edits, including insertion, deletion, and substitution.- With minor modifications to accommodate Chinese, we adopt GECToR, which achieves the SOTA performance on English GEC datasets.
- Seq2Seq model
./models/seq2seq-based-CGEC
:This kind of models straightforwardly treats GEC as a monolingual translation task- We fine-tune the recently-proposed Seq2Seq pretrained model Chinese-BART and use it in CGEC.
- Ensemble model
./scorers/ChERRANT/emsemble.sh
:We adopt a simple edit-wise vote mechanism, which can support the ensemble of heterogeneous models (such as Seq2Seq and Seq2Edit) and lead to significant performance boosts.
- Seq2Edit model
-
CGEC tools
./tools
:- Tokenization tools
- Data augmentation tools (Todo)
- Data cleaning tools (Todo)
Our dataset consists of texts written by Chinese learners. We select data from the following three sources: NLPCC18
corpus, CGED
corpus, and Chinese Lang8
corpus. Each sentence has been corrected by three
annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence. The detailed statistics are shown below:
Subset | #Sents | %Errors | Chars/sent | Edits/sent | Refs/sent |
---|---|---|---|---|---|
MuCGEC-NLPCC18 | 1996 | 1904(95.4%) | 29.7 | 2.5 | 2.5 |
MuCGEC-CGED | 3125 | 2988(95.6%) | 44.8 | 4.0 | 2.3 |
MuCGEC-Lang8 | 1942 | 1652(85.1%) | 37.5 | 2.8 | 2.1 |
MuCGEC-ALL | 7063 | 6544(92.7%) | 38.5 | 3.2 | 2.3 |
Compared with previous CGEC evaluation sets (such as NLPCC18-orig and CGED-orig), MuCGEC has richer references and data sources. In addition, during the annotation procedure, we also found that 74 sentences could not be annotated since their meanings are unclear.
For more details about MuCGEC, please kindly refer to our paper.
Note: we are currently planning a CGEC evaluation task at CCL2022 conference based on MuCGEC, so MuCGEC has not been released for the time being. We will release it soon. Please wait patiently.
We use Python 3.8 to experiment, and the necessary dependencies can be installed through the following code. Considering that there are some conflicts between the environments of Seq2Seq and Seq2Edit models, two environments need to be installed separately:
# Seq2Edit environment
pip install -r requirements_seq2edit.txt
# Seq2Seq environment
pip install -r requirements_seq2seq.txt
The training data used in our experiment is composed of: 1) Chinese Lang8
corpus; 2)HSK
corpus. We upsampling HSK
corpus 5 times. We only use the erroneous part of the training data.
cd ./data/train_data
chmod +x download.sh
./download.sh
Note: the copyright issue about HSK data is under discussion, so the training data download link is not available at present.
We provide pipeline scripts to use our model, including the process of preprocessing->training->inference. Please refer to
./models/seq2edit-based-CGEC/pipeline.sh
and ./models/seq2seq-based-CGEC/pipeline.sh
Besides, we also provide converged checkpoints for testing (the following metrics are precision/recall/F0.5):
Model | NLPCC18-Official(m2socrer) | MuCGEC(ChERRANT) |
---|---|---|
seq2seq_lang8[Link] | 37.78/29.91/35.89 | 40.44/26.71/36.67 |
seq2seq_lang8+hsk[Link] | 41.50/32.87/39.43 | 44.02/28.51/39.70 |
seq2edit_lang8[Link] | 37.43/26.29/34.50 | 38.08/22.90/33.62 |
seq2edit_lang8+hsk[Link] | 43.12/30.18/39.72 | 44.65/27.32/39.62 |
The ensemble strategy used in our paper can be found in ./scorers/ChERRANT/emsemble.sh
.
- We found that some useful tricks in English are still effective in Chinese, such as the confidence bias trick in GECToR and R2L reranking trick in Seq2Seq models. You can try them yourself.
- We found that the decomposition of the training procedure (firstly train on HSL+Lang8, then fine-tune on HSK) can lead to further improvement. You can re-train our models following this two-stage training strategy.
- Our Seq2Seq models based on Chinese-BART can be improved: 1) the original vocabulary of Chinese-BART lacks some common Chinese punctuation / characters; 2)The training and generation speed of transformers library is relatively slow. We recently re-implement our Seq2Seq model based on fairseq and use some additional tricks, which greatly improved its performance (4-5 F0.5) and accelerated its speed. We will also release the improved version in the future.
- None of the baselines we provided used pseudo data. For data augmentation, please refer to our paper in CTC-2021 CGEC competition [Link].
To get the evaluation results on the Official NLPCC18 dataset, you can make predictions by using our benchmark models, and evaluate the results through M2Scorer. It is important to note that the predicted results must be segmented using the PKUSEG tool.
To get the evaluation results on MuCGEC, you can evaluate the results through our proposed ChERRANT. Please refer to ./scorers/ChERRANT/demo.sh
for the use of ChERRANT.
Error types in ChERRANT
-
Operation tier (word/char granularity):
- M(missing): Missing error, which means tokens need to be inserted;
- R(redundant): Redundant error, which means tokens need to be deleted;
- S(substitute): Substitued error , which means tokens need to be replaced;
- W(word-order): Word-error error, which means tokens need to be re-ordered.
-
Linguistic tier (word granularity):
- We have used some technologies from this repo in the CTC-2021 evaluation task, and obtained the top-1 score. Please see: CTC-report.
- Online demonstration platform of the baseline models: GEC demo。
If you have any problems, feel free to contact me at [email protected].