Skip to content

Latest commit

 

History

History
96 lines (76 loc) · 10.9 KB

vietnamese.md

File metadata and controls

96 lines (76 loc) · 10.9 KB

Vietnamese NLP tasks

Dependency parsing

  • The last 1020 sentences of the benchmark Vietnamese dependency treebank VnDT are used for test, while the remaining 9k+ sentences are used for training & development. LAS and UAS scores are computed on all tokens (i.e. including punctuation).
Model LAS UAS Paper Code
Predicted POS Biaffine (2017) 73.53 80.84 Deep Biaffine Attention for Neural Dependency Parsing
Predicted POS jointWPD (2018) 72.56 79.75 A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
Predicted POS jPTDP-v2 (2018) 71.72 79.26 An improved neural network model for joint POS tagging and dependency parsing
Predicted POS VnCoreNLP (2018) 70.23 76.93 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official
Gold POS VnCoreNLP (2018) 73.39 79.02 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official
Gold POS BIST BiLSTM graph-based parser (2016) 73.17 79.39 Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations Official
Gold POS BIST BiLSTM transition-based parser (2016) 72.53 79.33 Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations Official
Gold POS MSTparser (2006) 70.29 76.47 Online large-margin training of dependency parsers
Gold POS MaltParser (2007) 69.10 74.91 MaltParser: A language-independent system for datadriven dependency parsing

Machine translation

English-to-Vietnamese translation

Model BLEU Paper Code
CVT (2018) 29.6 Semi-Supervised Sequence Modeling with Cross-View Training
ELMo (2018) 29.3 Deep contextualized word representations
Transformer (2017) 28.9 Attention is all you need Link
Google (2017) 26.1 Neural machine translation (seq2seq) tutorial Official
Stanford (2015) 23.3 Stanford Neural Machine Translation Systems for Spoken Language Domains

Named entity recognition

  • 16,861 sentences for training and development from the VLSP 2016 NER shared task:
    • 14,861 sentences are used for training.
    • 2k sentences are used for development.
  • Test data: 2,831 test sentences from the VLSP 2016 NER shared task.
  • NOTE that in the VLSP 2016 NER data, each word representing a full personal name are separated into syllables that constitute the word. The VLSP 2016 NER data also consists of gold POS and chunking tags as reconfirmed by VLSP 2016 organizers. This scheme results in an unrealistic scenario for a pipeline evaluation:
    • The standard annotation for Vietnamese word segmentation and POS tagging forms each full name as a word token, thus all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a POS label to the entire full-name.
    • Gold POS and chunking tags are NOT available in a real-world application.
  • For a realistic scenario, contiguous syllables constituting a full name are merged to form a word. POS/chunking tags--if used--have to be automatically predicted!
Model F1 Paper Code Note
VnCoreNLP (2018) [1] 91.30 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official Pre-trained embeddings learned from Vietnamese Wikipedia corpus
BiLSTM-CRF + CNN-char (2016) [1] 91.09 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Official / Link Pre-trained embeddings learned from Vietnamese Wikipedia corpus
VNER (2019) 89.58 Attentive Neural Network for Named Entity Recognition in Vietnamese
VnCoreNLP (2018) 88.55 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit Official Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + CNN-char (2016) [2] 88.28 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Official / Link Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + LSTM-char (2016) [2] 87.71 Neural Architectures for Named Entity Recognition Link Pre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF (2015) [2] 86.48 Bidirectional LSTM-CRF Models for Sequence Tagging Link Pre-trained embeddings learned from Baomoi corpus

Part-of-speech tagging

  • 27,870 sentences for training and development from the VLSP 2013 POS tagging shared task:
    • 27k sentences are used for training.
    • 870 sentences are used for development.
  • Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
Model Accuracy Paper Code
jointWPD (2018) 95.93 A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
VnCoreNLP-VnMarMoT (2017) 95.88 From Word Segmentation to POS Tagging for Vietnamese Official
jPTDP-v2 (2018) 95.61 An improved neural network model for joint POS tagging and dependency parsing
BiLSTM-CRF + CNN-char (2016) 95.40 End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF Official / Link
BiLSTM-CRF + LSTM-char (2016) 95.31 Neural Architectures for Named Entity Recognition Link
BiLSTM-CRF (2015) 95.06 Bidirectional LSTM-CRF Models for Sequence Tagging Link
RDRPOSTagger (2014) 95.11 RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger Official

Word segmentation

  • Training & development data: 75k manually word-segmented training sentences from the VLSP 2013 word segmentation shared task.
  • Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
Model F1 Paper Code
VnCoreNLP-RDRsegmenter (2018) 97.90 A Fast and Accurate Vietnamese Word Segmenter Official
UETsegmenter (2016) 97.87 A hybrid approach to Vietnamese word segmentation Official
jointWPD (2018) 97.78 A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
vnTokenizer (2008) 97.33 A Hybrid Approach to Word Segmentation of Vietnamese Texts
JVnSegmenter (2006) 97.06 Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
DongDu (2012) 96.90 Ứng dụng phương pháp Pointwise vào bài toán tách từ cho tiếng Việt