Word Segmentation

VLSP 2013

The training set consists of 75k manually word-segmented sentences (about 23 words per sentence in average). The test set consists of 2120 sentences (about 31 words per sentence) in 10 files from 800001.seg to 800010.seg.

Leaderboard

Model	F1	Method	Reference	Code
UITws-v1	98.06	Nguyen et al. PACLING'19		Official
RDRsegmenter	97.90	Nguyen et al. LREC'18		Official
jPTDP-v2	97.90	Nguyen et al. CoNLL'18	Nguyen '18	Official
Biaffine	97.90	Dozat and Manning ICLR'17	Nguyen '18
UETsegmenter	97.87	Nguyen et al. RIVF'16		Official
JointWPD	97.78	Nguyen '18
vnTokenizer	97.33	Le et al. LATA'08		Official
JVnSegmenter	97.06	Nguyen et al. PACLIC'06		Official
DongDu	96.90			Official

VietTreeBank

References

📜 Vietnamese Treebank paper (Nguyen et al. 2009)

Miscellaneous

📜 Papers

Nguyen et al. LREC'18, Liu et al. 2017, Liu et al. LREC'16, Nguyen et al. ICSITech'16, Nguyen et al. RIVF'16, Tran et al. 2015
Tran et al. 2012, Let et al. 2010, Tran et al. 2010, Pham et al. 2009, Le et al. 2008, Nguyen et al. 2006

💫 Services:

OpenFPT: Vitk (2017)

📁 Open sources

coccoc/coccoc-tokenizer (2019) c++
vncorenlp/VnCoreNLP (2018) java
datquocnguyen/RDRsegmenter (2017) java
UETsegmenter (2016) java
Vitk (2016)java
pyvi (2016) python
truongdo/vita (2015) c++
vTools (2015) python
manhtai/vietseg (2015) python
DongDu (2014)c++
Roy_VnTokenizer (2014) python
vnTokenizer (2008) java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word_segmentation.md

word_segmentation.md

Word Segmentation

VLSP 2013

Leaderboard

VietTreeBank

Miscellaneous

Files

word_segmentation.md

Latest commit

History

word_segmentation.md

File metadata and controls

Word Segmentation

VLSP 2013

Leaderboard

VietTreeBank

Miscellaneous