The training set consists of 75k manually word-segmented sentences (about 23 words per sentence in average). The test set consists of 2120 sentences (about 31 words per sentence) in 10 files from 800001.seg to 800010.seg.
Model | F1 | Method | Reference | Code |
---|---|---|---|---|
UITws-v1 | 98.06 | Nguyen et al. PACLING'19 | Official | |
RDRsegmenter | 97.90 | Nguyen et al. LREC'18 | Official | |
jPTDP-v2 | 97.90 | Nguyen et al. CoNLL'18 | Nguyen '18 | Official |
Biaffine | 97.90 | Dozat and Manning ICLR'17 | Nguyen '18 | |
UETsegmenter | 97.87 | Nguyen et al. RIVF'16 | Official | |
JointWPD | 97.78 | Nguyen '18 | ||
vnTokenizer | 97.33 | Le et al. LATA'08 | Official | |
JVnSegmenter | 97.06 | Nguyen et al. PACLIC'06 | Official | |
DongDu | 96.90 | Official |
References
📜 Papers
- Nguyen et al. LREC'18, Liu et al. 2017, Liu et al. LREC'16, Nguyen et al. ICSITech'16, Nguyen et al. RIVF'16, Tran et al. 2015
- Tran et al. 2012, Let et al. 2010, Tran et al. 2010, Pham et al. 2009, Le et al. 2008, Nguyen et al. 2006
💫 Services:
📁 Open sources
- coccoc/coccoc-tokenizer (2019)
c++
- vncorenlp/VnCoreNLP (2018)
java
- datquocnguyen/RDRsegmenter (2017)
java
- UETsegmenter (2016)
java
- Vitk (2016)
java
- pyvi (2016)
python
- truongdo/vita (2015)
c++
- vTools (2015)
python
- manhtai/vietseg (2015)
python
- DongDu (2014)
c++
- Roy_VnTokenizer (2014)
python
- vnTokenizer (2008)
java