Skip to content

Thai word segmentation written in coffee-script

License

Notifications You must be signed in to change notification settings

pureexe/cutthai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CutThai

if you find javascript library for Thai word segmentation in production. I strongly recommend wordcut This repository is use for describe how Thai word segmentation work.

This work is base on document of wordcut that you can found on meduim (Thai language)

Algorithm

1. Find wordlist

this work is use Dictionary base you must have some Thai wordlist. you can found some Thai wordlist from

2. Build word Trie

convert wordlist from step 1 into trie to increase speed of searching. read more about trie: Wikipedia - Trie Note: This step is difference from wordcut, it using Binary search

3.Create wordgraph

Wordgraph is graph. use to determine position to word Segmentation where vertex is position to segmentation and Edge is word. create edge by compare input with trie.

4.Find shortest path

Find shortest path from start vertex to end vertex by using SPFA read more about SPFA: Wikipedia - SPFA

5.Segmentation sentense to array

use shortest path from step 4 to segmentation sentense and convert to array

Usage

CutThai isn't recommend to use in production. but you can download lastest release from Releases

by using Node.js or CommonJS

var CutThai = require("cutthai")

by using normal browser

<script src="path/to/cutthai.min.js"></script>

run some segmentation

var cutthai = new CutThai(function(err){
  if(err){
    throw err;
  }
  console.log(cutthai.cut("ฉันกินข้าว"));
});

Thank

wordcut - for Algorithm to Thai word segmentaion LibThai - for Thai word dictionary

Note: This document isn't complete yet. need to improve gramma add more picture to describe Algorithm. add more instruction to build.

About

Thai word segmentation written in coffee-script

Resources

License

Stars

Watchers

Forks

Packages

No packages published