Emacs Tokenizer tokenizing CJK words with WinRT API or ICU.
EWT stands for Emacs Windows Tokenizer. But it works on all platforms, if built with ICU.
This crate provides dynamic module which emt.el consumes. Install emt.el first, put the module dynamic lib into emt-lib-path
(by default located at ~/.emacs.d/modules/libEMT.{dll,so,etc}
).
Download from Releases, or CI Artifact for unversioned binaries.
The Windows .dll
files are only compatible with Emacs built with UCRT. MSVCRT is not supported.
- Install Rust toolchain (On Windows, please target x86_64-pc-windows-gnu)
- (On Windows) Install MSYS2
cargo build --release
to use ICUcargo build --release --no-default-features -F windows
to use WinRT API
The segmenter language with WinRT API is hardcoded. Users can adjust zh-CN
to the favoured language.
Microsoft doesn't and will never provide WinRT API for C.
C++ 20 is required for cppwinrt. I encounter auto type deduction error in the cppwinrt header file, which I cannot fix. The size could be much smaller (~100k?) though, if it works, it's favourable.
I have to use unsafe extern "C" all the way to write Rust binding. The safety no better than C++, but it has better WinRT API support and type inference. When built with lto, the size ~260K is acceptable.
Personally I recommand WinRT API for Simplified Chinese and ICU for Traditional Chinese.
WinRT API | ICU |
---|---|
'有|异曲同工|之|妙' | '有异|曲|同工|之|妙' |
'有|異|曲|同工|之|妙' | '有|異曲同工|之|妙' |
'丧心病狂|的|异想天开' | '丧心病狂|的|异|想|天|开' |
This crate handles String on char level instead of grapheme cluster level. However, this causes no problem, probally because emt.el only use the helper function when moving in CJK characters.
- Try ICU Backend
- Find out why M-S-{F,B} doesn't select anything
- emt.el
- ubolonton/emacs-module-rs I don't use it because of issue, but it helps me learn how Emacs Dynamic Module works, and provides useful functions.
- Article: Writing an Emacs module in Rust