Tiny tokenizer is a simple wrapper of wrapper for Japanese tokenizers.
It unifies the interface of several Japanese tokenizers.
Tiny tokenizer provides you the way to switch a tokenizer and boost your pre-processing.
tiny_tokenizer
supports following tokenizers.
- MeCab (and natto-py)
- KyTea (and Mykytea-python)
- Sudachi (SudachiPy)
- Sentencepiece (Sentencepiece)
- character-based
- whitespace-based
Also, tiny tokenizer provides a simple rule-based sentence tokenizer, which segments a document into sentences.
It is not needed for sentence level tokenization because these libraries are used in word level tokenization.
You can install tiny_tokenizer and above libraries by pip, please run:
pip install tiny_tokenizer[all]
.
Or, you can install tiny_tokenizer only with SentenceTokenizer by the following command:
pip install tiny_tokenizer
.
You can use tiny_tokenizer using the Docker container.
If you want to use tiny_tokenizer with Docker, run following commands.
docker build -t himkt/tiny_tokenizer .
docker run -it himkt/tiny_tokenizer /bin/bash
- Code
from tiny_tokenizer import WordTokenizer
sentence = '่ช็ถ่จ่ชๅฆ็ใๅๅผทใใฆใใพใ'
tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize())
tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
- Output
[่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆ, ใ, ใพใ]
[โ, ่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆใใพใ]
For more detail, please see the example/
directory.
- Code
from tiny_tokenizer import SentenceTokenizer
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใใใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"
tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
- Output
['็งใฏ็ซใ ใ', 'ๅๅใชใใฆใใฎใฏใชใใ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
python -m pytest
Sentencepiece model used in test is provided by @yoheikikuta. Thanks!