Skip to content
/ konoha Public

๐ŸŒฟ An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

License

Notifications You must be signed in to change notification settings

himkt/konoha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

author
himkt
Dec 28, 2019
3312d31 ยท Dec 28, 2019
Dec 23, 2019
Oct 27, 2019
Dec 28, 2019
Dec 25, 2018
Dec 28, 2019
Dec 28, 2019
Oct 30, 2019
Oct 19, 2019
Dec 25, 2018
May 28, 2019
Oct 20, 2019
Dec 28, 2019
Dec 28, 2019
Dec 28, 2019

Repository files navigation

tiny_tokenizer

Build Status GitHub stars GitHub issues GitHub release MIT License

Tiny tokenizer is a simple wrapper of wrapper for Japanese tokenizers.

It unifies the interface of several Japanese tokenizers.

Tiny tokenizer provides you the way to switch a tokenizer and boost your pre-processing.

tiny_tokenizer supports following tokenizers.

Also, tiny tokenizer provides a simple rule-based sentence tokenizer, which segments a document into sentences.

Installation

Install tiny_tokenizer on local machine

It is not needed for sentence level tokenization because these libraries are used in word level tokenization.

You can install tiny_tokenizer and above libraries by pip, please run: pip install tiny_tokenizer[all].

Or, you can install tiny_tokenizer only with SentenceTokenizer by the following command: pip install tiny_tokenizer.

Install tiny_tokenizer on Docker container

You can use tiny_tokenizer using the Docker container.

If you want to use tiny_tokenizer with Docker, run following commands.

docker build -t himkt/tiny_tokenizer .
docker run -it himkt/tiny_tokenizer /bin/bash

Example

Word level tokenization

  • Code
from tiny_tokenizer import WordTokenizer

sentence = '่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใ‚’ๅ‹‰ๅผทใ—ใฆใ„ใพใ™'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize())

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
  • Output
[่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆ, ใ„, ใพใ™]
[โ–, ่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆใ„ใพใ™]

For more detail, please see the example/ directory.

Sentence level tokenization

  • Code
from tiny_tokenizer import SentenceTokenizer

sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
  • Output
['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']

Test

python -m pytest

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!