Skip to content

Latest commit

 

History

History
107 lines (67 loc) · 3.06 KB

README.md

File metadata and controls

107 lines (67 loc) · 3.06 KB

tiny_tokenizer

Build Status GitHub stars GitHub issues GitHub release MIT License

Tiny tokenizer is a simple wrapper of wrapper for Japanese tokenizers.

It unifies the interface of several Japanese tokenizers.

Tiny tokenizer provides you the way to switch a tokenizer and boost your pre-processing.

tiny_tokenizer supports following tokenizers.

Also, tiny tokenizer provides a simple rule-based sentence tokenizer, which segments a document into sentences.

Installation

Install tiny_tokenizer on local machine

It is not needed for sentence level tokenization because these libraries are used in word level tokenization.

You can install tiny_tokenizer and above libraries by pip, please run: pip install tiny_tokenizer[all].

Or, you can install tiny_tokenizer only with SentenceTokenizer by the following command: pip install tiny_tokenizer.

Install tiny_tokenizer on Docker container

You can use tiny_tokenizer using the Docker container.

If you want to use tiny_tokenizer with Docker, run following commands.

docker build -t himkt/tiny_tokenizer .
docker run -it himkt/tiny_tokenizer /bin/bash

Example

Word level tokenization

  • Code
from tiny_tokenizer import WordTokenizer

sentence = '自然言語処理を勉強しています'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize())

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
  • Output
[自然, 言語, 処理, を, 勉強, し, て, い, ます]
[▁, 自然, 言語, 処理, を, 勉強, し, ています]

For more detail, please see the example/ directory.

Sentence level tokenization

  • Code
from tiny_tokenizer import SentenceTokenizer

sentence = "私は猫だ。名前なんてものはない。だが,「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
  • Output
['私は猫だ。', '名前なんてものはない。', 'だが,「かわいい。それで十分だろう」。']

Test

python -m pytest

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!