Name	Name	Last commit message	Last commit date
Latest commit himkt Update example Dec 28, 2019 3312d31 · Dec 28, 2019 History 178 Commits
.github/workflows	.github/workflows	Remove link	Dec 23, 2019
data	data	Add kytea model trained on the specific example	Oct 27, 2019
example	example	Update example	Dec 28, 2019
static/image	static/image	Add logo	Dec 25, 2018
tests	tests	Add test for janome tokenizer	Dec 28, 2019
tiny_tokenizer	tiny_tokenizer	Bump up version: 3.0.2 -> 3.1.0	Dec 28, 2019
.gitignore	.gitignore	Ignore build files	Oct 30, 2019
Dockerfile	Dockerfile	Use Ubuntu18.04	Oct 19, 2019
LICENSE	LICENSE	Create LICENSE	Dec 25, 2018
Makefile	Makefile	Cleanup	May 28, 2019
README.md	README.md	Update README	Oct 20, 2019
poetry.lock	poetry.lock	Install janome	Dec 28, 2019
pyproject.toml	pyproject.toml	Bump up version: 3.0.2 -> 3.1.0	Dec 28, 2019
setup.py	setup.py	Bump up version: 3.0.2 -> 3.1.0	Dec 28, 2019

Repository files navigation

tiny_tokenizer

Tiny tokenizer is a simple wrapper of wrapper for Japanese tokenizers.

It unifies the interface of several Japanese tokenizers.

Tiny tokenizer provides you the way to switch a tokenizer and boost your pre-processing.

tiny_tokenizer supports following tokenizers.

MeCab (and natto-py)
KyTea (and Mykytea-python)
Sudachi (SudachiPy)
Sentencepiece (Sentencepiece)
character-based
whitespace-based

Also, tiny tokenizer provides a simple rule-based sentence tokenizer, which segments a document into sentences.

Installation

Install tiny_tokenizer on local machine

It is not needed for sentence level tokenization because these libraries are used in word level tokenization.

You can install tiny_tokenizer and above libraries by pip, please run: pip install tiny_tokenizer[all].

Or, you can install tiny_tokenizer only with SentenceTokenizer by the following command: pip install tiny_tokenizer.

Install tiny_tokenizer on Docker container

You can use tiny_tokenizer using the Docker container.

If you want to use tiny_tokenizer with Docker, run following commands.

docker build -t himkt/tiny_tokenizer .
docker run -it himkt/tiny_tokenizer /bin/bash

Example

Word level tokenization

Code

from tiny_tokenizer import WordTokenizer

sentence = '自然言語処理を勉強しています'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize())

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))

Output

[自然, 言語, 処理, を, 勉強, し, て, い, ます]
[▁, 自然, 言語, 処理, を, 勉強, し, ています]

For more detail, please see the example/ directory.

Sentence level tokenization

Code

from tiny_tokenizer import SentenceTokenizer

sentence = "私は猫だ。名前なんてものはない。だが，「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))

Output

['私は猫だ。', '名前なんてものはない。', 'だが，「かわいい。それで十分だろう」。']

Test

python -m pytest

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny_tokenizer

Installation

Install tiny_tokenizer on local machine

Install tiny_tokenizer on Docker container

Example

Word level tokenization

Sentence level tokenization

Test

Acknowledgement

About

Releases 38

Packages

Used by 803

Contributors 9

Languages

License

himkt/konoha

Folders and files

Latest commit

History

Repository files navigation

tiny_tokenizer

Installation

Install tiny_tokenizer on local machine

Install tiny_tokenizer on Docker container

Example

Word level tokenization

Sentence level tokenization

Test

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 38

Packages 0

Used by 803

Contributors 9

Languages

Packages