Skip to content

Latest commit

 

History

History
216 lines (154 loc) · 7.07 KB

readme_english.md

File metadata and controls

216 lines (154 loc) · 7.07 KB

Pkuseg

A multi-domain Chinese word segmentation toolkit.

Highlights

The pkuseg-python toolkit has the following features:

  1. Supporting multi-domain Chinese word segmentation. Pkuseg-python supports multi-domain segmentation, including domains like news, web, medicine, and tourism. Users are free to choose different pre-trained models according to the domain features of the text to be segmented. If not sure the domain of the text, users are recommended to use the default model trained on mixed-domain data.

  2. Higher word segmentation results. Compared with existing word segmentation toolkits, pkuseg-python can achieve higher F1 scores on the same dataset.

  3. Supporting model training. Pkuseg-python also supports users to train a new segmentation model with their own data.

  4. Supporting POS tagging. We also provide users POS tagging interfaces for further lexical analysis.

Installation

  • Requirements: python3
  1. Install pkuseg-python by using PyPI: (with the default model trained on mixed-doimain data)

    pip3 install pkuseg
    

    or update to the latest version (suggested):

    pip3 install -U pkuseg
    
  2. Install pkuseg-python by using image source for fast speed:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg
    

    or update to the latest version (suggested):

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
    

    Note: The previous two installing commands only support python3.5, python3.6, python3.7 on linux, mac, and windows 64 bit.

  3. If the code is downloaded from GitHub, please run the following command to install pkuseg-python:

    python setup.py build_ext -i
    

    Note: the github code does not contain the pre-trained models, users need to download the pre-trained models from release, and set parameter 'model_name' as the model path.

Usage

Examples

Example 1: Segmentation under the default configuration. If users are not sure the domain of the text to be segmented, the default configuration is recommended.

import pkuseg

seg = pkuseg.pkuseg() #load the default model
text = seg.cut('我爱北京天安门')
print(text)

Example 2: Domain-specific segmentation. If users know the text domain, they can select a pre-trained domain model according to the domain features.

import pkuseg
seg = pkuseg.pkuseg(model_name='medicine') 
#Automatically download the domain-specific model.
text = seg.cut('我爱北京天安门')
print(text)

Example 3:Segmentation and POS tagging. For the detailed meaning of each POS tag, please refer to tags.txt.

import pkuseg

seg = pkuseg.pkuseg(postag=True)                           
text = seg.cut('我爱北京天安门')
print(text)

Example 4:Segmentation with a text file as input.

import pkuseg

#Take file 'input.txt' as input. 
#The segmented result is stored in file 'output.txt'.
pkuseg.test('input.txt', 'output.txt', nthread=20)     

Example 5: Segmentation with a user-defined dictionary.

import pkuseg

seg = pkuseg.pkuseg(user_dict='my_dict.txt')
text = seg.cut('我爱北京天安门')
print(text)

Example 6: Segmentation with a user-trained model. Take CTB8 as an example.

import pkuseg

seg = pkuseg.pkuseg(model_name='./ctb8') 
text = seg.cut('我爱北京天安门')
print(text)

Example 7: Training a new model (randomly initialized).

import pkuseg

# Training file: 'msr_training.utf8'.
# Test file: 'msr_test_gold.utf8'.
# Save the trained model to './models'.
# The training and test files are in utf-8 encoding.
pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models')	

Example 8: Fine-tuning. Take a pre-trained model as input.

import pkuseg

# Training file: 'train.txt'.
# Testing file'test.txt'.
# The path of the pre-trained model: './pretrained'.
# Save the trained model to './models'.
# The training and test files are in utf-8 encoding.
pkuseg.train('train.txt', 'test.txt', './models', train_iter=10, init_model='./pretrained')

Parameter Settings

Segmentation for sentences.

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		The path of the used model.
			        "default". The default mixed-domain model.
				"news". The model trained on news domain data.
				"web". The model trained on web domain data.
				"medicine". The model trained on medicine domain data.
				"tourism". The model trained on tourism domain data.
			        model_path. Load a model from the user-specified path.
	user_dict		Set up the user dictionary.
				"default". Use the default dictionary.
				None. No dictionary is used.
				dict_path. The path of the user-defined dictionary. Each line only contains one word.
	postag		        POS tagging or not.
				False. The default setting. Segmentation without POS tagging.
				True. Segmentation with POS tagging.

Segmentation for documents.

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		The path of the input file.
	outputFile		The path of the output file.
	model_name		The path of the used model. Refer to pkuseg.pkuseg.
	user_dict		The path of the user dictionary. Refer to pkuseg.pkuseg.
	postag			POS tagging or not. Refer to pkuseg.pkuseg.
	nthread			The number of threads.

Model training.

pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
	trainFile		The path of the training file.
	testFile		The path of the test file.
	savedir			The saved path of the trained model.
	train_iter		The maximum number of training epochs.
	init_model		By default, None means random initialization. Users can also load a pre-trained model as initialization, like init_model='./models/'.

Publication

The toolkit is mainly based on the following publication. If you use the toolkit, please cite the paper:


@article{pkuseg,
  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Zhang, Zhiyuan and Ren, Xuancheng and Sun, Xu},
  journal = {CoRR},
  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},
  url = {https://arxiv.org/abs/1906.11455},
  volume = {abs/1906.11455},
  year = 2019
}

Related Work

  • Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.
  • Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.
  • Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.

Authors

Ruixuan Luo, Jingjing Xu, Xuancheng Ren, Yi Zhang, Zhiyuan Zhang, Bingzhen Wei, Xu Sun

Language Computing and Machine Learning Group, Peking University