Name		Name	Last commit message	Last commit date
parent directory ..
configs		configs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
project.yml		project.yml
requirements.txt		requirements.txt

README.md

🪐 spaCy Project: Universal Dependencies v2.5 Benchmarks

This project template lets you train a spaCy pipeline on any Universal Dependencies corpus (v2.5) for benchmarking purposes. The pipeline includes an experimental trainable tokenizer, an experimental edit tree lemmatizer, and the standard spaCy tagger, morphologizer and dependency parser components. The CoNLL 2018 evaluation script is used to evaluate the pipeline. The template uses the UD_English-EWT treebank by default, but you can swap it out for any other available treebank. Just make sure to adjust the ud_treebank and spacy_lang settings in the config. Use xx (multi-language) for spacy_lang if a particular language is not supported by spaCy. The tokenizer in particular is only intended for use in this generic benchmarking setup. It is not optimized for speed and it does not perform particularly well for languages without space-separated tokens. In production, custom rules for spaCy's rule-based tokenizer or a language-specific word segmenter such as jieba for Chinese or sudachipy for Japanese would be recommended instead.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the spaCy projects documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`extract`	Extract the data
`convert`	Convert the data to spaCy's format
`train-tokenizer`	Train tokenizer
`train-transformer`	Train transformer
`assemble`	Assemble full pipeline
`evaluate`	Evaluate on the test data and save the metrics
`evaluate-with-senter`	Evaluate on the test data and save the metrics
`package`	Package the trained model so it can be installed
`clean`	Remove intermediate files

⏭ Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`all`	`extract` → `convert` → `train-tokenizer` → `train-transformer` → `assemble` → `evaluate` → `evaluate-with-senter` → `package`

🗂 Assets

The following assets are defined by the project. They can be fetched by running spacy project assets in the project directory.

File	Source	Description
`assets/ud-treebanks-v2.5.tgz`	URL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ud_benchmark

ud_benchmark

README.md

🪐 spaCy Project: Universal Dependencies v2.5 Benchmarks

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

Files

ud_benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

ud_benchmark

Folders and files

parent directory

README.md

🪐 spaCy Project: Universal Dependencies v2.5 Benchmarks

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets