A language identification tool that aims to be both fast and accurate. Originally started as a HeLI-OTS port to Rust.
Install it in your environment
pip install heliport
then download the model
heliport-download
Install the requirements:
Clone the repo, build the package and compile the model
git clone https://github.com/ZJaume/heliport
cd heliport
pip install .
heliport-convert
Just run the heliport
command that reads lines from stdin
cat sentences.txt | heliport
eng_latn
cat_latn
rus_cyrl
...
>>> from heliport import Identifier
>>> i = Identifier()
>>> i.identify("L'aigua clara")
'cat_latn'
use std::sync::Arc;
use heliport::identifier::Identifier;
use heliport::lang::Lang;
use heliport::load_models;
let (charmodel, wordmodel) = load_models("/dir/to/models")
let identifier = Identifier::new(
Arc::new(charmodel),
Arc::new(wordmodel),
);
let lang, score = identifier.identify("L'aigua clara");
assert_eq!(lang, Lang::cat_Latn);
Speed benchmarks with 100k random sentences from OpenLID, all the tools running single-threaded:
tool | time (s) |
---|---|
CLD2 | 1.12 |
HeLI-OTS | 60.37 |
lingua all high preloaded | 56.29 |
lingua all low preloaded | 23.34 |
fasttext openlid193 | 8.44 |
heliport | 2.33 |