Java port of C++ version of Facebook Research fastText.
This implementation supports prediction for supervised and unsupervised models, whether they are quantized or not. Please use C++ version of fastText for train, test and quantization.
fastText4j currently supports models from fastText 1b version (support of subwords for supervised models).
This library offers two implementations of fastText library:
- A regular in-memory model, which is a simple port of the C++ version
- A memory-mapped version of the model, allowing a lower RAM usage
This second implementation relies on memory-mapped IO for reading the dictionary and the input matrix.
Note: In order to be able to use this second implementation, you will have to convert your fastText model to the appropriate memory-mapped model format.
To build and use fastText4j, you will need:
- Java 8 or above
- Maven
This project uses maven as build tool. To build fastText4j, use the following:
$ mvn package
You can convert both non-quantized and quantized fastText models to memory-mapped models.
You will have to use the binary model .bin
or .ftz
for the conversion step.
Use the following command to obtain a zip archive containing an executable jar with dependencies and a bash script to launch the jar:
$ mvn install -Papp
The zip archive will be built in the app
folder. You can then use this distribution to run the mmap model conversion:
$ cd app
$ unzip fasttext4j-app.zip
$ ./fasttext-mmap.sh -input <fastText-model-path> -output <fasttext-mmap-model-path>
Loading a memory-mapped model with fastText4j is completely transparent.
You just have to provide the path <fasttext-mmap-model-path>
that you passed to the output
parameter above.
When loading a memory-mapped model, fastText4j internally opens FileChannels that will need to be closed.
To properly close your memory-mapped model, you will need to call the .close()
method on your FastText object.
The memory-mapped FastText may only be used from one thread, because it is not thread safe (it keeps internal state like the mapped file positions).
To allow multithreaded use, every FastText instance must be cloned before being used in another thread.
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
@article{bojanowski2016enriching,
title={Enriching Word Vectors with Subword Information},
author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.04606},
year={2016}
}
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
@article{joulin2016bag,
title={Bag of Tricks for Efficient Text Classification},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
journal={arXiv preprint arXiv:1607.01759},
year={2016}
}
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
(* These authors contributed equally.)