From b943033763a36c7b43d320eeeb12089a1c5a06e1 Mon Sep 17 00:00:00 2001 From: David Dale Date: Thu, 16 Nov 2023 02:16:01 -0800 Subject: [PATCH 1/4] update the main readme file --- .gitignore | 2 ++ README.md | 24 ++++++++++++++++++++++-- install_external_tools.sh | 4 ++++ laser_encoders/README.md | 9 +++++++-- 4 files changed, 35 insertions(+), 4 deletions(-) diff --git a/.gitignore b/.gitignore index 16290d9e..0566e9f4 100644 --- a/.gitignore +++ b/.gitignore @@ -10,3 +10,5 @@ tasks/xnli/XNLI-1.0* tasks/xnli/multinli_1.0* .??*swp .idea +__pycache__ +nllb diff --git a/README.md b/README.md index 96d96ff0..1800fd96 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,7 @@ LASER is a library to calculate and use multilingual sentence embeddings. **NEWS** +* 2023/11/16 Released [**laser_encoders**](laser_encoders), a compact pip-installable package supporting LASER-2 and LASER-3 models * 2023/06/26 [**xSIM++**](https://arxiv.org/abs/2306.12907) evaluation pipeline and data [**released**](tasks/xsimplusplus/README.md) * 2022/07/06 Updated LASER models with support for over 200 languages are [**now available**](nllb/README.md) * 2022/07/06 Multilingual similarity search (**xsim**) evaluation pipeline [**released**](tasks/xsim/README.md) @@ -26,7 +27,25 @@ a language family which is covered by other languages. A detailed description of how the multilingual sentence embeddings are trained can be found [here](https://arxiv.org/abs/2205.12654), together with an experimental evaluation. -## Dependencies +## The core embedding package: `laser_encoders` +We provide a package `laser_encoders` with minimal dependencies. +It supports LASER-2 (an updated signle encoder for the languages listed [below](#supported-languages)) +and LASER-3 (147 language-specific encoders described [here](nllb/README.md)). + +The package can be installed simply with `pip install laser_encoders` and used as below: + +```python +from laser_encoders import LaserEncoderPipeline +encoder = LaserEncoderPipeline(lang="eng_Latn") +``` + +The laser_encoders [readme file](laser_encoders) provides more examples of its installation and usage. + +## The full LASER kit +Apart from the `laser_encoders`, we provide support for LASER-1 (the original multilingual encoder) +and for various LASER applications listed below. + +### Dependencies * Python >= 3.7 * [PyTorch 1.0](http://pytorch.org/) * [NumPy](http://www.numpy.org/), tested with 1.15.4 @@ -42,7 +61,8 @@ be found [here](https://arxiv.org/abs/2205.12654), together with an experimental * [pandas](https://pypi.org/project/pandas), data analysis toolkit (`pip install pandas`) * [Sentencepiece](https://github.com/google/sentencepiece), subword tokenization (installed automatically) -## Installation +### Installation +* install the `laser_encoders` package by e.g. `pip install -e .` for installing it in the editable mode * set the environment variable 'LASER' to the root of the installation, e.g. `export LASER="${HOME}/projects/laser"` * download encoders from Amazon s3 by e.g. `bash ./nllb/download_models.sh` diff --git a/install_external_tools.sh b/install_external_tools.sh index 9fba8417..6aee045f 100755 --- a/install_external_tools.sh +++ b/install_external_tools.sh @@ -181,6 +181,10 @@ InstallMecab () { # ################################################################### +echo "Installing the laser_encoders package in editable mode" + +pip install -e . + echo "Installing external tools" InstallMosesTools diff --git a/laser_encoders/README.md b/laser_encoders/README.md index 4c508824..8a35c8d7 100644 --- a/laser_encoders/README.md +++ b/laser_encoders/README.md @@ -17,10 +17,15 @@ You can find a full list of requirements [here](requirements.txt) ## Installation -You can install laser_encoders using pip: +You can install `laser_encoders` package from PyPI: ```sh - pip install laser_encoders +pip install laser_encoders +``` + +Alternatively, you can install it from a local clone of this repository, in editable mode: +```sh +pip install . -e ``` ## Usage From 1f5b2e50aaa9907b312e7554cedbfd45506e5a85 Mon Sep 17 00:00:00 2001 From: David Dale Date: Thu, 16 Nov 2023 02:22:59 -0800 Subject: [PATCH 2/4] wording changes --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 1800fd96..16a52490 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ LASER is a library to calculate and use multilingual sentence embeddings. **NEWS** -* 2023/11/16 Released [**laser_encoders**](laser_encoders), a compact pip-installable package supporting LASER-2 and LASER-3 models +* 2023/11/16 Released [**laser_encoders**](laser_encoders), a pip-installable package supporting LASER-2 and LASER-3 models * 2023/06/26 [**xSIM++**](https://arxiv.org/abs/2306.12907) evaluation pipeline and data [**released**](tasks/xsimplusplus/README.md) * 2022/07/06 Updated LASER models with support for over 200 languages are [**now available**](nllb/README.md) * 2022/07/06 Multilingual similarity search (**xsim**) evaluation pipeline [**released**](tasks/xsim/README.md) @@ -27,7 +27,7 @@ a language family which is covered by other languages. A detailed description of how the multilingual sentence embeddings are trained can be found [here](https://arxiv.org/abs/2205.12654), together with an experimental evaluation. -## The core embedding package: `laser_encoders` +## The core sentence embedding package: `laser_encoders` We provide a package `laser_encoders` with minimal dependencies. It supports LASER-2 (an updated signle encoder for the languages listed [below](#supported-languages)) and LASER-3 (147 language-specific encoders described [here](nllb/README.md)). From 93bbbada1c5c60e7b71413f6537d1b73262d39d5 Mon Sep 17 00:00:00 2001 From: David Dale Date: Thu, 16 Nov 2023 02:29:21 -0800 Subject: [PATCH 3/4] update the example in the readme --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 16a52490..d86de2c6 100644 --- a/README.md +++ b/README.md @@ -37,6 +37,8 @@ The package can be installed simply with `pip install laser_encoders` and used a ```python from laser_encoders import LaserEncoderPipeline encoder = LaserEncoderPipeline(lang="eng_Latn") +embeddings = encoder.encode_sentences(["Hi!", "This is a sentence encoder."]) +print(embeddings.shape) # (2, 1024) ``` The laser_encoders [readme file](laser_encoders) provides more examples of its installation and usage. From 3f270aaf89090d642649a1680368f329a2a84099 Mon Sep 17 00:00:00 2001 From: David Dale Date: Fri, 17 Nov 2023 05:24:27 -0800 Subject: [PATCH 4/4] fix readme text --- .gitignore | 1 + README.md | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/.gitignore b/.gitignore index 0566e9f4..95098827 100644 --- a/.gitignore +++ b/.gitignore @@ -12,3 +12,4 @@ tasks/xnli/multinli_1.0* .idea __pycache__ nllb +dist diff --git a/README.md b/README.md index d86de2c6..526f9632 100644 --- a/README.md +++ b/README.md @@ -28,8 +28,8 @@ A detailed description of how the multilingual sentence embeddings are trained c be found [here](https://arxiv.org/abs/2205.12654), together with an experimental evaluation. ## The core sentence embedding package: `laser_encoders` -We provide a package `laser_encoders` with minimal dependencies. -It supports LASER-2 (an updated signle encoder for the languages listed [below](#supported-languages)) +We provide a package `laser_encoders` with minimal dependencies. +It supports LASER-2 (a single encoder for the languages listed [below](#supported-languages)) and LASER-3 (147 language-specific encoders described [here](nllb/README.md)). The package can be installed simply with `pip install laser_encoders` and used as below: