Ensure laser_encoders
has parity with existing LASER inference code for release
#268
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why?
Before releasing
laser_encoders
we need to ensure that we have parity (within a reasonable tolerance) compared to the existingLASER
inference code. This PR updateslaser_encoders
to ensure parity and then rigorously compares the inference of bothlaser_encoders
and olderLASER
code side-by-side for both LASER(2) and LASER(3) on all of FLORES (approx. 0.5M lines of text). Results show embedding parity (atol=1e-3
) with the exception of one line inshn_Mymr.devtest
for LASER(2) i.e., we have parity on all of FLORES except for one sentence, for one language, on LASER(2). I'm not sure exactly the difference, but I think our current parity is acceptable.How
The existing code called the Python string function
isprintable
. However, after checking the MOSES perl scriptremove-non-printing-char.perl
they are removing a range of specific unicode characters falling under the category: "C" (other). Updated code to account for this and added new dependency inpyproject.toml
for libraryunicategories
.After Siddharth's update to
sacremoses
which resolves parity with the current version of MOSES, unfortunately in LASER we are using a specific version: 4.0. Instead of requesting another update tosacremoses
to support this deprecated version, we ourselves update the regexes, and then freeze the version ofsacremoses
inpyproject.toml
.laser_tokenizer
Test plans
Check correctness against LASER2
python parity.py run_comparison_parallel --laser_type laser2
Check correctness against LASER3
python parity.py run_comparison_parallel --laser_type laser3
Code below for
parity.py
(perhaps we should check this in somewhere?).