-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix normalization order in find_missing_characters() #105
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #105 +/- ##
=======================================
Coverage 88.33% 88.34%
=======================================
Files 234 234
Lines 13816 13821 +5
=======================================
+ Hits 12205 12210 +5
Misses 1611 1611 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 178 at r1 (raw file):
for lang_code in lang_codes: ex_text = ex[lang_code] if isinstance(tokenizer, (NllbTokenizerFast)):
I believe that isinstance
calls can be expensive in some circumstances. It would be better to perform the check once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion (waiting on @ddaspit)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 178 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I believe that
isinstance
calls can be expensive in some circumstances. It would be better to perform the check once.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)
To account for composite characters, I changed
find_missing_characters()
to normalize the training example before using it to calculate the set of all characters in the training data, rather than normalizing the set of characters.In addition to the change, I modified one of the test cases for updating the tokenizer to include a check for handling a composite character.
This change is