fix normalization order in find_missing_characters() #105

mshannon-sil · 2024-04-05T22:33:56Z

To account for composite characters, I changed find_missing_characters() to normalize the training example before using it to calculate the set of all characters in the training data, rather than normalizing the set of characters.

In addition to the change, I modified one of the test cases for updating the tokenizer to include a check for handling a composite character.

This change is

codecov-commenter · 2024-04-05T22:35:05Z

Codecov Report

Attention: Patch coverage is 90.90909% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 88.34%. Comparing base (bda3b54) to head (7942dfe).

Files	Patch %	Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py	87.50%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #105   +/-   ##
=======================================
  Coverage   88.33%   88.34%           
=======================================
  Files         234      234           
  Lines       13816    13821    +5     
=======================================
+ Hits        12205    12210    +5     
  Misses       1611     1611

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ddaspit

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 178 at r1 (raw file):

                for lang_code in lang_codes:
                    ex_text = ex[lang_code]
                    if isinstance(tokenizer, (NllbTokenizerFast)):

I believe that isinstance calls can be expensive in some circumstances. It would be better to perform the check once.

mshannon-sil

Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion (waiting on @ddaspit)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 178 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I believe that isinstance calls can be expensive in some circumstances. It would be better to perform the check once.

Done.

ddaspit

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

fix normalization order in find_missing_characters()

38cdb25

mshannon-sil added the bug Something isn't working label Apr 5, 2024

mshannon-sil requested a review from ddaspit April 5, 2024 22:33

mshannon-sil self-assigned this Apr 5, 2024

mshannon-sil linked an issue Apr 5, 2024 that may be closed by this pull request

normalize lines before getting charset #104

Closed

ddaspit requested changes Apr 10, 2024

View reviewed changes

don't check isinstance during iterations

7942dfe

mshannon-sil commented Apr 10, 2024

View reviewed changes

ddaspit approved these changes Apr 15, 2024

View reviewed changes

johnml1135 merged commit 5d05e6d into main Apr 16, 2024
14 checks passed

ddaspit deleted the #104_normalize_order branch April 16, 2024 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix normalization order in find_missing_characters() #105

fix normalization order in find_missing_characters() #105

mshannon-sil commented Apr 5, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Apr 5, 2024 •

edited

Loading

ddaspit left a comment

mshannon-sil left a comment

ddaspit left a comment

fix normalization order in find_missing_characters() #105

fix normalization order in find_missing_characters() #105

Conversation

mshannon-sil commented Apr 5, 2024 • edited by ddaspit Loading

codecov-commenter commented Apr 5, 2024 • edited Loading

Codecov Report

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil commented Apr 5, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Apr 5, 2024 •

edited

Loading