Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix normalization order in find_missing_characters() #105

Merged
merged 2 commits into from
Apr 16, 2024

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Apr 5, 2024

To account for composite characters, I changed find_missing_characters() to normalize the training example before using it to calculate the set of all characters in the training data, rather than normalizing the set of characters.

In addition to the change, I modified one of the test cases for updating the tokenizer to include a check for handling a composite character.


This change is Reviewable

@mshannon-sil mshannon-sil added the bug Something isn't working label Apr 5, 2024
@mshannon-sil mshannon-sil requested a review from ddaspit April 5, 2024 22:33
@mshannon-sil mshannon-sil self-assigned this Apr 5, 2024
@mshannon-sil mshannon-sil linked an issue Apr 5, 2024 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Apr 5, 2024

Codecov Report

Attention: Patch coverage is 90.90909% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 88.34%. Comparing base (bda3b54) to head (7942dfe).

Files Patch % Lines
...tion/huggingface/hugging_face_nmt_model_trainer.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #105   +/-   ##
=======================================
  Coverage   88.33%   88.34%           
=======================================
  Files         234      234           
  Lines       13816    13821    +5     
=======================================
+ Hits        12205    12210    +5     
  Misses       1611     1611           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 178 at r1 (raw file):

                for lang_code in lang_codes:
                    ex_text = ex[lang_code]
                    if isinstance(tokenizer, (NllbTokenizerFast)):

I believe that isinstance calls can be expensive in some circumstances. It would be better to perform the check once.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 2 files reviewed, 1 unresolved discussion (waiting on @ddaspit)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 178 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I believe that isinstance calls can be expensive in some circumstances. It would be better to perform the check once.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

@johnml1135 johnml1135 merged commit 5d05e6d into main Apr 16, 2024
14 checks passed
@ddaspit ddaspit deleted the #104_normalize_order branch April 16, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

normalize lines before getting charset
4 participants