Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word alignment try 2 #267

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Word alignment try 2 #267

wants to merge 3 commits into from

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Nov 5, 2024

Add word alignment engine to IInteractiveTranslationEngine.


This change is Reviewable

@johnml1135 johnml1135 requested a review from ddaspit November 5, 2024 21:58
@codecov-commenter
Copy link

codecov-commenter commented Nov 5, 2024

Codecov Report

Attention: Patch coverage is 42.51497% with 96 lines in your changes missing coverage. Please review.

Project coverage is 70.01%. Comparing base (8319868) to head (fa1d6c7).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...hine/Translation/SymmetrizedWordAlignmentEngine.cs 70.00% 27 Missing ⚠️
...SIL.Machine/Translation/SymmetrizationHeuristic.cs 0.00% 26 Missing ⚠️
src/SIL.Machine/Translation/WordAlignmentResult.cs 0.00% 16 Missing ⚠️
...ine.Translation.Thot/ThotWordAlignmentModelType.cs 0.00% 11 Missing ⚠️
src/SIL.Machine/Corpora/AlignedWordPair.cs 0.00% 9 Missing ⚠️
...Machine.Translation.Thot/ThotWordAlignmentModel.cs 50.00% 4 Missing and 1 partial ⚠️
src/SIL.Machine/Corpora/NParallelTextCorpus.cs 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #267      +/-   ##
==========================================
- Coverage   70.16%   70.01%   -0.15%     
==========================================
  Files         385      389       +4     
  Lines       31957    32041      +84     
  Branches     4488     4496       +8     
==========================================
+ Hits        22424    22435      +11     
- Misses       8493     8565      +72     
- Partials     1040     1041       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for?

Reviewable status: 0 of 3 files reviewed, all discussions resolved

@johnml1135
Copy link
Collaborator Author

This is needed for adding the word alignment engine to Serval. Just exposing the alignment endpoints to the interactive engine.

@johnml1135
Copy link
Collaborator Author

This needs to be merged and released before the Serval changes will be able to compile.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure I understand what this is for. There are already interfaces for word alignment models. Also, phrase alignment isn't word alignment. That is specific to the Thot SMT engine.

Reviewable status: 0 of 3 files reviewed, all discussions resolved

@johnml1135
Copy link
Collaborator Author

The ThotSmtModel appears to be the best place to add the alignment routines onto - as the "phrase alignment" just means that the tokenizer can be configured. If I don't use ThotSmtModel, what specific things would I use? IWordAligner assumes that the source and target are already tokenized. Also, how would it interact with loading models built by machine.py?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For word alignment, you should use one of the classes that inherits from ThotWordAlignmentModel. For SMT and word alignment models, you will need to tokenize the text. We should just use the LatinWordTokenizer like we do for the SMT engine.

Reviewable status: 0 of 3 files reviewed, all discussions resolved

@johnml1135
Copy link
Collaborator Author

Hmm. It wold be quite a bit of reworking. I would have to use a different wording than ThotWordAlignmentModel because that is just referring to the asymmetrical alignment, not the symmetrical alignment with tokenizer. In python, the word aligner has the tokenizer connected to it. I could rework the Machine word aligner to have the tokenizer in it, but that would be a fair amount of work. The solution I have appears to be a good minimal solution - treat the ThotSmtModel as a SymmetrizedWordAlignmentModel with tokenizers - it already has the capability of having the truecaser as null.

Otherwise, I think I would have to create base class of ThotSmtModel called ThotSymmetrizedWordAlignmentModelWithTokenizer? in which 1/2 of the functionality of ThotSmtModel is implemented. And even then, all the configurations and trainers and everything else would need to be torn apart and rewritten.

I think this minimal change is the best solution - it looks like a word aligner on Serval but is just an SMT model underneath.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ThotSmtModel is a full phrased-based SMT system and takes a lot more computation and time to train. The phrase alignment from the SMT model uses a different algorithm than the word alignment models and is much more expensive. Unfortunately, it is not a replacement for the word alignment models. We should meet to discuss how best to proceed. I'm sure if I had a better understanding of what you are trying to achieve, we can come up with a good solution.

Reviewable status: 0 of 3 files reviewed, all discussions resolved

@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch 3 times, most recently from 3c8ddd6 to 1f29ecd Compare November 27, 2024 17:34
Add tokenizer to trainer
@johnml1135 johnml1135 force-pushed the word_alignment_try_2 branch from 1f29ecd to f905c2d Compare December 9, 2024 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants