Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If no other training data, don't add any keyterms #580

Merged
merged 1 commit into from
Jan 6, 2025
Merged

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Dec 12, 2024

Resolves #569


This change is Reviewable

@johnml1135 johnml1135 requested a review from ddaspit December 12, 2024 21:48
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.92%. Comparing base (0cfd145) to head (b0c0286).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #580      +/-   ##
==========================================
- Coverage   62.97%   62.92%   -0.06%     
==========================================
  Files         279      279              
  Lines       13984    13989       +5     
  Branches     1814     1817       +3     
==========================================
- Hits         8807     8803       -4     
- Misses       4551     4560       +9     
  Partials      626      626              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r1, all commit messages.
Reviewable status: 3 of 4 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 747 at r1 (raw file):

        Assert.That(
            pretranslations[2]!["translation"]!.ToString(),
            Is.EqualTo("Source one, chapter twelve, verse one.")

Is there a way to 'save' this test? Or are we happy for it just to be saved in the commit history? I think ultimately we'll need a test like this once we enable some kind of filtering and, as you can see, it's a fairly long test.


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 89 at r1 (raw file):

            }

            if (useKeyTerms)

Why not just && parallelTrainingDataPresent?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only other concern is that I think the EITL team needs to 'buy in' before we do this. It's my impression that they already feel undermined in how KBT changes were done/unstable in regard to what the behavior is in Serval/SF. Even a quick email or Slack message that has an asking-not-telling tone would do it. Not sure if that's already been done.

Reviewable status: 3 of 4 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We need buy-in from the EITL team before we proceed.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator Author

src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 747 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Is there a way to 'save' this test? Or are we happy for it just to be saved in the commit history? I think ultimately we'll need a test like this once we enable some kind of filtering and, as you can see, it's a fairly long test.

We could "ignore" or otherwise pass over the test. That would be about equivalent as to saving it in commit history, but just more visible. I am fine either way. Also, if we actually implement this, we will likely want to change it to "filter on the pretranslations" rather than on the training data.

@johnml1135
Copy link
Collaborator Author

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 89 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why not just && parallelTrainingDataPresent?

Because we won't know until every corpus has been run through whether there is any data at all. The simplest solution that I could think of was to collect them all and if there was data, add them when all the corpora are looped through. Otherwise, we would have to loop through the corpora twice.

@johnml1135
Copy link
Collaborator Author

Request for input made.

@Enkidu93
Copy link
Collaborator

In the meeting that you had to leave, @johnml1135, I got the go-ahead from the EITL team for this PR.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator Author

@ddaspit - can you finish this review?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)

@johnml1135 johnml1135 merged commit a818da9 into main Jan 6, 2025
1 of 2 checks passed
@johnml1135 johnml1135 deleted the keyterm_fix branch January 6, 2025 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If no other training data, don't add any keyterms
4 participants