Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use chapter-filtering for terms #545

Merged
merged 14 commits into from
Nov 27, 2024
Merged

Use chapter-filtering for terms #545

merged 14 commits into from
Nov 27, 2024

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Nov 26, 2024

Reworking of #508

Fixes #476


This change is Reviewable

@Enkidu93 Enkidu93 force-pushed the filter_kbt_by_chapter branch from 4f37c23 to e5ead88 Compare November 26, 2024 20:45
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @johnml1135)


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 87 at r1 (raw file):

                ITextCorpus[]? sourceTermCorpora = _corpusService
                    .CreateTermCorpora(
                        corpus.SourceCorpora.SelectMany(sc => sc.Files.Select(f => (f, sc.TrainOnChapters))).ToArray()

What about textIds?

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 6 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 87 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What about textIds?

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@Enkidu93
Copy link
Collaborator Author

Still fixing a bug

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 95 at r3 (raw file):

                    IParallelTextCorpus parallelKeyTermsCorpus = sourceTermCorpora
                        .ChooseRandom(Seed)
                        .AlignRows(targetTermCorpora.ChooseFirst());

Does this also address #538? Can we unique these - or do we want to do it in a separate PR?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 6 files at r1, 3 of 4 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

@johnml1135
Copy link
Collaborator

src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs line 95 at r3 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Does this also address #538? Can we unique these - or do we want to do it in a separate PR?

A separate PR is fine.

@Enkidu93 Enkidu93 force-pushed the filter_kbt_by_chapter branch from 788beb9 to 551e6cd Compare November 27, 2024 15:43
@johnml1135
Copy link
Collaborator

Is it all ready? Do we need to re-release Machine?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@Enkidu93
Copy link
Collaborator Author

Alright, the last issue has been addressed. We'll need

Is it all ready? Do we need to re-release Machine?

sillsdev/machine#278

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r3, 6 of 6 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/CorpusService.cs line 43 at r5 (raw file):

    )
    {
        foreach ((CorpusFile file, Dictionary<string, HashSet<int>>? chapters) in corpora)

Why was this added?

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 9 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


src/ServiceToolkit/src/SIL.ServiceToolkit/Services/CorpusService.cs line 43 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Why was this added?

Sorry, it was there before; I added it back in temporarily while I was rethinking what the right behavior should be when no filters are supplied and then forgot to remove it. Removed.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 96.55172% with 1 line in your changes missing coverage. Please review.

Project coverage is 57.03%. Comparing base (0635292) to head (bff3084).

Files with missing lines Patch % Lines
...kit/Services/ParallelCorpusPreprocessingService.cs 96.29% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #545      +/-   ##
==========================================
+ Coverage   56.97%   57.03%   +0.05%     
==========================================
  Files         302      302              
  Lines       15620    15639      +19     
  Branches     2153     2155       +2     
==========================================
+ Hits         8900     8919      +19     
+ Misses       6077     6076       -1     
- Partials      643      644       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Enkidu93 Enkidu93 merged commit 03e4716 into main Nov 27, 2024
3 checks passed
@Enkidu93 Enkidu93 deleted the filter_kbt_by_chapter branch November 27, 2024 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keyterm data always gets added - and then we always train
4 participants