-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NParallelTextCorpus #270
NParallelTextCorpus #270
Conversation
more broken More broken. Compiling but not working Progress More progress Almost all tests passing All PTCorpus tests passing! Passing tests; added alignment corpus Fix test; add corpora extensions test More fixes
…into nparallel_corpus
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #270 +/- ##
==========================================
+ Coverage 70.01% 70.13% +0.12%
==========================================
Files 380 385 +5
Lines 31816 31962 +146
Branches 4455 4490 +35
==========================================
+ Hits 22275 22418 +143
+ Misses 8508 8500 -8
- Partials 1033 1044 +11 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Fixes #266 |
The only outstanding bullet point from the original description that I think I haven't covered is: "NParallelTextCorpora will take a parameter to determine how equivalent references are combined: choose first, combine all, or permutate all". Is this something we'll really want? Or is there a behavior we'll always prefer? Right now it's permutating. |
|
Can you explain in English what the indexOfInRangeRow is trying to accomplish? |
How does this logic work with AllRows? Are they blank rows being send down? |
Previously, johnml1135 (John Lambert) wrote…
Could you elaborate on the connection with allRows? Yes, there may be empty rows output if all the rows available are empty. But it would be an unusual circunstance. |
Previously, johnml1135 (John Lambert) wrote…
The point of the variable is to keep track of whether you previously picked a row from a particular source that was the beginning of a range. This way, you'll keep picking from that source until the range is finished. |
Previously, johnml1135 (John Lambert) wrote…
Sure! Would it be simpler just to squash this when I'm merging/rebasing? |
What exactly is the function of the alignment corpus? Why is it active here, but not above when not all corpora were in lock-step? |
Many of these are cryptic logic. Let's have at least a sentence per set of logic. "doing it the right way" would likely involve creating a dozen or so sub-functions to do the intermediate processing, but that may lead to more confusion. I just want to have enough bread crumbs that when someone comes back in 3 years, they'll be able to understand what is going on here without having to spend 3 days. |
Previously, johnml1135 (John Lambert) wrote…
Or, if you take the complicated logic in these if statements and make functions with descriptive names, like below with "rangeInfo.IsInRange && AllInRangeHaveSegments(currentIncompleteRows)". That is decodable - the above logic less so. |
Why rows.Any if you already threw a Argument null exception above? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 9 files at r2, 1 of 3 files at r4, 1 of 1 files at r6, 1 of 1 files at r7, 6 of 6 files at r8, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 9 files at r2, 6 of 6 files at r8, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Corpora/CorporaExtensions.cs
line 541 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Is this the way you wanted it? One question I have is in regards to the AllRows parameter: Is this the cleanest way to pass it or the best point at which to inject it?
I don't think that AllRows
is useful in this context. Do we need it? If not, we could simply not pass AllRows
at all.
tests/SIL.Machine.Tests/Corpora/NParallelTextCorpusTests.cs
line 86 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
@ddaspit - can you bring in clarity - this may be the last issue...
@Enkidu93's description of the behavior sounds correct. This assert is correct. There should be only one row that is returned.
src/SIL.Machine/Corpora/MergedTextCorpus.cs
line 16 at r8 (raw file):
private readonly Random _random; public MergedTextCorpus(NParallelTextCorpus nParallelTextCorpus, MergeRule mergeRule, int seed)
I would have this accept an enumerable of text corpora instead of an NParallelTextCorpus
.
Once Damien's concerns are resolved, I'm fine with the changes. |
Previously, ddaspit (Damien Daspit) wrote…
Let me see - "AllRows" is valuable when wanting to align to VRef.txt - so you get one line per vref entry (needed for AQUA as it relies on that specific format). Beyond that, I can't think of any need for "AllRows". And that should likely be it's own function, taking a single TextCorpus and then calling "AlignToVref" which would return a single TextCorpus with 1 entry per VRef. Therefore, it would not be needed when "merging" or "choosing random" but merely in adding another extension to format a single existing corpus. @ddaspit, does this sound right to you? |
Previously, johnml1135 (John Lambert) wrote…
Actually, should we remove "AllRows" entirely from NParallelCorpus and rather just have a separate function that takes a single ITextCorpus to "AlignToVref"? |
Previously, johnml1135 (John Lambert) wrote…
We could also just rip out "AllRows" right now and then add the new function later, as it is not needed now. |
Actually, should there be an ability to filter out non-scripture? What is the best way to do that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Corpora/CorporaExtensions.cs
line 541 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
We could also just rip out "AllRows" right now and then add the new function later, as it is not needed now.
We shouldn't remove AllRows
altogether. We just don't need to pass it in this function. Also, there is already an "AlignToVref" function, it is called ExtractScriptureCorpus
.
Previously, ddaspit (Damien Daspit) wrote…
I think that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the Serval PR, you can see an example (although it might be a bit outdated) of the intended use. At the moment, I'm filtering out non-Scripture where necessary on that side.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/CorporaExtensions.cs
line 541 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
I think that
AllRows
is generally useful not just for us but perhaps even for others using the Machine library. One place we need it on the Serval side is when we align the pretranslate source and the train target: Since many rows will not have a target, this is necessary. Also, we use it currently (see the other PR) when doing theAlignMany
s since each corpus has already been filtered by textId/scriptureRange, meaning that in many cases, they won't overlap.
(Sorry, just seeing your new message, Damien!) To respond to your earlier message more directly, at the moment where we're using it in Serval (see the other PR), we're passing in AllRows=[true,...,true]
, so maybe that ought to be the default? I'm not sure. It's less awkward to pass AllRows
in with the API I was originally imagining; I'm only thinking of this now because it seems odd to pass in AllRows
(which has to do with alignment) to a function that has to do with selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/MergedTextCorpus.cs
line 16 at r8 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would have this accept an enumerable of text corpora instead of an
NParallelTextCorpus
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r9, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Corpora/CorporaExtensions.cs
line 541 at r2 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
(Sorry, just seeing your new message, Damien!) To respond to your earlier message more directly, at the moment where we're using it in Serval (see the other PR), we're passing in
AllRows=[true,...,true]
, so maybe that ought to be the default? I'm not sure. It's less awkward to passAllRows
in with the API I was originally imagining; I'm only thinking of this now because it seems odd to pass inAllRows
(which has to do with alignment) to a function that has to do with selection.
Yes, I think we only need AllRows=[true,...,true]
for the MergedTextCorpus
to work properly. We shouldn't need to pass in the allRows
parameter.
src/SIL.Machine/Corpora/CorporaExtensions.cs
line 544 at r9 (raw file):
this IEnumerable<ITextCorpus> corpora, IEnumerable<bool> allRows, int seed
We should make seed
an optional parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've addressed everybody's comments. I'll wait for any more feedback you all have; let me know then I'll re-review and merge.
Reviewable status: 8 of 12 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/CorporaExtensions.cs
line 541 at r2 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
Yes, I think we only need
AllRows=[true,...,true]
for theMergedTextCorpus
to work properly. We shouldn't need to pass in theallRows
parameter.
Done.
src/SIL.Machine/Corpora/CorporaExtensions.cs
line 544 at r9 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
We should make
seed
an optional parameter.
Done.
tests/SIL.Machine.Tests/Corpora/NParallelTextCorpusTests.cs
line 86 at r1 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
@Enkidu93's description of the behavior sounds correct. This assert is correct. There should be only one row that is returned.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Actually, I think I'll add at least one test to cover a case John uncovered).
Reviewable status: 8 of 12 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r10, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93 and @johnml1135)
src/SIL.Machine/Corpora/CorporaExtensions.cs
line 544 at r9 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Done.
You forgot to make it optional in the ChooseRandom
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I've added a test to cover a case that wasn't covered. And I uncovered that the code that had not previously run actually breaks the behavior for the situation in which it would run. @ddaspit, do you remember what particular case the corresponding bit of logic in the original ParallelTextCorpus
logic was meant to cover? I kept it for consistency but it wasn't covered by the tests previously either.
Reviewable status: 10 of 13 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/SIL.Machine/Corpora/CorporaExtensions.cs
line 544 at r9 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
You forgot to make it optional in the
ChooseRandom
function.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't remember what that check was for. I'm okay with the change, since the new test is testing for the correct behavior and all other tests pass.
Reviewed 3 of 3 files at r11, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @johnml1135 from a discussion.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
@johnml1135, it's saying I can't merge without your approval. I tried to dismiss you from the discussion because I think your question was answered, but it's still not letting me. If you could take a look when you have time, I'd appreciate it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r10, 3 of 3 files at r11, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
Here's the Machine PR needed to support the updates to Serval.
This change is