-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update lastDoc in ScoreCachingWrappingScorer #13987
Update lastDoc in ScoreCachingWrappingScorer #13987
Conversation
I noticed that ScoreCachingWrappingScorer never updates lastDoc, so it's always -1. Technically, it's probably fine, since it still ends up returning the same score for multiple score() calls between collect calls, but I think this is the intended logic. (In particular, if the same doc was somehow collected multiple times, then the score would get recalculated.)
A couple of unit tests seemed to rely on scores not being cached for subsequent collect calls with the same doc id.
Wow, good catch @msfroh. Could we maybe add a new test case that explicitly confirms that the wrapped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, this was a great catch of a long standing bug ... I wonder how much net performance was lost across all Lucene applications that were using this class thinking it actually saved CPU by caching the score like it claims it does. Perhaps this class used to work and then a refactoring (maybe the addition of Scorable
?) accidentally introduced the bug ...
The change looks great to me, and it's sweet you also then found two separate pre-existing test bugs. I'm hoping we can also add a dedicated test case confirming this class is actually doing its one job.
ac.collect(0); | ||
int docId; | ||
while ((docId = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { | ||
ac.collect(docId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, good catch! So this was a latent pre-existing test bug, and with your above bug fix this test is now (correctly!) failing, and with your fix to this separate test bug, the test now passes?
@@ -359,7 +359,7 @@ public void testTotalHitsWithScore() throws Exception { | |||
leafCollector.collect(1); | |||
|
|||
scorer.score = 4; | |||
leafCollector.collect(1); | |||
leafCollector.collect(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another pre-existing test bug fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the change, the call to leafCollector.collect(1)
would get a score of 3 (since it's cached for document 1), instead of the 4 that the test expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! This suggests that the caching of scores did not actually work, can you add a test that fails when a score is computed twice for the same doc even though ScoreCachingWrapperScorer
is used?
Luckily, this is a Lucene 10-only bug (from when I came across it when updating OpenSearch to support Lucene 10 and needed to refactor some code that used I'll add the extra unit test to cover this. |
I think the caching worked fine, albeit in a funny way. When you call As long as you don't call |
for (int i = 0; i < scores.length; i++) { | ||
assertEquals(scores[i], scc.mscores[i * 2], 0f); | ||
assertEquals(scores[i], scc.mscores[i * 2 + 1], 0f); | ||
} | ||
|
||
assertEquals(scores.length, countingScorable.count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the other test method (testGetScores
), but we assert that each score is collected twice (since we collect each document twice), but the score()
method is only called once per document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see -- you want to assert that indeed score()
was called (up on top) multiple times per hit, but then down below in the "true" scorer, it was dedup'd to just once per hit. So that's why you .collect(doc)
twice, OK ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the questions that @mikemccand raised, otherwise the change looks good to me.
int doc; | ||
while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { | ||
lc.collect(doc); | ||
lc.collect(doc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's illegal to call collect() multiple times on the same doc, this shouldn't be necessary as collect()
already calls Scorable#score()
multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- that's why I'd hesitate to really call this a "bug fix".
I'm pretty sure you meant to write lastDoc = currDoc
in #12407, but functionally currDoc = lastDoc
works fine as long as you don't call collect()
multiple times on the same doc.
I only noticed this because I was copy/pasting the logic to adapt it for LeafBucketCollector
in OpenSearch and IntelliJ warned me that lastDoc
could be final
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, then my suggestion would be to keep the test simple and not call collect() multiple times on the same doc. Your change looks good to me otherwise, so I can merge after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @msfroh -- does the new test fail when you revert the bug fix?
@@ -157,4 +157,40 @@ public void testGetScores() throws Exception { | |||
ir.close(); | |||
directory.close(); | |||
} | |||
|
|||
private static class CountingScorable extends FilterScorable { | |||
int count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the = 0
since it's java's default already?
int doc; | ||
while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { | ||
lc.collect(doc); | ||
lc.collect(doc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why are we collecting the same doc
twice? (I thought this is smelly -- we fixed the other two tests to stop doing that?).
Oh I see, is so score()
happens more than once while on the same doc? Actually, since ScoreCachingCollector
's LeafCollector
is calling .score()
three times for each .collect()
, you should be able to do only one lc.collect(doc)
here?
Does this new test case fail if you revert the bug fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new test does fail if I revert the fix.
As I mentioned in #13987 (comment), the condition for this being an actual bug is probably not really valid (i.e. you need to call collect()
on the ScoreCachingWrappingLeafCollector
multiple times for the same document).
The caching does work if the wrapped LeafCollector
calls score()
multiple times, which can happen in MultiCollector
(for example). The layering is roughly:
ScoreCachingWrappingLeafCollector wraps
Some other LeafCollector whose score is
ScoreCachingWrappingScorer, which wraps
The real Scorable
The multiple calls to score()
from Some other LeafCollector
work just fine. Again, it's the call to collect()
on ScoreCachingWrappingLeafCollector
that was invalidating the cache.
Those other two tests broke because my "fix" causes the caching to work (but they didn't expect it).
public void testRepeatedCollectReusesScore() throws Exception { | ||
Scorer s = new SimpleScorer(); | ||
CountingScorable countingScorable = new CountingScorable(s); | ||
ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment about why we need * 2
here?
for (int i = 0; i < scores.length; i++) { | ||
assertEquals(scores[i], scc.mscores[i * 2], 0f); | ||
assertEquals(scores[i], scc.mscores[i * 2 + 1], 0f); | ||
} | ||
|
||
assertEquals(scores.length, countingScorable.count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see -- you want to assert that indeed score()
was called (up on top) multiple times per hit, but then down below in the "true" scorer, it was dedup'd to just once per hit. So that's why you .collect(doc)
twice, OK ...
@msfroh FWIW I'm happy to merge this PR when we remove the double call to LeafCollector#collect on the same doc ID in tests. |
In that case, the unit test that I added can be removed (since the double-call to LeafCollector#collect was the whole point). The old code had the correct behavior if we don't call collect twice for the same doc. We can simplify/clarify it like #14012 (which is functionally equivalent to what was there, but maybe a little clearer). |
This works for me. Sorry for putting you on the wrong track by suggesting that a test is added, it took me a while to understand how the existing logic was accidentally working. |
Closing in favor of #14012 |
Description
I noticed that ScoreCachingWrappingScorer never updates lastDoc, so it's always -1. Technically, it's probably fine, since it still ends up returning the same score for multiple score() calls between collect calls, but I think this change better reflects the intended logic. (In particular, if the same doc was somehow collected multiple times, then the score would get recalculated.)