Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lastDoc in ScoreCachingWrappingScorer #13987

Closed

Conversation

msfroh
Copy link
Contributor

@msfroh msfroh commented Nov 12, 2024

Description

I noticed that ScoreCachingWrappingScorer never updates lastDoc, so it's always -1. Technically, it's probably fine, since it still ends up returning the same score for multiple score() calls between collect calls, but I think this change better reflects the intended logic. (In particular, if the same doc was somehow collected multiple times, then the score would get recalculated.)

I noticed that ScoreCachingWrappingScorer never updates lastDoc, so
it's always -1. Technically, it's probably fine, since it still ends
up returning the same score for multiple score() calls between collect
calls, but I think this is the intended logic. (In particular, if the
same doc was somehow collected multiple times, then the score would
get recalculated.)
A couple of unit tests seemed to rely on scores not being cached for
subsequent collect calls with the same doc id.
@mikemccand
Copy link
Member

Wow, good catch @msfroh. Could we maybe add a new test case that explicitly confirms that the wrapped Scorable's score method is indeed only called once even if the outer user calls .score() multiple times on the same docID? This seems to be the primary purpose of this class, and it was failing this, for a very long time!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, this was a great catch of a long standing bug ... I wonder how much net performance was lost across all Lucene applications that were using this class thinking it actually saved CPU by caching the score like it claims it does. Perhaps this class used to work and then a refactoring (maybe the addition of Scorable?) accidentally introduced the bug ...

The change looks great to me, and it's sweet you also then found two separate pre-existing test bugs. I'm hoping we can also add a dedicated test case confirming this class is actually doing its one job.

ac.collect(0);
int docId;
while ((docId = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
ac.collect(docId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, good catch! So this was a latent pre-existing test bug, and with your above bug fix this test is now (correctly!) failing, and with your fix to this separate test bug, the test now passes?

@@ -359,7 +359,7 @@ public void testTotalHitsWithScore() throws Exception {
leafCollector.collect(1);

scorer.score = 4;
leafCollector.collect(1);
leafCollector.collect(2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pre-existing test bug fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the change, the call to leafCollector.collect(1) would get a score of 3 (since it's cached for document 1), instead of the 4 that the test expected.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! This suggests that the caching of scores did not actually work, can you add a test that fails when a score is computed twice for the same doc even though ScoreCachingWrapperScorer is used?

@msfroh
Copy link
Contributor Author

msfroh commented Nov 12, 2024

Luckily, this is a Lucene 10-only bug (from when docId() was removed from Scorable).

I came across it when updating OpenSearch to support Lucene 10 and needed to refactor some code that used ScoreCachingWrapperScorer.

I'll add the extra unit test to cover this.

@msfroh
Copy link
Contributor Author

msfroh commented Nov 13, 2024

I think the caching worked fine, albeit in a funny way.

When you call collect() with any valid doc ID, it invalidates the cache, causing scores to be computed. After the score was computed, currDoc was set to -1. Repeated calls to score() (until the next collect()) will reuse the cached value because -1 == -1.

As long as you don't call collect() on the same document multiple times (which you probably shouldn't be doing anyway -- those existing unit tests were a little weird), it should be okay.

Comment on lines 189 to 194
for (int i = 0; i < scores.length; i++) {
assertEquals(scores[i], scc.mscores[i * 2], 0f);
assertEquals(scores[i], scc.mscores[i * 2 + 1], 0f);
}

assertEquals(scores.length, countingScorable.count);
Copy link
Contributor Author

@msfroh msfroh Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the other test method (testGetScores), but we assert that each score is collected twice (since we collect each document twice), but the score() method is only called once per document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see -- you want to assert that indeed score() was called (up on top) multiple times per hit, but then down below in the "true" scorer, it was dedup'd to just once per hit. So that's why you .collect(doc) twice, OK ...

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the questions that @mikemccand raised, otherwise the change looks good to me.

int doc;
while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
lc.collect(doc);
lc.collect(doc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's illegal to call collect() multiple times on the same doc, this shouldn't be necessary as collect() already calls Scorable#score() multiple times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- that's why I'd hesitate to really call this a "bug fix".

I'm pretty sure you meant to write lastDoc = currDoc in #12407, but functionally currDoc = lastDoc works fine as long as you don't call collect() multiple times on the same doc.

I only noticed this because I was copy/pasting the logic to adapt it for LeafBucketCollector in OpenSearch and IntelliJ warned me that lastDoc could be final.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, then my suggestion would be to keep the test simple and not call collect() multiple times on the same doc. Your change looks good to me otherwise, so I can merge after that.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @msfroh -- does the new test fail when you revert the bug fix?

@@ -157,4 +157,40 @@ public void testGetScores() throws Exception {
ir.close();
directory.close();
}

private static class CountingScorable extends FilterScorable {
int count = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the = 0 since it's java's default already?

int doc;
while ((doc = s.iterator().nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
lc.collect(doc);
lc.collect(doc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why are we collecting the same doc twice? (I thought this is smelly -- we fixed the other two tests to stop doing that?).

Oh I see, is so score() happens more than once while on the same doc? Actually, since ScoreCachingCollector's LeafCollector is calling .score() three times for each .collect(), you should be able to do only one lc.collect(doc) here?

Does this new test case fail if you revert the bug fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test does fail if I revert the fix.

As I mentioned in #13987 (comment), the condition for this being an actual bug is probably not really valid (i.e. you need to call collect() on the ScoreCachingWrappingLeafCollector multiple times for the same document).

The caching does work if the wrapped LeafCollector calls score() multiple times, which can happen in MultiCollector (for example). The layering is roughly:

ScoreCachingWrappingLeafCollector wraps
  Some other LeafCollector whose score is
    ScoreCachingWrappingScorer, which wraps
      The real Scorable

The multiple calls to score() from Some other LeafCollector work just fine. Again, it's the call to collect() on ScoreCachingWrappingLeafCollector that was invalidating the cache.

Those other two tests broke because my "fix" causes the caching to work (but they didn't expect it).

public void testRepeatedCollectReusesScore() throws Exception {
Scorer s = new SimpleScorer();
CountingScorable countingScorable = new CountingScorable(s);
ScoreCachingCollector scc = new ScoreCachingCollector(scores.length * 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment about why we need * 2 here?

Comment on lines 189 to 194
for (int i = 0; i < scores.length; i++) {
assertEquals(scores[i], scc.mscores[i * 2], 0f);
assertEquals(scores[i], scc.mscores[i * 2 + 1], 0f);
}

assertEquals(scores.length, countingScorable.count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see -- you want to assert that indeed score() was called (up on top) multiple times per hit, but then down below in the "true" scorer, it was dedup'd to just once per hit. So that's why you .collect(doc) twice, OK ...

@jpountz
Copy link
Contributor

jpountz commented Nov 22, 2024

@msfroh FWIW I'm happy to merge this PR when we remove the double call to LeafCollector#collect on the same doc ID in tests.

@msfroh
Copy link
Contributor Author

msfroh commented Nov 22, 2024

@msfroh FWIW I'm happy to merge this PR when we remove the double call to LeafCollector#collect on the same doc ID in tests.

In that case, the unit test that I added can be removed (since the double-call to LeafCollector#collect was the whole point).

The old code had the correct behavior if we don't call collect twice for the same doc. We can simplify/clarify it like #14012 (which is functionally equivalent to what was there, but maybe a little clearer).

@jpountz
Copy link
Contributor

jpountz commented Nov 25, 2024

In that case, the unit test that I added can be removed

This works for me. Sorry for putting you on the wrong track by suggesting that a test is added, it took me a while to understand how the existing logic was accidentally working.

@msfroh
Copy link
Contributor Author

msfroh commented Nov 25, 2024

Closing in favor of #14012

@msfroh msfroh closed this Nov 25, 2024
@msfroh msfroh deleted the update_lastdoc_ScoreCachingWrappingScorer branch November 25, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants