-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scoring: use repo freshness as tiebreaker #832
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like how it's now clearly split into a tiebreaker. I will review more carefully tomorrow. A couple things on my mind:
- The "months since 1970" trick makes sense, but is a bit surprising. As time moves on, scores always get higher. If a repo that hasn't been updated for 1 month, it feels like it should have the same "freshness" score, no matter when it's calculated (today vs. a few months from now)... but the score will change. However, it does have the nice property that the scores for a search never change, even as you rerun it over time.
- Does this have any effect on our Zoekt "golden queries" evals? It'd be great to add some test queries to see more details on how it works! Specifically queries that need good multi-repo ranking.
I am not worried about that because it's a tiebreaker and the score is not combined with another score. We really just care about a stable order for last commits. |
// We use the number of months since 1970 as a simple measure of repo freshness. | ||
// It is monotonically increasing and stable across re-indexes and restarts. | ||
r.Rank = monthsSince1970(repo.LatestCommitDate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an atlernative, why not use the unix timestamp which is also monotonically increasing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically the unix timestamp, but in buckets of 1 month to fit into the uint16 we have. I think monthly increments make sense and the unit16 keeps the number of digits we have to reserve in the score to 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach! Just left some questions.
As we discussed offline, I'm not sure this will kick in super often. But it feels like a good step forward, and we can then test on dot com and S2 to get more ideas.
@@ -99,12 +102,24 @@ func (d *indexData) scoreFile(fileMatch *FileMatch, doc uint32, mt matchTree, kn | |||
} | |||
} | |||
|
|||
// Add tiebreakers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these docs would fit better in contentprovider.go
, where we already have some explanations around the various score weights.
score.go
Outdated
md := d.repoMetaData[d.repos[doc]] | ||
// digits 3-7 are the repo rank, this gives us increments of 1 per month. | ||
addScore("repo-rank", scoreRepoRankFactor*float64(md.Rank)) | ||
// digits 1-2 and the decimals are the doc order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this tricky, it'd be helpful to expand these explanations a bit more 🤔
Here's what I found unclear:
doc-order
lies in the range [0, 1), so multiplying it by 10 gives [0, 10). So wouldn't it just be "digit 1 and the decimals" ?- It's not obvious what range
repo-rank
lies in. Especially since it can represent both "months since 1970" and "number of stars (with normalization)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc-order lies in the range [0, 1), so multiplying it by 10 gives [0, 10). So wouldn't it just be "digit 1 and the decimals" ?
It is [0, 1] not [0,1). Matches in the first document of a shard will get the full score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious what range repo-rank lies in. Especially since it can represent both "months since 1970" and "number of stars (with normalization)"
I see, I will update the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is [0, 1] not [0,1). Matches in the first document of a shard will get the full score.
Ah, I missed the fact we do "1 - doc_id/num_docs". Thanks for clarifying the docs.
This is motivated by #832 We use archive index in our e2e tests. In order to test our latest improvements to ranking, archive index needs to set the latest commit date. Test plan: - new unit test - I checked that the tar files downloaded from github have the correct mod time.
We ignore priority and instead use the latest commit date as repo rank. This has a big impact for Sourcegraph because it means we switch from star count to repo freshness as tiebreaker. As a minor tweak, we also separate query based scores from tiebreakers. To achieve this we reserve the last 7 digits of a score for tiebreakers: - 5 digits (maxUint16) for repo rank - 2 digits ([0,10]) for file order (2 digits). Example: Before: score: 8775.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35 After: score: 8750_00019_06.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35
fd2f4c8
to
bbc7e7f
Compare
I had to rebase to pull in the latest changes of the e2e ranking test suite. |
queries: 16 | ||
recall@1: 8 (50%) | ||
recall@1: 9 (56%) | ||
recall@5: 11 (69%) | ||
mrr: 0.600787 | ||
mrr: 0.632037 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny improvement, because "sourcegraph/conc" is now ranked higher than a ~2 months older "golang/go"
We ignore priority and instead use the latest commit date as repo rank.
This has a big impact on scoring for Sourcegraph because it means we switch from
star count to repo freshness as tiebreaker.
As a minor tweak, we also separate query based scores from tiebreakers.
To achieve this we reserve the last 7 digits of a score for tiebreakers:
Example:
Before:
score: 8775.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35
After:
score: 8750_00019_06.35 <- atom(2):200, fragment:8550.00, repo-rank: 19, doc-order:6.35