RSR per client #193

bajtos · 2024-11-18T15:02:32Z

Provide retrieval-based RSR calculated on a per-client basis.

REST API
Visualisation on the Spark Public Dashboard

Blocks #216

Related discussions:

https://filecoinproject.slack.com/archives/C0807FBUKC3/p1731226294482699
^^ The goal is to calculate RSR for deals made by a particular data onramp provider. See https://app.hex.tech/protocol/app/ccbb785c-7fad-42b5-9553-609cde5c6acc/latest
Break down retrieval success rate by Deals, Clients, Allocators, MetaAllocators #139
Provide miner RSR per client and per allocator spark-checker#79

bajtos · 2025-02-11T13:16:10Z

We need to change spark-evaluate to start producing this new data.

Caveat: one retrieval task (miner_id, payload_cid) can test deals from multiple clients, that’s why each task have an array of client ids. When calculating retrieval-based RSR, each measurement should contribute exactly +1 to the total number of measurements performed, irrespective of how many clients/deals were covered by this measurement. We need to figure out how to define and implement per-client measurement-based RSR and how to expand one measurement covering X clients into +1 measurement in total for each client. This change must not affect per-miner RSR (i.e. we cannot add client_id to the existing table).

NikolasHaimerl · 2025-02-13T09:43:06Z

The first task is to record per client retrieval stats in spark-evaluate to create the basis for further use.
Effectively, this means that for each client_id we need to find all unique combinations of (miner_id,payload_cid) and aggregate the measurements for this combination.

Context

In spark-evaluate the evaluate function is the entry point for the evaluation of a round of spark evaluations. One of the arguments of this function contains a list of all measurments for a given round. Each measurement contains information about the spark results for a given combination of (miner_id,payload_cid).
Furthermore, evaluate fetches the details for the round which contains a list of retrieval tasks. Each retrieval task contains a list of clients for a combination of (miner_id,payload_cid).
The new measurements are written to the database inside the function that updates the public stats for spark measurements. In this function, a filtering method that lists all clients for a given combination of (miner_id,payload_cid) is passed.
With this, we have all the information we need to create per client RSR stats. For each client we can find all combinations of (miner_id,payload_cid) and their respective success rates.

Implementation

I propose we create a new function which stores the per client RSR scores. This function will be called inside of updatePublicStats, similar to how existing update functions are already being called. I will call this function updateDailyPerClientStats. This function will take in a list of all measurements in the form of committees, the database client and the filtering method to find all clients for a given combination of (miner_id,payload_cid).
updateDailyPerClientStats will then create a list of all unique clients and assign them their respective list of committees. Similarly to the function that builds the per miner retrievalResultStats we can also create a function which yields us the per client result stats.

To summarize, the implementation will do the following:

We iterate over all (miner_id,payload_cid) pairs to build a mapping (miner_id,payload_cid) -> client_id[]

We reverse that mapping to get client_id -> (miner_id,payload_cid)[]

From there we get to client_id -> measurement[]

For persisting this new data we will need a new table in our database. The proposed table format is the following:

(day, client_id, total, successful, successful_http)

Feedback is much appreciated @juliangruber @bajtos @pyropy

juliangruber · 2025-02-13T10:08:10Z

Context

It was awesome that you listed all the context, linking to source code. That was easy to follow and got me up to speed 🙏

updateDailyPerClientStats will then create a list of all unique clients

To clarify:

We iterate over all (miner_id,payload_cid) pairs to build a mapping (miner_id,payload_cid) -> client_id[]
We reverse that mapping to get client_id -> (miner_id,payload_cid)[]
From there we get to client_id -> measurement[]

Is my understanding correct?

(day, client_id, total, successful, successful_http)

👍

NikolasHaimerl · 2025-02-13T10:09:18Z

Context

It was awesome that you listed all the context, linking to source code. That was easy to follow and got me up to speed 🙏

updateDailyPerClientStats will then create a list of all unique clients

To clarify:
1. We iterate over all `(miner_id,payload_cid)` pairs to build a mapping `(miner_id,payload_cid) -> client_id[]`

2. We reverse that mapping to get `client_id -> (miner_id,payload_cid)[]`

3. From there we get to `client_id -> measurement[]`
Is my understanding correct?

(day, client_id, total, successful, successful_http)

👍

Yes, that is my intention.

pyropy · 2025-02-13T10:16:26Z

@NikolasHaimerl Implementation plan looks great, thanks for taking out time to write it! My proposal for table name would be something like client_retrieval_stats but it's ultimately up to you.

juliangruber · 2025-02-13T10:34:19Z

Yes, that is my intention.

Great! Please add that to the implementation plan then, so a review doesn't need to look at these comments :)

bajtos · 2025-02-13T15:33:45Z

The plan looks good to me 👍🏻

This function will take in a list of all measurements in the form of committees,
(...)
Similarly to the function that builds the per miner retrievalResultStats we can also create a function which yields us the per client result stats.

Implementation-wise, it may be simpler to ignore committees and work directly with the list of measurements. I think we needed to use committees in lib/provider-retrieval-result-stats.js because that was the easiest way to include minority results in the aggregation. After my refactoring in CheckerNetwork/spark-evaluate#442, it's easy to get the same list by filtering all measurements matching taskingEvaluation === 'OK' (and ignoring consensusEvaluation in the filter).

Illustration of what I mean:

for (const m of allMeasurements) {
  if (m.taskingEvaluation !== 'OK') continue
  const clients = findDealClients(m)
  for (const c of clients) {
    // update the aggregated data for client `c` using measurement `m`
  }
}

OTOH, if we start from committees, we can call findDealClients() only once per committee, not once per measurement, which may (but may not be) a meaningful performance optimisation.

It can also be said that it's better to make findDealClients() cheaper rather than to complicate updateDailyPerClientStats() to reduce the number of findDealClients() calls.

The current findDealClients() implementation uses Array.find, which is not optimal performance-wise. See https://github.com/CheckerNetwork/spark-evaluate/blob/49de2f667e05f80ac12022c130438a2fd0a480b6/lib/evaluate.js#L201-L202

(minerId, cid) => sparkRoundDetails.retrievalTasks
          .find(t => t.cid === cid && t.minerId === minerId)?.clients

On the second thought, I am fine with either approach - the one described in the original proposal or the measurement-based I described in this comment. It's something easy to change in the future if needed.

bajtos added the 💥 Spark label Nov 18, 2024

bajtos mentioned this issue Nov 18, 2024

M4.11 #181

Closed

26 tasks

bajtos mentioned this issue Dec 2, 2024

Provide miner RSR per client and per allocator CheckerNetwork/spark-checker#79

Open

bajtos added the 🤗 good first issue Good for newcomers label Dec 2, 2024

bajtos mentioned this issue Dec 2, 2024

M4.12 #195

Closed

15 tasks

This was referenced Jan 6, 2025

RSR per allocator #216

Open

M5.1 #213

Closed

bajtos mentioned this issue Jan 30, 2025

M5.2 #219

Open

60 tasks

juliangruber assigned NikolasHaimerl Feb 13, 2025

juliangruber added the blocker label Feb 13, 2025

bajtos added this to Space Meridian Feb 14, 2025

bajtos moved this to 📋 planned in Space Meridian Feb 14, 2025

NikolasHaimerl linked a pull request Feb 17, 2025 that will close this issue

feat: per client rsr CheckerNetwork/spark-evaluate#481

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSR per client #193

RSR per client #193

bajtos commented Nov 18, 2024 •

edited by juliangruber

Loading

bajtos commented Feb 11, 2025

NikolasHaimerl commented Feb 13, 2025 •

edited

Loading

juliangruber commented Feb 13, 2025

NikolasHaimerl commented Feb 13, 2025

pyropy commented Feb 13, 2025

juliangruber commented Feb 13, 2025

bajtos commented Feb 13, 2025

RSR per client #193

RSR per client #193

Comments

bajtos commented Nov 18, 2024 • edited by juliangruber Loading

bajtos commented Feb 11, 2025

NikolasHaimerl commented Feb 13, 2025 • edited Loading

Context

Implementation

juliangruber commented Feb 13, 2025

NikolasHaimerl commented Feb 13, 2025

pyropy commented Feb 13, 2025

juliangruber commented Feb 13, 2025

bajtos commented Feb 13, 2025

bajtos commented Nov 18, 2024 •

edited by juliangruber

Loading

NikolasHaimerl commented Feb 13, 2025 •

edited

Loading