Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSR per client #193

Open
Tracked by #181 ...
bajtos opened this issue Nov 18, 2024 · 7 comments · May be fixed by CheckerNetwork/spark-evaluate#481
Open
Tracked by #181 ...

RSR per client #193

bajtos opened this issue Nov 18, 2024 · 7 comments · May be fixed by CheckerNetwork/spark-evaluate#481
Assignees

Comments

@bajtos
Copy link
Member

bajtos commented Nov 18, 2024

Provide retrieval-based RSR calculated on a per-client basis.

  • REST API
  • Visualisation on the Spark Public Dashboard

Blocks #216

Related discussions:

@bajtos bajtos mentioned this issue Nov 18, 2024
26 tasks
@bajtos bajtos added the 🤗 good first issue Good for newcomers label Dec 2, 2024
@bajtos bajtos mentioned this issue Dec 2, 2024
15 tasks
This was referenced Jan 6, 2025
@bajtos bajtos mentioned this issue Jan 30, 2025
60 tasks
@bajtos
Copy link
Member Author

bajtos commented Feb 11, 2025

We need to change spark-evaluate to start producing this new data.

Caveat: one retrieval task (miner_id, payload_cid) can test deals from multiple clients, that’s why each task have an array of client ids. When calculating retrieval-based RSR, each measurement should contribute exactly +1 to the total number of measurements performed, irrespective of how many clients/deals were covered by this measurement. We need to figure out how to define and implement per-client measurement-based RSR and how to expand one measurement covering X clients into +1 measurement in total for each client. This change must not affect per-miner RSR (i.e. we cannot add client_id to the existing table).

@NikolasHaimerl
Copy link

NikolasHaimerl commented Feb 13, 2025

The first task is to record per client retrieval stats in spark-evaluate to create the basis for further use.
Effectively, this means that for each client_id we need to find all unique combinations of (miner_id,payload_cid) and aggregate the measurements for this combination.

Context

In spark-evaluate the evaluate function is the entry point for the evaluation of a round of spark evaluations. One of the arguments of this function contains a list of all measurments for a given round. Each measurement contains information about the spark results for a given combination of (miner_id,payload_cid).
Furthermore, evaluate fetches the details for the round which contains a list of retrieval tasks. Each retrieval task contains a list of clients for a combination of (miner_id,payload_cid).
The new measurements are written to the database inside the function that updates the public stats for spark measurements. In this function, a filtering method that lists all clients for a given combination of (miner_id,payload_cid) is passed.
With this, we have all the information we need to create per client RSR stats. For each client we can find all combinations of (miner_id,payload_cid) and their respective success rates.

Implementation

I propose we create a new function which stores the per client RSR scores. This function will be called inside of updatePublicStats, similar to how existing update functions are already being called. I will call this function updateDailyPerClientStats. This function will take in a list of all measurements in the form of committees, the database client and the filtering method to find all clients for a given combination of (miner_id,payload_cid).
updateDailyPerClientStats will then create a list of all unique clients and assign them their respective list of committees. Similarly to the function that builds the per miner retrievalResultStats we can also create a function which yields us the per client result stats.

To summarize, the implementation will do the following:

  1. We iterate over all (miner_id,payload_cid) pairs to build a mapping (miner_id,payload_cid) -> client_id[]
    
  2. We reverse that mapping to get client_id -> (miner_id,payload_cid)[]
    
  3. From there we get to client_id -> measurement[]
    

For persisting this new data we will need a new table in our database. The proposed table format is the following:

(day, client_id, total, successful, successful_http)

Feedback is much appreciated @juliangruber @bajtos @pyropy

@juliangruber
Copy link
Member

Context

It was awesome that you listed all the context, linking to source code. That was easy to follow and got me up to speed 🙏

updateDailyPerClientStats will then create a list of all unique clients

To clarify:

  1. We iterate over all (miner_id,payload_cid) pairs to build a mapping (miner_id,payload_cid) -> client_id[]
  2. We reverse that mapping to get client_id -> (miner_id,payload_cid)[]
  3. From there we get to client_id -> measurement[]

Is my understanding correct?

(day, client_id, total, successful, successful_http)

👍

@NikolasHaimerl
Copy link

Context

It was awesome that you listed all the context, linking to source code. That was easy to follow and got me up to speed 🙏

updateDailyPerClientStats will then create a list of all unique clients

To clarify:

1. We iterate over all `(miner_id,payload_cid)` pairs to build a mapping `(miner_id,payload_cid) -> client_id[]`

2. We reverse that mapping to get `client_id -> (miner_id,payload_cid)[]`

3. From there we get to `client_id -> measurement[]`

Is my understanding correct?

(day, client_id, total, successful, successful_http)

👍

Yes, that is my intention.

@pyropy
Copy link

pyropy commented Feb 13, 2025

@NikolasHaimerl Implementation plan looks great, thanks for taking out time to write it! My proposal for table name would be something like client_retrieval_stats but it's ultimately up to you.

@juliangruber
Copy link
Member

Yes, that is my intention.

Great! Please add that to the implementation plan then, so a review doesn't need to look at these comments :)

@bajtos
Copy link
Member Author

bajtos commented Feb 13, 2025

The plan looks good to me 👍🏻

This function will take in a list of all measurements in the form of committees,
(...)
Similarly to the function that builds the per miner retrievalResultStats we can also create a function which yields us the per client result stats.

Implementation-wise, it may be simpler to ignore committees and work directly with the list of measurements. I think we needed to use committees in lib/provider-retrieval-result-stats.js because that was the easiest way to include minority results in the aggregation. After my refactoring in CheckerNetwork/spark-evaluate#442, it's easy to get the same list by filtering all measurements matching taskingEvaluation === 'OK' (and ignoring consensusEvaluation in the filter).

Illustration of what I mean:

for (const m of allMeasurements) {
  if (m.taskingEvaluation !== 'OK') continue
  const clients = findDealClients(m)
  for (const c of clients) {
    // update the aggregated data for client `c` using measurement `m`
  }
}

OTOH, if we start from committees, we can call findDealClients() only once per committee, not once per measurement, which may (but may not be) a meaningful performance optimisation.

It can also be said that it's better to make findDealClients() cheaper rather than to complicate updateDailyPerClientStats() to reduce the number of findDealClients() calls.

The current findDealClients() implementation uses Array.find, which is not optimal performance-wise. See https://github.com/CheckerNetwork/spark-evaluate/blob/49de2f667e05f80ac12022c130438a2fd0a480b6/lib/evaluate.js#L201-L202

(minerId, cid) => sparkRoundDetails.retrievalTasks
          .find(t => t.cid === cid && t.minerId === minerId)?.clients

On the second thought, I am fine with either approach - the one described in the original proposal or the measurement-based I described in this comment. It's something easy to change in the future if needed.

@bajtos bajtos moved this to 📋 planned in Space Meridian Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 📋 planned
Development

Successfully merging a pull request may close this issue.

4 participants