Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Kalman Filter-based Torrent Health Estimation #8188

Open
4 tasks
grimadas opened this issue Oct 3, 2024 · 3 comments
Open
4 tasks

Integrate Kalman Filter-based Torrent Health Estimation #8188

grimadas opened this issue Oct 3, 2024 · 3 comments

Comments

@grimadas
Copy link
Contributor

grimadas commented Oct 3, 2024

The problem

As highlighted in this comment, relying solely on self-assessments isn’t scalable. Navigating through a sea of misleading or fake health signals is challenging. We need a mechanism to (1) filter out spam and irrelevant information and (2) reliably rank popularity and emerging trends.

Solution

Why not apply some tried-and-true signal processing techniques to see if they can cut through the noise?

My plan is to integrate a Kalman Filter-based algorithm into Tribler to estimate torrent health and filter out dead torrents based on seeder reports. Atm, I have developed a prototype that utilizes the filterpy library, specifically leveraging the Unscented Kalman Filter (UKF) implementation. This algorithm allows us to combine seeder reports from various peers while accounting for measurement noise and adjusting for the reliability scores of different sources. And it's pretty fast to run.

To adapt to the dynamic nature of torrent networks I have made few adjustments:

  • Torrent health checks, performed at different time intervals, are considered reliable only to a certain degree, and our model includes mechanisms to estimate the likelihood of torrent change over time.
  • Outliers in health reports are defined as values lying outside a 95-99% confidence interval
  • If a peer consistently provides unreliable reports, its reputation is decreased drastically. If the report seems valid reputation score is slightly increased.
  • These reputation scores are then incorporated as weights in the predict_health function, which computes the current best estimate of torrent health given timestamp.

Development plan:

  • Integrate the current prototype into the Tribler client and run it locally to test its effectiveness using real network health checks. Evaluate how adequate the algorithm is.

  • Numerical examples with real stuff. Performance analysis

  • Refactor the Kalman Filter to use only numpy to reduce dependency weight, removing the reliance on scipy to ensure a lightweight solution (scipy dependency is too much).

  • Experimental release

@adlai
Copy link

adlai commented Oct 11, 2024

Why are both scipy and numpy together considered too much, if numpy alone is not?

@qstokkink
Copy link
Contributor

This approach seems viable. As a small POC, I stripped out both numpy and scipy: https://gist.github.com/qstokkink/823c566d532c4d3556fd100f7d9105e6

As an added benefit, the version without those libraries (which I named "quinten") is also ~10x faster:

out

@synctext
Copy link
Member

Is 10 really a usable lowest seeder count? Users wait for a week sometimes to see if a seeder comes back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants