Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Overhaul indexing to be more efficient #540

Merged
merged 5 commits into from
Jan 2, 2025

Conversation

kieraneglin
Copy link
Owner

@kieraneglin kieraneglin commented Jan 2, 2025

What's new?

N/A

What's changed?

This is a very meaningful change to the way indexing works but you can skip to the bottom for a tl;dr

Background

Before, any indexing run would read the entire contents of a channel and capture everything, every time. This was initially seen as an advantage for two reasons:

  1. Changes made by the uploader would be picked up and reflected in the media items (eg: a title change)
  2. Previously private videos would be captured if they were made public once again

In practice, the former rarely happens after the first few days of a video's upload and the latter happens so infrequently as to not be worth hinging such a big design decision on. I wouldn't have made this PR if it were just about idealized design choices - there are some real downsides to this approach:

  • Indexing is a very slow process. Indexing a large channel can (and does) take days
  • In order to prevent rate limiting I limit the number of concurrent yt-dlp processes. This helps, but it'll naturally trend toward multiple large channels being indexed at the same time while other, smaller channels have to wait until one of those processes frees up
  • All the while, these long-running processes take up a non-zero amount of system resources even though the app appears to be at idle

To combat this, I introduced a concept called fast indexing. This helped alleviate the pressure by doing more frequent checks using more efficient mechanisms (RSS, API), but it still required a full scan once per month to ensure everything was kept in sync.

So, what's new?

Indexing has been updated so that it stops once it's reached content it's seen before. More precisely, it stops when it's scanned a certain number of previously-indexed videos. For this initial iteration I've set that number to 20, so this means that indexing stops once it's scanned 20 videos past the most recent video you've indexed.

Why have this weird offset logic? It's to still allow changes from the uploader to get picked up by Pinchflat. If I stopped once indexing had scanned the latest video you've indexed, any changes to recent videos wouldn't get picked up. This isn't a perfect approach, but I've been running some long tests over the last few months and 20 videos is the sweet spot for catching essentially all changes while not taking too much time. I'm sure there's a use case I haven't thought of so the number itself may be subject to change, but the opportunity to cut indexing times for large channels down from days to ~2 minutes was too good to pass up.

If you still need to scan the full channel for whatever reason, you can select "force index" from the actions menu while viewing a source. Playlists are not impacted by this change - only channels. Fast indexing still works great and there's no reason to move away from it if you're already using it.

TL;DR:

  • Normal (non-fast) indexing doesn't re-scan the entire channel any more. Much more time and resource efficient
    • In my testing, indexing a large channel went from taking over a day to now taking ~2 minutes
  • However, using the "force index" button for a source does scan the entire channel so use with care
  • This change only applies to channels, not playlists
  • You can now set your indexing times to be more aggressive on larger channels if you aren't using fast indexing. If you are using fast indexing, you probably don't have to change anything
  • Please create a bug report if things aren't working!

What's fixed?

  • Fixed a bug where forcing a source to index scheduled that index run for the future when it should've been run immediately

Any other comments?

N/A

@kieraneglin kieraneglin added the enhancement New feature or request label Jan 2, 2025
@kieraneglin kieraneglin self-assigned this Jan 2, 2025
@kieraneglin kieraneglin merged commit 9185f07 into master Jan 2, 2025
1 check passed
@kieraneglin kieraneglin deleted the ke/improve-slow-indexing-approach branch January 2, 2025 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant