Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: avoid oom snapshot #26043

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Conversation

praveen-influx
Copy link
Contributor

@praveen-influx praveen-influx commented Feb 20, 2025

This PR addresses the OOM issue (or reduces the chances of running into OOM when snapshotting) by doing following main changes

  • defaults gen 1 duration to 1m (instead of 10m)
  • snapshot chunks are built lazily and
  • sort/dedupe step itself is done serially (i.e 1 at a time)

As an optimisation when not forcing a snapshot it aggregates up to 10m worth of chunks and writes them in parallel assumption is given it's a normal snapshot, there is enough memory to run it.

closes: #25991

@praveen-influx praveen-influx force-pushed the praveen/avoid-oom-snapshot branch 2 times, most recently from 56024af to 2213e14 Compare February 24, 2025 10:38
@praveen-influx praveen-influx changed the title Praveen/avoid oom snapshot feat: avoid oom snapshot Feb 24, 2025
@@ -181,7 +181,7 @@ pub struct Config {
#[clap(
long = "gen1-duration",
env = "INFLUXDB3_GEN1_DURATION",
default_value = "10m",
default_value = "1m",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting to 1m means there are more query chunks in QueryableBuffer (10 times more), but this hasn't been an issue so far.


for chunk in snapshot_chunks {
for chunk in snapshot_chunks_iter {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This snapshot_chunks_iter produces SnapshotChunk lazily, uses the chunk to create PersistJob and then moves it to TableBuffer's snapshotting_chunks. Because there's a write lock on this buffer above, it is ok to remove the key and then add it back. Previously the snapshotting_chunks was cloned and this avoids the cloning.

persisted_files: Arc<PersistedFiles>,
persisted_snapshot: Arc<Mutex<PersistedSnapshot>>,
) {
let iterator = PersistJobGroupedIterator::new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows chunks to be grouped, since 1m gen 1 duration, it aggregates together up to 10 chunks to write a single parquet file for 10m window.

@praveen-influx praveen-influx force-pushed the praveen/avoid-oom-snapshot branch 11 times, most recently from 3a4a9ab to d85460b Compare February 25, 2025 18:23
}

#[test_log::test(tokio::test)]
async fn test_snapshot_serially_two_tables_with_varying_throughput() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pauldix - this should cover the case we discussed with 2 tables receiving different amount of writes.

@praveen-influx praveen-influx force-pushed the praveen/avoid-oom-snapshot branch 4 times, most recently from 9aa8fdf to 0760147 Compare March 3, 2025 12:39
This PR addresses the OOM issue (or reduces the chances of running into OOM when snapshotting) by doing following main changes
- defaults gen 1 duration to 1m (instead of 10m)
- snapshot chunks are built lazily and
- sort/dedupe step itself is done serially (i.e 1 at a time)

As an optimisation when _not_ forcing a snapshot it aggregates up to 10m worth of chunks and writes them in parallel assumption is given it's a normal snapshot, there is enough memory to run it.

closes: #25991
- extra debug logs added
- test fixes
@praveen-influx praveen-influx force-pushed the praveen/avoid-oom-snapshot branch from 0760147 to 19c29ab Compare March 5, 2025 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid OOM when running snapshots at higher throughput
1 participant