Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictive batching #16

Open
petervandivier opened this issue Dec 22, 2023 · 2 comments
Open

Predictive batching #16

petervandivier opened this issue Dec 22, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@petervandivier
Copy link
Owner

Uneven data distribution hurts queue throughput (#5) and can cause batch failure if a batch exceeds 60 minutes runtime.

Get a baseline export size and use it to predict what batch sizes are appropriate for a given data range using estimate_data_size() or similar.

@petervandivier
Copy link
Owner Author

Back-of-the-envelope math suggests 4gb batch sizes if you want to target 15 min run times per-batch.

  • 5mb/s speed
  • x60 sec = 300mb
  • x15 min = ~4.5gb

Predictor function should allow for user input to set a custom run time (remembering the 60 min hard cap with a buffer).

@petervandivier petervandivier added the enhancement New feature or request label Dec 22, 2023
@petervandivier
Copy link
Owner Author

Maybe steer clear of estimate_data_size() & stick to .show extents. estimate_data_size(*) appears to read the entire table into memory - which isn't super surprising in retrospect 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant