Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Databases and Data warehouse Benchmarking #11719

Closed
grishick opened this issue Apr 5, 2022 · 3 comments
Closed

[EPIC] Databases and Data warehouse Benchmarking #11719

grishick opened this issue Apr 5, 2022 · 3 comments
Labels

Comments

@grishick
Copy link
Contributor

grishick commented Apr 5, 2022

Tell us about the problem you're trying to solve

This epic is for DB source connector benchmarks, including JDBC and CDC configs for all supported database sources.

Describe the solution you’d like

I would like to see benchmarks results being regularly (every time there is a change in a connector or platform) published on github for the following areas for each database source connector that is either in GA or Beta stage:

  • source to dev/null full refresh job broken down by size of source
  • source to dev/null incremental job (broken down by size of incremental change)

Additionally (lower priority), I would like us to be publishing results of E2E benchmarks from source databases to specific data warehouses. We can limit these to one DW per cloud provider and select one source database.

  • E2E source to a data warehouse full refresh job broken down by size of source
  • E2E source to a data warehouse incremental job (broken down by size of incremental change)

Each of the benchmarks listed above should also be split into JDBC and CDC (if supported) use cases.

Numbers that benchmarks have to confirm/maintain

  • 4.5M rows/hour average throughput from database sources

Tech spec draft

Inspiration

@grishick grishick added type/enhancement New feature or request area/connectors Connector related issues connectors/source/postgres labels Apr 5, 2022
@noahkawasaki-airbyte noahkawasaki-airbyte changed the title [EPIC] Database source benchmarks [EPIC] Databases and Data warehouse Benchmarking Apr 26, 2022
@arimbr
Copy link
Contributor

arimbr commented Apr 27, 2022

Nice! How do these benchmarks compare to real-usage stats computed based on syncs run by Airbyte OSS and Cloud users?

On a side note, I feel like the work done here could lead to a great article on the Airbyte engineering blog.

@noahkawasaki-airbyte
Copy link
Contributor

Hey Ari, this project is super early and hasnt really even started yet lol! The idea is we're trying to replicate customer like volumes of data and have internal syncs running to make sure airbyte can handle them.

Definitely a good idea for a blog in the future when its built out though

@evantahler
Copy link
Contributor

@grishick how does this epic relate to #15152?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants