-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threading for dbt_external_tables #109
Comments
@codigo-ergo-sum Yes yes yes! Thanks for opening, and sorry for the delay in response. I think this is going to become an increasing need as dbt project footprints grow, and along with them the adoption of external tables via this package. Here's the catch: dbt's threading capabilities are exposed to Python tasks, but they are not accessible to the Jinja programming interface. All macros, loops, etc run in sequence, in a single thread. So, in order to make this happen, we'd need to take one of three approaches:
|
Hi @codigo-ergo-sum , @jtcohen6 We are experiencing the same issue, our # sources are still growing, and it already takes + 30 minutes for less than 300 objects. Thanks, |
Hey @bram-be - it doesn't appear to be that dbt-core itself supports parallel "run-operations" yet so we will have to wait till that happens. In the meantime, I'm thinking that even IF that is the case - that is run-ops are singularly threaded, what if we could submit multiple SQL commands at a time rather than just one at a time as we do now... Currently, for each external table, we loop over the create statement like so:
I'm wondering if we bundle up each statement thus doing:
If this will materially speed up the operation or not. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
I'm voting for this request to be re-opened. I have the same problem as the rest of the guys in this forum. As our external tables increase in number, not having the external table refreshing in parallel will impact SLA's and data lags. |
My work around currently is to not use dbt-external-tables package when I need to refresh source external tables. I just add the |
@etaboada-sanford Check out this new discussion, would love to have your thoughts & participation there: That is - we'd be most likely to move forward with option (2) that I outlined in my comment above. |
Describe the feature
It would be lovely if, when dbt_external_tables is run, that it would obey dbt's threading parameters and create external tables in parallel. I ask this because we have about 400 external tables that we create, and running the
run-operation
command (dbt run-operation stage_external_sources --vars 'ext_full_refresh: true')
can take somewhere between 10 to 20 minutes. We're usually running 8 threads or so when we rundbt run
anddbt test
so is there a reason that external table creation couldn't be parallelized?Describe alternatives you've considered
Can't think of anything obvious right now but open to alternatives
Additional context
This is for Snowflake in our case but seems like it would be equally useful on other database platforms.
Who will this benefit?
Anyone creating/recreating any significant volume of external tables on a regular basis.
The text was updated successfully, but these errors were encountered: