Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading for dbt_external_tables #109

Closed
codigo-ergo-sum opened this issue Sep 14, 2021 · 8 comments
Closed

Multi-threading for dbt_external_tables #109

codigo-ergo-sum opened this issue Sep 14, 2021 · 8 comments
Labels
enhancement New feature or request Stale

Comments

@codigo-ergo-sum
Copy link

codigo-ergo-sum commented Sep 14, 2021

Describe the feature

It would be lovely if, when dbt_external_tables is run, that it would obey dbt's threading parameters and create external tables in parallel. I ask this because we have about 400 external tables that we create, and running the run-operation command (dbt run-operation stage_external_sources --vars 'ext_full_refresh: true') can take somewhere between 10 to 20 minutes. We're usually running 8 threads or so when we run dbt run and dbt test so is there a reason that external table creation couldn't be parallelized?

Describe alternatives you've considered

Can't think of anything obvious right now but open to alternatives

Additional context

This is for Snowflake in our case but seems like it would be equally useful on other database platforms.

Who will this benefit?

Anyone creating/recreating any significant volume of external tables on a regular basis.

@codigo-ergo-sum codigo-ergo-sum added the enhancement New feature or request label Sep 14, 2021
@jtcohen6
Copy link
Collaborator

@codigo-ergo-sum Yes yes yes! Thanks for opening, and sorry for the delay in response.

I think this is going to become an increasing need as dbt project footprints grow, and along with them the adoption of external tables via this package.

Here's the catch: dbt's threading capabilities are exposed to Python tasks, but they are not accessible to the Jinja programming interface. All macros, loops, etc run in sequence, in a single thread. So, in order to make this happen, we'd need to take one of three approaches:

  1. Expose threading capabilities to the Jinja context. I don't quite know how we'd do this — perhaps with some kind of context manager/decorator?
  2. Move the logic of this package into dbt-core, and (crucially) rewrite the main entry-point / operation (stage_external_sources) as a new task. Just like how the source freshness task operates concurrently over a set of sources, a source external task would be able to do the same—and it'd be able to use standard flags and node selection, in lieu of the janky substitutes I've coded up thus far.
  3. Make the task interface of dbt-core pluggable, that is, support Python tasks defined elsewhere. This would get us all the benefits from option 2, while also opening up those benefits far and wide, without the requirement of adding and maintaining code within the centralized dbt-core codebase. This option is the most exciting to me personally, and it's something I'd be excited to think more about next year.

@bram-be
Copy link

bram-be commented Mar 14, 2023

Hi @codigo-ergo-sum , @jtcohen6

We are experiencing the same issue, our # sources are still growing, and it already takes + 30 minutes for less than 300 objects.
Maybe a work around is also available?

Thanks,

@jeremyyeo
Copy link
Collaborator

Hey @bram-be - it doesn't appear to be that dbt-core itself supports parallel "run-operations" yet so we will have to wait till that happens. In the meantime, I'm thinking that even IF that is the case - that is run-ops are singularly threaded, what if we could submit multiple SQL commands at a time rather than just one at a time as we do now...

Currently, for each external table, we loop over the create statement like so:

MainThread: create or replace external table foo...
MainThread: SQL success
MainThread: create or replace external table bar...
MainThread: SQL success
MainThread: create or replace external table baz...
MainThread: SQL success

I'm wondering if we bundle up each statement thus doing:

MainThread: 
	create or replace external table foo...;
    create or replace external table bar...;
    create or replace external table baz...;
MainThread: SQL success

If this will materially speed up the operation or not.

@github-actions
Copy link

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 21, 2023
@github-actions
Copy link

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 28, 2023
@etaboada-sanford
Copy link

I'm voting for this request to be re-opened. I have the same problem as the rest of the guys in this forum. As our external tables increase in number, not having the external table refreshing in parallel will impact SLA's and data lags.

@etaboada-sanford
Copy link

My work around currently is to not use dbt-external-tables package when I need to refresh source external tables. I just add the ALTER external table my_ext_table REFRESH as pre-hook to the individual models. This way, I get concurrency.

@jtcohen6
Copy link
Collaborator

jtcohen6 commented Oct 6, 2023

@etaboada-sanford Check out this new discussion, would love to have your thoughts & participation there:

That is - we'd be most likely to move forward with option (2) that I outlined in my comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

5 participants