streaming_table incremental model in Databricks table update job explosively consuming compute #10523
Unanswered
ctivanovich
asked this question in
Q&A
Replies: 1 comment 2 replies
-
hey @ctivanovich! first off -- yikes! Thanks for the write up. I think you're best opening this as a bug report on dbt-databricks where @benc-db, the maintainer, can help you sort this out. Though it very well may be that the root cause is in dbt-core, I think that's a better place to start. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, we are pretty mystified about why we have suddenly run-away costs after making use of incremental model type "streaming_table" in a Databricks bronze layer daily ingestion job.
My understanding is that the model, when first configured and run, will do a full load, sure, and then keep history for all loaded files, only adding new files per the strategy. Some sample code in one of our DBT models:
With each subsequent run to check for new data, we found ourselves getting charged for serverless JOBS compute with serverless disabled at the account level; DLT pipelines getting launched but reported under the serverless jobs compute SKU; a massive amount of generic jobs serverless compute consumed per or together with each DLT pipeline run which, presumably, kicked off a ton of other job cluster creation.
Our DBT job runs in a "classic" SQL Warehouse Pro cluster, and running the model under the hood runs REFRESH LIVE TABLE, thus running the corresponding DLT pipeline on serverless compute we can't see in the DLT console or in our Audit logs.
Does anyone know how all this is working under the hood, and have an idea as to why simple autoloader via DLT REFRESH LIVE TABLE functionality would cost us like a 1000x what we would expect?
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions