streaming_table incremental model in Databricks table update job explosively consuming compute #10523

ctivanovich · 2024-08-05T08:49:59Z

ctivanovich
Aug 5, 2024

Hello, we are pretty mystified about why we have suddenly run-away costs after making use of incremental model type "streaming_table" in a Databricks bronze layer daily ingestion job.

My understanding is that the model, when first configured and run, will do a full load, sure, and then keep history for all loaded files, only adding new files per the strategy. Some sample code in one of our DBT models:


{{
    config(
        materialized='streaming_table'
    )
}}
    

select
    uuid() as dbt_unique_id,
    to_timestamp(transferred_date, 'yyyy-MM-dd') AS dbt_updated_at_id,
    *
    ,_metadata.file_path as file_path
from stream read_files(
    's3://path/env/table',
    format => 'parquet', header => true, includeExistingFiles => true, allowOverwrites => false
)

With each subsequent run to check for new data, we found ourselves getting charged for serverless JOBS compute with serverless disabled at the account level; DLT pipelines getting launched but reported under the serverless jobs compute SKU; a massive amount of generic jobs serverless compute consumed per or together with each DLT pipeline run which, presumably, kicked off a ton of other job cluster creation.

Our DBT job runs in a "classic" SQL Warehouse Pro cluster, and running the model under the hood runs REFRESH LIVE TABLE, thus running the corresponding DLT pipeline on serverless compute we can't see in the DLT console or in our Audit logs.

Does anyone know how all this is working under the hood, and have an idea as to why simple autoloader via DLT REFRESH LIVE TABLE functionality would cost us like a 1000x what we would expect?

Thank you.

dataders · 2024-08-09T14:07:58Z

dataders
Aug 9, 2024
Maintainer

hey @ctivanovich! first off -- yikes! Thanks for the write up. I think you're best opening this as a bug report on dbt-databricks where @benc-db, the maintainer, can help you sort this out. Though it very well may be that the root cause is in dbt-core, I think that's a better place to start.

2 replies

benc-db Aug 9, 2024

@ctivanovich I recommend speaking with your company's Databricks contact to get a better understanding of streaming table costs and what can be done to manage them. DLT pipelines (which are created when you make a streaming table or materialized view) all use serverless under the hood to manage compute needs. The one thing I can think of here as to why costs would be so much higher than expected is whether you have a schedule specified? If you do, that can lead to unexpected costs from the pipeline spinning up resources to the update when you're not even running dbt jobs. Btw, while we're pushing internally for better visibility/auditing of the pipelines generated, what we do in the dbt-databricks adapter is grab the pipelineId from tblproperties of the table (using SHOW TBLPROPERTIES ) to get information about the pipeline. You can use this id with the REST API, or possibly on the show pipeline page on the website (less sure about this option). Either way, step 1 is talk to your Databricks rep/file a ticket. If that discussion/investigation shows that we're doing something improper in dbt-databricks, then please file a Github issue here: https://github.com/databricks/dbt-databricks/issues

ctivanovich Aug 12, 2024
Author

@benc-db and @dataders thank you for your responses. Yes, I have been speaking with our representative and with DBX support, and they were able to tie all of the serverless job compute to DLT pipelines that launched them. On our end, we didn't specify a schedule in declaring pipelines, all we did was execute code of the sort I shared above at defined intervals across the day to, it was hoped, simply check if a new file has arrived and load if present, for roughly ~200 tables. I tried tracing the precise logic in the adapter's code, but it looks like all it does is execute CREATE OR REFRESH LIVE TABLE, leaving the implementation to DLT to manage, so I suppose this out of control resource use for these pipelines is owing to the black box that is DLT serverless.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streaming_table incremental model in Databricks table update job explosively consuming compute #10523

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

streaming_table incremental model in Databricks table update job explosively consuming compute #10523

ctivanovich Aug 5, 2024

Replies: 1 comment · 2 replies

dataders Aug 9, 2024 Maintainer

benc-db Aug 9, 2024

ctivanovich Aug 12, 2024 Author

ctivanovich
Aug 5, 2024

Replies: 1 comment 2 replies

dataders
Aug 9, 2024
Maintainer

ctivanovich Aug 12, 2024
Author