dbt source freshness applied to public sources #51

eduardo-nercolini · 2024-06-12T01:41:31Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with the jira ticket associated with the PR.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated the docs and README with the added features, breaking changes, new instructions on how to use the repository.

Release planning

I've decided if this PR requires a new major/minor/patch version accordingly to
semver, and I've changed the name of the BRANCH to release/* , feature/* or patch/* .

What

This PR brings a new proposal for testing sources using the native dbt test dbt source freshness, along with intervals defined according to the build execution (dbt_enriched_base_tables DAG) interval of these models/sources. Thus, data with a severity of 'warn' will be considered outdated if it is more than 30 minutes old, and 'error' for intervals exceeding 2x the interval, meaning a difference greater than 60 minutes.

Why

Source testing is necessary in the pipeline to identify if the data is not outdated and prevent models from being built based on such data, ensuring greater quality and reliability.

Known limitations

The suggested interval should be discussed to verify if it is more appropriate and fits the expected intervals according to Stellar data pipeline, in order to properly and more accurately assess whether the data is outdated or not. Additionally, the dbt source freshness command, similar to dbt build or dbt run, must be executed. Therefore, it needs to be added into an Airflow DAG or task, prompting us to determine the most suitable timing for executing this test/command.

Suggestion: dbt has the dbt build --select "source_status:fresher+" command feature. When dbt source freshness is executed, the source.json artifact is created, which contains execution times and max_loaded_at dates for dbt sources. dbt will then use those artifacts to determine the set of fresh sources. In your job commands, you can signal dbt to run and test only on these fresher sources and their children by including the source_status+ argument. This would mean that models where it sources had the status ERROR STALE would not be updated, only those where its sources have passed in the freshness test (more information on The "source_status" status).

Also we could have a dashboard showing which tables are up-to-date and which are not, based on freshness. Once dbt freshness is running, Elementary will collect data for freshness and we can use this data to add these visualizations to the Data Observability - Models, Sources and Invocations dashboard.

sydneynotthecity · 2024-06-12T14:21:20Z

@harsha-stellar-data can you take a crack at reviewing this PR?

models/sources/src_config_settings.yml

eduardo-nercolini · 2024-06-12T17:39:24Z

@sydneynotthecity when I ran the freshness tests at these defined intervals, the sources src_contract_code and src_contract_data resulted in ERROR STALE, so I didn't include the test for this sources in this commit. I would like to verify with you if the defined intervals make sense and if indeed the source is outdated, or if, in these cases, a different interval should be defined. I await your feedback to make the config and be able to pass in a new commit. Thanks

eduardo-nercolini · 2024-06-12T17:40:29Z

Ps.: I tested again now and src_contract_data passed, but src_contract_code didn't. Anyway, I would appreciate your comment on this. Thanks

sydneynotthecity · 2024-06-13T15:47:08Z

when I ran the freshness tests at these defined intervals, the sources src_contract_code and src_contract_data resulted in ERROR STALE

@enercolini yes, it's fine to remove the freshness test for src_contract_code. I'm surprised that contract_data failed. @harsha-stellar-data can you profile the contract_data table by batch_run_date and see if there is an appropriate, predictable interval for a freshness test on this table?

harsha-stellar-data · 2024-06-13T20:58:08Z

@eduardo-nercolini I reviewed the changes and here are some comments
My thoughts - since Freshness is more to do about Source, we should be using closed_at as and when available.

For history_trades, i see batch_run_date being used but I see "ledger_closed_at" available and is getting frequently and properly modified. Should we be using this timestamp field instead?
We do not need Freshness or Recency check on contract_code as the intervals are very unpredictable
For Contract_Data, Freshness check at 30 (for Warning), 60 (For Error) should work. I performed the following analysis on the table and did not notice any deviation from the pattern
WITH src_contract_data AS ( SELECT batch_id, MIN(closed_at) AS min_closed, MAX(closed_at) AS max_closed FROM crypto-stellar.crypto_stellar.contract_dataGROUP BY batch_id ) SELECT batch_id, min_closed, max_closed FROM src_contract_data WHERE TIMESTAMP_DIFF(max_closed, min_closed, MINUTE) > 30;
I only noticed Recency check on Contract_Data in the code though. Did we back out the changes based on the failures observed and can you also provide details on what field is being used for checking? We should be using Closed_At but Batch_Run_Date should also work.

Let me know your thoughts here.

@sydneynotthecity @chowbao

sydneynotthecity · 2024-06-14T15:59:50Z

@harsha-stellar-data I agree, we can update all freshness checks to use closed_at so that they're consistent and use the right source column. This aligns with our overall strategy to repartition all tables by closed_at

Nice analysis, @eduardo-nercolini let's add the freshness test for contract_data back in. if you see failures with the test, can you post the results? There were early gaps in the data when Soroban first launched, but there should not be recent examples of freshness problems

…shness_tests

harsha-stellar-data

Updated changes to check Freshness look good.

dbt source freshness applied to public sources

892cb1f

eduardo-nercolini requested review from chowbao, sydneynotthecity, edualvess and LucasDias-Indicium June 12, 2024 01:42

sydneynotthecity reviewed Jun 12, 2024

View reviewed changes

models/sources/src_config_settings.yml Outdated Show resolved Hide resolved

src_config_settings excluded from freshness tests

13bf4b4

eduardo-nercolini requested a review from sydneynotthecity June 12, 2024 17:39

eduardo-nercolini requested a review from harsha-stellar-data June 13, 2024 12:50

eduardo-nercolini added 2 commits June 14, 2024 14:56

Merge remote-tracking branch 'origin/master' into feature/sources_fre…

6bc4db5

…shness_tests

Re-added contract_data and fix ledger_closed_at for history_trades

5d61dec

harsha-stellar-data approved these changes Jun 18, 2024

View reviewed changes

eduardo-nercolini merged commit c3e6d3f into master Jun 19, 2024
1 check passed

sydneynotthecity deleted the feature/sources_freshness_tests branch November 14, 2024 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbt source freshness applied to public sources #51

dbt source freshness applied to public sources #51

eduardo-nercolini commented Jun 12, 2024 •

edited

Loading

sydneynotthecity commented Jun 12, 2024

eduardo-nercolini commented Jun 12, 2024

eduardo-nercolini commented Jun 12, 2024

sydneynotthecity commented Jun 13, 2024

harsha-stellar-data commented Jun 13, 2024 •

edited

Loading

sydneynotthecity commented Jun 14, 2024

harsha-stellar-data left a comment

dbt source freshness applied to public sources #51

dbt source freshness applied to public sources #51

Conversation

eduardo-nercolini commented Jun 12, 2024 • edited Loading

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

sydneynotthecity commented Jun 12, 2024

eduardo-nercolini commented Jun 12, 2024

eduardo-nercolini commented Jun 12, 2024

sydneynotthecity commented Jun 13, 2024

harsha-stellar-data commented Jun 13, 2024 • edited Loading

sydneynotthecity commented Jun 14, 2024

harsha-stellar-data left a comment

Choose a reason for hiding this comment

eduardo-nercolini commented Jun 12, 2024 •

edited

Loading

harsha-stellar-data commented Jun 13, 2024 •

edited

Loading