Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt source freshness applied to public sources #51

Merged
merged 4 commits into from
Jun 19, 2024

Conversation

eduardo-nercolini
Copy link
Contributor

@eduardo-nercolini eduardo-nercolini commented Jun 12, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with the jira ticket associated with the PR.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated the docs and README with the added features, breaking changes, new instructions on how to use the repository.

Release planning

  • I've decided if this PR requires a new major/minor/patch version accordingly to
    semver, and I've changed the name of the BRANCH to release/* , feature/* or patch/* .

What

This PR brings a new proposal for testing sources using the native dbt test dbt source freshness, along with intervals defined according to the build execution (dbt_enriched_base_tables DAG) interval of these models/sources. Thus, data with a severity of 'warn' will be considered outdated if it is more than 30 minutes old, and 'error' for intervals exceeding 2x the interval, meaning a difference greater than 60 minutes.

Why

Source testing is necessary in the pipeline to identify if the data is not outdated and prevent models from being built based on such data, ensuring greater quality and reliability.

Known limitations

The suggested interval should be discussed to verify if it is more appropriate and fits the expected intervals according to Stellar data pipeline, in order to properly and more accurately assess whether the data is outdated or not. Additionally, the dbt source freshness command, similar to dbt build or dbt run, must be executed. Therefore, it needs to be added into an Airflow DAG or task, prompting us to determine the most suitable timing for executing this test/command.

Suggestion: dbt has the dbt build --select "source_status:fresher+" command feature. When dbt source freshness is executed, the source.json artifact is created, which contains execution times and max_loaded_at dates for dbt sources. dbt will then use those artifacts to determine the set of fresh sources. In your job commands, you can signal dbt to run and test only on these fresher sources and their children by including the source_status+ argument. This would mean that models where it sources had the status ERROR STALE would not be updated, only those where its sources have passed in the freshness test (more information on The "source_status" status).

Also we could have a dashboard showing which tables are up-to-date and which are not, based on freshness. Once dbt freshness is running, Elementary will collect data for freshness and we can use this data to add these visualizations to the Data Observability - Models, Sources and Invocations dashboard.

@sydneynotthecity
Copy link
Contributor

@harsha-stellar-data can you take a crack at reviewing this PR?

@eduardo-nercolini
Copy link
Contributor Author

@sydneynotthecity when I ran the freshness tests at these defined intervals, the sources src_contract_code and src_contract_data resulted in ERROR STALE, so I didn't include the test for this sources in this commit. I would like to verify with you if the defined intervals make sense and if indeed the source is outdated, or if, in these cases, a different interval should be defined. I await your feedback to make the config and be able to pass in a new commit. Thanks

@eduardo-nercolini
Copy link
Contributor Author

Ps.: I tested again now and src_contract_data passed, but src_contract_code didn't. Anyway, I would appreciate your comment on this. Thanks

@sydneynotthecity
Copy link
Contributor

when I ran the freshness tests at these defined intervals, the sources src_contract_code and src_contract_data resulted in ERROR STALE

@enercolini yes, it's fine to remove the freshness test for src_contract_code. I'm surprised that contract_data failed. @harsha-stellar-data can you profile the contract_data table by batch_run_date and see if there is an appropriate, predictable interval for a freshness test on this table?

@harsha-stellar-data
Copy link
Contributor

harsha-stellar-data commented Jun 13, 2024

@eduardo-nercolini I reviewed the changes and here are some comments
My thoughts - since Freshness is more to do about Source, we should be using closed_at as and when available.

  • For history_trades, i see batch_run_date being used but I see "ledger_closed_at" available and is getting frequently and properly modified. Should we be using this timestamp field instead?

  • We do not need Freshness or Recency check on contract_code as the intervals are very unpredictable

  • For Contract_Data, Freshness check at 30 (for Warning), 60 (For Error) should work. I performed the following analysis on the table and did not notice any deviation from the pattern
    WITH src_contract_data AS ( SELECT batch_id, MIN(closed_at) AS min_closed, MAX(closed_at) AS max_closed FROM crypto-stellar.crypto_stellar.contract_dataGROUP BY batch_id ) SELECT batch_id, min_closed, max_closed FROM src_contract_data WHERE TIMESTAMP_DIFF(max_closed, min_closed, MINUTE) > 30;

  • I only noticed Recency check on Contract_Data in the code though. Did we back out the changes based on the failures observed and can you also provide details on what field is being used for checking? We should be using Closed_At but Batch_Run_Date should also work.

Let me know your thoughts here.

@sydneynotthecity @chowbao

@sydneynotthecity
Copy link
Contributor

@harsha-stellar-data I agree, we can update all freshness checks to use closed_at so that they're consistent and use the right source column. This aligns with our overall strategy to repartition all tables by closed_at

Nice analysis, @eduardo-nercolini let's add the freshness test for contract_data back in. if you see failures with the test, can you post the results? There were early gaps in the data when Soroban first launched, but there should not be recent examples of freshness problems

Copy link
Contributor

@harsha-stellar-data harsha-stellar-data left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated changes to check Freshness look good.

@eduardo-nercolini eduardo-nercolini merged commit c3e6d3f into master Jun 19, 2024
1 check passed
@sydneynotthecity sydneynotthecity deleted the feature/sources_freshness_tests branch November 14, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants