-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbt source freshness applied to public sources #51
Conversation
@harsha-stellar-data can you take a crack at reviewing this PR? |
@sydneynotthecity when I ran the freshness tests at these defined intervals, the sources |
Ps.: I tested again now and |
@enercolini yes, it's fine to remove the freshness test for |
@eduardo-nercolini I reviewed the changes and here are some comments
Let me know your thoughts here. |
@harsha-stellar-data I agree, we can update all freshness checks to use Nice analysis, @eduardo-nercolini let's add the freshness test for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated changes to check Freshness look good.
PR Checklist
PR Structure
otherwise).
Thoroughness
Release planning
semver, and I've changed the name of the BRANCH to release/* , feature/* or patch/* .
What
This PR brings a new proposal for testing sources using the native dbt test
dbt source freshness
, along with intervals defined according to the build execution (dbt_enriched_base_tables DAG) interval of these models/sources. Thus, data with a severity of 'warn' will be considered outdated if it is more than 30 minutes old, and 'error' for intervals exceeding 2x the interval, meaning a difference greater than 60 minutes.Why
Source testing is necessary in the pipeline to identify if the data is not outdated and prevent models from being built based on such data, ensuring greater quality and reliability.
Known limitations
The suggested interval should be discussed to verify if it is more appropriate and fits the expected intervals according to Stellar data pipeline, in order to properly and more accurately assess whether the data is outdated or not. Additionally, the
dbt source freshness
command, similar todbt build
ordbt run
, must be executed. Therefore, it needs to be added into an Airflow DAG or task, prompting us to determine the most suitable timing for executing this test/command.Suggestion: dbt has the
dbt build --select "source_status:fresher+"
command feature. Whendbt source freshness
is executed, thesource.json
artifact is created, which contains execution times andmax_loaded_at
dates fordbt sources
. dbt will then use those artifacts to determine the set of fresh sources. In your job commands, you can signal dbt to run and test only on these fresher sources and their children by including the source_status+ argument. This would mean that models where it sources had the statusERROR STALE
would not be updated, only those where its sources have passed in the freshness test (more information on The "source_status" status).Also we could have a dashboard showing which tables are up-to-date and which are not, based on freshness. Once dbt freshness is running, Elementary will collect data for freshness and we can use this data to add these visualizations to the Data Observability - Models, Sources and Invocations dashboard.