Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] query rule is skipped #44

Closed
holdenk opened this issue Oct 3, 2023 · 11 comments
Closed

[BUG] query rule is skipped #44

holdenk opened this issue Oct 3, 2023 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@holdenk
Copy link
Contributor

holdenk commented Oct 3, 2023

Describe the bug
A clear and concise description of what the bug is.

To Reproduce

Run the same code as #43

Expected behavior

Expect rule to run and fail

Screenshots

See 1:47 of https://www.youtube.com/watch?v=bNvvPKv-dmQ

Desktop (please complete the following information):
1.0

Additional context
N/A

@holdenk holdenk added the bug Something isn't working label Oct 3, 2023
@phanikumarvemuri
Copy link
Contributor

I will work on it

@phanikumarvemuri
Copy link
Contributor

@holdenk When i looked into the rules you used, I see that there is no row_dq rule for the table local.3rd_fake and query_dq is only enabled only for target_dq_validation. For final_query_dq to run there has to be a row_dq rule or enable_for_source_dq_validation should be set to true. The stats also confirms that row_dq is Skipped and final_query_dq is skipped.

The sequence of execution is
source_agg_dq
source_query_dq
row_dq
final_agg_dq only if row_dq is enabled
final_query_dq only if row_dq is enabled

Screenshot 2023-10-11 at 10 03 07 PM

if _row_dq is True and _target_query_dq is True:

{"product_id": "pay", "table_name": "local.fake_table_name", "rule_type": "row_dq", "rule": "bonus_checker", "column_name": "MaleBonusPercent", "expectation": "MaleBonusPercent > FemaleBonusPercent", "action_if_failed": "drop", "tag": "", "description": "Sample rule that the male bonuses should be higher. Thankfully this fails (but could be lower base pay etc.)", "enable_for_source_dq_validation": true, "enable_for_target_dq_validation": false, "is_active": true, "enable_error_drop_alert": true, "error_drop_threshold": 1}
{"product_id": "pay", "table_name": "local.3rd_fake", "rule_type": "query_dq", "rule": "history", "column_name": "Bloop", "expectation": "(select count(*) from 3rd_fake_view) < (select input_count from local.dq_stats WHERE table_name='local.3rd_fake')", "action_if_failed": "fail", "tag": "", "description": "We should always have more records than before", "enable_for_source_dq_validation": false, "enable_for_target_dq_validation": true, "is_active": true, "enable_error_drop_alert": true, "error_drop_threshold": 1}

@holdenk
Copy link
Contributor Author

holdenk commented Oct 11, 2023

Ah interesting so the aggregate data quality checks are only run if there are row data quality checks present?

@phanikumarvemuri
Copy link
Contributor

phanikumarvemuri commented Oct 11, 2023

both agg and query dq rule types can be run on source and target. Here the source is input dataframe and target is data filtered/dropped after running rules of type row_dq

In the above example the enable_for_source_dq_validation is set to false and enable_for_target_dq_validation is set to true source validation is skipped because of the flag enable_for_source_dq_validation and target is skipped because there is no row_dq rule

@phanikumarvemuri
Copy link
Contributor

@holdenk Please let us know if this is good to be closed ?
cc @asingamaneni

@holdenk
Copy link
Contributor Author

holdenk commented Oct 14, 2023

I mean you can, although I think that aggregation rules only running if there is a row dq of the same type is something that I did not understand from the docs so I think it would be good to: always run them or update the docs and produce an warning (or error) when encountering a rule which is enabled but will not run.

@newfront
Copy link

Late to the party here. It feels like a dry-run option could be an interesting pre-integration test to say "no rules table or x for table|stats|x while this will not prevent spark-expectations from running. The benefits of x means y."

I am all for docs where they are applicable but getting warnings or short circuiting in the case of missing definitions would be good for the project too so that a user can do something like:

  1. initialize spark-expectations
  2. bootstrap the dq tables - empty if nothing else
  3. run pre-flight. "scanning. zero rules for x (table|view)". Suggestions: (scanned (delta|iceberg|x) table - structure - (table.schema....) - recommended starting rule (x) -> do you want to apply non_null to the columns on table (x)...

I always like being able to fumble and get the package to identify my mistakes. Spark does this nicely with AnalysisExceptions. Delta does similar also with Analysis exceptions. Maybe there is a nice pattern for minimal viable rules with warnings (...)

@phanikumarvemuri
Copy link
Contributor

Agree with both the options.
I will close this issue and will open a feature request to support dryRun option and also to update documentation and add flow chart wherever required
Thanks

@asingamaneni
Copy link
Collaborator

Dryrun is an awesome thought. Let’s implement it and also update the documentation where necessary!

@phanikumarvemuri
Copy link
Contributor

@asingamaneni Can you close this issue. I will open new one for the above features

@asingamaneni
Copy link
Collaborator

Closing this issue as we will create a new feature request from the above discussion. @phanikumarvemuri please tag this issue when you create a new feature, so that we will know the source for the feature creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants