diff --git a/.gitignore b/.gitignore index 87b2439..cf53482 100644 --- a/.gitignore +++ b/.gitignore @@ -488,3 +488,4 @@ spark-warehouse/ .scannerwork/ .pipeline/* .scannerwork/* +.vscode/settings.json diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index a029d4a..dc465a9 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -9,6 +9,7 @@ Thanks to the contributors who helped on this project apart from the authors * [Sarath Chandra Bandaru](https://www.linkedin.com/in/sarath-chandra-bandaru/) * [Holden Karau](https://www.linkedin.com/in/holdenkarau/) * [Araveti Venkata Bharat Kumar](https://www.linkedin.com/in/bharat-kumar-araveti/) +* [Samy Coenen](https://github.com/SamyCoenen) # Honorary Mentions Thanks to the team below for invaluable insights and support throughout the initial release of this project diff --git a/README.md b/README.md index 49e3969..33152ee 100644 --- a/README.md +++ b/README.md @@ -35,12 +35,12 @@ We're delighted that you're interested in contributing to our project! To get st please carefully read and follow the guidelines provided in our [contributing](https://github.com/Nike-Inc/spark-expectations/blob/main/CONTRIBUTING.md) document # What is Spark Expectations? -#### Spark Expectations is a Data quality framework built in Pyspark as a solution for the following problem statements: +#### Spark Expectations is a Data quality framework built in PySpark as a solution for the following problem statements: 1. The existing data quality tools validates the data in a table at rest and provides the success and error metrics. Users need to manually check the metrics to identify the error records 2. The error data is not quarantined to an error table or there are no corrective actions taken to send only the valid data to downstream 3. Users further downstream must consume the same data incorrectly, or they must perform additional calculations to eliminate records that don't comply with the data quality rules. -4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this acitivity +4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this activity #### Spark Expectations solves these issues using the following principles: diff --git a/docs/bigquery.md b/docs/bigquery.md index 0299bec..447a388 100644 --- a/docs/bigquery.md +++ b/docs/bigquery.md @@ -1,6 +1,6 @@ ### Example - Write to Delta -Setup SparkSession for bigquery to test in your local environment. Configure accordingly for higher environments. +Setup SparkSession for BigQuery to test in your local environment. Configure accordingly for higher environments. Refer to Examples in [base_setup.py](../spark_expectations/examples/base_setup.py) and [delta.py](../spark_expectations/examples/sample_dq_bigquery.py) @@ -22,7 +22,7 @@ spark.conf.set("viewsEnabled", "true") spark.conf.set("materializationDataset", "") ``` -Below is the configuration that can be used to run SparkExpectations and write to DeltaLake +Below is the configuration that can be used to run SparkExpectations and write to Delta Lake ```python title="iceberg_write" import os diff --git a/docs/configurations/adoption_versions_comparsion.md b/docs/configurations/adoption_versions_comparsion.md index a26d24d..b8ea728 100644 --- a/docs/configurations/adoption_versions_comparsion.md +++ b/docs/configurations/adoption_versions_comparsion.md @@ -6,13 +6,13 @@ Please find the difference in the changes with different version, latest three v | stage | 0.8.0 | 1.0.0 | |:----------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------| -| rules table schema changes | added additional two column
1.`enable_error_drop_alert(boolean)`
2.`error_drop_thresholdt(int)`

documentation found [here](https://engineering.nike.com/spark-expectations/0.8.1/getting-started/setup/) | Remains same | +| rules table schema changes | added additional two column
1.`enable_error_drop_alert(boolean)`
2.`error_drop_threshold(int)`

documentation found [here](https://engineering.nike.com/spark-expectations/0.8.1/getting-started/setup/) | Remains same | | rule table creation required | yes - creation not required if you're upgrading from old version but schema changes required | yes - creation not required if you're upgrading from old version but schema changes required | | stats table schema changes | remains same | Remains Same | | stats table creation required | automated | Remains Same | | notification config setting | remains same | Remains Same | -| secret store and kafka authentication details | Create a dictionary that contains your secret configuration values and register in `__init__.py` for multiple usage, [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | Remains Same. You can disable streaming if needed, in SparkExpectations class | -| spark expectations initialisation | create spark expectations class object using `SpakrExpectations` by passing `product_id` and additional optional parameter `debugger`, `stats_streaming_options` [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) | +| secret store and Kafka authentication details | Create a dictionary that contains your secret configuration values and register in `__init__.py` for multiple usage, [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | Remains Same. You can disable streaming if needed, in SparkExpectations class | +| spark expectations initialization | create spark expectations class object using `SparkExpectations` by passing `product_id` and additional optional parameter `debugger`, `stats_streaming_options` [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) | | with_expectations decorator | remains same | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) | | WrappedDataFrameWriter | Doesn't exist | This is new and users need to provider the writer object to record the spark conf that need to be used while writing - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) | diff --git a/docs/configurations/configure_rules.md b/docs/configurations/configure_rules.md index 44a8676..0fd947d 100644 --- a/docs/configurations/configure_rules.md +++ b/docs/configurations/configure_rules.md @@ -10,37 +10,37 @@ To perform row data quality checks for artificially order table, please set up r insert into `catalog`.`schema`.`{product}_rules` (product_id, table_name, rule_type, rule, column_name, expectation, action_if_failed, tag, description, enable_for_source_dq_validation, enable_for_target_dq_validation, is_active) values ---The row data qulaity has been set on customer_id when customer_id is null, drop respective row into error table +--The row data quality has been set on customer_id when customer_id is null, drop respective row into error table --as "action_if_failed" tagged "drop" ('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'customer_id_is_not_null', 'customer_id', -'customer_id is not null','drop', 'validity', 'customer_id ishould not be null', false, false, true) +'customer_id is not null','drop', 'validity', 'customer_id should not be null', false, false, true) ---The row data qulaity has been set on sales when sales is less than zero, drop respective row into error table as +--The row data quality has been set on sales when sales is less than zero, drop respective row into error table as --'action_if_failed' tagged "drop" ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'sales_greater_than_zero', 'sales', 'sales > 0', 'drop', 'accuracy', 'sales value should be greater than zero', false, false, true) ---The row data qulaity has been set on discount when discount is less than 60, drop respective row into error table +--The row data quality has been set on discount when discount is less than 60, drop respective row into error table --and final table as "action_if_failed" tagged 'ignore' ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'discount_threshold', 'discount', 'discount*100 < 60', 'ignore', 'validity', 'discount should be less than 40', false, false, true) ---The row data qulaity has been set on ship_mode when ship_mode not in ("second class", "standard class", ---"standard class"), drop respective row into error table and fail the framewok as "action_if_failed" tagged "fail" +--The row data quality has been set on ship_mode when ship_mode not in ("second class", "standard class", +--"standard class"), drop respective row into error table and fail the framework as "action_if_failed" tagged "fail" ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'ship_mode_in_set', 'ship_mode', 'lower(trim(ship_mode)) in('second class', 'standard class', 'standard class')', 'fail', 'validity', 'ship_mode mode belongs in the sets', false, false, true) ---The row data qulaity has been set on profit when profit is less than or equals to 0, drop respective row into +--The row data quality has been set on profit when profit is less than or equals to 0, drop respective row into --error table and final table as "action_if_failed" tagged "ignore" ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'profit_threshold', 'profit', 'profit>0', 'ignore', -'validity', 'profit threshold should be greater tahn 0', false, false, true) +'validity', 'profit threshold should be greater than 0', false, false, true) --The rule has been established to identify and remove completely identical records in which rows repeat with the --same value more than once, while keeping one instance of the row. Any additional duplicated rows will be dropped --into error table as action_if_failed set to "drop" ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'complete_duplicate', 'All', 'row_number() - over(partition by cutomer_id, order_id order by 1)=1', 'drop', 'uniqueness', 'drop complete duplicate records', + over(partition by customer_id, order_id order by 1)=1', 'drop', 'uniqueness', 'drop complete duplicate records', false, false, true) ``` @@ -62,9 +62,9 @@ action_if_failed, tag, description, enable_for_source_dq_validation, enable_fo ,('apla_nd', '`catalog`.`schema`.customer_order', 'agg_dq', 'distinct_of_ship_mode', 'ship_mode', 'count(distinct ship_mode)<=3', 'ignore', 'validity', 'regex format validation for quantity', true, false, true) --- The aggregation rule is established on the table countand the metadata of the rule will be captured in the ---statistics table when distinct count greater than 10000 and failes the job as "action_if_failed" set to "fail" ---and enabled only for validated datset +-- The aggregation rule is established on the table count and the metadata of the rule will be captured in the +--statistics table when distinct count greater than 10000 and fails the job as "action_if_failed" set to "fail" +--and enabled only for validated dataset ,('apla_nd', '`catalog`.`schema`..customer_order', 'agg_dq', 'row_count', '*', 'count(*)>=10000', 'fail', 'validity', 'distinct ship_mode must be less or equals to 3', false, true, true) @@ -76,21 +76,21 @@ Please set up rules for checking the quality of artificially order table by impl insert into `catalog`.`schema`.`{product}_rules` (product_id, table_name, rule_type, rule, column_name, expectation, action_if_failed, tag, description) values ---The query dq rule is established to check product_id differemce between two table if differnce is more than 20% +--The query dq rule is established to check product_id difference between two table if difference is more than 20% --from source table, the metadata of the rule will be captured in the statistics table as "action_if_failed" is "ignore" ,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'product_missing_count_threshold', '*', '((select count(distinct product_id) from product) - (select count(distinct product_id) from order))> -(select count(distinct product_id) from product)*0.2', 'ignore', 'validity', 'row count threshold difference msut +(select count(distinct product_id) from product)*0.2', 'ignore', 'validity', 'row count threshold difference must be less than 20%', true, true, true) ---The query dq rule is established to check distinct proudtc_id in the product table is less than 5, if not the +--The query dq rule is established to check distinct product_id in the product table is less than 5, if not the --metadata of the rule will be captured in the statistics table along with fails the job as "action_if_failed" is --"fail" and enabled for source dataset ,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'product_category', '*', '(select count(distinct category) from product) < 5', 'fail', 'validity', 'distinct product category must be less than 5', true, False, true) --The query dq rule is established to check count of the dataset should be less than 10000 other wise the metadata ---of the rule will be captured in the statistics table as "action_if_failed" is "ignore" and enabled only for target datset +--of the rule will be captured in the statistics table as "action_if_failed" is "ignore" and enabled only for target dataset ,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'row_count_in_order', '*', '(select count(*) from order)<10000', 'ignore', 'accuracy', 'count of the row in order dataset must be less then 10000', false, true, true) diff --git a/docs/configurations/databricks_setup_guide.md b/docs/configurations/databricks_setup_guide.md index 002c0e2..43b7ea2 100644 --- a/docs/configurations/databricks_setup_guide.md +++ b/docs/configurations/databricks_setup_guide.md @@ -4,7 +4,7 @@ This section provides instructions on how to set up a sample notebook in the Dat #### Prerequisite: -1. Recommended databricks run time environment for better experience - DBS 11.0 and above -3. Please install the kafka jar using the path `dbfs:/kafka-jars/databricks-shaded-strimzi-kafka-oauth-client-1.1.jar`, If the jar is not available in the dbfs location, please raise a ticket with GAP Support team to add the jar to your workspace -2. Please follow the steps provided [here](TODO) to integrate and clone repo from git databricks -4. Please follow the steps to create the wbhook-hook URL for team-specific channel [here](TODO) \ No newline at end of file +1. Recommended Databricks run time environment for better experience - DBS 11.0 and above +3. Please install the Kafka jar using the path `dbfs:/kafka-jars/databricks-shaded-strimzi-kafka-oauth-client-1.1.jar`, If the jar is not available in the dbfs location, please raise a ticket with GAP Support team to add the jar to your workspace +2. Please follow the steps provided [here](TODO) to integrate and clone repo from git Databricks +4. Please follow the steps to create the webhook-hook URL for team-specific channel [here](TODO) \ No newline at end of file diff --git a/docs/delta.md b/docs/delta.md index 0ff0488..e1a7c80 100644 --- a/docs/delta.md +++ b/docs/delta.md @@ -23,7 +23,7 @@ builder = ( spark = builder.getOrCreate() ``` -Below is the configuration that can be used to run SparkExpectations and write to DeltaLake +Below is the configuration that can be used to run SparkExpectations and write to Delta Lake ```python title="delta_write" import os diff --git a/docs/examples.md b/docs/examples.md index 41aae9c..2d0b1dc 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -27,13 +27,13 @@ se_user_conf = { 2. The `user_config.se_notifications_email_smtp_host` parameter is set to "mailhost.com" by default and is used to specify the email SMTP domain host 3. The `user_config.se_notifications_email_smtp_port` parameter, which accepts a port number, is set to "25" by default 4. The `user_config.se_notifications_email_from` parameter is used to specify the email ID that will trigger the email notification -5. The `user_configse_notifications_email_to_other_mail_id` parameter accepts a list of recipient email IDs +5. The `user_config.se_notifications_email_to_other_mail_id` parameter accepts a list of recipient email IDs 6. The `user_config.se_notifications_email_subject` parameter captures the subject line of the email 7. The `user_config.se_notifications_enable_slack` parameter, which controls whether notifications are sent via slack, is set to false by default 8. The `user_config/se_notifications_slack_webhook_url` parameter accepts the webhook URL of a Slack channel for sending notifications 9. When `user_config.se_notifications_on_start` parameter set to `True` enables notification on start of the spark-expectations, variable by default set to `False` 10. When `user_config.se_notifications_on_completion` parameter set to `True` enables notification on completion of spark-expectations framework, variable by default set to `False` -11. When `user_config.se_notifications_on_fail` parameter set to `True` enables notification on failure of spark-expectations data qulaity framework, variable by default set to `True` +11. When `user_config.se_notifications_on_fail` parameter set to `True` enables notification on failure of spark-expectations data quality framework, variable by default set to `True` 12. When `user_config.se_notifications_on_error_drop_exceeds_threshold_breach` parameter set to `True` enables notification when error threshold reaches above the configured value 13. The `user_config.se_notifications_on_error_drop_threshold` parameter captures error drop threshold value @@ -41,7 +41,7 @@ se_user_conf = { For all the below examples the below import and SparkExpectations class instantiation is mandatory -When store for sensitive details is Databricks secret scope,construct config dictionary for authentication of kafka and +When store for sensitive details is Databricks secret scope,construct config dictionary for authentication of Kafka and avoid duplicate construction every time your project is initialized, you can create a dictionary with the following keys and their appropriate values. This dictionary can be placed in the __init__.py file of your project or declared as a global variable. ```python @@ -62,16 +62,16 @@ stats_streaming_config_dict: Dict[str, Union[bool, str]] = { ``` 1. The `user_config.se_enable_streaming` parameter is used to control the enabling or disabling of Spark Expectations (SE) streaming functionality. When enabled, SE streaming stores the statistics of every batch run into Kafka. -2. The `user_config.secret_type` used to define type of secret store and takes two values (`databricks`, `cererus`) by default will be `databricks` -3. The `user_config.dbx_workspace_url` used to pass databricks workspace in the format `https://.cloud.databricks.com` +2. The `user_config.secret_type` used to define type of secret store and takes two values (`databricks`, `cerberus`) by default will be `databricks` +3. The `user_config.dbx_workspace_url` used to pass Databricks workspace in the format `https://.cloud.databricks.com` 4. The `user_config.dbx_secret_scope` captures name of the secret scope -5. The `user_config.dbx_kafka_server_url` captures secret key for the kafka url -6. The ` user_config.dbx_secret_token_url` captures secret key for the kafka authentication app url -7. The `user_config.dbx_secret_app_name` captures secret key for the kafka authentication app name -8. The `user_config.dbx_secret_token` captures secret key for the kafka authentication app secret token -9. The `user_config.dbx_topic_name` captures secret key for the kafka topic name +5. The `user_config.dbx_kafka_server_url` captures secret key for the Kafka URL +6. The ` user_config.dbx_secret_token_url` captures secret key for the Kafka authentication app URL +7. The `user_config.dbx_secret_app_name` captures secret key for the Kafka authentication app name +8. The `user_config.dbx_secret_token` captures secret key for the Kafka authentication app secret token +9. The `user_config.dbx_topic_name` captures secret key for the Kafka topic name -Similarly when sensitive store is cerberus: +Similarly when sensitive store is Cerberus: ```python from typing import Dict, Union @@ -83,7 +83,7 @@ stats_streaming_config_dict: Dict[str, Union[bool, str]] = { user_config.cbs_url : "https://.cerberus.com", # (3)! user_config.cbs_sdb_path: "cerberus_sdb_path", # (4)! user_config.cbs_kafka_server_url: "se_streaming_server_url_secret_sdb_path", # (5)! - user_config.cbs_secret_token_url: "se_streaming_auth_secret_token_url_sdb_apth", # (6)! + user_config.cbs_secret_token_url: "se_streaming_auth_secret_token_url_sdb_path", # (6)! user_config.cbs_secret_app_name: "se_streaming_auth_secret_appid_sdb_path", # (7)! user_config.cbs_secret_token: "se_streaming_auth_secret_token_sdb_path", # (8)! user_config.cbs_topic_name: "se_streaming_topic_name_sdb_path", # (9)! @@ -91,14 +91,14 @@ stats_streaming_config_dict: Dict[str, Union[bool, str]] = { ``` 1. The `user_config.se_enable_streaming` parameter is used to control the enabling or disabling of Spark Expectations (SE) streaming functionality. When enabled, SE streaming stores the statistics of every batch run into Kafka. -2. The `user_config.secret_type` used to define type of secret store and takes two values (`databricks`, `cererus`) by default will be `databricks` -3. The `user_config.cbs_url` used to pass cerberus url -4. The `user_config.cbs_sdb_path` captures cerberus secure data store path -5. The `user_config.cbs_kafka_server_url` captures path where kafka url stored in the cerberus sdb -6. The ` user_config.cbs_secret_token_url` captures path where kafka authentication app stored in the cerberus sdb -7. The `user_config.cbs_secret_app_name` captures path where kafka authentication app name stored in the cerberus sdb -8. The `user_config.cbs_secret_token` captures path where kafka authentication app name secret token stored in the cerberus sdb -9. The `user_config.cbs_topic_name` captures path where kafka topic name stored in the cerberus sdb +2. The `user_config.secret_type` used to define type of secret store and takes two values (`databricks`, `cerberus`) by default will be `databricks` +3. The `user_config.cbs_url` used to pass Cerberus URL +4. The `user_config.cbs_sdb_path` captures Cerberus secure data store path +5. The `user_config.cbs_kafka_server_url` captures path where Kafka URL stored in the Cerberus sdb +6. The ` user_config.cbs_secret_token_url` captures path where Kafka authentication app stored in the Cerberus sdb +7. The `user_config.cbs_secret_app_name` captures path where Kafka authentication app name stored in the Cerberus sdb +8. The `user_config.cbs_secret_token` captures path where Kafka authentication app name secret token stored in the Cerberus sdb +9. The `user_config.cbs_topic_name` captures path where Kafka topic name stored in the Cerberus sdb You can disable the streaming functionality by setting the `user_config.se_enable_streaming` parameter to `False` @@ -183,7 +183,7 @@ def build_new() -> DataFrame: 6. The `target_table_view` parameter is used to provide the name of a view that represents the target validated dataset for implementation of `query_dq` on the clean dataset from `row_dq` 7. The `target_and_error_table_writer` parameter is used to write the final data into the table. By default, it is False. This is optional, if you just want to run the data quality checks. A good example will be a staging table or temporary view. 8. View registration can be utilized when implementing `query_dq` expectations. -9. Returning a dataframe is mandatory for the `spark_expectations` to work, if we do not return a dataframe - then an exceptionm will be raised +9. Returning a dataframe is mandatory for the `spark_expectations` to work, if we do not return a dataframe - then an exception will be raised 10. Instantiate `SparkExpectations` class which has all the required functions for running data quality rules 11. The `product_id` parameter is used to specify the product ID of the data quality rules. This has to be a unique value 12. The `rules_df` parameter is used to specify the dataframe that contains the data quality rules @@ -191,4 +191,4 @@ def build_new() -> DataFrame: 14. The `stats_table_writer` takes in the configuration that need to be used to write the stats table using pyspark 15. The `target_and_error_table_writer` takes in the configuration that need to be used to write the target and error table using pyspark 16. The `debugger` parameter is used to enable the debugger mode -17. The `stats_streaming_options` parameter is used to specify the configurations for streaming statistics into Kafka. To not use kafka, uncomment this. +17. The `stats_streaming_options` parameter is used to specify the configurations for streaming statistics into Kafka. To not use Kafka, uncomment this. diff --git a/docs/getting-started/setup.md b/docs/getting-started/setup.md index 51c02b5..e773b6d 100644 --- a/docs/getting-started/setup.md +++ b/docs/getting-started/setup.md @@ -1,6 +1,6 @@ ## Installation -The library is available in the pypi and can be installed in your environment using the below command or -please add the library "spark-expectations" into the requirements.txt or poetry dependencies. +The library is available in the Python Package Index (PyPi) and can be installed in your environment using the below command or + add the library "spark-expectations" into the requirements.txt or poetry dependencies. ```shell pip install -U spark-expectations @@ -9,7 +9,7 @@ pip install -U spark-expectations ## Required Tables There are two tables that need to be created for spark-expectations to run seamlessly and integrate with a spark job. -The below sql statements used three namespaces which works with Databricks UnityCatalog, but if you are using hive +The below SQL statements used three namespaces which works with Databricks Unity Catalog, but if you are using hive please update the namespaces accordingly and also provide necessary table metadata. @@ -40,8 +40,8 @@ create table if not exists `catalog`.`schema`.`{product}_rules` ( ### Rule Type For Rules The rules column has a column called "rule_type". It is important that this column should only accept one of -these three values - `[row_dq, agg_dq, query_dq]`. If other values are provided, the library may cause unforseen errors. -Please run the below command to add constriants to the above created rules table +these three values - `[row_dq, agg_dq, query_dq]`. If other values are provided, the library may cause unforeseen errors. +Please run the below command to add constraints to the above created rules table ```sql ALTER TABLE `catalog`.`schema`.`{product}_rules` @@ -52,8 +52,8 @@ ADD CONSTRAINT rule_type_action CHECK (rule_type in ('row_dq', 'agg_dq', 'query_ The rules column has a column called "action_if_failed". It is important that this column should only accept one of these values - `[fail, drop or ignore]` for `'rule_type'='row_dq'` and `[fail, ignore]` for `'rule_type'='agg_dq' and 'rule_type'='query_dq'`. -If other values are provided, the library may cause unforseen errors. -Please run the below command to add constriants to the above created rules table +If other values are provided, the library may cause unforeseen errors. +Please run the below command to add constraints to the above created rules table ```sql ALTER TABLE apla_nd_dq_rules ADD CONSTRAINT action CHECK @@ -65,7 +65,7 @@ ALTER TABLE apla_nd_dq_rules ADD CONSTRAINT action CHECK ### DQ Stats Table In order to collect the stats/metrics for each data quality job run, the spark-expectations job will -automatically create the stats table if it does not exist. The below sql statement can be used to create the table +automatically create the stats table if it does not exist. The below SQL statement can be used to create the table if you want to create it manually, but it is not recommended. ```sql diff --git a/docs/iceberg.md b/docs/iceberg.md index d9123aa..4628f69 100644 --- a/docs/iceberg.md +++ b/docs/iceberg.md @@ -29,7 +29,7 @@ builder = ( spark = builder.getOrCreate() ``` -Below is the configuration that can be used to run SparkExpectations and write to DeltaLake +Below is the configuration that can be used to run SparkExpectations and write to Delta Lake ```python title="iceberg_write" import os diff --git a/docs/index.md b/docs/index.md index ea7fce0..9a59438 100644 --- a/docs/index.md +++ b/docs/index.md @@ -28,7 +28,7 @@ the data quality standards * Downstream users have to consume the same data with error, or they have to do additional computation to remove the records that doesn't meet the standards * Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually -required for this acitivity +required for this activity `Spark-Expectations solves all of the above problems by following the below principles` @@ -53,10 +53,10 @@ what needs to be done if a rule fails * Let's consider a hypothetical scenario, where we have 100 columns and with 200 row level data quality rules, 10 aggregation data quality rules and 5 query data quality rules computed against. When the dq job is run, there are 10 rules that failed on a particular row and 4 aggregation rules fails- what determines if that row should end up in -final table or not? Below are the heirarchy of checks that happens? -* Among the row level 10 rules failed, if there is atleast one rule which has an _action_if_failed_ as _fail_ - +final table or not? Below are the hierarchy of checks that happens? +* Among the row level 10 rules failed, if there is at least one rule which has an _action_if_failed_ as _fail_ - then the job will be failed - * Among the 10 row level rules failed, if there is no rule that has an _action_if_failed_ as _fail_, but atleast + * Among the 10 row level rules failed, if there is no rule that has an _action_if_failed_ as _fail_, but at least has one rule with _action_if_failed_ as _drop_ - then the record/row will be dropped * Among the 10 row level rules failed, if no rule neither has _fail_ nor _drop_ as an _action_if_failed_ - then the record will be end up in the final table. Note that, this record would also exist in the `_error` table