Nike-Inc · asingamaneni · Feb 7, 2024 · Jan 17, 2024 · Feb 7, 2024
@@ -488,3 +488,4 @@ spark-warehouse/
 .scannerwork/
 .pipeline/*
 .scannerwork/*
+.vscode/settings.json
@@ -9,6 +9,7 @@ Thanks to the contributors who helped on this project apart from the authors
 * [Sarath Chandra Bandaru](https://www.linkedin.com/in/sarath-chandra-bandaru/)
 * [Holden Karau](https://www.linkedin.com/in/holdenkarau/)
 * [Araveti Venkata Bharat Kumar](https://www.linkedin.com/in/bharat-kumar-araveti/)
+* [Samy Coenen](https://github.com/SamyCoenen)
 
 # Honorary Mentions
 Thanks to the team below for invaluable insights and support throughout the initial release of this project

@@ -35,12 +35,12 @@ We're delighted that you're interested in contributing to our project! To get st
 please carefully read and follow the guidelines provided in our [contributing](https://github.com/Nike-Inc/spark-expectations/blob/main/CONTRIBUTING.md) document
 
 # What is Spark Expectations?
-#### Spark Expectations is a Data quality framework built in Pyspark as a solution for the following problem statements:
+#### Spark Expectations is a Data quality framework built in PySpark as a solution for the following problem statements:
 
 1. The existing data quality tools validates the data in a table at rest and provides the success and error metrics. Users need to manually check the metrics to identify the error records
 2. The error data is not quarantined to an error table or there are no corrective actions taken to send only the valid data to downstream
 3. Users further downstream must consume the same data incorrectly, or they must perform additional calculations to eliminate records that don't comply with the data quality rules.
-4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this acitivity
+4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this activity
 
 #### Spark Expectations solves these issues using the following principles:
 

@@ -1,6 +1,6 @@
 ### Example - Write to Delta
 
-Setup SparkSession for bigquery to test in your local environment. Configure accordingly for higher environments.
+Setup SparkSession for BigQuery to test in your local environment. Configure accordingly for higher environments.
 Refer to Examples in [base_setup.py](../spark_expectations/examples/base_setup.py) and
 [delta.py](../spark_expectations/examples/sample_dq_bigquery.py)
 
@@ -22,7 +22,7 @@ spark.conf.set("viewsEnabled", "true")
 spark.conf.set("materializationDataset", "<temp_dataset>")
 ```
 
-Below is the configuration that can be used to run SparkExpectations and write to DeltaLake
+Below is the configuration that can be used to run SparkExpectations and write to Delta Lake
 
 ```python title="iceberg_write"
 import os

@@ -6,13 +6,13 @@ Please find the difference in the changes with different version, latest three v
 
 | stage                                         | 0.8.0                                                                                                                                                                                                                                        | 1.0.0                                                                                                                    |   
 |:----------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
-| rules table schema changes                    | added additional two column <br> 1.`enable_error_drop_alert(boolean)` <br> 2.`error_drop_thresholdt(int)` <br><br> documentation found [here](https://engineering.nike.com/spark-expectations/0.8.1/getting-started/setup/)                  | Remains same                                                                                                             |                                               
+| rules table schema changes                    | added additional two column <br> 1.`enable_error_drop_alert(boolean)` <br> 2.`error_drop_threshold(int)` <br><br> documentation found [here](https://engineering.nike.com/spark-expectations/0.8.1/getting-started/setup/)                  | Remains same                                                                                                             |                                               
 | rule table creation required                  | yes - creation not required if you're upgrading from old version but schema changes required                                                                                                                                                 | yes - creation not required if you're upgrading from old version but schema changes required                             |       
 | stats table schema changes                    | remains same                                                                                                                                                                                                                                 | Remains Same                                                                                                             |                               
 | stats table creation required                 | automated                                                                                                                                                                                                                                    | Remains Same                                                                                                             |
 | notification config setting                   | remains same                                                                                                                                                                                                                                 | Remains Same                                                                                                             |
-| secret store and kafka authentication details | Create a dictionary that contains your secret configuration values and register in `__init__.py` for multiple usage, [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/)                                              | Remains Same. You can disable streaming if needed, in SparkExpectations class                                            |
-| spark expectations initialisation             | create spark expectations class object using `SpakrExpectations` by passing `product_id` and additional optional parameter `debugger`, `stats_streaming_options`  [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |
+| secret store and Kafka authentication details | Create a dictionary that contains your secret configuration values and register in `__init__.py` for multiple usage, [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/)                                              | Remains Same. You can disable streaming if needed, in SparkExpectations class                                            |
+| spark expectations initialization             | create spark expectations class object using `SparkExpectations` by passing `product_id` and additional optional parameter `debugger`, `stats_streaming_options`  [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |
 | with_expectations decorator                   | remains same                                                                                                                                                                                                                                 | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |
 | WrappedDataFrameWriter                        | Doesn't exist                                                                                                                                                                                                                                | This is new and users need to provider the writer object to record the spark conf that need to be used while writing - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/)  |
 

@@ -10,37 +10,37 @@ To perform row data quality checks for artificially order table, please set up r
 insert into `catalog`.`schema`.`{product}_rules` (product_id, table_name, rule_type, rule, column_name, expectation, 
 action_if_failed, tag, description, enable_for_source_dq_validation,  enable_for_target_dq_validation, is_active) values
 
---The row data qulaity has been set on customer_id when customer_id is null, drop respective row into error table 
+--The row data quality has been set on customer_id when customer_id is null, drop respective row into error table 
 --as "action_if_failed" tagged "drop"
 ('apla_nd', '`catalog`.`schema`.customer_order',  'row_dq', 'customer_id_is_not_null', 'customer_id', 
-'customer_id is not null','drop', 'validity', 'customer_id ishould not be null', false, false, true)
+'customer_id is not null','drop', 'validity', 'customer_id should not be null', false, false, true)
 
---The row data qulaity has been set on sales when sales is less than zero, drop respective row into error table as 
+--The row data quality has been set on sales when sales is less than zero, drop respective row into error table as 
 --'action_if_failed' tagged "drop"
 ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'sales_greater_than_zero', 'sales', 'sales > 0', 
 'drop', 'accuracy', 'sales value should be greater than zero', false, false, true)
 
---The row data qulaity has been set on discount when discount is less than 60, drop respective row into error table
+--The row data quality has been set on discount when discount is less than 60, drop respective row into error table
 --and final table  as "action_if_failed" tagged 'ignore'
 ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'discount_threshold', 'discount', 'discount*100 < 60',
 'ignore', 'validity', 'discount should be less than 40', false, false, true)
 
---The row data qulaity has been set on ship_mode when ship_mode not in ("second class", "standard class", 
---"standard class"), drop respective row into error table and fail the framewok  as "action_if_failed" tagged "fail"
+--The row data quality has been set on ship_mode when ship_mode not in ("second class", "standard class", 
+--"standard class"), drop respective row into error table and fail the framework  as "action_if_failed" tagged "fail"
 ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'ship_mode_in_set', 'ship_mode', 'lower(trim(ship_mode))
 in('second class', 'standard class', 'standard class')', 'fail', 'validity', 'ship_mode mode belongs in the sets',
 false, false, true)
 
---The row data qulaity has been set on profit when profit is less than or equals to 0, drop respective row into 
+--The row data quality has been set on profit when profit is less than or equals to 0, drop respective row into 
 --error table and final table as "action_if_failed" tagged "ignore"
 ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'profit_threshold', 'profit', 'profit>0', 'ignore', 
-'validity', 'profit threshold should be greater tahn 0', false, false, true)
+'validity', 'profit threshold should be greater than 0', false, false, true)
 
 --The rule has been established to identify and remove completely identical records in which rows repeat with the 
 --same value more than once, while keeping one instance of the row. Any additional duplicated rows will be dropped 
 --into error table as action_if_failed set to "drop"
 ,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'complete_duplicate', 'All', 'row_number() 
- over(partition by cutomer_id, order_id order by 1)=1', 'drop', 'uniqueness', 'drop complete duplicate records', 
+ over(partition by customer_id, order_id order by 1)=1', 'drop', 'uniqueness', 'drop complete duplicate records', 
  false, false, true)
 
 ```
@@ -62,9 +62,9 @@ action_if_failed, tag, description,  enable_for_source_dq_validation,  enable_fo
 ,('apla_nd', '`catalog`.`schema`.customer_order', 'agg_dq', 'distinct_of_ship_mode', 'ship_mode', 
 'count(distinct ship_mode)<=3', 'ignore', 'validity', 'regex format validation for quantity', true, false, true)
 
--- The aggregation rule is established on the table countand the metadata of the rule will be captured in the 
---statistics table when distinct count greater than 10000 and failes the job as "action_if_failed" set to "fail" 
---and enabled only for validated datset
+-- The aggregation rule is established on the table count and the metadata of the rule will be captured in the 
+--statistics table when distinct count greater than 10000 and fails the job as "action_if_failed" set to "fail" 
+--and enabled only for validated dataset
 ,('apla_nd', '`catalog`.`schema`..customer_order', 'agg_dq', 'row_count', '*', 'count(*)>=10000', 'fail', 'validity',
 'distinct ship_mode must be less or equals to 3', false, true, true)
 
@@ -76,21 +76,21 @@ Please set up rules for checking the quality of artificially order table by impl
 insert into `catalog`.`schema`.`{product}_rules` (product_id, table_name, rule_type, rule, column_name, expectation, 
 action_if_failed, tag, description) values
 
---The query dq rule is established to check product_id differemce between two table if differnce is more than 20% 
+--The query dq rule is established to check product_id difference between two table if difference is more than 20% 
 --from source table, the metadata of the rule will be captured in the statistics table as "action_if_failed" is "ignore"
 ,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'product_missing_count_threshold', '*', 
 '((select count(distinct product_id) from product) - (select count(distinct product_id) from order))>
-(select count(distinct product_id) from product)*0.2', 'ignore', 'validity', 'row count threshold difference msut 
+(select count(distinct product_id) from product)*0.2', 'ignore', 'validity', 'row count threshold difference must 
 be less than 20%', true, true, true)
 
---The query dq rule is established to check distinct proudtc_id in the product table is less than 5, if not the 
+--The query dq rule is established to check distinct product_id in the product table is less than 5, if not the 
 --metadata of the rule will be captured in the statistics table along with fails the job as "action_if_failed" is 
 --"fail" and enabled for source dataset
 ,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'product_category', '*', '(select count(distinct category) 
 from product) < 5', 'fail', 'validity', 'distinct product category must be less than 5', true, False, true)
 
 --The query dq rule is established to check count of the dataset should be less than 10000 other wise the metadata 
---of the rule will be captured in the statistics table as "action_if_failed" is "ignore" and enabled only for target datset
+--of the rule will be captured in the statistics table as "action_if_failed" is "ignore" and enabled only for target dataset
 ,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'row_count_in_order', '*', 
 '(select count(*) from order)<10000', 'ignore', 'accuracy', 'count of the row in order dataset must be less then 10000', 
 false, true, true)

@@ -4,7 +4,7 @@ This section provides instructions on how to set up a sample notebook in the Dat
 
 #### Prerequisite:
 
-1. Recommended databricks run time environment for better experience - DBS 11.0 and above
-3. Please install the kafka jar using the path `dbfs:/kafka-jars/databricks-shaded-strimzi-kafka-oauth-client-1.1.jar`, If the jar is not available in the dbfs location, please raise a ticket with GAP Support team to add the jar to your workspace
-2. Please follow the steps provided [here](TODO) to integrate and clone repo from git databricks
-4. Please follow the steps to create the wbhook-hook URL for team-specific channel [here](TODO)
+1. Recommended Databricks run time environment for better experience - DBS 11.0 and above
+3. Please install the Kafka jar using the path `dbfs:/kafka-jars/databricks-shaded-strimzi-kafka-oauth-client-1.1.jar`, If the jar is not available in the dbfs location, please raise a ticket with GAP Support team to add the jar to your workspace
+2. Please follow the steps provided [here](TODO) to integrate and clone repo from git Databricks
+4. Please follow the steps to create the webhook-hook URL for team-specific channel [here](TODO)
@@ -23,7 +23,7 @@ builder = (
 spark = builder.getOrCreate()
 ```
 
-Below is the configuration that can be used to run SparkExpectations and write to DeltaLake
+Below is the configuration that can be used to run SparkExpectations and write to Delta Lake
 
 ```python title="delta_write"
 import os