Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply spell checks #67

Merged
merged 2 commits into from
Feb 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -488,3 +488,4 @@ spark-warehouse/
.scannerwork/
.pipeline/*
.scannerwork/*
.vscode/settings.json
1 change: 1 addition & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Thanks to the contributors who helped on this project apart from the authors
* [Sarath Chandra Bandaru](https://www.linkedin.com/in/sarath-chandra-bandaru/)
* [Holden Karau](https://www.linkedin.com/in/holdenkarau/)
* [Araveti Venkata Bharat Kumar](https://www.linkedin.com/in/bharat-kumar-araveti/)
* [Samy Coenen](https://github.com/SamyCoenen)

# Honorary Mentions
Thanks to the team below for invaluable insights and support throughout the initial release of this project
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,12 @@ We're delighted that you're interested in contributing to our project! To get st
please carefully read and follow the guidelines provided in our [contributing](https://github.com/Nike-Inc/spark-expectations/blob/main/CONTRIBUTING.md) document

# What is Spark Expectations?
#### Spark Expectations is a Data quality framework built in Pyspark as a solution for the following problem statements:
#### Spark Expectations is a Data quality framework built in PySpark as a solution for the following problem statements:

1. The existing data quality tools validates the data in a table at rest and provides the success and error metrics. Users need to manually check the metrics to identify the error records
2. The error data is not quarantined to an error table or there are no corrective actions taken to send only the valid data to downstream
3. Users further downstream must consume the same data incorrectly, or they must perform additional calculations to eliminate records that don't comply with the data quality rules.
4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this acitivity
4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this activity

#### Spark Expectations solves these issues using the following principles:

Expand Down
4 changes: 2 additions & 2 deletions docs/bigquery.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### Example - Write to Delta

Setup SparkSession for bigquery to test in your local environment. Configure accordingly for higher environments.
Setup SparkSession for BigQuery to test in your local environment. Configure accordingly for higher environments.
Refer to Examples in [base_setup.py](../spark_expectations/examples/base_setup.py) and
[delta.py](../spark_expectations/examples/sample_dq_bigquery.py)

Expand All @@ -22,7 +22,7 @@ spark.conf.set("viewsEnabled", "true")
spark.conf.set("materializationDataset", "<temp_dataset>")
```

Below is the configuration that can be used to run SparkExpectations and write to DeltaLake
Below is the configuration that can be used to run SparkExpectations and write to Delta Lake

```python title="iceberg_write"
import os
Expand Down
6 changes: 3 additions & 3 deletions docs/configurations/adoption_versions_comparsion.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ Please find the difference in the changes with different version, latest three v

| stage | 0.8.0 | 1.0.0 |
|:----------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| rules table schema changes | added additional two column <br> 1.`enable_error_drop_alert(boolean)` <br> 2.`error_drop_thresholdt(int)` <br><br> documentation found [here](https://engineering.nike.com/spark-expectations/0.8.1/getting-started/setup/) | Remains same |
| rules table schema changes | added additional two column <br> 1.`enable_error_drop_alert(boolean)` <br> 2.`error_drop_threshold(int)` <br><br> documentation found [here](https://engineering.nike.com/spark-expectations/0.8.1/getting-started/setup/) | Remains same |
| rule table creation required | yes - creation not required if you're upgrading from old version but schema changes required | yes - creation not required if you're upgrading from old version but schema changes required |
| stats table schema changes | remains same | Remains Same |
| stats table creation required | automated | Remains Same |
| notification config setting | remains same | Remains Same |
| secret store and kafka authentication details | Create a dictionary that contains your secret configuration values and register in `__init__.py` for multiple usage, [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | Remains Same. You can disable streaming if needed, in SparkExpectations class |
| spark expectations initialisation | create spark expectations class object using `SpakrExpectations` by passing `product_id` and additional optional parameter `debugger`, `stats_streaming_options` [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |
| secret store and Kafka authentication details | Create a dictionary that contains your secret configuration values and register in `__init__.py` for multiple usage, [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | Remains Same. You can disable streaming if needed, in SparkExpectations class |
| spark expectations initialization | create spark expectations class object using `SparkExpectations` by passing `product_id` and additional optional parameter `debugger`, `stats_streaming_options` [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |
| with_expectations decorator | remains same | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |
| WrappedDataFrameWriter | Doesn't exist | This is new and users need to provider the writer object to record the spark conf that need to be used while writing - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |

Expand Down
32 changes: 16 additions & 16 deletions docs/configurations/configure_rules.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,37 +10,37 @@ To perform row data quality checks for artificially order table, please set up r
insert into `catalog`.`schema`.`{product}_rules` (product_id, table_name, rule_type, rule, column_name, expectation,
action_if_failed, tag, description, enable_for_source_dq_validation, enable_for_target_dq_validation, is_active) values

--The row data qulaity has been set on customer_id when customer_id is null, drop respective row into error table
--The row data quality has been set on customer_id when customer_id is null, drop respective row into error table
--as "action_if_failed" tagged "drop"
('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'customer_id_is_not_null', 'customer_id',
'customer_id is not null','drop', 'validity', 'customer_id ishould not be null', false, false, true)
'customer_id is not null','drop', 'validity', 'customer_id should not be null', false, false, true)

--The row data qulaity has been set on sales when sales is less than zero, drop respective row into error table as
--The row data quality has been set on sales when sales is less than zero, drop respective row into error table as
--'action_if_failed' tagged "drop"
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'sales_greater_than_zero', 'sales', 'sales > 0',
'drop', 'accuracy', 'sales value should be greater than zero', false, false, true)

--The row data qulaity has been set on discount when discount is less than 60, drop respective row into error table
--The row data quality has been set on discount when discount is less than 60, drop respective row into error table
--and final table as "action_if_failed" tagged 'ignore'
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'discount_threshold', 'discount', 'discount*100 < 60',
'ignore', 'validity', 'discount should be less than 40', false, false, true)

--The row data qulaity has been set on ship_mode when ship_mode not in ("second class", "standard class",
--"standard class"), drop respective row into error table and fail the framewok as "action_if_failed" tagged "fail"
--The row data quality has been set on ship_mode when ship_mode not in ("second class", "standard class",
--"standard class"), drop respective row into error table and fail the framework as "action_if_failed" tagged "fail"
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'ship_mode_in_set', 'ship_mode', 'lower(trim(ship_mode))
in('second class', 'standard class', 'standard class')', 'fail', 'validity', 'ship_mode mode belongs in the sets',
false, false, true)

--The row data qulaity has been set on profit when profit is less than or equals to 0, drop respective row into
--The row data quality has been set on profit when profit is less than or equals to 0, drop respective row into
--error table and final table as "action_if_failed" tagged "ignore"
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'profit_threshold', 'profit', 'profit>0', 'ignore',
'validity', 'profit threshold should be greater tahn 0', false, false, true)
'validity', 'profit threshold should be greater than 0', false, false, true)

--The rule has been established to identify and remove completely identical records in which rows repeat with the
--same value more than once, while keeping one instance of the row. Any additional duplicated rows will be dropped
--into error table as action_if_failed set to "drop"
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'complete_duplicate', 'All', 'row_number()
over(partition by cutomer_id, order_id order by 1)=1', 'drop', 'uniqueness', 'drop complete duplicate records',
over(partition by customer_id, order_id order by 1)=1', 'drop', 'uniqueness', 'drop complete duplicate records',
false, false, true)

```
Expand All @@ -62,9 +62,9 @@ action_if_failed, tag, description, enable_for_source_dq_validation, enable_fo
,('apla_nd', '`catalog`.`schema`.customer_order', 'agg_dq', 'distinct_of_ship_mode', 'ship_mode',
'count(distinct ship_mode)<=3', 'ignore', 'validity', 'regex format validation for quantity', true, false, true)

-- The aggregation rule is established on the table countand the metadata of the rule will be captured in the
--statistics table when distinct count greater than 10000 and failes the job as "action_if_failed" set to "fail"
--and enabled only for validated datset
-- The aggregation rule is established on the table count and the metadata of the rule will be captured in the
--statistics table when distinct count greater than 10000 and fails the job as "action_if_failed" set to "fail"
--and enabled only for validated dataset
,('apla_nd', '`catalog`.`schema`..customer_order', 'agg_dq', 'row_count', '*', 'count(*)>=10000', 'fail', 'validity',
'distinct ship_mode must be less or equals to 3', false, true, true)

Expand All @@ -76,21 +76,21 @@ Please set up rules for checking the quality of artificially order table by impl
insert into `catalog`.`schema`.`{product}_rules` (product_id, table_name, rule_type, rule, column_name, expectation,
action_if_failed, tag, description) values

--The query dq rule is established to check product_id differemce between two table if differnce is more than 20%
--The query dq rule is established to check product_id difference between two table if difference is more than 20%
--from source table, the metadata of the rule will be captured in the statistics table as "action_if_failed" is "ignore"
,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'product_missing_count_threshold', '*',
'((select count(distinct product_id) from product) - (select count(distinct product_id) from order))>
(select count(distinct product_id) from product)*0.2', 'ignore', 'validity', 'row count threshold difference msut
(select count(distinct product_id) from product)*0.2', 'ignore', 'validity', 'row count threshold difference must
be less than 20%', true, true, true)

--The query dq rule is established to check distinct proudtc_id in the product table is less than 5, if not the
--The query dq rule is established to check distinct product_id in the product table is less than 5, if not the
--metadata of the rule will be captured in the statistics table along with fails the job as "action_if_failed" is
--"fail" and enabled for source dataset
,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'product_category', '*', '(select count(distinct category)
from product) < 5', 'fail', 'validity', 'distinct product category must be less than 5', true, False, true)

--The query dq rule is established to check count of the dataset should be less than 10000 other wise the metadata
--of the rule will be captured in the statistics table as "action_if_failed" is "ignore" and enabled only for target datset
--of the rule will be captured in the statistics table as "action_if_failed" is "ignore" and enabled only for target dataset
,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'row_count_in_order', '*',
'(select count(*) from order)<10000', 'ignore', 'accuracy', 'count of the row in order dataset must be less then 10000',
false, true, true)
Expand Down
8 changes: 4 additions & 4 deletions docs/configurations/databricks_setup_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This section provides instructions on how to set up a sample notebook in the Dat

#### Prerequisite:

1. Recommended databricks run time environment for better experience - DBS 11.0 and above
3. Please install the kafka jar using the path `dbfs:/kafka-jars/databricks-shaded-strimzi-kafka-oauth-client-1.1.jar`, If the jar is not available in the dbfs location, please raise a ticket with GAP Support team to add the jar to your workspace
2. Please follow the steps provided [here](TODO) to integrate and clone repo from git databricks
4. Please follow the steps to create the wbhook-hook URL for team-specific channel [here](TODO)
1. Recommended Databricks run time environment for better experience - DBS 11.0 and above
3. Please install the Kafka jar using the path `dbfs:/kafka-jars/databricks-shaded-strimzi-kafka-oauth-client-1.1.jar`, If the jar is not available in the dbfs location, please raise a ticket with GAP Support team to add the jar to your workspace
2. Please follow the steps provided [here](TODO) to integrate and clone repo from git Databricks
4. Please follow the steps to create the webhook-hook URL for team-specific channel [here](TODO)
2 changes: 1 addition & 1 deletion docs/delta.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ builder = (
spark = builder.getOrCreate()
```

Below is the configuration that can be used to run SparkExpectations and write to DeltaLake
Below is the configuration that can be used to run SparkExpectations and write to Delta Lake

```python title="delta_write"
import os
Expand Down
Loading
Loading