Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply spell checks #67

Merged
merged 2 commits into from
Feb 7, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Apply spell checks
Samy Coenen committed Jan 17, 2024
commit e00e5479142ae740d56ea56fb3a6617c7d633d30
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -488,3 +488,4 @@ spark-warehouse/
.scannerwork/
.pipeline/*
.scannerwork/*
.vscode/settings.json
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -35,12 +35,12 @@ We're delighted that you're interested in contributing to our project! To get st
please carefully read and follow the guidelines provided in our [contributing](https://github.com/Nike-Inc/spark-expectations/blob/main/CONTRIBUTING.md) document

# What is Spark Expectations?
#### Spark Expectations is a Data quality framework built in Pyspark as a solution for the following problem statements:
#### Spark Expectations is a Data quality framework built in PySpark as a solution for the following problem statements:

1. The existing data quality tools validates the data in a table at rest and provides the success and error metrics. Users need to manually check the metrics to identify the error records
2. The error data is not quarantined to an error table or there are no corrective actions taken to send only the valid data to downstream
3. Users further downstream must consume the same data incorrectly, or they must perform additional calculations to eliminate records that don't comply with the data quality rules.
4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this acitivity
4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this activity

#### Spark Expectations solves these issues using the following principles:

4 changes: 2 additions & 2 deletions docs/bigquery.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### Example - Write to Delta

Setup SparkSession for bigquery to test in your local environment. Configure accordingly for higher environments.
Setup SparkSession for BigQuery to test in your local environment. Configure accordingly for higher environments.
Refer to Examples in [base_setup.py](../spark_expectations/examples/base_setup.py) and
[delta.py](../spark_expectations/examples/sample_dq_bigquery.py)

@@ -22,7 +22,7 @@ spark.conf.set("viewsEnabled", "true")
spark.conf.set("materializationDataset", "<temp_dataset>")
```

Below is the configuration that can be used to run SparkExpectations and write to DeltaLake
Below is the configuration that can be used to run SparkExpectations and write to Delta Lake

```python title="iceberg_write"
import os
6 changes: 3 additions & 3 deletions docs/configurations/adoption_versions_comparsion.md
Original file line number Diff line number Diff line change
@@ -6,13 +6,13 @@ Please find the difference in the changes with different version, latest three v

| stage | 0.8.0 | 1.0.0 |
|:----------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| rules table schema changes | added additional two column <br> 1.`enable_error_drop_alert(boolean)` <br> 2.`error_drop_thresholdt(int)` <br><br> documentation found [here](https://engineering.nike.com/spark-expectations/0.8.1/getting-started/setup/) | Remains same |
| rules table schema changes | added additional two column <br> 1.`enable_error_drop_alert(boolean)` <br> 2.`error_drop_threshold(int)` <br><br> documentation found [here](https://engineering.nike.com/spark-expectations/0.8.1/getting-started/setup/) | Remains same |
| rule table creation required | yes - creation not required if you're upgrading from old version but schema changes required | yes - creation not required if you're upgrading from old version but schema changes required |
| stats table schema changes | remains same | Remains Same |
| stats table creation required | automated | Remains Same |
| notification config setting | remains same | Remains Same |
| secret store and kafka authentication details | Create a dictionary that contains your secret configuration values and register in `__init__.py` for multiple usage, [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | Remains Same. You can disable streaming if needed, in SparkExpectations class |
| spark expectations initialisation | create spark expectations class object using `SpakrExpectations` by passing `product_id` and additional optional parameter `debugger`, `stats_streaming_options` [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |
| secret store and Kafka authentication details | Create a dictionary that contains your secret configuration values and register in `__init__.py` for multiple usage, [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | Remains Same. You can disable streaming if needed, in SparkExpectations class |
| spark expectations initialization | create spark expectations class object using `SparkExpectations` by passing `product_id` and additional optional parameter `debugger`, `stats_streaming_options` [example](https://engineering.nike.com/spark-expectations/0.8.1/examples/) | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |
| with_expectations decorator | remains same | New arguments are added. Please follow this - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |
| WrappedDataFrameWriter | Doesn't exist | This is new and users need to provider the writer object to record the spark conf that need to be used while writing - [example](https://engineering.nike.com/spark-expectations/1.0.0/examples/) |

32 changes: 16 additions & 16 deletions docs/configurations/configure_rules.md
Original file line number Diff line number Diff line change
@@ -10,37 +10,37 @@ To perform row data quality checks for artificially order table, please set up r
insert into `catalog`.`schema`.`{product}_rules` (product_id, table_name, rule_type, rule, column_name, expectation,
action_if_failed, tag, description, enable_for_source_dq_validation, enable_for_target_dq_validation, is_active) values

--The row data qulaity has been set on customer_id when customer_id is null, drop respective row into error table
--The row data quality has been set on customer_id when customer_id is null, drop respective row into error table
--as "action_if_failed" tagged "drop"
('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'customer_id_is_not_null', 'customer_id',
'customer_id is not null','drop', 'validity', 'customer_id ishould not be null', false, false, true)
'customer_id is not null','drop', 'validity', 'customer_id should not be null', false, false, true)

--The row data qulaity has been set on sales when sales is less than zero, drop respective row into error table as
--The row data quality has been set on sales when sales is less than zero, drop respective row into error table as
--'action_if_failed' tagged "drop"
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'sales_greater_than_zero', 'sales', 'sales > 0',
'drop', 'accuracy', 'sales value should be greater than zero', false, false, true)

--The row data qulaity has been set on discount when discount is less than 60, drop respective row into error table
--The row data quality has been set on discount when discount is less than 60, drop respective row into error table
--and final table as "action_if_failed" tagged 'ignore'
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'discount_threshold', 'discount', 'discount*100 < 60',
'ignore', 'validity', 'discount should be less than 40', false, false, true)

--The row data qulaity has been set on ship_mode when ship_mode not in ("second class", "standard class",
--"standard class"), drop respective row into error table and fail the framewok as "action_if_failed" tagged "fail"
--The row data quality has been set on ship_mode when ship_mode not in ("second class", "standard class",
--"standard class"), drop respective row into error table and fail the framework as "action_if_failed" tagged "fail"
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'ship_mode_in_set', 'ship_mode', 'lower(trim(ship_mode))
in('second class', 'standard class', 'standard class')', 'fail', 'validity', 'ship_mode mode belongs in the sets',
false, false, true)

--The row data qulaity has been set on profit when profit is less than or equals to 0, drop respective row into
--The row data quality has been set on profit when profit is less than or equals to 0, drop respective row into
--error table and final table as "action_if_failed" tagged "ignore"
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'profit_threshold', 'profit', 'profit>0', 'ignore',
'validity', 'profit threshold should be greater tahn 0', false, false, true)
'validity', 'profit threshold should be greater than 0', false, false, true)

--The rule has been established to identify and remove completely identical records in which rows repeat with the
--same value more than once, while keeping one instance of the row. Any additional duplicated rows will be dropped
--into error table as action_if_failed set to "drop"
,('apla_nd', '`catalog`.`schema`.customer_order', 'row_dq', 'complete_duplicate', 'All', 'row_number()
over(partition by cutomer_id, order_id order by 1)=1', 'drop', 'uniqueness', 'drop complete duplicate records',
over(partition by customer_id, order_id order by 1)=1', 'drop', 'uniqueness', 'drop complete duplicate records',
false, false, true)

```
@@ -62,9 +62,9 @@ action_if_failed, tag, description, enable_for_source_dq_validation, enable_fo
,('apla_nd', '`catalog`.`schema`.customer_order', 'agg_dq', 'distinct_of_ship_mode', 'ship_mode',
'count(distinct ship_mode)<=3', 'ignore', 'validity', 'regex format validation for quantity', true, false, true)

-- The aggregation rule is established on the table countand the metadata of the rule will be captured in the
--statistics table when distinct count greater than 10000 and failes the job as "action_if_failed" set to "fail"
--and enabled only for validated datset
-- The aggregation rule is established on the table count and the metadata of the rule will be captured in the
--statistics table when distinct count greater than 10000 and fails the job as "action_if_failed" set to "fail"
--and enabled only for validated dataset
,('apla_nd', '`catalog`.`schema`..customer_order', 'agg_dq', 'row_count', '*', 'count(*)>=10000', 'fail', 'validity',
'distinct ship_mode must be less or equals to 3', false, true, true)

@@ -76,21 +76,21 @@ Please set up rules for checking the quality of artificially order table by impl
insert into `catalog`.`schema`.`{product}_rules` (product_id, table_name, rule_type, rule, column_name, expectation,
action_if_failed, tag, description) values

--The query dq rule is established to check product_id differemce between two table if differnce is more than 20%
--The query dq rule is established to check product_id difference between two table if difference is more than 20%
--from source table, the metadata of the rule will be captured in the statistics table as "action_if_failed" is "ignore"
,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'product_missing_count_threshold', '*',
'((select count(distinct product_id) from product) - (select count(distinct product_id) from order))>
(select count(distinct product_id) from product)*0.2', 'ignore', 'validity', 'row count threshold difference msut
(select count(distinct product_id) from product)*0.2', 'ignore', 'validity', 'row count threshold difference must
be less than 20%', true, true, true)

--The query dq rule is established to check distinct proudtc_id in the product table is less than 5, if not the
--The query dq rule is established to check distinct product_id in the product table is less than 5, if not the
--metadata of the rule will be captured in the statistics table along with fails the job as "action_if_failed" is
--"fail" and enabled for source dataset
,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'product_category', '*', '(select count(distinct category)
from product) < 5', 'fail', 'validity', 'distinct product category must be less than 5', true, False, true)

--The query dq rule is established to check count of the dataset should be less than 10000 other wise the metadata
--of the rule will be captured in the statistics table as "action_if_failed" is "ignore" and enabled only for target datset
--of the rule will be captured in the statistics table as "action_if_failed" is "ignore" and enabled only for target dataset
,('apla_nd', '`catalog`.`schema`.customer_order', 'query_dq', 'row_count_in_order', '*',
'(select count(*) from order)<10000', 'ignore', 'accuracy', 'count of the row in order dataset must be less then 10000',
false, true, true)
8 changes: 4 additions & 4 deletions docs/configurations/databricks_setup_guide.md
Original file line number Diff line number Diff line change
@@ -4,7 +4,7 @@ This section provides instructions on how to set up a sample notebook in the Dat

#### Prerequisite:

1. Recommended databricks run time environment for better experience - DBS 11.0 and above
3. Please install the kafka jar using the path `dbfs:/kafka-jars/databricks-shaded-strimzi-kafka-oauth-client-1.1.jar`, If the jar is not available in the dbfs location, please raise a ticket with GAP Support team to add the jar to your workspace
2. Please follow the steps provided [here](TODO) to integrate and clone repo from git databricks
4. Please follow the steps to create the wbhook-hook URL for team-specific channel [here](TODO)
1. Recommended Databricks run time environment for better experience - DBS 11.0 and above
3. Please install the Kafka jar using the path `dbfs:/kafka-jars/databricks-shaded-strimzi-kafka-oauth-client-1.1.jar`, If the jar is not available in the dbfs location, please raise a ticket with GAP Support team to add the jar to your workspace
2. Please follow the steps provided [here](TODO) to integrate and clone repo from git Databricks
4. Please follow the steps to create the webhook-hook URL for team-specific channel [here](TODO)
2 changes: 1 addition & 1 deletion docs/delta.md
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@ builder = (
spark = builder.getOrCreate()
```

Below is the configuration that can be used to run SparkExpectations and write to DeltaLake
Below is the configuration that can be used to run SparkExpectations and write to Delta Lake

```python title="delta_write"
import os
44 changes: 22 additions & 22 deletions docs/examples.md
Original file line number Diff line number Diff line change
@@ -27,21 +27,21 @@ se_user_conf = {
2. The `user_config.se_notifications_email_smtp_host` parameter is set to "mailhost.com" by default and is used to specify the email SMTP domain host
3. The `user_config.se_notifications_email_smtp_port` parameter, which accepts a port number, is set to "25" by default
4. The `user_config.se_notifications_email_from` parameter is used to specify the email ID that will trigger the email notification
5. The `user_configse_notifications_email_to_other_mail_id` parameter accepts a list of recipient email IDs
5. The `user_config.se_notifications_email_to_other_mail_id` parameter accepts a list of recipient email IDs
6. The `user_config.se_notifications_email_subject` parameter captures the subject line of the email
7. The `user_config.se_notifications_enable_slack` parameter, which controls whether notifications are sent via slack, is set to false by default
8. The `user_config/se_notifications_slack_webhook_url` parameter accepts the webhook URL of a Slack channel for sending notifications
9. When `user_config.se_notifications_on_start` parameter set to `True` enables notification on start of the spark-expectations, variable by default set to `False`
10. When `user_config.se_notifications_on_completion` parameter set to `True` enables notification on completion of spark-expectations framework, variable by default set to `False`
11. When `user_config.se_notifications_on_fail` parameter set to `True` enables notification on failure of spark-expectations data qulaity framework, variable by default set to `True`
11. When `user_config.se_notifications_on_fail` parameter set to `True` enables notification on failure of spark-expectations data quality framework, variable by default set to `True`
12. When `user_config.se_notifications_on_error_drop_exceeds_threshold_breach` parameter set to `True` enables notification when error threshold reaches above the configured value
13. The `user_config.se_notifications_on_error_drop_threshold` parameter captures error drop threshold value

### Spark Expectations Initialization

For all the below examples the below import and SparkExpectations class instantiation is mandatory

When store for sensitive details is Databricks secret scope,construct config dictionary for authentication of kafka and
When store for sensitive details is Databricks secret scope,construct config dictionary for authentication of Kafka and
avoid duplicate construction every time your project is initialized, you can create a dictionary with the following keys and their appropriate values.
This dictionary can be placed in the __init__.py file of your project or declared as a global variable.
```python
@@ -62,16 +62,16 @@ stats_streaming_config_dict: Dict[str, Union[bool, str]] = {
```

1. The `user_config.se_enable_streaming` parameter is used to control the enabling or disabling of Spark Expectations (SE) streaming functionality. When enabled, SE streaming stores the statistics of every batch run into Kafka.
2. The `user_config.secret_type` used to define type of secret store and takes two values (`databricks`, `cererus`) by default will be `databricks`
3. The `user_config.dbx_workspace_url` used to pass databricks workspace in the format `https://<workspace_name>.cloud.databricks.com`
2. The `user_config.secret_type` used to define type of secret store and takes two values (`databricks`, `cerberus`) by default will be `databricks`
3. The `user_config.dbx_workspace_url` used to pass Databricks workspace in the format `https://<workspace_name>.cloud.databricks.com`
4. The `user_config.dbx_secret_scope` captures name of the secret scope
5. The `user_config.dbx_kafka_server_url` captures secret key for the kafka url
6. The ` user_config.dbx_secret_token_url` captures secret key for the kafka authentication app url
7. The `user_config.dbx_secret_app_name` captures secret key for the kafka authentication app name
8. The `user_config.dbx_secret_token` captures secret key for the kafka authentication app secret token
9. The `user_config.dbx_topic_name` captures secret key for the kafka topic name
5. The `user_config.dbx_kafka_server_url` captures secret key for the Kafka URL
6. The ` user_config.dbx_secret_token_url` captures secret key for the Kafka authentication app URL
7. The `user_config.dbx_secret_app_name` captures secret key for the Kafka authentication app name
8. The `user_config.dbx_secret_token` captures secret key for the Kafka authentication app secret token
9. The `user_config.dbx_topic_name` captures secret key for the Kafka topic name

Similarly when sensitive store is cerberus:
Similarly when sensitive store is Cerberus:

```python
from typing import Dict, Union
@@ -83,22 +83,22 @@ stats_streaming_config_dict: Dict[str, Union[bool, str]] = {
user_config.cbs_url : "https://<url>.cerberus.com", # (3)!
user_config.cbs_sdb_path: "cerberus_sdb_path", # (4)!
user_config.cbs_kafka_server_url: "se_streaming_server_url_secret_sdb_path", # (5)!
user_config.cbs_secret_token_url: "se_streaming_auth_secret_token_url_sdb_apth", # (6)!
user_config.cbs_secret_token_url: "se_streaming_auth_secret_token_url_sdb_path", # (6)!
user_config.cbs_secret_app_name: "se_streaming_auth_secret_appid_sdb_path", # (7)!
user_config.cbs_secret_token: "se_streaming_auth_secret_token_sdb_path", # (8)!
user_config.cbs_topic_name: "se_streaming_topic_name_sdb_path", # (9)!
}
```

1. The `user_config.se_enable_streaming` parameter is used to control the enabling or disabling of Spark Expectations (SE) streaming functionality. When enabled, SE streaming stores the statistics of every batch run into Kafka.
2. The `user_config.secret_type` used to define type of secret store and takes two values (`databricks`, `cererus`) by default will be `databricks`
3. The `user_config.cbs_url` used to pass cerberus url
4. The `user_config.cbs_sdb_path` captures cerberus secure data store path
5. The `user_config.cbs_kafka_server_url` captures path where kafka url stored in the cerberus sdb
6. The ` user_config.cbs_secret_token_url` captures path where kafka authentication app stored in the cerberus sdb
7. The `user_config.cbs_secret_app_name` captures path where kafka authentication app name stored in the cerberus sdb
8. The `user_config.cbs_secret_token` captures path where kafka authentication app name secret token stored in the cerberus sdb
9. The `user_config.cbs_topic_name` captures path where kafka topic name stored in the cerberus sdb
2. The `user_config.secret_type` used to define type of secret store and takes two values (`databricks`, `cerberus`) by default will be `databricks`
3. The `user_config.cbs_url` used to pass Cerberus URL
4. The `user_config.cbs_sdb_path` captures Cerberus secure data store path
5. The `user_config.cbs_kafka_server_url` captures path where Kafka URL stored in the Cerberus sdb
6. The ` user_config.cbs_secret_token_url` captures path where Kafka authentication app stored in the Cerberus sdb
7. The `user_config.cbs_secret_app_name` captures path where Kafka authentication app name stored in the Cerberus sdb
8. The `user_config.cbs_secret_token` captures path where Kafka authentication app name secret token stored in the Cerberus sdb
9. The `user_config.cbs_topic_name` captures path where Kafka topic name stored in the Cerberus sdb

You can disable the streaming functionality by setting the `user_config.se_enable_streaming` parameter to `False`

@@ -183,12 +183,12 @@ def build_new() -> DataFrame:
6. The `target_table_view` parameter is used to provide the name of a view that represents the target validated dataset for implementation of `query_dq` on the clean dataset from `row_dq`
7. The `target_and_error_table_writer` parameter is used to write the final data into the table. By default, it is False. This is optional, if you just want to run the data quality checks. A good example will be a staging table or temporary view.
8. View registration can be utilized when implementing `query_dq` expectations.
9. Returning a dataframe is mandatory for the `spark_expectations` to work, if we do not return a dataframe - then an exceptionm will be raised
9. Returning a dataframe is mandatory for the `spark_expectations` to work, if we do not return a dataframe - then an exception will be raised
10. Instantiate `SparkExpectations` class which has all the required functions for running data quality rules
11. The `product_id` parameter is used to specify the product ID of the data quality rules. This has to be a unique value
12. The `rules_df` parameter is used to specify the dataframe that contains the data quality rules
13. The `stats_table` parameter is used to specify the table name where the statistics will be written into
14. The `stats_table_writer` takes in the configuration that need to be used to write the stats table using pyspark
15. The `target_and_error_table_writer` takes in the configuration that need to be used to write the target and error table using pyspark
16. The `debugger` parameter is used to enable the debugger mode
17. The `stats_streaming_options` parameter is used to specify the configurations for streaming statistics into Kafka. To not use kafka, uncomment this.
17. The `stats_streaming_options` parameter is used to specify the configurations for streaming statistics into Kafka. To not use Kafka, uncomment this.
16 changes: 8 additions & 8 deletions docs/getting-started/setup.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Installation
The library is available in the pypi and can be installed in your environment using the below command or
please add the library "spark-expectations" into the requirements.txt or poetry dependencies.
The library is available in the Python Package Index (PyPi) and can be installed in your environment using the below command or
add the library "spark-expectations" into the requirements.txt or poetry dependencies.

```shell
pip install -U spark-expectations
@@ -9,7 +9,7 @@ pip install -U spark-expectations
## Required Tables

There are two tables that need to be created for spark-expectations to run seamlessly and integrate with a spark job.
The below sql statements used three namespaces which works with Databricks UnityCatalog, but if you are using hive
The below SQL statements used three namespaces which works with Databricks Unity Catalog, but if you are using hive
please update the namespaces accordingly and also provide necessary table metadata.


@@ -40,8 +40,8 @@ create table if not exists `catalog`.`schema`.`{product}_rules` (
### Rule Type For Rules

The rules column has a column called "rule_type". It is important that this column should only accept one of
these three values - `[row_dq, agg_dq, query_dq]`. If other values are provided, the library may cause unforseen errors.
Please run the below command to add constriants to the above created rules table
these three values - `[row_dq, agg_dq, query_dq]`. If other values are provided, the library may cause unforeseen errors.
Please run the below command to add constraints to the above created rules table

```sql
ALTER TABLE `catalog`.`schema`.`{product}_rules`
@@ -52,8 +52,8 @@ ADD CONSTRAINT rule_type_action CHECK (rule_type in ('row_dq', 'agg_dq', 'query_

The rules column has a column called "action_if_failed". It is important that this column should only accept one of
these values - `[fail, drop or ignore]` for `'rule_type'='row_dq'` and `[fail, ignore]` for `'rule_type'='agg_dq' and 'rule_type'='query_dq'`.
If other values are provided, the library may cause unforseen errors.
Please run the below command to add constriants to the above created rules table
If other values are provided, the library may cause unforeseen errors.
Please run the below command to add constraints to the above created rules table

```sql
ALTER TABLE apla_nd_dq_rules ADD CONSTRAINT action CHECK
@@ -65,7 +65,7 @@ ALTER TABLE apla_nd_dq_rules ADD CONSTRAINT action CHECK
### DQ Stats Table

In order to collect the stats/metrics for each data quality job run, the spark-expectations job will
automatically create the stats table if it does not exist. The below sql statement can be used to create the table
automatically create the stats table if it does not exist. The below SQL statement can be used to create the table
if you want to create it manually, but it is not recommended.

```sql
2 changes: 1 addition & 1 deletion docs/iceberg.md
Original file line number Diff line number Diff line change
@@ -29,7 +29,7 @@ builder = (
spark = builder.getOrCreate()
```

Below is the configuration that can be used to run SparkExpectations and write to DeltaLake
Below is the configuration that can be used to run SparkExpectations and write to Delta Lake

```python title="iceberg_write"
import os
8 changes: 4 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -28,7 +28,7 @@ the data quality standards
* Downstream users have to consume the same data with error, or they have to do additional computation to remove the
records that doesn't meet the standards
* Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually
required for this acitivity
required for this activity

`Spark-Expectations solves all of the above problems by following the below principles`

@@ -53,10 +53,10 @@ what needs to be done if a rule fails
* Let's consider a hypothetical scenario, where we have 100 columns and with 200
row level data quality rules, 10 aggregation data quality rules and 5 query data quality rules computed against. When the dq job is run, there are
10 rules that failed on a particular row and 4 aggregation rules fails- what determines if that row should end up in
final table or not? Below are the heirarchy of checks that happens?
* Among the row level 10 rules failed, if there is atleast one rule which has an _action_if_failed_ as _fail_ -
final table or not? Below are the hierarchy of checks that happens?
* Among the row level 10 rules failed, if there is at least one rule which has an _action_if_failed_ as _fail_ -
then the job will be failed
* Among the 10 row level rules failed, if there is no rule that has an _action_if_failed_ as _fail_, but atleast
* Among the 10 row level rules failed, if there is no rule that has an _action_if_failed_ as _fail_, but at least
has one rule with _action_if_failed_ as _drop_ - then the record/row will be dropped
* Among the 10 row level rules failed, if no rule neither has _fail_ nor _drop_ as an _action_if_failed_ - then
the record will be end up in the final table. Note that, this record would also exist in the `_error` table