-
Notifications
You must be signed in to change notification settings - Fork 42
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* making changes and refactoring the code. * Adding Wrapped DataFrameWriter and unitests for the same * ignoring agg_dq and query_dq * Rearranging lot of code, removed delta. Enabled custom writers in different formats * fixing tests for the sinks and updating README * fixing kafka configurations * Feature add reader * changing rules table along with few other changes * reader and writer modifications --------- Co-authored-by: phanikumarvemuri <[email protected]> * Updating formatting, removing delta as dependency * Adding examples for expectations * adding documentation Co-authored-by: phanikumarvemuri <[email protected]> * Updating tests * Updating documentation * Updating WrappedDataframeWriter --------- Co-authored-by: phanikumarvemuri <[email protected]>
- Loading branch information
1 parent
1bcce5a
commit ca1ceb5
Showing
51 changed files
with
2,984 additions
and
3,338 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
|
||
::: spark_expectations.examples.sample_dq_delta | ||
handler: python | ||
options: | ||
filters: | ||
- "!^_[^_]" | ||
- "!^__[^__]" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
|
||
::: spark_expectations.examples.sample_dq_iceberg | ||
handler: python | ||
options: | ||
filters: | ||
- "!^_[^_]" | ||
- "!^__[^__]" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
### Example - Write to Delta | ||
|
||
Setup SparkSession for bigquery to test in your local environment. Configure accordingly for higher environments. | ||
Refer to Examples in [base_setup.py](../spark_expectations/examples/base_setup.py) and | ||
[delta.py](../spark_expectations/examples/sample_dq_bigquery.py) | ||
|
||
```python title="spark_session" | ||
from pyspark.sql import SparkSession | ||
|
||
builder = ( | ||
SparkSession.builder.config( | ||
"spark.jars.packages", | ||
"com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.30.0", | ||
) | ||
) | ||
spark = builder.getOrCreate() | ||
|
||
spark._jsc.hadoopConfiguration().set( | ||
"fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem" | ||
) | ||
spark.conf.set("viewsEnabled", "true") | ||
spark.conf.set("materializationDataset", "<temp_dataset>") | ||
``` | ||
|
||
Below is the configuration that can be used to run SparkExpectations and write to DeltaLake | ||
|
||
```python title="iceberg_write" | ||
import os | ||
from pyspark.sql import DataFrame | ||
from spark_expectations.core.expectations import ( | ||
SparkExpectations, | ||
WrappedDataFrameWriter, | ||
) | ||
from spark_expectations.config.user_config import Constants as user_config | ||
|
||
os.environ[ | ||
"GOOGLE_APPLICATION_CREDENTIALS" | ||
] = "path_to_your_json_credential_file" # This is needed for spark write to bigquery | ||
writer = ( | ||
WrappedDataFrameWriter().mode("overwrite") | ||
.format("bigquery") | ||
.option("createDisposition", "CREATE_IF_NEEDED") | ||
.option("writeMethod", "direct") | ||
) | ||
|
||
se: SparkExpectations = SparkExpectations( | ||
product_id="your_product", | ||
rules_df=spark.read.format("bigquery").load( | ||
"<project_id>.<dataset_id>.<rules_table>" | ||
), | ||
stats_table="<project_id>.<dataset_id>.<stats_table>", | ||
stats_table_writer=writer, | ||
target_and_error_table_writer=writer, | ||
debugger=False, | ||
stats_streaming_options={user_config.se_enable_streaming: False} | ||
) | ||
|
||
|
||
# Commented fields are optional or required when notifications are enabled | ||
user_conf = { | ||
user_config.se_notifications_enable_email: False, | ||
# user_config.se_notifications_email_smtp_host: "mailhost.com", | ||
# user_config.se_notifications_email_smtp_port: 25, | ||
# user_config.se_notifications_email_from: "", | ||
# user_config.se_notifications_email_to_other_mail_id: "", | ||
# user_config.se_notifications_email_subject: "spark expectations - data quality - notifications", | ||
user_config.se_notifications_enable_slack: False, | ||
# user_config.se_notifications_slack_webhook_url: "", | ||
# user_config.se_notifications_on_start: True, | ||
# user_config.se_notifications_on_completion: True, | ||
# user_config.se_notifications_on_fail: True, | ||
# user_config.se_notifications_on_error_drop_exceeds_threshold_breach: True, | ||
# user_config.se_notifications_on_error_drop_threshold: 15, | ||
} | ||
|
||
|
||
@se.with_expectations( | ||
target_table="<project_id>.<dataset_id>.<target_table_name>", | ||
write_to_table=True, | ||
user_conf=user_conf, | ||
target_table_view="<project_id>.<dataset_id>.<target_table_view_name>", | ||
) | ||
def build_new() -> DataFrame: | ||
_df_order: DataFrame = ( | ||
spark.read.option("header", "true") | ||
.option("inferSchema", "true") | ||
.csv(os.path.join(os.path.dirname(__file__), "resources/order.csv")) | ||
) | ||
_df_order.createOrReplaceTempView("order") | ||
|
||
return _df_order | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.