-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 Athena Spark Initial Testing #6640
Comments
Tasks done so far:
|
A new CaDeT branch created to do the testing - https://github.com/moj-analytical-services/create-a-derived-table/tree/6640-athena-spark-initial-testing |
Few sample python models added to test Python Models via Athena Spark deployments - Python Models via Athena Spark deployments |
Many options were tried and tested, but still deployment is not successful yet. Getting the below issue while deploying these python models. |
dbt-fal adapter is not compatible with current dbt-core version. dbt-spark, dbt-athena adapters are not working as expected. Will try few more options next week. |
I am now able to successfully build python/pyspark models via Athena Spark deployments. The log of the airflow DAG is below. The models are built in the CaDeT repo in a branch before we merge those into main. |
I will be working on a few ML models with larger workload to monitor and compare relative costs via cost-explorer |
Test of importing a custom python library via upload to S3 is passed. |
User Story
As a maintainer of CaDeT Deployments
I need to explore costs and mechanisms involved with Python Models via Athena Spark deployments
So that I can understand whether to enable these for end users.
Value / Purpose
There are tasks that are not easily completable using pure SQL-based transformations, such as fuzzy matching or natural language processing. To enable these tasks, DBT supports 'Python Models', which allow users to build models based on python code rather than SQL. To do this, queries must be submitted to an
Athena Spark
workgroup, which includes several packages by default and allows for the import of more as pure python (no cPython) zip files.Useful Contacts
@jhpyke
User Types
No response
Hypothesis
If we build a test pipeline of python models
we will be able to validate the costs and compute times associated with python models.
Proposal
Suggested order of tasks:
Additional Information
Any Python models should be built in the existing sandpit domain, to prevent unwanted data spillage.
Definition of Done
The text was updated successfully, but these errors were encountered: