Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 Athena Spark Initial Testing #6640

Open
3 of 4 tasks
jhpyke opened this issue Jan 31, 2025 · 8 comments
Open
3 of 4 tasks

📖 Athena Spark Initial Testing #6640

jhpyke opened this issue Jan 31, 2025 · 8 comments
Assignees

Comments

@jhpyke
Copy link
Contributor

jhpyke commented Jan 31, 2025

User Story

As a maintainer of CaDeT Deployments
I need to explore costs and mechanisms involved with Python Models via Athena Spark deployments
So that I can understand whether to enable these for end users.

Value / Purpose

There are tasks that are not easily completable using pure SQL-based transformations, such as fuzzy matching or natural language processing. To enable these tasks, DBT supports 'Python Models', which allow users to build models based on python code rather than SQL. To do this, queries must be submitted to an Athena Spark workgroup, which includes several packages by default and allows for the import of more as pure python (no cPython) zip files.

Useful Contacts

@jhpyke

User Types

No response

Hypothesis

If we build a test pipeline of python models
we will be able to validate the costs and compute times associated with python models.

Proposal

Suggested order of tasks:

  • Create a new Athena Workgroup for Athena Spark testing
  • Update the profiles file to reference the new workgroup
  • Build some basic models using python per the DBT docs on Python Models
  • Test importing a custom python library via upload to S3
  • View costs associated with your query workgroup in cost-explorer

Additional Information

Any Python models should be built in the existing sandpit domain, to prevent unwanted data spillage.

Definition of Done

  • An Athena Spark workgroup exists
  • We are able to succesfully build models using it
  • We can import custom packages
  • We are able to monitor and compare relative costs via cost-explorer or similar.
@jnayak-moj
Copy link
Contributor

Few sample python models added to test Python Models via Athena Spark deployments - Python Models via Athena Spark deployments

@jnayak-moj
Copy link
Contributor

Many options were tried and tested, but still deployment is not successful yet. Getting the below issue while deploying these python models.
https://github.com/moj-analytical-services/create-a-derived-table/actions/runs/13459939339/job/37612577785?pr=2985

@jnayak-moj
Copy link
Contributor

dbt-fal adapter is not compatible with current dbt-core version. dbt-spark, dbt-athena adapters are not working as expected. Will try few more options next week.

@jnayak-moj jnayak-moj moved this from In Progress to TODO in Analytical Platform Feb 27, 2025
@jnayak-moj jnayak-moj reopened this Feb 27, 2025
@jnayak-moj jnayak-moj moved this from TODO to In Progress in Analytical Platform Feb 27, 2025
@jnayak-moj
Copy link
Contributor

jnayak-moj commented Mar 3, 2025

I am now able to successfully build python/pyspark models via Athena Spark deployments.
Initially I have built very simple programs including a simple python model, a simple sql model, a simple pyspark model and a simple pyspark ML (SKLearn) model via Athena Spark deployments.

The log of the airflow DAG is below.
https://23f37892-d1d1-4d9f-a03d-b8a53581fd20.c0.eu-west-1.airflow.amazonaws.com/log?execution_date=2025-02-28T16%3A41%3A43.977743%2B00%3A00&task_id=cadet-deploy-task&dag_id=cadet_deployments.deploy-pyspark-models&map_index=-1

The models are built in the CaDeT repo in a branch before we merge those into main.
https://github.com/moj-analytical-services/create-a-derived-table/pull/2985/files

@jnayak-moj
Copy link
Contributor

I will be working on a few ML models with larger workload to monitor and compare relative costs via cost-explorer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

2 participants