📖 Athena Spark Initial Testing #6640

jhpyke · 2025-01-31T10:20:24Z

User Story

As a maintainer of CaDeT Deployments
I need to explore costs and mechanisms involved with Python Models via Athena Spark deployments
So that I can understand whether to enable these for end users.

Value / Purpose

There are tasks that are not easily completable using pure SQL-based transformations, such as fuzzy matching or natural language processing. To enable these tasks, DBT supports 'Python Models', which allow users to build models based on python code rather than SQL. To do this, queries must be submitted to an Athena Spark workgroup, which includes several packages by default and allows for the import of more as pure python (no cPython) zip files.

Useful Contacts

@jhpyke

User Types

No response

Hypothesis

If we build a test pipeline of python models
we will be able to validate the costs and compute times associated with python models.

Proposal

Suggested order of tasks:

Create a new Athena Workgroup for Athena Spark testing
Update the profiles file to reference the new workgroup
Build some basic models using python per the DBT docs on Python Models
Test importing a custom python library via upload to S3
View costs associated with your query workgroup in cost-explorer

Additional Information

Any Python models should be built in the existing sandpit domain, to prevent unwanted data spillage.

Definition of Done

An Athena Spark workgroup exists
We are able to succesfully build models using it
We can import custom packages
We are able to monitor and compare relative costs via cost-explorer or similar.

The text was updated successfully, but these errors were encountered:

jnayak-moj · 2025-02-21T16:01:49Z

Tasks done so far:

New Athena Workgroup added - https://github.com/ministryofjustice/analytical-platform/blob/main/terraform/aws/analytical-platform-data-production/athena/athena-workgroups.tf#L30
New Iam-role for execution role added - https://github.com/ministryofjustice/analytical-platform/blob/main/terraform/aws/analytical-platform-data-production/athena/iam-roles.tf

jnayak-moj · 2025-02-21T16:05:56Z

A new CaDeT branch created to do the testing - https://github.com/moj-analytical-services/create-a-derived-table/tree/6640-athena-spark-initial-testing
new workflow added for testing: https://github.com/moj-analytical-services/create-a-derived-table/blob/73bb008a826cfe2e1d3e496b615173960dcb6662/.github/workflows/deploy-python-models.yml
workgroup updated in the CaDeT repo - https://github.com/ministryofjustice/analytical-platform/blob/main/terraform/aws/analytical-platform-data-production/athena/iam-roles.tf

jnayak-moj · 2025-02-21T16:07:17Z

Few sample python models added to test Python Models via Athena Spark deployments - Python Models via Athena Spark deployments

jnayak-moj · 2025-02-21T16:09:25Z

Many options were tried and tested, but still deployment is not successful yet. Getting the below issue while deploying these python models.
https://github.com/moj-analytical-services/create-a-derived-table/actions/runs/13459939339/job/37612577785?pr=2985

jnayak-moj · 2025-02-21T16:11:41Z

dbt-fal adapter is not compatible with current dbt-core version. dbt-spark, dbt-athena adapters are not working as expected. Will try few more options next week.

jnayak-moj · 2025-03-03T10:05:06Z

I am now able to successfully build python/pyspark models via Athena Spark deployments.
Initially I have built very simple programs including a simple python model, a simple sql model, a simple pyspark model and a simple pyspark ML (SKLearn) model via Athena Spark deployments.

The log of the airflow DAG is below.
https://23f37892-d1d1-4d9f-a03d-b8a53581fd20.c0.eu-west-1.airflow.amazonaws.com/log?execution_date=2025-02-28T16%3A41%3A43.977743%2B00%3A00&task_id=cadet-deploy-task&dag_id=cadet_deployments.deploy-pyspark-models&map_index=-1

The models are built in the CaDeT repo in a branch before we merge those into main.
https://github.com/moj-analytical-services/create-a-derived-table/pull/2985/files

jnayak-moj · 2025-03-03T10:05:54Z

I will be working on a few ML models with larger workload to monitor and compare relative costs via cost-explorer

jnayak-moj · 2025-03-03T17:00:24Z

Test of importing a custom python library via upload to S3 is passed.
The python DBT model - https://github.com/moj-analytical-services/create-a-derived-table/pull/2985/files#diff-7e0d4f924b89027bcf4be04f464538713bc2da1b0519c763613eaa7577b7d4fcR4

The Airflow DAG logs - https://23f37892-d1d1-4d9f-a03d-b8a53581fd20.c0.eu-west-1.airflow.amazonaws.com/log?execution_date=2025-03-03T16%3A48%3A39.745888%2B00%3A00&task_id=cadet-deploy-task&dag_id=cadet_deployments.deploy-pyspark-models&map_index=-1

jhpyke added the story label Jan 31, 2025

jhpyke added this to Analytical Platform Jan 31, 2025

jhpyke added the 📊 CaDeT label Jan 31, 2025

github-project-automation bot moved this to 👀 TODO in Analytical Platform Jan 31, 2025

github-actions bot mentioned this issue Feb 1, 2025

Monthly issue metrics report #6643

Closed

This was referenced Feb 14, 2025

6640 athena spark testing #6914

Closed

6640 athena spark workgroup #6919

Merged

tom-webber mentioned this issue Feb 14, 2025

fix: AP-6640 revert broken athena workgroup apply #6920

Merged

4 tasks

jnayak-moj moved this from TODO to In Progress in Analytical Platform Feb 14, 2025

This was referenced Feb 17, 2025

6640 athena spark workgroup fixed #6923

Merged

6640 athena spark workgroup TF apply fix #6928

Merged

6640 athena spark workgroup engine version changed #6930

Merged

workgroup engine version updated #6931

Merged

jnayak-moj moved this from In Progress to Done in Analytical Platform Feb 27, 2025

jnayak-moj closed this as completed by moving to Done in Analytical Platform Feb 27, 2025

jnayak-moj moved this from Done to In Progress in Analytical Platform Feb 27, 2025

jnayak-moj self-assigned this Feb 27, 2025

jnayak-moj removed the status in Analytical Platform Feb 27, 2025

jnayak-moj mentioned this issue Feb 27, 2025

attched restrcited s3 policy #7013

Merged

4 tasks

jnayak-moj moved this to In Progress in Analytical Platform Feb 27, 2025

jnayak-moj moved this from In Progress to TODO in Analytical Platform Feb 27, 2025

jnayak-moj reopened this Feb 27, 2025

jnayak-moj moved this from TODO to In Progress in Analytical Platform Feb 27, 2025

jnayak-moj mentioned this issue Feb 27, 2025

6640 mojap development bucket access added #7014

Merged

4 tasks

jnayak-moj mentioned this issue Mar 3, 2025

s3 file access added to the spark execution role #7069

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📖 Athena Spark Initial Testing #6640

📖 Athena Spark Initial Testing #6640

jhpyke commented Jan 31, 2025 •

edited

Loading

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Mar 3, 2025 •

edited

Loading

jnayak-moj commented Mar 3, 2025

jnayak-moj commented Mar 3, 2025 •

edited

Loading

📖 Athena Spark Initial Testing #6640

📖 Athena Spark Initial Testing #6640

Comments

jhpyke commented Jan 31, 2025 • edited Loading

User Story

Value / Purpose

Useful Contacts

User Types

Hypothesis

Proposal

Additional Information

Definition of Done

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Feb 21, 2025

jnayak-moj commented Mar 3, 2025 • edited Loading

jnayak-moj commented Mar 3, 2025

jnayak-moj commented Mar 3, 2025 • edited Loading

jhpyke commented Jan 31, 2025 •

edited

Loading

jnayak-moj commented Mar 3, 2025 •

edited

Loading

jnayak-moj commented Mar 3, 2025 •

edited

Loading