This repo provides the Kubernetes Helm chart for deploying Pyspark Notebook.
- Set up a kubernetes cluster
- In a cloud platform of choice like Amazon EKS, Google Kubernetes Engine, and Azure Kubernetes Service OR
- In local environment using Minikube.
- Install the following tools:
Add pyspark-notebook helm repo by running the following
helm repo add pyspark-notebook https://a3data.github.io/pyspark-notebook-helm/
Then, deploy the pyspark-notebook by running the following
helm install pyspark-notebook pyspark-notebook/pyspark-notebook
Run kubectl get all
to check whether all the pyspark resources are running. You should get a result similar to below.
NAME READY STATUS RESTARTS AGE
pod/pyspark-0 1/1 Running 0 9m18s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/pyspark ClusterIP 10.110.1.129 <none> 8888/TCP,7777/TCP,2222/TCP 9m18s
service/pyspark-headless ClusterIP None <none> 8888/TCP,7777/TCP,2222/TCP 9m18s
NAME READY AGE
statefulset.apps/pyspark 1/1 9m18s
You can run the following to expose the notebook locally.
kubectl port-forward svc/<release name> 8888:8888
You should be able to access the frontend via http://localhost:8888.
kubectl exec -it pod/pyspark-0 -- bash
jupyter server list
helm install pyspark-notebook pyspark-notebook/pyspark-notebook --set service.type=LoadBalancer
Create secret
kubectl create secret generic gcs-credentials --from-file="./config/key.json"
Alter values.yaml
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /mnt/secrets/key.json
extraVolumes:
- name: secrets
secret:
secretName: gcp-credentials
extraVolumeMounts:
- name: secrets
mountPath: "/mnt/secrets"
readOnly: true
Create secret from a key.json
file.
kubectl create secret generic aws-credentials --from-file="./config/key.json"
Or you can create a secret directly in the terminal:
kubectl create secret generic aws-credentials --from-literal=aws_access_key_id=<YOUR_KEY_ID> --from-literal=aws_secret_access_key=<YOUR_SECRET_KEY>
Alter values.yaml
to set your AWS credentials as environment variables
# Allows you to load environment variables from kubernetes secret
secret:
- envName: AWS_ACCESS_KEY_ID
secretName: aws-credentials
secretKey: aws_access_key_id
- envName: AWS_SECRET_ACCESS_KEY
secretName: aws-credentials
secretKey: aws_secret_access_key
And deploy the helm chart with helm install
command shown above.
For the notebook to connect with AWS S3, you have to setup the correct spark configurations in your .py
file. An example:
from pyspark import SparkConf, SparkContext
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
#spark configuration
conf = (
SparkConf().set('spark.executor.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.set("spark.hadoop.fs.s3a.fast.upload", True)
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set('spark.jars.packages', 'software.amazon.awssdk:s3:2.17.133,org.apache.hadoop:hadoop-aws:3.2.0')
.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')
)
sc=SparkContext(conf=conf).getOrCreate()
spark=SparkSession(sc)
df = spark.read.parquet("s3a:/<BUCKET-NAME>/<TABLE-NAME>/")
df.printSchema()
Make sure the credentials you passed as env variables do have access to the S3 bucket.