Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request to update branch. #7

Merged
merged 21 commits into from
Apr 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,13 @@

<h2>Overview</h2>

This project will deploy a Jupyterhub with scalable compute nodes (for distributed computing) on 3 cloud platforms using AWS EKS, Azure Kubernetes Service (AKS) and Google Kubernetes Engine (GKE). The goals are to (1) Create a replicable, reusable template for deployment of Jupyterhubs on campuses using Terraform and/or other automation scripts (2) Create documentation around best practices for deploying Jupyterhubs including steps on SSO/OAuth, cost optimization, and other node scaling mechanisms.
This project will deploy a Jupyterhub with scalable compute nodes (for distributed computing) on 3 cloud platforms using AWS EKS, Azure Kubernetes Service (AKS) and Google Kubernetes Engine (GKE). There are several learning objectives for this project:

1. Learn how create replicable and reusable templates for deployment of cloud services using Infrastructure-as-Code (IaC), namely Terraform
2. Understand how to use Kubernetes for container orchestration and scaling of microservices
3. Understand how to deploy scalable Jupyterhubs (i.e. Jupyterhubs as a service, Jupyterhub for Classroom)
4. Understand best practices on Jupyterhub deployments including steps on SSO/OAuth, cost optimization, security, networking, and other node scaling mechanisms


<h2>Contents</h2>

Expand Down
39 changes: 39 additions & 0 deletions bind-jhub-fqdn2IP.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#! /bin/bash
############################################
# This script will help bind the fqdn
# to the Jupyterhub static IP address.
# This usefull on Azure. You probably have
# similar process on other cloud provide
#
# Original script from here
# https://docs.microsoft.com/en-us/azure/aks/ingress-tls
# Synopsis:
# ./bind-jhub-fqdn2IP.sh <IP> <NAME>
# CC 2023-04-11 Jacob Fosso Tande
#########################################
# configure an FQDN for the ingress controller IP address
# Public IP address of your ingress controller
IPADDRESS=$1
NAME=$2
IP="$IPADDRESS"

# Name to associate with public IP address
DNSNAME="$NAME"

# Get the resource-id of the public ip
PUBLICIPID=$(az network public-ip list --query "[?ipAddress!=null]|[?contains(ipAddress, '$IP')].[id]" --output tsv)

# Update public ip address with DNS name
az network public-ip update --ids $PUBLICIPID --dns-name $DNSNAME

# Display the FQDN
FQDN=$(az network public-ip show --ids $PUBLICIPID --query "[dnsSettings.fqdn]" --output tsv)

echo " "
echo " "
echo " Got FQDN "
echo " "
echo " "
echo $FQDN
echo " "
echo " "
28 changes: 20 additions & 8 deletions caseyd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,32 +3,44 @@
Working directory for Casey Dinsmore


# Jupyterhub Common Commands
# Common Commands

## Helm

### Install / Reconfigure

```
helm upgrade --cleanup-on-fail \
--install jhub jupyterhub/jupyterhub \
--namespace jhub \
--create-namespace \
--version=2.0.0 \
--values config.yaml

```
## Kubectl

### Get Proxy Address

kubectl -n jhub get service proxy-public
```
kubectl -n jhub get service proxy-public
```

### Show all pod states
```
kubectl get pods -A
```

### View details about a pod including deployment errors

```
kubectl -n jhub describe pod <pod.name>

```
### Get the logs for a pod

```
kubectl -n jhub get logs <pod.name>
```

### Get Persistent Volumes/Claims
```
kubectl -n jhub get pv
```
```
kubectl -n jhub get pvc
```
25 changes: 25 additions & 0 deletions caseyd/aws/eksctl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# AWS EKS cluster config with eksctl

## Resources

https://www.arhea.net/posts/2020-06-18-jupyterhub-amazon-eks


## Issues

* With 4 availability zones in us-west-2, eksctl will randomly pick three and
so sometimes the deployment will fail.

Adding the AvailibilityZones: stanza to cluster.yaml resolves the issue as outlined here:

https://github.com/weaveworks/eksctl/blob/main/examples/05-advanced-nodegroups.yaml

availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2d"]

* Hubs end up stuck in the Pending state

running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition

Resolution here does not seem to work

https://discourse.jupyter.org/t/hub-pod-stuck-on-pending-timed-out-binding-volumes/17176
113 changes: 113 additions & 0 deletions caseyd/aws/eksctl/cluster.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# file: cluster.yml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: jupyterhub
region: us-west-2

iam:
withOIDC: true
serviceAccounts:
- metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
aws-usage: "cluster-ops"
app.kubernetes.io/name: cluster-autoscaler
attachPolicy:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "autoscaling:DescribeAutoScalingGroups"
- "autoscaling:DescribeAutoScalingInstances"
- "autoscaling:DescribeLaunchConfigurations"
- "autoscaling:DescribeTags"
- "autoscaling:SetDesiredCapacity"
- "autoscaling:TerminateInstanceInAutoScalingGroup"
- "ec2:DescribeLaunchTemplateVersions"
Resource: '*'
- metadata:
name: ebs-csi-controller-sa
namespace: kube-system
labels:
aws-usage: "cluster-ops"
app.kubernetes.io/name: aws-ebs-csi-driver
attachPolicy:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "ec2:AttachVolume"
- "ec2:CreateSnapshot"
- "ec2:CreateTags"
- "ec2:CreateVolume"
- "ec2:DeleteSnapshot"
- "ec2:DeleteTags"
- "ec2:DeleteVolume"
- "ec2:DescribeInstances"
- "ec2:DescribeSnapshots"
- "ec2:DescribeTags"
- "ec2:DescribeVolumes"
- "ec2:DetachVolume"
Resource: '*'

managedNodeGroups:
- name: ng-us-west-2a
instanceType: t3.medium
volumeSize: 30
desiredCapacity: 1
privateNetworking: true
availabilityZones:
- us-west-2a
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/jupyterhub: "owned"
- name: ng-us-west-2b
instanceType: t3.medium
volumeSize: 30
desiredCapacity: 1
privateNetworking: true
availabilityZones:
- us-west-2b
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/jupyterhub: "owned"
- name: ng-us-west-2c
instanceType: t3.medium
volumeSize: 30
desiredCapacity: 1
privateNetworking: true
availabilityZones:
- us-west-2d
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/jupyterhub: "owned"

availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2d"]

# Adding EBS CSI to try to resolve permissions
# Does not seem to work
# 2023/04/11
# https://discourse.jupyter.org/t/hub-pod-stuck-on-pending-timed-out-binding-volumes/17176
addons:
- name: aws-ebs-csi-driver
attachPolicy:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "ec2:AttachVolume"
- "ec2:CreateSnapshot"
- "ec2:CreateTags"
- "ec2:CreateVolume"
- "ec2:DeleteSnapshot"
- "ec2:DeleteTags"
- "ec2:DeleteVolume"
- "ec2:DescribeInstances"
- "ec2:DescribeSnapshots"
- "ec2:DescribeTags"
- "ec2:DescribeVolumes"
- "ec2:DetachVolume"
Resource: '*'
12 changes: 12 additions & 0 deletions caseyd/aws/jup-default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# This file can update the JupyterHub Helm chart's default configuration values.
#
# For reference see the configuration reference and default values, but make
# sure to refer to the Helm chart version of interest to you!
#
# Introduction to YAML: https://www.youtube.com/watch?v=cdLNKUoMc6c
# Chart config reference: https://zero-to-jupyterhub.readthedocs.io/en/stable/resources/reference.html
# Chart default values: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/HEAD/jupyterhub/values.yaml
# Available chart versions: https://jupyterhub.github.io/helm-chart/
#


62 changes: 62 additions & 0 deletions caseyd/aws/tf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@

## Terraform Commands

### Install TF Requirements
```
terraform init
```

### Validate Terraform file syntax
```
terraform validate
```

### Preview Changes
```
terraform plan
```

### Apply Terraform files
```
terraform apply
```

When the provisioning is complete, details will be provided about the cluster.

```
cluster_endpoint = "https://E44319CC44678D8EE100B7C42A46AE5D.gr7.us-west-2.eks.amazonaws.com"
cluster_name = "education-eks-pAGhwfz9"
cluster_security_group_id = "sg-01f527e90fdbf2f6d"
region = "us-west-2"
```

### Show the current terraform state
```
terraform show
```

This will also show the cluster output information


## Configure kube for the new cluster

```
aws eks update-kubeconfig --name <clustername>
```

Update kubectl from Terraform output (from the EKS terraform directory)
```
aws eks update-kubeconfig --name $(terraform output -raw cluster_name)
```



## Deleting a terraform deployment
```
terraform destroy
```


# References
* [Terraform EKS Example](https://developer.hashicorp.com/terraform/tutorials/kubernetes/eks)
* [Terraform Helm Example](https://developer.hashicorp.com/terraform/tutorials/kubernetes/helm-provider)
27 changes: 27 additions & 0 deletions caseyd/aws/tf/provision-eks-cluster/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Local .terraform directories
**/.terraform/*

# .tfstate files
*.tfstate
*.tfstate.*
*.tfplan

# Crash log files
crash.log

# Exclude all .tfvars files, which are likely to contain sentitive data, such as
# password, private keys, and other secrets. These should not be part of version
# control as they are data points which are potentially sensitive and subject
# to change depending on the environment.
*.tfvars

# Ignore override files as they are usually used to override resources locally and so
# are not checked in
override.tf
override.tf.json
*_override.tf
*_override.tf.json

# Ignore CLI configuration files
.terraformrc
terraform.rc
Loading