Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Z2JH 4.0.0 fails to detect old named server PVCs due to an overriden template in 3.3.8 #3574

Open
manics opened this issue Nov 15, 2024 · 11 comments
Labels

Comments

@manics
Copy link
Member

manics commented Nov 15, 2024

Bug description

Reported in https://discourse.jupyter.org/t/jupyterhub-helm-chart-4-0-0/29887/2

How to reproduce

  1. Deploy Z2JH 3.3.8 with persistent user storage
  2. Start and stop a named server
  3. Upgrade Z2JH to 4.0.0
  4. Start the named server, a new PVC is created

Expected behaviour

Z2JH should use the existing named PVC

Actual behaviour

3.3.8 set pvcNameTemplate: claim-{username}{servername}, overriding the KubeSpawner 6.2.0 default of claim-{username}--{servername}
https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/3.3.8/jupyterhub/values.yaml#L395
https://github.com/jupyterhub/kubespawner/blob/6.2.0/kubespawner/spawner.py#L569

4.0.0 uses the default from KubeSpawner
https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/4.0.0/jupyterhub/values.yaml#L416

As a result the automatic legacy PVC detection in KubeSpawner fails, since it assumes the 6.2.0 template is used
https://github.com/jupyterhub/kubespawner/blob/a8f28439078e42e8012de8f141b51bd6fa96d9c7/kubespawner/spawner.py#L3118

[D 2024-11-15 14:05:10.037 JupyterHub spawner:3117] Checking for legacy-named pvc claim-test-40example-2eorg--named for [email protected]

Instead of claim-test-40example-2eorgnamed

@manics manics added the bug label Nov 15, 2024
@minrk
Copy link
Member

minrk commented Nov 19, 2024

Short-term, a deployment can work around this by

  1. explicitly setting the old template in config
  2. starting and stopping all servers, which persists the pvc name in spawner state
  3. removing the old template config; since PVC names are persisted now, new users will use the new template while existing users will keep their PVCs

alternatively, we could write a bulk-rename script to rename PVCs.

@aychang95
Copy link

@minrk
A bulk-rename script would work for our case. Could also support downgrading if that's within the scope.
May have to deal with servers that now have two PVCs associated with them

@minrk
Copy link
Member

minrk commented Nov 21, 2024

yes, servers that have upgraded and associated with new pvcs are going to be the hardest because we can't know if there's any data in the new pvcs that needs to be preserved, so the only safe option is to merge their contents, which we can't do automatically. But we could at least:

  1. safely rename every old pvc, and
  2. if any pvc was created under the new name would result in a collision, rename the new one identifiably so that they can be manually inspected for merge

or perhaps the simplest is a script that iterates over existing PVCs and servers and pushes the pvc name into Spawner.state (i.e. effectively the same thing as preserving the old template, starting/stopping servers, then updating the template).

@minrk
Copy link
Member

minrk commented Nov 21, 2024

Here is a script that does the last suggestion: https://gist.github.com/minrk/7ad21b18f7a5b3908d74f108dc564cd5

it collects all pvcs and spawners, then associates spawners with their pvcs.

  1. if there's only one match, persist the link in spawner.state (avoids the above bug, as if past versions of kubespawners implemented pvc_name persistence)
  2. if there's more than one, it picks the oldest match (this is what will happen for deployments that have already been upgraded and the template mismatch caused the creation of a new, empty pvc)

it doesn't touch the pvcs themselves, just the association to the servers. Any dealing with removing new pvcs / merging content is something that has to be taken on a case-by-case basis.

@manics
Copy link
Member Author

manics commented Nov 21, 2024

Do you think we should put your script into an optional initContainer in this chart? Perhaps with a flag to control how to resolve multiple matching PVCs (oldest, newest, or return an error to force the admin to sort things out before starting JupyterHub).

@minrk
Copy link
Member

minrk commented Nov 21, 2024

Yes, possibly. I'm also investigating whether this belongs in the legacy pvc check in kubespawner, because I think we can change that to consider labels and annotations as well, so it's not sensitive to template changes across the upgrade.

Either way, if we run this automatically for users, we will need to be careful to consider valid use cases of deliberate sharing of PVCs across servers, e.g.

  1. one pvc per user for all servers: pvc_name_template = '{username}'
  2. shared pvc across users for a given named server: pvc_name_template = '{servername}'
  3. one pvc for everyone (?!): pvc_name_template = 'shared'
  4. manually created PVCs out-of-band, may be missing labels and/or annotations which were not checked previously

whereas it is appropriate for the script above to only handle the default template case, since it is only run by hand, so users have a chance to make a decision whether it is appropriate before it runs.

@aychang95
Copy link

Here is a script that does the last suggestion: https://gist.github.com/minrk/7ad21b18f7a5b3908d74f108dc564cd5

it collects all pvcs and spawners, then associates spawners with their pvcs.

1. if there's only one match, persist the link in spawner.state (avoids the above bug, as if past versions of kubespawners implemented pvc_name persistence)

2. if there's more than one, it picks the oldest match (this is what will happen for deployments that have already been upgraded and the template mismatch caused the creation of a new, empty pvc)

it doesn't touch the pvcs themselves, just the association to the servers. Any dealing with removing new pvcs / merging content is something that has to be taken on a case-by-case basis.

Nice script!

Tried running cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply, and get "Made # changes", but it doesn't seem to actually applied the changes. If i run the command again, it continues to say "Made # changes". The servers seem to continue to attach to the newest PVC and creates a new PVC even if the old one exists.

@minrk
Copy link
Member

minrk commented Nov 24, 2024

Can you share more of the output of your script run (feel free to obfuscate usernames)? Does the output change at all from one run to the next?

@aychang95
Copy link

Output doesn't seem to change at all for either execs.

Running cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 | grep "dry run"

Output:

!!!!! linking server [email protected]/vector-embeddings to pvc claim-andrew-40example-2ecomvector-2dembeddings (dry run) !!!!
!!!!! linking server [email protected]/datasets-exploratory to pvc claim-andrew-40example-2ecomdatasets-2dexploratory (dry run) !!!!
!!!!! linking server [email protected]/modeling to pvc claim-andrew-40example-2ecommodeling (dry run) !!!!
This was a dry run, no changes were made.

Then we run cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply

Output:

...
Made 3 changes

However, seems that nothing actually changes since we get the same outputs if we run the same two commands above again. Same dry runs appear without apply and applying it always outputs the same "Made # changes"

@minrk
Copy link
Member

minrk commented Nov 26, 2024

Sorry it doesn't seem to be working for you. Do you have JupyterHub running with active users, and is it running the z2jh 4 or 3? Specifically, have any of the affected servers started or stopped?

I updated the script to log a bit more about what's going on. It shouldn't do anything different, but it might tell us why.

When I run the script for the first time with:

cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply

I get several 'linking server...' messages, but if I run the same command a second time, I get:

username/ has pvc claim-username
  username/ is linked to only matching pvc claim-username (good)
...
Nothing to do!

@aychang95
Copy link

aychang95 commented Jan 2, 2025

Sorry it doesn't seem to be working for you. Do you have JupyterHub running with active users, and is it running the z2jh 4 or 3? Specifically, have any of the affected servers started or stopped?

I updated the script to log a bit more about what's going on. It shouldn't do anything different, but it might tell us why.

When I run the script for the first time with:

cat collect_pvcs.py | kubectl exec -i $(kubectl get pod -l component=hub -o name) -- python3 - --apply

I get several 'linking server...' messages, but if I run the same command a second time, I get:

username/ has pvc claim-username
  username/ is linked to only matching pvc claim-username (good)
...
Nothing to do!

Happy new year!

Hmm this script seemed to have worked now!
Not sure if there was a specific state change that triggered it since it seems only logs were added, but now legacy-named claims are linked to their respective servers on --apply.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants