Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to bring cluster deployment back after a restart #23

Open
adrian-a-graham opened this issue Feb 6, 2025 · 0 comments
Open

Unable to bring cluster deployment back after a restart #23

adrian-a-graham opened this issue Feb 6, 2025 · 0 comments

Comments

@adrian-a-graham
Copy link

adrian-a-graham commented Feb 6, 2025

Deploying this on GCP, I would like to be able to stop my resources when not in use, and restart again when necessary. Unfortunately, after a reboot, the Kubernetes cluster on the app-master VM is unable to fully recover.

A few hours after restarting the VM, many pods are still in a failed state:

ubuntu@tokkiodemo-app-master:~$ kubectl get pods -n app
NAME                                                        READY   STATUS             RESTARTS         AGE
a2f-a2f-deployment-79fc848877-lf7v6                         0/1     Unknown            0                23h
ace-agent-chat-controller-deployment-0                      1/1     Running            1 (161m ago)     23h
ace-agent-chat-engine-deployment-f45497ff9-6vsr4            1/1     Running            1 (161m ago)     23h
ace-agent-plugin-server-deployment-6d9c679489-g5pjq         1/1     Running            1 (161m ago)     23h
anim-graph-sdr-envoy-sdr-deployment-55c9cd8944-s64zx        3/3     Running            3 (161m ago)     23h
chat-controller-sdr-envoy-sdr-deployment-57bfdf8888-8crms   3/3     Running            3 (161m ago)     23h
ds-sdr-envoy-sdr-deployment-f6d58c956-2v8fk                 3/3     Running            3 (161m ago)     23h
ds-visionai-ds-visionai-deployment-0                        0/1     CrashLoopBackOff   35 (80s ago)     23h
ia-animation-graph-microservice-deployment-0                0/1     CrashLoopBackOff   35 (86s ago)     23h
ia-omniverse-renderer-microservice-deployment-0             0/1     CrashLoopBackOff   36 (2m2s ago)    23h
ia-omniverse-renderer-microservice-deployment-1             0/1     CrashLoopBackOff   36 (102s ago)    23h
ia-omniverse-renderer-microservice-deployment-2             0/1     CrashLoopBackOff   36 (88s ago)     23h
ia-omniverse-renderer-microservice-deployment-3             0/1     CrashLoopBackOff   36 (68s ago)     23h
ia-omniverse-renderer-microservice-deployment-4             0/1     CrashLoopBackOff   36 (71s ago)     23h
ia-omniverse-renderer-microservice-deployment-5             0/1     CrashLoopBackOff   35 (2m36s ago)   23h
mongodb-mongodb-64d69c8469-px7s2                            1/1     Running            1 (161m ago)     23h
occupancy-alerts-api-app-559d6df449-6c9rw                   1/1     Running            4 (159m ago)     23h
occupancy-alerts-app-65c99b5f9d-qhnmz                       1/1     Running            1 (161m ago)     23h
redis-redis-5c446c5565-jghtq                                1/1     Running            1 (161m ago)     23h
redis-timeseries-redis-timeseries-5f57d89965-qlntz          1/1     Running            1 (161m ago)     23h
renderer-sdr-envoy-sdr-deployment-6fb868584c-krh4t          3/3     Running            3 (161m ago)     23h
riva-speech-547fb9b8c5-rrkhq                                0/1     Unknown            0                23h
tokkio-ingress-mgr-deployment-7d4f5858c4-5rs9s              3/3     Running            3 (161m ago)     23h
tokkio-ui-server-deployment-d88868b96-cvgld                 1/1     Running            1 (161m ago)     23h
tokkio-umim-action-server-deployment-db9fc78-khhvm          1/1     Running            1 (161m ago)     23h
triton0-7ccdd556bc-v9kwf                                    0/1     Unknown            0                23h
vms-vms-768b6ff69-qmr64                                     1/1     Running            1 (161m ago)     23h

The logs for the failed pods have differing reasons for their failure.

Also, I am unable to SSH into the app-master VM using the generated command following deployment. I am, however, able to SSH from my local workstation using gcloud compute ssh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant