-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent pod restarts for verdaccio #90
Comments
I should also say that we switched the service type to |
So I have a deployment that is suffering from pod restarts as well. The issue seems to be caused by sporadic high CPU spikes which cause the health checks to fail and which in turn cause the kubelet to restart the container. We've narrowed down the cause of the CPU spikes to certain requests coming from a specific type of client (JFrog Artifactory). However, we have not been able to figure out what is unique about these requests that causes these CPU spikes. The # of incoming requests is low so it is not an issue related to volume of requests. Can I ask if you have metrics around CPU and if you are seeing spikes in CPU around the same time the health checks fail and the pod restarts occur? Can you share what client(s) are making requests to your Verdaccio instance and if you see a correlation between the client type (user-agent) and these pod restarts? |
@edclement we should probably follow up here and mention that we've solved this and tracked it down to an issue with bcrypt. We plan to open an issue against Verdaccio core and work on a fix for it as well. |
As @Splaktar mentioned, we managed to figure out the root cause of this issue for our cluster. The majority of our users authenticate using a JWT token. However, users that use JFrog Artifactory to mirror the packages from our registry must use a username/password to authenticate since Artifactory does support the use of a JWT token. Looking at metrics for our cluster, I noticed that the requests coming from most (but not all) or our Artifactory users had high latency (a high response time to requests). The response latency could reach as high as the maximum configured 60 seconds on all incoming requests when one or more Artifactory instance was sending our cluster a higher then normal volume of requests. CPU would spike to 100% at the same time. Once the 60 second mark was hit and Verdaccio had not yet responded to the request, the load balancer would automatically close the client connection to Verdaccio and respond 502. It's also worth mentioning that Artifactory has the habit of sending batches of requests concurrently, sometimes milliseconds apart. It turns out that the root of this issue was the number of salt rounds that was used to hash the password. We had initially been using the There is more to the story here though. It became apparent during the testing process when we discovered this issue, that when CPU for a Verdaccio pod would spike:
I did a little digging in the Verdaccio
We plan on opening an issue to switch the @juanpicado - you will likely be interested in these findings. |
@edclement thanks for the detailed description, Iwas not aware to be honest there was a sync function, I´ll enable the sync warnings from Node.js from now on to detect this kind of cases, I´ll check it out :) |
I meant this https://nodejs.org/api/cli.html#--trace-sync-io just in case you can also use it to find more sync places around verdaccio, but is very noisy. |
running via helm with image 5.8.0 and getting same issue w/ 3 replicas. Because of failed liveliness probes which will be configurable soon. I suspect the npm runtime on the docker image itself may need tweaking or there is more synchronous code. |
We are seeing frequent pod restarts when building our software project.
There aren't any logs associated with the pod restarts and in the pod events we see that the pods are restarted due to liveness probe failures:
The only significant difference from our values file is how we handled the packages section in the config:
We also tried vertically and horizontally scaling up the pods to no avail:
These restarts wouldn't be a hassle if they didn't interrupt the build process:
The text was updated successfully, but these errors were encountered: