-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ensure SLO for server availability #713
Comments
I believe you filed this before many server updates were peformed. Since then, service availability and responsiveness have greatly improved, but there is still a lot of work to be done, as open-vsx consumes an inordinate amount of bandwidth. I will file an issue for that. |
The performance improvements are great. But the issue is about Eclipse team being able to recognise that incident is happening. In the past it was never the case. It is even alright for us that it takes a day or two to resolve the incident, but it should be noticed before users do it. |
Related PR: eclipse/openvsx#667 |
Please consider implement request duration and failure rate metrics for OpenVSX server to ensure availability.
In our experience RED metrics are good fit for this. @amvanbaren suggested to use spring-metrics to collect data for prometheus.
At Gitpod we rely on OpenVSX server responsiveness while users starting workspaces. If a request to OpenVSX fails then workspace is mostly unusable since VS Code frontend times out in 1 min. We have been working on SLO of 99% of extensions availability and built a caching proxy which allows us to serve 70%-90% of requests for 3 days while OpenVSX is down.
But it is not enough to achieve the goal though. We need to ensure that the issue gets recognised and addressed in OpenVSX itself before users notice it. In the past it was not a case, i.e. https://www.eclipsestatus.io/ usually did not get updated before some Gitpod user ping us and then we reach out to @eclipsewebmaster. Usually we already have a full blown incident by this moment. Unfortunately it is tricky for us to figure out whether there is a real issue with upstream from the proxy, since we are not only client and a request failure can be caused by the proxy itself. The OpenVSX server looks to be a proper place to address the issue.
The text was updated successfully, but these errors were encountered: