-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log storage feature needs to be rewritten #763
Comments
| where we are close to high level agreement on doing log storage in the watcher And on a different note, since you've already spent a lot of time testing, could you please share what's the easiest way to reproduce the issue, e.g local Kind environment, how and what pipelineruns are you running, etc? |
per recent exchange with @khrm @sayan-biswas I mean the watcher @enarha .... we need to avoid GRPC calls where we obtain / wait on status like we have to do with UpdateLog, and thus need to interact with s3, etc. from our current watcher process. We may have to do a vanilla GRPC get or two for metadata, but and the idea of even bypassing that to scaffold the metadata needed from what is already available in the watcher to engage with the external storage seemed feasible upon the last code review. If not, those vanilla GRPC gets will scale better than having to confirm the S3 storage succeeded on the API server side.
I think deployment of tekton results is up to the implementor. Turns out deploying on openshift with our bootstrapping of minio and postgresql was easiest for me, so that is what I did. But I'm sure others might find using kind etc. easier. Between @pmacik and I we had two layers of testing. Mine was a simple pipelinerun and bash script which I'll attach here. That definitely exhibited the http2 body ready bottleneck, which I think would be a sufficient unit test for this. @pmacik 's test employed a much more sophisticated pipeline used at Red Hat's internal deployment of Results ("Konflux"). In addition to the http2 body bottleneck, he could also exhibit the massive scaling issues with the internal GRPC WriteStatus function which we uncovered after adding debug into the GRPC code in this repo's vendor tree. I'll post a duplication of my internal report in this issue in a separate comment. Ultimately, once a developer has performance the unit test level validation with my test case, I would suggest the Results WG reach out to @pmacik to see about running his test with a version of these changes. If/when that point is reached, I can point you to the branches our our internal deployment we used for @pmacik's testing. |
A copy/paste of analysis I shared internally at RH (with some removal of internal links): OK verion 3 of the mem leak fix, which still makes GRPC calls to send logs, but does it all thread, with high timeouts, has been running in stage for 10 days. While it is tolerable performance wise vs. the on thread version, but is still 3X slower, and the success rate is only on par at best (i.e. cancelled context aborts, etc) then the current version on konflux with the mem leak which does not properly handle context. The upstream WG, including Khurram Baig from openshift pipelines, agrees the function needs to be rewritten. I was able to do a prototype where GRPC was only used as as signal for the API server to read the logs, but while that was an improvement, still hlt GRPC / HTTP2 throttling / bottlenecks in my testing. We all agree now that GRPC needs to be bypassed, and have the watcher access S3 directly. I have opened #763 upstream to track. On the performance problems, first, wrt the http2 body read bottleneck, the performance team lined me up with a golang performance engineer who I have been working with in https://redhat-internal.slack.com/archives/C060Z6FG7UK/p1718108397779749 . We have confirmed the http2 body performance is well documented on the web, across other languages, but agreed it investigation at the golang level was warranted. After about a week of testing and customized golang code, we still don't have any data at the golang level. Bottom line, though I think at best we would just confirm what we read on the web across various languanges wrt http2 body access, flow control, etc. Next, with the continued efforts between myself and Pavel Macik I believe we have made some headway on the 9 plus minute delays he saw with the "GGMGGM16" logs out of https://github.com/gabemontero/results/tree/downstream-http-vs-grpc with the lates round of debug. Contact Pavel (@pmacik) if you need access to his spreadsheet. The GGMGGM16, GGMGGM20, GGMGGM31, and GGMGGM33 logs all hit the 10 minute mark (600 seconds) in duration when the pipelineruns all complete concurently and write their logs. The basic call stack is GGMGGM16, GGMGGM20, then GGMGGM33, then GGMGGM31.
https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/server.go#L1735 is GGMGGM33 Now, there were no GGMGGM34 logs https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L244 in Pavel's test run. And that runStream is over at https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L578-L586 where the There are GGMGGM 24 logs from https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L382 .... we are writing data there .... I see as long as 2 minutes there iteration-0002-0075.tekton-results-api.logs:[pod/tekton-results-api-5b668c456c-s487j/api] GGMGGM24 serverHandlerTransport Write do Write data 2m0.299206724s ts 1718897216.849309923 There are GGMGGM25 logs from https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L389 but a lot of those are in nanoseconds, like iteration-0002-0075.tekton-results-api.logs:[pod/tekton-results-api-5b668c456c-s487j/api] GGMGGM25 serverHandlerTransport Write do Write data 100ns ts 1718896109.647620007 But I now see a bug where I am printing the Write duration (duration4) vs. the Flush duration (durationa5) ... bottom line, the Flush can take time. There are no GGMGGM26 through GGMGGM30 logs for processing of the headers. So it is just the body of the response, which gets back to our http2 body access (in this case writes vs. reads). Also, we don't hit the transport is closed do metrics bit at https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L291-L299 because the server stays up Lastly, back to the It is single threaded. That can only mean we are blocked on the mutex at https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/internal/transport/handler_server.go#L291-L299 i.e. this server when writing status appears SINGLE THREADED !!! Further proof:
|
Also for reference, the performance issues with GRPC's ServerHTTP appear to be old news. For example, see grpc/grpc-go#586 Results has to use ServerHTTP to provide vanilla HTTP(2) Rest access Also, see the query https://github.com/grpc/grpc-go/pulls?q=is%3Apr+is%3Aclosed+performance+in%3Atitle for other forays into this topic over the years. As of this comment, there a no more open issues related to performance. Lastly, none of the tuning options I found like https://github.com/gabemontero/results/blob/downstream-http-vs-grpc/vendor/google.golang.org/grpc/server.go#L163-L178 (some of which are mentioned in those closed performance issues) provided any benefit our testing. |
Expected Behavior
Storage of logs succeeds at a high success rate with reasonable performance
Actual Behavior
In a sufficiently sized production system, addressing
is untenable with the current design.
2 months of performance analysis of GRPC and golang's HTTP2 have uncovered repeatedly report issues on the internet about HTTP2 Request Body access not scaling / becoming increadibly slow, as well GRPC level throttling which the current tuning options in the golang GRPC code have yet to resolve.
We have a prototype already that showed improvement wrt the HTTP2 body scaling issue, where we did pod log retrieval in the API server and then stored to S3 there, but it was still slowed by GRPC throttling on the background threads, leading to long delays before logs were actually stored.
The WG has discussed on calls and in slack, where we are close to high level agreement on doing log storage in the watcher, and only update the Log Records on success / failure.
@khrm @enarha @sayan-biswas @avinal FYI
The text was updated successfully, but these errors were encountered: