-
-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: don't panic when logs waits for more than 5 seconds (v2) #1278
Conversation
The consumer's Accept method is running in a different gorutine than the main test. We need to synchronize access to the messages slice. Some tests waited for Ack before accessing Msgs and were properly synchronized. TestContainerLogWithErrClosed accessed Msgs while the consumer was running and the memory access was not synchronized. Test_StartStop did not wait for the expected lines to appear before stopping the log producer, there was nothing to guarantee that at least one line would be read. The test passed because currently the producer does not interrupt outstanding HTTP requests to fetch logs, so it implicitly waited for some logs to be received. We now wait for the mlem line to appear before stopping the log producer.
This removes panic when logs endpoint takes more than 5 seconds to respond. The panic happened at least with podman when no new logs appear when using follow and since parameters. We keep retrying until the context is canceled (the retry request would fail anyway with canceled context) or the producer is stopped, whichever comes first. This makes the retry behavior consistent with closed connections handling. Outstanding HTTP calls for fetching logs are now interrupted when a producer is stopped. Previously the consumer and StopProducer() waited for the HTTP call to complete. This should fix testcontainers#946
✅ Deploy Preview for testcontainers-go ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm testing this PR with the same script as in #947:
for x in {1..20}; do go test -timeout 600s -run ^Test_StartStop$ github.com/testcontainers/testcontainers-go -count=1; done
And in general I still see certain flakiness, although different to the one described in that ticket: instead of receiving inconsistent messages, I get a context deadline exceeded
error.
Could you take a look at it?
Apart from that, the code at this moment LGTM
@@ -31,7 +42,41 @@ func (g *TestLogConsumer) Accept(l Log) { | |||
return | |||
} | |||
|
|||
g.Msgs = append(g.Msgs, s) | |||
g.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment here explaining why the synchronisation must be done at the test level? You already added it in #947 :)
Hey @martin-sucha I came back from parental leave and I'm revisiting this PR. Did you see my previous comments? |
Hey @mdelapenya yes, thank you for the comments! I've been quite busy recently, this keeps falling through the cracks. I hope I can revisit this soon. |
If the context is done, we close the log producer. That is not an error, the context cancellation signals that the consumer should stop. If there is a non-context error during the HTTP call or while reading the response, retry the HTTP request in 1 second again. Previously, the error handling was inconsistent: - an error while reading HTTP response headers would retry the HTTP request - but an error while reading the body would just end the log producer With this commit, the error handling should be more consistent.
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
I fixed the error handling so that context canceled is not treated as an error. It is merely a signal that the log producer should stop. Also updated the error handling for other errors to always retry until the log producer is stopped. |
wdyt about a different approach:
In this way caller would have bigger control over what it wants to do:
Why?:
|
@Tofel I like this. Would you be able to work on it? We are thinking about an eventual v1 version, so adding a breaking change in the signature/behaviour, if done properly, wouldn't be that bad. |
@mdelapenya I will be able to work on that, yes, but I need to confirm with me team when. Will circle back once I know it. |
@mdelapenya it's me again... so my team is okay with me taking the time to propose an alternative solution (although I guess it will have to be a bit different, since we cannot return the error from that goroutine, but why not have an error channel on which the caller could listen in a dedicated goroutine?). Before I do start with it, could you tell me when you're thinking that v1 might see the light of the day? |
Indeed, I admit that we should totally rewrite the log consumer implementation for the v1, as I think it has grown in a non-organic manner.
Not sure about it. I already have a draft doc with the things that I'd like to do/fix, which I'd make it public right after completed, in the form of a GH discussion. From there, I'll start creating issues in case the community wants to participate. |
This PR has been superseded by #1971, please take a look at #1971 (comment) I'd like to thank you for your time elaborating this PR. It has not been merged this time, but came with a lot of light for the final implementation. Thanks! |
This should fix #946
What does this PR do?
There are two commits.
The first commit fixes synchronization issues in producer tests.
The second commit removes panic when logs endpoint takes more than 5 seconds to respond. The panic happened at least with podman when no new logs appear when using follow and since parameters.
We keep retrying until the context is canceled (the retry request would fail anyway) or the producer is stopped, whichever comes first. This makes the retry behavior consistent. The HTTP calls to fetch logs are interrupted immediately (previously the producer waited for in-progress calls to complete).
Why is it important?
Panicking the whole test suite breaks all my tests if the container ends early or there is some temporary error fetching the logs.
Related issues