-
-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement]: controllable retries for pulling images #2898
Comments
Thanks for raising this issue, we have experimented it a bit in the past, in our own CI. @stevenh could this be related to your observations on pull errors? |
Interesting, is that always the error when you get a failure @srenatus ? |
I have no data at hand right now, but one observation I've made frequently is that it feels like if it fails, it fails badly -- i.e. one test has a docker pull problem, the other one can't talk to the service in time, another one fails for some other reason. So, it feels like there were bad apples among the github action runners, or dockerhub mini-outages, or something like that. But these would be unrelated to TC, I think? 🤔 So if your question was "Is it always this specific image?" -- No, I don't think so. It's always some transient issue cluster 😬 |
I've seen similar, logically you would expect one test to fail, but others run later to succeed but as you say if one fails others run after also seem to fail. I have been suspecting login issues which in turn result in pull failures due to rate limiting, but its proved hard to confirm. We have some debug logging in TC for the last few releases which we were hoping would shed some light. Any logs you have which show issues would be appreciated. |
OK I'm not sure if this is helping, but here's a log of one of these "when it rains, it pours" failure modes
We see three different tests failing here, one using neo4j, one using mongodb and one using mssql. 💥 💥 💥 😅 |
Proposal
Heya!
We're using TC 0.34.0, and you know we're huge fans. 👏
Regardless, our CI runs are too flaky lately -- it takes a few retries (manually, via the Github UI) to get them in to a green state. This is not a TC problem. However, looking closely at what goes wrong, I've found a thing that seems avoidable:
That image most certainly exists. So guessing at what happened, Dockerhub might have had some intermittent issues -- but it would be great if we could have our CI run not fail because of that.
I've found #2502, which declares "not found" as non-retriable, but I'm wondering if this could be something that's controllable...?
Or maybe some parameter that lets me give the setup more leeway when run on Github actions? 🤔 Maybe something like this already exists -- so I figured I might just raise the question with you. Thanks in advance!
The text was updated successfully, but these errors were encountered: