[Enhancement]: controllable retries for pulling images #2898

srenatus · 2024-11-28T09:39:32Z

Proposal

Heya!

We're using TC 0.34.0, and you know we're huge fans. 👏

Regardless, our CI runs are too flaky lately -- it takes a few retries (manually, via the Github UI) to get them in to a green state. This is not a TC problem. However, looking closely at what goes wrong, I've found a thing that seems avoidable:

lifecycle.go:62: 🐳 Creating container for image gvenzl/oracle-free:slim-faststart
our_test.go:58: create container: container create: Error response from daemon: No such image: gvenzl/oracle-free:slim-faststart

That image most certainly exists. So guessing at what happened, Dockerhub might have had some intermittent issues -- but it would be great if we could have our CI run not fail because of that.

I've found #2502, which declares "not found" as non-retriable, but I'm wondering if this could be something that's controllable...?

Or maybe some parameter that lets me give the setup more leeway when run on Github actions? 🤔 Maybe something like this already exists -- so I figured I might just raise the question with you. Thanks in advance!

The text was updated successfully, but these errors were encountered:

mdelapenya · 2024-12-04T12:41:42Z

Thanks for raising this issue, we have experimented it a bit in the past, in our own CI. @stevenh could this be related to your observations on pull errors?

stevenh · 2024-12-12T17:23:15Z

Interesting, is that always the error when you get a failure @srenatus ?

srenatus · 2024-12-12T18:57:42Z

I have no data at hand right now, but one observation I've made frequently is that it feels like if it fails, it fails badly -- i.e. one test has a docker pull problem, the other one can't talk to the service in time, another one fails for some other reason. So, it feels like there were bad apples among the github action runners, or dockerhub mini-outages, or something like that. But these would be unrelated to TC, I think? 🤔

So if your question was "Is it always this specific image?" -- No, I don't think so. It's always some transient issue cluster 😬

stevenh · 2024-12-12T21:35:27Z

I've seen similar, logically you would expect one test to fail, but others run later to succeed but as you say if one fails others run after also seem to fail.

I have been suspecting login issues which in turn result in pull failures due to rate limiting, but its proved hard to confirm. We have some debug logging in TC for the last few releases which we were hoping would shed some light.

Any logs you have which show issues would be appreciated.

srenatus · 2024-12-16T18:58:50Z

OK I'm not sure if this is helping, but here's a log of one of these "when it rains, it pours" failure modes

--- FAIL: TestNeo4jQuery (109.80s)
    lifecycle.go:62: 🐳 Creating container for image neo4j:5.13.0-bullseye
    lifecycle.go:68: ✅ Container created: 05bbad81dc65
    lifecycle.go:74: 🐳 Starting container: 05bbad81dc65
    lifecycle.go:80: ✅ Container started: 05bbad81dc65
    lifecycle.go:270: ⏳ Waiting for container id 05bbad81dc65 image: neo4j:5.13.0-bullseye. Waiting for: &{timeout:<nil> deadline:<nil> Strategies:[0xc0010b7410 0xc0010b7440 0xc0010b7470]}
    lifecycle.go:86: 🔔 Container is ready: 05bbad81dc65
    neo4j_test.go:31: TransactionExecutionLimit: timeout (exceeded max retry time: 30s) after 6 attempts, last error: ConnectivityError: Unable to retrieve routing table from localhost:32796: server responded HTTP. Make sure you are not trying to connect to the http endpoint (HTTP defaults to port 7474 whereas BOLT defaults to port 7687)
--- FAIL: TestMongoDBFindOne (33.03s)
    lifecycle.go:62: 🐳 Creating container for image mongo:8
    lifecycle.go:68: ✅ Container created: 730531476bbc
    lifecycle.go:74: 🐳 Starting container: 730531476bbc
    lifecycle.go:80: ✅ Container started: 730531476bbc
    lifecycle.go:270: ⏳ Waiting for container id 730531476bbc image: mongo:8. Waiting for: &{timeout:<nil> deadline:<nil> Strategies:[0xc001a15b30 0xc001a15b60]}
    lifecycle.go:86: 🔔 Container is ready: 730531476bbc
    mongodb_test.go:186: server selection error: server selection timeout, current topology: { Type: Unknown, Servers: [{ Addr: localhost:32806, Type: Unknown, Last error:  connection(localhost:32806[-83]) socket was unexpectedly closed: EOF: connection(localhost:32806[-83]) socket was unexpectedly closed: EOF }, ] }
--- FAIL: TestSQLSend (182.16s)
    lifecycle.go:62: 🐳 Creating container for image mysql:9.0.0
    lifecycle.go:68: ✅ Container created: 5c44f865f05f
    lifecycle.go:74: 🐳 Starting container: 5c44f865f05f
    lifecycle.go:80: ✅ Container started: 5c44f865f05f
    lifecycle.go:270: ⏳ Waiting for container id 5c44f865f05f image: mysql:9.0.0. Waiting for: &{timeout:<nil> Log:port: 3306  MySQL Community Server IsRegexp:false Occurrence:1 PollInterval:100ms}
    lifecycle.go:86: 🔔 Container is ready: 5c44f865f05f
    lifecycle.go:62: 🐳 Creating container for image postgres:16.3
    lifecycle.go:68: ✅ Container created: ce06e299ee07
    lifecycle.go:74: 🐳 Starting container: ce06e299ee07
    lifecycle.go:80: ✅ Container started: ce06e299ee07
    lifecycle.go:270: ⏳ Waiting for container id ce06e299ee07 image: postgres:16.3. Waiting for: &{timeout:<nil> deadline:0xc001073208 Strategies:[0xc0015d8450]}
    lifecycle.go:86: 🔔 Container is ready: ce06e299ee07
    docker.go:1334: Failed to get image auth for mcr.microsoft.com. Setting empty credentials for the image: mcr.microsoft.com/mssql/server:2022-latest. Error is: credentials not found in native keychain
    lifecycle.go:62: 🐳 Creating container for image mcr.microsoft.com/mssql/server:2022-latest
    lifecycle.go:68: ✅ Container created: c15ea03ad42c
    lifecycle.go:74: 🐳 Starting container: c15ea03ad42c
    lifecycle.go:80: ✅ Container started: c15ea03ad42c
    lifecycle.go:270: ⏳ Waiting for container id c15ea03ad42c image: mcr.microsoft.com/mssql/server:2022-latest. Waiting for: &{timeout:<nil> Log:Recovery is complete. IsRegexp:false Occurrence:1 PollInterval:100ms}
    lifecycle.go:86: 🔔 Container is ready: c15ea03ad42c
    lifecycle.go:62: 🐳 Creating container for image gvenzl/oracle-free:slim-faststart
    lifecycle.go:68: ✅ Container created: 95b877adde79
    lifecycle.go:74: 🐳 Starting container: 95b877adde79
    lifecycle.go:80: ✅ Container started: 95b877adde79
    lifecycle.go:270: ⏳ Waiting for container id 95b877adde79 image: gvenzl/oracle-free:slim-faststart. Waiting for: &{timeout:<nil> deadline:<nil> Strategies:[0xc000e1ca50 0xc000e1ca80]}
    lifecycle.go:86: 🔔 Container is ready: 95b877adde79
    --- FAIL: TestSQLSend/sqlserver:_a_single_row_query (0.01s)
        sqlsend_test.go:530: eval_builtin_error: sql.send: invalid packet size, it is longer than buffer size
FAIL
FAIL	github.com/styrainc/enterprise-opa-private/pkg/builtins	335.200s

We see three different tests failing here, one using neo4j, one using mongodb and one using mssql. 💥 💥 💥 😅

srenatus added the enhancement New feature or request label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement]: controllable retries for pulling images #2898

[Enhancement]: controllable retries for pulling images #2898

srenatus commented Nov 28, 2024

mdelapenya commented Dec 4, 2024

stevenh commented Dec 12, 2024

srenatus commented Dec 12, 2024 •

edited

Loading

stevenh commented Dec 12, 2024

srenatus commented Dec 16, 2024

[Enhancement]: controllable retries for pulling images #2898

[Enhancement]: controllable retries for pulling images #2898

Comments

srenatus commented Nov 28, 2024

Proposal

mdelapenya commented Dec 4, 2024

stevenh commented Dec 12, 2024

srenatus commented Dec 12, 2024 • edited Loading

stevenh commented Dec 12, 2024

srenatus commented Dec 16, 2024

srenatus commented Dec 12, 2024 •

edited

Loading