Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in docker.start #582

Open
ckanibal opened this issue Jul 21, 2020 · 2 comments
Open

Race condition in docker.start #582

ckanibal opened this issue Jul 21, 2020 · 2 comments

Comments

@ckanibal
Copy link

First of all, Thank You! for this amazing project.

When I try to run a short running operation in a new container with the HostOption AutoRemove: true like this:

docker.run(options.image,
        [command, ...args],
        [stdout, stderr],
        {
            Tty: false,
            HostConfig: {
                AutoRemove: true,
                ...(options.customCwd && [`${options.customCwd}:${options.customCwd}:rw`])
            },
        })
        .then(([res, container]) => {
            logger.debug("Docker run successful", {res, container});
            return {
                code: res.StatusCode,
                okToCache,
                filenameTransform,
                stdout: stdout.toString(),
                stderr: stderr.toString(),
            };
        })
        .catch(error => {
            logger.warn("Docker run failed", {error});
            return error;
        });

it occasionally fails with the error message:

Error: (HTTP code 404) no such container - No such container: 919455c4b42dd72b6767c129772a134616d6d8d8e7b54e9f947a0d92b74f2fa2 
    at /project/node_modules/docker-modem/lib/modem.js:301:17
    at getCause (/project/node_modules/docker-modem/lib/modem.js:331:7)
    at Modem.buildPayload (/project/node_modules/docker-modem/lib/modem.js:300:5)
    at IncomingMessage.<anonymous> (/project/node_modules/docker-modem/lib/modem.js:275:14)
    at IncomingMessage.emit (events.js:333:22)
    at IncomingMessage.EventEmitter.emit (domain.js:485:12)
    at endReadableNT (_stream_readable.js:1220:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21)

I tracked the issue down to those lines of code:

dockerode/lib/docker.js

Lines 1486 to 1494 in ed6ef39

container.start(startOptions, function(err, data) {
if (err) return callback(err, data, container);
hub.emit('start', container);
container.wait(function(err, data) {
hub.emit('data', data);
callback(err, data, container);
});
});

The wait operation can fail, if the container is already finished and automatically removed between the start and wait operation. I have no good idea how to mitigate this issue, yet. Somebody care to take a look?

@apocas
Copy link
Owner

apocas commented Apr 5, 2021

Interesting. But I don't see an easy fix for this one.

Checking the container's status will not help. We can't assume the container finished correctly after a very fast lifecycle, something could went bad.

@dekimsey
Copy link

This doesn't seem to be limited to very fast launches, or to rephrase I'm seeing it on containers that run ~3s. I observe it regularly with podman on OSX on my techdocs container that takes a few seconds to execute.

The issue does seems to be race condition between the start() and wait() calls here. Looking at the podman daemon logs I can see the following events:

  1. Podman receives the /start container call
  2. Runs for ~2.8s
  3. container logs last message ("Documentation built ...")
  4. Podman logs the access event to stream logs (presumably this is it closing the stream, but I'm not sure.)
  5. cgroup scope starts shutting down
  6. Podman finishes destroying the container
  7. Podman logs a .../wait request on an already destroyed container

There doesn't seem to be any way to resolve this either. Is there any guidance one could offer a downstream library on how to avoid this?

log excerpt
Jan 30 19:47:29 localhost.localdomain podman[76394]: @ - - [30/Jan/2025:19:47:29 -0600] "POST /containers/e12c6c7d0d8cb83b43066f9eab1f4a913cc5ce247e2759c02467d57f8cd5d738/start HTTP/1.1" 204 0 "" ""
....
Jan 30 19:47:32 localhost.localdomain competent_knuth[76528]: INFO    -  Documentation built in 2.23 seconds
Jan 30 19:47:32 localhost.localdomain podman[76394]: @ - - [30/Jan/2025:19:47:29 -0600] "POST /containers/e12c6c7d0d8cb83b43066f9eab1f4a913cc5ce247e2759c02467d57f8cd5d738/attach?stream=true&stdout=true&stderr=true HTTP/1.1" 200 0 "" ""
Jan 30 19:47:32 localhost.localdomain systemd[1139]: libpod-e12c6c7d0d8cb83b43066f9eab1f4a913cc5ce247e2759c02467d57f8cd5d738.scope: Consumed 2.873s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit UNIT completed and consumed the indicated resources.
Jan 30 19:47:32 localhost.localdomain podman[76394]: 2025-01-30 19:47:32.262329637 -0600 CST m=+4.817239494 container died e12c6c7d0d8cb83b43066f9eab1f4a913cc5ce247e2759c02467d57f8cd5d738 (image=localhost/mkdocs-local:latest, name=competent_knuth, io.buildah.version=1.37.2)
Jan 30 19:47:32 localhost.localdomain kernel: podman0: port 1(veth0) entered disabled state
Jan 30 19:47:32 localhost.localdomain kernel: veth0 (unregistering): left allmulticast mode
Jan 30 19:47:32 localhost.localdomain kernel: veth0 (unregistering): left promiscuous mode
Jan 30 19:47:32 localhost.localdomain kernel: podman0: port 1(veth0) entered disabled state
Jan 30 19:47:32 localhost.localdomain podman[76662]: 2025-01-30 19:47:32.472918367 -0600 CST m=+0.207885268 container remove e12c6c7d0d8cb83b43066f9eab1f4a913cc5ce247e2759c02467d57f8cd5d738 (image=localhost/mkdocs-local:latest, name=competent_knuth, io.buildah.version=1.37.2)
Jan 30 19:47:32 localhost.localdomain podman[76394]: time="2025-01-30T19:47:32-06:00" level=error msg="While waiting on condition: \"no container with ID e12c6c7d0d8cb83b43066f9eab1f4a913cc5ce247e2759c02467d57f8cd5d738 found in database: no such container\""
Jan 30 19:47:32 localhost.localdomain podman[76394]: @ - - [30/Jan/2025:19:47:29 -0600] "POST /containers/e12c6c7d0d8cb83b43066f9eab1f4a913cc5ce247e2759c02467d57f8cd5d738/wait HTTP/1.1" 200 162 "" ""
Jan 30 19:47:37 localhost.localdomain podman[76394]: time="2025-01-30T19:47:37-06:00" level=info msg="Received shutdown.Stop(), terminating!" PID=76394

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants