Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman 5.2.0 does not close connections properly on MacOS #23616

Closed
maciej-szlosarczyk opened this issue Aug 14, 2024 · 12 comments
Closed

podman 5.2.0 does not close connections properly on MacOS #23616

maciej-szlosarczyk opened this issue Aug 14, 2024 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. machine macos MacOS (OSX) related regression remote Problem is in podman-remote

Comments

@maciej-szlosarczyk
Copy link

Issue Description

After upgrade to 5.2.0 gvproxy keeps connections open for a long time I noticed it while running tests against postgresql 11 running in a container. After a few runs postgres would accumulate enough connections and memory usage that it would either get killed due to memory limits or would report that too many connections are open.

This is the output of lsof from both the host and inside the podman machine ten minutes after I ran the test:

podman-machine $ lsof -nP -iTCP

rootlessp 2572 core    4u  IPv6  12796      0t0  TCP 192.168.127.2:5432->192.168.127.1:26886 (ESTABLISHED)
rootlessp 2572 core   11u  IPv6  12698      0t0  TCP *:5432 (LISTEN)
rootlessp 2572 core   12u  IPv6  12797      0t0  TCP 192.168.127.2:5432->192.168.127.1:26884 (ESTABLISHED)
rootlessp 2572 core   20u  IPv6  12042      0t0  TCP 192.168.127.2:5432->192.168.127.1:26887 (ESTABLISHED)
rootlessp 2572 core   22u  IPv6  12043      0t0  TCP 192.168.127.2:5432->192.168.127.1:26885 (ESTABLISHED)
rootlessp 2572 core   30u  IPv6  12799      0t0  TCP 192.168.127.2:5432->192.168.127.1:26888 (ESTABLISHED)
rootlessp 2572 core   37u  IPv6  12801      0t0  TCP 192.168.127.2:5432->192.168.127.1:26890 (ESTABLISHED)
rootlessp 2572 core   38u  IPv6  12802      0t0  TCP 192.168.127.2:5432->192.168.127.1:26893 (ESTABLISHED)
rootlessp 2572 core   41u  IPv6  12805      0t0  TCP 192.168.127.2:5432->192.168.127.1:26889 (ESTABLISHED)
rootlessp 2572 core   42u  IPv6  12806      0t0  TCP 192.168.127.2:5432->192.168.127.1:26892 (ESTABLISHED)
rootlessp 2572 core   45u  IPv6  12809      0t0  TCP 192.168.127.2:5432->192.168.127.1:26891 (ESTABLISHED)
rootlessp 2572 core   83u  IPv6  12063      0t0  TCP 192.168.127.2:5432->192.168.127.1:26894 (ESTABLISHED)
rootlessp 2572 core   84u  IPv6  12064      0t0  TCP 192.168.127.2:5432->192.168.127.1:26895 (ESTABLISHED)
rootlessp 2572 core   87u  IPv6  14581      0t0  TCP 192.168.127.2:5432->192.168.127.1:26896 (ESTABLISHED)
rootlessp 2572 core  104u  IPv6  12069      0t0  TCP 192.168.127.2:5432->192.168.127.1:26899 (ESTABLISHED)
rootlessp 2572 core  105u  IPv6  12070      0t0  TCP 192.168.127.2:5432->192.168.127.1:26897 (ESTABLISHED)
rootlessp 2572 core  107u  IPv6  14587      0t0  TCP 192.168.127.2:5432->192.168.127.1:26898 (ESTABLISHED)
rootlessp 2572 core  109u  IPv6  14588      0t0  TCP 192.168.127.2:5432->192.168.127.1:26900 (ESTABLISHED)
rootlessp 2572 core  128u  IPv6  13660      0t0  TCP 192.168.127.2:5432->192.168.127.1:26901 (ESTABLISHED)
rootlessp 2572 core  134u  IPv6  13661      0t0  TCP 192.168.127.2:5432->192.168.127.1:26903 (ESTABLISHED)
rootlessp 2572 core  136u  IPv6  14598      0t0  TCP 192.168.127.2:5432->192.168.127.1:26902 (ESTABLISHED)
rootlessp 2572 core  137u  IPv6  14599      0t0  TCP 192.168.127.2:5432->192.168.127.1:26904 (ESTABLISHED)
rootlessp 2572 core  140u  IPv6  12838      0t0  TCP 192.168.127.2:5432->192.168.127.1:26906 (ESTABLISHED)
rootlessp 2572 core  141u  IPv6  12839      0t0  TCP 192.168.127.2:5432->192.168.127.1:26905 (ESTABLISHED)
rootlessp 2572 core  142u  IPv6  12840      0t0  TCP 192.168.127.2:5432->192.168.127.1:26913 (ESTABLISHED)
rootlessp 2572 core  146u  IPv6  12843      0t0  TCP 192.168.127.2:5432->192.168.127.1:26916 (ESTABLISHED)
rootlessp 2572 core  147u  IPv6  12844      0t0  TCP 192.168.127.2:5432->192.168.127.1:26911 (ESTABLISHED)
rootlessp 2572 core  150u  IPv6  14602      0t0  TCP 192.168.127.2:5432->192.168.127.1:26914 (ESTABLISHED)
rootlessp 2572 core  151u  IPv6  14603      0t0  TCP 192.168.127.2:5432->192.168.127.1:26908 (ESTABLISHED)
rootlessp 2572 core  154u  IPv6  12848      0t0  TCP 192.168.127.2:5432->192.168.127.1:26915 (ESTABLISHED)
rootlessp 2572 core  155u  IPv6  12849      0t0  TCP 192.168.127.2:5432->192.168.127.1:26907 (ESTABLISHED)
rootlessp 2572 core  158u  IPv6  12852      0t0  TCP 192.168.127.2:5432->192.168.127.1:26912 (ESTABLISHED)
rootlessp 2572 core  159u  IPv6  12853      0t0  TCP 192.168.127.2:5432->192.168.127.1:26909 (ESTABLISHED)
rootlessp 2572 core  162u  IPv6  14606      0t0  TCP 192.168.127.2:5432->192.168.127.1:26910 (ESTABLISHED)
rootlessp 2572 core  244u  IPv6  12100      0t0  TCP 192.168.127.2:5432->192.168.127.1:26917 (ESTABLISHED)
macos $ lsof -nP -iTCP

gvproxy   60061 maciej   11u  IPv4 0x5e926f38c5bfb168      0t0  TCP 127.0.0.1:50705 (LISTEN)
gvproxy   60061 maciej   16u  IPv4 0x953406e4466c535d      0t0  TCP 127.0.0.1:50705->127.0.0.1:50737 (CLOSED)
gvproxy   60061 maciej   19u  IPv4 0x3789059812058dd1      0t0  TCP 127.0.0.1:50705->127.0.0.1:50739 (CLOSED)
gvproxy   60061 maciej   22u  IPv4  0x9682450022e7d02      0t0  TCP 127.0.0.1:50705->127.0.0.1:50873 (CLOSED)
gvproxy   60061 maciej   23u  IPv4 0x1be1c29eac5d6858      0t0  TCP 127.0.0.1:50705->127.0.0.1:50875 (CLOSED)
gvproxy   60061 maciej   25u  IPv4 0xbd8071c31497745e      0t0  TCP 127.0.0.1:50705->127.0.0.1:50740 (CLOSED)
gvproxy   60061 maciej   29u  IPv4 0x9454279b154def93      0t0  TCP 127.0.0.1:50705->127.0.0.1:50881 (CLOSED)
gvproxy   60061 maciej   30u  IPv4 0x61dc385e10861b30      0t0  TCP 127.0.0.1:50705->127.0.0.1:50882 (CLOSED)
gvproxy   60061 maciej   31u  IPv6 0x1c7ced5e0ea769f9      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50783 (CLOSED)
gvproxy   60061 maciej   32u  IPv6 0xf51794d53ac7bf61      0t0  TCP *:5432 (LISTEN)
gvproxy   60061 maciej   35u  IPv6 0x8ee1e59331d9b4eb      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50784 (CLOSED)
gvproxy   60061 maciej   36u  IPv6 0x4461e9b487ca5038      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50785 (CLOSED)
gvproxy   60061 maciej   37u  IPv6 0x654daf1c1b524c81      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50786 (CLOSED)
gvproxy   60061 maciej   40u  IPv6 0xbd1f8da5525d8ea9      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50787 (CLOSED)
gvproxy   60061 maciej   41u  IPv6 0xf4ee4b56e2fe6d28      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50788 (CLOSED)
gvproxy   60061 maciej   42u  IPv6 0xebefcd2083efd482      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50789 (CLOSED)
gvproxy   60061 maciej   43u  IPv6 0x2688979dd0323152      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50790 (CLOSED)
gvproxy   60061 maciej   45u  IPv6 0xa80ac723feddf84b      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50791 (CLOSED)
gvproxy   60061 maciej   46u  IPv6 0x4422ea55d7b9df09      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50792 (CLOSED)
gvproxy   60061 maciej   47u  IPv6 0xa8a2e8f877895393      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50793 (CLOSED)
gvproxy   60061 maciej   48u  IPv6 0x17117e6d683e509c      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50794 (CLOSED)
gvproxy   60061 maciej   49u  IPv6 0x4eddffe61685b75d      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50795 (CLOSED)
gvproxy   60061 maciej   50u  IPv6 0x494828cb4fb5d6d8      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50796 (CLOSED)
gvproxy   60061 maciej   51u  IPv6 0x56b1d31fdbef1026      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50797 (CLOSED)
gvproxy   60061 maciej   52u  IPv6  0x7224088b59fca3f      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50798 (CLOSED)
gvproxy   60061 maciej   53u  IPv6 0xd5816cb26426e997      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50799 (CLOSED)
gvproxy   60061 maciej   54u  IPv6 0xabc19d0040b80509      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50800 (CLOSED)
gvproxy   60061 maciej   55u  IPv6 0x208a3e5b83626444      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50802 (CLOSED)
gvproxy   60061 maciej   56u  IPv6 0x90ebbedee86163dd      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50801 (CLOSED)
gvproxy   60061 maciej   57u  IPv6 0x1929cc69b5caf122      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50804 (CLOSED)
gvproxy   60061 maciej   58u  IPv6 0x12271b9704d54c09      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50803 (CLOSED)
gvproxy   60061 maciej   59u  IPv6 0x515b09d096f5578e      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50805 (CLOSED)
gvproxy   60061 maciej   60u  IPv6 0xd47d82c2b4b907e9      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50806 (CLOSED)
gvproxy   60061 maciej   61u  IPv6 0xbe3dcde786c4c750      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50807 (CLOSED)
gvproxy   60061 maciej   62u  IPv6 0xbcb65b3f5988ad79      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50808 (CLOSED)
gvproxy   60061 maciej   63u  IPv6 0x45363382b4ed0203      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50809 (CLOSED)
gvproxy   60061 maciej   64u  IPv6 0xe19711d3fee821f2      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50811 (CLOSED)
gvproxy   60061 maciej   65u  IPv6 0x90b7bfcdd1f0f495      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50810 (CLOSED)
gvproxy   60061 maciej   66u  IPv6 0x432cfe824e976f6b      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50812 (CLOSED)
gvproxy   60061 maciej   67u  IPv6 0xe8d4fd3eb26bb504      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50813 (CLOSED)
gvproxy   60061 maciej   68u  IPv6 0x7b56d3044aedd9dc      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50814 (CLOSED)
gvproxy   60061 maciej   69u  IPv6  0xa79808b077ea8e9      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50815 (CLOSED)
gvproxy   60061 maciej   70u  IPv6  0xea3f01278f01df8      0t0  TCP [::127.0.0.1]:5432->[::127.0.0.1]:50816 (CLOSED)

Looks like a regression between 5.1.2 and 5.2.0. Downgrading back to 5.1.2 fixes this issue.

Steps to reproduce the issue

  1. Create a container running docker.io/library/postgres:11.
  2. Run high concurrency tests against the container (I ran 20 concurrent connections, postgres was configured to 100 total)
  3. After a while postgres will start running out of connections.

Describe the results you received

Connections/File descriptors should be closed properly once they're closed by the client.

Describe the results you expected

Describe the results you expected

podman info output

host:
  arch: arm64
  buildahVersion: 1.37.0
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.10-1.fc40.aarch64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: '
  cpuUtilization:
    idlePercent: 99.23
    systemPercent: 0.43
    userPercent: 0.35
  cpus: 4
  databaseBackend: sqlite
  distribution:
    distribution: fedora
    variant: coreos
    version: "40"
  eventLogger: journald
  freeLocks: 2046
  hostname: localhost.localdomain
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
    uidmap:
    - container_id: 0
      host_id: 501
      size: 1
    - container_id: 1
      host_id: 100000
      size: 1000000
  kernel: 6.9.11-200.fc40.aarch64
  linkmode: dynamic
  logDriver: journald
  memFree: 458776576
  memTotal: 2044641280
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.12.0-1.20240801182221740927.main.8.g7cc82a4.fc40.aarch64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.13.0-dev
    package: netavark-1.12.1-1.20240802123603627403.main.4.geb4e5d3.fc40.aarch64
    path: /usr/libexec/podman/netavark
    version: netavark 1.13.0-dev
  ociRuntime:
    name: crun
    package: crun-1.15-1.20240731144341062436.main.62.ge6a1ef1.fc40.aarch64
    path: /usr/bin/crun
    version: |-
      crun version UNKNOWN
      commit: ce56cc2473f934a77c79c5cc79a65fe157dec42d
      rundir: /run/user/501/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20240624.g1ee2eca-1.fc40.aarch64
    version: |
      pasta 0^20240624.g1ee2eca-1.fc40.aarch64-pasta
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/501/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.2-2.fc40.aarch64
    version: |-
      slirp4netns version 1.2.2
      commit: 0ee2d87523e906518d34a6b423271e4826f71faf
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.5
  swapFree: 0
  swapTotal: 0
  uptime: 0h 23m 47.00s
  variant: v8
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /var/home/core/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/home/core/.local/share/containers/storage
  graphRootAllocated: 53082042368
  graphRootUsed: 3940200448
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 1
  runRoot: /run/user/501/containers
  transientStore: false
  volumePath: /var/home/core/.local/share/containers/storage/volumes
version:
  APIVersion: 5.2.0
  Built: 1722556800
  BuiltTime: Fri Aug  2 03:00:00 2024
  GitCommit: ""
  GoVersion: go1.22.5
  Os: linux
  OsArch: linux/arm64
  Version: 5.2.0

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

podman version:

Client:       Podman Engine
Version:      5.2.0
API Version:  5.2.0
Go Version:   go1.22.6
Git Commit:   b22d5c61eef93475413724f49fd6a32980d2c746
Built:        Thu Aug  1 20:32:02 2024
OS/Arch:      darwin/arm64

Server:       Podman Engine
Version:      5.2.0
API Version:  5.2.0
Go Version:   go1.22.5
Built:        Fri Aug  2 03:00:00 2024
OS/Arch:      linux/arm64

Additional information

Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting

@maciej-szlosarczyk maciej-szlosarczyk added the kind/bug Categorizes issue or PR as related to a bug. label Aug 14, 2024
@github-actions github-actions bot added macos MacOS (OSX) related remote Problem is in podman-remote labels Aug 14, 2024
@Luap99 Luap99 added the machine label Aug 14, 2024
@Luap99
Copy link
Member

Luap99 commented Aug 15, 2024

Can you replace only gvproxy binary with the one included in in 5.1.2 installer, should be version gvproxy 0.7.3, we updated the version to 0.7.4 in the 5.2 installer (https://github.com/containers/gvisor-tap-vsock/releases). So I wonder if there is a regression there or on the VM side somehow.

@maciej-szlosarczyk
Copy link
Author

Hey @Luap99!

It works properly with gvproxy 0.7.3:

% ./podman/5.2.0/libexec/podman/gvproxy --version
gvproxy version v0.7.3

% lsof -nP -iTCP | grep gvproxy
gvproxy   63436 maciej   11u  IPv4 0x921c900cb3b2c2c3      0t0  TCP 127.0.0.1:57485 (LISTEN)
gvproxy   63436 maciej   34u  IPv6 0x9a475d1349cde53d      0t0  TCP *:5432 (LISTEN)
gvproxy   63436 maciej   71u  IPv4 0x776bd7258f5958e0      0t0  TCP 127.0.0.1:57485->127.0.0.1:58106 (ESTABLISHED)

Machine internal:

core@localhost:~$ lsof -nP -iTCP
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
rootlessp 2568 core   11u  IPv6  13502      0t0  TCP *:5432 (LISTEN)

@Luap99
Copy link
Member

Luap99 commented Aug 15, 2024

@maciej-szlosarczyk Do you know how to use git bisect? It would be great if you could build gvproxy from source (make gvproxy in https://github.com/containers/gvisor-tap-vsock/) then copy the binary to your location and test again until you find the commit that caused the regression.

@maciej-szlosarczyk
Copy link
Author

Bisect points at this commit:

containers/gvisor-tap-vsock@600910c

I managed to narrow it down to these 15 or so lines:

containers/gvisor-tap-vsock@600910c#diff-a888e5fb95e32a55bcba169802125c5ea2dbd00cf18815f1edfe4faa3e6a2d69L376-R408

@Luap99
Copy link
Member

Luap99 commented Aug 15, 2024

Thanks

cc @praveenkumar @cfergeau

@maciej-szlosarczyk
Copy link
Author

Hey!

I did some additional investigation there and discovered two things:

  • dst is *gonet.TCPConn and not *net.TCPConn. It falls through the type assertions, but it looks like it was like this even before this change.
  • Reverting the change on lines 404-408 (make the channel size 1) fixes the regression locally. I don't know how it relates to the thing above.

@praveenkumar
Copy link
Contributor

@maciej-szlosarczyk Can you file a PR for it to https://github.com/containers/gvisor-tap-vsock repo?

maciej-szlosarczyk added a commit to maciej-szlosarczyk/gvisor-tap-vsock that referenced this issue Aug 20, 2024
Original issue reported in podman here:

containers/podman#23616
maciej-szlosarczyk added a commit to maciej-szlosarczyk/gvisor-tap-vsock that referenced this issue Aug 20, 2024
Original issue reported in podman here:

containers/podman#23616

Signed-off-by: Maciej Szlosarczyk <[email protected]>
cfergeau pushed a commit to cfergeau/gvisor-tap-vsock that referenced this issue Aug 20, 2024
This causes a regression in gvproxy when it's used by podman:
containers/podman#23616

Reverting inetaf/tcpproxy commit 2862066 is a bit convoluted, as we need
to first undo the module name change (inet.af/tcpproxy ->
github.com/inetaf/tcpproxy) done in commit 600910c
and then a go module `replace` directive to redirect the no-longer
existing inet.af/tcpproxy to the commit we want in github.com/inetaf/tcpproxy/

This way, the module name in gvisor-tap-vsock go.mod and in
github.com/inetaf/tcpproxy go.mod are the same (inet.af/tcpproxy), and
we can use older commits in this repository.

It's unclear what's causing the regression, as the commit log/PR
description/associated issue don't provide useful details:
inetaf/tcpproxy@2862066

The best I could find is:
tailscale/tailscale#10070
> The close in the handler sometimes occurs before the buffered data is
forwarded. The proxy could be improved to perform a half-close dance,
such that it will only mutually close once both halves are closed or
both halves error.

and inetaf/tcpproxy#21 which seems to be the
same issue as inetaf/tcpproxy#38 which is the
issue fixed by the commit triggering the regression.

What could be happening is that before inetaf/tcpproxy commit 2862066,
as soon as one side of the connection was closed, the other half was
also closed, while after commit 2862066, the tcpproxy code waits for
both halves of the connection to be closed. So maybe we are missing a
connection close somewhere in gvproxy's code :-/
@cfergeau
Copy link
Contributor

The lines you pointed to as causing the regression are related to this inetaf/tcpproxy commit inetaf/tcpproxy@2862066

I think what happens with this commit is that before this commit HandleConn was returning after either the src or dest connection was closed/returned an error. My guess is that after HandleConn returns, both connections were eventually garbage collected.

After commit 2862066fc2a9405880 however, the code now waits for both src and dest to be closed or return an error before HandleConn returns. I suspect when gvproxy shows the bug, a lot of go routines may be stuck in this HandleConn method.

cfergeau added a commit to cfergeau/gvisor-tap-vsock that referenced this issue Aug 20, 2024
This causes a regression in gvproxy when it's used by podman:
containers/podman#23616

Thanks to Maciej Szlosarczyk <[email protected]> for investigating and
finding the faulty commit!

Reverting inetaf/tcpproxy commit 2862066 is a bit convoluted, as we need
to first undo the module name change (inet.af/tcpproxy ->
github.com/inetaf/tcpproxy) done in commit 600910c
and then a go module `replace` directive to redirect the no-longer
existing inet.af/tcpproxy to the commit we want in github.com/inetaf/tcpproxy/

This way, the module name in gvisor-tap-vsock go.mod and in
github.com/inetaf/tcpproxy go.mod are the same (inet.af/tcpproxy), and
we can use older commits in this repository.

It's unclear what's causing the regression, as the commit log/PR
description/associated issue don't provide useful details:
inetaf/tcpproxy@2862066

The best I could find is:
tailscale/tailscale#10070
> The close in the handler sometimes occurs before the buffered data is
forwarded. The proxy could be improved to perform a half-close dance,
such that it will only mutually close once both halves are closed or
both halves error.

and inetaf/tcpproxy#21 which seems to be the
same issue as inetaf/tcpproxy#38 which is the
issue fixed by the commit triggering the regression.

What could be happening is that before inetaf/tcpproxy commit 2862066,
as soon as one side of the connection was closed, the other half was
also closed, while after commit 2862066, the tcpproxy code waits for
both halves of the connection to be closed. So maybe we are missing a
connection close somewhere in gvproxy's code :-/

Signed-off-by: Christophe Fergeau <[email protected]>
cfergeau added a commit to cfergeau/gvisor-tap-vsock that referenced this issue Aug 21, 2024
This causes a regression in gvproxy when it's used by podman:
containers/podman#23616

Thanks to Maciej Szlosarczyk <[email protected]> for investigating and
finding the faulty commit!

Reverting inetaf/tcpproxy commit 2862066 is a bit convoluted, as we need
to first undo the module name change (inet.af/tcpproxy ->
github.com/inetaf/tcpproxy) done in commit 600910c
and then a go module `replace` directive to redirect the no-longer
existing inet.af/tcpproxy to the commit we want in github.com/inetaf/tcpproxy/

This way, the module name in gvisor-tap-vsock go.mod and in
github.com/inetaf/tcpproxy go.mod are the same (inet.af/tcpproxy), and
we can use older commits in this repository.

It's unclear what's causing the regression, as the commit log/PR
description/associated issue don't provide useful details:
inetaf/tcpproxy@2862066

The best I could find is:
tailscale/tailscale#10070
> The close in the handler sometimes occurs before the buffered data is
forwarded. The proxy could be improved to perform a half-close dance,
such that it will only mutually close once both halves are closed or
both halves error.

and inetaf/tcpproxy#21 which seems to be the
same issue as inetaf/tcpproxy#38 which is the
issue fixed by the commit triggering the regression.

What could be happening is that before inetaf/tcpproxy commit 2862066,
as soon as one side of the connection was closed, the other half was
also closed, while after commit 2862066, the tcpproxy code waits for
both halves of the connection to be closed. So maybe we are missing a
connection close somewhere in gvproxy's code :-/

Signed-off-by: Christophe Fergeau <[email protected]>
Tested-by: Maciej Szlosarczyk <[email protected]>
@cfergeau
Copy link
Contributor

containers/gvisor-tap-vsock#386 is merged and should address this issue. I'll make a gvisor-tap-vsock release soon.
It's likely that it's only papering over a gvisor-tap-vsock or tcpproxy bug, but if it's a preexisting gvisor-tap-vsock bug made more visible by the changes in containers/gvisor-tap-vsock@600910c, then it's a bug which has been present for years, so we can live with it for a while longer.

@Luap99
Copy link
Member

Luap99 commented Aug 22, 2024

Sorry I was ooto, we did a new release yesterday.
But if you do a new gvproxy release we can certainly do another release soon to fix this regression.

cfergeau added a commit to cfergeau/podman that referenced this issue Aug 26, 2024
This should fix the regression reported in
containers#23616

Signed-off-by: Christophe Fergeau <[email protected]>
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/podman that referenced this issue Aug 28, 2024
This should fix the regression reported in
containers#23616

Signed-off-by: Christophe Fergeau <[email protected]>
@gbraad
Copy link
Member

gbraad commented Sep 19, 2024

I'll try to see if this can be assigned on our end; @evidolob WDYT? => containers/gvisor-tap-vsock#387

@Luap99
Copy link
Member

Luap99 commented Oct 29, 2024

This was fixed in 5.2.3

@Luap99 Luap99 closed this as completed Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. machine macos MacOS (OSX) related regression remote Problem is in podman-remote
Projects
None yet
Development

No branches or pull requests

5 participants