Keepalived occasionally fails SSL_CHECK #2351

lemrouch · 2023-10-11T10:18:55Z

Describe the bug
I'm upgrading our servers from Debian Bullseye to Debian Bookworm. Some of them act as load balancers using keepalived.
Right now I have one Bullseye and one Bookworm with the same configuration checking the same services.
Several of our services are running on HTTPS therefore I'm using SSL_CHECK.
I can see that the Bookworm one occasionally fails SSL_CHECK for several seconds on one service while the
Bullseye does not report any problem at all.
I've seen there was RST packet sent from keepalived to the service when the check failed.

To Reproduce
That's hard to tell.
It's quite rare - not even once per hour with 1s loop delay with tens of real servers.
I would say services with longer certificate chain are more likely to fail.

Expected behavior
Don't fail SSL_CHECK when the service has no problems.

Keepalived version

Keepalived v2.2.7 (01/16,2022)

Copyright(C) 2001-2022 Alexandre Cassen, <[email protected]>

Built with kernel headers for Linux 5.19.11
Running on Linux 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29)
Distro: Debian GNU/Linux 12 (bookworm)

configure options: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --enable-snmp --enable-sha1 --enable-snmp-rfcv2 --enable-snmp-rfcv3 --enable-dbus --enable-json --enable-bfd --enable-regex --with-init=systemd build_alias=x86_64-linux-gnu CFLAGS=-g -O2 -ffile-prefix-map=/build/keepalived-m8ENAG/keepalived-2.2.7=. -fstack-protector-strong -Wformat -Werror=format-security LDFLAGS=-Wl,-z,relro CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2

Config options:  NFTABLES LVS REGEX VRRP VRRP_AUTH VRRP_VMAC JSON BFD OLD_CHKSUM_COMPAT SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3 DBUS INIT=systemd SYSTEMD_NOTIFY

System options:  VSYSLOG MEMFD_CREATE IPV6_MULTICAST_ALL IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS IPVS_TUN_TYPE IPVS_TUN_CSUM IPVS_TUN_GRE VRRP_IPVLAN IFLA_LINK_NETNSID GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE VRF SO_MARK

Distro (please complete the following information):

Name: Debian
Version: 1:2.2.7-1+b2
Architecture: x86_64

Configuration file:

global_defs {
  notification_email {
    [email protected]
  }
  notification_email_from [email protected]
  smtp_server 10.17.0.153
  smtp_connect_timeout 60
  router_id BALANCER-2
  script_user balancer
  enable_script_security
  vrrp_version 3
  snmp_socket unix:/var/agentx/master
  lvs_sync_daemon bond0.160
}

vrrp_instance INET {
  interface                 inet
  state                     BACKUP
  virtual_router_id         253
  priority                  100
  advert_int                0.4
  garp_master_delay         5

  # notify scripts and alerts are optional
  #
  # filenames of scripts to run on transitions
  # can be unquoted (if just filename)
  # or quoted (if has parameters)
  # to MASTER transition
  notify_master "/usr/bin/sudo /etc/conntrackd/primary-backup.sh primary"
  # to BACKUP transition
  notify_backup "/usr/bin/sudo /etc/conntrackd/primary-backup.sh backup"
  # FAULT transition
  notify_fault "/usr/bin/sudo /etc/conntrackd/primary-backup.sh fault"

  track_interface {
    inet
    snet
  }

  virtual_ipaddress {
    10.17.0.129/24 dev snet 
    10.11.0.129/24 dev snet.3 
    A.B.C.134/25 dev inet
    A.B.C.135/25 dev inet
    A.B.C.138/25 dev inet
    A.B.C.148/25 dev inet
    A.B.C.149/25 dev inet
    A.B.C.150/25 dev inet
    A.B.C.152/25 dev inet
    A.B.C.154/25 dev inet
    A.B.C.162/25 dev inet
    A.B.C.163/25 dev inet
    A.B.C.164/25 dev inet
    A.B.C.165/25 dev inet
    A.B.C.166/25 dev inet
    A.B.C.167/25 dev inet
    A.B.C.168/25 dev inet
    A.B.C.169/25 dev inet
    A.B.C.170/25 dev inet
    A.B.C.172/25 dev inet
    A.B.C.173/25 dev inet
    A.B.C.177/25 dev inet
    A.B.C.178/25 dev inet
    A.B.C.179/25 dev inet
    A.B.C.180/25 dev inet
    A.B.C.182/25 dev inet
    A.B.C.185/25 dev inet
    A.B.C.186/25 dev inet
    A.B.C.190/25 dev inet
    A.B.C.191/25 dev inet
    A.B.C.194/25 dev inet
    A.B.C.200/25 dev inet
  }

  unicast_src_ip A.B.C.137

  unicast_peer {
    A.B.C.136
  }

}

virtual_server A.B.C.191 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.11.0.90 8443 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
  real_server 10.11.0.91 8443 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.191 444 {

  delay_loop 1
  lb_algo sh
  lb_kind NAT
  protocol TCP

  real_server 10.11.0.86 8443 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.163 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.137 8604 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8604 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.163 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.166 8604 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.150 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost admin.example.com
  protocol TCP

  real_server 10.17.0.137 8031 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8031 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.150 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost admin.example.com
  protocol TCP

  real_server 10.17.0.166 8031 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.168 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP



  real_server 10.17.0.137 8601 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8601 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.168 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.166 8601 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.194 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost www.example3.com
  protocol TCP

  real_server 10.17.0.137 8100 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8100 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.194 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost www.example3.com
  protocol TCP

  real_server 10.17.0.166 8100 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.164 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.137 7101 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
  real_server 10.17.0.138 7101 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.164 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.166 7101 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.149 80 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost buttons.example.com
  protocol TCP

  real_server 10.17.0.137 8021 {
    HTTP_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /index.html
        status_code 200
      }
    }
  }
  real_server 10.17.0.138 8021 {
    HTTP_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /index.html
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.149 81 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost buttons.example.com
  protocol TCP

  real_server 10.17.0.166 8021 {
    HTTP_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /index.html
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.182 8602 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.145 8602 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.146 8602 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.182 8603 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP



  real_server 10.17.0.163 8602 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.138 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost controlcenter.example.com
  protocol TCP

  real_server 10.17.0.137 8002 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8002 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.138 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost controlcenter.example.com
  protocol TCP

  real_server 10.17.0.166 8002 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.190 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.11.0.90 443 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
  real_server 10.11.0.91 443 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.190 444 {

  delay_loop 1
  lb_algo sh
  lb_kind NAT
  protocol TCP

  real_server 10.11.0.86 443 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.178 22 {

  delay_loop 10
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.178 22 {
    TCP_CHECK {
      connect_port 22
      connect_timeout 2
    }
  }
}

virtual_server A.B.C.180 443 {

  delay_loop 1
  lb_algo sh
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.145 443 {
    weight 100

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
  real_server 10.17.0.146 443 {
    weight 100

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.180 444 {

  delay_loop 1
  lb_algo sh
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.163 443 {
    weight 100

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.180 80 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.163 80 {
    TCP_CHECK {
      connect_port 80
      connect_timeout 2
      delay_before_retry 1
    }
  }
}

virtual_server A.B.C.148 80 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost links.example.com
  protocol TCP

  real_server 10.17.0.137 8020 {
    HTTP_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8020 {
    HTTP_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.148 81 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost links.example.com
  protocol TCP

  real_server 10.17.0.166 8020 {
    HTTP_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.177 61614 {

  delay_loop 10
  lb_algo rr
  lb_kind NAT
  protocol TCP



  real_server 10.17.0.177 61614 {
    TCP_CHECK {
      connect_port 61614
      connect_timeout 2
    }
  }
}

virtual_server A.B.C.165 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.137 8600 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8600 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.165 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.166 8600 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.162 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP



  real_server 10.17.0.145 7202 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
  real_server 10.17.0.146 7202 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.162 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.163 7202 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.166 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.137 8603 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8603 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.166 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.166 8603 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.167 8555 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.145 8555 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.146 8555 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.167 8556 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.163 8555 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.200 8080 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 195.20.32.203 8080 {
    inhibit_on_failure

    TCP_CHECK {
      connect_port 8080
      connect_timeout 2
    }
  }
  real_server A.B.E.203 8080 {
    inhibit_on_failure

    TCP_CHECK {
      connect_port 8080
      connect_timeout 2
    }
  }
  real_server X.Y.Z.211 8080 {
    inhibit_on_failure

    TCP_CHECK {
      connect_port 8080
      connect_timeout 2
    }
  }
}

virtual_server A.B.C.200 8081 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP



  real_server A.B.D.8 8080 {
    inhibit_on_failure

    TCP_CHECK {
      connect_port 8080
      connect_timeout 2
    }
  }
}

virtual_server A.B.C.177 8140 {

  delay_loop 10
  lb_algo rr
  lb_kind NAT
  protocol TCP



  real_server 10.17.0.177 8140 {
    TCP_CHECK {
      connect_port 8140
      connect_timeout 2
    }
  }
}

virtual_server A.B.C.154 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost secure.example.com
  protocol TCP

  real_server 10.17.0.137 8034 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8034 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.154 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost secure.example.com
  protocol TCP

  real_server 10.17.0.166 8034 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.179 22 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.153 10022 {
    TCP_CHECK {
      connect_port 10022
      connect_timeout 2
      delay_before_retry 1
    }
  }
  real_server 10.17.0.154 10022 {
    TCP_CHECK {
      connect_port 10022
      connect_timeout 2
      delay_before_retry 1
    }
  }
  real_server 10.17.0.155 10022 {
    TCP_CHECK {
      connect_port 10022
      connect_timeout 2
      delay_before_retry 1
    }
  }
}

virtual_server A.B.C.179 23 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.164 10022 {
    TCP_CHECK {
      connect_port 10022
      connect_timeout 2
      delay_before_retry 1
    }
  }
}

virtual_server A.B.C.170 25 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.153 25 {
    SMTP_CHECK {
      connect_timeout 2
      helo_name balancer.check.snet.example.com
    }
  }
  real_server 10.17.0.154 25 {
    SMTP_CHECK {
      connect_timeout 2
      helo_name balancer.check.snet.example.com
    }
  }
  real_server 10.17.0.155 25 {
    SMTP_CHECK {
      connect_timeout 2
      helo_name balancer.check.snet.example.com
    }
  }
}

virtual_server A.B.C.169 389 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.153 389 {
    TCP_CHECK {
      connect_port 389
      connect_timeout 2
    }
  }
  real_server 10.17.0.154 389 {
    TCP_CHECK {
      connect_port 389
      connect_timeout 2
    }
  }
  real_server 10.17.0.155 389 {
    TCP_CHECK {
      connect_port 389
      connect_timeout 2
    }
  }
}

virtual_server A.B.C.169 636 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.153 636 {
    TCP_CHECK {
      connect_port 636
      connect_timeout 2
    }
  }
  real_server 10.17.0.154 636 {
    TCP_CHECK {
      connect_port 636
      connect_timeout 2
    }
  }
  real_server 10.17.0.155 636 {
    TCP_CHECK {
      connect_port 636
      connect_timeout 2
    }
  }
}

virtual_server A.B.C.173 80 {

  delay_loop 10
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.153 80 {
    TCP_CHECK {
      connect_port 80
      connect_timeout 2
      delay_before_retry 1
    }
  }
  real_server 10.17.0.154 80 {
    TCP_CHECK {
      connect_port 80
      connect_timeout 2
      delay_before_retry 1
    }
  }
  real_server 10.17.0.155 80 {
    TCP_CHECK {
      connect_port 80
      connect_timeout 2
      delay_before_retry 1
    }
  }
}

virtual_server A.B.C.185 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.11.0.90 9980 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
  real_server 10.11.0.91 9980 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.185 444 {

  delay_loop 1
  lb_algo sh
  lb_kind NAT
  protocol TCP

  real_server 10.11.0.86 9980 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.186 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.11.0.90 9981 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
  real_server 10.11.0.91 9981 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.186 444 {

  delay_loop 1
  lb_algo sh
  lb_kind NAT
  protocol TCP

  real_server 10.11.0.86 9981 {
    inhibit_on_failure

    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        path /actuator/health
        status_code 200
      }
    }
  }
}

virtual_server A.B.C.190 2223 {

  delay_loop 1
  lb_algo sh
  lb_kind NAT
  protocol TCP
  sorry_server 10.11.0.91 2223

  real_server 10.11.0.90 2223 {
    weight 100

    TCP_CHECK {
      connect_port 2223
      connect_timeout 2
      delay_before_retry 1
    }
  }
}

virtual_server A.B.C.190 2224 {

  delay_loop 1
  lb_algo sh
  lb_kind NAT
  protocol TCP

  real_server 10.11.0.86 2223 {
    weight 100

    TCP_CHECK {
      connect_port 2223
      connect_timeout 2
      delay_before_retry 1
    }
  }
}

virtual_server A.B.C.152 443 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost www.example1.com
  protocol TCP

  real_server 10.17.0.137 8032 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
  real_server 10.17.0.138 8032 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.152 444 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  virtualhost www.example2.com
  protocol TCP

  real_server 10.17.0.166 8032 {
    SSL_GET {
      connect_timeout 2
      delay_before_retry 1
      nb_get_retry 2

      url {
        digest e0aa021e21dddbd6d8cecec71e9cf564
        path /ping
      }
    }
  }
}

virtual_server A.B.C.172 26500 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.153 26500 {
    TCP_CHECK {
      connect_port 26500
      connect_timeout 2
      delay_before_retry 1
    }
  }
  real_server 10.17.0.154 26500 {
    TCP_CHECK {
      connect_port 26500
      connect_timeout 2
      delay_before_retry 1
    }
  }
  real_server 10.17.0.155 26500 {
    TCP_CHECK {
      connect_port 26500
      connect_timeout 2
      delay_before_retry 1
    }
  }
}

virtual_server A.B.C.172 26501 {

  delay_loop 1
  lb_algo rr
  lb_kind NAT
  protocol TCP

  real_server 10.17.0.164 26500 {
    TCP_CHECK {
      connect_port 26500
      connect_timeout 2
      delay_before_retry 1
    }
  }
}

Notify and track scripts

CONNTRACKD_BIN=/usr/sbin/conntrackd
CONNTRACKD_LOCK=/var/lock/conntrack.lock
CONNTRACKD_CONFIG=/etc/conntrackd/conntrackd.conf

case "$1" in
  primary)
    #
    # commit the external cache into the kernel table
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -c
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -c"
    fi

    #
    # flush the internal and the external caches
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -f
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -f"
    fi

    #
    # resynchronize my internal cache to the kernel table
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -R
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -R"
    fi

    #
    # send a bulk update to backups
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -B
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -B"
    fi
    ;;
  backup)
    #
    # is conntrackd running? request some statistics to check it
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -s
    if [ $? -eq 1 ]
    then
        #
        # something's wrong, do we have a lock file?
        #
        if [ -f $CONNTRACKD_LOCK ]
        then
            logger "WARNING: conntrackd was not cleanly stopped."
            logger "If you suspect that it has crashed:"
            logger "1) Enable coredumps"
            logger "2) Try to reproduce the problem"
            logger "3) Post the coredump to [email protected]"
            rm -f $CONNTRACKD_LOCK
        fi
        $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -d
        if [ $? -eq 1 ]
        then
            logger "ERROR: cannot launch conntrackd"
            exit 1
        fi
    fi
    #
    # shorten kernel conntrack timers to remove the zombie entries.
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -t
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -t"
    fi

    #
    # request resynchronization with master firewall replica (if any)
    # Note: this does nothing in the alarm approach.
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -n
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -n"
    fi
    ;;
  fault)
    #
    # shorten kernel conntrack timers to remove the zombie entries.
    #
    $CONNTRACKD_BIN -C $CONNTRACKD_CONFIG -t
    if [ $? -eq 1 ]
    then
        logger "ERROR: failed to invoke conntrackd -t"
    fi
    ;;
  *)
    logger "ERROR: unknown state transition"
    echo "Usage: primary-backup.sh {primary|backup|fault}"
    exit 1
    ;;
esac

exit 0

System Log entries

Oct 02 08:29:45 balancer-2 Keepalived[918506]: Starting Keepalived v2.2.7 (01/16,2022)
Oct 02 08:29:45 balancer-2 Keepalived[918506]: WARNING - keepalived was built for newer Linux 6.1.52, running on Linux 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07)
Oct 02 08:29:45 balancer-2 Keepalived[918506]: Command line: '/usr/sbin/keepalived' '-f' '/etc/keepalived/keepalived.conf' '-m'
Oct 02 08:29:45 balancer-2 Keepalived[918506]: Configuration file /etc/keepalived/keepalived.conf
Oct 02 08:29:45 balancer-2 Keepalived[918507]: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Oct 02 08:29:45 balancer-2 Keepalived[918507]: Starting Healthcheck child process, pid=918508
Oct 02 08:29:45 balancer-2 Keepalived[918507]: Starting VRRP child process, pid=918509
Oct 02 08:29:45 balancer-2 Keepalived_healthcheckers[918508]: (/etc/keepalived/keepalived.conf: Line 110) nb_get_retry is deprecated - please use 'retry'
..
sorry lot's of those - old puppet module
..
Oct 02 08:29:45 balancer-2 Keepalived_healthcheckers[918508]: (/etc/keepalived/keepalived.conf: Line 1679) nb_get_retry is deprecated - please use 'retry'
Oct 02 08:29:45 balancer-2 Keepalived_healthcheckers[918508]: Initializing ipvs
Oct 02 08:29:45 balancer-2 Keepalived_vrrp[918509]: (INET) Ignoring track_interface inet since own interface
Oct 02 08:29:45 balancer-2 Keepalived_vrrp[918509]: (INET) Entering BACKUP STATE (init)
Oct 02 08:29:45 balancer-2 Keepalived_healthcheckers[918508]: Gained quorum 1+0=1 <= 2 for VS [A.B.C.191]:tcp:443
..
Oct 02 08:29:45 balancer-2 Keepalived_healthcheckers[918508]: Gained quorum 1+0=1 <= 1 for VS [A.B.C.172]:tcp:26501
Oct 02 08:29:45 balancer-2 Keepalived[918507]: Startup complete
Oct 02 08:29:45 balancer-2 Keepalived_healthcheckers[918508]: Activating healthchecker for service [10.11.0.90]:tcp:8443 for VS [A.B.C.191]:tcp:443
..

Did keepalived coredump?
no

Additional context
I've reported Debian bug but the maintainer asked me to fill upstream issue first.

I was looking for possible reason and I've found
openssl/openssl#20365
pjsip/pjproject#3632
https://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error

They are all basically saying that you can have multiple SSL errors left in error queue and you are supposed to
run ERR_get_error() before calling SSL_* functions.

I was able to solve this issue with simple patch.
What do you think about it?

The text was updated successfully, but these errors were encountered:

pqarmitage · 2023-10-15T16:27:41Z

The man page for SSL_get_error() list a number of functions to which it applies, which includes SSL_shutdown(), and so I think ERR_clear_error() should also be called before the two calls of SSL_shutdown(). keepalived does not use any of the other functions listed.

However, I think the patch will not work as a solution; in fact I think we have a severe problem.

The SSL_get_error(3) man page states:

In addition to ssl and ret, SSL_get_error() inspects the current thread's OpenSSL error queue.  Thus, SSL_get_error() must be used in the same thread that performed the TLS/SSL I/O operation, and no other OpenSSL function calls should appear in between.  The current thread's error queue must be empty before the TLS/SSL I/O operation is attempted, or SSL_get_error() will not work reliably.

The last sentence says we need to ensure that the thread's error queue is empty before we call SSL_connect(), SSL_read(), SSL_write() or SSL_shutdown(), so either we must call ERR_clear_error() before any of the SSL functions, or after an error we need to ensure we read all the errors, which is a solution I prefer since we ought to process all the errors and not just throw them away.

The problem is the second sentence above "... and no other OpenSSL function calls should appear in between.". Apart from it being unclear what between refers to, the real problem is that keepalived can have SSL connection open simultaneously for each SSL_CHECK. If an error is returned from any SSL_* call, SSL_get_error() will return errors not only for the particular connection, but also for errors that have occurred on any other SSL connection that are on the OpenSSL error queue. Calling ERR_clear_error() before all SSL_* function calls could mean that errors are discarded without being processed.

There are two possible solutions that I can see (but only 1 can work for keepalived):

Only run one SSL_GET checker at a time (this won't scale for large configurations with a large number of SSL_GET checkers)
Run each SSL_GET checker in a separate process thread (i.e. using pthread_create()).

The problem with option 2 is that (except for DBus) we do not use pthreads, and this would require a not insignificant architecture change for keepalived.

I will think on about this and see if I can work out the best solution. It might need reading the openSSL code to really understand how it works, but I would rather avoid that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keepalived occasionally fails SSL_CHECK #2351

Keepalived occasionally fails SSL_CHECK #2351

lemrouch commented Oct 11, 2023

pqarmitage commented Oct 15, 2023

Keepalived occasionally fails SSL_CHECK #2351

Keepalived occasionally fails SSL_CHECK #2351

Comments

lemrouch commented Oct 11, 2023

pqarmitage commented Oct 15, 2023