Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a net health recovery service to qemu machines #21262

Merged
merged 1 commit into from
Jan 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 75 additions & 11 deletions pkg/machine/ignition/ignition.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,16 @@ func GetNodeGrp(grpName string) NodeGroup {
}

type DynamicIgnition struct {
Name string
Key string
TimeZone string
UID int
VMName string
VMType define.VMType
WritePath string
Cfg Config
Rootful bool
Name string
Key string
TimeZone string
UID int
VMName string
VMType define.VMType
WritePath string
Cfg Config
Rootful bool
NetRecover bool
}

func (ign *DynamicIgnition) Write() error {
Expand Down Expand Up @@ -97,7 +98,7 @@ func (ign *DynamicIgnition) GenerateIgnitionConfig() error {

ignStorage := Storage{
Directories: getDirs(ign.Name),
Files: getFiles(ign.Name, ign.UID, ign.Rootful, ign.VMType),
Files: getFiles(ign.Name, ign.UID, ign.Rootful, ign.VMType, ign.NetRecover),
Links: getLinks(ign.Name),
}

Expand Down Expand Up @@ -231,6 +232,21 @@ func (ign *DynamicIgnition) GenerateIgnitionConfig() error {
}
ignSystemd.Units = append(ignSystemd.Units, qemuUnit)
}

if ign.NetRecover {
contents, err := GetNetRecoveryUnitFile().ToString()
if err != nil {
return err
}

recoveryUnit := Unit{
Enabled: BoolToPtr(true),
Name: "net-health-recovery.service",
Contents: &contents,
}
ignSystemd.Units = append(ignSystemd.Units, recoveryUnit)
}

// Only after all checks are done
// it's ready create the ingConfig
ign.Cfg = Config{
Expand Down Expand Up @@ -303,7 +319,7 @@ func getDirs(usrName string) []Directory {
return dirs
}

func getFiles(usrName string, uid int, rootful bool, vmtype define.VMType) []File {
func getFiles(usrName string, uid int, rootful bool, vmtype define.VMType, netRecover bool) []File {
files := make([]File, 0)

lingerExample := parser.NewUnitFile()
Expand Down Expand Up @@ -574,6 +590,23 @@ Delegate=memory pids cpu io
},
})

// Only necessary for qemu on mac
if netRecover {
files = append(files, File{
Node: Node{
User: GetNodeUsr("root"),
Group: GetNodeGrp("root"),
Path: "/usr/local/bin/net-health-recovery.sh",
},
FileEmbedded1: FileEmbedded1{
Mode: IntToPtr(0755),
Contents: Resource{
Source: EncodeDataURLPtr(GetNetRecoveryFile()),
},
},
})
}

return files
}

Expand Down Expand Up @@ -743,6 +776,37 @@ func (i *IgnitionBuilder) Build() error {
return i.dynamicIgnition.Write()
}

func GetNetRecoveryFile() string {
return `#!/bin/bash
# Verify network health, and bounce the network device if host connectivity
# is lost. This is a temporary workaround for a known rare qemu/virtio issue
# that affects some systems

sleep 120 # allow time for network setup on initial boot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sleep still needed with the systemd unit which has recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service") ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sshd only has an after on network.target, which is quasi reliable:

"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined [snip]"

Since we only seem to see the problem with long running vms, my thinking was it was better to just wait a bit longer in the script without disrupting / delaying boot (e.g. using something like network-online.target).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree sleeping for 2 minutes (or even 5 minutes or 1 hour or ... :) is no big deal given when the bug happens. I just wondered.

while true; do
sleep 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using a systemd timer which runs every 30 seconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right yeah only reason it's done this way is to avoid the log generation noise that comes from them that @rhatdan was warning about. With a frequent timer like this it would be a lot of noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah logging, makes sense!

curl -s -o /dev/null --max-time 30 http://192.168.127.1/health
if [ "$?" != "0" ]; then
echo "bouncing nic due to loss of connectivity with host"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this line is reported to ? to see if it occurred or not in my podman machine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit file sends stdout&stderr to the system journal, so can be found using journalctl.

ifconfig enp0s1 down; ifconfig enp0s1 up
fi
done
`
}

func GetNetRecoveryUnitFile() *parser.UnitFile {
recoveryUnit := parser.NewUnitFile()
recoveryUnit.Add("Unit", "Description", "Verifies health of network and recovers if necessary")
recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw, I'm not sure sshd.service is required here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sshd.socket and sshd.service are mutually exclusive alternates (currently our fcos images are using sshd.service atm. Our other units are declared as after both (my assumption is to be compatible if the base images witch to the inet socket approach in the future), so just mirroring that pattern here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I assumed the FCOS images used sshd.socket already.

recoveryUnit.Add("Service", "ExecStart", "/usr/local/bin/net-health-recovery.sh")
recoveryUnit.Add("Service", "StandardOutput", "journal")
recoveryUnit.Add("Service", "StandardError", "journal")
recoveryUnit.Add("Service", "StandardInput", "null")
recoveryUnit.Add("Install", "WantedBy", "default.target")

return recoveryUnit
}

func DefaultReadyUnitFile() parser.UnitFile {
u := parser.NewUnitFile()
u.Add("Unit", "After", "remove-moby.service sshd.socket sshd.service")
Expand Down
17 changes: 9 additions & 8 deletions pkg/machine/qemu/machine.go
Original file line number Diff line number Diff line change
Expand Up @@ -194,14 +194,15 @@ func (v *MachineVM) Init(opts machine.InitOptions) (bool, error) {
}

builder := ignition.NewIgnitionBuilder(ignition.DynamicIgnition{
Name: opts.Username,
Key: key,
VMName: v.Name,
VMType: define.QemuVirt,
TimeZone: opts.TimeZone,
WritePath: v.getIgnitionFile(),
UID: v.UID,
Rootful: v.Rootful,
Name: opts.Username,
Key: key,
VMName: v.Name,
VMType: define.QemuVirt,
TimeZone: opts.TimeZone,
WritePath: v.getIgnitionFile(),
UID: v.UID,
Rootful: v.Rootful,
NetRecover: useNetworkRecover(),
})

// If the user provides an ignition file, we need to
Expand Down
4 changes: 4 additions & 0 deletions pkg/machine/qemu/options_darwin.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,7 @@ func getRuntimeDir() (string, error) {
}
return tmpDir, nil
}

func useNetworkRecover() bool {
return true
}
4 changes: 4 additions & 0 deletions pkg/machine/qemu/options_freebsd.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,7 @@ func getRuntimeDir() (string, error) {
}
return tmpDir, nil
}

func useNetworkRecover() bool {
return false
}
4 changes: 4 additions & 0 deletions pkg/machine/qemu/options_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,7 @@ func getRuntimeDir() (string, error) {
}
return util.GetRootlessRuntimeDir()
}

func useNetworkRecover() bool {
return false
}
4 changes: 4 additions & 0 deletions pkg/machine/qemu/options_windows.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,7 @@ func getRuntimeDir() (string, error) {
}
return tmpDir, nil
}

func useNetworkRecover() bool {
return false
}