-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a net health recovery service to qemu machines #21262
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,15 +53,16 @@ func GetNodeGrp(grpName string) NodeGroup { | |
} | ||
|
||
type DynamicIgnition struct { | ||
Name string | ||
Key string | ||
TimeZone string | ||
UID int | ||
VMName string | ||
VMType define.VMType | ||
WritePath string | ||
Cfg Config | ||
Rootful bool | ||
Name string | ||
Key string | ||
TimeZone string | ||
UID int | ||
VMName string | ||
VMType define.VMType | ||
WritePath string | ||
Cfg Config | ||
Rootful bool | ||
NetRecover bool | ||
} | ||
|
||
func (ign *DynamicIgnition) Write() error { | ||
|
@@ -97,7 +98,7 @@ func (ign *DynamicIgnition) GenerateIgnitionConfig() error { | |
|
||
ignStorage := Storage{ | ||
Directories: getDirs(ign.Name), | ||
Files: getFiles(ign.Name, ign.UID, ign.Rootful, ign.VMType), | ||
Files: getFiles(ign.Name, ign.UID, ign.Rootful, ign.VMType, ign.NetRecover), | ||
Links: getLinks(ign.Name), | ||
} | ||
|
||
|
@@ -231,6 +232,21 @@ func (ign *DynamicIgnition) GenerateIgnitionConfig() error { | |
} | ||
ignSystemd.Units = append(ignSystemd.Units, qemuUnit) | ||
} | ||
|
||
if ign.NetRecover { | ||
contents, err := GetNetRecoveryUnitFile().ToString() | ||
if err != nil { | ||
return err | ||
} | ||
|
||
recoveryUnit := Unit{ | ||
Enabled: BoolToPtr(true), | ||
Name: "net-health-recovery.service", | ||
Contents: &contents, | ||
} | ||
ignSystemd.Units = append(ignSystemd.Units, recoveryUnit) | ||
} | ||
|
||
// Only after all checks are done | ||
// it's ready create the ingConfig | ||
ign.Cfg = Config{ | ||
|
@@ -303,7 +319,7 @@ func getDirs(usrName string) []Directory { | |
return dirs | ||
} | ||
|
||
func getFiles(usrName string, uid int, rootful bool, vmtype define.VMType) []File { | ||
func getFiles(usrName string, uid int, rootful bool, vmtype define.VMType, netRecover bool) []File { | ||
files := make([]File, 0) | ||
|
||
lingerExample := parser.NewUnitFile() | ||
|
@@ -574,6 +590,23 @@ Delegate=memory pids cpu io | |
}, | ||
}) | ||
|
||
// Only necessary for qemu on mac | ||
if netRecover { | ||
files = append(files, File{ | ||
Node: Node{ | ||
User: GetNodeUsr("root"), | ||
Group: GetNodeGrp("root"), | ||
Path: "/usr/local/bin/net-health-recovery.sh", | ||
}, | ||
FileEmbedded1: FileEmbedded1{ | ||
Mode: IntToPtr(0755), | ||
Contents: Resource{ | ||
Source: EncodeDataURLPtr(GetNetRecoveryFile()), | ||
}, | ||
}, | ||
}) | ||
} | ||
|
||
return files | ||
} | ||
|
||
|
@@ -743,6 +776,37 @@ func (i *IgnitionBuilder) Build() error { | |
return i.dynamicIgnition.Write() | ||
} | ||
|
||
func GetNetRecoveryFile() string { | ||
return `#!/bin/bash | ||
# Verify network health, and bounce the network device if host connectivity | ||
# is lost. This is a temporary workaround for a known rare qemu/virtio issue | ||
# that affects some systems | ||
|
||
sleep 120 # allow time for network setup on initial boot | ||
while true; do | ||
sleep 30 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you considered using a systemd timer which runs every 30 seconds? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right yeah only reason it's done this way is to avoid the log generation noise that comes from them that @rhatdan was warning about. With a frequent timer like this it would be a lot of noise. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah logging, makes sense! |
||
curl -s -o /dev/null --max-time 30 http://192.168.127.1/health | ||
if [ "$?" != "0" ]; then | ||
echo "bouncing nic due to loss of connectivity with host" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. where does this line is reported to ? to see if it occurred or not in my podman machine There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The unit file sends stdout&stderr to the system journal, so can be found using journalctl. |
||
ifconfig enp0s1 down; ifconfig enp0s1 up | ||
fi | ||
done | ||
` | ||
} | ||
|
||
func GetNetRecoveryUnitFile() *parser.UnitFile { | ||
recoveryUnit := parser.NewUnitFile() | ||
recoveryUnit.Add("Unit", "Description", "Verifies health of network and recovers if necessary") | ||
recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fwiw, I'm not sure There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sshd.socket and sshd.service are mutually exclusive alternates (currently our fcos images are using sshd.service atm. Our other units are declared as after both (my assumption is to be compatible if the base images witch to the inet socket approach in the future), so just mirroring that pattern here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah ok, I assumed the FCOS images used |
||
recoveryUnit.Add("Service", "ExecStart", "/usr/local/bin/net-health-recovery.sh") | ||
recoveryUnit.Add("Service", "StandardOutput", "journal") | ||
recoveryUnit.Add("Service", "StandardError", "journal") | ||
recoveryUnit.Add("Service", "StandardInput", "null") | ||
recoveryUnit.Add("Install", "WantedBy", "default.target") | ||
|
||
return recoveryUnit | ||
} | ||
|
||
func DefaultReadyUnitFile() parser.UnitFile { | ||
u := parser.NewUnitFile() | ||
u.Add("Unit", "After", "remove-moby.service sshd.socket sshd.service") | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,3 +11,7 @@ func getRuntimeDir() (string, error) { | |
} | ||
return tmpDir, nil | ||
} | ||
|
||
func useNetworkRecover() bool { | ||
return true | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,3 +11,7 @@ func getRuntimeDir() (string, error) { | |
} | ||
return tmpDir, nil | ||
} | ||
|
||
func useNetworkRecover() bool { | ||
return false | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,3 +11,7 @@ func getRuntimeDir() (string, error) { | |
} | ||
return tmpDir, nil | ||
} | ||
|
||
func useNetworkRecover() bool { | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sleep still needed with the systemd unit which has
recoveryUnit.Add("Unit", "After", "sshd.socket sshd.service")
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sshd only has an after on network.target, which is quasi reliable:
"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined [snip]"
Since we only seem to see the problem with long running vms, my thinking was it was better to just wait a bit longer in the script without disrupting / delaying boot (e.g. using something like network-online.target).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree sleeping for 2 minutes (or even 5 minutes or 1 hour or ... :) is no big deal given when the bug happens. I just wondered.