You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While deploying #2024, I've experienced some sporadic issues on some instances failing to obtain IPv4 address from GCP.
The failure is from nanos perspective since from gcp perspective the ip address gets allocated correctly, hostname resolves correctly, reverse dns works correctly, ...
Note: I'm deploying to single vCPU f1-micro instances (not trying to test any SMP related in this case)
SeaBIOS (version 1.8.2-google)
Total RAM Size = 0x0000000026600000 = 614 MiB
CPUs found: 1 Max CPUs supported: 1
found virtio-scsi at 0:3
virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
virtio-scsi blksize=512 sectors=2097152 = 1024 MiB
virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
virtio-scsi blksize=512 sectors=8388608 = 4096 MiB
drive 0x000f2800: PCHS=0/0/0 translation=lba LCHS=1024/32/63 s=2097152
drive 0x000f27c0: PCHS=0/0/0 translation=lba LCHS=522/255/63 s=8388608
Sending Seabios boot VM event.
Booting from Hard Disk 0...
en1: assigned FE80::4001:AFF:FE80:3D0
# expected errors/complains from klibs (gcp,ntp,...)
Out of ~1_000 instances currently deployed and active within less than 36 hours in 4 different zones of us-central1 at least 100 of them are suspected to have experienced this "issue". Most of them just needed one restart to be back online, while for a couple of them it took 2+ restarts. There was no visible pattern about a specific location or a specific time.
Before that pr, I had 57203bc deployed to a similar scenario, but with fewer instances ~400 and with a much slower deployment pace/frequency and had no such issue reported and/or experienced (doesn't mean that it did not happen though).
While this might have been a temporary issue/deplays on gcp, or an edge case that I have yet to identify, and there might be nothing wrong with nanos, I'm just raising this to get some more eyes on it to make sure nanos behavior is correct and check what we can do to improve the handling in such cases (i.e: maybe add a config to reboot the vm if we don't get an IP for x amount of time, maybe enhance the existing exec_wait_for_ip4_secs functionality when there is no cloud_init involved,...).
While deploying #2024, I've experienced some sporadic issues on some instances failing to obtain IPv4 address from GCP.
Note: I'm deploying to single vCPU f1-micro instances (not trying to test any SMP related in this case)
Out of ~1_000 instances currently deployed and active within less than 36 hours in 4 different zones of us-central1 at least 100 of them are suspected to have experienced this "issue". Most of them just needed one restart to be back online, while for a couple of them it took 2+ restarts. There was no visible pattern about a specific location or a specific time.
Before that pr, I had 57203bc deployed to a similar scenario, but with fewer instances ~400 and with a much slower deployment pace/frequency and had no such issue reported and/or experienced (doesn't mean that it did not happen though).
exec_wait_for_ip4_secs
functionality when there is no cloud_init involved,...).This is the base config used:
Atm, I don't have more information or other details. Nevertheless I plan to get back to this and test in a controlled environment.
The text was updated successfully, but these errors were encountered: