Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question(gcp/network/dhcp): sporadic failures to obtain internal IPv4 from dhcp on GCP #2029

Open
rinor opened this issue Jun 10, 2024 · 0 comments

Comments

@rinor
Copy link
Contributor

rinor commented Jun 10, 2024

While deploying #2024, I've experienced some sporadic issues on some instances failing to obtain IPv4 address from GCP.

  • The failure is from nanos perspective since from gcp perspective the ip address gets allocated correctly, hostname resolves correctly, reverse dns works correctly, ...

Note: I'm deploying to single vCPU f1-micro instances (not trying to test any SMP related in this case)

SeaBIOS (version 1.8.2-google)
Total RAM Size = 0x0000000026600000 = 614 MiB
CPUs found: 1     Max CPUs supported: 1
found virtio-scsi at 0:3
virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
virtio-scsi blksize=512 sectors=2097152 = 1024 MiB
virtio-scsi vendor='Google' product='PersistentDisk' rev='1' type=0 removable=0
virtio-scsi blksize=512 sectors=8388608 = 4096 MiB
drive 0x000f2800: PCHS=0/0/0 translation=lba LCHS=1024/32/63 s=2097152
drive 0x000f27c0: PCHS=0/0/0 translation=lba LCHS=522/255/63 s=8388608
Sending Seabios boot VM event.
Booting from Hard Disk 0...
en1: assigned FE80::4001:AFF:FE80:3D0
# expected errors/complains from klibs (gcp,ntp,...)

Out of ~1_000 instances currently deployed and active within less than 36 hours in 4 different zones of us-central1 at least 100 of them are suspected to have experienced this "issue". Most of them just needed one restart to be back online, while for a couple of them it took 2+ restarts. There was no visible pattern about a specific location or a specific time.

Before that pr, I had 57203bc deployed to a similar scenario, but with fewer instances ~400 and with a much slower deployment pace/frequency and had no such issue reported and/or experienced (doesn't mean that it did not happen though).

  • While this might have been a temporary issue/deplays on gcp, or an edge case that I have yet to identify, and there might be nothing wrong with nanos, I'm just raising this to get some more eyes on it to make sure nanos behavior is correct and check what we can do to improve the handling in such cases (i.e: maybe add a config to reboot the vm if we don't get an IP for x amount of time, maybe enhance the existing exec_wait_for_ip4_secs functionality when there is no cloud_init involved,...).

This is the base config used:

{
  "Program": "myapp",
  "Version": "myapp-af8b26d-sv70",
  "NanosVersion": "nanos-5779988",
  "Mounts": {
    "myapp-storage@${myappid}-v": "/storage"
  },
  "NameServers": [
    "169.254.169.254",
    "8.8.8.8",
    "1.1.1.1"
  ],
  "Klibs": [
    "gcp",
    "tls",
    "ntp",
    "cloud_init"
  ],
  "ManifestPassthrough": {
    "readonly_rootfs": "true",
    "exec_wait_for_ip4_secs": "5",
    "reboot_on_exit": "*",
    "ntp_servers": [
      "169.254.169.254"
    ],
    "gcp": {
      "metrics": {
        "interval": "300",
        "disk": {}
      }
    },
    "cloud_init": {
      "download_env": [
        {
          "auth": "",
          "src": "http://10.128.0.5:7367/config/{host}/{host}_env.json"
        }
      ]
    }
  },
  "CloudConfig": {
    "Spot": false,
    "Platform": "gcp",
    "ProjectID": "xxxxx",
    "BucketName": "xxxxx",
    "Flavor": "f1-micro",
    "InstanceProfile": "[email protected]",
    "VPC": "default",
    "Subnet": "default",
    "Zone": "us-central1-c",
    "Tags": [
      {
        "key": "service",
        "value": "myapp",
        "attribute": {
          "instance_label": true,
          "instance_network": true
        }
      }
    ]
  },
  "RunConfig": {
    "AttachVolumeOnInstanceCreate": true
  }
}

Atm, I don't have more information or other details. Nevertheless I plan to get back to this and test in a controlled environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant