Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration from v0.26.0 to v0.35.2 resulted in some devices failing to connect to NetBird (self-hosted) #3177

Open
realfresh opened this issue Jan 13, 2025 · 0 comments

Comments

@realfresh
Copy link

Hey everyone, I'm in quite a problematic situation right now, having lost connection to around 20 IOT devices.

We have had NetBird running for quite a while on v0.26.0. A couple of days ago, we decided to upgrade to the latest version. We took note that the notes said that the jsonfile storage mechanism would be automatically replaced with sqlite storage from v0.28.0 onwards. So heres how we did the upgrade:

  1. Clone NetBird v0.27.0 repo, copy artifacts directory and run the configuration script. Run the new docker compose file.
  2. While v.027.10 is running, use the migration CLI to migrate the JSON file to the new SQLite file.
  3. Clone the v0.35.2, copy artifacts directory and run the configuration script again, making sure the management.json file storage driver is set to sqlite.

After doing these steps, we noticed existing clients were all offline but slowly, over the course of several hours, 33/71 clients came back online.

We somehow managed to get to 50/71 clients back online and connecting to our self-hosted NetBird instance after trying many different things such as:

  1. Downgrading back to v0.27.10
  2. Using the old JSON file instead of the SQLite database
  3. Disabling authentication on the coturn server so we didn't have auth failures there
  4. Trying various Turns[].Password values in the management.json file
  5. And a lot more we did in a panic that I can't remember.

However, the remaining devices don't seem to want to come back. The management server logs show occasional lines like this:

WARN [accountID: UNKNOWN, peerID: <<REDACTED>>, context: GRPC, requestID: a1b1b45a-01af-4a66-a196-58dcf7c1cde7] management/server/grpcserver.go:471: failed logging in peer <<REDACTED>: no peer auth method provided, please use a setup key or interactive SSO login

I so happened to have 1 test IOT device on hand which is also unable to connect (the other devices are all over the country). Looking at the NetBird daemon logs on that device, I see this:

systemd[1]: Started netbird.service - A WireGuard-based mesh network that connects your devices into a single private network..
netbird[1021]: 2025-01-12T16:40:13+13:00 INFO client/cmd/service_controller.go:24: starting Netbird service
netbird[1021]: 2025-01-12T16:40:13+13:00 INFO client/cmd/service_controller.go:64: started daemon server: /var/run/netbird.sock
netbird[1021]: 2025-01-12T16:40:13+13:00 INFO client/internal/connect.go:119: starting NetBird client version 0.28.9 on linux/arm64
netbird[1021]: 2025-01-12T16:40:14+13:00 ERRO management/client/grpc.go:350: failed to login to Management Service: rpc error: code = PermissionDenied desc = no peer auth method provided, please use a setup key or interactive SSO login

Restarting the daemon only produces the same result.

Is there any workaround or solution for us to get the remaining devices connected again? It seems like if there was someway to temporarily bypass auth, so those devices could authenticate successfully and reconnect, things would be solved.

Any suggestions and ideas are much appreciated!

PS: I'm certain this whole problem is "user error" and not an issue with NetBird itself, hopefully it's possible to have safeguards to ensure issues like this don't happen for others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant