Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Outages #99

Open
benfrancis opened this issue May 17, 2023 · 0 comments
Open

DNS Outages #99

benfrancis opened this issue May 17, 2023 · 0 comments
Labels

Comments

@benfrancis
Copy link
Member

STR:

  • Leave registration server running and wait

Expected:

  • It keeps working

Actual:

  • Tunnelling service (and webthings.io website) suddenly drop offline and are inaccessible until the registration server is rebooted

This has been happening regularly for many months now, and requires a reboot of the registration server EC2 instances in order to fix it. We believe it is caused by PowerDNS crashing so that the registration server no longer resolves DNS lookups.

In the logs of the registration server docker container there is an error which says "5001 questions waiting for database/backend attention. Limit is 5000, respawning". pdns then re-spawns and after that happens so many times, the init system in the docker container gives up and just kills it. This is happening on both EC2 instances.

We think that the DNS servers are occasionally getting overwhelmed by traffic but we don't know where it's coming from, I suspect it isn't WebThings users because there are lots of failed lookups for subdomains that don't exist in the logs.

Some potential solutions:

  1. Configuring rate limiting with something like dnsdist to set a limit on queries per second per IP address
  2. Re-configure pdns to use the gmysql back end so that pdns reads records directly from the database, rather than directing them to the registration server which then queries the database
  3. Modify the registration server by adding an option to use a hosted DNS service like Cloudflare as a back end, to take load off our EC2 instances. Downsides being 1. We would be dependent on Cloudflare 2. We'd have to set a TTL limit of minimum 60 seconds, so there would be brief outages when a gateway changes IP (but at least not the whole domain)
  4. Same as number 3, but re-write the registration server in Node.js so that more people are able to work on it (we have an IoT gateway written in Node.js and a cloud service written in Rust and it should probably be the other way around!)

My personal preference is to start with option 1 and see if it helps. I suspect the spikes in traffic are not coming from WebThings users and if we cut off the source of the excessive traffic the service would hopefully go back to being stable again.

If anyone has experience of configuring rate limiting for pdns, I would be grateful for some help.

@benfrancis benfrancis added the bug label May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant