DNS Outages #99

benfrancis · 2023-05-17T09:34:08Z

STR:

Leave registration server running and wait

Expected:

It keeps working

Actual:

Tunnelling service (and webthings.io website) suddenly drop offline and are inaccessible until the registration server is rebooted

This has been happening regularly for many months now, and requires a reboot of the registration server EC2 instances in order to fix it. We believe it is caused by PowerDNS crashing so that the registration server no longer resolves DNS lookups.

In the logs of the registration server docker container there is an error which says "5001 questions waiting for database/backend attention. Limit is 5000, respawning". pdns then re-spawns and after that happens so many times, the init system in the docker container gives up and just kills it. This is happening on both EC2 instances.

We think that the DNS servers are occasionally getting overwhelmed by traffic but we don't know where it's coming from, I suspect it isn't WebThings users because there are lots of failed lookups for subdomains that don't exist in the logs.

Some potential solutions:

Configuring rate limiting with something like dnsdist to set a limit on queries per second per IP address
Re-configure pdns to use the gmysql back end so that pdns reads records directly from the database, rather than directing them to the registration server which then queries the database
Modify the registration server by adding an option to use a hosted DNS service like Cloudflare as a back end, to take load off our EC2 instances. Downsides being 1. We would be dependent on Cloudflare 2. We'd have to set a TTL limit of minimum 60 seconds, so there would be brief outages when a gateway changes IP (but at least not the whole domain)
Same as number 3, but re-write the registration server in Node.js so that more people are able to work on it (we have an IoT gateway written in Node.js and a cloud service written in Rust and it should probably be the other way around!)

My personal preference is to start with option 1 and see if it helps. I suspect the spikes in traffic are not coming from WebThings users and if we cut off the source of the excessive traffic the service would hopefully go back to being stable again.

If anyone has experience of configuring rate limiting for pdns, I would be grateful for some help.

benfrancis added the bug label May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS Outages #99

DNS Outages #99

benfrancis commented May 17, 2023

DNS Outages #99

DNS Outages #99

Comments

benfrancis commented May 17, 2023