Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally allow specification of custom DNS resolver #584

Closed
wants to merge 4 commits into from

Conversation

oxtoacart
Copy link
Contributor

@oxtoacart oxtoacart commented Oct 5, 2023

I tested this on route d1192fd7-4b4f-48dc-87d7-d99850445f53. I defaulted dns-servers to 172.16.0.53:53,8.8.8.8:53 and put in some debug statements to make sure our custom dialer is being used (it is). I was able to successfully proxy through the host.

This will allow us to specify explicit nameservers for http-proxy to use, bypassing Docker for DNS. This is useful because Docker has been hanging lately, causing name resolution on the proxies to fail.

@oxtoacart oxtoacart force-pushed the custom-dns branch 6 times, most recently from cd7537c to 8a49af2 Compare October 6, 2023 00:29
@oxtoacart oxtoacart marked this pull request as ready for review October 6, 2023 02:03
@oxtoacart oxtoacart requested review from hwh33 and Crosse October 6, 2023 02:03
@oxtoacart oxtoacart force-pushed the custom-dns branch 7 times, most recently from 583db73 to f0698d4 Compare October 6, 2023 10:59
@oxtoacart
Copy link
Contributor Author

I've realized that this isn't working with Slack, so I have some work to do I guess.

@oxtoacart oxtoacart marked this pull request as draft October 6, 2023 12:52
custom_dns.go Outdated
// Google anomaly detection can be triggered very often over IPv6.
// Prefer IPv4 to mitigate, see issue #97
// If no IPv4 is available, fall back to IPv6
for _, candidate := range ips {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the logic from preferIPv4.

@oxtoacart
Copy link
Contributor Author

Okay, this is now ready for review. I verified that it works both with the system resolver and with specifying 172.16.0.53:53,8.8.8.8:53 as the DNS servers. And Slack is happy.

@oxtoacart oxtoacart marked this pull request as ready for review October 6, 2023 14:23
Copy link
Contributor

@Crosse Crosse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty neat! I'll defer to @hwh33 since he's more familiar with the codebase, but this seems good to me.

custom_dns.go Outdated
r := &net.Resolver{
PreferGo: true,
Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
return netx.DialContext(ctx, "udp", dnsServer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For responses that are larger than a full-sized UDP payload can be, the server may tell the client to retry over TCP. While this isn't terribly common, it can happen quite a lot with larger TXT and DNSSEC records. Since (eventually) the Go resolver is only looking for A and AAAA records, this is probably fine, but it's worth noting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we'd ever want to overwrite the network parameter... Shouldn't we trust that the Go DNS resolver is specifying the correct network?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. It looks like both unbound and Google support TCP connections on port 53, so I'll just pass through the protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@Crosse
Copy link
Contributor

Crosse commented Oct 6, 2023

I have another take on this in getlantern/lantern-cloud#423 as well. I haven't had time to test it yet, but manually running docker run ... --mount ... worked, so I'm hopeful.

@oxtoacart
Copy link
Contributor Author

Closing in favor of https://github.com/getlantern/lantern-cloud/pull/423

@oxtoacart
Copy link
Contributor Author

@Crosse I've changed this so that it resolves in parallel now. I tested the parallel resolution on route 4f549ac2-6371-4de9-84e2-1a4ec65cf1e4 and I'm able to successfully proxy through it and don't see DNS resolution errors in the log.

custom_dns.go Outdated

// Returns a dialer that uses custom DNS servers to resolve the host. It uses all DNS servers
// in parallel and uses the first response it gets.
func customDNSDialer(dnsServers []string, timeout time.Duration) (func(context.Context, string, string) (net.Conn, error), error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the parameter should be named dialTimeout. As it stands, it would intuitively be the resolution timeout (to me anyway).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the resolution timeout. I just happen to also set it as a dial timeout because there's no point waiting to dial past the resolution timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, it is a dialTimeout, even I was confused!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

custom_dns.go Outdated
r := &net.Resolver{
PreferGo: true,
Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
return netx.DialContext(ctx, "udp", dnsServer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we'd ever want to overwrite the network parameter... Shouldn't we trust that the Go DNS resolver is specifying the correct network?

custom_dns.go Outdated
Comment on lines 59 to 66
var resolveErr error
select {
case resolveErr = <-errs:
// got an error
default:
// no error, we just timed out
}
return nil, errors.New("unable to resolve host %v, last resolution error: %v", host, resolveErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be mildly confusing in the case of a timeout, where resolveErr will be nil. You could just return different errors in each of the select cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

custom_dns.go Outdated

resolvedAddr := fmt.Sprintf("%s:%s", ip, port)
d := &net.Dialer{
Deadline: time.Now().Add(timeout),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it'd be a little more direct to just set Timeout: timeout (it's kind of weird that net.Dialer supports deadlines and timeouts simultaneously).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool, they must have added that somewhere along the way. I have a memory of only Deadline being available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

errors <- err
return
}
if len(ips) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will mask misses as timeouts. We could instead have a miss channel or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that if the DNS resolver can't find the host, it will return a "no such host" error, so I don't think we have to worry about masking. I did adjust the logic so that if all resolvers error, we fail immediately instead of waiting to hit the timeout. I added a unit test for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into it!

}
if len(ips) > 0 {
// Google anomaly detection can be triggered very often over IPv6.
// Prefer IPv4 to mitigate, see issue #97
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this referencing? https://github.com/getlantern/engineering/issues/97 doesn't look related

Copy link
Contributor

@hwh33 hwh33 Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this comment was moved from elsewhere. @oxtoacart do you know what the deal is with this? I don't quite understand it. Is the comment saying that Google services don't like it when our proxy connects over IPv6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's referencing https://github.com/getlantern/http-proxy-lantern/issues/97

IIRC, we were getting tons of CAPTCHAs and changing this to force IPv4 reduced that quite a bit.

@oxtoacart
Copy link
Contributor Author

@hwh33 Thanks for the great feedback! I've addressed your comments and retested on route d1192fd7-4b4f-48dc-87d7-d99850445f53.

Comment on lines +71 to +72
errorCount++
if errorCount == len(resolvers) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

start := time.Now()
_, err = d(context.Background(), "tcp", "blubbaasdfsadfsadf.dude:443")
require.Error(t, err)
fmt.Println(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: leftover print statement

Copy link
Contributor

@hwh33 hwh33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit-pick, but looks good to me! Thanks for addressing that stuff =)

Copy link
Contributor

@Crosse Crosse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally good with this, code-wise. I would suggest not having a default value for the DNS servers to use and making it an explicit opt-in (principal of least astonishment).

After thinking about it longer, my only reservation is parallelizing the DNS queries. In our current situation, we would send simultaneous queries to our local resolver and Google and let them race for an answer, which sounds good on the face of it. But note that each of our phosts average around 3,000-5,000 queries per second, for an aggregate of around 100,000 queries per second. The reason Colin and I began using a local resolver in the first place was because we suspected that we were having our DNS queries rate-limited by the public resolver we had configured, and this was at much lower levels that we see today. If we unconditionally send queries to some non-Lantern-controlled DNS servers, we risk the same result and, potentially, a full block.

In my opinion, the best solution for all of this is actually to fix our local resolver issues (i.e., unbound segfaulting and locking up; see getlantern/lantern-cloud#428 for an idea of replacing Unbound altogether with BIND) and, eventually, run our own caching resolvers in the lantern cloud to act as upstreams for all of the phost-local forwarding resolvers.

@oxtoacart
Copy link
Contributor Author

In my opinion, the best solution for all of this is actually to fix our local resolver issues (i.e., unbound segfaulting and locking up; see https://github.com/getlantern/lantern-cloud/pull/428 for an idea of replacing Unbound altogether with BIND) and, eventually, run our own caching resolvers in the lantern cloud to act as upstreams for all of the phost-local forwarding resolvers.

I agree 100%.

@myleshorton
Copy link
Collaborator

Should we close this one then? I generally agree that hammering Google DNS is probably not a great idea.

@Crosse
Copy link
Contributor

Crosse commented Oct 20, 2023

Should we close this one then?

I'm on the fence. I think having the option to let lantern-proxy do custom DNS resolution could be useful in the future even if we don't use it right now, but I also wonder if we should avoid adding a knob for functionality we don't intend to use in the near-term.

@oxtoacart
Copy link
Contributor Author

I also wonder if we should avoid adding a knob for functionality we don't intend to use in the near-term.

Yeah, I would just close this PR and if you ever need that knob in the future, you can resurrect it.

@Crosse Crosse closed this Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants