Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core) embedded dns resolver #1225

Closed
wants to merge 2 commits into from
Closed

feat(core) embedded dns resolver #1225

wants to merge 2 commits into from

Conversation

subnetmarco
Copy link
Member

@subnetmarco subnetmarco commented May 17, 2016

This PR drastically changes how DNS resolutions work.

  • Removes the dnsmasq dependency by manually parsing the /etc/hosts and /etc/resolv.d files. During resolutions, the entries in /etc/hosts always have the highest priority, with fallback to the configured DNS server.
  • It's compliant with the TTL value returned by the DNS resolution, caching the resolutions in the shared dict for the amount specified in the TTL field.
  • The configuration property dns_resolvers_available has been removed, and dns_resolver has been introduced. By default the system uses the local settings configured at /etc/resolv.d (using the first nameserver and being compliant with any options specified, if any). The DNS resolver can be manually set by uncommenting dns_resolver.
  • Both A and SRV resolutions are supported. SRV resolutions are enabled by default and can be disabled to gain some performance in scenarios when it's not required, setting the property srv: false into dns_resolver.
  • Introduces balancer_by_lua in the configuration file.

We can wait refactor/cli to be merged into next before merging this, as you like.

The PR is almost complete:

  • Missing a couple more tests
  • Reviewing the code

return {host = result.host, port = result.port}
end

-- Query for A record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an SRV record is supposed to start with _, which is an illegal character for A records iirc. So instead of querying them both successively, check for the underscore and query either A or SRV.

Copy link
Member Author

@subnetmarco subnetmarco May 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an SRV record is supposed to start with _

I am not aware of this convention. In all the Consul examples an SRV record looks like an A record. Also in my tests SRV records were just returning port information, not IP information, so querying both is required.

Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See RFC2782. But apparently consul has multiple ways of looking up the same service; 'Standard lookup' and 'RFC 2782 lookup', see https://www.consul.io/docs/agent/dns.html

The way I understand it if you define a service in Consul it constructs multiple ways to retrieve the same service info.

It will be known by this name; [tag.]<service>.service[.datacenter].<domain>
as well as; _<service>._<protocol>.service[.datacenter][.domain]

They return both a port as well as a hostname/ip. So in case of a hostname, a second lookup may be required.

try; dig srvtest.thijsschreijer.nl SRV (not Consul, but regular DNS)

@Tieske
Copy link
Member

Tieske commented May 17, 2016

As a general comment; can we cache the dns resolver objects? this creates a new dns resolver object for every request (assuming ttl = 0). Can't we reuse them. Less setting up and tearing down, and less GC cycles.

@subnetmarco
Copy link
Member Author

subnetmarco commented May 17, 2016

can we cache the dns resolver objects?

Yes, we should - but when trying to cache them though the system was returning errors after a few queries. In the resty-dns examples the resolver is initialized every time too: https://github.com/openresty/lua-resty-dns#synopsis

The TTL is handled differently, since the resolutions are stored in the shared dict and not in the object itself.

@Tieske
Copy link
Member

Tieske commented May 17, 2016

Yes, we should - but when trying to cache them though the system was returning errors after a few queries.

Did you provide a locking mechanism? you must make sure (due to the co-sockets nature) that every DNS resolver object only handles a single complete cycle at a time. Before reusing the object for the next query.

@subnetmarco
Copy link
Member Author

subnetmarco commented May 17, 2016

That didn't really concern me a lot since the base example of resty-dns creates a new resolver anyways, which may be the intended use, or not have any particular performance impact. It's an open question.

Locking would slow down resolution across the workers on high throughputs, maybe more than initializing a new resolver. I will benchmark this branch to figure it out.

@Tieske
Copy link
Member

Tieske commented May 17, 2016

Locking is rather simple. Something like this; Create two lists at module level; free and busy. Whenever you whenever a socket operation is done (presumably all through resolve) grab one from free and move it from free to busy. When the full resolution cycle is done (sockets are free again), before resolve returns, move it back from busy to free.

The problem is that when a request is sent, the coroutine/light-thread is yielded by openresty. Now if a second request comes in and uses the same sockets before the response is returned, it goes awry. Cosocket code is written as sync, but really is async. Yielding is implicit whenever you do a socket operation.

@subnetmarco
Copy link
Member Author

My point is that we may not need locking in the first place, because creating the resolver every time doesn't really have any performance slowdowns (which seems to be the intended use in the resty-dns examples I linked above).

At a closer look when the library is initialized it creates a new socket for every request (at https://github.com/openresty/lua-resty-dns/blob/master/lib/resty/dns/resolver.lua#L112), which is a non-blocking operation, and then sets a timeout on the socket (https://github.com/openresty/lua-resty-dns/blob/master/lib/resty/dns/resolver.lua#L133) in order to enforce the DNS resolution timeout.

Your system seems more like a pool of DNS resolvers which while may be doing the work, adds additional complexity. I would first benchmark the performance of the current system, and then take an action depending if any performance loss is obtained.

In other words, let's make it work, let's run it, and see :)

@sonicaghi
Copy link
Member

we should test this with other Service discovery tools other than consul. Like zookeeper.

@subnetmarco
Copy link
Member Author

subnetmarco commented May 17, 2016

I am aiming at supporting Consul and etcd - we could always keep the SRV resolution property explicit with the current implementation (with the option to disable it) if the first request with additional data doesn't work.

@kiddkai
Copy link

kiddkai commented May 18, 2016

Just updated my PR, hope that it got 🈴 soon~, then you guys can finishes this feature~~

@lagonnebula
Copy link

lagonnebula commented May 19, 2016

Hey, i built this branch to try it, but i have some problem with a service configuration. How do you set the query to do to the dns discovery service ?


-- Query for SRV record
local final_port = port
if enable_srv then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that on the enable_srv he didn't load corrrectly the configuration, I put the srv: true on the yaml configuration file but it is always nil

So when im querying consul, i never got the port of the service. I made it to work by adding a realy bad fix.
On resolver.lua

if dns_config.srv == nil then
     dns_config.srv = true
 end

before the call of the

dns:resolve(parsed_url.host, parsed_url.port, dns_config.srv)

@Tieske
Copy link
Member

Tieske commented May 19, 2016

we should test this with other Service discovery tools other than consul. Like zookeeper.

If we stick to the standards, they all should work. Specific optimization can be done later.

My point is that we may not need locking in the first place, because creating the resolver every time doesn't really have any performance slowdowns (which seems to be the intended use in the resty-dns examples I linked above).

It's not so much in resolving where the slowdown is, but in the GC. Everything created also needs to be cleaned up. Anyway, currently I don't think they can be reused, see openresty/lua-resty-dns#13

@Tieske
Copy link
Member

Tieske commented May 19, 2016

So some more issues to deal with;

  • A and AAAA records; see http://stackoverflow.com/questions/4082081/requesting-a-and-aaaa-records-in-single-dns-query so we need to do two queries
  • CNAME records; try; smtp.thijsschreijer.nl (CNAME) which points to thuis.thijsschreijer.nl (CNAME) which points to wdnaste.duckdns.org (A) which resolves to an IP. This is also the order in which the results are returned. So in this case answer[1].address == answer[2].address == nil we would need answer[3].address

@sonicaghi sonicaghi added this to the 0.9 milestone May 20, 2016
@hutchic
Copy link
Contributor

hutchic commented May 25, 2016

https://github.com/docker/libkv kv abstraction library with support for consul, etcd and zookeeper. Might not fit the use case / technology stack but thought it worth mentioning

@Tieske Tieske removed this from the 0.9 milestone Jul 20, 2016
@thibaultcha thibaultcha changed the title Feature/dns feat(core) embedded dns resolver Jul 20, 2016
@bitsofinfo
Copy link

Whats the eta on this? We publish a microservice container endpoint info into consul via registrator. Will this PR cover the use-case where kong could leverage that named service info in consul to dynamically route all requests to the correct backends as containers come up/down?

@Tieske
Copy link
Member

Tieske commented Jul 22, 2016

it will not make it in the upcoming 0.9. But is scheduled for the next one. Presumably 0.10 if all goes well 4 weeks after 0.9 release (as 2 weeks for an RC, 2 weeks later final).

@subnetmarco
Copy link
Member Author

subnetmarco commented Jul 22, 2016

@bitsofinfo just so you know, you can currently use Consul if you are not leveraging SRV records but only A records.

@Tieske
Copy link
Member

Tieske commented Aug 26, 2016

This branch is now superseded by #1541.

If anyone wants to test the current status, please see #1541 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants