-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SRV resolver to loadbalancer exporter to use hostnames and track IPs #29760
Add SRV resolver to loadbalancer exporter to use hostnames and track IPs #29760
Conversation
type K8sSvcResolver struct { | ||
Service string `mapstructure:"service"` | ||
Ports []int32 `mapstructure:"ports"` | ||
} | ||
|
||
// TODO: Should a common struct be used for dns-based resolvers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a common dns-based struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: perhaps there could be a RecordType
property, with CNAME
as the default value, and SRV
as the new option.
I'd like to review this, but I'm unable to do so this year. I'm back on Jan 15, please ping me again on Jan 18 if I haven't reviewed this here by then. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
@jpkrohling Can you unstale this until you're ready to review in a couple of weeks? |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
@jpkrohling , as requested, it's Jan 18th so hoping you can take a look at this new feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great start. Apart from the inline comments, I'm missing either a mention about the lack of support for priority/weight, or its implementation. I would love to see support for prio/weight though, it shouldn't be hard to add it.
@@ -156,6 +161,36 @@ service: | |||
- loadbalancing | |||
``` | |||
|
|||
The SRV Resolver is useful in situations when you want to return hostnames instead of IPs for endpoints. An example would be a `StatefulSet`-backed headless kubernetes `Service` with istio. Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add what the SRV record should look like?
resolver: | ||
srv: | ||
hostname: _<svc-port-name>._<svc-port-protocol>.<svc-name>.<svc-namespace>.svc.cluster.local | ||
routing_key: traceID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the default, right? In that case, you can remove this.
) | ||
|
||
// TODO: What is this? | ||
var _ resolver = (*srvResolver)(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to ensure that svrResolver implements the resolver interface.
type K8sSvcResolver struct { | ||
Service string `mapstructure:"service"` | ||
Ports []int32 `mapstructure:"ports"` | ||
} | ||
|
||
// TODO: Should a common struct be used for dns-based resolvers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: perhaps there could be a RecordType
property, with CNAME
as the default value, and SRV
as the new option.
@@ -85,6 +85,15 @@ func newLoadBalancer(params exporter.CreateSettings, cfg component.Config, facto | |||
return nil, err | |||
} | |||
} | |||
if oCfg.Resolver.SRV != nil { | |||
srvLogger := params.Logger.With(zap.String("resolver", "DNS SRV")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
srvLogger := params.Logger.With(zap.String("resolver", "DNS SRV")) | |
srvLogger := params.Logger.With(zap.String("resolver", "srv")) |
_ = stats.RecordWithTags(ctx, srvResolverSuccessFalseMutators, mNumResolutions.M(1)) | ||
continue | ||
} | ||
// A headless Service SRV target only returns 1 IP address for its A record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be true for Kubernetes, but this resolver isn't only for Kubernetes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-checking the documentation, this is what I see:
SRV Records are created for named ports that are part of normal or headless services. For each named port, the SRV record has the form _port-name._port-protocol.my-svc.my-namespace.svc.cluster-domain.example. For a regular Service, this resolves to the port number and the domain name: my-svc.my-namespace.svc.cluster-domain.example. For a headless Service, this resolves to multiple answers, one for each Pod that is backing the Service, and contains the port number and the domain name of the Pod of the form hostname.my-svc.my-namespace.svc.cluster-domain.example.
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#srv-records
I believe this resolver should target generic SRV records, not only the ones that Kubernetes generates for headless services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is another area where I should note down a limitation until priority and weight are introduced. Is my thinking correct here?
- For a k8s headless service there is a direct 1:1 mapping
- For a normal service it is going to return a single IP but that's the service VIP and could get load balanced to any n number of endpoints and if n > 1 you might have spans delivered to different endpoints
- Outside of k8s I'd say the guidance would be to use the weight feature when it is in place. If you have a record which returns back multiple IPs then use the normal resolver. Otherwise make sure your targets only return back A records with single IPs and multiple results are handled inside of the SRV record.
If we did want this implemented then perhaps we should go back to the original name mentioned in the issue and make an issue to create DNSSRV
since that's what it does for those other apps. We would essentially say if your desire is to get back a hostname as opposed to IP(s) then the underlying targets must have a 1:1 mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are mimicking the dnssrvnoa behavior from Thanos/Loki/Mimir/Cortex, that's the name we should use for the resolver, IMO.
|
||
// the list has changed! | ||
r.updateLock.Lock() | ||
r.logger.Debug("Updating endpoints", zap.Strings("new endpoints", backends)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be added together with the next debug entry
r.onChangeCallbacks = append(r.onChangeCallbacks, f) | ||
} | ||
|
||
func parseSRVHostname(srvHostname string) (service string, proto string, name string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of returning more than two parameters, it would be preferable to return a custom struct.
// verify | ||
assert.Len(t, resolved, 3) | ||
for i, value := range []string{ | ||
"pod-0.service-1.ns.svc.cluster.local:12000", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be really confusing to users: if the srv record shows that the target port is 4317, why is it reaching out to port 12000? It would take people some time to figure out that there's a Port setting on the collector config taking precedence over the SRV record. IMO, the right thing here is to return an error if users set a Port on the collector config, and always use the one from the SRV record.
assert.NoError(t, err) | ||
} | ||
|
||
func TestMultipleIPsOnATarget(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember 100% of the SRV record spec, but aren't hostnames allowed to be regular CNAME hostnames, backed by multiple A records?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does work in most cases but it's actually explicitly stated it should not be a CNAME in the rfc. It both defines "address records" as A
and AAAA
only and under target
it states it must not be an alias and references RFC 1034 for alias which is the RFC talking about CNAME
s.
@jpkrohling Thank you for the review. It might be a bit before I can work on this, but did want clarification on one part as I think it would require the most work. Perhaps I'm misunderstanding the point of the exporter, but I thought it was to make a hash ring so all stateless collectors knew where to send something based on the routing key. How would that work with things like priority and weight? As I wrote this I did struggle with whether I should even be naming it "srv". Mostly I was setting this up to get around istio issues. Perhaps a name referencing that would have been better? I looked up "dnssrvnoa" and I think it was first implemented here. They reference gRPC client-side load balancing as their reasoning so I guess it's more than just istio. Perhaps we should just explicitly state the goal of endpoints as A records instead of IPs? |
We can think about those points on a follow-up PR, but our users need to know that we are ignoring those aspects for now. About priority: the resolver can attempt to resolve the backends for the highest priority. If there's at least one there, that priority wins, and all other higher priority are ignored (as per DNS spec). About weight: the consistent hashing algorithm consists of building a ring with 3600 points, and each endpoint is placed in 100 different places in the ring based on the hash value of endpoint+N. The idea is then to have the algorithm accept endpoints with custom weights instead of all being placed in 100 parts. This would be the relevant part of the code: opentelemetry-collector-contrib/exporter/loadbalancingexporter/consistent_hashing.go Lines 102 to 113 in cff7eaa
I think we can do this refactoring as part of another PR, and only then implement the weight based on SRV records. @kentquirk, you mentioned interest in working with this component: is this something you'd like to work on? This task would give you a great overview of this component. |
e5a92ed
to
0beac79
Compare
Edit: Ignore, this is due to an istio error. Will double check the code as is in this PR works and then request review. |
I think this is ready to go @jpkrohling . Not sure if I'm supposed to mark things as resolved or let you since you raised them, but do let me know if you want me to. |
@jpkrohling and @snuggie12 I am interested and will review this. I should have time on Wednesday afternoon to look at it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned a lot reading this. I asked for further clarification in some areas because I don't think I'm all that atypical. Overall this looks good and I think it's pretty close.
@@ -60,7 +60,7 @@ The `loadbalancingexporter` will, irrespective of the chosen resolver (`static`, | |||
Refer to [config.yaml](./testdata/config.yaml) for detailed examples on using the processor. | |||
|
|||
* The `otlp` property configures the template used for building the OTLP exporter. Refer to the OTLP Exporter documentation for information on which options are available. Note that the `endpoint` property should not be set and will be overridden by this exporter with the backend endpoint. | |||
* The `resolver` accepts a `static` node, a `dns` or a `k8s` service. If all three are specified, `k8s` takes precedence. | |||
* The `resolver` accepts a `static` node, a `dns`, `srv`, or a `k8s` service. If all four are specified, `k8s` takes precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would anyone ever specify all four? This got updated from "both" where it made sense, to "all three" which didn't make sense but someone got it through, but now "all four" doesn't make any sense at all.
Please change this to specify the resolution ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do, however, I don't think I really ever looked at how the order was determined. Now I'm wondering if there should be a prescribed precedence.
I would need to dig into this deeper, but my best guess is that these are merely in reverse order of when the resolver was added. If a resolver is at the bottom of this block then that resolver takes most precedence.
Happy to leave everything as is and either make the resolver I'm adding to have the most precedence or leaving it as k8s if that was a conscious decision. If not, my suggestion might be the regular DNS resolver. It seems to me it'd be the most commonly used one.
var _ resolver = (*dnssrvnoaResolver)(nil) | ||
|
||
/* | ||
TODO: These are from resolver_dns.go, but are used here. We should move them to a common place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO for you? Should that be part of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both TODO's were previously a question and it seemed like we would want them implemented. I had left them to be worked on later since I didn't have scheduled time to work on this anymore, but I have some time set aside in March to try and wrap this up so if you think I should take a stab at it I can.
type K8sSvcResolver struct { | ||
Service string `mapstructure:"service"` | ||
Ports []int32 `mapstructure:"ports"` | ||
} | ||
|
||
// TODO: Make a common struct to be used for dns-based resolvers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO still needs to be done?
@@ -162,6 +166,48 @@ service: | |||
- loadbalancing | |||
``` | |||
|
|||
The DNSSRVNOA Resolver is useful in situations when you want to return hostnames instead of IPs for endpoints. An example would be a `StatefulSet`-backed headless kubernetes `Service` with istio. | |||
|
|||
The format for the name of an SRV record is `_service._proto.name` such as `_ldap._tcp.example.com`. The full record contains: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this section hard to follow. I see that @jpkrohling asked you to add it, but in its current form it's not clear what this is, why it's here, and what format it's actually documenting. Can you please add some more text explaining where this goes?
I'd also like to see you explain NOA and why it's significant. After staring at this a while and looking at other things, I believe it stands for "noall", meaning that these records must map to a single IP address? If I'm wrong, it's definitely not obvious, and if I'm right, it's still not obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll def take a look at making this clearer but wanted to answer your questions so you can properly gauge if you think my edits work.
The first paragraph is more or less describing the format of SRV records. Most of it is taken from the rfc itself. I personally feel like using SRV records is a less basic approach and a person likely knows about them already. Maybe instead of explaining it the doc could benefit from a general description of scenarios when you might choose one of the resolvers over another?
The "NOA" means "No A record" as in do not have the app perform 2 levels of DNS resolutions. Instead of IP addresses, A records are returned so it's the difference between curl www.google.com
vs curl 142.250.176.4
.
Though I've heard of some gRPC client load balancing advantages, I'm mostly implementing the NOA for istio. You need something in the ".svc.cluster.local" (or equivalent Service
domain,) for istio to work properly and resolving just the SRV record and allowing the kernel to resolve the A records makes it work. If all it has is an IP it will not use mTLS and the endpoints you're trying to talk to will reject the traffic.
Co-authored-by: Kent Quirk <[email protected]>
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Description: Add a 4th resolver to the loadbalancing exporter. This provides the ability to specify individual hosts using the hostname instead of IP. However, 99% of the time the desired hostnames will not change but the underlying IP addresses will. The common example is using istio and your backends are part of a
StatefulSet
with a headlessService
.Link to tracking Issue: #18412
Testing: Tested this live on a GKE cluster with both deleting individual pods and performing a
rollout restart
on a 3-pod sized otel-collectorStatefulSet
with istio set toSTRICT
mTLS mode. Additionally, unit tests were near 1:1 copied from the DNS resolver plus additional unit tests to handle the 2nd layer of monitoring changes (underlying IP addresses.)Documentation: Added the config parameters, an example, and a little section explaining when you might choose this option (i.e. mentioned istio for anyone trying to search for it.)
One question I had across the whole PR is what to name this resolver. In other systems such as thanos and loki this is referred to as DNSSRVNOA. It feels like that name is incorrect and perhaps DNSSRV came first since I believe the default SRV lookup is not to resolve the underlying A records.
Using DNSSRVNOA or even subtracting the "NOA" looked clunky with how golang typically handles variable naming. In the end I simply went with "SRV". I'm happy with that decision but if someone has a better name I'm more than happy to adjust.