Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add service.type experimental Resource attribute #575

Conversation

tigrannajaryan
Copy link
Member

Contributes to #554

Contributes to #396

Contributes to open-telemetry/opamp-spec#131

Problem Description

service.name Resource attribute is currently defined as the "Logical name of the service". The expectation is that service.name will be set by the operator of the service to a value that describes the role of the service in the overall observable set of entities the operator has (within a service.namespace).

Otel Collector sets service.name by default to be the name of the executable (e.g. otelcorecol or otelcontribcol).

Collector's service.name can be overridden by the operator using service.telemetry.resource setting of Collector's config file. This is typically expected in any non-trivial infrastructure where the same Collector executable can be used as a locally running agent on a host, as a standalone gateway that serves as an intermediary between agents and the backends, as part of Kubernetes operator, etc. The roles in these cases are sufficiently different to warrant different logical names.

However, there is currently no semantic convention for an attribute that specifies the type of a service that may have different logical roles when used in different places in the infrastructure, yet be identically produced, i.e. be the exact same executable. The executable file name to some extent can serve that purpose but nothing prevents different service types from having the same executable file name, it has poor uniqueness guarantees.

This issue talks a bit more about why we would want to have the type of an agent (Otel Collector in our case) to be a well-defined semantic convention.

This issue shows how the agent type would be useful in the context of agent management. The issue talks about how it is important to tie agent's own telemetry's Resource to the attributes that identify that agent in the context of the OpAMP protocol.

Changes

This change adds service.type as a Recommended, experimental Resource semantic convention.

The value is a string in reverse domain notation that uniquely identifies the type of the service (the type of the product deployed as the service), e.g. io.opentelemetry.collector, io.redis, etc.
Unlike (service.namespace,service.name,service.instance.id) triplet the (service.namespace,service.type,service.instance.id) triplet is not guaranteed to be globally unique.

For OpAMP having a separate service.type allows OpAMP, if desired by the operator, to manage the same type of agents in a similar way even though their service.name values may be different due to different logical roles they have.

An example unrelated to OpAMP, when using NGINX: service.type will be set to "com.nginx", while service.name is set to "api-gateway", denoting the logical role that the particular NGINX deployment serves in this particular system.

Merge requirement checklist

docs/resource/README.md Outdated Show resolved Hide resolved
docs/resource/README.md Outdated Show resolved Hide resolved
Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the value of service.type always be io.opentelemetry.collector for an OpenTelemetry Collector regardless of the distro?

@tigrannajaryan
Copy link
Member Author

Would the value of service.type always be io.opentelemetry.collector for an OpenTelemetry Collector regardless of the distro?

Yes. I also proposed service.distro as one more attribute here, but wanted to start with service.type as the more important one for now.


**[1]:** MUST be unique for each instance of the same `service.namespace,service.name` pair (in other words `service.namespace,service.name,service.instance.id` triplet MUST be globally unique). The ID helps to distinguish instances of the same service that exist at the same time (e.g. instances of a horizontally scaled service). It is preferable for the ID to be persistent and stay the same for the lifetime of the service instance, however it is acceptable that the ID is ephemeral and changes during important lifetime events for the service (e.g. service restarts). If the service has no inherent unique ID that can be used as the value of this attribute it is recommended to generate a random Version 1 or Version 4 RFC 4122 UUID (services aiming for reproducible UUIDs may also use Version 5, see RFC 4122 for more recommendations).

**[2]:** A string value having a meaning that helps to distinguish a group of services, for example the team name that owns a group of services. `service.name` is expected to be unique within the same namespace. If `service.namespace` is not specified in the Resource then `service.name` is expected to be unique for all services that have no explicit namespace defined (so the empty/unspecified namespace is simply one more valid namespace). Zero-length namespace string is assumed equal to unspecified namespace.

**[3]:** The `service.type` identifies the product that is deployed as the service. The same product may be simultaneously deployed multiple times on the same observable infrastructure. In this case each of these deployments will typically have a distinct `service.name` to help identify the logical role of the particular deployment, however their `service.type` will be the same and will help identify the deployed product.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly mention the possibility of different distros or flavors of the same component here? Even if we don't have a convention for this on the first iteration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it described here it plays well with definition ECS has for service.type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion I don't want to mention it in semconv until we have a clear understanding of how we want distros/flavours to be recorded. I think it can be done in future PRs.

Copy link

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two thoughts, otherwise this looks good to me. Adding a role feels like a worthy follow-up if we want it.

docs/resource/README.md Outdated Show resolved Hide resolved
docs/resource/README.md Outdated Show resolved Hide resolved
@tigrannajaryan
Copy link
Member Author

@open-telemetry/specs-approvers please take a look.

Contributes to open-telemetry#554

Contributes to open-telemetry#396

Contributes to open-telemetry/opamp-spec#131

Problem Description
===================

`service.name` Resource attribute is [currently defined](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/resource#service)
as the "Logical name of the service". The expectation is that `service.name` will be set
by the operator of the service to a value that describes the role of the service
in the overall observable set of entities the operator has (within a `service.namespace`).

Otel Collector [sets](https://github.com/open-telemetry/opentelemetry-collector/blob/7e3e725a2952728560b9f5f71867ad6358b1977f/service/service.go#L276)
`service.name` by default to be the name of the executable (e.g. otelcorecol or
otelcontribcol).

Collector's `service.name` can be overridden by the operator using
`service.telemetry.resource` setting of Collector's config file. This is typically
expected in any non-trivial infrastructure where the same Collector executable can
be used as a locally running agent on a host, as a standalone gateway that serves
as an intermediary between agents and the backends, as part of Kubernetes operator,
etc. The roles in these cases are sufficiently different to warrant different
logical names.

However, there is currently no semantic convention for an attribute that
specifies the type of a service that may have different logical roles when
used in different places in the infrastructure, yet be identically produced,
i.e. be the exact same executable. The executable file name to some extent
can serve that purpose but nothing prevents different service types from having
the same executable file name, it has poor uniqueness guarantees.

This [issue](open-telemetry#396)
talks a bit more about why we would want to have the type of an agent (Otel
Collector in our case) to be a well-defined semantic convention.

This [issue](open-telemetry/opamp-spec#131) shows
how the agent type would be useful in the context of agent management. The issue
talks about how it is important to tie agent's own telemetry's Resource to the
attributes that identify that agent in the context of the OpAMP protocol.

Changes
=======

This change adds `service.type` as a Recommended, experimental Resource
semantic convention.

The value is a string in reverse domain notation that uniquely identifies
the type of the service (the type of the product deployed as the service),
e.g. io.opentelemetry.collector, io.redis, etc.
Unlike (service.namespace,service.name,service.instance.id) triplet
the (service.namespace,service.type,service.instance.id) triplet is not
guaranteed to be globally unique.

For OpAMP having a separate `service.type` allows OpAMP, if desired by the
operator, to manage the same type of agents in a similar way even though
their `service.name` values may be different due to different logical roles they have.

An example unrelated to OpAMP, when using NGINX: `service.type` will be set
to "com.nginx", while `service.name` is set to "api-gateway", denoting the
logical role that the particular NGINX deployment serves in this particular system.
@tigrannajaryan
Copy link
Member Author

@open-telemetry/specs-approvers please take a look. If it looks good I will resolve the conflicts.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no definition of "type" in this PR. I think if we don't want to be rigorous in defining semantic meaning of an attribute then what's the value in the attribute, just reserving its name? Without a clear definition people would start putting different data into it.

PR says "the same product". Is the same binary the same product? What if they are configured differently? Can different binaries be the same product?

@tigrannajaryan
Copy link
Member Author

There is no definition of "type" in this PR. I think if we don't want to be rigorous in defining semantic meaning of an attribute then what's the value in the attribute, just reserving its name? Without a clear definition people would start putting different data into it.

PR says "the same product". Is the same binary the same product? What if they are configured differently? Can different binaries be the same product?

I agree it is not defined very precisely. A couple additional ways to think about this:

  • Services that are of the same service.type are typically expected to produce similar shape of telemetry (e.g. same set of metrics). This is useful since it allows for example to build dashboards in backends per service.type and use the dashboards for all such services regardless of the logical role of the service. So for example you can have a dashboard for NGINX that is built based on what metrics we expect from NGINX. Note: there is no hard restriction on the shape produced telemetry since different versions of the same product or different configurations of the same product may result in variations in telemetry, but there is generally significant commonality that we can rely on.
  • Services that are of the same service.type are typically expected to be configured similarly. This is the OpAMP angle. It means the OpAMP server can serve the same configuration to all such services (or have a base configuration that is the same for all such services, but with an extra per-service.name configuration that is added on top).

Do you think it would be useful to add these explanations?

@yurishkuro
Copy link
Member

@tigrannajaryan tbh I am not convinced of the usefulness of service.type. Hear me out:

At Uber we initially allowed services to hardcode their service names when initializing telemetry SDKs. This was later recognized as a big mistake because those service identities had meaning not just to telemetry, but needed (or at least it was highly desirable) to match other domains, e.g. identities that cluster manager was using, permissions, etc. So we made a coordinated effort to remove hardcoded service names and instead inject them based on env variables defined by the cluster manager. The cluster manager also supported other dimensions in the data model, such as deployment groups, regional jobs, etc. - all of those were injected into resource attributes automatically.

I am assuming this model is not unique to Uber. And in this model, I don't see where service.type would come from -- at best it would be set to the same string as service.name, thus defeating the purpose. Or it would need to be hardcoded in the code (very difficult to manage when dealing with 1000s of microservices). What ideally needs to happen is the service.name needs to be this higher-level (type-like) string to begin with, denoting the type of service as already implied by the current (somewhat tautological) definition "logical name of the service", and other distinguishing attributes like role or deployment group should come from additional dimensions supported by the cluster manager (which we cannot standardize in OTEL).

To play devil's advocate, per your clarification the collector as agent and collector as collector (as called out in the PR description) would still be the same service name and service type, since it's the same binary, the same output telemetry schema, and similar configuration. So why not use service.name=collector for both and through deployment mechanism add an extra tag like service.role=agent|collector? I think it's a conceptually cleaner scheme than name=otel_agent|otel_collector && type=collector because the name starts to be some adhoc concatenation of things, precisely what we're generally trying avoid by supporting dimensional telemetry everywhere.

@reyang
Copy link
Member

reyang commented Feb 13, 2024

What ideally needs to happen is the service.name needs to be this higher-level (type-like) string to begin with, denoting the type of service as already implied by the current (somewhat tautological) definition "logical name of the service", and other distinguishing attributes like role or deployment group should come from additional dimensions supported by the cluster manager (which we cannot standardize in OTEL).

+1, these additional dimensions (e.g. data center, geo location, availability zone, whether it is a private cloud or public cloud) keep evolving and are normally owned by other systems rather than observability. I think if "service.name" cannot be used as the primary key, it defeats the purpose, keep introducing more dimensions (e.g. "service.type" or "service.category" or something else) will make things more complicated without solving the issue in the end.

What I learned from Microsoft, we used to have too many dimensions defined in the observability system, in the end only these work well after many years:

  • The logical "role name" - essentially the same as "service.name".
  • The instance id.
  • The version of deployment.

@tigrannajaryan
Copy link
Member Author

To play devil's advocate, per your clarification the collector as agent and collector as collector (as called out in the PR description) would still be the same service name and service type, since it's the same binary, the same output telemetry schema, and similar configuration. So why not use service.name=collector for both and through deployment mechanism add an extra tag like service.role=agent|collector?

@yurishkuro The service.name is supposed to be set and be changeable by the user who deploys the Collector. They are free to set it to any value they want. When they do so the OpAMP Server loses the ability to identify the Collector. We need a hard-coded value that is provided at build-time by producers of the Collector to signify it is the Otel Collector, not something else. This information is necessary for OpAMP Server to know how to deal with the connecting agent.

Generally speaking the end user can set for example service.name to "host-agent" or to "k8s-node-agent" or "gateway" to signify the logical role of the Collector. At the same time service.type will be always set to "io.opentelemetry.collector" to signify they are all Otel Collectors. This allows OpAMP Server to have a common sub-configuration defined for all Collectors regardless of their role (for example to specify backend endpoint to send collected data to).

why not use service.name=collector for both and through deployment mechanism add an extra tag like service.role=agent|collector?

I think this is the equivalent of what I am proposing with simply different names of attributes. I am suggesting (service.type, service.name) pair, you are suggesting (service.name, service.role) pair. I see no semantic different between these two proposals, only attributes names are different. I don't mind using different attribute names if we agree that we need a pair and a single attribute we have is not enough.

I am assuming this model is not unique to Uber. And in this model, I don't see where service.type would come from -- at best it would be set to the same string as service.name, thus defeating the purpose.

I think this is true only for first-party services. If the person (or the team) who develops and who deploys the service are the same then I agree with you, most likely they will choose the same values for service.name and service.type, defeating the purpose.

My focus is not the first-party services though. I am thinking about third-party services where the developer (the person who builds the service) and the operator (the person who deploys the service) are different and are disconnected from one another, are not part of the same organization. Let's go back to the NGINX example. The developers of NGINX will define that service.type=com.nginx at build tyime. The deployers of a particular NGINX instance will probably set service.name=api-gateway when deploying it.

Let's say I expect a telemetry backend to provide a specialized dashboard that is built for NGINX. How does the backend know that this dashboard is applicable? It cannot look at the service.name which can be set to any arbitrary value based on the logical role of the service. service.type would the value the backend can rely on to activate the dashboard.

The service.type dimension I am describing here is clearly used for observability purposes, to show the right telemetry in the Observability tool.

What I learned from Microsoft, we used to have too many dimensions defined in the observability system, in the end only these work well after many years:

  • The logical "role name" - essentially the same as "service.name".
  • The instance id.
  • The version of deployment.

@reyang I think this is not enough to handle the use case I described above with NGINX. If you see a way can you clarify how using these attributes could make the NGINX-specific dashboard activate?

--

To everyone:

What's interesting is that we have these additional dimensions in semconv already, but they are domain-specific. For example db.system or webengine.name signify the type of the database or the type of the web engine (similarly we have messaging.system for messaging). We maintain these domain-specific attributes as enumerations. Supposedly for the scenario I was discussing above we should set webengine.name=nginx (although the values for this attribute are not well-defined), and then use webengine.name in the backend to decide what dashboard to show.

The service.type attribute suggests to use one attribute for all domains instead of having domain-specific attributes and instead of the need to maintain the enumerations manually it specifies that an FQDN can be used as the value.

I think it is a choice that we need to make.

Use domains-specific attributes

We use domain-specific, different attributes to signify the type of the service (type as it is known at build time). This is what we do currently (e.g. db.system, messaging.system, webengine.name).

Pros

  • Probably shorter and more descriptive attribute names and values.

Cons

  • Need to come up with attributes for every domain.

If we choose this approach I will close this PR and will create a new one to add agent.type as a domain-specific attribute for agents.

Settle on one service.type

We settle on one attribute (service.type) that is the same for all domains.

Pros

  • Design this once for all domains
  • Well-defined attribute value without the need to maintain attribute-specific enumeration lists

Cons

  • Less readable attribute name and value perhaps.
  • It is only a single dimension. Can't properly define more than one dimension simultaneously, e.g. a database server that is also a web server. Is this a use case we care about?
  • Longer attribute value?

I would like to hear some more pros and cons on these choices since it is not clear to me what’s the best way.

@yurishkuro
Copy link
Member

@tigrannajaryan

I think this is the equivalent of what I am proposing with simply different names of attributes. I am suggesting (service.type, service.name) pair, you are suggesting (service.name, service.role) pair.

the difference is that I was not suggesting to standardize on service.role, it was just an arbitrary name I picked that a specific use case may be interested in.

Don't think OpAMP use case is a good fit here since it's about a handshake mechanism, not about tagging telemetry of a binary. Ack on the NGINX use case. But it brings back to my first question - if this field is meant to provide a classification mechanism, it needs better definition, and a well-define value domain since you want it to be vendor-neutral identifier. If the only objective of this type is to understand the overall shape of the telemetry, then it feels like it should be solved with telemetry schemas from a recent OTEP. If it's purpose is for other things, then we need to better define a data contract.

@tigrannajaryan
Copy link
Member Author

@yurishkuro I will set aside the OpAMP case for now since I agree with you, it has other possible solutions (e.g. agent.type attribute is one possible way).

Let's focus on the telemetry shape for now and decide if we want it to be roughly described by one attribute service.type.

If the only objective of this type is to understand the overall shape of the telemetry, then it feels like it should be solved with telemetry schemas from a recent OTEP.

I am not sure that this is a better way. Let's say for example there is hundreds of 3rd party products that produce telemetry according to Otel semconv and they don't deviate from that semconv. In that case the telemetry from these products will reference the base Otel Schema https://opentelemetry.io/schemas/<version>. Since the products don't deviate from and don't extend Otel semconv they have no need to declare an extended schema (as defined by the recent OTEP).

Nevertheless the shape of telemetry produced by each of these products can be very different since each product can use a very different subset of standard Otel semconv.

If we require that the shape of the telemetry produced by each product is uniquely described by its Schema URL we are essentially forcing the products to have an extended schema derived from Otel Schema. That is unnecessary burden. Publishing and maintaining an extended schema is a job that is best to be avoided unless there are strong reasons - it requires running a highly available http server that can serve the schema files.

Compared to that including a single FQDN value for service.type in the produced telemetry is trivial, adds no maintenance burden and still serves the purpose of uniquely identifying the expected shape of the telemetry from the particular product.

It needs better definition, and a well-define value domain since you want it to be vendor-neutral identifier.

Do you think specifying that the value is the reverse FQDN of the product is not precise enough? It is vendor-neutral, ensures no collisions (provided guidelines are followed), is unique enough and is easy to understand. Are you looking for more guidelines on which FQDN to use for a particular product? (e.g. Collector vs Collector Contrib - should they use the same FQDN and which one).

@yurishkuro
Copy link
Member

Are you looking for more guidelines on which FQDN to use for a particular product?

not specifically on the format, but on what criteria need to be satisfied to produce the same or different FQDNs. For example, jaeger-v1 and jaeger-v2 telemetry shapes are likely going to be significantly different (because of the architecture change), so should both binaries still produce the same or different FQDNs? I can seen arguments either way (since "Version" can be a separate resource attribute anyway). Or another example: jaeger-v2 is going to be a single binary that can work as either jaeger-collector or jaeger-query (from v1 nomenclature). Again, the telemetry is going to be pretty different, but it's the same binary and the same version - so same FQDN or not?

@tigrannajaryan
Copy link
Member Author

not specifically on the format, but on what criteria need to be satisfied to produce the same or different FQDNs. For example, jaeger-v1 and jaeger-v2 telemetry shapes are likely going to be significantly different (because of the architecture change), so should both binaries still produce the same or different FQDNs? I can seen arguments either way (since "Version" can be a separate resource attribute anyway)

Yes, the same service.type value. We already have service.version that will allows differentiating between jaeger-v1 and jaeger-v2 telemetry shapes as necessary. At the same time if there is commonality in the shape between versions the fact of that commonality will be indicated by the same service.type. Of course we don't force that commonality to be there, the versions are free to emit completely different telemetry shapes as well. In that case service.type will need to be paired with service.version to be useful.

Or another example: jaeger-v2 is going to be a single binary that can work as either jaeger-collector or jaeger-query (from v1 nomenclature). Again, the telemetry is going to be pretty different, but it's the same binary and the same version - so same FQDN or not?

Same service.type value if it is the same binary. We expect that binary to produce roughly the same telemetry, right? (Note: any differences in telemetry that can be attributed to differences in user-defined configuration are fine). In this scenario we may also expect different service.name values since jaeger-collector and jaeger-query are different logical roles ("logical name" is the current definition of service.name, which seems to fit in this case).

From the perspective of what to put into service.type/serivce.name it seems jaeger-collector vs jaeger-query situation is very similar to "Otel Collector as an Agent" vs "Otel Collector as a Gateway" situation.

@yurishkuro
Copy link
Member

Fair enough. None of these nuances & explanations are coming through in the description though. I don't like putting things into the spec that are so vague that you need a whole separate FAQ to explain how to use them.

@lmolkova
Copy link
Contributor

Any reason to generalize the attribute?

Would something be broken if applications that deploy the same binary to multiple services, need to build dashboards based on the same template, etc would define a custom resource attribute for themselves? The use case sounds like a niche one based on my (potentially limited) experience.

@tigrannajaryan
Copy link
Member Author

Any reason to generalize the attribute?

Would something be broken if applications that deploy the same binary to multiple services, need to build dashboards based on the same template, etc would define a custom resource attribute for themselves? The use case sounds like a niche one based on my (potentially limited) experience.

I think this is a fair question. So far I do not see a huge number of supporting voices for a common service.type attribute. Let me advertise this a bit more and we don't see a significant support for it then we can close this and instead solve the specific problem agents have in a different way.

@tigrannajaryan
Copy link
Member Author

Fair enough. None of these nuances & explanations are coming through in the description though. I don't like putting things into the spec that are so vague that you need a whole separate FAQ to explain how to use them.

I am happy to re-write the PR and add these to semconv but I would like to first see if there is a good support for the attribute at all (as suggested by @lmolkova).

@andrzej-stencel
Copy link
Member

I stand by my earlier comment:

I'm worried that service.type is too vague. Different people might want to set it to different things in different contexts.

If this is needed for OpAMP (as this comment suggests: #396 (comment)), why not intruduce an OpAMP-specific attribute like opamp.agent.type as proposed by @jack-berg here: open-telemetry/opamp-spec#131 (comment)?

@joaopgrassi
Copy link
Member

joaopgrassi commented Feb 21, 2024

Isn't all of this already possible by using the Instrumentation Scope for this use case? https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#get-a-tracer

Also, a thing I don't really get is:

For OpAMP having a separate service.type allows OpAMP, if desired by the operator, to manage the same type of agents in a similar way even though their service.name values may be different due to different logical roles they have.

IIUC, the service.type can be used by OpAMP to find all "collectors" or "NGINX's" (as service.type will have the same value). But, in the example you gave (collector being deployed in different roles) and also NGINX being deployed as multiple roles, how is it useful that they can be managed in the same way, as their role is completely different? Maybe I missed something here? 🤔

From the use cases I read here, It seems the PR is somewhat backwards to me. To me, I'd understand it more like: service.name is the same, but then you have a "role" that can determine similar components that are performing the same "job" (service.name=collector, service.role=agent)?

@tigrannajaryan
Copy link
Member Author

in the example you gave (collector being deployed in different roles) and also NGINX being deployed as multiple roles, how is it useful that they can be managed in the same way, as their role is completely different? Maybe I missed something here?

@joaopgrassi The same product deployed as multiple roles likely still has largely the same configuration settings and that need to be set mostly to the same values. It is useful to be able to specify this common product-specific (service.type-specific) configuration once.

From the use cases I read here, It seems the PR is somewhat backwards to me. To me, I'd understand it more like: service.name is the same, but then you have a "role" that can determine similar components that are performing the same "job" (service.name=collector, service.role=agent)?

Hard to tell. service.name is very underspecified. The current definition of service.name is that it is the logical name. I think that fits the "role" but you are right it is open for a different interpretation. We should probably also work on clarifying the service.name.

Copy link

github-actions bot commented Mar 8, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 8, 2024
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.