[k8sclusterreceiver] refactoring pod status phase #24425

povilasv · 2023-07-21T11:23:56Z

Component(s)

receiver/k8scluster

Describe the issue you're reporting

We currently encode pod status phase to number:

  k8s.pod.phase:
    enabled: true
    description: Current phase of the pod (1 - Pending, 2 - Running, 3 - Succeeded, 4 - Failed, 5 - Unknown)
    unit: 1
    gauge:
      value_type: int

I think it's preferable to split into many metrics:

  k8s.pod.phase_pending:
    enabled: true
    description: Whether this pod phase is Pending (1), or not (0).
    unit: 1
    gauge:
      value_type: int
  k8s.pod.phase_running:
    enabled: true
    description: Whether this pod phase is Running (1), or not (0).
    unit: 1
    gauge:
      value_type: int

IMO it's easier to parse from the name what is happening and we already have this pattern for Nodes conditions.

Thoughts?

The text was updated successfully, but these errors were encountered:

github-actions · 2023-07-21T11:24:16Z

Pinging code owners:

k8sclusterreceiver: @dmitryax
receiver/k8scluster: @dmitryax

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dmitryax · 2023-07-21T20:37:22Z

I see your point, but I have a couple of counter-arguments:

This will introduce 5x bigger payload to deliver the same information.
When you look at the chart, it's pretty convenient to see what states the pod has been in over time using just one time series. Some UIs can let you assign the states to the values on the chart, which is very convenient. Using several metrics for that is not that handy.

If a user wants this particular representation, I believe it can be translated with the metrics transform processor. Also, most backends likely have a query language capable of representing this metric as 5 "boolean" time series.

I wouldn't like to make this change until we have strict guidelines from the OTel Spec on how to represent this kind of data

povilasv · 2023-07-24T12:09:22Z

Thanks for reply. I agree with you, the k8s.pod.phase is way more compact and in theory transformProcessor should work. I will play with it to see if it does in practice.

sirianni · 2023-08-04T19:05:41Z

When you look at the chart, it's pretty convenient to see what states the pod has been in over time using just one time series. Some UIs can let you assign the states to the values on the chart, which is very convenient.

Can you provide a screenshot/example?

Also, most backends likely have a query language capable of representing this metric as 5 "boolean" time series.

Example? I can't get this to work in either of the monitoring vendors my company uses.

Since the actual numeric values are meaningless, I can't see how to make this work in a when aggregating across dimensions. Many monitoring tools also don't let you filter on metric values, just tags (i.e. you can't say WHERE value = 2). For example, if I'd like to see the number of pods that are in Running phase in a DaemonSet, how do I do that?

I think an OpenMetrics StateSet would be the more idiomatic way to model this information (and is also supported by OTel).

This is also how kube-state-metrics does it with kube_pod_status_phase. Seems to work quite well in PromQLcount(kube_pod_status_phase{phase="Running"})

sirianni · 2023-08-07T19:19:34Z

I believe it can be translated with the metrics transform processor

@dmitryax can you give a rough example of how you would use metricstransform processor to do this transformation? From what I see there's no way to "match" based on values and aggregating the "numeric enums" don't make sense either.

bertysentry · 2023-09-22T08:54:29Z

I agree with @sirianni.

OOT, k8s.pod.phase metric should keep the same name (don't split it in k8s.pod.phase_pending, k8s.pod.phase_running, etc.), but it should have a state attribute, to be consistent with other similar enumerations in other metrics (although I don't think state is specified anywhere in Otel's semantic conventions).

So you'll push several datapoints to represent the current "running" state of the pod:

k8s.pod.phase{state="running"} 1
k8s.pod.phase{state="pending"} 0
k8s.pod.phase{state="succeeded"} 0
k8s.pod.phase{state="failed"} 0
k8s.pod.phase{state="unknown"} 0

Not ultra optimal (cardinality), but it does the job, and it's easily query-able.

Also, it's a good incentive for OpenTelemetry to specify the StateSet metric type! 😉

sirianni · 2023-11-06T14:36:44Z

Take a look at the hw.status metric in OTel Semantic Conventions which uses the value-as-attribute approach I'm suggesting. In particular, see this note

hw.status is currently specified as an UpDownCounter but would ideally be represented using a StateSet as defined in OpenMetrics. This semantic convention will be updated once StateSet is specified in OpenTelemetry. This planned change is not expected to have any consequence on the way users query their timeseries backend to retrieve the values of hw.status over time.

TylerHelmuth · 2023-11-15T17:21:37Z

We discussed this issue in the SIG meeting today. We decided we want the Spec to make a decision on StateSet before we'd move forward with any changes. I will try to find an open issue in the Spec repo and create one to start the conversation if I can't find one.

TylerHelmuth · 2023-11-15T17:23:59Z

There is also an open question on where this metric should live. @sirianni can you provide more details on the scaling issue you've experienced?

sirianni · 2023-11-15T17:29:31Z

There is also an open question on where this metric should live. @sirianni can you provide more details on the scaling issue you've experienced?

Sure. We run several large Kubernetes clusters (over 1000 nodes, over 12,000 pods) where using a single cluster-wide collector runs into scalability issues (scrape timeouts, etc.). Here is the Datadog Agent PR I referenced in today's SIG call where Airbnb mentions similar scalability issues using a cluster-wide collector (i.e. kube-state-metrics) for this data.

Collecting this information from the kubelet via a DaemonSet scales incrementally as the cluster grows.

dashpole · 2023-11-15T17:33:33Z

Link to slack discussion mentioned in the meeting: https://cloud-native.slack.com/archives/C01NP3BV26R/p1623952435074700

TylerHelmuth · 2023-11-15T19:29:37Z

Reading through this issue again is there actually anything we need from the spec? This link (https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/#stateset) feels like it is pretty clear about how we handle statesets in otel metrics.

dashpole · 2023-11-15T19:50:31Z

That is for how to convert from openmetrics stateset metrics to OTLP. The idea discussed in the slack thread was to make the stateset pattern (or something close to it) a recommended pattern for OTel instrumentation. I would like to see, at minimum, the specification recommend using labels, rather than values to represent enums (i.e. foo{state=bar} 1, rather than foo_state_bar 1 or foo 4, where value 4 implies state==bar)

github-actions · 2024-01-15T03:30:33Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-03-18T03:29:41Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jaronoff97 · 2024-04-26T16:57:59Z

Ran into another example of some confusion/frustration recently

github-actions · 2024-06-27T03:30:32Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-08-26T05:19:31Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

povilasv added the needs triage label Jul 21, 2023

dmitryax added discussion needed receiver/k8scluster and removed needs triage labels Jul 21, 2023

povilasv mentioned this issue Sep 22, 2023

[k8sclusterreceiver] Some metric units don't follow Otel semantic conventions #10553

Closed

povilasv mentioned this issue Oct 20, 2023

[receiver/k8scluster] codeowner proposal #27876

Closed

sirianni mentioned this issue Nov 3, 2023

[k8scluster] switch k8s.node.condition to mdatagen #27617

Closed

sirianni mentioned this issue Nov 13, 2023

[receiver/kubeletstats] Add pod and container state metrics #29157

Closed

github-actions bot added the Stale label Jan 15, 2024

TylerHelmuth removed the Stale label Jan 16, 2024

github-actions bot added the Stale label Mar 18, 2024

github-actions bot removed the Stale label Apr 27, 2024

github-actions bot added the Stale label Jun 27, 2024

povilasv mentioned this issue Aug 6, 2024

Add a metric to report container lifecycle phase (e.g. k8s.container.phase) #34406

Open

github-actions bot added the closed as inactive label Aug 26, 2024

github-actions bot closed this as not planned Aug 26, 2024

jaronoff97 mentioned this issue Sep 19, 2024

Propose a new K8s SemConv SIG project open-telemetry/community#2352

Merged

ChrsMark moved this to Todo in K8s SemConv SIG Oct 2, 2024

ChrsMark added this to K8s SemConv SIG Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8sclusterreceiver] refactoring pod status phase #24425

[k8sclusterreceiver] refactoring pod status phase #24425

povilasv commented Jul 21, 2023

github-actions bot commented Jul 21, 2023

dmitryax commented Jul 21, 2023

povilasv commented Jul 24, 2023

sirianni commented Aug 4, 2023 •

edited

Loading

sirianni commented Aug 7, 2023

bertysentry commented Sep 22, 2023

sirianni commented Nov 6, 2023

TylerHelmuth commented Nov 15, 2023

TylerHelmuth commented Nov 15, 2023

sirianni commented Nov 15, 2023

dashpole commented Nov 15, 2023

TylerHelmuth commented Nov 15, 2023

dashpole commented Nov 15, 2023 •

edited

Loading

github-actions bot commented Jan 15, 2024

github-actions bot commented Mar 18, 2024

jaronoff97 commented Apr 26, 2024

github-actions bot commented Jun 27, 2024

github-actions bot commented Aug 26, 2024

[k8sclusterreceiver] refactoring pod status phase #24425

[k8sclusterreceiver] refactoring pod status phase #24425

Comments

povilasv commented Jul 21, 2023

Component(s)

Describe the issue you're reporting

github-actions bot commented Jul 21, 2023

dmitryax commented Jul 21, 2023

povilasv commented Jul 24, 2023

sirianni commented Aug 4, 2023 • edited Loading

sirianni commented Aug 7, 2023

bertysentry commented Sep 22, 2023

sirianni commented Nov 6, 2023

TylerHelmuth commented Nov 15, 2023

TylerHelmuth commented Nov 15, 2023

sirianni commented Nov 15, 2023

dashpole commented Nov 15, 2023

TylerHelmuth commented Nov 15, 2023

dashpole commented Nov 15, 2023 • edited Loading

github-actions bot commented Jan 15, 2024

github-actions bot commented Mar 18, 2024

jaronoff97 commented Apr 26, 2024

github-actions bot commented Jun 27, 2024

github-actions bot commented Aug 26, 2024

sirianni commented Aug 4, 2023 •

edited

Loading

dashpole commented Nov 15, 2023 •

edited

Loading