Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cgroups #29621

Merged
merged 8 commits into from
Feb 16, 2024
Merged

Cgroups #29621

merged 8 commits into from
Feb 16, 2024

Conversation

atoulme
Copy link
Contributor

@atoulme atoulme commented Dec 1, 2023

Description:
Adds process.cgroup resource attribute to process metrics

Link to tracking Issue:
Fixes #29282

@atoulme atoulme requested a review from dmitryax as a code owner December 1, 2023 23:03
@atoulme atoulme requested a review from a team December 1, 2023 23:03
@crobert-1
Copy link
Member

CI test is hitting a panic that looks like it's caused by this change:

--- FAIL: TestScrapeMetrics_Filtered (0.00s)
    --- FAIL: TestScrapeMetrics_Filtered/No_Filter (0.00s)
panic: interface conversion: processscraper.processHandle is *processscraper.processHandleMock, not *process.Process [recovered]
	panic: interface conversion: processscraper.processHandle is *processscraper.processHandleMock, not *process.Process

@atoulme
Copy link
Contributor Author

atoulme commented Dec 7, 2023

Yes, I need to work on this some more. The interface used doesn't work well to access fields on the struct.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 22, 2023
@crobert-1 crobert-1 removed the Stale label Jan 2, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 17, 2024
@github-actions github-actions bot requested a review from braydonk January 22, 2024 06:33
@atoulme atoulme force-pushed the cgroups branch 2 times, most recently from 7c2f712 to ba19244 Compare January 22, 2024 22:57
@github-actions github-actions bot removed the Stale label Jan 23, 2024
@atoulme
Copy link
Contributor Author

atoulme commented Jan 24, 2024

Sorry, this took a long time. This is now ready for review.

@@ -31,6 +31,10 @@ type Config struct {
// the collector does not have permission for.
MuteProcessIOError bool `mapstructure:"mute_process_io_error,omitempty"`

// MuteProcessCgroupError is a flag that will mute the error encountered when trying to read the cgroup of a process
// the collector does not have permission for.
MuteProcessCgroupError bool `mapstructure:"mute_process_cgroup_error,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this option from the start? It's an optional metric and can be disabled if it results in errors

Copy link
Member

@andrzej-stencel andrzej-stencel Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need an option to mute errors even for optional metrics.

The problem is (I think) that the errors in the process scraper are usually huge - reporting an error for every process on the system, and that they're repeatable - reporting the same thing on every scrape. Perhaps going forward we could do something more clever than adding another "mute" option. Here are some thoughts:

  1. Make the error messages shorter by aggregating the duplicate error messages,
  2. Only display a specific type of error when it happens for the first time,
  3. Add metrics counting the occurrences of each type of error - especially important if we do 2. as the default behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A solution I'd come up with for this is a new type of error that a scraper can report to log errors at a different level. This combined with option 3 would be a pretty good outcome I think; there would be metrics reported for the different types of errors, and these errors could be logged at debug level by the scraper controller.

This is the issue I have open on the collector repo with accompanying PR: open-telemetry/opentelemetry-collector#8293

@dmitryax dmitryax merged commit 3e385c8 into open-telemetry:main Feb 16, 2024
142 checks passed
@github-actions github-actions bot added this to the next release milestone Feb 16, 2024
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this pull request Mar 13, 2024
**Description:**
Adds `process.cgroup` resource attribute to process metrics

**Link to tracking Issue:**
Fixes open-telemetry#29282

---------

Co-authored-by: Andrzej Stencel <[email protected]>
@cforce
Copy link

cforce commented Apr 26, 2024

Where to on / se resource attribute process.cgroup: true ?

@crobert-1
Copy link
Member

@cforce You're wondering how to enable the process.cgroup resource attribute?

The documentation here shares a little more information, you'd want add something like this to your hostmetrics receiver configuration:

resource_attributes:
  process.cgroup:
    enabled: true

@atoulme atoulme deleted the cgroups branch July 9, 2024 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[receiver/hostmetrics] Support collecting process cgroup as dimension
7 participants