-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[chore] System Semantic Conventions Non-Normative Guidance #1618
base: main
Are you sure you want to change the base?
[chore] System Semantic Conventions Non-Normative Guidance #1618
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this doc!
I don't think we have similar precedents of "why we designed it in this way" documented (the closest analogy is OTEP), but I wish we had more of these.
We might find a better place for it within the repo over time if we'll have more docs like this.
|
||
## **Host** | ||
|
||
A user should be able to monitor the health of a host, including monitoring resource consumption, unexpected errors due to resource exhaustion or malfunction of core components of a host or fleet of hosts (network stack, memory, CPU…). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unexpected errors due to resource exhaustion
not sure if we have anything defined today and if there is anything general we can provide, but it'd be nice to have some OS network/hw/etc errors and have them on the dashboards/alerts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the system.network.errors
metric, I don't think we have anything else (I don't know if there is a way to retrieve this, libraries like psutil don't provide this for other stuff like memory or disk AFAIK). Still, I think the existing metrics cover the case of troubleshooting resource exhaustion/malfunction
e980f13
to
e051e87
Compare
Did a first pass of easy comments to address, will make some time soon to go through the comments that require more thought! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a question/suggestion.
* General disk and network metrics | ||
* Universal system/process information (names, identifiers, basic specs) | ||
|
||
Some Specialist Class examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the whole description of the rationale here is exactly how it should be, I think we miss the part of having a set of rules/guidelines/sanity-checks that would help somebody in the future to decide into which directory a metric or attribute fall into. This might not be quite easy to define because of the nature of this problem but maybe it would worth adding a section in the bottom suggesting how this kind of situations should be handled in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do have a case study below for process.linux.cgroup
; perhaps I can adapt this to more general rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 487af83
|
||
## **Host** | ||
|
||
A user should be able to monitor the health of a host, including monitoring resource consumption, unexpected errors due to resource exhaustion or malfunction of core components of a host or fleet of hosts (network stack, memory, CPU…). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the system.network.errors
metric, I don't think we have anything else (I don't know if there is a way to retrieve this, libraries like psutil don't provide this for other stuff like memory or disk AFAIK). Still, I think the existing metrics cover the case of troubleshooting resource exhaustion/malfunction
* Machine name | ||
* ID (relevant to its context, could be a cloud provider ID or just base machine ID) | ||
* OS information (platform, version, architecture, etc) | ||
* Number of CPU cores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can be "CPU information" instead? We have a bunch of those here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, I left a few non-blocking comments above :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love writing this down.
The categorization of "Two Class Design Strategy" I think we should move to general non-normative guidance for all semantic conventions to follow.
What is missing for this to be merged? |
I'm finishing up edits for the remaining open comments, will be pushing this morning. |
This PR adds non-normative guidance from the System Semantic Conventions Working Group. This is added in a new `groups` folder in `non-normative`, and a `system` subfolder in `groups`. The docs written here were already discussed in a Google doc where we were originally collaborating on this, a link to which can be shared directly if needed.
e051e87
to
01f43e9
Compare
I've pushed up two new commits: 487af83: Addresses review comments. I will re-request review from those who still had open comments. 01f43e9: To address the issue with the markdown files having really long lines, I have set up Prettier to apply to these markdown files and wrap them at 80 characters. Did this in a separate commit so it wasn't too difficult to see exactly how I addressed open comments. |
For example, there may be `process.linux`, `process.windows`, or `process.posix` | ||
names for metrics and attributes. We will not have root `linux.*`, `windows.*`, | ||
or `posix.*` namespaces. This is because of the principle we’re trying to uphold | ||
from the [Namespaces section](#namespaces); we still want the instrumentation | ||
source to be represented by the root namespace of the attribute/metric. If we | ||
had OS root namespaces, different sources like `system`, `process`, etc. could | ||
get very tangled within each OS namespace, defeating the intended design | ||
philosophy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what would be specific problems if we gave up on the prefix and use OS name as a root?
I'm trying to document naming patterns we have in #1708
and I'm actually struggling to understand what benefit the domain
prefix brings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. what should I do if I want to describe a property of OS that's indifferent to instrumentation point/source? which namespace would I use?
PTAL at the related #1707 - it's my attempt to document overall semconv guidance (only attribute definition so far). There are some intersections. |
Changes
This PR adds non-normative guidance from the System Semantic Conventions Working Group. This is added in a new
groups
folder innon-normative
, and asystem
subfolder ingroups
. The docs written here were already discussed in a Google doc where we were originally collaborating on this, a link to which can be shared directly if needed.Merge requirement checklist
[chore]