-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device nodes are not guaranteed to be consistent over time #134
Comments
What exactly query are break? |
I could see it breaking if you squashed an alert, and then the drive letters flip after reboot, and then the wrong drive is squashed... disk-by-path might be another way to get a more stable identifier. |
@k0ste Since PromQL doesn't support many-to-many joins and a new timeseries is created for each unique combination of labels, it's not possible to do a "join" between Then there's also the issue that every time a device node is reassigned it makes any graph that tracks history wrong. If I'm tracking changes in disk space used, power cycles, etc and have alerts based on a percentage increase, those might trigger on a reboot when the device node is changed to a drive with different metrics. Finally, with three servers, each with four drives, that'll eventually create 48 different timeseries in |
@kfox1111 I did consider suggesting this as well, though I decided against as that information is either available through smartctl or is not consistent ( As I understand it, the use-case for this exporter is to track individual pieces of hardware over time, for instance if a drive is about to fail. I think the way that makes it the easiest to setup good tracking is to identify each harddrive by an id that doesn't change over time. I think the serial number is the only piece of data available that works for that, though I'm by no means a hardware expert, so there might be (and there probably are) a smarter solution than I can think of 😅 |
@Scandiravian if you operate by linux device name - this is totally wrong, you should operate only by device serial_number. Linux device names are not persistent, for example:
All your record rules / alerts should look like this smartctl_device{form_factor="3.5 inches"} * on (instance, device)
group_left () smartctl_device_temperature > 30 In this case, in one moment in time, the |
I think there is a use case for querying both by disk-by-path (so you can identify slot 3 in node B in queries) as well as drive serial_numbers so you can track a drive no matter where it shows up. |
Agree. I believe to support both use cases, it may be appropriate to revert #83. And user should configure to drop the relevant label(s) in prometheus scrape config. I have forked and reverted #83 for my own use. If deemed appropriate I can make a PR as well. |
This impossible to "resolve" on Prometheus side, because before drop something Prometheus should download something What exactly issue do you have with current design? |
Oh my Google-fu should be really bad yesterday. I would like to track the lifetime (e.g. Total Bytes Written) of a disk consistently. Yesterday I tested by intentionally causing a flip in device node. And the stats were, as expected, flipped. You have provided an alert rule example, which inspired me to do something like this That was still two series for the same drive before and after the device node flip. I would really like to join them as one. Then I was stuck. In fact, I was using VictoriaMetrics instead of Prometheus. I tried using MetricsQL And after your reply I Googled again and came up with metricsQL: add function for merging time series values based on label value That inspired me to come up with the following query and my problem is solved!
And for calculating the rate of increase: Still open to discussion to whether adding serial number label is necessary. |
In the process of setting up this exporter for the first time, running on a system with 25 SATA disks attached. #83 is not a good change in my opinion. I would always want to have the device serial number in the metrics output, for reasons described above (unstable identification via /dev/sdX). Until then, I'll use my own forked version as well. |
IMO somewhat correct solution for this is replacing
As seen above, it's not perfect since it only applies to SATA/SAS devices, but this should be easily solvable regardless in smartmontools. I'll (try to) prepare a PR adding a flag enabling this behaviour. (since it's a breaking change over what was there before) |
Currently the smartctl exporter only attaches the
device
label to all metrics exceptsmartctl_device
. This was introduced in #83.This can unfortunately lead to issues, since device nodes are not guaranteed to be consistent over time, so after a reboot
/dev/sdc
might for instance become/dev/sda
.This makes it difficult to create dashboards in Grafana that tracks for instance temperature over time, since a query will break after reboot.
If there is a goal to limit the number of labels sent, I think it would be better to switch to using the serial number as the identifying label sent with metrics. These are not guaranteed to be unique, though I think a conflict will be unlikely in most cases.
The text was updated successfully, but these errors were encountered: