Export unit substates #12

hamiltont · 2020-02-26T18:22:07Z

We currently export unit states, but we do not export the unit substate. Substates often include much more actionable information than states, such as why a unit is inactive (e.g. did it stop with error, was it killed, did it stop without error, etc). Note that a unit's possible substates depend on the type of the unit - different types (service, mount, etc) have different possible substates. See the large list below for all possible combinations on systemd v237 (and be aware that different systemd versions have added/removed substates as needed).

Exporting substates would be useful to support querying/graphing/possibly alerting by substate e.g. sum(systemd_unit_state{state="inactive"}) by (type, substate).

As I see it, there are two reasonable ways to expose this substate information:

Add a new label substate to the systemd_unit_state metric
Export a new metric for each unit type with a substate label. For example, systemd_mount_state{name="foo.mount", substate="mounted"}

IMO adding a new label to systemd_unit_state makes the most sense, but other opinions are welcome

Regardless of approach, I do not think we would follow the standard prometheus guideline of exporting all possible values of substate as 0-value timeseries. The cardinality explosion is ridiculous. For example, for each service unit we would be exporting approx. 6 states * 16 substates = 96 timeseries.

Instead, we would add the current substate label to each metric. When the substate changes, this would be a new timeseries. For example systemd_unit_state{name="ssh.service", type="service", state="inactive", substate="failed"} would be distinct from systemd_unit_state{name="ssh.service", type="service", state="inactive", substate="dead"}. This might require aggregation in PromQL queries. However, as we already export one-timeseries-per-state, this may be an easy transition (e.g. convert by (state) into by (state, substate). Feedback welcome on this...

Regarding exporter performance, the good news is we are already receiving substate information from dbus. It's included in every dbus.UnitStatus already, so there is effectively zero performance penalty for adding it as a new label.

List of states and substates on one of my systems. Note: different systemd versions will have different lists of substates.

m ~$ systemctl --state=help
Available unit load states:
stub
loaded
not-found
error
merged
masked

Available unit active states:
active
reloading
inactive
failed
activating
deactivating

Available automount unit substates:
dead
waiting
running
failed

Available device unit substates:
dead
tentative
plugged

Available mount unit substates:
dead
mounting
mounting-done
mounted
remounting
unmounting
remounting-sigterm
remounting-sigkill
unmounting-sigterm
unmounting-sigkill
failed

Available path unit substates:
dead
waiting
running
failed

Available scope unit substates:
dead
running
abandoned
stop-sigterm
stop-sigkill
failed

Available service unit substates:
dead
start-pre
start
start-post
running
exited
reload
stop
stop-sigabrt
stop-sigterm
stop-sigkill
stop-post
final-sigterm
final-sigkill
failed
auto-restart

Available slice unit substates:
dead
active

Available socket unit substates:
dead
start-pre
start-chown
start-post
listening
running
stop-pre
stop-pre-sigterm
stop-pre-sigkill
stop-post
final-sigterm
final-sigkill
failed

Available swap unit substates:
dead
activating
activating-done
active
deactivating
deactivating-sigterm
deactivating-sigkill
failed

Available target unit substates:
dead
active

Available timer unit substates:
dead
waiting
running
elapsed
failed

The text was updated successfully, but these errors were encountered:

hamiltont · 2020-02-26T18:24:35Z

@povilasv FYI - I'm not trying to include this into #10 - it has just been on my mind for a while so I wanted to write up the issue

/cc @SuperQ - Any insights you have on how implement this in a sane manner would be appreciated

povilasv · 2020-02-27T06:34:35Z

👍 thanks for this. IMO it makes sense and I like the systemd_unit_state{name="ssh.service", type="service", state="inactive", substate="failed"} approach.

SuperQ · 2020-02-27T08:20:01Z

While it's not optimal, I agree that the cardinality for substates is a bit much. There will still be issues with disappearing metrics.

My only concern is that there will be different behavior between the two labels.

hamiltont · 2020-02-27T17:45:43Z

One additional thought. It would be straightforward to have a feature flag collector.enable-complete-substate-series to allow users to request we do create zero-values series for all possible substates. This would by potentially useful for advanced users who want to alert on substate changes for a small set of business-critical units, specified with collector.unit-whitelist)

IMO, maintaining the boilerplate code (list of all possible substates) is not worth it for a feature that might be used by a hypothetical advanced user. Would be better to wait for someone to request something like this before prematurely adding the feature

EpiqSty mentioned this issue Apr 21, 2020

implement handling of systemd active (exited) status prometheus/node_exporter#1350

Open

JensErat mentioned this issue May 20, 2020

cgroups should not be checked if substate is exited #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export unit substates #12

Export unit substates #12

hamiltont commented Feb 26, 2020

hamiltont commented Feb 26, 2020

povilasv commented Feb 27, 2020

SuperQ commented Feb 27, 2020

hamiltont commented Feb 27, 2020

Export unit substates #12

Export unit substates #12

Comments

hamiltont commented Feb 26, 2020

hamiltont commented Feb 26, 2020

povilasv commented Feb 27, 2020

SuperQ commented Feb 27, 2020

hamiltont commented Feb 27, 2020