Journal probe rework #7

phmccarty · 2017-01-20T21:53:15Z

This series reworks the existing journal probe to make it more generic, able to capture error log messages for any service, not just the crashprobe.

rnesius · 2017-02-10T19:19:50Z

src/probes/journal.c

        // The three highest log levels, all indicating errors
        MATCH("PRIORITY=1");
        MATCH("PRIORITY=2");
        MATCH("PRIORITY=3");
+        OR;


Perhaps name-space the macros (JOURNAL_MATCH), (JOURNAL_AND) etc...

Have a comment summarizing boolean logic (boot_id & (1 || 2 || 3 || exited)

Unit tests?

I renamed the macros and added documentation for the logical expression.

The journal probe is in the same boat as the crash probe... No unit tests are written for either because they rely on system libraries for most of their operation, and I'm not sure how I would go about mocking everything for the tests.

The error case is negative, so fix the return code conditional check appropriately. Signed-off-by: Patrick McCarty <[email protected]>

Previous behavior was to only filter LOG_ERR messages from the telemetrics crashprobe, but it will be helpful to make this probe more generic to capture log messages with the highest log levels. Signed-off-by: Patrick McCarty <[email protected]>

The make payloads more legible, make sure separate messages are newline separated. Signed-off-by: Patrick McCarty <[email protected]>

Signed-off-by: Patrick McCarty <[email protected]>

To avoid having to read the repetitive error handling when adding journal filters, add some helper macros. Signed-off-by: Patrick McCarty <[email protected]>

Signed-off-by: Patrick McCarty <[email protected]>

This field appears to be set when services fail, so filter on it as well. Signed-off-by: Patrick McCarty <[email protected]>

The previous logic concatenates log messages on initial startup, when the entire journal is read to process existing messages. But doing so might run into the payload size limit (8KB), and thus fail to create a record. Sending one record per log message will ensure that the payload size remains relatively small, almost always below the 8KB size limit. Signed-off-by: Patrick McCarty <[email protected]>

I omitted log level 0 from the filter, so add it here. Signed-off-by: Patrick McCarty <[email protected]>

Signed-off-by: Patrick McCarty <[email protected]>

bryteise · 2017-02-17T17:40:58Z

src/probes/journal.c

+                return false;
+        }
+
+        r = sd_journal_add_match(journal, "PRIORITY=2", 0);


I sense a configuration option and a loop eventually for this code ala:

telemetrics.conf: journal-log-level = X
for (int i = 1; i < atoi(X); i++) { sd_journal_add_match }

kind of thing.

A config option would be helpful, I agree. I opened #12 for tracking.

bryteise · 2017-02-17T17:43:18Z

src/probes/journal.c

+                r = sd_journal_add_match(journal, data, 0); \
+                if (r < 0) { \
+                        tm_journal_match_err(r); \
+                        return false; \


Always a little scary when the macro can return in some cases but in this case it seems reasonable enough.

Yes. In this case, sd_journal_add_match and friends returning an error is very unlikely, so I figured the returns here would be acceptable for unlikely conditions.

bryteise · 2017-02-17T17:48:12Z

src/probes/journal.c

        // The three highest log levels, all indicating errors
        JOURNAL_MATCH("PRIORITY=1");
        JOURNAL_MATCH("PRIORITY=2");
        JOURNAL_MATCH("PRIORITY=3");
+        JOURNAL_OR;
+        // Only set for service-level error conditions
+        JOURNAL_MATCH("EXIT_CODE=exited");


So this confuses me a little. I'd expect exited to be a possible issue assuming non zero exit but looking at:
journalctl -F EXIT_CODE

dumped
exited
killed

I'd assume we may want to look at the dumped and killed case potentially too.

Oh, interesting. In my testing, I only encountered "exited", but I didn't inspect the source to see what other values it accepts.

Filed an issue for this (#13)

bryteise · 2017-02-17T17:51:19Z

src/probes/journal.c

+                // For now, we send one record per log message, in case the
+                // there is a large backlog of messages and we exceed the
+                // payload size limit (8KB). And ignore errors, hoping that it's
+                // a transient problem.


This is fine as we can try to put things back together server side. Though could we have a message-id + optional message part X/Y type thing? It would really make things easier.

I was thinking of using the telemetry machine-id + client record timestamp fields to piece together a series of journal probe records. Would that be acceptable?

Ah yea that should work just as well. Drinking from the firehouse of the full log will always be a little tricky.

bryteise · 2017-02-17T17:53:49Z

src/probes/journal.c

+
+                if (!send_data(error_class)) {
+                        telem_log(LOG_ERR, "Failed to send data. Ignoring.\n");
+                        return num_entries;


So I am a little worried about this as the function doesn't seem to take a last tried value so I'm unsure how well this would handle an error in the middle of sending messages and resuming from that point later on.

bryteise · 2017-02-17T17:58:17Z

src/probes/journal.c

-        }
-
-        if (!payload) {
+        } else if (ret == 0) {


Ah if I would have read a little further on I'd have seen it.

So I'm wondering if before returning when send_data fails, we could put the data back onto the journal so it can be attempted to be read next time.

Also ret could be 0 when send_data fails so the no existing entries log message would be wrong. I'd return -1 in the send_data case and hopefully there is a way to get data back into the journal after pulling it out.

This is an excellent observation. I wasn't considering how the probe should behave in the event of send_data failures. I suppose I could call sd_journal_previous to move the pointer in the reverse direction in this case, but a timeout would be needed...

So, I think that I want to continue ignoring send_data failures for now, since this may indicate a configuration problem with the telemetry client on the system. Also, I think it's acceptable for the journal probe to be "lossy" with respect to the data it collects.

But I will revisit the error handling here later.

I'll remove the LOG_DEBUG you mentioned.

A return value of 0 can also indicate send_data() failure, so remove this log message. Signed-off-by: Patrick McCarty <[email protected]>

phmccarty force-pushed the journal-probe-rework branch from 92059d7 to 6f80467 Compare January 21, 2017 01:38

rnesius suggested changes Feb 10, 2017

View reviewed changes

phmccarty added 10 commits February 13, 2017 10:46

journal probe: fix error handling of read_new_entries()

d4261b5

The error case is negative, so fix the return code conditional check appropriately. Signed-off-by: Patrick McCarty <[email protected]>

journal probe: add newlines for to each message in the payload

1f2083b

The make payloads more legible, make sure separate messages are newline separated. Signed-off-by: Patrick McCarty <[email protected]>

journal probe: fix formatting of some log messages

0adfc63

Signed-off-by: Patrick McCarty <[email protected]>

journal probe: add macros to help with using the journal API

9dcdc0d

To avoid having to read the repetitive error handling when adding journal filters, add some helper macros. Signed-off-by: Patrick McCarty <[email protected]>

journal probe: use the new macros

9834e9f

Signed-off-by: Patrick McCarty <[email protected]>

journal probe: filter on EXIT_CODE as well

7dee764

This field appears to be set when services fail, so filter on it as well. Signed-off-by: Patrick McCarty <[email protected]>

journal probe: filter LOG_EMERG logs as well

bf3dc47

I omitted log level 0 from the filter, so add it here. Signed-off-by: Patrick McCarty <[email protected]>

Document the logical expression for journal entry matching

15bdcf0

Signed-off-by: Patrick McCarty <[email protected]>

phmccarty force-pushed the journal-probe-rework branch from 6f80467 to 15bdcf0 Compare February 13, 2017 19:17

bryteise reviewed Feb 17, 2017

View reviewed changes

rnesius approved these changes Feb 17, 2017

View reviewed changes

journal probe: remove obsolete LOG_DEBUG message

35b6ca2

A return value of 0 can also indicate send_data() failure, so remove this log message. Signed-off-by: Patrick McCarty <[email protected]>

phmccarty merged commit 9b435f3 into clearlinux:master Feb 17, 2017

phmccarty deleted the journal-probe-rework branch May 14, 2018 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Journal probe rework #7

Journal probe rework #7

phmccarty commented Jan 20, 2017

rnesius Feb 10, 2017

phmccarty Feb 13, 2017

phmccarty Feb 13, 2017 •

edited

Loading

bryteise Feb 17, 2017

phmccarty Feb 17, 2017

bryteise Feb 17, 2017

phmccarty Feb 17, 2017

bryteise Feb 17, 2017

phmccarty Feb 17, 2017

phmccarty Feb 17, 2017

bryteise Feb 17, 2017

phmccarty Feb 17, 2017

bryteise Feb 17, 2017

bryteise Feb 17, 2017

bryteise Feb 17, 2017

phmccarty Feb 17, 2017

phmccarty Feb 17, 2017

phmccarty Feb 17, 2017

Journal probe rework #7

Journal probe rework #7

Conversation

phmccarty commented Jan 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phmccarty Feb 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phmccarty Feb 13, 2017 •

edited

Loading