Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

session: Pass remote peer address in authorize message, add authorize response timeout, robustify client certificate mapping errors #21292

Merged

Conversation

martinpitt
Copy link
Member

Split out from #16808 plus some cockpit-session error code path strengthening suggested by @allisonkarlitskaya in #16808 (comment) .

Copy link
Member

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks pretty good. Some very minor nitpicks below.

The one thing I find most concerning is the signal 14 message. This is unexpected and it requires a better explanation.

src/session/session-utils.c Show resolved Hide resolved
@@ -164,6 +164,24 @@ write_control_end (void)
auth_msg_size = 0;
}

__attribute__((__noreturn__)) void
exit_init_problem (const char *problem, const char *message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace after the comma

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


free (payload);
exit (5);
exit_init_problem (problem, message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same odd double whitespace :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...not fixed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fixed the last bit.

@@ -401,15 +400,7 @@ exit_init_problem (int result_code)
else
message = pam_strerror (NULL, result_code);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a guarantee that PAM won't someday have a \ or " in their messages? I know this is an existing issue, so feel free to punt on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running out of time this morning, but good point. It's not a new problem, so I'd punt it for this PR. But I made a TODO entry for myself and also in #16808.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed here after all. Running out of razor blades!

@@ -56,7 +56,7 @@ read_proc_self_cgroup (size_t *out_length)
if (fp == NULL)
{
warn ("Failed to open /proc/self/cgroup");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you want to keep the copy for the journal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as we don't see the protocol messages in the journal on failures, so the warn()/fail() are for tests and admins.

@@ -285,7 +285,7 @@ sssd_map_certificate (const char *certificate, char** username)
*
* Read the given certificate file, ensure that it belongs to our own cgroup, and ask
* sssd to map it to a user. If everything matches as expected, return the user name.
* Otherwise return %NULL, a warning message will already have been logged.
* Otherwise exit the process with sending an appropriate error to the protocol.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to the protocol" is a bit weird as language. "To stdout using the cockpit protocol" perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

warnx ("Could not determine cgroup of this process");
return NULL;
}
assert (ws_cgroup);
/* A simple prefix comparison is appropriate here because ws_cgroup
* will contain exactly one newline (at the end), and the expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost feel like this is the place to make the extra check to ensure that ws_cgroup definitely ends with \n.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16808 changes this a lot, so I'd prefer to make the validation changes there.

@@ -279,6 +279,8 @@ class CommonTests:
# (this should be fixed properly in cockpit-ws)
self.allow_journal_messages('.*session timed out during authentication')
self.allow_journal_messages('.*authentication timed out')
# 14 == SIGALRM
self.allow_journal_messages('.*bridge program failed: Child process killed by signal 14')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. If your signal handler is working properly then we shouldn't see this message. It should exit cleanly via your _exit(EX) call. Please look into this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, yes, I didn't pay enough attention to this 🙈 This is doubly nonsense, as the bridge gets killed with that alarm, not cockpit -session. I. e. that alarm(60) somehow gets inherited.

I see one obvious bug here, there is a signal (SIGALRM, SIG_DFL); call later on which I missed. However, after fixing that, I still get that error. It almost looks like starting a PAM session inherits the current alarms and signal handlers, and that of course is really not intended.

This confirms my initial skepticism about making the range of that alert(60) so wide. I reverted it to a narrower scope of waiting for and parsing the initial authorize dialog.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our original issue here is that this message is misleading. This journal message is coming from -ws (cockpitpipetransport.c:125), which prints it after observing the exit status of cockpit-session.

The child doesn't inherit alarm() across a fork(). This is the documented behaviour per POSIX, Linux, and I tested that it's true.

Somehow the -session process itself is the thing that's getting killed by the signal. Which means that there's still something unexplained going on here.

Copy link
Member Author

@martinpitt martinpitt Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as said on Matrix it's definitively not our spawn_and_wait(). That came after the alarm disarming even in the previous version of the patch. Hence my suspicion that it's the PAM session start, because what else could it be? If that's session itself (and a message from ws), then our signal handler is somehow not working.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I did test the signal handler, with alarm(1) and an extra sleep in ws (I put the patch somewhere, but can't find it quickly -- too many threads). That worked as expected, including the correct message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note: This piece of the patch is entirely gone now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to chase this down long enough to have thoroughly disproven my own original theory of what was going on here and to have (re-)learned with a high degree of certainty exactly what the process model is here.

The tl;dr: of it all is that a million things use alarm() for all kinds of reasons, and many of them attempt to restore the previous state. I've been chasing ghosts through glibc with the locking functions in the utmp code and the password database and haven't been able to come up with anything, but I'm convinced enough at this point that the culprit is definitely in-process.

I also don't care anymore. It would have been nice to expand this to cover more code, but my concerns about the user being able to delay the other parts of the code are completely theoretical, and definitely "not our bug". Thanks for putting up with me on this. I'm not satisfied with the (lack of) answer but I'm sure enough about it to know that nothing deeply "weird" is happening here, and even if it is, it's not our fault.

@@ -937,7 +939,7 @@ build_session_credentials (CockpitAuth *self,
}
}

remote_peer = cockpit_web_request_get_remote_address (request);
gchar *remote_peer = cockpit_web_request_get_remote_address (request);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda wondering why you didn't spring for g_autofree here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, been hacking on cockpit-session for too long. Used that while at it.

The various perform_*() functions all assume a non-NULL rhost, as
several functions such as `btmp_log()` do unchecked strncpy() on them.
Add assertions to (1) make them fail more gracefully and usefully), and
(2) document that requirement more explicitly.

This is currently guaranteed by having an explicit fallback to `""` if
`$COCKPIT_REMOTE_PEER` isn't set.
For hysterical raisins this was a leftover in session-utils (and even
declared *twice* for the lolz), but it's only being used in session.c
(which has all the PAM stuff). It fits much more neatly besides
`last_txt_msg`.
Copy link
Member

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual: one serious problem and a small collection of minor/optionals.

@@ -164,6 +164,24 @@ write_control_end (void)
auth_msg_size = 0;
}

__attribute__((__noreturn__)) void
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't catch this before, but <stdnoreturn.h> is a thing since C11. Since C23 apparently you should write [[noreturn]] without the #include (which is now deprecated). That syntax is a bit of an eyebrow-raise for me, to be honest, but hey... no time like the present...

Feel free to pick either of those, or to ignore this request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use -std=gnu18, so too new for [[noreturn]] I figure. find /usr/include/ -name stdno*.h shows no such entry. Just that libc apparently puts that attribute after the declaration, not before. Nevertheless, as this has been there for ages, I'd rather not touch it here and open yet another Baustelle.


free (payload);
exit (5);
exit_init_problem (problem, message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...not fixed?

@@ -957,6 +968,10 @@ main (int argc,
if (!cockpit_authorize_type (response, &type))
errx (EX, "invalid authorization header received");

/* stop time-bounding */
alarm (0);
sigaction (SIGALRM, &(struct sigaction) { .sa_handler = SIG_DFL, .sa_flags = 0 }, NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant: ignored signals survive across execve() but those with handlers are reset to default (since the handler isn't present in the new executable image).

Although I suppose we keep our own version of ourselves around to wait for the session to exit to perform cleanup.

I'm really curious about why lifting to here is required, though, and it's certainly not ideal. At exactly which point does PAM call fork()? And which side are "we" on after it does that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alarm(2) says nothing about auto-resetting the alarm handler. I want to make really damn sure that we don't spawn the session with a non-default SIGALRM, so I really want to keep this. Maybe it's redundant, but then so are a lot of other things in session (just count the useless free() calls :grin).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that logic.

/* we send the message through the Cockpit protocol, make sure we don't break JSON encoding */
char *message_sanitized = strdupx (message);
char *c;
while ((c = strchr (message, '"')) != NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a mutable copy of the string into message_sanitized and then you modify the original message before passing the unmodified copy to exit_init_problem() :) char* unsafety issues strike again via the (terrible) signature of strchr() which takes a const char * and returns a char *. I seriously start to hate C.

Shame we don't have g_strcanon here.

The two passes over the string bother me and I'd have written it more like

   for (int i = 0; message[i]; i++)
       if (message[i] != '\\') ... else if ...

but I bet yours is actually faster in practice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, copy pasta! Fixed.

We want to use that facility for non-PAM code (like certificate
checking). Rename the remainder in session.c to `exit_pam_init_problem()` to
clarify its purpose.
Instead of passing them up to the caller, make the internal functions
immediately exit cockpit-session with an appropriate protocol error,
using our new `exit_init_problem()`.

This doesn't do much yet, but this code is going to become more complex
soon, let's not take chances of missing some error handling.
Just like in cockpit-wsinstance-*.socket
cockpit_session_launch() doesn't set this env if the remote host is
unknown. That is never the case in practice at least in our tests, but
callers should still be aware of it.
ws times out the authorization attempt after one minute
(`cockpit_ws_auth_response_timeout`). Introduce a similar timeout on the
session side. This makes PID recycling attacks harder, as their victim
now has to log in within one minute.

Drop the resetting of SIGQUIT. Nothing changes it before, `signal()` is
discouraged, let's not mix it with our new `sigaction()` calls. That
code has been there since the initial commit, so we don't have a
justification for it.
Passing the remote peer from ws to cockpit-session via the
`$COCKPIT_REMOTE_PEER` environment variable does not work in unix socket mode.
So make that part of the protocol instead and attach it to the authorize
response.
Any `"` or `\` in PAM error messages break the protocol, as we do naïve
JSON encoding. Replace them with `'` and `/` respectively.

(There are no known cases of this today, but let's be safe).
Copy link
Member

@allisonkarlitskaya allisonkarlitskaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the fixes. This is ready to go now!

/* we send the message through the Cockpit protocol, make sure we don't break JSON encoding */
char *message_sanitized = strdupx (message);
char *c;
while ((c = strchr (message_sanitized, '"')) != NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look close enough the first time around (because of the more obvious bug) but my understanding of this code wasn't correct: you're rescanning the entire string on every loop iteration, not just starting again from the point of c.

This is valid though, because the previous "match" will no longer exist, since you've replaced it.

@allisonkarlitskaya allisonkarlitskaya merged commit ea2a62c into cockpit-project:main Nov 21, 2024
85 checks passed
@martinpitt martinpitt deleted the session-socket-prep branch November 21, 2024 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants