Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a dedicated method for disconnecting TLS connections #10005

Merged
merged 4 commits into from
Dec 12, 2024

Conversation

julianbrost
Copy link
Contributor

@julianbrost julianbrost commented Feb 16, 2024

Properly closing a TLS connection involves sending some shutdown messages so that both ends know that the connection wasn't truncated maliciously. Exchanging those messages can stall for a long time if the underlying TCP connection is broken. The HTTP connection handling was missing any kind of timeout for the TLS shutdown so that dead connections could hang around for a long time.

This PR introduces two new methods on AsioTlsStream, namely ForceDisconnect() which just wraps the call for closing the TCP connection and GracefulShutdown() which performs the TLS shutdown with a timeout similar to it was done in JsonRpcConnection::Disconnect() before.

fixes #9986

@julianbrost
Copy link
Contributor Author

Open questions:

  • Do we want to try to call ForceDisconnect() directly in case a connection is shut down due to a timeout like "no messages received" on JSON-RPC connections?

By now, I'm pretty sure the answer is yes.

Evidence: Take two connected Icinga 2 nodes and break individual connections by dropping the packets specific to that connection in a firewall. Both nodes will detect that no messages were received and reconnect, however, the old connection remains in an established state:

root@satellite-b-2:/# netstat -tpn | grep 172.18.0.33
tcp6       0      0 172.18.0.31:5665        172.18.0.33:49584       ESTABLISHED 60/icinga2          
root@satellite-b-2:/# iptables -A INPUT -p tcp -s 172.18.0.33 --sport 49584 -j DROP
root@satellite-b-2:/# netstat -tpn | grep 172.18.0.33
tcp6       0      0 172.18.0.31:5665        172.18.0.33:49098       ESTABLISHED 60/icinga2          
tcp6       0  82558 172.18.0.31:5665        172.18.0.33:49584       ESTABLISHED 60/icinga2          
root@satellite-b-2:/# iptables -A INPUT -p tcp -s 172.18.0.33 --sport 49098 -j DROP
root@satellite-b-2:/# netstat -tpn | grep 172.18.0.33
tcp6       0  94323 172.18.0.31:5665        172.18.0.33:49098       ESTABLISHED 60/icinga2          
tcp6       0  82558 172.18.0.31:5665        172.18.0.33:49584       ESTABLISHED 60/icinga2          
tcp6       0      0 172.18.0.31:5665        172.18.0.33:39946       ESTABLISHED 60/icinga2          

That implies that there's probably a resource leak in that scenario (until the kernel decides that the connection is actually dead and returns an error for the socket operations).

Unverified theory of what might happen: JsonRpcConnection::Disconnect() could block here:

m_WriterDone.Wait(yc);

Which waits for JsonRpcConnection::WriteOutgoingMessages() to complete which might hang indefinitely in these send operations:

for (auto& message : queue) {
size_t bytesSent = JsonRpc::SendRawMessage(m_Stream, message, yc);
if (m_Endpoint) {
m_Endpoint->AddMessageSent(bytesSent);
}
}
m_Stream->async_flush(yc);

@Al2Klimov
Copy link
Member

Thing to consider

@julianbrost
Copy link
Contributor Author

I resolved conflicts and started implementing JsonRpcConnection::ForceDisconnect(), still far from finished, but if anyone is eager to have a look, feel free.

@julianbrost julianbrost force-pushed the graceful-tls-disconnect branch from 3a72a6f to 9d67c26 Compare November 13, 2024 16:35
@julianbrost
Copy link
Contributor Author

I resolved conflicts and started implementing JsonRpcConnection::ForceDisconnect(), still far from finished

While continuing that work, I figured that this might become a bigger rework of JsonRpcConnection than I anticipated. So I decided that this is better done in a separate PR that I'll create tomorrow.

This PR on it's own should already be enough of an improvement on its own, after all it even fixes a problem in the HTTP connection handling.

Open questions:

  • I'm not yet sure about the last two commits (d6953cc, e90acc5): they basically replace two calls of async_shutdown() with calls to the new GracefulShutdown(). In both instances, those calls are already guarded by higher-level timeouts (apilistener.cpp, apilistener.cpp, ifwapichecktask.cpp), the idea for replacing them would be to have just a single call to async_shutdown() in our code base that's properly guarded.

In that regard, I figured that adding comments to these calls why they are fine is good enough, especially when compared that was needed just to add a redundant timeout and spawn a pointless coroutine (e90acc5).

I'm leaving the PR in a draft state for the moment because I still want to answer a few detail questions regarding the two new disconnect methods (I removed a cancel() because I didn't see a good reason for it to exist, I wonder if shutdown() is even necessary after a successful async_shutdown() and once I'm sure about all this, some doc comments explaining the exact behavior certainly won't hurt).

@julianbrost julianbrost force-pushed the graceful-tls-disconnect branch from 9d67c26 to e5b9ff4 Compare November 14, 2024 16:20
@julianbrost
Copy link
Contributor Author

I removed a cancel() because I didn't see a good reason for it to exist

I still couldn't find a good reason to call lowest_layer().cancel() before next_layer().async_shutdown(), which is a bit of a problem if I want to add comments explaining what the code does. Like what would we want to cancel on the lowest, i.e. TCP layer? And if something was actually cancelled, how much sense would it make to continue using the TCP connection? Would it be possible to actually cancel something in the middle of writing a TLS record and what would continuing with a new TLS record during the shutdown do?

I wonder if shutdown() is even necessary after a successful async_shutdown()

That on the other hand turned out to be necessary: as confirmed using strace, without calling lowest_layer().shutdown(), there would otherwise be no shutdown() syscall.

@julianbrost julianbrost force-pushed the graceful-tls-disconnect branch from e5b9ff4 to 7430618 Compare November 14, 2024 17:19
@julianbrost julianbrost marked this pull request as ready for review November 14, 2024 17:19
lib/base/tlsstream.cpp Outdated Show resolved Hide resolved

m_Stream->next_layer().async_shutdown(yc[ec]);

m_Stream->lowest_layer().shutdown(m_Stream->lowest_layer().shutdown_both, ec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ec isn't needed anymore, right?

}

void ForceDisconnect();
void GracefulDisconnect(boost::asio::io_context::strand strand, boost::asio::yield_context yc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void GracefulDisconnect(boost::asio::io_context::strand strand, boost::asio::yield_context yc);
void GracefulDisconnect(boost::asio::io_context::strand& strand, boost::asio::yield_context yc);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy constructor of strand say this :):
Bildschirmfoto 2024-11-15 um 13 15 32

Meaning, it's just like a shared pointer. So, your suggestion is not wrong, but neither is the current implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, using a reference here wouldn't work because that method spawns a coroutine, and there's nothing that keeps the m_IoStrand object from the Http or Rpc class alive till that coroutine finishes.

lib/base/tlsstream.cpp Show resolved Hide resolved
yhabteab
yhabteab previously approved these changes Nov 19, 2024
lib/base/tlsstream.cpp Outdated Show resolved Hide resolved
@julianbrost
Copy link
Contributor Author

Force push was just a rebase to resolve merge conflicts due to changes in indentation, no other (intended) changes.

lib/base/tlsstream.cpp Outdated Show resolved Hide resolved
@julianbrost julianbrost force-pushed the graceful-tls-disconnect branch from 1cf8617 to e6d387d Compare November 27, 2024 09:05
@julianbrost
Copy link
Contributor Author

And another rebase to get the GitHub Actions going again (actually, I thought the last rebase would already to that, but at that point, #10251 wasn't merged yet)

@julianbrost
Copy link
Contributor Author

@Al2Klimov Please clarify what your expectations are regarding your outstanding comments (#10005 (review), #10005 (review)). Yes, cleaning up the Timeout class is nice, but is it really a prerequisite of this PR? Currently, having this PR open blocks other work (or at least dooms it to introduce avoidable merge conflicts).

lib/base/tlsstream.cpp Outdated Show resolved Hide resolved
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is almost perfectly fine.

Timeout::Ptr shutdownTimeout(new Timeout(strand.context(), strand, boost::posix_time::seconds(10),
[this, keepAlive = AsioTlsStream::Ptr(this)](boost::asio::yield_context yc) {
// Forcefully terminate the connection if async_shutdown() blocked more than 10 seconds.
ForceDisconnect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If keepAlive just protects against (SEGV on) ForceDisconnect() out of GracefulDisconnect() scope, please Defer a Timeout cancellation instead.
  2. Drop a85c188.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If keepAlive just protects against (SEGV on) ForceDisconnect() out of GracefulDisconnect() scope, please Defer a Timeout cancellation instead.

Yes, capturing a reference in a lambda is a pattern we use in a few places to prevent a use after free and that's why I did it here as well. Indeed, after closer inspection, due tue the timeout doing the cancellation check on the same strand that will also cancel the timeout, after Cancel(), the timeout class won't use the callback, so the captures this isn't used anymore either.

Surprising what a small insight is able to resolve that whole discussion nicely.

Calling `AsioTlsStream::async_shutdown()` performs a TLS shutdown which
exchanges messages (that's why it takes a `yield_context`) and thus has the
potential to block the coroutine. Therefore, it should be protected with a
timeout. As `async_shutdown()` doesn't simply take a timeout, this has to be
implemented using a timer. So far, these timers are scattered throughout the
codebase with some places missing them entirely. This commit adds helper
functions to properly shutdown a TLS connection with a single function call.
This new helper functions allows deduplicating the timeout handling for
`async_shutdown()`.
This new helper function has proper timeout handling which was missing here.
The reason for introducing AsioTlsStream::GracefulDisconnect() was to handle
the TLS shutdown properly with a timeout since it involves a timeout. However,
the implementation of this timeout involves spwaning coroutines which are
redundant in some cases. This commit adds comments to the remaining calls of
async_shutdown() stating why calling it is safe in these places.
@julianbrost julianbrost force-pushed the graceful-tls-disconnect branch from e6d387d to a506d56 Compare December 12, 2024 12:50
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine.

m_CheckLivenessTimer.cancel();
m_HeartbeatTimer.cancel();

m_Stream->lowest_layer().cancel(ec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst what can hypothetically happen if we don't cancel the reader is that it reads...

shutdownTimeout->Cancel();

m_Stream->lowest_layer().shutdown(m_Stream->lowest_layer().shutdown_both, ec);
m_Stream->GracefulDisconnect(m_IoStrand, yc);
Copy link
Member

@Al2Klimov Al2Klimov Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
if (!lowest_layer().is_open()) {
// Already disconnected, nothing to do.
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is consistent with GracefulDisconnect() and shouldn't hurt – idk how Windows handles the -1 thing (#10005 (comment)) if you close a closed socket.😅


{
Timeout::Ptr shutdownTimeout(new Timeout(strand.context(), strand, boost::posix_time::seconds(10),
[this](boost::asio::yield_context yc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional

Suggested change
[this](boost::asio::yield_context yc) {
[this](boost::asio::yield_context) {

Would otherwise add an unnecessary compiler warning.

ForceDisconnect();
}
));
Defer cancelTimeout ([&shutdownTimeout]() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional

Suggested change
Defer cancelTimeout ([&shutdownTimeout]() {
Defer cancelTimeout ([&shutdownTimeout] {

Would otherwise add an unnecessary CLion hint.

@yhabteab yhabteab removed their request for review December 12, 2024 15:17
@julianbrost julianbrost merged commit 452386c into master Dec 12, 2024
23 of 24 checks passed
@julianbrost julianbrost deleted the graceful-tls-disconnect branch December 12, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API bug Something isn't working cla/signed ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HttpServerConnection performs TLS shutdown without a timeout
3 participants