-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix various memory corruption and unwanted behavior scenarios #591
base: master
Are you sure you want to change the base?
Conversation
👍 LGTM |
5. Fix crash when using X-Accel-Redirect with message forwardingWhen using X-Accel-Redirect to abstract a websocket pub/sub endpoint from the end client, combined with upstream message forwarding with This behavior could be mitigated with placement of Here's roughly how this would occur: First request:
Second request:
FixClear the current request structure's ============================================================================ |
6-7. Fix crash when multi-id pubsub is closed by timeout or HTTP DELETEWhen using multi-id pubsub endpoints, a variety of problems occurred upon the closing of the pubsub endpoint and/or its channels. Example configuration:
Consequences:
FixThis line ============================================================================ 8. Fix crash when forwarding used without proxy_passThis directly addresses issue #588. It fixes the crash caused by this scenario, but does not enable its full functionality - message modification, in particular, doesn't work. Since there is an acceptable workaround, I will not attempt to fix this further - at least it's no longer possible to crash the worker thread with this scenario. FixThis simply populates the ============================================================================ This concludes my series of fixes for Nchan. |
Let's get this merged indeed! There's a lot to regression-test here. I will report my findings once the testing is complete. |
@sobitcorp please email me at leo (at) nchan.io , i'd like to talk to you a little more about this PR |
Any news about this PR? This seems to fix all the issues I'm currently facing in my production environment! |
@@ -1086,7 +1086,7 @@ static ngx_int_t websocket_perform_handshake(full_subscriber_t *fsub) { | |||
|
|||
static void websocket_reading(ngx_http_request_t *r); | |||
|
|||
static ngx_buf_t *websocket_inflate_message(full_subscriber_t *fsub, ngx_buf_t *msgbuf, ngx_pool_t *pool) { | |||
static ngx_buf_t *websocket_inflate_message(full_subscriber_t *fsub, ngx_buf_t *msgbuf, ngx_pool_t *pool, uint64_t max, int *result) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller below expects *result
to be initialized if the return value is NULL
. However, there appear to be three (rare?) cases below where this does not happen.
Also a -1
magic number for "message too large" is a bit ugly, and it may be better to follow a pattern where the primary return value indicates whether/how the call is successful and the pointer is returned via indirection (ngx_buf_t **result
).
@aliqandil slact is currently regression-testing this, afaik. @jilles-sg Good catch, I fixed the uninitialized var. I also agree with your proposed change to the calling convention. I simply didn't want to make any major calling convention changes, as I'm not too deeply familiar with nchan internals, so I just tacked a result flag at the end as a quick fix. |
Yup. The |
src/subscribers/websocket.c
Outdated
@@ -1091,6 +1091,8 @@ static ngx_buf_t *websocket_inflate_message(full_subscriber_t *fsub, ngx_buf_t * | |||
z_stream *strm; | |||
int rc; | |||
ngx_buf_t *outbuf; | |||
|
|||
*result = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems *result
is still uninitialized if NGX_ZLIB
is not defined (perhaps that does not happen, but it still seems wrong to write code that will not work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops... fixed
Just checking in to see if there are any updates on this pull request. We are still seeing lots of segfaults and general protection faults on our instances, and it is beginning to become a problem under load. We think at least some of these segfaults are related to bugs this pull request is aiming to fix. @slact I know you are busy on the new nchan (which is great, really appreciate your work!), but Is there any chance these fixes (or at least the ones that succeeded regression testing) could be merged in an upcoming release, along with the recent commit fixing the segfault during multi-publishing? It would greatly help us maintain stability, at least until the new nchan is released (as we do not really have an alternative currently). If we can help in any way with this testing or debugging, please let me know as well. Happy to help in any way possible. |
We are also very much looking forward to this, after seeing nchan come to a halt on a traffic spike and not recover (service was running, but wouldn't accept new connections even though CPU was back down to 0). |
This PR fixes the following issues:
is_utf8()
when block being checked is >16KB-----(After 2020-11-14)-----
-----(After 2020-12-05)-----
1. Fix possible memory corruption when using upstream requests
There are several possible configuration scenarios where using
nchan_publisher_upstream_request
can cause memory corruption and undefined behavior. This is due to sanity checks being performed (and wrongly succeeding) on a variable which is never assigned.Example
Consider this stack trace:
At frame 6:
The
ws_publish_data_t
struct is allocated here without clearing memory (line 662).Thus,
(ws_publish_data_t)->subrequest
refers to an undefined memory location (line 690).It is to be assigned here after the subrequest setup is complete, and not referenced upon until then.
This is normally the case, as the subrequest gets 'queued' and we go all the way back down to frame
2
before the subrequest is actually executed and finalized.At frame 13:
However, if for whatever reason subrequest setup fails, nginx finalizes the subrequest early.
For instance, here the setup fails because the request is too large (line 989).
At frame 15:
Since the request is being finalized, NChan subrequest handler gets called, expecting the upstream subrequest to have been successfully queued and
(ws_publish_data_t)->subrequest
assigned.But we're still in the same event that will only assign
(ws_publish_data_t)->subrequest
once we get back down to frame6
.At frame 16:
A sanity check assertion (line 575) can wrongly succeed here due to
(ws_publish_data_t)->subrequest
having an undefined value.Thus, past this check we can get undefined behavior and memory corruption.
At frame 17:
We'll most likely segfault here from trying to dereference
(ws_publish_data_t)->subrequest
, but we can also corrupt memory and/or cause some other undefined behavior instead!Fix
Clear
(ws_publish_data_t)->subrequest
right after the creating thews_publish_data_t
structure, so that various further sanity checks will correctly interrupt execution on error.============================================================================
2. Respect client_max_body_size for incoming websocket messages
The
client_max_body_size
directive was not respected for incoming websocket messages. This causes potential DDoS problems and, in conjunction withnchan_publisher_upstream_request
a possible crash. (It also caused possible memory corruption, but has been fixed above).If
nchan_publisher_upstream_request
is not being used, this is not a very serious issue, as the message length is limited by memory limits and the worst that can happen is it can flood a channel group's memory limits.However, with
nchan_publisher_upstream_request
, this can be used as a potential DDoS vector and to crash the nginx server. The former is due to nginx writing temporary files for any messages exceeding 16KB. It can be used by an attacker to exhaust disk space on the host system with no way to set limits of any kind, as the incoming websocket message will always be accepted, regardless of the activeclient_max_body_size
setting.Additionally, if the incoming message exceeds the
client_max_body_size
setting, upstream subrequest forwarding will fail, because at that point nginx will reject the subrequest. This will cause NChan to bring down the server worker with the error messagenginx: /nchan-1.2.7/src/subscribers/websocket.c:575: websocket_publish_upstream_handler: Assertion 'd->subrequest' failed
. (This can instead cause memory corruption and undefined behavior without the other fix above).Fix:
Check the configured client_max_body_size upon receiving a message for publishing and close the socket if it is in excess of the active
client_max_body_size
setting. This is currently done in 3 places:WEBSOCKET_OPCODE_TEXT
orWEBSOCKET_OPCODE_BINARY
============================================================================
3. Fix memory corruption in
is_utf8()
when block being checked is >16KBBlocks larger than 16KB get written to a temporary file. When checking such a block, an incorrect memory region was getting unmapped due to incorrect pointer handling. This manifested in occasional segfaults.
============================================================================
4. Fix file-cached messages (over 16KB) not getting passed to upstream.
When using
nchan_publisher_upstream_request
, requests over 16KB get stored to temporary files for processing. These were neither getting passed upstream nor deleted after processing because(ngx_buf_t)->memory
was getting set even if the buffer was file-mapped.This would result in the following text in the nginx error log
[alert] 27766#27766: *1 zero size buf in output
, the message not being passed upstream, temporary files remaining on disk forever, and nginx workers' virtual memory size growing since the file-mapped memory regions were not getting unmapped.Fix:
The offending line was commented out.
(ngx_buf_t)->memory
is already set for buffers that are actually in memory, so I don't see why this flag was getting set manually here. Was there some reason?============================================================================
Note
This is part of a series of patches to resolve issue #588.