Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix various memory corruption and unwanted behavior scenarios #591

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

sobitcorp
Copy link

@sobitcorp sobitcorp commented Nov 1, 2020

This PR fixes the following issues:

  1. Fix possible memory corruption when using upstream requests
  2. Respect client_max_body_size for incoming websocket messages
  3. Fix memory corruption in is_utf8() when block being checked is >16KB
  4. Fix file-cached messages (over 16KB) not getting passed to upstream
    -----(After 2020-11-14)-----
  5. Fix crash when using X-Accel-Redirect with message forwarding
    -----(After 2020-12-05)-----
  6. Fix crash when multi-id pubsub is closed by timeout
  7. Fix crash when multi-id pubsub is closed by HTTP DELETE
  8. Fix crash when forwarding used without proxy_pass

1. Fix possible memory corruption when using upstream requests

There are several possible configuration scenarios where using nchan_publisher_upstream_request can cause memory corruption and undefined behavior. This is due to sanity checks being performed (and wrongly succeeding) on a variable which is never assigned.

Example

Consider this stack trace:
stack

At frame 6:
The ws_publish_data_t struct is allocated here without clearing memory (line 662).
Thus, (ws_publish_data_t)->subrequest refers to an undefined memory location (line 690).
It is to be assigned here after the subrequest setup is complete, and not referenced upon until then.
This is normally the case, as the subrequest gets 'queued' and we go all the way back down to frame 2 before the subrequest is actually executed and finalized.

At frame 13:
However, if for whatever reason subrequest setup fails, nginx finalizes the subrequest early.
For instance, here the setup fails because the request is too large (line 989).

At frame 15:
Since the request is being finalized, NChan subrequest handler gets called, expecting the upstream subrequest to have been successfully queued and (ws_publish_data_t)->subrequest assigned.
But we're still in the same event that will only assign (ws_publish_data_t)->subrequest once we get back down to frame 6.

At frame 16:
A sanity check assertion (line 575) can wrongly succeed here due to (ws_publish_data_t)->subrequest having an undefined value.
Thus, past this check we can get undefined behavior and memory corruption.

At frame 17:
We'll most likely segfault here from trying to dereference (ws_publish_data_t)->subrequest, but we can also corrupt memory and/or cause some other undefined behavior instead!

Fix

Clear (ws_publish_data_t)->subrequest right after the creating the ws_publish_data_t structure, so that various further sanity checks will correctly interrupt execution on error.

============================================================================

2. Respect client_max_body_size for incoming websocket messages

The client_max_body_size directive was not respected for incoming websocket messages. This causes potential DDoS problems and, in conjunction with nchan_publisher_upstream_request a possible crash. (It also caused possible memory corruption, but has been fixed above).

If nchan_publisher_upstream_request is not being used, this is not a very serious issue, as the message length is limited by memory limits and the worst that can happen is it can flood a channel group's memory limits.

However, with nchan_publisher_upstream_request, this can be used as a potential DDoS vector and to crash the nginx server. The former is due to nginx writing temporary files for any messages exceeding 16KB. It can be used by an attacker to exhaust disk space on the host system with no way to set limits of any kind, as the incoming websocket message will always be accepted, regardless of the active client_max_body_size setting.
Additionally, if the incoming message exceeds the client_max_body_size setting, upstream subrequest forwarding will fail, because at that point nginx will reject the subrequest. This will cause NChan to bring down the server worker with the error message nginx: /nchan-1.2.7/src/subscribers/websocket.c:575: websocket_publish_upstream_handler: Assertion 'd->subrequest' failed. (This can instead cause memory corruption and undefined behavior without the other fix above).

Fix:

Check the configured client_max_body_size upon receiving a message for publishing and close the socket if it is in excess of the active client_max_body_size setting. This is currently done in 3 places:

  • Upon receiving any websocket packet with WEBSOCKET_OPCODE_TEXT or WEBSOCKET_OPCODE_BINARY
  • Upon joining the received message so far
  • While inflating the final received message (to prevent 'decompression-bombs')

============================================================================

3. Fix memory corruption in is_utf8() when block being checked is >16KB

Blocks larger than 16KB get written to a temporary file. When checking such a block, an incorrect memory region was getting unmapped due to incorrect pointer handling. This manifested in occasional segfaults.

============================================================================

4. Fix file-cached messages (over 16KB) not getting passed to upstream.

When using nchan_publisher_upstream_request, requests over 16KB get stored to temporary files for processing. These were neither getting passed upstream nor deleted after processing because (ngx_buf_t)->memory was getting set even if the buffer was file-mapped.
This would result in the following text in the nginx error log [alert] 27766#27766: *1 zero size buf in output, the message not being passed upstream, temporary files remaining on disk forever, and nginx workers' virtual memory size growing since the file-mapped memory regions were not getting unmapped.

Fix:

The offending line was commented out. (ngx_buf_t)->memory is already set for buffers that are actually in memory, so I don't see why this flag was getting set manually here. Was there some reason?

============================================================================

Note

This is part of a series of patches to resolve issue #588.

@jmealo
Copy link

jmealo commented Nov 8, 2020

👍 LGTM

@sobitcorp
Copy link
Author

5. Fix crash when using X-Accel-Redirect with message forwarding

When using X-Accel-Redirect to abstract a websocket pub/sub endpoint from the end client, combined with upstream message forwarding with nchan_publisher_upstream_request, the nginx worker thread would very often segfault after receiving and trying to broadcast the 2nd message after channel creation. Rarely, when this didn't happen, it would often segfault instead on websocket disconnect. This was due to memory corruption.

This behavior could be mitigated with placement of charset off directives in the nginx configuration file. However, this never completely eliminated the problem in my tests.

Here's roughly how this would occur:

First request:

  1. Soon after receiving a response from upstream (in ngx_http_upstream_process_header())
    , a charset_filter_module context gets allocated in ngx_http_main_request_charset() on the current memory pool ( with ngx_pcalloc) and set for the request structure: ngx_http_set_ctx(r->main, ctx, ngx_http_charset_filter_module); in ngx_http_charset_filter_modeule.c, line 409.

  2. Later in the request's processing pipeline ngx_http_charset_body_filter() (ngx_http_charset_filter_modeule.c, line 547) is called multiple times and uses the context set above.

  3. I assume the above mentioned memory pool is then freed upon completion of the upstream request.

Second request:

  1. The request structure gets reused. For some reason it's charset_filter_module context is still set to the value in step 1.

  2. ngx_http_main_request_charset() is called, but does nothing, because it sees the request's charset_filter_module context is already set.

  3. Somewhere between here, the freed memory pointed to by the current request structure's charset_filter_module context gets overwritten.

  4. Later in the requests processing pipeline ngx_http_charset_body_filter() is, again, called multiple times. It attempts to use use the now clobbered charset_filter_module context -- Segfault!

Fix

Clear the current request structure's charset_filter_module context upon completion of the request (after completely receiving, storing, and broadcasting a message). There may be a better way to do this. (This will most likely not be an issue at all in Nchan2).

============================================================================

@sobitcorp
Copy link
Author

6-7. Fix crash when multi-id pubsub is closed by timeout or HTTP DELETE

When using multi-id pubsub endpoints, a variety of problems occurred upon the closing of the pubsub endpoint and/or its channels.

Example configuration:

nchan_pubsub websocket;
nchan_channel_id 1 2;
nchan_subscriber_timeout 1m;

Consequences:

  • Regardless of how endpoint was closed:
    • Channel never gets deleted. Debug log says: MEMSTORE:00: not ready to reap m/ /1 /2 : status 2
  • If endpoint was closed by nchan_subscriber_timeout triggering or one of the channel IDs receiving an HTTP DELETE:
    • Channel never gets deleted.
    • Error log says: MEMSTORE:04: tried adding WAITING chanhead 00BDDB08 m/ /1 /2 to chanhead_gc. why?
    • Messages sometimes stop arriving. Reconnecting the websocket sometimes temporarily fixes this.
    • Eventually worker thread crashes with a segfault.

Fix

This line sub->dequeue_after_response = 1 appears to have caused closing procedures to happen out of order. Commenting it out completely fixes this behavior. Once the websocket is closed, the pub/sub endpoints are properly dequeued and deleted, as are the target channels (once there are no pub/subs left).

============================================================================

8. Fix crash when forwarding used without proxy_pass

This directly addresses issue #588. It fixes the crash caused by this scenario, but does not enable its full functionality - message modification, in particular, doesn't work. Since there is an acceptable workaround, I will not attempt to fix this further - at least it's no longer possible to crash the worker thread with this scenario.

Fix

This simply populates the (ws_publish_data_t)->subrequest early to allow the sequence of events to complete on the same worker cycle without crashing. Basic upstream responses without message modification work fine.

============================================================================

This concludes my series of fixes for Nchan.

@sobitcorp sobitcorp changed the title Fix possible memory corruption with upstream requests Fix various memory corruption and unwanted behavior scenarios Dec 6, 2020
@slact
Copy link
Owner

slact commented Dec 7, 2020

Let's get this merged indeed! There's a lot to regression-test here. I will report my findings once the testing is complete.

@slact
Copy link
Owner

slact commented Dec 7, 2020

@sobitcorp please email me at leo (at) nchan.io , i'd like to talk to you a little more about this PR

@aliqandil
Copy link

Any news about this PR? This seems to fix all the issues I'm currently facing in my production environment!

@@ -1086,7 +1086,7 @@ static ngx_int_t websocket_perform_handshake(full_subscriber_t *fsub) {

static void websocket_reading(ngx_http_request_t *r);

static ngx_buf_t *websocket_inflate_message(full_subscriber_t *fsub, ngx_buf_t *msgbuf, ngx_pool_t *pool) {
static ngx_buf_t *websocket_inflate_message(full_subscriber_t *fsub, ngx_buf_t *msgbuf, ngx_pool_t *pool, uint64_t max, int *result) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller below expects *result to be initialized if the return value is NULL. However, there appear to be three (rare?) cases below where this does not happen.

Also a -1 magic number for "message too large" is a bit ugly, and it may be better to follow a pattern where the primary return value indicates whether/how the call is successful and the pointer is returned via indirection (ngx_buf_t **result).

@sobitcorp
Copy link
Author

@aliqandil slact is currently regression-testing this, afaik.

@jilles-sg Good catch, I fixed the uninitialized var. I also agree with your proposed change to the calling convention. I simply didn't want to make any major calling convention changes, as I'm not too deeply familiar with nchan internals, so I just tacked a result flag at the end as a quick fix.

@slact
Copy link
Owner

slact commented Feb 25, 2021

Yup. The sub->dequeue_after_response = 1 and the charset_filter_module fixes aren't passing regression testing, and I haven't had time to come up with a fix. I'm leaning towards merging the other stuff in the meantime and leave these two for later.

@@ -1091,6 +1091,8 @@ static ngx_buf_t *websocket_inflate_message(full_subscriber_t *fsub, ngx_buf_t *
z_stream *strm;
int rc;
ngx_buf_t *outbuf;

*result = 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems *result is still uninitialized if NGX_ZLIB is not defined (perhaps that does not happen, but it still seems wrong to write code that will not work).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops... fixed

@sg-jorrit
Copy link

sg-jorrit commented Jun 4, 2021

Just checking in to see if there are any updates on this pull request. We are still seeing lots of segfaults and general protection faults on our instances, and it is beginning to become a problem under load. We think at least some of these segfaults are related to bugs this pull request is aiming to fix.

@slact I know you are busy on the new nchan (which is great, really appreciate your work!), but Is there any chance these fixes (or at least the ones that succeeded regression testing) could be merged in an upcoming release, along with the recent commit fixing the segfault during multi-publishing? It would greatly help us maintain stability, at least until the new nchan is released (as we do not really have an alternative currently).

If we can help in any way with this testing or debugging, please let me know as well. Happy to help in any way possible.

@winrid
Copy link

winrid commented Oct 20, 2021

We are also very much looking forward to this, after seeing nchan come to a halt on a traffic spike and not recover (service was running, but wouldn't accept new connections even though CPU was back down to 0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nchan_publisher_upstream_request without proxy_pass = crashes
7 participants