Fix various memory corruption and unwanted behavior scenarios #591

sobitcorp · 2020-11-01T09:31:15Z

This PR fixes the following issues:

Fix possible memory corruption when using upstream requests
Respect client_max_body_size for incoming websocket messages
Fix memory corruption in is_utf8() when block being checked is >16KB
Fix file-cached messages (over 16KB) not getting passed to upstream
-----(After 2020-11-14)-----
Fix crash when using X-Accel-Redirect with message forwarding
-----(After 2020-12-05)-----
Fix crash when multi-id pubsub is closed by timeout
Fix crash when multi-id pubsub is closed by HTTP DELETE
Fix crash when forwarding used without proxy_pass

1. Fix possible memory corruption when using upstream requests

There are several possible configuration scenarios where using nchan_publisher_upstream_request can cause memory corruption and undefined behavior. This is due to sanity checks being performed (and wrongly succeeding) on a variable which is never assigned.

Example

Consider this stack trace:

At frame 6:
The ws_publish_data_t struct is allocated here without clearing memory (line 662).
Thus, (ws_publish_data_t)->subrequest refers to an undefined memory location (line 690).
It is to be assigned here after the subrequest setup is complete, and not referenced upon until then.
This is normally the case, as the subrequest gets 'queued' and we go all the way back down to frame 2 before the subrequest is actually executed and finalized.

At frame 13:
However, if for whatever reason subrequest setup fails, nginx finalizes the subrequest early.
For instance, here the setup fails because the request is too large (line 989).

At frame 15:
Since the request is being finalized, NChan subrequest handler gets called, expecting the upstream subrequest to have been successfully queued and (ws_publish_data_t)->subrequest assigned.
But we're still in the same event that will only assign (ws_publish_data_t)->subrequest once we get back down to frame 6.

At frame 16:
A sanity check assertion (line 575) can wrongly succeed here due to (ws_publish_data_t)->subrequest having an undefined value.
Thus, past this check we can get undefined behavior and memory corruption.

At frame 17:
We'll most likely segfault here from trying to dereference (ws_publish_data_t)->subrequest, but we can also corrupt memory and/or cause some other undefined behavior instead!

Fix

Clear (ws_publish_data_t)->subrequest right after the creating the ws_publish_data_t structure, so that various further sanity checks will correctly interrupt execution on error.

============================================================================

2. Respect client_max_body_size for incoming websocket messages

The client_max_body_size directive was not respected for incoming websocket messages. This causes potential DDoS problems and, in conjunction with nchan_publisher_upstream_request a possible crash. (It also caused possible memory corruption, but has been fixed above).

If nchan_publisher_upstream_request is not being used, this is not a very serious issue, as the message length is limited by memory limits and the worst that can happen is it can flood a channel group's memory limits.

However, with nchan_publisher_upstream_request, this can be used as a potential DDoS vector and to crash the nginx server. The former is due to nginx writing temporary files for any messages exceeding 16KB. It can be used by an attacker to exhaust disk space on the host system with no way to set limits of any kind, as the incoming websocket message will always be accepted, regardless of the active client_max_body_size setting.
Additionally, if the incoming message exceeds the client_max_body_size setting, upstream subrequest forwarding will fail, because at that point nginx will reject the subrequest. This will cause NChan to bring down the server worker with the error message nginx: /nchan-1.2.7/src/subscribers/websocket.c:575: websocket_publish_upstream_handler: Assertion 'd->subrequest' failed. (This can instead cause memory corruption and undefined behavior without the other fix above).

Fix:

Check the configured client_max_body_size upon receiving a message for publishing and close the socket if it is in excess of the active client_max_body_size setting. This is currently done in 3 places:

Upon receiving any websocket packet with WEBSOCKET_OPCODE_TEXT or WEBSOCKET_OPCODE_BINARY
Upon joining the received message so far
While inflating the final received message (to prevent 'decompression-bombs')

============================================================================

3. Fix memory corruption in `is_utf8()` when block being checked is >16KB

Blocks larger than 16KB get written to a temporary file. When checking such a block, an incorrect memory region was getting unmapped due to incorrect pointer handling. This manifested in occasional segfaults.

============================================================================

4. Fix file-cached messages (over 16KB) not getting passed to upstream.

When using nchan_publisher_upstream_request, requests over 16KB get stored to temporary files for processing. These were neither getting passed upstream nor deleted after processing because (ngx_buf_t)->memory was getting set even if the buffer was file-mapped.
This would result in the following text in the nginx error log [alert] 27766#27766: *1 zero size buf in output, the message not being passed upstream, temporary files remaining on disk forever, and nginx workers' virtual memory size growing since the file-mapped memory regions were not getting unmapped.

Fix:

The offending line was commented out. (ngx_buf_t)->memory is already set for buffers that are actually in memory, so I don't see why this flag was getting set manually here. Was there some reason?

============================================================================

Note

This is part of a series of patches to resolve issue #588.

…messages

jmealo · 2020-11-08T01:10:44Z

👍 LGTM

sobitcorp · 2020-11-15T06:53:37Z

5. Fix crash when using X-Accel-Redirect with message forwarding

When using X-Accel-Redirect to abstract a websocket pub/sub endpoint from the end client, combined with upstream message forwarding with nchan_publisher_upstream_request, the nginx worker thread would very often segfault after receiving and trying to broadcast the 2nd message after channel creation. Rarely, when this didn't happen, it would often segfault instead on websocket disconnect. This was due to memory corruption.

This behavior could be mitigated with placement of charset off directives in the nginx configuration file. However, this never completely eliminated the problem in my tests.

Here's roughly how this would occur:

First request:

Soon after receiving a response from upstream (in ngx_http_upstream_process_header())
, a charset_filter_module context gets allocated in ngx_http_main_request_charset() on the current memory pool ( with ngx_pcalloc) and set for the request structure: ngx_http_set_ctx(r->main, ctx, ngx_http_charset_filter_module); in ngx_http_charset_filter_modeule.c, line 409.
Later in the request's processing pipeline ngx_http_charset_body_filter() (ngx_http_charset_filter_modeule.c, line 547) is called multiple times and uses the context set above.
I assume the above mentioned memory pool is then freed upon completion of the upstream request.

Second request:

The request structure gets reused. For some reason it's charset_filter_module context is still set to the value in step 1.
ngx_http_main_request_charset() is called, but does nothing, because it sees the request's charset_filter_module context is already set.
Somewhere between here, the freed memory pointed to by the current request structure's charset_filter_module context gets overwritten.
Later in the requests processing pipeline ngx_http_charset_body_filter() is, again, called multiple times. It attempts to use use the now clobbered charset_filter_module context -- Segfault!

Fix

Clear the current request structure's charset_filter_module context upon completion of the request (after completely receiving, storing, and broadcasting a message). There may be a better way to do this. (This will most likely not be an issue at all in Nchan2).

============================================================================

slact#591

sobitcorp · 2020-12-06T07:06:22Z

6-7. Fix crash when multi-id pubsub is closed by timeout or HTTP DELETE

When using multi-id pubsub endpoints, a variety of problems occurred upon the closing of the pubsub endpoint and/or its channels.

Example configuration:

nchan_pubsub websocket;
nchan_channel_id 1 2;
nchan_subscriber_timeout 1m;

Consequences:

Regardless of how endpoint was closed:
- Channel never gets deleted. Debug log says: MEMSTORE:00: not ready to reap m/ /1 /2 : status 2
If endpoint was closed by nchan_subscriber_timeout triggering or one of the channel IDs receiving an HTTP DELETE:
- Channel never gets deleted.
- Error log says: MEMSTORE:04: tried adding WAITING chanhead 00BDDB08 m/ /1 /2 to chanhead_gc. why?
- Messages sometimes stop arriving. Reconnecting the websocket sometimes temporarily fixes this.
- Eventually worker thread crashes with a segfault.

Fix

This line sub->dequeue_after_response = 1 appears to have caused closing procedures to happen out of order. Commenting it out completely fixes this behavior. Once the websocket is closed, the pub/sub endpoints are properly dequeued and deleted, as are the target channels (once there are no pub/subs left).

============================================================================

8. Fix crash when forwarding used without proxy_pass

This directly addresses issue #588. It fixes the crash caused by this scenario, but does not enable its full functionality - message modification, in particular, doesn't work. Since there is an acceptable workaround, I will not attempt to fix this further - at least it's no longer possible to crash the worker thread with this scenario.

Fix

This simply populates the (ws_publish_data_t)->subrequest early to allow the sequence of events to complete on the same worker cycle without crashing. Basic upstream responses without message modification work fine.

============================================================================

This concludes my series of fixes for Nchan.

slact · 2020-12-07T19:35:41Z

Let's get this merged indeed! There's a lot to regression-test here. I will report my findings once the testing is complete.

slact · 2020-12-07T19:49:08Z

@sobitcorp please email me at leo (at) nchan.io , i'd like to talk to you a little more about this PR

aliqandil · 2021-01-31T08:33:02Z

Any news about this PR? This seems to fix all the issues I'm currently facing in my production environment!

jilles-sg · 2021-02-10T11:20:54Z

src/subscribers/websocket.c

@@ -1086,7 +1086,7 @@ static ngx_int_t websocket_perform_handshake(full_subscriber_t *fsub) {

 static void websocket_reading(ngx_http_request_t *r);

-static ngx_buf_t *websocket_inflate_message(full_subscriber_t *fsub, ngx_buf_t *msgbuf, ngx_pool_t *pool) {
+static ngx_buf_t *websocket_inflate_message(full_subscriber_t *fsub, ngx_buf_t *msgbuf, ngx_pool_t *pool, uint64_t max, int *result) {


The caller below expects *result to be initialized if the return value is NULL. However, there appear to be three (rare?) cases below where this does not happen.

Also a -1 magic number for "message too large" is a bit ugly, and it may be better to follow a pattern where the primary return value indicates whether/how the call is successful and the pointer is returned via indirection (ngx_buf_t **result).

sobitcorp · 2021-02-24T23:56:11Z

@aliqandil slact is currently regression-testing this, afaik.

@jilles-sg Good catch, I fixed the uninitialized var. I also agree with your proposed change to the calling convention. I simply didn't want to make any major calling convention changes, as I'm not too deeply familiar with nchan internals, so I just tacked a result flag at the end as a quick fix.

slact · 2021-02-25T02:59:54Z

Yup. The sub->dequeue_after_response = 1 and the charset_filter_module fixes aren't passing regression testing, and I haven't had time to come up with a fix. I'm leaning towards merging the other stuff in the meantime and leave these two for later.

jilles-sg · 2021-02-25T11:06:08Z

src/subscribers/websocket.c

@@ -1091,6 +1091,8 @@ static ngx_buf_t *websocket_inflate_message(full_subscriber_t *fsub, ngx_buf_t *
  z_stream      *strm;
  int            rc;
  ngx_buf_t     *outbuf;
+
+  *result = 0;


It seems *result is still uninitialized if NGX_ZLIB is not defined (perhaps that does not happen, but it still seems wrong to write code that will not work).

oops... fixed

sg-jorrit · 2021-06-04T09:29:14Z

Just checking in to see if there are any updates on this pull request. We are still seeing lots of segfaults and general protection faults on our instances, and it is beginning to become a problem under load. We think at least some of these segfaults are related to bugs this pull request is aiming to fix.

@slact I know you are busy on the new nchan (which is great, really appreciate your work!), but Is there any chance these fixes (or at least the ones that succeeded regression testing) could be merged in an upcoming release, along with the recent commit fixing the segfault during multi-publishing? It would greatly help us maintain stability, at least until the new nchan is released (as we do not really have an alternative currently).

If we can help in any way with this testing or debugging, please let me know as well. Happy to help in any way possible.

winrid · 2021-10-20T16:26:44Z

We are also very much looking forward to this, after seeing nchan come to a halt on a traffic spike and not recover (service was running, but wouldn't accept new connections even though CPU was back down to 0).

sobitcorp added 7 commits November 1, 2020 03:30

Fix possible memory corruption when using upstream requests

b6ad5a6

Respect client_max_body_size for incoming websocket messages

11b8387

Fix memory corruption in is_utf8() when block being checked is >16KB

4c37623

Fix file-cached messages (over 16KB) not getting passed to upstream

14514b6

Better handling of client_max_body_size limit for incoming websocket …

00fdb66

…messages

Better handling of client_max_body_size limit for incoming websocket …

a5a1352

…messages

Better handling of client_max_body_size limit for incoming websocket …

aece64f

…messages

sobitcorp added 2 commits November 15, 2020 01:32

Fix crash when using X-Accel-Redirect with message forwarding

579ac29

Fix crash when using X-Accel-Redirect with message forwarding

86eae8b

sobitcorp added 3 commits December 6, 2020 01:19

Fix crash when multi-id pubsub is closed by timeout

713fbcb

slact#591

Fix crash when multi-id pubsub is closed by HTTP DELETE

d19723c

slact#591

Fix crash when forwarding used without proxy_pass

035d7c5

slact#591

sobitcorp changed the title ~~Fix possible memory corruption with upstream requests~~ Fix various memory corruption and unwanted behavior scenarios Dec 6, 2020

jilles-sg reviewed Feb 10, 2021

View reviewed changes

Fixed uninitialized result var

9dfdc24

jilles-sg reviewed Feb 25, 2021

View reviewed changes

Fixed uninitialized result var - attempt 2

4bdc0f7

slact force-pushed the master branch from 6f46080 to be81b98 Compare May 17, 2022 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix various memory corruption and unwanted behavior scenarios #591

Fix various memory corruption and unwanted behavior scenarios #591

sobitcorp commented Nov 1, 2020 •

edited

Loading

jmealo commented Nov 8, 2020

sobitcorp commented Nov 15, 2020

sobitcorp commented Dec 6, 2020

slact commented Dec 7, 2020 •

edited

Loading

slact commented Dec 7, 2020

aliqandil commented Jan 31, 2021

jilles-sg Feb 10, 2021

sobitcorp commented Feb 24, 2021

slact commented Feb 25, 2021

jilles-sg Feb 25, 2021

sobitcorp Feb 27, 2021

sg-jorrit commented Jun 4, 2021 •

edited

Loading

winrid commented Oct 20, 2021 •

edited

Loading

Fix various memory corruption and unwanted behavior scenarios #591

Are you sure you want to change the base?

Fix various memory corruption and unwanted behavior scenarios #591

Conversation

sobitcorp commented Nov 1, 2020 • edited Loading

1. Fix possible memory corruption when using upstream requests

Example

Fix

2. Respect client_max_body_size for incoming websocket messages

Fix:

3. Fix memory corruption in is_utf8() when block being checked is >16KB

4. Fix file-cached messages (over 16KB) not getting passed to upstream.

Fix:

Note

jmealo commented Nov 8, 2020

sobitcorp commented Nov 15, 2020

5. Fix crash when using X-Accel-Redirect with message forwarding

Fix

sobitcorp commented Dec 6, 2020

6-7. Fix crash when multi-id pubsub is closed by timeout or HTTP DELETE

Fix

8. Fix crash when forwarding used without proxy_pass

Fix

slact commented Dec 7, 2020 • edited Loading

slact commented Dec 7, 2020

aliqandil commented Jan 31, 2021

jilles-sg Feb 10, 2021

Choose a reason for hiding this comment

sobitcorp commented Feb 24, 2021

slact commented Feb 25, 2021

jilles-sg Feb 25, 2021

Choose a reason for hiding this comment

sobitcorp Feb 27, 2021

Choose a reason for hiding this comment

sg-jorrit commented Jun 4, 2021 • edited Loading

winrid commented Oct 20, 2021 • edited Loading

sobitcorp commented Nov 1, 2020 •

edited

Loading

3. Fix memory corruption in `is_utf8()` when block being checked is >16KB

slact commented Dec 7, 2020 •

edited

Loading

sg-jorrit commented Jun 4, 2021 •

edited

Loading

winrid commented Oct 20, 2021 •

edited

Loading