Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Please add synchronous access to http2 write #2899

Open
re-gor opened this issue Feb 5, 2025 · 6 comments
Open

[Feature request] Please add synchronous access to http2 write #2899

re-gor opened this issue Feb 5, 2025 · 6 comments

Comments

@re-gor
Copy link

re-gor commented Feb 5, 2025

Hi! Before I start asking I have to say that it was my pleasure to read through the code of grpc-node library. Splendid work, thank you!

Is your feature request related to a problem? Please describe.

tldr; We noticed that synchronous API is significantly faster in our scenario.

We have a grpc-js server with single BiDi Stream method:

message Request {
    oneof payload {
        HttpRequest http_request    = 1; // plain http request Api Gateway got from user
        BackendA backend_a_response = 2;
        BackendB backend_b_response = 3;
        // ... other backends
    }
}

message Response {
    optional Metadata metadata = 1; // Headers
    optional string content    = 2; // Response body. HTML
}

service Frontend {
    rpc render (stream Request) returns (stream Response) {}
}

Our NodeJS backend receives chunks of data from Api Gateway. Once he got one chunk he starts an operation of a template rendering.
Once chunk's rendering is done server writes it into the grpc-js response stream. Sequence Diagram of the process if one is interested.

We noticed that this server is losing a lot of client speed to our old plain-http-1.1 server in the A/B experiment.
I was digging around about a week or two and made a conclusion that the only one significant difference. Grpc-js is using streams as a proxy to http2.write instead of good old Response.write which our old server does.

Streams are asynchronous by their nature. Once one does write, actual writing will happen on nextTick as far I understand. So we have a situation when the writing operation stands aside of its corresponding rendering operation. Roughly something like this happens:

Event loop tasks

There are a couple of new operations before writing which were not presented before. In our case there are about 700 chunks and 100 rendering operations (one rendering operation can produce several chunks). And it makes delay: about 280ms in the 75 percentile (p75) and 211ms in p50 for Largest Content Paint. Our old website's LSP value is about 1000(p50)-1500ms(p75), so loosing 20% of speed is a lot. While in general LCP is a good metric for the web-sites we prefer to rely on Hero Element Rendering - we are tracking the most important element arriving time. So according this metric grpc-js is loosing about 370ms of p75 and 300ms of p50. Again it is about 20% degradation =(

Describe alternatives you've considered

I've been trying a lot of things during the last couple of weeks. First of all I've proven to myself and my colleagues using cpuprofiles and codes' listings that there is no other reason for degradation different from I described above.

The second thing I tried was chunk compressing using zlib and brotli. It was а failing effort. First of all zlib is making some cpu intensive work which delays the moment of chunk's readiness. However the most important thing that zlib is using streams too, so I got more of nextTicks and result was actually worse than without compression. So we went back to balancer's compression.

The last effort was successful but is a nasty one. I started using grpc-js private methods. Actually the most important one: sendMessage(https://github.com/grpc/grpc-node/blob/master/packages/grpc-js/src/server-interceptors.ts#L837) which synchronously calls this.stream.write which is actually http2Response.write.

write(data: ResponseType, _: string = 'utf-8', callback: (error: null | undefined | Error) => void = () => {}) {
    try {
        this._sendMetaData();

        // NASTY ONE!
        this._call.call.sendMessage(data, callback as () => void);

        return true;
    } catch (error: any) {
        // ANOTHER ONE. Actually it is a copy-paste from ServerDuplexStreamImpl
        this.pendingStatus = serverErrorToStatus(error);
        this.end();

        return false;
    }
}

And this effort made significant difference!

We deployed this version to the production yesterday (about 21:00 of our local time)
LCP
Hero element velocity

As one can see we win about 200-300ms per each metric (LCP and Hero Element). Now we are loosing about 50ms against the plain http version and it is ok. It is not so much. And we are ready that packing grpc-messages is not free. Also we have a new one proxy before our server.

Describe the solution you'd like

In general I want synchronous way of using http2.response. For example it can be a new method writeSync for ServerDuplexStreamImpl<RequestType, ResponseType> class.

Additional context

¯_(ツ)_/¯
Thanks for reading all of this! Ready to your questions

@murgatroid99
Copy link
Member

Streams are not inherently asynchronous in the way you suggest: Writable#write calls _write, _write calls writeOrBuffer, and writeOrBuffer calls Writable#_write, all synchronously. My guess is that the actual source of the latency you are observing is from the stream waiting for the callback from sendMessage (which waits for the callback from the underlying stream.write, which probably waits for it to actually go out on the network) before calling sendMessage again. Your modified write method doesn't have that behavior, which is why it would be faster.

Unfortunately, your strategy for changing this wouldn't work, because ServerDuplexStreamImpl#call isn't always a BaseServerInterceptingCall; if there are interceptors then there will be one or more ServerInterceptingCall in between, and the current implementation of that class relies on the invariant that there is never more than one message in flight to ensure message ordering is preserved. Allowing synchronous access to sendMessage lets users violate that invariant, potentially causing messages to be sent out of order. In addition, a separate writeSync would allow reordering relative to messages passed to write.

It might be enough to call the callback early in BaseServerInterceptingCall#sendMessage. That way messages will be buffered by the underlying stream, and can probably be grouped together.

@murgatroid99
Copy link
Member

Do you have a self-contained benchmark that demonstrates the problem you are seeing that you can share here?

@re-gor
Copy link
Author

re-gor commented Feb 6, 2025

Streams are not inherently asynchronous in the way you suggest: Writable#write calls _write, _write calls writeOrBuffer, and writeOrBuffer calls Writable#_write, all synchronously.

Indeed, I was confused with this line of writeOrBuffer method. Taken a close look to it I see that it is not the case. There should be some race to make this clause true. And as far as I see all the nextTicks are related to callbacks (on write, error, finish...).

My guess is that the actual source of the latency you are observing is from the stream waiting for the callback from sendMessage (which waits for the callback from the underlying stream.write, which probably waits for it to actually go out on the network) before calling sendMessage again.

We don't have any callbacks on write method... Just call write(chunk) any time chunk is ready. But we do have a lot of awaits in render process in general. Pseudo code looks like this

    function writeWidget(widget) {
        // chunk can be ready: actual html
        // chunk can be delimeter: there should be nested widget
        const chunks = await widget.chunks; 

        const writeChunk = async (chunk: WidgetChunk) => {
            if (chunk.ready) {
                httpStream.write(widget, chunk);
                return;
            }

            const {slotName} = chunk;

            // there can be nested widgets inside one we are processing
            await writeWidget(widget.slots[slotName]);
        };

        for (const chunk of chunks) {
              await writeChunk(chunk);
        }
   }

ServerDuplexStreamImpl#call isn't always a BaseServerInterceptingCall; if there are interceptors then there will be one or more ServerInterceptingCall in between, and the current implementation of that class relies on the invariant that there is never more than one message in flight to ensure message ordering is preserved.

Yes, of course. I can guarantee that invariant in my particular application, but my solution is not good enough for the library.

Allowing synchronous access to sendMessage lets users violate that invariant, potentially causing messages to be sent out of order

I can't get why. I am not sure if there is a difference between having sendMessage wrapped with its own stream or accessing it directly. There is http2 stream lying under sendMessage - so once you called it, message is queued inside http2 stream. The only way messages can be mixed it is because of client code.

@re-gor
Copy link
Author

re-gor commented Feb 6, 2025

Do you have a self-contained benchmark that demonstrates the problem you are seeing that you can share here?

Ugh, it takes a little more time than I originally imagined. I was hoping I could do it today, but no=( Still working on it.
Will bring it tomorrow

@murgatroid99
Copy link
Member

Allowing synchronous access to sendMessage lets users violate that invariant, potentially causing messages to be sent out of order

I can't get why. I am not sure if there is a difference between having sendMessage wrapped with its own stream or accessing it directly. There is http2 stream lying under sendMessage - so once you called it, message is queued inside http2 stream. The only way messages can be mixed it is because of client code.

As I said, this is about the case where there are interceptors. Interceptors can process messages asynchronously, and they are not required to process different messages consistently. So if an interceptor gets one message and starts some slow asynchronous operation with it, then gets another message and passes it along immediately, that will reorder the messages.

@re-gor
Copy link
Author

re-gor commented Feb 7, 2025

@murgatroid99 Thank you for your patience

As I said, this is about the case where there are interceptors. Interceptors can process messages asynchronously, and they are not required to process different messages consistently. So if an interceptor gets one message and starts some slow asynchronous operation with it, then gets another message and passes it along immediately, that will reorder the messages.

I guess I got your point. I may suggest that an interceptor should have similar writeSync method. One should not call write and writeSync alternately. As it is with fs.writeSync and fs.write.

Do you have a self-contained benchmark that demonstrates the problem you are seeing that you can share here?

I prepared an example: https://github.com/re-gor/grpc-js-issue. It was kinda tricky. The whole point was to exploit the line I mentioned before to make methodwriteOrBuffer to buffer actually. It requires a lot of chunks: more than 16 to make it drain in asynchronous callback.

Long story short I wrote simple client and simple server. The server produces a bunch of chunks. At some point it buffers them. At this point one got delay. This is exactly the behaviour we observed in our application. We were loosing quite a bit of TTFB (Time to first byte) metrics, but a lot of Hero Element and LCP

And today we got some actual A/B results for my nasty patch. One can see that after release of the "patch" our grpc-experiment shows a better speed

Image Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants