Consecutive HTTP requests cause socket hang up on node v20 #5765

nflaig · 2023-07-17T11:46:45Z

Describe the bug

Consecutive HTTP requests cause socket hang up on node v20. This only happens with some servers, so far I could only reproduce this with Lighthouse. This issue causes the sim tests to fail and might cause other issues, e.g. when running Lodestar VC with a Lighthouse BN.

Expected behavior

No socket hang up / requests failures

Steps to reproduce

Running the following script will cause the first request to pass but the second one fails with socket hang up error

import { getClient } from "@lodestar/api";
import { config } from "@lodestar/config/default";

const lighthouse = "http://localhost:5052"
const api = getClient({ baseUrl: lighthouse }, { config });

await api.beacon.getBlockV2("head");
await api.beacon.getBlockV2("head");

> node index.js 

/home/nico/projects/test/debug-getStatev2/node_modules/node-fetch/lib/index.js:1505
                        reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
                               ^
FetchError: request to http://localhost:5052/eth/v2/beacon/blocks/head failed, reason: socket hang up
    at ClientRequest.<anonymous> (/home/nico/projects/test/debug-getStatev2/node_modules/node-fetch/lib/index.js:1505:11)
    at ClientRequest.emit (node:events:512:28)
    at Socket.socketOnEnd (node:_http_client:519:9)
    at Socket.emit (node:events:524:35)
    at endReadableNT (node:internal/streams/readable:1378:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  type: 'system',
  errno: 'ECONNRESET',
  code: 'ECONNRESET'
}

Node.js v20.4.0

Yielding to macro queue in-between the two requests does not produce an error

// ...
await api.beacon.getBlockV2("head");
setTimeout(async () => await api.beacon.getBlockV2("head"), 0);

Additional context

The issue is already reported upstream

There are also some suggested workarounds but those can't easily applied in our case since we are using cross-fetch and require browser compatiblity.

Operating system

Linux

Lodestar version or commit hash

6e01421

The text was updated successfully, but these errors were encountered:

nflaig · 2023-07-18T14:49:32Z

Would be also interesting to find out why this only happens with Lighthouse as all other CLs don't have this issue. Maybe something related to set headers / keep-alive / connection header.

nflaig · 2023-07-20T09:52:17Z

Maybe something related to set headers / keep-alive / connection header

Does not seem to be related to HTTP response headers

Lodestar and Nimbus set Connection: keep-alive header
Lighthouse, Prysm, and Teku do not set the header

socket hang up only happens with Lighthouse

matthewkeil · 2023-07-24T07:37:25Z

After digging a bit this is related to how the socket is being handled by the two sides below the http layer. Why only Lighthouse is sending the RESET and the other clients do not I am not sure but it is a function of how they default handle half-open sockets. Similar to what was mentioned here in reference to the various implementations within node itself. I took a peak at the rust net and http crates but after a short dig I realized that it would not be a great use of time to try and figure out how they handle post FIN packages differently than the Nim and Go packages.

As a tl/dr; for the short term it appears that using an Agent with the keepAlive set will prevent node-fetch from including the Connection: close header and will keep the server side from sending the RESET. I have not tested this yet to confirm but @dhedey mentioned it works in his PR to node-fetch

That PR's in node-fetch/node-fetch#1736 is approved and waiting for merge as of Apr 12. It defaults to the node agent's keep-alive setting which will keep the socket open for reuse without having to pass an agent to skip a conditional check that is removed. In the mean time the workaround for passing a keep-alive agent suggested in that PR should work.

The core issue though is how the sockets are being recycled in node though.

After reading the other thread linked in "additional context" I started my spelunking in the node repo.

mcolina mentioned in this comment that he agrees with benbnoordhuis how its possible to pop an invalid socket which may resolve part of the issue. Adjusting that may help the situation and could provide a performance boost commensurate with this comment.

Although if that was the only issue it would not explain why 100% of the time an error gets thrown as there should be some percentage of the sockets that get shifted from the checked side of the array under loads.

@DevasiaThomas mentioned just below that he is working on a fix as of Apr 21. I tend to agree with @DevasiaThomas though that it will take more than just updating that while loop as there are a few places that emitter callbacks are required for resolution which is why we are seeing no error with a setTimeout(0).

I have pinged him here to see if he has made headway and would like help collaborating on the solution. I will be happy to help as this issue causing issues with us upgrading our project to node 20.

I dropped some hints to myself and others below as I dig. I will update the below as I go with how to resolve the issue.

Some code notes for potential resolution:

There is a note in Socket.prototype._destroy about error event being handled in next tick que:
https://github.com/nodejs/node/blob/4166d40d0873b6d8a0c7291872c8d20dc680b1d7/lib/net.js#L781

I found this line in the socket that appears to be getting called after the write side of the client socket is closed by the server and the socket is written to again. Reference onReadableStreamEnd that throws the error during the microtask que with process.nextTick.

nflaig · 2023-07-25T19:02:31Z

node-fetch/node-fetch#1736 seems to fix the issue, if sim tests are also passing on node 20 I think we are good to go

dhedey · 2023-07-27T11:11:08Z

@matthewkeil - just saw your above post.

With respect to this:

Although if that was the only issue it would not explain why 100% of the time an error gets thrown as there should be some percentage of the sockets that get shifted from the checked side of the array under loads.

I don't know if you saw some of musings on this in the post here node-fetch/node-fetch#1735 - but yes, I came to the same conclusion as you (from my investigations and code digging) - the issue I believe is that packets can still be sent to the socket before it is cleared up in a following microtask. Any packets that get sent trigger an ECONNRESET

I'm not sure if nodejs/node#47130 is really moving ahead sadly. Perhaps it could do with another pair of eyes, but I don't have the time right now sadly.

nflaig mentioned this issue Jul 17, 2023

chore!: use node v20 throughout monorepo #5730

Merged

nflaig added the meta-bug Issues that identify a bug and require a fix. label Jul 17, 2023

philknows modified the milestone: v1.10.0 Jul 18, 2023

philknows added the prio-high Resolve issues as soon as possible. label Jul 25, 2023

philknows added this to the v1.10.0 milestone Jul 25, 2023

wemeetagain mentioned this issue Jul 26, 2023

chore: downgrade cross-fetch to v3 #5806

Merged

dhedey mentioned this issue Jul 27, 2023

ECONNRESET while doing http 1.1 keep alive requests and server closes the connections nodejs/node#47130

Closed

nflaig mentioned this issue Jul 27, 2023

refactor: switch to native fetch implementation #5811

Merged

philknows modified the milestones: v1.10.0, v1.11.0 Jul 31, 2023

nflaig closed this as completed in #5811 Jul 31, 2023

michaelsproul mentioned this issue Aug 5, 2023

Improve handling of HTTP connection closure sigp/lighthouse#4439

Open

nflaig modified the milestones: v1.11.0, v1.10.0 Aug 5, 2023

IbrahimTanyalcin mentioned this issue Oct 9, 2024

Socket Hang Up / ECONNRESET / ECONNREFUSED on consecutive http requests (>= 100) and temporary FIX nodejs/node#55330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consecutive HTTP requests cause socket hang up on node v20 #5765

Consecutive HTTP requests cause socket hang up on node v20 #5765

nflaig commented Jul 17, 2023 •

edited

Loading

nflaig commented Jul 18, 2023 •

edited

Loading

nflaig commented Jul 20, 2023

matthewkeil commented Jul 24, 2023

nflaig commented Jul 25, 2023

dhedey commented Jul 27, 2023 •

edited

Loading

Consecutive HTTP requests cause socket hang up on node v20 #5765

Consecutive HTTP requests cause socket hang up on node v20 #5765

Comments

nflaig commented Jul 17, 2023 • edited Loading

Describe the bug

Expected behavior

Steps to reproduce

Additional context

Operating system

Lodestar version or commit hash

nflaig commented Jul 18, 2023 • edited Loading

nflaig commented Jul 20, 2023

matthewkeil commented Jul 24, 2023

Some code notes for potential resolution:

nflaig commented Jul 25, 2023

dhedey commented Jul 27, 2023 • edited Loading

nflaig commented Jul 17, 2023 •

edited

Loading

nflaig commented Jul 18, 2023 •

edited

Loading

dhedey commented Jul 27, 2023 •

edited

Loading