-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undici removeEventListener issue during outage #3099
Comments
I was able to collect a little bit more information here and was able to recreate the issue locally. I created a fresh Apollo Gateway using minimal dependencies above. I created a test server that returns the supergraph sdl successfully on first call, then times out on subsequent calls.
My Apollo gateway starts up successfully on first call, but on subsequent calls it breaks on timeout.
When I remove undici the server starts up correctly, and on subsequent calls to the endpoint it times out and does not crash
I think it might be fair to say that Undici is broken as a fetcher for Apollo Gateway due to the removeEventListener missing. If the uplink begins to timeout for whatever reason it will crash the gateway if Undici is used as a fetcher. |
Yep, I'm seeing a similar issue. Our setup is a bit more complex though -- node20, AWS lambdas, supergraph behind an AWS API gateway. If the lambda is kept alive via 3-minutely requests, after usually 15 minutes the AWS APIG starts returning a 502 as the supergraph crashes. It crashes on the usage reporting up to apollo, so I removed that. Now it crashes on the same function here, loadSupergraphSdlFromStorage. I thought it had to do with the lambda filesystem being ephemeral and being kept hot for an extended priod, given it's reading whatever from presumably local storage, I guess it might not be. The weird part about my version of this error is that the default timeout of 30 seconds is not being hit. AWS cloudwatch logs show the failure happening milliseconds after the request comes in. This is happening in our production instance, and has been for quite some time -- a coworker suspects it has to do with whatever got sunset on or around June 27th, but our logs don't go back that far. Since node-abort-controller is unnecessary after |
oop, I wrote this in response to a post that got deleted --> As it stands we're explicitly passing in the node native In the meantime I altered the apollo code in the node_modules directory to just use the native AbortController() and ran my 3-minute poll for 5 hours without failures. So, there's that. Gotta do some more testing. |
@squirrelchew I think you might have some luck explicitly passing in node-fetch 2.7.0, but I ran into this issue with undici not native node fetch. |
Hmm, I was under the impression node18+ uses undici under the hood. Anways, my lambda timeouts were a bit too short. Bumped those up and errors are being exposed a little better. With node-fetch as the fetcher for the server + gateway, apollo's code untouched so it uses node-abort-controller, the supergraph straight up fails to start:
going back to passing in node's native fetch works at first This really points me at but drilling through what I can find here they do ultimately appear to be the same, so I'm at a loss there. This big scary comment block seems to have some level of relevance, too. I'm unconvinced the fetcher itself is the problem in our case here, outside of using anything but node's native testing so far:
Next up is to re-remove the node-abort-controller stuff from Apollo and test the various fetchers again. |
@squirrelchew I ran into this error too using node-fetch, the issue was esbuild minification. You could use --keep-names to avoid breakage with minfication that breaks node-fetch. If you minify using another library try to find a way to preserve function names because node-fetch tries to reach into AbortSignal using the name |
Our esbuild minify flag is set to false in our serverless yaml: esbuild:
bundle: true
minify: false
sourcemap: true
exclude: ['aws-sdk']
target: 'node20'
define: { 'require.resolve': undefined }
platform: 'node'
concurrency: 10 this is pretty standard procedure here; this portion of the yaml has been passed around since node12 deployments as far as I can see. I've been running a test against the default fetcher for the last 1.5 hours without issue, which uses node-fetch inside of these plugins that otherwise fail. The difference is that I removed all references to the node-abort-controller package in the apollo code that I have deployed. I'll try node-fetch directly in a little bit here and report back, but I have a suspicion it'll work just fine since the undefined/default fetcher is working great so far! There are just so many moving parts in my case though, what with the lambdas and APIG. I need to simplify this into something I can reproduce locally, as you have above. I'll likely get to work on this while some of these other tests are running in the AWS deployment. |
@squirrelchew yeah, i think ultimately it comes down to perhaps Apollo not properly handling timeouts in the instance of undici vs node-fetch vs native fetch, and issues with the AbortSignal / polyfill |
It certainly feels that way. I don't feel good about opening a PR or whatever in public here to add a handler for it, so for the time being I'm going to fork apollo, add handling for using the native AbortController in the case of node16+, and see where that gets us. if that fixes it and I can prove that it fixes it, it's a very simple change assuming parsing |
Alright, so changing the import of node-abort-controller to: import NodeAbortController from 'node-abort-controller' and then changing the instantiation of it to: const nodeMajor = parseInt(process.versions.node.split('.')[0], 10)
const controller = nodeMajor >= 16 ? new AbortController() : new NodeAbortController.AbortController(); Seems to have fixed our problem after some 8 hours of testing. This exists in a few places in It seems fine with both the default apollo fetcher as well as explicitly passing in node's There's a type import for the signal from node-abort-controller in the drainHttpServer plugin that I'm not sure how to handle, but it appears unused in our case so that's fine enough. |
Issue Description
Apollo outage on July 3rd caused issues communicating with uplink, forcing server crash.
We use undici as a primary fetch library for Apollo federation. We are in the process of integrating with Apollo enterprise, in our lower environment our server began to crash during a reported outage. The services recovered but our gateway crashed and impacted some testing in our lower environments. If we were using Apollo in production I would anticipate this would be a major outage for us.
Mainly asking to see if this is a known issue that has occurred for others, or if there's potentially an issue with using undici as a fetcher library.
Node version 18
Looking at the error source this appears to be related to a request timeout, which calls
node-abort-controller
polyfill which in turn tries to call the undici removeEventListener function which doesn't exist.federation/gateway-js/src/supergraphManagers/UplinkSupergraphManager/loadSupergraphSdlFromStorage.ts
Line 130 in 38beaa2
https://github.com/southpolesteve/node-abort-controller/blob/10e0cea66a069d9319f948d055621e1d37aea5db/index.js#L16
Link to Reproduction
N/A
Reproduction Steps
Service Dependencies:
Gateway init:
Node version: 18
The text was updated successfully, but these errors were encountered: