-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Egress server: Heroku reports H15 errors / 503 responses #181
Comments
Given the content and order of the log messages when the error is reported, there are two possible interpretations of causality here:
This SO thread suggests it might be no. 2: https://stackoverflow.com/questions/32728030/heroku-h15-error-on-web-socket-close |
Another clue: If you pay attention to the timestamps in the log messages, it seems that the messages actually print out of order, which obfuscates causality. In other words, if you match up the log message from the egress server indicating WebSocket closure with the log message from Heroku reporting the error, it's actually the egress server message with the earlier timestamp -- even though the Heroku message reliably appears first. The timestamp reversal does not occur when looking at messages resulting from a client-initiated healthy closure. I've only looked at 10 or so message pairings -- but if these invariants hold, it suggests the following idea: When a client disconnects in a healthy way, Heroku is the first to know about it (which makes sense), and so Heroku reports the outcome of the request first. When a client disconnects in an unhealthy way, leaving the connection in a half open state, the server is the first to know about it, and so the server formally closes the request. Only then does Heroku see that the request has terminated, and so it reports the outcome -- and for some reason, it reports it as an H15 error. This suggests that, in the case of the reported error, it's not Heroku closing our WebSockets and our server reacting to it -- rather, it's our server behaving correctly, and Heroku just interpreting things wrong. |
@oxtoacart pointed out that the egress server metrics indicate a bunch of 5XX responses for requests. This correlates with the H15 errors reported in the events.
If you tail the logs for the server, you can observe these things happening from time to time. It looks like this:
After reading about H15 errors, you might think that Heroku is just deciding that some of our WebSocket connections are idle and closing them down. But we implemented a keepalive to prevent this from happening, and at the time, it seems to have worked.
I thought: Is the keepalive now failing? It was set for 30s, so I reduced it to 15s and redeployed. But this didn't solve the problem!
The text was updated successfully, but these errors were encountered: