b.echo.th.ooni.io possibly down for 8 hours #244

darkk · 2018-11-20T16:14:38Z

Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)

Detection: CPUHigh alert with expected 8h delay

Timeline UTC:
17 Nov 07:30 CPU spikes to 100%, thats accept() vs. EMFILE busy loop
17 Nov 15:34 CPUHigh alert firing
17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU
17 Nov 16:14 @darkk logs into the VM, looks at oonib, reboots the VM
17 Nov 16:20 everything recovers to normal

What went well:

resource utilisation alerts are actually useful!

What went wrong:

oonib was slowly leaking sockets at port tcp/57002 (TCPEchoHelper)
1004 connections were enough to kill the daemon, these connections were coming from 395 distinct IPs, only 99 IPs had more than one connection, only 17 were >10, top5 IPs had {55,52,33,32,32} connections
status was reporting nothing for init script , reboot was an "easy" way to restrat the service

ooni-backend Status
Listing all oonib procs
No running oonib procs

What is still unclear:

was the service actually down? Seems, it should be, but no other alerts besides CPUHigh were triggered

What could be done to prevent relapse and decrease impact:

increase FD limit
preventive restart (?)
add TCP_KEEPALIVE with low timeout values for the endpoints (?)
monitoring for the service itself besides TCP port check (?)

The text was updated successfully, but these errors were encountered:

darkk · 2019-02-15T08:15:59Z

Relapse. Timeline UTC:
14 Feb 22:50 CPU spikes to 100%
15 Feb 08:15 everything recovers

bassosimone · 2019-05-04T10:14:22Z

Relapse. Timeline UTC:

2019-05-03T17:29:30Z CPU spikes
2019-05-04T01:31:00Z alert fires
2019-05-04T07:38:00Z @bassosimone notices and asks for guidance
2019-05-04T09:28:00Z @darkk suggests to search for issues in this repo
2019-05-04T10:14:00Z issue has been found; incident still ongoing
2019-05-04T10:18:00Z @bassosimone reboots the machine; top is happier
2019-05-04T10:22:00Z alerts are resolved

darkk added the incident label Nov 20, 2018

hellais added the tracking this is a recurring incident or bug label Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b.echo.th.ooni.io possibly down for 8 hours #244

b.echo.th.ooni.io possibly down for 8 hours #244

darkk commented Nov 20, 2018 •

edited

Loading

darkk commented Feb 15, 2019

bassosimone commented May 4, 2019 •

edited

Loading

b.echo.th.ooni.io possibly down for 8 hours #244

b.echo.th.ooni.io possibly down for 8 hours #244

Comments

darkk commented Nov 20, 2018 • edited Loading

darkk commented Feb 15, 2019

bassosimone commented May 4, 2019 • edited Loading

darkk commented Nov 20, 2018 •

edited

Loading

bassosimone commented May 4, 2019 •

edited

Loading