You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)
Detection: CPUHigh alert with expected 8h delay
Timeline UTC:
17 Nov 07:30 CPU spikes to 100%, thats accept() vs. EMFILE busy loop
17 Nov 15:34 CPUHigh alert firing
17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU
17 Nov 16:14 @darkk logs into the VM, looks at oonib, reboots the VM
17 Nov 16:20 everything recovers to normal
What went well:
resource utilisation alerts are actually useful!
What went wrong:
oonib was slowly leaking sockets at port tcp/57002 (TCPEchoHelper)
1004 connections were enough to kill the daemon, these connections were coming from 395 distinct IPs, only 99 IPs had more than one connection, only 17 were >10, top5 IPs had {55,52,33,32,32} connections
status was reporting nothing for init script , reboot was an "easy" way to restrat the service
ooni-backend Status
Listing all oonib procs
No running oonib procs
What is still unclear:
was the service actually down? Seems, it should be, but no other alerts besides CPUHigh were triggered
What could be done to prevent relapse and decrease impact:
increase FD limit
preventive restart (?)
add TCP_KEEPALIVE with low timeout values for the endpoints (?)
monitoring for the service itself besides TCP port check (?)
The text was updated successfully, but these errors were encountered:
2019-05-03T17:29:30Z CPU spikes
2019-05-04T01:31:00Z alert fires
2019-05-04T07:38:00Z @bassosimone notices and asks for guidance
2019-05-04T09:28:00Z @darkk suggests to search for issues in this repo
2019-05-04T10:14:00Z issue has been found; incident still ongoing
2019-05-04T10:18:00Z @bassosimone reboots the machine; top is happier
2019-05-04T10:22:00Z alerts are resolved
Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)
Detection: CPUHigh alert with expected 8h delay
Timeline UTC:
17 Nov 07:30 CPU spikes to 100%, thats
accept()
vs. EMFILE busy loop17 Nov 15:34
CPUHigh
alert firing17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU
17 Nov 16:14 @darkk logs into the VM, looks at
oonib
, reboots the VM17 Nov 16:20 everything recovers to normal
What went well:
What went wrong:
status
was reporting nothing for init script ,reboot
was an "easy" way to restrat the serviceWhat is still unclear:
What could be done to prevent relapse and decrease impact:
The text was updated successfully, but these errors were encountered: