Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b.echo.th.ooni.io possibly down for 8 hours #244

Open
4 tasks
darkk opened this issue Nov 20, 2018 · 2 comments
Open
4 tasks

b.echo.th.ooni.io possibly down for 8 hours #244

darkk opened this issue Nov 20, 2018 · 2 comments
Labels
incident tracking this is a recurring incident or bug

Comments

@darkk
Copy link
Contributor

darkk commented Nov 20, 2018

Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)

Detection: CPUHigh alert with expected 8h delay

Timeline UTC:
17 Nov 07:30 CPU spikes to 100%, thats accept() vs. EMFILE busy loop
17 Nov 15:34 CPUHigh alert firing
17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU
17 Nov 16:14 @darkk logs into the VM, looks at oonib, reboots the VM
17 Nov 16:20 everything recovers to normal

What went well:

  • resource utilisation alerts are actually useful!

What went wrong:

  • oonib was slowly leaking sockets at port tcp/57002 (TCPEchoHelper)
  • 1004 connections were enough to kill the daemon, these connections were coming from 395 distinct IPs, only 99 IPs had more than one connection, only 17 were >10, top5 IPs had {55,52,33,32,32} connections
  • status was reporting nothing for init script , reboot was an "easy" way to restrat the service
ooni-backend Status
Listing all oonib procs
No running oonib procs

What is still unclear:

  • was the service actually down? Seems, it should be, but no other alerts besides CPUHigh were triggered

What could be done to prevent relapse and decrease impact:

  • increase FD limit
  • preventive restart (?)
  • add TCP_KEEPALIVE with low timeout values for the endpoints (?)
  • monitoring for the service itself besides TCP port check (?)
@darkk darkk added the incident label Nov 20, 2018
@darkk
Copy link
Contributor Author

darkk commented Feb 15, 2019

Relapse. Timeline UTC:
14 Feb 22:50 CPU spikes to 100%
15 Feb 08:15 everything recovers

@bassosimone
Copy link
Contributor

bassosimone commented May 4, 2019

Relapse. Timeline UTC:

2019-05-03T17:29:30Z CPU spikes
2019-05-04T01:31:00Z alert fires
2019-05-04T07:38:00Z @bassosimone notices and asks for guidance
2019-05-04T09:28:00Z @darkk suggests to search for issues in this repo
2019-05-04T10:14:00Z issue has been found; incident still ongoing
2019-05-04T10:18:00Z @bassosimone reboots the machine; top is happier
2019-05-04T10:22:00Z alerts are resolved

@hellais hellais added the tracking this is a recurring incident or bug label Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incident tracking this is a recurring incident or bug
Projects
None yet
Development

No branches or pull requests

3 participants