Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b.web-connectivity.th down for 7.6 hours #128

Open
1 of 10 tasks
darkk opened this issue Jul 27, 2017 · 4 comments
Open
1 of 10 tasks

b.web-connectivity.th down for 7.6 hours #128

darkk opened this issue Jul 27, 2017 · 4 comments
Labels
incident tracking this is a recurring incident or bug

Comments

@darkk
Copy link
Contributor

darkk commented Jul 27, 2017

Impact: TBD (AFAIK clients should use another test helper if one of them is down, but it may be not the case).

Detection: repeated email alert

Timeline UTC:
00:00 four periodic jobs start: cerbot@cron, certbot@systemd, munin plugins apt update, /data/b.web-connectivity.th.ooni.io/update-bouncer.py
00:00 2017-07-27 00:00:03,678:DEBUG:certbot.storage:Should renew, less than 30 days before certificate expiry 2017-08-25 23:01:00 UTC.
00:00 2017-07-27 00:00:03,678:INFO:certbot.hooks:Running pre-hook command: docker stop ooni-backend-b.web-connectivity.th.ooni.io
00:00 2017-07-27T00:00:03+0000 [-] Received SIGTERM, shutting down.
00:00 Jul 27 00:00:13 b dockerd[366]: time="2017-07-27T00:00:13.717964674Z" level=info msg="Container 6a4379167c880b295f7383d6eab8fc7b9e422ac1b0e6df0ab5cfefa2524fd512 failed to exit within 10 seconds of signal 15 - using the force"
00:00 2017-07-27 00:00:32,230:INFO:certbot.hooks:Running post-hook command: cp ... && docker start ooni-backend-b.web-connectivity.th.ooni.io
00:00 2017-07-27T00:00:34.710428510Z Another twistd server is running, PID 1
00:01 [FIRING] Instance https://b.web-connectivity.th.ooni.io/status down
00:34 2017-07-27 00:34:22,763:INFO:certbot.renewal:Cert not yet due for renewal
07:00 darkk@ wakes up
07:31 darkk@ logs into b.web-connectivity.th.ooni.io
07:33 2017-07-27T07:33:31.007042572Z Pidfile /oonib.pid contains non-numeric value after an attempt to truncate --size 0 /var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:34 2017-07-27T07:34:00.767235934Z Removing stale pidfile /oonib.pid after echo 42 >/var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:36 [RESOLVED] Instance https://b.web-connectivity.th.ooni.io/status down
09:50 incident published

What went well:

  • repeated alerting made it clear that the problem is ongoing
  • although I had no darkk user at the host, I had my key at root's authorized_keys
  • docker aufs makes it easy to do amendments to container FS while it is stopped, device-mapper based backend is worse in that case

What went wrong:

  • that was recidive of pid=1 in pidfile problem
  • View in AlertManager link in email and slack notifications is not clickable
  • two certbot launchers /etc/systemd/system/timers.target.wants/certbot.timer and /etc/cron.d/letsencrypt_renew_certs-... are confusing, I have not noticed one of them at first and spent some time looking for the trigger of SIGTERM
  • ooni-backend-b.web-connectivity.th.ooni.io container stores significant amount of logs inside, these logs take 7G and overall container size limit is 15G, historical logs are useful for sure, but: 1. renaming of 7400 files on rotation may be slow and twisted is single-threaded, 2. they should not overflow container disk quota
  • twisted logs do not contain milliseconds, those are useful as a helper to distinguish two nearby events from event producing several log lines. docker logs stores nanoseconds, but oonib does not log to stdout (docker)

What is still unclear and should be discussed:

  • impact. Do applications really fallback to another test helper?
  • do we need more insistent notification system that can wake people up?

What could be done to prevent relapse and decrease impact:

  • move letsencrypt updates to team "business hours", people sleep at UTC midnight
  • avoid stale pid files somehow: cleanup twisted PID file on container start || grab flock on pid || randomize daemon pid so it's not pid=1
  • fix AlertManager link
  • fix two certbot launchers
  • limit number of /var/log/ooni/oonibackend.log.* files in the container
  • make /var/log/ooni/oonibackend.log.* files renamed once (e.g. oonibackend.log.${unix_timestamp})
  • milli- or microseconds in twisted logs || logging to stdout (to docker)
  • ensure that same storage driver is used across all containers #96
@darkk darkk added the incident label Jul 27, 2017
@bassosimone
Copy link
Contributor

bassosimone commented Aug 16, 2017

AFAIK clients should use another test helper if one of them is down, but it may be not the case

This is what ooni-probe does, but MK is lagging behind in this respect.

In particular, my understanding is that (@hellais correct me if I'm wrong) the bouncer returns three different collectors and/or test helpers (if any):

  1. the onion one
  2. the https one
  3. the cloudfronted one

Regarding specifically MK: it cannot use the onion one; it will use the https one; with some more hammering it will also be able to use the cloudfronted one.

Question: how relevant is the fact that a client will retry with another helper in light of the fact that the three returned helpers (collectors) may all point to the same VM implementation?

I mean, let's assume that the bouncer returns:

  • A.onion pointing to VM A
  • A.https pointing to VM A
  • A.cloudfronted pointing to VM A

In such case, what do we gain by retrying all of them, if the breakage is caused by A being down?

  • Is the bouncer smart enough to return helpers (or collectors) belonging to different VMs (e.g. A.onion, B.https, C.cloudfronted) so to increase diversity?

  • Is the bouncer able to return more than one entry for each category?

  • Is the bouncer somehow connected to alerts so that it reasonably knows that A is up when it returns A.onion, A.https and A.cloudfronted?

@darkk
Copy link
Contributor Author

darkk commented Sep 25, 2017

Relapse at b.web-connectivity.th.ooni.io, timeline UTC:
25 Sep 00:00 2017-09-25T00:00:40.448510440Z Another twistd server is running, PID 1
25 Sep 00:05 [FIRING:1] InstanceDown
25 Sep 07:36 2017-09-25T07:36:01.047083733Z Removing stale pidfile /oonib.pid
25 Sep 07:40 [RESOLVED] InstanceDown

@hellais
Copy link
Member

hellais commented Nov 21, 2019

This relapsed again twice recently:

01:11 18th November 2019 UTC+2
[FIRING] mia-wcth.ooni.io:9100 had 1 OOMs in 5m
[FIRING] http://y3zq5fwelrzkkv3s.onion/status endpoint down
[FIRING] http://2lzg3f4r3eapfi6j.onion/status endpoint down
[RESOLVED] mia-wcth.ooni.io:9100 had 1 OOMs in 5m

hellais 11:33 18th November 2019 UTC+2
[RESOLVED] https://mia-wcth.ooni.io/status endpoint down
This was actually down. The issue was:
This could either be a previously started instance of your application or a
different application entirely. To start a new one, either run it in some other
directory, or use the --pidfile and --logfile parameters to avoid clashes.
Another twistd server is running, PID 1
The fix was to delete /srv/web_connectivity/oonib.pid and restart the docker container

and then

00:34 21st November 2019 UTC+2
[FIRING] https://mia-wcth.ooni.io/status endpoint down

10:29 21st November 2019 UTC+2
Damn the mia-wcth went down again due to: https://openobservatory.slack.com/archives/C38EJ0CET/p1574073207161400
10:29
[RESOLVED] https://mia-wcth.ooni.io/status endpoint down

@FedericoCeratto
Copy link
Contributor

FedericoCeratto commented Dec 29, 2019

Relapsed again:

[FIRING] https://mia-wcth.ooni.io/status endpoint down
docker stop 1ca0dce565e5
rm /srv/web_connectivity/oonib.pid
docker restart 1ca0dce565e5
docker logs 1ca0dce565e5    # show Starting web_connectivity helper...
tail /srv/web_connectivity/logs/oonibackend.log -f   # unstuck
[RESOLVED] mia-run.ooni.nu:9100 is not OK, check `systemctl list-units | grep failed`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incident tracking this is a recurring incident or bug
Projects
None yet
Development

No branches or pull requests

4 participants