b.web-connectivity.th down for 7.6 hours #128

darkk · 2017-07-27T09:49:04Z

Impact: TBD (AFAIK clients should use another test helper if one of them is down, but it may be not the case).

Detection: repeated email alert

Timeline UTC:
00:00 four periodic jobs start: cerbot@cron, certbot@systemd, munin plugins apt update, /data/b.web-connectivity.th.ooni.io/update-bouncer.py
00:00 2017-07-27 00:00:03,678:DEBUG:certbot.storage:Should renew, less than 30 days before certificate expiry 2017-08-25 23:01:00 UTC.
00:00 2017-07-27 00:00:03,678:INFO:certbot.hooks:Running pre-hook command: docker stop ooni-backend-b.web-connectivity.th.ooni.io
00:00 2017-07-27T00:00:03+0000 [-] Received SIGTERM, shutting down.
00:00 Jul 27 00:00:13 b dockerd[366]: time="2017-07-27T00:00:13.717964674Z" level=info msg="Container 6a4379167c880b295f7383d6eab8fc7b9e422ac1b0e6df0ab5cfefa2524fd512 failed to exit within 10 seconds of signal 15 - using the force"
00:00 2017-07-27 00:00:32,230:INFO:certbot.hooks:Running post-hook command: cp ... && docker start ooni-backend-b.web-connectivity.th.ooni.io
00:00 2017-07-27T00:00:34.710428510Z Another twistd server is running, PID 1
00:01 [FIRING] Instance https://b.web-connectivity.th.ooni.io/status down
00:34 2017-07-27 00:34:22,763:INFO:certbot.renewal:Cert not yet due for renewal
07:00 darkk@ wakes up
07:31 darkk@ logs into b.web-connectivity.th.ooni.io
07:33 2017-07-27T07:33:31.007042572Z Pidfile /oonib.pid contains non-numeric value after an attempt to truncate --size 0 /var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:34 2017-07-27T07:34:00.767235934Z Removing stale pidfile /oonib.pid after echo 42 >/var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:36 [RESOLVED] Instance https://b.web-connectivity.th.ooni.io/status down
09:50 incident published

What went well:

repeated alerting made it clear that the problem is ongoing
although I had no darkk user at the host, I had my key at root's authorized_keys
docker aufs makes it easy to do amendments to container FS while it is stopped, device-mapper based backend is worse in that case

What went wrong:

that was recidive of pid=1 in pidfile problem
View in AlertManager link in email and slack notifications is not clickable
two certbot launchers /etc/systemd/system/timers.target.wants/certbot.timer and /etc/cron.d/letsencrypt_renew_certs-... are confusing, I have not noticed one of them at first and spent some time looking for the trigger of SIGTERM
ooni-backend-b.web-connectivity.th.ooni.io container stores significant amount of logs inside, these logs take 7G and overall container size limit is 15G, historical logs are useful for sure, but: 1. renaming of 7400 files on rotation may be slow and twisted is single-threaded, 2. they should not overflow container disk quota
twisted logs do not contain milliseconds, those are useful as a helper to distinguish two nearby events from event producing several log lines. docker logs stores nanoseconds, but oonib does not log to stdout (docker)

What is still unclear and should be discussed:

impact. Do applications really fallback to another test helper?
do we need more insistent notification system that can wake people up?

What could be done to prevent relapse and decrease impact:

move letsencrypt updates to team "business hours", people sleep at UTC midnight
avoid stale pid files somehow: cleanup twisted PID file on container start || grab flock on pid || randomize daemon pid so it's not pid=1
fix AlertManager link
fix two certbot launchers
limit number of /var/log/ooni/oonibackend.log.* files in the container
make /var/log/ooni/oonibackend.log.* files renamed once (e.g. oonibackend.log.${unix_timestamp})
milli- or microseconds in twisted logs || logging to stdout (to docker)
ensure that same storage driver is used across all containers #96

The text was updated successfully, but these errors were encountered:

bassosimone · 2017-08-16T13:39:45Z

AFAIK clients should use another test helper if one of them is down, but it may be not the case

This is what ooni-probe does, but MK is lagging behind in this respect.

In particular, my understanding is that (@hellais correct me if I'm wrong) the bouncer returns three different collectors and/or test helpers (if any):

the onion one
the https one
the cloudfronted one

Regarding specifically MK: it cannot use the onion one; it will use the https one; with some more hammering it will also be able to use the cloudfronted one.

Question: how relevant is the fact that a client will retry with another helper in light of the fact that the three returned helpers (collectors) may all point to the same VM implementation?

I mean, let's assume that the bouncer returns:

A.onion pointing to VM A
A.https pointing to VM A
A.cloudfronted pointing to VM A

In such case, what do we gain by retrying all of them, if the breakage is caused by A being down?

Is the bouncer smart enough to return helpers (or collectors) belonging to different VMs (e.g. A.onion, B.https, C.cloudfronted) so to increase diversity?
Is the bouncer able to return more than one entry for each category?
Is the bouncer somehow connected to alerts so that it reasonably knows that A is up when it returns A.onion, A.https and A.cloudfronted?

darkk · 2017-09-25T08:38:25Z

Relapse at b.web-connectivity.th.ooni.io, timeline UTC:
25 Sep 00:00 2017-09-25T00:00:40.448510440Z Another twistd server is running, PID 1
25 Sep 00:05 [FIRING:1] InstanceDown
25 Sep 07:36 2017-09-25T07:36:01.047083733Z Removing stale pidfile /oonib.pid
25 Sep 07:40 [RESOLVED] InstanceDown

hellais · 2019-11-21T09:32:36Z

This relapsed again twice recently:

01:11 18th November 2019 UTC+2
[FIRING] mia-wcth.ooni.io:9100 had 1 OOMs in 5m
[FIRING] http://y3zq5fwelrzkkv3s.onion/status endpoint down
[FIRING] http://2lzg3f4r3eapfi6j.onion/status endpoint down
[RESOLVED] mia-wcth.ooni.io:9100 had 1 OOMs in 5m

hellais 11:33 18th November 2019 UTC+2
[RESOLVED] https://mia-wcth.ooni.io/status endpoint down
This was actually down. The issue was:
This could either be a previously started instance of your application or a
different application entirely. To start a new one, either run it in some other
directory, or use the --pidfile and --logfile parameters to avoid clashes.
Another twistd server is running, PID 1
The fix was to delete /srv/web_connectivity/oonib.pid and restart the docker container

and then

00:34 21st November 2019 UTC+2
[FIRING] https://mia-wcth.ooni.io/status endpoint down

10:29 21st November 2019 UTC+2
Damn the mia-wcth went down again due to: https://openobservatory.slack.com/archives/C38EJ0CET/p1574073207161400
10:29
[RESOLVED] https://mia-wcth.ooni.io/status endpoint down

FedericoCeratto · 2019-12-29T14:10:18Z

Relapsed again:

[FIRING] https://mia-wcth.ooni.io/status endpoint down
docker stop 1ca0dce565e5
rm /srv/web_connectivity/oonib.pid
docker restart 1ca0dce565e5
docker logs 1ca0dce565e5    # show Starting web_connectivity helper...
tail /srv/web_connectivity/logs/oonibackend.log -f   # unstuck
[RESOLVED] mia-run.ooni.nu:9100 is not OK, check `systemctl list-units | grep failed`

darkk added the incident label Jul 27, 2017

This was referenced Sep 10, 2017

b.collector down for 8.5 hours #157

Closed

Insistent notifications for urgent actionable alerts #158

Closed

darkk mentioned this issue Sep 5, 2018

Monitoring epic, Oct 2017 … Sep 2018 #226

Closed

31 tasks

hellais mentioned this issue Jan 23, 2020

Consolidate web-connectivity test helper patterns and hosts ooni/backend#300

Closed

hellais added the tracking this is a recurring incident or bug label Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b.web-connectivity.th down for 7.6 hours #128

b.web-connectivity.th down for 7.6 hours #128

darkk commented Jul 27, 2017 •

edited

Loading

bassosimone commented Aug 16, 2017 •

edited

Loading

darkk commented Sep 25, 2017 •

edited

Loading

hellais commented Nov 21, 2019

FedericoCeratto commented Dec 29, 2019 •

edited

Loading

b.web-connectivity.th down for 7.6 hours #128

b.web-connectivity.th down for 7.6 hours #128

Comments

darkk commented Jul 27, 2017 • edited Loading

bassosimone commented Aug 16, 2017 • edited Loading

darkk commented Sep 25, 2017 • edited Loading

hellais commented Nov 21, 2019

FedericoCeratto commented Dec 29, 2019 • edited Loading

darkk commented Jul 27, 2017 •

edited

Loading

bassosimone commented Aug 16, 2017 •

edited

Loading

darkk commented Sep 25, 2017 •

edited

Loading

FedericoCeratto commented Dec 29, 2019 •

edited

Loading