-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
b.web-connectivity.th down for 7.6 hours #128
Comments
This is what ooni-probe does, but MK is lagging behind in this respect. In particular, my understanding is that (@hellais correct me if I'm wrong) the bouncer returns three different collectors and/or test helpers (if any):
Regarding specifically MK: it cannot use the onion one; it will use the https one; with some more hammering it will also be able to use the cloudfronted one. Question: how relevant is the fact that a client will retry with another helper in light of the fact that the three returned helpers (collectors) may all point to the same VM implementation? I mean, let's assume that the bouncer returns:
In such case, what do we gain by retrying all of them, if the breakage is caused by A being down?
|
Relapse at b.web-connectivity.th.ooni.io, timeline UTC: |
This relapsed again twice recently: 01:11 18th November 2019 UTC+2 hellais 11:33 18th November 2019 UTC+2 and then 00:34 21st November 2019 UTC+2 10:29 21st November 2019 UTC+2 |
Relapsed again: [FIRING] https://mia-wcth.ooni.io/status endpoint down
docker stop 1ca0dce565e5
rm /srv/web_connectivity/oonib.pid
docker restart 1ca0dce565e5
docker logs 1ca0dce565e5 # show Starting web_connectivity helper...
tail /srv/web_connectivity/logs/oonibackend.log -f # unstuck
[RESOLVED] mia-run.ooni.nu:9100 is not OK, check `systemctl list-units | grep failed` |
Impact: TBD (AFAIK clients should use another test helper if one of them is down, but it may be not the case).
Detection: repeated email alert
Timeline UTC:
00:00 four periodic jobs start: cerbot@cron, certbot@systemd, munin plugins apt update, /data/b.web-connectivity.th.ooni.io/update-bouncer.py
00:00
2017-07-27 00:00:03,678:DEBUG:certbot.storage:Should renew, less than 30 days before certificate expiry 2017-08-25 23:01:00 UTC.
00:00
2017-07-27 00:00:03,678:INFO:certbot.hooks:Running pre-hook command: docker stop ooni-backend-b.web-connectivity.th.ooni.io
00:00
2017-07-27T00:00:03+0000 [-] Received SIGTERM, shutting down.
00:00
Jul 27 00:00:13 b dockerd[366]: time="2017-07-27T00:00:13.717964674Z" level=info msg="Container 6a4379167c880b295f7383d6eab8fc7b9e422ac1b0e6df0ab5cfefa2524fd512 failed to exit within 10 seconds of signal 15 - using the force"
00:00
2017-07-27 00:00:32,230:INFO:certbot.hooks:Running post-hook command: cp ... && docker start ooni-backend-b.web-connectivity.th.ooni.io
00:00
2017-07-27T00:00:34.710428510Z Another twistd server is running, PID 1
00:01 [FIRING] Instance https://b.web-connectivity.th.ooni.io/status down
00:34
2017-07-27 00:34:22,763:INFO:certbot.renewal:Cert not yet due for renewal
07:00 darkk@ wakes up
07:31 darkk@ logs into b.web-connectivity.th.ooni.io
07:33
2017-07-27T07:33:31.007042572Z Pidfile /oonib.pid contains non-numeric value
after an attempt totruncate --size 0 /var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:34
2017-07-27T07:34:00.767235934Z Removing stale pidfile /oonib.pid
afterecho 42 >/var/lib/docker/aufs/diff/096b1a00f4529b788ee6f062929dc54540b9b06171c52a8957da8bb88c1ec094/oonib.pid
07:36 [RESOLVED] Instance https://b.web-connectivity.th.ooni.io/status down
09:50 incident published
What went well:
darkk
user at the host, I had my key at root's authorized_keysWhat went wrong:
/etc/systemd/system/timers.target.wants/certbot.timer
and/etc/cron.d/letsencrypt_renew_certs-...
are confusing, I have not noticed one of them at first and spent some time looking for the trigger of SIGTERMWhat is still unclear and should be discussed:
What could be done to prevent relapse and decrease impact:
The text was updated successfully, but these errors were encountered: