Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
web: informative and verbose error message when watchdog fails (#647)
Right now we use panicf which leads a stack trace which is misleading at what is happening and fills up the space used by kubernetes error reporting. Additionally a few times we have had bug reports about the watchdog failing. This commit updates the message to be far more informative about next steps. Additionally we update the watchdog error to include the response body in case that contains useful information for debugging. Test Plan: Updated the serveHealthz handler to always return an error. Then ran the following $ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver 2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s 2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt 2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n" 2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n" 2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3. Final error: watchdog: status=500 body="not ready: boom\n" Possible Remediations: - If this rarely happens, ignore and let your process manager restart zoekt. - Possibly under provisioned. Try increasing CPU or disk IO. - A bug. Reach out with logs and screenshots of metrics when this occurs. exit status 3
- Loading branch information