Skip to content

Commit

Permalink
web: informative and verbose error message when watchdog fails (#647)
Browse files Browse the repository at this point in the history
Right now we use panicf which leads a stack trace which is misleading at
what is happening and fills up the space used by kubernetes error
reporting. Additionally a few times we have had bug reports about the
watchdog failing. This commit updates the message to be far more
informative about next steps.

Additionally we update the watchdog error to include the response body
in case that contains useful information for debugging.

Test Plan: Updated the serveHealthz handler to always return an error.
Then ran the following

  $ ZOEKT_WATCHDOG_TICK=1s go run ./cmd/zoekt-webserver
  2023/09/14 15:55:27 custom ZOEKT_WATCHDOG_TICK=1s
  2023/09/14 15:55:27 loading 1 shard(s): github.com%2Fsourcegraph%2Fzoekt_v16.00000.zoekt
  2023/09/14 15:55:28 watchdog: failed, will try 2 more times: watchdog: status=500 body="not ready: boom\n"
  2023/09/14 15:55:29 watchdog: failed, will try 1 more times: watchdog: status=500 body="not ready: boom\n"
  2023/09/14 15:55:30 watchdog health check has consecutively failed 3 times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
  Final error: watchdog: status=500 body="not ready: boom\n"
  Possible Remediations:
  - If this rarely happens, ignore and let your process manager restart zoekt.
  - Possibly under provisioned. Try increasing CPU or disk IO.
  - A bug. Reach out with logs and screenshots of metrics when this occurs.
  exit status 3
  • Loading branch information
keegancsmith authored Sep 15, 2023
1 parent af12665 commit adf376d
Showing 1 changed file with 13 additions and 2 deletions.
15 changes: 13 additions & 2 deletions cmd/zoekt-webserver/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import (
"flag"
"fmt"
"html/template"
"io"
"log"
"net"
"net/http"
Expand Down Expand Up @@ -438,9 +439,11 @@ func watchdogOnce(ctx context.Context, client *http.Client, addr string) error {
if err != nil {
return err
}
body, _ := io.ReadAll(resp.Body)
_ = resp.Body.Close()

if resp.StatusCode != http.StatusOK {
return fmt.Errorf("watchdog: status %v", resp.StatusCode)
return fmt.Errorf("watchdog: status=%v body=%q", resp.StatusCode, string(body))
}
return nil
}
Expand All @@ -462,7 +465,15 @@ func watchdog(dt time.Duration, maxErrCount int, addr string) {
metricWatchdogErrors.Set(float64(errCount))
metricWatchdogErrorsTotal.Inc()
if errCount >= maxErrCount {
log.Panicf("watchdog: %v", err)
log.Printf(`watchdog health check has consecutively failed %d times indicating is likely an unrecoverable error affecting zoekt. As such this process will exit with code 3.
Final error: %v
Possible remediations:
- If this rarely happens, ignore and let your process manager restart zoekt.
- Possibly under provisioned. Try increasing CPU or disk IO.
- A bug. Reach out with logs and screenshots of metrics when this occurs.`, errCount, err)
os.Exit(3)
} else {
log.Printf("watchdog: failed, will try %d more times: %v", maxErrCount-errCount, err)
}
Expand Down

0 comments on commit adf376d

Please sign in to comment.