Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure alert for when frontend not getting response from backend #6888

Closed
mehansen opened this issue Oct 31, 2023 · 6 comments · Fixed by #7051
Closed

Azure alert for when frontend not getting response from backend #6888

mehansen opened this issue Oct 31, 2023 · 6 comments · Fixed by #7051
Assignees

Comments

@mehansen
Copy link
Collaborator

mehansen commented Oct 31, 2023

may not be necessary if we can create a probe that will alert us via PagerDuty, waiting on outcome of #6890

Background

During our recent production outage, our frontend could not connect to our backend due to a misconfigured environment variable. We received no alerts about this. There is an action item to create a test that acts as more of a health check after a prod deploy and alerts us if it fails regardless of whether someone is using the app. This ticket is for a shorter-term fix.

Action requested

Create an Azure alert based on logs that will trigger a page when the frontend is not able to connect to the backend while users are using the app.

Potential query for alerts:
Count of failed API retrievals

dependencies
|where type == "Fetch" and name == "GET https://prod.simplereport.gov/api/feature-flags" and success == false and client_Type == "Browser"
| summarize count()

^ check to see if results are cached, we should see successful responses every 1-2 min

Acceptance criteria

  • if frontend cannot reach the backend for any reason, we should get an alert within [? min? attempts?]
    • up to assignee to determine good threshold

Additional notes

suggestion for testing: set REACT_APP_BASE_URL to an incorrect URL in a lower, attempt to access the app, and verify the alert is triggered
incident should be created in PagerDuty (but it will be low-urgency since it's in a lower env)

written in Terraform - use existing alerts we've written for context

@mehansen
Copy link
Collaborator Author

run the query in app insights prod and see what number we get back
ideally the threshold is 0 but check to see if we're already getting these failures

@emyl3 emyl3 self-assigned this Dec 3, 2023
@emyl3
Copy link
Collaborator

emyl3 commented Dec 3, 2023

Was able to confirm the following for running the query above on prod application insights:

During time period that covered the outage (1612 count):

Image

Almost 2 month period after outage (0 count):

Image

This confirms that we should alert if even 1 failure of this nature happens.

@emyl3
Copy link
Collaborator

emyl3 commented Dec 15, 2023

The following query below (in the issue description) assumed that the url would be improperly set to https://prod.simplereport.gov like during our outage

dependencies
|where type == "Fetch" and name == "GET https://prod.simplereport.gov/api/feature-flags" and success == false and client_Type == "Browser"
| summarize count()

However, I had changed that query to be more permissive for the following scenarios:

  1. it works on other envs
  2. it works if the url is set to something else that's not just https://prod.simplereport.gov/api/feature-flags
    New query in the PR Add alert when frontend can't communicate with backend #7051:
dependencies
| where type == "Fetch" and name has "api/feature-flags" and success == false and client_Type == "Browser"
| summarize count() by bin(ago(20m), 5m)

However, I did not re-check this more permissive query in prod and it looks like the feature-flags endpoint fails pretty frequently. See this search of our logs

We will have to revisit what query we can reliably create an alert from. 😓

@emyl3
Copy link
Collaborator

emyl3 commented Dec 15, 2023

haha this would work: https://github.com/CDCgov/prime-simplereport/pull/7057/files 🤩

@emyl3
Copy link
Collaborator

emyl3 commented Dec 21, 2023

Closing in favor of the more robust solution #7057

@emyl3 emyl3 closed this as completed Dec 21, 2023
@mehansen
Copy link
Collaborator Author

so much for our shorter term fix while we work on #7019 lol
thanks for working on it anyway!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants