-
Notifications
You must be signed in to change notification settings - Fork 0
Rollbar Handbook
In this guide we will cover the basic concepts behind Rollbar, show you how to manage errors in the system, and give general guidance for Point Guards who need to curate messages in the system.
In order to manage errors in Rollbar you will need an account. If you have recently joined, or have never had to manage errors before, ask Forrest for access.
Rollbar is a essentially a logging system that records errors, tallies them, and can be used to understand the state of our system from the perspective of failure scenarios. It also includes tools that allow us to watch, resolve, and change the criticality of error messages. By using it correctly to manage the errors being reported it grants us the ability to quick identify, track, and fix errors in our production system.
In order to begin managing our error reporting through rollbar, you must first understand the different levels of criticality it provides us. The levels, in order of priority, are:
- Critical: Fatal errors that cause servers (or the entire app) to go down; these should be addressed immediately.
- Error: Unexpected failures; these should be addressed soon.
- Warn: Expected failures; these can usually be ignored.
- Info: Information about the service; these should never be reported.
Notice that these levels are very similar to that of our production logging. This is no accident, as rollbar is essentially a log tallying system that provides a focused view on errors. Keep this in mind when managing errors, as it will help you make decisions on which errors should be marked as warnings, and which ones should be kept at error or critical.
As a Point Guard you will be responsible for managing and curating the errors in Rollbar. Since each of our projects reports every error they encounter to Rollbar it is hard to separate the signal from the noise.
As the point guard you will be responsible for performing the following actions:
- Categorize any new errors (warning, critical, etc.)
- Determine if any newly introduced errors require immediate attention
- Attempt to discern any correlations to recent deployments on the project by cross-referencing with GitHub
As long as you perform these actions for every project that we are tracking then you will have done your job and helped make sure we all have a keen understanding of our system (from the perspective of errors, that is). To start you can adopt the following basic script to use when performing your duties:
- Login to Rollbar and access the "Dashboard" for the API
- Note any new errors that were not present the day before
- Investigate any new errors first, they might represent critical failures in the system
- Determine if any of the new errors being reported are non-serious and mark them as warnings
- Switch to the next project from within the interface, rinse, and repeat.
The workflow given above is only one of many possible scripts one could follow to get the job done. As you gain more experience with the tool and our system you will probably find a method that fits you best.
And that's basically it. The tool is pretty good so it makes the job of exploring and getting a grasp on the errors a piece of cake. All you have to do is make sure you're not skimping on your duty and get the job done! The rest of this article contains a reference (with images) for each of the major aspects of the tool.
When you first login to Rollbar you should see a page that looks like this:
This is the Rollbar dashboard and it shows you basic trends about errors for a particular project in our infrastructure. From the dashboard you have the following basic actions you can perform:
- View trend information for project errors
- Dig deeper into a particular error
- Monitor account usage
The dashboard will only show errors for one project at a time. As a Point Guard you will need to monitor errors across all of the projects in our infrastructure. To do so use the project selection dropdown near the top left of the page.
The rollbar interface uses dark red to denote errors and yellow to denote warnings, like so:
Dotted throughout the interface you will find "occurrences charts". These charts give you a quick visual representation of the number of errors occurring over time (usually a 24 hour period). While useful for seeing a short trend, the charts themselves can often be a bit misleading. Thus, you will need to use your brain and investigate what is going on before screaming "RED ALERT".
The runnable infrastructure currently has five environments (at the time of this writing), they are:
-
alpha
(production) -
beta
(infrastructure testing) -
delta
(new production, will be replacing alpha) -
gamma
(new infrastructure testing, will replace beta) -
staging
(our internal dogfooding environment)
As such it is important to note which environments you are looking at when exploring the errors in rollbar. The environment dropdown (shown above) allows you to choose specific environments to investigate. Mostly you'll want to keep an eye on production, but it may also be fruitful to occasional look at beta and staging to get an idea of what errors might be coming up (usually these environments are slightly ahead of production, code-wise).