-
Notifications
You must be signed in to change notification settings - Fork 1
Infra Ticket Queue
- Ticket queue kanban board
- Build and performance dashboards
- Current oncall (Google internal link)
The Flutter infra team uses a ticket queue to manage operational tasks, such as:
- Configuration requests: "Please add a new LUCI builder"
- Outage/degradation notifications: "Build dashboard is down/slow"
- Requests for help: "How do I debug a test on devicelab machine X?"
This allows the team to separate their engineering work from "toil" work. It also lets them see which types of tasks are common and worth automating.
IMPORTANT: Whenever you have a request for the infra team, please file a ticket instead of contacting team members directly, even for seemingly trivial things or even if an individual has done the same thing for you in the past. Infra oncall will be there to handle your request, and it lets non-oncall team members focus on their engineering tasks.
- Open a new infra issue. (That template summarizes the information on this page.)
- Add a descriptive title. A message like "Add a LUCI builder for linux web engine" or "Debug gallery startup" is much more helpful than "quick request" or "test doesn't work?".
- Clearly describe the issue or request in the description field. For example, if a ticket is requesting running several commands on the bots, the ticket should explain why, what commands are needed, on which bots and how to verify the results.
- Add the "team: infra" label and a priority label:
-
P0 (immediate): Such as a build break or regression.
- Fix as soon as possible, before any other work.
- Should be very rare, and only used when critical work is blocked without a workaround.
- Ideally is downgraded to P1 as soon as a workaround is found.
-
P1 (high): Users are suffering but not blocked; or, an immediate-level incident will happen if this is not addressed (e.g., almost out of quota).
- Fix today (8 business hours).
- Degraded service (Build bots work but are slow to start).
- Time-sensitive requests.
- Should be relatively rare.
- Anything below P1 is not suitable for the infra ticket queue and will be treated as a normal infra bug.
-
P0 (immediate): Such as a build break or regression.
- Add the project "Infra Ticket Queue". This is the step that is important to get it into the queue!
- Click the create button. No need to set an assignee; infra oncall will handle all new tickets.
Below are instructions for infra oncall on how to process the ticket queue. It describes the processes that oncall should follow, along with useful tips and tricks. If you are on call and see a problem or omission on this page, please change it!
SLO: A ticket in the queue will be triaged within 4 business hours.
In the beginning of a week, an oncall should sweep 10 new infra issues in case they ought to be put into the infra ticket queue. This is needed for people to get familiar with this process.
New, untriaged tickets will be in the New column on the kanban board.
When a new ticket comes in, an oncall should:
- Duplicate or close the ticket immediately if appropriate.
- Move it out of the ticket queue if it's actually a feature request or bug report that will require more than 8 business hours.
- Check the values of the critical fields (title, comments, priority)
- Adjust the values if necessary: if the summary isn't clear, clarify it; if the priority doesn't fit the request, adjust it.
- The field values are there purely for the benefit of oncall (you and the next person), so make them work for you.
- Move triaged issues to the Triaged column.
NOTE: This is meant to be quick and mechanical, and doesn't require a lot of thought. Even if you don't have time to take any immediate action, it's helpful to keep the new column empty. Your marking it triaged also lets the ticket creator know that someone has seen it.
Once all tickets have been triaged, oncall's job is to service them in order by priority: P0 > P1.
In reality, the order will also depend on your expertise and how much time you have. If a lower-priority ticket can be resolved in a couple minutes, don't feel like it has to wait behind a higher-priority ticket.
From the top of the priority queue down, oncall makes sure that someone is working on each ticket. It's important to keep things moving if you see that they're stuck; try CC'ing people with more information and making it clear what a given ticket is blocked on.
- Set the assignee. Read the guideline first. In addition:
- All P0 and P1 tickets must be assigned to someone who is working on them.
- Start working on the ticket
- Set the status to "in progress", typically by dragging its card to the In progress column.
- Add a comment, if you think it helps.
- Keep the ticket updated with progress, especially for high-priority tickets
- P0 tickets should be updated once every hour or so, since many people may be blocked and will be waiting for updates.
- P1 tickets should be updated as you see fit, but especially if it's going to take longer than promised or expected.
- Update the ticket's workflow state when you reach a stopping point
- If it's closed, move it to Done.
- If it's blocked on something else, make it clear in the comments.
- If you (or the assignee) can no longer work on the ticket, find a new owner or move it back to New.
- If you're heading home for the day, add a final status update to P0 and P1 priority tickets, especially if they're not going to get resolved in their urgency window.
NOTE: When servicing a ticket as an oncall, remember that it is not your responsibility to fix every ticket, only to make sure that someone is working on it. You may not be the most appropriate person to do the work, but you make sure the work gets done. This goes for new tickets as well as older tickets that someone has claimed but dropped on the floor -- some of those tickets may even have been created and assigned during the previous oncall shift, so it's important to check up on older tickets and re-assign if necessary.
On Friday during the 15-minute handoff meeting, please add comments and update the status on any tickets on which you have context, to help the next oncall person ramp up and understand the workload.
- Home of the Wiki
- Roadmap
- API Reference (stable)
- API Reference (master)
- Glossary
- Contributor Guide
- Chat on Discord
- Code of Conduct
- Issue triage reports
- Our Values
- Tree hygiene
- Issue hygiene and Triage
- Style guide for Flutter repo
- Project teams
- Contributor access
- What should I work on?
- Running and writing tests
- Release process
- Rolling Dart
- Manual Engine Roll with Breaking Commits
- Updating Material Design Fonts & Icons
- Postmortems
- Setting up the Framework development environment
- The Framework architecture
- The flutter tool
- API Docs code block generation
- Running examples
- Using the Dart analyzer
- The flutter run variants
- Test coverage for package:flutter
- Writing a golden-file test for package:flutter
- Setting up the Engine development environment
- Compiling the engine
- Debugging the engine
- Using Sanitizers with the Flutter Engine
- Testing the engine
- The Engine architecture
- Flutter's modes
- Engine disk footprint
- Comparing AOT Snapshot Sizes
- Custom Flutter engine embedders
- Custom Flutter Engine Embedding in AOT Mode
- Flutter engine operation in AOT Mode
- Engine-specific Service Protocol extensions
- Crashes
- Supporting legacy platforms
- Metal on iOS FAQ
- Engine Clang Tidy Linter
- Why we have a separate engine repo
- Reduce Flutter engine size with MLGO
- Setting up the Plugins development environment
- Setting up the Packages development environment
- Plugins and Packages repository structure
- Plugin Tests
- Contributing to Plugins and Packages
- Releasing a Plugin or Package
- Unexpected Plugins and Packages failures