Skip to content

Infra Ticket Queue

Tong Wu edited this page Aug 25, 2020 · 4 revisions

Quick Links

Background

The Flutter infra team uses a ticket queue to manage operational tasks, such as:

  • Configuration requests: "Please add a new LUCI builder"
  • Outage/degradation notifications: "Build dashboard is down/slow"
  • Requests for help: "How do I debug a test on devicelab machine X?"

This allows the team to separate their engineering work from "toil" work. It also lets them see which types of tasks are common and worth automating.

IMPORTANT: Whenever you have a request for the infra team, please file a ticket instead of contacting team members directly, even for seemingly trivial things or even if an individual has done the same thing for you in the past. Infra oncall will be there to handle your request, and it lets non-oncall team members focus on their engineering tasks.

How to File a Ticket as an Infra Customer

  1. Open a new infra issue. (That template summarizes the information on this page.)
  2. Add a descriptive title. A message like "Add a LUCI builder for linux web engine" or "Debug gallery startup" is much more helpful than "quick request" or "test doesn't work?".
  3. Clearly describe the issue or request in the description field. For example, if a ticket is requesting running several commands on the bots, the ticket should explain why, what commands are needed, on which bots and how to verify the results.
  4. Add the "team: infra" label and a priority label:
    • P0 (immediate): Such as a build break or regression.
      • Fix as soon as possible, before any other work.
      • Should be very rare, and only used when critical work is blocked without a workaround.
      • Ideally is downgraded to P1 as soon as a workaround is found.
    • P1 (high): Users are suffering but not blocked; or, an immediate-level incident will happen if this is not addressed (e.g., almost out of quota).
      • Fix today (8 business hours).
      • Degraded service (Build bots work but are slow to start).
      • Time-sensitive requests.
      • Should be relatively rare.
    • Anything below P1 is not suitable for the infra ticket queue and will be treated as a normal infra bug.
  5. Add the project "Infra Ticket Queue". This is the step that is important to get it into the queue!
  6. Click the create button. No need to set an assignee; infra oncall will handle all new tickets.

How to Serve Tickets as an Infra Oncall

Below are instructions for infra oncall on how to process the ticket queue. It describes the processes that oncall should follow, along with useful tips and tricks. If you are on call and see a problem or omission on this page, please change it!

Triaging

SLO: A ticket in the queue will be triaged within 4 business hours.

In the beginning of a week, an oncall should sweep 10 new infra issues in case they ought to be put into the infra ticket queue. This is needed for people to get familiar with this process.

New, untriaged tickets will be in the New column on the kanban board.

When a new ticket comes in, an oncall should:

  1. Duplicate or close the ticket immediately if appropriate.
  2. Move it out of the ticket queue if it's actually a feature request or bug report that will require more than 8 business hours.
  3. Check the values of the critical fields (title, comments, priority)
    • Adjust the values if necessary: if the summary isn't clear, clarify it; if the priority doesn't fit the request, adjust it.
    • The field values are there purely for the benefit of oncall (you and the next person), so make them work for you.
  4. Move triaged issues to the Triaged column.

NOTE: This is meant to be quick and mechanical, and doesn't require a lot of thought. Even if you don't have time to take any immediate action, it's helpful to keep the new column empty. Your marking it triaged also lets the ticket creator know that someone has seen it.

Serving

Once all tickets have been triaged, oncall's job is to service them in order by priority: P0 > P1.

In reality, the order will also depend on your expertise and how much time you have. If a lower-priority ticket can be resolved in a couple minutes, don't feel like it has to wait behind a higher-priority ticket.

From the top of the priority queue down, oncall makes sure that someone is working on each ticket. It's important to keep things moving if you see that they're stuck; try CC'ing people with more information and making it clear what a given ticket is blocked on.

  1. Set the assignee. Read the guideline first. In addition:
    • All P0 and P1 tickets must be assigned to someone who is working on them.
  2. Start working on the ticket
    • Set the status to "in progress", typically by dragging its card to the In progress column.
    • Add a comment, if you think it helps.
  3. Keep the ticket updated with progress, especially for high-priority tickets
    • P0 tickets should be updated once every hour or so, since many people may be blocked and will be waiting for updates.
    • P1 tickets should be updated as you see fit, but especially if it's going to take longer than promised or expected.
  4. Update the ticket's workflow state when you reach a stopping point
    • If it's closed, move it to Done.
    • If it's blocked on something else, make it clear in the comments.
    • If you (or the assignee) can no longer work on the ticket, find a new owner or move it back to New.
    • If you're heading home for the day, add a final status update to P0 and P1 priority tickets, especially if they're not going to get resolved in their urgency window.

NOTE: When servicing a ticket as an oncall, remember that it is not your responsibility to fix every ticket, only to make sure that someone is working on it. You may not be the most appropriate person to do the work, but you make sure the work gets done. This goes for new tickets as well as older tickets that someone has claimed but dropped on the floor -- some of those tickets may even have been created and assigned during the previous oncall shift, so it's important to check up on older tickets and re-assign if necessary.

Handoff

On Friday during the 15-minute handoff meeting, please add comments and update the status on any tickets on which you have context, to help the next oncall person ramp up and understand the workload.

Flutter Wiki

Process

Framework repo

Engine repo

Android

Plugins and packages repos

Infrastructure

Release Information

Experimental features

Clone this wiki locally