Incident Management Doc #67

jjasghar · 2024-05-29T19:14:32Z

This is the first pass at created an incident management process for InstructLab. We will have problems and this is the beginning of the way to communicate them out.

Thanks to @coderanger for the inspiration for this.

leseb

This is the first time I've seen something so formalized for an open source project. It feels very company-oriented, but I might be missing something. Another point is the operational nature of the Incident Commander (IC) role as described. Currently, InstructLab, at least in terms of the CLI, doesn't have any long-running services, so I'm not sure if we would experience outages in the same way. However, I do agree that having a process for post-mortem analysis is valuable for addressing issues. I have mixed feelings about this but want to hear from others as well :). Thanks for the write-up!

docs/incident-management.md

jjasghar · 2024-05-30T15:57:40Z

This is the first time I've seen something so formalized for an open source project. It feels very company-oriented, but I might be missing something. Another point is the operational nature of the Incident Commander (IC) role as described. Currently, InstructLab, at least in terms of the CLI, doesn't have any long-running services, so I'm not sure if we would experience outages in the same way. However, I do agree that having a process for post-mortem analysis is valuable for addressing issues. I have mixed feelings about this but want to hear from others as well :). Thanks for the write-up!

Ah, see, it turns out we do have a bunch of back-end services required for us to run this project. We have a team of backend engineers ensuring the model training and tuning is done. This requires communication between two different groups of people, and we have no formal way to communicate. (hence this process)
We had a "major" incident, just this last week and when I asked for a some information about it, we noticed we didn't have a plan for these types of situations.

russellb · 2024-05-30T21:31:17Z

docs/incident-management.md

+
+Anything with customer or community visible negative consequences. In most cases this
+will be an outage or downtime event but some non-outage incidents include severe
+performance degradations and security events.


I'm having a hard time understanding what would qualify as an incident under this proposal. Could you come up with some concrete examples?

It seems to be oriented toward running a service where some level of availability is to be expected. That's not really true here. There's stuff running, but not in service to a public group of some kind.

Before commenting further, it would help if we had a sample set of scenarios to use as context for discussion.

I just nonticed I'm duplicating some discussion between you and @leseb - sorry about that.

Ah, see, it turns out we do have a bunch of back-end services required for us to run this project. We have a team > of backend engineers ensuring the model training and tuning is done. This requires communication between two > different groups of people, and we have no formal way to communicate. (hence this process)
We had a "major" incident, just this last week and when I asked for a some information about it, we noticed we didn't have a plan for these types of situations.

but none of this is visible to the public right now, so I'm not sure a public process like this makes sense.

@russellb I take your comment to mean "once our backend services like model training are publicly visible, we should have a process for dealing with incidents"

I agree with you!

However, as we did have a hiccup that caused @jjasghar to open this PR, should we at least have some documented process for dealing with said hiccups amongst the maintainer team? I am sure we have one that is informal and ad hoc, but I'd like to see it written down (for my ignorant self).

I think @jjasghar can maybe just give me a summary of the what happened and how to remediate/deal in this issue. The deep specifics are unlikely to be interesting or useful for GitHub history purposes. :D

I don't know anything about the example in question. It would probably help to speak about it more concretely instead of in the hypothetical.

@coderanger

This is the first pass at created an incident management process for InstructLab. We will have problems and this is the beginning of the way to communicate them out. Thanks to @coderanger for the inspiration for this. Signed-off-by: JJ Asghar <[email protected]>

hickeyma

I am in agreement with @leseb and @russellb. Some of my thoughts:

This seems like a process when you have service(s) that you support. Will the open source community be providing such support?
The backend services I would think will remain for the most part to run the workflow to perform model alignment and tuning of a new model to be published. Do we envision exposing this externally?
Processing of taxononmy PRs should be treated like any open source contribution: push, CI/CD, review/re-commit cycle and then merge or reject. Any capabilities run (e.g. SDG, training) for the PR should be treated like CI/CD in all projects - raise a bug if the gate is broken.
What was the issue(s) that occurred that instigated this proposal? This is important in the context if the proposal is required or not.

nathan-weinberg · 2024-10-29T20:08:03Z

@jjasghar what's the status of this doc?

jjasghar · 2024-10-31T15:02:36Z

Haven't touched this since we created it. When we start hosting some level of service for our community we will need some standard procedure, but I think we can have this sit around for the time being.

leseb reviewed May 30, 2024

View reviewed changes

docs/incident-management.md Show resolved Hide resolved

docs/incident-management.md Outdated Show resolved Hide resolved

docs/incident-management.md Outdated Show resolved Hide resolved

russellb requested changes May 30, 2024

View reviewed changes

Incident Management Doc

6aee764

This is the first pass at created an incident management process for InstructLab. We will have problems and this is the beginning of the way to communicate them out. Thanks to @coderanger for the inspiration for this. Signed-off-by: JJ Asghar <[email protected]>

jjasghar force-pushed the jjasghar/incident_management branch from d4aa768 to 6aee764 Compare June 6, 2024 00:40

hickeyma reviewed Jul 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident Management Doc #67

Incident Management Doc #67

jjasghar commented May 29, 2024

leseb left a comment

jjasghar commented May 30, 2024

russellb May 30, 2024

russellb May 30, 2024

lhawthorn Jun 11, 2024

russellb Jun 11, 2024

hickeyma left a comment

nathan-weinberg commented Oct 29, 2024

jjasghar commented Oct 31, 2024

Incident Management Doc #67

Are you sure you want to change the base?

Incident Management Doc #67

Conversation

jjasghar commented May 29, 2024

leseb left a comment

Choose a reason for hiding this comment

jjasghar commented May 30, 2024

russellb May 30, 2024

Choose a reason for hiding this comment

russellb May 30, 2024

Choose a reason for hiding this comment

lhawthorn Jun 11, 2024

Choose a reason for hiding this comment

russellb Jun 11, 2024

Choose a reason for hiding this comment

hickeyma left a comment

Choose a reason for hiding this comment

nathan-weinberg commented Oct 29, 2024

jjasghar commented Oct 31, 2024