Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident Management Doc #67

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

jjasghar
Copy link
Member

This is the first pass at created an incident management process for InstructLab. We will have problems and this is the beginning of the way to communicate them out.

Thanks to @coderanger for the inspiration for this.

Copy link
Contributor

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time I've seen something so formalized for an open source project. It feels very company-oriented, but I might be missing something. Another point is the operational nature of the Incident Commander (IC) role as described. Currently, InstructLab, at least in terms of the CLI, doesn't have any long-running services, so I'm not sure if we would experience outages in the same way. However, I do agree that having a process for post-mortem analysis is valuable for addressing issues. I have mixed feelings about this but want to hear from others as well :). Thanks for the write-up!

docs/incident-management.md Show resolved Hide resolved
docs/incident-management.md Outdated Show resolved Hide resolved
docs/incident-management.md Outdated Show resolved Hide resolved
@jjasghar
Copy link
Member Author

This is the first time I've seen something so formalized for an open source project. It feels very company-oriented, but I might be missing something. Another point is the operational nature of the Incident Commander (IC) role as described. Currently, InstructLab, at least in terms of the CLI, doesn't have any long-running services, so I'm not sure if we would experience outages in the same way. However, I do agree that having a process for post-mortem analysis is valuable for addressing issues. I have mixed feelings about this but want to hear from others as well :). Thanks for the write-up!

Ah, see, it turns out we do have a bunch of back-end services required for us to run this project. We have a team of backend engineers ensuring the model training and tuning is done. This requires communication between two different groups of people, and we have no formal way to communicate. (hence this process)
We had a "major" incident, just this last week and when I asked for a some information about it, we noticed we didn't have a plan for these types of situations.


Anything with customer or community visible negative consequences. In most cases this
will be an outage or downtime event but some non-outage incidents include severe
performance degradations and security events.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding what would qualify as an incident under this proposal. Could you come up with some concrete examples?

It seems to be oriented toward running a service where some level of availability is to be expected. That's not really true here. There's stuff running, but not in service to a public group of some kind.

Before commenting further, it would help if we had a sample set of scenarios to use as context for discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just nonticed I'm duplicating some discussion between you and @leseb - sorry about that.

Ah, see, it turns out we do have a bunch of back-end services required for us to run this project. We have a team > of backend engineers ensuring the model training and tuning is done. This requires communication between two > different groups of people, and we have no formal way to communicate. (hence this process)
We had a "major" incident, just this last week and when I asked for a some information about it, we noticed we didn't have a plan for these types of situations.

but none of this is visible to the public right now, so I'm not sure a public process like this makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russellb I take your comment to mean "once our backend services like model training are publicly visible, we should have a process for dealing with incidents"

I agree with you!

However, as we did have a hiccup that caused @jjasghar to open this PR, should we at least have some documented process for dealing with said hiccups amongst the maintainer team? I am sure we have one that is informal and ad hoc, but I'd like to see it written down (for my ignorant self).

I think @jjasghar can maybe just give me a summary of the what happened and how to remediate/deal in this issue. The deep specifics are unlikely to be interesting or useful for GitHub history purposes. :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about the example in question. It would probably help to speak about it more concretely instead of in the hypothetical.

This is the first pass at created an incident management process
for InstructLab. We will have problems and this is the beginning
of the way to communicate them out.

Thanks to @coderanger for the inspiration for this.

Signed-off-by: JJ Asghar <[email protected]>
@jjasghar jjasghar force-pushed the jjasghar/incident_management branch from d4aa768 to 6aee764 Compare June 6, 2024 00:40
Copy link
Member

@hickeyma hickeyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in agreement with @leseb and @russellb. Some of my thoughts:

  • This seems like a process when you have service(s) that you support. Will the open source community be providing such support?
  • The backend services I would think will remain for the most part to run the workflow to perform model alignment and tuning of a new model to be published. Do we envision exposing this externally?
  • Processing of taxononmy PRs should be treated like any open source contribution: push, CI/CD, review/re-commit cycle and then merge or reject. Any capabilities run (e.g. SDG, training) for the PR should be treated like CI/CD in all projects - raise a bug if the gate is broken.
  • What was the issue(s) that occurred that instigated this proposal? This is important in the context if the proposal is required or not.

@nathan-weinberg
Copy link
Member

@jjasghar what's the status of this doc?

@jjasghar
Copy link
Member Author

Haven't touched this since we created it. When we start hosting some level of service for our community we will need some standard procedure, but I think we can have this sit around for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants