-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incident Management Doc #67
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time I've seen something so formalized for an open source project. It feels very company-oriented, but I might be missing something. Another point is the operational nature of the Incident Commander (IC) role as described. Currently, InstructLab, at least in terms of the CLI, doesn't have any long-running services, so I'm not sure if we would experience outages in the same way. However, I do agree that having a process for post-mortem analysis is valuable for addressing issues. I have mixed feelings about this but want to hear from others as well :). Thanks for the write-up!
Ah, see, it turns out we do have a bunch of back-end services required for us to run this project. We have a team of backend engineers ensuring the model training and tuning is done. This requires communication between two different groups of people, and we have no formal way to communicate. (hence this process) |
|
||
Anything with customer or community visible negative consequences. In most cases this | ||
will be an outage or downtime event but some non-outage incidents include severe | ||
performance degradations and security events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time understanding what would qualify as an incident under this proposal. Could you come up with some concrete examples?
It seems to be oriented toward running a service where some level of availability is to be expected. That's not really true here. There's stuff running, but not in service to a public group of some kind.
Before commenting further, it would help if we had a sample set of scenarios to use as context for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just nonticed I'm duplicating some discussion between you and @leseb - sorry about that.
Ah, see, it turns out we do have a bunch of back-end services required for us to run this project. We have a team > of backend engineers ensuring the model training and tuning is done. This requires communication between two > different groups of people, and we have no formal way to communicate. (hence this process)
We had a "major" incident, just this last week and when I asked for a some information about it, we noticed we didn't have a plan for these types of situations.
but none of this is visible to the public right now, so I'm not sure a public process like this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russellb I take your comment to mean "once our backend services like model training are publicly visible, we should have a process for dealing with incidents"
I agree with you!
However, as we did have a hiccup that caused @jjasghar to open this PR, should we at least have some documented process for dealing with said hiccups amongst the maintainer team? I am sure we have one that is informal and ad hoc, but I'd like to see it written down (for my ignorant self).
I think @jjasghar can maybe just give me a summary of the what happened and how to remediate/deal in this issue. The deep specifics are unlikely to be interesting or useful for GitHub history purposes. :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know anything about the example in question. It would probably help to speak about it more concretely instead of in the hypothetical.
This is the first pass at created an incident management process for InstructLab. We will have problems and this is the beginning of the way to communicate them out. Thanks to @coderanger for the inspiration for this. Signed-off-by: JJ Asghar <[email protected]>
d4aa768
to
6aee764
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in agreement with @leseb and @russellb. Some of my thoughts:
- This seems like a process when you have service(s) that you support. Will the open source community be providing such support?
- The backend services I would think will remain for the most part to run the workflow to perform model alignment and tuning of a new model to be published. Do we envision exposing this externally?
- Processing of taxononmy PRs should be treated like any open source contribution: push, CI/CD, review/re-commit cycle and then merge or reject. Any capabilities run (e.g. SDG, training) for the PR should be treated like CI/CD in all projects - raise a bug if the gate is broken.
- What was the issue(s) that occurred that instigated this proposal? This is important in the context if the proposal is required or not.
@jjasghar what's the status of this doc? |
Haven't touched this since we created it. When we start hosting some level of service for our community we will need some standard procedure, but I think we can have this sit around for the time being. |
This is the first pass at created an incident management process for InstructLab. We will have problems and this is the beginning of the way to communicate them out.
Thanks to @coderanger for the inspiration for this.