-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Index Maintainer activity status into Metrics cluster #75
Comments
Thanks @bshien I would even go with splitting the documents at the event level, by adding The raw event data collected #76 already has the event name. By segregating the documents by user (maintainer), repository, and event name, we can obtain more granular metrics for maintainers, allowing us to infer whether they are active or inactive. |
Thanks, I've added this to the issue. |
You will want to store information about maintainers in the metrics store, so this should be the result of a cron process that runs regularly. Note that maintainers can be added and removed at any time, and sometimes they can be re-added and re-removed. When viewing dashboards that pertain to maintainers we'll want to see the state at the time of the graphs being shown, not the latest state.
IMO maintainer is just a subset of any user. I suggest representing all users in the metrics store, then adding a relationship to them that says maintainer from date X to date Y (or still a maintainer) for a given repo.
Note that you'll want to see active or inactive at the time being displayed. |
Hey dB, the dashboard will have the current repo maintainer stats (when applied the repo filter or will show all the maintainers). Coming from #76 the metrics cluster will already have the raw information (inferred from s3 datalake) for all the users and for all the targeted maintainer events. The additional slope logic defined above will only target the maintainers from the MAINTAINERS.md and add a flag Later we can have another dashboard for user engagement (with the data already part of the metrics cluster #76) which can target all users and can be used to nominate as a maintainer. |
This is problematic because the maintainer state is "now" and is incorrect when you travel in time. I think you need to store maintainer status every time you do a sweep of MAINTAINERS.md with a date, for example This way you can build dashboards of maintainer growth, have data about maintainers at any given day, etc. |
Yes we can index
Using this index which has the data if maintainer is active or not (the maintainer stats), can still be used to go back in time and filter the documents. Then then output is the set of documents which the maintainer stats of set the of maintainers for that point in time. Thank you |
The index that this lambda will create, |
@bshien (from dB's point) I assume we should also be able to build dashboards of maintainer growth, have data about maintainers at any given day, etc. ? |
Because the above design indexes snapshots, you can build dashboards of # of maintainers over time(by counting the docs each day with unique username), and also go back and see the set of documents representing maintainer statuses on any day in the past. |
Is your feature request related to a problem?
Coming from #57
As a prerequisite for #73 and opensearch-project/automation-app#8, there needs to be data in the Metrics cluster with information about each maintainers' repo, name, affiliation, the date they were last engaged, and their inactivity status.
What solution would you like?
An index created in the Metrics OpenSearch cluster called
maintainer_engagement
, which will have documents with this structure:To create these documents, there should be a lambda running periodically(daily or weekly) that will use the
github-activity-events
index(from: #76) to collect/calculate the required fields for each document and index these to themaintainer_engagement
index.This lambda should:
MAINTAINERS.md
for each repository in the OpenSearch project. This will yield therepo
,name
,github_login
, andaffiliation
fields.github-activity-events
index for each repo, maintainer, and event type.created_at
field for each GitHub Event document to get thetime_last_engaged
time_last_engaged
and how active the repo is(see below).To address the problem of waiting longer to flag maintainers of less active repos:
For the inactivity calculation, we can use a linear equation, y = m*x + b, where:
x = the total number of events in a repo
y = the amount of time a maintainer is inactive before we flag them as inactive
And we can calculate the slope(m) and the y-intercept(b) with two points:
(# of events in the repo with the least events, higher bound time to wait(365 days))
(# of events in the repo with the most events, lower bound time to wait(90 days))
This way we have an equation to calculate how long to wait for each repo, we wait longer on repos that are less active, wait shorter on repos that are more active.
Example: Imagine there is a repo with 600 events. You can use the two starting points to calculate the slope and y-intercept of the linear equation. You can then use the linear equation to calculate how long to wait until maintainers should be flagged.
Now that we know how long to wait, we then calculate inactivity for each event. Let's say that a maintainer has been the actor for the issues, pull_request, and label events within the last 201.01 days.
(The dots in the graph represent when the maintainer last triggered each event)
We would consider this maintainer active and we wouldn't flag them as inactive. Now let's say some time has passed and they have not triggered any new events.
Though some events have passed the threshold, because there is still a single event within the threshold, we still consider the maintainer as active and do not flag.
Now let's say even more time passes without activity:
Now that all events are past the threshold, we flag the maintainer as inactive.
Let's say the maintainer raises an issue in the repo:
Now that an event is within the threshold, we now consider the maintainer as active.
Aggregate all event types to a single document which will definitively say whether a maintainer is inactive.
For each event type and the aggregate event, index these documents to the
maintainer_engagement
index.Additional Info
Actionable Data
Because the source of the raw event data is the data lake that we have just started collecting in real time, the above design would only yield fully actionable data only after a period of months(however long we decide the HIGHER_BOUND of time to wait until flagging to be).
In the same way, the inactivity calculation will not yield any actionable data until anther period of time(however long we decide the LOWER_BOUND of time to wait until flagging to be).
Edge case regarding activity calculation
Because there is a different linear equation calculated every time the lambda is run, there may be a case where one day, a maintainer is marked as inactive, but the next day, as the repo has become less active relative to the rest of the repos, the time to wait has increased so that the maintainer is marked as active again. While possible, this is a rare case that doesn't have much impact on the effectiveness of the goal of the calculation: to remove inactive maintainers.
Do you have any additional context?
#57
The text was updated successfully, but these errors were encountered: