In this project, I have analysed the opened and closed bugs of Pytorch and Tensorflow in the last 2 years(2021-2022).
I analyzed 5 properties of every issue in both the projects. The properties are below:
- lifetime(days): Duration (in days) between issue create and close time. This metric helps to understand whether it takes much time to close the issue. If this duration is higher, the corresponding issues may need to be broken down more.
- first_response_duration(days): Duration (in days) between issue create and first comment time. Though there are many forms of response like reactions, commit etc, I have considered only comments for simplicity. This metric indicates how responsive the developer community is or are the issues ignored for long time after being posted.
- number_of_comments: Number of human(not bots) comments in issue. It helps to understand how much discussion is needed to close the issue. Overall, it gives hints about community engagement.
- number_of_labels: Number of labels in issue. It gives an idea on using the number of labels per issue. Labels are important to track the issue in project.
- number_of_assignees: Number of assignees in issue. This metric will help on understanding the issue scope. If it needs more assignees than expected, the issue might be critical or has larger scope than expected. The analyzed data are in analysis/pytorch_issues_analysis.csv and analysis/tensorflow_issues_analysis.csv.
In addition, analysis/repository_analysis.csv contains the following self-explanatory data for every project.
- median_lifetime(days)
- median_first_response(days)
- avg_number_comments_per_issue
- percentage_of_issues_with_no_assignees
- avg_number_of_labels_per_issue
- Data fetch
- Originally data are fetched through github issue API with the query parameter state=closed.
- Data are fetched page by page
- Original data are stored in pytorch_issues.json and tensorflow_issues.json files.
- Data filter
- Github issue API returns both bugs and pull requests. To retrieve only bugs, data, whose pull_request is null, are filtered.
- Data are also filtered based on created_at and closed_at datetime fields being after January 1, 2021.
- Construct initial working dataframe
- Necessary columns ('number', 'created_at', 'closed_at', 'comments', 'assignees', 'labels') are fetched from the above filtered dataframe.
- Used multi-threading for comments data retrieval as they are huge. It took lots of time in single thread.
- Created a new column first_comment_created_at as the first comment time of every issue.
- Comments data are stored in pytorch_comments.csv and tensorflow_comments.csv files.
- Removed rows having null value
- Calculating lifetime
- Difference between issue created and closed time.
- Calculating first response
- Difference between issue created and first comment time.
- Calculating number of comments
- Number of comments for every issue.
- Calculating number of assignees
- Length of assignees list per issue
- Calculating number of labels
- Length of labels list per issue
- Top Labels
- Construct a dataframe of labels data
- Calculated the number of times each label is used.
- Label summary is stored in tensorflow_labels_summary.csv and pytorch_labels_summary.csv
- Top 10 labels with the most counts are retrieved.
- Top Commentator
- Count of commentator id (login field of pytorch_comments.csv or tensorflow_comments.csv)
- Top 10 commentator are retrieved based on their counts in comments data.
-
Distribution of lifetime(days):
Tensorflow
- Distribution is right skewed, as in most of the cases, issues do not take much time to close.
Pytorch
- Distribution is right skewed, as in most of the cases, issues do not take much time to close.
There are some outliers in both distributions
-
Distribution of first response(days)
Tensorflow
- Distribution of first responses in Tensorflow is bimodal.
- Most of the first responses are posted either early or lately
Pytorch
- Distribution of first responses in Pytorch is right skewed.
- Most of the first responses does not take much time to be posted.
-
Distribution of number of comments
Tensorflow
- Distribution is right skewed, as in most of the cases, issues have 3 to 5 comments.
Pytorch
- Distribution is right skewed, as in most of the cases, issues have 1 to 2 comments.
There are some outliers in both distributions
-
Distribution of number of assignees
Tensorflow
- Most of the issues have 1 assignee.
- Distribution is right skewed, as in most of the cases, 1 issue is assigned to 1 assignee
Pytorch
- Most of the issues do not have any assignee.
- Distribution is right skewed
-
Distribution of number of labels
Tensorflow
- Distribution is left skewed, as most of the issues have 4 to 5 labels
Pytorch
- Distribution is slightly right skewed, as most of the issues have 2 to 3 labels
-
Top Commenter
-
Top 10 labels
- Labels for the tensorflow project supports the long time of first response of the issues. The top most used label is stat: awaiting response which is described as Status - Awaiting response from author in tensorflow_labels_summary.csv
-
Default Labels
Default labels are provided by github itself.
- Both the projects use custom labels mostly.
- Github API calling are commented in the code for convenience.
- To create the environment -> conda create --name --file requirements.txt