Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Core/Peripheral Classification Methods #276

Merged
merged 15 commits into from
Jan 29, 2025
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
- Add `remove.duplicate.edges` function that takes a network as input and conflates identical edges (PR #268, d9a4be417b340812b744f59398ba6460ba527e1c, 0c2f47c4fea6f5f2f582c0259f8cf23af985058a, c6e90dd9cb462232563f753f414da14a24b392a3)
- Add `cumulative` as an argument to `construct.ranges` which enables the creation of cumulative ranges from given revisions (PR #268, a135f6bb6f83ccb03ae27c735c2700fccc1ee0c8, 8ec207f1e306ef6a641fb0205a9982fa89c7e0d9)
- Add function `get.last.activity.data` to compute developers' last activities in a project, as well as function `add.vertex.attribute.author.last.activity` to add a developer's date of last activity as vertex attribute to a network, as well as helper functions `get.aggregated.activity.data` and `add.vertex.attribute.author.aggregated.activity` to allow for other activity aggregations than first and last activity (PR #275, 9f231612fcd33a283362c79b35a94295ff3d4ef9, 8660ed763ba4b69e909e7fbb01e27e1999522047)
- Add four new metric which can be used for the classification of authors into core and peripheral: Betweenness, Closeness, Pagerank and Eccentricity (PR #276, 65d5c9cc86708777ef458b0c2e744ab4b846bdd1, b392d1a125d0f306b4bce8d95032162a328a3ce2, c5d37d40024e32ad5778fa5971a45bc08f7631e0)
Leo-Send marked this conversation as resolved.
Show resolved Hide resolved
Leo-Send marked this conversation as resolved.
Show resolved Hide resolved

### Changed/Improved

Expand All @@ -30,6 +31,7 @@
- Explicitly add R version 4.4 to the CI test pipeline (c8e6f45111e487fadbe7f0a13c7595eb23f3af6e)
- Refactor function `construct.edge.list.from.key.value.list` to be more readable (PR #263, 05c3bc09cb1d396fd59c34a88030cdca58fd04dd)
- Update necessary `igraph` version to 2.1.0 in `README.md` (PR #274, 6c3bcd1a2366d0d3a176d9fde95b8356b0158da3)
- Include core/peripheral classification in the `README.md` (PR #276, )

### Fixed

Expand Down
52 changes: 52 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ If you wonder: The name `coronet` derives as an acronym from the words "configur
- [Splitting data and networks based on defined time windows](#splitting-data-and-networks-based-on-defined-time-windows)
- [Cutting data to unified date ranges](#cutting-data-to-unified-date-ranges)
- [Handling data independently](#handling-data-independently)
- [Core/Peripheral classification](#coreperipheral-classification)
- [Count-based metrics](#count-based-metrics)
- [Network-based metrics](#network-based-metrics)
- [How-to](#how-to)
- [File/Module overview](#filemodule-overview)
- [Configuration classes](#configuration-classes)
Expand Down Expand Up @@ -375,6 +378,55 @@ Analogously, the `NetworkConf` parameter `unify.date.ranges` enables this very f

In some cases, it is not necessary to build a network to get the information you need. Therefore, please remember that we offer the possibility to get the raw data or mappings between, e.g., authors and the files they edited. The data inside an instance of `ProjectData` can be accessed independently. Examples can be found in the file `showcase.R`.

#### Core/Peripheral classification

Core/Peripheral classification descibes the process of dividing the authors of a project into either `core` or `peripheral` developers based on the principle that the core developers contribute most of the work in a given project. The concrete threshold can be configured in `CORE.THRESHOLD` and is set to 80% per default, a value commonly used in literature. In practice, this is done by assigning scores to developers to approximate their importance in a project and then dividing the authors into `core` or `peripheral` based on these scores such that the desired split is achieved.
Leo-Send marked this conversation as resolved.
Show resolved Hide resolved

##### Count-based metrics

In this section, we provide descriptions of the different algorithms we provide for classifying authors into core or peripheral authors using count-based metrics.
- `commit.count`
* calculates scores based on the number of commits per author
- `loc.count`
* calculates scores based on the number of lines of code changed by each author
- `mail.count`
* calculates scores based on the number of mails sent per author
- `mail.thread.count`
* calculates scores based on the number of mail threads each author participated in
- `issue.count`
* calculates scores based on the number of issues each author participated in
- `issue.comment.count`
* calculates scores based on the number of comments each author made in issues
- `issue.commented.in.count`
* calculates scores based on the number of issues each author commented in
- `issue.created.count`
* calculates scores based on the number of issues each author created

##### Network-based metrics

In this section, we provide descriptions of the different algorithms we provide for classifying authors into core or peripheral authors using metrics that are used on author networks. Note that the provided methods can be used for any network and not just author networks. The classification would then occur regarding the type of the vertices, i.e. an artifact network would result in a classification of the artifacts based on their importance in the network.
Leo-Send marked this conversation as resolved.
Show resolved Hide resolved
- `network.degree`
* calculates scores based on the vertex degrees in a network
* the degree of a vertex is the number of adjacent edges
- `network.eigen`
* calculates scores based on the eigenvector centralities in a network
* eigenvector centrality measures the importance of vertices within a network by awarding scores for adjacent edges proportional to the score of the connected vertex
- `network.hierarchy`
* calculates scores based on the hierarchy found within a network
* hierarchical scores are calculated by dividing the vertex degree by the clustering coefficient of each vertex
- `network.betweenness`
* calculates scores based on the betweenness of vertices in a network
* betweenness measures the number of shortest paths between any two vertices that go through each vertex
- `network.closeness`
* calculates scores based on the closeness of vertices in a network
* closeness measures how close vertices are to each other by calculating the sum of their shortest paths to all other vertices
- `network.pagerank`
* calculates scores based on the pagerank of vertices in a network
* pagerank refers to the pagerank algorithm employed by google, which is closely related to eigenvector centrality
- `network.eccentricity`
* calculates scores based on the eccentricity of vertices in a network
* eccentricity measures the length of the shortest path to each vertex's furthest reachable vertex

### How-to

In this section, we give a short example on how to initialize all needed objects and build a bipartite network.
Expand Down
112 changes: 112 additions & 0 deletions tests/test-core-peripheral.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
## Copyright 2019 by Christian Hechtl <[email protected]>
## Copyright 2021 by Christian Hechtl <[email protected]>
## Copyright 2023-2024 by Maximilian Löffler <[email protected]>
## Copyright 2024 by Leo Sendelbach <[email protected]>
Leo-Send marked this conversation as resolved.
Show resolved Hide resolved
## All Rights Reserved.


Expand Down Expand Up @@ -105,6 +106,117 @@ test_that("Eigenvector classification", {
expect_equal(expected, result, tolerance = 0.0001)
})

test_that("Hierarchy classification", {

vertices = data.frame(
name = c("Olaf", "Thomas", "Karl"),
kind = TYPE.AUTHOR,
type = TYPE.AUTHOR
)
edges = data.frame(
from = c("Olaf", "Thomas", "Karl", "Thomas"),
to = c("Thomas", "Karl", "Olaf", "Thomas"),
func = c("GLOBAL", "test2.c::test2", "GLOBAL", "test2.c::test2"),
hash = c("0a1a5c523d835459c42f33e863623138555e2526",
"418d1dc4929ad1df251d2aeb833dd45757b04a6f",
"5a5ec9675e98187e1e92561e1888aa6f04faa338",
"d01921773fae4bed8186b0aa411d6a2f7a6626e6"),
file = c("GLOBAL", "test2.c", "GLOBAL", "test2.c"),
base.hash = c("3a0ed78458b3976243db6829f63eba3eead26774",
"0a1a5c523d835459c42f33e863623138555e2526",
"1143db502761379c2bfcecc2007fc34282e7ee61",
"0a1a5c523d835459c42f33e863623138555e2526"),
base.func = c("test2.c::test2", "test2.c::test2",
"test3.c::test_function", "test2.c::test2"),
base.file = c("test2.c", "test2.c", "test3.c", "test2.c"),
artifact.type = c("CommitInteraction", "CommitInteraction", "CommitInteraction", "CommitInteraction"),
weight = c(1, 1, 1, 1),
type = c(TYPE.EDGES.INTRA, TYPE.EDGES.INTRA, TYPE.EDGES.INTRA, TYPE.EDGES.INTRA),
relation = c("commit.interaction", "commit.interaction", "commit.interaction", "commit.interaction")
)
test.network = igraph::graph_from_data_frame(edges, directed = FALSE, vertices = vertices)

## Act
result = get.author.class.network.hierarchy(test.network)
## Assert
Leo-Send marked this conversation as resolved.
Show resolved Hide resolved
expected.core = data.frame(author.name = c("Thomas"),
hierarchy = c(4))
expected.peripheral = data.frame(author.name = c("Olaf", "Karl"),
hierarchy = c(2, 2))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result)
})

test_that("Betweenness classification", {

## Act
result = get.author.class.network.betweenness(network)

## Assert
expected.core = data.frame(author.name = c("Olaf"),
betweenness.centrality = c(1))
expected.peripheral = data.frame(author.name = c("Björn", "udo", "Thomas", "Fritz [email protected]",
"georg", "Hans"),
betweenness.centrality = c(0, 0, 0, 0, 0, 0))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result)
})

test_that("Closeness classification", {

## Act
result = get.author.class.network.closeness(network)

## Assert
expected.core = data.frame(author.name = c("Olaf"),
closeness.centrality = c(0.5))
expected.peripheral = data.frame(author.name = c("Björn", "Thomas", "udo", "Fritz [email protected]",
"georg", "Hans"),
closeness.centrality = c(0.33333, 0.33333, 0.0, 0.0, 0.0, 0.0))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result, tolerance = 0.0001)
})

test_that("Pagerank classification", {

## Act
result = get.author.class.network.pagerank(network)

## Assert
expected.core = data.frame(author.name = c("Olaf"),
pagerank.centrality = c(0.40541))
expected.peripheral = data.frame(author.name = c("Björn", "Thomas", "udo", "Fritz [email protected]",
"georg", "Hans"),
pagerank.centrality = c(0.21396, 0.21396, 0.041667, 0.041667, 0.041667, 0.041667))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result, tolerance = 0.0001)
})

test_that("Eccentricity classification", {

## Act
result = get.author.class.network.eccentricity(network)

## Assert
expected.core = data.frame(author.name = c("Olaf"),
eccentricity = c(1))
expected.peripheral = data.frame(author.name = c("Björn", "udo", "Thomas", "Fritz [email protected]",
"georg", "Hans"),
eccentricity = c(0, 0, 0, 0, 0, 0))
expected = list(core = expected.core, peripheral = expected.peripheral)
row.names(result[["core"]]) = NULL
row.names(result[["peripheral"]]) = NULL
expect_equal(expected, result)
})

# TODO: Add a test for hierarchy classification
Leo-Send marked this conversation as resolved.
Show resolved Hide resolved

test_that("Commit-count classification using 'result.limit'" , {
Expand Down
Loading
Loading