-
Notifications
You must be signed in to change notification settings - Fork 40
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
42 additions
and
29 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,35 +10,39 @@ in this repository. You are welcome to use it in your researches. | |
Each release has a fixed version. By referring to it in your research | ||
you avoid ambiguity and guarantees repeatability of your experiments. | ||
|
||
This is a more formal explanation of this project: | ||
This is a more formal explanation of this project: | ||
[in PDF](https://github.com/yegor256/cam/blob/gh-pages/paper.pdf). | ||
|
||
The latest ZIP archive with the dataset is here: | ||
[cam-2023-10-22.zip](http://cam.yegor256.com/cam-2023-10-22.zip) | ||
[cam-2023-10-22.zip](http://cam.yegor256.com/cam-2023-10-22.zip) | ||
(2.19Gb). | ||
There are **33 metrics** calculated for **862,517 Java classes** from **1000 GitHub repositories**, including: | ||
There are **33 metrics** calculated for **862,517 Java classes** from | ||
**1000 GitHub repositories**, including: | ||
lines of code (reported by [cloc](https://github.com/AlDanial/cloc)); | ||
[NCSS](https://stackoverflow.com/questions/5486983/what-does-ncss-stand-for); | ||
[cyclomatic](https://en.wikipedia.org/wiki/Cyclomatic_complexity) and | ||
[cognitive complexity](https://en.wikipedia.org/wiki/Cognitive_complexity) (by [PMD](https://pmd.github.io/)); | ||
[Halstead](https://en.wikipedia.org/wiki/Halstead_complexity_measures) volume, effort, and difficulty; | ||
[cyclomatic](https://en.wikipedia.org/wiki/Cyclomatic_complexity) and | ||
[cognitive complexity](https://en.wikipedia.org/wiki/Cognitive_complexity) | ||
(by [PMD](https://pmd.github.io/)); | ||
[Halstead](https://en.wikipedia.org/wiki/Halstead_complexity_measures) | ||
volume, effort, and difficulty; | ||
[maintainability index](https://ieeexplore.ieee.org/abstract/document/303623); | ||
number of attributes, constructors, methods; | ||
number of attributes, constructors, methods; | ||
and others ([see PDF](http://cam.yegor256.com/cam-2023-10-22.pdf)). | ||
|
||
Previous archives (took me a few days to build each of them, using a pretty big machine): | ||
|
||
* [cam-2023-10-22.zip](http://cam.yegor256.com/cam-2023-10-22.zip) | ||
* [cam-2023-10-22.zip](http://cam.yegor256.com/cam-2023-10-22.zip) | ||
(2.19Gb): 1000 repos, 33 metrics, 863K classes | ||
* [cam-2023-10-11.zip](http://cam.yegor256.com/cam-2023-10-11.zip) | ||
* [cam-2023-10-11.zip](http://cam.yegor256.com/cam-2023-10-11.zip) | ||
(3Gb): 959 repos, 29 metrics, 840K classes | ||
* [cam-2021-08-04.zip](https://github.com/yegor256/cam/releases/download/0.2.0/cam-2021-08-04.zip) | ||
* [cam-2021-08-04.zip](https://github.com/yegor256/cam/releases/download/0.2.0/cam-2021-08-04.zip) | ||
(692Mb): 1000 repos, 15 metrics | ||
* [cam-2021-07-08.zip](https://github.com/yegor256/cam/releases/download/0.1.1/cam-2021-07-08.zip) | ||
* [cam-2021-07-08.zip](https://github.com/yegor256/cam/releases/download/0.1.1/cam-2021-07-08.zip) | ||
(387Mb): 1000 repos, 11 metrics | ||
|
||
If you want to create a new dataset, | ||
just run the following command and the entire dataset will be built in the current directory | ||
just run the following command and the entire dataset will | ||
be built in the current directory | ||
(you need to have [Docker](https://docs.docker.com/get-docker/) installed), | ||
where `1000` is the number of repositories to fetch from GitHub | ||
and `XXX` is | ||
|
@@ -53,24 +57,28 @@ docker run --detach --name=cam --rm --volume "$(pwd):/dataset" \ | |
|
||
This command will create a new Docker container, running in the background. | ||
(run `docker ps -a`, in order to see it). | ||
If you want to run docker interactively and see all the logs, you can just | ||
disable [detached mode](https://docs.docker.com/language/golang/run-containers/#run-in-detached-mode) | ||
If you want to run docker interactively and see all the logs, | ||
you can just disable | ||
[detached mode](https://docs.docker.com/language/golang/run-containers/#run-in-detached-mode) | ||
by removing the `--detach` option from the command. | ||
|
||
The dataset will be created in the current directory (may take some time, | ||
maybe a few days!), and a `.zip` archive will also be there. Docker container | ||
will run in the background: you can safely close the console and come back when the | ||
maybe a few days!), and a `.zip` archive will also be there. | ||
Docker container will run in the background: you can safely close | ||
the console and come back when the | ||
dataset is ready and the container is deleted. | ||
|
||
Make sure your server has enough | ||
[swap memory](https://askubuntu.com/questions/178712/how-to-increase-swap-space) | ||
Make sure your server has enough | ||
[swap memory](https://askubuntu.com/questions/178712/how-to-increase-swap-space) | ||
(at least 32Gb) and free disk space (at least 512Gb) | ||
— without this, the dataset will have many errors. | ||
It's better to have multiple CPUs, since the entire build process is highly parallel: | ||
It's better to have multiple CPUs, since the entire build process is highly parallel: | ||
all CPUs will be utilized. | ||
|
||
If the script fails at some point, you can restart it again, without deleting previously | ||
created files. The process is incremental — it will understand where it stopped before. | ||
If the script fails at some point, you can restart it again, | ||
without deleting previously | ||
created files. The process is incremental — it will understand | ||
where it stopped before. | ||
In order to restart an entire "step," delete the following directory: | ||
|
||
* `github/` to rerun `clone` | ||
|
@@ -87,7 +95,7 @@ make TOTAL=100 | |
Should work, if you have all the dependencies installed, as suggested in the | ||
[Dockerfile](https://github.com/yegor256/cam/blob/master/Dockerfile). | ||
|
||
In order to analyze just a single repository, do this | ||
In order to analyze just a single repository, do this | ||
([`yegor256/tojos`](https://github.com/yegor256/tojos) as an example): | ||
|
||
```bash | ||
|
@@ -105,16 +113,19 @@ metrics on top of it. It should be easy: | |
* Unpack it to the `cam/dataset/` directory | ||
* Add a new script to the `cam/metrics/` directory (use `ast.py` as an example) | ||
* Delete all other files except yours from the `cam/metrics/` directory | ||
* Run [`make`](https://www.gnu.org/software/make/) in the `cam/` directory: `sudo make install; make all` | ||
* Run [`make`](https://www.gnu.org/software/make/) in the `cam/` | ||
directory: `sudo make install; make all` | ||
|
||
The `make` should understand that a new metric was added. It will apply this new metric | ||
The `make` should understand that a new metric was added. | ||
It will apply this new metric | ||
to all `.java` files, generate new `.csv` reports, aggregate them with existing | ||
reports (in the `cam/dataset/data/` directory), | ||
reports (in the `cam/dataset/data/` directory), | ||
and then the final `.pdf` report will also be updated. | ||
|
||
## How to Contribute | ||
|
||
Fork repository, make changes, send us a [pull request](https://www.yegor256.com/2014/04/15/github-guidelines.html). | ||
Fork repository, make changes, send us a | ||
[pull request](https://www.yegor256.com/2014/04/15/github-guidelines.html). | ||
We will review your changes and apply them to the `master` branch shortly, | ||
provided they don't violate our quality standards. To avoid frustration, | ||
before sending us your pull request please run full build: | ||
|
@@ -129,7 +140,7 @@ This should take a few minutes to complete, without errors. | |
## How to Build a New Archive | ||
|
||
When it's time to build a new archive, create a new `m7i.2xlarge` | ||
server (8 CPU, 32Gb RAM, 512Gb disk) with Ubuntu 22.04 in AWS. | ||
server (8 CPU, 32Gb RAM, 512Gb disk) with Ubuntu 22.04 in AWS. | ||
|
||
Then, install Docker into it: | ||
|
||
|
@@ -154,5 +165,5 @@ sudo mkswap /swapfile | |
sudo swapon /swapfile | ||
``` | ||
|
||
Then, create a [personal access token](https://docs.github.com/en/[email protected]/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) | ||
Then, create a [personal access token](https://docs.github.com/en/[email protected]/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) | ||
in GitHub, and run Docker as explained above. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters