RADAR: Towards Automatic Source Code Repository Information Recovery and Validation for PyPI Packages
# Ubuntu 22.04 LTS (GNU/Linux 5.19.0-41-generic x86_64)
# install pyenv from https://github.com/pyenv/pyenv#installation
pyenv install 3.11.3
pyenv virtualenv 3.11.3 radar
pyenv activate radar
pip install -r requirements.txt
# e.g., /data/pypi_data
export DATA_HOME=<Folder to Store Data>
- baseline_analysis.ipynb: Code for RQ1 results for Metadata-based Retriever evaluation.
- dist_diff_analysis.ipynb and dist_file_analysis.ipynb: Code for why use the source distribution.
- phantom_file_analysis.ipynb: Code for RQ2 results.
- validator_analysis.ipynb: Code for Source Code Repository Validator evaluation.
- retriever_analysis.ipynb: Code for Source Code-based Retriever evaluation.
-
Dump PyPI package metadata
python -u pypi_crawler.py --folder=$DATA_HOME --email=<Your Email> --processes <numOfProcessess> --chunk <numofDataPerChunk>
-
Import metadata to MongoDB. We provide the dump in the
mongodb
folder (release_metadata.bson.gz
anddistribution_file_info.bson.gz
).python -u import_to_mongo.py --base_folder $DATA_HOME --metadata --distribution --drop
-
Obtain baseline results. We provide a MongoDB dump in the
mongodb
folder (baseline_results.bson.gz
)python -m dataset.run_baselines --baseline ossgadget python -m dataset.run_baselines --baseline warehouse python -m dataset.run_baselines --baseline librariesio # caution: running py2src requires very heavy http requests and is very slow python -m dataset.run_baselines --baseline py2src --n_jobs <numOfProcessess> --chunk_size <numofDataPerChunk> # dump baseline results to MongoDB python -m dataset.run_baselines --dump
You can also obtain results of a single release by passing
--name
and--version
arguments:python -m dataset.run_baselines --baseline librariesio --name tensorflow --version 2.9.0
-
Obtain MetadataRetriever results. Since MetadataRetriever still need to search webpages, to reduce http requests as many as possible, we run it by stages.
-
The 1st stage: use
--all
option to search repository urls from thehome_page``,
download_url,
project_urls, and
description` field in the metadata.python -m dataset.run_metadata_retriever --all
-
The 2nd stage: use
--left_release
option to get all repository urls in the unique homepage and documentation webpage in the left releases whose metadata does not have repository url.python -m dataset.run_metadata_retriever --left_release --n_jobs <numOfProcessess> --chunk_size <numofDataPerChunk>
-
The 3rd stage: use
--process_log
option to process failed urls in the 3nd stage.python -m dataset.run_metadata_retriever --process_log --n_jobs <numOfProcessess> --chunk_size <numofDataPerChunk> 2>log/metadata_retriever.log.2
-
The 4th stage: use
--merge
option to merge retrived repository url for each webpage in the 3rd stage to MetadataRetriever results:python -m dataset.run_metadata_retriever --merge
-
The 5th stage: use
--redirect
option to get the redirected url of each repository urls retrived by MetadataRetriever:python -m dataset.run_metadata_retriever --redirect --n_jobs <numOfProcessess> --chunk_size <numofDataPerChunk> 2>log/metadata_retriever.log
You can also obtain results of a single release by passing
--name
and--version
arguments. There are some options:--webpage
: search the webpage pointed by the Homepage and Documentation links in the metadata--redirect
: get the redirected url of the retrieved repository url
python -m dataset.run_metadata_retriever --name tensorflow --version 2.10.0
-
-
List repositories's blobs based on metadata retriever results:
# Clone repositories to local python -m dataset.clone_repository --base_folder $DATA_HOME --processes <numOfProcessess> --chunk_size <numofDataPerChunk> # List repositories's blobs python -m dataset.list_blobs --base_folder $DATA_HOME --processes <numOfProcessess> --chunk_size <numofDataPerChunk>
-
Compare the difference between source distributions and binary distributions
python -m dataset.dist_diff --base_folder $DATA_HOME --all --processes <numOfProcessess> --chunk_size <numofDataPerChunk> [ --mirror <PyPI mirror site> ]
-
construct dataset:
# collect Python repositories on GitHub with more than 100 stars. python -m dataset.ground_truth --repository # collect packages in these GitHub repositories python -m dataset.ground_truth --package --n_jobs <numOfProcessess> --chunk_size <numofDataPerChunk> # collect PyPI package's PyPI maintainer python -m dataset.ground_truth --maintainer --n_jobs <numOfProcessess> --chunk_size <numofDataPerChunk> # construct ground truth dataset for validator python -m dataset.ground_truth --dataset # download source distributions for releases in the ground truth dataset, you can specify a PyPI mirror site to accelerate the downloading. python -m dataset.ground_truth --download --dest $DATA_HOME --n_jobs <numOfProcessess> --chunk_size <numofDataPerChunk> [ --mirror <PyPI mirror site> ] # Get Phantom files in matched and mismatched releases. python -m dataset.run_validator --base_folder $DATA_HOME --n_jobs <numOfProcessess> --chunk_size <numofDataPerChunk> --phantom_file # Get validator features for matched and mismatched releases. python -m dataset.run_validator --base_folder $DATA_HOME --n_jobs <numOfProcessess> --features # Download the source distributions for the latest release of all PyPI packages python -m dataset.run_validator --base_folder $DATA_HOME --n_jobs <numOfProcessess> --pypi [ --mirror <PyPI mirror site> ] # Get validator features for all PyPI releases. python -m dataset.run_validator --base_folder $DATA_HOME --n_jobs <numOfProcessess> --pypi_features # Get files shas for correct links python -m dataset.run_retriever --base_folder $DATA_HOME --n_jobs <numOfProcessess> --fileshas # Get candidates from WoC python -m dataset.run_retriever --base_folder $DATA_HOME --n_jobs <numOfProcessess> --candidates [ --mirror <PyPI mirror site> ] # Get topn candidates python -m dataset.run_retriever --base_folder $DATA_HOME --most_common # Get upstream forks python -m dataset.run_retriever --base_folder $DATA_HOME --chunk_size <numofDataPerChunk> --defork # Get final returned repository python -m dataset.run_retriever --base_folder $DATA_HOME --final # Retrieve for releasess for which the Metadata-based Retreiver can not retrieve repository information python -m dataset.run_retriever --base_folder $DATA_HOME --n_jobs <numOfProcessess> --download_remaining [ --mirror <PyPI mirror site> ] python -m dataset.run_retriever --base_folder $DATA_HOME --n_jobs <numOfProcessess> --candidate_remaining [ --mirror <PyPI mirror site> ] python -m dataset.run_retriever --base_folder $DATA_HOME --chunk_size <numofDataPerChunk> --most_common_remaining --defork_remaining --final_remaining
-
fit machine learning models on validator features:
# run Logistic Regression, Decision Tree, Random Forest, AdaBoost, Gradient Boosting, SVM, XGBoost python -m models.fit_model --all --n_jobs <numOfProcessess>