This repository contains the source code of our research paper "CHRONOS: Time-Aware Zero-Shot Identification of Libraries from Vulnerability Reports", which is accepted at ICSE 2023
@inproceedings{lyu2023chronos,
title={CHRONOS: Time-Aware Zero-Shot Identification of Libraries from Vulnerability Reports},
author={Lyu, Yunbo and Le-Cong, Thanh and Kang, Hong Jin and Widyasari, Ratnadira and Zhao, Zhipeng and Le, Xuan-Bach D and Li, Ming and Lo, David},
booktitle={Proceedings of the 45th IEEE/ACM Internation Conference on Software Engineering},
year={2023}
}
There is a minor inconsistency between the paper and the code that has been addressed in the latest arxiv paper. Specifically, the code use normalized P@K to consider the best possible P@K. The paper mentions standard P@K. The arxiv paper has now been updated. The arxiv paper also now includes results of the standard P@K (in the Appendix) and we have similar findings: Chronos outperforms ZestXML by 20%+, and both outperform LightXML by a big margin.
Before using Chronos, please download our data from Figshare.
You should unzip all files in the dataset folder so that you can use Chronos.
The structure of our source code's repository is as follows:
- dataset: contains our dataset for empirical evaluation;
- reference_processing: contains source code for preprocessing reference data;
- zestxml: contains our source code for zero-shot learning model.
- analyze_data.py: contains our source code for analyzing unseen labels and the associated data points
For ease of use, we also provide a installation package via a docker image. You can set up Chronos's docker step-by-step as follow:
- Pull Chronos's docker image:
docker pull chronosicse22/chronos:v1
- Run a docker container:
docker run --name chronos -it --shm-size 16G --gpus all chronosicse22/chronos:v1
An option command to run a docker container:
docker run -it -v </media/Rb/:/workspace/> --name chronos_ae chronosicse22/chronos:v1
</media/Rb/:/workspace/> is your workspace path, you need to change it for your usage.
You need to update the workspace path in auto_run.sh
Line 12 in 07e0a75
The change in the auto_run.sh script is to point to the local directory of the dataset.
bash auto_run.sh -d [description data: "merged" or "description_and_reference"]
-l [label processing: "splitting" or "none"]
-m [the M parameter on Equation (6) for adjustment]
-i [top-i highest labels for adjustment]
If you want to create reference data from scratch, please use the following commands:
cd reference_processing
python3 generate_new_csv.py
To replicate our results for RQ1, please use:
python3 analyze_data.py
To replicate our results for RQ2, please use:
bash auto_run.sh -d 'description_and_reference' -l 'splitting' -m 8 -i 10
To replicate our results for RQ3, please use:
- Chronos without adjustment
bash auto_run.sh -d 'description_and_reference' -l 'splitting' -m 0 -i 0
- Chronos without data enhancement
bash auto_run.sh -d 'merged' -l 'none' -m 8 -i 10
For RQ1, after executing the script, you will find the information about seen and unseen labels by years. For RQ2 and RQ3, after executing each script, you will find Precision, Recall, and F1 for each experiment set.
You can get the detailed expected output in this document.
Grid search was performed on two hyperparameters: batch size (bs), epochs, and learning rate (lr).
Particularly, we use batch sizes in {1, 2, 4, 8, 16}; learning rates in {1e-6, 1e-5, 1e-4, 1e-3, and 1e-2}; and epochs in {20, 25, 30, 35, 40}.
We used the hyperparameters that result in LightXML’s best performance on the validation dataset to evaluate its performance on the testing dataset.
To use LightXML, please refer to our previous study.