Skip to content

The datasets used in Lib2Desc paper are published in this repository.

License

Notifications You must be signed in to change notification settings

beyzacevik/Lib2Desc-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Lib2Desc-Dataset

The datasets proposed in Lib2Desc: Automatic generation of security-centric Android app descriptions using third-party libraries paper are published in this repository.

DATASETS

TPL-Dataset:

This is a structured dataset including information about 1900 Android third-party libraries.

  • TPL_category_textual-description.csv

    It includes the name, category, general textual description of third-party libraries(TPL).

  • TPL_textual_features.csv

    It includes rhe name and textual feature vector extracted from the general textual description of TPLs.

  • TPL_API_features.csv

    It includes the name and API feature vector obtained from static analysis of TPLs.

  • TPL_name_url.csv

    It includes the name and the website resource of the description.

AAPI Dataset:

This dataset consists of two parts codes of Android API calls and Javadoc comment of the APIs extracted from original Android API documentation. The dataset is split in train, validation, and test sets.

  • code.csv

    It includes the Android API calls preprocessed as mentioned in the paper.

  • javadoc.csv

    It includes the matching API descriptions for the API calls. This textual information is preprocessed as mentioned in the paper.

The repository hierarchy:

├── AAPI-Dataset
│   ├── dev
│   │   ├── code.csv
│   │   └── javadoc.csv
│   ├── test
│   │   ├── code.csv
│   │   └── javadoc.csv
│   └── train
│       ├── code.csv
│       └── javadoc.csv
├── LICENSE
├── README.md
└── TPL-Dataset
    ├── TPL_API_features.csv
    ├── TPL_category_textual-description.csv
    ├── TPL_name_url.csv
    └── TPL_textual_features.csv

Usage

The datasets can be used for developing machine learning, deep learning, NLG, NLP models. These datasets are constructed and preprocessed for developing classification, NLG, and transformer based seq2seq models.

Related Sections

You can find more information about dataset collection, preprocessing and models in the following sections.

Section 3 Data Collection
Section 3.1 Collecting Third-Party Libraries
Section 4.3.3 Datasets

Contact

Author: Beyza Cevik

Contact: [email protected]

About

The datasets used in Lib2Desc paper are published in this repository.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published