In this repository, we provide the artefacts of our paper "Lessons Learnt on Reproducibility in Machine Learning Based Android Malware Detection", which has been accepted to be published in Empirical Software Engineering (EMSE).
This script generates ".data" files that represent the features extracted from the APK files.
INPUTs are: The path to the APK files directory and optionally the number of CPUs to use. They are explained in details in the script. OUTPUTs are the ".data" file that has the list of the extracted features and the manifest file.
Example:
python GetApkData.py -d my_dataset
This script embedd the features extracted with GetApkData.py
in vectors space and it performs
the classification using SVM LinearSVC classifier.
Recall and Accuracy are calculated 10 times. Each experiment is performed with 66% of training
set and 33% of test set, and scores are averaged.
Roc curve is plotted using the last trained classifier from the 10 experiments.
- The path to the .data malware files directory,
- The path to the .data goodware files directory,
- The name of the txt file on which the results will be written,
- The name of the pdf file on which the roc curve will be saved,
- and optionally, The seed to fix for the experiments.
- the results text file,
- and roc curve pdf graph.
Example:
python classification_drebin.py -md my_dataset/malware/ -gd my_dataset/goodware/ -fs file_scores -roc file_roc
This script is used to generate the call graphs.
- The path to the APK files directory,
- and the path to the Android platform directory
- 4 directories created inside the provided APK files directory. They are:
- "graphs" (it contains the call graphs),
- "family",
- "package",
- "class" directories (they contain the abstarction of the call graphs to family, package, and class modes respectively).
Example:
python mamadroid.py -f my_dataset -d android/
This script creates the features files for family and package modes.
- The names of the call graphs datasets in this format (database1:database2:database3).
You need to move the call graphs generated with `mamadroid.py` script, to "graphs"
directory that is in the same directory as `mamadroid.py` and `MaMaStat.py` scripts,
- Flag to write intermediate files or not.
INPUTs are explained in more details in the script.
- The features files (one per indicated database) that are created as "name_of_the_database".csv files in the folders Features/Families and Features/Packages
Example:
python MaMaStat.py -d Trial1 -wf N
This script performs MaMaDroid's classification using Random Forest classifier. Scores (Precision, Recall, F1-score) are calculated using 10-folds cross-validation with and without PCA for family and package modes.
- The path to the CSV features files of family mode, for drebin, 2013, 2014, 2015, 2016,
oldbenign, and newbenign datasets. These files are generated by `MaMaStat.py` script
in Features/Families,
- The path to the CSV features files of package mode, for drebin, 2013, 2014, 2015, 2016,
oldbenign, and newbenign datasets. These files are generated by `MaMaStat.py` script
in Features/Packages.
- The name of the txt file on which the results will be written,
- and optionally, the seed to fix for the experiments.
- The results text file.
Example:
python classification_mamadroid.py -pf Features/Families -pp Features/Packages -fs file_scores
-To build RevealDroid, please follow the instructions in https://bitbucket.org/joshuaga/revealdroid/src/master/
- Load the CSV file from AndroZoo https://androzoo.uni.lu/lists and put it in your home directory ~/ It will be used in case you try to download apps from AndroZoo using their md5
- The path to the txt file containing the hashes,
- The name of your dataset: (e.g., my_dataset),
- A valid AndroZoo APIKEY
- The script downloads the apps from AndroZoo and extracts the features that are stored in:
- data/apiusage/my_dataset
- data/native_external_calls/my_dataset
- android-reflection-analysis/data/my_dataset
- The name of your malware datasets separated by space. Note that the features of these datasets should be located in /data/apiusage/malware{i} /data/native_external_calls/malware{i} and ../android-reflection-analysis/data/malware{i}
- The name of your goodware datasets separated by space. Note that the features of these datasets should be located in /data/apiusage/goodware{i} /data/native_external_calls/goodware{i} and ../android-reflection-analysis/data/goodware{i}
- The name to be used to save the roc curve figure
- The script runs 10-fold cross-validation and prints average precision, average recall, and average F1-score for both malware and goodware
- It also generates the PR curve
- The path to the file that contains hashes and their corresponding families separated by space. This file is located in dataset/revealdroid for both genome and all the malware datasets used in the experiments
- The name of your malware datasets to consider. They should be separated by space. Note that the features of these datasets should be located in /data/apiusage/malware{i} /data/native_external_calls/malware{i} and ../android-reflection-analysis/data/malware{i}.
Also, for genome, you should use "drebin" name as this collection contains the genome apps.
- The name to be used to save the roc curve figure
- The script runs 10-fold cross-validation and prints average accuracy
After the publication of our paper, DroidCat's author contacted us and we provided him with more details on the dataset mismatches discussed in our paper. We note that our reproduction attempt of DroidCat was performed with the latest version of DroidCat artefacts publicly available at the time (Repo: https://bitbucket.org/haipeng_cai/droidcat/, latest commit: d108ace0ddb7c56c8f4ebce02801bfee2a3c5d24 Mars 31th, 2019). Hence the dataset mismatches in DroidCat artefacts we described in our paper may be fixed when you read this message (i.e., after August 4th, 2021).
- Make sure that droidcat repo is in your home directory ~/
- Install the tools and dependecies listed in https://bitbucket.org/haipeng_cai/droidfax/src/master/portable/README
- Make sure to install the android sdk manager in your home directory in ~/.android
- You can also use our helping script install.sh but make sure to change the path of your java and the JAVA_HOME environment variable.
- A text file that contains the sha256. Note that the lists of hashes can be found in dataset/droidcat,
- The name of your dataset: (e.g., my_dataset),
- A valid AndroZoo APIKEY
- The apps are saved in ~/droidcat/droidcat/testbed/inputs/my_dataset
- You will need to create a key using keytool, then adapt the script signandalign.sh with the details of your key
- Put the key you have created in ~/droidcat/droidcat/scripts
- You will also have to create a file "droidcat_keytool_password" where you store your password
- The name of your dataset. Note that the apps should be stored in: ~/droidcat/droidcat/testbed/inputs/my_dataset
- The instrumented apps are saved in ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset
- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset
- The traces are saved in ~/droidcat/droidcat/testbed/monkey_results/my_dataset
- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset and the traces should be located in ~/droidcat/droidcat/testbed/monkey_results/my_dataset
- GeneralFeatures are saved in ~/droidcat/droidcat/testbed/allGeneralReports/my_dataset/gfeatures.txt
- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset and the traces should be located in ~/droidcat/droidcat/testbed/monkey_results/my_dataset
- ICCFeatures are saved in ~/droidcat/droidcat/testbed/allICCReports/my_dataset/iccfeatures.txt
- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset and the traces should be located in ~/droidcat/droidcat/testbed/monkey_results/my_dataset
- SecurityFeatures are saved in ~/droidcat/droidcat/testbed/allSecurityReports/my_dataset/securityfeatures.txt
- For each dataset, put the 3 feature files in the same folder.
- Put the datasets directories (created in the previous step) in a directory named "features"
- Put your "features" directory in ~/droidcat/droidcat/ML
- Based on the lists of hashes in dataset/DroidCat directory, you will end up with the following datasets: malware-2017-more, newzoo2011, vs2013, vs2014, vs2015, vs2016, zoo2010, zoo2011, zoo2012, zoo2017, zoobenign2010, zoobenign2011, zoobenign2012, zoobenign2013, zoobenign2014, zoobenign2015, zoobenign2016, zoobenign2017
Example:
~/droidcat/droidcat/ML/zoobenign2014 should contain gfeatures.txt, iccfeatures.txt, and securityfeatures.txt of zoobenign2014 dataset
-The new features are stored in features_droidcat_byfirstseen
Note that for reproducible experiments, you can uncomment the corresponding lines in the following files: common.py, configs.py, family_detection.py, featureLoader_wdate.py, malware_detection.py, plot_roc.py
-This script prints for each of the 4 datasets (D1617, D1415, D1213, D0911): accuracy, recall, precision, and F1-score
-Load the CSV file from AndroZoo https://androzoo.uni.lu/lists and put it in your home directory ~/ -Change the APIKEY variable in ~/droidcat/droidcat/ML/configs.py with your AndroZoo APIKEY -This script prints for each of the 4 datasets (D1617, D1415, D1213, D0911): accuracy, recall, precision, and F1-score
- type: det for malware detection and fam for family detection
- file: The name of the output roc curve file
- The roc curve file
-You need to have the following python3 libraries installed: networkx, androguard, numpy, and sklearn
- A text file that contains the sha256. Note that the lists of hashes can be found in dataset/malscan,
- The year of your dataset,
- The type of your apps; malware or goodware
- A valid AndroZoo APIKEY,
- The apps are saved in apps/year/type directory
- The path to your dir of APK files,
- The path to the output files
- The call graphs are saved in the path of output files you have provided
Example:
python3 CallGraphExtraction.py -f apps/2011/malware -o callgraphs/2011/malware
- The path to your dir of call graphs. Note that your directory should contain both malware and goodware folders with their call graphs,
- The path to the output file
- The type of centrality: degree, katz, closeness, or harmonic
- The csv file of the chosen centrality is saved in the output file you have provided
Example:
python3 FeatureExtraction.py -d callgraphs/2011 -o features/2011 -c degree
The script generates the file features/2011/degree.csv
- The path to your dir of csv files generated in the previous step,
- The path to the output file
- The type of centrality: degree, closeness, harmonic, katz, average, or concatenate. Note that for degree, closeness, harmonic, and katz, you must have only the csv file of the chosen centrality. As for average and concatenate, you must have the csv files of degree, closeness, harmonic, and katz centralities.
- A csv file of that contains F1,Precision,Recall,Accuracy,TPR,FPR,TNR,FNR for KNN-1, KNN-3, and Random Forest classifiers. The file is saved in the path of output file you have provided
Example:
python3 Classification.py -d features/2011 -o results/2011 -t degree
The script generates the file results/2011/degree_result.csv
- Malware:
- They are provided by original authors.
- Drebin_Malware_APK_Done.txt: List of hashes of APKs that passed the features extraction.
- Drebin_Malware_APK_Errors.txt: List of hashes of APKs that failed in the features extraction.
- Goodware:
- (They were collected from AndroZoo)
- Drebin_Goodware_Original_All.txt: List of hashes of Drebin original APKs.
- Drebin_Goodware_Original_Found.txt: List of hashes of Drebin original APKs that are available in AndroZoo.
- Drebin_Goodware_Original_NotFound.txt: List of hashes of Drebin original APKs that are not available in AndroZoo.
- Drebin_Goodware_Original_Failed.txt: List of hashes of Drebin original APKs.
- Drebin_Goodware_CompleteWith.txt: List of hashes of APKs that are used to complete the dataset.
- Drebin_Goodware_Orig+Completed.txt: List of hashes of original APKs that passed the features extraction and the APK that used to complete the goodware dataset.
-
Malware:
- 2013, 2014, 2015, and 2016 are collected from VirusShare. They are also available in AndroZoo
- drebin is provided by Drebin original authors
- INPUTs are: The path to the APK files directory and optionally the number of CPUs to use.
- 2013_hashes_found.txt: List of hashes of 2013 dataset APKs that we were able to collect.
- 2013_hashes_done.txt: List of hashes of 2013 dataset APKs that passed the features extraction.
- 2013_hashes_failed.txt: List of hashes of 2013 dataset APKs that failed in the features extraction.
- 2014_hashes_found.txt: List of hashes of 2014 dataset APKs that we were able to collect.
- 2014_hashes_done.txt: List of hashes of 2014 dataset APKs that passed the features extraction.
- 2014_hashes_failed.txt: List of hashes of 2014 dataset APKs that failed in the features extraction.
- 2015_hashes_found.txt: List of hashes of 2015 dataset APKs that we were able to collect.
- 2015_hashes_notFound.txt: List of hashes of 2015 dataset APKs that we were not able to collect.
- 2015_hashes_done.txt: List of hashes of 2015 dataset APKs that passed the features extraction.
- 2015_hashes_failed.txt: List of hashes of 2015 dataset APKs that failed in the features extraction.
- 2016_hashes_found.txt: List of hashes of 2016 dataset APKs that we were able to collect.
- 2016_hashes_done.txt: List of hashes of 2016 dataset APKs that passed the features extraction.
- 2016_hashes_failed.txt: List of hashes of 2016 dataset APKs that failed in the features extraction.
- drebin_hashes_found.txt: List of hashes of drebin dataset APKs that we were able to collect.
- drebin_hashes_done.txt: List of hashes of drebin dataset APKs that passed the features extraction.
- drebin_hashes_failed.txt: List of hashes of drebin dataset APKs that failed in the features extraction.
-
Goodware:
- oldbenign dataset is collected from PlayDrone, and it is available in AndroZoo
- newbenign dataset are collected from AndroZoo.
- oldbenign_hashes_found.txt: List of hashes of oldbenign dataset APKs that we were able to collect.
- oldbenign_hashes_done.txt: List of hashes of oldbenign dataset APKs that passed the features extraction.
- oldbenign_hashes_failed.txt: List of hashes of oldbenign dataset APKs that failed in the features extraction.
- oldbenign_namesApp_found.txt: List of apps names of oldbenign dataset APKs that we were able to collect.
- oldbenign_namesApp_done.txt: List of apps names (as provided by PlayDrone) of oldbenign dataset APKs that passed the features extraction.
- oldbenign_namesApp_failed.txt: List of apps names of oldbenign dataset APKs that failed in the features extraction.
- newbenign_hashes_found.txt: List of hashes of newbenign dataset APKs that we were able to collect.
- newbenign_hashes_notFound.txt: List of hashes of newbenign dataset APKs that we were not able to collect.
- newbenign_hashes_done.txt: List of hashes of newbenign dataset APKs that passed the features extraction.
- newbenign_hashes_failed.txt: List of hashes of newbenign dataset APKs that failed in the features extraction.
- newbenign_UsedToComplete.txt: List of hashes of APKs that are used to complete the newbenign original dataset.
-
Malware:
- drebin_sha.txt: List of drebin apps
- drebin_sha_intersection_ok_all_features.txt: List of drebin apps after features extraction
- remaining_sha_found_all.txt: List of VirusTotal apps
- remain_sha_intersection_ok_all_features.txt: List of VirusTotal apps after features extraction
- virusshare_md5.txt: List of VirusShare apps
- virusshare_sha_md5_all_intersection_ok_all_features.txt: List of VirusShare apps after features extraction
-
Goodware:
- benign_androzoo.txt: List of benign apps
- benign_androzoo_intersection_ok_all_features.txt: List of benign apps after features extraction
-
Family labels:
- all_labels_malware.txt: List of family labels for drebin, virusshare, and virustotal apps
- genome_sha256_labels.txt: List of family labels for genome apps
All these apps can be downloaded from AndroZoo
-
Malware:
- apks.malware-2017-more
- apks.zoo2017
- apks.vs2016
- apks.vs2015
- apks.vs2014
- apks.vs2013
- apks.zoo2012
- apks.newzoo2011
- apks.zoo2011
- apks.zoo2010
-
Goodware:
- sha256.benign2017
- apks.zoobenign2016
- apks.zoobenign2015
- apks.zoobenign2014
- apks.zoobenign2013
- apks.zoobenign2012
- apks.zoobenign2011
- apks.zoobenign2010
All these apps can be downloaded from AndroZoo
-
Malware:
- 2018_malware.txt
- 2017_malware.txt
- 2016_malware.txt
- 2015_malware.txt
- 2014_malware.txt
- 2013_malware.txt
- 2012_malware.txt
- 2011_malware.txt
-
Goodware:
- 2018_benign.txt
- 2017_benign.txt
- 2016_benign.txt
- 2015_benign.txt
- 2014_benign.txt
- 2013_benign.txt
- 2012_benign.txt
- 2011_benign.txt
- Repositories:
- https://bitbucket.org/gianluca_students/mamadroid_code/
- https://github.com/MLDroid/drebin
- https://bitbucket.org/joshuaga/revealdroid/src/master/
- https://bitbucket.org/joshuaga/android-reflection-analysis/src/master/
- https://bitbucket.org/haipeng_cai/droidcat/src/master/
- https://github.com/malscan-android/MalScan/tree/master/MalScan-code
To try to achieve replicable results, the results presented in our paper were computed after setting the seed values to:
- MaMaDroid:
- 388652140
- Drebin
- 388652140
- RevealDroid:
- 123456789
- DroidCat
- 480509637
- MalScan
- 480509637