This repository contains scripts to predict whether an unlabelled Android app observation is malware or benignware. The feature extraction part can be accessed at https://github.com/33onethird/feature-extraction.
We used Debian machines for our experiments.
Prerequisite for the scripts is a valid Python 3 installation, ie.
sudo apt install python3
The training script also needs the python3-tk
package:
sudo apt install python3-tk
Now clone the repository:
git clone https://github.com/33onethird/malware-test
cd malware-test
Next, install the python packages as listed in requirements.txt
. You might want to use a
virtualenv
for that.
pip3 install -r requirements.txt
To setup the training environment, you need to run the data preparation scripts (acc_features.py
and gen_vectors.py
)
first.
The observation data should be the output of the feature extraction, ie
text files with string identfiers. Put all observation data into a directory. Then run acc_features.py
with this
directory as input.
./acc_features.py -i [observation_directory]
It will use the directory feature-vectors
as default. If you create such a directory in the repository root, you can
run:
./acc_features.py
You can also specify the output directory of acc_features.py
. For further information, run:
./acc_features.py -h
After acc_features.py
has omitted all features, you need to run gen_vectors.py
. It will create binary vectors out of
the observation text files. You need to specify the path to the feature file (output of acc_features.py
) and the
observation data directory (should be the same as the input of acc_features.py
).
You also need to provide a .csv
file that identifies all malware observations. It needs to contain the id (filename)
of each malware observation in the first column of a separate row. The rest of the file's content is ignored.
Important: Since there is a large number of high dimensional observations, gen_vectors.py
cannot store all of them
in memory. Therefore, it will process them in batches and omit an output file for each batch. You need to specify the
number of observations per batch. By default, it will process batches of 1000 observations, which uses around 2.5 GB of
memory. Adjust this number according to your memory constraints.
You can also specify the output directory of gen_vectors.py
.
For further information, run:
./gen_vectors.py -h
After everything has been set up, you can fit Machine Learning models onto the training data. This step is
straightforward. Just call epxeriments.py
and specify the ML algorithm, eg.
./experiments.py -a svm
The script will output the model performance and dump the learned parameters.
After you have trained a model with experiments.py
, you can use the trained model to predict unlabelled Android
applications. This can be done with predict.py
. predict.py
takes a directory of .apk
files as input and outputs a
list, assigning malware
or benignware
to each input .apk
. You can also specify the algorithm to use with the -a
flag. It will always use the latest trained parameters.
You can also use predict.py
within a Python script using from predict import predict
.
For further information refer to ./predict.py -h
.