Skip to content

ML algorithms for Android malware detection

Notifications You must be signed in to change notification settings

zym-wade/malware-test

 
 

Repository files navigation

Android Malware detection with machine learning

This repository contains scripts to predict whether an unlabelled Android app observation is malware or benignware. The feature extraction part can be accessed at https://github.com/33onethird/feature-extraction.

Setup

Environment setup

We used Debian machines for our experiments.

Prerequisite for the scripts is a valid Python 3 installation, ie.

sudo apt install python3

The training script also needs the python3-tk package:

sudo apt install python3-tk

Now clone the repository:

git clone https://github.com/33onethird/malware-test
cd malware-test

Next, install the python packages as listed in requirements.txt. You might want to use a virtualenv for that.

pip3 install -r requirements.txt

Data preparation

To setup the training environment, you need to run the data preparation scripts (acc_features.py and gen_vectors.py) first.

acc_features

The observation data should be the output of the feature extraction, ie text files with string identfiers. Put all observation data into a directory. Then run acc_features.py with this directory as input.

./acc_features.py -i [observation_directory]

It will use the directory feature-vectors as default. If you create such a directory in the repository root, you can run:

./acc_features.py

You can also specify the output directory of acc_features.py. For further information, run:

./acc_features.py -h

gen_vectors

After acc_features.py has omitted all features, you need to run gen_vectors.py. It will create binary vectors out of the observation text files. You need to specify the path to the feature file (output of acc_features.py) and the observation data directory (should be the same as the input of acc_features.py).

You also need to provide a .csv file that identifies all malware observations. It needs to contain the id (filename) of each malware observation in the first column of a separate row. The rest of the file's content is ignored.

Important: Since there is a large number of high dimensional observations, gen_vectors.py cannot store all of them in memory. Therefore, it will process them in batches and omit an output file for each batch. You need to specify the number of observations per batch. By default, it will process batches of 1000 observations, which uses around 2.5 GB of memory. Adjust this number according to your memory constraints.

You can also specify the output directory of gen_vectors.py.

For further information, run:

./gen_vectors.py -h

Usage

Training

After everything has been set up, you can fit Machine Learning models onto the training data. This step is straightforward. Just call epxeriments.py and specify the ML algorithm, eg.

./experiments.py -a svm

The script will output the model performance and dump the learned parameters.

Testing

After you have trained a model with experiments.py, you can use the trained model to predict unlabelled Android applications. This can be done with predict.py. predict.py takes a directory of .apk files as input and outputs a list, assigning malware or benignware to each input .apk. You can also specify the algorithm to use with the -a flag. It will always use the latest trained parameters.

You can also use predict.py within a Python script using from predict import predict.

For further information refer to ./predict.py -h.

About

ML algorithms for Android malware detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%