Code associated to the paper: Towards Obfuscated Malware Detection Using Deep Learning Models and Transfer Learning
The goodware samples used for this project were obtained using 2 different sources.
The first one was using the service PortableApps provides, we manually download all the software this solution offer, moved all the files from all the folders to a single location.
On parellel, for the second option, we installed a 32 bit version of Windows 10 and copied all files from the System32
folder, moved all the files from all the folders to that same location and then the preprocess stage begun.
For the preprocess stage, for each file we used the file
command from an Ubuntu installation to remove all files that didn't include the strings PE32 executable
and for MS Windows
. This process can be seen on dataset_creation_scripts/cleanType.py
. This lef us with a total of 15628 goodware samples.
To obtain the malware samples we contacted the staff team of VirusShare and requested acess to their malware repository. Once the access was granted, we donwnloaded the torrent which contained PE for microsoft Windows and repeated the process of cleaning by type using dataset_creation_scripts/cleanType.py
and then randomy sampled the same amount of binaries as we had on goodware to avoid having a disbalanced dataset, this can be seen on dataset_creation_scripts/cleanAmount.py
The python code to identify whether a binary file has been obfuscated (XOR/Shikata ga nai) can be found in the notebook entropy_tester/entropy_tester.ipynb
. In the cell called Create and write entropy data, the entropies for every file in the folders indicated by the user (in this case, the folders containing the samples from the previous section) are extracted and saved in a file named entropies.csv
, which can be found in the entropy_tester
folder. Then, the cell called Read entropy data reads the CSV file. The remaining cells in the notebook test the performance of several machine learning algorithms when identifying obfuscated binary files based on their entropy.
The script binary2image.py
transforms executable files into greyscale images, as described in the paper. The script can be used with python binary2image.py input_folder output_folder
, where input_folder
contains the binary files and output_folder
will contain the resulting images.
The notebook cnn_tester.ipynb
reads the images from the preious step and tests the performance of four different CNN architectures when identifying malware: ResNet18, ResNet34, EfficientNetB3, and EfficientNetV2. The notebook can be executed completely just by setting the value of the variable path
to the images folder which will be processed.