Skip to content

Latest commit

 

History

History
31 lines (17 loc) · 2.89 KB

README2.md

File metadata and controls

31 lines (17 loc) · 2.89 KB

Obfuscated Malware Detection

Code associated to the paper: Towards Obfuscated Malware Detection Using Deep Learning Models and Transfer Learning

GUIDE

Obtaining the sample

Goodware

The goodware samples used for this project were obtained using 2 different sources.

The first one was using the service PortableApps provides, we manually download all the software this solution offer, moved all the files from all the folders to a single location. On parellel, for the second option, we installed a 32 bit version of Windows 10 and copied all files from the System32 folder, moved all the files from all the folders to that same location and then the preprocess stage begun.

For the preprocess stage, for each file we used the file command from an Ubuntu installation to remove all files that didn't include the strings PE32 executableand for MS Windows. This process can be seen on dataset_creation_scripts/cleanType.py. This lef us with a total of 15628 goodware samples.

Malware

To obtain the malware samples we contacted the staff team of VirusShare and requested acess to their malware repository. Once the access was granted, we donwnloaded the torrent which contained PE for microsoft Windows and repeated the process of cleaning by type using dataset_creation_scripts/cleanType.py and then randomy sampled the same amount of binaries as we had on goodware to avoid having a disbalanced dataset, this can be seen on dataset_creation_scripts/cleanAmount.py

Detecting obfuscation

The python code to identify whether a binary file has been obfuscated (XOR/Shikata ga nai) can be found in the notebook entropy_tester/entropy_tester.ipynb. In the cell called Create and write entropy data, the entropies for every file in the folders indicated by the user (in this case, the folders containing the samples from the previous section) are extracted and saved in a file named entropies.csv, which can be found in the entropy_tester folder. Then, the cell called Read entropy data reads the CSV file. The remaining cells in the notebook test the performance of several machine learning algorithms when identifying obfuscated binary files based on their entropy.

Obtaining the images

The script binary2image.py transforms executable files into greyscale images, as described in the paper. The script can be used with python binary2image.py input_folder output_folder, where input_folder contains the binary files and output_folder will contain the resulting images.

Detecting malware

The notebook cnn_tester.ipynb reads the images from the preious step and tests the performance of four different CNN architectures when identifying malware: ResNet18, ResNet34, EfficientNetB3, and EfficientNetV2. The notebook can be executed completely just by setting the value of the variable path to the images folder which will be processed.