- Notifications
You must be signed in to change notification settings - Fork10
DasariJayanth/Malware-Detection-in-PE-files-using-Machine-Learning
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Every thing about this project is explained in detail in FER(Final Evaluation report).
This project aims to detect malware in PE (Portable Executable) files using Machine Learning techniques. We have developed a model that analyzes PE files and predicts whether they contain malware or not using hybrid static malware analysis(combination of PE Headers, byte-n-grams and opcode-n-grams features). This project can be a valuable tool in enhancing cybersecurity measures and protecting systems against malicious software/files.
- PE files csv, containing metadata, header informationDataset.
- byte and asm raw files, from kaggle microsoft malware classification challenge (BIG 2015)Dataset.
- Dataset already has values for the features.
- Created a script to extract those features and header information from the given PE files like .exe, .dll file types.
- Used Extra-Trees classifier for the feature selection, important feature set from all the available information.
- Dataset has raw byte and asm files. Created seperate directories for each type and extracted file size as a feature for each file.
- Extracted N-grams from byte(byte-n-grams, where n= 1,2) and asm files(prefixes/keywords/registers/opcode-n-grams, where n= 1,2,3,4) as the features from each file.
- Converted asm files to image and extracted top performing 200 image pixels as features from that image.
- Used Random Forest for important features selection from all the above features separately for each feature set and merged them.
Final dataset contains the following features.
- PE Header dataset
- Byte unigrams
- Opcode unigrams
- Top 300 Byte bigrams
- Top 200 Opcode bigrams
- Top 200 Opcode trigrams
- Top 200 Opcode tetragrams
- Top 200 Image Pixels
Trained various ML models on the above final dataset for the classification of files into malware/benign.
Evaluation metrics used are accuracy, f1 score, confusion matrix.
Random Forest model performed best among others like Gradient Boost, SVM.
you can download thetrained Random Forest model here.
Clone the repository to your local machine:
git clone https://github.com/DasariJayanth/Malware-Detection-in-PE-files-using-Machine-Learning.git
Once you cloned the repository create a virtual environment using
python3 -m venv .venv
you might be required to set the policies to authorize the acivation of env
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Activate the environment:
source .venv/bin/activate
Next install the required libraries using:
pip install -r requirements.txt
Perform Feature extraction on your data as done in thePE_Header(exe, dll files)/malware_test.py
andNgrams(byte, asm files)/N-grams.ipynb
. Also referMalware Detection Model.ipynb
for merging both feature sets before predicting with the model.
Load themodels/RF_model.pkl
and run the loaded model on the extracted features for prediction.
After you are done, Deactivate the virtual environment:
deactivate
About
Detecting Malware in PE files
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.