- Notifications
You must be signed in to change notification settings - Fork2
Leveraging the power of Machine Learning as a tool, we delve into the realm of app permissions to discern the true nature of applications, whether they harbor malicious or benign intent. By analyzing and predicting based on these permissions, we unlock valuable insights to safeguard users in the digital landscape.
Findcoding/Android-Malware-Detection-System-Using-Machine-Learning
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Project atIIITDunder the courseCSE343 : Machine Learning under the guidance of ProfessorAnubha Gupta
As the android market continues to expand, so does the prevalence of malicious apps. According toZDNet, as many as 10%-24% of apps available on the Play store could be malicious in nature. These apps may appear innocuous at first glance, but they can wreak havoc on a user’s system in a variety of harmful ways. Unfortunately, current methods for detecting malware are both resource-intensive and exhaustive, and they struggle to keep up with the rapid pace at which new malware is being developed.
What can help us to overcome these challenges ?
- Developing a comprehensive strategy to assess and analyze data from confirmed malicious applications.
- Creating a model that can accurately predict the presence of malicious applications based on their permissions.
- Introducing a machine learning-based malware detection model that utilizes publicly available metadata information. This model will be evaluated to determine its effectiveness as a first-stage filter for detecting Android malware.
Despite the growing threat of malware, there is still no reliable and robust method for detecting malicious applications. However,with the increasing use of machine learning in various fields, we believe that this issue can be addressed through the applicationof machine learning techniques. Our project aims to conduct a thorough and systematic investigation into the use of machinelearning for malware detection, with the ultimate goal of developing an efficient ML model capable of accurately classifyingapps as eitherbenign (0) ormalware (1) based on their requested permissions.This study Proposes:
- Conducting an in-depth examination and evaluation of Android metadata and permissions as predictors of malware.
- Introducing a machine learning-based malware detection strategy that utilizes publicly available metadata information.
- Analyzing the effectiveness of this model and assessing its potential as a first-stage filter for detecting Android malware.
- Dataset has been taken fromkaggle
- Data contains the details of the permission of almost 30k app
- There are 183 features in the dataset like Dangerous Permissions Count, Default : Access DRM content, Default : Move application resource, etc.
- There is one target class (binary- 0/1) named - ‘Class’, indicating Benign(0) and Malware(1) applications.
- There are 29,999 records with 20,000 malwares and 9,999 benign apps.
Prerocessing, Visualization and Analysis:The data is first imported from a CSV file and loaded into a dataframe for ease ofuse. The necessary attributes are then extracted from the dataset. To gain a better understanding of the data, several plots aregenerated. The data is checked for null or missing values, and any such values are replaced with the mean of the correspondingcolumn. The distribution of malware and benign applications across various settings is then analyzed, and the results arevisualized through a series of plots created usingMatplotlib andSeaborn.
The EDA for the Android Permission Dataset provided valuable insights into the relationships between different features in thedataset and helped us identify the most important features for predicting the app rating. It also provided a foundation for furtheranalysis using machine learning techniques.
After preprocessing the data, it is split into testing and training sets at an8:2 ratio. We attempted both under and oversamplingtechniques on the dataset, but the results were not promising. We then applied various classifiers, including logistic regression,decision trees, and Naive Bayes, but the outcomes were unsatisfactory. Upon further inspection of the dataset, we discoveredthat it contained several multivariate data tables, which required us to applyPCA to each dataset. We plotted the variancepercentage after using PCA and chose to use the inverse transform. We then applied Random Forest to the dataset, whichresulted in a significant improvement in accuracy. We then used the boosting approach to further increase prediction accuracy,both on an unsampled dataset and on one with reliable features selected. The results showed that the model was improving.Finally, we appliedSVM andMLP to the final dataset and achieved our best results. When comparing the results obtained afterfeature selection andboosting, we can see that we have made significant progress and achieved our final accuracy.
Models | Unsampled | Oversampled | Undersampled |
---|---|---|---|
Logistic | Training Accuracy 0.69 Test Accuracy 0.68 Recall Score 0.95 ROC Score 0.53 | Training Accuracy 0.63 Test Accuracy 0.62 Recall Score 0.66 ROC Score 0.61 | Training Accuracy 0.63 Test Accuracy 0.63 Recall Score 0.67 ROC Score 0.62 |
Naive | Training Accuracy 0.68 Test Accuracy 0.67 Recall Score 0.97 ROC Score 0.52 | Training Accuracy 0.53 Test Accuracy 0.53 Recall Score 0.98 ROC Score 0.51 | Training Accuracy 0.53 Test Accuracy 0.53 Recall Score 0.99 ROC Score 0.50 |
Decision Tree | Training Accuracy 0.67 Test Accuracy 0.67 Recall Score 0.99 ROC Score 0.51 | Training Accuracy 0.57 Test Accuracy 0.55 Recall Score 0.68 ROC Score 0.54 | Training Accuracy 0.55 Test Accuracy 0.56 Recall Score 0.79 ROC Score 0.55 |
As we can see that sampling is not effective in our case so move forward with unsampled data only.
Models | Optimal Parameter | Accuracy | Recall | ROC |
---|---|---|---|---|
SVM | default | Training Accuracy 0.85 Test Accuracy 0.85 | 0.94 | 0.80 |
Random Forest | n_estimators=200, n_jobs = -1 | Training Accuracy 0.87 Test Accuracy 0.86 | 0.93 | 0.81 |
MLP | random_state = 42, max_iter = 300 | Training Accuracy 0.85 Test Accuracy 0.85 | 0.95 | 0.80 |
By looking at the result all the three models performs more or less the same with Random Forest with Accuracy of 86%. As we seen in the Tabulation that, Accuracy follows the order as follow:Random Forest > MLP > SVM
- LearningDifferent ways to visualize the data for better understanding of features. Machine Learning models like Logistic Regression, Naive Bayes and Decision Tree to model the problem. How to use platforms like Kaggle and Google Colab. How to work and collaborate in teams.
About
Leveraging the power of Machine Learning as a tool, we delve into the realm of app permissions to discern the true nature of applications, whether they harbor malicious or benign intent. By analyzing and predicting based on these permissions, we unlock valuable insights to safeguard users in the digital landscape.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.