Movatterモバイル変換

3937Accesses
28Citations
Explore all metrics

Abstract

Communication via email has expanded dramatically in recent decades due to its cost-effectiveness, convenience, speed, and utility for a variety of contexts, including social, scientific, cultural, political, authentication, and advertising applications. Spam is an email sent to a large number of individuals or organizations without the recipient's desire or request. It is increasingly becoming a harmful part of email traffic and can negatively affect the usability of email systems. Such emails consume network bandwidth as well as storage space, causing email systems to slow down, wasting time and effort scanning and eliminating enormous amounts of useless information. Spam is also used for distributing offensive and harmful content on the Internet. The objective of the current study was to develop a new method for email spam detection with high accuracy and a low error rate. There are several methods to recognize, detect, filter, categorize, and delete spam emails, and almost the majority of the proposed methods have some extent of error rate. None of the spam detection techniques, despite the optimizations performed, have been effective alone. A step in text mining and message classification is feature selection, and one of the best approaches for feature selection is the use of metaheuristic algorithms. This article introduces a new method for detecting spam using the Horse herd metaheuristic Optimization Algorithm (HOA). First, the continuous HOA was transformed into a discrete algorithm. The inputs of the resulting algorithm then became opposition-based and then converted to multiobjective. Finally, it was used for spam detection, which is a discrete and multiobjective problem. The evaluation results indicate that the proposed method performs better compared to other methods such as K-nearest neighbours-grey wolf optimisation, K-nearest neighbours, multilayer perceptron, support vector machine, and Naive Bayesian. The results show that the new multiobjective opposition-based binary horse herd optimizer, running on the UCI data set, has been more successful in the average selection size and classification accuracy compared with other standard metaheuristic methods. According to the findings, the proposed algorithm is substantially more accurate in detecting spam emails in the data set in comparison with other similar algorithms, and it shows lower computational complexity.

Feature Selection Techniques for Email Spam Classification: A Survey

Evolutionary Multi-objective Scheduling for Anti-Spam Filtering Throughput Optimization

Training Logistic Regression Model by Hybridized Multi-verse Optimizer for Spam Email Classification

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1Introduction

There are several types of email messages that computer users do not opt to receive in their email inboxes, such as spam, bulk email, junk email, promotion and commercial emails, and so on. These messages have some differences; however, in this study, they are all considered spam emails. Inappropriate messages on a large scale on the Internet that do not have useful content for the user would be classified as spam. Spam can be distributed in different formats and on various platforms. Social media spam, web spam, forum spam, spam instant messaging, email spam, and so on are examples of various types of spam. Although the majority of internet-based platforms can be successfully utilized to transmit spam, email spamming has grown in popularity due to its widespread use for a variety of purposes [34]. Text REtrieval Conference (TREC) has a definition for spam: "Spam is unsolicited mail that is sent vaguely, directly or indirectly by someone who has no relationship with the recipient of the letter" [11].

Although emails are effective and easily accessible means of communication, they can become a disaster due to the exploitation of marketers to advertise their products and scammers to deceive people into abusing their designs. The significant negative effect of spam emails is not limited to the severe waste of resources, time, and effort but also increases the burden of communication and cybercrime, affecting even the global economy and costing millions of dollars annually for businesses and individuals. Unwanted emails, in addition to consuming resources such as bandwidth, removal time, and storage space, also pose a security threat [5]. Attackers use a variety of methods to gain access to the victim's information. Email systems are one of the platforms used by attackers to spread malware. A recent McAfee report states that more than 97% of spam emails in the last four months of 2017 were sent via Necurs and Gamut botnets [33].

Detecting suspicious emails manually by users prevents attackers from reaching their goals in this way. To facilitate identifying suspicious emails, users, after observing their characteristics, should immediately take the necessary actions to prevent spam distribution and must inform the relevant institutions [16]. However, developing efficient mechanisms to automatically identify unsolicited emails is very important. Some of the characteristics of emails that are believed to be malicious are listed in “Appendix”.

Spam detection is a challenging problem, and several techniques have been developed and introduced to automatically detect spam emails; however, not all of them show an accuracy of 100%. Machine learning and deep learning techniques have proven to be the most successful of the methods introduced. In recent years, one of the common applications of machine learning has been spam detection [53]. Natural Language Processing (NLP) helps these methods to increase their accuracy. These spam detection methods consist of two stages: feature selection and classification [15].

Optimization algorithms are the other methods that can help developing spam detection systems. The Horse herd Optimisation Algorithm (HOA) [35] is a novel meta-heuristic algorithm and has a high exploration and exploitation performance. It excels at finding the best and optimal solutions to high-dimensional problems. In this article, our objective is to present a new method for detecting spam emails using HOA. To do this, we first convert the basic HOA, which is a continuous algorithm, to a discrete algorithm and then modify it into a multiobjective algorithm to solve multiobjective problems. Finally, the new multiobjective binary HOA is used in selecting the important features of spam emails to recognize them so that the received emails are classified correctly into spam or genuine emails. These two categories are then evaluated.

This study's main motivation for using HOA in solving spam detection problems was its outstanding performance in addressing complex high-dimensional problems. It is exceptionally efficient in exploration and exploitation. It can find the optimal solution very fast, with a low cost and complexity. With regards to accuracy and efficiency, it outperforms many well-known optimization algorithms such as the grasshopper optimization algorithm [48], the sine cosine algorithm [38], the multi-verse optimizer [39], the moth-flame optimization algorithm [36], the dragonfly algorithm [37], and the grey wolf optimizer [40].

Overall, the current study has the following main contributions:

HOA, a novel metaheuristic algorithm for high exploration and faster convergence, has been used in the study. To the best of the authors’ knowledge, this algorithm has not yet been used for spam detection.
The original HOA was a single objective algorithm developed to solve continuous problems. In this study, HOA was discretized and converted to a multiobjective algorithm.
The original HOA was transformed into a binary opposition-based algorithm.
Using HOA for feature selection, a novel spam detection method is proposed.
After selecting the optimal features, the K-Nearest Neighbours (KNN) classification method was used to classify the collection of spam emails.
According to the evaluation results, the proposed method outperforms well-known algorithms in terms of accuracy, precision, and sensitivity.

The remainder of this article is organized as follows: Sect. 2 introduces the related works. In Sect. 3, the original horse herd optimization algorithm is presented. Section 4 introduces the new proposed approach, and finally, in Sect. 5, the evaluation results and conclusion are discussed.

2Related works

Unsolicited spam emails sent by marketers for promoting their products are regarded as annoying since they take up a lot of space in servers [45]. Some innocent users may also fall prey to fake emails [21]. Scammers try to get users' bank account details by sending these emails to steal money. Spam emails by attackers and hackers to distribute viruses and other malicious software are also hidden behind attractive and exciting offer links [23]. Therefore, the problem of spam emails should be addressed immediately, and effective measures should be taken to control this problem. Efforts have been made to reduce spam emails, including the development of advanced filtering tools and anti-spam laws in the United States [5].

Many researchers have focused their attention on the email spam detection problem, and in the literature, several notable approaches have been proposed. This section discusses some of the previous studies focusing on detecting and classifying spam through machine learning techniques and deep learning algorithms. One of the widely used algorithms for this problem is Naive Bayes [4,47,50]. There are various techniques introduced for detecting spam; however, our main focus would be on metaheuristic optimization algorithms in the present study.

A decision tree was applied in the study by Carreras and Marquez [8] to filter unwanted emails. Because the features of spam emails are difficult to define, this method is not extensively employed in spam filtering. K-nearest neighbours (KNN), Naïve Bayes and Reverse DBSCAN algorithms were used by Harisinghaney et al. [18] to classify image-based and text-based spam, and performance comparison of the mentioned algorithms were provided based on four measuring factors.

Egozi and Verma [13] used natural language processing techniques to detect phishing emails. Their model applies a feature selection method to select 26 features in order to determine if an email is a genuine email or spam. With only 26 features, their approach correctly identified more than 95% of ham emails as well as 80% of phishing emails.

Sharma and Bhardwaj [51] introduced a spam mail detection (SMD) system based on hybrid machine learning applying Naive Bayes and the J8 decision tree. This system consists of four models: data set preparation, data preprocessing, feature selection, and hybrid bagged approach. A total of three experiments were performed, of which the first two were conducted based on Naive Bayes and J8, and the other experiment was the proposed SMD which achieved an accuracy of 87.5%.

A new model for spam detection (THEMIS) was introduced by Soni [52] that is used to show emails at the header, body, character, and word level all at the same time. This approach uses deep convolutional neural network algorithms for recognizing spam emails. The evaluation results show that THEMIS's accuracy of 99.84% is higher than LSTM and CNN's accuracy.

In the study by GuangJun et al. [17], a method is proposed for spam classification in mobile communication using predictive machine learning models (e.g., logistic regression, K-Nearest Neighbor, and decision tree). Experiment results suggest that this method is accurate and timely in detecting spam and can protect email communication in mobile systems.

The study by Bibi et al. [6] provides a comparison of past spam filtering algorithms discussing their accuracy and the employed data sets. The study presents in-depth knowledge of the simple Naive Bayes algorithm, which is one of the best algorithms for text classification. This study evaluated classifier machine learning algorithms in spam detection and found that using WEKA, the Naïve Bayes algorithm provides effective accuracy and precision.

Mohmmadzadeh [42] developed a new hybrid model by combining the whale optimization algorithms and the flower pollination algorithms to solve the feature selection problem on the basis of opposition-based learning for detecting spams. The new model has higher accuracy in spam detection compared to previous approaches.

A spam detection approach using word embedding based on deep learning architecture in the NLP context was introduced by Srinivasan et al. [53]. The study reveals that deep learning outperforms standard machine learning classifiers when it comes to spam detection.

Apart from the sample methods described earlier, other methods are also available that only used metaheuristic algorithms, but none of the proposed methods are entirely accurate, and they are all erroneous to some extent. Moreover, only the classification phase was carried out in many previous methods, and the feature selection phase was not implemented. Feature selection reduces the dimensions of computation and increases classification accuracy by removing unnecessary features. Due to the lack of the feature selection process, the majority of the previous solutions spend a tremendous amount of time running the algorithm and do not have a high accuracy percentage. Table1 demonstrates some examples of optimization methods used in spam detection that have been published recently, with some drawbacks that the proposed method in this study attempts to rectify.

Table 1 Examples of the recent spam detection methods

Full size table

As can be seen in Table1, even the most recent methods are not 100% accurate and need a lot of time to execute the algorithm, and some of them have high computational complexity and high error rate. Thus, the objective of the current study was to employ a robust metaheuristic optimization algorithm, which is highly efficient in exploration and exploitation, to enhance the computation speed and accuracy of spam detection as well as reduce the error rate. After a comprehensive search in the literature and examination of several optimization algorithms, the authors decided to use the novel metaheuristic optimization algorithm, HOA, for the feature selection phase of the proposed approach, and as a result, the spam detection method suggested by the current study is on the basis of HOA. This optimization algorithm has been tested by multiple well-known test functions in high dimensions and has proven that it is able to solve challenging and high-dimensional problems.

In order to carefully assess and evaluate the performance and efficiency of the proposed method, some of the most popular and highly efficient optimization and classification algorithms in the literature were selected for the simulation, and their performance was compared to the proposed method’s performance. The simulation results indicate that the proposed method outperforms the previous methods, and demonstrates a high level of accuracy and precision, spends less execution time, and has lower error rate. Thus, the new method’s superiority is its higher accuracy and speed, and lower error rate and complexity.

As stated earlier, to be able to use HOA in selecting features, we converted that into a discrete algorithm since it was originally a continuous algorithm. Then, because feature selection is also a multiobjective problem, we transformed HOA into a multiobjective HOA and used it to select spam features. To the best of our knowledge, this is the first research in the field that presents a binary and multiobjective version of HOA. The following section introduces the horse herd optimization algorithm.

3Horse herd optimization algorithm

In recent years, various metaheuristic algorithms have been employed to solve a wide range of optimization problems [10,29,56]. A reason for this is the ability of metaheuristic algorithms to mathematically model and solve a variety of real-world problems [49]. This study aimed to employ a novel metaheuristic algorithm for solving the feature selection problem for detecting spam emails. Therefore, the Horse herd Optimisation Algorithm (HOA) was used as the primary method for this purpose. HOA, proposed in the study by MiarNaeimi et al. [35], is a robust metaheuristic algorithm inspired by the horses’ herding behaviors at various ages. Because of the vast number of control factors based on the behavior of horses of various ages, HOA shows an outstanding performance at addressing complex high-dimensional problems. Its performance at high dimensions (up to 10,000) has been evaluated using popular test functions, and it was discovered to be extremely efficient in exploration and exploitation. It has the ability to find the best solution in the shortest time, at the lowest cost, and with the least amount of complexity, and in terms of accuracy and efficiency, it outperforms many well-known metaheuristic optimization algorithms. This algorithm is discussed in greater detail in the following section.

At different ages, horses show various behaviors [35]. A horse's maximum lifespan is around 25–30 years [25]. In HOA, horses are divided into four categories according to their age: horses in ages 0–5, 5–10, 10–15 and older than 15, which are represented byδ,γ,β, andα respectively. HOA uses six general horse behaviors at the mentioned ages to simulate their social life. Those behaviours are: "grazing, hierarchy, sociability, imitation, defence mechanism and roaming".

Equation (1) describes the horse movement at each iteration:

$$X_{m}^{{\text{Iter,AGE}}} = \vec{V}_{m}^{{\text{Iter,AGE}}} + X_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} , \quad {\text{AGE}} = \alpha ,\beta ,\gamma ,\delta$$

(1)

where$X_{m}^{{{\text{Iter}},{\text{AGE}}}}$ is the position of themth horse,$\vec{V}_{m}^{{{\text{Iter}},{\text{AGE}}}}$ is the velocity vector of themth horse, AGE is the horse age range, and Iter is the current iteration.

To determine the age of horses, each iteration should have a thorough matrix of responses. The matrix is sorted according to the best responses, with the first 10% of the horses from the matrix’s top chosen asα. Theβ,δ, andγ horses comprised the next 20%, 30% and 40% of the remainder of the horses, respectively. In order to detect the velocity vector, the steps of simulating the mentioned six behaviors are mathematically implemented. During each cycle of the algorithm, the motion vector of horses of various ages can be expressed by Eq. (2) [35]:

$$\begin{aligned} \vec{V}_{m}^{{{\text{Iter,}}\alpha }} & = \vec{G}_{m}^{{{\text{Iter,}}\alpha }} + \vec{D}_{m}^{{{\text{Iter,}}\alpha }} \\ \vec{V}_{m}^{{{\text{Iter}},\beta }} & = \vec{G}_{m}^{{{\text{Iter}},\beta }} + \vec{H}_{m}^{{{\text{Iter}},\beta }} + \vec{S}_{m}^{{{\text{Iter}},\beta }} + \vec{D}_{m}^{{{\text{Iter}},\beta }} \\ \vec{V}_{m}^{{{\text{Iter,}}\gamma }} & = \vec{G}_{m}^{{{\text{Iter,}}\gamma }} + \vec{H}_{m}^{{{\text{Iter,}}\gamma }} + \vec{S}_{m}^{{{\text{Iter,}}\gamma }} + \vec{I}_{m}^{{{\text{Iter,}}\gamma }} + \vec{D}_{m}^{{{\text{Iter,}}\gamma }} + \vec{R}_{m}^{{{\text{Iter,}}\gamma }} \\ \vec{V}_{m}^{{{\text{Iter}},\delta }} & = \vec{G}_{m}^{{{\text{Iter}},\delta }} + \vec{I}_{m}^{{{\text{Iter}},\delta }} + \vec{R}_{m}^{{{\text{Iter}},\delta }} \\ \end{aligned}$$

(2)

As stated earlier, HOA is inspired by horses and their six general and social behaviors in various ages. The six behaviors and their mathematical implementation are discussed as follows.

Grazing: Horses are grazing animals that graze at all stages of their lives for about 16–20 h per day [25]. Equations (3) and (4) mathematically implement this behavior in HOA [35].

$$\vec{G}_{m}^{{{\text{Iter}},{\text{AGE}}}} = g_{{{\text{Iter}}}} \left( {\check{u}} + {\check{\rho}} \right) + [X_{m}^{{({\text{Iter}} - 1)}} ],\quad {\text{AGE}} = \alpha ,\beta ,\gamma ,\delta$$

(3)

$$g_{m}^{{{\text{Iter}},{\text{AGE}}}} = g_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{g}$$

(4)

In the above equations,$\vec{G}_{m}^{{{\text{Iter}},{\text{AGE}}}}$ is theith horse's motion parameter indicating its tendency to graze. With${\omega }_{g}$ in each iteration, this factor reduces linearity.${\check{u}}$ is the upper bound of the grazing space, and its recommended value is 1.05.${\check{l}}$ is the lower bound of the grazing space, and its recommended value is 0.95.$\rho$ is a random number in between 0 and 1. The coefficient$g$ for all age ranges is recommended to be set to 1.5.

Hierarchy: Horses are not self-sufficient, and they usually follow a leader, which could be a human, an adult stallion, or a mare. This occurs in the hierarchy law [7]. The most experienced and strongest horse tends to lead in a herd of horses, and others follow it. Horses between the ages of 5 and 15 (β andγ) were shown to follow the hierarchy law. The hierarchy is implemented according to Eqs. (5) and (6) below [35]:

$$\vec{H}_{m}^{{\text{Iter,AGE}}} = h_{m}^{{\text{Iter,AGE}}} \left[ {X_{*}^{{({\text{Iter}} - 1)}} - X_{m}^{{({\text{Iter}} - 1)}} } \right], \quad {\text{AGE}} = \alpha ,\beta \;{\text{and}}\;\gamma$$

(5)

$$h_{m}^{{\text{Iter,AGE}}} = h_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{h}$$

(6)

where$\vec{H}_{m}^{{{\text{Iter}},{\text{AGE}}}}$ is the impact of the location of the leader horse on the velocity, and$X_{*}^{{({\text{Iter}} - 1)}}$ indicates the location of that horse.

Sociability: Sociability is another behavior of horses that HOA inspired. Horses require social interaction and may coexist with other animals. This also increases their chances of survival. Some horses appear to enjoy being with even other animals such as cattle and sheep [25]. Horses between the ages of 5 and 15 years old show this behavior. Socialization in HOA was considered the movement towards the position of other horses in the herd, and it is implemented using the Eqs. (7) and (8) [35]:

$$\vec{S}_{m}^{{\text{Iter,AGE}}} = s_{m}^{{\text{Iter,AGE}}} \left[ {\left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} X_{j}^{{({\text{Iter}} - 1)}} } \right) - X_{m}^{{({\text{Iter}} - 1)}} } \right], \quad {\text{AGE}} = \beta ,\gamma$$

(7)

$$S_{m}^{{\text{Iter,AGE}}} = s_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{s}$$

(8)

where$\vec{S}_{m}^{{\text{Iter,AGE}}}$ is theith horses social vector motion, and$s_{m}^{{\text{Iter,AGE}}}$ is the same horse's orientation towards the herd in the Iter^th iteration. With a${ }\omega_{s}$ factor,$s_{m}^{{\text{Iter,AGE}}}$ decrements in each cycle. The total number of horses is indicated byN, and AGE is each horse’s age range in the herd. Thes coefficient ofβ andγ horses is calculated in the parameters' sensitivity analysis.

Imitation: Horses learn each other's excellent and undesirable habits and behaviors by imitating one another [7]. This imitation is the other horse behavior that is inspired by HOA. Young horses attempt to imitate others, and this behavior persists throughout their lives. The imitation is described by Eqs. (9) and (10) [35]:

$$\vec{I}_{m}^{{\text{Iter,AGE}}} = i_{m}^{{\text{Iter,AGE}}} \left[ {\left( {\frac{1}{pN}\mathop \sum \limits_{j = 1}^{pN} \hat{X}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right],\quad {\text{AGE}} = \gamma$$

(9)

$$i_{m}^{{\text{Iter,AGE}}} = i_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{i}$$

(10)

In the above equations,$\vec{I}_{m}^{{\text{Iter,AGE}}}$ shows theith horse's motion vector towards the best horses’ average with locations of$\widehat{X}$.pN presents the total number of horses that have the best locations, andp is recommended to be set to 10% of total horses in the herd.${\omega }_{i}$ is a reduction factor in each cycle fori_iter.

Defense: Horses use fight-or-flight behavior to defend themselves. Their initial impulse is to flee. In addition, when trapped, they usually buck. To keep rivals, they fight for food and water. They also fight to avoid dangerous situations with enemies such as wolves [25,55]. The horses’ defense mechanism is the other behavior used in HOA and defined by running away from horses that exhibit non-optimal responses. Equations (11) and (12) describe the defense mechanism [35]:

$$\vec{D}_{m}^{{\text{Iter,AGE}}} = - d_{m}^{{\text{Iter,AGE}}} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} {\check{X}}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right], \quad {\text{AGE}} = \alpha ,\beta \;{\text{and}}\;\gamma$$

(11)

$$d_{m}^{{\text{Iter,AGE}}} = d_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{d}$$

(12)

In the above equations,$\vec{D}_{m}^{{\text{Iter,AGE}}}$ indicates the “the escape vector ofith horse from the average of some horses with worst locations, which are shown by the${\check{X}}$ vector”. The quantity of horses that have the worst locations isqN. The value ofq is recommended to be set to 20% of the total number of horses.$\omega_{d} { }$ is the reduction factor per cycle for diter.

Roaming: The last behavior of horses that HOA simulates is their roaming habit. In pursuit of food, horses in nature roam and graze from one pasture to another if they are not kept in stables. A horse may abruptly change its grazing site. Horses are incredibly curious, as they frequently visit different pastures and get to know their surroundings [55]. The Roaming behavior is considered as a random movement of a horse in the herd and can be described by Eqs. (13) and (14) [35]:

$$\vec{R}_{m}^{{\text{Iter,AGE}}} = r_{m}^{{\text{Iter,AGE}}} pX^{{({\text{Iter}} - 1)}} ,\quad {\text{AGE}} = \gamma \;{\text{and}}\;\delta$$

(13)

$$r_{m}^{{\text{Iter,AGE}}} = r_{m}^{{({\text{Iter}} - 1),{\text{AGE}}}} \times \omega_{r}$$

(14)

$\vec{R}_{m}^{{\text{Iter,AGE}}}$ is “the random velocity vector ofith horse for a local search and an escape from local minima”. The reduction factor of$r_{ m}^{{\text{Iter,AGE}}}$ per cycle is represented by$\omega_{r}$.

The horses’ general velocity can be calculated by substituting Eqs. (3)–(14) in Eq. (2). The velocity of horses at different ages (δ,γ,β, andα, respectively) are obtained according to Eqs. (15)–(18).

$$\vec{V}_{m}^{{{\text{Iter}},\delta }} = \left[ {g_{m}^{{({\text{Iter}} - 1),\delta }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] + \left[ {i_{m}^{{({\text{Iter}} - 1),\delta }} \omega_{i} \left[ {\left( {\frac{1}{pN}\mathop \sum \limits_{j = 1}^{pN} \hat{X}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] + \left[ {r_{m}^{{({\text{Iter}} - 1),\delta }} \omega_{r} pX^{{({\text{Iter}} - 1)}} } \right]$$

(15)

where$\vec{V}_{m}^{{{\text{Iter}},\delta }}$ is theδ horses’ velocity (horses at the age of 0–5).

$$\begin{aligned} \vec{V}_{m}^{{{\text{Iter}},\gamma }} & = \left[ {g_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] + \left[ {h_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{h} \left[ {X_{*}^{{({\text{Iter}} - 1)}} - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad + \left[ {s_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{s} \left[ {\left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} X_{j}^{{({\text{Iter}} - 1)}} } \right) - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] + \left[ {i_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{i} \left[ {\left( {\frac{1}{pN}\mathop \sum \limits_{j = 1}^{pN} \hat{X}_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad - \left[ {d_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{d} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} \mathop X\limits_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] + \left[ {r_{m}^{{({\text{Iter}} - 1),\gamma }} \omega_{r} pX^{{({\text{Iter}} - 1)}} } \right] \\ \end{aligned}$$

(16)

where$\vec{V}_{m}^{{{\text{Iter}},\gamma }}$ is theγ horses’s velocity (horses at the age of 5–10).

$$\begin{aligned} \vec{V}_{m}^{{{\text{Iter}},\beta }} & = \left[ {g_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] + \left[ {h_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{h} \left[ {X_{*}^{{({\text{Iter}} - 1)}} - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad + \left[ {s_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{s} \left[ {\left( {\frac{1}{N}\mathop \sum \limits_{j = 1}^{N} X_{j}^{{({\text{Iter}} - 1)}} } \right) - X_{m}^{{({\text{Iter}} - 1)}} } \right]} \right] \\ & \quad - \left[ {d_{m}^{{({\text{Iter}} - 1),\beta }} \omega_{d} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} \mathop X\limits_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right] \\ \end{aligned}$$

(17)

where$\vec{V}_{m}^{{{\text{Iter}},\beta }}$ is theβ horses’ velocity (horses at the age between 10 and 15 years).

$$\vec{V}_{m}^{{{\text{Iter}},\alpha }} = \left[ {g_{m}^{{({\text{Iter}} - 1),\alpha }} \omega_{g} \left( {{\check{u}} + \rho {\check{l}} } \right) + [X_{m}^{{({\text{Iter}} - 1)}} ]} \right] - \left[ {d_{m}^{{({\text{Iter}} - 1),\alpha }} \omega_{d} \left[ {\left( {\frac{1}{qN}\mathop \sum \limits_{j = 1}^{pN} \mathop X\limits_{j}^{{({\text{Iter}} - 1)}} } \right) - X^{{({\text{Iter}} - 1)}} } \right]} \right]$$

(18)

where$\vec{V}_{m}^{{{\text{Iter}},\alpha }}$ is theα horses’ velocity (horses older than 15).

The findings validated HOA's capacity to cope with difficult situations involving a large number of unknown variables in high-dimensional domains. Adultα horses start a local search around the global optimum with extremely high precision. Theβ horses look for other near situations around the adultα horses, intending to approach them; nevertheless, theγ horses have less interest in approaching theα horses. They show a strong drive to explore new regions and discover new global optimum spots. Because of their specific behavioral features, youngδ horses are excellent candidates for the random search phase.

4Proposed approach

In this study, the metaheuristic HOA is modified first, then the modified version of HOA is used in feature selection for detecting spam emails. First, the continuous HOA is changed to a binary algorithm to be used for feature selection since it is a discrete problem. The inputs of the resulting algorithm are then become opposition-based. Next, the binary opposition-based HOA is upgraded to multiobjective in order to solve multiobjective problems. Finally, the multiobjective opposition-based binary HOA (MOBHOA) is applied in spam detection.

Users usually receive spam from anonymous senders with strange email addresses. This certainly does not mean that every email sent by an anonymous sender is considered spam. Therefore, it is necessary to use appropriate methods to detect and separate spam emails from legitimate emails that contain important information. In the proposed method, every email that is entered from the server needs to be followed by a series of steps to be classified as spam email or genuine email. The first step after receiving an email from the server is the feature extraction step. A series of general or specific features are extracted from the email body in the feature extraction step. The next phase is feature selection, which identifies related features and removes irrelevant and duplicate features. The final step is the classification step which is used to classify emails as spam or genuine emails. The overall structure of this method is depicted in Fig. 1 which shows the flowchart of the new approach and how it operates for detecting spam emails. The next sections provide further details of each step in modifying the HOA.

4.1Binary HOA

The optimization process in binary search space differs significantly from continuous search space. Horse search agents can update their positions in the continuous search space by adding a step length to their position vector. But in a binary search space, the search agents’ position can not be updated by adding a step length because the search agent position vector can only have a value of 0 or 1. Therefore, we needed to develop a binary version of the HOA for feature selection, which is a discrete problem.

Developing the binary version of the HOA algorithm is simple. We only need to set the variables’ minimum and maximum values between zero and one, then run the algorithm. Just before sending the values to the cost function, we process the values with the greatest integer function to round them to zero and one vector. The nature of the variables is and will be continuous, but they will become binary with the greatest integer function only before entering the cost function. In other words, the algorithm considers the problem to be continuous, and the cost function considers it to be discrete. In the meantime, a function establishes the communication language of the discrete cost function (binary) and the continuous algorithm. This is performed by applying the greatest integer function in Eq. (19). In Eq. (19),x represents a real value betweenm andn, which are two consecutive integers, andk is an integer resulting from the application of the greatest integer function onx. This strategy can solve the problem of continuity of a continuous algorithm to be used for discrete problems.

$$k = \left| \!{\underline {\,\,}} \right. x\left. {\underline {\, \,}}\! \right|$$

(19)

4.2Opposition-based binary HOA

By exploring conflicting solutions, opposition-based learning increases the chances of the start with a better initial population [48]. Not only could this approach be used in the initial solutions, but could be applied continuously to any solution in the current population. Generally, the opposition-based learning method is employed in metaheuristic approaches to improve convergence. Because the temporal complexity of metaheuristic algorithms increases, the opposition-based learning method is used to avoid these limitations. This strategy causes the metaheuristic method to seek the optimal solutions in the current solution’s opposite direction. Then, it determines which one is the best solution to choose, the current or the opposite. This method converges the solution rapidly and brings it closer to the optimal solution [48]. A sample opposition-based learning application was discussed in the study by Ibrahim et al. [22].

Starting from a suitable initial population in evolutionary algorithms is an essential and challenging task as the starting point would be effective in the algorithm's convergence speed and the final solution’s quality [48]. In an opposition-based algorithm, to determine the members of the original population, first, a high and a low limit is defined for each of the genes that make up the population members. The genes are then randomly defined between the upper limit and the lower limit. To use opposite numbers during the starting of the population, we consider the value of each member, which is defined according to Eq. (20). Assuming thatX is the position of the horse betweena andb, the opposition-based$\overline{X}$ is defined according to Eq. (20). If the opposition-based cost function becomes less than the initial cost function, then the point can be substituted; otherwise, it will continue. Therefore, the gene and the opposite gene are evaluated simultaneously to proceed with more appropriate ones.

$$\overline{X} = a + b {-}X$$

(20)

4.3Multiobjective opposition-based binary HOA

Models used to optimize problems that only have one objective function are known as single-objective models. In a single-objective problem, we attempt to find the best solution among available solutions. In practice, there is more than one objective function in many designing and engineering problems. These problems are known as multiobjective optimization problems. In many cases, the objective functions defined in multiobjective optimization problems are in conflict with each other [9]. That means the objectives are not compatible [37].

Spam detection is a multiobjective problem. The objectives pursued in this problem are the number of features and the classification accuracy, in which the quantity of features should be minimum, whereas the classification accuracy should be maximum. Higher classification accuracy means that most emails are categorized into the correct category after the classification is completed, and the error rate of the classification is minimal. Furthermore, because the classification is reliant on the selected features by the modified HOA metaheuristic algorithm, the number of features should be kept as minimal as feasible to prevent complexity. Since more than one objective function must be investigated, it is necessary to use a multiobjective optimization method. The essential aspect of such approaches is that they provide engineers and system designers with more than one solution. These solutions demonstrate the balance between the various objective functions [24]. A multiobjective optimization problem can be expressed mathematically as a minimization problem using Eq. (21) [60]:

$$\begin{aligned} & {\text{Minimize:}}\;f_{m} (x), \quad m = 1,2, \ldots ,M \\ & {\text{Subject }}\;{\text{to:}}\; g_{i} (x) \ge 0, \quad j = 1,2, \ldots ,J \\ & h_{k} (x) = 0,\quad k = 1,2, \ldots ,K \\ & L_{i} \le x_{i} \le U_{i} , n\quad i = 1,2, \ldots ,n \\ \end{aligned}$$

(21)

In Eq. (21),M represents the number of objectives,J represents the number of inequality constraints,K represents the number of equality constraints, and [L_i,U_i] are theith variable's boundaries. The solutions of a multiobjective problem would not be compared by arithmetic relational operators. Rather, the Pareto optimal dominance concept compares two solutions in a multiobjective search space [60].

To date, several single-objective metaheuristic methods have been converted to multiobjective [58]. This section explains how we have converted the single-objective HOA to multiobjective HOA. The multiobjective HOA algorithm employs a general objective function with a weight vector based on Eq. (22) to find the relationship between horses in a multiobjective search space. In this Equation,M combines each horse's objectives into a single objective.

$$F(x_{i} ) = \frac{1}{M}\mathop \sum \limits_{j = 1}^{M} f_{j} (x_{i} )$$

(22)

The main difference between single-objective and multiobjective HOA is in their process of updating the objectives. By selecting the best solution obtained, the objective could be easily selected in a single-objective search space. However, the objective must be selected from a set of optimal solutions in multiobjective HOA. Optimal solutions are stored, and the ultimate objective would be one of them. The challenge here is to find an objective for improving the distribution of the stored solutions. To attain this goal, first, the number of neighboring solutions in the existing solution’s neighborhood is calculated [41]. This method is similar to MOPSO in the study by Zouache et al. [60]. Then, the number of neighboring solutions is considered a quantitative criterion for measuring the areas' congestion. Equation (23) determines the probability of choosing an objective from among the objectives.

$$p_{i} = {\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 {N_{i} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${N_{i} }$}}$$

(23)

In Eq. (23),N_i indicates the total number of the neighborhood of theith solution. With this probability, a roulette method is used to choose the objective. This improves the distribution of the search space's less distributed areas. The other benefit is that in the event of premature convergence, solutions with a crowded neighborhood may be chosen as the objective to solve the problem [59]. The used storage space is limited. To lower the computational cost of the multiobjective HOA, only a small number of solutions should be in the archive, and the archive must be updated frequently. But, when comparing out-of-archive and in-archive solutions, there are several cases. The multiobjective HOA must be able to manage these cases in order to enhance the archive. The simplest case is when at least one archive member dominates the external solution. In this case, it must be discarded immediately. The other case is when all of the solutions in the archive are dominated by new solutions. Since the archive stores the dominant solutions that have been achieved so far, a non-dominant solution must be added to the archive. On the other hand, if the solution dominates the archive, it must be replaced.

In spam detection, feature selection is considered a multiobjective optimization problem. Two opposite objectives are met in multiobjective problems: (1) a minimum selected features and (2) a higher classification accuracy. Therefore, to be able to define the feature selection's objective function, a classification algorithm is required [19,20]. Because most studies in the literature have employed the KNN classification algorithm, this classification method is employed to define the feature selection problem’s objective function in the current study as well, and the opposition-based binary HOA was converted to multiobjective, then it is used for spam detection problem.

Equation (24) is applied as a multiobjective function for selecting features. This equation balances between two opposing objectives so that a near-optimal solution is chosen.

The smaller the number of features contributes to a more optimal solution, yet, a lower number of features might sometimes raise the classification error rate. Also, the smaller the classification error, the more optimal the solution, but the number of features may have to be increased to reduce the error rate. In other words, a fewer number of features does not always optimize the solution, and a lower number of features from a certain limit may reduce the accuracy of the classification. It might also happen the other way around; a lower classification error rate does not always optimize the solution and may cause more features to be selected. There is a threshold for each of these, and this threshold is different in different problems. Therefore, a balance must be achieved between these, and Eq. (24) establishes this balance.

$${\text{Fitness}} = \alpha \gamma_{R} (D) + \beta \frac{\left| R \right|}{{\left| N \right|}}$$

(24)

In Eq. (24),$\alpha \gamma_{R} (D)$ indicates the classifier's error rate,$\left| R \right|$ indicates the selected subset's multi-linearity, and the overall number of features within the data set is denoted by$\left| N \right|.$α andβ are the significance of the classification's quality and the subset's length, respectively. Theα andβ values have been adapted from Emary, Zawbaa and Hassanien [14] whereα ∈ [0, 1] andβ = (1 − α). The initial value of α in this study is set to 0.99; thus,β will be calculated as 0.01. KNN helps to evaluate the selected feature by the suggested method and other similar methods accurately, and it serves as a benchmark for all algorithms [2,3,27,30,31,32,46].

4.4Spam detection using multiobjective opposition-based binary HOA

The current study employed a data set on which preliminary processing was performed, and a set of features was extracted. MOBHOA selects several extracted features that distinguish spam emails from genuine emails. This is accomplished through the use of HOA's natural processes, which are discussed as follows.

Feature selection is a four-step process that includes the generation of feature subsets, evaluation of subsets, termination of criteria checking, and the validation of the results [26]. Firstly, the feature subset is generated in the data set. In this subset, candidate features are searched based on the search strategy of MOBOHA. Then, candidate subsets are evaluated and compared with the best previous value of the evaluation feature used. If a better subset is produced, it is replaced with the previous best. This generation and evaluation of the subsets is iterated until the termination criterion of the MOBHOA is reached. MOBHOA is repeated several times before achieving the best global solution. After each cycle, the fitness function calculates the accuracy of the classifier for the candidate subset. The candidate generation, fitness calculation, and evaluation function continues until the final criteria are met. In general, termination criteria are defined on the basis of two factors: the rate of error and the total number of iterations. If the error rate is lower than a certain threshold, or if the algorithm exceeds the specified number of iterations, the algorithm stops [26].

As stated earlier, this study attempted to propose a new optimization approach for feature selection in detecting spam emails using MOBHOA. Figure 2 illustrates the flowchart related to the proposed approach.

5Simulation and evaluation

The new algorithm was implemented and simulated in MATLAB R2014a environment installed on a PC with a 64-bit i5 CPU and 4GB memory. For simulation, a data set called 'Spam Base' was used. 20% of the data was allocated for training and 80% for testing. Experiments were conducted on the Spam Base data set from the UCI data repository for evaluating the algorithm's performance in detecting spam. The used data set includes 4601 emails, of which 1813 (39.4%) are spam emails and 2788 (60.6%) are non-spam emails. Every record in this data set contains fifty-eight features in which the latest feature shows whether the email is spam (1) or genuine (0). The first forty-eight features indicate the frequency of specific keywords. That is the percentage of words or phrases in the email that match a specific word or phrase. The next six features indicate the characters’ frequency, and the next three features contain information about the data set. In Liu et al. [28], this data set was recommended as one of the most valid and suitable data sets for spam.

The remainder of this section discusses the classification accuracy of the proposed method in detecting spam compared to GWO and KNN. The simulation results of GWO, KNN, and MOBHOA in terms of classification accuracy with a different number of iterations are represented in Table2 and Fig. 3. In this simulation, the iteration number was set to 1–100, and the size of the population was considered 20.

Table 2 Comparison of GWO, KNN, and MOBHOA in terms of the accuracy of the classification

Full size table

According to Table2 and Fig. 3, MOBHOA has obtained much better results than GWO and KNN algorithms in detecting spam by increasing the iteration number. The performance of MOBHOA was similar to the other two algorithms in the first iterations; but, by the increase in the number of iterations, it has shown its better performance over the GWO and KNN algorithms. This was due to the application of the opposition-based approach to develop solutions in the opposite search space.

In the next evaluation, MOBHOA was compared with K-Nearest Neighbours-Grey Wolf Optimisation (KNN-GWO), KNN, Multi Layer Perceptron (MLP), Naive Bayesian (NB), and Support Vector Machine (SVM) classifiers with regards to accuracy, sensitivity and precision in detecting spam emails. Table3 and Fig.4 show the evaluation results.

Table 3 Comparison of MOBHOA with other classifiers in terms of accuracy, precision, and sensitivity

Full size table

As mentioned earlier, detecting spam emails is carried out in two steps, the first step is selecting features, and the second step is classification. Table3 shows the results of feature selection.

The results in Table3 are presented according to the feature selection step of the MOBHOA-KNN method, which was carried out with MOBHOA, and the classification step which is done with the KNN method. Also, in the KNN-GWO method, the feature selection step is carried out with GWO, and the classification step is done with the KNN method. However, KNN, MLP, SVM, and NB methods classify data without performing the feature selection step.

The proposed approach in feature selection improves accuracy, precision, and sensitivity and also reduces runtime. Because in optimal feature selection, redundant or insignificant features are eliminated, and operations are performed on only significant features. In this case, the algorithm's execution time will be reduced, and the accuracy, precision, and sensitivity will be increased. In this experiment, when running the execution, in all feature selections, KNN was constant. In the second run, considering the optimal feature selection, KNN was combined with the proposed method, and the results are demonstrated in Table3.

As shown in Fig. 4, the evaluation results indicate that MOBHOA has improved compared to KNN and KNN-GWO with respect to accuracy, sensitivity, and precision. Specifically, it has increased the algorithm's accuracy by around 50%.

The results show that the multiobjective opposition-based binary horse herd optimizer running on the UCI data set has been more successful in the average size of selection and the accuracy of classification compared with some other standard metaheuristic methods. According to the results, the proposed algorithm is substantially more accurate in detecting spam emails in the data set than other similar algorithms. This is due to the application of HOA, which is a highly efficient optimization algorithm and has an outstanding performance in solving high-dimensional problems. The other reason is implementing the feature selection phase besides the classification phase. Feature selection decreases the computational complexity and increases classification accuracy by removing unnecessary features.

Machine learning-based techniques are one of the most efficient ways to solve a variety of problems. However, most machine learning algorithms have the problem of computational complexity. There is a need to employ more advanced techniques and algorithms that can improve the accuracy and decrease the complexity and error rate of the spam detection problem; therefore, we used the horse herd optimization algorithm to further improve the computation speed and accuracy. New advances in deep learning demonstrate that they can still be utilized for solving spam detection problems. A limited number of studies in the literature have examined the performance of deep learning algorithms for spam detection. In addition, the majority of the used datasets are either small in size or artificially developed. Thus, future studies are expected to consider big data solutions, large datasets, and deep learning algorithms to develop more efficient techniques for detecting spam. Furthermore, the focus of this study was specifically on email spam detection, and spam detection in other platforms, such as social networking spam and so on was not examined in the current study. Future studies may focus on using this approach for spam detection on other platforms.

6Conclusion

Unwanted emails or spam have become a problem for Internet users and data centers, as these types of emails waste a large amount of storage and other resources. Moreover, they provide a basis for intrusion, cyber-attacks as well as access to user information. There are several techniques and methods for detecting, filtering, classifying spam, and facilitating their removal. In the majority of the proposed approaches, there is a rate of error, and none of the spam detection techniques, despite the optimizations performed, have been effective on their own. The objective of this paper was to use a robust metaheuristic optimization algorithm to detect spam emails to be used in email services. For this purpose, the horse herd optimization algorithm was employed, which is a novel nature-inspired metaheuristic optimization algorithm developed for solving highly complex optimization problems. The problem of detecting spam is discrete and has multiple objectives. To be able to use HOA for this problem, first, the original HOA, which is a continuous algorithm, was binarised and then transformed into a multiobjective opposition-based algorithm to solve the feature selection problem in spam detection. The new algorithm, multiobjective opposition-based binary horse herd optimization algorithm (MOBHOA), was implemented and simulated in MATLAB, and in order to evaluate the performance of the proposed approach in detecting spams, experiments were conducted on the Spam Base data set from the UCI data repository. According to the simulation results, in comparison with other similar approaches such as KNN, GWO, MLP, SVM, and NB, the new approach performs better in classification, as well as accuracy, precision, and sensitivity. The findings demonstrate that the new approach outperforms similar metaheuristic solutions introduced in the literature; therefore, it could be used for feature selection in spam detection systems.

References

Abdulhamid SM, Shuaib M, Alhassan JK, Adebayo OS, Ismaila I, Osho O, Rans N (2019) Whale optimization algorithm based email spam feature selection method using rotation forest for classification. SN Appl Sci 1:1–17
Google Scholar
Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125
Google Scholar
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin
Google Scholar
Awad W, ELseuofi S (2011) Machine learning methods for spam e-mail classification. Int J Comput Sci Inf Technol (IJCSIT) 3(1):173–184
Google Scholar
Batra J, Jain R, Tikkiwal VA, Chakraborty A (2021) A comprehensive study of spam detection in e-mails using bio-inspired optimization techniques. Int J Inf Manag Data Insights 1(1):100006
Google Scholar
Bibi A, Latif R, Khalid S, Ahmed W, Shabir RA, Shahryar T (2020) Spam mail scanning using machine learning algorithm. J Comput 15(2):73–84
Google Scholar
Bogner F (2011) A comprehensive summary of the scientific literature on Horse Assisted Education in Germany. Van Hall Larenstein
Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. arXiv preprint cs/0109015
Chang K-H (2014) Design theory and methods using CAD/CAE: the computer aided engineering design series. Academic Press, Cambridge
Google Scholar
Chen H, Jiao S, Heidari AA, Wang M, Chen X, Zhao X (2019) An opposition-based sine cosine approach with local search for parameter estimation of photovoltaic models. Energy Convers Manag 195:927–942
Google Scholar
DeBarr D, Wechsler H (2009) Spam detection using clustering, random forests, and active learning. In: Sixth conference on email and anti-spam. Mountain View, California
Dedeturk BK, Akay B (2020) Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Appl Soft Comput 91:106229
Google Scholar
Egozi G, Verma R (2018) Phishing email detection using robust nlp techniques. In: IEEE international conference on data mining workshops (ICDMW)
Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion approaches for feature selection. Neurocomputing 213:54–65
Google Scholar
Faris H, Aljarah I, Al-Shboul B (2016) A hybrid approach based on particle swarm optimization and random forests for e-mail spam filtering. In: International conference on computational collective intelligence
Guo D, Chen C (2014) Detecting non-personal and spam users on geo-tagged Twitter network. Trans GIS 18(3):370–384
MathSciNet Google Scholar
GuangJun L, Nazir S, Khan HU, Haq AU (2020) Spam detection approach for secure mobile message communication using machine learning algorithms. Secur Commun Netw 2020:8873639.https://doi.org/10.1155/2020/8873639
Article Google Scholar
Harisinghaney A, Dixit A, Gupta S, Arora A (2014) Text and image based spam email classification using KNN, Naïve Bayes and Reverse DBSCAN algorithm. In: International conference on reliability optimization and information technology (ICROIT)
Hosseinalipour A, Gharehchopogh FS, Masdari M, Khademi A (2021) A novel binary farmland fertility algorithm for feature selection in analysis of the text psychology. Appl Intell 51:4824–4859
Google Scholar
Hosseinalipour A, Gharehchopogh FS, Masdari M, Khademi A (2021) Toward text psychology analysis using social spider optimization algorithm. Concurr Comput Pract Exp 33:e6325
Google Scholar
Hu H, Wang G (2018) Revisiting email spoofing attacks. arXiv preprint. arXiv:1801.00853
Ibrahim RA, Abd Elaziz M, Oliva D, Cuevas E, Lu S (2019) An opposition-based social spider optimization for feature selection. Soft Comput 23(24):13547–13567
Google Scholar
Karim A, Azam S, Shanmugam B, Kannoorpatti K, Alazab M (2019) A comprehensive survey for intelligent spam email detection. IEEE Access 7:168261–168295
Google Scholar
Khanmohammadi S, Kizilkan O, Musharavati F (2021) Multiobjective optimization of a geothermal power plant. In: Thermodynamic analysis and optimization of geothermal power plants. Elsevier, pp 279–291
Krueger K, Heinze J (2008) Horse sense: social status of horses (Equus caballus) affects their likelihood of copying other horses’ behavior. Anim Cognit 11(3):431–439
Google Scholar
Kumar A, Khorwal R, Chaudhary S (2016) A survey on sentiment analysis using swarm intelligence. Indian J Sci Technol 9(39):1–7
Google Scholar
Liao TW, Kuo R (2018) Five discrete symbiotic organisms search algorithms for simultaneous optimization of feature subset and neighborhood size of knn classification models. Appl Soft Comput 64:581–595
Google Scholar
Liu J, Jing H, Tang YY (2002) Multi-agent oriented constraint satisfaction. Artif Intell 136(1):101–144
MathSciNet MATH Google Scholar
Luo J, Chen H, Heidari AA, Xu Y, Zhang Q, Li C (2019) Multi-strategy boosted mutative whale-inspired optimization approaches. Appl Math Model 73:109–123
MathSciNet MATH Google Scholar
Mafarja M, Aljarah I, Heidari AA, Hammouri AI, Faris H, Ala’M A-Z, Mirjalili S (2018) Evolutionary population dynamics and grasshopper optimization approaches for feature selection problems. Knowl Based Syst 145:25–45
Google Scholar
Mafarja M, Mirjalili S (2018) Whale optimization approaches for wrapper feature selection. Appl Soft Comput 62:441–453
Google Scholar
Mafarja MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260:302–312
Google Scholar
Marinos L, Lourenço M (2019) ENISA threat landscape report 2018: 15 top cyberthreats and trends. European Union Agency For Network and Information Security (ENISA)
Mendez JR, Cotos-Yanez TR, Ruano-Ordas D (2019) A new semantic-based feature selection method for spam filtering. Appl Soft Comput 76:89–104
Google Scholar
MiarNaeimi F, Azizyan G, Rashki M (2021) Horse herd optimization algorithm: a nature-inspired algorithm for high-dimensional optimization problems. Knowl Based Syst 213:106711
Google Scholar
Mirjalili S (2015) Moth-flame optimization algorithm: a novel nature-inspired heuristic paradigm. Knowl Based Syst 89:228–249
Google Scholar
Mirjalili S (2016) Dragonfly algorithm: a new meta-heuristic optimization technique for solving single-objective, discrete, and multi-objective problems. Neural Comput Appl 27(4):1053–1073
MathSciNet Google Scholar
Mirjalili S (2016) SCA: a sine cosine algorithm for solving optimization problems. Knowl Based Syst 96:120–133
Google Scholar
Mirjalili S, Mirjalili SM, Hatamlou A (2016) Multi-verse optimizer: a nature-inspired algorithm for global optimization. Neural Comput Appl 27(2):495–513
Google Scholar
Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61
Google Scholar
Mirjalili SZ, Mirjalili S, Saremi S, Faris H, Aljarah I (2018) Grasshopper optimization algorithm for multi-objective optimization problems. Appl Intell 48(4):805–820
Google Scholar
Mohmmadzadeh H (2020) Case study email spam detection of two metaheuristic algorithm for optimal feature selection
Pandey AC, Rajpoot DS (2019) Spam review detection using spiral cuckoo search clustering method. Evolut Intell 12(2):147–164
Google Scholar
Pashiri RT, Rostami Y, Mahrami M (2020) Spam detection through feature selection using artificial neural network and sine–cosine algorithm. Math Sci 14(3):193–199
MathSciNet MATH Google Scholar
Raad M, Yeassen NM, Alam GM, Zaidan BB, Zaidan AA (2010) Impact of spam advertisement through e-mail: a study to assess the influence of the anti-spam on the e-mail marketing. Afr J Bus Manag 4(11):2362–2367
Google Scholar
Rajamohana S, Umamaheswari K (2018) Hybrid approach of improved binary particle swarm optimization and shuffled frog leaping for feature selection. Comput Electr Eng 67:497–508
Google Scholar
Saab SA, Mitri N, Awad M (2014) Ham or spam? A comparative study for some content-based classification algorithms for email filtering. In: MELECON 2014–2014 17th IEEE mediterranean electrotechnical conference
Saremi S, Mirjalili S, Lewis A (2017) Grasshopper optimisation algorithm: theory and application. Adv Eng Softw 105:30–47
Google Scholar
Shadravan S, Naji H, Bardsiri VK (2019) The Sailfish Optimizer: a novel nature-inspired metaheuristic algorithm for solving constrained engineering optimization problems. Eng Appl Artif Intell 80:20–34
Google Scholar
Shajideen NM, Bindu V (2018) Spam filtering: a comparison between different machine learning classifiers. In: Second international conference on electronics, communication and aerospace technology (ICECA)
Sharma P, Bhardwaj U (2018) Machine learning based spam e-mail detection. Int J Intell Eng Syst 11(3):1–10
Google Scholar
Soni AN (2019) Spam-e-mail-detection-using-advanced-deep-convolution-neuralnetwork-algorithms. J Innov Dev Pharm Tech Sci 2(5):74–80
Google Scholar
Srinivasan S, Ravi V, Alazab M, Ketha S, Ala’M A-Z, Padannayil SK (2021) Spam emails detection based on distributed word embedding with deep learning. In: Machine intelligence and big data analytics for cybersecurity applications. Springer, pp 161–189
Wang C, Li Q, Ren TY, Wang XH, Guo GX (2021) High efficiency spam filtering: a manifold learning-based approach. In: Mathematical problems in engineering
Waring G (1983) The behavioral traits and adaptations of domestic and wild horses, including ponies. Horse Behavor
Xu Y, Chen H, Heidari AA, Luo J, Zhang Q, Zhao X, Li C (2019) An efficient chaotic mutative moth-flame-inspired optimizer for global optimization tasks. Expert Syst Appl 129:135–155
Google Scholar
Yaseen Q (2021) Spam email detection using deep learning techniques. Procedia Comput Sci 184:853–858
Google Scholar
Zhang Y, Gong D-W, Gao X-Z, Tian T, Sun X-Y (2020) Binary differential evolution with self-learning for multi-objective feature selection. Inf Sci 507:67–85
MathSciNet MATH Google Scholar
Zhang Y, Wang J, Lu H (2019) Research and application of a novel combined model based on multiobjective optimization for multistep-ahead electric load forecasting. Energies 12(10):1931
Google Scholar
Zouache D, Arby YO, Nouioua F, Abdelaziz FB (2019) Multi-objective chicken swarm optimization: a novel algorithm for solving multi-objective optimization problems. Comput Ind Eng 129:377–391
Google Scholar

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions. The authors state that they have no known competing financial interests or personal relationships that could have influenced the research presented in this study.

Author information

Authors and Affiliations

Department of Computer Engineering, Heris Branch, Islamic Azad University, Heris, Iran
Ali Hosseinalipour
Faculty of Science and Engineering, Southern Cross University, Gold Coast, Australia
Reza Ghanbarzadeh

Authors

Ali Hosseinalipour
View author publications
You can also search for this author inPubMed Google Scholar
Reza Ghanbarzadeh
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toReza Ghanbarzadeh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

The following are some characteristics of emails that are believed to be malicious:

Some of the characteristics of emails that are believed to be malicious:

The email is located in the spam list

The sender of the email is anonymous

The sender's email address is related to free email services

The sender's email address is entirely or even slightly different from the trusted email address

The recipient's exact name is not mentioned in the content of the email; rather, general names are used. For example, the recipient is addressed with phrases such as "Dear customer", "Dear expert"

The email expresses a sense of urgency. For example, the sender threatens to immediately close the recipient's account if the requested action is not taken

The email contains persuasive content, while the sender is not credible. For instance, they promise money, participate in a lottery, win the lottery, discount vouchers for famous stores or brands, request to help a charity or an accident survivor

The email asks for personal information such as username, password or bank account details

The email content has major spelling and grammatical errors

The email was sent from a trusted organization, while the organization is not expected to send an email at that specific time

The entire body of the email is an embedded photo of the content in the text format

Images in the email contain a link to a fake website

The email contains links or attachments that are not expected. In other words, the name or format of the attachments are different from the expected name or format

The attached files have two or more extensions for their format

In case of suspicious emails, or after detecting spam, users [16]:

should not click on the links in the email

should not open email attachments in any way

should not reply to the mail or contact the sender

should not enter any information on the opened website if they accidentally click on a link in a suspicious email

should report suspicious emails to the body responsible for handling these emails

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hosseinalipour, A., Ghanbarzadeh, R. A novel approach for spam detection using horse herd optimization algorithm.Neural Comput & Applic34, 13091–13105 (2022). https://doi.org/10.1007/s00521-022-07148-x

Download citation

Received:13 October 2021
Accepted:28 February 2022
Published:29 March 2022
Issue Date:August 2022
DOI:https://doi.org/10.1007/s00521-022-07148-x

Movatterモバイル変換

A novel approach for spam detection using horse herd optimization algorithm

Abstract

Similar content being viewed by others

Feature Selection Techniques for Email Spam Classification: A Survey

Evolutionary Multi-objective Scheduling for Anti-Spam Filtering Throughput Optimization

Training Logistic Regression Model by Hybridized Multi-verse Optimizer for Spam Email Classification

Explore related subjects

1Introduction

2Related works

3Horse herd optimization algorithm

4Proposed approach

4.1Binary HOA

4.2Opposition-based binary HOA

4.3Multiobjective opposition-based binary HOA

4.4Spam detection using multiobjective opposition-based binary HOA

5Simulation and evaluation

6Conclusion

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords