HomeADOFL: Multi-Kernel-Based Adaptive Directive Operative Fractional Lion Optimisation Algorithm for Data Clustering

ArticleOpen Access

ADOFL: Multi-Kernel-Based Adaptive Directive Operative Fractional Lion Optimisation Algorithm for Data Clustering

Satish Chander,P. Vijaya andPraveen Dhyani

Published/Copyright:December 20, 2016

Download Article (PDF)

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journalJournal of Intelligent Systems Volume 27 Issue 3

Abstract

The progress of databases in fields such as medical, business, education, marketing, etc., is colossal because of the developments in information technology. Knowledge discovery from such concealed bulk databases is a tedious task. For this, data mining is one of the promising solutions and clustering is one of its applications. The clustering process groups the data objects related to each other in a similar cluster and diverse objects in another cluster. The literature presents many clustering algorithms for data clustering. Optimisation-based clustering algorithm is one of the recently developed algorithms for the clustering process to discover the optimal cluster based on the objective function. In our previous method, direct operative fractional lion optimisation algorithm was proposed for data clustering. In this paper, we designed a new clustering algorithm called adaptive decisive operative fractional lion (ADOFL) optimisation algorithm based on multi-kernel function. Moreover, a new fitness function called multi-kernel WL index is proposed for the selection of the best centroid point for clustering. The experimentation of the proposed ADOFL algorithm is carried out over two benchmarked datasets, Iris and Wine. The performance of the proposed ADOFL algorithm is validated over existing clustering algorithms such as particle swarm clustering (PSC) algorithm, modified PSC algorithm, lion algorithm, fractional lion algorithm, and DOFL. The result shows that the maximum clustering accuracy of 79.51 is obtained by the proposed method in data clustering.

Keywords:Data clustering;optimisation;fractional lion optimisation;WLI cluster validity index

1 Introduction

Clustering algorithms are broadly categorised into two types: hierarchical clustering and partitional clustering [5,9,21]. In partitional clustering algorithms [18], the data objects in the corpus are clustered into non-overlapping clusters. The noteworthy aspect is that each data object belongs to a separate cluster. In hierarchical clustering algorithms, clusters are formed based on the hierarchical level. The hierarchy is organised either as a tree or a dendogram. Clustering of data is performed based on different sorts. They are traditional, density based, grid based, exclusive, overlapping, fuzzy, complete, and partial algorithms [6,22]. In overlapping clustering, data objects of one cluster seem to present in another cluster, whereas in exclusive cluster every objects are clustered as a single cluster. The cluster size depends on the size of the data points in the data corpus. In fuzzy clustering, the clustering of data points is performed based on the membership weight function. In traditional clustering algorithm, a distance calculation measure is used as the similarity measure for grouping related objects. In density-based clustering, data objects are grouped based on the shape and it also reduces the noise more effectively. Even though many clustering algorithms have been developed, the algorithms are used for real-time applications based on the ability to deal with different types of attributes and noise, outliers, interoperability, usability, scalability, etc.

K-means algorithm is the first clustering algorithm introduced for data clustering purpose [11]. Despite its simplicity, the distance-based centroid calculation is considered a shortcoming of the K-means algorithm. Clustering algorithms have been emerging as a state-of-the-art research topic because they segment large databases into distinguishable representative clusters more efficiently. However, the problem in the data clustering algorithm arises in the objective consideration for clustering that have a high degree of similarity with other parts of the data, and so the clustering problem is called as an NP-hard problem. After the introduction of soft computing techniques, the clustering problem is transformed to an optimisation problem, finding the optimal clusters in the defined search space. Genetic algorithm [13] is the first optimisation algorithm used for the data clustering approach. Currently, several types of optimisation algorithms, like particle swarm optimisation [15], artificial bee colony [24], differential evolution algorithm [3], and firefly optimisation algorithm [17], are applied for the clustering process. In optimisation-based clustering algorithms, the fitness function is evaluated for selecting the optimal cluster centroids [1]. In most of the optimisation-based clustering algorithms, convergence and local optimal solution trap appear to be prodigious problems. Moreover, the computational overhead of the optimisation algorithms for larger data corpus seems demanding. In order to overcome these drawbacks of the aforementioned optimisation algorithms, we have proposed the adaptive directive operative fractional lion (ADOFL) optimisation-based clustering algorithm using multi-kernel functions in this paper. The proposed clustering algorithm has high a convergence rate compared to the existing optimisation-based data clustering algorithms.

The main contributions of this paper are as follows:

ADOFL optimisation algorithm: The ADOFL optimisation algorithm is the main contribution of this research work. The proposed optimisation algorithm is developed by integrating the decision operative-based searching strategy and adaptive fractional lion optimisation algorithm. The proposed algorithm is based on the lion pride behaviour [16]. In the proposed ADOFL algorithm, female lion updation is done using the decisive operative searching, and new lions are generated based on the adaptive fractional lion algorithm. The searching strategy and adaptive fractional lion incorporation in the algorithm increases the convergence rate of the optimisation algorithm, resulting in a speedy solution attainment.
Fitness function: A new fitness function is also proposed in this research work based on three kernel functions: Gaussian kernel, tangential kernel, and rational quadratic kernel [7]. The fitness function is developed by altering the distance matrix of the WL index (WLI) cluster validity index. Using the proposed fitness function, the best possible solution is chosen as the centroid for clustering.

2 Motivation

The motivation for this research work in optimisation-based data clustering algorithm is deliberated in this section.

2.1 Problem Statement

Let us assume thatD is the database containing bulk data. The primary intention is to cluster the data based on a similarity measure. The data points in the database are represented asP_x; 1≤x≤n, and the data points are positioned inP-dimensional real space. By clustering, the data pointsP_x are divided intoC_y clusters by selecting the cluster centre centred on the similarity measure. So, the problem in the clustering process is to find the cluster centreC_y;1≤y≤q from the databaseD. Here,q is the number of optimal selected clusters. For optimal centroids selection, optimisation-based clustering algorithms are adapted in which the centroids chosen for clustering are evaluated based on the fitness function comprising multi-kernel functions. The fitness evaluation function is an objective criterion (similarity measure) that groups data of similar characteristics into one cluster and data of diverse characteristics into different clusters.

2.2 Challenges

Based on the literature, the challenges present in the existing research works related to the optimisation-based clustering algorithm are listed below.

The effectiveness of the optimisation-based clustering algorithm mainly depends on the objective function chosen for clustering. In most of the existing works, low-performing measures, such as mean square error (MSE), sum squared distance (SSD), etc., are used as the objective function [8].
The objective function selection is another crucial challenge in optimisation-based data clustering algorithms. The selected objective function must not affect the convergence performance of the optimisation algorithm irrespective of dataset characteristics such as the range of values, dimension, image, data type [14], etc.
Kernel-based clustering algorithms achieve better performance with real-world data [12]. However, the kernel-based clustering algorithm has the following challenges: kernel-based clustering algorithms have limitations in terms of both run time and memory complexity. Accordingly, the complexity of the kernel function is represented in quadratic manner based on the number of data instances.
Single-kernel functions [2] included in the objective function of the optimisation algorithms are not able to follow the frequency content variations in the dissimilar regions of an input space. This is one of the greatest challenges in the kernel-based clustering algorithm.
Most of the existing algorithms aim to find the global centroids throughout the process, rather than focusing on the initialisation part, and also the termination strategy has not made the converging procedure to be aware of the quality improvement of centroids [14].
Directive operative lion-based optimisation algorithms use unique lion behaviour as well as fractional lion algorithm for centroid selection [4]. However, the convergence rate of the algorithm was minimal and also the search space seems minimal, increasing the time of solution attainment.

In order to overcome the aforementioned challenges, in this research work, the advantages of multi-kernel functions are utilised in the proposed objective function for centroid selection and the convergence problem in the DOFL algorithm is reduced by increasing the convergence rate by adaptively generating new solutions in consecutive iterations of the optimisation algorithm.

3 Proposed Methodology: ADOFL Optimisation Algorithm for Data Clustering

In this section, the proposed ADOFL optimisation algorithm for data clustering is presented.

The overall block diagram of the proposed ADOFL optimisation-based data clustering algorithm is depicted in Figure1. The processing steps in the proposed ADOFL algorithm are discussed as follows. Primarily, the database for clustering is accepted as the input for the proposed data clustering algorithm. The input database consists of high-dimensional data. From the high-dimensional data, the data for the clustering algorithm is selected based on the object vs. attribute format. For example, consider the Iris dataset [10]; it consists of four attributes, so each of data object consists of four attributes. Normally in data clustering, the data objects are clustered based on the attributes. Analogous with that, the data selection is performed as object vs. attribute format in this research work. After the data selection, the selected data objects are subjected to the proposed ADOFL optimiser.

Figure 1:

Proposed ADOFL Optimization Algorithm.

The proposed ADOFL optimisation algorithm is the modification of the fractional lion optimisation algorithm [4]. The ADOFL is developed by modifying the female fertility evaluation function and solution generation function in fractional lion optimisation algorithm. The female lion (lioness) fertility function is altered with a directive operative searching strategy that searches for fertile lionesses within the available resources. Moreover, a new fitness function called multi-kernel WLI (WKWLI) is proposed as an objective function for the proposed clustering algorithm. The fitness function is developed by combining three kernel functions with the WLI cluster validity index [20]. The selected data that are subjected to the optimiser are clustered into different groups based on the optimal cluster centroid selected by the proposed optimisation algorithm. The speed of the convergence rate of the ADOFL algorithm is higher than that of the existing data clustering algorithm. The computational overhead appears to be less because of meta-heuristic algorithm-based operation, besides the speed of proposed clustering algorithm seeming maximal.

3.1 Proposed MKWLI Fitness Function

In this section, the detailed description of the proposed fitness function is deliberated. In optimisation-based data clustering algorithm, fitness function is nothing but the objective function evaluating the possibility of a randomly selected cluster centroid point to result in a successful cluster. Moreover, rather as the similarity measure for grouping, the quality of clustering is also improved by the fitness function. In this research work, a MKWLI cluster validity index is proposed as the fitness function. By adapting kernel-based distance measurement in WLI [20], the fitness function is proposed in this research work. WLI is the cluster validation index used for evaluating the fuzzy clustering results. Here, it is used to evaluate the centroid selection point for clustering. It assesses the clustering results based on the fuzzy weighted distances and fuzzy cardinality of the cluster (membership function). The WLI cluster validity index considers the distance minimisation among the data points with its nearest neighbour centroid for cluster validation. Here, we are making use of kernel functions in the distance measurement. The mathematical addition of three distances from three kernel functions is utilised for the distance measurement in WLI.

The kernel function is chosen in the proposed clustering algorithm as it is a supervised learning algorithm that consists of the ground truth value. The availability of ground truth value helps the kernel in ideal learning, resulting in more successful clustering. In the proposed fitness function, three kernel functions are included in the distance measurement of WLI. The three kernel functions are Gaussian kernel, tangential kernel, and rational quadratic kernel [7]. The sum of the kernel function considered in the proposed fitness function is represented byK_sum.

The considered kernel function is given by

Ksum=K1Gaussian+K2tangential+K3rational quadratic.

The kernel function calculation between the real valued vector inputsX_a andX_b are signified below:

Gaussian kernel:K1Gaussian=exp(−||Xa−Xb||22σ2), whereσ is the adjustable parameter that increases the performance of the kernel.
Tangential kernel:K2tangential=tanh(αXaTXb+c), whereα andc are the slope and constant terms in the kernel function.
Rational quadratic kernel:K3rational quadratic=1−||Xa−Xb||2||Xa−Xb||2+c, whereC is the constant term of the kernel function.

The use of multiple kernels provides non-linear or data-dependent functions. The multi-kernel function in the distance matrix of the WLI increases the efficiency of the linear grouping. The multi-kernel function is chosen over the single-kernel function, as single-kernel functions are incapable of handling high-dimensional databases.

3.1.1 MKWLI Fitness Function

The multi-kernel WLI fitness function proposed in this paper is denoted as

(1)MKWLI=WLn2×WLD.

Here,WL_n is the fuzzy compactness andWL_D is the separation measure. The value of the fuzzy compactness is expressed as

(2)WLn=∑m = 1K(∑l = 1Nμlm(Ksum(al, vm)∑l = 1Nμlm),

whereK_sum represents the robust distance measurement between the data pointa_l;1≤l≤N and centroid pointv_m;1≤m≤K in the multiple kernel space andμ is the membership matrix. The membership valueμ_lm is computed as follows:

(3)μlm=1∑s = 1K(||Ksum(al, vm)||||Ksum(al, vs)||)2B − 1; B≥1.

Here, the centroid pointv_m for the membership computation is given by

(4)vm=∑l = 1NulmBal∑l = 1NulmB.

The multi-kernel distance measurement function in the fuzzy compactness and membership matrix computation is given by

(5)Ksum(al, vm)=e(− ||al − vm||2σ2)+tanh(αalTvm+c)+(1−||al−vm||2||al−vm||2+c),

(6)Ksum(al, vs)=e(− ||al − vs||2σ2)+tanh(αalTvs+c)+(1−||al−vs||2||al−vs||2+c).

The separation measure of the proposed fitness functionWL_n is calculated using median and minimum distances of pair of centroids.WL_n is represented as

(7)WLd=12{min(Ksum(vr, vs))+median(Ksum(vr, vs))}.

The distance between the pair of centroids is also calculated by integrating multi-kernel distance. The value of kernel distance measurement between the pair of centroid is denoted as

(8)Ksum(vr, vs)=e(− ||vr − vs||2σ2)+tanh(αvrTvs+c)+(1−||vr−vs||2||vr−vs||2+c).

The proposed multiple kernel WL cluster validity index as the fitness function given in Eq. (1) reduces the running time complexity and memory requirements of the proposed ADOFL clustering algorithm.

3.2 ADOFL Optimisation Algorithm

In this section, the proposed ADOFL optimisation algorithm is explained with solution encoding.

The ADOFL optimisation algorithm is proposed to estimate the rapid centroid for the cluster. Optimisation is the current state-of-the-art research that is used in many applications, such as data mining, because of its simplicity. The proposed work is developed by modifying the fractional lion optimisation algorithm [4]. The fractional lion optimisation algorithm is based on the lion social behaviour [16]. Moreover, an adaptive modification is made in the fractional lion generation. The adaptive fractional lion generation improves the speed of the optimal centroid selection for clustering, increasing the convergence rate. The female updation using the directive operative searching algorithm increases the search space of the lion pride for the best solution attainment. In this work, an attempt is made to introduce a novel optimisation algorithm for data clustering, with modified directive operative search algorithm and adaptive fractional lion generation.

Solution Encoding:
The solution encoding of the proposed ADOFL optimisation algorithm is described in this section. The solution encoding of the proposed ADOFL clustering algorithm is depicted in Figure2. The clustering is performed over the selected data of the database in the object vs. attribute format. Consider the input databaseD that is given byD={D₁,D₂}, whereD₁,D₂ are the data points subjected for clustering. The data classes consist of five attributesP_wx={P_w₁,P_w₂,P_w₃,P_w₄,P_w₅};W data object 1≤w≤x. Let us consider two cluster centroids for clustering, so the solution vector consists of 10 elements and the cluster size isC_y×P_x⇒2×5=10.
ADOFL Algorithm:
The detailed description about the proposed ADOFL optimisation algorithm is presented in this section. The fractional lion optimisation algorithm [4] is the basis of the proposed optimisation algorithm, which is the modification of the lion optimisation algorithm. In the fractional lion algorithm, for female updation, the directive operative searching algorithm is used and for fractional lion generation, adaptive function is integrated; thereby, solution attainment such as speed, search space, convergence rate, etc., are optimised, supporting a more effective data clustering. The algorithmic steps of the proposed ADOFL optimisation algorithm are discussed below.

Figure 2:

Solution Encoding.

Lion pride generation:
Lion pride generation is the first step involved in the proposed ADOFL algorithm. Lion pride is the solution vector consisting of male, female, and nomad lion. LetS^m be the male lion,S^f be the female lion, andSⁿ be the nomad lion in pride. The interpretation of the lion solution vector is set as kernel models. The elements of the solution vector are given byS^m(d),S^f(d), andSⁿ(d). Thed values lies in between the range (1,L). At the time of algorithm initiation, only one nomad lion is initialised.
Pride fitness evaluation:
The generated lion pride containing the male, female, and nomad lion solution vectors are evaluated using the proposed fitness function MKWLI. The fitness function of the lion pride is represented byf(S^m),f(S^f), andf(Sⁿ). The fitness function of the male lion is considered as the reference fitness in the optimisation algorithm as per the social behaviour of the male lion. From the solution vector, the best male and female lions are selected for further process.
Fertility evaluation:
Fertility evaluation ensures the fertility of the male and female lions. If the lions seem to be infertile, the solution is rejected and the new solutions are updated. The fertility evaluation is performed based on the laggardness rateL_r for the male lion and the sterility rateS_r for the female lion. Initially, the value ofL_r andS_r is set as zero; the fertility evaluation continues until the determined value ofL_r andS_r are obtained.
Male lion evaluation: In male fertility evaluation, the fitness function of the male lion is compared with the reference fitness selected at the fitness evaluation stage. The process is iterated by increasing the laggardness rateL_r based on the evaluation; the best male lion with fitness greater than the reference fitness is chosen as the male lion for further stages.
Female lion evaluation: The female fertility evaluation is performed based on the proposed directive operative searching strategy of animal. The process of female fertility evaluation continues until the maximum generation of female is attained. Here,Ucf andGcf are female update count and female generation count, respectively.
As per the fractional lion algorithm [4], the female lion updation equation is given by
Sbf+={skf+; if b=ksbf+; otherwise},
where,skf+ andsbf+ arek^th andb^th vector elements ofSbf+, respectively. In our case, the value ofskf+ is calculated based on directive operative searching strategy. Using the searching strategy, three new solutions are generated that centred onSbf+. From the generated female lion, the best female lion solution with best fitness function is chosen as the female lion for further processing stages. The generated new solution vectors are characterised by maximum pursuit angle, pursuit distance, and pursuit height, and is represented as
Skzf+=Sif+RdmaxVi(φi)Skrf+=Sif+RdmaxVi(φi+R∗⋅θmax2)Sklf+=Sif+RdmaxVi(φi−R∗⋅θmax2),
whereR is the normally distributed random number with zero mean and standard deviation 1;R^* is the uniformly distributed random sequence;d_max is the pursuit distance;θ_max is the pursuit angle; andi is the number of iterations. In the search strategy, each of the solution vector searches for the new solution with a head scanning mechanism. The search direction of solution vector is given by
Vij(φij)=(tij1, tij2, …)∈Rn.
The value of the direction solution vector is calculated using the polar to Cartesian coordinate transformation equation with the help ofφij;J is the solution vector and it is expressed as
tij1=∏a = 1n − 1cos(φija),tiju=sin(φij(u − 1))∏a = un − 1cos(φija); u=2, 3, 4, …, n−1,tijn=sin(φij(n − 1)).
Mating:
The mating stage is considered as a significant operator in the ADOFL optimisation algorithm. Mating is performed in two stages: crossover and mutation. The mating process results in lion cubs, which further improves the convergence rate of the proposed ADOFL optimisation algorithm.
Crossover: Crossover results in four new solution vectors from the crossover operation. The crossover operation is mathematically represented as
Sc(e)=(Me)∘Sm+(M¯e)∘Sf,
whereM_e is the crossover mask andS_c is the generated cubs. The crossover operation is performed with the random crossover probability. In our work, single-point crossover operation is utilised.
Mutation: The crossover generated lion cubs are subjected to the mutation process. The mutation is performed with a random mutation rate. The mutation operation generates four new cubs as solution vectors.
Mutation=SNew
After the mating operations, the cubs generated are subjected to a gender clustering process where male and female lion cubs are clustered separately. Then, for all the male and female lion cubs, the fitness is calculated using Eq. (1). The male and female cubs with high fitness are selected as male and female cubs, which is represented asScm andScf.
Adaptive fractional calculus-based generation:
In order to increase the speed of solution attainment, in addition to the solution vectors of lion pride, a new lion solution vector is generated based on adaptive fractional calculus. According to the fractional lion algorithm [4], if the fitness value of the male lion of the present and next iterations is equal,f(Sl + 1m)=f(Slm), then the solution vector of the present and next iterations seems equal, and it is given by
Sl + 1m=Slm.
The order of the above equation is rearranged as
Sl + 1m−Slm=0.
As per the fractional theory, when the values of the present and next iterations are equal, then the discrete version of the derivative of orderχ is equal to zero.
Dχ[Sl + 1m]=0,
whereD^χ is discrete version derivative order. Withχ value as 2, the above equation can be elaborated as
Dχ[Sl + 1m]=Sl + 1m−χSlm−12χSl − 1m,Sl + 1m−χSlm−12χSl − 1m=0,Sl + 1m=χSlm+12χSl − 1m.
The above equation is based on the fractional lion algorithm [4]. In the fractional lion algorithm,χ is a constant value. Here, we are espousing the constant value with an adaptive value, and it is given by
χ=Slminm+(Slminm−Slminm)×SlmSlmaxm.
Using this value, new fractional lion is generated, which further improves the speed of optimal centroid selection.
Termination criteria:
The process of the ADOFL algorithm is iterated until the maximal number of iterations. From the optimal solution vector, the solution with high fitness is chosen as the optimal centroid for clustering.

4 Results and Discussion

The results and discussion of the experimentation of the proposed ADOFL optimisation algorithm for data clustering is deliberated in this section. The performance validation of the proposed ADOFL algorithm over existing methods using different performance measures is also presented in this section.

Experimental Setup:
The experimentation of the proposed ADOFL data clustering algorithm is performed in a personal computer with the following specifications: Windows 10 Operating System, Intel Core i-3 processor, and 2 GB RAM. The implementation tool used for the experimentation is JAVA version 8.
Dataset:
The experimentation of the proposed ADOFL optimisation algorithm for data clustering is performed over two datasets, such as Iris and Wine [10]. The Iris and Wine datasets are well-known data sets for pattern recognition. In the Iris and Wine datasets, three classes of data are present. Each of the classes in the Iris dataset consists of 50 sample data instances. However, class I of the Iris dataset consists of 59 data instances, class II consists of 71 data instances, and class III consists of 48 data instances. The Iris dataset consists of four attributes and is of numeric form, whereas the Wine data set consists of 13 attributes and is of a continuous format. Compared to the Iris dataset, the Wine dataset is well organised.
Comparative Methods:
The performance validation of the proposed ADOFL algorithm is done over existing optimisation algorithm-based data clustering methodologies such as particle swarm clustering (PSC) [23], modified PSC (mPSC) [19], lion [16], fractional lion [4], and DOFL.
Performance Measures:
The evaluation metrics considered for the performance evaluation of the proposed clustering algorithm is presented in this section. The efficacy of the proposed clustering algorithm is evaluated over comparative methods using performance measures such as clustering accuracy, Rand coefficient, and Jaccard coefficient.

4.1 Experimental Results

The experimental results of the proposed ADOFL data clustering algorithm are presented in this section.

4.2 Convergence Analysis

In this section, the convergence analysis of the proposed ADOFL clustering algorithm is deliberated. Figure3 depicts the convergence analysis curve of the Iris and Wine datasets. The convergence analysis is plotted between the number of iterations and the fitness value. The convergence analysis is performed by increasing the number of iterations with different cluster sizes. Here, the experimentation is performed with two cluster sizes: cluster size 2 and 3. Figure3A shows the convergence analysis curve of the Iris data set. For the Iris dataset, with cluster sizeC=2, the fitness value seems minimal with increase in iteration. Similarly, for cluster size 3, increase in the number of iterations reduces the fitness value. At iteration 5, the fitness value attained by the proposed ADOFL algorithm for cluster size 2 and 3 are 4.237 and 3.624, respectively. Similarly, the convergence analysis curve for the Wine dataset is shown in Figure3B. For the Wine dataset, with cluster size 2, the fitness value attained by the proposed algorithm at iterations 1 and 2 are 4.617. For cluster size 3, the fitness value attained by the ADOFL algorithm at iterations 7 and 8 are 3.456.

Figure 3:

Convergence Analysis Curve.

(A) Iris dataset. (B) Wine dataset.

4.3 Performance Analysis

The performance validation of the proposed ADOFL clustering algorithm over existing methods based on the experimental results using performance measures such as clustering accuracy, Rand coefficient, and Jaccard coefficient are presented in this section.

4.3.1 Analysis Based on Clustering Accuracy

In this section, the analysis based on the clustering accuracy is presented. The clustering accuracy analysis curve for the Iris and Wine datasets is shown in Figure4A and B, respectively. The clustering accuracy analysis curve is plotted between the clustering accuracy value and the cluster size. On experimentation of the proposed ADOFL algorithm over the Iris dataset, with the cluster size 2, the maximal clustering accuracy value of 79.36 is attained by the proposed ADOFL algorithm, whereas the existing PSC, mPSC, lion, fractional lion, and DOFL clustering algorithms attained the clustering accuracy values of 69.5, 70, 75, 75, and 78.26, respectively. With increase in the cluster size, the clustering accuracy value of the proposed ADOFL algorithm always seems better compared to that of existing algorithms. For the Wine dataset, as shown in Figure4, with the cluster size of 3, the clustering accuracy value attained by the existing methods PSC, mPSC, lion, fractional lion, and DOFL are 69.42, 71.65, 73.29, 74.15, and 76.18, respectively. At the same time, the proposed ADOFL algorithm attained the clustering accuracy value of 76.16, which is greater than the value of existing methods. The best-case clustering accuracy value of 77.04 is attained by the proposed ADOFL algorithm with cluster size of 5. Moreover, the worst-case clustering accuracy value of 68.14 is attained by the existing PSC algorithm with cluster size of 2.

Figure 4:

Clustering Accuracy Analysis Curve.

(A) Iris dataset. (B) Wine dataset.

4.3.2 Analysis Based on the Rand Coefficient

The analysis based on the Rand coefficient is discussed in this section. Figure5 depicts the Rand coefficient analysis curve. The Rand coefficient analysis curve is plotted between the Rand coefficient value and the cluster size. For the Iris data set as shown in Figure5A, the Rand coefficient analysis is performed for nine different cluster sizes, i.e. from cluster size 2 to 10. For the Wine dataset as shown in Figure5B, the Rand coefficient analysis is performed for four different cluster sizes, i.e. 2–5. Similarly, the experimentation on the Wine dataset results in a better Rand coefficient value of 78.86 by the proposed ADOFL algorithm with the cluster size 5. The worst-case Rand coefficient value of 61.36 is attained by the PSC algorithm with the cluster size of 2.

Figure 5:

Rand Coefficient Analysis Curve.

(A) Iris dataset. (B) Wine dataset.

4.3.3 Analysis Based on the Jaccard Coefficient

In this section, the analysis based on the Jaccard coefficient value is presented. The Jaccard coefficient analysis curve is shown in Figure6. The Jaccard coefficient analysis curve is plotted between the Jaccard value and the cluster size. Figure6A depicts the Jaccard coefficient analysis curve of the Iris dataset in which the experimentation is performed with cluster size varying from 2 to 10, and Figure6B depicts the Jaccard coefficient analysis curve for the Wine dataset in which the experimentation is performed with cluster size varying from 2 to 5. Similarly, for the Wine dataset as shown in Figure6B, the Jaccard value attained by the proposed ADOFL algorithm is better than the existing methods. The best-case Jaccard value of 65.07 is attained by the proposed ADOFL algorithm with cluster size 5, and the worst-case Jaccard value of 56.39 is attained by the PSC clustering algorithm at cluster size 3 and 4, respectively.

Figure 6:

Jaccard Coefficient Analysis Curve.

(A) Iris dataset. (B) Wine dataset.

5 Conclusion

In this paper, we have presented a novel optimisation algorithm to find the optimal centroid point for data clustering. The ADOFL algorithm was proposed by incorporating the adaptive fraction lion optimisation algorithm and direct operating search strategy. Moreover, a new fitness function, MKWLI, was proposed by integrating the multiple kernel function with the WLI cluster validity index. Based on the proposed fitness function, the optimal centroid point for clustering was chosen. The advantage of incorporating a multi-kernel WLI objective function is that the kernel space can identify the variance among the data points even better compared with the clustering done through only data space. The experimentation of the proposed ADOFL clustering algorithm was performed over the Iris and Wine datasets. The performance of the proposed ADOFL algorithm is validated over existing methods based on evaluation measures such as classification accuracy, Jaccard coefficient, and Rand coefficient. The experimentation results proved the efficacy of the proposed ADOFL algorithm in data clustering with a clustering accuracy of 79.51.

Bibliography

[1]D. Binu, Cluster analysis using optimization algorithms with newly designed objective functions,Expert Syst. Appl.42 (2015), 5848–5859.10.1016/j.eswa.2015.03.031Search in Google Scholar

[2]D. Binu, M. Selvi and A. George, MKF-Cuckoo: hybridization of cuckoo search and multiple kernel-based fuzzy C-means algorithm, in:Proceedings of AASRI Conference on Intelligent Systems and Control, vol. 4, pp. 243–249, 2013.10.1016/j.aasri.2013.10.037Search in Google Scholar

[3]J. A. Castellanos-Garzon and F. Diaz, An evolutionary computational model applied to cluster analysis of DNA microarray data,Expert Syst. Appl.40 (2013), 2575–2591.10.1016/j.eswa.2012.10.061Search in Google Scholar

[4]S. Chander, P. Vijaya and P. Dhyani, Fractional lion algorithm – an optimization algorithm for data clustering,J. Comput. Sci.12 (2016), 323–340.10.3844/jcssp.2016.323.340Search in Google Scholar

[5]G. Duan, W. Hu and Z. Zhang, A novel data clustering algorithm based on modified adaptive particle swarm optimization,Int. J. Signal Process. Image Process. Pattern Recogn.9 (2016), 179–188.10.14257/ijsip.2016.9.3.16Search in Google Scholar

[6]M. Ester, H. P. Kriegel, J. Sander and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in:Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231, 1996.Search in Google Scholar

[7]S. Fadel, S. Ghoniemy, M. Abdallah, H. A. Sorra, A. Ashour and A. Ansary, Investigating the effect of different kernel functions on the performance of SVM for recognizing Arabic characters,Int. J. Adv. Comput. Sci. Appl.7 (2016), 446–450.10.14569/IJACSA.2016.070160Search in Google Scholar

[8]X. Huang, Y. Ye and H. Zhang, Extensions of K means-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation,IEEE Trans. Neural Netw. Learn. Syst.25 (2014), 1433–1446.10.1109/TNNLS.2013.2293795Search in Google Scholar

[9]J. Ji, W. Pang, C. Zhou, X. Han and Z. Wang, A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data,Knowl.-Based Syst.30 (2012), 129–135.10.1016/j.knosys.2012.01.006Search in Google Scholar

[10]M. Lichman,UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, 2013, Available athttp://archive.ics.uci.edu/ml, Accessed May 2015.Search in Google Scholar

[11]J. McQueen, Some methods for classification and analysis of multivariate observations, in:Proceedings of Fifth Berkeley Symposium on Math, Vol.: Statistics and Probability, pp. 281–297, 1967.Search in Google Scholar

[12]M. Motamedi and H. Naderi, Data clustering using kernel based algorithm,Inform. Technol. Control Autom.4 (2014), 1–6.10.5121/ijitca.2014.4301Search in Google Scholar

[13]U. Mualik and S. Bandyopadhyay, Genetic algorithm based clustering technique,Pattern Recogn.33 (2002), 1455–1465.10.1016/S0031-3203(99)00137-5Search in Google Scholar

[14]J. K. Parker and L. O. Hall, Accelerating fuzzy-C means using an estimated subsample size,IEEE Trans. Fuzzy Syst.22 (2014), 1229–1244.10.1109/TFUZZ.2013.2286993Search in Google Scholar PubMed PubMed Central

[15]K. Premalatha and A. M. Natarajan, A new approach for data clustering based on PSO with local search,Comput. Inform. Sci.1 (2008), 139–145.10.5539/cis.v1n4p139Search in Google Scholar

[16]B. Rajakumar, The Lion’s Algorithm: a new nature-inspired search algorithm, in:Proceedings of International Conference on Communication, Computing and Security, vol. 6, pp. 126–135, 2012.10.1016/j.protcy.2012.10.016Search in Google Scholar

[17]J. Senthilnath, S. N. Omkar and V. Mani, Clustering using firefly algorithm: performance study,Swarm Evol. Comput.1 (2011), 164–171.10.1016/j.swevo.2011.06.003Search in Google Scholar

[18]T. M. Silva Filho, B. A. Pimentel, R. M. C. R. Souza and A. L. I. Oliveira, Hybrid methods for fuzzy clustering based on fuzzy c-means and improved particle swarm optimization,Expert Syst. Appl.42 (2015), 6315–6328.10.1016/j.eswa.2015.04.032Search in Google Scholar

[19]M. Wan, L. Li, J. Xiao, C. Wang and Y. Yang, Data clustering using bacterial foraging optimization,J. Intell. Inf. Syst.38 (2012), 321–341.10.1007/s10844-011-0158-3Search in Google Scholar

[20]C. H. Wu, C. S. Ouyang, L. W. Chen and L. W. Lu, A new fuzzy clustering validity index with a median factor for centroid-based clustering,IEEE Trans. Fuzzy Syst.23 (2015), 701–718.10.1109/TFUZZ.2014.2322495Search in Google Scholar

[21]R. Xu and D. Wunsch, Survey of clustering algorithms,IEEE Trans. Neural Netw.16 (2005), 645–677.10.1109/TNN.2005.845141Search in Google Scholar PubMed

[22]X. Yin, S. Chen, E. Hu and D. Zhang, Semi-supervised clustering with metric learning: an adaptive kernel method,Pattern Recogn.43 (2010), 1320–1333.10.1016/j.patcog.2009.11.005Search in Google Scholar

[23]M. Yuwono, S. W. Su, B. D. Moulton and H. T. Nguyen, Data clustering using variants of rapid centroid estimation,IEEE Trans. Evol. Comput.18 (2014), 366–377.10.1109/TEVC.2013.2281545Search in Google Scholar

[24]C. Zhang, D. Ouyang and J. Ning, An artificial bee colony approach for clustering,Expert Syst. Appl.37 (2010), 4761–4767.10.1016/j.eswa.2009.11.003Search in Google Scholar

Received:2016-8-27

Published Online:2016-12-20

Published in Print:2018-7-26

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Articles in the same Issue

DOI: doi.org/10.1515/jisys-2016-0175

Keywords for this article

Data clustering;optimisation;fractional lion optimisation;WLI cluster validity index

Creative Commons

BY-NC-ND 3.0

Movatterモバイル変換

ADOFL: Multi-Kernel-Based Adaptive Directive Operative Fractional Lion Optimisation Algorithm for Data Clustering

Article

Abstract

1 Introduction

2 Motivation

2.1 Problem Statement

2.2 Challenges

3 Proposed Methodology: ADOFL Optimisation Algorithm for Data Clustering

3.1 Proposed MKWLI Fitness Function

3.1.1 MKWLI Fitness Function

3.2 ADOFL Optimisation Algorithm

4 Results and Discussion

4.1 Experimental Results

4.2 Convergence Analysis

4.3 Performance Analysis

4.3.1 Analysis Based on Clustering Accuracy

4.3.2 Analysis Based on the Rand Coefficient

4.3.3 Analysis Based on the Jaccard Coefficient

5 Conclusion

Bibliography

Articles in the same Issue

Articles in the same Issue

Articles in the same Issue