. Author manuscript; available in PMC: 2015 Nov 27.

Published in final edited form as:IEEE Trans Fuzzy Syst. 2013 Oct 23;22(5):1229–1244. doi:10.1109/TFUZZ.2013.2286993

Accelerating Fuzzy-C Means Using an Estimated Subsample Size

Jonathon K Parker¹,Lawrence O Hall¹

¹Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620, USA

jkparker@mail.usf.edu,hall@csee.usf.edu.

Roles

Jonathon K Parker:Student Member, IEEE

Lawrence O Hall:Fellow, IEEE

Issue date 2014 Oct.

PMC Copyright notice

PMCID: PMC4662382 NIHMSID: NIHMS672499 PMID:26617455

Abstract

Many algorithms designed to accelerate the Fuzzy c-Means (FCM) clusteringalgorithm randomly sample the data. Typically, no statistical method is used toestimate the subsample size, despite the impact subsample sizes have on speedand quality. This paper introduces two new accelerated algorithms, GOFCM andMSERFCM, that use a statistical method to estimate the subsample size. GOFCM, avariant of SPFCM, also leverages progressive sampling. MSERFCM, a variant ofrseFCM, gains a speedup from improved initialization. A general, novel stoppingcriterion for accelerated clustering is introduced. The new algorithms arecompared to FCM and four accelerated variants of FCM. GOFCM's speedup was 4-47times that of FCM and faster than SPFCM on each of the six datasets used inexperiments. For five of the datasets, partitions were within 1% of those ofFCM. MSERFCM's speedup was 5-26 times that of FCM and produced partitions within3% of those of FCM on all datasets. A unique dataset, consisting of planktonimages, exposed the strengths and weaknesses of many of the algorithms tested.It is shown that the new stopping criterion is effective in speeding upalgorithms such as SPFCM and the final partitions are very close to those ofFCM.

Index Terms: fuzzy clustering, scalable, accelerated, sampling, progressive sampling, multinomial proportion, fuzzy c-means, fcm, spfcm, ofcm, effcm, rsefcm, gofcm, mserfcm, stopping criterion

I. Introduction

Clustering is the process of separating unlabeled objects into groups, wherethe members in each group have more similarity to each other than the members ofother groups. Clustering is an important first step in data exploration and aprimary approach for unsupervised learning. The K-means algorithm, also called thehard c-means algorithm (HCM) [1], and itsvariant fuzzy c-means (FCM) are widely used clustering algorithms [2].

One of the many difficulties with applying clustering algorithms isscalability. On large datasets, the length of time taken to cluster or partition thedata can be prohibitive, taking so long that people will only cluster subsets. Thereare a number of FCM variants that accelerate the performance of FCM with varyingdegrees of tradeoff between speed and quality.

This paper introduces two new accelerated fuzzy clustering algorithms,geometric progressive fuzzy c-means (GOFCM) and minimum sample estimate random fuzzyc-means (MSERFCM). GOFCM has elements similar to single pass fuzzy c means (SPFCM)[3] and progressive sampling [4]. MSERFCM has elements similar to rseFCM[5], [6] and psFCM [7].

Both new algorithms use a statistical method to estimate the subsample size.The subsample is the minimum size necessary to guarantee proportional representationfrom each different type of data (true cluster) within a specified range of error,at a specified confidence level. A subsample is used to quickly estimate clustercenters, which accelerates the base FCM algorithm by starting closer to atermination point. A novel stopping criterion was also developed that is shown toincrease speed with minimal loss in quality.

Experiments showed GOFCM was 4 to 47 times faster than FCM and MSERFCM was 4to 26 times faster than FCM. GOFCM consistently outperformed SPFCM with respect tospeedup, but suffered a slight loss in fidelity to FCM when clustering with all thedata. MSERFCM outperformed rseFCM with respect to speedup and quality in over twothirds of the experiments conducted. A total of six datasets were used; three of thedatasets were also used in [5] and results arecompared.

II. FUZZY C-means (FCM) based Algorithms

Fuzzy c-means (FCM) [2] and fouradditional FCM-based algorithms were selected and implemented in code to provide abaseline for comparison to the newly developed algorithms. Many FCM-variants havebeen created [3], [5], [7]–[16].

FCM and its variants employ fuzzy sets as opposed to classical (i.e. crisp orhard) sets. FCM's advantage is that an object can have a grade of membership inmultiple clusters. The degree to which an objectk belongs toclusteri isu_ik. This relationship issubject to the following constraints [2]:

u_{i k} \in [0, 1], 1 \leq i \leq c, 1 \leq k \leq n

(1)

\sum_{i = 1}^{c} u_{i k} = 1, 1 \leq k \leq n

(2)

\sum_{k = 1}^{n} u_{i k} > 0, 1 \leq i \leq c

(3)

wheren is the number of data examples andc is the number of clusters.

The descriptions of FCM and its scalable variants follow in a slightlymodified form from previous work [6].

A. Fuzzy c-means (FCM)

This algorithm minimizes an objective function that calculates thewithin-group sum of squared distances from each data example to each clustercenter. FCM alternates between calculating cluster centers given the membershipvalues of each data example and calculating the membership values given thecluster centers. If data examples are defined as feature vectorsx_k inR^s, theobjective function (J_m) is expressed as:

J_{m} (U, V) = \sum_{i = 1}^{c} \sum_{k = 1}^{n} u_{i k}^{m} D_{i k} (x_{k}, v_{i})

(4)

The functions for determining membership values and clustercenters are:

u_{i k} = \frac{D_{i k} {(x_{k}, v_{i})}^{\frac{1}{1 - m}}}{\sum_{j = 1}^{c} D_{j k} {(x_{k}, v_{j})}^{\frac{1}{1 - m}}}

(5)

v_{i} = \frac{\sum_{j = 1}^{n} {(u_{i j})}^{m} x_{j}}{\sum_{j = 1}^{n} {(u_{i j})}^{m}}

(6)

where

n: is the number of examples.
m > 1: is the’fuzzifier’.
c: is the number of clusters.
U: is the membership matrix.u_ik refers to the membershipvalue of thek^th data example(x_k) for thei^th cluster.
V : is the set of cluster centers.v_i is thei^th cluster center.
D_ik(x_k,v_i): is the squared distance between thek^th data example andi^th cluster center. Any innerproduct induced distance metric can be used. This research usedEuclidean distance.

There are multiple ways to initialize and terminate the algorithm. Anyvalid set of values may be used to initialize theU orV matrices. The members ofU(u_ik) may be initialized with a set ofvalues adhering to(1) and(3) or each member ofV (v_i) may be set to someposition inR^s. Typically, cluster center(V ) initialization is performed by randomly selectingc examples from the dataset. The algorithm terminates whenthe difference between the calculated matrix norms for successive membershipmatrices or for cluster center matrices does not exceed a user-providedparameterε.

Fuzzy c-means has been applied to many real world applications, someexamples include: magnetic resonance imaging (MRI) data [17] [16], DNAmicroarray data [18], and power qualitymanagement [19]. FCM is a generalizationof HCM, and is equivalent to HCM ifm = 1 [20]; it is also equivalent to ExpectationMaximization (EM) [21]. These algorithmsalso have been successfully applied to many real world applications [22] [23] [24] [25]. Thus, one could apply FCM successfullyto any domain that has had success with HCM or EM.

B. Single pass fuzzy c-means (SPFCM)

The SPFCM algorithm [3]sequentially processes the entire dataset. The data set is broken into equallysized “partial data accesses” (PDA). A user provided parameter,“fractional PDA” (fPDA ≤ 0.5), is used todefine the PDA size asfPDA × n wherenequals the total number of data examples. Each PDA is processed by a weightedversion of FCM, called Weighted FCM (WFCM). In the WFCM algorithm, each dataexamplex_i has an associated weightw_i. The objective function and clustercenter calculation are modified as follows [3], [16]:

J_{m w} (U, V) = \sum_{i = 1}^{c} \sum_{k = 1}^{n} u_{i k}^{m} w_{k} D_{i k} (x_{k}, v_{i})

(7)

v_{i} = \frac{\sum_{j = 1}^{n} w_{j} {(u_{i j})}^{m} x_{j}}{\sum_{j = 1}^{n} w_{j} {(u_{i j})}^{m}}

(8)

Data examples are initially given a weight of 1. After clustercentersV are calculated from the first PDA, the clustercenters are assigned weights using the following equation [16]:

w_{i}^{'} = \sum_{j = 1}^{n} (u_{i j}) w_{j}, 1 \leq i \leq c

(9)

These weighted cluster centers represent the partitioninformation of the dataset from the first PDA. Thec clustercenters are added as additional data examples to the second PDA, which is thenprocessed by WFCM. The positions of the cluster centers calculated from thefirst PDA are used as the initial values forV in the secondPDA. This is repeated until all PDAs are processed. The set of cluster centersfrom the final PDA are returned by SPFCM.

The SPFCM algorithm assumes the data in the dataset is randomly ordered.Datasets with some sort of inherent order in the data, which is typical inimages, can result in subsets of data that are significantly different withrespect to the overall distribution. Our implementation randomizes the dataprior to processing.

C. Online fuzzy c-means (OFCM)

The OFCM algorithm is similar to SPFCM, with one major difference [16]. The dataset is broken into PDAs in thesame manner as SPFCM and each PDA is processed independently. Cluster centersfrom each PDA are created using FCM and their weights calculated using(9). Unlike SPFCM, the sets ofweighted cluster centers are not added to the next PDA, but saved. After allPDAs are processed, the combined sets of weighted cluster centers are processedby WFCM as a single dataset and a final set of cluster centers returned.

A feature of OFCM is that the processing of a dataset can be separatedover distance or time. This is similar in concept to parallel fuzzy clustering,reviewed in [26]. In these cases, thecluster initialization of each PDA would be performed locally by randomselection. In our experiments, we used the cluster centers from the previous PDAas an initialization. While this matches the original implementation of thealgorithm [27], a poor initializationwill be produced by PDAs largely consisting of just one class. Another featureof OFCM is that the dataset is not assumed to be in random order. In ourimplementation, we did not randomize the datasets for OFCM.

D. Extensible fast fuzzy c-means (eFFCM)

This algorithm progressively increases the size of a sample of thedataset until a statistically significant sample, compared with the distributionof the full dataset, is obtained. Statistical significance is tested for withthe Chi-square (χ²) statistic orKullback-Leibler divergence. If the initial sample fails testing, additionaldata is progressively added to the sample and retested. This is repeated until astatistical test is passed [11], [14]. The size of this additional subsampleis constant; the algorithm uses progression with an arithmetic schedule [4]. The final, statistically significantsample is then processed by FCM to obtain a set of cluster centers.

The use of the statistical tests implies that the distribution of thedataset is known. For most datasets the distribution must be calculated prior torunning the algorithm. A successful implementation requires decisions concerninghow to calculate the distribution, the statistical test to use, the rate ofarithmetic progression and the termination criterion [13]. Details of the author's implementation can be found in[6].

E. Random sampling plus extension fuzzy c-means (rseFCM)

This algorithm obtains a random sample of the dataset and FCM is appliedto the sample. Once cluster centers are returned from FCM, the clustermembershipu_ik can be calculated for any example inthe dataset.

This algorithm, initially developed in [5] was also reported in [6] asrandom fuzzy c-means (randFCM). The rseFCM algorithm provides a speedup byreducing the size of the dataset. In this work, the size of the random sample isequal tofPDA × n, wherefPDA ≤0.5 is a parameter defined for SPFCM and OFCM andn equals thetotal number of data examples.

III. Accelerating FCM

A literal implementation of the FCM algorithm has an expected runtimecomplexity ofO(nisc²) [3] wheren is the number ofdata examples,c is the number of clusters,s isthe dimension of the data andi is the number of iterations. It ispossible to reduce the runtime toO(nisc) with anoptimization proposed by Kolen and Hutcheson [10]. This and other minor optimizations were implemented. In someprevious comparison work these optimizations were not implemented [6].

Techniques that accelerate FCM use the base FCM algorithm or WFCM as a stepin a more complex algorithm. Thus, any optimization to the FCM algorithm reduces theruntime for all variants.

Given a dataset withc natural clusters, an FCM variant canbe accelerated by reducingn,s ori. The SPFCM and OFCM algorithms have the same runtimecomplexity as FCM [3], [6] and process 100% of the original data. The initial PDAsprocessed by SPFCM and OFCM provide better cluster center estimates for the nextPDAs. This process will continue with the initial cluster center points havinghigher weights at each step. Better, weighted, cluster center estimates reduce thenumber of iterations for subsequent PDAs to reach the termination criterion. InSPFCM, each of then examples is involved in a smaller number ofiterations in the subsample in which it is clustered than when the whole data set isclustered at once. If the average number of iterationsi_avg for the subsamples is less thani for the full data, an example is used in calculationsn ×i_avg times (on average)for SPFCM versusn ×i with for FCMi_avg <i. The timecomplexity analysis can be reviewed in Section 4.5 of [28]. The eFFCM and rseFCM algorithms provide acceleration byreducingn. In both cases a subsample less than the size of thedataset will be processed. The subsequent acceleration is more or less inverselyproportional to the fraction of the data processed [5], [6], [11].

These effects were directly observed in earlier research by the authors[6]. Detailed logs were recorded for eachalgorithm's progress. These showed that for SPFCM the initial PDAs were responsiblefor the majority of the total iterations. After a certain point, PDA clusteringterminated after one or two iterations, with little change in the estimated clustercenters.

The OFCM algorithm, despite its similarity to SPFCM, did not achieve thesame acceleration. The logs suggest a major contributing factor. OFCM sequentiallyprocesses non-randomized data. In many datasets, the PDA has a non-proportionalrepresentation of clusters, which takes more iterations to terminate. This wasdemonstrated empirically by comparing the performance of OFCM on non-randomized vs.randomized datasets [6].

The assumption here is that OFCM needs less iterations when the data is moreuniform in its distribution. A similar observation was made in previous work whencomparing the performance eFFCM and rseFCM. The eFFCM algorithm, which ensures thesubsample passes a statistical test prior to processing, occasionally needed lessiterations to terminate compared to rseFCM, despite processing a higher percentageof the data [6].

IV. Significant Work Related to Acceleration

The runtime for a hill-climbing algorithm, such as FCM, is reduced if thestarting point is close to the final solution. This is due to a reduction initerations before termination. Bradley and Fayyad [29] investigated the effects of an improved starting position forK-means. A better start position reduced the runtime, but their study was focused onquality, not speed.

Processing a small data sample to obtain an improved initial starting pointfor FCM has been partially investigated. Cheng, et al. describe an iterative processto develop a “good” starting point [8]. This method, multistage random sampling FCM (mrFCM), consists of twoparts. The first part progressively samples the dataset, improving the startingclusters until a termination criterion is met. Then the starting clusters are usedto initialize FCM on the full dataset.

Similarly, Altman processes a small sample of data examples with FCM toobtain a set of cluster centers. These are used to initialize the membership matrixU, prior to running FCM on the full dataset [9].

In partition simplification FCM (psFCM), Hung and Yang [7] partition the data using a k-d tree to obtaina simplified dataset, which in turn is used as a subsample to estimate the clustercenters. The resulting estimate is used to initialize FCM on the full dataset.

Provost presented an overview of the progressive sampling technique in thecontext of induction (aka classification) algorithms [4]. In progressive sampling, an initial subsample is used to form aclassifier which is tested on labeled data. The size of the subsample isprogressively increased arithmetically or geometrically, with a new classifiercreated each time the subsample grows in size. When the accuracy of the classifierceases to improve significantly compared to the previous sample, the techniqueterminates.

Progressive sampling has been applied to clustering problems. Domingos andHulten [30] used Hoeffding bounds in aprogressive sampling technique to both estimate the initial sample size and estimateif the sample size at any point in the progression was sufficient. The technique wasdeveloped for K-means and depends on the fact that in K-means, data elements havemembership in only one cluster.

Pal and Bezdek [11] and Wang et al.[15] used progressive sampling for thepurpose of selecting a subsample that accurately represents the dataset. Adivergence test was used to assess if the subsample matches the distribution of thedataset. If the test failed, progressively larger subsamples were taken until thetest passed. Finally a clustering algorithm was run on the chosen subsample.

V. Estimating the Random Sample Size

The scalable clustering approaches discussed explicitly or implicitly samplethe dataset to develop a starting point, but typically do not use a statisticalmethod to determine the sample size. Domingos and Hulten's technique [30] estimates the initial sample size usingHoeffding bounds for K-means clustering, but does not directly generalize to FCM[14].

The SPFCM algorithm uses a parameter (fPDA) to define astatic sample size [3]. The eFFCM algorithmuses a statistical test to validate a subsample, but not to determine the subsamplesize [11], [14]. The mrFCM algorithm and Altman's method also use a parameter todefine the subsample size for cluster center initialization [8], [9]. The size of thesubsample influences the performance of the algorithms [6], but the parameters in each case are determinedempirically.

As suggested in [29], the subsamplemust proportionally represent each cluster to provide an improved starting point fora hill-climbing algorithm such as FCM. A known difficulty with using a subsample togenerate starting points for a clustering algorithm occurs when the subsample failsto sufficiently represent all clusters [8],[29]. This results in skewed startingpoints. One solution to this problem is to ensure the subsample has proportionalrepresentation. This solution was suggested in [8] but not elaborated on.

Gu et al. studied the effects of an improper starting sample size usingprogressive sampling on supervised learning problems [31]. They implemented a divergence test on a subsample toensure it represents the dataset distribution. Pal, Bezdek and Hathaway in [11] and [13], similarly test the subsample for the proportionality of the sampleas a whole, but use the sample for calculating the clustering solution rather thanestimating clusters as a starting point. Regardless, this sort of technique requirescollecting information from the entire dataset to ensure proportionality. A largersubsample could be used for this purpose, but that leaves us with the sameuncertainty about the validity of the size of the subsample.

Another approach is to select a probabilistically large enough subsample tosufficiently represent all clusters at a desired level of confidence. If one assumesthat the clusters correspond to a set of currently unknown classes, selecting asubsample to sufficiently represent each cluster is analogous to selecting asubsample to estimate a multinomial proportion of classes. This is because if thesample provides an acceptable estimate of a proportion of classes, it will haveproportional representation of the clusters in the data.

Thompson [32] came up with atechnique to find the smallest sample size, λ, such that a random sample froma multinomial population will result in “class” proportions within aspecified distance of the true population proportions with probability at least 1 -α. It was shown that the minimum sample size, λ,would be:

λ = \max_{μ} z^{2} (1 ∕ μ) (1 - 1 ∕ μ) ∕ d^{2}

(10)

whered is the maximum absolute difference from thetrue proportion that will be tolerated for any class. The valuezis the upper (α/2μ) x 100thpercentile of the standard normal distribution.

Thompson showed thatμ, an integer, is the number ofclasses present in the population for which the calculated value of λ is amaximum. Forα ≤ 0.10 (which we are interested in),the maximum values for λ occur whenμ is between 2and 3. As the clustering problems we are interested in have the number of classesc ≥ 3, accepting the maximum value for λ allowsus to ignore the value ofμ. For details, see [32].

Phoungphol and Zhang borrowed Thompson's definition forμ as part of a technique to estimate the sample size forHCM. They implemented a “hard” version of rseFCM where the sample sizewas estimated with their technique [33].

Solutions to Thompson's formula have been published in tabular form, whichpairs desired significance levels (α) with a value ford²λ [32]. An example calculation follows. If the desired significance levelisα = 0.05, this corresponds to a value ofd²λ = 1.27359. If the desired maximumabsolute difference,d = 0.02, the minimum sample size for is $λ = \frac{1.27359}{{0.02}^{2}} = 3184$ . Thus a sample size of 3184 is the minimum to ensure with a 95%probability (1 −α) that the maximum absolutedifference in class representation is 0.02.

If this method is used to obtain samples for a clustering problem, the totalnumber of classes present must be considered. Assume a full dataset,X, has 5 equally distributed classes. The true proportion,π, of each class,c, equals 0.20. Usingthe example above, withd = 0.02, a sample size of 3184 iscalculated. At the desired significance level,α = 0.05, themethod predicts with a 95% probability that the sample represents all clusters atthe proportionp = 0.20 ± 0.02. This is a suitableproportion for many clustering problems.

If in the example above,X instead has 100 equallydistributed classes,π = 0.01. The absolute difference isstilld = 0.02, thus the tolerated differenceexceeds the expected proportion of each classp = 0.01 ± 0.02. In this case theaverage number of examples from each class will be 32 but willrange from 0 to 96 (with a 95% probability). Thusd must be adjusted in order to ensure that each class isrepresented with enough examples to be clustered.

Assuming an equal distribution, each class will have a true proportion of $π = \frac{1}{c}$ . Thompson's formula assumes anabsolutedifference,d. At the level of significance desired, the expectedproportion of each class in the sample is $\frac{1}{c} \pm d$ . In the examples above, the value of d is kept fixed and the valueof c is increased by a factor of 20. This causes the amount of absolute differenceallowed in the sample to be greater than the true proportion of the classes:π << d.

This problem can be repaired by tying the absolute difference to c. Define avalue,r, as the “relative difference”. Set $d = \frac{r}{c}$ . Now the proportion of each class in the sample is $\frac{1}{c} \pm \frac{r}{c}$ , though rewriting it as $\frac{1 \pm r}{c}$ makes it clear whyr is defined as the“relative difference”.

Using the assumptions above, $\frac{r}{c}$ can be substituted ford. Now the formula for thedesired minimum sample size can be expressed as:

\begin{matrix} d^{2} λ = & v (α) \\ \frac{r^{2} λ}{c^{2}} = & v (α) \\ λ = & \frac{v (α) c^{2}}{r^{2}} \end{matrix}

(11)

wherev(α) is thecalculated value (or from Thompson's published table) for a specifiedα value and the other variables are defined asabove.

An example follows. Assume that desired significance level isα = 0.05, which corresponds tov(α) = 1.27359, the number of clusters,c = 5, and the desired relative differencer =0.10. Usingequation (11), theestimated sample size $λ = \frac{1.27359 \times 5^{2}}{{(0.1)}^{2}} = 3184$ . Note that this is the same result from the example above.

Another example. Keeping the desired significance level atα = 0.05 and increasing the desired relative differencetor = 0.20, let us find the minimum sample size forc = 100. Usingequation (11), the estimated sample size $λ = \frac{1.27359 \times 100^{2}}{{(0.2)}^{2}} = 318,398$ .

VI. Algorithms Based on Thompson's Method

Insight on how to best leverage Thompson's method for selecting a set ofexamples comes from understanding how related methods function. Recall that SPFCMfunctions by processing one PDA at a time. Analysis of the execution of SPFCMrevealed that the number of iterations per PDA fell dramatically after the first fewPDAs were processed, resulting in a lower average number of iterations per PDAcompared with the total number of iterations for FCM on the full dataset. Themajority of PDAs terminated after a single iteration and the changes in the clustercenters were very small. The SPFCM algorithm converged to a solution close to FCM'squickly [3], [6], [28].

The OFCM algorithm has significant similarity to SPFCM, but has poorerperformance in regards to speedup and quality when run sequentially [6]. As mentioned above, one reason is that OFCMobtains PDAs by sequentially sampling the data rather than using random sampling.This makes it less likely that the PDA is proportional with respect to class. A PDAlargely consisting of a single class will take longer to terminate [6]. While not required to, some implementationsof OFCM pass on cluster center estimates from the previous PDA; in the case justdescribed a poor initialization will extend the time before termination [29]. A second reason is that SPFCM storespartition data in the weighted cluster centers returned from each PDA. These in turnare added to the next PDA. The weighting of the cluster centers on average reducesthe number of iterations per PDA before WFCM terminates. This was demonstrated inearlier work [6] by comparing OFCM and SPFCMwith identical pre-randomized datasets and identical initializations for eachPDA.

These insights and the availability of Thompson's formula led to thecreation of two algorithms, geometric progressive fuzzy c-means (GOFCM) and minimumsample estimate random fuzzy c-means (MSERFCM). Both of these algorithms useThompson's formula to estimate an initial sample size for an expected number ofclusters. A base assumption, for clustering, is that a dataset processed by thesealgorithms has the expected number of clusters as reflected in the data attributes.If the data attributes do not provide any distinction between the clusters, the datawill not have multinomial properties and Thompson's method is not valid.

A. The GOFCM Algorithm

The GOFCM algorithm is an improvement to SPFCM that leveragesprogressive sampling, Thompson's method(11) and a different stopping criterion. GOFCM operates like SPFCMexcept as follows. The initial subsample size is estimated using Thompson'smethod. The size of subsequent subsamples are calculated using a geometricschedule [4]. The calculated size of thesubsample stops growing once it exceeds a user-provided value; once this occursthe subsample size is fixed to equal the limit. In our experimentation, theuser-provided value was set ton ×fPDAwhich is the subsample size used by SPFCM (Section II).

As in SPFCM, each subsample is processed by WFCM. The information fromprevious subsamples is retained and compressed by weighting the cluster centersfrom each step of the progressive sampling. The stopping criterion, discussed indetail below, is based on the rate of change (slopeσ)of cluster center positions in successive subsamples. The algorithm terminateswhen the slope rises above a user-defined value.

GOFCM has the same expected runtime complexity as SPFCM,O(nisc) (Section III). In practice, GOFCMwill often have a shorter runtime than SPFCM due to a faster convergence thatreducesi and the new stopping criterion that reducesn. SeeAlgorithm 1for a detailed description of GOFCM.

Algorithm 1.

GOFCM Algorithm

Provide datasetX withn examples. Select values for parameters: number ofclustersc, fuzzifierm, terminationcriterionε, geometric schedule factora ≥ 1, maximum slope σ and maximumfraction of data to samplefPDA. Select the relativedifferencer and the significance levelα for use with Thompson's method. Sett = 1.

Calculate the initial subsample size,n₁ of datasetX usingThompson's method.

Create the initial subsample(x₁) by randomly selectingn₁ data examples fromXwithout replacement.

Processx₁ withWFCM

Retain weighted clustersV₁

repeat

t =t + 1

Calculate newsubsample sizen_t =a×n_t–1

ifn_t >n ×fPDAthen

n_t=n ×fPDA

endif

Createsubsample (x_t) by randomly selectingn_t data examples fromX without replacement.

Add the cweighted cluster centersV_t–1 tox_t.

Processx_t with WFCM

Retainweighted clustersV_t

Calculate thechange of cluster centers between subsamples,δ(V_t;V_t–1).

Saveln(δ(V_t,V_t–1)) in a buffer.

ift > 6then

Calculatethe slopeσ_t

else

σ_t=σ

endif

untilσ_t >σ

returnV_t

Open in a new tab

One of the key principles of GOFCM is based on the observation that inSPFCM after a certain amount of data is processed, the result does not improvein any appreciable way. This is similar to Provost's observation concerninginduction algorithms in [4]. Thus, GOFCMmay terminate early and without needing to process all the data.

The GOFCM algorithm follows a pattern similar to those in Gu et al.[31] and Provost [4] in that multiple subsamples are selectedand processed by the base algorithm. GOFCM is also similar to methods that useestimation for a better set of starting cluster centers [7]–[9].

There are key differences between GOFCM and these similar methods. Thefirst is that GOFCM uses Thompson's method to derive the initial subsample size.The second is that GOFCM reuses the information from each subsample. This isbecause the cluster centers obtained from a subsample are weighted, combinedwith the next subsample, and used as the starting cluster centers. Thesedifferences have the benefits of generating an initial cluster center estimateusing the minimum amount of sampled data, representing all the data previouslyprocessed in weighted clusters and reducing the number of iterations for eachsubsample until termination [3].

In GOFCM, progressively larger subsamples are taken until the stoppingcriterion is met. The size of the subsamples is controlled by a parametera ≥ 1, the geometric schedule factor. Ifa = 1, the subsample size remains constant and thealgorithm is similar to SPFCM with a different stopping criterion (see below).As noted in [4], [34], the actual type and rate of scheduling is a tradeoffbetween cost (as measured in speedup) and benefit (as measured in fidelity toFCM).

The GOFCM algorithm is also similar to eFFCM and its variants in that itprogressively samples the dataset while retaining the data already sampled. Asdiscussed above, its method of “retaining” the data differs fromeFFCM and mirrors SPFCM.

A difficult decision in the development of GOFCM was the development ofa stopping criterion. Provost identifies detection of convergence, in thecontext of induction algorithms, as an important area of future research [4]. The same is true for clusteringalgorithms.

Unlike Provost's method for induction algorithms with labeled data,there is no objective criteria, such as model accuracy, to compare the qualityof clustering algorithms. The typical alternative means of developing a stoppingcriterion is to identify when some metric associated with the algorithm fails tochange more than some threshold.

The GOFCM stopping criterion is based on a comparison of the positionsof the cluster centers (V s) in successive subsamples. This wasstudied with a large dataset (MRI-017) known to cluster well with FCM and itsvariants (Figure 1). The mean distancebetween successive cluster centers was selected as the norm. While thedifference betweenV s initially reduced as the amount of dataincreased, it did not converge to a particular value. Instead, the algorithmreached a steady state where there was significant variation of cluster centerposition between subsamples.

Changes in V for GOFCM and dataset MRI017

A simple thought experiment reveals why this is so. Imagine if asubsample produces the ideal set of cluster centers signifying an extrema forthe objective function if all the data were present. Now another, possiblylarger, subsample is drawn and the weighted cluster centers from the previoussubsample are added. This is a subsample of the remaining data and is extremelylikely to have a distribution that differs from the collection of the formersubsamples. This will result in the cluster centers deviating betweensubsamples. The next subsample is drawn from the remaining data, which now has adistribution that differs slightly from the data already sampled.

The FCM algorithm seeks to minimize the objective function from the datathat is present. As the distribution of the samples will always be slightlydifferent, the cluster centers will always experience some random variation. Letus consider this random variation as noise.

Note inFigure 1 that the changesin cluster centers (δ(V )) betweensubsamples are asymptotic when noise is removed. The shape of this curve appearsto be the inverse of the learning curves noted by Provost [4] and Meek [34], butthe same challenge is present. At what point in the curve should the algorithmbe stopped?

If the stopping criterion is defined to be whenδ(V ) falls below a user-definedvalue, the actual value of the change is strongly overwhelmed by the noise. Intest experiments using this criterion, a large degree of variability in thefinal partitions was noticed.

Examination of the dataset inFigure1 and other datasets showed that this metric generally obeys thePower Law after the first few subsamples. The legend inFigure 1 shows the best fit equation isy =0.7324x⁻^0.845. InFigure 2, the logarithm of the x and ycoordinates are plotted. Here the best fit equation is the straight liney = 0.845x − 0.3114. Note thatln(0.7324) = −0.3114.

Log of change in V for GOFCM and dataset MRI017

Figure 2 provides a clear view ofthe noise generated by each subsample and suggests the stopping criterionselected for GOFCM.

After each subsample is processed, the logarithm ofδ(V ) is saved, and simple linearregression finds the best fit equation. Then the best fit equation is convertedback to the original coordinates and the slope between the last two subsamplesis found. The best fit line is of the formy =ax^−b, so the slope will have a rangeof (−∞, 0). If this slope rises above a user-defined value(σ), GOFCM terminates. Selection ofσ is a tradeoff between speed and quality asmeasured by fidelity to FCM with the full dataset. A small value forσ provides more speed with less quality, while alarge value ofσ provides higher quality but less speed(more iterations). For experiments, the value ofσ wasdetermined empirically.

For a new dataset, one could estimate a suitable value forσ by using GOFCM to cluster a small sample of thedata using different values forσ and comparing theresults to FCM on the same sample. As FCM scales linearly withn, the speedups obtained for differentσ could be used to estimate runtimes for GOFCM onthe full dataset.

Due to noise generated by the subsampling, the algorithm wouldoccasionally prematurely terminate. In order to prevent this from occurring,GOFCM was not allowed to terminate until a minimum number of PDAs wereprocessed. This is a similar concept to linear regression with local sampling(LRLS) used by Provost [4]. The minimumnumber of PDAs to process before calculating the slope is a value that couldhave been parameterized, but in our implementation the minimum number was set toa constant. We experimented with many values and the number 6 was the lowestminimum value that provided consistent results for the datasets tested.

The minimum number of PDAs can affect the runtime and quality of thealgorithm. If set low, GOFCM has an early opportunity to terminate, possiblyreducing the runtime. However, if the initial data examples do not accuratelyreflect the dataset, the quality of the results might be poor. If set high, thismight increase the runtime, but also makes it more likely the sampled dataexamples accurately reflect the entire dataset, thus improving quality.

To see how the stopping criterion might work in another setting, we alsoapplied it to SPFCM. We assumed that it would save time at the cost of leavingsome of the data unused. Additional experiments were run to compare SPFCMmodified with this stopping criterion (MODSPFCM) to SPFCM and GOFCM. The resultsare discussed in Section IX.

B. The MSERFCM Algorithm

Minimum Sample Estimate Random Fuzzy C-Means (MSERFCM) is designed as animprovement to rseFCM. It is also similar to methods that try to find a betterset of starting cluster centers [7]–[9], but is muchsimpler.

The rseFCM algorithm usesc randomly selected dataexamples as initial cluster centers. In contrast, MSERFCM processes a subsampleof the dataset to estimate initial cluster centers. This is the only majordifference between rseFCM and MSERFCM unless one of the assumptions (see below)is violated.

For a dataset (X) of sizen, asubsample of size (n₁) is estimated using Thompson'smethod. A subsample (x₁) of sizen₁ is drawn without replacement from the datasetand is processed by FCM. The position of the resultant cluster centers areretained. Then a second subsample (x₂) is drawn fromX. This second subsample size is specified by the user,which may indicate the amount of available RAM or some other practical concern.In our implementation, the specified (second) subsample size(n₂) ofx₂ isdefined asfPDA ×n, which is the samesubsample size used by rseFCM. The retained cluster center positions are used toinitialize FCM forx₂.

The MSERFCM algorithm assumes that the estimated subsample size,n₁, is less than the specified subsample size,n₂, and less than the dataset size,n:n₁ <n₂ <n. This assumptionmay not be correct, which can happen when Thompson's method estimates a largevalue forn₁. In these cases, MSERFCM will be lessefficient than rseFCM. To correct this, the following adjustments are made.

Whenn₂ <n₁ <n, the estimatedsubsample size exceeds the specified subsample size and MSERFCM degenerates torseFCM with a subsample of sizen₁. Whenn₂ <n <n₁, the estimated subsample size exceeds theavailable data and MSERFCM degenerates to FCM and processes the entire dataset.In both of the latter cases, the subsample could exceed memory which wouldprovide the actual limit. In experiments below this did not happen.

MSERFCM has an expected runtime complexity of,O(n₂i₂sc+n₁i₁sc)due to two successive applications of FCM (Section III). The rseFCM algorithmhas an expected runtime complexity ofO(n₂isc). Inpractice, MSERFCM will usually have a shorter runtime than rseFCM due to theimproved set of starting clusters that reducesi₂enough to more than compensate for the additionalO(n₁i₁sc)time.

VII. Experiments

GOFCM and MSERFCM were compared in terms of speedup and quality to existingalgorithms. The experiments applied FCM and the six accelerated variants describedabove to four large real-world data sets and two artificial data sets. Clustercenters (V ) predicted by the algorithms can vary based oninitialization. Each experiment consisted of 30 trials, each initialized by a randomselection ofc data examples as initial cluster centers. The finalcluster centers, runtime, number of iterations and quality metrics were recorded foreach trial. The reported experimental results are the averages across these 30trials.

Our intent was to compare our new accelerated algorithms to the algorithm onwhich they were based, FCM, and other accelerated variants of FCM. Demonstrating aspeedup with minimal differences to the base algorithm should be useful to theclustering community.

Earlier work by Havens, et al. used the same three MRI datasets (see SectionVII-B) and four of the same algorithms used in this research. An additional set ofexperiments, with identical parameters to those used in [5], were done to directly compare results. These experimentsconsisted of averages from 21 trials, to match what Havens, et al. reported [5].

The three MRI data sets were also used to compare MODSPFCM to FCM, SPFCM andGOFCM.

A. Parameters

As described in Sections II and VI above, the algorithms tested havemultiple parameters. The intent of the experiments was to explore acceleratingalgorithms. Thus for any given dataset, only the parameter affecting the samplesize (fPDA) was varied; the other parameters were kept fixed.Experimental parameters are summarized inTableI. The series of experiments using MODSPFCM added two additionalsettings for thefPDA parameter. These settings, which are notlisted inTable I, arefPDA in {0.02,0.06}.

TABLE I.

Experiment Parameter Settings

Parameter	Value
trials	30 (21)
m	2.0 (1.7) (MRI) 1.7 (Art and PLK01)
termination criterion	max change in U (max change in V)
ε	0.001
fPDA	0.1, 0.0333333, 0.01, 0.00333333, 0.001
α (eFFCM)	0.200
dPDA (eFFCM)	20% of the value offPDA
σ (GOFCM)	−0.01
α (GOFCM)	0.05
a (GOFCM)	2.0
r (GOFCM)	0.1

Open in a new tab

Parameter values modified to match [5] arein parentheses.

MRI, Art and PLK01 refer to particular datasets.

The value form, the fuzzifier, is not consistentacross the datasets. Initial experiments on the MRI datasets withm = 2.0 provided acceptable results, but this was not thecase for the other datasets. Some tuning ofm was necessary;setting it to a value of 1.7 vastly improved results with respect to runtime andimproved fidelity to the cluster centers of the artificial and planktondatasets.

B. Data Sets

Table II contains the detailsabout each of the six datasets used.

TABLE II.

Datasets

Dataset	examples (n)	attributes (s)	clusters (c)
MRI016	3,882,771	3	3
MRI017	3,898,407	3	3
MRI018	4,293,292	3	3
PLK01	203,278	21	20
ART01	1,000,000	6	5
ART02	1,000,000	10	7

Open in a new tab

Three datasets (MRI016, MRI017, MRI018) were magnetic resonance images(MRI) images of the normal human brain. The three features in these datasets arethe intensities of the T1-weighted, T2 weighted and proton density weightedsequences. The values are integers ranging from 0 to 1951. These images wereclustered into the three classes: cerebro-spinal fluid (CSF), gray matter (GM)and white matter (WM) [16].

The plankton dataset (PLK01) was a set of image features describingplankton collected by the Shadow Imaging Particle Profiler and EvaluationRecorder (SIPPER) imaging system [35].This particular set of images was collected during three cruises in the Gulf ofMexico (2002, 2008 & 2010) and two in the Western Pacific (2007 & 2008)[36] [37]. The data was classified by experts at the College of MarineScience at the University of South Florida into a type of marine object(typically a class of plankton).

Initially, there were a total of 482,719 data examples with 88attributes representing 168 classes. The number of examples in classes rangedfrom a single example (elongate phytoplankton: chaetoceros) to 99,414 (noise:air bubble). Attempting to cluster the entire dataset withc =168 did not provide stable results. The data was stripped of examples that werenoise, detritus and other small and non homogeneous classes and randomized. Theattributes were normalized, and feature selection was performed with ConsistencySubset Evaluation and Linear Forward Selection [38] using the WEKA data mining tool [39]. These efforts left 203,278 data examples and 21 attributes.Analysis of the cleaned dataset showed 20 predominant classes. Details on thefeature selection parameters, plankton attributes and classes available athttp://www.cse.usf.edu/~jkparker/plankton/.

Two datasets (ART01 and ART02) were artificial datasets. A simpleprogram was written to construct these types of dataset. The program inputconsists of the locations ofc cluster centers inR^s as well as the standard deviation andnumber of examples for each cluster center. Each data example deviates from itscluster center in each dimension by a random amount determined by a gaussiandistribution using the provided standard deviation. The generated examples werethen output in random order.

The artificial datasets were designed to moderately challenge FCM andits variants.Figure 3 provides a 3dimensional view of 3 attributes of ART01. When attributes missing fromFigure 3 are considered, the actual Euclideandistance between the true cluster centers provides more separation than isstraightforward to display. Despite the crowding, FCM did a reasonably accuratejob of correctly predicting the cluster centers of ART01.

C. Metrics

Three metrics were calculated to assess the speed and quality of thealgorithms and to serve as a basis of comparison between them.

1) Relative Speedup (SU)

This metric calculates the ratio between the runtimes of twoalgorithms. Ift₁ is the time foralgorithm 1 andt₂ is the time for the reference algorithm,the speed up (SU₁₂) ofalgorithm 1 relative to algorithm 2 is:

S U_{12} = \frac{t_{2}}{t_{1}}

(12)

Thus if the runtime ofalgorithm1 is 150ms and algorithm 2 takes 750ms, the speed up equals 5;i.e five times as fast.

2) Adjusted Rand Index (ARI)

The ARI [40] returns a valueof 1 when two partitions are in complete agreement, 0 when the partitionsreturn the value expected by chance and a negative value when the partitionsare in greater disagreement than would be expected from chance.

The ARI is typically used to compare a clustering method's resultsto the actual class membership of the data examples. In our research, ARI isused to compare the results of an accelerated FCM-variant to the base FCMalgorithm. The ARI also assumes that the clustering is discrete, i.e hard[40], [41]. Before calculating the ARI for this research,partitions were hardened by assigning the membership of an example to thecluster for which it had the highest membership value [5] and corresponding clusters from the variant and basealgorithms were identified using the Hungarian method [42].

3) Cluster Membership Change (CC)

This metric compares the cluster assignments of two soft clusteringalgorithms on the same dataset. Small changes in cluster membershipsindicate two partitions are very close. If one is the baseline, a smallamount of cluster change is a high quality partition when compared to thebaseline. TheCC% is used as a relative quality measure forcomparing an accelerated algorithm to an FCM partition which is typically anaverage over t (30 here) trials.

The cluster assignments from each soft algorithm are hardened bysetting the highest membership value in each column ofU toequal 1. All other values are set to 0. If the cluster assignment for columni is not identical for both algorithms, the indicatorvariableδ_i is set to 1.Otherwise this value is set to 0. TheCC value is thencalculated as follows [43]:

C C % = \frac{\sum_{i = 1}^{n} δ_{i}}{n} * 100

(13)

This metric uses the Hungarian method [42] to identify corresponding clusters across trialsand experiments.

VIII. Results

All algorithms except for FCM and OFCM assume that the data is in randomorder or the implementation performs random sampling. Here, the entire dataset israndomized prior to processing each trial (excepting FCM and OFCM). In addition torecording the runtime of the algorithm, the software implementation separatelyrecorded the time to randomly sample the data and recorded the time to perform I/O.Unless otherwise noted, reported times and speedup comparisons below include theruntime for randomization and I/O.

It was observed that the results for all MRI datasets were similar, soresults for the average of the three MRI datasets are presented inTables III andIV andFigure 4. While no groundtruth existed for the MRI images, the differences in cluster assignment by allmethods studied (as measured byCC% with an FCM partition obtainedfrom the average cluster centers over 30 trials as the baseline) never exceeded 1%.This is far more consistent than human experts, whose assignments have been observedto differ by 16% or more [44] on MRI imagesof the brain. FCM and its variants have successfully been used in image segmentation[45] [46]. It could be argued that FCM and its variants should be used inpreference to human experts for consistency.

TABLE III.

MRI Speedup

fPDA	0.100	0.033	0.010	0.003	0.001
OFCM	1.39	1.55	1.32	1.04	1.05
SPFCM	3.01	3.70	4.12	4.16	4.17
eFFCM	2.08	2.12	1.93	1.55	2.51
GOFCM	4.46	6.56	8.60	9.11	9.51
MSERFCM	6.59	8.27	9.54	9.39	9.53
rseFCM	4.79	7.25	9.14	9.36	9.61

Open in a new tab

Bold type indicates fastest speedup for eachfPDA

TABLE IV.

MRI Speedup (Ignoring Randomization and I/O)

fPDA	0.100	0.033	0.010	0.003	0.001
OFCM	1.40	1.56	1.32	1.04	1.05
SPFCM	4.43	6.25	7.36	7.74	7.86
eFFCM	3.36	4.22	5.65	9.25	227.98
GOFCM	8.42	20.63	53.55	122.97	301.24
MSERFCM	19.15	50.94	119.70	313.54	599.68
rseFCM	9.13	28.24	85.79	241.59	575.05

Open in a new tab

Bold type indicates fastest speedup for eachfPDA

For these data sets MSERFCM typically had the highest speedup, rseFCM wasoften second and GOFCM was faster than all others.

Despite the difference in dimension and number of clusters, the results forART01 and ART02 were similar. Thus, only results from ART01 are presented (Table V andFigure 5). The fidelity to FCM as measured byCC% wasvery good. The speed of the algorithms was in the same order as for the MRIdatasets.

TABLE V.

Art01 Speedup

fPDA	0.100	0.033	0.010	0.003	0.001
OFCM	2.76	2.85	2.74	2.46	2.23
SPFCM	3.61	5.00	5.74	5.94	5.91
eFFCM	3.91	5.52	7.08	7.65	12.46
GOFCM	9.83	15.63	21.26	24.00	26.78
MSERFCM	12.50	20.21	25.02	25.32	26.16
rseFCM	6.87	14.88	23.35	26.07	27.64

Open in a new tab

Bold type indicates fastest speedup for eachfPDA

The results for dataset PLK01 are shown inTable VI andFigure 6. The speedupwas most for rseFCM and then GOFCM. For small sample sizes the fidelity of all toFCM was poor. For larger sample sizes, the fidelity was fair with rseFCM and MSERFCMhaving the most fidelity, followed by GOFCM and SPFCM.

TABLE VI.

Plankton Speedup

fPDA	0.100	0.033	0.010	0.003	0.001
OFCM	2.84	2.75	1.75	1.47	1.11
SPFCM	6.93	13.58	18.77	21.73	22.83
eFFCM	2.74	4.08	5.45	7.29	6.28
GOFCM	8.19	18.29	33.04	41.12	47.95
MSERFCM	4.00	4.03	4.04	4.04	4.04
rseFCM	8.52	21.55	38.39	46.48	49.57

Open in a new tab

Bold type indicates fastest speedup for eachfPDA

The quality of GOFCM and MSERFCM, as measured byCC%, isvery high. TheCC% is less than 0.2% in all experiments, except forthose using the PLK01 dataset. The cluster centers returned by FCM itself, will varyby a small % over 30 initializations, making such a smallCC%nearly insignificant.

Assuming the data is pre-randomized increases the speedup considerably.CompareTables III andIV to see the speedup difference with the MRI datasets. Thesetwo results can serve as an upper and lower bound for runtime respectively for FCMand its variants.

The GOFCM algorithm provides a consistent speedup over the base FCMalgorithm. Depending on the size of the PDA and dataset, the speedup ranged fromroughly 4 to 47 times (Table III). Designedas an improvement over SPFCM, it also provides a consistent speedup over SPFCM. Onthe MRI datasets, the speedup was an average of 2 times faster than SPFCM. On ART01and ART02, the speedup ranged from 3 to 5 times. GOFCM was also consistently fasterthan SPFCM on PLK01.

If the time taken for randomization and I/O are not considered, GOFCMprovides a much greater speedup. The speedup on the MRI datasets ranged from 8 to300 times compared to FCM and a speedup of 2 to 40 times compared to SPFCM (Table IV). Speedups on ART01, ART02 and PLK01were even greater, ranging from 10 to over 700 times compared to FCM. Seehttp://www.cse.usf.edu/~jkparker/gofcm for additional results.

GOFCM's speedup, however, came with a loss in quality when compared to FCM.Looking at theCC% metric, GOFCM had a small loss in quality whichranged from 0.04% to 0.12% on the MRI datasets. SPFCM's quality loss over the samedatasets ranged from 0.04% to 0.06%. On PLK01, GOFCM had aCC% lossin quality which ranged from 12% to 46%, compared to a 11% to 48% for SPFCM.

TheARI metric was specifically calculated for comparisonto work by Havens (Section VIII-A), but recorded for all experiments. GOFCM had aconsistent, but small loss (< 1%) in quality as measured by ARI when comparedto SPFCM. The GOFCM algorithm's quality was close to or lower than SPFCM for alldatasets. However, the small difference in quality is generally going to result inpartitions that are functionally the same.

MSERFCM, designed as an improvement to rseFCM, has slightly superiorperformance with respect to speed and quality on the MRI datasets. MSERFCM wasfaster than rseFCM in 80% of the experiments. With respect to quality, MSERFCM wasequal to or better than rseFCM on 73%, and 67% of the experiments examining theCC% andARI metrics respectively. For alldifferences of speed and quality, both algorithms were very close - on a fewoccasions differing by only 0.0001% or less.

On ART01 and ART02, MSERFCM was faster than rseFCM on 60% of theexperiments. With respect to quality, MSERFCM was equal to or better than rseFCM on80%, and 90% of the experiments for theCC% andARI metrics respectively. Again, the results were extremelyclose in all experiments. Despite the differences between ART01 and ART02 withrespect to number of clusters and dimensions, the differences between them inrelative speedup and result quality across all metrics were negligible.

When randomization and I/O are not considered, MSERFCM was faster thanrseFCM on 73% of the MRI experiments and 80% of the experiments with artificialdata. Cases where rseFCM was faster than MSERFCM, usually occurred when the PDA wassmall (fPDA = 0.001 or 0.00333333). When this occurred thedifference in speed was not trivial and there was a noticeable quality loss forrseFCM.

Results were different with the plankton dataset. Here, rseFCM consistentlyoutperformed MSERFCM with respect to speed, while MSERFCM consistently outperformedrseFCM with respect to quality. In fact, of all six FCM variants, MSERFCM had thebest quality over all metrics and all experiments with a generally consistentspeedup of 4 times over FCM. The eFFCM algorithm was the only consistent competitorto MSERFCM with respect to quality, with a speedup that ranged from 2.7 to 7.3 timesover FCM. An explanation of the very different performance on the plankton data isgiven in Section IX below.

A. Comparison to work by Havens et al

In ”Fuzzy c-Means Algorithms for Very Large Data”,experiments were run with rseFCM, SPFCM and OFCM using the same MRI datasets asin our research [5]. Their experimentsdiffered in several ways. Only 21 trials were performed, the fuzzifier was setto 1.7, and the termination criterion was changed to use the maximum change inV . Havens reports results for SU and ARI; these wererounded to nearest whole number and two decimal digits respectively.

The algorithm implementation was done in MATLAB rather than a Linux/Cimplementation as in our work. Havens pre-randomized the files, and did notcount that step in the algorithm execution time, but did consider sampling andI/O time [47]. Our experiments includedthe time to randomize the files and perform I/O in the algorithm execution time.It was not possible to make the runtime results perfectly comparable, as ourreported results include more overhead. As a consideration to make theexperimentation as close as possible, our code was modified to pre-randomizefiles for OFCM. Results from Havens’ experiments and from our experimentsare presented in a format as identical as possible to that presented by Havens[5]. Comparison results and additionalresults from GOFCM and MSERFCM are listed inTable VII.

TABLE VII.

Comparison with Havens' Results

(a)fPDA =0.001
Dataset	MRI016				MRI017				MRI018

Researcher	Havens		Parker		Havens		Parker		Havens		Parker

Metric	SU	ARI	SU	ARI	SU	ARI	SU	ARI	SU	ARI	SU	ARI
OFCM	2	0.78	2	1.00	2	0.78	1	1.00	2	0.85	2	1.00
SPFCM	16	0.96	8	1.00	14	0.97	7	1.00	12	1.00	7	1.00
GOFCM			25	1.00			20	1.00			21	0.99
rseFCM	29	0.97	26	0.99	25	0.97	21	0.99	22	0.97	22	0.99
MSERFCM			26	0.99			21	0.99			23	1.00

(b)fPDA =0.01
Dataset	MRI016				MRI017				MRI018

Researcher	Havens		Parker		Havens		Parker		Havens		Parker

Metric	SU	ARI	SU	ARI	SU	ARI	SU	ARI	SU	ARI	SU	ARI
OFCM	2	0.93	2	1.00	2	0.78	2	1.00	2	1.00	2	1.00
SPFCM	13	0.96	7	1.00	11	0.98	6	1.00	10	1.00	6	1.00
GOFCM			18	1.00			15	1.00			16	0.99
rseFCM	24	1.00	22	1.00	21	0.99	18	1.00	18	1.00	19	0.99
MSERFCM			24	1.00			20	1.00			20	0.99

(c)fPDA =0.1
Dataset	MRI016				MRI017				MRI018

Researcher	Havens		Parker		Havens		Parker		Havens		Parker

Metric	SU	ARI	SU	ARI	SU	ARI	SU	ARI	SU	ARI	SU	ARI
OFCM	3	1.00	2	1.00	2	1.00	2	1.00	2	1.00	2	1.00
SPFCM	7	0.96	4	1.00	6	0.97	3	1.00	5	1.00	4	1.00
GOFCM			6	1.00			5	1.00			5	1.00
rseFCM	8	1.00	8	1.00	8	1.00	7	1.00	6	1.00	7	0.99
MSERFCM			10	1.00			9	1.00			9	0.99

Open in a new tab

For the same sets of parameters, the OFCM and rseFCM results are fairlysimilar for both groups. Judging from the speedup for SPFCM however, thereappears to be a significant difference in the implementations. The authorssuspect it is due to the additional time taken to perform randomization.Regardless, the algorithms have the same order with respect to speed: OFCM,SPFCM, rseFCM. This suggests that a MATLAB implementation of GOFCM and MSERFCMsimilar to that in [5] would show thespeedup for GOFCM higher than SPFCM and the speedup for MSERFCM (on average)faster than rseFCM.

B. Comparison of MODSPFCM to SPFCM and GOFCM

The speedup of MODSPFCM (TableVIII) and the quality as measured byCC% (Figure 7), when applied to the MRI data sets,falls between that of SPFCM and GOFCM. In the case of speedup, MODSPFCMapproaches that of GOFCM as the sample rate, as measured byfPDA, becomes smaller. In the case of quality, GOFCMmaintains a slight edge over MODSPFCM at all but the lowest sample rates. Inaddition to the five experiments with thefPDA values listed inTable I, two experiments were runwithfPDA = 0.02 andfPDA = 0.06 to addevidence to the observed trends.

TABLE VIII.

MODSPFCM MRI Speedup

f PDA	0.100	0.060	0.033	0.020	0.010	0.003	0.001
SPFCM	3.01	3.38	3.70	3.85	4.12	4.24	4.13
MODSPFCM	3.40	4.51	5.88	6.84	8.26	9.41	9.41
GOFCM	4.40	5.44	6.65	7.32	8.58	9.52	9.44

Open in a new tab

Bold type indicates fastest speedup for eachfPDA

Cluster center change comparison of MODSPFCM with SPFCM and GOFCM.

IX. Discussion

A. GOFCM vs. Related Methods

Why is GOFCM faster compared to SPFCM and other related methods? Thereare two main reasons: the estimated subsample size and the stopping criterion.Recall that the runtime complexity of GOFCM is affected byn,i,s andc. If we arecomparing the performance of two algorithms on the same dataset, assumings andc to be constant makes thecomparative runtime complexityO(ni). At thebeginning of the GOFCM algorithm,n is the (presumably small)estimated subsample size, but initialization of the cluster centers is random.Thus the number of iterations,i, is usually large. At thesecond subsample there are more data examples, but the cluster centerinitialization is improved which requires fewer iterations. The GOFCM algorithmexperiences an accelerated performance because whenn is small,i is large and asn increases,i decreases. This keeps the runtime more consistent acrosssubsamples processed, a demonstration of how subsample size and cluster centerinitialization impact speed.

The second reason GOFCM is faster is its stopping criterion. It stopsprocessing data when the predicted cluster centers do not show a high degree ofchange. Related methods process all available data.

There is a tradeoff between speed and quality. The effect of thistradeoff is evident when GOFCM and eFFCM are compared. The eFFCM algorithm oftenhad better quality than GOFCM, but much lower speedups (Table III,Fig. 4). TheGOFCM algorithm selects a starting subsample aiming to have the number in eachcluster within the specified range. This does not guarantee that the range ofattribute values in each cluster is proportionally represented in the subsample.This is one difference between GOFCM and techniques that perform a divergencetest on subsamples against the sample distribution [11] [15].

The process of performing a divergence test on a subsample is timeconsuming. The entire sample must be analyzed for ranges of values, bins must beselected, and the subsample's values must be assigned bins. This technique isalso subjective, as there is no optimal way to select the bins or parameters.Analysis of an implementation of eFFCM found that 5%-42% of the dataset wassampled before a Chi-squared test was passed; these tests were performed onrelatively simple data sets [11]. Theauthor's implementation found that 0.2%-34.6% of the test datasets were sampledbefore the Chi-squared test was passed.

In contrast, GOFCM determines a starting subsample size via a lookuptable and a simple equation. This step, while less precise, is much faster.

GOFCM's quality is controlled by the stopping criterion parameter(σ). The consistent setting forσ in the experiments provided a consistent speedupof GOFCM over SPFCM with a consistent loss in quality. While not explored, astricter setting forσ should have resulted in a smallerspeedup and higher quality. The converse would also apply.

Another limitation on GOFCM's quality is when the dataset requires alarger subsample than that allowed by the maximum subsample size(fPDA ×n). In these cases, GOFCMis forced to predict cluster centers with a suboptimal sample. This is clearlydemonstrated in Section IX-D below.

The experiments with MODSPFCM revealed the role the stopping criterionplays in GOFCM's speedup. MODSPFCM can be considered SPFCM with GOFCM's stoppingcriterion or GOFCM without progressive sampling.Table VIII shows how MODSPFCM's speedup falls between that of SPFCMand GOFCM. When thefPDA is a comparatively larger number (0.1,0.06, 0.033333), the advantage of GOFCM's subsampling is clearly shown. As thefPDA becomes smaller, the speedup of GOFCM and MODSPFCMapproach the same value. The example below explains why this is so.

Assume an experiment with GOFCM and MODSPFCM wherefPDA = 0.001. On the MRI datasets, usingEquation 11 with the parametersfromTable I, the initial subsample sizefor GOFCM is about 1100 data examples. The MRI datasets have roughly 4 ×10⁶ data examples each; whenfPDA = 0.001 theinitial subsample size for MODSPFCM is about 4000 data examples. Recall that thegeometric scheduling parameter for GOFCM is set to 2.0 and in our experimentsthe maximum subsample size is set byfPDA. Thus by the thirdPDA, the scheduled subsample size exceeds the maximum and the PDA sizes for bothalgorithms are the same. In this case, GOFCM only has an advantage of using asmallern for the first two PDAs, after which both algorithmsprocess the same amount of data with the same stopping criterion. If thesubsample size decreases below that calculated byEquation 11, GOFCM “degenerates” toMODSPFCM.

B. Artificial Datasets and OFCM

As mentioned in Section (VIII), experiments with the artificialdatasets (ART01, ART02) had very similar results. Thus only results for ART01are displayed (Table V,Fig. 5). The results for GOFCM and MSERFCMusing these datasets are not radically different than those for the MRIdatasets. The only surprise is the performance of OFCM.

The speedup of OFCM on ART01 (TableV) is consistently the lowest. This is also the case on the MRIdatasets (Table III). The quality ofOFCM, as measured byCC%, is radically different on theartificial datasets (Fig. 5) compared withthe MRI datasets (Fig. 4). In fact, OFCMhas the highest quality of any algorithm on ART01.

Recall that OFCM processes data in order; the algorithm does notrandomly sample the data. The difference in quality in this case is due to thefact that the artificial datasets were randomized prior to experimentation.Naturally occurring datasets, such as an MRI scan do not have a random orderingwith respect to cluster.

Pre-randomizing the artificial data made it more likely each sampleprocessed by OFCM was proportional. In the worst case, withfPDA = 0.001, the subsample size for ART01 is 1000. Usingthe formula from [32], a sample size of1000 corresponds to a maximum absolute difference of 0.03 whenα = 0.05. It so happens that the 5 true clusters inART01 each account for 20% of the total. Thus a sample size of 1000 will resultin each true cluster consisting of 17-23% of the total (with a 95%probability).

The 5 cluster centers calculated from each PDA do a fairly reliable jobof representing the 1000 data examples, with very little skewing due to the“uniform effect” [48]. Inthe last step of OFCM, the collection of weighted cluster centers are processedby WFCM. The resultant cluster centers, as measured byCC%, hadthe best fidelty with the full FCM algorithm.

The difference in speedup of OFCM using non-randomized vs.pre-randomized data was studied in [6].The speedup of OFCM on MRI datasets, was reported to improve from 48% to 76% ofSPFCM's speedup when the dataset was pre-randomized.

This suggests that both speed [6]and quality of OFCM can be improved by an efficient method of randomly samplingthe data.

C. MSERFCM

MSERFCM, designed as an improvement to rseFCM, was the fastest orsecond fastest algorithm in all experiments for the MRI and artificial datasets(Tables III,IV,V).

The occasions when MSERFCM's speedup was less than rseFCM was when thesubsample size was small. This occurred with the MRI datasets andfPDA = 0.001. MSERFCM initially used a sample size of 1147(Eq. 11 withc=3) to calculate the starting cluster centers forexecuting FCM with a subsequent subsample size of about 4000 examples. So inthese cases, MSERFCM processed over 5000 examples, compared with rseFCMprocessing about 4000 examples. The speedup gained from a better set of startingcluster centers was not enough to compensate for the increased amount of dataprocessed.

When ART01 was processed with the parameterfPDA =0.001, the estimated sample size to predict cluster centers was 3184. As thisexceeded the subsample size, the MSERFCM algorithm degenerated to rseFCM with asubsample size of 3184. This explains the faster performance of rseFCM, whichused a subsample size of 1000.

With respect to quality, MSERFCM was a better choice than rseFCM themajority of the time. This is expected in situations when MSERFCM processed alarger subsample size than rseFCM. In the example above, MSERFCM processed arandom subsample three times larger, making it far more likely MSERFCM will havemore fidelity to FCM. This situation occurred in 6 of the 25 experiments usingthe MRI and artificial datasets. Of the remaining 19 experiments, one wouldexpect no difference inquality. WhileMSERFCM draws a separate subsample to calculate the set of starting clustercenters, both MSERFCM and rseFCM use the exact same size subsamples to calculatethe output set of cluster centers as they proceed.

This however was not the the case. Of the remaining 12 MRI experiments(of 30 trials each), ignoring ties, MSERFCM had better average quality thanrseFCM in 69% and 63% of the experiments for the CC% and ARI metricsrespectively. Of the remaining 7 artificial dataset experiments, ignoring ties,MSERFCM had better quality than rseFCM in 60% and 67% of the experiments for theCC% and ARI metrics respectively.

FCM is guaranteed to converge to a local (or global) minimum orsaddlepoint [49]. One possibility is thatthe calculated starting cluster centers for MSERFCM allow the discovery ofbetter local minima than the randomly determined starting cluster centers forrseFCM. This explanation is supported by the results of related experimentsinvolving K-means [29].

D. Plankton Dataset Challenges

The use of the plankton dataset (PLK01) shows the limitations andcapabilities of the algorithms tested. Note that the speedup compared to FCM(Table VI) for PLK01 is greater thanthe speedup for MRI (Table III) except inthat case of MSERFCM. The MSERFCM algorithm has a consistent, but small speedup,just barely greater than that of OFCM.

This is because of the subsample size rules for MSERFCM. Thompson'smethod (Eq. 11) calculates aminimum subsample size of 50,944 for PLK01. Thus in all experiments, MSERFCMdegenerates to rseFCM with a subsample size of 50,944. In contrast, the largestsubsample size for the other algorithms is 20,328 whenfPDA =0.1.

Note that the MSERFCM algorithm has the best quality (Fig. 6) and is the only one to haveconsistent quality across the experiments. The eFFCM algorithm has fairlyconsistent quality, but eFFCM has a mechanism to dynamically increase its samplesize if quality measures are not met. The average subsample sizes for eFFCM onPLK01 ranged from 10.7% to 34.6%. All other algorithms suffer a degradation ofquality as the subsample size is reduced. This is attributable to a suboptimalsample size. The quality, as measured byCC%, make the resultsunusable for many applications of clustering when trying to use very smallsubsamples.

The quality of the MSERFCM algorithm's results demonstrates the utilityof using a statistical method for determining subsample size.

The PLK01 dataset also provides a nice demonstration of GOFCM'soperation. As with MSERFCM, the predicted minimum subsample size is greater thanthe size calculated (fPDA ×n). Whathappens in this case is that GOFCM degenerates to use the same, consistentsubsample as SPFCM.

In this case, the only difference between SPFCM and GOFCM is thestopping criterion. SPFCM will always process 100% of the data, GOFCM will stopshort of that if the stopping criterion is met. An examination ofTable VI shows that SPFCM and GOFCM have asimilar speedup when thefPDA (sample rate) is 0.1, withGOFCM's speedup increasing and quality decreasing compared to SPFCM as thefPDA decreases.

Observe theCC% of SPFCM, OFCM and GOFCM when thefPDA = 0.001 (Fig. 6).Note that the values are very similar. WhenfPDA = 0.001, thesubsample size is only 204. Experimentation was kept consistent across alldatasets; in reality there is no reason for such a small sample. Recall thatc = 20 for PLK01. The cluster representation for PLK01 isnot uniform and the sample size is small, making it unlikely that all clustersare even represented in each subsample. The base FCM algorithm will still try tofit 20 clusters, which will likely result in predicted cluster centersunrepresentative of the full dataset.

Finally, we note that these approaches can be applied to FCM withdifferent distance metrics [50], [51].

X. Conclusions

In this work we introduced two significant contributions to the study ofaccelerated clustering algorithms: (a) use of an estimated subsample size and (b) anovel stopping criterion. These new contributions were demonstrated with twoaccelerated versions of FCM that leverage a statistical method to estimate thesubsample size. These algorithms, GOFCM and MSERFCM, also use sound principles fromother accelerated algorithms, namely progressive sampling and improved clustercenter initialization. The novel stopping criterion was applied to GOFCM and shownto be the most significant factor in producing a speedup over FCM when the subsamplesize was small. This led to the development of the MODSPFCM algorithm which isfaster than SPFCM.

The majority of the time, the new algorithms were improvements over similarexisting algorithms in terms of speed and/or quality. GOFCM provided a clearadvantage in processing time over SPFCM and the loss of quality due to processingless data was kept to a minimum. On average, MSERFCM edged out rseFCM on bothspeedup and quality. Here, quality is defined by the percentage of cluster changewhen compared to an average partition produced by FCM. The changes were typicallyquite small indicating high quality.

The plankton and artificial datasets demonstrated the value and limitationsof statistical methods. Experimental parameters resulted in many algorithms usingsuboptimally sized subsamples with a corresponding loss in quality. GOFCM wasdesigned to conform to the user-provided parameters despite a suboptimal subsample,which resulted in speedup at the expense of quality. MSERFCM was designed to rejectuser-provided parameters if a suboptimal subsample would result, which resulted inquality at the expense of speedup.

Two techniques applied in this work are novel to accelerated clustering;Thompson's statistical method and the stopping criterion for GOFCM. The manner inwhich Thompson's statistical method was used is only one possible application. Giventhat there are numerous accelerated FCM variants, this statistical method canpotentially be applied to improve speedup and quality of results for many of them.The stopping criterion designed is a simple solution to the general problem ofestimating a changing, noisy value. It enables MODSPFCM, which is a very good highlyscalable clustering algorithm. This technique could be applied to other existingclustering algorithms as a means of improving speedup with a minimum loss inquality. Future work will involve discovery of additional applications for these newtechniques.

Acknowledgment

This research was partially supported by by grant 1U01CA143062-01, Radiomics of NSCLCfrom the National Institutes of Health. The authors would like to thank Tim Havensfor providing details on his research in scalable algorithms. The authors would liketo thank Kurt Kramer for providing the plankton dataset and answering follow upquestions.

Original implementations of FCM by Steven Eschrich [12] and SPFCM and OFCM by Prodip Hore [28] were reviewed. Some implementation techniques were adopted, but thecode was entirely rewritten. Initialization parser code was obtained from animplementation by Hoyt [52]. Theχ² calculation needed for eFFCM was performedwith code from [53].

Biography

graphic file with name nihms-672499-b0008.gif Jonathon K. Parker is a Doctoral Candidate in the Department ofComputer Science and Engineering at the University of South Florida. He received anM.S. in Computer Science from Bowie State University in 1997 and a B.S. in Chemistryfrom the University of Wisconsin-Madison in 1988.

His interests include data mining, data clustering, machine learning, sentimentanalysis and designing software applications.

graphic file with name nihms-672499-b0009.gif Lawrence O. Hall is a Distinguished University Professor and the Chairof the Department of Computer Science and Engineering at University of SouthFlorida. He received his Ph.D. in Computer Science from the Florida State Universityin 1986 and a B.S. in Applied Mathematics from the Florida Institute of Technologyin 1980.

He is a fellow of the IEEE. He is a fellow of the AAAS and IAPR. He received theNorbert Wiener award in 2012 from the IEEE SMC Society. His research interests liein distributed machine learning, extreme data mining, bioinformatics, patternrecognition and integrating AI into image processing. The exploitation ofimprecision with the use of fuzzy logic in pattern recognition, AI and learning is aresearch theme.

He has authored or co-authored over 70 publications in journals, as well as manyconference papers and book chapters. Some recent publications appear in the IEEETransactions on Fuzzy Systems, IEEE Transactions on Pattern Analysis and MachineIntelligence, Neural Computation, Information Fusion, , IEEE Transactions onSystems, Man, and Cybernetics, Pattern Recognition, the International Conference onPattern Recognition, the Multiple Classifier Systems Workshop, and the FUZZ-IEEEconference (http://isl.csee.usf.edu/ailab/hall.html).

He received the IEEE SMC Society Outstanding contribution award in 2008. He receivedan Outstanding Research achievement award from the Univ. of South Florida in 2004. Apast president of NAFIPS. The former vice president for membership of the SMCsociety. He was the President of the IEEE Systems, Man and Cybernetics society for2006-7. He was the Editor-In-Chief of the IEEE Transactions on Systems, Man andCybernetics, Part B, 2002-05. He is the Vice President for Publications of the IEEEBiometrics Council. Also, associate editor for IEEE Transactions on Fuzzy Systems,International Journal of Intelligent Data Analysis, the International Journal ofPattern Recognition and Artificial Intelligence and International Journal ofApproximate Reasoning.

REFERENCES

1.Lloyd SP. Least squares quantization of pcm. IEEE Transactions on Information Theory. 1982 Mar;28(2):129–137. [Google Scholar]
2.Bezdek JC. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press; 1981. [Google Scholar]
3.Hore P, Hall LO, Goldgof DB. Single pass fuzzy c means. IEEE International Conference on Fuzzy Systems. FUZZ-IEEE. 2007 Jul;:1–7. [Google Scholar]
4.Provost F, Jensen D, Oates T. Efficient progressive sampling. In Proceedings of the Fifth International Conference on KnowledgeDiscovery and Data Mining. 19991999:23–32. [Google Scholar]
5.Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M. Fuzzy c-means algorithms for very large data. IEEE Trans. Fuzzy Systems. 2012 Dec;20(6):1130–1146. [Google Scholar]
6.Parker JK, Hall LO, Bezdek JC. Comparison of scalable fuzzy clustering methods. Fuzzy Systems (FUZZ), 2012 IEEE International Conference on. 2012 Jun;:1–9. [Google Scholar]
7.Hung MC, Yang DL. An efficient fuzzy c-means clustering algorithm. Proceeding of the 2001 IEEE International Conference on Data Mining(ICDM'01) 2001:225–232. [Google Scholar]
8.Cheng T, Goldgof D, Hall L. Fast fuzzy clustering. Fuzzy Sets and Systems. 1998 Jan;93(1):49–56. [Google Scholar]
9.Altman D. Efficient fuzzy clustering of multi-spectralimages. Proc. IEEE IGARSS. 1999;3:1594–1596. [Google Scholar]
10.Kolen J, Hutcheson T. Reducing the time complexity of the fuzzy c-meansalgorithm. Fuzzy Systems, IEEE Transactions on. 2002 Apr;10(2):263–267. [Google Scholar]
11.Pal NR, Bezdek JC. Complexity reduction for ”large image”processing. IEEE Trans. Syst. Man Cybern. 2002 Oct;32(5):598–611. doi: 10.1109/TSMCB.2002.1033179. [DOI] [PubMed] [Google Scholar]
12.Eschrich S, Ke J, Hall LO, Goldgof DB. Fast accurate fuzzy clustering through datareduction. IEEE Trans. Fuzzy Systems. 2003;11:262–269. [Google Scholar]
13.Bezdek JC, Hathaway RJ. Progressive sampling schemes for approximate clustering in verylarge data sets. IEEE International Conference on Fuzzy Systems. 2004 Jul;1:15–21. [Google Scholar]
14.Hathaway RJ, Bezdek JC. Extending fuzzy and probabilistic clustering to very large datasets. Computational Statistics and Data Analysis. 2006 Mar;5:215–234. [Google Scholar]
15.Wang L, Bezdek JC, Leckie C, Kotagiri R. Selective sampling for approximate clustering of very large datasets. International Journal of Intelligent Systems. 2008 Mar;23(3):313–331. [Google Scholar]
16.Hore P, Hall LO, Goldgof DB, Gu Y, Maudsley AA, Darkazanli A. A scalable framework for segmenting magnetic resonanceimages. J. Sign. Process Syst. 2009;54:183–203. doi: 10.1007/s11265-008-0243-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Ahmed MN, Yamany SM, Mohamed N, Farag AA, Moriarty T. A modified fuzzy c-means algorithm for bias field estimation andsegmentation of mri data. Medical Imaging, IEEE Transactions on. 2002;21(3):193–199. doi: 10.1109/42.996338. [DOI] [PubMed] [Google Scholar]
18.Dembele D, Kastner P. Fuzzy c-means method for clustering microarraydata. Bioinformatics. 2003;19(8):973–980. doi: 10.1093/bioinformatics/btg119. [DOI] [PubMed] [Google Scholar]
19.Biswal B, Dash PK, Panigrahi BK. Power quality disturbance classification using fuzzy c-meansalgorithm and adaptive particle swarm optimization. Industrial Electronics, IEEE Transactions on. 2009;56(1):212–220. [Google Scholar]
20.Pal NR, Bezdek JC. On cluster validity for the fuzzy c-means model. Fuzzy Systems, IEEE Transactions on. 1995;3(3):370–379. [Google Scholar]
21.Alpaydın E. Soft vector quantization and the em algorithm. Neural Networks. 1998;11(3):467–477. doi: 10.1016/s0893-6080(97)00147-0. [DOI] [PubMed] [Google Scholar]
22.Jain AK. Data clustering: 50 years beyond k-means. Pattern Recognition Letters. 2010 Jun;31(8):651–666. [Google Scholar]
23.de Pasquale F, Cherubini A, Péran P, Caltagirone C, Sabatini U. Influence of white matter fiber orientation on r2* revealed bymri segmentation. Journal of Magnetic Resonance Imaging. 2013;37(1):85–91. doi: 10.1002/jmri.23801. [DOI] [PubMed] [Google Scholar]
24.Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifsin bipolymers. 1994 [PubMed] [Google Scholar]
25.Carson C, Belongie S, Greenspan H, Malik J. Blobworld: Image segmentation using expectation-maximization andits application to image querying. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2002;24(8):1026–1038. [Google Scholar]
26.Coletta LF, Vendramin L, Hruschka ER, Campello RJ, Pedrycz W. Collaborative fuzzy clustering algorithms: Some refinements anddesign guidelines. Fuzzy Systems, IEEE Transactions on. 2012;20(3):444–462. [Google Scholar]
27.Hore P, Hall L, Goldgof D, Cheng W. Online fuzzy c means. Annual Meeting of the North American Fuzzy Information ProcessingSociety. 2008 May;:1–5. [Google Scholar]
28.Hore P. Ph.D. dissertation. University of South Florida; Jun, 2007. Scalable frameworks and algorithms for cluster ensembles andclustering data streams. [Google Scholar]
29.Bradley PS, Fayyad UM. Refining initial points for k-means clustering. Microsoft Research, Tech. Rep. 1998 May; MSR-TR-98-36. [Google Scholar]
30.Domingos P, Hulten G, Edu P, Edu C. A general method for scaling up machine learning algorithms andits application to clustering. In Proceedings of the Eighteenth International Conference on MachineLearning. 2001 [Google Scholar]
31.Gu B, Liu B, Hu F, Liu H. Efficiently determining the starting sample size for progressivesampling. Proceedings of the European Conference on Machine Learning. 2001:192–202. [Google Scholar]
32.Thompson SK. Sample size for estimating multinomialproportions. The American Statistician. 1987 Feb;41(1):42–46. [Google Scholar]
33.Phoungphol P, Zhang Y. Sample size estimation with high confidence for large scaleclustering. Proceedings of the 3rd International Conference on Intelligent Computingand Intelligent Systems. 2011 [Google Scholar]
34.Meek C, Theisson B, Heckerman D. The learning curve sampling method applied to model basedclustering. The Journal of Machine Learning. 2002 Mar;2:397–418. [Google Scholar]
35.Samson S, Hopkins T, Remsen A, Langebrake L, Sutton T, Patten J. A system for high-resolution zooplankton imaging. IEEE Journal of Oceanic Engineering. 2001 Oct;26(4):671–676. [Google Scholar]
36.Remsen A, Samson S, Hopkins T. What you see is not what you catch: a comparison of concurrentlycollected net, optical plankton counter and shadowed image particleprofiling and evaluation recorder data from the northeast gulf ofmexico. Deep Sea Research Part I: Oceanographic Research Papers. 2004 Jan;51(1):129–151. [Google Scholar]
37.Kramer K. Ph.D. dissertation. University of South Florida; Oct, 2010. System for identifying plankton from the sipper instrumentplatform. [Google Scholar]
38.Liu H, Setiono R. A probabilistic approach to feature selection-a filtersolution. Machine learning: proceedings of the Thirteenth International Conference(ICML’96) 1996 Jul;:319–327. [Google Scholar]
39.Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The weka data mining software: an update. SIGKDD Explor. Newsl. 2009 Nov;11(1):10–18. [Online]. Available:http://doi.acm.org/10.1145/1656274.1656278. [Google Scholar]
40.Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2:193–218. [Google Scholar]
41.Rand WM. Objective criteria for the evaluation of clusteringmethods. Journal of the American Statistical Association. 1971 Dec;66(336):846–850. [Google Scholar]
42.Kuhn HW. The hungarian method for the assignment problem. Naval Research Logistics Quarterly. 1955 Mar;2(1-2):83–97. [Google Scholar]
43.Gu Y, Hall LO, Goldgof DB. Evaluating scalable fuzzy clustering. 2010 IEEE International Conference on Systems Man and Cybernetics(SMC) 2010 Oct;:873–880. [Google Scholar]
44.Clarke LP, Velthuizen RP, Clark M, Gaviria J, Hall LO, Goldgof D, Murtagh R, Phuphanich S, Brem S. Mri measurement of brain tumor response: comparison of visualmetric and automatic segmentation. Magnetic resonance imaging. 1998;16(3):271–279. doi: 10.1016/s0730-725x(97)00302-0. [DOI] [PubMed] [Google Scholar]
45.Chuang K-S, Tzeng H-L, Chen S, Wu J, Chen T-J. Fuzzy c-means clustering with spatial information for imagesegmentation. computerized medical imaging and graphics. 2006;30(1):9–15. doi: 10.1016/j.compmedimag.2005.10.001. [DOI] [PubMed] [Google Scholar]
46.Wang Z, Song Q, Soh YC, Sim K. An adaptive spatial information-theoretic fuzzy clusteringalgorithm for image segmentation. Computer Vision and Image Understanding. 2013 [Google Scholar]
47.Havens TC. 2012 private communication. [Google Scholar]
48.Liang J, Bai L, Dang C, Cao F. The k-means-type algorithms versus imbalanced datadistributions. IEEE Trans. Fuzzy Systems. 2012 Aug;20(4):728–745. [Google Scholar]
49.Bezdek JC, Hathaway RJ, Sabin MJ, Tucker WT. Convergence theory for fuzzy c-means: Counterexamples andrepairs. IEEE Transactions on Systems, Man, and Cybernetics. 1987 Sep-Oct;SMC-17(5):873–877. [Google Scholar]
50.Wu J, Xiong H, Liu C, Chen J. A generalization of distance functions for fuzzy c-meansclustering with centroids of arithmetic means. IEEE Transactions on Fuzzy Systems. 2012;20(3):557–571. [Google Scholar]
51.Huang H-C, Chuang Y-Y, Chen C-S. Multiple kernel fuzzy clustering. Fuzzy Systems, IEEE Transactions on. 2012;20(1):120–134. [Google Scholar]
52.Hoyt B. inih: simple .ini parser in c. Open Source Project. [Online] 2011 Available:http://code.google.com/p/inih/
53.Reiter B, Aquino J. Statist 1.4.2. Open Source Project. [Online] 2009 Available:http://wald.intevation.org/projects/statist/

Movatterモバイル変換

PERMALINK

Accelerating Fuzzy-C Means Using an Estimated Subsample Size

Jonathon K Parker

Lawrence O Hall

Roles

Abstract

I. Introduction

II. FUZZY C-means (FCM) based Algorithms

A. Fuzzy c-means (FCM)

B. Single pass fuzzy c-means (SPFCM)

C. Online fuzzy c-means (OFCM)

D. Extensible fast fuzzy c-means (eFFCM)

E. Random sampling plus extension fuzzy c-means (rseFCM)

III. Accelerating FCM

IV. Significant Work Related to Acceleration

V. Estimating the Random Sample Size

VI. Algorithms Based on Thompson's Method

A. The GOFCM Algorithm

Algorithm 1.

Figure 1.

Figure 2.

B. The MSERFCM Algorithm

VII. Experiments

A. Parameters

TABLE I.

B. Data Sets

TABLE II.

Figure 3.

C. Metrics

1) Relative Speedup (SU)

2) Adjusted Rand Index (ARI)

3) Cluster Membership Change (CC)

VIII. Results

TABLE III.

TABLE IV.

Figure 4.

TABLE V.

Figure 5.

TABLE VI.

Figure 6.

A. Comparison to work by Havens et al

TABLE VII.

B. Comparison of MODSPFCM to SPFCM and GOFCM

TABLE VIII.

Figure 7.

IX. Discussion

A. GOFCM vs. Related Methods

B. Artificial Datasets and OFCM

C. MSERFCM

D. Plankton Dataset Challenges

X. Conclusions

Acknowledgment

Biography

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases