To do: |
A hierarchical clustering method consists of grouping data objects into a tree of clusters. There are two main types of techniques: a bottom-up and a top-down approach. The first one starts with small clusters composed by a single object and, at each step, merge the current clusters into greater ones, successively, until reach a cluster composed by all data objects. The second approach use the same logic, but to the opposite direction, starting with the greatest cluster, composed by all objects, and split it successively into smaller clusters until reach the singleton groups. Besides the strategies, other important issue is the metrics used to build (merge or split) clusters. Such metrics can be different distance measures, described next section.
Four widely used measures for distance between clusters are as follows. Wherep andp' are two different data object points,mi is the mean for clusterCi,ni is the number of objects in clusterCi, and|p - p'| is the distance betweenp andp'.
Minimum distance:
Maximum distance:
Mean distance:
Average distance:
One algorithm that implements the bottom-up approach is AGNES (AGglomerative NESting). The main idea of AGNES is, at its first step, create clusters composed by one single data object, and then, using a specified metric (such the ones mentioned previous section), merge such clusters into greater ones. The second step is repeated iteratively until only one cluster is obtained, composed by all data objects. Another example of algorithm implementing this approach is Cure, where instead dealing all the time with the whole set of entities, the clusters are shrank to its centers.
DIANA (DIvisive ANAlysis) is an example of an algorithm that implements top-down approach, starting with a single big cluster with all elements and iteratively spliting the current groups into smaller ones. As happens with AGNES and Cure, is necessary to define some metric to compute the distance inter-clusters, in order to decide how (where) they have to be splitted.
In order to use Hierarchical Clustering algorithms in R, one must installcluster package. This package includes a function that performs the AGNES process and the DIANA process, according to different algorithms.
The AGNES function provided bycluster package, might be used as follow:
agnes(x, diss = inherits(x, "dist"), metric = "euclidean",stand = FALSE, method = "average", par.method,keep.diss = n < 100, keep.data = !diss)
where the arguments are:
AGNES algorithm has the following features:
The function AGNES returns aagnes object representing the clustering. This agnes object is a list with the components listed below:
To visualize the AGNES function result might be used the functions:print andplot.
The first function simply print the components of objectagnes and the second one plot the object, creating a chart that represents the data.
Here, it is an example of use function plot:
plot(x, ask = FALSE, which.plots = NULL, main = NULL,sub = paste("Agglomerative Coefficient = ",round(x$ac, digits = 2)),adj = 0, nmax.lab = 35, max.strlen = 5, xax.pretty = TRUE, ...)
The arguments used are:
And here, it is the example of use function print:
print(x, ...)
The arguments are:
The DIANA function provided bycluster package, might be used as follow:
diana(x, diss = inherits(x, "dist"), metric = "euclidean", stand = FALSE, keep.diss = n < 100, keep.data = !diss)
where the arguments are:
result. Setting these to FALSE can give much smaller results and hence even save memory allocation time.
DIANA algorithm is probably unique in computing a divisive hierarchy, whereas most other software for hierarchical clustering is agglomerative. It has the same features as AGNES functions as follows:
The function DIANA returns adiana object representing the clustering. This agnes object is a list with the components listed below:
To visualize the DIANA function result might be used the functions:print andplot.
The first function simply print the components of objectagnes and the second one plot the object, creating a chart that represents the data.
Here, it is an example of use functionplot:
plot(x, ask = FALSE, which.plots = NULL, main = NULL, sub = paste("Divisive Coefficient = ", round(x$dc, digits = 2)),adj = 0, nmax.lab = 35, max.strlen = 5, xax.pretty = TRUE, ...)
The arguments used are:
And here, it is the example of use functionprint:
print(x, ...)
The arguments are:
In this section it is illustrate an example of using hierarchical clustering in some Italian cities. In the example below, it is used mean distance, or single-linkage method.
Given the distances in kilometers between Italian cities, the object is to group them into clusters in an agglomerative way.For examples in R is used the functionagnes.
The input data is a table containing the numeric values of distances between cities in Italy. The table consists of six columns (and rows), representing Italian cities. Each cell has a numeric value representing the distance in kilometers between them. The table can be loaded from a spreadsheet or from a text file.
Figure 1 illustres the cities used in this example.
Input distance matrix
| BA | FI | MI | VO | RM | TO | |
| BA | 0 | 662 | 877 | 255 | 412 | 996 |
| FI | 662 | 0 | 295 | 468 | 268 | 400 |
| MI | 877 | 295 | 0 | 754 | 564 | 138 |
| VO | 255 | 468 | 754 | 0 | 219 | 869 |
| RM | 412 | 268 | 564 | 219 | 0 | 669 |
| TO | 996 | 400 | 138 | 869 | 669 | 0 |
The functionagnes can be used to define the groups of countries as follow:
# import datax <- read.table("data.txt")# run AGNESag <- agnes (x, false, metric="euclidean", false, method ="single") # print components of agprint(ag)# plot clustersplot(ag, ask = FALSE, which.plots = NULL)
The value of the second parameter ofagnes was FALSE, becausex was treated as a matrix of observations by variables. The fourth parameter also is FALSE because isn't necessary standardizing each column.
The result of printing the components of the classagnes returned by the function application is shown below:
Call: agnes(x = x, diss = FALSE, metric = "euclidean", stand = FALSE, method = "single") Agglomerative coefficient: 0.3370960 Order of objects:[1] BA VO RM FI MI TOHeight (summary): Min. 1st Qu. Median Mean 3rd Qu. Max. 295.8 486.0 486.5 517.6 642.6 677.0 Available components:[1] "order" "height" "ac" "merge" "diss" "call" [7] "method" "order.lab" "data"
The result of plotting the class returned by the function application it is shown below:
The algorithm always cluster the nearest pair of cities so, in this case, the " neighborhood " cities are clusterized first. In single link clustering the rule is that the distance from the compound object to another object is equal to the shortest distance from any member of the cluster to the outside object
However, there are a few weaknesses in agglomerative clustering methods:
J. Han and M. Kamber. Data Mining: Concepts and Techniuqes. Morgan Kaufmann Publishers, San Francisco, CA, 2006.
Maechler, M. Package 'cluster'. Disponível em <http://cran.r-project.org/web/packages/cluster/cluster.pdf> Acesso em 12 dez 2009.
Meira Jr., W.; Zaki, M. Fundamentals of Data Mining Algorithms. Disponível em <http://www.dcc.ufmg.br/miningalgorithms/DokuWiki/doku.php> Acesso em 12 dez 2009.
A Tutorial on Clustering Algorithms. Disponível em <http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html>. Acesso em 12 dez 2009.