Correlation "Distances" and
Earl F. Glynn
A recent posting by Robin Hankin to the R-News mailing list provided some details. The CorrelatedXY function below creates the specified pairs, N, of normal random values with given mean and variance, but with a specified correlation, too:
Let's now create six sets of correlated, normally-distributed values with these parameters:
Let's set this random number seed so this experiment is repeatable.
The following may seem a bit obscure but is not. The Raw matrix is initialized to be N pairs of random values that have correlation 0.0 with each other. The for loop then adds another N pairs of random values (using cbind) as columns to the matrix. The sets have the approximate correlations: -0.2, +0.4, -0.6, +0.8, -1.0:
Column names are assigned to the Raw data matrix. The prefix "P" is for "Plus" and "M" is for "Minus" correlation. The value 00 means a correlation of 0.0, the value 02 means a correlation of 0.2, ..., the value 10 means a correlation of 1.0. The "A" and "B" suffixes are used to distinguish the pair of values in each set:
# Look at correlation matrix
The values in bold above show the assigned correlation values are approximately correct and the rest of the data is not correlated:
A "map" of the correlation matrix can be created by assigning a color to each cell of the matrix:
The 1st, 3rd and 4th methods above were suggested in a Dec 28 posting to R-Help. The 2nd method is suggested as part of R's ?dist help, which is likely used to rescale the 1st method to always have values from 0.0 to 1.0 instead of 0.0 to 2.0.
Here's the R code that creates a distance matrix from 1-Correlation as the dissimilarity index:
Note the as.dist function is used here to assign the correlation values to be "distances". (In some cases, you may want to use the dist function to compute distances using a variety of distance metrics instead.)
>round(distance, 4) P00A P00B M02A M02B P04A P04B M06A M06B P08A P08B M10A P00B 0.9678 M02A 1.0054 1.0349 M02B 1.0258 1.0052 1.2106 P04A 1.0247 0.9928 1.0145 0.9260 P04B 0.9898 0.9769 0.9875 0.9855 0.6075 M06A 1.0159 0.9893 1.0175 0.9521 0.9266 0.9660 M06B 0.9837 0.9912 1.0124 1.0402 1.0272 1.0367 1.5693 P08A 1.0279 1.0303 0.9865 0.9748 1.0184 1.0452 0.9799 1.0400 P08B 1.0248 1.0299 0.9717 0.9673 1.0048 1.0329 1.0280 0.9907 0.2158 M10A 0.9850 0.9603 1.0246 0.9708 1.0231 0.9771 0.9916 1.0168 0.9722 0.9525 M10B 1.0150 1.0397 0.9754 1.0292 0.9769 1.0229 1.0084 0.9832 1.0278 1.0475 2.0000
The hierarchical clustering function, hclust, expects a dissimilarity matrix. The plot function knows how to plot a dendrogram from hclust's result
The results for this simple dissimilarity index are not good. In particular, the values with a correlation of -1.0 are grouped incorrectly (M10B with M02A, and M10A with P00B).
As expected, the second method only changed the scaling and did not affect the clusters:
The R cluster package provides an alternative "agnes" function (agglomerative nesting) for plotting dendrograms:
"agnes" can work with a dissimilarity matrix, like hclust, or can manipulate raw data directly if the the parameter diss = FALSE.
The "banner" plot is explained under ?plot.agnes:
The agglomerative coefficient is explained under ?agnes.object:
The usefulness of the agglomerative coefficient is not clear from this brief exercise.
The dendrograms from hclust and agnes look similar but are slightly different. There is no single "correct assignment" of the order of the clustered objects. While agnes put the two "random" variables together (P00A and P00B), hclust randomly associated them as being "closer" to other clusters. The "pvclust" package below tries to improve on this by assigning probabilities to the various clusters.
In addition to agnes, the R "cluster" package provides an alternative "pam" (partitioning around medoids) function for partitioning the data. For example:
See ?pam.object for details.
Yet another hierarchical clustering alternative is provided by the Hmisc package:
The results of the Hmisc varclus function are rougly similar to the results from hclust with a dissimilarity measure 1-Abs(Correlation).
The pvclust site gives this information about how to interpret the AU and BP p values above:
The pvrect function can be used to automatically highlight significant clusters.
The dendrogram above was created using the "abscor" method, i.e., the dissimilarity measure 1-Abs(Correlation). If the simpler "correlation" method is used, only two clusters are found to be significant:
Dendrograms for the same data can be presented in a vareity of ways. The computational-intensive bootstrap procedure is useful to assess the uncertainty in hierarchical clustering.
Download: R code to reproduce the above figues
29 Dec 2005