Distance and similarity measures
There is more than one way to measure a distance. There are distances that are euclidean (can be measured with a 'ruler') and there are other distances based on similarity. For example, in terms of road distance (a euclidean distance) York is closer to Manchester than it is Canterbury. However, if distance is measured in terms of the characteristics of a city York is closer to Canterbury.
Three general classes of distance measures can be recognised.
| Euclidean metrics | These measure true straight line
distances in Euclidean space. Note that the distance from
Manchester to Chicago would not be Euclidean unless you
bored a tunnel through the earth. The flight path of
plane would follow the earth's curvature and hence would
not be a straight line distance. In a univariate example the euclidean distance between two values is the arithmetic difference, i.e. value1 - value2. In the bivariate case the minimum distance is the hypotenuse of a triangle formed from the points. For three variables the hypotenuse will be extended through 3-dimensional space. Although difficult to visualise an extension of the Pythagoras theorem will give the distance between two points in n-dimensional space.
|
| Non-Euclidean metrics | Distances that are not
straight-line, but which obey certain rules. dij
is the distance between two cases, i & j.
The Manhattan or City Block metric is an example of this type. |
| Semi-metrics | Distance measures that obey the first three rules but may not obey the 'triangle' rule. The Cosine measure is an example of this type. |
You also need to take account of the type of data since each has its own set of distance measures. In general there are 3 broad classes:
It is not generally possible, or advisable, to mix data types in a cluster analysis.
Euclidean distances for interval variables
Several distance measures are available within SPSS
| SEUCLID | squared
euclidean distance. For any pair of cases the measure is Distance(x,y) = S(x - y)2 Therefore SEUCLID is the sum of the squared differences between scores for two cases on all variables, i.e. the squared length of the hypotenuse. |
||
| EUCLID | square root of SEUCLID. Note that for two variables this is the normal Pythagoras theorem. | ||
| CHEBYCHEV | This is a
distance measure. In this the absolute maximum
difference between variable scores is used. Distance(x,y) = MAX |x-y| |
||
| BLOCK | City Block
or Manhattan distance. So named because in most American
cities it is not possible to go directly between 2
points. A route which follows the regular grid of roads
is used.
|
Data Transformations
One problem with euclidean distances is that they can be greatly influenced by variables that have the largest values. Consider the following pair of points.

The euclidean distance between them is the square root of (6002 + 0.82), which is the square root of (360000 + 0.64) = 600.0007. Obviously this distance is almost entirely dominated by the variable on the x axis.
One way around this problem is to standardise the variables. For example, if both are forced to have a scale within the range 0 - 1 the two distances become 0.6 and 0.8, which are much more equitable. The value of d is then 1.0, and both variables have contributed significantly to this distance.
Non-euclidean distances for interval data
COSINE - Cosine of vectors of variables. This is a pattern similarity measure. The cosine of the angle between 2 vectors is identical to their correlation coefficient.
| Similarity(x,y) = | S (xy) |
| (Sx2)(Sy2) |
Count Data
Chi-square measure. This measure is based on the chi-square test of equality for two sets of frequencies.
Phi-square measure. This measure is equal to the chi-square measure normalized by the square root of the combined frequency.
Binary Data
There are a surprisingly large number of dissimilarity measures available for binary data, for example SPSS v8.0 lists 27 different ones. All of these binary-based measures use two or more of the values obtained from a simple 2 x 2 matrix of agreement. Note that n = a + b + c + d.
| + | - | |
| + | a | b |
| - | c | d |
Many of these measures differ with respect to the inclusion of double negatives (d). Is it valid to say to say that the shared absence of a character is indicative of similarity? The answer is that it depends on the context, soemtimes it is and sometimes it isn't. Ten example measures are listed below.
| Euclidean distance. | SQRT(b+c) | The square root of the sum of discordant cases, minimum value is 0, and it has no upper limit. |
| Squared Euclidean distance | b+c. | The sum of discordant cases, minimum value is 0, and it has no upper limit. |
| Size difference | An index of asymmetry. It ranges from 0 to 1. | |
| Pattern difference | bc/(n**2) | Dissimilarity measure for binary data that ranges from 0 to 1. |
| Variance. | (b+c)/4n | Ranges from 0 to 1 |
| Simple matching | (a+d)/n | This is the ratio of matches to the total number of values. Equal weight is given to matches and nonmatches. |
| Dice. | 2a/(2a + b + c) | This is an index in which joint absences are excluded from consideration, and matches are weighted double. Also known as the Czekanowski or Sorensen measure. |
| Lance and Williams | (b+c)/(2a+b+c) | Range of 0 to 1. (Also known as the Bray-Curtis nonmetric coefficient.) |
| Nei & Lei's genetic distance | 2a/[(a+b)+(a+c)] | |
| Yule coefficient | (ad-bc)/(ad+bc) | This index is a function of the cross-ratio for a 2 x 2 table and is independent of the marginal totals. It has a range of -1 to 1. Also known as the coefficient of colligation. |