Introduction to data similarity

Algorithms - Similarity

Written by Jan Schulz

Thursday, 15 May 2008 18:22

Similarity, dissimilarity and distances among data records

Objective

Investigators are often faced with numerical readings describing characteristic properties for a certain object. For several applications it is interesting to find out how alike two or more objects are. With n being the number of measured variables its location in such a space is defined by an n-dimensional vector. Generalised it means that each object represents a point in a space of the same dimension as the number of variables. Investigating variables individually disregards relationsships between them and does not respect their multivariate characteristic. Having more than one object distance and similarity indices can be calculated that include all variables. The derived values describe relative interspaces between objects by a single value and are a measurement for their alinkeness. In larger data sets two or more naturally occuring groups can be distinguished, when the variables included in the analysis discriminate the objects in such a space. They are the basis for a couple of data ordination techniques to identify clusters and groups of objects being similar or separate from others.

As a limitation the indices require values for every variable, respectively no missing values in the calculation. The indices are commonly used in scientific, industrial, engineering and other applications. Legendre & Legendre (1998) give more than 50 different measurements mainly used for ecological investigations. Here we apply Occam’s razor and restrict the number of algorithms to some of the most common.

Initial example

As an initial example for similarity of numerically represented objects one can imagine points in a plane. These points may e.g. be coordinates on a city map. The elements of each coordinate are the variables. The first variable describes the position along a given x-axis, the second along a perpendicular y-axis. Thus, a point on this plane is an object in a two-dimensional space. With increasing distances for example travel time increases and there is also a higher probability that not directly measured characteristics change (e.g. environmental or socio-economic).

To describe the relationship between two points it appears obvious to measure the shortest connection between them, the so called Euclidean distance. This is eqivalent to determining the diagonal of a rectangle with the two points giving the opposite corners. But this distance is not necessarily the best measurement for a given problem. Another option is to sum the absolute distances on the two axes. This is comparable to the roadway a person needs to walk when streets are perpendicular as in a chess board grid. It gives the shortest distance between two blocks along the streets and is referred to as the Manhattan distance. In this case the use of the Euclidean distance would be unsuitable, as blocks can not be directly passed, but may be useful when thinking for example about using a helicopter.

Example of the Euclidean distance (bold arrow) and Manhattan distance (dashed arrow) between points in a 2-dimensional space.

The location of the points in the above figure can be displayed in a table. Each point is listed in a new row, while the variables are given in the respective columns of the row. This is a common way to display readings for several objects. For the above example the tabulated x and y coordinates are the values of the two-dimensional row vectors giving the positions in the plane:

Object	x	y
P₁	2	6
P₂	5	2
P₃	6	5

To calculate the Euclidean distance one extracts the root of the summed squared differences along the two axes. Thus, the distance between P₁ and P₂ is:

The Manhattan distance is defined as the sum of the absolute differences of the coordinates along the axes. As these differences are the same as for the latter distance measurement we sum them and the result for the Manhattan distance is:

In both cases the unit of the result is the number of blocks to pass.

Metrics

An improtant property when measuring alikeness between objects is whether the used method satisfies the terms of triangle inequality. This mathematical theorem states that the distance between two objects in a metric space must be equal or greater than their summed distances to a third object. This is comparable to three points defining the vertexes of a triangle. In a metric space the summed lengths of two sides is always equal or greater than the length of the last side, regardless of the dimension. Equality occurs only when the points are on a straight line, synonymical with two angles of the triangle being 0°, while the last is 180°. This can be expressed by:

Furthermore metric measurements are non-negative d(P₁, P₂) ≥

0, show symmetry d(P₁, P₂) = d(P₂, P₁), have an identification mark d(P₁, P₁) = 0 and satisfy the terms of definiteness d(P₁, P₂) = 0, if and only if P₁ = P₂.

A measurement satisfying these properties is called metric. When d obeys triangle inequality objects can be ordinated with linear approaches in a way that the distances resemble relationships of the raw data. Thus, being metric is important for a couple of approaches investigating relationships based on d by multivariate data models. This is mainly due to the fact that these models often use linear approaches, like covariances or correlations between objects. Results may not be reliable when non metric measurements were used, as an initialy assumption was violated.

For the initial example it can be easily retraced that the distance between the points P₁ and P₂ and their distances to the third point P₃ satisfy the terms of triangle inequality for both Euclidean and Manhattan distance. Both are metric and special cases of the Minkowski distance.

Distance, similarity and dissimilarity

It can be easily imagined that the distance between two objects does not tell something about similarity. With increasing distances objects are more separated in their feature space. Thus, distance is a method to express an absolute value of dissimilarity. Dissimilarity itself is a relative value measuring the deviation between two objects.

Similarity in turn is a relative measurement for the quantity of relationship between two objects. The definitions for similarity functions are more loosely defined than for metrics. A similarity measurement is in most cases also non-negative d_s(P₁, P₂) ≥ 0, always symmetric d_s(P₁, P₂) = d_s(P₂, P₁) and its value increases the more similar the compared objects are. While distances range between 0 and ∞, the normal range of a similarity value is between –1 and +1, or when normalised between 0 and +1. A normalised similarity value d_s of +1 means a complete similarity, while 0 indicates a complete dissimilarity. This normalised similarity value d_s can be converted into a dissimilarity value d_d by subtracting it from 1:

d_d = 1- d_s

In the initial example it seems more appropriate to measure distances or dissimilarities than similarity. Similarity would be an unsatisfying information about the way and is just useful when comparing the relative distance between a couple of points. Using a metric measure to express dissimilarity between objects, the dissimilarities behave like relative distances (Pielou 1984). Thus, objects can be plotted in a space of desired dimension with the distance between them equal to the dissimilarity among them.

Application

There are a number of algorithms for specific tasks calculating distances, dissimilarities or similarities. Each has its own strength and limitation. Common to all of these methods is the capability to represent relationships between objects by a single value.

The calculation for a few objects might be interesting for some first estimations. The worth of calculating similarities, dissimilarities or distances becomes obvious when performed among all objects of a data set. Results can be summarised in a matrix with the number of rows and columns being equal to the number of different distances between objects. This means that the resulting table or matrix is square. Every object appears in the same order once in a column and in a row. In cell (i, j) the distance between object i and object j is tabulated. As there is no difference whether object i is compared with object j, or vice versa, the matrix is also symmetric and mirrored along the diagonal. The design is similar to distance tables found in almanacs, giving distances between cities (the respective objects).

Data	Var1	Var2	Var3
Case1	1	1	1
Case2	1	1	0
Case3	2	2	2
Case4	10	10	10
Case5	11	11	11
Case6	10	5	0

Euclidean Distance	Case1	Case2	Case3	Case4	Case5	Case6
Case1	0	1	1.732	15.588	17.321	9.899
Case2	1	0	2.449	16.186	17.916	9.849
Case3	1.732	2.449	0	13.856	15.588	8.775
Case4	15.588	16.186	13.856	0	1.732	11.180
Case5	17.321	17.916	15.588	1.732	0	12.570
Case6	9.899	9.849	8.775	11.180	12.570	0

It can be seen that in this Euclidean distance matrix values along the diagonal are always 0. This is common for distance matrices, as an object compared with itself shows a perfect congruence. The closer to 0 values are, the more similar these objects are. In the case of a similarity matrix the value 1 would be found along the diagonal, while the matrix itself remains symmetric.

As the matrix is symmetric the half above or below the diagonal is often omitted. This might also apply to the diagonal itself, as the inherent information can be given by the upper or lower triangular matrix. Such triangular matrices are a prerequisite for several data ordination techniques. As these methods, which revert on such triangular matrices, may arbitrarily access the upper or lower part, it is useful to create the complete matrices. Anyhow it is just necessary to calculate one side and to mirror the results.

Data ordination techniques endeavour to faithfully represent the charcteristics of an n-dimensional data set in lower dimensional spaces (Gauch 1982). From a random start configuration objects are arranged in a target space according to the measured relationships. This can be imagined as a model of spheres that represent the data objects. Every sphere is interconnected with all others by springs. The neutral length of a spring is that of the measured relationship between the respective objects. Due to random initialisation springs are under tension and not in their neutral length. The global tension, as a result of the overall displacement, is often referred to as the stress. During ordination it is tried to find a configuration in the dimension of the target space where overall stress among the springs is lowest. During this iterative process it is tried to maintain the relative distance between the objects and to reduce the stress stepwise. Except for perfect matches the overall stress is seldom zero. Especially in larger data sets the relative distances between all objects can hardly be reconciled. But it allows displaying higher dimensional and complex data sets in a 1-, 2- or 3-dimensional space that can be conceived by human mind. Objects, discriminable by the given variables, show a tendency to form clusters during the analysis. After charting clusters in such data sets can often be visually discriminated. Thus, it is possible to assign objects to a group based on their distance, dissimilarity or similarity and to identify higher order structures in large data sets. This may help to understand complex relationships and to identify similarity among groups. The probability of a new object to belong to a known group can be numerically determined by its distance from the group centroid. Anyhow, when objects are measured along a gradient it is possible that global investigations fail, although pairwise tests between objects from opposite ends of the gradient are significantly different (cf. Somerfield et al. 2002).

MDS of the famous Fisher-Anderson Iris data set

A metric Multi-Dimensional-Scaling (MDS) of the famous Fischer-Anderson Iris spp. data set (plant family Iridaceae). This data set has got 50 objects (specimens) for each of the three plant species Iris setosa, Iris versicolor and Iris virginica. Four variables were measured for each specimen. It can be seen that I. setosa separates well from I. versicolor and I. virginica. Although the numerical distances between objects of the latter two species is shorter, the two groups can be distinguished by the variables included in the analysis. To calculate the initial distance matrix for the MDS the Euclidean distance was chosen. Points in the plot were independently coloured by their known taxonomic classification.

Common ordination methods are e.g. Multi-Dimensional Scaling (MDS) or Sammon mapping. The above image shows an MDS ordination of the famous Iris data set used by Fisher in his article on numerical taxonomy (Fisher 1936). This data set contains 150 objects of the three plant species Iris setosa, Iris versicolor and Iris virginica and was collected by Anderson (Anderson 1935). Four variables were measured on 50 objects of each species. These variables were the widths and lengths of sepals and petals of each object. For the above MDS a Euclidean distance was chosen as measurement of the relationships between the single objects. It can be seen that two large groups are distinguishable. The first is formed by the species I. setosa, while the second group is composed of I. versicolor and I. virginica. This means that it is possible to distinguish I. setosa numerically from the other two species by the included variables. Although the distances are shorter, it is obvious that also I. versicolor and I. virginica show some characteristic differences in the size of the measured parameters. This analysis does not tell something about the contribution of the individual variables, but allows an investigation as the whole.

Data types

Before applying a method to a data set some preliminaries have to be controlled. Although Stevens ‘Theory of Scales of Measurement’ (Stevens 1946) is still contrarily discussed and often not sufficient (e.g. Velleman & Wilkinson 1993) the type of data is an important criterion for choosing the appropriate method.

Binary variables

Binary variables can just hold one of two possibilities. These possibilities might be e.g. True or False, Positive or Negative, Recent or Extinct, Abundant or Absent and so on. It is common to code the two possibilities by the values 1 for True and 0 for False. It is also possible to evaluate whether a value is different from a defined zero. Most similarity or dissimilarity methods then use frequencies of the combinations between the two states of the compared objects. Common methods for binary variables are:

Jaccard index
Hamming distance

Continuous variables

Continuous variables represent values of a continuous function that can be positive or negative real numbers and measured in different units. Commonly used methods are:

Bray-Curtis similarity
Canberra distance
Minkowski distance

Ordinal variables

Ordinal variables often represent the order of a selection or coded frequencies. When assigning e.g. arbitrary numbers to enumerated prey organisms we can express a predators preferred selection order as a vector, like (3, 4, 2, 1). This row vector represents an object in a 4-dimensional space. Alternatively they can represent semi-quantitative values (e.g. 0=absent, 1=common, 2=dominant, etc.). Commonly used methods for such variables are:

Hamming distance
Minkowski distance

Mixed variables

The use of variables of different types is a controversial issue, but often encountered in field obeservations. Several suggestions are made how to handle this problem. Depending on the variables it might be applicable to use a different weighting term for each variable (Gower 1971, Gower 1987). As no generalisations can be given, it is up to the investigator to find out the method that delivers trustful and robust results.

Anyhow, it is always recommended to carefully interpret results when using different units in an analysis. It is better to reduce this mixture, as far as possible.

Data properties

Additionally to the consideration of the data type it is necessary to consider some initial assumptions and conditions.

Joint zeros

Several measurements of distance, dissimilarity of similarity are sensitive for joint zeros. This addresses the problem of data records with zeroes appearing in the same variables among compared objects. Having abundance data of species on several stations it may appear that just few individuals of a single species are found. In most cases one is interested in the analysis of differences between the abundant species. Depending on the algorithm such common absences may gain a higher influence in the analysis instead of differences in the sparse species composition. Measures which are sensitive for joint zeros are often not robust enough to be generally applicable (Field et al. 1982). In this case even transformations, like logarithmic or square root, do not help.

Methods like the e.g. Minkowski metric can be sensitive for high joint zero ratios.

Negative values

Results of several measurments of distance, dissimilarity or similarity are only defined, when variables are entirely positive. Especially the in ecology commomly used Bray-Curtis dissimilarity does nomally not deliver reliable results when negative values are processed. This applies also to the Canberra metric (unless values are entirely negative).

Missing values

Measurements of distance, dissimilarity or similarity are established for complete data sets. Most analytical software packages do not allow missing data and terminate the analysis, although results could still be meaningful when one or more readings are missing. Thus, it is strongly recommended to preprocess such data sets. It might often be better to exclude such objects from larger analysis or even the respective variable when missing several times. Another option could be to estimate the values from other readings, although this may create artificial dependensies. This problem increases the more values are missing.

Transformation

Transformations like log or root normalise positively skewed data and reduce the influence of high values in procedures based on dissimilarity indices (Digby & Kempton 1987). Especially when using distance functions variables on different scales or wide ranges can bias analyses. In this case it is advisable to use data transformations or standardisations beforehand. Some authors suggest to commonly use the root-root transformation to reduce the effect of dominating variables (Clarke & Warwick 1994, Field et al. 1982). This approach is difficult as the effect of the transformation depends on the underlying distribution of the variable and may be inconsistent (Quinn & Keough 2004). Therefore Quinn & Keogh (2004) prefer data standardisations. Anyhow, to reduce the influence of variables of different ranges it can by useful to reduce data to binary presence-absence states.

References

Anderson E. (1935): The Irises of the Gaspe Peninsula. Bulletin of the American Iris Society, 59:2–5.

Bray J.R., Curtis J.T. (1957): An ordination of the upland forest communities of Southern Wisconsin. Ecological Monographies 27:325-349.

Clarke K.R., Warwick R.M. (1994): Change in marine communities: An approach to statistical analysis and interpretation. Natural Environment Research Council, UK.

Clarke K.R., Somerfield P.J., Chapman M.G. (2006): On resamblance measures for ecological studies, including taxonomic dissimilarities and a zero-adjusted Bray-Curtis coefficient for denuded assemblages. Journal of Experimental Marine Biology and Ecology 330:55-80

Digby P.G.N., Kempton R.A. (1987): Multivariate analysis of ecological communities. Chapman & Hall, London.

Field J.G., Clarke K.R., Warwick R.M. (1982): A practical strategy for analysing multispecies distribution patterns. Marine Ecology Progress Series 8:37-52.

Fisher R.A. (1936): The use of multiple measurements in taxonomic problems. Annals of Eugenics 7(II):179–188.

Gauch H.G. (1982): Multivariate analysis in community ecology. Cambridge University Press, New York.

Gower J.C. (1971): A general coefficient of similarity and some of its properties. Biometrics 27:857-871.

Gower J.C. (1987): Introduction to ordination techniques. In: Developments in Numerical Ecology. Edited by Legendre and Legendre. NATO ASI Series G 14.

Legendre P., Legendre L. (1998): Numerical ecology. 2^nd Edition, Elsevier, Amsterdam.

Pielou E.C. (1984): The interpretation of ecological data – A primer on classification and ordination. John Wiley & Sons, Inc.

Quinn G.P., Keough M.J. (2004): Experimental design and data analysis for biologists. Cambridge University Press, Reprinted 2004.

Stevens S.S. (1946): On the Theory of Scales of Measurements. Science 103(2684):677-680.

Somerfield P.J., Clarke K.R., Olsgard F. (2002): A comparison of the power of categorical and correlational tests applied to community ecology data from gradient studies. Journal of Animal Ecology 71:581-593.

Velleman P.F., Wilkinson L. (1993): Nominal, ordinal, interval and ratio typologies are misleading. The American Statistician 47:65-72.

Next >

Last Updated on Friday, 18 March 2011 18:13