|Algorithms - Similarity|
|Written by Jan Schulz|
|Thursday, 13 September 2007 08:21|
The Canberra distance is a metric function often used for data scattered around an origin. It was introduced in 1966 (Lance & Williams 1966) and is today mainly used in the form of 1967 (Lance & Williams 1967).
The Canberra metric is similar to the Manhattan distance (which itself is a special form of the Minkowski distance). The distinction is that the absolute difference between the variables of the two objects is divided by the sum of the absolute variable values prior to summing. The generalised equation is given in the form:
This is a slightly modified form compared to the original form given by Lance & Williams (1966) and was suggested by Adkins (reference in Lance & Williams 1967). In the equation dCAD is the Canberra distance between the two objects i and j, k is the index of a variable and n is the total number of variables y.
No common synonyms are established for the Canberra metric.
In the original form of the Canberra metric data must not be signed. The modified form according to Adkins (in Lance & Williams 1967) has the property that the result becomes unity when the variables are of opposite sign. It is useful in the special case where signs represent differences in kind rather than in degree (Lance & Williams 1967). Anyhow, it is mainly used for values > 0. This metric is easily biased for measures around the origin and very sensitive for values close to 0, where it is more sensitive to proportional than to absolute differences (Lance & Williams 1967). This feature becomes more apparent in higher dimensional space, respectively an increasing number of variables. It is in turn less influenced than the Manhattan distance by variables with high values (Krebs 1989). As a very sensitive measurement it is applicable to identify deviations from normal readings (e.g. Emran & Ye 2001).
The algorithm controls whether the data input matrix is rectangular or not. If not the function returns FALSE and a defined, but empty output matrix. When the matrix is rectangular the Canberra distance is calculated. Therefore the dimensions of the respective arrays of the output matrix and the titles for the rows and columns set. As the result is a square matrix, which is mirrored along the diagonal only values for one triangular part and the diagonal are computed. When errors occur during computation the function returns FALSE.
Function dist_Canberra (InputMatrix : t2dVariantArrayDouble; Var OutputMatrix : t2dVariantArrayDouble) : Boolean;
For a data matrix aInputMatrix of the type t2dVariantArrayDouble, populated with:
the call of:
aBooleanVar := dist_Canberra (aInputMatrix, aOutputMatrix);
returns the respective Canberra distance matrix in aOutputMatrix:
Although the Euclidean distance between the objects Case1 and Case3 is the same as between Case4 and Case5, the Canberra distance indicates a higher distance between the objects Case1 and Case3. This is due to the fact that the analysis is more sensitive for values closer to the origin. The distance between Case2 and Case6 is shorter, as the joint zeroes in variable 3 are omitted.
Emran S.M., Ye N. (2001): Robustness of Canberra metric in computer intrusion detection. Proceedings of the 2001 IEEE, Workshop on Information Assurance and Security, United States Military Academy, West Point, New York, 5-6 June, 2001.
Krebs C.J. (1989): Ecological Methodology. Harper-Collins, New-York.
Lance G.N., Williams W.T. (1966): Computer programs for hierarchical polythetic classification (“similarity analysis”). Computer Journal 9:60-64.
Lance G.N., Williams W.T. (1967): Mixed-data classificatory programs, I.) Agglomerative Systems. Australian Computer Journal 1:15-20.
|Last Updated on Friday, 18 March 2011 18:16|