Home 18 | 07 | 2019
Feed Display
 heise online News Nachrichten nicht nur aus der Welt der Computer Latest News
Popular
 Jaccard similarity   Algorithms - Similarity
Written by Jan Schulz
Thursday, 15 May 2008 19:26 # Jaccard similarity

## Objective

The Jaccard similarity (Jaccard 1902, Jaccard 1912) is a common index for binary variables. It is defined as the quotient between the intersection and the union of the pairwise compared variables among two objects.

## Equation In the equation dJAD is the Jaccard distance between the objects i and j. For two data records with n binary variables y the variable index k ranges from 0 to n-1. Four different combinations between yi,k and yj,k can be distinguished when comparing binary variables. These combinations are (0/0), (0/1), (1/0) and (1/1). The sums of these combinations can be grouped by:

• J01: the total number of variables being 0 in yi and 1 in yj.
• J10: the total number of variables being 1 in yi and 0 in yj.
• J11: the total number of variables being 1 in both yi and yj.
• J00: the total number of variables being 0 in both yi and yj.

As each paired variable belongs to one of these groups it can be easily seen that:

J00 + J01 + J10 + J11 = n

As the Jaccard similarity is based on joint presence, J00 is discarded.

The Jaccard dissimilarity is defined as dJAD = 1- dJAS.

In some cases the Jaccard similarity is computed as dJAS=2dBCD/(1+dBCD), where dBCD is the Bray–Curtis dissimilarity. This equation does not reduce values to binary states. Thus, results are different when using on the one hand a presence/absence matrix and on the other hand a count matrix. The results are the same, when the count matrix is converted to a binary matrix beforehand.

## Synonyms

The Jaccard similarity or Jaccard similarity coefficient is often called Jaccard index. Anyhow, the term Jaccard index is sometimes used for the Jaccard dissimilarity, while the Jaccard dissimilarity is sometimes called Jaccard distance. It can be observed that the terms Jaccard similarity and Jaccard dissimilarity are not precisely separated and sometimes seem to be used synonymical or confused, although results represent opposite meanings. Thus, one should carefully inspect the intention of the analysis.

## Usage

The Jaccard similarity can be used, when intersted in binary differences between two or more objects. Especially in ecological research investigations often focus on the presence/absence between several sites. When interested in characterising compared sites by the possibility of species to settle there abundances are often negligible.

## Algorithm

The algorithm controls whether the data input matrix is rectangular or not. If not the function returns FALSE and a defined, but empty output matrix. When the matrix is rectangular the Jaccard similarity will be calculated. Therefore the dimensions of the respective arrays of the output matrix are set, and the titles for the rows and columns set. As the result is a square matrix, which is mirrored along the diagonal only values for one triangular part and the diagonal are computed. When errors occur during computation the function returns FALSE.

For practical reasons the implementation of the algorithm does not necessarily need true binary data. It distinguishes whether a value is 0 or within a certain threshold close to it. In this case it will be interpreted as logical FALSE, e.g. absence. Values being larger than the given threshold are interpreted as logical TRUE, e.g. presence. Thus, it is possible without further preparation to pass a count matrix to the function. As the given threshold affects all values equally it does not alter its metric characteristic.

To calculate the Jaccard dissimilarity the Jaccard similarity matrix is computed first and thereafter transformed.

## Source

`Function dist_JaccardSimilarity (InputMatrix : t2dVariantArrayDouble; aZeroThresh: Double; Var OutputMatrix : t2dVariantArrayDouble) : Boolean;// The function dist_JaccardSimilarity calculates the Jaccard similarity between// several cases based on binary values (e.g. presences/absences or TRUE/FALSE).// The variable aZeroThresh defines the minumum value, that is required to// interpret a variable as being logically TRUE.// The cases are expected in the rows. The variables are expected// in the columns. Function returns FALSE if at least one cell can not be// calculated. The result matrix is returned in OutputMatrix.// (c) Dr. Jan Schulz, 05.May 2008; www.code10.infoVar  OutputMatrixSize : Integer;     InputCols        : Integer;     InputRows        : Integer;     RunnerY          : Integer;     RunnerX          : Integer;     J01              : Integer;     J10              : Integer;     J11              : Integer;     i                : Integer;     Denominator      : Integer;     Quotient         : Double;     FirstVal         : Boolean;     SecondVal        : Boolean;Begin  // if one dimension is zero or matrix not rectangular  If Not mtx_IsRectangular (InputMatrix, InputRows, InputCols) THen  Begin    //create an empty matrix, return FALSE and exit    mtx_Create (OutputMatrix, 1, 1, 0, 'Erroneous Jaccard similarity matrix');    dist_JaccardSimilarity := False;    Exit;  end;  // Threshold for binary level needs to be positive  aZeroThresh := Abs (aZeroThresh);  // let's expect the best case ...  dist_JaccardSimilarity := True;  // define and set the row dimension of the result matrix  OutputMatrixSize := High (InputMatrix.Cells) + 1;  SetLength (OutputMatrix.Cells, OutputMatrixSize);  // create the column dimension of the array  For RunnerY := Low (OutputMatrix.Cells) to High (Outputmatrix.Cells) do  Begin    SetLength (OutputMatrix.Cells [RunnerY], OutputMatrixSize);  end;  // define title of matrix  OutputMatrix.MatrixName := 'Jaccard similarity matrix';   // Set Row/Col-Title for the new matrix  SetLength (OutputMatrix.RowTitle, OutputMatrixSize);  SetLength (OutPutMatrix.ColTitle, OutPutMatrixSize);  For RunnerY := Low (InputMatrix.RowTitle) to High (InputMatrix.RowTitle) do  Begin    // names for rows and columns are the same in this triangualary matrix    OutputMatrix.RowTitle [RunnerY] := InputMatrix.RowTitle [RunnerY];    OutputMatrix.ColTitle [RunnerY] := InputMatrix.RowTitle [RunnerY];  end;  // compare every object  For RunnerY := Low (OutputMatrix.Cells) to High (OutputMatrix.Cells) do  Begin    //with every other object    For RunnerX := Low (OutputMatrix.Cells) to RunnerY do    Begin      J01 := 0;      J10 := 0;      J11 := 0;      //include all variables in analysis      For i := 0 to High (InputMatrix.Cells ) do      Begin        // directly convert to binary variables        FirstVal  := Abs (InputMatrix.Cells [RunnerX, i]) > aZeroThresh;        SecondVal := Abs (InputMatrix.Cells [RunnerY, i]) > aZeroThresh;        If FirstVal And SecondVal THen        Begin          J11 := J11 + 1;        end        Else        Begin          If FirstVal Then J10 := J10 + 1;          If SecondVal Then J01 := J01 + 1;        end;      end;      Denominator := J01 + J10 + J11;      If Denominator > 0 THen Quotient := J11 / Denominator                         Else                         Begin                           Quotient := NaN;                           dist_JaccardSimilarity := False;                         end;      // set the calculated value on both sides of diagonal and diagonal itself      OutputMatrix.Cells [RunnerX, RunnerY] := Quotient;      OutputMatrix.Cells [RunnerY, RunnerX] := Quotient;    end;  end;end;Function dist_JaccardDissimilarity (InputMatrix : t2dVariantArrayDouble; aZeroThresh: Double; Var OutputMatrix : t2dVariantArrayDouble) : Boolean;// The function dist_JaccardDissimilarity calculates the Jaccard dissimilarity// matrix between several cases, which are expected in the rows. The variables are// expected in the columns. Function returns FALSE if at least one cell can not be// calculated. The result matrix is returned in OutputMatrix. This function depends// on the function CalcCanberraDissimilarityMatrix// (c) Dr. Jan Schulz, 05.May 2008; www.code10.infoVar RunnerX : Integer;    RunnerY : Integer;Begin  // calculate the Canberra Dissimilarity matrix  Result := dist_JaccardSimilarity (InputMatrix, aZeroThresh, OutputMatrix);  // convert the similarity into a dissimilarity matrix  For RunnerY := Low (OutputMatrix.Cells) to High (OutputMatrix.Cells) do  Begin    For RunnerX := Low (OutputMatrix.Cells [RunnerY]) to High (OutputMatrix.Cells [RunnerY]) do    Begin      OutPutMatrix.Cells [RunnerY, RunnerX] := 1 - OutPutMatrix.Cells [RunnerY, RunnerX];    end;  end;  If Result THen OutputMatrix.MatrixName := 'Jaccard dissimilarity matrix'            Else OutputMatrix.MatrixName := 'Erroneous Jaccard dissimilarity matrix';end; `

## Example

For a data matrix aInputMatrix of the type t2dVariantArrayDouble, populated with:

 Data Var1 Var2 Var3 Case1 1 1 1 Case2 1 1 0 Case3 2 2 2 Case4 10 10 10 Case5 11 11 11 Case6 10 5 0

the call of:

aBooleanVar := dist_JaccardSimilarity (aInputMatrix, aOutputMatrix);

returns the respective Jaccard similarity matrix in aOutputMatrix:

 Jaccard similarity Case1 Case2 Case3 Case4 Case5 Case6 Case1 1 0.667 1 1 1 0.667 Case2 0.667 1 0.667 0.667 0.667 1 Case3 1 0.667 1 1 1 0.667 Case4 1 0.667 1 1 1 0.667 Case5 1 0.667 1 1 1 0.667 Case6 0.667 1 0.667 0.667 0.667 1

Based on the binary analysis it can be seen that only objects having a zero value among their variables show a lower similarity. This zero is interpreted as logical FALSE. All other objects are completely similar in this analysis, as their values are interpreted as logical TRUE.

## Literature

Jaccard P. (1902): Lois de distribution florale. Bulletin de la Socíeté Vaudoise des Sciences Naturelles 38:67-130.

Jaccard P. (1912): The distribution of the flora in the alpine zone. New Phytologist 11(2):37-50.

Last Updated on Friday, 18 March 2011 18:18 