Home Algorithms Similarity Jaccard similarity
17 | 10 | 2017
Jaccard similarity PDF Print E-mail
Algorithms - Similarity
Written by Jan Schulz   
Thursday, 15 May 2008 19:26

Jaccard similarity

 

Objective

The Jaccard similarity (Jaccard 1902, Jaccard 1912) is a common index for binary variables. It is defined as the quotient between the intersection and the union of the pairwise compared variables among two objects.

 

Equation

 

Equation of the Jaccard similarity

In the equation dJAD is the Jaccard distance between the objects i and j. For two data records with n binary variables y the variable index k ranges from 0 to n-1. Four different combinations between yi,k and yj,k can be distinguished when comparing binary variables. These combinations are (0/0), (0/1), (1/0) and (1/1). The sums of these combinations can be grouped by:

  • J01: the total number of variables being 0 in yi and 1 in yj.
  • J10: the total number of variables being 1 in yi and 0 in yj.
  • J11: the total number of variables being 1 in both yi and yj.
  • J00: the total number of variables being 0 in both yi and yj.

As each paired variable belongs to one of these groups it can be easily seen that:

J00 + J01 + J10 + J11 = n

As the Jaccard similarity is based on joint presence, J00 is discarded.

The Jaccard dissimilarity is defined as dJAD = 1- dJAS.

In some cases the Jaccard similarity is computed as dJAS=2dBCD/(1+dBCD), where dBCD is the Bray–Curtis dissimilarity. This equation does not reduce values to binary states. Thus, results are different when using on the one hand a presence/absence matrix and on the other hand a count matrix. The results are the same, when the count matrix is converted to a binary matrix beforehand.

 

Synonyms

The Jaccard similarity or Jaccard similarity coefficient is often called Jaccard index. Anyhow, the term Jaccard index is sometimes used for the Jaccard dissimilarity, while the Jaccard dissimilarity is sometimes called Jaccard distance. It can be observed that the terms Jaccard similarity and Jaccard dissimilarity are not precisely separated and sometimes seem to be used synonymical or confused, although results represent opposite meanings. Thus, one should carefully inspect the intention of the analysis.

 

Usage

The Jaccard similarity can be used, when intersted in binary differences between two or more objects. Especially in ecological research investigations often focus on the presence/absence between several sites. When interested in characterising compared sites by the possibility of species to settle there abundances are often negligible.

 

Algorithm

The algorithm controls whether the data input matrix is rectangular or not. If not the function returns FALSE and a defined, but empty output matrix. When the matrix is rectangular the Jaccard similarity will be calculated. Therefore the dimensions of the respective arrays of the output matrix are set, and the titles for the rows and columns set. As the result is a square matrix, which is mirrored along the diagonal only values for one triangular part and the diagonal are computed. When errors occur during computation the function returns FALSE.

For practical reasons the implementation of the algorithm does not necessarily need true binary data. It distinguishes whether a value is 0 or within a certain threshold close to it. In this case it will be interpreted as logical FALSE, e.g. absence. Values being larger than the given threshold are interpreted as logical TRUE, e.g. presence. Thus, it is possible without further preparation to pass a count matrix to the function. As the given threshold affects all values equally it does not alter its metric characteristic.

To calculate the Jaccard dissimilarity the Jaccard similarity matrix is computed first and thereafter transformed.

 

Source

Function dist_JaccardSimilarity (InputMatrix : t2dVariantArrayDouble; aZeroThresh: Double; Var OutputMatrix : t2dVariantArrayDouble) : Boolean;
// The function dist_JaccardSimilarity calculates the Jaccard similarity between
// several cases based on binary values (e.g. presences/absences or TRUE/FALSE).
// The variable aZeroThresh defines the minumum value, that is required to
// interpret a variable as being logically TRUE.
// The cases are expected in the rows. The variables are expected
// in the columns. Function returns FALSE if at least one cell can not be
// calculated. The result matrix is returned in OutputMatrix.
// (c) Dr. Jan Schulz, 05.May 2008; www.code10.info
Var OutputMatrixSize : Integer;
InputCols : Integer;
InputRows : Integer;
RunnerY : Integer;
RunnerX : Integer;
J01 : Integer;
J10 : Integer;
J11 : Integer;
i : Integer;
Denominator : Integer;
Quotient : Double;
FirstVal : Boolean;
SecondVal : Boolean;
Begin
// if one dimension is zero or matrix not rectangular
If Not mtx_IsRectangular (InputMatrix, InputRows, InputCols) THen
Begin
//create an empty matrix, return FALSE and exit
mtx_Create (OutputMatrix, 1, 1, 0, 'Erroneous Jaccard similarity matrix');
dist_JaccardSimilarity := False;
Exit;
end;

// Threshold for binary level needs to be positive
aZeroThresh := Abs (aZeroThresh);

// let's expect the best case ...
dist_JaccardSimilarity := True;

// define and set the row dimension of the result matrix
OutputMatrixSize := High (InputMatrix.Cells) + 1;
SetLength (OutputMatrix.Cells, OutputMatrixSize);

// create the column dimension of the array
For RunnerY := Low (OutputMatrix.Cells) to High (Outputmatrix.Cells) do
Begin
SetLength (OutputMatrix.Cells [RunnerY], OutputMatrixSize);
end;

// define title of matrix
OutputMatrix.MatrixName := 'Jaccard similarity matrix';

// Set Row/Col-Title for the new matrix
SetLength (OutputMatrix.RowTitle, OutputMatrixSize);
SetLength (OutPutMatrix.ColTitle, OutPutMatrixSize);

For RunnerY := Low (InputMatrix.RowTitle) to High (InputMatrix.RowTitle) do
Begin
// names for rows and columns are the same in this triangualary matrix
OutputMatrix.RowTitle [RunnerY] := InputMatrix.RowTitle [RunnerY];
OutputMatrix.ColTitle [RunnerY] := InputMatrix.RowTitle [RunnerY];
end;


// compare every object
For RunnerY := Low (OutputMatrix.Cells) to High (OutputMatrix.Cells) do
Begin
//with every other object
For RunnerX := Low (OutputMatrix.Cells) to RunnerY do
Begin
J01 := 0;
J10 := 0;
J11 := 0;

//include all variables in analysis
For i := 0 to High (InputMatrix.Cells [0]) do
Begin
// directly convert to binary variables
FirstVal := Abs (InputMatrix.Cells [RunnerX, i]) > aZeroThresh;
SecondVal := Abs (InputMatrix.Cells [RunnerY, i]) > aZeroThresh;

If FirstVal And SecondVal THen
Begin
J11 := J11 + 1;
end
Else
Begin
If FirstVal Then J10 := J10 + 1;
If SecondVal Then J01 := J01 + 1;
end;
end;

Denominator := J01 + J10 + J11;
If Denominator > 0 THen Quotient := J11 / Denominator
Else
Begin
Quotient := NaN;
dist_JaccardSimilarity := False;
end;

// set the calculated value on both sides of diagonal and diagonal itself
OutputMatrix.Cells [RunnerX, RunnerY] := Quotient;
OutputMatrix.Cells [RunnerY, RunnerX] := Quotient;
end;
end;
end;


Function dist_JaccardDissimilarity (InputMatrix : t2dVariantArrayDouble; aZeroThresh: Double; Var OutputMatrix : t2dVariantArrayDouble) : Boolean;
// The function dist_JaccardDissimilarity calculates the Jaccard dissimilarity
// matrix between several cases, which are expected in the rows. The variables are
// expected in the columns. Function returns FALSE if at least one cell can not be
// calculated. The result matrix is returned in OutputMatrix. This function depends
// on the function CalcCanberraDissimilarityMatrix
// (c) Dr. Jan Schulz, 05.May 2008; www.code10.info
Var RunnerX : Integer;
RunnerY : Integer;
Begin
// calculate the Canberra Dissimilarity matrix
Result := dist_JaccardSimilarity (InputMatrix, aZeroThresh, OutputMatrix);

// convert the similarity into a dissimilarity matrix
For RunnerY := Low (OutputMatrix.Cells) to High (OutputMatrix.Cells) do
Begin
For RunnerX := Low (OutputMatrix.Cells [RunnerY]) to High (OutputMatrix.Cells [RunnerY]) do
Begin
OutPutMatrix.Cells [RunnerY, RunnerX] := 1 - OutPutMatrix.Cells [RunnerY, RunnerX];
end;
end;

If Result THen OutputMatrix.MatrixName := 'Jaccard dissimilarity matrix'
Else OutputMatrix.MatrixName := 'Erroneous Jaccard dissimilarity matrix';
end;

Example

For a data matrix aInputMatrix of the type t2dVariantArrayDouble, populated with:

Data

Var1

Var2

Var3

Case1

1

1

1

Case2

1

1

0

Case3

2

2

2

Case4

10

10

10

Case5

11

11

11

Case6

10

5

0

the call of:

aBooleanVar := dist_JaccardSimilarity (aInputMatrix, aOutputMatrix);

returns the respective Jaccard similarity matrix in aOutputMatrix:

Jaccard

similarity

Case1

Case2

Case3

Case4

Case5

Case6

Case1

1

0.667

1

1

1

0.667

Case2

0.667

1

0.667

0.667

0.667

1

Case3

1

0.667

1

1

1

0.667

Case4

1

0.667

1

1

1

0.667

Case5

1

0.667

1

1

1

0.667

Case6

0.667

1

0.667

0.667

0.667

1

Based on the binary analysis it can be seen that only objects having a zero value among their variables show a lower similarity. This zero is interpreted as logical FALSE. All other objects are completely similar in this analysis, as their values are interpreted as logical TRUE.

 

Literature

Jaccard P. (1902): Lois de distribution florale. Bulletin de la Socíeté Vaudoise des Sciences Naturelles 38:67-130.

Jaccard P. (1912): The distribution of the flora in the alpine zone. New Phytologist 11(2):37-50.

 

Last Updated on Friday, 18 March 2011 18:18
 
Sponsored Links
Polls
Where did you find helpful information on this site?