Dice Coefficient

  Dice coefficient is a similarity measure related to the Jaccard index.

  For sets X and Y of keywords used in information retrieval, the coefficient may be defined as:

s =<br />
\frac{2 | X \cap Y |}{| X | + | Y |}   When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:

s =<br />
\frac{2 n_t}{n_x + n_y}

  where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:

  night

  nacht

  We would find the set of bigrams in each word:

  {ni,ig,gh,ht}

  {na,ac,ch,ht}

  Each set has four elements, and the intersection of these two sets has only one element: ht.

  Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

发表评论

电子邮件地址不会被公开。 必填项已用*标注