Skip to content

Latest commit

 

History

History
63 lines (43 loc) · 2.42 KB

README.md

File metadata and controls

63 lines (43 loc) · 2.42 KB

FUJI: Fuzzy Jaccard Index: A robust comparison of ordered lists

This repository contains the code (and many feature rankings computed on over twenty real-life benchmark data sets) from the paper Fuzzy Jaccard Index: A robust comparison of ordered lists (also available on arXiv).

This code is distributed under the Creative Commons Attribution license (CC BY 4.0), so the authors would greatly apprecieate if you acknowledge its use by citing the paper above (the corresponding bibtex is shown below).

@article{fuji,
    title = {Fuzzy Jaccard Index: A robust comparison of ordered lists},
    journal = {Applied Soft Computing},
    volume = {113},
    pages = {107849},
    year = {2021},
    issn = {1568-4946},
    doi = {https://doi.org/10.1016/j.asoc.2021.107849},
    url = {https://www.sciencedirect.com/science/article/pii/S1568494621007717},
    author = {Matej Petkovi\'{c} and Bla\v{Z} \v{S}krlj and Dragi Kocev and Nikola Simidjievski},
    keywords = {Ordered lists, Fuzzy scores, Feature ranking, Information retrieval, Jaccard index}
}

Example

The code is easy to use and implements the FUJI score (fuzzy_jaccard), as well as all the baselines that we compare to (jaccard, hamming, pog, npog, kuncheva, wald, lustgarten, krizek, cwrel, pearson, correlation, fuzzy_gamma).

For example, once we obtain the rankings r and s, e.g.,

r = [1.0, 0.9, 0.3, 0.14, 0.1]
s = [0.8, 0.9, 0.3, 0.14, 0.1]

(where r[i] and s[i] give the importance of the i-th feature), FUJI can be computed as

curve, auc = compute_similarity(r, s, "fuzzy_jaccard")

The list curve is a list, containing the FUJI values at each point, and auc is the area under this curve. For some other examples, see main.py.

Dependencies

The code implements many similarity scores. Some of them need numpy or scipy. For showing the progress, tqdm can be used.

.fimp files

The structure of the files is the following:

<meta data (if available)>
<fimp table>

<fimp table> consists of four columns:

  • index of the feature in the dataset
  • name of the feature
  • rank of the feature (>= 1)
  • feature relevance score

The values are tab-separated.