This repository contains the code (and many feature rankings computed on over twenty real-life benchmark data sets) from the paper Fuzzy Jaccard Index: A robust comparison of ordered lists (also available on arXiv).
This code is distributed under the Creative Commons Attribution license (CC BY 4.0), so the authors would greatly apprecieate if you acknowledge its use by citing the paper above (the corresponding bibtex is shown below).
@article{fuji,
title = {Fuzzy Jaccard Index: A robust comparison of ordered lists},
journal = {Applied Soft Computing},
volume = {113},
pages = {107849},
year = {2021},
issn = {1568-4946},
doi = {https://doi.org/10.1016/j.asoc.2021.107849},
url = {https://www.sciencedirect.com/science/article/pii/S1568494621007717},
author = {Matej Petkovi\'{c} and Bla\v{Z} \v{S}krlj and Dragi Kocev and Nikola Simidjievski},
keywords = {Ordered lists, Fuzzy scores, Feature ranking, Information retrieval, Jaccard index}
}
The code is easy to use and implements the FUJI score (fuzzy_jaccard
), as well as all the baselines that we compare to (jaccard
, hamming
, pog
, npog
, kuncheva
, wald
, lustgarten
, krizek
, cwrel
, pearson
, correlation
, fuzzy_gamma
).
For example, once we obtain the rankings r
and s
, e.g.,
r = [1.0, 0.9, 0.3, 0.14, 0.1]
s = [0.8, 0.9, 0.3, 0.14, 0.1]
(where r[i]
and s[i]
give the importance of the i-th feature), FUJI can be computed as
curve, auc = compute_similarity(r, s, "fuzzy_jaccard")
The list curve
is a list, containing the FUJI values at each point, and auc
is the area under this curve. For some other examples, see main.py
.
The code implements many similarity scores. Some of them need numpy
or scipy
. For showing the progress, tqdm
can be used.
The structure of the files is the following:
<meta data (if available)>
<fimp table>
<fimp table>
consists of four columns:
- index of the feature in the dataset
- name of the feature
- rank of the feature (>= 1)
- feature relevance score
The values are tab-separated.