This repo provides methods to 1. estimate the query cost, 2. measure the query hardness of a given query on graph-based ANN indexes (e.g., HNSW, NSG, KGraph).
This hardness measure is what we proposed as
We've prepared several unbiased workloads for common public ANN datasets. They can be directly downloaded from folder workloads/
.
-
Eigen == 3.4.0
- Download the Eigen library from https://gitlab.com/libeigen/eigen/-/archive/3.4.0/eigen-3.4.0.tar.gz.
- Unzip it and move the
Eigen
folder to./src/
.
-
openmp
- Ensure you have the following datasets and queries ready:
- dataset named
{dataset}_base.fvecs
- queries named
{dataset}_query.fvecs
- Clone the Efanna library:
You can clone the efanna library to any location, then you need to input the cloned path into the efanna_directory
field in the configuration file config.py
.
git clone https://github.com/ZJULearning/efanna_graph
-
Install g++ 9 or higher.
-
Install some python library:
cd python-script
pip install -r requirements.txt
Edit the python-script/config.py
file to set your configurations.
Navigate to the python-script
folder and run the script to execute all steps in sequence:
python run.py --command preprocess --config-file config.py
This step includes:
- Generating kgraph using
efanna
library - Generating the graph of
mrng
- Calculating the reverse graph of
mrng
- Sampling and training the GMM model
python run.py --command hardness --config-file config.py
This step includes:
- Constructing the groundtruth of
query.fvecs
- Calculating hardness
python run.py --command benchmark --config-file config.py
This step includes:
- Generating queries using the GMM model trained in the preprocess stage
- Constructing the groundtruth for benchmark
- Calculating the hardness for the benchmark
- Constructing the benchmark
We provide the code to build NSW, HNSW, MRNG, SSG, and script/index_nsw.sh
, script/index_hnsw.sh
, and script/index_mrng.sh
We use python-script/ground_truth.py
to achieve this. Other EXACT kNN methods are also OK.
To get the revsersed graph, use python-script/get_rev_graph.py
.
You can then use script/get_me.sh
to get the ME of queries.
If you want to compare the ME with the actual query effort (or reproduce the result of Figure 8 and 9 in the paper), you need to obtain the actual effort of queries on a given recall and script/search_hnsw.sh
, script/search_mrng.sh
, etc. with purpose=1
.
Use the python script python-script/minimum_effor_greedy_test.py
to plot the figure and get the correlation coefficient.
Our
Use script/index_mrng.sh
to build the MRNG index, like in 1.1.
Use script/get_me.sh
to compute the ME of queries on MRNG index, like in 1.2.
These ME are the
If you want to compare the
To evaluate python-script/shuffle_dataset2.py
to get the shuffled datasets, and build indexes on these datasets.
The same as 1.3. Use script/search_hnsw.sh
, script/search_mrng.sh
, etc. with purpose=1
to compute the actual effort of queries.
Take the average of the actual effort of queries on different insertion orders of the same graph index. We view this average value as the ground truth of the hardness of queries on this index.
2.4 (optional) plot the density-scatter figure of $Steiner$ -hardness and average actual query effort
Use python script python-script/hardness_test.py
Use python-script/augment_GMM.py
to generate new data with Gaussian Mixture Model.
See 2. to compute the hardness of the new queries.
Use python-script/build_workloads.py
to build workloads.
Use python-script/plot_box_plot_workload_hardness.py
to plot the distribution of the hardness of the new queries.
Use script/search_hnsw.sh
, script/search_mrng.sh
, etc. with purpose=2
to evaluate indexes with the new unbiased workloads.