Refactor tolerance settings for BEIR dense vector regressions (#2538)

+ tweak docs for flat indexes. + refactor tolerance values for HNSW indexes, calibrate wrt flat index scores.
castorini · Jul 9, 2024 · 670623c · 670623c
1 parent 6d61c04
commit 670623c
Show file tree

Hide file tree

Showing 596 changed files with 4,167 additions and 2,682 deletions.
diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md
@@ -498,5 +498,4 @@ The BM25 run with default parameters `k1=0.9`, `b=0.4` roughly corresponds to th
 + Results reproduced by [@alireza-taban](https://github.com/alireza-taban) on 2024-06-10 (commit [`59330e3`](https://github.com/castorini/anserini/commit/59330e355b4aaf6754622cb3a136259dea0d8d05))
 + Results reproduced by [@Feng-12138](https://github.com/Feng-12138) on 2024-06-16 (commit [`ad97377`](https://github.com/castorini/anserini/commit/ad97377e463e70ee8b2f501ac7c41134af53e976))
 + Results reproduced by [@hosnahoseini](https://github.com/hosnahoseini) on 2024-06-18 (commit [`ad97377`](https://github.com/castorini/anserini/commit/ad97377e463e70ee8b2f501ac7c41134af53e976))
-
-
++ Results reproduced by [@FaizanFaisal25](https://github.com/FaizanFaisal25) on 2024-06-29 (commit [`e92370a`](https://github.com/FaizanFaisal25/anserini/commit/e92370a06eaa3bbc5bacdba65cc9c3f125590071))
diff --git a/...egressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat-int8.cached.md b/...egressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat-int8.cached.md
@@ -78,5 +78,6 @@ With the above commands, you should be able to reproduce the following results:
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): ArguAna                                                                                       | 0.9964    |
 
-The above figures are from running brute-force search with cached queries on non-quantized indexes.
-With quantized indexes, results may differ slightly, but the nDCG@10 score should generally be within 0.004 of the result reported above (with a small number of outliers).
+The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
+With cached queries on quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.004 of the results reported above (with some outliers).
+Note that quantization is non-deterministic due to sampling (i.e., results may differ slightly between trials).
diff --git a/.../regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat-int8.onnx.md b/.../regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat-int8.onnx.md
@@ -78,5 +78,6 @@ With the above commands, you should be able to reproduce the following results:
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): ArguAna                                                                                       | 0.9964    |
 
-The above figures are from running brute-force search with cached queries on non-quantized indexes.
-With quantized indexes and on-the-fly ONNX query encoding, results may differ slightly, but the nDCG@10 score should generally be within 0.005 of the result reported above (with a small number of outliers).
+The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
+With ONNX query encoding on quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.004 of the results reported above (with some outliers).
+Note that quantization is non-deterministic due to sampling (i.e., results may differ slightly between trials).
diff --git a/docs/regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat.cached.md b/docs/regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat.cached.md
@@ -78,4 +78,4 @@ With the above commands, you should be able to reproduce the following results:
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): ArguAna                                                                                       | 0.9964    |
 
-Note that since we're running brute-force search, the results should be reproducible _exactly_.
+Note that since we're running brute-force search with cached queries on non-quantized flat indexes, the results should be reproducible _exactly_.
diff --git a/docs/regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat.onnx.md b/docs/regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.flat.onnx.md
@@ -78,5 +78,5 @@ With the above commands, you should be able to reproduce the following results:
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): ArguAna                                                                                       | 0.9964    |
 
-The above figures are from running brute-force search with cached queries.
-With ONNX query encoding, results may differ slightly, but the nDCG@10 score should generally be within 0.002 of the result reported above (with a small number of outliers).
+The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
+With ONNX query encoding on non-quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.001 of the results reported above (with some outliers).
diff --git a/...egressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw-int8.cached.md b/...egressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw-int8.cached.md
@@ -56,16 +56,16 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
   -index indexes/lucene-hnsw-int8.beir-v1.0.0-arguana.bge-base-en-v1.5/ \
   -topics tools/topics-and-qrels/topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.gz \
   -topicReader JsonStringVector \
-  -output runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt \
+  -output runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt \
   -generator VectorQueryGenerator -topicField vector -removeQuery -threads 16 -hits 1000 -efSearch 1000 &
 ```
 
 Evaluation can be performed using `trec_eval`:
 
 ```
-bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
-bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
-bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
+bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
+bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
+bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
 ```
 
 ## Effectiveness
@@ -74,11 +74,12 @@ With the above commands, you should be able to reproduce the following results:
 
 | **nDCG@10**                                                                                                  | **BGE-base-en-v1.5**|
 |:-------------------------------------------------------------------------------------------------------------|-----------|
-| BEIR (v1.0.0): ArguAna                                                                                       | 0.635     |
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.636     |
 | **R@100**                                                                                                    | **BGE-base-en-v1.5**|
-| BEIR (v1.0.0): ArguAna                                                                                       | 0.991     |
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.992     |
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): ArguAna                                                                                       | 0.996     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw-int8.cached.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With cached queries on quantized HNSW indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.005 of the results reported above (with some outliers).
+Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).
diff --git a/.../regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw-int8.onnx.md b/.../regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw-int8.onnx.md
@@ -56,16 +56,16 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
   -index indexes/lucene-hnsw-int8.beir-v1.0.0-arguana.bge-base-en-v1.5/ \
   -topics tools/topics-and-qrels/topics.beir-v1.0.0-arguana.test.tsv.gz \
   -topicReader TsvString \
-  -output runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-arguana.test.txt \
+  -output runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-arguana.test.txt \
   -generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15 &
 ```
 
 Evaluation can be performed using `trec_eval`:
 
 ```
-bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-arguana.test.txt
-bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-arguana.test.txt
-bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-arguana.test.txt
+bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-arguana.test.txt
+bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-arguana.test.txt
+bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-arguana.test.txt
 ```
 
 ## Effectiveness
@@ -74,11 +74,12 @@ With the above commands, you should be able to reproduce the following results:
 
 | **nDCG@10**                                                                                                  | **BGE-base-en-v1.5**|
 |:-------------------------------------------------------------------------------------------------------------|-----------|
-| BEIR (v1.0.0): ArguAna                                                                                       | 0.621     |
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.636     |
 | **R@100**                                                                                                    | **BGE-base-en-v1.5**|
-| BEIR (v1.0.0): ArguAna                                                                                       | 0.971     |
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.992     |
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
-| BEIR (v1.0.0): ArguAna                                                                                       | 0.994     |
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.996     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw-int8.onnx.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With ONNX query encoding on quantized HNSW indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.005 of the results reported above (with some outliers).
+Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).
diff --git a/docs/regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw.cached.md b/docs/regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw.cached.md
@@ -80,5 +80,6 @@ With the above commands, you should be able to reproduce the following results:
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): ArguAna                                                                                       | 0.996     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw.cached.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With cached queries on non-quantized HNSW indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.003 of the results reported above (with some outliers).
+Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).
diff --git a/docs/regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw.onnx.md b/docs/regressions/regressions-beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw.onnx.md
@@ -74,11 +74,12 @@ With the above commands, you should be able to reproduce the following results:
 
 | **nDCG@10**                                                                                                  | **BGE-base-en-v1.5**|
 |:-------------------------------------------------------------------------------------------------------------|-----------|
-| BEIR (v1.0.0): ArguAna                                                                                       | 0.623     |
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.636     |
 | **R@100**                                                                                                    | **BGE-base-en-v1.5**|
-| BEIR (v1.0.0): ArguAna                                                                                       | 0.972     |
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.992     |
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
-| BEIR (v1.0.0): ArguAna                                                                                       | 0.993     |
+| BEIR (v1.0.0): ArguAna                                                                                       | 0.996     |
 
-Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
-Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw.onnx.yaml).
+The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
+With ONNX query encoding on non-quantized HNSW indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.003 of the results reported above (with some outliers).
+Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).
diff --git a/...regressions/regressions-beir-v1.0.0-bioasq.bge-base-en-v1.5.flat-int8.cached.md b/...regressions/regressions-beir-v1.0.0-bioasq.bge-base-en-v1.5.flat-int8.cached.md
@@ -78,5 +78,6 @@ With the above commands, you should be able to reproduce the following results:
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): BioASQ                                                                                        | 0.8059    |
 
-The above figures are from running brute-force search with cached queries on non-quantized indexes.
-With quantized indexes, results may differ slightly, but the nDCG@10 score should generally be within 0.004 of the result reported above (with a small number of outliers).
+The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
+With cached queries on quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.004 of the results reported above (with some outliers).
+Note that quantization is non-deterministic due to sampling (i.e., results may differ slightly between trials).
diff --git a/docs/regressions/regressions-beir-v1.0.0-bioasq.bge-base-en-v1.5.flat-int8.onnx.md b/docs/regressions/regressions-beir-v1.0.0-bioasq.bge-base-en-v1.5.flat-int8.onnx.md
@@ -78,5 +78,6 @@ With the above commands, you should be able to reproduce the following results:
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): BioASQ                                                                                        | 0.8059    |
 
-The above figures are from running brute-force search with cached queries on non-quantized indexes.
-With quantized indexes and on-the-fly ONNX query encoding, results may differ slightly, but the nDCG@10 score should generally be within 0.005 of the result reported above (with a small number of outliers).
+The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
+With ONNX query encoding on quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.004 of the results reported above (with some outliers).
+Note that quantization is non-deterministic due to sampling (i.e., results may differ slightly between trials).
diff --git a/docs/regressions/regressions-beir-v1.0.0-bioasq.bge-base-en-v1.5.flat.cached.md b/docs/regressions/regressions-beir-v1.0.0-bioasq.bge-base-en-v1.5.flat.cached.md
@@ -78,4 +78,4 @@ With the above commands, you should be able to reproduce the following results:
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): BioASQ                                                                                        | 0.8059    |
 
-Note that since we're running brute-force search, the results should be reproducible _exactly_.
+Note that since we're running brute-force search with cached queries on non-quantized flat indexes, the results should be reproducible _exactly_.
diff --git a/docs/regressions/regressions-beir-v1.0.0-bioasq.bge-base-en-v1.5.flat.onnx.md b/docs/regressions/regressions-beir-v1.0.0-bioasq.bge-base-en-v1.5.flat.onnx.md
@@ -78,5 +78,5 @@ With the above commands, you should be able to reproduce the following results:
 | **R@1000**                                                                                                   | **BGE-base-en-v1.5**|
 | BEIR (v1.0.0): BioASQ                                                                                        | 0.8059    |
 
-The above figures are from running brute-force search with cached queries.
-With ONNX query encoding, results may differ slightly, but the nDCG@10 score should generally be within 0.002 of the result reported above (with a small number of outliers).
+The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
+With ONNX query encoding on non-quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.001 of the results reported above (with some outliers).