Skip to content

Commit

Permalink
Refactor tolerance settings for BEIR dense vector regressions (#2538)
Browse files Browse the repository at this point in the history
+ tweak docs for flat indexes.
+ refactor tolerance values for HNSW indexes, calibrate wrt flat index scores.
  • Loading branch information
lintool authored and Panizghi committed Jul 9, 2024
1 parent 6d61c04 commit 670623c
Show file tree
Hide file tree
Showing 596 changed files with 4,167 additions and 2,682 deletions.
3 changes: 1 addition & 2 deletions docs/experiments-msmarco-passage.md
Original file line number Diff line number Diff line change
Expand Up @@ -498,5 +498,4 @@ The BM25 run with default parameters `k1=0.9`, `b=0.4` roughly corresponds to th
+ Results reproduced by [@alireza-taban](https://github.com/alireza-taban) on 2024-06-10 (commit [`59330e3`](https://github.com/castorini/anserini/commit/59330e355b4aaf6754622cb3a136259dea0d8d05))
+ Results reproduced by [@Feng-12138](https://github.com/Feng-12138) on 2024-06-16 (commit [`ad97377`](https://github.com/castorini/anserini/commit/ad97377e463e70ee8b2f501ac7c41134af53e976))
+ Results reproduced by [@hosnahoseini](https://github.com/hosnahoseini) on 2024-06-18 (commit [`ad97377`](https://github.com/castorini/anserini/commit/ad97377e463e70ee8b2f501ac7c41134af53e976))


+ Results reproduced by [@FaizanFaisal25](https://github.com/FaizanFaisal25) on 2024-06-29 (commit [`e92370a`](https://github.com/FaizanFaisal25/anserini/commit/e92370a06eaa3bbc5bacdba65cc9c3f125590071))
Original file line number Diff line number Diff line change
Expand Up @@ -78,5 +78,6 @@ With the above commands, you should be able to reproduce the following results:
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.9964 |

The above figures are from running brute-force search with cached queries on non-quantized indexes.
With quantized indexes, results may differ slightly, but the nDCG@10 score should generally be within 0.004 of the result reported above (with a small number of outliers).
The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
With cached queries on quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.004 of the results reported above (with some outliers).
Note that quantization is non-deterministic due to sampling (i.e., results may differ slightly between trials).
Original file line number Diff line number Diff line change
Expand Up @@ -78,5 +78,6 @@ With the above commands, you should be able to reproduce the following results:
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.9964 |

The above figures are from running brute-force search with cached queries on non-quantized indexes.
With quantized indexes and on-the-fly ONNX query encoding, results may differ slightly, but the nDCG@10 score should generally be within 0.005 of the result reported above (with a small number of outliers).
The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
With ONNX query encoding on quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.004 of the results reported above (with some outliers).
Note that quantization is non-deterministic due to sampling (i.e., results may differ slightly between trials).
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,4 @@ With the above commands, you should be able to reproduce the following results:
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.9964 |

Note that since we're running brute-force search, the results should be reproducible _exactly_.
Note that since we're running brute-force search with cached queries on non-quantized flat indexes, the results should be reproducible _exactly_.
Original file line number Diff line number Diff line change
Expand Up @@ -78,5 +78,5 @@ With the above commands, you should be able to reproduce the following results:
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.9964 |

The above figures are from running brute-force search with cached queries.
With ONNX query encoding, results may differ slightly, but the nDCG@10 score should generally be within 0.002 of the result reported above (with a small number of outliers).
The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
With ONNX query encoding on non-quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.001 of the results reported above (with some outliers).
Original file line number Diff line number Diff line change
Expand Up @@ -56,16 +56,16 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-index indexes/lucene-hnsw-int8.beir-v1.0.0-arguana.bge-base-en-v1.5/ \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.gz \
-topicReader JsonStringVector \
-output runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt \
-output runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt \
-generator VectorQueryGenerator -topicField vector -removeQuery -threads 16 -hits 1000 -efSearch 1000 &
```

Evaluation can be performed using `trec_eval`:

```
bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-cached.topics.beir-v1.0.0-arguana.test.bge-base-en-v1.5.jsonl.txt
```

## Effectiveness
Expand All @@ -74,11 +74,12 @@ With the above commands, you should be able to reproduce the following results:

| **nDCG@10** | **BGE-base-en-v1.5**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| BEIR (v1.0.0): ArguAna | 0.635 |
| BEIR (v1.0.0): ArguAna | 0.636 |
| **R@100** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.991 |
| BEIR (v1.0.0): ArguAna | 0.992 |
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.996 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw-int8.cached.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With cached queries on quantized HNSW indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.005 of the results reported above (with some outliers).
Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).
Original file line number Diff line number Diff line change
Expand Up @@ -56,16 +56,16 @@ bin/run.sh io.anserini.search.SearchHnswDenseVectors \
-index indexes/lucene-hnsw-int8.beir-v1.0.0-arguana.bge-base-en-v1.5/ \
-topics tools/topics-and-qrels/topics.beir-v1.0.0-arguana.test.tsv.gz \
-topicReader TsvString \
-output runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-arguana.test.txt \
-output runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-arguana.test.txt \
-generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15 &
```

Evaluation can be performed using `trec_eval`:

```
bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-arguana.test.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-arguana.test.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-arguana.test.txt
bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-arguana.test.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-arguana.test.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-arguana.test.txt runs/run.beir-v1.0.0-arguana.bge-base-en-v1.5.bge-hnsw-int8-onnx.topics.beir-v1.0.0-arguana.test.txt
```

## Effectiveness
Expand All @@ -74,11 +74,12 @@ With the above commands, you should be able to reproduce the following results:

| **nDCG@10** | **BGE-base-en-v1.5**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| BEIR (v1.0.0): ArguAna | 0.621 |
| BEIR (v1.0.0): ArguAna | 0.636 |
| **R@100** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.971 |
| BEIR (v1.0.0): ArguAna | 0.992 |
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.994 |
| BEIR (v1.0.0): ArguAna | 0.996 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw-int8.onnx.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With ONNX query encoding on quantized HNSW indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.005 of the results reported above (with some outliers).
Note that both HNSW indexing and quantization are non-deterministic (i.e., results may differ slightly between trials).
Original file line number Diff line number Diff line change
Expand Up @@ -80,5 +80,6 @@ With the above commands, you should be able to reproduce the following results:
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.996 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw.cached.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With cached queries on non-quantized HNSW indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.003 of the results reported above (with some outliers).
Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).
Original file line number Diff line number Diff line change
Expand Up @@ -74,11 +74,12 @@ With the above commands, you should be able to reproduce the following results:

| **nDCG@10** | **BGE-base-en-v1.5**|
|:-------------------------------------------------------------------------------------------------------------|-----------|
| BEIR (v1.0.0): ArguAna | 0.623 |
| BEIR (v1.0.0): ArguAna | 0.636 |
| **R@100** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.972 |
| BEIR (v1.0.0): ArguAna | 0.992 |
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): ArguAna | 0.993 |
| BEIR (v1.0.0): ArguAna | 0.996 |

Note that due to the non-deterministic nature of HNSW indexing, results may differ slightly between each experimental run.
Nevertheless, scores are generally within 0.005 of the reference values recorded in [our YAML configuration file](../../src/main/resources/regression/beir-v1.0.0-arguana.bge-base-en-v1.5.hnsw.onnx.yaml).
The above figures are from running brute-force search with cached queries on non-quantized **flat** indexes.
With ONNX query encoding on non-quantized HNSW indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.003 of the results reported above (with some outliers).
Note that HNSW indexing is non-deterministic (i.e., results may differ slightly between trials).
Original file line number Diff line number Diff line change
Expand Up @@ -78,5 +78,6 @@ With the above commands, you should be able to reproduce the following results:
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): BioASQ | 0.8059 |

The above figures are from running brute-force search with cached queries on non-quantized indexes.
With quantized indexes, results may differ slightly, but the nDCG@10 score should generally be within 0.004 of the result reported above (with a small number of outliers).
The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
With cached queries on quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.004 of the results reported above (with some outliers).
Note that quantization is non-deterministic due to sampling (i.e., results may differ slightly between trials).
Original file line number Diff line number Diff line change
Expand Up @@ -78,5 +78,6 @@ With the above commands, you should be able to reproduce the following results:
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): BioASQ | 0.8059 |

The above figures are from running brute-force search with cached queries on non-quantized indexes.
With quantized indexes and on-the-fly ONNX query encoding, results may differ slightly, but the nDCG@10 score should generally be within 0.005 of the result reported above (with a small number of outliers).
The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
With ONNX query encoding on quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.004 of the results reported above (with some outliers).
Note that quantization is non-deterministic due to sampling (i.e., results may differ slightly between trials).
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,4 @@ With the above commands, you should be able to reproduce the following results:
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): BioASQ | 0.8059 |

Note that since we're running brute-force search, the results should be reproducible _exactly_.
Note that since we're running brute-force search with cached queries on non-quantized flat indexes, the results should be reproducible _exactly_.
Original file line number Diff line number Diff line change
Expand Up @@ -78,5 +78,5 @@ With the above commands, you should be able to reproduce the following results:
| **R@1000** | **BGE-base-en-v1.5**|
| BEIR (v1.0.0): BioASQ | 0.8059 |

The above figures are from running brute-force search with cached queries.
With ONNX query encoding, results may differ slightly, but the nDCG@10 score should generally be within 0.002 of the result reported above (with a small number of outliers).
The above figures are from running brute-force search with cached queries on non-quantized flat indexes.
With ONNX query encoding on non-quantized flat indexes, observed results may differ slightly (typically, lower), but scores should generally be within 0.001 of the results reported above (with some outliers).
Loading

0 comments on commit 670623c

Please sign in to comment.