From ab9db9ba46c961bffb0a7d1bf2818b294df9bfcb Mon Sep 17 00:00:00 2001 From: DanielKohn1208 Date: Mon, 22 Apr 2024 22:11:58 -0400 Subject: [PATCH 1/6] Add to onboarding reproduction logs --- docs/conceptual-framework.md | 1 + docs/conceptual-framework2.md | 1 + docs/experiments-msmarco-passage.md | 1 + docs/experiments-nfcorpus.md | 1 + 4 files changed, 4 insertions(+) diff --git a/docs/conceptual-framework.md b/docs/conceptual-framework.md index 7a0be816e..dd44c81c6 100644 --- a/docs/conceptual-framework.md +++ b/docs/conceptual-framework.md @@ -331,3 +331,4 @@ Before you move on, however, add an entry in the "Reproduction Log" at the botto + Results reproduced by [@Lindaaa8](https://github.com/lindaaa8) on 2024-03-29 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) + Results reproduced by [@th13nd4n0](https://github.com/th13nd4n0) on 2024-04-05 (commit [`df3bc6c`](https://github.com/castorini/pyserini/commit/df3bc6c2c887d7e3a3a5ee40972600b9ab8cefc2)) + Results reproduced by [@a68lin](https://github.com/a68lin) on 2024-04-12 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) ++ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2 )) diff --git a/docs/conceptual-framework2.md b/docs/conceptual-framework2.md index eca933556..858708a17 100644 --- a/docs/conceptual-framework2.md +++ b/docs/conceptual-framework2.md @@ -572,3 +572,4 @@ Before you move on, however, add an entry in the "Reproduction Log" at the botto + Results reproduced by [@Lindaaa8](https://github.com/lindaaa8) on 2024-04-02 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) + Results reproduced by [@th13nd4n0](https://github.com/th13nd4n0) on 2024-04-05 (commit [`df3bc6c`](https://github.com/castorini/pyserini/commit/df3bc6c2c887d7e3a3a5ee40972600b9ab8cefc2)) + Results reproduced by [@a68lin](https://github.com/a68lin) on 2024-04-12 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) ++ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2 )) diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 2f2163882..6949bc7b0 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -369,3 +369,4 @@ Before you move on, however, add an entry in the "Reproduction Log" at the botto + Results reproduced by [@Lindaaa8](https://github.com/lindaaa8) on 2024-03-29 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) + Results reproduced by [@th13nd4n0](https://github.com/th13nd4n0) on 2024-04-05 (commit [`df3bc6c`](https://github.com/castorini/pyserini/commit/df3bc6c2c887d7e3a3a5ee40972600b9ab8cefc2)) + Results reproduced by [@a68lin](https://github.com/a68lin) on 2024-04-12 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) ++ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2 )) diff --git a/docs/experiments-nfcorpus.md b/docs/experiments-nfcorpus.md index c7952e73a..7f7dff5cf 100644 --- a/docs/experiments-nfcorpus.md +++ b/docs/experiments-nfcorpus.md @@ -367,3 +367,4 @@ Before you move on, however, add an entry in the "Reproduction Log" at the botto + Results reproduced by [@Lindaaa8](https://github.com/lindaaa8) on 2024-03-29 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) + Results reproduced by [@th13nd4n0](https://github.com/th13nd4n0) on 2024-04-05 (commit [`df3bc6c`](https://github.com/castorini/pyserini/commit/df3bc6c2c887d7e3a3a5ee40972600b9ab8cefc2)) + Results reproduced by [@a68lin](https://github.com/a68lin) on 2024-04-12 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) ++ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2 )) From 76d253d021b6f1a6910884a57abaa45ac8b81275 Mon Sep 17 00:00:00 2001 From: DanielKohn1208 Date: Tue, 23 Apr 2024 09:37:37 -0400 Subject: [PATCH 2/6] Removed space before parenthesis --- docs/conceptual-framework.md | 2 +- docs/conceptual-framework2.md | 2 +- docs/experiments-msmarco-passage.md | 2 +- docs/experiments-nfcorpus.md | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/conceptual-framework.md b/docs/conceptual-framework.md index dd44c81c6..601e0cfed 100644 --- a/docs/conceptual-framework.md +++ b/docs/conceptual-framework.md @@ -331,4 +331,4 @@ Before you move on, however, add an entry in the "Reproduction Log" at the botto + Results reproduced by [@Lindaaa8](https://github.com/lindaaa8) on 2024-03-29 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) + Results reproduced by [@th13nd4n0](https://github.com/th13nd4n0) on 2024-04-05 (commit [`df3bc6c`](https://github.com/castorini/pyserini/commit/df3bc6c2c887d7e3a3a5ee40972600b9ab8cefc2)) + Results reproduced by [@a68lin](https://github.com/a68lin) on 2024-04-12 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) -+ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2 )) ++ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2)) diff --git a/docs/conceptual-framework2.md b/docs/conceptual-framework2.md index 858708a17..a50366a5f 100644 --- a/docs/conceptual-framework2.md +++ b/docs/conceptual-framework2.md @@ -572,4 +572,4 @@ Before you move on, however, add an entry in the "Reproduction Log" at the botto + Results reproduced by [@Lindaaa8](https://github.com/lindaaa8) on 2024-04-02 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) + Results reproduced by [@th13nd4n0](https://github.com/th13nd4n0) on 2024-04-05 (commit [`df3bc6c`](https://github.com/castorini/pyserini/commit/df3bc6c2c887d7e3a3a5ee40972600b9ab8cefc2)) + Results reproduced by [@a68lin](https://github.com/a68lin) on 2024-04-12 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) -+ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2 )) ++ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2)) diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 6949bc7b0..ddafe121b 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -369,4 +369,4 @@ Before you move on, however, add an entry in the "Reproduction Log" at the botto + Results reproduced by [@Lindaaa8](https://github.com/lindaaa8) on 2024-03-29 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) + Results reproduced by [@th13nd4n0](https://github.com/th13nd4n0) on 2024-04-05 (commit [`df3bc6c`](https://github.com/castorini/pyserini/commit/df3bc6c2c887d7e3a3a5ee40972600b9ab8cefc2)) + Results reproduced by [@a68lin](https://github.com/a68lin) on 2024-04-12 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) -+ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2 )) ++ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2)) diff --git a/docs/experiments-nfcorpus.md b/docs/experiments-nfcorpus.md index 7f7dff5cf..fd8795837 100644 --- a/docs/experiments-nfcorpus.md +++ b/docs/experiments-nfcorpus.md @@ -367,4 +367,4 @@ Before you move on, however, add an entry in the "Reproduction Log" at the botto + Results reproduced by [@Lindaaa8](https://github.com/lindaaa8) on 2024-03-29 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) + Results reproduced by [@th13nd4n0](https://github.com/th13nd4n0) on 2024-04-05 (commit [`df3bc6c`](https://github.com/castorini/pyserini/commit/df3bc6c2c887d7e3a3a5ee40972600b9ab8cefc2)) + Results reproduced by [@a68lin](https://github.com/a68lin) on 2024-04-12 (commit [`7dda9f3`](https://github.com/castorini/pyserini/commit/7dda9f3246d791a52ebfcedb0c9c10ee01d4862d)) -+ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2 )) ++ Results reproduced by [@DanielKohn1208](https://github.com/DanielKohn1208) on 2024-04-22 (commit [`184a212`](https://github.com/castorini/pyserini/commit/184a212e7d578fac453ead64f7f796bc2e44bcf2)) From 86bd67b6759a49fde6543f2fcaac1f01c2087977 Mon Sep 17 00:00:00 2001 From: DanielKohn1208 Date: Tue, 30 Apr 2024 11:34:44 -0400 Subject: [PATCH 3/6] added raw() --- pyserini/search/faiss/_searcher.py | 26 +++++++++++++++++++------- pyserini/search/hybrid/_searcher.py | 2 +- 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/pyserini/search/faiss/_searcher.py b/pyserini/search/faiss/_searcher.py index d5fc92fdd..b1a22859c 100644 --- a/pyserini/search/faiss/_searcher.py +++ b/pyserini/search/faiss/_searcher.py @@ -419,13 +419,24 @@ def encode(self, query: str): class DenseSearchResult: docid: str score: float + ssearcher: LuceneSearcher # only useful for prebuilt indexes, otherwise set to None + def raw(self): + if self.ssearcher is None: + return None + return self.ssearcher.doc(self.docid).raw() @dataclass class PRFDenseSearchResult: docid: str score: float vectors: [float] + ssearcher: LuceneSearcher # only useful for prebuilt indexes, otherwise set to None + + def raw(self): + if self.ssearcher is None: + return None + return self.ssearcher.doc(self.docid).raw() class FaissSearcher: @@ -449,6 +460,7 @@ def __init__(self, index_dir: str, query_encoder: Union[QueryEncoder, str], self.num_docs = self.index.ntotal assert self.docids is None or self.num_docs == len(self.docids) + self.ssearcher = None if prebuilt_index_name: sparse_index = get_sparse_index(prebuilt_index_name) self.ssearcher = LuceneSearcher.from_prebuilt_index(sparse_index) @@ -525,7 +537,7 @@ def search(self, query: Union[str, np.ndarray], k: int = 10, threads: int = 1, r vectors = vectors[0] distances = distances.flat indexes = indexes.flat - return emb_q, [PRFDenseSearchResult(self.docids[idx], score, vector) + return emb_q, [PRFDenseSearchResult(self.docids[idx], score, vector, self.ssearcher) for score, idx, vector in zip(distances, indexes, vectors) if idx != -1] else: distances, indexes = self.index.search(emb_q, k) @@ -537,9 +549,9 @@ def search(self, query: Union[str, np.ndarray], k: int = 10, threads: int = 1, r for score, idx in zip(distances, indexes): if idx not in unique_docs: unique_docs.add(idx) - results.append(DenseSearchResult(self.docids[idx],score)) + results.append(DenseSearchResult(self.docids[idx], score, self.sssearcher)) return results - return [DenseSearchResult(self.docids[idx], score) + return [DenseSearchResult(self.docids[idx], score, self.ssearcher) for score, idx in zip(distances, indexes) if idx != -1] def batch_search(self, queries: Union[List[str], np.ndarray], q_ids: List[str], k: int = 10, @@ -576,12 +588,12 @@ def batch_search(self, queries: Union[List[str], np.ndarray], q_ids: List[str], faiss.omp_set_num_threads(threads) if return_vector: D, I, V = self.index.search_and_reconstruct(q_embs, k) - return q_embs, {key: [PRFDenseSearchResult(self.docids[idx], score, vector) + return q_embs, {key: [PRFDenseSearchResult(self.docids[idx], score, vector, self.ssearcher) for score, idx, vector in zip(distances, indexes, vectors) if idx != -1] for key, distances, indexes, vectors in zip(q_ids, D, I, V)} else: D, I = self.index.search(q_embs, k) - return {key: [DenseSearchResult(self.docids[idx], score) + return {key: [DenseSearchResult(self.docids[idx], score, self.ssearcher) for score, idx in zip(distances, indexes) if idx != -1] for key, distances, indexes in zip(q_ids, D, I)} @@ -681,7 +693,7 @@ def search(self, query: str, k: int = 10, binary_k: int = 100, rerank: bool = Tr distances, indexes = self.binary_dense_search(k, binary_k, rerank, dense_emb_q, sparse_emb_q) distances = distances.flat indexes = indexes.flat - return [DenseSearchResult(str(idx), score) + return [DenseSearchResult(str(idx), score, self.ssearcher) for score, idx in zip(distances, indexes) if idx != -1] def batch_search(self, queries: List[str], q_ids: List[str], k: int = 10, binary_k: int = 100, @@ -721,7 +733,7 @@ def batch_search(self, queries: List[str], q_ids: List[str], k: int = 10, binary assert m == self.dimension faiss.omp_set_num_threads(threads) D, I = self.binary_dense_search(k, binary_k, rerank, dense_q_embs, sparse_q_embs) - return {key: [DenseSearchResult(str(idx), score) + return {key: [DenseSearchResult(str(idx), score, self.ssearcher) for score, idx in zip(distances, indexes) if idx != -1] for key, distances, indexes in zip(q_ids, D, I)} diff --git a/pyserini/search/hybrid/_searcher.py b/pyserini/search/hybrid/_searcher.py index 0817f6c85..ab53e49aa 100644 --- a/pyserini/search/hybrid/_searcher.py +++ b/pyserini/search/hybrid/_searcher.py @@ -77,5 +77,5 @@ def _hybrid_results(dense_results, sparse_results, alpha, k, normalization=False dense_score = (dense_score - (min_dense_score + max_dense_score) / 2) \ / (max_dense_score - min_dense_score) score = alpha * sparse_score + dense_score if not weight_on_dense else sparse_score + alpha * dense_score - hybrid_result.append(DenseSearchResult(doc, score)) + hybrid_result.append(DenseSearchResult(doc, score, None)) return sorted(hybrid_result, key=lambda x: x.score, reverse=True)[:k] From 341196fd5a662e3b9f94f449ab232273fa45e16d Mon Sep 17 00:00:00 2001 From: DanielKohn1208 Date: Tue, 30 Apr 2024 12:27:47 -0400 Subject: [PATCH 4/6] added test cases for raw() --- tests/test_search.py | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/tests/test_search.py b/tests/test_search.py index 278796899..e89f4373b 100644 --- a/tests/test_search.py +++ b/tests/test_search.py @@ -23,7 +23,9 @@ from urllib.request import urlretrieve from pyserini.search.lucene import LuceneSearcher, JScoredDoc +from pyserini.search.faiss import FaissSearcher, AutoQueryEncoder from pyserini.index.lucene import Document +from pyserini.util import get_sparse_index class TestSearch(unittest.TestCase): @@ -409,6 +411,20 @@ def test_doc_by_field(self): # Should return None if we request a docid that doesn't exist self.assertTrue(self.searcher.doc_by_field('foo', 'bar') is None) + def test_dense_search_result_raw(self): + DENSE_INDEX = 'beir-v1.0.0-nfcorpus.bge-base-en-v1.5' + + # Using a prebuilt index as this feature only works for this + encoder = AutoQueryEncoder('BAAI/bge-base-en-v1.5', device='cpu', pooling='mean', l2_norm=True) + faiss_searcher = FaissSearcher.from_prebuilt_index(DENSE_INDEX, encoder) + hits = faiss_searcher.search('How to Help Prevent Abdominal Aortic Aneurysms') + lucene_searcher= LuceneSearcher.from_prebuilt_index(get_sparse_index(DENSE_INDEX)) + + self.assertEqual(lucene_searcher.doc(hits[0].docid).raw(), hits[0].raw()) + self.assertEqual(lucene_searcher.doc(hits[1].docid).raw(), hits[1].raw()) + self.assertEqual(lucene_searcher.doc(hits[2].docid).raw(), hits[2].raw()) + self.assertEqual(lucene_searcher.doc(hits[3].docid).raw(), hits[3].raw()) + @classmethod def tearDownClass(cls): cls.searcher.close() From e1e9283343a63f38314ab673c84b0d83f952b2e0 Mon Sep 17 00:00:00 2001 From: DanielKohn1208 Date: Tue, 30 Apr 2024 21:48:33 -0400 Subject: [PATCH 5/6] updated docs to explain connection between FaissSearcher.doc() and LuceneSearcher.doc() --- docs/usage-fetch.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/usage-fetch.md b/docs/usage-fetch.md index 113dff9e7..26c4f51f9 100644 --- a/docs/usage-fetch.md +++ b/docs/usage-fetch.md @@ -1,5 +1,6 @@ # Pyserini: Fetching Document Content +## Using a Sparse Representation Another commonly used feature in Pyserini is to fetch a document (i.e., its text) given its `docid`. A sparse (Lucene) index can be configured to include the raw document text, in which case the `doc()` method can be used to fetch the document: @@ -60,3 +61,17 @@ Thus, a simple way to iterate through all documents in the collection (and for e for i in range(searcher.num_docs): print(searcher.doc(i).docid()) ``` + +## Using a Dense Representation + +A similar operation can be performed using a dense (Faiss) index **for prebuilt indexes only**. +Note that internally, the corresponding sparse (Lucene) index is used to fetch document content. + +```python +from pyserini.search.faiss import FaissSearcher, AutoQueryEncoder + +encoder = AutoQueryEncoder('BAAI/bge-base-en-v1.5', device='cpu', pooling='mean', l2_norm=True) +searcher = FaissSearcher.from_prebuilt_index('beir-v1.0.0-nfcorpus.bge-base-en-v1.5', encoder) +doc = searcher.doc('MED-14') +``` + From c0dda9530794b08a194d4b2004bfd61f1a061cda Mon Sep 17 00:00:00 2001 From: DanielKohn1208 Date: Tue, 30 Apr 2024 21:51:37 -0400 Subject: [PATCH 6/6] added missing sentence --- docs/usage-fetch.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/usage-fetch.md b/docs/usage-fetch.md index 26c4f51f9..563e7d2db 100644 --- a/docs/usage-fetch.md +++ b/docs/usage-fetch.md @@ -75,3 +75,4 @@ searcher = FaissSearcher.from_prebuilt_index('beir-v1.0.0-nfcorpus.bge-base-en-v doc = searcher.doc('MED-14') ``` +Since a sparse index is used internally, all methods used on doc returned by a `LuceneSearcher` apply here as well (see above section).