This repo is a complementary set of configs and links for the Haystack US 23 talk on Hybrid Search
We use a combination of original Amazon ESCI and community ESCI-S datasets.
The ESCI+ESCI-S small dataset in the Metarank format can be downloaded here: s3://esci-s/metarank-esci-small.jsonl.zst
See huggingface.co/metarank repo with all models used in the final configuration:
- metarank/all-MiniLM-L6-v2
- metarank/all-mpnet-base-v2
- metarank/esci-MiniLM-L6-v2
- metarank/ce-msmarco-MiniLM-L6-v2
- metarank/ce-esci-MiniLM-L6-v2
You can always translate your own model to ONNX, see translation scripts on each model repos: https://huggingface.co/metarank/all-MiniLM-L6-v2/blob/main/convert.py
The Metarank config file is stored in this repo: config.yml
To speed-up all the experiments, we used precomputed embeddings for all models:
- with no caching bootstrapping over CE models takes hours.
- pre-computed embeddings for all experiments are ~30GB, so we're not sharing them for the sake of saving bandwidth. If you need them, contact us in Slack
All the BM25 features mention term-frequencies file. You can create it with the following command:
java -jar metarank.jar termfreq --data events.jsonl --out tf-title.json --fields title
Term-freq files should be build per field to match the behavior of Lucene.
- Download the dataset: s3://esci-s/metarank-esci-small.jsonl.zst
- Get the config file: config.yml
- [optional] Compute term-freqs over all fields with
metarank termfreq
Then take a look into the config.yml
file: there is a section with feature definitions, and the actual feature layout over different models. In this example there's only a single model, which includes all the features:
- you should uncomment the features you need to be included into the ensemble
- then run
metarank standalone -d events.jsonl -c config.yml
and write down the NDCG values
Licensed under the Apache 2.0.