The goal here is to get a set of query document pairs. How can we evaluate it though?
General Datasets
- TREC
- MS Marco
- BEIR
Ranking Metrics
- Recall @ K
- Precision @ K
- nDCG @ K
- Reciprocal Rank
- LGTM@10
- Engagement , Revenue, add to chat
We can use a LLM as a judge by
- Having a nice