Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend scoring hits with BM25(k1=0.9,b=0.4). #46

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions CONTRIBUTE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Adding another engine

Currently only tantivy and lucene are supported, but you can add another search
Currently only Tantivy and Lucene are supported, but you can add another search
engine by creating a directory in the engines directory and add a `Makefile`
implementing the following commands :

Expand All @@ -20,7 +20,7 @@ Stemming should be disabled. Tokenization should be something reasonably close t

Starts a program that will get `tests` from stdin, and output
a result hit count as fast as possible. *If this is not your language's default,
be sure to flush stdout after writing your answer".
be sure to flush stdout after writing your answer*.

The tests consist in a command followed by a query.

Expand All @@ -39,6 +39,20 @@ Queries are expressed in the Lucene query language.

If a command is not supported, just print to stdout "UNSUPPORTED".

# Recommendations for new engines

Engines are recommended to follow the below guidelines:
- Indexing is not measured and may be multi-threaded.
- Engines may optimize for read-only access, e.g. by merging multiple segments
down to a single one or performing document reordering.
- Search operations must run in a single thread.
- Hits must be ranked according to the
[BM25](https://en.wikipedia.org/wiki/Okapi_BM25) ranking function with
standard parameters `k1`=0.9 and `b`=0.4.
- Phrase queries must be evaluated using indexed positions. They must not take
advantage of indexing phrases at indexing time (e.g. Lucene's
ShingleFilter).
- Result caches must be disabled.

# Adding tests

Expand Down
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ The corpus used is the English wikipedia. Stemming is disabled. Queries have bee
from the [AOL query dataset](https://en.wikipedia.org/wiki/AOL_search_data_leak)
(but do not contain any personal information).

Out of a random sample of query, we filtered queries that had at least two terms and yield at least 1 hit when searches as
a phrase query.
Out of a random sample of query, we filtered queries that had at least two terms and yield at least 1 hit when searched
as a phrase query.

For each of these query, we then run them as :
- `intersection`
Expand All @@ -49,15 +49,14 @@ All tests are run once in order to make sure that
- Java's JIT already kicked in.

Test are run in a single thread.
Out of 5 runs, we only retain the best score, so Garbage Collection likely does not matter.

Out of 10 runs, we only retain the best score, so Garbage Collection likely does not matter.

## Engine specific detail

### Lucene

- Query cache is disabled.
- GC should not influence the results as we pick the best out of 5 runs.
- GC should not influence the results as we pick the best out of 10 runs.
- JVM used was openjdk 10.0.1 2018-04-17

### Tantivy
Expand Down
2 changes: 2 additions & 0 deletions engines/lucene-7.2.1/src/main/java/DoQuery.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.store.FSDirectory;

import java.io.BufferedReader;
Expand All @@ -19,6 +20,7 @@ public static void main(String[] args) throws IOException, ParseException {
try (IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir))) {
final IndexSearcher searcher = new IndexSearcher(reader);
searcher.setQueryCache(null);
searcher.setSimilarity(new BM25Similarity(0.9f, 0.4f));
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in))) {
final QueryParser queryParser = new QueryParser("text", new StandardAnalyzer(CharArraySet.EMPTY_SET));
String line;
Expand Down
2 changes: 2 additions & 0 deletions engines/lucene-8.0.0/src/main/java/DoQuery.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.store.FSDirectory;

import java.io.BufferedReader;
Expand All @@ -19,6 +20,7 @@ public static void main(String[] args) throws IOException, ParseException {
try (IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir))) {
final IndexSearcher searcher = new IndexSearcher(reader);
searcher.setQueryCache(null);
searcher.setSimilarity(new BM25Similarity(0.9f, 0.4f));
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in))) {
final QueryParser queryParser = new QueryParser("text", new StandardAnalyzer(CharArraySet.EMPTY_SET));
String line;
Expand Down
2 changes: 2 additions & 0 deletions engines/lucene-8.10.1/src/main/java/DoQuery.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.store.FSDirectory;

import java.io.BufferedReader;
Expand All @@ -19,6 +20,7 @@ public static void main(String[] args) throws IOException, ParseException {
try (IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir))) {
final IndexSearcher searcher = new IndexSearcher(reader);
searcher.setQueryCache(null);
searcher.setSimilarity(new BM25Similarity(0.9f, 0.4f));
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in))) {
final QueryParser queryParser = new QueryParser("text", new StandardAnalyzer(CharArraySet.EMPTY_SET));
String line;
Expand Down
2 changes: 2 additions & 0 deletions engines/lucene-9.6.0/src/main/java/DoQuery.java
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.similarities.BM25Similarity;
import org.apache.lucene.store.FSDirectory;

import java.io.BufferedReader;
Expand All @@ -19,6 +20,7 @@ public static void main(String[] args) throws IOException, ParseException {
try (IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir))) {
final IndexSearcher searcher = new IndexSearcher(reader);
searcher.setQueryCache(null);
searcher.setSimilarity(new BM25Similarity(0.9f, 0.4f));
try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(System.in))) {
final QueryParser queryParser = new QueryParser("text", new StandardAnalyzer(CharArraySet.EMPTY_SET));
String line;
Expand Down