Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add range queries #42

Open
wants to merge 61 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
0e18c8b
Rewrite of the drive part
fulmicoton Jun 17, 2018
3fa10b1
Bench working
fulmicoton Jun 18, 2018
cef3018
Update
fulmicoton Jun 18, 2018
1f4b868
Update bench
fulmicoton Jun 26, 2018
17be4eb
added output ui
fulmicoton Aug 2, 2018
53b2810
Added tantivy files
fulmicoton Aug 7, 2018
4f97097
Editing README.md
fulmicoton Aug 8, 2018
f41a581
Updated README
fulmicoton Aug 8, 2018
67e70a0
Update README
fulmicoton Aug 9, 2018
ccc5f9c
Edited the text for the benchmark
fulmicoton Aug 10, 2018
6d90680
Updated results.json
fulmicoton Aug 11, 2018
7bc233f
Added two phase critic query
fulmicoton Aug 31, 2018
b954b13
add .gitattributes
Sep 16, 2018
6521580
Merge pull request #2 from vmchale/master
fulmicoton Sep 16, 2018
4dcf3ef
blop
fulmicoton Nov 19, 2018
cda9e22
Merge branch 'master' of github.com:fulmicoton/search-index-benchmark…
fulmicoton Nov 19, 2018
b975e8a
blop
fulmicoton Nov 19, 2018
15c7851
Using query.count
fulmicoton Nov 20, 2018
c6a1cdd
0.6 0.7 and 0.8
fulmicoton Jan 13, 2019
7af30ae
updated engines / update results.json
fulmicoton Jan 13, 2019
6c07b8b
Update the link to benchmark results webpage (#9)
petr-tik Jan 25, 2019
8204900
Added tantivy 0.9
fulmicoton Mar 21, 2019
c55668f
Merge branch 'master' of github.com:fulmicoton/search-index-benchmark…
fulmicoton Mar 21, 2019
ce78af1
Added Lucene 8.0.0
fulmicoton Apr 12, 2019
e9cd109
Updated bench
fulmicoton Apr 14, 2019
928286f
added regression script
fulmicoton May 24, 2019
05785e7
added most recent results
fulmicoton Jun 16, 2019
11a1ab0
Change in tantivy 0.9 build
fulmicoton Jul 29, 2019
9a7bffb
Add Bleve (#14)
mosuka Sep 26, 2019
edcd256
Added rucene.
fulmicoton Dec 23, 2019
4ea6ad9
Avoid storing term vectors.
fulmicoton Dec 23, 2019
2f389ba
Made COMMANDS, ENGINES, PORT setable from env
fulmicoton Dec 24, 2019
90a95e6
Reordered collectors
fulmicoton Dec 24, 2019
2bd1362
Simplify the corpus to make sure everyone tokenize it the same way.
fulmicoton Dec 27, 2019
959efcc
Bugfix first collector.
fulmicoton Dec 27, 2019
0063719
Added corpus_transform file
fulmicoton Dec 27, 2019
194a08f
ignore documents with empty url (#15)
mosuka Dec 28, 2019
639711c
Updated results
fulmicoton Dec 28, 2019
77fad70
update results,json
fulmicoton Dec 28, 2019
9fe0548
Added lucene 8.4.0
fulmicoton Dec 30, 2019
0f965c9
Special phrase query (#17)
amallia Apr 1, 2020
ff8819d
Integrate PISA v0.8.2 (#16)
amallia Apr 2, 2020
3f19eda
Updated results.json
fulmicoton Apr 2, 2020
619ffb2
Added tantivy 0.13
fulmicoton Aug 20, 2020
62b9bb7
Removed coffeescript and gulp. Now using webpack
fulmicoton Oct 17, 2020
1cbaa6f
Update Makefile (#24)
lengyijun Mar 12, 2021
0f18fb4
Update index.js (#25)
lengyijun Mar 12, 2021
825edaa
Update README.md (#26)
PSeitz Apr 20, 2021
3ed3d88
Added +nighty for lucene
fulmicoton Jun 15, 2021
4772771
Pointing to the main commit
fulmicoton Jun 15, 2021
fba4795
Updated rucene
fulmicoton Jun 15, 2021
c0e0a7a
Add command file
fulmicoton Oct 15, 2021
41a28e6
Added a Dockerfile to fix compilation.
fulmicoton Oct 24, 2021
b0a5e46
Added tantivy 0.16
fulmicoton Oct 25, 2021
a595eaa
Updated Makefile
fulmicoton Oct 25, 2021
3d1d431
Updated rucene
fulmicoton Oct 25, 2021
1e8a752
Remove Max score comment in Readme
fulmicoton Nov 19, 2021
985601d
add tantivy versions 0.17, 0.18 and 0,19
PSeitz Dec 28, 2022
7cee369
Merge pull request #40 from quickwit-oss/add_engines
PSeitz Dec 28, 2022
b956323
add range queries
PSeitz Dec 28, 2022
1398266
add range query compatible engines
PSeitz Dec 28, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
web/* linguist-documentation
17 changes: 16 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,22 @@
*.iml
.idea
outputs
results
benchmark/target
tantivy/target
lucene/.gradle
lucene/build
engines/bleve-*/bin/
engines/lucene/.gradle
**/idx
engines/**/build
**/target
**/node_modules
**/.readymade
**/.gradle
**/out/
**/.cargo
**/perf.data*
**/flamegraph.svg
wiki-articles.json.bz2
wiki-articles.json
corpus.json
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "engines/pisa-0.8.2/pisa"]
path = engines/pisa-0.8.2/pisa
url = https://github.com/pisa-engine/pisa.git
48 changes: 48 additions & 0 deletions CONTRIBUTE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Adding another engine

Currently only tantivy and lucene are supported, but you can add another search
engine by creating a directory in the engines directory and add a `Makefile`
implementing the following commands :

## clean

Removes all files, including the built index, and your compiled bench program.

## index

Starts a program that will receive documents from stdin and build a search
index. Check out the lucene implementation for reference.

Stemming should be disabled. Tokenization should be something reasonably close to Lucene's
[StandardTokenizer](https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/standard/StandardTokenizer.html). Discrepancies should be documented in `README.md`.

## serve

Starts a program that will get `tests` from stdin, and output
a result hit count as fast as possible. *If this is not your language's default,
be sure to flush stdout after writing your answer".

The tests consist in a command followed by a query.

The command describes the type of operation that should
be performed. Right now there are three commands

- `COUNT` Outputs the document count.
- `TOP10` computes the top-K elements. Just outputs "1"
- `TOP10_COUNT` computes the topK documents and the overall count of matching documents. Outputs the document count.

Scores for these commands should be as close as possible to lucene's BM25.
If BM25 is not available, fall back to TfIdf. If TfIdf is not available,
just implement whatever is available to you. Make sure to document any difference in the `README.md` file.

Queries are expressed in the Lucene query language.

If a command is not supported, just print to stdout "UNSUPPORTED".


# Adding tests

If you would like a command to be added please open an issue.
Wanting to show a specific case where your engine shines is a perfectly valid motivation.

`TOP10` should give some advantage to engines implementing variations of the `WAND` algorithm.
2 changes: 0 additions & 2 deletions INDEX_TYPES.txt

This file was deleted.

47 changes: 47 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
CORPUS := $(shell pwd)/corpus.json
export

WIKI_SRC = "https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2"

COMMANDS ?= TOP_10 TOP_10_COUNT COUNT

# ENGINES ?= tantivy-0.13 lucene-8.4.0 pisa-0.8.2 rucene-0.1 bleve-0.8.0-scorch rucene-0.1 tantivy-0.11 tantivy-0.16 tantivy-0.17 tantivy-0.18 tantivy-0.19
# ENGINES ?= tantivy-0.16 lucene-8.10.1 pisa-0.8.2 bleve-0.8.0-scorch rucene-0.1
ENGINES ?= tantivy-0.18 tantivy-0.19 lucene-8.10.1

# Engines that have a `id_num` field (u64) indexed for range queries in the format of `id_num:[48694410 TO 48694420]`
export RANGE_QUERY_ENABLED_ENGINES ?= tantivy-0.18 tantivy-0.19 lucene-8.10.1 lucene-8.0.0 lucene-7.2.1
PORT ?= 8080

help:
@grep '^[^#[:space:]].*:' Makefile

all: index

corpus:
@echo "--- Downloading $(WIKI_SRC) ---"
@curl -# -L "$(WIKI_SRC)" | bunzip2 -c | python3 corpus_transform.py > $(CORPUS)

clean:
@echo "--- Cleaning directories ---"
@rm -fr results
@for engine in $(ENGINES); do cd ${shell pwd}/engines/$$engine && make clean ; done

index:
@echo "--- Indexing corpus ---"
@for engine in $(ENGINES); do cd ${shell pwd}/engines/$$engine && make index ; done

bench:
@echo "--- Benchmarking ---"
@rm -fr results
@mkdir results
@python3 src/client.py queries.txt $(ENGINES)

compile:
@echo "--- Compiling binaries ---"
@for engine in $(ENGINES); do cd ${shell pwd}/engines/$$engine && make compile ; done

serve:
@echo "--- Serving results ---"
@cp results.json web/build/results.json
@cd web/build && python3 -m http.server $(PORT)
134 changes: 88 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,121 @@
# Search Index Benchmark Game

A set of standardized benchmarks for comparing the speed of various aspects of search engine technologies.
# Welcome to Search Benchmark, the Game!

This is useful both for comparing different libraries and as tooling for more easily and comprehensively
comparing versions of the same technology.
This repository is standardized benchmark for comparing the speed of various
aspects of search engine technologies.

## Getting Started
The results are available [here](https://tantivy-search.github.io/bench/).

These instructions will get you a copy of the project up and running on your local machine.
This benchmark is both
- **for users** to make it easy for users to compare different libraries
- **for library** developers to identify optimization opportunities by comparing
their implementation to other implementations.

### Prerequisites
Currently, the benchmark only includes Lucene and tantivy.
It is reasonably simple to add another engine.

You are free to communicate about the results of this benchmark **in
a reasonable manner**.
For instance, twisting this benchmark in marketing material to claim that your search engine is 31x faster than Lucene,
because your product was 31x on one of the test is not tolerated. If this happens, the benchmark will publicly
host a wall of shame.
Bullshit claims about performance are a plague in the database world.


## The benchmark

Different search engine implementation are benched over different real-life tests.
The corpus used is the English wikipedia. Stemming is disabled. Queries have been derived
from the [AOL query dataset](https://en.wikipedia.org/wiki/AOL_search_data_leak)
(but do not contain any personal information).

Out of a random sample of query, we filtered queries that had at least two terms and yield at least 1 hit when searches as
a phrase query.

For each of these query, we then run them as :
- `intersection`
- `unions`
- `phrase queries`

with the following collection options :
- `COUNT` only count documents, no need to score them
- `TOP 10` : Identify the 10 documents with the best BM25 score.
- `TOP 10 + COUNT`: Identify the 10 documents with the best BM25 score, and count the matching documents.

We also reintroduced artificially a couple of term queries with different term frequencies.

All tests are run once in order to make sure that
- all of the data is loaded and in page cache
- Java's JIT already kicked in.

Test are run in a single thread.
Out of 5 runs, we only retain the best score, so Garbage Collection likely does not matter.


## Engine specific detail

### Lucene

- Query cache is disabled.
- GC should not influence the results as we pick the best out of 5 runs.
- JVM used was openjdk 10.0.1 2018-04-17

The lucene benchmarks requires Gradle. This can be installed from [the Gradle website](https://gradle.org/).
### Tantivy

The tantivy benchmarks and benchmark driver code requires Cargo. This can be installed using [rustup](https://www.rustup.rs/).
- Tantivy returns slightly more results because its tokenizer handles apostrophes differently.
- Tantivy and Lucene both use BM25 and should return almost identical scores.


# Reproducing

These instructions will get you a copy of the project up and running on your local machine.

### Prerequisites

The lucene benchmarks requires java and Gradle. This can be installed from [the Gradle website](https://gradle.org/).
The tantivy benchmarks and benchmark driver code requires Cargo. This can be installed using [rustup](https://www.rustup.rs/).

### Installing

Clone this repo.

```
git clone [email protected]:jason-wolfe/search-index-benchmark-game.git
git clone [email protected]:tantivy-search/search-benchmark-game.git
```

And that's it!

## Running

You can now pass any file containing articles in JSON format, and a directory containing queries.
A minimal example of articles is included [in the project](./common/datasets/minimal.json).
A small set of queries is included [in the project](./common/queries).

Running with the examples can be done like so
Checkout the [Makefile](Makefile) for all available commands. You can adjust the `ENGINES` parameter for a different set of engines.

Run `make corpus` to download and unzip the corpus used in the benchmark.
```
./run_all.sh ./common/datasets/minimal.json ./common/queries
make corpus
```

This will:
1. build the benchmark driving code
2. For each engine being tested:
1. Build the code necessary to use it
2. Build an index using the supplied documents, and output timing in seconds to `output/$engine/build_time.txt`.
3. Run all of the supplied queries a number of times, recording the time taken to run in `output/$engine/query_output.txt`.
Run `make index` to create the indices for the engines.

The supplied queries can be a directory, which will be searched recursively for `.txt` files to run,
or it can be a `.txt` file itself, which will be used directly.
```
make index
```

The output goes into the `output` subdirectory.
It contains one folder per engine tested.
Run `make bench` to build the different project and run the benches.
This command may take more than 30mn.

## Running more
```
make bench
```

Maybe you want to query again after you know the page cache is warmed up, to better represent your production workflow.
Or maybe you're debugging something or trying to improve query performance, and would like to run some queries without building the indexes again.
For these use-cases, the `query_all.sh` script allows you to run the given set of queries against the already built indexes.
The results are outputted in a `results.json` file.

The argument format is the same as [`drive_queries.rs`](./benchmark/src/bin/drive_queries.rs),
which differs from `run_all.sh`, but allows more flexibility than `run_all.sh` currently offers.
You can then check your results out by running:

```bash
./query_all.sh --queries ./common/queries/my_expensive_queries.txt -n 1
```
make serve
```

Important note:
This assumes that each of your projects is already compiled as you wish them to be.
If this is not the case, run `preprocess_all.sh` and they will build per the standard process.

## TODO

Supply a better representative training set for easy use.
And open the following in your browser: [http://localhost:8000/](http://localhost:8000/)

Support more engines.

Improve `benchmark/run_all.sh` to allow passing more parameters to the `drive.sh` program.
# Adding another search engine

Output a more consumable summary format of any measurements made, to make comparison easier.
See `CONTRIBUTE.md`.
Loading