quickwit-oss · PSeitz · Jun 17, 2018 · Jun 18, 2018 · Jun 18, 2018 · Jun 26, 2018
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1 @@
+web/* linguist-documentation
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,22 @@
 *.iml
 .idea
 outputs
+results
 benchmark/target
 tantivy/target
-lucene/.gradle
 lucene/build
+engines/bleve-*/bin/
+engines/lucene/.gradle
+**/idx
+engines/**/build
+**/target
+**/node_modules
+**/.readymade
+**/.gradle
+**/out/
+**/.cargo
+**/perf.data*
+**/flamegraph.svg
+wiki-articles.json.bz2
+wiki-articles.json
+corpus.json
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "engines/pisa-0.8.2/pisa"]
+	path = engines/pisa-0.8.2/pisa
+	url = https://github.com/pisa-engine/pisa.git
diff --git a/CONTRIBUTE.md b/CONTRIBUTE.md
@@ -0,0 +1,48 @@
+# Adding another engine
+
+Currently only tantivy and lucene are supported, but you can add another search
+engine by creating a directory in the engines directory and add a `Makefile`
+implementing the following commands :
+
+## clean
+
+Removes all files, including the built index, and your compiled bench program.
+
+## index
+
+Starts a program that will receive documents from stdin and build a search
+index. Check out the lucene implementation for reference.
+
+Stemming should be disabled. Tokenization should be something reasonably close to Lucene's
+[StandardTokenizer](https://lucene.apache.org/core/7_3_1/core/org/apache/lucene/analysis/standard/StandardTokenizer.html). Discrepancies should be documented in `README.md`.
+
+## serve
+
+Starts a program that will get `tests` from stdin, and output
+a result hit count as fast as possible. *If this is not your language's default,
+be sure to flush stdout after writing your answer".
+
+The tests consist in a command followed by a query.
+
+The command describes the type of operation that should
+be performed. Right now there are three commands
+
+- `COUNT` Outputs the document count.
+- `TOP10` computes the top-K elements. Just outputs "1"
+- `TOP10_COUNT` computes the topK documents and the overall count of matching documents. Outputs the document count.
+
+Scores for these commands should be as close as possible to lucene's BM25.
+If BM25  is not available, fall back to TfIdf. If TfIdf is not available,
+just implement whatever is available to you. Make sure to document any difference in the `README.md` file.
+
+Queries are expressed in the Lucene query language.
+
+If a command is not supported, just print to stdout "UNSUPPORTED".
+
+
+# Adding tests
+
+If you would like a command to be added please open an issue.
+Wanting to show a specific case where your engine shines is a perfectly valid motivation.
+
+`TOP10` should give some advantage to engines implementing variations of the `WAND` algorithm.
diff --git a/INDEX_TYPES.txt b/INDEX_TYPES.txt
diff --git a/Makefile b/Makefile
@@ -0,0 +1,47 @@
+CORPUS := $(shell pwd)/corpus.json
+export
+
+WIKI_SRC = "https://www.dropbox.com/s/wwnfnu441w1ec9p/wiki-articles.json.bz2"
+
+COMMANDS ?= TOP_10 TOP_10_COUNT COUNT
+
+# ENGINES ?= tantivy-0.13 lucene-8.4.0 pisa-0.8.2 rucene-0.1 bleve-0.8.0-scorch rucene-0.1 tantivy-0.11 tantivy-0.16 tantivy-0.17 tantivy-0.18 tantivy-0.19 
+# ENGINES ?= tantivy-0.16 lucene-8.10.1 pisa-0.8.2 bleve-0.8.0-scorch rucene-0.1 
+ENGINES ?= tantivy-0.18 tantivy-0.19 lucene-8.10.1 
+
+# Engines that have a `id_num` field (u64) indexed for range queries in the format of `id_num:[48694410 TO 48694420]`
+export RANGE_QUERY_ENABLED_ENGINES ?= tantivy-0.18 tantivy-0.19 lucene-8.10.1 lucene-8.0.0 lucene-7.2.1
+PORT ?= 8080
+
+help:
+	@grep '^[^#[:space:]].*:' Makefile
+
+all: index
+
+corpus:
+	@echo "--- Downloading $(WIKI_SRC) ---"
+	@curl -# -L "$(WIKI_SRC)" | bunzip2 -c | python3 corpus_transform.py > $(CORPUS)
+
+clean:
+	@echo "--- Cleaning directories ---"
+	@rm -fr results
+	@for engine in $(ENGINES); do cd ${shell pwd}/engines/$$engine && make clean ; done
+
+index:
+	@echo "--- Indexing corpus ---"
+	@for engine in $(ENGINES); do cd ${shell pwd}/engines/$$engine && make index ; done
+
+bench:
+	@echo "--- Benchmarking ---"
+	@rm -fr results
+	@mkdir results
+	@python3 src/client.py queries.txt $(ENGINES)
+
+compile:
+	@echo "--- Compiling binaries ---"
+	@for engine in $(ENGINES); do cd ${shell pwd}/engines/$$engine && make compile ; done
+
+serve:
+	@echo "--- Serving results ---"
+	@cp results.json web/build/results.json
+	@cd web/build && python3 -m http.server $(PORT)
diff --git a/README.md b/README.md
@@ -1,79 +1,121 @@
-# Search Index Benchmark Game
 
-A set of standardized benchmarks for comparing the speed of various aspects of search engine technologies.
+# Welcome to Search Benchmark, the Game!
 
-This is useful both for comparing different libraries and as tooling for more easily and comprehensively
- comparing versions of the same technology. 
+This repository is standardized benchmark for comparing the speed of various
+aspects of search engine technologies.
 
-## Getting Started
+The results are available [here](https://tantivy-search.github.io/bench/).
 
-These instructions will get you a copy of the project up and running on your local machine.
+This benchmark is both
+- **for users** to make it easy for users to compare different libraries
+- **for library** developers to identify optimization opportunities by comparing
+their implementation to other implementations.
 
-### Prerequisites
+Currently, the benchmark only includes Lucene and tantivy.
+It is reasonably simple to add another engine.
+
+You are free to communicate about the results of this benchmark **in
+a reasonable manner**.
+For instance, twisting this benchmark in marketing material to claim that your search engine is 31x faster than Lucene,
+because your product was 31x on one of the test is not tolerated. If this happens, the benchmark will publicly
+host a wall of shame.
+Bullshit claims about performance are a plague in the database world.
+
+
+## The benchmark
+
+Different search engine implementation are benched over different real-life tests.
+The corpus used is the English wikipedia. Stemming is disabled. Queries have been derived
+ from the [AOL query dataset](https://en.wikipedia.org/wiki/AOL_search_data_leak)
+ (but do not contain any personal information).
+
+Out of a random sample of query, we filtered queries that had at least two terms and yield at least 1 hit when searches as
+a phrase query.
+
+For each of these query, we then run them as :
+- `intersection`
+- `unions`
+- `phrase queries`
+
+with the following collection options :
+- `COUNT` only count documents, no need to score them
+- `TOP 10` : Identify the 10 documents with the best BM25 score.
+- `TOP 10 + COUNT`: Identify the 10  documents with the best BM25 score, and count the matching documents.
+
+We also reintroduced artificially a couple of term queries with different term frequencies.
+
+All tests are run once in order to make sure that
+- all of the data is loaded and in page cache
+- Java's JIT already kicked in.
+
+Test are run in a single thread.
+Out of 5 runs, we only retain the best score, so Garbage Collection likely does not matter.
+
+
+## Engine specific detail
+
+### Lucene
+
+- Query cache is disabled.
+- GC should not influence the results as we pick the best out of 5 runs.
+- JVM used was openjdk 10.0.1 2018-04-17
 
-The lucene benchmarks requires Gradle. This can be installed from [the Gradle website](https://gradle.org/).
+### Tantivy
 
-The tantivy benchmarks and benchmark driver code requires Cargo. This can be installed using [rustup](https://www.rustup.rs/). 
+- Tantivy returns slightly more results because its tokenizer handles apostrophes differently.
+- Tantivy and Lucene both use BM25 and should return almost identical scores.
 
 
+# Reproducing
+
+These instructions will get you a copy of the project up and running on your local machine.
+
+### Prerequisites
+
+The lucene benchmarks requires java and Gradle. This can be installed from [the Gradle website](https://gradle.org/).
+The tantivy benchmarks and benchmark driver code requires Cargo. This can be installed using [rustup](https://www.rustup.rs/).
+
 ### Installing
 
 Clone this repo.
 
 ```
-git clone [email protected]:jason-wolfe/search-index-benchmark-game.git
+git clone [email protected]:tantivy-search/search-benchmark-game.git
 ```
 
-And that's it!
-
 ## Running
 
-You can now pass any file containing articles in JSON format, and a directory containing queries. 
-A minimal example of articles is included [in the project](./common/datasets/minimal.json).
-A small set of queries is included [in the project](./common/queries). 
-
-Running with the examples can be done like so
+Checkout the [Makefile](Makefile) for all available commands. You can adjust the `ENGINES` parameter for a different set of engines.
 
+Run `make corpus` to download and unzip the corpus used in the benchmark.
 ```
-./run_all.sh ./common/datasets/minimal.json ./common/queries
+make corpus
 ```
 
-This will:
-1. build the benchmark driving code
-2. For each engine being tested:
-    1. Build the code necessary to use it
-    2. Build an index using the supplied documents, and output timing in seconds to `output/$engine/build_time.txt`.
-    3. Run all of the supplied queries a number of times, recording the time taken to run in `output/$engine/query_output.txt`.
+Run `make index` to create the indices for the engines.
 
-The supplied queries can be a directory, which will be searched recursively for `.txt` files to run, 
-or it can be a `.txt` file itself, which will be used directly.
+```
+make index
+```
 
-The output goes into the `output` subdirectory. 
-It contains one folder per engine tested.
+Run `make bench` to build the different project and run the benches.
+This command may take more than 30mn.
 
-## Running more
+```
+make bench
+```
 
-Maybe you want to query again after you know the page cache is warmed up, to better represent your production workflow.
-Or maybe you're debugging something or trying to improve query performance, and would like to run some queries without building the indexes again.
-For these use-cases, the `query_all.sh` script allows you to run the given set of queries against the already built indexes.
+The results are outputted in a `results.json` file.
 
-The argument format is the same as [`drive_queries.rs`](./benchmark/src/bin/drive_queries.rs), 
-which differs from `run_all.sh`, but allows more flexibility than `run_all.sh` currently offers.
+You can then check your results out by running:
 
-```bash
-./query_all.sh --queries ./common/queries/my_expensive_queries.txt -n 1
+```
+make serve
 ```
 
-Important note: 
-This assumes that each of your projects is already compiled as you wish them to be.
-If this is not the case, run `preprocess_all.sh` and they will build per the standard process.
-
-## TODO
-
-Supply a better representative training set for easy use.
+And open the following in your browser: [http://localhost:8000/](http://localhost:8000/)
 
-Support more engines.
 
-Improve `benchmark/run_all.sh` to allow passing more parameters to the `drive.sh` program.
+# Adding another search engine
 
-Output a more consumable summary format of any measurements made, to make comparison easier.
+See `CONTRIBUTE.md`.