benchmark questions #816

mooreniemi · 2020-04-20T23:35:07Z

I love your benchmark site!

But since I noticed all phrase queries seem slower in Lucene, I wondered whether shingles had been used or not? I am not sure what would make the comparison more "fair" but it would be helpful in making the choice to know.

In general, is there a page describing the index configuration, machine used, etc.? And could percentiles for latency be produced instead of just the average? (With a reported query per second?) The readme says Tantivy is "usually faster" than Lucene but the average latency is actually slower as listed (9,963 μs vs 5,849 μs).

Please let me know if I can answer my questions myself somewhere in the code.

fulmicoton · 2020-04-21T05:30:41Z

Hello!

The benchmark is on a different repository which has its own set of issues.

Anyway, let me answer you here, inline with your questions.

I love your benchmark site!

Thank you

But since I noticed all phrase queries seem slower in Lucene, I wondered whether shingles had been used or not? I am not sure what would make the comparison more "fair" but it would be helpful in making the choice to know.

The benchmark does not use shingles at all and they would definitely have a huge impact on phrase queries. It is very difficult to decide where to draw the line on this kind of issue.
Maybe in the future there should be more than one Lucene with different settings, and the trade-off clearly explained, for the moment, the benchmark will stay that way.
The benchmark is mostly helpful for developers, to get an idea of what the headroom is.

In general, is there a page describing the index configuration, machine used, etc.? And could percentiles for latency be produced instead of just the average? (With a reported query per second?) The readme says Tantivy is "usually faster" than Lucene but the average latency is actually slower as listed (9,963 μs vs 5,849 μs).

So about the README bit. It is just not up to date. I wrote that before lucene 8, but lucene 8 introduced an optimization called block wand. The story is more complicated now.
Lucene is much faster for unions with a TopK collection.
Tantivy is much faster for intersections, and phrase queries.
I will update it (probably today).

Of course, for users, actual throughput (more or less homogenous to average timings under load) or
latency figures (some percentile of those timings, under some load too) would be more interesting.
But once again my main goal with this benchmark is to find headroom for tantivy... In this perspective, I prefer to have a look at the min numbers.

Another thing is that I was a bit scared that in the case of Lucene, the GC may have a very strong effect on the latency figure. This could have lead to endless comments on VM settings and I would prefer to avoid those. In reality, the variance is higher for Lucene but not bad at all.

For the hardware, I will open a ticket to capture and display relevant hardware information. Feel free to take that ticket if you want to contribute!

fulmicoton · 2020-04-27T02:07:07Z

Closing.

fulmicoton closed this as completed Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark questions #816

benchmark questions #816

mooreniemi commented Apr 20, 2020

fulmicoton commented Apr 21, 2020 •

edited

Loading

fulmicoton commented Apr 27, 2020

benchmark questions #816

benchmark questions #816

Comments

mooreniemi commented Apr 20, 2020

fulmicoton commented Apr 21, 2020 • edited Loading

fulmicoton commented Apr 27, 2020

fulmicoton commented Apr 21, 2020 •

edited

Loading