Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark questions #816

Closed
mooreniemi opened this issue Apr 20, 2020 · 2 comments
Closed

benchmark questions #816

mooreniemi opened this issue Apr 20, 2020 · 2 comments

Comments

@mooreniemi
Copy link

I love your benchmark site!

But since I noticed all phrase queries seem slower in Lucene, I wondered whether shingles had been used or not? I am not sure what would make the comparison more "fair" but it would be helpful in making the choice to know.

In general, is there a page describing the index configuration, machine used, etc.? And could percentiles for latency be produced instead of just the average? (With a reported query per second?) The readme says Tantivy is "usually faster" than Lucene but the average latency is actually slower as listed (9,963 μs vs 5,849 μs).

Please let me know if I can answer my questions myself somewhere in the code.

@fulmicoton
Copy link
Collaborator

fulmicoton commented Apr 21, 2020

Hello!

The benchmark is on a different repository which has its own set of issues.

Anyway, let me answer you here, inline with your questions.

I love your benchmark site!

Thank you

But since I noticed all phrase queries seem slower in Lucene, I wondered whether shingles had been used or not? I am not sure what would make the comparison more "fair" but it would be helpful in making the choice to know.

The benchmark does not use shingles at all and they would definitely have a huge impact on phrase queries. It is very difficult to decide where to draw the line on this kind of issue.
Maybe in the future there should be more than one Lucene with different settings, and the trade-off clearly explained, for the moment, the benchmark will stay that way.
The benchmark is mostly helpful for developers, to get an idea of what the headroom is.

In general, is there a page describing the index configuration, machine used, etc.? And could percentiles for latency be produced instead of just the average? (With a reported query per second?) The readme says Tantivy is "usually faster" than Lucene but the average latency is actually slower as listed (9,963 μs vs 5,849 μs).

So about the README bit. It is just not up to date. I wrote that before lucene 8, but lucene 8 introduced an optimization called block wand. The story is more complicated now.
Lucene is much faster for unions with a TopK collection.
Tantivy is much faster for intersections, and phrase queries.
I will update it (probably today).

Of course, for users, actual throughput (more or less homogenous to average timings under load) or
latency figures (some percentile of those timings, under some load too) would be more interesting.
But once again my main goal with this benchmark is to find headroom for tantivy... In this perspective, I prefer to have a look at the min numbers.

Another thing is that I was a bit scared that in the case of Lucene, the GC may have a very strong effect on the latency figure. This could have lead to endless comments on VM settings and I would prefer to avoid those. In reality, the variance is higher for Lucene but not bad at all.

For the hardware, I will open a ticket to capture and display relevant hardware information. Feel free to take that ticket if you want to contribute!

@fulmicoton
Copy link
Collaborator

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants