-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Highlighting Not Honoring Slop #20
Comments
Yes, sadly spans are not fully supported. When inspecting the query in the QueryFlattener we only extract term information, position info is not tracked. |
Thanks @nomoa. Just so I understand, you're saying the problem is basically when the Automata is generated it doesn't take into account SpanTermQuery.slop? More specifically that the slop gets lost in QueryFlattener.flattenQuery here: protected void flattenQuery(SpanNearQuery query, float pathBoost, Object sourceOverride,
IndexReader reader, Callback callback) {
pathBoost *= query.getBoost();
for (SpanQuery clause : query.getClauses()) {
flattenSpan(clause, pathBoost, sourceOverride, reader, callback);
}
} Where the SpanNearQuery knows the slop but the constituent clauses do not. Thanks for any help. I'm just trying to wrap my head around the high level algorithm so I can help wherever possible. |
Yes exactly, as you can see we only flatten the clauses but we should keep track of span info such as slop. The highlighter currently only support PhraseQuery by calling callback#Phrase methods to keep positional information. Unfortunately spans are more complex than phrase queries and supporting them does not seem trivial and it's unclear to me how we can do that yet, possibly by reusing WeightedSpanTermExtractor? |
We are thinking about modifying the flattenQuery(SpanNearQuery...) method to build an Automaton object directly. To handle nested span queries it could maybe pass around a current Automaton state node, defaulting to the init state. Once the span query automaton is built up BasicQueryWeigher.FlattenerCallback.flattened(Automaton...) could be called (instead of the flattened that takes a term byte reference) to add it to the list of automata. We're still working through the concept but I thought I'd bounce it off you to see if you had any feedback. |
Thanks for digging into this! I must confess that I don't have much time to explore possible solutions. At a glance I would change FlattenerCallback to allow capturing Spans states which then would allow you to create custom HitEnum subclasses to handle Spans. This would be very similar to what we do for Phrases but for Spans:
Handling nested spans will certainly be non trivial e.g. FlattenerCallback#startSpan may be called multiple times before endSpan is called. But if you have a solution that works I'd be happy to review it. Thanks! |
I wasn't expecting we'd have to extend Automaton as there's already an example in the unit tests for TermAutomatonQuery that handles slop. See: // "comes sun" or "comes * sun"
public void testBasicSlop() throws Exception {
Directory dir = newDirectory();
RandomIndexWriter w = new RandomIndexWriter(random(), dir);
Document doc = new Document();
doc.add(newTextField("field", "here comes the sun", Field.Store.NO));
w.addDocument(doc);
doc = new Document();
doc.add(newTextField("field", "here comes sun", Field.Store.NO));
w.addDocument(doc);
doc = new Document();
doc.add(newTextField("field", "here comes the other sun", Field.Store.NO));
w.addDocument(doc);
IndexReader r = w.getReader();
IndexSearcher s = newSearcher(r);
TermAutomatonQuery q = new TermAutomatonQuery("field");
int init = q.createState();
int s1 = q.createState();
q.addTransition(init, s1, "comes");
int s2 = q.createState();
q.addAnyTransition(s1, s2);
int s3 = q.createState();
q.setAccept(s3, true);
q.addTransition(s1, s3, "sun");
q.addTransition(s2, s3, "sun");
q.finish();
assertEquals(2, s.search(q, 1).totalHits);
w.close();
r.close();
dir.close();
} Maybe this wouldn't get us enough, but it's the road we were walking down. We'll take a peek at what you suggested today as well. Thanks for your feedback!! |
I'm confused; how does |
Could someone correct my understanding of how queryWeight is being used to filter terms matched outside a phrase? I see all the PostingsHitEnums for a field get grouped under a MergeHitEnumWrapper which is wrapped in nested PhraseHitEnumWrappers (one for each phrase for the field). These are then wrapped by a WeightFilteredHitEnumWrapper. When the hits come through if the query weight is not positive the PhraseHitEnumWrapper checks to see if it's a phrase hit. If so it bumps up the query weight. But I can't see what's making the queryWeight 0 to begin with. As I'm iterating through the PostingsHitEnum hits the queryWeight always seems to be 1. I was trying to use the same basic algorithm for a new SpanHitEnumWrapper class but if the next calls to wrapped always result in a queryWeight that's 1 I'm sunk. I'm missing something here but I can't see what. I'll keep digging but a gentle nudge in the right direction would certainly help the effort:) |
Alrighty, I think I figured it out. I was missing that in QueryFlattener we were passing a 0 for the weight of terms that were part of a phrase. I think I have a partial solution implemented, piggybacking on a lot of the concepts for phrase handling. So far I seem to have it working for SpanTerm and SpanNear queries and would like to run the concept past you experts before I spend time implementing the rest. The high level solution is to, while flattening the query, build up a list of root-level The Lastly, when @Override
public float queryWeight() {
float queryWeight = toReplay.queryWeight();
if (queryWeight <= 0) {
if (spanValidator.isSuccessful()) {
if (spanValidator.isSourceAtPositionValid(source(), position())) {
queryWeight = 1;
}
}
}
return queryWeight;
} A little more detail: I added functions to QueryFlattener.Callback to mark the start and end of the span types: void startSpanTermQuery(SpanTermQuery query);
void endSpanTermQuery(SpanTermQuery query, int source);
void startSpanNearQuery(SpanNearQuery query);
void endSpanNearQuery(SpanNearQuery query); These call back into Concerns:
|
@mkrauklis everything you describe sounds good to me. Concerning queryWeight if we want to do like phrases we need to use the top-level boost recorded when flattening and as you said you may have to forget nested span query boosts. So instead of returning 1 in SpanHitEnumWrapper#queryWeight() would it be possible to return the boost you could capture when you flatten the query? Concerning the problem with duplicated terms I have no idea... I'd need to spend some time in the code to tell. Would you mind opening a PR so we can discuss implementation details? This looks very promising, thanks! |
@nomoa I'm doing this on company time so I'm just waiting to get the approval to contribute back to this project. We've done this for other projects so it shouldn't be a problem. Just takes a few days for everything to get approved. Once that's all set I'll open a PR. |
Sorry for the delay. Got a little held up in red-tape but I was finally able to move forward with the pull request. In the mean time I implemented handling of span_or and span_multi. See: #21 |
I should first note this exact bug exists with the Plain Highlighter in 1.7.x-2.4.x and is supposedly fixed in 5.x as reported here: elastic/elasticsearch#18246
The root of the problem is that although the we only get hits on documents that meet the slop criteria for a span_near query, the highlighter highlights all instances of the matched terms regardless of proximity.
For example, the following request:
Returns the following highlight:
In this example the highlights for the first instance of the search terms are incorrect.
We've traced this into the PostingsHitEnum's iteration through the PostingsEnum which is returning the offsets for the incorrect term offsets.
The text was updated successfully, but these errors were encountered: