Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Highlighting Not Honoring Slop #20

Open
mkrauklis opened this issue Oct 25, 2016 · 12 comments
Open

Highlighting Not Honoring Slop #20

mkrauklis opened this issue Oct 25, 2016 · 12 comments

Comments

@mkrauklis
Copy link

mkrauklis commented Oct 25, 2016

I should first note this exact bug exists with the Plain Highlighter in 1.7.x-2.4.x and is supposedly fixed in 5.x as reported here: elastic/elasticsearch#18246

The root of the problem is that although the we only get hits on documents that meet the slop criteria for a span_near query, the highlighter highlights all instances of the matched terms regardless of proximity.

For example, the following request:

POST /idx/_search
{
   "highlight": {
      "pre_tags": [
         "|~|"
      ],
      "post_tags": [
         "|^|"
      ],
      "order": "score",
      "fields": {
         "fieldx": {
            "fragment_size": 1000,
            "number_of_fragments": 3,
            "type": "experimental",
            "matched_fields": [
               "fieldx",
               "fieldx.exact"
            ]
         }
      }
   },
   "query": {
      "span_near": {
         "clauses": [
            {
               "span_term": {
                  "fieldx.exact": {
                     "value": "term1"
                  }
               }
            },
            {
               "span_term": {
                  "fieldx.exact": {
                     "value": "term2"
                  }
               }
            }
         ],
         "slop": 2,
         "in_order": false
      }
   }
}

Returns the following highlight:

...
"fieldx": ["Shouldn't hit |~|term1|^| because it's not close enough to |~|term2|^|. However, the distance between the second instances of |~|term1|^| and |~|term2|^| are close enough and should get highlighted."]
...

In this example the highlights for the first instance of the search terms are incorrect.

We've traced this into the PostingsHitEnum's iteration through the PostingsEnum which is returning the offsets for the incorrect term offsets.

@nomoa
Copy link
Member

nomoa commented Oct 26, 2016

Yes, sadly spans are not fully supported. When inspecting the query in the QueryFlattener we only extract term information, position info is not tracked.
I'll try to find some time to have a closer look, thanks.

@mkrauklis
Copy link
Author

Thanks @nomoa. Just so I understand, you're saying the problem is basically when the Automata is generated it doesn't take into account SpanTermQuery.slop?

More specifically that the slop gets lost in QueryFlattener.flattenQuery here:

    protected void flattenQuery(SpanNearQuery query, float pathBoost, Object sourceOverride,
            IndexReader reader, Callback callback) {
        pathBoost *= query.getBoost();
        for (SpanQuery clause : query.getClauses()) {
            flattenSpan(clause, pathBoost, sourceOverride, reader, callback);
        }
    }

Where the SpanNearQuery knows the slop but the constituent clauses do not.

Thanks for any help. I'm just trying to wrap my head around the high level algorithm so I can help wherever possible.

@nomoa
Copy link
Member

nomoa commented Oct 26, 2016

Yes exactly, as you can see we only flatten the clauses but we should keep track of span info such as slop. The highlighter currently only support PhraseQuery by calling callback#Phrase methods to keep positional information. Unfortunately spans are more complex than phrase queries and supporting them does not seem trivial and it's unclear to me how we can do that yet, possibly by reusing WeightedSpanTermExtractor?

@mkrauklis
Copy link
Author

We are thinking about modifying the flattenQuery(SpanNearQuery...) method to build an Automaton object directly. To handle nested span queries it could maybe pass around a current Automaton state node, defaulting to the init state. Once the span query automaton is built up BasicQueryWeigher.FlattenerCallback.flattened(Automaton...) could be called (instead of the flattened that takes a term byte reference) to add it to the list of automata. We're still working through the concept but I thought I'd bounce it off you to see if you had any feedback.

@nomoa
Copy link
Member

nomoa commented Oct 26, 2016

Thanks for digging into this!

I must confess that I don't have much time to explore possible solutions.
Modifying flattenQuery(SpanNearQuery...) is certainly the first thing to do. But I'm not sure to follow your solution with a custom Automaton. You mean you would like to extend Automaton to store additional info?
Note that Automaton should be considered to something very similar to ByteRef, it's just a way to handle MultiTermQueries (Prefix/Fuzzy/WildcardQuery) without calling Query#rewrite.
If you need to pass more complex object to the callback I'd suggest using a dedicated class rather than extending Automaton.

At a glance I would change FlattenerCallback to allow capturing Spans states which then would allow you to create custom HitEnum subclasses to handle Spans. This would be very similar to what we do for Phrases but for Spans:

  • FlattenerCallback would expose specific methods to capture the phrase states: startSpan(int slop, boolean ordered)/endSpan().
  • Then BasicQueryWeigher will generate SpanNearHitEnumWrappers to properly emit span matches

Handling nested spans will certainly be non trivial e.g. FlattenerCallback#startSpan may be called multiple times before endSpan is called.

But if you have a solution that works I'd be happy to review it.

Thanks!

@mkrauklis
Copy link
Author

I wasn't expecting we'd have to extend Automaton as there's already an example in the unit tests for TermAutomatonQuery that handles slop. See:
TestTermAutomatonQuery.java

  // "comes sun" or "comes * sun"
  public void testBasicSlop() throws Exception {
    Directory dir = newDirectory();
    RandomIndexWriter w = new RandomIndexWriter(random(), dir);
    Document doc = new Document();
    doc.add(newTextField("field", "here comes the sun", Field.Store.NO));
    w.addDocument(doc);

    doc = new Document();
    doc.add(newTextField("field", "here comes sun", Field.Store.NO));
    w.addDocument(doc);

    doc = new Document();
    doc.add(newTextField("field", "here comes the other sun", Field.Store.NO));
    w.addDocument(doc);
    IndexReader r = w.getReader();
    IndexSearcher s = newSearcher(r);

    TermAutomatonQuery q = new TermAutomatonQuery("field");
    int init = q.createState();
    int s1 = q.createState();
    q.addTransition(init, s1, "comes");
    int s2 = q.createState();
    q.addAnyTransition(s1, s2);
    int s3 = q.createState();
    q.setAccept(s3, true);
    q.addTransition(s1, s3, "sun");
    q.addTransition(s2, s3, "sun");
    q.finish();

    assertEquals(2, s.search(q, 1).totalHits);

    w.close();
    r.close();
    dir.close();
  }

Maybe this wouldn't get us enough, but it's the road we were walking down. We'll take a peek at what you suggested today as well. Thanks for your feedback!!

@dsmiley
Copy link

dsmiley commented Oct 27, 2016

I'm confused; how does TermAutomatonQuery relate to highlighting? Unfortunately, TermAutomatonQuery isn't a SpanQuery derivative and/or doesn't expose the offsets of matching terms; although it could be modified to do so.

@mkrauklis
Copy link
Author

Could someone correct my understanding of how queryWeight is being used to filter terms matched outside a phrase? I see all the PostingsHitEnums for a field get grouped under a MergeHitEnumWrapper which is wrapped in nested PhraseHitEnumWrappers (one for each phrase for the field). These are then wrapped by a WeightFilteredHitEnumWrapper. When the hits come through if the query weight is not positive the PhraseHitEnumWrapper checks to see if it's a phrase hit. If so it bumps up the query weight. But I can't see what's making the queryWeight 0 to begin with. As I'm iterating through the PostingsHitEnum hits the queryWeight always seems to be 1. I was trying to use the same basic algorithm for a new SpanHitEnumWrapper class but if the next calls to wrapped always result in a queryWeight that's 1 I'm sunk. I'm missing something here but I can't see what. I'll keep digging but a gentle nudge in the right direction would certainly help the effort:)

@mkrauklis
Copy link
Author

mkrauklis commented Oct 31, 2016

Alrighty, I think I figured it out. I was missing that in QueryFlattener we were passing a 0 for the weight of terms that were part of a phrase.

I think I have a partial solution implemented, piggybacking on a lot of the concepts for phrase handling. So far I seem to have it working for SpanTerm and SpanNear queries and would like to run the concept past you experts before I spend time implementing the rest.

The high level solution is to, while flattening the query, build up a list of root-level SpanValidator trees. Then, when wrapping (BasicQueryWeigher.wrap), for each root-level validator I wrap the current HitEnum with a SpanHitEnumWrapper.

The SpanHitEnumWrapper tries to validate the current hit. If there is a partial match (e.g. we have a hit on one of the SpanTerms in a list of span near clauses) it looks forward (wrapped.next()) until either there is a match or there is no match. For example, if the positions of the current hits are outside the bounds allowed by slop. When validating each acceptable hit is stored by that validator node.

Lastly, when SpanHitEnumWrapper.queryWeight() is called it checks with its validator to see if the given source/position combination was recorded as a successful hit during validation:

    @Override
    public float queryWeight() {
        float queryWeight = toReplay.queryWeight();

        if (queryWeight <= 0) {
            if (spanValidator.isSuccessful()) {
                if (spanValidator.isSourceAtPositionValid(source(), position())) {
                    queryWeight = 1;
                }
            }
        }

        return queryWeight;
    }

A little more detail:

I added functions to QueryFlattener.Callback to mark the start and end of the span types:

        void startSpanTermQuery(SpanTermQuery query);
        void endSpanTermQuery(SpanTermQuery query, int source);
        void startSpanNearQuery(SpanNearQuery query);
        void endSpanNearQuery(SpanNearQuery query);

These call back into BasicQueryWeigher which maintains a stack of List of SpanValidator instances. When the bottom of the stack is popped off this validator is added to the list of root-level validators for the given field.

Concerns:

  • It seems to be following the pattern established by phrases but with this solution we're throwing away queryWeight/boost for nested span queries. This seems like it could be more problematic than it was for phrases.
  • I see when we encounter a term multiple times we update the termInfo to have a new source. I followed the pattern established in phrase handling and performed a source check for SpanTerm to see if it's a hit. I'm not sure what the impact of changing the source on termInfo would have in this case (if any).

@nomoa
Copy link
Member

nomoa commented Nov 2, 2016

@mkrauklis everything you describe sounds good to me.

Concerning queryWeight if we want to do like phrases we need to use the top-level boost recorded when flattening and as you said you may have to forget nested span query boosts. So instead of returning 1 in SpanHitEnumWrapper#queryWeight() would it be possible to return the boost you could capture when you flatten the query?

Concerning the problem with duplicated terms I have no idea... I'd need to spend some time in the code to tell. Would you mind opening a PR so we can discuss implementation details?

This looks very promising, thanks!

@mkrauklis
Copy link
Author

@nomoa I'm doing this on company time so I'm just waiting to get the approval to contribute back to this project. We've done this for other projects so it shouldn't be a problem. Just takes a few days for everything to get approved. Once that's all set I'll open a PR.

@mkrauklis
Copy link
Author

Sorry for the delay. Got a little held up in red-tape but I was finally able to move forward with the pull request. In the mean time I implemented handling of span_or and span_multi.

See: #21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants