-
Notifications
You must be signed in to change notification settings - Fork 3
Larger hit count when adding quotes #1508
Comments
|
https://github.com/ualbertalib/blacklight_solr_conf/blob/bc89d514ca857214c57bc2cb023e0d6dfe4a79c6/solr/blacklight-core/conf/solrconfig.xml#L101 mm is Minimum Should Match If there are 1 or 2 clauses, then both are required. So for these queries, british missions south pacific (319) has four clauses, three are required. And for british missions "south pacific" (1,358) has three clauses, two are required. (You can see this in the debugged query where ~2 vs ~3 slop is used). So, I believe that british missions "south pacific" returns more because british missions (1,297) satisfies this condition. |
Interesting @pgwillia, thanks for investigating. The weird part of this still is that typically in libraries we would think of quotes as reducing the number of hits because the expectation is that all terms in that 1 clause in quotes are required. I'm still confused as to why other Blacklight institutions don't experience this problem, so I'm wondering if it actually relates to a data/mapping problem? Not sure if this is significant but here is Stanford's mm parameter: 6<-1 6<90% |
Stanford's example confuses me a little bit. Here's how I'd interpret it: If there are 1, 2, 3, 4,5 or 6 clauses, then all are required. Playing with some values, this plays out
From the Solr documentation: Defines multiple conditions, each one being valid only for numbers greater than the one before it. https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Themm_MinimumShouldMatch_Parameter |
No, I don't think it relates to a data/mapping problem. Increasing the minimum to match (like Stanford using 6) would hide this behaviour. If you're looking to require a clause, try something like british missions +"south pacific" . The + will mark that clause as required. |
@pgwillia it's interesting even to compare hit counts of a query for just british missions south pacific between UAL and other libraries using SOLR. With the exception of John's Hopkins, our results always tend to stand out. By comparison:
It's possible that we've got much more on this topic that other institutions, but unlikely that the discrepancy would be this high considering these are very large libraries. It seems as if most libraries have increased the minimum to match parameter so that SOLR behaves more like a library catalogue |
Thanks for testing this @pgwillia, it's good to see more precision when phrase searches are added to a query. Looking at these results, I wonder if phrase slop is also an issue in our SOLR config? The terms in question are pretty far apart in these fields. I know that this can be specified at either the SOLR config level or at the query, and it would be interesting to experiment with this |
Read some other interesting discussions about this problem, with one recommendation being to move to other indexing schemes link ngrams, analyzing after stopwords are removed |
@pgwillia I think I was talking about the results from the first query (British missions south pacific). For example, in the first record, terms are quite far apart in fields: https://search.library.ualberta.ca/catalog/3537384 |
No. I was completely wrong. This was addressed correctly by the qf fields. |
Looking at this fresh today, I'm not sure why I had that (alpha?) order. Seeing the score let's us know that there are three that are very relevant and the other seven not as much. Maybe this makes more sense:
|
Describe the bug
Larger hit count when adding quotes, e.g.
british missions south pacific gives 318
british missions "south pacific" gives 1351
Expected behaviour: adding quotes to a query reduces the number of hits
The text was updated successfully, but these errors were encountered: