-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for structured queries (opensearch only) #815
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall structure is fine. I've added some smaller comments regarding style as well as comments regarding index structure.
My main concern is the handling of mismatches or missing address parts in the structured query. The lenient 'should' matching produces a lot of false positive results. This would be rather unexpected.
private static final float StreetBoost = 5.0f; // we filter streets in the wrong city / district / ... so we can use a high boost value | ||
private static final float HouseNumberBoost = 10.0f; | ||
private static final float HouseNumberUnmatchedBoost = 5f; | ||
private static final float FactorForWrongLanguage = 0.1f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nitpicking: can you make these constant upper-case.
.copyTo("collector.base", "collector." + lang))); | ||
} | ||
|
||
mappings.properties("name." + lang, | ||
b -> b.text(p -> p.index(false) | ||
b -> b.text(p -> p.index(supportStructuredQueries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need this index? Names really should be queried only through the .ngrams or .raw indexes.
|
||
public boolean hasStreet() { return this.street != null || this.houseNumber != null; } | ||
|
||
public boolean hasHouseNumber() { return this.houseNumber != null; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to extend all these to check for the empty string. (Or do the check in the set functions and set to null when the input string is empty.)
@@ -183,6 +184,16 @@ public DatabaseProperties recreateIndex(String[] languages, Date importDate) thr | |||
return dbProperties; | |||
} | |||
|
|||
private void createAndPutIndexMapping(String[] languages, boolean supportStructuredQueries) | |||
{ | |||
if(supportStructuredQueries) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nitpicking: please leave spaces between if
and the bracket. (Also in other parts of the code.)
} | ||
} | ||
|
||
private void AddQuery(Query clause) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style-nitpicking: functions should start with a lower-case letter.
.boost(boost) | ||
.build() | ||
.toQuery()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating a complex query over all languages here, you could also work with a collector field for each address part. Define a field 'state_collector' which is indexed but not stored. Then instead of setting the index for 'state.', you copy to 'state_collector'.
You'd loose the boost factor for wrong language. In my experiments it made little enough difference in accuracy for quite a bit of performance boost.
var query4QueryBuilder = new AddressQueryBuilder(lenient, language, languages) | ||
.addCountryCode(request.getCountryCode()) | ||
.addState(request.getState(), request.hasCounty() || request.hasCityOrPostcode()) | ||
.addCounty(request.getCounty(), request.hasCityOrPostcode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two are missing hasStreet()
in the 'hasMoreDetails' field. It might seem that it doesn't make much sense to have 'street' and 'county' in the query but no 'city' parameter, but that has never stopped users from trying anyway.
.build() | ||
.toQuery()) | ||
.boost(boost); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the logic here. I would expect that you'd always search in the address field when hasMoreDetails is true and in the name otherwise. Right now it looks like you can also return results where all searched fields are in the address only. That seems rather confusing. For example a query with /structured?city=vaduz
returns the city of Vaduz as expected but then also a lot of address results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea was that you still find the city if the street/housenumber is wrong (or not yet in openstreetmap) but I clearly messed something up. Returning houses or streets when no street is given is obviously a bad idea and I thought I handled that case (and failed) -> have to investigate.
.boost(HouseNumberBoost) | ||
.build() | ||
.toQuery()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: bad indent
@@ -32,12 +36,12 @@ public IndexMapping addLanguages(String[] languages) { | |||
for (var field: ADDRESS_FIELDS) { | |||
mappings.properties(String.format("%s.%s", field, lang), | |||
b -> b.text(p -> p | |||
.index(false) | |||
.index(shouldIndexAddressField(field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the new indexes simply use the default analyser. I wonder if 'analyser("index_raw")' and searchAnalyser("search") wouldn't be better here. At a minimum, it gives you German Umlaut-Folding (ö vs. oe).
You can set analysers unconditionally. They simply have no effect when the index is disabled.
Sorry for creating the conflict. I found #816 while reviewing this PR. Please just rebase on master at some point. I don't mind force pushes to PR branches. |
ae6a9d7
to
d9bc19f
Compare
results = sendQuery(buildQuery(photonRequest, true).buildQuery(), extLimit); | ||
|
||
if (results.hits().total().value() == 0 && photonRequest.hasStreet()) { | ||
var street = photonRequest.getStreet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this turned out to be easier than having an unmaintainable complex query.
Of course this could also be done on the client side...
// match only name | ||
if (hasPostCode) { | ||
// post code can implicitly specify a district that has the city in the address field (instead of the name) | ||
combinedQuery = QueryBuilders.bool() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. /structured?state=Baden Württemberg&district=Plotzsägmühle&housenumber=2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query works much better now.
You don't allow typos in the second lenient try. Is this intentional? I would have expected auto fuzziness to be enabled for it.
/structured/?city=vaduz&street=lettstrasse
still returns me the bus stop "Lettstrasse". We might need a layer filter here.
The other thing that stands out is that the usually abbreviations don't work ('strasse' -> 'str'). The freetext search mitigates that somewhat by always doing a prefix search so that some abbreviations just accidentally work (doesn't help with 'road' -> 'rd', though). We don't have this here. I would defer this topic to later. Just wanted to mention it, in case somebody tries this and doesn't get results.
On the nitpicking side: there are still some function names that start with an upper-case letter.
} | ||
|
||
mappings.properties(propertyName, | ||
b -> b.text(p -> p.copyTo(collectors))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the 'index(false)' is gone. I'm not entirely sure what the default is, so prefer explicit statements.
mappings.properties(field + ".default", b -> b.text(p -> p | ||
.index(false) | ||
.copyTo("collector.default", "collector.base"))); | ||
.index(shouldIndexAddressField(field)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need an index here?
Luckily I got rid of the bus stops without an overly complex layer filter. There was already a filter to filter out results with house numbers if the request does not contain one. Bus stops and a few other locations passed that filter because they have no house number. Therefore fixing the filter from "no house number" to "no house number and not a house" should be sufficient. |
You may even want to put a global filter on the query "when type is house then it must have a house number". And also add a global filter for "type != other". This should consistently filter out all object that are not "address-like". |
…not address-like.
Looks good as a first version. We can fine-tune in follow-up PRs, if necessary. |
Oh dear, now I forgot about documentation. Do you mind adding a little bit of documentation for the new call in a separate PR? I would suggest to do this in an extra file |
Yes, I can add some documentation. I'm off next week, not sure whether I get to it before the 20th. |
this adds a new function
Supported parameters are
Result format is equal to /api.
For those parameters that are also available for /api the meaning is the same.
CountryCode has to be the ISO2 code (e.g. DE instead of "Germany"). The country code must match exactly. Fuzzy matching does not make sense for 2 letter codes. In my experience the case that address is fine except for the country code is rare compared to nonsense hits in other countries.
In order to use structured queries, the new option "-structured" has to be used when importing from Nominatim. Expect 10-20% higher file size. If you run photon on an existing index folder, photon detects automatically if the data was imported with structured query support.
For data imported without "-structured" /structured?city=berlin returns a technical error as the route is not mapped.
Structured queries are only supported for OpenSearch based photon.
Known issues