-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update utils.py - remove prefix wildcards #822
Conversation
Prefix wildcards have a huge performance cost. Removing here as a precursor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right place to add escaping :
? It requires a prefixed backslash to escape it properly. http://gobo.com
needs to be http\://gobo.com
@Evan-Leon Any comments on Rahul's question? I'm not sure what all the context is, but this does seem like a sane place to put some escaping. |
I agree with Rahul that quoting doesn't belong in the
url_search_string field of the database: it's error prone, and (at
least in theory) search engine specific.
BUT, I'm QUITE wary of having "http:" or "https:" in the database; the
URL scheme _could_ vary (both contemporaneously and over time), and I
think we should investigate the cost of generating:
(url:http\://URL_SEARCH_STRING OR url:https\://URL_SEARCH_STRING)
As I've noted recently, the normalized url, which is hashed to
generate the ES _id (unique key) is "flattened" to always be "http:",
and would have simplified our lives in this case isn't stored in ES
(hindsight is 20:10).
|
I hear the concern about hardcoding URL schema. I had my test harness still up in Jupyter, so I ran a quick test and it looks like @philbudne's proposal (schema-independant) is similar speed to my original test. |
I guess this touches on the question I posed in #823 - What exactly is the standard form we want to impose on the url-search-string? |
I guess the other question begged here is whether we want to bother with the postfix wildcard too- and just enforce that we want a wildcard in the database. |
We have to balance flexibility and robustness here. |
Getting into the weeds:
A peeve of mine that has likely become (more) moot with the banishment
of prefix wildcards is the subject of terminal slashes.
If the url_search_string is JUST the domain: "foobar.do.ma.in" then
"foobar.do.ma.in/*" is the "proper" search string as opposed to
"foobar.do.ma.in*" but given the current state of url_search_string
column, I think that's a subtlety unlikely to be handled correctly.
I suppose we could still be caught offguard by articles published by
"nytimes.com.ru" being pulled in by a "nytimes.com*" search string.
Similar, but less likely to be problematic is:
do.ma.in/thing* vs do.ma.in/thing/*
And while it's tempting to think about how the UI could be made
smarter, CSV imports are likely to inject badness without anyone to
interact with on a line by line basis...
|
Just a sainity check- these are runtimes for star queries over a single day for 15 random sources- q1 is the current production approach, and q2 is the approach @phil described: (url:http://URL_SEARCH_STRING OR url:https://URL_SEARCH_STRING). Pretty clear improvement- and I think the robustness gained here against variability in the scheme is worth potential impacts from duplicating the search-string in the query. Agree that a UI filter is a good idea, and also agree that csv imports raise a problem here. I wonder if this is just something we add to the list of source health checks- double-checking the correctness of the url-search-string occasionally seems easy enough |
Did escaping |
We determined that we weren't going to have schemes stored in the database, so there's only the hardcoded http:// OR https:// which do manually escape the colon, yes. |
Prefix wildcards have a huge performance cost. Removing here as a precursor to more work fixing up and standardizing how the url_search_strings work in the directory itself.