Update utils.py - remove prefix wildcards #822

pgulley · 2024-10-15T19:29:47Z

Prefix wildcards have a huge performance cost. Removing here as a precursor to more work fixing up and standardizing how the url_search_strings work in the directory itself.

Prefix wildcards have a huge performance cost. Removing here as a precursor.

rahulbot

Is this the right place to add escaping :? It requires a prefixed backslash to escape it properly. http://gobo.com needs to be http\://gobo.com

pgulley · 2024-10-17T17:12:05Z

@Evan-Leon Any comments on Rahul's question? I'm not sure what all the context is, but this does seem like a sane place to put some escaping.

philbudne · 2024-10-17T17:54:34Z

I agree with Rahul that quoting doesn't belong in the url_search_string field of the database: it's error prone, and (at least in theory) search engine specific. BUT, I'm QUITE wary of having "http:" or "https:" in the database; the URL scheme _could_ vary (both contemporaneously and over time), and I think we should investigate the cost of generating: (url:http\://URL_SEARCH_STRING OR url:https\://URL_SEARCH_STRING) As I've noted recently, the normalized url, which is hashed to generate the ES _id (unique key) is "flattened" to always be "http:", and would have simplified our lives in this case isn't stored in ES (hindsight is 20:10).

rahulbot · 2024-10-17T18:25:27Z

I hear the concern about hardcoding URL schema. I had my test harness still up in Jupyter, so I ran a quick test and it looks like @philbudne's proposal (schema-independant) is similar speed to my original test.

Media_Cloud_—_api-mc-news-…__14__-_JupyterLab

pgulley · 2024-10-17T19:47:24Z

I guess this touches on the question I posed in #823 - What exactly is the standard form we want to impose on the url-search-string?
I see the wisdom from a UI perspective too of omitting the schema from the database, and I think that the spot-check @rahulbot posted is confidence building. I can double check on a few other cases so we know there's no caching interfering. I don't have enough understanding of the elastic internals to know how plausible it is that that should scale in general.

pgulley · 2024-10-17T19:57:09Z

I guess the other question begged here is whether we want to bother with the postfix wildcard too- and just enforce that we want a wildcard in the database.

rahulbot · 2024-10-17T20:01:31Z

We have to balance flexibility and robustness here.
In this I don't think coding the schema reduces flexibility of the feature for us as users in a meaningful way.
However I do think letting the post-fixes * be written in the field leaves open possibilities of use in the future that we haven't imagined, and is a useful thing for our collection maintenance staff to be aware of (that wildcard means match-anything). I'd leave that in the UI and add a front-end warning if * it is not included in the field before save (as a field validation rule).

philbudne · 2024-10-17T21:15:53Z

Getting into the weeds: A peeve of mine that has likely become (more) moot with the banishment of prefix wildcards is the subject of terminal slashes. If the url_search_string is JUST the domain: "foobar.do.ma.in" then "foobar.do.ma.in/*" is the "proper" search string as opposed to "foobar.do.ma.in*" but given the current state of url_search_string column, I think that's a subtlety unlikely to be handled correctly. I suppose we could still be caught offguard by articles published by "nytimes.com.ru" being pulled in by a "nytimes.com*" search string. Similar, but less likely to be problematic is: do.ma.in/thing* vs do.ma.in/thing/* And while it's tempting to think about how the UI could be made smarter, CSV imports are likely to inject badness without anyone to interact with on a line by line basis...

pgulley · 2024-10-17T22:13:03Z

Just a sainity check- these are runtimes for star queries over a single day for 15 random sources- q1 is the current production approach, and q2 is the approach @phil described: (url:http://URL_SEARCH_STRING OR url:https://URL_SEARCH_STRING). Pretty clear improvement- and I think the robustness gained here against variability in the scheme is worth potential impacts from duplicating the search-string in the query.

Agree that a UI filter is a good idea, and also agree that csv imports raise a problem here. I wonder if this is just something we add to the list of source health checks- double-checking the correctness of the url-search-string occasionally seems easy enough

rahulbot · 2024-10-18T16:04:02Z

Did escaping : get added to this? I think that is important.

pgulley · 2024-10-18T16:18:07Z

We determined that we weren't going to have schemes stored in the database, so there's only the hardcoded http:// OR https:// which do manually escape the colon, yes.

Update utils.py - remove prefix wildcards

9991732

Prefix wildcards have a huge performance cost. Removing here as a precursor.

pgulley requested review from rahulbot and philbudne October 15, 2024 19:29

philbudne approved these changes Oct 15, 2024

View reviewed changes

rahulbot reviewed Oct 15, 2024

View reviewed changes

pgulley requested a review from Evan-Leon October 17, 2024 17:11

pgulley mentioned this pull request Oct 17, 2024

URL Search string fixes in the directory #823

Open

Update utils.py - scheme-safe url-search-string

6598cd9

Evan-Leon approved these changes Oct 18, 2024

View reviewed changes

rahulbot self-requested a review October 18, 2024 16:18

pgulley merged commit f42eabd into main Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update utils.py - remove prefix wildcards #822

Update utils.py - remove prefix wildcards #822

pgulley commented Oct 15, 2024

rahulbot left a comment •

edited

Loading

pgulley commented Oct 17, 2024

philbudne commented Oct 17, 2024 via email

rahulbot commented Oct 17, 2024

pgulley commented Oct 17, 2024

pgulley commented Oct 17, 2024

rahulbot commented Oct 17, 2024

philbudne commented Oct 17, 2024 via email

pgulley commented Oct 17, 2024 •

edited

Loading

rahulbot commented Oct 18, 2024

pgulley commented Oct 18, 2024

Update utils.py - remove prefix wildcards #822

Update utils.py - remove prefix wildcards #822

Conversation

pgulley commented Oct 15, 2024

rahulbot left a comment • edited Loading

Choose a reason for hiding this comment

pgulley commented Oct 17, 2024

philbudne commented Oct 17, 2024 via email

rahulbot commented Oct 17, 2024

pgulley commented Oct 17, 2024

pgulley commented Oct 17, 2024

rahulbot commented Oct 17, 2024

philbudne commented Oct 17, 2024 via email

pgulley commented Oct 17, 2024 • edited Loading

rahulbot commented Oct 18, 2024

pgulley commented Oct 18, 2024

rahulbot left a comment •

edited

Loading

pgulley commented Oct 17, 2024 •

edited

Loading