Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update utils.py - remove prefix wildcards #822

Merged
merged 2 commits into from
Oct 18, 2024
Merged

Conversation

pgulley
Copy link
Member

@pgulley pgulley commented Oct 15, 2024

Prefix wildcards have a huge performance cost. Removing here as a precursor to more work fixing up and standardizing how the url_search_strings work in the directory itself.

Prefix wildcards have a huge performance cost. Removing here as a precursor.
Copy link
Collaborator

@rahulbot rahulbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right place to add escaping :? It requires a prefixed backslash to escape it properly. http://gobo.com needs to be http\://gobo.com

@pgulley
Copy link
Member Author

pgulley commented Oct 17, 2024

@Evan-Leon Any comments on Rahul's question? I'm not sure what all the context is, but this does seem like a sane place to put some escaping.

@philbudne
Copy link
Contributor

philbudne commented Oct 17, 2024 via email

@rahulbot
Copy link
Collaborator

I hear the concern about hardcoding URL schema. I had my test harness still up in Jupyter, so I ran a quick test and it looks like @philbudne's proposal (schema-independant) is similar speed to my original test.

Media_Cloud_—_api-mc-news-…__14__-_JupyterLab

@pgulley
Copy link
Member Author

pgulley commented Oct 17, 2024

I guess this touches on the question I posed in #823 - What exactly is the standard form we want to impose on the url-search-string?
I see the wisdom from a UI perspective too of omitting the schema from the database, and I think that the spot-check @rahulbot posted is confidence building. I can double check on a few other cases so we know there's no caching interfering. I don't have enough understanding of the elastic internals to know how plausible it is that that should scale in general.

@pgulley
Copy link
Member Author

pgulley commented Oct 17, 2024

I guess the other question begged here is whether we want to bother with the postfix wildcard too- and just enforce that we want a wildcard in the database.

@rahulbot
Copy link
Collaborator

We have to balance flexibility and robustness here.
In this I don't think coding the schema reduces flexibility of the feature for us as users in a meaningful way.
However I do think letting the post-fixes * be written in the field leaves open possibilities of use in the future that we haven't imagined, and is a useful thing for our collection maintenance staff to be aware of (that wildcard means match-anything). I'd leave that in the UI and add a front-end warning if * it is not included in the field before save (as a field validation rule).

@philbudne
Copy link
Contributor

philbudne commented Oct 17, 2024 via email

@pgulley
Copy link
Member Author

pgulley commented Oct 17, 2024

Screenshot 2024-10-17 at 6 07 43 PM

Just a sainity check- these are runtimes for star queries over a single day for 15 random sources- q1 is the current production approach, and q2 is the approach @phil described: (url:http://URL_SEARCH_STRING OR url:https://URL_SEARCH_STRING). Pretty clear improvement- and I think the robustness gained here against variability in the scheme is worth potential impacts from duplicating the search-string in the query.

Agree that a UI filter is a good idea, and also agree that csv imports raise a problem here. I wonder if this is just something we add to the list of source health checks- double-checking the correctness of the url-search-string occasionally seems easy enough

@rahulbot
Copy link
Collaborator

Did escaping : get added to this? I think that is important.

@pgulley
Copy link
Member Author

pgulley commented Oct 18, 2024

We determined that we weren't going to have schemes stored in the database, so there's only the hardcoded http:// OR https:// which do manually escape the colon, yes.

@rahulbot rahulbot self-requested a review October 18, 2024 16:18
@pgulley pgulley merged commit f42eabd into main Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants