Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BIG-3884 - Use PSL support from url-cpp. #45

Merged
merged 3 commits into from
Aug 26, 2016
Merged

BIG-3884 - Use PSL support from url-cpp. #45

merged 3 commits into from
Aug 26, 2016

Conversation

dlecocq
Copy link

@dlecocq dlecocq commented Aug 17, 2016

This adds support for:

  • PSL use in both punycode and unpunycode forms
  • an updated copy of the PSL
  • a ~9x speed improvement in PSL lookups (over the pure-python PSL implementation)

Relates to #39 and #29

I'd like to figure out a nice way to allow clients to provide their own lists. Any opinions on which is the preferred interface?

# Option 1
import url as URL
URL.use_psl_path('path/to/your/psl')
URL.use_psl_string('your PSL as a string')

URL.parse('http://foo.example.com/').pld

# Option 2
psl = PSL('path/to/your/psl')
psl.getPLD(URL.parse('http://foo.example.com/').hostname)

# Option 3
psl = PSL('path/to/your/psl')
URL.parse('http://foo.example.com/').pldFromPsl(psl)

@b4hand @lindseyreno @neilmb

@neilmb
Copy link

neilmb commented Aug 17, 2016

I prefer option 1 where we have a (replaceable) Singleton PSL in the Url package. Otherwise I have to think too often about the dependence of parse results on the particular PSL being used. I'd prefer that hidden somewhere under the covers.

install_requires = [
'publicsuffix'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep an empty install_requires list, or lose the while thing until we need it again?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to keep it.

@neilmb
Copy link

neilmb commented Aug 17, 2016

We are just naming the PSL by date. Do we need to be more specific about its source or particular version on that date (does this thing change more than daily)?

@dlecocq dlecocq force-pushed the dan/url-cpp-psl branch 2 times, most recently from 011c63a to eaced36 Compare August 17, 2016 15:22
@dlecocq
Copy link
Author

dlecocq commented Aug 17, 2016

To add some substantiation to the claim I made about speed:

import timeit

setup = '''
import url as URL
urls = map(URL.parse, [
    'http://moz.com',
    'http://amazon.co.uk',
    'http://a.b.domain.biz',
    'http://www.test.ac.jp',
    'http://xn--85x722f.xn--55qx5d.cn'
])'''

stmt = '''
for url in urls:
    url.pld
'''

timeit.repeat(stmt, setup, number=10000)

The results:

# On master
[0.1616840362548828, 0.17512893676757812, 0.161513090133667]

# On dan/url-cpp-psl
[0.017058134078979492, 0.017113924026489258, 0.01880812644958496]

@b4hand
Copy link
Contributor

b4hand commented Aug 18, 2016

I too vote for Option 1, but I'd rather you pass a PSL object rather than a path to a string so it let's you mock out the PSL lookup more easily. (eg. URL.use_psl(PSL('path_to_psl')))

('http://foo.გე' , 'foo.გე'),
('http://bar.foo.გე' , 'foo.გე'),
('http://foo.xn--node' , 'foo.xn--node'),
('http://bar.foo.xn--node', 'foo.xn--node')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like any of your tests check for a non-trivial TLD like .co.uk.

@dlecocq
Copy link
Author

dlecocq commented Aug 26, 2016

I ended up adding support for option 1, though not as Brandon suggested. Currently it accepts a string, and creates the corresponding url-cpp object. Otherwise we'd have to wrap the PSL class and treat the top-level psl as a generic python object. Not a bad thing -- I just didn't want to take the time to do it right now.

@b4hand
Copy link
Contributor

b4hand commented Aug 26, 2016

Can we make sure to create a ticket or issue, so that idea doesn't get lost?


# Grab it from the PSL site
import requests
url.set_psl(requests.get('https://publicsuffix.org/list/public_suffix_list.dat'))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the return value of requests.get string-y enough to work with PSL.fromString, or should this use requests.get(...).text?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! I meant to tack content on there.

@dlecocq
Copy link
Author

dlecocq commented Aug 26, 2016

@b4hand -- yes, I'll make a github issue (mostly because I'm not sure where the best place in Jira would be) - #46

@dlecocq dlecocq merged commit 216e0a9 into master Aug 26, 2016
@dlecocq dlecocq deleted the dan/url-cpp-psl branch August 26, 2016 16:47
@dlecocq
Copy link
Author

dlecocq commented Aug 26, 2016

Sorry, I may have merged this a little prematurely. Let me know if there's any additional feedback, and I'll happily apply it retroactively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants