-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ExtractDomain to extract apex domains. #520
Conversation
- Resolves #519 - Add scala-uri as a dependency - Replace getHost method of extracting domains with apexDomain from scala-uri - Update tests as needed - Removed unused source parameter from ExtractDomain
Well, this blows up in Spark 3.0.3 and 3.1.2 with:
|
No clue. Went back in the blame and looks like it's been there forever (maybe from when Alice was working on this five or six years ago). If I had to guess it was some janky error handling. I think yanking it makes sense. |
😞 (that sucks! and odd behaviour... ) |
I built the branch on
|
My memory is faint on this, but - per docs, "source an optional default url for urls with no valid domain host". I think we've run into cases before where the full URL that we're trying to extract from was mangled. Usually, we're doing extraction on an URL on some page - so we set "source" to that page. Rationale is that if the URL is mangled, we can back up to the page that URL was found on. That was the original rationale. As to whether it's a good one, that's a different matter... |
- Caused by: io.lemonlabs.uri.parsing.UriParsingException: Invalid URL could not be parsed. Error(4,NonEmptyList(EndOfString(4,146)))
Codecov Report
@@ Coverage Diff @@
## main #520 +/- ##
============================================
+ Coverage 88.83% 89.04% +0.20%
Complexity 57 57
============================================
Files 43 43
Lines 1012 1022 +10
Branches 85 83 -2
============================================
+ Hits 899 910 +11
Misses 74 74
+ Partials 39 38 -1 |
Tested locally on Spark 3.1.1:
I'm doing the same test on tuna now on the full geocities dataset, and everything seems to be running smoothly. I'm not 100% happy with my nest try and catch to swallow @ianmilligan1 @lintool any objects to me merging once the tuna job finishes, and cutting a dot release? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay (was just on a call).
Looks good! Builds locally and sample scripts running on my end. I am happy for this to be merged when you're good to go @ruebot.
Ran without issue on tuna!
|
GitHub issue(s): #519
What does this Pull Request do?
How should this be tested?
Additional Notes:
I removed the
source
parameter since I honestly had no idea what it was for, or what the use case was. Couldn't really see anything in the commit history. @ianmilligan1 @lintool do y'all recall what it was for?Once this is merged, I'll cut a new release since we'll need it for ARCH.