-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Parser for cast kernel (#4512) #4513
Conversation
No major regressions and a nice speed bump for date parsing
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making thing go faster by deleting code. What is not to love ❤️ -- really nice work @tustvold.
I think we need to fix the regression in parsing dates like 2022-2-2
but then it will be ready to go
BTW While reading the code I noticed that the comments on Parser
need to be updated (as they are now also called in the cast kernel)
/// Specialized parsing implementations
/// used by csv and json reader
pub trait Parser: ArrowPrimitiveType {
arrow-cast/src/cast.rs
Outdated
@@ -7823,7 +7420,7 @@ mod tests { | |||
|
|||
let a = StringArray::from(vec![ | |||
"2000-01-01", // valid date with leading 0s | |||
"2000-2-2", // valid date without leading 0s | |||
"2000-2-2", // invalid date without leading 0s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should continue to treat such strings as valid dates as I think it is very common in the real world
For example, postgres recognizes it as a date as well
postgres=# select '2000-2-2'::date;
date
------------
2000-02-02
(1 row)
Not that postgres is the be-all end/all data system but I think it is illustrative that that format is not some fringe case.
arrow-cast/src/cast.rs
Outdated
|
||
// test invalid inputs | ||
assert!(!c.is_valid(2)); // "2000-00-00" | ||
assert!(!c.is_valid(3)); // "2000-01-01T12:00:00" | ||
assert!(c.is_valid(3)); // "2000-01-01T12:00:00" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While surprising that this is a supported "Date" I understanding why
So I have code that does this but it represents a 10% performance regression and also seems a touch inconsistent, as we don't permit unpadded timestamps (as they aren't valid according to RFC3339). How strongly do you feel about this? |
I feel quite strongly -- I think this would represent a significant functionality regression |
I recommend viewing with the whitespace blind diff, for some reason parse.rs had CLRF line endings - c1aa19a?w=1
I ended up implementing a compromise that special cases dates without timestamps and supports missing whitespace. This avoids regressing the timestamp parsing performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold
Not sure why this PR renderes so large on Github (maybe due to line ending changes)
Whitespace blind diff makes it look much better: https://github.com/apache/arrow-rs/pull/4513/files?w=1
@@ -7874,13 +7471,13 @@ mod tests { | |||
assert_eq!(946728000000, c.value(0)); | |||
assert!(c.is_valid(1)); // "2020-12-15T12:34:56" | |||
assert_eq!(1608035696000, c.value(1)); | |||
assert!(c.is_valid(2)); // "2020-2-2T12:34:56" | |||
assert_eq!(1580646896000, c.value(2)); | |||
assert!(!c.is_valid(2)); // "2020-2-2T12:34:56" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK, as I don't think that format is common for "dates"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is also not correct according to RFC3339, chrono will refuse to parse it
|
||
#[test] | ||
fn parse_date32() { | ||
let cases = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend some error cases here as well
(empty string)
80-01-01
342
Foo
2020-09-08-03
2020--04-03
2020--
@@ -648,7 +648,7 @@ impl Parser for Date32Type { | |||
|
|||
impl Parser for Date64Type { | |||
fn parse(string: &str) -> Option<i64> { | |||
if string.len() < 10 { | |||
if string.len() <= 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My error cases maybe caught a bug!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afraid not, this just would impact perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some day
I updated this PR description to cross out
Without that this PR may not have any actual API changes |
Which issue does this PR close?
Closes #4512
Rationale for this change
Duplicating the parsing logic leads to inconsistencies and is unnecessary.
I still need to benchmark this, but it should also be faster.
What changes are included in this PR?
Are there any user-facing changes?
Dates without leading zeros are not supported, as per #3969