Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Parser for cast kernel (#4512) #4513

Merged
merged 4 commits into from
Jul 14, 2023
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 12, 2023

Which issue does this PR close?

Closes #4512

Rationale for this change

Duplicating the parsing logic leads to inconsistencies and is unnecessary.

I still need to benchmark this, but it should also be faster.

What changes are included in this PR?

Are there any user-facing changes?

Dates without leading zeros are not supported, as per #3969

@tustvold tustvold added the api-change Changes to the arrow API label Jul 12, 2023
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 12, 2023
@tustvold
Copy link
Contributor Author

No major regressions and a nice speed bump for date parsing

cast utf8 to date32 512 time:   [9.4216 µs 9.4270 µs 9.4332 µs]
                        change: [-77.107% -77.045% -76.983%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

cast utf8 to date64 512 time:   [42.573 µs 42.581 µs 42.591 µs]
                        change: [-37.832% -37.753% -37.612%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making thing go faster by deleting code. What is not to love ❤️ -- really nice work @tustvold.

I think we need to fix the regression in parsing dates like 2022-2-2 but then it will be ready to go

BTW While reading the code I noticed that the comments on Parser need to be updated (as they are now also called in the cast kernel)

/// Specialized parsing implementations
/// used by csv and json reader
pub trait Parser: ArrowPrimitiveType {

@@ -7823,7 +7420,7 @@ mod tests {

let a = StringArray::from(vec![
"2000-01-01", // valid date with leading 0s
"2000-2-2", // valid date without leading 0s
"2000-2-2", // invalid date without leading 0s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should continue to treat such strings as valid dates as I think it is very common in the real world

For example, postgres recognizes it as a date as well

postgres=# select '2000-2-2'::date;
    date
------------
 2000-02-02
(1 row)

Not that postgres is the be-all end/all data system but I think it is illustrative that that format is not some fringe case.


// test invalid inputs
assert!(!c.is_valid(2)); // "2000-00-00"
assert!(!c.is_valid(3)); // "2000-01-01T12:00:00"
assert!(c.is_valid(3)); // "2000-01-01T12:00:00"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While surprising that this is a supported "Date" I understanding why

@tustvold
Copy link
Contributor Author

I think we need to fix the regression in parsing dates like 2022-2-2 but then it will be ready to go

So I have code that does this but it represents a 10% performance regression and also seems a touch inconsistent, as we don't permit unpadded timestamps (as they aren't valid according to RFC3339). How strongly do you feel about this?

@alamb
Copy link
Contributor

alamb commented Jul 14, 2023

So I have code that does this but it represents a 10% performance regression and also seems a touch inconsistent, as we don't permit unpadded timestamps (as they aren't valid according to RFC3339). How strongly do you feel about this?

I feel quite strongly -- I think this would represent a significant functionality regression

@tustvold
Copy link
Contributor Author

I recommend viewing with the whitespace blind diff, for some reason parse.rs had CLRF line endings - c1aa19a?w=1

I feel quite strongly

I ended up implementing a compromise that special cases dates without timestamps and supports missing whitespace. This avoids regressing the timestamp parsing performance

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold

Not sure why this PR renderes so large on Github (maybe due to line ending changes)

Whitespace blind diff makes it look much better: https://github.com/apache/arrow-rs/pull/4513/files?w=1

@@ -7874,13 +7471,13 @@ mod tests {
assert_eq!(946728000000, c.value(0));
assert!(c.is_valid(1)); // "2020-12-15T12:34:56"
assert_eq!(1608035696000, c.value(1));
assert!(c.is_valid(2)); // "2020-2-2T12:34:56"
assert_eq!(1580646896000, c.value(2));
assert!(!c.is_valid(2)); // "2020-2-2T12:34:56"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK, as I don't think that format is common for "dates"

Copy link
Contributor Author

@tustvold tustvold Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is also not correct according to RFC3339, chrono will refuse to parse it


#[test]
fn parse_date32() {
let cases = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend some error cases here as well

(empty string)
80-01-01
342
Foo
2020-09-08-03
2020--04-03
2020--

@tustvold tustvold merged commit 6909db8 into apache:master Jul 14, 2023
@@ -648,7 +648,7 @@ impl Parser for Date32Type {

impl Parser for Date64Type {
fn parse(string: &str) -> Option<i64> {
if string.len() < 10 {
if string.len() <= 10 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My error cases maybe caught a bug!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afraid not, this just would impact perf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some day

@alamb
Copy link
Contributor

alamb commented Jul 14, 2023

I updated this PR description to cross out

Dates without leading zeros are not supported, as per #3969

Without that this PR may not have any actual API changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot cast string '2021-01-02' to value of Date64 type
2 participants