Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REGRESSION] Parsing dates of the form 2022-1-1 that used to parse (now needs 2022-01-01) #3969

Closed
alamb opened this issue Mar 28, 2023 · 2 comments
Labels
arrow Changes to the arrow crate bug

Comments

@alamb
Copy link
Contributor

alamb commented Mar 28, 2023

Describe the bug
In arrow 34.0.0, values such as '2021-1-1T05:11:10.432' were parsed as a timestamp but now they error

To Reproduce

Try to parse this value as a timestamp: '2021-1-1T05:11:10.432'

Expected behavior

The value '2021-1-1T05:11:10.432' parses the same as '2021-01-01T05:11:10.432'

Workaround is to change it to (add leading zeros) '2021-01-01T05:11:10.432'

Additional context
We discovered this as part of the upgrade in DataFusion apache/datafusion#5685

Per @tustvold it appears that 2021-1-1T05:11:10.432 is supported by chrono even though it is not strictly valid and arrow was not documented to support this format: apache/datafusion#5685 (comment)

It may be that we simply choose not to fix this regression as the previous implementation was working "by accident" but I wanted to file this ticket to track the change in behavior

@alamb alamb added arrow Changes to the arrow crate bug labels Mar 28, 2023
@tustvold
Copy link
Contributor

tustvold commented Mar 28, 2023

It may be that we simply choose not to fix this regression as the previous implementation was working "by accident"

Indeed, chrono's implementation of strptime is incredibly permissive, to the point where some of the undocumented behaviour is actually considered a bug - chronotope/chrono#332. In particular the docs state that %m and similar expect 0 padded inputs, however, the implementation is more permissive.

We strive to support RFC3339 and reasonable variants thereupon, we therefore added additional cases based on chrono's strptime to supplement chrono's very rigid RFC3339 support, which in turn led to chrono's permissive strptime implementation unintentionally leaking through arrow's parsing abstraction.

Given RFC3339 requires that days and months should be two digits and years must be 4 digits, and we have never claimed support for such timestamps, I do not personally consider this a regression and do not plan to change it.

That being said if someone wants to contribute additional functionality for this and is able to do so in a way that doesn't regress parsing performance, I would be fine with it. The major motivation for not supporting it is that knowing ahead of time the character layout is what allows the parser to be more efficient than otherwise.

As per RFC3339 we should not support anything other than 4 digits for years, as this leads to ambiguity.

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2023

I'm closing this as 2 months have passed with no complaints

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

No branches or pull requests

2 participants