-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable partial date input support for from_iso8601_date() #9357
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@mbasmanova opened this PR to address your comment regarding partial date input handling. Turned out to be a simple fix, taken care of with a call to an alternate date parse function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svm1 Thank you for looking into this. It is nice that the fix is so simple.
I'm seeing that util::castFromDateString has a boolean parameter named 'isIso8601' and you are passing 'false' for it. This is counter-intuitive since the formats being parsed are from the ISO 8601 format. What do you think?
It would be nice to update function documentation to explain what formats it supports. My impression is that most users would think that ISO 8601 format for date is YYYY-MM-DD, but turns out that standard allows other formats as well.
Thanks.
@mbasmanova Thanks for the review. I won't lie, I was initially thrown off by that name myself - I agree that it's a bit misleading. Passing |
I would be nice to rename this parameter to reflect its meaning more accurately. Just to make sure, would you double check that supported formats are exactly the same as in Presto (or at least a subset, but not a superset)?
That would be great. |
b0cac4b
to
88c4cea
Compare
@mbasmanova Your concern was valid - when testing the change, I noticed that along with allowing partial date input, the flag I enabled also allows ISO date strings to include the time component - this doesn't seem to be the case with Presto Java. The
|
@mbasmanova Now that I think about it, adding to the enum wouldn't quite be sufficient, due to the way the cast function is structured. The function itself needs to be modified or duplicated.
castFromDateString() is already being called with To address this issue, as well as any similar needs that may arise in the future - wouldn't it be better if castFromDateString() takes a ParseMode parameter directly? Abstracting ParseMode away from the caller just seems to add unnecessary complications. Though this would require moving the enum to a header file that each caller can access. |
a93e9f6
to
5265abb
Compare
@mbasmanova I've refactored |
@mbasmanova @svm1 Thanks for noticing. I created a test on partial input, and found the behavior was not aligned for input like
|
velox/type/TimestampConversion.h
Outdated
@@ -44,6 +44,26 @@ constexpr const int32_t kMaxYear{292278994}; | |||
constexpr const int32_t kYearInterval{400}; | |||
constexpr const int32_t kDaysPerYearInterval{146097}; | |||
|
|||
// Enum to dictate parsing modes for date strings. | |||
// | |||
// kStrict: For date string conversion, align with DuckDB's implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why do we need parsing mode aligned with DuckDB? Do you happen to know where this is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Looks like it's only used by the original date cast function that I replaced here in from_iso8601_date - fromDateString()
. And this cast function is only used by the Date class' utility function DateType::toDays()
.
As for the actual parsing, the only difference between kStrict
and kStandard
just seems that the former allows trailing whitespace at the end of the date string. Will need to test some more to see if this split is really necessary or if these modes can be consolidated. Maybe better to open that as a separate investigation, seems to be moving further away from the scope of this PR?
Parses the ISO 8601 formatted ``string`` into a ``date``. | ||
Parses the ISO 8601 formatted date ``string`` into a ``date``. | ||
ISO 8601 date ``string`` can be formatted as any of the following: | ||
``[+-]YYYY`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow number of digits in the year to be less or greater than 4? If so, would be nice to clarify. It would be great to add a few examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - the year must simply be at least one digit. In theory it can go up to 6-7 digits before it begins to overflow, and then errors out at 8 digits, though I suppose such values are out of the realm of real-world usage.
Updated the format syntax in the docs to better reflect this, and added a note about year values needing to be between 1 and 6 digits:
[+-][Y]Y*
[+-][Y]Y*-[M]M*
[+-][Y]Y*-[M]M*-[D]D*
[+-][Y]Y*-[M]M*-[D]D* *
// kNonStandardCast: Like standard but permits missing day/month and allows | ||
// trailing 'T' or spaces. Align with Spark SQL casting conventions. | ||
enum class ParseMode { | ||
kStrict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names are not very descriptive. I wonder if we could come up with names that are more self-explanatory. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree these names aren't great. Though perhaps it would be better to look into this separately, in conjunction with investigating the necessity of kStrict
(#9357 (comment))? That will require identifying the purpose of each mode anyway and seeing if some can be removed/merged, after which it may be easier to rename them appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svm1 Thank you for iterating on this PR. Appreciate non-trivial amount of effort you are putting into this. Some comments.
3e74a16
to
18f9398
Compare
@mbasmanova Thanks for all your feedback. Cleaned up and elaborated upon the docs, please let me if it looks good to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svm1 Thank you for iterating on this PR. Looks great.
Would you rebase if you haven't already? |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
if (mode == ParseMode::kStandardCast) { | ||
VELOX_ASSERT_THROW( | ||
castFromDateString(str, mode), | ||
fmt::format(kStandardCastErr, std::string(str.data(), str.size()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work as fmt::format requires constant string or fmt::runtime. I'm fixing that.
Also, noticed there is no space between sentences. Fixing that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it seemed to be working when I ran the test locally. I think the error strings are declared as constants.
18f9398
to
a705db4
Compare
…r castFromDateString()
a705db4
to
dd2f22c
Compare
@mbasmanova merged this pull request in 115a240. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…cubator#9357) Summary: Also refactoring `castFromDateString()` function to take in `ParseMode` parameter rather than boolean, to improve flexibility and ease of use. Pull Request resolved: facebookincubator#9357 Reviewed By: xiaoxmeng Differential Revision: D56042071 Pulled By: mbasmanova fbshipit-source-id: 61cb5ef5d72862c62a435a7b13585d7f6013edb0
Summary:
from_iso8601_date()
in Presto allows for partial date strings, with the month and day values omitted:Whereas the function in Velox does not support such input:
Both the higher-level date cast functions in Velox,
fromDateString()
andcastFromDateString()
invoke the same parse function,tryParseDateString()
- the difference being that the former passesParseMode::kStrict
into the parse function, while the latter takes in a boolean used to dictate the parse mode passed down.Replaced the call to
fromDateString()
infrom_iso8601_date()
with a call tocastFromDateString()
. The boolean only allows two possible parse modes, neither of which were suitable for the purpose here (kStandard
expects complete ISO format, whilekNonStandard
does allow partial dates but also permits the inclusion of timestamps, which Presto'sfrom_iso8601_date()
does not) - therefore, I created a new ParseMode to allow partial dates while blocking timestamps, and refactoredcastFromDateString()
to directly take in a ParseMode.