Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable partial date input support for from_iso8601_date() #9357

Closed
wants to merge 1 commit into from

Conversation

svm1
Copy link
Collaborator

@svm1 svm1 commented Apr 3, 2024

Summary:
from_iso8601_date() in Presto allows for partial date strings, with the month and day values omitted:

presto:di> select from_iso8601_date('123');
   _col0
------------
 0123-01-01
(1 row)

presto:di> select from_iso8601_date('2020-1');
   _col0
------------
 2020-01-01
(1 row)

Whereas the function in Velox does not support such input:

presto:tpch> select from_iso8601_date(dstr) from dateStrs;

Query 20240410_141433_00118_wr8dj failed:  Unable to parse date value: "2020-01", expected format is (YYYY-MM-DD) presto.default.from_iso8601_date(dstr)

Both the higher-level date cast functions in Velox, fromDateString() and castFromDateString() invoke the same parse function, tryParseDateString() - the difference being that the former passes ParseMode::kStrict into the parse function, while the latter takes in a boolean used to dictate the parse mode passed down.

Replaced the call to fromDateString() in from_iso8601_date() with a call to castFromDateString(). The boolean only allows two possible parse modes, neither of which were suitable for the purpose here (kStandard expects complete ISO format, while kNonStandard does allow partial dates but also permits the inclusion of timestamps, which Presto's from_iso8601_date() does not) - therefore, I created a new ParseMode to allow partial dates while blocking timestamps, and refactored castFromDateString() to directly take in a ParseMode.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 3, 2024
Copy link

netlify bot commented Apr 3, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit dd2f22c
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/661892c923bd4800081b8820

@svm1
Copy link
Collaborator Author

svm1 commented Apr 3, 2024

@mbasmanova opened this PR to address your comment regarding partial date input handling. Turned out to be a simple fix, taken care of with a call to an alternate date parse function.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svm1 Thank you for looking into this. It is nice that the fix is so simple.

I'm seeing that util::castFromDateString has a boolean parameter named 'isIso8601' and you are passing 'false' for it. This is counter-intuitive since the formats being parsed are from the ISO 8601 format. What do you think?

It would be nice to update function documentation to explain what formats it supports. My impression is that most users would think that ISO 8601 format for date is YYYY-MM-DD, but turns out that standard allows other formats as well.

Thanks.

@mbasmanova mbasmanova requested a review from amitkdutta April 3, 2024 18:28
@svm1
Copy link
Collaborator Author

svm1 commented Apr 3, 2024

@mbasmanova Thanks for the review. I won't lie, I was initially thrown off by that name myself - I agree that it's a bit misleading.

Passing false for isIso8601 is what enables non-strict parsing of partial dates. I can definitely expand on the documentation to make it more clear, perhaps renaming that parameter would be helpful too?

@mbasmanova
Copy link
Contributor

Passing false for isIso8601 is what enables non-strict parsing of partial dates. ..., perhaps renaming that parameter would be helpful too?

I would be nice to rename this parameter to reflect its meaning more accurately. Just to make sure, would you double check that supported formats are exactly the same as in Presto (or at least a subset, but not a superset)?

I can definitely expand on the documentation to make it more clear

That would be great.

@svm1 svm1 force-pushed the svm_from_iso_date_parse branch from b0cac4b to 88c4cea Compare April 3, 2024 22:53
@svm1
Copy link
Collaborator Author

svm1 commented Apr 9, 2024

@mbasmanova Your concern was valid - when testing the change, I noticed that along with allowing partial date input, the flag I enabled also allows ISO date strings to include the time component - this doesn't seem to be the case with Presto Java.

The ParseMode::kNonStandardCast flag (definition below) which lets these values through is used elsewhere (particularly for Spark support), so modifying that wouldn't be a good idea - it may be worth creating a new field in the ParseMode enum to handle this specific case (to allow strings containing partial date inputs but no timestamps). What do you say?

// kNonStandardCast: Like standard but permits missing day/month and allows
// trailing 'T' or spaces. Align with Spark SQL casting conventions.

@mbasmanova
Copy link
Contributor

@svm1 Thank you for taking a closer look.

it may be worth creating a new field in the ParseMode enum to handle this specific case (to allow strings containing partial date inputs but no timestamps)

This sounds reasonable to me. CC: @rui-mo @PHILO-HE @pedroerp

@svm1
Copy link
Collaborator Author

svm1 commented Apr 10, 2024

@mbasmanova Now that I think about it, adding to the enum wouldn't quite be sufficient, due to the way the cast function is structured. The function itself needs to be modified or duplicated.

int32_t castFromDateString(const char* str, size_t len, bool strictParse) {
  int64_t daysSinceEpoch;
  size_t pos = 0;

  auto mode =
      strictParse ? ParseMode::kStandardCast : ParseMode::kNonStandardCast;
  if (!tryParseDateString(str, len, pos, daysSinceEpoch, mode)) {
      ...

castFromDateString() is already being called with strictParse = false elsewhere (SparkCastHooks) - so directing !strictParse to a different ParseMode would be just as problematic as modifying the existing kNonStandardCast mode itself.

To address this issue, as well as any similar needs that may arise in the future - wouldn't it be better if castFromDateString() takes a ParseMode parameter directly? Abstracting ParseMode away from the caller just seems to add unnecessary complications. Though this would require moving the enum to a header file that each caller can access.

@mbasmanova
Copy link
Contributor

wouldn't it be better if castFromDateString() takes a ParseMode parameter directly?

I think this is a good idea. An enum would make it easier to use the API because it will be clearer than 'true/false'.

CC: @rui-mo @PHILO-HE

@svm1
Copy link
Collaborator Author

svm1 commented Apr 10, 2024

CC @majetideepak

@svm1 svm1 force-pushed the svm_from_iso_date_parse branch 3 times, most recently from a93e9f6 to 5265abb Compare April 11, 2024 01:41
@svm1
Copy link
Collaborator Author

svm1 commented Apr 11, 2024

@mbasmanova I've refactored castFromDateString() to use the enum directly. Updated docs and tests accordingly as well, please take a look.

@rui-mo
Copy link
Collaborator

rui-mo commented Apr 11, 2024

@mbasmanova @svm1 Thanks for noticing. I created a test on partial input, and found the behavior was not aligned for input like 123. Spark throws error or returns null for 123, but 2020 works. While in Velox, 123 and 2020 both work. We can follow-up in a separate PR to fix Spark cast. What do you think?

scala> spark.sql("select cast('123' as date)").show(false)
+-----------------+                                                             
|CAST(123 AS DATE)|
+-----------------+
|null             |
+-----------------+


scala> spark.sql("select cast('2020' as date)").show(false)
+------------------+                                                            
|CAST(2020 AS DATE)|
+------------------+
|2020-01-01        |
+------------------+

@@ -44,6 +44,26 @@ constexpr const int32_t kMaxYear{292278994};
constexpr const int32_t kYearInterval{400};
constexpr const int32_t kDaysPerYearInterval{146097};

// Enum to dictate parsing modes for date strings.
//
// kStrict: For date string conversion, align with DuckDB's implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why do we need parsing mode aligned with DuckDB? Do you happen to know where this is used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Looks like it's only used by the original date cast function that I replaced here in from_iso8601_date - fromDateString(). And this cast function is only used by the Date class' utility function DateType::toDays().

As for the actual parsing, the only difference between kStrict and kStandard just seems that the former allows trailing whitespace at the end of the date string. Will need to test some more to see if this split is really necessary or if these modes can be consolidated. Maybe better to open that as a separate investigation, seems to be moving further away from the scope of this PR?

Parses the ISO 8601 formatted ``string`` into a ``date``.
Parses the ISO 8601 formatted date ``string`` into a ``date``.
ISO 8601 date ``string`` can be formatted as any of the following:
``[+-]YYYY``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow number of digits in the year to be less or greater than 4? If so, would be nice to clarify. It would be great to add a few examples.

Copy link
Collaborator Author

@svm1 svm1 Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the year must simply be at least one digit. In theory it can go up to 6-7 digits before it begins to overflow, and then errors out at 8 digits, though I suppose such values are out of the realm of real-world usage.

Updated the format syntax in the docs to better reflect this, and added a note about year values needing to be between 1 and 6 digits:

    [+-][Y]Y*
    [+-][Y]Y*-[M]M*
    [+-][Y]Y*-[M]M*-[D]D*
    [+-][Y]Y*-[M]M*-[D]D* *

// kNonStandardCast: Like standard but permits missing day/month and allows
// trailing 'T' or spaces. Align with Spark SQL casting conventions.
enum class ParseMode {
kStrict,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names are not very descriptive. I wonder if we could come up with names that are more self-explanatory. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree these names aren't great. Though perhaps it would be better to look into this separately, in conjunction with investigating the necessity of kStrict (#9357 (comment))? That will require identifying the purpose of each mode anyway and seeing if some can be removed/merged, after which it may be easier to rename them appropriately.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svm1 Thank you for iterating on this PR. Appreciate non-trivial amount of effort you are putting into this. Some comments.

@svm1 svm1 force-pushed the svm_from_iso_date_parse branch from 3e74a16 to 18f9398 Compare April 11, 2024 22:05
@svm1
Copy link
Collaborator Author

svm1 commented Apr 11, 2024

@mbasmanova Thanks for all your feedback. Cleaned up and elaborated upon the docs, please let me if it looks good to you!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svm1 Thank you for iterating on this PR. Looks great.

velox/docs/functions/presto/datetime.rst Show resolved Hide resolved
velox/type/TimestampConversion.cpp Outdated Show resolved Hide resolved
@mbasmanova
Copy link
Contributor

Would you rebase if you haven't already?

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

if (mode == ParseMode::kStandardCast) {
VELOX_ASSERT_THROW(
castFromDateString(str, mode),
fmt::format(kStandardCastErr, std::string(str.data(), str.size())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work as fmt::format requires constant string or fmt::runtime. I'm fixing that.

Also, noticed there is no space between sentences. Fixing that too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seemed to be working when I ran the test locally. I think the error strings are declared as constants.

@svm1 svm1 force-pushed the svm_from_iso_date_parse branch from 18f9398 to a705db4 Compare April 12, 2024 00:48
@svm1 svm1 force-pushed the svm_from_iso_date_parse branch from a705db4 to dd2f22c Compare April 12, 2024 01:47
@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 115a240.

Copy link

Conbench analyzed the 1 benchmark run on commit 115a240c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@svm1 svm1 deleted the svm_from_iso_date_parse branch April 15, 2024 22:53
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…cubator#9357)

Summary:
Also refactoring `castFromDateString()` function to take in `ParseMode` parameter rather than boolean, to improve flexibility and ease of use.

Pull Request resolved: facebookincubator#9357

Reviewed By: xiaoxmeng

Differential Revision: D56042071

Pulled By: mbasmanova

fbshipit-source-id: 61cb5ef5d72862c62a435a7b13585d7f6013edb0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants