-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add helper function to return the number of days since epoch for a week-of-month date #10604
Add helper function to return the number of days since epoch for a week-of-month date #10604
Conversation
✅ Deploy Preview for meta-velox ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for extracting this logic into a separate PR.
The PR title mentions "date format", but the code doesn't seem to be dealing with date format. Wondering if the PR title / description could be updated for clarity and accuracy.
CC: @rui-mo
velox/type/TimestampConversion.cpp
Outdated
int64_t daysSinceEpochOfFirstDayOfMonth; | ||
VELOX_RETURN_NOT_OK( | ||
daysSinceEpochFromDate(year, month, 1, daysSinceEpochOfFirstDayOfMonth)); | ||
int32_t firstDayOfWeek = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I asked this earlier, but I don't think I saw an answer, hence, asking again. What happens if weekOfMonth is 5, but month has only 4 weeks (Feb)? Similarly, what happens if weekOfMonth is 5 and dayOfWeek is 7 (no month has 35 days)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Java's SimpleDateFormat, "invalid" weekOfMonth and dayOfWeek etc. are allowed. If the date falls outside the valid range, the week or day from the preceding or following month's will be used. For example, if weekOfMonth is 5 but the current month only has 4 weeks (such as February), the first week of March will be considered as the 5th week of February. Similarly, if weekOfMonth is 5 and dayOfWeek is 7, the corresponding day from the next month will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarifying. Let's update method comments to explain this as it is not obvious. Do we already have tests for these cases? If not, let's make sure to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this comments for clarifying and two tests for cases that days in previous/next month.
We treat days of the previous or next months as a part of the specified WEEK_OF_MONTH. For example, if
weekOfMonth is 5 but the current month only has 4 weeks (such as February),the first week of March will be considered as the 5th week of February.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing that DATE type is based on INTEGER type, hence, allows only int32_t number of days since epoch. If that's the case, then, perhaps, daysSinceEpochFromDate should return int32_t, not int64_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is intended for date parsing, and the DateTimeFormatter::parse method returns a Timestamp. A Timestamp is based on int64_t seconds and can represent up to 106,751,991,167 days, which is calculated as (2^63 - 1) / 1000 / 86400. This value is greater than the maximum value of an int32_t. Therefore, I think that daysSinceEpochFromDate should return an int64_t.
@mbasmanova I have updated the PR title and description to remove the words related to "date format" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added several comments.
@@ -144,6 +144,26 @@ bool isValidWeekDate(int32_t weekYear, int32_t weekOfYear, int32_t dayOfWeek) { | |||
return true; | |||
} | |||
|
|||
bool isValidWeekOfMonthDate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the discussion at https://github.com/facebookincubator/velox/pull/10511/files#r1686566980, true result does not always indicate a valid date.
If the date falls outside the valid range, the week or day from the preceding or following month's will be used.
With current implementation, for some invalid date, we return user error status while for the others we return a int64_t result in the daysSinceEpochFromWeekOfMonthDate
function, which seems to be confusing.
I notice the method isValidDate
provides an accurate evaluation for valid date. Does it make sense to use that?
https://github.com/facebookincubator/velox/blob/main/velox/type/TimestampConversion.cpp#L518-L529
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. We should remove the validation for week of month. According to SimpleDateFormat's doc the range of day
is [1,7] so isValidDate
isn't suitable for this.
u | Day number of week (1 = Monday, ..., 7 = Sunday) | Number | 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. See e718074
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to SimpleDateFormat's doc the range of day is [1,7] so isValidDate isn't suitable for this.
Could you clarify the purpose of isValidWeekOfMonthDate
? Is it to follow certain rule defined in SimpleDateFormat to check if a date is valid, and what are the rules? Perhaps add some comments for this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to do validation follows SimpleDateFormat. However, SimpleDateFormat does not provide clear rules or relevant documentation. For example, document states WEEK_OF_MONTH
field range is from 0 to 6, , but in my tests, a value of 99 is considered valid.(https://github.com/facebookincubator/velox/pull/10511/files#r1686566980)
Values calculated for the WEEK_OF_MONTH field range from 0 to 6
Another example is the valid range for the year
field, which is not documented.
I tried to determine the valid range through testing, but it is too costly and not reliable enough. I wonder how to properly validate dates in Velox. Could you give me some advice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo I don't believe this functionality is needed for Presto functions. Hence, we are free to define semantics in any "sensible" way. Assuming we need these semantics to match Spark, we can use Spark semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. Understood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova @rui-mo According to this discussion, Spark date validation rules in document don't match with the test results. Which one should I follow? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NEUpanning I assume we need to follow the actual behavior of Spark. How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo I agree with you. I will change the part of date validation for following the actual behavior of Spark.
return folly::makeUnexpected(status); | ||
} | ||
int32_t firstDayOfWeek = | ||
extractISODayOfTheWeek(daysSinceEpochOfFirstDayOfMonth); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extractISODayOfTheWeek
accepts an int32_t value while the input is of int64_t type. Shall we check for overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice extractISODayOfTheWeek
casts input to int64_t. How about changing parameter to int64_t type instead of int32_t type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the cast is to avoid overflow in addition and negating. If we change it as int64_t, overflow needs to considered as well by casting to a larger type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation. I see daysSinceEpochFromWeekDate
function also calls this funtion with a int64_t argument. I think we change extractISODayOfTheWeek
parameter to int64_t and cast it to int128_t when do calculate would be better. How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is any limitation for year? If the days cannot exceed INT32_MAX, it would be safe to call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max year is 292278994 so the days can be (292278994-1970)*365=106,681,113,760 that is bigger than 2,147,483,647(INT32_MAX).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we change extractISODayOfTheWeek parameter to int64_t and cast it to int128_t when do calculate
Looks fine to me. @mbasmanova What are your thoughts? Thanks.
velox/type/TimestampConversion.h
Outdated
@@ -107,6 +107,24 @@ Status daysSinceEpochFromWeekDate( | |||
int32_t dayOfWeek, | |||
int64_t& out); | |||
|
|||
/// Computes the (signed) number of days since unix epoch (1970-01-01). | |||
/// Returns error status if the date is invalid. We treat days of the previous | |||
/// or next months as a part of the specified WEEK_OF_MONTH. For example, if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WEEK_OF_MONTH -> weekOfMonth for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Done.
velox/type/TimestampConversion.h
Outdated
/// Returns error status if the date is invalid. We treat days of the previous | ||
/// or next months as a part of the specified WEEK_OF_MONTH. For example, if | ||
/// weekOfMonth is 5 but the current month only has 4 weeks (such as February), | ||
/// the first week of March will be considered as the 5th week of February. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice clarification. Perhaps, clarify that the same applies to dayOfWeek.
BTW, I assume you are matching some behavior from Spark here? It would be nice to clarify that to explain the choice of handling "overflow" dates. Do you have a link to Spark's documentation that explains these rules? Let's make sure we match exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the implemention align with Spark's SimpleDateFormat. I've added docs. You can see the rules in this link.
Leniency
Calendar has two modes for interpreting the calendar fields, lenient and non-lenient. When a Calendar is in lenient mode, it accepts a wider range of calendar field values than it produces. When a Calendar recomputes calendar field values for return by get(), all of the calendar fields are normalized. For example, a lenient GregorianCalendar interprets MONTH == JANUARY, DAY_OF_MONTH == 32 as February 1.
When a Calendar is in non-lenient mode, it throws an exception if there is any inconsistency in its calendar fields. For example, a GregorianCalendar always produces DAY_OF_MONTH values between 1 and the length of the month. A non-lenient GregorianCalendar throws an exception upon calculating its time or calendar field values if any out-of-range field value has been set.
/// @param dayOfWeek Day number of week. A value in [1, 7] range. For example, 1 | ||
/// is Monday, 7 is Sunday. | ||
Expected<int64_t> daysSinceEpochFromWeekOfMonthDate( | ||
int32_t year, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation imposes limits on the 'year' value. Would you document these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Done.
velox/type/TimestampConversion.cpp
Outdated
int64_t daysSinceEpochOfFirstDayOfMonth; | ||
VELOX_RETURN_NOT_OK( | ||
daysSinceEpochFromDate(year, month, 1, daysSinceEpochOfFirstDayOfMonth)); | ||
int32_t firstDayOfWeek = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing that DATE type is based on INTEGER type, hence, allows only int32_t number of days since epoch. If that's the case, then, perhaps, daysSinceEpochFromDate should return int32_t, not int64_t.
@@ -104,6 +104,39 @@ TEST(DateTimeUtilTest, fromDateInvalid) { | |||
1970, 6, 31, "Date out of range: 1970-6-31")); | |||
} | |||
|
|||
TEST(DateTimeUtilTest, fromWeekOfMonthDate) { | |||
auto daysSinceEpochFromWeekOfMonthDate = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider shortening to daysSinceEpoch for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Done
// day in previous month | ||
EXPECT_EQ(19783, daysSinceEpochFromWeekOfMonthDate(2024, 2, 5, 5)); | ||
// day in next month | ||
EXPECT_EQ(19751, daysSinceEpochFromWeekOfMonthDate(2024, 2, 1, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you add negative tests to verify logic in isValidWeekOfMonthDate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I added fromWeekOfMonthDateInvalid tests.
EXPECT_EQ(-719561, daysSinceEpochFromWeekOfMonthDate(-1, 12, 1, 1)); | ||
EXPECT_EQ(-719897, daysSinceEpochFromWeekOfMonthDate(-1, 1, 1, 1)); | ||
|
||
// day in previous month |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Make sure comments are full sentences. Start with a capital letter. End with a period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Done.
EXPECT_EQ(-719533, daysSinceEpochFromWeekOfMonthDate(0, 1, 1, 1)); | ||
|
||
// Negative year - BC. | ||
EXPECT_EQ(-719561, daysSinceEpochFromWeekOfMonthDate(-1, 12, 1, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be easier to maintain these tests if we use date strings for 'expected' value.
EXPECT_EQ(parseDate("-1-12-01"), daysSinceEpoch(-1, 12, 1, 1)
EXPECT_EQ(parseDate("2024-03-05"), daysSinceEpoch(2024, 2, 5, 5));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion. In this way, if the test fails, it could be an issue with parseDate
. And I'm not sure the behavior of parseDate
is same as daysSinceEpochFromWeekOfMonthDate.
@mbasmanova @rui-mo I took some time to read the Java SimpleDateFormat code used by Spark to better understand its actual behavior. I've updated the branch accordingly. Looking forward to your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Looks much better now. Some comments.
@rui-mo I've updated the code based on the review. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also resolve the code format issue? Thanks.
@rui-mo I've resolved format issue and review comments. Thanks. |
cc: @mbasmanova Do you have any more comment? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NEUpanning Would you update description to provide more context. There was a lot of discussion and on that PR. It would be nice to summarize the outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NEUpanning Overall looks great. Some comments.
velox/type/TimestampConversion.h
Outdated
/// modes: lenient and non-lenient. For non-lenient mode, it returns an error | ||
/// status if the date is invalid. For lenient mode, it accepts a wider range of | ||
/// arguments. | ||
/// @param year Year. For non-lenient mode, it should be in the range [1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be in the range [1, 292278994]. e.g: 1996, -2000.
typo? The range doesn't seem to allow negative values or zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a typo. The min value of year in non-lenient mode is 1. Here is Java implementation:
static final int MIN_VALUES[] = {
BCE, // ERA
1, // YEAR
JANUARY, // MONTH
1, // WEEK_OF_YEAR
0, // WEEK_OF_MONTH
1, // DAY_OF_MONTH
1, // DAY_OF_YEAR
SUNDAY, // DAY_OF_WEEK
1, // DAY_OF_WEEK_IN_MONTH
AM, // AM_PM
0, // HOUR
0, // HOUR_OF_DAY
0, // MINUTE
0, // SECOND
0, // MILLISECOND
-13*ONE_HOUR, // ZONE_OFFSET (UNIX compatibility)
0 // DST_OFFSET
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NEUpanning Got it. In this case "e.g: 1996, -2000." needs updating since -2000 is not allowed.
It might be helpful to re-iterate that dates before Jan 1, 1970 are not supported in non-lenient mode.
I find it hard to say / write non-lenient. Would it make sense to flip the boolean to 'strict' mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova I assume using non-lenient
is clearer because it indicates we only have tow modes. If using strict
, readers might wonder if there is a balanced
or moderate
mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. I was suggesting to consider replacing 'lenient' boolean with 'strict'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark uses the term 'lenient', and this function aligns with Spark. See link. Therefore, I think 'lenient' would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NEUpanning I assume this suggestion is about replacing 'non-lenient' with 'strict' which makes it more straightforward to understand. The 'lenient' will stay unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume using non-lenient is clearer because it indicates we only have tow modes. If using strict, readers might wonder if there is a balanced or moderate mode.
This is my concern. If this is not a problem or rarely happens, I'd like to replace it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NEUpanning Since it has been clarified in the comment that there are only two modes, so I think it is fine. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion. I've replaced it.
// Out of range year. | ||
auto result = | ||
util::daysSinceEpochFromWeekOfMonthDate(292278995, 1, 1, 1, true); | ||
EXPECT_EQ(result.error().message(), "Date out of range: 292278995-1-1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we do not expect errors in lenient mode. This behavior doesn't seem to match the doc. Would you double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see this conversation #10604 (comment). Maybe I should change the documentation.
EXPECT_EQ(result.error().message(), "Date out of range: 292278995-1-1"); | ||
} | ||
|
||
TEST(DateTimeUtilTest, extractISODayOfTheWeek) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason this tests is added in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function extractISODayOfTheWeek is changed. See this discussion:#10604 (comment)
@mbasmanova I've updated description and code. Could you review again? Thanks. |
Have you published the changes? For some reason, I don't see these. |
@mbasmanova I didn't notice you have reviewed the changes in lastday. I've updated again. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova @rui-mo Could you help to merge this PR? Thanks. |
@bikramSingh91 Bikram, would you help merge this PR? |
@bikramSingh91 merged this pull request in fe7fdac. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
This helper function is only used by Spark. To align with Spark's SimpleDateFormat behavior, this function offers two modes: lenient and non-lenient. For non-lenient mode, it returns an error status if the date is invalid. For lenient mode, it accepts a wider range of arguments.
Part of #10511