Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add helper function to return the number of days since epoch for a week-of-month date #10604

Closed
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 104 additions & 3 deletions velox/type/TimestampConversion.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@

#include "velox/type/TimestampConversion.h"
#include <folly/Expected.h>
#include "HugeInt.h"
#include "velox/common/base/CheckedArithmetic.h"
#include "velox/common/base/Exceptions.h"
#include "velox/type/tz/TimeZoneMap.h"
Expand Down Expand Up @@ -144,6 +145,53 @@ bool isValidWeekDate(int32_t weekYear, int32_t weekOfYear, int32_t dayOfWeek) {
return true;
}

bool isValidWeekOfMonthDate(
Copy link
Collaborator

@rui-mo rui-mo Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the discussion at https://github.com/facebookincubator/velox/pull/10511/files#r1686566980, true result does not always indicate a valid date.

If the date falls outside the valid range, the week or day from the preceding or following month's will be used.

With current implementation, for some invalid date, we return user error status while for the others we return a int64_t result in the daysSinceEpochFromWeekOfMonthDate function, which seems to be confusing.

I notice the method isValidDate provides an accurate evaluation for valid date. Does it make sense to use that?
https://github.com/facebookincubator/velox/blob/main/velox/type/TimestampConversion.cpp#L518-L529

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. We should remove the validation for week of month. According to SimpleDateFormat's doc the range of day is [1,7] so isValidDate isn't suitable for this.

u | Day number of week (1 = Monday, ..., 7 = Sunday) | Number | 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. See e718074

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to SimpleDateFormat's doc the range of day is [1,7] so isValidDate isn't suitable for this.

Could you clarify the purpose of isValidWeekOfMonthDate? Is it to follow certain rule defined in SimpleDateFormat to check if a date is valid, and what are the rules? Perhaps add some comments for this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to do validation follows SimpleDateFormat. However, SimpleDateFormat does not provide clear rules or relevant documentation. For example, document states WEEK_OF_MONTH field range is from 0 to 6, , but in my tests, a value of 99 is considered valid.(https://github.com/facebookincubator/velox/pull/10511/files#r1686566980)

Values calculated for the WEEK_OF_MONTH field range from 0 to 6

Another example is the valid range for the year field, which is not documented.

I tried to determine the valid range through testing, but it is too costly and not reliable enough. I wonder how to properly validate dates in Velox. Could you give me some advice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo I don't believe this functionality is needed for Presto functions. Hence, we are free to define semantics in any "sensible" way. Assuming we need these semantics to match Spark, we can use Spark semantics.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. Understood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova @rui-mo According to this discussion, Spark date validation rules in document don't match with the test results. Which one should I follow? Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NEUpanning I assume we need to follow the actual behavior of Spark. How do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo I agree with you. I will change the part of date validation for following the actual behavior of Spark.

int32_t year,
int32_t month,
int32_t weekOfMonth,
int32_t dayOfWeek) {
if (year < 1 || year > kMaxYear) {
return false;
}
if (month < 1 || month > 12) {
return false;
}

int64_t daysSinceEpochOfFirstDayOfMonth;
const Status status =
daysSinceEpochFromDate(year, month, 1, daysSinceEpochOfFirstDayOfMonth);
if (!status.ok()) {
return false;
}

// Calculates the max week of month and validates if it is in the valid range.
const int32_t firstDayOfWeek =
extractISODayOfTheWeek(daysSinceEpochOfFirstDayOfMonth);
const int32_t firstWeekLength = 7 - firstDayOfWeek + 1;
const int32_t monthLength =
isLeapYear(year) ? kLeapDays[month] : kNormalDays[month];
const int32_t maxWeekOfMonth =
1 + ceil((monthLength - firstWeekLength) / 7.0);
if (weekOfMonth < 1 || weekOfMonth > maxWeekOfMonth) {
return false;
}

// Validate day of week.
// If dayOfWeek is before the first day of week, it is considered invalid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to match the documentation, which suggests the valid range is fixed [1, 7].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. The documentation is not clear. Thanks.

if (weekOfMonth == 1 && dayOfWeek < firstDayOfWeek) {
return false;
}
const int32_t lastWeekLength = (monthLength - firstWeekLength) % 7;
// If dayOfWeek is after the last day of the last week of the month, it is
// considered invalid.
if (weekOfMonth == maxWeekOfMonth && lastWeekLength != 0 &&
dayOfWeek > lastWeekLength) {
return false;
}

return true;
}

inline bool validDate(int64_t daysSinceEpoch) {
return daysSinceEpoch >= std::numeric_limits<int32_t>::min() &&
daysSinceEpoch <= std::numeric_limits<int32_t>::max();
Expand Down Expand Up @@ -593,6 +641,59 @@ Status daysSinceEpochFromWeekDate(
return Status::OK();
}

Expected<int64_t> daysSinceEpochFromWeekOfMonthDate(
int32_t year,
int32_t month,
int32_t weekOfMonth,
int32_t dayOfWeek,
bool lenient) {
if (!lenient &&
!isValidWeekOfMonthDate(year, month, weekOfMonth, dayOfWeek)) {
if (threadSkipErrorDetails()) {
return folly::makeUnexpected(Status::UserError());
} else {
return folly::makeUnexpected(Status::UserError(
"Date out of range: {}-{}-{}-{}",
year,
month,
weekOfMonth,
dayOfWeek));
}
}

// Adjusts the year and month to ensure month is within the range 1-12,
// accounting for overflow or underflow.
int32_t additionYears = 0;
if (month < 1) {
additionYears = month / 12 - 1;
month = 12 - abs(month) % 12;
} else if (month > 12) {
additionYears = (month - 1) / 12;
month = (month - 1) % 12 + 1;
}
year += additionYears;

int64_t daysSinceEpochOfFirstDayOfMonth;
Status status =
daysSinceEpochFromDate(year, month, 1, daysSinceEpochOfFirstDayOfMonth);
if (!status.ok()) {
return folly::makeUnexpected(status);
}
int32_t firstDayOfWeek =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I asked this earlier, but I don't think I saw an answer, hence, asking again. What happens if weekOfMonth is 5, but month has only 4 weeks (Feb)? Similarly, what happens if weekOfMonth is 5 and dayOfWeek is 7 (no month has 35 days)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java's SimpleDateFormat, "invalid" weekOfMonth and dayOfWeek etc. are allowed. If the date falls outside the valid range, the week or day from the preceding or following month's will be used. For example, if weekOfMonth is 5 but the current month only has 4 weeks (such as February), the first week of March will be considered as the 5th week of February. Similarly, if weekOfMonth is 5 and dayOfWeek is 7, the corresponding day from the next month will be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying. Let's update method comments to explain this as it is not obvious. Do we already have tests for these cases? If not, let's make sure to add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this comments for clarifying and two tests for cases that days in previous/next month.

We treat days of the previous or next months as a part of the specified WEEK_OF_MONTH. For example, if
weekOfMonth is 5 but the current month only has 4 weeks (such as February),the first week of March will be considered as the 5th week of February.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing that DATE type is based on INTEGER type, hence, allows only int32_t number of days since epoch. If that's the case, then, perhaps, daysSinceEpochFromDate should return int32_t, not int64_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is intended for date parsing, and the DateTimeFormatter::parse method returns a Timestamp. A Timestamp is based on int64_t seconds and can represent up to 106,751,991,167 days, which is calculated as (2^63 - 1) / 1000 / 86400. This value is greater than the maximum value of an int32_t. Therefore, I think that daysSinceEpochFromDate should return an int64_t.

extractISODayOfTheWeek(daysSinceEpochOfFirstDayOfMonth);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractISODayOfTheWeek accepts an int32_t value while the input is of int64_t type. Shall we check for overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice extractISODayOfTheWeek casts input to int64_t. How about changing parameter to int64_t type instead of int32_t type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the cast is to avoid overflow in addition and negating. If we change it as int64_t, overflow needs to considered as well by casting to a larger type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation. I see daysSinceEpochFromWeekDate function also calls this funtion with a int64_t argument. I think we change extractISODayOfTheWeek parameter to int64_t and cast it to int128_t when do calculate would be better. How do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is any limitation for year? If the days cannot exceed INT32_MAX, it would be safe to call it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max year is 292278994 so the days can be (292278994-1970)*365=106,681,113,760 that is bigger than 2,147,483,647(INT32_MAX).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we change extractISODayOfTheWeek parameter to int64_t and cast it to int128_t when do calculate

Looks fine to me. @mbasmanova What are your thoughts? Thanks.

int32_t days;
if (dayOfWeek < 1) {
days = 7 - abs(dayOfWeek - 1) % 7;
} else if (dayOfWeek > 7) {
days = (dayOfWeek - 1) % 7;
} else {
days = dayOfWeek % 7;
}
int64_t result = daysSinceEpochOfFirstDayOfMonth - (firstDayOfWeek - 1) +
7 * (weekOfMonth - 1) + days - 1;
return result;
}

Status
daysSinceEpochFromDayOfYear(int32_t year, int32_t dayOfYear, int64_t& out) {
if (!isValidDayOfYear(year, dayOfYear)) {
Expand Down Expand Up @@ -643,7 +744,7 @@ Expected<int32_t> fromDateString(const char* str, size_t len, ParseMode mode) {
return daysSinceEpoch;
}

int32_t extractISODayOfTheWeek(int32_t daysSinceEpoch) {
int32_t extractISODayOfTheWeek(int64_t daysSinceEpoch) {
// date of 0 is 1970-01-01, which was a Thursday (4)
// -7 = 4
// -6 = 5
Expand All @@ -662,10 +763,10 @@ int32_t extractISODayOfTheWeek(int32_t daysSinceEpoch) {
// 7 = 4
if (daysSinceEpoch < 0) {
// negative date: start off at 4 and cycle downwards
return (7 - ((-int64_t(daysSinceEpoch) + 3) % 7));
return (7 - ((-int128_t(daysSinceEpoch) + 3) % 7));
} else {
// positive date: start off at 4 and cycle upwards
return ((int64_t(daysSinceEpoch) + 3) % 7) + 1;
return ((int128_t(daysSinceEpoch) + 3) % 7) + 1;
}
}

Expand Down
33 changes: 32 additions & 1 deletion velox/type/TimestampConversion.h
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,37 @@ Status daysSinceEpochFromWeekDate(
int32_t dayOfWeek,
int64_t& out);

/// Computes the signed number of days since the Unix epoch (1970-01-01). To
/// align with Spark's SimpleDateFormat behavior, this function offers two
/// modes: lenient and non-lenient. If `lenient` is false, it returns an error
/// status if the date is invalid. If `lenient` is true, it accepts a wider
/// range of arguments.
/// @param year Year. For non-lenient mode, it should be in the range [1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be in the range [1, 292278994]. e.g: 1996, -2000.

typo? The range doesn't seem to allow negative values or zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a typo. The min value of year in non-lenient mode is 1. Here is Java implementation:

static final int MIN_VALUES[] = {
BCE, // ERA
1, // YEAR
JANUARY, // MONTH
1, // WEEK_OF_YEAR
0, // WEEK_OF_MONTH
1, // DAY_OF_MONTH
1, // DAY_OF_YEAR
SUNDAY, // DAY_OF_WEEK
1, // DAY_OF_WEEK_IN_MONTH
AM, // AM_PM
0, // HOUR
0, // HOUR_OF_DAY
0, // MINUTE
0, // SECOND
0, // MILLISECOND
-13*ONE_HOUR, // ZONE_OFFSET (UNIX compatibility)
0 // DST_OFFSET
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NEUpanning Got it. In this case "e.g: 1996, -2000." needs updating since -2000 is not allowed.

It might be helpful to re-iterate that dates before Jan 1, 1970 are not supported in non-lenient mode.

I find it hard to say / write non-lenient. Would it make sense to flip the boolean to 'strict' mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova I assume using non-lenient is clearer because it indicates we only have tow modes. If using strict, readers might wonder if there is a balanced or moderate mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. I was suggesting to consider replacing 'lenient' boolean with 'strict'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark uses the term 'lenient', and this function aligns with Spark. See link. Therefore, I think 'lenient' would be more appropriate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NEUpanning I assume this suggestion is about replacing 'non-lenient' with 'strict' which makes it more straightforward to understand. The 'lenient' will stay unchanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo

I assume using non-lenient is clearer because it indicates we only have tow modes. If using strict, readers might wonder if there is a balanced or moderate mode.

This is my concern. If this is not a problem or rarely happens, I'd like to replace it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NEUpanning Since it has been clarified in the comment that there are only two modes, so I think it is fine. Thanks.

Copy link
Contributor Author

@NEUpanning NEUpanning Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. I've replaced it.

/// 292278994]. e.g: 1996, -2000
/// @param month Month of year. For non-lenient mode, it should be in the range
/// [1, 12]. For example, 1 is Jan, 7 is Jul. For lenient mode, values greater
/// than 12 wrap around to the start of the year, and values less than 1 count
/// backward from December. For example, 13 corresponds to January of the
/// following year and -1 corresponds to November of the previous year.
/// @param weekOfMonth Week of the month. For non-lenient mode, it should be in
/// the range [1, 6]. For example, 1 is 1st week, 3 is 3rd week. For lenient
/// mode, we consider days of the previous or next months as part of the
/// specified weekOfMonth. For example, if weekOfMonth is 5 but the current
/// month only has 4 weeks (such as February), the first week of March will be
/// considered as the 5th week of February.
/// @param dayOfWeek Day number of week. For non-lenient mode, it should be in
/// the range [1, 7]. For example, 1 is Monday, 7 is Sunday. For lenient mode,
/// we consider days of the previous or next months as part of the specified
/// dayOfWeek.For example, if weekOfMonth is 1 and dayOfWeek is 1 but the
/// month's first day is Saturday, the Monday of the last week of the previous
/// month will be used.
Expected<int64_t> daysSinceEpochFromWeekOfMonthDate(
int32_t year,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation imposes limits on the 'year' value. Would you document these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done.

int32_t month,
int32_t weekOfMonth,
int32_t dayOfWeek,
bool lenient);

/// Computes the (signed) number of days since unix epoch (1970-01-01).
/// Returns UserError status if the date is invalid.
Status
Expand All @@ -123,7 +154,7 @@ inline Expected<int32_t> fromDateString(const StringView& str, ParseMode mode) {
}

// Extracts the day of the week from the number of days since epoch
int32_t extractISODayOfTheWeek(int32_t daysSinceEpoch);
int32_t extractISODayOfTheWeek(int64_t daysSinceEpoch);

/// Time conversions.

Expand Down
78 changes: 78 additions & 0 deletions velox/type/tests/TimestampConversionTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,84 @@ TEST(DateTimeUtilTest, fromDateInvalid) {
1970, 6, 31, "Date out of range: 1970-6-31"));
}

TEST(DateTimeUtilTest, fromWeekOfMonthDate) {
auto daysSinceEpoch =
[](int32_t year, int32_t month, int32_t weekOfMonth, int32_t dayOfWeek) {
auto result = util::daysSinceEpochFromWeekOfMonthDate(
year, month, weekOfMonth, dayOfWeek, true);
EXPECT_TRUE(!result.hasError());
return result.value();
};
EXPECT_EQ(4, daysSinceEpoch(1970, 1, 2, 1));
EXPECT_EQ(361, daysSinceEpoch(1971, 1, 1, 1));
EXPECT_EQ(396, daysSinceEpoch(1971, 2, 1, 1));

EXPECT_EQ(10952, daysSinceEpoch(2000, 1, 1, 1));
EXPECT_EQ(19905, daysSinceEpoch(2024, 7, 1, 1));

// Before unix epoch.
EXPECT_EQ(-3, daysSinceEpoch(1970, 1, 1, 1));
EXPECT_EQ(-2, daysSinceEpoch(1970, 1, 1, 2));
EXPECT_EQ(-31, daysSinceEpoch(1969, 12, 1, 1));
EXPECT_EQ(-367, daysSinceEpoch(1969, 1, 1, 1));
EXPECT_EQ(-724, daysSinceEpoch(1968, 1, 2, 1));
EXPECT_EQ(-719533, daysSinceEpoch(0, 1, 1, 1));

// Negative year - BC.
EXPECT_EQ(-719561, daysSinceEpoch(-1, 12, 1, 1));
EXPECT_EQ(-719897, daysSinceEpoch(-1, 1, 1, 1));

// Day in the previous month.
EXPECT_EQ(19783, daysSinceEpoch(2024, 2, 5, 5));
// Day in the next month.
EXPECT_EQ(19751, daysSinceEpoch(2024, 2, 1, 1));

// Out of range day of week.
EXPECT_EQ(338, daysSinceEpoch(1970, 12, 1, 0));
EXPECT_EQ(337, daysSinceEpoch(1970, 12, 1, -1));
EXPECT_EQ(337, daysSinceEpoch(1970, 12, 1, -8));

EXPECT_EQ(332, daysSinceEpoch(1970, 12, 1, 8));
EXPECT_EQ(333, daysSinceEpoch(1970, 12, 1, 9));
EXPECT_EQ(336, daysSinceEpoch(1970, 12, 1, 19));

// Out of range month.
EXPECT_EQ(-3, daysSinceEpoch(1970, 1, 1, 1));
EXPECT_EQ(207, daysSinceEpoch(1970, 8, 1, 1));
EXPECT_EQ(361, daysSinceEpoch(1970, 13, 1, 1));

EXPECT_EQ(-31, daysSinceEpoch(1970, 0, 1, 1));
EXPECT_EQ(-66, daysSinceEpoch(1970, -1, 1, 1));
EXPECT_EQ(-430, daysSinceEpoch(1970, -13, 1, 1));
}

TEST(DateTimeUtilTest, fromWeekOfMonthDateInvalid) {
auto daysSinceEpoch = [](int32_t year,
int32_t month,
int32_t weekOfMonth,
int32_t dayOfWeek,
const std::string& error) {
auto result = util::daysSinceEpochFromWeekOfMonthDate(
year, month, weekOfMonth, dayOfWeek, false);
EXPECT_TRUE(result.error().isUserError());
EXPECT_EQ(result.error().message(), error);
};

EXPECT_NO_THROW(daysSinceEpoch(-1, 1, 1, 1, "Date out of range: -1-1-1-1"));
EXPECT_NO_THROW(
daysSinceEpoch(292278995, 1, 1, 1, "Date out of range: 292278995-1-1-1"));
EXPECT_NO_THROW(
daysSinceEpoch(2024, 0, 1, 1, "Date out of range: 2024-0-1-1"));
EXPECT_NO_THROW(
daysSinceEpoch(2024, 13, 1, 1, "Date out of range: 2024-13-1-1"));
EXPECT_NO_THROW(
daysSinceEpoch(2024, 1, 6, 1, "Date out of range: 2024-1-6-1"));
EXPECT_NO_THROW(
daysSinceEpoch(2024, 2, 1, 1, "Date out of range: 2024-2-1-1"));
EXPECT_NO_THROW(
daysSinceEpoch(2024, 2, 5, 5, "Date out of range: 2024-2-5-5"));
}

TEST(DateTimeUtilTest, fromDateString) {
for (ParseMode mode : {ParseMode::kPrestoCast, ParseMode::kSparkCast}) {
EXPECT_EQ(0, parseDate("1970-01-01", mode));
Expand Down
Loading