-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spark date_trunc function #11340
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The key idea is that by moving some of the methods to the lib for reuse, we can reduce the code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for refactoring.
@@ -123,4 +126,155 @@ struct InitSessionTimezone { | |||
timeZone_ = getTimeZoneFromConfig(config); | |||
} | |||
}; | |||
|
|||
FOLLY_ALWAYS_INLINE std::optional<DateTimeUnit> fromDateTimeUnitString( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please add some comments for the new API? If there are unit tests for them, please move them into the lib; if not, we will need to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe mention it in the currently empty PR description/commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Some minors. Could you also clarify what you have changed in the PR description?
|
||
.. spark:function:: date_trunc(fmt, ts) -> timestamp | ||
|
||
Returns timestamp ts truncated to the unit specified by the format model fmt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns timestamp ts -> Returns timestamp
that ts
, by the format model fmt
Returns null if ``fmt`` is invalid. | ||
``fmt`` is case insensitive and must be one of the following: | ||
* "YEAR", "YYYY", "YY" - truncate to the first date of the year that the ts falls in, the time part will be zero out | ||
* "QUARTER" - truncate to the first date of the quarter that the ts falls in, the time part will be zero out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that the ts
falls, please quote all the ts, thanks!
velox/functions/lib/TimeUtils.h
Outdated
bool allowAbbreviated = false) { | ||
static const StringView kMicrosecond("microsecond"); | ||
static const StringView kMillisecond("millisecond"); | ||
static const StringView kSecond("second"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this fields are used only once, do we need to set them? Or we could use the string directly in if condition? CC @rui-mo
For example, https://github.com/facebookincubator/velox/blob/main/velox/common/base/StatsReporter.h#L70
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original presto function code. Now it is just extracted to lib for reuse. Do we need to change it? @jinchengchenghh @rui-mo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove those defined in DateTimeFormatter then? And +1 for using string directly because these variables are used only once.
velox/functions/lib/TimeUtils.h
Outdated
} | ||
|
||
// Calculate the correct day of the month based on the number of days | ||
// in the adjusted month |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add .
in the end.
velox/functions/lib/TimeUtils.h
Outdated
31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}; | ||
int daysInPrevMonth = daysInMonth[dateTime.tm_mon]; | ||
|
||
// Adjust for leap year if February |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
velox/functions/lib/TimeUtils.h
Outdated
(dateTime.tm_year + 1900) % 400 == 0)) { | ||
daysInPrevMonth = 29; | ||
} | ||
// Set to the correct day in the previous month |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
velox/functions/lib/TimeUtils.h
Outdated
|
||
// Returns timestamp truncated to the unit specified by the format. | ||
FOLLY_ALWAYS_INLINE Timestamp dateTrunc( | ||
const DateTimeUnit unit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't add the const to enum class, you could consider it as integer like.
https://stackoverflow.com/questions/44749658/c-is-it-better-to-pass-an-enum-class-as-value-or-const-reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Thanks for review. It seems to be used this way in velox,
inline bool isTimeUnit(const DateTimeUnit unit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add const when passing by value.
struct DateTruncFunction { | ||
VELOX_DEFINE_FUNCTION_TYPES(T); | ||
|
||
const tz::TimeZone* timeZone_ = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the class member to the last of the struct as FromUnixtimeFunction.
And move it to private scope
if (!unitOption.has_value()) { | ||
return false; | ||
} | ||
DateTimeUnit unit = unitOption.value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const DateTimeUnit unit. I would prefer to drop this initialization since it is used only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
velox/functions/lib/TimeUtils.h
Outdated
bool allowAbbreviated = false) { | ||
static const StringView kMicrosecond("microsecond"); | ||
static const StringView kMillisecond("millisecond"); | ||
static const StringView kSecond("second"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove those defined in DateTimeFormatter then? And +1 for using string directly because these variables are used only once.
velox/functions/lib/TimeUtils.h
Outdated
|
||
// Returns timestamp truncated to the unit specified by the format. | ||
FOLLY_ALWAYS_INLINE Timestamp dateTrunc( | ||
const DateTimeUnit unit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to add const when passing by value.
525fdaa
to
8ab10ec
Compare
@@ -518,6 +518,8 @@ void registerFunctions(const std::string& prefix) { | |||
Varchar, | |||
Varchar, | |||
Varchar>({prefix + "mask"}); | |||
registerFunction<DateTruncFunction, Timestamp, Varchar, Timestamp>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the function registration after the date time functions https://github.com/facebookincubator/velox/pull/11340/files#diff-19962809ab2250cbf3c99e92b7ba14ae89e2f26ae5e646db3c2b31e67af16e8dL450
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved, thanks.
* "MICROSECOND" - everything remains | ||
|
||
Returns timestamp ``ts`` truncated to the unit specified by the format model ``fmt``. | ||
Returns null if ``fmt`` is invalid. :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if ts
is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* "MILLISECOND" - zero out the microseconds | ||
* "MICROSECOND" - everything remains | ||
|
||
Returns timestamp ``ts`` truncated to the unit specified by the format model ``fmt``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the function description before argument description L56.
ASSERT_EQ(std::nullopt, fromDateTimeUnitString("yyyy", false)); | ||
ASSERT_EQ(std::nullopt, fromDateTimeUnitString("yy", false)); | ||
|
||
ASSERT_EQ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valid too.
ASSERT_EQ(
DateTimeUnit::kMillisecond, fromDateTimeUnitString("millisecond", false));
And other public functions also need tests. I'm open for this point since the function test should have cover the function you moved, but it is better to add it.
ASSERT_EQ( | ||
std::optional(DateTimeUnit::kYear), | ||
fromDateTimeUnitString("yyyy", false, true, true)); | ||
ASSERT_EQ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test invalid fmt by VELOX_ASSERT_THROW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added some comments.
* "MINUTE"- zero out the second with fraction part | ||
* "SECOND" - zero out the second fraction part | ||
* "MILLISECOND" - zero out the microseconds | ||
* "MICROSECOND" - everything remains. :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example section does not render correctly. You could add the '::' as below to fix it.
* "MICROSECOND" - everything remains.
::
SELECT date_trunc('YEAR', '2015-03-05T09:32:05.359'); -- 2015-01-01 00:00:00
SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:00:00 | ||
SELECT date_trunc('MINUTE', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:00 | ||
SELECT date_trunc('SECOND', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:05 | ||
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456'); -- 2015-03-05 09:32:05.123456 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The microseconds are not zeroed out. Is this example correct?
SELECT date_trunc('MINUTE', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:00 | ||
SELECT date_trunc('SECOND', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:05 | ||
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456'); -- 2015-03-05 09:32:05.123456 | ||
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05..123456'); -- 2015-01-01 09:32:05.123456 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate for millisecond.
@@ -118,6 +118,19 @@ enum class DateTimeFormatSpecifier : uint8_t { | |||
WEEK_OF_MONTH = 24 | |||
}; | |||
|
|||
enum class DateTimeUnit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean remove class DateTimeUnit
?
velox/functions/lib/TimeUtils.h
Outdated
FOLLY_ALWAYS_INLINE std::optional<DateTimeUnit> fromDateTimeUnitString( | ||
const StringView& unitString, | ||
bool throwIfInvalid, | ||
bool allowMirco = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Mirco -> Micro
velox/functions/lib/TimeUtils.h
Outdated
if (seconds < 0 && seconds % intervalSeconds) { | ||
s = s - 1; | ||
} | ||
int64_t truncedSeconds = s * intervalSeconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: trunced -> truncated
velox/functions/lib/TimeUtils.h
Outdated
|
||
/// For fixed interval like minute, hour and day, | ||
/// we can truncate date by a simple arithmetic expression: | ||
/// floor(seconds / intervalSeconds) * intervalSeconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid repeating the code in comments. Comment suggestion:
/// Returns timestamp with seconds adjusted to the nearest lower multiple of the
/// specified interval. If the given seconds is negative and not an exact
/// multiple of the interval, it adjusts further down.
/// we can truncate date by a simple arithmetic expression: | ||
/// floor(seconds / intervalSeconds) * intervalSeconds. | ||
FOLLY_ALWAYS_INLINE Timestamp | ||
adjustEpoch(int64_t seconds, int64_t intervalSeconds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Do we have unit test for it?
} | ||
|
||
// Returns timestamp truncated to the specified unit. | ||
FOLLY_ALWAYS_INLINE Timestamp truncateTimestamp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Better to add unit test.
FOLLY_ALWAYS_INLINE void initialize( | ||
const std::vector<TypePtr>& /*inputTypes*/, | ||
const core::QueryConfig& config, | ||
const arg_type<Varchar>* format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: format is not used either.
Add Spark date_trunc function.
Also refactor presto date_trunc function for reuse:
Move class
DateTimeUnit
toDateTimeFormatter.h
.Extend funciton
fromDateTimeUnitString
, which can be used with presto and spark, and move toTimeUtils.h
.Move
adjustDateTime
andadjustEpoch
toTimeUtils.h
.Extract new function
dateTrunc
.