Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark date_trunc function #11340

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

zml1206
Copy link
Contributor

@zml1206 zml1206 commented Oct 24, 2024

Add Spark date_trunc function.
Also refactor presto date_trunc function for reuse:
Move class DateTimeUnit to DateTimeFormatter.h.
Extend funciton fromDateTimeUnitString, which can be used with presto and spark, and move to TimeUtils.h.
Move adjustDateTime and adjustEpoch to TimeUtils.h.
Extract new function dateTrunc.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2024
Copy link

netlify bot commented Oct 24, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit d544dde
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/672c3cd0fe18ba000853d699

@rui-mo rui-mo changed the title add date_trunc spark function Add Spark date_trunc function Oct 28, 2024
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The key idea is that by moving some of the methods to the lib for reuse, we can reduce the code duplication.

velox/docs/functions/spark/datetime.rst Outdated Show resolved Hide resolved
velox/functions/sparksql/DateTimeFunctions.h Outdated Show resolved Hide resolved
velox/functions/sparksql/DateTimeFunctions.h Outdated Show resolved Hide resolved
velox/functions/sparksql/DateTimeFunctions.h Outdated Show resolved Hide resolved
velox/functions/sparksql/DateTimeFunctions.h Outdated Show resolved Hide resolved
@zml1206 zml1206 requested a review from rui-mo October 29, 2024 06:17
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring.

velox/functions/sparksql/DateTimeFunctions.h Outdated Show resolved Hide resolved
@@ -123,4 +126,155 @@ struct InitSessionTimezone {
timeZone_ = getTimeZoneFromConfig(config);
}
};

FOLLY_ALWAYS_INLINE std::optional<DateTimeUnit> fromDateTimeUnitString(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add some comments for the new API? If there are unit tests for them, please move them into the lib; if not, we will need to add them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe mention it in the currently empty PR description/commit message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some minors. Could you also clarify what you have changed in the PR description?


.. spark:function:: date_trunc(fmt, ts) -> timestamp

Returns timestamp ts truncated to the unit specified by the format model fmt.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns timestamp ts -> Returns timestamp that ts, by the format model fmt

Returns null if ``fmt`` is invalid.
``fmt`` is case insensitive and must be one of the following:
* "YEAR", "YYYY", "YY" - truncate to the first date of the year that the ts falls in, the time part will be zero out
* "QUARTER" - truncate to the first date of the quarter that the ts falls in, the time part will be zero out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that the ts falls, please quote all the ts, thanks!

bool allowAbbreviated = false) {
static const StringView kMicrosecond("microsecond");
static const StringView kMillisecond("millisecond");
static const StringView kSecond("second");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this fields are used only once, do we need to set them? Or we could use the string directly in if condition? CC @rui-mo
For example, https://github.com/facebookincubator/velox/blob/main/velox/common/base/StatsReporter.h#L70

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original presto function code. Now it is just extracted to lib for reuse. Do we need to change it? @jinchengchenghh @rui-mo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove those defined in DateTimeFormatter then? And +1 for using string directly because these variables are used only once.

}

// Calculate the correct day of the month based on the number of days
// in the adjusted month
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add . in the end.

31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31};
int daysInPrevMonth = daysInMonth[dateTime.tm_mon];

// Adjust for leap year if February
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

(dateTime.tm_year + 1900) % 400 == 0)) {
daysInPrevMonth = 29;
}
// Set to the correct day in the previous month
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


// Returns timestamp truncated to the unit specified by the format.
FOLLY_ALWAYS_INLINE Timestamp dateTrunc(
const DateTimeUnit unit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add the const to enum class, you could consider it as integer like.
https://stackoverflow.com/questions/44749658/c-is-it-better-to-pass-an-enum-class-as-value-or-const-reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh Thanks for review. It seems to be used this way in velox,

inline bool isTimeUnit(const DateTimeUnit unit) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add const when passing by value.

struct DateTruncFunction {
VELOX_DEFINE_FUNCTION_TYPES(T);

const tz::TimeZone* timeZone_ = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the class member to the last of the struct as FromUnixtimeFunction.
And move it to private scope

if (!unitOption.has_value()) {
return false;
}
DateTimeUnit unit = unitOption.value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const DateTimeUnit unit. I would prefer to drop this initialization since it is used only once.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

velox/docs/functions/spark/datetime.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/datetime.rst Outdated Show resolved Hide resolved
velox/docs/functions/spark/datetime.rst Outdated Show resolved Hide resolved
bool allowAbbreviated = false) {
static const StringView kMicrosecond("microsecond");
static const StringView kMillisecond("millisecond");
static const StringView kSecond("second");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove those defined in DateTimeFormatter then? And +1 for using string directly because these variables are used only once.

velox/functions/lib/TimeUtils.h Outdated Show resolved Hide resolved
velox/functions/lib/TimeUtils.h Outdated Show resolved Hide resolved
velox/functions/lib/TimeUtils.h Outdated Show resolved Hide resolved

// Returns timestamp truncated to the unit specified by the format.
FOLLY_ALWAYS_INLINE Timestamp dateTrunc(
const DateTimeUnit unit,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add const when passing by value.

velox/functions/lib/TimeUtils.h Outdated Show resolved Hide resolved
velox/functions/lib/TimeUtils.h Show resolved Hide resolved
fix style

fix

fix

fix

fix

fix

fix
fix style

fix

fix

fix

fix
update

fix

fix
@@ -518,6 +518,8 @@ void registerFunctions(const std::string& prefix) {
Varchar,
Varchar,
Varchar>({prefix + "mask"});
registerFunction<DateTruncFunction, Timestamp, Varchar, Timestamp>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved, thanks.

* "MICROSECOND" - everything remains

Returns timestamp ``ts`` truncated to the unit specified by the format model ``fmt``.
Returns null if ``fmt`` is invalid. ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if ts is null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* "MILLISECOND" - zero out the microseconds
* "MICROSECOND" - everything remains

Returns timestamp ``ts`` truncated to the unit specified by the format model ``fmt``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the function description before argument description L56.

ASSERT_EQ(std::nullopt, fromDateTimeUnitString("yyyy", false));
ASSERT_EQ(std::nullopt, fromDateTimeUnitString("yy", false));

ASSERT_EQ(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid too.

 ASSERT_EQ(
      DateTimeUnit::kMillisecond, fromDateTimeUnitString("millisecond", false));

And other public functions also need tests. I'm open for this point since the function test should have cover the function you moved, but it is better to add it.

ASSERT_EQ(
std::optional(DateTimeUnit::kYear),
fromDateTimeUnitString("yyyy", false, true, true));
ASSERT_EQ(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test invalid fmt by VELOX_ASSERT_THROW

update

update
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added some comments.

* "MINUTE"- zero out the second with fraction part
* "SECOND" - zero out the second fraction part
* "MILLISECOND" - zero out the microseconds
* "MICROSECOND" - everything remains. ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example section does not render correctly. You could add the '::' as below to fix it.

       * "MICROSECOND" - everything remains.
        
    ::

        SELECT date_trunc('YEAR', '2015-03-05T09:32:05.359'); -- 2015-01-01 00:00:00

SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:00:00
SELECT date_trunc('MINUTE', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:00
SELECT date_trunc('SECOND', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:05
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456'); -- 2015-03-05 09:32:05.123456
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The microseconds are not zeroed out. Is this example correct?

SELECT date_trunc('MINUTE', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:00
SELECT date_trunc('SECOND', '2015-03-05T09:32:05.359'); -- 2015-03-05 09:32:05
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456'); -- 2015-03-05 09:32:05.123456
SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05..123456'); -- 2015-01-01 09:32:05.123456
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate for millisecond.

@@ -118,6 +118,19 @@ enum class DateTimeFormatSpecifier : uint8_t {
WEEK_OF_MONTH = 24
};

enum class DateTimeUnit {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean remove class DateTimeUnit?

FOLLY_ALWAYS_INLINE std::optional<DateTimeUnit> fromDateTimeUnitString(
const StringView& unitString,
bool throwIfInvalid,
bool allowMirco = false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Mirco -> Micro

if (seconds < 0 && seconds % intervalSeconds) {
s = s - 1;
}
int64_t truncedSeconds = s * intervalSeconds;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: trunced -> truncated


/// For fixed interval like minute, hour and day,
/// we can truncate date by a simple arithmetic expression:
/// floor(seconds / intervalSeconds) * intervalSeconds.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid repeating the code in comments. Comment suggestion:

/// Returns timestamp with seconds adjusted to the nearest lower multiple of the
/// specified interval. If the given seconds is negative and not an exact
/// multiple of the interval, it adjusts further down.

/// we can truncate date by a simple arithmetic expression:
/// floor(seconds / intervalSeconds) * intervalSeconds.
FOLLY_ALWAYS_INLINE Timestamp
adjustEpoch(int64_t seconds, int64_t intervalSeconds) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Do we have unit test for it?

}

// Returns timestamp truncated to the specified unit.
FOLLY_ALWAYS_INLINE Timestamp truncateTimestamp(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Better to add unit test.

FOLLY_ALWAYS_INLINE void initialize(
const std::vector<TypePtr>& /*inputTypes*/,
const core::QueryConfig& config,
const arg_type<Varchar>* format,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: format is not used either.

add ut
@zml1206 zml1206 requested a review from rui-mo November 15, 2024 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants