-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spark CAST(timestamp as integral) #11468
base: main
Are you sure you want to change the base?
Add Spark CAST(timestamp as integral) #11468
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
(int64_t)(nanos_ / 1'000)); | ||
} catch (const std::exception& e) { | ||
// We use int128_t to make sure the computation does not overflows since | ||
// there are cases such that seconds*1000000 does not fit in int64_t, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark is using int64 type to represent micro second, so the max allowed seconds should be INT64_MAX / 1000000
. For a valid timestamp from Spark, why would seconds * 1000000
overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The min second could overflow, the min second is -9223372036855
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to extract this fix to a separate PR like 671e126? We could add test in 'velox/type/tests/TimestampTest.cpp'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, let me extract this to a new pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added here: #11774
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update the document, thanks!
@@ -41,6 +40,12 @@ Expected<Timestamp> SparkCastHooks::castIntToTimestamp(int64_t seconds) const { | |||
return Timestamp(seconds, 0); | |||
} | |||
|
|||
Expected<int64_t> SparkCastHooks::castTimestampToInt( | |||
Timestamp timestamp) const { | |||
return std::floor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does timestamp.toMicros() / Timestamp::kMicrosecondsInSecond work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the timestamp.toMicros()
is negative, we need to round towards negative infinity instead of 0, so here we need std::floor
, like the implementation in Spark using Math.floorDiv(ts, MICROS_PER_SECOND)
} catch (const std::exception& e) { | ||
// We use int128_t to make sure the computation does not overflows since | ||
// there are cases such that seconds*1000000 does not fit in int64_t, | ||
// but seconds*1000000 + nanos does, an example is TimeStamp::minMillis(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo TimeStamp::minMillis(). -> Timestamp::minMillis().
// there are cases such that seconds*1000000 does not fit in int64_t, | ||
// but seconds*1000000 + nanos does, an example is TimeStamp::minMillis(). | ||
|
||
// If the final result does not fit in int64_tw we throw. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo int64_tw
// If the final result does not fit in int64_tw we throw. | ||
__int128_t result = | ||
(__int128_t)seconds_ * 1'000'000 + (int64_t)(nanos_ / 1'000); | ||
if (result < std::numeric_limits<int64_t>::min() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INT64_MAX and INT64_MIN
(int64_t)(nanos_ / 1'000)); | ||
} catch (const std::exception& e) { | ||
// We use int128_t to make sure the computation does not overflows since | ||
// there are cases such that seconds*1000000 does not fit in int64_t, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to extract this fix to a separate PR like 671e126? We could add test in 'velox/type/tests/TimestampTest.cpp'.
if (castResult.hasError()) { | ||
setError(castResult.error().message()); | ||
} else { | ||
result->set(row, static_cast<To>(castResult.value())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_cast(castResult.value())
Is overflow well handled if casting int64_t as a lower-byte type?
Add Spark CAST (timestamp as integral). Supported types are tinyint, smallint, integer and bigint.
Spark's implementation: https://github.com/apache/spark/blob/fd86f85e181fc2dc0f50a096855acf83a6cc5d9c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L682