Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spark CAST(timestamp as integral) #11468

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

boneanxs
Copy link
Contributor

@boneanxs boneanxs commented Nov 7, 2024

Add Spark CAST (timestamp as integral). Supported types are tinyint, smallint, integer and bigint.

Spark's implementation: https://github.com/apache/spark/blob/fd86f85e181fc2dc0f50a096855acf83a6cc5d9c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L682

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 7, 2024
Copy link

netlify bot commented Nov 7, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 8cc7aa6
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/672c59e71b10db0008347454

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

(int64_t)(nanos_ / 1'000));
} catch (const std::exception& e) {
// We use int128_t to make sure the computation does not overflows since
// there are cases such that seconds*1000000 does not fit in int64_t,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark is using int64 type to represent micro second, so the max allowed seconds should be INT64_MAX / 1000000. For a valid timestamp from Spark, why would seconds * 1000000 overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The min second could overflow, the min second is -9223372036855

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to extract this fix to a separate PR like 671e126? We could add test in 'velox/type/tests/TimestampTest.cpp'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let me extract this to a new pr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added here: #11774

@boneanxs boneanxs requested a review from rui-mo November 11, 2024 06:53
Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the document, thanks!

@@ -41,6 +40,12 @@ Expected<Timestamp> SparkCastHooks::castIntToTimestamp(int64_t seconds) const {
return Timestamp(seconds, 0);
}

Expected<int64_t> SparkCastHooks::castTimestampToInt(
Timestamp timestamp) const {
return std::floor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does timestamp.toMicros() / Timestamp::kMicrosecondsInSecond work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the timestamp.toMicros() is negative, we need to round towards negative infinity instead of 0, so here we need std::floor, like the implementation in Spark using Math.floorDiv(ts, MICROS_PER_SECOND)

} catch (const std::exception& e) {
// We use int128_t to make sure the computation does not overflows since
// there are cases such that seconds*1000000 does not fit in int64_t,
// but seconds*1000000 + nanos does, an example is TimeStamp::minMillis().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo TimeStamp::minMillis(). -> Timestamp::minMillis().

// there are cases such that seconds*1000000 does not fit in int64_t,
// but seconds*1000000 + nanos does, an example is TimeStamp::minMillis().

// If the final result does not fit in int64_tw we throw.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo int64_tw

// If the final result does not fit in int64_tw we throw.
__int128_t result =
(__int128_t)seconds_ * 1'000'000 + (int64_t)(nanos_ / 1'000);
if (result < std::numeric_limits<int64_t>::min() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INT64_MAX and INT64_MIN

(int64_t)(nanos_ / 1'000));
} catch (const std::exception& e) {
// We use int128_t to make sure the computation does not overflows since
// there are cases such that seconds*1000000 does not fit in int64_t,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to extract this fix to a separate PR like 671e126? We could add test in 'velox/type/tests/TimestampTest.cpp'.

if (castResult.hasError()) {
setError(castResult.error().message());
} else {
result->set(row, static_cast<To>(castResult.value()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast(castResult.value())

Is overflow well handled if casting int64_t as a lower-byte type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants