-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for UUID comparison functions #10791
base: main
Are you sure you want to change the base?
feat: Add support for UUID comparison functions #10791
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler. One small comment.
Did you try running the queries in prestodb/presto#23301 with this logic ?
} | ||
}; | ||
|
||
#define VELOX_GEN_BINARY_EXPR_UUID(Name, uuidCompExpr, TResult) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TResult parameter is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good catch! It's actually the same in VELOX_GEN_BINARY_EXPR_TIMESTAMP_WITH_TIME_ZONE
too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Yes, and the queries for <,>,<=,>= now pass. But it is a little strange, they only pass for me in debug mode, when I run in release mode I get this type of error:
where it looks like one UUID is randomly generated, and I'm not sure where it's coming from. I'll look into that a bit more. |
@aditi-pandit I didn't realize the temp table from the test in prestodb/presto#23301 included a randomly generated UUID with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler. Had some minor comments.
template <typename T> \ | ||
struct Name##Uuid : public UuidCompareBase { \ | ||
VELOX_DEFINE_FUNCTION_TYPES(T); \ | ||
FOLLY_ALWAYS_INLINE void call( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : Add newline in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditi-pandit I don't see a consistent format for this, most of the time it looks how it is currently (when the arguments don't fit within a single line) but I do see some cases where the format is:
template <typename T>
struct MultiplyVoidOutputFunction {
template <typename TInput>
FOLLY_ALWAYS_INLINE void
call(TInput& result, const TInput& a, const TInput& b) {
result = functions::multiply(a, b);
}
};
Do you mean the latter, with a newline after void?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler : I meant newline in between
VELOX_DEFINE_FUNCTION_TYPES(T);
FOLLY_ALWAYS_INLINE void call(
You can stick with the formatting from make format-fix
otherwise.
@@ -108,5 +108,45 @@ TEST_F(UuidFunctionsTest, unsupportedCast) { | |||
evaluate("cast(123 as uuid())", input), "Cannot cast BIGINT to UUID."); | |||
} | |||
|
|||
TEST_F(UuidFunctionsTest, comparisons) { | |||
auto data0 = makeFlatVector<std::string>({ | |||
"33355449-2c7d-43d7-967a-f53cd23215ad", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add more tests where the lower 8 bytes are the same for better coverage of the comparison logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, for the lower bits I have a check for <, = and I'll add one more for >
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some additional cases, one with equal lower bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler.
@Yuhta @kagamiori @mbasmanova : Please can you help with the review. |
Gentle ping @Yuhta @kagamiori @mbasmanova to please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanCutler Thank you for adding this functionality. I wonder where is the semantics of comparing UUIDs come from? Do you have a pointer? Would be nice to update PR description and documentation (https://facebookincubator.github.io/velox/develop/types.html) to clarify.
#include "velox/functions/prestosql/types/UuidType.h" | ||
|
||
namespace facebook::velox::functions { | ||
|
||
struct UuidCompareBase { | ||
FOLLY_ALWAYS_INLINE int compare64(uint64_t x, uint64_t y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be static functions or free functions.
Let's also put this code into 'details' namespace.
} | ||
|
||
int compare128(const int128_t& lhs, const int128_t& rhs) { | ||
int comp = compare64(HugeInt::upper(lhs), HugeInt::upper(rhs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comp
Do not abbreviate: comp -> result or similar.
"33355449-2c7d-43d7-967a-f53cd23215ad", | ||
}); | ||
|
||
auto actual = evaluate("cast(c0 as uuid) < cast(c1 as uuid)", makeRowVector({data0, data1})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason not to test one value at a time using evaluateOnce? This makes tests easier to read and debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just not familiar with the test environment. I will try with evaluateOnce
.
Thanks for reviewing @mbasmanova
I basically followed the comparison done in in Presto Java https://github.com/prestodb/presto/blob/97d88cfdebb553b5c7fb0e217b0211573df3d190/presto-common/src/main/java/com/facebook/presto/common/type/UuidType.java#L97 Since int128_t doesn't have support for atomic operations, we first check the higher bytes, then lower. Returning an int from the comparison allows it to be streamlined to do minimal checks. I will update the PR and documentation. |
@BryanCutler I wonder whether it makes sense to compare UUIDs at all? What does it mean that one UUID is greater than the other? Perhaps, this type should be non-orderable. |
@mbasmanova I think it makes sense, just as you would compare values for std::byte, etc. Most UUID implementations support such operations, see https://www.boost.org/doc/libs/1_85_0/libs/uuid/doc/uuid.html#Operators |
@mbasmanova : I found UUID comparisons in the Java spec also https://docs.oracle.com/javase/8/docs/api/java/util/UUID.html My understanding is that a source generating UUIDs uses a combination of timestamp with the DNS address of the generating device. So there is some notion of ordering of UUIDs as a result. |
@mbasmanova just wanted to confirm if you are ok with adding this functionality? I still need to update the tests per your request and will get to that soon, thanks! |
@BryanCutler Yes, please, move ahead with addressing the comments. Thank you for clarifying the use case. |
@mbasmanova @aditi-pandit after looking into prestodb/presto#23311 there are a couple issues with UUIDs in Velox:
For (1), since Velox stores UUID values with an int128_t, unlike boost::uuid which is a byte array, I think the value should be in little endian. This means instead of directly using memcpy bytes from the boost::uuid during casting, the bytes would need to swapped to get int128 in LE order. wdyt? To fix (2), the word order would need to be switched during serialization. Would you prefer I fix that in this PR or open another? |
I feel the first fix should be to address is the issue of the incorrect results in prestodb/presto#23311. Post that we can do this comparisons PR. From your comments seems that requires 2 changes: @mbasmanova : Using boost::uuid (instead of int128_t) to represent UUID in Velox could be cleaner given these issues. We could make UUID TypeKind::VARBINARY and use the boost::uuid bytes to represent it. wdyt ? @BryanCutler : If we did that, would we need the serialization changes from Native to Java side as well ? My intuition is yes, but I could be wrong. @Yuhta : Jimmy, would be great to hear your thoughts as well. |
Thanks @aditi-pandit , if we did change the storage from int128_t to a byte array in boost::uuid, then we would still need to adjust the serialization to Java, since Presto expects an Int128ArrayBlock. |
@aditi-pandit I don't think VARBINARY is the right physical type for UUID, it is meant for variable length data, while UUID is fixed length. I would say just keep using int128 and fix the endianness so the comparison between UUID is the same as comparison between int128. |
Thanks @Yuhta for your response. I do see a pattern of using VARBINARY for fixed lengths (in IPPREFIX type) but that has length 17 bytes which can't be captured with any integral type. For UUID we do have the option of using int128_t. @BryanCutler : Let's go with Jimmy's suggestion. Will make our implementation simpler as well. |
e4f984a
to
5aa91d4
Compare
Thanks @Yuhta and @aditi-pandit , I have fixed the serialization issue here and added a unit test, I can put that in a separate PR if preferred. I will also test these changes again with prestodb/presto#23311 |
Thanks @Yuhta , I can look into that but there is a bigger concern that this type of ordering will not match Presto Java, see above #10791 (comment). Presto Java I'm not sure if we should try to follow Presto Java with the comparisons or use the spec at RFC-4122 to order lexicographically, e.g. comparing uint128_t in little endian, and differ from Presto Java. Please share recommendations on this @Yuhta @aditi-pandit |
There is a recent related PR cherry-picked from trindodb, this might address some part of the Presto Java issue prestodb/presto#17939. |
@BryanCutler We should ask Presto Java community to decide the ordering of UUID and document them officially first. |
There is a PR up at prestodb/presto#23847 that corrects Presto Java to conform to the IETF RFC 4122 definition of ordering UUIDs. When that is resolved, this PR can be updated to match. |
5aa91d4
to
9fa9dde
Compare
Status: PR prestodb/presto#23847 has been merged, and I have updated the comparisons to use a cast to uint128_t, and this includes the serialization commit from #10791 which stores the UUID value as little endian. Once that PR is merged, it can be dropped from here. I have also checked these changes again to ensure the Presto Java tests from prestodb/presto#23301 |
a703f14
to
9d917dc
Compare
Finished rebasing after fixing serialzation issue at #10791 |
const auto uuidEval = [&](const std::optional<std::string>& lhs, | ||
const std::string& operation, | ||
const std::optional<std::string>& rhs) { | ||
return evaluateOnce<bool>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try this version of evaluateOnce https://github.com/facebookincubator/velox/blob/main/velox/functions/prestosql/tests/utils/FunctionBaseTest.h#L241
that can override the logical type. It might reduce the casting in your expression then.
So the call would be something like
evaluateOnce<bool>(fmt::format("c0 {} c1", operation), {UUID(), UUID()}, lhs, rhs)
But of course that assumes that you are not relying on the casting to reshuffle bytes in a way that the uint128_t comparisons hold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does work, but then I would need to construct the UUIDs with boost and then convert to an int128_t to pass to the eval. I think the end result is less clear, and this way also tests the UUID casting.
rhs); | ||
}; | ||
|
||
ASSERT_EQ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming that you got the same results with Presto Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've copied these values to the Presto Java UUID tests to run and get the same results.
9d917dc
to
86278a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler for working with the revisions.
@Yuhta : Please can you help further review and merge. |
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@BryanCutler Can you make sure the failed tests is related or not? |
8d88f1e
to
76dc4be
Compare
76dc4be
to
fa1b203
Compare
It seems like one of the problems is the biased expression fuzzer test tries to create a hive table with IPADDRESS or UUID as a type, and they are not supported so the reference query fails. Locally, this is one of the errors I see
The test might have to be adjusted for these types, I'll look into what we can do about it. |
Another PR has failed biased expression test, not quite the same error since testing different functions https://github.com/facebookincubator/velox/actions/runs/12279548356/job/34266985156?pr=11612 |
fa1b203
to
e624cb0
Compare
…Runner Summary: Mark UUID type as unsupported input type in Presto because Presto doesn't allow creating HIVE columns of this type and requires constant literals of these types to be valid string. This follows a similar fix for IPADDRESS and IPPREFIX from facebookincubator#11820.
This adds UUID comparison functions that were previously unsupported. Functions added are <, >, <=, >=. Equal was already supported. The ordering is done lexicographically and conforms to IETF [RFC 4122]. The ordering also matches Presto Java after #[23847]. [RFC 4122]: https://datatracker.ietf.org/doc/html/rfc4122.html [23847]: prestodb/presto#23847
e624cb0
to
bf7334e
Compare
This adds binary comparison functions <, >, <=, >= to the UUID custom data type. Equality functions are already present. Added unit tests for testing a query with comparisons between UUID values.
The ordering is done lexicographically to conforms with IETF RFC 4122 https://datatracker.ietf.org/doc/html/rfc4122.html, and also matches Presto Java after prestodb/presto#23311.
Related UUID serialization fix at #11197
From #10584