Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add hashcode cache in StructType #11764

Merged
merged 1 commit into from
Jan 17, 2025
Merged

Conversation

wzx140
Copy link
Contributor

@wzx140 wzx140 commented Dec 12, 2024

close #11763

@github-actions github-actions bot added the API label Dec 12, 2024
@singhpk234
Copy link
Contributor

Q: does it completely mitigate the flatness observed ? can you please attach the flame graph now ?
Interesting find @wzx140

@wzx140
Copy link
Contributor Author

wzx140 commented Dec 13, 2024

Q: does it completely mitigate the flatness observed ? can you please attach the flame graph now ? Interesting find @wzx140

@singhpk234

Performance Comparison After Adding Cache

Metric Before Adding Cache After Adding Cache
Pre-execution Preparation Time 154s 18s
Scan Spec Time 142s 5s

Pre-execution Preparation Time: the time interval from the first table load to the start of the first stage execution
Scan Spec Time: added a timer to the method SparkPartitioningAwareScan#specs

Flame Graph
Before Adding Cache: https://drive.google.com/file/d/1o68Q6n1c-BD7xwfM7ETO6fjbC3jjrlOr/view?usp=drive_link
After Adding Cache: https://drive.google.com/file/d/1YnGnEZ06Es7xs4fGVIZgnjDFS3L4cSR2/view?usp=drive_link

@Fokko Fokko requested a review from aokolnychyi December 13, 2024 08:10
Comment on lines +830 to +833
if (hashCode == NO_HASHCODE) {
hashCode = Objects.hash(NestedField.class, Arrays.hashCode(fields));
}
return hashCode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this have a multi-threaded access ? if yes can we have double check locking ? to avoid recompute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scenario, there is no multi-threaded access, but the method structType.hashCode might be accessed by multiple threads in other contexts.

I think the main purpose of this cache is to reduce a significant amount of redundant computation. Introducing additional complexity to completely avoid redundant computation might not be necessary, as even with multi-threaded access, the redundant computation would only occur a few times (up to the number of threads), which should be negligible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I don't think it's worth the complexity of double checked locking to avoid a little bit of redundant computation in the multi-threaded case.

@singhpk234
Copy link
Contributor

Pre-execution Preparation Time: the time interval from the first table load to the start of the first stage execution
Scan Spec Time: added a timer to the method SparkPartitioningAwareScan#specs

sounds really promising, thank you for sharing @wzx140

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wzx140 this is a great finding! I'll hold before merging in case @aokolnychyi had any feedback.

Comment on lines +830 to +833
if (hashCode == NO_HASHCODE) {
hashCode = Objects.hash(NestedField.class, Arrays.hashCode(fields));
}
return hashCode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I don't think it's worth the complexity of double checked locking to avoid a little bit of redundant computation in the multi-threaded case.

@@ -723,6 +723,9 @@ public int hashCode() {

public static class StructType extends NestedType {
private static final Joiner FIELD_SEP = Joiner.on(", ");
private static final int NO_HASHCODE = Integer.MIN_VALUE;

private transient int hashCode = NO_HASHCODE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this transient, when I saw this PR that's the one aspect I wanted to make sure was the case. We really don't want any weird issues that will happen in a distributed setting where the cached hashcode on one struct type is different than the hashcode for the same struct type that's on a different node (which can easily happen since hashcodes across JVMS can be different)

Making it transient will avoid all those kinds of issues.

@amogh-jahagirdar amogh-jahagirdar dismissed their stale review December 16, 2024 15:54

Uninitialized hashcode value

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion, still looks good to me, I thought I spotted an issue in how the default hashCode is initialized but not really worth addressing imo.

@@ -723,6 +723,9 @@ public int hashCode() {

public static class StructType extends NestedType {
private static final Joiner FIELD_SEP = Joiner.on(", ");
private static final int NO_HASHCODE = Integer.MIN_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally going to suggest we follow the same pattern we do in CharSequenceWrapper where we have two fields, the hashCode and a boolean hashIsZero flag. This way in case the hashCode is actually zero we don't recompute it.

In the current implementation, we would recompute the hashCode if it's actually Integer.MIN_VALUE but arguably not worth the complexity to handle that for this case.

@amogh-jahagirdar
Copy link
Contributor

I'm going to go ahead and merge this, it's been ready for a while and didn't want to miss this change going in for the next release! Thank you @wzx140 and thank you @singhpk234 @Fokko for reviewing.

@amogh-jahagirdar amogh-jahagirdar merged commit bed7c33 into apache:main Jan 17, 2025
50 checks passed
Fokko added a commit to Fokko/iceberg that referenced this pull request Jan 19, 2025
amogh-jahagirdar pushed a commit that referenced this pull request Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance Regression Caused by Schema Hash in Spark PartitionPruning with Wide Tables
4 participants