Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add hashcode cache in StructType #11764

Merged
merged 1 commit into from
Jan 17, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion api/src/main/java/org/apache/iceberg/types/Types.java
Original file line number Diff line number Diff line change
Expand Up @@ -723,6 +723,9 @@ public int hashCode() {

public static class StructType extends NestedType {
private static final Joiner FIELD_SEP = Joiner.on(", ");
private static final int NO_HASHCODE = Integer.MIN_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally going to suggest we follow the same pattern we do in CharSequenceWrapper where we have two fields, the hashCode and a boolean hashIsZero flag. This way in case the hashCode is actually zero we don't recompute it.

In the current implementation, we would recompute the hashCode if it's actually Integer.MIN_VALUE but arguably not worth the complexity to handle that for this case.


private transient int hashCode = NO_HASHCODE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this transient, when I saw this PR that's the one aspect I wanted to make sure was the case. We really don't want any weird issues that will happen in a distributed setting where the cached hashcode on one struct type is different than the hashcode for the same struct type that's on a different node (which can easily happen since hashcodes across JVMS can be different)

Making it transient will avoid all those kinds of issues.


public static StructType of(NestedField... fields) {
return of(Arrays.asList(fields));
Expand Down Expand Up @@ -824,7 +827,10 @@ public boolean equals(Object o) {

@Override
public int hashCode() {
return Objects.hash(NestedField.class, Arrays.hashCode(fields));
if (hashCode == NO_HASHCODE) {
hashCode = Objects.hash(NestedField.class, Arrays.hashCode(fields));
}
return hashCode;
Comment on lines +830 to +833
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this have a multi-threaded access ? if yes can we have double check locking ? to avoid recompute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scenario, there is no multi-threaded access, but the method structType.hashCode might be accessed by multiple threads in other contexts.

I think the main purpose of this cache is to reduce a significant amount of redundant computation. Introducing additional complexity to completely avoid redundant computation might not be necessary, as even with multi-threaded access, the redundant computation would only occur a few times (up to the number of threads), which should be negligible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I don't think it's worth the complexity of double checked locking to avoid a little bit of redundant computation in the multi-threaded case.

}

private List<NestedField> lazyFieldList() {
Expand Down