[WIP] Use HashStringAllocator for raw_vector rows in HashLookup #10349

czentgr · 2024-06-28T23:26:32Z

The rows in the HashLookup are stored as raw_vector. The raw_vector uses a malloc type allocation to acquire aligned memory. It was found this contributes to large unaccounted memory when mainly using the HashProbe and other operators.

This change allows the usage of a custom allocator for the raw_vector. By default it keeps the current way of allocating and deallocating.

For raw_vectors representing rows in the HashLookup the allocator used is the HashStringAllocator which allows for the accounting of the memory.

netlify · 2024-06-28T23:26:51Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`2ece662`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/671ac4cfe8b82a00082ed753

aditi-pandit · 2024-06-29T00:03:15Z

velox/exec/HashTable.h

@@ -74,7 +77,8 @@ struct HashLookup {
  /// Input to groupProbe and joinProbe APIs.

  /// Set of row numbers of row to probe.
-  raw_vector<vector_size_t> rows;
+  raw_vector<vector_size_t, AlignedStlAllocator<vector_size_t, simd::kPadding>>
+      rows;

  /// Hashes or value IDs for rows in 'rows'. Not aligned with 'rows'. Index is
  /// the row number.


There are bunch more raw_vector below this in HashLookup (hashes, hits, newGroups, normalizedKeys). All these are also proportional to the number of input rows in addInput(). We should use the HashStringAllocator for them.

Right, I started with one. More changes to specify the templated type in various places would be needed. I had started on this after pushing this PR. Will continue making the changes and try them out.

velox/common/base/RawVector.h

Yuhta · 2024-07-02T15:52:49Z

velox/common/base/RawVector.h

@@ -118,6 +132,7 @@ class raw_vector {
    }
    reserve(size);
    size_ = size;
+    allocatedSize_ = size;


Just store the bytes in allocateData in this field

velox/common/base/RawVector.h

Yuhta · 2024-07-02T15:54:56Z

velox/exec/HashTable.h

@@ -55,8 +55,11 @@ struct TableInsertPartitionInfo {

 /// Contains input and output parameters for groupProbe and joinProbe APIs.
 struct HashLookup {
-  explicit HashLookup(const std::vector<std::unique_ptr<VectorHasher>>& h)
-      : hashers(h) {}
+  explicit HashLookup(


No need for explicit anymore

Yuhta · 2024-07-17T21:34:37Z

velox/common/base/RawVector.h

@@ -152,6 +154,9 @@ class raw_vector {
  }

 private:
+  // Constructs  a copy of 'other'. See operator=. 'data_' must be copied.
+  raw_vector(const raw_vector<T>& other) {}


We can use raw_vector(const raw_vector<T>& other) = deleted
Also delete the copy assignment void operator=(const raw_vector<T>& other) = deleted

Yuhta · 2024-07-17T21:36:06Z

velox/common/base/RawVector.h

@@ -177,7 +183,8 @@ class raw_vector {

  void freeData(T* data) {
    if (data_) {
-      ::free(addBytes(data, -simd::kPadding));
+      auto bytes = paddedSize(sizeof(T) * allocatedSize_);


allocatedSize_ is already padded byte size, just pass it directly to deallocate

Yes, this was actually a bug. But it revealed a problem with the interfaces of the two allocators used.

The original implementation used byte size. But the AlignedStlAllocator (HashStringAllocator) wants as input the number of elements to do its own padding/alignment.
The problem is that the result isn't a memory block that neatly multiples of sizeof(T) and won't necessarily align with the xsimd alignment that is used for the raw vectors.

So I tried to do something like this to get a number of element count (and equivalent changes in the StlAlignedAllocator to change from bytes to number of elements).

T* allocateData(int32_t size, int32_t& capacity) { auto bytes = paddedSize(sizeof(T) * size); auto numElementsOfT = std::ceil(bytes / sizeof(T)); // The allocate function takes in number of elements for the AlignedStlAllocator // and not the number of bytes. auto ptr = allocator_.allocate(numElementsOfT); allocatedSize_ = numElementsOfT; ... }

But I'm seeing SEGVs in the fuzzers because the raw vector for the hashes vector during simd execution appears to be reading the incorrect number of rows. My guess is that the memory layout due to the incorrect simd alignment of the rows vector causes this issue.

There is also a bug in StdAlignedAllocator::allocate:

return reinterpret_cast<T*>(aligned_alloc(kAlignment, n * sizeof(T)));

Right. I did fix that when I noticed it is the number of elements.
There is some kind of memory corruption with the memory from the HSA. I can see that the rows vector would contain sequential numbers (the row numbers) but I can see it being corrupted when it is freed or when it is occasionally read. Any contents in the memory region that was allocated from the HSA. I've seen rows values being corrupted, and in some cases it is the delta that is used to calculate the actual pointer that was allocated, sometimes this corrupted value is accessed to find the hash in the probe and fails, the HSA header could also be corrupted, etc. And it contains non-rows content - clearly non-numeric values (some values seem to be ascii strings - e.g. E_ARRAY and others).
I'm trying to figure out what is happening. The memory block allocated previously outside the HSA didn't have this kind of problem.

Are you using the raw_vector outside the life cycle of HSA?

I don't think so. Say the RowNumber operator that is used for one of the failing fuzzers. The lookup_ is destructed first and the table_ with the HSA after that. So this should be ok. If HSA was invalid we would get SEGV on access of the allocate/deallocate during deallocation of the raw_vector in the HashLookup.
The issue doesn't always happen.

Edit: seems to be related to execution of restoreNextSpillPartition. Every time it's a normal RowNumber operator allocate/deallocate it's fine. But when it comes to the above function where it also prepares for the probe it then fails in the actual group probe.

I've found the root cause. The issue is that we pass the HSA as a regular point while it is a shared_ptr in the RowContainer that has only a single reference. Later when restoring the spill partition data, the hash table is cleared, which also clears the RowContainer (RowContainer::clear()) and if the the HSA is not shared then it also clears out the HSA and with it the raw_vector data in the HashLookup leading to subsequent lookup problems.

I see two possibilities:

Let each opertor using the HashLookup use its operator pool to create a new HSA for it. This affects the HashProbe, RowNumber, TopNRowNumber operators. And also have the HashTable use its own memory pool creating the HashLookup object whenevr needed. So they are separate from each other.

Keep a shared_ptr reference to the RowContainer HSA in the HashLookup to make sure the HSA is shared (use count will be 2) is, therefore, not cleared.

Is there a preference? Option 2 would likely involve much less allocation/deallocation of a new HSA if the HashLookup is frequently destroyed/created.

Option 2 is probably not possible because we must clear out and release the memory when spilling is triggered. If we want to continue to use the same HSA as row container, all these raw_vectors should also be spilled and unspilled at the same time as the row container. @xiaoxmeng Do you think this is doable?

Option 1 is possible but might have performance issue. Maybe we can have a driver level HSA to be used here?

Yuhta · 2024-07-17T21:37:01Z

velox/common/base/RawVector.h

  void operator=(raw_vector<T>&& other) noexcept {
    data_ = other.data_;
    size_ = other.size_;
    capacity_ = other.capacity_;
+    allocator_ = std::move(other.allocator_);


We could just do void operator=(raw_vector<T>&& other) noexcept = default

Yuhta · 2024-07-17T21:41:06Z

velox/exec/HashTable.h

@@ -74,7 +77,8 @@ struct HashLookup {
  /// Input to groupProbe and joinProbe APIs.

  /// Set of row numbers of row to probe.
-  raw_vector<vector_size_t> rows;
+  raw_vector<vector_size_t, AlignedStlAllocator<vector_size_t, simd::kPadding>>


Maybe give this thing an alias since you are going to create a lot of these. Something like HsaRawVector

stale · 2024-10-24T20:24:15Z

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

The rows in the HashLookup are stored as raw_vector. The raw_vector uses a malloc type allocation to acquire aligned memory. It was found this contributes to large unaccounted memory when mainly using the HashProbe and other operators. This change allows the usage of a custom allocator for the raw_vector. By default it keeps the current way of allocating and deallocating. For raw_vectors representing rows in the HashLookup the allocator used is the HashStringAllocator which allows for the accounting of the memory. Co-authored-by: Jimmy Lu <[email protected]> Resolves: facebookincubator#10200

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2024

aditi-pandit reviewed Jun 29, 2024

View reviewed changes

Yuhta reviewed Jul 2, 2024

View reviewed changes

czentgr force-pushed the cz_fix_hash_memory_accounting branch 2 times, most recently from b413411 to 33a6c05 Compare July 15, 2024 17:10

Yuhta reviewed Jul 17, 2024

View reviewed changes

majetideepak mentioned this pull request Sep 24, 2024

Reduce unaccounted memory usage in the HashProbe operator #10200

Open

stale bot added the stale label Oct 24, 2024

czentgr force-pushed the cz_fix_hash_memory_accounting branch from 33a6c05 to b53bcd6 Compare October 24, 2024 22:00

czentgr removed the stale label Oct 24, 2024

czentgr changed the title ~~Use HashStringAllocator for raw_vector rows in HashLookup~~ [WIP] Use HashStringAllocator for raw_vector rows in HashLookup Oct 24, 2024

czentgr force-pushed the cz_fix_hash_memory_accounting branch from b53bcd6 to 2ece662 Compare October 24, 2024 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Use HashStringAllocator for raw_vector rows in HashLookup #10349

[WIP] Use HashStringAllocator for raw_vector rows in HashLookup #10349

czentgr commented Jun 28, 2024

netlify bot commented Jun 28, 2024 •

edited

Loading

aditi-pandit Jun 29, 2024

czentgr Jul 1, 2024

Yuhta Jul 2, 2024

Yuhta Jul 2, 2024

Yuhta Jul 17, 2024

Yuhta Jul 17, 2024

czentgr Jul 17, 2024 •

edited

Loading

Yuhta Jul 18, 2024

czentgr Jul 18, 2024

Yuhta Jul 18, 2024

czentgr Jul 18, 2024 •

edited

Loading

czentgr Jul 25, 2024 •

edited

Loading

Yuhta Jul 25, 2024

Yuhta Jul 17, 2024

Yuhta Jul 17, 2024

stale bot commented Oct 24, 2024

[WIP] Use HashStringAllocator for raw_vector rows in HashLookup #10349

Are you sure you want to change the base?

[WIP] Use HashStringAllocator for raw_vector rows in HashLookup #10349

Conversation

czentgr commented Jun 28, 2024

netlify bot commented Jun 28, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

czentgr Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

czentgr Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

czentgr Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Oct 24, 2024

netlify bot commented Jun 28, 2024 •

edited

Loading

czentgr Jul 17, 2024 •

edited

Loading

czentgr Jul 18, 2024 •

edited

Loading

czentgr Jul 25, 2024 •

edited

Loading