Skip to content

Commit

Permalink
fix: Array_intersect with a single argument doesn't correctly handle …
Browse files Browse the repository at this point in the history
…nulls in dictionary encoded inner arrays (#11807)

Summary:
Pull Request resolved: #11807

The single argument flavor of array_intersect computes the intersection of the inner arrays in an array of arrays.
It currently does not correctly handle nulls in the dictionary if the inner arrays are dictionary encoded.

There are 2 bugs:
1) toElementRows requires that nulls be deselected in the SelectivityVector, array_intersect does not do this
when calling it on the inner arrays.  This means it will attempt to use the index from the DecodedVector's
indices() to get the offset and size of the null arrays, which can lead to accessing arbitrary memory as the index
may point off the end of the buffers getting the size and offset, which can further lead to writing to arbitrary
memory as we can write off the end of the SelectivityVector.
2) When actually doing the intersection we get the innerOffset and innerSize before checking if the inner array
is null. We don't end up using these values if the inner array is null, so this just results in an error when ASAN is
enabled.

To fix 1) I created separate APIs for toElementRows for use with DecodedVectors and otherwise. It forces the
user to pass in the nulls from the DecodedVector in addition to the indices to help avoid mistakes like this.

Note that this worked in PrestoHasher because it was correctly deselecting nulls in the SelectivityVector, and in
WidthBucketArray because it has default null behavior so the nulls would have been automatically deselected
by the expression eval code.

Reviewed By: spershin

Differential Revision: D66994602

fbshipit-source-id: 1c14bfaab886e692aae09b555cd6d2971e3deb72
  • Loading branch information
Kevin Wilfong authored and facebook-github-bot committed Dec 10, 2024
1 parent 1bd480e commit 4cffa4b
Show file tree
Hide file tree
Showing 5 changed files with 59 additions and 31 deletions.
33 changes: 23 additions & 10 deletions velox/functions/lib/RowsTranslationUtil.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,22 +23,20 @@
namespace facebook::velox::functions {

/// This function returns a SelectivityVector for ARRAY/MAP vectors that selects
/// all rows corresponding to the specified rows. If the base vector was
/// dictionary encoded, an optional rowMapping parameter can be used to pass
/// the dictionary indices. In this case, it is important to ensure that the
/// topLevelRows parameter has already filtered out any null rows added by the
/// dictionary encoding.
/// all rows corresponding to the specified rows. This flavor is intended for
/// use with the base vectors of decoded vectors. Use `nulls` and `rowMapping`
/// to pass in the null and index mappings from the DecodedVector.
template <typename T>
SelectivityVector toElementRows(
vector_size_t size,
const SelectivityVector& topLevelNonNullRows,
const SelectivityVector& topLevelRows,
const T* arrayBaseVector,
const vector_size_t* rowMapping = nullptr) {
const uint64_t* nulls,
const vector_size_t* rowMapping) {
VELOX_CHECK(
arrayBaseVector->encoding() == VectorEncoding::Simple::MAP ||
arrayBaseVector->encoding() == VectorEncoding::Simple::ARRAY);

auto rawNulls = arrayBaseVector->rawNulls();
auto rawSizes = arrayBaseVector->rawSizes();
auto rawOffsets = arrayBaseVector->rawOffsets();
VELOX_DEBUG_ONLY const auto sizeRange =
Expand All @@ -47,9 +45,9 @@ SelectivityVector toElementRows(
arrayBaseVector->offsets()->template asRange<vector_size_t>().end();

SelectivityVector elementRows(size, false);
topLevelNonNullRows.applyToSelected([&](vector_size_t row) {
topLevelRows.applyToSelected([&](vector_size_t row) {
auto index = rowMapping ? rowMapping[row] : row;
if (rawNulls && bits::isBitNull(rawNulls, index)) {
if (nulls && bits::isBitNull(nulls, row)) {
return;
}

Expand All @@ -64,6 +62,21 @@ SelectivityVector toElementRows(
return elementRows;
}

/// This function returns a SelectivityVector for ARRAY/MAP vectors that selects
/// all rows corresponding to the specified rows.
template <typename T>
SelectivityVector toElementRows(
vector_size_t size,
const SelectivityVector& topLevelRows,
const T* arrayBaseVector) {
return toElementRows(
size,
topLevelRows,
arrayBaseVector,
arrayBaseVector->rawNulls(),
nullptr);
}

/// Returns a buffer of vector_size_t that represents the mapping from
/// topLevelRows's element rows to its top-level rows. For example, suppose
/// `result` is the returned buffer, result[i] == j means the value at index i
Expand Down
14 changes: 9 additions & 5 deletions velox/functions/prestosql/ArrayIntersectExcept.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,11 @@ DecodedVector* decodeArrayElements(
// Decode and acquire array elements vector.
auto elementsVector = baseArrayVector->elements();
*elementRows = toElementRows(
elementsVector->size(), rows, baseArrayVector, decodedVector->indices());
elementsVector->size(),
rows,
baseArrayVector,
arrayDecoder->nulls(&rows),
arrayDecoder->indices());
elementsDecoder.get()->decode(*elementsVector, *elementRows);
auto decodedElementsVector = elementsDecoder.get();
return decodedElementsVector;
Expand Down Expand Up @@ -217,10 +221,6 @@ class ArraysIntersectSingleParam : public exec::VectorFunction {
auto size = outerArray->sizeAt(idx);
bool setInitialized = false;
for (auto i = offset; i < (offset + size); ++i) {
auto innerIdx = decodedInnerArray->index(i);
auto innerOffset = innerArray->offsetAt(innerIdx);
auto innerSize = innerArray->sizeAt(innerIdx);

// 1. prepare for next iteration
indicesCursor = rawNewOffsets[row];
SetWithNull<T> intermediateSet;
Expand All @@ -234,6 +234,10 @@ class ArraysIntersectSingleParam : public exec::VectorFunction {
break;
}

auto innerIdx = decodedInnerArray->index(i);
auto innerOffset = innerArray->offsetAt(innerIdx);
auto innerSize = innerArray->sizeAt(innerIdx);

// 3. Regular array
for (auto j = innerOffset; j < (innerOffset + innerSize); ++j) {
// null element
Expand Down
8 changes: 6 additions & 2 deletions velox/functions/prestosql/WidthBucketArray.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,12 @@ class WidthBucketArrayFunction : public exec::VectorFunction {
auto rawSizes = binsArray->rawSizes();
auto rawOffsets = binsArray->rawOffsets();
auto elementsVector = binsArray->elements();
auto elementsRows =
toElementRows(elementsVector->size(), rows, binsArray, bins->indices());
auto elementsRows = toElementRows(
elementsVector->size(),
rows,
binsArray,
bins->nulls(&rows),
bins->indices());
exec::LocalDecodedVector elementsHolder(
context, *elementsVector, elementsRows);

Expand Down
18 changes: 4 additions & 14 deletions velox/functions/prestosql/aggregates/PrestoHasher.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -245,14 +245,10 @@ void PrestoHasher::hash<TypeKind::ARRAY>(
BufferPtr& hashes) {
auto baseArray = vector_->base()->as<ArrayVector>();
auto indices = vector_->indices();

auto nonNullRows = SelectivityVector(rows);
if (vector_->nulls(&nonNullRows)) {
nonNullRows.deselectNulls(vector_->nulls(), 0, nonNullRows.end());
}
auto decodedNulls = vector_->nulls(&rows);

auto elementRows = functions::toElementRows(
baseArray->elements()->size(), nonNullRows, baseArray, indices);
baseArray->elements()->size(), rows, baseArray, decodedNulls, indices);

BufferPtr elementHashes =
AlignedBuffer::allocate<int64_t>(elementRows.end(), baseArray->pool());
Expand All @@ -263,7 +259,6 @@ void PrestoHasher::hash<TypeKind::ARRAY>(
auto rawOffsets = baseArray->rawOffsets();
auto rawElementHashes = elementHashes->as<int64_t>();
auto rawHashes = hashes->asMutable<int64_t>();
auto decodedNulls = vector_->nulls();

rows.applyToSelected([&](auto row) {
int64_t hash = 0;
Expand All @@ -285,15 +280,11 @@ void PrestoHasher::hash<TypeKind::MAP>(
BufferPtr& hashes) {
auto baseMap = vector_->base()->as<MapVector>();
auto indices = vector_->indices();
auto decodedNulls = vector_->nulls(&rows);
VELOX_CHECK_EQ(children_.size(), 2);

auto nonNullRows = SelectivityVector(rows);
if (vector_->nulls(&nonNullRows)) {
nonNullRows.deselectNulls(vector_->nulls(), 0, nonNullRows.end());
}

auto elementRows = functions::toElementRows(
baseMap->mapKeys()->size(), nonNullRows, baseMap, indices);
baseMap->mapKeys()->size(), rows, baseMap, decodedNulls, indices);
BufferPtr keyHashes =
AlignedBuffer::allocate<int64_t>(elementRows.end(), baseMap->pool());

Expand All @@ -309,7 +300,6 @@ void PrestoHasher::hash<TypeKind::MAP>(

auto rawSizes = baseMap->rawSizes();
auto rawOffsets = baseMap->rawOffsets();
auto decodedNulls = vector_->nulls();

rows.applyToSelected([&](auto row) {
int64_t hash = 0;
Expand Down
17 changes: 17 additions & 0 deletions velox/functions/prestosql/tests/ArrayIntersectTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -718,3 +718,20 @@ TEST_F(ArrayIntersectTest, timestampWithTimezone) {
{pack(1, 0), pack(8, 6)},
{pack(8, 10), pack(1, 11)});
}

TEST_F(ArrayIntersectTest, nullInDictionary) {
// Test that if the array elements are dictionary encoded, nulls in the
// dictionary are propagated correctly.
auto elements =
makeArrayVector<int64_t>({{1, 2, 3}, {1, 5, 6}, {7, 8, 9}, {7}});
auto indices = makeIndices({0, 1, 50, 2, 3});
auto nulls = makeNulls({false, false, true, false, false});
auto encodedElements =
BaseVector::wrapInDictionary(nulls, indices, 5, elements);

auto arrays = makeArrayVector({0, 2, 3}, encodedElements);

auto expected =
makeNullableArrayVector<int64_t>({{{1}}, std::nullopt, {{7}}});
testExpr(expected, "array_intersect(c0)", {arrays});
}

0 comments on commit 4cffa4b

Please sign in to comment.