Skip to content

Commit

Permalink
Add support for testing peeling in expression fuzzer (#11379)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #11379

This change allows wrapping input columns with a shared dictionary
layer, to exercise code paths that handle dictionary peeling of
multiple inputs. Previously, only single inputs could be peeled due
to differing dictionary wraps across columns. This allows us to
replicate vectors that are produced by joins or unnest or filters.
It achieves this by first picking columns that are not encoded (as
multiple dictionary layers will soon be phased out), then wrapping
them just before passing them to evaluation. This change also adds
the ability to serialize these common wraps to ensure easy repro
using ExpressionRunnerTest.

'common_dictionary_wraps_generation_ratio' startup flag is used to
enable this feature.

Reviewed By: kagamiori

Differential Revision: D64436877

fbshipit-source-id: 04980136423eaec5b7d0b185c8e007a73e9bac41
  • Loading branch information
Bikramjeet Vig authored and facebook-github-bot committed Nov 9, 2024
1 parent a33e8d7 commit 4a79bc5
Show file tree
Hide file tree
Showing 16 changed files with 413 additions and 190 deletions.
2 changes: 1 addition & 1 deletion velox/docs/develop/testing/fuzzer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -369,7 +369,7 @@ ExpressionRunner supports the following flags:

* ``--complex_constant_path`` optional path to complex constants that aren't accurately expressable in SQL (Array, Map, Structs, ...). This is used with SQL file to reproduce the exact expression, not needed when the expression doesn't contain complex constants.

* ``--lazy_column_list_path`` optional path for the file stored on-disk which contains a vector of column indices that specify which columns of the input row vector should be wrapped in lazy. This is used when the failing test included input columns that were lazy vector.
* ``--input_row_metadata_path`` optional path for the file stored on-disk which contains a struct containing input row metadata. This includes columns in the input row vector to be wrapped in a lazy vector and/or dictionary encoded. It may also contain a dictionary peel for columns requiring dictionary encoding. This is used when the failing test included input columns that were lazy vectors and/or had columns wrapped with a common dictionary wrap.

* ``--result_path`` optional path to result vector that was created by the Fuzzer. Result vector is used to reproduce cases where Fuzzer passes dirty vectors to expression evaluation as a result buffer. This ensures that functions are implemented correctly, taking into consideration dirty result buffer.

Expand Down
58 changes: 39 additions & 19 deletions velox/expression/fuzzer/ExpressionFuzzerVerifier.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -115,20 +115,39 @@ ExpressionFuzzerVerifier::ExpressionFuzzerVerifier(
}
}

std::vector<int> ExpressionFuzzerVerifier::generateLazyColumnIds(
InputRowMetadata ExpressionFuzzerVerifier::generateInputRowMetadata(
const RowVectorPtr& rowVector,
VectorFuzzer& vectorFuzzer) {
std::vector<int> columnsToWrapInLazy;
if (options_.lazyVectorGenerationRatio > 0) {
for (int idx = 0; idx < rowVector->childrenSize(); idx++) {
VELOX_CHECK_NOT_NULL(rowVector->childAt(idx));
if (vectorFuzzer.coinToss(options_.lazyVectorGenerationRatio)) {
columnsToWrapInLazy.push_back(
vectorFuzzer.coinToss(0.8) ? idx : -1 * idx);
}
InputRowMetadata inputRowMetadata;
if (options_.commonDictionaryWrapRatio <= 0 &&
options_.lazyVectorGenerationRatio <= 0) {
return inputRowMetadata;
}

bool wrapInCommonDictionary =
vectorFuzzer.coinToss(options_.commonDictionaryWrapRatio);
for (int idx = 0; idx < rowVector->childrenSize(); idx++) {
const auto& child = rowVector->childAt(idx);
VELOX_CHECK_NOT_NULL(child);
if (child->encoding() != VectorEncoding::Simple::DICTIONARY &&
wrapInCommonDictionary) {
inputRowMetadata.columnsToWrapInCommonDictionary.push_back(idx);
}
if (vectorFuzzer.coinToss(options_.lazyVectorGenerationRatio)) {
inputRowMetadata.columnsToWrapInLazy.push_back(
vectorFuzzer.coinToss(0.8) ? idx : -1 * idx);
}
}
// Skip wrapping in common dictionary if there is only one column.
if (inputRowMetadata.columnsToWrapInCommonDictionary.size() > 1) {
auto inputSize = rowVector->size();
inputRowMetadata.commonDictionaryIndices =
vectorFuzzer.fuzzIndices(inputSize, inputSize);
inputRowMetadata.commonDictionaryNulls = vectorFuzzer.fuzzNulls(inputSize);
} else {
inputRowMetadata.columnsToWrapInCommonDictionary.clear();
}
return columnsToWrapInLazy;
return inputRowMetadata;
}

void ExpressionFuzzerVerifier::reSeed() {
Expand Down Expand Up @@ -233,7 +252,7 @@ void ExpressionFuzzerVerifier::retryWithTry(
std::vector<core::TypedExprPtr> plans,
const RowVectorPtr& rowVector,
const VectorPtr& resultVector,
const std::vector<int>& columnsToWrapInLazy) {
const InputRowMetadata& inputRowMetadata) {
// Wrap each expression tree with 'try'.
std::vector<core::TypedExprPtr> tryPlans;
for (auto& plan : plans) {
Expand All @@ -252,7 +271,7 @@ void ExpressionFuzzerVerifier::retryWithTry(
std::nullopt,
resultVector ? BaseVector::copy(*resultVector) : nullptr,
false, // canThrow
columnsToWrapInLazy);
inputRowMetadata);
} catch (const std::exception&) {
if (options_.findMinimalSubexpression) {
test::computeMinimumSubExpression(
Expand All @@ -261,7 +280,7 @@ void ExpressionFuzzerVerifier::retryWithTry(
plans,
rowVector,
std::nullopt,
columnsToWrapInLazy);
inputRowMetadata);
}
throw;
}
Expand All @@ -286,7 +305,7 @@ void ExpressionFuzzerVerifier::retryWithTry(
noErrorRows,
resultVector ? BaseVector::copy(*resultVector) : nullptr,
false, // canThrow
columnsToWrapInLazy);
inputRowMetadata);
} catch (const std::exception&) {
if (options_.findMinimalSubexpression) {
test::computeMinimumSubExpression(
Expand All @@ -295,7 +314,7 @@ void ExpressionFuzzerVerifier::retryWithTry(
plans,
rowVector,
noErrorRows,
columnsToWrapInLazy);
inputRowMetadata);
}
throw;
}
Expand Down Expand Up @@ -358,7 +377,8 @@ void ExpressionFuzzerVerifier::go() {

auto rowVector = fuzzInputWithRowNumber(*vectorFuzzer_, inputType);

auto columnsToWrapInLazy = generateLazyColumnIds(rowVector, *vectorFuzzer_);
InputRowMetadata inputRowMetadata =
generateInputRowMetadata(rowVector, *vectorFuzzer_);

auto resultVectors = generateResultVectors(plans);
ResultOrError result;
Expand All @@ -370,7 +390,7 @@ void ExpressionFuzzerVerifier::go() {
std::nullopt,
resultVectors ? BaseVector::copy(*resultVectors) : nullptr,
true, // canThrow
columnsToWrapInLazy);
inputRowMetadata);
} catch (const std::exception&) {
if (options_.findMinimalSubexpression) {
test::computeMinimumSubExpression(
Expand All @@ -379,7 +399,7 @@ void ExpressionFuzzerVerifier::go() {
plans,
rowVector,
std::nullopt,
columnsToWrapInLazy);
inputRowMetadata);
}
throw;
}
Expand All @@ -396,7 +416,7 @@ void ExpressionFuzzerVerifier::go() {
!result.unsupportedInputUncatchableError) {
LOG(INFO)
<< "Both paths failed with compatible exceptions. Retrying expression using try().";
retryWithTry(plans, rowVector, resultVectors, columnsToWrapInLazy);
retryWithTry(plans, rowVector, resultVectors, inputRowMetadata);
}

LOG(INFO) << "==============================> Done with iteration " << i;
Expand Down
25 changes: 19 additions & 6 deletions velox/expression/fuzzer/ExpressionFuzzerVerifier.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#include "velox/expression/fuzzer/FuzzerToolkit.h"
#include "velox/expression/tests/ExpressionVerifier.h"
#include "velox/functions/FunctionRegistry.h"
#include "velox/vector/VectorSaver.h"
#include "velox/vector/fuzzer/VectorFuzzer.h"
#include "velox/vector/tests/utils/VectorMaker.h"

Expand Down Expand Up @@ -95,6 +96,14 @@ class ExpressionFuzzerVerifier {
// 1).
double lazyVectorGenerationRatio = 0.0;

// Specifies the probability with which columns in the input row vector will
// be selected to be wrapped in a common dictionary layer (expressed as
// double from 0 to 1). Only columns that are not already dictionary encoded
// will be selected as eventually only one dictionary wrap will be allowed
// so additional wrap can be folded into the existing one. This is to
// replicate inputs coming from a filter, union, or join.
double commonDictionaryWrapRatio = 0.0;

// This sets an upper limit on the number of expression trees to generate
// per step. These trees would be executed in the same ExprSet and can
// re-use already generated columns and subexpressions (if re-use is
Expand Down Expand Up @@ -164,7 +173,7 @@ class ExpressionFuzzerVerifier {
std::vector<core::TypedExprPtr> plans,
const RowVectorPtr& rowVector,
const VectorPtr& resultVectors,
const std::vector<int>& columnsToWrapInLazy);
const InputRowMetadata& columnsToWrapInLazy);

/// If --duration_sec > 0, check if we expired the time budget. Otherwise,
/// check if we expired the number of iterations (--steps).
Expand All @@ -180,11 +189,15 @@ class ExpressionFuzzerVerifier {
/// proportionOfTimesSelected numProcessedRows.
void logStats();

// Randomly pick columns from the input row vector to wrap in lazy.
// Negative column indices represent lazy vectors that have been preloaded
// before feeding them to the evaluator. This list is sorted on the absolute
// value of the entries.
std::vector<int> generateLazyColumnIds(
// Generates InputRowMetadata which contains the following:
// 1. Randomly picked columns from the input row vector to wrap
// in lazy. Negative column indices represent lazy vectors that have been
// preloaded before feeding them to the evaluator.
// 2. Randomly picked columns (2 or more) from the input row vector to
// wrap in a common dictionary layer. Only columns not already dictionary
// encoded are picked.
// Note: These lists are sorted on the absolute value of the entries.
InputRowMetadata generateInputRowMetadata(
const RowVectorPtr& rowVector,
VectorFuzzer& vectorFuzzer);

Expand Down
10 changes: 10 additions & 0 deletions velox/expression/fuzzer/FuzzerRunner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,14 @@ DEFINE_double(
"vector will be selected to be wrapped in lazy encoding "
"(expressed as double from 0 to 1).");

DEFINE_double(
common_dictionary_wraps_generation_ratio,
0.0,
"Specifies the probability with which columns in the input row "
"vector will be selected to be wrapped in a common dictionary wrap "
"(expressed as double from 0 to 1). Only columns not already encoded "
"will be considered.");

DEFINE_int32(
max_expression_trees_per_step,
1,
Expand Down Expand Up @@ -207,6 +215,8 @@ ExpressionFuzzerVerifier::Options getExpressionFuzzerVerifierOptions(
opts.reproPersistPath = FLAGS_repro_persist_path;
opts.persistAndRunOnce = FLAGS_persist_and_run_once;
opts.lazyVectorGenerationRatio = FLAGS_lazy_vector_generation_ratio;
opts.commonDictionaryWrapRatio =
FLAGS_common_dictionary_wraps_generation_ratio;
opts.maxExpressionTreesPerStep = FLAGS_max_expression_trees_per_step;
opts.vectorFuzzerOptions = getVectorFuzzerOptions();
opts.expressionFuzzerOptions = getExpressionFuzzerOptions(
Expand Down
77 changes: 77 additions & 0 deletions velox/expression/fuzzer/FuzzerToolkit.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,30 @@
* limitations under the License.
*/
#include "velox/expression/fuzzer/FuzzerToolkit.h"
#include "velox/vector/VectorSaver.h"

namespace facebook::velox::fuzzer {

namespace {
template <typename T>
void saveStdVector(const std::vector<T>& list, std::ostream& out) {
// Size of the vector
size_t size = list.size();
out.write((char*)&(size), sizeof(size));
out.write(
reinterpret_cast<const char*>(list.data()), list.size() * sizeof(T));
}

template <typename T>
std::vector<T> restoreStdVector(std::istream& in) {
size_t size;
in.read((char*)&size, sizeof(size));
std::vector<T> vec(size);
in.read(reinterpret_cast<char*>(vec.data()), size * sizeof(T));
return vec;
}
} // namespace

std::string CallableSignature::toString() const {
std::string buf = name;
buf.append("( ");
Expand Down Expand Up @@ -137,4 +158,60 @@ void compareVectors(
LOG(INFO) << "Two vectors match.";
}

RowVectorPtr applyCommonDictionaryLayer(
const RowVectorPtr& rowVector,
const InputRowMetadata& inputRowMetadata) {
if (inputRowMetadata.columnsToWrapInCommonDictionary.empty()) {
return rowVector;
}
auto size = rowVector->size();
auto& nulls = inputRowMetadata.commonDictionaryNulls;
auto& indices = inputRowMetadata.commonDictionaryIndices;
if (nulls) {
VELOX_CHECK_LE(bits::nbytes(size), nulls->size());
}
VELOX_CHECK_LE(size, indices->size() / sizeof(vector_size_t));
std::vector<VectorPtr> newInputs;
int listIndex = 0;
auto& columnsToWrap = inputRowMetadata.columnsToWrapInCommonDictionary;
for (int idx = 0; idx < rowVector->childrenSize(); idx++) {
auto& child = rowVector->childAt(idx);
VELOX_CHECK_NOT_NULL(child);
if (listIndex < columnsToWrap.size() && idx == columnsToWrap[listIndex]) {
newInputs.push_back(
BaseVector::wrapInDictionary(nulls, indices, size, child));
listIndex++;
} else {
newInputs.push_back(child);
}
}
return std::make_shared<RowVector>(
rowVector->pool(), rowVector->type(), nullptr, size, newInputs);
}

void InputRowMetadata::saveToFile(const char* filePath) const {
std::ofstream outputFile(filePath, std::ofstream::binary);
saveStdVector(columnsToWrapInLazy, outputFile);
saveStdVector(columnsToWrapInCommonDictionary, outputFile);
writeOptionalBuffer(commonDictionaryIndices, outputFile);
writeOptionalBuffer(commonDictionaryNulls, outputFile);
outputFile.close();
}

InputRowMetadata InputRowMetadata::restoreFromFile(
const char* filePath,
memory::MemoryPool* pool) {
InputRowMetadata ret;
std::ifstream in(filePath, std::ifstream::binary);
ret.columnsToWrapInLazy = restoreStdVector<int>(in);
if (in.peek() != EOF) {
// this allows reading old files that only saved columnsToWrapInLazy.
ret.columnsToWrapInCommonDictionary = restoreStdVector<int>(in);
ret.commonDictionaryIndices = readOptionalBuffer(in, pool);
ret.commonDictionaryNulls = readOptionalBuffer(in, pool);
}
in.close();
return ret;
}

} // namespace facebook::velox::fuzzer
31 changes: 31 additions & 0 deletions velox/expression/fuzzer/FuzzerToolkit.h
Original file line number Diff line number Diff line change
Expand Up @@ -114,4 +114,35 @@ void compareVectors(
const std::string& leftName = "left",
const std::string& rightName = "right",
const std::optional<SelectivityVector>& rows = std::nullopt);

struct InputRowMetadata {
// Column indices to wrap in LazyVector (in a strictly increasing order)
std::vector<int> columnsToWrapInLazy;

// Column indices to wrap in a common dictionary layer (in a strictly
// increasing order)
std::vector<int> columnsToWrapInCommonDictionary;

// Dictionary indices and nulls for the common dictionary layer. Buffers are
// null if no columns are specified in `columnsToWrapInCommonDictionary`.
BufferPtr commonDictionaryIndices;
BufferPtr commonDictionaryNulls;

bool empty() const {
return columnsToWrapInLazy.empty() &&
columnsToWrapInCommonDictionary.empty();
}

void saveToFile(const char* filePath) const;
static InputRowMetadata restoreFromFile(
const char* filePath,
memory::MemoryPool* pool);
};

// Wraps the columns in the row vector with a common dictionary layer. The
// column indices to wrap and the wrap itself is specified in
// `inputRowMetadata`.
RowVectorPtr applyCommonDictionaryLayer(
const RowVectorPtr& rowVector,
const InputRowMetadata& inputRowMetadata);
} // namespace facebook::velox::fuzzer
Loading

0 comments on commit 4a79bc5

Please sign in to comment.