Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-5953][VL] Prevent pushdown filters with unsupported data types to scan node #5954

Closed
wants to merge 9 commits into from

Conversation

WangGuangxin
Copy link
Contributor

What changes were proposed in this pull request?

Prevent pushdown filters with unsupported data types to scan node in case the scan fallback to vanilla scan operator

(Fixes: #5953)

How was this patch tested?

UT

Copy link

github-actions bot commented Jun 2, 2024

#5953

@FelixYBW
Copy link
Contributor

FelixYBW commented Jun 3, 2024

Do you mean these fields are not supported in filter pushdown?

              StructField("short_decimal_field", DecimalType(5, 2), true),
              StructField("long_decimal_field", DecimalType(32, 8), true),
              StructField("binary_field", BinaryType, true),
              StructField("timestamp_field", TimestampType, true)

@WangGuangxin
Copy link
Contributor Author

Do you mean these fields are not supported in filter pushdown?

              StructField("short_decimal_field", DecimalType(5, 2), true),
              StructField("long_decimal_field", DecimalType(32, 8), true),
              StructField("binary_field", BinaryType, true),
              StructField("timestamp_field", TimestampType, true)

@FelixYBW Before this PR, Gluten will try to pushdown them, but throw Subfield filters creation not supported for input type **, which make FileSourceScan fallback to Vanilla Spark.
After this PR, Gluten will not try to pushdown them.

BTW, short_decimal_field is supported to pushdown, the other three fields don't.

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

@WangGuangxin
Copy link
Contributor Author

cc @PHILO-HE @rui-mo can you help review this?

"""select * from t where binary_field = '3.14'""".stripMargin
)(checkGlutenOperatorMatch[FileSourceScanExecTransformer])
runQueryAndCompare(
"""select * from t where timestamp_field = current_timestamp()""".stripMargin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have made some updates on the timestamp support. Could you help check if the pushdown of timestamp is supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's already supported in velox side. But as for now, Gluten can still fallback since mapToFilters doesn't support yet. I think we can support it in next PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Would you like to open an issue to track its support? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the issue #6642

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

cpp/velox/substrait/SubstraitToVeloxPlan.h Outdated Show resolved Hide resolved
Copy link

Run Gluten Clickhouse CI

@PHILO-HE
Copy link
Contributor

@kecookier, @rui-mo

@jiangjiangtian
Copy link
Contributor

I think we should also prevent pushdown filters from singularOrLists.
https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlan.cc#L1694

@WangGuangxin
Copy link
Contributor Author

I think we should also prevent pushdown filters from singularOrLists. https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlan.cc#L1694

@jiangjiangtian maybe yes, but since there is already type check, I'm not quite sure. https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlan.cc#L2469 . I think it can be in another PR if needed?

case TypeKind::BOOLEAN:
case TypeKind::VARCHAR:
case TypeKind::ARRAY:
case TypeKind::MAP:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like most type are supported. Would you list the unsupported types instead?

static bool isPushdownSupported(TypePtr inputType);

/// Check whether the scalar function contains data type that doesn't to pushdown.
static bool canPushdownScalarFunction(
Copy link
Contributor

@rui-mo rui-mo Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we merge it with canPushdownFunction which checks if a scalar function can be pushed down?

@jiangjiangtian
Copy link
Contributor

I think we should also prevent pushdown filters from singularOrLists. https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlan.cc#L1694

@jiangjiangtian maybe yes, but since there is already type check, I'm not quite sure. https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlan.cc#L2469 . I think it can be in another PR if needed?

Yes, I think you are right. There is already type check.

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Aug 2, 2024
Copy link

github-actions bot commented Aug 2, 2024

Run Gluten Clickhouse CI

1 similar comment
Copy link

github-actions bot commented Aug 2, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 3, 2024

Run Gluten Clickhouse CI

Copy link

github-actions bot commented Aug 6, 2024

Run Gluten Clickhouse CI

1 similar comment
Copy link

github-actions bot commented Aug 6, 2024

Run Gluten Clickhouse CI

@WangGuangxin WangGuangxin requested a review from rui-mo August 6, 2024 14:45
@@ -1712,19 +1733,6 @@ void SubstraitToVeloxPlanConverter::separateFilters(
for (const auto& scalarFunction : scalarFunctions) {
auto filterNameSpec = SubstraitParser::findFunctionSpec(functionMap_, scalarFunction.function_reference());
auto filterName = SubstraitParser::getNameBeforeDelimiter(filterNameSpec);
// Add all decimal filters to remaining functions because their pushdown are not supported.
if (format == dwio::common::FileFormat::ORC && scalarFunction.arguments().size() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like previous logic only excludes decimal type for ORC format, but now we are excluding HUGEINT for all file formats. Could you confirm if it is expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ORC format already supports short-decimal predicate pushdown. @jiangjiangtian Can you help test this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verify that short-decimal predicate can be pushdown, as follows:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea of these three types (Timestamp/VarBinary/HugeInt) in this PR is that if we don't put them in remainingFilter, then when we call mapToFilters, a exception Subfield filters creation not supported for input type '{}' in mapToFilters will throw, which will cause scan fallback.

The reason why mapToFilters don't support these three types is because there is no MultiRangeType or related type traits defined, which has nothing to do with the fileformat. cc @rui-mo @jiangjiangtian @kecookier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means actually we don't support the pushdown of long decimal in Parquet and ORC. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means actually we don't support the pushdown of long decimal in Parquet and ORC. Is that correct?

@rui-mo Currently yes. But to support a HugeintMultiRange in velox side is not a big issue, I can work on it later.

cpp/velox/substrait/SubstraitToVeloxPlan.cc Outdated Show resolved Hide resolved
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

return canPushdown;

// Check whether data type is supported or not
if (!veloxTypeList.empty() && fieldIdx < veloxTypeList.size() && !isPushdownSupported(veloxTypeList.at(fieldIdx))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a bug if 'fieldIdx >= veloxTypeList.size()'? Shall we just add a check to ensure it does not happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale stale label Sep 30, 2024
Copy link

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core stale stale VELOX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Prevent pushdown filters with unsupported data types to scan node
6 participants