Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce startsWith Predicate #327

Merged
merged 7 commits into from
Aug 12, 2019
Merged

Introduce startsWith Predicate #327

merged 7 commits into from
Aug 12, 2019

Conversation

sujithjay
Copy link
Contributor

@sujithjay sujithjay commented Jul 29, 2019

Implements #31.

@sujithjay
Copy link
Contributor Author

@renato2099 & @Liorba had submitted PRs previously for this issue. I am taking up on their work, and trying to see if it can shepherded into master.

@sujithjay
Copy link
Contributor Author

@rdblue Could you please take a look? Thank you.

@@ -113,6 +116,7 @@ public String toString() {
predicate.op(), name, apply(predicate.literal().value()));
// case IN:
// return Expressions.predicate();
case STARTS_WITH:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if a startsWith predicate makes sense in case of Bucket transforms. Hence, leaving it to be handled by the default case. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Transformed values tell us nothing about whether the original predicate is true or not.

@sujithjay sujithjay force-pushed the ICEBERG-31 branch 6 times, most recently from a3c38b7 to 3180cec Compare July 29, 2019 21:22
assertProjectionInclusive(spec, startsWith("someStringCol", "ababab"), "abab");

assertProjectionStrict(spec, startsWith("someStringCol", "ab"), "ab");
assertProjectionStrict(spec, startsWith("someStringCol", "abab"), "abab");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also test strict projection for startsWith("someStringCol", "ababab") since that doesn't have a strict projection.

return null;
switch (predicate.op()) {
case STARTS_WITH:
if (predicate.literal().value().length() <= width()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case where the length of the literal equals width, is there value in converting to an equality predicate?

Copy link
Contributor Author

@sujithjay sujithjay Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have converted it into an equality predicate.

@rdblue
Copy link
Contributor

rdblue commented Aug 3, 2019

@sujithjay, nice work! I think this is about ready to go. I just had a few comments.

@moulimukherjee, could you take a look at this as well, since you're familiar with the projections?

renato2099 and others added 3 commits August 7, 2019 20:49
Co-authored-by: Renato Marroquin <[email protected]>
Co-authored-by: Lior Baber <[email protected]>
Co-authored-by: Sujith Jay Nair <[email protected]>
} else if (predicate.literal().value().length() == width()) {
return Expressions.equal(name, predicate.literal().value());
} else {
return null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does returning null here still make sense, in light of recent changes? Or should it be returning ProjectionUtil.truncateArrayStrict(name, predicate, this) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning ProjectionUtil.truncateArrayStrict(name, predicate, this) would still return null anyways, since it can't handle the STARTS_WITH case. So I think it makes sense to return null here.

}

@Test
public void assertTruncateString() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can some tests be added in the TestTruncatesResiduals.testStringTruncateTransformResiduals() too with the STARTS_WITH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExpressionVisitors for startsWith currently throw an UnsupportedOperationException (c.f. discussion-thread).

Given that, tests in TestTruncatesResiduals with STARTS_WITH would not add much value. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujithjay, I agree with @moulimukherjee. But you're right that currently an UnsupportedOperationException will be thrown. That's a problem because it will prevent creating splits for a scan when there is a startsWith predicate.

The fix is to implement startsWith in ResidualEvaluator to support residual calculations and to add a couple of tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for ^ (implement startsWith in ResidualEvaluator and add tests)

@moulimukherjee
Copy link
Contributor

Some nits, but the changes look good to me 👍.

// starts with
assertResidualValue(spec, startsWith("value", "bcd"), "ab", Expression.Operation.FALSE);
assertResidualPredicate(spec, startsWith("value", "bcd"), "bc");
assertResidualValue(spec, startsWith("value", "bcd"), "cd", Expression.Operation.FALSE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sujithjay
Copy link
Contributor Author

@rdblue, @moulimukherjee I have updated the PR. Please review.

@moulimukherjee
Copy link
Contributor

Changes look good to me 👍

@@ -120,6 +124,9 @@ public R or(R leftResult, R rightResult) {
return in(pred.ref(), pred.literal());
case NOT_IN:
return notIn(pred.ref(), pred.literal());
case STARTS_WITH:
/* startsWith accepts only Strings, hence type-casting */
return startsWith((BoundReference<String>) pred.ref(), (Literal<String>) pred.literal());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't catch this the first time through, but if the startsWith method above is parameterized by T then there shouldn't be a need to cast these to literals and bound references parameterized by String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startsWith was parameterised with T, but I missed removing the casts here.

@rdblue
Copy link
Contributor

rdblue commented Aug 12, 2019

@sujithjay, I just found 2 minor issues and I'll merge this after those are fixed. Thanks for pushing this feature through!

@sujithjay
Copy link
Contributor Author

Thank you for the help, @rdblue & @moulimukherjee. 🙏

@rdblue rdblue merged commit 5cfc119 into apache:master Aug 12, 2019
@rdblue rdblue mentioned this pull request Aug 12, 2019
@sujithjay sujithjay deleted the ICEBERG-31 branch August 12, 2019 19:01
@aokolnychyi
Copy link
Contributor

I would like to follow up with a PR to enable pushdown of startsWith in Spark. Are there any limitations that will block me?

Apart from changes in SparkFilters, we will need to extend InclusiveMetricsEvaluator, ParquetDictionaryRowGroupFilter, ParquetMetricsRowGroupFilter with logic that checks lower/upper bounds in startsWith.

I think one way to implement this is to take the same number of bytes from prefix and bounds (e.g. slice) and then rely on UnsignedByteBufComparator for unsigned lexicographical comparison. Alternatively, we can convert bounds to strings and operate on them.

@sujithjay @rdblue @moulimukherjee Any thoughts?

@rdblue
Copy link
Contributor

rdblue commented Aug 20, 2019

That plan sounds good to me!

@xabriel xabriel mentioned this pull request Oct 9, 2019
holdenk added a commit to holdenk/incubator-iceberg that referenced this pull request Jun 29, 2021
holdenk added a commit to holdenk/incubator-iceberg that referenced this pull request Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants