Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce startsWith Predicate #327

Merged
merged 7 commits into from
Aug 12, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -142,5 +142,10 @@ public <T> Boolean in(BoundReference<T> ref, Literal<T> lit) {
public <T> Boolean notIn(BoundReference<T> ref, Literal<T> lit) {
return !in(ref, lit);
}

@Override
public Boolean startsWith(BoundReference<String> ref, Literal<String> lit) {
return ref.get(struct).startsWith(lit.value());
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ enum Operation {
NOT_IN,
NOT,
AND,
OR;
OR,
STARTS_WITH;

/**
* @return the operation used when this is negated
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,10 @@ public <T> R notIn(BoundReference<T> ref, Literal<T> lit) {
return null;
}

public R startsWith(BoundReference<String> ref, Literal<String> lit) {
throw new UnsupportedOperationException("Unsupported operation.");
}

@Override
public <T> R predicate(BoundPredicate<T> pred) {
switch (pred.op()) {
Expand All @@ -120,6 +124,9 @@ public <T> R predicate(BoundPredicate<T> pred) {
return in(pred.ref(), pred.literal());
case NOT_IN:
return notIn(pred.ref(), pred.literal());
case STARTS_WITH:
// Expressions.startsWith accepts only strings, thus type casting is done here.
return startsWith((BoundReference<String>) pred.ref(), (Literal<String>) pred.literal());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't catch this the first time through, but if the startsWith method above is parameterized by T then there shouldn't be a need to cast these to literals and bound references parameterized by String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startsWith was parameterised with T, but I missed removing the casts here.

default:
throw new UnsupportedOperationException(
"Unknown operation for predicate: " + pred.op());
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,10 @@ public static <T> UnboundPredicate<T> notEqual(String name, T value) {
return new UnboundPredicate<>(Expression.Operation.NOT_EQ, ref(name), value);
}

public static UnboundPredicate<String> startsWith(String name, String value) {
return new UnboundPredicate<>(Expression.Operation.STARTS_WITH, ref(name), value);
}

public static <T> UnboundPredicate<T> predicate(Operation op, String name, T value) {
Preconditions.checkArgument(op != Operation.IS_NULL && op != Operation.NOT_NULL,
"Cannot create %s predicate inclusive a value", op);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ public String toString() {
return String.valueOf(ref()) + " == " + literal();
case NOT_EQ:
return String.valueOf(ref()) + " != " + literal();
case STARTS_WITH:
return ref() + " startsWith \"" + literal() + "\"";
// case IN:
// break;
// case NOT_IN:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ public UnboundPredicate<Integer> project(String name, BoundPredicate<T> predicat
predicate.op(), name, apply(predicate.literal().value()));
// case IN:
// return Expressions.predicate();
case STARTS_WITH:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if a startsWith predicate makes sense in case of Bucket transforms. Hence, leaving it to be handled by the default case. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Transformed values tell us nothing about whether the original predicate is true or not.

default:
// comparison predicates can't be projected, notEq can't be projected
// TODO: small ranges can be projected.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,8 @@ static <S, T> UnboundPredicate<T> truncateArray(
return predicate(Expression.Operation.GT_EQ, name, transform.apply(boundary));
case EQ:
return predicate(Expression.Operation.EQ, name, transform.apply(boundary));
case STARTS_WITH:
return predicate(Expression.Operation.STARTS_WITH, name, transform.apply(boundary));
// case IN: // TODO
// return Expressions.predicate(Operation.IN, name, transform.apply(boundary));
default:
Expand Down
31 changes: 23 additions & 8 deletions api/src/main/java/org/apache/iceberg/transforms/Truncate.java
Original file line number Diff line number Diff line change
Expand Up @@ -213,20 +213,35 @@ public boolean canTransform(Type type) {

@Override
public UnboundPredicate<CharSequence> project(String name,
BoundPredicate<CharSequence> pred) {
if (pred.op() == NOT_NULL || pred.op() == IS_NULL) {
return Expressions.predicate(pred.op(), name);
BoundPredicate<CharSequence> predicate) {
switch (predicate.op()) {
case NOT_NULL:
case IS_NULL:
return Expressions.predicate(predicate.op(), name);
case STARTS_WITH:
default:
return ProjectionUtil.truncateArray(name, predicate, this);
}
return ProjectionUtil.truncateArray(name, pred, this);
}

@Override
public UnboundPredicate<CharSequence> projectStrict(String name,
BoundPredicate<CharSequence> pred) {
if (pred.op() == NOT_NULL || pred.op() == IS_NULL) {
return Expressions.predicate(pred.op(), name);
BoundPredicate<CharSequence> predicate) {
switch (predicate.op()) {
case IS_NULL:
case NOT_NULL:
return Expressions.predicate(predicate.op(), name);
case STARTS_WITH:
if (predicate.literal().value().length() < width()) {
return Expressions.predicate(predicate.op(), name, predicate.literal().value());
} else if (predicate.literal().value().length() == width()) {
return Expressions.equal(name, predicate.literal().value());
} else {
return null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does returning null here still make sense, in light of recent changes? Or should it be returning ProjectionUtil.truncateArrayStrict(name, predicate, this) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning ProjectionUtil.truncateArrayStrict(name, predicate, this) would still return null anyways, since it can't handle the STARTS_WITH case. So I think it makes sense to return null here.

}
default:
return ProjectionUtil.truncateArrayStrict(name, predicate, this);
}
return ProjectionUtil.truncateArrayStrict(name, pred, this);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
import static org.apache.iceberg.expressions.Expressions.lessThan;
import static org.apache.iceberg.expressions.Expressions.not;
import static org.apache.iceberg.expressions.Expressions.or;
import static org.apache.iceberg.expressions.Expressions.startsWith;
import static org.apache.iceberg.types.Types.NestedField.required;

public class TestExpressionBinding {
Expand Down Expand Up @@ -131,6 +132,18 @@ public void testNot() {
Assert.assertEquals("Should bind x correctly", 0, child.ref().fieldId());
}

@Test
public void testStartsWith() {
StructType struct = StructType.of(required(0, "s", Types.StringType.get()));
Expression expr = startsWith("s", "abc");
Expression boundExpr = Binder.bind(struct, expr);
TestHelpers.assertAllReferencesBound("StartsWith", boundExpr);
// make sure the expression is a StartsWith
BoundPredicate<?> pred = TestHelpers.assertAndUnwrap(boundExpr, BoundPredicate.class);
Assert.assertEquals("Should be right operation", Expression.Operation.STARTS_WITH, pred.op());
Assert.assertEquals("Should bind s correctly", 0, pred.ref().fieldId());
}

@Test
public void testAlwaysTrue() {
Assert.assertEquals("Should not change alwaysTrue",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.apache.iceberg.transforms;

import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Schema;
import org.apache.iceberg.TestHelpers;
import org.apache.iceberg.expressions.Binder;
import org.apache.iceberg.expressions.BoundPredicate;
import org.apache.iceberg.expressions.Evaluator;
import org.apache.iceberg.expressions.Expression;
import org.apache.iceberg.expressions.False;
import org.apache.iceberg.expressions.Literal;
import org.apache.iceberg.expressions.Projections;
import org.apache.iceberg.expressions.UnboundPredicate;
import org.apache.iceberg.types.Types;
import org.junit.Assert;
import org.junit.Test;

import static org.apache.iceberg.TestHelpers.assertAndUnwrapUnbound;
import static org.apache.iceberg.expressions.Expressions.startsWith;
import static org.apache.iceberg.types.Types.NestedField.optional;

public class TestStartsWith {

private static final String COLUMN = "someStringCol";
private static final Schema SCHEMA = new Schema(optional(1, COLUMN, Types.StringType.get()));

@Test
public void assertTruncateProjections() {
sujithjay marked this conversation as resolved.
Show resolved Hide resolved
PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).truncate(COLUMN, 4).build();

assertProjectionInclusive(spec, startsWith(COLUMN, "ab"), "ab", Expression.Operation.STARTS_WITH);
assertProjectionInclusive(spec, startsWith(COLUMN, "abab"), "abab", Expression.Operation.STARTS_WITH);
assertProjectionInclusive(spec, startsWith(COLUMN, "ababab"), "abab", Expression.Operation.STARTS_WITH);

assertProjectionStrict(spec, startsWith(COLUMN, "ab"), "ab", Expression.Operation.STARTS_WITH);
assertProjectionStrict(spec, startsWith(COLUMN, "abab"), "abab", Expression.Operation.EQ);
assertProjectionFalse(spec, startsWith(COLUMN, "ababab"));
sujithjay marked this conversation as resolved.
Show resolved Hide resolved
}

@Test
public void assertTruncateString() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can some tests be added in the TestTruncatesResiduals.testStringTruncateTransformResiduals() too with the STARTS_WITH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExpressionVisitors for startsWith currently throw an UnsupportedOperationException (c.f. discussion-thread).

Given that, tests in TestTruncatesResiduals with STARTS_WITH would not add much value. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujithjay, I agree with @moulimukherjee. But you're right that currently an UnsupportedOperationException will be thrown. That's a problem because it will prevent creating splits for a scan when there is a startsWith predicate.

The fix is to implement startsWith in ResidualEvaluator to support residual calculations and to add a couple of tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for ^ (implement startsWith in ResidualEvaluator and add tests)

Truncate<String> trunc = Truncate.get(Types.StringType.get(), 2);
Expression expr = startsWith(COLUMN, "abcde");
BoundPredicate<String> boundExpr = (BoundPredicate<String>) Binder.bind(SCHEMA.asStruct(), expr, false);

UnboundPredicate<String> projected = trunc.project(COLUMN, boundExpr);
Evaluator evaluator = new Evaluator(SCHEMA.asStruct(), projected);

Assert.assertTrue("startsWith(abcde, truncate(abcde,2)) => true",
evaluator.eval(TestHelpers.Row.of("abcde")));
}

private void assertProjectionInclusive(PartitionSpec spec, UnboundPredicate<?> filter,
String expectedLiteral, Expression.Operation expectedOp) {
Expression projection = Projections.inclusive(spec).project(filter);
assertProjection(spec, expectedLiteral, projection, expectedOp);
}

private void assertProjectionStrict(PartitionSpec spec, UnboundPredicate<?> filter,
String expectedLiteral, Expression.Operation expectedOp) {
Expression projection = Projections.strict(spec).project(filter);
assertProjection(spec, expectedLiteral, projection, expectedOp);
}

private void assertProjection(PartitionSpec spec, String expectedLiteral, Expression projection,
Expression.Operation expectedOp) {
UnboundPredicate<?> predicate = assertAndUnwrapUnbound(projection);
Literal literal = predicate.literal();
Truncate<CharSequence> transform = (Truncate<CharSequence>) spec.getFieldsBySourceId(1).get(0).transform();
String output = transform.toHumanString((String) literal.value());

Assert.assertEquals(expectedOp, predicate.op());
Assert.assertEquals(expectedLiteral, output);
}

private void assertProjectionFalse(PartitionSpec spec, UnboundPredicate<?> filter) {
Expression projection = Projections.strict(spec).project(filter);
Assert.assertTrue(projection instanceof False);
}
}