Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support regex delimiter and limit argument for Spark split function #10248

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
4e0f895
refactor spark split function
gaoyangxiaozhu Jun 18, 2024
eada804
Merge branch 'facebookincubator:main' into gayangya/split_refactor
gaoyangxiaozhu Jun 25, 2024
a9ec3de
address comments
gaoyangxiaozhu Jun 27, 2024
eddfb43
Merge branch 'gayangya/split_refactor' of https://github.com/gayangya…
gaoyangxiaozhu Jun 27, 2024
f7f0967
address comment
gaoyangxiaozhu Jun 28, 2024
0348181
Merge branch 'facebookincubator:main' into gayangya/split_refactor
gaoyangxiaozhu Jun 28, 2024
47542dd
address comment
gaoyangxiaozhu Jul 1, 2024
553187b
spark split only support integer type limit
gaoyangxiaozhu Jul 1, 2024
e69ea54
address comments
gaoyangxiaozhu Jul 2, 2024
f5c603a
address comments
gaoyangxiaozhu Jul 2, 2024
54a1bce
address comment
gaoyangxiaozhu Jul 3, 2024
dd3d805
fix split overflow issue if regex not empty but has empty size match
gaoyangxiaozhu Jul 4, 2024
cfc700b
address comment
gaoyangxiaozhu Jul 5, 2024
c42ec96
fix empty delimiter split issue for non-utf-8 characters
gaoyangxiaozhu Jul 5, 2024
ea214f9
address comment
gaoyangxiaozhu Jul 9, 2024
27fcfab
address comment
gaoyangxiaozhu Jul 10, 2024
d13e03e
address comment
gaoyangxiaozhu Jul 10, 2024
9e2dbe9
address comment
gaoyangxiaozhu Jul 10, 2024
99c8e87
Merge branch 'facebookincubator:main' into gayangya/split_refactor
gaoyangxiaozhu Jul 10, 2024
5da18db
address comment - rename splitfunctions.cpp to split.cpp
gaoyangxiaozhu Jul 11, 2024
64c2282
Merge branch 'gayangya/split_refactor' of https://github.com/gayangya…
gaoyangxiaozhu Jul 11, 2024
6363c73
Merge branch 'facebookincubator:main' into gayangya/split_refactor
gaoyangxiaozhu Jul 11, 2024
ce89291
address comment
gaoyangxiaozhu Jul 12, 2024
edb1e70
Merge branch 'gayangya/split_refactor' of https://github.com/gayangya…
gaoyangxiaozhu Jul 12, 2024
1a8e0b4
Merge branch 'facebookincubator:main' into gayangya/split_refactor
gaoyangxiaozhu Jul 12, 2024
9eb75d3
address comment
gaoyangxiaozhu Jul 12, 2024
4aff4aa
address comment
gaoyangxiaozhu Jul 15, 2024
9aff221
Merge branch 'facebookincubator:main' into gayangya/split_refactor
gaoyangxiaozhu Jul 23, 2024
3845cce
convert to simple function for spark split function
gaoyangxiaozhu Jul 23, 2024
2151c17
split the split function to seperate file and address comments
gaoyangxiaozhu Jul 25, 2024
526a2d6
address comments
gaoyangxiaozhu Jul 25, 2024
d7aa8b7
fix small format issue
gaoyangxiaozhu Jul 25, 2024
9622d96
address comments
gaoyangxiaozhu Jul 26, 2024
948e01f
small change
gaoyangxiaozhu Jul 26, 2024
355200f
address some comments
gaoyangxiaozhu Jul 29, 2024
ded360a
fix format issue
gaoyangxiaozhu Jul 29, 2024
fdc8c87
address non-well UTF-8 string
gaoyangxiaozhu Jul 30, 2024
dbb9f8a
Merge branch 'main' of https://github.com/gayangya/velox into gayangy…
gaoyangxiaozhu Jul 30, 2024
0ad32d0
fix build
gaoyangxiaozhu Jul 30, 2024
5197087
address comment
gaoyangxiaozhu Jul 31, 2024
0db194e
Merge branch 'facebookincubator:main' into gayangya/split_refactor
gaoyangxiaozhu Jul 31, 2024
a2fc545
Merge branch 'facebookincubator:main' into gayangya/split_refactor
gaoyangxiaozhu Jul 31, 2024
26e060f
Merge branch 'facebookincubator:main' into gayangya/split_refactor
gaoyangxiaozhu Aug 1, 2024
e43c558
address comments
gaoyangxiaozhu Aug 1, 2024
7682d2d
Merge branch 'gayangya/split_refactor' of https://github.com/gayangya…
gaoyangxiaozhu Aug 1, 2024
4ad8f15
add ut to cover wide-character as delimiter
gaoyangxiaozhu Aug 1, 2024
2a399d7
refactor to address comment
gaoyangxiaozhu Aug 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 17 additions & 13 deletions velox/docs/functions/spark/string.rst
Original file line number Diff line number Diff line change
Expand Up @@ -236,22 +236,26 @@ Unless specified otherwise, all functions return NULL if at least one of the arg

SELECT soundex('Miller'); -- "M460"

.. spark:function:: split(string, delimiter) -> array(string)

Splits ``string`` on ``delimiter`` and returns an array. ::
.. spark:function:: split(string, delimiter[, limit]) -> array(string)

Splits ``string`` around occurrences that match ``delimiter`` and returns an array with a length of
gaoyangxiaozhu marked this conversation as resolved.
Show resolved Hide resolved
at most ``limit``. ``delimiter`` is a string representing regular expression. ``limit`` is an integer
which controls the number of times the regex is applied. By default, ``limit`` is -1. When ``limit`` > 0,
the resulting array's length will not be more than ``limit``, and the resulting array's last entry will
contain all input beyond the last matched regex. When ``limit`` <= 0, ``regex`` will be applied as many
times as possible, and the resulting array can be of any size. When ``delimiter`` is empty, if ``limit``
is smaller than the size of ``string``, the resulting array only contains ``limit`` number of single characters
splitting from ``string``, if ``limit`` is not provided or is larger than the size of ``string``, the resulting
array contains all the single characters of ``string`` and does not include an empty tail character.
The split function align with vanilla spark 3.4+ split function. ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split function align with vanilla spark 3.4+ split function.

Wondering why do we need this comment? Isn't it true for all functions that they are expected to match Spark 3.4+ behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case people see different behavior happen for spark version below 3.4 as 3.2/3.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other functions? Do they implement Spark 3.2/3.3 semantics? CC: @rui-mo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaoyangxiaozhu @rui-mo gentle ping

Copy link
Collaborator

@rui-mo rui-mo Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova For the split function, the behavior for empty delimiter was changed since Spark 3.4 with commit apache/spark@247306c.
When implementing functions in Velox, if there is semantic difference among Spark versions, I assume we need to follow the latest one (Spark 3.5). Does it makes sense? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui, thank you for clarifying. I assume this is a general policy that applies to all functions. If so, it would be helpful to document it in https://facebookincubator.github.io/velox/spark_functions.html and clarify which Spark version we match and whether we match ANSI mode on or off.

Furthermore, assuming Spark doesn't guarantee backwards compatibility across versions, we would have to pick a specific version we match and cannot say "latest" or 3.4+. If Spark changes behavior in "latest" version and we change to match, existing users will start seeing different behavior. Is this the case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Thanks for the pointer. I'd like to document it.

If Spark changes behavior in "latest" version and we change to match, existing users will start seeing different behavior. Is this the case?

I assume it is the case, and it makes sense to me to pick a specific version.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #10677 for the documentation.

@gaoyangxiaozhu I assume below statement and Since Spark 3.4 for splitEmptyDelimiter function could be removed.

The split function align with vanilla spark 3.4+ split function.


SELECT split('oneAtwoBthreeC', '[ABC]'); -- ["one","two","three",""]
SELECT split('one', ''); -- ["o", "n", "e", ""]
SELECT split('one', '1'); -- ["one"]

.. spark:function:: split(string, delimiter, limit) -> array(string)
:noindex:

Splits ``string`` on ``delimiter`` and returns an array of size at most ``limit``. ::

SELECT split('oneAtwoBthreeC', '[ABC]', -1); -- ["one","two","three",""]
SELECT split('oneAtwoBthreeC', '[ABC]', 0); -- ["one", "two", "three", ""]
SELECT split('oneAtwoBthreeC', '[ABC]', 2); -- ["one","twoBthreeC"]
gaoyangxiaozhu marked this conversation as resolved.
Show resolved Hide resolved
SELECT split('oneAtwoBthreeC', '[ABC]', 5); -- ["one","two","three",""]
SELECT split('one', '1'); -- ["one"]
SELECT split('abcd', ''); -- ["a","b","c","d"]
SELECT split('abcd', '', 3); -- ["a","b","c"]
SELECT split('abcd', '', 5); -- ["a","b","c","d"]

.. spark:function:: startswith(left, right) -> boolean

Expand Down
3 changes: 2 additions & 1 deletion velox/functions/lib/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ velox_add_library(
Repeat.cpp
StringEncodingUtils.cpp
SubscriptUtil.cpp
TimeUtils.cpp)
TimeUtils.cpp
Utf8Utils.cpp)

velox_link_libraries(
velox_functions_lib
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "velox/functions/prestosql/Utf8Utils.h"
#include "velox/functions/lib/Utf8Utils.h"
#include "velox/common/base/Exceptions.h"
#include "velox/external/utf8proc/utf8procImpl.h"

Expand Down
File renamed without changes.
1 change: 1 addition & 0 deletions velox/functions/lib/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ add_executable(
MapConcatTest.cpp
Re2FunctionsTest.cpp
RepeatTest.cpp
Utf8Test.cpp
ZetaDistributionTest.cpp)

add_test(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
*/

#include <gtest/gtest.h>
#include "velox/functions/prestosql/Utf8Utils.h"
#include "velox/functions/lib/Utf8Utils.h"

namespace facebook::velox::functions {
namespace {
Expand Down
1 change: 0 additions & 1 deletion velox/functions/prestosql/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,6 @@ velox_add_library(
TransformValues.cpp
TypeOf.cpp
URLFunctions.cpp
Utf8Utils.cpp
VectorArithmetic.cpp
WidthBucketArray.cpp
Zip.cpp
Expand Down
2 changes: 1 addition & 1 deletion velox/functions/prestosql/FromUtf8.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
#include "velox/expression/DecodedArgs.h"
#include "velox/expression/StringWriter.h"
#include "velox/expression/VectorFunction.h"
#include "velox/functions/lib/Utf8Utils.h"
#include "velox/functions/lib/string/StringImpl.h"
#include "velox/functions/prestosql/Utf8Utils.h"

namespace facebook::velox::functions {
namespace {
Expand Down
1 change: 0 additions & 1 deletion velox/functions/prestosql/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,6 @@ add_executable(
TrimFunctionsTest.cpp
TypeOfTest.cpp
URLFunctionsTest.cpp
Utf8Test.cpp
UuidFunctionsTest.cpp
WidthBucketArrayTest.cpp
WordStemTest.cpp
Expand Down
1 change: 0 additions & 1 deletion velox/functions/sparksql/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ velox_add_library(
RegisterArithmetic.cpp
RegisterCompare.cpp
Size.cpp
SplitFunctions.cpp
String.cpp
UnscaledValueFunction.cpp)

Expand Down
2 changes: 1 addition & 1 deletion velox/functions/sparksql/MaskFunction.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
*/
#pragma once

#include "velox/functions/prestosql/Utf8Utils.h"
#include "velox/functions/lib/Utf8Utils.h"

namespace facebook::velox::functions::sparksql {

Expand Down
6 changes: 5 additions & 1 deletion velox/functions/sparksql/Register.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
#include "velox/functions/sparksql/RegisterCompare.h"
#include "velox/functions/sparksql/Size.h"
#include "velox/functions/sparksql/SparkPartitionId.h"
#include "velox/functions/sparksql/Split.h"
#include "velox/functions/sparksql/String.h"
#include "velox/functions/sparksql/StringToMap.h"
#include "velox/functions/sparksql/UnscaledValueFunction.h"
Expand Down Expand Up @@ -253,7 +254,6 @@ void registerFunctions(const std::string& prefix) {
prefix + "rlike", re2SearchSignatures(), makeRLike);
exec::registerStatefulVectorFunction(
prefix + "like", likeSignatures(), makeLike);
VELOX_REGISTER_VECTOR_FUNCTION(udf_regexp_split, prefix + "split");

exec::registerStatefulVectorFunction(
prefix + "least",
Expand Down Expand Up @@ -483,6 +483,10 @@ void registerFunctions(const std::string& prefix) {
registerFunction<LevenshteinDistanceFunction, int32_t, Varchar, Varchar>(
{prefix + "levenshtein"});

registerFunction<Split, Array<Varchar>, Varchar, Varchar>({prefix + "split"});
registerFunction<Split, Array<Varchar>, Varchar, Varchar, int32_t>(
{prefix + "split"});

registerFunction<MaskFunction, Varchar, Varchar>({prefix + "mask"});
registerFunction<MaskFunction, Varchar, Varchar, Varchar>({prefix + "mask"});
registerFunction<MaskFunction, Varchar, Varchar, Varchar, Varchar>(
Expand Down
170 changes: 170 additions & 0 deletions velox/functions/sparksql/Split.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
/*
* Copyright (c) Facebook, Inc. and its affiliates.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include "velox/functions/lib/Utf8Utils.h"

namespace facebook::velox::functions::sparksql {

/// split(string, delimiter[, limit]) -> array(varchar)
///
/// Splits string on delimiter and returns an array of size at most limit.
/// delimiter is a string representing regular expression.
/// limit is an integer which controls the number of times the regex is applied.
/// By default, limit is -1, which means 'no limit', the delimiter will be
/// applied as many times as possible.
template <typename T>
struct Split {
VELOX_DEFINE_FUNCTION_TYPES(T);

// Results refer to strings in the first argument.
static constexpr int32_t reuse_strings_from_arg = 0;

FOLLY_ALWAYS_INLINE void call(
out_type<Array<Varchar>>& result,
const arg_type<Varchar>& input,
const arg_type<Varchar>& delimiter) {
doCall(result, input, delimiter, INT32_MAX);
}

FOLLY_ALWAYS_INLINE void call(
out_type<Array<Varchar>>& result,
const arg_type<Varchar>& input,
const arg_type<Varchar>& delimiter,
const arg_type<int32_t>& limit) {
doCall(result, input, delimiter, limit > 0 ? limit : INT32_MAX);
}

private:
void doCall(
out_type<Array<Varchar>>& result,
const arg_type<Varchar>& input,
const arg_type<Varchar>& delimiter,
int32_t limit) const {
if (delimiter.empty()) {
splitEmptyDelimiter(result, input, limit);
} else {
split(result, input, delimiter, limit);
}
}

// When pattern is empty, split each character out. Since Spark 3.4, when
gaoyangxiaozhu marked this conversation as resolved.
Show resolved Hide resolved
// delimiter is empty, the result does not include an empty tail string, e.g.
// split('abc', '') outputs ["a", "b", "c"] instead of ["a", "b", "c", ""].
// The result does not include remaining string when limit is smaller than the
// string size, e.g. split('abc', '', 2) outputs ["a", "b"] instead of ["a",
// "bc"].
void splitEmptyDelimiter(
out_type<Array<Varchar>>& result,
const arg_type<Varchar>& input,
int32_t limit) const {
if (input.size() == 0) {
result.add_item().setNoCopy(StringView());
return;
}

const size_t end = input.size();
const char* start = input.data();
size_t pos = 0;
int32_t count = 0;
while (pos < end && count < limit) {
auto charLength = tryGetCharLength(start + pos, end - pos);
if (charLength <= 0) {
// Invalid UTF-8 character, the length of the invalid
// character is the absolute value of result of `tryGetCharLength`.
charLength = -charLength;
}
result.add_item().setNoCopy(StringView(start + pos, charLength));
pos += charLength;
count += 1;
}
}

// Split with a non-empty delimiter. If limit > 0, The resulting array's
// length will not be more than limit and the resulting array's last entry
// will contain all input beyond the last matched regex. If limit <= 0,
// delimiter will be applied as many times as possible, and the resulting
// array can be of any size.
void split(
out_type<Array<Varchar>>& result,
const arg_type<Varchar>& input,
const arg_type<Varchar>& delimiter,
int32_t limit) const {
VELOX_DCHECK(!delimiter.empty(), "Non-empty delimiter is expected");

// Trivial case of converting string to array with 1 element.
if (limit == 1) {
result.add_item().setNoCopy(input);
return;
}

// Splits input string using the delimiter and adds the cutting-off pieces
// to elements vector until the string's end or the limit is reached.
int32_t addedElements{0};
auto* re = cache_.findOrCompile(delimiter);
const size_t end = input.size();
const char* start = input.data();
const auto re2String = re2::StringPiece(start, end);
size_t pos = 0;

re2::StringPiece subMatches[1];
// Matches a regular expression against a portion of the input string,
// starting from 'pos' to the end of the input string. The match is not
// anchored, which means it can start at any position in the string. If a
// match is found, the matched portion of the string is stored in
// 'subMatches'. The '1' indicates that we are only interested in the first
// match found from the current position 'pos' in each iteration of the
// loop.
while (re->Match(
re2String, pos, end, RE2::Anchor::UNANCHORED, subMatches, 1)) {
const auto fullMatch = subMatches[0];
auto offset = fullMatch.data() - start;
const auto size = fullMatch.size();
if (offset >= end) {
break;
}

// When hitting an empty match, split the character at the current 'pos'
// of the input string and put it into the result array, followed by an
// empty tail string at last, e.g., the result array for split('abc','d|')
// is ["a","b","c",""].
if (size == 0) {
auto charLength = tryGetCharLength(start + pos, end - pos);
if (charLength <= 0) {
// Invalid UTF-8 character, the length of the invalid
// character is the absolute value of result of `tryGetCharLength`.
charLength = -charLength;
}
offset += charLength;
}
result.add_item().setNoCopy(StringView(start + pos, offset - pos));
pos = offset + size;

++addedElements;
// If the next element should be the last, leave the loop.
if (addedElements + 1 == limit) {
break;
}
}

// Add the rest of the string and we are done.
// Note that the rest of the string can be empty - we still add it.
result.add_item().setNoCopy(StringView(start + pos, end - pos));
}

mutable detail::ReCache cache_;
};
} // namespace facebook::velox::functions::sparksql
Loading
Loading