Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf::strings::contains_multiple #16900

Merged
merged 86 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
e446371
Add cudf::strings::contains_multiple
davidwendt Sep 24, 2024
779ca64
add proclaim_return_type
davidwendt Sep 24, 2024
90ab892
Merge branch 'branch-24.12' into contains-multiple
davidwendt Sep 24, 2024
f19cb1c
fix while-loop check
davidwendt Sep 24, 2024
a221ecb
Merge branch 'branch-24.12' into contains-multiple
galipremsagar Sep 24, 2024
ca3581e
Merge branch 'branch-24.12' into contains-multiple
davidwendt Sep 25, 2024
1394b62
Merge branch 'branch-24.12' into contains-multiple
davidwendt Sep 26, 2024
a0be912
Merge branch 'branch-24.12' into contains-multiple
davidwendt Sep 26, 2024
65e21d3
cleanup code
davidwendt Sep 26, 2024
b9da939
change shared memory layout
davidwendt Sep 26, 2024
e8cb6cf
use global memory if shared memory limit is reached
davidwendt Sep 26, 2024
0c6902d
Merge branch 'branch-24.12' into contains-multiple
davidwendt Sep 26, 2024
5cc9d54
change shared-memory threshold
davidwendt Sep 26, 2024
c38569e
Merge branch 'branch-24.12' into contains-multiple
davidwendt Sep 27, 2024
4acb531
factor out benchmarks for find/contains_multiple
davidwendt Sep 27, 2024
f3a3f24
refactor kernels into templated one
davidwendt Sep 27, 2024
d5fd320
Merge branch 'branch-24.12' into contains-multiple
davidwendt Sep 27, 2024
1ac8ecb
Merge branch 'branch-24.12' into contains-multiple
davidwendt Sep 30, 2024
33d7847
move tests to find_multiple_tests.cpp
davidwendt Sep 30, 2024
04717e3
use output directly in row-parallel kernel
davidwendt Oct 1, 2024
e15db54
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 1, 2024
bda7617
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 2, 2024
ae8ef54
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 2, 2024
94d64a9
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 4, 2024
ba6a5df
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 10, 2024
38dea3a
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 11, 2024
4684f88
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 14, 2024
b5e6d08
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 14, 2024
62a51c6
Update doxygen, exceptions
davidwendt Oct 15, 2024
f966d58
Merge branch 'contains-multiple' of github.com:davidwendt/cudf into c…
davidwendt Oct 16, 2024
0ac3c53
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 16, 2024
f8b0de2
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 16, 2024
88174ca
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 17, 2024
ede8a7b
cleanup code part I
davidwendt Oct 17, 2024
32accdb
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 17, 2024
041e6c2
Merge branch 'contains-multiple' of github.com:davidwendt/cudf into c…
davidwendt Oct 18, 2024
67c8e1e
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 18, 2024
4804f1e
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 18, 2024
dd94c4a
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 18, 2024
4051b12
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 21, 2024
a987665
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 21, 2024
b5493ec
fix threshold check
davidwendt Oct 22, 2024
c33ecea
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 22, 2024
b6ac3ff
fix threshold check for real
davidwendt Oct 22, 2024
448c0f2
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 22, 2024
819f59f
Merge branch 'contains-multiple' of github.com:davidwendt/cudf into c…
davidwendt Oct 23, 2024
2ca6b2a
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 23, 2024
b038e0c
remove commented out debug print
davidwendt Oct 23, 2024
df8cd90
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 23, 2024
1fc599a
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 24, 2024
1941b30
Merge branch 'branch-24.12' into contains-multiple
res-life Oct 25, 2024
3b83e29
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 25, 2024
8926db9
Merge branch 'contains-multiple' of github.com:davidwendt/cudf into c…
davidwendt Oct 25, 2024
5945278
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 25, 2024
d85db83
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 28, 2024
8e54dd7
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 28, 2024
6b74623
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 28, 2024
ca6371a
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 28, 2024
b17405e
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 31, 2024
0184732
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 31, 2024
3c5cd47
fix call to build benchmark input col
davidwendt Oct 31, 2024
2a5c021
Merge branch 'contains-multiple' of github.com:davidwendt/cudf into c…
davidwendt Oct 31, 2024
cd9d30d
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 31, 2024
5dc8b2f
Merge branch 'branch-24.12' into contains-multiple
davidwendt Oct 31, 2024
e9375f2
Merge branch 'branch-24.12' into contains-multiple
davidwendt Nov 5, 2024
9910bd3
fix copyright year
davidwendt Nov 5, 2024
31d5a4e
use cooperative groups for tile-size
davidwendt Nov 5, 2024
9a2a8af
Merge branch 'branch-24.12' into contains-multiple
davidwendt Nov 5, 2024
3b5576b
replace syncwarp with tile.sync
davidwendt Nov 5, 2024
c0e96dc
use meta_group_rank
davidwendt Nov 5, 2024
f5321d3
Merge branch 'contains-multiple' of github.com:davidwendt/cudf into c…
davidwendt Nov 5, 2024
80f4ba4
Merge branch 'branch-24.12' into contains-multiple
davidwendt Nov 5, 2024
64ff11a
Merge branch 'contains-multiple' of github.com:davidwendt/cudf into c…
davidwendt Nov 6, 2024
8218352
Merge branch 'branch-24.12' into contains-multiple
davidwendt Nov 6, 2024
529eec0
Merge branch 'branch-24.12' into contains-multiple
davidwendt Nov 7, 2024
ecadd12
fix merge conflict
davidwendt Nov 7, 2024
592ed8d
Merge branch 'branch-24.12' into contains-multiple
davidwendt Nov 7, 2024
7625963
Merge branch 'branch-24.12' into contains-multiple
davidwendt Nov 7, 2024
3a2c5dd
Merge branch 'branch-24.12' into contains-multiple
davidwendt Nov 8, 2024
7031dac
fix comment wording
davidwendt Nov 8, 2024
b8b023a
Merge branch 'branch-24.12' into contains-multiple
davidwendt Nov 8, 2024
a8bc038
Merge branch 'branch-24.12' into contains-multiple
res-life Nov 11, 2024
9a0c87f
Merge branch 'branch-24.12' into contains-multiple
res-life Nov 12, 2024
7bf652b
Merge branch 'branch-24.12' into contains-multiple
res-life Nov 12, 2024
4ef6d46
1st instead of 1th
bdice Nov 12, 2024
3de3f9e
Merge branch 'branch-24.12' into contains-multiple
bdice Nov 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -705,6 +705,7 @@ add_library(
src/strings/replace/replace_slice.cu
src/strings/reverse.cu
src/strings/scan/scan_inclusive.cu
src/strings/search/contains_multiple.cu
src/strings/search/findall.cu
src/strings/search/find.cu
src/strings/search/find_multiple.cu
Expand Down
1 change: 1 addition & 0 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,7 @@ ConfigureNVBench(
string/count.cpp
string/extract.cpp
string/find.cpp
string/find_multiple.cpp
string/join_strings.cpp
string/lengths.cpp
string/like.cpp
Expand Down
14 changes: 3 additions & 11 deletions cpp/benchmarks/string/find.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,7 @@
#include <cudf_test/column_wrapper.hpp>

#include <cudf/scalar/scalar.hpp>
#include <cudf/strings/combine.hpp>
#include <cudf/strings/find.hpp>
#include <cudf/strings/find_multiple.hpp>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/utilities/default_stream.hpp>

Expand All @@ -44,15 +42,13 @@ static void bench_find_string(nvbench::state& state)
auto const col = create_string_column(n_rows, row_width, hit_rate);
auto const input = cudf::strings_column_view(col->view());

std::vector<std::string> h_targets({"5W", "5W43", "0987 5W43"});
cudf::string_scalar target(h_targets[2]);
cudf::test::strings_column_wrapper targets(h_targets.begin(), h_targets.end());
cudf::string_scalar target("0987 5W43");

state.set_cuda_stream(nvbench::make_cuda_stream_view(stream.value()));
auto const chars_size = input.chars_size(stream);
state.add_element_count(chars_size, "chars_size");
state.add_global_memory_reads<nvbench::int8_t>(chars_size);
if (api.substr(0, 4) == "find") {
if (api == "find") {
state.add_global_memory_writes<nvbench::int32_t>(input.size());
} else {
state.add_global_memory_writes<nvbench::int8_t>(input.size());
Expand All @@ -61,10 +57,6 @@ static void bench_find_string(nvbench::state& state)
if (api == "find") {
state.exec(nvbench::exec_tag::sync,
[&](nvbench::launch& launch) { cudf::strings::find(input, target); });
} else if (api == "find_multi") {
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
cudf::strings::find_multiple(input, cudf::strings_column_view(targets));
});
} else if (api == "contains") {
state.exec(nvbench::exec_tag::sync,
[&](nvbench::launch& launch) { cudf::strings::contains(input, target); });
Expand All @@ -79,7 +71,7 @@ static void bench_find_string(nvbench::state& state)

NVBENCH_BENCH(bench_find_string)
.set_name("find_string")
.add_string_axis("api", {"find", "find_multi", "contains", "starts_with", "ends_with"})
.add_string_axis("api", {"find", "contains", "starts_with", "ends_with"})
.add_int64_axis("row_width", {32, 64, 128, 256, 512, 1024})
.add_int64_axis("num_rows", {260'000, 1'953'000, 16'777'216})
.add_int64_axis("hit_rate", {20, 80}); // percentage
77 changes: 77 additions & 0 deletions cpp/benchmarks/string/find_multiple.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmarks/common/generate_input.hpp>
#include <benchmarks/fixture/benchmark_fixture.hpp>

#include <cudf_test/column_wrapper.hpp>

#include <cudf/strings/find.hpp>
#include <cudf/strings/find_multiple.hpp>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/utilities/default_stream.hpp>

#include <nvbench/nvbench.cuh>

static void bench_find_string(nvbench::state& state)
{
auto const n_rows = static_cast<cudf::size_type>(state.get_int64("num_rows"));
auto const row_width = static_cast<cudf::size_type>(state.get_int64("row_width"));
auto const hit_rate = static_cast<cudf::size_type>(state.get_int64("hit_rate"));
auto const target_count = static_cast<cudf::size_type>(state.get_int64("targets"));
auto const api = state.get_string("api");

auto const stream = cudf::get_default_stream();
auto const col = create_string_column(n_rows, row_width, hit_rate);
auto const input = cudf::strings_column_view(col->view());

// Note that these all match the first row of the raw_data in create_string_column.
// This is so the hit_rate can properly accounted for.
mythrocks marked this conversation as resolved.
Show resolved Hide resolved
std::vector<std::string> const target_data(
{" abc", "W43", "0987 5W43", "123 abc", "23 abc", "3 abc", "7 5W43", "87 5W43", "987 5W43"});
auto h_targets = std::vector<std::string>{};
for (cudf::size_type i = 0; i < target_count; i++) {
h_targets.emplace_back(target_data[i % target_data.size()]);
}
cudf::test::strings_column_wrapper targets(h_targets.begin(), h_targets.end());

state.set_cuda_stream(nvbench::make_cuda_stream_view(stream.value()));
auto const chars_size = input.chars_size(stream);
state.add_global_memory_reads<nvbench::int8_t>(chars_size);
if (api == "find") {
state.add_global_memory_writes<nvbench::int32_t>(input.size());
} else {
state.add_global_memory_writes<nvbench::int8_t>(input.size());
}

if (api == "find") {
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
cudf::strings::find_multiple(input, cudf::strings_column_view(targets));
});
} else if (api == "contains") {
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
cudf::strings::contains_multiple(input, cudf::strings_column_view(targets));
});
}
}

NVBENCH_BENCH(bench_find_string)
.set_name("find_multiple")
.add_string_axis("api", {"find", "contains"})
.add_int64_axis("targets", {10, 20, 40})
.add_int64_axis("row_width", {32, 64, 128, 256})
.add_int64_axis("num_rows", {32768, 262144, 2097152})
.add_int64_axis("hit_rate", {20, 80}); // percentage
40 changes: 37 additions & 3 deletions cpp/include/cudf/strings/find_multiple.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,42 @@ namespace strings {
*/

/**
* @brief Returns a lists column with character position values where each
* of the target strings are found in each string.
* @brief Searches for the given target strings within each string in the provided column
*
* Each column in the result table corresponds to the result for the target string at the same
* ordinal. i.e. 0th column is the BOOL8 column result for the 0th target string, 1st for 1st,
* etc.
*
* If the target is not found for a string, false is returned for that entry in the output column.
* If the target is an empty string, true is returned for all non-null entries in the output column.
*
* Any null input strings return corresponding null entries in the output columns.
*
* @code{.pseudo}
* input = ["a", "b", "c"]
* targets = ["a", "c"]
* output is a table with two boolean columns:
* column 0: [true, false, false]
* column 1: [false, false, true]
* @endcode
*
* @throw std::invalid_argument if `targets` is empty or contains nulls
*
* @param input Strings instance for this operation
* @param targets UTF-8 encoded strings to search for in each string in `input`
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
* @return Table of BOOL8 columns
*/
std::unique_ptr<table> contains_multiple(
strings_column_view const& input,
strings_column_view const& targets,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Searches for the given target strings within each string in the provided column
* and returns the position the targets were found
*
* The size of the output column is `input.size()`.
* Each row of the output column is of size `targets.size()`.
Expand All @@ -45,7 +79,7 @@ namespace strings {
* [-1,-1, 1 ]} // for "def": "a" and "b" not found, "e" at pos 1
* @endcode
*
* @throw cudf::logic_error if `targets` is empty or contains nulls
* @throw std::invalid_argument if `targets` is empty or contains nulls
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
*
* @param input Strings instance for this operation
* @param targets Strings to search for in each string
Expand Down
Loading
Loading