Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strings::contains() for multiple scalar search targets #16641

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
f4924a9
strings::contains() for multiple search targets
mythrocks Apr 15, 2024
1022c83
string contains optimization
Aug 22, 2024
45170e9
Add benchmark test
Aug 22, 2024
7e2aa43
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Aug 23, 2024
32e1329
Fix comments
Aug 27, 2024
be6985b
Use new approach to improve perf: index the first chars in the targets
Aug 29, 2024
be7a1e2
Fix comments; Restore a test change
Aug 29, 2024
6b635f6
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Aug 29, 2024
479788c
Improve
Aug 29, 2024
543a1f6
Fix compile error
Aug 30, 2024
f1da8b0
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Aug 30, 2024
06ba14c
Update test cases; update benchmark tests
Aug 30, 2024
14418d7
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Aug 30, 2024
814e002
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 2, 2024
587ce34
Format code
Sep 2, 2024
470355f
Fix bug
Sep 2, 2024
4b41ead
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 4, 2024
e56a122
Fix comments
Sep 4, 2024
31f4822
Optimize warp parallel
Sep 5, 2024
88d351d
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 5, 2024
7836c33
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 6, 2024
6ae2c00
Split targets to small groups to save shared memory when num of targe…
Sep 6, 2024
542e1ff
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 9, 2024
ab5ef90
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 10, 2024
da1d92b
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 11, 2024
3324671
Fix bug when strings are long: returns all falses.
Sep 11, 2024
849c093
Format code
Sep 11, 2024
85e8b17
Refactor: refine code comments
Sep 11, 2024
ce4450d
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 14, 2024
9fc9398
Fix bug: illegal memory access
Sep 14, 2024
b33d692
Fix bug in split logic
Sep 14, 2024
6741bef
Optimize the perf for indexing first chars
Sep 14, 2024
330e828
Fix comments from code review
Sep 14, 2024
d216993
Fix compile error
Sep 14, 2024
eb6744f
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 18, 2024
a32c54d
Fix bugs; update tests
Sep 18, 2024
8391239
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 18, 2024
5caf782
Update
Sep 18, 2024
41fb9ae
Merge branch 'branch-24.10' into multi-string-contains-review
res-life Sep 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion cpp/benchmarks/string/find.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,27 @@ static void bench_find_string(nvbench::state& state)
} else if (api == "contains") {
state.exec(nvbench::exec_tag::sync,
[&](nvbench::launch& launch) { cudf::strings::contains(input, target); });
} else if (api == "multi-contains") {
constexpr int iters = 10;
std::vector<std::string> match_targets({" abc",
"W43",
"0987 5W43",
"123 abc",
"23 abc",
"3 abc",
"é",
"7 5W43",
"87 5W43",
"987 5W43"});
auto multi_targets = std::vector<std::string>{};
for (int i = 0; i < iters; i++) {
multi_targets.emplace_back(match_targets[i % match_targets.size()]);
}
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
cudf::test::strings_column_wrapper multi_targets_column(multi_targets.begin(),
multi_targets.end());
cudf::strings::multi_contains(input, cudf::strings_column_view(multi_targets_column));
});
} else if (api == "starts_with") {
state.exec(nvbench::exec_tag::sync,
[&](nvbench::launch& launch) { cudf::strings::starts_with(input, target); });
Expand All @@ -84,7 +105,8 @@ static void bench_find_string(nvbench::state& state)

NVBENCH_BENCH(bench_find_string)
.set_name("find_string")
.add_string_axis("api", {"find", "find_multi", "contains", "starts_with", "ends_with"})
.add_string_axis("api",
{"find", "find_multi", "contains", "starts_with", "ends_with", "multi-contains"})
.add_int64_axis("row_width", {32, 64, 128, 256, 512, 1024})
.add_int64_axis("num_rows", {260'000, 1'953'000, 16'777'216})
.add_int64_axis("hit_rate", {20, 80}); // percentage
33 changes: 33 additions & 0 deletions cpp/include/cudf/strings/find.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,39 @@ std::unique_ptr<column> contains(
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Returns a table of columns of boolean values for each string where true indicates
* the target string was found within that string in the provided column.
*
* Each column in the result table corresponds to the result for the target string at the same
* ordinal. i.e. 0th column is the boolean-column result for the 0th target string, 1th for 1th,
* etc.
*
* If the target is not found for a string, false is returned for that entry in the output column.
* If the target is an empty string, true is returned for all non-null entries in the output column.
*
* Any null string entries return corresponding null entries in the output columns.
* e.g.:
* @code
* input: "a", "b", "c"
* targets: "a", "c"
* output is a table with two boolean columns:
* column_0: true, false, false
* column_1: false, false, true
* @endcode
*
* @param input Strings instance for this operation
* @param targets UTF-8 encoded strings to search for in each string in `input`
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New BOOL8 column
*/
std::unique_ptr<table> multi_contains(
strings_column_view const& input,
strings_column_view const& targets,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a column of boolean values for each string where true indicates
* the corresponding target string was found within that string in the provided column.
Expand Down
Loading
Loading