Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate hashing operations to pylibcudf #15418

Merged
merged 69 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 68 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
968aef5
hashing - initial
brandon-b-miller Apr 1, 2024
f4c953c
minor cleanup for now
brandon-b-miller Apr 1, 2024
eeee5ee
add hash top level function
brandon-b-miller Apr 1, 2024
0576423
begin tests
brandon-b-miller Apr 2, 2024
306ad1d
some untested code worth saving
brandon-b-miller Apr 3, 2024
a15dd45
tests run and fail
brandon-b-miller Apr 3, 2024
30b0f2b
todo
brandon-b-miller Apr 3, 2024
eeb4edb
Apply suggestions from code review
brandon-b-miller Apr 5, 2024
6762279
remove hash_id
brandon-b-miller Apr 5, 2024
8ab4afa
docs
brandon-b-miller Apr 5, 2024
ccf64d4
small lint
brandon-b-miller Apr 5, 2024
9ce384a
add DEFAULT_HASH_SEED from hpp
brandon-b-miller Apr 5, 2024
bed5792
fix xxhash_64_string test
brandon-b-miller Apr 5, 2024
992cba1
raise for unimplemented hash test functions on the python side for now
brandon-b-miller Apr 8, 2024
b61d2cc
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller Apr 10, 2024
2977b63
fix up some tests
brandon-b-miller Apr 10, 2024
c0bb09a
separate md5 test
brandon-b-miller Apr 10, 2024
874f317
cleanup
brandon-b-miller Apr 11, 2024
edcde76
more cleanup
brandon-b-miller Apr 11, 2024
3a442cb
begin hashing tests
brandon-b-miller Apr 12, 2024
6841de9
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller Apr 15, 2024
eef3616
fix up murmurhash3_x64_128 test, list struct error sha test
brandon-b-miller Apr 16, 2024
ab5870d
add mmh3_x86_32 tests that currently fail
brandon-b-miller Apr 16, 2024
41c0ae6
Apply suggestions from code review
brandon-b-miller May 2, 2024
3fdd04f
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 2, 2024
774b093
update cpp errors
brandon-b-miller May 2, 2024
2f131b3
address some reviews
brandon-b-miller May 3, 2024
af6e59d
uncomment xxhash_64
brandon-b-miller May 3, 2024
ee56145
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 3, 2024
2e6743e
add mmh3 to test_python_cudf
brandon-b-miller May 3, 2024
8d8bef9
fix murmurhash3_x86_32
brandon-b-miller May 3, 2024
5930c7a
add xxhash testing dependency
brandon-b-miller May 3, 2024
cedc89f
depandasify
brandon-b-miller May 3, 2024
840f3e4
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 15, 2024
17e63eb
merge latest/resolve conflicts/fix
brandon-b-miller May 16, 2024
751e5f3
fix pylibcudf tests
brandon-b-miller May 16, 2024
642b444
update dependencies
brandon-b-miller May 16, 2024
cbeb9f9
linting
brandon-b-miller May 16, 2024
174af9d
refactor
brandon-b-miller May 16, 2024
24c72a3
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 22, 2024
9f5355f
merge latest/resolve conflicts
brandon-b-miller Jun 28, 2024
3d10495
Merge branch 'branch-24.08' into pylibcudf-hashing
brandon-b-miller Jul 3, 2024
37a91bf
debug commit
brandon-b-miller Jul 8, 2024
a5ec407
merged but not building yet
brandon-b-miller Sep 30, 2024
bab6bb5
merge/resolve
brandon-b-miller Oct 2, 2024
b406b41
small updates
brandon-b-miller Oct 2, 2024
d730b6f
refactor/pass, missing a few tests
brandon-b-miller Oct 4, 2024
751a89c
extra test
brandon-b-miller Oct 4, 2024
d99b5cf
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 14, 2024
68c7a49
missing test
brandon-b-miller Oct 14, 2024
370405b
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 17, 2024
cdd41db
prune moves
brandon-b-miller Oct 17, 2024
2dc49b2
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 18, 2024
a6ded88
fixes
brandon-b-miller Oct 18, 2024
382c2dc
Update docs/cudf/source/user_guide/api_docs/pylibcudf/hashing.rst
brandon-b-miller Oct 22, 2024
a0f9d07
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 22, 2024
a55048a
Apply suggestions from code review
brandon-b-miller Oct 22, 2024
23cd5fe
combine sha/md5 tests
brandon-b-miller Oct 22, 2024
1a4cfad
struct and list tests, struct still fails
brandon-b-miller Oct 27, 2024
4c37de9
pass.
brandon-b-miller Oct 27, 2024
5423074
clean
brandon-b-miller Oct 27, 2024
f0ec39b
latest
brandon-b-miller Oct 27, 2024
46b27a1
style
brandon-b-miller Oct 28, 2024
60e5c4c
Update python/pylibcudf/pylibcudf/tests/conftest.py
brandon-b-miller Oct 28, 2024
dcd38c6
update docstrings
brandon-b-miller Oct 28, 2024
5d1c2f1
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 29, 2024
7f3157b
enforce uint32
brandon-b-miller Oct 30, 2024
08b8818
adjust doxygen tags
brandon-b-miller Oct 30, 2024
d0234d4
doc fixes
brandon-b-miller Oct 31, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions conda/environments/all_cuda-118_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ dependencies:
- librdkafka>=2.5.0,<2.6.0a0
- librmm==24.12.*,>=0.0.0a0
- make
- mmh3
- moto>=4.0.8
- msgpack-python
- myst-nb
Expand Down Expand Up @@ -76,6 +77,7 @@ dependencies:
- pytest-xdist
- pytest<8
- python-confluent-kafka>=2.5.0,<2.6.0a0
- python-xxhash
- python>=3.10,<3.13
- pytorch>=2.1.0
- rapids-build-backend>=0.3.0,<0.4.0.dev0
Expand Down
2 changes: 2 additions & 0 deletions conda/environments/all_cuda-125_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ dependencies:
- librdkafka>=2.5.0,<2.6.0a0
- librmm==24.12.*,>=0.0.0a0
- make
- mmh3
- moto>=4.0.8
- msgpack-python
- myst-nb
Expand Down Expand Up @@ -74,6 +75,7 @@ dependencies:
- pytest-xdist
- pytest<8
- python-confluent-kafka>=2.5.0,<2.6.0a0
- python-xxhash
- python>=3.10,<3.13
- pytorch>=2.1.0
- rapids-build-backend>=0.3.0,<0.4.0.dev0
Expand Down
15 changes: 8 additions & 7 deletions cpp/include/cudf/hashing.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,6 @@

namespace CUDF_EXPORT cudf {

/**
* @addtogroup column_hash
* @{
* @file
*/

/**
* @brief Type of hash value
*
Expand All @@ -42,6 +36,12 @@ static constexpr uint32_t DEFAULT_HASH_SEED = 0;
//! Hash APIs
namespace hashing {

/**
* @addtogroup column_hash
* @{
* @file
*/

/**
* @brief Computes the MurmurHash3 32-bit hash value of each row in the given table
*
Expand Down Expand Up @@ -183,7 +183,8 @@ std::unique_ptr<column> xxhash_64(
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/** @} */ // end of group

} // namespace hashing

/** @} */ // end of group
} // namespace CUDF_EXPORT cudf
3 changes: 2 additions & 1 deletion cpp/src/hash/md5_hash.cu
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,8 @@ std::unique_ptr<column> md5(table_view const& input,
}
return md5_leaf_type_check(col.type());
}),
"Unsupported column type for hash function.");
"Unsupported column type for hash function.",
cudf::data_type_error);
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved

// Digest size in bytes
auto constexpr digest_size = 32;
Expand Down
3 changes: 2 additions & 1 deletion cpp/src/hash/sha_hash.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -513,7 +513,8 @@ std::unique_ptr<column> sha_hash(table_view const& input,
CUDF_EXPECTS(
std::all_of(
input.begin(), input.end(), [](auto const& col) { return sha_leaf_type_check(col.type()); }),
"Unsupported column type for hash function.");
"Unsupported column type for hash function.",
cudf::data_type_error);

// Result column allocation and creation
auto begin = thrust::make_constant_iterator(Hasher::digest_size);
Expand Down
4 changes: 2 additions & 2 deletions cpp/tests/hashing/sha1_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ TEST_F(SHA1HashTest, ListsUnsupported)

auto const input = cudf::table_view({strings_list_col});

EXPECT_THROW(cudf::hashing::sha1(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha1(input), cudf::data_type_error);
}

TEST_F(SHA1HashTest, StructsUnsupported)
Expand All @@ -146,7 +146,7 @@ TEST_F(SHA1HashTest, StructsUnsupported)
auto struct_col = cudf::test::structs_column_wrapper{{child_col}};
auto const input = cudf::table_view({struct_col});

EXPECT_THROW(cudf::hashing::sha1(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha1(input), cudf::data_type_error);
}

template <typename T>
Expand Down
4 changes: 2 additions & 2 deletions cpp/tests/hashing/sha224_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ TEST_F(SHA224HashTest, ListsUnsupported)

auto const input = cudf::table_view({strings_list_col});

EXPECT_THROW(cudf::hashing::sha224(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha224(input), cudf::data_type_error);
}

TEST_F(SHA224HashTest, StructsUnsupported)
Expand All @@ -146,7 +146,7 @@ TEST_F(SHA224HashTest, StructsUnsupported)
auto struct_col = cudf::test::structs_column_wrapper{{child_col}};
auto const input = cudf::table_view({struct_col});

EXPECT_THROW(cudf::hashing::sha224(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha224(input), cudf::data_type_error);
}

template <typename T>
Expand Down
4 changes: 2 additions & 2 deletions cpp/tests/hashing/sha256_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ TEST_F(SHA256HashTest, ListsUnsupported)

auto const input = cudf::table_view({strings_list_col});

EXPECT_THROW(cudf::hashing::sha256(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha256(input), cudf::data_type_error);
}

TEST_F(SHA256HashTest, StructsUnsupported)
Expand All @@ -145,7 +145,7 @@ TEST_F(SHA256HashTest, StructsUnsupported)
auto struct_col = cudf::test::structs_column_wrapper{{child_col}};
auto const input = cudf::table_view({struct_col});

EXPECT_THROW(cudf::hashing::sha256(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha256(input), cudf::data_type_error);
}

template <typename T>
Expand Down
4 changes: 2 additions & 2 deletions cpp/tests/hashing/sha384_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ TEST_F(SHA384HashTest, ListsUnsupported)

auto const input = cudf::table_view({strings_list_col});

EXPECT_THROW(cudf::hashing::sha384(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha384(input), cudf::data_type_error);
}

TEST_F(SHA384HashTest, StructsUnsupported)
Expand All @@ -164,7 +164,7 @@ TEST_F(SHA384HashTest, StructsUnsupported)
auto struct_col = cudf::test::structs_column_wrapper{{child_col}};
auto const input = cudf::table_view({struct_col});

EXPECT_THROW(cudf::hashing::sha384(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha384(input), cudf::data_type_error);
}

template <typename T>
Expand Down
4 changes: 2 additions & 2 deletions cpp/tests/hashing/sha512_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ TEST_F(SHA512HashTest, ListsUnsupported)

auto const input = cudf::table_view({strings_list_col});

EXPECT_THROW(cudf::hashing::sha512(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha512(input), cudf::data_type_error);
}

TEST_F(SHA512HashTest, StructsUnsupported)
Expand All @@ -164,7 +164,7 @@ TEST_F(SHA512HashTest, StructsUnsupported)
auto struct_col = cudf::test::structs_column_wrapper{{child_col}};
auto const input = cudf::table_view({struct_col});

EXPECT_THROW(cudf::hashing::sha512(input), cudf::logic_error);
EXPECT_THROW(cudf::hashing::sha512(input), cudf::data_type_error);
}

template <typename T>
Expand Down
5 changes: 4 additions & 1 deletion dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -828,6 +828,7 @@ dependencies:
- pytest-benchmark
- pytest-cases>=3.8.2
- scipy
- mmh3
- output_types: conda
packages:
- aiobotocore>=2.2.0
Expand All @@ -836,12 +837,14 @@ dependencies:
- msgpack-python
- moto>=4.0.8
- s3fs>=2022.3.0
- output_types: pyproject
- python-xxhash
- output_types: [pyproject, requirements]
packages:
- msgpack
- &tokenizers tokenizers==0.15.2
- &transformers transformers==4.39.3
- tzdata
- xxhash
specific:
- output_types: [conda, requirements]
matrices:
Expand Down
6 changes: 6 additions & 0 deletions docs/cudf/source/user_guide/api_docs/pylibcudf/hashing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
=======
hashing
=======

.. automodule:: pylibcudf.hashing
:members:
1 change: 1 addition & 0 deletions docs/cudf/source/user_guide/api_docs/pylibcudf/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ This page provides API documentation for pylibcudf.
filling
gpumemoryview
groupby
hashing
interop
join
json
Expand Down
57 changes: 18 additions & 39 deletions python/cudf/cudf/_lib/hash.pyx
Original file line number Diff line number Diff line change
@@ -1,27 +1,12 @@
# Copyright (c) 2020-2024, NVIDIA CORPORATION.

from cudf.core.buffer import acquire_spill_lock
import pylibcudf as plc

from libcpp.memory cimport unique_ptr
from libcpp.utility cimport move
from cudf.core.buffer import acquire_spill_lock

from pylibcudf.libcudf.column.column cimport column
from pylibcudf.libcudf.hash cimport (
md5,
murmurhash3_x86_32,
sha1,
sha224,
sha256,
sha384,
sha512,
xxhash_64,
)
from pylibcudf.libcudf.table.table_view cimport table_view
from pylibcudf.table cimport Table

from cudf._lib.column cimport Column
from cudf._lib.utils cimport table_view_from_columns

import pylibcudf as plc


@acquire_spill_lock()
Expand All @@ -37,32 +22,26 @@ def hash_partition(list source_columns, list columns_to_hash,

@acquire_spill_lock()
def hash(list source_columns, str method, int seed=0):
cdef table_view c_source_view = table_view_from_columns(source_columns)
cdef unique_ptr[column] c_result
cdef Table ctbl = Table(
[c.to_pylibcudf(mode="read") for c in source_columns]
)
if method == "murmur3":
with nogil:
c_result = move(murmurhash3_x86_32(c_source_view, seed))
return Column.from_pylibcudf(plc.hashing.murmurhash3_x86_32(ctbl, seed))
elif method == "xxhash64":
return Column.from_pylibcudf(plc.hashing.xxhash_64(ctbl, seed))
elif method == "md5":
with nogil:
c_result = move(md5(c_source_view))
return Column.from_pylibcudf(plc.hashing.md5(ctbl))
elif method == "sha1":
with nogil:
c_result = move(sha1(c_source_view))
return Column.from_pylibcudf(plc.hashing.sha1(ctbl))
elif method == "sha224":
with nogil:
c_result = move(sha224(c_source_view))
return Column.from_pylibcudf(plc.hashing.sha224(ctbl))
elif method == "sha256":
with nogil:
c_result = move(sha256(c_source_view))
return Column.from_pylibcudf(plc.hashing.sha256(ctbl))
elif method == "sha384":
with nogil:
c_result = move(sha384(c_source_view))
return Column.from_pylibcudf(plc.hashing.sha384(ctbl))
elif method == "sha512":
with nogil:
c_result = move(sha512(c_source_view))
elif method == "xxhash64":
with nogil:
c_result = move(xxhash_64(c_source_view, seed))
return Column.from_pylibcudf(plc.hashing.sha512(ctbl))
else:
raise ValueError(f"Unsupported hash function: {method}")
return Column.from_unique_ptr(move(c_result))
raise ValueError(
f"Unsupported hashing algorithm {method}."
)
2 changes: 2 additions & 0 deletions python/cudf/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ test = [
"cramjam",
"fastavro>=0.22.9",
"hypothesis",
"mmh3",
"msgpack",
"pytest-benchmark",
"pytest-cases>=3.8.2",
Expand All @@ -63,6 +64,7 @@ test = [
"tokenizers==0.15.2",
"transformers==4.39.3",
"tzdata",
"xxhash",
] # This list was generated by `rapids-dependency-file-generator`. To make changes, edit ../../dependencies.yaml and run `rapids-dependency-file-generator`.
pandas-tests = [
"ipython",
Expand Down
1 change: 1 addition & 0 deletions python/pylibcudf/pylibcudf/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ set(cython_sources
filling.pyx
gpumemoryview.pyx
groupby.pyx
hashing.pyx
interop.pyx
join.pyx
json.pyx
Expand Down
2 changes: 2 additions & 0 deletions python/pylibcudf/pylibcudf/__init__.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ from . cimport (
expressions,
filling,
groupby,
hashing,
interop,
join,
json,
Expand Down Expand Up @@ -63,6 +64,7 @@ __all__ = [
"filling",
"gpumemoryview",
"groupby",
"hashing",
"interop",
"join",
"json",
Expand Down
2 changes: 2 additions & 0 deletions python/pylibcudf/pylibcudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
expressions,
filling,
groupby,
hashing,
interop,
io,
join,
Expand Down Expand Up @@ -73,6 +74,7 @@
"filling",
"gpumemoryview",
"groupby",
"hashing",
"interop",
"io",
"join",
Expand Down
30 changes: 30 additions & 0 deletions python/pylibcudf/pylibcudf/hashing.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright (c) 2024, NVIDIA CORPORATION.

from libc.stdint cimport uint32_t, uint64_t

from .column cimport Column
from .table cimport Table


cpdef Column murmurhash3_x86_32(
Table input,
uint32_t seed=*
)

cpdef Table murmurhash3_x64_128(
Table input,
uint64_t seed=*
)


cpdef Column xxhash_64(
Table input,
uint64_t seed=*
)

cpdef Column md5(Table input)
cpdef Column sha1(Table input)
cpdef Column sha224(Table input)
cpdef Column sha256(Table input)
cpdef Column sha384(Table input)
cpdef Column sha512(Table input)
Loading
Loading