Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate hashing operations to pylibcudf #15418

Merged
merged 69 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
968aef5
hashing - initial
brandon-b-miller Apr 1, 2024
f4c953c
minor cleanup for now
brandon-b-miller Apr 1, 2024
eeee5ee
add hash top level function
brandon-b-miller Apr 1, 2024
0576423
begin tests
brandon-b-miller Apr 2, 2024
306ad1d
some untested code worth saving
brandon-b-miller Apr 3, 2024
a15dd45
tests run and fail
brandon-b-miller Apr 3, 2024
30b0f2b
todo
brandon-b-miller Apr 3, 2024
eeb4edb
Apply suggestions from code review
brandon-b-miller Apr 5, 2024
6762279
remove hash_id
brandon-b-miller Apr 5, 2024
8ab4afa
docs
brandon-b-miller Apr 5, 2024
ccf64d4
small lint
brandon-b-miller Apr 5, 2024
9ce384a
add DEFAULT_HASH_SEED from hpp
brandon-b-miller Apr 5, 2024
bed5792
fix xxhash_64_string test
brandon-b-miller Apr 5, 2024
992cba1
raise for unimplemented hash test functions on the python side for now
brandon-b-miller Apr 8, 2024
b61d2cc
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller Apr 10, 2024
2977b63
fix up some tests
brandon-b-miller Apr 10, 2024
c0bb09a
separate md5 test
brandon-b-miller Apr 10, 2024
874f317
cleanup
brandon-b-miller Apr 11, 2024
edcde76
more cleanup
brandon-b-miller Apr 11, 2024
3a442cb
begin hashing tests
brandon-b-miller Apr 12, 2024
6841de9
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller Apr 15, 2024
eef3616
fix up murmurhash3_x64_128 test, list struct error sha test
brandon-b-miller Apr 16, 2024
ab5870d
add mmh3_x86_32 tests that currently fail
brandon-b-miller Apr 16, 2024
41c0ae6
Apply suggestions from code review
brandon-b-miller May 2, 2024
3fdd04f
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 2, 2024
774b093
update cpp errors
brandon-b-miller May 2, 2024
2f131b3
address some reviews
brandon-b-miller May 3, 2024
af6e59d
uncomment xxhash_64
brandon-b-miller May 3, 2024
ee56145
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 3, 2024
2e6743e
add mmh3 to test_python_cudf
brandon-b-miller May 3, 2024
8d8bef9
fix murmurhash3_x86_32
brandon-b-miller May 3, 2024
5930c7a
add xxhash testing dependency
brandon-b-miller May 3, 2024
cedc89f
depandasify
brandon-b-miller May 3, 2024
840f3e4
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 15, 2024
17e63eb
merge latest/resolve conflicts/fix
brandon-b-miller May 16, 2024
751e5f3
fix pylibcudf tests
brandon-b-miller May 16, 2024
642b444
update dependencies
brandon-b-miller May 16, 2024
cbeb9f9
linting
brandon-b-miller May 16, 2024
174af9d
refactor
brandon-b-miller May 16, 2024
24c72a3
Merge branch 'branch-24.06' into pylibcudf-hashing
brandon-b-miller May 22, 2024
9f5355f
merge latest/resolve conflicts
brandon-b-miller Jun 28, 2024
3d10495
Merge branch 'branch-24.08' into pylibcudf-hashing
brandon-b-miller Jul 3, 2024
37a91bf
debug commit
brandon-b-miller Jul 8, 2024
a5ec407
merged but not building yet
brandon-b-miller Sep 30, 2024
bab6bb5
merge/resolve
brandon-b-miller Oct 2, 2024
b406b41
small updates
brandon-b-miller Oct 2, 2024
d730b6f
refactor/pass, missing a few tests
brandon-b-miller Oct 4, 2024
751a89c
extra test
brandon-b-miller Oct 4, 2024
d99b5cf
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 14, 2024
68c7a49
missing test
brandon-b-miller Oct 14, 2024
370405b
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 17, 2024
cdd41db
prune moves
brandon-b-miller Oct 17, 2024
2dc49b2
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 18, 2024
a6ded88
fixes
brandon-b-miller Oct 18, 2024
382c2dc
Update docs/cudf/source/user_guide/api_docs/pylibcudf/hashing.rst
brandon-b-miller Oct 22, 2024
a0f9d07
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 22, 2024
a55048a
Apply suggestions from code review
brandon-b-miller Oct 22, 2024
23cd5fe
combine sha/md5 tests
brandon-b-miller Oct 22, 2024
1a4cfad
struct and list tests, struct still fails
brandon-b-miller Oct 27, 2024
4c37de9
pass.
brandon-b-miller Oct 27, 2024
5423074
clean
brandon-b-miller Oct 27, 2024
f0ec39b
latest
brandon-b-miller Oct 27, 2024
46b27a1
style
brandon-b-miller Oct 28, 2024
60e5c4c
Update python/pylibcudf/pylibcudf/tests/conftest.py
brandon-b-miller Oct 28, 2024
dcd38c6
update docstrings
brandon-b-miller Oct 28, 2024
5d1c2f1
Merge branch 'branch-24.12' into pylibcudf-hashing
brandon-b-miller Oct 29, 2024
7f3157b
enforce uint32
brandon-b-miller Oct 30, 2024
08b8818
adjust doxygen tags
brandon-b-miller Oct 30, 2024
d0234d4
doc fixes
brandon-b-miller Oct 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -838,16 +838,13 @@ dependencies:
- moto>=4.0.8
- s3fs>=2022.3.0
- python-xxhash
- output_types: pyproject
- output_types: [pyproject, requirements]
packages:
- msgpack
- &tokenizers tokenizers==0.15.2
- &transformers transformers==4.39.3
- tzdata
- xxhash
- output_types: requirements
packages:
- xxhash
specific:
- output_types: [conda, requirements]
matrices:
Expand Down
2 changes: 2 additions & 0 deletions python/pylibcudf/pylibcudf/__init__.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ from . cimport (
filling,
groupby,
hashing,
interop,
join,
json,
labeling,
Expand Down Expand Up @@ -64,6 +65,7 @@ __all__ = [
"gpumemoryview",
"groupby",
"hashing",
"interop",
"join",
"json",
"lists",
Expand Down
4 changes: 2 additions & 2 deletions python/pylibcudf/pylibcudf/hashing.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ cpdef Table murmurhash3_x64_128(
----------
input : Table
The table of columns to hash
seed : uint32_t
seed : uint64_t
Optional seed value to use for the hash function

Returns
Expand Down Expand Up @@ -94,7 +94,7 @@ cpdef Column xxhash_64(
----------
input : Table
The table of columns to hash
seed : uint32_t
seed : uint64_t
Optional seed value to use for the hash function

Returns
Expand Down
4 changes: 2 additions & 2 deletions python/pylibcudf/pylibcudf/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@

def _type_to_str(typ):
if isinstance(typ, pa.ListType):
return f"list-{_type_to_str(typ.value_type)}"
return f"list[{_type_to_str(typ.value_type)}]"
elif isinstance(typ, pa.StructType):
return f"struct-{'-'.join([_type_to_str(typ.field(i).type) for i in range(typ.num_fields)])}"
return f"struct[{', '.join([_type_to_str(typ.field(i).type) for i in range(typ.num_fields)])}]"
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
else:
return str(typ)

Expand Down
119 changes: 102 additions & 17 deletions python/pylibcudf/pylibcudf/tests/test_hashing.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@
import mmh3
import numpy as np
import pyarrow as pa
import pylibcudf as plc
import pytest
import xxhash
from utils import assert_column_eq, assert_table_eq

import pylibcudf as plc

SEED = 0
METHODS = ["md5", "sha1", "sha224", "sha256", "sha384", "sha512"]

Expand All @@ -28,10 +29,24 @@ def scalar_to_binary(x):
raise NotImplementedError


def hash_single_uint32(val, seed=0):
return mmh3.hash(np.uint32(val).tobytes(), seed=seed, signed=False)


def hash_combine_32(lhs, rhs):
return lhs ^ (rhs + 0x9E3779B9 + (lhs << 6) + (lhs >> 2))


def uint_hash_combine_32(lhs, rhs):
lhs = np.uint32(lhs)
rhs = np.uint32(rhs)
return hash_combine_32(lhs, rhs)


def libcudf_mmh3_x86_32(binary):
seed = plc.hashing.LIBCUDF_DEFAULT_HASH_SEED
hashval = mmh3.hash(binary, seed)
return seed ^ (hashval + 0x9E3779B9 + (seed << 6) + (seed >> 2))
return hash_combine_32(seed, hashval)


@pytest.fixture(params=[pa.int64(), pa.float64(), pa.string(), pa.bool_()])
Expand Down Expand Up @@ -84,9 +99,11 @@ def python_hash_value(x, method):


@pytest.mark.parametrize(
"method", ["sha1", "sha224", "sha256", "sha384", "sha512"]
"method", ["sha1", "sha224", "sha256", "sha384", "sha512", "md5"]
)
def test_hash_column_sha(pa_scalar_input_column, plc_scalar_input_tbl, method):
def test_hash_column_sha_md5(
pa_scalar_input_column, plc_scalar_input_tbl, method
):
plc_hasher = getattr(plc.hashing, method)

def py_hasher(val):
Expand All @@ -100,18 +117,6 @@ def py_hasher(val):
assert_column_eq(got, expect)


def test_hash_column_md5(pa_scalar_input_column, plc_scalar_input_tbl):
def py_hasher(val):
return hashlib.md5(scalar_to_binary(val)).hexdigest()

expect = pa.array(
[py_hasher(val) for val in pa_scalar_input_column.to_pylist()],
type=pa.string(),
)
got = plc.hashing.md5(plc_scalar_input_tbl)
assert_column_eq(got, expect)


def test_hash_column_xxhash64(pa_scalar_input_column, plc_scalar_input_tbl):
def py_hasher(val):
return xxhash.xxh64(
Expand Down Expand Up @@ -161,7 +166,87 @@ def py_hasher(val):
assert_column_eq(got, expect)


# def test_murmurhash_x86_32_list_struct TODO
@pytest.mark.filterwarnings("ignore::RuntimeWarning")
def test_murmurhash3_x86_32_list():
pa_tbl = pa.Table.from_pydict(
{
"list": pa.array(
[[1, 2, 3], [4, 5, 6], [7, 8, 9]], type=pa.list_(pa.uint32())
)
}
)
plc_tbl = plc.interop.from_arrow(pa_tbl)

def hash_list(list_):
hash_value = uint_hash_combine_32(0, hash_single_uint32(len(list_)))

for element in list_:
hash_value = uint_hash_combine_32(
hash_value,
hash_single_uint32(
element, seed=plc.hashing.LIBCUDF_DEFAULT_HASH_SEED
),
)

final = uint_hash_combine_32(
plc.hashing.LIBCUDF_DEFAULT_HASH_SEED, hash_value
)
return final

expect = pa.array(
[hash_list(val) for val in pa_tbl["list"].to_pylist()],
type=pa.uint32(),
)
got = plc.hashing.murmurhash3_x86_32(
plc_tbl, plc.hashing.LIBCUDF_DEFAULT_HASH_SEED
)
assert_column_eq(got, expect)


@pytest.mark.filterwarnings("ignore::RuntimeWarning")
def test_murmurhash3_x86_32_struct():
pa_tbl = pa.table(
{
"struct": pa.array(
[
{"a": 1, "b": 2, "c": 3},
{"a": 4, "b": 5, "c": 6},
{"a": 7, "b": 8, "c": 9},
],
type=pa.struct(
[
pa.field("a", pa.uint32()),
pa.field("b", pa.uint32(), pa.field("c", pa.uint32())),
]
),
)
}
)
plc_tbl = plc.interop.from_arrow(pa_tbl)

def hash_struct(s):
seed = plc.hashing.LIBCUDF_DEFAULT_HASH_SEED
keys = list(s.keys())

combined_hash = hash_single_uint32(s[keys[0]], seed=seed)
combined_hash = uint_hash_combine_32(0, combined_hash)
combined_hash = uint_hash_combine_32(seed, combined_hash)

for key in keys[1:]:
current_hash = hash_single_uint32(s[key], seed=seed)
combined_hash = uint_hash_combine_32(combined_hash, current_hash)

return combined_hash

got = plc.hashing.murmurhash3_x86_32(
plc_tbl, plc.hashing.LIBCUDF_DEFAULT_HASH_SEED
)

expect = pa.array(
[hash_struct(val) for val in pa_tbl["struct"].to_pylist()],
type=pa.uint32(),
)
assert_column_eq(got, expect)


def test_murmurhash3_x64_128(pa_scalar_input_column, plc_scalar_input_tbl):
Expand Down
Loading