-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SetSketches saved from different processes have jaccard estimation of 0 #74
Comments
The problem seems to be related to the hash function used by CSetSketch.addh(). Hashing strings separately and adding hashes helps to solve the problem.
The speed remains almost the same. Maybe CSetSketch.addh() can be fixed by using murmurhash3 with fixed seed? |
Hi!
That's right, the default addh uses Python's hash function which is seeded
at each invocation. I should modify this to be consistent.
In the short term, you can convert to a numpy array and hash it (from_np
and from_shs should provide this). from_shs expects already hashed data (no
hashing), and from_np hashes. I just used python's hash so it could work on
any python object.
I'll let you know when it's updates. Thanks for the issue!
Best,
Daniel
…On Sunday, September 24, 2023, Nikolay Arefyev ***@***.***> wrote:
The problem seems to be related to the hash function used by
CSetSketch.addh(). Hashing strings separately and adding hashes helps to
solve the problem.
%%timeit
import os
import sketch
import mmh3
m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)
step1, step2, maxval1, maxval2 = 2, 5, 100000, 100000
for i in range(step1, maxval1+1, step1):
# o = str(i)
o,_ = mmh3.hash64(str(i), seed=0, signed=False,)
hll.add(o)
for i in range(step2, maxval2+1, step2):
# o = str(i)
o, _ = mmh3.hash64(str(i), seed=0, signed=False,)
hll2.add(o)
hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')
The speed remains almost the same. Maybe CSetSketch.addh() can be fixed by
using murmurhash3 with fixed seed?
—
Reply to this email directly, view it on GitHub
<#74 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQ5UVJVSEOYAH2N7UX2TY3X4CJ6HANCNFSM6AAAAAA5CJSOLM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi! I'm using CSetSketch from python. I noticed that when I create, fill and then save this structure on disk for 2 sets in the same process, then loading it from disk and calculating jaccard estimation works well. But when 2 sets are processed in different processes, then the estimate is 0. For cardinality estimation everything works well in both cases.
Here is a minimal example showing this:
Run this code twice in 2 different process. Then run:
It will print this in my case:
tmp1_736949 tmp2_736949 0.16761398315429688
tmp1_736949 tmp2_736999 0.0
tmp1_736999 tmp2_736949 0.0
tmp1_736999 tmp2_736999 0.166900634765625
The text was updated successfully, but these errors were encountered: