Skip to content
This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

HLL++ offer() always returns true when format SPARSE #91

Open
xkrt opened this issue Jul 2, 2015 · 11 comments
Open

HLL++ offer() always returns true when format SPARSE #91

xkrt opened this issue Jul 2, 2015 · 11 comments

Comments

@xkrt
Copy link

xkrt commented Jul 2, 2015

Hi,

HyperLogLogPlus.offerHashed() constantly returns true in SPARSE format, even if hash already counted.

https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java#L309

Test:

@Test
public void testOffer() {
    HyperLogLogPlus hll = new HyperLogLogPlus(5,25);
    assertTrue(hll.offer("ABC"));
    assertFalse(hll.offer("ABC")); // boom!
}
@abramsm
Copy link
Contributor

abramsm commented Jul 3, 2015

Confirmed that it returns true. This behavior doesn't actually result in
incorrect cardinality results but it does violate the interface's contract.

The issue is that for efficiency purposes we do not merge the temp set
until it reaches a certain size or someone tries to get the cardinality
represented by the set.

In order to correctly return 'false' on each entry we would have to merge
the temp set after each offer and this would slow things down quite a bit.

What are peoples thoughts on optimal solution here? We should at least add
some javadoc to explain why this is and consider changing the interface
contract.

Matt

On Thu, Jul 2, 2015 at 4:32 AM Pavel Martynov [email protected]
wrote:

Hi,

HyperLogLogPlus.offerHashed() constantly returns true in SPARSE format,
even if hash already counted.

https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java#L309

Test:

@testpublic void testOffer() {
HyperLogLogPlus hll = new HyperLogLogPlus(5,25);
assertTrue(hll.offer("ABC"));
assertFalse(hll.offer("ABC")); // boom!
}


Reply to this email directly or view it on GitHub
#91.

@xkrt
Copy link
Author

xkrt commented Jul 3, 2015

IMHO you should measure how much will slow down contract support and if it really quite a bit - fix offer() to return correct value.

I want to explain why I need it: my business logic should perfom some notification when counter becomes larger than some predefined value. Pseudocode:

if (hll.offer(value) == true) {
    if (hll.cardinality() > notifyBound) {
        notify();
    }
}

If offer() will always return true (in Sparse format) I will be forced always call cardinality() that expensive as far I understand. Moreover stream of all values is very big but new values comes to counter relatively rarely.

And I think contract violation with explanation in docs in general bad idea leading to poor API.

Before stream-lib I used Redis (http://redis.io/commands/pfadd).

@abramsm
Copy link
Contributor

abramsm commented Jul 3, 2015

I've created a branch that provides the correct behavior. Basic
performance tests seem OK but more testing on that front is required before
merging into master. I've also updated language to Java 8 because, why
not...

Are you interested in helping with performance analysis?

f4c88af

Matt

On Fri, Jul 3, 2015 at 12:36 AM Pavel Martynov [email protected]
wrote:

IMHO you should measure how much will slow down contract support and if it
really quite a bit - fix offer() to return correct value.

I want to explain why I need it: my business logic should perfom some
notification when counter becomes larger than some predefined value.
Pseudocode:

if (hll.offer(value) == true) {
if (hll.cardinality() > notifyBound) {
notify();
}
}

If offer() will be always return true (in Sparse format) I will be forced
always call cardinality() that expensive as far I understand. Moreover
stream of all values is very big but new values comes to counter relatively
rarely.

And I think contract violation with explanation in docs in general bad
idea leading to poor API.

Before stream-lib I used Redis (http://redis.io/commands/pfadd).


Reply to this email directly or view it on GitHub
#91 (comment).

@xkrt
Copy link
Author

xkrt commented Jul 3, 2015

@abramsm I will try to compare perf on my dataset on next week. thanks!

@xkrt
Copy link
Author

xkrt commented Jul 6, 2015

@abramsm I run fixed offer() on one of my data sets (~10e6 items) and performance just sliiiightly changed. So I think your fix is production ready and may be merged.

xkrt added a commit to KL-WLCR/stream-lib.net that referenced this issue Jul 6, 2015
OfferHashed() now returns correct value for Sparse format
@abramsm
Copy link
Contributor

abramsm commented Jul 6, 2015

Thanks. Can you share any more details and results of your performance
tests? A few folks have expressed concerns about this approach so we will
need to give them time to propose/implement alternatives.

Matt

On Mon, Jul 6, 2015, 1:50 AM Pavel Martynov [email protected]
wrote:

@abramsm https://github.com/abramsm I run fixed offer() on one of my
data sets (~10e6 items) and performance just sliiiightly changed. So I
think your fix is production ready and may be merged.


Reply to this email directly or view it on GitHub
#91 (comment).

@mspiegel
Copy link

mspiegel commented Jul 6, 2015

I have two issues with the proposed changes: (1) I don't see the purpose of keeping around the tmpSet and the sparseSet if we introduce the transientSparseSet. (2) Presumably the tmpSet and the sparseSet were introduced because of some combination of (a) lower memory footprint and/or (b) amortized cost of insertions into those data structures is cheaper than using a hash set. Maybe we could use a specialized hash table for int primitives for the transientSparseSet and eliminate the tmpSet and the sparseSet. I'm not certain. It's likely that @tea-dragon has some objections too.

@abramsm
Copy link
Contributor

abramsm commented Jul 6, 2015

Yeah, those are good points and I think why we didn't do this initially.

Pavel - can you try your performance tests again using the current
release? And modify your code to always call 'cardinality' rather than
waiting for a 'true' response from offer. Lets see what the performance
looks like with that as compared to the hack in this branch.

Matt

On Mon, Jul 6, 2015 at 8:37 AM mspiegel [email protected] wrote:

I have two issues with the proposed changes: (1) I don't see the purpose
of keeping around the tmpSet and the sparseSet if we introduce the
transientSparseSet. (2) Presumably the tmpSet and the sparseSet were
introduced because of some combination of (a) lower memory footprint and/or
(b) amortized cost of insertions into those data structures is cheaper than
using a hash set. Maybe we could use a specialized hash table for int
primitives for the transientSparseSet and eliminate the tmpSet and the
sparseSet. I'm not certain. It's likely that @tea-dragon
https://github.com/tea-dragon has some objections too.


Reply to this email directly or view it on GitHub
#91 (comment).

@xkrt
Copy link
Author

xkrt commented Jul 8, 2015

While I am experementing with performance of fixed offer() I found a problem:

@Test
public void testOfferReturn() {
    HyperLogLogPlus hll = new HyperLogLogPlus(5, 25);
    int uniqOffers = 10000;
    int hllUpdates = 0;
    for (int i = 0; i < uniqOffers; ++i) {
        if (hll.offer(UUID.randomUUID().toString()))
            ++hllUpdates;
    }
    assertEquals(uniqOffers, hllUpdates);
}

Test fails with:

Expected :10000
Actual   :166

Test run on f4c88af

@abramsm
Copy link
Contributor

abramsm commented Jul 8, 2015

This has to do with the precision properties of the counter. In normal
mode the buckets do not update every time a new hash is offered to the
set. The higher the precision the more likely a bucket is to update. If
you change your test from:

HyperLogLogPlus(5,25)

to

HyperLogLogPlus(25,25)

You will get a result like:

Expected :10000
Actual :9998

Which is about as close as you can get.

Give this fact I think the proper course of action is to update the
interface documentation rather than changing the code. Michael Spiegel
proposed something like:

"returns true if it alters the internal state of the of the cardinality
estimator"

Matt

On Wed, Jul 8, 2015 at 2:23 AM Pavel Martynov [email protected]
wrote:

While I am experementing with performance of fixed offer() I found a
problem:

@testpublic void testOfferReturn() {
HyperLogLogPlus hll = new HyperLogLogPlus(5, 25);
int uniqOffers = 10000;
int hllUpdates = 0;
for (int i = 0; i < uniqOffers; ++i) {
if (hll.offer(UUID.randomUUID().toString()))
++hllUpdates;
}
assertEquals(uniqOffers, hllUpdates);
}

Test fails with:

Expected :10000
Actual :166

Test run on f4c88af
f4c88af


Reply to this email directly or view it on GitHub
#91 (comment).

@xkrt
Copy link
Author

xkrt commented Jul 8, 2015

@abramsm auh, I see, you are right. I have some additional questions about your HLL++ implementation, will ask it in mail list. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants