Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements reservoir sampler randomly sampling stream of features #33

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

daniel-j-h
Copy link
Collaborator

For #7. Work in progress.

This changeset implements a a randomized online algorithm "reservoir sampling" for randomly sampling k items from a stream of unknown n items. We can use this to randomly sample e.g. k building features in the osmium handlers without having to store all features first or doing two passes.

Tasks:

  • Hook up to osmium handlers
  • Let users pass number of samples for randomly sampling

Refs:

Copy link
Contributor

@bkowshik bkowshik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a beautiful implementation! ❤️

i = random.randint(0, size - 1)
self.reservoir[i] = v

self.pushed += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First we designate a counter, which will be incremented for every data point seen.

'''Randomly samples k items from a stream of unknown n items.
'''

def __init__(self, capacity):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reservoir is generally a list or array of predefined size.

self.reservoir = []
self.pushed = 0

def push(self, v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we can begin adding data.

size = len(self.reservoir)

if size < self.capacity:
self.reservoir.append(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we encounter size elements, elements are added directly to reservoir

assert size == self.capacity
assert size <= self.pushed

p = self.capacity / self.pushed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once reservoir is full, incoming data points have a size / counter chance to replace an existing sample point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants