Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor binning #136

Open
gbordyugov opened this issue Jul 20, 2017 · 5 comments
Open

Refactor binning #136

gbordyugov opened this issue Jul 20, 2017 · 5 comments

Comments

@gbordyugov
Copy link
Contributor

gbordyugov commented Jul 20, 2017

  1. Do we really need classes for binning? Do we ever use inheritance from the base Binning class? I've got a feeling that a bunch of functions would be enough for all our purposes.

  2. Labelling: Instead of passing format_str to Binning.labels, user should pass a function that would return the appropriate label for given category, smth like
    labels = bins.labels(x, lambda i: 'label " + str(i)) thus a) getting rid of not very flexible labelling code and b) giving the user the flexibility of choosing any labelling scheme.

  3. Better greedy sorting (for instance, min-heap based).

@gbordyugov
Copy link
Contributor Author

Example of heap-based binning:

from heapq import heapify, heappop, heappush

def doTheMagic(items, nbins):
    categories = []
    for item in set(items):
        pair = (sum([i == item for i in items]), [item])
        categories.append(pair)

    heapify(categories)

    while len(categories) > nbins:
        smallest       = heappop(categories)
        secondSmallest = heappop(categories)
        count0, items0 =       smallest[0],       smallest[1]
        count1, items1 = secondSmallest[0], secondSmallest[1]
        heappush(categories, (count0 + count1, items0 + items1))

    return categories

doTheMagic([1, 1, 2, 2, 3, 3, 5], 4)

@mkolarek
Copy link
Contributor

Agree completely, we need to prioritize these changes so that we can fix sga() functionality ASAP.

@shansfolder
Copy link
Contributor

I've constructed an example where your greedy algorithm won't work. ;)

Given items=[10, 10, 20, 20], nbins=2, the algorithm will compute the following steps:
-> [10, 10, 20, 20]
-> [20, 20, 20]
-> [40, 20]

where the optimal binning is [30, 30].

@gbordyugov
Copy link
Contributor Author

Nice catch! However, the existing algorithm would fail here too.

Another reason to rethink it.

@gbordyugov
Copy link
Contributor Author

A nice paper on categorical binning: http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/viewFile/625/705

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants