Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize the Dict type for group #18

Open
yurivish opened this issue Feb 11, 2019 · 4 comments
Open

Customize the Dict type for group #18

yurivish opened this issue Feb 11, 2019 · 4 comments

Comments

@yurivish
Copy link

Hello, I'm using this great package to do basic data operations, and just found myself wanting to get back an OrderedDict from DataStructures.jl, since the order in which values appear in the iterable is significant.

I see there's some type promotion optimization going on in the group code, and I wonder if it would be possible to support passing in or otherwise specifying the output type in a way that preserves good type information.

Thanks for your work on this package!

@andyferris
Copy link
Member

Thank you :)

Yes, this is an important problem. I've thought of only two things so far:

  • Export a group! function, where you can initialize your own empty dictionary.
  • With AcceleratedArrays.jl I'd expect an array with a sort index (or an array that is sorted) to group to an order-based dictionary.

Note that we have the same general problem with functions like map, which only infers an output type from the input types, rather than anything else. (The solution offered by Base is obviously the first one above).

@ssfrr
Copy link

ssfrr commented Feb 7, 2020

It's not clear to me why the output of group would be a dictionary, rather than a Vector{Tuple} or Vector{NamedTuple}. I get that the dictionary will be faster for random access of groups, but it seems like in the split-apply-combine workflow you end up iterating through all the groups anyways (caveat: I'm not very familiar with SAC so could be mistaken).

For instance, here's a data pipeline I just wrote (related to what I was trying to do in #22). In the 2nd line, the first thing I do after the group operation is to convert it into a Vector{Tuple}. The idea is to get something I can easily plot, so it needs to be sorted by the group key. I'm also using the @df macro from StatsPlots so I can refer to "columns" of my data. (ir_analysis is a Vector{NamedTuple} from a previous analysis step). My convention here is to use d for a whole dataset and r for each row.

rt60s = group(r->r.freq, r->r.rt60, ir_analysis) |>
    d -> map(tuple, collect(keys(d)), d) |>
    d -> sort(d; by=first) |>
    d -> map(d) do r
        n = count(!ismissing, r[2])
        m = n == 0 ? missing : mean(skipmissing(r[2]))
        (freq=r[1], count=n, mean=m)
    end

@df rt60s plot(:freq, :mean)

Maybe it's just my unfamiliarity with Dictionaries,jl preventing me from seeing the right way to do this - for me it's easier to work with familiar data types.

@andyferris
Copy link
Member

andyferris commented May 19, 2020

Sorry for not responding to this earlier.

It's not clear to me why the output of group would be a dictionary, rather than a Vector{Tuple} or Vector{NamedTuple}. I get that the dictionary will be faster for random access of groups, but it seems like in the split-apply-combine workflow you end up iterating through all the groups anyways (caveat: I'm not very familiar with SAC so could be mistaken).

As an an intermediate step to create the groups, you need to create a dictionary to efficiently push the elements into the correct group. (The alternative way is to first sort the elements by the grouping function and then use Iterators.partition, but a thin wrapper over a sorted collection is a dictionary of groups). I recently moved group to return a Dictionaries.AbstractDictionary precisely because it iterates the same as a vector and is the intermediate value. Returning another data structure is more work, so my logic was the user could be responsible for that.

As to your example, there are several enhancements I want that should make your life easier.

  1. A sort-based group algorithm that returns a dictionary sorted by key.
  2. An AbstractDictionary-based table (instead of a AbstractVector-based table) that can wrap any such AbstractDictionary (or columns of AbstractDictionarys). The dictionary keys would be a "primary key" column and the dictionary values the other column(s).

For now, ignoring sorting by :freq we can do something like:

groups = groupview(r.freq, r.rt60);
freq = keys(groups)
count = (length  skipmissing).(groups);
mean = (mean  skipmissing).(groups);

plot(collect(freq), collect(mean))

However later I hope we can get these pre-sorted and in a table-like structure. :)

@andyferris
Copy link
Member

count = (length ∘ skipmissing).(groups)

Apparenlty that doesn't work (yet). I got carried away: JuliaLang/julia#35946, JuliaLang/julia#35947

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants