Customize the Dict type for `group` #18

yurivish · 2019-02-11T01:21:12Z

Hello, I'm using this great package to do basic data operations, and just found myself wanting to get back an OrderedDict from DataStructures.jl, since the order in which values appear in the iterable is significant.

I see there's some type promotion optimization going on in the group code, and I wonder if it would be possible to support passing in or otherwise specifying the output type in a way that preserves good type information.

Thanks for your work on this package!

The text was updated successfully, but these errors were encountered:

andyferris · 2019-02-11T06:28:58Z

Thank you :)

Yes, this is an important problem. I've thought of only two things so far:

Export a group! function, where you can initialize your own empty dictionary.
With AcceleratedArrays.jl I'd expect an array with a sort index (or an array that is sorted) to group to an order-based dictionary.

Note that we have the same general problem with functions like map, which only infers an output type from the input types, rather than anything else. (The solution offered by Base is obviously the first one above).

ssfrr · 2020-02-07T02:39:56Z

It's not clear to me why the output of group would be a dictionary, rather than a Vector{Tuple} or Vector{NamedTuple}. I get that the dictionary will be faster for random access of groups, but it seems like in the split-apply-combine workflow you end up iterating through all the groups anyways (caveat: I'm not very familiar with SAC so could be mistaken).

For instance, here's a data pipeline I just wrote (related to what I was trying to do in #22). In the 2nd line, the first thing I do after the group operation is to convert it into a Vector{Tuple}. The idea is to get something I can easily plot, so it needs to be sorted by the group key. I'm also using the @df macro from StatsPlots so I can refer to "columns" of my data. (ir_analysis is a Vector{NamedTuple} from a previous analysis step). My convention here is to use d for a whole dataset and r for each row.

rt60s = group(r->r.freq, r->r.rt60, ir_analysis) |>
    d -> map(tuple, collect(keys(d)), d) |>
    d -> sort(d; by=first) |>
    d -> map(d) do r
        n = count(!ismissing, r[2])
        m = n == 0 ? missing : mean(skipmissing(r[2]))
        (freq=r[1], count=n, mean=m)
    end

@df rt60s plot(:freq, :mean)

Maybe it's just my unfamiliarity with Dictionaries,jl preventing me from seeing the right way to do this - for me it's easier to work with familiar data types.

andyferris · 2020-05-19T23:26:17Z

Sorry for not responding to this earlier.

It's not clear to me why the output of group would be a dictionary, rather than a Vector{Tuple} or Vector{NamedTuple}. I get that the dictionary will be faster for random access of groups, but it seems like in the split-apply-combine workflow you end up iterating through all the groups anyways (caveat: I'm not very familiar with SAC so could be mistaken).

As an an intermediate step to create the groups, you need to create a dictionary to efficiently push the elements into the correct group. (The alternative way is to first sort the elements by the grouping function and then use Iterators.partition, but a thin wrapper over a sorted collection is a dictionary of groups). I recently moved group to return a Dictionaries.AbstractDictionary precisely because it iterates the same as a vector and is the intermediate value. Returning another data structure is more work, so my logic was the user could be responsible for that.

As to your example, there are several enhancements I want that should make your life easier.

A sort-based group algorithm that returns a dictionary sorted by key.
An AbstractDictionary-based table (instead of a AbstractVector-based table) that can wrap any such AbstractDictionary (or columns of AbstractDictionarys). The dictionary keys would be a "primary key" column and the dictionary values the other column(s).

For now, ignoring sorting by :freq we can do something like:

groups = groupview(r.freq, r.rt60);
freq = keys(groups)
count = (length ∘ skipmissing).(groups);
mean = (mean ∘ skipmissing).(groups);

plot(collect(freq), collect(mean))

However later I hope we can get these pre-sorted and in a table-like structure. :)

andyferris · 2020-05-20T02:39:53Z

count = (length ∘ skipmissing).(groups)

Apparenlty that doesn't work (yet). I got carried away: JuliaLang/julia#35946, JuliaLang/julia#35947

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customize the Dict type for `group` #18

Customize the Dict type for `group` #18

yurivish commented Feb 11, 2019

andyferris commented Feb 11, 2019

ssfrr commented Feb 7, 2020

andyferris commented May 19, 2020 •

edited

Loading

andyferris commented May 20, 2020

Customize the Dict type for group #18

Customize the Dict type for group #18

Comments

yurivish commented Feb 11, 2019

andyferris commented Feb 11, 2019

ssfrr commented Feb 7, 2020

andyferris commented May 19, 2020 • edited Loading

andyferris commented May 20, 2020

Customize the Dict type for `group` #18

Customize the Dict type for `group` #18

andyferris commented May 19, 2020 •

edited

Loading