-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not sure how best to handle missing
#22
Comments
Yes - you are completely right. This is difficult. And the difficulty arises from the interaction of lots of little different aspects of the system, from the types, inference, standard library/ If you are looking for a quick solution, you can try: julia> (((count, sum),) -> sum / count).(groupreduce(r->r.foo, r->coalesce(r.bar, 0.0), ((count, sum), bar) -> (count + 1, sum + bar), table, init = (0, 0.0)))
3-element Dictionaries.HashDictionary{Any,Any}
2 │ 0.0
3 │ 0.567196074236143
1 │ 0.04591996082297756 It's not particularly elegant though! And it's still slow due to type instability of julia> table = NamedTuple{(:foo, :bar), Tuple{Int, Union{Missing, Float64}}}[(foo=1, bar=rand()),
(foo=2, bar=missing),
(foo=3, bar=rand()),
(foo=1, bar=missing),
(foo=2, bar=missing),
(foo=3, bar=rand())]
6-element Array{NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}},1}:
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((1, 0.07490415420165197))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((2, missing))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((3, 0.1651018454906743))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((1, missing))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((2, missing))
NamedTuple{(:foo, :bar),Tuple{Int64,Union{Missing, Float64}}}((3, 0.23729296064354855)) but that's not reasonable and it's very much not recommended for large named tuples of with more than one or two
I'm not certain the best way we can improve this on the library side. Well, I suppose that |
I think it should be easy once you have mutate-or-widen interface for Dictionaries.jl. Here is an example with plain using BangBang
using BangBang.Experimental: modify!!
using BangBang.NoBang: SingletonVector
using InitialValues
using InitialValues: InitialValue
function groupreduce_bb(by, f, op, itr; init = Init(op))
acc = foldl(itr; init = Dict{Union{},Union{}}()) do acc, x
acc, _ = modify!!(acc, by(x)) do iacc
Some(op(something(iacc, init), f(x)))
end
return acc
end
if init isa InitialValue && InitialValue <: valtype(acc)
return Dict(k => v for (k, v) in acc if v !== init)
else
return acc
end
end
group_bb(by, f, itr) =
groupreduce_bb(by, x -> SingletonVector(f(x)), append!!, table; init = Init(append!!)) julia> group_bb(r->r.foo, r->r.bar, table)
Dict{Int64,AbstractArray{T,1} where T} with 3 entries:
2 => [missing, missing]
3 => [0.0756208, 0.745847]
1 => Union{Missing, Float64}[0.37734, missing] (OK, "easy" in the sense it's possible now that I fixed a bug JuliaFolds/BangBang.jl#145) Transducers.jl uses a similar strategy but in a more thread-friendly way.
I don't think adding a flag is the best strategy in terms of composability. I think this is where you really need transducers. julia> using SplitApplyCombine
using Transducers
using OnlineStats: Mean
julia> rf0 = reducingfunction(Mean())
rf = reducingfunction(NotA(Missing), rf0)
groupreduce(r->r.foo, r->r.bar, rf, table; init = Init(rf0))
3-element Dictionaries.HashDictionary{Any,Union{InitialValues.InitialValueOf{Transducers.OnlineStatReducingFunction{Mean{Float64,OnlineStatsBase.EqualWeight}}}, Mean{Float64,OnlineStatsBase.EqualWeight}}}
2 │ Init(::Transducers.OnlineStatReducingFunction{Mean{Float64,OnlineStatsBase.EqualWeight}})
3 │ Mean: n=2 | value=0.410734
1 │ Mean: n=1 | value=0.37734 I think it should be possible to make it work more easily, without adding Transducers.jl-aware code in SplitApplyCombine.jl, if |
I'm not actually sure where the right place on the stack is to fix this, because it seems to cut across several layers.
Here's an example - Say I want to get the mean of each group, ignoring missings. Notice that for the
foo=2
group, both elements aremissing
.This throws the error
MethodError: no method matching zero(::Type{Any})
This is the result of a cascade of things, most of which seem pretty reasonable in isolation, which is why it's not clear (to me anyways) what the right fix is
mean
doesn't know how to handle an empty arrayAny[]
. I don't think there's anything more reasonable formean
to do heretable
doesn't have usful type information (see type promotion of missing inside tuples JuliaLang/julia#31077)group
seems to set the type of the dictionary elements based on theeltype
oftable
.I'm not sure if there's a good resolution to this. Even if
group
built up the groups iteratively rather than pre-allocating, for a group with only missings it would end up with anArray{Missing}
, which still doesn't helpmean
figure out what a reasonable answer is.My current workaround is to re-inject the type information, but it took some digging to figure out what the actual problem was, and is not pretty:
Another workaround is setting the type of
table
explicitly:But that gets pretty verbose.
Any thoughts as the the best way to handle this?
The text was updated successfully, but these errors were encountered: