Make `view(::AbstractWeights, ...)` return an `AbstractWeights` #723

nalimilan · 2021-10-05T18:00:18Z

This is necessary to preserve the information regarding the type of weights.

Fixes #719 and #561.

This is necessary to preserve the information regarding the type of weights.

oxinabox

LGTM and makes sense.

I totally forgot that issue.

I assume the math for this kind of slicing already have been checked when getindex for this kind of slicing was permitted?

Looking at https://github.com/JuliaStats/StatsBase.jl/blob/2faa6e80b7966b915086d8cd5a4a4d89a2126db5/src/moments.jl
I am not sure that this kind of slicing should be allowed for ProbabilityWeights ?
Is it still a valid ProbabilityWeights if you slice it in this way?

But I am not at all an expert on this math; where as I assume you are.
So if you've thought it through then this should be all good.

nalimilan · 2021-10-05T18:22:40Z

Yes it's fine for frequency weights and analytic weights. For probability weights, it's a complex matter, but it's better to allow people to use weights for a subsample than to completely disallow it (otherwise you wouldn't be able to use them in the presence of missing values at all).

bkamins · 2021-10-05T18:41:12Z

What functions we provide are ProbabilityWeights aware?

If I understand things correctly ProbabilityWeights vector just denotes what were the inverse probabilities of sampling given values. If you subset such a vector these inverse probabilities do not change (the population from which you have drawn the sample remains the same - you just have a smaller sample from it). @nalimilan - is there anything more to it (I have not taken part in the discussions when ProbabilityWeights were designed)

nalimilan · 2021-10-05T20:36:50Z

What functions we provide are ProbabilityWeights aware?

Essentially var with corrected=true. Ideally GLM.jl will also support them in addition to frequency weights (that's easy, just divide them by their mean).

If I understand things correctly ProbabilityWeights vector just denotes what were the inverse probabilities of sampling given values. If you subset such a vector these inverse probabilities do not change (the population from which you have drawn the sample remains the same - you just have a smaller sample from it). @nalimilan - is there anything more to it (I have not taken part in the discussions when ProbabilityWeights were designed)

Well that's right if you take a random subsample, but if you select a nonrandom subsample (which is the most common case) weights will not be exactly correct IIUC. But in practice it's better to use somewhat incorrect weights than no weights at all (often the difference won't be that large). Also, if you take a subsample based on a variable which was used as a strata to construct the weights then the result will be OK. The correct way to handle this in general is to use software designed to take into account complex survey designs (like design in R or svy in Stata). See for example https://www.restore.ac.uk/PEAS/subgroups.php.

bkamins · 2021-10-07T08:22:51Z

We also need to take care of the fact that AbstractWeigths must be AbstractVector, so we need to make sure we handle the following correctly:

julia> w = Weights([1,2,3])
3-element Weights{Int64, Int64, Vector{Int64}}:
 1
 2
 3

julia> vw = @view w[1]
0-dimensional view(::Weights{Int64, Int64, Vector{Int64}}, 1) with eltype Int64:
1

as this expression drops a dimension, so vw cannot be AbstractWeights (maybe such views should be disallowed?).

bkamins · 2021-10-07T08:29:41Z

Relatedly similar on Weights produces a Vector - not sure if this is intended.

bkamins · 2021-10-07T08:51:05Z

Maybe similar is OK, given that parent returns the underlying vector, so Weights is a kind of view already?

nalimilan · 2021-10-09T14:00:06Z

As discussed on Slack, it turns out that this PR in its current state has the problem that mutating the view will corrupt the parent, as its sum won't be updated. In addition to being dangerous, technically, this is breaking even though it's not very likely that people rely on it. The only solution I see to avoid breaking something is to have view return an AbstractWeights{..., SubArray{..., <: AbstractWeights}}. That's a mouthful, but it doesn't create any problems AFAICT.

We could consider making weight vectors immutable (again) in the next breaking release to simplify this.

nalimilan · 2021-11-07T15:04:37Z

I've pushed a commit to make view return an AbstractWeights view into the AbstractWeights. In addition to that change, to ensure that mutating the parent doesn't make the view's precomputed sum inconsistent with its contents, I added a sum method which always recomputes the sum from the actual contents. In theory we could avoid this by storing a state in the parent indicating that it was mutated since the last time sum was called on a view. Not sure whether it's worth it.

I also added another commit making copy and wv[:] return an AbstractWeights object.

src/weights.jl

bkamins

I have left some comments.

nalimilan · 2021-11-11T16:40:00Z

I've adapted testsets so that they loop over views of weights for all tests that cover weights. This makes the code more complex but that's probably worth it if we want to fully support views of weights.

nalimilan · 2021-11-11T17:21:08Z

Ah, UnitWeights are not tested and don't work...

bkamins · 2021-11-11T19:25:21Z

Ah, UnitWeights are not tested and don't work...

Do you want me to review now, or should I wait until you work on this?

nalimilan · 2021-11-12T16:42:08Z

As you prefer. The change to support UnitWeights will probably be distinct so you can start reviewing the current code if you want.

bkamins · 2021-11-13T06:43:28Z

src/weights.jl

+    {S <: Real, W <: AbstractWeights{S}}
+    @boundscheck checkbounds(wv, inds...)
+    @inbounds v = invoke(view, Tuple{AbstractArray, Vararg{Any}}, wv, inds...)
+    weightstype(W){S, eltype(wv), typeof(v)}(v, missing)


Suggested change

weightstype(W){S, eltype(wv), typeof(v)}(v, missing)

return weightstype(W){S, eltype(wv), typeof(v)}(v, missing)

bkamins · 2021-11-13T06:45:37Z

src/weights.jl

+    @inbounds invoke(view, Tuple{AbstractArray, Vararg{Any}}, wv, inds...)
+end
+
+# Always recompute the sum for views of AbstractWeights, as we cannot know whether


maybe move the definitions of sum and copy into one place to make sure the reader can see both definitions side by side?

docs/src/weights.md

bkamins · 2021-11-13T06:49:22Z

OK. I have commented. I think it is a good idea to use a separate missing value when sum is missing.

bkamins · 2021-11-13T06:58:34Z

Maybe the only comment is that it would be even safer and a bit faster (but it would complicate code so I am not sure it is worth it) to instead of having sum::Union{S, Missing} have s::S2 where S2 would be S for non-view and Missing for view.

nalimilan · 2021-11-14T12:17:22Z

Actually, I wonder whether it wouldn't be better to keep returning SubArrays of AbstractWeights and change dispatch to allow Union{AbstractWeights, SubArray{..., <: AbstractWeights}} instead. This would avoid adding tricky code and AFAICT it would give the same result since sum has to recompute the sum on every call for views anyway. The only drawback would be that dispatching on AbstractWeights would not be enough (a bit like what we have with CatArrOrSub in CategoricalArrays).

bkamins · 2021-11-14T14:05:19Z

As you prefer. Still we would need to make sure the view is one dimensional, as currently one can do:

julia> x = Weights(1:3)
3-element Weights{Int64, Int64, UnitRange{Int64}}:
 1
 2
 3

julia> view(x, 1)
0-dimensional view(::Weights{Int64, Int64, UnitRange{Int64}}, 1) with eltype Int64:
1

julia> view(x, 1:1, 1:1)
1×1 view(reshape(::Weights{Int64, Int64, UnitRange{Int64}}, 3, 1), 1:1, 1:1) with eltype Int64:
 1

nalimilan · 2021-11-14T14:10:48Z

Yes, we would only change signatures to accept Union{AbstractWeights, SubArray{<:Real, 1, AbstractWeights}}.

bkamins · 2021-11-14T15:53:01Z

Yes, then Union{AbstractWeights, SubArray{<:Real, 1, AbstractWeights}} should probably be defined with some intuitive name and exported as this should be the signature used by packages.

Make view(::AbstractWeights, ...) return an AbstractWeights

a29e523

This is necessary to preserve the information regarding the type of weights.

oxinabox approved these changes Oct 5, 2021

View reviewed changes

bkamins approved these changes Oct 5, 2021

View reviewed changes

nalimilan added 3 commits November 7, 2021 14:21

Make view return a SubArray of AbstractWeights

04dd238

Make copy preserve weights type

192ccca

Revert unrelated changes

91208f4

nalimilan commented Nov 7, 2021

View reviewed changes

src/weights.jl Show resolved Hide resolved

nalimilan added 2 commits November 7, 2021 16:19

Fixes

efa81b6

More fixes

95b3650