Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make view(::AbstractWeights, ...) return an AbstractWeights #723

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

nalimilan
Copy link
Member

This is necessary to preserve the information regarding the type of weights.

Fixes #719 and #561.

Cc: @bkamins, @oxinabox

This is necessary to preserve the information regarding the type of weights.
Copy link
Contributor

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and makes sense.

I totally forgot that issue.

I assume the math for this kind of slicing already have been checked when getindex for this kind of slicing was permitted?

Looking at https://github.com/JuliaStats/StatsBase.jl/blob/2faa6e80b7966b915086d8cd5a4a4d89a2126db5/src/moments.jl
I am not sure that this kind of slicing should be allowed for ProbabilityWeights ?
Is it still a valid ProbabilityWeights if you slice it in this way?

But I am not at all an expert on this math; where as I assume you are.
So if you've thought it through then this should be all good.

@nalimilan
Copy link
Member Author

Yes it's fine for frequency weights and analytic weights. For probability weights, it's a complex matter, but it's better to allow people to use weights for a subsample than to completely disallow it (otherwise you wouldn't be able to use them in the presence of missing values at all).

@bkamins
Copy link
Contributor

bkamins commented Oct 5, 2021

What functions we provide are ProbabilityWeights aware?

If I understand things correctly ProbabilityWeights vector just denotes what were the inverse probabilities of sampling given values. If you subset such a vector these inverse probabilities do not change (the population from which you have drawn the sample remains the same - you just have a smaller sample from it). @nalimilan - is there anything more to it (I have not taken part in the discussions when ProbabilityWeights were designed)

@nalimilan
Copy link
Member Author

What functions we provide are ProbabilityWeights aware?

Essentially var with corrected=true. Ideally GLM.jl will also support them in addition to frequency weights (that's easy, just divide them by their mean).

If I understand things correctly ProbabilityWeights vector just denotes what were the inverse probabilities of sampling given values. If you subset such a vector these inverse probabilities do not change (the population from which you have drawn the sample remains the same - you just have a smaller sample from it). @nalimilan - is there anything more to it (I have not taken part in the discussions when ProbabilityWeights were designed)

Well that's right if you take a random subsample, but if you select a nonrandom subsample (which is the most common case) weights will not be exactly correct IIUC. But in practice it's better to use somewhat incorrect weights than no weights at all (often the difference won't be that large). Also, if you take a subsample based on a variable which was used as a strata to construct the weights then the result will be OK. The correct way to handle this in general is to use software designed to take into account complex survey designs (like design in R or svy in Stata). See for example https://www.restore.ac.uk/PEAS/subgroups.php.

@bkamins
Copy link
Contributor

bkamins commented Oct 7, 2021

We also need to take care of the fact that AbstractWeigths must be AbstractVector, so we need to make sure we handle the following correctly:

julia> w = Weights([1,2,3])
3-element Weights{Int64, Int64, Vector{Int64}}:
 1
 2
 3

julia> vw = @view w[1]
0-dimensional view(::Weights{Int64, Int64, Vector{Int64}}, 1) with eltype Int64:
1

as this expression drops a dimension, so vw cannot be AbstractWeights (maybe such views should be disallowed?).

@bkamins
Copy link
Contributor

bkamins commented Oct 7, 2021

Relatedly similar on Weights produces a Vector - not sure if this is intended.

@bkamins
Copy link
Contributor

bkamins commented Oct 7, 2021

Maybe similar is OK, given that parent returns the underlying vector, so Weights is a kind of view already?

@nalimilan
Copy link
Member Author

As discussed on Slack, it turns out that this PR in its current state has the problem that mutating the view will corrupt the parent, as its sum won't be updated. In addition to being dangerous, technically, this is breaking even though it's not very likely that people rely on it. The only solution I see to avoid breaking something is to have view return an AbstractWeights{..., SubArray{..., <: AbstractWeights}}. That's a mouthful, but it doesn't create any problems AFAICT.

We could consider making weight vectors immutable (again) in the next breaking release to simplify this.

@nalimilan
Copy link
Member Author

I've pushed a commit to make view return an AbstractWeights view into the AbstractWeights. In addition to that change, to ensure that mutating the parent doesn't make the view's precomputed sum inconsistent with its contents, I added a sum method which always recomputes the sum from the actual contents. In theory we could avoid this by storing a state in the parent indicating that it was mutated since the last time sum was called on a view. Not sure whether it's worth it.

I also added another commit making copy and wv[:] return an AbstractWeights object.

src/weights.jl Show resolved Hide resolved
src/weights.jl Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Outdated Show resolved Hide resolved
src/weights.jl Show resolved Hide resolved
Copy link
Contributor

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments.

@nalimilan
Copy link
Member Author

I've adapted testsets so that they loop over views of weights for all tests that cover weights. This makes the code more complex but that's probably worth it if we want to fully support views of weights.

@nalimilan
Copy link
Member Author

Ah, UnitWeights are not tested and don't work...

@bkamins
Copy link
Contributor

bkamins commented Nov 11, 2021

Ah, UnitWeights are not tested and don't work...

Do you want me to review now, or should I wait until you work on this?

@nalimilan
Copy link
Member Author

As you prefer. The change to support UnitWeights will probably be distinct so you can start reviewing the current code if you want.

{S <: Real, W <: AbstractWeights{S}}
@boundscheck checkbounds(wv, inds...)
@inbounds v = invoke(view, Tuple{AbstractArray, Vararg{Any}}, wv, inds...)
weightstype(W){S, eltype(wv), typeof(v)}(v, missing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
weightstype(W){S, eltype(wv), typeof(v)}(v, missing)
return weightstype(W){S, eltype(wv), typeof(v)}(v, missing)

@inbounds invoke(view, Tuple{AbstractArray, Vararg{Any}}, wv, inds...)
end

# Always recompute the sum for views of AbstractWeights, as we cannot know whether
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move the definitions of sum and copy into one place to make sure the reader can see both definitions side by side?

@bkamins
Copy link
Contributor

bkamins commented Nov 13, 2021

OK. I have commented. I think it is a good idea to use a separate missing value when sum is missing.

@bkamins
Copy link
Contributor

bkamins commented Nov 13, 2021

Maybe the only comment is that it would be even safer and a bit faster (but it would complicate code so I am not sure it is worth it) to instead of having sum::Union{S, Missing} have s::S2 where S2 would be S for non-view and Missing for view.

@nalimilan
Copy link
Member Author

Actually, I wonder whether it wouldn't be better to keep returning SubArrays of AbstractWeights and change dispatch to allow Union{AbstractWeights, SubArray{..., <: AbstractWeights}} instead. This would avoid adding tricky code and AFAICT it would give the same result since sum has to recompute the sum on every call for views anyway. The only drawback would be that dispatching on AbstractWeights would not be enough (a bit like what we have with CatArrOrSub in CategoricalArrays).

@bkamins
Copy link
Contributor

bkamins commented Nov 14, 2021

As you prefer. Still we would need to make sure the view is one dimensional, as currently one can do:

julia> x = Weights(1:3)
3-element Weights{Int64, Int64, UnitRange{Int64}}:
 1
 2
 3

julia> view(x, 1)
0-dimensional view(::Weights{Int64, Int64, UnitRange{Int64}}, 1) with eltype Int64:
1

julia> view(x, 1:1, 1:1)
1×1 view(reshape(::Weights{Int64, Int64, UnitRange{Int64}}, 3, 1), 1:1, 1:1) with eltype Int64:
 1

@nalimilan
Copy link
Member Author

Yes, we would only change signatures to accept Union{AbstractWeights, SubArray{<:Real, 1, AbstractWeights}}.

@bkamins
Copy link
Contributor

bkamins commented Nov 14, 2021

Yes, then Union{AbstractWeights, SubArray{<:Real, 1, AbstractWeights}} should probably be defined with some intuitive name and exported as this should be the signature used by packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Views of weight vectors should be weight vectors of views
3 participants