-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empirical CDF enhancements #590
base: master
Are you sure you want to change the base?
Conversation
use the custom broadcast implementation of ecdf.(v)
precalculates the partial weigths sums
linearly interpolates cdf between adjacent values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay. This looks nice. I can't comment on the math but here are remarks about the code.
@@ -121,8 +121,8 @@ aweights(vs::RealArray) = AnalyticWeights(vec(vs)) | |||
s = w.sum | |||
|
|||
if corrected | |||
sum_sn = sum(x -> (x / s) ^ 2, w) | |||
1 / (s * (1 - sum_sn)) | |||
sum_w2 = sum(abs2, w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't this here to avoid overflow?
|
||
### Deprecate August 2020 (v0.33) | ||
function (ecdf::ECDF)(v::RealArray) | ||
depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", "(ECDF::ecdf)(v::RealArray)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More common, and possibly faster:
depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", "(ECDF::ecdf)(v::RealArray)") | |
depwarn("(ecdf::ECDF)(v::RealArray) is deprecated, use `ecdf.(v)` broadcasting instead", :ecdf) |
@@ -1,68 +1,142 @@ | |||
# Empirical estimation of CDF and PDF | |||
|
|||
## Empirical CDF | |||
""" | |||
Empirical Cumulative Distribution Function (ECDF). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empirical Cumulative Distribution Function (ECDF). | |
ECDF{T <: Real, W <: Real, I} | |
Empirical Cumulative Distribution Function (ECDF). |
# - `ECDF(x[i])` | ||
# - `1/(x[i+1] - x[i])` | ||
# - `ECDF(x[i+1]) - ECDF(x[i])` (the weight of `x[i+1]`) | ||
sorted_values::Vector{Tuple{T, W, W, W}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it less efficient to store values as a vector of tuples instead of as several vectors? In terms of memory use AFAIK alignment may require padding between entries.
r[ord[i]] = weightsum | ||
i += 1 | ||
# broadcasts ecdf() over an array | ||
# caches the last calculated value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "caches"?
any(isnan, X) && throw(ArgumentError("ecdf can not include NaN values")) | ||
isempty(weights) || length(X) == length(weights) || throw(ArgumentError("data and weight vectors must be the same size," * | ||
"got $(length(X)) and $(length(weights))")) | ||
evenweights = isnothing(weights) || isempty(weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inference works better with ===
:
evenweights = isnothing(weights) || isempty(weights) | |
evenweights = weights === nothing || isempty(weights) |
"got $(length(X)) and $(length(weights))")) | ||
T = eltype(X) | ||
W0 = evenweights ? Int : eltype(weights) | ||
W = isnothing(weights) ? Float64 : eltype(one(W0)/sum(weights)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store sum(weights)
to avoid calling it twice?
push_valprev!() = push!(sorted_vals, (valprev, min(wsumprev/wsum, one(W)), | ||
inv(val - valprev), valw/wsum)) | ||
|
||
@inbounds for i in ord | ||
valnew = X[i] | ||
if (val != valnew) || (i == last(ord)) | ||
(wsumprev > 0) && push_valprev!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this anonymous function trigger boxing of these variables, making the loop quite slow?
@test_skip fnecdf.weights == fnecdfalt.weights | ||
@test_skip fnecdf.weights != w1 # check that w wasn't accidently modified in place | ||
@test_skip fnecdfalt.weights != w2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
show(iobuf, obj) | ||
return String(take!(iobuf)) | ||
end | ||
|
||
@testset "ECDF" begin | ||
x = randn(10000000) | ||
fnecdf = ecdf(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change all of these to ecdf.(x)
, and copy this to deprecated.jl.
Is there anything here that's still worth it, or should I close? |
The PR improves the performance of ECDF and adds interpolation.
Base.Broadcast.broadcasted()
for providing enhanced support for vectors. The PR deprecatesecdf(v::AbstractVector)
in favor of standard dot notation. I've kept customized broadcasting, but now it just caches the CDF value of the last vector element.ecdf(..., interpolate=true)
), which is handy for continuous empirical distributions. (It just linearly interpolates between the CDFs of adjacent values).