-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shapiro-Wilk normality test #124
Conversation
Thanks for contributing this. Your implementation looks good. The main issue is that the Another issue is the license. http://lib.stat.cmu.edu/apstat/ states that
which is not compatible with the MIT license so either you should try to implement these directly from the paper or try to contact the Royal Statistical Society and ask them if they have changed their mind regarding the clause. The best would be if they could use a real license. |
Yep, I tried to do this, but failed obtaining the "true" Even obtaining the "exact" expected values of order statistics A simple MC produces highly variable results: function mMC(N::Int,samps=10^5)
X = Array{Float64}(N)
s = zeros(div(N,2))
for i in 1:samps
for j in 1:N
X[j] = randn()
end
sort!(X)
for j in 1:div(N,2)
s[j] += X[j]
end
end
return -s/samps
end
d = mMC(101, 10^7) - mMC(101, 10^7);
norm(d,1)
0.0019123699000122493 A more or less exact order statistics can be obtained by integration directly, eg. following
But I don't know how exact are normcdfs, etc., quadgk, i.e. how exact is the following method? function Royston(n,r, logs=[log(i) for i in 1:n], o=7*sqrt(n))
C = sum(logs) - sum(view(logs, 2:r-1)) - sum(view(logs, 2:n-r))
I(x) = C + (r-1)*log(normccdf(x)) + (n-r)*log(normcdf(x)) - 1/2*(log(2π) + x^2)
return quadgk(x->x*exp(I(x)), -9.0-log(n),9.0+log(n),
order=floor(Int, o),
maxevals=floor(Int, log(n)*10^7))
end
function mRoyston(n, o=7*sqrt(n))
logs=[log(i) for i in 1:n]
m = [Royston(n,r, logs, o) for r in 1:div(n,2)]
err = norm([i[2] for i in m],1)
@show err
return [i[1] for i in m]
end Royston's Blom(n,r) = -norminvcdf((r - 3/8)/(n + 1/4))
mBlom(n) = [Blom(n,r) for r in 1:div(n,2)] and Rahman (cited above) used a simple Rahman(n,r) = -norminvcdf(r/(n + 1))
mRahman(n) = [Rahman(n,r) for r in 1:div(n,2)] ComparingI believe the direct integration to be the most accurate, so I'd lean towards regarding it as "true". Comparing other methods against it gives const N = 5000
m_Ro = mRoyston(N);
m_mc = mMC(N, 10^6);
m_Bl = mBlom(N);
m_Rh = mRahman(N);
p = plot(yaxis=:log10, title="Relative to Royston",size=(800,600))
for (X, l) in [(m_mc, "MonteCarlo"),
(m_Bl, "Blom approximation"),
(m_Rh, "Rahman approximation"),
]
plot!(abs.((X-m_Ro)./m_Ro), label=l)
end
p Question:
|
Both should be high quality and So all-in-all I think this should be fine for computing the mean order statistics. Have you tried to compute the covariances yet? They are needed for fitting the polynomials, right?
I guess we could do that. We could also try to do this as part of a generated function that would compute everything on demand but it might make the first call pretty slow. It would be pretty cool on the other hand.
Given how inaccuratly they computed the order statistics back then, I guess it would be good if we could recompute them. Another thing, do you have experience with Weisberg-Bingham version? It is so much easier to compute so do you know why people seem to prefer SW? Path dependency? I tried to look for power comparisons I couldn't find anything that suggested it to be significantly worse. |
For covariances I plan to do the following: [DavisStephens1977] goves an algorithm to compute covariance matrix of normal order statistics given
But I may not have enough time to this during this week.
We may have a look at speeds and decide when I finish the plan above; but I guess cached (or
Roystons |
As a matter of fact, I'm not a statistician, doing pure math, coding in julia "after hours" ;-). But for most of my "empirical science" friends SW was the one to go to, and seems to have the best power for small samples. |
Ok, variances/covariances turned out harder than I thought: (BTW. Do You have, by a chance, access to a paper copy of Parrishs article (to scan and send me)? The digital one I have is without text layer and of such poor quality that OCRing those tables produces errors in every row. Which for an article to BORROW for 24h for $50 is OUTRAGEOUS. ) With these I can tell that Roystons approximation of blom(n,i) = norminvcdf((i - 3/8)/(n + 1/4))
rahman(n,i) = norminvcdf(i/(n + 1))
struct Blom
n::Int
end
struct Rahman
n::Int
end
OrderStatistics.expectation(B::Blom) = [blom(B.n,i) for i in 1:B.n]
OrderStatistics.expectation(R::Rahman) = [rahman(R.n,i) for i in 1:R.n]
# M. Mahibbur Rahman & Z. Govindarajulu,
# A modification of the test of Shapiro and Wilk for normality
# *Journal of Applied Statistics* 24:2, 219-236, 1997
# DOI: [10.1080/02664769723828](http://dx.doi.org/10.1080/02664769723828)
function invV(N, M)
pdfM = normpdf.(M)
d = pdfM.^2
dl= [-pdfM[i]*pdfM[i+1] for i in 1:N-1]
return (N+1)*(N+2)*SymTridiagonal(d, dl)
end
function HypothesisTests.SWCoeffs(R::Rahman)
M = expectation(R)
A = invV(R.n, M)
C = sqrt(M'*(A*A)*M)
Astar = (M'*A)'/C
return SWCoeffs(R.n, Astar[1:div(R.n,2)])
end
function HypothesisTests.SWCoeffs(OS::NormOrderStatistic)
m = expectation(OS)
iV = inv(cov(OS))
A = m'*iV./sqrt(m'*iV*iV*m)
return SWCoeffs(OS.n, A[1,1:div(OS.n,2)])
end Then: N = 25
@time ExactA = SWCoeffs(NormOrderStatistic(N));
plot(SWCoeffs(N).A, label="Blom + 0-correlation OrdStats")
plot!(SWCoeffs(Rahman(N)).A, label="Rahman + limit inverse correlation")
plot!(-ExactA.A, label="Exact formulas, BigFloat")
Right now I can not think of anything better than the mentioned above approximation by Stephen&David (using Taylor expansion of the expectation of product of order statistics). Using BigFloats, ArbFloats, etc will not help much, as the precision of e.g. |
It took more time than I had predicted, but turned out a fun project ;-) Normal order statisticsFor implementation see https://github.com/kalmarek/HypothesisTests.jl/tree/normordstats David-Johnson formula:function diffG(x)
s = normpdf(x)
s² = s^2
s⁴ = s²*s²
x² = x^2
x⁴ = x²*x²
return (
1/s,
x/s²,
(1 + 2x²)/(s*s²),
x*(7 + 6x²)/s⁴,
(7 + 46x² + 24*x⁴)/(s*s⁴),
x*(127 + 326x² + 120x⁴)/(s²*s⁴)
)
end
function covDavidJohnson(i::Int,j::Int,n::Int, order=3)
if j < i
return covDavidJohnson(j,i,n)
end
pᵢ, pⱼ = i/(n+1), j/(n+1)
qᵢ, qⱼ = 1 - pᵢ, 1 - pⱼ
Gᵢ = diffG(norminvcdf(pᵢ))
Gⱼ = diffG(norminvcdf(pⱼ))
X = 1/(n+2)
T = Vector{eltype(Gᵢ)}()
if order >= 1
T1 = X *pᵢ*qⱼ*Gᵢ[1]*Gⱼ[1]
push!(T, T1)
end
if order >= 2
T2 = X^2*pᵢ*qⱼ*(
(qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] +
(qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] +
0.5(
pᵢ*qᵢ*Gᵢ[3]*Gⱼ[1] +
pⱼ*qⱼ*Gᵢ[1]*Gⱼ[3] +
pᵢ*pⱼ*Gᵢ[2]*Gⱼ[2]
)
)
push!(T, T2)
end
if order >= 3
T3 = X^3*pᵢ*qⱼ*(
-(qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] - (qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] +
((qᵢ-pᵢ)^2 - pᵢ*qᵢ)*Gᵢ[3]*Gⱼ[1] +
((qⱼ-pⱼ)^2 - pⱼ*qⱼ)*Gᵢ[1]*Gⱼ[3] +
1/2*(3(qᵢ-pᵢ)*(qⱼ-pⱼ) + pⱼ*qᵢ - 4pᵢ*qⱼ)*Gᵢ[2]*Gⱼ[2] +
5/6*pᵢ*qᵢ*(qᵢ-pᵢ)*Gᵢ[4]*Gⱼ[1] +
5/6*pⱼ*qⱼ*(qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[4] +
(pᵢ*qⱼ*(qᵢ-pᵢ) + 0.5pᵢ*qᵢ*(qⱼ-pⱼ))*Gᵢ[3]*Gⱼ[2] +
(pᵢ*qⱼ*(qⱼ-pⱼ) + 0.5pⱼ*qⱼ*(qᵢ-pᵢ))*Gᵢ[2]*Gⱼ[3] +
1/8*(pᵢ^2*qᵢ^2*Gᵢ[5]*Gⱼ[1] + pⱼ^2*qⱼ^2*Gᵢ[1]*Gⱼ[5]) +
1/4*(pᵢ^2*qᵢ*qⱼ*Gᵢ[4]*Gⱼ[2] + pᵢ*pⱼ*qⱼ^2*Gᵢ[2]*Gⱼ[4]) +
1/12*(2pᵢ^2*qⱼ^2 + 3pᵢ*pⱼ*qᵢ*qⱼ)*Gᵢ[3]*Gⱼ[3]
)
push!(T, T3)
end
return sum(T)
end Comparing to the high-precision (25d) values obtained by Parrish ( DavisStephens Covariance Matrix of Normal order StatisticsThis method uses the David-Johnson formula + relations between correlations, etc: function normalizerel!(v::AbstractVector, relative::Union{Vector{Int}, UnitRange})
Sr = sum(v[relative])
c = (1-Sr)/(sum(v) - Sr)
for i in eachindex(v)
if !(i in relative)
v[i] *= c
end
end
return v
end
function fillcov!(V, j::Int)
# copy reversed j-th column to (end-j)-th column
@views V[j:end-j+1, end-j+1] .= reverse(V[j:end-j+1, j])
# copy j-th column to j-th row
@views V[j, j+1:end-j] .= V[j+1:end-j, j]
# copy (end-j)-th column to (end-j)-th row
@views V[end-j+1, j+1:end-j] .= V[j+1:end-j, end-j+1]
return V
end
function covDavisStephens(OS::NormOrderStatistic{T}, V11::T=var(OS, 1)) where T
V = zeros(T, OS.n, OS.n)
V[1,1] = V11
V[2,1] = moment(OS, 1, 2) - expectation(OS,1)*expectation(OS,2) - T(1)
for i in 3:OS.n
V[i, 1] = covDavidJohnson(i, 1, OS.n)
end
normalizerel!(view(V, :, 1), [1,2])
fillcov!(V, 1)
j=1
for j in 2:div(OS.n, 2)
V[j,j] = var(OS, j)
for i in j+1:OS.n-j+1
V[i,j] = covDavidJohnson(i, j, OS.n)
end
normalizerel!(view(V, :, j), [collect(1:j); collect(OS.n-j+1:OS.n)])
fillcov!(V, j)
end
if isodd(OS.n) # the middle point
j = div(OS.n, 2)+1
V[j, j] = var(OS, j)
normalizerel!(view(V, :, j), [j])
fillcov!(V, j)
end
return Symmetric(V, :U)
end
covDavisStephens(N::Int) = covDavisStephens(NormOrderStatistic(N)) Again comparing against high-precision values of Parrish ( The formula is pretty quick: N = 2000
@time M = covDavisStephens(N)
@time invM = inv(M);
heatmap(abs.(invM./maximum(invM)), size=(800,700), yflip=true)
# 2.853262 seconds (12.26 M allocations: 365.602 MiB, 2.97% gc time)
# 0.552725 seconds (22 allocations: 62.043 MiB, 1.02% gc time) Shapiro-Wilk coefficientsFinally the coefficients: swcoeff(invcov, m) = ((m'*invcov)'/sqrt(m'*(invcov*invcov)*m))[1:div(length(m),2)] Relative to the exact (=using high-precision covariance by Parrish) value of SW coefficients: N=20
m = Float64.(expectation(NormOrderStatistic{BigFloat}(N)))
exact = -swcoeff(inv(Float64.(covd[N])), m)
p = plot(yaxis=:log10)
# plot!(exact, label="Exact (Parrish)")
plot!(abs.(SWCoeffs(N).A)./exact, label="Royston")
plot!(abs.(swcoeff(eye(N), expectation(Blom(N))))./exact, label="Weisberg-Bingham")
plot!(abs.(swcoeff(inv(covDavisStephens(N)), m))./exact, label="DavisStephens") I still have no idea what Royston fitted his coefficients against, but that must have been pretty accurate! for |
@kalmarek Sorry for dropping the ball here. What is the current status? |
I also had too much to do, couldn't finish this cleanly;) long story short we either:
Happy to finish this, starting the next week, actually (conferences, etc.) |
This dedicated work on Shapiro-Wilk-Test is exceptional fantastic 🥇 Are there any news to complete the work on the SWT? |
@Ph0non this effort (it was fun!) is an offspring of my frustration on the teaching I was forced to do. but if You are willing to provide documentation, I may look into finishing this, but only in December (conferences abound). |
Do you need documentation like the other Tests in the package or something more (with more theoretical background or so)? |
There are three things to be finished here:
I volunteer to finish 1 (and maybe 2)
|
Some small fixes kalmarek#1 |
Hi, thanks for the work done so far. Have you check python implementation? I cannot remember right now, by heart, how the coefficients are computed in the python version but I believe they are for double precision and are not cover by any copyright. It could be helpful to cross check this info. I had also implemented a version in the fashion for tran at some point, it could come handy. |
@albz could You point me to a python implementation? the one from scipy right now copyright is not a problem, since I wrote everything from scratch following the paper only. |
I see that people are generally interested so here is some details. In 1992 paper Royston
I'm happy with both options here:
Note that my experiments with MC show that it can not provide better precision than Then additional fitting must be provided for the distribution of W-statistic. |
You are right, the python-fortran code is in single precision. That is of little help. |
@kalmarek It's really a shame that this never was completed given all the work you put into this. So hereby my encouragement to resume the work. |
@andreasnoack Thanks for the encouragement ;) However unlikely it is, SW test is still in the back of my head.
|
@andreasnoack in kalmarek/ShapiroWilk#1 To compute critical/p-values I'd need to
I used of the shelf Optim.jl methods to do so: using Statistics
using Random
using Distributions
using LinearAlgebra
using Optim
using Revise
using ShapiroWilk
function sample_Wstatistic(sw::ShapiroWilk.SWCoeffs{T}, sample_size) where T
Wstats = zeros(T, sample_size)
tmp = similar(Wstats, length(sw))
for i in 1:sample_size
randn!(tmp)
sort!(tmp)
Wstats[i] = ShapiroWilk.Wstatistic(tmp, sw)
end
Wstats
end;
function normal_power_fit(
t,
sample,
qs = 0.005:0.005:0.995,
normal_qs=quantile.(Normal(0,1), qs),
)
sample_t = similar(sample)
for idx in 1:length(sample_t)
sample_t[idx] = (one(eltype(sample)) - sample[idx])^t
end
µ, σ = mean(sample_t), std(sample_t)
for idx in 1:length(sample_t)
sample_t[idx] = (sample_t[idx] - μ)/σ
end
#@assert issorted(sample_t)
sample_t_qs = quantile(sample_t, qs, sorted=true)
for idx in 1:length(sample_t_qs)
sample_t_qs[idx] = sample_t_qs[idx] - normal_qs[idx]
end
return norm(sample_t_qs, 2)
end optionally e.g. @time let n = 30, prec=128
OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
@time ShapiroWilk.OrderStatisticsArblib._precompute_ψ(n, prec=prec, R=18.0)
end then exponents = map(5:30) do n
prec=128
qs=0.005:0.005:0.995
OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
@info "Computing coefficients..."
@time swfl = ShapiroWilk.SWCoeffs(n, Float64.(ShapiroWilk.SWCoeffs(OS)))
@info "Sampling W-statistic..."
@time W_sample = sort!(sample_Wstatistic(swfl, 10_000_000), rev=true)
@info "Fitting power-transform for normality..."
@time res = let qs=qs
normal_qs=quantile.(Normal(0,1), qs)
optimize(
x->normal_power_fit(first(x), W_sample, qs, normal_qs),
0.01,
0.5,
Brent(),
)
end
@info "For n=$n optimization yielded" res
Optim.minimizer(res)
end yields the normalizing exponents, i.e. ┌ Info: For n=30 optimization yielded
│ res =
│ Results of Optimization Algorithm
│ * Algorithm: Brent's Method
│ * Search Interval: [0.010000, 0.500000]
│ * Minimizer: 5.386236e-02
│ * Minimum: 6.233191e-02
│ * Iterations: 12
│ * Convergence: max(|x - x_upper|, |x - x_lower|) <= 2*(1.5e-08*|x|+2.2e-16): true
└ * Objective Function Calls: 13 The whole table of these |
Was just looking into Shapiro-Wilk, am I reading it right that it's pretty close? |
@BioTurboNick You can already use code from this branch directly (well, the only thing that doesn't work is censored samples) for an implementation based on Royston approximation of S-W test; (well updating to modern HypothesisTests will require a bit coding, but not much). What is (relatively) close is a new approximation that I'm doing in my spare time ;) Whether it will make a difference, I don't know, but it will be much closer to "mathematical correctness" (whatever that means:). EDIT: just to let you know: everybody else (R, scipy) are using the same old fortran implementation which will be comparable to what is in this branch. |
@andreasnoack or anyone else in this thread ;) I'd need your advice. I have the SW-coeffs computed up to what is left is the critical levels for the
After looking closely at MC samples I arrived at this:
(The transition Parrish → Royston is determined based on MSE on ~200 uniformly distributed quantiles) What do you think? I asked about the shape of the W-distributions on discourse: https://discourse.julialang.org/t/can-anyone-tell-me-what-is-this-distribution/53726 but got no input ;) |
ok, it seems that I was fitting Normal distribution wrong;) Box-Cox transform using
λ_7:140.pdf @andreasnoack As I really have very little knowledge about statistics/fitting models/parameters estimation I need someone knowledgeable to have a look at this and provide a stamp of approval ;) |
@kalmarek @andreasnoack just curious if this is still being looked into/implemented |
@jarredclloyd this kind of stalled because I was not able to produce a meaningful fits (except eyeballing plots and saying: this looks seriously gooood...). If you're willing to help (and have more experience with this) be my guest |
Definitely willing to help, but I think the specific problem you are trying to overcome might be beyond my level of knowledge currently. |
@nalimilan thanks for your review and patience. It seems ready to be merged. Let's celebrate 🎉 (and don't forget to squash ;P) |
src/shapiro_wilk.jl
Outdated
struct ShapiroWilkCoefs <: AbstractVector{Float64} | ||
N::Int | ||
A::Vector{Float64} | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to construct a new array type? Isn't a function that constructs a Vector{Float64}
with the desired coefficients A
and a function that performs the desired computation with them simpler?
Generally, it's quite non-trivial to implement the AbstractArray interface correctly and most operations with the resulting array type will fall back to the generic implementations in Base and LinearAlgebra instead of exploiting optimizations for Vector{Float64}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not necessary; SW coeffs are antisymmetric and we're computing and storing just half of them.
It'd be easy enough to collect them, store in ShapiroWilkCoefs
and use internally (we might gain something performance-wise by this, but only when they are reused). Distinguishing them from Vector{Float64}
makes the interface and signatures clearer imo.
@devmotion Was this just a question or rather a suggestion for change? If it is the latter I'd appreciate a set of concrete suggestions ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It started as a question but now I'd actually suggest changing the implementation.
Note that I did not mean to change the efficiency of the algorithm, I don't think you necessarily have to store all coefficients. I was suggesting to store and use the A
field directly in the test. More concretely, make the A
coefficients a field of the test struct, and then reimplement unsafe_swstat
in a way that takes the vector of coefficients directly (instead of providing its wrapper).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devmotion What is the benefit of dropping this abstraction in favour of plain Vector{Float64}
? On performance side, since ShapiroWilkCoefs
is an immutable its creation will be most probably omitted by the compiler; on the programmer side it makes clear what kind of vector must be passed, and it is not an arbitrary Vector{Float64}
. Users may pass the coefficients for re-use and I find it very fitting to provide a thin (no cost?) wrapper to store the coefficients.
After doing a bit of benchmarking, I think we gain very little by storing only half of the coefficients and we loose dispatch to Blas.dot
(the generic one isn't that much slower though). I'm fine changing the implementation to store the whole of the vector. I'd rather keep ShapiroWilkCoefs
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we store the whole vector, I agree with @devmotion that we may as well take a plain Vector{Float64}
. Wrapping values in ShapiroWilkCoefs
doesn't add anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that the wrapper should be removed regardless od whether the full vector is stored or not. As you noticed in your benchmarks, the wrapper array type is often not performant since you miss optimizations/BLAS. I don't think there's a major benefit from a user or correctness perspective either - you can always provide nonsensical coefficients and as a user you would always construct them with some documented and exported function without caring about their type.
Taken together, I think you should remove the wrapper and possibly store half of the coefficients. I wonder if you also benchmarked the version based on the Vector{Float64}
with half of the coefficients (dot
rewritten as sum of two dot
operations)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for simplicity I went with the plain vector storing all of the coefficients.
a user you would always construct them with some documented and exported function
shapiro_wilk_coefs
is not exported and neither was ShapiroWilkCoefs
(suggestions for other names welcome).
I did not benchmark the two-dot producs as sw coefs are antisymmetric and dot
is memory bound anyway.
m = mean(X) | ||
S² = sum(x -> abs2(x - m), X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m = mean(X) | |
S² = sum(x -> abs2(x - m), X) | |
S² = moment(X, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devmotion isn't moment
normalized (i.e. integrated w.r.t. uniform measure on X
? Then this should be S² = moment(X,2)*length(X)
.
But the real question is: did you benchmark this? On my machine this is (2-6) × slower (N ∈ 2.^7:15, increasing with N) than before.
(Which, accidentally, makes it even slower than the non-BLAS dot for old SWCoeffs (N=2^7, increases to ~8 for N=2^15)
If we really care about perf I'd rather revert this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should improve moment
if it is slow. But regardless, you're right of course that the moment
will be normalized. I'm fine with reverting it.
src/shapiro_wilk.jl
Outdated
m .= m ./ sqrt(ϕ) # A, but reusing m to save allocs | ||
m[1], m[2] = a₁, a₂ | ||
end | ||
resize!(m, N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed, it seems?
resize!(m, N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think so @devmotion? to me it seems very much needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you create m
as a Vector{Float64}(undef, N)
a few lines above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but then I resize it to length n = N÷2
, and compute only the first half (normalizing w.r.t. n
). Only then I fill the rest of the vector below line 62; To do so I need to bring back the old size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this. Can you remove all resize!
calls? I think you should be able to work with a view
instead of resizing the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, @view
s are much better than resizes ;)
Co-authored-by: David Widmann <[email protected]>
Co-authored-by: David Widmann <[email protected]>
This reverts commit 135e43a.
now tests on 1.3 fail with
due to |
This reverts commit 833a18e.
@devmotion I took my liberty to revert/modify the suggestion on To be honest I'd be very glad to have this merged as soon as possible so that I can cross it off my list... It's been sitting here for long enough. |
You can do |
that's what I did. |
@devmotion I'd like to ask you again to have a look at the changes you requested; |
Friendly bump, hoping this is still on the radar for review/merging @devmotion @nalimilan @andreasnoack as it looked to be in a ready state, or only minor adjustments needed. |
* Too much data is censored (`censored / N > 0.8`) | ||
|
||
# Implementation notes | ||
* The current implementation DOES NOT implement p-values for censored data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the utility of allowing the user to compute the Shapiro-Wilk test statistic at all if they can't get its p-value nor view the object directly in the REPL (since show
calls pvalue
)? That's an honest question; I would guess that many users would try it and get frustrated without checking the docstring but that is a naive assumption.
I'd probably recommend commenting out the censoring-related functionality for the time being, then someone can implement it fully in a follow-up PR. Thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ararslan feel free to suggest (via gh suggestions) any edits that would suit your proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it was half question, half proposal—do you envision a use case for this with censored data as it is currently implemented in this PR (no pvalue
/show
)? If not, I can make those suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the W
statistic can be still useful if somebody wants to use section 3.3 of Royston 1993 to calculate significance levels directly.
I didn't implement those methods back then since the significance levels there are only empirical, and until I need to do it at work I don't see myself doing this now either ;)
If those suggestions push this PR over the final line, please make them!
Co-authored-by: Alex Arslan <[email protected]>
Co-authored-by: Alex Arslan <[email protected]>
Co-authored-by: Alex Arslan <[email protected]>
This already starts feeling silly ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for another reminder but, first of all, thanks for spending so much time on this over so many years. Unfortunately, we are all busy with other tasks which has delayed this review beyond any reasonable limites. I think we have all now spent enough time on this and any subsequent modifications should be delegated to other PRs.
implements ShapiroWilkTest following
This is work in progress, all comments are welcome!
I tried to follow the paper closely, but copied e.g. computed constants from the original swilk.f.
Please let me know if You are ok (license-wise) with this.
These polynomial interpolations are computed at loadtime for speed.
Currently
whereas calling
swilkfort
directlyStill missing:
and probably more.
I tried to compute exact values of
SWCoeffs
(via MonteCarlo simulation), but the results I'm getting are off the reported ones in Table 1 op. cit. Woule be glad if anyone could help.approximate_A(10,1000000)
results inCompare
swilk.f
's, andSWCoeffs(10).A
:Rahman & Govindarajulu
A different implementation of SWCoeffs following
But they don't provide critical values for their statistic (or I couldn't find it), so that would require further simulation. Also their
SWCoeffs(10, Val{:Rahman})
are quite different:On the other hand
ALGORITHM AS R94
is a standard.