-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed linear model with perfectly collinear rhs variables and weights #432
Conversation
Please let me know if I should change anything in the PR. :) |
Looks right to me. Could you please add a test case as well? |
Update to latest GLM
I tried to add a teset but for me the earlier testset "linear model with weights" errors already. I feel like it is related to the PR but I am not sure how we even end up with a This would be my testset @testset "collinearity and weights" begin
rng = StableRNG(1234321)
x1 = randn(100)
x1_2 = 3 * x1
x2 = 10 * randn(100)
x2_2 = -2.4 * x2
y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
f = @formula(y ~ x1 + x2 + x3 + x4)
lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
X = [ones(length(y)) x1_2 x2_2]
W = Diagonal(df.weights)
coef_naive = (X'W*X)\X'W*y
@test lm_model.model.pp.chol isa CholeskyPivoted
@test rank(lm_model.model.pp.chol) == 3
@test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end This is what goes wrong earlier julia> lm_model = lm(f, df, wts = df.weights)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
FoodExp ~ 1 + Income
Coefficients:
──────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept) -2794.53 2.92818e5 -0.01 0.9924 -5.77698e5 5.72109e5
Income 3.63222e6 3.36375e8 0.01 0.9914 -6.56787e8 6.64051e8
────────────────────────────────────────────────────────────────────────────── |
We use |
@nalimilan you were of course correct. I fixed that and added tests (all test passing now). However, I made a mess with autoformatting in vscode. What would be the best way to get just the stuff I changed in without the unnecessary indentation changes? Should I just fork again? EDIT: I gave it a try let me know if that is ok! |
This reverts commit 519f3cd.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #432 +/- ##
==========================================
+ Coverage 87.08% 87.27% +0.19%
==========================================
Files 7 7
Lines 929 943 +14
==========================================
+ Hits 809 823 +14
Misses 120 120 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Thanks. There are still some failures which may well be random given that they happen on different versions. Any idea what may be going on? Can you reproduce these locally? |
Looks like numeric instability to me. I'll investigate. Maybe I can write a more stable version of the test case. They do not happen on my machine (ubuntu with 1.6) |
Bump. Maybe try a different set of values? |
Bump on this PR. I am encountering the issue in #420 as well. On my local machine (1.8RC4, Mac M1) the tests pass. I agree with @danielw2904 that this appears to just be a numerical difference which stands out because the failing tests were printing All four failing test cases in the Screen.Recording.2022-08-13.at.11.53.45.PM.mov |
Sorry, I am unable to work on this anymore. |
Unfortunately it seems to me that something more serious is going on. Locally, tests always fail, even on Julia 1.7. Maybe this has to do with BLAS and the CPU? The following models give completely different coefficients here: julia> using GLM, StableRNGs
julia> rng = StableRNG(1234321);
julia> x1 = randn(rng, 100);
julia> x1_2 = 3 * x1;
julia> x2 = 10 * randn(rng, 100);
julia> x2_2 = -2.4 * x2;
julia> y = 1 .+ randn(rng) * x1 + randn(rng) * x2 + 2 * randn(rng, 100);
julia> df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50));
julia> lm_collinear = lm(@formula(y ~ x1 + x2 + x3 + x4), df, wts=df.weights)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
y ~ 1 + x1 + x2 + x3 + x4
Coefficients:
───────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept) 0.00389006 7.74397 0.00 0.9996 -15.4334 15.4412
x1 0.0 NaN NaN NaN NaN NaN
x2 0.594857 2.78844 0.21 0.8317 -4.96379 6.1535
x3 0.0 NaN NaN NaN NaN NaN
x4 4.11564 0.351608 11.71 <1e-17 3.41472 4.81656
───────────────────────────────────────────────────────────────────────────
julia> lm_reduced = lm(@formula(y ~ x2 + x4), df, wts=df.weights)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
y ~ 1 + x2 + x4
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 1.29678 0.265978 4.88 <1e-05 0.766565 1.827
x2 0.215149 0.0973285 2.21 0.0302 0.0211286 0.40917
x4 0.648579 0.0124034 52.29 <1e-58 0.623853 0.673305
────────────────────────────────────────────────────────────────────────
julia> deviance(lm_collinear)
410484.4884591147
julia> deviance(lm_reduced)
355.2246843374244 On current master, both models have the same deviance and the same predictions, so it seems that this PR introduces a regression at least in some cases. |
Though if #487 (comment) fixes it, then maybe we don't need this PR after all? |
@alecloudenback If you're interested in making a PR, you could extract the relevant part from #487 so that we can fix this bug without waiting for the more general improvements of weighting. |
It looks like this was fixed by another PR, see #420 (comment). |
This is my probably not very efficient suggestion to fix #420
Suggestions very welcome!!