GLM Model Difference when running with Standardization, Weights, and Beta Constraints #15519

hasithjp · 2023-05-24T12:44:01Z

hasithjp
May 24, 2023
Maintainer

Problem

When running with weights, the expectation is that it will produce the same model as up-sampling the training dataset. For example if the class 1 samples are duplicated 5 times and added back into the original training dataset, the results of such a model should equate running the original dataset with weight of 6 for the class 1 response. Which is what we see in the below experiment where standardization is turned on without beta constraints:

library(RUnit)
library(h2o)
h2o.init(nthreads = -1)

prostate.hex = h2o.uploadFile(path = system.file("extdata", "prostate.csv", package="h2o"), destination_frame = "prostate")
myX = c("AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON")
myY = "CAPSULE"
## Run original dataset as-is
fit = h2o.glm( training_frame = prostate.hex,
               y = myY,
               x = myX,
               family = "binomial",
               lambda = 0)

## Experiment 1: weights column with unsampled data
## Upsample 1 column by 5 times
prostate0.hex = prostate.hex[prostate.hex[,myY] == 0,]
prostate1.hex = prostate.hex[prostate.hex[,myY] == 1,]
prostate2.hex = h2o.rbind(prostate1.hex,prostate1.hex,prostate1.hex,prostate1.hex,prostate1.hex)

## Add weights
fr = h2o.rbind(prostate2.hex, prostate.hex)
fr[,"weights"] = ifelse( fr[,myY] == 1, 1/6, 1)

## Run GLM again with weights on unsampled data
fit2 = h2o.glm(training_frame = fr,
               y = myY,
               x = c(myX, "weights"),
               family = "binomial",
               lambda = 0,
               weights_column = "weights")

## The two coeficients from the original model and the weighted model will differ because of the weighted variance and change in beta_given.
h2o.coef(fit)
h2o.coef(fit2)
checkEqualsNumeric(h2o.coef(fit), h2o.coef(fit2))

However if beta_constraints were added, the up-sampled and weighted cases produce different GLM models with different coefficients:

## Experiment 2: Add Beta Contraints

## Create beta-constraints file
beta_constraints <- data.frame(names = myX, lower_bounds = -1000, upper_bounds = 1000, beta_given = c(0.5, -0.62, 0.55, 0.48, 0.027, 0, 1), rho = c(10000,1,1,1,1,1,1))
# beta_constraints <- data.frame(names = myX, lower_bounds = -1000, upper_bounds = 1000, beta_given = 0, rho = 1)

standardize = TRUE

##Run original dataset with beta constraints
fit3 = h2o.glm( training_frame = prostate.hex,
                y = myY,
                x = myX,
                family = "binomial",
                lambda = 0,
                standardize = standardize,
                beta_constraints = beta_constraints)

## Run on upsampled dataset with weights
fit4 = h2o.glm(training_frame = fr,
               y = myY,
               x = c(myX, "weights"),
               family = "binomial",
               lambda = 0,
               weights_column = "weights",
               standardize = standardize,
               beta_constraints = beta_constraints)

## The two coeficients from the original model and the weighted model should be the same
h2o.coef(fit3)
h2o.coef(fit4)
checkEqualsNumeric(h2o.coef(fit3), h2o.coef(fit4))

Explanation

Priors is standardized and changed when standardization is turned on for the model build

When standardization is turned on, the beta_given in beta_constraints, is standardized as well. In the code, you see that the beta given is multiplied by factor d : _betaGiven *= d; where d = 1/sd. In particular be careful when using previous coefficients as priors and with standardization turned on because the penalty taken on different priors.

Weighted Variance

When weights is turned on the variance will differ from the variance in an up-sampled dataset. The way variance is typically calculated is:

s2 = summation( (x i – x) 2) / N -1

When using weights you calculate the weighted variance as:

s2 = (N/ (N-1)) * (summation( w i (x i – x) 2) / summation(w i) )

So when the sum of weights equals the number of observations you have the exact same variance otherwise your variance will differ by a factor of approximately N/N-1 which is relatively small difference but something you will observe in your resulting coefficients.

Solution

If you want to supply the beta constraints for a standardized model build scale your bounds and priors in beta_constraints down by the variance. So that you will have (1/d)* betaGiven *= d; which equals betaGiven.

## Experiment 3: Scale up the beta constraints by the variance

## Run standardization = F and original beta_constraints
data_sd = unlist(lapply(myX, function(x) h2o.sd(prostate.hex[,x])))
data_mean = unlist(lapply(myX, function(x) h2o.mean(prostate.hex[,x])))

prostate_standardized = prostate.hex
for(x in myX) {
  prostate_standardized[,x] = prostate_standardized[,x] - data_mean[ which(x == myX)]
  prostate_standardized[,x] = prostate_standardized[,x] / data_sd[ which(x == myX)]
}


fit5 = h2o.glm(training_frame = prostate_standardized,
               y = myY,
               x = myX,
               family = "binomial",
               lambda = 0,
               standardize = F,
               beta_constraints = beta_constraints)

## Run standardization = F and original beta_constraints
data_sd = unlist(lapply(myX, function(x) h2o.sd(prostate.hex[,x])))
data_mean = unlist(lapply(myX, function(x) h2o.mean(prostate.hex[,x])))
beta_constraints2 = beta_constraints
beta_constraints2$lower_bounds = beta_constraints2$lower_bounds / data_sd
beta_constraints2$upper_bounds = beta_constraints2$upper_bounds / data_sd
beta_constraints2$beta_given = beta_constraints2$beta_given / (data_sd)

fit6 = h2o.glm(training_frame = prostate.hex,
               y = myY,
               x = myX,
               family = "binomial",
               lambda = 0,
               standardize = T,
               beta_constraints = beta_constraints2)

## Check the two coefficients
fit5_coef = fit5@model$coefficients_table$coefficients
fit6_coef = fit6@model$coefficients_table$standardized_coefficients
checkEqualsNumeric(fit5_coef, fit6_coef)

## Compare beta_constraint tables
beta_constraints
beta_constraints2

JIRA Issue Migration Info

Jira Issue: TN-10
Assignee: Amy Wang
Reporter: Amy Wang
State: Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM Model Difference when running with Standardization, Weights, and Beta Constraints #15519

{{title}}

Replies: 0 comments

Select a reply

GLM Model Difference when running with Standardization, Weights, and Beta Constraints #15519

hasithjp May 24, 2023 Maintainer

Problem

Explanation

Solution

Replies: 0 comments

hasithjp
May 24, 2023
Maintainer