Skip to content

Commit

Permalink
update, examples pending
Browse files Browse the repository at this point in the history
  • Loading branch information
changwoo-lee committed Jan 11, 2024
1 parent 46f40a6 commit f8ede69
Show file tree
Hide file tree
Showing 6 changed files with 384 additions and 302 deletions.
109 changes: 71 additions & 38 deletions R/blogireg_me.R → R/bglm_me.R
Original file line number Diff line number Diff line change
@@ -1,31 +1,45 @@
#' Bayesian logistic regression models with correlated measurement errors
#' Bayesian generalized linear models with spatial exposure measurement error.
#'
#' This function implements the Bayesian logistic regression model with correlated measurement error of covariate(s).
#' Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate of \eqn{i}th observation that is subject to measurement error,
#' This function fits Bayesian generalized linear model in the presence of spatial exposure measurement error of covariate(s) \eqn{X}, represented as a multivariate normal prior.
#' One of the most important features of this function is that it allows sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation.
#' As of version 1.0.0, only Bayesian logistic regression model is supported among GLMs, and
#' function \code{bglm_me()} runs a Gibbs sampler to carry out posterior inference using Polya-Gamma augmentation (Polson et al., 2013).
#' See below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology.
#'
#' Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate that is subject to spatial exposure measurement error,
#' and \eqn{Z_i} be a \eqn{p\times 1} covariate without measurement error.
#' The likelihood model is Bayesian logistic regression,
#' \deqn{logit(Pr(Y_i = 1)) = \beta_0 + X_i^\top \beta_x + Z_i^\top \beta_z, \quad i=1,\dots,n, }
#' independently, and correlated measurement error of \eqn{X_i, i=1,\dots,n} is incorporated into the model as a multivariate normal prior.
#' For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior
#' Consider logistic regression model, independently for each \eqn{i=1,\dots,n,}
#' \deqn{\log(Pr(Y_i = 1)/Pr(Y_i = 0)) = \beta_0 + X_i^\top \beta_X + Z_i^\top \beta_Z,}
#' and spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model as a multivariate normal prior.
#' For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top},
#' \deqn{(X_1,\dots,X_n)\sim N_n(\mu_X, Q_X^{-1}).}
#' Also, we consider semiconjugate priors for regression coefficients;
#' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{x,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{z,k} \stackrel{iid}{\sim} N(0, V_\beta)}
#' The function \code{blogireg_me()} implements the Gibbs sampler for posterior inference using Polya-Gamma augmentation.
#' Most importantly, it allows sparse matrix input for \eqn{Q_X} for scalable computation.
#'
#' @param Y n by 1 matrix, binary responses
#' @param X_mean n by 1 matrix of mean of X \eqn{\mu_X}, or list of n by 1 matrices of length q.
#' @param X_prec n by n matrix precision matrix of X \eqn{Q_X}, or list of n by n matrices of length q. Support sparse matrix format from Matrix package.
#' @param Z n by p matrix, covariates without measurement error
#' @param nburn integer, burn-in iteration
#' @param nthin integer, thin-in rate
#' @param nsave integer, number of posterior samples
#' @param prior list of prior parameters, default is var_beta = 100,a_Y = 0.01, b_Y = 0.01
#' @param saveX logical, save X or not
#'
#' @return list of (1) posterior, the (nsave)x(q+p) matrix of posterior samples as a coda object,
#' (2) cputime, cpu time taken in seconds,
#' and optionally (3) X_save, posterior samples of X
#' Most importantly, it allows sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation.
#'
#' We consider normal priors for regression coefficients,
#' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{X,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{Z,k} \stackrel{iid}{\sim} N(0, V_\beta)}
#' where \code{var_beta} corresponds to \eqn{V_\beta}.
#'
#' @param Y *vector<int>*, n by 1 binary response vector.
#' @param X_mean *matrix<num>*, n by 1 prior mean matrix \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 matrices.
#' @param X_prec *matrix<num>*, n by 1 prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.
#' @param Z *matrix<num>*, n by p covariates that are not subject to measurement error.
#' @param family *class \link[stats]{family}*, a description of the error distribution and link function to be used in the model. Currently only supports binomial(link = "logit").
#' @param nburn *integer*, burn-in iteration, default 1000.
#' @param nsave *integer*, number of posterior samples, default 1000. Total number of MCMC iteration is \code{nburn + nsave * nthin}.
#' @param nthin *integer*, thin-in rate, default 1.
#' @param prior *list*, list of prior parameters of regression model. Default is \code{list(var_beta = 100)}.
#' @param saveX *logical*, default FALSE, whether save posterior samples of X (exposure).
#'
#' @return List of the following:
#' \describe{
#' \item{posterior}{The \code{nsave} by (q+p) matrix of posterior samples of \eqn{\beta_x} and \eqn{\beta_z} as a coda::\link[coda]{mcmc} object.}
#' \item{time}{Time taken for running MCMC (in seconds)}
#' \item{X_save}{ (if \code{saveX = TRUE}) Posterior samples of X}
#' }
#'
#' @references Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American statistical Association, 108(504), 1339-1349.
#'
#' Lee, C. J., Symanski, E., Rammah, A., Kang, D. H., Hopke, P. K., & Park, E. S. (2024). A scalable two-stage Bayesian approach accounting for exposure measurement error in environmental epidemiology. arXiv preprint arXiv:2401.00634.
#' @export
#'
#' @examples
Expand All @@ -37,7 +51,7 @@
#' data(ozone)
#' data(health_sim)
#' library(fields)
#' # exposure mean and covariance at health subject locations with 06/18/1987 data (date id = 16)
#' # exposure mean and covariance at health participant locations with 06/18/1987 data (date id = 16)
#' # using Gaussian process with prior mean zero and exponential covariance kernel
#' # with fixed range 300 (in miles) and stdev 15 (in ppb)
#'
Expand Down Expand Up @@ -88,15 +102,27 @@
#' bayesplot::mcmc_trace(fit_me$posterior)
#' }
#'
blogireg_me <- function(Y,
bglm_me <- function(Y,
X_mean,
X_prec,
Z,
nburn = 2000,
nsave = 2000,
nthin = 5,
family = binomial(link = "logit"),
nburn = 1000,
nsave = 1000,
nthin = 1,
prior = NULL,
saveX = F){
saveX = FALSE){
#### input check ####
# check family input
family = "binomial"
if (is.character(family)) family <- get(family, mode = "function", envir = parent.frame())
if (is.function(family)) family <- family()
if (is.null(family$family)) {
print(family)
stop("'family' not recognized")
}
if (family$family != "binomial") stop("only binomial family is supported")
if (family$link != "logit") stop("only logit link is supported")

# prior input, default
if(is.null(prior)){
Expand All @@ -105,6 +131,9 @@ blogireg_me <- function(Y,
var_beta = prior$var_beta

n_y = length(Y)
# check z is binary (temporary for family = "binomial")
if(!all(Z %in% c(0,1))) stop("Z must be binary")

if(is.vector(Z)) Z = as.matrix(Z)
if(!is.list(X_mean) & !is.list(X_prec)){
q = 1
Expand Down Expand Up @@ -132,6 +161,9 @@ blogireg_me <- function(Y,
X_spamstruct[[qq]] = spam::chol(X_prec[[qq]])
}
}

#### initialize ####

X = matrix(0, n_y, q)
for(qq in 1:q) X[,qq] = X_mean[[q]]
if(is.null(names(X_mean))){
Expand All @@ -147,14 +179,13 @@ blogireg_me <- function(Y,
colnames(Z) = paste0("covariate.",colnames(Z))
}
df_temp = as.data.frame(cbind(X,Z))
D = model.matrix( ~ ., df_temp)

D = model.matrix( ~ ., df_temp) # design matrix

# prior
Sigma_beta = diag(var_beta, ncol(D))# 3 coefficients(beta0, beta1, betaz)
Sigma_betainv = solve(Sigma_beta)

# initialize
# parameter
beta = rep(0.1, ncol(D))
omega = rep(0.1, n_y) # aux variable
Dbeta = D%*%beta
Expand All @@ -167,8 +198,9 @@ blogireg_me <- function(Y,
}

YtY = crossprod(Y)
#browser()
# sampler starts

#### run MCMC ####

isave = 0
isnegative = numeric(n_y)
pb <- txtProgressBar(style=3)
Expand Down Expand Up @@ -199,15 +231,16 @@ blogireg_me <- function(Y,
D[,(qq+1)] = as.vector(Xstar)
}


if((imcmc > nburn)&(imcmc%%nthin==0)){
isave = isave + 1
beta_save[isave,] = beta
if(saveX) X_save[isave,,] = D[,2:(q+1)]
}
}
t_diff = difftime(Sys.time(), t_start, units = "secs")
print(paste0("Exposure components contains negative vaules total ",sum(isnegative)," times among (# exposures) x n_y x (MCMC iter after burnin) = ",q," x ",n_y," x ",nsave*nthin," instances"))
#print(paste0("Exposure components contains negative vaules total ",sum(isnegative)," times among (# exposures) x n_y x (MCMC iter after burnin) = ",q," x ",n_y," x ",nsave*nthin," instances"))

#### output ####
out = list()
out$posterior = beta_save
out$posterior = coda::mcmc(out$posterior)
Expand Down
82 changes: 50 additions & 32 deletions R/blinreg_me.R → R/blm_me.R
Original file line number Diff line number Diff line change
@@ -1,29 +1,42 @@
#' Bayesian normal linear regression models with correlated measurement errors
#' Bayesian linear regression models with spatial exposure measurement error.
#'
#' This function implements the Bayesian normal linear regression model with correlated measurement error of covariate(s).
#' Denote \eqn{Y_i} be a continuous response, \eqn{X_i} be a \eqn{q\times 1} covariate of \eqn{i}th observation that is subject to measurement error,
#' This function fits Bayesian linear regression model in the presence of spatial exposure measurement error of covariate(s) \eqn{X}, represented as a multivariate normal prior.
#' One of the most important features of this function is that it allows sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation.
#' Function \code{blm_me()} runs a Gibbs sampler to carry out posterior inference; see below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology.
#'
#' Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate that is subject to spatial exposure measurement error,
#' and \eqn{Z_i} be a \eqn{p\times 1} covariate without measurement error.
#' The likelihood model is Bayesian normal linear regression,
#' \deqn{Y_i = \beta_0 + X_i^\top \beta_x + Z_i^\top \beta_z + \epsilon_i,\quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2_Y), \quad i=1,\dots,n}
#' and correlated measurement error of \eqn{X_i, i=1,\dots,n} is incorporated into the model as a multivariate normal prior. For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior
#' Consider normal linear regression model,
#' \deqn{Y_i = \beta_0 + X_i^\top \beta_x + Z_i^\top \beta_z + \epsilon_i,\quad \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2_Y), \quad i=1,\dots,n,}
#' and spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model as a multivariate normal prior.
#' For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top},
#' \deqn{(X_1,\dots,X_n)\sim N_n(\mu_X, Q_X^{-1}).}
#' Also, we consider semiconjugate priors for regression coefficients and noise variance;
#' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{x,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{z,k} \stackrel{iid}{\sim} N(0, V_\beta), \quad \sigma_Y^2 \sim IG(a_Y, b_Y).}
#' The function \code{blinreg_me()} implements the Gibbs sampler for posterior inference. Most importantly, it allows sparse matrix input for \eqn{Q_X} for scalable computation.
#'
#' @param Y n by 1 matrix, response
#' @param X_mean n by 1 matrix of mean of X \eqn{\mu_X}, or list of n by 1 matrices of length q.
#' @param X_prec n by n matrix precision matrix of X \eqn{Q_X}, or list of n by n matrices of length q. Support sparse matrix format from Matrix package.
#' @param Z n by p matrix, covariates without measurement error
#' @param nburn integer, burn-in iteration
#' @param nthin integer, thin-in rate
#' @param nsave integer, number of posterior samples
#' @param prior list of prior parameters, default is var_beta = 100,a_Y = 0.01, b_Y = 0.01
#' @param saveX logical, save X or not
#'
#' @return list of (1) posterior, the (nsave)x(q+p) matrix of posterior samples as a coda object,
#' (2) cputime, cpu time taken in seconds,
#' and optionally (3) X_save, posterior samples of X
#' Most importantly, it allows sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation.
#'
#' We consider semiconjugate priors for regression coefficients and error variance,
#' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{X,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{Z,k} \stackrel{iid}{\sim} N(0, V_\beta), \quad \sigma_Y^2 \sim IG(a_Y, b_Y).}
#' where \code{var_beta} corresponds to \eqn{V_\beta} and \code{a_Y}, \code{b_Y} corresponds to hyperparameters for \eqn{\sigma^2_Y}.
#'
#' @param Y *vector&lt;int&gt;*, n by 1 continuous response vector.
#' @param X_mean *matrix&lt;num&gt;*, n by 1 prior mean matrix \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 matrices.
#' @param X_prec *matrix&lt;num&gt;*, n by 1 prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.
#' @param Z *matrix&lt;num&gt;*, n by p covariates that are not subject to measurement error.
#' @param nburn *integer*, burn-in iteration, default 1000.
#' @param nsave *integer*, number of posterior samples, default 1000. Total number of MCMC iteration is \code{nburn + nsave * nthin}.
#' @param nthin *integer*, thin-in rate, default 1.
#' @param prior *list*, list of prior parameters of regression model. Default is \code{list(var_beta = 100, a_Y = 0.01, b_Y = 0.01)}.
#' @param saveX *logical*, default FALSE, whether save posterior samples of X (exposure).
#'
#' @return list of the following:
#' \describe{
#' \item{posterior}{The \code{nsave} by (q + p + 1) matrix of posterior samples of \eqn{\beta_X}, \eqn{\beta_Z}, \eqn{\sigma_Y^2} as a coda::\link[coda]{mcmc} object.}
#' \item{time}{Time taken for running MCMC (in seconds)}
#' \item{X_save}{ (if \code{saveX = TRUE}) Posterior samples of X}
#' }
#'
#' @references
#' Lee, C. J., Symanski, E., Rammah, A., Kang, D. H., Hopke, P. K., & Park, E. S. (2024). A scalable two-stage Bayesian approach accounting for exposure measurement error in environmental epidemiology. arXiv preprint arXiv:2401.00634.
#'
#' @export
#'
#' @examples
Expand Down Expand Up @@ -86,16 +99,16 @@
#' bayesplot::mcmc_trace(fit_me$posterior)
#' }
#'
blinreg_me <- function(Y,
blm_me <- function(Y,
X_mean,
X_prec,
Z,
nburn = 2000,
nsave = 2000,
nthin = 5,
nburn = 1000,
nsave = 1000,
nthin = 1,
prior = NULL,
saveX = F){

saveX = FALSE){
#### input check ####
# prior input, default
if(is.null(prior)){
prior = list(var_beta = 100,a_Y = 0.01, b_Y = 0.01)
Expand Down Expand Up @@ -134,6 +147,8 @@ blinreg_me <- function(Y,
}
}

#### initialize ####

X = matrix(0, n_y, q)
for(qq in 1:q) X[,qq] = X_mean[[q]]
if(is.null(names(X_mean))){
Expand All @@ -156,7 +171,7 @@ blinreg_me <- function(Y,
Sigma_beta = diag(var_beta, ncol(D))# 3 coefficients(beta0, beta1, betaz)
Sigma_betainv = solve(Sigma_beta)

# initialize
# parameter
sigma2_Y = 1
beta = rep(0.1, ncol(D))

Expand All @@ -170,8 +185,9 @@ blinreg_me <- function(Y,
}

YtY = crossprod(Y)
#browser()
# sampler starts

#### run MCMC ####

isave = 0
isnegative = numeric(n_y)
pb <- txtProgressBar(style=3)
Expand Down Expand Up @@ -212,6 +228,8 @@ blinreg_me <- function(Y,
}
t_diff = difftime(Sys.time(), t_start, units = "secs")
#print(paste0("Exposure components contains negative vaules total ",sum(isnegative)," times among (# exposures) x n_y x (MCMC iter after burnin) = ",q," x ",n_y," x ",nsave*nthin," instances"))

#### output ####
out = list()
out$posterior = cbind(beta_save, sigma2_save)
out$posterior = coda::mcmc(out$posterior)
Expand Down
Loading

0 comments on commit f8ede69

Please sign in to comment.