update, examples pending

changwoo-lee · Jan 11, 2024 · f8ede69 · f8ede69
1 parent 46f40a6
commit f8ede69
Show file tree

Hide file tree

Showing 6 changed files with 384 additions and 302 deletions.
diff --git a/R/blogireg_me.R → R/bglm_me.R b/R/blogireg_me.R → R/bglm_me.R
@@ -1,31 +1,45 @@
-#' Bayesian logistic regression models with correlated measurement errors
+#' Bayesian generalized linear models with spatial exposure measurement error.
 #'
-#' This function implements the Bayesian logistic regression model with correlated measurement error of covariate(s).
-#' Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate of \eqn{i}th observation that is subject to measurement error,
+#' This function fits Bayesian generalized linear model in the presence of spatial exposure measurement error of covariate(s) \eqn{X}, represented as a multivariate normal prior.
+#' One of the most important features of this function is that it allows sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation.
+#' As of version 1.0.0, only Bayesian logistic regression model is supported among GLMs, and
+#' function \code{bglm_me()} runs a Gibbs sampler to carry out posterior inference using Polya-Gamma augmentation (Polson et al., 2013).
+#' See below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology.
+#'
+#' Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate that is subject to spatial exposure measurement error,
 #' and \eqn{Z_i} be a \eqn{p\times 1} covariate without measurement error.
-#' The likelihood model is Bayesian logistic regression,
-#' \deqn{logit(Pr(Y_i = 1)) = \beta_0 + X_i^\top \beta_x +  Z_i^\top \beta_z, \quad i=1,\dots,n, }
-#' independently, and correlated measurement error of \eqn{X_i, i=1,\dots,n} is incorporated into the model as a multivariate normal prior.
-#' For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior
+#' Consider logistic regression model, independently for each \eqn{i=1,\dots,n,}
+#' \deqn{\log(Pr(Y_i = 1)/Pr(Y_i = 0)) = \beta_0 + X_i^\top \beta_X +  Z_i^\top \beta_Z,}
+#' and spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model as a multivariate normal prior.
+#' For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top},
 #' \deqn{(X_1,\dots,X_n)\sim N_n(\mu_X, Q_X^{-1}).}
-#' Also, we consider semiconjugate priors for regression coefficients;
-#' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{x,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{z,k} \stackrel{iid}{\sim} N(0, V_\beta)}
-#' The function \code{blogireg_me()} implements the Gibbs sampler for posterior inference using Polya-Gamma augmentation.
-#' Most importantly, it allows sparse matrix input for \eqn{Q_X} for scalable computation.
-#'
-#' @param Y n by 1 matrix, binary responses
-#' @param X_mean n by 1 matrix of mean of X \eqn{\mu_X}, or list of n by 1 matrices of length q.
-#' @param X_prec n by n matrix precision matrix of X  \eqn{Q_X}, or list of n by n matrices of length q. Support sparse matrix format from Matrix package.
-#' @param Z n by p matrix, covariates without measurement error
-#' @param nburn integer, burn-in iteration
-#' @param nthin integer, thin-in rate
-#' @param nsave integer, number of posterior samples
-#' @param prior list of prior parameters, default is var_beta = 100,a_Y = 0.01, b_Y = 0.01
-#' @param saveX logical, save X or not
-#'
-#' @return list of (1) posterior, the (nsave)x(q+p) matrix of posterior samples as a coda object,
-#'  (2) cputime, cpu time taken in seconds,
-#'  and optionally (3) X_save, posterior samples of X
+#' Most importantly, it allows sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation.
+#'
+#' We consider normal priors for regression coefficients,
+#' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{X,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{Z,k} \stackrel{iid}{\sim} N(0, V_\beta)}
+#' where \code{var_beta} corresponds to \eqn{V_\beta}.
+#'
+#' @param Y *vector&lt;int&gt;*, n by 1 binary response vector.
+#' @param X_mean *matrix&lt;num&gt;*, n by 1 prior mean matrix \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 matrices.
+#' @param X_prec *matrix&lt;num&gt;*, n by 1 prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.
+#' @param Z *matrix&lt;num&gt;*, n by p covariates that are not subject to measurement error.
+#' @param family *class \link[stats]{family}*, a description of the error distribution and link function to be used in the model. Currently only supports binomial(link = "logit").
+#' @param nburn *integer*, burn-in iteration, default 1000.
+#' @param nsave *integer*, number of posterior samples, default 1000. Total number of MCMC iteration is \code{nburn + nsave * nthin}.
+#' @param nthin *integer*, thin-in rate, default 1.
+#' @param prior *list*, list of prior parameters of regression model. Default is \code{list(var_beta = 100)}.
+#' @param saveX *logical*, default FALSE, whether save posterior samples of X (exposure).
+#'
+#' @return List of the following:
+#' \describe{
+#'   \item{posterior}{The \code{nsave} by (q+p) matrix of posterior samples of \eqn{\beta_x} and \eqn{\beta_z} as a coda::\link[coda]{mcmc} object.}
+#'   \item{time}{Time taken for running MCMC (in seconds)}
+#'   \item{X_save}{ (if \code{saveX = TRUE}) Posterior samples of X}
+#' }
+#'
+#' @references Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American statistical Association, 108(504), 1339-1349.
+#'
+#' Lee, C. J., Symanski, E., Rammah, A., Kang, D. H., Hopke, P. K., & Park, E. S. (2024). A scalable two-stage Bayesian approach accounting for exposure measurement error in environmental epidemiology. arXiv preprint arXiv:2401.00634.
 #' @export
 #'
 #' @examples
@@ -37,7 +51,7 @@
 #' data(ozone)
 #' data(health_sim)
 #' library(fields)
-#' # exposure mean and covariance at health subject locations with 06/18/1987 data (date id = 16)
+#' # exposure mean and covariance at health participant locations with 06/18/1987 data (date id = 16)
 #' # using Gaussian process with prior mean zero and exponential covariance kernel
 #' # with fixed range 300 (in miles) and stdev 15 (in ppb)
 #'
@@ -88,15 +102,27 @@
 #' bayesplot::mcmc_trace(fit_me$posterior)
 #' }
 #'
-blogireg_me <- function(Y,
+bglm_me <- function(Y,
                        X_mean,
                        X_prec,
                        Z,
-                       nburn = 2000,
-                       nsave = 2000,
-                       nthin = 5,
+                       family = binomial(link = "logit"),
+                       nburn = 1000,
+                       nsave = 1000,
+                       nthin = 1,
                        prior = NULL,
-                       saveX = F){
+                       saveX = FALSE){
+  #### input check ####
+  # check family input
+  family = "binomial"
+  if (is.character(family)) family <- get(family, mode = "function", envir = parent.frame())
+  if (is.function(family)) family <- family()
+  if (is.null(family$family)) {
+    print(family)
+    stop("'family' not recognized")
+  }
+  if (family$family != "binomial") stop("only binomial family is supported")
+  if (family$link != "logit") stop("only logit link is supported")
 
   # prior input, default
   if(is.null(prior)){
@@ -105,6 +131,9 @@ blogireg_me <- function(Y,
   var_beta = prior$var_beta
 
   n_y = length(Y)
+  # check z is binary (temporary for family = "binomial")
+  if(!all(Z %in% c(0,1))) stop("Z must be binary")
+
   if(is.vector(Z)) Z = as.matrix(Z)
   if(!is.list(X_mean) & !is.list(X_prec)){
     q = 1
@@ -132,6 +161,9 @@ blogireg_me <- function(Y,
       X_spamstruct[[qq]] = spam::chol(X_prec[[qq]])
     }
   }
+
+  #### initialize ####
+
   X = matrix(0, n_y, q)
   for(qq in 1:q) X[,qq] = X_mean[[q]]
   if(is.null(names(X_mean))){
@@ -147,14 +179,13 @@ blogireg_me <- function(Y,
     colnames(Z) =  paste0("covariate.",colnames(Z))
   }
   df_temp = as.data.frame(cbind(X,Z))
-  D = model.matrix( ~ ., df_temp)
-
+  D = model.matrix( ~ ., df_temp) # design matrix
 
   # prior
   Sigma_beta = diag(var_beta, ncol(D))# 3 coefficients(beta0, beta1, betaz)
   Sigma_betainv = solve(Sigma_beta)
 
-  # initialize
+  # parameter
   beta = rep(0.1, ncol(D))
   omega = rep(0.1, n_y) # aux variable
   Dbeta =  D%*%beta
@@ -167,8 +198,9 @@ blogireg_me <- function(Y,
   }
 
   YtY = crossprod(Y)
-  #browser()
-  # sampler starts
+
+  #### run MCMC ####
+
   isave = 0
   isnegative = numeric(n_y)
   pb <- txtProgressBar(style=3)
@@ -199,15 +231,16 @@ blogireg_me <- function(Y,
       D[,(qq+1)] = as.vector(Xstar)
     }
 
-
     if((imcmc > nburn)&(imcmc%%nthin==0)){
       isave = isave + 1
       beta_save[isave,] = beta
       if(saveX) X_save[isave,,] = D[,2:(q+1)]
     }
   }
   t_diff = difftime(Sys.time(), t_start, units = "secs")
-  print(paste0("Exposure components contains negative vaules total ",sum(isnegative)," times among (# exposures) x n_y x (MCMC iter after burnin) = ",q," x ",n_y," x ",nsave*nthin," instances"))
+  #print(paste0("Exposure components contains negative vaules total ",sum(isnegative)," times among (# exposures) x n_y x (MCMC iter after burnin) = ",q," x ",n_y," x ",nsave*nthin," instances"))
+
+  #### output ####
   out = list()
   out$posterior = beta_save
   out$posterior = coda::mcmc(out$posterior)

diff --git a/R/blinreg_me.R → R/blm_me.R b/R/blinreg_me.R → R/blm_me.R
@@ -1,29 +1,42 @@
-#' Bayesian normal linear regression models with correlated measurement errors
+#' Bayesian linear regression models with spatial exposure measurement error.
 #'
-#' This function implements the Bayesian normal linear regression model with correlated measurement error of covariate(s).
-#' Denote \eqn{Y_i} be a continuous response, \eqn{X_i} be a \eqn{q\times 1} covariate of \eqn{i}th observation that is subject to measurement error,
+#' This function fits Bayesian linear regression model in the presence of spatial exposure measurement error of covariate(s) \eqn{X}, represented as a multivariate normal prior.
+#' One of the most important features of this function is that it allows sparse matrix input for the prior precision matrix of \eqn{X} for scalable computation.
+#' Function \code{blm_me()} runs a Gibbs sampler to carry out posterior inference; see below "Details" section for the model description, and Lee et al. (2024) for an application example in environmental epidemiology.
+#'
+#' Denote \eqn{Y_i} be a binary response, \eqn{X_i} be a \eqn{q\times 1} covariate that is subject to spatial exposure measurement error,
 #' and \eqn{Z_i} be a \eqn{p\times 1} covariate without measurement error.
-#' The likelihood model is Bayesian normal linear regression,
-#' \deqn{Y_i = \beta_0 + X_i^\top \beta_x +  Z_i^\top \beta_z + \epsilon_i,\quad  \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2_Y), \quad i=1,\dots,n}
-#' and correlated measurement error of \eqn{X_i, i=1,\dots,n} is incorporated into the model as a multivariate normal prior. For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior
+#' Consider normal linear regression model,
+#' \deqn{Y_i = \beta_0 + X_i^\top \beta_x +  Z_i^\top \beta_z + \epsilon_i,\quad  \epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2_Y), \quad i=1,\dots,n,}
+#' and spatial exposure measurement error of \eqn{X_i} for \eqn{i=1,\dots,n} is incorporated into the model as a multivariate normal prior.
+#' For example when \eqn{q=1}, we have \eqn{n-}dimensional multivariate normal prior of \eqn{X = (X_1,\dots,X_n)^\top},
 #' \deqn{(X_1,\dots,X_n)\sim N_n(\mu_X, Q_X^{-1}).}
-#' Also, we consider semiconjugate priors for regression coefficients and noise variance;
-#' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{x,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{z,k} \stackrel{iid}{\sim} N(0, V_\beta), \quad \sigma_Y^2 \sim IG(a_Y, b_Y).}
-#' The function \code{blinreg_me()} implements the Gibbs sampler for posterior inference. Most importantly, it allows sparse matrix input for \eqn{Q_X} for scalable computation.
-#'
-#' @param Y n by 1 matrix, response
-#' @param X_mean n by 1 matrix of mean of X \eqn{\mu_X}, or list of n by 1 matrices of length q.
-#' @param X_prec n by n matrix precision matrix of X  \eqn{Q_X}, or list of n by n matrices of length q. Support sparse matrix format from Matrix package.
-#' @param Z n by p matrix, covariates without measurement error
-#' @param nburn integer, burn-in iteration
-#' @param nthin integer, thin-in rate
-#' @param nsave integer, number of posterior samples
-#' @param prior list of prior parameters, default is var_beta = 100,a_Y = 0.01, b_Y = 0.01
-#' @param saveX logical, save X or not
-#'
-#' @return list of (1) posterior, the (nsave)x(q+p) matrix of posterior samples as a coda object,
-#'  (2) cputime, cpu time taken in seconds,
-#'  and optionally (3) X_save, posterior samples of X
+#' Most importantly, it allows sparse matrix input for the prior precision matrix \eqn{Q_X} for scalable computation, which could be obtained by Vecchia approximation.
+#'
+#' We consider semiconjugate priors for regression coefficients and error variance,
+#' \deqn{\beta_0 \sim N(0, V_\beta), \quad \beta_{X,j} \stackrel{iid}{\sim} N(0, V_\beta), \quad \beta_{Z,k} \stackrel{iid}{\sim} N(0, V_\beta), \quad \sigma_Y^2 \sim IG(a_Y, b_Y).}
+#' where \code{var_beta} corresponds to \eqn{V_\beta} and \code{a_Y}, \code{b_Y} corresponds to hyperparameters for \eqn{\sigma^2_Y}.
+#'
+#' @param Y *vector&lt;int&gt;*, n by 1 continuous response vector.
+#' @param X_mean *matrix&lt;num&gt;*, n by 1 prior mean matrix \eqn{\mu_X}. When there are q multiple exposures subject to measurement error, it can be a length q list of n by 1 matrices.
+#' @param X_prec *matrix&lt;num&gt;*, n by 1 prior precision matrix \eqn{Q_X}, which allows sparse format from \link{Matrix} package. When there are q multiple exposures subject to measurement error, it can be a length q list of n by n matrices.
+#' @param Z *matrix&lt;num&gt;*, n by p covariates that are not subject to measurement error.
+#' @param nburn *integer*, burn-in iteration, default 1000.
+#' @param nsave *integer*, number of posterior samples, default 1000. Total number of MCMC iteration is \code{nburn + nsave * nthin}.
+#' @param nthin *integer*, thin-in rate, default 1.
+#' @param prior *list*, list of prior parameters of regression model. Default is \code{list(var_beta = 100, a_Y = 0.01, b_Y = 0.01)}.
+#' @param saveX *logical*, default FALSE, whether save posterior samples of X (exposure).
+#'
+#' @return list of the following:
+#' \describe{
+#'   \item{posterior}{The \code{nsave} by (q + p + 1) matrix of posterior samples of \eqn{\beta_X}, \eqn{\beta_Z}, \eqn{\sigma_Y^2} as a coda::\link[coda]{mcmc} object.}
+#'   \item{time}{Time taken for running MCMC (in seconds)}
+#'   \item{X_save}{ (if \code{saveX = TRUE}) Posterior samples of X}
+#' }
+#'
+#' @references
+#' Lee, C. J., Symanski, E., Rammah, A., Kang, D. H., Hopke, P. K., & Park, E. S. (2024). A scalable two-stage Bayesian approach accounting for exposure measurement error in environmental epidemiology. arXiv preprint arXiv:2401.00634.
+#'
 #' @export
 #'
 #' @examples
@@ -86,16 +99,16 @@
 #' bayesplot::mcmc_trace(fit_me$posterior)
 #' }
 #'
-blinreg_me <- function(Y,
+blm_me <- function(Y,
                       X_mean,
                       X_prec,
                       Z,
-                      nburn = 2000,
-                      nsave = 2000,
-                      nthin = 5,
+                      nburn = 1000,
+                      nsave = 1000,
+                      nthin = 1,
                       prior = NULL,
-                      saveX = F){
-
+                      saveX = FALSE){
+  #### input check ####
   # prior input, default
   if(is.null(prior)){
     prior = list(var_beta = 100,a_Y = 0.01, b_Y = 0.01)
@@ -134,6 +147,8 @@ blinreg_me <- function(Y,
     }
   }
 
+  #### initialize ####
+
   X = matrix(0, n_y, q)
   for(qq in 1:q) X[,qq] = X_mean[[q]]
   if(is.null(names(X_mean))){
@@ -156,7 +171,7 @@ blinreg_me <- function(Y,
   Sigma_beta = diag(var_beta, ncol(D))# 3 coefficients(beta0, beta1, betaz)
   Sigma_betainv = solve(Sigma_beta)
 
-  # initialize
+  # parameter
   sigma2_Y = 1
   beta = rep(0.1, ncol(D))
 
@@ -170,8 +185,9 @@ blinreg_me <- function(Y,
   }
 
   YtY = crossprod(Y)
-  #browser()
-  # sampler starts
+
+  #### run MCMC ####
+
   isave = 0
   isnegative = numeric(n_y)
   pb <- txtProgressBar(style=3)
@@ -212,6 +228,8 @@ blinreg_me <- function(Y,
   }
   t_diff = difftime(Sys.time(), t_start, units = "secs")
   #print(paste0("Exposure components contains negative vaules total ",sum(isnegative)," times among (# exposures) x n_y x (MCMC iter after burnin) = ",q," x ",n_y," x ",nsave*nthin," instances"))
+
+  #### output ####
   out = list()
   out$posterior = cbind(beta_save, sigma2_save)
   out$posterior = coda::mcmc(out$posterior)