Skip to content

Commit

Permalink
updating synthetics data generation
Browse files Browse the repository at this point in the history
  • Loading branch information
SoniaDem committed Dec 19, 2024
1 parent a141f8d commit 288d41c
Showing 1 changed file with 111 additions and 2 deletions.
113 changes: 111 additions & 2 deletions docs/notebooks/time_varying.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -118,15 +118,117 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Simulating a dataset\n",
"## Simulating a dataset: first approach to test dimensions but doesn't guarantee meaningful results\n",
"\n",
"We will simulate a dataset of 100 subjects with 10 follow up times where a covariate is observed. The covariates will follow a trigonometric function over time and will be dependant on a random variable to differentiate between subjects.\n",
"\n",
"For each $i$ the covariate follows the function:\n",
"\n",
"$$ Z_i(t) = a_i \\cos(2 \\pi t) $$\n",
"\n",
"where $a_i \\sim N(5, 2.5)$."
"where $a_i \\sim N(5, 2.5)$.\n",
"\n",
"## Proper simulation guidance: data that can be interpreted\n",
"\n",
"A good approach for simulating data is described in detail by [Ngwa et al 2020](https://pmc.ncbi.nlm.nih.gov/articles/PMC7731987/). If this is not yet implemented, it would be a good way of starting to ensure that both methods work as expected. There are tow parts in simulating such a dataset. First, simulating the longitudina lobservational data and then the survival data. Below we describe methodologies for both.\n",
"\n",
"### Longitudinal data (covariates)\n",
"\n",
"We use $i \\in \\{1, \\dots, n\\}$ to index subjects and $j \\in \\{1, \\dots, m_i\\}$ to index time points where $m_i$ is the final time point for subject $i$.\n",
"We simulate covariates independantly:\n",
"- age at baseline $Age_i \\sim N(35,5)$\n",
"- sex $\\sim Bernoulli(p=0.54)$\n",
"\n",
"Generate expected longitudinal trajectories $\\varphi_{\\beta}(t_{ij})$:\n",
"\n",
"$$ \\varphi_{\\beta}(t_{ij}) = b_{i1} + b_{i2} \\cdot t_{ij} + \\alpha Age_i, $$\n",
"\n",
"where $b_{i1}, b_{i2}$ are random effects\n",
"\n",
"We will generate $b_{i1}, b_{i2}$ from multivariate normal distribution with a covariance matrix $G = [[0.29, -0.00465],[-0.00465, 0.000320]]$. Sample from this multivariate normal distribution (with mean zero) to get the random intercept and slope.\n",
"\n",
"The observed longitudinal measures measures $Y_{ij}(t_{ij})$ from a multivariate normal distribution with mean $ \\varphi_{\\beta}(t_{ij})$ and variance $V$:\n",
"\n",
"$$ V = Z_i GZ_i ^T + R_i, \\text{ where }Z_i = [[1,1,1,1,1,1]^T, [0,5,10,15,20,25]^T]$$\n",
"\n",
"and $R_i = diag(\\sigma^2)$ and $\\sigma^2$ is set to $0.1161$."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([[33.7853, 34.1568, 33.7249, 33.9724, 34.4417, 34.4528],\n",
" [33.1087, 33.4781, 32.5054, 33.1090, 32.9212, 33.4908],\n",
" [31.8224, 31.8031, 32.1202, 32.3814, 31.4848, 31.9074],\n",
" [36.1902, 35.9910, 36.4153, 36.2511, 35.8788, 36.4300]])\n"
]
}
],
"source": [
"import torch.distributions as dist\n",
"\n",
"# Set random seed for reproducibility\n",
"torch.manual_seed(123)\n",
"\n",
"n = 100 # Number of subjects\n",
"T = 6 # Number of time points\n",
"\n",
"# Simulation parameters\n",
"age_mean = 35\n",
"age_std = 5\n",
"sex_prob = 0.54\n",
"G = torch.tensor([[0.29, -0.00465],[-0.00465, 0.000320]])\n",
"Z = torch.tensor([[1, 1, 1, 1, 1, 1], [0, 5, 10, 15, 20, 25]], dtype=torch.float32).T\n",
"sigma = torch.tensor([0.1161])\n",
"alpha = 1\n",
"\n",
"# Simulate age at baseline\n",
"age_dist = dist.Normal(age_mean, age_std)\n",
"age = age_dist.sample((n,))\n",
"\n",
"# Simulate sex\n",
"sex_dist = dist.Bernoulli(probs=sex_prob)\n",
"sex = sex_dist.sample((n,))\n",
"\n",
"# Simulate random effects\n",
"random_effects_dist = dist.MultivariateNormal(torch.zeros(2), G)\n",
"random_effects = random_effects_dist.sample((n,))\n",
"\n",
"# Generate expected longitudinal trajectories\n",
"# quite frakly this is useless now - it was based on my bad understanding of the algorithm\n",
"trajectories = random_effects[:, 0].unsqueeze(1) + random_effects[:, 1].unsqueeze(1) * Z[:,1] + alpha * age.unsqueeze(1)\n",
"\n",
"# Simulate observed longitudinal measures\n",
"R = torch.diag_embed(sigma.repeat(T))\n",
"V = torch.matmul(torch.matmul(Z, G), Z.T) + R\n",
"\n",
"#get a mean trajectory\n",
"b1 = torch.tensor([4.250])\n",
"b2 = torch.tensor([0.250])\n",
"mean_trajectory = b1.item() + b2.item() * Z[:,1] + alpha * age_mean\n",
"\n",
"#define the distribution to sample the trajectories from\n",
"observed_data_dist = dist.MultivariateNormal(trajectories, V)\n",
"\n",
"#sample from the distribution to get an n x T matrix of observations/covariates\n",
"observed_data = observed_data_dist.sample((1,)).squeeze()\n",
"\n",
"print(observed_data[1:5, :])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Survival data (outcomes)\n",
"\n",
"here I will describe how to get the survival and censoring for all the subjects from above. then I will code it up in python."
]
},
{
Expand Down Expand Up @@ -160,6 +262,13 @@
"covars = matrix * random_vars[:, None]\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down

0 comments on commit 288d41c

Please sign in to comment.