updating synthetics data generation

Novartis · Dec 19, 2024 · 288d41c · 288d41c
1 parent a141f8d
commit 288d41c
Showing 1 changed file with 111 additions and 2 deletions.
diff --git a/docs/notebooks/time_varying.ipynb b/docs/notebooks/time_varying.ipynb
@@ -118,15 +118,117 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Simulating a dataset\n",
+    "## Simulating a dataset: first approach to test dimensions but doesn't guarantee meaningful results\n",
     "\n",
     "We will simulate a dataset of 100 subjects with 10 follow up times where a covariate is observed. The covariates will follow a trigonometric function over time and will be dependant on a random variable to differentiate between subjects.\n",
     "\n",
     "For each $i$ the covariate follows the function:\n",
     "\n",
     "$$ Z_i(t) = a_i \\cos(2 \\pi t) $$\n",
     "\n",
-    "where $a_i \\sim N(5, 2.5)$."
+    "where $a_i \\sim N(5, 2.5)$.\n",
+    "\n",
+    "## Proper simulation guidance: data that can be interpreted\n",
+    "\n",
+    "A good approach for simulating data is described in detail by [Ngwa et al 2020](https://pmc.ncbi.nlm.nih.gov/articles/PMC7731987/). If this is not yet implemented, it would be a good way of starting to ensure that both methods work as expected. There are tow parts in simulating such a dataset. First, simulating the longitudina lobservational data and then the survival data. Below we describe methodologies for both.\n",
+    "\n",
+    "### Longitudinal data (covariates)\n",
+    "\n",
+    "We use $i \\in \\{1, \\dots, n\\}$ to index subjects and $j \\in \\{1, \\dots, m_i\\}$ to index time points where $m_i$ is the final time point for subject $i$.\n",
+    "We simulate covariates independantly:\n",
+    "- age at baseline $Age_i \\sim N(35,5)$\n",
+    "- sex $\\sim Bernoulli(p=0.54)$\n",
+    "\n",
+    "Generate expected longitudinal trajectories $\\varphi_{\\beta}(t_{ij})$:\n",
+    "\n",
+    "$$ \\varphi_{\\beta}(t_{ij}) = b_{i1} + b_{i2} \\cdot t_{ij} + \\alpha Age_i, $$\n",
+    "\n",
+    "where $b_{i1}, b_{i2}$ are random effects\n",
+    "\n",
+    "We will generate $b_{i1}, b_{i2}$ from multivariate normal distribution with a covariance matrix $G = [[0.29, -0.00465],[-0.00465, 0.000320]]$. Sample from this multivariate normal distribution (with mean zero) to get the random intercept and slope.\n",
+    "\n",
+    "The observed longitudinal measures measures $Y_{ij}(t_{ij})$ from a multivariate normal distribution with mean $ \\varphi_{\\beta}(t_{ij})$ and variance $V$:\n",
+    "\n",
+    "$$ V = Z_i GZ_i ^T + R_i, \\text{ where }Z_i = [[1,1,1,1,1,1]^T, [0,5,10,15,20,25]^T]$$\n",
+    "\n",
+    "and $R_i = diag(\\sigma^2)$ and $\\sigma^2$ is set to $0.1161$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[33.7853, 34.1568, 33.7249, 33.9724, 34.4417, 34.4528],\n",
+      "        [33.1087, 33.4781, 32.5054, 33.1090, 32.9212, 33.4908],\n",
+      "        [31.8224, 31.8031, 32.1202, 32.3814, 31.4848, 31.9074],\n",
+      "        [36.1902, 35.9910, 36.4153, 36.2511, 35.8788, 36.4300]])\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch.distributions as dist\n",
+    "\n",
+    "# Set random seed for reproducibility\n",
+    "torch.manual_seed(123)\n",
+    "\n",
+    "n = 100  # Number of subjects\n",
+    "T = 6    # Number of time points\n",
+    "\n",
+    "# Simulation parameters\n",
+    "age_mean = 35\n",
+    "age_std = 5\n",
+    "sex_prob = 0.54\n",
+    "G = torch.tensor([[0.29, -0.00465],[-0.00465, 0.000320]])\n",
+    "Z = torch.tensor([[1, 1, 1, 1, 1, 1], [0, 5, 10, 15, 20, 25]], dtype=torch.float32).T\n",
+    "sigma = torch.tensor([0.1161])\n",
+    "alpha = 1\n",
+    "\n",
+    "# Simulate age at baseline\n",
+    "age_dist = dist.Normal(age_mean, age_std)\n",
+    "age = age_dist.sample((n,))\n",
+    "\n",
+    "# Simulate sex\n",
+    "sex_dist = dist.Bernoulli(probs=sex_prob)\n",
+    "sex = sex_dist.sample((n,))\n",
+    "\n",
+    "# Simulate random effects\n",
+    "random_effects_dist = dist.MultivariateNormal(torch.zeros(2), G)\n",
+    "random_effects = random_effects_dist.sample((n,))\n",
+    "\n",
+    "# Generate expected longitudinal trajectories\n",
+    "# quite frakly this is useless now - it was based on my bad understanding of the algorithm\n",
+    "trajectories = random_effects[:, 0].unsqueeze(1) + random_effects[:, 1].unsqueeze(1) * Z[:,1] + alpha * age.unsqueeze(1)\n",
+    "\n",
+    "# Simulate observed longitudinal measures\n",
+    "R = torch.diag_embed(sigma.repeat(T))\n",
+    "V = torch.matmul(torch.matmul(Z, G), Z.T) + R\n",
+    "\n",
+    "#get a mean trajectory\n",
+    "b1 = torch.tensor([4.250])\n",
+    "b2 = torch.tensor([0.250])\n",
+    "mean_trajectory =  b1.item() + b2.item() * Z[:,1] + alpha * age_mean\n",
+    "\n",
+    "#define the distribution to sample the trajectories from\n",
+    "observed_data_dist = dist.MultivariateNormal(trajectories, V)\n",
+    "\n",
+    "#sample from the distribution to get an n x T matrix of observations/covariates\n",
+    "observed_data = observed_data_dist.sample((1,)).squeeze()\n",
+    "\n",
+    "print(observed_data[1:5, :])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Survival data (outcomes)\n",
+    "\n",
+    "here I will describe how to get the survival and censoring for all the subjects from above. then I will code it up in python."
    ]
   },
   {
@@ -160,6 +262,13 @@
     "covars = matrix * random_vars[:, None]\n"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},