-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpedLikelihood_code.Rmd
805 lines (706 loc) · 31.6 KB
/
pedLikelihood_code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
---
title: "A pedigree-transmission likelihood for multiplex families"
date: "`r Sys.Date()`"
output:
pdf_document: default
pdf_notebook: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```
This document presents a tutorial describing Tianyu's work in summer 2022 to implement a pedigree-transmission likelihood. We are interested in transmission-parameter
likelihoods to evaluate the evidence that a rare variant (RV) is causally linked to the disease in an extended family, or pedigree. To implement the likelihood we use the `gRain` R package [1] to propagate probability in Bayesian networks. Bayesian networks are a special case of $gRa$phical $i$ndependence $n$etworks. In graphical independence networks,
random variables are nodes in a graph
and conditional dependencies are edges between nodes,
allowing a compact representation of
the joint probability distribution of the random variables.
The graphical independence networks in the gRain package are restricted to comprising
discrete variables, each with a finite state space.
The project's goal is to calculate the joint
probability of an RV configuration in the diseased individuals with DNA available in the pedigree.
A pedigree can be viewed as a **directed** graphical
model of the RV status of its members, with the RV
status of a child depending causally
on the RV status of its
parents. Thus a pedigree can be
cast as a Bayesian network,
a directed acyclic graph.
We use `gRain` because it allows us to cast a pedigree as a Bayesian network
and more easily calculate transmission-parameter likelihoods.
To start, we load the required packages: `gRain` and the pedigree drawing and manipulation R package `kinship2` [2].
```{r}
library(gRain)
library(kinship2)
```
# 1. Getting to know the `gRain` package
We work through a simple example
showing how to use `gRain` functions to calculate the
probability that the affected individuals in a pedigree
received a rare variant (RV) from a carrier founder,
given that the RV is transmitted with probability 3/4.
The example pedigree is shown below, drawn with
the `pedigree()` function of the `kinship2` package.
The first step for plotting is to label each family
member with an ID number and indicate their father and mother.
```{r}
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
# Label each family member with id number
father = c(0, 0, 1, 0, 1, 0, 3, 5, 5)
# Use id number to indicate the father of each family member
mother = c(0, 0, 2, 0, 2, 0, 4, 6, 6)
# Use id number to indicate the mother of each family member
```
Next indicate the biological sex, affection status
and DNA availability of each individual. In our applications,
DNA is typically available only from living disease-affected members
of the pedigree. The living disease-affected members
tend to be in the more recent generations of a pedigree, and are typically non-founders.
```{r}
sex = c(1, 2, 1, 2, 1, 2, 2, 2, 2)
# 1 represents male, 2 represents female.
affected = c(0, 0, 0, 0, 0, 0, 1, 1, 1)
# affected specifies which family members are affected by the disease.
# 0 and 1 represent individuals who don't and do have the disease, respectively.
avail = c(0, 0, 0, 0, 0, 0, 1, 1, 1)
# avail specifies which family members have DNA available.
# 0 and 1 represent individuals who don't and do have DNA available, respectively.
```
Assemble the information into a data frame for the pedigree and then convert
the data frame into a pedigree object.
```{r}
toyPed = as.data.frame(cbind(id,father,mother,sex,avail, affected))
toyPed_ob = pedigree(toyPed$id, toyPed$father, toyPed$mother, toyPed$sex,
affected=cbind(toyPed$affected, toyPed$avail))
# Use pedigree() function from the kinship2 package to construct the pedigree object.
```
Finally, plot the resulting pedigree with affection status marked on the left and DNA availability marked on the right of nodes for individuals.
```{r}
plot(toyPed_ob)
```
From the output above, we can see that ID's 7, 8 and 9 are the only affected members and the only ones with DNA available.
We would like to cast this pedigree as a Bayesian network.
A Bayesian network models the joint distribution of a set of variables in terms of their conditional dependencies
in a directed graph. A network consists of nodes (variables)
and components known as conditional probability tables (CPTs). CPTs specify the local conditional dependencies
between variables in the network.
When the dependencies among variables are
local in nature, it is a lot easier to understand
them in terms of CPTs than
in terms of the joint distributions of all the variables.
In our context, the variables are the RV status of
individuals in a pedigree and the pedigree is the directed
graph. The conditional probability tables specify
the local dependency of a child's RV genotype given
the RV genotypes of their two parents. For pedigrees
with no inbreeding,
the local dependency structure is straightforward,
with children depending only on their parents.
To set up the Bayesian network, we first specify
the levels for the RV status of pedigree members
as 0, 1 or 2 copies of the RV.
```{r}
copy = c("0", "1", "2")
# number of variants levels: 0, 1, 2
```
Next we consider all $3^3=27$
possible configurations of the RV status
for the child and its two parents. For each
configuration, we specify the conditional probability of the child's
RV status given the RV status of the two parents assuming
that the transmission probability of the RV from a parent
to the child is $\tau=3/4$. We will use these conditional probabilities
to set up our CPTs.
```{r}
# conditional probability of child in the first entry of the
# triplet marked below, given parents in the second and third
# entries of triplet.
child_1 = c(1, 0, 0, 1/4, 3/4, 0, 0, 3/4, 0)
# 000 100 200 010 110 210 020 120 220
child_2 = c(1/4, 3/4, 0, 1/16, 3/8, 9/16, 0, 1/4, 3/4)
# 001 101 201 011 111 211 021 121 221
child_3 = c(0, 1, 0, 0, 1/4, 3/4, 0, 0, 1)
# 002 102 202 012 112 212 022 122 222
child = c(child_1, child_2, child_3)
```
We may now specify the CPTs with the values given in the `child` vector. We start with the pedigree non-founders. These
are the individuals who have information on both parents.
According to the pedigree diagram above, the non-founders are
IDs 3, 5, 7, 8 and 9. The non-founders with IDs 3 and 5
are siblings and have parents with IDs 1 and 2.
The non-founder with ID 7 has parents with IDs 3 and 4.
Finally, the non-founders with IDs 8 and 9 are siblings and have
parents with IDs 5 and 6. In the CPT `formula` argument, the child node comes first after the `~`, followed by the parent nodes,
because the child's RV status is conditionally
dependent on the parents' RV status.
```{r}
# CPTs for non-founders
ID_3.12 = cptable(~3 + 1 + 2, values = child, levels = copy)
ID_5.12 = cptable(~5 + 1 + 2, values = child, levels = copy)
ID_7.34 = cptable(~7 + 3 + 4, values = child, levels = copy)
ID_8.56 = cptable(~8 + 5 + 6, values = child, levels = copy)
ID_9.56 = cptable(~9 + 5 + 6, values = child, levels = copy)
# Note the order of the variables after the + operator matters.
# The child nodes come first after ~ followed by the parents.
```
We then set up the conditional probability tables for the
founders of the pedigree.
According to the pedigree diagram above, the
founders are
IDs 1, 2, 4 and 6.
Since founders have no parents, we specify the
marginal probability of their RV genotypes in their CPT
tables. Our application does not require
the marginal RV probabilities for the founders
but the `gRain` package does.
We therefore set the three RV
genotypes in founders to be equiprobable.
```{r}
# Conditional probability tables for founders.
ID_1 = cptable(~1, values = rep(1/3, 3), levels = copy)
ID_2 = cptable(~2, values = rep(1/3, 3), levels = copy)
ID_4 = cptable(~4, values = rep(1/3, 3), levels = copy)
ID_6 = cptable(~6, values = rep(1/3, 3), levels = copy)
# We don't know the values for founders since they don't
# have parents in our pedigree.
```
Next we combine all the CPTs into a list for our Bayesian network:
```{r}
plist_pedigree <-
compileCPT(list(ID_1, ID_2, ID_3.12, ID_4, ID_5.12,
ID_6, ID_7.34, ID_8.56, ID_9.56))
```
From this combined CPT list we can extract the individual
CPTs of individuals in the pedigree. For example, we can
extract the CPT for the individual with ID 3 as follows.
```{r}
plist_pedigree$`3`
```
The output gives the conditional probabilities that ID 3
has 0, 1, or 2 copies of the RV, given the RV genotype
status of its parents ID 1 and ID 2.
We are ready to create the network from the list of CPTs.
```{r}
gin1 = grain(plist_pedigree)
summary(gin1)
```
The `summary()` of the network tells us it has
five "cliques", corresponding to the CPT parent-child trios for the five children
in the pedigree. The maximal state-space in these cliques
is $3^3=27$, corresponding to possible RV-status
outcomes for the child-parent trios.
We now have a Bayesian network for efficiently
calculating joint RV probabilities. Being rare, the RV is assumed to enter the pedigree through a single founder. For illustration, let's assume the founder who introduced the RV into the pedigree is ID 1.
The next code chunk calculates the conditional joint probability
that all three of the affected individuals carry the RV
given that the founder with ID 1 introduced it into the pedigree. The RV transmission probability is taken
to be $\tau=3/4$.
We focus on the affected individuals because in medical-genetic studies
they are often the only pedigree members who are genotyped
for RV status.
```{r}
#Condition on founder id1 introducing the RV into the pedigree.
bn1 = setEvidence(gin1, nodes = c("1", "2", "4", "6"),
states = c("1", "0", "0", "0") )
```
In Bayesian networks, "evidence" is what is being conditioned on to get the probability. We use the
`setEvidence()` function to condition on ID 1
being the founder who introduced the single copy of
the RV into the pedigree. The other founders IDs 2, 4, and 6 introduce no copies of the
RV into the pedigree.
Let's start by getting the marginal
distributions of RV status for each of the three affected
individuals.
```{r}
affs=c("7", "8", "9")
querygrain(bn1, nodes = affs, type = "marginal")
```
Next, we move on to get the joint probability distribution for
the three affected individuals.
```{r}
config=c("1","1", "1") #RV configuration for affected members
names(config)=affs #link config to the affected IDs
querygrain(bn1, nodes=affs, type="joint")
```
The output lists the conditional joint distribution
of the RV status of the affected individuals given
that the founder with ID 1 introduced the
RV into the pedigree. This probability is calculated
assuming that the transmission
probability of the RV is $\tau=3/4$. We see that
the probability that all three of the affected individuals
carry a single copy of the RV given that the RV was introduced
into the pedigree by the founder with ID 1 is 0.2373. We may
also get this result by repeated conditioning, also known as the "chain rule" for Bayesian networks.
```{r}
p = unlist(querygrain(bn1, nodes = "7", type = "marginal"))
names(p) = copy
p[config["7"]]
```
```{r}
prob=1
for(n in affs){
# calculate the marginal probability for the node, n
p=unlist(querygrain(bn1, nodes=n))
names(p)=copy
prob=prob*p[config[n]]
# condition on node n's state
bn1 = setEvidence(bn1, nodes=n, states=config[n])
}
prob
```
The output confirms that the
probability that all three of the affected individuals
carry a single copy of the RV introduced
by ID 1 is 0.2373.
# 2. Function to create a Bayesian network
In this section, we present a function to create a Bayesian network to be used in our application. The function takes two arguments:
1. A `kinship2` pedigree dataframe, `pedfile`, and
2. A transmission probability of the RV, `tau`.
This function `BNcreate()` is defined in
the code chunk below.
```{r}
BNcreate = function(pedfile, tau){
## conditional prob of child in the first entry of the triplet
## given parents in the second and third entries of triplet with transmission
## probability tau
geno_prob = c(
# 000 100 200 010 110 210 020 120 220
1, 0, 0, 1-tau, tau, 0, 0, 1, 0,
# 001 101 201 011 111 211 021 121 221
1-tau, tau, 0, (1-tau)^2, 2*tau*(1-tau), tau^2, 0, 1-tau, tau,
# 002 102 202 012 112 212 022 122 222
0, 1, 0, 0, 1-tau, tau, 0, 0, 1
)
# Construct the CPTs for founders
# Founders: columns of father or mother are labelled as 0.
founders = which(pedfile[, "father"] == 0) # the row numbers of founders
founders_vec = rep(0, length(founders))
# store the id number of founders (for user configuration)
founders_cpt = list()
# store the CPTs of founders in a list, which is denoted as [[i]]
for (i in 1:length(founders)) {
id = pedfile[founders[i], "id"]
founders_vec[i] = id
node = cptable(c(id), values = rep(1/3, 3), levels = copy)
# CPTs of founders
founders_cpt[i] = list(node)
}
# Construct the CPTs for non-founders
pedfile_c = pedfile[-c(founders), ]
# Pedigree data after removing founders
nonfounders_cpt = list()
# Store the CPTs for non-founders
for (i in 1:nrow(pedfile_c)) {
c = pedfile_c[i, 'id'] # id numbers of child
f = pedfile_c[i, "father"] # id numbers of father
m = pedfile_c[i, "mother"] # id numbers of mother
node_nf = cptable(c(c, f, m), values = geno_prob, levels = copy)
# CPTs for non-founders
nonfounders_cpt[i] = list(node_nf)
}
plist = compileCPT(founders_cpt, nonfounders_cpt)
gin = grain(plist)
# Create the Bayesian network from the CPTs of both founders and non-founders
return(gin)
}
```
We apply the function to the pedigree dataframe from the previous example.
```{r}
gin = BNcreate(toyPed, 3/4)
# Create the Bayesian network from the toy pedigree dataframe using a transmission probability of 3/4.
```
Let's check that we get the same summary as in the previous part
of this document, when
we created the Bayesian network for the toy pedigree interactively.
```{r}
summary(gin) #gives same summary as above
```
# 3. Construct a function for the transmission likelihood
This next section describes how to calculate the likelihood
of a transmission probability in a multiplex
pedigree, given data on the RV status of the
affected members with available DNA and
assuming that a single founder introduced the
RV into the pedigree. A multiplex pedigree
is a pedigree that has been sampled or "ascertained"
to contain several
members (e.g. 3 or more) who
are affected with the disease of interest.
Many family studies in medical genetics rely on this sampling
design.
Our goal is to calculate the likelihood of a value
of the RV-transmission probability. For example,
above we used a value of $\tau=3/4$ for the transmission
probability in the toy pedigree.
As before, we assume that, being rare,
the RV enters the pedigree through a single founder only.
Therefore, in a pedigree with no inbreeding, the possible values for RV status are 0 (no copies) and 1 (a single copy).
We take the prior probability that any founder is the one that
introduced the RV to be uniform over all the pedigree founders.
The likelihood is obtained from
the conditional probability of the RV configuration
for affected pedigree members with available DNA,
given that exactly one of the founders introduced
a single copy of the RV into the pedigree.
The event that exactly one founder introduced the RV is a union of the mutually exclusive events, $F_i$, that each of the founders,
$i$, introduced the RV. These events are mutually exclusive because the chance of more than one founder carrying a RV is essentially zero. The probability of
the RV configuration $C$ in the affected pedigree members with
available DNA is thus
\begin{eqnarray*}
P(C | \cup_i \! F_i) & =& \frac{P(C, \cup_i F_i)}{P(\cup_i F_i)} \\
&=& \frac{\sum_i P(C, F_i)}{\sum_i P(F_i)}\\
&=& \sum_i P(C|F_i) \frac{P(F_i)}{\sum_j P(F_j)}.
\end{eqnarray*}
Founders are assumed to have the same prior probability of introducing the RV; i.e., $P(F_1)=P(F_2)=P(F_3)=\cdots$. Thus $P(F_i)/\sum_j P(F_j)$ on the right-hand side of the above equation is equal to one over the number of founders. The conditional
probability, $P(C|F_i)$, of the RV configuration given that founder $i$ introduced the RV is obtained by "path counting" the number
of independent transmissions connecting founder $i$ to the affected
members. For example, in the toy pedigree above, an RV from the founder with ID 1 can reach the affected pedigree member with ID 7
through two independent transmissions from ID 1 to ID 3 and then
from ID 3 to ID 7, and can reach IDs 8 and 9 through one independent
transmission to ID 5 and then two independent transmissions
from ID 5 to IDs 8 and 9. Therefore, $P(C=(1,1,1)| F_1) = \tau^{2+1+2}$ and $P(C=(1,1,0)|F_1) = \tau^{2+1+1}(1-\tau)$,
for example. In the calculation for $P(C=(1,1,0)|F_1)$, we
count two transmissions of the RV from ID 1 to ID 7, one
from ID 1 to ID 5, one from ID 5 to ID 8, and
one transmission of the other variant in ID 5
(i.e. the non-RV) to ID 9.
The first step in the likelihood function is to call `BNcreate()`
to set up a Bayesian network for
a given value of the transmission probability.
The pedigree-transmission likelihood function takes 3 arguments:
`pedfile`, `tau` and `config` described below.
The first argument,`pedfile`,
is a `kinship2` pedigree object specifying the pedigree structure.
The second argument, `tau`, is the transmission probability of the RV.
The third argument, `config`, is a numeric vector giving the RV status of affected individuals with DNA available.
Each element of the
vector corresponds to an affected pedigree member with DNA. The elements are ordered by their ID numbering. For example, if the affected individuals with available DNA
have ID numbers 1, 3, 6, 7, 14 and `config =c(1, 0, 1, 1, 1)`,
IDs 1, 6, 7, 14 carry one copy of the RV and ID 3 carries no copies.
We loop through all the founders, and for each
founder $i$ we calculate $P(\mbox{ config } | \mbox { founder } i \mbox{ introduced the RV})$,
the conditional probability
of the RV configuration given that the RV enters the pedigree on founder $i$. In the loop, we sum out over
the product of
this conditional probability and the probability that
founder $i$ is the founder who introduced the RV
to get the desired likelihood from $P(\mbox{ config } | \mbox { exactly one founder introduced the RV})$.
```{r}
likehd = function(pedfile, tau, config){
gin = BNcreate(pedfile, tau)
# Bayesian networks of the specified pedigree with transmission prob tau
likelihood = 0
founders = which(pedfile[, "father"] == 0)
# find the row numbers of founders
founders = as.character(pedfile$id[founders])
# find the ID numbers of founders
affs = which(pedfile[, "affected"] == 1&pedfile[,"avail"]==1)
# find the row numbers of affected individuals with DNA available
affs=as.character(pedfile$id[affs])
# find the ID numbers of affected individuals with DNA available
# Assume the RV enters pedigree on exactly one founder and
# loop through all the founders
for (i in 1:length(founders)) {
state = rep(0, length(founders))
# vector state specifies which founder introduces the RV
state[i] = "1" #1 means carrying the RV.
bn = setEvidence(gin, nodes = founders, states = state)
# Set founder i to introduce the rare variant
# calculate P(config | founder i introduced the rv)
prob=1
for(n in affs){
if(prob > 0) # prevents conditioning on zero prob events
{
# calculate probability for this node, n
p=unlist(querygrain(bn,nodes=n,exclude=FALSE)) #get marginal prob
names(p)=c("0","1","2")
prob=prob*p[config[n]]
# condition on node n's state
bn = setEvidence(bn, nodes=n, states=config[n])
}
}
likelihood = prob*1/length(founders)+likelihood
}
return(likelihood)
}
```
# 4. Testing the likelihood function
## 4.1 Toy pedigree
We start by testing the likelihood function
on the toy pedigree above. We evaluate
the likelihood of $\tau=3/4$ when all three
of the affected pedigree
members with available DNA carry a single
copy of the RV.
```{r}
config=c("1","1","1")
names(config)=c("7","8","9")
likehd(toyPed, 3/4, config)
```
From the output above, we see that, when
the RV configuration is $C=(1,1,1)$, the likelihood of $\tau = 3/4$ is about $.1187$.
This value of the likelihood complies with our hand calculation from
the "path counting" approach:
\begin{eqnarray*}
P(C=(1,1,1) | \cup_i F_i)
&=& \frac{1}{4} \left \{ [\tau^2 \times \tau (\tau^2)] \; +
\; [\tau^2 \times \tau (\tau^2)] \; +
\; [ \tau \times 0 ] \; +
\; [ \tau^2 \times 0 ] \right \} \\
& = & \frac{1}{4} \left \{ [\tau^5] \; + \; [\tau^5] \; + \; [0] \; + \; [0] \right \}\\
&=& \frac{1}{4}\left \{ 2 \times \left ( \frac{3}{4} \right )^5 \right \}\\
&=&.1187
\end{eqnarray*}
Next, we evaluate the likelihood of
$\tau=3/4$ when the RV configuration for
IDs 7, 8, and 9 is $C=(1,1,0)$. In this configuration,
IDs 7 and 8 both carry the RV but ID 9 does not.
```{r}
config=c("1","1","0")
names(config)=c("7","8","9")
likehd(toyPed, 3/4, config)
```
This value of the likelihood complies with our hand calculation of $\frac{1}{2}\tau^4 (1-\tau)=\frac{1}{2} \left ( \frac{3}{4} \right )^4 \left ( 1 - \frac{3}{4} \right )=0.0396$ from the path-counting approach.
## 4.2 Modified `kinship2` pedigrees
Consider a modified version
of the second of the two
sample pedigrees in the `kinship2` R package.
```{r}
# load kinship2 pedigree data
data("sample.ped") #two pedigrees
testPed=sample.ped[sample.ped$ped==2,] #2nd pedigree
# modify the kinship2 pedigree data
testPed[testPed$id==203,"affected"]=0
testPed[testPed$id==211,"affected"]=1
testPed[testPed$id==202,"affected"]=0
testPed[testPed$id==201,"avail"]=0
testPed[testPed$id==203,"avail"]=0
testPed[testPed$id==204,"avail"]=0
testPed[testPed$id==212,"avail"]=0
testPed # Unknown affection status has NA
```
From the above data-frame, we see that ID
205 has unknown affection status.
This person is marked as "?" in the
pedigree plot below.
```{r}
testPed_ob=pedigree(testPed$id, testPed$father,testPed$mother, testPed$sex, affected=cbind(testPed$affected,testPed$avail))
plot(testPed_ob)
```
The affected family members with DNA available have IDs 206, 207, 211 and 214.
Let's evaluate the likelihood of $\tau=1/2$
for various configurations of the RV status
in these individuals. First,
suppose that all four of them share
a copy of the RV; i.e. $C=(1,1,1,1)$.
```{r}
config1 = c("1", "1", "1", "1")
names(config1)=c("206","207","211", "214")
likehd(testPed, 1/2, config1)
```
Similarly, from the output above, we observe that, when the RV configuration is
$C = (1,1,1,1)$, the likelihood of $\tau = 1/2$ is about $.0078125$. This likelihood
value also complies with our hand calculation from the "path counting" approach:
\begin{eqnarray*}
P(C=(1,1,1,1) | \cup_i F_i)
&=& \frac{1}{4} \left \{ [ \tau^2 \times \tau \times \tau \times \tau^2] \; +
\; [ \tau^2 \times \tau \times \tau \times \tau^2] \; +
\; [1 \times \tau \times 0 ] \; +
\; [1 \times 0] \right \} \\
&=& \frac{1}{4} \left \{ [\tau^6] \; + \; [\tau^6] \; + \; [0] \; + \; [0] \right \} \\
&=& \frac{1}{4} \left \{ 2 \times \left( \frac{1}{2} \right )^6 \right \} \\
&=& .0078125
\end{eqnarray*}
Next, we evaluate the likelihood of $\tau = \frac{1}{2}$ when the RV configuration
for IDs 206, 207, 211, 214 is $C = (1,0,0,0)$, which indicates that only ID 206
carries the RV.
```{r}
config4 = c("1", "0", "0", "0")
names(config4)=c("206","207","211", "214")
likehd(testPed, 1/2, config4)
```
From the output above, we see that when the RV configuration is C = (1,0,0,0),
the likelihood of $\tau = 1/2$ is about $.0703125$. The likelihood value also complies
with our hand calculation from the "path counting" approach:
\begin{eqnarray*}
P(C = (1,0,0,0) | \cup_i F_i)
&=& \frac{1}{4} \left \{ [ \tau \times \left ( 1 - \tau \right ) \times
\left ( \left (1 - \tau \right ) + \tau \times \left( 1 - \tau \right ) \right )^2] \;\;\;\; + \right. \\
& & \;\;\;\;\;\;\;\; \left.
\; [ \tau \times \left ( 1 - \tau \right ) \times
\left ( \left (1 - \tau \right ) + \tau \times \left( 1 - \tau \right ) \right )^2] \; +
\; [0] \; + \; [0] \right \} \\
&=& \frac{1}{4} \left \{ \frac{1}{2} \times \frac{1}{2} \times
\left ( \frac{1}{2} + \frac{1}{2} \times \frac{1}{2} \right )^2 \; +
\; \frac{1}{2} \times \frac{1}{2} \times
\left(\frac{1}{2} + \frac{1}{2} \times \frac{1}{2}\right)^2 \right \} \\
&=& .0703125
\end{eqnarray*}
Next consider `testPed2`, a version of the pedigree above in which the founder with ID 201 has DNA available.
```{r}
testPed2=testPed
testPed2[testPed2$id==201,"avail"]=1 #Change ID 201
testPed2
```
From the output above, we see that the disease-affected founder with ID 201 now has DNA available.
Let's work through the likelihood calculation when all
the affected pedigree members with DNA in `testPed2`
carry the RV. In this case, the RV configuration is
$C=(1,1,1,1,1)$ for IDs 201, 206, 207, 211 and 214.
This pedigree has four founders with IDs 201, 202, 203
and 209.
If ID 201 is the founder introducing the RV, the
conditional probability of the configuration is
$$P(C = (1,1,1,1,1) \, | F_{201}) = 1\times \tau \times
\tau \times \tau^2 \times \tau^2 = \tau^6.$$
If ID 202 is the founder introducing the RV,
the conditional probability of the configuration is
$$P(C = (1,1,1,1,1) \, | F_{202}) = 0 \times \tau \times
\tau \times \tau^2 \times \tau^2 = 0.$$
Similarly, if IDs 203 or 209 are the founders
introducing
the RV, the conditional probability of the
configuration is zero. Therefore, the likelihood
function for the transmission probability
is $P(C = (1,1,1,1,1) \, | \cup_i \!{F_i}) =
\frac{1}{4} \tau^6.$ This
likelihood evaluates to $0.00390625$ for $\tau=1/2$,
as confirmed by the code chunk below:
```{r}
config5 = c("1", "1", "1", "1", "1")
names(config5)=c("201","206","207","211", "214")
likehd(testPed2, 1/2, config5)
```
Now suppose the RV configuration is $C=(0,1,1,1,1)$.
If ID 201 is the founder introducing the RV, the
conditional probability of the configuration
is $P(C = (0,1,1,1,1) \, | F_{201})=0$ because
ID 201 not carrying the RV
conflicts with the conditioning event that
ID 201 introduced the RV into the pedigree.
If ID 202 is the founder introducing the RV,
the conditional probability of the configuration is
$P(C = (0,1,1,1,1) \, | F_{202})= \tau \times \tau \times \tau^2 \times \tau^2 = \tau^6$.
If IDs 203 or 209 introduce the RV, the conditional probability of the configuration is
$P(C = (0,1,1,1,1) \, | F_{203})=P(C = (0,1,1,1,1) \, | F_{209})=0$. The likelihood function for $\tau$ is
therefore, again,
$P(C = (0,1,1,1,1) \, | \cup_i \! F_i) =\frac{1}{4} \tau^6$ and evaluates to $0.00390625$ for $\tau=1/2$,
as confirmed by the code chunk below:
```{r}
config6 = c("0", "1", "1", "1", "1")
names(config6)=c("201","206","207","211", "214")
likehd(testPed2, 1/2, config6)
```
# 5. Likelihood curves
This section plots likelihood curves for the
toy pedigree created in section 1 and the
modified `kinship2` pedigree created in section 4.2.
## 5.1 Toy pedigree
We start by considering the toy pedigree. First,
we plot likelihood values versus a sequence of transmission
probabilities, $\tau$, when the RV configuration for IDs 7, 8, and 9 is $C = (1,1,1)$.
```{r}
# Create a sequence of tau values
tau.seq = seq(from = 0, to = 1, by = 0.01)
#first configuration example in the toy pedigree
config1=c("1","1","1")
names(config1)=c("7","8","9")
likehd.vec = c()
for (p in tau.seq) {
likelihood = likehd(toyPed, p, config1)
likehd.vec = append(likehd.vec, likelihood)
}
plot(tau.seq, likehd.vec, type = "l", xlab = "transmission probabilities", ylab = "likelihood values"
,main = "Likelihood curve when the RV configuration is C = (1,1,1)")
```
From the above plot, we observe that, for configuration
$C=(1,1,1)$, the likelihood values increase rapidly once the transmission probability is greater than 0.6, up to a likelihood
value of 0.5. The maximum-likelihood estimate of the
transmission probability is $\hat \tau =1$.
Next, we plot the likelihood curve for
RV configuration $C = (1,1,0)$.
```{r}
config2=c("1","1","0")
names(config2)=c("7","8","9")
likehd.vec = c()
for (p in tau.seq) {
likelihood = likehd(toyPed, p, config2)
likehd.vec = append(likehd.vec, likelihood)
}
plot(tau.seq, likehd.vec, type = "l", xlab = "transmission probabilities", ylab = "likelihood values",
main = "Likelihood curve when the RV configuration is C = (1,1,0)")
```
From the above plot, we see that for $C=(1,1,0)$
the likelihood curve
reaches a peak at a transmission probability of approximately 0.8.
Likelihood values increase with the
transmission probability when the transmission probability is less than 0.8,
and decrease when the transmission probability is greater than 0.8.
The maximum-likelihood estimate of the
transmission probability is thus $\hat \tau \approx 0.8$
## Modified `kinship2` pedigree
Next, we plot two likelihood curves for the
first modified `kinship2` pedigree in section 4.2.
We start by considering the configuration $C=(1,1,1,1)$ when all four of the affected pedigree members
with available DNA carry a single copy of the RV.
```{r}
config3 = c("1", "1", "1", "1")
names(config3)=c("206","207","211", "214")
likehd.vec = c()
for (p in tau.seq) {
likelihood = likehd(testPed, p, config3)
likehd.vec = append(likehd.vec, likelihood)
}
plot(tau.seq, likehd.vec, type = "l", xlab = "transmission probabilities", ylab = "likelihood values",
main = "Likelihood curve when the RV configuration is C = (1,1,1,1)")
```
From the above plot, we see that the likelihood values increase rapidly when
transmission probability is greater than 0.6.
The maximum likelihood estimate of the
transmission probability in this case is
$\hat \tau =1$.
Next suppose that we observe the configuration
$C=(1,1,0,1)$.
```{r}
config4 = c("1", "1", "0", "1")
names(config4)=c("206","207","211", "214")
likehd.vec = c()
for (p in tau.seq) {
likelihood = likehd(testPed, p, config4)
likehd.vec = append(likehd.vec, likelihood)
}
plot(tau.seq, likehd.vec, type = "l", xlab = "transmission probabilities", ylab = "likelihood values",
main = "Likelihood curve when the RV configuration is C = (1,1,0,1)")
```
From the above plot, we observe that the likelihood
curve reaches its peak value
when the transmission probability is approximately 0.8. Likelihood values increase with the
transmission probability when the transmission
probability is less than 0.8, and decrease
when the transmission probability is greater than 0.8.
The maximum likelihood estimate of the transmission
probability is therefore $\hat \tau \approx 0.8$.
Finally, let us conclude by considering
the configuration $C=(1,0,0,0)$; that is
ID 206 carries a single copy of the RV
and IDs 207, 211 and 214 carry no copies.
```{r}
config5 = c("1", "0", "0", "0")
names(config5)=c("206","207","211", "214")
likehd.vec = c()
for (p in tau.seq) {
likelihood = likehd(testPed, p, config5)
likehd.vec = append(likehd.vec, likelihood)
}
plot(tau.seq, likehd.vec, type = "l", xlab = "transmission probabilities", ylab = "likelihood values",
main = "Likelihood curve when the RV configuration is C = (1,0,0,0)")
```
From the above plot, we see that the likelihood
curve reaches a peak value when the
transmission probability is approximately 0.4.
The maximum likelihood estimate of the
transmission probability in this case
is $\hat \tau \approx 0.4$.
# References
[1] Højsgaard, S. . (2012). Graphical independence networks with the `gRain` package for R. Journal of Statistical Software, 46(10), 1–26.
[2] Sinnwell JP, Therneau TM, Schaid DJ. The `kinship2` R package for pedigree data. Hum Hered. 2014;78(2):91-3. doi: 10.1159/000363105. Epub 2014 Jul 29. PMID: 25074474; PMCID: PMC4154601.