new vignettes

veltenlab · Dec 17, 2020 · 4d71496 · 4d71496
1 parent c7a7d70
commit 4d71496
Show file tree

Hide file tree

Showing 27 changed files with 246 additions and 613 deletions.
diff --git a/inst/doc/calling.Rmd b/inst/doc/calling.Rmd
@@ -27,6 +27,8 @@ The function `mutationCallsFromBlacklist` performs coverage-based filtering and
 
 *   `min.num.samples` specifies the minimum number of mutant cells required for a variant to be kept.
 
+As in the case of cohort mutation calling, the choice of these parameters is quite critical. Please see the funal part of the [Variant calling and blacklist creation (using cohort data)](callingCohort.html) vignette for instructions how to set the parameters and/or perform sensibility checks on the results.
+
 ```{r example,fig.width = 6, fig.height=4}
 library(mitoClone)
 library(pheatmap)
@@ -64,7 +66,7 @@ orig.anno <- data.frame(row.names = names(cutree(clustered.hand$tree_col, k=12))
 ```
 
 
-Then, we call mitochondrial variants de novo and cluster using the `muta_cluster` function, as described in more detail in the [Computation of phylogenetic trees and clustering of mutations](clustering.html) vignette.
+Then, we call mitochondrial variants de novo and cluster using the `muta_cluster` function, as described in more detail in the [Computation of clonal hierarchies and clustering of mutations](clustering.html) vignette.
 
 Of note, in this dataset there is quite a large number of variants present in all cells but at low heteroplasmy, justifying the use of a different `universal.var.cells` parameter.
 

diff --git a/inst/doc/calling.html b/inst/doc/calling.html
diff --git a/inst/doc/callingCohort.R b/inst/doc/callingCohort.R
@@ -16,3 +16,34 @@ result <- mutationCallsFromCohort(countTables, patient)
 print(colnames(result$P342@M))
 print(colnames(result$HRK@M))
 
+## ----get_association, fig.width=6,fig.height=4--------------------------------
+#retrieve the binary mutation status for nuclear and mitochondrial genomic mutation from read count tables (corresponds to result$P342 , but additionally contains nuclear mutation calls)
+mutations <- mutationCallsFromMatrix(as.matrix(M_P1), as.matrix(N_P1))
+
+
+getasso <- function(m1,m2) {
+      testm <- matrix( data = c(
+        sum(m1 =="1" &m2 == "1"),
+        sum(m1 =="1" &m2 == "0"),
+        sum(m1 == "0" & m2 == "1"),
+        sum(m1 == "0" & m2 == "0")
+      ), ncol = 2  )
+      fisher.test(testm)
+}
+
+
+#run association tests between binary mitochondrial mutation status and binary SRSF2 mutation status
+mitochondrial <- colnames(mutations@ternary)[grep("^X\\d+",colnames(mutations@ternary))]
+tests <- apply(mutations@ternary[,mitochondrial], 2, getasso, m2 = mutations@ternary[,"SRSF2"])
+
+p.val <- sapply(tests, "[[", "p.value")
+est <- sapply(tests, "[[", "estimate")
+plf <- data.frame(name = mitochondrial,
+                  pval = -log10(p.val),
+                  or = est)
+
+library(ggplot2)
+qplot(x = name , y = pval, color = paste0(or <1, pval > 1), data=na.omit(plf)) + coord_flip() + theme_bw()+ theme(panel.grid = element_blank(), axis.title.x = element_text(color="black"), axis.text = element_text(color="black"), axis.text.y = element_text(size = 6)) + xlab("") + ylab("Association w/ SRSF2 mutation (-log10 p)")+ scale_color_manual(values = c("TRUETRUE" = "blue","FALSETRUE" = "red", "FALSEFALSE" = "grey", "TRUEFALSE" = "grey"), guide=F)
+
+
+
diff --git a/inst/doc/callingCohort.Rmd b/inst/doc/callingCohort.Rmd
@@ -44,4 +44,48 @@ print(colnames(result$P342@M))
 print(colnames(result$HRK@M))
 ```
 
-The result for patient P1 (`result$P342`) an P2 (`result$HRK`) serves as input into the vignette [Computation of phylogenetic trees and clustering of mutations](clustering.html). The blacklist (`result$blacklist`) serves as input into the vignette [Variant calling and validation](calling.html). We have added regions based on a conservative set of false-positive variants from our datsets and further trinucleotide repeats and softmased regions to this blacklist, since we observed that those tend to contain false positives as well. See the `blacklist` object provided as part of the package.
+The result for patient P1 (`result$P342`) an P2 (`result$HRK`) serves as input into the vignette [Computation of clonal hierarchies and clustering of mutations](clustering.html). The blacklist (`result$blacklist`) serves as input into the vignette [Variant calling and validation](calling.html). We have added regions based on a conservative set of false-positive variants from our datsets and further trinucleotide repeats and softmased regions to this blacklist, since we observed that those tend to contain false positives as well. See the `blacklist` object provided as part of the package.
+
+## Notes on parameter choice
+
+Some notes about possible parameters in the `mutationCallsFromCohort`: By default, we select coordinates in the mitochondrial genome containing at least `MINREADS=5` reads each in at least `MINCELL=20` cells. To distinguish RNA editing events and true mitochondrial mutations, we then identify mitochondrial variants that occur in several individuals. For this purpose only, individual cells are therefore called as 'mutant' in a given genomic site if at least `MINFRAC=10%` of the reads from that cell were from a minor allele (i.e. distinct from the reference). Mutations present in at least `MINRELATIVE.PATIENT=1%` of cells in a given patient, but no more than `MINCELLS.PATIENT=10` cells in any other individual, are then included into the final dataset and counts supporting the reference and mutant alleles are computed as for sites of interest in the nuclear genome. 
+
+Depending on the choice of parameters here, sites with low frequencies of RNA editing or low levels of heteroplasmy in all cells can be included and make the dataset more noisy, in particular if the `MINFRAC` parameter is set too low. On the other hand, a too stringent filtering can cause a loss of potentially informative sites. Hence, we recommend to rather begin with parameters that yield a larger list of sites and then perform the following checks to, manually or through the use of altered filtering parameters, remove erroneous sites:
+
+* Use the `quick_cluster` and the `muta_cluster` functions to check if any potential mutations are randomly distributed and do not follow a hierarchy.
+
+* If data on genomic mutations is available, perform association tests and prioritize mitochondrial variants that are significantly associated with nuclear variants, for example for patient P1 (P342):
+
+```{r get_association, fig.width=6,fig.height=4}
+#retrieve the binary mutation status for nuclear and mitochondrial genomic mutation from read count tables (corresponds to result$P342 , but additionally contains nuclear mutation calls)
+mutations <- mutationCallsFromMatrix(as.matrix(M_P1), as.matrix(N_P1))
+
+
+getasso <- function(m1,m2) {
+      testm <- matrix( data = c(
+        sum(m1 =="1" &m2 == "1"),
+        sum(m1 =="1" &m2 == "0"),
+        sum(m1 == "0" & m2 == "1"),
+        sum(m1 == "0" & m2 == "0")
+      ), ncol = 2  )
+      fisher.test(testm)
+}
+
+
+#run association tests between binary mitochondrial mutation status and binary SRSF2 mutation status
+mitochondrial <- colnames(mutations@ternary)[grep("^X\\d+",colnames(mutations@ternary))]
+tests <- apply(mutations@ternary[,mitochondrial], 2, getasso, m2 = mutations@ternary[,"SRSF2"])
+
+p.val <- sapply(tests, "[[", "p.value")
+est <- sapply(tests, "[[", "estimate")
+plf <- data.frame(name = mitochondrial,
+                  pval = -log10(p.val),
+                  or = est)
+
+library(ggplot2)
+qplot(x = name , y = pval, color = paste0(or <1, pval > 1), data=na.omit(plf)) + coord_flip() + theme_bw()+ theme(panel.grid = element_blank(), axis.title.x = element_text(color="black"), axis.text = element_text(color="black"), axis.text.y = element_text(size = 6)) + xlab("") + ylab("Association w/ SRSF2 mutation (-log10 p)")+ scale_color_manual(values = c("TRUETRUE" = "blue","FALSETRUE" = "red", "FALSEFALSE" = "grey", "TRUEFALSE" = "grey"), guide=F)
+
+
+```
+
+* Finally, some datasets contain cell types that are known to be clonally distinct; in leukemia datasets for example, CD3+ T cells are usually not from the same clone as leukemic cells. Hence, it may make sense to perform association tests between mutational status and cell type to prioritize variants that follow the expected pattern.
diff --git a/inst/doc/callingCohort.html b/inst/doc/callingCohort.html
diff --git a/inst/doc/callingNoBin.html b/inst/doc/callingNoBin.html
diff --git a/inst/doc/clustering.R b/inst/doc/clustering.R
@@ -22,14 +22,23 @@ plotTree(P1, file = "P1.ps")
 plotTree(P2, file = "P2.ps")
 
 ## ----clusterClonesP1, fig.width=8,fig.height=6--------------------------------
-P1 <- clusterMetaclones(P1)
+P1 <- clusterMetaclones(P1, min.lik =1)
 
 ## ----clusterClonesP2, fig.width=8,fig.height=6--------------------------------
-P2 <- clusterMetaclones(P2)
+P2 <- clusterMetaclones(P2, min.lik =1)
 
 ## ----plotClonesP1, fig.width=8,fig.height=6-----------------------------------
 plotClones(P1)
 
 ## ----plotClonesP2, fig.width=8,fig.height=6-----------------------------------
 plotClones(P2)
 
+## ----getmut2clone-------------------------------------------------------------
+m2c <- getMut2Clone(P1)
+print(m2c)
+
+## ----getmut2clone2, fig.width=8,fig.height=6----------------------------------
+m2c[c("X2537GA","X14462GA")] <- as.integer(6)
+P1.new <- overwriteMetaclones(P1, m2c)
+plotClones(P1.new)
+
diff --git a/inst/doc/clustering.Rmd b/inst/doc/clustering.Rmd
@@ -1,5 +1,5 @@
 ---
-title: "Computation of phylogenetic trees and clustering of mutations"
+title: "Computation of clonal hierarchies and clustering of mutations"
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{Computation of phylogenetic trees and clustering of mutations}
@@ -49,17 +49,20 @@ plotTree(P2, file = "P2.ps")
 
 ## Identify clones and assign cells to clones
 
-In many cases, the order of the leaves on these trees is arbitrary, because mutations systematically co-occur. We  therefore cluster the mutations into clones. In detail, we take every every branch on the tree and then shuffle the order of mutations in that branch while re-calculating the likelihood. If swapping nodes leads to small changes in the likelihood, these nodes are then merged into a "clone". The parameter `min.lik` controls the merging and can be adjusted after inspection of the dendrograms. This step also assigns each cell to the most likely clone, and provides an estimate of the likelihood. The `help(mutationCalls)` for more info on how these results are stored.
+In many cases, the order of the leaves on these trees is arbitrary, because mutations systematically co-occur. We  therefore cluster the mutations into clones. In detail, we take every every branch on the tree and then shuffle the order of mutations in that branch while re-calculating the likelihood. If swapping nodes leads to small changes in the likelihood, these nodes are then merged into a "clone". The parameter `min.lik` that controls the merging is set arbitrarily, see below for more information. 
 
 ```{r clusterClonesP1, fig.width=8,fig.height=6}
-P1 <- clusterMetaclones(P1)
+P1 <- clusterMetaclones(P1, min.lik =1)
 ```
 
 
 ```{r clusterClonesP2, fig.width=8,fig.height=6}
-P2 <- clusterMetaclones(P2)
+P2 <- clusterMetaclones(P2, min.lik =1)
 ```
 
+
+This step also assigns each cell to the most likely clone, and provides an estimate of the likelihood. The `help(mutationCalls)` for more info on how these results are stored.
+
 Finally, the clustering can be plotted.
 
 ```{r plotClonesP1, fig.width=8,fig.height=6}
@@ -71,3 +74,23 @@ plotClones(P1)
 plotClones(P2)
 ```
 
+## Parameter choice
+
+The parameter `min.lik` that controls the merging is set arbitrarily. In practice, the goal of these analyses is to group mutations into clones for subsequent analyses (such as differential expression analyses) and it may make sense to overwrite the result of `clusterMetaclones` manually; for example, if a subclone defned on a mitochondrial mutation only should be treated as part of a more clearly defined upstream clone for differential expression analysis.
+
+To overwrite the result of `clusterMetaclones`, first retrieve the assignment of mutations to clones:
+
+```{r getmut2clone}
+m2c <- getMut2Clone(P1)
+print(m2c)
+```
+
+To e.g. treat the mt:2537G>A and mt:14462:G>A mutations as a subclone distinct from CEBPA, we can assign a new clonal identity to them while respecting the hierarchy:
+
+```{r getmut2clone2, fig.width=8,fig.height=6}
+m2c[c("X2537GA","X14462GA")] <- as.integer(6)
+P1.new <- overwriteMetaclones(P1, m2c)
+plotClones(P1.new)
+```
+
+
diff --git a/inst/doc/clustering.html b/inst/doc/clustering.html
diff --git a/inst/doc/denovo.Rmd b/inst/doc/denovo.Rmd
@@ -11,7 +11,7 @@ vignette: >
 knitr::opts_chunk$set(echo = TRUE)
 ```
 
-Once you have identified clones from mitochondrial and known nuclear somatic variants by following the vignette for [Computation of phylogenetic trees and clustering of mutation](clustering.html), you can attempt to identify new mutations from expressed nuclear genes by following this vignette.
+Once you have identified clones from mitochondrial and known nuclear somatic variants by following the vignette for [Computation of clonal hierarchies and clustering of mutations](clustering.html), you can attempt to identify new mutations from expressed nuclear genes by following this vignette.
 
 ## Before you begin
 
@@ -75,7 +75,7 @@ likeli.clas <- function(params, data, clones) {
 }
 ```
 
-The clonal identity of each cell is obtained by following the [Computation of phylogenetic trees and clustering of mutation](clustering.html) vignette for P1.
+The clonal identity of each cell is obtained by following the [Computation of clonal hierarchies and clustering of mutation](clustering.html) vignette for P1.
 
 ```{r getclone, eval=F}
 clone <- apply(P1@mainClone,1,which.max)

diff --git a/inst/doc/denovo.html b/inst/doc/denovo.html
@@ -192,7 +192,7 @@ <h1 class="title toc-ignore">De novo identification of nuclear variants</h1>
 </div>
 
 
-<p>Once you have identified clones from mitochondrial and known nuclear somatic variants by following the vignette for <a href="clustering.html">Computation of phylogenetic trees and clustering of mutation</a>, you can attempt to identify new mutations from expressed nuclear genes by following this vignette.</p>
+<p>Once you have identified clones from mitochondrial and known nuclear somatic variants by following the vignette for <a href="clustering.html">Computation of clonal hierarchies and clustering of mutations</a>, you can attempt to identify new mutations from expressed nuclear genes by following this vignette.</p>
 <div id="before-you-begin" class="section level2">
 <h2>Before you begin</h2>
 <p>COSMIC is used to restrict the hypothesis space. The first step is therefore to download COSMIC following the instructions provided on <a href="https://cancer.sanger.ac.uk/cosmic/download" class="uri">https://cancer.sanger.ac.uk/cosmic/download</a></p>
@@ -238,7 +238,7 @@ <h2>Hypothesis testing</h2>
   out
 
 }</code></pre>
-<p>The clonal identity of each cell is obtained by following the <a href="clustering.html">Computation of phylogenetic trees and clustering of mutation</a> vignette for P1.</p>
+<p>The clonal identity of each cell is obtained by following the <a href="clustering.html">Computation of clonal hierarchies and clustering of mutation</a> vignette for P1.</p>
 <pre class="r"><code>clone &lt;- apply(P1@mainClone,1,which.max)</code></pre>
 <p>Finally we define a function that takes the allele count tables as input, performs some further filtering, and then computes the maximum likelihoods of both models by numerical optimization. Again, this takes time and could be parallelized.</p>
 <pre class="r"><code>getAIC &lt;- function(counts, clones) {

diff --git a/man/baseCountsFromBamList.Rd b/man/baseCountsFromBamList.Rd
diff --git a/man/baseCountsFromSingleBam.Rd b/man/baseCountsFromSingleBam.Rd
diff --git a/man/blacklists.Rd b/man/blacklists.Rd
diff --git a/man/clusterMetaclones.Rd b/man/clusterMetaclones.Rd
diff --git a/man/getCloneLikelihood.Rd b/man/getCloneLikelihood.Rd