-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
144 lines (90 loc) · 8.16 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/"
)
set.seed(0)
```
# Transmission fitness polymorphism scanner
This package provides an analytical pipeline for rapid variant scanning based on integration of scalable phylogenetic analysis with non-genetic epidemiological data streams. The scanning tool, `tfpscan()`, computes a test statistic for every possible bipartition of the provided phylogeny to identify relative growth rates and relative evolutionary rates of clades with the phylogeny. For every clade in a partitioned tree, matched comparison clades are selected based on time (and if specified, space), for further statistical analyses.
## tfpscanner analyses
The main analyses conducted for each clade, or node with included descendants, are as follows:
* Molecular clock outlier statistic using root to tip regression
* Simple logisitic growth rate estimate
* Generalised additive model (GAM) combined with a Gaussian process model to estimate growth over time
###### Optional analyses
Additional options in the scanning tool may be specified, including:
* (GAM) combined wtih a model of spatial correlation between neighbouring regions to estimate frequency of a clade over time and space
* Other covariate analyses, specified by the user, to run a conditional logistic regression on a specified covariate of interest
* Identification of cluster defining mutations, identified as mutations present in > x% of sequences in the specified cluster and less than x% of sequences in the comparison sample
* MLEsky analyses to estime effective population size (Ne) over time, using `get_mlesky_node()` function
## tfpscanner outputs
The output from the scanning tool includes a new output directory will be created called `"tfpscan-{Sys.Date()}"`, unless otherwise specified, along with a RDS file containing output descriptives (size, lineage, sample date) and statistics (clock outlier, logistic growth rate, GAM) for every node included and a RDS file containing the environment from the scanning tool run. For each node, logistic growth rate p-values as well as suport for logistic model vs. GAM are also reported.
Within the output directory, the scanning tool creates a further folder directory for each node included in analyses. Node specific outputs within individual node directories include seperate CSV files with the following:
* Summary statistics (clock outlier, logstic growth rate, GAM logistic growth rate, support values)
* Sequences with specific sample times, region, and further descriptors reported if included (lineage, mutations, covariates)
* Regional composition of sequences within individual nodes
* Lineage composition of sequences within individual nodes
* Co-circulating lineage composition from comparison matched sample nodes
###### Optional outputs
Additional options for node specific output directories may be specified, including:
* GAM figure over time, plotting the logistics odds of a sequences being from the specified node
* Map of GAM + neighbour joining model over space and time (in epi weeks)
* Tree figure for the node of interest, colored by lineage
* CSV file with defining mutations for cluster of interest specified
These outputs are computational expensive, substantially increasing time and space required, so are recommened for running only as required within the main `tfpscan()` run. These options can alternatively be specified with the `tfpscan_report()` function, for a single node of interest. The `tfpscan_report()` function outputs a summary report for a selected node, including primary outputs for a node of interest, as well as specified additional outputs and mlesky analyses if specified in the function options.
###### Online tree viewer
The tfpscanner package additional offers an optional online tree viewer for the whole phylogeny with linked hover function for statistics associated with each node, using the function `treeview()`. A user may specify particular mutations or lineages of interest for the tree. Mutations will be illustrated wtih a heatmap; lineages will be used to subdivide outputs in scatter plots. An example tree can be viewed at the link below, where mutations specified include S:A222V, S:N:Q9L, and S:E484K, and lineages specified include a selection of Delta lineages (AY.9, AY.43, AY.4.2).
https://www.biorxiv.org/content/10.1101/2021.01.18.427056v1.full
## Methodology
For a complete description of the statistical methodology underpinning this package, see our preprint:
`preprint-link`
## Installation
In R, install the `devtools` package and run
```
devtools::install_github('mrc-ide/tfpscanner')
```
## Input requirements
###### Phylogeny
To run the scanning tool, the user requires a phylogeny, in ape::phylo or treeio::treedata format, and associated metadata. If the phylogeny is not rooted, the user must provide an outgroup to root on wtih paramter `root_on_tip`, and outgroop sample time with paramter `root_on_tip_sample_time`.
###### Metadata
The associated metadata should be provided in CSV format, with at minimum `sequence_name`, `sample_date` (date format), and `region` included for each sample (if `NA` for any of the three variables, sample with `NA` values will be excluded from further analyses). Optional metadata variables include `sample_time` (numeric format), `mutations`, and other `covariates` of interest (e.g. `age_group` or `vaccine_status`).
###### Additional covariates
If additional covariates are included, a character vector for all variable names must be specified as paramter `test_cluster_odds`, and a vector of same length as character vector to `test_cluster_odds_value`. For example if vaccine_status, with values `c("yes", "no") was included as a covariate, the user would specify in the scanner tool as follows:
```
tfpscan(...,
test_cluster_odds = c(vaccine_status),
test_cluster_odds_value = c(1,0),
...
)
```
If no values are provided to `test_cluster_odds_value`, the covariate is assumes to be continuous (e.g. `age`). For each included covariate, the odds of a sample belonging to each cluster given this variable will be estimated using conditional logistic regression and adjusting for time.
## Documentation
See the vignettes in the R package for examples of how to use `tfpscan()`, `treeview()`, `tfpscan_report()`, and `get_mlesky_node()` functions. An example phylogeny, `"tree_2021-12-30.nwk"`, and linked metadata, `"amd_2021-12-30.nwk"`, are provided for the user to trial the scanning tool functions and outputs prior to running on their own phylogeny and metdata. A further covariate example is included with `vaccine_breakthrough` and `age_group`.
**N.B.** The more options are included in `tfpscan()`, the more computational power and time is required to run the scanner tool. In particlar, outputting tree figures and geo figures for every node is computational expensive and not recommended unless required. Alternative options include outputting tree figures and geo figures within the `tfpscan_report()` function for a selected node, rather than all nodes within a tree in the `tfpscan()` function.
###### Priority
- [x] README for git package draft
- [x] Repo for paper results figures, paper RMD
- [x] GAM speed testing: bs or cs highest speed, cs smoother figures
- [ ] Organise stored trees/outputs on google drive for easy access
- [x] Add node clock figure to final package
- [x] Add node geo figure to final package
- [x] Add node cluster_muts speed up to final package and cluster node match error fix
- [x] Add tfps_report() function to final package
- [x] Detailed output update to include geo/tree/other fig if included in run
- [x] Add metadata NA fix and root in tree fix
- [x] Add fix for CSV outputs for lineage, co-circulating, regional, mutations
- [x] Data.table update to script finalised
- [x] Defining mutations figure (node specific)
###### Final updates
- [x] Vignettes for tfpscan() generic usage
- [ ] Vignettes for covariate in tfpscan()
- [x] Vignettes for get_mlesky_node() usage
- [x] Vignettes for tfpscan_report()
- [x] Vignettes for treeview()
- [ ] Final README draft