NOTE: Both [cite:@li2017quantifying] and [cite:@yakovenko2014outbreak] talk about a linelist of cases, but I haven’t been able to find this dataset. For the time being it may be necessary to use values obtained from the figure.
Week (2010) | Confirmed cases |
---|---|
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 1 |
6 | 1 |
7 | 0 |
8 | 2 |
9 | 3 |
10 | 4 |
11 | 4 |
12 | 5 |
13 | 20 |
14 | 26 |
15 | 59 |
16 | 67 |
17 | 59 |
18 | 77 |
19 | 73 |
20 | 50 |
21 | 11 |
22 | 8 |
23 | 1 |
24 | 1 |
25 | 0 |
26 | 1 |
27 | 0 |
28 | 0 |
29 | 0 |
30 | 0 |
NOTE: it looks like the following code for downloading the sequences and putting the calendar dates on them may not be necessary. The supplementary materials of [cite:@li2017quantifying] have a file with the sequences already date stamped and aligned.
NOTE: [cite:@yakovenko2006antigenic] says “Multiple alignments of these sequences were carried out with the program CLUSTAL W, version 1.83.”
Load some libraries that facilitate downloading and munging the data.
library(rentrez)
library(ape)
library(xml2)
library(purrr)
library(stringr)
library(lubridate)
fetch_as_xml <- function(...) {
temp_file <- tempfile()
on.exit(file.remove(temp_file))
entrez_obj <- entrez_fetch(
...,
rettype = "xml",
parsed = TRUE
)
XML::saveXML(entrez_obj, file = temp_file)
return(xml2::read_xml(temp_file))
}
We define a couple of files to store the results in so that they are in one place and easy to manage. The accession numbers are given in the paper, but since we need to hard-code them, we will include that information here.
fasta_file <- "tajikistan-poliomyelitis.fasta"
entrez_xml <- "tajikistan-poliomyelitis.xml"
tajikistan_accession_numbers <-
seq.int(from = 880365, to = 880521)
accession_numbers <- c(
seq.int(from = 800662, to = 800683),
seq.int(from = 812248, to = 812257),
tajikistan_accession_numbers
)
Download all of the data as an XML object and save this.
seqs_xml <- fetch_as_xml(
db = "nucleotide",
id = sprintf("KC%d", accession_numbers)
)
write_xml(x = seqs_xml, file = entrez_xml)
We can then extract the identifying information for the sequence along with the collection date directly from the XML.
qualifiers <- xml_find_all(
seqs_xml, "//GBQualifier[GBQualifier_name[text()='collection_date']]"
)
collection_dates <-
xml_find_first(qualifiers, "./GBQualifier_value") |>
xml_text()
accession_texts <-
xml_find_all(seqs_xml, "//GBSeq/GBSeq_primary-accession") |>
xml_text()
sequences <-
xml_find_all(seqs_xml, "//GBSeq_sequence") |>
xml_text()
seqs_dnabin <-
sequences |>
map(str_split, "") |>
map(unlist) |>
as.DNAbin()
date_mask <- str_detect(collection_dates, "[0-9]+-[A-Za-z]+-[0-9]+")
seqs_dnabin <- seqs_dnabin[date_mask]
new_names <- format(
dmy(collection_dates[date_mask]),
"%d-%m-%Y"
)
names(seqs_dnabin) <- str_c(accession_texts[date_mask], new_names, sep = "_")
write.dna(seqs_dnabin, file = fasta_file, format = "fasta")
alignment_dna <- read.dna("test-alignment.fasta", format = "fasta")
dist_matrix <- dist.dna(alignment_dna, model = "K80", pairwise.deletion = TRUE)
nj_phylo <- nj(dist_matrix)
write.nexus(nj_phylo, file = "test-nj-tree.nexus")
Load the tree into TempEst and use it to root the tree. Then export
the data from this to tempest-data.txt
. Warning: when you are
parsing the dates, you need to use upper case ‘m’ because the lower
case one is used for minutes.
tempest_df <- read.table("tempest-data.txt", header = TRUE, sep = "\t")
tempest_gg <-
ggplot(data = tempest_df,
mapping = aes(x = date, y = distance)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
scale_x_continuous(limits = c(2009.7,2011)) +
scale_y_continuous(limits = c(0, 0.015)) +
theme_bw()
print(tempest_gg)