Data

Retrieving and organising the data

Time series data

NOTE: Both [cite:@li2017quantifying] and [cite:@yakovenko2014outbreak] talk about a linelist of cases, but I haven’t been able to find this dataset. For the time being it may be necessary to use values obtained from the figure.

Week (2010)	Confirmed cases
1	0
2	0
3	0
4	0
5	1
6	1
7	0
8	2
9	3
10	4
11	4
12	5
13	20
14	26
15	59
16	67
17	59
18	77
19	73
20	50
21	11
22	8
23	1
24	1
25	0
26	1
27	0
28	0
29	0
30	0

Genomic data

NOTE: it looks like the following code for downloading the sequences and putting the calendar dates on them may not be necessary. The supplementary materials of [cite:@li2017quantifying] have a file with the sequences already date stamped and aligned.

NOTE: [cite:@yakovenko2006antigenic] says “Multiple alignments of these sequences were carried out with the program CLUSTAL W, version 1.83.”

Packages

Load some libraries that facilitate downloading and munging the data.

library(rentrez)
library(ape)
library(xml2)
library(purrr)
library(stringr)
library(lubridate)

fetch_as_xml <- function(...) {
  temp_file <- tempfile()
  on.exit(file.remove(temp_file))

  entrez_obj <- entrez_fetch(
    ...,
    rettype = "xml",
    parsed = TRUE
  )
  XML::saveXML(entrez_obj, file = temp_file)
  return(xml2::read_xml(temp_file))
}

Define files and accession numbers

We define a couple of files to store the results in so that they are in one place and easy to manage. The accession numbers are given in the paper, but since we need to hard-code them, we will include that information here.

fasta_file <- "tajikistan-poliomyelitis.fasta"
entrez_xml <- "tajikistan-poliomyelitis.xml"

tajikistan_accession_numbers <-
  seq.int(from = 880365, to = 880521)
accession_numbers <- c(
  seq.int(from = 800662, to = 800683),
  seq.int(from = 812248, to = 812257),
  tajikistan_accession_numbers
)

Download the data and extract relevant information

Download all of the data as an XML object and save this.

seqs_xml <- fetch_as_xml(
  db = "nucleotide",
  id = sprintf("KC%d", accession_numbers)
)
write_xml(x = seqs_xml, file = entrez_xml)

We can then extract the identifying information for the sequence along with the collection date directly from the XML.

qualifiers <- xml_find_all(
  seqs_xml, "//GBQualifier[GBQualifier_name[text()='collection_date']]"
)

collection_dates <-
  xml_find_first(qualifiers, "./GBQualifier_value") |>
  xml_text()

accession_texts <-
  xml_find_all(seqs_xml, "//GBSeq/GBSeq_primary-accession") |>
  xml_text()

sequences <-
  xml_find_all(seqs_xml, "//GBSeq_sequence") |>
  xml_text()

Format as FASTA

seqs_dnabin <-
  sequences |>
  map(str_split, "") |>
  map(unlist) |>
  as.DNAbin()

date_mask <- str_detect(collection_dates, "[0-9]+-[A-Za-z]+-[0-9]+")

seqs_dnabin <- seqs_dnabin[date_mask]
new_names <- format(
  dmy(collection_dates[date_mask]),
  "%d-%m-%Y"
)
names(seqs_dnabin) <- str_c(accession_texts[date_mask], new_names, sep = "_")

write.dna(seqs_dnabin, file = fasta_file, format = "fasta")

Checking for temporal signal in the sequences

Create a rough NJ tree to give to TempEst

alignment_dna <- read.dna("test-alignment.fasta", format = "fasta")
dist_matrix <- dist.dna(alignment_dna, model = "K80", pairwise.deletion = TRUE)
nj_phylo <- nj(dist_matrix)
write.nexus(nj_phylo, file = "test-nj-tree.nexus")

TempEst

Load the tree into TempEst and use it to root the tree. Then export the data from this to tempest-data.txt. Warning: when you are parsing the dates, you need to use upper case ‘m’ because the lower case one is used for minutes.

Check the results from TempEst

tempest_df <- read.table("tempest-data.txt", header = TRUE, sep = "\t")

tempest_gg <-
  ggplot(data = tempest_df,
	   mapping = aes(x = date, y = distance)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  scale_x_continuous(limits = c(2009.7,2011)) +
  scale_y_continuous(limits = c(0, 0.015)) +
  theme_bw()
print(tempest_gg)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scratch.org

scratch.org

Data

Retrieving and organising the data

Time series data

Genomic data

Packages

Define files and accession numbers

Download the data and extract relevant information

Format as FASTA

Checking for temporal signal in the sequences

Create a rough NJ tree to give to TempEst

TempEst

Check the results from TempEst

Bibliography

Week (2010)	Confirmed cases
1	0
2	0
3	0
4	0
5	1
6	1
7	0
8	2
9	3
10	4
11	4
12	5
13	20
14	26
15	59
16	67
17	59
18	77
19	73
20	50
21	11
22	8
23	1
24	1
25	0
26	1
27	0
28	0
29	0
30	0

Week (2010)	Confirmed cases
1	0
2	0
3	0
4	0
5	1
6	1
7	0
8	2
9	3
10	4
11	4
12	5
13	20
14	26
15	59
16	67
17	59
18	77
19	73
20	50
21	11
22	8
23	1
24	1
25	0
26	1
27	0
28	0
29	0
30	0

Files

scratch.org

Latest commit

History

scratch.org

File metadata and controls

Data

Retrieving and organising the data

Time series data

Genomic data

Packages

Define files and accession numbers

Download the data and extract relevant information

Format as FASTA

Checking for temporal signal in the sequences

Create a rough NJ tree to give to TempEst

TempEst

Check the results from TempEst

Bibliography

Week (2010)	Confirmed cases
1	0
2	0
3	0
4	0
5	1
6	1
7	0
8	2
9	3
10	4
11	4
12	5
13	20
14	26
15	59
16	67
17	59
18	77
19	73
20	50
21	11
22	8
23	1
24	1
25	0
26	1
27	0
28	0
29	0
30	0