-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
372 lines (288 loc) · 22.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
---
title: "README"
author: "Rasmus Kirkegaard"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(scipen = 10)
```
# R10.4.1 Zymo HMW basecalling
With the release of R10.4.1 I wanted to check the quality of the raw reads and the consensus accuracy to ensure that we had the right approach for continuing our nanopore only assemblies from metagenomic samples ([Mantas et al. 2022](https://www.nature.com/articles/s41592-022-01539-7)). So we sequenced the Zymo HMW mock DNA to evaluate the quality of the raw reads but more importantly the assembled genomes. With the introduction of [dorado](https://github.com/nanoporetech/dorado) as a basecaller that should replace guppy it was also interesting to see how well that performed.
## Aim
1. Check if nanopore R10.4.1 is also capable of generating nanopore only assemblies.
2. Evaluate the impact of the different basecalling models on raw read accuracy
3. Evaluate the impact of the different basecalling models on consensus accuracy
4. Evaluate if the introduction of 5khz sampling would allow for a move from SUP basecalling to HAC basecalling to speed up basecalling
5. Check out duplex basecalling
## Conclusion
1. Yes!!! The obtained indel and mismatch rates are very low already around 25X coverage. At ~0.1 /100 Kbp a 5 Mbp genome will have 5 "errors" (Could still be the references at this point but likely does not matter) so it is unlikely to affect gene calling and most downstream analysis.
2. There is a huge difference between fast, hac and sup for raw read accuracy. However, the time needed for compute is also much much higher for the super accuracy model.
3. There seems to be a clear improvement in consensus accuracy by using super accuracy reads. Even providing higher coverage for fast or hac accuracy reads do not seem to fix this indicating that some of the errors fixed by sup(er accuracy) are indeed systematic errors in fast and hac. So super accuracy mode is the way to go if the aim is to generate reference quality genomes despite the additional need for GPU compute.
4. HAC data shows a serious improvement on raw read accuracy compared to 4khz sampling. However, it is still inferior to the 4khz SUP data on this metric. For assemblies the consensus is on par with 4 khz sup for some organisms but generally slightly worse. For mismatches there is a similar pattern with some organisms being equally good and others where HAC is simply worse.
5. Duplex reads really represent a step change in single read accuracy
I think it is work taking a moment to think about how amazing it is that we now have several independent methods (sequencing tech+assembler) that can reproduce a bacterial genome of e.g. P aeruginosa (6832199 bp) with error rates of less than 0.01 per 100 kbp <2 total errors.
Ending 2023 with nanopore data improving from their single strand data (simplex) barely hitting their Q20 goal to going well beyond it and when reading both stands (duplex) even racing beyond Q30 is quite amazing. Lots of innovation on run conditions and basecalling models. The promise of [Q30 single strand reads](https://x.com/iiSeymour/status/1732542416320643164?s=20) in 2024 does not seem too far fetched.
## Data availability
### 4 khz data
The data has been added to the NCBI-SRA [PRJNA934154](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA934154). I have managed to upload the fastq and the fast5 files. The fastq and fast5 files should be available through the "cloud delivery service".
Fastq data (fast,hac & Sup):
[SRR23563655](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR23563655&display=data-access)
Fast5 data:
[SRR23437037](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR23437037&display=data-access)
### 5 khz data
Fastq data:
- fast: [SRR24893246](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR24893246&display=data-access)
- hac: [SRR24893245](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR24893245&display=data-access)
- sup: [SRR24893244](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR24893244&display=data-access)
Pod5 data:
The data has been added to the ENA [PRJEB64570](https://www.ebi.ac.uk/ena/browser/view/PRJEB64570). (NCBI did not accept pod5 and told me they do not want to do that)
### 5 khz high duplex data
The data has been added to the ENA [PRJEB65462](https://www.ebi.ac.uk/ena/browser/view/PRJEB65462)
Fastq data:
- sup: [ERR11901474](https://www.ebi.ac.uk/ena/browser/view/ERR11901474)
- sup duplex: [ERR11901475](https://www.ebi.ac.uk/ena/browser/view/ERR11901475)
Pod5 data:
- [ERR11924124](https://www.ebi.ac.uk/ena/browser/view/ERR11924124)
```{r load_libraries_and_data,echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(gridExtra)
tt2 <- ttheme_default(core=list(fg_params=list(hjust=1, x = 0.95, fontsize = 10)),
colhead=list(fg_params=list(fontsize = 12)))
# duplex read QC new ref
get_read_QC<-function(model="[email protected]") {
files<-dir(pattern = paste0(model,"_ref_E.coli_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
NP_QC_ref_Escherichia_coli <- files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Escherichia_coli")
files<-dir(pattern = paste0(model,"_ref_L.monocytogenes_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
NP_QC_ref_Listeria_monocytogenes<-files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Listeria_monocytogenes")
files<-dir(pattern = paste0(model,"_ref_P.aeruginosa_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
NP_QC_ref_Pseudomonas_aeruginosa<- files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Pseudomonas_aeruginosa")
files<-dir(pattern = paste0(model,"_ref_B.subtilis_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
NP_QC_ref_Bacillus_subtilis<- files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Bacillus_subtilis")
files<-dir(pattern = paste0(model,"_ref_S.enterica_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
NP_QC_ref_Salmonella_enterica<-files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Salmonella_enterica")
files<-dir(pattern = paste0(model,"_ref_E.faecalis_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
NP_QC_ref_Enterococcus_faecalis<-files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Enterococcus_faecalis")
files<-dir(pattern = paste0(model,"_ref_S.aureus_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
NP_QC_ref_Staphylococcus_aureus<-files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Staphylococcus_aureus")
NP_QC_combined<-bind_rows(NP_QC_ref_Bacillus_subtilis,
NP_QC_ref_Enterococcus_faecalis,
NP_QC_ref_Escherichia_coli,
NP_QC_ref_Listeria_monocytogenes,
NP_QC_ref_Pseudomonas_aeruginosa,
NP_QC_ref_Salmonella_enterica,
NP_QC_ref_Staphylococcus_aureus) %>%
mutate(model=model)
return(NP_QC_combined)
}
NP_QC_fast_4.0.0<-get_read_QC(model = "[email protected]")
NP_QC_hac_4.0.0<-get_read_QC(model = "[email protected]")
NP_QC_sup_4.0.0<-get_read_QC(model = "[email protected]")
NP_QC_fast_4.1.0<-get_read_QC(model = "[email protected]")
NP_QC_hac_4.1.0<-get_read_QC(model = "[email protected]")
NP_QC_sup_4.1.0<-get_read_QC(model = "[email protected]")
NP_QC_supdup_4.1.0<-get_read_QC(model = "[email protected]")
NP_QC_fast_4.2.0<-get_read_QC(model = "[email protected]")
NP_QC_hac_4.2.0<-get_read_QC(model = "[email protected]")
NP_QC_sup_4.2.0<-get_read_QC(model = "[email protected]")
NP_QC_supdup_4.2.0<-get_read_QC(model = "[email protected]")
NP_QC_fast_4.3.0<-get_read_QC(model = "[email protected]")
NP_QC_hac_4.3.0<-get_read_QC(model = "[email protected]")
NP_QC_hacdup_4.3.0<-get_read_QC(model = "[email protected]")
NP_QC_sup_4.3.0<-get_read_QC(model = "[email protected]")
NP_QC_supdup_4.3.0<-get_read_QC(model = "[email protected]")
NP_QC_combined<-bind_rows(NP_QC_fast_4.0.0,
NP_QC_hac_4.0.0,
NP_QC_sup_4.0.0,
NP_QC_fast_4.1.0,
NP_QC_hac_4.1.0,
NP_QC_sup_4.1.0,
NP_QC_supdup_4.1.0,
NP_QC_fast_4.2.0,
NP_QC_hac_4.2.0,
NP_QC_sup_4.2.0,
NP_QC_supdup_4.2.0,
NP_QC_fast_4.3.0,
NP_QC_hac_4.3.0,
NP_QC_hacdup_4.3.0,
NP_QC_sup_4.3.0,
NP_QC_supdup_4.3.0) %>%
mutate(model_version=gsub(x = model,pattern=".*@v(.*)",replacement = "\\1"),
basecalling_model=gsub(x = model,pattern=".*bps_(.*)@v.*",replacement = "\\1"))
### read coverage
draft_assemblies<-list.files(path = "results/",pattern = "*.assembly_info.txt",full.names = T)
cov_tab<-tibble()
for (f in draft_assemblies) {
cov_temp<-read_tsv(file = f) %>% mutate(draft=f)
cov_tab<-bind_rows(cov_tab,cov_temp)
}
cov_tab<-cov_tab %>% mutate(contigName=`#seq_name`,coverage=cov.,circular=circ.,
ReadSet=str_replace(string =draft,pattern = ".*(dna.*).assembly_info.txt",replacement = "\\1")) %>%
select(ReadSet,contigName,coverage,circular)
### Assembly based stats
fastani<-read_tsv(file = "results/fastani.tsv",col_names = c("query","ref","ANI","aligned_segments","total_segments")) %>%
mutate(Assembly=str_replace(string = query,pattern = "results/bins/(.*).fa.*",replacement = "\\1"),
ref=str_replace(string = ref,pattern = ".*/(.*).fasta",replacement = "\\1")) %>%
select(Assembly,ref,ANI)
ref_genomes=list.files(path = "results/",pattern = "quast_.*.tsv",full.names = T)
# Load quast data
quast_tab<-tibble()
for (f in ref_genomes) {
quast_temp<-read_tsv(file = f) %>% mutate(ref=f) %>% filter(`Total aligned length`!="-")
quast_tab<-bind_rows(quast_tab,quast_temp)
}
quast_tab<-quast_tab %>% mutate(ref=str_replace(string = ref,pattern = ".*quast_(.*).tsv",replacement = "\\1"),
Assembly=str_trim(Assembly)) %>% filter(as.numeric(`Total aligned length`)>10^6) %>%
select(Assembly,ref,`Total length`,`Reference length`,`Largest alignment`,`Total aligned length`,`# indels per 100 kbp`,`# mismatches per 100 kbp`) %>%
mutate(`Total aligned length`=as.numeric(`Total aligned length`),
`# indels per 100 kbp`=as.numeric(`# indels per 100 kbp`),
`# mismatches per 100 kbp`=as.numeric(`# mismatches per 100 kbp`)) %>%
filter(Assembly!=ref)
###
genome_stats<-fastani %>% right_join(quast_tab,by = c("Assembly", "ref")) %>%
mutate(contigName=str_replace(string = Assembly,pattern = ".*(contig.*)",replacement = "\\1"),
ReadSet=str_replace(string = Assembly,pattern = "(.*).flye.*",replacement = "\\1")) %>%
filter(ANI>95) %>%
left_join(cov_tab,by = c("contigName", "ReadSet")) %>%
mutate(asmtype=str_replace(string = Assembly,pattern = ".*(fly.*).contig.*",replacement = "\\1")) %>%
mutate(model=str_replace(ReadSet,pattern = "-.*",replacement = "")) %>%
mutate(ref=str_replace(ref,pattern = "_hifiasm",replacement = "")) %>%
filter(ref!="S.cerevisiae") %>%
mutate(model_version=gsub(x = model,pattern=".*@v(.*)",replacement = "\\1"),
basecalling_model=gsub(x = model,pattern=".*bps_(.*)@v.*",replacement = "\\1"))
```
## NP reads mapped to the refs overall (percent identity)
```{r fig.width=10,fig.height=7,echo=FALSE,warning=FALSE,message=FALSE}
NP_QC_combined %>% ggplot(aes(x = percentIdentity,col=basecalling_model))+geom_density()+geom_vline(xintercept = 99,col="red",linetype="dashed")+scale_x_continuous(breaks = 85:100)+coord_cartesian(xlim=c(85, 100))+xlab("Percent identity")+facet_wrap(~model_version,ncol = 1,scales = "free_y")
```
## Longest perfect read
```{r fig.width=10,fig.height=30,echo=FALSE,warning=FALSE,message=FALSE}
NP_QC_combined %>% group_by(basecalling_model,model_version,REF) %>% filter(percentIdentity==100) %>% filter(aligned_lengths==max(aligned_lengths)) %>%
ggplot(aes(x=lengths,y=model))+geom_point()+facet_wrap(~REF,ncol = 1)
```
## NP reads mapped to the refs overall (phred scale)
Phred scores for perfect matching reads are calculated as recommended by [Armin Topfer](https://twitter.com/kirk3gaard/status/1397457000217423873) which takes length into account by adjusting the percent identity value 100*(1-1/(length+1)). This implies that a perfect read of 100 bp will achieve a phred score of ~20, while 1000 bp caps out at ~30 etc. So forget about your shor reads being Q40.
```{r fig.width=10,fig.height=7,echo=FALSE,warning=FALSE,message=FALSE}
PHREDNORMPERFECT=function(length=10) {
adj_percentidentity<-100*(1-1/(length+1))
phred<--10*log10((100-adj_percentidentity)/100)
return(phred)
}
NP_QC_combined %>%
mutate(Phred=case_when(percentIdentity==100~PHREDNORMPERFECT(length=lengths),
TRUE~-10*log10((100-percentIdentity)/100))) %>%
ggplot(aes(x = Phred,col=basecalling_model))+geom_density()+geom_vline(xintercept = 20,col="red",linetype="dashed")+xlab("Phred score (Q)")+facet_wrap(~model_version,ncol = 1)
```
## Fast mode
### Indel rate vs coverage
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model=="fast") %>%
ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
### Mismatch rate vs coverage
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model=="fast") %>%
ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
## HAC mode
### Indel rate vs coverage
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model=="hac") %>%
ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
### Mismatch rate vs coverage
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model=="hac") %>%
ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
## SUP mode
### Indel rate vs coverage
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model=="sup") %>%
ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
### Mismatch rate vs coverage
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model=="sup") %>%
ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
## Duplex mode
### Indel rate vs coverage
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model %in% c("hacdup","supdup")) %>%
ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
### Mismatch rate vs coverage
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model %in% c("hacdup","supdup")) %>%
ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
## 4khz SUP vs 5 khz HAC & SUP
With the launch of 5 khz sampling rate around London Calling 2023 ONT was hoping that the GPU power needed for basecalling could be decreased dramatically as this should allow for the use of the faster HAC model to replace the compute intensive SUP model at 4khz. To test this we here compare the consensus assemblies with the 4 khz SUP model and the new 5khz HAC and SUP models.
### Indels
The indel rate seems to be higher with 5khz HAC than both SUP regardless of sample rate. Interestingly the 5khz sample rate with SUP performs much better than 4 khz SUP for some organisms but not S enterica.
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model!="fast") %>%
filter(model_version %in% c("4.1.0","4.2.0")) %>%
filter(!(model_version == "4.1.0" & basecalling_model=="hac")) %>%
mutate(Sample_rate_model=case_when(model_version=="4.1.0"~paste0("4khz_",basecalling_model),
model_version=="4.2.0"~paste0("5khz_",basecalling_model))) %>%
ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=Sample_rate_model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
### Mismatches
While HAC is on par with SUP for some organisms it is never the best option for mismatches.
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model!="fast") %>%
filter(model_version %in% c("4.1.0","4.2.0")) %>%
filter(!(model_version == "4.1.0" & basecalling_model=="hac")) %>%
mutate(Sample_rate_model=case_when(model_version=="4.1.0"~paste0("4khz_",basecalling_model),
model_version=="4.2.0"~paste0("5khz_",basecalling_model))) %>%
ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=Sample_rate_model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
## Super accuracy comparison model 4.2 vs 4.3
The 4.3 model includes some modifications found in bacteria and it seems to pay off. The model gives a clear improvement for the consensus accuracy across many of the genomes. Indel rates without any polishing ranging from below 0.01 to 1 per 100 kbp translates to 1-100 errors for a 10 Mbp genome. Similar levels were achieved regarding mismatches. The Bacillus showed notably poorer consensus accuracy than the others which I assume could be due to some real differences or errors in the Pacbio HiFi based references.When we are getting in the 0.01-0.1 errors per 100 kbp I assume that it gets difficult to tell whether they are actual errors without hand curating reference genomes with multiple independent technologies as per [Ryan Wicks "Perfect bacterial genome tutorial"](https://github.com/rrwick/Perfect-bacterial-genome-tutorial/wiki).
### Indels
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model %in% c("sup","supdup")) %>%
filter(asmtype=="flye") %>%
filter(model_version %in% c("4.2.0","4.3.0")) %>%
ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
### Mismatches
```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>%
filter(basecalling_model %in% c("sup","supdup")) %>%
filter(asmtype=="flye") %>%
filter(model_version %in% c("4.2.0","4.3.0")) %>%
filter(!(model_version == "4.1.0" & basecalling_model=="hac")) %>%
ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```
## Materials and methods
Here is a brief description of the tools used. For the exact commands check out the **Snakefile** in this repository ([Snakemake](https://snakemake.readthedocs.io/en/stable/) v. 7.18.2).
### DNA sequencing
DNA sample was the [Zymo Mock HMW standard](https://zymoresearch.eu/products/zymobiomics-hmw-dna-standard). The DNA was prepared for sequencing using the nanopore ligation sequencing kit (SQK-LSK114) and sequenced on a R10.4.1 nanopore promethion flowcell (FLO-PRO114M) with the "400 bp/s" mode (4khz sampling). The DNA was prepared for sequencing using the nanopore ligation sequencing kit (SQK-LSK114) and sequenced on a R10.4.1 nanopore MinION flowcell (FLO-MIN114) with the "400 bp/s" mode (5khz sampling). The DNA was prepared for sequencing using the nanopore ligation sequencing kit (SQK-LSK114) and sequenced on a R10.4.1 nanopore PromethION flowcell (FLO-PRO114HD) with the "400 bp/s" mode (5khz sampling).
### Basecalling
#### 4 khz PromethION data
The reads were basecalled using [dorado](https://github.com/nanoporetech/dorado) (v. 0.1.1) with fast, hac and sup accuracy mode using the 4.0.0 and 4.1.0 models.
#### 5 khz MinION data
The reads were basecalled using [dorado](https://github.com/nanoporetech/dorado) (v. 0.3.0) with fast, hac and sup accuracy mode using the 4.2.0 models.
#### 5 khz High Duplex PromethION data
The reads were basecalled using [dorado](https://github.com/nanoporetech/dorado) (v. 0.3.4) with fast, hac and sup accuracy mode using the 4.2.0 models. The reads were basecalled using [dorado](https://github.com/nanoporetech/dorado) (v. 0.5.0) with fast, hac and sup accuracy mode using the 4.3.0 models.
### Read QC
Reads were mapped to the updated zymo reference genomes (hopefully goes public soon) using [minimap2](https://github.com/lh3/minimap2) (v. 2.24), and QC information was obtained using [NanoPlot](https://github.com/wdecoster/NanoPlot) (v. 1.41.0).
### Assembly
The reads were subsampled using [seqtk](https://github.com/lh3/seqtk) (v. 1.3) and assembled using [flye](https://github.com/fenderglass/Flye) (v. 2.9.1). The metagenome assemblies were then polished using [medaka](https://github.com/nanoporetech/medaka) (v. 1.11.3).
### Genome quality assessment
The assembled contigs were compared to the reference contigs using [QUAST](https://github.com/ablab/quast) (v. 5.2.0) and [fastANI](https://github.com/ParBLiSS/FastANI) (v. 1.33).