README.Rmd

---
title: "README"
author: "Rasmus Kirkegaard"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(scipen = 10)
```

# R10.4.1 Zymo HMW basecalling

With the release of R10.4.1 I wanted to check the quality of the raw reads and the consensus accuracy to ensure that we had the right approach for continuing our nanopore only assemblies from metagenomic samples ([Mantas et al. 2022](https://www.nature.com/articles/s41592-022-01539-7)). So we sequenced the Zymo HMW mock DNA to evaluate the quality of the raw reads but more importantly the assembled genomes. With the introduction of [dorado](https://github.com/nanoporetech/dorado) as a basecaller that should replace guppy it was also interesting to see how well that performed.

## Aim

1. Check if nanopore R10.4.1 is also capable of generating nanopore only assemblies. 
2. Evaluate the impact of the different basecalling models on raw read accuracy 
3. Evaluate the impact of the different basecalling models on consensus accuracy
4. Evaluate if the introduction of 5khz sampling would allow for a move from SUP basecalling to HAC basecalling to speed up basecalling
5. Check out duplex basecalling

## Conclusion

1. Yes!!! The obtained indel and mismatch rates are very low already around 25X coverage. At ~0.1 /100 Kbp a 5 Mbp genome will have 5 "errors" (Could still be the references at this point but likely does not matter) so it is unlikely to affect gene calling and most downstream analysis.
2. There is a huge difference between fast, hac and sup for raw read accuracy. However, the time needed for compute is also much much higher for the super accuracy model.
3. There seems to be a clear improvement in consensus accuracy by using super accuracy reads. Even providing higher coverage for fast or hac accuracy reads do not seem to fix this indicating that some of the errors fixed by sup(er accuracy) are indeed systematic errors in fast and hac. So super accuracy mode is the way to go if the aim is to generate reference quality genomes despite the additional need for GPU compute.
4. HAC data shows a serious improvement on raw read accuracy compared to 4khz sampling. However, it is still inferior to the 4khz SUP data on this metric. For assemblies the consensus is on par with 4 khz sup for some organisms but generally slightly worse. For mismatches there is a similar pattern with some organisms being equally good and others where HAC is simply worse.
5. Duplex reads really represent a step change in single read accuracy

I think it is work taking a moment to think about how amazing it is that we now have several independent methods (sequencing tech+assembler) that can reproduce a bacterial genome of e.g. P aeruginosa (6832199 bp) with error rates of less than 0.01 per 100 kbp <2 total errors. 

Ending 2023 with nanopore data improving from their single strand data (simplex) barely hitting their Q20 goal to going well beyond it and when reading both stands (duplex) even racing beyond Q30 is quite amazing. Lots of innovation on run conditions and basecalling models. The promise of [Q30 single strand reads](https://x.com/iiSeymour/status/1732542416320643164?s=20) in 2024 does not seem too far fetched. 

## Data availability

### 4 khz data
The data has been added to the NCBI-SRA [PRJNA934154](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA934154). I have managed to upload the fastq and the fast5 files. The fastq and fast5 files should be available through the "cloud delivery service".

Fastq data (fast,hac & Sup): 
[SRR23563655](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR23563655&display=data-access)

Fast5 data:
[SRR23437037](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR23437037&display=data-access)

### 5 khz data

Fastq data:
- fast: [SRR24893246](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR24893246&display=data-access)
- hac: [SRR24893245](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR24893245&display=data-access)
- sup: [SRR24893244](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR24893244&display=data-access)


Pod5 data:
The data has been added to the ENA [PRJEB64570](https://www.ebi.ac.uk/ena/browser/view/PRJEB64570). (NCBI did not accept pod5 and told me they do not want to do that)

### 5 khz high duplex data
The data has been added to the ENA [PRJEB65462](https://www.ebi.ac.uk/ena/browser/view/PRJEB65462)

Fastq data:
- sup: [ERR11901474](https://www.ebi.ac.uk/ena/browser/view/ERR11901474)
- sup duplex: [ERR11901475](https://www.ebi.ac.uk/ena/browser/view/ERR11901475)


Pod5 data:
- [ERR11924124](https://www.ebi.ac.uk/ena/browser/view/ERR11924124)

```{r load_libraries_and_data,echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(gridExtra)
tt2 <- ttheme_default(core=list(fg_params=list(hjust=1, x = 0.95, fontsize = 10)),
                      colhead=list(fg_params=list(fontsize = 12)))

# duplex read QC new ref
get_read_QC<-function(model="dna_r10.4.1_e8.2_400bps_hac@v4.0.0") {
  files<-dir(pattern = paste0(model,"_ref_E.coli_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
  NP_QC_ref_Escherichia_coli <- files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Escherichia_coli")
  files<-dir(pattern = paste0(model,"_ref_L.monocytogenes_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
  NP_QC_ref_Listeria_monocytogenes<-files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Listeria_monocytogenes")
  files<-dir(pattern = paste0(model,"_ref_P.aeruginosa_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
  NP_QC_ref_Pseudomonas_aeruginosa<- files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Pseudomonas_aeruginosa")
  files<-dir(pattern = paste0(model,"_ref_B.subtilis_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
  NP_QC_ref_Bacillus_subtilis<- files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Bacillus_subtilis")
  files<-dir(pattern = paste0(model,"_ref_S.enterica_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
  NP_QC_ref_Salmonella_enterica<-files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Salmonella_enterica")
  files<-dir(pattern = paste0(model,"_ref_E.faecalis_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
  NP_QC_ref_Enterococcus_faecalis<-files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Enterococcus_faecalis")
  files<-dir(pattern = paste0(model,"_ref_S.aureus_.*.NanoPlot-data.tsv.gz"),path = "temp/",full.names = T)
  NP_QC_ref_Staphylococcus_aureus<-files %>% map(read_tsv) %>% reduce(rbind) %>% mutate(REF="Staphylococcus_aureus")
  NP_QC_combined<-bind_rows(NP_QC_ref_Bacillus_subtilis,
                            NP_QC_ref_Enterococcus_faecalis,
                            NP_QC_ref_Escherichia_coli,
                            NP_QC_ref_Listeria_monocytogenes,
                            NP_QC_ref_Pseudomonas_aeruginosa,
                            NP_QC_ref_Salmonella_enterica,
                            NP_QC_ref_Staphylococcus_aureus) %>% 
    mutate(model=model)
return(NP_QC_combined)
}

NP_QC_fast_4.0.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_fast@v4.0.0")
NP_QC_hac_4.0.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_hac@v4.0.0")
NP_QC_sup_4.0.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_sup@v4.0.0")
NP_QC_fast_4.1.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_fast@v4.1.0")
NP_QC_hac_4.1.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_hac@v4.1.0")
NP_QC_sup_4.1.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_sup@v4.1.0")
NP_QC_supdup_4.1.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_supdup@v4.1.0")
NP_QC_fast_4.2.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_fast@v4.2.0")
NP_QC_hac_4.2.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_hac@v4.2.0")
NP_QC_sup_4.2.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_sup@v4.2.0")
NP_QC_supdup_4.2.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_supdup@v4.2.0")
NP_QC_fast_4.3.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_fast@v4.3.0")
NP_QC_hac_4.3.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_hac@v4.3.0")
NP_QC_hacdup_4.3.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_hacdup@v4.3.0")
NP_QC_sup_4.3.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_sup@v4.3.0")
NP_QC_supdup_4.3.0<-get_read_QC(model = "dna_r10.4.1_e8.2_400bps_supdup@v4.3.0")
NP_QC_combined<-bind_rows(NP_QC_fast_4.0.0,
                          NP_QC_hac_4.0.0,
                          NP_QC_sup_4.0.0,
                          NP_QC_fast_4.1.0,
                          NP_QC_hac_4.1.0,
                          NP_QC_sup_4.1.0,
                          NP_QC_supdup_4.1.0,
                          NP_QC_fast_4.2.0,
                          NP_QC_hac_4.2.0,
                          NP_QC_sup_4.2.0,
                          NP_QC_supdup_4.2.0,
                          NP_QC_fast_4.3.0,
                          NP_QC_hac_4.3.0,
                          NP_QC_hacdup_4.3.0,
                          NP_QC_sup_4.3.0,
                          NP_QC_supdup_4.3.0) %>%
  mutate(model_version=gsub(x = model,pattern=".*@v(.*)",replacement = "\\1"),
         basecalling_model=gsub(x = model,pattern=".*bps_(.*)@v.*",replacement = "\\1"))

### read coverage
draft_assemblies<-list.files(path = "results/",pattern = "*.assembly_info.txt",full.names = T)
cov_tab<-tibble()
for (f in draft_assemblies) {
  cov_temp<-read_tsv(file = f) %>% mutate(draft=f)
  cov_tab<-bind_rows(cov_tab,cov_temp)
}
cov_tab<-cov_tab %>% mutate(contigName=`#seq_name`,coverage=cov.,circular=circ.,
                            ReadSet=str_replace(string =draft,pattern = ".*(dna.*).assembly_info.txt",replacement = "\\1")) %>% 
  select(ReadSet,contigName,coverage,circular)

### Assembly based stats
fastani<-read_tsv(file = "results/fastani.tsv",col_names = c("query","ref","ANI","aligned_segments","total_segments")) %>% 
  mutate(Assembly=str_replace(string = query,pattern = "results/bins/(.*).fa.*",replacement = "\\1"),
         ref=str_replace(string = ref,pattern = ".*/(.*).fasta",replacement = "\\1")) %>% 
  select(Assembly,ref,ANI)

ref_genomes=list.files(path = "results/",pattern = "quast_.*.tsv",full.names = T)

# Load quast data
quast_tab<-tibble()
for (f in ref_genomes) {
  quast_temp<-read_tsv(file = f) %>% mutate(ref=f) %>% filter(`Total aligned length`!="-")
  quast_tab<-bind_rows(quast_tab,quast_temp)
}
quast_tab<-quast_tab %>% mutate(ref=str_replace(string = ref,pattern = ".*quast_(.*).tsv",replacement = "\\1"),
                                Assembly=str_trim(Assembly)) %>% filter(as.numeric(`Total aligned length`)>10^6) %>%
  select(Assembly,ref,`Total length`,`Reference length`,`Largest alignment`,`Total aligned length`,`# indels per 100 kbp`,`# mismatches per 100 kbp`) %>%
  mutate(`Total aligned length`=as.numeric(`Total aligned length`),
         `# indels per 100 kbp`=as.numeric(`# indels per 100 kbp`),
         `# mismatches per 100 kbp`=as.numeric(`# mismatches per 100 kbp`)) %>%
  filter(Assembly!=ref)

###
genome_stats<-fastani %>% right_join(quast_tab,by = c("Assembly", "ref")) %>%
  mutate(contigName=str_replace(string = Assembly,pattern = ".*(contig.*)",replacement = "\\1"),
         ReadSet=str_replace(string = Assembly,pattern = "(.*).flye.*",replacement = "\\1")) %>%
  filter(ANI>95) %>%
  left_join(cov_tab,by = c("contigName", "ReadSet")) %>%
  mutate(asmtype=str_replace(string = Assembly,pattern = ".*(fly.*).contig.*",replacement = "\\1")) %>%
  mutate(model=str_replace(ReadSet,pattern = "-.*",replacement = "")) %>%
  mutate(ref=str_replace(ref,pattern = "_hifiasm",replacement = "")) %>%
  filter(ref!="S.cerevisiae") %>%
  mutate(model_version=gsub(x = model,pattern=".*@v(.*)",replacement = "\\1"),
         basecalling_model=gsub(x = model,pattern=".*bps_(.*)@v.*",replacement = "\\1"))
```

## NP reads mapped to the refs overall (percent identity)

```{r fig.width=10,fig.height=7,echo=FALSE,warning=FALSE,message=FALSE}
NP_QC_combined %>% ggplot(aes(x = percentIdentity,col=basecalling_model))+geom_density()+geom_vline(xintercept = 99,col="red",linetype="dashed")+scale_x_continuous(breaks = 85:100)+coord_cartesian(xlim=c(85, 100))+xlab("Percent identity")+facet_wrap(~model_version,ncol = 1,scales = "free_y")
```

## Longest perfect read

```{r fig.width=10,fig.height=30,echo=FALSE,warning=FALSE,message=FALSE}
NP_QC_combined %>% group_by(basecalling_model,model_version,REF) %>% filter(percentIdentity==100) %>% filter(aligned_lengths==max(aligned_lengths)) %>% 
  ggplot(aes(x=lengths,y=model))+geom_point()+facet_wrap(~REF,ncol = 1)
```

## NP reads mapped to the refs overall (phred scale)

Phred scores for perfect matching reads are calculated as recommended by [Armin Topfer](https://twitter.com/kirk3gaard/status/1397457000217423873) which takes length into account by adjusting the percent identity value 100*(1-1/(length+1)). This implies that a perfect read of 100 bp will achieve a phred score of ~20, while 1000 bp caps out at ~30 etc. So forget about your shor reads being Q40. 

```{r fig.width=10,fig.height=7,echo=FALSE,warning=FALSE,message=FALSE}
PHREDNORMPERFECT=function(length=10) {
  adj_percentidentity<-100*(1-1/(length+1))
  phred<--10*log10((100-adj_percentidentity)/100)
  return(phred)
}
NP_QC_combined %>% 
  mutate(Phred=case_when(percentIdentity==100~PHREDNORMPERFECT(length=lengths),
                         TRUE~-10*log10((100-percentIdentity)/100))) %>%
  ggplot(aes(x = Phred,col=basecalling_model))+geom_density()+geom_vline(xintercept = 20,col="red",linetype="dashed")+xlab("Phred score (Q)")+facet_wrap(~model_version,ncol = 1)
```

## Fast mode

### Indel rate vs coverage

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model=="fast") %>%
  ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

### Mismatch rate vs coverage

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model=="fast") %>%
  ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

## HAC mode

### Indel rate vs coverage

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model=="hac") %>%
  ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

### Mismatch rate vs coverage

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model=="hac") %>%
  ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

## SUP mode

### Indel rate vs coverage

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model=="sup") %>%
  ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

### Mismatch rate vs coverage

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model=="sup") %>%
  ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

## Duplex mode

### Indel rate vs coverage

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model %in% c("hacdup","supdup")) %>%
  ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

### Mismatch rate vs coverage

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model %in% c("hacdup","supdup")) %>%
  ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

## 4khz SUP vs 5 khz HAC & SUP

With the launch of 5 khz sampling rate around London Calling 2023 ONT was hoping that the GPU power needed for basecalling could be decreased dramatically as this should allow for the use of the faster HAC model to replace the compute intensive SUP model at 4khz. To test this we here compare the consensus assemblies with the 4 khz SUP model and the new 5khz HAC and SUP models.

### Indels

The indel rate seems to be higher with 5khz HAC than both SUP regardless of sample rate. Interestingly the 5khz sample rate with SUP performs much better than 4 khz SUP for some organisms but not S enterica. 

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model!="fast") %>%
  filter(model_version %in% c("4.1.0","4.2.0")) %>%
  filter(!(model_version == "4.1.0" & basecalling_model=="hac")) %>%
  mutate(Sample_rate_model=case_when(model_version=="4.1.0"~paste0("4khz_",basecalling_model),
                                     model_version=="4.2.0"~paste0("5khz_",basecalling_model))) %>%
  ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=Sample_rate_model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

### Mismatches

While HAC is on par with SUP for some organisms it is never the best option for mismatches.

```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model!="fast") %>%
  filter(model_version %in% c("4.1.0","4.2.0")) %>%
  filter(!(model_version == "4.1.0" & basecalling_model=="hac")) %>%
  mutate(Sample_rate_model=case_when(model_version=="4.1.0"~paste0("4khz_",basecalling_model),
                                     model_version=="4.2.0"~paste0("5khz_",basecalling_model))) %>%
  ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=Sample_rate_model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```


## Super accuracy comparison model 4.2 vs 4.3

The 4.3 model includes some modifications found in bacteria and it seems to pay off. The model gives a clear improvement for the consensus accuracy across many of the genomes. Indel rates without any polishing ranging from below 0.01 to 1 per 100 kbp translates to 1-100 errors for a 10 Mbp genome. Similar levels were achieved regarding mismatches. The Bacillus showed notably poorer consensus accuracy than the others which I assume could be due to some real differences or errors in the Pacbio HiFi based references.When we are getting in the 0.01-0.1 errors per 100 kbp I assume that it gets difficult to tell whether they are actual errors without hand curating reference genomes with multiple independent technologies as per [Ryan Wicks "Perfect bacterial genome tutorial"](https://github.com/rrwick/Perfect-bacterial-genome-tutorial/wiki). 


### Indels


```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model %in% c("sup","supdup")) %>%
  filter(asmtype=="flye") %>%
  filter(model_version %in% c("4.2.0","4.3.0")) %>%
  ggplot(aes(x=coverage,y = `# indels per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

### Mismatches



```{r fig.width=10, echo=FALSE,warning=FALSE,message=FALSE}
genome_stats %>% 
  filter(basecalling_model %in% c("sup","supdup")) %>%
  filter(asmtype=="flye") %>%
  filter(model_version %in% c("4.2.0","4.3.0")) %>%
  filter(!(model_version == "4.1.0" & basecalling_model=="hac")) %>%
  ggplot(aes(x=coverage,y = `# mismatches per 100 kbp`,col=model,shape=asmtype))+geom_point()+geom_line()+facet_wrap(~ref)+scale_y_log10()+coord_cartesian(xlim=c(0, 100))
```

## Materials and methods

Here is a brief description of the tools used. For the exact commands check out the **Snakefile** in this repository ([Snakemake](https://snakemake.readthedocs.io/en/stable/) v. 7.18.2).

### DNA sequencing
DNA sample was the [Zymo Mock HMW standard](https://zymoresearch.eu/products/zymobiomics-hmw-dna-standard). The DNA was prepared for sequencing using the nanopore ligation sequencing kit (SQK-LSK114) and sequenced on a R10.4.1 nanopore promethion flowcell (FLO-PRO114M) with the "400 bp/s" mode (4khz sampling). The DNA was prepared for sequencing using the nanopore ligation sequencing kit (SQK-LSK114) and sequenced on a R10.4.1 nanopore MinION flowcell (FLO-MIN114) with the "400 bp/s" mode (5khz sampling). The DNA was prepared for sequencing using the nanopore ligation sequencing kit (SQK-LSK114) and sequenced on a R10.4.1 nanopore PromethION flowcell (FLO-PRO114HD) with the "400 bp/s" mode (5khz sampling).

### Basecalling
#### 4 khz PromethION data
The reads were basecalled using [dorado](https://github.com/nanoporetech/dorado) (v. 0.1.1) with fast, hac and sup accuracy mode using the 4.0.0 and 4.1.0 models. 

#### 5 khz MinION data
The reads were basecalled using [dorado](https://github.com/nanoporetech/dorado) (v. 0.3.0) with fast, hac and sup accuracy mode using the 4.2.0 models. 

#### 5 khz High Duplex PromethION data
The reads were basecalled using [dorado](https://github.com/nanoporetech/dorado) (v. 0.3.4) with fast, hac and sup accuracy mode using the 4.2.0 models. The reads were basecalled using [dorado](https://github.com/nanoporetech/dorado) (v. 0.5.0) with fast, hac and sup accuracy mode using the 4.3.0 models. 

### Read QC
Reads were mapped to the updated zymo reference genomes (hopefully goes public soon) using [minimap2](https://github.com/lh3/minimap2) (v. 2.24), and QC information was obtained using [NanoPlot](https://github.com/wdecoster/NanoPlot) (v. 1.41.0).

### Assembly
The reads were subsampled using [seqtk](https://github.com/lh3/seqtk) (v. 1.3) and assembled using [flye](https://github.com/fenderglass/Flye) (v. 2.9.1). The metagenome assemblies were then polished using [medaka](https://github.com/nanoporetech/medaka) (v. 1.11.3).

### Genome quality assessment
The assembled contigs were compared to the reference contigs using [QUAST](https://github.com/ablab/quast) (v. 5.2.0) and [fastANI](https://github.com/ParBLiSS/FastANI) (v. 1.33).