Stinging nettle genome assembly

Codes/pipeline used in genome assembly report on Urtica dioica ssp. dioica (stinging nettle): https://www.mdpi.com/2223-7747/14/1/124

This assembly report consists of:

De novo genome assembly with PacBio HiFi reads + Hi-C
Gene annotation with BRAKER3
Repeat annotation with EDTA (TEs) and RepeatOBserver (telomeric sequences, centromeric repeats)
Describe polycentric behaviour of nettle chromosomes
Describe genes/proteins that are potentially responsible for stinging hairs
Describe fibre synthesis genes presence
Describe basic phylogenetics with related taxa in Urticaceae family.

De novo assembly pipeline (exact steps I've taken)

Filter HiFi reads by Q20 fastq.filter
Run Hififasm with HiFi + Hi-C
Run Juicer to produce .hic and .assembly file - works as a matrix for Hi-C reads to then map onto.
Use 3d-dna_pipeline to map Hi-C reads to assembly
Open the output in Juicebox, manually fix contigs & assign chromosomes - see https://aidenlab.org/assembly/manual_180322.pdf for details
Repeat 3-5 on each haplotype separately - Round_1_H1.fa, Round_1_H2.fa
Combine H1 and H2 and repeat 3-5 - H1_H2_v1.fa
Repeat 3-5 on each haplotype separately again on the original contig-level assembly file - Round_2_H1.fa, Round_2_H2.fa
Combine H1 and H2 and repeat 3-5 - H1_H2_v2.fa
Perform a draft gene annotation with the Transdecoder pipeline used in lingonberry paper, ensuring that the annotated genes covered most of the chromosomes (BUSCO: 61% but genes were homogeneously distributed across the genome so I used it)
Error correct ONT reads using Canu, filter the output by Q10 and length 30kbp, then map in a further effort to close gaps in the assembly using TGS_gap_closer - did nothing.
Use Anchorwave to align H1 and H2, verify any SVs that are present and whether they are real on Hi-C map
Use minimap2 to align H1 and H2 also, similarly do the same to check SVs --> looks like minimap2 identifies a lot more dubious organization of contigs in chromosomes
Use RepeatOBserver to check if some of the "difficult to resolve" regions are overlaping with centromeres or other repetitive regions - in that case, there is not much you can do to resolve the assembly... Also double check that homeologous chromosomes have reasonably similar centromeric positions.
Perform final polishing with Juicebox on H1_H2_v2.fa file and ensure all SVs are real - Round_3_H1.fa, Round_3_H2.fa
Assign chromosome numbers that are homeologous between two haplotypes (13 chr each) - Nettle_female_H1_chr_renamed.fa, Nettle_female_H2_chr_renamed.fa
Based on SyRI output in step 13, manually check for INV/TRANS regions if they are real or not
First identify exact coordinates of such variations. Second change these in H1 / H2 assembly file in Juicebox. Third, if these are unclear, check in H1+H2 asesmbly file in Juicebox.
To check the uncertainties remain, I mapped HiFi reads and >30kb ONT reads to the assembly using Winnowmap, visualized the alignment in IGV - the point is to see if any of those long-reads span the breakpoints of such SVs to support the orientation/position of contigs.
Finalize the best guesses --> Round_5_H1.fa, Round_5_H2.fa
Align to the published U. dioica genome from Darwin Tree of Life - if any variations are found, we make notes of it. I changed my chromosome numbering based on homologous relationship to this published genome (Udio_DToL). --> Nettle_female_Round_5_H1_chrname_renamed.fa, Nettle_female_Round_5_H2_chrname_renamed.fa
Due to some regions still showing messy Hi-C heatmap, I chopped off the obviously misplaced regions within a contig into debris, suspecting that some of these regions might be switch errors. --> Round_6_H1.fa, Round_6_H2.fa
Combined the H1+H2 --> H1_H2_v4.fa and run 3d-dna pipeline again, but with -q 0 setting to allow Hi-C reads to be mapped at repetitive regions. On the Hi-C heatmap with H1+H2 assembly, modify misassemblies as necessary. A lot of small contigs that were moved to debris in the previous steps could be placed back in chromosomes, in the correct haplotype (haplotype with more interaction than the other). --> Round_7_H1.fa, Round_7_H2.fa
Run minimap2+SyRI, check coordinates of INV and visually inspect in H1+H2 map. --> Round_8_H1.fa, Round_8_H2.fa (FINAL)
Run final BBmap(N50, etc.), BUSCO score on the assemblies.
Also check what the non-chromosomal regions contain using RED and BUSCO.

For list of intermediate files generated during the above process in a table, see my_juicebox.assembly_files.txt. also on the manuscript, I am referring to the different round number designations because I did not perform quality metric checks at every step on the way. I divided the Rounds of curation by distinctive steps with QC vakues available: Round 1 - visual Hi-C scaffolding with H1 and H2 and H1+H2 maps, Round 2 - attemp to incorporate ONT reads to fill the gaps, further curation with H1+H2 map and visually inspect, Round 3 - extensive manual curation using minimap2+SyRI coordinates to check INVs specificially, as well as checking if long-reads span breakpoints, Round 4 - final manual curation; using -q 0 on 3d-dna pipeline command, allowing mapping quality 0 reads to be visualized on the H1+H2 genome, corrected switch errors mainly to make the Hi-C map look smoother. Used minimap2+SyRI again to check exact INV coordinates only and made all change in H1+H2 map.

De novo annotation pipeline

Softmask repetitive regions with RepeatDetector
Align published RNAseq to the softmasked genome using Hisat2 and export as sorted BAM file
Annotate genes with BRAKER3 using softmasked genome, RNAseq BAM, and Viridiplantae (OrthoDB) protein database
Annotate TEs with EDTA on the pre-masked genonme

Describe polycentric behaviour of chromosomes

We used RepeatOBserver to visualize general patterns of repeat structure and found some unique centromeric repeats distributions
Search for specific centromeric/telomeric repeat sequences with RepeatOBserver results
Compare syntenic relationship with Morus genome, that has previously shown polycentric behaviour in an experiment
Compare synteny in Urticaceae family with GENESPACE

Describe genes/proteins that are potentially responsible for stinging hairs

try search the protein sequence against full genome with miniprot

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
0_raw_data_pre-filtering		0_raw_data_pre-filtering
0_stats_on_assemblies		0_stats_on_assemblies
1_genome_assembly_with_PBHiFi_HiC		1_genome_assembly_with_PBHiFi_HiC
2_annotation		2_annotation
3_comparative_genomics		3_comparative_genomics
4_sting_gene_discovery		4_sting_gene_discovery
README.md		README.md
my_juicebox.assembly_files.txt		my_juicebox.assembly_files.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stinging nettle genome assembly

De novo assembly pipeline (exact steps I've taken)

De novo annotation pipeline

Describe polycentric behaviour of chromosomes

Describe genes/proteins that are potentially responsible for stinging hairs

About

Releases

Packages

Languages

kaede0e/stinging_nettle_genome_assembly

Folders and files

Latest commit

History

Repository files navigation

Stinging nettle genome assembly

De novo assembly pipeline (exact steps I've taken)

De novo annotation pipeline

Describe polycentric behaviour of chromosomes

Describe genes/proteins that are potentially responsible for stinging hairs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages