This repository contains all the scripts and data that were used for the project of building a de novo assembly of the Shadhara accession (Sha) of Arabidopsis thaliana. The resulting Sha assembly and those of four other accessions (An-1, C24, Cvi-0 and Ler-3) were used for a comparative analysis of the corresponding gene and TE annotation. It is part of a collaborative work with four other teams, each working on a respective de novo assembly.
This project was part of two courses, "Genome and transcriptome assembly" and "Organisation and annotation of eukaryote genomes" organised by the University of Bern and Fribourg respectively in the context of the Master of Bioinformatics.
- Quality control and kmer analysis with FASTQC and jellyfish
- Long-read de novo assembly with Canu and Flye for the genomic data, and with Trinity for the transcriptomic data
- Assembly polishing with Pilon after short-read mapping with BowTie2
- Assembly evaluation with Busco, QUAST and Merqury
- Dot plot between the de novo assemblies and reference genome with MUMmer
- TE annotation and classification with EDTA and TESorter
- TE dynamics analysis: TE insertion dating, TE genomic distribution plotting and TE clades phylogeny
- Gene annotation with MAKER
- Gene annotation evaluation with Busco (protein-level) and alignment to Uniprot protein sequences with blast.
- Genetic comparative analysis between accessions with GENESPACE
The repository is organised into three main directories:
- scripts directory: all scripts that were used throughout the workflow of the project
- data directory: all the data of the project, from raw reads to intermediate data produced during the steps of the project
- analysis directory: all the results from any analysis that was performed
More information can be found in the README section of each directory.
Path for the repository of this project on the IBU cluster: /data/users/grochat/genome_assembly_course/
Link for the repository of this project on GitHub.com : https://github.com/girochat/genome_assembly_course/
Note: all the data is available on the IBU cluster but the GitHub repository contains only data of reasonable size (less than 100Mb) due to repository size limits.