Skip to content

Latest commit

 

History

History
107 lines (87 loc) · 3.25 KB

README.md

File metadata and controls

107 lines (87 loc) · 3.25 KB

Repeat and reanalysis COGUK data

Description

This folder archives the scripts for EBI data download, random selection of artic model, artic and artex running, and post analysis.

Execution environments

The conda environment used is listed in ./envs. Use conda env create to recreate each environment,

  1. Minimap2
conda env create -f ./envs/minimap2.yaml;
  1. bcftools
conda env create -f ./envs/bcftools.yaml;
  1. Artic
conda env create -f ./envs/artic.yaml;
  1. Artex
conda env create -f ./envs/artex.yaml;
  1. hap.py
conda env create -f ./envs/hap.yaml;
  1. bioawk
conda env create -f ./envs/bioawk.yaml;

Pipelines

  1. Preprocess of COG-UK EBI metadata Purpose: COG-UK sample id with both NGS and ONT sequencing, and retrieve the download links for raw ONT sequence and analysis files (NGS-assembled consensus sequence, ONT-assembled consensus sequence)
bash ./ebi_preprocess_scripts/preprocess.sh;
  1. Download data Download raw ONT sequencing data, NGS-assembled consensus sequence, and ONT-assembled consensus sequence, and COVID19 reference files, Artic primer scheme V3 files.
bash ./download_scripts/download_all.sh;
  1. LongBow config prediction Run LongBow on all raw ONT sequencing data and retrieve the predicted basecalling configuration
bash ./longbow_pred_script/run_longbow.sh;
  1. Run Artic pipeline Run Artic pipeline with three different Medaka model setting: 1. The LongBow predicted Medaka model, 2. A random Medaka model generated by Python random package, 3. The default Medaka model.
bash ./artic_scripts/run_artic.sh;
  1. Run Artex pipeline Run Artex (Artic extension) pipeline with extra Clair3 re-variant calling
bash ./artex_scripts/run_artex.sh;
  1. Post analysis Evaluate F1-score for each scheme variant calling, find extra variant for Artex pipeline compared to the traditional Artic pipeline.
bash ./post_analysis/post_analysis.sh;

Results description

  1. ./download_list.txt
column Content
1st column ERR id
2nd column ONT raw reads FASTQ download link
3rd column COG-UK ONT consensss file download link
4th column COG-UK NGS consensus file download link
  1. ../results/longbow.log Include the LongBow prediction results of each 269 FASTQ files, in our case, is all R9 Guppy 3/4 HAC.

  2. ./ERR_list_models.txt Include the ERR id, random seeds, and random Medaka mode for each ERR data.

  3. ../results/ERR* In each ERR* files contains 4 subdirectory: longbow: LongBow predicted Medaka model, default: default Medaka model, random: random Medaka model listed in ./ERR_list_models.txt, Artex: results of Artex pipelines

  4. ../results/F1_score_file Include the F1 score details of each separate mode of SNP and INDEL.

Repeat our results

To repeat our results, please install the previously mentioned conda environment and run

bash ./run_all.sh;

Possible error

  1. Artic pipeline If you encounter medaka: error: argument command: invalid choice: 'consensus', because your conda install newer medaka version but not the compatabile one, try running the following command:
conda install -c bioconda -c conda-forge artic medaka=1.11.3