Updated parameter section labels

Plant-Food-Research-Open · Oct 4, 2024 · 6ec662c · 6ec662c
1 parent 0c0269c
commit 6ec662c
Show file tree

Hide file tree

Showing 7 changed files with 155 additions and 116 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,7 +3,7 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## 0.4.0+dev - [30-Sep-2024]
+## 0.4.0+dev - [04-Oct-2024]
 
 ### `Added`
 
@@ -30,7 +30,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 21. Now `REPEATMASKER` GFF output is saved via `CUSTOM_RMOUTTOGFF3` [#54](https://github.com/plant-food-research-open/genepal/issues/54)
 22. Added `benchmark` column to the input sheet and used `GFFCOMPARE` to perform benchmarking [#63](https://github.com/plant-food-research-open/genepal/issues/63)
 23. Added `SEQKIT_RMDUP` to detect duplicate sequence and wrap the fasta to 80 characters
-24. Updated modules and sub-workflows
+24. Updated parameter section labels for annotation and post-annotation filtering [#64](https://github.com/plant-food-research-open/genepal/issues/64)
+25. Updated modules and sub-workflows
 
 ### `Fixed`
 

diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@
 - [REPEATMODELER](https://github.com/Dfam-consortium/RepeatModeler) or [EDTA](https://github.com/oushujun/EDTA): Create TE library
 - [REPEATMASKER](https://github.com/rmhubley/RepeatMasker): Soft mask the genome fasta
 - [FASTQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc), [FASTP](https://github.com/OpenGene/fastp), [SORTMERNA](https://github.com/sortmerna/sortmerna): QC, trim and filter RNASeq evidence
-- [STAR](https://github.com/alexdobin/STAR): RNAseq alignment
+- [STAR](https://github.com/alexdobin/STAR): RNASeq alignment
 - [BRAKER](https://github.com/Gaius-Augustus/BRAKER): Annotate the genome fasta
 - [LIFTOFF](https://github.com/agshumate/Liftoff): Liftoff annotations from reference genome fasta/gff
 - [TSEBRA](https://github.com/Gaius-Augustus/TSEBRA), [AGAT](https://github.com/NBISweden/AGAT): Merge BRAKER and Liftoff annotations

diff --git a/docs/output.md b/docs/output.md
@@ -14,8 +14,8 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 
 - [Repeat annotation](#repeat-annotation)
 - [Repeat masking](#repeat-masking)
-- [RNAseq trimming, filtering and QC](#rnaseq-trimming-filtering-and-qc)
-- [RNAseq alignment](#rnaseq-alignment)
+- [RNASeq trimming, filtering and QC](#rnaseq-trimming-filtering-and-qc)
+- [RNASeq alignment](#rnaseq-alignment)
 - [Annotation with BRAKER](#annotation-with-braker)
 - [Annotation with Liftoff](#annotation-with-liftoff)
 - [Annotation filtering and merging](#annotation-filtering-and-merging)
@@ -50,7 +50,7 @@ A repeat library is created with either [REPEATMODELER](https://github.com/Dfam-
 
 Soft masking of the repeats is performed with [REPEATMASKER](https://github.com/rmhubley/RepeatMasker) using the repeat library prepared in the previous step. Masking outputs are saved to the output directory only if `repeatmasker_save_outputs` parameter is set to `true` (default: `false`).
 
-### RNAseq trimming, filtering and QC
+### RNASeq trimming, filtering and QC
 
 <details markdown="1">
 <summary>Output files</summary>
@@ -79,7 +79,7 @@ Soft masking of the repeats is performed with [REPEATMASKER](https://github.com/
 
 RNASeq reads are trimmed with [FASTP](https://github.com/OpenGene/fastp) and are QC'ed with [FASTQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc). Ribosomal reads are filtered out using [SORTMERNA](https://github.com/sortmerna/sortmerna). Trimmed reads are only stored to the output directory if the `save_trimmed` parameter is set to `true` (default: `false`). Reads filtered by [SORTMERNA](https://github.com/sortmerna/sortmerna) are stored to the output directory if the `save_non_ribo_reads` parameter is set to `true` (default: `false`).
 
-### RNAseq alignment
+### RNASeq alignment
 
 <details markdown="1">
 <summary>Output files</summary>
@@ -93,7 +93,7 @@ RNASeq reads are trimmed with [FASTP](https://github.com/OpenGene/fastp) and are
 
 </details>
 
-RNAseq alignment is performed with [STAR](https://github.com/alexdobin/STAR). Alignment files are only stored to the output directory if the `star_save_outputs` parameter is set to `true` (default: `false`). Concatenated bam files are stored to the output directory if the `save_cat_bam` parameter is set to `true` (default: `false`).
+RNASeq alignment is performed with [STAR](https://github.com/alexdobin/STAR). Alignment files are only stored to the output directory if the `star_save_outputs` parameter is set to `true` (default: `false`). Concatenated bam files are stored to the output directory if the `save_cat_bam` parameter is set to `true` (default: `false`).
 
 ### Annotation with BRAKER
 
@@ -112,7 +112,7 @@ RNAseq alignment is performed with [STAR](https://github.com/alexdobin/STAR). Al
 
 </details>
 
-[BRAKER](https://github.com/Gaius-Augustus/BRAKER) is used to annotate each genome assembly using the provide protein and RNAseq evidence. Outputs from BRAKER are stored to the output directory if the `braker_save_outputs` parameter is set to `true` (default: `false`).
+[BRAKER](https://github.com/Gaius-Augustus/BRAKER) is used to annotate each genome assembly using the provide protein and RNASeq evidence. Outputs from BRAKER are stored to the output directory if the `braker_save_outputs` parameter is set to `true` (default: `false`).
 
 > [!CAUTION]
 >

diff --git a/docs/parameters.md b/docs/parameters.md
@@ -37,7 +37,7 @@ A Nextflow pipeline for single genome, multiple genomes and pan-genome annotatio
 | `save_non_ribo_reads`    | Save FASTQ files after Ribosomal RNA removal or not?               | `boolean` |                                           |          |        |
 | `ribo_database_manifest` | Ribosomal RNA fastas listed in a text sheet                        | `string`  | ${projectDir}/assets/rrna-db-defaults.txt |          |        |
 
-## RNAseq alignment options
+## RNASeq alignment options
 
 | Parameter                | Description                                       | Type      | Default | Required | Hidden |
 | ------------------------ | ------------------------------------------------- | --------- | ------- | -------- | ------ |
@@ -48,19 +48,29 @@ A Nextflow pipeline for single genome, multiple genomes and pan-genome annotatio
 
 ## Annotation options
 
-| Parameter                     | Description                                                                       | Type      | Default | Required | Hidden |
-| ----------------------------- | --------------------------------------------------------------------------------- | --------- | ------- | -------- | ------ |
-| `braker_extra_args`           | Extra arguments for BRAKER                                                        | `string`  |         |          |        |
-| `braker_save_outputs`         | Save BRAKER files                                                                 | `boolean` |         |          |        |
-| `liftoff_coverage`            | Liftoff coverage parameter                                                        | `number`  | 0.9     |          |        |
-| `liftoff_identity`            | Liftoff identity parameter                                                        | `number`  | 0.9     |          |        |
-| `allow_isoforms`              | Allow multiple isoforms for gene models                                           | `boolean` | True    |          |        |
-| `enforce_full_intron_support` | Require every model to have external evidence for all its introns                 | `boolean` | True    |          |        |
-| `filter_liftoff_by_hints`     | Use BRAKER hints to filter Liftoff models                                         | `boolean` | True    |          |        |
-| `eggnogmapper_evalue`         | Only report alignments below or equal the e-value threshold                       | `number`  | 1e-05   |          |        |
-| `eggnogmapper_pident`         | Only report alignments above or equal to the given percentage of identity (0-100) | `integer` | 35      |          |        |
-| `eggnogmapper_purge_nohits`   | Purge transcripts which do not have a hit against eggnog                          | `boolean` |         |          |        |
-| `add_attrs_to_proteins_fasta` | Add gff attributes to proteins fasta                                              | `boolean` |         |          |        |
+| Parameter             | Description                                                                       | Type      | Default | Required | Hidden |
+| --------------------- | --------------------------------------------------------------------------------- | --------- | ------- | -------- | ------ |
+| `braker_extra_args`   | Extra arguments for BRAKER                                                        | `string`  |         |          |        |
+| `liftoff_coverage`    | Liftoff coverage parameter                                                        | `number`  | 0.9     |          |        |
+| `liftoff_identity`    | Liftoff identity parameter                                                        | `number`  | 0.9     |          |        |
+| `eggnogmapper_evalue` | Only report alignments below or equal the e-value threshold                       | `number`  | 1e-05   |          |        |
+| `eggnogmapper_pident` | Only report alignments above or equal to the given percentage of identity (0-100) | `integer` | 35      |          |        |
+
+## Post-annotation filtering options
+
+| Parameter                     | Description                                                       | Type      | Default | Required | Hidden |
+| ----------------------------- | ----------------------------------------------------------------- | --------- | ------- | -------- | ------ |
+| `allow_isoforms`              | Allow multiple isoforms for gene models                           | `boolean` | True    |          |        |
+| `enforce_full_intron_support` | Require every model to have external evidence for all its introns | `boolean` | True    |          |        |
+| `filter_liftoff_by_hints`     | Use BRAKER hints to filter Liftoff models                         | `boolean` | True    |          |        |
+| `eggnogmapper_purge_nohits`   | Purge transcripts which do not have a hit against eggnog          | `boolean` |         |          |        |
+
+## Annotation output options
+
+| Parameter                     | Description                          | Type      | Default | Required | Hidden |
+| ----------------------------- | ------------------------------------ | --------- | ------- | -------- | ------ |
+| `braker_save_outputs`         | Save BRAKER files                    | `boolean` |         |          |        |
+| `add_attrs_to_proteins_fasta` | Add gff attributes to proteins fasta | `boolean` |         |          |        |
 
 ## Evaluation options
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -7,7 +7,7 @@
 - [Assemblysheet input](#assemblysheet-input)
 - [Protein evidence](#protein-evidence)
   - [BRAKER workflow](#braker-workflow)
-- [RNAseq evidence](#rnaseq-evidence)
+- [RNASeq evidence](#rnaseq-evidence)
   - [BRAKER workflow](#braker-workflow-1)
   - [Preprocessing](#preprocessing)
   - [Alignment](#alignment)
@@ -52,11 +52,11 @@ Protein evidence can be provided in two ways. First, a single FASTA file. Second
 
 With these two parameters, the pipeline has sufficient inputs to execute the [BRAKER workflow C](https://github.com/Gaius-Augustus/BRAKER/tree/f58479fe5bb13a9e51c3ca09cb9e137cab3b8471?tab=readme-ov-file#overview-of-modes-for-running-braker) (see Figure 4) in which GeneMark-EP+ is trained on protein spliced alignments, then GeneMark-EP+ generates training data for AUGUSTUS which then performs the final gene prediction.
 
-## RNAseq evidence
+## RNASeq evidence
 
 > ❔ Optional `--rna_evidence`
 
-RNAseq evidence must be provided through a samplesheet in CSV format which has the following columns,
+RNASeq evidence must be provided through a samplesheet in CSV format which has the following columns,
 
 - `sample:` A sample identifier. The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis.
 - `file_1:` A FASTQ or BAM file
@@ -65,17 +65,17 @@ RNAseq evidence must be provided through a samplesheet in CSV format which has t
 
 ### BRAKER workflow
 
-If RNAseq evidence is provided, the pipeline executes the [BRAKER workflow D](https://github.com/Gaius-Augustus/BRAKER/tree/f58479fe5bb13a9e51c3ca09cb9e137cab3b8471?tab=readme-ov-file#overview-of-modes-for-running-braker) (see Figure 4) in which GeneMark-ETP is trained with both protein and RNASeq evidence and the training data generated by GeneMark-ETP is used to optimise AUGUSTUS for final gene predictions.
+If RNASeq evidence is provided, the pipeline executes the [BRAKER workflow D](https://github.com/Gaius-Augustus/BRAKER/tree/f58479fe5bb13a9e51c3ca09cb9e137cab3b8471?tab=readme-ov-file#overview-of-modes-for-running-braker) (see Figure 4) in which GeneMark-ETP is trained with both protein and RNASeq evidence and the training data generated by GeneMark-ETP is used to optimise AUGUSTUS for final gene predictions.
 
 ### Preprocessing
 
-RNAseq reads provided in FASTQ files are by default trimmed with [FASTP](https://github.com/OpenGene/fastp). No parameters are provided by default. Although, additional parameters can be provided with `--extra_fastp_args` parameter. After trimming, any sample which does not have `10000` reads left is dropped. This threshold can be specified with the `--min_trimmed_reads` parameter. If trimming was already performed ot it is not desirable, it can be skipped by setting the `--skip_fastp` flag to `true`.
+RNASeq reads provided in FASTQ files are by default trimmed with [FASTP](https://github.com/OpenGene/fastp). No parameters are provided by default. Although, additional parameters can be provided with `--extra_fastp_args` parameter. After trimming, any sample which does not have `10000` reads left is dropped. This threshold can be specified with the `--min_trimmed_reads` parameter. If trimming was already performed ot it is not desirable, it can be skipped by setting the `--skip_fastp` flag to `true`.
 
 Optionally, [SORTMERNA](https://github.com/sortmerna/sortmerna) can be activated by setting the `--remove_ribo_rna` flag to `true`. A default list of rRNA databases is pre-configured and can be seen in the [assets/rrna-db-defaults.txt](../assets/rrna-db-defaults.txt) file. A path to a custom list of databases can be specified by the `--ribo_database_manifest` parameter.
 
 ### Alignment
 
-RNAseq evidence provided as FASTQ files is aligned using [STAR](https://github.com/alexdobin/STAR). The default alignment parameters are,
+RNASeq evidence provided as FASTQ files is aligned using [STAR](https://github.com/alexdobin/STAR). The default alignment parameters are,
 
 ```bash
 --outSAMstrandField intronMotif \
@@ -88,7 +88,7 @@ where `--star_max_intron_length` is a pipeline parameter and its default value i
 
 > [!WARNING]
 >
-> If pre-aligned RNAseq data is provided as a BAM file and the alignment was not performed with `--outSAMstrandField intronMotif` parameter, the pipeline might trough an error.
+> If pre-aligned RNASeq data is provided as a BAM file and the alignment was not performed with `--outSAMstrandField intronMotif` parameter, the pipeline might trough an error.
 
 ## Liftoff annotations
 
@@ -156,7 +156,7 @@ If there are more than one target assemblies, an orthology inference is performe
 
 ## Iso-forms and full intron support
 
-By default the pipeline allows multiple isoforms from BRAKER. This behavior can be changed by setting the `--allow_isoforms` flag to `false`. Moreover, every intron from every model from BRAKER and LIFTOFF must have support from protein or RNAseq evidence. This is enforced with [TSEBRA](https://github.com/Gaius-Augustus/TSEBRA). This requirement can be removed by setting the `--enforce_full_intron_support` flag to `false`. Or, selectively only applying this criterion to BRAKER models by setting the `--filter_liftoff_by_hints` flag to `false`.
+By default the pipeline allows multiple isoforms from BRAKER. This behavior can be changed by setting the `--allow_isoforms` flag to `false`. Moreover, every intron from every model from BRAKER and LIFTOFF must have support from protein or RNASeq evidence. This is enforced with [TSEBRA](https://github.com/Gaius-Augustus/TSEBRA). This requirement can be removed by setting the `--enforce_full_intron_support` flag to `false`. Or, selectively only applying this criterion to BRAKER models by setting the `--filter_liftoff_by_hints` flag to `false`.
 
 ## Running the pipeline