Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parameter filter_genes_by_aa_length #127

Draft
wants to merge 5 commits into
base: dev
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

1. Added cDNA and CDS outputs to <OUTPUT_DIR>/annotations/<SAMPLE> directory [#118](https://github.com/Plant-Food-Research-Open/genepal/issues/118)
2. Added parameter `add_attrs_to_proteins_cds_fastas`
3. Added parameter `filter_genes_by_aa_length` with default set to `24` which allows removal of genes with ORFs shorter than 24 [#125](https://github.com/Plant-Food-Research-Open/genepal/issues/125)

### `Fixed`

Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,9 @@
- Optionally, remove models without any EggNOG-mapper hits
- [EggNOG-mapper](https://github.com/eggnogdb/eggnog-mapper): Add functional annotation to gff
- [GenomeTools](https://github.com/genometools/genometools): GFF format validation
- [GffRead](https://github.com/gpertea/gffread): Extraction of protein sequences
- [GffRead](https://github.com/gpertea/gffread)
- Extraction of protein sequences
- Optionally, remove models with ORFs shorter than `N` amino acids
- [OrthoFinder](https://github.com/davidemms/OrthoFinder): Perform phylogenetic orthology inference across genomes
- [GffCompare](https://github.com/gpertea/gffcompare): Compare and benchmark against an existing annotation
- [BUSCO](https://gitlab.com/ezlab/busco): Completeness statistics for genome and annotation through proteins
Expand Down
4 changes: 4 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,10 @@ process { // SUBWORKFLOW: GFF_MERGE_CLEANUP
ext.prefix = { "${meta.id}.liftoff.braker" }
}

withName: '.*:GFF_MERGE_CLEANUP:FILTER_BY_ORF_SIZE' {
ext.args = params.filter_genes_by_aa_length ? "--no-pseudo --keep-genes -C -l ${params.filter_genes_by_aa_length * 3}" : ''
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rosscrowhurst

params.filter_genes_by_aa_length is multiplied by 3 to get the CDS length for the -l parameter.

}

withName: '.*:GFF_MERGE_CLEANUP:GT_GFF3' {
ext.args = '-tidy -retainids -sort'
}
Expand Down
4 changes: 2 additions & 2 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,8 +169,8 @@ If more than one genome is included in the pipeline, [ORTHOFINDER](https://githu
- `Y/`
- `Y.gt.gff3`: Final annotation file for genome `Y` which contains gene models and their functional annotations
- `Y.pep.fasta`: Protein sequences for the gene models
- 'Y.cdna.fasta': cDNA sequences for the gene models
- 'Y.cds.fasta': Coding sequences for the gene models
- `Y.cdna.fasta`: cDNA sequences for the gene models
- `Y.cds.fasta`: Coding sequences for the gene models

</details>

Expand Down
13 changes: 7 additions & 6 deletions docs/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,13 @@ A Nextflow pipeline for consensus, phased and pan-genome annotation.

## Post-annotation filtering options

| Parameter | Description | Type | Default | Required | Hidden |
| ----------------------------- | ----------------------------------------------------------------- | --------- | ------- | -------- | ------ |
| `allow_isoforms` | Allow multiple isoforms for gene models | `boolean` | True | | |
| `enforce_full_intron_support` | Require every model to have external evidence for all its introns | `boolean` | True | | |
| `filter_liftoff_by_hints` | Use BRAKER hints to filter Liftoff models | `boolean` | True | | |
| `eggnogmapper_purge_nohits` | Purge transcripts which do not have a hit against eggnog | `boolean` | | | |
| Parameter | Description | Type | Default | Required | Hidden |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | --------- | ------- | -------- | ------ |
| `allow_isoforms` | Allow multiple isoforms for gene models | `boolean` | True | | |
| `enforce_full_intron_support` | Require every model to have external evidence for all its introns | `boolean` | True | | |
| `filter_liftoff_by_hints` | Use BRAKER hints to filter Liftoff models | `boolean` | True | | |
| `eggnogmapper_purge_nohits` | Purge transcripts which do not have a hit against eggnog | `boolean` | | | |
| `filter_genes_by_aa_length` | Filter genes with open reading frames shorter than the specified number of amino acids. If set to `null`, this filter step is skipped. | `integer` | 24 | | |

## Annotation output options

Expand Down
38 changes: 38 additions & 0 deletions modules/local/tests/gffread/main.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
nextflow_process {

name "Test Process GFFREAD"
script "../../../nf-core/gffread/main.nf"
config "./nextflow.config"
process "GFFREAD"

tag "gffread"
tag "modules_nfcore"
tag "modules"

test("filter by length") {

when {
process {
"""
input[0] = [
[id: 'test'],
file("$baseDir" + '/modules/local/tests/gffread/testdata/t.gff', checkIfExists: true)
]
input[1] = []
"""
}
}

then {
assertAll (
{ assert process.success },
{ assert snapshot(process.out).match() },
{ assert file(process.out.gffread_gff[0][1]).text.contains('gene19851') },
{ assert file(process.out.gffread_gff[0][1]).text.contains('gene19851.t1') },
{ assert ! file(process.out.gffread_gff[0][1]).text.contains('gene19851.t2') } // This is the only transcript which is being knocked out
)
}

}

}
47 changes: 47 additions & 0 deletions modules/local/tests/gffread/main.nf.test.snap
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
{
"filter by length": {
"content": [
{
"0": [

],
"1": [
[
{
"id": "test"
},
"test.gff3:md5,59a7d6ff7123589ef2b90b20043a347c"
]
],
"2": [

],
"3": [
"versions.yml:md5,05f671c6c6e530acedad0af0a5948dbd"
],
"gffread_fasta": [

],
"gffread_gff": [
[
{
"id": "test"
},
"test.gff3:md5,59a7d6ff7123589ef2b90b20043a347c"
]
],
"gtf": [

],
"versions": [
"versions.yml:md5,05f671c6c6e530acedad0af0a5948dbd"
]
}
],
"meta": {
"nf-test": "0.9.2",
"nextflow": "24.04.4"
},
"timestamp": "2024-12-11T21:11:59.953464"
}
}
5 changes: 5 additions & 0 deletions modules/local/tests/gffread/nextflow.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
process {
withName: GFFREAD {
ext.args = '--no-pseudo --keep-genes -C -l 72'
}
}
47 changes: 47 additions & 0 deletions modules/local/tests/gffread/testdata/t.gff
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
##gff-version 3
###
chr23 AUGUSTUS gene 16515075 16516672 . - . ID=gene19849;description=Protein%20of%20unknown%20function%20%28DUF1635%29
chr23 AUGUSTUS mRNA 16515075 16516597 1 - . ID=gene19849.t1;Parent=gene19849;description=Protein%20of%20unknown%20function%20%28DUF1635%29
chr23 AUGUSTUS exon 16515075 16515794 . - . ID=gene19849.t1.exon1;Parent=gene19849.t1
chr23 AUGUSTUS CDS 16515075 16515794 1 - 0 ID=gene19849.t1.cds1;Parent=gene19849.t1
chr23 AUGUSTUS exon 16516562 16516597 . - . ID=gene19849.t1.exon2;Parent=gene19849.t1
chr23 AUGUSTUS CDS 16516562 16516597 1 - 0 ID=gene19849.t1.cds2;Parent=gene19849.t1
chr23 gmst mRNA 16515075 16516672 . - . ID=gene19849.t2;Parent=gene19849;description=Protein%20of%20unknown%20function%20%28DUF1635%29
chr23 gmst exon 16515075 16515794 50.2 - 0 ID=gene19849.t2.exon1;Parent=gene19849.t2
chr23 gmst CDS 16515075 16515794 50.2 - 0 ID=gene19849.t2.cds1;Parent=gene19849.t2
chr23 gmst exon 16516562 16516672 50.2 - 0 ID=gene19849.t2.exon2;Parent=gene19849.t2
chr23 gmst CDS 16516562 16516672 50.2 - 0 ID=gene19849.t2.cds2;Parent=gene19849.t2
###
chr23 gmst gene 16530414 16531453 . - . ID=gene19850;description=Myb-like%20DNA-binding%20domain
chr23 gmst mRNA 16530414 16531453 . - . ID=gene19850.t1;Parent=gene19850;description=Myb-like%20DNA-binding%20domain
chr23 gmst exon 16530414 16531041 42.7 - 1 ID=gene19850.t1.exon1;Parent=gene19850.t1
chr23 gmst CDS 16530414 16531041 42.7 - 1 ID=gene19850.t1.cds1;Parent=gene19850.t1
chr23 gmst exon 16531197 16531453 42.7 - 0 ID=gene19850.t1.exon2;Parent=gene19850.t1
chr23 gmst CDS 16531197 16531453 42.7 - 0 ID=gene19850.t1.cds2;Parent=gene19850.t1
###
chr23 AUGUSTUS gene 16530414 16531542 . - . ID=gene19851;description=Differing%20isoform%20descriptions
chr23 AUGUSTUS mRNA 16530414 16531542 1 - . ID=gene19851.t1;Parent=gene19851;description=Myb-like%20DNA-binding%20domain
chr23 AUGUSTUS exon 16530414 16530721 . - . ID=gene19851.t1.exon1;Parent=gene19851.t1
chr23 AUGUSTUS CDS 16530414 16530721 1 - 2 ID=gene19851.t1.cds1;Parent=gene19851.t1
chr23 AUGUSTUS exon 16530824 16531041 . - . ID=gene19851.t1.exon2;Parent=gene19851.t1
chr23 AUGUSTUS CDS 16530824 16531041 1 - 1 ID=gene19851.t1.cds2;Parent=gene19851.t1
chr23 AUGUSTUS exon 16531197 16531326 . - . ID=gene19851.t1.exon3;Parent=gene19851.t1
chr23 AUGUSTUS CDS 16531197 16531326 1 - 2 ID=gene19851.t1.cds3;Parent=gene19851.t1
chr23 AUGUSTUS exon 16531428 16531542 . - . ID=gene19851.t1.exon4;Parent=gene19851.t1
chr23 AUGUSTUS CDS 16531428 16531542 1 - 0 ID=gene19851.t1.cds4;Parent=gene19851.t1
chr23 GeneMark.hmm3 mRNA 16531514 16531542 . - . ID=gene19851.t2;Parent=gene19851;description=Hypothetical%20protein%20%7C%20no%20eggnog%20hit
chr23 GeneMark.hmm3 exon 16531514 16531542 . - 0 ID=gene19851.t2.exon1;Parent=gene19851.t2
chr23 GeneMark.hmm3 CDS 16531514 16531542 . - 0 ID=gene19851.t2.cds1;Parent=gene19851.t2
###
chr23 AUGUSTUS gene 16539401 16545431 . + . ID=gene19852;description=nuclease%20HARBI1
chr23 AUGUSTUS mRNA 16539401 16545431 1 + . ID=gene19852.t1;Parent=gene19852;description=nuclease%20HARBI1
chr23 AUGUSTUS exon 16539401 16539509 . + . ID=gene19852.t1.exon1;Parent=gene19852.t1
chr23 AUGUSTUS CDS 16539401 16539509 1 + 0 ID=gene19852.t1.cds1;Parent=gene19852.t1
chr23 AUGUSTUS exon 16544386 16545431 . + . ID=gene19852.t1.exon2;Parent=gene19852.t1
chr23 AUGUSTUS CDS 16544386 16545431 1 + 2 ID=gene19852.t1.cds2;Parent=gene19852.t1
###
chr23 AUGUSTUS gene 16556338 16556796 . + . ID=gene19853;description=Zinc%20finger%20protein
chr23 AUGUSTUS mRNA 16556338 16556796 1 + . ID=gene19853.t1;Parent=gene19853;description=Zinc%20finger%20protein
chr23 AUGUSTUS exon 16556338 16556796 . + . ID=gene19853.t1.exon1;Parent=gene19853.t1
chr23 AUGUSTUS CDS 16556338 16556796 1 + 0 ID=gene19853.t1.cds1;Parent=gene19853.t1
###
1 change: 1 addition & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ params {
enforce_full_intron_support = true
filter_liftoff_by_hints = true
eggnogmapper_purge_nohits = false
filter_genes_by_aa_length = 24

// Annotation output options
braker_save_outputs = false
Expand Down
7 changes: 7 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,13 @@
"type": "boolean",
"description": "Purge transcripts which do not have a hit against eggnog",
"fa_icon": "fas fa-question-circle"
},
"filter_genes_by_aa_length": {
"type": "integer",
"default": 24,
"fa_icon": "fas fa-hashtag",
"description": "Filter genes with open reading frames shorter than the specified number of amino acids. If set to `null`, this filter step is skipped.",
"minimum": 3
}
}
},
Expand Down
3 changes: 2 additions & 1 deletion pfr/params.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,9 @@
"enforce_full_intron_support": true,
"filter_liftoff_by_hints": true,
"eggnogmapper_purge_nohits": false,
"filter_genes_by_aa_length": 24,
"braker_save_outputs": false,
"add_attrs_to_proteins_fasta": false,
"add_attrs_to_proteins_cds_fastas": false,
"busco_skip": false,
"busco_lineage_datasets": "embryophyta_odb10"
}
8 changes: 4 additions & 4 deletions subworkflows/local/gff_eggnogmapper.nf
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ workflow GFF_EGGNOGMAPPER {
| join(ch_fasta)

GFF2FASTA_FOR_EGGNOGMAPPER(
ch_gffread_inputs.map { meta, gff, fasta -> [ meta, gff ] },
ch_gffread_inputs.map { meta, gff, fasta -> fasta }
ch_gffread_inputs.map { meta, gff, _fasta -> [ meta, gff ] },
ch_gffread_inputs.map { _meta, _gff, fasta -> fasta }
)

ch_gffread_fasta = GFF2FASTA_FOR_EGGNOGMAPPER.out.gffread_fasta
Expand All @@ -30,9 +30,9 @@ workflow GFF_EGGNOGMAPPER {
| combine(Channel.fromPath(db_folder))

EGGNOGMAPPER(
ch_eggnogmapper_inputs.map { meta, fasta, db -> [ meta, fasta ] },
ch_eggnogmapper_inputs.map { meta, fasta, _db -> [ meta, fasta ] },
[],
ch_eggnogmapper_inputs.map { meta, fasta, db -> db },
ch_eggnogmapper_inputs.map { _meta, _fasta, db -> db },
[ [], [] ]
)

Expand Down
23 changes: 19 additions & 4 deletions subworkflows/local/gff_merge_cleanup.nf
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
include { AGAT_SPMERGEANNOTATIONS } from '../../modules/nf-core/agat/spmergeannotations/main'
include { GT_GFF3 } from '../../modules/nf-core/gt/gff3/main'
include { GFFREAD as FILTER_BY_ORF_SIZE } from '../../modules/nf-core/gffread/main'
include { AGAT_CONVERTSPGXF2GXF } from '../../modules/nf-core/agat/convertspgxf2gxf/main'

workflow GFF_MERGE_CLEANUP {
take:
ch_braker_gff // Channel: [ meta, gff ]
ch_liftoff_gff // Channel: [ meta, gff ]
val_filter_by_aa_length // val(null|Integer)

main:
ch_versions = Channel.empty()

ch_gff_branch = ch_braker_gff
| join(ch_liftoff_gff, remainder:true)
| branch { meta, braker_gff, liftoff_gff ->
| branch { _meta, braker_gff, liftoff_gff ->
both : ( braker_gff && liftoff_gff )
braker_only : ( braker_gff && ( ! liftoff_gff ) )
liftoff_only: ( ( ! braker_gff ) && liftoff_gff )
Expand All @@ -25,12 +27,25 @@ workflow GFF_MERGE_CLEANUP {
)

ch_merged_gff = AGAT_SPMERGEANNOTATIONS.out.gff
| mix ( ch_gff_branch.liftoff_only.map { meta, braker_gff, liftoff_gff -> [ meta, liftoff_gff ] } )
| mix ( ch_gff_branch.braker_only.map { meta, braker_gff, liftoff_gff -> [ meta, braker_gff ] } )
| mix ( ch_gff_branch.liftoff_only.map { meta, _braker_gff, liftoff_gff -> [ meta, liftoff_gff ] } )
| mix ( ch_gff_branch.braker_only.map { meta, braker_gff, _liftoff_gff -> [ meta, braker_gff ] } )
ch_versions = ch_versions.mix(AGAT_SPMERGEANNOTATIONS.out.versions.first())

// MODULE: GFFREAD as FILTER_BY_ORF_SIZE
ch_filter_input = ch_merged_gff
| branch {
filter: val_filter_by_aa_length != null
pass: val_filter_by_aa_length == null
}

FILTER_BY_ORF_SIZE ( ch_filter_input.filter, [] )

ch_filtered_gff = FILTER_BY_ORF_SIZE.out.gffread_gff
| mix ( ch_filter_input.pass )
ch_versions = ch_versions.mix(FILTER_BY_ORF_SIZE.out.versions.first())

// MODULE: GT_GFF3
GT_GFF3 ( ch_merged_gff )
GT_GFF3 ( ch_filtered_gff )

ch_gt_gff = GT_GFF3.out.gt_gff3
ch_versions = ch_versions.mix(GT_GFF3.out.versions.first())
Expand Down
3 changes: 3 additions & 0 deletions tests/minimal/main.nf.test
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ nextflow_pipeline {
['**']
)

def summary_stats = (Map) new groovy.json.JsonSlurper().parseText(file("$outputDir/genepal_data/summary_stats.json").text)

assertAll(
{ assert workflow.success},
{ assert snapshot(
Expand All @@ -46,6 +48,7 @@ nextflow_pipeline {
'versions': removeNextflowVersion("$outputDir/pipeline_info/genepal_software_mqc_versions.yml"),
'stable paths': stable_path,
'stable names': getRelativePath(stable_name, outputDir),
'summary_stats': summary_stats
]
).match() }
)
Expand Down
30 changes: 23 additions & 7 deletions tests/minimal/main.nf.test.snap
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"profile - test": {
"content": [
{
"successful tasks": 20,
"successful tasks": 21,
"versions": {
"AGAT_CONVERTSPGFF2GTF": {
"agat": "v1.4.0"
Expand Down Expand Up @@ -37,6 +37,9 @@
"FASTAVALIDATOR": {
"py_fasta_validator": 0.6
},
"FILTER_BY_ORF_SIZE": {
"gffread": "0.12.7"
},
"FINAL_GFF_CHECK": {
"genometools": "1.6.5"
},
Expand Down Expand Up @@ -67,9 +70,9 @@
"stable paths": [
"a_thaliana.cdna.fasta:md5,12b9bef973e488640aec8c04ba3882fe",
"a_thaliana.cds.fasta:md5,b81060419355a590560f92aec8536281",
"a_thaliana.gt.gff3:md5,8ab16549095f605ff8715ac4a3de58ed",
"a_thaliana.gt.gff3:md5,528459cf9596523bf66de99d24c37e20",
"a_thaliana.pep.fasta:md5,4994c0393ca0245a1c57966d846d101e",
"a_thaliana.gff3:md5,d23d16cd86499d48a30ffb981ed27891",
"a_thaliana.gff3:md5,30adac1b21d7aaed6ca7fb71ab33f32d",
"summary_stats.json:md5,007ba5cf2b7a2fd395a27d9458ca2d2e"
],
"stable names": [
Expand All @@ -87,13 +90,26 @@
"genepal_report.html",
"multiqc_report.html",
"pipeline_info"
]
],
"summary_stats": {
"stats": [
{
"ID": "a_thaliana",
"Genes": 252,
"mRNA": 265,
"CDS": 1340,
"Exons": 1340,
"Intron": 1075,
"Non canon splice sites": 18
}
]
}
}
],
"meta": {
"nf-test": "0.9.2",
"nextflow": "24.04.2"
"nextflow": "24.04.4"
},
"timestamp": "2024-12-05T07:51:43.818374"
"timestamp": "2024-12-12T09:36:52.952048"
}
}
}
Loading
Loading