Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added liftoff, integrated with pfr/nxf-modules, and some minor improvements #2

Merged
merged 59 commits into from
Jan 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
ee9d0b9
Added updated info about nf-core modules
GallVp Nov 6, 2023
7f03697
Turned off SortMeRNA by default
GallVp Nov 7, 2023
232493a
Decouple target assemblies and read qc/align
GallVp Nov 8, 2023
4f2ed8e
A bit of reformatiing
GallVp Nov 9, 2023
8b83c61
Started implementing liftoff
GallVp Nov 9, 2023
37ce74e
Checkpoint before major reshuffle
GallVp Nov 10, 2023
d148f18
Reformatted local modules
GallVp Nov 10, 2023
e63e22f
Now using galaxy containers
GallVp Nov 10, 2023
1978308
Extracted some subworkflows
GallVp Nov 10, 2023
10a0158
Extracted a few subworkflows
GallVp Nov 12, 2023
2031790
Extracted subworkflows uptill BRAKER3
GallVp Nov 13, 2023
f315467
Inc liftoff
GallVp Nov 13, 2023
f10ae94
Added polished out channel to liftoff
GallVp Nov 13, 2023
80db270
Added gffread before liftoff
GallVp Nov 13, 2023
bb9b8b0
Updated flowchart
GallVp Nov 13, 2023
c982946
Added liftoff options
GallVp Nov 15, 2023
c8ff8da
Started moving to nf-core/tools
GallVp Nov 21, 2023
f285461
Reimported modules using nf-core/tools
GallVp Nov 21, 2023
bea59ef
Reimported kherronism modules with nf-core/tools
GallVp Nov 22, 2023
2dda752
Updated braker3
GallVp Nov 22, 2023
8efa34d
Updated repeatmasker
GallVp Nov 22, 2023
8d31976
Updated modules
GallVp Nov 22, 2023
5eaa87b
Imported fastavalidate and liftoff from pfr/nxf-modules
GallVp Nov 22, 2023
0a96c7f
Updated modules and subworkflows
GallVp Dec 12, 2023
8a6c5fe
Added EDTA from pfr/nxf-modules
GallVp Dec 12, 2023
26c33fa
Updated config
GallVp Dec 18, 2023
9e58314
Integrated fastavalidator
GallVp Dec 18, 2023
4534684
Added patch for star/genomegenerate
GallVp Dec 18, 2023
ec7ffc1
Incorporated fasta_edta_lai
GallVp Dec 19, 2023
2de0d22
Trying to add FASTQ_FASTQC_UMITOOLS_FASTP
GallVp Dec 20, 2023
48e7271
Updated modules and applied prettier
GallVp Dec 20, 2023
c526933
FASTP now has stub
GallVp Dec 20, 2023
ac45bbb
SORTMERNA now has stub
GallVp Dec 20, 2023
ed6aa33
Cleaned up prepare_assembly
GallVp Dec 21, 2023
144edb2
Cleaned up preprocess_rnaseq
GallVp Dec 21, 2023
a79c187
Reformatted and inc ALIGN_RNASEQ
GallVp Dec 21, 2023
6f61f5d
Cleaned up and inc PREPARE_EXT_PROTS
GallVp Dec 21, 2023
1184795
Cleanedup BRAKER3
GallVp Dec 21, 2023
a9f1fc6
Updated fastp, sortmerna and liftoff
GallVp Dec 21, 2023
d0faf8d
Cleaned up fasta_liftoff
GallVp Dec 21, 2023
27a1293
Updated fasta_edta_lai
GallVp Jan 7, 2024
3867ed5
Added script for local stub run
GallVp Jan 7, 2024
4761491
Samplesheet now accepts relative paths
GallVp Jan 7, 2024
8ced027
Updated modules
GallVp Jan 7, 2024
fce65a0
Removed -exclude_partial and updated flowchart
GallVp Jan 7, 2024
582f3fa
Separated test config for local and pfr
GallVp Jan 8, 2024
e213bd3
Fixed local script typo
GallVp Jan 8, 2024
c961ab0
Fixed apptainer scope bug in base config
GallVp Jan 8, 2024
260c706
Updated README
GallVp Jan 8, 2024
565e7d7
Readded -exclude_partial and now using teambraker container
GallVp Jan 8, 2024
47b7c40
Added config for test data and quay.io container for braker3
GallVp Jan 9, 2024
664fba1
Now using repeatmodeler by default
GallVp Jan 9, 2024
dea4bb5
BRAKER3 now runnable with test data
GallVp Jan 9, 2024
0535b1a
Added editor config
GallVp Jan 10, 2024
457a643
Disabled sortmerna by default added option to save cat bam
GallVp Jan 10, 2024
784bb54
Added pre-commit
GallVp Jan 10, 2024
a4ada59
Updated manifest
GallVp Jan 10, 2024
4de72ab
Fixed linting errors
GallVp Jan 11, 2024
12968f4
Updated base config for docker
GallVp Jan 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
root = true

[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true
indent_size = 4
indent_style = space

[*.{md,yml,yaml,cff}]
indent_size = 2

[*.nf.test]
insert_final_newline = false
26 changes: 8 additions & 18 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,26 +1,16 @@
.DS_Store

*.pyc
__pycahce__

nextflow
.nextflow*
work/
*.dot

Results/
results/
report/
Report/

*.log
.nfs*

*.sif
.DS_Store
*.code-workspace
.screenrc
.*.sw?
__pycache__
*.pyo
*.pyc

pan_gene_slurm.sh
*.stdout
*.stderr

.literature
.test
pangene-test/
1 change: 1 addition & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
repository_type: pipeline
5 changes: 5 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
repos:
- repo: https://github.com/pre-commit/mirrors-prettier
rev: "v3.1.0"
hooks:
- id: prettier
19 changes: 19 additions & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
includes/Maven_Pro/

# gitignore
.nextflow*
work/
results/
.DS_Store
*.code-workspace
.screenrc
.*.sw?
__pycache__
*.pyo
*.pyc

*.stdout
*.stderr

.literature
pangene-test/
1 change: 1 addition & 0 deletions .prettierrc.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
printWidth: 120
143 changes: 67 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,105 +1,95 @@
# PAN-GENE
# PANGENE

A NextFlow pipeline for pan-genome annotation.

## Pipeline Flowchart

```mermaid
flowchart TD
ribo_db((ribo_db))
SAMPLESHEET((samples))
TE_LIBRARIES(("[te_libs]"))
TARGET_ASSEMBLIES(("[assemblies]"))
EXTERNAL_PROTEIN_SEQS(("[ext_prots]"))

GUNZIP_PROT[GUNZIP]
GUNZIP_TE[GUNZIP]
SKIP_EDTA{Skip EDTA}
pend((dev))

TE_LIBRARIES --> GUNZIP_TE
GUNZIP_TE --> SKIP_EDTA

TARGET_ASSEMBLIES --> GUNZIP
GUNZIP --> FASTA_VALIDATE
FASTA_VALIDATE --> FASTA_PERFORM_EDTA
FASTA_VALIDATE --> SKIP_EDTA

SKIP_EDTA --> REPEATMASKER
FASTA_PERFORM_EDTA --> REPEATMASKER
REPEATMASKER --> STAR_GENOMEGENERATE

SAMPLESHEET --> SAMPLESHEET_CHECK
SAMPLESHEET_CHECK --> |Technical replicates|CAT_FASTQ
CAT_FASTQ --> FASTQC
SAMPLESHEET_CHECK --> FASTQC
FASTQC --> FASTP

ribo_db --> SORTMERNA
FASTP --> SORTMERNA
SORTMERNA --> STAR_ALIGN
STAR_GENOMEGENERATE --> STAR_ALIGN
STAR_ALIGN --> GROUP_BY_ASSEMBLY([Group by assembly])
GROUP_BY_ASSEMBLY --> SAMTOOLS_CAT
SAMTOOLS_CAT --> |RNASeq bam|BRAKER3

REPEATMASKER --> BRAKER3

EXTERNAL_PROTEIN_SEQS --> GUNZIP_PROT
GUNZIP_PROT --> CAT
CAT --> BRAKER3

BRAKER3 --> pend

subgraph Params
subgraph PrepareAssembly [ ]
TARGET_ASSEMBLIES
TE_LIBRARIES
SAMPLESHEET
ribo_db
EXTERNAL_PROTEIN_SEQS
end

subgraph GenomePrep
GUNZIP
FASTA_VALIDATE
GUNZIP_TE
FASTA_PERFORM_EDTA
SKIP_EDTA
fasta_file_from_fasta_validate
EDTA
REPEATMODELER
te_lib_absent_node
REPEATMASKER
STAR_GENOMEGENERATE
end

subgraph Braker
CAT
GUNZIP_PROT
BRAKER3
end

subgraph SamplePrep
SAMPLESHEET_CHECK
TARGET_ASSEMBLIES(["[target_assemblies]"])
TE_LIBRARIES(["[te_libs]"])
TARGET_ASSEMBLIES --> FASTA_VALIDATE
FASTA_VALIDATE --- |Fasta|fasta_file_from_fasta_validate(( ))
fasta_file_from_fasta_validate --> |or|EDTA
fasta_file_from_fasta_validate --> |default|REPEATMODELER
REPEATMODELER --- te_lib_absent_node(( ))
EDTA --- te_lib_absent_node
TE_LIBRARIES --> REPEATMASKER
te_lib_absent_node --> REPEATMASKER

subgraph Samplesheet [ ]
SAMPLESHEET
CAT_FASTQ
FASTQC
FASTP
FASTP_FASTQC
SORTMERNA
STAR_ALIGN
GROUP_BY_ASSEMBLY
fasta_file_for_star
STAR
SAMTOOLS_CAT
end

style Params fill:#00FFFF21,stroke:#00FFFF21
style GenomePrep fill:#00FFFF21,stroke:#00FFFF21
style SamplePrep fill:#00FFFF21,stroke:#00FFFF21
style Braker fill:#00FFFF21,stroke:#00FFFF21
SAMPLESHEET([samplesheet])
SAMPLESHEET --> |Tech. reps|CAT_FASTQ
CAT_FASTQ --> FASTQC
SAMPLESHEET --> FASTQC
FASTQC --> FASTP
FASTP --> FASTP_FASTQC[FASTQC]
FASTP_FASTQC --> SORTMERNA
fasta_file_for_star(( ))
fasta_file_for_star --> |Fasta|STAR
SORTMERNA --> STAR
STAR --> SAMTOOLS_CAT

subgraph Annotation [ ]
anno_fasta(( ))
anno_masked_fasta(( ))
anno_bam(( ))
EXTERNAL_PROTEIN_SEQS(["[ext_prots]"])
XREF_ANNOTATIONS(["[xref_annotations]"])
CAT
BRAKER3
GFFREAD
LIFTOFF
end

PrepareAssembly --> |Fasta, Masked fasta|Annotation
Samplesheet --> |RNASeq bam|Annotation

XREF_ANNOTATIONS --> |xref_gff|GFFREAD
XREF_ANNOTATIONS --> |xref_fasta|LIFTOFF
GFFREAD --> LIFTOFF
anno_fasta --> |Fasta|LIFTOFF

EXTERNAL_PROTEIN_SEQS --> CAT
anno_masked_fasta --> |Masked fasta|BRAKER3
anno_bam --> |RNASeq bam|BRAKER3
CAT --> BRAKER3

style Samplesheet fill:#00FFFF21,stroke:#00FFFF21
style PrepareAssembly fill:#00FFFF21,stroke:#00FFFF21
style Annotation fill:#00FFFF21,stroke:#00FFFF21
```

## Plant&Food Users

Configure the pipeline by modifying `nextflow.config` and submit to SLURM for execution.

```bash
sbatch ./pan_gene_pfr.sh
sbatch ./pangene_pfr
```


## Third-party Sources

Some software components of this pipeline have been adopted from following third-party sources:
Expand All @@ -112,5 +102,6 @@ Some software components of this pipeline have been adopted from following third
>
> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x).

2. rewarewaannotation [MIT](https://github.com/kherronism/rewarewaannotation/blob/master/LICENSE): https://github.com/kherronism/rewarewaannotation
3. assembly_qc [GPL-3.0](https://github.com/Plant-Food-Research-Open/assembly_qc/blob/main/LICENSE): https://github.com/Plant-Food-Research-Open/assembly_qc
2. nf-core/rnaseq [MIT](https://github.com/nf-core/rnaseq/blob/master/LICENSE): https://github.com/nf-core/rnaseq
3. rewarewaannotation [MIT](https://github.com/kherronism/rewarewaannotation/blob/master/LICENSE): https://github.com/kherronism/rewarewaannotation
4. assembly_qc [GPL-3.0](https://github.com/Plant-Food-Research-Open/assembly_qc/blob/main/LICENSE): https://github.com/Plant-Food-Research-Open/assembly_qc
25 changes: 21 additions & 4 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,21 @@
- [ ] Rename perform_edta_annotation to FASTA_PERFORM_EDTA
- [ ] Extract subworkflows
- [ ] STAR ignores softmasking and, thus, should be fed the unmasked genome so that masking and mapping can run in parallel.
- [ ] Add --eval=reference.gtf
- [ ] Add --eval=reference.gtf
- [ ] From Ross regarding post-processing:

> [9:49 am] Ross Crowhurst
> Here is an easy one: BLATSp vs swissprot & Arabidpsis and check query is with set thresholds of reference - if so accept; If not move to BLASTp vs Uniref90 or Refeq (or some other predetermined model species) - same deal accept if within threshold limits. Else BLASTn of cds vs NCBI nt (really scrapping the bottom of the barrel here). If not a hit to anything then chances are its garbage and should be removed. Some ppl might try to claim its a unique protein to the genotype but in 20 years I have never seen one of those be supported - mostly this category is garbage. The screen agains NCBI nt also assists to classify "bits" as well retroposonss etc. Idea being you want to remove garbage predictions - as this does take time you can see why some papers just filter out by size.

- [ ] From Cecilia:

> https://github.com/zhaotao1987/SynNet-Pipeline

- [ ] From Ross:

> https://www.biorxiv.org/content/10.1101/096529v2.full.pdf

- [ ] Sort out EDTA testing

- Mib finder, eggnog, blastp against TAIR and uniprot (Wait)
- entap to merge (Wait)
- trinity and PASA + StringTie2 -> Evigene (Do)
- othrofinder paper
- gffcompre on braker and liftoff
2 changes: 1 addition & 1 deletion assets/rrna-db-defaults.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/s
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-16s-id90.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-bac-23s-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-18s-id95.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-28s-id98.fasta
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-28s-id98.fasta
1 change: 1 addition & 0 deletions assets/rrna-db-test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
https://raw.githubusercontent.com/biocore/sortmerna/v4.3.4/data/rRNA_databases/silva-euk-28s-id98.fasta
2 changes: 1 addition & 1 deletion bin/make-samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,4 +282,4 @@ def main():
make_samplesheet_from_command(input_path_or_command, exp_name)

if __name__ == "__main__":
main()
main()
2 changes: 1 addition & 1 deletion cleanNXF.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@ for i in $(ls work | grep -v "conda");
do
rm -rf "work/$i"
done
echo "Cleaned work..."
echo "Cleaned work..."
40 changes: 21 additions & 19 deletions conf/base.config
Original file line number Diff line number Diff line change
@@ -1,19 +1,33 @@
profiles {
slurm {
pfr {
process {
executor = 'slurm'
}

apptainer {
envWhitelist = 'APPTAINER_BINDPATH,APPTAINER_BIND'
}
}

local {
process {
executor = 'local'
}
}

apptainer {
apptainer.enabled = true
apptainer.autoMounts= true
apptainer.registry = 'quay.io'
}

docker {
docker.enabled = true
docker.runOptions = '-u $(id -u):$(id -g) --platform=linux/amd64'
docker.registry = 'quay.io'
}
}

// Source: https://github.com/nf-core/rnaseq
// License: https://github.com/nf-core/rnaseq/blob/master/LICENSE
process {

cpus = { check_max( 1 * task.attempt, 'cpus' ) }
Expand All @@ -24,12 +38,6 @@ process {
maxRetries = 1
maxErrors = '-1'

// Process-specific resource requirements
// NOTE - Please try and re-use the labels below as much as possible.
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
// If possible, it would be nice to keep the same label naming convention when
// adding in your local modules too.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:process_single {
cpus = { check_max( 1 , 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
Expand All @@ -53,17 +61,13 @@ process {
withLabel:process_long {
time = { check_max( 20.h * task.attempt, 'time' ) }
}
withLabel:process_week_long {
time = { check_max( 7.days * task.attempt, 'time' ) }
}
withLabel:process_high_memory {
memory = { check_max( 200.GB * task.attempt, 'memory' ) }
}
}

singularity {
enabled = true
autoMounts = true
withName:CUSTOM_DUMPSOFTWAREVERSIONS {
cache = false
}
}

nextflow {
Expand All @@ -72,8 +76,6 @@ nextflow {
}
}

// Source: https://github.com/nf-core/rnaseq
// License: https://github.com/nf-core/rnaseq/blob/master/LICENSE
def check_max(obj, type) {
if (type == 'memory') {
try {
Expand Down
Loading