You must be signed in to change notification settings - Fork 4
To do
xonq edited this page Aug 19, 2024
2 revisions
- Conserved log class
- Must be capable of determining if the run is congruent or new
- Annotate code (all)
- add type constraints to functions
- Conform old scripts to PEP8 (all)
- [o] Build all-in-one stable conda package
- Transition code-base to Rust
- implement kon_log class, paying attention to verbosity control
- kon_log class output to local configuration directory
- support compressed databases
- register mycotools with NCBI
- configuration to output mtdb name + date in output directory
- discuss multigene phylo pipeline
- discuss db2hgs
- uniform argument parser
- allow for inputting multiple DBs with full paths
- robust conda update checks
- Outgroup manager for clusters that fit within min and maximum sequences
- Percent positives filter
- Integrate agglomerative clustering
- Allow for inputing a specific run order
- Log-based resume
- Do not reiterate running a gene in the same homology group
- Allow converting HG runs' names
- Better root inference
- Assembly query method, i.e. through tblastn
- Allow changing the clustering variable
- locus output using percent similarity
- pseudogenes can have RNAs, and CDSs from NCBI may reference those pseudogene parents or their RNAs (GCA_900074715.1_LAME0)
- some pseudogenes fail because they are given an "Alias" without being completed (GCA_004920355.1)
- make universal interface to remove need for source column
- allow including entries that cannot be hiearchically assimilated into genes or transcripts
- build to universally include genes and transcript-assimilated types
- implement db2search to identify NSCHGs best-hits
- implement an automated NSCHG extraction based on minimum gene #
- Allow log removal
- Distinguish between nt and aa mmseqs dbs
- Allow for blastdb construction
- Streamline mmseqs parsing
- mmseqs save db option
- profile mmseqs search
- concatenate mmseqs query dbs
- optional fail upon any failures
- Log hmmer runs
- nhmmer option
- create all outputs as temp files and move when complete
- extract covered portion of hits
- max hits post blast compilation
- hsp option
- Vectorize MTDB class
- make mtdb compiled class
- remove Entrez email login, simplify API access
- Get taxonomy of non-genus names
- get taxonomy XML - if it exists - instead of independent queries
- allow for a lineage list from command line (may already be integrated)
- stdin argument input
- Fix when lineages have multiple ranks, e.g. Tremellales sp. will be extracted from Tremellales input, when the order is likely what's requested
- sort log by default, and only unique run parameters
- percent positive mode
- integrate MCL
- rerun aggclus on new data
- Move from extracthmm to simplified output parsing
- Implement fa2clus
- ignore non-fasta inputs
- take to phylogenomic tree from db2hgs
- find a prettier way to create SVGs
- parse for in gene coordinates and annotations
- create a single file output option for multiple inputs
- remove gff v gff3 option
- delete database feature
- fix local password encryption
- overwrite old password
- move database feature
- archive and unarchive genomes
- remove logfiles as parting of clearing the cache
- add combine DB option
- add a log option of connected MTDBs
- remove standalone scripts from PATH
- look for old ome versions in query
- add a version querying option
- add an option to query log of analyses
- db check to ensure log is relevant to input
- convert downloading to NCBI datasets
- add strain parsing from within GenBank records for entries that don't have an obvious strain entry
- source to reference the annotation source/project name
- integrate prokka/bakta
- error check FAA
- allow for just assembly accession in known sources
- allow inputting GBK
- optimize dereplication, currently too slow
- initial JGI predb2mtdb fails because assemblyPath doesn't exist as a column, but restarts are fine
- allow updating from Predb.tsv immediately
- update introduction output
- need a verbose option
- reversion option
- reference a manually curated duplicate check
- prohibit specific IDs implementation
- finish --save
- singular strain download option
- pull failed JGI downloads from NCBI
- remove overlap when rerunning failed genomes
- central MTDB repository and reference option
- Improve MD5 check efficiency (update_mtdb)
- print organism name with genome accession
- don't remove files until after predb2mtdb (requires update_mtdb specific function)
- Need a manually curated file to correct errors in naming, e.g. Vararia v Vavraia, Fibularhizoctonia v Fibulorhizoctonia
- initialize from a predb
- option to remove entries that have been removed from genbank
- option to not dereplicate by genus and species alone
- main MTDB files for prokaryotes and fungi uploaded and that can be parsed
- add option to update taxonomy of existing entries
- sp. will also not dereplicate
- make add option check for overwriting entries (indicating incorrect PREDB linkage)
- ensure
overlooks non-JGI/NCBI sources - ensure assembly accessions from non-JGI/NCBI sources are not included in download
- still use a redundancy check when --ncbi_only is specified, or prevent changing between NCBI only and non-ncbi database
- acquire strain metadata and submitter organization from ncbi datasets