TODO.txt

Brainstorming to improve/update GET_HOMOLOGUES:

* Explore hmmpgmd to speed up hmmscan (see hmm_server.py & hmm_mapper.py from eggnog-mapper)

* Speed-up/paralellize marfil_homology::blast_parse. This can be done by:
  a) port sub to C++
  b) do parsing in parallel splitting (sorted) infile in chunks without braking consecutive HSPs
  Read: http://blogs.perl.org/users/kirk_kimmel/2012/09/text-processing-divide-and-conquer.html

* Print haplotypes of clusters (ref or !ref) with parse_pangenome_clusters.pl 

* Script to produce VCF/MFA after parsing sorted DNA clusters

* Compute tree determinants of PGM

* Add taxon names to tmp/blast.bpo file 

* Improve graphical output and possibilities

  + https://bmcgenomics.biomedcentral.com/track/pdf/10.1186/1471-2164-15-8 , http://higlass.io
  + https://legumeinfo.org/gcv2/instructions


Here I list some ideas related to synteny/gene collinearity, which were instead considered for the 
development of get_pangenes.pl at https://github.com/Ensembl/plant-scripts

  * Add minimap2 as alternative to BLASTN https://bioinfoperl.blogspot.com/2018/08/minimap2-vs-blastn.html

  * Print clusters in pangenome matrix (PGM) sorted by their (consensus) physical/genetic position
    + compare_clusters.pl -O gene_order (can be subset of genes) to produce sorted submatrix

  * Add flag to take into account synteny/gene order in input FASTA files for EST
    + read in .gff/.gtf file (3) matching ids in .fna (1) and .faa (2), save contig name and order
     * all input species must have GFF or allow badly broken assemblies?
     * >Bradi1g55470.1 pacid=43943876 polypeptide=Bradi1g55470.1.p locus=Bradi1g55470 ID=Bradi1g55470.1.v3.2 annot-version=v3.2
     * Bd1      phytozomev13    gene    10581   11638   .       +       .       ID=Bradi1g00200.v3.2;Name=Bradi1g00200;ancestorIdentifier=Bradi1g00200.v3.1
     * Bd1      phytozomev13    mRNA    10581   11638   .       +       .       ID=Bradi1g00200.1.v3.2;Name=Bradi1g00200.1;pacid=43953538;....

    + for pairs_i,j of sequences sets with GFF remove non-syntenic BLAST hits
     * bin/DAGchainer_r020608
     * take raw, compressed BLAST output and reformat for DAGchainer, eliminating redundant hits (filter_repetitive_matches.pl)
     * DP match ids_i to ids_j considering N neighbors and tolerating missing or duplicated genes
     * produce synt-filtered raw, compressed BLAST output

    + Buscar segmentos synt con Blastn desde FASTA ordenados , bench con brassicas?

    + min quality of assembly N50 1Mbp (C.elegans):
     * https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5791376/
         * https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0092621

    + MCScanX
     * https://academic.oup.com/nar/article/40/7/e49/1202057
     * https://github.com/wyp1125/MCScanX
     * Panicum (GENESPACE, synteny-constrained BDBH)
       - https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-07669-x/MediaObjects/41467_2018_7669_MOESM1_ESM.docx
       - We find collinear blocks of genes from culled BLASTp hits, via MCScanX using the following parameters:
         -s 5 (min block size) -m 10 (maximum number of gaps) -a (only ouput collinearity file).
         We then merge overlapping MCScanX blocks, so that no block exists entirely (or partially)
         within the xy gene-position rank space of another block; blocks overlapping by > 1 ordered hit are merged.
      * Angiosperms microsynteny
       - https://www.nature.com/articles/s41467-021-23665-0
       - MCScanX which was used for pairwise synteny block detection. Parameter settings of MCScanX for angiosperms have been
         tested and compared before (https://github.com/zhaotao1987/SynNet-Pipeline) here we adopt �~@~Xb5s5m25�~@~Y
         (b: number of top homologous pairs, s: number of minimum matched syntenic anchors, m: number of max gene gaps),
         which has proven to be appropriate by various studies for the evolutionary distances among angiosperm genomes.
         To avoid large numbers of local collinear gene pairs due to tandem arrays, if consecutive homologs (up to five genes apart)
         share a common gene, homologs are collapsed to one representative pair (with the smallest E-value).

    + liftoff https://github.com/agshumate/Liftoff