Generate ML features with Galaxy #30

paulzierep · 2025-01-16T11:07:57Z

Check the paper: https://www.biorxiv.org/content/10.1101/2024.10.22.619569v2.full
Add kmer tool to Galaxy (https://github.com/raw-lab/mercat2)
Analyze MGnify Genomes
Correlate features with peak memory

paulzierep · 2025-02-19T13:18:21Z

@SantaMcCloud can you wrap: https://github.com/raw-lab/mercat2 ?

SantaMcCloud · 2025-02-19T13:21:07Z

Yes will do it at the weekend quick!

SantaMcCloud · 2025-02-24T00:18:10Z

galaxyproject/tools-iuc#6791

SantaMcCloud · 2025-02-24T12:13:01Z

@paulzierep it might seem that mercat2 maybe has some bugs when running it via docker which means that it might take a while to finish the wrapper. I open an issue to check if the errors are correct or if therer is anything which needed to be fixed: raw-lab/mercat2#14

paulzierep · 2025-02-26T06:59:33Z

Maybe we can use this tool instead: https://github.com/refresh-bio/KMC
The main issue is, that we need a feature table, based on the kmers as input

paulzierep · 2025-02-26T07:00:44Z

But we need to add it to bioconda

paulzierep · 2025-02-26T07:02:06Z

Wrong it is there: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/kmc/meta.yaml
Lets use that one, sorry for the extra work @SantaMcCloud, we can still add the effort in the report ! Thanks !

paulzierep · 2025-02-26T07:04:01Z

Maybe we could apply the diversity estimation of mercat2 on the kmers produced by kmc ... that should be doable.
Basically we would need to apply this func. from skbio.diversity import alpha as skbio_alpha on the kmc counts.

SantaMcCloud · 2025-02-26T07:48:01Z

@paulzierep i will add this tool this weekend then. Mercat2 did respond on the issue i create so the bugs can be fixed the next few weeks!

paulzierep · 2025-02-26T09:20:09Z

well if the new tools works, I think we do not need the other, maybe a small script instead to allow to compute diversity...but maybe you could check if KMC dump works on a small dataset locally first and add a snippet here ?

SantaMcCloud · 2025-02-26T13:12:59Z

Okay i check the tool and how it it works.

This are the option to run it:

(base) sf373@LAPTOP-7RMLPR2D:~/sf373$ kmc 
K-Mer Counter (KMC) ver. 3.2.4 (2024-02-09)
Usage:
 kmc [options] <input_file_name> <output_file_name> <working_directory>
 kmc [options] <@input_file_names> <output_file_name> <working_directory>
Parameters:
  input_file_name - single file in specified (-f switch) format (gziped or not)
  @input_file_names - file name with list of input files in specified (-f switch) format (gziped or not)
Options:
  -v - verbose mode (shows all parameter settings); default: false
  -k<len> - k-mer length (k from 1 to 256; default: 25)
  -m<size> - max amount of RAM in GB (from 1 to 1024); default: 12
  -sm - use strict memory mode (memory limit from -m<n> switch will not be exceeded)
  -hc - count homopolymer compressed k-mers (approximate and experimental)
  -p<par> - signature length (5, 6, 7, 8, 9, 10, 11); default: 9
  -f<a/q/m/bam/kmc> - input in FASTA format (-fa), FASTQ format (-fq), multi FASTA (-fm) or BAM (-fbam) or KMC(-fkmc); default: FASTQ
  -ci<value> - exclude k-mers occurring less than <value> times (default: 2)
  -cs<value> - maximal value of a counter (default: 255)
  -cx<value> - exclude k-mers occurring more of than <value> times (default: 1e9)
  -b - turn off transformation of k-mers into canonical form
  -r - turn on RAM-only mode 
  -n<value> - number of bins 
  -t<value> - total number of threads (default: no. of CPU cores)
  -sf<value> - number of FASTQ reading threads
  -sp<value> - number of splitting threads
  -sr<value> - number of threads for 2nd stage
  -j<file_name> - file name with execution summary in JSON format
  -w - without output
  -o<kmc/kff> - output in KMC of KFF format; default: KMC
  -hp - hide percentage progress (default: false)
  -e - only estimate histogram of k-mers occurrences instead of exact k-mer counting
  --opt-out-size - optimize output database size (may increase running time)
Example:
kmc -k27 -m24 NA19238.fastq NA.res /data/kmc_tmp_dir/
kmc -k27 -m24 @files.lst NA.res /data/kmc_tmp_dir/

For this tool either one file at each run can be used or you can can give it a list where the path of each file is stated @paulzierep do like both option or do you prefer either single/multple only?

The output are 2 binary files:

This 2 files then can be used for the other functions which are:

(base) sf373@LAPTOP-7RMLPR2D:~/sf373$ kmc_tools 
kmc_tools ver. 3.2.4 (2024-02-09)
Usage:
 kmc_tools [global parameters] <operation> [operation parameters]
Available operations:
  transform            - transforms single KMC's database
  simple               - performs set operation on two KMC's databases
  complex              - performs set operation on multiple KMC's databases
  filter               - filter out reads with too small number of k-mers
 global parameters:
  -t<value>            - total number of threads (default: no. of CPU cores)
  -v                   - enable verbose mode (shows some information) (default: false)
  -hp                  - hide percentage progress (default: false)
Example:
kmc_tools simple db1 -ci3 db2 -ci5 -cx300 union db1_union_db2 -ci10
For detailed help of concrete operation type operation name without parameters:
kmc_tools simple

To greate a list where each kmer is listed with the number how ofter it is appear we need the tool transform. The option which can be used here are:

(base) sf373@LAPTOP-7RMLPR2D:~/sf373$ kmc_tools transform
transform operation transforms single input database to output (text file or KMC database)
General syntax:
kmc_tools transform <input> [input_params] <oper1 [oper_params1] output1 [output_params1]> [<oper2 [oper_params2] output2 [output_params2]>...]
input - path to database generated by KMC 
oper1, oper2, ..., operN          - transform operation name
output1, output2, ..., outputN    - paths to output
 Available operations:
  sort                       - converts database produced by KMC2.x to KMC1.x database format (which contains k-mers in sorted order)
  reduce                     - exclude too rare and too frequent k-mers
  compact                    - remove counters of k-mers
  histogram                  - produce histogram of k-mers occurrences
  dump                       - produce text dump of kmc database
  set_counts <value>         - set all k-mer counts to specific value
 For input there are additional parameters:
  -ci<value> - exclude k-mers occurring less than <value> times 
  -cx<value> - exclude k-mers occurring more of than <value> times
 For sort and reduce operations there are additional output_params:
  -ci<value> - exclude k-mers occurring less than <value> times 
  -cx<value> - exclude k-mers occurring more of than <value> times
  -cs<value> - maximal value of a counter
 For compact, reduce, set_counts and sort operations is an additional output_params:
  -o<kmc|kff> - output in KMC or KFF format (default: kmc) 
 For histogram operation there are additional output_params:
  -ci<value> - minimum value of counter to be stored in the otput file
  -cx<value> - maximum value of counter to be stored in the otput file
 For dump operation there are additional oper_params:
  -s - sorted output
Example:
kmc_tools transform db reduce err_kmers -cx10 reduce valid_kmers -ci11 histogram histo.txt dump dump.txt

with this we can get the, in this example. the dump file where evrey kmer is listed. Now my question to you @paulzierep should i include evreything or just the basic in which we are interested for the workflow? In this case the dump file and maybe the histogram file?

Example dump file (snippet):

AAAAA	255
AAAAC	255
AAAAG	255
AAAAT	255
AAACA	255
AAACC	255
AAACG	255
AAACT	255
AAAGA	255
AAAGC	255
AAAGG	255
AAAGT	255
AAATA	255
AAATC	255
AAATG	255
AAATT	255
AACAA	255
AACAC	255

Example histogram file (complete file):

paulzierep · 2025-02-26T13:14:24Z

I am currently testing https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Ffastk_fastk%2Ffastk_fastk%2F1.1.0%2Bgalaxy2&version=latest which we already got in galax y, seems super fast and worked so far ...

paulzierep · 2025-02-26T13:15:31Z

For https://github.com/refresh-bio/KMC I do not get why the dump file has all 255 ? Any idea.

SantaMcCloud · 2025-02-26T13:17:37Z

I am currently testing https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Ffastk_fastk%2Ffastk_fastk%2F1.1.0%2Bgalaxy2&version=latest which we already got in galax y, seems super fast and worked so far ...

okay if this not work let me know then i start to wrap https://github.com/refresh-bio/KMC

For https://github.com/refresh-bio/KMC I do not get why the dump file has all 255 ? Any idea.

currently now but i can have a look into the issue or ask them

SantaMcCloud · 2025-02-26T13:20:33Z

For https://github.com/refresh-bio/KMC I do not get why the dump file has all 255 ? Any idea.

currently now but i can have a look into the issue or ask them

It seems that there is no information about this so only way to find out is to open a issue to find out why

paulzierep added the T3.1 label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate ML features with Galaxy #30

Generate ML features with Galaxy #30

paulzierep commented Jan 16, 2025 •

edited

Loading

paulzierep commented Feb 19, 2025

SantaMcCloud commented Feb 19, 2025

SantaMcCloud commented Feb 24, 2025

SantaMcCloud commented Feb 24, 2025

paulzierep commented Feb 26, 2025

paulzierep commented Feb 26, 2025

paulzierep commented Feb 26, 2025

paulzierep commented Feb 26, 2025 •

edited

Loading

SantaMcCloud commented Feb 26, 2025

paulzierep commented Feb 26, 2025

SantaMcCloud commented Feb 26, 2025

paulzierep commented Feb 26, 2025

paulzierep commented Feb 26, 2025

SantaMcCloud commented Feb 26, 2025

SantaMcCloud commented Feb 26, 2025

Generate ML features with Galaxy #30

Generate ML features with Galaxy #30

Comments

paulzierep commented Jan 16, 2025 • edited Loading

paulzierep commented Feb 19, 2025

SantaMcCloud commented Feb 19, 2025

SantaMcCloud commented Feb 24, 2025

SantaMcCloud commented Feb 24, 2025

paulzierep commented Feb 26, 2025

paulzierep commented Feb 26, 2025

paulzierep commented Feb 26, 2025

paulzierep commented Feb 26, 2025 • edited Loading

SantaMcCloud commented Feb 26, 2025

paulzierep commented Feb 26, 2025

SantaMcCloud commented Feb 26, 2025

paulzierep commented Feb 26, 2025

paulzierep commented Feb 26, 2025

SantaMcCloud commented Feb 26, 2025

SantaMcCloud commented Feb 26, 2025

paulzierep commented Jan 16, 2025 •

edited

Loading

paulzierep commented Feb 26, 2025 •

edited

Loading