Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate ML features with Galaxy #30

Open
4 tasks
paulzierep opened this issue Jan 16, 2025 · 15 comments
Open
4 tasks

Generate ML features with Galaxy #30

paulzierep opened this issue Jan 16, 2025 · 15 comments
Labels

Comments

@paulzierep
Copy link
Contributor

paulzierep commented Jan 16, 2025

@paulzierep
Copy link
Contributor Author

@SantaMcCloud can you wrap: https://github.com/raw-lab/mercat2 ?

@SantaMcCloud
Copy link
Collaborator

Yes will do it at the weekend quick!

@SantaMcCloud
Copy link
Collaborator

@SantaMcCloud
Copy link
Collaborator

@paulzierep it might seem that mercat2 maybe has some bugs when running it via docker which means that it might take a while to finish the wrapper. I open an issue to check if the errors are correct or if therer is anything which needed to be fixed: raw-lab/mercat2#14

@paulzierep
Copy link
Contributor Author

Maybe we can use this tool instead: https://github.com/refresh-bio/KMC
The main issue is, that we need a feature table, based on the kmers as input

@paulzierep
Copy link
Contributor Author

But we need to add it to bioconda

@paulzierep
Copy link
Contributor Author

Wrong it is there: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/kmc/meta.yaml
Lets use that one, sorry for the extra work @SantaMcCloud, we can still add the effort in the report ! Thanks !

@paulzierep
Copy link
Contributor Author

paulzierep commented Feb 26, 2025

Maybe we could apply the diversity estimation of mercat2 on the kmers produced by kmc ... that should be doable.
Basically we would need to apply this func. from skbio.diversity import alpha as skbio_alpha on the kmc counts.

@SantaMcCloud
Copy link
Collaborator

@paulzierep i will add this tool this weekend then. Mercat2 did respond on the issue i create so the bugs can be fixed the next few weeks!

@paulzierep
Copy link
Contributor Author

well if the new tools works, I think we do not need the other, maybe a small script instead to allow to compute diversity...but maybe you could check if KMC dump works on a small dataset locally first and add a snippet here ?

@SantaMcCloud
Copy link
Collaborator

Okay i check the tool and how it it works.

This are the option to run it:

(base) sf373@LAPTOP-7RMLPR2D:~/sf373$ kmc 
K-Mer Counter (KMC) ver. 3.2.4 (2024-02-09)
Usage:
 kmc [options] <input_file_name> <output_file_name> <working_directory>
 kmc [options] <@input_file_names> <output_file_name> <working_directory>
Parameters:
  input_file_name - single file in specified (-f switch) format (gziped or not)
  @input_file_names - file name with list of input files in specified (-f switch) format (gziped or not)
Options:
  -v - verbose mode (shows all parameter settings); default: false
  -k<len> - k-mer length (k from 1 to 256; default: 25)
  -m<size> - max amount of RAM in GB (from 1 to 1024); default: 12
  -sm - use strict memory mode (memory limit from -m<n> switch will not be exceeded)
  -hc - count homopolymer compressed k-mers (approximate and experimental)
  -p<par> - signature length (5, 6, 7, 8, 9, 10, 11); default: 9
  -f<a/q/m/bam/kmc> - input in FASTA format (-fa), FASTQ format (-fq), multi FASTA (-fm) or BAM (-fbam) or KMC(-fkmc); default: FASTQ
  -ci<value> - exclude k-mers occurring less than <value> times (default: 2)
  -cs<value> - maximal value of a counter (default: 255)
  -cx<value> - exclude k-mers occurring more of than <value> times (default: 1e9)
  -b - turn off transformation of k-mers into canonical form
  -r - turn on RAM-only mode 
  -n<value> - number of bins 
  -t<value> - total number of threads (default: no. of CPU cores)
  -sf<value> - number of FASTQ reading threads
  -sp<value> - number of splitting threads
  -sr<value> - number of threads for 2nd stage
  -j<file_name> - file name with execution summary in JSON format
  -w - without output
  -o<kmc/kff> - output in KMC of KFF format; default: KMC
  -hp - hide percentage progress (default: false)
  -e - only estimate histogram of k-mers occurrences instead of exact k-mer counting
  --opt-out-size - optimize output database size (may increase running time)
Example:
kmc -k27 -m24 NA19238.fastq NA.res /data/kmc_tmp_dir/
kmc -k27 -m24 @files.lst NA.res /data/kmc_tmp_dir/

For this tool either one file at each run can be used or you can can give it a list where the path of each file is stated @paulzierep do like both option or do you prefer either single/multple only?

The output are 2 binary files:

Image

This 2 files then can be used for the other functions which are:

(base) sf373@LAPTOP-7RMLPR2D:~/sf373$ kmc_tools 
kmc_tools ver. 3.2.4 (2024-02-09)
Usage:
 kmc_tools [global parameters] <operation> [operation parameters]
Available operations:
  transform            - transforms single KMC's database
  simple               - performs set operation on two KMC's databases
  complex              - performs set operation on multiple KMC's databases
  filter               - filter out reads with too small number of k-mers
 global parameters:
  -t<value>            - total number of threads (default: no. of CPU cores)
  -v                   - enable verbose mode (shows some information) (default: false)
  -hp                  - hide percentage progress (default: false)
Example:
kmc_tools simple db1 -ci3 db2 -ci5 -cx300 union db1_union_db2 -ci10
For detailed help of concrete operation type operation name without parameters:
kmc_tools simple

To greate a list where each kmer is listed with the number how ofter it is appear we need the tool transform. The option which can be used here are:

(base) sf373@LAPTOP-7RMLPR2D:~/sf373$ kmc_tools transform
transform operation transforms single input database to output (text file or KMC database)
General syntax:
kmc_tools transform <input> [input_params] <oper1 [oper_params1] output1 [output_params1]> [<oper2 [oper_params2] output2 [output_params2]>...]
input - path to database generated by KMC 
oper1, oper2, ..., operN          - transform operation name
output1, output2, ..., outputN    - paths to output
 Available operations:
  sort                       - converts database produced by KMC2.x to KMC1.x database format (which contains k-mers in sorted order)
  reduce                     - exclude too rare and too frequent k-mers
  compact                    - remove counters of k-mers
  histogram                  - produce histogram of k-mers occurrences
  dump                       - produce text dump of kmc database
  set_counts <value>         - set all k-mer counts to specific value
 For input there are additional parameters:
  -ci<value> - exclude k-mers occurring less than <value> times 
  -cx<value> - exclude k-mers occurring more of than <value> times
 For sort and reduce operations there are additional output_params:
  -ci<value> - exclude k-mers occurring less than <value> times 
  -cx<value> - exclude k-mers occurring more of than <value> times
  -cs<value> - maximal value of a counter
 For compact, reduce, set_counts and sort operations is an additional output_params:
  -o<kmc|kff> - output in KMC or KFF format (default: kmc) 
 For histogram operation there are additional output_params:
  -ci<value> - minimum value of counter to be stored in the otput file
  -cx<value> - maximum value of counter to be stored in the otput file
 For dump operation there are additional oper_params:
  -s - sorted output
Example:
kmc_tools transform db reduce err_kmers -cx10 reduce valid_kmers -ci11 histogram histo.txt dump dump.txt

with this we can get the, in this example. the dump file where evrey kmer is listed. Now my question to you @paulzierep should i include evreything or just the basic in which we are interested for the workflow? In this case the dump file and maybe the histogram file?

Example dump file (snippet):

AAAAA	255
AAAAC	255
AAAAG	255
AAAAT	255
AAACA	255
AAACC	255
AAACG	255
AAACT	255
AAAGA	255
AAAGC	255
AAAGG	255
AAAGT	255
AAATA	255
AAATC	255
AAATG	255
AAATT	255
AACAA	255
AACAC	255

Example histogram file (complete file):

2	0
3	0
4	0
5	0
6	0
7	0
8	0
9	0
10	0
11	0
12	0
13	0
14	0
15	0
16	0
17	0
18	0
19	0
20	0
21	0
22	0
23	0
24	0
25	0
26	0
27	0
28	0
29	0
30	0
31	0
32	0
33	0
34	0
35	0
36	0
37	0
38	0
39	0
40	0
41	0
42	0
43	0
44	0
45	0
46	0
47	0
48	0
49	0
50	0
51	0
52	0
53	0
54	0
55	0
56	0
57	0
58	0
59	0
60	0
61	0
62	0
63	0
64	0
65	0
66	0
67	0
68	0
69	0
70	0
71	0
72	0
73	0
74	0
75	0
76	0
77	0
78	0
79	0
80	0
81	0
82	0
83	0
84	0
85	0
86	0
87	0
88	0
89	0
90	0
91	0
92	0
93	0
94	0
95	0
96	0
97	0
98	0
99	0
100	0
101	0
102	0
103	0
104	0
105	0
106	0
107	0
108	0
109	0
110	0
111	0
112	0
113	0
114	0
115	0
116	0
117	0
118	0
119	0
120	0
121	0
122	0
123	0
124	0
125	0
126	0
127	0
128	0
129	0
130	0
131	0
132	0
133	0
134	0
135	0
136	0
137	0
138	0
139	0
140	0
141	0
142	0
143	0
144	0
145	0
146	0
147	0
148	0
149	0
150	0
151	0
152	0
153	0
154	0
155	0
156	0
157	0
158	0
159	0
160	0
161	0
162	0
163	0
164	0
165	0
166	0
167	0
168	0
169	0
170	0
171	0
172	0
173	0
174	0
175	0
176	0
177	0
178	0
179	0
180	0
181	0
182	0
183	0
184	0
185	0
186	0
187	0
188	0
189	0
190	0
191	0
192	0
193	0
194	0
195	0
196	0
197	0
198	0
199	0
200	0
201	0
202	0
203	0
204	0
205	0
206	0
207	0
208	0
209	0
210	0
211	0
212	0
213	0
214	0
215	0
216	0
217	0
218	0
219	0
220	0
221	0
222	0
223	0
224	0
225	0
226	0
227	0
228	0
229	0
230	0
231	0
232	0
233	0
234	0
235	0
236	0
237	0
238	0
239	0
240	0
241	0
242	0
243	0
244	0
245	0
246	0
247	0
248	0
249	0
250	0
251	0
252	0
253	0
254	0
255	512

@paulzierep
Copy link
Contributor Author

I am currently testing https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Ffastk_fastk%2Ffastk_fastk%2F1.1.0%2Bgalaxy2&version=latest which we already got in galax y, seems super fast and worked so far ...

@paulzierep
Copy link
Contributor Author

For https://github.com/refresh-bio/KMC I do not get why the dump file has all 255 ? Any idea.

@SantaMcCloud
Copy link
Collaborator

I am currently testing https://usegalaxy.eu/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fiuc%2Ffastk_fastk%2Ffastk_fastk%2F1.1.0%2Bgalaxy2&version=latest which we already got in galax y, seems super fast and worked so far ...

okay if this not work let me know then i start to wrap https://github.com/refresh-bio/KMC

For https://github.com/refresh-bio/KMC I do not get why the dump file has all 255 ? Any idea.

currently now but i can have a look into the issue or ask them

@SantaMcCloud
Copy link
Collaborator

For https://github.com/refresh-bio/KMC I do not get why the dump file has all 255 ? Any idea.

currently now but i can have a look into the issue or ask them

It seems that there is no information about this so only way to find out is to open a issue to find out why

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants