Seed for minhash algorithm #11

CBorreda · 2020-02-10T11:06:06Z

I've read the paper for kmer-db and I weren't able to find anywhere whether kmer-db uses a seed for minhashing during the build step. I've ran kmer-db build twice (from KMC-counted kmers) and it seems to use the same seed every time, since the results are identical. Is there a way to alter this seed? I'd like to somehow generate ~100-200 distance matrices and use them as a support value for the distance estimations, but I would need to minhash with a different seed each time.

Best

Carles

agudys · 2020-02-10T11:12:31Z

Dear Carles,

At the moment there is no possibility to use different seeds - we will add this feature in the next release. In the meantime, you can try generating a distance matrix without minhashing (no -f parameter specified) to obtain more stable results. How large is your dataset?

Regards,
Adam

CBorreda · 2020-02-10T11:53:48Z

Yes I know I could use the whole kmer number but that is too much to run in my machine.

I am analyzing 75 samples resequenced by illumina. I used -ci5 in kmc to get rid of erroneous kmers (those with a count lower than 5) since, as far as I understood, they might inflate the RAM usage of kmer-db build. I checked and 5 is the upper limit to filter out by kmer abundance in my samples, mainly due to some low-coverage samples I need to keep in the dataset.

I have ran the whole pipeline (build, all2all and distance) for 3% of the kmers, it took about 40% of my RAM. I could try to increase the fraction to 5 or 10% but I think I won't be able to use the whole dataset. Still, the tree looks good so far, I just want to give it some bootstrap support. Since I have some other projects to work in, I could go into a different project for some time and come back later to this project to check if the feature is implemented. I see this project is in constant development.

Best
Carles

agudys · 2020-02-10T12:08:35Z

Actually, there is something you could use. There is an undocumented option -f-start that was designed to process all kmers in portions. It represents the relative minimum threshold of the minhash filter (whille -f its the filter width). Therefore, you can for instance run kmer-db 10 times at each run analyzing different 10% of k-mers:

-f 0.1 
-f 0.1 -f-start 0.1
-f 0.1 -f-start 0.2
...
-f 0.1 -f-start 0.9

It's not exactly bootstraping (no replacement in sampling), but maybe you can find it useful.

agudys · 2020-02-10T12:13:26Z

I've accidentally sent you a half of the comment but its been edited now :)

CBorreda · 2020-02-10T14:24:57Z

Very nice! You're right, this is not exactly what I was looking for (due to the lack of replacement in sampling), but it will for sure allow me to do some testing of the robustness of the tree. Still, I'll check for updates on the main request about the seeding.

I was wondering how would this option handle overlapping windows, say

-f 0.1 -f-start 0
-f 0.1 -f-start 0.01
-f 0.1 -f-start 0.02

I guess it would resample (not randomly though) part of the kmers?

Best
Carles

agudys · 2020-02-10T18:39:52Z

Exactly, you'll have overlapping k-mer spectra used in distance calculation. To have real bootstrapping, different seeds are needed. We'll work on that.

CBorreda · 2020-06-26T16:20:27Z

Hi there,

Have you managed to implement a way to specify a seed to the minhash algorithm, as we commented? I have even tried to dig in your source code, but without C knowledge, I can't really understand what's going on there.

Best,

Carles

agudys · 2020-06-29T18:53:36Z

Hello!

We had some ideas about, but didn't want to provide a solution without testing if it's properly random. We'll dig into that again soon and let you know.

Adam

agudys self-assigned this Feb 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seed for minhash algorithm #11

Seed for minhash algorithm #11

CBorreda commented Feb 10, 2020

agudys commented Feb 10, 2020

CBorreda commented Feb 10, 2020

agudys commented Feb 10, 2020 •

edited

Loading

agudys commented Feb 10, 2020

CBorreda commented Feb 10, 2020

agudys commented Feb 10, 2020

CBorreda commented Jun 26, 2020

agudys commented Jun 29, 2020

Seed for minhash algorithm #11

Seed for minhash algorithm #11

Comments

CBorreda commented Feb 10, 2020

agudys commented Feb 10, 2020

CBorreda commented Feb 10, 2020

agudys commented Feb 10, 2020 • edited Loading

agudys commented Feb 10, 2020

CBorreda commented Feb 10, 2020

agudys commented Feb 10, 2020

CBorreda commented Jun 26, 2020

agudys commented Jun 29, 2020

agudys commented Feb 10, 2020 •

edited

Loading