-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seed for minhash algorithm #11
Comments
Dear Carles, At the moment there is no possibility to use different seeds - we will add this feature in the next release. In the meantime, you can try generating a distance matrix without minhashing (no Regards, |
Yes I know I could use the whole kmer number but that is too much to run in my machine. I am analyzing 75 samples resequenced by illumina. I used -ci5 in kmc to get rid of erroneous kmers (those with a count lower than 5) since, as far as I understood, they might inflate the RAM usage of kmer-db build. I checked and 5 is the upper limit to filter out by kmer abundance in my samples, mainly due to some low-coverage samples I need to keep in the dataset. I have ran the whole pipeline (build, all2all and distance) for 3% of the kmers, it took about 40% of my RAM. I could try to increase the fraction to 5 or 10% but I think I won't be able to use the whole dataset. Still, the tree looks good so far, I just want to give it some bootstrap support. Since I have some other projects to work in, I could go into a different project for some time and come back later to this project to check if the feature is implemented. I see this project is in constant development. Best |
Actually, there is something you could use. There is an undocumented option
It's not exactly bootstraping (no replacement in sampling), but maybe you can find it useful. |
I've accidentally sent you a half of the comment but its been edited now :) |
Very nice! You're right, this is not exactly what I was looking for (due to the lack of replacement in sampling), but it will for sure allow me to do some testing of the robustness of the tree. Still, I'll check for updates on the main request about the seeding. I was wondering how would this option handle overlapping windows, say
I guess it would resample (not randomly though) part of the kmers? Best |
Exactly, you'll have overlapping k-mer spectra used in distance calculation. To have real bootstrapping, different seeds are needed. We'll work on that. |
Hi there, Have you managed to implement a way to specify a seed to the minhash algorithm, as we commented? I have even tried to dig in your source code, but without C knowledge, I can't really understand what's going on there. Best, Carles |
Hello! We had some ideas about, but didn't want to provide a solution without testing if it's properly random. We'll dig into that again soon and let you know. Adam |
I've read the paper for kmer-db and I weren't able to find anywhere whether kmer-db uses a seed for minhashing during the build step. I've ran kmer-db build twice (from KMC-counted kmers) and it seems to use the same seed every time, since the results are identical. Is there a way to alter this seed? I'd like to somehow generate ~100-200 distance matrices and use them as a support value for the distance estimations, but I would need to minhash with a different seed each time.
Best
Carles
The text was updated successfully, but these errors were encountered: