-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usage of score calibration #126
Comments
Score calibration should work fine in cases like this. I recommend using it. The performance of the calibration drops a bit when the sample composition is extreme (e.g., 99% plasmids), but geNomad automatically detects cases like this and deals with them properly (see these lines, in case you're curious). |
Hi @apcamargo, thanks a lot for your answer. After running genomad with score calibration, I'm a little confused about some aspects of the output. E.g. in the file *calibrated_aggregated_classification.tsv, some contigs have plasmid scores of very close to 1 (which if I'm correct their probability of being a plasmid is ~100%), but they are not in the file *plasmid_summary.tsv. How can this happen? |
My guess is that the empirical plasmid fracvtion within the sample was very low, but the algorithm should be robust to case like this. Can you share the contents of These contigs were excluded from If you want, you can see the calibrated scores of every single sequence (not just the ones classified as plasmid) in |
Hi, this is the output of *compositions.tsv:
My confusion is because I was already looking into the file *calibrated_aggregated_classification.tsv, and there these contigs had plasmid scores of almost 1, but then they weren't included in the plasmid_summary.tsv in the *summary directory. So the scores must have been already calibrated at this point, and something must have excluded them from being identified as plasmid despite an almost 100% probability of being a plasmid (according to *calibrated_aggregated_classification.tsv) |
Ohh, that's probably because they have negative marker enrichment (which is one of the post-classification filters). Just use |
Hi, I have a question regarding the --enable-score-calibration flag. I understand that it considers the composition of the sample to output a 'real' probability, e.g. plasmid score of 0.4 would mean that there is a 40% chance that this sequence is a plasmid. And that without using score calibration, the scores are not actual probabilities.
What I'm wondering is, if I run genomad on sequences which I already preselected on some criteria, e.g. a collection of short, circular contigs extracted from multiple metagenomes, would it be recommended to use the score calibration or not? Because in that collection I probably have a higher chance of finding plasmids than in a 'natural' assembly of an environmental metagenome. So what I'm wondering is, is the score calibration recommended only for 'natural' samples, or also samples where already a pre-selection of sequences (that are more likely viral or plasmid) has taken place?
The text was updated successfully, but these errors were encountered: