Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of score calibration #126

Open
zehanna opened this issue Sep 27, 2024 · 5 comments
Open

Usage of score calibration #126

zehanna opened this issue Sep 27, 2024 · 5 comments

Comments

@zehanna
Copy link

zehanna commented Sep 27, 2024

Hi, I have a question regarding the --enable-score-calibration flag. I understand that it considers the composition of the sample to output a 'real' probability, e.g. plasmid score of 0.4 would mean that there is a 40% chance that this sequence is a plasmid. And that without using score calibration, the scores are not actual probabilities.
What I'm wondering is, if I run genomad on sequences which I already preselected on some criteria, e.g. a collection of short, circular contigs extracted from multiple metagenomes, would it be recommended to use the score calibration or not? Because in that collection I probably have a higher chance of finding plasmids than in a 'natural' assembly of an environmental metagenome. So what I'm wondering is, is the score calibration recommended only for 'natural' samples, or also samples where already a pre-selection of sequences (that are more likely viral or plasmid) has taken place?

@apcamargo
Copy link
Owner

Score calibration should work fine in cases like this. I recommend using it.

The performance of the calibration drops a bit when the sample composition is extreme (e.g., 99% plasmids), but geNomad automatically detects cases like this and deals with them properly (see these lines, in case you're curious).

@zehanna
Copy link
Author

zehanna commented Oct 29, 2024

Hi @apcamargo, thanks a lot for your answer. After running genomad with score calibration, I'm a little confused about some aspects of the output. E.g. in the file *calibrated_aggregated_classification.tsv, some contigs have plasmid scores of very close to 1 (which if I'm correct their probability of being a plasmid is ~100%), but they are not in the file *plasmid_summary.tsv. How can this happen?

@apcamargo
Copy link
Owner

My guess is that the empirical plasmid fracvtion within the sample was very low, but the algorithm should be robust to case like this. Can you share the contents of <prefix>_score_calibration/<prefix>_compositions.tsv?

These contigs were excluded from <prefix>_plasmid_summary.tsv because their calibrated scores were lower than 0.7 (default cutoff). You can try to use --relaxed to get more contigs in the summary file (see here).

If you want, you can see the calibrated scores of every single sequence (not just the ones classified as plasmid) in <prefix>_score_calibration/<prefix>_calibrated_aggregated_classification.tsv. This way you can compare the pre- and post-calibration scores for all contigs.

@zehanna
Copy link
Author

zehanna commented Oct 30, 2024

Hi, this is the output of *compositions.tsv:

model   chromosome      plasmid virus
marker  0.9112  0.0145  0.0743
nn      0.5764  0.3175  0.1061
aggregated      0.8863  0.0394  0.0743

My confusion is because I was already looking into the file *calibrated_aggregated_classification.tsv, and there these contigs had plasmid scores of almost 1, but then they weren't included in the plasmid_summary.tsv in the *summary directory. So the scores must have been already calibrated at this point, and something must have excluded them from being identified as plasmid despite an almost 100% probability of being a plasmid (according to *calibrated_aggregated_classification.tsv)

@apcamargo
Copy link
Owner

Ohh, that's probably because they have negative marker enrichment (which is one of the post-classification filters). Just use --relaxed and these plasmids should appear in your summary file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants