Usage of score calibration #126

zehanna · 2024-09-27T08:58:30Z

Hi, I have a question regarding the --enable-score-calibration flag. I understand that it considers the composition of the sample to output a 'real' probability, e.g. plasmid score of 0.4 would mean that there is a 40% chance that this sequence is a plasmid. And that without using score calibration, the scores are not actual probabilities.
What I'm wondering is, if I run genomad on sequences which I already preselected on some criteria, e.g. a collection of short, circular contigs extracted from multiple metagenomes, would it be recommended to use the score calibration or not? Because in that collection I probably have a higher chance of finding plasmids than in a 'natural' assembly of an environmental metagenome. So what I'm wondering is, is the score calibration recommended only for 'natural' samples, or also samples where already a pre-selection of sequences (that are more likely viral or plasmid) has taken place?

apcamargo · 2024-10-01T08:54:47Z

Score calibration should work fine in cases like this. I recommend using it.

The performance of the calibration drops a bit when the sample composition is extreme (e.g., 99% plasmids), but geNomad automatically detects cases like this and deals with them properly (see these lines, in case you're curious).

zehanna · 2024-10-29T12:30:51Z

Hi @apcamargo, thanks a lot for your answer. After running genomad with score calibration, I'm a little confused about some aspects of the output. E.g. in the file *calibrated_aggregated_classification.tsv, some contigs have plasmid scores of very close to 1 (which if I'm correct their probability of being a plasmid is ~100%), but they are not in the file *plasmid_summary.tsv. How can this happen?

apcamargo · 2024-10-30T07:41:42Z

My guess is that the empirical plasmid fracvtion within the sample was very low, but the algorithm should be robust to case like this. Can you share the contents of <prefix>_score_calibration/<prefix>_compositions.tsv?

These contigs were excluded from <prefix>_plasmid_summary.tsv because their calibrated scores were lower than 0.7 (default cutoff). You can try to use --relaxed to get more contigs in the summary file (see here).

If you want, you can see the calibrated scores of every single sequence (not just the ones classified as plasmid) in <prefix>_score_calibration/<prefix>_calibrated_aggregated_classification.tsv. This way you can compare the pre- and post-calibration scores for all contigs.

zehanna · 2024-10-30T10:54:30Z

Hi, this is the output of *compositions.tsv:

model   chromosome      plasmid virus
marker  0.9112  0.0145  0.0743
nn      0.5764  0.3175  0.1061
aggregated      0.8863  0.0394  0.0743

My confusion is because I was already looking into the file *calibrated_aggregated_classification.tsv, and there these contigs had plasmid scores of almost 1, but then they weren't included in the plasmid_summary.tsv in the *summary directory. So the scores must have been already calibrated at this point, and something must have excluded them from being identified as plasmid despite an almost 100% probability of being a plasmid (according to *calibrated_aggregated_classification.tsv)

apcamargo · 2024-11-02T01:46:25Z

Ohh, that's probably because they have negative marker enrichment (which is one of the post-classification filters). Just use --relaxed and these plasmids should appear in your summary file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage of score calibration #126

Usage of score calibration #126

zehanna commented Sep 27, 2024

apcamargo commented Oct 1, 2024

zehanna commented Oct 29, 2024

apcamargo commented Oct 30, 2024

zehanna commented Oct 30, 2024

apcamargo commented Nov 2, 2024

Usage of score calibration #126

Usage of score calibration #126

Comments

zehanna commented Sep 27, 2024

apcamargo commented Oct 1, 2024

zehanna commented Oct 29, 2024

apcamargo commented Oct 30, 2024

zehanna commented Oct 30, 2024

apcamargo commented Nov 2, 2024