Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for the first Mascot submission #71

Closed
ypriverol opened this issue Jul 24, 2024 · 11 comments
Closed

Support for the first Mascot submission #71

ypriverol opened this issue Jul 24, 2024 · 11 comments
Assignees

Comments

@ypriverol
Copy link

ypriverol commented Jul 24, 2024

@sureshhewabi is working on the first Mascot submission with some data from the Mascot team. An issue was found while parsing the MGF, already an issue in pyteomics has been created levitsky/pyteomics#153.

This issue is related to the support of the main search engines #63

@colin-combe
Copy link

colin-combe commented Jul 24, 2024

first Mascot submission with some data from the Mascot team

Great!

Could you share the mzIdentML file with me, pls. There was something I wanted to check in it. (Something I thought I saw in another Mascot generated mzIdentML recently, to do with repetition of the same peptide).

@sureshhewabi
Copy link
Collaborator

@colin-combe I copied the files to dropbox and I will share you the FTP details

@colin-combe
Copy link

Thanks.

I think there's something not right in these mzIdentML files, but not something that will stop them working in our system.

The mzid specification states:

The combination of Peptide sequence and modifications MUST be unique in the file.
(Section 6.48.)

There is a complication re peptide uniqueness when it comes to the crosslinked peptides. Setting that aside and just looking at the 'linear' (uncrosslinked) peptides, it seems in the Mascot output they are not unique but instead repeated everytime they are identified.

This is OK for us, it works, but its sub-optimal. It bloats the files unnecessarily, then our database, and then the xiview web page takes longer to load because it is being sent duplicates of all the peptides.

I think it's worth taking this up with them to see what they say. (@vrkosk ?)

@vrkosk
Copy link

vrkosk commented Jul 24, 2024

@colin-combe Do you mean cases like:

    <Peptide id="peptide_162_1">
      <PeptideSequence>SPDKPGK</PeptideSequence>
    </Peptide>
    <Peptide id="peptide_163_1">
      <PeptideSequence>SPDKPGK</PeptideSequence>
    </Peptide>
    <Peptide id="peptide_164_1">
      <PeptideSequence>SPDKPGK</PeptideSequence>
    </Peptide>

I see what you mean. Mascot is currently taking a very PSM-centric view. The above are duplicate identifications of the same peptide in sequential Mascot queries. I agree it would be better if Mascot collated them into something like:

    <Peptide id="peptide_SPDKPGK">
      <PeptideSequence>SPDKPGK</PeptideSequence>
    </Peptide>

And where peptide_ref="peptide_162_1" is used in , replace it with peptide_ref="peptide_SPDKPGK". This would reduce duplication in elements as well, which currently repeat the start and end position and pre and post residues needlessly:

    <PeptideEvidence id="PE_162_1_1_EWas03_0_236_242" start="236" end="242" pre="R" post="G" peptide_ref="peptide_162_1" isDecoy="false" dBSequence_ref="DBSeq_1_EWas03" />
    <PeptideEvidence id="PE_163_1_1_EWas03_0_236_242" start="236" end="242" pre="R" post="G" peptide_ref="peptide_163_1" isDecoy="false" dBSequence_ref="DBSeq_1_EWas03" />
    <PeptideEvidence id="PE_164_1_1_EWas03_0_236_242" start="236" end="242" pre="R" post="G" peptide_ref="peptide_164_1" isDecoy="false" dBSequence_ref="DBSeq_1_EWas03" />

I'll add a change request.

@colin-combe
Copy link

yes, cases like that.

@colin-combe
Copy link

it's a little more complicated with the crosslinked peptides, where it's the crosslinked pair of peptides that is meant to be unique

@colin-combe
Copy link

is it weird that in these files there are things like:
<SpectrumIdentificationItem id="SII_21990_3" calculatedMassToCharge="1142.5857215" chargeState="2" experimentalMassToCharge="1142.5827" peptide_ref="peptide_21990_3" rank="3" passThreshold="true">

so the rank is 3, but it has passThreshold = true? @vrkosk ?

@colin-combe
Copy link

...i guess it's probably meant to be like this, guess there's no reason why not

@vrkosk
Copy link

vrkosk commented Jul 31, 2024

A Mascot PSM is significant if expect value < sigthreshold. This is encoded as passThreshold = true in the mzIdentML export. It's perfectly possible for the rank 1, 2 and 3 matches to have a similar score and, thus, similar expect values, all of which are statistically significant. Because the ranks are ordered by score, if rank 3 has passThreshold = true, then ranks 1 and 2 must also have passThreshold = true. (I don't think this is a rule that needs to be coded anywhere, just pointing it out here.)

@colin-combe
Copy link

ok, thanks.
I didn't forget about this btw - Rappsilber-Laboratory/build-xiview#87

@ypriverol
Copy link
Author

@colin-combe @sureshhewabi, as soon as we are sure these files will work, let me know so we can prepare the submission for the PRIDE Archive. Excellent work, Thanks @vrkosk for your support, the Mascot team has always been responsive and helpful. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants