Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update percolator to pepxml rewriting #917

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

chhh
Copy link
Collaborator

@chhh chhh commented Dec 7, 2022

I would greatly appreciate if you merged this change, it would make maintaining my fork a lot easier.

  • The main thing it does is add some structure to the spaghetti code in PercolatorOutputToPepXML.
  • Adds stox pepxml parsing and a test for what the rewriting actually does.

@chhh chhh requested review from guoci and fcyu December 7, 2022 23:15
@chhh chhh changed the base branch from master to develop December 7, 2022 23:15
@fcyu fcyu force-pushed the develop branch 3 times, most recently from 120e1d3 to 019ccd4 Compare December 8, 2022 01:01
final String basename = remove_rank_suffix(nameWithoutExt);
if(!basenames.add(basename))
//final String nameWithoutExt = FilenameUtils.removeExtension(pepxmlPath.getFileName().toString());
final String nameWithoutExt = PathUtils.removeExtension(pepxmlPath.getFileName().toString(), 2, 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chhh Why create a new PathUtils.removeExtension() to replace the existing FilenameUtils.removeExtension?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one has additional parameters - how many times to remove an extension (in case of files like file.raw.pep.xml) and a limit on the length of the extension, this catches cases when somebody puts a dot in the file name.
I didn't just replace it in order to replace it, there was a real life example where the original one from apache commons failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I forced MSFragger to always generate <file name>.pepXML. If you want to support <file name>.pep.xml, there will be a lot of places to change and test. I think many places, including the other tools used by FragPipe, assume that the extension is everything after the last dot.

Since your case will never happen if use MSFragger, I don't think it is necessary to implement this new function to make the things more complicated.

@fcyu
Copy link
Member

fcyu commented Dec 8, 2022

@chhh , there are a lot of changes. I need more time to review and test them. Will revisit this pull request when I have time later.

Thanks,

Fengchao

@chhh
Copy link
Collaborator Author

chhh commented Dec 8, 2022

@fcyu sure, the main changes are in PercolatorOutputToPepXML.percolatorToPepXML() method. The logic is kept the same, but everything is split into smaller functions.

There's a "test" PercolatorOutputToPepXMLTest with which you can run the function to try it out. Comment out Files.deleteIfExists(path) and you will get an interact- file in resources/percolator-to-pepxml/fragpipe-search-16_PXD022287_msfragger/interact-Human-Protein-Training_Trypsin.pep.xml

@chhh
Copy link
Collaborator Author

chhh commented Dec 20, 2022

@guoci @fcyu Is there any hope that you will merge this as a Christmas present?

@fcyu
Copy link
Member

fcyu commented Dec 20, 2022

Hi @chhh ,

I started to review the code but got interrupted by other things. I could merge it before next year, but I think we can do it better since you are almost re-writing the whole module.

I always think reading the pepXML file to a class and writing the modified file using SAX is better than manipulating the strings in an ad-hoc way. I left some comments in

for (final String e : search_hit_line.split("\\s")) { // fixme: the code assumes that all attributes are in one line, which makes it not robust

and
while (iterator.hasNext()) { // fixme: the code assumes that there are always <search_hit, massdiff=, and calc_neutral_pep_mass=, which makes it not robust

Batmass-io has the module to read pepXML file, so I think we could use that. I discussed with Guo Ci but no one had the time to do it. Do you think it would be a better idea, especially when you want to make the pepxml rewriting robust and support other flavor?

Best,

Fengchao

@chhh
Copy link
Collaborator Author

chhh commented Dec 20, 2022

  1. Goal number one is to make it at least a little more readable and less error prone.
    Look at current percolatorToPepxml() vs the proposed percolatorToPepxml()
    Instead of one huge inline function 230 lines long which is impossible to really understand what it's doing the new version is only 70 lines despite being able to handle 2 different search engines.

This conversion clearly splits this process into smaller function each related to reading information from files or writing to files.
I suggest to first verify that the results of this conversion are correct. And merge it as is.

  1. I wanted to remove all the string matching for xml stuff, but this was too much to do in one go. I did start though, some parameters from pepxml files are now being parsed using a stox parser, look at StoxParserPepxml (should have been Stax, it's a typo) and its usage in PercolatorOutputToPepXML.

I wouldn't use JAXB here because it has to parse the whole file into memory and is relatively slow. Also very memory intensive, the in-memory representation generated by JAXB parser is often larger than the original file on disk (because of all the lists it creates internally). And combined interact files can actually be really huge in some cases, so I'd suggest to only resort to stax parsing.

  1. In the new code the whole "rewriting" part is contained in a single function createInteractFile(), actually just in one 80-line block inside it. So all of that "use sax/stax/jaxb etc to rewrite file" is just about updating these 80 lines to use stax instead of reading files line by line. But it still requires step 1 to be done.

@fcyu fcyu requested review from fcyu and removed request for guoci December 24, 2022 03:17
@fcyu fcyu self-assigned this Dec 24, 2022
@fcyu
Copy link
Member

fcyu commented Dec 24, 2022

Hi @chhh ,

Thank you for the explanation and effort. I have briefly reviewed the code. I think the changes can be classified into:

  1. Variable renaming and minor refactor
  2. Adding Comet support
  3. Code refactor to make it more robust (not so sure) and easy to maintain.

I think for change type 1, need to undo them to make the other changes easy to track. For change type 2, also need to undo since FragPipe don't support Comet. Leave those code in FragPipe would confuse others, including me ;). I also don't think I have the bandwidth to support both search engines in the future.

After reverting type 1 and 2, I will review the code and merge them if they pass the tests.

Merry Christmas,

Fengchao

@asalt
Copy link

asalt commented Oct 14, 2024

I recently got an error that seems related to this:

Cannot find output_report_topN parameter from .... pepXML
Process 'Percolator: Convert to pepxml' finished, exit code: 1
Process returned non-zero exit code, stopping.

This was after many previous files successfully ran through percolator, making me think it is a bug.

log file attached-
log_2024-10-14_15-08-30.txt

@fcyu
Copy link
Member

fcyu commented Oct 14, 2024

Hi @asalt, I don't think your error is related to this pull request because it hasn't been merged.

Please do not send questions or issues to pull requests. Please re-submit it to https://github.com/Nesvilab/FragPipe/issues

Best,

Fengchao

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants