Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data generation for PDBDev reports #79

Open
sureshhewabi opened this issue Sep 19, 2024 · 11 comments
Open

Data generation for PDBDev reports #79

sureshhewabi opened this issue Sep 19, 2024 · 11 comments
Labels
CrossLinkingValidationLib Changes related with Crosslinking validations

Comments

@sureshhewabi
Copy link
Collaborator

No description provided.

@sureshhewabi sureshhewabi added the CrossLinkingValidationLib Changes related with Crosslinking validations label Sep 19, 2024
@aozalevsky
Copy link

@sureshhewabi what's the plan? i'm happy to help/test

@sureshhewabi
Copy link
Collaborator Author

@colin-combe is the expert of the MzIdentML parser and I hope he can help you on this. @aozalevsky,Could you please mention your requirement here? I also can help @colin-combe on this matter.

@colin-combe
Copy link

i have a half working way of doing this, by creating an sqlite DB and then querying it using the queries from the API endpoints.

So, as long as people are fine with this temporary sqlite file being created, this fairly straightforward and i'll have something for you to look at and test next week.

@aozalevsky
Copy link

sure, i'm happy to start testing asap. sqlite implementation sounds ok for me. plus we can use in-memory sqlite to avoid dealing with additional files/os locks.

@aozalevsky
Copy link

My requirements (from the previous issue):

  • Ideally, i'd like to get an output similar to the current API output. Basically, we need sequences (some ID + sequence) + residues pairs. Keeping the JSON formatted output would be nice, too.

  • Calling (import + call) as a library would be ideal, but making a subprocess CLI call is also acceptable.

@colin-combe
Copy link

the version in #84
does have something for this,
but the query for the residue pairs takes so long that it is kind of broken i think.

Its currently trying to get all the residue pairs at once.

hmm. i'll look into it a bit more.

The entry point is json_sequences_and_residue_pairs https://github.com/Rappsilber-Laboratory/xi-mzidentml-converter/blob/pride/parser/process_dataset.py#L119C5-L234
(though actually it returns python object not json at the moment). There's also a cli option for it.
Any comments on how to improve things welcome.

@colin-combe
Copy link

it seems it does eventually work

@colin-combe
Copy link

i'm testing it with PXD036833 (your main test dataset, right? @aozalevsky), which takes a long time to parse anyway.
if it works then maybe all that's needed is to add the json encoding to have a usable version of this

@aozalevsky
Copy link

great! i'll check it out and see if i can help/profile the query and/or the code. Now we also have PXD035508, PXD035519, and PXD035362 if that helps.

@colin-combe
Copy link

i added json encoding for it to #84

it takes a long time. I think sqlite doesn't like all those joins, seems like it was less of a problem with postgres.

@colin-combe
Copy link

OK, added another commit to #84
that issue with the query for residue pairs taking ages should be fixed. 793cb33

The time to get the summary of sequences and residue pairs shouldn't be much more than the time to parse ('convert') the file.

The thing that's not working is the in-memory sqlite db. I'm trying to share the same in memory sqlite db between parts of the code that make separate connections to it (using the connection string defined at https://github.com/Rappsilber-Laboratory/xi-mzidentml-converter/blob/pride/parser/process_dataset.py#L169).

But i think they're not getting the same in memory DB and it isn't working. Maybe someone can make some suggestions or help with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CrossLinkingValidationLib Changes related with Crosslinking validations
Projects
None yet
Development

No branches or pull requests

3 participants