Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDExaminer data support #349

Open
Jhsmit opened this issue Jun 13, 2024 · 3 comments
Open

HDExaminer data support #349

Jhsmit opened this issue Jun 13, 2024 · 3 comments

Comments

@Jhsmit
Copy link
Owner

Jhsmit commented Jun 13, 2024

PyHDX currently only directly accepts data formatted as 'state data' output from DynamX

The issue is a continuation of discussion opened by @tuttlelm at #348:

Related to coming from HDExaminer data (and I can open a separate issue for that topic if that would be more appropriate), pyHDX currently does not allow duplicate measurements when creating the HDXMeasurement object. As far as I can tell, having replicates isn't an issue for any of the downstream calculations, but I wondered if you had thoughts on that. I was able to make some simple modifications to models.py so that I can leave replicates in my data and not have to replicate average it first (basically just data.reset_index() in the init() function and add "index" as a column where you are sorting or pivoting on the columns)

It would be great to add support for other file formats such as HDExaminer data.

A couple of questions:
Why would you prefer to leave the replicates in the data and not average them before entering the HDXMeasurment object? Do you want to perform downstream calculations on each replicate individually?

In the latter the case would it make sense to make one HDXMeasurment object per replicate?

Perhaps you could share your input script or make a pull request with your changes to models.py?
To be honest I think that the current HDXMeasurement object has become a bit of a clumsy thing to work with at the moment. I'm planning to change it in the future (probably in the form of a different project altogether).

There is also the hdxms-datasets package, which is still in a beta phase. Maybe you can also share your thoughts on this. The idea there is that there is a datasets format with a .yaml specification example containing all required metadata such that downstream packages like PyHDX can load data from there directly. Ultimately, it would be nice to add support there for 1) cluster data (replicates) 2) HDExaminer output 3) other formats.

Again, also there currently only DynamX state data is supported, simply because thats the only example data I have at the moment.
Do you have any example datasets of HDExaminer data you can share and/or example scripts of how you load the data?

@Arthanis58
Copy link

Hello,
I would also welcome HDexaminer support, however as I am not good with python I am keeping to the web GUI and it would be very helpful if I could input the exposure times in seconds. Is there any way to do it now with the batch .yaml file definitions ?

@tuttlelm
Copy link

I have created a pull request that includes the models.py changes and an additional script convert_data.py for the HDExaminer conversion.

One main reason for keeping replicates is that we use that type of data for other HDX-MS statistical analysis packages, so it is nice to be able to work with the same original data file for different applications. Leaving as replicates does tend to rather inflate the coverage plots, but I appreciate being able to see any replicate to replicate variability there (obviously there are other ways to do this as well). My preference is to keep the replicates within a single HDXMeasurement object. The per residue calculations take care of the replicate averaging.

Is the hdxms-datasets project for the raw data or just the analysis outputs? I'd be very interested in something that can translate between different raw data formats and meta data specifications. I have the opposite problem as you in my pyHXExpress project in that I have access only to HDExaminer outputs and not so much DynamX type data and outputs.

Currently all of the HDExaminer outputs I am working with are for unpublished projects, but I'll see if I can track down something I am able to share.

@Jhsmit
Copy link
Owner Author

Jhsmit commented Jun 19, 2024

With respect to hdxms-datasets at the moment the scope would be output in the form of peptide d-uptake tables. At the moment as a format where replicates are averages together, but preferable the format would support keeping the replicates and let downstream software decide how they treat replicates. This way statistical testing can still be done on the datasets.

The format doesnt have to be all the same, so there can be DynamX formatted peptide output data files, or HDExaminer formatted output files, as long as the metadata specifies which format it is, and then a reader function can take that metadata and read tables depending on which format was used.

Ideally also there should be some agreement between users on which fields the returned dataframes are; eg is it 'time' , 'exposure' or 'exposure_time' (and units); d-uptake, uptake; should there be a m0 field, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants