feat: mass spectrometry data support #18
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the first part of adding support for proteomics/metabolomics metadata into the schema.
This PR focuses on adding required fields into the metadata schema.
In a second part, we will implement parsers in modos-api to auto-populate the zarr metadata by extracting from metabolomics data files.
Important changes are in
src/modos_schema/schema/modos_schema.yaml
Context.
We want to support mass spectrometry results from proteomics / metabolomics while keeping the schema as simple as possible.
The starting point is the mzTab format (specs here), a tabular format consisting of one metadata section followed by several tables
The data is basically a table of quantification of molecules present in a collection of samples.
Here is an example of fields extracted from a metadata section (MTD):
We can already represent the following fields:
Note: even if we don't include some fields in the metadata schema (e.g. mzTab-version), they would still be retrievable from the file itself through the API.
Changes
Based on @htmonkey's suggestions (sdsc-ordes/modos-api#91 (comment)), it seems we at least needed these changes:
MassSpectrometryResults
, a subclass of DataEntity for mass spectrometry quantification results.has_sample_processing
onAssay
(MTD sample_processing[n])Challenges / questions
1. In MODOS, we traditionally have
MODOS
-(has)->Assay
-(has)->DataEntity
, and samples can be attached toAssay
and/orDataEntity
.A single mzTab file can contain hundreds of samples and assays, each of which is a single line in the table.
I am not sure if this is an issue, but this will bloat the metadata with a lot of redundancy.
2. We will likely need to add properties to the schema to represent mzTab files in a meaningful way.
It is not clear sometimes on which class a property should be added; in this case, the MTD sample_processing[n] property seems tied to the entire mzTab file and not to an individual sample. Does that mean that all assays and samples in the file must have the same sample processing? @htmonkey
An alternative would be to attach the sample_processing directly to MassSpectroMetryResult to reduce redundancy.