Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcriptomics metadata template #75

Open
JolandaS opened this issue Mar 4, 2020 · 8 comments
Open

Transcriptomics metadata template #75

JolandaS opened this issue Mar 4, 2020 · 8 comments
Assignees
Labels

Comments

@JolandaS
Copy link
Collaborator

JolandaS commented Mar 4, 2020

Determine which ontologies to use for transcriptomics data (meta data templates)

@PeterWoollard
Copy link
Collaborator

PeterWoollard commented Mar 11, 2020

Key transcriptomics related entities for FAIR and some ontologies include

Key searching ontologies

  • Species - NCBI taxonomy Scientific name + ID
  • Tissue - Uberon term and ID
  • Cell type - CL term and id
  • Disease - no single solution? Mondo? DO? MeSH
  • Phenotype/Trait - in humans typically HPO, other mammals: MPO, beyond mammals?
  • Experiment Type e.g. RNASeq, CITESeq etc. - EFO

Key searching entities (not ontologies)

  • gene/protein - one of ENSEMBL/ENTREZ_GENE/UNIPROT/HGNC ID + HGN
  • compound - unichem(Chebi etc,) + SMILE?
  • metabolites -?

@JolandaS
Copy link
Collaborator Author

Define own minimal set of metadata, recommendations. Selection criteria for ontologies used.

@daniwelter
Copy link

For disease, I would use MONDO (possibly supplemented with NCIt for cancers) as it is currently the most actively developed, so most likely to respond quickly to any change requests. I definitely wouldn't use MeSH. Agreed on all the other ontologies. I'd also add

  • Cell location/cycle - GO
  • Developmental stage - HSAPDV/Uberon
  • chemical compounds - ChEBI

Searching entities
Again, agreed on most of the suggestions.
Metabolites - MetaboLights compound accession, ChEBI

@AlasdairGray
Copy link
Collaborator

Define own minimal set of metadata, recommendations. Selection criteria for ontologies used.

Bioschema's may be an appropriate approach here to define a minimal metadata record that would be searchable on the web.

@karsten-quast
Copy link

karsten-quast commented Apr 1, 2020

I tried to compile a potential starting point for a recipe. Hope it makes sense to you. Really looking forward to your thoughts. Maybe we can flesh this out.

Task

  • Generate metadata template for bulk NGS data generated at different sources following different standards

Define competency questions

  • What are the questions you would like to address with the template?

Defining Minimal Set Of Metadata (MSOM) according to these questions

  • Compile metadata from different sources
  • Generate consolidated view on metadata by merging attributes as far as possible
  • Differentiate metadata available for most of the studies from metadata occurring rarely (sparse matrix)
  • Identify gaps in the metadata available for most of the studies comprising data that is considered import but has not been captured in the past
  • Define a MSOM to be captured in the future from the metadata that is available for most of the studies and the metadata considered to be important
  • Identify available community standards regarding minimal sets of metadata
  • Add metadata attributes from those community standards to the MSOM, if they are not included, yet
  • Assign cardinality to the MSOM (identify mandatory metadata and how many times the attributes may be reported. Some metadata might not be mandatory but are still important to capture, if available)
  • Identify appropriate ontologies representing your data and establish an application ontology (see recipe 4 of UC3)
  • Assign, as far as possible, ontologies to the MSOM and the sparse matrix

Introducing semantics into the template

  • Identify most important objects to be represented in the model (e.g. study, sample, treatment, result, etc.)
  • Make sure to have an appropriate naming for the objects (e.g. an NGSstudy is an OMICSstudy is a Study; do not call an NGSstudy a Study; make sure the granularity fits your purposes)
  • Assign MSOM and sparse matrix attributes to the respective objects
  • Identify and introduce relationships among the identified objects (e.g. “an NGSstudy contains samples”, “a result is derived from a sample”)
  • Identify dependencies to data not represented as objects at this point in time, but, e.g. as termlists
  • Make sure that your model can be expanded subsequently to represent those data as objects, as well
  • Integrate the sparse matrix of metadata not contained in the MSOM in the model

Reality check

  • Introduce measures allowing identifying errors in reported data according to your model
  • Expose your model to actual data delivered by independent colleagues and capture the errors and gaps that occurred
  • Identify errors and gaps that are related to the model and not occurring due to errors in the data
  • Adjust the model according to these errors and gaps
  • Re-iterate the reality check until no more severe errors and gaps are occurring that are relevant for the previously defined competency questions

@FuqiX
Copy link
Collaborator

FuqiX commented Apr 28, 2020

@JolandaS JolandaS removed this from the Virtual meeting End of April 2020 milestone May 13, 2020
@Chris-Evelo
Copy link

Chris-Evelo commented May 20, 2020

I think this would benefit from some structure for an actual study that involves transcriptomics data. Apart from general metadata (who did it, where, where was it stored and so on), this should have a description of the study (which includes what other measurements were done in the same study), this should follow the ISA principles. How samples were created and how the actual measurements were performed. Next, it should also link (and have an ontological description) of 1) parallel measurements (like did you also do proteomics and where do I find that info). 2) phenotypic outcome data. Like under the treatment in the study the data that was measured was blood pressure and so on, and again where you would store that. Note that, ideally, in a public study, the ISA types of data would go into Biosamples, and the other measurements would be in Biostudies, or (for other comics data) be linked from there. So our choices should ideally align with how these repositories (and of course Arrayexpress and GEO) work. (Sorry if all that was already in the cookbook)

@Chris-Evelo
Copy link

We had some discussion about whether this could not better be part of the catalogue model. Of course, the catalog needs to align with how data is collected. But we need to also make sure of our recipes align with a "FAIR at source" approach where people can start to collect the relevant data when they design, perform and evaluate the actual study.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants