From 2cbaf219aad38c4f4ca6b2d1569f055ee2be1aca Mon Sep 17 00:00:00 2001 From: Cyril Matthey-Doret Date: Wed, 13 Dec 2023 09:31:01 +0000 Subject: [PATCH 1/3] docs: add initial README.md --- README.md | 47 +++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 188e8d76..bf5c0bfa 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,45 @@ -# smoc-poc -Initial system for creating and serving multi-omics digital objects +# SMOC-PoC + +Initial system for creating and serving multi-omics digital objects. + +## Motivation + +Provide a digital object and system to process, store and serve multi-omics data with their metadata such that: +* Traceability and reproducibility is ensured by rich metadata +* The different omics layers are processed and distributed together +* Common operations such as liftover can be automated easily and ensure that omics layers are kept in sync + +## Architecture + +The digital object is composed of multiple files: +* CRAM files for alignment data, Zarr +* HDF5 files for array data +* RDF for metadata (either separate, or embedded in the array file. + +A webserver is required to list available objects and serve them over the network. + +The basic structure is as follows: + +```mermaid + +flowchart LR; + +subgraph smoc[SMOC server] + OBJ[Digital object metadata] + CRAMG[Genomics CRAM] + CRAMT[Transcriptomics CRAM] + MATP[Proteomics matrix] + MATM[Metabolomics matrix] +end; +subgraph UI[User interface] + CAT[Catalogue] + INS[Inspector] +end; + + OBJ -.-> CRAMG; + OBJ -.-> CRAMT; + OBJ -.-> MATP; + OBJ -.-> MATM; + OBJ -->|list objects| CAT + OBJ -->|display metadata| INS +``` From 72c61b2f556e3dfcea32fd53152afad9d81660bf Mon Sep 17 00:00:00 2001 From: Cyril Matthey-Doret Date: Wed, 13 Dec 2023 09:52:37 +0000 Subject: [PATCH 2/3] docs: add implementation details to README.md --- README.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index bf5c0bfa..c42dd73d 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ Provide a digital object and system to process, store and serve multi-omics data The digital object is composed of multiple files: * CRAM files for alignment data, Zarr * HDF5 files for array data -* RDF for metadata (either separate, or embedded in the array file. +* RDF for metadata (either separate, or embedded in the array file). A webserver is required to list available objects and serve them over the network. @@ -43,3 +43,12 @@ end; OBJ -->|list objects| CAT OBJ -->|display metadata| INS ``` + +## Implementation details + +* To allow horizontal traversal of digital objects in the database (e.g. for listing), the metadata would need to be exported in a central database/knowledge-graph on the server side. +* Metadata can be either embedded in the array file, or stored in a separate file +* Each digital object needs a unique identifier +* The paths of individual files in the digital object must be referenced in a consistent way. + + Absolute paths are a no-go (machine/system dependent) + + Relative paths in the digital object could work, but need to be OS-independent From 04fd9912dba6d37242f9c91516a8af6b91b0cf65 Mon Sep 17 00:00:00 2001 From: Cyril Matthey-Doret Date: Fri, 15 Dec 2023 15:51:09 +0000 Subject: [PATCH 3/3] docs: add status to readme --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index c42dd73d..034f55fb 100644 --- a/README.md +++ b/README.md @@ -52,3 +52,9 @@ end; * The paths of individual files in the digital object must be referenced in a consistent way. + Absolute paths are a no-go (machine/system dependent) + Relative paths in the digital object could work, but need to be OS-independent + + +# Status and limitations + +* Focusing on data retrieval, object creation not yet implemented +* The htsget protocol supports streaming CRAM files, but it is currently only implemented for BAM in major genome browsers (igv.js, jbrowse)