diff --git a/.github/ISSUE_TEMPLATE/new-question.md b/.github/ISSUE_TEMPLATE/new-question.md index b6a3714..8568e4e 100644 --- a/.github/ISSUE_TEMPLATE/new-question.md +++ b/.github/ISSUE_TEMPLATE/new-question.md @@ -1,6 +1,6 @@ --- name: Support Issue -about: Ask for support on running and/or developing LocalEGA +about: Ask for support on running and/or developing FederatedEGA labels: Support --- diff --git a/README.md b/README.md index 9138c88..8d2d9bb 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ Source code for core components is available at: https://github.com/neicnordic/s | Component | Role | |---------------|------| -| inbox | SFTP, S3 or HTTPS server, acting as a dropbox, where user credentials are fetched from CentralEGA or via LifeScience AAI. [s3inbox](https://github.com/neicnordic/sensitive-data-archive/tree/main/sda/cmd/s3inbox/s3inbox.md) or [sftp-inbox](https://github.com/neicnordic/sensitive-data-archive/tree/main/sda-sftp-inbox/README.md) | +| inbox | SFTP, S3 or HTTPS server, acting as a dropbox, where user credentials are fetched from CentralEGA or via [Life Science AAI](https://lifescience-ri.eu/). [s3inbox](https://github.com/neicnordic/sensitive-data-archive/tree/main/sda/cmd/s3inbox/s3inbox.md) or [sftp-inbox](https://github.com/neicnordic/sensitive-data-archive/tree/main/sda-sftp-inbox/README.md) | | intercept | The intercept service relays message between the queue provided from the federated service and local queues. **(Required for Federated EGA use case)** | | ingest | Split the Crypt4GH header and move the remainder to the storage backend. No cryptographic task, nor access to the decryption keys. | | verify | Decrypt the stored files and checksum them against their embedded checksum. | diff --git a/docs/connection.md b/docs/connection.md index 6810e0b..c46bcce 100644 --- a/docs/connection.md +++ b/docs/connection.md @@ -1,17 +1,17 @@ Interfacing with CEGA ⇌ SDA =========================== -All Local EGA instances are connected to Central EGA using +All `FederatedEGA` instances are connected to `CentralEGA` using [RabbitMQ](http://www.rabbitmq.com), a Message Broker, that allows the components to send and receive messages, which are queued, not lost, and resent on network failure or connection problems. The RabbitMQ message brokers of each SDA instance are the **only** -components with the necessary credentials to connect to Central EGA +components with the necessary credentials to connect to `CentralEGA` message broker. We call `CEGAMQ` and `LocalMQ` (Local Message Broker, sometimes know as `sda-mq`), -the RabbitMQ message brokers of, respectively, `Central EGA` and `SDA`/`LocalEGA`. +the RabbitMQ message brokers of, respectively, `CentralEGA` and `SDA`/`FederatedEGA`. Local Message Broker -------------------- @@ -49,7 +49,7 @@ The following environment variables can be used to configure the broker: > would need to be set up to send and recive messages between other > services. -Central EGA connection +CentralEGA connection ---------------------- `CEGAMQ` declares a `vhost` for each SDA instance. It also creates the @@ -102,7 +102,7 @@ Service will wait for messages to arrive. > NOTE: > More information can be found also at -> [localEGA](https://localega.readthedocs.io/en/latest/amqp.html#message-interface-api-cega-connect-lega). +> [localEGA repository](https://localega.readthedocs.io/en/latest/amqp.html#message-interface-api-cega-connect-lega) - repository that provides functionality for `FederatedEGA` use case. `CEGAMQ` receives notifications from `LocalMQ` using a *shovel*. Everything that is published to its `to_cega` exchange gets forwarded to @@ -118,30 +118,30 @@ workflow to CentralEGA, using the following routing keys: | files.verified | For files ready to request accessionID | Note that we do not need at the moment a queue to store the completed -message, nor the errors, as we forward them to Central EGA. +message, nor the errors, as we forward them to `CentralEGA`. ![RabbitMQ setup](./static/CEGA-LEGA.png) -Connecting SDA to Central EGA +Connecting SDA to CentralEGA ----------------------------- -Central EGA only has to prepare a user/password pair along with a +`CentralEGA` only has to prepare a user/password pair along with a `vhost` in their RabbitMQ. -When Central EGA has communicated these details to the given Local EGA -instance, the latter can contact Central EGA using the federated queue +When `CentralEGA` has communicated these details to the given `FederatedEGA` +instance, the latter can contact `CentralEGA` using the federated queue and the shovel mechanism in their local broker. -CentralEGA should then see 2 incoming connections from that new LocalEGA +`CentralEGA` should then see 2 incoming connections from that new `FederatedEGA` instance, on the given `vhost`. The exchanges and routing keys will be the same as all the other -LocalEGA instances, since the clustering is done per `vhost`. +`FederatedEGA` instances, since the clustering is done per `vhost`. ### Message Format It is necessary to agree on the format of the messages exchanged between -Central EGA and any Local EGAs. Central EGA's messages are +`CentralEGA` and any `FederatedEGA`s. `CentralEGA`'s messages are JSON-formatted. The JSON schemas can be found in: @@ -200,14 +200,14 @@ of messages: - `type=cancel`: an ingestion cancellation - `type=accession`: contains an accession id - `type=mapping`: contains a dataset to accession ids mapping -- `type=heartbeat`: A mean to check if the Local EGA instance is +- `type=heartbeat`: A mean to check if the `FederatedEGA` instance is "alive" > IMPORTANT: > The `encrypted_checksums` key is optional. If the key is not present the > sha256 checksum will be calculated by `Ingest` service. -The message received from Central EGA to start ingestion at a Federated EGA node. +The message received from `CentralEGA` to start ingestion at a Federated EGA node. Processed by the the `ingest` service. ```javascript diff --git a/docs/dataout.md b/docs/dataout.md index b4b09da..b324647 100644 --- a/docs/dataout.md +++ b/docs/dataout.md @@ -84,9 +84,7 @@ and can't expose REST API (but still can receive RabbitMQ messages). Handling Permissions -------------------- -Data Out API can be run with connection to an AAI or without. In the -case connection to an AAI provider is not possible the -`PASSPORT_PUBLIC_KEY_PATH` and `CRYPT4GH_PRIVATE_KEY_PATH` need to be +Data Out API can be run with connection to an AAI or without. If connection to an AAI provider is not possible, the `PASSPORT_PUBLIC_KEY_PATH` and `CRYPT4GH_PRIVATE_KEY_PATH` need to be set. > NOTE: diff --git a/docs/db.md b/docs/db.md index 274d2aa..9b7eabf 100644 --- a/docs/db.md +++ b/docs/db.md @@ -10,8 +10,7 @@ documented below. > The database container will initialize and create the necessary database -structure and functions if started with an empty area. Procedures for -*backing up the database* are important but considered out of scope for +structure and functions if started with an empty area. Procedures for *backing up the database* are important, however considered out of scope for the secure data archive project. Look at [the SQL @@ -65,7 +64,7 @@ changes are required that risk being time consuming on large databases, it may be best to split that work in small chunks. Doing so helps in both demonstrating progress as well as avoiding -rollbacks of the entire process (and thus working needing to be done) if +rollbacks of the entire process, in case that something fails. Each schema migration is done in a transaction. Schema versions are integers. There is no strong coupling between diff --git a/docs/dictionary/wordlist.txt b/docs/dictionary/wordlist.txt index 47b5eeb..83d4ba8 100644 --- a/docs/dictionary/wordlist.txt +++ b/docs/dictionary/wordlist.txt @@ -43,7 +43,6 @@ confpath controlledaccessgrants copyheader creds -cryptograhy cryptographic cscfi dac @@ -58,6 +57,7 @@ decrypt decryptable decrypted decryptedchecksums +decrypting decryptor dev discoverable @@ -74,6 +74,7 @@ egas endcoordinate envs exportrequests +federatedega fega fileid filepath diff --git a/docs/encryption.md b/docs/encryption.md index bd37de5..6773ea2 100644 --- a/docs/encryption.md +++ b/docs/encryption.md @@ -5,18 +5,16 @@ The secure data archive uses public key cryptography extensively for maintaining data privacy throughout the various stages. - Files uploaded to the secure data archive are pre-encrypted (on the - user side) with public key based cryptograhy (this is in addition to + user side) with public key based Cryptography (this is in addition to any transport encryption provided for the connection, e.g. TLS or - the encryption provided by ssh for the sftp inbox service). -- During the ingestion process, the files are decrypted and - re-encrypted with another key to provide for the archiving. + the encryption provided by SSH for the SFTP inbox service). +- During the ingestion process, the files are decrypted and checksum is computed; - Finally, if the data is requested, it is again decrypted and - possibly reencrypted with a suitable key for the user (again, in - addition to any transport encryption). + possibly re-encrypted with a suitable key for the user (again, in addition to any transport encryption). -Files submitted are in the `Crypt4GH` file format, which provides the +Files submitted should be in the `Crypt4GH` file format, which provides the ability to decrypt parts of encrypted files without having to start -decrypt all data up to the desired area (useful for e.g. streaming). +decrypting all data up to the desired area (useful for e.g. streaming). The details of the file format used are provided at [Crypt4GH file format](http://samtools.github.io/hts-specs/crypt4gh.pdf), and summarized below. @@ -24,8 +22,7 @@ The details of the file format used are provided at A random session key (of 256 bits) is generated to seed a ChaCha20 engine, with Poly1305 authentication mode. For each segment of at most 64kB of data, a nonce is randomly generated and prepended to the -segment. Using the two latters, the original file is segmented and each -segment is encrypted. +segment. Using the latter two, the original file is segmented and each segment is encrypted. The header is prepended to the encrypted data, it also contains, the word `crypt4gh`, the format version, the number of header packets, and @@ -47,7 +44,7 @@ The advantages of the format are, among others: - Re-arranging the file to chunk a portion requires only to decrypt the header, re-encrypt with an edit list, and select the cipher segments surrounding the portion. The file itself is not decrypted - and reencrypted. + and re-encrypted. In order to encrypt files using this standard we recommend the following tools: diff --git a/docs/guides/deploy-k8s.md b/docs/guides/deploy-k8s.md index 1faa143..d6ea15e 100644 --- a/docs/guides/deploy-k8s.md +++ b/docs/guides/deploy-k8s.md @@ -38,7 +38,7 @@ This chart deploys a pre-configured database ([PostgreSQL](https://www.postgresq ### sda-mq - RabbitMQ component for Sensitive Data Archive (SDA) installation -This chart deploys a pre-configured message broker ([RabbitMQ](https://www.rabbitmq.com/)) designed to work [European Genome-Phenome Archive](https://ega-archive.org/) federated messaging interface between Central EGA and Local/Federated EGAs. +This chart deploys a pre-configured message broker ([RabbitMQ](https://www.rabbitmq.com/)) designed to work [European Genome-Phenome Archive](https://ega-archive.org/) federated messaging interface between `CentralEGA` and Local/Federated EGAs. ### sda-svc - Components for Sensitive Data Archive (SDA) installation diff --git a/docs/guides/deployment.md b/docs/guides/deployment.md index a68227c..0e086b2 100644 --- a/docs/guides/deployment.md +++ b/docs/guides/deployment.md @@ -5,7 +5,7 @@ > If you have feedback to give on the content, please contact us on > [github](https://github.com/neicnordic/neic-sda)! -Different nodes of the Federated EGA network, and projects using the standalone SDA +Different nodes of the Federated EGA network, and projects using the stand-alone SDA have made different decisions in how to deploy the system. Adaptations needs to be made depending on the system to deploy on, as well as the requirements of your deployment. diff --git a/docs/guides/federated-or-standalone.md b/docs/guides/federated-or-standalone.md index 6aadc6e..3619f7d 100644 --- a/docs/guides/federated-or-standalone.md +++ b/docs/guides/federated-or-standalone.md @@ -1,4 +1,4 @@ -# Federated or Standalone Archive +# Federated or Stand-alone Archive > TODO: > This guide is a stub and has yet to be finished. diff --git a/docs/guides/troubleshooting.md b/docs/guides/troubleshooting.md index 7410b3a..85d5b58 100644 --- a/docs/guides/troubleshooting.md +++ b/docs/guides/troubleshooting.md @@ -25,7 +25,7 @@ Next step is to make sure that the remote connections (CEGA RabbitMQ) are workin ## End-to-end testing -NOTE: This guide assumes that there exists a test instance account with Central EGA. Make sure that the account is approved and added to the submitters group. +NOTE: This guide assumes that there exists a test instance account with `CentralEGA`. Make sure that the account is approved and added to the submitters group. ### Upload file(s) diff --git a/docs/index.md b/docs/index.md index 7b49228..c6c3efc 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,47 +2,47 @@ NeIC Sensitive Data Archive =========================== -The NeIC Sensitive Data Archive (SDA) is an encrypted data archive, originally implemented for storage of sensitive biological data. It is implemented as a modular microservice system that can be deployed in different configurations depending on the service needs. +The NeIC Sensitive Data Archive (SDA) is an encrypted data archive, implemented for storage of sensitive data. It is implemented as a modular microservice system that can be deployed in different configurations depending on the service needs. The modular architecture of SDA supports both stand alone deployment of an archive, and the use case of deploying a Federated node in the [Federated European Genome-phenome Archive network (FEGA)](https://ega-archive.org/about/projects-and-funders/federated-ega/), serving discoverable sensitive datasets in the main [EGA web portal](https://ega-archive.org). > NOTE: > Throughout this documentation, we can refer to [Central > EGA](https://ega-archive.org/) as `CEGA`, or `CentralEGA`, and *any* -> Local EGA (also known as Federated EGA) instance as `LEGA`, or -> `LocalEGA`. In the context of NeIC we will refer to the LocalEGA as the +> `FederatedEGA` instance also know as: `FEGA`, `LEGA` or +> `LocalEGA`. In the context of NeIC we will refer to the Federated EGA as the > `Sensitive Data Archive` or `SDA`. Overall architecture -------------------- -The main components and interaction partners of the NeIC Sensitive Data Archive deployment in a Federated EGA setup, are illustrated in the figure below. The different colored backgrounds represent different zones of separation in the federated deployment. +The main components and the interaction between them, based on the NeIC Sensitive Data Archive deployment in a Federated EGA setup, are illustrated in the figure below. The different colored backgrounds represent different zones of separation in the federated deployment. ![](https://docs.google.com/drawings/d/e/2PACX-1vSCqC49WJkBduQ5AJ1VdwFq-FJDDcMRVLaWQmvRBLy7YihKQImTi41WyeNruMyH1DdFqevQ9cgKtXEg/pub?w=960&h=540) The components illustrated can be classified by which archive sub-process they take part in: -- Submission - the process of submitting sensitive data and meta-data to the inbox staging area -- Ingestion - the process of verifying uploaded data and securely storing it in archive storage, while synchronizing state and identifier information with CEGA -- Data Retrieval - the process of re-encrypting and staging data for retrieval/download. +- `Submission` - the process of submitting sensitive data and meta-data to the inbox staging area +- `Ingestion` - the process of verifying uploaded data and securely storing it in archive storage, while synchronizing state and identifier information with CEGA +- `Data Retrieval` - the process of re-encrypting and staging data for retrieval/download. +| Service/component | Description | Archive sub-process | +|----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------| +| Database | A Postgres database with appropriate schema, stores the file header, the accession id, file path and checksums as well as other relevant information. | Submission, Ingestion and Data Retrieval | +| MQ | A RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings. We use a federated queue to get messages from CentralEGA's broker and shovels to send answers back. | Submission and Ingestion | +| Inbox | Upload service for incoming data, acting as a dropbox. Uses credentials from `CentralEGA`. | Submission | +| Intercept | Relays messages between the queue provided from the federated service and local queues. | Submission and Ingestion | +| [Ingest](services/ingest.md) | Splits the Crypt4GH header and moves it to the database. The remainder of the file is sent to the storage backend (archive). No cryptographic tasks are done. | Ingestion | +| [Verify](services/verify.md) | Using the archive crypt4gh secret key, this service can decrypt the stored files and checksum them against the embedded checksum for the unencrypted file. | Ingestion | +| [Finalize](services/finalize.md) | Handles the so-called Accession ID (stable ID) to filename mappings from CentralEGA. | Ingestion | +| [Mapper](services/mapper.md) | The mapper service register mapping of accessionIDs (stable ids for files) to datasetIDs. | Ingestion | +| Archive (Storage) | Storage backend: can be a regular (POSIX) file system or a S3 object store. | Ingestion and Data Retrieval | +| [Data Retrieval API](dataout.md) | Provides a download/data access API for streaming archived data either in encrypted or decrypted format. | Data Retrieval | +| Inbox (Storage) | Storage backend: can be a regular (POSIX) file system or a S3 object store. | Ingestion | +| Backup (Storage) | Storage backend: can be a regular (POSIX) file system or a S3 object store. | Ingestion | -| Service/component | Description | Archive sub-process | -|---------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------| -| db | A Postgres database with appropriate schema, stores the file header, the accession id, file path and checksums as well as other relevant information. | Submission, Ingestion and Data Retrieval | -| mq (broker) | A RabbitMQ message broker with appropriate accounts, exchanges, queues and bindings. We use a federated queue to get messages from CentralEGA's broker and shovels to send answers back. | Submission and Ingestion | -| Inbox | Upload service for incoming data, acting as a dropbox. Uses credentials from Central EGA. | Submission | -| Intercept | Relays messages between the queue provided from the federated service and local queues. | Submission and Ingestion | -| [Ingest](services/ingest.md) | Splits the Crypt4GH header and moves it to the database. The remainder of the file is sent to the storage backend (archive). No cryptographic tasks are done. | Ingestion | -| [Verify](services/verify.md) | Using the archive crypt4gh secret key, this service can decrypt the stored files and checksum them against the embedded checksum for the unencrypted file. | Ingestion | -| [Finalize](services/finalize.md) | Handles the so-called Accession ID (stable ID) to filename mappings from CentralEGA. | Ingestion | -| [Mapper](services/mapper.md) | The mapper service register mapping of accessionIDs (stable ids for files) to datasetIDs. | Ingestion | -| Archive | Storage backend: can be a regular (POSIX) file system or a S3 object store. | Ingestion and Data Retrieval | -| Data Out API | Provides a download/data access API for streaming archived data either in encrypted or decrypted format. | Data Retrieval | -| Metadata | Component used in standalone version of SDA. Provides an interface and backend to submit Metadata and associated with a file in the Archive. | Submission, Ingestion and Data Retrieval | -| Orchestrator | Component used in standalone version of SDA. Provides an automated ingestion and dataset ID and file ID mapping. | Submission, Ingestion and Data Retrieval | Organisation of the NeIC SDA Operations Handbook ------------------------------------------------ @@ -51,44 +51,43 @@ This operations handbook is organized in four main parts, that each has it's ow 1. **Structure**: Provides overview material for how the services can be deployed in different constellations and highlights communication paths. -1. **Communication**: Provides more detailed communication focused documentation, such as OpenAPI-specs for APIs, rabbit-mq message flow, and database information flow details. +2. **Communication**: Provides more detailed communication focused documentation, such as OpenAPI-specs for APIs, rabbit-mq message flow, and database information flow details. -1. **Services**: Per service detailed specifications and documentation. - -1. **Guides**: Topic-guides for topics like "Deployment", "Federated vs. Standalone", "Troubleshooting services", etc. +3. **Services**: Per service detailed specifications and documentation. +4. **Guides**: Topic-guides for topics like "Deployment", "Federated vs. Stand-alone", "Troubleshooting services", etc. > NOTE: > NB!!! Content below to be considered moved into introductory pages of STRUCTURE and COMMUNICATION sections: The overall data workflow consists of three parts: -- The users logs onto the Local EGA's inbox and uploads the encrypted - files. They then go to the Central EGA's interface to prepare a +- The users logs onto the `FederatedEGA`'s inbox and uploads the encrypted + files. They then go to the `CentralEGA`'s interface to prepare a submission; - Upon submission completion, the files are ingested into the archive - and become searchable by the Central EGA's engine; + and become searchable by the `CentralEGA`'s engine; - Once the file has been successfully archived, it can be accessed by researchers in accordance with permissions given by the corresponding Data Access Committee. ------------------------------------------------------------------------ -Central EGA contains a database of users with permissions to upload to a -specific Sensitive Data Archive. The Central EGA ID is used to +`CentralEGA` contains a database of users with permissions to upload to a +specific Sensitive Data Archive. The `CentralEGA` ID is used to authenticate the user against either their EGA password or a private key. -For every uploaded file, Central EGA receives a notification that the +For every uploaded file, `CentralEGA` receives a notification that the file is present in a SDA's inbox. The uploaded file must be encrypted -in the [Crypt4GH file format](http://samtools.github.io/hts-specs/crypt4gh.pdf) using that SDA public Crypt4gh key. The file is -checksumed and presented in the Central EGA's interface in order for +in the [Crypt4GH file format](https://samtools.github.io/hts-specs/crypt4gh.pdf) using that SDA public Crypt4gh key. The file is +checksumed and presented in the `CentralEGA`'s interface in order for the user to double-check that it was properly uploaded. More details about process in [Data Submission](submission.md#data-submission). -When a submission is ready, Central EGA triggers an ingestion process on -the user-chosen SDA instance. Central EGA's interface is updated with +When a submission is ready, `CentralEGA` triggers an ingestion process on +the user-chosen SDA instance. `CentralEGA`'s interface is updated with progress notifications whether the ingestion was successful, or whether there was an error. diff --git a/docs/structure.md b/docs/structure.md index 154ca97..7005b6e 100644 --- a/docs/structure.md +++ b/docs/structure.md @@ -7,18 +7,18 @@ This section provides overview material for how the services can be deployed in Deployment related choices -------------------------- -### Federated vs standalone +### Federated vs stand-alone -In a Federated setup, the Local EGA archive node setup locally need to exchange status updates with the Central EGA in a synchronized manner to basically orchestrate two parallel processes +In a Federated setup, the `FederatedEGA` archive node setup locally need to exchange status updates with the `CentralEGA` in a synchronized manner to basically orchestrate two parallel processes 1. The multi-step process of uploading and safely archiving encrypted files holding both sensitive phenome and genome data. -2. The process of the Submitter annotating the archived data in an online portal at Central EGA, resulting in assigned accession numbers for items such as DataSet, Study, Files etc. +2. The process of the Submitter annotating the archived data in an online portal at `CentralEGA`, resulting in assigned accession numbers for items such as DataSet, Study, Files etc. -In a stand-alone setup, the deployed service has less remote synchronisation to worry about, but on the other hand need more components to also handle annotations/meta-data locally, as well as to deal with identifiers etc. +In a stand-alone setup, the deployed service has less remote synchronisation to worry about, but on the other hand more components might be required (e.g ([orchestrator](services/orchestrator))) to also handle annotations/meta-data locally, as well as to deal with identifiers etc. -The NeIC SDA is targeting both use cases in several projects in the Nordics. +The NeIC SDA is targeting both types of setup but also to allow for the possibility to re-use components in more use cases than initially envisioned. ### Container deployment options @@ -39,11 +39,10 @@ To support different needs of different deployment locations, SDA is heavily con For other storage dependent functionality, such as upload areas (aka inbox) and download areas (aka outbox), there are different choices of microservices (using different storage technology and transfer protocols) that can be orchestrated together with the main SDA microservices to meet local needs and requirements. - Inter-communication between services ------------------------------------ -There are 3 main ways that the system is passing on information and persist state in the system: +There are three main ways that the system is passing on information and persist state in the system: 1. through AMQP messages sent from and to micro services; 2. changes in the database of the status of a file being processed via the the `sda-pipeline`; @@ -54,11 +53,12 @@ There are 3 main ways that the system is passing on information and persist stat The orchestration of any action to be performed by the micro services, is managed through the appropriate AMQP message being posted at the RabbitMQ broker service, from where microservices will pick up messages about work to be performed. Each microservice will thus normally only process one type of messages/jobs from a specific AMQP queue, and have a predefined type of next message to post once the current task is completed, for the next microservice in the pipeline to carry on the next action needed. ### Database + The state of files being ingested to the SDA is recorded in a PostgreSQL database, and the different microservices will often update records in the database as part of their processing step in the pipeline. ### Inbox - Archive - Outbox areas -The SDA operates with three file areas (excluding additional backup mechanisms for data redundancy). The Inbox area is were users will be allowed to upload their encrypted files temporarily, before they get further processed into the archive. The uploaded files are then securely transferred into the Archive area with the header split off and stored in the database, after a validation of content integrity. If someone is later granted access to retrieve a file from the archive, the header is re-encrypted for the requester and merged back with the main content and stored in the Outbox area for the requester to retrieve it from there. +The SDA operates with three file areas (excluding additional backup mechanisms for data redundancy). The Inbox area is where users will be allowed to upload their encrypted files temporarily, before they get further processed into the archive. The uploaded files are then securely transferred into the Archive area with the header split off and stored in the database, after a validation of content integrity. If someone is later granted access to retrieve a file from the archive, the header is re-encrypted for the requester and merged back with the main content and stored in the Outbox area for the requester to retrieve it from there. Additional components @@ -66,7 +66,7 @@ Additional components ### Authentication of users -In a Federated setup, a data submitter will usually be required to have a user profile with the Central EGA services as well as a user identity trusted by the Federated EGA node services. Many use the Life Science login identity (a.k.a. ELIXIR AAI identity) for the latter. Integration towards both authentication services will likely need to be incorporated into a Federated EGA nodes upload mechanism and download mechanism. +In a Federated setup, a data submitter will usually be required to have a user profile with the `CentralEGA` services as well as a user identity trusted by the Federated EGA node services.The [Life Science AAI](https://lifescience-ri.eu/) login identity is primarily used (a.k.a. ELIXIR AAI identity) for the latter. Integration towards both authentication services will likely need to be incorporated into a Federated EGA nodes upload mechanism and download mechanism. ### Authorizing access to datasets diff --git a/docs/submission.md b/docs/submission.md index e6109c1..adb831c 100644 --- a/docs/submission.md +++ b/docs/submission.md @@ -4,7 +4,7 @@ Data Submission Ingestion Procedure ------------------- -For a given LocalEGA, Central EGA selects the associated `vhost` and +For a given `FederatedEGA` node, `CentralEGA` selects the associated `vhost` and drops, in the `files` queue, one message per file to ingest. Structure of the message and its contents are described in @@ -31,8 +31,7 @@ Structure of the message and its contents are described in > services/actuators match those used for the events initiated by the > respective services, except for the interactions in case of errors, > which are highlighted with red. The optional fragments are only executed -> if errors occur during ingestion, verify or finalize. Note that time in -> this diagram is all about ordering, not duration. +> if errors occur during ingestion, verify or finalize. **Note that the time axis in this diagram is all about the sequence of events not duration.** ### Ingestion Steps @@ -55,22 +54,21 @@ that the integrated checksum is valid. At this stage, the associated decryption key is retrieved. If decryption completes and the checksum is valid, a message of completion is sent to -Central EGA: Ingestion completed. +`CentralEGA`: Ingestion completed. ->Important -> If a file disappears or is overwritten in the inbox before ingestion is -> completed, ingestion may not be possible. +> **Important** +> If a file disappears or is overwritten in the inbox before ingestion is completed, ingestion may not be possible. If any of the above steps generates an error, we exit the workflow and log the error. In case the error is related to a misuse from the user, -such as submitting the wrong checksum or tempering with the encrypted -file, the error is forwarded to Central EGA in order to be displayed in +such as submitting the wrong checksum or tampering with the encrypted +file, the error is forwarded to `CentralEGA` in order to be displayed in the Submission Interface. Submission Inbox ---------------- -Central EGA contains a database of users, with IDs and passwords. We +`CentralEGA` contains a database of users, with IDs and passwords. We have developed several solutions allowing user authentication against CentralEGA user database: