Update from neicnordic/sensitive-data-archive at 09:22 on 2023-11-28

neicnordic · Nov 28, 2023 · db55c04 · db55c04
1 parent 25f557f
commit db55c04
Show file tree

Hide file tree

Showing 5 changed files with 141 additions and 140 deletions.
diff --git a/docs/services/finalize.md b/docs/services/finalize.md
@@ -4,6 +4,31 @@ Handles the so-called _Accession ID (stable ID)_ to filename mappings from Centr
 At the same time the service fulfills the replication requirement of having distinct backup copies.
 For more information see [Federated EGA Node Operations v2](https://ega-archive.org/assets/files/EGA-Node-Operations-v2.pdf) document.
 
+## Service Description
+
+`Finalize` adds stable, shareable _Accession ID_'s to archive files.
+If a backup location is configured it will perform backup of a file.
+When running, `finalize` reads messages from the configured RabbitMQ queue (commonly: `accession`).
+For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message):
+
+1. The message is validated as valid JSON that matches the `ingestion-accession` schema. If the message can’t be validated it is discarded with an error message in the logs.
+2. If the service is configured to perform backups i.e. the `ARCHIVE_` and `BACKUP_` storage backend are set. Archived files will be copied to the backup location.
+   1. The file size on disk is requested from the storage system.
+   2. The database file size is compared against the disk file size.
+   3. A file reader is created for the archive storage file, and a file writer is created for the backup storage file.
+3. The file data is copied from the archive file reader to the backup file writer.
+4. If the type of the `DecryptedChecksums` field in the message is `sha256`, the value is stored.
+5. A new RabbitMQ `complete` message is created and validated against the `ingestion-completion` schema. If the validation fails, an error message is written to the logs.
+6. The file accession ID in the message is marked as *ready* in the database. On error the service sleeps for up to 5 minutes to allow for database recovery, after 5 minutes the message is Nacked, re-queued and an error message is written to the logs.
+7. The complete message is sent to RabbitMQ. On error, a message is written to the logs.
+8. The original RabbitMQ message is Ack'ed.
+
+## Communication
+
+- `Finalize` reads messages from one RabbitMQ queue (commonly: `accession`).
+- `Finalize` publishes messages with one routing key  (commonly: `completed`).
+- `Finalize` assigns the accession ID to a file in the database using the `SetAccessionID` function.
+
 ## Configuration
 
 There are a number of options that can be set for the `finalize` service.
@@ -98,27 +123,3 @@ and if `*_TYPE` is `POSIX`:
 
 - `*_LOCATION`: POSIX path to use as storage root
 
-## Service Description
-
-`Finalize` adds stable, shareable _Accession ID_'s to archive files.
-If a backup location is configured it will perform backup of a file.
-When running, `finalize` reads messages from the configured RabbitMQ queue (commonly: `accession`).
-For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message):
-
-1. The message is validated as valid JSON that matches the `ingestion-accession` schema. If the message can’t be validated it is discarded with an error message in the logs.
-2. If the service is configured to perform backups i.e. the `ARCHIVE_` and `BACKUP_` storage backend are set. Archived files will be copied to the backup location.
-   1. The file size on disk is requested from the storage system.
-   2. The database file size is compared against the disk file size.
-   3. A file reader is created for the archive storage file, and a file writer is created for the backup storage file.
-3. The file data is copied from the archive file reader to the backup file writer.
-4. If the type of the `DecryptedChecksums` field in the message is `sha256`, the value is stored.
-5. A new RabbitMQ `complete` message is created and validated against the `ingestion-completion` schema. If the validation fails, an error message is written to the logs.
-6. The file accession ID in the message is marked as *ready* in the database. On error the service sleeps for up to 5 minutes to allow for database recovery, after 5 minutes the message is Nacked, re-queued and an error message is written to the logs.
-7. The complete message is sent to RabbitMQ. On error, a message is written to the logs.
-8. The original RabbitMQ message is Ack'ed.
-
-## Communication
-
-- `Finalize` reads messages from one RabbitMQ queue (commonly: `accession`).
-- `Finalize` publishes messages with one routing key  (commonly: `completed`).
-- `Finalize` assigns the accession ID to a file in the database using the `SetAccessionID` function.
diff --git a/docs/services/ingest.md b/docs/services/ingest.md
@@ -3,6 +3,45 @@
 Splits the Crypt4GH header and moves it to database. The remainder of the file
 is sent to the storage backend (archive). No cryptographic tasks are done.
 
+## Service Description
+
+The `ingest` service copies files from the file inbox to the archive, and registers them in the database.
+
+When running, `ingest` reads messages from the configured RabbitMQ queue (commonly: `ingest`).
+For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message):
+
+1. The message is validated as valid JSON that matches the `ingestion-trigger` schema.
+If the message can’t be validated it is discarded with an error message in the logs.
+2. If the message is of type `cancel`, the file will be marked as `disabled` and the next message in the queue will be read.
+3. A file reader is created for the filepath in the message.
+If the file reader can’t be created an error is written to the logs, the message is Nacked and forwarded to the error queue.
+4. The file size is read from the file reader.
+On error, the error is written to the logs, the message is Nacked and forwarded to the error queue.
+5. A uuid is generated, and a file writer is created in the archive using the uuid as filename.
+On error the error is written to the logs and the message is Nacked and then re-queued.
+6. The filename is inserted into the database along with the user id of the uploading user. In case the file is already existing in the database, the status is updated.
+Errors are written to the error log.
+Errors writing the filename to the database do not halt ingestion progress.
+7. The header is read from the file, and decrypted to ensure that it’s encrypted with the correct key.
+If the decryption fails, an error is written to the error log, the message is Nacked, and the message is forwarded to the error queue.
+8. The header is written to the database.
+Errors are written to the error log.
+9. The header is stripped from the file data, and the remaining file data is written to the archive.
+Errors are written to the error log.
+10. The size of the archived file is read.
+Errors are written to the error log.
+11. The database is updated with the file size, archive path, and archive checksum, and the file is set as *archived*.
+Errors are written to the error log.
+This error does not halt ingestion.
+12. A message is sent back to the original RabbitMQ broker containing the upload user, upload file path, database file id, archive file path and checksum of the archived file.
+
+## Communication
+
+- `Ingest` reads messages from one RabbitMQ queue (commonly: `ingest`).
+- `Ingest` publishes messages to one RabbitMQ queue (commonly: `archived`).
+- `Ingest` inserts file information in the database using three database functions, `InsertFile`, `StoreHeader`, and `SetArchived`.
+- `Ingest` reads file data from inbox storage and writes data to archive storage.
+
 ## Configuration
 
 There are a number of options that can be set for the `ingest` service.
@@ -99,42 +138,3 @@ and if `*_TYPE` is `POSIX`:
   - `error`
   - `fatal`
   - `panic`
-
-## Service Description
-
-The `ingest` service copies files from the file inbox to the archive, and registers them in the database.
-
-When running, `ingest` reads messages from the configured RabbitMQ queue (commonly: `ingest`).
-For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message):
-
-1. The message is validated as valid JSON that matches the `ingestion-trigger` schema.
-If the message can’t be validated it is discarded with an error message in the logs.
-2. If the message is of type `cancel`, the file will be marked as `disabled` and the next message in the queue will be read.
-3. A file reader is created for the filepath in the message.
-If the file reader can’t be created an error is written to the logs, the message is Nacked and forwarded to the error queue.
-4. The file size is read from the file reader.
-On error, the error is written to the logs, the message is Nacked and forwarded to the error queue.
-5. A uuid is generated, and a file writer is created in the archive using the uuid as filename.
-On error the error is written to the logs and the message is Nacked and then re-queued.
-6. The filename is inserted into the database along with the user id of the uploading user. In case the file is already existing in the database, the status is updated.
-Errors are written to the error log.
-Errors writing the filename to the database do not halt ingestion progress.
-7. The header is read from the file, and decrypted to ensure that it’s encrypted with the correct key.
-If the decryption fails, an error is written to the error log, the message is Nacked, and the message is forwarded to the error queue.
-8. The header is written to the database.
-Errors are written to the error log.
-9. The header is stripped from the file data, and the remaining file data is written to the archive.
-Errors are written to the error log.
-10. The size of the archived file is read.
-Errors are written to the error log.
-11. The database is updated with the file size, archive path, and archive checksum, and the file is set as *archived*.
-Errors are written to the error log.
-This error does not halt ingestion.
-12. A message is sent back to the original RabbitMQ broker containing the upload user, upload file path, database file id, archive file path and checksum of the archived file.
-
-## Communication
-
-- `Ingest` reads messages from one RabbitMQ queue (commonly: `ingest`).
-- `Ingest` publishes messages to one RabbitMQ queue (commonly: `archived`).
-- `Ingest` inserts file information in the database using three database functions, `InsertFile`, `StoreHeader`, and `SetArchived`.
-- `Ingest` reads file data from inbox storage and writes data to archive storage.
diff --git a/docs/services/intercept.md b/docs/services/intercept.md
@@ -2,6 +2,22 @@
 
 The `intercept` service relays messages between Central EGA and Federated EGA nodes.
 
+## Service Description
+
+When running, `intercept` reads messages from the configured RabbitMQ queue (commonly: `from_cega`).
+For each message, these steps are taken:
+
+1. The message type is read from the message `type` field.
+   1. If the message `type` is not known, an error is logged and the message is Ack'ed.
+2. The correct queue for the message is decided based on message type.
+3. The message is sent to the queue. This has no error handling as the resend-mechanism hasn't been finished.
+4. The message is Ack'ed.
+
+## Communication
+
+- `Intercept` reads messages from one queue (commonly: `from_cega`).
+- `Intercept` publishes messages to three queues, `accession`, `ingest`, and `mappings`.
+
 ## Configuration
 
 There are a number of options that can be set for the `intercept` service.
@@ -43,19 +59,3 @@ These settings control how `intercept` connects to the RabbitMQ message broker.
   - `error`
   - `fatal`
   - `panic`
-
-## Service Description
-
-When running, `intercept` reads messages from the configured RabbitMQ queue (commonly: `from_cega`).
-For each message, these steps are taken:
-
-1. The message type is read from the message `type` field.
-   1. If the message `type` is not known, an error is logged and the message is Ack'ed.
-2. The correct queue for the message is decided based on message type.
-3. The message is sent to the queue. This has no error handling as the resend-mechanism hasn't been finished.
-4. The message is Ack'ed.
-
-## Communication
-
-- `Intercept` reads messages from one queue (commonly: `from_cega`).
-- `Intercept` publishes messages to three queues, `accession`, `ingest`, and `mappings`.
diff --git a/docs/services/mapper.md b/docs/services/mapper.md
@@ -3,6 +3,29 @@
 The mapper service registers mapping of accessionIDs (stable ids for files) to datasetIDs.
 Once the file accession ID has been mapped to a dataset ID, the file is removed from the inbox.
 
+## Service Description
+
+The `mapper` service maps file `accessionIDs` to `datasetIDs`.
+
+When running, `mapper` reads messages from the configured RabbitMQ queue (commonly: `mappings`).  
+For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message):
+
+1. The message is validated as valid JSON that matches the `dataset-mapping` schema.  
+If the message can’t be validated it is discarded with an error message is logged.
+2. AccessionIDs from the message are mapped to a datasetID (also in the message) in the database.  
+On error the service sleeps for up to 5 minutes to allow for database recovery, after 5 minutes the message is Nacked, re-queued and an error message is written to the logs.
+3. The uploaded files related to each AccessionID is removed from the inbox  
+If this fails an error will be written to the logs.
+4. The RabbitMQ message is Ack'ed.
+
+## Communication
+
+- `Mapper` reads messages from one RabbitMQ queue (commonly: `mappings`).
+- `Mapper` maps files to datasets in the database using the `MapFilesToDataset` function.
+- `Mapper` retrieves the inbox filepath from the database for each file using the `GetInboxPath` function.
+- `Mapper` sets the status of a dataset in the database using the `UpdateDatasetEvent` function.
+- `Mapper` removes data from inbox storage.
+
 ## Configuration
 
 There are a number of options that can be set for the `mapper` service.
@@ -93,26 +116,3 @@ and if `*_TYPE` is `POSIX`:
   - `error`
   - `fatal`
   - `panic`
-
-## Service Description
-
-The `mapper` service maps file `accessionIDs` to `datasetIDs`.
-
-When running, `mapper` reads messages from the configured RabbitMQ queue (commonly: `mappings`).  
-For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message):
-
-1. The message is validated as valid JSON that matches the `dataset-mapping` schema.  
-If the message can’t be validated it is discarded with an error message is logged.
-2. AccessionIDs from the message are mapped to a datasetID (also in the message) in the database.  
-On error the service sleeps for up to 5 minutes to allow for database recovery, after 5 minutes the message is Nacked, re-queued and an error message is written to the logs.
-3. The uploaded files related to each AccessionID is removed from the inbox  
-If this fails an error will be written to the logs.
-4. The RabbitMQ message is Ack'ed.
-
-## Communication
-
-- `Mapper` reads messages from one RabbitMQ queue (commonly: `mappings`).
-- `Mapper` maps files to datasets in the database using the `MapFilesToDataset` function.
-- `Mapper` retrieves the inbox filepath from the database for each file using the `GetInboxPath` function.
-- `Mapper` sets the status of a dataset in the database using the `UpdateDatasetEvent` function.
-- `Mapper` removes data from inbox storage.