diff --git a/docs/services/finalize.md b/docs/services/finalize.md index feca10e..7491343 100644 --- a/docs/services/finalize.md +++ b/docs/services/finalize.md @@ -4,6 +4,31 @@ Handles the so-called _Accession ID (stable ID)_ to filename mappings from Centr At the same time the service fulfills the replication requirement of having distinct backup copies. For more information see [Federated EGA Node Operations v2](https://ega-archive.org/assets/files/EGA-Node-Operations-v2.pdf) document. +## Service Description + +`Finalize` adds stable, shareable _Accession ID_'s to archive files. +If a backup location is configured it will perform backup of a file. +When running, `finalize` reads messages from the configured RabbitMQ queue (commonly: `accession`). +For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message): + +1. The message is validated as valid JSON that matches the `ingestion-accession` schema. If the message can’t be validated it is discarded with an error message in the logs. +2. If the service is configured to perform backups i.e. the `ARCHIVE_` and `BACKUP_` storage backend are set. Archived files will be copied to the backup location. + 1. The file size on disk is requested from the storage system. + 2. The database file size is compared against the disk file size. + 3. A file reader is created for the archive storage file, and a file writer is created for the backup storage file. +3. The file data is copied from the archive file reader to the backup file writer. +4. If the type of the `DecryptedChecksums` field in the message is `sha256`, the value is stored. +5. A new RabbitMQ `complete` message is created and validated against the `ingestion-completion` schema. If the validation fails, an error message is written to the logs. +6. The file accession ID in the message is marked as *ready* in the database. On error the service sleeps for up to 5 minutes to allow for database recovery, after 5 minutes the message is Nacked, re-queued and an error message is written to the logs. +7. The complete message is sent to RabbitMQ. On error, a message is written to the logs. +8. The original RabbitMQ message is Ack'ed. + +## Communication + +- `Finalize` reads messages from one RabbitMQ queue (commonly: `accession`). +- `Finalize` publishes messages with one routing key (commonly: `completed`). +- `Finalize` assigns the accession ID to a file in the database using the `SetAccessionID` function. + ## Configuration There are a number of options that can be set for the `finalize` service. @@ -98,27 +123,3 @@ and if `*_TYPE` is `POSIX`: - `*_LOCATION`: POSIX path to use as storage root -## Service Description - -`Finalize` adds stable, shareable _Accession ID_'s to archive files. -If a backup location is configured it will perform backup of a file. -When running, `finalize` reads messages from the configured RabbitMQ queue (commonly: `accession`). -For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message): - -1. The message is validated as valid JSON that matches the `ingestion-accession` schema. If the message can’t be validated it is discarded with an error message in the logs. -2. If the service is configured to perform backups i.e. the `ARCHIVE_` and `BACKUP_` storage backend are set. Archived files will be copied to the backup location. - 1. The file size on disk is requested from the storage system. - 2. The database file size is compared against the disk file size. - 3. A file reader is created for the archive storage file, and a file writer is created for the backup storage file. -3. The file data is copied from the archive file reader to the backup file writer. -4. If the type of the `DecryptedChecksums` field in the message is `sha256`, the value is stored. -5. A new RabbitMQ `complete` message is created and validated against the `ingestion-completion` schema. If the validation fails, an error message is written to the logs. -6. The file accession ID in the message is marked as *ready* in the database. On error the service sleeps for up to 5 minutes to allow for database recovery, after 5 minutes the message is Nacked, re-queued and an error message is written to the logs. -7. The complete message is sent to RabbitMQ. On error, a message is written to the logs. -8. The original RabbitMQ message is Ack'ed. - -## Communication - -- `Finalize` reads messages from one RabbitMQ queue (commonly: `accession`). -- `Finalize` publishes messages with one routing key (commonly: `completed`). -- `Finalize` assigns the accession ID to a file in the database using the `SetAccessionID` function. diff --git a/docs/services/ingest.md b/docs/services/ingest.md index f7cbd68..df5b37b 100644 --- a/docs/services/ingest.md +++ b/docs/services/ingest.md @@ -3,6 +3,45 @@ Splits the Crypt4GH header and moves it to database. The remainder of the file is sent to the storage backend (archive). No cryptographic tasks are done. +## Service Description + +The `ingest` service copies files from the file inbox to the archive, and registers them in the database. + +When running, `ingest` reads messages from the configured RabbitMQ queue (commonly: `ingest`). +For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message): + +1. The message is validated as valid JSON that matches the `ingestion-trigger` schema. +If the message can’t be validated it is discarded with an error message in the logs. +2. If the message is of type `cancel`, the file will be marked as `disabled` and the next message in the queue will be read. +3. A file reader is created for the filepath in the message. +If the file reader can’t be created an error is written to the logs, the message is Nacked and forwarded to the error queue. +4. The file size is read from the file reader. +On error, the error is written to the logs, the message is Nacked and forwarded to the error queue. +5. A uuid is generated, and a file writer is created in the archive using the uuid as filename. +On error the error is written to the logs and the message is Nacked and then re-queued. +6. The filename is inserted into the database along with the user id of the uploading user. In case the file is already existing in the database, the status is updated. +Errors are written to the error log. +Errors writing the filename to the database do not halt ingestion progress. +7. The header is read from the file, and decrypted to ensure that it’s encrypted with the correct key. +If the decryption fails, an error is written to the error log, the message is Nacked, and the message is forwarded to the error queue. +8. The header is written to the database. +Errors are written to the error log. +9. The header is stripped from the file data, and the remaining file data is written to the archive. +Errors are written to the error log. +10. The size of the archived file is read. +Errors are written to the error log. +11. The database is updated with the file size, archive path, and archive checksum, and the file is set as *archived*. +Errors are written to the error log. +This error does not halt ingestion. +12. A message is sent back to the original RabbitMQ broker containing the upload user, upload file path, database file id, archive file path and checksum of the archived file. + +## Communication + +- `Ingest` reads messages from one RabbitMQ queue (commonly: `ingest`). +- `Ingest` publishes messages to one RabbitMQ queue (commonly: `archived`). +- `Ingest` inserts file information in the database using three database functions, `InsertFile`, `StoreHeader`, and `SetArchived`. +- `Ingest` reads file data from inbox storage and writes data to archive storage. + ## Configuration There are a number of options that can be set for the `ingest` service. @@ -99,42 +138,3 @@ and if `*_TYPE` is `POSIX`: - `error` - `fatal` - `panic` - -## Service Description - -The `ingest` service copies files from the file inbox to the archive, and registers them in the database. - -When running, `ingest` reads messages from the configured RabbitMQ queue (commonly: `ingest`). -For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message): - -1. The message is validated as valid JSON that matches the `ingestion-trigger` schema. -If the message can’t be validated it is discarded with an error message in the logs. -2. If the message is of type `cancel`, the file will be marked as `disabled` and the next message in the queue will be read. -3. A file reader is created for the filepath in the message. -If the file reader can’t be created an error is written to the logs, the message is Nacked and forwarded to the error queue. -4. The file size is read from the file reader. -On error, the error is written to the logs, the message is Nacked and forwarded to the error queue. -5. A uuid is generated, and a file writer is created in the archive using the uuid as filename. -On error the error is written to the logs and the message is Nacked and then re-queued. -6. The filename is inserted into the database along with the user id of the uploading user. In case the file is already existing in the database, the status is updated. -Errors are written to the error log. -Errors writing the filename to the database do not halt ingestion progress. -7. The header is read from the file, and decrypted to ensure that it’s encrypted with the correct key. -If the decryption fails, an error is written to the error log, the message is Nacked, and the message is forwarded to the error queue. -8. The header is written to the database. -Errors are written to the error log. -9. The header is stripped from the file data, and the remaining file data is written to the archive. -Errors are written to the error log. -10. The size of the archived file is read. -Errors are written to the error log. -11. The database is updated with the file size, archive path, and archive checksum, and the file is set as *archived*. -Errors are written to the error log. -This error does not halt ingestion. -12. A message is sent back to the original RabbitMQ broker containing the upload user, upload file path, database file id, archive file path and checksum of the archived file. - -## Communication - -- `Ingest` reads messages from one RabbitMQ queue (commonly: `ingest`). -- `Ingest` publishes messages to one RabbitMQ queue (commonly: `archived`). -- `Ingest` inserts file information in the database using three database functions, `InsertFile`, `StoreHeader`, and `SetArchived`. -- `Ingest` reads file data from inbox storage and writes data to archive storage. diff --git a/docs/services/intercept.md b/docs/services/intercept.md index ea4b8e2..d42190e 100644 --- a/docs/services/intercept.md +++ b/docs/services/intercept.md @@ -2,6 +2,22 @@ The `intercept` service relays messages between Central EGA and Federated EGA nodes. +## Service Description + +When running, `intercept` reads messages from the configured RabbitMQ queue (commonly: `from_cega`). +For each message, these steps are taken: + +1. The message type is read from the message `type` field. + 1. If the message `type` is not known, an error is logged and the message is Ack'ed. +2. The correct queue for the message is decided based on message type. +3. The message is sent to the queue. This has no error handling as the resend-mechanism hasn't been finished. +4. The message is Ack'ed. + +## Communication + +- `Intercept` reads messages from one queue (commonly: `from_cega`). +- `Intercept` publishes messages to three queues, `accession`, `ingest`, and `mappings`. + ## Configuration There are a number of options that can be set for the `intercept` service. @@ -43,19 +59,3 @@ These settings control how `intercept` connects to the RabbitMQ message broker. - `error` - `fatal` - `panic` - -## Service Description - -When running, `intercept` reads messages from the configured RabbitMQ queue (commonly: `from_cega`). -For each message, these steps are taken: - -1. The message type is read from the message `type` field. - 1. If the message `type` is not known, an error is logged and the message is Ack'ed. -2. The correct queue for the message is decided based on message type. -3. The message is sent to the queue. This has no error handling as the resend-mechanism hasn't been finished. -4. The message is Ack'ed. - -## Communication - -- `Intercept` reads messages from one queue (commonly: `from_cega`). -- `Intercept` publishes messages to three queues, `accession`, `ingest`, and `mappings`. diff --git a/docs/services/mapper.md b/docs/services/mapper.md index 040ebea..abc0d18 100644 --- a/docs/services/mapper.md +++ b/docs/services/mapper.md @@ -3,6 +3,29 @@ The mapper service registers mapping of accessionIDs (stable ids for files) to datasetIDs. Once the file accession ID has been mapped to a dataset ID, the file is removed from the inbox. +## Service Description + +The `mapper` service maps file `accessionIDs` to `datasetIDs`. + +When running, `mapper` reads messages from the configured RabbitMQ queue (commonly: `mappings`). +For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message): + +1. The message is validated as valid JSON that matches the `dataset-mapping` schema. +If the message can’t be validated it is discarded with an error message is logged. +2. AccessionIDs from the message are mapped to a datasetID (also in the message) in the database. +On error the service sleeps for up to 5 minutes to allow for database recovery, after 5 minutes the message is Nacked, re-queued and an error message is written to the logs. +3. The uploaded files related to each AccessionID is removed from the inbox +If this fails an error will be written to the logs. +4. The RabbitMQ message is Ack'ed. + +## Communication + +- `Mapper` reads messages from one RabbitMQ queue (commonly: `mappings`). +- `Mapper` maps files to datasets in the database using the `MapFilesToDataset` function. +- `Mapper` retrieves the inbox filepath from the database for each file using the `GetInboxPath` function. +- `Mapper` sets the status of a dataset in the database using the `UpdateDatasetEvent` function. +- `Mapper` removes data from inbox storage. + ## Configuration There are a number of options that can be set for the `mapper` service. @@ -93,26 +116,3 @@ and if `*_TYPE` is `POSIX`: - `error` - `fatal` - `panic` - -## Service Description - -The `mapper` service maps file `accessionIDs` to `datasetIDs`. - -When running, `mapper` reads messages from the configured RabbitMQ queue (commonly: `mappings`). -For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message): - -1. The message is validated as valid JSON that matches the `dataset-mapping` schema. -If the message can’t be validated it is discarded with an error message is logged. -2. AccessionIDs from the message are mapped to a datasetID (also in the message) in the database. -On error the service sleeps for up to 5 minutes to allow for database recovery, after 5 minutes the message is Nacked, re-queued and an error message is written to the logs. -3. The uploaded files related to each AccessionID is removed from the inbox -If this fails an error will be written to the logs. -4. The RabbitMQ message is Ack'ed. - -## Communication - -- `Mapper` reads messages from one RabbitMQ queue (commonly: `mappings`). -- `Mapper` maps files to datasets in the database using the `MapFilesToDataset` function. -- `Mapper` retrieves the inbox filepath from the database for each file using the `GetInboxPath` function. -- `Mapper` sets the status of a dataset in the database using the `UpdateDatasetEvent` function. -- `Mapper` removes data from inbox storage. diff --git a/docs/services/verify.md b/docs/services/verify.md index b085969..a0f18fa 100644 --- a/docs/services/verify.md +++ b/docs/services/verify.md @@ -2,6 +2,44 @@ Uses a crypt4gh secret key, this service can decrypt the stored files and checksum them against the embedded checksum for the unencrypted file. +## Service Description + +The `verify` service ensures that ingested files are encrypted with the correct key, and that the provided checksums match those of the ingested files. + +When running, `verify` reads messages from the configured RabbitMQ queue (commonly: `archived`). +For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message. +Unless explicitly stated, error messages are *not* written to the RabbitMQ error queue, and messages are not NACK or ACKed.): + +1. The message is validated as valid JSON that matches the `ingestion-verification` schema. +If the message can’t be validated it is discarded with an error message in the logs. +2. The service attempts to fetch the header for the file id in the message from the database. +If this fails a NACK will be sent for the RabbitMQ message, the error will be written to the logs, and sent to the RabbitMQ error queue. +3. The file size of the encrypted file is fetched from the archive storage system. +If this fails an error will be written to the logs. +4. The archive file is then opened for reading. +If this fails an error will be written to the logs and to the RabbitMQ error queue. +5. A decryptor is opened with the archive file. +If this fails an error will be written to the logs. +6. The file size, md5 and sha256 checksum will be read from the decryptor. +If this fails an error will be written to the logs. +7. If the `re_verify` boolean is not set in the RabbitMQ message, the message processing ends here, and continues with the next message. +Otherwise the processing continues with verification: + 1. A verification message is created, and validated against the `ingestion-accession-request` schema. + If this fails an error will be written to the logs. + 2. The file is marked as *verified* in the database (*COMPLETED* if you are using database schema <= `3`). + If this fails an error will be written to the logs. + 3. The verification message created in step 7.1 is sent to the `verified` queue. + If this fails an error will be written to the logs. + 4. The original RabbitMQ message is ACKed. + If this fails an error is written to the logs, but processing continues to the next step. + +## Communication + +- `Verify` reads messages from one RabbitMQ queue (commonly: `archived`). +- `Verify` publishes messages to one RabbitMQ queue (commonly: `verified`). +- `Verify` gets the file encryption header from the database using `GetHeader`, +and marks the files as `verified` (`COMPLETED` in db version <= `2.0`) using `MarkCompleted`. + ## Configuration There are a number of options that can be set for the `verify` service. @@ -103,41 +141,3 @@ and if `*_TYPE` is `POSIX`: - `error` - `fatal` - `panic` - -## Service Description - -The `verify` service ensures that ingested files are encrypted with the correct key, and that the provided checksums match those of the ingested files. - -When running, `verify` reads messages from the configured RabbitMQ queue (commonly: `archived`). -For each message, these steps are taken (if not otherwise noted, errors halt progress and the service moves on to the next message. -Unless explicitly stated, error messages are *not* written to the RabbitMQ error queue, and messages are not NACK or ACKed.): - -1. The message is validated as valid JSON that matches the `ingestion-verification` schema. -If the message can’t be validated it is discarded with an error message in the logs. -2. The service attempts to fetch the header for the file id in the message from the database. -If this fails a NACK will be sent for the RabbitMQ message, the error will be written to the logs, and sent to the RabbitMQ error queue. -3. The file size of the encrypted file is fetched from the archive storage system. -If this fails an error will be written to the logs. -4. The archive file is then opened for reading. -If this fails an error will be written to the logs and to the RabbitMQ error queue. -5. A decryptor is opened with the archive file. -If this fails an error will be written to the logs. -6. The file size, md5 and sha256 checksum will be read from the decryptor. -If this fails an error will be written to the logs. -7. If the `re_verify` boolean is not set in the RabbitMQ message, the message processing ends here, and continues with the next message. -Otherwise the processing continues with verification: - 1. A verification message is created, and validated against the `ingestion-accession-request` schema. - If this fails an error will be written to the logs. - 2. The file is marked as *verified* in the database (*COMPLETED* if you are using database schema <= `3`). - If this fails an error will be written to the logs. - 3. The verification message created in step 7.1 is sent to the `verified` queue. - If this fails an error will be written to the logs. - 4. The original RabbitMQ message is ACKed. - If this fails an error is written to the logs, but processing continues to the next step. - -## Communication - -- `Verify` reads messages from one RabbitMQ queue (commonly: `archived`). -- `Verify` publishes messages to one RabbitMQ queue (commonly: `verified`). -- `Verify` gets the file encryption header from the database using `GetHeader`, -and marks the files as `verified` (`COMPLETED` in db version <= `2.0`) using `MarkCompleted`.