You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 6, 2023. It is now read-only.
CEGA generates file stable id's from the unencrypted file checksum. This means that files with identical content will get the same stable_id. This can cause problems, since it will prevent finalize from assigning the stable_id and continue to mapper.
For CEGA, using the same id works since they only access files by stable_id, but for Bigpicture, there's been requests to download files with the upload filename. So each file needs a correct stable_id and submission_file_path.
Possible Solution
One of the core issues is wheather multiple uploads should share the same sda.files entry. Storage deduplication can be solved by pointing to the same archive path regardless of number of items, but there is also a case to be made for having all files containing the same data use the same database entry. One option here is to move the storage information into a separate table, so that multiple "upload-files" could reference a single "storage file".
It is likely required to remove the stable_id field from the sda.files table, and instead use the file_references table to store the ID. This is partly a matter of simplifying the schema so that all potential stable IDs are handled in the same manner. If multiple sda.files-entries are used, this also allows multiple files to have the same ID when needed for FEGA.
To solve the problem of submission_file_path if one sda.files-entry is used, one solution is to add a file_path field to the file_dataset table, so that a file can have a unique path for each dataset it's part of while still only referencing one sda.files entry.
The text was updated successfully, but these errors were encountered:
Background:
CEGA generates file stable id's from the unencrypted file checksum. This means that files with identical content will get the same
stable_id
. This can cause problems, since it will preventfinalize
from assigning thestable_id
and continue to mapper.For CEGA, using the same id works since they only access files by
stable_id
, but for Bigpicture, there's been requests to download files with the upload filename. So each file needs a correctstable_id
andsubmission_file_path
.Possible Solution
One of the core issues is wheather multiple uploads should share the same
sda.files
entry. Storage deduplication can be solved by pointing to the same archive path regardless of number of items, but there is also a case to be made for having all files containing the same data use the same database entry. One option here is to move the storage information into a separate table, so that multiple "upload-files" could reference a single "storage file".It is likely required to remove the
stable_id
field from thesda.files
table, and instead use thefile_references
table to store the ID. This is partly a matter of simplifying the schema so that all potential stable IDs are handled in the same manner. If multiplesda.files
-entries are used, this also allows multiple files to have the same ID when needed for FEGA.To solve the problem of submission_file_path if one
sda.files
-entry is used, one solution is to add afile_path
field to thefile_dataset
table, so that a file can have a unique path for each dataset it's part of while still only referencing onesda.files
entry.The text was updated successfully, but these errors were encountered: