Skip to content

Commit

Permalink
Merge pull request #280 from UParmaksiz/patch-3
Browse files Browse the repository at this point in the history
Update 25-digital-preservation.md
  • Loading branch information
konrad authored Oct 14, 2024
2 parents 223056a + 35094ed commit e84dbf6
Showing 1 changed file with 88 additions and 51 deletions.
139 changes: 88 additions & 51 deletions docs/_RDM-Preserve/25-digital-preservation.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,75 +5,112 @@ layout: default
docs_css: markdown
---

## Definition
Digital preservation means taking certain measures to ensure that digital material can be found and can be accessed in the long term ("long-term accessibility of data"). It aims to preserve information in a way that is understandable and reusable for a specific community and to prove its authenticity.

## Digital preservation for researchers
The sustainable handling of data by researchers naturally facilitates the long-term accessibility of data. Best practice methods are:
* Cleaning data / data structures - see also: [Data Organisation](https://knowledgebase.nfdi4microbiota.de/RDM-Process/14-data-organization.html)
* Validating data - see also: [Data Quality Control](https://knowledgebase.nfdi4microbiota.de/RDM-Collect/13-data-qc.html)
* Documenting data with metadata and context information to ensure reusability: commenting, adding descriptive, administrative and technical metadata, asigning user license.
* Using well-known open file formats during the project phase - see below - or transfering data into reusable file formats (needs documenting: original file or derivative)
* Storing data following the 3-2-1 rule:
* Keeping 3 copies of any important file
* Storing files on 2 different media types
* Keeping at least 1 copy off site.

### Data selection
To decide well-founded on data selection we recommend reading the how-to guide of the Edinburgh Digital Curation Centre {% cite dcc_five_2014 %}. The suggested steps are:
* **Step 1:** Identify purposes that the data could fulfill: consider the purpose or ‘reuse case’ of your data, including reuse outside your research group.
* **Step 2:** Identify data that **must** be kept: consider legal or policy compliance risks, as well as funder requirements.
* **Step 3:** Identify data that **should** be kept: as it may have long-term value.
* **Step 4:** Weigh up the costs and identify any need for external advice in case of shortfall in the budget.
* **Step 5:** Complete the data appraisal, i.e. list what data must, should or could be kept to fulfill which potential reuse purposes. Summarize any actions needed to prepare the data for deposit - or justification for not keeping it.


### Recommended file formats for preservation
Making your research available in recommended file formats additional to the original software format supports highly the reusability and long-term accessibility of your data.
# Definition
Digital preservation (DP) means taking certain measures to ensure that digital material can be found and accessed in the long term (“long-term accessibility of data”). It aims to preserve information in a way that is understandable and reusable for a specific community and to prove its authenticity. [Ref. ISO, OAIS, dpc].

The long-term timeframe starts now and lasts long as necessary. It may extend into an indefinite future, where there are usually concerns about changing technologies, storage media, obsolete data formats or standards [OAIS, Defintion of “Long Term”].

# DP during research

The sustainable handling of data naturally facilitates the long-term accessibility of the data.
Best practice methods are:
* Naming, versioning and data structures, etc. - see also: [Data Organisation](https://knowledgebase.nfdi4microbiota.de/RDM-Process/14-data-organization.html)
* Documenting data with metadata and context information to ensure reusability: commenting, adding descriptive, administrative and technical metadata, assigning user license. – see also: [Metadata and Metadata standards](https://knowledgebase.nfdi4microbiota.de/Research-Data-Management/03-md.html)
* Using well-known open file formats during the project phase - see below - or converting data into reusable file formats (needs documenting: original file or converted file)
* Storing data in compliance with the 3-2-1 rule:
* Keep 3 copies of any important file
* Store files on 2 different types of data carriers
* Keep at least 1 copy off-site.

## Data selection
To decide well-founded on data selection the suggested steps are:
* **Step 1:** Identify data that **must** and **can** be kept: consider legal or policy compliance risks, as well as funder requirements.
* **Step 2:** Identify data that **should** be kept: data which was expensive to generate, which is impossible to reproduce, particularly curated data and/or data which supports research findings in papers.
* **Step 3:** Weigh up the **costs** and identify any need for external advice in case of shortfall in the budget.
* **Step 4:** Identify **purposes** that the data could fulfill: consider the purpose or ‘reuse case’ of your data, including reuse outside your research group.
* **Step 5:** Complete the data **appraisal**, i.e. list what data must, should or could be kept to fulfill which potential reuse purposes. Summarize any actions needed to prepare the data for deposit - or justification for not keeping it. - see also: Digital preservation in [Data Management Plans (DMPs)](https://knowledgebase.nfdi4microbiota.de/RDM-Plan/01-dmp.html#digital-preservation-in-dmps)


See also: How-to guide of the Edinburgh Digital Curation Centre (DCC, 2014).

## File formats recommendation
Providing your research in recommended file formats in addition to the original format supports the reusability and long-term accessibility of your data.

Attributes of those file formats are:
* Open rather than proprietary (examples for [open files formats](https://en.wikipedia.org/wiki/List_of_open_file_formats))
* Well-documented
* In widespread use
* Simple (e.g. csv rather than xlsx)
* Text-based (i.e. any file you can open with a text editor and read) rather than binary (e.g. txt files rather than doc files)
* Exportable to / unpackable into an open format (e.g. xlsx, docx, etc. can be unpacked into folders of xml files)
* Can be exported into an open format (e.g. xlsx, docx, etc. can be unpacked into folders of xml files)
* Machine-readable

For biomaterial data, recommended formats are CSV, TXT and XML.

## Digital preservation for repository operators
# DP in labs
Depending on the project, institution and funding guidelines, the time at which the data of a research project is transferred to a local data centre for final documentation or to a repository, e.g. for publication, varies. Until then, preparing and setting up long-term archiving in your lab or on-site facility is an effort that will undoubtedly contribute to the sustainability of your research.
The following section is intended to provide a basic understanding of the possible measures:
* Determine responsibilities
* Define who will be responsible for the data of your organization/research project in the long-term. Determine handover scenarios in the event that the person leaves.
* Define possible risks of data loss and which follow-up measures should be taken - even after the project has been completed.
* Decide on a technical support, hardware and software is required and who will provide the resources. - see also: [Data Management Plans (DMP)](https://nfdi4microbiota.github.io/nfdi4microbiota-knowledge-base/RDM-Plan/01-dmp.html#content-of-dmps)
* Determine who should be able to find and access the data.
* How will these persons be made aware of the existence of relevant data in this location? How can they can search for specific projects or files?
* Set up a website, an index or a database with required metadata and ensure that necessary metadata is entered.
* Formulate your own criteria for selecting the data that you want to and can preserve.
* The criteria should be available to all researchers involved. - see above: “Data Selection”
* Determine how long data should be kept.
* Set up a system that informs the person(s) responsible that the data can be deleted - unless this has not been specified for an indefinite period.
* Define the necessary rights for data storage, active data preservation and, if applicable, deletion.
* In the event that the rights need to be transferred to another person at some point, the necessary procedures and documentation must be set up.
* If the rights for different folders, files, database entries or similar differ among your preserved data, document these with the data if necessary and make this documentation machine-readable.
* Determine the level of digital preservation
* **DP at Bitstream level:** This is the basis for being able to preserve digital objects and control changes at all.
* Check that files must not be encrypted, password-protected or protected against printing or copying of content.
* Check that files are virus-free
* Generate and check the checksums of files if transferred or on receipt, document them and conduct regular fixity checks so it is noticeable if files are no longer intact
* Store data redundantly - see above: [3-2-1 rule](https://knowledgebase.nfdi4microbiota.de/RDM-Preserve/25-digital-preservation.html#digital-preservation-for-researchers)
* Develop strategies for monitoring and updating storage media (e.g. according to the technical lifespan)
* Perform control, logging and as versioning of any changes
* **DP at Content preservation level:** This is generally understood to cover the combination of technical-logical and semantic preservation in order to understand for what the data is intended and how it is organized technically. (ref NFDI-paper DOI: 10.5281/zenodo.11109480 ).
* Check whether manuals (readme, codebook, data dictionary..) are available, e.g. to describe the software used or the structure of the data
* Describe your data with sufficient metadata (incl. information about versions, other publications, relationships between files) that support the FAIR principles and store it with your data, e.g. in a database
* Check whether codes and scripts are prepared according to coding best practices.
* Decide in favour of granting sufficient rights to enable technical maintenance measures (file repairs, file format migrations, ...)
* Check whether the digital object is accessible in a software and can be reproduced correctly
* Make sure the file format matches the file extension – even if software can render the file with the mismatch. Try tools - see also: [COPTR file format identification tools](https://coptr.digipres.org/index.php/File_Format_Identification), specifically [DROID](https://coptr.digipres.org/index.php/DROID)
* Check if the files conform to their format specifications (e. g. well-formed and valid XML-file). Try tools - see also: [COPTR validators](https://coptr.digipres.org/index.php/Validation)
* Replace files with problems and document all changes made to the digital object as part of the curative process.
* Document all versions of files with the option to revert to previous versions if required

# DP for repository operators
Repositories usually contain publications in the form of files and are dependent on the quality of submission. Similar to digital preservation in laboratories, digital preservation in repositories depends on sustainable organisational and technical capabilities to prevent technical and semantic obsolescence. In addition, repositories have a particular responsibility to provide published research data and their metadata in a machine readable and usable way that meet the needs of their user community. - see also: [FAIR-principles](https://nfdi4microbiota.github.io/nfdi4microbiota-knowledge-base/Research-Data-Management/04-fair.html). However, the deliberate design of the metadata schema and the repository's publication policy may impose specific requirements for the acceptance of the digital object. Digital Preservation may add additional data quality processes.


Specific preservation measures depend on the digital objects, needs of the user community, and various other conditions. Repositories usually contain publications as files, making file format identification and validation relevant.
Most measures mentioned in the section “DP for labs” are just as essential for repositories (see above).
Additionally, some workflows and tools beneficial for repository operators are listed below:

* Check all mandatory metadata and encourage further comprehensive description - also for files
* Extract technical metadata from files automatically (e.g. via [FIDO](https://openpreservation.org/tools/fido/), for more - see also: [COPTR metadata extraction tools](https://coptr.digipres.org/index.php/Metadata_Extraction)
* Specify who is responsible if files or file content are at risk, including for follow-up measures if problems occur at a later date. It can be very helpful not to lose contact with the depositor, e. g. researcher.
* Encourage your users to inform you publications in your repository that cannot be used
* Replace files with problems (e.g. invalid files) as early as possible, e.g. migrate obsolete file formats to sustainable formats. - see also: [COPTR migration tools](https://coptr.digipres.org/index.php/File_Format_Migration).
* Document all technical changes made to the digital object and ensure that versioning is activated so that you can revert to previous versions if necessary.

### Bitstream preservation
Preservation on the bitstream level is the basis for digital preservation. It covers e. g.
* Checking checksums of transferred files upon receiving them (or generating file checksums) and conducting regular fixity checks
* Redundant storage of data
* Generating backups (e. g. offline backups of the underlying repository database)
* Strategies for updating storage media (according to e. g. server lifetime)

### Preservation beyond bitstream
Preservation of file content, being able to open and render it correctly in a software is part of logical {% cite lindlar_2020_3672773 %} or technical preservation, also called digital curation. Semantic preservation is concerned with e. g. semantic drift impacting metadata.
* Obtaining sufficient rights allowing e. g. format migrations, file repairs and re-use over the long-term like re-publication in other infrastructures
* File format identification, based format-specific bit patterns, e. g. via [DROID](https://coptr.digipres.org/index.php/DROID) during publication process
* File format validation, based on format specifications, e. g. using XML validators during publication process (for validators see also [COPTR](https://coptr.digipres.org/index.php/Validation))
* Automated extraction of technical metadata from files (see also [COPTR](https://coptr.digipres.org/index.php/Metadata_Extraction))
* Virus scans
* Replacing files with problems (e. g. invalid files) as early as possible
* Obtaining sufficient metadata
* storing unique file identifiers, machine-readable version information and relations between files
* indexing rights information in a machine usable way
* indexing of identification-, validation-, metadata extraction- and virus check-output
* preserving and updating of descriptive metadata, according to user community needs
* Migrating at-risk files, e. g. files with obsolete formats (see also [migration tools]( https://coptr.digipres.org/index.php/File_Format_Migration))
* Providing versioning of files and publications and possibility to rollback to earlier versions

Many digital preservation criteria applying to repositories are also present in the certification criteria of the CoreTrustSeal and the nestor seal {% cite coretrustseal_standards_and_certificatio_2022_7051012 harmsen_henk_explanatory_2013 %}.
Many of the criteria for digital preservation that apply to repositories can also be found in the certification [criteria of the CoreTrustSeal](https://zenodo.org/records/7051096) and the nestor seal ( Standards & Board, 2022 ; Harmsen et al., 2013). Taking certification material into account is a good idea in any case, even if certification is not or not yet an issue. Research funders increasingly recommend publication of research data in certified repositories.


{% cite coretrustseal_standards_and_certificatio_2022_7051012 harmsen_henk_explanatory_2013 %}.

## Get Help
If you have any further questions about the management and analysis of your microbial research data, please contact us: [[email protected]](mailto:[email protected]) (by emailing us you agree to the privacy policy on our website: [Contact](https://nfdi4microbiota.de/contact-form/))

## References
{% bibliography --cited_in_order %}
1. DCC. (2014). Five steps to decide what data to keep: a checklist for appraising research data v.1. In Edinburgh: Digital Curation Centre. https://www.dcc.ac.uk/guidance/how-guides/five-steps-decide-what-data-keep
2. Lindlar, M., Rudnik, P., Horton, L., & Jones, S. (2020). "You say potato, I say potato" - Mapping Digital Preservation and Research Data Management Concepts towards Collective Curation and Preservation Strategies. 15(1). https://doi.org/10.2218/ijdc.v15i1.728
3. Standards, C. T. S., & Board, C. (2022). CoreTrustSeal Requirements 2023-2025 (Version V01.00). Zenodo. https://doi.org/10.5281/zenodo.7051012
4. Harmsen, H., Keitel, C., Schmidt, C., Schoger, A., Schrimpf, S., Stürzlinger, M., & Wolf, S. (2013). Explanatory notes on the nestor Seal for Trusworthy Digital Archives. nestor Certification Working Group (Vol. 17). https://nbn-resolving.de/urn:nbn:de:0008-2013100901


0 comments on commit e84dbf6

Please sign in to comment.