DAIS - Digital Archive of the Serbian Academy of Sciences and Arts: Preservation plan

From TRAP-RCUB

This public wiki is about the DAIS – Digital Archive of the Serbian Academy of Sciences and Arts

See also:


The Preservation Plan (henceforward “Plan”) outlines the principles guiding the main activities of the Serbian Academy of Sciences and Arts (SASA), SASA institutes (repository owners or participating institutions) and the University of Belgrade Computer Centre (RCUB, service provider) regarding sustainable preservation of and access to the digital content of DAIS - Digital Archive of the Serbian Academy of Sciences and Arts (henceforward “Repository”).

This Plan follows guidelines and standards for digital preservation, such as the FAIR principles, the OAIS Reference Model and CoreTrustSeal. It has been developed by the repository development team at RCUB in collaboration with the representatives of the SASA and its institutes. The Plan is subject to revision in five-year intervals.

The Plan aims at ensuring that the digital content in the Repository remains accessible, understandable, and sufficiently usable over the long term. Efforts are also made to mitigate the risk of deterioration, damage, data loss and corruption, as well as the obsolescence of file formats, storage or dissemination means.

The Plan contains:

  • Preservation Strategy, as a general framework, and
  • Preservation Policy, detailing the implementation of the principles defined in the Preservation Strategy.

Preservation Strategy

Objectives

The primary objective of the Repository is to provide infrastructure for archiving, long-term preservation, wide dissemination and Open Access, when possible, to publications and other research outputs resulting from the projects and other activities implemented by the Serbian Academy of Sciences and Arts (SASA) and SASA institutes.

Preservation efforts primarily focus on publications created by SASA and its institutes since the 1840s. As the initial steps towards archiving research data are being made, the Repository should also ensure that these data are available in the long term for reuse, replication and verification purposes.

The national Open Science policy (Open Science Platform, 2018) assumes an indefinitely long retention period for publications. According to the Code of Conduct in Research (2018), the recommended retention period for research data is at least 5 years (preferably 10 years). In the context of the Repository, the long term is certainly longer than 10 years.

The Repository also aims to:

  • ensure data authenticity;
  • maintain data integrity and quality;
  • ensure appropriate management of digital resources throughout their lifecycle;
  • ensure appropriate levels of information security;
  • serve as a trustworthy digital repository.

Mission

In line with their mission and the role of publicly funded institutions, SASA, the institutes, and RCUB seek to provide reliable and secure archiving for diverse outputs of SASA and its institutes, while ensuring an easy access, long-term preservation, and widest dissemination of the Repository content. To ensure the continued access and use of these resources, the Repository follows the policy of active preservation.

Designated community

The designated community of the Repository includes:

  • Internal users: members of the Serbian Academy of Sciences and Arts, SASA research support staff, and the researchers and research support staff of SASA institutes.
  • Associates, i.e. local and international professional researchers who have joint projects with internal users.

Internal users and Associates belong to various disciplines (linguistics, history, musicology, ethnology, social and cultural anthropology, archaeology, art history, geography, sociology, mathematics, natural sciences, engineering, medical and biomedical sciences, etc.).

The Repository has multiple functions for the Designated Community:

  • enabling access to the content;
  • ensuring long-term preservation;
  • serving as a dissemination tool;
  • serving as a data source for research and education;
  • ensuring compliance with national, institutional and funder Open Access policies and FAIR principles (for Internal users).

Content

The content of the Repository includes:

  • publications (books, journal articles, conference proceedings – ca. 90%) – in the PDF format;
  • PhD and MA theses (about 1%) – PDF,
  • research data – PDF (supplementary information published in journals), CSV, TIFF, OPJ,
  • posters, leaflets, CD covers, musical scores –PDF;
  • digital photographs (4%) – JPG (camera output format).

There are more than 11000 metadata records in the Repository (in August 2021), 99.5% of which are accompanied with data files.

All disciplines are covered, though Social Sciences and Humanities prevail (ca. 80% of the content). The content is multilingual: about 15 languages with Serbian and English prevailing.

Requirements

The Repository strives to ensure compliance with the following requirements:

  • submitted data fit into the scope of the Content policy;
  • legal issues are resolved before submission;
  • submissions are validated according to data ingest procedures;
  • data are described in line with appropriate metadata standards;
  • data, metadata and other representation information are preserved for the long-term;
  • the authenticity, integrity and reliability of data are retained.

Legal and regulatory framework

The following regulations are relevant for the operation of the Repository:


The relationship between the SASA and SASA institutes (repository owners) and RCUB (service provider), is based on:


The relationship between the depositor (Producer) and the Repository is based on:

Before depositing research data and similar materials, Producers are required to clear possible copyright, privacy, ethical, data protection and other issues with their institutions, collaborators, funders and all relevant stakeholders. Any violations or malpractice are entirely the responsibility of Producers.


The relationship between Consumers and the Repository are based on:

  • legally-binding Terms of Service and
  • content licence (indicated in metadata on landing pages).

Roles and responsibilities

Access to the Repository’s Administration function is strictly limited to authorized staff. All staff involved with Repository maintenance and daily operations have well defined roles and are familiar with relevant policies and their roles in implementing the preservation policy.

The participating institutions appoint a number of repository managers, who are a contact point for depositors and the team responsible for software development and technical support (at RCUB; see Organization). The work of repository managers is funded by the participating institutions (through salaries).

RCUB is responsible for hosting, regular back-up, software upgrades and development, additional features, user support and training, and the implementation of interoperability standards. RCUB has appointed a dedicated team (TRAP-RCUB) responsible for repository development. The team also serves as a steering body and defines development and preservation plans in collaboration with the representatives of SASA and SASA institutes.

Content formats

Considerable effort is invested in ensuring digital continuity of archived data, i.e. data usability over time. The needs of the Designated community, the development of technology and organizational changes are monitored in order to be able to plan and apply required actions in a timely manner.

In order to mitigate the risk of format obsolescence, file formats that have reasonable chances of remaining usable over a considerably long period of time are selected: these are the preferred formats and a limited number of accepted formats. The list of preferred and acceptable formats will be periodically revised to include new formats and remove the ones that are at risk of becoming obsolete.

The preferred formats are those file formats which are expected to warrant the best long-term usability, accessibility and sustainability. The Repository has the right to convert file formats if this is necessary to ensure permanent access to a resource.

Efforts are made to accept only file formats suitable for long-term preservation (preferably those resting on open standards). However, this is not always possible and in some cases, a compromise is made in the best interest of the Designated community. Currently, priority is given to data collection and ingestion into the Repository, in order to mitigate the risk of data loss. It is also important to make data available to the Designated community as soon as possible. A number of PDF files obtained from publishers or by scanning do not conform to the PDF/A standard but they are still accepted because the conversion to a preferred format would delay ingestion into the Repository (increasing the risk of data loss) and access to data. These files will be converted to a preferred format (PDF/A) by TRAP-RCUB to ensure an optimal format normalization.

Insufficiently documented proprietary file formats may be accepted only exceptionally, when it is not possible to convert files to a preferred format without compromising data integrity, or in cases when it is necessary to capture and archive data that have already been published elsewhere. Efforts are made to limit the number of such cases as much as possible. Collaboration with the original data creators is necessary to convert these files to preferred formats. In such cases, the Repository cannot warrant long-term preservation.

In order to be deposited, content must be in a digital format and in a completed state (not a dynamic document). If changes need to be made, then the changed file shall be deposited as a new document.

The Repository does not acquire or preserve the software used to create and open the deposited data. Software tools necessary to open and view the greatest part of the deposited data are widely available to the Designated community (PDF readers, web browsers). In case of less common formats (those used by specific parts of the Designated community), the information about the software required to view data is provided in the metadata.

Properties to be preserved

In order to define what should be preserved, for each data type, it is necessary to establish the "intended use" by the Designated community. For the greatest part of the Repository content, the "intended use" is currently limited to reading the publication text in much the same way as print text is read. Due to this, even scanned non-OCRed files are acceptable, and it is important to preserve the publication’s visual appearance and enable human reading of the content.

It is reasonable to expect that intended uses will change, e.g. towards text mining and data extraction. The emerging needs of the Designated community are monitored through feedback information (Internal users, Associates, External users), training (Internal users), and communication between repository managers and Internal users and Associates. The Plan will be dynamically revised to address identified changes.

Data integrity and authenticity

Only Internal users and Associates can submit data to the Repository. SASA and SASA Institutes are responsible for verifying user identities. Provenance information is saved for each item. Once the item is approved, only repository managers are able to change the metadata and data. Submissions are reviewed by qualified staff to ensure metadata quality and completeness, the compliance of data formats, best practice and preservation requirements, data integrity and quality, and resolve potential legal issues. Changes to submitted (Sumission Information Package – SIP) and approved items (Archival Information Package – AIP) by Producers and Consumers are not supported. If necessary, Producers may deposit a new version. Each version is assigned a unique and persistent identifier (Handle). Relations are established in the metadata between various versions.

DSpace ensures the integrity of both data and metadata over time regardless of possible changes in the physical storage media. To verify that a digital object has not been altered or corrupted, the repository periodically checks the integrity of the data. The checks include the verification of md5 checksums and metadata integrity, and testing that URLs are working.

Security

In the context of the Repository, the security and preservation measures apply to:

  • Metadata;
  • Bitstreams (data files; thumbnail; TXT file with extracted text, Distribution license text);
  • Repository software (DSpace) and its configuration;
  • Custom-made applications (Ellena, APP, NomadLite, ReportMaker) developed by RCUB;
  • Operating system, configuration, etc.
  • Backups

The Repository is committed to taking all necessary precautions to ensure the physical safety and security of the data it preserves. Based on a SLA, these responsibilities are entrusted to RCUB (outline of the SLA).

Sustainability plans and funding

According to the law, SASA is the national academy and the most prominent scholarly institution in Serbia. The institutes are independent legal entities but their work is closely tied with the mission and the activities of SASA (e.g. joint projects, co-publishing projects, joint conferences, etc.). Reliable and secure archiving, long-term preservation and wide dissemination are in line with the mission and the role of SASA, SASA institutes (participating institutions), and RCUB (service provider). The current level of funding is sufficient to maintain and develop DAIS. Development and maintenance, as well as data security, are ensured through a SLA with RCUB.

SASA and SASA institutes are able to preserve data access in case of unexpected emergency budget cuts. The Repository is easy to keep running and service costs are not high. All repository managers are employed under regular contracts at participating institutions and their activities related to repository management do not incur any additional cost. The SLA with RCUB foresees Post-Cancellation Service Time, i.e. a period of time after the termination of SLA during which the Repository will be available with the minimum maintenance services provided. Accordingly, even in case of funding disruption, the services will be kept running, providing sufficient time to find a sustainable solution. Even if funding ceased, it would be possible to keep the repository running for at least five years.

Preservation Policy

Implementing the Preservation Strategy

The Preservation policy relies on the main functional concepts of the Open Archival Information System (OAIS) reference model for digital preservation environments and the FAIR principles. Preservation decisions are made taking into consideration the Repository’s mission, Content policy, legal constraints and available human, technical and financial resources.

OAIS Reference Model
OAIS Reference Model, source: Wikimedia Commons, CC BY-SA 4.0 International

Ingest function

Ingest is the first functional component of the OAIS reference model. According to the model, at this stage, information (Submission Information Package – SIP) is received from the Producer; it is checked to validate that the information supplied is complete (see Submission). In order to ensure a better efficiency of the Ingest phase, detailed guidelines and training are provided.

DSpace ingest process - scheme
DSpace ingest process, source: DSpace 5.x Documentation, CC BY, 4.0 International

Repository managers perform quality control in line with data processing protocols. Although in most cases the submission will be stored in the original format, the final content of the SIP may be negotiated between the Producer and the Repository. If necessary, the repository manager may convert files to a preferred format to ensure long-term preservation and accessibility. In such cases, the original data will usually be preserved but will not be available to Consumers. The only exception are poor-quality PDFs obtained by scanning publications, which will be entirely removed and replaced with files of better quality (if available). Format migration may also occur in a later stage, during the Data Management phase.

Only the approved data and metadata (subject to at least basic curation) will be published. After publishing, a set of automated actions are launched: a PID (Handle) is assigned, readable text from data files (for PDFs) is extracted into a TXT file and included in the search index, and a thumbnail for the landing page is generated (for PDFs and image files). The version resulting from the ingest process is an Archival Information Package (AIP).

Data may also be ingested via a semi-automated procedure, using the custom-made application Ellena. DAIS does not use the native DSpace batch item importer. It is not disabled but its use is discouraged because Ellena offers better functinalities.

Archival storage function

The purpose of archival storage is to ensure that the package resulting from the ingest phase remains unchanged and accessible. The AIPs resulting from the ingest phase are added to the permanent storage facility and the management of the storing is monitored.

In DSpace, AIPs are only generated for objects which are currently in the "in archive" state. Uncompleted submissions are not described in AIPs, which means that they cannot be restored after a disaster. Permanently removed objects will no longer be available as AIPs after removal. Withdrawn objects will still be available as AIPs.

Based on a SLA, the responsibilities related to this function are entrusted to RCUB.

Data management function

The Data management function involves the maintenance of the databases of descriptive metadata and the management of administrative metadata. At this stage, the following alterations are allowed:

  • changes to metadata: primarily metadata normalization, enrichment and minor metadata corrections – the change is documented in the administrative metadata; data remain unchanged; PID is retained. Metadata normalization and enrichment is supported by a number of custom-made tools (Ellena, NomadLite, ReportMaker);
  • format conversion with the aim of ensuring long-term preservation – the original file is retained in the repository, but it is not available to Consumers;
  • changes to data – the changed data are deposited as a new version and are assigned a a new PID; links between various versions are established in descriptive metadata.

Content may be removed from the Repository only in exceptional circumstances by:

  • withdrawing an AIP (retained in the Repository but removed from public view) in case of:
    • proven copyright violation;
    • plagiarism;
    • falsified research;
    • research containing major errors;
    • threat to national security;
  • deleting AIPs, in case of technical errors and unintended duplicates; permanently removed objects will no longer be available as AIPs after removal.

In both cases, PIDs and URLs will be retained permanently.

Access function

The Access function ensures that Dissemination Information Packages (DIP) are visible to Consumers. DIPs are derived from AIPs: the data are the same as in AIPs, but the Deposition licence, and the TXT file with extracted text are not included in DIPs.

Consumers interact with the Repository to find and receive data, or to request access to data, in case the content is not Open Access. These processes are web-based and are facilitated by a bilingual interface (Serbian and English). Apart from the Repository’s browse and search functionalities, the process of finding data is additionally supported by a custom-made application APP.

The Access function also ensures the security related to access.

Administration function

The administration function manages the day-to-day operations of the Repository and coordinates all the other functions. This function is related to the application of fundamental rules and policies and the operation of the technical infrastructure – maintaining, changing, securing the software and the hardware, and documenting preservation steps. Actions in the Administration phase include automated checks of data integrity, namely the verification of md5 checksums and metadata integrity, and testing that URLs are working. The generated reports are helpful in the identification of possible issues relating to long-term preservation.

Preservation principles: FAIR

The Repository aims to operate according to the FAIR principles. All preservation actions are focused on ensuring that data are Findable, Accessible, Interoperable and Reusable.

Findable

  • Upon approval, items are assigned PIDs (Handles).
  • Metadata conform to the Qualified Dublin Core Schema. They can be exported through the user interface in BibTeX and RIS formats.
  • Metadata are embedded in HTML meta tags, which improves discoverability through search engines;
  • The sitemap feature in DSpace is enabled;
  • Metadata records can be harvested through the OAI-PMH interface, due to which the Repository content is made visible in various aggregators and discovery services.

Accessible

  • Metadata are openly accessible, without authentication, through the user interface or by using the open OAI-PMH protocol. Metadata are distributed under the CC0 license.
  • Data in Open Access are available to users without authentication.
  • Restricted data are usually accessible to Internal users after login into the system. Associates and External users may request access; if granted, access will be enabled beyond the scope of the Repository.

Interoperable

  • The metadata are mapped to Dublin Core, which is a formal, accessible, and broadly applicable language for knowledge representation.
  • The Dublin Core “Relation” metadata element is used to show various types of relations between data, e.g. IsVersionOf, IsPartOf, HasPart, IsReferencedBy, IsReplacedBy, Replaces, IsRequiredBy, Requires, etc.).

Reusable

  • Depositors are encouraged to describe their data in great detail. Metadata may be enriched during the Ingest and Data Management phases.
  • To enable proper reuse data are always released with clear conditions of use – license; license information is provided in item metadata.
  • Once data are published, only repository managers can make changes, keeping control of the authenticity.
  • Fixity checks (md5 checksums) are performed to verify that data have not been altered or corrupted.


Version 1, August 2021