Tuesday, 17 December 2013

Updating my requirements

Last week I published my digital preservation Christmas wishlist. A bit tongue in cheek really but I saw it as my homework in advance of the latest Digital Preservation Coalition (DPC) day on Friday which was specifically about articulating requirements for digital preservation systems.

This turned out to be a very timely and incredibly useful event. Along with many other digital preservation practitioners I am currently thinking about what I really need a digital preservation system to do and which  systems and software might be able to help.

Angela Dappert from the DPC started off the day with a very useful summary of requirements gathering methodology. I have since returned to my list and tidied it up a bit to get my requirements in line with her SMART framework – specific, measurable, attainable, relevant and time-bound. I also realised that by focusing on the framework of the OAIS model I have omitted some of the ‘non-functional’ requirements that are essential to having a working system – requirements related to the quality of a service, its reliability and performance for example.

As Carl Wilson of the Open Planets Foundation (OPF) mentioned, it can be quite hard to create sensible measurable requirements for digital preservation when we are talking about time frames which are so far in the future. How do we measure the fact that a particular digital object will still be readable in 50 years time? In digital preservation we regularly use phrases such as ‘always’, ‘forever’ and ‘in perpetuity’. Use of these terms in a requirements document inevitably leads us to requirements that can not be tested and this can be problematic.

I was interested to hear Carl describing requirements as being primarily about communication - communication with your colleagues and communication with the software vendors or developers. This idea tallies well with the thoughts I voiced last week. One of my main drivers for getting the requirements down in writing was to communicate my ideas with colleagues and stakeholders.

The Service Providers Forum at the end of the morning with representatives from Ex Libris, Tessella, Arkivum, Archivematica, Keep Solutions and the OPF was incredibly useful. Just hearing a little bit about each of the products and services on offer and some of the history behind their creation was interesting. There was lots of talk about community and the benefits of adopting a solution that other people are also using. Individual digital preservation tools have communities that grow around them and feed into their development. Ed Fay (soon to be of the OPF) made an important point that the wider digital preservation community is as important as the digital preservation solution that you adopt. Digital preservation is still not a solved problem. The community is where standards and best practice come from and these are still evolving outside of the arena of digital preservation vendors and service providers.

Following on from this discussion about community there was further talk about how useful it is for organisations to share their requirements. Is one organisation's needs going to differ wildly from another's? There are likely to be a core set of digital preservation requirements that are going to be relevant for most organisations. 

Also discussed was how we best compare the range of digital preservation software and solutions that are available. This can be hard to do when each vendor markets themselves or describes their product in a different way. Having a grid from which we can compare products against a base line of requirements would be incredibly useful. Something like the excellent tool grid provided by POWRR with a higher level of detail in the criteria used would be good.

I am not surprised that after spending a day learning about requirements gathering I now feel the need to go back and review my previous list. I was comforted by the fact that Maite Braud from Tessella stated that “requirements are never right first time round” and Susan Corrigall from the National Records of Scotland informed us that requirements gathering exercises can take months and will often go through many iterations before they are complete. Going back to the drawing board is not such a bad thing.

Wednesday, 11 December 2013

My digital preservation Christmas wish list

All I want for Christmas is a digital archive.

By paparutzi on Flickr CC BY 2.0
Since I started at the Borthwick Institute for Archives I have been keen to adopt a digital preservation solution. Up until this point, exploratory work on the digital archive has been overtaken by other priorities, perhaps the most important of these being an audit of digital data held at the Borthwick and an audit of research data management practices across the University. The outcome is clear to me – we hold a lot of data and if we are to manage this data effectively over time, a digital archiving system is required.

In a talk at the SPRUCE end of project workshop a couple of weeks ago both Ed Fay and Chris Fryer spoke about the importance of the language that we use when we talk about digital archiving. This is a known problem for the digital preservation community and one I have myself come up against on a number of different levels.

In an institution relatively new to digital preservation the term ‘digital archiving’ can mean a variety of different things and on the most basic IT level it implies static storage, a conceptual box we can put data in, a place where we put data when we have finished using it, a place where data will be stored but no longer maintained.

Those of us who work in digital preservation have a different understanding of digital archiving. We see digital archiving as the continuous active management of our digital assets, the curation of data over its whole life cycle, the systems that ensure data remains not only preserved, but fit for reuse over the long term. Digital archiving is more than just storage and needs to encompass activities as described within the Open Archival Information System reference model such as preservation planning and data management. Storage should be seen as just one part of a digital preservation solution.

To this end, and to inform discussions about what digital preservation really is, I pulled together a list of digital preservation requirements which any digital preservation system or software should be assessed against. This became my wish list for a digital preservation system. I do not really expect to have a system such as this unwrapped and ready to go on Christmas morning this year but may-be some time in the future!

In order to create this list of requirements I looked at the OAIS reference model and the main functional entities within this model. The list below is structured around these entities. 

I also bravely revisited ISO16363: Audit and Certification of Trustworthy Digital Repositories. This is the key (and most rigorous) certification route for those organisations who would like to become Trusted Digital Repositories. It goes into great detail about some of the activities which should be taking place within a digital archive and many of these are processes which would be most effectively carried out by an automated system built into the software or system on which the digital archive runs.

This list of requirements I have come up with has a slightly different emphasis from other lists of this nature due to the omission of the OAIS entity for Access. 

Access should be a key part of any digital archive. What is the point of preserving information if we are not going to allow others to access it at some point down the line? However, at York we already have an established system for providing access to digital data in the shape of York Digital Library. Any digital preservation system we adopt would need to build on and work alongside this existing repository not replace it. 

Functional requirements for access have also been well articulated by colleagues at Leeds University as part of their RoaDMaP project and I was keen not to duplicate effort here.

As well as helping to articulate what I actually mean when I talk about my hypothetical ‘digital archive’, one of the purposes of this is to provide a grid for comparing the functionality of different digital preservation systems and software.

Thanks to Julie Allinson and Chris Fryer for providing comment thus far. Chris's excellent case study for the SPRUCE project helped inform this exercise.

My requirements are listed below. Feedback is most welcome


The digital archive will enable us to record/store administrative information relating to the Submission Information Package (information and correspondence relating to receipt of the SIP)
The digital archive will include a means for recording decisions regarding selection/retention/disposal of material from the Submission Information Package
The digital archive will be able to identify and characterise data objects (where appropriate tools exist)
The digital archive will be able to validate files (where appropriate tools exist)
The digital archive will support automated extraction of metadata from files
The digital archive will incorporate virus checking as part of the ingest process
The digital archive will be able to record the presence and location of related physical material

The digital archive will generate persistent, unique internal identifiers
The digital archive will ensure that preservation description information (PDI) is persistently associated with the relevant content information. The relationship between a file and its metadata/documentation should be permanent
The digital archive will support the PREMIS metadata schema and use it to store preservation metadata
The digital archive will enable us to describe data at different levels of granularity – for example metadata could be attached to a collection, a group of files or an individual file
The digital archive will accurately record and maintain relationships between different representations of a file (for example, from submitted originals to dissemination and preservation versions and subsequent migrations)
The digital archive must store technical metadata extracted from files (for example that created as part of the ingest process)

The digital archive will allow preservation plans (such as file migration or refreshment) to be enacted on individual or groups of files.
Automated checking of significant properties of files will be carried out post-migration to ensure they are preserved (where tools exist).
The digital archive will record actions, migrations and administrative processes that occur whilst the digital objects are contained within the digital archive

The digital archive will allow for disposal of data where appropriate. A record must be kept of this data and when disposal occurred
The digital archive will have reporting capabilities so statistics can be gathered on numbers of files, types of files etc.

The digital archive will actively monitor the integrity of digital objects with the use of checksums
Where problems of data loss or corruption occur, The digital archive will have a reporting/notification system to prompt appropriate action

The digital archive will be able to connect to, and support a range of storage systems

The digital archive will be compliant with the Open Archival Information System (OAIS) reference model
The digital archive will integrate with the access system/repository
The digital archive will have APIs or other services for integrating with other systems
The digital archive will be able to incorporate new digital preservation tools (for migration, file validation, characterisation etc) as they become available
The digital archive will include functionality for extracting and exporting the data and associated metadata in standards compliant formats
The software or system chosen for the digital archive will be supported and technical help will be available
The software or system chosen for the digital archive will be under active development