Friday, 25 April 2014

How does Archivematica meet my requirements?

It seems a long time ago that I first blogged about my failed attempts to install archivematica. This is probably because it *was* quite a long time ago... other priorities had a habit of getting in the way!

With the help of a colleague (more technically able than I) I've now had a chance to see the new version of Archivematica. I have been assured that Archivematica version 1.0 is easier to install than it's predecessors so that is good news!

Any decent digital preservation system is going to have to be pretty complex in order to carry out the required tasks and workflows so assessing products such as this one is not something that can be done in one sitting.

As well as playing with the software itself, I've watched the video, I've signed up to the mailing list and I'm talking to others who are using it. A recent 'Technology Bytes' webinar hosted by the DPC (Digital Preservation Coalition) also helped me find out more. Artefactual Systems (who support and develop the software) have been really helpful in answering all of my many awkward questions.

In a more recent blog I talked about my digital preservation requirements, so one of the things I've been trying to do as I've been looking at Archivematica is see whether it could meet these requirements.

Below is a list of my requirements again (possibly slightly altered since the last time I published them) and an assessment of Archivematica against them.

It does seem to be a pretty good match and it is worth noting that any digital preservation system we implement will be just one part of a wider technical infrastructure for data management (that will also include a deposit workflow, data storage and an access system). There is some functionality within my requirements that could doubtless be fulfilled elsewhere within that infrastructure so I am not too concerned that we do not have a clear 'Yes' on all of these requirements. Where there are bits of functionality that we really do need Archivematica to perform, we have the option of either building it ourselves, or sponsoring Artefactual Systems to develop it for us and for the wider user community.

It is encouraging to see just how many developments are being sponsored at the moment and how many organisations are involved in this process.

It is also worth noting that while Archivematica is free and open source. Artefactual Systems are always keen to state that it is free as in 'free kittens' - time and money needs to go into looking after it, feeding it and taking it to the vet. There will clearly always be some element of cost involved with the implementation of an open source system that needs to be configured and integrated with existing systems.

Just to end with one very interesting piece of information that was mentioned in the Technology Bytes webinar:

Archivematica runs lots of microservices as part of the ingest and preservation workflow. You can configure it in various ways but there are a couple of points where the system waits for instructions from an administrator before proceeding with an operation. I was very interested to learn that one Archivematica user has configured his system to bypass these prompts for human interaction and has it set up as a fully automated workflow for a particular set of content.

Am I scared that this development might put digital archivists such as me out of a job? ....only a little bit

Am I excited by the opportunities to automate many of the repetitive and previously manual processes that digital archivists can spend a lot of time doing? ....very much so!

Does Archivematica meet this requirement?

The digital archive will enable us to store administrative information relating to the Submission Information Package (information and correspondence relating to receipt of the SIP)
Yes – a transfer can be made with submission documentation and this will be preserved within the AIP. Note that submission information as described in the archivematica wiki can be “donor agreements, transfer forms, copyright agreements and any correspondence or other documentation relating to the transfer”. Any SIPs generated will automatically include copies of this information too. We do need to establish where the best place to store supporting information is within our technical architecture.
The digital archive will include a means for recording appraisal decisions relating to the Submission Information Package and individual elements within it
No – appears to be out of scope for Archivematica but as we are not considering using this system in isolation, this information may be best stored elsewhere within the technical infrastructure.
The digital archive will be able to identify and characterise data objects (where appropriate tools exist)
Yes – this is an automated process. Uses FITS (Bundles file utility, ffident, DROID, JHOVE, FIDO, Tika, mediainfo). Output is stored in the METS and PREMIS XML within the AIP. New tools for identification will be included in future releases of Archivematica, and there is also the option for users of the system to add their own tools via the Format Policy Registry.
The digital archive will be able to validate files (where appropriate tools exist)
Yes – JHOVE is part of the package and output from JHOVE is stored in the METS and PREMIS XML within the AIP
The digital archive will support automated extraction of metadata from files
Yes – Tika is part of the package and output is stored in the METS and PREMIS XML within the AIP
The digital archive will virus check files on ingest
Yes – ClamAV is part of the package and information about virus checking is included within the PREMIS and METS XML. If a virus is detected within a file it will be sent to the ‘failed’ directory and all processing on that SIP will stop until the problem is resolved by an adminstrator
The digital archive will be able to record the presence and location of related physical material
No – this is out of scope for Archivematica but we would be able to store this metadata within Fedora

The digital archive will generate persistent, unique internal identifiers
Yes – a unique internal identifier is generated incorporated into filenames and stored in the METS.xml for both packages and digital objects.
The digital archive will ensure that preservation description information (PDI) is persistently associated with the relevant content information. The relationship between a file and its metadata/documentation must be permanent
Yes – any documentation that is included in the SIP will be included in the AIP. All technical and preservation metadata generated by Archivematica will also be wrapped up in the AIP.
The digital archive will support the PREMIS metadata schema and use it to store preservation metadata
Yes – creates and stores PREMIS/METS as part of the ingest process and as preservation actions are carried out. This XML is stored within the AIP
The digital archive will enable us to describe data at different levels of granularity – for example metadata may be attached to a collection, a group of files or an individual file
Partial – Preservation and technical metadata are generated at file level. Descriptive (Dublin Core) metadata appears to be only at project/collection level. If we require more detailed or granular metadata this will be stored elsewhere within the technical architecture.
The digital archive will accurately record and maintain relationships between different representations of a file (for example, from submitted originals to dissemination and preservation versions that will be created over time)
Yes – this is very much a part of the system. This is achieved using a unique identifier which is allocated to a submitted file, and included in any subsequent representations that are created
The digital archive will store technical metadata extracted from files (for example that is created as part of the ingest process)
Yes – very comprehensive technical metadata including details of all of the tools used are stored as part of the AIP

The digital archive will allow preservation plans (such as file migration or refreshment) to be enacted on individual or groups of files.
Partial(?) – on ingest, rules are in place to normalise files (migrate them) to different formats as appropriate for preservation/dissemination. These rules can be updated to meet local needs.

Need to explore how these rules can be run on all files of a certain type within the archive. Artefactual Systems report that a new AIP re-ingest feature will fulfil this need.
Automated checking of significant properties of files will be carried out post-migration to ensure these properties are adequately preserved (where appropriate tools exist).
Partial – default format policy choices are based on a comprehensive analysis of the significant properties of the samples as well as tests of many tools. Results of these tests are publicly available on the wiki. Archivematica users are able to run their own tests using other migration tools and if they are thought to adequately preserve significant properties they can be added to the system to serve local needs.
The digital archive will record actions, migrations and administrative processes that occur whilst the digital objects are contained within the digital archive
Yes – detailed information (in PREMIS and METS format) is stored within the AIP. The AIP keeps various logs which are gathered throughout the ingest process. Where migrations are carried out manually, PREMIS metadata can be added. This is a new feature in the 1.1 release and is documented here ( Note, it does assume a one to one relationship between original and migrated file which may not always be the case.

The digital archive will allow for disposal of data where appropriate.
Partial – it is possible to delete an AIP and set a reason but file level deletions within an AIP is not supported. The system deliberately makes it difficult to carry out deletions and can only be carried out by administrative users
A record must be kept of data disposal including what was disposed of, when it was disposed of and reasons for disposal.
Yes – It is possible to set a reason for deletion in Archivematica and this will be visible to the storage service adminstrator. Disposal decisions may be best recorded elsewhere within the infrastructure (Fedora/AtoM)
The digital archive will have reporting capabilities so statistics can be collated. For example it would be useful to be able to report on numbers of files, types of files, size of files, preservation actions carried out
No – This may be something we have to set up ourselves using the MySQL data that sits behind the system.

Artefactual Systems are keen that better reporting capabilities are sponsored in future releases of the software.

The digital archive will actively monitor the integrity of digital objects on a regular and automated schedule with the use of checksums
No – Checksums are generated by Archivematica and stored as part of the AIP but integrity checking is not performed. There is a plan to include active fixity checking in a future release of Archivematica, but in the meantime this could be carried out somewhere else within the technical infrastructure.
Where problems of data loss or corruption occur, The digital archive will have a reporting/notification system to prompt appropriate action
No – this is out of scope for Archivematica. The archival storage module will need to carry out integrity checking and a notification system (or automatic restore from backup) will need to be in place to guard against data loss.
The digital archive will be able to connect to, and support a range of storage systems
Yes – a number of different storage options can be configured within Archivematica and it is possible to have several different options depending on the nature of the data.

The digital archive will be compliant with the Open Archival Information System (OAIS) reference model
Yes – the design of Archivematica was created with OAIS in mind. The GUI leads you through the relevant OAIS functional entities and the language used throughout the application is consistent with that used within the OAIS reference model
The digital archive will integrate with our Fedora repository
Partial – Fedora is not directly supported but this may be something we can configure ourselves. Artefactual Systems are working with related systems (Islandora) which will go a little way towards Fedora integration.
The digital archive will integrate with our archival management system (AtoM)
Yes – Archivematica and AtoM are both supported by Artefactual systems and are designed to complement each other. AtoM is the recommended access front end to Archivematica
The digital archive will have APIs or other services for integrating with other systems
Yes – it has a REST API and a SWORD API planned
The digital archive will be able to incorporate new digital preservation tools (for migration, file validation, characterisation etc) as they become available
Yes – In terms of migration tools there is a handy interface for adding tools or commands and setting up new rules. The Roadmap includes plans for updating the tools that are internal to the system. Archivematica developers contribute to the development of tools such as FITS to make them better and more scalable.
The digital archive will include functionality for extracting and exporting the data and associated metadata in standards compliant formats
Yes – Archivematica uses open standards where possible. Metadata is in XML format, uses recognised standards and is packaged with the AIP. Archivematica packages its AIPs using BagIt which is an open standard for storage and transfer of files and metadata. Archival storage is separate so extracting the information from here needs to be a feature of the storage system.
The software or system chosen for the digital archive will be supported and technical help should be available
Yes – Open Source but supported by Artefactual Systems. An active mailing list exists for technical support and Artefactual Systems seem to be quick to respond to any queries
The software or system chosen for the digital archive will be under active development
Yes – Archivematica is very much in development. Wish lists are published online. Specific developments happen quicker if we are able to sponsor them. Alternatively, our own developers could help develop the system to meet our needs.

Thursday, 10 April 2014

Hydra: multiple heads are better than one

Trinity College, Dublin
I spent a couple of days this week in Dublin at the Hydra Europe Symposium. Hydra has been on my radar for a little while but these two days really gave me an opportunity to focus on what it is and what it can do for us. This is timely for us as we are currently looking at systems for fulfilling our repository and digital archiving functions. 

At York we currently use Fedora for our digital library so developments within the Hydra community are of particular interest because of its relationship to Fedora.

Chris Awre from the University of Hull stated that the fundamental assumptions on which Hydra was built were that:

1. No single system can provide the full range of repository based solutions for an institutions needs
2. No single institution can resource development of a full range of solutions on its own

This chimes well with our recent work at York trying to propose a technical architecture that could support deposit, storage, curation and access to research data (among other things). There is no one solution for this and building our own bespoke system from scratch or based purely on Fedora would clearly not be the best use of our resources.

The solution that Hydra provides is a technical framework that multiple institutions can work with but that can be built upon with adopting institutions developing custom elements tailored to local workflows. Hydra has one body but many heads supporting many different workflows.

We were told pretty early on within the proceedings that for Hydra, the community is key. Hydra is as much about knowledge sharing as sharing bits of code.

“If you want to go fast go alone, if you want to go far, go together” – This African proverb was used to help explain the Hydra concept of community. In working together you can achieve more and go further. However, some of the case studies that were presented during the Symposium clearly showed that for some, it is possible to go both far and fast using Hydra and with very little development required. Both Trinity College Dublin and the Royal Library of Denmark commented on the speed with which a repository solution based on Hydra could be up and running. Speed is of course largely dependent on the complexity or uniqueness of the workflows you need to put in place. Hydra does not provide a one-size-fits-all solution but should be seen more as a toolkit with building blocks that can be put together in different ways.

Dermot Frost from Trinity College Dublin summed up their reasons for joining the Hydra community, saying that they had had experience with both Fedora and DSpace and neither suited their needs. Fedora is highly configurable and in theory does everything you need to do, but you need a team of rocket scientists to work it out. DSpace is a more out-of-the-box solution but you can not configure it in the way you need to to get it to conform to local needs. Hydra sits between the two providing a solution that is highly configurable, but easier to work with than Fedora.

Anders Conrad from the Royal Library of Denmark told us that for their repository solution, 10-20% of material is deemed worthy of proper long term preservation and is pushed to the national repository. The important thing here is that Hydra can support these different workflows and allows an organisation to put one repository in place that could support different types of material with different values placed on the content and thus different workflows going on within it. The 'one repository - multiple workflows' model is very much the approach that the University of Hull have taken with their Hydra implementation. Richard Green described how data comes in to the repository through different routes and different types of data are treated and displayed in different ways depending on the content type.

And what about digital preservation? This is of course my main interest in all of this. One thing that is worth watching is Archivesphere, a Hydra head that is being created by Penn State designed to "create services for preserving, managing, and providing access to digital objects, in a way that is informed by archival thinking and practices" and including support for both PREMIS and EAD metadata. This is currently being tested by Hydra partners and it will be interesting to see how it develops.

Another thing to think about is how Hydra could meet my digital preservation requirements that I published last year (note they have changed a little bit since then). I think the answer to this is that it probably could meet most of them if we wanted to develop the solutions on top of existing Hydra components. Archivesphere is already starting to introduce some of the required functionality to Hydra, for example file characterisation, normalisation and fixity checking. I guess the bigger question for me is whether this is the best approach for us or whether we would be preferable to make use of existing digital archiving software (Archivematica for example) and ensure the systems can talk to each other effectively.