Tuesday, 8 December 2015

Addressing digital preservation challenges through Research Data Spring

With the short time scales at play in the Jisc Research Data Spring initiative it is very easy to find yourself so focussed on your own project that you don’t have time to look around and see what everyone else is doing. As phase 2 of Research Data Spring comes to an end we are taking time to reflect, to think about digital preservation for research data management, to look at the other projects and think about how all the different pieces of the puzzle fit together.

Our “Filling the Digital Preservation Gap” project is very specifically about digital preservation and we are focusing primarily on what happens once the researchers have handed over their data to us for long term safekeeping. However, ‘digital preservation’ is not a thing that exists in isolation. It is very much a part of the wider ecosystem for managing data. Different projects within Research Data Spring are working on specific elements of this infrastructure and this blog post will try and unpick who is doing what and how this work contributes to helping the community address the bigger challenges of digital preservation.

The series of podcast interviews that Jisc produced for each project were a great starting point to finding out about the projects and this has been complemented by some follow up questions and discussions with project teams. Any errors or misinterpretations are my own. A follow up session on digital preservation is planned for the next Research Data Spring sandpit later this week so an update may follow next week in the light of that.

So here is a bit of a synthesis of the projects and how they relate to digital preservation and more specifically the Open Archival Information System (OAIS) reference model. If you are new to OAIS, this DPC technology watch report is a great introduction.

OAIS Functional Model (taken from the DPC Technology Watch report:

So, starting at the left of the diagram, at the point at which researchers (producers) are creating their data and preparing it for submission to a digital archive, the CREAM project (or “Collaboration for Research Enhancement by Active Metadata”) led by the University of Southampton hopes to change the way researchers use metadata. It is looking at how different disciplines capture metadata and how this enhances the data in the long run. They are encouraging dynamic capture of metadata at the point of data creation which is the point at which researchers know most about their data. The project is investigating the use of lab notebooks (not just for scientists) and also looking at templates for metadata to help streamline the research process and enable future reuse of data.

Whilst the key aims of this project do fall within the active data creation phase and thus outside of the OAIS model, they are still fundamental to the success of a digital archive and the value of working in this area is clear. One of the mandatory responsibilities of an OAIS is to ensure the independent utility of the data that it holds. In simple terms this means that the digital archive should ensure that as well as preserving the data itself, it also preserves enough contextual information and documentation to make that data re-usable for its designated community. This sounds simple enough but speaking from experience, as a digital archivist, this is the area that often causes frustration - going back to ask a data producer for documentation after the point of submission and at a time when they have moved on to a new project can be a less than fruitful exercise. A methodology for encouraging metadata generation at the point of data creation and to enable this to be seamlessly submitted to the archive along with the data itself would be most welcome.

Another project that sits cleanly outside of the OAIS model but impacts on it in a similar way is “Artivity” from the University of the Arts London. This is again about capturing metadata but with a slightly different angle. This project is looking at metadata to capture the creative process as an artist or designer creates a piece of digital art. They are looking at tools to capture both the context and the methodology so that in the future we can ask questions such as ‘how were the software tools actually used to create this artwork?’. As above, this project is enabling an institution to fulfil the OAIS responsibility of ensuring the independent utility of the data, but the documentation and metadata it captures is quite specific to the artistic process.

For both of these projects we would need to ensure that this rich metadata and documentation was deposited in the digital archive or repository alongside the data itself in a format that could be re-used in the future. As well as thinking about the longevity of file formats used for research data we clearly also need to think about file formats for documentation and metadata. Of course, when incorporating this data and metadata into a data archive or repository, finding a way of ensuring the link between the data and associated documentation is retained is also a key consideration.

The Clipper project (“Clipper: Enhancing Time-based Media for Research”) from City of Glasgow College provides another way of generating metadata - this time specifically aimed at time-based media (audio and video). Clipper is a simple tool that allows a researcher to cite a specific clip of a digital audio or video file. This solves a real problem in the citation and re-use of time-based media. The project doesn't relate directly to digital preservation but it could interface with the OAIS model at either end. Data produced from Clipper could be deposited in a digital archive (either alongside the audio or video file itself, or referencing a file held elsewhere). This scenario could occur when a researcher needs to reference or highlight a particular section to back up their research. On the other end of the spectrum, Clipper could also be a tool that the OAIS Access system encourages data consumers to utilise, for example, by highlighting it as a way of citing a particular section of a video that they are enabling access to. The good news is that the Clipper team have already been thinking about how metadata from Clipper could be stored for the long term within a digital archive alongside the media itself. The choice of html as the native file format for metadata should ensure that this data can be fairly easily managed into the future.

Still on the edges of the OAIS model (and perhaps most comfortably sitting within the Producer-Archive Interface) is a project called “Streamlining deposit: OJS to Repository Plugin” from City University London which intends to make the process of submission of papers to journals and associated datasets to repositories more streamlined for researchers. They are developing a plugin to send data direct from a journal to a data repository. They want to streamline the submission process for authors who need to make additional data available alongside their publications. This will ensure that the appropriate data gets deposited and linked to a publication in order to ultimately enable access to others.

Along a similar theme is “Giving Researchers Credit for their Data” from the University of Oxford. This project is also looking at more streamlined ways of linking data in repositories with publisher platforms and avoiding retyping of metadata by researchers. They are working on practical prototypes with Ubiquity, Elsevier and Figshare and looking specifically at the communication between the repository platform and publication platform.

Ultimately these 2 projects are all about giving researchers the tools to make depositing data easier and, in doing so, ensuring that the repository also gets the information it needs to manage the data in the long term. This impacts on digital preservation in 2 ways. First the easier processes for deposit will encourage more data to be deposited in repositories where it can then be preserved. Secondly, data submitted in this way should include better metadata (with a direct link to a related publication) which will make the job of the repository in providing access to this data easier and ultimately encourage re-use.

Other projects explore the stages of the research lifecycle that occur once the active research phase is over, addressing what happens when data is handed over to an archive or repository for longer term storage.

The “DataVault” project at the Universities of Edinburgh and Manchester is primarily addressing the Archival Storage entity of the OAIS model. They are establishing a DataVault - a safe place to store research data arising from research that has been completed. This facility will ensure that that data is stored unchanged for an appropriate period of time. Researchers will be encouraged to use this facility for data that isn’t suitable for deposit via the repository but that they wish to keep copies of. This will enable them to fulfill funder requirements around retention periods. The DataVault whilst primarily being a storage facility will also carry out other digital preservation functionality. Data will be packaged using the BagIt specification, an initial stab at file identification will be carried out using Apache Tika and fixity checks will be run periodically to monitor the file store and ensure files remain unchanged. The project team have highlighted the fact that file identification is problematic in the sphere of research data as you work with so many data types across disciplines. This is certainly a concern that the “Filling the Digital Preservation Gap” project has shared.

Our own “Filling the Digital Preservation Gap” project focuses on some of the more hidden elements of a digital preservation system. We are not looking at digital preservation software or tools that a researcher will interact with, but with the help of Archivematica are looking at among other things the OAIS Ingest entity (how we process the data as it arrives in the digital archive) and the Preservation Planning entity (how we monitor preservation risks and react to them). In phase 3 we plan to address OAIS more holistically with our proof of concepts. I won’t go into any further detail here as our project already gets so much air space on this blog!

Another project looking more holistically at OAIS is “A Consortial Approach to Building an Integrated RDM System - Small and Specialist” led by the University for the Creative Arts. This project is looking at the whole technical infrastructure for RDM and in particular looking at how this infrastructure can be achievable for small and specialist research institutes with limited resources. In a phase 1 project report by Matthew Addis from Arkivum there are a full range of workflows described which cover many of the different elements of an OAIS. To give a few examples, there are workflows around data deposit (Producer-Archive Interface), research data archiving using Arkivum (Archival Storage), access using EPrints (Access), gathering and reporting usage metrics (Data Management) and last but not least a workflow for research data preservation using Archivematica which has parallels with some of the work we are doing in “Filling the Digital Preservation Gap”.

DMAOnline” sits firmly into the Data Management entity of the OAIS, running queries on the functions of the other entities and producing reports. This tool being created by the University of Lancaster will report on the administrative data around research data management systems (including statistics around access, storage and the preservation of that data). Using a tool like this, institutions will be able to monitor their RDM activities at a high level, drill down to see some of the detail and use this information to monitor the uptake of their RDM services or to make an assessment of their level of compliance to funder mandates. From the perspective of the “Filling the Digital Preservation Gap” project we are pleased that the DMAOnline team have agreed to include reporting on the statistics from Archivematica in their phase 3 plans. One of the limitations of Archivematica that was highlighted in the requirements section of our own phase 1 report was the lack of reporting options within the system. A development we have been sponsoring during phase 2 of our project will enable third party systems such as DMAOnline to extract information from Archivematica for reporting purposes.

Much focus in RDM activities typically goes into the Access functional entity, which naturally follows on from viewing a summary of activity through DMAOnline. This is one of the more visible parts of the model - the end product if you like of all the work that goes on behind the scenes. A project with a key focus on access is “Software Reuse, Repurposing and Reproducibility” from the University of St Andrews. However, as is the case for many of these projects, it also touches on other areas of the model. At the end of the day, access isn't sustainable without preservation so the project team are also thinking more broadly about these issues.

This project is looking at software that is created through research (the software that much research data actually depends on). What happens to software written by researchers, or created through projects when the person who was maintaining it leaves? How do people who want to reuse the data get hold of the right software? The project team have been looking at how you assign identifiers to software, how you capture software in such a way to make it usable in the future and how you then make that software accessible. Versioning is also a key concern in this area - different versions of software may need to be maintained with their own unique identifiers in order to allow future users of the data to replicate the results of a particular study. Issues around the preservation of and access to software are a bit of a hot topic in the digital preservation world so it is great to see an RDS project looking specifically at this.

The Administration entity of an OAIS coordinates the other high level functional entities, oversees the operation of them and serves as a central hub for internal and external interactions. The “Extending OPD to cover RDM” project from the University of Edinburgh could be one of these external interactions. It has put in place a framework for recording what facilities and services your institution has in place for managing research data - both technical infrastructure, policy and training. It allows an institution to make visible the information about their infrastructure and facilities and to compare it or benchmark it against others. The level of detail in this profile goes far above and beyond OAIS but allows an organisation to report on how it is meeting the ‘Data repository for longer term access and preservation’ component for example.

In summary it has been a useful exercise thinking about the OAIS model and how the different RDS projects in phase 2  fit within this framework. It is good to see how they all impact on and address digital preservation in some way - some by helping get the necessary metadata into the system, or enabling a more streamline deposit process, others helping monitor or assess the performance of the systems in place and some projects more directly addressing key entities within the model. The outputs from these projects complement each other - designed to solve different problems and addressing discrete elements of the complex puzzle that is research data management.

Wednesday, 2 December 2015

Research Data Spring - a case study for collaboration

Digital preservation is not a problem that any single institution can realistically find a solution to on their own. Collaboration with others is a great way of working towards sustainable solutions in a more effective way. This post is a case study about how we have benefited from collaboration whilst working on the "Filling the Digital Preservation Gap" project.

In late 2014 Jisc announced a new collaborative initiative called Research Data Spring. The project model specifically aimed to create innovative partnerships and collaborations between stakeholders at different HE institutions working within the field of Research Data Management. Project teams were asked to work in short sprints of between three and six months and were funded for a maximum of three phases of work. One of the projects lucky enough to be funded as part of this initiative was the “Filling the Digital Preservation Gap” project, a collaboration between the Universities of Hull and York. This was a valuable opportunity for teams at the two universities to work together on a shared solution to a shared problem and come up with a solution that might be beneficial to others.

The project team from Hull and York
The aim of the project was to address a perceived gap in existing research data management infrastructures around the active preservation of the data. Both Hull and York had existing digital repositories and sufficient storage provision but were lacking systems and workflows for fully addressing preservation. The project aimed to investigate the open source tool Archivematica and establish whether this would be a suitable solution to fill this gap.

As well as the collaboration between Hull and York, further collaborations emerged as the project progressed. 

Artefactual Systems are the organisation who support and develop Archivematica and the project team worked closely with them throughout the project. Having concluded that Archivematica has great potential for helping to preserve research data, the project team highlighted several areas where they felt additional development was required in order to enhance existing functionality. Artefactual Systems were consulted in detail as the project team scoped out priorities for further work. They were able to offer many useful insights about the best way of tackling the problems we described. Their extensive knowledge of the system put them in a good place to look at the issues from various angles to find a solution which would meet our needs as well as the needs of the wider community of users. Artefactual Systems were also able to help us with one of our outreach activities, joining us (virtually) to give a presentation about our work.

The UK Archivematica group was kept informed about the project and invited to help shape the priorities for development (you can read a bit about this in a previous blog post). Experienced and established Archivematica users from the international community were also consulted to discuss the new features and to review how the proposed features would impact on their workflows. Ultimately, none of us wanted to create bespoke developments that were only going to be of use to Hull and York.

Collaboration with another Research Data Spring project being carried out at Lancaster University was also necessary to enable future join up of these two initiatives. One of the areas highlighted for further work was improved reporting within Archivematica. By sponsoring a development to enable data to be more easily exposed to third party applications, the project team worked closely with the DMAOnline project team at Lancaster to ensure the data would be made available in a manner that was suitable for their tool to work with.  

Another area of work that called for additional collaboration was in the area of file format identification. This is very much an area that the digital preservation community as a whole needs to work together on. For research data in particular, there are many types of file that are not identified by current identification tools and are not present within the Pronom registry of file types. We wanted to get greater representation of research data file formats within Pronom and also enhance Archivematica to enable better workflows for non-identified files (see my previous post for more about file identification workflows). This is why we have also been collaborating with the team at The National Archives who develop new file signatures for Pronom.

The collaborative nature of this project brought several benefits. Despite the short time scales at play (or perhaps because of them) there was a strength in working together on a new and innovative solution to preserve research data.

The universities of Hull and York were similar enough to share the same problem and see the need to fill the digital preservation gap, but different enough to introduce interesting variations in workflows and implementation strategies. This demonstrated that there is often more than one way to implement a solution depending on institutional differences.  

By collaborating and consulting widely, the project hoped to create a better final outcome and produce a set of enhancements and case studies that would benefit a wide community of users.