Tuesday, 30 August 2016

Filling the Digital Preservation Gap - a brief update

As we near the end of the active phase of Filling the Digital Preservation Gap* here is a brief update about where we are with the main strands of work we highlighted in our phase 3 kick off blog post.

Archivematica implementation

Work at York

Work is ongoing at York to get our proof of concept implementation of Archivematica up and running. The purpose of this work was not to get a production service in place but to demonstrate that the implementation plan we published in our phase 2 report was feasible. The implementation we are developing pulls metadata (about deposited research datasets) from PURE and provides a method for capturing additional information for managing datasets  (filling some of the information gaps that are not collected through the PURE datasets module). It also includes an automated process to ingest deposited datasets (along with their metadata) into Archivematica, package them up for longer term preservation and provide a dissemination copy of the dataset to our repository. 

We have been doing this work in consultation with the staff at York who actually work with datasets that are deposited through our Research Data York service to ensure that the workflows and processes we are putting in place will make their lives easier rather than harder! We are keen to ensure that those processes that can be automated are automated and those areas where human input is required trigger e-mail notifications to relevant staff and a pause in the workflow to enable the relevant checks to be made.

Work at Hull

Like York, Hull is looking to produce a proof-of-concept system within the timeframe of the project. Whilst concentrating on research data for this Jisc funded work, we have our eye also on later using our approach for other forms of repository content that deserve long-term preservation.  To that end, we are taking as our starting point the institutional Box folder that each of our staff has access to; we will be asking depositors to assemble their material for the repository in a folder within their Box account.  As well as the content itself they will be asked for basic metadata and processing instructions in a very simple format.  When the folder is ready they share it with another Box account “owned” by Archivematica.

Hull has developed a “Box watcher” which detects the new share and instigates processing of the contents, keeping the depositor aware of progress along the way.  The contents of the folder are examined and, depending on what is found and how it is configured, one or more Bags (as in the BagIt standard) are created and handed off to Archivematica.

Like York we are then looking to have a fully automated Archivematica workflow which produces Archival Information Packages corresponding to each of the bags.  In addition, Hull will have Archivematica create Dissemination Information Package(s) which, once created, will automatically be processed to produce objects in the quality assurance queue of our Hydra repository.

Unidentified file formats

It has been clear from our project work during phase 3 that research data is much harder to identify in an automated fashion than other types of born digital data that an archive would typically hold. If you don’t believe us, read these 2 blog posts that show contrasting results when trying to identify two different types of born digital data: 

So, how are we working towards a solution? As well as directly sponsoring the development of a small selection of research data file formats by the PRONOM team at The National Archives, we also had a go at creating our own. York’s new signature will be incorporated into PRONOM in due course. Hull’s signature has been submitted and is just being tested by the PRONOM team. There have also been positive discussions with colleagues at The National Archives about wider public engagement around file format signature development and how we work towards increasing the coverage of PRONOM for research data file formats.

Dissemination and outreach

The project team have been keen to continue their focus on dissemination during phase 3 of the project. This has included presentations or posters at the following conferences and events:

  • International Digital Curation Conference (IDCC16), Amsterdam
  • 'Digital Preservation: Strategic Issues' - National Library of Wales
  • UK Archives Discovery Forum, Kew
  • UK Archivematica meeting, York
  • Research Data, Records and Archives: Breaking the Boundaries, Edinburgh
  • Open Repositories, Dublin
  • Jisc CNI conference, Oxford
  • Hydra Virtual Connect
  • TNA Digital Transformation day, Kew

...and our outreach work continues. Watch out for us at the Jisc Research Data Network event in Cambridge next week, the next UK Archivematica meeting in Lancaster the week after, the iPRES conference and Hydra Connect in October and of course the final Jisc Research Data Spring showcase event which will be later on in October.

And of course we have been blogging as usual throughout this phase of the project so do read back to see our previous posts for more information and watch out for our phase 3 final report in mid-October.

* we formally complete the project work on 14th Sept and will focus on writing up our final report over the following month

Friday, 19 August 2016

My first file format signature

As part of Filling the Digital Preservation Gap we've been doing a lot of talking about the importance of accurate file format identification and the challenges of doing so for research data.

Now we are thinking about how we can help solve the problem.

As promised in a post last month, I wanted to have a go at file format signature creation for PRONOM to see whether it is something that an average digital archivist could get their head around. Never before had I created my own signature. In the past I had considered this to be work that only technical people could carry out and it would be foolhardy to attempt it myself.

However, given the extent of the file formats identification challenge for research data wouldn't it be great if the community could engage more directly? Also, shouldn't file signature development be something every digital archivist should have a good understanding of?

Encouraged by Ross Spencer's blog post Five Star File Format Signature Development and a meeting with the PRONOM team at The National Archives in which the tricks and challenges of signature creation were explained, I decided to give it a go.

Where to begin?

  • First I read TNA's How to research and develop signatures for file format identification. This is an accessible and readable guide which tells you how to get started with signature development - from gathering samples, doing internet research on the format and using a Hex Editor to spot patterns. You don't need to be very technically minded to get your head around it.
  • Then I downloaded and installed a Hex editor. Though it is possible to view files as hexadecimal within Quick View Plus, I followed TNA's advice and used HxD Hex Editor as this allows you to compare files thus partially automating the process of spotting sequences.
  • Once I'd spotted a pattern which could be used to create a signature, I planned to use PRONOM's Signature Development Utility to create it. 
  • Once the signature is created I'm told it is possible to test this using DROID. Within DROID, go to Tools... Install signature file and replace the current signature with your new one (but remember to put it back again once you are done otherwise you may wonder why DROID isn't working properly!). Run this over your directory of sample files to see if they are all correctly identified using the signature you have developed. 

I decided start off with something that had been at the back of my mind for a while...those Wordstar 4.0 files from the Marks and Gran archive that I blogged about some time ago and had struggled to identify. When I wrote that post three years ago, Wordstar 4.0 files were not represented in PRONOM. They have more recently been added and the files can be identified but this is by extension only - not the more accurate file signature. I thought it would be fun to try and create a file signature for them.

I was very wrong.

My attempts to see a pattern within the files using the Hex Editor were unsuccessful. I decided to send the sample files to the experts at TNA to see if I was missing something. It was quickly confirmed that this was a rather awkward file type and not one that lent itself well to being automatically identified. Disappointing but at least it confirmed that my own investigations were not lacking.

For my next attempt I decided to tackle some of the unidentified research data that I had highlighted in my previous post Research data - what does it *really* look like?

I looked through the top ten most frequent unidentified file extensions in my sample and started to dig out the files themselves and assess whether they were a good candidate for me to work on. Ross Spencer suggests that PRONOM lends itself best to the creation of signatures for binary formats so this is what I wanted to focus on. No point in trying to make it hard for myself!

  • dat - 286
  • crl - 259
  • sd - 246
  • jdf - 130 (a signature for these JEOL NMR Spectroscopy files is now available)
  • out - 50
  • mrc - 47
  • inp - 46
  • xyz - 37
  • prefs - 32
  • spa - 32
Unfortunately, looking through the list (and digging out some samples) I discovered that many of these are ASCII formats rather than binary. It is possible to create signatures to identify ASCII files but it can be challenging (involving quite complex regular expressions) and not a great place for a first timer to start. I certainly did not want to start to tackle the confusing landscape of .dat files either!

After a little bit of investigation I discovered that the .spa files were something I could work with. I knew nothing about this format but found the relevant files and started doing some internet research looking for more information and perhaps some additional samples. I soon discovered they were one of many formats for optical spectroscopy and are known as Thermo Fisher’s OMNIC file format or Thermo Scientific OMNIC or Nicolet/Thermo OMNIC.

Looking at some of the files using a Hex Editor it was immediately apparent that there was a consistent pattern of bytes at the start of each file. A string which read 'Spectral Data File' which was represented by 53 70 65 63 74 72 61 6C 20 44 61 74 61 20 46 69 6C 65 in hexadecimal. Note that I actually thought the pattern was longer but advice received from the PRONOM team suggested that it was better to cut it down.

I also looked at the end of each file and at first sight there appeared to be consistency here too with each file ending with the same few bytes. This hypothesis was blown out of the water when I looked at a sample file that I had discovered online which did not display this pattern (but luckily did have the same bytes at the start of the file).

This is why it is so important to have sample data that comes from more than one source. A set of files from a single researcher may have misleading patterns that have occurred just because of the consistent way in which they work, rather than this being a true feature of the format itself.

So, once I'd looked at all 33 files and had convinced myself the hypothesis was solid, I went to the online signature development tool provided by The National Archives and created my signature.

PRONOM signature development tool

This was relatively easy to use but there were areas where more guidance was needed (so questions were fired off to the PRONOM team and a speedy response was received). I'm hoping that in the future there will be more documentation to help guide the completion of this form - so that people know how best to name the signature, where to find a definitive list of Mimetypes (this is the list they suggested I looked at), and what the Version field should contain (it is for the version of the file format if this is apparent/relevant - not the version of the signature you are creating).

Once I was done, I clicked on the 'Save Signature File' button and I was presented with the finished XML file:

Ta daaaaa!

I briefly admired my handiwork before sending it off to The National Archives for feedback.*

How long did it take me to do all of this? I would say one full day is a fair estimate (that would include reading the guidance, downloading the Hex Editor and a few false starts as I tried to find a format that I thought I could handle). The next signature would be much quicker.

The biggest challenges:

  1. It took me a while to find a binary format that I could work with. Much of the research data we hold appears to be ASCII formats ....which has benefits from a digital preservation perspective, but wasn't what I was looking for with regard to this exercise
  2. I did not really understand the file format I was working with. I am not a chemist. I have never heard of the .spa format. I struggle to even say 'Spectroscopy' let alone understand it. When I started to research it online I found the results quite confusing. If I knew more about the format in the first place it would have made life much easier. 
  3. There are limitations with the metadata we get from researchers when they deposit data with Research Data York. Reading the brief descriptions of the dataset that are provided did not really help me work out what the individual files are or what software and hardware was used to create them.
  4. I could not locate the file format specification online - I think next time I try this I may approach the software vendor direct and ask them for help. 
  5. Available documentation for creating and testing signatures could be enhanced. I had several questions as I went along and these were answered promptly by the PRONOM team, but if the information was all online then this would certainly help other newbies.

Despite the challenges this exercise has been both enjoyable and useful. The thing I like about being a digital archivist is being able to get hands on with the data and solve problems. Over the last few years I've done very little of this type of work so it was great to get stuck in. On top of the obvious benefit that after the next signature release these .spa files will now be recognised by DROID and other PRONOM-based file identification tools, I have also increased my knowledge and understanding of the process and this is a positive result.

I would definitely encourage other digital archivists, repository managers and research data managers to try this out for themselves.

* Feedback from the PRONOM team was positive. With a couple of modifications they were happy to include the signature in the next PRONOM signature release

Friday, 5 August 2016

Research data is different

This is a guest post from Simon Wilson who has been profiling the born-digital data at the Hull History Centre to provide another point of comparison with the research data at York reported on in this blog back in May.

Inspired by Jen’s blog Research data - what does it *really* look like? about the profile of the  research data at York and the responses it generated including that from the Bentley Historical Library, I decided to take a look at some of the born-digital archives we have at Hull. This data is not research data from academics, it is data that has been donated to or deposited with the Hull History Centre and it comes from a variety of different sources.

Whilst I had previously created a DROID report for each distinct accession I have never really looked into the detail, so for each accession I did the following;

  1. Run the DROID software and export the results into csv format with one row per file 
  2. Open the file in MS Excel and copy the data to a second tab for the subsequent actions
  3. Sort the data by Type field into A-Z order and then delete all of the records relating to folders 
  4. Sort the data on the PUID field into A-Z order
  5. For large datasets highlight the data and then select the subtotal tool and use it to count each time the PUID field changes and record the sub-total
  6. Once the subtotal tool has completed its calculations, select the entire dataset and select Hide Detail (adjacent to Subtotal in the Outline tools box) to leave you with just a row for each distinct PUID and the total count value

I then created a simple spreadsheet with a column for each distinct accession and added a row for each unique PUID, copying the MIME type, software and version details from the DROID report results.  I also noted the number of files that were not identified. There may be quicker ways to get the same results and I would love to hear other suggestions or shortcuts.

After having completed this for 24 accessions - totalling 270,867 files, what have I discovered?

  • An impressive 97.96% of files were identified by DROID (compared with only 37% in Jen's smaller sample of research data)
  • So far 228 different PUIDs have been identified (compared with 34 formats in Jen’s sample)
  • The most common format is fmt/40 (MS Word 97-2003) with 120,595 files (44.5%). See the top ten identified formats in the table below...

File format (version)
Total No files
% of total identified files
Microsoft Word Document (97-2003)
Microsoft Word for Windows (2007 onwards)
Microsoft Excel 97 Workbook
Graphics Interchange Format
Acrobat PDF 1.4 - Portable Document Format
JPEG File Interchange Format (1.01)
Microsoft Word Document (6.0 / 95)
Acrobat PDF 1.3 - Portable Document Format
JPEG File Interchange Format (1.02)
Hypertext Markup Language (v4)

I can now quickly look-up whether an individual archive has a particular file type, and see how frequently it occurs.  Once I have processed a few more accessions it may be possible to create a "profile" for an individual literary collection or a small business and use this to inform discussions with depositors.  I can also start to look at the identified file formats and determine whether there is a strategy in place to migrate that format. Where this isn’t the case, knowing the number and frequency of the format amongst the collections will allow me to prioritise my efforts.  I will also look to aggregate the data – for example merging all of the different versions of Adobe Acrobat or MS Word.

I haven’t forgotten the 5520 unidentified files. By noting the PRONOM signature file number used to profile each archive, it is easy to repeat the process with a later signature file.  This could validate the previous results or enable previously unidentified files to be identified (particularly if I use the results of this exercise to feed information back to the PRONOM team). Knowing which accessions have the largest number of unidentified files will allow me to focus my effort as appropriate.

Whilst this has certainly been a useful exercise in its own right, it is also interesting to note the similarities between this the and the born-digital archives profile published by the Bentley Historical Library and the contrast with the research data profile Jen reported on.

The top ten identified formats from Hull and Bentley are quite similar. Both have a good success rate for identifying file formats with 90% identified at Bentley and 98% at Hull. Though the formats do not appear in the same order in the top ten, they do contain similar types of file (MS Word, PDF, JPEGs, GIFs and HTML).

In contrast, only 37% of files were identified in York's research data sample and the top ten file formats that were identified look very different. The only area of overlap being MS Excel files which appear high up in the York research dataset as well as being in the top ten for the Hull History Centre.

Research data is different.