Pages

Monday, 13 February 2017

What have we got in our digital archive?

Do other digital archivists find that the work of a digital archivist rarely involves doing hands on stuff with digital archives? When you have to think about establishing your infrastructure, writing policies and plans and attending meetings it leaves little time for activities at the coal face. This makes it all the more satisfying when we do actually get the opportunity to work with our digital holdings.

In the past I've called for more open sharing of profiles of digital archive collections but I am aware that I had not yet done this for the contents of our born digital collections here at the Borthwick Institute for Archives. So here I try to redress that gap.

I ran DROID (v 6.1.5, signature file v 88, container signature 20160927) over the deposited files in our digital archive and have spent a couple of days crunching the results. Note that this just covers the original files as they have been given to us. It does not include administrative files that I have added, or dissemination or preservation versions of files that have subsequently been created.

I was keen to see:
  • How many files could be automatically identified by DROID
  • What the current distribution of file formats looks like
  • Which collections contain the most unidentified files
...and also use these results to:
  • Inform future preservation planning and priorities
  • Feed further information to the PRONOM team at The National Archives
  • Get us to Level 2 of the NDSA Levels of Digital Preservation which asks for "an inventory of file formats in use" and which until now I haven't been collating!

Digital data has been deposited with us since before I started at the Borthwick in 2012 and continues to be deposited with us today. We do not have huge quantities of digital archives here as yet (about 100GB) and digital deposits are still the exception rather than the norm. We will be looking to chase digital archives more proactively once we have a Archivematica in place and appropriate workflows established.

Last modified dates (as recorded by DROID) appear to range from 1984 to 2017 with a peak at 2008. This distribution is illustrated below. Note however, that this data is not always to be trusted (that could be another whole blog post in itself...). One thing that it is fair to say though is that the archive stretches back right to the early days of personal computers and up to the present day.

Last modified dates on files in the Borthwick digital archive

Here are some of the findings of this profiling exercise:

Summary statistics

  • Droid reported that 10005 individual files were present
  • 9431 (94%) of the files were given a file format identification by Droid. This is a really good result ...or at least it seems it in comparison to my previous data profiling efforts which have focused on research data. This result is also comparable with those found within other digital archives, for example 90% at Bentley Historical Library, 96% at Norfolk Record Office and 98% at Hull University Archives
  • 9326 (99%) of those files that were identified were given just one possible identification. 1 file was given 2 different identifications (an xlsx file) and 104 files (with a .DOC extension) were given 8 identifications. In all these cases of multiple identifications, identification was done by file extension rather than signature - which perhaps explains the uncertainty

Files that were identified

  • Of the 9431 files that were identified:
    • 6441 (68%) were identified by signature (which suggests a fairly accurate identification - if a file is identified by signature it means that Droid has looked inside the file and seen something that it recognises. Last year I was inducted into the magic ways this happens - see My First File Format Signature!)
    • 2546 (27%) were identified by container (which again suggests a high level of accuracy). The vast majority of these were Microsoft Office files 
    • 444 (5%) were identified by extension alone (which implies a less accurate identification)


  • Only 86 (1%) of the identified files had a file extension mismatch - this means that the file extension was not what you would expect given the identification by signature. There are all sorts of different examples here including:
    • files with a tmp or dot extension which are identified as Microsoft Word
    • files with a doc extension which are identified as Rich Text Format
    • files with an hmt extension identifying as JPEG files
    • and as in my previous research data example, a bunch of Extensible Markup Language files which had extensions other than XML
So perhaps these are things I'll look into in a bit more detail if I have time in the future.

  • 90 different file formats were identified within this collection of data

  • Of the identified files 1764 (19%) were identified as Microsoft Word Document 97-2003. This was followed very closely by JPEG File Interchange Format version 1.01 with 1675 (18%) occurrences. The top 10 identified files are illustrated below:

  • This top 10 is in many ways comparable to other similar profiles that have been published recently from Bentley Historical Library, Hull University Archive and Norfolk Records Office with high occurrences of Microsoft Word, PDF and JPEG images. In contrast. what it is not so common in this profile are HTML files and GIF image files - these only just make it into the top 50. 

  • Also notable in our top ten are the Sibelius files which haven't appeared in other recently published profiles. Sibelius is musical notation software and these files appear frequently in one of our archives.


Files that weren't identified

  • Of the 574 files that weren't identified by DROID, 125 different file extensions were represented. For most of these there was just a single example of each.

  • 160 (28%) of the unidentified files had no file extension at all. Perhaps not surprisingly it is the earlier files in our born digital collection (files from the mid 80's), that are most likely to fall into this category. These were created at a time when operating systems seemed to be a little less rigorous about enforcing the use of file extensions! Approximately 80 of these files are believed to be WordStar 4.0 (PUID:  x-fmt/260) which DROID would only be able to recognise by file extension. Of course if no extension is included. DROID has little chance of being able to identify them!

  • The most common file extensions of those files that weren't identified are visible in the graph below. I need to do some more investigation into these but most come from 2 of our archives that relate to electronic music composition:


I'm really pleased to see that the vast majority of the files that we hold can be identified using current tools. This is a much better result than for our research data. Obviously there is still room for improvement so I hope to find some time to do further investigations and provide information to help extend PRONOM.

Other follow on work involves looking at system files that have been highlighted in this exercise. See for example the AppleDouble Resource Fork files that appear in the top ten identified formats. Also appearing quite high up (at number 12) were Thumbs.db files but perhaps that is the topic of another blog post. In the meantime I'd be really interested to hear from anyone who thinks that system files such as these should be retained.


Friday, 10 February 2017

Harvesting EAD from AtoM: a collaborative approach

In a previous blog post AtoM harvesting (part 1) - it works! I described how archival descriptions within AtoM are being harvested as Dublin Core for inclusion within our University Library Catalogue.* I also hinted that this wouldn’t be the last you would hear from me on AtoM harvesting and that plans were afoot to enable much richer metadata in EAD 2002 XML (Encoded Archival Description) format to be harvested via OAI-PMH.

I’m pleased to be able to report that this work is now underway.

The University of York along with five other organisations in the UK have clubbed together to sponsor Artefactual Systems to carry out the necessary development work to make EAD harvesting possible. This work is scheduled for release in AtoM version 2.4 (due out in the Spring).

The work is being jointly sponsored by:



We are also receiving much needed support in this project from The Archives Hub who are providing advice on the AtoM EAD and will be helping us test the EAD harvesting when it is ready. While the sponsoring institutions are all producers of AtoM EAD, The Archives Hub is a consumer of that EAD. We are keen to ensure that the archival descriptions that we enter into AtoM can move smoothly to The Archives Hub (and potentially to other data aggregators in the future), allowing the richness of our collections to be signposted as widely as possible.

Adding this harvesting functionality to AtoM will enable The Archives Hub to gather data direct from us on a regular schedule or as and when updates occur, ensuring that:


  • Our data within the Archives Hub doesn’t stagnate
  • We manage our own master copy of the data and only need to edit this in one place
  • A minimum of human interaction is needed to incorporate our data into the Hub
  • It is easier for researchers to find information about the archives that we hold without having to search all of our individual catalogues


So, what are we doing at the moment?


  • Developers at Artefactual Systems are beavering away working on the initial development and getting the test site ready for us to play with.
  • The sponsoring institutions have been getting samples of their own AtoM data ready for loading up into the test deployment. It is always better when testing something to have some of your own data to mess around with.
  • The Borthwick have been having discussions with The Archives Hub for some time about AtoM EAD (from version 2.2) but we’ve picked up these discussions again and other institutions have joined in by supplying their own EAD samples. This allows staff at the Hub to see how EAD has changed in version 2.3 of AtoM (it hasn’t very much) and also to see how consistent the EAD from AtoM is from different institutions. We have been having some pretty detailed discussions about how we can make the EAD better, cleaner, fuller - either by data entry at the institutions, automated data cleaning at The Hub prior to display online or by further developments in AtoM.


What we are doing at the moment is good and a huge step in the right direction, but perhaps not perfect. As we work together on this project we are coming across areas where future work would be beneficial in order to improve the quality of the EAD that AtoM produces or to expand the scope of what can be harvested from AtoM. I hope to report on this in more detail at the end of the project, but in the meantime, do get in touch if you are interested in finding out more.







* It is great to see that this is working well and our Library Catalogue is now appearing in the referrer reports for the Borthwick Catalogue on Google Analytics. People are clearly following these new signposts to our archives!