Tuesday, 31 May 2016

Research data - what does it *really* look like?

Work continues on our Filling the Digital Preservation Gap project and I thought it was about time we updated you on some of the things we have been doing.

While my colleague Julie has been focusing on the more technical issues of implementing Archivematica for research data. I have been looking at some real research data and exploring in more detail some of the issues we discussed in our phase 1 report.

For the past year, we have been accepting research data for longer term curation. Though the systems for preservation and access to this data are still in development, we are for the time being able to allocate a DOI for each dataset, manage access and store it safely (ensuring it isn't altered) and intend to ingest it into our data curation systems once they are ready.

Having this data in one place on our filestore does give me the opportunity to test the hypothesis in our first report about the wide range of file formats that will be present in a research dataset and also the assertion that many of these will not be identified by the tools and registries in use for the creation of technical metadata.

So, I have done a fairly quick piece of analysis on the research data, running a tool called Droid developed by The National Archives over the data to get an indication of whether the files can be recognised and identified in an automated fashion.

All the data in our research data sample has been deposited with us since May 2015. The majority of the data is scientific in nature - much of it coming from the departments of Chemistry and Physics. (this may be a direct result of expectations from the EPSRC around data management). The data is mostly fairly recent, as suggested by the last modified dates on these files, which range from 2006 to 2016 with the vast majority having been modified in the last five years. The distribution of dates is illustrated below.

Here are some of the findings of this exercise:

Summary statistics

  • Droid reported that 3752 individual files were present*

  • 1382 (37%) of the files were given a file format identification by Droid

  • 1368 (99%) of those files that were identified were given just one possible identification. 12 files were given two possible identifications and a further two were given 18 possible identifications. In all these cases, the identification was done by file extension rather than signature - which perhaps explains the uncertainty

Files that were identified

  • Of the 1382 files that were identified: 
    • 668 (48%) were identified by signature (which suggests a fairly accurate identification - if a file is identified by signature it means that Droid has looked inside the file and seen something that it recognises. I'm told it does this by some sort of magic!)
    • 648 (47%) were identified by extension alone (which implies a less accurate identification)
    • 65 (5%) were identified by container. These were all Microsoft Office files - xlsx and docx as these are in effect zip files (which suggests a high level of accuracy)

  • 111 (8%) of the identified files had a file extension mismatch - this means that the file extension was not what you would expect given the identification by signature. 
    • All but 16 of these files were XML files that didn't have the .xml file extension (there were a range of extensions for these files including .orig, .plot, .xpr, .sc, .svg, .xci, .hwh, .bxml, .history). This isn't a very surprising finding given the ubiquity of XML files and the fact that applications often give their own XML output different extensions.

  • 34 different file formats were identified within the collection of research data

  • Of the identified files 360 (26%) were XML files. This was by far the most common file format identified within the research dataset. The top 10 identified files are as follows:
    • Extensible Markup Language - 360
    • Log File - 212
    • Plain Text File - 186
    • AppleDouble Resource Fork - 133
    • Comma Separated Values - 109
    • Microsoft Print File - 77
    • Portable Network Graphics - 73
    • Microsoft Excel for Windows - 57
    • ZIP Format - 23
    • XYWrite Document - 21

Files that weren't identified

  • Of the 2370 files that weren't identified by Droid, 107 different file extensions were represented

  • 614 (26%) of the unidentified files had no file extension at all. This does rather limit the chance of identification being that identification by file extension is relied on quite heavily by Droid and other similar tools. Of course it also limits our ability to actively curate this data unless we can identify it by another means.

  • The most common file extensions for the files that were not identified are as follows:
    • dat - 286
    • crl - 259
    • sd - 246
    • jdf - 130
    • out - 50
    • mrc - 47
    • inp - 46
    • xyz - 37
    • prefs - 32
    • spa - 32

Some thoughts

  • This is all very interesting and does back up our assertions about the long tail of file formats within a collection of research data and the challenges of identifying this data using current tools. I'd be interested to know whether for other collections of born digital data (not research data) a higher success rate would be expected? Is identification of 37% of files a particularly bad result or is it similar to what others have experienced?

  • As mentioned in a previous blog post, one area of work for us is to get some more research data file formats into PRONOM (a key database of file formats that is utilised by digital preservation tools). Alongside previous work on the top software applications used by researchers (a little of this is reported here) this has been helpful in informing our priorities when considering which formats we would like The National Archives to develop signatures for in phase 3 of the project.

  • Given the point made above, it could be suggested that one of our priorities for file format research should be .dat files. This would make sense being that we have 286 of these files and they are not identified by any means. However, here lies a problem. This is actually a fairly common file extension. There are many different types of .dat files produced by many different applications. PRONOM already holds information on two varieties of .dat file and the .dat files that we hold appear to come from several different software applications. In short, solving the .dat problem is not a trivial exercise!

  • It strikes me that we are really just scratching the surface here. Though it is good we are getting a few file signatures developed as an output for this project, this is clearly not going to make a big impact given the size of the problem. We will need to think about the community should continue this work going forward.

  • It has been really helpful having some genuine research data to investigate when thinking about preservation workflows - particularly those workflows for unidentified files that we were considering in some detail during phase 2 of our project. The unidentified file report that has been developed within Archivematica as a result of this project helpfully organises the files by extension and I had not envisaged at the time that so many files would have no file extension at all. We have discussed previously how useful it is to fall back on identification by file extension if identification by signature is unsuccessful but clearly for so many of our files this will not be worth attempting.

* note that this turns out not to be entirely accurate given that eight of the datasets were zipped up into a .rar archive. Though Droid looks inside several types of zip files, (including .zip, .gzip, .tar, and the web archival formats .arc and .warc) it does not yet look inside .rar, 7z, .bz, and .iso files. I didn't realise this until after I had carried out some analysis on the results. Consequently there are another 1291 files which I have not reported on here (though a quick run through with Droid after unzipping them manual identified 33% of the files so a similar ratio to the rest of the data. Note that this functionality is something that the team at The National Archives intend to develop in the future.


  1. Do you know how long it takes for TNA to identify requested formats? We're also looking at contributing more formats to PRONOM, and were curious how long it might take to do that. Have you looked into any tools for creating format signatures?

  2. Hi Nick - I think this very much depends on TNA's resource and priorities at the time and perhaps also the quality of the information and sample data that is sent. If you submit information via their online form I think they promise to get back to you within 10 days. I haven't tried creating any signatures myself but it would be great if more people could do so.

  3. I should also add that we are working with TNA as part of this project and giving them some additional resource to carry out this work because we have a short time frame in which we would like the research data signatures to be created in order to meet our own project deadlines, but I know this scenario isn't typical!

  4. It'd be interesting see how DROID compares to other tools. I'm a big fan of FILE for file identification, which provides better results than DROID. It's widely supported in the Linux/Unix world, has been around for decades & can be set to perform a header scan for improved accuracy. It's part of the FITS toolkit

  5. This was a very interesting blog post to read, thank you Jenny. I've been doing similar work with DROID on one dataset (353GB - ~36'000 files) we've been looking after for 20 years so it was good to be able to compare notes. We have traditionally very heterogeneous research data at the NGDC, but this particular archive covers the period between 1979-2003 (biggest bulk from 1995) and is probably even more varying that newer datasets. There are 340 different file extensions, of which DROID identified the PUID for 95. 78 had no extension and 57 were not recognised. The data creators renamed some files which makes the identification even harder, and there is some modelling data which can be excluded. We are looking into providing TNA with information on some of the (mainly) geospatial data types we can recognise but as usual it is a matter of resources and priorities.

  6. Another helpful and interesting comment from Johan van der Knijff here:

  7. Hi Jenny! We just tried to respond to some of your questions in this excellent post with a post of our own!

  8. And another helpful response from Andy Jackson here:
    Thanks everyone - some great discussion going on around this topic!

  9. Ross Spencer blogs about 'Five Star File Signature Development' over here on the OPF blog:

  10. A further comparison from Simon Wilson based on the born digital data held at the Hull History Centre is also made in a follow up blog post:

  11. Jenny, I am curious what Signature version you used and if you had DROID set to scan the whole file. This can lead to more accurate results. I also noticed JDF in your list of not-identified files, but they are XML like the others you mentioned. Great test, thank you for sharing.

  12. Hi Thorsted - it would have been the most up to date version at the time (which is V84). Droid was configured to scan 65536 bytes at the start and end of each file, rather than scanning the whole file - apologies, I should have mentioned this information in the blog post itself.

    Regarding JDF files - these are actually JEOL NMR Spectroscopy files and are a binary format. You can read a bit more about this here as this format is one of the formats we asked TNA to develop a signature for for this project:

    I guess this highlights why identification by file extension is inaccurate - perhaps you are referring to another JDF format?

    Thanks for the helpful feedback!