Digital Archiving at the University of York: July 2013

The Cloisters were a good place to shelter from the heat!

Last week I was lucky enough to attend the first iteration of the Digital Preservation Coalition's Advanced Practitioner Course. This was a week long course organised by the APARSEN project and based at the University of Glasgow on the warmest week of 2013 so far. On the first day Ingrid Dillo begun by telling us that 'data is hot' - by the end of the week it was not only data that was hot.

It would be a very long blog post if I was to try and do justice to each and every presentation over the course of the week, (I took 30 pages of notes) so here is the abridged version:

A list of twelve interesting things:

These are my main 'take home' snippets of information. Some things I already knew but were reinforced at some point over the week, and others are things that were totally new to me or provided me with different ways of looking at things. Some of these things are facts, some are tools and some are rather open-ended challenges.

A novel way to present a cake.
Photo credit: Jenny Mitcham

1) We can think about data and interpretations of data using the analogy of a cake (Ingrid Dillo). The raw ingredients of the cake (eggs, flour, sugar etc) represent the raw data. The cake as it comes out of the oven is the information that is held within that raw data. The iced and decorated cake is the presentation of the data - data can be presented in lots of different ways just as the same cake could be decorated in many different ways. The leftover crumbs of the cake on the plate after it is eaten represents the intangible knowledge that we have gained. This really reinforces for me the reason behind our digital preservation actions - curating and preserving the raw data so that others can create alternative interpretations of that data and we all can benefit from the knowledge that that will bring.

2) Quantifying the costs of digital curation is hard (Kirnn Kaur). No surprise really considering there have been so many different projects looking at cost models for digital preservation. If it was easy perhaps it would have been solved by a single project. The main problem seems to be that many of the cost models that are currently out there are quite specific to the organisation or domain that produced them. Some interesting work on comparing and testing cost models has been carried out by the APARSEN project and a report on all of this work is available here.

3) In 20 years time we won't be talking about 'digital preservation' anymore, we will just be talking about 'preservation' (Ruben Riestra). There will be nothing special about preserving digital material, it will just be business as usual. I eagerly await 2033 when all my headaches will be over!

4) I am right in my feeling that uncompressed TIF is the best preservation file format for images. The main contender, JPEG2000 is complicated and not widely supported (Photoshop Elements dropped support for it due to lack of interest from their users) (Tomasz Parkola).

5) There is a tool created by the SCAPE project called Matchbox. I haven't had a chance to try it out yet but it sounds like one of those things that could be worth it's weight in gold. The tool lets you compare scanned pages of text to try and find duplicates and corresponding images. It uses visual keypoints and looks for structural similarity (Rainer Schmidt and Roman Graf). More tools like this that automate some of the more tedious and time consuming jobs in digital curation are welcome!

6) Digital preservation policy needs to be on several levels (Catherine Jones). I was already aware of the concept of high level guidance (we tend to call this 'the policy') and then preservation procedures (which we would call a 'method statement' or 'strategy') but Catherine suggests we go one step further with our digital preservation policy and create a control policy - a set of specific measurable objectives which could also be written in a machine readable form so that they could be understood by preservation monitoring and planning tools and the process of decision making based on these controls could be further automated. I really like this idea but think we may be a long long way off to achieving this particular policy nirvana!

7) The SCAPE project has produced a tool that can help with preservation and technology watch ...that elusive digital preservation function that we all say we do ...but actually do it in such an ad hoc way that it feels like we may have missed something. Scout is designed to automate this process (Luis Faria). Obviously it needs a certain amount of work to give it the information that it needs in order for it to know what exactly it is meant to be monitoring (see comment on machine readable preservation policies above), but this is one of those important things the digital preservation community as a whole should be investing their efforts into and collaborating on.

8) There is a new version of the Open Planets Foundation preservation planning tool Plato that sounds like it is definitely worth a look (Kresimir Duretec). I first came across Plato when it was first released several years ago but like many, decided it was too complex and time consuming to bring into our preservation workflows in earnest. The project team have now developed the tool further and many of the more time consuming elements of the workflow have been automated. I really need to re-visit this.

9) The beautifully named C3PO (Clever, crafty content profiling for objects) is another tool we can use to characterise and identify the digital files that we hold in our archives (Artur Kulmukhametov). We were lucky enough to have a chance to play with this during the training course. C3PO gives a really nice visual and interactive interface for viewing data from the FITS characterisation tool. I must admit I haven't really used FITS in earnest because of the complexity of managing conflicts between different tools, and the fact that the individual tools currently wrapped in FITS are not kept up-to-date. C3PO helps with the first of these problems but not so much with the later. The SCAPE project team suggested that if more people start to use FITS then the issue might be resolved more quickly. It does become one of those difficult situations where people may not use the tool until the elements are updated but the tool developers may not update the elements until they see the tool being more widely used! I think I will wait and see.

Of the handful of example files I threw at FITS, my plain text file was identified as an HTML file by JHove so not ideal. We need greater accuracy in all of these characterisation tools so that we can have greater trust the output they give us. How can we base our preservation planning actions on these tools unless we have this trust?

10) Different persistent identifier schemes offer different levels of persistence (Maurizio Lunghi). For a Digital Object Identifier (DOI) you only need the indentifier and metadata to remain persistent, the NBN namespace scheme requires that the accessibility of the resource itself is also persistent. Another reason why I think DOI's may be the way to go. We do not want a situation where we are not allowed to de-accession or remove access to any of the digital data within our care.

11) I now know the difference between a URN a URL and a URI (Maurizio Lunghi) - this is progress!

12) In the future data is likely to be more active more of the time, making it harder to curate. Data in a repository or archive could be annotated, tagged and linked. This is particularly the case with research data as it is re-used over time. Can our current best practices of archiving be maintained on live datasets? Repositories and archives may need to rethink how we support researchers in this to allow for different and more dynamic research methodologies, particularly with regards to 'big data' which can not simply be downloaded (Adam Carter).

So, there is much food for thought here and a lot more besides.

I also learnt that Glasgow too has heat waves, the Botanic Gardens are beautiful and that there is a very good ice cream parlour on Byres Road!

Jenny Mitcham, Digital Archivist

Digital Archiving at the University of York

Tuesday 23 July 2013

Twelve interesting things I learnt last week in Glasgow

A list of twelve interesting things:

The sustainability of a digital preservation blog...

Twitter

Subscribe