Friday, 7 November 2014

A non-archivist's perspective on cataloguing born digital archives

As blogged in my previous post, earlier this week I attended an ARA Section for Archives and Technology event on Born Digital Cataloguing and also had the opportunity to talk about some of the Borthwick's current work in this area.

I gave a non-archivist's perspective on born digital cataloguing. These were the main points I tried to put across, though some of the points below were also informed by discussions on the day:

  • Born digital cataloguing within a purely digital archive is reasonably straightforward. The real complexity comes when working with hybrid archives where content is both physical and digital
  • The Archaeology Data Service are good at born digital cataloguing. This is partly because they only have digital material to worry about, but also down to the fact that they have many years of experience and the necessary systems in place. Their new ADS Easy system allows depositors to submit data for archiving along with the required metadata (which they can enter both at project level and individual file level). A web interface for disseminating this data can then be created in a largely automated fashion. It makes sense to ask the person who knows the most about the data to catalogue it, freeing up the digital archivists' time to focus on checking the received data and metadata and more specialist digital preservation work.
  • Communication can be a problem between traditional archivists and digital archivists. We may use different metadata standards and we may not always know what the other is talking about. I was at the Borthwick Institute for approximately a year before I worked out that when my colleagues talked about describing archives at file level (which may cover multiple physical documents within the same physical file), they didn't mean the same as my perception of 'file level metadata' (which would apply to a single digital item). It is important to recognise these differences and try and work around them so that we understand each other better when working with hybrid archives.
A digital archivist may speak a different language to traditional archivists,
but we can work around this

  • At the Borthwick we are in the process of implementing a new system for accessioning and cataloguing archives (both physical and digital archives). We have installed a version of AtoM (Access to Memory) and have imported one of our more complex catalogues into it. We now need to build on this proof of concept and fully establish and populate this system. As well as holding information about our physical holdings, this system will provide a means of cataloguing born digital data and also the foundations on which a digital archiving system can be built. It will also provide the means by which we can disseminate digital objects to our users.
  • There are other types of metadata that are required for digital material and these are outside the scope of AtoM which is primarily for resource discovery metadata. More technical metadata relating to digital objects and any transformations they undergo needs to reside within a digital archiving system. This is where Archivematica comes in. We are currently testing this digital preservation system to establish whether it meets our digital archiving needs.
  • I worry about the identifiers we use within archival catalogues. The traditional archival identifier is performing two jobs – firstly acting as a unique identifier or reference for an item or group of items, and secondly showing where those items sit within the archival hierarchy. This can lead to problems...
    • ...if the arrangement of the archive changes – this may lead to the identifier changing – never a good thing once it has been published and made available, or, if that identifer is being used to link between different systems.
    • ...if we want to start describing objects before we know where they sit within the hierarchy. This may be the case in particular for digital material where we may want to start working with it with greater urgency than the physical element of the archive.*
  • We can argue that digital isn't different, but with digital we do tend to think more at item level. Digital preservation activities and the technical and preservation metadata that this generates are all at file (item) level, so perhaps it makes sense for the resource discovery metadata to follow this pattern. Unlike physical archives, for digital archives we can pretty easily generate a title (or file name) for every item. If we are to deal with digital archives at file level would this cause confusion when cataloguing a hybrid archive?
  • Before we incorporate digital material into a digital archive, some selection and appraisal needs to be carried out  - depending on the digital archiving system in use, it can be non-trivial to remove files from an AIP (archival information package) once they have been transferred, so we really do need to have a good idea of what is and isn't included before we carry out our preservation activities. In order to carry out this selection we may wish to start putting together a skeletal description of each item. Wouldn't it be nice if we could start to do this in a way which could be easily transferred into an archival management system? At the moment I have been doing this in a separate spreadsheet but we need strategies that are more sustainable and scalable.
  • Workflows are crucially important. Who does the born digital cataloguing where hybrid archives are concerned? It's place within the archive as a whole is key so it should be catalogued in tandem with the physical, but if we want to archive the digital material more rapidly than the physical how do we ensure we have the right workflows and procedures in place? Much of this will come down to institutional policies and procedures and the capabilities of the technologies you are using. These are still issues we are grappling with here at the Borthwick as we try and establish a framework for carrying out born digital cataloguing.

* as an aside (and a bit off-topic), my other bugbear with archival identifers is that they contain slashes (which means we can’t use them in directory or file names) and that they don’t order correctly in a spreadsheet or database as they are a mixture of numeric and alphabetical characters

Wednesday, 5 November 2014

Born Digital Cataloguing (some thoughts from the ARA SAT event)

First day back to work after my holiday and I am straight back into the fray – no quiet day catching up on e-mails and getting my head slowly back into work mode for me! On Monday I attended an ARA Section for Archives and Technology event on BornDigital Cataloguing and also had the opportunity to talk about some of our current work in this area.

It was great to see the event so well attended (the organisers had to find a bigger room due to the huge amount of interest!). This is clearly an important and interesting subject for many archives professionals and it was clear throughout the day that many of us are grappling with very similar issues. Here are some of the main points that I latched on to from the morning’s presentations:

  •  It is important to preserve the directory structure of digital files as submitted into the archive – even if you subsequently move the files into a different structure. This is the equivalent of original order and can give context to the files. End users of the digital archive should also have access to this information so it is included in the description within the Discovery interface (Anthea Seles, TNA).
  •  Users don’t really know what they want or need with regard to born digital material – it is too early to say and too new a field. We need to try and predict what they will require and also need to learn from our experiences as we go along (Anthea Seles, TNA).
  • “It’s all just stuff” – born digital archives should be treated the same as paper as far as possible (Chris Hilton, Wellcome Library).
  • Interesting case study from the Wellcome Library about how an archival management system (Calm) and a digital preservation system (Preservica) can work together. It is important to establish which data is duplicated between the 2 systems (there may be some overlap) and if this is the case, which is the master data and how the information is synchronised between systems. In this case study, digital data starts off in Preservica and overnight catalogue records are copied over into Calm. Calm then becomes the master for resource discovery metadata and any subsequent edits need to be made in Calm before syncing back to the digital archive (Chris Hilton, Wellcome Library).
  • Original order – in the Wellcome Library’s case study, the method the creator or donor used to store and order his digital files was different to the system of arrangement used for paper. Digital files were arranged chronologically but the paper archive was arranged according to themes. This results in a hybrid archive that is ordered or arranged inconsistently depending on the media and leaves the archivists with a decision to make (Victoria Sloyan, Wellcome Library).
  • Workflow is crucially important. It matters what happens when. Once data is ingested into a digital archive such as Preservica (I believe Archivematica is the same), it becomes difficult to remove individual items from the Archival Information Package. This becomes more of a problem when that information has also been replicated into an archival management system. Selection and Appraisal therefore needs to happen at an early stage in the workflow….and we also need to accept that our digital archives may not be perfect – we are unlikely to be able to weed out all redundant files on a first pass so we may end up with items in the digital archive that are not needed (Victoria Sloyan Wellcome Library).
  • Should we stop using the word cataloguing and instead talk about ‘enabling discovery’ – this is really what we are trying to do? We may end up moving away from the traditional archival catalogue (particularly for digital data) but we still need to ensure that we can enable our users to find the information they require. Digital collections may lead to alternative (less labour intensive) ways of enabling resource discovery (Jessica Womack and Rebecca Webster, Institute of Education).
  • We should be working with donors and depositors to get them to structure and label their data appropriately (and thus help with born digital cataloguing). It is very hard for archivists to deal with large quantities of digital data that has been created with little order or structure (Jessica Womack and Rebecca Webster, Institute of Education).
  • Digital is different to paper in that it requires more immediate action once it has been accepted into an archive and we need to ensure our processes, procedures and workflows can cope with this (Jessica Womack and Rebecca Webster, Institute of Education).

The last scheduled presentation of the morning was from me in which I gave a non-archivists perspective on born digital cataloguing. I'll try and summarise some of my points in a separate post later this week.

And here are some of the main messages I took away from the day as a whole:

  • Try things out – it is better to do something now than to wait until we have a perfect solution. This is the best way of learning what works and what doesn't.
  • Accept that the solutions you put in place may be temporary. We are all learning, and born digital cataloguing is not a solved problem (particularly with regard to hybrid archives).
  • Be honest about failures as well as successes – others can learn as much from finding out what didn't work and why as they can from finding out what did.
  • Think about which approaches are scalable in the longer term. Digital archives are going to increase in size and volume and we need to explore different ways of enabling discovery.

Despite the fact that there were more problems than solutions highlighted during the course of the day, it was comforting (as always) to discover that we are not alone!