Pages

Monday, 21 November 2016

Every little bit helps: File format identification at Lancaster University

This is a guest post from Rachel MacGregor, Digital Archivist at Lancaster University. Her work on identifying research data follows on from the work of Filling the Digital Preservation Gap and provides a interesting comparison with the statistics reported in a previous blog post and our final project report.

Here at Lancaster University I have been very inspired by the work at York on file format identification and we thought it was high time I did my own analysis of the one hundred or so datasets held here.  The aim is to aid understanding of the nature of research data as well as to inform our approaches to preservation.  Our results are comparable to York's in that the data is characterised as research data (as yet we don't have any born digital archives or digitised image files).  I used DROID (version 6.2.1) as the tool for file identification - there are others and it would be interesting to compare results at some stage with results from using other software such as FILE (FITS), Apache Tika etc.

The exercise was carried out using the following signature files: DROID_SignatureFile_V88 and container-signature-file-20160927.  The maximum number of bytes DROID was set to scan at the start and end of each file was 65536 (which is the default setting when you install DROID).

Summary of the statistics:

There were a total of 24,705 files (so a substantially larger sample than in the comparable study at York)

Of these: 
  • 11008 (44.5%) were identified by DROID and 13697 (55.5%) not.
  • 99.3% were given one file identification and 76 files had multiple identifications.  
    • 59 files had two possible identifications
    • 13 had 3 identifications
    • 4 had 4 possible identifications.  
  • 50 of these files were asc files identified (by extension) as either 8-bit or 7-bit ASCII text files.  The remaining 26 were identified by container as various types of Microsoft files. 

Files that were identified

Of the 11008 identified files:
  • 89.34% were identified by signature: this is the overwhelming majority, far more than in Jen's survey
  • 9.2% were identified by extension, a much smaller proportion than at York
  • 1.46% identified by container

However there was one large dataset containing over 7,000 gzip files, all identified by signature which did skew the results rather.  With those files removed, the percentages identified by different methods were as follows:

  • 68% (2505) by signature
  • 27.5% (1013) by extension
  • 4.5% (161) by container
This was still different from York's results but not so dramatically.

Only 38 were identified as having a file extension mismatch (0.3%) but closer inspection may reveal more.  Of these most were Microsoft files with multiple id's (see above) but also a set of lsm files identified as TIFFs.  This is not a format I'm familiar with although it seems as if lsm is a form of TIFF file but how do I know if this is a "correct" id or not?

59 different file formats were identified, the most frequently occurring being the GZIP format (as mentioned above) with 7331 instances.  The next most popular was, unsurprisingly xml (similar to results at York) with 1456 files spread across the datasets.  The top 11 were:

Top formats identified by DROID for Lancaster University's research data


Files that weren't identified

There were 13697 files not identified by DROID of which 4947 (36%) had file extensions.  This means there was a substantial proportion of files with no file extension (64%). This is much higher than the result at York which was 26%. As at York there were 107 different extensions in the unidentified files of which the top ten were:

Top counts of unidentified file extensions


Top extensions of unidentified files


This top ten are quite different to York's results, though in both institutions dat files topped the list by some margin! We also found 20 inp and 32 out files which also occur in York's analysis. 

Like Jen at York I will be looking for a format to analyse further to create a signature - this will be a big step for me but will help my understanding of the work I am trying to do as well as contribute towards our overall understanding of file format types.

Every little bit helps.

Tuesday, 15 November 2016

AtoM harvesting (part 1) - it works!

When we first started using Access to Memory (AtoM) to create the Borthwick Catalogue we were keen to enable our data to be harvested via OAI-PMH (more about this feature of AtoM is available in the documentation). Indeed the ability to do this was one of our requirements when we were looking to select a new Archival Management System (read about our system requirements here).

Look! Archives now available in Library Catalogue search
So it is with great pleasure that I can announce that we are now exposing some of our data from AtoM through our University Library catalogue YorSearch. Dublin Core metadata is automatically harvested nightly from our production AtoM instance - so we don't need to worry about manual updates or old versions of our data hanging around.

Our hope is that doing this will allow users of the Library Catalogue (primarily staff and students at the University of York) to happen upon relevant information about the archives that we hold here at the Borthwick whilst they are carrying out searches for other information resources.

We believe that enabling serendipitous discovery in this way will benefit those users of the Library Catalogue who may have no idea of the extent and breadth of our holdings and who may not know that we hold archives of relevance to their research interests. Increasing the visibility of the archives within the University of York is an useful way of signposting our holdings and we think this should bring benefits both to us and our potential user base.

A fair bit of thought (and a certain amount of tweaking within YorSearch) went into getting this set up. From the archives perspective, the main decision was around exactly what should be harvested. It was agreed that only top level records from the Borthwick Catalogue should be made available in this way. If we had enabled the harvesting of all levels of records, there was a risk that search results would have been swamped by hundreds of lower level records from those archives that have been fully catalogued. This would have made the search results difficult to understand, particularly given the fact that these results could not have been displayed in a hierarchical way so the relationships between the different levels would be unclear. We would still encourage users to go direct to the Borthwick Catalogue itself to search and browse lower levels of description.

It should also be noted that only a subset of the metadata within the Borthwick Catalogue will be available through the Library Catalogue. The metadata we create within AtoM is compliant with ISAD(G): General International Standard Archival Description which contains 26 different data elements. In order to facilitate harvesting using OAI-PMH, data within AtoM is mapped to simple Dublin Core and this information is available for search and retrieval via YorSearch. As you can see from the screen shot below, Dublin Core does allow a useful level of information to be harvested, but it is not as detailed as the original record.

An example of one of our archival descriptions converted to Dublin Core within YorSearch

Further work was necessary to change the default behaviour within Primo (the software that YorSearch runs on) which displayed results from the Borthwick Catalogue with the label Electronic resource. This is what it calls anything that is harvested as Dublin Core. We didn't think this would be helpful to users because even though the finding aid itself (within AtoM) is indeed an electronic resource, the actual archive that it refers to isn't. We were keen that users didn't come to us expecting everything to be digitised! Fortunately it was possible to change this label to Borthwick Finding Aid, a term that we think will be more helpful to users.
Searches within our library catalogue (YorSearch) now surface Borthwick finding aids, harvested from AtoM.
These are clearly labelled as Borthwick Finding Aids.


Click through to a Borthwick Finding Aid and you can see the full archival description in AtoM in an iFrame

Now this development has gone live we will be able to monitor the impact. It will be interesting to see whether traffic to the Borthwick Catalogue increases and whether a greater number of University of York staff and students engage with the archives as a result.

However, note that I called this blog post AtoM harvesting (part 1).

Of course that means we would like to do more.

Specifically we would like to move beyond just harvesting our top level records as Dublin Core and enable harvesting of all of our archival descriptions in full in Encoded Archival Description (EAD) - an XML standard that is closely modelled on ISAD(G).  This is currently not possible within AtoM but we are hoping to change this in the future.

Part 2 of this blog post will follow once we get further along with this aim...




Monday, 14 November 2016

Automating transfers with Automation Tools

This is a guest post by Julie Allinson, Technology Development Manager for Library & Archives at York. Julie has been working on York's implementation for the 'Filling the Digital Preservation Gap' project. This post describes how we have used Artefactual Systems' Automation Tools at York.

For Phase three of our 'Filling the Digital Preservation Gap' we have delivered a proof-of-concept implementation to to illustrate how PURE and Archivematica can be used as part of a Research Data management lifecycle.

One of the requirements for this work was the ability to fully automate a transfer in Archivematica. Automation Tools is a set of python scripts from Artefactual Systems that are designed to help.

The way Automation Tools works is that a script (transfer.py) runs regularly at a set interval (as cron task). The script is fed a set of parameters and, based on these, checks for new transfers in the given transfer source directory. On finding something, a transfer in Archivematica is initiated and approved.

One of the neat features of Automation Tools is that if you need custom behaviour, there are hooks in the transfer.py script that can run other scripts within specified directories. The 'pre-transfer' scripts are run before the transfer starts and 'user input' scripts can be used to act when manual steps in the processing are reached. A processing configuration can be supplied and this can fully automate all steps, or leave some manual as desired.

The best way to use Automation Tools is to fork the github repository and then add local scripts into the pre-transfer and/or user-input directories.

So, how have we used Automation Tools at York?

When a user deposits data through our Research Data York (RDYork) application, the data is written into a folder within the transfer source directory named with the id of our local Fedora resource for the data package. The directory sits on filestore that is shared between the Archivematica and RDYork servers. On seeing a new transfer, three scripts run:

1_datasets_config.py - this script copies the dedicated datasets processing config into the directory where the new data resides.

2_arrange_transfer.py - this script simply makes sure the correct file permissions are in place so that Archivematica can access the data.

3_create_metadata_csv.py - this script looks for a file called 'metadata.json' which contains metadata from PURE and if it finds it, processes the contents and writes out a metadata.csv file in a format that Archivematica will understand. These scripts are all fairly rudimentary, but could be extended for other use cases, for example to process metadata files from different sources or to select a processing config for different types of deposit.

Our processing configuration for datasets is fully automated so by using automation tools we never have to look at the Archivematica interface.

With transfer.py as inspiration I have added a second script called status.py. This one speaks directly to APIs in our researchdatayork application and updates our repository objects with information from Archiveamtica, such as the UUID for the AIP and the location of the package itself. In this way our two 'automation' scripts keep researchdatayork and Archivematica in sync. Archivematica is alerted when new transfers appear and automates the ingest, and researchdatayork is updated with the status once Archivematica has finished processing.

The good news is, the documentation for Automation Tools is very clear and that makes it pretty easy to get started. Read more at https://github.com/artefactual/automation-tools

Friday, 4 November 2016

From old York to New York: PASIG 2016

My walk to the conference on the first day
Last week I was lucky enough to attend PASIG 2016 (Preservation and Archiving Special Interest Group) at the Museum of Modern Art in New York. A big thanks to Jisc who generously funded my conference fee and travel expenses. This was the first time I have attended PASIG but I had heard excellent reports from previous conferences and knew I would be in for a treat.

On the conference website PASIG is described as "a place to learn from each other's practical experiences, success stories, and challenges in practising digital preservation." This sounded right up my street and I was not disappointed. The practical focus proved to be a real strength.

The conference was three days long and I took pages of notes (and lots of photographs!). As always, it would be impossible to cover everything in one blog post so here is a round up of some of my highlights. Apologies to all of those speakers who I haven't mentioned.

Bootcamp!
 The first day was Bootcamp - all about finding your feet and getting started with digital preservation. However, this session had value not just for beginners but for those of us who have been working in this area for some time. There are always new things to learn in this field and a sometimes a benefit in being walked through some of the basics.

The highlight of the first day for me was an excellent talk by Bert Lyons from AVPreserve called "The Anatomy of Digital Files". This talk was a bit of a whirlwind (I couldn't type my notes fast enough) but it was so informative and hugely valuable. Bert talked us through the binary and hexadecimal notation systems and how they relate to content within a file. This information backed up some of the things I had learnt when investigating how file format signatures are created and really should be essential learning for all digital archivists. If we don't really understand what digital files are made up of then it is hard to preserve them.

Bert also went on to talk about the file system information - which is additional to the bytes within the file - and how crucial it is to also preserve this information alongside the file itself. If you want to know more, there is a great blog post by Bert that I read earlier this year - What is the chemistry of digital preservation?. It includes a comparison about the need to understand the materials you are working with whether you are working in physical conservation or digital preservation. One of the best blog posts I've read this year so pleased to get the chance to shout about it here!

Hands up if you love ISO 16363!
Kara Van Malssen, also from AVPreserve gave another good presentation called "How I learned to stop worrying and love ISO16363". Although specifically intended for formal certification, she talked about its value outside the certification process - for self assessment, to identify gaps and to prioritise further work. She concluded by saying that ISO16363 is one of the most valuable digital preservation tools we have.

Jon Tilbury from Preservica gave a thought provoking talk entitled "Preservation Architectures - Now and in the Future". He talked about how tool provision has evolved, from individual tools (like PRONOM and DROID) to integrated tools designed for an institution, to out of the box solutions. He suggested that the fourth age of digital preservation will be embedded tools - with digital preservation being seamless and invisible and very much business as usual. This will take digital preservation from the libraries and archives sector to the business world. Users will be expecting systems to be intuitive and highly automated - they won't want to think in OAIS terms. He went on to suggest that the fifth age will be when every day consumers (specifically his mum!) are using the tools without even thinking about it! This is a great vision - I wonder how long it will take us to get there?

Erin O'Meara from University of Arizona Libraries gave an interesting talk entitled "Digital Storage: Choose your own adventure". She discussed how we select suitable preservation storage and how we can get a seat at the table for storage discussions and decisions within our institutions. She suggested that often we are just getting what we are given rather than what we actually need. She referenced the excellent NDSA Levels of Digital Preservation which are a good starting point when trying to articulate preservation storage needs (and one which I have used myself). Further discussions on Twitter following on from this presentation highlighted the work on preservation storage requirements being carried out as a result of a workshop at iPRES 2016, so this is well worth following up on.

A talk from Amy Rushing and Julianna Barrera-Gomez from the University of Texas at San Antonio entitled "Jumping in and Staying Afloat: Creating Digital Preservation Capacity as a Balancing Act" really highlighted for me one of the key messages that has come out of our recent project work for Filling the Digital Preservation Gap. This is that, choosing a digital preservation system is relatively easy but actually deciding how to use it is the harder! After ArchivesDirect (a combination of Archivematica and DuraSpace) was selected as their preservation system (which included 6TB of storage), Amy and Julianna had a lot of decisions to make in order to balance the needs of their collections with the available resources. It was a really interesting case study and valuable to hear how they approached the problem and prioritised their collections.

The Museum of Modern Art in New York
Andrew French from Ex Libris Solutions gave an interesting insight into a more open future for their digital preservation system Rosetta. He pointed out that institutions when selected digital preservation systems focus on best practice and what is known. They tend to have key requirements relating to known standards such as OAIS, Dublin Core, PREMIS and METS as well as a need for automated workflows and a scalable infrastructure. However, once they start using the tool, they find they want other things too - they want to plug in different tools that suit their own needs.

In order to meet these needs, Rosetta is moving towards greater openness, enabling institutions to swap out any of the tools for ingest, preservation, deposit or publication. This flexibility allows the system to be better suited for a greater range of use cases. They are also being more open with their documentation and this is a very encouraging sign. The Rosetta Developer Network documentation is open to all and includes information, case studies and workflows from Rosetta users that help describe how Rosetta can be used in practice. We can all learn a lot from other people even if we are not using the same DP system so this kind of sharing is really great to see.

MOMA in the rain on day 2!
Day two of PASIG was a practitioners knowledge exchange. The morning sessions around reproducibility of research were of particular interest to me given my work on research data preservation and it was great to see two of the presentations referencing the work of the Filling the Digital Preservation Gap project. I'm really pleased to see our work has been of interest to others working in this area.

One of the most valuable talks of the day for me was from Fernando Chirigati from New York University. He introduced us to a useful new tool called ReproZip. He made the point that the computational environment is as important as the data itself for the reproducibility of research data. This could include information about libraries used, environment variables and options. You can not expect your depositors to find or document all of the dependencies (or your future users to install them). What ReproZip does is package up all the necessary dependencies along with the data itself. This package can then be archived and re-used in the future. ReproZip can also be used to unpack and re-use the data in the future. I can see a very real use case for this for researchers within our institution.

Another engaging talk from Joanna Phillips from the Guggenheim Museum and and Deena Engel of New York University described a really productive collaboration between the two institutions. Computer Science students from NYU have been working closely with the time-based media conservator at the museum on the digital artworks in their care. This symbiotic relationship enables the students to earn credit towards their academic studies whilst the museum receives valuable help towards understanding and preserving some of their complex digital objects. Work that the students carry out includes source code analysis and the creation of full documentation of the code so that is can be understood by others. Some also engage with the unique preservation challenges within the artwork, considering how it could be migrated or exhibited again. It was clear from the speakers that both institutions get a huge amount of benefit from this collaboration. A great case study!

Karen Cariani from WGBH Educational Foundation talked about their work (with Indiana University Libraries) to build HydraDAM2. This presentation was of real interest to me given our recent Filling the Digital Preservation Gap project in which we introduced digital preservation functionality to Hydra by integrating it with Archivematica.  HydraDAM2 was a different approach, building a preservation head for audio-visual material within Hydra itself. Interesting to see a contrasting solution and to note the commonalities between their project and ours (particularly around the data modelling work and difficulties recruiting skilled developers).

More rain at the end of day 2
Ben Fino Radin from the Museum of Modern Art in "More Data, More Problems: Designing Efficient Workflows at Petabyte Scale" highlighted the challenges of digitising their time-based media holdings and shared some calculations around how much digital storage space would be required if they were to digitise all of their analogue holdings. This again really highlighted some big issues and questions around digital preservation. When working with large collections, organisations need to prioritise and compromise and these decisions can not be taken lightly. This theme was picked up again on day 3 in the session around environmental sustainability.

The lightning talks on the afternoon of the second day were also of interest. Great to hear from such a range of practitioners.... though I did feel guilty that I didn't volunteer to give one myself! Next time!

On the morning of day 3 we were treated to an excellent presentation by Dragan Espenschied from Rhizome who showed us Webrecorder. Webrecorder is a new open source tool for creating web archives. It uses a single system both for initial capture and subsequent access. One of its many strengths appears to be the ability to capture dynamic websites as you browse them and it looks like it will be particularly useful for websites that are also digital artworks. This is definitely one to watch!

MOMA again!
Also on day 3 was a really interesting session on environmental responsibility and sustainability. This was one of the reasons that PASIG made me think...this is not the sort of stuff we normally talk about so it was really refreshing to see a whole session dedicated to it.

Eira Tansey from the University of Cincinnati gave a very thought provoking talk with a key question for us to think about - why do we continue to buy more storage rather than appraise? This is particularly important considering the environmental costs of continuing to store more and more data of unknown value.

Ben Goldman of Penn State University also picked up this theme, looking at the carbon footprint of digital preservation. He pointed out the paradox in the fact we are preserving data for future generations but we are powering this work with fossil fuels. Is preserving the environment not going to be more important to future generations than our digital data? He suggested that we consider the long term impacts of our decision making and look at our own professional assumptions. Are there things that we do currently that we could do with less impact? Are we saving too many copies of things? Are we running too many integrity checks? Is capturing a full disk image wasteful? He ended his talk by suggesting that we should engage in a debate about the impacts of what we do.

Amelia Acker from the University of Texas at Austin presented another interesting perspective on digital preservation in mobile networks, asking how our collections will change as we move from an information society to a networked era and how mobile phones change the ways we read, write and create the cultural record. The atomic level of the file is no longer there on mobile devices.  Most people don't really know where the actual data is on their phones or tablets, they can't show you the file structure. Data is typically tied up with an app and stored in the cloud and apps come and go rapidly. There are obvious preservation challenges here! She also mentioned the concept of the legacy contact on Facebook...something which had passed me by, but which will be of interest to many of us who care about our own personal digital legacy.

Yes, there really is steam coming out of the pavements in NYC
The stand out presentation of the conference for me was "Invisible Defaults and Percieved Limitations: Processing the Juan Gelman Files" from Eliva Arroyo-Ramirez from Princeton University. She described the archive of Juan Gelman, an Argentinian poet and human rights activist. Much of the archive was received on floppy disks and included documents relating to his human rights work and campaigns for the return of his missing son and daughter-in-law. The area she focused on within her talk was about how we preserve files with accented characters in the file names.

Diacritics can cause problems when trying to open the files or use our preservation tools (for example Bagger). When she encountered problems like these she put a question out to the digital preservation community asking how to solve the problem and she was grateful to receive so many responses but at the same time was concerned about the language used. It was suggested that she 'scrub', 'clean' or 'detox' the file names in order to remove the 'illegal characters' but she was concerned that our attitudes towards accented characters further marginalises those who do not fit into our western ideals.

She also explored how removing or replacing these accented characters would impact on the files themselves and it was clear that meaning would change significantly.  'Campaign' (a word included in so many of the filenames) would change to 'bell'. She decided not to change the file names but to try and find a work around and she was eventually successful in finding a way to keep the filenames as they were (using the command line to turn the latin characters to UTF8). The message that she ended on was that we as archivists should do no harm whether we are dealing with physical or digital archives. We must juggle our priorities but think hard about where we compromise and what is important to preserve. It is possible to work through problems rather than work around them and we need to be conscious of the needs of collections that fall outside our defaults. This was real food for thought and prompted an interesting conversation on twitter afterwards.

Times Square selfie!
Not only did I have a fantastic week in New York (its not every day you can pop out in your lunch break to take a selfie in Times Square!), but I also came away with lots to think about. PASIG is a bit closer to home next year (in Oxford) so I am hoping I'll be there!