Skip to content


KeepIt course 3: Introducing preservation workflow and format risks

KeepIt course module 3, London, 2 March 2010
Principal topic this module: significant properties
Tools this module: PREMIS, Open Provenance Model (OPM)
Tags Find out more about: this module KeepIt course 3, the full KeepIt course
Presentations and tutorial exercises course 3 (source files)

After the simple game designed to highlight significant properties of data, we are ready to begin on preservation workflow and format risks. First the slideshow introduces some terms we will encounter later in the module: representation information, and the Open Archival Information System (OAIS). Unusually, some brief pre-course background reading was recommended to familiarise participants with the role of file formats: Preservation & Storage Formats for Repositories.

[slideshare id=3364088&doc=keepit-course3-workflow-100308062001-phpapp02]

By combining data with representation, e.g. markup (or encoding in machine terms), by analogy this can be seen as a ‘performance’ of the digital object (slide 3). While this is easy to understand for audio-visual content, the analogy holds for text and other forms of content too: an element of machine performance is required to present content to an end user. We can see already there are a number of components required for a performance of a digital object, and our concept of performance introduces expectations, perhaps different, on the parts of creator and user.

Our framework for preservation is described by OAIS (slide 4). From one view this model looks much like our repository model. A strength of OAIS is you can use it to model processes in almost any system, but once you subscribe to its strictures you are bound to its detailed specifications. OAIS is not perfect but is a standard and is universally adopted in formal preservation practice. As you build preservation into your repository you will want to demonstrate that others can trust your procedures and ability to manage data, and to do this you will need to show compliance with OAIS.

Broadly we can reduce OAIS to our three-stage repository model (slide 5), and within our ‘manage content’ activities we can identify the three stages of our format preservation workflow: check, analyse, action. When we stretch these out chronologically (slide 6) we can see there are tools for each stage, but the hard part is connecting format check with action. In other words, when we have identified the format of a digital object, we can migrate that to another format if needed, but connecting this information and the action decision to migrate is harder because we don’t yet know how to evaluate whether or when to convert, and what to convert to. Again we can see there are tools to help us with this. Most of our work in course modules 3 and 4 will concern this ‘analyse’ bit in the middle of our workflow.

Rosenthal has argued against preemptive format migration, arguing for the relaxed rather than aggressive approach, and referring to format obsolescence as the ‘prostate cancer’ of preservation: “It is a serious and ultimately fatal problem. But it is highly likely that something else will kill you first.” (slides 7-9)

Let’s look at what types of file format we might find in a typical institutional repository. We have two example format profiles from the Registry of Open Access Repositories (ROAR), past and present (slides 10, 11). Both show the dominance of PDF, not just the generic format but different versions of PDF. A recent summary of responses to a list mail enquiry about repository file formats partly explains why PDF dominates (slide 12), but is this justified?

We now know there are many factors that affect the risk associated with any file format, and 10 principal risk classifications have been identified by the National Archives (slide 13). We will use this classification as the basis of our first exercise in this module. Without explaining these factors in any more detail than is given in the slide, but after striking out two factors that might require prior knowledge or documentation (slide 14), our groups were asked to select and compare a pair of formats, to score the formats against each other for each risk factor, and produce a final score and recommendation in favour of one or other format. In addition, groups were asked to suggest a reason why the ‘winning’ format might not be the best choice for their repository (slide 15). The results were reported in an earlier blog post. This exercise is described in a handout sheet used on the day, as well as in the course slides.

We used to think that the best formats were open: open standards and open source (slide 16). Well that tended to differentiate formats from Microsoft applications from the most of the rest, and perhaps explains the recent history of PDF preference by repositories, but more recently Microsoft has lobbied successfully for its Office formats to be released as open standards. So now which popular text and document formats are best for preservation? MS Word, which can be converted to Open XML, or PDF, which can’t (refer again to the results of our group exercise)? It shows there is no single answer, and even if there was, it might have to be reconsidered at another time. Rosenthal provides some reassurance about the impact of format risk, even for proprietary formats (slide 17).

Repositories know they must work with their authors and avoid imposing unnecessary constraints. Authors create digital objects with no doubt little regard for format over functionality, so repositories should neither impose format restrictions or require authors to convert formats (slide 18). For now repositories should require source files from authors and make their own decisions, and that is what modules 3 and 4 will help them to do.

Before we end this section, let’s return to the exercise and consider the comparison of image formats (slide 19) – TIFF vs JPEG – rather than document formats. Again, there are cases for each. This might seem unlikely after three major institutions chose JPEG as an archival format over TIFF (slide 20), but an independent evaluation of both formats using the Plato preservation tool, which we will cover comprehensively in module 4, was less decisive, leaving a final decision to a later time (slide 21). Whichever formats we consider, format migration is not an open and shut case, and could use more sophistication than has been available to repositories thus far.

In a sense file formats are nice and well defined, if sometimes highly technical and complex. What is less well defined are other characteristics of a digital object that may affect its performance, or the way it is perceived, used and understood by different users. Our next session in this module will introduce and provide working examples of what is referred to as significant characteristics.

Posted in Uncategorized.

Tagged with , , , .

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

or, reply to this post via trackback.