KeepIt course module 3, London, 2 March 2010
Principal topic this module: significant properties
Tools this module: PREMIS, Open Provenance Model (OPM)
Tags Find out more about: this module KeepIt course 3, the full KeepIt course
Presentations and tutorial exercises course 3 (source files)
In the original notice for this course the focus of module 3 was on description. That emphasis changed a little in the interim towards significant characteristics, as we saw in the previous session. Now that we understand which characteristics of a digital object might be significant in affecting its use both now and in the future, and how to identify those characteristics, we need to have a means to describe these characteristics and record any actions taken on the object that might alter them. Our earlier game revealed the importance of recording this information.
In this final section of module 3 we look at two means of recording information for preservation, one a well established metadata dictionary, PREMIS, the other an emerging standard, the Open Provenance Model (OPM). A participant’s view of this session can be found
On the surface there is little to connect these metadata models in terms of the people involved and the starting point for each initiative. PREMIS comes from the digital library community, while OPM could be said to have roots in the computer science and Semantic Web community. One describes over 100 elements, the other barely more than a dozen. In terms of purpose, however, there is perhaps more common ground. So might the two approaches eventually collaborate, or be in competition? This is not clear yet, although I could report my KeepIt colleague Dave Tarrant, a programmer and developer, was immediately drawn to the elegant and concise OPM, less so to the unwieldy PREMIS. I’ve always been unstinting in my praise for PREMIS.
Preservation Metadata Implementation Strategies (PREMIS)
When I first began looking at preservation metadata in 2004, there were many schemes. It was clear that someone would have to try and unite these disparate approaches, and thankfully PREMIS filled that gap. Yes, the PREMIS Data Dictionary for preservation metadata is extensive and initially daunting, but once you understand the structure, the interested and informed reader cannot but be struck by its coherent and comprehensive treatment. Each time I give this presentation – and I have done so previously for events by the Repositories Support Project, such as in 2008 in London and Bournemouth – I want to give those new to PREMIS a sense of this contribution.
The other key point from this presentation, invariably aimed at repository managers and administrators, is however daunting PREMIS may appear, it is very likely they are already recording PREMIS metadata. One aim of the exercise attached to the presentation (slides 8, 9) and associated preservation metadata worksheet, is to select elements from the Dictionary that will be familiar to repository managers, such as object identifier, file size and format. Having established the basic familiarity of PREMIS, the exercise then asks participants to identify the sources of information for 20 PREMIS elements listed on the worksheet.
On this day, such was the focus on significant characteristics that we did not have time for the full exercise, working in groups, and instead treated it as a discussion exercise.
One of the PREMIS elements included in the exercise is labelled significantProperties, but in terms of what we have learned about this topic in this module, this element alone is clearly not very substantial. PREMIS is doing more work to accommodate significant characteristics, and the final slide (11) here reveals some likely directions.
Open Provenance Model
“As data becomes plentiful, verifiable truth becomes scarce”. Since provenance is about records to verify the past history of an object, be it an antique or any physical or digital object, this great quote from Eric Hellman instantly reveals why provenance will become increasingly critical for digital information.
OPM is driven by an international community, and its lead author is our colleague Professor Luc Moreau at the University of Southampton. We attended a seminar by Luc on OPM, read the documentation, and met with him. Apart from the opening slide, the slides used here are selected from one of his presentations on OPM, so aside from the editing and this commentary this is all bona fide original OPM, and we are grateful to Luc for permission to reproduce this material.
Given the time available and, as we noted earlier, the origins in another community, this presentation is quite pared back, so please try to find out more about OPM from the sources provided. What we wanted to do here is introduce the general concept of provenance, put it in the context of a familiar information lifecycle, in this case the science lifecycle, and then identify the main elements of OPM.
From a tools perspective, a critical feature of OPM is its support for ‘serialisation’ (slide 8). This allows information describing an object to be passed to different services. RSS and Atom are well known serialisation formats. OPM supports serialisation formats in XML, RDF and others, and supports the Semantic Web ontology language OWL. These formats can be used by repositories, and EPrints repository software, for example, now supports provenance information based on OPM.
OPM uses ‘nodes’ (slide 9); immediately we can start to see some parallels with PREMIS. Nodes are connected by ‘edges’ denoting the relation between nodes (slide 10); this begins to look like the graph-based approach of RDF, such as is used by OAI-ORE. Vitally, OPM graphs represent relations in time (slide 12). Finally, we see how Dublin Core, which will be familiar to repository administrators, might be mapped to OPM to represent the process of publication and versions of the published item (slides 13, 14).
By working with an open standard model, that can pass information as XML and in standard serialisation formats (e.g. RDF), it is possible to build provenance services into repository environments.
Summary of module 3
Module 3 of this course has demonstrated that actively used digital documents have critical features that affect productive use in different circumstances, and will change perceptibly and imperceptibly over time. We have recognised that we have to record this information, and have briefly used some tools to help do this.
Our arsenal of preservation tools is growing by the module. Our participant reporter noted that ‘for practical purposes there really needs to be some substantial work undertaken to integrate resources as well as applications to support content creators as well as repository managers in developing policies and practises for preservation.’
In module 4 we will see how these approaches can be built into the systems we are already using, our digital repositories.