Skip to content

Categories:

KeepIt course 4: Putting storage, format management and preservation planning in the repository

KeepIt course module 4, Southampton, 18-19 March 2010
Tools this module: Plato, EPrints preservation apps
Tags Find out more about: this module KeepIt course 4, the full KeepIt course
Presentations and tutorial exercises course 4 (source files)

We continue our journey through the KeepIt course modules, adding commentary and context to the core materials for those who want to follow and apply the course at their own speed.

Module 4 is the penultimate module, and in some ways the culmination of the course – certainly it is the culmination of modules 3 and 4 and our approach to put preservation in the repository interface.

[slideshare id=3561165&doc=keepit-course4-intro-100326045654-phpapp01]

We’ve seen in courses 1 and 2 that digital preservation should be built into general information management – institutional policy and costings – but there’s an area that differentiates digital preservation, it’s somewhat technical and concerns file format characteristics. We learned about those characteristics in the previous module, KeepIt course 3.

In this module we will be using another tool, Plato, to create a preservation plan to manage file format characteristics, and applying this to files in a repository based on EPrints. The PLANETS project provided opportunities to learn about and practice with Plato, and EPrints offers regular training courses some of which will feature preservation, but this course module, and its final live incarnation, is the only place to put these pieces together. Once again we have top class presenters, who designed and built the tools you will be using.

Putting it in the interface: the importance of a familiar workspace

Steve Jobs shows off the iPad. Picture by curiouslee

“75 million people already own iPod Touches and iPhones. That's all people who already know how to use the iPad.”

Apple recently launched a new machine – the iPad. It provoked differing opinions, from ‘can’t wait to use’ to ‘not revolutionary, just a bigger iPhone’. Why expect another revolution so soon? The key is the familiarity of the interface: “That’s all people who already know how to use the iPad”, said Steve Jobs, in perhaps the line that epitomised Apple’s design approach to the interface of the new device.

EPrints is not Apple, but it is attempting to put preservation in the repository interface, to enable tasks to be accomplished in a familiar working environment.

First, the previous module was a primer for this one, so a quick recap might help. We began with preservation workflow (slide 7). This will underpin our work in this module. We looked at the two ends of the workflow – finding out what we have (identification), and considering the type of preservation actions we may have to take (if any). We recognised risk as an area we will wish to control and moderate.

As an example we looked at the broad risks (slide 8) associated with some formats we identified in a typical repository format profile. Our teams compared and rated some familiar formats (slide 9). But note, we also came up with reasons why our ‘winning’ formats might not be the best choice in all situations.

We next looked at significant properties using the F-B-S engineering design framework (slide 13):

  • Function-the purpose of the design
  • Structure-of the designated digital object
  • Behaviour-what a user wants from the object

In this way we aim to identify and understand the critical characteristics of an object that we seek to preserve, recognising that it’s not just what you preserve but how you preserve it that can affect its use.

We did an exercise to analyse the structure of emails (slide 14), where we were seeking to identify the behaviours required of the structural elements of an email (slide 15), and then classified these in five high-level categories (slide 16). A second exercise revealed that the expected behaviours of our email can be different depending on the user (slide 17).

In two rapid sessions we looked at two means to document these characteristics (slide 19):

  • PREMIS
  • Open Provenance Model

Not forgetting we began module 3 with a simple numbers exercise to show why it matters to document the significant characteristics and history of objects: numbers can change, features can change, information can change whenever it is transmitted and transcribed. Unless you are careful, you will not end up with the information you started with (slide 21).

There we have it, module 3 in a single slide (slide 22).

Digital preservation fundamentals: a recap

Now we will put it all to work in a repository. First, Andreas Rauber and Hannes Kulovits from the Vienna University of Technology will answer the question Why do we need digital preservation? This broad recap of digital preservation fundamentals will be familiar to many, but Rauber and Kulovits are skilled presenters and held KeepIt course particpants rapt.

[slideshare id=3561267&doc=plato-eprints-intro-100326051510-phpapp02]

What follows in this module, KeepIt course 4, is:

Although we are using training repositories based on EPrints, we expect similar approaches to become available for other repository softwares.

Posted in Uncategorized.

Tagged with , .


Adding chemistry to a file format registry

Adding chemistry formats to a DROID profile

Everybody knows DROID. Well, everybody working in digital preservation. And for those being introduced to digital preservation, it’s likely they will be shown a tool or two, because tools help us do practical preservation. And among those tools the one most likely to be shown will be DROID (for example, here 11.45 am). This is because DROID is from the National Archives, is open source and does something that is fairly fundamental and basic to digital preservation, that is, identify file formats. DROID (Digital Record Object Identification) is an automatic file format identification tool.

We’ve been using DROID for years, in KeepIt and before that Preserv, to produce repository format profiles. It tells us what we have to preserve, and we can use that information to begin to judge risk, build a preservation plan and, where necessary, take some evasive action by converting a digital item into a format we may believe to be less risky.

One thing we’ve never done with DROID is add a new format. This part is carefully curated by the National Archives. (Note, in case you are thinking what is PRONOM, the subject of the link, it is a registry of file formats that informs DROID, which in turn is the software used to scan your content). PRONOM is not a wiki or other social media so you can’t just add stuff without moderation (not yet anyhow, although that may change).

PRONOM the registry currently contains information on over 700 file formats, but there are many thousands more formats in existence. In other words, PRONOM covers most major, popular formats and then some, but not all formats you can think of. When it comes to more specialised data, it’s likely your format is not represented.

In this case our exemplar KeepIt repository eCrystals stores data from crystallography experiments in the laboratory. The formats it uses to describe these data are not likely to be available in DROID, so a format profile of this repository will reveal a large number of unknown files and will not be of much use.

What’s required by PRONOM-DROID for format identification is a signature for the file format you want to identify, that is, something distinctive that will reliably differentiate it from any other format type. We imagined this would require quite a detailed knowledge of the formats. We were concerned, since we are not the originators or sponsors of the formats in question, that we might not have the requisite knowledge, also that we might require the cooperation of such people in some way. We needn’t have worried about either. Creating a file format signature for DROID is simpler than we had anticipated.

What follows is a Twitter record (@jisckeepit, from about 9:45 AM Sep 16th) of what happened as we set about creating format signatures for our crystallography and chemical files. We were able to do this at a small test level using a version of DROID we run locally.

  • We just worked out how to write a signature file for a format in PRONOM-DROID. It’s ID, not validation. You bet!
  • Need to check uniqueness of signature for CIF file format. “You boy at the back, DROID, do you recognise this file?” “No sir”. Good start
  • For the others at the back of the class, our CIF is a Crystallographic Information File, produced from experiments in the lab
  • The boy DROID is a fast learner. He now knows CIF and has passed the initial class test, but how will he cope with the big school exam?
  • We have boy DROID’s exam results, and are tearing open the envelope now. FAIL. Retake
  • Great, boy DROID has now passed his CIF exam. It’s a high grade, no absolute mark, but probably as good as earlier passes 😉
  • Since boy DROID is such a good student we’re now teaching him CML, Chemical Markup Language, based on XML, which he already knows
  • See, it’s easy when you know how. Boy DROID passed the CML exam first time
  • Next we will commend Boy DROID to his parents (at iPres next week) and suggest they enter the new CIF, CML format sigs in the DROID academy
  • Also next, we will run boy DROID on eCrystals-a big repository!-to produce its first format profile. That may take some days
  • Big thanks to boy DROID’s class teachers, from Chemistry Philip Adler and KeepIt’s Dave Tarrant

The result of all this is the profile shown in the figure at the top of this blog post. Essentially we deposited a few example CIF and CML files in our test repository, inspected the source files in detail to write the signature, and used our EPrints preservation apps, including our test version of DROID, to produce this simple profile of our test files. What we haven’t shown here are all the unsuccessful test profiles, which would show as an alarming red bar labelled ‘unknown files’ when DROID could not recognise a CIF, instead of blue bars with the correctly identified file names.

If you are interested you can find out more about the CIF and CML formats.

A helpful lesson and a step towards another KeepIt exemplar repository.

Addendum (22 September 2010)

Philip Adler, a specialist in the chemistry file types being looked at here, provides some additional insights into writing a new signature profile to be added to the DROID format identification tool.

“Creating a new profile for DROID to search a database was a novel problem; and one for which the documentation, whilst plentiful, failed to characterise the behaviours of the XML form in which one specifies what signature formats DROID is looking for. As such, for someone with little to no computing experience, it is my opinion that it would be very challenging and time consuming, on a first run through, to establish a new signature file. Indeed, for an experienced computer scientist and a specialist in the file type being looked at, it took a considerable length of time.

“That said, once the format of the signatures, and the layout of the signature files has been deduced, installing a new signature was relatively straightforward the second time around.

“There is one ‘bug’. Whilst attempting to define a term which would be at a non-fixed distance from the beginning or end of a file, we established that the DROID framework does not permit this. For instance, we managed to ‘break’ the HTML definition within DROID by submitting a perfectly valid (although unorthodox) HTML file with an absurdly long comment at the top. Whilst in HTML this is odd, in a .CIF file it is not, and so could serve to break the file in the future. The back-door for this is to set the maximum distance from the beginning or end of the file to an absurdly great length. This solution, however, is far from obvious, and is located as a standard method within the documentation, which implies that DROID can cope with arbitrary distances from BOF (beginning of file) or EOF (end of file).”

Posted in Uncategorized.

Tagged with , , , , , , , .


Unravelling the Open Access-Preservation layer cake

Lapis layer cakeDigital preservation presents its own complexity. Add the complexity of the target content, and it’s easy to get mired. This is particularly so for the case of open access (OA) content where that content is made available through institutional repositories (IRs), so-called ‘green’ open access. This is the layer cake presented by journalist Richard Poynder in one of his blog posts. Richard has long been interested in open access, and as a journalist he has a probing curiosity that leads to incisive questions directed at those in the field. In this case he wanted to know why green open access does not appear to be obviously compatible with the approach of digital preservation.

My colleague Stevan Harnad has long had a straightforward answer: OA through IRs does not need preserving (yet), for two reasons:

  1. Most IRs don’t have enough content to bother with preservation;
  2. For the target content of OA IRs, authors’ copies of papers published in journals and other peer reviewed sources, preservation is required for the primary content and is the responsibility of the content owner, the publisher.

It turns out that Stevan has been largely correct, if not necessarily for the right reasons. This all boils down to putting in place the economic drivers for digital preservation. But repositories have changed in the last 5 years or so, and continue to change, so it is necessary to reappraise the situation:

  1. IRs were originally (around 2000) synonymous with OA, but contain wider types of content now, such as data, arts and teaching materials. KeepIt was founded on this recognition.
  2. The emergence of institutional open access mandates is welcome, if belated, recognition of the significance of the institutional component of the term IR; institutions cannot be seen to be casual with content they may have required through policy.
  3. There is wider recognition that digital preservation is not simply a post-content acquistion and some-time-later activity, but is integral to the planning, policy and development of the repository.
  4. An array of digital preservation tools has emerged to enable digital preservation to be practised and applied by non-specialists outside the traditional preservation organisations.

For this reason it seemed appropriate, even necessary, to respond to Richard’s questions about OA and preservation posted to a number of email lists within the notice of his blog post. Inevitably, since nobody has a response rate faster than Stevan, this reply is also in the context of his response. This mail is available in full, rather than the slightly shorter version reproduced below. Also, see Stevan’s blog take on this, or take a moment just to absorb his title – Authors’ Drafts, Publishers’ Versions-of-Record, Digital Preservation, Open Access and Institutional Repositories – and ponder why this might present some complexity.

On Fri, 13 Aug 2010, Richard Poynder wrote:
>> [1] Should institutional repositories [IRs] be viewed as preservation tools?

On 14 Aug 2010, at 20:16, Stevan Harnad wrote:
> Not primarily. IRs’ primary function should be to provide open access [OA] to
> institutional research article output.

On 18 Aug 2010, at 10:50, Steve Hitchcock wrote:
Yes. We may have witnessed a golden age of digital preservation tools, and some of these have been built into repository software interfaces. To explore the practical application for repositories, see our structured and fully documented KeepIt course on digital preservation tools for repository managers:

Source materials http://bit.ly/afof8g
Blog http://blog.soton.ac.uk/keepit/tag/keepit-course/

The underlying philosophy of the course is to enable users to evaluate the appropriate degree of commitment, responsibility and resource for preservation that is consistent with the aims and objectives of the institution and repository at a given time and looking forward. It follows that answers can range from high to low, even to nothing, providing the analysis has been thorough, the results documented and the decisions and consequences are fully understood.

Without commenting on priorities here, IRs are much wider than OA papers. For IR preservation it’s this broad scope that matters, then how policy deals with the specifics, rather than simply OA concerns.

>> [2] Should self-archiving mandates always be accompanied by a “preservation
>> mandate”?

> Definitely not. (But IRs can, will, should and do preserve their
> contents.) For journal articles, the real digital preservation problem
> concerns the publisher’s version-of-record. Self-archiving mandates
> pertain to the author’s-draft.

Not an additional mandate, agreed, and it’s important that institutional and repository policy, such as OA mandates, precede preservation policy and provide the basis for it. But it’s interesting to ask whether OA mandates, since at the moment these are the most prominent form of repository policy, should make some reference to preservation. It’s notable that research funder OA policies are more likely to make some brief reference to preservation than institutional policies.

To Stevan the answer may seem obvious in the particular case of OA, but the question is whether such policies would benefit from such a reference. Or more broadly, whether repository policies need to demonstrate some degree of reciprocity, not just preservation, for the demands they appear to make of authors. Given the weight of an institution’s repository policy, it will have to address this at some stage, and omission, even from an OA mandate, since IRs are wider than OA, could begin to look curious and raise questions. The wider context is what repositories can offer in terms of responsible content management for access now and longer-term access. It will do no harm to sprinkle policies with features that will appeal to authors, where repositories can take practical steps to implement these. Stevan says IRs should and do preserve their contents; in which case, IRs simply need to specify and demonstrate what this means in practical terms, where possible, and policy is one prominent place to do this.

In this case return to [1] above, but first see conditions in [3] below.

>> [3] Should Gold OA funds be used to enable preservation in institutional repositories?

> Funds committed to Gold OA should be used any way the university or
> research funder that can afford them elects to use them (though does
> seem a bit random to spend money designated to pay for publishing in
> Gold OA journals instead to preserve articles published in
> subscription journals).
>
> But on no account should commitment to fund either Gold OA or digital
> preservation of the version-of-record be a condition for mandating
> Green OA self-archiving.

>> More, including an interview with digital preservation specialist Neal
>> Beagrie, here: http://bit.ly/dur5EP

Stevan has long been concerned about costs and distractions, including preservation, to the core OA aim. Economics are the primary driver here. As Neil Beagrie said in the interview: “digital preservation is “a means to an end”: the benefit and goal of digital preservation is access for as long as we require it”. This can work for open access too. My experience is that repositories are not wasting time and effort on preservation where it may be unnecessary, e.g. empty repositories. On this basis, it is too stark for Richard Poynder to say: “Nevertheless it is hard not to conclude that there is a potential conflict between OA and preservation.”

For others the problem may be the opposite, of turning concern into action. There is emerging evidence that repositories will take the necessary actions on preservation where the tools are available and when the circumstances support this, e.g. these repositories:

To try and gauge what circumstances might convert concern over preservation into action by repositories I recently proposed this rough metric
http://blog.soton.ac.uk/keepit/2010/07/22/conditions-for-digital-preservation/

When these conditions apply, again, return to [1] above.

I’ve made the case before that the issue between support for green and gold OA, from an institutional perspective, is one of chronology, and it’s the same for IRs and preservation.

Steve

Posted in Uncategorized.

Tagged with , , .


KeepIt course 3: describing significant characteristics and recording provenance

KeepIt course module 3, London, 2 March 2010
Principal topic this module: significant properties
Tools this module: PREMIS, Open Provenance Model (OPM)
Tags Find out more about: this module KeepIt course 3, the full KeepIt course
Presentations and tutorial exercises course 3 (source files)

In the original notice for this course the focus of module 3 was on description. That emphasis changed a little in the interim towards significant characteristics, as we saw in the previous session. Now that we understand which characteristics of a digital object might be significant in affecting its use both now and in the future, and how to identify those characteristics, we need to have a means to describe these characteristics and record any actions taken on the object that might alter them. Our earlier game revealed the importance of recording this information.

In this final section of module 3 we look at two means of recording information for preservation, one a well established metadata dictionary, PREMIS, the other an emerging standard, the Open Provenance Model (OPM). A participant’s view of this session can be found

On the surface there is little to connect these metadata models in terms of the people involved and the starting point for each initiative. PREMIS comes from the digital library community, while OPM could be said to have roots in the computer science and Semantic Web community. One describes over 100 elements, the other barely more than a dozen. In terms of purpose, however, there is perhaps more common ground. So might the two approaches eventually collaborate, or be in competition? This is not clear yet, although I could report my KeepIt colleague Dave Tarrant, a programmer and developer, was immediately drawn to the elegant and concise OPM, less so to the unwieldy PREMIS. I’ve always been unstinting in my praise for PREMIS.

Preservation Metadata Implementation Strategies (PREMIS)

[slideshare id=3365403&doc=keepit-course3-presmeta-100308093709-phpapp01]

When I first began looking at preservation metadata in 2004, there were many schemes. It was clear that someone would have to try and unite these disparate approaches, and thankfully PREMIS filled that gap. Yes, the PREMIS Data Dictionary for preservation metadata is extensive and initially daunting, but once you understand the structure, the interested and informed reader cannot but be struck by its coherent and comprehensive treatment. Each time I give this presentation – and I have done so previously for events by the Repositories Support Project, such as in 2008 in London and Bournemouth – I want to give those new to PREMIS a sense of this contribution.

The other key point from this presentation, invariably aimed at repository managers and administrators, is however daunting PREMIS may appear, it is very likely they are already recording PREMIS metadata. One aim of the exercise attached to the presentation (slides 8, 9) and associated preservation metadata worksheet, is to select elements from the Dictionary that will be familiar to repository managers, such as object identifier, file size and format. Having established the basic familiarity of PREMIS, the exercise then asks participants to identify the sources of information for 20 PREMIS elements listed on the worksheet.

On this day, such was the focus on significant characteristics that we did not have time for the full exercise, working in groups, and instead treated it as a discussion exercise.

One of the PREMIS elements included in the exercise is labelled significantProperties, but in terms of what we have learned about this topic in this module, this element alone is clearly not very substantial. PREMIS is doing more work to accommodate significant characteristics, and the final slide (11) here reveals some likely directions.

Open Provenance Model

“As data becomes plentiful, verifiable truth becomes scarce”. Since provenance is about records to verify the past history of an object, be it an antique or any physical or digital object, this great quote from Eric Hellman instantly reveals why provenance will become increasingly critical for digital information.

OPM is driven by an international community, and its lead author is our colleague Professor Luc Moreau at the University of Southampton. We attended a seminar by Luc on OPM, read the documentation, and met with him. Apart from the opening slide, the slides used here are selected from one of his presentations on OPM, so aside from the editing and this commentary this is all bona fide original OPM, and we are grateful to Luc for permission to reproduce this material.

Given the time available and, as we noted earlier, the origins in another community, this presentation is quite pared back, so please try to find out more about OPM from the sources provided. What we wanted to do here is introduce the general concept of provenance, put it in the context of a familiar information lifecycle, in this case the science lifecycle, and then identify the main elements of OPM.

[slideshare id=3365483&doc=keepit-course3-provenance-100308094733-phpapp01]

From a tools perspective, a critical feature of OPM is its support for ‘serialisation’ (slide 8). This allows information describing an object to be passed to different services. RSS and Atom are well known serialisation formats. OPM supports serialisation formats in XML, RDF and others, and supports the Semantic Web ontology language OWL. These formats can be used by repositories, and EPrints repository software, for example, now supports provenance information based on OPM.

OPM uses ‘nodes’ (slide 9); immediately we can start to see some parallels with PREMIS. Nodes are connected by ‘edges’ denoting the relation between nodes (slide 10); this begins to look like the graph-based approach of RDF, such as is used by OAI-ORE. Vitally, OPM graphs represent relations in time (slide 12). Finally, we see how Dublin Core, which will be familiar to repository administrators, might be mapped to OPM to represent the process of publication and versions of the published item (slides 13, 14).

By working with an open standard model, that can pass information as XML and in standard serialisation formats (e.g. RDF), it is possible to build provenance services into repository environments.

Summary of module 3

Module 3 of this course has demonstrated that actively used digital documents have critical features that affect productive use in different circumstances, and will change perceptibly and imperceptibly over time. We have recognised that we have to record this information, and have briefly used some tools to help do this.

Our arsenal of preservation tools is growing by the module. Our participant reporter noted that ‘for practical purposes there really needs to be some substantial work undertaken to integrate resources as well as applications to support content creators as well as repository managers in developing policies and practises for preservation.’

In module 4 we will see how these approaches can be built into the systems we are already using, our digital repositories.

Posted in Uncategorized.

Tagged with , , , , .


KeepIt course 3: significant characteristics

KeepIt course module 3, London, 2 March 2010
Principal topic this module: significant properties
Tools this module: PREMIS, Open Provenance Model (OPM)
Tags Find out more about: this module KeepIt course 3, the full KeepIt course
Presentations and tutorial exercises course 3 (source files)

Significant characteristics are “ the characteristics of digital objects that must be preserved over time in order to ensure the continued accessibility, usability, and meaning of the objects”.

To continue introducing the more technical aspects of preservation workflow in course module 3, we welcome Stephen Grace and Gareth Knight from the Centre for e-Research at King’s College London (KCL), to explore the practical implications and impact of significant characteristics. Alternating presentation duties, Steve and Gareth split their coverage into six self-contained presentations interspersed with practicals and discussion.

[slideshare id=3364267&doc=1sp-introducingsignificantproperties-d2-100308064856-phpapp02]

Presentation 1

One of the difficulties of digital preservation is that although we as users hope to be able see what we want in a digital object, this may not be the same as another user, or the author or creator. In other words, there is scope for interpretation. In addition, machines interpret and present the object depending on the technology framework available at a given time. As we know, that framework can change. So if we allow the possibility of creating a new version of an object – e.g. by format migration, which by definition must change something in the object, however small – to enable presentation in a contemporary technical framework, how do we decide what changes to the object are to be permitted? Such an analysis is the focus of what has become known variously as significant properties or significant characteristics (although Dappert and Farquhar argued for ‘characteristics’ and against interchangeability of these terms).

Perhaps surprisingly, given its wide scope and multi-faceted nature, the topic of significant characteristics has received far less attention than, say, file formats, as the timeline in presentation 1 recognises.

[slideshare id=3364378&doc=2sp-inspectframework-100308070838-phpapp01]

Presentation 2

A recent substantive investigation of significant characteristics was carried out by the InSPECT project, led by KCL. This project provided a framework for determining significance based on Function-Behaviour-Structure (FBS), originally developed to assist engineers and designers to create and redesign artefacts. The FBS approach is used throughout subsequent presentations in this set and in group practicals, and presentation 2 introduces the method and adapts it for data objects:

  1. Analyse structure for technical properties
  2. identify the purpose of properties and categorise
  3. determine expected behaviours
  4. and classify into functions

However, the behaviour of an object may vary depending on the user, or ‘stakeholder’ in the analysis presented. Which activities might the user wish to perform on the object? In this presentation we learn how to cross-match object functions with stakeholder functions.

Group exercises

[slideshare id=3364808&doc=3sp-practicalobjectanalysis-100308081920-phpapp02]

Presentation 3

Now we are ready to test our understanding of this framework with our first group exercise, designed to analyse the content of a familiar type of digital object, an email, and consider how it will be used. Presentation 3 invites participants to start (slide 3) with a blank FBS structure (or rather an SBF structure, since we are performing the analysis on this direction), and shows a generic example (slide 4) of what this might look like at the finish. (Note, some of the detail of these slides may be better viewed on the originals – use View on Slideshare button, lower right-hand corner of the slide viewer – or the source MS Powerpoint, see link at the top of this blog entry.) A handout sheet to support the exercise tabulates and defines the elements of an email, also providing a functional description and examples. Users were also given a sheet with a blank SBF framework to complete. Among the source materials you will find a version of the framework filled out with some possible answers.

[slideshare id=3364939&doc=4sp-practicalstakeholderanalysis-1-100308083910-phpapp02]

Presentation 4

The second part of the exercise, presentation 4, links the needs of different stakeholders with the actual behaviours required of an email. The stakeholders considered in this exercise are the creator, a recipient, and a custodian (e.g. an archivist). Groups performing the exercise are asked to identify 2-5 behaviours for each stakeholder. You will notice some parallels with the first exercise. In the final slide participants are asked to review the correlation between object and stakeholder properties.

A participant’s perspective of these exercises and of this course module has been reported.

[slideshare id=3365063&doc=5sps-in-archive-100308085155-phpapp02]

Presentation 5

By this stage we are becoming aware that it is not feasible to manually maintain a large number of digital objects. Presentation 5 introduces a selection of tools available to perform anaysis of digital objects, including DROID for format identification, JHOVE for format validation, and XCL (eXtensible Characterisation Language) to extract and document object properties.

[slideshare id=3365130&doc=6sp-summary-100308085925-phpapp02]

Presentation 6

Concluding, presentation 6 considers the role of significant characteristics in repository workflows.

Next, we consider how to describe these characteristics, and how to record actions taken on the digital object that might alter them.

Posted in Uncategorized.

Tagged with , , .


KeepIt course 3: Introducing preservation workflow and format risks

KeepIt course module 3, London, 2 March 2010
Principal topic this module: significant properties
Tools this module: PREMIS, Open Provenance Model (OPM)
Tags Find out more about: this module KeepIt course 3, the full KeepIt course
Presentations and tutorial exercises course 3 (source files)

After the simple game designed to highlight significant properties of data, we are ready to begin on preservation workflow and format risks. First the slideshow introduces some terms we will encounter later in the module: representation information, and the Open Archival Information System (OAIS). Unusually, some brief pre-course background reading was recommended to familiarise participants with the role of file formats: Preservation & Storage Formats for Repositories.

[slideshare id=3364088&doc=keepit-course3-workflow-100308062001-phpapp02]

By combining data with representation, e.g. markup (or encoding in machine terms), by analogy this can be seen as a ‘performance’ of the digital object (slide 3). While this is easy to understand for audio-visual content, the analogy holds for text and other forms of content too: an element of machine performance is required to present content to an end user. We can see already there are a number of components required for a performance of a digital object, and our concept of performance introduces expectations, perhaps different, on the parts of creator and user.

Our framework for preservation is described by OAIS (slide 4). From one view this model looks much like our repository model. A strength of OAIS is you can use it to model processes in almost any system, but once you subscribe to its strictures you are bound to its detailed specifications. OAIS is not perfect but is a standard and is universally adopted in formal preservation practice. As you build preservation into your repository you will want to demonstrate that others can trust your procedures and ability to manage data, and to do this you will need to show compliance with OAIS.

Broadly we can reduce OAIS to our three-stage repository model (slide 5), and within our ‘manage content’ activities we can identify the three stages of our format preservation workflow: check, analyse, action. When we stretch these out chronologically (slide 6) we can see there are tools for each stage, but the hard part is connecting format check with action. In other words, when we have identified the format of a digital object, we can migrate that to another format if needed, but connecting this information and the action decision to migrate is harder because we don’t yet know how to evaluate whether or when to convert, and what to convert to. Again we can see there are tools to help us with this. Most of our work in course modules 3 and 4 will concern this ‘analyse’ bit in the middle of our workflow.

Rosenthal has argued against preemptive format migration, arguing for the relaxed rather than aggressive approach, and referring to format obsolescence as the ‘prostate cancer’ of preservation: “It is a serious and ultimately fatal problem. But it is highly likely that something else will kill you first.” (slides 7-9)

Let’s look at what types of file format we might find in a typical institutional repository. We have two example format profiles from the Registry of Open Access Repositories (ROAR), past and present (slides 10, 11). Both show the dominance of PDF, not just the generic format but different versions of PDF. A recent summary of responses to a list mail enquiry about repository file formats partly explains why PDF dominates (slide 12), but is this justified?

We now know there are many factors that affect the risk associated with any file format, and 10 principal risk classifications have been identified by the National Archives (slide 13). We will use this classification as the basis of our first exercise in this module. Without explaining these factors in any more detail than is given in the slide, but after striking out two factors that might require prior knowledge or documentation (slide 14), our groups were asked to select and compare a pair of formats, to score the formats against each other for each risk factor, and produce a final score and recommendation in favour of one or other format. In addition, groups were asked to suggest a reason why the ‘winning’ format might not be the best choice for their repository (slide 15). The results were reported in an earlier blog post. This exercise is described in a handout sheet used on the day, as well as in the course slides.

We used to think that the best formats were open: open standards and open source (slide 16). Well that tended to differentiate formats from Microsoft applications from the most of the rest, and perhaps explains the recent history of PDF preference by repositories, but more recently Microsoft has lobbied successfully for its Office formats to be released as open standards. So now which popular text and document formats are best for preservation? MS Word, which can be converted to Open XML, or PDF, which can’t (refer again to the results of our group exercise)? It shows there is no single answer, and even if there was, it might have to be reconsidered at another time. Rosenthal provides some reassurance about the impact of format risk, even for proprietary formats (slide 17).

Repositories know they must work with their authors and avoid imposing unnecessary constraints. Authors create digital objects with no doubt little regard for format over functionality, so repositories should neither impose format restrictions or require authors to convert formats (slide 18). For now repositories should require source files from authors and make their own decisions, and that is what modules 3 and 4 will help them to do.

Before we end this section, let’s return to the exercise and consider the comparison of image formats (slide 19) – TIFF vs JPEG – rather than document formats. Again, there are cases for each. This might seem unlikely after three major institutions chose JPEG as an archival format over TIFF (slide 20), but an independent evaluation of both formats using the Plato preservation tool, which we will cover comprehensively in module 4, was less decisive, leaving a final decision to a later time (slide 21). Whichever formats we consider, format migration is not an open and shut case, and could use more sophistication than has been available to repositories thus far.

In a sense file formats are nice and well defined, if sometimes highly technical and complex. What is less well defined are other characteristics of a digital object that may affect its performance, or the way it is perceived, used and understood by different users. Our next session in this module will introduce and provide working examples of what is referred to as significant characteristics.

Posted in Uncategorized.

Tagged with , , , .


KeepIt course 3: game playing characterisation, transmission, metadata and provenance

KeepIt course module 3, London, 2 March 2010
Principal topic this module: significant properties
Tools this module: PREMIS, Open Provenance Model (OPM)
Tags Find out more about: this module KeepIt course 3, the full KeepIt course
Presentations and tutorial exercises course 3 (source files)

So far in the course we have covered institutions and costs. When do we reach ‘preservation’? We begin that journey into some of the more technical aspects of preservation here, and it will take us into KeepIt course 4 as well, when we will discover tools that can help us with repository preservation workflow from within your repository software.

There is a slightly different structure to this module compared with the first two modules: more tools, more and shorter presentations, and also practicals mixed in, but this module is less about practicals and more a primer for the tools that follow.

[slideshare id=3363963&doc=keepit-course3-intro-100308055942-phpapp02]

The emphasis of the day will be on understanding the significant characteristics of digital objects that might affect preservation decisions. Later we’ll look at tools for recording metadata about these characteristics and preservation actions using PREMIS, an authoritative community-based standard, and the Open Provenance Model, an emerging and perhaps a more concise approach.

First we played a short game. Each group of four people was handed some simple information: three groups received randomly generated numbers on a small piece of paper; one group was handed four playing cards. The groups were asked to transmit the information in various ways, either written or spoken. In each case, the information was to be transferred sequentially between adjacent members and not shared openly with others in the group. The results from two groups, using written and spoken transmission, respectively, are shown below.

Provenance numbers game: written and spoken transmission

We can see that each group were given a PIN-like four digit number that was colour coded. The whispering group (right) transmitted both numbers and colour correctly. They had clearly decided the colouring had significance and retained this information. The writing group chose not to retain the colours. (This evidence was retrieved after an obvious and desperate attempt at disposal!) Had they been handed information identified as a PIN number, for example, this would have been a reasonable omission, because colour has no significance in a PIN. This was not in the instructions, however. In addition, an error was transcoded in the first transmission (top), retained in a second transmission, and somehow corrected in the final result (bottom). This last transcription must have broken the rule about group conferrals, but we can forgive them because it demonstrates how important recording can be in a transmission chain to enable errors to be identified and corrected.

In the case of the group receiving the playing cards, without being told this the information they were expected to transmit was number (or picture) and suit, as these are the two familiar significant characteristics of playing cards, although they may not be commonly referred to in this way.

A simple 10 min game had revealed all the key issues to be covered in module 3:

  • the importance of recognising significant characteristics in deciding which aspects of information are to be retained,
  • the effect of transmission on information, and
  • the need to record data and metadata to identify and correct errors, to record changes and provenance, decisions made and actions taken on data.

Next we will introduce a format preservation workflow and format risks.

Posted in Uncategorized.

Tagged with , , , .


KeepIt course 2: LIFE lifecycle model and costs calculator

LIFE (Life Cycle Information for E-Literature) logoKeepIt course module 2, Southampton, 5 February 2010
Tools this module: KRDS, LIFE
Tags Find out more about: this module KeepIt course 2, the full KeepIt course
Presentations and tutorial exercises course 2 (source files)

Digital preservation begins with the author. This sounds good, but depends on the relation between author and content service. In the case of repositories, which work with large numbers of authors who use a wide variety of content creation tools, the relation may not be close enough to influence or impose good authoring practices that can assist and improve preservation consequences. Yet the alternative, dismal prospect is that preservation is at the end of the food chain, both literally and chronologically. You certainly don’t want to be at the end of the food chain when the money is being handed out.

So the trick is to make sure preservation and data management is is embedded into the institution’s policies and procedures. Hence KeepIt course 1. It’s also to place preservation within the whole lifecycle of digital objects. This is what the DCC Curation Lifecycle Model does at a conceptual level; it’s also what the LIFE (Life Cycle Information for E-Literature) seeks to achieve in terms of modelling the costs of preserving digital information from creation through longer-term management over years.

[slideshare id=3209056&doc=holekeepit2010v02-final-100217101020-phpapp01]

This presentation by LIFE3 project manager Brian Hole introduces LIFE and leads towards a practical exercise to explore using the costing tool. The version of the tool used at the time of the course was described by Brian as a ‘work in progress’ where the initial input and digitisation sections were sufficiently complete ‘to yield meaningful results’. The tool used an Excel spreadsheet, and that continues to be the case for the recently announced and more widely available LIFE3 beta evaluation version, but is eventually expected to be made available through a Web interface. Whichever platform you run Excel on, the LIFE tool requires use of the macros, which course participants found a limitation for non-Windows machines.

Costings results from the group work were in some cases quite startling to participants, producing some eye-watering numbers for those used to more frugal repository budgets. This mirrors the kinds of results reported for repository applications – see the documentary evidence provided for the SHERPA DP and SHERPA-LEAP case studies from LIFE2. It’s possible such costs are over-estimated for the specific case of repositories today – it may require access to the final version of the tool and some more dedicated use to discover this – recall this was not a final version and could not be customised for the repository experience. The results were nevertheless instructive, and it’s plausible that if and when repositories experience typically explosive digital content growth that the costs will converge with those from the tool. At that point, let’s hope the tool is sufficiently fine-grained to enable repository managers to control and fine-tune the costs, to provide preservation within the budget.

Posted in Uncategorized.

Tagged with , , , .


KeepIt course 2: Keeping Research Data Safe

KeepIt course module 2, Southampton, 5 February 2010
Tools this module: KRDS, LIFE
Tags Find out more about: this module KeepIt course 2, the full KeepIt course
Presentations and tutorial exercises course 2 (source files)

Keeping Research Data Safe (KRDS) is an approach to institutional costings and budgets for managing research data. It involves a model, a form, a survey of archiving organisations – some represented on this course – and, ultimately, a report.

First there was KRDS1, now expanded in KRDS2. How the two reports relate is helpfully explained by the project Web page. This course module preceded KRDS2, but in the week of the course the pre-report survey results from KRDS2 were published, so it was a timely opportunity to consider this work. BTW, should you notice mention of rat and rabbit heart, it’s not a reference to the course lunch, but a dataset found in one of the data survey responses

Welcome to our time capsule version of the KeepIt course, which is posted, for completeness, some time after the course has ended and some time after the resources were posted freely for all to use. In these blogs we will try to add a little more context to these resources, perhaps to make them slightly more accessible from an intellectual viewpoint.

[slideshare id=3197577&doc=keepit-0110cb-final-100216111747-phpapp02]

KRDS is presented by Neil Beagrie, who coordinated and led the project from the outset. Neil has a consultancy, Charles Beagrie Ltd, and many will know him for his work and many influential reports for JISC. A short aside in this presentation (slides 14-18) makes reference to one of these reports, the JISC Digital Preservation Policies Study.

Each tool covered in the KeepIt course involves a short group exercise designed by the presenter to give participants some practice in applying the tool. Here the groups used the KRDS2 benefits taxonomy to answer three set questions (slide 28, final slide). A version of the taxonomy, adapted to educational interests, and the exercise are described in Debra Morris’ blog on this tool.

Posted in Uncategorized.

Tagged with , , , .


KeepIt course 2: institutional and lifecycle preservation costs

KeepIt course module 2, Southampton, 5 February 2010
Tools this module: KRDS, LIFE
Tags Find out more about: this module KeepIt course 2, the full KeepIt course
Presentations and tutorial exercises course 2 (source files)

Welcome to our time capsule version of the KeepIt course, which is posted, for completeness, some time after the course has ended and some time after the resources were posted freely for all to use. In these blogs we will try to add a little more context to these resources, perhaps to make them slightly more accessible from an intellectual viewpoint. This blog introduces course 2 on institutional and lifecycle preservation costs.

[slideshare id=3197432&doc=keepit-course2-100216110746-phpapp02]

Digital preservation aims to be perfect, at least, it’s not politic to admit to less, as long as theory remains more prevalent than practice. This can lead to idealisation, and idealised approaches being built into tools. But as we discovered in KeepIt course 1, constraints can be placed on idealised preservation by your institutional context.

One of the most obvious and critical constraints is cost. You can do little more or less than you are funded to do, preferably within the framework of institutional objectives and policy that includes preservation.

As an example of how preservation costs can escalate, see Rosenthal’s blog on a Petabyte for a Century (start here and work back to the original).

How can we estimate, plan and calculate the costs for preservation of a repository? As any accountant will tell you, there is more than one way to look at the figures, and you need a number of tools to hep you.

Later we will look at the LIFE model of costing the management of digital objects throughout their lifecycle. First we consider KRDS, an approach to institutional costings and budgets.

Tools this module: KRDS, LIFE

Posted in Uncategorized.

Tagged with , , , , .