Skip to content

Categories:

Measuring progress of digital preservation

arl-logoHow can we measure the extent of preservation activity? One way to do this for libraries is costs. ARL produces annual preservation statistics, and it appears that one helpful blogger has collated the data in a more useful way.

ARL produces a lot of detail in its reports, but it isn’t helped by not dating versions clearly and precisely. So, for example, the most recent report (pdf) (i.e. with commentary) covers 2006–2007 and is simply dated 2009, but the latest figures are for 2007-2008 (an Excel spreadsheet, no commentary, undated). That’s my understanding, anyway, not being a regular reader of ARL statistics, although I have noticed quite a few announcements for these recently.

A blogger at Preservation & Conservation Administration News has now ‘reconsidered‘ the most recent figures (07-08): “After years of thinking about this and occasionally calculating the comparison stats for the institutions where I’ve worked, I’ve finally taken the time to put together an Excel spreadsheet that calculates the following comparisons for all reporting ARL Libraries”. Go to the blog entry to download the comparison spreadsheet.

Is this new and useful, and does it help the case for preservation? As the blog says: “Well, most libraries are competitive and recognize certain other ARL libraries as peers to which they aspire or like to compare themselves.”

Two things to note. First, this is at the level of libraries rather than repositories, although if you go back to the most recent report (2006–2007) you will find some context for repositories and institutional repositories. Second, this is about preservation and so not wholly about digital preservation.

Posted in Uncategorized.

Tagged with .


Sustainability must precede preservation for IRs

Sustainability tree by xotokoThe terms preservation and sustainability are often bound together, but it is important to understand the relationship. When we recognise the scope of sustainability for institutional repositories (IRs) we can see that it must typically precede preservation because the conditions for preservation depend on the sustainability of the target source.

That this is the case for IRs, the focus of the KeepIt project on digital preservation, became evident in a short aside in a new paper by Armbruster and Romary: Comparing Repositories Types: Challenges and Barriers for Subject-Based Repositories, Research Repositories, National Repository Systems and Institutional Repositories in Serving Scholarly Communication.

They classify four different types of repository, one of which is the IR, as the basis for their argument for the type of repository and infrastructure they believe could succeed going forward. At the root of this analysis is the issue of sustainability.

There is a short passage in the paper (p 14) that connects sustainability and preservation: “Long-term preservation and access is an unsolved issue for repositories, but not an urgent one.” I would say that in so far as repositories are managing content deposit and access then they are already supporting preservation, but it’s not a complete preservation solution. It’s that completeness that we are working towards in KeepIt, and to which the authors’ point attaches.

This hinges on the next point, which is more significant: “Repositories must attend to the issue of sustainability as a priority, this being primarily about service, usage and cost.” Preservation is a cost on a repository, and will therefore depend on the degree to which the repository can first demonstrate its sustainability. A repository will not be able to attract the additional funds needed for preservation, in other words for longer-term development and maintenance, unless it first demonstrates the value of the service with regard to its defined purpose and constituency. With growing content and usage comes value.

If I recall, the JISC espida project (“making it happen by getting real”) made the connection between preservation and economics, showing why it can be hard to make the case for preservation when it comes to economic decisions: “digital preservation is a selective preservation of an intangible asset that has a reasonable probability of producing benefits at some future time. In that sense, digital preservation decisions are investment decisions, incurring present costs in expectation of returns in periods beyond the current accounting period” (Hunter 2005). It’s likely, therefore, that decisions concerning preservation will relate to factors that might be more measurable at the time the decision is made, such as whether the content provider or service, in this case an IR, is delivering value for its users within the context of its institutional commitment.

Are there examples of IRs where we could apply this type of thinking? Coincidentally, an appendix to a new JISC project report features two IRs, at Cambridge University and Glasgow University (Birrell et al. 2009). The paper is not about sustainability, and refers to the term only once. Nor is it just about IRs but their integration within digital libraries. The paper is, however, concerned with institutional mission, strategy, integration, resources, staffing, etc., all of which can be considered contributory factors in IR sustainability, and has sufficient detail on these two IRs for a case to be made. As an interesting exercise, based on these case studies, which repository do you believe is more sustainable?

Armbruster and Romary clearly believe that IRs, among the forms of repository they considered, have done least  to demonstrate sustainability in the terms they outline. That may be contentious and is not fully justified with reference to other repository types in the paper, nor does it recognise the distinct mission of IRs. For those running IRs, or repositories that might become part of IRs, however, a platform for sustainability, particularly with regard to growing content and usage within defined institutional requirements, is an area that requires attention ahead of a preservation plan.

Posted in Uncategorized.

Tagged with .


Preservation planning depends on repository context

Plato polar bear logoDigital preservation planning may be technical, but you have to get to grips with your organisational policy and planning first, as is revealed in a new paper by our KeepIt project advisor Andreas Rauber and coauthors in the latest issue of D-Lib Magazine

From TIFF to JPEG 2000? Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings

Although ostensibly about two particular image file formats, the real point of interest for us is how this acts as an example of preservation planning. This is a topic that we will encounter later in our project training, to be presented by Andi, and based on the preservation planning tool Plato, also described in the paper.

Preservation planning connects the processes of managing file formats within your collection. Computer applications change, so some file formats are at risk of becoming obsolete, and when this happens the content may become inaccessible. To prevent this preemptive action might be taken, but you have to know where (for which files or contents) and when such action might be taken.

The first step in this process is to identify the formats of all content files within a collection. Then you have to know the preservation implications of each format, and decide on appropriate actions, if any. This is where it gets tricky because the the number of risk factors to take into account for each format is large – status of applications and viewers, etc. – and thus making decisions on action becomes more complex. Bear in mind as well that every preservation action incurs a cost, and costs can rise rapidly for large, diverse digital collections. So good judgement is critical. By connecting these steps, preservation planning can help produce good judgement.

The D-Lib paper provides a practical example, but note that where this paper evaluates the migration, a preservation action, between two formats for a large volume archive, the exemplar repositories in KeepIt and other institutional repositories typically switch these factors around, i.e. smaller collections but with many more formats and consequently multiple possible conversions. I’ll leave you to judge the effect on complexity caused by each factor.

Now the really sobering part.  The result of a preservation planning process, say Rauber et al., is dependent on the institutional and repository context: “the design of one plan can differ considerably from the plan of another institution, even one with a similar collection.” Considerations include “the institution’s preservation policies, legal obligations, organizational and technical constraints, requirements, and preservation goals, as well as the capabilities of the tested tools. Preservation Planning is a process that depends very much on an institution’s individual policies and requirements in its day-to-day work.” In reality, there is no magic wand.

This is where Plato comes in, and this is why we are trying to integrate it with EPrints, so that at least you might be able to invoke the planning process within a familiar repository environment. It’s also why we are not starting the training programme with preservation planning.

Finally, the conclusion of the paper highlights another factor, time dependence. The current recommendation for the image example, despite positive reports of the target format JPEG 2000 and all the accompanying analysis, is not to migrate. But this could change: “in one year we’ll look at this plan again to see if there are more tools available and whether or not the ones we considered in this year’s evaluation have been improved.” In other words, planning is an ongoing process. There is no single result, and even that can change over time, all depending on your repository, of course.

Posted in Uncategorized.

Tagged with , .


More on data curation for repositories

Previously we have considered the case for data curation in KeepIt, and here is more grist for that mill. Dorothea Salo has posted her recent presentation at the Access 2009 Canadian library technology conference. You have a selection of formats:

  1. Video (with slides)
  2. Slide show (with commentary)

[slideshare id=2061191&doc=accessdatamanage-key-090924120942-phpapp01]

Note. This embedded slide show is the pre-conference version. Follow the link above for the full conference slides. The full version will be embedded here when we can get it working.

Unless you have 1 hour to spare I recommend 2, while dipping into 1 for flavour and colour. Yet you may want to spend some time on this, because the range of issues is large, and Salo makes it feel like time well spent because of the quality of the narrative and the slides. It’s an excellent if provocative presentation, even if you don’t agree with all of it (which I don’t).

To whet your appetite, there are lots of data examples, a good teaching example, a positive reference to Kultur, and of course strong opinions on the the place of IRs in all this.

The presentation effectively has a section on the problems and challenges in adapting IRs for data curation (my comments in brackets):

  • on making repositories less institutional (I think the opposite is the case)
  • on managing many different content types using IR software
  • on static and final content (on the content angle, for OA content IRs are secondary to publications, and that is proving to be problematic for content collection for IRs)
  • on the inadequacy of manual and ‘one file at a time’ deposit
  • on an Archive It! button
  • on creating more and better repository interfaces, improving the look-and-feel
  • on the limitations of key-value pair metadata; “use XML or RDF”
  • on APIs for data and data interactions, data relations
  • on content modelling, and sharing (and reusing) content models
  • on Fedora

There are some technical issues in achieving these requirements. Perhaps the demo examples in the EPrints 10 year review will answer some of these points.

Salo makes the case that all of these need to be resolved if IRs are to become platforms for data curation.

I think you will see from this presentation that Salo believes libraries and digital libraries are to be transformed. The future of IRs is left open, but the bar is set high for IRs to have a future. In a follow-up blog answering what libraries should do about the ‘laundry-list of data-curation challenges presented’, Salo includes:

“Do you have an institutional repository? Are you getting value out of it? Really? If not, have the courage to migrate the content and shut it down, re-assigning its manager to something more useful—data services, perchance.”

So why more of this in KeepIt? This approach places the onus of taking an institutional perspective on to the repository. Fundamentally, there is not much point in thinking about preservation of repositories unless those repositories have a view on what they will look like, and what role they will play within the institution, into that same future. Preservation is about planning for that future, for the whole repository, and not just the content it has now. After this, should it include data curation?

None of this prescribes any solutions. None of this says that what the KeepIt exemplar repositories are doing now is wrong. Just that the wider picture has, at least, to be considered. If Salo is correct, the future for institutional repositories is up for grabs.

Posted in Uncategorized.

Tagged with , , .


An architecture for preservation?

Preservation architecture: Library of Congress hall, by La Citta VitaDo we need an architecture for digital preservation? If so, what might it cover? Perhaps content types (e.g. data, teaching materials), management types (e.g. repositories), institutions, infrastructure (networks, storage) and services. Something more concrete than OAIS (references below), for example, which is after all a reference model, not an implementation.

Working on the Preserv projects and now KeepIt, whenever it might have seemed possible to map a preservation architecture for institutional repositories, things have suddenly changed. There’s one advantage already of a reference model over an implementation. For example, we had a powerful new Sun Honeycomb storage server for Preserv, and not long after Sun announced it was being withdrawn from the market. Think again. Now we have cloud storage. How long until the fatal flaw in that strategy emerges, if it hasn’t already, although colleagues are working on a promising way of applying the Honeycomb principles in an institutionally controlled cloud (see these slides, but note this blog comment on the live presentation: “extraordinary presentation 
 The slides don’t give the flavour; you just had to be there”).

A preservation architecture is a nice idea in principle, for example being able to say: here is a repository responsible for content (a) which it can predict and plan, it can assess the risk and manage this content using tools (b) and outsource some technical functions to service (c), keeping copies of the content at locations (d) and (e) for access and archiving. Such a prescription, if it were possible, might appeal to repository managers, and might act as a template for repository preservation. In turn, it would become easier to plan and develop the sort of services on which such an architecture depends.

Admittedly, it all sounds a bit Soviet-style and rests on the assumption that once framed nothing will change this architecture, the antithesis of the digital environment where everything changes all the time. Would we want it any other way?

I was prompted to revisit these architectural thoughts by a curious story about a joint project – on Collection Development, Acquisitions, Preservation – between the university libraries of Cornell and Columbia. Actually I wasn’t that curious until this part: “Cornell and Columbia assert that the project—not a merger—could be 
” Hold it there. It hadn’t seemed remotely like a merger story, but now you mention it, there is an interesting idea here. Are digital libraries enhanced by mergers? Certainly there is a scale problem with digital preservation that might be helped by a joint approach.

I don’t know to what extent digital academic libraries will be transformed by the open Web, that is, to become sources of content from the institution rather than acquirers of content from elsewhere. That’s my institutional repository viewpoint again. In the IR scenario, of the titular functions from the story above, preservation remains a concern.

You see, what they have done in this project by collaborating and suggesting a merger, is they have altered another of the factors in our nascent architecture: this time it’s the institutions and the infrastructure that have changed, rather than the storage service.

How would OAIS handle this? OAIS allows us to model a preservation architecture and build an implementation. If we are rigorous, we are encouraged to become ‘OAIS-compliant’, that is, we don’t simply treat OAIS as helpful advice, but we fulfil all the necessary requirements. This is important because one day we may want to be seen as a ‘trusted repository’, and that is likely to be measured against OAIS-related criteria. If we take library 1 and library 2, both of which are OAIS-compliant, and put them together as at Cornell and Columbia, is the result OAIS-compliant, or are we breaking compliance? Probably, initially at least until a full analysis has been performed on the new organisational framework. If we add in services, what needs to be OAIS-compliant? Without practice and experience, we don’t know. When it comes to institutional repositories we don’t have this experience, and I suspect not many others have it either, likely excepting the established preservation centres such as national libraries, and their experience may not be directly applicable.

There are certain places that digital preservation should begin, but these should not be the same for everyone. A manager of an institutional repository, for example, need not be a preservation specialist. At the moment, however, we seem to be expecting everyone to begin with OAIS (formal, intermediate (tutorial) or less formal), formats, etc. Instead there should be tools, services, interfaces, and perhaps a preservation architecture, that embed specialist knowledge and practice, and provide a better starting point for non-specialists. We have to be bold to move forward from the abstract model.

Posted in Uncategorized.


Digital preservation bibliographies updated

Biblio library by I recently updated the Preserv bibliography for the first time in nearly a year. Hence, for those following KeepIt but not the earlier Preserv project, you may not have come across it

It may be, currently, the most extensive bibliography on digital preservation, perhaps. Yet it isn’t comprehensive because it sees DP mostly through the prism of what might be relevant to repositories, particularly institutional repositories, and covers technical and other factors relevant to creating digital preservation services.

You mighty also argue that it can’t be comprehensive because it doesn’t include the papers in the KeepIt wiki bibliography, which follows the data structure – arts, science, teaching – of the project’s exemplar repositories. The two bibliographies are largely mutually exclusive, and complementary.

Posted in Uncategorized.

Tagged with .


Preservation, DP 2.0 and the happy shopper

Twitter: Rory McNicholl (from #dptp slides): Is there a tool that covers the entire OAIS model? (next slide shows full diagram) “What do you think?” https://twitter.com/jisckeepit (2:33 PM Oct 22nd)

"Fridge Meme" 0547 by marymactavish So there goes another weekend, of relaxation, reflection and little event, interrupted as usual by the weekly supermarket shopping trip. We don’t have a car at the moment, so the list has to be planned to avoid pedestrian overload. I managed to restrict my selection to just four full bags this week (not including the case of wine, but we won’t dwell on that). On returning home the goods had to be distributed to various storage destinations: cupboards, vegetable rack, fridge and freezer. This takes a little time and effort, but not a lot of intellectual exertion. Most stuff has its place. I did wonder whether the minced beef would be needed for the week, and if it should go to the fridge or the freezer (I chose fridge, and later changed my mind). I noticed that the fridge was unusually full, and found myself removing some green cheese, out-of-date yoghurt (I’m more sensitive about ‘use by’ dates than I was), and something that I decided was a less of good idea than when I had bought it (clearly my selection and appraisal skills could be better).

Those reading this blog in search of enlightenment on digital preservation rather than shopping will have seen where this is heading. This may not be digital preservation, or even information preservation, but it is preservation. There are, however, a number of lessons from this simple process:

  1. None of us are specially trained in this form of supply and storage (much to the concern of food safety technologists)
  2. It’s mostly straightforward but occasionally we have to think a little harder about certain decisions
  3. It’s not an infallible process, as stuff gets wasted for different reasons
  4. It’s not a highlight of this or any other weekend

It’s striking how most of these lessons are reversed when it comes to digital preservation. Last week I attended part of the Digital Preservation Training Programme. People were being specially trained. The content was extensive, comprehensive and, often, complex, particularly in the inter-relations between the different modules (which I had the chance to scan after attending). We are aiming to be infallible, because we can’t really admit otherwise (not yet, anyway). And it is very much a highlight, in this case aimed at actual or prospective information and digital preservation professionals. We are regularly reminded elsewhere that digital preservation must be an urgent priority.

But should we expect this to apply to everyone involved with information or repository management? After all, the food storage example is founded on experience, both real and received, indicated in food labelling and packaging and date information, and on the implementation of storage devices with simple controls and explicitly developed and refined for purpose and simplicity. If this were a digital environment, we might call it food storage 2.0.

Where is DP 2.0? According to my reading of DPTP, it’s not here yet. Let’s be clear, DPTP is mostly reflecting state-of-the-art. Simply, the art is not where it should be for wider uptake. So who should be held to account for this? Well, for one I’ll hold my hand up for the Preserv projects. We had hoped to get further, and the groundwork has been laid, so perhaps we will make it in this project. When I hear about the software tools available to help with DP, and invariably PRONOM is mentioned again, I realise we were saying this five years ago. However good PRONOM is and whatever refinements TNA has made to PRONOM in that time, notably adding DROID, it is still short of what is needed. But this is a community responsibility, not that of one agency or tool.

What’s needed for DP 2.0, especially at the more complex, technical end such as file format management and metadata, are embedded knowledge and decision bases, accessed by interfaces tied to information management and storage environments.

Widely used open source software for creating repositories has been variously criticised for creating a limited framework, e.g. in terms of ingesting content, not being “of the Web”, etc., but in fact the strength and flexibility of these services has always been based on providing a series of interfaces (deposit, manage, access) to a series of data management processes. This is all integrated and is only limited by the use to which it is put. In DP we have a single interface to a single process, or a series of tools and interfaces, but not integrated.

The good news is that we are beginning to see open knowledge registries that are accessible through interfaces that build in preservation planning support, are tied to widely used content management services such as IR software, with built in storage selection. The complexity is hidden. The content manager should not have to be an expert on file formats to manage preservation any more than the happy shopper needs to be a food technologist.

Posted in Uncategorized.

Tagged with , .


eCrystals – Repository preservation objectives

logo3.0l

A little bit about eCrystals:
eCrystals Southampton (http://ecrystals.chem.soton.ac.uk/) is a data repository based on the EPrints platform, but heavily reconfigured to manage data files from chemical crystallography structure determination experiments. The repository evolved out of 2 rounds of JISC funding of the eBank-UK project (http://www.ukoln.ac.uk/projects/ebank-uk/). The repository has a schema that represents the data files generated during an experiment (a crude representation of the workflow) so that a record can be managed and presented in an understandable fashion. The next two phases of eCrystals evolution were concerned with developing a ‘Federation’ of such repositories and analysing the requirements, problems & pitfalls of a distributed network of data repositories (http://wiki.ecrystals.chem.soton.ac.uk). One workpackage of this project was concerned with preservation issues and looked into preservation planning, metadata and representation information concerning the network and individual repositories (WP 4).

A number of matters arose from this project and in general they were not technical, but mostly stemmed from the fact that these repositories are not based on the Institutional model, but are operated, maintained, administered and populated by individuals or research groups
often a single person takes on all these roles, in addition to their ‘day job’ of doing the scientific research! It is our belief that the future repository landscape will be very heterogeneous indeed and there will be a large number of archives & resources that do not operate under the Institutional model – the preservation requirements of these will be very different from those currently recognised by the community.

eCrystals preservation objectives therefore generally go beyond the technical and are concerned with understanding the issues around every-day researchers performing preservation activities:

Objective 1

to explore the preservation training and actions required for small groups and non-archivists. If all crystallographically active institutions (virtually every Chemistry Department has a crystallography centre) signed up to the concept of eCrystals, 95% of the repositories would be user-administered within research groups run by 1,2 or 3 people. These people are researchers trained in their art and would have very little concept of any preservation issues whatsoever.

Objective 2

to investigate how performing preservation actions can be made easy! Learning the minimum requirements for the maximum return (the 80% rule). What can be automated and what technologies can be implemented, both unseen by the repository software and as ancillary tools.

Objective 3

to develop an exemplar non-onerous preservation regime for the researcher administering a repository. Following on from objectives 1 & 2, a schedule of preservation actions can be derived and an objective would be to understand how this can be ‘coded’ into the repository software (eg an EPrints plug-in) so that the repository can automatically perform these actions or alert the researcher at the appropriate time as to which actions need executing (eg accession, appraisal, deletion, etc).

Objective 4

to develop example costings for researchers and administrators. Exemplar costs on a crystal structure record level would enable individuals to quickly and easily include the appropriate amount for preservation and administration of a repository to be included in FEC grant applications or departmental / research group levels.

Posted in Uncategorized.

Tagged with , , , , .


Scholarships, repositories and digital preservation training

DPTP logo largeThe Digital Preservation Training Programme (DPTP) is hosting another course in London next week. This is becoming a well established feature of the digital preservation landscape in the UK and further afield. It’s not cheap, but good quality training is not. One innovation to have resulted from this cost is the funding, by the Digital Preservation Coalition (DPC), of a number of scholarships for its members to attend the course. Three scholarships were offered, and six were awarded, showing the growing popularity of the course.

The process of awarding these scholarships has shone some light on the target community for this course, and has implications for repository managers. One of the repository managers involved in the KeepIt project submitted a personal application for a scholarship but was not selected. In retrospect this may seem unsurprising given the criteria outlined in the DPC’s announcement of the awards:

“Applicants were judged against three main criteria: the role that DPTP would play in career development; the benefits to their organisation from attendance and the extent to which the applicant’s job profile within the organisation pertains to digital preservation. Applications were open to DPC members and associates.”

This does not describe the current organisational view of the repository or of the repository manager. There simply is not yet this focus on preservation by repositories, nor on preservation as a career development path for repository managers. What seems to define the lone repository manager is they have to be jack of all trades, with too little time for any of them, preservation included.

It’s notable that none of the selected scholars appears to be managing a repository. That could simply be a reflection of DPC membership, another requirement for application.

So it falls to KeepIt to try and bring preservation training to the repository. I shall be attending part of the DPTP course next week as an observer, to get a clearer picture of the course coverage and presentation. I am grateful to Kevin Ashley, the course leader, for this chance.

The first thing that struck me about the programme is that it begins with OAIS and ends with organisational issues. In our training plan for KeepIt it is almost the reverse. The priority for a repository is to embed itself within the institution and to respond to its needs. Preservation follows from this analysis rather than precedes it. Again, I think this highlights the different target audiences, with DPTP aiming at organisations with preservation already established as a core mission. I’ve had some brief correspondence with Kevin about this and we hope to talk some more next week about how we can work with DPTP to bring its training to repository managers. I’ll report back on my impressions of the course and any developments that might follow.

Posted in Uncategorized.

Tagged with , .


UAL Research Online – repository preservation objectives

I suppose the first thing I should comment upon is ‘UAL Research Online’. While I have worked upon and am still employed as Kultur Project Officer, the Kultur project has actually finished and I am working upon UAL Research Online. This is of course one of the outcomes from the Kultur project and while what I discover and learn and attempt as part of the KeepIt! Project can be shared among the Kultur partners it will be specifically applied to UAL Research Online. Unfortunately I don’t have a lovely logo (and I’m at an arts institution!) but this will happen soon.

University of the Arts London logoSo UAL Research Online is the soon to be launched (honestly) institutional repository for the University of the Arts London. We are an arts institution comprising 6 colleges across London and covering the widest range of arts and creative practice in the UK. We do produce a lot of text, probably more than most people would think but of course we produce a huge amount of research in  multimedia formats. We will also be collecting teaching and learning materials at some point in the near future so this again will have an impact upon our digital preservation needs.

Surprisingly or unsurprisingly producing these ‘objectives’ was harder than I thought. I would also add that I am sure that I will want to change them at the end of the project! However the main point for myself is that this is an entire learning process. It is about learning about digital preservation. My learning process will mirror my own institution’s and mirror our own researcher’s process. So cue drumroll
..

  1. To produce a set of information guides detailing the issues, possible solutions and problems for the digital preservation of the wide range of formats that UAL Research Online holds. This should include a way to evaluate what to preserve and what to leave.These guides should be for both a general research audience and for repository staff.
  2. To gain enough knowledge and experience of the issues around digital preservation in order to be able to advise staff and researchers about digital preservation and to also advocate its importance at senior levels.
  3. To produce a costing for digital preservation at our institution. This will help to develop a realistic business plan that can be presented at senior management level
  4. To provide training on identifying the needs and the tools to conduct digital preservation for the repository and to plan and implement a preservation programme that can be then integrated within the management of the repository.

Posted in Uncategorized.

Tagged with , , , , , .