Skip to content

Categories:

Prior project reveals state of repository preservation

The KeepIt project builds on substantial prior experience in the areas of digital repositories, through its partners, and digital preservation. In particular, the approach to preservation is informed by an earlier project, Preserv (spanning two phases from 2005). There are two ways of looking at this, as I said in an opening address to project partners:

  1. We bring great authority and expertise
  2. Haven’t you solved that problem (preserving digital repositories) yet?

Although progress has been made, the second point has a serious side too. Both repositories and preservation, even separately, present tricky issues to resolve that can be intractable to instant solutions.

The final report from the Preserv 2 project recently became available and shines some light on the reasons. According to the JISC programme manager responsible for funding the project, Neil Grindley, the report “is candid and realistic about the current state of preservation thinking vis-a-vis repositories but also points to some ways forward in terms of how we (the preservation community) might more productively push what we do.”

Project reports such as this are intended to be both a record of the work of a project, providing a legacy and hopefully providing a step-off point for others to build on. Grindley says: “The section on outputs and results contains useful practical illustrative information. The recommendations seem sensible; there is evidence the project has had impact, especially the creation of the storage controller and its implementation into EPrints version 3.2.”

This promising development work is now moving forward in KeepIt so it can be applied to the exemplar repositories. To complement the format management and storage tools due in the next version of EPrints software, we are already developing new services with major national and international partners:

  • an open format risk registry
  • a means of integrating the Plato formats planning tools from the PLANETS project
  • a resilient storage approach, adapted from a Sun Honeycomb server, that can be applied in a new ‘institutional’ cloud service

These approaches go beyond format identification to risk assessment based on the most up-to-date registry data, applying actions based on generalised inputs through the planning tools when linked to the format risk results, and selection of storage services based on ongoing preservation managagement, cost and judgements on content value.

These new services have been the subject of recent tutorial workshops and presentations, and will be explored in a subsequent blog.

Posted in Uncategorized.

Tagged with .


Twitterstream: using linked data to manage format risk

ipres 2009 logoThe Twitterstream flows fast and isn’t always there for review later. So here is the stream, copied live from Twitter #ipres09, the Sixth International Conference on Preservation of Digital Objects (San Francisco, 5-6 October 2009), on Dave Tarrant’s talk: Where the Semantic Web and Web 2.0 meet format risk management: P2 registry.

From the opening slides: “This paper/talk is not actually about a new registry for preservation data. The P2-Registry is simply a demonstration of what can be done with machine readable data which is published openly on the web.”

This innovative work began as a short term response to the problem of format risk assessment in the Preserv 2 project. It is being promoted as an open solution to a component part of the digital preservation and format management service for exemplar repositories in KeepIt, for other repositories based on EPrints, and for other repositories. Format management and risk assessment are one part of the chain. Tbc.

This is the first presentation of this work. Here is what others thought of it.

23.34 davetaz #ipres09 Here we go. Web 2.0, Semantic Web meets digital preservation – Presentation up @ http://bit.ly/UYTQV

cardcc Stand by: the Tarrant torrent about to start

mopennock David Tarrant now up in final session – P2 Registry

WilliamKilbride last #ipres09 session under way with discussion of registry services and linked data

mopennock Tarrant talking about dbpedia – I heard about this over lunch for first time (where have I been?)

edsu @davetaz is promoting #linkeddata at #ipres09: reduces redundancy, increases re-use, and maximizes discovery

mbklein Cool! http://p2-registry.ecs.soton.ac.uk/ – Preserv2 Semantically Enhanced File Format Registry

figoblog RT @mbklein Cool!

JMarkOckerbloom @cardcc has q on reliability of linked data. I wonder what will happen if linked data gets traction enough to attract spammers.

cardcc The Tarrant Torrent, or bits of it, in a blog post on the P2 registry & other linked data stuff at #ipres09 http://bit.ly/CoV5I

24.00 davetaz #ipres09, thanks for the comments all, feedback welcome, URIs of data even more welcome, if anyone has some with RDF, i’ll import them now!

edsu @davetaz nice response to the question of trust and linked data #ipres09 – dns has it pluses

edsu @davetaz might be fun to talk to martha anderson about linkeddata-ifying the arms/flesichauer formats website http://is.gd/41dJa

davetaz @edsu Some good info here, just need to move to URIs and then provide rdf. For most moving to #linkeddatashould be strait forward

Posted in Uncategorized.

Tagged with , .


This is now a real-time blog

Exciting news. This is now a real time blog, which means you can receive new posts the instant they are published, thanks to a plugin (rssCloud) that has been enabled for WordPress blogs, of which this is one.

All you have to do is install a feed reader, such as River2, or sign up to LazyFeed, both of which can apparently read from the rssCloud. As a test, I’m sending the link for this blog to Twitter at the same time as publishing this entry. So what came first, the blog or the tweet? Feel free to let us know.

All we have to do is publish blogs that justify instant attention. Easy!

Despite the immediate drawbacks, I believe the real-time Web is likely to be transformational. You can find out more about the impact in an update summary from ReadWriteWeb.

Twitter might be considered the leading realtime service. It is currently the source of choice for many realtime news search services, but other social Web services offer instant communication. What’s new is the effort to apply this intrinsically to the Web rather than to particular services. Then we will see the full effect.

Google is also developing a protocol to deliver real-time feeds, PubSubHubbub, but the key question is when Google Reader will be enabled for such feeds. Lets hope this is not used as a wedge between Google’s Blogger and WordPress that balkanises blog access.

Finally, you may have spotted an irony that a blog concerned with digital preservation over extended periods of time should be interested in the first second after publication. It’s all part of the same digital lifecycle. It could be argued that the start of the lifecycle is the critical moment and we need to understand that as much as anything that happens later.

Posted in Uncategorized.


EdShare – Repository preservation objectives

EdShare logo
Background to EdShare
EdShare is a resource for collaboration and sharing of materials used in teaching and learning at the University of Southampton.  EdShare is based on the successful EPrints software and was developed in the EdSpace Project at the University of Southampton.

The resource has been created for teachers and other staff, who are involved in supporting students in their learning, to organise, manage, share and collaborate on the everday resources that they use in their teaching. 

EdShare enables the entire University community to make resources visible to colleagues and students across the University and the World.  These resources may be very simple in format – Powerpoint presentations, Word documents and PDF files – or they may be more complex learning objects combining a range of file formats.

EdShare is at: www.edshare.soton.ac.uk

EdShare provides a safe, secure and persistent URL for the learning and teaching resources of the whole University community deposited in its shares.   This EdShare approach requires that we engage actively with the issues of digital preservation which are particularly relevant to the growing community for learning and teaching repositories.

EdShare objectives in KeepIt:

Objective 1
to define the preservation needs of the most prevalent file types and formats for learning and teaching
. These resources typically use a very broad range of file formats, with some potentially very complex varieties; the “significant properties” aspect of specific software applications may override all other considerations for some teachers – what might the impact of this be for EdShare?  In practical terms, it is difficult to anticipate anything approaching completeness for preservation of all file formats, but we should aim to address concerns for the most common and most deposited formats.

Objective 2
to explore and understand the concerns, in preservation terms, of people who add content to EdShare
.  What are the lessons and implications for the growing community of learning and teaching respositories?

Objective 3
to explore and understand institutional concerns and policies in the realm of preservation of learning and teaching resources.
The institutionally-focused aspect of EdShare leads us to be concerned about significant institutional aspects of: document retention policies, legal and contractual obligations and responsibilities, as well as consistency with other repository policies across the institution.

Objective 4
to understand the relationship between the responsibilities of EdShare and the responsibilities of the creator of the content being added to EdShare
.  In many ways the burden of preservation being placed at the door of the sharing, organising and managing tool is a distraction from the appropriate, and rightful responsibility that all teachers and content creators have to ensure the preservation of their own materials.

Debra Morris
September 2009

Posted in Uncategorized.

Tagged with , , , , , .


NECTAR – repository preservation objectives

NECTAR logoFollowing our last project group meeting we were all asked to come up with a set of objectives as preservation exemplar repositories. Mine are listed below.

About NECTAR:

NECTAR is a repository of research outputs; content comprises a range of item types (e.g. journal articles, book chapters, performances, conference papers, compositions etc) and formats (text based, multimedia, images etc). Items may be simple or complex. Some NECTAR content may not be preserved elsewhere.

NECTAR is typical of an IR in a smaller institution. The IR team is small (at least in terms of staff time dedicated to the IR) and NECTAR responsibilities are slotted in around other roles. The research community is diverse, and NECTAR is competing with other calls on researcher time. Researchers are often sympathetic to the idea of NECTAR but fail to come up with the goods.

NECTAR still doesn’t have a critical mass of full content – but guaranteed preservation could be a valuable service and therefore a further selling point for the repository.

Objectives:

Objective 1: to define the preservation needs of all file types and formats held in NECTAR (now and in the foreseeable future).

Both repository staff and the research community need to be aware of what needs to be done to preserve digital items. This requires an understanding of both technical and procedural requirements – throughout the digital lifecycle.

Objective 2: to have procedures and tools to support the preservation needs identified in objective 1.

Ideally the procedures will be embedded into the ‘normal’ research workflow, so that preservation awareness is an everyday part of research output creation. Similarly, the tools to support preservation will be accommodated in the repository software – this may require an extra step in the IR workflow, but it will be central to the repository, not something that has to be thought about as an optional extra.

Objective 3: to have documentation to inform and support NECTAR stakeholders.

Useful documentation would include the following:

  • The need for preservation – to support the business case
  • How to do it – technical and non-technical documentation for different audiences (content creators, repository administrators, library staff, technical staff, IR manager etc). This should include advice for new stakeholders on preservation-readiness.
  • Advocacy materials – for promoting the value of preservation to stakeholders (as above, but also institutional senior management and research managers)

Ideally, there will be two versions of documentation – brief guides and more comprehensive documents. Short guides will give summaries of advice and good practice, comprehensive versions will justify and explain the reasoning behind the summaries.

Objective 4: repository staff and others with collection management responsibility for the IR to receive training and ongoing support.

Face to face training should focus on the practical elements of preservation in the context of repositories. Although theoretical and contextual matters are important, the most useful training will be that which can immediately be translated into action. Thus, training should cover the need for preservation, practical steps and advocacy (selling the preservation message to managers and colleagues as well as content creators). It should not be overly technical (ideally, technical elements will be catered for within repository software and will be largely invisible to IR managers). Training should be available to anyone involved in repository maintenance.

Ongoing support should be provided virtually so that users can dip in and out as their needs arise – a ‘discussion forum’ may be a suitable format. (In my view this would be a good end of project output because it would have ongoing value to the wider repository community.)

Posted in Uncategorized.

Tagged with , , , , .


Data repositories: the next new wave

dcc_logoShould institutional repositories do data curation? Underlying this question is: what is a repository, and is that changing?

Without going into full detail, there are some straws in the wind. First, back when I was working for the Repositories Support Project, Tony Hey, VP Microsoft and formerly head of our school (Electronics and Computer Science) at Southampton University, gave the keynote at the RSP Repository Softwares Day: Repositories – Past, Present and Future (slides). Look in particular at the part on the future.

Unprompted, I dashed off a report on the meeting for the RSP Web site. It wasn’t used – clearly I wasn’t effusive enough about the event – but my main point was this: If repositories are the new wave of scholarly communication, the next new wave was glimpsed in a keynote presentation by Tony Hey. He pointed to ‘cloud computing’ as the way forward, with the potential for recasting the way repositories are structured and managed.

Hey’s view was reinforced by Dave Flanders of JISC in a closing panel session. So how does the cloud transform repositories? The cloud provides the technical infrastructure, so that repositories don’t have to, and it offers flexibility, leaving repositories to focus on what they want to do, whether this is to restructure repositories to reflect institutional priorities, to be led within institutional structures such as schools and departments, or to manage different digital types such as research materials and learning objects. Put the repository in the cloud and then ask the questions again, is how Flanders saw it.

More recently, at the end of July, I went to the Edinburgh Repository Fringe to join the DataShare and DCC data workshops. Instead of hearing about the latest exciting repository developments in other sessions, we ploughed into data management documentation. While I didn’t find the open forum of the DataShare workshop to be particularly enlightening, the underlying support document, the policy-making guide produced by the project, is helpful. What’s good is this guide recognises that everyone produces data; it’s not just for specialists, and the multilayered presentation makes it more approachable.

Next morning it was time for Digital Curation 101 ‘Lite’. Despite lasting three hours, this was a clearly a skim, yet somehow it gave a sense of the full 101 course that Digital Curation Centre has compiled. I was impressed. The short team exercises revealed that there are others in the institutions who are approaching these issues from a completely different angle to repositories. There’s the clue.

Nevertheless, I came away thinking these data management issues, which are institution-wide and transformational in scale, are not going to happen in the next year, the remaining timeframe of the KeepIt digital preservation project. Our exemplar repositories are not going to be transformed in that time. Perhaps I should drop it.

Then Dorothea Salo unwittingly opened my mind to the prospect again. A major theme of Salo’s latest blog incarnation is data curation and it connects well, rather unusually, with institutions and their repositories. If not now, when? (27 Aug 2009): “who’s going to do data curation … we can have a pretty good idea who’s not going to do it: anybody who isn’t right this very minute planning to do it. This is no time for analysis paralysis.”

So when we convened earlier this month to redraft our project training plan I resolved to put the case for including data management to our exemplar repositories. I noted that each of these repositories exemplifies a different aspect of data management. I suggested that JISC and DCC, as well as UKRDS, research funders and, eventually, institutions will be the drivers for this. Coincidentally, immediately after the meeting a Nature editorial came to light saying essentially the same thing. How can we not go forward after this admonishment.

So, finally, here is my take on how repositories may be changing. We have to separate people and content or data from systems and infrastructure. At the moment we tend to take a systems-based approach (e.g. is it EPrints or DSpace) to managing a thinly-defined type of content, and the focus is the repository. Yet as these repositories grow institutionally, that is, to represent and present all the substantive activities and outputs of the institution, we can see the expansion and transformation of the system in the ‘cloud’, and the emergence of intermediate services to manage repository systems within this flexible infrastructure.

There are already many people supporting systems and IT infrastructure in institutions; there are fewer people designated to manage data and support data creators. We can already see in our exemplar repositories the types of data that might be managed: arts, science (crystallography), teaching materials, research papers, etc., and probably within disciplines and sub-disciplines for some data types. The people responsible for these repositories tend to be called repository managers, but they are not systems experts; they are data experts. We need many more data experts across the institution.

As repositories grow they will essentially become teams of data experts working with data producers. There will be repository managers, but they will be team leaders, coordinating data and systems teams, rather than repository managers we know today. What kind of people will they be? Salo has an idea (The accidental informaticist, 17 Aug 2009): “can-do souls comfortable with a lot of uncertainty and able to learn fast.”

It is my expectation that we need to allow the repository managers of our exemplars to develop as people rather than simply as fronts for repository systems. This project is unlikely to see that process complete, but if we have the vision we can at least make a start. It will be a major topic and a big challenge to embrace it, but at least we know who to turn to for help.

Posted in Uncategorized.

Tagged with , .


Data deletion: it happens

Data deletion happens within institutions, but the institutional repository can help prevent it.

The previous post considered the prospect of single-click deletion of data, i.e. inadvertent or unexpected sudden loss of data. From a formal digital preservation perspective this is a rather simplistic example, but as Brian Kelly recently noted, “Why you never should leave it to the University”, it happens.

The case concerns a business researcher in Sweden who lost 10 years worth of data apparently when an institutional Web site was redesigned. Brian asks if this could happen elsewhere, say in the UK, and of course, it could. It is likely that we at least get close to similar data losses all the time.

The reason that personal data collections managed on institutional sites are at risk in this way is users fail to recognise a crucial distinction, between a site that has systems management from one that has data management. There are many more people supporting systems and IT infrastructure in institutions than manage data. To secure data against loss in this way requires designation: a data management approach backed by policy. Ed Pinsent provides an institutional Web archiving example.

A systems manager seeks to build a framework to enable users to create and look after data as efficiently as possible. What happens to that data over time is typically the responsibility of the creator. In maintaining the efficiency and security of the system, the systems manager’s priority is the system, and sometimes changes will be necessary that could endanger data.

In my experience, working within a computer science department that likes to stay ahead of the field, there are regular upgrades to the systems infrastructure. When changes affect data or require data to be moved, the data creator is given the chance to manage the transition, to avoid any unintended loss. This might not always be straightforward. I have authorised data movement to new infrastructure machines, but I have also had to take data offline where a proprietary server application could no longer be supported cost-effectively.

Making the necessary decision and provision for data is not always painless in such circumstances for data creators with little time or experience in data management, so the easiest decision is often deferral. Such inaction will usually be followed by a deadline and a warning from the systems manager, that the system change takes priority over data loss if the creator does not act.

This is where the institutional repository has a vital role to play. It should be designated not as a system but as a managed data environment, where data is the priority and expert data managers can work with creators to support data throughout its lifecycle. IRs are not the only such sites that can perform this role within institutions, as the archiving example above shows, but it is an opportunity for IRs to forge a needed role and identity that users can understand.

Without designation as a data management site, data can be at risk of loss, just as the researcher in Sweden discovered.

Posted in Uncategorized.

Tagged with .


New Mac: that syncing feeling

By flickrich

Image by flickrich

Having used Windows since Win95, I recently switched to a new Mac laptop. Perhaps I should have waited until Win 7, but this Mac has a better screen that any earlier machine I’ve used, so I went for it. Getting used to the Mac – different shortcuts, lack of a right mouse button, etc. – has been a bit of a drag on productivity, even for me. But this isn’t about Mac vs Win.

I also have an iPod, so clearly there would be no concerns about using that with a Mac. I plugged in the USB cable and the Mac duly fired up iTunes. What happened next will not surprise anyone who has had an iPod long enough to switch computers, but is rather shocking from a preservation perspective. It seems you can only synchronise an iPod with one iTunes library, and in this case the library was on another machine. I had a choice: do nothing and disconnect, or sync with the library on the new machine. Since that library was empty, that option would effectively erase everything on the iPod. As with the floppy disc formatting option of old, I was one click away from deleting everything.

There are ways around this. Retrieve the original data sources, e.g. music CDs or purchased MP3s, and reload the library, this time on the new machine. Alternatively there are a number of software tools that will manage the transfer process for you. According to reviews these differ in their effectiveness for transferring data associated with the, in this case, mainly music files, such as cover artwork and other related metadata. Now we are beginning to see a connection with repositories.

The reason for the constraint on iTunes libraries isn’t explained. There may be some good practical reason; at the moment from my perspective it simply looks inconvenient. Most likely it is due to rights, or IPR, and fears of piracy. So there’s another connection with repositories: rights constraints.

Now, this project is about preservation of digital repositories. The iTunes library issue is not strictly a preservation problem. Since my iPod does not contain original (music) data, I can delete and rebuild. The data is not lost for good. Yet I am looking at an inconvenience, time and cost. And that’s for a few dozen music albums.

Scale that up to a fully functional and established institutional repository. If it’s an open access repository, the contents are author copies, or supplements, of papers published elsewhere, in order to make them freely accessible to everyone even if the published versions require payment to access. So for that content, it has been argued, preservation is a non-issue: the authoritative copies are (should be?) preserved elsewhere. On the basis of the iPod problem, that is correct. Also on the basis of the iPod experience, it’s not a good position to be in when you need to replace content on any significant scale.

Take a very simple (and unlikely) case, comparable with the ITunes choice, of deleting the repository with a single click having taken no preservation precautions whatever. The repository may be theoretically replaceable, but at what time, effort and cost? Notwithstanding that few, if any, repositories are purely OA content since they have diversified to include other types of content, much of which will not have managed copies elsewhere. The iPod example, and in particular the software to transfer content, should also alert us to the value of metadata. In the case of repositories that is likely to be value-added content quite specific to the repository and distinct from any other archive.

Unfortunately preservation is not simply about taking steps to avoid inadvertent disc erasure. Yet this simple example should begin to reveal the value of well managed digital repositories, even OA repositories where the content may not be the original or ultimate copies, and justify the effort that goes into presenting and maintaining that content. Exactly how far repositories should go beyond basic precautions is not yet clear, and in time is likely to become more complex. Between trying to justify the effort without scaring repositories away, we are working on it.

Posted in Uncategorized.


Linking JISC Inf11 projects to KeepIt

The focus of the KeepIt project is preservation of repositories of broad scope, and our aim is to make a difference. This is challenging, but we are not alone, and our work overlaps with many others in various ways, whether concerned with digital repositories or preservation; ideally repositories AND preservation. We have to be clever in rethinking, adapting and structuring our work to build on and use the findings of others in this area.

The recent JISC Inf11 programme meeting for new projects prompted me to elaborate four principles to guide our approach:

  1. We are concerned with the whole information lifecycle, not just the ‘preservation’ bit
  2. The institutional context matters as much as the repository context
  3. Given the breadth of scope we have to use all available resources and use external expertise in the JISC programme and elsewhere in the community wherever possible


To begin to fulfil the third point, here is a watchlist of relevant current projects from the JISC Inf11 programme:

  • allAboutMeprints eprints Southampton Web JISC
  • ArchivePress preservation (blogs) blog Web
  • Biophysical Repositories in the Lab (BRIL) research data, science (biophysics), data capture Web
  • CAVA (human communication Audio-Visual Archive) research data (A-V) Web JISC
  • CLARION research data, science (chemistry), data capture blog
  • CLIF (Content Lifecycle Integration Framework) lifecycle Web JISC
  • DIASER archive, backup, Southampton Web
  • EIDCSR (Embedding Institutional Data Curation Services in Research) research data, curation, workflows blog
  • I-WIRE (Integrated Workflow for Institutional Repository Enhancement) workflow blog JISC
  • Lifespan RADAR research data (A-V), preservation blog JISC
  • OneShare eprints, Southampton blog JISC
  • PAXS eprints, Southampton wiki JISC
  • PEKin (Preservation Exemplar at King’s) preservation Web

Please let me know if I have missed any relevant projects.

The fourth principle? That’s about designing your own repository preservation training course, and is aimed at our repository exemplars. Blame it on Dave De Roure’s presentation at the JISC meeting, with reference to the myexperiment project’s approach to empowering scientists by enabling them to design online experiments. For another post.

Posted in Uncategorized.

Tagged with , .


Let all talk about digital preservation

When we talk about digital preservation or institutional repositories, not everyone has the same perspective, or the same terminology, for the work we are engaged in. For example, the institutional repository will have to work with people from all parts of the institution, including IT, data centre, legal, etc. staff. A recent report might help to build a common understanding.

Building a Terminology Bridge: Guidelines for Digital Information Retention and Preservation Practices in the Datacenter, SNIA, May 2009 http://twurl.nl/6i382m (it’s a 48pp pdf)

SNIA is the Storage Networking Industry Association. If this all sounds dry and uninteresting, actually it is really quite revealing.

SNIA has a vested interest in selling storage systems, but it has put this aside in the storage growth example in the executive summary: “The current process is broken, does not scale, is pure overhead, and paradoxically not where any business organization wants to be spending its time and money.”

Next we learn that not everyone is a digital librarian: “In 2008, recognition of the role and importance of metadata was elevated. Now, not only is all ESI (electronically stored information) discoverable, but its metadata may also be required. IT systems and business applications were not designed to fully utilize and preserve metadata. Changes in IT and storage practices are needed.”

When we get to the terminology, it’s not short dictionary definitions, but well researched, and often anecdotal, collections of views on a term. Recommended.

Posted in Uncategorized.

Tagged with , .