Skip to content


Data repositories: the next new wave

dcc_logoShould institutional repositories do data curation? Underlying this question is: what is a repository, and is that changing?

Without going into full detail, there are some straws in the wind. First, back when I was working for the Repositories Support Project, Tony Hey, VP Microsoft and formerly head of our school (Electronics and Computer Science) at Southampton University, gave the keynote at the RSP Repository Softwares Day: Repositories – Past, Present and Future (slides). Look in particular at the part on the future.

Unprompted, I dashed off a report on the meeting for the RSP Web site. It wasn’t used – clearly I wasn’t effusive enough about the event – but my main point was this: If repositories are the new wave of scholarly communication, the next new wave was glimpsed in a keynote presentation by Tony Hey. He pointed to ‘cloud computing’ as the way forward, with the potential for recasting the way repositories are structured and managed.

Hey’s view was reinforced by Dave Flanders of JISC in a closing panel session. So how does the cloud transform repositories? The cloud provides the technical infrastructure, so that repositories don’t have to, and it offers flexibility, leaving repositories to focus on what they want to do, whether this is to restructure repositories to reflect institutional priorities, to be led within institutional structures such as schools and departments, or to manage different digital types such as research materials and learning objects. Put the repository in the cloud and then ask the questions again, is how Flanders saw it.

More recently, at the end of July, I went to the Edinburgh Repository Fringe to join the DataShare and DCC data workshops. Instead of hearing about the latest exciting repository developments in other sessions, we ploughed into data management documentation. While I didn’t find the open forum of the DataShare workshop to be particularly enlightening, the underlying support document, the policy-making guide produced by the project, is helpful. What’s good is this guide recognises that everyone produces data; it’s not just for specialists, and the multilayered presentation makes it more approachable.

Next morning it was time for Digital Curation 101 ‘Lite’. Despite lasting three hours, this was a clearly a skim, yet somehow it gave a sense of the full 101 course that Digital Curation Centre has compiled. I was impressed. The short team exercises revealed that there are others in the institutions who are approaching these issues from a completely different angle to repositories. There’s the clue.

Nevertheless, I came away thinking these data management issues, which are institution-wide and transformational in scale, are not going to happen in the next year, the remaining timeframe of the KeepIt digital preservation project. Our exemplar repositories are not going to be transformed in that time. Perhaps I should drop it.

Then Dorothea Salo unwittingly opened my mind to the prospect again. A major theme of Salo’s latest blog incarnation is data curation and it connects well, rather unusually, with institutions and their repositories. If not now, when? (27 Aug 2009): “who’s going to do data curation … we can have a pretty good idea who’s not going to do it: anybody who isn’t right this very minute planning to do it. This is no time for analysis paralysis.”

So when we convened earlier this month to redraft our project training plan I resolved to put the case for including data management to our exemplar repositories. I noted that each of these repositories exemplifies a different aspect of data management. I suggested that JISC and DCC, as well as UKRDS, research funders and, eventually, institutions will be the drivers for this. Coincidentally, immediately after the meeting a Nature editorial came to light saying essentially the same thing. How can we not go forward after this admonishment.

So, finally, here is my take on how repositories may be changing. We have to separate people and content or data from systems and infrastructure. At the moment we tend to take a systems-based approach (e.g. is it EPrints or DSpace) to managing a thinly-defined type of content, and the focus is the repository. Yet as these repositories grow institutionally, that is, to represent and present all the substantive activities and outputs of the institution, we can see the expansion and transformation of the system in the ‘cloud’, and the emergence of intermediate services to manage repository systems within this flexible infrastructure.

There are already many people supporting systems and IT infrastructure in institutions; there are fewer people designated to manage data and support data creators. We can already see in our exemplar repositories the types of data that might be managed: arts, science (crystallography), teaching materials, research papers, etc., and probably within disciplines and sub-disciplines for some data types. The people responsible for these repositories tend to be called repository managers, but they are not systems experts; they are data experts. We need many more data experts across the institution.

As repositories grow they will essentially become teams of data experts working with data producers. There will be repository managers, but they will be team leaders, coordinating data and systems teams, rather than repository managers we know today. What kind of people will they be? Salo has an idea (The accidental informaticist, 17 Aug 2009): “can-do souls comfortable with a lot of uncertainty and able to learn fast.”

It is my expectation that we need to allow the repository managers of our exemplars to develop as people rather than simply as fronts for repository systems. This project is unlikely to see that process complete, but if we have the vision we can at least make a start. It will be a major topic and a big challenge to embrace it, but at least we know who to turn to for help.

Posted in Uncategorized.

Tagged with , .

2 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Miggie Pickton says

    You have distinguished between the roles of ‘people and content or data’ and ‘systems and infrastructure’. How do you think that repository policy and procedure fits with these? I suspect that the solution to repository data curation lies in a combination of all of these. Moreover, if we don’t manage to effect a full transformation within our institutions by the end of the KeepIt project, we should still aim to put the policy and procedure in place so that subsequent transformation will be inevitable. At the very least we need to know what we should be doing even if we haven’t the resource (yet) to do it. By all means develop me – as a repository manager first, as a person next (but that may take longer 🙂 )

Continuing the Discussion

  1. More on data curation for repositories – Diary of a Repository Preservation Project linked to this post on November 12, 2009

    […] we have considered the case for data curation in KeepIt, and here is more grist for that mill. Dorothea Salo has posted her recent presentation […]

Some HTML is OK

or, reply to this post via trackback.