your computation goes to where your data is, not the other way around.
Alistair Croll, Big business for big data, O’Reilly radar, 21 September 2010
They’re now called apps, after a certain other initiative; they were called plugins. As the KeepIt project moves towards its close, this is the story of one of its products, the EPrints preservation apps, and it charts a path through digital preservation and changes in repository and computer architectures that are still playing out today.
Online preservation services, format identification and bandwidth limitations
In fact, at the outset they weren’t even apps or plugins, but online services. The story begins with our predecessor, the JISC Preserv repository preservation project. At the time, in 2005, EPrints repository software didn’t support modular applications. Preserv was working with two relatively nascent institutional repositories at the universities of Oxford and Southampton, and was focussing on a preservation approach concerned with managing the formats of the digital files in these repositories. This approach, referred to as ‘seamless flow‘, was conceived by another project partner, The National Archives, and proposed a preservation workflow in which a digital file format was first identified, the risks assessed, and then the file was converted, if necessary. TNA had tools for this as well: PRONOM, effectively its knowledge base of file formats, and DROID, a software tool that would scan and identify the formats of stored files.
Using these tools Preserv profiled its two partner repositories; it revealed the now classic open access institutional repository format profile dominated by versions of PDF and a long tail of other formats. That apart, it didn’t seem like a great step forward, and there were limitations. We weren’t able to integrate the repository and preservation tools, and we didn’t have a testbed for large-scale content. Beyond the two repositories, it seemed we didn’t have much further content to work with anyway.
What we did have, however, was ROAR, the Registry of Open Access Repositories, created and maintained by then-project developer Tim Brody. ROAR is a comprehensive machine index of repositories, one of the early OAI services that continues today. Brody had the idea to extend ROAR to generate format profiles of the repositories it indexed. The approach he devised did not work with all repository types, but it worked with those based on the main repository softwares, DSpace and EPrints. Profiles were generated for over 200 repositories, and those PRONOM-ROAR profiles as they were to be called, suitably updated, can still be found on ROAR today.
There are problems with this approach, bandwidth scalability and cost being the primary ones. As Croll noted in his O’Reilly blog post, “a paper by the late Jim Gray of Microsoft says that, compared to the cost of moving bytes around, everything else is free.”
As a result large files over 2MB, multimedia objects for example, were not included in the profiles. Further, we had only established the first part of the ‘seamless flow’ preservation workflow, format identification. What to do with these files once we knew what they were was the next question, the answer to which was likely to put more pressure on our computing infrastructure, which at that time separated repositories and our online services.
Repository plugins and large storage systems
At the start of Preserv 2 in 2007 things began to change. EPrints version 3 had launched earlier in the year and now supported plugins (or apps) developed independently of the core code. Oxford and Southampton became strong supporters of the Sun PASIG (Preservation and Archiving Special Interest Group) forum, and both took delivery of the new Sun Honeycomb, an innovative large-scale storage machine being re-positioned for digital library applications. Early in 2008 Dave Tarrant joined as Preserv 2 project developer at Southampton, and was introduced to Ben O’Steen, project developer at Oxford.
At this first meeting O’Steen proposed using the then new OAI-ORE approach to demonstrate how files could be moved between different repository platforms, Fedora and EPrints, which were used at Oxford and Southampton, respectively. After all, if one approach to digital preservation is the ability and flexibility to copy and move content around more easily, then this seemed like a possible step forward for repository preservation. Working with Tim Brody, they went on to win the repository challenge prize at the Open Repositories 2008 international conference, demonstrating that digital data can be moved between storage sites running different software.
We speculated that the approach might have important implications for the evolution of repository software and architectures: “Binding objects in this manner would allow the construction of a layered repository where the core is the storage and binding and all other software and services sit on top of this layer. In this scenario, if a repository wanted to change its software, instead of migrating the objects from one software to another, we could simply swap the software.” It hasn’t happened in a real repository yet.
With the delivery of the new Honeycomb machines what was in mind was a large data store where we could run various repository software and preservation applications, closely tying storage and computation to overcome the limitations of PRONOM-ROAR.
The corporate world runs on a different dynamic to research organisations, and Sun withdrew its Honeycomb later in 2008, more for marketing reasons than technical capability. Although the machines were perfectly serviceable and had the promise of support for a further five years, they could no longer be part of a long-term preservation strategy and eventual migration to alternative systems would need to be planned.
By now the emerging phenomenon in computing storage was the ‘cloud’, where storage and computation services can be accessed using network services and the Web. It has been suggested that network storage and computing will grow to rival the scale of other essential utilities such as electricity and power. Another parallel, it is predicted, will be the widespread switch from local to network computation in the same way that local electricity generation mushroomed into national power networks 100 years ago.
Investment in cloud services by Amazon, Google and other major Internet companies reinforces those predictions. As Croll added: “Amazon’s S3 large-object store, not its EC2 compute service, is core to the company’s strategy: your computation goes to where your data is, not the other way around.”
With the demise of the Honeycomb, Tarrant and O’Steen, working with Sun and other partners, devised a network storage solution that would adopt the fault-tolerant architecture of the Honeycomb. A bigger question, however, would be uptake of network services by institutions and research organisations such as universities. Concerns of such organisations centre on reliability, trust and, crucially, control of their data. The concept of an institutionally-managed private cloud storage network was proposed. Again, these ideas have yet to play out in the larger scheme.
It remains the case the case that most institutional repositories run on local computer boxes, typically managed by information services support within the institution. Apart from hosted repository services, cloud support appears not to have been extensively utilised yet, and it need not necessarily be obvious if it was, yet no doubt some of the concerns raised above remain.
Separately, Dave Tarrant went to a seminal PLANETS project tutorial in Vienna and met the team behind the Plato preservation planning tool. Word was this was a promising tool from this major EU-funded project, and Dave confirmed this. This meeting would seed an idea to fill a critical element of the EPrints preservation workflow.
Tarrant was still working with the Honeycomb. Compared with Brody earlier, he could harvest data to a large-scale storage facility and run experiments with the preservation tools, updating and completing the format profiles, and extending the number of profiles.
Controlling storage and format risks within EPrints
Prompted by Adrian Brown, then in charge of digital preservation at the National Archives, we began to conceive the idea of smart storage, where we could run storage and computation in close proximity: “Since at the core of any preservation approach is storage, we call this approach ‘smart storage’ because it combines an underlying passive storage approach with the intelligence provided through the respective services.”
Taking advantage of the EPrints architecture for running plugins and using his experience of cloud services, Tarrant began to produce applications that would manage storage selection and run format risk tools, including DROID, within an EPrints repository. He demonstrated interfaces within EPrints designed to manage these tasks. One particular feature of the emerging interface for format management was its traffic-light based identification of risk based on high, medium and low risk objects. This seemed like a good way of managing risk and alerting repository managers to files considered likely to pose preservation problems.
There remained a gap in the format preservation workflow. We could identify and profile our content more effectively than before, but we still had no formal basis on which to evaluate risk for our known format types. Our traffic light risk scores were hypothetical. Although there were plenty of arbitrary rules – such as open source, open standard formats were preferred – this is not the whole picture.
In a rapid demonstration, two sources of data on file format risks, from PRONOM and DBpedia, were combined in a Semantic Web, or linked data, fashion, revealing more risk factors than either data source alone. The aim was to prompt an open data approach to file format risks and sharing among organisations that produce this information, and it is an approach that still has promise of adoption as the preservation community seeks to build a general format registry.
The missing link: Plato and preservation planning
While a registry can provide a factual basis for evaluating format risks, this is not the only angle in implementing a format-based preservation workflow for digital repositories. Other factors may include institutional priorities, policies, costs, and local technical factors. What’s needed is a more flexible system for assessing risk for digital preservation, a process known as preservation planning. Plato is such a system. In EPrints we now had tools to identify formats, using DROID, and an interface categorising (hypothetical) risks. We wanted to connect the two. The output from Plato is a preservation plan in the form of an XML document. So a button was added to the EPrints format risk management interface to upload and act on this plan. Creating a preservation plan in Plato may not be easy, but when uploaded to act on a large repository content it is powerful, and will continue to monitor and act on new content as the repository grows.
“I will be honest and say that for a while I was quite mystified by Plato. It seemed like a worthy cause but at first inspection the tools seemed so complicated that, rather than putting preservation within reach, they would have the contrary result of making it more difficult. Only now it’s finished has the penny finally dropped.”
William Kilbride, Editorial: Preservation Planning on a Spin Cycle, What’s New – Issue 28, Digital Preservation Coalition, August 2010
Refocussing on repositories: making the tools fit for preservation exemplars
With the start of KeepIt in 2009, by design and prompting by JISC we focussed again on repositories, in this case our preservation exemplars. The stories of these exemplars is told separately. These repositories were carefully chosen to exemplify different data collection policies, across research, teaching, science and arts. What they had in common as repositories, not entirely by design but not entirely by coincidence either, was that they all run on EPrints.
The aim of the project was not to impose big, broad preservation strategies on these repositories, or do preservation for them, but to introduce a range of approaches from which each repository could choose to suit the needs of their institutions and their content. We already knew that imposing the approaches specified by preservation specialists would be difficult for generalist repository mangers and their often small or part-time support teams. Even if preservation services were available to repositories, they need to know enough about preservation to identify their own institutional needs and how to specify these to the providers. That was the rationalisation for our extensive KeepIt course, and we were able to draw on a range of available preservation tools, over 70% produced by JISC projects.
Given the base of EPrints shared by exemplars, Dave Tarrant was able to build on the ‘smart storage’ approach and EPrints plugins and interfaces we had elaborated earlier. To optimise this approach required some additions to the core EPrints code, which Tarrant contributed, and which became available at the beginning of this year as the latest version of the software, EPrints v3.2.
Repositories are increasingly large, complex and critical systems requiring a formally managed systems approach, so as with better known computing platforms and operating systems, upgrades to new versions can take time and require consultation. Some six months after release, three of our exemplars now run v3.2, have their own preservation apps installed and are producing, analysing and acting on their own format profiles. (The fourth exemplar runs a customised version of EPrints that it claims is unsuited to this upgrade; we believe it’s less of a problem than they do.)
We’ll find out what these format profiles look like in the next blog post, but we can say they are each quite distinctive, and most are distinct from the classic open access format profile illustrated above, with consequent implications for different preservation strategies.
It doesn’t stop at EPrints 3.2. Next year easier, one-click installation is promised in version 3.3, which will bring the EPrints Bazaar, modelled on the Apple App store and which will include the updated preservation apps (Plugins).
Learning about and implementing EPrints preservation apps
There is now a remarkable range of resources to support preservation for EPrints repositories, not just apps. Dedicated courses have been run across Europe to introduce participants jointly to EPrints preservation apps and Plato, and all these events are documented. The most complete documentation is probably from the KeepIt course 4 held over two days earlier this year in Southampton. Want a shorter introduction? Try this version run over 1.5h in Madrid during July. Or there are single-day versions of this EPrints-Plato course from Corfu (ECDL, Sept. 2009) and the final presentation in Vienna (iPres, Sept. 2010). All are suitable for independent study and include practical work for users to follow.
To use EPrints preservation apps you need to be running a repository based on v3.2 or later, or try a test repository in the cloud. If running 3.2, download the preservation plugins. There is a readme file in the download which explains how to install the plugins ready for the training exercises. For others not currently running EPrints, there are two test repositories running on the Amazon cloud services providing ‘machine images’ (AMIs), instances of EPrints, which people can launch for training purposes.
Thanks to our brilliant developer team who have energised both Preserv and KeepIt projects with their innovative ideas. Thanks as well to the developers of the tools that have been integrated into the services described here (notably DROID and Plato), highlighting the real benefit of open sharing and open source software that is characteristic of this developer community. Every idea that has been built, worked through, adapted and embedded into this story has played a role in the continuing evolution of digital repository preservation.
- Preserv Final report (03/2007), Laying the Foundations for Repository Preservation Services
- Preserv 2 Final report (07/2009), Towards repository preservation services
- KeepIt Final report (10/2010), in preparation