KeepIt course module 4, Southampton, 18-19 March 2010
Tools this module: Plato, EPrints preservation apps
Tags Find out more about: this module KeepIt course 4, the full KeepIt course
Presentations and tutorial exercises course 4 (source files)
In this session of KeepIt course 4 we will use EPrints preservation apps to manage large-scale storage and display file format risks. In the course participants were provided with test repositories. Other users will need to download the apps (plugins) and install them to run with an EPrints 3.2 repository.
Our presenters for this session are EPrints experts (don’t omit the space or everyone sniggers) Adam Field from EPrints Services and Dave Tarrant from the KeepIt project, both members of the EPrints core code developer team.
EPrints is software designed to build digital repositories, in particular, repositories of research papers produced within institutions such as universities. Since that original goal the capability of the software has expanded to enable different types of digital content to be managed within a repository. In essence, it provides a series of interfaces for content management tasks for users and administrators, for example, depositing content, managing content, and finding content.
An important development stage was reached in 2007 with the release of EPrints version 3.0, which enabled applications created independently of EPrints to be used with it, without needing the authorisation of the code manager. Among the applications developed in this way are some intended to manage the preservation workflow, applications that have evolved from the JISC Preserv and KeepIt projects. Finally, with the release this year and ongoing upgrades of existing repositories to EPrints v3.2, these preservation tools can be distributed and used more widely, and with the next iteration (v3.3 in 2011) they will be available with simpler one-click installation from the EPrints Bazaar.
Managing storage in the ‘cloud’
First we consider storage with Adam Field. The range of services providing digital storage is changing and expanding. Disc storage attached to machines and local network storage have been staples that are are increasingly being supplanted by distributed storage on the Internet, or ‘cloud’ storage as it is often called, for large content volumes. There are advantages and disadvantages to each (slides 7-10).
Another approach is to combine all storage services optimally, or ‘hybrid’ storage (slide 11). To support this EPrints has introduced a storage controller. In this way different types of content or different versions of content can be stored in different places depending on cost, value, how critical the content is, etc. Storage policies can be written to manage storage automatically based on the selected criteria (slide 14). For example, a document is to be stored locally if it is a volatile version, or in a cloud service if not (slide 15).
The storage controller provides an interface to manage and move content from the locations selected initially (slide 17).
In our first exercise in this session we use the storage interfaces and learn how to modify storage policies. First, within our repository (or test repository; must be EPrints v3.2 or later) we need to find the familiar EPrints admin screen, and then open the tab for Config. Tools. Here we will find a button for the Storage Manager. This displays the different storage options available to the repository manager, and indicates the number and volume of files stored in each location, with buttons to move and delete content simply from each location.
Next we will modify the storage policy. This involves some simple code editing. To access the file we return to the Config. Tools screen and open the View Configuration button to reveal an XML file (storage/default.xml). There follows three very short exercises designed to modify this XML code and change the storage policy. After each exercise you can return to the storage manager screen to review the changes you have made. If there are any problems an example solution is provided on the final page of the exercise sheets.
File Formats and Risk Analysis
Next Dave Tarrant begins to explore support for managing file format risks using the EPrints preservation apps, and it will become apparent why we spent time introducing the preservation workflow in KeepIt course 3 (and recapped in this course module).
Immediately we can see elements of the preservation workflow in this presentation. The key feature here, however, is to show how a format risks management interface is built into EPrints. First we have a format classification screen (slide 4) but without any risk scores. By slide 6 we have classified these formats, hypothetically, into three broad risk categories: high, medium and low risk objects.
Actual risk scores are problematic at this stage. Although we can identify risk factors, as we saw in KeepIt course 3, we don’t yet have databases that are sufficiently complete to quantify the risks. Although a community-based linked data way forward has been proposed, such an approach still has to be adopted and developed further.
Nevertheless, we can see in principle how our interface might be used were real scores available. Slide 7 identifies some ‘medium’ risk objects in our small test repository. Should we wish to migrate these objects to a lower-risk format – and note at this stage we have not decided whether this is really necessary; we will come to that in our next session on preservation planning with Plato – the right-hand side of our screen evaluates some migration options in terms of the tools and target formats available.
For our second exercise using EPrints preservation apps we will import some test files, then identify and classify the file formats and apply risk scores.
Once again we begin with the EPrints admin screen in our (test) repository. This time we go to the Editorial Tools tab, where we find a button marked Formats/Risks. But there are no files to classify, so we import a test set (of 20 pdf files, in this case). This time using the Formats/Risks button we have 20 files classified as high risk because these have not yet been identified. After using the Classify Object button as shown in the handout sheets, we can confirm the files are pdfs, but we don’t have a risk score. Next, for variety we deposit some GIF images, provided in a zip file, in our repository.
Now we want to add some risk scores. Again we have to access and edit code in an EPrints Config. file, this time to reset a file that mimics a format risk analysis tool, PRONOM, from the UK National Archives. We have worked with TNA for many years and this is a way of enhancing its publicly available tool in anticipation of new services such as risk scoring. We can see the effect of this code editing by reloading our Formats/Risks screen.
Note, again, these risk scores are hypothetical and are used here to illustrate the process. To emphasise this point, the next part of the exercise shows how to alter the boundaries of the risk classification. In other words, just by altering this file we can affect the risk classification.
By this stage we should have deposited six GIF images from the original zip file, now reclassified as high risk files. We can examine the metadata for these files, and on the same screen we have a form to select a number of these files for download ready for the next session on preservation planning with Plato.