Skip to content


KeepIt preservation exemplar repositories: the final countdown

Repositories Are Doin’ It For Themselves (with due deference to Annie Lennox)

“Up to a point, Lord Copper” (with due deference to Evelyn Waugh)

5-4-3-2-1 Neon CountdownWhen faced with the challenge of doing digital preservation, what will institutional repository managers choose to do? The KeepIt project aimed to find out, and now we have some answers.

Digital preservation is not a straightforward choice. The obvious answer, often provided by those without such responsibility, is to preserve everything forever. The practicality is different.

Before they begin, some repository managers in higher education institutions are faced with other contradictory voices, arguing that it is too soon for preservation to be a priority. That can be attributed to their current stage in the anticipated lifecycle of digital repositories, that is, at an early stage: many institutional repositories are less than 5 years old. For many, growing content volumes is a bigger concern.

Of more direct impact for repositories are economic constraints: limited resources, staffing and access to specialist expertise. The primary drivers of resources for digital preservation are national heritage and legislation, neither of which have much bearing on institutional repositories, yet.

Faced with these views, what would repositories actually choose for themselves? How can we tie these apparently disparate issues together? Working with four selected exemplar repositories in the UK for 18 months, KeepIt set out to support and inform, but not prescribe. We would develop some repository preservation tools – based on EPrints repository software, since the exemplars were all using this repository software – and run a structured course based on practical preservation tools. The exemplars themselves helped to design and shape the course.

“The lens of digital preservation can provide the vision for shaping the whole repository context.”

We surveyed each exemplar initially to identify where preservation may have a role to play. We then asked each exemplar to specify their preservation objectives.

What we found by the end of the project was different repositories make different preservation choices, and that they are entirely pragmatic in these choices, not seeking to over-extend or over-emphasise preservation among their active plans, but to phase different tools and approaches into their strategies. There is no single approach.

You can discover what each exemplar chose, in their own voices:

So that we might extend these results beyond these specific exemplars to other repositories, it is helpful to relate these choices to factors that might characterise the exemplars.

Institutional context

First, what each of these repositories have in common is they are, or have aspirations to become, institutional in scope or in terms of becoming part of the institutional infrastructure for disseminating the digital outputs of the academic enterprise. In this respect they have to understand the institutional context in which they operate, which resolves to scope, policy, resources, and ultimately risk, as well as the technical infrastructure. In other words, this is nothing short of how the manager of a repository fulfils the objectives designated for the repository by both the senior management of the institution and the producers of the academic content they choose to target, that is, taking both a top-down and bottom-up view of the institution.

Tools such as DAF and AIDA show us this context is wider than we might have expected. As well as auditing the scope of academic outputs currently being produced within the institution, we have to anticipate the changing profiles of these outputs, that is, both the nature of that content and its impact and its usage, as well as the volumes of the different types of content within the profile. One of the curious and less appreciated effects of digital preservation is that it tends to be applied to current content, but it also involves anticipation of future content.

When it comes to the real institutional context we find this is rarely as well defined as we would like or expect. This means that repositories in some cases have sought to become the drivers for e.g. policy. Where AIDA helps is more subtle, allowing the repository to identify institutional context that is either explicit or latent.

“Our exemplars show there is a ready buy-in for preservation tools, providing these are set within the repository.”

Limited resources, most obviously financial but also staffing and access to specialist preservation expertise, will define the extent of what can be done. There is no point raising expectations if resources don’t permit. Translating expectations into quantified resources therefore matters, and lifecycle costing and benefits analysis tools, such LIFE and KRDS, are needed.

The final part of the institutional context is that once users buy into the repository, by providing content in response to various different prompts such as open access advocacy or policy, they will begin to look for trust and reassurance that their content is well looked after. A similar view will permeate the institutional management who invest in the repository, either tangibly or in terms of goodwill, but because they are managers rather than content producers their perspective will centre on risk and how that risk could affect the institution, that is, what could go wrong? This is where a tool like DRAMBORA can help by highlighting the risks that can impact the repository.

With this in mind we can now turn the question around, from which preservation tools did the exemplars choose, to what do the tools chosen tell us about the repositories?

First, a repository that applies DAF is likely to be seeking to expand or confirm its scope. It may be a relatively new repository, or one that is responding to new management prompts or is seeking to influence or inform future management decisions affecting the repository.

A repository that is concerned with tools to assess the impact of costs on preservation probably has quite a large or rapidly growing volume of content and/or an uncertain or cyclical financial income.

A repository that chooses a risk analysis tool such as DRAMBORA maybe has a substantial body of content, is confident in its targetting of content providers and the type of content it presents, and is beginning to think of this content more in terms of required or implied responsibilities. This might be in response to management concerns. As we have also seen, this approach can preempt such concern and can instead be turned to advantage. Highlighting risks can engage management in supporting the necessary actions and providing the necessary resources to minimise the impact of the identified risks.

Technical infrastructure

Technical approaches can tend to dominate pre-conceived ideas of what digital preservation involves. We have just seen how this must be balanced with the substantial element of institutional context. Nevertheless, our exemplars show there is a ready buy-in for preservation tools for storage, format identification, preservation planning and format transformation, providing these are set within the repository. That is what we did with the EPrints preservation apps and, because the exemplars all use EPrints, we were able to set them up with the facilities offered by the apps. In effect, the buy-in involved allowing us to accelerate the natural upgrade cycle for their repositories to install the latest version of EPrints (v3.2), which is required to use the apps, to fit the timescale of the project.

As a first-stage use of these apps, format profiles were produced for each exemplar. Admittedly the extent of the involvement of each exemplar in producing the profiles varied – some produced these themselves, others were produced by the KeepIt developer, Dave Tarrant. The exemplars, and many others at a series of international workshops, were introduced to the full format management workflow supported by the apps, but none has yet gone beyond identifying the format profiles. It is reasonable they should get used to producing, interpreting, updating and acting on the profiles before using these as the basis for preservation planning.

“What we found was different repositories make different preservation choices, and that they are entirely pragmatic in these choices. There is no single approach.”

Format profiles are not just internal tools for preservation, however. They reveal what we might now recognise as distinctive fingerprints of different types of repository. They tell us much more about the repository than just file types. Among the exemplars we have an emerging category of repository – focussed on collecting teaching and learning materials, or OERs (open educational resources) – but not yet a formal consensus on how these should be organised, institutionally or nationally, by content types, etc.

A repository that uses format profiling not just to examine its own files but those of other similar repositories may be seeking to be at the forefront of defining the genre of that type of repository, or it may be reflecting uncertainty and seeking to identify a community consensus on how such repositories should be organised. This is a real application of the fingerprint characterisation offered by format profiles. Unfortunately, what was found initially was that other repositories don’t have the facility to produce such profiles, because they don’t have tools installed, so full comparison will have to wait. Without the right tools to run with your repository it’s not easy.


This is what our exemplars show us about how different types of repository at different stages of development might tackle preservation when provided with appropriate tools. Don’t expect a revolution, nor expect preservation to suddenly become top priority for repositories. They will engage with preservation at a pace and in a way that is appropriate for their needs and assists them to make progress with their current problems. Would we expect any different? What we can learn from these exemplars, because they represent real repositories and real cases, not prescriptions, is where and how the approaches they have chosen might be adopted by other repositories depending on the stage of development reached, scope and institutional (and perhaps national) context and technical platform (repository software).

As we can see, digital preservation is not just about assuring future access for the content a repository has already acquired. Content growth and preservation, risk and resources: these are often two sides of the same issue. The lens of digital preservation can provide the vision for shaping the whole repository context.

La minute by Fanch The System !!!

This is likely to be the last substantive Diary entry from the KeepIt project. Our time is already up. Thanks for reading and commenting on this blog. To all digital repositories, keep up the good work, and do all you need to ensure your content remains accessible.

Posted in Uncategorized.

Tagged with .

Exemplar preservation repositories: comparison by format profile

How format profiles can reveal potentially characteristic fingerprints for emerging types of repository.

Format profile long tailNot all digital repositories are the same. Nor are all institutional repositories the same. In fact, the differences between the types of repositories emerging recently can be surprisingly large. In KeepIt we’ve been investigating how these differences might affect digital preservation practices. We now have results and we shall compare some of those here.

There are various ways in which these differences can be seen. One is in the tools that have been adopted by our exemplar repositories – covering arts and sciences, education and research – as they begin to explore the application of digital preservation. In the first instance these tools were selected from those covered in the KeepIt course. It was expected that most would choose to explore the use of one or two tools initially, depending on their priorities. We can see from their reports they have chosen different paths – from data scoping, to costs, and risk assessment – and each can be seen to be an appropriate and revealing choice when you examine the issues faced by the respective exemplars.

“Evidently, like an arts repository (perhaps predictably) and like a science data repository (less predictably), it seems the emphasis of a repository of teaching resources may be visual rather than textual.”

A more direct way to view and understand the differences between repositories with different content remits is to look at the range of file types they manage. This format profile is the starting point for preservation plans and actions. Why formats matter for preservation was covered in the KeepIt course. One approach that has been adopted by three of the four exemplars, because we designed and implemented this for them, is the EPrints preservation ‘apps’. This tool bundles a range of apps, including the DROID file format identification tool from The National Archives, to present a format, or preservation, profile within a repository interface. Here we will reveal and compare the profiles of the four exemplars.

Format profiles past and present

First, we should recall that we have been producing format profiles in previous projects, using earlier variants on the tools. What we have now are more complete and distinctive profiles.

“One major similarity between the exemplars and earlier format profiles is the ‘long tail'”

One major similarity we can note, however, between the exemplars and earlier profiles, is the dominance in each profile of one format, that is, the total number of files stored in the repository using that format, followed by an exponential power law decline in the number of files per format – the ‘long tail’. For open access research repositories the typical profile is dominated by PDF and its variants and versions. This has been known for some time from our previous work. In the case of our KeepIt exemplars only one, the research papers repository, has this classic PDF-led profile. We can now reveal how the others – a science repository, an arts repository, and an educational and teaching repository – differ, and thus begin to understand what preservation challenges they each face.

Producing format profiles

Before we do this, bear in mind these general background notes on how the profiles were produced. For the scale of repositories we have been working with here this is now a substantial processing task that can take hours and days to complete.

For three repositories the counts include only accepted objects and do not include ‘volatile’ objects. The fourth (University of the Arts London) includes all objects, including those in the editorial buffer and volatiles. Repositories use editorial buffers to moderate submissions. Depending on the repository policy there may be a delay between submission, acceptance and public availability. Volatiles are objects that are generated when required by the repository – an example would be thumbnail previews used to provide an instant but sizeably reduced view of the object.

These are growing repositories, so the profiles must be viewed as temporary snapshots for the dates specified. They are provided here for illustration. For those repositories that have installed the EPrints preservation apps, the repository manager is provided with regular internal reports including an updated profile, and will need to track the changes between profiles as well as review each subsequent profile.

Understanding and responding to format profiles

We also need to understand some features of the tools when reviewing the results. In these results we have ‘unknown’ formats and ‘unclassified’ formats. Unclassified are files that are still to be classified. These may be new files that have been added since a profile scan began (scans can take some time) or since the last full scan.

“The number of unknown file formats is likely to be a major factor in assessing the preservation risk faced by a repository”

More critical for preservation purposes are files with unknown formats. To identify a file format a tool such as DROID – an open source tool integrated within the EPrints preservation apps – looks for a specified signature within the object. If it can’t match a file with a signature in its database it is classified as unknown. In such cases it may be possible to identify the format simply by examining the file extension (.pdf .htm .gif, etc.). In most cases a file format will be exactly what it purports to be according to this extension. The merits of each approach, by format signature or filename extension, can be debated; neither is infallible, nor has the degree of error been rigorously quantified. It is up to the individual repositories how they interpret and resolve these results.

The number of unknowns will be a major factor in assessing the preservation risk faced by a repository and is likely to be the area requiring most attention by its manager, at least initially until the risk has been assessed. We believe that in future it will be possible to quantify the risk of known formats, and to build preservation plans to act on these risks within repositories.

For formats known to specialists but not to the general preservation tools, it will be important to enable these to be added to the tools. When this happens it will be possible for the community to begin to accumulate the factors that might contribute to the risk scores for these formats. As long as formats remain outside this general domain, it will be for specialists to assess the risk for themselves. We will see examples of this in the cases below.

Producing format profiles is becoming an intensive process, and subsequent analysis is likely to be no less intensive.

Science data repository (eCrystals, University of Southampton)

A specialised science data repository is likely to have file types that a general format tool will fail to recognise. For this repository of crystal structures we anticipated two such formats – CIF and CML – and we reported how signatures for these formats were added to the identification tool. What we can see in this profile is how successful, or not, these signatures were. That is, successful for CIF, but only partially successful for CML.

For this repository, which uses a customised version of EPrints and therefore has not so far installed the apps, we ran the tool for them over a copy of their content temporarily stored in the cloud. Figure 1 shows the full profile for this repository, including unknowns (in red), those formats not identified by DROID but known to EPrints (showing both the total and the breakdown in yellow), as well as the long tail of identified formats. All but two CIF files were identified by DROID. Had all the instances of CML been recognised it would have been the largest format with most files (adding the yellow and blue CML bars), but almost half were not recognised by DROID.

eCrystals full format profile including formats 'unknown' to DROID and the repository (in red), the breakdown of those classified by the repository (yellow bars), as well as the long tail of formats classified by DROID

Figure 1. eCrystals: full format profile including formats 'unknown' to DROID and the repository (in red), the breakdown of those classified by the repository (yellow bars), as well as the long tail of formats classified by DROID. Chart generated from spreadsheet of results (profile date 1 Oct. 2010)

As it stands the format with the largest number of files known to DROID was, interestingly, an image format (JPEG 1.01). We will see this is a recurring theme of emerging repository types exemplified by our project repositories.

Also with reference to the other exemplar profiles to follow, it will be noticeable that this profile appears to have a less long tail than others. However, in this case we can see that ‘unknown’ (to DROID and EPrints) is the largest single category, and when this is broken down it too presents a long tail (Figure 2) that is effectively additive to the tail in Figure 1. These include more specialised formats, which might be recognised by file extension.

eCrystals 'unknown' formats 'recognised' by file extension

Figure 2. eCrystals 'unknown' formats by file extension (profile date 1 Oct. 2010)

As explained, clearly these unknowns will need to be a focus for the repository managers, although in preliminary feedback they say that many of these files are “all very familiar, standard crystallography files of varying extent of data handling that often get uploaded to ecrystals for completeness.” This is reassuring because file formats unknown to system or manager or scientists could be a serious problem for the repository. Even so, as long as such formats remain outside the scope of the general format identification tools the managers will need to use their own assessments and judgement to assure the longer-term viability and accessibility of these files.

Arts repository (University of the Arts London)

What’s the largest file type in an arts repository? Perhaps unsurprisingly it’s an image format, in this case led by JPEGs of different versions. As can be seen in Figure 3, the number of unknowns, highlighted among the High Risk Objects, is the fourth largest single category in this profile and so requires further investigation.

UAL Formats and Risks Screenshot 1

Figure 3. Format/Risks screen (top level) for UAL Research Online repository. This screenshot of the profile was generated by the repository staff from the live repository using the installed tools. Date 13 September 2010. (Click on image for larger version with more legible format labels)

Once again there is a long tail (Figure 4).

UAL Formats and Risks Screenshot 2

Figure 4. Format/Risks screen (long tail) for UAL Research Online repository. This profile was generated by the repository staff from the live repository using the installed tools. Date 13 September 2010. (Click on image for larger version)

First indications in Figure 5 showing the expansion of the high risk category suggest many of these will turn out to be known formats but which have not been recognised by DROID. It may be possible to resolve and classify many of these by manual inspection, the last resort of the repository manager to ensure that files can be opened and used effectively.

UAL Formats and Risks Screenshot 3

Figure 5. High risk objects (top level examples) in UAL Research Online repository. This profile was generated by the repository staff from the live repository using the installed tools. Date 13 September 2010.

Teaching repository (EdShare, University of Southampton)

EdShare repository manager Debra Morris has already reflected on this profile. The first notable feature of the profile (Figure 6) is that the largest format is, again, an image format. Evidently, like an arts repository (perhaps predictably) and like a science data repository (less predictably), it seems the emphasis of a repository of teaching resources may be visual rather than textual.

EdShare, largest formats (top half of long tail). Chart generated from spreadsheet

Figure 6. EdShare: largest formats by file count (top half of long tail). Chart generated from spreadsheet of results (profile date 14 Oct. 2010)

Another feature of the profile is the classification of LaTEX (Master File), the second largest format in this profile. Until now this format was unknown to DROID, but a new signature was created and added to our project version of DROID, in the same way as described for CIF/CML (and was submitted to the TNA for inclusion in the official format registry). The effect of this was to reduce the number of unknowns from near 2500 to c. 550, and thus instantly to both clarify and reduce the scale of the challenge.

As usual with the long tail, preservation planning decisions have to be made about the impact and viability of even infrequent formats. For reference, Table 1 shows the formats not included among the largest formats by file count in Figure 6.

Table 1. Long tail of formats by file count in EdShare

Plain Text File 20
Rich Text Format (Version 1.7) 17
Windows Bitmap (Version 3.0) 15
Acrobat PDF 1.6 – Portable Document Format (Version 1.6) 15
Waveform Audio 12
Document Type Definition 11
Macromedia FLV (Version 1) 11
MPEG-1 Video Format 9
Icon file format 9
LaTEX (Sub File) 9
XML Schema Definition 8
Macromedia Flash (Version 7) 8
Windows Media Video 6
Extensible Hypertext Markup Language (Version 1.1) 5
Rich Text Format (Version 1.5) 3
TeX Binary File 3
Acrobat PDF 1.1 – Portable Document Format (Version 1.1) 3
Java Compiled Object Code 2
Encapsulated PostScript File Format (Version 3.0) 2
Exchangeable Image File Format (Compressed) (Version 2.2) 2
Scalable Vector Graphics (Version 1.0) 2
Adobe Photoshop 2
Microsoft Web Archive 2
Audio/Video Interleaved Format 2
JTIP (JPEG Tiled Image Pyramid) 1
Comma Separated Values 1
OS/2 Bitmap (Version 1.0) 1
Acrobat PDF 1.7 – Portable Document Format (Version 1.7) 1
PHP Script Page 1
Microsoft Word for Windows Document (Version 6.0/95) 1
Quicktime 1
Hypertext Markup Language (Version 2.0) 1
Applixware Spreadsheet 1
PostScript (Version 3.0) 1
EdShare unknown formats

Figure 7. EdShare: unknown formats (profile date 14 Oct. 2010)

Similarly, the unknowns present a potent challenge. Figure 7 is a breakdown of what we think we can tell from file extensions. Unlike the case of unknowns in eCrystals, here there was less anticipation of specialised formats that were unlikely to be found in a general format registry, so this list is a revelation. In many of these cases an error in the file may be preventing recognition of an otherwise familiar format. Here we can see extensions such as Flash files, various text-based formats such as HTML, CSS, etc., which may be malformed, and possibly some images. In such cases the relevant file should be identified and an attempt made to open it with a suitable application. In this way it may be possible to begin to assess the reasons for non-recognition, to confirm the likely format, and take any action to repair or convert the file if necessary. Files with unfamiliar extensions (.m) will need particular attention.

Another feature of the this repository’s format profile not illustrated here is the difference when the profile is based on file size rather than file count. In this case the largest format by file size remains JPEG 1.01, but the next largest file types are all MPEG video formats, which is not so evident from Figure 6 and Table 1. It is not hard to understand why this might be: video files tend to be larger than text files or other format types. At first sight the profile by file count might have the stronger influence on preservation plans, but this further evidence on file size can be used strategically as well.

Research papers repository (University of Northampton)

NECTAR format profile from live repository screenshot (profile date 14 Oct. 2010)

Figure 8. NECTAR format profile from live repository screenshot

Figure 8 is a classic, PDF-dominated profile of a repository of research papers. Although an apparently small repository, what this shows is a repository that has so far focussed on records-keeping rather than on collecting full digital objects. We have seen already how investigations to expand the scope of this repository have begun. There is nothing in this profile that would surprise the repository manager. It acts as confirmation, a snapshot of the repository and can be viewed as a platform for deciding where the repository should head in future. The tools will help the repository manager to monitor growth and the implementation of future plans.


Digital and institutional repositories are changing, and the established research papers repository is now complemented by rapidly growing repositories targetting new types of digital content. For the first time we have been able to compare and contrast these different repository types using tools designed to assist digital preservation analysis by identifying file formats and producing profiles of the distribution of formats in each repository.

While past format profiles of repositories collecting open access research papers tended to produce uniform results differing in scale rather than range, the new profiles reveal potentially characteristic fingerprints for the emerging repository types. What this also reveals more clearly, by emphasising the differences, are the real preservation implications for these repositories based on these profiles, which could be masked when all profiles looked the same. Each exemplar profile gives the respective managers a new insight into their repositories and careful reading will lead them to an agenda for managing the repository content effectively and ensuring continued access, an agenda that will be the more clearly marked for recognising how the same process produced different results for other types of repository.

This agenda will initially be led by the need to investigate digital objects for which the format could not be identified by the general tools, the ‘unknowns’. We have seen that specialised science data repositories, and even less obviously specialised examples, can produce large numbers of unknowns. These are high-risk objects in any repository by virtue of their internal format being unknown, even though on inspection many may turn out to be easily identified and/or corrected.

For the known formats, especially the largest formats by file count, these profiles show where effort is worth expending on producing preservation plans that will automate the maintenance of these files. Based on these exemplars, all repositories with substantial content are likely to produce format profiles displaying a long tail.

An intriguing finding of this work is that the emerging repository types, rather than open access institutional repositories founded on research papers, are dominated by visual rather than textual formats.

All these exemplars either are, or plan to become, institutional in scope even though limited to a specified type of content. One original idea that motivated the KeepIt project was that truly institutional repositories are likely to come to collect and store digital outputs from all research and academic activities, such as those represented by these exemplars. Thus, combined the exemplars might represent the institutional repository of the future. It’s worth bearing in mind how the combined format profiles might look, and the consequent implications for preservation, when contemplating the prospect.

We are grateful to all the exemplar repositories for allowing us to reproduce these profiles.

Posted in Uncategorized.

Tagged with , , , , , .

EPrints preservation apps: from PRONOM-ROAR to Amazon and a Bazaar

your computation goes to where your data is, not the other way around.

Alistair Croll, Big business for big data, O’Reilly radar, 21 September 2010

They’re now called apps, after a certain other initiative; they were called plugins. As the KeepIt project moves towards its close, this is the story of one of its products, the EPrints preservation apps, and it charts a path through digital preservation and changes in repository and computer architectures that are still playing out today.

Online preservation services, format identification and bandwidth limitations

In fact, at the outset they weren’t even apps or plugins, but online services. The story begins with our predecessor, the JISC Preserv repository preservation project. At the time, in 2005, EPrints repository software didn’t support modular applications. Preserv was working with two relatively nascent institutional repositories at the universities of Oxford and Southampton, and was focussing on a preservation approach concerned with managing the formats of the digital files in these repositories. This approach, referred to as ‘seamless flow‘, was conceived by another project partner, The National Archives, and proposed a preservation workflow in which a digital file format was first identified, the risks assessed, and then the file was converted, if necessary. TNA had tools for this as well: PRONOM, effectively its knowledge base of file formats, and DROID, a software tool that would scan and identify the formats of stored files.

Typical repository format profile, from Registry of Open Access Repositories (ROAR)

Typical research repository format profile, from Registry of Open Access Repositories (ROAR)

Using these tools Preserv profiled its two partner repositories; it revealed the now classic open access institutional repository format profile dominated by versions of PDF and a long tail of other formats. That apart, it didn’t seem like a great step forward, and there were limitations. We weren’t able to integrate the repository and preservation tools, and we didn’t have a testbed for large-scale content. Beyond the two repositories, it seemed we didn’t have much further content to work with anyway.

What we did have, however, was ROAR, the Registry of Open Access Repositories, created and maintained by then-project developer Tim Brody. ROAR is a comprehensive machine index of repositories, one of the early OAI services that continues today. Brody had the idea to extend ROAR to generate format profiles of the repositories it indexed. The approach he devised did not work with all repository types, but it worked with those based on the main repository softwares, DSpace and EPrints. Profiles were generated for over 200 repositories, and those PRONOM-ROAR profiles as they were to be called, suitably updated, can still be found on ROAR today.

There are problems with this approach, bandwidth scalability and cost being the primary ones. As Croll noted in his O’Reilly blog post, “a paper by the late Jim Gray of Microsoft says that, compared to the cost of moving bytes around, everything else is free.”

As a result large files over 2MB, multimedia objects for example, were not included in the profiles. Further, we had only established the first part of the ‘seamless flow’ preservation workflow, format identification. What to do with these files once we knew what they were was the next question, the answer to which was likely to put more pressure on our computing infrastructure, which at that time separated repositories and our online services.

Repository plugins and large storage systems

At the start of Preserv 2 in 2007 things began to change. EPrints version 3 had launched earlier in the year and now supported plugins (or apps) developed independently of the core code. Oxford and Southampton became strong supporters of the Sun PASIG (Preservation and Archiving Special Interest Group) forum, and both took delivery of the new Sun Honeycomb, an innovative large-scale storage machine being re-positioned for digital library applications. Early in 2008 Dave Tarrant joined as Preserv 2 project developer at Southampton, and was introduced to Ben O’Steen, project developer at Oxford.

At this first meeting O’Steen proposed using the then new OAI-ORE approach to demonstrate how files could be moved between different repository platforms, Fedora and EPrints, which were used at Oxford and Southampton, respectively. After all, if one approach to digital preservation is the ability and flexibility to copy and move content around more easily, then this seemed like a possible step forward for repository preservation. Working with Tim Brody, they went on to win the repository challenge prize at the Open Repositories 2008 international conference, demonstrating that digital data can be moved between storage sites running different software.

Tarrant, Brody and O'Steen celebrate winning the repository challenge at OR08

Tarrant, Brody and O'Steen (L-R) celebrate winning the repository challenge at OR08

We speculated that the approach might have important implications for the evolution of repository software and architectures: “Binding objects in this manner would allow the construction of a layered repository where the core is the storage and binding and all other software and services sit on top of this layer. In this scenario, if a repository wanted to change its software, instead of migrating the objects from one software to another, we could simply swap the software.” It hasn’t happened in a real repository yet.

With the delivery of the new Honeycomb machines what was in mind was a large data store where we could run various repository software and preservation applications, closely tying storage and computation to overcome the limitations of PRONOM-ROAR.

The corporate world runs on a different dynamic to research organisations, and Sun withdrew its Honeycomb later in 2008, more for marketing reasons than technical capability. Although the machines were perfectly serviceable and had the promise of support for a further five years, they could no longer be part of a long-term preservation strategy and eventual migration to alternative systems would need to be planned.

‘Cloud’ storage

By now the emerging phenomenon in computing storage was the ‘cloud’, where storage and computation services can be accessed using network services and the Web. It has been suggested that network storage and computing will grow to rival the scale of other essential utilities such as electricity and power. Another parallel, it is predicted, will be the widespread switch from local to network computation in the same way that local electricity generation mushroomed into national power networks 100 years ago.

Investment in cloud services by Amazon, Google and other major Internet companies reinforces those predictions. As Croll added: “Amazon’s S3 large-object store, not its EC2 compute service, is core to the company’s strategy: your computation goes to where your data is, not the other way around.”

With the demise of the Honeycomb, Tarrant and O’Steen, working with Sun and other partners, devised a network storage solution that would adopt the fault-tolerant architecture of the Honeycomb. A bigger question, however, would be uptake of network services by institutions and research organisations such as universities. Concerns of such organisations centre on reliability, trust and, crucially, control of their data. The concept of an institutionally-managed private cloud storage network was proposed. Again, these ideas have yet to play out in the larger scheme.

It remains the case the case that most institutional repositories run on local computer boxes, typically managed by information services support within the institution. Apart from hosted repository services, cloud support appears not to have been extensively utilised yet, and it need not necessarily be obvious if it was, yet no doubt some of the concerns raised above remain.

Separately, Dave Tarrant went to a seminal PLANETS project tutorial in Vienna and met the team behind the Plato preservation planning tool. Word was this was a promising tool from this major EU-funded project, and Dave confirmed this. This meeting would seed an idea to fill a critical element of the EPrints preservation workflow.

Tarrant was still working with the Honeycomb. Compared with Brody earlier, he could harvest data to a large-scale storage facility and run experiments with the preservation tools, updating and completing the format profiles, and extending the number of profiles.

Controlling storage and format risks within EPrints

Prompted by Adrian Brown, then in charge of digital preservation at the National Archives, we began to conceive the idea of smart storage, where we could run storage and computation in close proximity: “Since at the core of any preservation approach is storage, we call this approach ‘smart storage’ because it combines an underlying passive storage approach with the intelligence provided through the respective services.”

Taking advantage of the EPrints architecture for running plugins and using his experience of cloud services, Tarrant began to produce applications that would manage storage selection and run format risk tools, including DROID, within an EPrints repository. He demonstrated interfaces within EPrints designed to manage these tasks. One particular feature of the emerging interface for format management was its traffic-light based identification of risk based on high, medium and low risk objects. This seemed like a good way of managing risk and alerting repository managers to files considered likely to pose preservation problems.

Traffic-light categories of format risks presented in an EPrints repository interface

Traffic-light categories of format risks presented in an EPrints repository interface

There remained a gap in the format preservation workflow. We could identify and profile our content more effectively than before, but we still had no formal basis on which to evaluate risk for our known format types. Our traffic light risk scores were hypothetical. Although there were plenty of arbitrary rules – such as open source, open standard formats were preferred – this is not the whole picture.

In a rapid demonstration, two sources of data on file format risks, from PRONOM and DBpedia, were combined in a Semantic Web, or linked data, fashion, revealing more risk factors than either data source alone. The aim was to prompt an open data approach to file format risks and sharing among organisations that produce this information, and it is an approach that still has promise of adoption as the preservation community seeks to build a general format registry.

The missing link: Plato and preservation planning

While a registry can provide a factual basis for evaluating format risks, this is not the only angle in implementing a format-based preservation workflow for digital repositories. Other factors may include institutional priorities, policies, costs, and local technical factors. What’s needed is a more flexible system for assessing risk for digital preservation, a process known as preservation planning. Plato is such a system. In EPrints we now had tools to identify formats, using DROID, and an interface categorising (hypothetical) risks. We wanted to connect the two. The output from Plato is a preservation plan in the form of an XML document. So a button was added to the EPrints format risk management interface to upload and act on this plan. Creating a preservation plan in Plato may not be easy, but when uploaded to act on a large repository content it is powerful, and will continue to monitor and act on new content as the repository grows.

I will be honest and say that for a while I was quite mystified by Plato. It seemed like a worthy cause but at first inspection the tools seemed so complicated that, rather than putting preservation within reach, they would have the contrary result of making it more difficult. Only now it’s finished has the penny finally dropped.

William Kilbride, Editorial: Preservation Planning on a Spin Cycle, What’s New – Issue 28, Digital Preservation Coalition, August 2010

Refocussing on repositories: making the tools fit for preservation exemplars

With the start of KeepIt in 2009, by design and prompting by JISC we focussed again on repositories, in this case our preservation exemplars. The stories of these exemplars is told separately. These repositories were carefully chosen to exemplify different data collection policies, across research, teaching, science and arts. What they had in common as repositories, not entirely by design but not entirely by coincidence either, was that they all run on EPrints.

The aim of the project was not to impose big, broad preservation strategies on these repositories, or do preservation for them, but to introduce a range of approaches from which each repository could choose to suit the needs of their institutions and their content. We already knew that imposing the approaches specified by preservation specialists would be difficult for generalist repository mangers and their often small or part-time support teams. Even if preservation services were available to repositories, they need to know enough about preservation to identify their own institutional needs and how to specify these to the providers. That was the rationalisation for our extensive KeepIt course, and we were able to draw on a range of available preservation tools, over 70% produced by JISC projects.

Given the base of EPrints shared by exemplars, Dave Tarrant was able to build on the ‘smart storage’ approach and EPrints plugins and interfaces we had elaborated earlier. To optimise this approach required some additions to the core EPrints code, which Tarrant contributed, and which became available at the beginning of this year as the latest version of the software, EPrints v3.2.

Repositories are increasingly large, complex and critical systems requiring a formally managed systems approach, so as with better known computing platforms and operating systems, upgrades to new versions can take time and require consultation. Some six months after release, three of our exemplars now run v3.2, have their own preservation apps installed and are producing, analysing and acting on their own format profiles. (The fourth exemplar runs a customised version of EPrints that it claims is unsuited to this upgrade; we believe it’s less of a problem than they do.)

We’ll find out what these format profiles look like in the next blog post, but we can say they are each quite distinctive, and most are distinct from the classic open access format profile illustrated above, with consequent implications for different preservation strategies.

It doesn’t stop at EPrints 3.2. Next year easier, one-click installation is promised in version 3.3, which will bring the EPrints Bazaar, modelled on the Apple App store and which will include the updated preservation apps (Plugins).

EPrints Bazaar

Learning about and implementing EPrints preservation apps

There is now a remarkable range of resources to support preservation for EPrints repositories, not just apps. Dedicated courses have been run across Europe to introduce participants jointly to EPrints preservation apps and Plato, and all these events are documented. The most complete documentation is probably from the KeepIt course 4 held over two days earlier this year in Southampton. Want a shorter introduction? Try this version run over 1.5h in Madrid during July. Or there are single-day versions of this EPrints-Plato course from Corfu (ECDL, Sept. 2009) and the final presentation in Vienna (iPres, Sept. 2010). All are suitable for independent study and include practical work for users to follow.

To use EPrints preservation apps you need to be running a repository based on v3.2 or later, or try a test repository in the cloud. If running 3.2, download the preservation plugins. There is a readme file in the download which explains how to install the plugins ready for the training exercises. For others not currently running EPrints, there are two test repositories running on the Amazon cloud services providing ‘machine images’ (AMIs), instances of EPrints, which people can launch for training purposes.

Thanks to our brilliant developer team who have energised both Preserv and KeepIt projects with their innovative ideas. Thanks as well to the developers of the tools that have been integrated into the services described here (notably DROID and Plato), highlighting the real benefit of open sharing and open source software that is characteristic of this developer community. Every idea that has been built, worked through, adapted and embedded into this story has played a role in the continuing evolution of digital repository preservation.

Posted in Uncategorized.

Tagged with , , , , .

Costs, formats and iPad apps: past-future preservation lessons for a science repository

As an institutionally-based digital repository, eCrystals is somewhat different – both as an exemplar in the KeepIt project and in the institutional repository landscape as a whole. It is operated by the National Crystallography Service (NCS), which is funded on a 5 year grant basis. This brings preservation implications and requirements that are rather different from those faced by repositories set up by institutions as a component of their research infrastructure, as when grant funding ceases then so does support for the repository, and its future hangs in the balance.

National Crystallography Service logo

It costs money to do preservation. This recognition and the periodically precarious funding position meant that much of our work on eCrystals as an exemplar was focused on preservation costs. There is plenty of (wildly contradictory) anecdotal talk and urban myth in the practising research community around how much it costs to preserve data. My perspective draws on personal experience and other reported work in the area. However, what is clear is that the community needs to know how much it costs to set up a repository and then what the financial implications are for migrating all the old data into it. It has been particularly insightful thinking about how much all this costs and the main lesson ought to be blindingly obvious – setting up and maintaining a data repository is relatively cheap and easy (providing you are not the innovator or primary mover in the area). It’s populating it with all your old data that really costs.

eCrystals holds the results (in the form of multiple, small data files) of crystallographic experiments performed at the NCS, and is operated by the NCS as an independent mid-range facility funded to serve UK academics in the chemistry (and related subjects) sector. An important part of our interaction with the KeepIt project was the registration of file formats so that digital preservation services can automatically recognise and understand our repository content. The authoritative PRONOM registry recognises several hundred file formats, but these are the popular ones and domain-specific formats such as our Crystallographic Information File (CIF) and Chemical Markup Language (CML) – which are ubiquitous in crystallography and chemistry, respectively – were not included. Work was done to create signature files for these two formats for the DROID format identification tool, which applies data from PRONOM. These signatures will be submitted for inclusion in the formal PRONOM registry.

ipad-periodic tableWorking with KeepIt and other projects has given momentum to the preservation of crystallography data in the eCrystals repository and in related repositories. Looking ahead, we intend to maintain this momentum. Through the project we recently invested in an Apple iPad, and we are developing an app as a front-end to an electronic laboratory notebook / blog service. As we have reported, we recognised that the best possible moment to begin preservation is at the time the experiment is performed, as it is prohibitively expensive to recreate the data at a later stage. The idea for the app is that the contextual information that underpins publication and preservation is built up as the experiment progresses – not as is done now, where a bunch of files are uploaded some time after the event and some (arbitrary) metadata assigned.

This means capturing data in the laboratory – not easy (even in a conventional lab notebook) and we are spinning out a project to address this problem – the smart laboratory with pervasive data and metadata recording. A primary problem here is that drawing or ‘scribble’ software is poor and chemists draw, they don’t generally type. Our app is being specified to resolve such issues by enabling the chemist to sketch reactions, note observations, make and test hypotheses – this is the valuable chemical metadata that gives our data meaning in the long term. Tablet PCs that have been tested in the past proved too cumbersome but iPad-type technology could be a winner in terms of portability and ease of interaction in solving these problems and making data capture in the lab instant and efficient. We are also investigating the use of portable devices (mobile phone as well as iPad) to record audio and video in the laboratory to act as anything from the primary observation record to contextual or supporting metadata.

chemistry app

iPhone app running a chemical reaction. Mobile smartphones might be used to capture all sorts of contextual data, as they become devices that many people carry and therefore add few extra requirements in terms of data capture technology

In summary, the most striking lessons learned for the NCS by working with the KeepIt project are:

  • Preservation isn’t hard – you just need to think about it and then generate a preservation plan.
  • The hard part is following the preservation plan and getting those involved in the right mindset.
  • It is acceptable to ‘just to do nothing’, but this must be the conclusion of thinking about preservation.
  • As long as storage is kept live (on spinning disks), unknown or unmanaged file formats are a major risk to the loss of information.
  • Subject domains or communities should therefore be encouraged to supply descriptions of their specific formats (e.g. DROID signatures) to make sure they don’t suffer from file format rot.
  • Repository software (like EPrints) is making preservation easier, by incorporating tools to help identify risks leading to information loss.
  • It’s relatively cheap to set up a repository that will, among other functions, preserve your data.
  • Retrospective preservation, migrating data and populating repositories are where the real costs lie.
  • The best possible moment to begin preservation is at the time the experiment is performed and data is generated.
  • New portable computing devices and apps will help capture data and embellish immediately with metadata in the lab.

Posted in Uncategorized.

Tagged with , , , .

Preserving crystallographic data in a digital repository: a costs based analysis

eCystals logoeCrystals has been presented within KeepIt as an exemplar of a scientific data repository, but more particularly it exemplifies the ‘one man band’ scientific repository.

No research scientist will take on the responsibility of setting up and administering a scientific data repository without an indication of the financial implications and savings involved. Until now these costs have largely been anecdotal. Here we provide documentary evidence based on our experience gained in developing and deploying the eCrystals repository at the UK National Crystallography Service (NCS) and the eCrystals Federation Project, which engaged the broader crystallographic community.

“The major cost components do not lie in implementing and maintaining a scientific data repository. Populating the repository is the expensive aspect of operating such a resource”

An ‘at a glance’ summary of this information is of vital importance for the practising researcher evaluating whether they wish to implement a data management and digital preservation solution. These data are not new – a more substantive version originally appeared in the Keeping Research Data Safe 2 project report – but here we present a summary couched in a manner more familiar to the practising scientist and covering the costs of a) the storage of data over the lifetime of a facility and b) migrating data from one era, technology or regime to another.

The actual costs involved (full economic costs based on a 250 working day year) with the set up and persistent operation of established software for an average size crystallography laboratory (i.e. 1 man, 1 instrument) can be estimated and outlined as:

  1. Installation of the repository software by an untrained operative (i.e. a research scientist) will take 2.5 days – £800.
  2. Training (sys admin) takes a day – £340.
  3. Performing sys admin – the cost of supporting the repository component of the NCS electronic infrastructure is based on the fact that this aspect takes about 10% of the time of the 0.1FTE systems administrator – £800 per year.
  4. Training (deposit) takes one day – £340.
  5. Deposit, management, appraisal and publication takes 5% of the time of a single researcher – £4000 per year.
  6. DOI registration depends, to an extent, on the number of records but averages £250 per annum.

You can’t just set up a repository and then expect that the data will get into it for free. All researchers carry a lot of historical data from throughout their careers (crystallographers perhaps more so than others!) and therefore it is crucial to get an idea of what it will cost to migrate your data to the repository.

In coming to the following conclusions, I am drawing on data and facility costs over a period of NCS operation from 1970 to 2009. During this timeframe experimental instrumentation, computational capability and data storage media have all changed radically. There has also been a change in the raw and results data that we collect – these days raw data can be a couple of hundred binary image files (as opposed to the lists of observed reflections from serial counter days) in the gigabyte size range, whilst results data can be as little as a single text-based CIF file of the order of a few kilobytes. When considering these elements of change, one can roughly group transitions between technologies – e.g. the introduction of personal computers, a new generation of instrumentation or the advent of online storage – into three roughly similar periods (1970-1990, 1990-2000 and 2000-present). As crystallographers, these eras relate loosely to the serial detector age (1970-1990; data stored on paper), the early days of area detectors (1990-2000; data stored on magnetic disks) and the modern age where large volumes of data are being generated (2000-; data stored on CD’s/DVD’s and, more recently, online).

“The best possible job of preservation must be done at the time an experiment is performed, as it is prohibitively expensive to recreate the data at a later stage”

One vital point often overlooked is that we can store and migrate data over many decades, but we can’t do this with the samples that we measure! The cost of a crystal structure with current equipment is £328; the cost to regenerate a structure from the past can be anything up to 60 times this amount (ca £20k) if the raw data (and appropriate correction files) are not available or the sample has to be re-made (includes all the expertise and laboratory infrastructure from an entire research project). It is not simply a matter of doing the experiment (or analysis) again if you don’t have the sample.

The reason a sample might need to be resynthesised is that it has not been possible in the past to efficiently store and preserve raw data (some data has been kept on paper and magnetic media, but the cost of migrating to online media is prohibitive and prone to being corrupt). In more recent times raw data could be preserved, in which case the cost of recreating the data is that involved with the (re)interpretation of the raw data and is therefore considerably cheaper. The most obvious points regarding the costs of data storage are that:

  • The cost of storing data has dramatically reduced.
  • The cost of migrating data from recent eras when computing has been more prevalent is significantly less.
  • The cost of storing raw data is around 70% of the total data storage cost.

For those scientists who still want to do something with all the structures in their filing cabinets or data on floppy disks, the following cautionary points regarding data migration should be noted (again, the full costs can be found in the KRDS2 report):

  • Long term storage of data in its native format becomes less worthwhile with time (this is because these formats cannot be migrated, i.e. an instrument manufacturer’s proprietary binary format cannot be read by newer generations of integration software). The most cost-effective approach would be to transform the raw data into a common interchangeable format and subsequently migrate it.
  • It is considerably more costly to migrate results data than preserve them – this is due to the variety of formats and the storage media used over the years.
  • There is considerable fluctuation in the relative cost of migration against storage in different eras and it does not necessarily follow that modern approaches make it cheaper to migrate as opposed to store with respect to other eras.

Migrating raw and results data highlighted some important points regarding data loss:

  • During migration of raw data from CDs/DVDs to online storage there was a 7% loss of data.
  • Migration of results from floppy disks resulted in a 5% data loss.
  • Less than 1% of results were lost in the migration from paper, although the cost of performing the migration itself was extremely variable due to the differing quality of records.

In summary, it is clear that the major cost components do not lie in implementing and maintaining a scientific data repository. Populating the repository is the expensive aspect of operating such a resource – this involves changing working practice and importing legacy data. It is also clear that the best possible job of preservation must be done at the time an experiment is performed, as it is prohibitively expensive to recreate the data at a later stage when the sample has ceased to exist.

Storage is a relatively cheap and well-understood process these days and there is now no reason why all raw and derived data cannot be kept long into the future. The migration of legacy data is a time-consuming and costly process and it is almost certainly not worth trying to migrate historic raw data. Great consideration should be given on a case-by-case basis as to whether it is worth migrating results data. One must bear in mind, however, that the migration of results data is a one-off cost going forward if this process is performed correctly and the data stored in a form that makes it easy to migrate in the future.

Posted in Uncategorized.

Tagged with , , , , , .

Shaping the culture and practice of digital preservation

Saturday morning, and I picked up an unread copy of the week’s Guardian g2 section where I chanced on a piece by Mark Lawson about some lost British television drama from the 1950s and 1960s that had resurfaced in the USA at the Library of Congress (LoC). I quickly tweeted four extracts (, enough presumably to exasperate Twitter followers. Yet in spite of Twitter’s length prejudices, there is much more to this article than even I tweeted. I finished by suggesting, even though this is a ‘mere’ newspaper article, that it might serve as a set text for digital preservation. Here’s why.

Sean Connery and Dorothy Tutin in the 1960 BBC adaptation of Colombe, one of the television dramas recovered by the Library of Congress

Sean Connery and Dorothy Tutin in the 1960 BBC adaptation of Colombe, one of the television dramas recovered by the Library of Congress

You won’t learn anything about technical preservation, clearly, but you will learn about what shapes culture, content and, crucially for preservation, selection, all within the context of a timespan – 50 years – and evolution of a medium – television – that most people today can understand from experience.

One short section I didn’t tweet an extract from was towards the end of the article, looking forward, and this part is perhaps even more important because that is where we have to project our role today.

First, one myth to be demolished is that there was a golden age of preservation pre-digital. To be fair, the purpose of such spin is fund raising, to emphasise comparatively a digital ‘black hole’ that may be looming unless we invest more in preserving digital content. Recently I tweeted a quote from the Economist: “the world has a better record of beginning of the 20C than of beginning of the 21C”. How is that measured? Chris Rusbridge’s tweeted reply used a blunt, part-asterisked word but made the point clear.

Just as we learn that luck has determined the fate of the TV drama recovered from the LoC

  • that the final part of the Romeo and Juliet is missing emphasises how dependent on luck our record of television of the past is

any visit to a local museum of social history, for example, will reveal mostly a random collection of artefacts that have, hopefully, been collated by a skilled curator to reflect stories of the area but which mostly owe their continuing existence to chance.

Where we do better in terms of more comprehensive and systematic preservation is for objects long established as having cultural significance, largely because of the processes that have created them: printed literature, books and, in the academic world, journals fall into this category. These processes feed directly into preservation in our major libraries, through mechanisms such as legal deposit. We don’t have to think about it: if it’s published and in print, it is worth preserving. It’s self-selecting and funded.

That isn’t where we are now with born-digital content and preservation, and Lawson shows us that is not where we were with television in its early years. Television today remains a medium infected with cultural elitism, in the UK ranging from Saturday night ITV to BBC4 and Sky television. So we can understand how this might have been worse in the 1960s, when TV was new and little understood in media terms.

What might concern us here is is how this affects selection for preservation. This is where I tweeted most extensively from Lawson. Here is the full paragraph:

“in terms of content and scheduling, these plays reflect a lost time. But the fact that these were the examples of British TV chosen to be stored in an American library reflects another bias. Several were screened in the US by the National Educational TV network, while the Romeo and Juliet was part of schools’ programming. Their admission to the Library of Congress may have been helped by their categorisation as theatre, literature or education, rather than as mere TV. As a result, this trove is limited to genres that executives regarded as good for viewers, rather than those viewers regarded as good: key sitcoms, quiz and chat shows of the same era will have vanished for ever, because no Washington librarian would have thought them worth keeping.”

Phrases that leap out here – categorisation, ‘mere TV’, executive selection, regarded as good, worth keeping – begin to explain what happened to this and other TV content of the time, and why.

Now apply that to digital content on the Web today. It is not hard to anticipate repeating these mistakes. Except, things have changed already. This is the critical and perhaps most revealing part of the article. Lawson starts to project forward to 2060, but his main point applies now:

“The final thought prompted by this discovery is that there is no risk of viewers in 2060 being invited to view lost treasures from today’s schedules. Detractors will jibe that this is because there is little worth conserving (even though these tapes do little to advance the case that the 1950s and 60s were a golden age of drama). No, the real difference between then and now lies in technology. Contemporary TV is indestructible. In a digital age, storage is not an issue: most transmissions are kept – even embarrassments that broadcasters might prefer to disappear are archived on file-sharing sites.”

Two things have happened here. Whereas the visual medium, especially television, may have been viewed by cultural archivists as largely ephemeral, visual content dominates on the Web, much of it created for or derived from television, or at least for small screen presentation. Technology, creativity and imagination are sweeping aside legal and cultural barriers. Second, there is also the idea that this choice – of what to watch and therefore what gets copied and hence ‘archived, however legitimately – has been democratised by access to digital content through the Web – overturning the perceived wisdom of the past, reference contents ‘regarded as good for viewers, rather than those viewers regarded as good’.

YouTube screenshot, catch-up tv

More recent example of television selection and 'archiving"

Except there is still a distinction between professional and personal archiving, and selection criteria and preservation processes will reflect that. The critical factor that differentiates digital from analogue content is volume, of production and consumption (Rosenthal expands the defining factors of digital preservation to include costs and rights as well as scale). That requires new ways of archiving, especially selection. Lawson points to the wisdom of the crowd becoming the mechanism of archiving on file-sharing sites.

It’s not clear that the timorous professional archiving community has yet recognised or understood the impact of the crowdsourced approach that massive digital content growth demands, or more particularly how it can apply this, because crowdsourcing requires that professionals relinquish at least a degree of control over selection. That’s the path of discovery that we are on now, and it starts with a recognition of the cultural shifts that are forcing the pace of change and shaping the selection criteria for preservation. That’s why Lawson’s text is a better starting point for this journey than many other formal texts on digital preservation.

Posted in Uncategorized.

Final thoughts on digital preservation work undertaken at UAL

University of the Arts London logoAs part of our participation in the KeepIt project, the EPrints Formats/Risks plugin was installed on our newly-upgraded repository in order to allow us to identify format risks based on DROID (Digital Record Object Identification) [this blog post includes an explanation of how the DROID file format tool works]. We performed the identification task in late September 2010. The process was very quick (it took about 13 seconds for our repository of about 2000 items), did not affect repository function, and as it required a single click of a button, couldn’t be easier.

“The EPrints Formats/Risks plugin provided the most tangible value for us – it was quick and efficient and identified areas we can start investigating and improving right away.”

EPrints repository software logoA comparison of the different project members’ results from this task will be discussed in an upcoming blog post. For the UAL Research Online repository, we immediately identified a potential preservation risk in the fact that over 200 objects were unidentifiable, and returned as ‘unknown’ file formats. The extensions for the unknown formats are generally recognizable, such as .swf or .mov. Possibly the objects do not contain digital signatures that DROID was able to recognise; if this is so, will need to investigate why these signatures are absent or unreadable. We also will check each of the unknown objects to make sure they are valid files and have not become corrupted (a good chance to do a little general housekeeping).

In response to our feedback about these results, the developer of the EPrints plugin is looking at providing a way for files unrecognised by DROID, but with identifiable extensions, to be manually classified. Using the Formats/Risks plugin on our diverse collection of research outputs in arts, design and media has therefore been immediately useful for learning about and managing our collection.

DRAMBORA-logoWe also registered for a self-audit of our risks using the DRAMBORA tool. A previous blog post provided a general introduction to DRAMBORA, and our own approach to using this tool has also been explained.

“What I have learned about DRAMBORA is that it isn’t realistic to expect a small repository team to be able to complete the full process in and around their daily activities.”

As mentioned in that earlier blog, I was unsure that I would find the time to follow the whole DRAMBORA programme (see this schematic of the DRAMBORA method) – the user manual suggests that four to five days of 6 hours each would be required to carry out the full self-audit. My wariness was justified; I estimate that I have found no more than four hours to spend with DRAMBORA since registering UAL Research Online for the audit many weeks ago. The preliminary stage (the ‘Preparation Centre’) and the first elements of the next stage (the ‘Assessment Centre’) that I have so far encountered are still concerned with copying and pasting general descriptions and policies, such as the wording of our mandate, and listing our objectives. Currently I am detailing our constraints; this is glossed as “any factor that compels or influences the repository to operate in a particular fashion” and as such is becoming rather lengthy. I’m not sure that these larger policies will directly impact our preservation activities, though I can see that copying them over to one place is a thorough approach to documenting a repository.

Perhaps what I have learned about the DRAMBORA process is that it isn’t realistic to expect a small repository team to be able to complete the full process in and around their daily management activities. With 1 full-time manager and 1 part-time administrator our staffing arrangements are typical for UK repositories, and I know of several repositories managed by single part-time librarians.

Miggie Pickton at the University of Northampton, another KeepIt exemplar repository, completed her scoping project with the DAF tool, structuring the work as a separate project lasting 8 weeks and employing two graduate interns to carry out the research. The interns were found through the Graduate Boost programme, which receives funds from HEFCE and the European Social Fund. A DRAMBORA self-audit was completed for the repository at LSE where it was found to be beneficial. Unfortunately there was no estimate provided of the time and staff required by LSE to complete the process.

Our initial scoping of DRAMBORA did not indicate that we needed to set up a separate project similar to that at Northampton; however, it seems clear that undertaking DRAMBORA needs to be planned and funded, with additional staff required either to carry out the audit or to release existing staff to do so. The latter option is probably preferable, as DRAMBORA requires the auditor to have an in-depth understanding of high-level university policies as well as the institution’s IT procedures, and to have access to policy documents across university departments, including library, legal, research management and IT. The repository manager, or this person’s line manager, is best placed to be able to access and interpret these documents.


Steve Hitchcock, KeepIt project manager, suggested that a ‘DRAMBORA Light’ might be the solution; a scaled-down version, sacrificing some thoroughness in favour of rapid results, would be more realistic for institutional repository use. Rather than building up a comprehensive profile of the repository, the scaled-down version would aim directly at the most common and relevant risks encountered in repository management, and ideally could be completed in no more than half a day. If the repository manager judged it necessary, the risk register generated by the Light version could be used as the basis for a request to senior management for time or funds to carry out the full audit.

In sum, the EPrints Formats/Risks plugin provided the most tangible value for us. Although it only dealt with one aspect of digital preservation risks, it was quick and efficient and identified areas we can start investigating and improving right away.

Posted in Uncategorized.

Tagged with , , , , , , .

Recognition for educational repositories

Digital Repositories InfoKitAt the beginning of October 2010, a message was sent out across the usual mailing lists notifying the community that:

“JISC infoNet have launched a learning and teaching upgrade to the Digital Repositories infoKit”.

Lou McGill’s work has provided another useful and timely resource for the growing number of educational repositories, and for people interested in developing their own ways of managing, presenting and sharing learning and teaching resources.

The new infoKit provides useful summaries of significant milestones, drivers and links to other resources in Lou’s usual lucid and readable style. In addition, this publication supports recognition for the distinctive concerns of educational repositories. We hope it will also contribute to growing awareness of what these services have to offer in times of financial constraint with its focus on ensuring efficient return on investment across disciplines and institutions.

Posted in Uncategorized.

Tagged with , .

Repository file type analysis for educational repositories

EdShare logo

One of the characteristics that we have recognised in working with educational repositories is the diversity of the file types content creators work with in their development of educational resources.

“It would be unrealistic, even positively unhelpful for an educational repository to limit by file types the kind of resources they are prepared to accept.”

While the dominant institutional research repositories may content themselves with accommodating PDFs, Word documents, and now, increasingly, LaTeX files, and other subject specialist repositories may only accept .jpg or .cif files, it would be unrealistic, even positively unhelpful for an educational repository to limit by file types the kind of resources they are prepared to accept. The reality that we have accepted in EdShare is that we need to be responsive and supportive of our whole user community, and that there may well be significant preservation challenges in accepting this situation.

Following a recent upgrade of the repository software to EPrints v3.2, a version that includes preservation and file format identification tools, we were able to produce an accurate profile of the file formats stored in EdShare:

EdShare format profile by file count (14 Oct. 2010)

EdShare format profile by file count, top 16 formats, file count > 150 files, 49 'other' formats aggregated for clarity of chart presentation (14 Oct. 2010)

Although not shown here, a breakdown of the repository contents by file size would show a different picture. The largest format by file count among the ‘others’ is MPEG 1/2 Audio Layer 3 (MPEG-4 Media File and MPEG-1 Video Format are also in this aggregate). It is anticipated these formats could each account for 10-20% of file storage space, quite a different picture to that presented above.

The ‘unknown’ files have a format that is either unknown, or known but not recognised, to the format identification tool. These are critical files in a preservation risk analysis, until they can be identified. The issue of unknown files will be considered further in a forthcoming blog post comparing format profiles among all the KeepIt exemplar repositories.

EdShare format profile: what does it mean?

From this data, we are left with lots of questions:

Is there likely to be any change over time in the composition of these proportions and ratios? What is the relationship between the deposit level for specific file types and the usage levels of file types – do these functions map simply one to the other or is highest usage associated with a relatively small number of file types? We will undertake to generate time slice data for EdShare content over the next period so that we can develop trend analysis and explore more fully what the statistics can tell us. This work will be made much more straightforward with the upgrade to version 3.2 of the EPrints repository software.

We will also undertake some work on exploring the institutional and individual’s expectations of EdShare and what our service can/should offer. So, do all users (and the University) expect and require unbroken access at all times to all resources? If so, this would be a level of service provision unprecedented across the portfolio of systems which support and deliver education across the institution. We will explore the level of service delivery procured by the central IT service for the University by comparable systems and chart our own provision. This may involve unbroken access for certain groups of staff (teachers and educational administrators) or unbroken access for resources with certain characteristics. In the new areas that EdShare is charting, this may be a difficult thing to clarify and articulate in the first instance.

Where we have identified the different format types in EdShare, both in terms of the number of files and file size, we need to understand what exactly these resources are. We will do some work to explore their storage costs, over the short, medium and longer term as well as the bandwidth costs in delivering these resources to the end user. This data, together with information about the risk levels associated with the file types themselves will support more detailed understanding about the materials we are working with and the expectations and requirements of our user community.

Comparing other sources of educational materials

In addition, as a complementary piece of work to this, we interrogated the database of the institutional VLE (Blackboard) as to the statistics of the range of file types which are held in its database and have accumulated over 10 years’ use. The data produced from this repository shows a distribution like this:

DOCS – 20%
GIF – 20%
PDF – 11%
JPEG – 8%
PPT – 8%
HTM – 5.3%
XML – 4.4%
DAT – 4%
SWF – 2.7%
MP3 – 1%
Others – 15.6%

This data has an extremely long tail, of a very large number of file types.

As I stated in a blog post here one year ago: “EdShare provides a safe, secure and persistent URL for the learning and teaching resources of the whole University community deposited in its shares. This EdShare approach requires that we engage actively with the issues of digital preservation which are particularly relevant to the growing community for learning and teaching repositories.” To deliver on these words, we may need to clarify exactly what we are committing the repository to and define our “offer” more precisely.

“Most repositories were simply not able to provide the format information that we requested – they lacked the functionality required”

In preparation for the next stage of this work, we decided to consult more widely across the educational repository community in order to establish how representative (or otherwise) EdShare is in terms of the diversity of file types we currently host. I contacted known repository managers from a cross-section of different types of repositories – institutional repositories, subject repositories and national services. We received a range of responses: most repositories actually were simply not in a position to be able to provide the information that we requested – they lacked the functionality required or hadn’t created a report which they could easily run to discover the range of file types they hosted in their repository. Where we did manage to obtain some data, though, it looks like this:

Repository A – national, Open Educational Resources Collection
JPEG – 39%
Links – 39%
HTML – 2%
PPT – 2%
Unknown – 1.5%
Others – 16.5%

Repository B – small, teaching-focused institution, collection of educational resources
JPEG – 67%
PDF – 14%
FLV – 10%
MP3 – 7%
Others – 2%

Interestingly, although having quite different rationales, both of these cases share the same predominant file type as EdShare and also share the same characteristic observed in other repositories that in any community, for any kind of repository, there will typically be a single predominant file type that people like and this will also be associated with a long tail of a mixed and varied range of file types which pose all kinds of challenges.

“there is some consistency and indications of general trends”

We acknowledge that there is only a very limited amount of data here that we have managed to gather. However, from what data we do have, there is remarkable consistency and indications of general trends. Certainly, we have some basis on which to focus on the preservation issues relating to a relatively small number of file types and some indications of where it would be useful to investigate further – in terms of highest-use file types rather than necessarily the largest number or the highest level of storage space. Over the coming months we intend to work on this area further and understand how representative EdShare is of the small but growing community of educational repositories.

Posted in Uncategorized.

Tagged with , , , , , , .

NECTAR and the Data Asset Framework – final thoughts

NECTAR logo“Our DAF project has provided an evidence base for the development of a future research data policy and of services to support research data management.”

Some months ago I blogged about our plans to undertake a DAF project at The University of Northampton. DAF is the Data Asset Framework, which enables universities to audit departmental data collections, awareness, policies and practice for data curation and preservation. There had not previously been any systematic audit of data management practices at the university and it was felt that the DAF methodology would allow us to gather evidence to inform not only preservation planning but also the development of new policy and services to support researchers.

At that stage we had won approval for the project to go ahead and were investigating how we might implement the methodology.

This post describes some of the key challenges and findings of the project we eventually undertook. We conducted both an online survey and a series of interviews. We found that researchers were generally confident in managing their data but held a number of misconceptions regarding the services available to support them in this.


The biggest obstacle to starting the project was the lack of a project researcher. We had no spare capacity in Information Services but we were aware of several university and local initiatives which aimed to provide graduates and PhD students with the opportunity to gain relevant work experience and new skills. We also had contacts in Information Science departments at other universities and considered inviting a Masters student to undertake the project.

In the end, we were fortunate enough to get two graduate interns from the local Graduate Boost [1] programme.

Lesson #1: Making use of graduates from a local scheme was advantageous to both of us. We gained free research assistants, they gained valuable work experience.

Each intern was available for four weeks and we wanted to offer both of them the chance to be involved in a range of project activities, including design; data collection; analysis and reporting. Our implementation of DAF therefore had to be designed in two halves and completed within eight weeks.

Project plan

Action to be completed by end of week
Week 1 (14/5/10)
Researcher 1
Researcher 1: induction/familiarisation with The University of Northampton (UoN).
Introduction to DAF (Stage 1)
: What is it? Why are we doing it? What do we hope to achieve? Who has used DAF before – what can we learn from them? (Contact other institutions if appropriate).  Establish justification and need for this project.
Who are the project stakeholders? What do they want from the project? (Conduct interviews with Project Board; Director of Information Services; Deputy Director (Academic Services) and Director of Knowledge Exchange).
Define scope of project, aims and objectives.
Write up introduction to project.
Week 2 (21/5/10)
Researcher 1
DAF Stage 2: find out about research data at UoN.
Conduct interviews with research leaders.
Review methodologies used by other institutions (especially content of surveys); draft survey questionnaire for UoN (based on feedback from stakeholders and experience of others); create survey using Bristol Online Surveys.
Write up methodology – including justification for decisions taken.
Week 3 (28/5/10)
Researcher 1
Conduct pilot survey; evaluate pilot and amend survey as required.
Present progress report to Project Board.
Online survey goes live
; promote to research community.
Write up pilot.
Week 4 (4/6/10)
Researcher 1
Monitor survey results; produce data analysis plan; produce draft interview plan (incorporating researcher workflow and data management practices).
Promote survey to research community and identify possible candidates for interview.
Produce briefing sheet for Researcher 2.
Week 5 (11/6/10)
Researcher 2
Researcher 2: induction/familiarisation with UoN.
Introduction to DAF
– familiarisation with DAF methodology and experience so far at Northampton (see Researcher 1’s reports).
Monitor ongoing survey results; evaluate plan for data analysis and amend if appropriate; continue promoting survey to research community.
Finalise interview schedule; pilot this and make amendments as necessary.
Write up plan for analysing survey data.
Meet with Project Board.
Week 6 (18/6/10)
Researcher 2
DAF Stage 3: identify candidates for interview; conduct interviews; transcribe interviews.
Continue monitoring survey.
Online survey closes.

Write short report of key themes arising in interviews.
Week 7 (25/6/10)
Researcher 2
Finish identifying candidates for interview; conducting interviews; transcribing interviews.
Analyse survey results and write up.
Week 8 (2/7/10)
Researcher 2
DAF Stage 4: Final reporting and recommendations.
Write up results from interviews; collate with survey results; make recommendations for policy and practice in research data management.
Present final report to Project Board, Director of Information Services, Deputy Director (Academic Services) and Director of the Knowledge Exchange.

In DAF terms, Researcher 1 focussed on Stages 1 (planning) and 2 (identification and classification of data assets) while Researcher 2 took on Stages 3 (assessment of data asset management) and 4 (reporting and recommendations). The project was overseen by a Project Board comprising myself (NECTAR repository manager), the university’s Records Manager and Information Services’ Collections and Learning Resources Manager.

Fortunately, the project went according to plan. The first project researcher had no difficulty completing the desk research, interviewing several research leaders, designing and launching the survey, and writing up progress thus far. The second project researcher had the harder task, since he was tasked with conducting in-depth interviews with researchers, analysing the data from both the survey and the interviews, and producing the first draft of the project report.  To his credit, he succeeded in all of this.

Lesson #2: The limited availability of project staff and the tight deadlines for the project forced everyone to work quickly and effectively in order to complete the project in time.

Project findings

The online survey attracted 80 responses – more than we initially expected. No doubt the incentive of a £50 prize draw helped. The survey covered a wide range of issues including the types, sizes and formats of research data held; its ownership; means of storage; security arrangements; sharing and access over the short and long term; and the requirements of funders. The (sixteen) interviews enabled the project team to follow up key findings from the survey and gather additional technical information on specific data objects.

Lesson #3: The timing of the project, toward the end of the summer term, worked well for us.  Researchers had time to respond to the survey and a good number volunteered for the follow-up interviews.

A number of themes emerged. Three generic types of researcher were identified, based on their different needs and behaviours with respect to research data: the research student, the independent researcher and the group researcher/collaborator.

Some common behaviours were identified, for example, researchers overwhelmingly use Microsoft software for creating documents and spreadsheets and so habitually create .doc and .xls file types; similarly, .jpeg is the preferred format for image files. In contrast, there is much greater variation in the file types used for databases, audio and video files. This of course has significant implications for preservation planning.

Data storage needs and behaviours vary throughout the research lifecycle, with different storage devices being prominent at the data collection, analysis and project completion stages. For those that need to share data, a shared server has proved to be effective, but where this is not available, email is most frequently used.

Relatively few Northampton researchers have applied for funding from a body that mandates open access to research data and just over half are interested in a university repository for data (either for their exclusive use or for wider access). Of those that are interested in using a repository, more are interested in storing data for a finite period (say, until the end of the project or for a specified period thereafter) than indefinitely. Although a significant minority of researchers would consider allowing academic colleagues, either at Northampton or elsewhere, to access their data, the majority do not want their research data to be publicly available.

The project highlighted some common problems and concerns:

  • researchers’ data management practices are guided by intuition rather than informed by good practice;
  • data are sometimes neglected once a project is complete;
  • there is uncertainty surrounding the ownership of data;
  • in some cases data are still being collected or stored in out-dated formats;
  • the university’s shared server space is under-exploited;
  • researchers are unaware or misinformed of the full range of services available to them.

It is impossible to know whether these issues are common to the research community as a whole, but they certainly provide a useful starting point for a dialogue with researchers.


The DAF project gave us a much greater understanding of the needs and wants of researchers with respect to their research data. It has provided an evidence base for the development of a future research data policy and of services to support research data management. Project recommendations additionally include clarification of data ownership, provision of data management training and guidelines, and promotion of digital preservation.

Although researchers were generally not interested in storing their datasets in the university’s open access institutional repository (NECTAR), the DAF project has been helpful in confirming that the file types which are stored in NECTAR are representative (albeit as a subset) of the types generally created by researchers. Future preservation planning for the repository will therefore also inform preservation plans for research data.

UPDATE (18 October 2010) The full report on this work is now available: E. Alexogiannopoulos, S. McKenney, and M. Pickton, Research Data Management Project: a DAF investigation of research data management practices at The University of Northampton, University of Northampton, September 2010

[1] Graduate Boost is designed to provide unemployed graduates with work experience and postgraduate credits towards an MBA qualification. It is managed by 3e, a social enterprise employment agency, on behalf of the Northampton Business School, and is funded by the Higher Education Funding Council for England (HEFCE) and the European Union European Social Fund.

UPDATE (20 Oct. 2010) The full report on the Northampton DAF project is now available: Alexogiannopoulos, E.McKenney, S. and Pickton, M. (2010) Research Data Management Project: a DAF investigation of research data management practices at The University of Northampton. University of Northampton, September 2010

Posted in Uncategorized.

Tagged with , , , .