As an institutionally-based digital repository, eCrystals is somewhat different – both as an exemplar in the KeepIt project and in the institutional repository landscape as a whole. It is operated by the National Crystallography Service (NCS), which is funded on a 5 year grant basis. This brings preservation implications and requirements that are rather different from those faced by repositories set up by institutions as a component of their research infrastructure, as when grant funding ceases then so does support for the repository, and its future hangs in the balance.
It costs money to do preservation. This recognition and the periodically precarious funding position meant that much of our work on eCrystals as an exemplar was focused on preservation costs. There is plenty of (wildly contradictory) anecdotal talk and urban myth in the practising research community around how much it costs to preserve data. My perspective draws on personal experience and other reported work in the area. However, what is clear is that the community needs to know how much it costs to set up a repository and then what the financial implications are for migrating all the old data into it. It has been particularly insightful thinking about how much all this costs and the main lesson ought to be blindingly obvious – setting up and maintaining a data repository is relatively cheap and easy (providing you are not the innovator or primary mover in the area). It’s populating it with all your old data that really costs.
eCrystals holds the results (in the form of multiple, small data files) of crystallographic experiments performed at the NCS, and is operated by the NCS as an independent mid-range facility funded to serve UK academics in the chemistry (and related subjects) sector. An important part of our interaction with the KeepIt project was the registration of file formats so that digital preservation services can automatically recognise and understand our repository content. The authoritative PRONOM registry recognises several hundred file formats, but these are the popular ones and domain-specific formats such as our Crystallographic Information File (CIF) and Chemical Markup Language (CML) – which are ubiquitous in crystallography and chemistry, respectively – were not included. Work was done to create signature files for these two formats for the DROID format identification tool, which applies data from PRONOM. These signatures will be submitted for inclusion in the formal PRONOM registry.
Working with KeepIt and other projects has given momentum to the preservation of crystallography data in the eCrystals repository and in related repositories. Looking ahead, we intend to maintain this momentum. Through the project we recently invested in an Apple iPad, and we are developing an app as a front-end to an electronic laboratory notebook / blog service. As we have reported, we recognised that the best possible moment to begin preservation is at the time the experiment is performed, as it is prohibitively expensive to recreate the data at a later stage. The idea for the app is that the contextual information that underpins publication and preservation is built up as the experiment progresses – not as is done now, where a bunch of files are uploaded some time after the event and some (arbitrary) metadata assigned.
This means capturing data in the laboratory – not easy (even in a conventional lab notebook) and we are spinning out a project to address this problem – the smart laboratory with pervasive data and metadata recording. A primary problem here is that drawing or ‘scribble’ software is poor and chemists draw, they don’t generally type. Our app is being specified to resolve such issues by enabling the chemist to sketch reactions, note observations, make and test hypotheses – this is the valuable chemical metadata that gives our data meaning in the long term. Tablet PCs that have been tested in the past proved too cumbersome but iPad-type technology could be a winner in terms of portability and ease of interaction in solving these problems and making data capture in the lab instant and efficient. We are also investigating the use of portable devices (mobile phone as well as iPad) to record audio and video in the laboratory to act as anything from the primary observation record to contextual or supporting metadata.
In summary, the most striking lessons learned for the NCS by working with the KeepIt project are:
- Preservation isn’t hard – you just need to think about it and then generate a preservation plan.
- The hard part is following the preservation plan and getting those involved in the right mindset.
- It is acceptable to ‘just to do nothing’, but this must be the conclusion of thinking about preservation.
- As long as storage is kept live (on spinning disks), unknown or unmanaged file formats are a major risk to the loss of information.
- Subject domains or communities should therefore be encouraged to supply descriptions of their specific formats (e.g. DROID signatures) to make sure they don’t suffer from file format rot.
- Repository software (like EPrints) is making preservation easier, by incorporating tools to help identify risks leading to information loss.
- It’s relatively cheap to set up a repository that will, among other functions, preserve your data.
- Retrospective preservation, migrating data and populating repositories are where the real costs lie.
- The best possible moment to begin preservation is at the time the experiment is performed and data is generated.
- New portable computing devices and apps will help capture data and embellish immediately with metadata in the lab.