Skip to content

Categories:

Preserving crystallographic data in a digital repository: a costs based analysis

eCystals logoeCrystals has been presented within KeepIt as an exemplar of a scientific data repository, but more particularly it exemplifies the ‘one man band’ scientific repository.

No research scientist will take on the responsibility of setting up and administering a scientific data repository without an indication of the financial implications and savings involved. Until now these costs have largely been anecdotal. Here we provide documentary evidence based on our experience gained in developing and deploying the eCrystals repository at the UK National Crystallography Service (NCS) and the eCrystals Federation Project, which engaged the broader crystallographic community.

“The major cost components do not lie in implementing and maintaining a scientific data repository. Populating the repository is the expensive aspect of operating such a resource”

An ‘at a glance’ summary of this information is of vital importance for the practising researcher evaluating whether they wish to implement a data management and digital preservation solution. These data are not new – a more substantive version originally appeared in the Keeping Research Data Safe 2 project report – but here we present a summary couched in a manner more familiar to the practising scientist and covering the costs of a) the storage of data over the lifetime of a facility and b) migrating data from one era, technology or regime to another.

The actual costs involved (full economic costs based on a 250 working day year) with the set up and persistent operation of established software for an average size crystallography laboratory (i.e. 1 man, 1 instrument) can be estimated and outlined as:

  1. Installation of the repository software by an untrained operative (i.e. a research scientist) will take 2.5 days – £800.
  2. Training (sys admin) takes a day – £340.
  3. Performing sys admin – the cost of supporting the repository component of the NCS electronic infrastructure is based on the fact that this aspect takes about 10% of the time of the 0.1FTE systems administrator – £800 per year.
  4. Training (deposit) takes one day – £340.
  5. Deposit, management, appraisal and publication takes 5% of the time of a single researcher – £4000 per year.
  6. DOI registration depends, to an extent, on the number of records but averages £250 per annum.

You can’t just set up a repository and then expect that the data will get into it for free. All researchers carry a lot of historical data from throughout their careers (crystallographers perhaps more so than others!) and therefore it is crucial to get an idea of what it will cost to migrate your data to the repository.

In coming to the following conclusions, I am drawing on data and facility costs over a period of NCS operation from 1970 to 2009. During this timeframe experimental instrumentation, computational capability and data storage media have all changed radically. There has also been a change in the raw and results data that we collect – these days raw data can be a couple of hundred binary image files (as opposed to the lists of observed reflections from serial counter days) in the gigabyte size range, whilst results data can be as little as a single text-based CIF file of the order of a few kilobytes. When considering these elements of change, one can roughly group transitions between technologies – e.g. the introduction of personal computers, a new generation of instrumentation or the advent of online storage – into three roughly similar periods (1970-1990, 1990-2000 and 2000-present). As crystallographers, these eras relate loosely to the serial detector age (1970-1990; data stored on paper), the early days of area detectors (1990-2000; data stored on magnetic disks) and the modern age where large volumes of data are being generated (2000-; data stored on CD’s/DVD’s and, more recently, online).

“The best possible job of preservation must be done at the time an experiment is performed, as it is prohibitively expensive to recreate the data at a later stage”

One vital point often overlooked is that we can store and migrate data over many decades, but we can’t do this with the samples that we measure! The cost of a crystal structure with current equipment is £328; the cost to regenerate a structure from the past can be anything up to 60 times this amount (ca £20k) if the raw data (and appropriate correction files) are not available or the sample has to be re-made (includes all the expertise and laboratory infrastructure from an entire research project). It is not simply a matter of doing the experiment (or analysis) again if you don’t have the sample.

The reason a sample might need to be resynthesised is that it has not been possible in the past to efficiently store and preserve raw data (some data has been kept on paper and magnetic media, but the cost of migrating to online media is prohibitive and prone to being corrupt). In more recent times raw data could be preserved, in which case the cost of recreating the data is that involved with the (re)interpretation of the raw data and is therefore considerably cheaper. The most obvious points regarding the costs of data storage are that:

  • The cost of storing data has dramatically reduced.
  • The cost of migrating data from recent eras when computing has been more prevalent is significantly less.
  • The cost of storing raw data is around 70% of the total data storage cost.

For those scientists who still want to do something with all the structures in their filing cabinets or data on floppy disks, the following cautionary points regarding data migration should be noted (again, the full costs can be found in the KRDS2 report):

  • Long term storage of data in its native format becomes less worthwhile with time (this is because these formats cannot be migrated, i.e. an instrument manufacturer’s proprietary binary format cannot be read by newer generations of integration software). The most cost-effective approach would be to transform the raw data into a common interchangeable format and subsequently migrate it.
  • It is considerably more costly to migrate results data than preserve them – this is due to the variety of formats and the storage media used over the years.
  • There is considerable fluctuation in the relative cost of migration against storage in different eras and it does not necessarily follow that modern approaches make it cheaper to migrate as opposed to store with respect to other eras.

Migrating raw and results data highlighted some important points regarding data loss:

  • During migration of raw data from CDs/DVDs to online storage there was a 7% loss of data.
  • Migration of results from floppy disks resulted in a 5% data loss.
  • Less than 1% of results were lost in the migration from paper, although the cost of performing the migration itself was extremely variable due to the differing quality of records.

In summary, it is clear that the major cost components do not lie in implementing and maintaining a scientific data repository. Populating the repository is the expensive aspect of operating such a resource – this involves changing working practice and importing legacy data. It is also clear that the best possible job of preservation must be done at the time an experiment is performed, as it is prohibitively expensive to recreate the data at a later stage when the sample has ceased to exist.

Storage is a relatively cheap and well-understood process these days and there is now no reason why all raw and derived data cannot be kept long into the future. The migration of legacy data is a time-consuming and costly process and it is almost certainly not worth trying to migrate historic raw data. Great consideration should be given on a case-by-case basis as to whether it is worth migrating results data. One must bear in mind, however, that the migration of results data is a one-off cost going forward if this process is performed correctly and the data stored in a form that makes it easy to migrate in the future.

Posted in Uncategorized.

Tagged with , , , , , .


One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Steve Hitchcock says

    Confused by raw and results (or derived) data in Simon’s blog post? There is a good illustrated explanation of the distinction in ars technica, part of an ongoing series on preserving science
    Preserving science: what data do we keep? What do we discard?
    http://arstechnica.com/science/news/2010/11/preserving-science-choosing-what-data-to-discard.ars



Some HTML is OK

or, reply to this post via trackback.