Our third repository exemplar is eCrystals, which manages scientific, specifically crystallography, data that might be referred to broadly as e-data or e-science.
To recap the purpose of these initial surveys of the four exemplar repositories in the KeepIt project, we are seeking to characterise the repositories not in terms of their preservation activity but in terms of factors that will influence possible preservation strategies.
Since this repository operates somewhat differently from the others, some brief background information is needed. The repository is operated by the National Crystallography Service (NCS) based at the University of Southampton, so is a national service. It manages two types of data: the ‘raw’ data generated directly by crystal analysis, and the results data ’derived’ from the raw data.
NCS offers two types of experimental service to its users
- Full determination, where NCS generates raw data and works up derived data into results. This is deposited in eCrystals (generally initially embargoed);
- Data Collection only, where NCS collects the raw data and turns it into the first stage of derived data. This derived data is then sent to users and they work it up into results. None of the user-derived or results data is deposited in eCrystals.
The future plan is to use eCrystals for 2, where NCS deposits first-stage derived data, the user picks it up, turns it into a result and deposits the result into eCrystals.
Current status of repository
Was funded by JISC, eCrystals open-access archive project, to end March 2009.
Archive service provided as part of NCS at Southampton University, funded by EPSRC. This funding is subject to periodic review in the forthcoming research cycle.
eCrystals Southampton is the archive for Crystal Structures generated by the Southampton Chemical Crystallography Group and the EPSRC UK National Crystallography Service (NCS).
Management structure and decision-making, reporting tree
Management structure of the repository is headed by the director of the national service.
Staffing (no. FTE)
0.5 FTE systems administrator
Policy (documentation, e.g. mandate, format policy, retention policy, take down policy?)
Embedded into NCS Publication Policy: ”We have created an archival method for crystal structure data which is designed to reside on Institutional Repositories.
“At the present time, we are operating two archives. One is a private resource, visible only within the Southampton firewall, which is used as a comprehensive laboratory management and data archival system, to which we now routinely upload all completed and validated crystal structure determination outputs. The other archive is an open access resource, visible externally, which we are now using as a direct route to dissemination of structural data. Each entry is assigned a Digital Object Indentifier (DOI) so that the entry may be referenced in any future publication.”
For journal publications that report and link to crystal structure determinations presented in the repository the policy recognises it is important to satisfy publishers and the public that it will have the same stability and longevity as journal publications.
The “two” archives referred to here are concerned with just the derived and results data, not raw data. The difference today is effectively embargoed and not embargoed. The “raw” data is stored at the Atlas Data Store at the Rutherford Appleton Laboratory, essentially a closed repository that is used as an internal store, but this data is available on request (by email / post / dropbox type solutions).
Planning the repository (formal planning approach?)
Repository founded on JISC project planning and design
Data architecture carefully matched to crystallography requirements
Investigated preservation needs and options: A study of Curation and Preservation Issues in the eCrystals Data Repository and Proposed Federation
- Representation Information for Crystallography Data;
- Preservation Planning for Crystallography Data;
- Preservation Metadata for Crystallography Data
Budget (contingency for preservation?)
Budget covers storage of raw data.
Results (derived) data not formally budgetted but this is to be reviewed.
Infrastructure (institutional, network, etc.)
eCrystals server managed by sys. admin.
Archive is backed up nightly. No offsite backups of server. The backup is within the chemistry department, to a building connected by corridor.
Personal curation culture – analysis of crystal structures performed on series of linux boxes
Tools, services and support (which v. EPrints?)
Reconfigured repository software, core code modified, bespoke standalone code and third-party Web services used.
Storage (current, strategy?)
Record of the raw data back to about 2002, including frame images, at the Atlas Data Store at the Rutherford Appleton Laboratory.
Testing storage of raw data (500 GB) from just the last couple of years on (School of ECS, Southampton) Honeycomb server (Honeycomb hardware platform discontinued by Sun, support continues to 2013).
Data from 1998-2002 is on USB disks stored in our lab, migrated from CDs written at the time of generation.
Institutional solution preferred.
Content profile – volume, types, formats (content control?)
The information contained within each entry of this archive is all the fundamental and derived data resulting from a single crystal X-ray structure determination, but excluding the raw images.
archive: 480, buffer: 26, inbox: 65, deletion: 7, eprint: 578
Preserv format profile (large number of files ‘unknown’ to profiling tool)
Growth projections (scaling up?)
Plan to expand remit of repository to cover user-derived data (see above).
Scientific instrumentation has a typical lifespan of 10 years. As equipment is renewed there is likely to be an order of magnitude increase in data volumes.
Future plans for the repository (any major changes planned?)
Change storage model – cloud?
Target more learned society involvement.
- Part of national service provision
- Detailed repository data architecture design developed over several project iterations
- Highly customised (EPrints) repository software
- Well informed and proactive on preservation
- Funding uncertainties pending review
- Review storage provision
- Examine infrastructure options and prospects, strengthen current arrangements
- Assess scope for policy provision beyond publishing policy
- Consider how to cultivate and embed personal curation practices endemic in this field of science
- Produce more complete profile of deposit formats
- Consult on upgrade to EPrints v3.2 when available, or assess how to integrate preservation-support tools from this version in the customised repository software
Thanks to Simon Coles, Manjula Patel and Richard Stephenson for sharing and clarifying this information.