<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Diary of a Repository Preservation Project &#187; eCrystals</title>
	<atom:link href="http://blog.soton.ac.uk/keepit/tag/ecrystals/feed/?withoutcomments=1" rel="self" type="application/rss+xml" />
	<link>http://blog.soton.ac.uk/keepit</link>
	<description>How digital repositories can plan for the future - by four exemplars in science, arts, research and teaching</description>
	<lastBuildDate>Fri, 08 Apr 2011 14:38:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.1</generator>
	<atom:link rel='hub' href='http://blog.soton.ac.uk/keepit/?pushpress=hub'/>
<cloud domain='blog.soton.ac.uk' port='80' path='/keepit/?rsscloud=notify' registerProcedure='' protocol='http-post' />
		<item>
		<title>Exemplar preservation repositories: comparison by format profile</title>
		<link>http://blog.soton.ac.uk/keepit/2010/11/18/exemplar-preservation-repositories-comparison-by-format-profile/</link>
		<comments>http://blog.soton.ac.uk/keepit/2010/11/18/exemplar-preservation-repositories-comparison-by-format-profile/#comments</comments>
		<pubDate>Wed, 17 Nov 2010 23:06:18 +0000</pubDate>
		<dc:creator>Steve Hitchcock</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[eCrystals]]></category>
		<category><![CDATA[EdShare]]></category>
		<category><![CDATA[exemplar profiles]]></category>
		<category><![CDATA[file formats]]></category>
		<category><![CDATA[NECTAR]]></category>
		<category><![CDATA[UAL Research Online]]></category>

		<guid isPermaLink="false">http://blog.soton.ac.uk/keepit/?p=2111</guid>
		<description><![CDATA[How format profiles can reveal potentially characteristic fingerprints for emerging types of repository. Not all digital repositories are the same. Nor are all institutional repositories the same. In fact, the differences between the types of repositories emerging recently can be surprisingly large. In KeepIt we&#8217;ve been investigating how these differences might affect digital preservation practices. [...]]]></description>
			<content:encoded><![CDATA[<p><strong>How format profiles can reveal potentially characteristic fingerprints for emerging types of repository.</strong></p>
<p><a href="http://blog.soton.ac.uk/keepit/files/2010/11/format-tail-slim.png"><img class="alignright size-medium wp-image-2448" src="http://blog.soton.ac.uk/keepit/files/2010/11/format-tail-slim-210x300.png" alt="Format profile long tail" width="210" height="300" /></a>Not all digital repositories are the same. Nor are all institutional repositories the same. In fact, the differences between the types of repositories emerging recently can be surprisingly large. In KeepIt we&#8217;ve been investigating how these differences might affect digital preservation practices. We now have results and we shall compare some of those here.</p>
<p>There are various ways in which these differences can be seen. One is in the tools that have been adopted by our <a title="Tag: exemplar profiles, Diary, various entries" href="http://blog.soton.ac.uk/keepit/tag/exemplar-profiles/" target="_self">exemplar repositories</a> &#8211; covering arts and sciences, education and research &#8211; as they begin to explore the application of digital preservation. In the first instance these tools were selected from those covered in the <a title="Tag: KeepIt course, Diary, various entries" href="http://blog.soton.ac.uk/keepit/tag/keepit-course/" target="_self">KeepIt course</a>. It was expected that most would choose to explore the use of one or two tools initially, depending on their priorities. We can see from their reports they have chosen different paths &#8211; from data scoping, to costs, and risk assessment &#8211; and each can be seen to be an appropriate and revealing choice when you examine the issues faced by the respective exemplars.</p>
<blockquote><p><strong>&#8220;Evidently, like an arts repository (perhaps predictably) and like a science data repository (less predictably), it seems the emphasis of a repository of teaching resources may be visual rather than textual.&#8221;</strong></p></blockquote>
<p>A more direct way to view and understand the differences between repositories with different content remits is to look at the range of file types they manage. This format profile is the starting point for preservation plans and actions. <a title="KeepIt course 3: Introducing preservation workflow and format risks, Diary, August 23, 2010 " href="http://blog.soton.ac.uk/keepit/2010/08/23/keepit-course-3-introducing-preservation-workflow-and-format-risks/" target="_self">Why formats matter</a> for preservation was covered in the KeepIt course. One approach that has been adopted by three of the four exemplars, because we designed and implemented this for them, is the <a title="EPrints preservation apps: from PRONOM-ROAR to Amazon and a Bazaar, Diary, November 15, 2010 " href="http://blog.soton.ac.uk/keepit/2010/11/15/eprints-preservation-apps-from-pronom-roar-to-amazon-and-a-bazaar/" target="_self">EPrints preservation &#8216;apps&#8217;</a>. This tool bundles a range of apps, including the DROID file format identification tool from The National Archives, to present a format, or preservation, profile within a repository interface. Here we will reveal and compare the profiles of the four exemplars.</p>
<h3>Format profiles past and present</h3>
<p>First, we should recall that we have been producing <a title="Preservation Services, Testing format classification, Preserv 2 project" href="http://www.preserv.org.uk/guide/preservation/?slide=1" target="_self">format profiles in previous projects</a>, using earlier variants on the tools. What we have now are more complete and distinctive profiles.</p>
<blockquote><p><strong>&#8220;One major similarity between the exemplars and earlier format profiles is the &#8216;long tail&#8217;&#8221;</strong></p></blockquote>
<p>One major similarity we can note, however, between the exemplars and earlier profiles, is the dominance in each profile of one format, that is, the total number of files stored in the repository using that format, followed by an exponential power law decline in the number of files per format &#8211; the &#8216;long tail&#8217;. For open access research repositories the typical profile is dominated by PDF and its variants and versions. This has been known for some time from our previous work. In the case of our KeepIt exemplars only one, the research papers repository, has this classic PDF-led profile. We can now reveal how the others &#8211; a science repository, an arts repository, and an educational and teaching repository &#8211; differ, and thus begin to understand what preservation challenges they each face.</p>
<h3>Producing format profiles</h3>
<p>Before we do this, bear in mind these general background notes on how the profiles were produced. For the scale of repositories we have been working with here this is now a substantial processing task that can take hours and days to complete.</p>
<p>For three repositories the counts include only accepted objects and do not include &#8216;volatile&#8217; objects. The fourth (University of the Arts London) includes all objects, including those in the editorial buffer and volatiles. Repositories use editorial buffers to moderate submissions. Depending on the repository policy there may be a delay between submission, acceptance and public availability. Volatiles are objects that are generated when required by the repository &#8211; an example would be thumbnail previews used to provide an instant but sizeably reduced view of the object.</p>
<p>These are growing repositories, so the profiles must be viewed as temporary snapshots for the dates specified. They are provided here for illustration. For those repositories that have installed the EPrints preservation apps, the repository manager is provided with regular internal reports including an updated profile, and will need to track the changes between profiles as well as review each subsequent profile.</p>
<h3>Understanding and responding to format profiles</h3>
<p>We also need to understand some features of the tools when reviewing the results. In these results we have &#8216;unknown&#8217; formats and &#8216;unclassified&#8217; formats. Unclassified are files that are still to be classified. These may be new files that have been added since a profile scan began (scans can take some time) or since the last full scan.</p>
<blockquote><p><strong>&#8220;The number of unknown file formats is likely to be a major factor in assessing the preservation risk faced by a repository&#8221;</strong></p></blockquote>
<p>More critical for preservation purposes are files with unknown formats. To identify a file format a tool such as DROID &#8211; an open source tool integrated within the EPrints preservation apps &#8211; looks for a specified signature within the object. If it can&#8217;t match a file with a signature in its database it is classified as unknown. In such cases it may be possible to identify the format simply by examining the file extension (.pdf .htm .gif, etc.). In most cases a file format will be exactly what it purports to be according to this extension. The merits of each approach, by format signature or filename extension, can be debated; neither is infallible, nor has the degree of error been rigorously quantified. It is up to the individual repositories how they interpret and resolve these results.</p>
<p>The number of unknowns will be a major factor in assessing the preservation risk faced by a repository and is likely to be the area requiring most attention by its manager, at least initially until the risk has been assessed. We believe that in future it will be possible to <a title="Tarrant, et al., Where the Semantic Web and Web 2.0 meet format risk management: P2 registry. In: iPres 2009 International Conference, San Francisco" href="http://eprints.ecs.soton.ac.uk/17556/" target="_self">quantify the risk of known formats</a>, and to build <a title="KeepIt course 4: putting a preservation plan into EPrints, Diary, September 21, 2010 " href="http://blog.soton.ac.uk/keepit/2010/09/21/keepit-course-4-putting-a-preservation-plan-into-eprints/" target="_self">preservation plans to act on these risks within repositories</a>.</p>
<p>For formats known to specialists but not to the general preservation tools, it will be important to enable these to be added to the tools. When this happens it will be possible for the community to begin to accumulate the factors that might contribute to the risk scores for these formats. As long as formats remain outside this general domain, it will be for specialists to assess the risk for themselves. We will see examples of this in the cases below.</p>
<p>Producing format profiles is becoming an intensive process, and subsequent analysis is likely to be no less intensive.</p>
<h3>Science data repository (eCrystals, University of Southampton)</h3>
<p>A specialised science data repository is likely to have file types that a general format tool will fail to recognise. For this repository of crystal structures we anticipated two such formats &#8211; CIF and CML &#8211; and we <a title="Adding chemistry to a file format registry, Diary, September 16, 2010" href="http://blog.soton.ac.uk/keepit/2010/09/16/adding-chemistry-to-a-file-format-registry/" target="_self">reported</a> how signatures for these formats were added to the identification tool. What we can see in this profile is how successful, or not, these signatures were. That is, successful for CIF, but only partially successful for CML.</p>
<p>For this repository, which uses a customised version of EPrints and therefore has not so far installed the apps, we ran the tool for them over a copy of their content temporarily stored in the cloud. Figure 1 shows the full profile for this repository, including unknowns (in red), those formats not identified by DROID but known to EPrints (showing both the total and the breakdown in yellow), as well as the long tail of identified formats. All but two CIF files were identified by DROID. Had all the instances of CML been recognised it would have been the largest format with most files (adding the yellow and blue CML bars), but almost half were not recognised by DROID.</p>
<div id="attachment_2176" class="wp-caption aligncenter" style="width: 510px"><a href="http://blog.soton.ac.uk/keepit/files/2010/10/ecrystals-formats-all-bar.png"><img class="size-full wp-image-2176" src="http://blog.soton.ac.uk/keepit/files/2010/10/ecrystals-formats-all-bar.png" alt="eCrystals full format profile including formats 'unknown' to DROID and the repository (in red), the breakdown of those classified by the repository (yellow bars), as well as the long tail of formats classified by DROID " width="500" height="640" /></a><p class="wp-caption-text">Figure 1. eCrystals: full format profile including formats &#39;unknown&#39; to DROID and the repository (in red), the breakdown of those classified by the repository (yellow bars), as well as the long tail of formats classified by DROID. Chart generated from spreadsheet of results (profile date 1 Oct. 2010)</p></div>
<p>As it stands the format with the largest number of files known to DROID was, interestingly, an image format (JPEG 1.01). We will see this is a recurring theme of emerging repository types exemplified by our project repositories.</p>
<p>Also with reference to the other exemplar profiles to follow, it will be noticeable that this profile appears to have a less long tail than others. However, in this case we can see that &#8216;unknown&#8217; (to DROID and EPrints) is the largest single category, and when this is broken down it too presents a long tail (Figure 2) that is effectively additive to the tail in Figure 1. These include more specialised formats, which might be recognised by file extension.</p>
<div id="attachment_2180" class="wp-caption aligncenter" style="width: 581px"><a href="http://blog.soton.ac.uk/keepit/files/2010/10/ecrystals-formats-unknown-red.png"><img class="size-full wp-image-2180" src="http://blog.soton.ac.uk/keepit/files/2010/10/ecrystals-formats-unknown-red.png" alt="eCrystals 'unknown' formats 'recognised' by file extension" width="571" height="356" /></a><p class="wp-caption-text">Figure 2. eCrystals &#39;unknown&#39; formats by file extension (profile date 1 Oct. 2010)</p></div>
<p>As explained, clearly these unknowns will need to be a focus for the repository managers, although in preliminary feedback they say that many of these files are &#8220;all very familiar, standard crystallography files of varying extent of data handling that often get uploaded to ecrystals for completeness.&#8221; This is reassuring because file formats unknown to system or manager or scientists could be a serious problem for the repository. Even so, as long as such formats remain outside the scope of the general format identification tools the managers will need to use their own assessments and judgement to assure the longer-term viability and accessibility of these files.</p>
<h3>Arts repository (University of the Arts London)</h3>
<p style="text-align: left">What&#8217;s the largest file type in an arts repository? Perhaps unsurprisingly it&#8217;s an image format, in this case led by JPEGs of different versions. As can be seen in Figure 3, the number of unknowns, highlighted among the High Risk Objects, is the fourth largest single category in this profile and so requires further investigation.</p>
<div id="attachment_2131" class="wp-caption aligncenter" style="width: 563px"><a href="http://blog.soton.ac.uk/keepit/files/2010/10/UAL-Formats-and-Risks-Screenshot-12.JPG"><img class="size-large wp-image-2131  " src="http://blog.soton.ac.uk/keepit/files/2010/10/UAL-Formats-and-Risks-Screenshot-12-1024x819.jpg" alt="UAL Formats and Risks Screenshot 1" width="553" height="442" /></a><p class="wp-caption-text">Figure 3. Format/Risks screen (top level) for UAL Research Online repository. This screenshot of the profile was generated by the repository staff from the live repository using the installed tools. Date 13 September 2010. (Click on image for larger version with more legible format labels)</p></div>
<p>Once again there is a long tail (Figure 4).</p>
<div id="attachment_2136" class="wp-caption aligncenter" style="width: 563px"><a href="http://blog.soton.ac.uk/keepit/files/2010/10/UAL-Formats-and-Risks-Screenshot-2.JPG"><img class="size-large wp-image-2136 " src="http://blog.soton.ac.uk/keepit/files/2010/10/UAL-Formats-and-Risks-Screenshot-2-1024x819.jpg" alt="UAL Formats and Risks Screenshot 2" width="553" height="442" /></a><p class="wp-caption-text">Figure 4. Format/Risks screen (long tail) for UAL Research Online repository. This profile was generated by the repository staff from the live repository using the installed tools. Date 13 September 2010. (Click on image for larger version)</p></div>
<p>First indications in Figure 5 showing the expansion of the high risk category suggest many of these will turn out to be known formats but which have not been recognised by DROID. It may be possible to resolve and classify many of these by manual inspection, the last resort of the repository manager to ensure that files can be opened and used effectively.</p>
<div id="attachment_2142" class="wp-caption aligncenter" style="width: 563px"><a href="http://blog.soton.ac.uk/keepit/files/2010/10/UAL-Formats-and-Risks-Screenshot-3.JPG"><img class="size-large wp-image-2142" src="http://blog.soton.ac.uk/keepit/files/2010/10/UAL-Formats-and-Risks-Screenshot-3-1024x819.jpg" alt="UAL Formats and Risks Screenshot 3" width="553" height="442" /></a><p class="wp-caption-text">Figure 5. High risk objects (top level examples) in UAL Research Online repository. This profile was generated by the repository staff from the live repository using the installed tools. Date 13 September 2010.</p></div>
<h3>Teaching repository (EdShare, University of Southampton)</h3>
<p>EdShare repository manager Debra Morris has already <a title="Repository file type analysis for educational repositories, Diary, October 25, 2010" href="http://blog.soton.ac.uk/keepit/2010/10/25/repository-file-type-analysis/" target="_self">reflected</a> on this profile. The first notable feature of the profile (Figure 6) is that the largest format is, again, an image format. Evidently, like an arts repository (perhaps predictably) and like a science data repository (less predictably), it seems the emphasis of a repository of teaching resources may be visual rather than textual.</p>
<div id="attachment_2166" class="wp-caption aligncenter" style="width: 540px"><a href="http://blog.soton.ac.uk/keepit/files/2010/10/edshare-formats-largest.png"><img class="size-full wp-image-2166" src="http://blog.soton.ac.uk/keepit/files/2010/10/edshare-formats-largest.png" alt="EdShare, largest formats (top half of long tail). Chart generated from spreadsheet" width="530" height="626" /></a><p class="wp-caption-text">Figure 6. EdShare: largest formats by file count (top half of long tail). Chart generated from spreadsheet of results (profile date 14 Oct. 2010)</p></div>
<p>Another feature of the profile is the classification of LaTEX (Master File), the second largest format in this profile. Until now this format was unknown to DROID, but a new signature was created and added to our project version of DROID, in the same way as described for CIF/CML (and was submitted to the TNA for inclusion in the official format registry). The effect of this was to reduce the number of unknowns from near 2500 to c. 550, and thus instantly to both clarify and reduce the scale of the challenge.</p>
<p>As usual with the long tail, preservation planning decisions have to be made about the impact and viability of even infrequent formats. For reference, Table 1 shows the formats not included among the largest formats by file count in Figure 6.</p>
<p><strong>Table 1. Long tail of formats by file count in EdShare</strong></p>
<table border="0" cellspacing="0" cellpadding="0" width="392">
<col width="339"></col>
<col width="53"></col>
<tbody>
<tr>
<td width="339" height="14">Plain Text File</td>
<td width="53" align="right">20</td>
</tr>
<tr>
<td height="14">Rich Text Format (Version 1.7)</td>
<td align="right">17</td>
</tr>
<tr>
<td height="14">Windows Bitmap (Version 3.0)</td>
<td align="right">15</td>
</tr>
<tr>
<td height="14">Acrobat PDF 1.6 &#8211; Portable Document Format (Version 1.6)</td>
<td align="right">15</td>
</tr>
<tr>
<td height="14">Waveform Audio</td>
<td align="right">12</td>
</tr>
<tr>
<td height="14">Document Type Definition</td>
<td align="right">11</td>
</tr>
<tr>
<td height="14">Macromedia FLV (Version 1)</td>
<td align="right">11</td>
</tr>
<tr>
<td height="14">MPEG-1 Video Format</td>
<td align="right">9</td>
</tr>
<tr>
<td height="14">Icon file format</td>
<td align="right">9</td>
</tr>
<tr>
<td height="14">LaTEX (Sub File)</td>
<td align="right">9</td>
</tr>
<tr>
<td height="14">XML Schema Definition</td>
<td align="right">8</td>
</tr>
<tr>
<td height="14">Macromedia Flash (Version 7)</td>
<td align="right">8</td>
</tr>
<tr>
<td height="14">Windows Media Video</td>
<td align="right">6</td>
</tr>
<tr>
<td height="14">Extensible Hypertext Markup Language (Version 1.1)</td>
<td align="right">5</td>
</tr>
<tr>
<td height="14">Rich Text Format (Version 1.5)</td>
<td align="right">3</td>
</tr>
<tr>
<td height="14">TeX Binary File</td>
<td align="right">3</td>
</tr>
<tr>
<td height="14">Acrobat PDF 1.1 &#8211; Portable Document Format (Version 1.1)</td>
<td align="right">3</td>
</tr>
<tr>
<td height="14">Java Compiled Object Code</td>
<td align="right">2</td>
</tr>
<tr>
<td height="14">Encapsulated PostScript File Format (Version 3.0)</td>
<td align="right">2</td>
</tr>
<tr>
<td height="14">Exchangeable Image File Format (Compressed) (Version 2.2)</td>
<td align="right">2</td>
</tr>
<tr>
<td height="14">Scalable Vector Graphics (Version 1.0)</td>
<td align="right">2</td>
</tr>
<tr>
<td height="14">Adobe Photoshop</td>
<td align="right">2</td>
</tr>
<tr>
<td height="14">Microsoft Web Archive</td>
<td align="right">2</td>
</tr>
<tr>
<td height="14">Audio/Video Interleaved Format</td>
<td align="right">2</td>
</tr>
<tr>
<td height="14">JTIP (JPEG Tiled Image Pyramid)</td>
<td align="right">1</td>
</tr>
<tr>
<td height="14">Comma Separated Values</td>
<td align="right">1</td>
</tr>
<tr>
<td height="14">OS/2 Bitmap (Version 1.0)</td>
<td align="right">1</td>
</tr>
<tr>
<td height="14">Acrobat PDF 1.7 &#8211; Portable Document Format (Version 1.7)</td>
<td align="right">1</td>
</tr>
<tr>
<td height="14">PHP Script Page</td>
<td align="right">1</td>
</tr>
<tr>
<td height="14">Microsoft Word for Windows Document (Version 6.0/95)</td>
<td align="right">1</td>
</tr>
<tr>
<td height="14">Quicktime</td>
<td align="right">1</td>
</tr>
<tr>
<td height="14">Hypertext Markup Language (Version 2.0)</td>
<td align="right">1</td>
</tr>
<tr>
<td height="14">Applixware Spreadsheet</td>
<td align="right">1</td>
</tr>
<tr>
<td height="14">PostScript (Version 3.0)</td>
<td align="right">1</td>
</tr>
</tbody>
</table>
<div id="attachment_2171" class="wp-caption aligncenter" style="width: 662px"><a href="http://blog.soton.ac.uk/keepit/files/2010/10/edshare-formats-unknown.png"><img class="size-full wp-image-2171" src="http://blog.soton.ac.uk/keepit/files/2010/10/edshare-formats-unknown.png" alt="EdShare unknown formats" width="652" height="348" /></a><p class="wp-caption-text">Figure 7. EdShare: unknown formats (profile date 14 Oct. 2010)</p></div>
<p>Similarly, the unknowns present a potent challenge. Figure 7 is a breakdown of what we think we can tell from file extensions. Unlike the case of unknowns in eCrystals, here there was less anticipation of specialised formats that were unlikely to be found in a general format registry, so this list is a revelation. In many of these cases an error in the file may be preventing recognition of an otherwise familiar format. Here we can see extensions such as Flash files, various text-based formats such as HTML, CSS, etc., which may be malformed, and possibly some images. In such cases the relevant file should be identified and an attempt made to open it with a suitable application. In this way it may be possible to begin to assess the reasons for non-recognition, to confirm the likely format, and take any action to repair or convert the file if necessary. Files with unfamiliar extensions (.m) will need particular attention.</p>
<p>Another feature of the this repository&#8217;s format profile not illustrated here is the difference when the profile is based on file size rather than file count. In this case the largest format by file size remains JPEG 1.01, but the next largest file types are all MPEG video formats, which is not so evident from Figure 6 and Table 1. It is not hard to understand why this might be: video files tend to be larger than text files or other format types. At first sight the profile by file count might have the stronger influence on preservation plans, but this further evidence on file size can be used strategically as well.</p>
<h3>Research papers repository (University of Northampton)</h3>
<div id="attachment_2185" class="wp-caption aligncenter" style="width: 624px"><a href="http://blog.soton.ac.uk/keepit/files/2010/10/nectar2.png"><img class="size-large wp-image-2185 " src="http://blog.soton.ac.uk/keepit/files/2010/10/nectar2-1024x380.png" alt="NECTAR format profile from live repository screenshot (profile date 14 Oct. 2010)" width="614" height="228" /></a><p class="wp-caption-text">Figure 8. NECTAR format profile from live repository screenshot</p></div>
<p>Figure 8 is a classic, PDF-dominated profile of a repository of research papers. Although an apparently small repository, what this shows is a repository that has so far focussed on records-keeping rather than on collecting full digital objects. We have seen already how <a title="NECTAR and the Data Asset Framework – final thoughts, Diary, September 30, 2010 " href="http://blog.soton.ac.uk/keepit/2010/09/30/nectar-and-the-data-asset-framework-–-final-thoughts/" target="_self">investigations to expand the scope of this repository</a> have begun. There is nothing in this profile that would surprise the repository manager. It acts as confirmation, a snapshot of the repository and can be viewed as a platform for deciding where the repository should head in future. The tools will help the repository manager to monitor growth and the implementation of future plans.</p>
<h3>Summary</h3>
<p>Digital and institutional repositories are changing, and the established research papers repository is now complemented by rapidly growing repositories targetting new types of digital content. For the first time we have been able to compare and contrast these different repository types using tools designed to assist digital preservation analysis by identifying file formats and producing profiles of the distribution of formats in each repository.</p>
<p>While past format profiles of repositories collecting open access research papers tended to produce uniform results differing in scale rather than range, the new profiles reveal potentially characteristic fingerprints for the emerging repository types. What this also reveals more clearly, by emphasising the differences, are the real preservation implications for these repositories based on these profiles, which could be masked when all profiles looked the same. Each exemplar profile gives the respective managers a new insight into their repositories and careful reading will lead them to an agenda for managing the repository content effectively and ensuring continued access, an agenda that will be the more clearly marked for recognising how the same process produced different results for other types of repository.</p>
<p>This agenda will initially be led by the need to investigate digital objects for which the format could not be identified by the general tools, the &#8216;unknowns&#8217;. We have seen that specialised science data repositories, and even less obviously specialised examples, can produce large numbers of unknowns. These are high-risk objects in any repository by virtue of their internal format being unknown, even though on inspection many may turn out to be easily identified and/or corrected.</p>
<p>For the known formats, especially the largest formats by file count, these profiles show where effort is worth expending on producing preservation plans that will automate the maintenance of these files. Based on these exemplars, all repositories with substantial content are likely to produce format profiles displaying a long tail.</p>
<p>An intriguing finding of this work is that the emerging repository types, rather than open access institutional repositories founded on research papers, are dominated by visual rather than textual formats.</p>
<p>All these exemplars either are, or plan to become, institutional in scope even though limited to a specified type of content. One original idea that motivated the KeepIt project was that truly institutional repositories are likely to come to collect and store digital outputs from all research and academic activities, such as those represented by these exemplars. Thus, combined the exemplars might represent the institutional repository of the future. It&#8217;s worth bearing in mind how the combined format profiles might look, and the consequent implications for preservation, when contemplating the prospect.</p>
<p>We are grateful to all the exemplar repositories for allowing us to reproduce these profiles.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.soton.ac.uk/keepit/2010/11/18/exemplar-preservation-repositories-comparison-by-format-profile/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Costs, formats and iPad apps: past-future preservation lessons for a science repository</title>
		<link>http://blog.soton.ac.uk/keepit/2010/11/11/costs-formats-and-ipad-apps-past-future-preservation-lessons-for-a-science-repository/</link>
		<comments>http://blog.soton.ac.uk/keepit/2010/11/11/costs-formats-and-ipad-apps-past-future-preservation-lessons-for-a-science-repository/#comments</comments>
		<pubDate>Thu, 11 Nov 2010 22:47:50 +0000</pubDate>
		<dc:creator>Simon Coles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data repositories]]></category>
		<category><![CDATA[eCrystals]]></category>
		<category><![CDATA[exemplar profiles]]></category>
		<category><![CDATA[science repositories]]></category>

		<guid isPermaLink="false">http://blog.soton.ac.uk/keepit/?p=2343</guid>
		<description><![CDATA[As an institutionally-based digital repository, eCrystals is somewhat different – both as an exemplar in the KeepIt project and in the institutional repository landscape as a whole. It is operated by the National Crystallography Service (NCS), which is funded on a 5 year grant basis. This brings preservation implications and requirements that are rather different from [...]]]></description>
			<content:encoded><![CDATA[<p>As an institutionally-based digital repository, eCrystals is somewhat different – both as an exemplar in the KeepIt project and in the institutional repository landscape as a whole. It is operated by the National Crystallography Service (NCS), which is funded on a 5 year grant basis. This brings preservation implications and requirements that are rather different from those faced by repositories set up by institutions as a component of their research infrastructure, as when grant funding ceases then so does support for the repository, and its future hangs in the balance.</p>
<p><img class="size-medium wp-image-2373 alignleft" style="border: 0px initial initial" src="http://blog.soton.ac.uk/keepit/files/2010/11/ncs-logo-medium4-300x29.png" alt="National Crystallography Service logo" width="300" height="29" /></p>
<p>It costs money to do preservation. This recognition and the periodically precarious funding position meant that much of our work on eCrystals as an exemplar was focused on <a title="Preserving crystallographic data in a digital repository: a costs based analysis, Diary, November 9, 2010" href="http://blog.soton.ac.uk/keepit/2010/11/09/preserving-crystallographic-data-in-a-digital-repository-a-costs-based-analysis/" target="_self">preservation costs</a>. There is plenty of (wildly contradictory) anecdotal talk and urban myth in the practising research community around how much it costs to preserve data. My perspective draws on personal experience and other reported work in the area. However, what is clear is that the community needs to know how much it costs to set up a repository and then what the financial implications are for migrating all the old data into it. It has been particularly insightful thinking about how much all this costs and the main lesson ought to be blindingly obvious – setting up and maintaining a data repository is relatively cheap and easy (providing you are not the innovator or primary mover in the area). It’s populating it with all your old data that really costs.</p>
<p>eCrystals holds the results (in the form of multiple, small data files) of crystallographic experiments performed at the NCS, and is operated by the NCS as an independent mid-range facility funded to serve UK academics in the chemistry (and related subjects) sector. An important part of our interaction with the KeepIt project was the <a title="Adding chemistry to a file format registry, Diary, September 16, 2010" href="http://blog.soton.ac.uk/keepit/2010/09/16/adding-chemistry-to-a-file-format-registry/" target="_self">registration of file formats</a> so that digital preservation services can automatically recognise and understand our repository content. The authoritative PRONOM registry recognises several hundred file formats, but these are the popular ones and domain-specific formats such as our Crystallographic Information File (CIF) and Chemical Markup Language (CML) – which are ubiquitous in crystallography and chemistry, respectively – were not included. Work was done to create signature files for these two formats for the DROID format identification tool, which applies data from PRONOM. These signatures will be submitted for inclusion in the formal PRONOM registry.</p>
<p><a href="http://blog.soton.ac.uk/keepit/files/2010/11/ipad_periodic_table-s.jpg"><img class="alignright size-full wp-image-2381" src="http://blog.soton.ac.uk/keepit/files/2010/11/ipad_periodic_table-s.jpg" alt="ipad-periodic table" width="280" height="231" /></a>Working with KeepIt and other projects has given momentum to the preservation of crystallography data in the eCrystals repository and in related repositories. Looking ahead, we intend to maintain this momentum. Through the project we recently invested in an Apple iPad, and we are developing an app as a front-end to an electronic laboratory notebook / blog service. As we have <a title="Preserving crystallographic data in a digital repository: a costs based analysis, Diary, November 9, 2010" href="http://blog.soton.ac.uk/keepit/2010/11/09/preserving-crystallographic-data-in-a-digital-repository-a-costs-based-analysis/" target="_self">reported</a>, we recognised that the best possible moment to begin preservation is at the time the experiment is performed, as it is prohibitively expensive to recreate the data at a later stage. The idea for the app is that the contextual information that underpins publication and preservation is built up as the experiment progresses &#8211; not as is done now, where a bunch of files are uploaded some time after the event and some (arbitrary) metadata assigned.</p>
<p>This means capturing data in the laboratory – not easy (even in a conventional lab notebook) and we are spinning out a project to address this problem &#8211; the smart laboratory with pervasive data and metadata recording. A primary problem here is that drawing or &#8216;scribble&#8217; software is poor and chemists draw, they don&#8217;t generally type. Our app is being specified to resolve such issues by enabling the chemist to sketch reactions, note observations, make and test hypotheses &#8211; this is the valuable chemical metadata that gives our data meaning in the long term. Tablet PCs that have been tested in the past proved too cumbersome but iPad-type technology could be a winner in terms of portability and ease of interaction in solving these problems and making data capture in the lab instant and efficient. We are also investigating the use of portable devices (mobile phone as well as iPad) to record audio and video in the laboratory to act as anything from the primary observation record to contextual or supporting metadata.</p>
<div id="attachment_2384" class="wp-caption aligncenter" style="width: 490px"><a href="http://blog.soton.ac.uk/keepit/files/2010/11/chemistry-app1.jpg"><img class="size-full wp-image-2384" src="http://blog.soton.ac.uk/keepit/files/2010/11/chemistry-app1.jpg" alt="chemistry app" width="480" height="320" /></a><p class="wp-caption-text">iPhone app running a chemical reaction. Mobile smartphones might be used to capture all sorts of contextual data, as they become devices that many people carry and therefore add few extra requirements in terms of data capture technology</p></div>
<p style="margin-left: 0cm">In summary, the most striking lessons learned for the NCS by working with the KeepIt project are:</p>
<ul>
<li>Preservation isn&#8217;t hard – you just need to think about it and then generate a preservation plan.</li>
<li>The hard part is following the preservation plan and getting those involved in the right mindset.</li>
<li>It is acceptable to ‘just to do nothing’, but this must be the conclusion of thinking about preservation.</li>
<li>As long as storage is kept live (on spinning disks), unknown or unmanaged file formats are a major risk to the loss of information.</li>
<li>Subject domains or communities should therefore be encouraged to supply descriptions of their specific formats (e.g. DROID signatures) to make sure they don&#8217;t suffer from file format rot.</li>
<li>Repository software (like EPrints) is making preservation easier, by incorporating tools to help identify risks leading to information loss.</li>
<li>It&#8217;s relatively cheap to set up a repository that will, among other functions, preserve your data.</li>
<li>Retrospective preservation, migrating data and populating repositories are where the real costs lie.</li>
<li>The best possible moment to begin preservation is at the time the experiment is performed and data is generated.</li>
<li>New portable computing devices and apps will help capture data and embellish immediately with metadata in the lab.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.soton.ac.uk/keepit/2010/11/11/costs-formats-and-ipad-apps-past-future-preservation-lessons-for-a-science-repository/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Preserving crystallographic data in a digital repository: a costs based analysis</title>
		<link>http://blog.soton.ac.uk/keepit/2010/11/09/preserving-crystallographic-data-in-a-digital-repository-a-costs-based-analysis/</link>
		<comments>http://blog.soton.ac.uk/keepit/2010/11/09/preserving-crystallographic-data-in-a-digital-repository-a-costs-based-analysis/#comments</comments>
		<pubDate>Tue, 09 Nov 2010 15:50:48 +0000</pubDate>
		<dc:creator>Simon Coles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data repositories]]></category>
		<category><![CDATA[eCrystals]]></category>
		<category><![CDATA[exemplar profiles]]></category>
		<category><![CDATA[KRDS]]></category>
		<category><![CDATA[preservation costs]]></category>
		<category><![CDATA[science repositories]]></category>

		<guid isPermaLink="false">http://blog.soton.ac.uk/keepit/?p=2338</guid>
		<description><![CDATA[eCrystals has been presented within KeepIt as an exemplar of a scientific data repository, but more particularly it exemplifies the ‘one man band’ scientific repository. No research scientist will take on the responsibility of setting up and administering a scientific data repository without an indication of the financial implications and savings involved. Until now these [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.soton.ac.uk/keepit/files/2009/06/logo_ec.jpg"><img class="alignright size-full wp-image-81" src="http://blog.soton.ac.uk/keepit/files/2009/06/logo_ec.jpg" alt="eCystals logo" width="142" height="55" /></a>eCrystals has been presented within KeepIt as an exemplar of a scientific data repository, but more particularly it exemplifies the ‘one man band’ scientific repository.</p>
<p>No research scientist will take on the responsibility of setting up and administering a scientific data repository without an indication of the financial implications and savings involved. Until now these costs have largely been anecdotal. Here we provide documentary evidence based on our experience gained in developing and deploying the eCrystals repository at the UK National Crystallography Service (NCS) and the <a href="http://wiki.ecrystals.chem.soton.ac.uk/index.php/Main_Page">eCrystals Federation Project</a>, which engaged the broader crystallographic community.</p>
<blockquote><p><strong>&#8220;The major cost components do not lie in implementing and maintaining a scientific data repository. Populating the repository is the expensive aspect of operating such a resource&#8221;</strong></p></blockquote>
<p>An ‘at a glance’ summary of this information is of vital importance for the practising researcher evaluating whether they wish to implement a data management and digital preservation solution. These data are not new &#8211; a more substantive version originally appeared in the <a href="http://www.jisc.ac.uk/media/documents/publications/reports/2010/keepingresearchdatasafe2.pdf">Keeping Research Data Safe 2 project</a> report &#8211; but here we present a summary couched in a manner more familiar to the practising scientist and covering the costs of a) the storage of data over the lifetime of a facility and b) migrating data from one era, technology or regime to another.</p>
<p>The actual costs involved (full economic costs based on a 250 working day year) with the set up and persistent operation of established software for an average size crystallography laboratory (i.e. 1 man, 1 instrument) can be estimated and outlined as:</p>
<ol>
<li>Installation of the repository software by an untrained operative (i.e. a research scientist) will take 2.5 days &#8211; £800.</li>
<li>Training (sys admin) takes a day &#8211; £340.</li>
<li>Performing sys admin – the cost of supporting the repository component of the NCS electronic infrastructure is based on the fact that this aspect takes about 10% of the time of the 0.1FTE systems administrator &#8211; £800 per year.</li>
<li>Training (deposit) takes one day &#8211; £340.</li>
<li>Deposit, management, appraisal and publication takes 5% of the time of a single researcher &#8211; £4000 per year.</li>
<li>DOI registration depends, to an extent, on the number of records but averages £250 per annum.</li>
</ol>
<p>You can’t just set up a repository and then expect that the data will get into it for free. All researchers carry a lot of historical data from throughout their careers (crystallographers perhaps more so than others!) and therefore it is crucial to get an idea of what it will cost to migrate your data to the repository.</p>
<p>In coming to the following conclusions, I am drawing on data and facility costs over a period of NCS operation from 1970 to 2009. During this timeframe experimental instrumentation, computational capability and data storage media have all changed radically. There has also been a change in the raw and results data that we collect &#8211; these days raw data can be a couple of hundred binary image files (as opposed to the lists of observed reflections from serial counter days) in the gigabyte size range, whilst results data can be as little as a single text-based CIF file of the order of a few kilobytes. When considering these elements of change, one can roughly group transitions between technologies &#8211; e.g. the introduction of personal computers, a new generation of instrumentation or the advent of online storage &#8211; into three roughly similar periods (1970-1990, 1990-2000 and 2000-present). As crystallographers, these eras relate loosely to the serial detector age (1970-1990; data stored on paper), the early days of area detectors (1990-2000; data stored on magnetic disks) and the modern age where large volumes of data are being generated (2000-; data stored on CD’s/DVD’s and, more recently, online).</p>
<blockquote><p><strong>&#8220;The best possible job of preservation must be done at the time an experiment is performed, as it is prohibitively expensive to recreate the data at a later stage&#8221;</strong></p></blockquote>
<p>One vital point often overlooked is that we can store and migrate data over many decades, but we can&#8217;t do this with the samples that we measure! The cost of a crystal structure with current equipment is £328; the cost to regenerate a structure from the past can be anything up to <strong>60</strong> times this amount (ca £20k) if the raw data (and appropriate correction files) are not available or the sample has to be re-made (includes all the expertise and laboratory infrastructure from an entire research project). It is not simply a matter of doing the experiment (or analysis) again if you don’t have the sample.</p>
<p>The reason a sample might need to be resynthesised is that it has not been possible in the past to efficiently store and preserve raw data (some data has been kept on paper and magnetic media, but the cost of migrating to online media is prohibitive and prone to being corrupt). In more recent times raw data could be preserved, in which case the cost of recreating the data is that involved with the (re)interpretation of the raw data and is therefore <em>considerably</em> cheaper. The most obvious points regarding the costs of data storage are that:</p>
<ul>
<li>The cost of storing data has dramatically reduced.</li>
<li>The cost of migrating data from recent eras when computing has been more prevalent is significantly less.</li>
<li>The cost of storing raw data is around 70% of the total data storage cost.</li>
</ul>
<p>For those scientists who still want to do something with all the structures in their filing cabinets or data on floppy disks, the following cautionary points regarding data migration should be noted (again, the full costs can be found in the <a href="http://www.jisc.ac.uk/media/documents/publications/reports/2010/keepingresearchdatasafe2.pdf">KRDS2</a> report):</p>
<ul>
<li>Long term storage of data in its native format becomes less worthwhile with time (this is because these formats cannot be migrated, i.e. an instrument manufacturer&#8217;s proprietary binary format cannot be read by newer generations of integration software). The most cost-effective approach would be to transform the raw data into a common interchangeable format and subsequently migrate it.<span> </span></li>
<li>It is considerably more costly to migrate results data than preserve them &#8211; this is due to the variety of formats and the storage media used over the years.</li>
<li>There is considerable fluctuation in the relative cost of migration against storage in different eras and it does not necessarily follow that modern approaches make it cheaper to migrate as opposed to store with respect to other eras.</li>
</ul>
<p>Migrating raw and results data highlighted some important points regarding data loss:</p>
<ul>
<li>During migration of raw data from CDs/DVDs to online storage there was a 7% loss of data.</li>
<li>Migration of results from floppy disks resulted in a 5% data loss.</li>
<li>Less than 1% of results were lost in the migration from paper, although the cost of performing the migration itself was extremely variable due to the differing quality of records.</li>
</ul>
<p>In summary, it is clear that the major cost components do not lie in implementing and maintaining a scientific data repository. Populating the repository is the expensive aspect of operating such a resource – this involves changing working practice and importing legacy data. It is also clear that the best possible job of preservation must be done at the time an experiment is performed, as it is prohibitively expensive to recreate the data at a later stage when the sample has ceased to exist.</p>
<p>Storage is a relatively cheap and well-understood process these days and there is now no reason why all raw and derived data cannot be kept long into the future. The migration of legacy data is a time-consuming and costly process and it is almost certainly not worth trying to migrate historic raw data. Great consideration should be given on a case-by-case basis as to whether it is worth migrating results data. One must bear in mind, however, that the migration of results data is a one-off cost going forward if this process is performed correctly and the data stored in a form that makes it easy to migrate in the future.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.soton.ac.uk/keepit/2010/11/09/preserving-crystallographic-data-in-a-digital-repository-a-costs-based-analysis/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Adding chemistry to a file format registry</title>
		<link>http://blog.soton.ac.uk/keepit/2010/09/16/adding-chemistry-to-a-file-format-registry/</link>
		<comments>http://blog.soton.ac.uk/keepit/2010/09/16/adding-chemistry-to-a-file-format-registry/#comments</comments>
		<pubDate>Thu, 16 Sep 2010 16:06:25 +0000</pubDate>
		<dc:creator>Steve Hitchcock</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[CIF]]></category>
		<category><![CDATA[CML]]></category>
		<category><![CDATA[data repositories]]></category>
		<category><![CDATA[DROID]]></category>
		<category><![CDATA[eCrystals]]></category>
		<category><![CDATA[file formats]]></category>
		<category><![CDATA[PRONOM]]></category>
		<category><![CDATA[science repositories]]></category>

		<guid isPermaLink="false">http://blog.soton.ac.uk/keepit/?p=1526</guid>
		<description><![CDATA[Everybody knows DROID. Well, everybody working in digital preservation. And for those being introduced to digital preservation, it&#8217;s likely they will be shown a tool or two, because tools help us do practical preservation. And among those tools the one most likely to be shown will be DROID (for example, here 11.45 am). This is [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center">
<p style="text-align: center"><a href="http://blog.soton.ac.uk/keepit/files/2010/09/chem_droid_proof.png"><img class="aligncenter size-full wp-image-1537" src="http://blog.soton.ac.uk/keepit/files/2010/09/chem_droid_proof.png" alt="Adding chemistry formats to a DROID profile" width="547" height="246" /></a></p>
<p>Everybody knows DROID. Well, everybody working in digital preservation. And for those being introduced to digital preservation, it&#8217;s likely they will be shown a tool or two, because tools help us do practical preservation. And among those tools the one most likely to be shown will be DROID (for example, <a title="Digital Preservation Roadshows 2009-2010" href="http://www.dpconline.org/training/roadshows-0910" target="_self">here</a> 11.45 am). This is because DROID is from the National Archives, is open source and does something that is fairly fundamental and basic to digital preservation, that is, identify file formats. <a title="Download DROID at SourceForge" href="http://sourceforge.net/projects/droid/">DROID</a> (Digital Record Object Identification) is an automatic file format identification tool.</p>
<p>We&#8217;ve been using DROID for years, in <a title="EPrints Digital Preservation" href="http://preservation.eprints.org/" target="_self">KeepIt and before that Preserv</a>, to produce <a title="KeepIt course 3: Primer on preservation workflow, formats and characterisation, Diary, March 31, 2010" href="http://blog.soton.ac.uk/keepit/2010/03/31/digital-preservation-tools-for-repository-managers-primer-on-preservation-workflow-formats-and-characterisation/" target="_self">repository format profiles</a>. It tells us what we have to preserve, and we can use that information to begin to judge risk, build a preservation plan and, where necessary, take some evasive action by converting a digital item into a format we may believe to be less risky.</p>
<p>One thing we&#8217;ve never done with DROID is add a new format. This part is <a title="PRONOM: add an entry" href="http://www.nationalarchives.gov.uk/PRONOM/submitinfo.htm" target="_self">carefully curated</a> by the National Archives. (Note, in case you are thinking what is PRONOM, the subject of the link, it is a registry of file formats that informs DROID, which in turn is the software used to scan your content). PRONOM is not a wiki or other social media so you can&#8217;t just add stuff without moderation (not yet anyhow, although that may <a title="Tarrant, D., Hitchcock, S. and Carr, L. (2009) Where the Semantic Web and Web 2.0 meet format risk management: P2 registry. In: iPres2009: The Sixth International Conference on Preservation of Digital Objects, October 5th and 6th, 2009., San Francisco" href="http://eprints.ecs.soton.ac.uk/17556/" target="_self">change</a>).</p>
<p>PRONOM the registry currently contains information on over 700 file formats, but there are many thousands more formats in existence. In other words, PRONOM covers most major, popular formats and then some, but not all formats you can think of. When it comes to more specialised data, it&#8217;s likely your format is not represented.</p>
<p>In this case our exemplar KeepIt repository <a title="Welcome to eCrystals - University of Southampton" href="http://ecrystals.chem.soton.ac.uk/" target="_self">eCrystals</a> stores data from crystallography experiments in the laboratory. The formats it uses to describe these data are not likely to be available in DROID, so a format profile of this repository will reveal a large number of unknown files and will not be of much use.</p>
<p>What&#8217;s required by PRONOM-DROID for format identification is a signature for the file format you want to identify, that is, something distinctive that will reliably differentiate it from any other format type. We imagined this would require quite a detailed knowledge of the formats. We were concerned, since we are not the originators or sponsors of the formats in question, that we might not have the requisite knowledge, also that we might require the cooperation of such people in some way. We needn&#8217;t have worried about either. Creating a file format signature for DROID is simpler than we had anticipated.</p>
<p>What follows is a Twitter record (<a title="JISC KeepIt project Twitter profile" href="https://twitter.com/jisckeepit" target="_self">@jisckeepit</a>, from about 9:45 AM Sep 16th) of what happened as we set about creating format signatures for our crystallography and chemical files. We were able to do this at a small test level using a version of DROID we run locally.</p>
<ul>
<li>We just worked out how to write a signature file for a format in PRONOM-DROID. It&#8217;s ID, not validation. You bet!</li>
<li>Need to check uniqueness of signature for CIF file format. &#8220;You boy at the back, DROID, do you recognise this file?&#8221; &#8220;No sir&#8221;. Good start</li>
<li>For the others at the back of the class, our CIF is a Crystallographic Information File, produced from experiments in the lab</li>
<li>The boy DROID is a fast learner. He now knows CIF and has passed the initial class test, but how will he cope with the big school exam?</li>
<li>We have boy DROID&#8217;s exam results, and are tearing open the envelope now. FAIL. Retake</li>
<li>Great, boy DROID has now passed his CIF exam. It&#8217;s a high grade, no absolute mark, but probably as good as earlier passes <img src='http://blog.soton.ac.uk/keepit/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </li>
<li>Since boy DROID is such a good student we&#8217;re now teaching him CML, Chemical Markup Language, based on XML, which he already knows</li>
<li>See, it&#8217;s easy when you know how. Boy DROID passed the CML exam first time</li>
<li>Next we will commend Boy DROID to his parents (at iPres next week) and suggest they enter the new CIF, CML format sigs in the DROID academy</li>
<li>Also next, we will run boy DROID on eCrystals-a big repository!-to produce its first format profile. That may take some days</li>
<li>Big thanks to boy DROID&#8217;s class teachers, from Chemistry Philip Adler and KeepIt&#8217;s Dave Tarrant</li>
</ul>
<p>The result of all this is the profile shown in the figure at the top of this blog post. Essentially we deposited a few example CIF and CML files in our test repository, inspected the source files in detail to write the signature, and used our EPrints preservation apps, including our test version of DROID, to produce this simple profile of our test files. What we haven&#8217;t shown here are all the unsuccessful test profiles, which would show as an alarming red bar labelled &#8216;unknown files&#8217; when DROID could not recognise a CIF, instead of blue bars with the correctly identified file names.</p>
<p>If you are interested you can find out more about the <a title="International Union of Crystallography, Crystallographic Information Framework" href="http://www.iucr.org/resources/cif" target="_self">CIF</a> and <a title="cml.sourceforge.net - OpenSource Site for CML" href="http://cml.sourceforge.net/" target="_self">CML</a> formats.</p>
<p>A helpful lesson and a step towards another KeepIt exemplar repository.</p>
<p><strong>Addendum</strong> (22 September 2010)</p>
<p>Philip Adler, a specialist in the chemistry file types being looked at here, provides some additional insights into writing a new signature profile to be added to the DROID format identification tool.</p>
<p>&#8220;Creating a new profile for DROID to search a database was a novel problem; and one for which the documentation, whilst plentiful, failed to characterise the behaviours of the XML form in which one specifies what signature formats DROID is looking for. As such, for someone with little to no computing experience, it is my opinion that it would be very challenging and time consuming, on a first run through, to establish a new signature file. Indeed, for an experienced computer scientist and a specialist in the file type being looked at, it took a considerable length of time.</p>
<p>&#8220;That said, once the format of the signatures, and the layout of the signature files has been deduced, installing a new signature was relatively straightforward the second time around.</p>
<p>&#8220;There is one &#8216;bug&#8217;. Whilst attempting to define a term which would be at a non-fixed distance from the beginning or end of a file, we established that the DROID framework does not permit this. For instance, we managed to &#8216;break&#8217; the HTML definition within DROID by submitting a perfectly valid (although unorthodox) HTML file with an absurdly long comment at the top. Whilst in HTML this is odd, in a .CIF file it is not, and so could serve to break the file in the future. The back-door for this is to set the maximum distance from the beginning or end of the file to an absurdly great length. This solution, however, is far from obvious, and is located as a standard method within the documentation, which implies that DROID can cope with arbitrary distances from BOF (beginning of file) or EOF (end of file).&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.soton.ac.uk/keepit/2010/09/16/adding-chemistry-to-a-file-format-registry/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>eCrystals &#8211; Repository preservation objectives</title>
		<link>http://blog.soton.ac.uk/keepit/2009/10/23/ecrystals-repository-preservation-objectives/</link>
		<comments>http://blog.soton.ac.uk/keepit/2009/10/23/ecrystals-repository-preservation-objectives/#comments</comments>
		<pubDate>Fri, 23 Oct 2009 09:28:10 +0000</pubDate>
		<dc:creator>Simon Coles</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data repositories]]></category>
		<category><![CDATA[eCrystals]]></category>
		<category><![CDATA[exemplar objectives]]></category>
		<category><![CDATA[exemplar profiles]]></category>
		<category><![CDATA[science repositories]]></category>

		<guid isPermaLink="false">http://blog.soton.ac.uk/keepit/?p=547</guid>
		<description><![CDATA[A little bit about eCrystals: eCrystals Southampton (http://ecrystals.chem.soton.ac.uk/) is a data repository based on the EPrints platform, but heavily reconfigured to manage data files from chemical crystallography structure determination experiments. The repository evolved out of 2 rounds of JISC funding of the eBank-UK project (http://www.ukoln.ac.uk/projects/ebank-uk/). The repository has a schema that represents the data files [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignnone size-medium wp-image-546" src="http://blog.soton.ac.uk/keepit/files/2009/10/logo3.0l-300x115.jpg" alt="logo3.0l" width="300" height="115" /></p>
<p><strong>A little bit about eCrystals:</strong><br />
eCrystals Southampton (<a href="http://ecrystals.chem.soton.ac.uk/">http://ecrystals.chem.soton.ac.uk/</a>) is a data repository based on the EPrints platform, but heavily reconfigured to manage data files from chemical crystallography structure determination experiments. The repository evolved out of 2 rounds of JISC funding of the eBank-UK project (<a href="http://www.ukoln.ac.uk/projects/ebank-uk/">http://www.ukoln.ac.uk/projects/ebank-uk/</a>). The repository has a schema that represents the data files generated during an experiment (a crude representation of the workflow) so that a record can be managed and presented in an understandable fashion. The next two phases of eCrystals evolution were concerned with developing a &#8216;Federation&#8217; of such repositories and analysing the requirements, problems &amp; pitfalls of a distributed network of data repositories (<a href="http://wiki.ecrystals.chem.soton.ac.uk">http://wiki.ecrystals.chem.soton.ac.uk</a>). One workpackage of this project was concerned with preservation issues and looked into preservation planning, metadata and representation information concerning the network and individual repositories (<a href="http://wiki.ecrystals.chem.soton.ac.uk/index.php/Work_Package_4:_Repositories%2C_Preservation_and_Sustainability#Deliverables">WP 4</a>).</p>
<p>A number of matters arose from this project and in general they were not technical, but mostly stemmed from the fact that these repositories are not based on the Institutional model, but are operated, maintained, administered and populated by individuals or research groups&#8230;often a single person takes on all these roles, in addition to their &#8216;day job&#8217; of doing the scientific research! It is our belief that the future repository landscape will be very heterogeneous indeed and there will be a large number of archives &amp; resources that do not operate under the Institutional model &#8211; the preservation requirements of these will be very different from those currently recognised by the community.</p>
<p>eCrystals preservation objectives therefore generally go beyond the technical and are concerned with understanding the issues around every-day researchers performing preservation activities:</p>
<p><strong>Objective 1</strong></p>
<p><strong><em>to explore the preservation training and actions required for small groups and non-archivists</em>. </strong>If all crystallographically active institutions (virtually every Chemistry Department has a crystallography centre) signed up to the concept of eCrystals, 95% of the repositories would be user-administered within research groups run by 1,2 or 3 people. These people are researchers trained in their art and would have very little concept of any preservation issues whatsoever.</p>
<p><strong>Objective 2</strong></p>
<p><em><strong>to investigate how performing preservation actions can be made easy!</strong></em> Learning the minimum requirements for the maximum return (the 80% rule). What can be automated and what technologies can be implemented, both unseen by the repository software and as ancillary tools.</p>
<p><strong>Objective 3</strong></p>
<p><em><strong>to develop an exemplar non-onerous preservation regime for the researcher administering a repository.</strong></em> Following on from objectives 1 &amp; 2, a schedule of preservation actions can be derived and an objective would be to understand how this can be &#8216;coded&#8217; into the repository software (eg an EPrints plug-in) so that the repository can automatically perform these actions or alert the researcher at the appropriate time as to which actions need executing (eg accession, appraisal, deletion, etc).</p>
<p><strong>Objective 4</strong></p>
<p><em><strong>to develop example costings for researchers and administrators.</strong></em> Exemplar costs on a crystal structure record level would enable individuals to quickly and easily include the appropriate amount for preservation and administration of a repository to be included in FEC  grant applications or departmental / research group levels.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.soton.ac.uk/keepit/2009/10/23/ecrystals-repository-preservation-objectives/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>First project meeting: presentations</title>
		<link>http://blog.soton.ac.uk/keepit/2009/06/19/first-project-meeting-presentations/</link>
		<comments>http://blog.soton.ac.uk/keepit/2009/06/19/first-project-meeting-presentations/#comments</comments>
		<pubDate>Fri, 19 Jun 2009 11:56:56 +0000</pubDate>
		<dc:creator>Steve Hitchcock</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[eCrystals]]></category>

		<guid isPermaLink="false">http://blog.soton.ac.uk/keepit/?p=164</guid>
		<description><![CDATA[The first KeepIt project meeting for partners took place on 2 June and led with three presentations: to prompt discussion on why we should be wary of the big issues in digital preservation, on new tools to support repository preservation, and on the experience of one exemplar repository that has already investigated digital preservation. First, [...]]]></description>
			<content:encoded><![CDATA[<p>The first KeepIt project meeting for partners took place on 2 June and led with three presentations: to prompt discussion on why we should be wary of the big issues in digital preservation, on new tools to support repository preservation, and on the experience of one exemplar repository that has already investigated digital preservation.</p>
<p>First, project manager Steve Hitchcock on how to spot typical digital preservation propaganda.</p>
<p><object type="application/x-shockwave-flash" data="http://static.slideshare.net/swf/ssplayer2.swf?doc=id=1598183&amp;doc=hitchcock-firstpartnersmeetingv2-090617111943-phpapp01" width="425" height="348"><param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=id=1598183&amp;doc=hitchcock-firstpartnersmeetingv2-090617111943-phpapp01" ></object></p>
<p>Dave Tarrant, the project developer, reprised a presentation on the project from the recent <em><a href="https://or09.library.gatech.edu/eprints.php">Open Repositories OR09</a></em> international conference in Atlanta. He describes how the KeepIt project is providing training, development and deployment in three key areas of digital preservation for repositories: storage, risk analysis and preservation action.</p>
<p><object type="application/x-shockwave-flash" data="http://static.slideshare.net/swf/ssplayer2.swf?doc=id=1607722&amp;doc=keepitor09-090619041223-phpapp02" width="425" height="348"><param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=id=1607722&amp;doc=keepitor09-090619041223-phpapp02" ></object></p>
<p>Not all repositories are new to digital preservation. Chemist Simon Coles has been working on research data repositories for many years, and describes the experience and challenges of data repository preservation, including planning, metadata and other issues affecting dissemination.</p>
<p><object type="application/x-shockwave-flash" data="http://static.slideshare.net/swf/ssplayer2.swf?doc=id=1608282&amp;doc=coles-keepit-kickoff-090619063135-phpapp02" width="425" height="348"><param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=id=1608282&amp;doc=coles-keepit-kickoff-090619063135-phpapp02" ></object></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.soton.ac.uk/keepit/2009/06/19/first-project-meeting-presentations/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>KeepIt repositories initial survey: eCrystals</title>
		<link>http://blog.soton.ac.uk/keepit/2009/06/02/keepit-repositories-initial-survey-ecrystals/</link>
		<comments>http://blog.soton.ac.uk/keepit/2009/06/02/keepit-repositories-initial-survey-ecrystals/#comments</comments>
		<pubDate>Tue, 02 Jun 2009 10:10:28 +0000</pubDate>
		<dc:creator>Steve Hitchcock</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[eCrystals]]></category>
		<category><![CDATA[exemplar profiles]]></category>
		<category><![CDATA[exemplar surveys]]></category>
		<category><![CDATA[science repositories]]></category>

		<guid isPermaLink="false">http://blog.soton.ac.uk/keepit/?p=80</guid>
		<description><![CDATA[Our third repository exemplar is eCrystals, which manages scientific, specifically crystallography, data that might be referred to broadly as e-data or e-science. To recap the purpose of these initial surveys of the four exemplar repositories in the KeepIt project, we are seeking to characterise the repositories not in terms of their preservation activity but in terms of factors [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://ecrystals.chem.soton.ac.uk/"><img class="alignright size-medium wp-image-81" src="http://blog.soton.ac.uk/keepit/files/2009/06/logo_ec.jpg" alt="eCrystals logo" width="142" height="55" /></a></p>
<p>Our third repository exemplar is eCrystals, which manages scientific, specifically crystallography, data that might be referred to broadly as e-data or e-science.</p>
<p>To recap the purpose of these initial surveys of the four exemplar repositories in the KeepIt project, we are seeking to characterise the repositories not in terms of their preservation activity but in terms of factors that will influence possible preservation strategies.</p>
<p>Since this repository operates somewhat differently from the others, some brief background information is needed. The repository is operated by the National Crystallography Service (NCS) based at the University of Southampton, so is a national service. It manages two types of data: the &#8216;raw&#8217; data generated directly by crystal analysis, and the results data &#8217;derived&#8217; from the raw data.</p>
<p>NCS offers two types of experimental service to its users</p>
<ol>
<li>Full determination, where NCS generates raw data and works up derived data into results. This is deposited in eCrystals (generally initially embargoed);</li>
<li>Data Collection only, where NCS collects the raw data and turns it into the first stage of derived data. This derived data is then sent to users and they work it up into results. None of the user-derived or results data is deposited in eCrystals.</li>
</ol>
<p>The future plan is to use eCrystals for 2, where NCS deposits first-stage derived data, the user picks it up, turns it into a result and deposits the result into eCrystals.</p>
<p><a href="http://ecrystals.chem.soton.ac.uk/" target="_self">http://ecrystals.chem.soton.ac.uk/</a></p>
<p>ROAR: <a href="http://roar.eprints.org/index.php?url=http://ecrystals.chem.soton.ac.uk/" target="_self">http://roar.eprints.org/index.php?url=http://ecrystals.chem.soton.ac.uk/</a></p>
<p><em>Current status of repository</em></p>
<p>Was funded by JISC, eCrystals open-access archive project, to end March 2009.</p>
<p>Archive service provided as part of NCS at Southampton University, funded by EPSRC. This funding is subject to periodic review in the forthcoming research cycle.</p>
<p><em>Mission</em></p>
<p>eCrystals Southampton is the archive for Crystal Structures generated by the Southampton Chemical Crystallography Group and the EPSRC UK National Crystallography Service (NCS).</p>
<p><em>Management structure and decision-making, reporting tree</em></p>
<p>Management structure of the repository is headed by the director of the national service.</p>
<p><em>Staffing (no. FTE)</em></p>
<p>0.5 FTE systems administrator</p>
<p><em>Policy</em> (documentation, e.g. mandate, format policy, retention policy, take down policy?)</p>
<p>Embedded into <a href="http://www.ncs.chem.soton.ac.uk/pub_pol.htm" target="_self">NCS Publication Policy</a>: &#8221;We have created an archival method for crystal structure data which is designed to reside on Institutional Repositories.</p>
<p>&#8220;At the present time, we are operating two archives. One is a private resource, visible only within the Southampton firewall, which is used as a comprehensive laboratory management and data archival system, to which we now routinely upload all completed and validated crystal structure determination outputs. The other archive is an open access resource, visible <a title="eCrystals repository" href="http://www.ecrystals.chem.soton.ac.uk" target="_self">externally</a>, which we are now using as a direct route to dissemination of structural data. Each entry is assigned a Digital Object Indentifier (DOI) so that the entry may be referenced in any future publication.&#8221;</p>
<p>For journal publications that report and link to crystal structure determinations presented in the repository the policy recognises it is important to satisfy publishers and the public that it will have the same stability and longevity as journal publications.</p>
<p>The &#8220;two&#8221; archives referred to here are concerned with just the derived and results data, not raw data. The difference today is effectively embargoed and not embargoed. The &#8220;raw&#8221; data is stored at the Atlas Data Store at the Rutherford Appleton Laboratory, essentially a closed repository that is used as an internal store, but this data is available on request (by email / post / dropbox type solutions).</p>
<p><em>Planning the repository (formal planning approach?)</em></p>
<p>Repository founded on JISC project planning and design</p>
<p>Data architecture carefully matched to crystallography requirements</p>
<p>Investigated preservation needs and options:  A study of <a href="http://wiki.ecrystals.chem.soton.ac.uk/images/a/a7/ECrystalsCuration.pdf" target="_self">Curation and Preservation Issues in the eCrystals Data Repository and Proposed Federation</a></p>
<p><a href="http://wiki.ecrystals.chem.soton.ac.uk/images/a/a7/ECrystalsCuration.pdf" target="_self"></a>Further preservation reports due:</p>
<ol>
<li>Representation Information for Crystallography Data;</li>
<li>Preservation Planning for Crystallography Data;</li>
<li>Preservation Metadata for Crystallography Data</li>
</ol>
<p><em>Budget (contingency for preservation?)</em></p>
<p>Budget covers storage of raw data.</p>
<p>Results (derived) data not formally budgetted but this is to be reviewed.</p>
<p><em> Infrastructure (institutional, network, etc.)</em></p>
<p>eCrystals server managed by sys. admin.</p>
<p>Archive is backed up nightly. No offsite backups of server. The backup is within the chemistry department, to a building connected by corridor.</p>
<p>Personal curation culture &#8211; analysis of crystal structures performed on series of linux boxes</p>
<p><em>Tools, services and support (which v. EPrints?) </em></p>
<pre>version: eprints-3.0.3-rc-1</pre>
<p>Reconfigured repository software, core code modified,  bespoke standalone code and third-party Web services used.</p>
<p><em> Storage (current, strategy?)</em></p>
<p>Record of the raw data back to about 2002, including frame images, at the Atlas Data Store at the Rutherford Appleton Laboratory.</p>
<p>Testing storage of raw data (500 GB) from just the last couple of years on (School of ECS, Southampton) Honeycomb server (Honeycomb hardware platform discontinued by Sun, support continues to 2013).</p>
<p>Data from 1998-2002 is on USB disks stored in our lab, migrated from CDs written at the time of generation.</p>
<p>Institutional solution preferred.</p>
<p><em>Content profile &#8211; volume, types, formats (content control?) </em></p>
<p>The information contained within each entry of this archive is all the fundamental and derived data resulting from a single crystal X-ray structure determination, but excluding the raw images.</p>
<pre>21/05/09</pre>
<pre>archive: 480, buffer: 26, inbox: 65, deletion: 7, eprint: 578</pre>
<pre>document: 565</pre>
<p><a href="http://roar.eprints.org/index.php?action=profile&amp;url=http://ecrystals.chem.soton.ac.uk/" target="_self"> Preserv format profile</a> (large number of files &#8216;unknown&#8217; to profiling tool)</p>
<p><em> Growth projections (scaling up?) </em></p>
<p>Plan to expand remit of repository to cover user-derived data (see above).</p>
<p>Scientific instrumentation has a typical lifespan of 10 years. As equipment is renewed there is likely to be an order of magnitude increase in data volumes.</p>
<p><em> Future plans for the repository (any major changes planned?) </em></p>
<p>Change storage model &#8211; cloud?</p>
<p>Target more learned society involvement.</p>
<p><strong> Summary </strong></p>
<ul>
<li>Part of national service provision</li>
<li>Detailed repository data architecture design developed over several project iterations</li>
<li>Highly customised (EPrints) repository software</li>
<li>Well informed and proactive on preservation</li>
<li>Funding uncertainties pending review</li>
</ul>
<p><strong> Proposed actions </strong></p>
<ul>
<li>Review storage provision</li>
<li>Examine infrastructure options and prospects, strengthen current arrangements</li>
<li>Assess scope for policy provision beyond publishing policy</li>
<li>Consider how to cultivate and embed personal curation practices endemic in this field of science</li>
<li>Produce more complete profile of deposit formats</li>
<li>Consult on upgrade to EPrints v3.2 when available, or assess how to integrate preservation-support tools from this version in the customised repository software</li>
</ul>
<p>Thanks to Simon Coles, Manjula Patel and Richard Stephenson for sharing and clarifying this information.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.soton.ac.uk/keepit/2009/06/02/keepit-repositories-initial-survey-ecrystals/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
