How format profiles can reveal potentially characteristic fingerprints for emerging types of repository.
Not all digital repositories are the same. Nor are all institutional repositories the same. In fact, the differences between the types of repositories emerging recently can be surprisingly large. In KeepIt we’ve been investigating how these differences might affect digital preservation practices. We now have results and we shall compare some of those here.
There are various ways in which these differences can be seen. One is in the tools that have been adopted by our exemplar repositories – covering arts and sciences, education and research – as they begin to explore the application of digital preservation. In the first instance these tools were selected from those covered in the KeepIt course. It was expected that most would choose to explore the use of one or two tools initially, depending on their priorities. We can see from their reports they have chosen different paths – from data scoping, to costs, and risk assessment – and each can be seen to be an appropriate and revealing choice when you examine the issues faced by the respective exemplars.
“Evidently, like an arts repository (perhaps predictably) and like a science data repository (less predictably), it seems the emphasis of a repository of teaching resources may be visual rather than textual.”
A more direct way to view and understand the differences between repositories with different content remits is to look at the range of file types they manage. This format profile is the starting point for preservation plans and actions. Why formats matter for preservation was covered in the KeepIt course. One approach that has been adopted by three of the four exemplars, because we designed and implemented this for them, is the EPrints preservation ‘apps’. This tool bundles a range of apps, including the DROID file format identification tool from The National Archives, to present a format, or preservation, profile within a repository interface. Here we will reveal and compare the profiles of the four exemplars.
Format profiles past and present
First, we should recall that we have been producing format profiles in previous projects, using earlier variants on the tools. What we have now are more complete and distinctive profiles.
“One major similarity between the exemplars and earlier format profiles is the ‘long tail'”
One major similarity we can note, however, between the exemplars and earlier profiles, is the dominance in each profile of one format, that is, the total number of files stored in the repository using that format, followed by an exponential power law decline in the number of files per format – the ‘long tail’. For open access research repositories the typical profile is dominated by PDF and its variants and versions. This has been known for some time from our previous work. In the case of our KeepIt exemplars only one, the research papers repository, has this classic PDF-led profile. We can now reveal how the others – a science repository, an arts repository, and an educational and teaching repository – differ, and thus begin to understand what preservation challenges they each face.
Producing format profiles
Before we do this, bear in mind these general background notes on how the profiles were produced. For the scale of repositories we have been working with here this is now a substantial processing task that can take hours and days to complete.
For three repositories the counts include only accepted objects and do not include ‘volatile’ objects. The fourth (University of the Arts London) includes all objects, including those in the editorial buffer and volatiles. Repositories use editorial buffers to moderate submissions. Depending on the repository policy there may be a delay between submission, acceptance and public availability. Volatiles are objects that are generated when required by the repository – an example would be thumbnail previews used to provide an instant but sizeably reduced view of the object.
These are growing repositories, so the profiles must be viewed as temporary snapshots for the dates specified. They are provided here for illustration. For those repositories that have installed the EPrints preservation apps, the repository manager is provided with regular internal reports including an updated profile, and will need to track the changes between profiles as well as review each subsequent profile.
Understanding and responding to format profiles
We also need to understand some features of the tools when reviewing the results. In these results we have ‘unknown’ formats and ‘unclassified’ formats. Unclassified are files that are still to be classified. These may be new files that have been added since a profile scan began (scans can take some time) or since the last full scan.
“The number of unknown file formats is likely to be a major factor in assessing the preservation risk faced by a repository”
More critical for preservation purposes are files with unknown formats. To identify a file format a tool such as DROID – an open source tool integrated within the EPrints preservation apps – looks for a specified signature within the object. If it can’t match a file with a signature in its database it is classified as unknown. In such cases it may be possible to identify the format simply by examining the file extension (.pdf .htm .gif, etc.). In most cases a file format will be exactly what it purports to be according to this extension. The merits of each approach, by format signature or filename extension, can be debated; neither is infallible, nor has the degree of error been rigorously quantified. It is up to the individual repositories how they interpret and resolve these results.
The number of unknowns will be a major factor in assessing the preservation risk faced by a repository and is likely to be the area requiring most attention by its manager, at least initially until the risk has been assessed. We believe that in future it will be possible to quantify the risk of known formats, and to build preservation plans to act on these risks within repositories.
For formats known to specialists but not to the general preservation tools, it will be important to enable these to be added to the tools. When this happens it will be possible for the community to begin to accumulate the factors that might contribute to the risk scores for these formats. As long as formats remain outside this general domain, it will be for specialists to assess the risk for themselves. We will see examples of this in the cases below.
Producing format profiles is becoming an intensive process, and subsequent analysis is likely to be no less intensive.
Science data repository (eCrystals, University of Southampton)
A specialised science data repository is likely to have file types that a general format tool will fail to recognise. For this repository of crystal structures we anticipated two such formats – CIF and CML – and we reported how signatures for these formats were added to the identification tool. What we can see in this profile is how successful, or not, these signatures were. That is, successful for CIF, but only partially successful for CML.
For this repository, which uses a customised version of EPrints and therefore has not so far installed the apps, we ran the tool for them over a copy of their content temporarily stored in the cloud. Figure 1 shows the full profile for this repository, including unknowns (in red), those formats not identified by DROID but known to EPrints (showing both the total and the breakdown in yellow), as well as the long tail of identified formats. All but two CIF files were identified by DROID. Had all the instances of CML been recognised it would have been the largest format with most files (adding the yellow and blue CML bars), but almost half were not recognised by DROID.
As it stands the format with the largest number of files known to DROID was, interestingly, an image format (JPEG 1.01). We will see this is a recurring theme of emerging repository types exemplified by our project repositories.
Also with reference to the other exemplar profiles to follow, it will be noticeable that this profile appears to have a less long tail than others. However, in this case we can see that ‘unknown’ (to DROID and EPrints) is the largest single category, and when this is broken down it too presents a long tail (Figure 2) that is effectively additive to the tail in Figure 1. These include more specialised formats, which might be recognised by file extension.
As explained, clearly these unknowns will need to be a focus for the repository managers, although in preliminary feedback they say that many of these files are “all very familiar, standard crystallography files of varying extent of data handling that often get uploaded to ecrystals for completeness.” This is reassuring because file formats unknown to system or manager or scientists could be a serious problem for the repository. Even so, as long as such formats remain outside the scope of the general format identification tools the managers will need to use their own assessments and judgement to assure the longer-term viability and accessibility of these files.
Arts repository (University of the Arts London)
What’s the largest file type in an arts repository? Perhaps unsurprisingly it’s an image format, in this case led by JPEGs of different versions. As can be seen in Figure 3, the number of unknowns, highlighted among the High Risk Objects, is the fourth largest single category in this profile and so requires further investigation.
Once again there is a long tail (Figure 4).
First indications in Figure 5 showing the expansion of the high risk category suggest many of these will turn out to be known formats but which have not been recognised by DROID. It may be possible to resolve and classify many of these by manual inspection, the last resort of the repository manager to ensure that files can be opened and used effectively.
Teaching repository (EdShare, University of Southampton)
EdShare repository manager Debra Morris has already reflected on this profile. The first notable feature of the profile (Figure 6) is that the largest format is, again, an image format. Evidently, like an arts repository (perhaps predictably) and like a science data repository (less predictably), it seems the emphasis of a repository of teaching resources may be visual rather than textual.
Another feature of the profile is the classification of LaTEX (Master File), the second largest format in this profile. Until now this format was unknown to DROID, but a new signature was created and added to our project version of DROID, in the same way as described for CIF/CML (and was submitted to the TNA for inclusion in the official format registry). The effect of this was to reduce the number of unknowns from near 2500 to c. 550, and thus instantly to both clarify and reduce the scale of the challenge.
As usual with the long tail, preservation planning decisions have to be made about the impact and viability of even infrequent formats. For reference, Table 1 shows the formats not included among the largest formats by file count in Figure 6.
Table 1. Long tail of formats by file count in EdShare
|Plain Text File||20|
|Rich Text Format (Version 1.7)||17|
|Windows Bitmap (Version 3.0)||15|
|Acrobat PDF 1.6 – Portable Document Format (Version 1.6)||15|
|Document Type Definition||11|
|Macromedia FLV (Version 1)||11|
|MPEG-1 Video Format||9|
|Icon file format||9|
|LaTEX (Sub File)||9|
|XML Schema Definition||8|
|Macromedia Flash (Version 7)||8|
|Windows Media Video||6|
|Extensible Hypertext Markup Language (Version 1.1)||5|
|Rich Text Format (Version 1.5)||3|
|TeX Binary File||3|
|Acrobat PDF 1.1 – Portable Document Format (Version 1.1)||3|
|Java Compiled Object Code||2|
|Encapsulated PostScript File Format (Version 3.0)||2|
|Exchangeable Image File Format (Compressed) (Version 2.2)||2|
|Scalable Vector Graphics (Version 1.0)||2|
|Microsoft Web Archive||2|
|Audio/Video Interleaved Format||2|
|JTIP (JPEG Tiled Image Pyramid)||1|
|Comma Separated Values||1|
|OS/2 Bitmap (Version 1.0)||1|
|Acrobat PDF 1.7 – Portable Document Format (Version 1.7)||1|
|PHP Script Page||1|
|Microsoft Word for Windows Document (Version 6.0/95)||1|
|Hypertext Markup Language (Version 2.0)||1|
|PostScript (Version 3.0)||1|
Similarly, the unknowns present a potent challenge. Figure 7 is a breakdown of what we think we can tell from file extensions. Unlike the case of unknowns in eCrystals, here there was less anticipation of specialised formats that were unlikely to be found in a general format registry, so this list is a revelation. In many of these cases an error in the file may be preventing recognition of an otherwise familiar format. Here we can see extensions such as Flash files, various text-based formats such as HTML, CSS, etc., which may be malformed, and possibly some images. In such cases the relevant file should be identified and an attempt made to open it with a suitable application. In this way it may be possible to begin to assess the reasons for non-recognition, to confirm the likely format, and take any action to repair or convert the file if necessary. Files with unfamiliar extensions (.m) will need particular attention.
Another feature of the this repository’s format profile not illustrated here is the difference when the profile is based on file size rather than file count. In this case the largest format by file size remains JPEG 1.01, but the next largest file types are all MPEG video formats, which is not so evident from Figure 6 and Table 1. It is not hard to understand why this might be: video files tend to be larger than text files or other format types. At first sight the profile by file count might have the stronger influence on preservation plans, but this further evidence on file size can be used strategically as well.
Research papers repository (University of Northampton)
Figure 8 is a classic, PDF-dominated profile of a repository of research papers. Although an apparently small repository, what this shows is a repository that has so far focussed on records-keeping rather than on collecting full digital objects. We have seen already how investigations to expand the scope of this repository have begun. There is nothing in this profile that would surprise the repository manager. It acts as confirmation, a snapshot of the repository and can be viewed as a platform for deciding where the repository should head in future. The tools will help the repository manager to monitor growth and the implementation of future plans.
Digital and institutional repositories are changing, and the established research papers repository is now complemented by rapidly growing repositories targetting new types of digital content. For the first time we have been able to compare and contrast these different repository types using tools designed to assist digital preservation analysis by identifying file formats and producing profiles of the distribution of formats in each repository.
While past format profiles of repositories collecting open access research papers tended to produce uniform results differing in scale rather than range, the new profiles reveal potentially characteristic fingerprints for the emerging repository types. What this also reveals more clearly, by emphasising the differences, are the real preservation implications for these repositories based on these profiles, which could be masked when all profiles looked the same. Each exemplar profile gives the respective managers a new insight into their repositories and careful reading will lead them to an agenda for managing the repository content effectively and ensuring continued access, an agenda that will be the more clearly marked for recognising how the same process produced different results for other types of repository.
This agenda will initially be led by the need to investigate digital objects for which the format could not be identified by the general tools, the ‘unknowns’. We have seen that specialised science data repositories, and even less obviously specialised examples, can produce large numbers of unknowns. These are high-risk objects in any repository by virtue of their internal format being unknown, even though on inspection many may turn out to be easily identified and/or corrected.
For the known formats, especially the largest formats by file count, these profiles show where effort is worth expending on producing preservation plans that will automate the maintenance of these files. Based on these exemplars, all repositories with substantial content are likely to produce format profiles displaying a long tail.
An intriguing finding of this work is that the emerging repository types, rather than open access institutional repositories founded on research papers, are dominated by visual rather than textual formats.
All these exemplars either are, or plan to become, institutional in scope even though limited to a specified type of content. One original idea that motivated the KeepIt project was that truly institutional repositories are likely to come to collect and store digital outputs from all research and academic activities, such as those represented by these exemplars. Thus, combined the exemplars might represent the institutional repository of the future. It’s worth bearing in mind how the combined format profiles might look, and the consequent implications for preservation, when contemplating the prospect.
We are grateful to all the exemplar repositories for allowing us to reproduce these profiles.