Skip to content

Categories:

Adding chemistry to a file format registry

Adding chemistry formats to a DROID profile

Everybody knows DROID. Well, everybody working in digital preservation. And for those being introduced to digital preservation, it’s likely they will be shown a tool or two, because tools help us do practical preservation. And among those tools the one most likely to be shown will be DROID (for example, here 11.45 am). This is because DROID is from the National Archives, is open source and does something that is fairly fundamental and basic to digital preservation, that is, identify file formats. DROID (Digital Record Object Identification) is an automatic file format identification tool.

We’ve been using DROID for years, in KeepIt and before that Preserv, to produce repository format profiles. It tells us what we have to preserve, and we can use that information to begin to judge risk, build a preservation plan and, where necessary, take some evasive action by converting a digital item into a format we may believe to be less risky.

One thing we’ve never done with DROID is add a new format. This part is carefully curated by the National Archives. (Note, in case you are thinking what is PRONOM, the subject of the link, it is a registry of file formats that informs DROID, which in turn is the software used to scan your content). PRONOM is not a wiki or other social media so you can’t just add stuff without moderation (not yet anyhow, although that may change).

PRONOM the registry currently contains information on over 700 file formats, but there are many thousands more formats in existence. In other words, PRONOM covers most major, popular formats and then some, but not all formats you can think of. When it comes to more specialised data, it’s likely your format is not represented.

In this case our exemplar KeepIt repository eCrystals stores data from crystallography experiments in the laboratory. The formats it uses to describe these data are not likely to be available in DROID, so a format profile of this repository will reveal a large number of unknown files and will not be of much use.

What’s required by PRONOM-DROID for format identification is a signature for the file format you want to identify, that is, something distinctive that will reliably differentiate it from any other format type. We imagined this would require quite a detailed knowledge of the formats. We were concerned, since we are not the originators or sponsors of the formats in question, that we might not have the requisite knowledge, also that we might require the cooperation of such people in some way. We needn’t have worried about either. Creating a file format signature for DROID is simpler than we had anticipated.

What follows is a Twitter record (@jisckeepit, from about 9:45 AM Sep 16th) of what happened as we set about creating format signatures for our crystallography and chemical files. We were able to do this at a small test level using a version of DROID we run locally.

  • We just worked out how to write a signature file for a format in PRONOM-DROID. It’s ID, not validation. You bet!
  • Need to check uniqueness of signature for CIF file format. “You boy at the back, DROID, do you recognise this file?” “No sir”. Good start
  • For the others at the back of the class, our CIF is a Crystallographic Information File, produced from experiments in the lab
  • The boy DROID is a fast learner. He now knows CIF and has passed the initial class test, but how will he cope with the big school exam?
  • We have boy DROID’s exam results, and are tearing open the envelope now. FAIL. Retake
  • Great, boy DROID has now passed his CIF exam. It’s a high grade, no absolute mark, but probably as good as earlier passes 😉
  • Since boy DROID is such a good student we’re now teaching him CML, Chemical Markup Language, based on XML, which he already knows
  • See, it’s easy when you know how. Boy DROID passed the CML exam first time
  • Next we will commend Boy DROID to his parents (at iPres next week) and suggest they enter the new CIF, CML format sigs in the DROID academy
  • Also next, we will run boy DROID on eCrystals-a big repository!-to produce its first format profile. That may take some days
  • Big thanks to boy DROID’s class teachers, from Chemistry Philip Adler and KeepIt’s Dave Tarrant

The result of all this is the profile shown in the figure at the top of this blog post. Essentially we deposited a few example CIF and CML files in our test repository, inspected the source files in detail to write the signature, and used our EPrints preservation apps, including our test version of DROID, to produce this simple profile of our test files. What we haven’t shown here are all the unsuccessful test profiles, which would show as an alarming red bar labelled ‘unknown files’ when DROID could not recognise a CIF, instead of blue bars with the correctly identified file names.

If you are interested you can find out more about the CIF and CML formats.

A helpful lesson and a step towards another KeepIt exemplar repository.

Addendum (22 September 2010)

Philip Adler, a specialist in the chemistry file types being looked at here, provides some additional insights into writing a new signature profile to be added to the DROID format identification tool.

“Creating a new profile for DROID to search a database was a novel problem; and one for which the documentation, whilst plentiful, failed to characterise the behaviours of the XML form in which one specifies what signature formats DROID is looking for. As such, for someone with little to no computing experience, it is my opinion that it would be very challenging and time consuming, on a first run through, to establish a new signature file. Indeed, for an experienced computer scientist and a specialist in the file type being looked at, it took a considerable length of time.

“That said, once the format of the signatures, and the layout of the signature files has been deduced, installing a new signature was relatively straightforward the second time around.

“There is one ‘bug’. Whilst attempting to define a term which would be at a non-fixed distance from the beginning or end of a file, we established that the DROID framework does not permit this. For instance, we managed to ‘break’ the HTML definition within DROID by submitting a perfectly valid (although unorthodox) HTML file with an absurdly long comment at the top. Whilst in HTML this is odd, in a .CIF file it is not, and so could serve to break the file in the future. The back-door for this is to set the maximum distance from the beginning or end of the file to an absurdly great length. This solution, however, is far from obvious, and is located as a standard method within the documentation, which implies that DROID can cope with arbitrary distances from BOF (beginning of file) or EOF (end of file).”

Posted in Uncategorized.

Tagged with , , , , , , , .


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.