Skip to content


Getting Machine-readable Spreadsheets

Paddling in the shallow end of open data we have the PDF file, we have Excel and we have CSV.

For those of you not yet familiar with TBL’s linked data stars, Ed Summers has a great summary, but here’s the first three.

make your stuff available on the web (whatever format)
★★ make it available as structured data (e.g. excel instead of image scan of a table)
★★★ non-proprietary format (e.g. csv instead of excel)

Give .xls some love!

It’s not clear that Excel is still a propriatary format. From http://en.wikipedia.org/wiki/Microsoft_Excel#Binary:

OpenOffice.org has created documentation of the Excel format.[52] Since then Microsoft made the Excel binary format specification available to freely download.[53]

I think we need to use a better example. If it contains no equations it’s dead easy to work with. Hell, there’s the Perl module Spreadsheet::ParseExcel, and if you’re too lazy to use that, I’ve made you a REST web service using it.

Being in CSV does not mean your data doesn’t suck!

However that’s not my beef for this article. My rant is about the assumption that something being in CSV makes it machine readable data. data.gov.uk is a Good Thing, but there’s lots of Excel & CSV files in it which are not noticeably more machine readable than an HTML document. It deserves a Gold Star for being online with an Open License but not 3 for being machine-readable.

So here’s my idea… Define a subset of CSV to be easy for programmers to work with and also easy for office staff to maintain. This is not idle whimsy, I’m talking to our various university services about how to both get their data open but also how to make it easy for them to maintain in a way machines can work with.

Ideally, of course, we’d just build a database with a pretty web interface, but I’m on a budget so want something cheap and cheerful. Also, I want to minimise the impact on their normal working practices and make sure they hold the authoritative version of their data.

Here’s the plan, although I need a name for it. It applies to any tabular data, but mainly aimed at csv & spreadsheets.

Easy-Parse Spreadsheet Recommendation

The first row to contain a non-blank value in Col-1 is the ‘title row’. Each column in this row is treated as the title of column and used to identify it. Two columns should not have the same title. Titles should not be changed by the person maintaining the table, although columns may be added with new titles, and columns may be reordered.

Every row after the title row, which is not entirely blank, is a data row. Each row is considered a data object with the values properties named by the title row. Any data without a title in the title row is ignored. Whitespace before or after a value is always ignored.

If a cell has multiple values then these values should be separated with a single agreed character. Usually a comma “,” but if the field contains textual descriptions where a comma may be useful then use something else like tilda “~” or semi-colon “;”. Whitespace either side of the separating character is removed. Which fields may be split by what characters is not specified in the data.

No computed fields should be used. Or at least they should be used for human convenience and are not considered part of the data.

Use of colour & style can and should be used to make it easier for humans to work with the document. They are not part of the data.

Dates and times should be machine readable. Don’t use “am” & “pm”! Values should be written consistently.

I think that’s all pretty sensible, but this last one is a bit more controversial (suggested refinements welcome!)

Some metadata about the document may be useful to embed in it. Anything vital to the mechanical interpretation of this specific file. Most notably the spatial and temporal coverage. In this case, metadata may be specified by rows in the format “*SET”, “Key”,”Value”. The “*SET” should be handled separately, in fact any row where the first character of the first cell is a star “*” should not be interpreted as data or titles.

So, I guess, we might as well define “*COMMENT” to allow section headings and comments.

OrgGrinder

My long term plan for this is to produce a simple template system which can process these files into RDF. It’s an imperfect system, but I think it’s much more sustainable for a small office.

My long term goal is to produce some sample spreadsheets + Grinder Templates so that smaller organisations could produce adequate open linked data with minimal hassle and staff training.

Here’s a scenario; you create your organisations public phonebook on Google Docs spreadsheet (using my template) and publish it as CSV. Bam!: 3 Stars already. Then you go to my web service and give it the URL of the CSV and some bits and bobs like your base URI & license and Bamo!: 4 Stars.

BizTalk

This is a system used by some large organisations (including the University of Southampton) to push around data between diverse systems. It’s got some clever stuff I don’t understand, but ultimately it appears to shove around Pipe “|”-separated data files. That would work like a charm with this system as it’s already conforming to most of the restrictions.

EPrints-Data

It seems fairly clear to me that EPrints will soon have to start coping with being a repository of “data documents” (in fact there’s already some in the wild).

There’s several kinds of data we may want a repository for:

  • One-shot datasets – the results of an experiment.
  • Cumulative data without temporal/spatial scope – the results of a really long experiment. They may have a ‘temporal’ element, but it’s relevant to the stage of the experiment not the calendar date. Once the project ends this data may be treated just like a one-shot dataset.
  • Cumulative data with a temporal and/or spatial scope – This is more the field of infrastructure data. The price of coffee at the university is only true from and to a certain date, but it may remain interesting to find out values from the past. Spatial scope is more aimed at things like data.gov.uk where you often have one dataset supplied per council per year (or hope to).
  • Live Data – This is data which is only considered interesting and useful in the here-and-now. For example the current university phonebook or the list of carparks. There may be a handful of uses for the old data, but this is the kind of data which you edit in-situ rather than published revisions of.

I’m thinking for the more infrastructure-type data we could add some nice functionality to EPrints to allow it to

  1. Have a ‘grinder template’ and configuration options associated with an EPrint so that you can just upload an excel document and have it turned into RDF triples, just like it makes preview images.
  2. Make documents have a temporal coverage from and through fields, then make it possible to query the document for “now” or for a given date.
  3. For the datasets where only the current is deemed important, add a /latest/ URL to get the last document attached to a record. This is already coded and will be a package in the new EPrints Bazaar, coming next year!

Posted in Best Practice, Data, RDF, Repositories, Templates.


3 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Neil Crookes says

    There’s also Google Spreadsheets, in front of which you can build a pretty web form, and there’s an API for querying it.

    But it is proprietary of course :-(

    • Christopher Gutteridge says

      What I’m trying to do is find something which can be realistically maintained by office staff, rather than IT staff. In an ideal world we’d build a custom database & web interface for everything but I’m looking at something slightly less IT-staff expensive to get running.

      My point is that CSV/excel/tabular data isn’t automatically a useful database without some effort being made to either post-process it or restrict the way it is used. My idea is to produce some spreadsheets *designed* to be converted into RDF, but easily maintainable in the office by staff who don’t know or care what a URI is.

  2. Jakob says

    Sorry, but I just don’t get it. What’s wrong with CSV? Sure “there’s lots of Excel & CSV files in it which are not noticeably more machine readable than an HTML document”. That’s not a problem of CSV, but a problem with the particular data. You can create CSV, XML, JSON, and even RDF data, that is far from machine readable.

    Converting CSV to RDF should be similar to converting relational data to RDF. I’d keep the tabular data as it is and introduce a mapping from rows to graph patterns in turtle syntax with ‘$foo’ as replacement for the table cell in column $foo. For more complex mappings (conditionals, splitting field values etc.) you need a full programming language and a customized script for each tabular format anyway.



Some HTML is OK

or, reply to this post via trackback.