Data Catalogue Interoperability Meeting
June 23, 2011
by Christopher Gutteridge
I’ve just come back from a two day meeting one making data catalogs interoperable.
There’s some interesting things going on and CKAN is by far the most established data catalog tool, but it frustrates me that they are basically re-implementing EPrints starting from a different goal. On the flip side, maybe parallel evolution is a natural pheonomena. None the less the CKAN community should pay more attention to the, now mature, repository community.
Data Catalogs have a few quirks. One is that they are really not aimed at the general public. Only a small number of people can actually work with data and this should inform descisions.
The meeting had a notable split in methodologies, but not an acrimonious one. In one camp we have the URIs+RDF approach (which is now my comfort zone) and on the other GUID plus JSON. The concensus was that JSON & RDF are both useful for different approaches. Expressing lossless RDF via JSON just removes the benefits people get from using JSON (easy to parse & understand at a glance).
A Dataset by any other Name
A key issue is that dataset and dataset catalogue are very loaded terms. We agreed, for the purposes of interoperability that a dataset record is something which describes a single set of data, not an aggregation. Each distribution of a dcat:dataset should give access to the whole of the dataset(ish). Specifically this means that a dataset (lower-case d) which is described as the sum of serveral datasets is a slightly different catalog record and may be described as a list of simple dcat:datasets.
Roughly speaking the model of a (abstract, interoperable) data-catalog is
- Dataset (simple kind)
- Distributions (Endpoints, Download URLS, indirect links to the pages you can get the data from or instructions of how to get the data)
- Dataset (simple kind)
We agreed that DCAT was pretty close to what was needed, but with a few tweaks. The CKAN guys come from the Open Knowledge Foundation so handling distributions of data which required other kinds of access such as password, license agreement or even “show up to a room with a filing cabinet” where outside their usual scope but will be important for research data catalogues.
We discussed ‘abuse’ of dcat:accessURL – it sometimes gets used very ambiguously when people don’t have better information. The plan is to add dcat:directURL which is the actual resource from which a serialisation or endpoint is available.
Services vs Apps: Services which give machine-friendly access to a dataset, such as SPARQL or an API we agreed were distributions of a dataset, but Applications giving humans access are not.
We agreed that, in addition to dc:identifier. dcat should support a globally unique ID (a string which can by a GUID or a URI or other) which can be used for de-duplication.
Provenance is any issue we skirted around but didn’t come up with a solid recommendation for. It’s important – we agreed that!
Very Simple Protocol
At one point we nearly reinvented OAI-PMH which would be rather pointless. The final draft of the day defines the method to provide interoperable data, and the information to pass but deliberately not the exact encoding as some people want Turtle and some JSON. It should be easy to map from Turtle to JSON but in a lossy way.
A nice design is that it takes a single URL with an optional parameter which the data-catalog can ignore. In other words, the degenerate case is you create the entire response as a catalog.ttl file and put it in a directory! The possible endpoint formats are initially .json, .ttl and (my ideal right now) maybe .ttl.gz
The request returns a description of the catalog and all records. It can be limited to catalog records changed since a date using ?from=DATE but obviously if you do that on a flat file you’ll still get the whole thing.
It can also optionally, for huge sites, include a continuation URL to get the next page of records.
The information returned is the URL to get the metadata for the catalog record (be it license,collection or dataset) in .ttl or .json depending on the endpoint format, last modified time for the catalog record (not the dataset contents), the globally unique ID (or IDs…) of the dataset it describes, and an indication if the record has been removed from the catalog — possibly the removal time.
Harvesters should obey directives from robots.txt
All in all I’m pleased where this is going. It means you can easily implement this with a fill-in-the-blanks approach for smaller catalogs. A validator will be essential, of course, but it’ll be much less painful to implement than OAI-PMH (but less versatile).
I learned some interesting stuff from John Erickson (from Jim Hendler’s lot). They are following very similar patterns to what I’ve been doing with Grinder (CSV –grinder–> XML –XSLT–> RDF/XML –> Triples )
One idea I’m going to nick is that they capture the event of downloading data from URLs as part of the provenance they store.
One Catalog to Rule them All
The final part of the discussion was about a catalog of all the world’s data catalogues. This is a tool aimed at a smaller group than even data catalogues, but it could be key in decision making and I suggested the people working on it have a look at ROAR: Repository of Open Access Archives which is a catalog of 2200+ repositories. It has been redesigned from the first attempt and captures useful information for understanding the community; like software used, activity of each repository (update frequency), counrty, purpose etc. Much the same will be useful to the data-cat-cat.
Something like http://data-ac-uk.ecs.soton.ac.uk/ (maybe becoming http://data.ac.uk at some point) would be one of the things which would feed this monster.
All in all a great trip, except for the flight back where pilot wasn’t sure if the landing flaps were working so we circled for about an hour and at one point he came out with a torch to have a look at the wings! All was fine and the poor ambulance drivers and firemen had a wasted trip to the airport. Still, better to have them there and not need them!
Jonathan Gray has transfered the notes from the meeting to a wiki.