Skip to content


Data Catalogue Interoperability Meeting

I’ve just come back from a two day meeting in Edinburgh on making data catalogs interoperable.

There’s some interesting things going on and CKAN is by far the most established data catalog tool, but it frustrates me that they are basically re-implementing EPrints starting from a different goal. On the flip side, maybe parallel evolution is a natural pheonomena. None the less the CKAN community should pay more attention to the, now mature, repository community.

Data Catalogs have a few quirks. One is that they are really not aimed at the general public. Only a small number of people can actually work with data and this should inform descisions.

The meeting had a notable split in methodologies, but not an acrimonious one. In one camp we have the URIs+RDF approach (which is now my comfort zone) and on the other GUID plus JSON. The concensus was that JSON & RDF are both useful for different approaches. Expressing lossless RDF via JSON just removes the benefits people get from using JSON (easy to parse & understand at a glance).

My interests here are twofold. One is that I represent data.southampton.ac.uk which is one of the first organisation-level data catalogs. These will be numerous and small. The other is that I hope that repositories (EPrints) will begin to be used to store research data. A dataset repository is almost certainly also a dataset catalogue.

Catalog or Catalogue?

The dataset catalog community has settled on the en-us spelling of the word, but I tend to swing between the two. Apologies for that.

A Dataset by any other Name

A key issue is that dataset and dataset catalogue are very loaded terms. We agreed, for the purposes of interoperability that a dataset record is something which describes a single set of data, not an aggregation. Each distribution of a dcat:dataset should give access to the whole of the dataset(ish). Specifically this means that a dataset (lower-case d) which is described as the sum of serveral datasets is a slightly different catalog record and may be described as a list of simple dcat:datasets.

Roughly speaking the model of a (abstract, interoperable) data-catalog is

  • Catalog
    • Dataset (simple kind)
      • Distributions (Endpoints, Download URLS, indirect links to the pages you can get the data from or instructions of how to get the data)
    • Collections
    • Licenses

 

We agreed that DCAT was pretty close to what was needed, but with a few tweaks. The CKAN guys come from the Open Knowledge Foundation so handling distributions of data  which required other kinds of access such as password, license agreement or even “show up to a room with a filing cabinet” where outside their usual scope but will be important for research data catalogues.

We discussed ‘abuse’ of dcat:accessURL  – it sometimes gets used very ambiguously when people don’t have better information. The plan is to add dcat:directURL which is the actual resource from which a serialisation or endpoint is available.

Services vs Apps: Services which give machine-friendly access to a dataset, such as SPARQL or an API we agreed were distributions of a dataset, but Applications giving humans access are not.

We agreed that, in addition to dc:identifier. dcat should support a globally unique ID (a string which can by a GUID or a URI or other) which can be used for de-duplication.

Provenance is any issue we skirted around but didn’t come up with a solid recommendation for. It’s important – we agreed that!

Very Simple Protocol

At one point we nearly reinvented OAI-PMH which would be rather pointless. The final draft of the day defines the method to provide interoperable data, and the information to pass but deliberately not the exact encoding as some people want Turtle and some JSON. It should be easy to map from Turtle to JSON but in a lossy way.

A nice design is that it takes a single URL with an optional parameter which the data-catalog can ignore. In other words, the degenerate case is you create the entire response as a catalog.ttl file and put it in a directory! The possible endpoint formats are initially .json, .ttl and (my ideal right now) maybe .ttl.gz

The request returns a description of the catalog and all records. It can be limited to catalog records changed since a date using ?from=DATE but obviously if you do that on a flat file you’ll still get the whole thing.

It can also optionally, for huge sites, include a continuation URL to get the next page of records.

The information returned is the URL to get the metadata for the catalog record (be it license,collection or dataset) in .ttl or .json depending on the endpoint format, last modified time for the catalog record (not the dataset contents), the globally unique ID (or IDs…) of the dataset it describes, and an indication if the record has been removed from the catalog — possibly the removal time.

Harvesters should obey directives from robots.txt

All in all I’m pleased where this is going. It means you can easily implement this with a fill-in-the-blanks approach for smaller catalogs. A validator will be essential, of course, but it’ll be much less painful to implement than OAI-PMH (but less versatile).

csv2rdf4lod

I learned some interesting stuff from John Erickson (from Jim Hendler’s lot). They are following very similar patterns to what I’ve been doing with Grinder (CSV –grinder–> XML –XSLT–> RDF/XML –> Triples )

One idea I’m going to nick is that they capture the event of downloading data from URLs as part of the provenance they store.

One Catalog to Rule them All

The final part of the discussion was about a catalog of all the world’s data catalogues. This is a tool aimed at a smaller group than even data catalogues, but it could be key in decision making and I suggested the people working on it have a look at ROAR: Repository of Open Access Archives which is a catalog of 2200+ repositories. It has been redesigned from the first attempt and captures useful information for understanding the community; like software used, activity of each repository (update frequency), counrty, purpose etc. Much the same will be useful to the data-cat-cat.

Something like http://data-ac-uk.ecs.soton.ac.uk/ (maybe becoming http://data.ac.uk at some point) would be one of the things which would feed this monster.

Conclusion

All in all a great trip, except for the flight back where pilot wasn’t sure if the landing flaps were working so we circled for about an hour and at one point he came out with a torch to have a look at the wings! All was fine and the poor ambulance drivers and firemen had a wasted trip to the airport. Still, better to have them there and not need them!

Jonathan Gray has transfered the notes from the meeting to a wiki.

Posted in Repositories.

Tagged with .


3 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Ed Summers says

    I’m kind of surprised there was no mention of using Atom? The Feed Paging and Archiving extension provides a simple resumption-token-like mechanism for breaking up the ordered result set. And if you order the feed (which is typical) what more do you need?

  2. C.Gutteridge says

    Sorry, ATOM did come up. It may be that ATOM would do the job (I didn’t know it handled resumption tokens). We were more focussed on the process required and the information required in each step of the exchange. The problem with ATOM is that one of the CKAN guys is very keen to be able to access the information via JSONP to do stuff in-browser. I personally think this is an overloading of the problem and not a good idea, but I don’t fully understand his use case.

    Either way, we agreed that the jsonp method was for lightweight hacks and the .ttl was for full-blown harvesting. It may be that the jsonp thing is an almost unlrelated tool. JSON and TTL/ATOM have very different uses and if you shoehorn them into JSON it loses its elegance.

    Hmm. We probably could do what we need with ATOM indexing .TTL files, and moreover, ATOM allows publishing too, I think.

  3. Ed Summers says

    The desire for JSON and JSONP understandable–especially for JavaScript integration.

    For the JSON it might be nice to consider something like what Google provide in JSONC, which is their JSON flavored Atom.



Some HTML is OK

or, reply to this post via trackback.