{"id":725,"date":"2011-05-05T09:17:59","date_gmt":"2011-05-05T09:17:59","guid":{"rendered":"http:\/\/blog.soton.ac.uk\/webteam\/?p=725"},"modified":"2011-05-05T09:25:52","modified_gmt":"2011-05-05T09:25:52","slug":"data-catalogue-interoperability-meeting","status":"publish","type":"post","link":"https:\/\/blog.soton.ac.uk\/webteam\/2011\/05\/05\/data-catalogue-interoperability-meeting\/","title":{"rendered":"Data Catalogue Interoperability Meeting"},"content":{"rendered":"<p>I&#8217;ve just come back from a two day meeting in Edinburgh on making data catalogs interoperable.<\/p>\n<p>There&#8217;s  some interesting things going on and CKAN is by far the most  established data catalog tool, but it frustrates me that they are  basically re-implementing EPrints starting from a different goal. On the  flip side, maybe parallel evolution is a natural pheonomena. None the  less the CKAN community should pay more attention to the, now mature,  repository community.<\/p>\n<p>Data Catalogs have a few quirks. One is that  they are really not aimed at the general public. Only a small number of  people can actually work with data and this should inform descisions.<\/p>\n<p>The  meeting had a notable split in methodologies, but not an acrimonious  one. In one camp we have the URIs+RDF approach (which is now my comfort  zone) and on the other GUID plus JSON. The concensus was that JSON &amp;  RDF are both useful for different approaches. Expressing lossless RDF  via JSON just removes the benefits people get from using JSON (easy to  parse &amp; understand at a glance).<\/p>\n<p>My interests here are twofold. One is that I represent <a href=\"http:\/\/data.southampton.ac.uk\">data.southampton.ac.uk<\/a> which is one of the first organisation-level data catalogs. These will be numerous and small. The other is that I hope that repositories (EPrints) will begin to be used to store research data. A dataset repository is almost certainly also a dataset catalogue.<\/p>\n<h2>Catalog or Catalogue?<\/h2>\n<p>The dataset catalog community has settled on the en-us spelling of the word, but I tend to swing between the two. Apologies for that.<\/p>\n<h2>A Dataset by any other Name<\/h2>\n<p>A key issue is that dataset and dataset catalogue are very loaded  terms. We agreed, for the purposes of interoperability that a dataset  record is something which describes a single set of data, not an  aggregation. Each distribution of a dcat:dataset should give access to  the whole of the dataset(ish). Specifically this means that a dataset  (lower-case d) which is described as the sum of serveral datasets is a  slightly different catalog record and may be described as a list of  simple dcat:datasets.<\/p>\n<p>Roughly speaking the model of a (abstract, interoperable) data-catalog is<\/p>\n<ul>\n<li>Catalog\n<ul>\n<li>Dataset (simple kind)\n<ul>\n<li>Distributions (Endpoints, Download URLS, indirect links to the  pages you can get the data from or instructions of how to get the data)<\/li>\n<\/ul>\n<\/li>\n<li>Collections<\/li>\n<li>Licenses<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<p>We agreed that <a href=\"http:\/\/vocab.deri.ie\/dcat#\">DCAT<\/a> was pretty close to what was needed, but with a few tweaks. The CKAN  guys come from the Open Knowledge Foundation so handling distributions  of data\u00a0 which required other kinds of access such as password, license  agreement or even &#8220;show up to a room with a filing cabinet&#8221; where  outside their usual scope but will be important for research data  catalogues.<\/p>\n<p>We discussed &#8216;abuse&#8217; of dcat:accessURL\u00a0 &#8211; it sometimes  gets used very ambiguously when people don&#8217;t have better information.  The plan is to add dcat:directURL which is the actual resource from  which a serialisation or endpoint is available.<\/p>\n<p>Services vs Apps:  Services which give machine-friendly access to a dataset, such as  SPARQL or an API we agreed were distributions of a dataset, but  Applications giving humans access are not.<\/p>\n<p>We agreed that, in  addition to dc:identifier. dcat should support a globally unique ID (a  string which can by a GUID or a URI or other) which can be used for  de-duplication.<\/p>\n<p>Provenance is any issue we skirted around but didn&#8217;t come up with a solid recommendation for. It&#8217;s important &#8211; we agreed that!<\/p>\n<h2>Very Simple Protocol<\/h2>\n<p>At one point we nearly reinvented OAI-PMH which would be rather  pointless. The final draft of the day defines the method to provide  interoperable data, and the information to pass but deliberately not the  exact encoding as some people want Turtle and some JSON. It should be  easy to map from Turtle to JSON but in a lossy way.<\/p>\n<p>A nice design  is that it takes a single URL with an optional parameter which the  data-catalog can ignore. In other words, the degenerate case is you  create the entire response as a catalog.ttl file and put it in a  directory! The possible endpoint formats are initially .json, .ttl and  (my ideal right now) maybe .ttl.gz<\/p>\n<p>The request returns a  description of the catalog and all records. It can be limited to catalog  records changed since a date using ?from=DATE but obviously if you do  that on a flat file you&#8217;ll still get the whole thing.<\/p>\n<p>It can also optionally, for huge sites, include a continuation URL to get the next page of records.<\/p>\n<p>The  information returned is the URL to get the metadata for the catalog  record (be it license,collection or dataset) in .ttl or .json depending  on the endpoint format, last modified time for the catalog record (not  the dataset contents), the globally unique ID (or IDs&#8230;) of the dataset  it describes, and an indication if the record has been removed from the  catalog &#8212; possibly the removal time.<\/p>\n<p>Harvesters should obey directives from robots.txt<\/p>\n<p>All  in all I&#8217;m pleased where this is going. It means you can easily  implement this with a fill-in-the-blanks approach for smaller catalogs. A  validator will be essential, of course, but it&#8217;ll be much less painful  to implement than OAI-PMH (but less versatile).<\/p>\n<h2>csv2rdf4lod<\/h2>\n<p>I learned some interesting stuff from John Erickson (from Jim  Hendler&#8217;s lot). They are following very similar patterns to what I&#8217;ve  been doing with Grinder (CSV &#8211;grinder&#8211;&gt; XML &#8211;XSLT&#8211;&gt; RDF\/XML  &#8211;&gt; Triples )<\/p>\n<p>One idea I&#8217;m going to nick is that they capture the event of downloading data from URLs as part of the provenance they store.<\/p>\n<h2>One Catalog to Rule them All<\/h2>\n<p>The final part of the discussion was about a catalog of all the  world&#8217;s data catalogues. This is a tool aimed at a smaller group than  even data catalogues, but it could be key in decision making and I  suggested the people working on it have a look at <a href=\"http:\/\/roar.eprints.org\/\">ROAR: Repository of Open Access Archives<\/a> which is a catalog of 2200+ repositories. It has been redesigned from  the first attempt and captures useful information for understanding the  community; like software used, activity of each repository (update  frequency), counrty, purpose etc. Much the same will be useful to the  data-cat-cat.<\/p>\n<p>Something like <a href=\"http:\/\/data-ac-uk.ecs.soton.ac.uk\/\">http:\/\/data-ac-uk.ecs.soton.ac.uk\/<\/a> (maybe becoming http:\/\/data.ac.uk at some point) would be one of the things which would feed this monster.<\/p>\n<h2>Conclusion<\/h2>\n<p>All in all a great trip, except for the flight back where pilot  wasn&#8217;t sure if the landing flaps were working so we circled for about an  hour and at one point he came out with a torch to have a look at the  wings! All was fine and the poor ambulance drivers and firemen had a  wasted trip to the airport. Still, better to have them there and not  need them!<\/p>\n<p>Jonathan Gray has transfered the <a href=\"http:\/\/wiki.okfn.org\/OpenDataCatalogues\/2\">notes from the meeting<\/a> to a wiki.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve just come back from a two day meeting in Edinburgh on making data catalogs interoperable. There&#8217;s some interesting things going on and CKAN is by far the most established data catalog tool, but it frustrates me that they are basically re-implementing EPrints starting from a different goal. On the flip side, maybe parallel evolution [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[352,313],"tags":[],"class_list":["post-725","post","type-post","status-publish","format-standard","hentry","category-data","category-repositories"],"_links":{"self":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/725","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/comments?post=725"}],"version-history":[{"count":4,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/725\/revisions"}],"predecessor-version":[{"id":729,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/725\/revisions\/729"}],"wp:attachment":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/media?parent=725"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/categories?post=725"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/tags?post=725"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}