Categories:

# Combining and republishing datasets with different licenses

We’ll soon be launching data.ac.uk! Right now it’s all a bit of a work in progress. The plan is for us to start with a few useful subdomains then have other subdomains run by other organisations. Southampton neither can nor should be the sole proprietors.

The goal of the domain is to provide a permenant home for URIs, datasets and services. The problem with the .ac.uk level scheme is that sites are named either after an organisation, or after a project. But a good service should outlive the project which creates it, and if you’re trying to create a linked data resource for the ages then using http://www.myuni.ac.uk/~chris/project/2008/schema/ as your namespace is a ticking timebomb of breakiness.

There’s serveral different projects to create sub-sites right now. These are all focused on “infrastructure” rather than “research” data, but that should not be seen as a firm precident. That said, UK level services for research data are artificial — it shouldn’t matter where good data comes from, but from a practical point of view the UK is a funder of research so there may be times when national aggregation and services are created.

For projects like Gateway to Research to create good linked data they’ll need good URIs. Obviously some of their datastructures are going  to be complex and specialised, but we want solid URIs for institutions, funding bodies, projects, researchers, publications, patents etc.

### hub.data.ac.uk

OK, this is the bit this post was supposed to actually be about.

One of the sub-domains which already exists is http://hub.data.ac.uk/ which is intended as a hub for UK academia open-data services. It has a hand-maintained list fo the current open data services and their contacts. We also set it up that it would periodically resolve the self-assigned URI for each university, and combine the triples it found their into a big document which you could query in one go.

The first problem we encountered for this was that Oxford and Southampton have chosen to make their “self assigned” URIs resolve to short RDF documents describing the organisation [Oxford] [Southampton]. However the Open University made a different assumption of what should happen when you resolve their URI. Their services generates a document describing every triple referencing their university. This isn’t wrong it’s just large and answers a differnt question.

To address this we’ve hit on the idea of asking each open data service to produce a “Profile Document” which may be what their self assigned URI redirects to, but will also be auto discoverable from their main website. This we can (more) safely download knowing more or less what to expect, and we can provide standard ways to describe elements which may be useful to list on hub.data.ac.uk.

### Combining Datasets

The problem I’m facing this week is how to handle combining datasets with multiple licenses.

Right now I’m thinking:

For every source dataset, include a “provenance event” describing where it was downloaded from, and the license on the document that was used as the source.

nb. this is not proper RDF, I’m just explaining my thoughts:

 <#event27> a ProvenanceEvent ;
source <http://www.example.ac.uk/profile.rdf> ;
result <#source27> .

<http://www.example.ac.uk/profile.rdf>
attribution "University of Examples" .
 <#event27> a ProvenanceEvent ;
source <#source20>,<#source21>,<#source22>,<#source27> ;
action <merge> ;
result <>

OK. So the above is true but I’m not sure how useful it is. If I’m using a dataset, all I really want to know is:

• Can I use it for the purpose I have in mind?
• What restrictions does it place on me?
• What obligations (attribution) does it place on me?

So far as I can see, combining datasets with different licenses results in a dataset which is licensed by all at the same time. This isn’t the same as when software is “duel licensed” and you can pick which license, this dataset is simultaneously under several licenses (like wiring them in series, rather than in parallel). Even a “must attribute” license gets out of hand with data from 180 sources (BSD was modified for a reason!)

The licenses we’re plannng to accept (or at least recommend) are, in order of increasing restrictions, CC0, ODCA and OGL.

1. CC0 data only under a CC0 license
2. CC0 and ODCA data only under a ODCA license (with a long attribution list)
3. CC0, ODCA & OGL data under the OGL. (with a longer attribution list)

I’m not a lawyer, but this seems to go with the intent of the origional publishers licences.

There’s also the issue of the ODCA phrase “keep intact any notices on the original database” which would be easy to do if combining datasets by hand, but is going to be very difficult to automate. What if their notice turned out to be in the XML comments in and RDF/XML file?

I came quite late to the Semantic Web, so I suspect much of these issues were discussed a decade ago, so any tips or leads from the community would be most welcome.

All, in all my favorite license remains the “please attribute” rather than “must attribute”. It’s legally the same as CC0, and makes not additional requirements for reuse, but just asks nicely if you could credit the source when and if convenient.

Posted in Data, RDF.

## 3 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

1. I’ve heard people say that attribution is built in with Linked Data, as any use of a URI implies that your are linking back to the original source (if the owner has their system set up correctly, so that the URI resolves). I’m not a lawyer either, though.

2. I love the concept in theory, but it’s not true. For example I can create a dataset mapping dbpedia items to some other taxonomy. Non of the URIs involved in the data belong to a site I control, but the combined set is a document I’ve created and my license.

3. Some more ideas to throw in to the pot:
* Describing separately which license applies and what that entails. This allows lawyers to say (for example) “only use OGL” and allows you to then query for that.
* Listing the common concerns about licenses as properties of the license so that you can also retrieve any data that, for example, permits commercial re-use.
Something a bit like this:

.
.