Southampton Web and Data Innovation Team

Ideas and Tips from the Team

Categories:

Advertising
AI
Apache
Best Practice
Bitcoin
Command Line
Community
Conference Spam
Conference Website
Data
- Research Data
Database
dev8d
Doug Englebart
Drupal
Events
Gateway to Research
GDPR
Geo
HESA
HTTP
Internet Archive
Intranet
Javascript
Jisc
Management
- Recruitment
Minecraft
Open Data
Open Source
ORCID
OSX
Outreach
Perl
PHP
Programming
python
RDF
- 4store
- Graphite
- SPARQL
- Triplestore
Repositories
Sharepoint
SQL
Team
Templates
Terms and Conditions
testing
Tips
Training
Tutorial
twitter
Uncategorized
web management
Wordpress

Combining and republishing datasets with different licenses

We’ll soon be launching data.ac.uk! Right now it’s all a bit of a work in progress. The plan is for us to start with a few useful subdomains then have other subdomains run by other organisations. Southampton neither can nor should be the sole proprietors.

The goal of the domain is to provide a permenant home for URIs, datasets and services. The problem with the .ac.uk level scheme is that sites are named either after an organisation, or after a project. But a good service should outlive the project which creates it, and if you’re trying to create a linked data resource for the ages then using http://www.myuni.ac.uk/~chris/project/2008/schema/ as your namespace is a ticking timebomb of breakiness.

There’s serveral different projects to create sub-sites right now. These are all focused on “infrastructure” rather than “research” data, but that should not be seen as a firm precident. That said, UK level services for research data are artificial — it shouldn’t matter where good data comes from, but from a practical point of view the UK is a funder of research so there may be times when national aggregation and services are created.

For projects like Gateway to Research to create good linked data they’ll need good URIs. Obviously some of their datastructures are going to be complex and specialised, but we want solid URIs for institutions, funding bodies, projects, researchers, publications, patents etc.

hub.data.ac.uk

OK, this is the bit this post was supposed to actually be about.

One of the sub-domains which already exists is http://hub.data.ac.uk/ which is intended as a hub for UK academia open-data services. It has a hand-maintained list fo the current open data services and their contacts. We also set it up that it would periodically resolve the self-assigned URI for each university, and combine the triples it found their into a big document which you could query in one go.

The first problem we encountered for this was that Oxford and Southampton have chosen to make their “self assigned” URIs resolve to short RDF documents describing the organisation [Oxford] [Southampton]. However the Open University made a different assumption of what should happen when you resolve their URI. Their services generates a document describing every triple referencing their university. This isn’t wrong it’s just large and answers a differnt question.

To address this we’ve hit on the idea of asking each open data service to produce a “Profile Document” which may be what their self assigned URI redirects to, but will also be auto discoverable from their main website. This we can (more) safely download knowing more or less what to expect, and we can provide standard ways to describe elements which may be useful to list on hub.data.ac.uk.

Combining Datasets

The problem I’m facing this week is how to handle combining datasets with multiple licenses.

Right now I’m thinking:

For every source dataset, include a “provenance event” describing where it was downloaded from, and the license on the document that was used as the source.

nb. this is not proper RDF, I’m just explaining my thoughts:

 <#event27> a ProvenanceEvent ;
     source <http://www.example.ac.uk/profile.rdf> ;
     action <downloaded> ;
     result <#source27> .

 <http://www.example.ac.uk/profile.rdf> 
     license <Open government License> ;
     attribution "University of Examples" .

 <#event27> a ProvenanceEvent ;
     source <#source20>,<#source21>,<#source22>,<#source27> ;
     action <merge> ;
     result <>

OK. So the above is true but I’m not sure how useful it is. If I’m using a dataset, all I really want to know is:

Can I use it for the purpose I have in mind?
What restrictions does it place on me?
What obligations (attribution) does it place on me?

So far as I can see, combining datasets with different licenses results in a dataset which is licensed by all at the same time. This isn’t the same as when software is “duel licensed” and you can pick which license, this dataset is simultaneously under several licenses (like wiring them in series, rather than in parallel). Even a “must attribute” license gets out of hand with data from 180 sources (BSD was modified for a reason!)

The licenses we’re plannng to accept (or at least recommend) are, in order of increasing restrictions, CC0, ODCA and OGL.

One option we’re considering is to provide several downloads:

CC0 data only under a CC0 license
CC0 and ODCA data only under a ODCA license (with a long attribution list)
CC0, ODCA & OGL data under the OGL. (with a longer attribution list)

I’m not a lawyer, but this seems to go with the intent of the origional publishers licences.

There’s also the issue of the ODCA phrase “keep intact any notices on the original database” which would be easy to do if combining datasets by hand, but is going to be very difficult to automate. What if their notice turned out to be in the XML comments in and RDF/XML file?

I came quite late to the Semantic Web, so I suspect much of these issues were discussed a decade ago, so any tips or leads from the community would be most welcome.

All, in all my favorite license remains the “please attribute” rather than “must attribute”. It’s legally the same as CC0, and makes not additional requirements for reuse, but just asks nicely if you could credit the source when and if convenient.

Posted in RDF.

Tagged with Data.

rev="post-964" 3 comments

By Christopher Gutteridge – November 29, 2012

3 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Carsten Keßler says

I’ve heard people say that attribution is built in with Linked Data, as any use of a URI implies that your are linking back to the original source (if the owner has their system set up correctly, so that the URI resolves). I’m not a lawyer either, though.

November 29, 2012, 12:17 pm Reply
- Christopher Gutteridge says
  
  I love the concept in theory, but it’s not true. For example I can create a dataset mapping dbpedia items to some other taxonomy. Non of the URIs involved in the data belong to a site I control, but the combined set is a document I’ve created and my license.
  
  November 29, 2012, 1:43 pm Reply
Ronan Klyne says

Some more ideas to throw in to the pot:
* Describing separately which license applies and what that entails. This allows lawyers to say (for example) “only use OGL” and allows you to then query for that.
* Listing the common concerns about licenses as properties of the license so that you can also retrieve any data that, for example, permits commercial re-use.
Something a bit like this:

licenses:license1 a license:License
; license:forbids license:resale
; license:mandates license:attribution
.
licenses:license2 a license:License
; license:forbids license:removingCopyright
; license:mandates license:sourceCodePublication
.
license:releasedUnder
license:releasedUnder

license:partsReleasedUnder
license:partsReleasedUnder

In this model the dual license thing would be the special case, using some kind of “your choice of these licenses” construct – again easily queryable.

Also, IMHO it’s fair to ignore anything in an RDF comment and still claim due diligence – it’s semantically insignificant data in a document made for machine reading. It’s on a par with deliberately microscopic small print. But I’m not a lawyer either, nor would I make a good one.

December 19, 2012, 9:15 am Reply

« How to mirror a TWIKI Adding a custom Line Break Plugin to the TinyMCE WYSIWYG editor inside Drupal 7 »

Proudly powered by WordPress and Carrington.

Carrington Theme by Crowd Favorite

Combining and republishing datasets with different licenses

hub.data.ac.uk

Combining Datasets

3 Responses

Authors

Recent Posts

Meta

Blogroll

Tags

Combining and republishing datasets with different licenses

hub.data.ac.uk

Combining Datasets

3 Responses

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags