Licenses in data.Southampton

June 14, 2011
by Christopher Gutteridge

I got the following enquiry a few days ago (reproduced with permission), and figured the response would be a good blog post (and that saves me answering people individually)

While developing our site about Tsinghua University OpenData, we met some question about licence & copyright.

Some data we got are crawled from public homepages of our university’s organizations and faculties. And we are not sure if it’s proper to release these data.

In your project of Southampton Open Data, I noticed that most of the datasets are published under CreativeCommons, and I found Open Government Licence on your homepage.

Do your have any data source that may have copyright issue while collecting data? How do you deal with that?

Thanks a lot! Look forward to your reply!

I’m going to be honest in the response as that will help people see where we are now. I am not a lawyer and can’t offer legal advice. We are doing our best to get it right, while not slowing down the progress we’re making.

We apply licenses per dataset. In someways that helps define the scope of a dataset, a dataset is a bunch of data with shared metadata.

Open Government License

In general, we use the UK Governments http://www.nationalarchives.gov.uk/doc/open-government-licence/ Open Government License (OGL), which really is a lovely bit of work. At first glance it’s very like the creative commons “cc-by” license, which is sometimes called “must attribute”.

However, it’s got some clever little restrictions, which make it easier for your management to feel comfortable releasing the data as they address some of the key concerns;

ensure that you do not use the Information in a way that suggests any official status or that the Information Provider endorses you or your use of the Information;
ensure that you do not mislead others or misrepresent the Information or its source;
ensure that your use of the Information does not breach the Data Protection Act 1998 or the Privacy and Electronic Communications (EC Directive) Regulations 2003.

So, if a railway used this for timetables; if someone took a train timetable under this license and publish train times on a porn site, that’s OK. But if they deliberately gave out slightly incorrect times to make the trains look bad, that’s not OK. If they claim to be the train company to sell tickets, on commission, that’s not OK. The DPA bit doesn’t mean anything outside the UK, of course.

It gives people lots of freedom but restricts them doing the obvious malicious exploits that are not actually illegal.

NULL License

Another license we use is a lack of a license. Maybe I should add a URI for the deliberate rather than accidental omission of the license?

I have to be very careful about slapping a license on things. Without permission of the data owner, I don’t do it.

A couple of examples of datasets which at the time of writing have no licence:

EPrints.soton — people are still looking into the issues with this. The problem is that the database may at some point have imported some abstracts from service without an explicit license to republish. It’s a small issue, but we are trying to be squeaky clean because it would be very counter productive to have any embarrassing cock ups in the first year of the open data service. All the data’s been around via OAI-PMH for years, so it’s a low risk, but until I get the all clear from the data owner I won’t do anything. The OGL has the lovely restriction of not covering “third party rights the Information Provider is not authorised to license;” but we shouldn’t knowingly put out such data. My ideal result here is that the guidance from the government is that publishing academic bibliographic metadata is always OK, but I’ve not had that instruction, yet.
Southampton Bus Routes & Stops — I’ve been told over the phone by the person running the system that he considers it public domain, but until I’ve got that in writing I’m not putting a license on it. Even if he says public domain, I’m inclined towards OGL as it prevents those kinds of malicious use I outlined earlier.

CC-BY

We may use this in a couple of places. It’s only win over OGL is that it’s more widely understood, but I think the extra restrictions of OGL are a good thing.

CC-Zero

This is pretty much saying “public domain”. It’s giving an unlimited license on the data. We use this for the Electronics and Computer Science Open Data, which acted as a prototype for data.southampton (boy, we made some mistakes, read the webteam blog and this blog for more details).

We’ve never yet had anybody do anything upsetting with the ECS RDF, but I’m inclined to relicense future copies as OGL, as it adds the protection against malicious but non-illegal uses.

Creative Evil

Out of interest, I challenge the readers to suggest in the comments harmful, or embarrassing, things they could do with the data.southampton data if it was placed in the public domain, rather than having an OGL license. It’s useful to get some ideas of what we need to protect ourselves against.

If there’s some evil ideas of what you could do under the restrictions of the OGL or no license, please send them to me privately, as I don’t want to actually get my project into disrepute, just get some ideas of what spammers, and people after lulz, might do. Better to think about what bolts the stable door needs well in advance.

3rd Party Data

I’ve got a lovely dataset I’ve added but not yet added metadata for, it maps the disibility information hosted by a group called “disabledgo” to the URI for buildings, sites and points of service. eg. http://www.disabledgo.com/en/access-guide/zepler-building/university-of-southampton is mapped to the URI for that building, and gets a neat little link in http://data.southampton.ac.uk/building/59.html

I created this dataset by hand by finding every URL and mapping it myself, so I have the right to place any license on it I choose. I also added in some data I screen scraped from their site (flags indicating disabled parking, good disabled toilets etc.). I checked with disabledgo and they asked me not to republish that data, so I can’t.

We pay them to conduct these surveys, and our contract does not specify the owner of the data. I’m hoping we might actually renegotiate next year to be allowed to republish the data, but it would be far better if *they* published under an open license and we just used their open data. Probably that’s still a few years off.

Either way, it’s a nice demo of the issues facing us. They are friendly and helpful, just don’t want anyone diluting the meaning of their icons. They give them a strict meaning.

Screen Scraping

Very little data in data.southampton is screen scraped. Exceptions are the trivia about buildings (year of construction, architect etc.) and some of the information about teaching locations, including their photos, and the site which lists experts who can talk to the press on various subjects.

I have a clear remit from the management of the university to publish under an open license anything which would be subject to a “Freedom of Information” (FOI) request. In the long run we can save a fair bit of hassle and money by pointing people at the public website.

The advantage I have over most other Open Data projects is that I’m operating under direct instructions from the heads of Finance, Communications, iSolutions (the silly name we give our I.T. team, which I’m part of) and the VC’s office. This means that I can reasonably work with anything owned by the organisation.

Another rule of thumb I was given is that if it’s already on the web as HTML or PDF then it might as well be data too! It’s not a strict rule, as obviously there’s some things which might not be appropriate, but I’ve not had much to screen scrape yet.

Categories: Licenses and Policy.

News and Ideas from the Southampton Open Data Team

Southampton Open Data Blog