Policy
Getting Real
December 18, 2012
by Christopher Gutteridge
Up until now the open data service has been run on a pretty much seat-of-our-pants approach. We’re actually at the point where one of our services, the events calendar, really needs to graduate into a normal university service. It requires a little regular TLC to deal with broken feeds. There’s 74 feeds so some break now and then. They were always breaking, but now at least someone notices. I (Chris) recently attended a course on the University “Change management” process (which is basically getting sign-off to modify live services to reduce the impact and manage risk). I was pleasantly surprised to hear that the change management team actually use the events calendar to check if a change to live IT services might cause extra issues (eg. don’t mess with the wifi the weekend we’re hosting an international conference.
I always said that the success criteria for data.soton.ac.uk was that it becomes too important to trust me with (tongue in cheek, but not actually a joke). And, lo and behold, management has asked me to start looking at how to start the (long) journey to having it be a normal university service.
I feel some fear, but not panic.
I’ve been trying to think about how to divide the service into logical sections and consider them separately.
I’ve discussed the workflow for the system before, but here’s a quick overview again.
Publishing System: This downloads source data from various sources and turns it into RDF, publishes it to a web enabled directory then tells the SPARQL database to re-import it. This has just been entirely re-written by Ash Smith in command line PHP. An odd choice you might think, but it’s a language which many people in the university web systems team can deal with, so beats perl/python/ruby on those grounds. We’ve put it on github. The working title is Hedgehog (I forget why) but we’ve decided that each dataset workflow is a quill, which sounds nice.
SPARQL Database: This is 4 store. It effectively just runs as a cache of the RDF documents the publishing system spits out, it contains nothing that can’t be recreated from those files.
SPARQL Front End: This is a hacked version of ARC2’s SPARQL interface but it dispatches the reqests to the 4store. It’s much friendlier than the blunt minimal 4store interface. It also lets us provide some formats that 4store doesn’t, such as CSV.
URI Resolver: This is pretty minimal. It does little more than look at the URI and redirect you the the same path on data.soton. It currently does some content negotiation (decides if /building/23 should go to /building/23.rdf or /building/23.html) but we’re thinking of making that a separate step. Yeah, it’s a bit more bandwidth, but meh.
Resource Viewers: A bunch of PHP scripts which handle all the different type of resources, like buildings, products, bus-stops etc. These are a bit hacky and the apache configuration under them isn’t something I’m proud of. Each viewer handles all the formats a resource can be presented in (RDF, HTML, KML etc.)
Website: The rest of the data.soton.ac.uk website is just PHP pages, some of which do some SPARQL to get information
So here’s what I’m thinking about getting some of this managed appropriately by business processes.
As a first step, create a clone of the publishing system on a university server and move some of the most stable and core datasets there. Specifically the organisation structure: codes, names, and parent groups in the org-chart, and also the buildings data — just the name, number and what site they are on. These are simple but critical. They also happen to be the two datasets that the events calendar depends on and so would have to be properly managed dependencies before the calendar could follow the same route.
The idea of this 2nds data service, lets call it reliable.data.soton.ac.uk, is that it would only provide documents for each dataset, all the fun stuff would stay (for now) on the dev server, and I really don’t want to get iSolutions monekying around with SPARQL until they’ve got at least a little comfortable with RDF. The hedgehog instance on reliable.data would still trigger the normal “beta” SPARQL endpoint to re-import the data documents when they change.
We could make sure that the schema for these documents was very well documented and that changes were properly managed, and could be tested prior to execution. I’m not sure how, but maybe university members could register an interest so that they could be notified of plans to change these. That would be getting value out of the process. For the buildings dataset, which is updated a few times a year, maybe even the republishing should have a prior warning.
The next step would be to move the event calendar into change management, and ensure that it only depended on the ‘reliable’ documents. This service is pretty static now in terms of functionality, although we’ve got some ideas for enhancements, these could be minor tweaks to the site, with the heavy lifting done on the ‘un-managed’ main data server.
Don’t get my wrong, I don’t love all this bureaucracy, but if open data services are to succeed they need to be embedded in business processes, not quick hacks.
Licenses in data.Southampton
June 14, 2011
by Christopher Gutteridge
While developing our site about Tsinghua University OpenData, we met some question about licence & copyright.Some data we got are crawled from public homepages of our university’s organizations and faculties. And we are not sure if it’s proper to release these data.In your project of Southampton Open Data, I noticed that most of the datasets are published under CreativeCommons, and I found Open Government Licence on your homepage.Do your have any data source that may have copyright issue while collecting data? How do you deal with that?Thanks a lot! Look forward to your reply!
I’m going to be honest in the response as that will help people see where we are now. I am not a lawyer and can’t offer legal advice. We are doing our best to get it right, while not slowing down the progress we’re making.
We apply licenses per dataset. In someways that helps define the scope of a dataset, a dataset is a bunch of data with shared metadata.
Open Government License
In general, we use the UK Governments http://www.nationalarchives.gov.uk/doc/open-government-licence/ Open Government License (OGL), which really is a lovely bit of work. At first glance it’s very like the creative commons “cc-by” license, which is sometimes called “must attribute”.
However, it’s got some clever little restrictions, which make it easier for your management to feel comfortable releasing the data as they address some of the key concerns;
- ensure that you do not use the Information in a way that suggests any official status or that the Information Provider endorses you or your use of the Information;
- ensure that you do not mislead others or misrepresent the Information or its source;
- ensure that your use of the Information does not breach the Data Protection Act 1998 or the Privacy and Electronic Communications (EC Directive) Regulations 2003.
So, if a railway used this for timetables; if someone took a train timetable under this license and publish train times on a porn site, that’s OK. But if they deliberately gave out slightly incorrect times to make the trains look bad, that’s not OK. If they claim to be the train company to sell tickets, on commission, that’s not OK. The DPA bit doesn’t mean anything outside the UK, of course.
It gives people lots of freedom but restricts them doing the obvious malicious exploits that are not actually illegal.
NULL License
Another license we use is a lack of a license. Maybe I should add a URI for the deliberate rather than accidental omission of the license?
I have to be very careful about slapping a license on things. Without permission of the data owner, I don’t do it.
A couple of examples of datasets which at the time of writing have no licence:
- EPrints.soton — people are still looking into the issues with this. The problem is that the database may at some point have imported some abstracts from service without an explicit license to republish. It’s a small issue, but we are trying to be squeaky clean because it would be very counter productive to have any embarrassing cock ups in the first year of the open data service. All the data’s been around via OAI-PMH for years, so it’s a low risk, but until I get the all clear from the data owner I won’t do anything. The OGL has the lovely restriction of not covering “third party rights the Information Provider is not authorised to license;” but we shouldn’t knowingly put out such data. My ideal result here is that the guidance from the government is that publishing academic bibliographic metadata is always OK, but I’ve not had that instruction, yet.
- Southampton Bus Routes & Stops — I’ve been told over the phone by the person running the system that he considers it public domain, but until I’ve got that in writing I’m not putting a license on it. Even if he says public domain, I’m inclined towards OGL as it prevents those kinds of malicious use I outlined earlier.
CC-BY
We may use this in a couple of places. It’s only win over OGL is that it’s more widely understood, but I think the extra restrictions of OGL are a good thing.
CC-Zero
This is pretty much saying “public domain”. It’s giving an unlimited license on the data. We use this for the Electronics and Computer Science Open Data, which acted as a prototype for data.southampton (boy, we made some mistakes, read the webteam blog and this blog for more details).
We’ve never yet had anybody do anything upsetting with the ECS RDF, but I’m inclined to relicense future copies as OGL, as it adds the protection against malicious but non-illegal uses.
Creative Evil
Out of interest, I challenge the readers to suggest in the comments harmful, or embarrassing, things they could do with the data.southampton data if it was placed in the public domain, rather than having an OGL license. It’s useful to get some ideas of what we need to protect ourselves against.
If there’s some evil ideas of what you could do under the restrictions of the OGL or no license, please send them to me privately, as I don’t want to actually get my project into disrepute, just get some ideas of what spammers, and people after lulz, might do. Better to think about what bolts the stable door needs well in advance.
3rd Party Data
I’ve got a lovely dataset I’ve added but not yet added metadata for, it maps the disibility information hosted by a group called “disabledgo” to the URI for buildings, sites and points of service. eg. http://www.disabledgo.com/en/access-guide/zepler-building/university-of-southampton is mapped to the URI for that building, and gets a neat little link in http://data.southampton.ac.uk/building/59.html
I created this dataset by hand by finding every URL and mapping it myself, so I have the right to place any license on it I choose. I also added in some data I screen scraped from their site (flags indicating disabled parking, good disabled toilets etc.). I checked with disabledgo and they asked me not to republish that data, so I can’t.
We pay them to conduct these surveys, and our contract does not specify the owner of the data. I’m hoping we might actually renegotiate next year to be allowed to republish the data, but it would be far better if *they* published under an open license and we just used their open data. Probably that’s still a few years off.
Either way, it’s a nice demo of the issues facing us. They are friendly and helpful, just don’t want anyone diluting the meaning of their icons. They give them a strict meaning.
Screen Scraping
Very little data in data.southampton is screen scraped. Exceptions are the trivia about buildings (year of construction, architect etc.) and some of the information about teaching locations, including their photos, and the site which lists experts who can talk to the press on various subjects.
I have a clear remit from the management of the university to publish under an open license anything which would be subject to a “Freedom of Information” (FOI) request. In the long run we can save a fair bit of hassle and money by pointing people at the public website.
The advantage I have over most other Open Data projects is that I’m operating under direct instructions from the heads of Finance, Communications, iSolutions (the silly name we give our I.T. team, which I’m part of) and the VC’s office. This means that I can reasonably work with anything owned by the organisation.
Another rule of thumb I was given is that if it’s already on the web as HTML or PDF then it might as well be data too! It’s not a strict rule, as obviously there’s some things which might not be appropriate, but I’ve not had much to screen scrape yet.
Grasping the nettle and changing some URIs
March 24, 2011
by Christopher Gutteridge
We’ve realised that using UPPER CASE in some URIs looked fine in a spreadsheet but makes for ugly URLS, and if we’re stuck with them, we want them to look nice.
Hence I’ve taken an executive decision and renamed the URIs for all the Points of Service from looking like this
http://id.southampton.ac.uk/point-of-service/38-LATTES
to this
http://id.southampton.ac.uk/point-of-service/38-lattes
meaning the URL is now
http://data.southampton.ac.uk/point-of-service/38-lattes.html
This actually matters, as these are going to become the long term web pages for the catering points of service, so aesthetics are important, and “If t’were to be done, t’were best done quickly”.
We’ve seen lots of visitors as a result of the Register Article, which is nice. (we saw a 10x increase in visitors, so that’s good)
I’ve just added in the lunchtime menu for the Nuffield. They are not yet quite taking ownership of their data, but that’s just a case of getting them some training. I’ve also talked today to the manager of the on-campus book shop to see if they want to list some prices and products. I’m thinking they could do well to list the oddball stuff they sell like memory sticks & backpacks.
Mostly I’m preparing to tidy up the back-end code — it needs to be a bit more slick and logical, more on this later.
Also today our very own Nigel Shadbolt is featured in the first ever edition of the Google Magazine. (It’s a PDF!)
A question of policy
March 18, 2011
by Christopher Gutteridge
To make this site sustainable we’re going to have to work out some policies about scope. The student-run Southampton Open Wireless Network Group (SOWN) have produced a dataset about their wireless nodes, and the council has more data sources we could wrap into the site (eg. number of spaces in carparks).
This leads to a number of interesting policy questions which I’ve not got an easy answer for.
- What data should we host on data.southampton.ac.uk (ie. allow it to be the primary source of the data and host a copy of the data dump)?
- What should we allow (or insist) use id.southampton.ac.uk URIs?
- Is data about the council a special case?
- What data should we list as part of the data catalog?
- What data should we import into the triple store?
- What data should we recommend (via links)?
Right now it’s easy to say yes to lots of things, but we need to think about the future maintenance too.
I’m currently thinking that what we should do is, for now, say yes council and other useful local data such as SOWN under sections ‘6’ and ‘5’ above only, with the intention later of having a 2nd ‘authoratative’ triple store which only imports our authoratative datasets.
SOWN is a good test case as it’s a grey area. It’s a university society run by university members, but certainly not part of the university administration. As it’s coming from the owners of the data it *is* authoratative, but it’s not authoratative AND published by University of Southampton.
Best dataset for the job
I’m also running into the question of how to divide data between datasets, for example I’ve got
- points of service & opening hours for SUSU and catering provided from the catering manager
- menus for catering points of service, provided by the catering manager
- I’m hoping to get daily menus for a few catering points of service provided by the catering manager
- I’ve got opening hours for the theatre bar provided by their manager
- I’ve got menus for the theatre bar (from their menu!)
- Opening hours for local amenities (provided by a small group of postgrad volunteers)
- Student services points of service and hours, provided by the university student services and therefore authoratative
- Waste & recycle points (currently run by the student volunteers but we hope to hand that over to the authoratative source)
- Transport points such as the travel office, bike racks, parking etc. which were created by the student volunteers, but now are being curated by the data owner (the transport office).
- List of vending machines, sourced from our contractors, via catering, and then annotated with building numbers by me.
- Bus stops, taken from a list provided by the council.
It’s really hard to work out if these should be one dataset each, or if not how to deal with them. Do I move the data out of the amenities (student sourced) dataset when rows of data are taken over by the data owner? Should I have an ‘authoratative university of southampton’ dataset including everything that is thus, and a non-authoratative amenities dataset? Also, the bigger the dataset, the more often it’ll need to be republished.
I am almost certainly going to make the ‘todays menu’ dataset separate due to it having to be updated daily.
A key reason to use separate datasets has been to filter things. I think it makes more sense to include this in the data itself than rely on the dataset. My current thinking is that we should rearrange the data to be based around provenance so;
- Authoratative Services including buildings & estates & catering and menus and vending machines.
- Todays Menus (because they change so fast), it’s a daily ammendum to the previous set.
- Nuffield Theatre Bar times & menus (authoratative, but not from the University)
- Non-authoratative (Colin-sourced) amenities
- Bus Stops
Menus for the local coffee shop and the nearest pubs (Brewed Awakening, Crown, Stile) can be included in the non-authoratative datasets.
It leads to a change in some underlying technology for me as currently each dataset only contains one “type” or record, eg. a set of prices OR a set of points-of-service.
Hopefully once we settle on a workable pattern for this it’ll save other people making the same false starts we have.