December 18, 2012
by Christopher Gutteridge
Up until now the open data service has been run on a pretty much seat-of-our-pants approach. We’re actually at the point where one of our services, the events calendar, really needs to graduate into a normal university service. It requires a little regular TLC to deal with broken feeds. There’s 74 feeds so some break now and then. They were always breaking, but now at least someone notices. I (Chris) recently attended a course on the University “Change management” process (which is basically getting sign-off to modify live services to reduce the impact and manage risk). I was pleasantly surprised to hear that the change management team actually use the events calendar to check if a change to live IT services might cause extra issues (eg. don’t mess with the wifi the weekend we’re hosting an international conference.
I always said that the success criteria for data.soton.ac.uk was that it becomes too important to trust me with (tongue in cheek, but not actually a joke). And, lo and behold, management has asked me to start looking at how to start the (long) journey to having it be a normal university service.
I feel some fear, but not panic.
I’ve been trying to think about how to divide the service into logical sections and consider them separately.
I’ve discussed the workflow for the system before, but here’s a quick overview again.
Publishing System: This downloads source data from various sources and turns it into RDF, publishes it to a web enabled directory then tells the SPARQL database to re-import it. This has just been entirely re-written by Ash Smith in command line PHP. An odd choice you might think, but it’s a language which many people in the university web systems team can deal with, so beats perl/python/ruby on those grounds. We’ve put it on github. The working title is Hedgehog (I forget why) but we’ve decided that each dataset workflow is a quill, which sounds nice.
SPARQL Database: This is 4 store. It effectively just runs as a cache of the RDF documents the publishing system spits out, it contains nothing that can’t be recreated from those files.
SPARQL Front End: This is a hacked version of ARC2’s SPARQL interface but it dispatches the reqests to the 4store. It’s much friendlier than the blunt minimal 4store interface. It also lets us provide some formats that 4store doesn’t, such as CSV.
URI Resolver: This is pretty minimal. It does little more than look at the URI and redirect you the the same path on data.soton. It currently does some content negotiation (decides if /building/23 should go to /building/23.rdf or /building/23.html) but we’re thinking of making that a separate step. Yeah, it’s a bit more bandwidth, but meh.
Resource Viewers: A bunch of PHP scripts which handle all the different type of resources, like buildings, products, bus-stops etc. These are a bit hacky and the apache configuration under them isn’t something I’m proud of. Each viewer handles all the formats a resource can be presented in (RDF, HTML, KML etc.)
Website: The rest of the data.soton.ac.uk website is just PHP pages, some of which do some SPARQL to get information
So here’s what I’m thinking about getting some of this managed appropriately by business processes.
As a first step, create a clone of the publishing system on a university server and move some of the most stable and core datasets there. Specifically the organisation structure: codes, names, and parent groups in the org-chart, and also the buildings data — just the name, number and what site they are on. These are simple but critical. They also happen to be the two datasets that the events calendar depends on and so would have to be properly managed dependencies before the calendar could follow the same route.
The idea of this 2nds data service, lets call it reliable.data.soton.ac.uk, is that it would only provide documents for each dataset, all the fun stuff would stay (for now) on the dev server, and I really don’t want to get iSolutions monekying around with SPARQL until they’ve got at least a little comfortable with RDF. The hedgehog instance on reliable.data would still trigger the normal “beta” SPARQL endpoint to re-import the data documents when they change.
We could make sure that the schema for these documents was very well documented and that changes were properly managed, and could be tested prior to execution. I’m not sure how, but maybe university members could register an interest so that they could be notified of plans to change these. That would be getting value out of the process. For the buildings dataset, which is updated a few times a year, maybe even the republishing should have a prior warning.
The next step would be to move the event calendar into change management, and ensure that it only depended on the ‘reliable’ documents. This service is pretty static now in terms of functionality, although we’ve got some ideas for enhancements, these could be minor tweaks to the site, with the heavy lifting done on the ‘un-managed’ main data server.
Don’t get my wrong, I don’t love all this bureaucracy, but if open data services are to succeed they need to be embedded in business processes, not quick hacks.