(or One big fat “before” picture)
Over the next year we plan to turn http://data.southampton.ac.uk/ into a lean mean open data publishing machine. Right now it’s a bunch of scripts on a linux VM. It works OK, but it’s hardly a poster-child for “Enterprise Systems”. A key part of the new design for I.T. at Southampton is that when my team innovates a new system, such as the Open Data Service, if it is successful then we need to get it into a state where it can become part of the normal IT infrastructure of the organisation.
This is all a bit new to me, and I don’t know exactly what I’ll need to learn, but I’ve been talking to the people at the Software Sustainability Institute, and I think it’s useful to document the process in public. To be really useful, I’m going to have to be a bit honest about some of my own poor practices.
So here’s where we stand at the beginning of 2012:
The open data service is run on a single Linux Virtual Machine. The machine is backed up. The non-standard bits I’ve installed are 4store, rapper, graphite & grinder.
The main publishing config directory contains a sub directory for each dataset. A script called publish_dataset republishes a named dataset. This script is in github (under the Grinder project) which seemed like a good idea at the time. It’s scruffy but it’s been written with an eye to the future so is quite modular. The confiugration files are not in github, but a snapshot is created when a dataset is published. Example. This goal of that was to let other people see what we were doing. It’s crazy for me not to have version control on these configuration files!
This is a single dataset describing all the other datasets. It was a mistake. I now believe I should take the entry for each dataset and save it as configuration with the
Of course, it’s a bit more complicated than that. The main data.soton SPARQL endpoint has an ARC2 PHP endpoint which is what the public sees. This passes requests to the 4store endpoint, but adds some nice features such as CSV & SQLite output formats. The catch with this is that it doesn’t spot the <!– comment –> in the 4store results that warns you that it’s hit the ‘soft limit’ (which prevents big searches killing the machine). Annoyingly, SPARQL doesn’t have a way to pass a warning message with the results.
There’s one more SPARQL endpoint, which also sits on top of the real data.soton 4store endpoint, but caches results for a few minutes. I think this is used for most of the public website, but I’ve got in a bit of a muddle about it.
4 SPARQL endpoints on the machine is why it got named “Edward” as it’s so SPARQLy.
The website uses a scary .htaccess file to make /building/****.**** redirect to a PHP file to handle rendering buildings (and many other types of thing). The oldest ones use lots of SPARQL queries, and have multiple scripts for different formats of output (HTML, RDF etc.). The newer ones use the new Graphite “Resource Description” to get the data and are much easier to read. There’s a php library which renders the sidebar and captures any SPARQL queries used to build the page (it doesn’t capture queries from graphite, but they are not as interesting as they are auto-generated and almost unreadable). It also has a register of datasets used to generate the page, but this is hand created in the PHP viewer so is often a bit inaccurate. I had a go at capturing the graphs used to make a page but it’s tricky and puts strain on the database.
This is critical. This virtualhost does nothing but resolve URIs to redirect them to an RDF or HTML page. It could stand to be smarter. I am a big advocate of keeping the URI resolution entirely separate from the data site.
southampton events diary
This is a work in progress which will replace the university public events diary. It aggregates feeds from all over and combines them into a single RDF document, then makes a pretty web interface on top. This will be quite high profile, but is currently a little fragile and if a single feed stops working it’s hard to spot. It’s being developed by a final-year postgrad, which is great as I don’t have time, but means I understand it a little less.
It is loosely coupled to the main data site. It could run on a stand alone box, but it enhances it’s data with the university place and division heirarchy, and data.southampton pulls the dataset in once an hour.
This is a little tool written using the Google Maps API. It has some issues now, that it doesn’t know to filter demolished buildings!
The site also hosts a generic map search app which is a very powerful demo of data reuse.
We should get a spreadsheet with energy readings from a bunch of meters but after working for a few weeks, it’s broken at their end and it’s not high-enough priority to have got sorted. My impression is that the underling software is a bit old and flakey, but there’s no other option that I can see. Their underlying database contains raw data which needs quite a bit of processing to turn into a simple KWh value. I don’t know if I should give up and remove the graphs from the building pages now it’s been broken for over a month.
bus times and routes
These have been the highest profile dataset, and inspired the most apps, but it is sort of odd to host them on the university server. The bus-stops and bus-routes were screen scraped a while ago and might be out of date, but nobodies complained. I suspect I’ll have to look into that at some point, however.
I still have been unable to get timtables out of anybody, just the live feed for a stop which is done by screenscaping and caching for 30 seconds. The government have made noises about this kind of data being available nationally, but we’ll see.
There’s a bunch of cron jobs which import and publish some datasets. The eprints one doesn’t seem to work properly and I’ve not finished unpicking why.
There’s a monitoring script which Dave Challis wrote which keeps an eye on the 4store instances and does something if they look peaky. I don’t fully understand what, yet.
I’m often asked what vocabularies datasets use, and I really have not documented it well. Each dataset should have at the very least a list of predicates, classes and some example entities — most of that is probably automatable.
Continutity of Data
I’ve still not got everything flowing smoothly, but it’s improving. The “organisation” dataset now has a weekly feed direct from HR, but many things are not automated which should be. These include; the phonebook,the list of courses & modules, international links, buildings occupied by faculties and academic-units… this list is long. Some lists require hand-intervention before publishing so can’t be automated — the list of “Buildings” from the Planon system run by Buildings & Estates lists the stream as a building. I figure in time, more web pages will be built from this data and the data owners will want ot take ownership. This is already happening, eg. the transport office has been updating our list of Bike Racks.
Most of these things either change slowly or already exist in a database elsewhere, so my goal is to hook up the database feeds where I can, and just make the by-hand corrections to the slow-changing data. It’s certain that some is going out of date. I need to make it easy for people to feed back corrections, done right that’s a benefit to the organisation.
The university has many schemes which identify people; Username, Email, Staff/Student ID, EPrints.soton Code etc. but there’s not a nice easy way to produce an identifier for a person. I’m currently using a hash of their email but it’s unsatisfactory.
Odds and Sods
There’s a directory on the website which contains little works-in-progress and proofs of concept. It doesn’t really belong on the production server.
Feedback System & Bugtracker
When we set up the system we were very rushed and installed a system for giving feedback, but it’s not a public discussion, so we’d have been better off with email. It’s yet-another thing to check so I often forget it for big chunks of time, then find 50 spams and a couple of good ideas or impossible dreams. I think this system needs turning into something more functional but I’m not sure what.
Politics & People
Currently only I really know the guts of how this system works. Dave Challis knows some, and knows far more about the 4store set up than me, but he left at the start of the year to work for the company that make 4store, but in a dire emergency I can get in touch as we’re still good friends.
The university has made it very clear to me that this is an ongoing concern, and will be part of the future of the university, but that it can’t be the top priority, which I think is entirely fair.
The big concern right now is that if I go under a bus, the system would be at great risk of just decaying and dying because the costs of someone learning enough to support and extend it are too high. The ultimate goal should be to clean the system up enough to make it a configurable platform of which data.southampton is an instance.
Sharepoint vs Google Docs
The university is now using Sharepoint for lots of things, and it’s actually not a bad option for collecting tabular data from people to turn into RDF, it allows pick-lists etc which Google Docs does not. The problem is that we don’t have a very strong policy on where to put such things yet as our sharepoint experts are very busy working on the existing todo list.
Publishing on Demand
It would be very useful if certain datasets could be republished by the data owner, when they’ve updated the data in Sharepoint or Google. Eg. the dailing menus from the catering dept. I’ve been thinking of creating role-based passwords so I just give them a URL and password to see a big ‘republish dataset-X’ button. I’ve got as far as registering admin.data.southampton.ac.uk and thinking about https.
- Version Control of all non-confidential scripts and configuration files.
- More documentation!
- Move data-set metadata into the dataset itself, not a custom dataset catalogue-dataset
- publication script needs to be much better engineered
- More automated publication and a republish-now button
- Testing Server, and a way to transfer changes to/from live server.
- Remove cruft from live server
- Better feedback system
- monitoring of systems and processes to detect load and failures
- Learn more about what the normal iSolutions platform policy is.