Skip to content


Data.Soton Enterprise Edition

(or One big fat “before” picture)

Over the next year we plan to turn http://data.southampton.ac.uk/ into a lean mean open data publishing machine. Right now it’s a bunch of scripts on a linux VM. It works OK, but it’s hardly a poster-child for “Enterprise Systems”. A key part of the new design for I.T. at Southampton is that when my team innovates a new system, such as the Open Data Service, if it is successful then we need to get it into a state where it can become part of the normal IT infrastructure of the organisation.

This is all a bit new to me, and I don’t know exactly what I’ll need to learn, but I’ve been talking to the people at the Software Sustainability Institute, and I think it’s useful to document the process in public. To be really useful, I’m going to have to be a bit honest about some of my own poor practices.

So here’s where we stand at the beginning of 2012:

Platform

The open data service is run on a single Linux Virtual Machine. The machine is backed up. The non-standard bits I’ve installed are 4store, rapper, graphite & grinder.

Publishing Process

The main publishing config directory contains a sub directory for each dataset. A script called publish_dataset republishes a named dataset. This script is in github (under the Grinder project) which seemed like a good idea at the time. It’s scruffy but it’s been written with an eye to the future so is quite modular. The confiugration files are not in github, but a snapshot is created when a dataset is published. Example. This goal of that was to let other people see what we were doing. It’s crazy for me not to have version control on these configuration files!

Data Catalogue

This is a single dataset describing all the other datasets. It was a mistake. I now believe I should take the entry for each dataset and save it as configuration with the

SPARQL Databases

There’s 2 instances of 4store running on the box. It runs the one for http://sparql.data.southampton.ac.uk/, and a second one which powers http://rdf.ecs.soton.ac.uk/sparql/

Of course, it’s a bit more complicated than that. The main data.soton SPARQL endpoint has an ARC2 PHP endpoint which is what the public sees. This passes requests to the 4store endpoint, but adds some nice features such as CSV & SQLite output formats. The catch with this is that it doesn’t spot the <!– comment –> in the 4store results that warns you that it’s hit the ‘soft limit’ (which prevents big searches killing the machine). Annoyingly, SPARQL doesn’t have a way to pass a warning message with the results.

There’s one more SPARQL endpoint, which also sits on top of the real data.soton 4store endpoint, but caches results for a few minutes. I think this is used for most of the public website, but I’ve got in a bit of a muddle about it.

4 SPARQL endpoints on the machine is why it got named “Edward” as it’s so SPARQLy.

Public Website

The website uses a scary .htaccess file to make /building/****.**** redirect to a PHP file to handle rendering buildings (and many other types of thing). The oldest ones use lots of SPARQL queries, and have multiple scripts for different formats of output (HTML, RDF etc.). The newer ones use the new Graphite “Resource Description” to get the data and are much easier to read. There’s a php library which renders the sidebar and captures any SPARQL queries used to build the page (it doesn’t capture queries from graphite, but they are not as interesting as they are auto-generated and almost unreadable). It also has a register of datasets used to generate the page, but this is hand created in the PHP viewer so is often a bit inaccurate. I had a go at capturing the graphs used to make a page but it’s tricky and puts strain on the database.

id.southampton.ac.uk

This is critical. This virtualhost does nothing but resolve URIs to redirect them to an RDF or HTML page. It could stand to be smarter. I am a big advocate of keeping the URI resolution entirely separate from the data site.

southampton events diary

This is a work in progress which will replace the university public events diary. It aggregates feeds from all over and combines them into a single RDF document, then makes a pretty web interface on top. This will be quite high profile, but is currently a little fragile and if a single feed stops working it’s hard to spot. It’s being developed by a final-year postgrad, which is great as I don’t have time, but means I understand it a little less.

It is loosely coupled to the main data site. It could run on a stand alone box, but it enhances it’s data with the university place and division heirarchy, and data.southampton pulls the dataset in once an hour.

maps.southampton.ac.uk

This is a little tool written using the Google Maps API. It has some issues now, that it doesn’t know to filter demolished buildings!

The site also hosts a generic map search appย  which is a very powerful demo of data reuse.

energy data

We should get a spreadsheet with energy readings from a bunch of meters but after working for a few weeks, it’s broken at their end and it’s not high-enough priority to have got sorted. My impression is that the underling software is a bit old and flakey, but there’s no other option that I can see. Their underlying database contains raw data which needs quite a bit of processing to turn into a simple KWh value. I don’t know if I should give up and remove the graphs from the building pages now it’s been broken for over a month.

bus times and routes

These have been the highest profile dataset, and inspired the most apps, but it is sort of odd to host them on the university server. The bus-stops and bus-routes were screen scraped a while ago and might be out of date, but nobodies complained. I suspect I’ll have to look into that at some point, however.

I still have been unable to get timtables out of anybody, just the live feed for a stop which is done by screenscaping and caching for 30 seconds. The government have made noises about this kind of data being available nationally, but we’ll see.

Cron Jobs

There’s a bunch of cron jobs which import and publish some datasets. The eprints one doesn’t seem to work properly and I’ve not finished unpicking why.

There’s a monitoring script which Dave Challis wrote which keeps an eye on the 4store instances and does something if they look peaky. I don’t fully understand what, yet.

Vocabularies

I’m often asked what vocabularies datasets use, and I really have not documented it well. Each dataset should have at the very least a list of predicates, classes and some example entities — most of that is probably automatable.

Continutity of Data

I’ve still not got everything flowing smoothly, but it’s improving. The “organisation” dataset now has a weekly feed direct from HR, but many things are not automated which should be. These include; the phonebook,the list of courses & modules, international links, buildings occupied by faculties and academic-units… this list is long. Some lists require hand-intervention before publishing so can’t be automated — the list of “Buildings” from the Planon system run by Buildings & Estates lists the stream as a building. I figure in time, more web pages will be built from this data and the data owners will want ot take ownership. This is already happening, eg. the transport office has been updating our list of Bike Racks.

Most of these things either change slowly or already exist in a database elsewhere, so my goal is to hook up the database feeds where I can, and just make the by-hand corrections to the slow-changing data. It’s certain that some is going out of date. I need to make it easy for people to feed back corrections, done right that’s a benefit to the organisation.

Identifying People

The university has many schemes which identify people; Username, Email, Staff/Student ID, EPrints.soton Code etc. but there’s not a nice easy way to produce an identifier for a person. I’m currently using a hash of their email but it’s unsatisfactory.

Odds and Sods

There’s a directory on the website which contains little works-in-progress and proofs of concept. It doesn’t really belong on the production server.

Feedback System & Bugtracker

When we set up the system we were very rushed and installed a system for giving feedback, but it’s not a public discussion, so we’d have been better off with email. It’s yet-another thing to check so I often forget it for big chunks of time, then find 50 spams and a couple of good ideas or impossible dreams. I think this system needs turning into something more functional but I’m not sure what.

Politics & People

Currently only I really know the guts of how this system works. Dave Challis knows some, and knows far more about the 4store set up than me, but he left at the start of the year to work for the company that make 4store, but in a dire emergency I can get in touch as we’re still good friends.

The university has made it very clear to me that this is an ongoing concern, and will be part of the future of the university, but that it can’t be the top priority, which I think is entirely fair.

The big concern right now is that if I go under a bus, the system would be at great risk of just decaying and dying because the costs of someone learning enough to support and extend it are too high. The ultimate goal should be to clean the system up enough to make it a configurable platform of which data.southampton is an instance.

Sharepoint vs Google Docs

The university is now using Sharepoint for lots of things, and it’s actually not a bad option for collecting tabular data from people to turn into RDF, it allows pick-lists etc which Google Docs does not. The problem is that we don’t have a very strong policy on where to put such things yet as our sharepoint experts are very busy working on the existing todo list.

Publishing on Demand

It would be very useful if certain datasets could be republished by the data owner, when they’ve updated the data in Sharepoint or Google. Eg. the dailing menus from the catering dept. I’ve been thinking of creating role-based passwords so I just give them a URL and password to see a big ‘republish dataset-X’ button. I’ve got as far as registering admin.data.southampton.ac.uk and thinking about https.

Where next?

  • Version Control of all non-confidential scripts and configuration files.
  • More documentation!
  • Move data-set metadata into the dataset itself, not a custom dataset catalogue-dataset
  • publication script needs to be much better engineered
  • More automated publication and a republish-now button
  • Testing Server, and a way to transfer changes to/from live server.
  • Remove cruft from live server
  • Better feedback system
  • monitoring of systems and processes to detect load and failures
  • Learn more about what the normal iSolutions platform policy is.

Posted in Best Practice.

Tagged with .


4 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Alex Bilbie says

    Great stuff guys.

    I’ve been thinking about doing a similar post about the future of the data.lincoln.ac.uk platform.

    Keep up the great work!

    Alex

  2. Mike Jackson says

    Chris asked The Software Sustainability Institute for a chat about these plans for “data.southampton.ac.uk enterprise edition”. I looked at both data.southampton.ac.uk and Chris’s plans above and we then discussed these in a phone call. Chris asked me to add my comments…so here they are…

    The blog gave the impression you have a very good understanding of your main issues and how these can be resolved ๐Ÿ™‚ Nothing seemed to be missing, bar how important the various tasks might be. These are my comments in order of perceived importance, which I formed as I read so, yes, there’s a lot of overlap with your “where next?” list (which is good!) ๐Ÿ™‚

    Document everything about the infrastructure

    Currently a lot of information is in your head relating to the location, deployment and configuration of various components on your VM (e.g. Graphite, Grinder, Rapper etc), the id.southampton.ac.uk URI redirection components, .htaccess configuration and modification, dataset vocabularies, the location of the register of datasets and how to update it. This is in addition to information about data from third-parties e.g. what this data is, who to contact if a data feed goes down (as for the energy data), data for which hand-tweaking is required, the nature of this hand-tweaking etc.

    In addition there is the information held by Dave Challis e.g. setting up the 4stores and their monitoring scripts. And, for the Southampton events diary, a final year undergrad.

    Documenting all this helps ensure the infrastructure is not just dependant upon you. This doesn’t have to be step-by-step tutorials, or product-quality user documentation, just notes to a sufficient level of detail that any able developer could, with the notes and a search engine, understand everything you do. These can be evolved into more usable documentation later.

    Get another body

    Another developer to work with you would ensure that the infrastructure is not just dependant upon you. They could also validate your documentation and processes.

    Expand the scope of version control

    It would be useful to set up version control to hold any scripts or configuration files that do not, at present, seem to be under version control. For example, dataset configuration files, .htaccess files, CRON jobs, 4store monitoring scripts.

    It is worth considering whether any of these make sense as stand-alone tools and, if so, creating new repositories for these (augmenting your existing set of Graphite, Grinder and the like).

    If these contain confidential information (e.g. usernames and passwords) then these could be held in a private version control system. A public version control system could hold “template” copies with documentation as to how to customise them.

    Revise scripts and configuration

    In parallel with the above, revise your scripts and configuration files to make them more generally useful and applicable to others’.

    Formalise processes

    It would be useful to create, and set down, lightweight processes for how to handle various interactions with others e.g. data providers and users. For example:

    -Third-party data providers: who to contact if the feed stops (e.g. the energy data), how to report such issues, at what point is the feed removed from the site if it remains problematic.
    -When and how to update register of datasets
    -Under what circumstances, how and how often to scrape data (where applicable) e.g. the bus times and routes.
    How users can report bugs or request features, how these are managed.

    Set up an issue tracker

    To record your bugs and TODOs, to allow users to report bugs, request features, ask to register new data sets etc. It would be useful if this were publicly browsable and asking users to register could deter spammers.

    Usability evaluation

    Once you have documented how your infrastructure is set up and configured and have a reasonably stable set of scripts, carry out a usability evaluation. Get someone representative of a target user to try to set up a server to expose a few data sets. This would indicate how how easy it would be for others to set up this infrastructure in their home institutions, identify scope for improvements and iron out any wrinkles in your scripts, configuration files or documentation.

    University of Southampton Open Data Comments

    These are my comments on http://data.southampton.ac.uk spanning everything from making certain information more prominent to typos and broken links ๐Ÿ™‚

    Front page – http://data.southampton.ac.uk

    “Linked Open Data” section
    -Change “data.gov.uk” and “Coalition Government’s Transparency Board” to hyperlinks.
    -Typo: “eaiser”

    Keep informed
    -Ideally, just link to the public mailing list only.
    -Also provide a link to your blog and re-mention the RSS feeds and Twitter.

    Credits section
    -Broken link: “Dave Challis”

    Add link to Open Government Licence and Open Definition in footer alongside the Terms and Conditions and Freedom of Information.

    FAQ – http://data.southampton.ac.uk/faq.html

    Answer to “What technologies are you using?” would be more useful pulled out into a new page, linked from the left-hand menu. Providing a table of the technologies, their web sites and what they’re used for would be more informative.

    Add Geo2KML to this list.

    Answer to “What is a URI?” says “click the grey, italic URI” which looks like a bug.

    E-mails and feedback

    It would be useful if the public majordomo@lists.soton.ac.uk list was archived and the archive browsable, searchable and linked.

    Certain pages e.g. requests to add apps (http://data.southampton.ac.uk/apps.html) or FAQs (http://data.southampton.ac.uk/faq.html) have mails to cjg@ecs.soton.ac.uk. These should be directed to one of the lists.

    A new “Contact Us”/”Get in Touch” page would be useful which would list your public and private e-mail lists and what to use these for. All pages with e-mail addresses should instead link to this. This reduces the risk of redundancy and inconsistency. A link to this page should be prominently displayed from your front-page.

    The feedback page *http://data.southampton.ac.uk/feedback/) provides forms for suggestions (datasets, features, apps), bug reports and app registration. Who receives these? Perhaps better just to request they use the public e-mail list? Or adopt a publicly-browsable ticketing system for managing these?

    Graphite (http://graphite.ecs.soton.ac.uk/) page lists a number of “quick and dirty” and other tools. Are these all under GitHub somewhere?

  3. Alex Bilbie says

    I’ve written a post about the future of data.lincoln at httpster.org/the-future-of-data-lincoln-ac-uk/

Continuing the Discussion

  1. data.ac.uk and some things to read – Southampton Data Blog linked to this post on April 24, 2012

    […] data.southampton: Enterprise Edition – this post goes into the technical detail and is about the underlying system, not the data itself. […]



Some HTML is OK

or, reply to this post via trackback.