Oooo, data

May 27, 2011
by Christopher Gutteridge

On Wednesday I gave a well-recieved talk to the university ‘Digital Economy’ research group (a virutal group containing people from all over the university).

Yesterday I had the fun problem of lots of people getting in touch with ideas! For the next couple of months I still can’t put my full focus on the Open Data, but here’s some of the interesting things going on behind the scenes:

Facilities / Equipment dataset to describe our cool toys. I’ve got people interesting in contributing to this from all over the university. You can see a preview here. The idea is to help the left hand know what resources the right hand has, and who’s allowed to use them. I’ve had provisional interest in this from medical imaging, the high voltage lab, the nano cleanrooms, archaeology, civil engineering and chemistry.
Disabled Go reports – someone pointed me at this site which has detailed reports on disabled access for 98 of our buildngs. Most of the data is too detailed to map into RDF, but what I was hoping to do is (1) just provide a link to the reports for each building from our data and /building/ pages. That alone gets far more value out of it and maybe (2) pull out the headline data, eg “has disabled loo”, “allows guidedogs”. We’ve been in touch with them and it sounds like they are pretty postitive about the idea. I still need their permission to provide that information under OGL or another open license.
Catering have updated all the menus to include coffee & other hot drinks (it was missing before), after noticing the the opendatamap didn’t have any results for searching for ‘coffee’ (the horror). Problem is, the menu says “Filter (Large)” now so still no match for coffee! We’ll either rename it to “Filter Coffee (Large)” or consider adding a “Hidden Labels” field to help searches.

I got asked what the success criteria for the Open Data project was. This is very difficult to define but for me it will be when the open-data-service is so much part of business-as-usual that people on longer want an enthusiastic hacker running it! I’m looking forward to talking about the good ‘ole days when open data was a new frontier and nobody even had an ontology for coffee types or bus timetables yet.

The Open Data is starting to get put to use to:

People are using the bus times pages (I need to make the interface better, I know!)
Our upcoming campus mobile phone app will use some of the location data
I’ve been asked how the service could aid with student induction– eg. help people find what’s available, and where it is.

The other thing ticking along is getting live hookups to databases. Right now it’s all done with one-off dumps, we want to be showing the living data. The dump-and-email approach is fine for getting started but now it’s time to do the far less glamorous job of making the back-end more automated. I’m still working on getting energy use data per building, and I’ve a lead on recycling data!

Good times.

One final thing, you may notice that the Open Data Map is now not quite as pretty, there’s a good reason for this. We noticed that we may not own data traced using the Google Maps, so Colin has re-created all the data from the ordnance survey instead. There is slightly less detail, but the functionality is all still there.

The slides from my talk are available on EdShare. I’ve never uploaded to EdShare before — they’ve done a really great job at making a streamlined submit process. It’s far better than anything I’ve used in EPrints before, and I say this as the person who designed the EPrints 3.0 submit workflow!

Interview with Christopher Gutteridge

April 14, 2011
by Christopher Gutteridge

There’s an interview with Christopher Gutteridge (me!) on this weeks Ubuntu UK Podcast.

(If you’re wondering, data.soutampton.ac.uk runs on virtual machine running Ubuntu)

Actually, it’s worth giving a shout out to the technologies we use, but I’ll save that for a future post.

[April 1st Gag] PDF selected as Interchange Format

April 1, 2011
by Christopher Gutteridge

The following article is our prank for April 1st.

Just to be clear PDF is a dreadful format to exchange data in. It was inspired, in part, by The Register wesbsite running the following picture and quote. Yes, I did say that, but I was talking about research and data communication.

It was fun working out how to make our site output PDF versions of the data, and we’ll leave those as available, but no longer the default. Also, I’ve now linked in the “.svg” format which is basically the same as the PDF.

Hopefully this gave a few people a chuckle.

*** *** ***

We have had many complaints that RDF is complicated, unsupported and makes it difficult to control how people will reuse your data.

With this in mind, we have taken a big decision: PDF (Portable Document Format) has been selected as our preferred format for exchanging data on the data.southampton.ac.uk site.

Many of the data.southampton team felt we should listen to the pro-PDF comments on the forum for the recent Register Article about Open Data in Southampton.

PDF is widely recognised as one of the most accessible document formats available today, and is ideally suited to both the publication and importing of data because of its ability to accurately maintain the layout of complex data sets in the browser on the desktop, and via printed hard copy. The immaturity of the Linked Data community means that there are still considerable technical overheads involved in the publication and use of data represented in less well supported formats, such as RDF or XML.

When we compared the number of search results PDF has when compared with RDF the decision became far easier to justify.

Henceforth, the preferred method for both importing and exporting data from the site will be PDF. We will continue to provide other formats such as CSV & XML for the time being, but with a clear goal of removing these options as soon as is practical.

From May 1st onward we will only accept and export data in PDF and HTML formats. This allows us much more control and flexibility over how our data is presented. Data providers will be able to supply the Southampton OpenData team with data via PDF documents, or as printouts that we can scan and convert to PDF, and we will know exactly how to deal with it. To make things even easier, people will even be able to use the networked scanners anywhere on campus to directly upload data. Data providers at remote sites will be able to fax their data in.

As well as PDF, we are also working with owners of very large databases on an application that will allow them to dump their data into a view resembling a spreadsheet view; we will then republish this data via an interface a little like Google Maps. This will allow users to cast their eye over very large datasets and then zoom in to data values that look particularly interesting. We hope this will particularly enthuse library staff, as it is bringing a familiar micro-fiche style user interface to the web of open data.

Extending 4store

For now, we will be continuing to use 4store as our database server, but we have significantly improved on the default interface by adding a “PDF” output mode which users will find familiar.

Examples:

Our extension will be made available, on request, under an open source license.

PDF Descriptions of Resources

Many of the resources in the site will now be available to download as PDF in addition to HTML, just by changing “.html” to “.pdf”. Look out for the “Get the data!” box on many pages which will offer a link to the PDF format.

Module described in PDF
Where to buy booze (popular with some students!)

Real-time PDF data!

The most valuable data of all is accurate and up to date, and we are now able to do this in a way you’ve never seen before! We’ve already created an HTML page for every bus-stop in the city, but that’s only in HTML format, which is well known to be inferior to PDF.

http://data.southampton.ac.uk/bus-stop/SNA19777.html

Imagine you’re at a bus-stop and want to know when the next bus is, now all you need to do is download the following link into your phone and view it in the mobile PDF viewer of your choice, and hey-presto! – realtime bus data direct to you on your handset!

http://data.southampton.ac.uk/bus-stop/SNA19777.pdf

Positive Reactions

So far all the feedback we have had has been massively positive. One user of data.southampton said

“I’m so glad they have done this, and it’s easy to switch too, all I needed to do was change a “R” to a “P” – simples!”

Professor Nigel Shadbolt and Professor Sir Tim Berners-Lee were unavailable to comment as they are currently at the WWW2011 Conference, but we are confident they will have a very strong reaction when they hear about the decision.

New Formats

March 25, 2011
by Christopher Gutteridge

New ways to enjoy our data.

We’ve added some links to the “Get the Data” box which let you see what formats are available. Some pages let you download RDF, others you can get back as tabular data, suitable for loading into Excel, amongst other things. Roughly speaking, pages about things have RDF versions, pages about lists of things (places, buildings etc) have a tabular download available.

eg.

Grasping the nettle and changing some URIs

March 24, 2011
by Christopher Gutteridge

We’ve realised that using UPPER CASE in some URIs looked fine in a spreadsheet but makes for ugly URLS, and if we’re stuck with them, we want them to look nice.

Hence I’ve taken an executive decision and renamed the URIs for all the Points of Service from looking like this

http://id.southampton.ac.uk/point-of-service/38-LATTES

to this

http://id.southampton.ac.uk/point-of-service/38-lattes

meaning the URL is now

http://data.southampton.ac.uk/point-of-service/38-lattes.html

This actually matters, as these are going to become the long term web pages for the catering points of service, so aesthetics are important, and “If t’were to be done, t’were best done quickly”.

We’ve seen lots of visitors as a result of the Register Article, which is nice. (we saw a 10x increase in visitors, so that’s good)

I’ve just added in the lunchtime menu for the Nuffield. They are not yet quite taking ownership of their data, but that’s just a case of getting them some training. I’ve also talked today to the manager of the on-campus book shop to see if they want to list some prices and products. I’m thinking they could do well to list the oddball stuff they sell like memory sticks & backpacks.

Mostly I’m preparing to tidy up the back-end code — it needs to be a bit more slick and logical, more on this later.

Also today our very own Nigel Shadbolt is featured in the first ever edition of the Google Magazine. (It’s a PDF!)

We are featured in The Register

March 22, 2011
by Christopher Gutteridge

I recently had the slightly scary experience of giving an interview to the Register, along with my old friend John Goodwin. I appear to have made it onto the frontpage of the site, along with my comment about how much I hate to see people still using PDF to simulate A4 paper in documents never destined to be printed.

Knowing that The Register tends to quickly puncture pretentiousness, I did my best to be as straight-talking as I could. The article has come out well, but with slightly more colourful language than I’d have used talking to the BBC!

The Register: Southampton Uni shows way to a truly open web.

A question of policy

March 18, 2011
by Christopher Gutteridge

To make this site sustainable we’re going to have to work out some policies about scope. The student-run Southampton Open Wireless Network Group (SOWN) have produced a dataset about their wireless nodes, and the council has more data sources we could wrap into the site (eg. number of spaces in carparks).

This leads to a number of interesting policy questions which I’ve not got an easy answer for.

What data should we host on data.southampton.ac.uk (ie. allow it to be the primary source of the data and host a copy of the data dump)?
What should we allow (or insist) use id.southampton.ac.uk URIs?
Is data about the council a special case?
What data should we list as part of the data catalog?
What data should we import into the triple store?
What data should we recommend (via links)?

Right now it’s easy to say yes to lots of things, but we need to think about the future maintenance too.

I’m currently thinking that what we should do is, for now, say yes council and other useful local data such as SOWN under sections ‘6’ and ‘5’ above only, with the intention later of having a 2nd ‘authoratative’ triple store which only imports our authoratative datasets.

SOWN is a good test case as it’s a grey area. It’s a university society run by university members, but certainly not part of the university administration. As it’s coming from the owners of the data it *is* authoratative, but it’s not authoratative AND published by University of Southampton.

Best dataset for the job

I’m also running into the question of how to divide data between datasets, for example I’ve got

points of service & opening hours for SUSU and catering provided from the catering manager
menus for catering points of service, provided by the catering manager
I’m hoping to get daily menus for a few catering points of service provided by the catering manager
I’ve got opening hours for the theatre bar provided by their manager
I’ve got menus for the theatre bar (from their menu!)
Opening hours for local amenities (provided by a small group of postgrad volunteers)
Student services points of service and hours, provided by the university student services and therefore authoratative
Waste & recycle points (currently run by the student volunteers but we hope to hand that over to the authoratative source)
Transport points such as the travel office, bike racks, parking etc. which were created by the student volunteers, but now are being curated by the data owner (the transport office).
List of vending machines, sourced from our contractors, via catering, and then annotated with building numbers by me.
Bus stops, taken from a list provided by the council.

It’s really hard to work out if these should be one dataset each, or if not how to deal with them. Do I move the data out of the amenities (student sourced) dataset when rows of data are taken over by the data owner? Should I have an ‘authoratative university of southampton’ dataset including everything that is thus, and a non-authoratative amenities dataset? Also, the bigger the dataset, the more often it’ll need to be republished.

I am almost certainly going to make the ‘todays menu’ dataset separate due to it having to be updated daily.

A key reason to use separate datasets has been to filter things. I think it makes more sense to include this in the data itself than rely on the dataset. My current thinking is that we should rearrange the data to be based around provenance so;

Authoratative Services including buildings & estates & catering and menus and vending machines.
Todays Menus (because they change so fast), it’s a daily ammendum to the previous set.
Nuffield Theatre Bar times & menus (authoratative, but not from the University)
Non-authoratative (Colin-sourced) amenities
Bus Stops

Menus for the local coffee shop and the nearest pubs (Brewed Awakening, Crown, Stile) can be included in the non-authoratative datasets.

It leads to a change in some underlying technology for me as currently each dataset only contains one “type” or record, eg. a set of prices OR a set of points-of-service.

Hopefully once we settle on a workable pattern for this it’ll save other people making the same false starts we have.

Jargon FIle

March 15, 2011
by Christopher Gutteridge

I’ve added a new dataset;

University of Southampton Jargon Dictionary

It’s semi-crowd sourced; I’ll give any member of iSolutions, or other professional services, the ability to edit it. It could use a search tool similar to the phonebook, but we’ll get to that at some point.

Improvements to the Embedable Map Tool

March 13, 2011
by Christopher Gutteridge

I’ve added an option for ‘terrain’ instead of map/satellite. This only works when a bit more zoomed out than the other views.

More importantly, I’ve added numbered placemarkers. This only works for buildings with a simple one or two digit number. If it ever becomes massively popular we’ll build a custom placemark generator.

Embeddable University of Southampton Map Designer

View an example: Full Screen

Where does the Money Go?

March 12, 2011
by Christopher Gutteridge

After many battles with excel, pivot tables and the IBM “Many Eyes”s site, I’ve had a go at visualising our Payments Dataset. I’m now an armchair auditor!

Please note that I am far an expert in working with such data so the below graphs should not be considered “official” data from the university as I may have made mistakes in my processing. The data is not entirely complete as it contains no payments to individuals, and nothing commercially sensitive.

Here’s who we’ve paid money to in that dataset… I had to trim the data down to payments of £10K+ as otherwise it seemed to crash their java!

This shows a break down of the broad categories and sub categories of what we paid money for.

I hope that we’ve got some budding statisticians, accountants or data visualisers who can do something better than me!

One cool idea; find out what payees we have in common with the local hospitals and council:

News and Ideas from the Southampton Open Data Team

Southampton Open Data Blog