Data.Soton Enterprise Edition

(or One big fat “before” picture)

Over the next year we plan to turn http://data.southampton.ac.uk/ into a lean mean open data publishing machine. Right now it’s a bunch of scripts on a linux VM. It works OK, but it’s hardly a poster-child for “Enterprise Systems”. A key part of the new design for I.T. at Southampton is that when my team innovates a new system, such as the Open Data Service, if it is successful then we need to get it into a state where it can become part of the normal IT infrastructure of the organisation.

This is all a bit new to me, and I don’t know exactly what I’ll need to learn, but I’ve been talking to the people at the Software Sustainability Institute, and I think it’s useful to document the process in public. To be really useful, I’m going to have to be a bit honest about some of my own poor practices.

So here’s where we stand at the beginning of 2012:

Platform

The open data service is run on a single Linux Virtual Machine. The machine is backed up. The non-standard bits I’ve installed are 4store, rapper, graphite & grinder.

Publishing Process

The main publishing config directory contains a sub directory for each dataset. A script called publish_dataset republishes a named dataset. This script is in github (under the Grinder project) which seemed like a good idea at the time. It’s scruffy but it’s been written with an eye to the future so is quite modular. The confiugration files are not in github, but a snapshot is created when a dataset is published. Example. This goal of that was to let other people see what we were doing. It’s crazy for me not to have version control on these configuration files!

Data Catalogue

This is a single dataset describing all the other datasets. It was a mistake. I now believe I should take the entry for each dataset and save it as configuration with the

SPARQL Databases

There’s 2 instances of 4store running on the box. It runs the one for http://sparql.data.southampton.ac.uk/, and a second one which powers http://rdf.ecs.soton.ac.uk/sparql/

Of course, it’s a bit more complicated than that. The main data.soton SPARQL endpoint has an ARC2 PHP endpoint which is what the public sees. This passes requests to the 4store endpoint, but adds some nice features such as CSV & SQLite output formats. The catch with this is that it doesn’t spot the <!– comment –> in the 4store results that warns you that it’s hit the ‘soft limit’ (which prevents big searches killing the machine). Annoyingly, SPARQL doesn’t have a way to pass a warning message with the results.

There’s one more SPARQL endpoint, which also sits on top of the real data.soton 4store endpoint, but caches results for a few minutes. I think this is used for most of the public website, but I’ve got in a bit of a muddle about it.

4 SPARQL endpoints on the machine is why it got named “Edward” as it’s so SPARQLy.

Public Website

The website uses a scary .htaccess file to make /building/****.**** redirect to a PHP file to handle rendering buildings (and many other types of thing). The oldest ones use lots of SPARQL queries, and have multiple scripts for different formats of output (HTML, RDF etc.). The newer ones use the new Graphite “Resource Description” to get the data and are much easier to read. There’s a php library which renders the sidebar and captures any SPARQL queries used to build the page (it doesn’t capture queries from graphite, but they are not as interesting as they are auto-generated and almost unreadable). It also has a register of datasets used to generate the page, but this is hand created in the PHP viewer so is often a bit inaccurate. I had a go at capturing the graphs used to make a page but it’s tricky and puts strain on the database.

id.southampton.ac.uk

This is critical. This virtualhost does nothing but resolve URIs to redirect them to an RDF or HTML page. It could stand to be smarter. I am a big advocate of keeping the URI resolution entirely separate from the data site.

southampton events diary

This is a work in progress which will replace the university public events diary. It aggregates feeds from all over and combines them into a single RDF document, then makes a pretty web interface on top. This will be quite high profile, but is currently a little fragile and if a single feed stops working it’s hard to spot. It’s being developed by a final-year postgrad, which is great as I don’t have time, but means I understand it a little less.

It is loosely coupled to the main data site. It could run on a stand alone box, but it enhances it’s data with the university place and division heirarchy, and data.southampton pulls the dataset in once an hour.

maps.southampton.ac.uk

This is a little tool written using the Google Maps API. It has some issues now, that it doesn’t know to filter demolished buildings!

The site also hosts a generic map search app which is a very powerful demo of data reuse.

energy data

We should get a spreadsheet with energy readings from a bunch of meters but after working for a few weeks, it’s broken at their end and it’s not high-enough priority to have got sorted. My impression is that the underling software is a bit old and flakey, but there’s no other option that I can see. Their underlying database contains raw data which needs quite a bit of processing to turn into a simple KWh value. I don’t know if I should give up and remove the graphs from the building pages now it’s been broken for over a month.

bus times and routes

These have been the highest profile dataset, and inspired the most apps, but it is sort of odd to host them on the university server. The bus-stops and bus-routes were screen scraped a while ago and might be out of date, but nobodies complained. I suspect I’ll have to look into that at some point, however.

I still have been unable to get timtables out of anybody, just the live feed for a stop which is done by screenscaping and caching for 30 seconds. The government have made noises about this kind of data being available nationally, but we’ll see.

Cron Jobs

There’s a bunch of cron jobs which import and publish some datasets. The eprints one doesn’t seem to work properly and I’ve not finished unpicking why.

There’s a monitoring script which Dave Challis wrote which keeps an eye on the 4store instances and does something if they look peaky. I don’t fully understand what, yet.

Vocabularies

I’m often asked what vocabularies datasets use, and I really have not documented it well. Each dataset should have at the very least a list of predicates, classes and some example entities — most of that is probably automatable.

Continutity of Data

I’ve still not got everything flowing smoothly, but it’s improving. The “organisation” dataset now has a weekly feed direct from HR, but many things are not automated which should be. These include; the phonebook,the list of courses & modules, international links, buildings occupied by faculties and academic-units… this list is long. Some lists require hand-intervention before publishing so can’t be automated — the list of “Buildings” from the Planon system run by Buildings & Estates lists the stream as a building. I figure in time, more web pages will be built from this data and the data owners will want ot take ownership. This is already happening, eg. the transport office has been updating our list of Bike Racks.

Most of these things either change slowly or already exist in a database elsewhere, so my goal is to hook up the database feeds where I can, and just make the by-hand corrections to the slow-changing data. It’s certain that some is going out of date. I need to make it easy for people to feed back corrections, done right that’s a benefit to the organisation.

Identifying People

The university has many schemes which identify people; Username, Email, Staff/Student ID, EPrints.soton Code etc. but there’s not a nice easy way to produce an identifier for a person. I’m currently using a hash of their email but it’s unsatisfactory.

Odds and Sods

There’s a directory on the website which contains little works-in-progress and proofs of concept. It doesn’t really belong on the production server.

Feedback System & Bugtracker

When we set up the system we were very rushed and installed a system for giving feedback, but it’s not a public discussion, so we’d have been better off with email. It’s yet-another thing to check so I often forget it for big chunks of time, then find 50 spams and a couple of good ideas or impossible dreams. I think this system needs turning into something more functional but I’m not sure what.

Politics & People

Currently only I really know the guts of how this system works. Dave Challis knows some, and knows far more about the 4store set up than me, but he left at the start of the year to work for the company that make 4store, but in a dire emergency I can get in touch as we’re still good friends.

The university has made it very clear to me that this is an ongoing concern, and will be part of the future of the university, but that it can’t be the top priority, which I think is entirely fair.

The big concern right now is that if I go under a bus, the system would be at great risk of just decaying and dying because the costs of someone learning enough to support and extend it are too high. The ultimate goal should be to clean the system up enough to make it a configurable platform of which data.southampton is an instance.

Sharepoint vs Google Docs

The university is now using Sharepoint for lots of things, and it’s actually not a bad option for collecting tabular data from people to turn into RDF, it allows pick-lists etc which Google Docs does not. The problem is that we don’t have a very strong policy on where to put such things yet as our sharepoint experts are very busy working on the existing todo list.

Publishing on Demand

It would be very useful if certain datasets could be republished by the data owner, when they’ve updated the data in Sharepoint or Google. Eg. the dailing menus from the catering dept. I’ve been thinking of creating role-based passwords so I just give them a URL and password to see a big ‘republish dataset-X’ button. I’ve got as far as registering admin.data.southampton.ac.uk and thinking about https.

Where next?

Version Control of all non-confidential scripts and configuration files.
More documentation!
Move data-set metadata into the dataset itself, not a custom dataset catalogue-dataset
publication script needs to be much better engineered
More automated publication and a republish-now button
Testing Server, and a way to transfer changes to/from live server.
Remove cruft from live server
Better feedback system
monitoring of systems and processes to detect load and failures
Learn more about what the normal iSolutions platform policy is.

Posted in Best Practice.

Tagged with Data.

4 comments

By Christopher Gutteridge – March 15, 2012

Graphite 1.5 Released

There’s some exciting new features in v1.5 of Graphite:

Added “resource-description” which allows you to create subgraphs, JSON trees and extract graphs from a SPARQL endpoint without needing to learn SPARQL!
Added dumpText() for command-line debugging/resource inspecting.
Added prettyLink() and link() for easier HTML creation, and supporting functions for setting mailto: and tel: icons.
Added functions to set addional relations to be considered for label()
Added datatype() and language() to get those values from a literal.
addTriple() and addCompressedTriple() allow individual triples to be added to the graph.
freeze() and thaw() make it possible to cache an indexed graph to disk for faster load times.

That first one is the really cool one; it makes it really easy to build HTML pages from SPARQL, and expose the data as RDF and JSON documents.

And I’ve spent my Sunday morning getting it all documented!

Posted in Graphite.

1 comment

By Christopher Gutteridge – March 11, 2012

HTTP Content Negoitation could do better

I’ve long been frustrated with HTTP content negotiation. It doesn’t do what I need.

If you’ve never encountered this, when a web request is made there’s an optional header, something like

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

Which says what formats the client (you) is able to accept, and it’s preferences. The start of this blog post gives a more full explanation. There’s also a way to negotiate what languages you prefer your content in, and a proposed system for asking for versions from a specific date or time.

My annoyance is specifically with the fact that what you ask for is very human-web-browser centric. You ask for formats your system is capable of accepting, not that you actually want. Why does this annoy me? Because if I have to give variations on the server a custom MIME Type, if I want to content negotiate for them specifically. All formats I work with are viewable as text, and most are XML, but if they have some wacky mimetype, like application/x-southampton23 then nothing else will understand it as XML, which is just annoying. For example, an RSS1.0 feed is application/rss+xml, according to StackOverflow. However it’s also valid application/rdf+xml, text/xml and text/plain.

Servers should handle MIME-inherritance

I feel like content negotiation is missing a bit of inherritence. ie. If I ask for text/xml or text/html and the server has only application/x-southampton-xml-3 available for that resource, then it should give me the application/rss+xml document but tell me it’s text/xml, which it is, and my web browser would display it as XML .

Imagine the web-browser walking into a fancy resturant and ordering soup. The waiter brings over a dish and says, “here you are sir, Consommé”. The browser refuses to eat it because it doesn’t know what “Consommé” is.

Now lets run that again, with a less pretentious waiter (in this analogy, the waiter is the web server). The soup is ordered and the waiter says “Here you are sir, Soup”. Which is not only true, but it’s certain to be understood by the customer, who eats their Consommé saying “mmm, nice soup”.

Servers should handle ‘abstract’ MIME types

The other very useful thing would be to expect browsers to undersand abstract MIME types, which have no specific serialisation, but a number of sub variants. For example:

Accept:application/rdf+xml,application/rdf;q=0.9

Where application/rdf is a super-class for all RDF serialisations. The above header line *should* say that I want RDF+XML, but failing that any RDF serialisation will do.

Research Data File Formats

Discribing the various properties of a file containing data as output from a research activity will also require some richers definitions, but maybe not in mime. I’m still thinking about this but I think it would be best to describe files, and sets-of-files, by things which they conform to. MIME Types could be part of this, but also what the data describes. For example; one record might require the following ‘tags’ to be usefully discovered

Single File, XML File, CML File, Describes Single Molecule, Describes Crystal, Describes Organic Crystal, Reuse License allows Attribution-Only reuse

Admittedly chemists are already doing pretty well in this field, and maybe I’m trying to solve too general a case…

Posted in HTTP, Research Data.

5 comments

By Christopher Gutteridge – March 7, 2012

All-Things-Of-Type-X an Anti-pattern?

When I was developing the Graphite PHP Library I added a simple function called $graph->allOfType( $type ) which would return a list of all the things of a given type in the current graph. For example the list of all foaf:Person or all Buildings.

It’s also very tempting to do this when presented with a SPARQL endpoint, and a totally legitimate thing to do when exploring the data.

However…

Leaving applications lying around which use this either as SPARQL or otherwise is a ticking time bomb. Here’s an example;

At the university, I’ve got a list of all the buildings, in our endpoint. So I cat get all our buildings by doing this query:

(Note that rooms: is just a prefix for a vocabulary to describe rooms and buildings, sorry if that’s confusing)

SELECT ?thing WHERE { ?thing a rooms:Building . }

OK, that’s great. But as our system has grown we’ve now got some buildings which are in the data but no longer part of the university state, and I’d like to add some buildings which are in the city, but I can’t because my stupid naieve coding will assume all buildings in the store are our buildings.

Easily Solved

The solution is to add some semantics to say what I really mean, which is to have a list of buildings which are occupied by the university of southampton. I guess I just need to add

  <http://id.southampton.ac.uk/building/32> <http://vocab.deri.ie/rooms#occupant> <http://id.southampton.ac.uk/> .

(That last URI is the identifier for the university). Then the question becomes:

SELECT ?thing WHERE { 
   ?thing a rooms:Building . 
   ?thing rooms:occupant <http://id.southampton.ac.uk/> .
}

Which is a tiny bit more work, but much more future-proof.

App Builders are Lazy

Well, I am. So people will only do just enough to work. The first version of an app may well use naive solutions of the all-things-of-type-X pattern, as it’ll solve their immediate problem. When it starts to break, they’ll look for a new pattern and so it’d be good if data providers made sure there’s solid triples giving these simple facts.

There’s a really cheap-and-cheerful alternate solution, which is to solve the problem at the ‘graph’ level. ie. state that all the buildings in triples you get from <http://id.southampton.ac.uk/dataset/places/latest> have certain properties. This is handy for hacking, but lousy for data aggregation.

This came about as I was thinking about making a version of my building finder web-app which would aggregate together buildings from Southampton, Oxford & Lincoln. I realised I could easily do that, but it doesn’t give me a way to indicate who each building belongs to as the assumption breaks down when we merge the data.

In the short term, there’s some value in aggreeing that an App accepts a target RDF URL, and should show everything it understands. If there’s a SPARQL endpoint then maybe the owners need to write a CONSTRUCT query to give that app what it needs. This isn’t an ideal solution, but it works.

I think right now it’s just important for us to notice what assumptions we make. RDF & SPARQL are not like Tables & SQL. There’s some new techniques to learn…

Posted in Best Practice, RDF.

No comments

By Christopher Gutteridge – March 4, 2012

Firing Range-14

So there’s a call for suggestions to fix/replace http range14.

If you’re not familiar with this. The basic sitution is that we use URIs to represent real world things, and if you resolve them they give a “303 See Other” to redirect you to a document of interest, presumably about the subject you asked about.

eg.

I, the person, am identified by URI: http://id.ecs.soton.ac.uk/person/1248

My profile page, an HTML document, is identified (& located) by URL: http://www.ecs.soton.ac.uk/people/cjg

My FOAF Profile, an RDF+XML document containing machine-readable facts about me is identified (& located) by URL: http://rdf.ecs.soton.ac.uk/person/1248

If you resolve my URI in a web browser, it’ll pop up my profile page. You can see how by typing this on the UNIX or OSX terminal:

curl -I http://id.ecs.soton.ac.uk/person/1248

The response will be more or less like this:

HTTP/1.1 303 See Other
Date: Wed, 29 Feb 2012 23:52:16 GMT
Server: Apache
X-Powered-By: PHP/5.3.3
Location: http://www.ecs.soton.ac.uk/people/cjg
Connection: close
Content-Type: text/html; charset=utf-8

Which means go look at the URL in the “Location:” bit. The 303 indicates “See Other” rather than the normal “Moved”, which implies it might not actually be the same thing.

For added complexity, if you tell the web server you prefer to get RDF+XML documents, by typing

curl -H'accept:application/rdf+xml' -I http://id.ecs.soton.ac.uk/person/1248

You get back

HTTP/1.1 303 See Other
Date: Wed, 29 Feb 2012 23:54:39 GMT
Server: Apache
X-Powered-By: PHP/5.3.3
Location: http://rdf.ecs.soton.ac.uk/person/1248
Connection: close
Content-Type: text/html; charset=utf-8

This is bloody hard for people to get their heads around, and not obvious unless you really grok how the web was designed. However, for me, the real screw-up in all of this is using “http:” to represent something which isn’t a document… It’s not like we weren’t already using http: https: gopher: ftp: urn: mailto: tel: etc. (OK, nobody remembrs gopher)

I think it’s daft to use the same protocol to unqiuely identify real-world objects AND documents on the web. I have to explain this again and again to each person learning RDF, and it won’t take off if people can’t figure it out for themselves, like HTML, JSON, XML etc.

Schema.org

If you’ve not yet seen schema.org; it’s a website which presents a schema for information friendly to search engnges. It mostly doesn’t idenify ‘things’ at all, just defines a structure and literal properties of items in that structure (eg. start time of an event, name of a person). I hear it uses URLs to identify things which isn’t as crazy as it sounds if you define the relationships correctly. eg.

<http://www.soton.ac.uk> *hasMember* <http://www.ecs.soton.ac.uk/person/cjg> .

That’s an utterly reasonable statement if *hasMember* is defined as meaning “the group or organization which is the primary topic of the first document, has a member which is the primary topic of the second document”. It’s ugly, but entirely semantically sane. In slightly more formal terms;

?X *hasMember* ?Y

implies

?X foaf:primaryTopic ?X-topic .
?Y foaf:primaryTopic ?Y-topic .
?X-topic foaf:member ?Y-topic .

My proposal; infra:

UPDATE 3: So it turns out that just like my ‘primaryTopic.net’ namespace idea, this is also an idea that’s been suggested before, in far more careful detail: tools.ietf.org/html/draft-masinter-dated-uri-10

So my analysis stands, but as regards the tdb: (thing-described-by) system described in the above link.

and I admit I’ve not got the 10 years of literature review as some of the community, but can’t we just do:

infra:http://www.ecs.soton.ac.uk/person/cjg

and specify that http://www.ecs.soton.ac.uk/person/cjg is assumed to be a document about that thing, and it could optionally content-negotiate if it wants.

Effectively, there’s a standing definiation that <XYZ> foaf:primaryTopic <infra:XYZ> .

NOTE: My first draft used “resource:” not “infra:” but that was very muddling to type in an RDF+XML document. I don’t really care about the choice of name, just the approach.

Pros:

Visible distinction between Document & Non-Information URIs
Does not invalidate http: URIs, just provides a better method
Allows URIs to be created from popular websites without formal buy in; eg. infra:http://www.imdb.com/title/tt0133093/ or infra:http://xkcd.com/327/
Should not break existing software, such a triple stores.
Allows a bridge to the schema.org approach (refer to things by a URL which describes them)
You can still use content negotiation on the URL to give back HTML or RDF.
Provides similar functionality to “&” and “*” operators in C
Allows existing URLs to be cleanly used as identifiers in a semantically correct way.
Works with # elements in documents, eg. infra:http://en.wikipedia.org/wiki/University_of_Southampton#Malaysia_Campus

Cons:

Will require some trivial changes to existing systems to allow them to resolve these URIs into additional data.
Current URIs may still confuse new users as they start with http://
It is entirely reasonable to have infra:infra:infra:http://totl.net/ but that’s going to tramatise anybody who didn’t absorb C pointer de-referencing through the skin in their formative years.
Obviously, my abilitiy to identify cons is limited by proximity.
People might just slap “infra:” on the front of everything, even standard URIs.
“it’s not great for sites with high traffic; tends to encourage conflation with REST. Be nice if could message intent.” – from @derivadow

I doubt I’ve got the whole picutre, but in that statement lies much of the problem. I’m now definitely an expert, and I still don’t ‘get’ the subtle issues. If the linked-data-web is going to work we’ve got to make it workable by the hacky pragmatists who didn’t make their RSS feeds valid XML, just made sure they worked in a few major readers. They aren’t jerks, they just have different priorities to us university types!

UPDATE 1:

I’ve created an example FOAF profile using this approach. It uses a mixture of ‘traditional’ URIs and normal URIs, and ARC2 and Graphite seem to be fine with it, but a stricter test is the W3 validatior & it passes that too!, so won’t break existing software, except for requiring a quick fiddle to make the URIs resolvable, which should be simple enough.

I’ve also edited the proptocol name from “resource:” to “infra:”

The code to create the implied triples from infra: URIs is trivial; running the previous FOAF example through a scrap of PHP produces this version with primaryTopic realtions injected.

UPDATE 2:

On reflection ‘primary topic’ might be too loaded and a different predicate may be more appropriate. It doesn’t really matter to the basic idea.

Posted in RDF.

4 comments

By Christopher Gutteridge – March 1, 2012

With apologies to Faith Lawrence

I’ve just added the new events-diary dataset to http://data.southampton.ac.uk/ and clearly went to sleep thinking aboutu events RDF data..

As discussed in the last post, my friend Faith is doing real research using linked Shakespear data. This post is just me getting something out of my system, it’s just an exercise and not intended as a serious example.

SELECT ?meet_time ?meet_weather_conditions WHERE {
  ?event a event:Event .
  ?event event:agent
    <http://id.macbeth.org/person/WeirdSister1> ;
    <http://id.macbeth.org/person/WeirdSister2> ;
    <http://id.macbeth.org/person/WeirdSister3> .
  ?event event:time ?t .
  ?t tl:begin ?meet_time .
  OPTIONAL {
    ?event2 a event:Event .
    ?event2 event:factor <http://id.macbeth.org/themes/hurlyBurly .
    ?event2 event:time ?t2 .
    ?t2 tl:end ?hb_event_end .
    FILTER ( ?hb_event_end > ?meet_time )
  }
  FILTER( !bound( ?event2 )  )

  ?battle a dbpedia-owl:MilitaryConflict .
  ?battle event:time ?t3 .
  ?t3 tl:end ?battle_event_end .
  FILTER ( ?battle_event_end < ?meet_time ) 

  ?event event:place <http://id.macbeth.org/place/heath> .
  ?event event:agent <http://id.macbeth.org/person/macbeth> .
  <http://id.macbeth.org/days/1> <http://id.macbeth.org/ns/sunsetTime> ?sunset_time .
  FILTER ( ?meet_time < ?sunset_time ) 

  ?event <http://id.macbeth.org/ns/weather> ?meet_weather_conditions .
}

I’m actually surprised not to easily find a predicate to map a day to a time of sunset. It would have to be day-in-region, to be meaningful, of course.

Posted in SPARQL.

No comments

By Christopher Gutteridge – February 24, 2012

Two Girls, One Conversation

or “Excluding results from SPARQL by their relation to a tainted set”.

So, in the bar last night my old friend Faith was telling me about an SPARQL problem she had. She has a dataset describing the statements of a Shakespear play, and she wanted to test it against the Bechdel Test.

The challenge she had was that should wanted to exclude results if they had a single dcterms:subject relationship to an entity of type “Man”. It’s really easy to do the opposite; to include items which have such a relation, but excluding is a more confusing patter.

What she had (roughly):

?conversation ns:hasParticipant ?char1 .
?conversation ns:hasParticipant ?char2 .
?char1 a ns:Woman .
?char2 a ns:Woman .
FILTER( ?char1 != ?char2 ) . 

?conversation ns:hasTopic ?topic .
?topic a ?type .
FILTER( ?type != ns:Man )

Now this last bit wasn’t doing the right thing. What it actually does is

1. get a list of all combinations of ?char1, ?char2, ?conversation, ?topic, ?type

2. Filter out the unwanted ones. But if a conversation has two topics, and one of them is about a man, then only that row will be filtered, and other topics in the same conversation don’t get filtered.

It took me and Dave Challis (who was at the next table in the same bar, luckily) a while to work out the correct syntax to exclude all results when a relationship exists to X, even if there’s other relations of the same type to different things. That last bit should have been:

OPTIONAL {
  ?conversation ns:hasTopic ?topic .
  ?topic a ns:Man .
}
FILTER( !bound( ?topic ) )

This feels a bit backwards, but what it does is return all the ?char1 ?char2 and ?conversation as normal, and for each row returns a topic about a man if available. Then, if it was available, we didn’t want this row so we filter it.

It’s a useful pattern and was not obvious to a bunch of smart, motivated people, so I figured it was worth writing about.

Are there any other useful patterns like this in SPARQL?

Posted in RDF, SPARQL, Uncategorized.

Tagged with Tips.

1 comment

By Christopher Gutteridge – February 18, 2012

Getting Live Google Spreadsheets

So, I’ve been using Google Spreadsheets as a way to let staff easily maintain data which gets passed through the Grinder to make RDF data. The problem is that I do it using the “Publish to web” option, which makes a public version of the document available at a URL. All fun and dandy, but that URL doesn’t update the moment you modify a cell, it updates every 5 minutes or so. ish. more-or-less.

This isn’t ideal as we want a situation where a staff member can finish editing some data and hit a ‘republish on datasite’ button (on data.southampton.ac.uk somewhere behind a password) which immediately downloads the latest version and converts that.

It seems that if you download the document with the API using a username/password you get a live copy — yay!

A bit of fiddling and I can do it with Curl & Perl.

So gotcha one; you have to do it in two steps, the first to get an authentication token, the second to ask for the file.

The next gotcha is you have to specify the service, and Google uses codes for these which are… not intuitive.

The last gotcha was that downloading a spreadsheet needs a different service key (“wise”) than getting the information about documents (“writely”). See what I mean about the daft codes?

Anyhow, the following code is placed in the Public Domain without warrenty in the hope it’ll be useful.

I’ll probaby rewrite it properly one day, but just knowing this is possible gets me a step closer to the Chef publishing the staff resturant menu each day!

#!/usr/bin/perl
use strict;
use warnings;

my $doc = '0AqodCQwjuWZXdDZqcm0tYmFGMVpDOG1obnctUXdhb0E';
my $format = 'tsv';
my $username = 'open.data.southampton@gmail.com';
my $password = 'XXXXXXXXX';
my $gid = 0;

my @result = `curl -s https://www.google.com/accounts/ClientLogin --data 'Email=$username&Passwd=$password&scope=https://spreadsheets.google.com/feeds/&service=wise&session=1'`; 
my $auth;
foreach my $line ( @result )
{    
   if( $line =~ s/^Auth=// )
   {
     chomp $line;
     $auth = $line;
   }
}

die "Failed to authenticate $username" if( !defined $auth );

$auth =~ s/[^A-Z0-9-_]//i; # sigh, better safe than sorry.

print `curl -s -H 'Authorization: GoogleLogin auth=$auth' 'https://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=$doc&exportFormat=$format&gid=$gid' `;

Posted in Command Line, Perl.

Tagged with Data, Tips.

No comments

By Christopher Gutteridge – February 8, 2012

Linux, SharePoint & Perl

Sharetoperl: A half-shark, half-octopus creature created for the University, creates a whole lot of terror in Southampton while a computer scientist who helped created it tries to capture/kill it.

(alternative title: The Good, The Bad and the Ugly)

At the Univrsity of Southampton we’ve jumped into Sharepoint in a big way. It’s a grumpy beast, and doesn’t think like me, but it’s got some things going for it. Currently I’m working with SP2007 but I’m told SP2010 is less insanely IE/Windows centric. Which wouldn’t be hard. There’s options in SP2007 which only work in Internet Explorer on a Windows box – yuck!

That said it’s a good way to create sets of documents with rich controls on people & groups who can read and write them.

It does Calendars, but they’re a bit of a joke. Utterly Microsoft-centric so I’m just ignoring them. There’s probably ways to make them better, but I’ll worry about it another day.

The other thing SharePoint does is things called “Lists”, which is a kinda catch all term for, well, lists of stuff, including the contents of a document library, calendars (I think), and dataish spreadsheetish things.

Now if I could control those from a nice command line (Perl) script it would let me do lots of interesting clever things without ending up in treatment for .NET induced substance abuse.

So I’ve been around the Web, and there’s a few helpful posts, but none of them worked (quite), but a combination of hacks got me to something which did work (around 5am this morning).

Sharepoint Command Line

Lets just place that in the public domain… It’s got scraps from all over the blogosphere, but I’ve done a fair bit of work tarting it up into something easy to use.

Next stop, uploading files into sharepoint and setting their metadata…

Posted in Command Line, Perl, Sharepoint.

Tagged with Tips.

3 comments

By Christopher Gutteridge – February 8, 2012

INBOX d/dt

So I’ve been looking at my inbox. I felt like I’ve been getting more email than usual so I’ve put that theory to the test:

I’ve been collecting my INBOX size every hour for the past year, which makes it easier.

Data as CSV [CC-BY]

First a basic graph of time vs INBOX size is a start… you can see the peak where I went to the sea-side and didn’t answer my email!

But to work out how loaded I am, a differencial is more useful: Hourly change in INBOX which is unreadable, so lets add a 168 hour (one week) rolling average: Hourly change in INBOX, Weekly rolling average.

Nearly there, but it’s only increases that I’m interested in so lets make a Hourly increases in INBOX, weekly rolling average.

So I can clearly see from this that my email inbox increases remain at an average of 2-3 per hour. Obviously any hour where I answer email will be ignored, so it’s imperfect, and I get far more email in week days so the actual number is probably higher, but it shows that the steady rate isn’t unsually high right now.

As a final note; the level of decreases is a very different to that of increases: Hourly decreases in INBOX, weekly rolling average, which you’d expect as it’s when I’ve been clearing more/less email.

OK… I should probably get back to my INBOX.

Posted in Uncategorized.

Tagged with Data.

No comments

By Christopher Gutteridge – January 26, 2012

« Previous Next »

Platform

Publishing Process

Data Catalogue

SPARQL Databases

Public Website

id.southampton.ac.uk

southampton events diary

energy data

bus times and routes

Cron Jobs

Vocabularies

Continutity of Data

Identifying People

Odds and Sods

Feedback System & Bugtracker

Politics & People

Sharepoint vs Google Docs

Publishing on Demand

Where next?

Servers should handle MIME-inherritance

Servers should handle ‘abstract’ MIME types

Research Data File Formats

However…

Easily Solved

App Builders are Lazy

﻿Schema.org

My proposal; infra:

Pros:

Cons:

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags

Schema.org