A Bit Quiet

The blog has been a bit quiet for a few weeks as we’ve been busy with our birthdays (Dave C. and I have them about 6 days apart), Dev8D and getting ready for the public launch of the University of Southampton Linked Open Data site.

Internal people can check it out now at http://data.southampton.ac.uk/ — feedback, ideas and questions are good, producing demo apps is better!

Oh, spotting any errors is really helpful too!

Things still to do include, more HTML visualisations for people who prefer to get their data a little more cooked than as nerds, a clever browser for our spending data, and work out how to extract the data from our energy metering system (which is awsesome, but the main job is filtering out the weird unhelpful data as it was not planned to be public so there’s harmless but confusing cruft).

The biggy is to get to grips with the linked data API which I hope to have ready for the launch date. Basically this means that if you can get useful bits of bitesize data about something, rather than loading the enitre dataset to find out the location of a bus-stop. I think it’s a great idea for data, but I want to try and provide custom HTML views for things rather than just a data dump in HTML form.

I’ve got a very hacky hookup to the council bus-stop live data (ie. how long until the next bus), it’s a big of a scary route involving screen scraping and 3rd year project code but it demonstrates the massive value of getting this data properly.

I want to try and get some data out of a few other sources — I figure that the locations and opening hours of services would be really helpful. I don’t actually know much about student services, so I’m learning as I go, but they have a surprising amount in common with coffee shops, just providing advice rather than coffee.

I’m looking into linking up our list of expertise for the media with wikipedia, perhaps having 3 columns each can either be free text or a wikipedia (so secretly dbpedia!) link. The columns are narrow subject, subject, broad subject. The goal with this dataset is not perfection, but to hook up journalists and academics who would benefit from talking to each other. Also to provide cool tools to journalists so they think well of us, which can’t hurt! I reckon in a few years time it will become standard to provide a feed of all your experts so media organistations can maintain their own databases (or compile central ones).

I made a bold statement last year on Twitter, and it looks like I’m going to be able to make good on that.

OK, bit of a ramble but you get the idea of how busy I am…

This project is going great (although I’m working so very hard) but the real reason is that it’s got amazing backing from an unholy alliance of the central IT (iSolutions), Computer Science (ECS), the Communications department and Finance.

And not to forget the head of catering, who got hold of a list of all our vending machines, moving me closer to one of my key use-cases: Where is the nearest place (that’s open) to obtain ANYTHING in with caffiene in? I’m also tagging all points-of-service which sell alchol so choose your poison.

Also I should thank ECS people as they’ve been loads of help, especially the ones with the difficult questions. And especially Colin, François-Xavier, Chris and the photography club who’ve been contributing useful one-time data. Locations, outlines, and depictions of our buildings and sites etc.

Sometime later I’ll comment on the technologies we’re using. They are, of course, all free & open source, including the ones I have built for the project. I think the approach I’ve come up for publishing datasets is pretty simple and re-usable.

Posted in Uncategorized.

No comments

By Christopher Gutteridge – February 23, 2011

Data Stores for Southampton and ECS Open Data

I’ve been working on getting a SPARQL endpoint and triplestore up and running to support the University‘s forthcoming open data releases.

At the same time, I’ve been migrating our data at ECS from its current store (ARC2‘s native store, built on top of MySQL), due to several problems which cropped up.

Experiences with ARC2’s store

There were a few reasons I went with this store in the first place:

Incredibly easy to set up (drop files in a directory on web server, point config file at MySQL)
Support for aggregates (e.g. COUNT)
Support for SPARQL UPDATE/DELETE
Transactional updates (using InnoDB tables)
Full-text indexing on literals (provided by MySQL)

For the number of triples we were working with (~a million), imports took an acceptable amount of time (~6 mins to import 1 million triples), and queries were more than fast enough (in the region <0.02s for the majority of queries). Longer term though, we ran into two problems with the store: 1) Each triple is assigned an auto-incremented ID in the corresponding MySQL table. As large numbers of triples are removed and reinserted (e.g. reimporting a dataset), the maximum value of this ID creeps ever upwards, until MySQL runs out of values for it (since it won't re-use keys by default). 2) SPARQL queries are converted to MySQL queries internally. As we started using more complex queries, we ran into sets of queries which couldn't be converted to SQL JOINs, and failed.

Moving to 4store

For both the University’s and ECS’s store, we chose to go with 4store. We plan to make our tools as store agnostic as possible though, and will carry on evaluating its suitability as we go along.

One key factor in this choice is that it’s a big dumb triple store (their words, not mine!). We just wanted a place to dump triples, and a way of getting them back again. It comes with its own HTTP server, and allows imports/deletes to be performed through that.

Quite a lot of the stores out there come as part of a larger RDF framework (many of which are Java based). We’re not really a Java based shop here, so having something standalone (e.g. as opposed to running using Tomcat) was a consideration.

Installation was pretty straightforward, and allowed us to run multiple stores on the same machine, with a SPARQL endpoint for each running on different ports. It’s also been used for various research projects within the University, so there’s some existing knowledge of it to call on for support.

Getting data into the store(s)

As the number of datasets around the University grows, we’ll need a way of managing updates into the store (or a variety of stores).

While we plan to maintain a central catalogue of datasets, these will all be updated at different frequencies. For some, I envisage that we’ll just perform a weekly/monthly/yearly import. For others, we’d want to update the store whenever they changed.

To achieve this, I’ve set up a simple web application that allows anything to ask the store to update a dataset: An application sends an HTTP post to a page, containing the name of the store to update, and a list of dataset URLs the store should update from.

This means that for frequently changing datasets (e.g. a news feed), a client can trigger an update whenever it’s needed, rather than waiting for the server to next schedule an update.

These requests are added to an ordered set (works as a FIFO queue, which discards duplicates), so that they can be imported in the order the requests come in.

The server hosting the store runs a small python script as a daemon which watches the queue, and pulls dataset URLs to import from it.

It then starts up a number of worker processes to deal with each dataset URL it receives. One set downloads data from each source, another set imports data into the store, another set deals with deletions from the store.

The multiprocess route works to avoid the bottlenecks in downloading large datasets (the largest dataset we import takes ~40 minutes to download), and allows smaller downloads and imports to take place while this is happening.

Abstracting away the download and import of data (as opposed to using cron for example) means we can also log frequency of updates and errors, track size of datasets over time, and retry failed downloads.

We’ll be having an invite only open data hackday here soon , which we hopefully serve as a good initial test of things.

Posted in 4store, RDF, SPARQL, Triplestore.

1 comment

By Dave Challis – February 4, 2011

Internal Only Open-Data-Preview Hack Day

Before we launch data.southampton.ac.uk to the public we will be having a day where members of the university get to have a look at the data, see what they can build, and make suggestions.

11th of February. Details to follow.

Posted in Uncategorized.

No comments

By Christopher Gutteridge – January 31, 2011

Using SPARQL to help with the next reorganisation

If there’s one thing that really causes extra work for webmasters and database admins it’s a university reorganisation…. and the current one at Southamton is a doozy! We are basically restructuring everything; academic and professional services. Most of this isn’t actually a problem for the ECS web team, but now ECS is part of a faculty (along with what was the School of Physics and the Optical Research Centre).

One of the reorganisations is to reduce the number of research groups in the faculty to a managable number. Thankfully, the decisions about all of this happen well above our paygrade, but it does mean that all our research groups will be changing. Merging, splitting, renaming.

Our current list looks like this:

Some of these have very custom websites but, most have a very standard basic architecture.

The new world order will see these groups about halved so that’s a pretty major change! Since we’ve had groups websites, until now, we’ve only had one group change at a time.

Architecture of a Research Group Website

The current pages are built from our SQL database which already contains the required info. PHP libraries abstract much of the detail. Research group webmasters may edit the PHP directly or just edit the content in pages via a web interface (skill levels and available time vary wildly). The current set up adds a slightly annoying requirement that the webserver serving the group websites must be able to connect to our core database server (not ideal, security wise). Also, we use the PHP libraries to manage what data is shown. A rogue member of staff (or, more realistically, a postgrad who quickly fixes something on the site without understanding the bigger picture) could easily expose information on staff and students who’ve not given permission to appear on the website. We’ve never had any really serious incidents, but the set-up could be better.

Publications info is not via SQL but grabbed via HTTP requests to eprints.ecs.soton.ac.uk

So anyway, the layout of a research group site is generally something like this. I’m using IAM as an example, but many of the groups are very similar.

Homepage – some plain text about the group with dynamic content showing news, recent publications, etc.
- About Group X – plain HTML
- News – News pages with content pulled from main ECS news database, but only items tagged as about Group X
- Research Themes – A list of the research themes selected by this group.
  - Theme Page – A dynamically generated page for each research theme. The content comes from the local CMS, but the list of projects come from the SQL.
- Current Projects – A list of all current projects in the group, with info such as funders, dates etc.
  - Project Page – A dynamically generated page for each project in the projects DB
- Seminars – This section is also entirely built from the SQL database.
- Publications – These pages are built almost entirely by data grabbed via HTTP from the EPrints server.
- People – Various lists of members of the group, filtered different ways but don’t show to the public anyone who didn’t give permission to appear in the online phonebook.
- Join – Plain HTML page
- Contact – Also a plain HTML page
- Intranet – beyond scope for today.

We like this design as almost all the content comes from databases which are used to build other sites and also perform other functions for ECS. (I nearly wrote “the School”, but ECS isn’t a school anymore, we’re a conceptual slice of the new Faculty.

Mapping our existing records to new groups

This is really the painful bit. All our databases need to be updated.

People: Our People database can cope with the idea of people having a primary group and be added to other groups too so we could create the new groups and transition slowly. The mapping of people is going to be probably a by-hand exercise.

Groups: The list of groups in the database can just grow, and eventually some groups get flagged as ‘deprecated’ but that won’t break much.

Seminars: This is an interesting one as we’ve got seminar series associated with groups. I’m tempted to suggest that we rename each of these to the most appropriate group, merging two series in some cases, but don’t lose too much sleep over it.

Projects: Ended projects can remain attached to their old groups. To sort out the current projects, I guess we’ll need to produce a page which lists all projects still associated with a deprecated group and get staff to sort it slowly. Any member of ECS staff can edit any project record, wiki style, so that’s not impractical.

Themes: These probably need rethinking by every group, or maybe even retiring temporarily or permenently. Themes are assoicated with research groups and projects.

News: We’ll just have to go and retag the last 6 months or so of news to add the new groups, just so they have some news to start with. With only 5 groups to choose from it shouldn’t be that painful and we can always link them to multiple groups if in doubt.

Publications: This one is an utter nightmare! research papers are directly linked to research groups in addition to being linked to the authors (who are members of groups). This way if someone changes group their paper does not follow them. Right now I’m tempted to either (a) not have publications on the new research group sites or (b) just list them on the assumption that the papers of a group are the publications of its members. (b) causes problems as our normal staff database only lists current members and just because Professor Awesome retired, it doesn’t mean that the group won’t want his papers to linger long after he’s going to the great conference dinner in the sky. I think the best thing to do here is to add the newly mapped research groups of the authors to each paper, and tweak anything later if people don’t agree.

Getting ready for the New Group Websites

It seems that there’s some planning into getting people who can write copy for the new websites, but I suspect the actual database mapping will just be done quietly between the cracks as usual. Ah well. The real thing we need to do is get a nice template ready for the sites so that they can hit the groups running, but customise them too as they’ll each have their own quirks that they will want to show. One of the new groups is going to have a Web focus so they now doubt will want the moon on a stick in HTML5.

So, here’s the plan; we create a basic PHP website, using a version of the template we’ve been working on which is entirely on-brand in the new University style, but still quite nice to work with and standards compliant etc. But here’s the funky bit…

We use the ECS SPARQL endpoint, which we recently launched, to build the site. (almost) all the information we need to build a research group website is now available from our SPARQL endpoint, which is a way to query all our public data.

Pro: Demonstrating new technology. ECS is very involved with open, linked data and it’s a good move to use it to solve our own problems.

Pro: Does not require special firewall rules. It just runs over HTTP so the website does not have to be on a webserver with special firewall rules, or even inside ECS. Ideally, it does need to have a pretty quick network connection to the SPARQL server, but no more than with an SQL server.

Con: Extra point of failure. The old way was; website generated from database. This was is the SPARQL is built from the database, website built from the SPARQL. so that’s adding in one extra point of failure for these sites. However, the risks are reasonable. An outage on a research group website is annoying but less serious than on a site providing a service, or for an upcoming or current event such as a conference.

Con: Untried technology. Stuff that, we’re a university, it’s our job to try new things if it all goes horribly wrong we just fall back to the old system, so that’s not a good excuse not to try.

Con: Can’t show extra information to internal people: The SPARQL only knows about public things. The current pages show all staff if you view the pages from within the university IP range (even those who don’t want to appear to the public). This is actually kinda a bad idea; as discussed in an earlier article, if you trust an IP range with random research webservers in then it’s pretty likely someone will build some kind of proxy, by accident, which will allow google to see all pages that proxy can see, so this practically leads into the next pro…

Pro: Bad Group Webmaster code can’t expose confidential data: Because the datasource just doesn’t have any confidential info in, there’s no risk of anything leaking, no matter what whacky queries they write.

So there’s plenty of pros and cons, but as a leading webby research type place we should really be trying this stuff. Our next step is to start to build a generic group website on top of the SPARQL and see how many issues we run into.

Posted in Best Practice, PHP, RDF, SPARQL, web management.

2 comments

By Christopher Gutteridge – January 6, 2011

Publishing CSV to RDF part 2

OK. Having realised I was dangerously on path to writing my own language, I’ve revised my plan. My current thinking is that I’ll make it a two step process.

Step one converts the CSV/Excel/Whatever into an XML file, with optional extras included to allow local values to be set.

Then this will be processed using an XSLT file, which is already well supported. I’m not a huge fan of XSLT but you need to think carefully before embarking on inventing an arbitrary new language. The goal of this system is to make it easier to maintain for non-research organisations (like our central IT) so using an established technology makes it easier to ensure you can find someone to maintain the system.

That said, I’m not sure what to make the XSLT output. It really needs to be XML (although I *think* you can do other stuff, it’s more fiddly. So assuming the XML restriction my options are:

Any RDF/XML
Subset of RDF/XML
XML format I’ve not yet heard of
XML format of my own invention

This last one had me tempted for a while.. something like:

 <triple>
   <subject>http://....</subject>
   <predicate>http://....</predicate>
   <object>http://....</object>
   <datatype>http://....</datatype>
 </triple>

… but I think I’m suffering from an attack of over-engineering. So what it should output is valid RDF+XML, which my tool can then validate & process into the triple-serialisation of your choice.

Posted in RDF.

Tagged with Data.

4 comments

By Christopher Gutteridge – December 21, 2010

Getting Machine-readable Spreadsheets

Paddling in the shallow end of open data we have the PDF file, we have Excel and we have CSV.

For those of you not yet familiar with TBL’s linked data stars, Ed Summers has a great summary, but here’s the first three.

★	make your stuff available on the web (whatever format)
★★	make it available as structured data (e.g. excel instead of image scan of a table)
★★★	non-proprietary format (e.g. csv instead of excel)

Give .xls some love!

It’s not clear that Excel is still a propriatary format. From http://en.wikipedia.org/wiki/Microsoft_Excel#Binary:

OpenOffice.org has created documentation of the Excel format.^[52] Since then Microsoft made the Excel binary format specification available to freely download.^[53]

I think we need to use a better example. If it contains no equations it’s dead easy to work with. Hell, there’s the Perl module Spreadsheet::ParseExcel, and if you’re too lazy to use that, I’ve made you a REST web service using it.

Being in CSV does not mean your data doesn’t suck!

However that’s not my beef for this article. My rant is about the assumption that something being in CSV makes it machine readable data. data.gov.uk is a Good Thing, but there’s lots of Excel & CSV files in it which are not noticeably more machine readable than an HTML document. It deserves a Gold Star for being online with an Open License but not 3 for being machine-readable.

So here’s my idea… Define a subset of CSV to be easy for programmers to work with and also easy for office staff to maintain. This is not idle whimsy, I’m talking to our various university services about how to both get their data open but also how to make it easy for them to maintain in a way machines can work with.

Ideally, of course, we’d just build a database with a pretty web interface, but I’m on a budget so want something cheap and cheerful. Also, I want to minimise the impact on their normal working practices and make sure they hold the authoritative version of their data.

Here’s the plan, although I need a name for it. It applies to any tabular data, but mainly aimed at csv & spreadsheets.

Easy-Parse Spreadsheet Recommendation

The first row to contain a non-blank value in Col-1 is the ‘title row’. Each column in this row is treated as the title of column and used to identify it. Two columns should not have the same title. Titles should not be changed by the person maintaining the table, although columns may be added with new titles, and columns may be reordered.

Every row after the title row, which is not entirely blank, is a data row. Each row is considered a data object with the values properties named by the title row. Any data without a title in the title row is ignored. Whitespace before or after a value is always ignored.

If a cell has multiple values then these values should be separated with a single agreed character. Usually a comma “,” but if the field contains textual descriptions where a comma may be useful then use something else like tilda “~” or semi-colon “;”. Whitespace either side of the separating character is removed. Which fields may be split by what characters is not specified in the data.

No computed fields should be used. Or at least they should be used for human convenience and are not considered part of the data.

Use of colour & style can and should be used to make it easier for humans to work with the document. They are not part of the data.

Dates and times should be machine readable. Don’t use “am” & “pm”! Values should be written consistently.

I think that’s all pretty sensible, but this last one is a bit more controversial (suggested refinements welcome!)

Some metadata about the document may be useful to embed in it. Anything vital to the mechanical interpretation of this specific file. Most notably the spatial and temporal coverage. In this case, metadata may be specified by rows in the format “*SET”, “Key”,”Value”. The “*SET” should be handled separately, in fact any row where the first character of the first cell is a star “*” should not be interpreted as data or titles.

So, I guess, we might as well define “*COMMENT” to allow section headings and comments.

OrgGrinder

My long term plan for this is to produce a simple template system which can process these files into RDF. It’s an imperfect system, but I think it’s much more sustainable for a small office.

My long term goal is to produce some sample spreadsheets + Grinder Templates so that smaller organisations could produce adequate open linked data with minimal hassle and staff training.

Here’s a scenario; you create your organisations public phonebook on Google Docs spreadsheet (using my template) and publish it as CSV. Bam!: 3 Stars already. Then you go to my web service and give it the URL of the CSV and some bits and bobs like your base URI & license and Bamo!: 4 Stars.

BizTalk

This is a system used by some large organisations (including the University of Southampton) to push around data between diverse systems. It’s got some clever stuff I don’t understand, but ultimately it appears to shove around Pipe “|”-separated data files. That would work like a charm with this system as it’s already conforming to most of the restrictions.

EPrints-Data

It seems fairly clear to me that EPrints will soon have to start coping with being a repository of “data documents” (in fact there’s already some in the wild).

There’s several kinds of data we may want a repository for:

One-shot datasets – the results of an experiment.
Cumulative data without temporal/spatial scope – the results of a really long experiment. They may have a ‘temporal’ element, but it’s relevant to the stage of the experiment not the calendar date. Once the project ends this data may be treated just like a one-shot dataset.
Cumulative data with a temporal and/or spatial scope – This is more the field of infrastructure data. The price of coffee at the university is only true from and to a certain date, but it may remain interesting to find out values from the past. Spatial scope is more aimed at things like data.gov.uk where you often have one dataset supplied per council per year (or hope to).
Live Data – This is data which is only considered interesting and useful in the here-and-now. For example the current university phonebook or the list of carparks. There may be a handful of uses for the old data, but this is the kind of data which you edit in-situ rather than published revisions of.

I’m thinking for the more infrastructure-type data we could add some nice functionality to EPrints to allow it to

Have a ‘grinder template’ and configuration options associated with an EPrint so that you can just upload an excel document and have it turned into RDF triples, just like it makes preview images.
Make documents have a temporal coverage from and through fields, then make it possible to query the document for “now” or for a given date.
For the datasets where only the current is deemed important, add a /latest/ URL to get the last document attached to a record. This is already coded and will be a package in the new EPrints Bazaar, coming next year!

Posted in Best Practice, RDF, Repositories, Templates.

Tagged with Data.

3 comments

By Christopher Gutteridge – December 17, 2010

Studying the MPs

Last night Tony Hirst was trying to work out the birth-places and universities of the current UK MPs.

Here’s what I managed to produce while hacking with a glass of wine & TV on:

I did some seat of the pants data munging so I thought it’s worth explaining the process I used:

List of MPs

For the graph of MP’s birth-decade I did earlier in the year, I used a dbpedia relationship which gave me a nice little subject ‘member of 2005 UK Parliament’. This time around I can’t find anything so easy for the 2010 Parliament.

Solution, I used http://en.wikipedia.org/wiki/MPs_elected_in_the_UK_general_election,_2010 and a dirty little script…

my @tables = split(/<table/, join(”,<>));
foreach my $tr ( split( /<tr/, $tables[3] ) )

{

my @td = split( /<td/, $tr );

if( $td[5] =~ m!”/wiki/([^”]+)”! ) { print “$1\n”; }
}

(sorry for the godawful formatting, having hassles pasting code into wordpress since our upgrade.)

This then got munged in a text editor to create a .ttl file (much nicer way to express RDF than XML, esp. when doing hacky scripts)

This gives me this: data, [Browse].

Later I did something almost identical to produce a file adding affiliations as a party label and as an icon red/blue/yellow/other.

This gives me this: data, [Browse].

In retrospect I could have done this in one go, but it was late. Note the raw data of this file is just of the N-Triples format which is really easy to create and easy to import as RDF.

Making the Map Data

I then wrote a quick PHP script using my own Graphite Library to turn this data into geocoded RDF. eg. each resource as rdfs:label, geo:lat, geo:long and also an icon predicate I made up for the day.

View code: http://graphite.ecs.soton.ac.uk/experiments/parlibirth/mpmunge.php

View output: http://graphite.ecs.soton.ac.uk/experiments/parlibirth/mpmunge.php

As it’s a one-shot, I’ve just hard wired the relationship as “http://dbpedia.org/ontology/birthPlace” and then I just grab the results using curl. eg.

curl http://graphite.ecs.soton.ac.uk/experiments/parlibirth/mpmunge.php > born.ttl

…and then modify to “almaMater” and repeat.

Making Maps

This gives me files which can be loaded into my GeoRDF2KML tool. This forwards directly to Google Maps as that will accept the URL of a KML file as an input parameter.

Full disclosure; I added a dirty late night hack to geo2kml to accept my icon predicate to allow you to change what icon appears so I can get the by-party-colour-codes. If anyone has a ‘proper’ predicate to relate a geolocation to a map icon, let me know and I’ll support it.

To make things simple, I used ‘curl’ again to save the KML files to the same website.

MP Universities KML file

Final Maps:

Note that the data is patchy. It only shows MPs with a geocoded birthplace/university listed on dbpedia.

Posted in Geo, Graphite, Perl, PHP, RDF.

Tagged with Data.

No comments

By Christopher Gutteridge – December 10, 2010

New Tools

As well as the more experimental stuff, I’ve also produce several more useful tools:

sparql2kml

http://graphite.ecs.soton.ac.uk/sparql2kml/ – This takes a SPARQL query which returns ?lat,?long (or ?georss) and ?title and maybe ?desc and ?placename and produces a KML file so you can see it on Google Maps or Earth!

As an experiment I used the following to find the birth place of Southampton football players.

PREFIX dbo: <http://dbpedia.org/ontology/>

PREFIX dbpedia: <http://dbpedia.org/resource/>

SELECT DISTINCT ?georss ?title ?placename WHERE {

?person dbo:team <http://dbpedia.org/resource/Southampton_F.C.> .

?person dbo:birthPlace ?place .

?place <http://www.georss.org/georss/point> ?georss .

?person rdfs:label ?title . FILTER langMatches( lang(?title), “EN” ) .

OPTIONAL { ?place rdfs:label ?placename . FILTER langMatches( lang(?placename), “EN” ) }

OPTIONAL { ?x <http://dbpedia.org/property/county> ?place }

FILTER (!bound(?x) && ?place != <http://dbpedia.org/resource/England> && ?place != &l
t;http://dbpedia.org/resource/Wales> )

}

View it: Google Maps or KML for Google Earth.

excel2csv

This one is dead simple. It converts an excel file into comma separated values.

http://graphite.ecs.soton.ac.uk/excel2csv?src=http://opendata.s3.amazonaws.com/bridge-weight-limits-2010.xls

sparqllib.php

http://graphite.ecs.soton.ac.uk/sparqllib/

Nice and simple library to let you use SPARQL from PHP. The function names are deliberately copied from the mysql ones so you have sparql_connect, sparql_fetchrow etc.

Posted in Uncategorized.

1 comment

By Christopher Gutteridge – December 5, 2010

Linked Data Experiments

So I’ve been doing lots of little experiements with consuming open data…

Using JSONP from DBPedia SPARQL
Producing KML of #uksnow tweets via Twapper Keeper
Sorting a KML file by proximity to a postcode (uses @gothwin’s RDF again)
Port of Graphite to javascript (very incomplete, and requires a service to obtain the RDF)
CSV2RDF (first draft of a tool I’m building to turn spreadsheets into RDF)

Posted in Uncategorized.

No comments

By Christopher Gutteridge – December 5, 2010

Blue Plaque

So I’ve just been to the Open Data Hack Day in Oxford which was good fun. Met some cool people, wrote a lot of code and drank some brandy.

My team was playing around with using dbpedia‘s data mixed with geo-location to find you an interesting fact about where you currently are. We had a lot of fun with it — the final results are here:

Blue Plaque (and sourcecode)

It does some neat things. It uses javascript to ask your browser where you are, or failing that to use the wikipedia name of a city, the lat/long or use the postcode. http://data.ordnancesurvey.co.uk/id/postcodeunit/SO171BJ will give you the lat & long thanks to @gothwin.

It then attmpets to find nearby places on wikipedia which are the hometown of something. It does this by searching for things within + or – 0.2 of a latitude and longitude (I know that’s not going to be a perfect square, but meh). If it finds nothing it doubles the search range and tries again until it does.

It then gets all the things that have the city as a hometown, picks one and renders a blue plaque.

For added sillyness, if there’s a image available, it has a little proxy which downloads the image, shrinks it to no more than 300×300 to be phone-friendly, and makes it white-on-blue to match the plaque.

I stole the style of buttons at the bottom of the page from m.ox.ac.uk which is an excellent example of how to make a website to work on a phone, rather than bothing making a specific phone app.

We won ‘most creative use of data’. Some of the other groups did more worthy things like visualise arts-funding data and make useful bus timetables so forth. One group had a great idea but didn’t get very far which was linked-data top-trumps. Each site in the linked data cloud has quite a few stats so you could probably do something cool with that. Most triples, most links to other datasets, most open license… Actually I wonder if there’s a tool out there which you can feed a csv and it’ll produce you nice pdfs of top-trump cards to print-out.

Posted in Uncategorized.

1 comment

By Christopher Gutteridge – December 5, 2010

« Previous Next »

A Bit Quiet

Data Stores for Southampton and ECS Open Data

Experiences with ARC2’s store

Moving to 4store

Getting data into the store(s)

Internal Only Open-Data-Preview Hack Day

Using SPARQL to help with the next reorganisation

Architecture of a Research Group Website

Mapping our existing records to new groups

Getting ready for the New Group Websites

Publishing CSV to RDF part 2

Getting Machine-readable Spreadsheets

Give .xls some love!