Skip to content


Geo RDF to KML

I’ve been looking at our RDF building data: http://rdf.ecs.soton.ac.uk/location/UoS/building/ and it’s pretty cool, but most mapping tools want KML so I knocked up a quick tool to convert geo tagged RDF into KML.

With a very simple URI so you can use it in pipelines and so forth. It includes a dc:description of each element if available as the <description> and uses any of rdfs:label, dc:title, foaf:name or skos:prefLabel to get the title. It’s also got a handy option to view the results in Google Maps rather than downloading them.

You can include more than one RDF URL, separated with a comma, and it’ll process objects from all of them! Which is kind of funky!

Ideas for other convert tools…

This also got me thinking about how useful it might be to provide some more web services like this to convert between: KML/KMZ, RDF+XML, RDF+N3, RDF Triples, Atom, RSS, SIOC, Memento, iCal, events, JSON and JSONP.

For example, I could very easily provide a URL which you gave a webpage as a parameter, and it would read the RDFa in that page and return it as a jsonp callback, or something that can read an RSS file and convert it to memento or vice versa.

Also just making it able to merge several KML, RSS, iCal etc. into one file for passing onto the next stage of a pipeline would be funky!

If I get some positive responses I’ll start work on it as a hobby/side project. Possibly datatools.ecs.soton.ac.uk so that I can move it around separately to graphite if it becomes high-traffic.

http://rdf.ecs.soton.ac.uk/location/UoS/building

Posted in RDF.

Tagged with , , , , , , , , , , , , , , .


Ask Not

I’ve been trying to think of a more susinct way to explain my thinking about linked data.

I was told long ago that a good business deal is one that is beneifical to both parties. You should walk away from any deal which isn’t beneficial to you. I think we can say something similar about Linked Data…

Good Linked Data benefits both the producer and consumer.

If it doesn’t benefit the person or organisation which produces it then it is not sustainable, it is certain to get bitrot. Benefits can be as simple as “it gains us sales” or “we lose funding if we don’t do it”, or much more subtle; “it enhances our reputation to provide this service”.

“Benefits the consumer” is even more straght forward: If nobody wants to comsume your data, why the hell are you bothering? “We’re semantic web researchers” works as an excuse for some of the data we put out and I don’t think anybody has ever actually used. Much of our RDF is just a pretty model of our internal data-model and nobody has a use for it, in it’s current form (homespun ontology). As a result much of it had got bitrot as nobody noticed issues. (Since we realised this, we’ve been working through it slowly trying to replace our local assumptions with more generic things like DC, FOAF etc.)

What I’m trying to say here is for the next round of Linked Data/RDF adopters as we shift from exploring pioneers to early homesteaders:

“Ask not what you can do for Linked Data, ask what Linked Data can do for you!”

Posted in Uncategorized.

Tagged with .


The Modeller

I’ve invented a new Batman villain. His name is “The Modeller” and his scheme is to model Gotham city entirely accurately in a way that is of no practical value to anybody. He has an OWL which sits on his shoulder which has the power to absorb huge amounts of time and energy.

The Modeler

The Modeller

In the first issue, “Batman vs the Modeller” the modeller gets away by confusing batman as to exactly which incarnation he currently is (Frank Miller, Golden age or Batman Begins) which forces Batman into an identity crisis where he registers different URIs and FOAF profiles for Batman and Bruce Wayne.

Over the 3 issues there’s a running subplot about the modelers master weapon, the FRBR, which everyone knows is very very powerful but when the citizens of Gotham talk about it none of them can quite agree on exactly what it does. All over the city citizens are becoming trapped in the modelers logical rabbit holes and rats nest traps, which The Batman does nothing but fiddle on the Bat-computer having now decided he must require a different URI for every possible version of himself that isn’t entirely identical. So far he’s at the level of one URI for every version in every media with a specific author, artist, director and/or actor. Any slight variation demands a new URI.

While unpopular with the fans, issue two, “Batman vs the ProtĂ©gĂ©, will later be hailed as a Kafkaesque masterpiece. Batman descends further into madness as he realises that every moment he’s the Batman of that second in time, and each requires a URI, and every time he considers a plan of action, the theoretical Batmen in his imagination also require unique distinct identifiers which he must assign before continuing. Gotham Police are unable to do anything as they have not yet finished their OWL ontology of Gotham crime, which fails to map onto the normal crime ontology. Commissioner Gorden can’t work out what rdf:type the Penguin’s last caper should be modelled as. It closes with Batman realising that time is continuous, not discrete and he needs an uncountable infinity of URIs…

The final issue, “B-nodes and Broomsticks”, is a much more light hearted affair as Batman, gains enlightenment and realises that nothing can ever be perfectly modelled and any model should serve him not he serve the model. In the final showdown Batman gives a speech about how if we try to hold knowledge too tight it slips through our grasp. He then quickly and satisfyingly captures the Modeller and delivers him the the cops who charge him with Aggravated Wasting of Police Time (they are still bitter about reading all the w3c OWL documents).

In a twist which splits the fanbase, some love it, other hate it; we never actually find out what the FRBR was capable of doing and if it really would have  lived up to the hype.

*** * ***

OK, maybe I have a chip on my shoulder about the fact that OWL appears to be the hardest part of the whole RDF thing and for the benefit of semantic web researchers only.

I’m interested to see if I get any outraged comments. I love modelling stuff and have written OWL ontologies for fun in my own time, my issue with the whole thing is that I’m not convinced it’s a useful exercise. Everyone hates 404s but making one way links was one of the things which made the web possible. I suspect that linked data will have some similar sacrifices to make on the alter of pragmatism. The picture isn’t supposed to look like anyone in particular, my 9am drawing ability is limited.

I currently think you have to accept that URIs may be sameAs or not, depending on the task you are attempting rather than an absolute truth. Semantic relativism, baby!

Posted in Uncategorized.


First thoughts about FOAF+SSL

My last blog post caused @kidehen (Kingsley Uyi Idehen) to ask if I’d looked at FOAF+SSL for the purposes of people allowing selective access to 3rd parties to their own personal data held on our systems.

I still don’t think it’s the right tool for that job, but it is pretty cool. Here’s a quick summary.

This system allows you to authenticate that you are a person represented by a URI. To do this you cause a key pair to be generated. The private key is installed in your browser, and the public key is added to your FOAF page as a handful of extra triples.

When you attempt to view a page requiring you authenticate with a certificate, your browser asks which of those installed you want to use. You pick the one for the identity you care about. The remote site then resolves the URI (stored as part of the cert) to get the public key for your claimed identity, then does the usual library stuff to check your client really has the private complement to the public key in the FOAF for your claimed identity.

Here’s an example I knocked up using foaf.me: https://foaf.me/marvin

It can then optionally do funky stuff with RDF to decide if your identity should have view/edit rights on the resource you requested, and can also do funky stuff by resolving your depiction, name, friends etc. from your FOAF to enhance whatever it’s up to.

Applications for my closed linked data ideas

For web based services this is a bit of a non starter. The only way I can see it working is that you give the ECS profile system a list of URIs of services allowed to get at your data and at what level of detail. The second you expect a user to understand URIs vs URLs you lose 99% of your audience, but this might work if the system was very slick and hid all that from you.

It might work differently on phones, where the phone app. could more easily have access to your certificates (you wouldn’t allow a 3rd party website providing a service have access to your browser certs!)

Applications for ECS (Universities in general…)

So it occurred to me that it would be pretty easy to provide a service to allow our users to generate one or more certs and install the public part in their FOAF profile. I am not clear how well this system handles multiple certs on one FOAF profile but it seems to me that’s going to be needed. Scenario 1 is that I can quickly jump through some hoops to get a new cert each time I’m using a different browser etc. Scenario 2 is that I have to learn to copy my cert(s) around and if I don’t think anyone cares enough to deal with the hassle.

It might be quite a cute service to provide a way for our staff to easily authenticate that they are really http://id.ecs.soton.ac.uk/person/1248 although I’m not actually going to build it unless I hear some requests from members of my school (well, faculty as of the latest re-org)

The other thing this sounded useful for, at first glance, would be dealing with the huge pain of single-signon over so many different systems at the uni. This system provides a secure authentication without having to strongly couple the systems.

Problems with FOAF+SSL

…but…

Having installed my first certificate I’m not that comfortable. This cert. is possibly my password to some sites, but what it really says is “someone with the credentials of this user once used this browser”. The browser (firefox) didn’t give me any user-friendly explanation of what the hell was going on, and I don’t know what to do if I want to let someone else use a FOAF+SSL service from my laptop. It is non-trivial to remove certs and I’m not really clear of if my friend using my laptop needs to get the key from a USB stick, or to generate a new one there and then, then remember to dive deep into the settings menu to erase it afterwards.

In many ways SSL+password seems more secure as at least I don’t leave my passwords lying around on my machine, but I can’t keep the cert. in my human memory, it requires a digital copy. Maybe it can be password protected, but I wasn’t offered the option.

All in all an interesting tech. but I wouldn’t use it for our Intranet yet…

Further Reading

Here’s the links Kingsley gave in his comment. In my opinion the FOAF+SSL explanation on the first link would be clarified by making it clear that the last two bits (6. and 7.) are an interesting but entirely optional extension.

1. http://esw.w3.org/Foaf%2Bssl
2. http://www.mail-archive.com/public-lod@w3.org/msg05665.html — old post about WebID (nee. FOAF+SSL) ACL example

Posted in Intranet.

Tagged with , , .


Privacy Controls for Linked Data

I’ve been considering how we might allow our users to provide access to selected third parties, to data we hold about them. This includes timetables, module selections, handin deadlines. I’m very wary of anything more sensitive such as grades and feedback, but more about that in a minute.

The only URIs impacted is our /person/ URIs as this will be the only source of personal information. The idea being that our users may wish to grant limited access to some of there information, or even make it public.

Why allow users to make their information public?

I’ve got a few usecases for this. A key one comes from a student built iPhone app called iSoton which takes your university username and password and uses it to navigate several webpages to obtain your timetable and present it in a friendly format. There is no good reason for you to allow that app. access to anything more than it actually needs. With the username and password it could also read and send your email! I don’t currently have access to student timetable data, but it remains a usecase for people following the patterns we’re working out here.

Another is letting students see their coursework handin timetables in a more useful format, and load them into Google Calendar or whatnot. Anything which helps the students hand coursework in on time is a bonus, in my book. This means maybe adding a new ical export mode to our system.

Annoyingly we adopted the format of URI: FILEFORMAT.ecs.soton.ac.uk/person/123 so that maybe means one new HTTPS cert per format. Fortunately they no longer cost a fortune. If we are accepting passwords or passing out password protected data there’s no excuse not to use https.

Initial Design of our Closed Linked Data

My idea is that we’ll keep the same URIs for people, but we’ll shift the RDF to https://rdf.ecs.soton.ac.uk/ just for people, and redirect http://rdf.ecs.soton.ac.uk to it.

We’ll also add https://ical.ecs.soton.ac.uk/ (or maybe ics.ecs — whichever)

Students will then be able to set their module selections and coursework information to be either private, public, or available via a username/password. If they select to make it available via a password, then WE will generate a username:password pair for them (cjg-1:9wernhi3ewrfjio) and they will be able to set what that pair can access and give it an expiry date. Maybe we’ll insist on an expiry of 12 months or less). We won’t let them change the random password as it’s intended for one-shot cut-and-paste to phone or web-app, not remembering.

In this way a student can provide access to some of their personal data to calendar and semantic web tools while retaining control, and not compromising their real password. This should also allow them to try to build some toys on top of this for 3rd year projects.

As I mentioned in a previous post, it’s not acceptable to give out personal information about OTHERS to a 3rd party site or phone app, so we won’t provide a facility.

ePortfolios

I had a stimulating discussion the other day with an MSc student (Niha Shaikh) looking into ePortfolios. The idea is that a student gets a transcript of their university involvement as a digital document (could be data and human-readable), this file being signed using the private key of the University of Southampton. This way they could prove to prospective employers that they got certain marks in certain modules, or show feedback from courseworks in a way that can be easily verified as sourced from the University of Southampton. They could even upload these documents to recruitment sites to help them get a job.

This all ties in with the closed linked data ideas I’ve been kicking around — you could provide live access to the same data, and limit who can access it, or choose to make it public.

ePortfolio Risks

I could see an immediate danger with this, were it to become standard practice: Students with externals very interested in their ongoing performance, be it parents, sponsoring organisation or home government, might be required to hand over access to this information. Can you imagine the pressure if someone was checking up on you for handing in a coursework a day late? Part of the value of university, to many students, is the chance to start becoming independent adults, and this level of monitoring would rather kill that.

A passing colleague had another interesting take on this which even impacts the digitally signed document the student theoretically gets at the end of the course! If every coursework feedback and late handin penalty becomes part of what you show to a prospective employer, this suddenly becomes worth appealing as it now matters. Even small penalties on modules which don’t impact your final mark. The costs to the appeals system would be astronomical.

Posted in RDF, web management.

Tagged with , , , , , , .


Wanted: Data Shamen

I’ve spoken to a number of people who seem frustrated that the information available from http://data.gov.uk/ and other providers of RDF and SPARQL isn’t really usable by the layman yet. There seems to be a misplaced belief that as tools get better, providing raw RDF will be useful to your mum.

I’m sorry, but that’s not going to happen.

You will always need skilled people to understand the subtleties and create the mash-ups for the general public to use. Almost everybody in the UK has access to a computer. Using a computer is far more effective with even a small amount of programming skills but I’m guessing that less than 1% of the population attempt to acquire any. Maybe some people learn a little in school. Understanding how to get an answer out of a big complicated dataset is similar. With linked data these datasets may have errors, be patchy, and you may be trying to overlap two not entirely identical sets of identifiers. These require special skills which must be learned.

Information Shamen

The shaman was, in popular belief (I am hardly an expert on these things), the member of the tribe tasked with going and getting knowledge and then passing it on to the tribe. OK, maybe their approach was to chew on some dodgy fungus, but bear with me on the analogy. In our culture there’s many things we have to trust an expert on. The expert often has no more access to information than us, but they are trained to navigate it and we are not. Medical doctors and archaeologists, financial advisers and sporting commentators.

With the advent of linked data we see the need for a new breed who can venture into the complexities of the web of data and bring back an answer. Better still, bring back a recipe, or nice PHP website, which can help us get answers in a certain scope. However, any person willing to become informed and skilled enough to do this for themselves will be rare, and in doing so becomes such a specialist. They do not need to be protective of their skills, as most people will be too lazy, busy or plain unable to follow where they explore.

For an example, one of the pioneers of this new exploration is Tony Hirst. You should really read his blog. He has written up a number of his forays into the data including how he did it. His blog was a key inspiration in us keeping this one.

Data Journalism

While I like the term InfoShamen, it is a wee bit pretentious. It harks back to my days of wanting to be a cyberpunk computer programmer. These days my friends glare at me if I refer to the laptop as my “deck”.

The more acceptable term is “Data Journalist”. We’re all journalists now, so long as we have a code of integraty for our blogs. But leading the field in this is the Guardian Data Blog. Check it out.

Caveat Datanaut

This morning, over breakfast I was chatting with my housemate and cleaner about cabbages and kings and which side of the road various places drive. This led to a discussion of this very cool bridge between China and Hong Kong (irrelevant to this article but so cool it’s worth including). I then wrote a quick bit of SPARQL, between mouthfuls of bacon sandwich, to see a list of places which drive on the left and right, according to wikipedia.

I did this by first looking up a country, Japan, in Wikipedia, to see if it had the data I wanted in the infobox.. it did [View Japan on Wikipedia] as it’s in the infobox there was a good chance it would be available from DBPedia. I then went to dbpedia to find out what the predicate was [View DBPedia Data]. This gave me the predicate, dbpprop:drivesOn

Armed with this I went to http://dbpedia.org/sparql and wrote out my query:

SELECT DISTINCT ?name ?side WHERE {
   ?place dbpprop:drivesOn ?side .
   ?place foaf:name ?name .
}

Which failed as I forgot the namespaces, so a quick trip to prefix.cc got me those. I use prefix.cc often enough that I can just write http://prefix.cc/foaf,dbpprop.sparql into a browser to get the cut and paste bit I need. So I added this to the top to make it work.

PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
PREFIX dbpprop: <http://dbpedia.org/property/>

All of this took about 4 minutes, compared with the 30 or so it’s taken to document.

This data was interesting for a quick view that lots more places than we thought drive on the left. However it’s flawed in a number of ways. It lists countries once for each variation of their foaf:name. All I want is any name, not all of them. Maybe there’s a predicate in dbpedia for preferred name, but it’s not worth the time to find out over breakfast. Maybe I should have just listed the URI and driveysideness, but I wanted to present it to people not used to dbpedia URIs.

My housemate asked about South Africa and it’s not in the list. I just checked and the wikipedia page does have a “drives on” element in the info box, but even when I remove the requirement for a foaf:name, it still does not show up in the data.

I think that this illustrates the current level of skill required to work with such data, and also the risks of attempting to interpret it. We can improve on this but it will be like improvements to library science, it’ll still require a skilled operator to get meaningful results.

1966

I’m hardly the first to comment on this. The following quote has been on my homepage for years and it’s exciting seeing it beginning to come into it’s own 44 years after it was said.

“Today the forerunners of these synthesists are already at work in many places. Their titles may be anything; their degrees may be in anything – or they may have no degrees.

Today the are called `operations researchers’, or sometimes `systems development engineers’, or other interim tags. But they are all interdisciplinary people, generalists, not specialists – the new Renaissance Man. The very explosion of data which forced most scholars to specialise very narrowly created the necessity which evoked this new non-specialist. So far, this `unspeciality’ is in its infancy; its methodology is inchoate, the results are sometimes trivial, and no one knows how to train to become such a man. But the results are often spectacularly brilliant, too — this new man may yet save all of us.” – Robert A. Heinlein, 1966.

That last bit isn’t a bit of an exaggeration. Once a respectable number of science datasets become available online in usefully open and interoperable formats we will have to rethink science. Some people are skilled at generating data and some at interpreting it. Currently prestige is given for collecting and interpreting some data (ie. publishing in a high impact journal) but it seems likely that these will, should and must become decoupled. Tim Berners-Lee, speaking at the Royal Society this year, said something along the lines of “Do your research in the mashuposphere but give all due credit to the datasphere”.

The world has changed in some very exciting ways over the last few years, and few people can see it yet!…

* one caveat. We may get to the point where AI can take a human question and query the web of linked data for the results, and report back with an explanation of any gaps or uncertainties in the data. Once we get to that level of AI, it becomes hard to justify the purpose of the human in the process.

Posted in Uncategorized.


Keeping our RDF updated and secure

A brief post on how we’re keeping our RDF accurate and up to date:

We’ve had our infrastructure data available as RDF for a while now (thanks to Nick Gibbins et al.), and have recently started pushing our datasets into a triplestore.

As time goes on, we’re consuming and re-using more and more of our RDF, with a plan to make our SPARQL endpoint public. This means that it’s increasingly important that the RDF we make available is up to date and an accurate representation of the data it’s generated from.

Keeping our RDF updated

This is the easy part. Our underlying data is mostly in MSSQL/MySQL databases. Each RDF document (e.g. http://rdf.ecs.soton.ac.uk/person/60) is built dynamically each time the URL is requested (with some caching).

Keeping our Triplestore updated

All triples from each RDF document are stored in a named graph. We’re using the URL of the RDF document as a graph name (so triples read from http://rdf.ecs.soton.ac.uk/person/60 would be stored in a the named graph http://rdf.ecs.soton.ac.uk/person/60).

Every time an RDF document is requested (and once per night), a hash of the data is calculated, and stored in a database, along with the time the hash was calculated.

A set of scripts running on the same host as the triplestore query the hash database periodically, and re-import triples from any RDF documents which have changed since it last checked.

We’re also slowly moving towards live updates. Many of our systems have a single point up change (e.g. a person’s profile pages are edited from a single form), so we’re planning to insert hooks into these that trigger a refresh of the data into the triplestore.

Keeping our Data Safe

An ongoing worry is that private information could get into our triplestore, or exposed on our RDF pages. Once incorrect/private data gets out, it’d be difficult to remove should the data ever get into external datasets.

As well as having plenty of checks throughout code, we’ve also added monitoring for this to our Nagios IT monitoring system.

At regular intervals, we build a list of anyone in the School who shouldn’t have information about themselves visible. This is then used build a number of SPARQL queries, which are execute to ensure the data hasn’t made it into our triplestore.

If something is found, we log the date this was first noticed. If this isn’t corrected within a reasonable timeframe (currently 24 hours or so), our Nagios system sends warning emails out. (The 24h delay is to allow for the delay between the data changing at source, and the triplestore picking up on that change)

Further Checks

As we deal with more (and larger) datasets, I can see a use for setting up additional checks like this – monitoring our data is just as important as monitoring our systems infrastructure in a lot of cases.

SPARQL (along with multiple datasets kept in a triplestore) provides a convenient (and consistent) way of effectively querying across multiple data sources.

A few ideas for future checks include:

  • Data integrity checks (e.g. check each module has 1+ teachers)
  • Spelling checks (e.g. check any literals for common misspellings)
  • Link checks (e.g. check any URL/URI used actually resolves)

Unit tests for data?

Posted in Uncategorized.


SEO Trackbacks

I had a very bizarre comment on this blog…:

Cyber Cauldron
cybercauldron.co.uk/transmogrification/
174.136.32.19
Submitted on 2010/08/17 at 11:20am

Transmogrification…

I found your entry interesting thus I’ve added a Trackback to it on my weblog 🙂 …

-snip-

It was obviously machine generated spam, so I just hit spam. Then out of curiosity checked the site it came from which was a pagan blog post about transmogrification (meditating until you see yourself as an animal).

This made no sense, they aren’t the spammy kind. So I gave them a ring, and the chap who answered was baffled too, but he worked it out!

He’d installed a wordpress plugin for SEO (Search Engine Optimisation) which was looking for similar posts and my “data visualisation” post got paired with his about visualising being an animal.

The nice thing is he turned off the plugin when he discovered it was causing a bother. So if you see these (88,000 matches for the string on google) then they may just be someone who didn’t realise the SEO wordpress plugin would be annoying other users.

Posted in Uncategorized.

Tagged with , .


End of the Java era?

Apparently Oracle, who now own Java, think it’s smart to sue Google for using Java stuffs. Even if they have a case, we all lose. Java is an important part of the ecosystem, but it ain’t vital. If working with Java entails legal risks the net community will pervcieve this as damage and route around it.

…which would be a shame as Java is a pretty good tool. Or was last time I worked with it.

http://arstechnica.com/tech-policy/news/2010/08/oracle-sues-google-over-use-of-java-in-android-sdk.ars

Posted in Uncategorized.

Tagged with , , , , .


Graphite 1.4

I’ve just released Graphite 1.4

Graphite is a PHP library to allow quick scripting with RDF, designed to work on one, or a small number of linked RDF documents and produce a page.

Most of the 1.4 features were added to facilitate Les Carr who was trying to use it to solve a problem. It now has a bunch more methods you can call on a list of resources; append,union,except,intersection and sort.

This means you can now do:


include_once("arc/ARC2.php");
include_once("Graphite.php");


$graph = new Graphite();
$graph->ns( "ecs","http://rdf.ecs.soton.ac.uk/ontology/ecs#" );

$base_interest_uri = "http://id.ecs.soton.ac.uk/interest/";
$graph->load( $base_interest_uri."rdf" );
$graph->load( $base_interest_uri."web_science" );

$rdf_people = $people = $graph->resource( $base_interest_uri."rdf" )->all( "-ecs:hasInterest" );

$wsc_people = $people = $graph->resource( $base_interest_uri."web_science" )->all( "-ecs:hasInterest" );

print "<div>Either RDF or Web Science: ".$rdf_people->union( $wsc_people )->sort( "foaf:name")->label()->join( ", " )."</div>";

print "<div>Both RDF and Web Science: ".$rdf_people->intersection( $wsc_people )->sort( "foaf:name")->label()->join( ", " ).".</div>;

Which I think is beginning to get quite powerful, and remains reasonably easy to read.

It also can now cache the RDF locally so you can save repeated HTTP requests when doing your linked-data thing.

Posted in Graphite, PHP, Uncategorized.