Skip to content


Wanted: Data Shamen

I’ve spoken to a number of people who seem frustrated that the information available from http://data.gov.uk/ and other providers of RDF and SPARQL isn’t really usable by the layman yet. There seems to be a misplaced belief that as tools get better, providing raw RDF will be useful to your mum.

I’m sorry, but that’s not going to happen.

You will always need skilled people to understand the subtleties and create the mash-ups for the general public to use. Almost everybody in the UK has access to a computer. Using a computer is far more effective with even a small amount of programming skills but I’m guessing that less than 1% of the population attempt to acquire any. Maybe some people learn a little in school. Understanding how to get an answer out of a big complicated dataset is similar. With linked data these datasets may have errors, be patchy, and you may be trying to overlap two not entirely identical sets of identifiers. These require special skills which must be learned.

Information Shamen

The shaman was, in popular belief (I am hardly an expert on these things), the member of the tribe tasked with going and getting knowledge and then passing it on to the tribe. OK, maybe their approach was to chew on some dodgy fungus, but bear with me on the analogy. In our culture there’s many things we have to trust an expert on. The expert often has no more access to information than us, but they are trained to navigate it and we are not. Medical doctors and archaeologists, financial advisers and sporting commentators.

With the advent of linked data we see the need for a new breed who can venture into the complexities of the web of data and bring back an answer. Better still, bring back a recipe, or nice PHP website, which can help us get answers in a certain scope. However, any person willing to become informed and skilled enough to do this for themselves will be rare, and in doing so becomes such a specialist. They do not need to be protective of their skills, as most people will be too lazy, busy or plain unable to follow where they explore.

For an example, one of the pioneers of this new exploration is Tony Hirst. You should really read his blog. He has written up a number of his forays into the data including how he did it. His blog was a key inspiration in us keeping this one.

Data Journalism

While I like the term InfoShamen, it is a wee bit pretentious. It harks back to my days of wanting to be a cyberpunk computer programmer. These days my friends glare at me if I refer to the laptop as my “deck”.

The more acceptable term is “Data Journalist”. We’re all journalists now, so long as we have a code of integraty for our blogs. But leading the field in this is the Guardian Data Blog. Check it out.

Caveat Datanaut

This morning, over breakfast I was chatting with my housemate and cleaner about cabbages and kings and which side of the road various places drive. This led to a discussion of this very cool bridge between China and Hong Kong (irrelevant to this article but so cool it’s worth including). I then wrote a quick bit of SPARQL, between mouthfuls of bacon sandwich, to see a list of places which drive on the left and right, according to wikipedia.

I did this by first looking up a country, Japan, in Wikipedia, to see if it had the data I wanted in the infobox.. it did [View Japan on Wikipedia] as it’s in the infobox there was a good chance it would be available from DBPedia. I then went to dbpedia to find out what the predicate was [View DBPedia Data]. This gave me the predicate, dbpprop:drivesOn

Armed with this I went to http://dbpedia.org/sparql and wrote out my query:

SELECT DISTINCT ?name ?side WHERE {
   ?place dbpprop:drivesOn ?side .
   ?place foaf:name ?name .
}

Which failed as I forgot the namespaces, so a quick trip to prefix.cc got me those. I use prefix.cc often enough that I can just write http://prefix.cc/foaf,dbpprop.sparql into a browser to get the cut and paste bit I need. So I added this to the top to make it work.

PREFIX foaf:    <http://xmlns.com/foaf/0.1/>
PREFIX dbpprop: <http://dbpedia.org/property/>

All of this took about 4 minutes, compared with the 30 or so it’s taken to document.

This data was interesting for a quick view that lots more places than we thought drive on the left. However it’s flawed in a number of ways. It lists countries once for each variation of their foaf:name. All I want is any name, not all of them. Maybe there’s a predicate in dbpedia for preferred name, but it’s not worth the time to find out over breakfast. Maybe I should have just listed the URI and driveysideness, but I wanted to present it to people not used to dbpedia URIs.

My housemate asked about South Africa and it’s not in the list. I just checked and the wikipedia page does have a “drives on” element in the info box, but even when I remove the requirement for a foaf:name, it still does not show up in the data.

I think that this illustrates the current level of skill required to work with such data, and also the risks of attempting to interpret it. We can improve on this but it will be like improvements to library science, it’ll still require a skilled operator to get meaningful results.

1966

I’m hardly the first to comment on this. The following quote has been on my homepage for years and it’s exciting seeing it beginning to come into it’s own 44 years after it was said.

“Today the forerunners of these synthesists are already at work in many places. Their titles may be anything; their degrees may be in anything – or they may have no degrees.

Today the are called `operations researchers’, or sometimes `systems development engineers’, or other interim tags. But they are all interdisciplinary people, generalists, not specialists – the new Renaissance Man. The very explosion of data which forced most scholars to specialise very narrowly created the necessity which evoked this new non-specialist. So far, this `unspeciality’ is in its infancy; its methodology is inchoate, the results are sometimes trivial, and no one knows how to train to become such a man. But the results are often spectacularly brilliant, too — this new man may yet save all of us.” – Robert A. Heinlein, 1966.

That last bit isn’t a bit of an exaggeration. Once a respectable number of science datasets become available online in usefully open and interoperable formats we will have to rethink science. Some people are skilled at generating data and some at interpreting it. Currently prestige is given for collecting and interpreting some data (ie. publishing in a high impact journal) but it seems likely that these will, should and must become decoupled. Tim Berners-Lee, speaking at the Royal Society this year, said something along the lines of “Do your research in the mashuposphere but give all due credit to the datasphere”.

The world has changed in some very exciting ways over the last few years, and few people can see it yet!…

* one caveat. We may get to the point where AI can take a human question and query the web of linked data for the results, and report back with an explanation of any gaps or uncertainties in the data. Once we get to that level of AI, it becomes hard to justify the purpose of the human in the process.

Posted in Uncategorized.


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.