Skip to content


So long and thanks for all the triples…

After just under a decade of being with the ECS systems team, I’ll be moving on to something new.

I’ve thoroughly enjoyed the last 10 years of working with some incredibly smart individuals, both within academia and the systems team.

ECS has never ceased to be an exciting an innovative place to work, I’m very sad to be leaving, especially at a time when [open|big|semantic|private|public|leaked] data is really starting to take off. It’s one of the few institutions where open data has been values and used both in research, and as part of the core systems infrastructure.

I’m proud to have been part of the open data initiative and data.southampton.ac.uk, and look forward to seeing where it goes next, especially as more and more academic institutions over here follow suit.

As of next week, I’ll be joining the team at Garlik, working alongside several other ECS graduates, using semantic web technologies (many of which were pioneered here) in the Real World™.

They were recently accquired by Experian, so should be plenty of interesting times and opportunities ahead, and I’m sure I can handle just one more re-org…

I’ll still be based in Southampton, so expect to see most of you around the place still, all the best,

Dave Challis

Posted in Uncategorized.


Introducing primaryTopic.net

About a ago I took the now traditional RDF-newbie step of getting het-up about HTTP-14.

Anyhow, I’m older and wiser (or at least more tired). One thing that did stick with me from the above-linked ramble is that it would be really useful to have a neat and simple way to refer to the topic of a page. Many normal humans don’t get data issues like URIs, unique-primary keys and jokes about “This is not a pipe.”. I’m interested in ways to enable stuff for the 99% of web users who are not data nerds.

Part of the thinking on this was a good friend who suggested that we should just give up on distinguishing between URI and document URL and let common sense figure out what we’re talking about. I kicked around some really dumb ideas of extending the URI scheme in some way to do this. Today I hit on a much much more simple solution.

Here’s the URI for a music video I like:

http://primarytopic.net/?url=http%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D4nigRT2KmCE

of the species Womp Rat, which doesn’t rate a page to itself in wikipedia so doesn’t get a dbpediaURI:

http://primarytopic.net/?url=http%3A%2F%2Fstarwars.wikia.com%2Fwiki%2FWomp_rat

me:

and more usefully, the SVG tutorial at the 2002 WWW Conference. This one actually indicates the primary topic of a fragment of a document. That works fine.

OK. You can mint your own URIs, but this means less URI proliferation than each service minting URIs they don’t really need to be resolved to their own triples. I reckon this will work very nicely with http://sameas.org/

My anticipation is that it’s useful to ask people for data about, say, an event or organisation by asking for the URL of the thing and then using this to construct a URI. If someone else is building data using the same system then your data links. Which is what we’re about, right?

Suggestions very welcome. I’ve kept this dead simple for now as it doesn’t need to be comlicated. To save everyone license headaches I’ve made the whole thing CC0.

Posted in RDF.


Conference Spam 2

I am seriously sick of the junk mail I get every day from conferences. Some is vaguely related to my professional field. Most is not even slightly related. After my last post on the subject, I’ve had several positive comments from people for raising the issue.

Tonights is “International Technology, Education and Development Conference” inted2012@inted2012.org

My most basic demand is not to be removed from their mailing list, but rather be removed from the list which they presumably purchased.

Anybody know more law than me? i really want to ruin the day of the people who bulk mail me any conference spam not relateted to a paper I recently published or a registered request to get this crap.

I believe in the UK, under the data protection act, I could send Ā£10 and ask for the data of where they got my name. Has anybody tried this? I don’t give a darn about the money, I want to alter this sucky lame culture.

My other idea is to start compliling:

(a) a reasonable guide to what is reasonable action to take in the UK. I don’t want to be removed from each conferences list, I want to be removed from their source list, so methods for actually achieving this would be valuable.

(b) a list of conferences which spam me when I’ve not registered an interest in getting emails in this field. Plus a list of their sponsors. I want to know who’s paying these people to spam me.

I know it’s hardly the biggest issue facing academia right now, but I react very badly to being told I just “have to accept the way the world works”. Sod that.

Posted in Conference Spam.


The Open Data Mullet

There’s two open data communities — the quick and dirty JSON & REST crowd, who do most of the actual consuming of data, and the holier-than-thou RDF crowd (of which I myself am a card carrying member).

What’s good about RDF? – it’s easy to extend indefinitely, you can trivially combine data into one big pile, it has the locations of the data sources built into it (sometimes!?) as the identifiers themselves, it’s self-describing without the risk of ambiguity you get in CSV or JSON etc. and there’s simple ways to synthesise new data based on rules (semantics), like all things in set “Members of Parliament” are also in set “Person” (allegedly).

So why do people keep asking for JSON? Becuase it’s so less intimidating. It is basically just a serialised data-structure any programmer is already familiar with. It’s like XML with all the unnecessary bits cut away. Any first-year programmer can do a quick web request, and have data they can get their head around.

For hard core data nerds (like me) the power of RDF is awesome, and fun to play with, but there’s a learning curve. This was why I was inpsired to write Graphite which is a PHP library for doing quick stuff with small (<10000 triples) graphs. I also then used it to create the Graphite Quick & Dirty RDF browser because I needed it to debug the stuff I was producing and consuming and the existing viewers didn’t do what I wanted.

I think one of the reasons data.southampton has been very successful as for almost every dataset we’ve made at least one HTML viewer — for buildings, points-of-sale (or service), rooms, products (and services), organisation elements, bus-stops etc. These mean the data in the site in instantly valuable to people who can’t read raw data files. And sorry guys but I don’t believe that just converting the results of a SPARQL DESCRIBE query on the URI is good enough for anyone who’s not already an RDF nerd. eg. http://ns.nature.com/records/pmid-20671719 (admittedly, this is the fall-back mode for things on data.southampton we’ve not yet made pretty viewers for…)

So here, in brief, is my recommendation for what you should plan for a good organisational open data service (and I confess, we’re not there yet — but you can learn from our mistakes). I’m not going to comment much on licenses or dataset update policy etc. this post is about formats and URIs.

data.* best practice

RDF should be the native format, although some data may be stored in relational databases rather than a triple store, it should still have an RDF presentation available. Licenses should be specified. If you don’t have a license (due to vagueness from decision makers, use http://purl.org/openorg/NoLicenseDefined to state this explicitly. Give information on who to contact about each dataset, and how to send corrections to incorrect data, possibly using the oo:contact and oo:corrections predicates we defined for this purpose.

URIs

Everything is given a well specified URI, and if you’re minting the URI you do it on a website reserved for nothing but identifiers. Ideally, if your organisation homepage is www.myorg.org then you define official URIs on id.myorg.org — I know the standard is data.myorg.org/resource/foo but I think it lowers the cognitive gap to use one domain for URIs and one for documents.

RDF for individual resources

Every URI should have an RDF format available (I like Turtle, RDF/XML is the most common, and ntriples is easiest to show to new people learning RDF). I suggest supporting all 3 as it costs nothing given a good library.

HTML for individual resources

Every URI should have an HTML page which presents the information in a way useful to normal people interested in the resource, not data nerds. The HTML page should advertise other formats available. In some cases the HTML page may be on the main website, because it’s already got a home in the normal website structure — in this case it should still mention that there is data available. The HTML may not always show all the data available in other formats, but should as much as makes sense. It should use graphs & maps to communicate information, if that’s more appropriate.

JSON for individual resources

In addition to the RDF you should aim to make each resource available as a simple JSON file to make it easy for people to consume your data. Data.southampton.ac.uk does not do this yet, and I feel it’s a mistake on my part.

KML for individual resources

If the resource contains the spatial locations or shapes of things, please also make a KML file available. I’ve got a utility that converts RDF to KML (please ask for a copy to host on your site) and I’m working to make it do 2 way conversions between KML and WKT in RDF.

Lists of Things

It’s certain you’ll have pages which present lists of resources of a given type (buildings, people etc.). These pages should be laid out to be useful for normal people, not nerds. At the time of writing, our Products and Services page has got a bit out of hand and needs some kind of interface or design work. All HTML pages containing lists should have a link to a few things. First of all, if the list was created with a SPARQL query, link to your SPARQL endpoint so people can see and edit the query. This really promotes learning. This is in-effect a report from your system. I recommend putting in any additional columns which are easy, even if they’re not needed for the HTML page, as this makes life much easier for people trying to consume your data — it’s easier for a beginner to remove stuff from a SPARQL query than to add it.

I also recommend making the results available as CSV (we had to build a gateway to the 4store endpoint as it doesn’t do it natively. Adding CSV means the list can load directly into Microsoft Excel, which will make admin staff very happy. Our hacked PHP ARC2 SPARQL endpoint sits between the public web and the real 4store endpoint. It can do some funky extra outputs including CSV, SQLite and PDF (put in for April 1st). Code is available on request, but we can’t offer much free support time as we’ve got lots to do ourselves.

You might also want to add RSS, Atom or other formats which make sense.

APIs

Once you’ve got all that you’ve got a very powerful & flexible system, but there’s no reason not to create APIs as well. My concern with an organisation creating an API to very standard things (like buildings or parts of the org chart) is that you lose the amazing power of Linked Data. An API tends to lock people into consuming from one site, so an App built for Lincoln data won’t work at Southampton.

Making APIs takes time, of course, as does providing multiple formats. That said, once you’ve got a SPARQL backend, it should be bloody easy to make APIs. In fact, if it’s an open endpoint, you can make APIs on other people’s data!

RDF Business Up-Front, JSON Party in-the-Back

So that’s my Open Data Service Mullet; provide RDF, SPARQL, Cool URIs for the awesome things it can do that other formats can’t, but still provide JSON, CSV and APIs everywhere you can because that’s easier for keen people to do quick cool stuff with.

Posted in Best Practice, RDF.


When Conference Spam goes Too Far

(These are my personal opinions, and not those of the University of Southampton.)

I am angry with a conference.

I don’t know about you but I get conference spam daily. Phrases like “CFP” or “Deadline extended” in a subject line will consign it to the delete button before my brain really absorbs it. About 50% of the conference spam is in my field or close enough to be forgivable. However the other half isn’t. I get spam for things entirely outside my academic field, and the conferences won’t tell me where they got my email address so I can request removal.

All of this is nothing compared with the message I got a little over a month ago, the subject line to which was “LOSS & BEREAVEMENT”, for http://www.bmehealth.org/ — you can tell it must be important because they used capital letters! The introduction read

This one day conference will critically explore the ways in which loss and bereavement is understood and experienced by individuals and groups from various cultures and backgrounds. All of the great world religions provide us with solutions to the universal problem of death.

I am recently bereaved and my father is a man of strong religious convictions so I took some time working out that this wasn’t something he’d subscribed me to, but rather just irrelevant junk in my inbox.

I wrote to the organisers and to the listed speakers because I feel spamming people about bereavement is utterly unacceptable. It’s not like spam about physics or user interface design. This is personal and painful. If they’ve mailed me (a member of web support staff in an Electronics & Computer Science Department) it’s certain they’ve mailed hundreds, maybe thousands or even more?!, people who have no interest in this subject, many of which will have suffered recent bereavement (it’s a rather common affliction).

I eventually got a response:

Thank you for your recent email to us and to the our list of speakers for our Loss and Bereavement event.

We are sorry that you have been angered by receiving our mailshot of the event. We can ensure that we have found your email address through our own efforts on the web search and not through any illegal means.

We have informed our webmaster to ensure that your email is removed with immediate effect from our mailing list and can assure you that you will not receive any further email from us.

We are sorry for the inconvenience this may have caused you. In the future if you find unsolicited mails from us, please reply with “Remove” in the subject line and we will take care that you do not receive any further promotional mail.

Kind regards,
Ahmed Qureshi
ETHNIC HEALTH INITIATIVE

Which was progress, but didn’t exactly fill me with cheer. “It’s not illegal” is a pretty inflamatory response, however they made absolutely no changes to the message they were sending out. How do I know? because they sent me it again a month later!

This second email was identical to the first. I sent them a pretty stroppy email (which could be sumarised as “you’re incompetent and you suck” which I feel is a fair under the circumstances)

It took them nearly a week to respond, and their response was:

Dear Christopher,

Your complaint was dealt with on 27th August.

However I have since spoken to the web designer and he has apologised that somehow a duplicate was produced and it was sent out, which should not have happened. He has now manually taken you off the list. This should do it. On behalf of EHI our apologies to you also.

Best
Ahmed Qureshi

That’s really not good enough. I will not be fobbed off, they are clearly not concerned with who they are upsetting to promote their conference. I wrote to them again to ask what steps they have taken to ensure that (a) they don’t spam completely inappropriate people (b) people actually get properly removed and kept off their lists (c) they have less impact on the recently bereaved.

This conference is sponsored mostly by charities who exist to help the bereaved, hence I would expect excelllent efforts to not be jerks about how it’s promoted. I sent the last email last week with a note to the effect that I’d be contacting their sponsors if they didn’t tell me they had made an effort to reduce the impact of their spamming. They have not responded.

Their event is sponsored by;

Two of these sponsors have links in the page, but not icons, which is pretty sloppy work.

My biggest issue is the two sponsors who are very actively soliciting donations from the public. They should ensure that their money is not spent on spamming people about the subject.

I plan to draw this blog post to the attention of all the sponsors. This isn’t OK behaviour.

Posted in Best Practice.


WTF is the Semantic Web?

I’ve had a request to write a post on “What is the Semantic Web”… so here goes. This is a personal persepective, but if people point out glaring or dangerous errors, I may update the article. I’m not going to allow comments which will inflame the usual debates, I’m trying to write a friendly summary for people how are interested.

Executive Summary

The Semantic Web allows software to find out facts about the structure of data. A data file on the web says “Chris Gutteridge” is an employee of the university of Southampton. A computer uses Linked Data to discover facts about the identifier for the relationship “is employee of” and uses these facts to reason that Chris Gutteridge must be a person, and the University of Southampton is an organisation, even though these facts were not in the initial data document. Now imagine that scaled up to the whole Web!

URIs are Awesome

URIs are globally unique identifiers which identify, er, something. A subset of URIs are URLs which locate something on the web. There’s also a thing called IRIs which allow non-ascii characters, but don’t worry about them for the purposes of this explanation.

Why are URIs awesome?

Well, the instant value is that you can confidently generate unique identifiers, by generating them as web-like addresses in a domain that you own. This is just a convention, but a damn elegant one.

However if that’s all that URIs were then UUIDs would do very nearly as well, or better in some cases.

So URIs can do something UUIDs can’t?

In my previous post I explained that Linked Data is when you have links from one dataset to another. There’s a sort of degenerate version of this, which makes linked data people sad, but is still dead useful. That’s when you just use URIs as globally-unique identifiers, but don’t make them resolvable on the web to more data. If you define them in a website you own, then you always have the option of making them resolveable at some later date.

The idea of discovering more useful data by resolving a URI which identifies something is really neat, but where it gets confusing is that you can’t have an identifier which identifies both “Building 59 at the University of Southampton” and some “30K HTML document on the web”. One of these has a size measured in cubic meters, the other in bytes. So what we do is we make the identifier for the thing different from the one for the document. The really simple way to do it is like this:

http://users.ecs.soton.ac.uk/foaf.rdf#meĀ  — the URI for me
http://users.ecs.soton.ac.uk/foaf.rdf — the document describing me

When you resolve a URI with a # in (or a fragment identifier if you want to sound clever), you get the content from the web address without the ‘#’ bit. This is good as far as it goes, but can only return a single file format for the concept and is kind of ugly. There’s a much more neato way to do this but it requires a bit more webmastering…

HTTP 303 See Other Redirects

(This is sometimes called “HTTP Range-14”, generally when people are arguing about it, which seems to be an ancient and holy tradition of semantic web mailing lists)

The clever way to get from the real-world-thing-URI to the data-document-URI is to use the HTTP return code 303. Now, you might have already run intoĀ  301 and 302, which mean a resource has moved temporarily or permentantly. You can see them in action when you type in a web address and your browser changes it to the ‘official’ location. For example, if you visit http://data.soton.ac.uk you’ll be redirected to http://data.southampton.ac.uk/. You can look under the hood on a linux or OSX machine by going to the command line and typing:

curl -I http://data.soton.ac.uk

The -I means show me the HTTP headers. You should get something like..

HTTP/1.1 302 Found
Date: Fri, 22 Jul 2011 08:39:49 GMT
Server: Apache/2.2.14 (Ubuntu)
Location: http://data.southampton.ac.uk/
Vary: Accept-Encoding
Content-Type: text/html; charset=iso-8859-1

This tells the web browser to redirect to the URL in the Location: line to find the thing it’s looking for.

But when you resolve the URI for a thing which can’t be expressed as a document, we can’t just say the document has moved because it hasn’t. 303 is the code for “See Other”, or in other words, “I can’t or won’t give you what you asked for, but hey, this URL may be of interest…”. To see it in action, try:

curl -H"Accept: application/rdf+xml" -I http://id.southampton.ac.uk/building/59

You should get something like:

HTTP/1.0 303 See Other
Date: Fri, 22 Jul 2011 08:43:14 GMT
Server: Apache/2.2.14 (Ubuntu)
X-Powered-By: PHP/5.3.2-1ubuntu4.9
Location: http://data.southampton.ac.uk/building/59.rdf
Vary: Accept-Encoding
Connection: close
Content-Type: text/html

Wait a minute; why the -H bit?

Thanks for asking. The -H tells the web server what our preferences for document formats are. If you type http://id.southampton.ac.uk/building/59 into a web browser, or type:

curl -H"Accept: text/html" -I http://id.southampton.ac.uk/building/59

then you’ll get sent to http://data.southampton.ac.uk/building/59.html — which is a useful page for human beings. This is a bit fiddly to set up but really neat as the one identifier is now of use to both humans and machines. Following the identifier to find out more facts is sometimes called “follow your nose”. Humans can do it by hand, but so can software.

But Chris, Why would a URI be useful to machines?

Aha, well that’s getting round to the main point of this article. You notice that in RDF data you have two or three URIs per fact (we call them triples). Here’s some representative examples. The three values are named subject, predicate, object. The predicate identifies the relationship between the subject and object.

 <http://id.southampton.ac.uk/building/59> 
 <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within>Ā  
 <http://id.southampton.ac.uk/site/1> .

In this one the object is a “literal”, or a value.

 <http://id.southampton.ac.uk/building/59> 
 <http://www.w3.org/2000/01/rdf-schema#> 
 "New Zepler" .

In this triple, we actually have a 4th value, which is a data type. You can have any data type you want but the m
ost useful ones are the ones defined in the xsd namespace.

 <http://id.southampton.ac.uk/building/59> 
 <http://www.w3.org/2003/01/geo/wgs84_pos#lat>
 "50.937412"^^<http://www.w3.org/2001/XMLSchema#float"> .

This final example uses “rdf:type” which indicates that the subject is in the set of things (or class) represente
d by the object identifier.

 <http://id.southampton.ac.uk/building/59> 
 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
 <http://vocab.deri.ie/rooms#Building> .

I just used an abbreviation, back there, for rdf:type. Generally, in examples (and in real documents) the URIs for predicates and classes are abbreviated using a namespace prefix. The most common namespaces have well established prefixes so we use them as a short cut, but never forget that the class and predicate identifiers are full web addresses in their own right. Tip: To find the common prefix for a namespace, or namespace for a prefix, use the dead useful http://prefix.cc/

Yes, But why would a machine want to resolve a URI?

There’s a few good examples, but today I’m sticking to what it gets if it resolve a URI for a class or predicate. Often it will get back some triples describing that class or relationship.

This allows a computer to synthesise new triples, using simple logical steps.

Here’s an example…

Your software resolves the URI for the relationship

 <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within>

defined by the UK ordinance survey. If you put it into a web browser you just get HTML back, however if you either use curl:Ā  (the -L means don’t show me the headers, and follow all see-other redirections)

curl -L -H"Accept: application/rdf+xml" http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within 

…or you can use a web based tool to view it as RDF.

it states that this relationship is transitory and the inverse of

 <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/contains>

From that our software can deduce that if room http://id.southampton.ac.uk/room/59-1257 (our seminar room) is “within” http://id.southampton.ac.uk/building/59 and that building 59 is within the Highfield Campus (http://id.southampton.ac.uk/site/1) then… drum roll please… Seminar Room 1 is “within” Highfield Campus and therefore Highfield Campus “contains” our Seminar Room 1. One small loop of code for a machine, one giant leap for machine kind.

This might not seem like much, but it’s a huge step. It does not require custom code for each dataset.

Some other bits of semantics you can get back are that a class is a subclass of another class. Example; see the bit about buildings in the document you get back if you resolve the Building class described in the triples above. So, as a different example to keep it interesting, your data says that David Beckham is a Professional Footballer, and your software can then resolve the URI which represents “Professional Footballer” and learn that it is a subclass of “Professional Athlete”, which it can resolve to discover that all professional athletes are of class Person. It can then deduce that David Beckham is also of type Person.. who knew?

A more complex case is to restrict the domain and range of a predicate. That is the class, or set of classes, which are “legal” before and after the predicate. For example, the

Posted in RDF.

Tagged with .


Data Pages

I don’t like the linked data API, but not because it’s a bad design, but because you end up with scary confusing HTML webpages which people end up on by accident and have no cue as to what to do next.

It’s my strong opinion that when producing a system that will mass produce HTML pages with raw data dumps in, you will alienate people who will arrive there by mistake from search engines.

Currently we don’t have a viewer for individual payments, so you can see how our data site defaults to show a raw data object:
http://data.southampton.ac.uk/payments/201106/payment-50350344.html

“This page shows raw data about this resource. We have not yet had the time to build a pretty viewer for this type of item. Don’t worry if it seems a bit arcane — you are looking under the hood of our service into the data inside! You can download this data in the following formats: rdf, ttl, json, xml.”

I think just telling people that this page isn’t the droids they are looking for will help them move along without distress, and should be considered good practice.

ps. The Linked Data API does one thing I have a real issue with, it can only be configured to follow “forward” relationships from a resource. So you can show a person and their name & phonebumber and follow the link to their address and follow the link from their to their city. What you can’t do is follow an inverse link like “foaf:member” which links from a group to a person, not a person to a group. It could easily be fixed by a tiny tweak where you define an inverse relationship for the property and give that a label, and the system will look for both it and its reverse. This is much cleaner (and semantic) than having people create inverse triples for every membership in their database to appease the whims of the linked data API. I know I should join the community, and argue for it there, but I don’t have time. Sorry.

Posted in Best Practice, RDF.


Linked Data vs Open Data vs RDF Data

I often notice people get confused between Open Data, Linked Data and RDF. Here’s a quick overview to get you straight:

Open / Linked / RDF Data

These 3 things are all slightly different; “Open Data” is a policy; “Linked Data” is an approach and “RDF” is a data structure.

Open Data: Open data is data which you can use more or less freely. It’s generally available on the web, and uses non-proprietry formats like XML, CSV. An extremist definition is data with a clear copyright & an open license (which allows commercial reuse), available from a URL or a well documented API without any restrictions, in formats which are completely open (ie. no patent concerns etc.) A milder definition is “available as data on the web in a form people can do stuff with”. Some Open Data is also Linked Data and RDF, but probably less than half.

Linked Data: Linked data is data which contains links to other datasets. Generally these will use URIs which are resolvable to discover more facts. It’s not essential for the URIs to be resolvable, it’s still really useful to have two different datasets which have used the same identifiers. URIs are unambiguous. However, some data doesn’t make much sense to link up, or the costs are too high and put people off. Linked data is often open, but doesn’t have to be — for example you can have internal confidential data which links up with other data sources. A good example is a lecture timetable; which is confidentidal to the student, but links to data about rooms & modules which are open. Almost all Linked Data is currently expressed in RDF, but you could have links in XML, KML, CSV etc. it’s just RDF is designed with linking in mind.

RDF: RDF is a useful data-structure for creating interoperable data. It has a number of file formats for exchanging this data. Most common is RDF/XML. Nicest (in my opinion) is Turtle. Simplest is N-Triples, where you just write out the data one fact per line. You can also express RDF data embeded in HTML as “RDFa”. The structure of RDF makes it trivial to merge data from multiple sources — it’s all triples. Also it assumes that you will want to either link the data yourself, or other people will want to link into your data. You can publish RDF data which becomes linked data as other people link to it, just like publishing pages on the web. RDF is just a way of structuring data and as such is not always open and not always linked.

Linked Open Data: (aka LOD) is a common term, and as you can see is usually going to be in RDF too. The key thing is not to get put off by the linking. Add links when they provide value to your data and will help people using your data (yourself included) do more with it.

Posted in RDF.

Tagged with , .


Concerns about competative metrics for Repositories

Iā€™m deeply concerned about the power lying in the webometrics league table. http://repositories.webometrics.info/toprep_inst.asp

The give a ranking bonus for your number of ā€œRich Filesā€, which basicaly means ā€œNumber of PDFsā€. This means that if we were to push for using ā€œscholarly HTMLā€ rather than PDF than our rank would drop.

Currently eprints.ecs.soton.ac.uk is at 22 and eprints.soton.ac.uk is at 60. ā€” I couldnā€™t tell you why, but stats isnā€™t my strong suit.

My real concen is that this league table will stifle innovation by only measuring common quality factors, rather than promoting new ones. Also, I think the ā€˜deltaā€™ is more important than the size, and always have. The success criteria for the TARDIS project, which launched eprints.soton was that it should have a number (2000, I think) of records by a date. I opposed that at the time, and still think it was wrong. A better criteria would have been a sustained deposit rate and (in the first 2 years) a continuous increasing number of contributors.

http://roar.eprints.org/ is run by one of my colleagues, but Iā€™m very happy to see that they show graphs of ā€˜deposit activityā€™ rather than size. This shows that eprints.soton is in very robust healt; http://roar.eprints.org/1423/ with a sustained level of daily deposits over the past few years.

Whatā€™s unhealthy is that a drop in the ranking for eprints.soton caused the board which oversees the site to discuss how to improve our rankings, and there was no really obvious way I could see to do it without generating un-necisary additional PDF files. Of course this was rejected as a silly idea, but my fear is that other sites may feel pressured to improve their ranking and make bad decisions. The community should be calling the shots of what metrics make a good repository. Iā€™m not sure what those metrics should be, but they should be as careful as they can to avoid a situation where I can inflate my score by making my repository worse, eg. by encouraging bad formats like PDF.

If youā€™ve not heard the PDF rant, then in short itā€™s that people write and read papers primariy on computers. In most cases they write in a format with some markup (latex or Word) and then convert it to simulated sheets of A4 paper (PDF). Computers rarely have displays whre an A4 page is useful. I donā€™t see how itā€™s acceptable to produce papers (gah, even the name is inappropriate) which cantā€™ be comfortably viewed on my landscape laptop screen, on my phone, and on the iPad I might justify buying one day. Reading papers is one of the key things an academic does for a living and itā€™s still easier to read them by printing them out first.

Thereā€™s some people moving in the right direction, at least: http://scholarlyhtml.org/ but the repository and research-publication community needs to be goaded into this direction out of itā€™s PDF comfort zone.

Posted in Repositories.


Why I’m looking forward to SPARQL 1.1 (and a ramble about bad and malicous data)

There’s lots of features coming in SPARQL 1.1.

However, there’s one little one I’m really looking forward to: SAMPLE. It’s right at the bottom of the new set functions aggregates (Harris corrected me on this in the comments).

UPDATE:

I misunderstood how SAMPLE works, Dave got it working on our local endpoint and it appears to just be the equivalent of doing LIMIT 1. That’s bloody useless.

SELECT (SAMPLE(?s) AS ?an_s) ?p ?o WHERE { ?s ?p ?oĀ  } LIMIT 10

I would have expected the above to return 10 lines, but only a single line for each ?s with the first ?p and ?o to go with each ?s.

I’m hoping that this is a bug and not the intended implementation of SAMPLE, otherwise it’s utterly useless. Why bother if it’s semantically the same as LIMIT 1… turns out I didn’t RTFM, so GROUP BY….

It seems I was missing adding a GROUP BY to my examples.

Lat & Long & Lat & Long

When we were overhauling the way I create data for buildings in data.southampton, I accidentally left in the old lat & long list as well as the new one. The meant that some buildings had two lats and two longs. When I asked for a list of buildings with their lat and long, I got four results per building as it gave every possible variant. So if a building had the following data:

building59 rdfs:label "My Building" ;
  geo:lat "0.100" ; geo:long "0.200" ;
  geo:lat "0.101" ; geo:long "0.202" ;

The results of

SELECT ?label ?lat ?long WHERE { ?b a Building; geo:lat ?lat; geo:long ?long }

end up multiplying it out

?label      | ?lat  | ?long |
My Building | 0.100 | 0.200 |
My Building | 0.101 | 0.200 |
My Building | 0.100 | 0.202 |
My Building | 0.101 | 0.202 |

All of which are, of course, true, but it’s not really what I wanted. The new SAMPLE feature will limit a field to only one result.

SELECT ?label (SAMPLE(?lat) AS ?a_lat) (SAMPLE(?long) AS ?a_long)
WHERE { ?b a Building; geo:lat ?lat; geo:long ?long }
GROUP BY ?b

What I’m hoping is that it’s one sample per the other fields, not one sample from all the rows returned. So that there will still be one valid ?lat for each building row returned.

Bad URIs

A bigger cock up I made recently was generating URIs for our phonebook by hashing the email address of the person. That seemed to work fine, but I didn’t notice some people didn’t have an email address so all ended up with the same URI. This resulted in one URI having many (nearly 100) given names, family names, names and phone numbers. So if you just request all people with their given name, family name and phone number, then this one rouge URI (generated from an empty (“”) email address) has 100 of each, which returns every variation which is 100 x 100 x 100 which is a million rows which isn’t very helpful. I know that someone working with the data was getting out-of-memory problems!

Would SAMPLE fix this? Yes.

Isn’t that masking an error? Well, yes, but I’m a Perl programmer at heart. It’s more important to have a system which is working than a system that you let keep breaking to spot issues. Build a better way to spot issues that doesn’t inconvenience the users!

If the query had been for (SAMPLE(?given) AS ?a_given ) (SAMPLE(?family) as ?a_family) {….} GROUP BY ?person then it would still have had one weird record, but the day-to-day operation wouldn’t have broken. It would have taken longer to fix the problem, but the system wouldn’t have been overloaded and breaky while we were unaware of the problem.

But bad data is, er, bad!

If the semantic web/linked data/open data concepts are to work, then you’ve got to deal with the fact that there’s going to be bad data. If it’s so fragile that the services fall over everytime someone gets URIs wrong then it ain’t going to work as people make mistakes all the time. Plan for it.

Anythng consuming open data should be considering how to deal with the all kinds of broken data;

  • Typos in literals
  • Factual errors (not the same as typos)
  • Structual semantic errors, like many people incorrectly having the same URI. If I did it by accident, it’s likely to happen to other data sources now and then.
  • “Impossible” semantics.Ā  It’s likely you’ll have some facts in a big data-set that the ontology says are mutually exclusive. Beware RDF versions of the Bible.
  • Simple malicious data such as incorrect literals, or false predicates
  • Malicious semantics, where someone creates innocuous seeming triples which do something unexpected when combined with certain other datasets.

Malicious Semantics

…or, at the very least, sneaky semantics.

Hugh Glaser pointed this out to me, I’ll see if I can explain it:

Imagine a Judge has declared it illegal to make a certain fact public, specifically that two people know each other.

Site-A defines:

person:678AF foaf:knows person:67D32

Then Site-B defines:

person:678AF foaf:name "Timogen Stohmas" .

Then Site C defines:

person:67D32 foaf:name "Byron Briggs" .

You can find out, from combining three sources, that Byron apparently knows Timogen, but who let that cat out of the bag? Site-A did, assuming that they meant the URIs to mean what B & C claim, but could you prove that?

See Hugh’s examples at http://who.isthat.org/

Support for SAMPLE in 4store

I’m told that 4store has implemented the SAMPLE feature from 1.1 along with the other aggregates, most of the Update operations, and most of the FILTER functions.

It didn’t work for me when I just tried it on our local copy, but that may be because we are on a stable rather than dev version, or possibly due to the PHP that sits between the public SPARQL endpoint and the 4store back-end-point.

UPDATE; it does work, I just havn’t had enough coffee.

Live Example

The following selects all buildings from our endpoint,Ā  and an example of something that they contain. Most buildings contain many things, but this only lists one thing per building.

SELECT ?building ?building_name (SAMPLE(?inside_thing) AS ?an_inside_thing) ?inside_thing_name
WHERE {
 ?building a <http://vocab.deri.ie/rooms#Building> ; rdfs:label ?building_name .
 OPTIONAL { ?inside_thing spacerel:within ?building ; rdfs:label ?inside_thing_name }
}
GROUP BY ?building

Posted in 4store, RDF.

Tagged with .