Skip to content


Why I’m looking forward to SPARQL 1.1 (and a ramble about bad and malicous data)

There’s lots of features coming in SPARQL 1.1.

However, there’s one little one I’m really looking forward to: SAMPLE. It’s right at the bottom of the new set functions aggregates (Harris corrected me on this in the comments).

UPDATE:

I misunderstood how SAMPLE works, Dave got it working on our local endpoint and it appears to just be the equivalent of doing LIMIT 1. That’s bloody useless.

SELECT (SAMPLE(?s) AS ?an_s) ?p ?o WHERE { ?s ?p ?o  } LIMIT 10

I would have expected the above to return 10 lines, but only a single line for each ?s with the first ?p and ?o to go with each ?s.

I’m hoping that this is a bug and not the intended implementation of SAMPLE, otherwise it’s utterly useless. Why bother if it’s semantically the same as LIMIT 1… turns out I didn’t RTFM, so GROUP BY….

It seems I was missing adding a GROUP BY to my examples.

Lat & Long & Lat & Long

When we were overhauling the way I create data for buildings in data.southampton, I accidentally left in the old lat & long list as well as the new one. The meant that some buildings had two lats and two longs. When I asked for a list of buildings with their lat and long, I got four results per building as it gave every possible variant. So if a building had the following data:

building59 rdfs:label "My Building" ;
  geo:lat "0.100" ; geo:long "0.200" ;
  geo:lat "0.101" ; geo:long "0.202" ;

The results of

SELECT ?label ?lat ?long WHERE { ?b a Building; geo:lat ?lat; geo:long ?long }

end up multiplying it out

?label      | ?lat  | ?long |
My Building | 0.100 | 0.200 |
My Building | 0.101 | 0.200 |
My Building | 0.100 | 0.202 |
My Building | 0.101 | 0.202 |

All of which are, of course, true, but it’s not really what I wanted. The new SAMPLE feature will limit a field to only one result.

SELECT ?label (SAMPLE(?lat) AS ?a_lat) (SAMPLE(?long) AS ?a_long)
WHERE { ?b a Building; geo:lat ?lat; geo:long ?long }
GROUP BY ?b

What I’m hoping is that it’s one sample per the other fields, not one sample from all the rows returned. So that there will still be one valid ?lat for each building row returned.

Bad URIs

A bigger cock up I made recently was generating URIs for our phonebook by hashing the email address of the person. That seemed to work fine, but I didn’t notice some people didn’t have an email address so all ended up with the same URI. This resulted in one URI having many (nearly 100) given names, family names, names and phone numbers. So if you just request all people with their given name, family name and phone number, then this one rouge URI (generated from an empty (“”) email address) has 100 of each, which returns every variation which is 100 x 100 x 100 which is a million rows which isn’t very helpful. I know that someone working with the data was getting out-of-memory problems!

Would SAMPLE fix this? Yes.

Isn’t that masking an error? Well, yes, but I’m a Perl programmer at heart. It’s more important to have a system which is working than a system that you let keep breaking to spot issues. Build a better way to spot issues that doesn’t inconvenience the users!

If the query had been for (SAMPLE(?given) AS ?a_given ) (SAMPLE(?family) as ?a_family) {….} GROUP BY ?person then it would still have had one weird record, but the day-to-day operation wouldn’t have broken. It would have taken longer to fix the problem, but the system wouldn’t have been overloaded and breaky while we were unaware of the problem.

But bad data is, er, bad!

If the semantic web/linked data/open data concepts are to work, then you’ve got to deal with the fact that there’s going to be bad data. If it’s so fragile that the services fall over everytime someone gets URIs wrong then it ain’t going to work as people make mistakes all the time. Plan for it.

Anythng consuming open data should be considering how to deal with the all kinds of broken data;

  • Typos in literals
  • Factual errors (not the same as typos)
  • Structual semantic errors, like many people incorrectly having the same URI. If I did it by accident, it’s likely to happen to other data sources now and then.
  • “Impossible” semantics.  It’s likely you’ll have some facts in a big data-set that the ontology says are mutually exclusive. Beware RDF versions of the Bible.
  • Simple malicious data such as incorrect literals, or false predicates
  • Malicious semantics, where someone creates innocuous seeming triples which do something unexpected when combined with certain other datasets.

Malicious Semantics

…or, at the very least, sneaky semantics.

Hugh Glaser pointed this out to me, I’ll see if I can explain it:

Imagine a Judge has declared it illegal to make a certain fact public, specifically that two people know each other.

Site-A defines:

person:678AF foaf:knows person:67D32

Then Site-B defines:

person:678AF foaf:name "Timogen Stohmas" .

Then Site C defines:

person:67D32 foaf:name "Byron Briggs" .

You can find out, from combining three sources, that Byron apparently knows Timogen, but who let that cat out of the bag? Site-A did, assuming that they meant the URIs to mean what B & C claim, but could you prove that?

See Hugh’s examples at http://who.isthat.org/

Support for SAMPLE in 4store

I’m told that 4store has implemented the SAMPLE feature from 1.1 along with the other aggregates, most of the Update operations, and most of the FILTER functions.

It didn’t work for me when I just tried it on our local copy, but that may be because we are on a stable rather than dev version, or possibly due to the PHP that sits between the public SPARQL endpoint and the 4store back-end-point.

UPDATE; it does work, I just havn’t had enough coffee.

Live Example

The following selects all buildings from our endpoint,  and an example of something that they contain. Most buildings contain many things, but this only lists one thing per building.

SELECT ?building ?building_name (SAMPLE(?inside_thing) AS ?an_inside_thing) ?inside_thing_name
WHERE {
 ?building a <http://vocab.deri.ie/rooms#Building> ; rdfs:label ?building_name .
 OPTIONAL { ?inside_thing spacerel:within ?building ; rdfs:label ?inside_thing_name }
}
GROUP BY ?building

Posted in 4store, RDF.

Tagged with .


8 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Jakob says

    I bet there are better use cases of the “SAMPLE” function than hiding errors in your data. SAMPLE does not circumvent the garbage-in-garbage-out principle. Its more useful when you query other people’s endpoints and if you prefer bad data over no data at all (which is often the right choice). You should also take care of other nasty stuff like blank nodes when you expect URI references, resources that turn out to be of another type than expected, etc. Most ontologies include some additional constraints (for instance SKOS) that in practice will be violated. A set of rules to write safer SPAQRL querie would be helpful. I guess that, SAMPLE, FILTER, and LIMIT should be included in every query.

  2. Alexander Dutton says

    It’s not useless, honest guv! If you combine it with a GROUP BY you get your desired behaviour, though probably with a performance hit:

    SELECT ?person (SAMPLE(?n) as ?name) WHERE {
    ?person foaf:name ?n
    } GROUP BY ?person

  3. Alexander Dutton says

    To clarify, it’s an aggregation thing, much like SUM, COUNT and MAX. You could even return the number of names for each person so that you know when data has been omitted:

    SELECT ?person (SAMPLE(?n) as ?name) (COUNT(?n) as ?number_of_names) WHERE {
    ?person foaf:name ?n
    } GROUP BY ?person

    As an aside, COALESCE is another awesome function that returns the first of its bound arguments. You can use it with OPTIONAL patterns when you know that the data you want is either on foaf:name or rdfs:label, you just don’t know which.

  4. Christopher Gutteridge says

    COALESCE: sounds rather cool! I suspect people will use it rather than semantic reasoning…

    Jakob; in the cas of two (valid) reference points for one building, getting either lat and either long is good enough. It’s not garbage data, it’s just about asking the question you meant to ask; “for each foo give me a single bar” rather than “for each foo give me all the bars”

  5. Steve Harris says

    It’s a bit pedantic, but I should point out that SAMPLE (and co.) are Aggregates, not Functions. That is why they have the different behaviour, including triggering implicit grouping if you don’t use GROUP BY.

    Another neat aggreagte is GROUP_CONCAT, e.g.

    SELECT ?person (GROUP_CONCAT(?name) AS ?names) WHERE { ?person :name ?name } GROUP BY ?person

    It lets you roll up multiple values into a single solution.

  6. Steve Harris says

    Also, I should point out the bug in 4store which lets you project non-GROUP BY variables out of a GROUP, without using an aggregate, there’s an implicit SAMPLE on all the columns other than ?building in your example.

    A valid, equivalent SPARQL 1.1 query should be something like:

    SELECT ?building (SAMPLE(?building_name) AS ?a_building_name) (SAMPLE(?inside_thing) AS ?an_inside_thing) (SAMPLE(?inside_thing_name) AS ?an_inside_thing_name)
    WHERE {
    ?building a ; rdfs:label ?building_name .
    OPTIONAL { ?inside_thing spacerel:within ?building ; rdfs:label ?inside_thing_name }
    }
    GROUP BY ?building

    [untested]

  7. Jakob says

    Christ: You data does not contain two lon/lat points per building but one point that turns out to have two lon and two lat values. Welcome, to open world assumption! Furthermore geo:long and geo:lat are a simplification. If you want to be precice you need a blank “position” object:

    ?building has:position [ geo:long ?long; geo:lat ?lat ]

    But being preciseness is not the only goal in practice. Simplification just has its price. The more examples like this I see (thanks for sharing them!), the more it looks like RDF/SPARQL gives you little in advance of plain RDBMS/SQL, but the use of URIs and Unicode. And I bet that SPARQL implementations will soon be as diverse as SQL implementations.

  8. Christopher Gutteridge says

    If I added the concept of a point and said the building was near that point the results would be slightly less broken;
    building near point_A .
    point_A lat ‘0.1’ ; long ‘0.2’ .
    building near point_B .
    point_B lat ‘0.101’ ; long ‘0.201’ .

    querying for
    ?building near ?point .
    ?point lat ?lat ; long ?long .

    Would return only 2 rows, one per point, rather than multiplying out all the lats with all the longs.

    Ulitmately, the best solution would be to have a geo-aware system and treat lat,long as a single literal.



Some HTML is OK

or, reply to this post via trackback.