There’s lots of features coming in SPARQL 1.1.
However, there’s one little one I’m really looking forward to: SAMPLE. It’s right at the bottom of the new set functions aggregates (Harris corrected me on this in the comments).
UPDATE:
I misunderstood how SAMPLE works, Dave got it working on our local endpoint and it appears to just be the equivalent of doing LIMIT 1. That’s bloody useless.
SELECT (SAMPLE(?s) AS ?an_s) ?p ?o WHERE { ?s ?p ?o } LIMIT 10
I would have expected the above to return 10 lines, but only a single line for each ?s with the first ?p and ?o to go with each ?s.
I’m hoping that this is a bug and not the intended implementation of SAMPLE, otherwise it’s utterly useless. Why bother if it’s semantically the same as LIMIT 1… turns out I didn’t RTFM, so GROUP BY….
It seems I was missing adding a GROUP BY to my examples.
Lat & Long & Lat & Long
When we were overhauling the way I create data for buildings in data.southampton, I accidentally left in the old lat & long list as well as the new one. The meant that some buildings had two lats and two longs. When I asked for a list of buildings with their lat and long, I got four results per building as it gave every possible variant. So if a building had the following data:
building59 rdfs:label "My Building" ; geo:lat "0.100" ; geo:long "0.200" ; geo:lat "0.101" ; geo:long "0.202" ;
The results of
SELECT ?label ?lat ?long WHERE { ?b a Building; geo:lat ?lat; geo:long ?long }
end up multiplying it out
?label | ?lat | ?long | My Building | 0.100 | 0.200 | My Building | 0.101 | 0.200 | My Building | 0.100 | 0.202 | My Building | 0.101 | 0.202 |
All of which are, of course, true, but it’s not really what I wanted. The new SAMPLE feature will limit a field to only one result.
SELECT ?label (SAMPLE(?lat) AS ?a_lat) (SAMPLE(?long) AS ?a_long)
WHERE { ?b a Building; geo:lat ?lat; geo:long ?long }
GROUP BY ?b
What I’m hoping is that it’s one sample per the other fields, not one sample from all the rows returned. So that there will still be one valid ?lat for each building row returned.
Bad URIs
A bigger cock up I made recently was generating URIs for our phonebook by hashing the email address of the person. That seemed to work fine, but I didn’t notice some people didn’t have an email address so all ended up with the same URI. This resulted in one URI having many (nearly 100) given names, family names, names and phone numbers. So if you just request all people with their given name, family name and phone number, then this one rouge URI (generated from an empty (“”) email address) has 100 of each, which returns every variation which is 100 x 100 x 100 which is a million rows which isn’t very helpful. I know that someone working with the data was getting out-of-memory problems!
Would SAMPLE fix this? Yes.
Isn’t that masking an error? Well, yes, but I’m a Perl programmer at heart. It’s more important to have a system which is working than a system that you let keep breaking to spot issues. Build a better way to spot issues that doesn’t inconvenience the users!
If the query had been for (SAMPLE(?given) AS ?a_given ) (SAMPLE(?family) as ?a_family) {….} GROUP BY ?person then it would still have had one weird record, but the day-to-day operation wouldn’t have broken. It would have taken longer to fix the problem, but the system wouldn’t have been overloaded and breaky while we were unaware of the problem.
But bad data is, er, bad!
If the semantic web/linked data/open data concepts are to work, then you’ve got to deal with the fact that there’s going to be bad data. If it’s so fragile that the services fall over everytime someone gets URIs wrong then it ain’t going to work as people make mistakes all the time. Plan for it.
Anythng consuming open data should be considering how to deal with the all kinds of broken data;
- Typos in literals
- Factual errors (not the same as typos)
- Structual semantic errors, like many people incorrectly having the same URI. If I did it by accident, it’s likely to happen to other data sources now and then.
- “Impossible” semantics. It’s likely you’ll have some facts in a big data-set that the ontology says are mutually exclusive. Beware RDF versions of the Bible.
- Simple malicious data such as incorrect literals, or false predicates
- Malicious semantics, where someone creates innocuous seeming triples which do something unexpected when combined with certain other datasets.
Malicious Semantics
…or, at the very least, sneaky semantics.
Hugh Glaser pointed this out to me, I’ll see if I can explain it:
Imagine a Judge has declared it illegal to make a certain fact public, specifically that two people know each other.
Site-A defines:
person:678AF foaf:knows person:67D32
Then Site-B defines:
person:678AF foaf:name "Timogen Stohmas" .
Then Site C defines:
person:67D32 foaf:name "Byron Briggs" .
You can find out, from combining three sources, that Byron apparently knows Timogen, but who let that cat out of the bag? Site-A did, assuming that they meant the URIs to mean what B & C claim, but could you prove that?
See Hugh’s examples at http://who.isthat.org/
Support for SAMPLE in 4store
I’m told that 4store has implemented the SAMPLE feature from 1.1 along with the other aggregates, most of the Update operations, and most of the FILTER functions.
It didn’t work for me when I just tried it on our local copy, but that may be because we are on a stable rather than dev version, or possibly due to the PHP that sits between the public SPARQL endpoint and the 4store back-end-point.
UPDATE; it does work, I just havn’t had enough coffee.
Live Example
The following selects all buildings from our endpoint, and an example of something that they contain. Most buildings contain many things, but this only lists one thing per building.
SELECT ?building ?building_name (SAMPLE(?inside_thing) AS ?an_inside_thing) ?inside_thing_name
WHERE {
?building a <http://vocab.deri.ie/rooms#Building> ; rdfs:label ?building_name .
OPTIONAL { ?inside_thing spacerel:within ?building ; rdfs:label ?inside_thing_name }
}
GROUP BY ?building
