Categories:

# RDF / SPARQL debugging

While working on data.southampton.ac.uk, we run into inevitable problems with data failing to load into our triplestore, or with SPARQL queries failing to return in a reasonable time (or at all…).

I thought I’d share a few of the basic steps I take to find the root of these issues.

## RDF Import Problems

The main tool I use for debugging raw RDF when it fails to import into a store is rapper.

This is a small command line utility which uses the Raptor library to parse RDF in a number of formats (currently: rdfxml, ntriples, turtle, trig, rss-tag-soup, grddl, rdfa).

It’s extremely fast at parsing, so it a quick way of finding errors in a file.

I’ll usually run something like:
/usr/bin/rapper -i rdfxml input.rdf > /dev/null

This attempts to read input.rdf as RDF/XML, convert it to ntriples, then throw away the output.

Any syntax errors it comes across are reported on the command line. It usually takes seconds to run, even on large (well, ~4m RDF/XML triples large) datasets, and seems much more accurate in spotting mistakes than web based validators I’ve used.

## SPARQL Query Problems

Slow queries have accounted for the majority of the problems we’ve had so far.

The thing that isn’t immediately obvious, is that a query which returns a few items can still be generating thousands of intermediate results under the hood.

So far, there have been two underlying causes for this:

1. Poorly optimised queries
2. Incorrect source data imported

Triplestore query optimisation isn’t perfect yet, so it’s always worth playing around with changing the order of parts of queries, rather than assuming an optimiser will do it.

Incorrect (but valid) source data being imported into the store has caused us problems a few times. The main culprit has been where large numbers of values have been mapped to a single predicate for a single object.

One example was in our phonebook data, where a single person URI had several hundred foaf:name, foaf:mbox and foaf:phone assigned to it.

The number of results for queries involving this URI quickly exploded for each predicate added to a query.

The easiest way to debug these problems has been to:

1. Break each query down into atomic parts
2. Run a SPARQL COUNT first part
3. Add another part to the query
4. Run a SPARQL COUNT on the combined parts
5. Look for significant changes between 2. and 4.

So in the example of our phonebook data error, the initial query was something like:
SELECT * WHERE { ?person a foaf:Person ; foaf:familyName ?family_name ; foaf:givenName ?given_name ; foaf:name ?name ; foaf:mbox ?email ; foaf:phone ?phone . }

Which looked reasonable, until the query took up 10+ Gb of RAM…

Debugging then started with:
SELECT (COUNT(*) AS ?count) WHERE { ?person a foaf:Person . }

Which gave ~2000 results.

Looking just at foaf:name:
SELECT (COUNT(*) AS ?count) WHERE { ?person a foaf:Person . ?person foaf:name ?name . }

This gave ~70000 results rather than the expected ~2000. A quick look through the source data then showed the issue.

It’s pretty simple really, but took us a while to get a feel for how to debug query problems, and what to look out for.

I’ve also started monitoring 4store‘s query log to look for slow queries, as this has been a great indicator of poor data or query structure.

Posted in RDF, SPARQL, Triplestore.

## One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

1. Thanks,, I enjoyed the article. I’ve been learning SPARQL for a while now…This was helpful
Regards
James
http://www.tilogeo.com

Some HTML is OK