Southampton Web and Data Innovation Team

Ideas and Tips from the Team

Categories:

Advertising
AI
Apache
Best Practice
Bitcoin
Command Line
Community
Conference Spam
Conference Website
Data
- Research Data
Database
dev8d
Doug Englebart
Drupal
Events
Gateway to Research
GDPR
Geo
HESA
HTTP
Internet Archive
Intranet
Javascript
Jisc
Management
- Recruitment
Minecraft
Open Data
Open Source
ORCID
OSX
Outreach
Perl
PHP
Programming
python
RDF
- 4store
- Graphite
- SPARQL
- Triplestore
Repositories
Sharepoint
SQL
Team
Templates
Terms and Conditions
testing
Tips
Training
Tutorial
twitter
Uncategorized
web management
Wordpress

RDF / SPARQL debugging

While working on data.southampton.ac.uk, we run into inevitable problems with data failing to load into our triplestore, or with SPARQL queries failing to return in a reasonable time (or at all…).

I thought I’d share a few of the basic steps I take to find the root of these issues.

RDF Import Problems

The main tool I use for debugging raw RDF when it fails to import into a store is rapper.

This is a small command line utility which uses the Raptor library to parse RDF in a number of formats (currently: rdfxml, ntriples, turtle, trig, rss-tag-soup, grddl, rdfa).

It’s extremely fast at parsing, so it a quick way of finding errors in a file.

I’ll usually run something like:
/usr/bin/rapper -i rdfxml input.rdf > /dev/null

This attempts to read input.rdf as RDF/XML, convert it to ntriples, then throw away the output.

Any syntax errors it comes across are reported on the command line. It usually takes seconds to run, even on large (well, ~4m RDF/XML triples large) datasets, and seems much more accurate in spotting mistakes than web based validators I’ve used.

SPARQL Query Problems

Slow queries have accounted for the majority of the problems we’ve had so far.

The thing that isn’t immediately obvious, is that a query which returns a few items can still be generating thousands of intermediate results under the hood.

So far, there have been two underlying causes for this:

Poorly optimised queries
Incorrect source data imported

Triplestore query optimisation isn’t perfect yet, so it’s always worth playing around with changing the order of parts of queries, rather than assuming an optimiser will do it.

Incorrect (but valid) source data being imported into the store has caused us problems a few times. The main culprit has been where large numbers of values have been mapped to a single predicate for a single object.

One example was in our phonebook data, where a single person URI had several hundred foaf:name, foaf:mbox and foaf:phone assigned to it.

The number of results for queries involving this URI quickly exploded for each predicate added to a query.

The easiest way to debug these problems has been to:

Break each query down into atomic parts
Run a SPARQL COUNT first part
Add another part to the query
Run a SPARQL COUNT on the combined parts
Look for significant changes between 2. and 4.

So in the example of our phonebook data error, the initial query was something like:
SELECT * WHERE { ?person a foaf:Person ; foaf:familyName ?family_name ; foaf:givenName ?given_name ; foaf:name ?name ; foaf:mbox ?email ; foaf:phone ?phone . }

Which looked reasonable, until the query took up 10+ Gb of RAM…

Debugging then started with:
SELECT (COUNT(*) AS ?count) WHERE { ?person a foaf:Person . }

Which gave ~2000 results.

Looking just at foaf:name:
SELECT (COUNT(*) AS ?count) WHERE { ?person a foaf:Person . ?person foaf:name ?name . }

This gave ~70000 results rather than the expected ~2000. A quick look through the source data then showed the issue.

It’s pretty simple really, but took us a while to get a feel for how to debug query problems, and what to look out for.

I’ve also started monitoring 4store‘s query log to look for slow queries, as this has been a great indicator of poor data or query structure.

Posted in RDF, SPARQL, Triplestore.

rev="post-658" 1 comment

By Dave Challis – March 2, 2011

One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

James says

Thanks,, I enjoyed the article. I’ve been learning SPARQL for a while now…This was helpful
Regards
James
http://www.tilogeo.com

February 15, 2016, 10:48 am Reply

« Open in Browser SPARQL Query Caching with Nginx »

Proudly powered by WordPress and Carrington.

Carrington Theme by Crowd Favorite

RDF / SPARQL debugging

RDF Import Problems

SPARQL Query Problems

One Response

Authors

Recent Posts

Meta

Blogroll

Tags

RDF / SPARQL debugging

RDF Import Problems

SPARQL Query Problems

One Response

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags