{"id":658,"date":"2011-03-02T17:12:20","date_gmt":"2011-03-02T17:12:20","guid":{"rendered":"http:\/\/blog.soton.ac.uk\/webteam\/?p=658"},"modified":"2011-03-02T17:13:11","modified_gmt":"2011-03-02T17:13:11","slug":"rdf-sparql-debugging","status":"publish","type":"post","link":"https:\/\/blog.soton.ac.uk\/webteam\/2011\/03\/02\/rdf-sparql-debugging\/","title":{"rendered":"RDF \/ SPARQL debugging"},"content":{"rendered":"<p>While working on <a href=\"http:\/\/data.southampton.ac.uk\/\">data.southampton.ac.uk<\/a>, we run into inevitable problems with data failing to load into our triplestore, or with SPARQL queries failing to return in a reasonable time (or at all&#8230;).<\/p>\n<p>I thought I&#8217;d share a few of the basic steps I take to find the root of these issues.<\/p>\n<h2>RDF Import Problems<\/h2>\n<p>The main tool I use for debugging raw RDF when it fails to import into a store is <a href=\"http:\/\/librdf.org\/raptor\/rapper.html\">rapper<\/a>.<\/p>\n<p>This is a small command line utility which uses the <a href=\"http:\/\/librdf.org\/raptor\/\">Raptor<\/a> library to parse RDF in a number of formats (currently: rdfxml, ntriples, turtle, trig, rss-tag-soup, grddl, rdfa).<\/p>\n<p>It&#8217;s extremely fast at parsing, so it a quick way of finding errors in a file.<\/p>\n<p>I&#8217;ll usually run something like:<br \/>\n<code>\/usr\/bin\/rapper -i rdfxml input.rdf &gt; \/dev\/null<\/code><\/p>\n<p>This attempts to read input.rdf as RDF\/XML, convert it to ntriples, then throw away the output.<\/p>\n<p>Any syntax errors it comes across are reported on the command line.  It usually takes seconds to run, even on large (well, ~4m RDF\/XML triples large) datasets, and seems much more accurate in spotting mistakes than web based validators I&#8217;ve used.<\/p>\n<h2>SPARQL Query Problems<\/h2>\n<p>Slow queries have accounted for the majority of the problems we&#8217;ve had so far.<\/p>\n<p>The thing that isn&#8217;t immediately obvious, is that a query which returns a few items can still be generating thousands of intermediate results under the hood.<\/p>\n<p>So far, there have been two underlying causes for this:<\/p>\n<ol>\n<li>Poorly optimised queries<\/li>\n<li>Incorrect source data imported<\/li>\n<\/ol>\n<p>Triplestore query optimisation isn&#8217;t perfect yet, so it&#8217;s always worth playing around with changing the order of parts of queries, rather than assuming an optimiser will do it.<\/p>\n<p>Incorrect (but valid) source data being imported into the store has caused us problems a few times.  The main culprit has been where large numbers of values have been mapped to a single predicate for a single object.<\/p>\n<p>One example was in our <a href=\"http:\/\/data.southampton.ac.uk\/dataset\/phonebook.html\">phonebook data<\/a>, where a single person URI had several hundred foaf:name, foaf:mbox and foaf:phone assigned to it.<\/p>\n<p>The number of results for queries involving this URI quickly exploded for each predicate added to a query.<\/p>\n<p>The easiest way to debug these problems has been to:<\/p>\n<ol>\n<li>Break each query down into atomic parts<\/li>\n<li>Run a SPARQL COUNT first part<\/li>\n<li>Add another part to the query<\/li>\n<li>Run a SPARQL COUNT on the combined parts<\/li>\n<li>Look for significant changes between 2. and 4.<\/li>\n<\/ol>\n<p>So in the example of our phonebook data error, the initial query was something like:<br \/>\n<code>SELECT * WHERE {<br \/>\n  ?person a foaf:Person ;<br \/>\n          foaf:familyName ?family_name ;<br \/>\n          foaf:givenName ?given_name ;<br \/>\n          foaf:name ?name ;<br \/>\n          foaf:mbox ?email ;<br \/>\n          foaf:phone ?phone .<br \/>\n}<\/code><\/p>\n<p>Which looked reasonable, until the query took up 10+ Gb of RAM&#8230;<\/p>\n<p>Debugging then started with:<br \/>\n<code>SELECT (COUNT(*) AS ?count) WHERE {<br \/>\n    ?person a foaf:Person .<br \/>\n}<\/code><\/p>\n<p>Which gave ~2000 results.<\/p>\n<p>Looking just at foaf:name:<br \/>\n<code>SELECT (COUNT(*) AS ?count) WHERE {<br \/>\n    ?person a foaf:Person .<br \/>\n    ?person foaf:name ?name .<br \/>\n}<\/code><\/p>\n<p>This gave ~70000 results rather than the expected ~2000.  A quick look through the source data then showed the issue.<\/p>\n<p>It&#8217;s pretty simple really, but took us a while to get a feel for how to debug query problems, and what to look out for.<\/p>\n<p>I&#8217;ve also started monitoring <a href=\"http:\/\/4store.org\/\">4store<\/a>&#8216;s query log to look for slow queries, as this has been a great indicator of poor data or query structure.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While working on data.southampton.ac.uk, we run into inevitable problems with data failing to load into our triplestore, or with SPARQL queries failing to return in a reasonable time (or at all&#8230;). I thought I&#8217;d share a few of the basic steps I take to find the root of these issues. RDF Import Problems The main [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[136,411,4227],"tags":[],"class_list":["post-658","post","type-post","status-publish","format-standard","hentry","category-rdf","category-sparql","category-triplestore"],"_links":{"self":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/658","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/comments?post=658"}],"version-history":[{"count":9,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/658\/revisions"}],"predecessor-version":[{"id":667,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/658\/revisions\/667"}],"wp:attachment":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/media?parent=658"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/categories?post=658"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/tags?post=658"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}