Skip to content


Filtering and Preprocessing 3rd Party Triples

The data in data.southampton.ac.uk comes from a number of places but here’s the gist;

  • Data maintained by me (some I hope to hand over to other people in time)
  • Tabular data maintained by a professional service (ie. a google spreadsheet maintained by catering)
  • Data dumped from a university database (phonebook, eprints, courses etc.)
  • Screen Scraped data (the bus data is scraped from council pages, with permission)
  • Photographs
  • Tabular data maintained by volunteers (eg. amenities data not yet ‘owned’ by one of the professional services)

While I trust the volunteers not to do anything too silly, they are also constrained by the tool I use to import their data. They can only make certain kinds of statement about certain kinds of thing. There’s still scope for malicious activity, but it’s constrained.

Raw RDF

However, the university Open Wireless club (SOWN) has asked me to import their own RDF feed. This is a difficult decision as it describes wifi nodes of much interest to our members but if I was to import their data as-is then a hypothetical malicious member, or 3rd party exploiting lax security on their website, could add arbitrary triples.

  1. One option is to refuse to accept the RDF data and ask that they provide tabular data (in many ways this is a good option, and may be the default)
  2. Just allow it, keep an eye on it and revoke the privilege.. (I’m not sure this approach would scale to 50 student clubs)
  3. Have a second SPARQL store to contain less trusted data, or clearly mark the graphs to indicate how much control we have over the content (a good idea, but the problem is that we want to display some of this data via our pages.
  4. Manually check the data on import (does not scale!)
  5. Automatically filter their data on import to only allow certain “shapes” of graph.

I really like (5) but (1) is the cheapest and most sustainable.

Filtering Triples

A quick discussion on how to filter a graph suggested that a SPARQL “CONSTRUCT” statement does exactly this. We should take their triples, run an agreed CONSTRUCT query on this data and import the resulting triples. This can filter URIs to be in certain namespaces, and only allow predicates we choose. It can also construct and alter triples if needed.

So what I need is a command-line tool which takes RDF data as an input, one or more SPARQL “CONSTRUCT” queries as a configuration and outputs RDF data.

Ideally a stand-alone command line script. Realistically I’m thinking it may need a spare triple store running.

Ideas?

Posted in Best Practice, Command Line, Data, RDF, Triplestore.


2 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Carsten says

    Sounds like a use case for tSPARQL – have you thought about using that?

    • Christopher Gutteridge says

      I’ve just played with the bin/sparql command on Jena which seems to do exactly what I want. (I’m not really very up on Jena yet).

      Actually modelling trust is rather beyond the scope of what I want to do — but would make sense as a drop-in replacement for pure SPARQL as these systems get more complex.



Some HTML is OK

or, reply to this post via trackback.