Southampton Web and Data Innovation Team

Ideas and Tips from the Team

Categories:

Advertising
AI
Apache
Best Practice
Bitcoin
Command Line
Community
Conference Spam
Conference Website
Data
- Research Data
Database
dev8d
Doug Englebart
Drupal
Events
Gateway to Research
GDPR
Geo
HESA
HTTP
Internet Archive
Intranet
Javascript
Jisc
Management
- Recruitment
Minecraft
Open Data
Open Source
ORCID
OSX
Outreach
Perl
PHP
Programming
python
RDF
- 4store
- Graphite
- SPARQL
- Triplestore
Repositories
Sharepoint
SQL
Team
Templates
Terms and Conditions
testing
Tips
Training
Tutorial
twitter
Uncategorized
web management
Wordpress

Filtering and Preprocessing 3rd Party Triples

The data in data.southampton.ac.uk comes from a number of places but here’s the gist;

Data maintained by me (some I hope to hand over to other people in time)
Tabular data maintained by a professional service (ie. a google spreadsheet maintained by catering)
Data dumped from a university database (phonebook, eprints, courses etc.)
Screen Scraped data (the bus data is scraped from council pages, with permission)
Photographs
Tabular data maintained by volunteers (eg. amenities data not yet ‘owned’ by one of the professional services)

While I trust the volunteers not to do anything too silly, they are also constrained by the tool I use to import their data. They can only make certain kinds of statement about certain kinds of thing. There’s still scope for malicious activity, but it’s constrained.

Raw RDF

However, the university Open Wireless club (SOWN) has asked me to import their own RDF feed. This is a difficult decision as it describes wifi nodes of much interest to our members but if I was to import their data as-is then a hypothetical malicious member, or 3rd party exploiting lax security on their website, could add arbitrary triples.

One option is to refuse to accept the RDF data and ask that they provide tabular data (in many ways this is a good option, and may be the default)
Just allow it, keep an eye on it and revoke the privilege.. (I’m not sure this approach would scale to 50 student clubs)
Have a second SPARQL store to contain less trusted data, or clearly mark the graphs to indicate how much control we have over the content (a good idea, but the problem is that we want to display some of this data via our pages.
Manually check the data on import (does not scale!)
Automatically filter their data on import to only allow certain “shapes” of graph.

I really like (5) but (1) is the cheapest and most sustainable.

Filtering Triples

A quick discussion on how to filter a graph suggested that a SPARQL “CONSTRUCT” statement does exactly this. We should take their triples, run an agreed CONSTRUCT query on this data and import the resulting triples. This can filter URIs to be in certain namespaces, and only allow predicates we choose. It can also construct and alter triples if needed.

So what I need is a command-line tool which takes RDF data as an input, one or more SPARQL “CONSTRUCT” queries as a configuration and outputs RDF data.

Ideally a stand-alone command line script. Realistically I’m thinking it may need a spare triple store running.

Ideas?

Posted in Best Practice, Command Line, RDF, Triplestore.

Tagged with Data.

2 comments

By Christopher Gutteridge – March 29, 2011

2 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Carsten says

Sounds like a use case for tSPARQL – have you thought about using that?

March 29, 2011, 2:38 pm Reply
- Christopher Gutteridge says
  
  I’ve just played with the bin/sparql command on Jena which seems to do exactly what I want. (I’m not really very up on Jena yet).
  
  Actually modelling trust is rather beyond the scope of what I want to do — but would make sense as a drop-in replacement for pure SPARQL as these systems get more complex.
  
  March 29, 2011, 4:14 pm Reply

« Easy wins with Southampton building data A vim one-liner for expanding RDF namespace prefixes »

Proudly powered by WordPress and Carrington.

Carrington Theme by Crowd Favorite

Filtering and Preprocessing 3rd Party Triples

Raw RDF

Filtering Triples

2 Responses

Authors

Recent Posts

Meta

Blogroll

Tags

Filtering and Preprocessing 3rd Party Triples

Raw RDF

Filtering Triples

2 Responses

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags