Skip to content


Keeping our RDF updated and secure

A brief post on how we’re keeping our RDF accurate and up to date:

We’ve had our infrastructure data available as RDF for a while now (thanks to Nick Gibbins et al.), and have recently started pushing our datasets into a triplestore.

As time goes on, we’re consuming and re-using more and more of our RDF, with a plan to make our SPARQL endpoint public. This means that it’s increasingly important that the RDF we make available is up to date and an accurate representation of the data it’s generated from.

Keeping our RDF updated

This is the easy part. Our underlying data is mostly in MSSQL/MySQL databases. Each RDF document (e.g. http://rdf.ecs.soton.ac.uk/person/60) is built dynamically each time the URL is requested (with some caching).

Keeping our Triplestore updated

All triples from each RDF document are stored in a named graph. We’re using the URL of the RDF document as a graph name (so triples read from http://rdf.ecs.soton.ac.uk/person/60 would be stored in a the named graph http://rdf.ecs.soton.ac.uk/person/60).

Every time an RDF document is requested (and once per night), a hash of the data is calculated, and stored in a database, along with the time the hash was calculated.

A set of scripts running on the same host as the triplestore query the hash database periodically, and re-import triples from any RDF documents which have changed since it last checked.

We’re also slowly moving towards live updates. Many of our systems have a single point up change (e.g. a person’s profile pages are edited from a single form), so we’re planning to insert hooks into these that trigger a refresh of the data into the triplestore.

Keeping our Data Safe

An ongoing worry is that private information could get into our triplestore, or exposed on our RDF pages. Once incorrect/private data gets out, it’d be difficult to remove should the data ever get into external datasets.

As well as having plenty of checks throughout code, we’ve also added monitoring for this to our Nagios IT monitoring system.

At regular intervals, we build a list of anyone in the School who shouldn’t have information about themselves visible. This is then used build a number of SPARQL queries, which are execute to ensure the data hasn’t made it into our triplestore.

If something is found, we log the date this was first noticed. If this isn’t corrected within a reasonable timeframe (currently 24 hours or so), our Nagios system sends warning emails out. (The 24h delay is to allow for the delay between the data changing at source, and the triplestore picking up on that change)

Further Checks

As we deal with more (and larger) datasets, I can see a use for setting up additional checks like this – monitoring our data is just as important as monitoring our systems infrastructure in a lot of cases.

SPARQL (along with multiple datasets kept in a triplestore) provides a convenient (and consistent) way of effectively querying across multiple data sources.

A few ideas for future checks include:

  • Data integrity checks (e.g. check each module has 1+ teachers)
  • Spelling checks (e.g. check any literals for common misspellings)
  • Link checks (e.g. check any URL/URI used actually resolves)

Unit tests for data?

Posted in Uncategorized.


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.