I’ve been working on getting a SPARQL endpoint and triplestore up and running to support the University‘s forthcoming open data releases.
At the same time, I’ve been migrating our data at ECS from its current store (ARC2‘s native store, built on top of MySQL), due to several problems which cropped up.
Experiences with ARC2’s store
There were a few reasons I went with this store in the first place:
- Incredibly easy to set up (drop files in a directory on web server, point config file at MySQL)
- Support for aggregates (e.g. COUNT)
- Support for SPARQL UPDATE/DELETE
- Transactional updates (using InnoDB tables)
- Full-text indexing on literals (provided by MySQL)
For the number of triples we were working with (~a million), imports took an acceptable amount of time (~6 mins to import 1 million triples), and queries were more than fast enough (in the region <0.02s for the majority of queries). Longer term though, we ran into two problems with the store: 1) Each triple is assigned an auto-incremented ID in the corresponding MySQL table. As large numbers of triples are removed and reinserted (e.g. reimporting a dataset), the maximum value of this ID creeps ever upwards, until MySQL runs out of values for it (since it won't re-use keys by default). 2) SPARQL queries are converted to MySQL queries internally. As we started using more complex queries, we ran into sets of queries which couldn't be converted to SQL JOINs, and failed.
Moving to 4store
For both the University’s and ECS’s store, we chose to go with 4store. We plan to make our tools as store agnostic as possible though, and will carry on evaluating its suitability as we go along.
One key factor in this choice is that it’s a big dumb triple store (their words, not mine!). We just wanted a place to dump triples, and a way of getting them back again. It comes with its own HTTP server, and allows imports/deletes to be performed through that.
Quite a lot of the stores out there come as part of a larger RDF framework (many of which are Java based). We’re not really a Java based shop here, so having something standalone (e.g. as opposed to running using Tomcat) was a consideration.
Installation was pretty straightforward, and allowed us to run multiple stores on the same machine, with a SPARQL endpoint for each running on different ports. It’s also been used for various research projects within the University, so there’s some existing knowledge of it to call on for support.
Getting data into the store(s)
As the number of datasets around the University grows, we’ll need a way of managing updates into the store (or a variety of stores).
While we plan to maintain a central catalogue of datasets, these will all be updated at different frequencies. For some, I envisage that we’ll just perform a weekly/monthly/yearly import. For others, we’d want to update the store whenever they changed.
To achieve this, I’ve set up a simple web application that allows anything to ask the store to update a dataset: An application sends an HTTP post to a page, containing the name of the store to update, and a list of dataset URLs the store should update from.
This means that for frequently changing datasets (e.g. a news feed), a client can trigger an update whenever it’s needed, rather than waiting for the server to next schedule an update.
These requests are added to an ordered set (works as a FIFO queue, which discards duplicates), so that they can be imported in the order the requests come in.
The server hosting the store runs a small python script as a daemon which watches the queue, and pulls dataset URLs to import from it.
It then starts up a number of worker processes to deal with each dataset URL it receives. One set downloads data from each source, another set imports data into the store, another set deals with deletions from the store.
The multiprocess route works to avoid the bottlenecks in downloading large datasets (the largest dataset we import takes ~40 minutes to download), and allows smaller downloads and imports to take place while this is happening.
Abstracting away the download and import of data (as opposed to using cron for example) means we can also log frequency of updates and errors, track size of datasets over time, and retry failed downloads.
We’ll be having an invite only open data hackday here soon , which we hopefully serve as a good initial test of things.
One Response
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.
Continuing the Discussion