Some academics have urged me to consider using an triple-store as a back-end for some of our websites, as oppose to our normal MySQL approach. I’m not convinced, but it’s an interesting challenge. I started by looking into what common “patterns” we use SQL which we would need to replicate in RDF. Or change how we approach problems in the first place.
Our usual MySQL “patterns”
- Create a record from <object>
We create record, which is in effect a serialisation of an object. Most often it represents a human, an account, an event, an organisation or an article (text + metadata). We use the database to generate a unique key for the item, in the current context. Generally an integer. In MySQL we use AUTO_INCREMENT for this, but every SQL DB has a varient. - Delete record with <ID>
- Update record with <ID> to match <object>
- Retrieve record with <ID>
- Find/retrieve records matching <criteria>
Update can reasonably be abstracted to “delete then create” so lets ignore it.
“Find” and “retrieve” require some new techniques, but are not a big concern.
My current understanding is that when adding a set of triples you can say that they are all part of a “graph” with URI <x>, and later you can remove or replace all URIs from graph <x>.
The one thing entirely missing is the ability to generate new integers in a sequence.
I’ve been given two suggested solutions by the experts…
UUID
Suggestion one, to use UUID (universal unique IDs) or hashes. But the problem is, I want to use these in URLs and URIs and I want to use http://webscience.org/person/6.html not http://webscience.org/person/e530d212-0ff1-11df-8660-003048bd79d6.html
Flock a file
A second suggestion, was to flock a local file containing the next value. (lock file, read file, update file, unlock file). This would work, but I want the current position in each sequence to be stored with my other data, and accessed/modified using the same rights as can read/write the triple store. That’s what I’m used to with MySQL.
My Idea 1: Sequence Service
My first idea, was to create a stand alone service which could run on the same server as the triple store, and you could query it, via HTTP or command line, for a new integer in an sequence. Sequences could be identified via a URI.
http://dbserver:12345/next_in_sequence?seq=http://webscience.org/people_sequence
Which would return “23” then “24” etc. The locking code could be handled in the sequence server, and the assumption would be that only trusted parties could connect (like SPARQL). This service could work by:
- Locking (all requests processed sequentially)
- Querying the triple store for <http://webscience.org/people_sequence> <currentValue> ?value
- Replacing the triple with ?value+1
- Unlocking
- Returning ?value+1
While this is a bit icky, it does mean that my data remains stored in one place, including the state of each sequence.
What this doesn’t do is provide one access point. All SQL implementations provide a solution for this, and I suspect that, long term, so will triple stores. But I can’t see the purists liking it going through the same access interface as it’s clearly a hack.
Non technical concerns with RDF back-ends
On a non-technical note, I’m also concerned that an RDF+PHP solution is not very maintainable. You can’t easily hire someone with these skills yet.
We’ve not had any problems with the maintainability or skills, for lots of reasons:
You work in a University, hire graduates! SPARQL is way easier than SQL, so I bet most graduates come out about as good in either, even though they do more SQL.
There’s a problem with people who know SPARQL being snapped up by companies, but I think the number of people who know their stuff is growing faster than the demand for those skills.
In my experience, people who’ve “done a bit of SQL” often haven’t a clue what their really doing anyway. SQL has too much rope.
Alternatively, hire smart people and train them. We don’t require SPARQL or RDF skills (just nice-to-have), and we use it all over the place – it’s really not that hard to learn.
People who describe SPARQL as easier than SQL generally are rather above average intelligence with a brain very suited to complex abstraction.
It is my (current) belief that the very limitations of SQL make it easier for people to start working on it. Sure they’ll make mistakes, but understanding triples, requires effectively working in N-dimensions of information. I don’t think as many people can cope with that as with the limitedness of SQL.
Either way, we need to establish standard ways of solving the most common tasks. You guys are, I assume, a bit ahead of most of us!
What limitations of SQL? It’s waaay more complex than SPARQL, it’s just that people tend to learn it by SQL osmosis, and have no idea what they’re doing, really. How many web monkeys could honestly say that they understand relational algebra? Or more concretely, how are n-tuples simpler than triples?
Don’t make the classic mistake of confusing familiarity with ease. I know people who learned SPARQL first, and they don’t think SQL is easier.
OTOH I don’t want to make out that ditching all your SQL stores and replacing them with SPARQL is the Right Thing.
For one thing, there are some types of data which are just more appropriate to represent in SQL, anything which is very “rectangular”, i.e. innately tabular data like call records:
RDF is not going to do a better job of storing that, it will be less efficient, and harder to use.
You only really get wins when the data is straggly, partial, or has lots of relations in it.
The other area SPARQL stores tend to suck is value optimisations. In the example above you might expect to be able to do
SELECT * from Calls ORDER BY start-time DESC LIMIT 1
And expect it to come back on O(1) time, and it will unless your indexes are stuffed, but not many SPARQL stores will answer
SELECT * WHERE {
?call :start-time ?start
} ORDER BY DESC(?start) LIMIT 1
in O(1) time. That also demonstrates the point about boring tabular data being harder to query in SPARQL.
You comment system is b0rken, it ate my example tuple when I used angle brackets, but it was:
(from, to, start-time, length)
The Talis solution…
http://n2.talis.com/wiki/Primary_Key_Pattern
And I thought the SQL solutions were scary.
It’s a hack like that or an implicit transaction, neither is reassuring if you know what’s really going on 🙂
Couple of things.
1) I think if you try and view SPARQL the way you programme SQL, you will always find it harder/strange. For SPARQL you just specify a pattern to match, with some logical variables; people who haven’t seen SQL find it really intuitive, I find, even (especially?) when they aren’t computer scientists.
2) Auto-increment? I am a bit surprised that anyone would rely on this for anything that gets exposed to the world. If I want identifiers for things I will decide on the policy and assign them (so your sequence service might do this). To rely on the order in which transactions arrive at the DB to generate public identifiers seems a bit fragile to me. It is part of the application layer to generate identifiers, whatever they may be.
In my world things are read-write. Lots of typos, corrections and updates. And short URLs are better than long ones. I have no worries at all about consuming the RDF I create, it’s creating it that I find the challenge. How to make a nice form for a user to edit to enter & update information. I can do it, but with the amount of stuff we build, we need ways to do it quickly and painlessly. I’m getting closer…
Until you come up with a more elegant solution, maybe Flickr’s Ticket server can be used:
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
for more discussion see:
http://www.mysqlperformanceblog.com/2010/10/04/sharing-an-auto_increment-value-across-multiple-mysql-tables/