Southampton Web and Data Innovation Team

Ideas and Tips from the Team

Categories:

Advertising
AI
Apache
Best Practice
Bitcoin
Command Line
Community
Conference Spam
Conference Website
Data
- Research Data
Database
dev8d
Doug Englebart
Drupal
Events
Gateway to Research
GDPR
Geo
HESA
HTTP
Internet Archive
Intranet
Javascript
Jisc
Management
- Recruitment
Minecraft
Open Data
Open Source
ORCID
OSX
Outreach
Perl
PHP
Programming
python
RDF
- 4store
- Graphite
- SPARQL
- Triplestore
Repositories
Sharepoint
SQL
Team
Templates
Terms and Conditions
testing
Tips
Training
Tutorial
twitter
Uncategorized
web management
Wordpress

Replacing MySQL with a triple-store

Some academics have urged me to consider using an triple-store as a back-end for some of our websites, as oppose to our normal MySQL approach. I’m not convinced, but it’s an interesting challenge. I started by looking into what common “patterns” we use SQL which we would need to replicate in RDF. Or change how we approach problems in the first place.

Our usual MySQL “patterns”

Create a record from <object>
We create record, which is in effect a serialisation of an object. Most often it represents a human, an account, an event, an organisation or an article (text + metadata). We use the database to generate a unique key for the item, in the current context. Generally an integer. In MySQL we use AUTO_INCREMENT for this, but every SQL DB has a varient.
Delete record with <ID>
Update record with <ID> to match <object>
Retrieve record with <ID>
Find/retrieve records matching <criteria>

Update can reasonably be abstracted to “delete then create” so lets ignore it.

“Find” and “retrieve” require some new techniques, but are not a big concern.

My current understanding is that when adding a set of triples you can say that they are all part of a “graph” with URI <x>, and later you can remove or replace all URIs from graph <x>.

The one thing entirely missing is the ability to generate new integers in a sequence.

I’ve been given two suggested solutions by the experts…

UUID

Suggestion one, to use UUID (universal unique IDs) or hashes. But the problem is, I want to use these in URLs and URIs and I want to use http://webscience.org/person/6.html not http://webscience.org/person/e530d212-0ff1-11df-8660-003048bd79d6.html

Flock a file

A second suggestion, was to flock a local file containing the next value. (lock file, read file, update file, unlock file). This would work, but I want the current position in each sequence to be stored with my other data, and accessed/modified using the same rights as can read/write the triple store. That’s what I’m used to with MySQL.

My Idea 1: Sequence Service

My first idea, was to create a stand alone service which could run on the same server as the triple store, and you could query it, via HTTP or command line, for a new integer in an sequence. Sequences could be identified via a URI.

http://dbserver:12345/next_in_sequence?seq=http://webscience.org/people_sequence

Which would return “23” then “24” etc. The locking code could be handled in the sequence server, and the assumption would be that only trusted parties could connect (like SPARQL). This service could work by:

Locking (all requests processed sequentially)
Querying the triple store for <http://webscience.org/people_sequence> <currentValue> ?value
Replacing the triple with ?value+1
Unlocking
Returning ?value+1

While this is a bit icky, it does mean that my data remains stored in one place, including the state of each sequence.

What this doesn’t do is provide one access point. All SQL implementations provide a solution for this, and I suspect that, long term, so will triple stores. But I can’t see the purists liking it going through the same access interface as it’s clearly a hack.

Non technical concerns with RDF back-ends

On a non-technical note, I’m also concerned that an RDF+PHP solution is not very maintainable. You can’t easily hire someone with these skills yet.

Posted in Best Practice, Database, RDF, SQL.

rev="post-154" 9 comments

By Christopher Gutteridge – February 2, 2010

9 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Steve Harris says

We’ve not had any problems with the maintainability or skills, for lots of reasons:

You work in a University, hire graduates! SPARQL is way easier than SQL, so I bet most graduates come out about as good in either, even though they do more SQL.

There’s a problem with people who know SPARQL being snapped up by companies, but I think the number of people who know their stuff is growing faster than the demand for those skills.

In my experience, people who’ve “done a bit of SQL” often haven’t a clue what their really doing anyway. SQL has too much rope.

Alternatively, hire smart people and train them. We don’t require SPARQL or RDF skills (just nice-to-have), and we use it all over the place – it’s really not that hard to learn.

February 2, 2010, 6:30 pm Reply
- Christopher Gutteridge says
  
  People who describe SPARQL as easier than SQL generally are rather above average intelligence with a brain very suited to complex abstraction.
  
  It is my (current) belief that the very limitations of SQL make it easier for people to start working on it. Sure they’ll make mistakes, but understanding triples, requires effectively working in N-dimensions of information. I don’t think as many people can cope with that as with the limitedness of SQL.
  
  Either way, we need to establish standard ways of solving the most common tasks. You guys are, I assume, a bit ahead of most of us!
  
  February 3, 2010, 1:33 am Reply
Steve Harris says

What limitations of SQL? It’s waaay more complex than SPARQL, it’s just that people tend to learn it by SQL osmosis, and have no idea what they’re doing, really. How many web monkeys could honestly say that they understand relational algebra? Or more concretely, how are n-tuples simpler than triples?

Don’t make the classic mistake of confusing familiarity with ease. I know people who learned SPARQL first, and they don’t think SQL is easier.

OTOH I don’t want to make out that ditching all your SQL stores and replacing them with SPARQL is the Right Thing.

For one thing, there are some types of data which are just more appropriate to represent in SQL, anything which is very “rectangular”, i.e. innately tabular data like call records:

RDF is not going to do a better job of storing that, it will be less efficient, and harder to use.

You only really get wins when the data is straggly, partial, or has lots of relations in it.

The other area SPARQL stores tend to suck is value optimisations. In the example above you might expect to be able to do

SELECT * from Calls ORDER BY start-time DESC LIMIT 1

And expect it to come back on O(1) time, and it will unless your indexes are stuffed, but not many SPARQL stores will answer

SELECT * WHERE {
?call :start-time ?start
} ORDER BY DESC(?start) LIMIT 1

in O(1) time. That also demonstrates the point about boring tabular data being harder to query in SPARQL.

February 3, 2010, 10:08 am Reply
Steve Harris says

You comment system is b0rken, it ate my example tuple when I used angle brackets, but it was:

(from, to, start-time, length)

February 3, 2010, 10:11 am Reply
Christopher Gutteridge says

The Talis solution…

http://n2.talis.com/wiki/Primary_Key_Pattern

And I thought the SQL solutions were scary.

February 3, 2010, 2:06 pm Reply
Steve Harris says

It’s a hack like that or an implicit transaction, neither is reassuring if you know what’s really going on 🙂

February 3, 2010, 2:22 pm Reply
Hugh says

Couple of things.
1) I think if you try and view SPARQL the way you programme SQL, you will always find it harder/strange. For SPARQL you just specify a pattern to match, with some logical variables; people who haven’t seen SQL find it really intuitive, I find, even (especially?) when they aren’t computer scientists.
2) Auto-increment? I am a bit surprised that anyone would rely on this for anything that gets exposed to the world. If I want identifiers for things I will decide on the policy and assign them (so your sequence service might do this). To rely on the order in which transactions arrive at the DB to generate public identifiers seems a bit fragile to me. It is part of the application layer to generate identifiers, whatever they may be.

March 4, 2010, 11:51 am Reply
- Christopher Gutteridge says
  
  In my world things are read-write. Lots of typos, corrections and updates. And short URLs are better than long ones. I have no worries at all about consuming the RDF I create, it’s creating it that I find the challenge. How to make a nice form for a user to edit to enter & update information. I can do it, but with the amount of stuff we build, we need ways to do it quickly and painlessly. I’m getting closer…
  
  March 4, 2010, 12:32 pm Reply
Joe Devon says

Until you come up with a more elegant solution, maybe Flickr’s Ticket server can be used:

http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

for more discussion see:
http://www.mysqlperformanceblog.com/2010/10/04/sharing-an-auto_increment-value-across-multiple-mysql-tables/

October 8, 2010, 6:40 am Reply

« iPad Too much mail! »

Proudly powered by WordPress and Carrington.

Carrington Theme by Crowd Favorite

Replacing MySQL with a triple-store

Our usual MySQL “patterns”

UUID

Flock a file

My Idea 1: Sequence Service

Non technical concerns with RDF back-ends

9 Responses

Authors

Recent Posts

Meta

Blogroll

Tags

Replacing MySQL with a triple-store

Our usual MySQL “patterns”

UUID

Flock a file

My Idea 1: Sequence Service

Non technical concerns with RDF back-ends

9 Responses

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags