Skip to content

Dialects, Jargon and RDF

There’s a problem I encountered some time ago, and then more or less forgot about, but other people are having similar challenges so I thought I’d try to articulate it.

A bit of background about RDF literals

(if you know RDF well you can just skip this section)

The RDF way of structuring data allows you say several things about that string. The most simple version says nothing, it’s just a list of characters:


Then you can assign one of the common XML style datatypes:

"Hello"^^xsd:string .
"23"^^xsd:positiveInteger .
"1969-05-23"^^xsd:date .

The bit after the ^^ can actually be any URI, so you can have

    <> .

(nb. a lot of things which are identifiers get called a “number” which really are just a string of characters).

The final variation is a bit weird. You can indicate that a string is text in a given language. eg.

"Hello!"@en .
"Bonjour!"@fr .

And also specific variations of languages, such as

"Hi, parner!"@en-us .
"Wotchamate!"@en-gb .

You are not allowed to set both a language and datatype on a single literal so.

“XYZ”  or “XYZ”@en or “XYZ”^^<> are all legal but “XYZ”@en^^xsd:string is not.

I’ve never really understood why the designers didn’t use defined datatypes for languages, eg.

"Hello"^^<> .

I’m sure that internally most RDF systems probably optimise datatype & lang to be a single variable internally.

Other dialects

The problem with this very simple attitude to language is that it misses how subdivided dialects can become. For example

University X has a thing they do which we’ll describe as “a unit of education for which a student may enroll, for a fee, and may receive an award”. They call it a “course”.

University Y doesn’t have courses, it has “presentations”, however semantically it’s the same thing.

We can easily define a URI for this class, say <> but I want a way to describe the label appropriate for university X users and university Y users.

Option 0: Ignore the problem or enforce a national standard

Included but not really an option because THIS IS NOT THE WEBBY WAY! The web works because it can cope with the fact different systems don’t all work exactly the same way, but can still link up.

Option 1: Separate label datasets

I could provide each university with a local terms file to include, but that’s a bit of a disaster as they can’t safely merge their data.

eg. University Y gets a dataset with data like:

<> rdfs:label “Presentation”.

Option 2: Invent datatypes for these dialects

<> rdfs:label 

<> rdfs:label 

I guess this isn’t too bad, but it’s not very intuitive.

Option 3: Invent our own language codes

<> rdfs:label 
    "Course"@en-uni-x .

<> rdfs:label 
    "Presentation"@en-uni-y .

This is going to break things somewhere. I wouldn’t recommend it.

Option 4: Model it in RDF

We could actually assign a URI (or blank node) to the concept of the label and then use the RDF structure to explain the difference.

<> dialect:label 
    <> .
<> a 
    dialect:DialectSpecificText .

    dialect:text "Presentation"@en .
    dialect:inDialect <> .

This is sorta elegant until anybody tries to actually use your data.

Option 5: Use a predicate for each dialect

<> dialects:labelForX
   "Course" .
<> dialects:labelForY

This would certainly work, but it’s ugly and would make consuming the data fiddly.

Which option?

I have no clue. That’s why I’m writing this blog post. Labels (and descriptions) aimed at different audiences is not something I’ve yet seen done nicely in RDF.

This problem isn’t going to go away any time soon. At Southampton,w hat our students call “a degree” or “course” (eg. 3 Year BSc Computer Science”, the student admin are more likely to call a “programme theme”, and the underlying database is US-made so calls it “MAJOR”.

As a community we need to solve this at some point as there really is a good reason for audience specific labels and descriptions beyond simple national language variations.

Posted in RDF.

4 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Graham Klyne says

    timbl once wrote a designissues piece on ” interpretation properties”, which I think comes closest to your option 5. In hindsight, and per suggestion by Dan Connolly, I think that’s how RDF should have handled datatypes.

  2. Keith Alexander says

    (To add another option) you might see it as a provenance issue.

    X and Y talk about the same resource , but assign different labels to it.

    Sometimes you want to use the data from source X, sometimes from source Y.

    An application could choose which document/graph to fetch data from according to context.

  3. Wilbert Kraan says

    Why not model the entities from different sources as different subclasses of your own class? In my experience, if data comes from entirely different contexts, there will always be a myriad of little differences: labels are one, but I’ll bet that the uni where they use ‘presentations’ see those as different from a ‘course’, etc, and so on.
    Yes, you’ll have to have a wee indirection when querying, but I think that usually works out as less onerous than trying to cram entities from different data sets in the same class that almost, but not entirely fits.

Some HTML is OK

or, reply to this post via trackback.