Southampton Web and Data Innovation Team

Ideas and Tips from the Team

Categories:

Advertising
AI
Apache
Best Practice
Bitcoin
Command Line
Community
Conference Spam
Conference Website
Data
- Research Data
Database
dev8d
Doug Englebart
Drupal
Events
Gateway to Research
GDPR
Geo
HESA
HTTP
Internet Archive
Intranet
Javascript
Jisc
Management
- Recruitment
Minecraft
Open Data
Open Source
ORCID
OSX
Outreach
Perl
PHP
Programming
python
RDF
- 4store
- Graphite
- SPARQL
- Triplestore
Repositories
Sharepoint
SQL
Team
Templates
Terms and Conditions
testing
Tips
Training
Tutorial
twitter
Uncategorized
web management
Wordpress

Dialects, Jargon and RDF

There’s a problem I encountered some time ago, and then more or less forgot about, but other people are having similar challenges so I thought I’d try to articulate it.

A bit of background about RDF literals

(if you know RDF well you can just skip this section)

The RDF way of structuring data allows you say several things about that string. The most simple version says nothing, it’s just a list of characters:

"Hello"

Then you can assign one of the common XML style datatypes:

"Hello"^^xsd:string .

"23"^^xsd:positiveInteger .

"1969-05-23"^^xsd:date .

The bit after the ^^ can actually be any URI, so you can have

"2342A-1.3"^^
    <http://example.org/vocab/vendtechtron-product-serial-number> .

(nb. a lot of things which are identifiers get called a “number” which really are just a string of characters).

The final variation is a bit weird. You can indicate that a string is text in a given language. eg.

"Hello!"@en .

"Bonjour!"@fr .

And also specific variations of languages, such as

"Hi, parner!"@en-us .

"Wotchamate!"@en-gb .

You are not allowed to set both a language and datatype on a single literal so.

“XYZ” or “XYZ”@en or “XYZ”^^<http://foo.com/bar> are all legal but “XYZ”@en^^xsd:string is not.

I’ve never really understood why the designers didn’t use defined datatypes for languages, eg.

"Hello"^^<http://w3c.org/ns/lang/en> .

I’m sure that internally most RDF systems probably optimise datatype & lang to be a single variable internally.

Other dialects

The problem with this very simple attitude to language is that it misses how subdivided dialects can become. For example

University X has a thing they do which we’ll describe as “a unit of education for which a student may enroll, for a fee, and may receive an award”. They call it a “course”.

University Y doesn’t have courses, it has “presentations”, however semantically it’s the same thing.

We can easily define a URI for this class, say <http://example.com/vocab/EnrollableLearningUnit> but I want a way to describe the label appropriate for university X users and university Y users.

Option 0: Ignore the problem or enforce a national standard

Included but not really an option because THIS IS NOT THE WEBBY WAY! The web works because it can cope with the fact different systems don’t all work exactly the same way, but can still link up.

Option 1: Separate label datasets

I could provide each university with a local terms file to include, but that’s a bit of a disaster as they can’t safely merge their data.

eg. University Y gets a dataset with data like:

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label “Presentation”.

Option 2: Invent datatypes for these dialects

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label 
   "Course"^^<http://id.uni-x.ac.uk/vocab/our-way-of-describing-stuff>.

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label 
 "Presentation"^^<http://id.y.ac.uk/vocab/term-in-our-dialect>.

I guess this isn’t too bad, but it’s not very intuitive.

Option 3: Invent our own language codes

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label 
    "Course"@en-uni-x .

<http://example.com/vocab/EnrollableLearningUnit> rdfs:label 
    "Presentation"@en-uni-y .

This is going to break things somewhere. I wouldn’t recommend it.

Option 4: Model it in RDF

We could actually assign a URI (or blank node) to the concept of the label and then use the RDF structure to explain the difference.

<http://example.com/vocab/EnrollableLearningUnit> dialect:label 
    <http://example.com/vocab/EnrollableLearningUnit#label> .

<http://example.com/vocab/EnrollableLearningUnit#label> a 
    dialect:DialectSpecificText .

<http://example.com/vocab/EnrollableLearningUnit#label> 
    dialect:text "Presentation"@en .

<http://example.com/vocab/EnrollableLearningUnit#label> 
    dialect:inDialect <http://id.y.ac.uk/vocab/our-dialect> .

This is sorta elegant until anybody tries to actually use your data.

Option 5: Use a predicate for each dialect

<http://example.com/vocab/EnrollableLearningUnit> dialects:labelForX
   "Course" .
<http://example.com/vocab/EnrollableLearningUnit> dialects:labelForY
 "Presentation"

This would certainly work, but it’s ugly and would make consuming the data fiddly.

Which option?

I have no clue. That’s why I’m writing this blog post. Labels (and descriptions) aimed at different audiences is not something I’ve yet seen done nicely in RDF.

This problem isn’t going to go away any time soon. At Southampton,w hat our students call “a degree” or “course” (eg. 3 Year BSc Computer Science”, the student admin are more likely to call a “programme theme”, and the underlying database is US-made so calls it “MAJOR”.

As a community we need to solve this at some point as there really is a good reason for audience specific labels and descriptions beyond simple national language variations.

Posted in RDF.

rev="post-1049" 4 comments

By Christopher Gutteridge – February 14, 2014

4 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Graham Klyne says

timbl once wrote a designissues piece on ” interpretation properties”, which I think comes closest to your option 5. In hindsight, and per suggestion by Dan Connolly, I think that’s how RDF should have handled datatypes.

February 17, 2014, 1:41 pm Reply
- Christopher Gutteridge says
  
  I wasn’t involved in early RDF, but lang & datatype seem to be a holdover from the XML schema world.
  
  February 17, 2014, 2:48 pm Reply
Keith Alexander says

(To add another option) you might see it as a provenance issue.

X and Y talk about the same resource , but assign different labels to it.

Sometimes you want to use the data from source X, sometimes from source Y.

An application could choose which document/graph to fetch data from according to context.

February 17, 2014, 3:05 pm Reply
Wilbert Kraan says

Why not model the entities from different sources as different subclasses of your own class? In my experience, if data comes from entirely different contexts, there will always be a myriad of little differences: labels are one, but I’ll bet that the uni where they use ‘presentations’ see those as different from a ‘course’, etc, and so on.
Yes, you’ll have to have a wee indirection when querying, but I think that usually works out as less onerous than trying to cram entities from different data sets in the same class that almost, but not entirely fits.

February 17, 2014, 11:42 pm Reply

« Location Independent Software Jisc: Regeneration or Rebranding? »

Proudly powered by WordPress and Carrington.

Carrington Theme by Crowd Favorite

Dialects, Jargon and RDF

A bit of background about RDF literals

Other dialects

Option 0: Ignore the problem or enforce a national standard

Option 1: Separate label datasets

Option 2: Invent datatypes for these dialects

Option 3: Invent our own language codes

Option 4: Model it in RDF

Option 5: Use a predicate for each dialect

Which option?

4 Responses

Authors

Recent Posts

Meta

Blogroll

Tags

Dialects, Jargon and RDF

A bit of background about RDF literals

Other dialects

Option 0: Ignore the problem or enforce a national standard

Option 1: Separate label datasets

Option 2: Invent datatypes for these dialects

Option 3: Invent our own language codes

Option 4: Model it in RDF

Option 5: Use a predicate for each dialect

Which option?

4 Responses

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags