UNIT4 and Linked Data

I was recently in London for a meeting with some folks from UNIT4 about their recent forays into linked data with their Agresso Business World software.

They have the advantage of having a large installed base (>90 local councils, and >250 Futher and Higher Education institutions in the UK), so can hopefully provide a mass of data without customers having to set up or install additional systems/infrastructure.

They’ve initially been looking at local council data, with a view to widening this later (Universities are an obvious choice, especially with the growing interest and deployment of institutional open data).

Local Councils

Local councils will have to comply with the Prime Minister’s call to publish financial transactions over £500 from January 2011. Being able to do this simply, with an existing system (which already holds their financial information), makes a lot of sense to the council, while providing the data to the community in an open way.

The Guardian‘s Data Blog has a great summary of how this has been done to date: Local council spending over £500: full list of who has published what so far.

The whole list ranges from one to three stars of the Linked Open Data star scheme. Being one of the first councils to move up to four or five stars certainly couldn’t hurt…

UNIT4 are currently running a pilot with the borough of Windsor and Maidenhead, who already make a lot of their data open (1-3* Excel/CSV/PDF mostly). UNIT4’s plan is to take them up to 5* data, with a view to using the same techniques, software and lessons learned to do the same for other council.

From 3* to 5* Data

They’ve been looking at workflows for converting from existing financial data to RDF using the Payments Ontology, aiming to generalise to the process so that the same software and techniques can be applied to non-financial data an organisation might have.

Other ontologies used include VoiD and RDF Data Cube.

Redaction is obviously an important feature here, which it seems Agresso supports natively. The Payments Ontology also supports redaction (and I think there’s also an extension to it which supports redaction in a more fully featured way). This is something which can’t easily be automated though, and will still require human effort to clean up data before it gets opened.

This a great way to get a foot in the door – having one successful workflow from CSV/XLS to RDF means that an organisation can easily apply it to others, with the same software and input formats. Though this is an area that I’m guessing a lot of software providers will want be the center of…

Work done so far for Windsor and Maidenhead can be seen here: Local Government Spend Explorer

The hard parts…

The meeting also raised some familiar concerns/questions about the publishing and maintenance of open linked data:

How do organisations agree on identifiers to use for suppliers?

This is pretty hard without a central registry or lookup service. Companies House data would be a great starting point, but is not open or free.

UNIT4 are going down the route suggested by Tim BL – minting their own URIs for things, then using the owl:sameAs predicate to link them to definitive versions later.

How should an individual entering data find out an supplier’s URI given its name?

Auto-completion? Drop down lists? Even though this is more of a user interface issue, it raises the important point of getting people who don’t cared about linked data to be accurate about data they’re entering.

Which URIs should we use to describe currency?

While ISO maintains a list of currency codes, they’re not available in an open form, and the data set isn’t available without paying.

How should data from separate councils be aggregated?

There are hundreds of local councils in the UK, and collecting data from all of them, or querying 100+ triplestores to get at data for comparison just isn’t feasible.

I’m guessing this is something we’ll eventually have to face in the Higher Education linked data world (e.g. someone wanting to query Universities for course data won’t want to have to connect to download data from dozens of institutions).

Should there be a central registry? Should data.gov.uk pull local council data into a central triplestore? Should UNIT4 be pulling in the data as a service to their customers? Are any/all of these methods sustainable?

What next?

I’m sure we’ll be hearing more about UNIT4 and linked data in the near future (assuming the Windsor and Maidenhead pilot goes well!). If the strategy and data produced is successful, we may well see a number of councils adopt it.

If this happens, this would be a great starting point for producing institutional open financial data – choices of identifiers and ontologies to be use will be much clearer if there’s a large body of homogeneous data out there.

Posted in Uncategorized.

1 comment

By Dave Challis – December 3, 2010

Barcamp Southampton

Last weekend I helped run the first ever Barcamp Southampton.

As this isn’t actually part of my day job, the write-up is on the SoTech blog, instead.

Posted in Events.

No comments

By Christopher Gutteridge – December 3, 2010

Notes on SITS – the Scholarly Infrastructure Technical Summit

I was recently sent to attend the Scholarly Infrastructure Technical Summit (SITS) by JISC, along with Ian Mulvany from Mendeley.

The goal of each SITS meeting (as I see it) is to get a technical experts (a mix of developers, and project managers who understand tech) together to talk about their experiences/problems/successes with various scholarly infrastructure tools or components.

Something which worked extremely well (which was new to me) was having the meeting run according to an Open Agenda. Topics of interest were brought up by participants, and then voted upon to ensure that there was sufficient interest in a specific subject area.

The meeting started with a run through of topics that were raised at the previous SITS. These included:
SWORD, reverse SWORD, common tooling for workflows, storage abstraction layers for repositories and author identification, as well as discussions around appropriate citations for digital objects.

The topics which were raised and eventually discussed this time around were:

Authentication/Authorisation
RDF/Linked Data
Web Archiving
People/Author Identifiers
Microservices
Curriculum/Training Development
Search Engine Optimisation (SEO)
Lightweight Languages

Authentication/Authorisation

Discussions here were focused around authorisation and authentication, both across services within an institution, and between institutions.

Shibboleth was (as expected) the primary technology talked about, followed by OAuth, and then OpenID.

I was hoping to hear of some success stories here, but mostly it was the problems and questions about these systems which came through:

How can we combine Shib with IP or key based auth?
How can we provision temporary/guest accounts in such a system?
How can we trust remote credentials?
How can services authenticate between each other?

The issues with service to service auth were mostly based around the fact they’d require extra client libraries to be used (especially for web services, where basic access authentication remains the easiest method to use).

Another major issue was that of management of access to resources. If centrally managed, how can a data/service provider be confident that their institution is keeping their groups/users/access levels up to date and correct? And more importantly, how can we be confident that an external organisation who has access to our systems is doing the same?

Related to this issue is that of spreading access control of data around (Chris Gutteridge also has a blog post related to this in linked data). Most major auth systems seem to centralise control – but what happens when an individual wishes to share some of their personal data with another person/group/service? What happens if a department wishes to have their own policy controls? What models are there for delegating control, whilst ensuring overall stability and security of a system as a whole?

I think the most interesting outcome from this topic was discussion about a shim or meta-auth layer that could sit behind several auth systems. There certainly seemed to a be a lot of interest in something which could authenticate against Shib/OAuth/OpenID/anything else, and then provide a single set of auth details to an institutional system.

It would mean that institutional software could interface with this one layer, and have additional auth mechanisms added to it through extensions/plugins, rather than having to plug multiple auth systems directly in, and have to update code each time a new auth system comes along.

Mendeley do this internally, but if there’s an open source solution for this out there we don’t know about, I’m sure it would be very popular…

RDF/Linked Data

The discussions around linked data were very similar to several I’d had with people over here in the UK – the barriers to adoption were the same at least:

Which vocabularies should we be using?
Should we be creating our own?
Where can we see examples of best practice?
Whose identifiers should we use?
What content should we be making available first?

Seems it would be good to include some US institutions in the discussions that are happening around the UK academic community at the moment, at the very least to prevent us from going off in entirely different directions!

Lack of tooling was also perceived as a significant barrier to entry. There was a lot of interest in access to linked data using RESTful APIs (e.g. the linked-data-api), and using javascript and JSON to consume the data. These allow experienced developers to consume RDF using methods/technologies they’re already familiar with.

During this discussion (and during a couple of others), quite a few people expressed dissatisfaction with Dublin Core as a means of describing repository data. It seems that some (including people at Google) were interested in looking at HighWire as an alternative. I know next to nothing about this though (a search doesn’t reveal much either…), but will update should I find out more.

Web Archiving

Next up was the topic of archiving web resources, which followed on nicely from a presentation on recent developments in Memento at the DLF Fall Forum.

A key point here was the issue of when/what to archive. Some web resources (a paper in a repository for example) have fairly well defined versions – but do we want to archive every single one?

Other resources (a page aggregating 3rd party content for example) won’t have such well defined versions, and will need to either be archived periodically, or by constantly watching them for changes.

Assuming we’ve taken care of the above, the next point of discussion was about searching. What sort of interface will be needed to search historical resources? Obviously a user won’t want to be presented with a dozen search results containing almost identical content. They’re also likely to want to browse back and forth through important versions of an item onc they’re found a document they’re after.

Someone also raised the point of the difference between browsing/searching historic documents vs. browsing/searching as if you were on a historical system. Some users will want to search using historical indexes, others will just want historical results.

There was certainly interest in Memento as an easy to implement strategy for archiving some web content now (adding it to repositories would be an easy win in this regard), and then worrying about what to do with the data later. It also seems to have the advantage of working below the normal web application level, meaning that the same technology can be used for archiving video/images/RDF/html, without requiring an application specific setup each time.

It was also mentioned that JISC would be commissioning some large scale work on the preservation of fast moving resources.

People/Author Identifiers

The focus of this topic touched on a lot of things, but mostly revolved around ORCID (Open Researcher & Contributor ID).

The ORCID initiative aims to provide a registry of authors/contributors (to aid in communication, author disambiguation), which can then be linked to other ID schemes, to publications, or to each other.

ORCID is (I believe) a follow-on from Thompson Reuters’ ResearcherID. ResearcherID required self registration though, which is where it’s believed to have failed (they had <20000 individuals register). ORCID’s aim is to get author information form institutions, rather than individuals.

I was surprised to see there had been so much interest in this already, 300-400 institutions have already registered their interest in it. It seems that some journals may start requiring ORCID IDs before publishing, which could well be a driver in this.

This, along with the fact that it should work nicely with other ID systems makes it look like something worth keeping an eye on.

It’s not without its issues and potential problems though.

The first of these is that the information kept by ORCID hasn’t been finalised yet. What should they store along with an author’s ID and name(s)? Publication list? Grants? If so, who’s going be responsible for maintaining it?

This also raises the issue of control of personal data. If an institution makes a statement about you in ORCID, do you have the right to retract it? What about if an individual starts making statements an institutions knows are untrue?

Storing the provenance of each fact about an individual in ORCHID seemed to be the accepted solution for this – it would leave it up the data consumer to trust individually/institutionally submitted facts about a person.

By far the biggest obstacle seems to be the lack of an ongoing business model for ORCID though. Once it’s up and running, and as it’s a centralised service, how should it be funded? The identifiers it mints will need to be permanently resolvable in order for anyone to trust it and use it as a service, so how can the community guarantee this?

The project is still very much looking for contributors to develop things from an institutional side though, looks like there may be several potential projects here.

Microservices

This is a term I hadn’t heard before this month, but came up a lot at the DLF Forum, as well as being a topic with significant interest at the SITS meeting.

Luckily it wasn’t just me who was unsure of exactly what it meant – it seems to be a bit of a buzzword, taken to mean different things by different people. A rough summary can be found on the iRODS micro-services page. The CDL (California Digital Library) also have their own take on microservices on their Curation Micro-Services pages.

The basic gist of microservices seems to be this: Repository software is made up of collections of services. So why not separate them out, and make each one available for reuse, either as a web service (SOAP/REST), or on the command line?

This mirrors the Unix philosophy of making programs do one thing well, and making others by combining several of these.

They allow for easily interchangeable components with a narrow focus, allowing for complex services to be built up without reinventing the wheel each time (especially when moving between languages or platforms).

Some examples of microservices might include:

image resizing
file checksum calculation
file hashing service
send email
object storage

I’m not convinced the specs for these are well defined enough for general purpose use yet, but I can see the technique in general being very useful. If the same microservices can be called through multiple interfaces (command line, REST, etc.), then it should in theory make them language agnostic.

Curriculum/Training Development

Things moved to a slightly less technical theme here, the focus being on how to get new staff/project members quickly up to speed in a development or project environment.

The key point was to work out how best to ensure that new team/project members gain the skills they need to get their work done.

This includes a mix of technical and non-technical skills, the exact nature of which will vary depending on the project:

Source code management (git/svn) and committing guidelines
Documentation guides (how to document code/software)
Code style guides (what should my functions be named? should I indent with spaces or tabs?)
Unit testing (which framework should I use? when should I write them?)

There are also of project/platform specific things an individual will need to know:

How do I add plugins to this repository?
How do I code X in language Y?

A key point raised here was in the difference between people who were primarily librarians and those who were primarily developers. How do we get each up to speed in the areas they lack? A single curriculum for everyone probably wouldn’t suffice.

Additionally, how should this be taught, and who by? Online notes? Or taught as part of a library/computer science course?

Making sure that the right people attend the right training courses/workshops was also a key point mentioned. Ensuring that an individual has the prerequisites necessary to participate is essential in order not to waste time and money.

Some suggestions about starting points for developers/managers looking into this included the following:

code4lib wiki – a guide for the perplexed
Producing Open Source Software – a free (e)book
Making Things Happen – an O’Reilly book on project management
Joel on Software – a blog on software development

Search Engine Optimisation

This topic should really have been called “Google Scholar Optimisation”, such is the perceived importance of Scholar in the library/repository world (Scholar is apparently way ahead of Web of Science as a student portal to research for example).

There was a great deal of dissatisfaction expressed with way Scholar works, primarily concerning the fact that it’s the Scholar team who are dictating the metadata that institutions produce in order to be included in their index (more info in Google Scholar’s Inclusion Guidelines).

It was largely felt throughout the room that it should be the academic community who is responsible for agreeing upon a standard for exposing metadata (RDFa was mentioned here), rather than being forced to adopt 3rd party’s schema which doesn’t fit their data very well.

One thing I learnt from this was that the Scholar team is separate from the regular Google search team, and the technologies they use to index/harvest differ. This means that institutions have to produce metadata multiple times: for Scholar, for regular google search (and probably more for additional harvesters).

So what are the solutions/workarounds? Some ideas raised were to:

Agree on RDF(a) standards for presenting metadata
Contact NISO about developing a standard
Approach a rival to Scholar (e.g. Microsoft Academic Search)
Use the collective bargaining power of institutions to effect a change at Scholar

I’ll finish by saying that quite a lot of people in the room felt very strongly about this, it didn’t sound like Scholar was making anyone very happy!

Lightweight Languages

The main gist of this topic was discussion about barriers to introducing a new language into a project or environment.

Ruby was the main language being discussed, but the discussion could just as easily apply to any language/technology being considered for use.

The key word here was “misconceptions”, seemingly from all sides where introduction of a new technology is concerned.

One barrier to adoption was seen to come from developers themselves. Many are reluctant to learn a new language (especially one perceived as too hard/different to the ones they know). This could be especially relevant to non-CS developers – they’re more likely to have language specific experience, rather than abstract programming knowledge, making it harder to switch.

The next was from a sysadmin point of view. The introduction of new languages can be seen as a security risk, and as yet another set of software that needs patching/updating/configuring. Different languages also have very different security models that need looking at, PHP’s now deprecated safe mode and Java’s Security Technology are a couple of examples which are very different indeed.

There’s also a certain level of suspicion (speaking from my own experiences here too…) about whether new languages/technology are really required for a project. If a project manager has a good reason for picking a certain language (e.g. they have to use a specific software package, libraries/plugins in a certain language are good, a language is needed to easily interface with other tools, etc.), then that’s one thing. There are just as many less valid requests to use a language though (e.g. it’s all I know, it’s a current buzzword, etc.).

So how do we get around these issues?

Virtualisation or bundling of the language was one solution mooted. Using virtual machines is one example of how this could be done (though it still raises many questions about security and trust). The other example given was in bundling up the language and libraries in a single package that could run in a more sandboxed environment (the Ruby language bundled in a WAR file running on a JVM was given as a successful application of this technique).

More important than this though, was the idea of getting sysadmins involved early. Rather than going to them with your requirements, it seems that teams had much more success by involving them with developers from the start, getting them on board to discuss issues and solutions, rather than dictating them at a later date.

Closing Thoughts

Overall, I thought the meeting was a big success, and opened my eyes to quite a few big things that I wasn’t aware of before (ORCID, microservices and Google Scholar issues being foremost among them).

More importantly, each topic mentioned above was finished with some action items, so I’m hoping that we hear some progress on these from various SITS attendees in the near future (I’ll add links to this post as I hear about them).

Future SITS meetings will take place in different locations, and with different attendees, so I’m hoping we’ll see a good cross-section of issues and experiences coming from the meetings. I’m sure we’ll see a lot of common threads come up that lead to more less repetition of software development, and more importantly, less repetition of mistakes!

Dave Challis
dsc@ecs.soton.ac.uk

Posted in Uncategorized.

Tagged with CRIG, JISC, Library, Repository, SITS.

2 comments

By Dave Challis – November 15, 2010

Auto Discovery of voID via SPARQL

I wanted to know more about the data in a SPARQL endpoint. I had a good idea, search the endpoint for triples that mention the URL of the endpoint… no results.

I tried a few endpoints and most returned nothing, but trying one of the RKB Explorer endpoints I got a single triple. But it was the right triple!

http://acm.rkbexplorer.com/id/void → void:sparqlEndpoint → http://acm.rkbexplorer.com/sparql/

From this I can disover everything I need. I suggest that this should be best practice; SPARQL endpoints should contain voID for the datasets they contain, and relate themselves to the datasets using void:sparqlEndpoint.

Even if the voID just has a human readable title and description that’s infinititely better than nothing.

Posted in Uncategorized.

No comments

By Christopher Gutteridge – November 12, 2010

Everybody needs a 303

So there’s a lot of debate the past few days about the issue of 303’s being one of the two accepted ways to get from a URI for a non-document (eg. the City of London) and for a document about that thing (london.rdf or london.html etc.)

One of the key points Ian Davis made is that it must be practical for people on crappy ISP setups (or who’s computing services don’t let them touch their server configuration etc. Ian Davis and Leigh Dodds have done some useful work suggesting workable patterns for people to follow, and I think what’s needed next is a ‘cookbook’ of how to achieve these patterns using available tech.

With this in mind I’ve been trying out solutions and would like to present a simple solution for how (nearly) everybody can have a 303. I’ve tried this on apache on local Redhat Enterprise and Ubuntu servers, and also on the data.totl.net server on Dreamhost — it works on those. It nearly worked on the apache setup which comes with OSX, except for the fact it redirected with a 200 rather than a 303.

What you do is put your RDF files in a directory; eg. /project/people/marvin.rdf and define the URI for marvin as /project/people/marvin

Put as many in as you like then add this .htaccess file:

RedirectMatch 303 (.*)/([^\./]+)$ $1/$2.rdf
<Files ~ ".rdf">
 ForceType application/rdf+xml
</Files>

The first line just redirects *anything* without a dot “.” in to the same address with .rdf on the end. This means that /foo will direct to /foo.rdf and then 404, but who cares? The second bit just makes sure that the mimetype is correct if the server doesn’t do the right thing with .rdf

There are better ways to do this, but this works on a pretty vanilla, and locked down, apache set up.

So there you go, one less excuse. Link that data!

Posted in Uncategorized.

No comments

By Christopher Gutteridge – November 10, 2010

Searching a SPARQL endpoint

Recently, OUseful Blog has been talking about how to get started hacking SPARQL queries. So here’s a simple one. It looks for things with a search string in their name, title or label:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?thing ?name ?type {
 { ?thing foaf:name ?name }
 UNION { ?thing rdfs:label ?name }
 UNION { ?thing dct:title ?name  }
 OPTIONAL { ?thing a ?type }
 FILTER (REGEX(?name, "YOUR-QUERY-STRING-HERE","i"))
}

For example, searching for Ventnor in the Ordnance Survey. I suspect it’s not that fast because it’s actually having to work through filtering a huge pile of data. I thought searching for “^Ventnor” might be faster (It would in SQL as the indexes can do string-starts-with quickly), but it doesn’t seem to be. Advice on optimising?

If people are interested, I could add this as an option to the Graphite SPARQL Browser.

SPARQL/SQL Translation

For SQL users, UNION is in effect an “OR”, OPTIONAL can be thought of as “LEFT JOIN” and FILTER as a WHERE.

If the SPARQL endpoint were an SQL database, it would be a single table containing three columns, subject, preficate and object. (Yes I’m skipping some stuff here to keep it simple). I’m going to remove the UNION for now as that’s basically like running several SELECTs and merging the results. Note that “a” is an alias for “rdf:type”.

SELECT DISTINCT ?thing ?name ?type {
 { ?thing foaf:name ?name }
 OPTIONAL { ?thing rdf:type ?type }
 FILTER (REGEX(?name, "YOUR-QUERY-STRING-HERE","i"))
}
SELECT DISTINCT t1.subject, t1.object, t2.object FROM
triples AS t1,
LEFT JOIN triples AS t2
ON t1.subject = t2.subject
WHERE t1.predicate = 'http://xmlns.com/foaf/0.1/name'
AND ( t2.predicate = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' OR t2.predicate = NULL )
AND ( t1.object LIKE '%your string here%' )
Although I’ve changed the regexp to a LIKE. I’m not 100% sure I’ve got this entirely correct, but it should give an SQL hacker a feel for what’s going on. Every triple in the SPARQL select is effectively an inner join where the named parts ?foo are joined to the columns they were associated with in the previous triples. You can do some very funky things in SPARQL, but you need to get joins from lesson one. Even a trivial query  on a property of a field will probably require you to add a { ?item a “foaf:Person” } or you’ll get all things of all types which isn’t what you’re going to want.
I think that as RDF and the semantic web achieves escape velocity [PDF], we’ll need to make some tutorials for people who just want to get the job done. Right now we’re still working with almost entirely early adoptors. We need to make getting data out of SPARQL achievable for people who don’t really care. I found a PHP library for working with SPARQL, but it seems to be from more than 5 years ago. Perhaps I should write a SPARQL library which looks like an SQL library? sparql_query() sparql_connect() etc? (comment if it’s worth my time…)
Redirect to SPARQL
Dave Challis had an interesting suggestion yesterday… Making a URL which accepts a ?q=XXX query and redirects to the SPARQL query that searches relevant labels in our endpoint. That way we select which predicates we consider labels, and it gently cues people into SPARQL without forcing the initial learning curve.

Demo of a SPARQL ?q= redirector on the Ordnance Survey endpoint

Update:
@ldodds points out that the O.S. endpoint has some funky Talis features, so that there is a simple search API. Which gives pretty useful results. I’ve seen, passing, searches which return RSS, but what I’d not realised until today was that the RSS contains lots of useful triples, so in effect it’s just a structured list of RDF descriptions. This approach looks very useful for some usecases I’ve been thinking of. Specifically how to make it easy to search an organisation’s datasets. For example, how to find a building at southamton university when all you know is “Zepler”.

Posted in Uncategorized.

2 comments

By Christopher Gutteridge – November 10, 2010

What you need to know about RDF+XML

RDF+XML is a much loathed format.

It is a way of writing RDF data (triples of subject,predicate,object) in XML.

RDF+XML is not RDF. It’s a way of encoding RDF. There are better ones, such as n3, but it’s the one everyone expects you to provide, so you better learn the basics.

RDF+XML is way too big. You can do everything lots of ways. That makes things confusing, so I figured I’d write a guide to the bare minimum you need to know to create valid RDF+XML.

The basics

The subject & predicate are always a URI.

The object is a URI /or/ a literal value. If it’s a literal it may have an associated data type URI or a language code, but not both.

predicate is just a fancy word for “relation”. It relates the subject to the object, eg. Bob hasFriend Jill. (Note that you can’t assume Jill has a friend Bob, it’s a one way thing. Sorry Bob)

The correct mimetype is “application/rdf+xml”

How to write RDF+XML

This is going to cover the smallest learning curve approach. There’s lots more to RDF+XML but it’s all optional sugar. Don’t worry about it.

I’m assuming you already know what actual triples you want to write. If not this isn’t the correct tutorial for you yet.

An RDF document is an XML document and so always starts with

<?xml version=”1.0″ encoding=”utf-8″ ?>

If in doubt, always encode it as utf-8.

Here is a minimum RDF document defining no actual data:

  <?xml version="1.0" encoding="utf-8" ?>
  <rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  >
  </rdf:RDF>

See that bit which says “xmlns:rdf” that defines that any tag starting with “rdf:” is in the “namespace” http://www.w3.org/1999/02/22-rdf-syntax-ns#

That means that the unique identifier for that element is http://www.w3.org/1999/02/22-rdf-syntax-ns#RDF

If you want to use any predicates, and you do, you’ll need to define namespaces for them in the opening tag. Most common namespaces have a widely accepted prefix.

To fine the standard prefix for a namespace you can look it up on prefix.cc which is handy. Don’t use a different prefix without a good reason. If you can’t find the namespace on prefix.cc then pick something sensible. If you are writing a document with a bunch of namespaces, prefix.cc has a very funky shortcut… try this link:

http://prefix.cc/foaf,skos,gr.xml

Neat, huh? You can just cut and paste it. This saves time and typos. You might not notice missing a “#” from the end, but a computer will treat it as a completely different namespace!

OK. Now to encode some data. Here’s my data. I’m going to use the prefixes to keep it readable:

My name is Marvin
- http://example.com/marvin#me
- foaf:name
- “Marvin Fenderson”
I am a Person
- http://example.com/marvin#me
- rdf:type
- foaf:Person
My hat size is 10
- http://example.com/marvin#me
- myprefix:hatSize
- 10 ( type is http://www.w3.org/2001/XMLSchema#int )
the big head club is an organization
- http://example.com/bigheadsclub#org
- rdf:type
- foaf:Organization
The big head club has a member who is me!
- http://example.com/bigheadsclub#org
- foaf:member
- http://example.com/marvin#me
The big head club is called “The Big Head Club” in English.
- http://example.com/bigheadsclub#org
- foaf:name
- “The Big Head Club” (in English)

OK, that’s enough data. Note that because predicates are one way sometimes you say things backwards. I wanted to say “I’m a member of the club”, but because I’m using a predicate that relates organizations to members, I have to do it that way around.

Note that many things (like Organization in FOAF) have the US spelling. Don’t correct it, computers want an exact string. If you feel annoyed add a label to stuff with a en-gb language version of the label!

Here’s how to encode the above: For each distinct “subjects” (the #me and the #org are the “subjects” in the above data, 3 triples start with each), create a sub-element of the top level rdf:RDF element. Call these sub-elements <rdf:Description> and give them an rdf:about attribute which is the URI of the subject:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="http://example.com/marvin#me">
  </rdf:Description>
  <rdf:Description rdf:about="http://example.com/bigheadsclub#org">
  </rdf:Description>
</rdf:RDF>

OK! That’s still valid RDF (assuming it’s inside the <rdf:RDF> element), but it still contains no data. We need to relate Marvin to the Big Head Club.

For triples where the object is a URI (which indicates they relate the subject resource to another resource, not just a number or string), add them as a tag matching the predicate. The namespace must have been correctly aliases in an xmlns:xxxx=”yyyy”. The element should close itself at once and contain the attribute rdf:resource=”URI” where URI is the subject of the triple. Note that you don’t use the short version of the namespace in rdf:resource or rdf:about, just in the predicates relating subjects to objects.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:hats="http://example.com/hats/ns/"
>
  <rdf:Description rdf:about="http://example.com/marvin#me">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person" />
  </rdf:Description>
  <rdf:Description rdf:about="http://example.com/bigheadsclub#org">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Organization" />
    <foaf:member rdf:resource="http://example.com/marvin#me" />
  </rdf:Description>
</rdf:RDF>

OK. The last bit is to add in the literals; the strings and the number. Create a tag of the same name as you would for linking to a resource but this time don’t close it at once, but wrap it around the value. If their is a language to express for a string add an xml:lang=’xx’ attribute, when xx is the language code. Alternatively, if you need to express a dataype, use rdf:datatype=”xxx”.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:hats="http://example.com/hats/ns/"
>
  <rdf:Description rdf:about="http://example.com/marvin#me">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person" />
    <foaf:name>Marvin Fenderson</foaf:name>
    <hats:hatSize rdf:datatype="http://www.w3.org/2001/XMLSchema#int">10</hats:hatSize>
  </rdf:Description>
  <rdf:Description rdf:about="http://example.com/bigheadsclub#org">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Organization" />
    <foaf:name xml:lang="en">The Big Head Club</foaf:name>
    <foaf:member rdf:resource="http://example.com/marvin#me" />
  </rdf:Description>
</rdf:RDF>

The order of relations inside a description, and the order of the descriptions does not matter. I think it’s nice to put ‘types’ and ‘labels’ near the top of each description. Relations can be repeated.

At this point you could add an additional rdf:Description, the about of which is the URL of the RDF document. This allows you to make statements about the document as a whole, such as who wrote it, what license it is, what it’s called etc. There’s still no agreement on what is useful, but a title and license are handy. Use rdfs:label to label it.

While it’s not strictly required, it’s helpful to add a rdfs:label and rdf:type to describe every URI that is a subject or object in the document, not counting the objeects of rdf:type. Some people say this is overkill, but it does help debugging.

Checking your RDF

Don’t skip checking it. I keep running into broken RDF produced by people who never sanity checked it.

The best way to check your RDF is to put it on a URL and poke things at it. The first thing I usually do is load it in Firefox and check that it’s valid XML. If it’s not that’s a dealbreaker before we start. Here’s a link to an online copy of our file

Marvin Hat Data

If you load it in firefox it’ll tell you about any XML errors, other browsers are not so helpful.

Once you’ve done that, you should load it into an RDF aware viewer. I use the Graphite Quick & Dirty RDF Browser which I wrote. Here’s what the data looks like if you view it in the browser.

The rdf:type’s have been spotted and are shown on the right hand top corner of each box. The foaf:names have also been highlighted. This helps you spot obvious mistakes. Also, because we’ve got a valid label for Marvin, the list showing members of the organisation is showing his name rather than the URI (hover the mouse to see the URI). This is also handy in spotting obvious mistakes.

If the Graphite Browser can’t parse your RDF it’ll link you to the W3C RDF Validator which is sometimes helpful. Also double check your xmlns definitions. Missing a character off the end will cause lots of problems!

Better than using my generic RDF viewer, if available also check your data in one that is designed to understand the namespaces you’re working with. There’s not many around yet, but that will change. Personally I find the existing ones quite confusing.

If you don’t check your RDF+XML it’s bound to be buggy.

What I’ve skipped

Almost everything! But the only *useful* thing I’ve skipped is how to write bNodes (resources without an associated URI), that’s to keep this simple and because I have a dislike for them.

RDF+XML offers huge numbers of short cuts, but you don’t need any of them. They just make it easier to make mistakes. Sod ’em.

How to read RDF+XML

Simple. Use a library. Most programming languages now have a good library for parsing all the crazy crap in RDF+XML. Don’t bother trying to do it yourself, it’s a waste of time and there’s more important work to be done!

In PHP, I use ARC2. It handles lots of other RDF formats as well, and so I have one less problem to worry about. I just point it at web addresses and it sucks down triples. How cares how they were encoded?

N3

This is another way to encode the same data. It also has some shortcuts, but is much more elegant. Check out the same data:

     @prefix foaf: <http://xmlns.com/foaf/0.1/> .
     @prefix hats: <http://example.com/hats/ns/> .
     @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

    <http://example.com/bigheadsclub#org>     a foaf:Organization;
         foaf:member <http://example.com/marvin#me>;
         foaf:name "The Big Head Club"@en .

    <http://example.com/marvin#me>     a foaf:Person;
         hats:hatSize "10"^^<http://www.w3.org/2001/XMLSchema#int>;
         foaf:name "Marvin Fenderson" .

End

I’m sure that I’ve made a mistake or two myself. Suggestions on how to improve the above would be welcome.

Posted in RDF.

Tagged with Tutorial.

3 comments

By Christopher Gutteridge – November 8, 2010

TLD changing under your feet

Here’s a fun issue people should be aware of. I got the following email today about someone forgetting their password and their email having been changed outside their control!

However, since I am resident of Serbia, my country changed old .yu domain with the .rs domain. Thus, my e-mail changed from pmilin@ff.ns.ac.yu to pmilin@ff.uns.ac.rs

The darn country code for their university was changed!

What are the implications if they had being creating URIs at http://data.ns.ac.yu/id/

Posted in Uncategorized.

No comments

By Christopher Gutteridge – November 5, 2010

Nobody needs a 303

Ian Davis (@iand) has just written this rather challenging blog post about the future of the 303 redirect. He’s onto something, but the idea needs work and I have an idea…

Background

A key point of the semantic web is that you can use URIs (which look exactly like URLs) to represent concepts which are not resolvable into a sequence of 1’s & 0’s. A URI represents a single thing, be it an HTML document about Rice Pudding, and RDF document about Rice Pudding or the concept “Rice Pudding” itself.

Concept: http://dbpedia.org/resource/Rice_Pudding
RDF: http://dbpedia.org/data/Rice_Pudding.xml
HTML: http://dbpedia.org/page/Rice_Pudding

If you use an HTTP request to ask for the concept, it can’t serve you rice pudding over a TCP/IP stream so rather than tell you “200 OK” (and give you pudding) or “404 Not Found” it tells you “303 See Other” and gives a redirection to the RDF document. If it’s being extra clever, it listens to what format your client prefers (your HTTP client expresses a preference when it makes a request) and redirects you to a URL with the most palleteable data format for you.

(Side note, in many ways HTTP response 418 might make more sense in this case if there was no document available).

To watch this in action try (on a Linux command line, or Terminal on OSX):

curl -v http://dbpedia.org/resource/Rice_Pudding

then

curl -v -H’Accept: application/rdf+xml’ http://dbpedia.org/resource/Rice_Pudding

The problem is that this is a pain to configure on a webserver, and makes things complicated in general. Also when you ask a person “what’s your URI?” they stare at you blankly. It’s non-trivial to get URIs out of linked data experts, if we want Linked Data to take off, it must be achieveable by people who don’t really care.

Enter Ian Davis

Ian has just written this blog post: http://iand.posterous.com/is-303-really-necessary. I really want to disagree, just to make the Fatboy Slim refererence, but I think he’s onto something. He is, if I’ve understood correctly, suggesting that when you resolve a URI you should expect to get a “200 OK” and a document about that subject. This does make things more simple, but means that the URI for a document is now different for the URL of that document.

It’s going in the right direction, and really helps solve the problem of how to ask a layman for a URI, but I’ve some ideas of how to make it work, they could either or both be used.

Making it clear what’s going on in the HTTP response

Add a new HTTP return code “208 Metadata” which indicates what you are getting is data about the resource you requested rather than the resource itself. This could also be achivied by putting specific triples in the returned document, but this feels much cleaner. However it still has the issue of requiring special server configuration.

The thing Ian is going for (correct me if I’m wrong) is to allow someone to just place an RDF document in a directory and have it served over the web and you’re done. That’s OKish, but to make apache serve it with the correct mimetype it will need a ‘.rdf’ suffix, which means making a URI for ricepudding like http://data.totl.net/puddings/ricepudding.rdf which feels wrong to me. You can make it work pretty easily with PHP;

curl -v http://graphite.ecs.soton.ac.uk/experiments/208.php

… but now I’ve got a .php suffix! I can’t find any handy apache .htaccess config that lets me set the HTTP response code for a directory, the best I can find quickly is

ForceType text/n3
Header set Semantic Metadata

Which would at least mean your could create a directory of suffixless files, and indicate to an aware semantic header aware tool that this was not the requested resource, but rather metadata about it.

curl -v http://graphite.ecs.soton.ac.uk/experiments/setmime/ricePudding

Extending the URI syntax to indicate “Subject of URI”

I really quite like this one; which is to add something to the way you write the URIs. Put a symbol, let’s say “!”, at the start of the URI to indicate it represents the subject of the document at the given URL. This feels a little like the use of & in C code.

<!http://users.ecs.soton.ac.uk/cjg/> foaf:name “Christopher Gutteridge” .

<!http://users.ecs.soton.ac.uk/cjg/> foaf:homepage <http://users.ecs.soton.ac.uk/cjg/> .

I really think this could work! Most semantic systems just treat URIs as strings so this in no way breaks their ability to reason and process data. The only time it would matter is when they come to resolve the URI. Resolving the URI would not work for clients that didn’t understand the syntax, but that’s not a big deal, they’ll be easy to fix and just won’t get extra data — their loss.

I’ve done a couple of experiments to see how the ARC2 parser copes with this;

RDF+XML
N3

The result isn’t great. It is valid, but it’s treated the URI as relative to the current document. So we can’t really put anything at the front, but we could put something on the end… What’s a character which is in basic ASCII but explicitly not legal at the end of a URI? (we want one which can never conflict with a real URL, so things like “#topic” or “!” are out as they could be legal URLs)…

% as the suffix?

That’s really really not legal to put by itself in a valid URI as it must be followed by two hex digits. So let’s try using <http://users.ecs.soton.ac.uk/cjg/%> to represent the subject of that page (ie. me)

RDF+XML
N3

That works much better, but I can no longer put it into a namespace definition as it’s appended rather than prepended. If a naive client tries to resolve it, it will get back a 400 HTTP response, a smart client can understand to strip the trailing %. A nice webserver might add a plugin which spots requests ending in % and if there is a valid URL without the % send a 303 See Other, so that would enable most existing RDF libraries to keep working, unless they were super touchy about the URI being valid before requesting it.

One thing that doesn’t work is using the % suffix in predicates in RDF+XML as that requres you to write <test:foo%>Testing</test:bar%> which is not valid XML. You also can’t use it in the shortcut for class names, eg. <test:Bar% rdf:about=”…”> but that’s not a problem as you just describe it using <rdf:type …>.

In this way, a URI for the Web Science Trust is just <http://www.webscience.org/%>

Using ‘#’ as a suffix?

One other option would be to describe a URI ending in # as referring to the subject of the document without the ‘#’. This has some big advantages over using ‘%’ as it is a legal URI, still, and like any # URI, it will resolve to the source document.

I now that foo#bar indictes a sub part of foo, in XML/HTML it’s the element with id=’bar’. However, what does foo# mean? The element with id=”? or can we safely set a standard that this is the _subject_ of the URL without the #?

You still won’t be abe to use it RDF:XML predicates as <foo:bar#> isn’t legal.

Using this idea, the URI for the Web Science Trust is <http://www.webscience.org/#> which is rather elegant.

UPDATE: Dammit, Steve Harris has pointed out that URIs of the format <http://example.com/foo#> are used to identify namespaces. Dang.

Posted in RDF.

5 comments

By Christopher Gutteridge – November 4, 2010

What is our URI?

Canonical URI is already a bit of a loaded term, but what I really mean is what URI should I use to refer to Southampton University when writing linked data about it. Or, for that matter, how about The WebScience Trust?

Here’s the rule I think we should follow:

If the organisation who grants your charter assigns you a URI then use that.
Failing that, mint one for yourself in your own domain.

I don’t think it makes much sense to use your dbpedia URI — they are too volatile.

In both cases you should mint your own URIs for any entities which are within your scope, such as sub-organisations.

The problem is that (1) won’t resolve to your open data about your organisation, but rather to your parent organisation’s data about your organisation. In this case I suggest the following pattern is added to your ‘boilerplate’ which you add to most or all RDF documents:

<http://data.example.ac.uk/docs/exampleacuk.rdf> 
  foaf:primaryTopic 
  <http://education.data.gov.uk/id/school/666666> .

<http://data.example.ac.uk/docs/exampleacuk.rdf> 
  rdf:type 
  oocore:OpenOrgDocument .

What’s oocore?

oocore is the (still in development) core namespace for a bunch of namespaces for tools to help “Open Organisations” provide useful information about themselves and make it discoverable. The focus is not on perfect models (beware the Modeller) but rather on making the data easy to use and reuse.

The idea of an OpenOrgDocument is that it would obey certain conventions, and would be a little like a foaf:personalProfileDocument for an organisation. It will have some strong guidelines on what is useful to include, and link to additional OpenOrg documents for common facets of organisational data, such as buildings and amenities, organisational structure, news, publications, membership, financial information etc.

What if our parent organisation creates a URI for us in the future?

Well, that’s an issue. You’ll have the choice of using that in future, or just adding a sameAs link. It’s a pain, but I suspect most places will just continue to use the URI they picked early. The key thing is not to mint a URI if there’s already one out there.

Discovering the OpenOrg Document

If you request “/” from the organisation’s main domain, eg. www.example.ac.uk, with an HTTP heading that prefers to accept ‘application/rdf+xml’ then you should be redirected to the open org document. In addition, the homepage should have a

<link rel="alternate" type="application/rdf+xml" href='..path to openorg document..' />

This will mean that people can discover standard data about your organisation without jumping through any complicated hoops, or having to try 10 different fiddly approaches.

What should go in an OpenOrg Document?

Well, we’ll work that out as we go, but I’d go with some of:

Basic name of the organisation
contact details; main email, main homepage, main phone number
based_near to the nearest population centre
based_near also to a geo:point for simple navigation purposes.
links to additional openorg documents which (and this is a neat bit) can be the current document. If it’s a small organisation, you might as well put all the data in one big document which is rdf:type several types of openorg document.
links to additional datasets, with enough data to let a system know if it’s helpful to resolve the URI or not.

That last bit is important. By saying that a URL is of type “OpenOrgBuildingsDocument” that tells a consumer that the resulting data will not only be in RDF but will follow a known pattern, which should help it provide a user interface to it, especially for mobile applications.

Posted in Uncategorized.

2 comments

By Christopher Gutteridge – November 2, 2010

« Previous Next »

Local Councils

From 3* to 5* Data

The hard parts…

What next?

Authentication/Authorisation

RDF/Linked Data

Web Archiving

People/Author Identifiers

Microservices

Curriculum/Training Development

Search Engine Optimisation

Lightweight Languages

Closing Thoughts

SPARQL/SQL Translation

Redirect to SPARQL

The basics

How to write RDF+XML

Checking your RDF

What I’ve skipped

How to read RDF+XML

N3

End

Background

Enter Ian Davis

Making it clear what’s going on in the HTTP response

Extending the URI syntax to indicate “Subject of URI”

% as the suffix?

Using ‘#’ as a suffix?

What’s oocore?

What if our parent organisation creates a URI for us in the future?

Discovering the OpenOrg Document

What should go in an OpenOrg Document?

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags