Skip to content


When Linked Data is not Open Data

I made a mistake! Potentially one which could have exposed information to the Internet which should have never left the Internet. It’s unlikely that anything leaked out, and the hole is now closed for good.

IP-range restricted pages subverted by proxys

Here’s what happened: A few years back we set up our first stab at an RDF service for ECS. This only contained information on members who had agreed to appear in our public directory, and never contained information on peoples offices. However, we wanted to play with that data in RDF so we decided to be clever and also create intra.rdf.ecs.soton.ac.uk which would serve such data, but only to our IP range. All was then fine and many 3rd year projects (well, 3 or 4) used the intranet data for interesting demos.

Where things went wrong was when I recently launched my RDF browser which allows you to view RDF documents in a more human-friendly way. All well and good, until I was playing with it later that week and I noticed I was able to browse our intra.rdf server from my home machine. The RDF browser had access to the confidential data as it was inside our network. As soon as I found off I added a rule to block my RDF browser. Then it occurred to me that anyone in ECS could write a web proxy and any intranet information restricted only by IP address could then become visible to the world, including Google!

For this reason we’ve moved to make all our Intranet information secured by username/password rather than IP range. This is a bit annoying, but necessary for data-protection as we’re a research department and we shouldn’t be preventing postgrads building web proxies for fun and experimentation.

However, our cookie based single sign-on is a very ugly way to access an RDF document. So it got me thinking about if we should even have closed linked-data and if so, how it should be handled.

Closed Linked Data

After a bit of a think I’ve decided that there are two very distinct types of closed linked data:

  1. Data about me. For example: my contact details, office location, calendar/schedule/lecture timetable, what modules I am studying.
  2. Data I am authorised to view. For example a list of the grades of my tutees, the list of servers in a server room, the communications budget expenditure details for 2009.

What I should be allowed to do with type (1) closed data is very different to type (2). If I choose, it’s perfectly reasonable for me to give access to a smartphone application to read data about me. I can make my own call about trusting the 3rd party developer. However there’s no way in hell I should be uploading student marks or confidential budgets to such an application. If they are to be trusted should be a decision made by my organisation and they should then be granted access that way.

One of our students wrote an iPhone app. called “iSoton” which you give your username & password and it logs into the main university Intranet portal, and navigates through a couple of pages to get your timetable out as CSV. It’s so popular it’s not got blocked, even though the developer could be harvesting the username/password pairs.

The thing is, there’s no need to use your main username/password to grant access to this data. What I propose should happen for type (1) data is that if you request such a URL/URI without a (valid) username and password it will provide some minimal triples describing how to create a username password. The app can then give you these instructions. Basically, you should log into your university account with your real username and password and ask to create a username/password pair for use by this app to get access to the data you approve of it seeing. Much safer.

Your username: [cjg.............]
Your password: [*********.......]
ID of service: [isoton..........]
Allow Access   [x] Contact Information
           to: [.] Location Information
               [x] Calendar and Timetable
               [.] Allow app to pass your information to 3rd parties?
               [.] Allow app to place any of this information on the public web?
  Expiry date: [2011-07-12] (optional)

Thankyou-- the app may access your contact and calendar information at:
http://cjg+isoton:ybBiebYB3@data.soton.ac.uk/person/cjg

This would be entirely inappropriate for type (2) data but for type (1) it allows all the cool mashups to be done without compromising the password used for email. The “allow app to” options would control what license information was included in the RDF boilerplate. This should also contain info on when the data was generated and for what disposable username, so if it does get released into the wild there is some kind of audit trail.

Desktop Applications for type (2) Closed Linked Data

While you wouldn’t pass your type (2) RDF (stuff you don’t have a personal right to republish), you may well want to use it with a desktop application. In much the same way you might download an Excel file from your intranet and run it on your laptop.

In this case it’s perfectly reasonable to use your main username/password to authenticate. (unless the application is malicious, but that’s a known problem and much easier to cope with on the desktop than on phone apps, cloudy websites etc.

However, as with the type (1) data, if it is provided in RDF it should contain some boilerplate saying when it was generated, who for, and for what IP address. That way if it leaks accidentally it can be traced. Obviously this is not proof against malicious linking, but it should be considered best practice to include such a header plus a clearly NOT OPEN license in any non-open linked data document.

Boilerplate Triples for Closed Linked Data

Here’s a sketch of what I’m thinking. I’m not sure about the create

<> a foaf:Document ;
   dc:license <http://data.soton.ac.uk/licence/our-bloody-closed-eyes-only-license> ;
   xxx:generatedFor <http://data.soton.ac.uk/person/cjg> ;
   xxx:requestedBy "152.78.71.23" ;
   xxx:generatedOn "2010-07-23T12:32:01Z" ;
   rdfs:license "This file contains confidential data and should not be redistributed. \
     If you receive or discover it by accident please notify yikes@soton.ac.uk and \
     delete all copies."

You get the gist. xxx: is for predicates I’m not sure about. There may already be useful ones I’ve forgotten somewhere in the bowls of dcterms.

Posted in Best Practice, Intranet, RDF.


3 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Mischa Tuffield says

    FWIW:

    xxx:generatedFor could possibly be “http://xmlns.com/foaf/0.1/primaryTopic”

    and xxx:generatedOn could be a “http://purl.org/dc/elements/1.1/date”

    — it *could* be but I think those are too wooly. Also the primary topic might well be different to the person the data is being generated for. –c jg

  2. Kingsley Idehen says

    Chris,

    Yet another example of the kind of problems resolved by the WebID Protocol via its ACL dimension.

    On the FOAF+SSL mailing list I kicked of some Resource oriented ACL demos. Naturally, you can also secure Named Graphs using WebID Protocol.

    Links:

    1. http://esw.w3.org/Foaf%2Bssl – WebID (nee. FOAF+SSL)
    2. http://www.openlinksw.com/dataspace/kidehen@openlinksw.com/weblog/kidehen@openlinksw.com%27s%20BLOG%20%5B127%5D/1625 — recent post with demo links re. WebID Protocol .

  3. Olaf Hartig says

    Hey,

    You could use the Provenance Vocabulary to represent xxx:requestedBy and xxx:generatedOn. Just write something like:

    prv:retrievedBy [
    rdf:type prv:DataAccess ;
    prv:performedBy _:x ;
    prv:performedAt “2010-07-23T12:32:01Z”^^xsd:dateTime ] .

    _:x rdf:type prvTypes:DataAccessor .

    Now, you only need some property that states the IP address of the data accessor represented by the balnk node _:x



Some HTML is OK

or, reply to this post via trackback.