Skip to content


Open Data Metrics

Happy new year.

First of all, welcome to a new team member, Andrew Milsted. Andrew’s post is funded by HEFCE to work on developing equipment.data.ac.uk. He’s busy right now turning my 0.1 code into something a bit more suited for the long term, and as a side effect we hope to release some libraries which will be reusable in future projects.

So right now our little team is:

Christopher Gutteridge & Patrick McSweeney – General Web Medlling for ECS, the Faculty of Physical and Applied Sciences, and in general.
Ash Smith – Linked Open Data Specialist for http://data.southampton.ac.uk/
Andrew Milsted – Open Data Specialist for http://equipment.data.ac.uk/

We are all situated in the Level 4 labs in B32, and are part of the iSolutions “Technical Innovation & Development Team”.

A big problem with open data…

…usage metrics.

The problem of usage metrics came up with data.southampton.ac.uk, but is a far bigger concern for equipment.data.ac.uk as we need to be able to show an external funder that we’re having an impact.

We conform to (some) good Linked Open Data principles, our URIs resolve (mostly) and our data can be downloaded as a big single file; no noodling with APIs is required.

Local caches of the data dump

However this means that 3rd parties can snapshot our entire database and use it on things like their internal Intranet. In fact we do this ourselves; our university Intranet, SUSSED, has a search box which searches a mash up of the open data list of equipment from from equipment.data.ac.uk and some additional (non-open) data we get from some strategic partners. The catch is that we’ve got no way to know what’s been searched for there, or how often, so we’re losing all that juicy business intelligence.

This is going to be a sticking block in future. Organisations may decide that their statistics gathering is more important than being fully open so may require all searches via an API to try to discourage 3rd parties being able to search the whole dataset. That sucks, but I can see why a manager would make that decision.

Google Analytics can’t help you here

Most non-technical web managers seem to use Google Analytics as their go to stats system. I can see the cookies on many .ac.uk sites I visit. The problem is that it doesn’t work if you leave the HTML world. Many or most of the hits on our datafiles will be by automated scripts downloading them, or by semantic tools resolving the data for an entity, such as http://equipment.data.ac.uk/item/378b5a86ab130959dd62a68b9b7110a1.ttl (nb, .ttl files don’t open in the browser)

Those requests can only be tracked by the software running on the server. And they are, but it’s hard to know what to do with it. It would be handy if someone could make a semantic website aware way to resolve an apache log.

Sidebar: Should semantic tools honour robots.txt? Are they robots or browsers?

You can’t use cookies even, because most of these requests will be by naive robots who don’t store and return cookies between requests.

URI resolution patterns

Right now many open data sites use Tomcat (Personally, I’m still on Apache 2 and very happy thankyou-very-much). We usually use a stand alone domain to resolve 303s for URIs. I prefer this as it is more obviously a distinct service. eg. compare our open data service with dbpedia. In my opinion, ours is a better model

  • DBPedia
    • Identity: http://dbpedia.org/resource/Southampton
      (303 redirect to)

      • HTML Page: http://dbpedia.org/page/Southampton
        (or)
      • RDF Document: http://dbpedia.org/data/Southampton.rdf
  • data.southampton.ac.uk
    • Identity: http://id.southampton.ac.uk/building/59
      (303 redirect to)

      • Document URI: http://data.southampton.ac.uk/building/59
        (302 redirect to actual format)

        • HTML Page: http://data.southampton.ac.uk/building/59.html
          (or)
        • RDF Page: http://data.southampton.ac.uk/building/59.rdf

Admittedly we could skip the document URI for speed, but this keeps things very clean logically. And it’s easy to tell at a glance what’s a URL and what’s a URI.

The problem is that I don’t know a nice tool which could study the apache logs from id.southampton.ac.uk and data.southampton.ac.uk and make some nice graphs. Interpretation would need to be very different to the assumptions you make about a human browsing web pages, and you don’t get handy things like “referrer” in the headers when a cheap semantic tool is poking your URIs.

Quantity vs quality

The other problem of equipment.data.ac.uk is that all hits, and all searches, are not created equal. Our goal is to improve the value the UK gets out of money it has spent on research equipment. Either by enabling more full use of items, avoiding buying un-needed kit (oh, turns out there’s already one in the building next to ours!), or by encouraging new collaborations.

The problem is that these “wins” will come days, or months, after the visit to equipment.data. All the obvious methods for capturing feedback will make the site less user friendly; collecting their contact data to email them follow up surveys etc.

The thing is, if once a year, a single search saves the country buying a single £500K item of equipment, it’ll have been a resounding success and paid for itself many times over. However, I have no clue how to measure that so we can report it back to the funders.

Uncertenty

In marketing, you talk about “conversions”; how many people who visited the website actually made a sale. The problem is that we don’t have a way to know when we’ve had a success, and with people caching the open data we might not even know when and how people are using our data!

I guess the frustration I have is that to properly measure these services, we would in some way have to compromise something that makes them good.

Are there solutions? Can web science help? Should we compromise linked+open data principles in order to be able to better quantify our sucess?

 

Posted in Best Practice, Data, RDF, web management.


4 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Ash Smith says

    Perhaps we’re thinking backwards. In the pre-web days advertisers only had return clients as metrics, the clients were the ones who asked all their customers where they heard of them. We’re effectively advertising equipment on behalf of the “client” (the equipment owner), and I doubt the people whose kit we’re advertising will be slow in letting us know if we’re not gaining them any benefits.

    Obviously the analogy isn’t perfect – nobody pays us to advertise their kit, and people are far less likely to actually complain about a free service – but maybe a good first step is to occasionally ask equipment owners if we’re helping them and if there’s anything else we can do. With any luck the question will get around, and we may even end up with some testimonials. As you said, we only need a small number of wins to make it worthwhile, we don’t need to ensure we track every case. It’s not quite an exact (web) science 🙂

  2. Daniel A. Smith says

    In a post-prism world where we know that google analytics IDs are used to surveil us ( http://www.washingtonpost.com/blogs/the-switch/wp/2013/12/10/nsa-uses-google-cookies-to-pinpoint-targets-for-hacking/ ), I wouldn’t advocate for adding MORE tracking to a publicly-funded system.

    But I do understand wanting that key “impact” slide for the mid- and end- project reports – how else can you prove value for money?

    I’d probably try to get some “user stories” of real decisions that might have been made/changed based on using this system – phone up procurement managers at big institutions maybe. Still hard to get, but can have more impact that some usage numbers. I think your story is already well written in the blog post.

    • Christopher Gutteridge says

      I think there’s probably some interesting work looking at how agents access RDF data, to better optimise services, and to understand what tools are being used on the data, and how.

  3. Markus Luczak Roesch says

    Very interesting and timely discussion about what I would call “purposeful usage mining”. Since 2011 we are promoting research on this by organising the USEWOD workshop series (Usage Analysis and the Web of Data – http://data.semanticweb.org/usewod/). For the 2014 edition of USEWOD the call for papers was released very recently (https://research.cs.wisc.edu/dbworld/messages/2014-01/1389038016.html).

    I also wrote an entire PhD thesis on this topic. The work basically discusses how the analysis of SPARQL query logs can help data set providers to assess the quality of their data set and to plan maintenance activities that improve the local data set but also contribute to the evolutionary integration of the Web of Data as a whole. Feel free to get in touch for further information or check other things which published on this topic such as [1, 2, 3] or all papers at the USEWOD workshop series.

    [1] – Learning from Linked Open Data Usage: Patterns & Metrics
    Knud Möller; Michael Hausenblas; Richard Cyganiak; Gunnar Aastrand Grimnes
    In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line. Web Science Conference (WebSci), April 26-27, Raleigh, North Carolina, USA, o.A. 2010 , http://richard.cyganiak.de/2008/papers/lod-usage-websci2010.pdf
    [2] – Markus Luczak-Rösch, Markus Bischoff (2011). Statistical Analysis of Web of Data Usage. In Joint Workshop on Knowledge Evolution and Ontology Dynamics, http://ceur-ws.org/Vol-784/evodyn1.pdf
    [3] – Identifying Information Needs by Modelling Collective Query Patterns, Khadija Elbedweihy, Suvodeep Mazumdar, Amparo E. Cano, Stuart N. Wrigley, Fabio Ciravegna, http://ceur-ws.org/Vol-782/ElbedweihyEtAl_COLD2011.pdf



Some HTML is OK

or, reply to this post via trackback.