{"id":1039,"date":"2014-01-07T11:40:59","date_gmt":"2014-01-07T11:40:59","guid":{"rendered":"http:\/\/blog.soton.ac.uk\/webteam\/?p=1039"},"modified":"2014-01-07T11:40:59","modified_gmt":"2014-01-07T11:40:59","slug":"open-data-metrics","status":"publish","type":"post","link":"https:\/\/blog.soton.ac.uk\/webteam\/2014\/01\/07\/open-data-metrics\/","title":{"rendered":"Open Data Metrics"},"content":{"rendered":"<p>Happy new year.<\/p>\n<p>First of all, welcome to a new team member, Andrew Milsted. Andrew&#8217;s post is funded by HEFCE to work on developing equipment.data.ac.uk. He&#8217;s busy right now turning my 0.1 code into something a bit more suited for the long term, and as a side effect we hope to release some libraries which will be reusable in future projects.<\/p>\n<p>So right now our little team is:<\/p>\n<p>Christopher Gutteridge &amp; Patrick McSweeney &#8211; General Web Medlling for ECS, the Faculty of Physical and Applied Sciences, and in general.<br \/>\nAsh Smith &#8211; Linked Open Data Specialist for <a href=\"http:\/\/data.southampton.ac.uk\/\">http:\/\/data.southampton.ac.uk\/<\/a><br \/>\nAndrew Milsted &#8211; Open Data Specialist for <a href=\"http:\/\/equipment.data.ac.uk\/\">http:\/\/equipment.data.ac.uk\/<\/a><\/p>\n<p>We are all situated in the Level 4 labs in B32, and are part of the iSolutions &#8220;Technical Innovation &amp; Development Team&#8221;.<\/p>\n<h2>A big problem with open data&#8230;<\/h2>\n<p>&#8230;usage metrics.<\/p>\n<p>The problem of usage metrics came up with data.southampton.ac.uk, but is a far bigger concern for equipment.data.ac.uk as we need to be able to show an external funder that we&#8217;re having an impact.<\/p>\n<p>We conform to (some) good Linked Open Data principles, our URIs resolve (mostly) and our data can be downloaded as a big single file; no noodling with APIs is required.<\/p>\n<h3>Local caches of the data dump<\/h3>\n<p>However this means that 3rd parties can snapshot our entire database and use it on things like their internal Intranet. In fact we do this ourselves; our university Intranet, <a href=\"http:\/\/sussed.soton.ac.uk\/\">SUSSED<\/a>, has a search box which searches a mash up of the open data list of equipment from from equipment.data.ac.uk and some additional (non-open) data we get from some strategic partners. The catch is that we&#8217;ve got no way to know what&#8217;s been searched for there, or how often, so we&#8217;re losing all that juicy business intelligence.<\/p>\n<p>This is going to be a sticking block in future. Organisations may decide that their statistics gathering is more important than being fully open so may require all searches via an API to try to discourage 3rd parties being able to search the whole dataset. That sucks, but I can see why a manager would make that decision.<\/p>\n<h3>Google Analytics can&#8217;t help you here<\/h3>\n<p>Most non-technical web managers seem to use Google Analytics as their go to stats system. I can see the cookies on many .ac.uk sites I visit. The problem is that it doesn&#8217;t work if you leave the HTML world. Many or most of the hits on our datafiles will be by automated scripts downloading them, or by semantic tools resolving the data for an entity, such as <a href=\"http:\/\/equipment.data.ac.uk\/item\/378b5a86ab130959dd62a68b9b7110a1.ttl\">http:\/\/equipment.data.ac.uk\/item\/378b5a86ab130959dd62a68b9b7110a1.ttl<\/a> (nb, .ttl files don&#8217;t open in the browser)<\/p>\n<p>Those requests can only be tracked by the software running on the server. And they are, but it&#8217;s hard to know what to do with it. It would be handy if someone could make a semantic website aware way to resolve an apache log.<\/p>\n<div>Sidebar: Should semantic tools honour robots.txt? Are they robots or browsers?<\/div>\n<p>You can&#8217;t use cookies even, because most of these requests will be by naive robots who don&#8217;t store and return cookies between requests.<\/p>\n<h3>URI resolution patterns<\/h3>\n<p>Right now many open data sites use Tomcat (Personally, I&#8217;m still on Apache 2 and very happy thankyou-very-much). We usually use a stand alone domain to resolve 303s for URIs. I prefer this as it is more obviously a distinct service. eg. compare our open data service with dbpedia. In my opinion, ours is a better model<\/p>\n<ul>\n<li>DBPedia\n<ul>\n<li>Identity: http:\/\/dbpedia.org\/resource\/Southampton<br \/>\n(303 redirect to)<\/p>\n<ul>\n<li>HTML Page: http:\/\/dbpedia.org\/page\/Southampton<br \/>\n(or)<\/li>\n<li>RDF Document: http:\/\/dbpedia.org\/data\/Southampton.rdf<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>data.southampton.ac.uk\n<ul>\n<li>Identity: http:\/\/id.southampton.ac.uk\/building\/59<br \/>\n(303 redirect to)<\/p>\n<ul>\n<li>Document URI: http:\/\/data.southampton.ac.uk\/building\/59<br \/>\n(302 redirect to actual format)<\/p>\n<ul>\n<li>HTML Page: http:\/\/data.southampton.ac.uk\/building\/59.html<br \/>\n(or)<\/li>\n<li>RDF Page: http:\/\/data.southampton.ac.uk\/building\/59.rdf<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Admittedly we could skip the document URI for speed, but this keeps things very clean logically. And it&#8217;s easy to tell at a glance what&#8217;s a URL and what&#8217;s a URI.<\/p>\n<p>The problem is that I don&#8217;t know a nice tool which could study the apache logs from id.southampton.ac.uk and data.southampton.ac.uk and make some nice graphs. Interpretation would need to be very different to the assumptions you make about a human browsing web pages, and you don&#8217;t get handy things like &#8220;referrer&#8221; in the headers when a cheap semantic tool is poking your URIs.<\/p>\n<h3>Quantity vs quality<\/h3>\n<p>The other problem of equipment.data.ac.uk is that all hits, and all searches, are not created equal. Our goal is to improve the value the UK gets out of money it has spent on research equipment. Either by enabling more full use of items, avoiding buying un-needed kit (oh, turns out there&#8217;s already one in the building next to ours!), or by encouraging new collaborations.<\/p>\n<p>The problem is that these &#8220;wins&#8221; will come days, or months, after the visit to equipment.data. All the obvious methods for capturing feedback will make the site less user friendly; collecting their contact data to email them follow up surveys etc.<\/p>\n<p>The thing is, if once a year, a single search saves the country buying a single \u00a3500K item of equipment, it&#8217;ll have been a resounding success and paid for itself many times over. However, I have no clue how to measure that so we can report it back to the funders.<\/p>\n<h3>Uncertenty<\/h3>\n<p>In marketing, you talk about &#8220;conversions&#8221;; how many people who visited the website actually made a sale. The problem is that we don&#8217;t have a way to know when we&#8217;ve had a success, and with people caching the open data we might not even know when and how people are using our data!<\/p>\n<p>I guess the frustration I have is that to properly measure these services, we would in some way have to compromise something that makes them good.<\/p>\n<p>Are there solutions? Can web science help? Should we compromise linked+open data principles in order to be able to better quantify our sucess?<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Happy new year. First of all, welcome to a new team member, Andrew Milsted. Andrew&#8217;s post is funded by HEFCE to work on developing equipment.data.ac.uk. He&#8217;s busy right now turning my 0.1 code into something a bit more suited for the long term, and as a side effect we hope to release some libraries which [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[198,352,136,80],"tags":[],"class_list":["post-1039","post","type-post","status-publish","format-standard","hentry","category-best-practice","category-data","category-rdf","category-web-management"],"_links":{"self":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/1039","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/comments?post=1039"}],"version-history":[{"count":1,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/1039\/revisions"}],"predecessor-version":[{"id":1040,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/posts\/1039\/revisions\/1040"}],"wp:attachment":[{"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/media?parent=1039"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/categories?post=1039"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.soton.ac.uk\/webteam\/wp-json\/wp\/v2\/tags?post=1039"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}