June 23, 2011
by Colin R Williams
Let me introduce Colin Williams, a postgraduate student who has been doing lots of interesting stuff to help the Open Data service. I’ve asked him to contribute to this blog. Over to you Colin…
Recently, I have been assisting the data.southampton.ac.uk team in gathering geographic data for their site. By geographic data, I am referring to the latitude and longitude of campus buildings and services.
This can be a simple point reference, as in:
<http://id.southampton.ac.uk/point-of-service/sculpture-hepworth-1968> geo:lat 50.935436
<http://id.southampton.ac.uk/point-of-service/sculpture-hepworth-1968> geo:long -1.398055
or it could be an outline of a building (or site) footprint, as in:
<http://id.southampton.ac.uk/building/32> dct:spatial “POLYGON((-1.3960952 50.9368069,-1.3958352 50.9368250,-1.3956962 50.9360329,-1.3959562 50.9360148,-1.3960952 50.9368069))”
One of the surprising discoveries made by the data.southampton.ac.uk team during their data gathering was the lack of any geographic data held by Estates and Facilities. So, I set out to gather this data… [Editors Note: Our Estates & Facilities service do have all the geo data they need, but it’s not very useful to the open data project as they just don’t need a reference lat/long point.]
First stop, Google Maps. Google allows users to create their own maps, by overlaying points and polygons on their maps (or their satellite imagery). Their tool is easy to use, using a web interface to add points (and polygons) to the map. This data can then be exported, as a .kml file, which we can easily convert to a form that can be imported into data.southampton.ac.uk.
This started off fine, until I started to think more about the licencing of the data. I had read in the past that, due to the copyright held by Google (or their mapping providers) over their map data, contributors to OpenStreetMap aren’t allowed to use Google’s data to determine the location of entities.
2. Restrictions on Use. Unless you have received prior written authorization from Google (or, as applicable, from the provider of particular Content), you must not:
(e) use the Products in a manner that gives you or any other person access to mass downloads or bulk feeds of any Content, including but not limited to numerical latitude or longitude coordinates, imagery, and visible map data;
So, that rules out the use of Google as a data source.
As its name suggests, OpenStreetMap is an open data street map, with its data being available under the CC BY-SA licence. OpenStreetMap is a great example of a collaborative, wiki-style geographic application. We could re-use their data, however, we wanted to generate authorative data, without making huge, possibly unnecessary changes to the OpenStreetMap data simply in order to achieve our goal. So, let’s look somewhere else. (I should probably contribute some of our building outlines back to OpenStreetMap when I find some time.)
The Ordnance Survey is Great Britain’s national mapping agency, which, in recent years, has released some open products. Confusingly, they seem to have two ‘Open’ products which could be relevant to our task.
Well, it seems that this the data used on OS OpenSpace is licensed under the ‘OS OpenData terms’, which ‘include’ the Open Government Licence.
However, the OpenSpace FAQs include this entry:
2.1 I am using OS OpenSpace to create a database of location based information. Does Ordnance Survey own this?
When you use OS OpenSpace to geocode data by adding locations or attributes to it that have been directly accessed from and/or made available by Ordnance Survey mapping data, then the resulting data is ‘derived data’, because it is derived from Ordnance Survey data.
Ordnance Survey would own such ‘derived data’, but we grant you a non-exclusive, personal licence to use it within your web application. Please refer to the definition of ‘Derived Data’ and Clause 5.4 of the OS OpenSpace Developer Agreement.
Well that’s not what we want. But, how about the data, that is under the Open Government Licence?
The OS OpenData site holds a variety of geographical datasets. For example, Code-Point Open is a dataset containing the latitude and longitude of 1.7 million postcodes, whilst OS VectorMap District is a vector based map of Great Britain. Unfortunately it’s not quite detailed enough to show individual buildings, which is what we’re really after.
So, the product we’re after is OS Street View (not to be confused by a similarly named, but completely different product offered by Google).
Can we use this data? The FAQ (which is in PDF format) has this to say:
11 Am I able to reuse “derived data” created from the OS OpenData products?
The licence allows the reuse of the derived data created from any OS OpenData products for commercial and non-commercial use. For more information on terms and conditions, read the OS OpenData Licence at www.ordnancesurvey.co.uk/opendata/licence.
OK, so we have found some mapping data that we are allowed to use. Is it in an easy-to-use form? Of course not, its in raster format. In other words, it’s a bitmap image (or rather, a series of images, each covering a 5km by 5km patch of Great Britain). How can we easily extract the information we need from these images?
Merkaartor describes itself as an OpenStreetMap editor for Unix, Windows and Mac OS X. It turns out that we can use it to export data rather than uploading that data to OpenStreetMap.
By default, Merkaartor has a number of data sources installed. In order to use the OS OpenData maps, we add http://os.openstreetmap.org/ as a data source, which uses the OS Street View data mentioned earlier.
All that remains to be done is to trace the shapes on the map and then export the data, as KML, which we then convert into a simple CSV file to be imported into data.southampton.ac.uk.
The data that has been generated as part of this process is available in the buildings and places dataset, and you can see it in use on the University’s open data map (which I have also been developing).
Thanks, Colin. I’ll just wrap this up by saying that University of Southampton Buildings & Estates will one day probably take over curation of this data, and they are aware of this work. They are happy to let us worry about it for the time being. This is fine with me as buildings don’t move much. Colin has done all of this for fun in his own time. I hope the other data.xxx.ac.uk projects are lucky enough to get some helpers like this. Be ready with a plan of how to let people help if they offer!
June 23, 2011
by Christopher Gutteridge
I’ve just come back from a two day meeting one making data catalogs interoperable.
There’s some interesting things going on and CKAN is by far the most established data catalog tool, but it frustrates me that they are basically re-implementing EPrints starting from a different goal. On the flip side, maybe parallel evolution is a natural pheonomena. None the less the CKAN community should pay more attention to the, now mature, repository community.
Data Catalogs have a few quirks. One is that they are really not aimed at the general public. Only a small number of people can actually work with data and this should inform descisions.
The meeting had a notable split in methodologies, but not an acrimonious one. In one camp we have the URIs+RDF approach (which is now my comfort zone) and on the other GUID plus JSON. The concensus was that JSON & RDF are both useful for different approaches. Expressing lossless RDF via JSON just removes the benefits people get from using JSON (easy to parse & understand at a glance).
A Dataset by any other Name
A key issue is that dataset and dataset catalogue are very loaded terms. We agreed, for the purposes of interoperability that a dataset record is something which describes a single set of data, not an aggregation. Each distribution of a dcat:dataset should give access to the whole of the dataset(ish). Specifically this means that a dataset (lower-case d) which is described as the sum of serveral datasets is a slightly different catalog record and may be described as a list of simple dcat:datasets.
Roughly speaking the model of a (abstract, interoperable) data-catalog is
- Dataset (simple kind)
- Distributions (Endpoints, Download URLS, indirect links to the pages you can get the data from or instructions of how to get the data)
- Dataset (simple kind)
We agreed that DCAT was pretty close to what was needed, but with a few tweaks. The CKAN guys come from the Open Knowledge Foundation so handling distributions of data which required other kinds of access such as password, license agreement or even “show up to a room with a filing cabinet” where outside their usual scope but will be important for research data catalogues.
We discussed ‘abuse’ of dcat:accessURL – it sometimes gets used very ambiguously when people don’t have better information. The plan is to add dcat:directURL which is the actual resource from which a serialisation or endpoint is available.
Services vs Apps: Services which give machine-friendly access to a dataset, such as SPARQL or an API we agreed were distributions of a dataset, but Applications giving humans access are not.
We agreed that, in addition to dc:identifier. dcat should support a globally unique ID (a string which can by a GUID or a URI or other) which can be used for de-duplication.
Provenance is any issue we skirted around but didn’t come up with a solid recommendation for. It’s important – we agreed that!
Very Simple Protocol
At one point we nearly reinvented OAI-PMH which would be rather pointless. The final draft of the day defines the method to provide interoperable data, and the information to pass but deliberately not the exact encoding as some people want Turtle and some JSON. It should be easy to map from Turtle to JSON but in a lossy way.
A nice design is that it takes a single URL with an optional parameter which the data-catalog can ignore. In other words, the degenerate case is you create the entire response as a catalog.ttl file and put it in a directory! The possible endpoint formats are initially .json, .ttl and (my ideal right now) maybe .ttl.gz
The request returns a description of the catalog and all records. It can be limited to catalog records changed since a date using ?from=DATE but obviously if you do that on a flat file you’ll still get the whole thing.
It can also optionally, for huge sites, include a continuation URL to get the next page of records.
The information returned is the URL to get the metadata for the catalog record (be it license,collection or dataset) in .ttl or .json depending on the endpoint format, last modified time for the catalog record (not the dataset contents), the globally unique ID (or IDs…) of the dataset it describes, and an indication if the record has been removed from the catalog — possibly the removal time.
Harvesters should obey directives from robots.txt
All in all I’m pleased where this is going. It means you can easily implement this with a fill-in-the-blanks approach for smaller catalogs. A validator will be essential, of course, but it’ll be much less painful to implement than OAI-PMH (but less versatile).
I learned some interesting stuff from John Erickson (from Jim Hendler’s lot). They are following very similar patterns to what I’ve been doing with Grinder (CSV –grinder–> XML –XSLT–> RDF/XML –> Triples )
One idea I’m going to nick is that they capture the event of downloading data from URLs as part of the provenance they store.
One Catalog to Rule them All
The final part of the discussion was about a catalog of all the world’s data catalogues. This is a tool aimed at a smaller group than even data catalogues, but it could be key in decision making and I suggested the people working on it have a look at ROAR: Repository of Open Access Archives which is a catalog of 2200+ repositories. It has been redesigned from the first attempt and captures useful information for understanding the community; like software used, activity of each repository (update frequency), counrty, purpose etc. Much the same will be useful to the data-cat-cat.
Something like http://data-ac-uk.ecs.soton.ac.uk/ (maybe becoming http://data.ac.uk at some point) would be one of the things which would feed this monster.
All in all a great trip, except for the flight back where pilot wasn’t sure if the landing flaps were working so we circled for about an hour and at one point he came out with a torch to have a look at the wings! All was fine and the poor ambulance drivers and firemen had a wasted trip to the airport. Still, better to have them there and not need them!
Jonathan Gray has transfered the notes from the meeting to a wiki.
June 14, 2011
by Christopher Gutteridge
While developing our site about Tsinghua University OpenData, we met some question about licence & copyright.Some data we got are crawled from public homepages of our university’s organizations and faculties. And we are not sure if it’s proper to release these data.In your project of Southampton Open Data, I noticed that most of the datasets are published under CreativeCommons, and I found Open Government Licence on your homepage.Do your have any data source that may have copyright issue while collecting data? How do you deal with that?
Thanks a lot! Look forward to your reply!
I’m going to be honest in the response as that will help people see where we are now. I am not a lawyer and can’t offer legal advice. We are doing our best to get it right, while not slowing down the progress we’re making.
We apply licenses per dataset. In someways that helps define the scope of a dataset, a dataset is a bunch of data with shared metadata.
Open Government License
In general, we use the UK Governments http://www.nationalarchives.gov.uk/doc/open-government-licence/ Open Government License (OGL), which really is a lovely bit of work. At first glance it’s very like the creative commons “cc-by” license, which is sometimes called “must attribute”.
However, it’s got some clever little restrictions, which make it easier for your management to feel comfortable releasing the data as they address some of the key concerns;
- ensure that you do not use the Information in a way that suggests any official status or that the Information Provider endorses you or your use of the Information;
- ensure that you do not mislead others or misrepresent the Information or its source;
- ensure that your use of the Information does not breach the Data Protection Act 1998 or the Privacy and Electronic Communications (EC Directive) Regulations 2003.
So, if a railway used this for timetables; if someone took a train timetable under this license and publish train times on a porn site, that’s OK. But if they deliberately gave out slightly incorrect times to make the trains look bad, that’s not OK. If they claim to be the train company to sell tickets, on commission, that’s not OK. The DPA bit doesn’t mean anything outside the UK, of course.
It gives people lots of freedom but restricts them doing the obvious malicious exploits that are not actually illegal.
Another license we use is a lack of a license. Maybe I should add a URI for the deliberate rather than accidental omission of the license?
I have to be very careful about slapping a license on things. Without permission of the data owner, I don’t do it.
A couple of examples of datasets which at the time of writing have no licence:
- EPrints.soton — people are still looking into the issues with this. The problem is that the database may at some point have imported some abstracts from service without an explicit license to republish. It’s a small issue, but we are trying to be squeaky clean because it would be very counter productive to have any embarrassing cock ups in the first year of the open data service. All the data’s been around via OAI-PMH for years, so it’s a low risk, but until I get the all clear from the data owner I won’t do anything. The OGL has the lovely restriction of not covering “third party rights the Information Provider is not authorised to license;” but we shouldn’t knowingly put out such data. My ideal result here is that the guidance from the government is that publishing academic bibliographic metadata is always OK, but I’ve not had that instruction, yet.
- Southampton Bus Routes & Stops — I’ve been told over the phone by the person running the system that he considers it public domain, but until I’ve got that in writing I’m not putting a license on it. Even if he says public domain, I’m inclined towards OGL as it prevents those kinds of malicious use I outlined earlier.
We may use this in a couple of places. It’s only win over OGL is that it’s more widely understood, but I think the extra restrictions of OGL are a good thing.
This is pretty much saying “public domain”. It’s giving an unlimited license on the data. We use this for the Electronics and Computer Science Open Data, which acted as a prototype for data.southampton (boy, we made some mistakes, read the webteam blog and this blog for more details).
We’ve never yet had anybody do anything upsetting with the ECS RDF, but I’m inclined to relicense future copies as OGL, as it adds the protection against malicious but non-illegal uses.
Out of interest, I challenge the readers to suggest in the comments harmful, or embarrassing, things they could do with the data.southampton data if it was placed in the public domain, rather than having an OGL license. It’s useful to get some ideas of what we need to protect ourselves against.
If there’s some evil ideas of what you could do under the restrictions of the OGL or no license, please send them to me privately, as I don’t want to actually get my project into disrepute, just get some ideas of what spammers, and people after lulz, might do. Better to think about what bolts the stable door needs well in advance.
3rd Party Data
I’ve got a lovely dataset I’ve added but not yet added metadata for, it maps the disibility information hosted by a group called “disabledgo” to the URI for buildings, sites and points of service. eg. http://www.disabledgo.com/en/access-guide/zepler-building/university-of-southampton is mapped to the URI for that building, and gets a neat little link in http://data.southampton.ac.uk/building/59.html
I created this dataset by hand by finding every URL and mapping it myself, so I have the right to place any license on it I choose. I also added in some data I screen scraped from their site (flags indicating disabled parking, good disabled toilets etc.). I checked with disabledgo and they asked me not to republish that data, so I can’t.
We pay them to conduct these surveys, and our contract does not specify the owner of the data. I’m hoping we might actually renegotiate next year to be allowed to republish the data, but it would be far better if *they* published under an open license and we just used their open data. Probably that’s still a few years off.
Either way, it’s a nice demo of the issues facing us. They are friendly and helpful, just don’t want anyone diluting the meaning of their icons. They give them a strict meaning.
Very little data in data.southampton is screen scraped. Exceptions are the trivia about buildings (year of construction, architect etc.) and some of the information about teaching locations, including their photos, and the site which lists experts who can talk to the press on various subjects.
I have a clear remit from the management of the university to publish under an open license anything which would be subject to a “Freedom of Information” (FOI) request. In the long run we can save a fair bit of hassle and money by pointing people at the public website.
The advantage I have over most other Open Data projects is that I’m operating under direct instructions from the heads of Finance, Communications, iSolutions (the silly name we give our I.T. team, which I’m part of) and the VC’s office. This means that I can reasonably work with anything owned by the organisation.
Another rule of thumb I was given is that if it’s already on the web as HTML or PDF then it might as well be data too! It’s not a strict rule, as obviously there’s some things which might not be appropriate, but I’ve not had much to screen scrape yet.
June 1, 2011
by Christopher Gutteridge
I realised that there wasn’t a big list of all the points of service in our database, so now there is.
This kind of information may be very useful to freshers!