Categories:

# Open Data Internship: What is OpenGather?

Our tale begins a month and a half ago, in a lab at the University of Southampton….

I was at the beginning of my internship, and we had decided one of the key jobs was fleshing out our data. The Open Data Service previously gathered data using pen and whiteboard. The issue with these archaic tools is time. Not just the time spent gathering data, but processing it all by hand at the other end.

As such, I set out on journey. A journey to create one tool that would facilitate the easy gathering of data. It’s still far from complete, but here’s what it does so far.

Overview
DISCLAIMER: None of this tool is considered “stable” at this point in time. Data formats can and will change, use at your own risk!

The aims of the tool are:

• To speed up the gathering and processing of data. More specifically, data gathered on-location.
• To enable and encourage non-technical people to contribute open data.

The result is a responsive website, written in PHP. PHP was chosen as it’s trivially integrated into the Open Data service.

User Interface

The current interface is extremely simple, but it gets the job done. It shows a set of object types (schemas) people can submit. Changing the schema changes the form fields shown. These can then be filled in and submitted. Any required fields that weren’t filled in are highlighted in red.

The tool currently supports text fields, dropdown fields and geolocation fields. For geolocation fields, the initial values of longitude and latitude use the phone’s GPS. It’s possible to click on the map to select a more precise location. This is especially useful when recording the location-sensitive objects such as doors.

The Database

A small subset of the tables.

The tool uses MySQL as its default backend. The details are configurable in config.php. There’s a single central table that records each data item entered. It stores an id, the time and the schema id.

Each schema has its own table of details. Each entry’s id acts a foreign key, relating an entry to the details about it in the schema’s table.

Exporting Data

Currently, the data is exportable as JSON. This format allows several schemas to exist together seamlessly. JSON is also human editable, making it easy to correct long-term data. The tool makes the export publically available at http://yourwebsite/path/to/tool/dumpjson.php. There’s no issue with making data public, as it’s designed to gather OpenData.

The data is also available through the MySQL instance. This method of access isn’t recommended.

The Schema Generator

Personally, I think this is the coolest part of the system. It allows you to quickly specify a schema using PHP. This schema is then transformed into HTML for the web forms and SQL for the database. The web interface updates the schema list when the page loads. Dynamically loading these schemas allows submissions to go straight to the database.

The upshot is that defining new schema is incredibly easy. There’s no need to mess around with HTML or SQL. Just a few PHP objects gets the job done!

The following is a sample schema I use to gather data from around the University:

//Format:
//    new ObjectSchema($name,$fieldsArray);
//    new TextField($name,$id, $required); // new DropdownField($name, $id,$optionsArray, $required=true);$schemas = array(
new ObjectSchema("Building Entrance", array(
new TextField("Building Number", "buildingId", true),
new TextField("Entrance Label", "entranceId", false),
new TextField("Description", "description", true),
new GeoField("Latitude", "lat", true),
new GeoField("Longitude", "long", true),
new DropdownField("Access Method Daytime", "accessDaytime", $accessOptions), new DropdownField("Access Method Evening", "accessEvening",$accessOptions),
new DropdownField("Opening Method", "openingMethod", $openingOptions) )), new ObjectSchema("Drinking Water Source", array( new TextField("Building Number", "buildingId", true), new TextField("Floor", "floor", true), new GeoField("Latitude", "lat", true), new GeoField("Longitude", "long", true) )) );  In conclusion…. The tool is still very much in the early stages of development. Feel free to use it, but be wary that things may break between versions! If you’re feeling particularly adventurous, merge requests are more than welcome… Improvements currently on the roadmap include: • An updated, friendlier user interface. • Versioning for the schema, including a tool to move data between schema versions. This should help with the long-term preservation for data. • Support for image uploading. One major weakness is the need to take images seperately and link them later. • Add a README to explain installation and usage. Next week, I’ll be talking about taming QGIS, building tilesets and designing GeoJSON maps! Posted in Uncategorized. # Week 7 – Choices Hospitalised Me. Skills Demonstrated – User Consultation – Specification Acquisition – Adding Functionality to Legacy Code This was another successful week working on choices. Just like the previous six weeks. Due to my mangers spontaneous expedition to Iceland, Docpot was put on hold. I’ve been to Iceland before, didn’t take me a whole week to get my groceries though…(Sorry). The medical department uses a convoluted method to allocate students to their courses. This requires a separate allocation algorithm due to the uniqueness of their demands. Last year they split the students into three cohorts depending on their background. Some students were able to pick a language in both slots; some could not as one slot was pre-allocated. This needs changing now. It turns out that choices has finally forced me to make an appointment in the hospital. Meetings are much more productive when all parties are in the same room, even if that room is in a dark corner of a large city hospital. The medical department was a pleasure to work with. It’s refreshing when an agreed solution doesn’t need a complete reworking of existing code. Kev and I split the work between ourselves. The endgame was to ask questions about the language options so students were aware of the requirements. There were three difficulties of a language: beginner, intermediate and advanced. Each required varied qualifications. We needed to implement questions which are only asked when a student pick specific options. If the student does not pick the option, the answer should be set to a default false value. Kev worked on making actions to add choosers to groups depending on their options. I made the changes to the add and edit questions pages to allow a admin to add this functionality to their form. Using what I learnt last week, I refactored the questions controller so it was easier to work with. I cannot stress the usefulness of testing enough! We got a working demo working by Friday. Kev was showcasing it to the medical department this morning. There are a few minor bugs to fix but they seemed happy with our progress. I am now over half way through my internship. Choices starting to look healthier now. I know I say every week but I am looking forward to when i can move on from choices. This week TIDT are doing WAISFest, which is a hackathon hosted by the university. It should give me a good enough reason to escape choices. We shall see… Posted in Uncategorized. # Weeks 5 & 6 – The big refactor Skills Demonstrated – Integration Testing – Refactoring Legacy Code So week 5 was the tail end of front-end fortnight, what is that you ask? Well I can tell you in layman’s terms as I was the only one in the whole team who had no involvement with it. Our team had two weeks to figure out which front-end framework they would use for the foreseeable future. That left me pretty much to my own devices for the duration. That meant choices for me – there is always something to fix in choices. The more astute of my readers will have noticed that this blog post covers two weeks. These few weeks have left my blogging ability rather diminished. But I digress. It is now where I would like to link to you my managers recent blog post “Zen and the Art of Legacy Camp Site Cleaning”. I paraphrase: “Bolting unit tests onto legacy code is pretty much impossible. Untested legacy code which has been chugging along without issue should not be rewritten. The trick is to leave the campsite cleaner than you found it”. There were two incomprehensible behemoth functions in the options controller totalling about 400 lines. Much like choices, 70% of the universe consists of dark energy. No physicist can explain what dark energy is, or how it got there; only what it does. This is where integration tests come in handy. Unlike in unit testing, the controller communicates with fake databases called fixtures. This allowed me to give this pile of code a data set and then examine how it was digested at the other end. I assumed the add and edit functions worked before I touched them and proceeded to make many integration tests using this method. It was important that I had good code coverage in the tests before starting the cleaning process. Good testing allowed for quick refactoring thanks to immediate feedback. Since pushing choices to live, we’ve had even more requests from users. And so it goes on. A user found a bug where the ‘all pages’ button on the paginator was dysfunctional. The code for the all button was commented out, it appeared as though somebody couldn’t get it to work but then forgot to remove the button from the view. The fix was actually simple. They tried setting the options per page limit to 1000 to make the all button work. This would have worked but there is a sneaky cake setting called ‘max-limit’ which is set to 100 by default, this overrode the 1000 limit. There are few more things which I fixed and implemented on choices this week. But this blog post is already getting rather long for my liking, and most likely yours so this is it for now. Posted in Uncategorized. # Zen and the Art of Legacy Camp Site Cleaning This post combines a few ideas from the following books: Zen and the Art of Motorcycle Maintenance – Disregard the title, this book is about the philosophy of value and quality. I enjoyed reading the meandering tale of a man on a motorcycle tour thinking about what it means to do things well. You will need chapter 25, things which shake your quality focus during a maintenance task. Working Effectively with Legacy Code – This is a practitioners guide to coping with a problem we have all created for ourselves at some point or other. For me this period was about the first 10 years of my career. Legacy code is code without tests, almost always accompanied by having no documentation. This code has merrily chugged along, doing what it does since its birth but now you must add a new feature or fix a bug. All had a system with 9 bugs where you fix one and now you have 12 bugs. Clean Code – I read this book at the tail end of last year and it completely changed the way I programmed. If you only read one book about programming read this. It also gave me an excellent gauge for what Pirsig describes as classical quality. There are a lot things I intuited were good but did not know why and this book made those things explicit. The important message for this post is “The boy scout rule”; always leave the camp site cleaner than you found it. Combining the lessons from these books has helped me take a beastly code base and make it elegant. Begin by getting your “Zen” right, this is not a smash and grab you are going to turn this mess into a quality product. The gumption (quality focus) traps to avoid include: • Wishing you started from scratch – Don’t spend your maintenance time wishing you had re-written the code from the ground up. The grass always looks greener in that empty document root but trust me it is not. If this was so easy to do from scratch how did you make such a hash of it the first time? The problems of your current code are hard, start-over code will have hard problems too but they are further away. You felt as positive when you started this code base as you do about starting over. You will feel just as bad again, the fact is good software is hard to write. • Thinking you don’t need tests – You need tests. Yet you would be a fool not to have a good click around your in your application after you make a proper change. Clicking around your software is slow and inconsistent. Even with a testing team you wont click the full set of possibilities. Write tests for the existing code before you refactor and you will be sure you won’t introduce bugs. • Fear of big changes – It is surprising how you stop worrying about making changes when you have tests. It is reassuring to know you aren’t breaking things. • The big rush – This code base is hard to change, it makes everything slow, writing tests is even more time consuming. Writing it from scratch will be slower and once you have written a few tests you will get the rhythm. As you test more the speed you can make changes will start to improve until you can actually go quite fast. What’s the hurry though? You are trying to build something quality, you shouldn’t rush quality. Also you have the perfect excuse for going slow “sorry this code is old and hard to work on”. You wont have the excuse if you start from scratch. The trick of working with legacy code is to determine suitable chunks to test. From my experience adding a full set of unit tests to a legacy code base of any size is almost impossible so do not try. User interface tests like Selenium can be good but it is hard to get good coverage with them. They are slow to write, slow to run and they can be quite brittle. Have a light covering of these tests over a broad set of features to avoid embarrassing bungles. The real name of the game here is integration tests. If your code wasn’t written with unit tests it is certain to be un-unit-testable; there will be tight coupling all over the shop. Use integration tests to test natural lumps of functionality. You are likely to need a database of test data. Your integration tests will soon end up taking tens of seconds to run so your tool chain must let you run a sub set of tests. This will let you iterate on the particular area of the code you are testing. Once you have a nasty lump of code under test you can start refactoring the components and add better tests as you go. After not so long the lump of code will be a set of quite well encapsulated classes with unit tests. If you repeat this pattern for every lump of code in the project you will have a quality code base. The hardest test to write is the first. It’s psychologically hard because it is change of mindset. It is a big learning curve because you will have to work out how to get the test framework into the existing project. It will being boring because you need to create a bunch of test data. It is a technical challenge because you have to work out what the existing code does to test it. In spite of all that you have to do it. It is the only way you will ever achieve quality and if you read this far it’s because you want to deliver quality. So how now you have a code cleaning approach how do you choose what to clean? The boy scout rule! The whole code base is a mess so start with the bit where you want to add your feature. To add the feature you will have to learn how the surrounding code works. You can specify that knowledge in tests. You add some integration tests and then clean the camp site for a while. Once you are happy that it is clean enough to work with you can add your new feature and it’s associated unit tests. Now you have a clean place to put it your new code can have proper unit tests. It will take a long time to get your code base clean using this method. Adding a new features will give you good test coverage and make the code easier to work with. You will have well defined methods with good naming and simple logic. You might choose to spend your Friday afternoon doing a bit of camp site cleaning before you clock off. Remember there is no rush. There are always new features to add so you will have plenty of opportunities to clean. Remember well written tests specify what the code should do. If another programmer wants to change your code they should read your tests. If they change the code’s behaviour a test will tell them the impact of that change. Posted in Best Practice, Programming, testing, Tips. # Open Data Internship: How to Gather Data – Mark II This is a quick(ish) instructional post on how to gather open data in Southampton. This is written assuming you’re using the Open Gather tool. It covers the sort of data we’re looking for and how to gather it. Objects we’re looking for: • Drinking water dispensers (Fountains, coolers, etc) • Gender neutral toilets (Toilets a person of any gender could use, e.g most disabled toilets) • Portals into buildings and between them (E.g Doors) • Public showers a cyclist could use • Reception Desks (And any other Points of Service) • Images of University buildings (that we don’t already have) It’s a bit like a game of eye-spy (or Pokemon Go). The aim is to hunt round, find each of the items above and record info about it. For some of them (Portals, Building Images) we know the data we’re missing. For the others, we have no idea, so a little urban exploration may be required! For all of these, we’re interested in where they are. This involves a building number, floor and geo-location for most of the above. Open Gather has a clickable map to allow for precise geo-location. Otherwise, http://lemur.ecs.soton.ac.uk/~cjg/clickymap/ is available. ### How to Gather Data Preparation • If you’re hosting your own copy of OpenGather, make sure to clear any testing data out first! • If you’re recording portals and building images, it can be helpful to plot the things you’re looking for on a map. If you’ve retrieved a list of items using SPARQL, you can use the following to plot the items on the map. • Run a SPARQL query to generate the list of missing items, complete with latitude and longitude (examples to come soon!) • Generate a KML/CSV/GEOJSON file from the data produced by the SPARQL endpoint. • Host the KML/CSV?/GEOJSON file in a publically accesible location. I prefer Git, but Google Drive or an online pasting tool like Pastey also works. • Using umap (http://umap.openstreetmap.fr/en/) as a mapping tool, add a layer, then either import the data from the remote, or in umap, add as a remote data source. • (When using Umap, tick “Use Proxy” to ensure the icons load correctly) • Print off a copy of the map. UMap doesn’t work very well in mobile browsers. • Print off Univeristy of Southampton photograph consent forms. These are needed to use photos with people’s faces in. • Make sure your phone and camera have adequate amounts of battery (ideally full). • If your camera and phone are different, check the two clocks show the exact same time. This makes matching images and data far easier. Overall Process 1. Pick a location on the map and decide which buildings to gather data from. 2. For each building, gather the data needed, using the instructions below. General Method This is the quick-and-easy summary of how to record data. More specific information is available below. 1. Open the OpenGather tool in your browser. 2. Select the type of object to record. 3. Fill in the appropriate fields. 4. Submit the data. 5. Take a picture of the object. Taking a Building Image 1. Using the open data tool, select the category “Building Image”. • Fill in the “Building Number” field. • Wait for the GPS to update to the current location. • If the accuracy is low (say, less precise [higher] than 6m), click/touch the map to mark a more accurate position. 2. Take a picture of building, attempting to get as much of the building in frame as possible. • A good photo will make the building easily identifiable as you walk past it. The geo-location data isn’t necessary for buildings that are already marked on the map, but it helps automatically match images to names later on. Gathering Portal Data Walk around the building looking for entrances. Try to identify all entrances that aren’t fire escapes (which we aren’t permitted to gather as of 14/07/2016). For each entrance that you find: 1. Select the category “Building Entrance” in the OpenGather tool. 2. Record the geo-location of the entrance. This can be done by tapping on the map in the OpenGather tool. 3. Fill in the fields using the tool. • “Building Number” – Number of the building the entrance is attached to • “Entrance Label” – An arbitrary letter to identify the entrance. Typically starting from ‘A’. • “Description” – A brief description of the entrance, such as “Staff”, “Main”, “Side”, “North-east”. • “Access Method” – Is a card or key needed to get in? • “Opening Method” – How do you physically open the door? This is used to determine disabled accessibility. 4. Submit the data 5. Take a picture identifying the entrance. A good photo will make the entrance easily identifiable as you walk past. 6. Follow the procedure for getting consent, if any people are in your photo (an ideal photo has no people). Recording a Drinking Water Source Should one of these rare and majestic creatures be spotted: 1. Throw a greatball at it 2. Select the category “Drinking Water Source” in the OpenGather tool 3. Fill in the fields using the tool. • “Building Number” – Number of the building the water source is in • “Floor” – The floor the water source is on, level 1 is usually the ground floor. 4. Record geo-location using the map. Zoom in on the building you’re currently in, and try to mark your position in the building by clicking on the map. 5. Submit the data 6. Take a picture of the water source. Ideally, this will make it clear where the water source is located in that part of the building. 7. Follow the procedure for getting consent, if any people are in your photo (an ideal photo has no people). Recording the location of Public Showers 1. Select the category “Public Showers” in the OpenGather tool 2. Fill in the fields using the tool. • “Building Number” – Number of the building the water source is in • “Floor” – The floor the water source is on, level 1 is usually the ground floor. • “Room Number” – Room number of the shower, if it has one. 3. Record geo-location using the map. Zoom in on the building you’re currently in, and try to mark your position in the building by clicking on the map. 4. Submit the data Reception Desk (Point of Service) 1. Select the category “Point of Service” in the OpenGather tool 2. Fill in the fields using the tool • “Description” – What the Point of Service is. For example, “Library Reception Desk” or “Student Services Information Desk” • “Building Number” – Number of the building the point of service is in (assuming it isn’t a standalone service) • “Phone” – A phone number for contacting that point of service, if available. • “Email” – An email for contracting that point of service, if available. • “Opening Hours: Mon…etc” – Opening times for the point of service for that day. E.g “9:00-18:00”. 3. Record geo-location using the map. Zoom in on the building you’re currently in, and try to mark your position in the building by clicking on the map. 4. Submit the Data 5. Take a picture of the desk. It’s nice to have a friendly receptionist in the photo if possible, but don’t force anyone! 6. If anyone (including any member of staff) is in the picture, follow the procedure for gaining consent. Requesting Consent Attempt to get nobody in the shot, unless you’re taking pictures of a reception or Point of Service stand, where behind-the-counter staff can make it look friendlier. If people need to be in the shot: 1. Verbally ask permission before taking the picture, explaining that you represent the Open Data Service, and what that is. Ensure they’re okay signing a consent form. 2. Take the photo. 3. Ask them to fill in an entry on the consent form. Cross buildings off as you go, to mark them as completed. Posted in Uncategorized. # Open Data Internship: Task Lists, Links and a rant on PHP Week 5 already? It feels like the time has flown by. This is a particularly interesting week for me, as Chris is away story-ing. I have plenty to do, but it all needs doing under my own steam. That’s why this is a fantastic time to talk about time management. More specifically, how woeful I am and how to improve. Thoughts on Time To begin, I want to talk about task lists. An easy-to-use task list has single-handedly made the biggest difference to my time management. It’s done that by ensuring I never forget a task. I can always see the tasks I have to do, prioritise them and set deadlines. By prioritising them, I can always work on the most important thing. The trick to using a task list effectively is to put everything on it. Every little job, no matter how small, needs to go on it. It needs to be an authoritive list of everything that needs doing. When the list contains everything, you can work purely off the contents of the list. You just pick off the most important task you can have time to work on and get to work. My two biggest time management issues are tunnel vision and getting distracted. The former is where you get caught-up working on one thing. The mistake is not stepping back and asking “What’s the right thing to be working on?” periodically. As for getting distracted… it’s hard not to, in a lab filled with shiny objects and fantastic people. Hopefully, these will be solved in later blog post. Work for the Week This week, I’ll be mainly working on creating a SPARQL URL checker. This will build a database of all the URLs a SPARQL endpoint knows about (URLs in this context meaning website addresses, rather than RDF URIs). It will then launch requests to each of those URLs, reporting the status of each. The aim is to identify any broken links that need repairing. Chris and I spent Friday planning the system, which should look something like this: A high-level plan of SPARQL-Detective The code will be available at https://www.github.com/Spoffy/SPARQL-Detective in the near future. I’ll also be working to implement some of the changes to OpenGather I mentioned in my previous blog post. The main focus is to implement a schema for each type of data to be gathered. PHP – Or as I now call it, “Interpreted C” As promised, a short rant. So, this internship has been my first time working with PHP. I used it for the OpenGather tool, and up until now it hasn’t been so bad. However, I have since been introduced to the joys of the cURL library. Here’s a short snippet to query a URL in cURL. $curlHandle = curl_init($url); //Force it to use get requests curl_setopt($curlHandle, CURLOPT_HTTPGET, true);
//Force a fresh connection for each request. Not sure if this is needed...
curl_setopt($curlHandle, CURLOPT_FRESH_CONNECT, true); //Get Headers in case we need Location or other. curl_setopt($curlHandle, CURLOPT_HEADER, true);
curl_setopt($curlHandle, CURLOPT_FOLLOWLOCATION, true); //Do we care about SSL certificates when checking a link is broken? //...Possibly if there's SSL errors. V2. curl_setopt($curlHandle, CURLOPT_SSL_VERIFYPEER, false);

//Don't actually care about the output...
ob_start();
$result = curl_exec($curlHandle);
ob_end_clean();

$link_status[$url] = curl_getinfo($curlHandle, CURLINFO_HTTP_CODE); curl_close($curlHandle);


Oh, sorry, did I say short? I lied. This bit of code summarises my thoughts on PHP perfectly. The code is verbose, unnecessarily so. It’s missing high level abstractions (You have to manually parse the returned string for header data). And it all somehow feels… clunky. Thankfully, at least memory management isn’t a problem.. right?

For comparison, here’s the same snippet in Python.

link_status[url] = urlopen(url).getcode()


….The joys of cURL!

Posted in Uncategorized.

# Week 4 – Choices goes live once more!

Skills demonstrated
– Unit testing
– Manual testing
– Bash scripting
– User Consultation
– Crazy Golf

My hope to have a choices free week was delusional. Choices had to go live last week which meant – more testing.

These tests were different to my usual unit testing, these were full system tests. Usually a developer would use a program called selenium to test a system as a whole. This program simulates keystrokes and mouse input; it can be creepy to witness or so I hear. I wouldn’t know though, there was not time to implement selenium tests. Instead I spent Monday, Tuesday and part of Wednesday doing the tests by hand. Most likely, I only spent a few hours doing the tests. It took two-point-five days of my time to fix the endless torrent of errors. Who would’ve thought that an innocent action such as opening a form can blow up in my face? I accept that none of the allocation controllers worked first time, they are magical. But opening a form? That is just unfair. As is the way with choices.

Having managed to corral choices into its holding pen once more we celebrated with a round of TIDT golf. I didn’t want to show up my colleagues too much, so I kept it close and managed to pull away with a 4-point lead. We will return for the last 9 holes; the suspense is killing me.

Next on the agenda was Docpot and its migration. Docpot is a vestigial file-share which needs operating on. It’s only used by a handful of ECS staff and it’s my job to merge the remnants into a newer preexisting share. The implementation of this is simple enough, keeping the users of Docpot happy is a tricky one. By using some shell scripting magic, I was able to list all the files modified this year with who modified them. Then more shell wizardry to convert their names into emails to copy them into outlook. You can see where this is going. It turns out that it was only one directory which needs preserving. More developments on Docpot in next week’s blog post.

I’m now a third of the way through my internship. Having made it this far I’m sure that I’ve enjoyed every day so far, despite the sheer difficulty curve. I curse more at my work monitor than i do at a game of rocket league. But I’ve learnt more in these four weeks than any other four weeks in a long time.

Posted in Apache, Data, Database, HTTP, Javascript, PHP.

# HESA Open Data Consultation: University of Southampton response

This response has been submitted to HESA in response to their consultation on open data.

This is a temporary location, the text will be moved to our consultations responses site in the next few days, and I’ll put a link in here instead.

Question 1

1. Do you support HESA’s aim to make as much of our core data as possible available as open data?

In general we support the aim to make data as open as possible providing the data is drawn only from the University’s signed off HESA returns—i.e., the data has been through rigorous quality control.

Question 2

1. Do you agree with HESA’s assessment of its data sources regarding suitability to publish as open data (Annex A)?
2. If not please elaborate on any areas in which you disagree

Yes, we agree with HESA’s assessment of the suitability of the data sources to publish as open data. It must not be possible to identify individual students or staff members from information published. It is vital that there is consultation and collaboration with HE institutions on the content of the data subsets released.

Question 3

Do you feel that the list of open data resources to be published in Annex B is comprehensive, or do you feel there are any other types of open data publication HESA should be planning?

Yes, we feel the list published in Annex B is comprehensive.

For linked-data purposes it would be very useful to provide mapping tables (“linksets”) from ID schemes used by HESA data to other common ID schemes. http://learning-provider.data.ac.uk/ currently provides a number of these “linksets” which could be polished and expanded. These reduce the costs to data-consumers wanting to join HESA data with other datasets.

Question 4

1. Do you agree that it is important for HESA to publish meta-data as open data in addition to the data sets?
2. What benefits will this deliver for users?

Yes, we feel the meta-data should be published—without this, data users will struggle to understand and analyse the data.  We feel that the HESA data model meta-data should be released in the same time frame as the data sets—without this information it is likely that data users who are not familiar with HE may misinterpret the data and reach incorrect conclusions, generating unnecessary queries and work for HE institutions.

Question 5

1. Do you feel that HESA’s aims on ODI certification are pitched at an appropriate level of ambition?

A cautious ‘yes’. There is a risk of spending limited budget on less important aspects  in order to achieve certification.

Question 6

1. Do you agree that Creative Commons Attribution 4.0 is the most appropriate open data licence for HESA to use?

Yes, but for some datasets it may be desirable to make them even more open—e.g. not to  require attribution at all. This should be considered for any data which is likely to be combined with data from dozens of sources or the most simple (non-statistical) parts of datasets, where attribution could impede use in some cases. For example simple lists of ID, label or linksets.

Question 7

Do you have any advice for HESA in establishing communications channels to open data communities and users?

HESA have very strong ties with HE institution administration teams, but less so with academics, students and developers. Attempts should be made to promote the existence and value of the newly opened data directly to researchers and developers (who may be able to provide surprising new uses), to make it easy for such people to report back new applications to HESA, and always to be asking “why aren’t you using our data”. In our experience, it’s often hard for an open data producer to see where small amounts of investment could remove an impediment of which they are not aware.

Question 8

1. Do you think the list of proposed actions is appropriate and comprehensive?
2. If not, are there other elements which should be considered?

In addition, we would encourage HESA to consider reviewing

http://learning-provider.data.ac.uk/ and http://opd.data.ac.uk/ — these are very low cost sites that we have produced for the benefit of the UK HE community. It may be appropriate for HESA to lift some of these ideas or build upon the work.

Question 9

Do you have any other general or specific comments about HESA’s proposed approach to open data?

The area where we have the most concern is the availability of data without any supporting context to aid interpretation. Raw data can result in the wrong assumptions being drawn and comparability between an institution from one year to the next, and between institutions within the sector, can be difficult. For example, the introduction of new accounting standards (FRS 102) for the 2015/16 year onwards has introduced significantly more volatility in a university’s financial results, which makes comparisons between years and with other universities difficult without a supporting narrative that explains the key factors impacting on those figures.  This applies to all data sets. We think that HESA needs to ensure that the project addresses how this data can lead users to draw appropriate conclusions.

We are also concerned about how the data may be used commercially—at the moment consultants who market to us do not have access to our data; providing open access may generate a flood of firms writing to us to sell services and systems. A concern has also been raised that if data is going to be open it may impact the completion of the optional elements of returns; not completing these aspects of the returns may become a way to prevent the data becoming widely available.

We encourage HESA to make a clear plan for preservation and long-term access.

The University of Southampton has been very active in exploring the benefits of open data approaches for the UK HE sector. We have a comprehensive open data service covering many aspects of the university infrastructure. http://data.southampton.ac.uk . We founded data.ac.uk as a deliberately generic place to host data and linked data identifiers (URIs) so that they could be unchanged even if the hosting or sponsoring organisation changed, renamed or rebranded.

This policy could engage new and less-experienced software developers and other consumers but it shouldn’t be assumed that they have the same cultural background and training as those consuming HESA data historically. Clear guidance and easily noticed warnings will be required.

One unusual aspect for the Open Data community is that staff at universities may be subject to additional professional restrictions about how they can publish data from HESA, even if it has a CC-BY licence. This will need to be communicated clearly so nobody unwittingly breaks rules.

It is desirable to make datasets from multiple years compatible so that an investment in tools and services based on one year’s data gives value for several years. Changes to the data structure of datasets are inevitable but effort should be made to design dataset structures that can be extended in future years while still being compatible with tools developed to work with earlier years.

HESA is one of the pillars of data in UK HE. As such, HESA should work with the other data services in the sector to align identifiers as much as possible. This provides two important benefits. The first is the ability to join other datasets to the HESA data without expensive mapping exercises. The other is to provide an information infrastructure that other organisations can use for their own datasets.

All field definitions, terms and identity schemes used in the datasets and metadata should be available for other people to view in their entirety and reuse under a licence at least as open as, and compatible with, that of the dataset. Where possible international schemes should be used or mapping provided to reduce the costs of comparing and combining datasets from other providers both domestic and international. Where possible, and relevant, data terms and classes should use established data vocabularies such as the Organization Ontology  https://www.w3.org/TR/vocab-org/. A listing of vocabularies in popular use in open data can be found at http://prefix.cc/popular/all.

Much of the data HESA publishes needs to go through strict quality control. This data will probably only be published annually. There is other data which can be “self-certified” by an organisation, for example their current undergraduate admissions page or email address. This information may change out of sync with the annual publication process. We have successfully built a system to harvest such self-certified information and datasets http://opd.data.ac.uk/ and would encourage HESA to consider this as a route for keeping up to date with self-certified data. This route is also lower cost as it doesn’t require a constant formal relationship with every data provider (although HESA will have one anyhow).

A simple, but powerful example is this dataset http://opd.data.ac.uk/dataset/linkingyou, which is built nightly from open data from 32 HE organisations. This data may be valuable to HESA and its users, but can be “self-certified”—unlike statistical information, which requires quality control before publication.

Information on the process for correcting issues and errors should be included in the metadata for a dataset.

We would be interested in working further with HESA on this project.

Posted in Open Data, Uncategorized.

# Open Data Internship: Open Data Pipelines

I’ve spent this week reorganising the folders in My Documents. “But Callum!” you might say, “Isn’t that a complete waste of a week?”. Perhaps for some. In reality, I’ve been working towards creating an Open Data Pipeline.

Open Data Pipeline is a term I just created, and I think it refers to something like this:

In this post, I’m going to outline the pipeline I’ve created so far and the lessons I’ve learned in making it.

The pipeline so far

So far, the pipeline has four key stages:

Gathering – This is gathering the raw data using a system. This could just be pen and paper. In my case, I’m using a homegrown tool called OpenGather. It’s a web application designed for gathering categorised open data. You can input data, record GPS locations and send the data for remote storage. This database then exports a CSV file.

Storage – This is the long-term storage for data. Currently, I’m using an Excel (or Calc) Workbook for text and numeric data. It has a sheet each for data gathered by the tool and for data I’ve had to enter by hand. A folder of images is kept for each category. For example, “Buildings” or “Portals”. Long term, the folder hierachy and exact data format still need work.

Processing – This is taking the stored data and converting it into a format ready for publishing. To process the output from OpenGather, I use another homegrown tool. Yet unnamed, it reformats the data as a CSV file, marking any missing data for completion. Optionally, it attempts to use timestamps to match up data with the images taken using the camera.

Publishing – This is the act of making the data available to the public. To do this, I hand a USB stick with the data on to Ash. Occasionally I have to copy CSV data to a Google Document. The rest of the open data service takes over from here!

One of the Excel sheets used for long term storage

With that said, here are some of the things to do and avoid when building a pipeline:

Things to do

• Use an appropriate set of input fields for each category – I originally had the fields “Timestamp, Tag, Category, Latitude, Longitude, Accuracy” for everything. I found that for some objects, such as Images, I could throw away the geo data. For others, I needed extra data I later had to enter by hand.
• Be consistent in your data gathering process – For example, knowing that an image is always taken after the data is entered is extremely useful. It can be used to later infer information you’ve forgotten, lost or never had.
• Keep a backup of your data – You never know when your tools will delete or corrupt an important file. Excel did this to me more than once!
• Be thorough in gathering your data – Gathering too much data and throwing it away is far easier than needing it and not having it.
• Challenge your assumptions and provide for corner case – I guarantee making assumptions about the properties of a data type will come back to bite you.
• Get some data up – Even if it’s just one entry, it’s a great feeling getting it hosted for the world to see.
• Process the data as a single, large batch – Removing the need to repeatedly process different chunks of data will save time in the long run.

Things to avoid

• Taking photos in many formats – Some cameras take both RAW and JPG. This makes storing the files and matching images to data entries that much harder. Use a single image format and convert it as you need it.
• Directly using Geolocation data – For most types of object, I’ve found GPS accuracy to be too low (6m radius at best). I used a clickable map to get accurate data, using the GPS to roughly centre it. If nothing else, precise data adds a level of professionalism.
• Using Excel for CSV files – If you do, format it all as “text”. Otherwise, Excel is fond of re-formatting your data to be less accurate when you save it back to CSV.

So where am I going from here?

My upcoming ideas for the OpenGather tool involve:

• Using different input fields for each data type. This should make processing the data more accurate.
• Provide the option to submit data to iSolutions via Serviceline. This should allow each contribution from students to be reviewed.

In the longer term, I’m looking to:

• Start work on an Open Data link validator. This tool will detect broken URIs and URLs, flagging them for correction.
• Start building maps of Union facilities ready for use during bunfight.

Posted in Data, Geo, Open Source.

# Minecraft Archaeology day.

This weekend I helped out at the University of Southampton Archaeology department family day. There were lots of hands on activities such as reconstructing a horse skeleton and melting copper in a firepit. I was, of course, up in the (air conditioned) computer room playing Minecraft.

More accurately, I was running a row of computers with a variety of Archaelogy related Minecraft maps.

All of the maps used for the event can be freely downloaded online:

### Dig Site

Not a very professional excavation.

My far the most engaging thing was something I created at the last minute. I used a nice model of a Roman Villa and replaced all the “air” blocks with dirt up to a height just above the roof. I then made the top look like a normal minecraft field with flowers and trees but with a tell-tale ridge in the dirt showing the top of the building.

The player then gets a box of spades to dig out the site. It’s very naive but in Minecraft, creations are almost always there on the surface. To have to uncover one interactively was quite a novel idea and it really engaged the younger visitors. It made my day when the first child to play it seemed to get a real thrill to discover the edge of something just under the field.

### Portus

Portus

We also provided a fairly out-of-date model of the Roman port of Portus for the visitors to explore. This was much less engaging as there wasn’t enough interaction. It really needs a bit more to pull you into this model. Maybe a printed worksheet to put into context the interesting bits you are exploring. Something that makes it more interactive rather than just a thing to walk around.

### Contemporary LIDAR models

I’ve done a lot of work creating a tool to combine LIDAR and open streetmap into Minecraft maps. These are good at showing some of the techniques used in Archaeology but showing modern cities like London & Southampton wasn’t really in the spirit so I created two more appropriate maps. Stonehenge & Avebury.

Avebury from open data sources.

The problem with Stonehenge is it’s pretty boring landscape for the most part, but the LIDAR does show some of the nearby earthworks quite well. The child who interacted the most was a girl who switched to creative mode and had fun “improving” the world heritage site.

The Avebury model is a bit more interesting but it only really clicked with the parents, not the children. I got a nice result by running the model through Chunky, the 3D renderer for Minecraft.

### Further work

Some possible ideas came to me as a result of this event.

1. Create a Minecraft mod that adds interesting archaeolical features. There’s already some that add “ruins” but this would be intended to leave things mostly buried so that they could be sometimes found by accident when digging.
2. Make a Minecraft map that can be used by a school as part of teaching archaeology. The map would seem like a vanilla Minecraft world but would have interesting and reasonably accurate things to excavate. We could simulate “ground penetrating radar” by making a website where you submit your world cooridinates and it gives you back a fuzzy picture. I think running this as an open server would be a mistake as only one person gets each ‘discovery’. Running a server for a small group makes more sense.
3. Look into using TerraFirmaCraft in outreach. This is a mod with far more realistic crafting. Ores are found in veins and small amounts can be collected from the ground or in streams. Tools are made by chipping stones. Fire pits and so forth. The problem is that it’s quite hard for younger children and I suspect that anything that requires leaving “vanilla” Minecraft will put-off most teachers who don’t have the slack to learn to install mods for a one-off thing.
4. Write a tool which takes a Minecraft world and ages it.
1. Remove plant life
2. Repeat a few times
1. Randomly collapse unsupported structures
2. Add a (random?) layer of dirt or sand, but make it “fall” off edges so that large things stick out of mounts, small things are in mounts.
3. Re-add grass and trees to the new top layer.

### Overall

I really enjoyed my day as an honary archaeologist. Compared to my experiences in computer science events, they were less well organised but with significantly better taste in the beer provided for the wind down afterwards. When it comes to pitching in and love of the subject both archaeology and computer science are about equal, but if I ever have to choose who makes the camp fire, it’s going to be the archaeologist.

Posted in Minecraft, Outreach.