Skip to content

Open Data Internship: The Winchester Expedition

Friday saw the first real world test of the data gathering tool. Surprisingly, it didn’t crash, erase any of the gathered data or otherwise combust spectacularly. I’m almost dissapointed.

Instead, Chris and I spent a few hours wandering campus in the glorious sunshine. We took photos of buildings and doors, using the tool to log their locations. In theory, the timestamps produced by the tool will help match tags to photos. We’ll have to see about that one, check back next week!

Tomorrow marks the start of the serious data gathering. I’m mounting a one-man expedition to the wilds of Winchester, to pay a visit to the School of Arts. Whilst there, I’m hoping to discover:

  • The names they call their buildings. Supposedly, they’re different to the names on record. Mysterious!
  • The locations of any water fountains. A rare catch, these need documenting for further study.
  • Lecture rooms available, and which buildings they’re in. After all, nobody knows what’s really inside the Winchester School of Arts buildings.
  • Showers available for use by cyclists. There’s a surprising number of these hidden away in the ECS buildings in Highfield.

This is in addition to the usual building and doors data.

The plan is to start on one side of the campus and do a first pass through the building interiors. This is because my guide only has a limited amount of time. I need his valuable door access to delve deep into the bowels of WSA.

The outside of buildings is next, recording portal data and building images. I’ll follow this up with an excursion to Erasmus Park, the local halls.

Who knows, maybe I’ll grab an ice cream along the way too?



Warning: Technical description ahead.

This is the full process I’m using to gather data, as of this post:


  • Clear any testing data from the data gathering tool database.
  • Apply geo locations of items to be investigated to map.
    • – Generate a KML/CSV/GEOJSON file of the items.
    • – Host the items in a publically accesible location. I prefer Git, Google Drive or an online Paste tool like Pastey also work.
    • – Using a mapping tool such as umap (, add a layer, then either import the data from the remote, or in umap, add as a remote data source.
    • (When using Umap, tick “Use Proxy” to ensure the icons load correctly)
  • Screenshot and print off map
  • Print off open data consent forms
  • Make sure your phone and camera have adequate amounts of battery (ideally full).

Overall Process

  1. Pick a location on the map and decide which buildings to gather data from.
  2. For each building, gather the data needed, using the instructions below.

Taking a Building Image

  1. Take a picture of building, attempting to get as much of the building in frame as possible.
    • A good photo will make the building easily identifiable as you walk past it.
  2. Using the open data tool, select the category “Image”, write a tag in the tool.
    • Wait for the GPS to update to the current location.
    • If the accuracy is low (say, less precise [higher] than 6m), click/touch the map to mark a more accurate position.

The geo-location data isn’t necessary for buildings that are already marked on the map, but it helps automatically match images to names later on.

Gathering Portal Data

  1. Walk around building, try to identify all entrances that aren’t fire escapes (which we aren’t permitted to gather as of 14/07/2016).
  2. For each entrance, take a picture identifying it. A good photo will make the entrance easily identifiable as you walk past.
  3. Follow the procedure for getting consent, if any people are in your photo (an ideal photo has no people).
  4. Use the data gathering tool to mark the location of the entrance on the map. Try to get as close as possible to where you think the entrance is on the map.
  5. Select the “Portal” category in the tool.
  6. Add a tag, starting with the building ID, followed by the type of entrance. For example, “32 Main” or “32 Main North” or “32 Rear”.
  7. Submit the data.

Requesting Consent
Attempt to get nobody in the shot, unless you’re taking pictures of a reception or Point of Service stand, where behind-the-counter staff can make it look friendlier.

If people need to be in the shot:

  1. Verbally ask permission before taking the picture, explaining that you represent the Open Data Service, and what that is. Ensure they’re okay signing a consent form.
  2. Take the photo.
  3. Ask them to fill in an entry on the consent form.

Cross buildings off as you go, to mark them as completed.

Posted in Community, Data.

Tagged with , , , , , , , , .

Joe’s Intern Blog – Week 2

So it’s Friday of week two and it is time for another blog post. Okay, it’s not actually Friday, its Monday morning. I spent Friday writing PHPUnit tests. For the sake of consistency, I will explain Friday after I’ve explained the other four days.

So on Monday I began by drinking some coffee. Then returned to the Kanban board for more tasks. Unfortunately for me, only difficult tasks remained. I completed most of the easy ones. After a second early morning coffee I was ready to commit to a task. In week 1 when working on the failed user feedback, I discovered the lookup system. The idea is that when you type in usernames in the list, you can get auto complete to speed things up. The previous implementation did work; but it was not an adequate solution. The issue is that the lookup functionality was not in a controller of its own.

Lookup was implemented in three separate source files with each bolted on where necessary. It was sloppy. I actually had to make a new controller which communicated with the alpha table. The alpha table is a nightly update of all the users in the university. It’s faster to communicate with this table as it is non-volatile. This makes it ideal for autocomplete. The three files: ‘lookup.php’, which looked up users from an identifier; ‘lookup2.php’, which looked up users from a course code and ‘lookup3.php’, which looked up courses from an identifier formed the functions of the new controller.

After a lot of headache, and help, the lookup controller worked. The new controller needed testing, this is where PHPUnit comes in. PHPUnit does what it says on the tin, makes unit tests for PHP. This was where the real fun began. Getting PHPUnit to behave on my workstation was a real son of a glitch (see what I did there?). Bit of advice, make sure your version of the testing software is compatible with the pre-existing tests already on the system.

Tuesday and Wednesday occurred along the same sort of vein. Coffee -> Confusion -> Coffee -> Lunch -> Testing -> Coffee -> Coffee -> More Testing -> Coffee -> Home. This brings me onto Thursday. I started my first high priority task. By using a combination of browser history and luck, users could pass the selection page with an invalid number of choices. As users added to their choices, the table containing their choices updated on the fly. This meant an invalid number of choices could exist in the database. My initial thoughts were to just make a temporary table to store choices as a user makes them, then validate the whole basket at once. On consultation with Pat he recommended I change how stage validation worked instead. We pair programmed a solution together so that when a user goes onto a new stage, the server validates all previous stages and redirects the user to any invalid stage. The state of the existing stages implementation shocked us so we decided to add “fixing the mess” to the Kanban board. We implemented a temporary working solution using a refactored version of the existing code. Then we made a joint decision to conclude our evening at the pub.

I spent Friday working on testing the event actions. They can send email notifications to users upon completing their choices for example. It started off easy enough but then choices threw a screwball my way when I was only expecting a curveball. The version of PHPUnit choices uses, cannot use the new function to mock static methods. I spent a lot of time trying to make a full unit test for email notifications. It turned out that a semi-integrated test was actually the way to go for now. My tests found a few logical errors in the email code. The errors would otherwise have gone unnoticed. There was a lot of functionality hidden in separate source files. The code was not written by me which made it more challenging to test.

All in all, last week was quite a hectic week, hence me writing the blog on Monday morning. There’s not much work I can do at the minute as half of our team are not here currently. This week may be less frantic, this may just wishful thinking though.

Posted in Apache, Database, HTTP, Javascript, PHP, Programming, testing.

Joe’s iSolutions Internship Blog

Having previously worked for the University I thought I had a general idea of what my first week would be like. However, this is iSolutions, things are a bit different here. They had me working on the choices system which is implemented with cakePHP.

A frustrating morning installing apache2 on a VM was immediately followed by a mystifying afternoon trying to sacrifice the correct sequence of farmyard animals required by the cakePHP gods. They returned the favour by granting me with the power to add a small line of text underneath an image upload box stating that the maximum size of an image upload is 2MB. This triumph, although small, is still a worthwhile success. It basically meant that I could move a piece of paper from the backlog part of the Kanban board to the resolved part. Not bad for the first day – did I mention that I’ve never had any previous experience with PHP before this internship?


Choices Kanban board.

PHP is easy enough to pick up, I think that my previous experience with C/C++, HTML/CSS and JavaScript helped a lot, but nothing has prepared me for choices. Choices is the system for the university in which students are allocated supervisors and vice versa. It’s based on cakePHP, a framework which ““makes building both small and complex systems simpler, easier and, of course, tastier”” – official cakePHP homepage ‘’ (I used double quotes: one set for the quote itself, and one for the sarcastic delivery. Let me know if there is a special piece of punctuation for that exact purpose).

I soon moved onto the next piece of paper. This was more complex than the first. A problem was bought to our attention; a user could enter a long list of choosers to the list. A flash message would appear stating that ‘x’ choosers were added but with no feedback about which choosers were not added. I implemented a method to retrieve the error associated and list the un-added users. This forced me to delve deeper into the inner workings of cakePHP. With each new piece of paper passed my way I slowly (and painfully) understood cakePHP slightly more.

I have learnt a lot in this first week of my internship and I truly enjoy being part of the iSolutions team. Despite the headache of cakePHP, the work is engaging. The satisfaction of moving a piece of paper from one side of the board to the other is akin to completing a challenging boss fight in a video game. I look forward to when I can start my own project and also the obligatory round of mini-golf with the iSolutions crew.

Posted in Apache, Database, HTTP, Javascript, PHP, Programming.

Open Data Internship: The First Week

Last friday marked the end of the first week of my Open Data Internship. It’s the first time the University of Southampton has had an open data intern.  Or, to my knowledge, any University for that matter. This puts me in the interesting position of figuring out what it is an open data intern does.

A Little Backstory

I’m Callum, a graduate going-on PHD at the University of Southampton. I started as a developer about 5 years ago, around 2011, scripting for a game called Garrysmod. I’d never touched a programming language before. For the most part, I learned by studying the works of other people.

I didn’t realise it then, but that was my first introduction to open source. Since then, I’ve come to understand how important open software is to driving innovation. It aids learning and provides a platform for other projects to grow on. All that said, the most important thing is does is bring the community together. It brings together total strangers to work towards a common goal. I think that’s vital in an increasingly insular society. In many ways, I believe open data can do the same thing.

With all that covered, what has my first week been like? Well, the open data service here has been running for around 6 years. Over that time, it’s amassed quite a large amount of data. I feel like a new-age explorer, navigating twisting jungles of information. I’ve come across some spectacular datasets I didn’t know existed. My favorite moment so far was discovering building 185. In the middle of the indian ocean. In other words, it’s pretty fun.

What have I been up to?

The team has had quite a few ideas on the back-burner for while, generally things that they haven’t had time to do. These ranged from straightforward data gathering to projects in their own right.

One of the more straightforward tasks was to fill in some of the bigger holes in the data about the University. One of which was in the building data.

The first phase was to identify which buildings were missing data. The tool to do this, without a shadow of a doubt, was the University’s SPARQL endpoint. SPARQL is a language that allows querying of the open data service, much like how databases use SQL. In fact, the syntax is similar. I spent the first part of my week learning SPARQL and creating queries to hunt for the missing data.

Next up, I imported the locations of the buildings missing data into a mapping tool. At that point, I realised the magnitude of the task.

A map showing the spread of buildings missing portal data. Several pins are in Malaysia, while many are in Southampton.

Unfortunately, the department was unable to sanction a photo-gathering trip to Malaysia. The sad result being I booked my holiday in Cornwall, instead.

The difficulty of the task assessed, I developed a cunning plan to gather the data. Naturally, being a Computer Scientist, I dislike the notion of writing on paper. Even more so, I dislike the idea of then copying that data to a database by hand. I thus proposed creating a tool that would allow quick gathering and labelling of data.

This aligned well with the aims of the team. They had been looking for the time to create a tool to crowdsource data from students on campus.

Thus, I set out on a quest to kill two rats with one high-precision stone.

Update: The SPARQL Code

PREFIX soton: <>
PREFIX foaf: <>
PREFIX geo: <>
PREFIX rdfs: <>

SELECT ?building ?label ?lat ?long WHERE {
    ?building a soton:UoSBuilding .
      ?image a foaf:Image ;
      foaf:depicts ?building
      ?building rdfs:label ?label 
      ?building geo:lat ?lat .
      ?building geo:long ?long
    FILTER (!BOUND(?image))


Posted in Data, Open Source, Team.

3D Model Escrow

This is a quick note to record an idea I had in response to a conversation over lunch at “Connected Data London“.

The question was around how to enable 3D printing for the construction industry, or more specifically making 3D replacement parts, but this idea works for a lot more usecases.

There are a couple of problems here. First of all, taking a 3D scan of something and using it to make a copy is probably copyright infringement (I am not a lawyer). It also may infringe patents. The other problem is that for more complex shapes, there is no incentive for the copyright owner to help you — no business will help you make a cheaper copy of their own product, with the possible exception of celebrity chefs.

So here’s a possible future where everybody wins.

One or more trusted organisations set up a data escrow service. This service will hold the 3D print information required to make a copy of the part, with a set of conditions under which it will be made public. It also contain other design documents, manufacturing instructions etc.

A manufacturing company pays a small fee to the escrow service to hold the data for each product and to host the data on the web in the event that certain conditions are met. My first thought is that the data should become public if the part is no longer available on the market and isn’t going  to be back on the market in short order. These rule applies even if the rights to the part are sold on or the campany is purchased etc. because the copyright holder has given the escrow service the data under a license which allows them to share it under a public-domain-like-license if the product is no longer available.

This shouldn’t cost very much and gives the first mover a unique selling point. It has the advantage that the company can guarantee that replacement parts will be possible either from them or from anybody given the specifications should they go bust or stop that product line.

As more companies offer this service, it can be added as a preference or even a requirement on deciding suppliers. While the initial question was about the construction industry it could work for any manufacturing industry.

I like this idea because everybody wins — it doesn’t require anybody to work against their own interests.

Never appeal to a man’s “better nature.” He may not have one. Invoking his “self—interest” gives you more leverage.

– Robert Heinlein, “The Notebooks of Lazarus Long”

Posted in Repositories.

Institutional Web Management Workshop: Overall thoughts (IWMW2016)

IWMW2016 Logo This month we went to IWMW16 in Liverpool at Liverpool John Moores University. We took an unusually large group including

  • 2 people from the web & data innovation & development team
  • 1 person from the central IT web development
  • 2 people from the comms and marketing web team

We’ve a lot of notes and we may write some more detailed write ups, but here’s some initial thoughts.


John Moores is a much smaller university than I’m used to, but seemed welcoming and easy to find. Eduroam coverage was good but dropped out regularly.  The audio mics in the large lecture theatre seemed a bit hit and miss, but this may be that we didn’t pay for premium AV support from the venue?

I also was a bit frustrated that the machine used for the plenary talks didn’t do video playback and that reduced the impact at a couple of points, but otherwise things went smoothly.

Liverpool was very warm and cheerful. Would visit again.

A large group

Normally the University of Southampton sends only 1 or 2 people to this event. I (Chris Gutteridge) have been several times and given talks and workshops in the past. The other staff were all coming for the first time and it was really great to get to know the web-staff from other parts of the university. The only regret is that we all sat together at the dinner and I think we’d have been better off deciding to spread outselves out more, but it’s a small thing.

What’s new? What’s not?

Since I last went to IWMW I can see a few changes to what’s being talked about.

Agile methodology for university web teams is growing in popularity with teams working in two week “sprints”. Other teams are still concerned that all their work is still around big-bang projects and that continuous improvement isn’t an option.

There is still a big cultural gap between IT, comms/marketing and academics. As someone who’s worked closely with all 3 I find this unsurprising. I think as a community we could do some work in explaining each culture to the other two. There’s a lot of disrespect and frustration that comes from differing training. For example, many comms and IT staff don’t know much about the peer review process or academic writing. Many academics don’t know much about formal IT processes and data protection. I think there’s some real opportunities for the community in finding good ways to bridge this divide.

What seems to be off the menu is responsive design. I think that is now moving towards business as usual and the arguments for it and best practice is well understood.

What was also a little painful is that ‘continuous transition’ is more a punchline than a fear at this point. Many of the teams have been restructured again and again.

University websites and Computer Science

I work in what is currently titled something like the “University of Southampton Web & Data Innovation & Development Team”. The idea of having both a formal central team and a more lightweight and experimental team seems to be very unsual. We are either an expensive white elephant or a unique selling point for Southampton. I think it’s important to see how our team can learn more how to compliment the central teams rather than duplicate and irritate them. I’d really like to keep working on projects small enough to fail. Trying out things and learning from them to feed into the central team and general best practice. However with the disappearance of Jisc funding for innovation in information infrastructure I think our team are increasingly unique in the UK.

Hopefully we’ll have one or two more posts about IWMW to follow. Also, why do we never host it in Southampton?

Posted in Uncategorized, web management.


As a team manager, one of my key responsibilities is recruitment. It’s also one of the most rewarding. I get to meet some really smart people and hopefully find one who’s a great fit for our team. I’ve sat on a lot of interview panels over the years, for lots of different roles, and I’ve gained a fair amount of experience along the way.

For my team’s most recent round of recruitment, we went the extra mile and I invested a lot of time and effort into learning how to do a recruitment exercise well. This highlighted some glaring errors in our previous endeavours and resulted in me re-writing the entire suite of recruitment documents we use. For reference, the resources I referred to are listed at the end of this post in the Further Reading section.

As such, I thought I would share some of my experiences and insights into the recruitment process, highlighting the priorities and pitfalls as I see them. This post focuses primarily on the perspectives of a recruiter, but should be of value to people on either side of the table.

Job advertisements

This is where the recruitment process begins – in the production and publishing of job adverts. It’s essential that you make a good job of these. They represent your ‘first impression’ that you’re giving to prospective candidates; if you want the best applicants, your literature needs to make it clear why you’re worth working for. It’s a common misconception that there’s a glut of talent out there that needs to be filtered. Unless you’re a mighty tech giant like Amazon, Facebook, or Google, you need to be doing everything in your power to attract good candidates.

Job advertisements tend to be the place where you have the most freedom. Job descriptions and person specifications often need to fulfil HR requirements, so are often a little more structured and less flexible. Still, don’t underestimate the value of having a clear and well-written job description.

The most crucial piece of advert-writing advice I can offer is this: make sure your advert focuses primarily on what the job is and why it will challenge the candidate. Minimise descriptions of what the “ideal candidate” should look like – that’s what the person specification is for. Besides, person specification is very much about filtering, and that’s not what you should be doing. You’ll likely want to say a little bit about the sort of person you’re looking for, but it’s not the most critical part – save it for the person specification.

Instead, spend your time and effort talking about what challenges and opportunities are on offer in the role. Talk about the team and how it fits into the bigger organisation; talk about what facilities and amenities are available nearby. All of these things incentivise people to consider your vacancy seriously.

My final piece of advice for job adverts is: always include a section about equal opportunities and diversity. Omitting this will put many candidates off before they even apply, and you definitely don’t want to be doing this. You should be encouraging applications from as far and wide as possible.

Job descriptions

Job descriptions are usually mandated by your HR department, and likely have to fit a standard template. It’s easy to dismiss the JD as a mere formality and simply throw something together. In my time, I’ve seen countless job descriptions cribbed from previous examples, with constantly growing requirements that render them ever more unrealistic.

In an ideal world, you’d keep your job descriptions updated and review them at least once or twice a year, to ensure that they accurately reflect the activities of your team. As time goes by, roles inevitably evolve, and the job description is meant to be a living document that reflects those changes.

In reality, JDs far too often end up gathering dust, only to see the light of day when it’s time to recruit. Even if you don’t review your JDs regularly, at least take the opportunity when you’re recruiting to examine them and ensure they accurately reflect the roles your team are expected to fulfil.

Job descriptions are invariably comprised of a job purpose and key accountabilities. Depending on your organisation, they may also include a person specification (I’ll discuss those separately in the next section). These are all quite distinct aspects, and it’s important that you consider that when deciding what content to put where.

The job purpose should explain what the job is for: like the advert before it, this is a golden opportunity to sell the role. It should be inspiring and convince potential candidates that it’s something they could really make their own. A dull job purpose will just turn candidates off. The key accountabilities define the primary functions required to fulfil the job purpose. Remember, these are key accountabilities, so you shouldn’t go into excruciating detail – four or five areas are probably about right.

If it’s been a while since you looked at your job descriptions, it’s almost certainly worth starting afresh from the blank template instead of modifying a long-outdated one. Put some thought into what role you need to fill – it’s rarely the same as it was the last time you hired. What will they be doing? If they’re joining a larger, fairly homogenous team, what does that team currently do? Once you know roughly what the job description should look like, then it’s time to review the old JD. If parts are still relevant, by all means crib them – there’s no need to re-invent the wheel.

Person Specifications

If job descriptions have a reputation for being merely copied and pasted, person specifications are even worse. It’s tempting to put every little thing into one, but it’s essential that you keep in mind that you’re looking for real people, not some mythical superhero who can do everything. I can’t state this enough – talent has to be attracted! Don’t build your recruitment process around filtering. I recently undertook an exercise with my team where I printed off the person specification and asked them to mark them as either fully-met, partially-met or unmet. The results were enlightening, and some requirements were clearly unnecessary.

It’s likely that you’ll have sections for essential and desirable criteria. For a person specification to provide maximum value, the essential criteria should be restrained. Remember that every essential criterion is another reason for a potential candidate not to apply.

If a candidate meets all the essential criteria, but few to none of the desirables, that should indicate that they’re capable of doing the job after your standard suite of training. Indeed, I recommend basing your desirable criteria on your training objectives. Don’t be afraid to make your desirables section larger than your essentials one, but don’t go overboard – make sure that any skills you’re asking for will bring clear benefits.

Finally, a word about experience: it’s commonplace to see jobs advertised with time-based requirements – at least 5 years of X, at least 18 months of Y – and these are often very poor metrics for measuring skills. Would you rather have someone with ten years of bad habits, or someone with the aptitude to become proficient quickly and learn good habits from the start? Talented people with long experience in a post are probably looking for a more senior role anyway, not the same job somewhere else. In fact, many of the good candidates you can hire will be people from less-senior roles looking for something more challenging – they won’t have the experience, but might well be perfectly suitable.


In summary, there are two key points that I consider to be most important:
1. You want to encourage applicants, not filter them out. Keep your requirements realistic and remember that just because a candidate hasn’t done a role before, it doesn’t imply they aren’t capable.
2. Your literature is an advert. It should be engaging and exciting, not a dry collection of buzzwords. Don’t mindlessly recycle old literature – candidates are always told to tailor their applications and CVs, and we should show them the same courtesy.

That about wraps up my first post on recruitment – hopefully you found it worthwhile. Of course, there’s more to it than the literature, so expect to see future posts covering some of the other aspects.

Further reading

If you found this post insightful and are interested in more detail, I recommend watching Lou Adler’s videos on “Performance-Based Hiring” available on (membership required)

I also strongly encourage you to look at the writings of Liz Ryan, founder of Human Workplace. Liz has written some excellent articles about how to recruit and retain great staff. You can read some of her works on LinkedIn and Forbes.

Posted in Best Practice, Management, Recruitment.

Being realistic about large IT projects

This is well outside my normal area but I’ve been asked to think about the true costs of rolling out a major new (or replacement) system for a university, using VLE (virtual learning environment) as an example, and I’m assuming the system will have a projected lifetime of 10 years.

To start thinking about this I’m going to start with

  • try to divide it into more managable tasks
  • ask the community (Hi, that’s you!)


What seems clear is that we have some very distinct phases and that the costs, staff time and skills required will not be consistant.

Project management skills are required over the whole project until it’s fully deployed. As probably are user-experience skills in an ideal world.

  1. Scoping;
    • What do we need?
    • Do we need to do anything at all?
    • Could we just invest this cost+effort in the current system?
    • Identifying stakeholders.
      • Need to make more effort to get typical end users represented not just “power users”.
  2. Scouting
    • What’s out there?
    • What software and services could we use for this?
    • Could we build it in house?
  3. Decision point.
    • Decide on a plan
    • Cost it
    • Get the costs agreed
    • (at this stage I think the IT dept usually lowballs the estimates and then ends up borrowing from normal operations resources as a result)
  4. Learning, prototypes and recruitment
    • We’re going to need staff who grok this new system
    • They’ll need training and practice
    • …or we could try to hire them, but universities don’t have much wiggle room to pay for experienced specialists.
    • …if we’re training our own staff we should give them a chance to practice and play. That will cost time and money and probably software licenses.
    • We don’t do enough of this
  5. Minimum viable product
    • Once we start building for real we’ll need more people involved
    • We generally seem to skip MVP and go straight to 100% deployed and if your feature isn’t in that version, tough luck.
    • Experience has taught people that “phase 2 or later” features are often never done so it causes way too many things to go into “phase 1”.
    • Information architecture within the system and integrating with other systems
    • User experience oversight both for the system and it’s integration into our overall user experience (we don’t do enough of this)
    • At this point the project also starts to use lots of resources from existing IT teams including
      • Data integration
      • Testing
      • Resource deployment (VMs, etc)
      • Interface design
      • Copywriting (documentation)
    • Moving information from the legacy system is a big job here and involves the data owners as well as techies.
    • Also need management of communication between the various teams and various stakeholders
      • I’m not convinced “putting documents on sharepoint” counts as genuine effective communication.
      • We shouldn’t be afraid of getting stakeholders and techies in the same room, or even from running “jam sessions” where ideas can be explored rather than comunicating via word documents. Also, done right, this makes the shareholders more tolerant and flexible and the techies feel more appreciated and proud of what they are doing
  6. Content creation
    • This is a separate job to branding and templating and linking
  7. Phases 2,3 etc.
    • This is something which should be budgeted for properly. Nothing is ever right first time and it takes 6 months of using it to understand how it could really be better.
    • We should ensure that small win deployments happen now and then so users and stakeholders see there is progressive improvement. This will encourage constructive suggestions.
  8.  Business as usual
    • The system will still require maintenance, security patches and whatnot
    • Minor new features will be needed throughout it’s life
    • Other systems will want data out of this for reporting and integration
    • At this point we sometimes have the number of dedicated IT staff for a system drop to 1. It happens. And then they leave and the system gets dumped on someone who just caretakes it from then on. I’ve only see that happen on smaller systems, not Enterprise Apps.
  9. Major upgrades
    • These are effectively a project in their own right, but in addition to the obvious technical cost they will almost certainly cause knock on resource requirements if they alter the object-model, APIs etc. They can also require reworking the documentation.
  10. End of life
    • One day this service will be replaced with something new and shiny. At that time people will need a really good grasp of what processes it performs and what data it holds. Chances are those people all retired by now!

What actually happens is that the project phase-one pretty much always goes over time, usually by months or years. The reason for this is tricky to pin down for sure but is probably partly that really big IT projects are expensive to do right so we under budget them and then “borrow” from existing teams to resource the difference, hiding the true cost. I recently heard the comparison of a new university enterprise IT system to a new university building. The projected lifespan is a little different but I suspect the analogy holds up well, including the fact that you can move into a building even if not all the rooms are decorated yet, and that someone will need to find the budget to insure it and pay the heating bill.

This post was just a quick brainstorm to start thinking about the problem. What did I forget? Where am I naive or just plain wrong?

Posted in Uncategorized.

Generating Test Datasets (Part 1)


Obtaining an appropriate test dataset forms an integral part of the development and testing of any software system.  It is not uncommon for the test dataset to be extracted from a live environment (or simply a clone of it).  There are several reasons for taking this approach, however, there are also potential security and regulatory/legal issues that may arise from it, and other approaches should be investigated.  The following considers two primary reasons for testing: system correctness, and performance testing/system sizing.

Data Generation Approaches

There are several approaches to generating or otherwise obtaining test datasets – using “real” data from production systems, using data from production systems which has been “anonymised” or otherwise “cleansed” of identifying data, using randomly-generated data, and using data that follows a model of “real” data, each of which have their own set of advantages and disadvantages.

Using Real Data

Using “real” data from production environments to populate the test environment is an understandable approach, and the reasons for adopting it generally revolve around a perception of convenience:

  • Restoring (or otherwise synchronising) a copy of the production database onto the testing server is normally a trivial task in terms of effort (if not always time, in the case of large databases). Given an established system, this will give a large set of data for little developer effort.
  • Given a sufficient volume of data, it is expected that the values present will vary across the available domain
  • Real-world data is not normally constrained by the assumptions of the programmers who wrote the system being tested, and thus is a good source of “odd” values (or combinations thereof).
  • If the system is to be connected to other related systems, then using “real” data makes testing the integration of these systems appear to be easier, as managing a test dataset between multiple systems or environments may not (appear to be) required.
  • If those conducting the testing are familiar with the data in the production environment, they may feel more comfortable with having this data in the testing environment as it is familiar to them, or that they feel it makes their task easier, as they can identify unexpected outcomes based on their pre-existing knowledge of the dataset.

However, there are also several significant downsides to this approach:

  • Assuming that the production environment data contains data related to individuals, and depending on the wording of the agreement the individuals entered into when their data was originally obtained, then using their data for testing is likely to constitute a breach of the Data Protection Act, as it may be being used for a purpose other than that which it was obtained for. Further, if the data is modified as part of the testing process, this may constitute an issue with the requirement for all data to be correct.  According to the Information Commissioner, “The ICO advises that the use of personal data for system testing should be avoided. Where there is no practical alternative to using live data for this purpose, systems administrators should develop alternative methods of system testing. Should the Information Commissioner receive a complaint about the use of personal data for system testing, their first question to the data controller would be to ask why no alternative to the use of live data had been found”.  Further, there may also be other regulatory requirements related to the specific type of data being stored (e.g. the FCA when dealing with financial data).
  • This approach increases the attack surface, when considering sensitive data. This means that the testing environment would need to be secured to the same degree as the production environment, including the monitoring of the system for suspicious activity.  Further, it is possible that people who do not have access to the production environment have access to the testing environment, increasing the number of people with access to sensitive data.
  • It is possible that the release of code being tested contains new and/or otherwise undiscovered coding errors which result in a security vulnerability, e.g. which results in the leaking of data.
  • Dependent upon the maturity of the system, it is possible that the data it contains is not an accurate representation of the data it will contain in the future, for example, it may have a bias in its distribution (for example, older data may not follow the same trends as newer data due to altered processes, etc), which may misinform performance optimisations based upon data analysis. Alternatively, the data may not cover a large portion of the available domain, leaving edge cases untested.
  • The system may simply not contain enough data to supply a large-enough dataset, which would then require supplementing.

Using “Cleansed” Data

Given that there are advantages to using “real” data, a reasonable alternative is to try to “defuse” the potential privacy and regulatory problems by anonymising or removing the sensitive data (e.g. scrambling data by combining fields from different rows, replacing sensitive fields with fixed or random strings or null values, etc).  When done correctly, this would retain the advantages of using “real” data, whilst allaying the privacy concerns.  However, it does bring with it a distinct set of disadvantages:

  • Correctly anonymising a dataset so that it cannot be converted back into its original form, nor individuals otherwise be identified from the processed data, is a time-consuming and non-trivial task, which should be manually verified before proceeding. Care must be taken when deciding how to anonymise data, which fields are involved, and that all occurrences of the data are identified (it is likely that it will in reality be a combination or set of fields, which may vary based upon the data context).
  • Dependent upon the method used for anonymising the dataset, patterns existing in the data may be removed or obfuscated, or data which breaches logic rules (or that is otherwise of interest) may be removed. This may result in decisions made based upon data analysis being invalid (e.g. performance optimisations being informed by data distribution analysis).
  • Care needs to be taken with the management of the dataset in order to ensure new sensitive data doesn’t accidentally flow into it (e.g. via a feed from another system), nor that it is neither inadvertently lost nor damaged during testing.

Using Random Data

A further approach is to programmatically create random data, and populate the test dataset with that.  This may be either through generating the entire dataset as random data, or hybridised by using “real” ancillary data (essentially look-ups) and randomised data for sensitive entities.  This has several advantages:

  • Given that the data is literally random, there are no privacy concerns related to using it, as it doesn’t relate to any real entities.
  • It should be trivial to size the dataset generated to the amount of data required (i.e. there isn’t a problem if there is insufficient real data to generate a test dataset of the required size, as you can simply generate more). This is particularly relevant to performance forecasting.
  • When properly generated, it should be possible to have a dataset which covers a large portion of the data domain, and is free of assumptions made by the original programmer.
  • Tools of varying quality and expense are available, which can generate random data based on a defined schema and data rules (to ensure values “look” correct). These can reduce the amount of work required to produce the dataset to minimal.

There are, however, some disadvantages with this method:

  • Configuring and generating the dataset takes time and effort (varying by tool and dataset complexity).
  • Whilst it may “look like” “real data” on first glance, it is not – this façade of reality can lead to confusion. For example, if a dataset of 20 years’ of students is generated, whilst it might be valid in terms of validation rules for Student A to have a record in the first and last year, it probably would never happen – this can be jarring, and lead to people trying to find out why data looks “odd”, rather than examining the actual test cases.
  • Depending upon the sophistication of the tool being used, some data generated may violate complex data validation rules, or it may take some time to enter said validation into the tool.
  • The ability to regenerate the dataset exactly will vary by tool. Therefore, it is likely necessary to manage the dataset, to ensure that tests can be consistently run against it without needing to check or regenerate the data.
  • Given that the data is randomly generated, it may tend toward a uniform distribution, and not reflect the density, frequency, or range of real data. This may lead to decisions which are made based on data analysis (typically, index optimisation) being invalid.
  • The settings entered may mirror some assumptions made by the programmer regarding data distribution and/or domain, leading to edge cases not being explored, as they are effectively removed from the set of generated data (e.g. a field is populate with values “up to” instead of “up to and including” a specified value).

Using Modelled Random Data

The approach of using modelled random data takes using generic random data one step further – the data found in the production environment is analysed for patterns and characteristics, which are then adopted into the data generation algorithm.  In the previous example of student records, it would be reasonable for the data generated for a given student to be constrained to rough time range.  Still, there are some advantages over generic random data generation:

  • The data correlates with patterns in the live data, and thus “looks like” real data when examined, meaning that people are typically more comfortable when viewing it, and are less likely to question items found in it “on gut”.
  • Given that it models the “real” data, the distribution of the records should more closely match that of the data found in the production environment. This means that decisions based on data analysis (e.g. performance tuning) is more likely to be valid (this is more important if generating large volumes of data to simulate dataset growth).

There are, however, some disadvantages:

  • Analysing the data in the production data, and producing generation logic, is a time-consuming and non-trivial task. If it is not done correctly, most of the advantages of this approach are lost.
  • Generating the data is a more complex task, and may result in more expensive tools being required (or written).

Determining Data Volume

Whilst on first inspection it may appear that test datasets should be as large as possible in order to capture as many data combinations as possible, this may not be the correct approach, and may even be counter-productive.

Full Dataset

A “full” dataset is a dataset of a similar size to that of the data in the production environment, and is typically used when a large dataset is needed, e.g. for performance testing.  Alternatively, it can be used as a “superset” from which candidate test rows can be identified and processed (this can be advantageous is multiple data-modifying tests need similar data at the same time).  There are some disadvantages associated with this approach:

  • There is an assumption that the dataset is large enough to contain the required volume of data, as well as the necessary individual entries, which may not always be the case when dealing with “young” systems. If the dataset is not large enough, then this should be noted, and extra data generated.
  • Complete datasets can be quite large, especially when dealing with mature systems. This has an obvious cost in terms of disk space, etc, needed to support the dataset, along with processing time during testing and restore/revert times (if needed).
  • If being used as a superset from which candidate records are being selected, sometimes the volume of data can be counter-productive (“can’t see the wood for the trees”).

Sampled Dataset

A sampled dataset is simply a smaller dataset generated from a full dataset, using an algorithm to select part of the dataset – typically this will be “every nth record”, or “select n% at random”, although other more complex selection methods exist (e.g. “select the first record for every combination of the following”).  This has the advantage of reducing the volume of data that needs to be held.  However, it does have several disadvantages:

  • The success of this method is dependent upon the effectiveness of the sampling method. If the sampling is carried out incorrectly, it is possible that the distribution of data in the dataset produced is skewed (when compared to the full dataset), the dataset is missing sets of values that it should contain, or that patterns/trends that are apparent in the full dataset are obfuscated or removed in the reduced dataset.
  • Correctly extracting records requires care, and can be non-trivial dependent upon the complexity of the data model, and requires knowledge of the data storage schema. For example, referential integrity needs to be maintained, which can involve many objects in a complex schema.
  • Determining the correct sampling method can be non-trivial.
  • If sampling is not carried out correctly, the resultant dataset may be too large (in which case many of the drawbacks of using a full dataset apply) or too small (in which case, there’s the chance required records do not appear).

Hand-Picked Records

This approach involves extracting just the records needed to verify a given test.  This has the advantage that there are no extraneous records in the system to distract from the result of the test, and is most appropriate for individual tests.  However, it has some drawbacks:

  • Identifying the records to be extracted requires knowledge of the data in the system, at both the application and data storage level.
  • Correctly extracting records requires care, and can be non-trivial dependent upon the complexity of the data model. For example, referential integrity needs to be maintained, which can involve many objects in a complex schema.
  • It is possible that a routine has a side-effect which affects other records, but due to the reduced volume of data present, the required conditions for this side-effect to fire are not met, and it goes undetected (e.g. the test dataset consists of a single record, and the function alters the current- and next record).
  • It is likely that different data needs identifying for each test, making this approach quite labour intensive.

Proposed Approach

Having considered the above approaches, it is proposed that we generate test datasets using a hybrid of the randomised, and modelled randomised approaches, which, where possible, shall be of a size representative of the full (or planned) dataset in terms of record count and data volume.  It is intended that the generated data may be supplemented with real data in some cases.  The reasoning for this is as follows:

  1. In terms of benefit received for effort expended, random data generation offers the best “payoff”, as there are tools readily available which can perform the task automatically or near-automatically, e.g. automatically setting the datatype based on the column’s type, detecting the contents of a column by its name and generating appropriate data (e.g. if the column is called “TelephoneNumber”, the tool will automatically generate data that looks like phone numbers).
  2. Whilst there are possible benefits from going with fully-modelled data, the increase in expended effort to do this correctly is generally not worth it – a sufficient volume of data will give an indication of performance, and performance testing will likely be done separately. Where randomly-generated is obviously not “realistic enough” in its distribution, then we will examine modelling this, e.g. if 10% of files are marked as “sensitive”, this is trivial to reflect, and will likely make the dataset more acceptable to those using it (either from a performance aspect, or people “eye-balling” the data).
  3. By generating fully-modelled data, it is possible to inject flaws into the data that are a result of programmers’ assumptions – using mostly random “nonsense” data helps to avoid this. Some element of modelling, programmatic intervention, or the use of real data (where logic is dependent upon certain values being present) may be necessary in order for the data generated to comply with application logic rules that are in place.
  4. Whilst there is an (entirely reasonable) argument that testing data is obviously testing data and thus does not have to have meaning, it is not uncommon for the data to be examined by humans, who may question unrelated aspects (this would be something akin to cognitive dissonance). Therefore, tweaking certain prominent aspects of the generated dataset will likely increase its acceptability if it is being shown to end-users for user-acceptance testing.  For certain prominent non-sensitive aspects of datasets (e.g. names of programmes of study), we will look to use real data, or data generated from real data, in order to increase its realism and acceptance.
  5. Generating a dataset of similar size to the (proposed or envisaged) production dataset is largely trivial given appropriate tools, and will allow developers to identify performance issues that are being created, where they may otherwise not be obvious with small datasets, hopefully removing a potential downstream problem. Obviously, this may not be possible due to space constraints, in which case the scale of the data will be reduced.

Coming Up

In the next part, we will look at putting the above into practice by generating a test dataset for one of our existing systems.

Posted in Data, Database, testing.

Generating Test Datasets (Part 2)

A Brief Recap

In our previous post, we covered the various approaches to obtaining datasets for system testing, from using production data to modelling said data, along with their advantages and disadvantages.  We proposed an approach using software tools to automate the generation of fake data, which resembled real data, mixed with “real” (non-personal) data where necessary.


Case Study – Practice Placements

For our team’s first attempt at generating a large set of test data, the University’s Practice Placements has been selected.  This system is used to manage the allocation of placements to nursing students, and was chosen as it has a relatively complex model (c. 40 tables), which should highlight difficulties in generating data, and also the generated data, quickly.  For the purposes of this study, we elected to use Red Gate’s Data Generator (RGDG), due to the simple expedient of it already being installed.

To generate the dataset, we followed the following process:

  1. For datasets that are referenced by application logic these set of allowable values were extracted from the production system, saved into text files, and set to be loaded back into the relevant table(s). This step is not strictly necessary, as it’s possible to tell RGDG not to process the table (RGDG’s default behaviour is to wipe the table and re-generate the data), however, this approach removes the question, “why isn’t this table being processed?  Has it been missed?”.
  2. For datasets that are prominent to the end-user and therefore need to look “real”, we extracted datasets from production environments or other sources, saved them into text files, and set RGDG to pick either items from the dataset at random, or to generate entries by combining multiple rows at random. Examples of where this approach was employed include forenames (the in-built dataset is too short and too western, so a list of forenames was obtained), surnames (for the same reason), gender (the real-world data is not as “simple” as you might expect), and programmes of study and their associated codes (for end-user comfort with the data).  Additionally, we extracted a list of organisation names, and used the “text shuffler” generator to create randomised combinations of words found within the list, given min/max length parameters, to generate something which “appeared familiar”.
  3. Where the formatting of an item is important, and the type of data is particular to the University, or where it appears in multiple places, we wrote XML files which contain the definition of generators for these items (typically through the use of regular expressions), which are then referenced in order to limit the duplication (and accidental variance) of work. Examples include formatting of the University’s staff/student ID, usernames, and UCAS identifiers.
  4. For each table, we set the volume of data to be generated. We did this through either setting an absolute number of rows to generate, or indicating a proportion of the number of rows in another table (e.g. generating a number of student placement allocations equal to 400% of the student table generates on average 4 allocated placements per student).
  5. For each field, we checked the data generator that had been assigned, and where automatic matching was inappropriate, we manually set it. For the most part, the automatic matching worked well (or there was no obvious available match), although the most obvious example of failure of this were forename and surname being matched to nicknames (which manifests itself in the format of a forename and number), instead of first name and last name, and item names being interpreted as person names.
  6. For fields where no appropriate generator exists, we selected one. Generally, we used the regular expression generator (e.g. to generate names of learning groups conforming to a particular format), although numeric ranges (e.g. to generate longitude and latitude co-ordinates roughly within Hampshire), and weighted lists (to probabilistically model some items of data) were also used.
  7. Once the above steps had been completed, we hit the “big red button”, and after approximately 30s, we had 300,000 shiny new rows of data.

Having followed the above process, the following observations become fairly evident:

  1. At a very basic level, generating large randomised datasets is simple. This is at least partially due to the tool that was used intelligently dealing with foreign key relationships, constraints, data type mapping, etc.  However, attention must be paid to data that is required by the application, to ensure it is not missing or being randomised, and working through complex data models checking settings can be tiresome.
  2. In some circumstances, it is necessary to add code to facilitate “correct” data generation (e.g. where values in one field logically depend upon those in another). This is a slight stumbling point, as extending RGDG v2 via .NET assemblies is powerful (e.g. the ability to define custom UIs) but time-consuming, and the abilities of the Python module to reference other data are a bit limited in this respect (v3 is better, but was not installed, and the matter is not helped by the fact I don’t know Python…).
  3. If there is application logic which checks the data it is “being fed”, or there are business rules surrounding the content of the data which must be followed (either to appease the application or users who are testing it), then it is likely that additional work will be needed to generate data following these rules.  It is likely that this may be non-trivial.
  4. It became apparent that the application-level logic does not always match the database structure, resulting in application exceptions being generated when navigating around the system. For example, some fields are marked as required in the user interface, but are nullable in the database.  Tickets have been generated to fix these.
  5. As a general rule, the data looks at least semi-plausible (houses in the English Channel due to my bad longitude/latitude guessing excepted…), and provides a sufficient volume of data from which test cases could be selected whilst also giving an indication of likely performance.
  6. Unless working with a small dataset, or being utterly fastidious, it is likely that when specifying the data generation settings, some things won’t be set correctly, and that multiple iterations of data generation may be required. Thankfully this is a painless process.
  7. Resizing the generated dataset to any given size is largely trivial, and mostly involves clicking a button and waiting, although there is the obvious issue of disk space (this is likely more of an issue when developing/testing locally, or on a shared server). This can obviously be useful when investigating how a system may behave with significantly larger data loads.
  8. Having a pre-written tool that can do most of the work for you makes the process of generating the data far easier, more enjoyable, and significantly quicker than having to do this yourself.


The above was written from a basis of personal experience, combined with the following sources:

Posted in Data, Database, testing.