3D Model Escrow

This is a quick note to record an idea I had in response to a conversation over lunch at “Connected Data London“.

The question was around how to enable 3D printing for the construction industry, or more specifically making 3D replacement parts, but this idea works for a lot more usecases.

There are a couple of problems here. First of all, taking a 3D scan of something and using it to make a copy is probably copyright infringement (I am not a lawyer). It also may infringe patents. The other problem is that for more complex shapes, there is no incentive for the copyright owner to help you — no business will help you make a cheaper copy of their own product, with the possible exception of celebrity chefs.

So here’s a possible future where everybody wins.

One or more trusted organisations set up a data escrow service. This service will hold the 3D print information required to make a copy of the part, with a set of conditions under which it will be made public. It also contain other design documents, manufacturing instructions etc.

A manufacturing company pays a small fee to the escrow service to hold the data for each product and to host the data on the web in the event that certain conditions are met. My first thought is that the data should become public if the part is no longer available on the market and isn’t going to be back on the market in short order. These rule applies even if the rights to the part are sold on or the campany is purchased etc. because the copyright holder has given the escrow service the data under a license which allows them to share it under a public-domain-like-license if the product is no longer available.

This shouldn’t cost very much and gives the first mover a unique selling point. It has the advantage that the company can guarantee that replacement parts will be possible either from them or from anybody given the specifications should they go bust or stop that product line.

As more companies offer this service, it can be added as a preference or even a requirement on deciding suppliers. While the initial question was about the construction industry it could work for any manufacturing industry.

I like this idea because everybody wins — it doesn’t require anybody to work against their own interests.

Never appeal to a man’s “better nature.” He may not have one. Invoking his “self—interest” gives you more leverage.

– Robert Heinlein, “The Notebooks of Lazarus Long”

Posted in Repositories.

rev="post-1381" No comments

By Christopher Gutteridge – July 12, 2016

Institutional Web Management Workshop: Overall thoughts (IWMW2016)

This month we went to IWMW16 in Liverpool at Liverpool John Moores University. We took an unusually large group including

2 people from the web & data innovation & development team
1 person from the central IT web development
2 people from the comms and marketing web team

We’ve a lot of notes and we may write some more detailed write ups, but here’s some initial thoughts.

Venue

John Moores is a much smaller university than I’m used to, but seemed welcoming and easy to find. Eduroam coverage was good but dropped out regularly. The audio mics in the large lecture theatre seemed a bit hit and miss, but this may be that we didn’t pay for premium AV support from the venue?

I also was a bit frustrated that the machine used for the plenary talks didn’t do video playback and that reduced the impact at a couple of points, but otherwise things went smoothly.

Liverpool was very warm and cheerful. Would visit again.

A large group

Normally the University of Southampton sends only 1 or 2 people to this event. I (Chris Gutteridge) have been several times and given talks and workshops in the past. The other staff were all coming for the first time and it was really great to get to know the web-staff from other parts of the university. The only regret is that we all sat together at the dinner and I think we’d have been better off deciding to spread outselves out more, but it’s a small thing.

What’s new? What’s not?

Since I last went to IWMW I can see a few changes to what’s being talked about.

Agile methodology for university web teams is growing in popularity with teams working in two week “sprints”. Other teams are still concerned that all their work is still around big-bang projects and that continuous improvement isn’t an option.

There is still a big cultural gap between IT, comms/marketing and academics. As someone who’s worked closely with all 3 I find this unsurprising. I think as a community we could do some work in explaining each culture to the other two. There’s a lot of disrespect and frustration that comes from differing training. For example, many comms and IT staff don’t know much about the peer review process or academic writing. Many academics don’t know much about formal IT processes and data protection. I think there’s some real opportunities for the community in finding good ways to bridge this divide.

What seems to be off the menu is responsive design. I think that is now moving towards business as usual and the arguments for it and best practice is well understood.

What was also a little painful is that ‘continuous transition’ is more a punchline than a fear at this point. Many of the teams have been restructured again and again.

University websites and Computer Science

I work in what is currently titled something like the “University of Southampton Web & Data Innovation & Development Team”. The idea of having both a formal central team and a more lightweight and experimental team seems to be very unsual. We are either an expensive white elephant or a unique selling point for Southampton. I think it’s important to see how our team can learn more how to compliment the central teams rather than duplicate and irritate them. I’d really like to keep working on projects small enough to fail. Trying out things and learning from them to feed into the central team and general best practice. However with the disappearance of Jisc funding for innovation in information infrastructure I think our team are increasingly unique in the UK.

Hopefully we’ll have one or two more posts about IWMW to follow. Also, why do we never host it in Southampton?

Posted in Uncategorized, web management.

rev="post-1361" No comments

By Christopher Gutteridge – June 27, 2016

Recruiting

As a team manager, one of my key responsibilities is recruitment. It’s also one of the most rewarding. I get to meet some really smart people and hopefully find one who’s a great fit for our team. I’ve sat on a lot of interview panels over the years, for lots of different roles, and I’ve gained a fair amount of experience along the way.

For my team’s most recent round of recruitment, we went the extra mile and I invested a lot of time and effort into learning how to do a recruitment exercise well. This highlighted some glaring errors in our previous endeavours and resulted in me re-writing the entire suite of recruitment documents we use. For reference, the resources I referred to are listed at the end of this post in the Further Reading section.

As such, I thought I would share some of my experiences and insights into the recruitment process, highlighting the priorities and pitfalls as I see them. This post focuses primarily on the perspectives of a recruiter, but should be of value to people on either side of the table.

Job advertisements

This is where the recruitment process begins – in the production and publishing of job adverts. It’s essential that you make a good job of these. They represent your ‘first impression’ that you’re giving to prospective candidates; if you want the best applicants, your literature needs to make it clear why you’re worth working for. It’s a common misconception that there’s a glut of talent out there that needs to be filtered. Unless you’re a mighty tech giant like Amazon, Facebook, or Google, you need to be doing everything in your power to attract good candidates.

Job advertisements tend to be the place where you have the most freedom. Job descriptions and person specifications often need to fulfil HR requirements, so are often a little more structured and less flexible. Still, don’t underestimate the value of having a clear and well-written job description.

The most crucial piece of advert-writing advice I can offer is this: make sure your advert focuses primarily on what the job is and why it will challenge the candidate. Minimise descriptions of what the “ideal candidate” should look like – that’s what the person specification is for. Besides, person specification is very much about filtering, and that’s not what you should be doing. You’ll likely want to say a little bit about the sort of person you’re looking for, but it’s not the most critical part – save it for the person specification.

Instead, spend your time and effort talking about what challenges and opportunities are on offer in the role. Talk about the team and how it fits into the bigger organisation; talk about what facilities and amenities are available nearby. All of these things incentivise people to consider your vacancy seriously.

My final piece of advice for job adverts is: always include a section about equal opportunities and diversity. Omitting this will put many candidates off before they even apply, and you definitely don’t want to be doing this. You should be encouraging applications from as far and wide as possible.

Job descriptions

Job descriptions are usually mandated by your HR department, and likely have to fit a standard template. It’s easy to dismiss the JD as a mere formality and simply throw something together. In my time, I’ve seen countless job descriptions cribbed from previous examples, with constantly growing requirements that render them ever more unrealistic.

In an ideal world, you’d keep your job descriptions updated and review them at least once or twice a year, to ensure that they accurately reflect the activities of your team. As time goes by, roles inevitably evolve, and the job description is meant to be a living document that reflects those changes.

In reality, JDs far too often end up gathering dust, only to see the light of day when it’s time to recruit. Even if you don’t review your JDs regularly, at least take the opportunity when you’re recruiting to examine them and ensure they accurately reflect the roles your team are expected to fulfil.

Job descriptions are invariably comprised of a job purpose and key accountabilities. Depending on your organisation, they may also include a person specification (I’ll discuss those separately in the next section). These are all quite distinct aspects, and it’s important that you consider that when deciding what content to put where.

The job purpose should explain what the job is for: like the advert before it, this is a golden opportunity to sell the role. It should be inspiring and convince potential candidates that it’s something they could really make their own. A dull job purpose will just turn candidates off. The key accountabilities define the primary functions required to fulfil the job purpose. Remember, these are key accountabilities, so you shouldn’t go into excruciating detail – four or five areas are probably about right.

If it’s been a while since you looked at your job descriptions, it’s almost certainly worth starting afresh from the blank template instead of modifying a long-outdated one. Put some thought into what role you need to fill – it’s rarely the same as it was the last time you hired. What will they be doing? If they’re joining a larger, fairly homogenous team, what does that team currently do? Once you know roughly what the job description should look like, then it’s time to review the old JD. If parts are still relevant, by all means crib them – there’s no need to re-invent the wheel.

Person Specifications

If job descriptions have a reputation for being merely copied and pasted, person specifications are even worse. It’s tempting to put every little thing into one, but it’s essential that you keep in mind that you’re looking for real people, not some mythical superhero who can do everything. I can’t state this enough – talent has to be attracted! Don’t build your recruitment process around filtering. I recently undertook an exercise with my team where I printed off the person specification and asked them to mark them as either fully-met, partially-met or unmet. The results were enlightening, and some requirements were clearly unnecessary.

It’s likely that you’ll have sections for essential and desirable criteria. For a person specification to provide maximum value, the essential criteria should be restrained. Remember that every essential criterion is another reason for a potential candidate not to apply.

If a candidate meets all the essential criteria, but few to none of the desirables, that should indicate that they’re capable of doing the job after your standard suite of training. Indeed, I recommend basing your desirable criteria on your training objectives. Don’t be afraid to make your desirables section larger than your essentials one, but don’t go overboard – make sure that any skills you’re asking for will bring clear benefits.

Finally, a word about experience: it’s commonplace to see jobs advertised with time-based requirements – at least 5 years of X, at least 18 months of Y – and these are often very poor metrics for measuring skills. Would you rather have someone with ten years of bad habits, or someone with the aptitude to become proficient quickly and learn good habits from the start? Talented people with long experience in a post are probably looking for a more senior role anyway, not the same job somewhere else. In fact, many of the good candidates you can hire will be people from less-senior roles looking for something more challenging – they won’t have the experience, but might well be perfectly suitable.

Conclusion

In summary, there are two key points that I consider to be most important:
1. You want to encourage applicants, not filter them out. Keep your requirements realistic and remember that just because a candidate hasn’t done a role before, it doesn’t imply they aren’t capable.
2. Your literature is an advert. It should be engaging and exciting, not a dry collection of buzzwords. Don’t mindlessly recycle old literature – candidates are always told to tailor their applications and CVs, and we should show them the same courtesy.

That about wraps up my first post on recruitment – hopefully you found it worthwhile. Of course, there’s more to it than the literature, so expect to see future posts covering some of the other aspects.

Being realistic about large IT projects

This is well outside my normal area but I’ve been asked to think about the true costs of rolling out a major new (or replacement) system for a university, using VLE (virtual learning environment) as an example, and I’m assuming the system will have a projected lifetime of 10 years.

To start thinking about this I’m going to start with

try to divide it into more managable tasks
ask the community (Hi, that’s you!)

Phases

What seems clear is that we have some very distinct phases and that the costs, staff time and skills required will not be consistant.

Project management skills are required over the whole project until it’s fully deployed. As probably are user-experience skills in an ideal world.

Scoping;
- What do we need?
- Do we need to do anything at all?
- Could we just invest this cost+effort in the current system?
- Identifying stakeholders.
  - Need to make more effort to get typical end users represented not just “power users”.
Scouting
- What’s out there?
- What software and services could we use for this?
- Could we build it in house?
Decision point.
- Decide on a plan
- Cost it
- Get the costs agreed
- (at this stage I think the IT dept usually lowballs the estimates and then ends up borrowing from normal operations resources as a result)
Learning, prototypes and recruitment
- We’re going to need staff who grok this new system
- They’ll need training and practice
- …or we could try to hire them, but universities don’t have much wiggle room to pay for experienced specialists.
- …if we’re training our own staff we should give them a chance to practice and play. That will cost time and money and probably software licenses.
- We don’t do enough of this
Minimum viable product
- Once we start building for real we’ll need more people involved
- We generally seem to skip MVP and go straight to 100% deployed and if your feature isn’t in that version, tough luck.
- Experience has taught people that “phase 2 or later” features are often never done so it causes way too many things to go into “phase 1”.
- Information architecture within the system and integrating with other systems
- User experience oversight both for the system and it’s integration into our overall user experience (we don’t do enough of this)
- At this point the project also starts to use lots of resources from existing IT teams including
  - Data integration
  - Testing
  - Resource deployment (VMs, etc)
  - Interface design
  - Copywriting (documentation)
- Moving information from the legacy system is a big job here and involves the data owners as well as techies.
- Also need management of communication between the various teams and various stakeholders
  - I’m not convinced “putting documents on sharepoint” counts as genuine effective communication.
  - We shouldn’t be afraid of getting stakeholders and techies in the same room, or even from running “jam sessions” where ideas can be explored rather than comunicating via word documents. Also, done right, this makes the shareholders more tolerant and flexible and the techies feel more appreciated and proud of what they are doing
Content creation
- This is a separate job to branding and templating and linking
Phases 2,3 etc.
- This is something which should be budgeted for properly. Nothing is ever right first time and it takes 6 months of using it to understand how it could really be better.
- We should ensure that small win deployments happen now and then so users and stakeholders see there is progressive improvement. This will encourage constructive suggestions.
Business as usual
- The system will still require maintenance, security patches and whatnot
- Minor new features will be needed throughout it’s life
- Other systems will want data out of this for reporting and integration
- At this point we sometimes have the number of dedicated IT staff for a system drop to 1. It happens. And then they leave and the system gets dumped on someone who just caretakes it from then on. I’ve only see that happen on smaller systems, not Enterprise Apps.
Major upgrades
- These are effectively a project in their own right, but in addition to the obvious technical cost they will almost certainly cause knock on resource requirements if they alter the object-model, APIs etc. They can also require reworking the documentation.
End of life
- One day this service will be replaced with something new and shiny. At that time people will need a really good grasp of what processes it performs and what data it holds. Chances are those people all retired by now!

What actually happens is that the project phase-one pretty much always goes over time, usually by months or years. The reason for this is tricky to pin down for sure but is probably partly that really big IT projects are expensive to do right so we under budget them and then “borrow” from existing teams to resource the difference, hiding the true cost. I recently heard the comparison of a new university enterprise IT system to a new university building. The projected lifespan is a little different but I suspect the analogy holds up well, including the fact that you can move into a building even if not all the rooms are decorated yet, and that someone will need to find the budget to insure it and pay the heating bill.

This post was just a quick brainstorm to start thinking about the problem. What did I forget? Where am I naive or just plain wrong?

Posted in Uncategorized.

rev="post-1347" No comments

By Christopher Gutteridge – April 27, 2016

Generating Test Datasets (Part 1)

Overview

Obtaining an appropriate test dataset forms an integral part of the development and testing of any software system. It is not uncommon for the test dataset to be extracted from a live environment (or simply a clone of it). There are several reasons for taking this approach, however, there are also potential security and regulatory/legal issues that may arise from it, and other approaches should be investigated. The following considers two primary reasons for testing: system correctness, and performance testing/system sizing.

Data Generation Approaches

There are several approaches to generating or otherwise obtaining test datasets – using “real” data from production systems, using data from production systems which has been “anonymised” or otherwise “cleansed” of identifying data, using randomly-generated data, and using data that follows a model of “real” data, each of which have their own set of advantages and disadvantages.

Using Real Data

Using “real” data from production environments to populate the test environment is an understandable approach, and the reasons for adopting it generally revolve around a perception of convenience:

Restoring (or otherwise synchronising) a copy of the production database onto the testing server is normally a trivial task in terms of effort (if not always time, in the case of large databases). Given an established system, this will give a large set of data for little developer effort.
Given a sufficient volume of data, it is expected that the values present will vary across the available domain
Real-world data is not normally constrained by the assumptions of the programmers who wrote the system being tested, and thus is a good source of “odd” values (or combinations thereof).
If the system is to be connected to other related systems, then using “real” data makes testing the integration of these systems appear to be easier, as managing a test dataset between multiple systems or environments may not (appear to be) required.
If those conducting the testing are familiar with the data in the production environment, they may feel more comfortable with having this data in the testing environment as it is familiar to them, or that they feel it makes their task easier, as they can identify unexpected outcomes based on their pre-existing knowledge of the dataset.

However, there are also several significant downsides to this approach:

Assuming that the production environment data contains data related to individuals, and depending on the wording of the agreement the individuals entered into when their data was originally obtained, then using their data for testing is likely to constitute a breach of the Data Protection Act, as it may be being used for a purpose other than that which it was obtained for. Further, if the data is modified as part of the testing process, this may constitute an issue with the requirement for all data to be correct. According to the Information Commissioner, “The ICO advises that the use of personal data for system testing should be avoided. Where there is no practical alternative to using live data for this purpose, systems administrators should develop alternative methods of system testing. Should the Information Commissioner receive a complaint about the use of personal data for system testing, their first question to the data controller would be to ask why no alternative to the use of live data had been found”. Further, there may also be other regulatory requirements related to the specific type of data being stored (e.g. the FCA when dealing with financial data).
This approach increases the attack surface, when considering sensitive data. This means that the testing environment would need to be secured to the same degree as the production environment, including the monitoring of the system for suspicious activity. Further, it is possible that people who do not have access to the production environment have access to the testing environment, increasing the number of people with access to sensitive data.
It is possible that the release of code being tested contains new and/or otherwise undiscovered coding errors which result in a security vulnerability, e.g. which results in the leaking of data.
Dependent upon the maturity of the system, it is possible that the data it contains is not an accurate representation of the data it will contain in the future, for example, it may have a bias in its distribution (for example, older data may not follow the same trends as newer data due to altered processes, etc), which may misinform performance optimisations based upon data analysis. Alternatively, the data may not cover a large portion of the available domain, leaving edge cases untested.
The system may simply not contain enough data to supply a large-enough dataset, which would then require supplementing.

Using “Cleansed” Data

Given that there are advantages to using “real” data, a reasonable alternative is to try to “defuse” the potential privacy and regulatory problems by anonymising or removing the sensitive data (e.g. scrambling data by combining fields from different rows, replacing sensitive fields with fixed or random strings or null values, etc). When done correctly, this would retain the advantages of using “real” data, whilst allaying the privacy concerns. However, it does bring with it a distinct set of disadvantages:

Correctly anonymising a dataset so that it cannot be converted back into its original form, nor individuals otherwise be identified from the processed data, is a time-consuming and non-trivial task, which should be manually verified before proceeding. Care must be taken when deciding how to anonymise data, which fields are involved, and that all occurrences of the data are identified (it is likely that it will in reality be a combination or set of fields, which may vary based upon the data context).
Dependent upon the method used for anonymising the dataset, patterns existing in the data may be removed or obfuscated, or data which breaches logic rules (or that is otherwise of interest) may be removed. This may result in decisions made based upon data analysis being invalid (e.g. performance optimisations being informed by data distribution analysis).
Care needs to be taken with the management of the dataset in order to ensure new sensitive data doesn’t accidentally flow into it (e.g. via a feed from another system), nor that it is neither inadvertently lost nor damaged during testing.

Using Random Data

A further approach is to programmatically create random data, and populate the test dataset with that. This may be either through generating the entire dataset as random data, or hybridised by using “real” ancillary data (essentially look-ups) and randomised data for sensitive entities. This has several advantages:

Given that the data is literally random, there are no privacy concerns related to using it, as it doesn’t relate to any real entities.
It should be trivial to size the dataset generated to the amount of data required (i.e. there isn’t a problem if there is insufficient real data to generate a test dataset of the required size, as you can simply generate more). This is particularly relevant to performance forecasting.
When properly generated, it should be possible to have a dataset which covers a large portion of the data domain, and is free of assumptions made by the original programmer.
Tools of varying quality and expense are available, which can generate random data based on a defined schema and data rules (to ensure values “look” correct). These can reduce the amount of work required to produce the dataset to minimal.

There are, however, some disadvantages with this method:

Configuring and generating the dataset takes time and effort (varying by tool and dataset complexity).
Whilst it may “look like” “real data” on first glance, it is not – this façade of reality can lead to confusion. For example, if a dataset of 20 years’ of students is generated, whilst it might be valid in terms of validation rules for Student A to have a record in the first and last year, it probably would never happen – this can be jarring, and lead to people trying to find out why data looks “odd”, rather than examining the actual test cases.
Depending upon the sophistication of the tool being used, some data generated may violate complex data validation rules, or it may take some time to enter said validation into the tool.
The ability to regenerate the dataset exactly will vary by tool. Therefore, it is likely necessary to manage the dataset, to ensure that tests can be consistently run against it without needing to check or regenerate the data.
Given that the data is randomly generated, it may tend toward a uniform distribution, and not reflect the density, frequency, or range of real data. This may lead to decisions which are made based on data analysis (typically, index optimisation) being invalid.
The settings entered may mirror some assumptions made by the programmer regarding data distribution and/or domain, leading to edge cases not being explored, as they are effectively removed from the set of generated data (e.g. a field is populate with values “up to” instead of “up to and including” a specified value).

Using Modelled Random Data

The approach of using modelled random data takes using generic random data one step further – the data found in the production environment is analysed for patterns and characteristics, which are then adopted into the data generation algorithm. In the previous example of student records, it would be reasonable for the data generated for a given student to be constrained to rough time range. Still, there are some advantages over generic random data generation:

The data correlates with patterns in the live data, and thus “looks like” real data when examined, meaning that people are typically more comfortable when viewing it, and are less likely to question items found in it “on gut”.
Given that it models the “real” data, the distribution of the records should more closely match that of the data found in the production environment. This means that decisions based on data analysis (e.g. performance tuning) is more likely to be valid (this is more important if generating large volumes of data to simulate dataset growth).

There are, however, some disadvantages:

Analysing the data in the production data, and producing generation logic, is a time-consuming and non-trivial task. If it is not done correctly, most of the advantages of this approach are lost.
Generating the data is a more complex task, and may result in more expensive tools being required (or written).

Determining Data Volume

Whilst on first inspection it may appear that test datasets should be as large as possible in order to capture as many data combinations as possible, this may not be the correct approach, and may even be counter-productive.

Full Dataset

A “full” dataset is a dataset of a similar size to that of the data in the production environment, and is typically used when a large dataset is needed, e.g. for performance testing. Alternatively, it can be used as a “superset” from which candidate test rows can be identified and processed (this can be advantageous is multiple data-modifying tests need similar data at the same time). There are some disadvantages associated with this approach:

There is an assumption that the dataset is large enough to contain the required volume of data, as well as the necessary individual entries, which may not always be the case when dealing with “young” systems. If the dataset is not large enough, then this should be noted, and extra data generated.
Complete datasets can be quite large, especially when dealing with mature systems. This has an obvious cost in terms of disk space, etc, needed to support the dataset, along with processing time during testing and restore/revert times (if needed).
If being used as a superset from which candidate records are being selected, sometimes the volume of data can be counter-productive (“can’t see the wood for the trees”).

Sampled Dataset

A sampled dataset is simply a smaller dataset generated from a full dataset, using an algorithm to select part of the dataset – typically this will be “every n^th record”, or “select n% at random”, although other more complex selection methods exist (e.g. “select the first record for every combination of the following”). This has the advantage of reducing the volume of data that needs to be held. However, it does have several disadvantages:

The success of this method is dependent upon the effectiveness of the sampling method. If the sampling is carried out incorrectly, it is possible that the distribution of data in the dataset produced is skewed (when compared to the full dataset), the dataset is missing sets of values that it should contain, or that patterns/trends that are apparent in the full dataset are obfuscated or removed in the reduced dataset.
Correctly extracting records requires care, and can be non-trivial dependent upon the complexity of the data model, and requires knowledge of the data storage schema. For example, referential integrity needs to be maintained, which can involve many objects in a complex schema.
Determining the correct sampling method can be non-trivial.
If sampling is not carried out correctly, the resultant dataset may be too large (in which case many of the drawbacks of using a full dataset apply) or too small (in which case, there’s the chance required records do not appear).

Hand-Picked Records

This approach involves extracting just the records needed to verify a given test. This has the advantage that there are no extraneous records in the system to distract from the result of the test, and is most appropriate for individual tests. However, it has some drawbacks:

Identifying the records to be extracted requires knowledge of the data in the system, at both the application and data storage level.
Correctly extracting records requires care, and can be non-trivial dependent upon the complexity of the data model. For example, referential integrity needs to be maintained, which can involve many objects in a complex schema.
It is possible that a routine has a side-effect which affects other records, but due to the reduced volume of data present, the required conditions for this side-effect to fire are not met, and it goes undetected (e.g. the test dataset consists of a single record, and the function alters the current- and next record).
It is likely that different data needs identifying for each test, making this approach quite labour intensive.

Proposed Approach

Having considered the above approaches, it is proposed that we generate test datasets using a hybrid of the randomised, and modelled randomised approaches, which, where possible, shall be of a size representative of the full (or planned) dataset in terms of record count and data volume. It is intended that the generated data may be supplemented with real data in some cases. The reasoning for this is as follows:

In terms of benefit received for effort expended, random data generation offers the best “payoff”, as there are tools readily available which can perform the task automatically or near-automatically, e.g. automatically setting the datatype based on the column’s type, detecting the contents of a column by its name and generating appropriate data (e.g. if the column is called “TelephoneNumber”, the tool will automatically generate data that looks like phone numbers).
Whilst there are possible benefits from going with fully-modelled data, the increase in expended effort to do this correctly is generally not worth it – a sufficient volume of data will give an indication of performance, and performance testing will likely be done separately. Where randomly-generated is obviously not “realistic enough” in its distribution, then we will examine modelling this, e.g. if 10% of files are marked as “sensitive”, this is trivial to reflect, and will likely make the dataset more acceptable to those using it (either from a performance aspect, or people “eye-balling” the data).
By generating fully-modelled data, it is possible to inject flaws into the data that are a result of programmers’ assumptions – using mostly random “nonsense” data helps to avoid this. Some element of modelling, programmatic intervention, or the use of real data (where logic is dependent upon certain values being present) may be necessary in order for the data generated to comply with application logic rules that are in place.
Whilst there is an (entirely reasonable) argument that testing data is obviously testing data and thus does not have to have meaning, it is not uncommon for the data to be examined by humans, who may question unrelated aspects (this would be something akin to cognitive dissonance). Therefore, tweaking certain prominent aspects of the generated dataset will likely increase its acceptability if it is being shown to end-users for user-acceptance testing. For certain prominent non-sensitive aspects of datasets (e.g. names of programmes of study), we will look to use real data, or data generated from real data, in order to increase its realism and acceptance.
Generating a dataset of similar size to the (proposed or envisaged) production dataset is largely trivial given appropriate tools, and will allow developers to identify performance issues that are being created, where they may otherwise not be obvious with small datasets, hopefully removing a potential downstream problem. Obviously, this may not be possible due to space constraints, in which case the scale of the data will be reduced.

Coming Up

In the next part, we will look at putting the above into practice by generating a test dataset for one of our existing systems.

Posted in Database.

Tagged with Data, testing.

rev="post-1332" 1 comment

By Martin Chivers – April 26, 2016

Generating Test Datasets (Part 2)

A Brief Recap

In our previous post, we covered the various approaches to obtaining datasets for system testing, from using production data to modelling said data, along with their advantages and disadvantages. We proposed an approach using software tools to automate the generation of fake data, which resembled real data, mixed with “real” (non-personal) data where necessary.

Case Study – Practice Placements

For our team’s first attempt at generating a large set of test data, the University’s Practice Placements has been selected. This system is used to manage the allocation of placements to nursing students, and was chosen as it has a relatively complex model (c. 40 tables), which should highlight difficulties in generating data, and also the generated data, quickly. For the purposes of this study, we elected to use Red Gate’s Data Generator (RGDG), due to the simple expedient of it already being installed.

To generate the dataset, we followed the following process:

For datasets that are referenced by application logic these set of allowable values were extracted from the production system, saved into text files, and set to be loaded back into the relevant table(s). This step is not strictly necessary, as it’s possible to tell RGDG not to process the table (RGDG’s default behaviour is to wipe the table and re-generate the data), however, this approach removes the question, “why isn’t this table being processed? Has it been missed?”.
For datasets that are prominent to the end-user and therefore need to look “real”, we extracted datasets from production environments or other sources, saved them into text files, and set RGDG to pick either items from the dataset at random, or to generate entries by combining multiple rows at random. Examples of where this approach was employed include forenames (the in-built dataset is too short and too western, so a list of forenames was obtained), surnames (for the same reason), gender (the real-world data is not as “simple” as you might expect), and programmes of study and their associated codes (for end-user comfort with the data). Additionally, we extracted a list of organisation names, and used the “text shuffler” generator to create randomised combinations of words found within the list, given min/max length parameters, to generate something which “appeared familiar”.
Where the formatting of an item is important, and the type of data is particular to the University, or where it appears in multiple places, we wrote XML files which contain the definition of generators for these items (typically through the use of regular expressions), which are then referenced in order to limit the duplication (and accidental variance) of work. Examples include formatting of the University’s staff/student ID, usernames, and UCAS identifiers.
For each table, we set the volume of data to be generated. We did this through either setting an absolute number of rows to generate, or indicating a proportion of the number of rows in another table (e.g. generating a number of student placement allocations equal to 400% of the student table generates on average 4 allocated placements per student).
For each field, we checked the data generator that had been assigned, and where automatic matching was inappropriate, we manually set it. For the most part, the automatic matching worked well (or there was no obvious available match), although the most obvious example of failure of this were forename and surname being matched to nicknames (which manifests itself in the format of a forename and number), instead of first name and last name, and item names being interpreted as person names.
For fields where no appropriate generator exists, we selected one. Generally, we used the regular expression generator (e.g. to generate names of learning groups conforming to a particular format), although numeric ranges (e.g. to generate longitude and latitude co-ordinates roughly within Hampshire), and weighted lists (to probabilistically model some items of data) were also used.
Once the above steps had been completed, we hit the “big red button”, and after approximately 30s, we had 300,000 shiny new rows of data.

Having followed the above process, the following observations become fairly evident:

At a very basic level, generating large randomised datasets is simple. This is at least partially due to the tool that was used intelligently dealing with foreign key relationships, constraints, data type mapping, etc. However, attention must be paid to data that is required by the application, to ensure it is not missing or being randomised, and working through complex data models checking settings can be tiresome.
In some circumstances, it is necessary to add code to facilitate “correct” data generation (e.g. where values in one field logically depend upon those in another). This is a slight stumbling point, as extending RGDG v2 via .NET assemblies is powerful (e.g. the ability to define custom UIs) but time-consuming, and the abilities of the Python module to reference other data are a bit limited in this respect (v3 is better, but was not installed, and the matter is not helped by the fact I don’t know Python…).
If there is application logic which checks the data it is “being fed”, or there are business rules surrounding the content of the data which must be followed (either to appease the application or users who are testing it), then it is likely that additional work will be needed to generate data following these rules. It is likely that this may be non-trivial.
It became apparent that the application-level logic does not always match the database structure, resulting in application exceptions being generated when navigating around the system. For example, some fields are marked as required in the user interface, but are nullable in the database. Tickets have been generated to fix these.
As a general rule, the data looks at least semi-plausible (houses in the English Channel due to my bad longitude/latitude guessing excepted…), and provides a sufficient volume of data from which test cases could be selected whilst also giving an indication of likely performance.
Unless working with a small dataset, or being utterly fastidious, it is likely that when specifying the data generation settings, some things won’t be set correctly, and that multiple iterations of data generation may be required. Thankfully this is a painless process.
Resizing the generated dataset to any given size is largely trivial, and mostly involves clicking a button and waiting, although there is the obvious issue of disk space (this is likely more of an issue when developing/testing locally, or on a shared server). This can obviously be useful when investigating how a system may behave with significantly larger data loads.
Having a pre-written tool that can do most of the work for you makes the process of generating the data far easier, more enjoyable, and significantly quicker than having to do this yourself.

Acknowledgements

The above was written from a basis of personal experience, combined with the following sources:

Posted in Database.

Tagged with Data, testing.

rev="post-1333" No comments

By Martin Chivers – April 26, 2016

Introducing “Rube”

Meet Rube!

Rube is an abstraction to help us get over some of the dated and increasingly inaccurate cliches. Specifically using “your mom” when talking about naive web users. It was inspired by a tweet by Thomas Steiner after a great web-security keynote at WWW2016 by MEZ (known to her parents as Mary Ellen Zurko). It seems like there’s a need for a new hypothetical persona to discuss security questions without causing offence to our increasingly technically-literate mothers.

Rube.

Rube works for your company. Rube is a stand-up guy. He helps anybody move house if they just ask. He is the best at his job and the director of the company worries what would happen if Rube ever retires or leaves. A member of staff who is mean to Rube has made a grave career mistake. Rube runs the secret santa and convinced the boss to get a pool table for your office.

Sadly Rube has one fault… and it’s a doozy. Rube believes everything he is told and everything he reads on the Internet. He believes the boss when he says this is the best place to work in the whold world, but he also believs he going to be rich soon thanks to a Nigerian widow who just needed a little money to unlock her husband’s bank account.

Rube clicks on every link. He opens every email attachment. He also carefully reads and then ignores any security advice you send out. Rube will give his password to anybody who “seems trustworthy”. Rube is still waiting for the one unsolicited email that will make him rich.

He’ll never quit and never be fired but One day Rube might retire. On that day the director will promote his loathsome deputy Edward Zachery Markus Jones. A man with all the same gullibility and none of the talent. People call him E.Z.Mark (pron. Easy Mark)

Every Friday after work, Rube buys the entire IT department a round of drinks in the bar and listens humbly and nods when they explain basic security practice and every Monday he installs another browser toolbar he saw in a popup ad.

Your corporate security policy must take Rube into consideration. He can’t be fired and he can’t be trained.

Rube is an entirely public-domain concept, but please feel free to link back to this article. For bouus points, please leave comments with links to your Rube and E.Z.Mark stories and artwork!

Posted in Uncategorized.

rev="post-1327" No comments

By Christopher Gutteridge – April 15, 2016

Engineering Minecraft

[Post updated for 2017]

This Saturday I’m running an activity around Minecraft at the university Science and Engineering Family Day. This post will have a link to some of the resources we’re using if families want to try them out at home.

Handouts:

Minecraft Sci & Engineering Day Booklet

Maps:

Projects

Rail Model

Art

Ventnor Seafront, Isle of Wight (my Minecraft art project)

Useful tools:

Chunky – Renders raytraced pictures with reflections and shadows

Modpack loaders:

Posted in Uncategorized.

rev="post-1321" 5 comments

By Christopher Gutteridge – March 10, 2016

Feedback on Election Vocabulary

A peer on twitter asked for feedback on this vocabulary of terms to describe an election. I thought it might be helpful to make as a public post. No disrespect is intended and my views about vocab. design may not be in line with everybody elses. If I’m wrong; leave a comment!

[update: It seems this wasn’t the final version but a preview for me to review, so apologies if I gave the impression this was the official release]

http://linkedopendatang.com/schemas/election/v2/linkedopenelection.ttl

First of all, before I get nit-picky, this isn’t a disaster. It will solve an immediate need but there’s plenty of ways to make it more useful and reusable.

Validation

First of all it’s not a valid file! Always use a validator to check your RDF, and I generally recommend using a library to produce it. In fact a full validation chain would be:

Check the file can parse using a basic validator
Check for silly mistakes and typos using triple checker
Check for logical errors using an appropriate tool, eg. make sure it’s not got logical impossibilities.

rdfs:label

This vocabulary has very lasy choices for rdfs:label — these should be human-readable terms that make sense when said out loud. It almost never makes things better to prefix all the predicate labels with “has”. I get the impression that this vocab. is intended to describe primarily African elections and would be inclined to make the labels clearly specified as being in @en to make it easier to add translations later.

Too specific definitions

I would start by thinking about the most generic concepts for what an election is; it’s a vote where some or all of a pool of eligible voters register their choice(s) in a timeframe between two or more options. The options may be people to gain a political office, but that’s a special case. Multiple things may be decided during the same process, eg. voting for mayor and chief of police. In some voting systems people may vote for multiple candidates or rank their preferences. I see that the vocab. seems to also have the idea of candidates voting for bills and this should be the same basic concepts. Also, there’s no reason to restrict candidates to be foaf:Person– I believe that it’s quite common for inflatable bananas and bottles of beer to stand for election in student unions… possibly candidates could be considered foaf:Agent… and voters definitely can, but a set of internet-of-things devices or intelligent-agent-softwares can hold elections — no humans invovled.

Again, PollingBooth is very specific. I would suggest something more generic such as polling location or even polling node (not all polling booths will have meaningful real-world locations as things become more digital, but they still represent a collection of eligable voters and their resulting votes.)

A slight other nitpick is that I would have an Election be a subclass of a poll and have candidates be candidates for election to a role. A poll could be a broader question such as should we allow shops to trade on Sunday, or the recent vote on if Scotland should remain part of the UK.

Inconsistent approaches

The vocabulary contains both these terms election:hasConstituencyLat_Long and election:hasGeoLatitude. This is pretty muddled. It’s increasing the tasks for anybody parsing the data as they now have to cope with both a combined lat long type and a comma-separated one. I’d go with the separated one for both cases and just use geo:lat.

Reinventing wheels

This vocabulary has a number of very appropriate specific terms like Political Party and Candidate. Those belong as they add real value and indicate what terms you might reasonably expect associated with those entities.

However there’s terms which are just duplicates of existing well established terms and I would prefer to just use those existing terms. That way data tools which understand those terms will understand the data without any extra work.

The social media section terms can be easily replaced with FOAF equivalents.

election:hasDate is a subproperty of dc:date, and this one I’m a bit torn on as it does give a more specific meaning so I’ll allow it, but it should be datePromised not just hasDate to gain that value.

election:hasGeoLatitude and the other lat/long terms would be better replaced with geo:lat and geo:long.

hasURL might as well just be foaf:page unless there’s value to add and “hasURL” adds no more specific meaning.

election:image is nearly appropriate because it’s more specific than foaf:depiction as it’s a depiction of someone relating to a specific election, not any time in their life… However that still requires an explicit link between the image and the election. Imagine you compile RDF from three elections and the same candidate stood in all three. How would you detect which depiction is associated with that candidate in a specific election? This needs a bit of work.

Multiple elections

This leads me into a core issue. There’s a statement that the candidate is a person, and that they have a name, get votes etc. I think this is a mistake and a small tweak is required. I suggest that the person and their candidacy are different entities. One person may *be* a candidate, but their candidancy relates them to a single election with a result. Things like their image and their name may be more useful to link to their candidacy. For example Miss Gemma Smith may have run in the 2001 election, and lost and then after marriage Mrs Gemma Jones (same person) won in 2005. Both names are true for the foaf:Person but the candidacy has a speicific name for that elelction!

3 State Logic

There are some terms which capture if an official voted ‘yes’ or ‘no’ on an issue or bill. I would be tempted to rename these to “for” and “against. I would also suggest adding in “Formally Abstained” and “Did not vote”. Without logging abstentions there’s no way to tell the difference between “no data” and “abstained”. My suggestion of the “formally abstained” is to record when someone actively abstained, for example in the UK parliament walking through both lobbies in a division is a way to record that you were present but chose not to take a side.

Time sensitive terms

election:hasCounted makes me a bit nervous. It’s a boolean based on has an event occurred. I would be much happier with a timestamp recording when the counting completed. Without it you need to know the time the datafile was generated, and the RDF way of thinking is not to rely on document metadata, everything should be explicit.

The same applies to some of the other boolean terms which indicate an event or process is complete.

Promises…

The section on campaign promises seems out of place. It’s far more subjective than the other parts and likely to spark disagreement. I’d suggest adding some kind of citation to back up these statements, eg. a video, news article or press release.

The fraud and violence terms also are controversial and should require some kind of external justification using sources.

Conclusion

I think this effort is well worth continuing but would like to see it able to be used for all levels of democratic decisions not just first-past-the-post elections. I’ve tried to brain dump everything that might be useful and it’s not meant to detract from the hard work that’s gone into this project.

Posted in Best Practice, RDF.

rev="post-1315" No comments

By Christopher Gutteridge – February 16, 2016

One from the Vaults

When hunting for something unrelated I discoved a very old MSWord file containing the documentation for “Jerome” which was the ECS publication database that I worked on before EPrints.

1999

In 1999 I was 23 and only graduated a couple of years. There was no Wikipedia and I didn’t know any librarians. Dublin Core existed but I didn’t hear about it, and it would have been far harder to discover its existance.

At the same time, Rob Tansley was working on EPrints v1. I didn’t like EPrints, mostly because it was a rival to Jerome, which I’d worked hard on, and there was a lot of pressure to replace Jerome with EPrints.

Then Rob got a new job and was off. I found out later the new job was starting D-Space, the main rival to EPrints over the years. I had the EPrints code foisted upon me to do minor tweaks to get v1.0 out of the door, specifically adding support for OAI 1.0 when it was finalised. It also used CSV version control which I’d learned about but never really used. 16 years later, and I feel nervous if my shopping list isn’t in git.

I ended up deciding to make a few more tweaks to the codebase of EPrints to make it more configurable and internationalisable. I made some great choices, and some poor ones ($session as a god object was one of my worst ever). Some of those were based on what I learned from creating and running “Jerome”. A very bright intern worked for a summer holiday to help refactor much of the code to be more configuable. He’s now Dr Mike Jewell and sits about 10m from my current desk. Our work became EPrints 2.0, which was far more configurable assuming you were happy to edit XML & Perl configuration files. Much later came EPrints 3 with many more contributors, and which cleaned up much of the internals and 3.1 which introduced far more friendly configuration tools for the admin.

Over it’s lifetime EPrints has enable many researchers easier access to research and that was the mission. I have no pride in successfully helping people gather metrics, no joy in the embargo feature, and don’t even like the ability to restrict downloads, but I have immense pride that in some way I’ve contributed to science and research and that all started for me with Jerome.

Here’s the documentation I dug up, I’ve left the spelling mistakes as they were at the time.

Venue

A large group

What’s new? What’s not?

University websites and Computer Science

Job advertisements

Job descriptions

Person Specifications

Conclusion

Further reading

Phases

Overview

Data Generation Approaches

Using Real Data

Using “Cleansed” Data

Using Random Data

Using Modelled Random Data

Determining Data Volume

Full Dataset

Sampled Dataset

Hand-Picked Records

Proposed Approach

Coming Up

A Brief Recap

Case Study – Practice Placements

Acknowledgements

Validation

rdfs:label

Too specific definitions

Inconsistent approaches

Reinventing wheels

Multiple elections

3 State Logic

Time sensitive terms

Promises…

Conclusion

1999

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags