Skip to content

Generating Test Datasets (Part 2)

A Brief Recap

In our previous post, we covered the various approaches to obtaining datasets for system testing, from using production data to modelling said data, along with their advantages and disadvantages.  We proposed an approach using software tools to automate the generation of fake data, which resembled real data, mixed with “real” (non-personal) data where necessary.


Case Study – Practice Placements

For our team’s first attempt at generating a large set of test data, the University’s Practice Placements has been selected.  This system is used to manage the allocation of placements to nursing students, and was chosen as it has a relatively complex model (c. 40 tables), which should highlight difficulties in generating data, and also the generated data, quickly.  For the purposes of this study, we elected to use Red Gate’s Data Generator (RGDG), due to the simple expedient of it already being installed.

To generate the dataset, we followed the following process:

  1. For datasets that are referenced by application logic these set of allowable values were extracted from the production system, saved into text files, and set to be loaded back into the relevant table(s). This step is not strictly necessary, as it’s possible to tell RGDG not to process the table (RGDG’s default behaviour is to wipe the table and re-generate the data), however, this approach removes the question, “why isn’t this table being processed?  Has it been missed?”.
  2. For datasets that are prominent to the end-user and therefore need to look “real”, we extracted datasets from production environments or other sources, saved them into text files, and set RGDG to pick either items from the dataset at random, or to generate entries by combining multiple rows at random. Examples of where this approach was employed include forenames (the in-built dataset is too short and too western, so a list of forenames was obtained), surnames (for the same reason), gender (the real-world data is not as “simple” as you might expect), and programmes of study and their associated codes (for end-user comfort with the data).  Additionally, we extracted a list of organisation names, and used the “text shuffler” generator to create randomised combinations of words found within the list, given min/max length parameters, to generate something which “appeared familiar”.
  3. Where the formatting of an item is important, and the type of data is particular to the University, or where it appears in multiple places, we wrote XML files which contain the definition of generators for these items (typically through the use of regular expressions), which are then referenced in order to limit the duplication (and accidental variance) of work. Examples include formatting of the University’s staff/student ID, usernames, and UCAS identifiers.
  4. For each table, we set the volume of data to be generated. We did this through either setting an absolute number of rows to generate, or indicating a proportion of the number of rows in another table (e.g. generating a number of student placement allocations equal to 400% of the student table generates on average 4 allocated placements per student).
  5. For each field, we checked the data generator that had been assigned, and where automatic matching was inappropriate, we manually set it. For the most part, the automatic matching worked well (or there was no obvious available match), although the most obvious example of failure of this were forename and surname being matched to nicknames (which manifests itself in the format of a forename and number), instead of first name and last name, and item names being interpreted as person names.
  6. For fields where no appropriate generator exists, we selected one. Generally, we used the regular expression generator (e.g. to generate names of learning groups conforming to a particular format), although numeric ranges (e.g. to generate longitude and latitude co-ordinates roughly within Hampshire), and weighted lists (to probabilistically model some items of data) were also used.
  7. Once the above steps had been completed, we hit the “big red button”, and after approximately 30s, we had 300,000 shiny new rows of data.

Having followed the above process, the following observations become fairly evident:

  1. At a very basic level, generating large randomised datasets is simple. This is at least partially due to the tool that was used intelligently dealing with foreign key relationships, constraints, data type mapping, etc.  However, attention must be paid to data that is required by the application, to ensure it is not missing or being randomised, and working through complex data models checking settings can be tiresome.
  2. In some circumstances, it is necessary to add code to facilitate “correct” data generation (e.g. where values in one field logically depend upon those in another). This is a slight stumbling point, as extending RGDG v2 via .NET assemblies is powerful (e.g. the ability to define custom UIs) but time-consuming, and the abilities of the Python module to reference other data are a bit limited in this respect (v3 is better, but was not installed, and the matter is not helped by the fact I don’t know Python…).
  3. If there is application logic which checks the data it is “being fed”, or there are business rules surrounding the content of the data which must be followed (either to appease the application or users who are testing it), then it is likely that additional work will be needed to generate data following these rules.  It is likely that this may be non-trivial.
  4. It became apparent that the application-level logic does not always match the database structure, resulting in application exceptions being generated when navigating around the system. For example, some fields are marked as required in the user interface, but are nullable in the database.  Tickets have been generated to fix these.
  5. As a general rule, the data looks at least semi-plausible (houses in the English Channel due to my bad longitude/latitude guessing excepted…), and provides a sufficient volume of data from which test cases could be selected whilst also giving an indication of likely performance.
  6. Unless working with a small dataset, or being utterly fastidious, it is likely that when specifying the data generation settings, some things won’t be set correctly, and that multiple iterations of data generation may be required. Thankfully this is a painless process.
  7. Resizing the generated dataset to any given size is largely trivial, and mostly involves clicking a button and waiting, although there is the obvious issue of disk space (this is likely more of an issue when developing/testing locally, or on a shared server). This can obviously be useful when investigating how a system may behave with significantly larger data loads.
  8. Having a pre-written tool that can do most of the work for you makes the process of generating the data far easier, more enjoyable, and significantly quicker than having to do this yourself.


The above was written from a basis of personal experience, combined with the following sources:

Posted in Data, Database, testing.

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Some HTML is OK

or, reply to this post via trackback.