Skip to content

Generating Test Datasets (Part 1)


Obtaining an appropriate test dataset forms an integral part of the development and testing of any software system.  It is not uncommon for the test dataset to be extracted from a live environment (or simply a clone of it).  There are several reasons for taking this approach, however, there are also potential security and regulatory/legal issues that may arise from it, and other approaches should be investigated.  The following considers two primary reasons for testing: system correctness, and performance testing/system sizing.

Data Generation Approaches

There are several approaches to generating or otherwise obtaining test datasets – using “real” data from production systems, using data from production systems which has been “anonymised” or otherwise “cleansed” of identifying data, using randomly-generated data, and using data that follows a model of “real” data, each of which have their own set of advantages and disadvantages.

Using Real Data

Using “real” data from production environments to populate the test environment is an understandable approach, and the reasons for adopting it generally revolve around a perception of convenience:

  • Restoring (or otherwise synchronising) a copy of the production database onto the testing server is normally a trivial task in terms of effort (if not always time, in the case of large databases). Given an established system, this will give a large set of data for little developer effort.
  • Given a sufficient volume of data, it is expected that the values present will vary across the available domain
  • Real-world data is not normally constrained by the assumptions of the programmers who wrote the system being tested, and thus is a good source of “odd” values (or combinations thereof).
  • If the system is to be connected to other related systems, then using “real” data makes testing the integration of these systems appear to be easier, as managing a test dataset between multiple systems or environments may not (appear to be) required.
  • If those conducting the testing are familiar with the data in the production environment, they may feel more comfortable with having this data in the testing environment as it is familiar to them, or that they feel it makes their task easier, as they can identify unexpected outcomes based on their pre-existing knowledge of the dataset.

However, there are also several significant downsides to this approach:

  • Assuming that the production environment data contains data related to individuals, and depending on the wording of the agreement the individuals entered into when their data was originally obtained, then using their data for testing is likely to constitute a breach of the Data Protection Act, as it may be being used for a purpose other than that which it was obtained for. Further, if the data is modified as part of the testing process, this may constitute an issue with the requirement for all data to be correct.  According to the Information Commissioner, “The ICO advises that the use of personal data for system testing should be avoided. Where there is no practical alternative to using live data for this purpose, systems administrators should develop alternative methods of system testing. Should the Information Commissioner receive a complaint about the use of personal data for system testing, their first question to the data controller would be to ask why no alternative to the use of live data had been found”.  Further, there may also be other regulatory requirements related to the specific type of data being stored (e.g. the FCA when dealing with financial data).
  • This approach increases the attack surface, when considering sensitive data. This means that the testing environment would need to be secured to the same degree as the production environment, including the monitoring of the system for suspicious activity.  Further, it is possible that people who do not have access to the production environment have access to the testing environment, increasing the number of people with access to sensitive data.
  • It is possible that the release of code being tested contains new and/or otherwise undiscovered coding errors which result in a security vulnerability, e.g. which results in the leaking of data.
  • Dependent upon the maturity of the system, it is possible that the data it contains is not an accurate representation of the data it will contain in the future, for example, it may have a bias in its distribution (for example, older data may not follow the same trends as newer data due to altered processes, etc), which may misinform performance optimisations based upon data analysis. Alternatively, the data may not cover a large portion of the available domain, leaving edge cases untested.
  • The system may simply not contain enough data to supply a large-enough dataset, which would then require supplementing.

Using “Cleansed” Data

Given that there are advantages to using “real” data, a reasonable alternative is to try to “defuse” the potential privacy and regulatory problems by anonymising or removing the sensitive data (e.g. scrambling data by combining fields from different rows, replacing sensitive fields with fixed or random strings or null values, etc).  When done correctly, this would retain the advantages of using “real” data, whilst allaying the privacy concerns.  However, it does bring with it a distinct set of disadvantages:

  • Correctly anonymising a dataset so that it cannot be converted back into its original form, nor individuals otherwise be identified from the processed data, is a time-consuming and non-trivial task, which should be manually verified before proceeding. Care must be taken when deciding how to anonymise data, which fields are involved, and that all occurrences of the data are identified (it is likely that it will in reality be a combination or set of fields, which may vary based upon the data context).
  • Dependent upon the method used for anonymising the dataset, patterns existing in the data may be removed or obfuscated, or data which breaches logic rules (or that is otherwise of interest) may be removed. This may result in decisions made based upon data analysis being invalid (e.g. performance optimisations being informed by data distribution analysis).
  • Care needs to be taken with the management of the dataset in order to ensure new sensitive data doesn’t accidentally flow into it (e.g. via a feed from another system), nor that it is neither inadvertently lost nor damaged during testing.

Using Random Data

A further approach is to programmatically create random data, and populate the test dataset with that.  This may be either through generating the entire dataset as random data, or hybridised by using “real” ancillary data (essentially look-ups) and randomised data for sensitive entities.  This has several advantages:

  • Given that the data is literally random, there are no privacy concerns related to using it, as it doesn’t relate to any real entities.
  • It should be trivial to size the dataset generated to the amount of data required (i.e. there isn’t a problem if there is insufficient real data to generate a test dataset of the required size, as you can simply generate more). This is particularly relevant to performance forecasting.
  • When properly generated, it should be possible to have a dataset which covers a large portion of the data domain, and is free of assumptions made by the original programmer.
  • Tools of varying quality and expense are available, which can generate random data based on a defined schema and data rules (to ensure values “look” correct). These can reduce the amount of work required to produce the dataset to minimal.

There are, however, some disadvantages with this method:

  • Configuring and generating the dataset takes time and effort (varying by tool and dataset complexity).
  • Whilst it may “look like” “real data” on first glance, it is not – this façade of reality can lead to confusion. For example, if a dataset of 20 years’ of students is generated, whilst it might be valid in terms of validation rules for Student A to have a record in the first and last year, it probably would never happen – this can be jarring, and lead to people trying to find out why data looks “odd”, rather than examining the actual test cases.
  • Depending upon the sophistication of the tool being used, some data generated may violate complex data validation rules, or it may take some time to enter said validation into the tool.
  • The ability to regenerate the dataset exactly will vary by tool. Therefore, it is likely necessary to manage the dataset, to ensure that tests can be consistently run against it without needing to check or regenerate the data.
  • Given that the data is randomly generated, it may tend toward a uniform distribution, and not reflect the density, frequency, or range of real data. This may lead to decisions which are made based on data analysis (typically, index optimisation) being invalid.
  • The settings entered may mirror some assumptions made by the programmer regarding data distribution and/or domain, leading to edge cases not being explored, as they are effectively removed from the set of generated data (e.g. a field is populate with values “up to” instead of “up to and including” a specified value).

Using Modelled Random Data

The approach of using modelled random data takes using generic random data one step further – the data found in the production environment is analysed for patterns and characteristics, which are then adopted into the data generation algorithm.  In the previous example of student records, it would be reasonable for the data generated for a given student to be constrained to rough time range.  Still, there are some advantages over generic random data generation:

  • The data correlates with patterns in the live data, and thus “looks like” real data when examined, meaning that people are typically more comfortable when viewing it, and are less likely to question items found in it “on gut”.
  • Given that it models the “real” data, the distribution of the records should more closely match that of the data found in the production environment. This means that decisions based on data analysis (e.g. performance tuning) is more likely to be valid (this is more important if generating large volumes of data to simulate dataset growth).

There are, however, some disadvantages:

  • Analysing the data in the production data, and producing generation logic, is a time-consuming and non-trivial task. If it is not done correctly, most of the advantages of this approach are lost.
  • Generating the data is a more complex task, and may result in more expensive tools being required (or written).

Determining Data Volume

Whilst on first inspection it may appear that test datasets should be as large as possible in order to capture as many data combinations as possible, this may not be the correct approach, and may even be counter-productive.

Full Dataset

A “full” dataset is a dataset of a similar size to that of the data in the production environment, and is typically used when a large dataset is needed, e.g. for performance testing.  Alternatively, it can be used as a “superset” from which candidate test rows can be identified and processed (this can be advantageous is multiple data-modifying tests need similar data at the same time).  There are some disadvantages associated with this approach:

  • There is an assumption that the dataset is large enough to contain the required volume of data, as well as the necessary individual entries, which may not always be the case when dealing with “young” systems. If the dataset is not large enough, then this should be noted, and extra data generated.
  • Complete datasets can be quite large, especially when dealing with mature systems. This has an obvious cost in terms of disk space, etc, needed to support the dataset, along with processing time during testing and restore/revert times (if needed).
  • If being used as a superset from which candidate records are being selected, sometimes the volume of data can be counter-productive (“can’t see the wood for the trees”).

Sampled Dataset

A sampled dataset is simply a smaller dataset generated from a full dataset, using an algorithm to select part of the dataset – typically this will be “every nth record”, or “select n% at random”, although other more complex selection methods exist (e.g. “select the first record for every combination of the following”).  This has the advantage of reducing the volume of data that needs to be held.  However, it does have several disadvantages:

  • The success of this method is dependent upon the effectiveness of the sampling method. If the sampling is carried out incorrectly, it is possible that the distribution of data in the dataset produced is skewed (when compared to the full dataset), the dataset is missing sets of values that it should contain, or that patterns/trends that are apparent in the full dataset are obfuscated or removed in the reduced dataset.
  • Correctly extracting records requires care, and can be non-trivial dependent upon the complexity of the data model, and requires knowledge of the data storage schema. For example, referential integrity needs to be maintained, which can involve many objects in a complex schema.
  • Determining the correct sampling method can be non-trivial.
  • If sampling is not carried out correctly, the resultant dataset may be too large (in which case many of the drawbacks of using a full dataset apply) or too small (in which case, there’s the chance required records do not appear).

Hand-Picked Records

This approach involves extracting just the records needed to verify a given test.  This has the advantage that there are no extraneous records in the system to distract from the result of the test, and is most appropriate for individual tests.  However, it has some drawbacks:

  • Identifying the records to be extracted requires knowledge of the data in the system, at both the application and data storage level.
  • Correctly extracting records requires care, and can be non-trivial dependent upon the complexity of the data model. For example, referential integrity needs to be maintained, which can involve many objects in a complex schema.
  • It is possible that a routine has a side-effect which affects other records, but due to the reduced volume of data present, the required conditions for this side-effect to fire are not met, and it goes undetected (e.g. the test dataset consists of a single record, and the function alters the current- and next record).
  • It is likely that different data needs identifying for each test, making this approach quite labour intensive.

Proposed Approach

Having considered the above approaches, it is proposed that we generate test datasets using a hybrid of the randomised, and modelled randomised approaches, which, where possible, shall be of a size representative of the full (or planned) dataset in terms of record count and data volume.  It is intended that the generated data may be supplemented with real data in some cases.  The reasoning for this is as follows:

  1. In terms of benefit received for effort expended, random data generation offers the best “payoff”, as there are tools readily available which can perform the task automatically or near-automatically, e.g. automatically setting the datatype based on the column’s type, detecting the contents of a column by its name and generating appropriate data (e.g. if the column is called “TelephoneNumber”, the tool will automatically generate data that looks like phone numbers).
  2. Whilst there are possible benefits from going with fully-modelled data, the increase in expended effort to do this correctly is generally not worth it – a sufficient volume of data will give an indication of performance, and performance testing will likely be done separately. Where randomly-generated is obviously not “realistic enough” in its distribution, then we will examine modelling this, e.g. if 10% of files are marked as “sensitive”, this is trivial to reflect, and will likely make the dataset more acceptable to those using it (either from a performance aspect, or people “eye-balling” the data).
  3. By generating fully-modelled data, it is possible to inject flaws into the data that are a result of programmers’ assumptions – using mostly random “nonsense” data helps to avoid this. Some element of modelling, programmatic intervention, or the use of real data (where logic is dependent upon certain values being present) may be necessary in order for the data generated to comply with application logic rules that are in place.
  4. Whilst there is an (entirely reasonable) argument that testing data is obviously testing data and thus does not have to have meaning, it is not uncommon for the data to be examined by humans, who may question unrelated aspects (this would be something akin to cognitive dissonance). Therefore, tweaking certain prominent aspects of the generated dataset will likely increase its acceptability if it is being shown to end-users for user-acceptance testing.  For certain prominent non-sensitive aspects of datasets (e.g. names of programmes of study), we will look to use real data, or data generated from real data, in order to increase its realism and acceptance.
  5. Generating a dataset of similar size to the (proposed or envisaged) production dataset is largely trivial given appropriate tools, and will allow developers to identify performance issues that are being created, where they may otherwise not be obvious with small datasets, hopefully removing a potential downstream problem. Obviously, this may not be possible due to space constraints, in which case the scale of the data will be reduced.

Coming Up

In the next part, we will look at putting the above into practice by generating a test dataset for one of our existing systems.

Posted in Data, Database, testing.

One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Christopher Gutteridge says

    Hmm. It seems like the main concern here is around data about *people*. That’s entirely resaonable, but it makes less sense for parts of systems that deal with say courses, or carparks.

    Random phone numbers and email addresses are fine if you’re sure you won’t call/email them. For email I’d be tempted to generate them all going to a testing SMTP server so you could see them actually getting sent.

    When generating random or modelled data it would be helpful to be able to generate IDs and usernames that would never clash with real ones.

    For data integration this is going to be a pain unless all pre-production systems can use the same fake data. If we started with a good fake users/accounts/people database everything could flow from that. However every test I’ve ever done with end users involves them looking up some record (theirs or a tutuee) they are familiar with to check how it’s represented in the new system

Some HTML is OK

or, reply to this post via trackback.