Skip to content


Deliberately Tainted Data

Do you see trainted data? Call ServiceLine now on...

Do you see trainted data? Call now…

In iSolutions, we run a fairly common system for services of having a live, a pre-production and one or more development copies of the service. It took me a while to come around to this approach, having mostly done seat-of-the-pants hacking on live services in the past, but come around to it I have.  One problem with this approach is a problem we’ve encountered a number of times where the pre-production service contained more or less the same data as the live service so either people used it in error, or it sent real emails to real people based on test data.

A long time ago I came up with a way to massively reduce such incidents. Not stop, but reduce. The idea was inspired by the smell of natural gas. Natural gas doesn’t actually have much of a smell but the distinctive smell of unburned gas is added artificially and makes it very easy to notice if there’s a leak. While this approach doesn’t directly stop explosions, it means that 99% of incidents are caught longbefore anything bad can happen.

ChristopheX GutteridgX

My idea is to add a “taint” to some columns in the dev. and pre-production databases to make it obvious to a human that the data is tainted, but not impact testing. To do this I pick some free text columns which are going to be frequently viewed in any user-interface. For example Person_Forename, Person_Surname, Event_Title, Document_title. If these have 3 or more characters, I replace the last with a capital X. That way it doesn’t change the length of any data or notably change the indexing. So I would appear as “ChristopheX GutteridgX” and John Wu would be “JohX Wu”.  It’s immediately obvious that something is off, but the system can be tested as usual. If ever preprod or dev data accidentally ends up in a live system, it’s immediately obvious. This can happen if a database hostname is accidentally included in the version controlled part of the application, rather than in a config file outside the normal version control.

This is no substitue for proper checks and processes but it makes an excellent extra line of defence for no significant cost.

It works! (sample size: 1)

Today someone told me that their live database is showing tainted data. I’ve checked the database tables, and they have the correct untainted data, so I can deduce he’s still using an ODBC connection to the pre-prod database. A small victory, but it’s the first time this approach has paid off, so I wrote this blog post to celebrate.

I’m sure I can’t be the only person who’s thought of this. Does the technique have a name? Is it a good idea or an antipattern?

Posted in Best Practice, Data, Database.


5 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Alex Bilbie says

    What you’re describing is called “seeding” non-production databases.

    Generally a libary like Faker is used (variations available in most languages) which generates fake but seemingly “real-looking” data. Generally you’ll go for a half and half approach whereby you use Faker to generate personal data but use otherwise data for inanimate “public” objects (e.g. a list of buildings).

    Ruby – https://github.com/joke2k/faker
    PHP – https://github.com/fzaninotto/Faker
    JS – https://github.com/marak/Faker.js/
    Ruby – https://github.com/stympy/faker

    • Christopher Gutteridge says

      Hmm. “Real looking” can be a problem. I want to use *real* data, but make it obvious that it’s not for real use.

  2. Rikki says

    By putting the taint on the end, do you risk bad UIs (especially on mobile) cutting off the end of the field and users not noticing the taint?

    • Christopher Gutteridge says

      It’s a line of defence, not the only line. It might fail in some circumstances but that’s OK. Smell-in-gas fails if someone has no sense of smell, of if there’s nobody there to smell it, but it sometimes works and that’s enough.

  3. Christopher Gutteridge says

    Follow up thought from a discussion we just had; it might be clever to rewrite all the email addresses in the dev system to be like real ones but at a test domain. eg. so cjg@ecs.soton.ac.uk becomes cjg-ecs-soton-ac-uk@sotondebug.org and we could test emails sent to sotondebug org without ever accidently emailing a real person.



Some HTML is OK

or, reply to this post via trackback.