Southampton Web and Data Innovation Team

Ideas and Tips from the Team

Categories:

Advertising
AI
Apache
Best Practice
Bitcoin
Command Line
Community
Conference Spam
Conference Website
Data
- Research Data
Database
dev8d
Doug Englebart
Drupal
Events
Gateway to Research
GDPR
Geo
HESA
HTTP
Internet Archive
Intranet
Javascript
Jisc
Management
- Recruitment
Minecraft
Open Data
Open Source
ORCID
OSX
Outreach
Perl
PHP
Programming
python
RDF
- 4store
- Graphite
- SPARQL
- Triplestore
Repositories
Sharepoint
SQL
Team
Templates
Terms and Conditions
testing
Tips
Training
Tutorial
twitter
Uncategorized
web management
Wordpress

Deliberately Tainted Data

Do you see trainted data? Call now…

In iSolutions, we run a fairly common system for services of having a live, a pre-production and one or more development copies of the service. It took me a while to come around to this approach, having mostly done seat-of-the-pants hacking on live services in the past, but come around to it I have. One problem with this approach is a problem we’ve encountered a number of times where the pre-production service contained more or less the same data as the live service so either people used it in error, or it sent real emails to real people based on test data.

A long time ago I came up with a way to massively reduce such incidents. Not stop, but reduce. The idea was inspired by the smell of natural gas. Natural gas doesn’t actually have much of a smell but the distinctive smell of unburned gas is added artificially and makes it very easy to notice if there’s a leak. While this approach doesn’t directly stop explosions, it means that 99% of incidents are caught longbefore anything bad can happen.

ChristopheX GutteridgX

My idea is to add a “taint” to some columns in the dev. and pre-production databases to make it obvious to a human that the data is tainted, but not impact testing. To do this I pick some free text columns which are going to be frequently viewed in any user-interface. For example Person_Forename, Person_Surname, Event_Title, Document_title. If these have 3 or more characters, I replace the last with a capital X. That way it doesn’t change the length of any data or notably change the indexing. So I would appear as “ChristopheX GutteridgX” and John Wu would be “JohX Wu”. It’s immediately obvious that something is off, but the system can be tested as usual. If ever preprod or dev data accidentally ends up in a live system, it’s immediately obvious. This can happen if a database hostname is accidentally included in the version controlled part of the application, rather than in a config file outside the normal version control.

This is no substitue for proper checks and processes but it makes an excellent extra line of defence for no significant cost.

It works! (sample size: 1)

Today someone told me that their live database is showing tainted data. I’ve checked the database tables, and they have the correct untainted data, so I can deduce he’s still using an ODBC connection to the pre-prod database. A small victory, but it’s the first time this approach has paid off, so I wrote this blog post to celebrate.

I’m sure I can’t be the only person who’s thought of this. Does the technique have a name? Is it a good idea or an antipattern?

Posted in Best Practice, Database.

Tagged with Data.

rev="post-1277" 5 comments

By Christopher Gutteridge – August 27, 2015

5 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

Alex Bilbie says

What you’re describing is called “seeding” non-production databases.

Generally a libary like Faker is used (variations available in most languages) which generates fake but seemingly “real-looking” data. Generally you’ll go for a half and half approach whereby you use Faker to generate personal data but use otherwise data for inanimate “public” objects (e.g. a list of buildings).

Ruby – https://github.com/joke2k/faker
PHP – https://github.com/fzaninotto/Faker
JS – https://github.com/marak/Faker.js/
Ruby – https://github.com/stympy/faker

August 27, 2015, 1:03 pm Reply
- Christopher Gutteridge says
  
  Hmm. “Real looking” can be a problem. I want to use *real* data, but make it obvious that it’s not for real use.
  
  August 27, 2015, 1:25 pm Reply
Rikki says

By putting the taint on the end, do you risk bad UIs (especially on mobile) cutting off the end of the field and users not noticing the taint?

August 28, 2015, 1:02 pm Reply
- Christopher Gutteridge says
  
  It’s a line of defence, not the only line. It might fail in some circumstances but that’s OK. Smell-in-gas fails if someone has no sense of smell, of if there’s nobody there to smell it, but it sometimes works and that’s enough.
  
  August 28, 2015, 1:47 pm Reply
Christopher Gutteridge says

Follow up thought from a discussion we just had; it might be clever to rewrite all the email addresses in the dev system to be like real ones but at a test domain. eg. so cjg@ecs.soton.ac.uk becomes cjg-ecs-soton-ac-uk@sotondebug.org and we could test emails sent to sotondebug org without ever accidently emailing a real person.

August 28, 2015, 1:49 pm Reply

« Week Four Weeks Five to Eight »

Proudly powered by WordPress and Carrington.

Carrington Theme by Crowd Favorite

Deliberately Tainted Data

ChristopheX GutteridgX

It works! (sample size: 1)

5 Responses

Authors

Recent Posts

Meta

Blogroll

Tags

Deliberately Tainted Data

ChristopheX GutteridgX

It works! (sample size: 1)

5 Responses

Subscribe

Authors

Recent Posts

Meta

Blogroll

Tags