I am not a lawyer. This blog post represents my own understanding and not an official view of my employer.
The key thing with GDPR is that we’re not going to be perfect, but we can certainly improve. I’ve been working on my team’s GDPR activities since January, and as our team deals with many small, innovative and legacy databases this was… interesting. I’ve audited 88 “information assets”, most of which contain information about people in some way or another.
The single most useful thing we’ve done so far is turn off (and erase) a few datasets and services that were not actually required. I’ve also identified about 4 datasets that I was pulling from source and pushing into systems that no longer exist or now get their data elsewhere.
I’ve been trying to boil things down to some places for people to be thinking about. This isn’t a complete list, it’s more about how to spend limited resources in the most appropriate way.
Why bother?
The obvious answer is “to avoid fines”, but the better answer is “to do right by our users”. Data breaches should be minimised, but we shouldn’t just concentrate on avoiding them, but also minimising the impact if they do happen. Not holding data inappropriately, or having inappropriate locations or access to it can go a long way to reducing the actual harm caused. Keep this in mind and you’ll be on the right track.
Audit all the things!
We need to be aware of what systems we are responsible for that have personal data, what data it contains and why it’s used.
This chore is largely done for our team.
Start with high risks
A common problem I have with addressing GDPR is “what about… <obscure system that has a users table>”
Many of our systems have records of usernames or other IDs that caused actions, eg. approved something, updated a wikipage, updated a webpage. While these are technically in scope of GDPR, they are at the bottom of the TODO list.
For example; we’ve a few dozen sites using the Drupal CMS. Each site has under ten users. The user records may contain a person’s name, they do contain their email address. The contain a list of what that person has edited on the site. It also has the implicit metadata that this list of users are the editors of this website. However, the risk of breach is low to medium. Drupal is a target for hackers, but generally not to steal this kind of data[citation required] (this is an admitted assumption on my part). The damage done by such a breach is also relatively low compared to leaking a larger list or more detailed information.
I find making this calculations difficult, because it feels like I’m saying such a breach doesn’t matter. It does, but other risks matter more and should receive more investment of resources to both prevent happening, and mitigate the consequences if thy do happen. Which brings me nicely to:
Shut off anything unnecessary
This is work we should be doing right now!
Any system which is no longer needed should be shut off.
Systems that have been shut off for over a year should probably have their data erased entirely. It’s tempting to keep things for ever “just in case”. We have to stop doing that. If in doubt, agree it with management and/or the person doing the role that owns that data.
Some systems have unused tables or fields with data about people. These should just be removed.
Some systems have data feeds to/from other systems which provide more information about people than is required. Any reinquired fields should be identified and removed from the feed rather than just ignored by the target system (which is what we sometimes lazily do now).
Remove unused cached copies of personal data.
A more subtle thing is also who has access to data about other people. It’s easiest to remove data entirely, but if that’s not possible then consider how to restrict access to only people who need it. Does everyone need to see the fields with more sensitive information.
It’s worth telling our GPDR team when work like this is done, so they can note that the work was done, or note the system is now off/erased.
Reminder; sensitive data
Breaches regarding data which can cause additional harm to the subject are treated as more serious. The official list is as follows
- the racial or ethnic origin of the data subject,
- their political opinions,
- their religious beliefs or other beliefs of a similar nature,
- whether they are a member of a trade union (within the meaning of the Trade Union and Labour Relations (Consolidation) Act 1992),
- their physical or mental health or condition,
- their sexual life,
- the commission or alleged commission by them of any offence, or
- any proceedings for any offence committed or alleged to have been committed by them, the disposal of such proceedings or the sentence of any court in such proceedings.
I’ve been checking for any unexpected issues and found a few surprises:
- membership of certain mailing lists could indicate trade union membership, sexuality, ethnic origin, religion.
- “reason for leave”, ie. why someone was off work can include physical and mental health info.
- I also read through every reason a student has had an extension to a coursework deadline, as this is recorded in the ECS Handin system. It’s a free-text field, but thankfully people have used it responsibly and just list where the authority for such an extension came from. Although there’s a batch that just say “volcano” which is the coolest excuse for handing your coursework in late!
Data protection statements
This is another thing we should already be doing.
Any service which collects information from people who are not current members of the university should have a statement clearly saying how that data will be used. If you think you might want to use it for another purpose (eg. analysis) later, say so, but don’t be too vague. eg. If someone signs up for a university open day, are we going to add them to our database to send them more information on related topics, or keep this request on file for analysis? We probably are so we should say that.
EPrints lets people request an unavailable paper, and that logs the request and their contact info and passes it on to the creators of the paper. You know what? We probably do want to do some analysis on how that service is used, so we should say so up front. I’m thinking something like
“We may also use this information to help us understand and improve how this service is used. Other than passing your information to the creators of this work, we won’t share individual details outside the University of Southampton, but data aggregated by internet domain, country or organisation might be shared or published in the future.”
While most of the information our staff and students submit into our systems probably doesn’t need additional data protection sign-off, it still may be required if we’re going to use that data for something unexpected or not to do with their relationship to the university. eg. If we collected data on how email was used by our own members for service improvement, that’s probably not needing a specific statement. If we were using it for a research project, then consent would be required. If in doubt, ask the GDPR office.
Data retention periods
For all retention, it’s a trade off. It may harm people to keep the data to long. It may harm people not to keep it long enough.
The university has several key sets of people we have data about:
- Students
- Staff (and “Visitors” who are treated like staff on our IT systems)
- Non staff and students who interact with us.
- Research subjects (eg. people a research project collected data on)
Research project’s data retention is usually very clear, and handled as part of ethics. As GPDR beds in, the GDPR principles should be incorporated but consent is generally already given.
Data on interactions with the public (eg. open days, logged IP addresses, conference delegates) will all have an appropriate retention period but it’s not yet clear what these will be.
For data about staff and students the retain period will either be years since the data was created, or years since the student graduated or the person left the university. Possibly it could be a period after they leave the job post.
What we should be doing right now is have a plan for how either of these retention policies could be implemented. I think it’s more likely that the years-since-data-creation method will be used for most things as it’s so much more simple.
SARs: Subject Access Requests
It’s likely to become more common for people to ask for what the organisation knows about them. No all information is covered by this, but we should be ready for it.
What we should be doing right now:
Document the primary way people are identified in each system you are responsible for?
- Staff/Student number
- Email – the university provides everyone with several aliases to make this more complex
- Username
- ECS username – ECS had a different accounts system and 180 staff who’ve been here forever have a different username to their main uni one
- UNIX UID
- ORCID
- An ID local to this system (if so, is that linked to any of the above? If not how are we going to identify that it’s the right person?)
Think about how the person, or an authorised person could extract all their information from the system in a reasonable form (XML, JSON, ZIP, HTML, PDF…). For many of our systems this would currently be a set of manual SQL queries, but where possible these should be available as tools for both the person to get their own and admin’s to get anybody’s.
These requests are coming. We don’t know exactly how, but it’s probably some people will make them just for curiosity and ask for everything they possibly can. We need to keep the costs of these down.
Obviously, if responding to such a formal request, ensure that you only pass the data on to an appropriate person in legal handling the request. More formal methods are likely to evolve.
The right to be forgotten
Under the old DPA people have always been able to demand that information held about them is correct. In the new one they can also request to have information about them removed.
It seems very unlikely that current staff or students will make such a request about their current job or course, and unclear if that would be a reasonable thing to request. However someone could ask for information about a past course or post to be purged. It’s impossible to find every file or bit of paper, but quite likely we might be asked to remove them from a given system.
What we should be doing right now; considering how we would do this on systems we run, and if it’s a likely request, start to implement features to enable this.
Finally, email mailto: links
This is a novel but simple way to reduce data breaches caused by people picking the wrong email alias from their address book. When someone clicks on a mailto: link in a webpage, it’s usually just the format “acronymsoup@example.org”. However, you can write email addresses with the display names included, so that when people mail to them, it’ll save the human-readable name into their local address book and ensure they are less muddled about what they are sending to. Sometimes big lists of people have similar strings of characters to emails of a single role or office. This can cause data breaches.
Compare these two mailto links:
If you try clicking each you’ll see the results are similar, but the second link has a “display name” and that name will be there to hopefully help people spot if they’re sending to the wrong place. As we are about to restructure the university, lots of addresses will be changing so this is a good time to implement this approach.
It won’t stop breaches, but it should avoid some. As it results in a more user friendly system too, what’s not to like?
Summary; getting our ducks in a row
We should have already completed our basic audit of systems that deal with data about people. If we’ve missed any we should get those added to the audit ASAP.
Right now, as devops of information systems, we should be at the very least working on these things:
- Removing any systems, table, datasets, or fields that we don’t really need
- Preparing to cope with requests to see or remove data about a person
- Preparing to comply with data retention policies for systems which don’t already have one defined.
- Identifying any remaining places we don’t, but should, have a data-protection statement. And either adding one, or asking for help creating one.
- Keep records of the work done.
When in doubt, the priority systems and issues are those where a data breach would cause actual harm to people.
What else are teams like ours putting on your immediate GDPR TODO list?
Nice.
By the way, I interacted with one (other) University system that assumed that students graduated. 🙂
So http://blog.soton.ac.uk/webteam/2018/05/10/gdpr-preperations/#post-content-1793;char=8028-8064 could leave a lot of records lying around forever.
Checking assumptions of “normality” is always useful – they are only small numbers, but these can also be the very cases that have data that causes significant harm if breached.
You’re right Hugh, it should be “years since the student left their studies”, however it’s a tricky one if a student takes a long break and want to resume. My understanding is that to give them their degree we should have evidence of all work contributing to the final grade, so it can be assessed if needed.
(the gold standard would be to ask a study-breaking student if they want it preserved just-in-case or removed when no longer required for current audit purposes. Lots of work to manage that though)