Masking: it’s technically legal, but is it really ever that safe or simple in practice?

View Only

Masking: it’s technically legal, but is it really ever that safe or simple in practice?

By Anon Anon posted May 23, 2016 07:12 AM

Recommend

The EU General Data Protection (GDPR) is set to introduce sweeping reforms which will likely impact the sorts of data many organizations use in test environments. This includes the widespread use of masked production data for testing.

Organizations outside the EU will still be accountable if they fall within the extended territorial scope of the GDPR, and so test teams worldwide might need to reconsider their use of masked production data by the enforcement date of May 25^th 2018. Otherwise, they might find themselves facing maximum fines of €20 million or 4% of annual global turnover.

The need for consent: Can you use that data in testing?

We’ve written before about how the greater degree of consent needed for data processing might impact testing. In short, the need for unambiguous, affirmative consent to process data will mean that organizations will need to have demonstrable consent to use data for the exact testing process being performed.

Masking production data has proven a popular way to soften or bypass the existing requirement for consent when testing using production data, and the GDPR does in fact encourage what it calls “pseudonymization” as a way to alter data so that it can be used for “reasons other than which it was given. By masking data, these other uses, presumably including testing, are considered “compatible” with the reasons for which original consent was given. [1]

Masking data can therefore soften a number of the requirements placed on personally identifiable information, including the Rights to Erasure, Rectification and Portability, and the GDPR is generally only applicable to data from which individuals can be identified.

More stringent masking requirements

However, the way in which the GDPR defines “pseudonymized” data might mean that masking is neither the safest, easiest nor most cost effective solution in many cases.

The GDPR broadly defines “pseudonymized” data as that where it is no longer possible to identify individuals from it. This means that all direct identifiers must be removed from production data, while any external data which could be used to identify an individual must also then be kept separately from the masked data. [1]

Masking direct identifiers for testing is old news, but then the GDPR also addresses the risk of re-identification. In essence, if it is possible to identify individuals using “reasonable effort” from masked data, then it is considered personal and is still subject to the full weight of the GDPR. [1]

This will often mean that both direct and indirect identifiers will need masking. Otherwise, it might be possible to combine information like geographical factors and educational institutes with readily available information like a phone book or an online profile. If by doing so individuals can be identified, you would need consent to use that masked data for the exact testing task being performed.

The increased complexity of masking data for testing

In practice, a lot of the content of production data might need masking. However, for it to be usable in testing, the inter-column relationships will also need to remain intact. Masking therefore becomes a highly complex task.

Take the example of a bank’s transaction log. Masking the amounts, totals, temporal information and the sensitive content while retaining the relationships between them can be highly complicated. Faced with the difficulty of masking, we find that teams often leave some content in as a form of compromise [2]. However, this increases the possibility of re-identification, risking non-compliance.

To take a simpler example, your biggest customer will always be your biggest customer, and masking data won’t hide if one account constitutes 60% of transactions. Similarly, if you have the Lord Privy Seal of the United Kingdom in your data, then they will be readily identifiable unless indirect titles are also removed.

So, what’s the alternative?

Our view is that if it is possible to use data with no potential for re-identification, then using this data as much as possible should be a priority. Synthetic data generation offers a way to create completely fictitious data, which would then be exempt from the requirements which the GDPR places on data processing.

It will not be possible to generate all test data from scratch in one go, and we therefore advocate a hybrid approach to managing test data. A tool like CA Test Data Manager for example offers both masking and the ability to generate and gradually replace masked data with synthetic.

To find out more about the GDPR and its impact on Test Data Management, please sign up for “Are You GDPR Ready? Get the Vanson Bourne Readiness Survey Results Right Here.” On June 9^th at 11am ET / 4pm BST.

References:

[1] https://iapp.org/news/a/top-10-operational-impacts-of-the-gdpr-part-8-pseudonymization/

[2] Llyr Wyn Jones, Use synthetic data to avoid data breaches, SQ Magazin, 34-5.

2 comments

4 views