Data Masking
Data Masking is a way to create a fake but realistic version of your organizational data. The goal is to protect sensitive data while providing a functional alternative when real data is not needed—for example, in user training, sales demos, or software testing.
Data masking processes change the values of the data while using the same format. The goal is to create a version that cannot be deciphered or reverse-engineered. There are several ways to alter the data, including character shuffling, word or character substitution, and encryption.[1]
Background of Data Masking[2]
Data involved in any data masking or obfuscation must remain meaningful at several levels:
- The data must remain meaningful for the application logic. For example, suppose elements of addresses are to be obfuscated, and cities and suburbs are replaced with substitute cities or suburbs. In that case, if within the application, there is a feature that validates postcode or postcode lookup, that function must still be allowed to operate without error and operate as expected. The same applies to credit-card algorithm validation checks and Social Security Number validations.
- The data must undergo enough changes so that it is not obvious that the masked data is from a production data source. For example, it may be common knowledge in an organization that there are 10 senior managers, all earning in excess of $300K. If a test environment of the organization's HR System also includes 10 identities in the same earning bracket, then other information could be pieced together to reverse-engineer a real-life identity. Theoretically, if the data is obviously masked or obfuscated, then it would be reasonable for someone intending a data breach to assume that they could reverse engineer identity data if they had some degree of knowledge of the identities in the production data set. Accordingly, data obfuscation or masking of a data set applies in such a manner as to ensure that identity and sensitive data records are protected - not just the individual data elements in discrete fields and tables.
- The masked values may be required to be consistent across multiple organizational databases when the databases each contain the specific data element being masked. Applications may initially access one database and later access another one to retrieve related information where the foreign key has been masked (e.g., a call center application first brings up data from a customer master database and, depending on the situation, subsequently accesses one of several other databases with very different financial products.) This requires that the masking applied is repeatable (the same input value to the masking algorithm always yields the same output value) but cannot be reverse-engineered to return to the original value. Additional constraints may apply depending on the data element(s) involved. Where different character sets are used across the databases that need to connect in this scenario, a scheme of converting the original values to a common representation will need to be applied, either by the masking algorithm itself or prior to invoking said algorithm.
Basics of Masking Sensitive Data[3]
Data Masking, sometimes called data sanitization or data protection, or data obfuscation, is a term for the technology and processes that are used to anonymize or pseudonymize personal, private, or sensitive data.
Pseudonymization attempts to maintain a part of the data while anonymizing the directly identifying data elements consistently to allow the data to be used for meaningful reporting or analysis without revealing individual data. For example, the pseudonymization of a database will alter the name and other identifiers of individuals. Still, it will leave the rest of the data, such as may be shopping history or medical interventions, intact.
Anonymized data is simply data from which individuals, the ‘data subject,’ can no longer be identified. An anonymization process will make it impossible, or at least extremely impractical, to identify the data subject while also attempting to maintain the overall verisimilitude of the data to look real. This means it isn’t enough just to mask the directly identifying data elements such as a person’s name or ID. It requires additional measures to prevent identification, which will vary depending on the data and why you need to anonymize it. It will generally involve shuffling, scrambling, or otherwise changing the relationships between the people and the rest of the data. For example, the shopping habits will be changed as well as the names, or the medical interventions will be assigned to other individuals.
The advantage of pseudonymization is that all the relational links remain intact, and data distribution is guaranteed to be like the real data. The disadvantage is that it has obvious security implications that will limit its use outside of environments with security protocols matching those in place for the live data. With anonymization, the resulting data is safe for use. Still, if you need to start fiddling with the FOREIGN KEY references to prevent an inference attack, it becomes harder to maintain the correct distribution.
Data Masking Techniques[4]
There are a number of techniques that IT professionals can use when masking data. Here’s a list of data masking techniques and how they apply to your business:
- Encryption: When data is encrypted, authorized users must access it with a key. This is the most complex and secure type of data masking. Here, data is masked via an encryption algorithm.
- Character Scrambling: A very basic masking technique is character scrambling. Using this approach, characters are jumbled randomly so the original content is not revealed. For instance, using character scrambling, an employee whose badge number is #458912 in a production set of data may read #298514 in the test environment.
- Nulling Out or Deletion: Like the name would suggest, when this approach is applied, data becomes null to anyone who isn’t authorized to access it.
- Number and Date Variance: When properly executed, number and date variance can provide a useful data set without giving up important financial information or transaction dates. For instance, a data set that offers employee salaries can give you the range in salary between the highest and lowest-paid employee when masked. You can ensure accuracy by applying the same variance to all salaries in the set; that way, the range doesn’t change.
- Substitution: Substitution effectively mimics the look and feel of real data without compromising anyone’s personal information. With this approach, a value that looks authentic is substituted for the actual value. This effectively hides authentic data, protecting it from breach threats.
- Shuffling: Similar to substitution, shuffling uses one data set instead of another. But in shuffling, the data in an individual column is shuffled in a randomized fashion. The output set looks like authentic data but doesn’t reveal any real personal information.
What Data Requires to be Masked[5]
- Personally Identifiable Information (PII): Any data that could potentially be used to identify a particular person. For example, full name, social security number, driver’s license number, and passport number.
- Protected Health Information (PHI): PHI includes demographic information, medical histories, test and laboratory results, mental health conditions, insurance information, and other data that a healthcare professional collects to identify appropriate care.
- Payment card information (PCI-DSS): There is an information security standard for organizations to follow while handling branded credit cards from major card schemes.
- Intellectual property (IP): IP refers to creations of the mind, such as inventions; literary and artistic works; designs; symbols, names, and images used in commerce.
Benefits and Challenges of Data Masking[6]
Data masking is essential in many regulated industries where personally identifiable information must be protected from overexposure. By masking data, the organization can expose the data as needed to test teams or database administrators without compromising the data or getting out of compliance. The primary benefit is reduced security risk.
Data masking is difficult because the changed data must retain any characteristics of the original data that would require specific processing. Yet it must be sufficiently transformed so that no one viewing the replica can reverse-engineer it. Commercial software solutions are available to automate masking and provide confidence in the obfuscation quality.
See Also