Data Masking

Data Masking is a way to create a fake, but a realistic version of your organizational data. The goal is to protect sensitive data, while providing a functional alternative when real data is not needed—for example, in user training, sales demos, or software testing.

Data masking processes change the values of the data while using the same format. The goal is to create a version that cannot be deciphered or reverse engineered. There are several ways to alter the data, including character shuffling, word or character substitution, and encryption.[1]

How Data Masking Works
source: Imperva

Background of Data Masking[2]
Data involved in any data masking or obfuscation must remain meaningful at several levels:

  • The data must remain meaningful for the application logic. For example, if elements of addresses are to be obfuscated and city and suburbs are replaced with substitute cities or suburbs, then, if within the application there is a feature that validates postcode or post code lookup, that function must still be allowed to operate without error and operate as expected. The same is also true for credit-card algorithm validation checks and Social Security Number validations.
  • The data must undergo enough changes so that it is not obvious that the masked data is from a source of production data. For example, it may be common knowledge in an organization that there are 10 senior managers all earning in excess of $300K. If a test environment of the organisation's HR System also includes 10 identities in the same earning-bracket, then other information could be pieced together to reverse-engineer a real-life identity. Theoretically, if the data is obviously masked or obfuscated, then it would be reasonable for someone intending a data breach to assume that they could reverse engineer identity-data if they had some degree of knowledge of the identities in the production data-set. Accordingly, data obfuscation or masking of a data-set applies in such a manner as to ensure that identity and sensitive data records are protected - not just the individual data elements in discrete fields and tables.
  • The masked values may be required to be consistent across multiple databases within an organization when the databases each contain the specific data element being masked. Applications may initially access one database and later access another one to retrieve related information where the foreign key has been masked (e.g. a call center application first brings up data from a customer master database and, depending on the situation, subsequently accesses one of several other databases with very different financial products.) This requires that the masking applied is repeatable (the same input value to the masking algorithm always yields the same output value) but not able to be reverse engineered to get back to the original value. Additional constraints as mentioned in (1) above may also apply depending on the data element(s) involved. Where different character sets are used across the databases that need to connect in this scenario, a scheme of converting the original values to a common representation will need to be applied, either by the masking algorithm itself or prior to invoking said algorithm.

Basics of Masking Sensitive Data[3]
Data Masking, sometimes called data sanitization or data protection or data obfiscation, is a term for the technology and processes that are used to anonymize or pseudonymize personal, private or sensitive data.

Pseudonymization attempts to maintain a part of the data, while anonymizing the directly identifying data elements in a consistent way, to allow the data to be used for meaningful reporting or analysis without revealing individual data. For example, pseudonymisation of a database will alter the name and other identifiers of individuals but will leave the rest of the data, such as maybe shopping history or medical interventions intact.

Anonymized data is simply data from which individuals, the ‘data subject’ can no longer be identified. An anonymization process will make it impossible, or at least extremely impractical, to identify the data subject, while also attempting to maintain the overall verisimilitude of the data, so that it looks real. This means that it isn’t enough to just mask the directly identifying data elements such as a person’s name or ID. It requires additional measures to prevent identification, which will vary depending on the data and why you need to anonymize it but will generally involve shuffling, scrambling or otherwise change the relationships between the people and the rest of the data. For example, the shopping habits will be changed as well as the names, or the medical interventions will be assigned to other individuals.

The advantage of pseudonymization is that all the relational links remain intact, and the distribution of data is guaranteed to be like the real data. The disadvantage is that it comes with obvious security implications that will limit its use outside of environments with security protocols matching those in place for the live data. With anonymization, the resulting data is safe for use but if you need to start fiddling with the FOREIGN KEY references to prevent inference attack, it becomes harder to maintain the correct distribution.

Common Data Masking Techniques[4]
There are a number of techniques that IT professionals can use when masking data. Here’s a list of data masking techniques and how they apply to your business:

  • Encryption: When data is encrypted, authorized users must access it with a key. This is the most complex and secure type of data masking. Here, data is masked via an encryption algorithm.
  • Character Scrambling: A very basic masking technique is character scrambling. Using this approach, characters are jumbled into a random order so the original content is not revealed. For instance, using character scrambling, an employee who’s badge number is #458912 in a production set of data, may read #298514 in the test environment.
  • Nulling Out or Deletion: Like the name would suggest, when this approach is applied, data becomes null to anyone who isn’t authorized to access it.
  • Number and Date Variance: When properly executed, number and date variance can provide you with a useful set of data without giving up important financial information or transaction dates. For instance, a data set that offers employee salaries can give you the range in salary between highest and lowest paid employee when masked. You can ensure accuracy by applying the same variance to all salaries in the set, that way they range doesn’t change.
  • Substitution: Substitution effectively mimics the look and feel of real data without compromising anyone’s personal information. With this approach a value that looks to be authentic is substituted for the actual value. This effectively hides authentic data, protecting it from breach threats.
  • Shuffling: Similar to substitution, shuffling uses one data set in place of another. But in shuffling, the data in an individual column is shuffled in a randomized fashion. The output set looks like authentic data but doesn’t reveal any real personal information.

What Data Requires to be Masked[5]

  • Personally Identifiable Information (PII): Any data that could potentially be used to identify a particular person. For example, full name, social security number, driver’s license number, and passport number.
  • Protected Health Information (PHI): PHI includes demographic information, medical histories, test and laboratory results, mental health conditions, insurance information, and other data that a healthcare professional collects to identify appropriate care.
  • Payment card information (PCI-DSS): There is an information security standard for organizations to follow while handling branded credit cards from the major card schemes.
  • Intellectual property (IP): IP refers to creations of the mind, such as inventions; literary and artistic works; designs; and symbols, names and images used in commerce.

Benefits and Challenges of Data Masking[6]
Data masking is essential in many regulated industries where personally identifiable information must be protected from overexposure. By masking data, the organization can expose the data as needed to test teams or database administrators without compromising the data or getting out of compliance. The primary benefit is reduced security risk.

Data masking is difficult because the changed data must retain any characteristics of the original data that would require specific processing. Yet it must be sufficiently transformed so that no one viewing the replica would be able to reverse-engineer it. Commercial software solutions are available to automate masking and provide confidence in the obfuscation quality.


  1. Definition - What is Data Masking Imperva
  2. Background of Data Masking Wikipedia
  3. Basics of Masking Sensitive Data Red Gate
  4. Common Data Masking Techniques BMC
  5. Which types of data require data masking? AI Multiple
  6. What are the benefits and challenges of data masking? Informatica