Data Munging

What is Data Munging?^[1]

The process of manual data cleansing prior to analysis is known as data munging. This process can be a laborious task without the right tools. The common interface used for data munging is often Excel, which lacks the sophistication for collaboration and automation to make the process efficient. 80% of the time spent on data analytics is allocated to data munging, where IT manually cleans the data to pass over to business users who perform analytics. Data munging is a time-consuming and disjointed process that gets in the way of extracting true value and potential from data.

Stages of Data Munging^[2]

Data Discovery: Everything begins with a defined goal, and the data analysis journey isn’t an exception. Data discovery is the first stage of data munging, where data analysts define the data’s purpose and how to achieve it through data analytics. The goal is to identify the potential uses and requirements of data.
- At the discovery stage, the focus is more on business requirements related to data rather than technical specifications. For instance, data analysts focus on what key performance indicators or metrics will be helpful to improve the sales cycle instead of how to get the relevant numbers for analytics.
Data Structuring: Once the requirements are identified and outlined, the next stage is structuring raw data to make it machine-readable. Structured data has a well-defined schema and follows a consistent layout. Think of data neatly organized in rows and columns available in spreadsheets and relational databases.
- The process involves carefully extracting data from various sources, including structured and unstructured business documents. The captured data sets are organized into a formatted repository, so they are machine-readable and can be manipulated in the subsequent phases.

Data Cleansing: Once the data is organized into a standardized format, the next step is data cleansing. This stage addresses a range of data quality issues, ranging from missing values to duplicate datasets. The process involves detecting and correcting this erroneous data to avoid information gaps.
- Data cleansing lays the foundation for accurate and efficient data analysis. Several transformations - like Remove, Replace, Find and Replace, etc. - are applied to eliminate redundant text and null values as well as identify missing fields, misplaced entries, and typing errors that can distort the analysis.
Data Enrichment: The structured and cleansed data is now ready for enrichment. It’s a process that involves appending one or multiple data sets from different sources to generate a holistic view of information. As a result, the data becomes more useful for reporting and analytics.
- It typically involves aggregating multiple data sources. For instance, if an order ID is found within a system, a user can match that order ID against a different database to obtain further details like the account name, account balance, buying history, credit limit, etc. This additional data “enriches” the original ID with greater context.
Data Validation: Validating the accuracy, completeness, and reliability of data is imperative to the data munging process. There’s always a risk of data inaccuracies during the data transformation and enrichment process; hence a final check is necessary to validate the output information is accurate and reliable.
- Data validation contrasts with data cleansing in that it rejects any data that don’t comply with pre-defined rules or constraints. also checks for the correctness and meaningfulness of the information. There are different types of validation checks; here are some examples:
  - Consistency check: the date of an invoice can be restricted from preceding its order date.
  - Data-type validation: the date and month fields can only contain integers from 1 to 31 and 1 to 12, respectively.
  - Range and constraint validation: the password field must comprise at least eight characters, including uppercase letters, lowercase letters, and numeric digits.

The Importance of Data Munging^[3]

Most organizations have multiple, disparate sources of incoming data. These sources will all have different standards for validating data and catching errors. Some may simply output the data “as-is.”

Data munging is an important process whenever the data source does not perform its own form of data preparation.

Data consumers need to have clean, organized, high-quality data. These consumers can include:

People: Data scientists and analytics teams require a steady stream of data. To provide them with this, the business needs to implement a munging process. This guarantees a supply of high-quality information, which they can then use for detailed analysis. The organization can also make munged data available to business users through data marts.
Processes: Automated processes might require data from other systems. For instance, an order fulfillment system might require different pieces of customer data from across the network. Munging helps to remove any data inconsistencies, allowing these processes to run smoothly in the background.
Repositories: Organizations often store vast quantities of information in a data lake or data warehouse. There’s no point in storing low-quality data, and a munging process eliminates issues and ensures that everything stored is of value. Munging can also help standardize data, which makes it easier to store in a data warehouse.

Data Munging Process^[4]

With the wide variety of verticals, use cases, types of users, and systems utilizing enterprise data today, the specifics of munging can take on myriad forms.

Data exploration: Munging usually begins with data exploration. Whether an analyst is merely peeking at completely new data in initial data analysis (IDA), or a data scientist begins the search for novel associations in existing records in exploratory data analysis (EDA), munging always begins with some degree of data discovery.
Data transformation: Once a sense of the raw data’s contents and structure have been established, it must be transformed to new formats appropriate for downstream processing. This step involves the pure data scientist, for example, un-nesting hierarchical JSON data, denormalizing disparate tables so relevant information can be accessed from one place, or reshaping and aggregating time series data to the dimensions and spans of interest.
Data enrichment: Optionally, once data is ready for consumption, data mungers might choose to perform additional enrichment steps. This involves finding external sources of information to expand the scope or content of existing records. For example, using an open-source weather data set to add daily temperature to an ice cream shop’s sales figures.
Data validation: The final, perhaps most important, munging step is validation. At this point, the data is ready to be used, but certain common-sense or sanity checks are critical if one wishes to trust the processed data. This step allows users to discover typos, incorrect mappings, problems with transformation steps, and even the rare corruption caused by computational failure or error.

Example of Data Munging^[5]

A specific example of data munging might be used in Machine Learning, in order to restructure data in a way that could be used by a learning algorithm.

A common example of damaging data is email addresses. Typically, to prevent spam, a user will destroy the valid format of an email address by writing it in a way that humans understand but computers do not, such as:

JohnDOTdoeATJohnDoeDOTcom or John(dot)doe(at)John(dot)doe(dot)com

References

[1] Defining Data Munging

[2] Different Stages of Data Munging

[3] Why Use Data Munging?

[4] The data munging process: An overview

[5] What is an example of Data Munging?

[1]

[2]

[3]

[4]

[5]

Data Munging

What is Data Munging?[1]

Stages of Data Munging[2]

The Importance of Data Munging[3]

Data Munging Process[4]