Actions

Difference between revisions of "Data Munging"

m
Line 1: Line 1:
'''Content Coming Soon'''
+
===What is Data Munging?<ref>[https://www.trifacta.com/data-munging/ Defining Data Munging]</ref>===
 +
The process of manual data cleansing prior to analysis is known as '''data munging'''. This process can be a laborious task without the right tools. The common interface used for data munging is often Excel, which lacks the sophistication for collaboration and automation to make the process efficient. 80% of the time spent on data analytics is allocated to data munging, where IT manually cleans the data to pass over to business users who perform analytics. Data munging is a time-consuming and disjointed process that gets in the way of extracting true value and potential from data.
  
  
 +
===The Importance of Data Munging<ref>[https://www.integrate.io/glossary/what-is-data-munging/ Why Use Data Munging?]</ref>===
 +
Most organizations have multiple, disparate sources of incoming data. These sources will all have different standards for validating data and catching errors. Some may simply output the data “as-is.”
  
 +
Data munging is an important process whenever the data source does not perform its own form of data preparation.
  
 +
Data consumers need to have clean, organized, high-quality data. These consumers can include:
 +
*People: Data scientists and analytics teams require a steady stream of data. To provide them with this, the business needs to implement a munging process. This guarantees a supply of high-quality information, which they can then use for detailed analysis. The organization can also make munged data available to business users through data marts.
 +
*Processes: Automated processes might require data from other systems. For instance, an order fulfillment system might require different pieces of customer data from across the network. Munging helps to remove any data inconsistencies, allowing these processes to run smoothly in the background.
 +
*Repositories: Organizations often store vast quantities of information in a data lake or data warehouse. There’s no point in storing low-quality data, and a munging process eliminates issues and ensures that everything stored is of value. Munging can also help standardize data, which makes it easier to store in a data warehouse.
  
== See Also ==
 
  
 +
 +
===Data Munging Process<ref>[https://www.talend.com/resources/what-is-data-munging/ The data munging process: An overview]</ref>===
 +
With the wide variety of verticals, use cases, types of users, and systems utilizing enterprise data today, the specifics of munging can take on myriad forms.
 +
*Data exploration: Munging usually begins with data exploration. Whether an analyst is merely peeking at completely new data in initial data analysis (IDA), or a data scientist begins the search for novel associations in existing records in exploratory data analysis (EDA), munging always begins with some degree of data discovery.
 +
*Data transformation: Once a sense of the raw data’s contents and structure have been established, it must be transformed to new formats appropriate for downstream processing. This step involves the pure data scientist, for example, un-nesting hierarchical JSON data, denormalizing disparate tables so relevant information can be accessed from one place, or reshaping and aggregating time series data to the dimensions and spans of interest.
 +
*Data enrichment: Optionally, once data is ready for consumption, data mungers might choose to perform additional enrichment steps. This involves finding external sources of information to expand the scope or content of existing records. For example, using an open-source weather data set to add daily temperature to an ice cream shop’s sales figures.
 +
*Data validation: The final, perhaps most important, munging step is validation. At this point, the data is ready to be used, but certain common-sense or sanity checks are critical if one wishes to trust the processed data. This step allows users to discover typos, incorrect mappings, problems with transformation steps, and even the rare corruption caused by computational failure or error.
 +
 +
 +
===Example of Data Munging<ref>[https://www.experian.co.uk/business/glossary/data-munging/ What is an example of Data Munging?]</ref>===
 +
A specific example of data munging might be used in Machine Learning, in order to restructure data in a way that could be used by a learning algorithm.
 +
 +
A common example of damaging data is email addresses. Typically, to prevent spam, a user will destroy the valid format of an email address by writing it in a way that humans understand but computers do not, such as:
 +
 +
JohnDOTdoeATJohnDoeDOTcom or John(dot)doe(at)John(dot)doe(dot)com
  
  
 +
== See Also ==
  
  
 
== References ==
 
== References ==
 
<references />
 
<references />

Revision as of 14:58, 17 October 2022

What is Data Munging?[1]

The process of manual data cleansing prior to analysis is known as data munging. This process can be a laborious task without the right tools. The common interface used for data munging is often Excel, which lacks the sophistication for collaboration and automation to make the process efficient. 80% of the time spent on data analytics is allocated to data munging, where IT manually cleans the data to pass over to business users who perform analytics. Data munging is a time-consuming and disjointed process that gets in the way of extracting true value and potential from data.


The Importance of Data Munging[2]

Most organizations have multiple, disparate sources of incoming data. These sources will all have different standards for validating data and catching errors. Some may simply output the data “as-is.”

Data munging is an important process whenever the data source does not perform its own form of data preparation.

Data consumers need to have clean, organized, high-quality data. These consumers can include:

  • People: Data scientists and analytics teams require a steady stream of data. To provide them with this, the business needs to implement a munging process. This guarantees a supply of high-quality information, which they can then use for detailed analysis. The organization can also make munged data available to business users through data marts.
  • Processes: Automated processes might require data from other systems. For instance, an order fulfillment system might require different pieces of customer data from across the network. Munging helps to remove any data inconsistencies, allowing these processes to run smoothly in the background.
  • Repositories: Organizations often store vast quantities of information in a data lake or data warehouse. There’s no point in storing low-quality data, and a munging process eliminates issues and ensures that everything stored is of value. Munging can also help standardize data, which makes it easier to store in a data warehouse.


Data Munging Process[3]

With the wide variety of verticals, use cases, types of users, and systems utilizing enterprise data today, the specifics of munging can take on myriad forms.

  • Data exploration: Munging usually begins with data exploration. Whether an analyst is merely peeking at completely new data in initial data analysis (IDA), or a data scientist begins the search for novel associations in existing records in exploratory data analysis (EDA), munging always begins with some degree of data discovery.
  • Data transformation: Once a sense of the raw data’s contents and structure have been established, it must be transformed to new formats appropriate for downstream processing. This step involves the pure data scientist, for example, un-nesting hierarchical JSON data, denormalizing disparate tables so relevant information can be accessed from one place, or reshaping and aggregating time series data to the dimensions and spans of interest.
  • Data enrichment: Optionally, once data is ready for consumption, data mungers might choose to perform additional enrichment steps. This involves finding external sources of information to expand the scope or content of existing records. For example, using an open-source weather data set to add daily temperature to an ice cream shop’s sales figures.
  • Data validation: The final, perhaps most important, munging step is validation. At this point, the data is ready to be used, but certain common-sense or sanity checks are critical if one wishes to trust the processed data. This step allows users to discover typos, incorrect mappings, problems with transformation steps, and even the rare corruption caused by computational failure or error.


Example of Data Munging[4]

A specific example of data munging might be used in Machine Learning, in order to restructure data in a way that could be used by a learning algorithm.

A common example of damaging data is email addresses. Typically, to prevent spam, a user will destroy the valid format of an email address by writing it in a way that humans understand but computers do not, such as:

JohnDOTdoeATJohnDoeDOTcom or John(dot)doe(at)John(dot)doe(dot)com


See Also

References