Actions

Difference between revisions of "Exploratory Data Analysis (EDA)"

Line 7: Line 7:
 
*Provide a basis for further data collection through surveys or experiments
 
*Provide a basis for further data collection through surveys or experiments
 
Many EDA techniques have been adopted in [[Data Mining|data mining]]. They are also being taught to young students as a way to introduce them to statistical thinking.
 
Many EDA techniques have been adopted in [[Data Mining|data mining]]. They are also being taught to young students as a way to introduce them to statistical thinking.
 +
 +
Exploratory Data Analysis (EDA) is an analysis approach that identifies general patterns in the data. These patterns include outliers and features of the data that might be unexpected. EDA is an important first step in any data analysis. Understanding where outliers occur and how variables are related can help one design statistical analyses that yield meaningful results. In data mining, Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task. It is not easy to look at a column of numbers or a whole spreadsheet and determine important characteristics of the data. It may be tedious, boring, and/or overwhelming to derive insights by looking at plain numbers. Exploratory data analysis techniques have been devised as an aid in this situation.
 +
 +
Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either univariate or multivariate (usually just bivariate).
 +
 +
EDA is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results. EDA is valuable to data scientists to make certain that the results they produce are valid, correctly interpreted, and applicable to the desired business contexts.<ref>[https://chartio.com/learn/data-analytics/what-is-exploratory-data-analysis/ Explaining Exploratory Data Analysis (EDA)]</ref>
 +
 +
 +
== Types of Exploratory Data Analysis<ref>[https://www.ibm.com/topics/exploratory-data-analysis What are the Different Types of Exploratory Data Analysis?]</ref> ==
 +
There are four primary types of EDA:
 +
*Univariate non-graphical. This is the simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of the univariate analysis is to describe the data and find patterns that exist within it.
 +
*Univariate graphical. Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:
 +
**Stem-and-leaf plots, which show all data values and the shape of the distribution.
 +
**Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
 +
**Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
 +
*Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
 +
*Multivariate graphical: Multivariate data use graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.
 +
 +
Other common types of multivariate graphics include:
 +
*Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
 +
*Multivariate chart, which is a graphical representation of the relationships between factors and response.
 +
*Run chart, which is a line graph of data plotted over time.
 +
*Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
 +
*Heat map, which is a graphical representation of data where values are depicted by color.
 +
 +
 +
== The Importance of EDA<ref>[https://careerfoundry.com/en/blog/data-analytics/exploratory-data-analysis/ Why is exploratory data analysis important?]</ref> ==
 +
Effective EDA provides invaluable insights that an algorithm cannot. You can think of this a bit like running a document through a spellchecker versus reading it yourself. While the software is useful for spotting typos and grammatical errors, only a critical human eye can detect the nuance. An EDA is similar in this respect—tools can help you, but it requires our own intuition to make sense of it. This personal, in-depth insight will support detailed data analysis further down the line.
 +
 +
 +
== EDA Vs. Statistical Graphics>ref>[https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm EDA Vs. Statistical Graphics]</ref> ==
 +
EDA is not identical to statistical graphics although the two terms are used almost interchangeably. Statistical graphics is a collection of techniques--all graphically based and all focusing on one data characterization aspect. EDA encompasses a larger venue; EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA is not a mere collection of techniques; EDA is a philosophy as to how we dissect a data set; what we look for; how we look; and how we interpret. It is true that EDA heavily uses the collection of techniques that we call "statistical graphics", but it is not identical to statistical graphics per se.

Revision as of 15:50, 15 March 2023

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations. The objectives of EDA are to:

  • Enable unexpected discoveries in the data
  • Suggest hypotheses about the causes of observed phenomena
  • Assess assumptions on which statistical inference will be based
  • Support the selection of appropriate statistical tools and techniques
  • Provide a basis for further data collection through surveys or experiments

Many EDA techniques have been adopted in data mining. They are also being taught to young students as a way to introduce them to statistical thinking.

Exploratory Data Analysis (EDA) is an analysis approach that identifies general patterns in the data. These patterns include outliers and features of the data that might be unexpected. EDA is an important first step in any data analysis. Understanding where outliers occur and how variables are related can help one design statistical analyses that yield meaningful results. In data mining, Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task. It is not easy to look at a column of numbers or a whole spreadsheet and determine important characteristics of the data. It may be tedious, boring, and/or overwhelming to derive insights by looking at plain numbers. Exploratory data analysis techniques have been devised as an aid in this situation.

Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either univariate or multivariate (usually just bivariate).

EDA is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results. EDA is valuable to data scientists to make certain that the results they produce are valid, correctly interpreted, and applicable to the desired business contexts.[1]


Types of Exploratory Data Analysis[2]

There are four primary types of EDA:

  • Univariate non-graphical. This is the simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of the univariate analysis is to describe the data and find patterns that exist within it.
  • Univariate graphical. Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:
    • Stem-and-leaf plots, which show all data values and the shape of the distribution.
    • Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
    • Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
  • Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
  • Multivariate graphical: Multivariate data use graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

  • Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
  • Multivariate chart, which is a graphical representation of the relationships between factors and response.
  • Run chart, which is a line graph of data plotted over time.
  • Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
  • Heat map, which is a graphical representation of data where values are depicted by color.


The Importance of EDA[3]

Effective EDA provides invaluable insights that an algorithm cannot. You can think of this a bit like running a document through a spellchecker versus reading it yourself. While the software is useful for spotting typos and grammatical errors, only a critical human eye can detect the nuance. An EDA is similar in this respect—tools can help you, but it requires our own intuition to make sense of it. This personal, in-depth insight will support detailed data analysis further down the line.


EDA Vs. Statistical Graphics>ref>EDA Vs. Statistical Graphics</ref>

EDA is not identical to statistical graphics although the two terms are used almost interchangeably. Statistical graphics is a collection of techniques--all graphically based and all focusing on one data characterization aspect. EDA encompasses a larger venue; EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA is not a mere collection of techniques; EDA is a philosophy as to how we dissect a data set; what we look for; how we look; and how we interpret. It is true that EDA heavily uses the collection of techniques that we call "statistical graphics", but it is not identical to statistical graphics per se.