Exploratory Data Analysis (EDA)

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, spot anomalies, test hypotheses and check assumptions with the help of summary statistics and graphical representations. The objectives of EDA are to:

  • Enable unexpected discoveries in the data
  • Suggest hypotheses about the causes of observed phenomena
  • Assess assumptions on which statistical inference will be based
  • Support the selection of appropriate statistical tools and techniques
  • Provide a basis for further data collection through surveys or experiments

Many EDA techniques have been adopted in data mining. They are also being taught to young students as a way to introduce them to statistical thinking.

Exploratory Data Analysis (EDA) is an analysis approach that identifies general patterns in the data. These patterns include outliers and features of the data that might be unexpected. EDA is an important first step in any data analysis. Understanding where outliers occur and how variables are related can help one design statistical analyses that yield meaningful results. In data mining, Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task. It is not easy to look at a column of numbers or a whole spreadsheet and determine important characteristics of the data. It may be tedious, boring, and/or overwhelming to derive insights by looking at plain numbers. Exploratory data analysis techniques have been devised as an aid in this situation.

Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either univariate or multivariate (usually just bivariate).

EDA is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results. EDA is valuable to data scientists to make certain that the results they produce are valid, correctly interpreted, and applicable to the desired business contexts.[1]

With EDA, you can find anomalies in your data, such as outliers or unusual observations, uncover patterns, understand potential relationships among variables, and generate interesting questions or hypotheses that you can test later using more formal statistical methods.

Exploratory data analysis is like detective work: you're searching for clues and insights that can lead to the identification of potential root causes of the problem you are trying to solve. You explore one variable at a time, then two variables at a time, and then many variables at a time.

Although EDA encompasses tables of summary statistics such as the mean and standard deviation, most people focus on graphs. You use a variety of graphs and exploratory tools, and you go where your data take you. If one graph or analysis isn't informative, you look at the data from another perspective.

Because EDA involves exploring, it is iterative. You are likely to learn different aspects of your data from different graphs. Typical goals are understanding:

  • The distribution of variables in your data set. That is, what is the shape of your data? Is the distribution skewed? Mound-shaped? Bimodal?
  • The relationships between variables.
  • Whether or not your data have outliers or unusual points that may indicate data quality issues or lead to interesting insights.
  • Whether or not your data have patterns over time.[2]

Types of Exploratory Data Analysis[3]

There are four primary types of EDA:

  • Univariate non-graphical. This is the simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of the univariate analysis is to describe the data and find patterns that exist within it.
  • Univariate graphical. Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:
    • Stem-and-leaf plots, which show all data values and the shape of the distribution.
    • Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
    • Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
  • Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
  • Multivariate graphical: Multivariate data use graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

  • Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
  • Multivariate chart, which is a graphical representation of the relationships between factors and response.
  • Run chart, which is a line graph of data plotted over time.
  • Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
  • Heat map, which is a graphical representation of data where values are depicted by color.

Overview and Development of Exploratory Data Analysis[4]

John W. Tukey wrote the book Exploratory Data Analysis in 1977. Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data.

Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."

Tukey's championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs. The S programming language inspired the systems S-PLUS and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends, and patterns in data that merited further study.

Tukey's EDA was related to two other developments in statistical theory: robust statistics and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes (maximum and minimum), the median, and the quartiles—because these medians and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the mean and standard deviation; moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than traditional summaries (the mean and standard deviation). The packages S, S-PLUS, and R included routines using resampling statistics, such as Quenouille and Tukey's jackknife and Efron's bootstrap, which are nonparametric and robust (for many problems).

Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the Laplacian tradition's emphasis on exponential families.

Conducting Exploratory Data Analysis[5]

It can be easier to conduct exploratory data analysis if you break the process down into steps. Here are six key steps that you can follow to conduct EDA:

  1. Observe your dataset: The first step to conducting exploratory data analysis is to observe your dataset at a high level. Start by determining the size of your dataset, including how many rows and columns it has. This can help you predict any future issues you might have with your data.
  2. Find any missing values: Once you've observed your dataset, you can start looking for any missing values. When you find missing values, think about what could cause them to be missing. If you can spot a trend in your data, you might be able to replace some missing values with estimates.
  3. Categorize your values: After you find any missing values, you can categorize your values to help determine what statistical and visualization methods can work with your dataset. You can place your values into these categories:
    • Categorical: Categorical variables can have a set number of values.
    • Continuous: Continuous variables can have an infinite number of values.
    • Discrete: Discrete variables can have a set number of values that must be numeric.
  4. Find the shape of your dataset: Finding the shape of your dataset is another important step in the EDA process. This step is important because you can gather relevant information about your dataset by observing its shape. The shape of your dataset shows your data's distribution. You can also notice data features like skewness and gaps that can help you learn more about the dataset. It can also help you identify trends in your dataset.
  5. Identify relationships in your dataset: As you continue to understand your dataset, you can begin to pick out relationships in your dataset. Try to spot any correlations between values. Using scatter plots can make it easier to identify correlations and relationships between values. Be sure to take notes and pick out as many correlations as you can find. As you notice correlations, you can start thinking about why certain values might have correlations.
  6. Locate any outliers in your dataset: Locating outliers in your dataset is another important step in conducting EDA. Outliers are values in your dataset that are significantly different from the rest of the values. Outliers can be much higher or lower than the other values in a dataset. It's important to identify outliers because they can skew the mean, median, mode, or range of a dataset and alter the appearance of a visual representation. You can locate outliers by observing your graphs or sorting your data in numerical order during your EDA.

EDA Vs. Statistical Graphics[6]

EDA is not identical to statistical graphics although the two terms are used almost interchangeably. Statistical graphics are a collection of techniques--all graphically based and all focusing on one data characterization aspect. EDA encompasses a larger venue; EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA is not a mere collection of techniques; EDA is a philosophy as to how we dissect a data set; what we look for; how we look; and how we interpret. It is true that EDA heavily uses the collection of techniques that we call "statistical graphics", but it is not identical to statistical graphics per se.

Exploratory Data Analysis Tools

  1. Python: Python is used for different tasks in EDA, such as finding missing values in data collection, data description, handling outliers, obtaining insights through charts, etc. The syntax for EDA libraries like Matplotlib, Pandas, Seaborn, NumPy, Altair, and more in Python is fairly simple and easy to use for beginners. You can find many open-source packages in Python, such as D-Tale, AutoViz, PandasProfiling, etc., that can automate the entire exploratory data analysis process and save time.
  2. R: R programming language is a regularly used option to make statistical observations and analyze data, i.e., perform detailed EDA by data scientists and statisticians. Like Python, R is also an open-source programming language suitable for statistical computing and graphics. Apart from the commonly used libraries like ggplot, Leaflet, and Lattice, there are several powerful R libraries for automated EDA, such as Data Explorer, SmartEDA, GGally, etc.
  3. MATLAB: MATLAB is a well-known commercial tool among engineers since it has a very strong mathematical calculation ability. Due to this, it is possible to use MATLAB for EDA but it requires some basic knowledge of the MATLAB programming language.[7]

Specific statistical functions and techniques you can perform with EDA tools include:

  • Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
  • Univariate visualization of each field in the raw dataset, with summary statistics.
  • Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
  • Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
  • K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
  • Predictive models, such as linear regression, use statistics and data to predict outcomes.

The Importance and Benefits of EDA[8]

Effective EDA provides invaluable insights that an algorithm cannot. You can think of this a bit like running a document through a spellchecker versus reading it yourself. While the software is useful for spotting typos and grammatical errors, only a critical human eye can detect the nuance. An EDA is similar in this respect—tools can help you, but it requires our own intuition to make sense of it. This personal, in-depth insight will support detailed data analysis further down the line.

Specifically, some key benefits of an EDA include:

  • Spotting missing and incorrect data: As part of the data cleaning process, an initial data analysis (IDA) can help you spot any structural issues with your dataset. You may be able to fix these, or you might find that you need to reprocess the data or collect new data entirely. While this can be a nuisance, it’s better to know upfront, before you dive in with a deeper analysis.
  • Understanding the underlying structure of your data: Properly mapping your data ensures that you maintain high data quality when transferring it from its source to your database, spreadsheet, data warehouse, etc. Understanding how your data is structured means you can avoid mistakes from creeping in.
  • Testing your hypothesis and checking assumptions: Before diving in with a full analysis, it’s important to make sure any assumptions or hypotheses you’re working on stand up to scrutiny. While an EDA won’t give you all the details, it will help you spot if you’re inferring the right outcomes based on your understanding of the data. If not, then you know that your assumptions are wrong, or that you are asking the wrong questions about the dataset.
  • Calculating the most important variables: When carrying out any data analysis, it’s necessary to identify the importance of different variables. This includes how they relate to each other. For example, which independent variables affect which dependent variables? Determining this early on will help you extract the most useful information later on.
  • Creating the most efficient model: When carrying out your full analysis, you’ll need to remove any extraneous information. This is because needless additional data can either skew your results or simply obscure key insights with unnecessary noise. In pursuit of your goal, aim to include the fewest number of necessary variables. EDA helps identify information that you can extract.
  • Determining error margins: EDA isn’t just about finding helpful information. It’s also about determining which data might lead to unavoidable errors in your later analysis. Knowing which data will impact your results helps you to avoid wrongly accepting false conclusions or incorrectly labeling an outcome as statistically significant when it isn’t.
  • Identifying the most appropriate statistical tools to help you: Perhaps the most practical outcome of your EDA is that it will help you determine which techniques and statistical models will help you get what you need from your dataset. For instance, do you need to carry out a predictive analysis or a sentiment analysis? An EDA will help you decide.

Intuition and reflection are key skills for carrying out exploratory data analysis. While EDA can involve executing defined tasks, interpreting the results of these tasks is where the real skill lies.

See Also