Being The Explorer Of Your Data — Exploratory Data Analysis (part 1 of 2)

Amit Batra
5 min readAug 7, 2022

Exploratory data analysis (EDA) is conducted to explore the data, and know it better. It’s a stage in the broader data analysis endeavor to understand what data you are dealing with. EDA is key to getting a firm grip around what kind, quantum & quality of data is there in the data set(s) and how should they be analyzed.

EDA ≠ Making inferences
EDA ≠ Predicting
EDA = About identifying how should we analyze the data

Only after EDA, can one formulate the correct set of next-level questions that one should ask off the data set.

As in any good exploring expedition, it is important not for formulating impressions or biases early on. And rather be open and observant of what we see. Data science projects have a high failure rate, and one of the primary reasons for failure can be attributed to formulating early impressions about the data during the EDA phase.

EDA is about listening to the data, and not what you want the data to tell you

Exploring is good, but what are we exploring for?

Imagine if someone was to blindfold you and drop you into a new country. This is a weird imagination but stay with me. When you open your blindfold, what would you do? Well, to answer these questions, you would first need to find out, where you are, what can you do there, what are your options to return, etc.

Data exploration is kind of similar to this, essentially you would try to get your 5WH questions answered first:

  • What is the data set about? What are the inherent relationships between the data?
  • Why is the data structured in that way?
  • Who is creating, reviewing, and using this data?
  • Where can we use this data?
  • How can we understand this data set better?

The above set of questions would give you a good initial view of what the data is about and what perhaps can it be leveraged for.

Let’s build on this and divide our exploration journey into 3 stages. In each of the stages below, I highlight some of the main questions one needs to be looking for from the data set. Collectively, after these 3 stages, we will have enough information about the data set to be able to ask second order more pointed questions on how the data set can be used and what kinds of value can it yield.

Stage 1: Summarizing the data:

The intent of this stage is to purely measure the data, in terms of its length and breadth, number of observations and dimensions, etc. At this stage answering the following set of questions can help start to build our understanding of the data set:

  • How many dimensions does the data have? What do they mean?
  • How many observations do we have in the data set?
  • What are the kinds of data that make up the dataset? Which ones are qualitative and which ones are quantitative data dimensions? What are their scales?
  • Are there missing data points? How can we handle them? Do we need imputation, substitution or can we
    ignore them?
  • Are there outliers? How can we handle them?

As you would agree, the above set of questions is looking to summarize the data across different areas. Summation is an important building block of data analysis that helps us gets a better contextual understanding of the data at hand.

Stage 2: Testing the data:

In this stage, we try to find out the resemblance of the underline data viz established probability distributions. Why? — The main reason is to extrapolate and use the sets of mathematical and statistical inferences established for different probability distributions. So if our data matches say a normal distribution or a binomial distribution, then the tools kit and inference applicable for normal or binominal distributions can be used for our variables. Getting to answer the below set of questions would help further build our understanding during the EDA phase.

  • What are the probability distributions of the data?
  • Is there sufficient data available for each variable?
  • Do we need to transform the data?
  • What kind of hypothesis tests can we conduct on the data?

Stage 3: Visualize the data:

The third and final stage of the EDA phase is to visualize the data using different visualization techniques. Visualization should be done in spite of summary stats and hypothesis testing. Visualization helps bring out variances and helps us get a better visual picture of the nuances present in the data set. The question at this stage that we should explore:

  • What is the shape of the data?
  • Is it skewed? Is it unimodal or multi-modal?
  • How much is the data concentrated? Is it concentrated within certain bins? or across?
  • Do we have over-representation from any particular categorical variables? Will it impact the overall analysis?
  • What are the underlying relationships between the variables? Between numeric variables, among categorical variables, and between numeric and categorical variables?
  • What is the strength and what is the direction of these relationships?

The above, set of stages and questions within them should be explored for the data set. Of course, this is not an exhaustive list, but a good one to begin with. In my experience, no two data sets are alike, and hence, depending upon the data set you may decide to dive deeper or pull back on certain elements within each stage.

An important point that I would like to share is the conclusion of EDA. So when does one know to stop EDA? It is common to get consumed with data analysis and get into deeper and more layered nuances. But, that does not help. Especially with regards to EDA, the objective is to listen to what the underlying data is saying. That means, once we have fairly assessed the data sets across the above three stages and **can describe the data set well**. That to me would be a point where EDA is complete.

In the next part, I’ll cover a case study on applying how to do EDA on a data set using R.

References: https://www.youtube.com/watch?v=5rTb6AkKhds&t=2724s

Picture: Unsplash — https://unsplash.com/photos/gcsNOsPEXfs

--

--