Exploratory Data Analysis (EDA) is the foundation of any successful data science project. It’s the process of investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions through statistical summaries and graphical representations. This comprehensive cheat sheet covers all the essential Python tools and techniques you need to master EDA.

Why EDA Matters
Before diving into complex modeling, EDA helps you:
- Understand your data’s structure and quality
- Identify missing values and outliers
- Discover relationships between variables
- Generate hypotheses for further analysis
- Make informed decisions about data preprocessing
Essential Python Libraries for EDA
Core Libraries Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
1. Data Loading and Initial Inspection
Loading Data
# CSV files
df = pd.read_csv('data.csv')
# Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
Description: Start by loading your dataset using pandas’ versatile reading functions. Always specify encoding and handle potential parsing issues.
First Look at Data
# Dataset shape
print(f"Dataset shape: {df.shape}")
# First/last few rows
df.head()
df.tail()
# Random sample
df.sample(5)
Description: Get an immediate sense of your dataset’s size and structure. Use head()
and tail()
to see actual data examples, and sample()
for random observations.
Data Types and Info
# Data types and memory usage
df.info()
# Data types only
df.dtypes
# Memory usage
df.memory_usage(deep=True)
Description: Understanding data types is crucial for choosing appropriate analysis methods. Incorrect data types can lead to memory issues and analysis errors.
2. Data Quality Assessment
Missing Values Analysis
# Missing values count
df.isnull().sum()
# Missing values percentage
(df.isnull().sum() / len(df)) * 100
# Visualize missing data
import missingno as msno
msno.matrix(df)
msno.bar(df)
Description: Missing data can significantly impact your analysis. Identify patterns in missing values to determine appropriate handling strategies (imputation, deletion, or special encoding).
Duplicate Detection
# Check for duplicates
df.duplicated().sum()
# View duplicate rows
df[df.duplicated()]
# Remove duplicates
df.drop_duplicates(inplace=True)
Description: Duplicates can skew your analysis and model performance. Always check for and handle duplicate records appropriately.
3. Descriptive Statistics
Summary Statistics
# Basic statistics for numerical columns
df.describe()
# Include all columns (including categorical)
df.describe(include='all')
# Custom percentiles
df.describe(percentiles=[.1, .25, .5, .75, .9, .99])
Description: The describe()
function provides essential statistical measures including mean, standard deviation, quartiles, and min/max values. It’s your first insight into data distribution.
Individual Column Statistics
# For numerical columns
df['column_name'].mean()
df['column_name'].median()
df['column_name'].std()
df['column_name'].var()
df['column_name'].skew()
df['column_name'].kurt()
# For categorical columns
df['categorical_column'].value_counts()
df['categorical_column'].mode()
Description: Dive deeper into individual columns to understand their specific characteristics. Skewness and kurtosis help identify distribution shapes.
4. Data Visualization Essentials
Distribution Analysis
Histograms
# Single variable histogram
plt.figure(figsize=(10, 6))
df['column_name'].hist(bins=30, alpha=0.7)
plt.title('Distribution of Column Name')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
# Multiple histograms
df.hist(figsize=(15, 10), bins=20)
plt.tight_layout()
plt.show()
Description: Histograms reveal the distribution shape, helping identify normal distributions, skewness, multimodality, and potential outliers.
Box Plots
# Single box plot
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, y='column_name')
plt.title('Box Plot of Column Name')
plt.show()
# Multiple box plots
plt.figure(figsize=(12, 8))
df.boxplot()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Description: Box plots excel at showing data spread, quartiles, and outliers. They’re particularly useful for comparing distributions across different groups.
Density Plots
# Single density plot
plt.figure(figsize=(10, 6))
df['column_name'].plot(kind='density')
plt.title('Density Plot of Column Name')
plt.show()
# Multiple density plots
plt.figure(figsize=(12, 8))
for column in df.select_dtypes(include=[np.number]).columns:
df[column].plot(kind='density', alpha=0.7, label=column)
plt.legend()
plt.show()
Description: Density plots provide smooth distribution curves, making it easier to compare multiple variables’ distributions on the same plot.
Relationship Analysis
Scatter Plots
# Basic scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df['x_column'], df['y_column'], alpha=0.6)
plt.xlabel('X Column')
plt.ylabel('Y Column')
plt.title('Scatter Plot: X vs Y')
plt.show()
# Enhanced scatter plot with Seaborn
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='x_column', y='y_column', hue='category_column')
plt.title('Enhanced Scatter Plot')
plt.show()
Description: Scatter plots reveal relationships between continuous variables, helping identify correlations, clusters, and outliers in bivariate data.
Correlation Matrix
# Calculate correlation matrix
corr_matrix = df.corr()
# Visualize with heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
square=True, linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
Description: Correlation matrices quantify linear relationships between numerical variables. Heatmaps make it easy to spot strong correlations at a glance.
Pair Plots
# Pair plot for multiple variables
sns.pairplot(df, hue='target_column')
plt.show()
# Pair plot for selected columns
selected_columns = ['col1', 'col2', 'col3', 'target']
sns.pairplot(df[selected_columns], hue='target')
plt.show()
Description: Pair plots create scatter plots for every combination of numerical variables, providing a comprehensive view of all pairwise relationships.
5. Categorical Data Analysis
Value Counts and Frequencies
# Value counts
df['categorical_column'].value_counts()
# Relative frequencies
df['categorical_column'].value_counts(normalize=True)
# Bar plot of categories
plt.figure(figsize=(10, 6))
df['categorical_column'].value_counts().plot(kind='bar')
plt.title('Distribution of Categorical Column')
plt.xticks(rotation=45)
plt.show()
Description: Understanding categorical variable distributions helps identify class imbalances and dominant categories that might affect your analysis.
Cross-tabulation
# Cross-tabulation
pd.crosstab(df['category1'], df['category2'])
# With percentages
pd.crosstab(df['category1'], df['category2'], normalize='columns')
# Visualize with heatmap
plt.figure(figsize=(10, 6))
cross_tab = pd.crosstab(df['category1'], df['category2'])
sns.heatmap(cross_tab, annot=True, fmt='d', cmap='Blues')
plt.title('Cross-tabulation Heatmap')
plt.show()
Description: Cross-tabulation reveals relationships between categorical variables, showing how different categories interact and overlap.
6. Outlier Detection
Statistical Methods
# Using IQR method
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
outliers = df[(df['column_name'] < lower_bound) |
(df['column_name'] > upper_bound)]
print(f"Number of outliers: {len(outliers)}")
Description: The IQR method is robust for identifying outliers in numerical data. Values beyond 1.5 times the IQR from quartiles are considered outliers.
Z-Score Method
# Calculate Z-scores
from scipy import stats
z_scores = np.abs(stats.zscore(df['column_name']))
# Identify outliers (Z-score > 3)
outliers = df[z_scores > 3]
print(f"Number of outliers (Z-score > 3): {len(outliers)}")
Description: Z-scores measure how many standard deviations a value is from the mean. Values with Z-scores > 3 are typically considered outliers.
7. Time Series EDA (if applicable)
Time-based Analysis
# Convert to datetime
df['date_column'] = pd.to_datetime(df['date_column'])
# Set as index
df.set_index('date_column', inplace=True)
# Plot time series
plt.figure(figsize=(12, 6))
df['value_column'].plot()
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
# Seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df['value_column'], model='additive', period=12)
fig, axes = plt.subplots(4, 1, figsize=(12, 10))
decomposition.observed.plot(ax=axes[0], title='Original')
decomposition.trend.plot(ax=axes[1], title='Trend')
decomposition.seasonal.plot(ax=axes[2], title='Seasonal')
decomposition.resid.plot(ax=axes[3], title='Residual')
plt.tight_layout()
plt.show()
Description: Time series analysis reveals trends, seasonality, and patterns over time. Decomposition helps separate different components of time-based data.
8. Advanced EDA Techniques
Feature Engineering Insights
# Create new features for analysis
df['feature_ratio'] = df['feature1'] / df['feature2']
df['feature_sum'] = df['feature1'] + df['feature2']
df['is_high_value'] = (df['value_column'] > df['value_column'].median()).astype(int)
# Binning continuous variables
df['value_bins'] = pd.cut(df['value_column'], bins=5, labels=['Low', 'Med-Low', 'Medium', 'Med-High', 'High'])
Description: Creating new features during EDA can reveal hidden patterns and relationships that aren’t apparent in the original variables.
Distribution Testing
# Test for normality
from scipy.stats import shapiro, normaltest
# Shapiro-Wilk test
stat, p_value = shapiro(df['column_name'])
print(f"Shapiro-Wilk test: statistic={stat:.4f}, p-value={p_value:.4f}")
# D'Agostino's normality test
stat, p_value = normaltest(df['column_name'])
print(f"D'Agostino test: statistic={stat:.4f}, p-value={p_value:.4f}")
Description: Statistical tests help determine if your data follows specific distributions, which is crucial for choosing appropriate analysis methods and models.
Quick Reference Commands
Essential One-Liners
# Quick dataset overview
df.info(), df.describe(), df.shape
# Missing data overview
df.isnull().sum().sort_values(ascending=False)
# Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
# Distribution plots for all numerical columns
df.hist(figsize=(15, 10))
# Value counts for categorical columns
for col in df.select_dtypes(include=['object']).columns:
print(f"{col}:\n{df[col].value_counts()}\n")
This EDA cheat sheet provides a comprehensive toolkit for exploring and understanding your data using Python. Remember that EDA is an iterative process – start with broad overviews and progressively dive deeper into specific aspects of your data. The insights gained from thorough EDA will guide your feature engineering, model selection, and overall data science strategy.
The key to effective EDA is asking the right questions about your data and using these tools to find answers. Always document your findings and be prepared to iterate as new patterns emerge. With practice, these techniques will become second nature, making you a more effective data scientist.
Tips for Effective EDA
- Ask Questions: Start with questions about your data. What are you trying to find out?
- Iterate: EDA is not a linear process. You’ll often jump back and forth between steps as you uncover new insights.
- Document: Keep notes of your findings, observations, and any transformations you make.
- #Comment: Use the comments in your notebook extensively. You’ll not remember what you had written in the moment.
- Understand the Data: Apply human logic and relate the expected output with the input variables and if they are relevant for the model.
- Visualize, Visualize, Visualize: Humans are visual creatures. Plots often reveal patterns that summary statistics alone might miss.
- Don’t Be Afraid to Dive Deeper: If something looks interesting or unusual, investigate it further.