Hassan Ijaz

Ai, Web & Design

Descriptive StatisticsTopic 11 of 58

Outlier detection

Interactive scatter plot where users click to add points and see various outlier detection methods highlight anomalies in real-time

Concept Overview

Outliers are data points that differ significantly from other observations. Detecting them is crucial as they can indicate data quality issues, interesting anomalies, or rare but important events.

Statistical Methods

Z-Score Method

z = (x - μ) / σ

Points with |z| > 3 are often considered outliers

Assumes normal distribution

IQR Method (Tukey Fences)

Lower fence: Q1 - 1.5 × IQR

Upper fence: Q3 + 1.5 × IQR

Robust to distribution shape

Modified Z-Score (MAD)

M = 0.6745 × (x - median) / MAD

MAD = median(|x - median|)

More robust than standard z-score

Machine Learning Methods

Isolation Forest

Isolates anomalies using random trees

Local Outlier Factor

Compares local density to neighbors

DBSCAN

Density-based clustering approach

One-Class SVM

Learns boundary of normal data

Types of Outliers

Point Outliers

Individual data points far from others

Contextual Outliers

Normal globally but abnormal in specific context

Collective Outliers

Groups of data points that together are anomalous

Handling Outliers

Investigate: Understand why they exist
Keep: If they represent valid extreme cases
Remove: If they're errors or irrelevant
Transform: Use robust methods or transformations
Cap/Winsorize: Replace with less extreme values

Important: Outliers aren't always bad! They might represent:

Fraud in financial transactions
Rare diseases in medical data
Equipment failures in sensor data
Breakthrough performances in sports

Click on the scatter plot below to add data points. Watch as different outlier detection methods highlight anomalies in real-time, each using different criteria.

Interactive Visualization

Loading interactive visualization...

Interactive scatter plot where users click to add points and see various outlier detection methods highlight anomalies in real-time

←

Correlation and causation

Sampling methods comparison

→