Scenario
A SaaS platform stores customer addresses with city and state columns. A report query filters on both (WHERE city = 'Austin' AND state = 'TX') and returns 50,000 rows, but EXPLAIN shows the planner estimated only 12 rows. The plan uses a Nested Loop that scans millions of rows unnecessarily. No one has ever run CREATE STATISTICS on this table.
How to Identify
Conditions:
- Queries filtering on two or more correlated columns produce severely wrong row estimates
EXPLAIN shows rows= while EXPLAIN ANALYZE shows actual rows=
- Under-estimation leads to the planner choosing Nested Loop or wrong join strategy
SELECT * FROM pg_statistic_ext returns no rows for the table in question
- Per-column
pg_stats entries for each column individually look reasonable, but the planner multiplies their selectivities independently (independence assumption)
Analysis Steps
-- 1. Run EXPLAIN ANALYZE to compare estimated vs actual rows
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM customers WHERE city = 'Austin' AND state = 'TX';
-- 2. Check whether any extended statistics objects exist
SELECT stxname, stxrelid::regclass AS table_name,
stxkeys, stxkind
FROM pg_statistic_ext
ORDER BY stxrelid::regclass::text;
-- 3. Inspect per-column ndistinct values — planner multiplies them independently
SELECT attname, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'customers'
AND attname IN ('city', 'state');
-- 4. Verify the independence assumption error
-- Planner estimate ≈ (rows_for_city / total) * (rows_for_state / total) * total
-- If city and state are correlated, this massively underestimates
-- 5. Check most common values for each column separately
SELECT attname, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'customers'
AND attname IN ('city', 'state');
Pitfalls
- Extended statistics require PostgreSQL 10+;
CREATE STATISTICS syntax for MCV requires PG 12+
- Creating too many extended statistics objects slows
ANALYZE — focus on column pairs that appear together in WHERE clauses
- Functional dependencies (
dependencies) only help when one column functionally determines the other (e.g., zip → city); use ndistinct or mcv for non-deterministic correlations
CREATE STATISTICS collects stats but does NOT automatically trigger ANALYZE — you must run ANALYZE after
- Extended stats do not apply to expressions, only to plain column references
Resolution Approach
The planner assumes column selectivities are statistically independent. When columns are correlated (city implies state), the independence assumption causes severe under-estimation. The fix is to create extended statistics that capture the joint distribution: