Cookbook recipe

Correlated Column Statistics Missing

Applies to PostgreSQL 13–17 Last reviewed May 2026 Grounded in source
Estimated investigation4 min

Scenario

Scenario A SaaS platform stores customer addresses with city and state columns. A report query filters on both (WHERE city = 'Austin' AND state = 'TX') and returns 50,000 rows, but EXPLAIN shows the planner estimated…

Investigation Path

Scenario

A SaaS platform stores customer addresses with city and state columns. A report query filters on both (WHERE city = 'Austin' AND state = 'TX') and returns 50,000 rows, but EXPLAIN shows the planner estimated only 12 rows. The plan uses a Nested Loop that scans millions of rows unnecessarily. No one has ever run CREATE STATISTICS on this table.

How to Identify

Conditions:

  • Queries filtering on two or more correlated columns produce severely wrong row estimates
  • EXPLAIN shows rows= while EXPLAIN ANALYZE shows actual rows=
  • Under-estimation leads to the planner choosing Nested Loop or wrong join strategy
  • SELECT * FROM pg_statistic_ext returns no rows for the table in question
  • Per-column pg_stats entries for each column individually look reasonable, but the planner multiplies their selectivities independently (independence assumption)

Analysis Steps

-- 1. Run EXPLAIN ANALYZE to compare estimated vs actual rows
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM customers WHERE city = 'Austin' AND state = 'TX';

-- 2. Check whether any extended statistics objects exist
SELECT stxname, stxrelid::regclass AS table_name,
       stxkeys, stxkind
FROM   pg_statistic_ext
ORDER  BY stxrelid::regclass::text;

-- 3. Inspect per-column ndistinct values — planner multiplies them independently
SELECT attname, n_distinct, correlation
FROM   pg_stats
WHERE  tablename = 'customers'
  AND  attname IN ('city', 'state');

-- 4. Verify the independence assumption error
--    Planner estimate ≈ (rows_for_city / total) * (rows_for_state / total) * total
--    If city and state are correlated, this massively underestimates

-- 5. Check most common values for each column separately
SELECT attname, most_common_vals, most_common_freqs
FROM   pg_stats
WHERE  tablename = 'customers'
  AND  attname IN ('city', 'state');

Pitfalls

  • Extended statistics require PostgreSQL 10+; CREATE STATISTICS syntax for MCV requires PG 12+
  • Creating too many extended statistics objects slows ANALYZE — focus on column pairs that appear together in WHERE clauses
  • Functional dependencies (dependencies) only help when one column functionally determines the other (e.g., zip → city); use ndistinct or mcv for non-deterministic correlations
  • CREATE STATISTICS collects stats but does NOT automatically trigger ANALYZE — you must run ANALYZE after
  • Extended stats do not apply to expressions, only to plain column references

Resolution Approach

The planner assumes column selectivities are statistically independent. When columns are correlated (city implies state), the independence assumption causes severe under-estimation. The fix is to create extended statistics that capture the joint distribution:

This is a Pro lesson

Get every Learning Pathway and cookbook recipe — grounded in PostgreSQL source code, with diagnostics, fixes, and prevention for each topic.

Continue this lesson to learn:

  • Mitigation Actions
  • All 36 Learning Pathway lessons
  • 170+ cookbook recipes
  • Source-grounded diagnostics & fixes

Secure checkout Cancel anytime Source-grounded

Career Impact

This scenario builds production judgment and operational confidence under pressure.

Open Career Dashboard →

Keep going

Related & next steps

Was this helpful?

← All cookbook recipes