Query Processing & the Query Optimizer

When you write a SQL query, the database doesn't execute it literally. It runs the SQL through a multi-stage pipeline — parsing, rewriting, planning, and only then executing. The query optimizer is the component that transforms your SQL into the most efficient execution strategy it can find. Understanding this pipeline helps you write faster queries and diagnose slow ones.

The Query Lifecycle#

Stage 1: Query Parsing#

The parser converts raw SQL text into a structured parse tree.

sql

Loading editor…

What the Parser Does#

Tokenisation: splits the text into keywords, identifiers, operators, literals
- SELECT, u, ., name, ,, COUNT, (, o.id, ), ...
Syntax tree: validates grammar and builds an AST (Abstract Syntax Tree)

text
Loading editor…
Semantic analysis (binding phase): replaces table/column names with internal object IDs from the system catalog. Validates that users.name exists, that o.user_id is the right type for u.id, and that the user has SELECT permission.

Errors at this stage: relation "userz" does not exist, column "naem" does not exist, operator does not exist: text > integer.

Stage 2: Query Rewriting#

Before planning, the rewriter applies a set of algebraic transformations that simplify the query or open up more optimisation opportunities.

Subquery Unnesting#

Correlated subqueries are expensive — they run once per outer row. The rewriter often transforms them into joins.

sql

Loading editor…

Predicate Pushdown#

Move WHERE conditions as close to the data source as possible, filtering out rows early.

sql

Loading editor…

View Expansion#

sql

Loading editor…

Stage 3: Query Planning and Optimization#

The planner generates multiple candidate execution plans and estimates the cost of each. It then picks the one with the lowest estimated cost.

Rule-Based vs Cost-Based Optimization#

Approach	How it works	Weakness
Rule-based (RBO)	Apply fixed transformation rules (e.g., always use index if one exists)	Ignores data distribution, can choose a bad plan
Cost-based (CBO)	Use statistics to estimate row counts and I/O, pick minimum cost plan	Only as good as the statistics

Modern databases (PostgreSQL, MySQL 8+, SQL Server, Oracle) use cost-based optimization with some rule-based heuristics.

Statistics: How the Planner Estimates Row Counts#

The planner needs to estimate how many rows each operation will return. It uses statistics collected by ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL).

sql

Loading editor…

Example output:

text

Loading editor…

The planner uses this to estimate: WHERE status = 'pending' will return approximately 18% of rows.

Histograms#

For columns with non-uniform distributions, PostgreSQL builds histograms:

sql

Loading editor…

Access Paths#

Once the planner knows approximately how many rows to expect, it chooses how to physically access the data.

Sequential Scan#

Read every page of the table from disk. Unavoidable when:

No suitable index exists
The query returns a large fraction of the table (>5-20%)

text

Loading editor…

Index Scan#

Traverse the B-tree index, then fetch each matching heap row.

text

Loading editor…

Good for low-selectivity queries (few matching rows). Degrades if many rows match because random I/O is expensive.

Index-Only Scan#

All columns needed by the query are in the index. No heap access required.

sql

Loading editor…

Fastest access path when applicable. Requires the index to cover all columns referenced in SELECT and WHERE.

Bitmap Index Scan#

For queries returning a moderate number of rows (more than index scan is efficient for, less than seqscan):

text

Loading editor…

Can combine multiple bitmaps: BitmapAnd, BitmapOr for multi-condition queries.

Join Strategies#

When joining two tables, the planner chooses one of three algorithms based on table sizes, available indexes, and sort order.

Nested Loop Join#

For each row in the outer table, scan the inner table.

text

Loading editor…

text

Loading editor…

Hash Join#

Build a hash table from the smaller table, then probe it with every row from the larger table.

text

Loading editor…

text

Loading editor…

Merge Join#

Sort both tables on the join key, then merge them like merge sort.

text

Loading editor…

Join Type	Best For	Memory	Requires
Nested Loop	Small outer, indexed inner	Low	Index on inner join key
Hash Join	Large tables, no index	High (hash table)	Enough work_mem
Merge Join	Pre-sorted inputs	Low	Sorted input or indexes

Reading EXPLAIN ANALYZE Output#

EXPLAIN ANALYZE runs the query and shows the actual execution plan with real row counts and timing.

sql

Loading editor…

Example output:

text

Loading editor…

How to Read It#

cost=X..Y: estimated cost. X = startup cost (before first row), Y = total cost. Units are arbitrary but consistent within one EXPLAIN.
rows=N: estimated rows (planner's guess). Compare to actual rows=N to see if estimates were accurate.
actual time=X..Y: actual milliseconds. X = time to first row, Y = total time.
loops=N: how many times this node ran. actual time is per loop; total = actual time * loops.
Filter: ... Rows Removed by Filter: shows how many rows were discarded at this node.

Common Plan Problems#

Problem 1: Wrong row estimate (cardinality error)

text

Loading editor…

The planner thought 10 rows would match, picked nested loop join. Actually 85,000 rows — should have been a hash join. Fix: ANALYZE the table to refresh statistics.

Problem 2: Seq scan on a large table

text

Loading editor…

Only 60 rows matched out of 100,000 scanned. An index on user_id would change this to an index scan. Fix: CREATE INDEX ON orders(user_id).

Problem 3: Index exists but not used

sql

Loading editor…

Fix: create a functional index: CREATE INDEX ON orders(LOWER(status));

VACUUM ANALYZE: Keeping Statistics Fresh#

PostgreSQL's MVCC (multi-version concurrency control) leaves dead row versions after updates and deletes. VACUUM reclaims this space. ANALYZE refreshes the planner statistics.

sql

Loading editor…

pg_hint_plan#

When the planner consistently makes bad choices, you can override with hints (requires the pg_hint_plan extension):

sql

Loading editor…

Prepared Statements#

A prepared statement is parsed, bound, and planned once, then executed many times with different parameters.

sql

Loading editor…

PostgreSQL uses generic plans after 5 executions: a plan that works for any parameter value. This avoids per-parameter planning overhead but may not be optimal for skewed data.

Python: Running EXPLAIN ANALYZE#

python

Loading editor…

Summary#

Query lifecycle: parse → bind → rewrite → plan → execute. The optimizer operates at the plan phase.
Statistics: the planner's accuracy depends on up-to-date pg_stats. Run ANALYZE after bulk changes.
Access paths: sequential scan for large fractions of a table; index scan for selective predicates; index-only scan when all columns are in the index.
Join strategies: nested loop (small + indexed), hash join (large, no index), merge join (pre-sorted). The planner chooses based on estimated row counts.
EXPLAIN ANALYZE: always read estimated vs actual rows. A large discrepancy is a sign of stale statistics or missing statistics.
Prepared statements: parse once, execute many times — essential for high-throughput OLTP.
VACUUM ANALYZE: dead rows skew statistics and waste I/O. Autovacuum handles routine cases; run manually after bulk operations.

Query Processing & the Query Optimizer

The Query Lifecycle#

Stage 1: Query Parsing#

The parser converts raw SQL text into a structured parse tree.

sql

Loading editor…

What the Parser Does#

Tokenisation: splits the text into keywords, identifiers, operators, literals
- SELECT, u, ., name, ,, COUNT, (, o.id, ), ...
Syntax tree: validates grammar and builds an AST (Abstract Syntax Tree)

text
Loading editor…
Semantic analysis (binding phase): replaces table/column names with internal object IDs from the system catalog. Validates that users.name exists, that o.user_id is the right type for u.id, and that the user has SELECT permission.

Errors at this stage: relation "userz" does not exist, column "naem" does not exist, operator does not exist: text > integer.

Stage 2: Query Rewriting#

Before planning, the rewriter applies a set of algebraic transformations that simplify the query or open up more optimisation opportunities.

Subquery Unnesting#

Correlated subqueries are expensive — they run once per outer row. The rewriter often transforms them into joins.

sql

Loading editor…

Predicate Pushdown#

Move WHERE conditions as close to the data source as possible, filtering out rows early.

sql

Loading editor…

View Expansion#

sql

Loading editor…

Stage 3: Query Planning and Optimization#

The planner generates multiple candidate execution plans and estimates the cost of each. It then picks the one with the lowest estimated cost.

Rule-Based vs Cost-Based Optimization#

Approach	How it works	Weakness
Rule-based (RBO)	Apply fixed transformation rules (e.g., always use index if one exists)	Ignores data distribution, can choose a bad plan
Cost-based (CBO)	Use statistics to estimate row counts and I/O, pick minimum cost plan	Only as good as the statistics

Modern databases (PostgreSQL, MySQL 8+, SQL Server, Oracle) use cost-based optimization with some rule-based heuristics.

Statistics: How the Planner Estimates Row Counts#

The planner needs to estimate how many rows each operation will return. It uses statistics collected by ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL).

sql

Loading editor…

Example output:

text

Loading editor…

The planner uses this to estimate: WHERE status = 'pending' will return approximately 18% of rows.

Histograms#

For columns with non-uniform distributions, PostgreSQL builds histograms:

sql

Loading editor…

Access Paths#

Once the planner knows approximately how many rows to expect, it chooses how to physically access the data.

Sequential Scan#

Read every page of the table from disk. Unavoidable when:

No suitable index exists
The query returns a large fraction of the table (>5-20%)

text

Loading editor…

Index Scan#

Traverse the B-tree index, then fetch each matching heap row.

text

Loading editor…

Good for low-selectivity queries (few matching rows). Degrades if many rows match because random I/O is expensive.

Index-Only Scan#

All columns needed by the query are in the index. No heap access required.

sql

Loading editor…

Fastest access path when applicable. Requires the index to cover all columns referenced in SELECT and WHERE.

Bitmap Index Scan#

For queries returning a moderate number of rows (more than index scan is efficient for, less than seqscan):

text

Loading editor…

Can combine multiple bitmaps: BitmapAnd, BitmapOr for multi-condition queries.

Join Strategies#

When joining two tables, the planner chooses one of three algorithms based on table sizes, available indexes, and sort order.

Nested Loop Join#

For each row in the outer table, scan the inner table.

text

Loading editor…

text

Loading editor…

Hash Join#

Build a hash table from the smaller table, then probe it with every row from the larger table.

text

Loading editor…

text

Loading editor…

Merge Join#

Sort both tables on the join key, then merge them like merge sort.

text

Loading editor…

Join Type	Best For	Memory	Requires
Nested Loop	Small outer, indexed inner	Low	Index on inner join key
Hash Join	Large tables, no index	High (hash table)	Enough work_mem
Merge Join	Pre-sorted inputs	Low	Sorted input or indexes

Reading EXPLAIN ANALYZE Output#

EXPLAIN ANALYZE runs the query and shows the actual execution plan with real row counts and timing.

sql

Loading editor…

Example output:

text

Loading editor…

How to Read It#

cost=X..Y: estimated cost. X = startup cost (before first row), Y = total cost. Units are arbitrary but consistent within one EXPLAIN.
rows=N: estimated rows (planner's guess). Compare to actual rows=N to see if estimates were accurate.
actual time=X..Y: actual milliseconds. X = time to first row, Y = total time.
loops=N: how many times this node ran. actual time is per loop; total = actual time * loops.
Filter: ... Rows Removed by Filter: shows how many rows were discarded at this node.

Common Plan Problems#

Problem 1: Wrong row estimate (cardinality error)

text

Loading editor…

The planner thought 10 rows would match, picked nested loop join. Actually 85,000 rows — should have been a hash join. Fix: ANALYZE the table to refresh statistics.

Problem 2: Seq scan on a large table

text

Loading editor…

Only 60 rows matched out of 100,000 scanned. An index on user_id would change this to an index scan. Fix: CREATE INDEX ON orders(user_id).

Problem 3: Index exists but not used

sql

Loading editor…

Fix: create a functional index: CREATE INDEX ON orders(LOWER(status));

VACUUM ANALYZE: Keeping Statistics Fresh#

PostgreSQL's MVCC (multi-version concurrency control) leaves dead row versions after updates and deletes. VACUUM reclaims this space. ANALYZE refreshes the planner statistics.

sql

Loading editor…

pg_hint_plan#

When the planner consistently makes bad choices, you can override with hints (requires the pg_hint_plan extension):

sql

Loading editor…

Prepared Statements#

A prepared statement is parsed, bound, and planned once, then executed many times with different parameters.

sql

Loading editor…

PostgreSQL uses generic plans after 5 executions: a plan that works for any parameter value. This avoids per-parameter planning overhead but may not be optimal for skewed data.

Python: Running EXPLAIN ANALYZE#

python

Loading editor…

Summary#

Query lifecycle: parse → bind → rewrite → plan → execute. The optimizer operates at the plan phase.
Statistics: the planner's accuracy depends on up-to-date pg_stats. Run ANALYZE after bulk changes.
Access paths: sequential scan for large fractions of a table; index scan for selective predicates; index-only scan when all columns are in the index.
Join strategies: nested loop (small + indexed), hash join (large, no index), merge join (pre-sorted). The planner chooses based on estimated row counts.
EXPLAIN ANALYZE: always read estimated vs actual rows. A large discrepancy is a sign of stale statistics or missing statistics.
Prepared statements: parse once, execute many times — essential for high-throughput OLTP.
VACUUM ANALYZE: dead rows skew statistics and waste I/O. Autovacuum handles routine cases; run manually after bulk operations.

Query Processing & Optimisation — Parse, Plan & Execute

Query Processing & the Query Optimizer

The Query Lifecycle#

Stage 1: Query Parsing#

What the Parser Does#

Stage 2: Query Rewriting#

Subquery Unnesting#

Predicate Pushdown#

View Expansion#

Stage 3: Query Planning and Optimization#

Rule-Based vs Cost-Based Optimization#

Statistics: How the Planner Estimates Row Counts#

Histograms#

Access Paths#

Sequential Scan#

Index Scan#

Index-Only Scan#

Bitmap Index Scan#

Join Strategies#

Nested Loop Join#

Hash Join#

Merge Join#

Reading EXPLAIN ANALYZE Output#

How to Read It#

Common Plan Problems#

VACUUM ANALYZE: Keeping Statistics Fresh#

pg_hint_plan#

Prepared Statements#

Python: Running EXPLAIN ANALYZE#

Summary#

Query Processing & Optimisation — Parse, Plan & Execute

Query Processing & the Query Optimizer

The Query Lifecycle#

Stage 1: Query Parsing#

What the Parser Does#

Stage 2: Query Rewriting#

Subquery Unnesting#

Predicate Pushdown#

View Expansion#

Stage 3: Query Planning and Optimization#

Rule-Based vs Cost-Based Optimization#

Statistics: How the Planner Estimates Row Counts#

Histograms#

Access Paths#

Sequential Scan#

Index Scan#

Index-Only Scan#

Bitmap Index Scan#

Join Strategies#

Nested Loop Join#

Hash Join#

Merge Join#

Reading EXPLAIN ANALYZE Output#

How to Read It#

Common Plan Problems#

VACUUM ANALYZE: Keeping Statistics Fresh#

pg_hint_plan#

Prepared Statements#

Python: Running EXPLAIN ANALYZE#

Summary#