databases · level 3

Indexes

From full scan to seek: how engines find rows fast.

175 XP

Indexes

Without an index, a database scans every row in a table to find the ones that match your filter. That is fine for a few thousand rows. At a million rows, it is a second. At a billion, it is several minutes. Indexes are the solution: a separate data structure that lets the engine jump directly to the rows you want.

Analogy

An index is like the alphabet tabs on the side of a paper address book. The address book itself holds the real entries — names, phone numbers, birthdays — written in whatever order you jotted them down over the years. Without the tabs, finding "Novak" means flipping every page. With tabs, your thumb lands on the "N" section in one motion and you scan a dozen names instead of three hundred. The tabs take up space along the edge and have to be updated when you add a new contact, but looking someone up goes from "read the whole book" to "open straight to the page."

How a B-tree index works

PostgreSQL's default index type is a B-tree (balanced tree). A B-tree is a sorted tree where:

  • Every leaf node holds a slice of sorted key values and pointers to heap rows.
  • Interior nodes act as a directory — each holds the minimum key of its subtree.
  • A lookup traverses from root to the appropriate leaf in O(log n) steps.

Because the keys are sorted, a B-tree supports:

  • Equality: WHERE email = 'alice@example.com'
  • Range: WHERE created_at BETWEEN '2024-01-01' AND '2025-01-01'
  • Prefix LIKE: WHERE name LIKE 'Al%'
  • IS NULL / IS NOT NULL

Hash indexes

A hash index stores a hash of each key value. It is faster for equality queries on high-cardinality columns but supports nothing else — no ranges, no LIKE, no sorting. In PostgreSQL, hash indexes are not WAL-logged pre-v10, making them risky under crashes. B-tree is almost always the better default.

The leftmost-prefix rule

Composite indexes are ordered. A composite index on (last_name, first_name, age) is useful when your query filters on:

  • last_name alone
  • last_name and first_name
  • last_name, first_name, and age

But not on first_name alone, because the tree is sorted by last_name first. Skipping a leading column breaks the sort order.

Covering indexes

An index "covers" a query when every column the query needs — both for filtering and for output — is stored in the index. The engine never has to visit the heap.

-- This index covers: SELECT id FROM users WHERE email = ?
CREATE INDEX idx_users_email_id ON users(email) INCLUDE (id);

Covering indexes eliminate a second I/O round-trip. They are particularly valuable for read-heavy, high-volume queries.

When indexes hurt

  • Low-selectivity columns: a boolean column with 95% true values has low selectivity. Filtering WHERE is_active = true touches nearly every row. The planner prefers a full scan.
  • Small tables: for a 500-row table, a scan is cheaper than traversing the index tree.
  • High write volume: every INSERT, UPDATE, DELETE must maintain every index on the table. Ten indexes = ten maintenance operations per write.

EXPLAIN / EXPLAIN ANALYZE

EXPLAIN shows the query plan the planner chose. EXPLAIN ANALYZE runs the query and shows actual execution statistics.

Seq Scan on posts  (cost=0.00..5.07 rows=107 width=60)
  Filter: (published = true)

Index Scan using idx_posts_user_id on posts  (cost=0.28..8.30 rows=2 width=60)
  Index Cond: (user_id = 3)

Seq Scan means full scan. Index Scan means the engine is using an index. Index Only Scan means it is a covering index — no heap visit.

The playground

Toggle indexes on and off against a simulated 10,000-row table. Watch the cost estimate drop from a full scan (O(n)) to a seek (O(log n)). The B-tree visualizer shows keys being inserted and split as the tree grows.

Tools in the wild

6 tools
  • Run a query and report the actual plan + row counts — the first stop for any slow query.

    library
  • Browser visualizer for EXPLAIN ANALYZE output; highlights misestimates and seq scans.

    service
  • Postgres extension that aggregates per-query timing — find the queries worth indexing.

    library
  • pgHerofree tier

    Web dashboard surfacing missing/unused indexes, slow queries, and bloat for Postgres.

    library
  • Per-query latency + row stats for hosted MySQL; suggests indexes from real workloads.

    service
  • Aggregate MySQL slow query logs — the canonical way to find indexable hot queries.

    cli