PostgreSQL Optimization (Part 2): Reading EXPLAIN to see where Seq Scan hides
If you haven’t read the LIKE/Regex vs I/O/CPU story, start there. If you’d rather jump straight to the implementation, head to Part 3:
- Part 1: LIKE vs Regex - what’s actually different?
- Part 3: Turning tags into an index with Array + GIN
Problem framing: CPU improved, but the query was still too slow
Part 1 showed that Regex can lower per-row CPU compared to many LIKEs. At tens of millions of rows the bottleneck shifts to I/O and access path, and per-row CPU savings stop mattering.
This part has one job: use EXPLAIN ANALYZE to find exactly where LIKE/Regex pin the query to full-table I/O. Once we see the bottleneck, Part 3 fixes the access path.
Core analysis: read EXPLAIN to find the cost dominator
Note: the EXPLAIN snippets and SQL below are anonymized — schema, table, and column names (e.g.
customers,tag_dictionary,balance) are stand-ins. Execution stats (rows,loops,Rows Removed,Buffers) come from a small test dataset and are shown for structural recognition only: focus on what each node looks like, where conditions sit, and whether PostgreSQL finds an index entry. The absolute numbers do not represent production scale.
Open the LIKE plan with EXPLAIN ANALYZE and the first thing that jumps out: the same query touches the main table with three separate Seq Scan nodes. To understand why, we have to look at the SQL itself.
Trace each Seq Scan back to the SQL that triggered it
Here is the simplified LIKE-version SQL with the table-touch points marked:
WITH product_dict AS ( -- ← trigger 1
SELECT DISTINCT product_id, product
FROM customers
WHERE product <> ''
)
SELECT ...
FROM customers AS main -- ← trigger 2
INNER JOIN product_dict ...
WHERE main.segment = 'core'
AND lower(main.levels) LIKE '%lv1%'
AND lower(main.excluded_levels) NOT LIKE '%ex001%'
AND lower(main.excluded_levels) NOT LIKE '%ex022%'
AND date_trunc('MONTH', main.etl_date)
= date_trunc('MONTH', (SELECT MAX(etl_date) FROM customers)); -- ← trigger 3
Three innocent-looking constructs at the SQL level: a CTE, a main FROM, a MAX() subquery. To PostgreSQL, every one of them is “open the table again.” Their EXPLAIN counterparts are three Seq Scan nodes:
-> Aggregate -- trigger 3: InitPlan 1
-> Seq Scan on customers
Buffers: shared hit=159
-> Seq Scan on customers main -- trigger 2: main FROM
Filter: customer_id <> ''
AND segment = 'core'
AND lower(levels) ~~ '%lv1%'
AND lower(excluded_levels) !~~ '%ex001%'
AND lower(excluded_levels) !~~ '%ex022%'
AND date_trunc('MONTH', etl_date) = (InitPlan 1).col1
Rows Removed by Filter: 3806
Buffers: shared hit=318
-> Seq Scan on customers -- trigger 1: inside the CTE
Filter: product <> ''
Rows Removed by Filter: 3835
Buffers: shared hit=159
Four signals worth reading carefully:
- Three
Seq Scannodes: each one is “walk the entire table again.” Rows Removed by Filterclose to the node’srows: most of what we read is just thrown away.- No condition becomes an
Index Cond: every WHERE clause lands insideFilter, meaning PostgreSQL could not find any index entry point. - The three
Buffers: shared hittogether approach the table’s full page count: the table is effectively read three times.
Why LIKE locks every scan to Seq Scan
PostgreSQL only considers B-tree indexes by default (unless you’ve explicitly built GIN, GiST, or another type for that column). A B-tree is an ordered tree. Conditions it can accelerate share one property: a starting key has to be pinnable in the tree — for example name = 'Luke' or name LIKE 'Lu%' both let it descend from a specific node. Once that property breaks, the index has nowhere to start. Look back at the three triggers:
lower(levels) LIKE '%lv1%': leading%, no entry point in the tree.lower(excluded_levels) NOT LIKE '%ex001%':NOT LIKEis “verify a non-match for every row” — a B-tree is poor at negative matching.date_trunc('MONTH', etl_date) = ...: the column is wrapped in a function, so the B-tree can’t find a matching key (unless you build an expression index).
The three Seq Scans come from one structural fact: the LIKE-plus-function-wrapping style gives PostgreSQL no index entry point at all.
What the Regex rewrite actually changes
Replacing the two NOT LIKEs with a Regex compiled from a lookup table changes the SQL shape entirely:
WITH remove_list AS ( -- ← trigger 1
SELECT '(^|;|,)('
|| REPLACE(string_agg(TRIM(code), '|'), ' ', '')
|| ')($|;|,)' AS regex_pattern
FROM tag_dictionary
WHERE kind = 'exclude'
AND category IN ('low_marketing', 'private_bank')
)
SELECT
cluster_list.display_name,
main.etl_date,
product_list.product,
SUM(main.balance)::numeric / COUNT(DISTINCT main.customer_id) AS avg_balance
FROM (
SELECT *
FROM customers -- ← trigger 2
WHERE customer_id <> ''
AND segment = 'core'
AND etl_date = (SELECT MAX(etl_date) FROM customers) -- ← trigger 3
) AS main
LEFT JOIN (
SELECT DISTINCT '%' || lower(code) || '%' AS level_code, display_name
FROM tag_dictionary -- ← trigger 4
WHERE display_name = 'HVIP'
) AS cluster_list
ON main.levels LIKE cluster_list.level_code
LEFT JOIN (
SELECT DISTINCT product_id, product
FROM customers -- ← trigger 5
WHERE customer_id = ''
AND product <> ''
) AS product_list
ON main.product_id = product_list.product_id
CROSS JOIN remove_list
WHERE COALESCE(main.excluded_levels, '') !~ remove_list.regex_pattern
AND cluster_list.display_name <> ''
AND main.product_id <> 0;
Five table-touch points map to five Seq Scan nodes:
-> Aggregate -- trigger 3: InitPlan 1 (same as LIKE)
-> Seq Scan on customers
Buffers: shared hit=159
-> Aggregate -- trigger 1: remove_list CTE
-> Seq Scan on tag_dictionary
Filter: kind = 'exclude'
AND category IN ('low_marketing','private_bank')
Buffers: shared hit=4
-> Seq Scan on tag_dictionary -- trigger 4: cluster_list subquery
Filter: display_name = 'HVIP'
Buffers: shared hit=4
-> Seq Scan on customers main -- trigger 2: main FROM
Filter: customer_id <> ''
AND product_id <> 0
AND segment = 'core'
AND etl_date = (InitPlan 1).col1
Rows Removed by Filter: 3308
Buffers: shared hit=318
-> Seq Scan on customers -- trigger 5: product_list subquery
Filter: product <> '' AND customer_id = ''
Rows Removed by Filter: 3835
Buffers: shared hit=159
The main customers table is still scanned three times (triggers 2, 3, 5). The two extra Seq Scans are on the dictionary table (triggers 1, 4).
More importantly, the LIKE/Regex predicates that used to live in the main Filter: are now split across two Join Filter: nodes:
-> Nested Loop -- applies the Regex
Join Filter: COALESCE(excluded_levels, '') !~ regex_pattern
-> Nested Loop -- applies LIKE on levels
Join Filter: levels ~~ ('%' || lower(code) || '%')
LIKE vs Regex: a structural comparison
| Trait | LIKE version | Regex version |
|---|---|---|
| Seq Scans on main table | 3 | 3 |
| Seq Scans on dictionary table | 0 | 2 |
| Where the LIKE/Regex predicates live | main Filter: (evaluated during Seq Scan) | Join Filter: (evaluated during the join) |
| What rows the main Seq Scan emits | already pruned by LIKE conditions | only basic conditions applied — string checks happen later in the join |
Two structural takeaways:
- The number of Seq Scans on the main table doesn’t change. Three is still three. Regex doesn’t address “how many times we scan.”
- LIKE/Regex are promoted from
FiltertoJoin Filter. The string check moves to the join phase, so the main table has to emit every row that passed the basic conditions before LIKE/Regex prunes it. Regex saves per-row CPU in theory (more on that below), but this structural rewrite gives that saving back.
In other words: SQL structure decides how many times we touch the table; LIKE or Regex decides only what to check on each row.
Why a Regex can theoretically push CPU from to
Quick concept first: a finite automaton is a small machine that walks a string character by character, jumping between states, and reports “matched or not” at the end. A regex engine compiles the pattern into one of these structures before running it. There are two flavors:
- DFA (Deterministic Finite Automaton): at any state, a given character has exactly one next state. Fast to run, but the state count can blow up.
- NFA (Nondeterministic Finite Automaton): a state can transition into multiple successors on the same character. Fewer states, but execution may have to backtrack. PostgreSQL uses an NFA-based engine.
For an alternation like (ex001|ex022|ex045), the compiled state machine looks roughly like this:
┌─ 0 → 0 → 1 ✓ ex001
start ─→ e ─→ x ─→ 0 ──────┼─ 2 → 2 ✓ ex022
└─ 4 → 5 ✓ ex045
The shared prefix ex0 is processed once. As input is read character by character, the entire machine advances all branches simultaneously — a single pass evaluates every alternative.
Apply this to the query:
- Multi-LIKE: M independent patterns, no shared work — each row goes through M full string scans → CPU ~ .
- Single Regex: M alternatives merged into one state machine — each row only needs one pass → CPU ~ .
The savings are strictly per-row CPU. The number of times the table has to be scanned does not change.
Cost dominators across the three approaches
Symbols: N = main-table rows, M = predicate count, K = matching rows.
| Approach | Per-scan CPU | Per-scan I/O |
|---|---|---|
Multiple LIKE | ~ | ~ full scan |
| Single Regex | ~ | ~ full scan |
Array + GIN | ~ | index lookup + fetch matching pages |
LIKE and Regex only move the CPU column; the I/O column doesn’t budge. Since SQL structure dictates how many times the main table is touched, the total cost is whichever column you sit in, multiplied by that count. To actually change the shape of each scan, you have to retire Seq Scan and bring in Index Scan.
In plain terms: this only gets fast when the path changes from “check every row” to “find candidates first, then fetch hits.”
Wrap-up of Part 2
Three things are now clear:
- SQL structure decides how many times the main table is touched. CTEs, subqueries, and
MAX()each count as a separate trigger. - LIKE plus function-wrapped columns lock every scan to Seq Scan. B-tree only accelerates conditions that pin down a starting key, and our three triggers all break that property.
- Rewriting to Regex doesn’t reduce scan count. The state machine only saves per-row CPU; it doesn’t change the access path.
To actually pull ahead, we need to switch to an index-driven access path — that’s what Part 3 builds:
PostgreSQL Optimization (Part 3): Turning tags into an index with Array + GIN