Luke Dev Notes

PostgreSQL Optimization (Part 1): LIKE vs Regex, which is actually faster?

Luke Kong
04 May, 2026
- Database

This post comes from a real performance issue I hit at work. I took a few wrong turns before I understood what was actually slow. This is Part 1: LIKE vs Regex and the I/O vs CPU story. The next two parts go deeper:Part 2: Reading EXPLAIN to see where Seq Scan hides Part 3: Turning tags into an index with Array + GINPart 1: LIKE vs Regex in plain language 1. The setup I inherited ETL-shaped data where many tags were packed into one text field, separated by ;: tag1;tag2;tag3;tag_VIP;tag_inactive So I did the obvious thing: filter with many LIKE conditions. 2. One comment that hit the real issue A senior teammate said:“Merge those LIKE clauses into one Regex. It should run faster.”My first reaction: “If there is no useful index, don’t both approaches still read a lot of rows?” 3. We were not disagreeing, just entering from different angles We were solving the same problem, but from different starting points: I focused on read cost first, while my teammate focused on compute cost first.My focus: I/O costWithout an index, the database often still scans a large part of the table. Teammate’s focus: CPU costMany LIKEs can repeat string checks on the same row.One Regex often reduces that repeated per-row work.A simple analogy:Many LIKEs = reading the same page many times, each time looking for one word. One Regex = reading the page once and checking many words in that pass.4. What I want to pass on If your dataset is small, Regex alone may already feel much better.At larger scale, syntax tweaks are not enough. You need to change how the database reaches the data. ConclusionRegex can reduce CPU work, but if the query still scans most rows, I/O remains the real bottleneck.Part 2 dissects LIKE and Regex with EXPLAIN ANALYZE; Part 3 builds the Array + GIN solution:PostgreSQL Optimization (Part 2): Reading EXPLAIN to see where Seq Scan hides PostgreSQL Optimization (Part 3): Turning tags into an index with Array + GIN

PostgreSQL Optimization (Part 2): Reading EXPLAIN to see where Seq Scan hides

Luke Kong
04 May, 2026
- Database

If you haven't read the LIKE/Regex vs I/O/CPU story, start there. If you'd rather jump straight to the implementation, head to Part 3:Part 1: LIKE vs Regex - what's actually different? Part 3: Turning tags into an index with Array + GINProblem framing: CPU improved, but the query was still too slow Part 1 showed that Regex can lower per-row CPU compared to many LIKEs. At tens of millions of rows the bottleneck shifts to I/O and access path, and per-row CPU savings stop mattering. This part has one job: use EXPLAIN ANALYZE to find exactly where LIKE/Regex pin the query to full-table I/O. Once we see the bottleneck, Part 3 fixes the access path.Core analysis: read EXPLAIN to find the cost dominatorNote: the EXPLAIN snippets and SQL below are anonymized — schema, table, and column names (e.g. customers, tag_dictionary, balance) are stand-ins. Execution stats (rows, loops, Rows Removed, Buffers) come from a small test dataset and are shown for structural recognition only: focus on what each node looks like, where conditions sit, and whether PostgreSQL finds an index entry. The absolute numbers do not represent production scale.Open the LIKE plan with EXPLAIN ANALYZE and the first thing that jumps out: the same query touches the main table with three separate Seq Scan nodes. To understand why, we have to look at the SQL itself. Trace each Seq Scan back to the SQL that triggered it Here is the simplified LIKE-version SQL with the table-touch points marked: WITH product_dict AS ( -- ← trigger 1 SELECT DISTINCT product_id, product FROM customers WHERE product <> '' ) SELECT ... FROM customers AS main -- ← trigger 2 INNER JOIN product_dict ... WHERE main.segment = 'core' AND lower(main.levels) LIKE '%lv1%' AND lower(main.excluded_levels) NOT LIKE '%ex001%' AND lower(main.excluded_levels) NOT LIKE '%ex022%' AND date_trunc('MONTH', main.etl_date) = date_trunc('MONTH', (SELECT MAX(etl_date) FROM customers)); -- ← trigger 3Three innocent-looking constructs at the SQL level: a CTE, a main FROM, a MAX() subquery. To PostgreSQL, every one of them is "open the table again." Their EXPLAIN counterparts are three Seq Scan nodes: -> Aggregate -- trigger 3: InitPlan 1 -> Seq Scan on customers Buffers: shared hit=159-> Seq Scan on customers main -- trigger 2: main FROM Filter: customer_id <> '' AND segment = 'core' AND lower(levels) ~~ '%lv1%' AND lower(excluded_levels) !~~ '%ex001%' AND lower(excluded_levels) !~~ '%ex022%' AND date_trunc('MONTH', etl_date) = (InitPlan 1).col1 Rows Removed by Filter: 3806 Buffers: shared hit=318-> Seq Scan on customers -- trigger 1: inside the CTE Filter: product <> '' Rows Removed by Filter: 3835 Buffers: shared hit=159Four signals worth reading carefully:Three Seq Scan nodes: each one is "walk the entire table again." Rows Removed by Filter close to the node's rows: most of what we read is just thrown away. No condition becomes an Index Cond: every WHERE clause lands inside Filter, meaning PostgreSQL could not find any index entry point. The three Buffers: shared hit together approach the table's full page count: the table is effectively read three times.Why LIKE locks every scan to Seq Scan PostgreSQL only considers B-tree indexes by default (unless you've explicitly built GIN, GiST, or another type for that column). A B-tree is an ordered tree. Conditions it can accelerate share one property: a starting key has to be pinnable in the tree — for example name = 'Luke' or name LIKE 'Lu%' both let it descend from a specific node. Once that property breaks, the index has nowhere to start. Look back at the three triggers:lower(levels) LIKE '%lv1%': leading %, no entry point in the tree. lower(excluded_levels) NOT LIKE '%ex001%': NOT LIKE is "verify a non-match for every row" — a B-tree is poor at negative matching. date_trunc('MONTH', etl_date) = ...: the column is wrapped in a function, so the B-tree can't find a matching key (unless you build an expression index).The three Seq Scans come from one structural fact: the LIKE-plus-function-wrapping style gives PostgreSQL no index entry point at all. What the Regex rewrite actually changes Replacing the two NOT LIKEs with a Regex compiled from a lookup table changes the SQL shape entirely: WITH remove_list AS ( -- ← trigger 1 SELECT '(^|;|,)(' || REPLACE(string_agg(TRIM(code), '|'), ' ', '') || ')($|;|,)' AS regex_pattern FROM tag_dictionary WHERE kind = 'exclude' AND category IN ('low_marketing', 'private_bank') ) SELECT cluster_list.display_name, main.etl_date, product_list.product, SUM(main.balance)::numeric / COUNT(DISTINCT main.customer_id) AS avg_balance FROM ( SELECT * FROM customers -- ← trigger 2 WHERE customer_id <> '' AND segment = 'core' AND etl_date = (SELECT MAX(etl_date) FROM customers) -- ← trigger 3 ) AS main LEFT JOIN ( SELECT DISTINCT '%' || lower(code) || '%' AS level_code, display_name FROM tag_dictionary -- ← trigger 4 WHERE display_name = 'HVIP' ) AS cluster_list ON main.levels LIKE cluster_list.level_code LEFT JOIN ( SELECT DISTINCT product_id, product FROM customers -- ← trigger 5 WHERE customer_id = '' AND product <> '' ) AS product_list ON main.product_id = product_list.product_id CROSS JOIN remove_list WHERE COALESCE(main.excluded_levels, '') !~ remove_list.regex_pattern AND cluster_list.display_name <> '' AND main.product_id <> 0;Five table-touch points map to five Seq Scan nodes: -> Aggregate -- trigger 3: InitPlan 1 (same as LIKE) -> Seq Scan on customers Buffers: shared hit=159-> Aggregate -- trigger 1: remove_list CTE -> Seq Scan on tag_dictionary Filter: kind = 'exclude' AND category IN ('low_marketing','private_bank') Buffers: shared hit=4-> Seq Scan on tag_dictionary -- trigger 4: cluster_list subquery Filter: display_name = 'HVIP' Buffers: shared hit=4-> Seq Scan on customers main -- trigger 2: main FROM Filter: customer_id <> '' AND product_id <> 0 AND segment = 'core' AND etl_date = (InitPlan 1).col1 Rows Removed by Filter: 3308 Buffers: shared hit=318-> Seq Scan on customers -- trigger 5: product_list subquery Filter: product <> '' AND customer_id = '' Rows Removed by Filter: 3835 Buffers: shared hit=159The main customers table is still scanned three times (triggers 2, 3, 5). The two extra Seq Scans are on the dictionary table (triggers 1, 4). More importantly, the LIKE/Regex predicates that used to live in the main Filter: are now split across two Join Filter: nodes: -> Nested Loop -- applies the Regex Join Filter: COALESCE(excluded_levels, '') !~ regex_pattern-> Nested Loop -- applies LIKE on levels Join Filter: levels ~~ ('%' || lower(code) || '%')LIKE vs Regex: a structural comparisonTrait LIKE version Regex versionSeq Scans on main table 3 3Seq Scans on dictionary table 0 2Where the LIKE/Regex predicates live main Filter: (evaluated during Seq Scan) Join Filter: (evaluated during the join)What rows the main Seq Scan emits already pruned by LIKE conditions only basic conditions applied — string checks happen later in the joinTwo structural takeaways:The number of Seq Scans on the main table doesn't change. Three is still three. Regex doesn't address "how many times we scan." LIKE/Regex are promoted from Filter to Join Filter. The string check moves to the join phase, so the main table has to emit every row that passed the basic conditions before LIKE/Regex prunes it. Regex saves per-row CPU in theory (more on that below), but this structural rewrite gives that saving back.In other words: SQL structure decides how many times we touch the table; LIKE or Regex decides only what to check on each row. Why a Regex can theoretically push CPU from $O(N \times M)$ to $O(N)$ Quick concept first: a finite automaton is a small machine that walks a string character by character, jumping between states, and reports "matched or not" at the end. A regex engine compiles the pattern into one of these structures before running it. There are two flavors:DFA (Deterministic Finite Automaton): at any state, a given character has exactly one next state. Fast to run, but the state count can blow up. NFA (Nondeterministic Finite Automaton): a state can transition into multiple successors on the same character. Fewer states, but execution may have to backtrack. PostgreSQL uses an NFA-based engine.For an alternation like (ex001|ex022|ex045), the compiled state machine looks roughly like this: ┌─ 0 → 0 → 1 ✓ ex001 start ─→ e ─→ x ─→ 0 ──────┼─ 2 → 2 ✓ ex022 └─ 4 → 5 ✓ ex045The shared prefix ex0 is processed once. As input is read character by character, the entire machine advances all branches simultaneously — a single pass evaluates every alternative. Apply this to the query:Multi-LIKE: M independent patterns, no shared work — each row goes through M full string scans → CPU ~ $O(N \times M)$. Single Regex: M alternatives merged into one state machine — each row only needs one pass → CPU ~ $O(N)$.The savings are strictly per-row CPU. The number of times the table has to be scanned does not change. Cost dominators across the three approaches Symbols: N = main-table rows, M = predicate count, K = matching rows.Approach Per-scan CPU Per-scan I/OMultiple LIKE ~ $O(N \times M)$ ~ $O(N)$ full scanSingle Regex ~ $O(N)$ ~ $O(N)$ full scanArray + GIN ~ $O(\log N + K)$ index lookup + fetch matching pagesLIKE and Regex only move the CPU column; the I/O column doesn't budge. Since SQL structure dictates how many times the main table is touched, the total cost is whichever column you sit in, multiplied by that count. To actually change the shape of each scan, you have to retire Seq Scan and bring in Index Scan. In plain terms: this only gets fast when the path changes from "check every row" to "find candidates first, then fetch hits."Wrap-up of Part 2 Three things are now clear:SQL structure decides how many times the main table is touched. CTEs, subqueries, and MAX() each count as a separate trigger. LIKE plus function-wrapped columns lock every scan to Seq Scan. B-tree only accelerates conditions that pin down a starting key, and our three triggers all break that property. Rewriting to Regex doesn't reduce scan count. The state machine only saves per-row CPU; it doesn't change the access path.To actually pull ahead, we need to switch to an index-driven access path — that's what Part 3 builds:PostgreSQL Optimization (Part 3): Turning tags into an index with Array + GIN

PostgreSQL Optimization (Part 3): Turning tags into an index with Array + GIN

Luke Kong
04 May, 2026
- Database

If you haven't read the earlier parts:Part 1: LIKE vs Regex - what's actually different? Part 2: Reading EXPLAIN to see where Seq Scan hidesFrom observation to action: the access path is what we need to change Part 2 made the situation concrete: LIKE and Regex both hand PostgreSQL a full-table Seq Scan, because B-tree indexes can't anchor on '%xxx%' conditions. Two things have to happen to actually change the cost: change the data model so the query can express itself in operators an index can serve, and build the right kind of index so PostgreSQL has somewhere to go.Practical solution: make the query index-friendly Two simple steps: change the data type, then rewrite the predicates so they hit the index. A. Reshape the data (string → array) Convert 'tag1;tag2' into PostgreSQL's native text[], so query semantics map directly onto array operators: ALTER TABLE customers ADD COLUMN tags_array text[];UPDATE customers SET tags_array = string_to_array(nullif(trim(tags), ''), ';');B. Build a GIN inverted index CREATE INDEX CONCURRENTLY idx_customers_tags_gin ON customers USING gin (tags_array);Rewrite predicates with array operators (&&, @>) so PostgreSQL can pick up the GIN index:tags_array && ARRAY['tag_VIP']::text[] — has any overlap. tags_array @> ARRAY['tag_VIP']::text[] — contains the value.Key takeaways: database fundamentals worth carrying forward These database fundamentals will resurface on every future performance problem:Measurement beats intuition. Read the plan with EXPLAIN ANALYZE. Years of experience still lose to a printed plan tree. Describe cost in complexity language. Naming N, M, K upfront is what lets you compare two approaches honestly. The data model caps performance. SQL syntax can only optimize within the current shape — fundamental wins usually require rethinking the column design. Verify the optimization in the plan. A Seq Scan becoming an Index Scan is the proof; faster wall-clock alone doesn't confirm the access path changed.ConclusionRegex helps reduce CPU cost first, but at large scale the bigger win comes from Array + GIN, because it changes the access path itself.Recap of the earlier parts:Part 1: LIKE vs Regex - what's actually different? Part 2: Reading EXPLAIN to see where Seq Scan hides

Want to know more about my technical background?

See About for my background, strengths, and working style.

About me