Postgres & pgvector

Summary

PostgreSQL 16+ is the most feature-rich open-source RDBMS and a cornerstone of modern data engineering stacks. Its relevance extends beyond OLTP: JSONB, window functions, table partitioning, Foreign Data Wrappers (FDW), and logical replication make it a versatile operational analytics store. The pgvector extension (v0.7+) adds native vector similarity search, turning Postgres into a combined relational + vector database — critical for Retrieval-Augmented Generation (RAG) workloads. For DE interviews, expect deep questions on MVCC internals, EXPLAIN ANALYZE, index selection, VACUUM behaviour, and pgvector index types (HNSW vs IVFFlat).

This guide focuses on Postgres 16 and pgvector 0.7+. It assumes you know basic SQL; the focus is on DE-level performance tuning, partitioning, and vector search patterns relevant to modern AI-augmented pipelines.

Core Concepts
Industry Use Cases
Code Examples
Comparison / When to Use
Gotchas & Anti-patterns
Exercises
Quiz
Further Reading

Core Concepts

MVCC & VACUUM

Postgres uses Multi-Version Concurrency Control (MVCC): every row update writes a new version (tuple) with a new xmin transaction ID, while the old version is marked with xmax but not immediately removed. This enables non-blocking reads — readers never wait for writers. However, dead tuples accumulate over time, bloating tables and index pages.

VACUUM: Marks dead tuples as reusable space without returning it to the OS. Runs automatically via autovacuum.
VACUUM FULL: Rewrites the entire table (requires exclusive lock); reclaims disk space back to OS. Use sparingly — consider pg_repack for online bloat removal.
VACUUM ANALYZE: Also updates planner statistics in pg_statistic. Run after bulk loads before querying.
Transaction ID Wraparound: Postgres uses 32-bit XID; at ~2.1 billion transactions it wraps. The age(datfrozenxid) metric must stay below ~2 billion. Monitor with SELECT max(age(datfrozenxid)) FROM pg_database.

Signs of VACUUM neglect: high n_dead_tup in pg_stat_user_tables, last_autovacuum timestamps far in the past, table file sizes growing despite stable row counts.

EXPLAIN ANALYZE & Query Planning

The query planner chooses join strategies, index use, and scan types based on statistics. EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) runs the query and shows actual vs estimated rows, execution time, and buffer hits/misses.

Seq Scan: Reads entire table. Expected for small tables or low-selectivity predicates. Unexpected on large tables with a valid index = stale statistics or wrong index type.
Index Scan: Random I/O via index pointer. Efficient for high-selectivity (<~5% of rows).
Index Only Scan: Data served from the index alone (all selected columns in index). Check Heap Fetches: 0 to confirm — requires recent VACUUM for the visibility map.
Bitmap Heap Scan: Collects row locations from index, then fetches heap pages sorted by block. Efficient for medium selectivity (5–20%).
Hash Join / Merge Join / Nested Loop: Join algorithms. Hash join for unordered sets, merge for pre-sorted inputs, nested loop for small inner tables.

Key planner settings: work_mem (per-sort memory), effective_cache_size (hint for OS cache), random_page_cost (set to 1.1 for SSDs vs default 4.0).

Index Types

Index	Use Case	Notes
B-tree	Equality, range, ORDER BY	Default. Handles `=`, `<`, `>`, `BETWEEN`, `LIKE 'prefix%'`
Hash	Equality only	Faster than B-tree for pure `=`; not WAL-logged before PG10
GIN	Array containment, JSONB keys, full-text	Fast lookup, slow write; ideal for `@>`, `?` JSONB operators
GiST	Geometric, full-text, range types	Lossy (heap rechecks); used for `tsvector`, PostGIS
BRIN	Monotonic data (timestamps, sequential IDs)	Tiny index stores min/max per block range. Useless for random data.
Partial	Index a subset of rows	`CREATE INDEX ON orders(status) WHERE status = 'pending'`
Covering	Enable index-only scans	`CREATE INDEX ON orders(cust_id) INCLUDE (total, created_at)`

Table Partitioning

Postgres supports declarative partitioning (PG10+) with three strategies:

RANGE: By date/timestamp. Most common for time-series. Queries with WHERE created_at BETWEEN ... prune to relevant partitions only.
LIST: By discrete values (region, status). Useful for archiving specific segments.
HASH: Evenly distribute by column hash. Good for load balancing; no pruning for range queries.

Drop old data by DETACH PARTITION + DROP TABLE — far faster than DELETE and avoids bloat. Use pg_partman to automate partition creation and retention. Attach/detach without locking: ALTER TABLE parent ATTACH PARTITION child FOR VALUES FROM (...) TO (...).

pgvector: HNSW vs IVFFlat

The pgvector extension adds a vector(n) type and distance operators: <-> (L2), <#> (negative inner product), <=> (cosine). Two approximate nearest-neighbour (ANN) index types:

Feature	IVFFlat	HNSW
Build time	Fast	Slow (10–20× IVFFlat)
Build memory	Low	High (full graph in RAM during build)
Query speed	Medium	Fast (lower per-query latency)
Recall quality	Good at high `probes`	Excellent at high `ef_search`
Dynamic inserts	Degrades until rebuild	✅ Handles inserts without full rebuild
Best for	Static bulk-loaded datasets	Production RAG with continuous writes

Key parameters — IVFFlat: lists (centroids; ~sqrt(rows) is a good start), probes (centroids searched at query time). HNSW: m (max connections per layer; default 16), ef_construction (build quality), ef_search (query recall, set at session level with SET hnsw.ef_search = 100).

JSONB for Semi-Structured Data

Always use jsonb (binary decomposed) over json (stored as text, re-parsed on every access). JSONB supports indexing and is 10–50× faster for queries.

Operators: -> (field as JSON), ->> (as text), #> (path), @> (containment), ? (key exists)
GIN index: CREATE INDEX ON events USING GIN (payload) — accelerates @> and ?.
Generated columns: Extract frequently-queried JSONB fields for B-tree indexing: ALTER TABLE ev ADD COLUMN ev_type text GENERATED ALWAYS AS (payload->>'type') STORED.
jsonb_to_recordset: Unnest JSON arrays to rows for relational joins: SELECT * FROM jsonb_to_recordset('[{...}]') AS t(id int, name text).

↑ Back to top

Industry Use Cases

RAG (Retrieval-Augmented Generation) Pipeline

A document Q&A system chunks text into 512-token passages, embeds them with OpenAI text-embedding-3-small (1536 dims), and stores vectors in a documents table alongside the original text. At query time, the user's question is embedded and top-K nearest passages are retrieved via ORDER BY embedding <=> query_vec LIMIT 5 and injected as context into the LLM prompt. For datasets under ~10M rows, pgvector eliminates the need for a separate vector database, reducing operational complexity significantly.

Time-Series Sensor Data with Range Partitioning

An industrial IoT platform ingests 50M sensor readings per day into a measurements table, partitioned by month. New partitions are pre-created by pg_partman. Queries aggregating last 7 days hit at most 2 partitions. Old partitions are detached and archived to S3 via pg_dump + gzip. This avoids the expensive DELETE FROM ... WHERE timestamp < ... bloat and provides sub-second aggregation on recent data with a BRIN index on the timestamp column.

Hybrid Search: BM25 + pgvector for Product Discovery

An e-commerce platform combines full-text search (tsvector/tsquery + GiST index) with semantic vector similarity (pgvector cosine) for product search. A Reciprocal Rank Fusion (RRF) query fuses both result lists: WITH bm25 AS (...), vec AS (...) SELECT ... ORDER BY (1.0/(60+bm25.rank) + 1.0/(60+vec.rank)) DESC. Hybrid search outperforms either method alone on recall@10 benchmarks by 15–30%.

Operational Data Store for dbt Models

A dbt project materializes intermediate models into a Postgres analytics schema used by microservices. Logical replication from the OLTP primary to a read replica serves both dbt transformations and BI tools without impacting write performance. Foreign Data Wrappers (postgres_fdw) pull reference data from remote Postgres instances without an ETL pipeline.

↑ Back to top

Code Examples

Example 1 — pgvector: Table, HNSW Index, and Similarity Search

-- Install the extension (once per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- Documents table with 1536-dim OpenAI embeddings
CREATE TABLE documents (
    id          bigserial PRIMARY KEY,
    source      text          NOT NULL,
    chunk_text  text          NOT NULL,
    embedding   vector(1536) NOT NULL,
    created_at  timestamptz   DEFAULT now()
);

-- HNSW index for cosine similarity (best for production RAG)
CREATE INDEX idx_documents_embedding_hnsw
    ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Set recall quality at query time (higher = better recall, slower)
SET hnsw.ef_search = 100;

-- Top-5 semantically similar documents via cosine distance
SELECT
    id,
    source,
    chunk_text,
    1 - (embedding <=> '[0.02, -0.15, ...]'::vector) AS cosine_similarity
FROM  documents
ORDER BY embedding <=> '[0.02, -0.15, ...]'::vector
LIMIT 5;

Example 2 — Hybrid Search with Reciprocal Rank Fusion

-- Full-text search column (generated, stored)
ALTER TABLE documents
    ADD COLUMN ts_content tsvector
    GENERATED ALWAYS AS (to_tsvector('english', chunk_text)) STORED;

CREATE INDEX idx_documents_fts ON documents USING GIN(ts_content);

-- Reciprocal Rank Fusion (RRF) hybrid query
SET hnsw.ef_search = 100;

WITH
bm25_ranked AS (
    SELECT id,
           ROW_NUMBER() OVER (ORDER BY ts_rank_cd(ts_content, q) DESC) AS bm25_rank
    FROM   documents, to_tsquery('english', 'data & engineering') q
    WHERE  ts_content @@ q
    LIMIT  60
),
vec_ranked AS (
    SELECT id,
           ROW_NUMBER() OVER (ORDER BY embedding <=> '[0.02,-0.15,...]'::vector) AS vec_rank
    FROM   documents
    ORDER BY embedding <=> '[0.02,-0.15,...]'::vector
    LIMIT  60
),
rrf AS (
    SELECT
        COALESCE(b.id, v.id) AS id,
        COALESCE(1.0 / (60 + b.bm25_rank), 0) +
        COALESCE(1.0 / (60 + v.vec_rank),  0) AS rrf_score
    FROM  bm25_ranked b
    FULL OUTER JOIN vec_ranked v ON b.id = v.id
)
SELECT d.source, d.chunk_text, r.rrf_score
FROM  rrf r
JOIN  documents d ON d.id = r.id
ORDER BY rrf_score DESC
LIMIT 5;

Example 3 — Inserting Embeddings via Python (psycopg3)

import psycopg
from openai import OpenAI
from pgvector.psycopg import register_vector

client = OpenAI()

def embed(text: str) -> list[float]:
    resp = client.embeddings.create(
        model="text-embedding-3-small", input=text
    )
    return resp.data[0].embedding

with psycopg.connect("dbname=mydb") as conn:
    register_vector(conn)  # Enables native vector type serialization

    chunks = [
        ("doc_1.pdf", "Delta Lake provides ACID transactions on S3."),
        ("doc_1.pdf", "Photon is a vectorized C++ execution engine in Databricks."),
        ("doc_2.pdf", "pgvector supports HNSW and IVFFlat index types."),
    ]

    with conn.cursor() as cur:
        for source, text in chunks:
            vec = embed(text)
            cur.execute(
                """
                INSERT INTO documents (source, chunk_text, embedding)
                VALUES (%s, %s, %s)
                ON CONFLICT DO NOTHING
                """,
                (source, text, vec),
            )
    conn.commit()

Example 4 — Range Partitioning & Archival

-- Parent partitioned table
CREATE TABLE measurements (
    sensor_id   text             NOT NULL,
    measured_at timestamptz     NOT NULL,
    value       double precision NOT NULL
) PARTITION BY RANGE (measured_at);

-- Monthly partitions (pg_partman automates this in production)
CREATE TABLE measurements_2026_04
    PARTITION OF measurements
    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

-- BRIN index: tiny footprint for monotonic timestamps
CREATE INDEX idx_m_2026_04_brin
    ON measurements_2026_04 USING BRIN (measured_at);

-- Query prunes to 1 partition automatically
EXPLAIN (ANALYZE, BUFFERS)
SELECT AVG(value)
FROM  measurements
WHERE measured_at >= '2026-04-01'
  AND measured_at <  '2026-04-08';

-- Detach old partition for archival (non-blocking after PG14)
ALTER TABLE measurements
    DETACH PARTITION measurements_2026_03 CONCURRENTLY;

-- pg_dump measurements_2026_03 | gzip | aws s3 cp - s3://archive/...
DROP TABLE measurements_2026_03;

↑ Back to top

Comparison / When to Use

Dimension	Postgres + pgvector	Pinecone	Weaviate	Milvus
Max vectors (practical)	~10M (single node)	Billions (managed)	Hundreds of millions	Billions (distributed)
Relational joins	✅ Full SQL	❌	Partial	❌
Hybrid search (filter + vector)	✅ Native SQL WHERE	✅ Metadata filters	✅ GraphQL filters	✅ Scalar filters
ACID transactions	✅	❌	Partial	✅
Operational complexity	Low (existing Postgres)	None (SaaS)	Medium	High
Cost	Infra only	High ($0.096/1M vectors/mo+)	Medium	Infra only
Best for	≤10M vectors, RAG + relational joins, existing Postgres stack	Large scale, zero-ops	Knowledge graph + semantic search	Very large scale, ML teams

Rule of thumb: Start with pgvector if you already run Postgres and your vector count is under 5–10M. The operational simplicity of one database for relational + vector data often outweighs the query-speed advantage of dedicated vector DBs at small-to-medium scale.

↑ Back to top

Gotchas & Anti-patterns

No VACUUM ANALYZE after bulk loads. Bulk-inserting millions of rows leaves the planner with zero statistics. It defaults to estimating 10,000 rows, causing sequential scans. Always run ANALYZE table_name (or VACUUM ANALYZE) after bulk loads and before querying.
NOT IN with NULLs returns zero rows. WHERE id NOT IN (SELECT id FROM other) returns nothing if any subquery row is NULL — because x != NULL is always NULL. Use NOT EXISTS or add WHERE id IS NOT NULL inside the subquery. This is one of the most common SQL correctness bugs in DE interviews.
pgvector index not used due to type mismatch. If the query embedding is a Python list cast differently (e.g., ARRAY[...] instead of '[...]'::vector), Postgres won't use the vector index. Always cast to ::vector explicitly in queries, or use the pgvector Python library's native type adapter via register_vector(conn).
Storing embeddings as real[] instead of vector. A plain array column prevents pgvector index creation and disables SIMD-accelerated distance computation. Always use the vector(n) type from the pgvector extension.
Exhausting max_connections under load. Postgres creates one OS process per connection. Default max_connections = 100. Without connection pooling (PgBouncer in transaction mode), a typical web app with many threads will exhaust connections or saturate the server. A practical heuristic: max_connections ≈ (4 × CPU cores) + disk count; use PgBouncer for everything above that.

↑ Back to top

Exercises

EXPLAIN ANALYZE lab: On a million-row table, run the same query with SET enable_seqscan = off vs on. Compare outputs. Then insert 500K rows without running ANALYZE and observe how estimated rows diverge from actual rows. Run ANALYZE and confirm the planner improves. Document the planner's node changes.
pgvector RAG pipeline: Create a table with vector(384) embeddings using sentence-transformers/all-MiniLM-L6-v2 from Python. Load 500 text chunks. Create an HNSW index. Implement query_similar(question: str, k: int) -> list[str]. Measure recall vs exact scan (disable index with SET enable_indexscan = off). Tune ef_search to find the sweet spot between recall and latency.
Partitioning & archival: Create a log_events table partitioned by month. Write a Python script to create the next 3 monthly partitions, insert 10,000 rows into each, then detach and drop the oldest partition. Confirm queries against the parent table still scan only the relevant partitions using EXPLAIN and the Subplans Removed field.

↑ Back to top

Quiz

Q1: Why might an Index Only Scan show "Heap Fetches: 42000" even though all selected columns are in the index?

Index Only Scans avoid heap access only for pages marked as "all-visible" in the visibility map (VM). After inserts, the VM marks pages as not-all-visible until VACUUM processes them. For those pages, Postgres must fetch the heap to check row visibility — resulting in Heap Fetches > 0. Fix: run VACUUM table_name to update the visibility map. After a cold re-run, Heap Fetches should approach 0.

Q2: What is transaction ID wraparound and why is it dangerous?

Postgres uses 32-bit transaction IDs (XIDs). After ~2.1 billion transactions, XIDs wrap around to zero. Without prevention, Postgres would consider all old rows as "future transactions" — making them invisible (effectively data loss). Prevention: Postgres freezes old tuples (replaces their XID with FrozenXID) during aggressive VACUUM. Monitor SELECT max(age(datfrozenxid)) FROM pg_database. When age approaches 2 billion, Postgres enters read-only emergency mode. Autovacuum with proper autovacuum_freeze_max_age prevents this.

Q3: What is the difference between HNSW and IVFFlat in pgvector, and when would you choose each?

IVFFlat clusters vectors into centroids at build time (fast build, small memory, degrades with inserts). At query time it probes N clusters — higher probes = better recall but slower. HNSW builds a hierarchical proximity graph (slow build, high memory) but achieves excellent recall at lower query latency and handles dynamic inserts gracefully. Choose IVFFlat for static, bulk-loaded datasets. Choose HNSW for production RAG where documents are continuously added and query recall/latency quality matter most.

Q4: A query using NOT IN (subquery) returns 0 rows unexpectedly. What's wrong?

The subquery almost certainly returns at least one NULL. In SQL, x NOT IN (1, 2, NULL) evaluates to NULL (unknown) for every value of x, because x != NULL is always NULL. The fix: rewrite as NOT EXISTS (SELECT 1 FROM ... WHERE id = outer.id AND id IS NOT NULL) or add WHERE id IS NOT NULL inside the subquery. This is one of the most common SQL correctness bugs.

Q5: You see "Rows Removed by Filter: 950000" on an Index Scan in EXPLAIN ANALYZE. What does this mean and how do you fix it?

The index returned ~950,000 rows (poor selectivity), which were then re-checked and discarded by a filter condition. The index is not selective enough — essentially doing an expensive random-I/O scan of most of the table. Fixes: (1) Create a more selective composite index that includes the filter column. (2) Use a partial index if the filter is on a fixed low-cardinality value, e.g., CREATE INDEX ON orders(created_at) WHERE status = 'pending'. (3) Check if a Bitmap Heap Scan would be cheaper by disabling index scans temporarily.

↑ Back to top

Postgres & pgvector

Summary

Table of Contents

Core Concepts

MVCC & VACUUM

EXPLAIN ANALYZE & Query Planning

Index Types

Table Partitioning

pgvector: HNSW vs IVFFlat

JSONB for Semi-Structured Data

Industry Use Cases

RAG (Retrieval-Augmented Generation) Pipeline

Time-Series Sensor Data with Range Partitioning

Hybrid Search: BM25 + pgvector for Product Discovery

Operational Data Store for dbt Models

Code Examples

Example 1 — pgvector: Table, HNSW Index, and Similarity Search

Example 2 — Hybrid Search with Reciprocal Rank Fusion

Example 3 — Inserting Embeddings via Python (psycopg3)

Example 4 — Range Partitioning & Archival

Comparison / When to Use

Gotchas & Anti-patterns

Exercises

Quiz

Further Reading