PostgreSQL 16+ is the most feature-rich open-source RDBMS and a cornerstone of modern data engineering stacks. Its relevance extends beyond OLTP: JSONB, window functions, table partitioning, Foreign Data Wrappers (FDW), and logical replication make it a versatile operational analytics store. The pgvector extension (v0.7+) adds native vector similarity search, turning Postgres into a combined relational + vector database — critical for Retrieval-Augmented Generation (RAG) workloads. For DE interviews, expect deep questions on MVCC internals, EXPLAIN ANALYZE, index selection, VACUUM behaviour, and pgvector index types (HNSW vs IVFFlat).
This guide focuses on Postgres 16 and pgvector 0.7+. It assumes you know basic SQL; the focus is on DE-level performance tuning, partitioning, and vector search patterns relevant to modern AI-augmented pipelines.
Postgres uses Multi-Version Concurrency Control (MVCC): every row update writes a new version (tuple) with a new xmin transaction ID, while the old version is marked with xmax but not immediately removed. This enables non-blocking reads — readers never wait for writers. However, dead tuples accumulate over time, bloating tables and index pages.
autovacuum.pg_repack for online bloat removal.pg_statistic. Run after bulk loads before querying.age(datfrozenxid) metric must stay below ~2 billion. Monitor with SELECT max(age(datfrozenxid)) FROM pg_database.Signs of VACUUM neglect: high n_dead_tup in pg_stat_user_tables, last_autovacuum timestamps far in the past, table file sizes growing despite stable row counts.
The query planner chooses join strategies, index use, and scan types based on statistics. EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) runs the query and shows actual vs estimated rows, execution time, and buffer hits/misses.
Heap Fetches: 0 to confirm — requires recent VACUUM for the visibility map.Key planner settings: work_mem (per-sort memory), effective_cache_size (hint for OS cache), random_page_cost (set to 1.1 for SSDs vs default 4.0).
| Index | Use Case | Notes |
|---|---|---|
| B-tree | Equality, range, ORDER BY | Default. Handles =, <, >, BETWEEN, LIKE 'prefix%' |
| Hash | Equality only | Faster than B-tree for pure =; not WAL-logged before PG10 |
| GIN | Array containment, JSONB keys, full-text | Fast lookup, slow write; ideal for @>, ? JSONB operators |
| GiST | Geometric, full-text, range types | Lossy (heap rechecks); used for tsvector, PostGIS |
| BRIN | Monotonic data (timestamps, sequential IDs) | Tiny index stores min/max per block range. Useless for random data. |
| Partial | Index a subset of rows | CREATE INDEX ON orders(status) WHERE status = 'pending' |
| Covering | Enable index-only scans | CREATE INDEX ON orders(cust_id) INCLUDE (total, created_at) |
Postgres supports declarative partitioning (PG10+) with three strategies:
WHERE created_at BETWEEN ... prune to relevant partitions only.region, status). Useful for archiving specific segments.Drop old data by DETACH PARTITION + DROP TABLE — far faster than DELETE and avoids bloat. Use pg_partman to automate partition creation and retention. Attach/detach without locking: ALTER TABLE parent ATTACH PARTITION child FOR VALUES FROM (...) TO (...).
The pgvector extension adds a vector(n) type and distance operators: <-> (L2), <#> (negative inner product), <=> (cosine). Two approximate nearest-neighbour (ANN) index types:
| Feature | IVFFlat | HNSW |
|---|---|---|
| Build time | Fast | Slow (10–20× IVFFlat) |
| Build memory | Low | High (full graph in RAM during build) |
| Query speed | Medium | Fast (lower per-query latency) |
| Recall quality | Good at high probes | Excellent at high ef_search |
| Dynamic inserts | Degrades until rebuild | ✅ Handles inserts without full rebuild |
| Best for | Static bulk-loaded datasets | Production RAG with continuous writes |
Key parameters — IVFFlat: lists (centroids; ~sqrt(rows) is a good start), probes (centroids searched at query time). HNSW: m (max connections per layer; default 16), ef_construction (build quality), ef_search (query recall, set at session level with SET hnsw.ef_search = 100).
Always use jsonb (binary decomposed) over json (stored as text, re-parsed on every access). JSONB supports indexing and is 10–50× faster for queries.
-> (field as JSON), ->> (as text), #> (path), @> (containment), ? (key exists)CREATE INDEX ON events USING GIN (payload) — accelerates @> and ?.ALTER TABLE ev ADD COLUMN ev_type text GENERATED ALWAYS AS (payload->>'type') STORED.SELECT * FROM jsonb_to_recordset('[{...}]') AS t(id int, name text).A document Q&A system chunks text into 512-token passages, embeds them with OpenAI text-embedding-3-small (1536 dims), and stores vectors in a documents table alongside the original text. At query time, the user's question is embedded and top-K nearest passages are retrieved via ORDER BY embedding <=> query_vec LIMIT 5 and injected as context into the LLM prompt. For datasets under ~10M rows, pgvector eliminates the need for a separate vector database, reducing operational complexity significantly.
An industrial IoT platform ingests 50M sensor readings per day into a measurements table, partitioned by month. New partitions are pre-created by pg_partman. Queries aggregating last 7 days hit at most 2 partitions. Old partitions are detached and archived to S3 via pg_dump + gzip. This avoids the expensive DELETE FROM ... WHERE timestamp < ... bloat and provides sub-second aggregation on recent data with a BRIN index on the timestamp column.
An e-commerce platform combines full-text search (tsvector/tsquery + GiST index) with semantic vector similarity (pgvector cosine) for product search. A Reciprocal Rank Fusion (RRF) query fuses both result lists: WITH bm25 AS (...), vec AS (...) SELECT ... ORDER BY (1.0/(60+bm25.rank) + 1.0/(60+vec.rank)) DESC. Hybrid search outperforms either method alone on recall@10 benchmarks by 15–30%.
A dbt project materializes intermediate models into a Postgres analytics schema used by microservices. Logical replication from the OLTP primary to a read replica serves both dbt transformations and BI tools without impacting write performance. Foreign Data Wrappers (postgres_fdw) pull reference data from remote Postgres instances without an ETL pipeline.
↑ Back to top-- Install the extension (once per database)
CREATE EXTENSION IF NOT EXISTS vector;
-- Documents table with 1536-dim OpenAI embeddings
CREATE TABLE documents (
id bigserial PRIMARY KEY,
source text NOT NULL,
chunk_text text NOT NULL,
embedding vector(1536) NOT NULL,
created_at timestamptz DEFAULT now()
);
-- HNSW index for cosine similarity (best for production RAG)
CREATE INDEX idx_documents_embedding_hnsw
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Set recall quality at query time (higher = better recall, slower)
SET hnsw.ef_search = 100;
-- Top-5 semantically similar documents via cosine distance
SELECT
id,
source,
chunk_text,
1 - (embedding <=> '[0.02, -0.15, ...]'::vector) AS cosine_similarity
FROM documents
ORDER BY embedding <=> '[0.02, -0.15, ...]'::vector
LIMIT 5;
-- Full-text search column (generated, stored)
ALTER TABLE documents
ADD COLUMN ts_content tsvector
GENERATED ALWAYS AS (to_tsvector('english', chunk_text)) STORED;
CREATE INDEX idx_documents_fts ON documents USING GIN(ts_content);
-- Reciprocal Rank Fusion (RRF) hybrid query
SET hnsw.ef_search = 100;
WITH
bm25_ranked AS (
SELECT id,
ROW_NUMBER() OVER (ORDER BY ts_rank_cd(ts_content, q) DESC) AS bm25_rank
FROM documents, to_tsquery('english', 'data & engineering') q
WHERE ts_content @@ q
LIMIT 60
),
vec_ranked AS (
SELECT id,
ROW_NUMBER() OVER (ORDER BY embedding <=> '[0.02,-0.15,...]'::vector) AS vec_rank
FROM documents
ORDER BY embedding <=> '[0.02,-0.15,...]'::vector
LIMIT 60
),
rrf AS (
SELECT
COALESCE(b.id, v.id) AS id,
COALESCE(1.0 / (60 + b.bm25_rank), 0) +
COALESCE(1.0 / (60 + v.vec_rank), 0) AS rrf_score
FROM bm25_ranked b
FULL OUTER JOIN vec_ranked v ON b.id = v.id
)
SELECT d.source, d.chunk_text, r.rrf_score
FROM rrf r
JOIN documents d ON d.id = r.id
ORDER BY rrf_score DESC
LIMIT 5;
import psycopg
from openai import OpenAI
from pgvector.psycopg import register_vector
client = OpenAI()
def embed(text: str) -> list[float]:
resp = client.embeddings.create(
model="text-embedding-3-small", input=text
)
return resp.data[0].embedding
with psycopg.connect("dbname=mydb") as conn:
register_vector(conn) # Enables native vector type serialization
chunks = [
("doc_1.pdf", "Delta Lake provides ACID transactions on S3."),
("doc_1.pdf", "Photon is a vectorized C++ execution engine in Databricks."),
("doc_2.pdf", "pgvector supports HNSW and IVFFlat index types."),
]
with conn.cursor() as cur:
for source, text in chunks:
vec = embed(text)
cur.execute(
"""
INSERT INTO documents (source, chunk_text, embedding)
VALUES (%s, %s, %s)
ON CONFLICT DO NOTHING
""",
(source, text, vec),
)
conn.commit()
-- Parent partitioned table
CREATE TABLE measurements (
sensor_id text NOT NULL,
measured_at timestamptz NOT NULL,
value double precision NOT NULL
) PARTITION BY RANGE (measured_at);
-- Monthly partitions (pg_partman automates this in production)
CREATE TABLE measurements_2026_04
PARTITION OF measurements
FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');
-- BRIN index: tiny footprint for monotonic timestamps
CREATE INDEX idx_m_2026_04_brin
ON measurements_2026_04 USING BRIN (measured_at);
-- Query prunes to 1 partition automatically
EXPLAIN (ANALYZE, BUFFERS)
SELECT AVG(value)
FROM measurements
WHERE measured_at >= '2026-04-01'
AND measured_at < '2026-04-08';
-- Detach old partition for archival (non-blocking after PG14)
ALTER TABLE measurements
DETACH PARTITION measurements_2026_03 CONCURRENTLY;
-- pg_dump measurements_2026_03 | gzip | aws s3 cp - s3://archive/...
DROP TABLE measurements_2026_03;
↑ Back to top
| Dimension | Postgres + pgvector | Pinecone | Weaviate | Milvus |
|---|---|---|---|---|
| Max vectors (practical) | ~10M (single node) | Billions (managed) | Hundreds of millions | Billions (distributed) |
| Relational joins | ✅ Full SQL | ❌ | Partial | ❌ |
| Hybrid search (filter + vector) | ✅ Native SQL WHERE | ✅ Metadata filters | ✅ GraphQL filters | ✅ Scalar filters |
| ACID transactions | ✅ | ❌ | Partial | ✅ |
| Operational complexity | Low (existing Postgres) | None (SaaS) | Medium | High |
| Cost | Infra only | High ($0.096/1M vectors/mo+) | Medium | Infra only |
| Best for | ≤10M vectors, RAG + relational joins, existing Postgres stack | Large scale, zero-ops | Knowledge graph + semantic search | Very large scale, ML teams |
Rule of thumb: Start with pgvector if you already run Postgres and your vector count is under 5–10M. The operational simplicity of one database for relational + vector data often outweighs the query-speed advantage of dedicated vector DBs at small-to-medium scale.
↑ Back to topANALYZE table_name (or VACUUM ANALYZE) after bulk loads and before querying.WHERE id NOT IN (SELECT id FROM other) returns nothing if any subquery row is NULL — because x != NULL is always NULL. Use NOT EXISTS or add WHERE id IS NOT NULL inside the subquery. This is one of the most common SQL correctness bugs in DE interviews.ARRAY[...] instead of '[...]'::vector), Postgres won't use the vector index. Always cast to ::vector explicitly in queries, or use the pgvector Python library's native type adapter via register_vector(conn).real[] instead of vector. A plain array column prevents pgvector index creation and disables SIMD-accelerated distance computation. Always use the vector(n) type from the pgvector extension.max_connections under load. Postgres creates one OS process per connection. Default max_connections = 100. Without connection pooling (PgBouncer in transaction mode), a typical web app with many threads will exhaust connections or saturate the server. A practical heuristic: max_connections ≈ (4 × CPU cores) + disk count; use PgBouncer for everything above that.SET enable_seqscan = off vs on. Compare outputs. Then insert 500K rows without running ANALYZE and observe how estimated rows diverge from actual rows. Run ANALYZE and confirm the planner improves. Document the planner's node changes.vector(384) embeddings using sentence-transformers/all-MiniLM-L6-v2 from Python. Load 500 text chunks. Create an HNSW index. Implement query_similar(question: str, k: int) -> list[str]. Measure recall vs exact scan (disable index with SET enable_indexscan = off). Tune ef_search to find the sweet spot between recall and latency.log_events table partitioned by month. Write a Python script to create the next 3 monthly partitions, insert 10,000 rows into each, then detach and drop the oldest partition. Confirm queries against the parent table still scan only the relevant partitions using EXPLAIN and the Subplans Removed field.Index Only Scans avoid heap access only for pages marked as "all-visible" in the visibility map (VM). After inserts, the VM marks pages as not-all-visible until VACUUM processes them. For those pages, Postgres must fetch the heap to check row visibility — resulting in Heap Fetches > 0. Fix: run VACUUM table_name to update the visibility map. After a cold re-run, Heap Fetches should approach 0.
Postgres uses 32-bit transaction IDs (XIDs). After ~2.1 billion transactions, XIDs wrap around to zero. Without prevention, Postgres would consider all old rows as "future transactions" — making them invisible (effectively data loss). Prevention: Postgres freezes old tuples (replaces their XID with FrozenXID) during aggressive VACUUM. Monitor SELECT max(age(datfrozenxid)) FROM pg_database. When age approaches 2 billion, Postgres enters read-only emergency mode. Autovacuum with proper autovacuum_freeze_max_age prevents this.
IVFFlat clusters vectors into centroids at build time (fast build, small memory, degrades with inserts). At query time it probes N clusters — higher probes = better recall but slower. HNSW builds a hierarchical proximity graph (slow build, high memory) but achieves excellent recall at lower query latency and handles dynamic inserts gracefully. Choose IVFFlat for static, bulk-loaded datasets. Choose HNSW for production RAG where documents are continuously added and query recall/latency quality matter most.
The subquery almost certainly returns at least one NULL. In SQL, x NOT IN (1, 2, NULL) evaluates to NULL (unknown) for every value of x, because x != NULL is always NULL. The fix: rewrite as NOT EXISTS (SELECT 1 FROM ... WHERE id = outer.id AND id IS NOT NULL) or add WHERE id IS NOT NULL inside the subquery. This is one of the most common SQL correctness bugs.
The index returned ~950,000 rows (poor selectivity), which were then re-checked and discarded by a filter condition. The index is not selective enough — essentially doing an expensive random-I/O scan of most of the table. Fixes: (1) Create a more selective composite index that includes the filter column. (2) Use a partial index if the filter is on a fixed low-cardinality value, e.g., CREATE INDEX ON orders(created_at) WHERE status = 'pending'. (3) Check if a Bitmap Heap Scan would be cheaper by disabling index scans temporarily.