This guide covers the foundational data modeling and architecture patterns that underpin every data engineering interview: Kimball dimensional modeling, Data Vault 2.0, and modern lakehouse patterns. These are not tied to a single tool — they inform decisions you make in Spark, dbt, Snowflake, Databricks, or any warehouse. Interviewers test whether you can choose the right model for a given workload, articulate trade-offs, and avoid common anti-patterns. Assume familiarity with SQL and basic relational concepts; we focus on DE-specific design patterns, not introductory normalization.
Ralph Kimball's methodology organizes analytical data into star schemas: a central fact table (measures + foreign keys) surrounded by dimension tables (descriptive attributes). The goals are query simplicity, predictable performance, and business-user accessibility.
Fact tables record business events at a specific grain — the most atomic level of detail (e.g., one row per order line item). Types of facts:
fact_order_lines). Additive measures (quantity, revenue) can be summed across any dimension.fact_account_monthly_balance). Semi-additive: balances cannot be summed across time, only across other dimensions.fact_loan_application with columns: submitted_date, approved_date, funded_date). Contains multiple date foreign keys.fact_student_attendance recording that student X attended class Y on date Z). Useful for coverage analysis.Dimensions provide the "who, what, where, when, why" context for facts. They are typically wide (50-100+ columns), denormalized, and change slowly. The date dimension is universal — a pre-built table with fiscal calendars, holidays, and day-of-week flags.
effective_date, end_date, and is_current flag. Full history preserved. Most common for dimensions where history matters (customer address, product category).current_city, previous_city). Limited history (only one prior value). Rarely used alone.A star schema has fully denormalized dimensions — one join from fact to dimension. A snowflake schema normalizes dimensions into sub-tables (e.g., dim_product → dim_category → dim_department). Kimball strongly prefers star schemas: fewer joins, simpler queries, better BI tool compatibility. Snowflake schemas save storage but complicate queries and rarely improve performance with modern columnar engines.
Data Vault is a modeling methodology for enterprise data warehouses designed to handle change, auditability, and parallel loading. It separates structure (business keys), relationships, and context into three entity types:
A Hub represents a core business concept identified by its business key (natural key). Columns: hash key (surrogate), business key, load timestamp, record source. Hubs never change — once a business key is loaded, it persists forever. Example: hub_customer with customer_id as the business key.
A Link records a relationship between two or more Hubs. Columns: hash key, foreign hash keys to each hub, load timestamp, record source. Links are insert-only. They model many-to-many relationships naturally. Example: link_order connecting hub_customer and hub_product.
A Satellite stores descriptive attributes and their history for a specific Hub or Link. Columns: parent hash key, load timestamp, end timestamp (or hashdiff for change detection), attributes, record source. Satellites capture every change — each attribute change creates a new row. A single Hub can have multiple Satellites (e.g., sat_customer_personal, sat_customer_demographics) sourced from different systems.
Data Vault excels at: (1) Agility — adding a new source only requires new Hubs/Links/Satellites without altering existing structures. (2) Parallel loading — Hubs, Links, and Satellites load independently. (3) Full auditability — every record tracks its source and load time. (4) Handling multiple sources of truth — satellites from different systems coexist without conflict. The downside: querying Data Vault raw tables is complex (many joins), so a Kimball-style star schema is typically built on top as a presentation/business layer.
The medallion architecture (popularized by Databricks) organizes lakehouse data into three quality tiers:
_ingestion_ts, _source_file). Schema-on-read, append-only. Purpose: full data lineage and replayability.Key principle: each layer is a Delta/Iceberg table — fully queryable, version-controlled, and independently updatable. If silver logic changes, you replay from bronze without re-ingesting from source.
Kimball dimensional modeling remains the gold standard for the gold layer. In a lakehouse, the star schema is implemented as Delta/Iceberg tables. SCD Type 2 uses Delta MERGE operations. Date dimensions are static reference tables. The change: in a lakehouse, you keep the raw and intermediate layers (bronze/silver) alongside the gold star schema — something a traditional Kimball-only warehouse didn't do. This gives you both analytical performance (star schema) and data lineage (medallion layers).
Open table formats bring ACID transactions, time travel, and schema evolution to files on data lakes:
month(ts) are metadata — queries don't need to know partition layout). Industry momentum: adopted by AWS, Snowflake, and several major vendors.Convergence: Delta's UniForm writes Iceberg-compatible metadata alongside Delta logs, enabling cross-engine reads. All three formats are converging toward similar feature sets. Choose based on your engine ecosystem.
Data quality is a first-class pillar of data engineering — not an afterthought. Key patterns:
Change Data Capture is essential for keeping lakehouse tables in sync with operational databases without full table scans. Key patterns:
WHERE updated_at > last_watermark. Simpler to set up but misses deletes, cannot capture intermediate changes between polls, and requires a reliable watermark column. Suitable when log-based CDC is unavailable (e.g., legacy systems without accessible logs).MERGE or Iceberg's MERGE INTO to apply CDC events: INSERT new rows, UPDATE changed rows, DELETE removed rows. For SCD Type 2, the MERGE expires old rows and inserts new versions. Key challenge: out-of-order events — in distributed systems, a delete event may arrive before the corresponding insert. Use watermarking and transaction ordering to handle this.foreachBatch with MERGE. This is the most common real-time CDC pipeline in modern data engineering.The semantic layer sits between raw warehouse tables and BI consumers, providing a business-friendly abstraction over physical data structures. Key concepts:
metric: revenue, type: simple, measure: sum(amount), filter: WHERE status='completed'). Consumers query metrics via API — the semantic layer generates correct SQL regardless of which BI tool is asking. This is the direction dbt Labs is pushing for the "Single Source of Truth" pattern.A mid-size retailer builds a star schema with fact_sales (grain: one row per POS transaction line item) surrounded by dim_product (SCD Type 2 for category changes), dim_store, dim_date, and dim_customer (SCD Type 2 for loyalty tier changes). Gold-layer tables in Delta Lake. Analysts write simple SQL: SELECT dim_product.category, SUM(fact_sales.revenue) GROUP BY 1. BI dashboards (Tableau, Power BI) connect directly to the star schema with predictable performance — no multi-hop joins.
An insurance company integrates data from 12 source systems (policy admin, claims, billing, CRM, external rating agencies). The raw vault models core entities as Hubs (hub_policy, hub_claimant, hub_provider), relationships as Links (link_policy_claimant, link_claim_provider), and source-specific attributes as Satellites. Parallel ELT loads from each source without contention. A business vault layer adds computed Satellites (e.g., sat_policy_risk_score). Presentation layer: Kimball star schemas for actuarial reporting and claims analytics.
A logistics company builds a lakehouse on AWS with Apache Iceberg tables. Bronze: Flink jobs consume GPS events from Kafka and append to bronze_vehicle_events (Iceberg, partitioned by hour(event_ts) with hidden partitioning). Silver: Spark Structured Streaming deduplicates, enriches with geofence data, and writes silver_vehicle_positions. Gold: hourly batch jobs build gold_delivery_metrics (avg delivery time by region, on-time rate). Trino connects to the same Iceberg tables for ad-hoc analyst queries — no data movement between engines.
A SaaS company needs near-real-time analytics on their PostgreSQL OLTP database without impacting production performance. Debezium reads the PostgreSQL WAL and publishes change events to Kafka topics (one per table). PySpark Structured Streaming reads these topics, parses the Debezium JSON envelope (extracting before/after images and op type), and applies changes to Delta Lake tables using foreachBatch with MERGE. Inserts and updates become UPSERT operations; deletes set a soft-delete flag. The silver layer resolves foreign keys and builds conformed entities. Gold-layer star schemas refresh every 15 minutes, giving analysts near-real-time dashboards with only a 15-minute lag from OLTP — down from the previous 24-hour batch ETL lag.
A fintech company with 50+ dashboards discovers that "monthly active users (MAU)" is calculated differently across teams — engineering counts API callers, marketing counts login events, finance counts paying users. They implement a dbt Semantic Layer: a single YAML definition specifies metric: mau with the canonical calculation (unique users with at least one qualifying event per month). All BI tools (Tableau, Hex, internal APIs) query the semantic layer API instead of writing raw SQL. Within 3 months, metric discrepancies are eliminated, finance reporting time drops by 40%, and the "whose number is right" debates disappear.
from delta.tables import DeltaTable
from pyspark.sql import functions as F
# SCD Type 2: dim_customer with effective_date / end_date / is_current
target = DeltaTable.forName(spark, "gold.dim_customer")
# Source: new/changed customer records from silver layer
source = spark.table("silver.customers_latest")
# Build a change-detection join key
# Match on business key AND only where attributes differ
merge_condition = """
t.customer_bk = s.customer_bk
AND t.is_current = true
"""
change_condition = """
s.name != t.name
OR s.email != t.email
OR s.city != t.city
"""
# Step 1: Expire old rows (set end_date, is_current = false)
target.alias("t").merge(
source.alias("s"), merge_condition
).whenMatchedUpdate(
condition=change_condition,
set={
"end_date": F.current_date(),
"is_current": F.lit(False),
}
).execute()
# Step 2: Insert new current rows for changed + new records
current = spark.table("gold.dim_customer").where("is_current = true")
new_rows = source.alias("s").join(
current.alias("c"),
(F.col("s.customer_bk") == F.col("c.customer_bk"))
& (F.col("s.name") == F.col("c.name"))
& (F.col("s.email") == F.col("c.email"))
& (F.col("s.city") == F.col("c.city")),
"left_anti" # Rows that changed or are brand-new
).select(
F.monotonically_increasing_id().alias("customer_sk"),
F.col("s.customer_bk"),
F.col("s.name"), F.col("s.email"), F.col("s.city"),
F.current_date().alias("effective_date"),
F.lit("9999-12-31").cast("date").alias("end_date"),
F.lit(True).alias("is_current"),
)
new_rows.write.format("delta").mode("append").saveAsTable("gold.dim_customer")
-- Load hub_customer from staging
INSERT INTO raw_vault.hub_customer (hub_customer_hk, customer_bk, load_ts, record_source)
SELECT
SHA2(UPPER(TRIM(stg.customer_id)), 256) AS hub_customer_hk,
UPPER(TRIM(stg.customer_id)) AS customer_bk,
CURRENT_TIMESTAMP() AS load_ts,
'crm_system' AS record_source
FROM staging.stg_crm_customers stg
LEFT JOIN raw_vault.hub_customer hub
ON hub.customer_bk = UPPER(TRIM(stg.customer_id))
WHERE hub.hub_customer_hk IS NULL; -- Only insert new business keys
-- Load satellite with change detection via hashdiff
INSERT INTO raw_vault.sat_customer_personal
(hub_customer_hk, load_ts, hashdiff, name, email, phone, record_source)
SELECT
SHA2(UPPER(TRIM(stg.customer_id)), 256) AS hub_customer_hk,
CURRENT_TIMESTAMP() AS load_ts,
SHA2(CONCAT_WS('||',
COALESCE(stg.name, ''),
COALESCE(stg.email, ''),
COALESCE(stg.phone, '')
), 256) AS hashdiff,
stg.name, stg.email, stg.phone,
'crm_system' AS record_source
FROM staging.stg_crm_customers stg
INNER JOIN raw_vault.hub_customer hub
ON hub.customer_bk = UPPER(TRIM(stg.customer_id))
LEFT JOIN (
-- Get the latest satellite record per hub key
SELECT *, ROW_NUMBER() OVER (
PARTITION BY hub_customer_hk ORDER BY load_ts DESC
) AS rn
FROM raw_vault.sat_customer_personal
) latest ON latest.hub_customer_hk = hub.hub_customer_hk AND latest.rn = 1
WHERE latest.hashdiff IS NULL -- Brand-new
OR latest.hashdiff != SHA2(CONCAT_WS('||',
COALESCE(stg.name, ''),
COALESCE(stg.email, ''),
COALESCE(stg.phone, '')
), 256); -- Attributes changed
-- Create an Iceberg table with hidden (transform-based) partitioning
CREATE TABLE lakehouse.silver.events (
event_id STRING,
user_id STRING,
event_type STRING,
payload STRING,
event_ts TIMESTAMP
)
USING ICEBERG
PARTITIONED BY (days(event_ts), bucket(16, user_id));
-- Queries filter on event_ts directly; Iceberg prunes partitions
-- automatically — no need to reference partition columns
SELECT event_type, COUNT(*) AS cnt
FROM lakehouse.silver.events
WHERE event_ts BETWEEN '2024-06-01' AND '2024-06-30'
GROUP BY 1;
-- Time travel: query a specific snapshot
SELECT * FROM lakehouse.silver.events
TIMESTAMP AS OF '2024-06-15T10:00:00';
-- Partition evolution: change strategy without rewriting data
ALTER TABLE lakehouse.silver.events
REPLACE PARTITION FIELD days(event_ts) WITH hours(event_ts);
-- models/gold/dim_product.sql
-- dbt model: conformed product dimension (SCD Type 1)
{{ config(
materialized='table',
unique_key='product_id',
tags=['gold', 'dimensional'],
) }}
WITH source AS (
SELECT *
FROM {{ ref('stg_products') }}
),
enriched AS (
SELECT
product_id,
product_name,
category,
subcategory,
brand,
unit_cost,
unit_price,
unit_price - unit_cost AS unit_margin,
CASE WHEN is_active THEN 'Active'
ELSE 'Discontinued'
END AS product_status,
CURRENT_TIMESTAMP() AS _loaded_at
FROM source
)
SELECT * FROM enriched
# models/gold/dim_product.yml — dbt schema tests
version: 2
models:
- name: dim_product
description: "Conformed product dimension"
columns:
- name: product_id
tests:
- unique
- not_null
- name: category
tests:
- not_null
- accepted_values:
values: ['Electronics', 'Clothing', 'Home', 'Food', 'Sports']
- name: unit_price
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"
import great_expectations as gx
# Initialize context and connect to Delta table
context = gx.get_context()
datasource = context.sources.add_spark("lakehouse")
data_asset = datasource.add_dataframe_asset(
name="silver_orders",
dataframe=spark.table("silver.orders"),
)
batch_request = data_asset.build_batch_request()
# Define expectations
suite = context.add_or_update_expectation_suite("silver_orders_suite")
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="silver_orders_suite",
)
# Column-level checks
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_be_between("amount", min_value=0)
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_match_regex(
"email", regex=r"^[\w.+-]+@[\w-]+\.[\w.]+$"
)
# Table-level checks
validator.expect_table_row_count_to_be_between(
min_value=1000, max_value=10_000_000
)
# Run checkpoint and fail pipeline on violations
checkpoint = context.add_or_update_checkpoint(
name="silver_orders_checkpoint",
validations=[{"batch_request": batch_request,
"expectation_suite_name": "silver_orders_suite"}],
)
result = checkpoint.run()
if not result.success:
raise RuntimeError("Data quality check failed — see GX report")
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, LongType
from delta.tables import DeltaTable
# Read Debezium CDC events from Kafka
cdc_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "broker:9092") \
.option("subscribe", "dbserver.public.customers") \
.option("startingOffsets", "earliest") \
.load()
# Parse the Debezium JSON envelope
parsed = cdc_stream.select(
F.from_json(
F.col("value").cast("string"),
"payload STRUCT<op:STRING, before:STRUCT<customer_id:STRING, name:STRING, email:STRING>, after:STRUCT<customer_id:STRING, name:STRING, email:STRING>, ts_ms:LONG>"
).alias("cdc")
).select(
F.col("cdc.payload.op").alias("operation"), # c=create, u=update, d=delete
F.col("cdc.payload.after.customer_id").alias("customer_id"),
F.col("cdc.payload.after.name").alias("name"),
F.col("cdc.payload.after.email").alias("email"),
F.col("cdc.payload.ts_ms").alias("cdc_timestamp"),
)
# Apply CDC events to Delta Lake using foreachBatch + MERGE
def apply_cdc_to_delta(batch_df, batch_id):
if batch_df.isEmpty():
return
# Deduplicate: keep last event per key within micro-batch
deduped = batch_df.withColumn(
"rn",
F.row_number().over(
Window.partitionBy("customer_id").orderBy(F.desc("cdc_timestamp"))
)
).where("rn = 1").drop("rn")
target = DeltaTable.forName(spark, "silver.customers")
target.alias("t").merge(
deduped.alias("s"),
"t.customer_id = s.customer_id"
).whenMatchedUpdate(
condition="s.operation = 'u'",
set={"name": "s.name", "email": "s.email",
"_updated_at": F.current_timestamp()}
).whenMatchedDelete(
condition="s.operation = 'd'"
).whenNotMatchedInsert(
condition="s.operation != 'd'",
values={"customer_id": "s.customer_id", "name": "s.name",
"email": "s.email", "_updated_at": F.current_timestamp()}
).execute()
# Write stream with foreachBatch
from pyspark.sql import Window
query = parsed.writeStream \
.foreachBatch(apply_cdc_to_delta) \
.option("checkpointLocation", "s3a://checkpoints/cdc/customers/") \
.trigger(availableNow=True) \
.start()
query.awaitTermination()
# models/metrics/revenue.yml — dbt Semantic Layer metric definition
semantic_models:
- name: orders
description: "Order fact table for revenue metrics"
model: ref('fact_orders')
defaults:
agg_time_dimension: order_date
entities:
- name: order_id
type: primary
- name: customer_id
type: foreign
measures:
- name: total_revenue
expr: amount
agg: sum
description: "Sum of order amounts (completed orders only)"
- name: order_count
expr: "1"
agg: sum
dimensions:
- name: order_date
type: time
type_params:
time_granularity: day
- name: order_status
type: categorical
metrics:
- name: revenue
type: simple
label: "Revenue"
description: "Total revenue from completed orders"
type_params:
measure: total_revenue
filter: |
{{ Dimension('order_status') }} = 'completed'
- name: revenue_per_customer
type: derived
label: "Revenue per Customer"
type_params:
expr: revenue / unique_customers
metrics:
- name: revenue
- name: unique_customers
↑ Back to top
| Aspect | Kimball Star Schema | Data Vault 2.0 | One Big Table (OBT) | 3NF / Normalized |
|---|---|---|---|---|
| Primary use | Analytical / BI consumption | Enterprise raw integration layer | Single-use-case BI explore | Operational / OLTP systems |
| Schema design | Denormalized star (fact + dims) | Hub / Link / Satellite | Fully flattened wide table | Fully normalized (3NF+) |
| Query complexity | Low (few joins) | High (many joins from Hubs → Links → Sats) | Minimal (no joins) | High (many joins) |
| Change handling | SCD Types 1-6 | Insert-only satellites (full history) | Rebuild/overwrite | In-place updates |
| Source agility | Moderate (schema changes require redesign) | High (add Hub/Link/Sat independently) | Low (must add column to wide table) | Moderate |
| Storage efficiency | Moderate (denormalized dims) | Higher (normalized, but many tables) | Low (massive duplication) | High (no redundancy) |
| BI tool compatibility | Excellent (designed for it) | Poor (needs presentation layer) | Good (single table) | Poor (too many joins) |
| Best for | Gold layer, dashboards, ad-hoc SQL | Silver/raw vault, regulated industries, multi-source integration | Single Looker explore, denormalized export | Source systems, OLTP databases |
In practice, these are complementary, not competing. A mature lakehouse often uses 3NF at the source, Data Vault in the silver layer for integration and auditability, and Kimball star schemas in the gold layer for consumption. OBT is a tactical choice for specific performance-critical views, not a general architecture.
↑ Back to topdim_customer has different keys or attributes than the finance team's, cross-domain queries (e.g., "revenue by customer segment") require complex reconciliation. Define conformed dimensions centrally — even if each team adds their own attributes, the business key and core attributes must be consistent.WHERE updated_at > watermark) misses deletes and cannot capture intra-batch changes. For accurate synchronization with OLTP sources, invest in log-based CDC (Debezium, AWS DMS). The operational overhead pays for itself in data correctness and reduced source system load.orders, order_items, customers, products, categories, payments. Design: (1) fact table with grain declaration, (2) at least 4 dimension tables (including date), (3) SCD type decisions for each dimension (justify Type 1 vs. 2). Write the DDL in SQL. Then write 3 sample analytical queries that demonstrate why a star schema is simpler than querying the normalized source directly.Transaction fact: One row per discrete event. Example: fact_order_lines — one row per line item on a purchase order. Measures are fully additive (quantity, revenue). Periodic snapshot: One row per entity per time period. Example: fact_account_daily_balance — one row per bank account per day. Balances are semi-additive: you can sum across accounts but not across days (averaging is correct for time). Accumulating snapshot: One row per process instance, updated as milestones occur. Example: fact_insurance_claim with columns for filed_date, assessed_date, approved_date, paid_date. Multiple date foreign keys track the lifecycle. These are updated (not appended) as events occur.
A hashdiff is a hash (SHA-256) of all descriptive columns in a satellite record. During loading, you compare the hashdiff of the incoming record against the hashdiff of the latest satellite record for the same Hub key. If the hashes differ, a change occurred, and a new satellite row is inserted. If the hashes match, the record is skipped (no change). This avoids comparing each column individually, which is verbose and error-prone for wide tables. The hashdiff must be computed deterministically — NULLs should be coalesced to a consistent placeholder, and column order must be fixed.
In Hive-style partitioning, the user must specify partition columns explicitly in queries (WHERE year=2024 AND month=6) and the physical directory layout is visible. Iceberg's hidden partitioning applies partition transforms (like days(event_ts), months(event_ts), bucket(16, user_id)) as metadata. Queries filter on the original column (WHERE event_ts BETWEEN ...), and Iceberg's planner automatically prunes partitions. Benefits: (1) Users don't need to know the partition scheme. (2) Partition evolution — you can change the partition strategy (e.g., from days() to hours()) without rewriting data; old data keeps its old layout, new data uses the new layout. (3) No partition column in the data itself — avoids redundant columns.
Star schemas denormalize dimensions into wide single tables, so a fact-to-dimension query requires exactly one join per dimension. Snowflake schemas normalize dimensions into chains of sub-tables (dim_product → dim_subcategory → dim_category → dim_department), adding 2-3 joins per dimension. Drawbacks of snowflake: (1) More complex queries — analysts must understand the normalized structure. (2) BI tools work best with star schemas (automatic join path detection). (3) In modern columnar engines (Redshift, BigQuery, Databricks SQL), the storage savings from normalization are negligible because columnar compression already handles repeated values efficiently. (4) Join overhead grows with snowflake depth. The only case for snowflaking is when a dimension has a very large, rarely-queried sub-entity that would bloat the main dimension significantly.
Bronze (raw): Exact copy of source data. Schema-on-read — store as-is (JSON, CSV, Avro). Minimal transformation: add ingestion metadata. Consumer: data engineers debugging pipelines. Quality: none enforced (garbage in, garbage stored). Silver (cleaned): Schema-on-write. Data is typed, deduplicated, validated, and conformed. Business keys are resolved. This is where you apply Data Vault or 3NF modeling for integration. Consumer: data engineers and analysts for exploration. Quality: enforced (null checks, uniqueness, referential integrity). Gold (business): Star schemas, aggregates, KPI tables, ML feature tables. Highly denormalized for fast queries. Consumer: BI tools, dashboards, executives, ML models. Quality: highest — data is curated, tested, and SLA-backed. Each layer is independently queryable and version-controlled (Delta/Iceberg), so you can always replay from bronze if silver/gold logic changes.
Log-based CDC reads the database's transaction log (binlog, WAL, redo log) to capture every change: inserts, updates, and deletes. Tools like Debezium emit each change as an event with before/after images. Advantages: (1) Captures deletes — query-based CDC misses them because the row no longer exists to query. (2) Zero source impact — reading the log doesn't add load to the database. (3) Captures all intermediate states — if a row is updated 5 times between polls, log-based CDC captures all 5 changes. (4) Low latency — changes are available in seconds, not on a polling schedule. Query-based CDC (WHERE updated_at > watermark) is simpler but misses deletes, misses intermediate changes, requires a reliable watermark column, and adds query load to the source. Use log-based CDC for any production pipeline where data accuracy matters.
A semantic layer is an abstraction between the physical warehouse tables and BI consumers that provides centralized metric definitions. Instead of each dashboard calculates "revenue" differently (gross vs. net vs. with-refunds), metrics are defined once in code (e.g., dbt Semantic Layer YAML) and served via API to all consumers (Tableau, Looker, notebooks). Benefits: (1) Metric consistency — every team sees the same numbers. (2) Version control — metric definitions are in Git with review processes. (3) Testability — metric logic can be unit-tested. (4) Decoupling — physical schema changes don't break dashboards. The dbt Semantic Layer (MetricFlow), Cube.dev, and Looker's LookML are the main implementations. For interviews, articulating this pattern demonstrates understanding of organizational data challenges beyond just pipelines.