Data Engineering Concepts

Summary

This guide covers the foundational data modeling and architecture patterns that underpin every data engineering interview: Kimball dimensional modeling, Data Vault 2.0, and modern lakehouse patterns. These are not tied to a single tool — they inform decisions you make in Spark, dbt, Snowflake, Databricks, or any warehouse. Interviewers test whether you can choose the right model for a given workload, articulate trade-offs, and avoid common anti-patterns. Assume familiarity with SQL and basic relational concepts; we focus on DE-specific design patterns, not introductory normalization.

Core Concepts
Industry Use Cases
Code Examples
Comparison / When to Use
Gotchas & Anti-patterns
Exercises
Quiz
Further Reading

Core Concepts

1. Kimball Dimensional Modeling

Ralph Kimball's methodology organizes analytical data into star schemas: a central fact table (measures + foreign keys) surrounded by dimension tables (descriptive attributes). The goals are query simplicity, predictable performance, and business-user accessibility.

Fact Tables

Fact tables record business events at a specific grain — the most atomic level of detail (e.g., one row per order line item). Types of facts:

Transaction facts — one row per event (e.g., fact_order_lines). Additive measures (quantity, revenue) can be summed across any dimension.
Periodic snapshot facts — one row per entity per time period (e.g., fact_account_monthly_balance). Semi-additive: balances cannot be summed across time, only across other dimensions.
Accumulating snapshot facts — one row per process instance, updated as milestones occur (e.g., fact_loan_application with columns: submitted_date, approved_date, funded_date). Contains multiple date foreign keys.
Factless fact tables — record events that have no measure, only foreign keys (e.g., fact_student_attendance recording that student X attended class Y on date Z). Useful for coverage analysis.

Dimension Tables

Dimensions provide the "who, what, where, when, why" context for facts. They are typically wide (50-100+ columns), denormalized, and change slowly. The date dimension is universal — a pre-built table with fiscal calendars, holidays, and day-of-week flags.

Slowly Changing Dimensions (SCD)

Type 0 — Retain original: never update (e.g., original credit score at account opening).
Type 1 — Overwrite: update the attribute in place. No history preserved. Simple, but past reports change retroactively.
Type 2 — Add new row: insert a new row with a surrogate key, effective_date, end_date, and is_current flag. Full history preserved. Most common for dimensions where history matters (customer address, product category).
Type 3 — Add new column: store both current and previous values (current_city, previous_city). Limited history (only one prior value). Rarely used alone.
Type 6 (Hybrid 1+2+3) — Combines Type 2 rows with a Type 1 overwrite of a "current" column on all historical rows. Enables both historical and current-view queries without joins.

Star vs. Snowflake

A star schema has fully denormalized dimensions — one join from fact to dimension. A snowflake schema normalizes dimensions into sub-tables (e.g., dim_product → dim_category → dim_department). Kimball strongly prefers star schemas: fewer joins, simpler queries, better BI tool compatibility. Snowflake schemas save storage but complicate queries and rarely improve performance with modern columnar engines.

2. Data Vault 2.0

Data Vault is a modeling methodology for enterprise data warehouses designed to handle change, auditability, and parallel loading. It separates structure (business keys), relationships, and context into three entity types:

Hubs

A Hub represents a core business concept identified by its business key (natural key). Columns: hash key (surrogate), business key, load timestamp, record source. Hubs never change — once a business key is loaded, it persists forever. Example: hub_customer with customer_id as the business key.

Links

A Link records a relationship between two or more Hubs. Columns: hash key, foreign hash keys to each hub, load timestamp, record source. Links are insert-only. They model many-to-many relationships naturally. Example: link_order connecting hub_customer and hub_product.

Satellites

A Satellite stores descriptive attributes and their history for a specific Hub or Link. Columns: parent hash key, load timestamp, end timestamp (or hashdiff for change detection), attributes, record source. Satellites capture every change — each attribute change creates a new row. A single Hub can have multiple Satellites (e.g., sat_customer_personal, sat_customer_demographics) sourced from different systems.

Why Data Vault?

Data Vault excels at: (1) Agility — adding a new source only requires new Hubs/Links/Satellites without altering existing structures. (2) Parallel loading — Hubs, Links, and Satellites load independently. (3) Full auditability — every record tracks its source and load time. (4) Handling multiple sources of truth — satellites from different systems coexist without conflict. The downside: querying Data Vault raw tables is complex (many joins), so a Kimball-style star schema is typically built on top as a presentation/business layer.

3. Modern Lakehouse Patterns

Medallion Architecture (Bronze → Silver → Gold)

The medallion architecture (popularized by Databricks) organizes lakehouse data into three quality tiers:

Bronze — Raw ingestion. Exact copy of source data (including malformed rows), with metadata columns (_ingestion_ts, _source_file). Schema-on-read, append-only. Purpose: full data lineage and replayability.
Silver — Cleaned and conformed. Deduplication, schema enforcement, type casting, null handling, business key resolution. Often joined across sources. This is where Data Vault or 3NF modeling may live.
Gold — Business-level aggregates and dimensional models. Star schemas, KPI tables, ML feature tables. Optimized for consumption by BI tools, dashboards, and data science.

Key principle: each layer is a Delta/Iceberg table — fully queryable, version-controlled, and independently updatable. If silver logic changes, you replay from bronze without re-ingesting from source.

Kimball in the Lakehouse

Kimball dimensional modeling remains the gold standard for the gold layer. In a lakehouse, the star schema is implemented as Delta/Iceberg tables. SCD Type 2 uses Delta MERGE operations. Date dimensions are static reference tables. The change: in a lakehouse, you keep the raw and intermediate layers (bronze/silver) alongside the gold star schema — something a traditional Kimball-only warehouse didn't do. This gives you both analytical performance (star schema) and data lineage (medallion layers).

4. Open Table Formats — Delta Lake, Apache Iceberg & Apache Hudi

Open table formats bring ACID transactions, time travel, and schema evolution to files on data lakes:

Delta Lake — Created by Databricks. JSON-based transaction log. Strong Spark integration. Liquid clustering. Best ecosystem support on Databricks, growing multi-engine support (Trino, Flink, Presto via UniForm).
Apache Iceberg — Created by Netflix. Snapshot-based metadata (Avro manifest files → manifest lists → metadata.json). Engine-agnostic by design (Spark, Trino, Flink, Dremio, Snowflake). Hidden partitioning (partition transforms like month(ts) are metadata — queries don't need to know partition layout). Industry momentum: adopted by AWS, Snowflake, and several major vendors.
Apache Hudi — Created by Uber. Focus on incremental processing: Copy-on-Write (CoW) for read-heavy, Merge-on-Read (MoR) for write-heavy. Built-in incremental query support. Strong on upsert-heavy workloads (CDC ingestion).

Convergence: Delta's UniForm writes Iceberg-compatible metadata alongside Delta logs, enabling cross-engine reads. All three formats are converging toward similar feature sets. Choose based on your engine ecosystem.

5. Data Quality & Observability

Data quality is a first-class pillar of data engineering — not an afterthought. Key patterns:

Schema contracts — Define expected schemas at layer boundaries. Breaking changes (dropped/renamed columns) fail the pipeline, not the consumer. Tools: dbt contracts, DLT expectations, Great Expectations.
Data freshness & volume checks — Monitor that tables are updated on schedule and row counts are within expected ranges. Anomalous drops signal upstream failures.
Column-level validation — Null rates, uniqueness, referential integrity, value ranges. Run checks in the pipeline (fail-fast) or as post-load audits.
Data observability platforms — Monte Carlo, Bigeye, Elementary (open-source for dbt) provide automated anomaly detection, lineage-aware alerting, and incident management. They answer "what broke, when, and what downstream is affected."
Data lineage — Column-level lineage (Unity Catalog, OpenLineage, dbt) traces data from source to dashboard, enabling impact analysis before schema changes.

6. Common Modeling Mistakes

Wrong grain in fact tables. Mixing different grains in one fact table (e.g., order-level and line-item-level rows) makes aggregation unreliable. Always declare the grain first — it's the single most important design decision.
Using natural keys as primary keys in dimensions. Natural keys can change (email, tax ID). Use surrogate keys (auto-increment or hash) and store natural keys as business keys.
One Big Table (OBT) as the default. Flattening everything into a single wide table seems simple but causes: update anomalies, storage bloat (repeated dimension values), and inflexible queries. OBT works for specific read-optimized use cases (e.g., a single Looker explore), not as a general architecture.
Ignoring conformed dimensions. If "customer" means different things in different fact tables (different keys, different attributes), cross-domain analysis breaks. Conformed dimensions ensure consistent drill-across queries.
Treating the warehouse as a normalized OLTP schema. 3NF minimizes storage and update anomalies but creates join-heavy queries unsuitable for analytics. The whole point of dimensional modeling is denormalization for query performance and usability.

7. Change Data Capture (CDC) Patterns

Change Data Capture is essential for keeping lakehouse tables in sync with operational databases without full table scans. Key patterns:

Log-based CDC — Read the database transaction log (binlog in MySQL, WAL in PostgreSQL, redo log in Oracle). Tools: Debezium (open-source, Kafka Connect-based), AWS DMS, Fivetran, Airbyte. This is the gold standard: zero impact on the source, captures all operations (inserts, updates, deletes), and includes before/after images. Debezium emits each change as a JSON/Avro event to Kafka, which feeds the bronze layer.
Query-based CDC — Poll the source with WHERE updated_at > last_watermark. Simpler to set up but misses deletes, cannot capture intermediate changes between polls, and requires a reliable watermark column. Suitable when log-based CDC is unavailable (e.g., legacy systems without accessible logs).
CDC application in the lakehouse — Use Delta Lake MERGE or Iceberg's MERGE INTO to apply CDC events: INSERT new rows, UPDATE changed rows, DELETE removed rows. For SCD Type 2, the MERGE expires old rows and inserts new versions. Key challenge: out-of-order events — in distributed systems, a delete event may arrive before the corresponding insert. Use watermarking and transaction ordering to handle this.
Debezium + Kafka + Spark pattern — Debezium reads source DB logs → publishes to Kafka topics (one per table) → PySpark Structured Streaming reads from Kafka → applies changes to Delta tables via foreachBatch with MERGE. This is the most common real-time CDC pipeline in modern data engineering.

8. Semantic Layer & Metrics Layer

The semantic layer sits between raw warehouse tables and BI consumers, providing a business-friendly abstraction over physical data structures. Key concepts:

The problem it solves: Without a semantic layer, every BI tool, notebook, and dashboard defines "revenue" differently — one uses gross, another net, a third includes refunds. Metrics diverge, and leadership trust erodes.
dbt Semantic Layer (MetricFlow) — Define metrics as code in YAML (e.g., metric: revenue, type: simple, measure: sum(amount), filter: WHERE status='completed'). Consumers query metrics via API — the semantic layer generates correct SQL regardless of which BI tool is asking. This is the direction dbt Labs is pushing for the "Single Source of Truth" pattern.
Headless BI / Metrics Stores — Tools like Cube.dev and Transform (now part of dbt) sit as a metrics API between the warehouse and consumers. They cache results, enforce consistent definitions, and support multi-tool access (Tableau, Looker, Python notebooks all get the same "revenue" number).
Looker's LookML — A modeling language that defines dimensions, measures, and relationships. Looker was an early semantic layer but is tightly coupled to the Looker BI tool.
Why it matters for interviews: Interviewers increasingly ask how you'd ensure metric consistency across an organization. The answer involves: (1) single metric definitions in code, (2) version control for those definitions, (3) a serving layer that BI tools consume rather than writing raw SQL.

↑ Back to top

Industry Use Cases

1. Retail Analytics — Kimball Star Schema

A mid-size retailer builds a star schema with fact_sales (grain: one row per POS transaction line item) surrounded by dim_product (SCD Type 2 for category changes), dim_store, dim_date, and dim_customer (SCD Type 2 for loyalty tier changes). Gold-layer tables in Delta Lake. Analysts write simple SQL: SELECT dim_product.category, SUM(fact_sales.revenue) GROUP BY 1. BI dashboards (Tableau, Power BI) connect directly to the star schema with predictable performance — no multi-hop joins.

2. Insurance Data Hub — Data Vault 2.0

An insurance company integrates data from 12 source systems (policy admin, claims, billing, CRM, external rating agencies). The raw vault models core entities as Hubs (hub_policy, hub_claimant, hub_provider), relationships as Links (link_policy_claimant, link_claim_provider), and source-specific attributes as Satellites. Parallel ELT loads from each source without contention. A business vault layer adds computed Satellites (e.g., sat_policy_risk_score). Presentation layer: Kimball star schemas for actuarial reporting and claims analytics.

3. Streaming Lakehouse — Medallion + Iceberg

A logistics company builds a lakehouse on AWS with Apache Iceberg tables. Bronze: Flink jobs consume GPS events from Kafka and append to bronze_vehicle_events (Iceberg, partitioned by hour(event_ts) with hidden partitioning). Silver: Spark Structured Streaming deduplicates, enriches with geofence data, and writes silver_vehicle_positions. Gold: hourly batch jobs build gold_delivery_metrics (avg delivery time by region, on-time rate). Trino connects to the same Iceberg tables for ad-hoc analyst queries — no data movement between engines.

4. Real-Time CDC Pipeline — Debezium + Lakehouse

A SaaS company needs near-real-time analytics on their PostgreSQL OLTP database without impacting production performance. Debezium reads the PostgreSQL WAL and publishes change events to Kafka topics (one per table). PySpark Structured Streaming reads these topics, parses the Debezium JSON envelope (extracting before/after images and op type), and applies changes to Delta Lake tables using foreachBatch with MERGE. Inserts and updates become UPSERT operations; deletes set a soft-delete flag. The silver layer resolves foreign keys and builds conformed entities. Gold-layer star schemas refresh every 15 minutes, giving analysts near-real-time dashboards with only a 15-minute lag from OLTP — down from the previous 24-hour batch ETL lag.

5. Metrics Standardization with Semantic Layer

A fintech company with 50+ dashboards discovers that "monthly active users (MAU)" is calculated differently across teams — engineering counts API callers, marketing counts login events, finance counts paying users. They implement a dbt Semantic Layer: a single YAML definition specifies metric: mau with the canonical calculation (unique users with at least one qualifying event per month). All BI tools (Tableau, Hex, internal APIs) query the semantic layer API instead of writing raw SQL. Within 3 months, metric discrepancies are eliminated, finance reporting time drops by 40%, and the "whose number is right" debates disappear.

↑ Back to top

Code Examples

1. Kimball SCD Type 2 with Delta Lake MERGE

from delta.tables import DeltaTable
from pyspark.sql import functions as F

# SCD Type 2: dim_customer with effective_date / end_date / is_current
target = DeltaTable.forName(spark, "gold.dim_customer")

# Source: new/changed customer records from silver layer
source = spark.table("silver.customers_latest")

# Build a change-detection join key
# Match on business key AND only where attributes differ
merge_condition = """
    t.customer_bk = s.customer_bk
    AND t.is_current = true
"""

change_condition = """
    s.name != t.name
    OR s.email != t.email
    OR s.city != t.city
"""

# Step 1: Expire old rows (set end_date, is_current = false)
target.alias("t").merge(
    source.alias("s"), merge_condition
).whenMatchedUpdate(
    condition=change_condition,
    set={
        "end_date": F.current_date(),
        "is_current": F.lit(False),
    }
).execute()

# Step 2: Insert new current rows for changed + new records
current = spark.table("gold.dim_customer").where("is_current = true")

new_rows = source.alias("s").join(
    current.alias("c"),
    (F.col("s.customer_bk") == F.col("c.customer_bk"))
    & (F.col("s.name") == F.col("c.name"))
    & (F.col("s.email") == F.col("c.email"))
    & (F.col("s.city") == F.col("c.city")),
    "left_anti"  # Rows that changed or are brand-new
).select(
    F.monotonically_increasing_id().alias("customer_sk"),
    F.col("s.customer_bk"),
    F.col("s.name"), F.col("s.email"), F.col("s.city"),
    F.current_date().alias("effective_date"),
    F.lit("9999-12-31").cast("date").alias("end_date"),
    F.lit(True).alias("is_current"),
)

new_rows.write.format("delta").mode("append").saveAsTable("gold.dim_customer")

2. Data Vault Hub + Satellite Loading (SQL)

-- Load hub_customer from staging
INSERT INTO raw_vault.hub_customer (hub_customer_hk, customer_bk, load_ts, record_source)
SELECT
    SHA2(UPPER(TRIM(stg.customer_id)), 256)  AS hub_customer_hk,
    UPPER(TRIM(stg.customer_id))                AS customer_bk,
    CURRENT_TIMESTAMP()                          AS load_ts,
    'crm_system'                                  AS record_source
FROM staging.stg_crm_customers stg
LEFT JOIN raw_vault.hub_customer hub
    ON hub.customer_bk = UPPER(TRIM(stg.customer_id))
WHERE hub.hub_customer_hk IS NULL;   -- Only insert new business keys

-- Load satellite with change detection via hashdiff
INSERT INTO raw_vault.sat_customer_personal
    (hub_customer_hk, load_ts, hashdiff, name, email, phone, record_source)
SELECT
    SHA2(UPPER(TRIM(stg.customer_id)), 256)    AS hub_customer_hk,
    CURRENT_TIMESTAMP()                            AS load_ts,
    SHA2(CONCAT_WS('||',
        COALESCE(stg.name, ''),
        COALESCE(stg.email, ''),
        COALESCE(stg.phone, '')
    ), 256)                                        AS hashdiff,
    stg.name, stg.email, stg.phone,
    'crm_system'                                    AS record_source
FROM staging.stg_crm_customers stg
INNER JOIN raw_vault.hub_customer hub
    ON hub.customer_bk = UPPER(TRIM(stg.customer_id))
LEFT JOIN (
    -- Get the latest satellite record per hub key
    SELECT *, ROW_NUMBER() OVER (
        PARTITION BY hub_customer_hk ORDER BY load_ts DESC
    ) AS rn
    FROM raw_vault.sat_customer_personal
) latest ON latest.hub_customer_hk = hub.hub_customer_hk AND latest.rn = 1
WHERE latest.hashdiff IS NULL  -- Brand-new
   OR latest.hashdiff != SHA2(CONCAT_WS('||',
        COALESCE(stg.name, ''),
        COALESCE(stg.email, ''),
        COALESCE(stg.phone, '')
   ), 256);  -- Attributes changed

3. Apache Iceberg — Hidden Partitioning & Time Travel

-- Create an Iceberg table with hidden (transform-based) partitioning
CREATE TABLE lakehouse.silver.events (
    event_id     STRING,
    user_id      STRING,
    event_type   STRING,
    payload      STRING,
    event_ts     TIMESTAMP
)
USING ICEBERG
PARTITIONED BY (days(event_ts), bucket(16, user_id));

-- Queries filter on event_ts directly; Iceberg prunes partitions
-- automatically — no need to reference partition columns
SELECT event_type, COUNT(*) AS cnt
FROM lakehouse.silver.events
WHERE event_ts BETWEEN '2024-06-01' AND '2024-06-30'
GROUP BY 1;

-- Time travel: query a specific snapshot
SELECT * FROM lakehouse.silver.events
    TIMESTAMP AS OF '2024-06-15T10:00:00';

-- Partition evolution: change strategy without rewriting data
ALTER TABLE lakehouse.silver.events
    REPLACE PARTITION FIELD days(event_ts) WITH hours(event_ts);

4. dbt Dimensional Model — Star Schema with Tests

-- models/gold/dim_product.sql
-- dbt model: conformed product dimension (SCD Type 1)

{{ config(
    materialized='table',
    unique_key='product_id',
    tags=['gold', 'dimensional'],
) }}

WITH source AS (
    SELECT *
    FROM {{ ref('stg_products') }}
),

enriched AS (
    SELECT
        product_id,
        product_name,
        category,
        subcategory,
        brand,
        unit_cost,
        unit_price,
        unit_price - unit_cost           AS unit_margin,
        CASE WHEN is_active THEN 'Active'
             ELSE 'Discontinued'
        END AS                               product_status,
        CURRENT_TIMESTAMP()                  AS _loaded_at
    FROM source
)

SELECT * FROM enriched

# models/gold/dim_product.yml — dbt schema tests
version: 2
models:
  - name: dim_product
    description: "Conformed product dimension"
    columns:
      - name: product_id
        tests:
          - unique
          - not_null
      - name: category
        tests:
          - not_null
          - accepted_values:
              values: ['Electronics', 'Clothing', 'Home', 'Food', 'Sports']
      - name: unit_price
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

5. Great Expectations — Data Quality Checkpoint

import great_expectations as gx

# Initialize context and connect to Delta table
context = gx.get_context()

datasource = context.sources.add_spark("lakehouse")
data_asset = datasource.add_dataframe_asset(
    name="silver_orders",
    dataframe=spark.table("silver.orders"),
)
batch_request = data_asset.build_batch_request()

# Define expectations
suite = context.add_or_update_expectation_suite("silver_orders_suite")

validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="silver_orders_suite",
)

# Column-level checks
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_be_between("amount", min_value=0)
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_match_regex(
    "email", regex=r"^[\w.+-]+@[\w-]+\.[\w.]+$"
)

# Table-level checks
validator.expect_table_row_count_to_be_between(
    min_value=1000, max_value=10_000_000
)

# Run checkpoint and fail pipeline on violations
checkpoint = context.add_or_update_checkpoint(
    name="silver_orders_checkpoint",
    validations=[{"batch_request": batch_request,
                  "expectation_suite_name": "silver_orders_suite"}],
)
result = checkpoint.run()

if not result.success:
    raise RuntimeError("Data quality check failed — see GX report")

6. Debezium CDC — Apply Changes to Delta Lake

from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, LongType
from delta.tables import DeltaTable

# Read Debezium CDC events from Kafka
cdc_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "dbserver.public.customers") \
    .option("startingOffsets", "earliest") \
    .load()

# Parse the Debezium JSON envelope
parsed = cdc_stream.select(
    F.from_json(
        F.col("value").cast("string"),
        "payload STRUCT<op:STRING, before:STRUCT<customer_id:STRING, name:STRING, email:STRING>, after:STRUCT<customer_id:STRING, name:STRING, email:STRING>, ts_ms:LONG>"
    ).alias("cdc")
).select(
    F.col("cdc.payload.op").alias("operation"),       # c=create, u=update, d=delete
    F.col("cdc.payload.after.customer_id").alias("customer_id"),
    F.col("cdc.payload.after.name").alias("name"),
    F.col("cdc.payload.after.email").alias("email"),
    F.col("cdc.payload.ts_ms").alias("cdc_timestamp"),
)

# Apply CDC events to Delta Lake using foreachBatch + MERGE
def apply_cdc_to_delta(batch_df, batch_id):
    if batch_df.isEmpty():
        return

    # Deduplicate: keep last event per key within micro-batch
    deduped = batch_df.withColumn(
        "rn",
        F.row_number().over(
            Window.partitionBy("customer_id").orderBy(F.desc("cdc_timestamp"))
        )
    ).where("rn = 1").drop("rn")

    target = DeltaTable.forName(spark, "silver.customers")

    target.alias("t").merge(
        deduped.alias("s"),
        "t.customer_id = s.customer_id"
    ).whenMatchedUpdate(
        condition="s.operation = 'u'",
        set={"name": "s.name", "email": "s.email",
             "_updated_at": F.current_timestamp()}
    ).whenMatchedDelete(
        condition="s.operation = 'd'"
    ).whenNotMatchedInsert(
        condition="s.operation != 'd'",
        values={"customer_id": "s.customer_id", "name": "s.name",
                "email": "s.email", "_updated_at": F.current_timestamp()}
    ).execute()

# Write stream with foreachBatch
from pyspark.sql import Window
query = parsed.writeStream \
    .foreachBatch(apply_cdc_to_delta) \
    .option("checkpointLocation", "s3a://checkpoints/cdc/customers/") \
    .trigger(availableNow=True) \
    .start()

query.awaitTermination()

7. dbt Semantic Layer — Metric Definition (YAML)

# models/metrics/revenue.yml — dbt Semantic Layer metric definition
semantic_models:
  - name: orders
    description: "Order fact table for revenue metrics"
    model: ref('fact_orders')
    defaults:
      agg_time_dimension: order_date

    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign

    measures:
      - name: total_revenue
        expr: amount
        agg: sum
        description: "Sum of order amounts (completed orders only)"
      - name: order_count
        expr: "1"
        agg: sum

    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day
      - name: order_status
        type: categorical

metrics:
  - name: revenue
    type: simple
    label: "Revenue"
    description: "Total revenue from completed orders"
    type_params:
      measure: total_revenue
    filter: |
      {{ Dimension('order_status') }} = 'completed'

  - name: revenue_per_customer
    type: derived
    label: "Revenue per Customer"
    type_params:
      expr: revenue / unique_customers
      metrics:
        - name: revenue
        - name: unique_customers

↑ Back to top

Comparison / When to Use

Aspect	Kimball Star Schema	Data Vault 2.0	One Big Table (OBT)	3NF / Normalized
Primary use	Analytical / BI consumption	Enterprise raw integration layer	Single-use-case BI explore	Operational / OLTP systems
Schema design	Denormalized star (fact + dims)	Hub / Link / Satellite	Fully flattened wide table	Fully normalized (3NF+)
Query complexity	Low (few joins)	High (many joins from Hubs → Links → Sats)	Minimal (no joins)	High (many joins)
Change handling	SCD Types 1-6	Insert-only satellites (full history)	Rebuild/overwrite	In-place updates
Source agility	Moderate (schema changes require redesign)	High (add Hub/Link/Sat independently)	Low (must add column to wide table)	Moderate
Storage efficiency	Moderate (denormalized dims)	Higher (normalized, but many tables)	Low (massive duplication)	High (no redundancy)
BI tool compatibility	Excellent (designed for it)	Poor (needs presentation layer)	Good (single table)	Poor (too many joins)
Best for	Gold layer, dashboards, ad-hoc SQL	Silver/raw vault, regulated industries, multi-source integration	Single Looker explore, denormalized export	Source systems, OLTP databases

In practice, these are complementary, not competing. A mature lakehouse often uses 3NF at the source, Data Vault in the silver layer for integration and auditability, and Kimball star schemas in the gold layer for consumption. OBT is a tactical choice for specific performance-critical views, not a general architecture.

↑ Back to top

Gotchas & Anti-patterns

Skipping grain declaration. Starting a fact table design with "what columns do I need?" instead of "what does one row represent?" leads to mixed-grain tables. Always state the grain in words first: "one row per order line item per day." Every column must be true at that grain.
Building star schemas without conformed dimensions. If the marketing team's dim_customer has different keys or attributes than the finance team's, cross-domain queries (e.g., "revenue by customer segment") require complex reconciliation. Define conformed dimensions centrally — even if each team adds their own attributes, the business key and core attributes must be consistent.
Using Data Vault for the presentation layer. Data Vault's normalized structure requires 5-10 joins for a simple business query. It's designed as an integration/staging layer, not a query-facing layer. Always build a Kimball star schema or flat views on top for BI consumption.
Choosing OBT to avoid modeling work. One Big Table avoids upfront design but creates downstream pain: (1) schema changes ripple to every consumer, (2) repeated dimension values waste storage (10x+ blowup), (3) updates require rewriting the entire table. Use OBT only when you have one specific consumer with known access patterns.
Ignoring incremental processing. Rebuilding silver/gold tables from scratch on every run is safe but expensive. For large tables (100M+ rows), use Delta MERGE, Iceberg upsert, or dbt incremental models. Key: design your pipeline so you can always fall back to a full refresh if incremental logic drifts.
No CDC strategy — relying solely on batch watermarks. Query-based CDC (polling with WHERE updated_at > watermark) misses deletes and cannot capture intra-batch changes. For accurate synchronization with OLTP sources, invest in log-based CDC (Debezium, AWS DMS). The operational overhead pays for itself in data correctness and reduced source system load.
Metrics defined in BI tools instead of code. When "revenue" is a Tableau calculated field in 20 dashboards, changing the definition requires editing 20 workbooks. Define metrics in a version-controlled semantic layer (dbt metrics, Cube.dev) and have BI tools consume them as the single source of truth. This also enables automated testing of metric definitions.

↑ Back to top

Exercises

Design a star schema for an e-commerce company. Given these source tables: orders, order_items, customers, products, categories, payments. Design: (1) fact table with grain declaration, (2) at least 4 dimension tables (including date), (3) SCD type decisions for each dimension (justify Type 1 vs. 2). Write the DDL in SQL. Then write 3 sample analytical queries that demonstrate why a star schema is simpler than querying the normalized source directly.
Model a Data Vault for a healthcare system. Given sources: patient registration system, lab results system, appointment booking system. Design: (1) Hubs for core business concepts, (2) Links for relationships, (3) Satellites for descriptive attributes. Draw the entity diagram. Then write SQL to load one Hub and its Satellite from staging. Explain how you'd add a new source system (pharmacy) without modifying existing structures.
Implement a medallion pipeline with data quality gates. Build a 3-layer pipeline (bronze → silver → gold) using PySpark + Delta Lake or dbt. Bronze: ingest raw CSV/JSON files. Silver: clean, deduplicate, validate. Gold: build a star schema fact table with at least 2 dimensions. Add data quality checks at the silver layer (use Great Expectations, dbt tests, or DLT expectations) that reject null primary keys and out-of-range values.

↑ Back to top

Quiz

Q1: Explain the difference between a transaction fact, a periodic snapshot fact, and an accumulating snapshot fact. Give an example of each.

Transaction fact: One row per discrete event. Example: fact_order_lines — one row per line item on a purchase order. Measures are fully additive (quantity, revenue). Periodic snapshot: One row per entity per time period. Example: fact_account_daily_balance — one row per bank account per day. Balances are semi-additive: you can sum across accounts but not across days (averaging is correct for time). Accumulating snapshot: One row per process instance, updated as milestones occur. Example: fact_insurance_claim with columns for filed_date, assessed_date, approved_date, paid_date. Multiple date foreign keys track the lifecycle. These are updated (not appended) as events occur.

Q2: What is Data Vault's "hashdiff" pattern and why is it used in satellite loading?

A hashdiff is a hash (SHA-256) of all descriptive columns in a satellite record. During loading, you compare the hashdiff of the incoming record against the hashdiff of the latest satellite record for the same Hub key. If the hashes differ, a change occurred, and a new satellite row is inserted. If the hashes match, the record is skipped (no change). This avoids comparing each column individually, which is verbose and error-prone for wide tables. The hashdiff must be computed deterministically — NULLs should be coalesced to a consistent placeholder, and column order must be fixed.

Q3: What is Iceberg's "hidden partitioning" and why is it better than Hive-style partitioning?

In Hive-style partitioning, the user must specify partition columns explicitly in queries (WHERE year=2024 AND month=6) and the physical directory layout is visible. Iceberg's hidden partitioning applies partition transforms (like days(event_ts), months(event_ts), bucket(16, user_id)) as metadata. Queries filter on the original column (WHERE event_ts BETWEEN ...), and Iceberg's planner automatically prunes partitions. Benefits: (1) Users don't need to know the partition scheme. (2) Partition evolution — you can change the partition strategy (e.g., from days() to hours()) without rewriting data; old data keeps its old layout, new data uses the new layout. (3) No partition column in the data itself — avoids redundant columns.

Q4: Why does Kimball recommend star schemas over snowflake schemas for analytical workloads?

Star schemas denormalize dimensions into wide single tables, so a fact-to-dimension query requires exactly one join per dimension. Snowflake schemas normalize dimensions into chains of sub-tables (dim_product → dim_subcategory → dim_category → dim_department), adding 2-3 joins per dimension. Drawbacks of snowflake: (1) More complex queries — analysts must understand the normalized structure. (2) BI tools work best with star schemas (automatic join path detection). (3) In modern columnar engines (Redshift, BigQuery, Databricks SQL), the storage savings from normalization are negligible because columnar compression already handles repeated values efficiently. (4) Join overhead grows with snowflake depth. The only case for snowflaking is when a dimension has a very large, rarely-queried sub-entity that would bloat the main dimension significantly.

Q5: Describe the medallion architecture. How do bronze, silver, and gold layers differ in schema strategy, data quality, and intended consumers?

Bronze (raw): Exact copy of source data. Schema-on-read — store as-is (JSON, CSV, Avro). Minimal transformation: add ingestion metadata. Consumer: data engineers debugging pipelines. Quality: none enforced (garbage in, garbage stored). Silver (cleaned): Schema-on-write. Data is typed, deduplicated, validated, and conformed. Business keys are resolved. This is where you apply Data Vault or 3NF modeling for integration. Consumer: data engineers and analysts for exploration. Quality: enforced (null checks, uniqueness, referential integrity). Gold (business): Star schemas, aggregates, KPI tables, ML feature tables. Highly denormalized for fast queries. Consumer: BI tools, dashboards, executives, ML models. Quality: highest — data is curated, tested, and SLA-backed. Each layer is independently queryable and version-controlled (Delta/Iceberg), so you can always replay from bronze if silver/gold logic changes.

Q6: What is log-based CDC and why is it preferred over query-based CDC?

Log-based CDC reads the database's transaction log (binlog, WAL, redo log) to capture every change: inserts, updates, and deletes. Tools like Debezium emit each change as an event with before/after images. Advantages: (1) Captures deletes — query-based CDC misses them because the row no longer exists to query. (2) Zero source impact — reading the log doesn't add load to the database. (3) Captures all intermediate states — if a row is updated 5 times between polls, log-based CDC captures all 5 changes. (4) Low latency — changes are available in seconds, not on a polling schedule. Query-based CDC (WHERE updated_at > watermark) is simpler but misses deletes, misses intermediate changes, requires a reliable watermark column, and adds query load to the source. Use log-based CDC for any production pipeline where data accuracy matters.

Q7: What is a semantic layer and why do modern data teams implement one?

A semantic layer is an abstraction between the physical warehouse tables and BI consumers that provides centralized metric definitions. Instead of each dashboard calculates "revenue" differently (gross vs. net vs. with-refunds), metrics are defined once in code (e.g., dbt Semantic Layer YAML) and served via API to all consumers (Tableau, Looker, notebooks). Benefits: (1) Metric consistency — every team sees the same numbers. (2) Version control — metric definitions are in Git with review processes. (3) Testability — metric logic can be unit-tested. (4) Decoupling — physical schema changes don't break dashboards. The dbt Semantic Layer (MetricFlow), Cube.dev, and Looker's LookML are the main implementations. For interviews, articulating this pattern demonstrates understanding of organizational data challenges beyond just pipelines.

↑ Back to top

Data Engineering Concepts

Summary

Table of Contents

Core Concepts

1. Kimball Dimensional Modeling

Fact Tables

Dimension Tables

Slowly Changing Dimensions (SCD)

Star vs. Snowflake

2. Data Vault 2.0

Hubs

Links

Satellites

Why Data Vault?

3. Modern Lakehouse Patterns

Medallion Architecture (Bronze → Silver → Gold)

Kimball in the Lakehouse

4. Open Table Formats — Delta Lake, Apache Iceberg & Apache Hudi

5. Data Quality & Observability

6. Common Modeling Mistakes

7. Change Data Capture (CDC) Patterns

8. Semantic Layer & Metrics Layer

Industry Use Cases

1. Retail Analytics — Kimball Star Schema

2. Insurance Data Hub — Data Vault 2.0

3. Streaming Lakehouse — Medallion + Iceberg

4. Real-Time CDC Pipeline — Debezium + Lakehouse

5. Metrics Standardization with Semantic Layer

Code Examples

1. Kimball SCD Type 2 with Delta Lake MERGE

2. Data Vault Hub + Satellite Loading (SQL)

3. Apache Iceberg — Hidden Partitioning & Time Travel

4. dbt Dimensional Model — Star Schema with Tests

5. Great Expectations — Data Quality Checkpoint

6. Debezium CDC — Apply Changes to Delta Lake

7. dbt Semantic Layer — Metric Definition (YAML)

Comparison / When to Use

Gotchas & Anti-patterns

Exercises

Quiz

Further Reading