Data Engineering Concepts

Summary

This guide covers the foundational data modeling and architecture patterns that underpin every data engineering interview: Kimball dimensional modeling, Data Vault 2.0, and modern lakehouse patterns. These are not tied to a single tool — they inform decisions you make in Spark, dbt, Snowflake, Databricks, or any warehouse. Interviewers test whether you can choose the right model for a given workload, articulate trade-offs, and avoid common anti-patterns. Assume familiarity with SQL and basic relational concepts; we focus on DE-specific design patterns, not introductory normalization.

Table of Contents

Core Concepts

1. Kimball Dimensional Modeling

Ralph Kimball's methodology organizes analytical data into star schemas: a central fact table (measures + foreign keys) surrounded by dimension tables (descriptive attributes). The goals are query simplicity, predictable performance, and business-user accessibility.

Fact Tables

Fact tables record business events at a specific grain — the most atomic level of detail (e.g., one row per order line item). Types of facts:

Dimension Tables

Dimensions provide the "who, what, where, when, why" context for facts. They are typically wide (50-100+ columns), denormalized, and change slowly. The date dimension is universal — a pre-built table with fiscal calendars, holidays, and day-of-week flags.

Slowly Changing Dimensions (SCD)

Star vs. Snowflake

A star schema has fully denormalized dimensions — one join from fact to dimension. A snowflake schema normalizes dimensions into sub-tables (e.g., dim_productdim_categorydim_department). Kimball strongly prefers star schemas: fewer joins, simpler queries, better BI tool compatibility. Snowflake schemas save storage but complicate queries and rarely improve performance with modern columnar engines.

2. Data Vault 2.0

Data Vault is a modeling methodology for enterprise data warehouses designed to handle change, auditability, and parallel loading. It separates structure (business keys), relationships, and context into three entity types:

Hubs

A Hub represents a core business concept identified by its business key (natural key). Columns: hash key (surrogate), business key, load timestamp, record source. Hubs never change — once a business key is loaded, it persists forever. Example: hub_customer with customer_id as the business key.

Links

A Link records a relationship between two or more Hubs. Columns: hash key, foreign hash keys to each hub, load timestamp, record source. Links are insert-only. They model many-to-many relationships naturally. Example: link_order connecting hub_customer and hub_product.

Satellites

A Satellite stores descriptive attributes and their history for a specific Hub or Link. Columns: parent hash key, load timestamp, end timestamp (or hashdiff for change detection), attributes, record source. Satellites capture every change — each attribute change creates a new row. A single Hub can have multiple Satellites (e.g., sat_customer_personal, sat_customer_demographics) sourced from different systems.

Why Data Vault?

Data Vault excels at: (1) Agility — adding a new source only requires new Hubs/Links/Satellites without altering existing structures. (2) Parallel loading — Hubs, Links, and Satellites load independently. (3) Full auditability — every record tracks its source and load time. (4) Handling multiple sources of truth — satellites from different systems coexist without conflict. The downside: querying Data Vault raw tables is complex (many joins), so a Kimball-style star schema is typically built on top as a presentation/business layer.

3. Modern Lakehouse Patterns

Medallion Architecture (Bronze → Silver → Gold)

The medallion architecture (popularized by Databricks) organizes lakehouse data into three quality tiers:

Key principle: each layer is a Delta/Iceberg table — fully queryable, version-controlled, and independently updatable. If silver logic changes, you replay from bronze without re-ingesting from source.

Kimball in the Lakehouse

Kimball dimensional modeling remains the gold standard for the gold layer. In a lakehouse, the star schema is implemented as Delta/Iceberg tables. SCD Type 2 uses Delta MERGE operations. Date dimensions are static reference tables. The change: in a lakehouse, you keep the raw and intermediate layers (bronze/silver) alongside the gold star schema — something a traditional Kimball-only warehouse didn't do. This gives you both analytical performance (star schema) and data lineage (medallion layers).

4. Open Table Formats — Delta Lake, Apache Iceberg & Apache Hudi

Open table formats bring ACID transactions, time travel, and schema evolution to files on data lakes:

Convergence: Delta's UniForm writes Iceberg-compatible metadata alongside Delta logs, enabling cross-engine reads. All three formats are converging toward similar feature sets. Choose based on your engine ecosystem.

5. Data Quality & Observability

Data quality is a first-class pillar of data engineering — not an afterthought. Key patterns:

6. Common Modeling Mistakes

7. Change Data Capture (CDC) Patterns

Change Data Capture is essential for keeping lakehouse tables in sync with operational databases without full table scans. Key patterns:

8. Semantic Layer & Metrics Layer

The semantic layer sits between raw warehouse tables and BI consumers, providing a business-friendly abstraction over physical data structures. Key concepts:

↑ Back to top

Industry Use Cases

1. Retail Analytics — Kimball Star Schema

A mid-size retailer builds a star schema with fact_sales (grain: one row per POS transaction line item) surrounded by dim_product (SCD Type 2 for category changes), dim_store, dim_date, and dim_customer (SCD Type 2 for loyalty tier changes). Gold-layer tables in Delta Lake. Analysts write simple SQL: SELECT dim_product.category, SUM(fact_sales.revenue) GROUP BY 1. BI dashboards (Tableau, Power BI) connect directly to the star schema with predictable performance — no multi-hop joins.

2. Insurance Data Hub — Data Vault 2.0

An insurance company integrates data from 12 source systems (policy admin, claims, billing, CRM, external rating agencies). The raw vault models core entities as Hubs (hub_policy, hub_claimant, hub_provider), relationships as Links (link_policy_claimant, link_claim_provider), and source-specific attributes as Satellites. Parallel ELT loads from each source without contention. A business vault layer adds computed Satellites (e.g., sat_policy_risk_score). Presentation layer: Kimball star schemas for actuarial reporting and claims analytics.

3. Streaming Lakehouse — Medallion + Iceberg

A logistics company builds a lakehouse on AWS with Apache Iceberg tables. Bronze: Flink jobs consume GPS events from Kafka and append to bronze_vehicle_events (Iceberg, partitioned by hour(event_ts) with hidden partitioning). Silver: Spark Structured Streaming deduplicates, enriches with geofence data, and writes silver_vehicle_positions. Gold: hourly batch jobs build gold_delivery_metrics (avg delivery time by region, on-time rate). Trino connects to the same Iceberg tables for ad-hoc analyst queries — no data movement between engines.

4. Real-Time CDC Pipeline — Debezium + Lakehouse

A SaaS company needs near-real-time analytics on their PostgreSQL OLTP database without impacting production performance. Debezium reads the PostgreSQL WAL and publishes change events to Kafka topics (one per table). PySpark Structured Streaming reads these topics, parses the Debezium JSON envelope (extracting before/after images and op type), and applies changes to Delta Lake tables using foreachBatch with MERGE. Inserts and updates become UPSERT operations; deletes set a soft-delete flag. The silver layer resolves foreign keys and builds conformed entities. Gold-layer star schemas refresh every 15 minutes, giving analysts near-real-time dashboards with only a 15-minute lag from OLTP — down from the previous 24-hour batch ETL lag.

5. Metrics Standardization with Semantic Layer

A fintech company with 50+ dashboards discovers that "monthly active users (MAU)" is calculated differently across teams — engineering counts API callers, marketing counts login events, finance counts paying users. They implement a dbt Semantic Layer: a single YAML definition specifies metric: mau with the canonical calculation (unique users with at least one qualifying event per month). All BI tools (Tableau, Hex, internal APIs) query the semantic layer API instead of writing raw SQL. Within 3 months, metric discrepancies are eliminated, finance reporting time drops by 40%, and the "whose number is right" debates disappear.

↑ Back to top

Code Examples

1. Kimball SCD Type 2 with Delta Lake MERGE

from delta.tables import DeltaTable
from pyspark.sql import functions as F

# SCD Type 2: dim_customer with effective_date / end_date / is_current
target = DeltaTable.forName(spark, "gold.dim_customer")

# Source: new/changed customer records from silver layer
source = spark.table("silver.customers_latest")

# Build a change-detection join key
# Match on business key AND only where attributes differ
merge_condition = """
    t.customer_bk = s.customer_bk
    AND t.is_current = true
"""

change_condition = """
    s.name != t.name
    OR s.email != t.email
    OR s.city != t.city
"""

# Step 1: Expire old rows (set end_date, is_current = false)
target.alias("t").merge(
    source.alias("s"), merge_condition
).whenMatchedUpdate(
    condition=change_condition,
    set={
        "end_date": F.current_date(),
        "is_current": F.lit(False),
    }
).execute()

# Step 2: Insert new current rows for changed + new records
current = spark.table("gold.dim_customer").where("is_current = true")

new_rows = source.alias("s").join(
    current.alias("c"),
    (F.col("s.customer_bk") == F.col("c.customer_bk"))
    & (F.col("s.name") == F.col("c.name"))
    & (F.col("s.email") == F.col("c.email"))
    & (F.col("s.city") == F.col("c.city")),
    "left_anti"  # Rows that changed or are brand-new
).select(
    F.monotonically_increasing_id().alias("customer_sk"),
    F.col("s.customer_bk"),
    F.col("s.name"), F.col("s.email"), F.col("s.city"),
    F.current_date().alias("effective_date"),
    F.lit("9999-12-31").cast("date").alias("end_date"),
    F.lit(True).alias("is_current"),
)

new_rows.write.format("delta").mode("append").saveAsTable("gold.dim_customer")

2. Data Vault Hub + Satellite Loading (SQL)

-- Load hub_customer from staging
INSERT INTO raw_vault.hub_customer (hub_customer_hk, customer_bk, load_ts, record_source)
SELECT
    SHA2(UPPER(TRIM(stg.customer_id)), 256)  AS hub_customer_hk,
    UPPER(TRIM(stg.customer_id))                AS customer_bk,
    CURRENT_TIMESTAMP()                          AS load_ts,
    'crm_system'                                  AS record_source
FROM staging.stg_crm_customers stg
LEFT JOIN raw_vault.hub_customer hub
    ON hub.customer_bk = UPPER(TRIM(stg.customer_id))
WHERE hub.hub_customer_hk IS NULL;   -- Only insert new business keys

-- Load satellite with change detection via hashdiff
INSERT INTO raw_vault.sat_customer_personal
    (hub_customer_hk, load_ts, hashdiff, name, email, phone, record_source)
SELECT
    SHA2(UPPER(TRIM(stg.customer_id)), 256)    AS hub_customer_hk,
    CURRENT_TIMESTAMP()                            AS load_ts,
    SHA2(CONCAT_WS('||',
        COALESCE(stg.name, ''),
        COALESCE(stg.email, ''),
        COALESCE(stg.phone, '')
    ), 256)                                        AS hashdiff,
    stg.name, stg.email, stg.phone,
    'crm_system'                                    AS record_source
FROM staging.stg_crm_customers stg
INNER JOIN raw_vault.hub_customer hub
    ON hub.customer_bk = UPPER(TRIM(stg.customer_id))
LEFT JOIN (
    -- Get the latest satellite record per hub key
    SELECT *, ROW_NUMBER() OVER (
        PARTITION BY hub_customer_hk ORDER BY load_ts DESC
    ) AS rn
    FROM raw_vault.sat_customer_personal
) latest ON latest.hub_customer_hk = hub.hub_customer_hk AND latest.rn = 1
WHERE latest.hashdiff IS NULL  -- Brand-new
   OR latest.hashdiff != SHA2(CONCAT_WS('||',
        COALESCE(stg.name, ''),
        COALESCE(stg.email, ''),
        COALESCE(stg.phone, '')
   ), 256);  -- Attributes changed

3. Apache Iceberg — Hidden Partitioning & Time Travel

-- Create an Iceberg table with hidden (transform-based) partitioning
CREATE TABLE lakehouse.silver.events (
    event_id     STRING,
    user_id      STRING,
    event_type   STRING,
    payload      STRING,
    event_ts     TIMESTAMP
)
USING ICEBERG
PARTITIONED BY (days(event_ts), bucket(16, user_id));

-- Queries filter on event_ts directly; Iceberg prunes partitions
-- automatically — no need to reference partition columns
SELECT event_type, COUNT(*) AS cnt
FROM lakehouse.silver.events
WHERE event_ts BETWEEN '2024-06-01' AND '2024-06-30'
GROUP BY 1;

-- Time travel: query a specific snapshot
SELECT * FROM lakehouse.silver.events
    TIMESTAMP AS OF '2024-06-15T10:00:00';

-- Partition evolution: change strategy without rewriting data
ALTER TABLE lakehouse.silver.events
    REPLACE PARTITION FIELD days(event_ts) WITH hours(event_ts);

4. dbt Dimensional Model — Star Schema with Tests

-- models/gold/dim_product.sql
-- dbt model: conformed product dimension (SCD Type 1)

{{ config(
    materialized='table',
    unique_key='product_id',
    tags=['gold', 'dimensional'],
) }}

WITH source AS (
    SELECT *
    FROM {{ ref('stg_products') }}
),

enriched AS (
    SELECT
        product_id,
        product_name,
        category,
        subcategory,
        brand,
        unit_cost,
        unit_price,
        unit_price - unit_cost           AS unit_margin,
        CASE WHEN is_active THEN 'Active'
             ELSE 'Discontinued'
        END AS                               product_status,
        CURRENT_TIMESTAMP()                  AS _loaded_at
    FROM source
)

SELECT * FROM enriched
# models/gold/dim_product.yml — dbt schema tests
version: 2
models:
  - name: dim_product
    description: "Conformed product dimension"
    columns:
      - name: product_id
        tests:
          - unique
          - not_null
      - name: category
        tests:
          - not_null
          - accepted_values:
              values: ['Electronics', 'Clothing', 'Home', 'Food', 'Sports']
      - name: unit_price
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

5. Great Expectations — Data Quality Checkpoint

import great_expectations as gx

# Initialize context and connect to Delta table
context = gx.get_context()

datasource = context.sources.add_spark("lakehouse")
data_asset = datasource.add_dataframe_asset(
    name="silver_orders",
    dataframe=spark.table("silver.orders"),
)
batch_request = data_asset.build_batch_request()

# Define expectations
suite = context.add_or_update_expectation_suite("silver_orders_suite")

validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="silver_orders_suite",
)

# Column-level checks
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_be_between("amount", min_value=0)
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_match_regex(
    "email", regex=r"^[\w.+-]+@[\w-]+\.[\w.]+$"
)

# Table-level checks
validator.expect_table_row_count_to_be_between(
    min_value=1000, max_value=10_000_000
)

# Run checkpoint and fail pipeline on violations
checkpoint = context.add_or_update_checkpoint(
    name="silver_orders_checkpoint",
    validations=[{"batch_request": batch_request,
                  "expectation_suite_name": "silver_orders_suite"}],
)
result = checkpoint.run()

if not result.success:
    raise RuntimeError("Data quality check failed — see GX report")

6. Debezium CDC — Apply Changes to Delta Lake

from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, LongType
from delta.tables import DeltaTable

# Read Debezium CDC events from Kafka
cdc_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "dbserver.public.customers") \
    .option("startingOffsets", "earliest") \
    .load()

# Parse the Debezium JSON envelope
parsed = cdc_stream.select(
    F.from_json(
        F.col("value").cast("string"),
        "payload STRUCT<op:STRING, before:STRUCT<customer_id:STRING, name:STRING, email:STRING>, after:STRUCT<customer_id:STRING, name:STRING, email:STRING>, ts_ms:LONG>"
    ).alias("cdc")
).select(
    F.col("cdc.payload.op").alias("operation"),       # c=create, u=update, d=delete
    F.col("cdc.payload.after.customer_id").alias("customer_id"),
    F.col("cdc.payload.after.name").alias("name"),
    F.col("cdc.payload.after.email").alias("email"),
    F.col("cdc.payload.ts_ms").alias("cdc_timestamp"),
)

# Apply CDC events to Delta Lake using foreachBatch + MERGE
def apply_cdc_to_delta(batch_df, batch_id):
    if batch_df.isEmpty():
        return

    # Deduplicate: keep last event per key within micro-batch
    deduped = batch_df.withColumn(
        "rn",
        F.row_number().over(
            Window.partitionBy("customer_id").orderBy(F.desc("cdc_timestamp"))
        )
    ).where("rn = 1").drop("rn")

    target = DeltaTable.forName(spark, "silver.customers")

    target.alias("t").merge(
        deduped.alias("s"),
        "t.customer_id = s.customer_id"
    ).whenMatchedUpdate(
        condition="s.operation = 'u'",
        set={"name": "s.name", "email": "s.email",
             "_updated_at": F.current_timestamp()}
    ).whenMatchedDelete(
        condition="s.operation = 'd'"
    ).whenNotMatchedInsert(
        condition="s.operation != 'd'",
        values={"customer_id": "s.customer_id", "name": "s.name",
                "email": "s.email", "_updated_at": F.current_timestamp()}
    ).execute()

# Write stream with foreachBatch
from pyspark.sql import Window
query = parsed.writeStream \
    .foreachBatch(apply_cdc_to_delta) \
    .option("checkpointLocation", "s3a://checkpoints/cdc/customers/") \
    .trigger(availableNow=True) \
    .start()

query.awaitTermination()

7. dbt Semantic Layer — Metric Definition (YAML)

# models/metrics/revenue.yml — dbt Semantic Layer metric definition
semantic_models:
  - name: orders
    description: "Order fact table for revenue metrics"
    model: ref('fact_orders')
    defaults:
      agg_time_dimension: order_date

    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign

    measures:
      - name: total_revenue
        expr: amount
        agg: sum
        description: "Sum of order amounts (completed orders only)"
      - name: order_count
        expr: "1"
        agg: sum

    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day
      - name: order_status
        type: categorical

metrics:
  - name: revenue
    type: simple
    label: "Revenue"
    description: "Total revenue from completed orders"
    type_params:
      measure: total_revenue
    filter: |
      {{ Dimension('order_status') }} = 'completed'

  - name: revenue_per_customer
    type: derived
    label: "Revenue per Customer"
    type_params:
      expr: revenue / unique_customers
      metrics:
        - name: revenue
        - name: unique_customers
↑ Back to top

Comparison / When to Use

AspectKimball Star SchemaData Vault 2.0One Big Table (OBT)3NF / Normalized
Primary useAnalytical / BI consumptionEnterprise raw integration layerSingle-use-case BI exploreOperational / OLTP systems
Schema designDenormalized star (fact + dims)Hub / Link / SatelliteFully flattened wide tableFully normalized (3NF+)
Query complexityLow (few joins)High (many joins from Hubs → Links → Sats)Minimal (no joins)High (many joins)
Change handlingSCD Types 1-6Insert-only satellites (full history)Rebuild/overwriteIn-place updates
Source agilityModerate (schema changes require redesign)High (add Hub/Link/Sat independently)Low (must add column to wide table)Moderate
Storage efficiencyModerate (denormalized dims)Higher (normalized, but many tables)Low (massive duplication)High (no redundancy)
BI tool compatibilityExcellent (designed for it)Poor (needs presentation layer)Good (single table)Poor (too many joins)
Best forGold layer, dashboards, ad-hoc SQLSilver/raw vault, regulated industries, multi-source integrationSingle Looker explore, denormalized exportSource systems, OLTP databases

In practice, these are complementary, not competing. A mature lakehouse often uses 3NF at the source, Data Vault in the silver layer for integration and auditability, and Kimball star schemas in the gold layer for consumption. OBT is a tactical choice for specific performance-critical views, not a general architecture.

↑ Back to top

Gotchas & Anti-patterns

  1. Skipping grain declaration. Starting a fact table design with "what columns do I need?" instead of "what does one row represent?" leads to mixed-grain tables. Always state the grain in words first: "one row per order line item per day." Every column must be true at that grain.
  2. Building star schemas without conformed dimensions. If the marketing team's dim_customer has different keys or attributes than the finance team's, cross-domain queries (e.g., "revenue by customer segment") require complex reconciliation. Define conformed dimensions centrally — even if each team adds their own attributes, the business key and core attributes must be consistent.
  3. Using Data Vault for the presentation layer. Data Vault's normalized structure requires 5-10 joins for a simple business query. It's designed as an integration/staging layer, not a query-facing layer. Always build a Kimball star schema or flat views on top for BI consumption.
  4. Choosing OBT to avoid modeling work. One Big Table avoids upfront design but creates downstream pain: (1) schema changes ripple to every consumer, (2) repeated dimension values waste storage (10x+ blowup), (3) updates require rewriting the entire table. Use OBT only when you have one specific consumer with known access patterns.
  5. Ignoring incremental processing. Rebuilding silver/gold tables from scratch on every run is safe but expensive. For large tables (100M+ rows), use Delta MERGE, Iceberg upsert, or dbt incremental models. Key: design your pipeline so you can always fall back to a full refresh if incremental logic drifts.
  6. No CDC strategy — relying solely on batch watermarks. Query-based CDC (polling with WHERE updated_at > watermark) misses deletes and cannot capture intra-batch changes. For accurate synchronization with OLTP sources, invest in log-based CDC (Debezium, AWS DMS). The operational overhead pays for itself in data correctness and reduced source system load.
  7. Metrics defined in BI tools instead of code. When "revenue" is a Tableau calculated field in 20 dashboards, changing the definition requires editing 20 workbooks. Define metrics in a version-controlled semantic layer (dbt metrics, Cube.dev) and have BI tools consume them as the single source of truth. This also enables automated testing of metric definitions.
↑ Back to top

Exercises

  1. Design a star schema for an e-commerce company. Given these source tables: orders, order_items, customers, products, categories, payments. Design: (1) fact table with grain declaration, (2) at least 4 dimension tables (including date), (3) SCD type decisions for each dimension (justify Type 1 vs. 2). Write the DDL in SQL. Then write 3 sample analytical queries that demonstrate why a star schema is simpler than querying the normalized source directly.
  2. Model a Data Vault for a healthcare system. Given sources: patient registration system, lab results system, appointment booking system. Design: (1) Hubs for core business concepts, (2) Links for relationships, (3) Satellites for descriptive attributes. Draw the entity diagram. Then write SQL to load one Hub and its Satellite from staging. Explain how you'd add a new source system (pharmacy) without modifying existing structures.
  3. Implement a medallion pipeline with data quality gates. Build a 3-layer pipeline (bronze → silver → gold) using PySpark + Delta Lake or dbt. Bronze: ingest raw CSV/JSON files. Silver: clean, deduplicate, validate. Gold: build a star schema fact table with at least 2 dimensions. Add data quality checks at the silver layer (use Great Expectations, dbt tests, or DLT expectations) that reject null primary keys and out-of-range values.
↑ Back to top

Quiz

Q1: Explain the difference between a transaction fact, a periodic snapshot fact, and an accumulating snapshot fact. Give an example of each.

Transaction fact: One row per discrete event. Example: fact_order_lines — one row per line item on a purchase order. Measures are fully additive (quantity, revenue). Periodic snapshot: One row per entity per time period. Example: fact_account_daily_balance — one row per bank account per day. Balances are semi-additive: you can sum across accounts but not across days (averaging is correct for time). Accumulating snapshot: One row per process instance, updated as milestones occur. Example: fact_insurance_claim with columns for filed_date, assessed_date, approved_date, paid_date. Multiple date foreign keys track the lifecycle. These are updated (not appended) as events occur.

Q2: What is Data Vault's "hashdiff" pattern and why is it used in satellite loading?

A hashdiff is a hash (SHA-256) of all descriptive columns in a satellite record. During loading, you compare the hashdiff of the incoming record against the hashdiff of the latest satellite record for the same Hub key. If the hashes differ, a change occurred, and a new satellite row is inserted. If the hashes match, the record is skipped (no change). This avoids comparing each column individually, which is verbose and error-prone for wide tables. The hashdiff must be computed deterministically — NULLs should be coalesced to a consistent placeholder, and column order must be fixed.

Q3: What is Iceberg's "hidden partitioning" and why is it better than Hive-style partitioning?

In Hive-style partitioning, the user must specify partition columns explicitly in queries (WHERE year=2024 AND month=6) and the physical directory layout is visible. Iceberg's hidden partitioning applies partition transforms (like days(event_ts), months(event_ts), bucket(16, user_id)) as metadata. Queries filter on the original column (WHERE event_ts BETWEEN ...), and Iceberg's planner automatically prunes partitions. Benefits: (1) Users don't need to know the partition scheme. (2) Partition evolution — you can change the partition strategy (e.g., from days() to hours()) without rewriting data; old data keeps its old layout, new data uses the new layout. (3) No partition column in the data itself — avoids redundant columns.

Q4: Why does Kimball recommend star schemas over snowflake schemas for analytical workloads?

Star schemas denormalize dimensions into wide single tables, so a fact-to-dimension query requires exactly one join per dimension. Snowflake schemas normalize dimensions into chains of sub-tables (dim_product → dim_subcategory → dim_category → dim_department), adding 2-3 joins per dimension. Drawbacks of snowflake: (1) More complex queries — analysts must understand the normalized structure. (2) BI tools work best with star schemas (automatic join path detection). (3) In modern columnar engines (Redshift, BigQuery, Databricks SQL), the storage savings from normalization are negligible because columnar compression already handles repeated values efficiently. (4) Join overhead grows with snowflake depth. The only case for snowflaking is when a dimension has a very large, rarely-queried sub-entity that would bloat the main dimension significantly.

Q5: Describe the medallion architecture. How do bronze, silver, and gold layers differ in schema strategy, data quality, and intended consumers?

Bronze (raw): Exact copy of source data. Schema-on-read — store as-is (JSON, CSV, Avro). Minimal transformation: add ingestion metadata. Consumer: data engineers debugging pipelines. Quality: none enforced (garbage in, garbage stored). Silver (cleaned): Schema-on-write. Data is typed, deduplicated, validated, and conformed. Business keys are resolved. This is where you apply Data Vault or 3NF modeling for integration. Consumer: data engineers and analysts for exploration. Quality: enforced (null checks, uniqueness, referential integrity). Gold (business): Star schemas, aggregates, KPI tables, ML feature tables. Highly denormalized for fast queries. Consumer: BI tools, dashboards, executives, ML models. Quality: highest — data is curated, tested, and SLA-backed. Each layer is independently queryable and version-controlled (Delta/Iceberg), so you can always replay from bronze if silver/gold logic changes.

Q6: What is log-based CDC and why is it preferred over query-based CDC?

Log-based CDC reads the database's transaction log (binlog, WAL, redo log) to capture every change: inserts, updates, and deletes. Tools like Debezium emit each change as an event with before/after images. Advantages: (1) Captures deletes — query-based CDC misses them because the row no longer exists to query. (2) Zero source impact — reading the log doesn't add load to the database. (3) Captures all intermediate states — if a row is updated 5 times between polls, log-based CDC captures all 5 changes. (4) Low latency — changes are available in seconds, not on a polling schedule. Query-based CDC (WHERE updated_at > watermark) is simpler but misses deletes, misses intermediate changes, requires a reliable watermark column, and adds query load to the source. Use log-based CDC for any production pipeline where data accuracy matters.

Q7: What is a semantic layer and why do modern data teams implement one?

A semantic layer is an abstraction between the physical warehouse tables and BI consumers that provides centralized metric definitions. Instead of each dashboard calculates "revenue" differently (gross vs. net vs. with-refunds), metrics are defined once in code (e.g., dbt Semantic Layer YAML) and served via API to all consumers (Tableau, Looker, notebooks). Benefits: (1) Metric consistency — every team sees the same numbers. (2) Version control — metric definitions are in Git with review processes. (3) Testability — metric logic can be unit-tested. (4) Decoupling — physical schema changes don't break dashboards. The dbt Semantic Layer (MetricFlow), Cube.dev, and Looker's LookML are the main implementations. For interviews, articulating this pattern demonstrates understanding of organizational data challenges beyond just pipelines.

↑ Back to top

Further Reading

↑ Back to top