dbt (Data Build Tool)

Summary

dbt is the standard transformation layer for modern ELT pipelines. It lets data engineers and analytics engineers write SELECT statements as models, and dbt handles dependency resolution, DDL generation, testing, documentation, and deployment. dbt compiles SQL + Jinja templates into warehouse-native SQL and executes it against your data platform (Snowflake, BigQuery, Databricks, Redshift, PostgreSQL, and more). It brings software engineering practices — version control, code review, CI/CD, testing — to SQL transformations for the first time. This guide covers dbt Core 1.8+ / dbt Cloud features including model contracts, the semantic layer (MetricFlow), and unit testing. Assumes you already know SQL and basic data warehousing concepts.

Core Concepts
Industry Use Cases
Code Examples
Comparison / When to Use
Gotchas & Anti-patterns
Exercises
Quiz
Further Reading

Core Concepts

1. Project Structure & ref()

A dbt project is a directory with a standard layout:

my_dbt_project/
├── dbt_project.yml          # Project config: name, version, paths, vars
├── profiles.yml             # Connection targets (dev, staging, prod)
├── models/
│   ├── staging/             # 1:1 mappings from raw sources
│   │   ├── _stg_models.yml  # Source definitions + tests
│   │   ├── stg_orders.sql
│   │   └── stg_customers.sql
│   ├── intermediate/        # Business logic building blocks
│   │   └── int_order_items_enriched.sql
│   └── marts/               # Final business-facing models
│       ├── finance/
│       │   └── fct_revenue.sql
│       └── marketing/
│           └── dim_customers.sql
├── macros/                  # Reusable Jinja macros
├── seeds/                   # Static CSV lookup tables
├── snapshots/               # SCD Type 2 captures
└── tests/                   # Custom singular tests

The cornerstone is {{ ref('model_name') }} — a function that resolves to the correct schema.table for the current environment and declares a dependency. dbt builds a DAG from all ref() calls and executes models in topological order. {{ source('source_name', 'table_name') }} similarly references raw tables declared in YAML, enabling freshness checks and lineage tracking from source to mart.

The Three Layers (dbt Labs Convention)

Staging (stg_) — One model per source table. Light transformations: renaming columns, casting types, basic filtering. Materialized as view (cheap, always up-to-date). Purpose: isolate raw source quirks so downstream models don't deal with them.
Intermediate (int_) — Business logic building blocks that aren't exposed to end users. Joins, aggregations, pivots. Materialized as ephemeral (CTEs injected into downstream queries) or view. Purpose: DRY — avoid repeating complex joins in multiple marts.
Marts (fct_, dim_) — Business-facing tables. Facts and dimensions following Kimball conventions. Materialized as table or incremental. Purpose: the "gold" layer that BI tools, analysts, and applications consume.

2. Materializations

A materialization defines how dbt persists a model in the warehouse:

Materialization	DDL Generated	When to Use	Trade-offs
`view`	`CREATE VIEW AS`	Staging models, lightweight transforms, always-fresh needs	No storage cost; query-time compute; cascading performance impact if slow
`table`	`CREATE TABLE AS` (full refresh each run)	Marts, gold tables, moderate-size datasets	Fast reads; full rebuild each run; storage cost
`incremental`	`MERGE` / `INSERT` / `DELETE+INSERT`	Large fact tables, append-heavy event data	Fast runs; complex logic; requires careful `is_incremental()` handling
`ephemeral`	CTE injected into downstream SQL	Helper models used only once, avoiding warehouse objects	No table created; can't query directly; harder to debug
`snapshot`	SCD Type 2 table with `dbt_valid_from`, `dbt_valid_to`	Tracking historical changes to dimensions	Automatic SCD; requires a reliable `updated_at` or `check` strategy

Custom materializations can be defined as macros — some teams create insert_by_period or microbatch materializations for specialized patterns.

3. Incremental Models

Incremental models are dbt's answer to "don't rebuild the entire table every run." The pattern:

-- models/marts/fct_events.sql
{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge',   -- or 'delete+insert', 'append', 'microbatch'
    on_schema_change='sync_all_columns',
) }}

SELECT
    event_id,
    user_id,
    event_type,
    event_timestamp,
    payload
FROM {{ ref('stg_events') }}

{% if is_incremental() %}
  -- Only process new rows since the last run
  WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}

Key incremental strategies:

merge — Uses warehouse MERGE statement. Upserts on unique_key. Best for: dimension tables, tables with updates. Supported on Snowflake, BigQuery, Databricks, Redshift.
delete+insert — Deletes matching rows first, then inserts. Avoids MERGE overhead on some warehouses. Best for: partitioned fact tables where you overwrite entire time ranges.
append — Insert-only; no duplicate handling. Best for: immutable event logs where the source guarantees no duplicates.
microbatch (dbt 1.9+) — Processes data in time-based micro-batches with automatic retry per batch. Best for: large backfill-friendly event tables.

Full refresh escape hatch: Run dbt run --full-refresh to rebuild incremental models from scratch. This is your safety net when incremental logic drifts or schemas change.

4. Testing Framework

dbt's testing framework runs assertions against your data after (or during) model builds:

Generic Tests (YAML-defined)

Built-in: unique, not_null, accepted_values, relationships (referential integrity). Applied per-column in schema YAML files:

# models/marts/_schema.yml
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']

Singular Tests (SQL files)

Custom SQL queries in tests/ that return failing rows. If the query returns any rows, the test fails:

-- tests/assert_no_orphaned_orders.sql
-- Fail if any order references a non-existent customer
SELECT o.order_id, o.customer_id
FROM {{ ref('fct_orders') }} o
LEFT JOIN {{ ref('dim_customers') }} c
    ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL

Unit Tests (dbt 1.8+)

Test transformation logic with mocked inputs — no warehouse data needed:

# models/marts/_schema.yml
unit_tests:
  - name: test_revenue_calculation
    model: fct_orders
    given:
      - input: ref('stg_order_items')
        rows:
          - {order_id: 1, quantity: 3, unit_price: 10.00, discount: 0.10}
          - {order_id: 2, quantity: 1, unit_price: 50.00, discount: 0.00}
    expect:
      rows:
        - {order_id: 1, total_revenue: 27.00}
        - {order_id: 2, total_revenue: 50.00}

5. Jinja Templating & Macros

dbt models are SQL files processed through the Jinja2 template engine before compilation. This gives you variables, loops, conditionals, and reusable macros — making SQL DRY and environment-aware.

Key Jinja Constructs

{{ }} — Expression (outputs a value): {{ ref('orders') }}, {{ var('start_date') }}
{% %} — Statement (control flow): {% if is_incremental() %}, {% for col in columns %}
{# #} — Comment (not compiled)

Macros

Macros are reusable Jinja functions stored in macros/:

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    (CAST({{ column_name }} AS DECIMAL(12,2)) / 100)
{% endmacro %}

-- Usage in a model:
SELECT
    order_id,
    {{ cents_to_dollars('amount_cents') }} AS amount_dollars
FROM {{ ref('stg_payments') }}

Power patterns: macros for generate_schema_name (custom schema routing), grant_select (post-hook permissions), and dynamic column generation via {% for %} loops over schema metadata.

6. Packages & Ecosystem

dbt packages are reusable modules installed via packages.yml and dbt deps:

dbt_utils — The essential toolkit: surrogate_key(), pivot(), unpivot(), union_relations(), date_spine(), get_column_values(), generic tests like expression_is_true.
dbt_expectations — Great Expectations-style tests in dbt: expect_column_values_to_be_between, expect_table_row_count_to_be_between, expect_column_distinct_count_to_equal. Comprehensive data quality without leaving dbt.
dbt_date — Date spine and calendar table generation.
elementary — Open-source data observability: anomaly detection, test results dashboard, automated freshness/volume alerts. Runs inside dbt as models and tests.
Codegen — Auto-generates staging models and YAML schemas from source tables. Great for bootstrapping a new project.
Audit helper — Compare two relations (e.g., pre- vs. post-refactor) row-by-row to validate migrations.

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.1.0", "<2.0.0"]
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]
  - package: elementary-data/elementary
    version: [">=0.13.0", "<0.14.0"]

7. Semantic Layer (MetricFlow)

The dbt Semantic Layer (powered by MetricFlow) lets you define metrics as code alongside your models. Instead of every BI tool calculating "revenue" differently, metrics are defined once in YAML and served via API.

Key Components

Semantic models — Map dbt models to MetricFlow entities, measures, and dimensions.
Entities — Business keys that connect semantic models (like foreign keys in star schemas). Types: primary, foreign, unique, natural.
Measures — Aggregation expressions: sum, count, count_distinct, average, min, max.
Dimensions — Attributes for grouping/filtering: categorical or time.
Metrics — Business metric definitions built on measures. Types: simple (one measure), derived (expression over other metrics), cumulative (running totals), conversion (funnel rates).

Consumers query the semantic layer via the dbt Cloud Semantic Layer API, which generates optimized SQL for any target warehouse. Integrations: Tableau, Hex, Mode, Google Sheets, and any tool supporting the JDBC/ADBC interface.

8. Model Contracts & Governance

dbt's governance features (introduced in 1.5+) bring API-style contracts to data models:

Model contracts — Enforce a model's column names, data types, and constraints at build time. If the SQL produces columns that don't match the contract, the build fails — catching breaking changes before they reach consumers.
Model access — Control who can ref() a model: public (anyone), protected (same project/group), private (same directory). Prevents unintended dependencies.
Model versions — Maintain multiple versions of a model (v1, v2) simultaneously during migrations. Old consumers continue using v1 while new consumers adopt v2.
Model groups — Organize models by team ownership. Combined with access controls, this enforces boundaries: the marketing team can't directly ref() an engineering-internal model.

# models/marts/finance/_schema.yml
models:
  - name: fct_revenue
    access: public                # Any project can ref() this
    group: finance_team
    latest_version: 2
    config:
      contract:
        enforced: true           # Build fails if SQL doesn't match
    columns:
      - name: revenue_id
        data_type: string
        constraints:
          - type: not_null
          - type: primary_key
      - name: amount
        data_type: numeric
        constraints:
          - type: not_null
      - name: order_date
        data_type: date
    versions:
      - v: 1
        defined_in: fct_revenue_v1
      - v: 2
        defined_in: fct_revenue_v2

↑ Back to top

Industry Use Cases

1. E-Commerce Analytics — Full Kimball Star Schema in dbt

A DTC (direct-to-consumer) brand builds its entire analytics warehouse in dbt on Snowflake. Staging models map 12 source tables (Shopify, Stripe, Klaviyo, Google Analytics) via Fivetran ingestion. Intermediate models resolve customer identity across systems (email matching + device fingerprinting). Marts deliver: fct_orders (grain: order line item), dim_customers (SCD Type 2 via dbt snapshots tracking loyalty tier changes), dim_products, dim_date. dbt tests enforce referential integrity between facts and dimensions. The semantic layer defines "revenue," "AOV," and "customer LTV" metrics used by both Looker dashboards and the growth team's Python notebooks.

2. Financial Services — Regulatory Reporting Pipeline

A bank uses dbt on Databricks SQL to produce Basel III regulatory reports. Staging models normalize data from 6 core banking systems. Intermediate models calculate risk-weighted assets using complex business rules encoded in Jinja macros (parameterized by regulation version). Marts produce the exact table structures required by regulators. Model contracts enforce that output schemas exactly match the regulatory specification — any column drift fails the CI build. Unit tests validate calculation logic against known regulatory test cases. The entire pipeline runs in CI on every pull request, and auditors can review dbt's auto-generated documentation to understand data lineage from source to report.

3. SaaS Product Analytics — Self-Serve Metrics Platform

A B2B SaaS company with 200+ employees builds a self-serve analytics platform on BigQuery + dbt Cloud. Product, marketing, and finance teams each own a dbt "group" with specific models. The semantic layer exposes 50+ metrics (MRR, churn rate, feature adoption, NPS) via the dbt Cloud API. Tableau dashboards, Hex notebooks, and a customer-facing analytics embed all query the same metric definitions. dbt Cloud CI runs on every PR: builds modified models in a temporary schema, runs tests, and posts a comment on the PR with row count diffs. Deploy-on-merge promotes changes to production automatically.

4. Data Mesh — Domain-Owned dbt Projects

A large enterprise (10,000+ employees) implements a data mesh pattern using multiple dbt projects. Each domain team (orders, inventory, customers) owns a dbt project that publishes public models as data products. Cross-project references use {{ ref('project_name', 'model_name') }}. Model contracts ensure domain interfaces are stable — an upstream team can't accidentally break a downstream consumer. A central data platform team maintains shared macros (via a private dbt package) for standard patterns: SCD Type 2 handling, PII hashing, audit columns. Elementary monitors all projects centrally, alerting on test failures, freshness violations, and volume anomalies.

↑ Back to top

Code Examples

1. Staging Model — Clean Source Interface

-- models/staging/stripe/stg_stripe_payments.sql
-- One staging model per source table. Light cleaning only.

WITH source AS (
    SELECT * FROM {{ source('stripe', 'payments') }}
),

renamed AS (
    SELECT
        -- Primary key
        id                          AS payment_id,

        -- Foreign keys
        order_id,
        customer_id,

        -- Measures (convert cents → dollars)
        CAST(amount AS DECIMAL(12,2)) / 100  AS amount_dollars,

        -- Attributes
        LOWER(payment_method)        AS payment_method,
        status                       AS payment_status,

        -- Timestamps
        CAST(created_at AS TIMESTAMP) AS created_at,
        CAST(updated_at AS TIMESTAMP) AS updated_at

    FROM source
    WHERE _fivetran_deleted = FALSE  -- Exclude soft-deleted rows
)

SELECT * FROM renamed

2. Incremental Fact Table with Merge Strategy

-- models/marts/fct_page_views.sql
-- Incremental model: processes only new events each run

{{ config(
    materialized='incremental',
    unique_key='page_view_id',
    incremental_strategy='merge',
    cluster_by=['event_date'],         -- Databricks: liquid clustering
    on_schema_change='sync_all_columns',
    tags=['daily', 'product'],
) }}

WITH events AS (
    SELECT
        event_id                     AS page_view_id,
        user_id,
        page_url,
        referrer_url,
        session_id,
        CAST(event_timestamp AS TIMESTAMP) AS viewed_at,
        CAST(event_timestamp AS DATE)      AS event_date,

        -- Extract UTM parameters from URL
        {{ dbt_utils.get_url_parameter('page_url', 'utm_source') }}  AS utm_source,
        {{ dbt_utils.get_url_parameter('page_url', 'utm_campaign') }} AS utm_campaign,

        -- Duration on page (seconds)
        COALESCE(duration_ms / 1000, 0) AS duration_seconds

    FROM {{ ref('stg_segment_page_views') }}

    {% if is_incremental() %}
        -- Process only events newer than the latest in our table
        WHERE event_timestamp > (
            SELECT MAX(viewed_at) FROM {{ this }}
        )
    {% endif %}
)

SELECT * FROM events

3. Snapshot — SCD Type 2 Dimension

-- snapshots/snap_customers.sql
-- Automatically tracks changes to customer attributes over time

{% snapshot snap_customers %}

{{ config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='timestamp',           -- or 'check' for hash-based detection
    updated_at='updated_at',
    invalidate_hard_deletes=True,  -- Track deletions too
) }}

SELECT
    customer_id,
    name,
    email,
    city,
    loyalty_tier,
    updated_at
FROM {{ source('app_db', 'customers') }}

{% endsnapshot %}

{# Output table has columns:
   customer_id, name, email, city, loyalty_tier, updated_at,
   dbt_valid_from, dbt_valid_to, dbt_scd_id, dbt_updated_at

   Current rows: dbt_valid_to IS NULL
   Historical rows: dbt_valid_to IS NOT NULL
#}

4. Jinja Macro — Dynamic Pivot with Aggregation

-- macros/pivot_metrics.sql
-- Generic pivot macro: turns rows into columns dynamically

{% macro pivot_metrics(ref_model, key_col, pivot_col, value_col, agg_func='sum') %}

    {% set pivot_values = dbt_utils.get_column_values(
        table=ref(ref_model),
        column=pivot_col
    ) %}

    SELECT
        {{ key_col }},
        {% for val in pivot_values %}
            {{ agg_func }}(
                CASE WHEN {{ pivot_col }} = '{{ val }}'
                     THEN {{ value_col }}
                END
            ) AS {{ val | lower | replace(' ', '_') }}_{{ agg_func }}
            {%- if not loop.last %},{% endif %}
        {% endfor %}
    FROM {{ ref(ref_model) }}
    GROUP BY {{ key_col }}

{% endmacro %}

-- Usage: models/marts/fct_sales_by_category.sql
-- {{ pivot_metrics('int_sales_long', 'sale_date', 'category', 'revenue') }}
-- Compiles to: SELECT sale_date,
--   SUM(CASE WHEN category = 'Electronics' THEN revenue END) AS electronics_sum,
--   SUM(CASE WHEN category = 'Clothing' THEN revenue END) AS clothing_sum, ...

5. CI/CD — GitHub Actions for dbt Cloud

# .github/workflows/dbt-ci.yml
# Runs dbt build on every PR — tests modified models in a temp schema

name: dbt CI
on:
  pull_request:
    branches: [main]
    paths: ['models/**', 'macros/**', 'tests/**', 'snapshots/**']

jobs:
  dbt-build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dbt
        run: pip install dbt-snowflake==1.8.*

      - name: Run dbt deps
        run: dbt deps

      - name: dbt build (modified models + downstream)
        run: |
          dbt build \
            --select state:modified+ \
            --defer \
            --state ./prod-manifest/ \
            --target ci \
            --profiles-dir .
        env:
          DBT_SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
          DBT_SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
          DBT_SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
          DBT_CI_SCHEMA: PR_${{ github.event.number }}

      - name: Post results comment
        if: always()
        uses: dbt-labs/dbt-cloud-action@v1
        with:
          results-file: target/run_results.json

6. Semantic Layer — Metric Definitions

# models/semantic/_semantic_models.yml
# MetricFlow semantic model: defines measures, dimensions, entities

semantic_models:
  - name: orders
    description: "Order fact table"
    model: ref('fct_orders')
    defaults:
      agg_time_dimension: ordered_at

    entities:
      - name: order_id
        type: primary
      - name: customer_id
        type: foreign

    measures:
      - name: order_total
        expr: amount
        agg: sum
      - name: order_count
        expr: "1"
        agg: sum
      - name: unique_customers
        expr: customer_id
        agg: count_distinct

    dimensions:
      - name: ordered_at
        type: time
        type_params:
          time_granularity: day
      - name: order_status
        type: categorical
      - name: country
        type: categorical

metrics:
  - name: revenue
    type: simple
    label: "Total Revenue"
    description: "Gross revenue from all completed orders"
    type_params:
      measure: order_total
    filter: |
      {{ Dimension('order_status') }} = 'completed'

  - name: average_order_value
    type: derived
    label: "Average Order Value"
    type_params:
      expr: revenue / order_count
      metrics:
        - name: revenue
        - name: order_count
            filter: |
              {{ Dimension('order_status') }} = 'completed'

  - name: cumulative_revenue
    type: cumulative
    label: "Cumulative Revenue"
    type_params:
      measure: order_total
      window: 365   # Rolling 365-day sum
      grain_to_date: month

↑ Back to top

Comparison / When to Use

Feature	dbt (SQL + Jinja)	PySpark / Spark SQL	Stored Procedures	Dataform (Google)
Language	SQL + Jinja + YAML	Python + SQL	SQL (vendor-specific)	SQLX (SQL + JS)
Execution	Pushdown to warehouse	Distributed Spark engine	In-database	Pushdown to BigQuery
Testing	Built-in (generic, singular, unit)	Manual (Great Expectations, custom)	Manual scripts	Assertions (basic)
Version control	Native (SQL files in Git)	Native (Python in Git)	Difficult (state in DB)	Native (Git-integrated)
Lineage	Automatic DAG from ref()	Manual or via OpenLineage	None	Automatic DAG from ref()
Documentation	Auto-generated from YAML	Manual	Manual	Auto-generated
Semantic layer	MetricFlow (native)	None	None	None
Scale	Depends on warehouse	TB-PB (distributed)	Depends on DB engine	BigQuery scale
Best for	SQL-centric warehouse transforms	Complex Python logic, large-scale ETL	Legacy systems, simple transforms	BigQuery-native teams

Choose dbt when: Your transformations are expressible in SQL and you want testing, documentation, lineage, and CI/CD out of the box. Choose PySpark when: You need Python logic (ML features, complex algorithms), multi-TB distributed joins, or streaming. Choose stored procedures when: You have simple transforms in a legacy database and no appetite for modern tooling (but plan to migrate). Choose Dataform when: You're a BigQuery-only shop and want Google-integrated transformation. Many teams use dbt + PySpark together — dbt for SQL transforms in the warehouse, PySpark for heavy pre-processing in the lakehouse.

↑ Back to top

Gotchas & Anti-patterns

Building models without tests. Many teams rush to build 50+ models, then bolt on tests as an afterthought. By then, data quality issues have already eroded trust. Rule: every model should have at minimum unique + not_null on its primary key, and relationships tests on foreign keys. Add tests alongside model development, not after.
Overusing incremental models prematurely. Incremental models add complexity: you must handle late-arriving data, schema changes, and incremental logic bugs. If your table is under 10M rows and rebuilds in under 5 minutes, use table materialization. Switch to incremental only when full refreshes become too slow or expensive. Always ensure dbt run --full-refresh produces identical results as incremental runs.
ref() spaghetti — no layered structure. Without a staging/intermediate/marts convention, models reference each other in a tangled web. A change in one model cascades unpredictably. Follow the three-layer convention: staging models only reference sources, intermediate models only reference staging and other intermediates, marts reference intermediate and staging. Enforce this with access controls (dbt 1.5+).
Logic in staging models. Staging models should do light cleaning only: rename, cast, filter deleted records. Business logic (joins, aggregations, calculations) belongs in intermediate or marts. When staging models contain complex transforms, source interface changes become painful — you can't distinguish "source mapping" from "business logic" during debugging.
Not using --select and state:modified in CI. Running dbt build (all models) on every pull request is wasteful and slow. Use dbt build --select state:modified+ with --defer to only build changed models and their downstream dependencies, reading upstream from the production manifest. This cuts CI time from 30+ minutes to 2-5 minutes.
Ignoring the materialized view opportunity. dbt 1.6+ supports materialized_view on warehouses that support it (Snowflake dynamic tables, BigQuery materialized views, Redshift MV). For slowly-changing intermediate models that are queried frequently, materialized views give you table-like performance with automatic incremental refresh — no is_incremental() logic needed. Check if your warehouse supports it before writing complex incremental models.
Storing credentials in profiles.yml in Git. profiles.yml contains database credentials and should never be committed to version control. Use environment variables ({{ env_var('DBT_SNOWFLAKE_PASSWORD') }}) in profiles, and store secrets in your CI/CD system or a secret manager. dbt Cloud handles this automatically.

↑ Back to top

Exercises

Build a three-layer dbt project from scratch. Using a sample dataset (e.g., the Jaffle Shop tutorial data or Kaggle e-commerce data): (1) Create source definitions and 3+ staging models with proper naming (stg_). (2) Create 2+ intermediate models that join/transform staging data. (3) Create a fact table and 2 dimension tables as mart models. (4) Add generic tests on all primary keys (unique, not_null) and foreign keys (relationships). (5) Add at least one singular test. (6) Run dbt build and verify all tests pass. (7) Generate docs with dbt docs generate and explore the DAG.
Implement an incremental model with late-arriving data handling. Create an incremental fact table for events with: (1) merge strategy on event_id, (2) lookback window that reprocesses the last 3 days (WHERE event_date >= DATEADD(day, -3, current_date) in the incremental block) to catch late arrivals, (3) on_schema_change='sync_all_columns'. Manually insert a "late" record into the source and verify the incremental model picks it up. Then run dbt run --full-refresh and confirm results match.
Set up dbt CI with slim builds. (1) Generate a production manifest (dbt compile --target prod → save target/manifest.json). (2) Modify one staging model. (3) Run dbt build --select state:modified+ --defer --state ./prod-manifest/. Verify that only the modified model and its downstream dependents are built, while unrelated models use the production manifest. Measure the time difference vs. running dbt build on all models.

↑ Back to top

Quiz

Q1: What are the four built-in materializations in dbt and when would you use each?

(1) view — creates a SQL view; no data stored. Use for staging models, lightweight transforms, and when you always want fresh results. Trade-off: query-time compute. (2) table — creates a full table, rebuilt from scratch on each run. Use for mart models of moderate size. Trade-off: storage cost, but fast reads. (3) incremental — inserts/updates only new or changed rows using MERGE, DELETE+INSERT, or APPEND. Use for large fact tables where full rebuild is too slow. Trade-off: complex logic, must handle late-arriving data. (4) ephemeral — compiled as a CTE injected into downstream queries; no database object created. Use for intermediate helper logic that's referenced once. Trade-off: can't query directly, harder to debug. Additionally, snapshot is for SCD Type 2 tracking and materialized_view (1.6+) leverages warehouse-native incremental refresh.

Q2: Explain the difference between ref() and source() in dbt.

ref('model_name') references another dbt model. It resolves to the correct schema.table for the current target environment, and it registers a dependency in the DAG — dbt knows to build the referenced model first. source('source_name', 'table_name') references a raw table declared in a YAML source definition. It does NOT create a dependency on a dbt model — instead, it tracks the external table for freshness checks (dbt source freshness) and documentation. Sources are the entry points of the DAG; refs connect dbt models within the DAG. Best practice: staging models use source(), all other models use ref(). Never reference raw tables directly with hardcoded schema.table names — always go through source().

Q3: How do dbt snapshots work and what SCD type do they implement?

dbt snapshots implement SCD Type 2. They work by: (1) Taking a snapshot of a source query on each run. (2) Comparing current source rows against existing snapshot rows using a unique_key. (3) If a row has changed (detected by strategy: timestamp using updated_at, or strategy: check which hashes specified columns), the old row gets its dbt_valid_to set to the current timestamp, and a new row is inserted with dbt_valid_from = current_timestamp and dbt_valid_to = NULL. (4) New rows are inserted with dbt_valid_to = NULL. (5) With invalidate_hard_deletes=True, rows missing from the source get their dbt_valid_to set. Current records always have dbt_valid_to IS NULL. The timestamp strategy is preferred when a reliable updated_at column exists; the check strategy works when no such column exists but is more expensive (compares all specified columns).

Q4: What is state:modified in dbt and how does it enable slim CI builds?

state:modified is a node selector that identifies dbt resources (models, tests, snapshots) that have changed since a reference state (usually the production manifest). In CI, you save the production manifest.json, then run dbt build --select state:modified+ --defer --state ./prod-manifest/. The + suffix means "and all downstream dependents." The --defer flag tells dbt to read unmodified upstream models from the production database instead of rebuilding them. This means a PR that changes one staging model rebuilds only that model and its downstream chain — not the entire project. Result: CI runs drop from 30+ minutes to 2-5 minutes, and warehouse costs for CI drop proportionally. This is the single most impactful CI optimization for large dbt projects.

Q5: What are model contracts in dbt and what problem do they solve?

Model contracts (dbt 1.5+) enforce a model's schema at build time: column names, data types, and constraints (not_null, primary_key). When contract: enforced: true is set, dbt validates that the model's SQL output matches the declared columns. If a developer changes the SQL in a way that drops a column, renames it, or changes its type, the build fails immediately — before the change reaches production. This solves the "silent breaking change" problem: without contracts, an upstream model could change its output schema, and downstream models/dashboards would break at query time (or worse, silently produce wrong results). Contracts establish a formal API between data producers and consumers, essential for data mesh and multi-team dbt projects.

Q6: When would you choose incremental_strategy='delete+insert' over 'merge'?

delete+insert first deletes rows matching a condition (typically a time partition), then inserts the new data. It's better than merge when: (1) Your warehouse's MERGE is slow or expensive (some Redshift workloads). (2) You're overwriting entire partitions and don't need row-level upsert logic. (3) You want simpler semantics — delete a date partition, insert the full day's data. (4) Your data doesn't have a reliable unique key for matching. Use merge when: (1) You need row-level upserts (update matching rows, insert new ones). (2) Your data has a reliable unique key. (3) Only a small percentage of rows change each run (merge touches fewer rows). append is best for immutable event logs where duplicates are impossible. The new microbatch strategy (dbt 1.9+) combines the benefits of delete+insert with automatic batch-level retries.

Q7: How does the dbt Semantic Layer work and why is it important?

The dbt Semantic Layer (powered by MetricFlow) defines business metrics as code in YAML: semantic models (mapping dbt models to entities/measures/dimensions) and metrics (business calculations over measures). When a BI tool queries a metric (e.g., "revenue by country, monthly"), MetricFlow generates optimized SQL that joins the necessary tables, applies the correct aggregation, and filters correctly. The metric definition is the single source of truth — Tableau, Hex, Looker, and Python notebooks all get the same "revenue" number. This eliminates the "whose number is right?" problem where every team calculates metrics differently. It's served via the dbt Cloud Semantic Layer API (JDBC/ADBC). The semantic layer is increasingly important in interviews as it represents the next evolution of the analytics stack beyond just building tables.

↑ Back to top

dbt (Data Build Tool)

Summary

Table of Contents

Core Concepts

1. Project Structure & ref()

The Three Layers (dbt Labs Convention)

2. Materializations

3. Incremental Models

4. Testing Framework

Generic Tests (YAML-defined)

Singular Tests (SQL files)

Unit Tests (dbt 1.8+)

5. Jinja Templating & Macros

Key Jinja Constructs

Macros

6. Packages & Ecosystem

7. Semantic Layer (MetricFlow)

Key Components

8. Model Contracts & Governance

Industry Use Cases

1. E-Commerce Analytics — Full Kimball Star Schema in dbt

2. Financial Services — Regulatory Reporting Pipeline

3. SaaS Product Analytics — Self-Serve Metrics Platform

4. Data Mesh — Domain-Owned dbt Projects

Code Examples

1. Staging Model — Clean Source Interface

2. Incremental Fact Table with Merge Strategy

3. Snapshot — SCD Type 2 Dimension

4. Jinja Macro — Dynamic Pivot with Aggregation

5. CI/CD — GitHub Actions for dbt Cloud

6. Semantic Layer — Metric Definitions

Comparison / When to Use

Gotchas & Anti-patterns

Exercises

Quiz

Further Reading