dbt is the standard transformation layer for modern ELT pipelines. It lets data engineers and analytics engineers write SELECT statements as models, and dbt handles dependency resolution, DDL generation, testing, documentation, and deployment. dbt compiles SQL + Jinja templates into warehouse-native SQL and executes it against your data platform (Snowflake, BigQuery, Databricks, Redshift, PostgreSQL, and more). It brings software engineering practices — version control, code review, CI/CD, testing — to SQL transformations for the first time. This guide covers dbt Core 1.8+ / dbt Cloud features including model contracts, the semantic layer (MetricFlow), and unit testing. Assumes you already know SQL and basic data warehousing concepts.
A dbt project is a directory with a standard layout:
my_dbt_project/
├── dbt_project.yml # Project config: name, version, paths, vars
├── profiles.yml # Connection targets (dev, staging, prod)
├── models/
│ ├── staging/ # 1:1 mappings from raw sources
│ │ ├── _stg_models.yml # Source definitions + tests
│ │ ├── stg_orders.sql
│ │ └── stg_customers.sql
│ ├── intermediate/ # Business logic building blocks
│ │ └── int_order_items_enriched.sql
│ └── marts/ # Final business-facing models
│ ├── finance/
│ │ └── fct_revenue.sql
│ └── marketing/
│ └── dim_customers.sql
├── macros/ # Reusable Jinja macros
├── seeds/ # Static CSV lookup tables
├── snapshots/ # SCD Type 2 captures
└── tests/ # Custom singular tests
The cornerstone is {{ ref('model_name') }} — a function that resolves to the correct schema.table for the current environment and declares a dependency. dbt builds a DAG from all ref() calls and executes models in topological order. {{ source('source_name', 'table_name') }} similarly references raw tables declared in YAML, enabling freshness checks and lineage tracking from source to mart.
view (cheap, always up-to-date). Purpose: isolate raw source quirks so downstream models don't deal with them.ephemeral (CTEs injected into downstream queries) or view. Purpose: DRY — avoid repeating complex joins in multiple marts.table or incremental. Purpose: the "gold" layer that BI tools, analysts, and applications consume.A materialization defines how dbt persists a model in the warehouse:
| Materialization | DDL Generated | When to Use | Trade-offs |
|---|---|---|---|
view | CREATE VIEW AS | Staging models, lightweight transforms, always-fresh needs | No storage cost; query-time compute; cascading performance impact if slow |
table | CREATE TABLE AS (full refresh each run) | Marts, gold tables, moderate-size datasets | Fast reads; full rebuild each run; storage cost |
incremental | MERGE / INSERT / DELETE+INSERT | Large fact tables, append-heavy event data | Fast runs; complex logic; requires careful is_incremental() handling |
ephemeral | CTE injected into downstream SQL | Helper models used only once, avoiding warehouse objects | No table created; can't query directly; harder to debug |
snapshot | SCD Type 2 table with dbt_valid_from, dbt_valid_to | Tracking historical changes to dimensions | Automatic SCD; requires a reliable updated_at or check strategy |
Custom materializations can be defined as macros — some teams create insert_by_period or microbatch materializations for specialized patterns.
Incremental models are dbt's answer to "don't rebuild the entire table every run." The pattern:
-- models/marts/fct_events.sql
{{ config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge', -- or 'delete+insert', 'append', 'microbatch'
on_schema_change='sync_all_columns',
) }}
SELECT
event_id,
user_id,
event_type,
event_timestamp,
payload
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
-- Only process new rows since the last run
WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}
Key incremental strategies:
MERGE statement. Upserts on unique_key. Best for: dimension tables, tables with updates. Supported on Snowflake, BigQuery, Databricks, Redshift.MERGE overhead on some warehouses. Best for: partitioned fact tables where you overwrite entire time ranges.Full refresh escape hatch: Run dbt run --full-refresh to rebuild incremental models from scratch. This is your safety net when incremental logic drifts or schemas change.
dbt's testing framework runs assertions against your data after (or during) model builds:
Built-in: unique, not_null, accepted_values, relationships (referential integrity). Applied per-column in schema YAML files:
# models/marts/_schema.yml
models:
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
- name: order_status
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
Custom SQL queries in tests/ that return failing rows. If the query returns any rows, the test fails:
-- tests/assert_no_orphaned_orders.sql
-- Fail if any order references a non-existent customer
SELECT o.order_id, o.customer_id
FROM {{ ref('fct_orders') }} o
LEFT JOIN {{ ref('dim_customers') }} c
ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL
Test transformation logic with mocked inputs — no warehouse data needed:
# models/marts/_schema.yml
unit_tests:
- name: test_revenue_calculation
model: fct_orders
given:
- input: ref('stg_order_items')
rows:
- {order_id: 1, quantity: 3, unit_price: 10.00, discount: 0.10}
- {order_id: 2, quantity: 1, unit_price: 50.00, discount: 0.00}
expect:
rows:
- {order_id: 1, total_revenue: 27.00}
- {order_id: 2, total_revenue: 50.00}
dbt models are SQL files processed through the Jinja2 template engine before compilation. This gives you variables, loops, conditionals, and reusable macros — making SQL DRY and environment-aware.
{{ }} — Expression (outputs a value): {{ ref('orders') }}, {{ var('start_date') }}{% %} — Statement (control flow): {% if is_incremental() %}, {% for col in columns %}{# #} — Comment (not compiled)Macros are reusable Jinja functions stored in macros/:
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
(CAST({{ column_name }} AS DECIMAL(12,2)) / 100)
{% endmacro %}
-- Usage in a model:
SELECT
order_id,
{{ cents_to_dollars('amount_cents') }} AS amount_dollars
FROM {{ ref('stg_payments') }}
Power patterns: macros for generate_schema_name (custom schema routing), grant_select (post-hook permissions), and dynamic column generation via {% for %} loops over schema metadata.
dbt packages are reusable modules installed via packages.yml and dbt deps:
surrogate_key(), pivot(), unpivot(), union_relations(), date_spine(), get_column_values(), generic tests like expression_is_true.expect_column_values_to_be_between, expect_table_row_count_to_be_between, expect_column_distinct_count_to_equal. Comprehensive data quality without leaving dbt.# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: [">=1.1.0", "<2.0.0"]
- package: calogica/dbt_expectations
version: [">=0.10.0", "<0.11.0"]
- package: elementary-data/elementary
version: [">=0.13.0", "<0.14.0"]
The dbt Semantic Layer (powered by MetricFlow) lets you define metrics as code alongside your models. Instead of every BI tool calculating "revenue" differently, metrics are defined once in YAML and served via API.
primary, foreign, unique, natural.sum, count, count_distinct, average, min, max.categorical or time.simple (one measure), derived (expression over other metrics), cumulative (running totals), conversion (funnel rates).Consumers query the semantic layer via the dbt Cloud Semantic Layer API, which generates optimized SQL for any target warehouse. Integrations: Tableau, Hex, Mode, Google Sheets, and any tool supporting the JDBC/ADBC interface.
dbt's governance features (introduced in 1.5+) bring API-style contracts to data models:
ref() a model: public (anyone), protected (same project/group), private (same directory). Prevents unintended dependencies.ref() an engineering-internal model.# models/marts/finance/_schema.yml
models:
- name: fct_revenue
access: public # Any project can ref() this
group: finance_team
latest_version: 2
config:
contract:
enforced: true # Build fails if SQL doesn't match
columns:
- name: revenue_id
data_type: string
constraints:
- type: not_null
- type: primary_key
- name: amount
data_type: numeric
constraints:
- type: not_null
- name: order_date
data_type: date
versions:
- v: 1
defined_in: fct_revenue_v1
- v: 2
defined_in: fct_revenue_v2
↑ Back to top
A DTC (direct-to-consumer) brand builds its entire analytics warehouse in dbt on Snowflake. Staging models map 12 source tables (Shopify, Stripe, Klaviyo, Google Analytics) via Fivetran ingestion. Intermediate models resolve customer identity across systems (email matching + device fingerprinting). Marts deliver: fct_orders (grain: order line item), dim_customers (SCD Type 2 via dbt snapshots tracking loyalty tier changes), dim_products, dim_date. dbt tests enforce referential integrity between facts and dimensions. The semantic layer defines "revenue," "AOV," and "customer LTV" metrics used by both Looker dashboards and the growth team's Python notebooks.
A bank uses dbt on Databricks SQL to produce Basel III regulatory reports. Staging models normalize data from 6 core banking systems. Intermediate models calculate risk-weighted assets using complex business rules encoded in Jinja macros (parameterized by regulation version). Marts produce the exact table structures required by regulators. Model contracts enforce that output schemas exactly match the regulatory specification — any column drift fails the CI build. Unit tests validate calculation logic against known regulatory test cases. The entire pipeline runs in CI on every pull request, and auditors can review dbt's auto-generated documentation to understand data lineage from source to report.
A B2B SaaS company with 200+ employees builds a self-serve analytics platform on BigQuery + dbt Cloud. Product, marketing, and finance teams each own a dbt "group" with specific models. The semantic layer exposes 50+ metrics (MRR, churn rate, feature adoption, NPS) via the dbt Cloud API. Tableau dashboards, Hex notebooks, and a customer-facing analytics embed all query the same metric definitions. dbt Cloud CI runs on every PR: builds modified models in a temporary schema, runs tests, and posts a comment on the PR with row count diffs. Deploy-on-merge promotes changes to production automatically.
A large enterprise (10,000+ employees) implements a data mesh pattern using multiple dbt projects. Each domain team (orders, inventory, customers) owns a dbt project that publishes public models as data products. Cross-project references use {{ ref('project_name', 'model_name') }}. Model contracts ensure domain interfaces are stable — an upstream team can't accidentally break a downstream consumer. A central data platform team maintains shared macros (via a private dbt package) for standard patterns: SCD Type 2 handling, PII hashing, audit columns. Elementary monitors all projects centrally, alerting on test failures, freshness violations, and volume anomalies.
-- models/staging/stripe/stg_stripe_payments.sql
-- One staging model per source table. Light cleaning only.
WITH source AS (
SELECT * FROM {{ source('stripe', 'payments') }}
),
renamed AS (
SELECT
-- Primary key
id AS payment_id,
-- Foreign keys
order_id,
customer_id,
-- Measures (convert cents → dollars)
CAST(amount AS DECIMAL(12,2)) / 100 AS amount_dollars,
-- Attributes
LOWER(payment_method) AS payment_method,
status AS payment_status,
-- Timestamps
CAST(created_at AS TIMESTAMP) AS created_at,
CAST(updated_at AS TIMESTAMP) AS updated_at
FROM source
WHERE _fivetran_deleted = FALSE -- Exclude soft-deleted rows
)
SELECT * FROM renamed
-- models/marts/fct_page_views.sql
-- Incremental model: processes only new events each run
{{ config(
materialized='incremental',
unique_key='page_view_id',
incremental_strategy='merge',
cluster_by=['event_date'], -- Databricks: liquid clustering
on_schema_change='sync_all_columns',
tags=['daily', 'product'],
) }}
WITH events AS (
SELECT
event_id AS page_view_id,
user_id,
page_url,
referrer_url,
session_id,
CAST(event_timestamp AS TIMESTAMP) AS viewed_at,
CAST(event_timestamp AS DATE) AS event_date,
-- Extract UTM parameters from URL
{{ dbt_utils.get_url_parameter('page_url', 'utm_source') }} AS utm_source,
{{ dbt_utils.get_url_parameter('page_url', 'utm_campaign') }} AS utm_campaign,
-- Duration on page (seconds)
COALESCE(duration_ms / 1000, 0) AS duration_seconds
FROM {{ ref('stg_segment_page_views') }}
{% if is_incremental() %}
-- Process only events newer than the latest in our table
WHERE event_timestamp > (
SELECT MAX(viewed_at) FROM {{ this }}
)
{% endif %}
)
SELECT * FROM events
-- snapshots/snap_customers.sql
-- Automatically tracks changes to customer attributes over time
{% snapshot snap_customers %}
{{ config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp', -- or 'check' for hash-based detection
updated_at='updated_at',
invalidate_hard_deletes=True, -- Track deletions too
) }}
SELECT
customer_id,
name,
email,
city,
loyalty_tier,
updated_at
FROM {{ source('app_db', 'customers') }}
{% endsnapshot %}
{# Output table has columns:
customer_id, name, email, city, loyalty_tier, updated_at,
dbt_valid_from, dbt_valid_to, dbt_scd_id, dbt_updated_at
Current rows: dbt_valid_to IS NULL
Historical rows: dbt_valid_to IS NOT NULL
#}
-- macros/pivot_metrics.sql
-- Generic pivot macro: turns rows into columns dynamically
{% macro pivot_metrics(ref_model, key_col, pivot_col, value_col, agg_func='sum') %}
{% set pivot_values = dbt_utils.get_column_values(
table=ref(ref_model),
column=pivot_col
) %}
SELECT
{{ key_col }},
{% for val in pivot_values %}
{{ agg_func }}(
CASE WHEN {{ pivot_col }} = '{{ val }}'
THEN {{ value_col }}
END
) AS {{ val | lower | replace(' ', '_') }}_{{ agg_func }}
{%- if not loop.last %},{% endif %}
{% endfor %}
FROM {{ ref(ref_model) }}
GROUP BY {{ key_col }}
{% endmacro %}
-- Usage: models/marts/fct_sales_by_category.sql
-- {{ pivot_metrics('int_sales_long', 'sale_date', 'category', 'revenue') }}
-- Compiles to: SELECT sale_date,
-- SUM(CASE WHEN category = 'Electronics' THEN revenue END) AS electronics_sum,
-- SUM(CASE WHEN category = 'Clothing' THEN revenue END) AS clothing_sum, ...
# .github/workflows/dbt-ci.yml
# Runs dbt build on every PR — tests modified models in a temp schema
name: dbt CI
on:
pull_request:
branches: [main]
paths: ['models/**', 'macros/**', 'tests/**', 'snapshots/**']
jobs:
dbt-build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dbt
run: pip install dbt-snowflake==1.8.*
- name: Run dbt deps
run: dbt deps
- name: dbt build (modified models + downstream)
run: |
dbt build \
--select state:modified+ \
--defer \
--state ./prod-manifest/ \
--target ci \
--profiles-dir .
env:
DBT_SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
DBT_SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
DBT_SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
DBT_CI_SCHEMA: PR_${{ github.event.number }}
- name: Post results comment
if: always()
uses: dbt-labs/dbt-cloud-action@v1
with:
results-file: target/run_results.json
# models/semantic/_semantic_models.yml
# MetricFlow semantic model: defines measures, dimensions, entities
semantic_models:
- name: orders
description: "Order fact table"
model: ref('fct_orders')
defaults:
agg_time_dimension: ordered_at
entities:
- name: order_id
type: primary
- name: customer_id
type: foreign
measures:
- name: order_total
expr: amount
agg: sum
- name: order_count
expr: "1"
agg: sum
- name: unique_customers
expr: customer_id
agg: count_distinct
dimensions:
- name: ordered_at
type: time
type_params:
time_granularity: day
- name: order_status
type: categorical
- name: country
type: categorical
metrics:
- name: revenue
type: simple
label: "Total Revenue"
description: "Gross revenue from all completed orders"
type_params:
measure: order_total
filter: |
{{ Dimension('order_status') }} = 'completed'
- name: average_order_value
type: derived
label: "Average Order Value"
type_params:
expr: revenue / order_count
metrics:
- name: revenue
- name: order_count
filter: |
{{ Dimension('order_status') }} = 'completed'
- name: cumulative_revenue
type: cumulative
label: "Cumulative Revenue"
type_params:
measure: order_total
window: 365 # Rolling 365-day sum
grain_to_date: month
↑ Back to top
| Feature | dbt (SQL + Jinja) | PySpark / Spark SQL | Stored Procedures | Dataform (Google) |
|---|---|---|---|---|
| Language | SQL + Jinja + YAML | Python + SQL | SQL (vendor-specific) | SQLX (SQL + JS) |
| Execution | Pushdown to warehouse | Distributed Spark engine | In-database | Pushdown to BigQuery |
| Testing | Built-in (generic, singular, unit) | Manual (Great Expectations, custom) | Manual scripts | Assertions (basic) |
| Version control | Native (SQL files in Git) | Native (Python in Git) | Difficult (state in DB) | Native (Git-integrated) |
| Lineage | Automatic DAG from ref() | Manual or via OpenLineage | None | Automatic DAG from ref() |
| Documentation | Auto-generated from YAML | Manual | Manual | Auto-generated |
| Semantic layer | MetricFlow (native) | None | None | None |
| Scale | Depends on warehouse | TB-PB (distributed) | Depends on DB engine | BigQuery scale |
| Best for | SQL-centric warehouse transforms | Complex Python logic, large-scale ETL | Legacy systems, simple transforms | BigQuery-native teams |
Choose dbt when: Your transformations are expressible in SQL and you want testing, documentation, lineage, and CI/CD out of the box. Choose PySpark when: You need Python logic (ML features, complex algorithms), multi-TB distributed joins, or streaming. Choose stored procedures when: You have simple transforms in a legacy database and no appetite for modern tooling (but plan to migrate). Choose Dataform when: You're a BigQuery-only shop and want Google-integrated transformation. Many teams use dbt + PySpark together — dbt for SQL transforms in the warehouse, PySpark for heavy pre-processing in the lakehouse.
↑ Back to topunique + not_null on its primary key, and relationships tests on foreign keys. Add tests alongside model development, not after.table materialization. Switch to incremental only when full refreshes become too slow or expensive. Always ensure dbt run --full-refresh produces identical results as incremental runs.access controls (dbt 1.5+).--select and state:modified in CI. Running dbt build (all models) on every pull request is wasteful and slow. Use dbt build --select state:modified+ with --defer to only build changed models and their downstream dependencies, reading upstream from the production manifest. This cuts CI time from 30+ minutes to 2-5 minutes.materialized_view on warehouses that support it (Snowflake dynamic tables, BigQuery materialized views, Redshift MV). For slowly-changing intermediate models that are queried frequently, materialized views give you table-like performance with automatic incremental refresh — no is_incremental() logic needed. Check if your warehouse supports it before writing complex incremental models.profiles.yml in Git. profiles.yml contains database credentials and should never be committed to version control. Use environment variables ({{ env_var('DBT_SNOWFLAKE_PASSWORD') }}) in profiles, and store secrets in your CI/CD system or a secret manager. dbt Cloud handles this automatically.dbt build and verify all tests pass. (7) Generate docs with dbt docs generate and explore the DAG.WHERE event_date >= DATEADD(day, -3, current_date) in the incremental block) to catch late arrivals, (3) on_schema_change='sync_all_columns'. Manually insert a "late" record into the source and verify the incremental model picks it up. Then run dbt run --full-refresh and confirm results match.dbt compile --target prod → save target/manifest.json). (2) Modify one staging model. (3) Run dbt build --select state:modified+ --defer --state ./prod-manifest/. Verify that only the modified model and its downstream dependents are built, while unrelated models use the production manifest. Measure the time difference vs. running dbt build on all models.(1) view — creates a SQL view; no data stored. Use for staging models, lightweight transforms, and when you always want fresh results. Trade-off: query-time compute. (2) table — creates a full table, rebuilt from scratch on each run. Use for mart models of moderate size. Trade-off: storage cost, but fast reads. (3) incremental — inserts/updates only new or changed rows using MERGE, DELETE+INSERT, or APPEND. Use for large fact tables where full rebuild is too slow. Trade-off: complex logic, must handle late-arriving data. (4) ephemeral — compiled as a CTE injected into downstream queries; no database object created. Use for intermediate helper logic that's referenced once. Trade-off: can't query directly, harder to debug. Additionally, snapshot is for SCD Type 2 tracking and materialized_view (1.6+) leverages warehouse-native incremental refresh.
ref() and source() in dbt.ref('model_name') references another dbt model. It resolves to the correct schema.table for the current target environment, and it registers a dependency in the DAG — dbt knows to build the referenced model first. source('source_name', 'table_name') references a raw table declared in a YAML source definition. It does NOT create a dependency on a dbt model — instead, it tracks the external table for freshness checks (dbt source freshness) and documentation. Sources are the entry points of the DAG; refs connect dbt models within the DAG. Best practice: staging models use source(), all other models use ref(). Never reference raw tables directly with hardcoded schema.table names — always go through source().
dbt snapshots implement SCD Type 2. They work by: (1) Taking a snapshot of a source query on each run. (2) Comparing current source rows against existing snapshot rows using a unique_key. (3) If a row has changed (detected by strategy: timestamp using updated_at, or strategy: check which hashes specified columns), the old row gets its dbt_valid_to set to the current timestamp, and a new row is inserted with dbt_valid_from = current_timestamp and dbt_valid_to = NULL. (4) New rows are inserted with dbt_valid_to = NULL. (5) With invalidate_hard_deletes=True, rows missing from the source get their dbt_valid_to set. Current records always have dbt_valid_to IS NULL. The timestamp strategy is preferred when a reliable updated_at column exists; the check strategy works when no such column exists but is more expensive (compares all specified columns).
state:modified in dbt and how does it enable slim CI builds?state:modified is a node selector that identifies dbt resources (models, tests, snapshots) that have changed since a reference state (usually the production manifest). In CI, you save the production manifest.json, then run dbt build --select state:modified+ --defer --state ./prod-manifest/. The + suffix means "and all downstream dependents." The --defer flag tells dbt to read unmodified upstream models from the production database instead of rebuilding them. This means a PR that changes one staging model rebuilds only that model and its downstream chain — not the entire project. Result: CI runs drop from 30+ minutes to 2-5 minutes, and warehouse costs for CI drop proportionally. This is the single most impactful CI optimization for large dbt projects.
Model contracts (dbt 1.5+) enforce a model's schema at build time: column names, data types, and constraints (not_null, primary_key). When contract: enforced: true is set, dbt validates that the model's SQL output matches the declared columns. If a developer changes the SQL in a way that drops a column, renames it, or changes its type, the build fails immediately — before the change reaches production. This solves the "silent breaking change" problem: without contracts, an upstream model could change its output schema, and downstream models/dashboards would break at query time (or worse, silently produce wrong results). Contracts establish a formal API between data producers and consumers, essential for data mesh and multi-team dbt projects.
incremental_strategy='delete+insert' over 'merge'?delete+insert first deletes rows matching a condition (typically a time partition), then inserts the new data. It's better than merge when: (1) Your warehouse's MERGE is slow or expensive (some Redshift workloads). (2) You're overwriting entire partitions and don't need row-level upsert logic. (3) You want simpler semantics — delete a date partition, insert the full day's data. (4) Your data doesn't have a reliable unique key for matching. Use merge when: (1) You need row-level upserts (update matching rows, insert new ones). (2) Your data has a reliable unique key. (3) Only a small percentage of rows change each run (merge touches fewer rows). append is best for immutable event logs where duplicates are impossible. The new microbatch strategy (dbt 1.9+) combines the benefits of delete+insert with automatic batch-level retries.
The dbt Semantic Layer (powered by MetricFlow) defines business metrics as code in YAML: semantic models (mapping dbt models to entities/measures/dimensions) and metrics (business calculations over measures). When a BI tool queries a metric (e.g., "revenue by country, monthly"), MetricFlow generates optimized SQL that joins the necessary tables, applies the correct aggregation, and filters correctly. The metric definition is the single source of truth — Tableau, Hex, Looker, and Python notebooks all get the same "revenue" number. This eliminates the "whose number is right?" problem where every team calculates metrics differently. It's served via the dbt Cloud Semantic Layer API (JDBC/ADBC). The semantic layer is increasingly important in interviews as it represents the next evolution of the analytics stack beyond just building tables.