Databricks is a unified analytics platform built on Apache Spark that implements the lakehouse paradigm — combining the low-cost storage of data lakes with the reliability and performance features of data warehouses. Its core differentiators are Delta Lake (ACID transactions on object storage), Unity Catalog (unified governance across workspaces), Photon (a C++ vectorized query engine), and Databricks Workflows (native orchestration). For data engineering interviews, expect deep questions on Delta Lake internals, medallion architecture, cluster configuration, and how Databricks compares to self-managed Spark, Snowflake, or cloud-native ELT stacks. This guide covers Databricks Runtime 14+ features.
Delta Lake stores data as Parquet files on object storage (S3, ADLS, GCS) with a transaction log (_delta_log/) that records every change as a series of JSON commits. Each commit is an atomic unit that references which Parquet files to add/remove. This log enables ACID transactions, schema enforcement, and time travel (querying historical table states by version or timestamp). Every 10 commits, a checkpoint (Parquet-format summary) is written to speed up log replay. The VACUUM command deletes data files no longer referenced by the log (older than the retention threshold, default 7 days). Key: Delta Lake is an open protocol — you can read/write Delta tables from Spark, Flink, Trino, Polars, or plain Python (delta-rs).
OPTIMIZE compacts small files into larger ones (~1 GB target), solving the small-file problem. Z-Ordering (OPTIMIZE table ZORDER BY (col1, col2)) colocates related data by interleaving the sort order of multiple columns using a Z-order curve, improving predicate pushdown effectiveness for multi-column filters. Limitation: Z-ordering is applied at OPTIMIZE time and degrades as new data arrives — requiring periodic re-optimization. Liquid Clustering (GA in Databricks Runtime 13.3+) replaces both partitioning and Z-ordering with an incremental, write-time clustering strategy using Hilbert curves. It automatically rebalances data without full table rewrites. For new tables, Databricks recommends liquid clustering over traditional PARTITION BY + ZORDER.
Unity Catalog is Databricks' unified governance layer. It provides a three-level namespace: catalog.schema.table. Key features: (1) Centralized access control — GRANT/REVOKE on catalogs, schemas, tables, columns, and row filters. (2) Data lineage — automatic column-level lineage tracking across notebooks, jobs, and SQL queries. (3) Audit logging. (4) External locations & storage credentials — govern access to external cloud storage. (5) Delta Sharing — open protocol for cross-organization data sharing (read-only, no data duplication). Unity Catalog replaces the legacy workspace-level Hive metastore with an account-level metastore, enabling cross-workspace access and consistent governance.
Workflows (formerly Jobs) orchestrate multi-task pipelines within Databricks. Tasks can be notebooks, Python scripts, SQL queries, dbt models, JAR jobs, or DLT pipelines. Tasks support dependencies, conditional execution (IF/ELSE tasks), for-each loops, and retry policies. Workflows use job clusters (ephemeral, cost-optimized — shut down after the job) or shared job clusters (reused across tasks in the same run). Delta Live Tables (DLT) is a declarative ETL framework where you define tables as SQL/Python queries and DLT handles dependency resolution, incremental processing, error handling (expectations for data quality), and automatic infrastructure management. DLT pipelines can operate in triggered (batch) or continuous (streaming) mode.
Choosing the right cluster is critical for cost and performance. Key decisions: (1) All-Purpose clusters (interactive, long-running — for development) vs. Job clusters (ephemeral, cost-optimized — for production). (2) Worker type — memory-optimized for wide joins/caches, compute-optimized for CPU-heavy transforms (3) Autoscaling — set min/max workers; autoscaling reacts within ~2-5 minutes on cloud. (4) Photon — Databricks' native C++ vectorized engine that accelerates SQL and DataFrame operations 2-8x over standard Spark. Photon is GA on Databricks Runtime 14+ and is most effective for scan-heavy, aggregation-heavy workloads. It is automatically used when enabled on a cluster — no code changes needed. (5) Spot instances — use for workers (not the driver) to reduce costs by 60-80%, with auto-fallback to on-demand.
Auto Loader (cloudFiles) is Databricks' recommended way to incrementally ingest files from cloud storage. It uses cloud event notifications (S3 SQS/SNS, ADLS Event Grid, GCS Pub/Sub) or directory listing to efficiently discover new files. Key advantages over manual readStream.format("parquet"): (1) handles millions of files without listing overhead, (2) rescuable schema evolution (cloudFiles.schemaEvolutionMode), (3) automatic checkpointing. Combined with DLT, Auto Loader powers the bronze layer of medallion architectures. For streaming, Databricks extends Structured Streaming with trigger(availableNow=True) for batch-triggered micro-batch, rate limiting, and state store improvements (RocksDB-based state store for large state).
Databricks Asset Bundles (GA in 2024) are the recommended way to define, test, and deploy Databricks resources as code. A bundle is a project directory containing a databricks.yml manifest that declares jobs, DLT pipelines, ML experiments, and notebooks alongside the source code. Key concepts: (1) Targets — environment-specific configurations (dev, staging, prod) with variable overrides (cluster sizes, catalog names, permissions). (2) Deployment — databricks bundle deploy creates or updates all resources atomically using the Databricks API. (3) CI/CD integration — bundles work natively with GitHub Actions, Azure DevOps, and GitLab CI. Unlike workspace-level notebooks, bundles enforce version control (all code in Git) and reproducible deployments. (4) Templates — pre-built project templates (databricks bundle init) for common patterns: default Python, DLT, ML. Asset Bundles replace the older dbx tool and Terraform-only approaches with a Databricks-native, developer-friendly workflow.
Databricks offers serverless compute across two surfaces: (1) Serverless SQL Warehouses — instantly-available SQL endpoints that auto-scale and start in seconds (vs. minutes for classic warehouses). No cluster configuration needed. Ideal for BI dashboards (Tableau, Power BI), ad-hoc SQL, and dbt SQL models. Pricing is premium per-DBU but eliminates idle cluster costs. (2) Serverless Jobs Compute — run notebook and Python jobs without provisioning clusters. The platform manages infrastructure, auto-scales, and charges only for active computation time. Best practices: use serverless SQL warehouses as the default for all SQL workloads; reserve classic clusters only for long-running streaming jobs or custom library dependencies that serverless doesn't support. Cost comparison: serverless is cost-effective when utilization is bursty (e.g., dashboard queries); classic clusters are cheaper for sustained 24/7 compute.
↑ Back to topA bank migrates from a Teradata data warehouse to a Databricks lakehouse on Azure. Bronze tables ingest CDC streams from Oracle (via Debezium → Kafka → Auto Loader) and flat files from SFTP. Silver tables implement SCD Type 2 dimensions using Delta Lake MERGE. Gold tables provide aggregate views for regulatory reporting. Unity Catalog governs column-level access (PII masking via dynamic views for analysts, full access for compliance team). Delta Sharing exposes risk metrics to external auditors without data duplication.
An ad-tech company uses Databricks as its feature computation engine. Structured Streaming jobs consume from Kafka topics (ad impressions, clicks, bids), compute 15-minute rolling aggregates per campaign, and write features to a Delta Lake feature store. A Databricks Workflow (scheduled hourly) triggers model retraining using these features. Model serving uses Databricks Model Serving (MLflow-backed). Feature tables are governed by Unity Catalog, with lineage from raw Kafka topics to model predictions.
A large retail organization implements a data mesh pattern using Unity Catalog's three-level namespace. Each domain team (supply chain, marketing, finance) owns a catalog: supply_chain.inventory.stock_levels. Data products are published as Delta tables with enforced schemas and SLAs tracked via DLT expectations. Cross-domain consumption uses GRANT statements at the catalog or schema level. Lineage in Unity Catalog shows how the marketing.campaigns.roi table depends on finance.transactions.revenue and supply_chain.inventory.costs.
A Series A startup processes 2 TB/day of clickstream data with a lean team (2 data engineers). They use Databricks Workflows with serverless jobs compute to avoid cluster management entirely. Bronze ingestion runs Auto Loader every 15 minutes. Silver/gold transformations run as scheduled DLT pipelines on serverless compute. BI analysts connect directly to serverless SQL warehouses for Tableau dashboards — no classic clusters needed. Cost controls: (1) budgets in Unity Catalog limit per-catalog storage, (2) query tagging tracks consumption by team, (3) serverless auto-scaling eliminates over-provisioning. Total monthly cost: ~$3,000 vs. ~$8,000 with always-on classic clusters.
A pharmaceutical company uses Databricks for clinical trial data processing under FDA 21 CFR Part 11 requirements. Unity Catalog provides audit logs for every data access. Delta Lake's time travel enables reproducible analysis — regulators can query the exact data used in a historical submission. Access control: column-level masking hides patient PII from non-authorized analysts. Databricks Asset Bundles enforce deployment through validated CI/CD pipelines — no manual notebook edits in production. Data is stored on ADLS with customer-managed encryption keys (CMEK). DLT expectations serve as automated validation checks replacing manual QC spreadsheets.
↑ Back to topfrom delta.tables import DeltaTable
from pyspark.sql import functions as F
# Target: existing dimension table in Delta
target = DeltaTable.forPath(spark, "s3://lakehouse/silver/dim_customers")
# Source: new/updated records from the bronze layer
source = spark.read.format("delta").load("s3://lakehouse/bronze/customers") \
.where(F.col("load_date") == "2024-06-15")
# MERGE: update if exists, insert if new
target.alias("t").merge(
source.alias("s"),
"t.customer_id = s.customer_id"
).whenMatchedUpdate(
condition="s.updated_at > t.updated_at",
set={
"name": "s.name",
"email": "s.email",
"updated_at": "s.updated_at",
}
).whenNotMatchedInsertAll() \
.execute()
# Verify with time travel — compare before and after
prev_version = spark.read.format("delta") \
.option("versionAsOf", target.history(1).select("version").first()[0] - 1) \
.load("s3://lakehouse/silver/dim_customers")
# Auto Loader: incrementally ingest JSON files from cloud storage
bronze_df = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", "s3://checkpoints/bronze/orders/_schema") \
.option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
.option("cloudFiles.inferColumnTypes", "true") \
.load("s3://raw-landing/orders/")
# Add ingestion metadata
bronze_with_meta = bronze_df \
.withColumn("_ingestion_ts", F.current_timestamp()) \
.withColumn("_source_file", F.input_file_name())
# Write to Delta Bronze table
query = bronze_with_meta.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "s3://checkpoints/bronze/orders/") \
.option("mergeSchema", "true") \
.trigger(availableNow=True) \
.toTable("bronze.raw_orders")
query.awaitTermination()
import dlt
from pyspark.sql import functions as F
# Bronze: raw ingestion with Auto Loader
@dlt.table(
name="bronze_orders",
comment="Raw order events from landing zone",
)
def bronze_orders():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.load("/mnt/landing/orders/")
)
# Silver: cleaned, deduplicated, validated
@dlt.table(
name="silver_orders",
comment="Cleaned and validated orders",
)
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect_or_fail("valid_order_id", "order_id IS NOT NULL")
def silver_orders():
return (
dlt.read_stream("bronze_orders")
.dropDuplicates(["order_id"])
.withColumn("order_date", F.to_date("order_timestamp"))
.select(
"order_id", "customer_id", "product_id",
"amount", "order_date", "order_timestamp"
)
)
# Gold: business-level aggregate
@dlt.table(
name="gold_daily_revenue",
comment="Daily revenue by product",
)
def gold_daily_revenue():
return (
dlt.read("silver_orders")
.groupBy("product_id", "order_date")
.agg(
F.sum("amount").alias("total_revenue"),
F.count("order_id").alias("order_count"),
)
)
-- Create catalog and schema hierarchy
CREATE CATALOG IF NOT EXISTS analytics;
CREATE SCHEMA IF NOT EXISTS analytics.sales;
-- Create managed Delta table in Unity Catalog
CREATE TABLE analytics.sales.fact_orders (
order_id STRING NOT NULL,
customer_id STRING NOT NULL,
product_id STRING,
amount DECIMAL(12,2),
order_date DATE,
_loaded_at TIMESTAMP DEFAULT current_timestamp()
)
USING DELTA
CLUSTER BY (order_date, customer_id) -- Liquid clustering
COMMENT 'Fact table: customer orders';
-- Grant access to analyst group
GRANT SELECT ON TABLE analytics.sales.fact_orders TO data_analysts;
-- Row-level and column-level security via dynamic view
CREATE VIEW analytics.sales.v_orders_masked AS
SELECT
order_id,
CASE WHEN is_member('pii_access') THEN customer_id
ELSE sha2(customer_id, 256)
END AS customer_id,
amount,
order_date
FROM analytics.sales.fact_orders;
-- Create table with liquid clustering (replaces PARTITION BY + ZORDER)
CREATE TABLE analytics.events.clickstream (
event_id STRING,
user_id STRING,
page STRING,
event_ts TIMESTAMP,
event_date DATE GENERATED ALWAYS AS (CAST(event_ts AS DATE))
)
USING DELTA
CLUSTER BY (event_date, user_id); -- Liquid clustering columns
-- OPTIMIZE runs incremental clustering (no full rewrite)
OPTIMIZE analytics.events.clickstream;
-- Check clustering effectiveness
DESCRIBE DETAIL analytics.events.clickstream;
-- Vacuum old files (default retention: 7 days)
VACUUM analytics.events.clickstream RETAIN 168 HOURS;
# databricks.yml — Asset Bundle manifest
bundle:
name: daily_etl_pipeline
workspace:
host: https://adb-1234567890.12.azuredatabricks.net
resources:
jobs:
daily_pipeline:
name: "Daily ETL — Bronze to Gold"
schedule:
quartz_cron_expression: "0 0 6 * * ?" # 6 AM UTC daily
timezone_id: UTC
tasks:
- task_key: bronze_ingestion
notebook_task:
notebook_path: ./notebooks/bronze_ingest.py
job_cluster_key: etl_cluster
- task_key: silver_transform
depends_on:
- task_key: bronze_ingestion
notebook_task:
notebook_path: ./notebooks/silver_transform.py
job_cluster_key: etl_cluster
- task_key: gold_aggregate
depends_on:
- task_key: silver_transform
notebook_task:
notebook_path: ./notebooks/gold_aggregate.py
job_cluster_key: etl_cluster
- task_key: dbt_run
depends_on:
- task_key: gold_aggregate
dbt_task:
project_directory: ./dbt_project
commands:
- "dbt run --select tag:daily"
- "dbt test --select tag:daily"
warehouse_id: ${var.sql_warehouse_id}
job_clusters:
- job_cluster_key: etl_cluster
new_cluster:
spark_version: 14.3.x-scala2.12
node_type_id: ${var.node_type}
autoscale:
min_workers: 2
max_workers: 8
targets:
dev:
mode: development
default: true
variables:
node_type: Standard_DS3_v2
sql_warehouse_id: abc123dev
prod:
mode: production
variables:
node_type: Standard_DS4_v2
sql_warehouse_id: xyz789prod
permissions:
- group_name: data-engineers
level: CAN_MANAGE_RUN
- group_name: data-analysts
level: CAN_VIEW
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import (
AutoScale, ClusterSpec, RuntimeEngine
)
w = WorkspaceClient()
# Create a cost-optimized job cluster config
# - Photon enabled for SQL-heavy workloads (2-8x speedup)
# - Spot instances for workers (60-80% cost reduction)
# - Auto-termination to prevent idle costs
cluster_config = {
"spark_version": "14.3.x-photon-scala2.12", # Photon-enabled runtime
"node_type_id": "Standard_E8ds_v5", # Memory-optimized for joins
"driver_node_type_id": "Standard_E8ds_v5", # On-demand driver (stable)
"autoscale": {"min_workers": 2, "max_workers": 16},
"azure_attributes": {
"first_on_demand": 1, # Driver on-demand, workers spot
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1, # Pay up to on-demand price
},
"spark_conf": {
"spark.sql.adaptive.enabled": "true",
"spark.databricks.delta.optimizeWrite.enabled": "true",
"spark.databricks.delta.autoCompact.enabled": "true",
},
"custom_tags": {
"team": "data-engineering",
"cost_center": "DE-2024-Q3",
"environment": "production",
},
}
# List SQL Warehouses and their utilization
for wh in w.warehouses.list():
print(f"Warehouse: {wh.name}, Size: {wh.cluster_size}, "
f"Type: {'Serverless' if wh.enable_serverless_compute else 'Classic'}, "
f"State: {wh.state}")
↑ Back to top
| Feature | Databricks Lakehouse | Snowflake | Self-Managed Spark + Delta | Google BigQuery |
|---|---|---|---|---|
| Storage | Your cloud storage (S3/ADLS/GCS) + Delta format | Snowflake-managed storage | Your cloud storage + Delta | Google-managed storage |
| Compute | Managed Spark + Photon, SQL Warehouse | Virtual warehouses (auto-scaled) | EMR / Dataproc / Kubernetes | Serverless slots |
| Streaming | Structured Streaming + DLT (continuous) | Snowpipe (micro-batch) | Structured Streaming | Streaming inserts |
| ML | MLflow, Feature Store, Model Serving | Snowpark ML | MLlib + external tools | Vertex AI integration |
| Governance | Unity Catalog (lineage, RBAC, sharing) | Access control + RBAC | Custom (Apache Ranger, etc.) | IAM + column security |
| Open format | Delta Lake (open protocol) | Proprietary (Iceberg support added) | Delta / Iceberg / Hudi | Proprietary (export to Parquet) |
| Cost model | DBU-based (compute) + storage | Credit-based (compute) + storage | Cloud infra cost (EC2, etc.) | On-demand / slot-based |
| Best for | Lakehouse + ML + multicloud | SQL analytics + data sharing | Budget-conscious teams with Spark expertise | GCP-native serverless analytics |
Choose Databricks when: You need a lakehouse with unified batch + streaming + ML on open formats, and want managed Spark without the operational burden. Choose Snowflake when: Your workload is primarily SQL analytics, you value simplicity, and you don't need low-latency streaming or deep ML integration. Choose self-managed Spark when: You have strong platform engineering capabilities and want to minimize vendor lock-in at the expense of operational complexity. Choose BigQuery when: You're GCP-native and want serverless, pay-per-query analytics.
↑ Back to topVACUUM, storage costs grow unboundedly. A table written with 100 small MERGE operations accumulates hundreds of unreferenced Parquet files. Schedule VACUUM as part of your pipeline (or use Delta Lake's auto-VACUUM feature in newer runtimes). But never set retention below your streaming checkpoint interval or you'll corrupt streaming queries.user_id) creates millions of directories and tiny files — worse performance than no partitioning. In Databricks, prefer liquid clustering (CLUSTER BY) which handles multi-column data layout without directory explosion. If using traditional partitioning, only partition on low-cardinality columns (date, region) with at least 1 GB per partition.__apply_changes_storage table.DESCRIBE DETAIL and the Spark UI.The transaction log is a sequence of JSON files in the _delta_log/ directory (e.g., 000000.json, 000001.json). Each commit atomically records: (1) files added (new Parquet files written), (2) files removed (old files logically deleted), (3) metadata changes (schema, properties). Reads reconstruct the table state by replaying the log from the last checkpoint. Writes use optimistic concurrency — the writer attempts to commit a new JSON file, and if another writer committed first, it retries (conflict resolution checks if the same files were modified). Every 10 commits, a checkpoint file (Parquet format) summarizes the current state for faster reads. This design gives ACID guarantees without requiring a database server — just a cloud storage system with atomic rename/put operations.
Liquid clustering uses Hilbert space-filling curves to incrementally cluster data at write time, avoiding the need for separate OPTIMIZE + ZORDER jobs. Advantages: (1) No partition directory explosion — data is organized by clustering keys without creating hundreds of directories. (2) Incremental — only newly written data is clustered, not the entire table. (3) No need to choose between partitioning and Z-ordering — liquid clustering handles both. (4) You can change clustering keys without rewriting the entire table (ALTER TABLE ... CLUSTER BY). Traditional PARTITION BY creates physical directories that are hard to change and don't compose well with Z-ordering (ZORDER only sorts within each partition).
readStream.format("parquet") on a directory?Auto Loader (cloudFiles) provides: (1) Efficient file discovery — uses cloud event notifications (S3 SQS, ADLS Event Grid) instead of directory listing, scaling to millions of files. A plain readStream must list the directory repeatedly. (2) Schema evolution — Auto Loader can infer, track, and evolve the schema as new columns appear in source files (schemaEvolutionMode). (3) Rescue data column — columns that don't match the schema are captured in a _rescued_data column instead of being silently dropped. (4) Exactly-once guarantees — checkpoints track which files have been processed. Plain readStream also tracks files but lacks the schema evolution and rescue features.
DLT expectations are declarative data quality constraints applied to table definitions. The three modes: (1) @dlt.expect("name", "condition") — records violating the condition are included in the output, but the violation is tracked in metrics (alert-only). (2) @dlt.expect_or_drop("name", "condition") — violating records are silently dropped; they're excluded from the output. (3) @dlt.expect_or_fail("name", "condition") — the pipeline fails immediately on violation. Use expect for soft rules (log warnings), expect_or_drop for data cleansing (quarantine bad rows), and expect_or_fail for critical invariants (e.g., primary keys must not be null).
Unity Catalog uses: catalog.schema.table. The catalog is the top-level container — typically maps to a domain, environment, or business unit (e.g., finance_prod, marketing_dev). The schema (database) groups related tables within a catalog. The table/view is the actual data object. This replaces the two-level namespace (database.table) of the legacy Hive metastore. Benefits for multi-team organizations: (1) Access control at any level — grant a team access to an entire catalog or specific schema. (2) Cross-workspace visibility — Unity Catalog operates at the account level, so teams in different workspaces can share data without copies. (3) Clear ownership — each catalog has an owner responsible for governance. (4) Name isolation — different teams can use the same schema/table names without conflict (e.g., finance.core.transactions vs marketing.core.transactions).
Databricks Asset Bundles (DABs) are a declarative, project-based deployment framework. A databricks.yml manifest defines jobs, DLT pipelines, permissions, and cluster configs alongside source code in Git. Key benefits: (1) Reproducibility — every deployment is derived from a specific Git commit; no manual workspace edits. (2) Environment parity — targets (dev, staging, prod) override variables (cluster sizes, catalog names) without code changes. (3) CI/CD native — databricks bundle validate and databricks bundle deploy can run in GitHub Actions or Azure DevOps pipelines. (4) Atomic deployment — jobs and DLT pipelines are created/updated together, avoiding partial states. DABs replace ad-hoc approaches like manually configuring jobs in the UI or using Terraform for Databricks resources (though Terraform remains valid for infrastructure). They are the recommended approach for all new Databricks projects.
Serverless SQL Warehouses are best for: BI dashboard queries (instant start, no idle cost), ad-hoc SQL exploration, dbt runs, and bursty workloads where utilization is unpredictable. They start in seconds, auto-scale based on query concurrency, and charge only for active compute. Classic clusters are better for: 24/7 streaming jobs (sustained compute is cheaper than serverless pricing), workloads requiring custom libraries or init scripts not supported on serverless, large-scale PySpark ETL with custom UDFs, and when spot instances can reduce costs 60-80%. Rule of thumb: if the workload is SQL-centric and bursty, use serverless; if it's Spark/Python-centric and sustained, use classic job clusters. Many organizations use both: serverless for analysts, classic for engineering pipelines.