Databricks

Summary

Databricks is a unified analytics platform built on Apache Spark that implements the lakehouse paradigm — combining the low-cost storage of data lakes with the reliability and performance features of data warehouses. Its core differentiators are Delta Lake (ACID transactions on object storage), Unity Catalog (unified governance across workspaces), Photon (a C++ vectorized query engine), and Databricks Workflows (native orchestration). For data engineering interviews, expect deep questions on Delta Lake internals, medallion architecture, cluster configuration, and how Databricks compares to self-managed Spark, Snowflake, or cloud-native ELT stacks. This guide covers Databricks Runtime 14+ features.

Table of Contents

Core Concepts

1. Delta Lake Internals — Transaction Log & Time Travel

Delta Lake stores data as Parquet files on object storage (S3, ADLS, GCS) with a transaction log (_delta_log/) that records every change as a series of JSON commits. Each commit is an atomic unit that references which Parquet files to add/remove. This log enables ACID transactions, schema enforcement, and time travel (querying historical table states by version or timestamp). Every 10 commits, a checkpoint (Parquet-format summary) is written to speed up log replay. The VACUUM command deletes data files no longer referenced by the log (older than the retention threshold, default 7 days). Key: Delta Lake is an open protocol — you can read/write Delta tables from Spark, Flink, Trino, Polars, or plain Python (delta-rs).

2. OPTIMIZE, Z-Ordering & Liquid Clustering

OPTIMIZE compacts small files into larger ones (~1 GB target), solving the small-file problem. Z-Ordering (OPTIMIZE table ZORDER BY (col1, col2)) colocates related data by interleaving the sort order of multiple columns using a Z-order curve, improving predicate pushdown effectiveness for multi-column filters. Limitation: Z-ordering is applied at OPTIMIZE time and degrades as new data arrives — requiring periodic re-optimization. Liquid Clustering (GA in Databricks Runtime 13.3+) replaces both partitioning and Z-ordering with an incremental, write-time clustering strategy using Hilbert curves. It automatically rebalances data without full table rewrites. For new tables, Databricks recommends liquid clustering over traditional PARTITION BY + ZORDER.

3. Unity Catalog — Unified Governance

Unity Catalog is Databricks' unified governance layer. It provides a three-level namespace: catalog.schema.table. Key features: (1) Centralized access control — GRANT/REVOKE on catalogs, schemas, tables, columns, and row filters. (2) Data lineage — automatic column-level lineage tracking across notebooks, jobs, and SQL queries. (3) Audit logging. (4) External locations & storage credentials — govern access to external cloud storage. (5) Delta Sharing — open protocol for cross-organization data sharing (read-only, no data duplication). Unity Catalog replaces the legacy workspace-level Hive metastore with an account-level metastore, enabling cross-workspace access and consistent governance.

4. Databricks Workflows & Delta Live Tables (DLT)

Workflows (formerly Jobs) orchestrate multi-task pipelines within Databricks. Tasks can be notebooks, Python scripts, SQL queries, dbt models, JAR jobs, or DLT pipelines. Tasks support dependencies, conditional execution (IF/ELSE tasks), for-each loops, and retry policies. Workflows use job clusters (ephemeral, cost-optimized — shut down after the job) or shared job clusters (reused across tasks in the same run). Delta Live Tables (DLT) is a declarative ETL framework where you define tables as SQL/Python queries and DLT handles dependency resolution, incremental processing, error handling (expectations for data quality), and automatic infrastructure management. DLT pipelines can operate in triggered (batch) or continuous (streaming) mode.

5. Cluster Configuration & Photon

Choosing the right cluster is critical for cost and performance. Key decisions: (1) All-Purpose clusters (interactive, long-running — for development) vs. Job clusters (ephemeral, cost-optimized — for production). (2) Worker type — memory-optimized for wide joins/caches, compute-optimized for CPU-heavy transforms (3) Autoscaling — set min/max workers; autoscaling reacts within ~2-5 minutes on cloud. (4) Photon — Databricks' native C++ vectorized engine that accelerates SQL and DataFrame operations 2-8x over standard Spark. Photon is GA on Databricks Runtime 14+ and is most effective for scan-heavy, aggregation-heavy workloads. It is automatically used when enabled on a cluster — no code changes needed. (5) Spot instances — use for workers (not the driver) to reduce costs by 60-80%, with auto-fallback to on-demand.

6. Auto Loader & Structured Streaming on Databricks

Auto Loader (cloudFiles) is Databricks' recommended way to incrementally ingest files from cloud storage. It uses cloud event notifications (S3 SQS/SNS, ADLS Event Grid, GCS Pub/Sub) or directory listing to efficiently discover new files. Key advantages over manual readStream.format("parquet"): (1) handles millions of files without listing overhead, (2) rescuable schema evolution (cloudFiles.schemaEvolutionMode), (3) automatic checkpointing. Combined with DLT, Auto Loader powers the bronze layer of medallion architectures. For streaming, Databricks extends Structured Streaming with trigger(availableNow=True) for batch-triggered micro-batch, rate limiting, and state store improvements (RocksDB-based state store for large state).

7. Databricks Asset Bundles (DABs)

Databricks Asset Bundles (GA in 2024) are the recommended way to define, test, and deploy Databricks resources as code. A bundle is a project directory containing a databricks.yml manifest that declares jobs, DLT pipelines, ML experiments, and notebooks alongside the source code. Key concepts: (1) Targets — environment-specific configurations (dev, staging, prod) with variable overrides (cluster sizes, catalog names, permissions). (2) Deploymentdatabricks bundle deploy creates or updates all resources atomically using the Databricks API. (3) CI/CD integration — bundles work natively with GitHub Actions, Azure DevOps, and GitLab CI. Unlike workspace-level notebooks, bundles enforce version control (all code in Git) and reproducible deployments. (4) Templates — pre-built project templates (databricks bundle init) for common patterns: default Python, DLT, ML. Asset Bundles replace the older dbx tool and Terraform-only approaches with a Databricks-native, developer-friendly workflow.

8. Serverless Compute & SQL Warehouses

Databricks offers serverless compute across two surfaces: (1) Serverless SQL Warehouses — instantly-available SQL endpoints that auto-scale and start in seconds (vs. minutes for classic warehouses). No cluster configuration needed. Ideal for BI dashboards (Tableau, Power BI), ad-hoc SQL, and dbt SQL models. Pricing is premium per-DBU but eliminates idle cluster costs. (2) Serverless Jobs Compute — run notebook and Python jobs without provisioning clusters. The platform manages infrastructure, auto-scales, and charges only for active computation time. Best practices: use serverless SQL warehouses as the default for all SQL workloads; reserve classic clusters only for long-running streaming jobs or custom library dependencies that serverless doesn't support. Cost comparison: serverless is cost-effective when utilization is bursty (e.g., dashboard queries); classic clusters are cheaper for sustained 24/7 compute.

↑ Back to top

Industry Use Cases

1. Enterprise Lakehouse for a Financial Institution

A bank migrates from a Teradata data warehouse to a Databricks lakehouse on Azure. Bronze tables ingest CDC streams from Oracle (via Debezium → Kafka → Auto Loader) and flat files from SFTP. Silver tables implement SCD Type 2 dimensions using Delta Lake MERGE. Gold tables provide aggregate views for regulatory reporting. Unity Catalog governs column-level access (PII masking via dynamic views for analysts, full access for compliance team). Delta Sharing exposes risk metrics to external auditors without data duplication.

2. Real-Time ML Feature Platform

An ad-tech company uses Databricks as its feature computation engine. Structured Streaming jobs consume from Kafka topics (ad impressions, clicks, bids), compute 15-minute rolling aggregates per campaign, and write features to a Delta Lake feature store. A Databricks Workflow (scheduled hourly) triggers model retraining using these features. Model serving uses Databricks Model Serving (MLflow-backed). Feature tables are governed by Unity Catalog, with lineage from raw Kafka topics to model predictions.

3. Multi-Team Data Mesh on Databricks

A large retail organization implements a data mesh pattern using Unity Catalog's three-level namespace. Each domain team (supply chain, marketing, finance) owns a catalog: supply_chain.inventory.stock_levels. Data products are published as Delta tables with enforced schemas and SLAs tracked via DLT expectations. Cross-domain consumption uses GRANT statements at the catalog or schema level. Lineage in Unity Catalog shows how the marketing.campaigns.roi table depends on finance.transactions.revenue and supply_chain.inventory.costs.

4. Cost-Optimized ETL for a Startup

A Series A startup processes 2 TB/day of clickstream data with a lean team (2 data engineers). They use Databricks Workflows with serverless jobs compute to avoid cluster management entirely. Bronze ingestion runs Auto Loader every 15 minutes. Silver/gold transformations run as scheduled DLT pipelines on serverless compute. BI analysts connect directly to serverless SQL warehouses for Tableau dashboards — no classic clusters needed. Cost controls: (1) budgets in Unity Catalog limit per-catalog storage, (2) query tagging tracks consumption by team, (3) serverless auto-scaling eliminates over-provisioning. Total monthly cost: ~$3,000 vs. ~$8,000 with always-on classic clusters.

5. Regulatory Compliance in Pharma (GxP Validation)

A pharmaceutical company uses Databricks for clinical trial data processing under FDA 21 CFR Part 11 requirements. Unity Catalog provides audit logs for every data access. Delta Lake's time travel enables reproducible analysis — regulators can query the exact data used in a historical submission. Access control: column-level masking hides patient PII from non-authorized analysts. Databricks Asset Bundles enforce deployment through validated CI/CD pipelines — no manual notebook edits in production. Data is stored on ADLS with customer-managed encryption keys (CMEK). DLT expectations serve as automated validation checks replacing manual QC spreadsheets.

↑ Back to top

Code Examples

1. Delta Lake MERGE (Upsert / SCD Type 1)

from delta.tables import DeltaTable
from pyspark.sql import functions as F

# Target: existing dimension table in Delta
target = DeltaTable.forPath(spark, "s3://lakehouse/silver/dim_customers")

# Source: new/updated records from the bronze layer
source = spark.read.format("delta").load("s3://lakehouse/bronze/customers") \
    .where(F.col("load_date") == "2024-06-15")

# MERGE: update if exists, insert if new
target.alias("t").merge(
    source.alias("s"),
    "t.customer_id = s.customer_id"
).whenMatchedUpdate(
    condition="s.updated_at > t.updated_at",
    set={
        "name": "s.name",
        "email": "s.email",
        "updated_at": "s.updated_at",
    }
).whenNotMatchedInsertAll() \
 .execute()

# Verify with time travel — compare before and after
prev_version = spark.read.format("delta") \
    .option("versionAsOf", target.history(1).select("version").first()[0] - 1) \
    .load("s3://lakehouse/silver/dim_customers")

2. Auto Loader — Bronze Ingestion with Schema Evolution

# Auto Loader: incrementally ingest JSON files from cloud storage
bronze_df = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .option("cloudFiles.schemaLocation", "s3://checkpoints/bronze/orders/_schema") \
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
    .option("cloudFiles.inferColumnTypes", "true") \
    .load("s3://raw-landing/orders/")

# Add ingestion metadata
bronze_with_meta = bronze_df \
    .withColumn("_ingestion_ts", F.current_timestamp()) \
    .withColumn("_source_file", F.input_file_name())

# Write to Delta Bronze table
query = bronze_with_meta.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://checkpoints/bronze/orders/") \
    .option("mergeSchema", "true") \
    .trigger(availableNow=True) \
    .toTable("bronze.raw_orders")

query.awaitTermination()

3. Delta Live Tables (DLT) — Medallion Pipeline

import dlt
from pyspark.sql import functions as F

# Bronze: raw ingestion with Auto Loader
@dlt.table(
    name="bronze_orders",
    comment="Raw order events from landing zone",
)
def bronze_orders():
    return (
        spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "json")
            .option("cloudFiles.inferColumnTypes", "true")
            .load("/mnt/landing/orders/")
    )

# Silver: cleaned, deduplicated, validated
@dlt.table(
    name="silver_orders",
    comment="Cleaned and validated orders",
)
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect_or_fail("valid_order_id", "order_id IS NOT NULL")
def silver_orders():
    return (
        dlt.read_stream("bronze_orders")
            .dropDuplicates(["order_id"])
            .withColumn("order_date", F.to_date("order_timestamp"))
            .select(
                "order_id", "customer_id", "product_id",
                "amount", "order_date", "order_timestamp"
            )
    )

# Gold: business-level aggregate
@dlt.table(
    name="gold_daily_revenue",
    comment="Daily revenue by product",
)
def gold_daily_revenue():
    return (
        dlt.read("silver_orders")
            .groupBy("product_id", "order_date")
            .agg(
                F.sum("amount").alias("total_revenue"),
                F.count("order_id").alias("order_count"),
            )
    )

4. Unity Catalog — Table Governance

-- Create catalog and schema hierarchy
CREATE CATALOG IF NOT EXISTS analytics;
CREATE SCHEMA IF NOT EXISTS analytics.sales;

-- Create managed Delta table in Unity Catalog
CREATE TABLE analytics.sales.fact_orders (
    order_id     STRING NOT NULL,
    customer_id  STRING NOT NULL,
    product_id   STRING,
    amount       DECIMAL(12,2),
    order_date   DATE,
    _loaded_at   TIMESTAMP DEFAULT current_timestamp()
)
USING DELTA
CLUSTER BY (order_date, customer_id)  -- Liquid clustering
COMMENT 'Fact table: customer orders';

-- Grant access to analyst group
GRANT SELECT ON TABLE analytics.sales.fact_orders TO data_analysts;

-- Row-level and column-level security via dynamic view
CREATE VIEW analytics.sales.v_orders_masked AS
SELECT
    order_id,
    CASE WHEN is_member('pii_access') THEN customer_id
         ELSE sha2(customer_id, 256)
    END AS customer_id,
    amount,
    order_date
FROM analytics.sales.fact_orders;

5. Liquid Clustering & OPTIMIZE

-- Create table with liquid clustering (replaces PARTITION BY + ZORDER)
CREATE TABLE analytics.events.clickstream (
    event_id    STRING,
    user_id     STRING,
    page        STRING,
    event_ts    TIMESTAMP,
    event_date  DATE GENERATED ALWAYS AS (CAST(event_ts AS DATE))
)
USING DELTA
CLUSTER BY (event_date, user_id);  -- Liquid clustering columns

-- OPTIMIZE runs incremental clustering (no full rewrite)
OPTIMIZE analytics.events.clickstream;

-- Check clustering effectiveness
DESCRIBE DETAIL analytics.events.clickstream;

-- Vacuum old files (default retention: 7 days)
VACUUM analytics.events.clickstream RETAIN 168 HOURS;

6. Databricks Asset Bundles — Project Configuration

# databricks.yml — Asset Bundle manifest
bundle:
  name: daily_etl_pipeline

workspace:
  host: https://adb-1234567890.12.azuredatabricks.net

resources:
  jobs:
    daily_pipeline:
      name: "Daily ETL — Bronze to Gold"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"   # 6 AM UTC daily
        timezone_id: UTC
      tasks:
        - task_key: bronze_ingestion
          notebook_task:
            notebook_path: ./notebooks/bronze_ingest.py
          job_cluster_key: etl_cluster

        - task_key: silver_transform
          depends_on:
            - task_key: bronze_ingestion
          notebook_task:
            notebook_path: ./notebooks/silver_transform.py
          job_cluster_key: etl_cluster

        - task_key: gold_aggregate
          depends_on:
            - task_key: silver_transform
          notebook_task:
            notebook_path: ./notebooks/gold_aggregate.py
          job_cluster_key: etl_cluster

        - task_key: dbt_run
          depends_on:
            - task_key: gold_aggregate
          dbt_task:
            project_directory: ./dbt_project
            commands:
              - "dbt run --select tag:daily"
              - "dbt test --select tag:daily"
            warehouse_id: ${var.sql_warehouse_id}

      job_clusters:
        - job_cluster_key: etl_cluster
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: ${var.node_type}
            autoscale:
              min_workers: 2
              max_workers: 8

targets:
  dev:
    mode: development
    default: true
    variables:
      node_type: Standard_DS3_v2
      sql_warehouse_id: abc123dev

  prod:
    mode: production
    variables:
      node_type: Standard_DS4_v2
      sql_warehouse_id: xyz789prod
    permissions:
      - group_name: data-engineers
        level: CAN_MANAGE_RUN
      - group_name: data-analysts
        level: CAN_VIEW

7. Cost Optimization — Cluster Selection & Spot Instances

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import (
    AutoScale, ClusterSpec, RuntimeEngine
)

w = WorkspaceClient()

# Create a cost-optimized job cluster config
# - Photon enabled for SQL-heavy workloads (2-8x speedup)
# - Spot instances for workers (60-80% cost reduction)
# - Auto-termination to prevent idle costs

cluster_config = {
    "spark_version": "14.3.x-photon-scala2.12",  # Photon-enabled runtime
    "node_type_id": "Standard_E8ds_v5",          # Memory-optimized for joins
    "driver_node_type_id": "Standard_E8ds_v5",   # On-demand driver (stable)
    "autoscale": {"min_workers": 2, "max_workers": 16},
    "azure_attributes": {
        "first_on_demand": 1,              # Driver on-demand, workers spot
        "availability": "SPOT_WITH_FALLBACK_AZURE",
        "spot_bid_max_price": -1,          # Pay up to on-demand price
    },
    "spark_conf": {
        "spark.sql.adaptive.enabled": "true",
        "spark.databricks.delta.optimizeWrite.enabled": "true",
        "spark.databricks.delta.autoCompact.enabled": "true",
    },
    "custom_tags": {
        "team": "data-engineering",
        "cost_center": "DE-2024-Q3",
        "environment": "production",
    },
}

# List SQL Warehouses and their utilization
for wh in w.warehouses.list():
    print(f"Warehouse: {wh.name}, Size: {wh.cluster_size}, "
          f"Type: {'Serverless' if wh.enable_serverless_compute else 'Classic'}, "
          f"State: {wh.state}")
↑ Back to top

Comparison / When to Use

FeatureDatabricks LakehouseSnowflakeSelf-Managed Spark + DeltaGoogle BigQuery
StorageYour cloud storage (S3/ADLS/GCS) + Delta formatSnowflake-managed storageYour cloud storage + DeltaGoogle-managed storage
ComputeManaged Spark + Photon, SQL WarehouseVirtual warehouses (auto-scaled)EMR / Dataproc / KubernetesServerless slots
StreamingStructured Streaming + DLT (continuous)Snowpipe (micro-batch)Structured StreamingStreaming inserts
MLMLflow, Feature Store, Model ServingSnowpark MLMLlib + external toolsVertex AI integration
GovernanceUnity Catalog (lineage, RBAC, sharing)Access control + RBACCustom (Apache Ranger, etc.)IAM + column security
Open formatDelta Lake (open protocol)Proprietary (Iceberg support added)Delta / Iceberg / HudiProprietary (export to Parquet)
Cost modelDBU-based (compute) + storageCredit-based (compute) + storageCloud infra cost (EC2, etc.)On-demand / slot-based
Best forLakehouse + ML + multicloudSQL analytics + data sharingBudget-conscious teams with Spark expertiseGCP-native serverless analytics

Choose Databricks when: You need a lakehouse with unified batch + streaming + ML on open formats, and want managed Spark without the operational burden. Choose Snowflake when: Your workload is primarily SQL analytics, you value simplicity, and you don't need low-latency streaming or deep ML integration. Choose self-managed Spark when: You have strong platform engineering capabilities and want to minimize vendor lock-in at the expense of operational complexity. Choose BigQuery when: You're GCP-native and want serverless, pay-per-query analytics.

↑ Back to top

Gotchas & Anti-patterns

  1. Using all-purpose clusters for production jobs. All-purpose clusters are expensive (higher DBU rate) and long-running. Production workloads should use job clusters that spin up for the job and terminate immediately after, reducing costs by 40-60%. Reserve all-purpose clusters for interactive development only.
  2. Forgetting to VACUUM. Delta Lake's time travel retains all historical data files. Without regular VACUUM, storage costs grow unboundedly. A table written with 100 small MERGE operations accumulates hundreds of unreferenced Parquet files. Schedule VACUUM as part of your pipeline (or use Delta Lake's auto-VACUUM feature in newer runtimes). But never set retention below your streaming checkpoint interval or you'll corrupt streaming queries.
  3. Over-partitioning Delta tables. Partitioning by high-cardinality columns (e.g., user_id) creates millions of directories and tiny files — worse performance than no partitioning. In Databricks, prefer liquid clustering (CLUSTER BY) which handles multi-column data layout without directory explosion. If using traditional partitioning, only partition on low-cardinality columns (date, region) with at least 1 GB per partition.
  4. Not enabling Photon for SQL workloads. Photon accelerates SQL and DataFrame operations 2-8x for free (it's included in specific instance types). Teams that leave Photon disabled on SQL Warehouse or job clusters miss significant performance gains and waste DBUs. Check the Spark UI to verify Photon-enabled operators are being used.
  5. Ignoring Unity Catalog for governance. Legacy workspace-level Hive metastore has no cross-workspace visibility, no lineage, and limited access control. Organizations that delay Unity Catalog adoption accumulate ungoverned data assets that become costly to retrofit. Migrate early — Unity Catalog is now the default for new workspaces and supports in-place migration from legacy metastore.
  6. Not using Asset Bundles for production deployments. Deploying production jobs by manually editing notebooks in the workspace breaks reproducibility. When a job fails, there's no Git history to review or revert. Databricks Asset Bundles enforce code-in-Git, environment-specific configurations, and atomic deployments. Combined with CI/CD (GitHub Actions, Azure DevOps), they prevent configuration drift between environments.
  7. Over-sizing SQL Warehouses. Starting with a 2X-Large SQL Warehouse "to be safe" wastes significant DBUs. Serverless SQL Warehouses auto-scale instantly — start with Small and let concurrency-based scaling add capacity as needed. For classic warehouses, enable auto-stop after 10-15 minutes of inactivity. Profile actual query patterns before sizing: most dashboard queries need Small; only heavy ETL or concurrent BI workloads justify Medium+.
↑ Back to top

Exercises

  1. Build a full medallion pipeline with DLT. Create a DLT pipeline that: (1) bronze layer uses Auto Loader to ingest JSON/CSV files from a landing zone, (2) silver layer deduplicates, validates (add 3+ expectations), and cleans the data, (3) gold layer creates two aggregate tables (e.g., daily revenue by product, customer lifetime value). Deploy as a Databricks Workflow on a schedule. Verify that DLT expectations correctly quarantine bad records into the __apply_changes_storage table.
  2. Implement SCD Type 2 with Delta Lake MERGE. Create a customer dimension with columns: customer_id, name, email, effective_date, end_date, is_current. Load an initial dataset. Then simulate 3 updates (email changes) and verify: old rows get end_date set and is_current=false, new rows are inserted with is_current=true. Query the table as of the previous version to show time travel works.
  3. Compare liquid clustering vs. partitioning + ZORDER. Create two copies of a large dataset (~10M rows with event_date, user_id, event_type). Table A: PARTITION BY (event_date) + ZORDER BY (user_id). Table B: CLUSTER BY (event_date, user_id). Run the same 5 analytical queries on both (filter by date range, filter by user_id, filter by both). Compare query times, files scanned, and bytes read using DESCRIBE DETAIL and the Spark UI.
↑ Back to top

Quiz

Q1: Explain how Delta Lake's transaction log works and why it enables ACID transactions on object storage.

The transaction log is a sequence of JSON files in the _delta_log/ directory (e.g., 000000.json, 000001.json). Each commit atomically records: (1) files added (new Parquet files written), (2) files removed (old files logically deleted), (3) metadata changes (schema, properties). Reads reconstruct the table state by replaying the log from the last checkpoint. Writes use optimistic concurrency — the writer attempts to commit a new JSON file, and if another writer committed first, it retries (conflict resolution checks if the same files were modified). Every 10 commits, a checkpoint file (Parquet format) summarizes the current state for faster reads. This design gives ACID guarantees without requiring a database server — just a cloud storage system with atomic rename/put operations.

Q2: What is liquid clustering and why does Databricks recommend it over PARTITION BY + ZORDER?

Liquid clustering uses Hilbert space-filling curves to incrementally cluster data at write time, avoiding the need for separate OPTIMIZE + ZORDER jobs. Advantages: (1) No partition directory explosion — data is organized by clustering keys without creating hundreds of directories. (2) Incremental — only newly written data is clustered, not the entire table. (3) No need to choose between partitioning and Z-ordering — liquid clustering handles both. (4) You can change clustering keys without rewriting the entire table (ALTER TABLE ... CLUSTER BY). Traditional PARTITION BY creates physical directories that are hard to change and don't compose well with Z-ordering (ZORDER only sorts within each partition).

Q3: How does Auto Loader differ from a plain readStream.format("parquet") on a directory?

Auto Loader (cloudFiles) provides: (1) Efficient file discovery — uses cloud event notifications (S3 SQS, ADLS Event Grid) instead of directory listing, scaling to millions of files. A plain readStream must list the directory repeatedly. (2) Schema evolution — Auto Loader can infer, track, and evolve the schema as new columns appear in source files (schemaEvolutionMode). (3) Rescue data column — columns that don't match the schema are captured in a _rescued_data column instead of being silently dropped. (4) Exactly-once guarantees — checkpoints track which files have been processed. Plain readStream also tracks files but lacks the schema evolution and rescue features.

Q4: What are DLT expectations and how do the three modes (expect, expect_or_drop, expect_or_fail) differ?

DLT expectations are declarative data quality constraints applied to table definitions. The three modes: (1) @dlt.expect("name", "condition") — records violating the condition are included in the output, but the violation is tracked in metrics (alert-only). (2) @dlt.expect_or_drop("name", "condition") — violating records are silently dropped; they're excluded from the output. (3) @dlt.expect_or_fail("name", "condition") — the pipeline fails immediately on violation. Use expect for soft rules (log warnings), expect_or_drop for data cleansing (quarantine bad rows), and expect_or_fail for critical invariants (e.g., primary keys must not be null).

Q5: Explain the three-level namespace in Unity Catalog and why it matters for multi-team organizations.

Unity Catalog uses: catalog.schema.table. The catalog is the top-level container — typically maps to a domain, environment, or business unit (e.g., finance_prod, marketing_dev). The schema (database) groups related tables within a catalog. The table/view is the actual data object. This replaces the two-level namespace (database.table) of the legacy Hive metastore. Benefits for multi-team organizations: (1) Access control at any level — grant a team access to an entire catalog or specific schema. (2) Cross-workspace visibility — Unity Catalog operates at the account level, so teams in different workspaces can share data without copies. (3) Clear ownership — each catalog has an owner responsible for governance. (4) Name isolation — different teams can use the same schema/table names without conflict (e.g., finance.core.transactions vs marketing.core.transactions).

Q6: What are Databricks Asset Bundles and how do they improve production workflows?

Databricks Asset Bundles (DABs) are a declarative, project-based deployment framework. A databricks.yml manifest defines jobs, DLT pipelines, permissions, and cluster configs alongside source code in Git. Key benefits: (1) Reproducibility — every deployment is derived from a specific Git commit; no manual workspace edits. (2) Environment parity — targets (dev, staging, prod) override variables (cluster sizes, catalog names) without code changes. (3) CI/CD nativedatabricks bundle validate and databricks bundle deploy can run in GitHub Actions or Azure DevOps pipelines. (4) Atomic deployment — jobs and DLT pipelines are created/updated together, avoiding partial states. DABs replace ad-hoc approaches like manually configuring jobs in the UI or using Terraform for Databricks resources (though Terraform remains valid for infrastructure). They are the recommended approach for all new Databricks projects.

Q7: When would you choose serverless SQL warehouses over classic clusters, and vice versa?

Serverless SQL Warehouses are best for: BI dashboard queries (instant start, no idle cost), ad-hoc SQL exploration, dbt runs, and bursty workloads where utilization is unpredictable. They start in seconds, auto-scale based on query concurrency, and charge only for active compute. Classic clusters are better for: 24/7 streaming jobs (sustained compute is cheaper than serverless pricing), workloads requiring custom libraries or init scripts not supported on serverless, large-scale PySpark ETL with custom UDFs, and when spot instances can reduce costs 60-80%. Rule of thumb: if the workload is SQL-centric and bursty, use serverless; if it's Spark/Python-centric and sustained, use classic job clusters. Many organizations use both: serverless for analysts, classic for engineering pipelines.

↑ Back to top

Further Reading

↑ Back to top