Being a Data Engineer in an Enterprise Environment

Summary

Enterprise data engineering is fundamentally different from startup or small-team work. You deal with organizational complexity as much as technical complexity: data governance committees, change advisory boards, security reviews, SOX compliance, multi-team dependencies, legacy system migrations, and vendor management. This guide covers the non-technical and semi-technical skills that separate a senior enterprise data engineer from a mid-level one. These topics appear frequently in behavioral interviews and system design rounds at large companies. You won't find these in a PySpark tutorial — but they determine whether your data platform actually delivers value or becomes shelfware. Assumes you already have the technical foundations covered in the other guides.

Core Concepts
Industry Use Cases
Code & Config Examples
Comparison / When to Use
Gotchas & Anti-patterns
Exercises
Quiz
Further Reading

Core Concepts

1. Team Models & Organizational Structure

How a company organizes its data engineers determines what you'll build and how you'll operate:

Centralized Data Team (Platform Model)

One team owns all data infrastructure and pipelines. Business units submit requests via tickets. Good for: small-to-medium companies, ensuring consistency. Risks: becomes a bottleneck; the team lacks domain expertise. You spend more time in requirements meetings than writing code.

Embedded Model

Data engineers sit within business units (e.g., one DE on the marketing team, two on finance). Good for: deep domain knowledge, fast iteration within a team. Risks: inconsistent tooling, duplicated infrastructure, data silos. Without a central platform, teams build things in incompatible ways.

Hub-and-Spoke (Hybrid)

A central Data Platform Team owns shared infrastructure (Spark clusters, warehouse, orchestration, CI/CD, data catalog) while domain teams (sometimes called "squads") own domain-specific pipelines and data models. The platform team provides tools and guardrails; domain teams have autonomy within those guardrails. This is the model most large enterprises converge on, and it aligns with Data Mesh principles (domain ownership, data as a product, self-serve platform, federated governance).

The Data Platform Team's Responsibilities

Provision and manage compute (Databricks workspaces, Airflow clusters, Kafka clusters)
Maintain shared libraries, macros, and templates (e.g., a company dbt package)
Define and enforce standards: naming conventions, data quality thresholds, SLA frameworks
Provide self-serve capabilities: project templates, CI/CD pipelines, monitoring dashboards
Manage cross-cutting concerns: data catalog, access control, cost allocation

2. Data Governance & Compliance

In enterprises, governance isn't optional — it's often the first thing mentioned in interviews. Key pillars:

Data Classification & PII Handling

Data classification: Every dataset is labeled by sensitivity: Public, Internal, Confidential, Restricted. Classification drives access policies, encryption requirements, and retention rules.
PII identification: Columns containing personally identifiable information (name, email, SSN, IP address, phone number) must be tagged in the data catalog and handled according to policy.
PII handling patterns: (1) Tokenization — replace PII with a reversible token (useful when you need to re-identify in specific approved workflows). (2) Hashing (SHA-256 with salt) — irreversible pseudonymization for analytics (can still join on hashed values). (3) Masking — dynamic column masking at query time (Snowflake dynamic masking, Databricks row/column filters). (4) Encryption — column-level encryption for PII at rest, with key management via HSM/KMS.

Regulatory Compliance

Regulation	Scope	Key Requirements for Data Engineers
GDPR (EU)	EU residents' personal data	Right to erasure (delete pipelines), data portability (export in machine-readable format), legal basis for processing, DPIAs for high-risk processing, 72-hour breach notification
CCPA/CPRA (California)	California residents' personal data	Right to opt-out of data sale, right to deletion, minimum data retention, data inventory requirements
HIPAA (US Healthcare)	Protected Health Information	Encryption at rest and in transit, audit logs of all access, minimum necessary access, BAA with all vendors, de-identification standards
SOX (US Finance)	Financial reporting data	Immutable audit trails, change management controls, separation of duties, data lineage from source to report
LGPD (Brazil)	Brazilian residents' personal data	Similar to GDPR: consent management, DPO appointment, data minimization, 72-hour breach notification

Access Control (RBAC + ABAC)

Role-Based Access Control (RBAC): Assign permissions to roles, assign roles to groups, assign users to groups. Example: data_engineer_prod role → read/write on warehouse, analyst_marketing role → read-only on marketing schema. Attribute-Based Access Control (ABAC): Dynamic policies based on attributes (user department, data classification level, purpose). Example: "Analysts in EMEA can see EU customer PII; analysts in APAC see masked PII." In practice, enterprises use RBAC as the foundation with ABAC-style policies for fine-grained access (Unity Catalog, Apache Ranger, Purview).

3. CI/CD for Data Pipelines

Enterprise CI/CD for data is different from application CI/CD:

What Gets Deployed

Infrastructure as Code (IaC) — Terraform/Pulumi for cloud resources (Databricks workspaces, storage accounts, Kafka topics, IAM roles). Reviewed and approved before apply.
Pipeline code — Airflow DAGs, Spark jobs, dbt models, ADF pipeline JSON. Tested in dev/staging before production.
Schema changes — Table DDL, catalog configurations, access policies. Often require a data governance approval.
Configuration — Feature flags, job parameters, connection strings. Managed via config-as-code, not manual UI clicks.

Enterprise CI/CD Pipeline (Typical Flow)

Feature branch — Developer works in a personal dev environment (isolated schema/workspace)
Pull request — Triggers: linting (sqlfluff, black, ruff), unit tests, dbt build --select state:modified on a CI schema
Code review — Peer review required; certain files (IaC, access policies) require platform team review
Merge to main — Triggers: deploy to staging environment, run integration tests against staging data
Release to prod — May require: Change Advisory Board (CAB) approval for major changes, automated canary deployment (run new pipeline in shadow mode, compare outputs), manual approval gate in the CI/CD pipeline
Post-deploy validation — Automated smoke tests, data quality checks, monitoring for anomalies in the first hour

Environment Strategy

Environment	Purpose	Data	Access
Dev	Individual development, experimentation	Sample data or anonymized prod subset	Individual developer
CI	Automated testing on PRs	Test fixtures or ephemeral prod sample	CI service account only
Staging	Pre-prod validation, integration testing	Production-like (anonymized copy)	Data team
Production	Live data processing	Real data	Service accounts + on-call engineers

4. Cost Management & FinOps

Cloud data platforms can become extremely expensive without discipline. FinOps (Financial Operations) is the practice of managing cloud costs collaboratively:

Key Principles

Cost attribution — Tag every resource with team, project, and environment. Chargeback or showback models make teams responsible for their spend.
Right-sizing — Audit cluster sizes monthly. Most Spark clusters are over-provisioned. Start small, scale up based on actual metrics (spill to disk, shuffle read/write, executor memory utilization).
Reserved capacity — Committed Use Discounts (GCP), Reserved Instances (AWS), Reserved Capacity (Azure) for predictable baseline workloads. Spot/preemptible instances for fault-tolerant batch jobs.
Storage lifecycle — Auto-tier cold data (S3 Intelligent Tiering, Azure Cool/Archive), enforce retention policies, and clean up orphaned data. Unmanaged storage growth is one of the biggest hidden costs.

Common Cost Traps

Idle Databricks clusters left running (use auto-termination: 15 min for dev, 10 min for prod)
Over-provisioned Snowflake warehouses running 24/7 (use auto-suspend: 1-5 minutes)
Kafka topics with excessive retention (7 days of data you'll never re-read)
Full table scans on partitioned data because queries don't filter on partition columns
Rebuilding all dbt models daily when only 10% change (use state:modified selection)
Developer workspaces left running over weekends/holidays

Cost Dashboard Metrics

Every data platform should have a cost dashboard tracking: daily/weekly spend by team, cost per pipeline/job, compute utilization rate, storage growth rate, cost anomaly alerts (>20% week-over-week increase).

5. Observability, SLAs & Incident Response

Enterprise data platforms need the same operational maturity as production applications:

The Three Pillars of Data Observability

Freshness — Is the data arriving on time? (e.g., "fct_orders should have today's data by 06:00 UTC")
Volume — Is the row count within expected bounds? (e.g., "daily order count should be 10,000-50,000; alert on ±50% deviation")
Quality — Are data quality tests passing? (uniqueness, nullability, referential integrity, schema drift detection)
Schema — Has the schema changed unexpectedly? (new columns, dropped columns, type changes)
Distribution — Are column value distributions consistent? (sudden shift in NULL rate, mean value, cardinality)

SLA Framework

Define SLAs (Service Level Agreements) for critical data products:

Data Product	Freshness SLA	Quality SLA	Availability SLA
fct_orders (Gold)	By 06:00 UTC daily	99.9% test pass rate	99.5% uptime
Real-time events	< 5 min latency	99.5% schema compliance	99.9% uptime
ML feature store	By 04:00 UTC daily	100% not-null on required features	99.5% uptime

Incident Response Playbook

Alert fires — PagerDuty/Opsgenie notification to on-call data engineer
Triage (5 min) — Is this a real issue or a false positive? Check monitoring dashboard, recent deploys, source system status
Communicate (10 min) — Post in #data-incidents Slack channel: what's broken, who's affected, estimated impact, ETA to resolution
Mitigate — Immediate action to limit blast radius: pause downstream pipelines, switch to backup data, revert last deploy
Resolve — Fix the root cause: code fix, data backfill, config correction
Post-mortem (within 48 hours) — Blameless review: timeline, root cause, impact, action items to prevent recurrence. Document in a shared knowledge base.

6. Documentation & Knowledge Management

In an enterprise, undocumented pipelines are liabilities. When the original developer leaves, tribal knowledge evaporates. Effective documentation:

What to Document

Data catalog — Every table has: description, owner, source, freshness SLA, PII classification, sample queries. Tools: Purview, Datahub, Atlan, OpenMetadata.
Pipeline documentation — For each pipeline: purpose, schedule, dependencies, SLA, runbook (what to do when it fails), data flow diagram, contact person.
Architecture Decision Records (ADRs) — Document why architectural decisions were made: "We chose Delta Lake over Iceberg because...", "We use Airflow instead of Prefect because...". Future engineers (including yourself in 6 months) will thank you.
Runbooks — Step-by-step procedures for common operations: "How to backfill the last 30 days of fct_orders", "How to add a new Kafka consumer group", "How to onboard a new source system." Include actual commands, not just prose.

Documentation as Code

The best documentation lives alongside the code: dbt model descriptions in YAML (auto-generates a documentation site), Airflow DAG docstrings, Terraform resource descriptions, README.md files in each pipeline directory. Generated documentation (dbt docs, Swagger/OpenAPI for data APIs) is always up-to-date because it's derived from the code itself.

7. Migration Strategies

Enterprise data engineers frequently lead migrations: on-prem to cloud, legacy to modern stack, one cloud to another. This is where careers are made (or broken).

The Strangler Fig Pattern

Named after trees that grow around and eventually replace their host. Instead of a big-bang migration: (1) Build new capabilities in the new system. (2) Route traffic incrementally from old to new. (3) Run both systems in parallel during transition. (4) Decommission old system when new system handles all workloads. This is the safest pattern because the old system still works if the new system hits problems.

Data Migration Phases

Inventory & Assess — Catalog all pipelines, datasets, dependencies, and consumers on the old system. Identify: critical vs. nice-to-have, complexity, data volumes, regulatory constraints. Many enterprises skip this and pay for it later.
Design & Pilot — Migrate 2-3 representative pipelines to the new platform. Validate: performance, cost, operational workflows, team capability. Discover unknowns early.
Parallel Run — Run both old and new systems side by side. Compare outputs row-by-row (dbt audit_helper is great for this). Build confidence that the new system produces identical results.
Incremental Cutover — Migrate consumers one team/dashboard at a time. Each cutover is a small, reversible change. The migration is a series of small steps, not one big leap.
Decommission — Turn off old system components only after all consumers have migrated and a burn-in period has passed. Archive old code/data per retention policies.

Common Migration Pitfalls

Lift-and-shift without rearchitecting (you migrate the old system's problems to a more expensive platform)
Underestimating the "long tail" of obscure pipelines and reports that no one remembers exist
Not accounting for timezone, encoding, and precision differences between old and new systems
Skipping the parallel run phase (executives want to move fast, but parallel validation prevents disasters)

8. Stakeholder Management & Communication

Technical skill gets you in the door; stakeholder management determines your impact and career trajectory in an enterprise.

Communicating with Different Audiences

Audience	What They Care About	How to Communicate
Executive leadership	Business impact, risk, cost, timeline	1-page summary, no jargon, focus on outcomes: "Reduces data delivery time from 24h to 1h, enabling same-day marketing decisions"
Product managers	Feature delivery, data availability, reliability	Clear ETAs, honest trade-off discussions: "We can deliver in 2 weeks with daily freshness or 4 weeks with real-time, which do you need?"
Analysts / Data scientists	Data quality, discoverability, documentation	Technical details welcome; focus on "how do I use this?" not "how did we build it?"
Security / Compliance	Access controls, PII handling, audit trails	Be proactive: show them your controls before they ask. Provide documentation, architecture diagrams, DPIAs
Other engineering teams	API contracts, latency, reliability, breaking changes	Treat data products like APIs: versioned contracts, deprecation notices, migration guides

Key Soft Skills for Enterprise DEs

Saying "no" constructively — "We can't do X in the requested timeframe, but here's what we can do that addresses 80% of the need." Always offer an alternative.
Writing RFCs/Design Docs — Before building major features, write a design document: problem statement, proposed solution, alternatives considered, risks, timeline. Circulate for feedback. This avoids building the wrong thing.
Managing expectations — Under-promise, over-deliver. Provide conservative estimates with clear assumptions. "Assuming the source API latency stays under 500ms and the security review completes by Friday, we'll have this in staging by next Wednesday."
Building trust through reliability — The fastest way to earn trust: deliver what you said you'd deliver, when you said you'd deliver it. Miss deadlines and you'll lose influence, no matter how good your code is.

↑ Back to top

Industry Use Cases

1. Healthcare — HIPAA-Compliant Analytics Platform Migration

A hospital network (15 hospitals, 50,000 employees) migrates from on-prem SQL Server SSIS pipelines to Azure cloud. Key challenges: (1) HIPAA requires PHI encryption at rest/transit, access logging, minimum necessary access, and a Business Associate Agreement with Microsoft. (2) 300+ SSIS packages must be inventoried, prioritized, and migrated to ADF + Databricks. (3) Clinical staff are accustomed to same-day reports from legacy systems — downtime SLAs are measured in minutes. The data platform team uses the Strangler Fig pattern: new workloads go to cloud-native ADF/Databricks, existing SSIS packages are migrated in waves over 18 months. A dedicated "migration squad" of 3 DEs handles the conversion. Data is classified into 4 tiers, with PHI data routed through private endpoints and column-level masking. The migration is tracked via a portfolio dashboard showing: % pipelines migrated, cost comparison (on-prem vs. cloud), incidents per month, data quality KPIs.

2. Financial Services — SOX-Compliant Data Warehouse Rebuild

A bank's data warehouse (built on Teradata in 2008) needs modernization — Teradata licensing costs $5M/year. The data engineering team proposes Snowflake + dbt. Challenge: SOX compliance requires immutable audit trails for all financial reporting data, separation of duties (the person who writes pipeline code can't approve it to production), and documented data lineage from source system to regulatory report. Solution: (1) dbt model contracts enforce output schemas match regulatory specifications. (2) GitHub branch protection rules enforce separation of duties — the author can't merge their own PR. (3) dbt's auto-generated documentation + Snowflake's ACCESS_HISTORY provide full lineage. (4) All Snowflake data is on Time Travel (90 days) + Fail-safe (7 days) for audit recovery. (5) A Change Advisory Board reviews all production releases weekly. The migration takes 2 years and saves $3.5M/year in licensing alone.

3. Retail — Real-Time Inventory & Demand Forecasting

A retail chain (500 stores) needs real-time inventory visibility and demand forecasting. The data platform team runs a hub-and-spoke model: a central platform team manages Kafka, Databricks, and Airflow; domain teams (supply chain, merchandising, e-commerce) own their pipelines. Key decisions: (1) Kafka ingests POS (point-of-sale) transactions from all 500 stores in real-time (~50,000 events/sec). (2) A Spark Structured Streaming job maintains a real-time inventory view in Delta Lake. (3) A nightly Airflow DAG trains demand forecasting ML models on historical data. (4) Cost management: the platform team implements chargeback based on Databricks DBU consumption per team — supply chain sees their $X/month and optimizes their Spark jobs accordingly. (5) On-call rotation: the platform team handles infrastructure alerts; domain teams handle data quality alerts for their pipelines.

4. Enterprise Data Mesh Implementation

A 10,000-person enterprise implements Data Mesh across 8 domains (orders, customers, inventory, finance, HR, marketing, logistics, product). Each domain team owns: their source systems, ingestion pipelines, transformation logic (dbt), and published data products. The central data platform team provides: a self-serve data infrastructure (Databricks, dbt Cloud, Airflow), a federated governance framework (shared naming conventions, quality standards, PII handling policies), data discovery (Purview/DataHub catalog), and cross-domain metrics (dbt Semantic Layer). Challenges that arise: (1) Domain teams have uneven maturity — the orders team has 3 experienced DEs, the HR team has 1 junior DE. Solution: platform team provides "starter kits" (template dbt projects, CI/CD pipelines) and office hours. (2) Cross-domain data quality — when the orders domain's data is wrong, it cascades to finance. Solution: model contracts at domain boundaries, and Elementary monitoring across all domains.

↑ Back to top

Code & Config Examples

1. Terraform — Databricks Workspace with RBAC & Private Networking

# terraform/modules/databricks_workspace/main.tf
# Enterprise-grade Databricks workspace

resource "azurerm_databricks_workspace" "this" {
  name                        = "dbw-${var.project}-${var.environment}"
  resource_group_name         = var.resource_group_name
  location                    = var.location
  sku                         = "premium"           # Required for Unity Catalog
  managed_resource_group_name = "rg-dbw-managed-${var.project}"

  # Private networking — no public access
  public_network_access_enabled         = false
  network_security_group_rules_required = "NoAzureDatabricksRules"

  custom_parameters {
    no_public_ip                                         = true
    virtual_network_id                                   = var.vnet_id
    private_subnet_name                                  = var.private_subnet_name
    private_subnet_network_security_group_association_id = var.private_nsg_id
    public_subnet_name                                   = var.public_subnet_name
    public_subnet_network_security_group_association_id  = var.public_nsg_id
  }

  tags = {
    Environment  = var.environment
    Team         = var.team
    CostCenter   = var.cost_center
    DataClass    = "Confidential"
  }
}

# Unity Catalog metastore assignment
resource "databricks_metastore_assignment" "this" {
  metastore_id = var.metastore_id
  workspace_id = azurerm_databricks_workspace.this.workspace_id
}

# RBAC: Create groups and grant permissions
resource "databricks_group" "data_engineers" {
  display_name = "data-engineers-${var.project}"
}

resource "databricks_group" "analysts" {
  display_name = "analysts-${var.project}"
}

resource "databricks_grants" "catalog_grants" {
  catalog = "${var.project}_${var.environment}"

  grant {
    principal  = databricks_group.data_engineers.display_name
    privileges = ["USE_CATALOG", "USE_SCHEMA", "CREATE_SCHEMA", "CREATE_TABLE"]
  }

  grant {
    principal  = databricks_group.analysts.display_name
    privileges = ["USE_CATALOG", "USE_SCHEMA", "SELECT"]
  }
}

2. PII Detection & Masking Pipeline

# pii_detection/scanner.py
# Automated PII detection for new tables added to the catalog

import hashlib, re
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sha2, concat, lit, regexp_replace, when

# Patterns for common PII types
PII_PATTERNS = {
    "email":      re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"),
    "ssn":        re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    "phone_us":   re.compile(r"\b\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b"),
    "credit_card": re.compile(r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b"),
    "ip_address":  re.compile(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"),
}

# Column name heuristics (often faster than sampling data)
PII_COLUMN_NAMES = {
    "email":      ["email", "e_mail", "email_address"],
    "name":       ["first_name", "last_name", "full_name", "customer_name"],
    "phone":      ["phone", "phone_number", "mobile", "cell"],
    "ssn":        ["ssn", "social_security", "tax_id"],
    "address":    ["address", "street", "city", "zip_code", "postal_code"],
}


def scan_table_for_pii(spark, catalog, schema, table, sample_size=1000):
    """Scan a table and return PII findings."""
    df = spark.table(f"{catalog}.{schema}.{table}").limit(sample_size)
    findings = []

    for col_name in df.columns:
        # Check column name heuristics
        for pii_type, names in PII_COLUMN_NAMES.items():
            if col_name.lower() in names:
                findings.append({
                    "column": col_name,
                    "pii_type": pii_type,
                    "detection": "column_name",
                    "confidence": "high",
                })

        # Sample data and check patterns
        if df.schema[col_name].dataType.simpleString() == "string":
            sample = [row[col_name] for row in df.select(col_name).collect()
                      if row[col_name] is not None]
            for pii_type, pattern in PII_PATTERNS.items():
                match_count = sum(1 for val in sample if pattern.search(val))
                if match_count > len(sample) * 0.3:  # 30%+ match rate
                    findings.append({
                        "column": col_name,
                        "pii_type": pii_type,
                        "detection": "pattern_match",
                        "confidence": "medium",
                        "match_rate": f"{match_count}/{len(sample)}",
                    })

    return findings


def apply_masking(df, column, pii_type, salt="enterprise_salt_2025"):
    """Apply appropriate masking based on PII type."""
    if pii_type == "email":
        # Hash the email but preserve domain for analytics
        return df.withColumn(column,
            concat(
                sha2(concat(col(column), lit(salt)), 256).substr(1, 12),
                lit("@"),
                regexp_replace(col(column), r"^.*@", "")
            ))
    elif pii_type == "ssn":
        # Full hash — no partial preservation
        return df.withColumn(column,
            sha2(concat(col(column), lit(salt)), 256))
    elif pii_type == "name":
        # Replace with "***"
        return df.withColumn(column, lit("***"))
    else:
        # Default: full SHA-256 hash
        return df.withColumn(column,
            sha2(concat(col(column), lit(salt)), 256))

3. Airflow DAG — Enterprise Pattern with SLA, Alerting & Retry

# dags/enterprise_etl_pattern.py
# Production-grade DAG with enterprise patterns

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator
from airflow.utils.trigger_rule import TriggerRule


def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
    """Alert when SLA is missed."""
    slack = SlackWebhookHook(slack_webhook_conn_id="slack_data_alerts")
    slack.send(
        text=(f":rotating_light: *SLA MISS* — DAG `{dag.dag_id}` missed SLA\n"
              f"Tasks: {[t.task_id for t in task_list]}\n"
              f"Blocking: {[t.task_id for t in blocking_tis]}\n"
              f"Expected by: {slas}\n"
              f"Runbook: https://wiki.company.com/runbooks/etl-sla-miss")
    )


def on_failure_callback(context):
    """Alert on task failure with actionable context."""
    task = context["task_instance"]
    slack = SlackWebhookHook(slack_webhook_conn_id="slack_data_alerts")
    slack.send(
        text=(f":x: *TASK FAILED* — `{task.dag_id}.{task.task_id}`\n"
              f"Attempt: {task.try_number}/{task.max_tries + 1}\n"
              f"Error: ```{context.get('exception', 'N/A')}```\n"
              f"Log: {task.log_url}\n"
              f"On-call: @data-oncall")
    )


default_args = {
    "owner": "data-platform-team",
    "depends_on_past": False,
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "retry_exponential_backoff": True,
    "max_retry_delay": timedelta(minutes=30),
    "execution_timeout": timedelta(hours=2),
    "on_failure_callback": on_failure_callback,
    "sla": timedelta(hours=3),  # Each task must complete within 3h of DAG start
}

with DAG(
    dag_id="enterprise_orders_etl",
    default_args=default_args,
    description="Daily orders ETL: ingest → transform → quality → publish",
    schedule_interval="0 4 * * *",  # 04:00 UTC daily
    start_date=datetime(2024, 1, 1),
    catchup=False,
    max_active_runs=1,
    sla_miss_callback=sla_miss_callback,
    tags=["production", "orders", "sla:06:00-utc"],
    doc_md="""
    ## Orders ETL Pipeline
    **Owner**: data-platform-team | **SLA**: 06:00 UTC | **Oncall**: @data-oncall
    **Runbook**: https://wiki.company.com/runbooks/orders-etl
    **Dependencies**: Source API (orders-service), Databricks workspace (prod)
    """,
) as dag:

    ingest = DatabricksRunNowOperator(
        task_id="ingest_orders",
        databricks_conn_id="databricks_prod",
        job_id=12345,
        notebook_params={"date": "{{ ds }}", "mode": "incremental"},
    )

    transform = DatabricksRunNowOperator(
        task_id="transform_dbt",
        databricks_conn_id="databricks_prod",
        job_id=12346,
        notebook_params={"date": "{{ ds }}"},
    )

    quality_check = DatabricksRunNowOperator(
        task_id="data_quality_checks",
        databricks_conn_id="databricks_prod",
        job_id=12347,
    )

    notify_success = PythonOperator(
        task_id="notify_success",
        python_callable=lambda **ctx: SlackWebhookHook(
            slack_webhook_conn_id="slack_data_alerts"
        ).send(text=f":white_check_mark: Orders ETL completed for {ctx['ds']}"),
        trigger_rule=TriggerRule.ALL_SUCCESS,
    )

    notify_failure = PythonOperator(
        task_id="notify_failure_summary",
        python_callable=lambda **ctx: SlackWebhookHook(
            slack_webhook_conn_id="slack_data_alerts"
        ).send(text=f":fire: Orders ETL FAILED for {ctx['ds']} — check logs"),
        trigger_rule=TriggerRule.ONE_FAILED,
    )

    ingest >> transform >> quality_check >> [notify_success, notify_failure]

4. Data Quality Framework — Great Expectations Config

# great_expectations/expectations/orders_suite.json
# Enterprise data quality suite: freshness, volume, schema, values
{
  "expectation_suite_name": "orders_gold_suite",
  "expectations": [
    {
      "expectation_type": "expect_table_row_count_to_be_between",
      "kwargs": {"min_value": 10000, "max_value": 500000},
      "meta": {"notes": "Daily volume: 10K-500K orders. Alert on anomaly."}
    },
    {
      "expectation_type": "expect_column_values_to_be_unique",
      "kwargs": {"column": "order_id"}
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {"column": "order_id"}
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {"column": "customer_id"}
    },
    {
      "expectation_type": "expect_column_values_to_be_in_set",
      "kwargs": {
        "column": "order_status",
        "value_set": ["pending", "processing", "shipped", "delivered", "cancelled"]
      }
    },
    {
      "expectation_type": "expect_column_values_to_be_between",
      "kwargs": {
        "column": "total_amount",
        "min_value": 0.01,
        "max_value": 100000
      },
      "meta": {"notes": "Flag orders over $100K for fraud review"}
    },
    {
      "expectation_type": "expect_column_max_to_be_between",
      "kwargs": {
        "column": "order_date",
        "min_value": "{{ (execution_date - timedelta(days=1)).isoformat() }}",
        "max_value": "{{ execution_date.isoformat() }}"
      },
      "meta": {"notes": "Freshness check: most recent order should be within 24h"}
    }
  ]
}

5. Cost Alerting — Cloud Budget Monitor

# finops/cost_monitor.py
# Weekly cost anomaly detection and alerting

import pandas as pd
from datetime import datetime, timedelta


def detect_cost_anomalies(
    cost_df: pd.DataFrame,
    group_col: str = "team",
    threshold_pct: float = 0.25,   # Alert on 25%+ week-over-week increase
) -> pd.DataFrame:
    """Detect week-over-week cost anomalies by team/project."""

    # Calculate weekly totals by team
    current_week = cost_df[cost_df["date"] >= datetime.now() - timedelta(days=7)]
    prev_week = cost_df[
        (cost_df["date"] >= datetime.now() - timedelta(days=14)) &
        (cost_df["date"] < datetime.now() - timedelta(days=7))
    ]

    current_totals = current_week.groupby(group_col)["cost_usd"].sum()
    prev_totals = prev_week.groupby(group_col)["cost_usd"].sum()

    # Calculate change percentage
    comparison = pd.DataFrame({
        "current_week": current_totals,
        "previous_week": prev_totals,
    }).fillna(0)

    comparison["change_pct"] = (
        (comparison["current_week"] - comparison["previous_week"])
        / comparison["previous_week"].replace(0, 1)
    )

    # Filter anomalies
    anomalies = comparison[comparison["change_pct"] > threshold_pct]

    return anomalies.sort_values("change_pct", ascending=False)


def format_cost_alert(anomalies: pd.DataFrame) -> str:
    """Format anomalies into a Slack-ready message."""
    if anomalies.empty:
        return ":white_check_mark: No cost anomalies detected this week."

    lines = [":moneybag: *Weekly Cost Anomaly Report*\n"]
    for team, row in anomalies.iterrows():
        lines.append(
            f":warning: *{team}*: ${row['current_week']:,.0f} "
            f"(+{row['change_pct']:.0%} vs last week: ${row['previous_week']:,.0f})"
        )
    lines.append(f"\n_Review: https://finops.company.com/dashboard_")
    return "\n".join(lines)

6. Architecture Decision Record (ADR) Template

# docs/adr/ADR-007-lakehouse-format.md
# Architecture Decision Record — standard enterprise documentation

# ADR-007: Choose Delta Lake as the Lakehouse Table Format

## Status
Accepted (2024-11-15)

## Context
We need a table format for our lakehouse on Azure. Options evaluated:
- **Delta Lake** — Native to Databricks (our compute platform), mature, ACID transactions
- **Apache Iceberg** — Vendor-neutral, strong community momentum, good for multi-engine
- **Apache Hudi** — Strong CDC/upsert support, less mature ecosystem

## Decision
We will use **Delta Lake** as our primary table format.

## Rationale
1. **Native integration**: Our primary compute is Databricks — Delta Lake has zero-configuration
   support, Photon optimized reads, liquid clustering, and Unity Catalog integration.
2. **UniForm**: Delta Lake UniForm (GA in Databricks 2024) generates Iceberg metadata alongside
   Delta metadata — we get Iceberg compatibility for free if we ever need multi-engine access.
3. **Team expertise**: 4/5 data engineers have Delta experience; 1/5 has Iceberg experience.
4. **Time to production**: Estimated 2 weeks with Delta vs. 6 weeks with Iceberg (tooling setup,
   catalog integration, debugging).

## Consequences
- (+) Faster delivery, lower risk, leverages existing team skills
- (+) UniForm provides Iceberg escape hatch
- (-) Tighter coupling to Databricks ecosystem
- (-) If we move to a non-Databricks compute engine, we'll rely on UniForm or re-evaluate

## Alternatives Rejected
- **Iceberg**: Better long-term portability, but our 18-month compute contract with Databricks
  makes this benefit theoretical. Revisit at contract renewal.
- **Hudi**: Smallest community, least tooling integration with our stack.

## Review Date
2026-Q1 (align with Databricks contract renewal)

↑ Back to top

Comparison / When to Use

Enterprise Data Team Models

Model	Best For	Team Size	Trade-offs
Centralized	Small companies, early data teams	< 10 DEs	Low duplication, but becomes bottleneck at scale
Embedded	Companies needing deep domain expertise	10-30 DEs	Fast domain iteration, but tooling fragmentation and silos
Hub-and-Spoke	Large enterprises, data mesh adoption	30+ DEs	Balanced autonomy + consistency; requires a strong platform team

Data Governance Tools

Tool	Type	Strengths	Weaknesses
Unity Catalog (Databricks)	Integrated catalog + governance	Native to Databricks, fine-grained access, lineage	Databricks-only, no cross-platform
Microsoft Purview	Cloud governance platform	Azure-native, auto-classification, lineage across Azure	Focused on Azure; limited multi-cloud
Apache Atlas	Open-source metadata	Free, extensible, Hadoop ecosystem	Complex setup, aging UI, limited cloud support
Atlan / Alation	Commercial data catalog	Modern UX, rich collaboration, multi-platform	Expensive; another vendor to manage
OpenMetadata	Open-source catalog	Modern, API-first, growing community	Less mature than commercial options

Data Observability Tools

Tool	Type	Best For
Elementary	Open-source (dbt native)	dbt-centric teams; anomaly detection + test dashboards built into dbt
Monte Carlo	Commercial SaaS	Enterprise: multi-platform, automated monitoring, incident management
Great Expectations	Open-source framework	Teams wanting programmatic data validation with full control
Soda	Open-source + commercial	Simple YAML-based data quality checks, multi-warehouse

↑ Back to top

Gotchas & Anti-patterns

"We'll add governance later." Teams rush to build pipelines and defer governance. Then PII leaks into a dashboard, an auditor asks for data lineage you don't have, or a GDPR deletion request arrives with no way to find all instances of a customer's data. Governance must be foundational, not an afterthought: classify data from day one, enforce access controls, tag PII columns as they're ingested.
No cost visibility or accountability. Without tagging and chargeback, teams treat cloud resources as free. One team's unoptimized Spark job burns $5K/day for months before anyone notices. Implement cost tagging from the start, send weekly cost reports to team leads, and set budget alerts at the project level.
Hero culture — one person knows everything. A single senior engineer builds and operates critical pipelines alone. When they go on vacation (or leave the company), the team is paralyzed. Counter this with: pair programming, thorough documentation, runbooks, cross-training sessions, and a "no single owner" policy for critical systems. If only one person can fix it, it's a business risk.
Big-bang migrations. Attempting to migrate 200 pipelines from legacy to modern stack in one go is a recipe for disaster. The Strangler Fig pattern (incremental migration with parallel run) is slower but dramatically safer. Executives may push for speed — your job is to explain the risk and propose a phased approach with early wins.
Skipping design reviews for "small" changes. In enterprise, small changes have large blast radii. Renaming a column in a staging model can break 15 downstream dashboards in 4 departments. Always check: what depends on this? Use dbt's dbt ls --select +model_name+ or data lineage tools before modifying shared models. For significant changes, write a brief design doc even if it takes 30 minutes.
Not investing in self-serve. If every data request requires a ticket to the platform team, you've built a bottleneck, not a platform. A mature data platform provides: templates for common patterns, self-serve CI/CD, a data catalog for discovery, and documentation so teams can help themselves. The platform team's goal is to make themselves less needed for routine tasks.
Monitoring only the pipeline, not the data. Your Airflow DAG completed successfully — but the source API returned empty responses, so your "successful" pipeline loaded 0 rows into production. Pipeline monitoring (did the job run?) is necessary but insufficient. You also need data monitoring: freshness, volume, quality, schema. Tools like Elementary, Great Expectations, or Monte Carlo close this gap.

↑ Back to top

Exercises

Write an Architecture Decision Record. Pick a real technical decision you've made (or hypothetical: "Choose between Airflow and Dagster for orchestration"). Write a full ADR following the template: Status, Context, Decision, Rationale (at least 3 reasons), Consequences (positive and negative), Alternatives Rejected, Review Date. Practice explaining trade-offs clearly — this is a critical skill in interviews and on the job.
Design a data governance framework. For a hypothetical e-commerce company expanding into Europe (GDPR compliance needed): (1) Define 4 data classification levels with concrete examples. (2) Design the RBAC model: list roles, permissions, and which groups/teams map to each. (3) Specify how PII will be handled at each layer (raw, staging, gold). (4) Define the deletion workflow: how do you honor a GDPR "right to erasure" request when customer data exists in 20 tables across 3 zones? Diagram the flow.
Plan a migration using the Strangler Fig pattern. Scenario: You're migrating from an on-prem SQL Server + SSIS stack to cloud-native (ADF + Databricks + dbt). There are 50 SSIS packages. (1) How do you inventory and prioritize them? (2) What's your phased roadmap? (3) How do you validate the new pipelines produce identical results? (4) What's your rollback plan if a migrated pipeline produces incorrect data? (5) Write a timeline with milestones for migrating 50 pipelines over 12 months.

↑ Back to top

Quiz

Q1: What is the Strangler Fig pattern and why is it preferred for enterprise data migrations?

The Strangler Fig pattern involves incrementally building new capabilities in the target system while the legacy system continues to operate. New workloads go to the new system; existing workloads are migrated one at a time, with each migration validated via parallel running (comparing old vs. new outputs). The old system is decommissioned only after all workloads have been migrated and validated. It's preferred because: (1) Risk is contained — if the new system has problems, the old system still works. (2) Value is delivered incrementally — you don't wait 18 months for the big bang. (3) The team learns on early migrations and applies lessons to later ones. (4) Rollback is simple — just route traffic back to the old system. The alternative, big-bang migration, carries extreme risk: if anything goes wrong on cutover day, you're rolling back months of work with no fallback.

Q2: What are the key differences between RBAC and ABAC for data access control?

RBAC (Role-Based Access Control) assigns permissions to roles, and users are assigned to roles. Example: "data_engineer_prod" role has read/write on the warehouse; "analyst_marketing" has read-only on the marketing schema. It's simple to understand and audit, but can lead to role explosion in large organizations (hundreds of fine-grained roles). ABAC (Attribute-Based Access Control) evaluates policies based on attributes of the user (department, location, clearance level), the resource (data classification, owner), and the context (time of day, source IP). Example: "Users in the EU region with 'PII-authorized' clearance can see unmasked email addresses." ABAC is more flexible and expressive but harder to reason about and audit. In practice, enterprises use RBAC as the foundation (easy to audit for compliance) with ABAC-style dynamic policies for fine-grained controls (e.g., column masking based on user department).

Q3: How would you handle a GDPR "right to erasure" request in a data lakehouse?

This is a complex operational process: (1) Identify: Query the data catalog/lineage tool to find every table/file containing that customer's data — raw, staging, gold, snapshots, ML features, backups. (2) Classify: Determine which data must be deleted vs. anonymized vs. retained (some data may have legal retention requirements). (3) Execute: For Delta Lake tables, use DELETE FROM table WHERE customer_id = ? followed by VACUUM to remove physical files. For snapshots, invalidate the customer's rows. For raw data (Parquet files), either rewrite the files without the customer or overwrite the PII columns with nulls. For external systems (Kafka, third-party tools), trigger deletion via their APIs. (4) Verify: Run a scan confirming the customer's data no longer exists in any table. (5) Document: Log the deletion request, actions taken, and verification result for audit. The key complication: data may exist in file-based formats (Parquet, ORC) where you can't just "delete a row" — you must rewrite the entire file. Delta Lake's DELETE + VACUUM handles this, but raw landing zones may require batch rewriting. Having a data catalog with PII tagging makes step 1 feasible; without it, you're searching blindly.

Q4: What is the hub-and-spoke model for data teams, and how does it relate to Data Mesh?

Hub-and-spoke has a central Data Platform Team (the hub) providing shared infrastructure, tooling, and governance, while domain teams (the spokes) own domain-specific data pipelines and models. The platform team maintains: compute infrastructure, CI/CD pipelines, data catalog, monitoring, and shared libraries. Domain teams have autonomy to build within the platform's guardrails. This maps directly to Data Mesh's four principles: (1) Domain ownership — domain teams own their data products (spokes). (2) Data as a product — domain teams treat their published data as a product with SLAs, documentation, and quality guarantees. (3) Self-serve data infrastructure — the platform team provides tools so domains can operate independently (hub). (4) Federated computational governance — global policies (PII handling, naming conventions) enforced via automation, with local policies managed by domains. The hub-and-spoke model is the organizational pattern that makes Data Mesh practical at scale.

Q5: Why are Architecture Decision Records (ADRs) important in enterprise data engineering?

ADRs document the why behind technical decisions: what options were evaluated, why one was chosen, what trade-offs were accepted, and when to revisit. In enterprises, this matters because: (1) Team turnover — the person who made the decision will eventually leave. Without an ADR, the next engineer doesn't know why Airflow was chosen over Prefect and may waste weeks re-evaluating. (2) Audit and compliance — SOX, HIPAA, and other regulations require documented decision trails. "We chose this architecture because..." is an auditable artifact. (3) Prevents re-litigation — without ADRs, the same decisions get debated every 6 months. An ADR closes the discussion until the stated review date. (4) Onboarding — new team members read the ADR backlog and understand the system's evolution in hours instead of months. (5) Interview signal — discussing how you document and communicate decisions demonstrates senior-level engineering maturity. An ADR takes 30 minutes to write and saves hundreds of hours over its lifetime.

Q6: What's the difference between pipeline monitoring and data observability?

Pipeline monitoring tracks whether jobs run successfully: Did the Airflow DAG complete? What was the runtime? Were there any errors? This is necessary but insufficient — a pipeline can succeed while producing bad data (empty source, wrong schema, stale data). Data observability monitors the data itself across five dimensions: (1) Freshness — is the latest data within the expected time window? (2) Volume — are row counts within expected bounds? (3) Quality — do values pass validation rules (uniqueness, nullability, ranges)? (4) Schema — have columns been added, removed, or changed type unexpectedly? (5) Distribution — have column value distributions shifted anomalously? You need both: pipeline monitoring catches infrastructure failures (job crashed, cluster timeout), while data observability catches data failures (source sent empty payload, upstream schema changed, business logic drift). Tools: Airflow/Datadog for pipeline monitoring; Elementary/Monte Carlo/Great Expectations for data observability.

Q7: How do you manage costs effectively on a cloud data platform?

Effective FinOps for data platforms requires: (1) Visibility — tag every resource (team, project, environment), build cost dashboards by team/pipeline, send weekly cost reports to team leads. You can't optimize what you can't see. (2) Attribution — implement showback (showing teams their costs) or chargeback (billing teams internally). This creates accountability. (3) Right-sizing — audit compute monthly: check executor memory utilization, spill-to-disk metrics, and idle time. Most clusters are 2-3x over-provisioned. (4) Scheduling — auto-terminate dev clusters after 15 min idle, auto-suspend Snowflake warehouses after 1-5 min. Stop dev clusters on weekends/holidays. (5) Reserved capacity — for predictable baseline workloads, commit to 1-year or 3-year reserved pricing (30-60% savings). Use spot/preemptible for fault-tolerant batch jobs. (6) Data lifecycle — auto-tier cold storage, enforce retention policies, clean orphaned data. (7) Query optimization — partition pruning, clustering, avoiding SELECT *, using incremental models instead of full rebuilds. (8) Anomaly detection — automated alerts when any team's spend increases >25% week-over-week.

↑ Back to top

Being a Data Engineer in an Enterprise Environment

Summary

Table of Contents

Core Concepts

1. Team Models & Organizational Structure

Centralized Data Team (Platform Model)

Embedded Model

Hub-and-Spoke (Hybrid)

The Data Platform Team's Responsibilities

2. Data Governance & Compliance

Data Classification & PII Handling

Regulatory Compliance

Access Control (RBAC + ABAC)

3. CI/CD for Data Pipelines

What Gets Deployed

Enterprise CI/CD Pipeline (Typical Flow)

Environment Strategy

4. Cost Management & FinOps

Key Principles

Common Cost Traps

Cost Dashboard Metrics

5. Observability, SLAs & Incident Response

The Three Pillars of Data Observability

SLA Framework

Incident Response Playbook

6. Documentation & Knowledge Management

What to Document

Documentation as Code

7. Migration Strategies

The Strangler Fig Pattern

Data Migration Phases

Common Migration Pitfalls

8. Stakeholder Management & Communication

Communicating with Different Audiences

Key Soft Skills for Enterprise DEs

Industry Use Cases

1. Healthcare — HIPAA-Compliant Analytics Platform Migration

2. Financial Services — SOX-Compliant Data Warehouse Rebuild

3. Retail — Real-Time Inventory & Demand Forecasting

4. Enterprise Data Mesh Implementation

Code & Config Examples

1. Terraform — Databricks Workspace with RBAC & Private Networking

2. PII Detection & Masking Pipeline

3. Airflow DAG — Enterprise Pattern with SLA, Alerting & Retry

4. Data Quality Framework — Great Expectations Config

5. Cost Alerting — Cloud Budget Monitor

6. Architecture Decision Record (ADR) Template

Comparison / When to Use

Enterprise Data Team Models

Data Governance Tools

Data Observability Tools

Gotchas & Anti-patterns

Exercises

Quiz

Further Reading