Enterprise data engineering is fundamentally different from startup or small-team work. You deal with organizational complexity as much as technical complexity: data governance committees, change advisory boards, security reviews, SOX compliance, multi-team dependencies, legacy system migrations, and vendor management. This guide covers the non-technical and semi-technical skills that separate a senior enterprise data engineer from a mid-level one. These topics appear frequently in behavioral interviews and system design rounds at large companies. You won't find these in a PySpark tutorial — but they determine whether your data platform actually delivers value or becomes shelfware. Assumes you already have the technical foundations covered in the other guides.
How a company organizes its data engineers determines what you'll build and how you'll operate:
One team owns all data infrastructure and pipelines. Business units submit requests via tickets. Good for: small-to-medium companies, ensuring consistency. Risks: becomes a bottleneck; the team lacks domain expertise. You spend more time in requirements meetings than writing code.
Data engineers sit within business units (e.g., one DE on the marketing team, two on finance). Good for: deep domain knowledge, fast iteration within a team. Risks: inconsistent tooling, duplicated infrastructure, data silos. Without a central platform, teams build things in incompatible ways.
A central Data Platform Team owns shared infrastructure (Spark clusters, warehouse, orchestration, CI/CD, data catalog) while domain teams (sometimes called "squads") own domain-specific pipelines and data models. The platform team provides tools and guardrails; domain teams have autonomy within those guardrails. This is the model most large enterprises converge on, and it aligns with Data Mesh principles (domain ownership, data as a product, self-serve platform, federated governance).
In enterprises, governance isn't optional — it's often the first thing mentioned in interviews. Key pillars:
| Regulation | Scope | Key Requirements for Data Engineers |
|---|---|---|
| GDPR (EU) | EU residents' personal data | Right to erasure (delete pipelines), data portability (export in machine-readable format), legal basis for processing, DPIAs for high-risk processing, 72-hour breach notification |
| CCPA/CPRA (California) | California residents' personal data | Right to opt-out of data sale, right to deletion, minimum data retention, data inventory requirements |
| HIPAA (US Healthcare) | Protected Health Information | Encryption at rest and in transit, audit logs of all access, minimum necessary access, BAA with all vendors, de-identification standards |
| SOX (US Finance) | Financial reporting data | Immutable audit trails, change management controls, separation of duties, data lineage from source to report |
| LGPD (Brazil) | Brazilian residents' personal data | Similar to GDPR: consent management, DPO appointment, data minimization, 72-hour breach notification |
Role-Based Access Control (RBAC): Assign permissions to roles, assign roles to groups, assign users to groups. Example: data_engineer_prod role → read/write on warehouse, analyst_marketing role → read-only on marketing schema. Attribute-Based Access Control (ABAC): Dynamic policies based on attributes (user department, data classification level, purpose). Example: "Analysts in EMEA can see EU customer PII; analysts in APAC see masked PII." In practice, enterprises use RBAC as the foundation with ABAC-style policies for fine-grained access (Unity Catalog, Apache Ranger, Purview).
Enterprise CI/CD for data is different from application CI/CD:
| Environment | Purpose | Data | Access |
|---|---|---|---|
| Dev | Individual development, experimentation | Sample data or anonymized prod subset | Individual developer |
| CI | Automated testing on PRs | Test fixtures or ephemeral prod sample | CI service account only |
| Staging | Pre-prod validation, integration testing | Production-like (anonymized copy) | Data team |
| Production | Live data processing | Real data | Service accounts + on-call engineers |
Cloud data platforms can become extremely expensive without discipline. FinOps (Financial Operations) is the practice of managing cloud costs collaboratively:
state:modified selection)Every data platform should have a cost dashboard tracking: daily/weekly spend by team, cost per pipeline/job, compute utilization rate, storage growth rate, cost anomaly alerts (>20% week-over-week increase).
Enterprise data platforms need the same operational maturity as production applications:
Define SLAs (Service Level Agreements) for critical data products:
| Data Product | Freshness SLA | Quality SLA | Availability SLA |
|---|---|---|---|
| fct_orders (Gold) | By 06:00 UTC daily | 99.9% test pass rate | 99.5% uptime |
| Real-time events | < 5 min latency | 99.5% schema compliance | 99.9% uptime |
| ML feature store | By 04:00 UTC daily | 100% not-null on required features | 99.5% uptime |
In an enterprise, undocumented pipelines are liabilities. When the original developer leaves, tribal knowledge evaporates. Effective documentation:
The best documentation lives alongside the code: dbt model descriptions in YAML (auto-generates a documentation site), Airflow DAG docstrings, Terraform resource descriptions, README.md files in each pipeline directory. Generated documentation (dbt docs, Swagger/OpenAPI for data APIs) is always up-to-date because it's derived from the code itself.
Enterprise data engineers frequently lead migrations: on-prem to cloud, legacy to modern stack, one cloud to another. This is where careers are made (or broken).
Named after trees that grow around and eventually replace their host. Instead of a big-bang migration: (1) Build new capabilities in the new system. (2) Route traffic incrementally from old to new. (3) Run both systems in parallel during transition. (4) Decommission old system when new system handles all workloads. This is the safest pattern because the old system still works if the new system hits problems.
Technical skill gets you in the door; stakeholder management determines your impact and career trajectory in an enterprise.
| Audience | What They Care About | How to Communicate |
|---|---|---|
| Executive leadership | Business impact, risk, cost, timeline | 1-page summary, no jargon, focus on outcomes: "Reduces data delivery time from 24h to 1h, enabling same-day marketing decisions" |
| Product managers | Feature delivery, data availability, reliability | Clear ETAs, honest trade-off discussions: "We can deliver in 2 weeks with daily freshness or 4 weeks with real-time, which do you need?" |
| Analysts / Data scientists | Data quality, discoverability, documentation | Technical details welcome; focus on "how do I use this?" not "how did we build it?" |
| Security / Compliance | Access controls, PII handling, audit trails | Be proactive: show them your controls before they ask. Provide documentation, architecture diagrams, DPIAs |
| Other engineering teams | API contracts, latency, reliability, breaking changes | Treat data products like APIs: versioned contracts, deprecation notices, migration guides |
A hospital network (15 hospitals, 50,000 employees) migrates from on-prem SQL Server SSIS pipelines to Azure cloud. Key challenges: (1) HIPAA requires PHI encryption at rest/transit, access logging, minimum necessary access, and a Business Associate Agreement with Microsoft. (2) 300+ SSIS packages must be inventoried, prioritized, and migrated to ADF + Databricks. (3) Clinical staff are accustomed to same-day reports from legacy systems — downtime SLAs are measured in minutes. The data platform team uses the Strangler Fig pattern: new workloads go to cloud-native ADF/Databricks, existing SSIS packages are migrated in waves over 18 months. A dedicated "migration squad" of 3 DEs handles the conversion. Data is classified into 4 tiers, with PHI data routed through private endpoints and column-level masking. The migration is tracked via a portfolio dashboard showing: % pipelines migrated, cost comparison (on-prem vs. cloud), incidents per month, data quality KPIs.
A bank's data warehouse (built on Teradata in 2008) needs modernization — Teradata licensing costs $5M/year. The data engineering team proposes Snowflake + dbt. Challenge: SOX compliance requires immutable audit trails for all financial reporting data, separation of duties (the person who writes pipeline code can't approve it to production), and documented data lineage from source system to regulatory report. Solution: (1) dbt model contracts enforce output schemas match regulatory specifications. (2) GitHub branch protection rules enforce separation of duties — the author can't merge their own PR. (3) dbt's auto-generated documentation + Snowflake's ACCESS_HISTORY provide full lineage. (4) All Snowflake data is on Time Travel (90 days) + Fail-safe (7 days) for audit recovery. (5) A Change Advisory Board reviews all production releases weekly. The migration takes 2 years and saves $3.5M/year in licensing alone.
A retail chain (500 stores) needs real-time inventory visibility and demand forecasting. The data platform team runs a hub-and-spoke model: a central platform team manages Kafka, Databricks, and Airflow; domain teams (supply chain, merchandising, e-commerce) own their pipelines. Key decisions: (1) Kafka ingests POS (point-of-sale) transactions from all 500 stores in real-time (~50,000 events/sec). (2) A Spark Structured Streaming job maintains a real-time inventory view in Delta Lake. (3) A nightly Airflow DAG trains demand forecasting ML models on historical data. (4) Cost management: the platform team implements chargeback based on Databricks DBU consumption per team — supply chain sees their $X/month and optimizes their Spark jobs accordingly. (5) On-call rotation: the platform team handles infrastructure alerts; domain teams handle data quality alerts for their pipelines.
A 10,000-person enterprise implements Data Mesh across 8 domains (orders, customers, inventory, finance, HR, marketing, logistics, product). Each domain team owns: their source systems, ingestion pipelines, transformation logic (dbt), and published data products. The central data platform team provides: a self-serve data infrastructure (Databricks, dbt Cloud, Airflow), a federated governance framework (shared naming conventions, quality standards, PII handling policies), data discovery (Purview/DataHub catalog), and cross-domain metrics (dbt Semantic Layer). Challenges that arise: (1) Domain teams have uneven maturity — the orders team has 3 experienced DEs, the HR team has 1 junior DE. Solution: platform team provides "starter kits" (template dbt projects, CI/CD pipelines) and office hours. (2) Cross-domain data quality — when the orders domain's data is wrong, it cascades to finance. Solution: model contracts at domain boundaries, and Elementary monitoring across all domains.
↑ Back to top# terraform/modules/databricks_workspace/main.tf
# Enterprise-grade Databricks workspace
resource "azurerm_databricks_workspace" "this" {
name = "dbw-${var.project}-${var.environment}"
resource_group_name = var.resource_group_name
location = var.location
sku = "premium" # Required for Unity Catalog
managed_resource_group_name = "rg-dbw-managed-${var.project}"
# Private networking — no public access
public_network_access_enabled = false
network_security_group_rules_required = "NoAzureDatabricksRules"
custom_parameters {
no_public_ip = true
virtual_network_id = var.vnet_id
private_subnet_name = var.private_subnet_name
private_subnet_network_security_group_association_id = var.private_nsg_id
public_subnet_name = var.public_subnet_name
public_subnet_network_security_group_association_id = var.public_nsg_id
}
tags = {
Environment = var.environment
Team = var.team
CostCenter = var.cost_center
DataClass = "Confidential"
}
}
# Unity Catalog metastore assignment
resource "databricks_metastore_assignment" "this" {
metastore_id = var.metastore_id
workspace_id = azurerm_databricks_workspace.this.workspace_id
}
# RBAC: Create groups and grant permissions
resource "databricks_group" "data_engineers" {
display_name = "data-engineers-${var.project}"
}
resource "databricks_group" "analysts" {
display_name = "analysts-${var.project}"
}
resource "databricks_grants" "catalog_grants" {
catalog = "${var.project}_${var.environment}"
grant {
principal = databricks_group.data_engineers.display_name
privileges = ["USE_CATALOG", "USE_SCHEMA", "CREATE_SCHEMA", "CREATE_TABLE"]
}
grant {
principal = databricks_group.analysts.display_name
privileges = ["USE_CATALOG", "USE_SCHEMA", "SELECT"]
}
}
# pii_detection/scanner.py
# Automated PII detection for new tables added to the catalog
import hashlib, re
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sha2, concat, lit, regexp_replace, when
# Patterns for common PII types
PII_PATTERNS = {
"email": re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"),
"ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
"phone_us": re.compile(r"\b\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b"),
"credit_card": re.compile(r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b"),
"ip_address": re.compile(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b"),
}
# Column name heuristics (often faster than sampling data)
PII_COLUMN_NAMES = {
"email": ["email", "e_mail", "email_address"],
"name": ["first_name", "last_name", "full_name", "customer_name"],
"phone": ["phone", "phone_number", "mobile", "cell"],
"ssn": ["ssn", "social_security", "tax_id"],
"address": ["address", "street", "city", "zip_code", "postal_code"],
}
def scan_table_for_pii(spark, catalog, schema, table, sample_size=1000):
"""Scan a table and return PII findings."""
df = spark.table(f"{catalog}.{schema}.{table}").limit(sample_size)
findings = []
for col_name in df.columns:
# Check column name heuristics
for pii_type, names in PII_COLUMN_NAMES.items():
if col_name.lower() in names:
findings.append({
"column": col_name,
"pii_type": pii_type,
"detection": "column_name",
"confidence": "high",
})
# Sample data and check patterns
if df.schema[col_name].dataType.simpleString() == "string":
sample = [row[col_name] for row in df.select(col_name).collect()
if row[col_name] is not None]
for pii_type, pattern in PII_PATTERNS.items():
match_count = sum(1 for val in sample if pattern.search(val))
if match_count > len(sample) * 0.3: # 30%+ match rate
findings.append({
"column": col_name,
"pii_type": pii_type,
"detection": "pattern_match",
"confidence": "medium",
"match_rate": f"{match_count}/{len(sample)}",
})
return findings
def apply_masking(df, column, pii_type, salt="enterprise_salt_2025"):
"""Apply appropriate masking based on PII type."""
if pii_type == "email":
# Hash the email but preserve domain for analytics
return df.withColumn(column,
concat(
sha2(concat(col(column), lit(salt)), 256).substr(1, 12),
lit("@"),
regexp_replace(col(column), r"^.*@", "")
))
elif pii_type == "ssn":
# Full hash — no partial preservation
return df.withColumn(column,
sha2(concat(col(column), lit(salt)), 256))
elif pii_type == "name":
# Replace with "***"
return df.withColumn(column, lit("***"))
else:
# Default: full SHA-256 hash
return df.withColumn(column,
sha2(concat(col(column), lit(salt)), 256))
# dags/enterprise_etl_pattern.py
# Production-grade DAG with enterprise patterns
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator
from airflow.utils.trigger_rule import TriggerRule
def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
"""Alert when SLA is missed."""
slack = SlackWebhookHook(slack_webhook_conn_id="slack_data_alerts")
slack.send(
text=(f":rotating_light: *SLA MISS* — DAG `{dag.dag_id}` missed SLA\n"
f"Tasks: {[t.task_id for t in task_list]}\n"
f"Blocking: {[t.task_id for t in blocking_tis]}\n"
f"Expected by: {slas}\n"
f"Runbook: https://wiki.company.com/runbooks/etl-sla-miss")
)
def on_failure_callback(context):
"""Alert on task failure with actionable context."""
task = context["task_instance"]
slack = SlackWebhookHook(slack_webhook_conn_id="slack_data_alerts")
slack.send(
text=(f":x: *TASK FAILED* — `{task.dag_id}.{task.task_id}`\n"
f"Attempt: {task.try_number}/{task.max_tries + 1}\n"
f"Error: ```{context.get('exception', 'N/A')}```\n"
f"Log: {task.log_url}\n"
f"On-call: @data-oncall")
)
default_args = {
"owner": "data-platform-team",
"depends_on_past": False,
"retries": 3,
"retry_delay": timedelta(minutes=5),
"retry_exponential_backoff": True,
"max_retry_delay": timedelta(minutes=30),
"execution_timeout": timedelta(hours=2),
"on_failure_callback": on_failure_callback,
"sla": timedelta(hours=3), # Each task must complete within 3h of DAG start
}
with DAG(
dag_id="enterprise_orders_etl",
default_args=default_args,
description="Daily orders ETL: ingest → transform → quality → publish",
schedule_interval="0 4 * * *", # 04:00 UTC daily
start_date=datetime(2024, 1, 1),
catchup=False,
max_active_runs=1,
sla_miss_callback=sla_miss_callback,
tags=["production", "orders", "sla:06:00-utc"],
doc_md="""
## Orders ETL Pipeline
**Owner**: data-platform-team | **SLA**: 06:00 UTC | **Oncall**: @data-oncall
**Runbook**: https://wiki.company.com/runbooks/orders-etl
**Dependencies**: Source API (orders-service), Databricks workspace (prod)
""",
) as dag:
ingest = DatabricksRunNowOperator(
task_id="ingest_orders",
databricks_conn_id="databricks_prod",
job_id=12345,
notebook_params={"date": "{{ ds }}", "mode": "incremental"},
)
transform = DatabricksRunNowOperator(
task_id="transform_dbt",
databricks_conn_id="databricks_prod",
job_id=12346,
notebook_params={"date": "{{ ds }}"},
)
quality_check = DatabricksRunNowOperator(
task_id="data_quality_checks",
databricks_conn_id="databricks_prod",
job_id=12347,
)
notify_success = PythonOperator(
task_id="notify_success",
python_callable=lambda **ctx: SlackWebhookHook(
slack_webhook_conn_id="slack_data_alerts"
).send(text=f":white_check_mark: Orders ETL completed for {ctx['ds']}"),
trigger_rule=TriggerRule.ALL_SUCCESS,
)
notify_failure = PythonOperator(
task_id="notify_failure_summary",
python_callable=lambda **ctx: SlackWebhookHook(
slack_webhook_conn_id="slack_data_alerts"
).send(text=f":fire: Orders ETL FAILED for {ctx['ds']} — check logs"),
trigger_rule=TriggerRule.ONE_FAILED,
)
ingest >> transform >> quality_check >> [notify_success, notify_failure]
# great_expectations/expectations/orders_suite.json
# Enterprise data quality suite: freshness, volume, schema, values
{
"expectation_suite_name": "orders_gold_suite",
"expectations": [
{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": {"min_value": 10000, "max_value": 500000},
"meta": {"notes": "Daily volume: 10K-500K orders. Alert on anomaly."}
},
{
"expectation_type": "expect_column_values_to_be_unique",
"kwargs": {"column": "order_id"}
},
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "order_id"}
},
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "customer_id"}
},
{
"expectation_type": "expect_column_values_to_be_in_set",
"kwargs": {
"column": "order_status",
"value_set": ["pending", "processing", "shipped", "delivered", "cancelled"]
}
},
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {
"column": "total_amount",
"min_value": 0.01,
"max_value": 100000
},
"meta": {"notes": "Flag orders over $100K for fraud review"}
},
{
"expectation_type": "expect_column_max_to_be_between",
"kwargs": {
"column": "order_date",
"min_value": "{{ (execution_date - timedelta(days=1)).isoformat() }}",
"max_value": "{{ execution_date.isoformat() }}"
},
"meta": {"notes": "Freshness check: most recent order should be within 24h"}
}
]
}
# finops/cost_monitor.py
# Weekly cost anomaly detection and alerting
import pandas as pd
from datetime import datetime, timedelta
def detect_cost_anomalies(
cost_df: pd.DataFrame,
group_col: str = "team",
threshold_pct: float = 0.25, # Alert on 25%+ week-over-week increase
) -> pd.DataFrame:
"""Detect week-over-week cost anomalies by team/project."""
# Calculate weekly totals by team
current_week = cost_df[cost_df["date"] >= datetime.now() - timedelta(days=7)]
prev_week = cost_df[
(cost_df["date"] >= datetime.now() - timedelta(days=14)) &
(cost_df["date"] < datetime.now() - timedelta(days=7))
]
current_totals = current_week.groupby(group_col)["cost_usd"].sum()
prev_totals = prev_week.groupby(group_col)["cost_usd"].sum()
# Calculate change percentage
comparison = pd.DataFrame({
"current_week": current_totals,
"previous_week": prev_totals,
}).fillna(0)
comparison["change_pct"] = (
(comparison["current_week"] - comparison["previous_week"])
/ comparison["previous_week"].replace(0, 1)
)
# Filter anomalies
anomalies = comparison[comparison["change_pct"] > threshold_pct]
return anomalies.sort_values("change_pct", ascending=False)
def format_cost_alert(anomalies: pd.DataFrame) -> str:
"""Format anomalies into a Slack-ready message."""
if anomalies.empty:
return ":white_check_mark: No cost anomalies detected this week."
lines = [":moneybag: *Weekly Cost Anomaly Report*\n"]
for team, row in anomalies.iterrows():
lines.append(
f":warning: *{team}*: ${row['current_week']:,.0f} "
f"(+{row['change_pct']:.0%} vs last week: ${row['previous_week']:,.0f})"
)
lines.append(f"\n_Review: https://finops.company.com/dashboard_")
return "\n".join(lines)
# docs/adr/ADR-007-lakehouse-format.md
# Architecture Decision Record — standard enterprise documentation
# ADR-007: Choose Delta Lake as the Lakehouse Table Format
## Status
Accepted (2024-11-15)
## Context
We need a table format for our lakehouse on Azure. Options evaluated:
- **Delta Lake** — Native to Databricks (our compute platform), mature, ACID transactions
- **Apache Iceberg** — Vendor-neutral, strong community momentum, good for multi-engine
- **Apache Hudi** — Strong CDC/upsert support, less mature ecosystem
## Decision
We will use **Delta Lake** as our primary table format.
## Rationale
1. **Native integration**: Our primary compute is Databricks — Delta Lake has zero-configuration
support, Photon optimized reads, liquid clustering, and Unity Catalog integration.
2. **UniForm**: Delta Lake UniForm (GA in Databricks 2024) generates Iceberg metadata alongside
Delta metadata — we get Iceberg compatibility for free if we ever need multi-engine access.
3. **Team expertise**: 4/5 data engineers have Delta experience; 1/5 has Iceberg experience.
4. **Time to production**: Estimated 2 weeks with Delta vs. 6 weeks with Iceberg (tooling setup,
catalog integration, debugging).
## Consequences
- (+) Faster delivery, lower risk, leverages existing team skills
- (+) UniForm provides Iceberg escape hatch
- (-) Tighter coupling to Databricks ecosystem
- (-) If we move to a non-Databricks compute engine, we'll rely on UniForm or re-evaluate
## Alternatives Rejected
- **Iceberg**: Better long-term portability, but our 18-month compute contract with Databricks
makes this benefit theoretical. Revisit at contract renewal.
- **Hudi**: Smallest community, least tooling integration with our stack.
## Review Date
2026-Q1 (align with Databricks contract renewal)
↑ Back to top
| Model | Best For | Team Size | Trade-offs |
|---|---|---|---|
| Centralized | Small companies, early data teams | < 10 DEs | Low duplication, but becomes bottleneck at scale |
| Embedded | Companies needing deep domain expertise | 10-30 DEs | Fast domain iteration, but tooling fragmentation and silos |
| Hub-and-Spoke | Large enterprises, data mesh adoption | 30+ DEs | Balanced autonomy + consistency; requires a strong platform team |
| Tool | Type | Strengths | Weaknesses |
|---|---|---|---|
| Unity Catalog (Databricks) | Integrated catalog + governance | Native to Databricks, fine-grained access, lineage | Databricks-only, no cross-platform |
| Microsoft Purview | Cloud governance platform | Azure-native, auto-classification, lineage across Azure | Focused on Azure; limited multi-cloud |
| Apache Atlas | Open-source metadata | Free, extensible, Hadoop ecosystem | Complex setup, aging UI, limited cloud support |
| Atlan / Alation | Commercial data catalog | Modern UX, rich collaboration, multi-platform | Expensive; another vendor to manage |
| OpenMetadata | Open-source catalog | Modern, API-first, growing community | Less mature than commercial options |
| Tool | Type | Best For |
|---|---|---|
| Elementary | Open-source (dbt native) | dbt-centric teams; anomaly detection + test dashboards built into dbt |
| Monte Carlo | Commercial SaaS | Enterprise: multi-platform, automated monitoring, incident management |
| Great Expectations | Open-source framework | Teams wanting programmatic data validation with full control |
| Soda | Open-source + commercial | Simple YAML-based data quality checks, multi-warehouse |
dbt ls --select +model_name+ or data lineage tools before modifying shared models. For significant changes, write a brief design doc even if it takes 30 minutes.The Strangler Fig pattern involves incrementally building new capabilities in the target system while the legacy system continues to operate. New workloads go to the new system; existing workloads are migrated one at a time, with each migration validated via parallel running (comparing old vs. new outputs). The old system is decommissioned only after all workloads have been migrated and validated. It's preferred because: (1) Risk is contained — if the new system has problems, the old system still works. (2) Value is delivered incrementally — you don't wait 18 months for the big bang. (3) The team learns on early migrations and applies lessons to later ones. (4) Rollback is simple — just route traffic back to the old system. The alternative, big-bang migration, carries extreme risk: if anything goes wrong on cutover day, you're rolling back months of work with no fallback.
RBAC (Role-Based Access Control) assigns permissions to roles, and users are assigned to roles. Example: "data_engineer_prod" role has read/write on the warehouse; "analyst_marketing" has read-only on the marketing schema. It's simple to understand and audit, but can lead to role explosion in large organizations (hundreds of fine-grained roles). ABAC (Attribute-Based Access Control) evaluates policies based on attributes of the user (department, location, clearance level), the resource (data classification, owner), and the context (time of day, source IP). Example: "Users in the EU region with 'PII-authorized' clearance can see unmasked email addresses." ABAC is more flexible and expressive but harder to reason about and audit. In practice, enterprises use RBAC as the foundation (easy to audit for compliance) with ABAC-style dynamic policies for fine-grained controls (e.g., column masking based on user department).
This is a complex operational process: (1) Identify: Query the data catalog/lineage tool to find every table/file containing that customer's data — raw, staging, gold, snapshots, ML features, backups. (2) Classify: Determine which data must be deleted vs. anonymized vs. retained (some data may have legal retention requirements). (3) Execute: For Delta Lake tables, use DELETE FROM table WHERE customer_id = ? followed by VACUUM to remove physical files. For snapshots, invalidate the customer's rows. For raw data (Parquet files), either rewrite the files without the customer or overwrite the PII columns with nulls. For external systems (Kafka, third-party tools), trigger deletion via their APIs. (4) Verify: Run a scan confirming the customer's data no longer exists in any table. (5) Document: Log the deletion request, actions taken, and verification result for audit. The key complication: data may exist in file-based formats (Parquet, ORC) where you can't just "delete a row" — you must rewrite the entire file. Delta Lake's DELETE + VACUUM handles this, but raw landing zones may require batch rewriting. Having a data catalog with PII tagging makes step 1 feasible; without it, you're searching blindly.
Hub-and-spoke has a central Data Platform Team (the hub) providing shared infrastructure, tooling, and governance, while domain teams (the spokes) own domain-specific data pipelines and models. The platform team maintains: compute infrastructure, CI/CD pipelines, data catalog, monitoring, and shared libraries. Domain teams have autonomy to build within the platform's guardrails. This maps directly to Data Mesh's four principles: (1) Domain ownership — domain teams own their data products (spokes). (2) Data as a product — domain teams treat their published data as a product with SLAs, documentation, and quality guarantees. (3) Self-serve data infrastructure — the platform team provides tools so domains can operate independently (hub). (4) Federated computational governance — global policies (PII handling, naming conventions) enforced via automation, with local policies managed by domains. The hub-and-spoke model is the organizational pattern that makes Data Mesh practical at scale.
ADRs document the why behind technical decisions: what options were evaluated, why one was chosen, what trade-offs were accepted, and when to revisit. In enterprises, this matters because: (1) Team turnover — the person who made the decision will eventually leave. Without an ADR, the next engineer doesn't know why Airflow was chosen over Prefect and may waste weeks re-evaluating. (2) Audit and compliance — SOX, HIPAA, and other regulations require documented decision trails. "We chose this architecture because..." is an auditable artifact. (3) Prevents re-litigation — without ADRs, the same decisions get debated every 6 months. An ADR closes the discussion until the stated review date. (4) Onboarding — new team members read the ADR backlog and understand the system's evolution in hours instead of months. (5) Interview signal — discussing how you document and communicate decisions demonstrates senior-level engineering maturity. An ADR takes 30 minutes to write and saves hundreds of hours over its lifetime.
Pipeline monitoring tracks whether jobs run successfully: Did the Airflow DAG complete? What was the runtime? Were there any errors? This is necessary but insufficient — a pipeline can succeed while producing bad data (empty source, wrong schema, stale data). Data observability monitors the data itself across five dimensions: (1) Freshness — is the latest data within the expected time window? (2) Volume — are row counts within expected bounds? (3) Quality — do values pass validation rules (uniqueness, nullability, ranges)? (4) Schema — have columns been added, removed, or changed type unexpectedly? (5) Distribution — have column value distributions shifted anomalously? You need both: pipeline monitoring catches infrastructure failures (job crashed, cluster timeout), while data observability catches data failures (source sent empty payload, upstream schema changed, business logic drift). Tools: Airflow/Datadog for pipeline monitoring; Elementary/Monte Carlo/Great Expectations for data observability.
Effective FinOps for data platforms requires: (1) Visibility — tag every resource (team, project, environment), build cost dashboards by team/pipeline, send weekly cost reports to team leads. You can't optimize what you can't see. (2) Attribution — implement showback (showing teams their costs) or chargeback (billing teams internally). This creates accountability. (3) Right-sizing — audit compute monthly: check executor memory utilization, spill-to-disk metrics, and idle time. Most clusters are 2-3x over-provisioned. (4) Scheduling — auto-terminate dev clusters after 15 min idle, auto-suspend Snowflake warehouses after 1-5 min. Stop dev clusters on weekends/holidays. (5) Reserved capacity — for predictable baseline workloads, commit to 1-year or 3-year reserved pricing (30-60% savings). Use spot/preemptible for fault-tolerant batch jobs. (6) Data lifecycle — auto-tier cold storage, enforce retention policies, clean orphaned data. (7) Query optimization — partition pruning, clustering, avoiding SELECT *, using incremental models instead of full rebuilds. (8) Anomaly detection — automated alerts when any team's spend increases >25% week-over-week.