Terraform

Summary

Terraform (HashiCorp, now BSL-licensed) is the dominant Infrastructure as Code (IaC) tool in data engineering. It manages the full lifecycle — create, update, destroy — of cloud resources using declarative HCL (HashiCorp Configuration Language). For a DE, Terraform is how you provision the infrastructure your pipelines run on: Databricks clusters, Azure Data Factory pipelines, Snowflake warehouses, S3/ADLS2 storage accounts, IAM roles, VPCs, and more. The key capabilities to master for interviews are state management, modules, remote backends, provider version pinning, and import/drift detection.

This guide targets Terraform 1.7+ (OpenTofu is largely compatible). Azure and Databricks examples are used throughout, reflecting common DE interview contexts, but all concepts apply equally to AWS/GCP.

Table of Contents

Core Concepts

HCL Structure: Blocks and Expressions

A Terraform configuration is a directory of .tf files. Terraform merges them automatically. Three key block types form 90% of configs:

Expressions reference other blocks via interpolation: azurerm_resource_group.main.name or var.location. Terraform builds an implicit dependency graph from references and plans changes in topological order. Use depends_on only for hidden dependencies that can't be expressed through references.

Providers & Registry

Providers are plugins that translate HCL resources into API calls. They are downloaded from the Terraform Registry during terraform init into .terraform/providers/. The required_providers block in terraform {}` pins the source and version:

State Management

Terraform state (terraform.tfstate) maps your HCL configuration to real infrastructure. Without state, Terraform cannot detect drift, plan updates, or destroy resources correctly. Key concepts:

Modules

A module is any directory of .tf files. The root module is the directory where you run terraform apply. Child modules are called via module blocks and can come from local paths, the Terraform Registry, or Git URLs.

Variables, Outputs & Locals

ConstructPurposeNotes
variable "x" {}Parameterise configs (env-specific)Set via -var, .tfvars, env vars (TF_VAR_x)
locals {}Intermediate expressions, avoid repetitionNot user-configurable; not exposed as inputs/outputs
output "y" {}Expose values to callers or operatorsShown after apply; readable via terraform output
data "x" "y" {}Read existing infra not managed hereFetched during plan; use for VPCs, DNS zones, ACR images

Variable precedence (highest to lowest): CLI -var*.auto.tfvarsterraform.tfvars → environment variables (TF_VAR_*) → default in variable {} block.

Resource Lifecycle & Meta-Arguments

Meta-arguments modify how Terraform manages resources:

↑ Back to top

Industry Use Cases

Provisioning a Databricks Workspace + Cluster

A data platform team uses Terraform to provision Azure Databricks workspaces per environment (dev/staging/prod) from a shared module. Each workspace gets: a managed resource group, a VNet with workspace subnet injection, a Premium SKU, a service principal for CI/CD, a single-node dev cluster (auto-terminating in 30 min), and an all-purpose prod cluster. Environment-specific .tfvars control node types and cluster sizes. The same Terraform config is applied in a GitHub Actions workflow on PR merge to the main branch — no manual portal clicks.

Data Lake Storage with RBAC

An ADLS Gen2 storage account with hierarchical namespace is provisioned along with lifecycle management rules (move blobs to Cool after 30 days, Archive after 90 days, delete after 365 days). Storage containers for bronze/silver/gold zones are created. Azure AD groups are imported via data blocks and assigned appropriate RBAC roles (Storage Blob Data Contributor for ETL service principals, Storage Blob Data Reader for BI tools). All of this is in a reusable module with input variables for retention periods and principal IDs.

Multi-Environment Snowflake Warehouse Management

A Snowflake provider manages warehouses, databases, schemas, roles, and grants across dev/prod environments. The same Terraform module provisions an X-Small warehouse for dev and an X-Large (auto-suspend 60s, auto-resume) for prod. Role hierarchies and grants are codified — no ad-hoc GRANT statements in runbooks. Drift detection (terraform plan in CI) alerts the team when someone manually modified a warehouse size in the Snowflake console.

Infrastructure Teardown in Feature Branch Environments

Each feature branch in a data product repo provisions a short-lived Postgres + Airflow environment using Terraform + GitHub Actions. On PR close, a terraform destroy job cleans up all resources. The remote state is stored in an S3 bucket keyed by branch name. prevent_destroy is intentionally omitted on ephemeral environments. This pattern dramatically reduces cloud costs from long-lived development environments.

↑ Back to top

Code Examples

Example 1 — Provider Setup with Remote State (Azure Blob Backend)

# versions.tf
terraform {
  required_version = "~> 1.7"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.90"
    }
    databricks = {
      source  = "databricks/databricks"
      version = "~> 1.40"
    }
  }

  # Remote backend — state stored in Azure Blob Storage
  backend "azurerm" {
    resource_group_name  = "tf-state-rg"
    storage_account_name = "mycompanytfstate"
    container_name       = "tfstate"
    key                  = "data-platform/prod.tfstate"
  }
}

provider "azurerm" {
  features {}
  # Uses AZURE_CLIENT_ID / AZURE_CLIENT_SECRET / AZURE_TENANT_ID env vars
  # or Managed Identity when running in Azure DevOps / GitHub Actions OIDC
}

provider "databricks" {
  host = azurerm_databricks_workspace.main.workspace_url
  # azure_client_id / azure_client_secret set via env vars
}

Example 2 — Azure Resource Group, ADLS Gen2, and Databricks Workspace

# main.tf
resource "azurerm_resource_group" "main" {
  name     = "rg-${var.project}-${var.environment}-${var.location_short}"
  location = var.location

  tags = local.common_tags
}

resource "azurerm_storage_account" "datalake" {
  name                     = "sa${var.project}${var.environment}"  # 3-24 chars, lowercase
  resource_group_name      = azurerm_resource_group.main.name
  location                 = azurerm_resource_group.main.location
  account_tier             = "Standard"
  account_replication_type = "GRS"
  is_hns_enabled           = true   # Hierarchical namespace = ADLS Gen2
  min_tls_version          = "TLS1_2"

  blob_properties {
    delete_retention_policy { days = 30 }
  }

  tags = local.common_tags
}

resource "azurerm_storage_data_lake_gen2_filesystem" "bronze" {
  name               = "bronze"
  storage_account_id = azurerm_storage_account.datalake.id
}

resource "azurerm_databricks_workspace" "main" {
  name                = "dbw-${var.project}-${var.environment}"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  sku                 = "premium"

  lifecycle {
    prevent_destroy = true  # Never destroy prod workspace accidentally
  }

  tags = local.common_tags
}

# locals.tf
locals {
  common_tags = {
    project     = var.project
    environment = var.environment
    managed_by  = "terraform"
  }
  location_short = {
    "eastus"        = "eus"
    "westeurope"    = "weu"
    "northeurope"   = "neu"
  }[var.location]
}

Example 3 — Databricks Cluster with for_each Environments

# clusters.tf
variable "cluster_configs" {
  type = map(object({
    node_type     = string
    num_workers   = number
    autoterminate = number
  }))
  default = {
    dev = {
      node_type     = "Standard_DS3_v2"
      num_workers   = 1
      autoterminate = 30
    }
    prod = {
      node_type     = "Standard_DS5_v2"
      num_workers   = 8
      autoterminate = 120
    }
  }
}

resource "databricks_cluster" "env_clusters" {
  for_each = var.cluster_configs

  cluster_name            = "apc-${each.key}"
  spark_version           = "14.3.x-scala2.12"
  node_type_id            = each.value.node_type
  num_workers             = each.value.num_workers
  autotermination_minutes = each.value.autoterminate

  spark_conf = {
    "spark.databricks.delta.preview.enabled" = "true"
  }

  library {
    pypi { package = "dbt-databricks==1.8.0" }
  }
}

output "cluster_ids" {
  value = { for k, v in databricks_cluster.env_clusters : k => v.id }
}

Example 4 — Importing Existing Resources + Drift Detection in CI

# Import an existing storage container that was created manually
# (Terraform 1.5+ declarative import block)
import {
  to = azurerm_storage_data_lake_gen2_filesystem.silver
  id  = "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-myproject-prod-eus/providers/Microsoft.Storage/storageAccounts/samyprojectprod/blobServices/default/containers/silver"
}

resource "azurerm_storage_data_lake_gen2_filesystem" "silver" {
  name               = "silver"
  storage_account_id = azurerm_storage_account.datalake.id
}

---
# .github/workflows/tf-drift.yml (simplified)
# Runs on a schedule to detect configuration drift

# name: Terraform Drift Detection
# on:
#   schedule:
#     - cron: '0 6 * * 1-5'  # 6am UTC weekdays
#
# jobs:
#   drift:
#     runs-on: ubuntu-latest
#     steps:
#       - uses: actions/checkout@v4
#       - uses: hashicorp/setup-terraform@v3
#         with: { terraform_version: 1.7.5 }
#       - run: terraform init
#       - run: terraform plan -detailed-exitcode
#         id: plan
#       - if: steps.plan.outputs.exitcode == '2'  # 2 = changes detected
#         uses: slackapi/slack-github-action@v1
#         with:
#           payload: '{"text":"⚠️ Terraform drift detected in prod!"}'
↑ Back to top

Comparison / When to Use

FeatureTerraform / OpenTofuPulumiBicep / ARMAWS CDKAnsible
LanguageHCL (declarative)Python / TypeScript / Go (imperative)Bicep (Azure DSL)TypeScript / PythonYAML + Jinja
Multi-cloud✅ 3000+ providers✅ Same providers❌ Azure only❌ AWS only✅ Limited
State managementBuilt-in .tfstatePulumi Cloud / S3ARM API (no local state)CloudFormation stacksNo state
Learning curveLow (HCL simple)Medium (requires coding)Low (Azure-native)MediumLow
IDE / test toolingGood (Terraform Test, TFLint)Excellent (type safety)Good (VS Code extension)Excellent (Jest/Pytest)Molecule
Best forMulti-cloud IaC, DE platform teamsComplex logic, developer-friendly IaCAzure-only orgs with Microsoft supportAWS-native CDK teamsConfig management, not infra provisioning

Rule of thumb: Use Terraform (or OpenTofu) unless your team is Azure-only (Bicep is simpler) or has strong Python/TypeScript skills and needs loops/conditionals that HCL makes awkward (Pulumi). Avoid Ansible for idempotent cloud resource management — it lacks reliable state tracking.

↑ Back to top

Gotchas & Anti-patterns

  1. Using count instead of for_each for maps. If you manage a list of resources with count = length(var.buckets) and remove item at index 1, Terraform destroys and recreates all resources from index 1 onward because their numeric addresses change. Always use for_each with string keys for resources that may have items removed from the middle of the collection.
  2. Storing sensitive values in variables and echoing them in outputs. Even without sensitive = true, any output value is stored in plaintext in terraform.tfstate. Mark sensitive variables with sensitive = true (suppresses CLI display), but remember the value is still in state. Prefer retrieving secrets from Key Vault / Secrets Manager at runtime rather than passing them through Terraform.
  3. Forgetting to commit .terraform.lock.hcl. The lock file pins exact provider versions and checksums. Without it, terraform init in CI may pull a newer provider version that breaks your configuration or changes resource behaviour silently. Always commit this file. Never .gitignore it.
  4. Running terraform apply without reviewing plan. Always run terraform plan -out=tfplan first and review the output — especially lines with # forces replacement. Replacement means destroy + create, which causes downtime for live resources like RDS instances, Kafka clusters, or Databricks workspaces. Use lifecycle { create_before_destroy = true } where replacement is unavoidable.
  5. Shared state for multiple environments in one root module. Running terraform apply for dev and prod from the same directory (using terraform workspace) can lead to accidental prod applies. Use separate root modules (or directories with separate backends) for each environment rather than workspaces for isolation-critical separation. Terraform Cloud/Enterprise workspaces have stronger guardrails; OSS workspaces are just state-file namespaces.
↑ Back to top

Exercises

  1. Module authoring: Write a reusable module that provisions an Azure Storage Account (ADLS Gen2 enabled) with three containers (bronze/silver/gold), configurable retention lifecycle rules, and outputs the storage account name and primary endpoint. Call the module from a root config with separate dev.tfvars and prod.tfvars files. Validate with terraform validate and terraform plan.
  2. Drift detection simulation: Apply a Terraform config that creates an Azure resource group. Then manually add a tag to it in the Azure portal. Run terraform plan and observe the detected drift. Restore the state using terraform apply. Discuss when ignore_changes would be appropriate vs resolving the drift.
  3. Import existing resources: Create an Azure storage container manually (simulate "already existing infra"). Write a Terraform config with an import {} block (Terraform 1.5+) for it. Run terraform plan to confirm no changes are needed after import. Compare this approach with the older terraform import CLI command and explain the workflow difference.
↑ Back to top

Quiz

Q1: Why should you use for_each instead of count when creating multiple resource instances from a list?

With count, resources are addressed by index (e.g., azurerm_storage_account.buckets[0]). If you remove an item from the middle of the list, all resources with a higher index are destroyed and recreated because their addresses shift. With for_each, resources are addressed by a unique key (e.g., azurerm_storage_account.buckets["bronze"]). Removing one key only affects that specific resource, leaving all others untouched. Use toset(var.names) or a map for the for_each value.

Q2: What happens to sensitive values passed as Terraform variables (sensitive = true)?

The sensitive = true flag suppresses the value in CLI output (terraform plan / apply shows (sensitive value)). However, the plaintext value is still written to the state file (terraform.tfstate). The flag does NOT encrypt the state. You must separately encrypt the remote state backend (S3 SSE, Azure Blob at-rest encryption) and restrict access via IAM. For high-sensitivity secrets, prefer fetching them from Key Vault at runtime (in the application) rather than piping them through Terraform resources.

Q3: What is terraform plan exit code 2 and how is it used in CI?

terraform plan -detailed-exitcode returns exit code 0 (success, no changes), 1 (error), or 2 (success, changes are pending). In CI pipelines, exit code 2 is used to detect configuration drift: run terraform plan -detailed-exitcode on a schedule; if it exits with 2, alert the team (Slack, PagerDuty) that real infrastructure has diverged from the Terraform config. This is the foundation of drift detection workflows in mature IaC practices.

Q4: Explain the difference between terraform destroy and using prevent_destroy. When do you use each?

terraform destroy is a CLI command that destroys all resources managed by the current root module. It's useful for tearing down ephemeral environments (feature branches, staging). lifecycle { prevent_destroy = true } is a safeguard in HCL that causes terraform plan to error with "Instance cannot be destroyed" if Terraform's plan includes destroying that resource — protecting production databases, Databricks workspaces, or S3 buckets from accidental deletion via apply or destroy. Use prevent_destroy on any resource where accidental destruction would cause data loss or outages.

Q5: What is state drift and what causes it? How do you detect and remediate it?

State drift occurs when real cloud infrastructure diverges from what Terraform's state file records. Common causes: manual changes in the cloud console, other automation tools (runbooks, scripts) modifying resources, cloud provider changes on managed attributes (auto-assigned IPs, tags added by policies). Detection: terraform plan compares state vs live API call results and reports differences. Remediation options: (1) terraform apply to bring infra back to the Terraform-defined state; (2) terraform state rm + re-import if the resource should remain as-is; (3) Update HCL to match the drifted state if the manual change was intentional; (4) Add ignore_changes for attributes managed out-of-band (e.g., tags managed by a cloud governance policy).

↑ Back to top

Further Reading

↑ Back to top