Terraform

Summary

Terraform (HashiCorp, now BSL-licensed) is the dominant Infrastructure as Code (IaC) tool in data engineering. It manages the full lifecycle — create, update, destroy — of cloud resources using declarative HCL (HashiCorp Configuration Language). For a DE, Terraform is how you provision the infrastructure your pipelines run on: Databricks clusters, Azure Data Factory pipelines, Snowflake warehouses, S3/ADLS2 storage accounts, IAM roles, VPCs, and more. The key capabilities to master for interviews are state management, modules, remote backends, provider version pinning, and import/drift detection.

This guide targets Terraform 1.7+ (OpenTofu is largely compatible). Azure and Databricks examples are used throughout, reflecting common DE interview contexts, but all concepts apply equally to AWS/GCP.

Core Concepts
Industry Use Cases
Code Examples
Comparison / When to Use
Gotchas & Anti-patterns
Exercises
Quiz
Further Reading

Core Concepts

HCL Structure: Blocks and Expressions

A Terraform configuration is a directory of .tf files. Terraform merges them automatically. Three key block types form 90% of configs:

terraform {} — backend, required providers, required version constraints.
resource "<type>" "<name>" {} — declarative description of one cloud resource. Address: resource_type.name.
data "<type>" "<name>" {} — read-only lookup of existing infrastructure not managed by this config.

Expressions reference other blocks via interpolation: azurerm_resource_group.main.name or var.location. Terraform builds an implicit dependency graph from references and plans changes in topological order. Use depends_on only for hidden dependencies that can't be expressed through references.

Providers & Registry

Providers are plugins that translate HCL resources into API calls. They are downloaded from the Terraform Registry during terraform init into .terraform/providers/. The required_providers block in terraform {}` pins the source and version:



  Version constraints: ~> 3.90 (allow only patch updates in 3.90.x), >= 3.90, < 4.0 (range). Always pin in production — provider major versions often include breaking changes.
  Provider aliases: Configure the same provider twice with different credentials using alias — useful for multi-region or multi-account deployments.
  terraform lock file (.terraform.lock.hcl): Records exact provider versions and hashes. Always commit this file to version control for reproducible builds.


State Management
Terraform state (terraform.tfstate) maps your HCL configuration to real infrastructure. Without state, Terraform cannot detect drift, plan updates, or destroy resources correctly. Key concepts:

  Local state: Default, stored in terraform.tfstate in the working directory. Never use in teams — not shared and has no locking.
  Remote backends: Store state in S3+DynamoDB (AWS), Azure Blob Storage, GCS, or Terraform Cloud. Remote backends provide sharing, versioning, and locking (prevents concurrent apply conflicts).
  State locking: During plan and apply, Terraform acquires a lock. With Azure Blob Storage backend, locking uses blob leases. With S3, it requires a DynamoDB table.
  Sensitive data in state: State stores all resource attributes — including secrets like database passwords, private keys. Even with sensitive = true variables (redacted in CLI output), the plain value appears in the state file. Encrypt state at rest and restrict access with IAM/RBAC.
  terraform import: Bring existing cloud resources under Terraform management without recreating them. In Terraform 1.5+, use import {} blocks declaratively. Run terraform plan after importing to confirm no unintended changes.
  terraform state mv: Rename or move resources in state (e.g., after refactoring a module). Prevents destroy+recreate during refactors.


Modules
A module is any directory of .tf files. The root module is the directory where you run terraform apply. Child modules are called via module blocks and can come from local paths, the Terraform Registry, or Git URLs.

  Why modules: Encapsulate configuration, enforce naming conventions, provide reusable patterns (e.g., an "ADF pipeline with alerting" module used by all teams).
  Module versions: Registry modules support semantic versioning. Git-sourced modules pin with ?ref=v1.3.0. Always pin.
  Input variables (variable {}) define the module's interface. Output values (output {}) expose attributes for the caller. Never access child module internals directly.
  For-each modules: for_each on a module block creates multiple instances — e.g., one Databricks cluster per environment map entry.


Variables, Outputs & Locals

  Construct Purpose Notes
  
    variable "x" {} Parameterise configs (env-specific) Set via -var, .tfvars, env vars (TF_VAR_x)
    locals {} Intermediate expressions, avoid repetition Not user-configurable; not exposed as inputs/outputs
    output "y" {} Expose values to callers or operators Shown after apply; readable via terraform output
    data "x" "y" {} Read existing infra not managed here Fetched during plan; use for VPCs, DNS zones, ACR images
  

Variable precedence (highest to lowest): CLI -var → *.auto.tfvars → terraform.tfvars → environment variables (TF_VAR_*) → default in variable {} block.

Resource Lifecycle & Meta-Arguments
Meta-arguments modify how Terraform manages resources:

  count / for_each: Create multiple resource instances. Prefer for_each over count for maps/sets — removing an item in the middle of a count list causes all subsequent resources to be destroyed and recreated (index-based addressing).
  depends_on: Explicit dependency for cases where references alone can't capture ordering (e.g., a resource that depends on a policy attachment but doesn't reference it).
  lifecycle {}:
    
      create_before_destroy = true: Create replacement before destroying old. Essential for zero-downtime replacements.
      prevent_destroy = true: Block terraform destroy on critical resources (production databases). A safety net against accidents.
      ignore_changes = [tags]: Don't reconcile specific attributes that change out-of-band.
    
  
  provisioner (use sparingly): Run scripts on resource creation/destruction. Fragile and breaks idempotency — prefer cloud-init, Ansible, or Docker images instead.


↑ Back to top

Industry Use Cases

Provisioning a Databricks Workspace + Cluster
A data platform team uses Terraform to provision Azure Databricks workspaces per environment (dev/staging/prod) from a shared module. Each workspace gets: a managed resource group, a VNet with workspace subnet injection, a Premium SKU, a service principal for CI/CD, a single-node dev cluster (auto-terminating in 30 min), and an all-purpose prod cluster. Environment-specific .tfvars control node types and cluster sizes. The same Terraform config is applied in a GitHub Actions workflow on PR merge to the main branch — no manual portal clicks.

Data Lake Storage with RBAC
An ADLS Gen2 storage account with hierarchical namespace is provisioned along with lifecycle management rules (move blobs to Cool after 30 days, Archive after 90 days, delete after 365 days). Storage containers for bronze/silver/gold zones are created. Azure AD groups are imported via data blocks and assigned appropriate RBAC roles (Storage Blob Data Contributor for ETL service principals, Storage Blob Data Reader for BI tools). All of this is in a reusable module with input variables for retention periods and principal IDs.

Multi-Environment Snowflake Warehouse Management
A Snowflake provider manages warehouses, databases, schemas, roles, and grants across dev/prod environments. The same Terraform module provisions an X-Small warehouse for dev and an X-Large (auto-suspend 60s, auto-resume) for prod. Role hierarchies and grants are codified — no ad-hoc GRANT statements in runbooks. Drift detection (terraform plan in CI) alerts the team when someone manually modified a warehouse size in the Snowflake console.

Infrastructure Teardown in Feature Branch Environments
Each feature branch in a data product repo provisions a short-lived Postgres + Airflow environment using Terraform + GitHub Actions. On PR close, a terraform destroy job cleans up all resources. The remote state is stored in an S3 bucket keyed by branch name. prevent_destroy is intentionally omitted on ephemeral environments. This pattern dramatically reduces cloud costs from long-lived development environments.

↑ Back to top

Code Examples

Example 1 — Provider Setup with Remote State (Azure Blob Backend)
# versions.tf
terraform {
  required_version = "~> 1.7"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.90"
    }
    databricks = {
      source  = "databricks/databricks"
      version = "~> 1.40"
    }
  }

  # Remote backend — state stored in Azure Blob Storage
  backend "azurerm" {
    resource_group_name  = "tf-state-rg"
    storage_account_name = "mycompanytfstate"
    container_name       = "tfstate"
    key                  = "data-platform/prod.tfstate"
  }
}

provider "azurerm" {
  features {}
  # Uses AZURE_CLIENT_ID / AZURE_CLIENT_SECRET / AZURE_TENANT_ID env vars
  # or Managed Identity when running in Azure DevOps / GitHub Actions OIDC
}

provider "databricks" {
  host = azurerm_databricks_workspace.main.workspace_url
  # azure_client_id / azure_client_secret set via env vars
}


Example 2 — Azure Resource Group, ADLS Gen2, and Databricks Workspace
# main.tf
resource "azurerm_resource_group" "main" {
  name     = "rg-${var.project}-${var.environment}-${var.location_short}"
  location = var.location

  tags = local.common_tags
}

resource "azurerm_storage_account" "datalake" {
  name                     = "sa${var.project}${var.environment}"  # 3-24 chars, lowercase
  resource_group_name      = azurerm_resource_group.main.name
  location                 = azurerm_resource_group.main.location
  account_tier             = "Standard"
  account_replication_type = "GRS"
  is_hns_enabled           = true   # Hierarchical namespace = ADLS Gen2
  min_tls_version          = "TLS1_2"

  blob_properties {
    delete_retention_policy { days = 30 }
  }

  tags = local.common_tags
}

resource "azurerm_storage_data_lake_gen2_filesystem" "bronze" {
  name               = "bronze"
  storage_account_id = azurerm_storage_account.datalake.id
}

resource "azurerm_databricks_workspace" "main" {
  name                = "dbw-${var.project}-${var.environment}"
  resource_group_name = azurerm_resource_group.main.name
  location            = azurerm_resource_group.main.location
  sku                 = "premium"

  lifecycle {
    prevent_destroy = true  # Never destroy prod workspace accidentally
  }

  tags = local.common_tags
}

# locals.tf
locals {
  common_tags = {
    project     = var.project
    environment = var.environment
    managed_by  = "terraform"
  }
  location_short = {
    "eastus"        = "eus"
    "westeurope"    = "weu"
    "northeurope"   = "neu"
  }[var.location]
}


Example 3 — Databricks Cluster with for_each Environments
# clusters.tf
variable "cluster_configs" {
  type = map(object({
    node_type     = string
    num_workers   = number
    autoterminate = number
  }))
  default = {
    dev = {
      node_type     = "Standard_DS3_v2"
      num_workers   = 1
      autoterminate = 30
    }
    prod = {
      node_type     = "Standard_DS5_v2"
      num_workers   = 8
      autoterminate = 120
    }
  }
}

resource "databricks_cluster" "env_clusters" {
  for_each = var.cluster_configs

  cluster_name            = "apc-${each.key}"
  spark_version           = "14.3.x-scala2.12"
  node_type_id            = each.value.node_type
  num_workers             = each.value.num_workers
  autotermination_minutes = each.value.autoterminate

  spark_conf = {
    "spark.databricks.delta.preview.enabled" = "true"
  }

  library {
    pypi { package = "dbt-databricks==1.8.0" }
  }
}

output "cluster_ids" {
  value = { for k, v in databricks_cluster.env_clusters : k => v.id }
}


Example 4 — Importing Existing Resources + Drift Detection in CI
# Import an existing storage container that was created manually
# (Terraform 1.5+ declarative import block)
import {
  to = azurerm_storage_data_lake_gen2_filesystem.silver
  id  = "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-myproject-prod-eus/providers/Microsoft.Storage/storageAccounts/samyprojectprod/blobServices/default/containers/silver"
}

resource "azurerm_storage_data_lake_gen2_filesystem" "silver" {
  name               = "silver"
  storage_account_id = azurerm_storage_account.datalake.id
}

---
# .github/workflows/tf-drift.yml (simplified)
# Runs on a schedule to detect configuration drift

# name: Terraform Drift Detection
# on:
#   schedule:
#     - cron: '0 6 * * 1-5'  # 6am UTC weekdays
#
# jobs:
#   drift:
#     runs-on: ubuntu-latest
#     steps:
#       - uses: actions/checkout@v4
#       - uses: hashicorp/setup-terraform@v3
#         with: { terraform_version: 1.7.5 }
#       - run: terraform init
#       - run: terraform plan -detailed-exitcode
#         id: plan
#       - if: steps.plan.outputs.exitcode == '2'  # 2 = changes detected
#         uses: slackapi/slack-github-action@v1
#         with:
#           payload: '{"text":"⚠️ Terraform drift detected in prod!"}'


↑ Back to top

Comparison / When to Use

  Feature Terraform / OpenTofu Pulumi Bicep / ARM AWS CDK Ansible
  
    Language HCL (declarative) Python / TypeScript / Go (imperative) Bicep (Azure DSL) TypeScript / Python YAML + Jinja
    Multi-cloud ✅ 3000+ providers ✅ Same providers ❌ Azure only ❌ AWS only ✅ Limited
    State management Built-in .tfstate Pulumi Cloud / S3 ARM API (no local state) CloudFormation stacks No state
    Learning curve Low (HCL simple) Medium (requires coding) Low (Azure-native) Medium Low
    IDE / test tooling Good (Terraform Test, TFLint) Excellent (type safety) Good (VS Code extension) Excellent (Jest/Pytest) Molecule
    Best for Multi-cloud IaC, DE platform teams Complex logic, developer-friendly IaC Azure-only orgs with Microsoft support AWS-native CDK teams Config management, not infra provisioning
  

Rule of thumb: Use Terraform (or OpenTofu) unless your team is Azure-only (Bicep is simpler) or has strong Python/TypeScript skills and needs loops/conditionals that HCL makes awkward (Pulumi). Avoid Ansible for idempotent cloud resource management — it lacks reliable state tracking.

↑ Back to top

Gotchas & Anti-patterns

  Using count instead of for_each for maps. If you manage a list of resources with count = length(var.buckets) and remove item at index 1, Terraform destroys and recreates all resources from index 1 onward because their numeric addresses change. Always use for_each with string keys for resources that may have items removed from the middle of the collection.
  Storing sensitive values in variables and echoing them in outputs. Even without sensitive = true, any output value is stored in plaintext in terraform.tfstate. Mark sensitive variables with sensitive = true (suppresses CLI display), but remember the value is still in state. Prefer retrieving secrets from Key Vault / Secrets Manager at runtime rather than passing them through Terraform.
  Forgetting to commit .terraform.lock.hcl. The lock file pins exact provider versions and checksums. Without it, terraform init in CI may pull a newer provider version that breaks your configuration or changes resource behaviour silently. Always commit this file. Never .gitignore it.
  Running terraform apply without reviewing plan. Always run terraform plan -out=tfplan first and review the output — especially lines with # forces replacement. Replacement means destroy + create, which causes downtime for live resources like RDS instances, Kafka clusters, or Databricks workspaces. Use lifecycle { create_before_destroy = true } where replacement is unavoidable.
  Shared state for multiple environments in one root module. Running terraform apply for dev and prod from the same directory (using terraform workspace) can lead to accidental prod applies. Use separate root modules (or directories with separate backends) for each environment rather than workspaces for isolation-critical separation. Terraform Cloud/Enterprise workspaces have stronger guardrails; OSS workspaces are just state-file namespaces.


↑ Back to top

Exercises

  Module authoring: Write a reusable module that provisions an Azure Storage Account (ADLS Gen2 enabled) with three containers (bronze/silver/gold), configurable retention lifecycle rules, and outputs the storage account name and primary endpoint. Call the module from a root config with separate dev.tfvars and prod.tfvars files. Validate with terraform validate and terraform plan.
  Drift detection simulation: Apply a Terraform config that creates an Azure resource group. Then manually add a tag to it in the Azure portal. Run terraform plan and observe the detected drift. Restore the state using terraform apply. Discuss when ignore_changes would be appropriate vs resolving the drift.
  Import existing resources: Create an Azure storage container manually (simulate "already existing infra"). Write a Terraform config with an import {} block (Terraform 1.5+) for it. Run terraform plan to confirm no changes are needed after import. Compare this approach with the older terraform import CLI command and explain the workflow difference.


↑ Back to top

Quiz


Q1: Why should you use for_each instead of count when creating multiple resource instances from a list?
With count, resources are addressed by index (e.g., azurerm_storage_account.buckets[0]). If you remove an item from the middle of the list, all resources with a higher index are destroyed and recreated because their addresses shift. With for_each, resources are addressed by a unique key (e.g., azurerm_storage_account.buckets["bronze"]). Removing one key only affects that specific resource, leaving all others untouched. Use toset(var.names) or a map for the for_each value.



Q2: What happens to sensitive values passed as Terraform variables (sensitive = true)?
The sensitive = true flag suppresses the value in CLI output (terraform plan / apply shows (sensitive value)). However, the plaintext value is still written to the state file (terraform.tfstate). The flag does NOT encrypt the state. You must separately encrypt the remote state backend (S3 SSE, Azure Blob at-rest encryption) and restrict access via IAM. For high-sensitivity secrets, prefer fetching them from Key Vault at runtime (in the application) rather than piping them through Terraform resources.



Q3: What is terraform plan exit code 2 and how is it used in CI?
terraform plan -detailed-exitcode returns exit code 0 (success, no changes), 1 (error), or 2 (success, changes are pending). In CI pipelines, exit code 2 is used to detect configuration drift: run terraform plan -detailed-exitcode on a schedule; if it exits with 2, alert the team (Slack, PagerDuty) that real infrastructure has diverged from the Terraform config. This is the foundation of drift detection workflows in mature IaC practices.



Q4: Explain the difference between terraform destroy and using prevent_destroy. When do you use each?
terraform destroy is a CLI command that destroys all resources managed by the current root module. It's useful for tearing down ephemeral environments (feature branches, staging). lifecycle { prevent_destroy = true } is a safeguard in HCL that causes terraform plan to error with "Instance cannot be destroyed" if Terraform's plan includes destroying that resource — protecting production databases, Databricks workspaces, or S3 buckets from accidental deletion via apply or destroy. Use prevent_destroy on any resource where accidental destruction would cause data loss or outages.



Q5: What is state drift and what causes it? How do you detect and remediate it?
State drift occurs when real cloud infrastructure diverges from what Terraform's state file records. Common causes: manual changes in the cloud console, other automation tools (runbooks, scripts) modifying resources, cloud provider changes on managed attributes (auto-assigned IPs, tags added by policies). Detection: terraform plan compares state vs live API call results and reports differences. Remediation options: (1) terraform apply to bring infra back to the Terraform-defined state; (2) terraform state rm + re-import if the resource should remain as-is; (3) Update HCL to match the drifted state if the manual change was intentional; (4) Add ignore_changes for attributes managed out-of-band (e.g., tags managed by a cloud governance policy).


↑ Back to top

Further Reading

  Terraform Official Documentation
  Declarative Import Blocks (Terraform 1.5+)
  AzureRM Provider Reference
  Databricks Provider Reference
  OpenTofu Documentation (open-source Terraform fork)
  TFLint — Terraform Linter


↑ Back to top

Construct	Purpose	Notes
`variable "x" {}`	Parameterise configs (env-specific)	Set via `-var`, `.tfvars`, env vars (`TF_VAR_x`)
`locals {}`	Intermediate expressions, avoid repetition	Not user-configurable; not exposed as inputs/outputs
`output "y" {}`	Expose values to callers or operators	Shown after `apply`; readable via `terraform output`
`data "x" "y" {}`	Read existing infra not managed here	Fetched during `plan`; use for VPCs, DNS zones, ACR images

Feature	Terraform / OpenTofu	Pulumi	Bicep / ARM	AWS CDK	Ansible
Language	HCL (declarative)	Python / TypeScript / Go (imperative)	Bicep (Azure DSL)	TypeScript / Python	YAML + Jinja
Multi-cloud	✅ 3000+ providers	✅ Same providers	❌ Azure only	❌ AWS only	✅ Limited
State management	Built-in .tfstate	Pulumi Cloud / S3	ARM API (no local state)	CloudFormation stacks	No state
Learning curve	Low (HCL simple)	Medium (requires coding)	Low (Azure-native)	Medium	Low
IDE / test tooling	Good (Terraform Test, TFLint)	Excellent (type safety)	Good (VS Code extension)	Excellent (Jest/Pytest)	Molecule
Best for	Multi-cloud IaC, DE platform teams	Complex logic, developer-friendly IaC	Azure-only orgs with Microsoft support	AWS-native CDK teams	Config management, not infra provisioning

Terraform

Summary

Table of Contents

Core Concepts

HCL Structure: Blocks and Expressions

Providers & Registry

State Management

Modules

Variables, Outputs & Locals

Resource Lifecycle & Meta-Arguments

Industry Use Cases

Provisioning a Databricks Workspace + Cluster

Data Lake Storage with RBAC

Multi-Environment Snowflake Warehouse Management

Infrastructure Teardown in Feature Branch Environments

Code Examples

Example 1 — Provider Setup with Remote State (Azure Blob Backend)

Example 2 — Azure Resource Group, ADLS Gen2, and Databricks Workspace

Example 3 — Databricks Cluster with for_each Environments

Example 4 — Importing Existing Resources + Drift Detection in CI

Comparison / When to Use

Gotchas & Anti-patterns

Exercises

Quiz

Further Reading