Terraform (HashiCorp, now BSL-licensed) is the dominant Infrastructure as Code (IaC) tool in data engineering. It manages the full lifecycle — create, update, destroy — of cloud resources using declarative HCL (HashiCorp Configuration Language). For a DE, Terraform is how you provision the infrastructure your pipelines run on: Databricks clusters, Azure Data Factory pipelines, Snowflake warehouses, S3/ADLS2 storage accounts, IAM roles, VPCs, and more. The key capabilities to master for interviews are state management, modules, remote backends, provider version pinning, and import/drift detection.
This guide targets Terraform 1.7+ (OpenTofu is largely compatible). Azure and Databricks examples are used throughout, reflecting common DE interview contexts, but all concepts apply equally to AWS/GCP.
A Terraform configuration is a directory of .tf files. Terraform merges them automatically. Three key block types form 90% of configs:
terraform {} — backend, required providers, required version constraints.resource "<type>" "<name>" {} — declarative description of one cloud resource. Address: resource_type.name.data "<type>" "<name>" {} — read-only lookup of existing infrastructure not managed by this config.Expressions reference other blocks via interpolation: azurerm_resource_group.main.name or var.location. Terraform builds an implicit dependency graph from references and plans changes in topological order. Use depends_on only for hidden dependencies that can't be expressed through references.
Providers are plugins that translate HCL resources into API calls. They are downloaded from the Terraform Registry during terraform init into .terraform/providers/. The required_providers block in terraform {}` pins the source and version:
~> 3.90 (allow only patch updates in 3.90.x), >= 3.90, < 4.0 (range). Always pin in production — provider major versions often include breaking changes.alias — useful for multi-region or multi-account deployments.terraform lock file (.terraform.lock.hcl): Records exact provider versions and hashes. Always commit this file to version control for reproducible builds.Terraform state (terraform.tfstate) maps your HCL configuration to real infrastructure. Without state, Terraform cannot detect drift, plan updates, or destroy resources correctly. Key concepts:
terraform.tfstate in the working directory. Never use in teams — not shared and has no locking.apply conflicts).plan and apply, Terraform acquires a lock. With Azure Blob Storage backend, locking uses blob leases. With S3, it requires a DynamoDB table.sensitive = true variables (redacted in CLI output), the plain value appears in the state file. Encrypt state at rest and restrict access with IAM/RBAC.terraform import: Bring existing cloud resources under Terraform management without recreating them. In Terraform 1.5+, use import {} blocks declaratively. Run terraform plan after importing to confirm no unintended changes.terraform state mv: Rename or move resources in state (e.g., after refactoring a module). Prevents destroy+recreate during refactors.A module is any directory of .tf files. The root module is the directory where you run terraform apply. Child modules are called via module blocks and can come from local paths, the Terraform Registry, or Git URLs.
?ref=v1.3.0. Always pin.variable {}) define the module's interface. Output values (output {}) expose attributes for the caller. Never access child module internals directly.for_each on a module block creates multiple instances — e.g., one Databricks cluster per environment map entry.| Construct | Purpose | Notes |
|---|---|---|
variable "x" {} | Parameterise configs (env-specific) | Set via -var, .tfvars, env vars (TF_VAR_x) |
locals {} | Intermediate expressions, avoid repetition | Not user-configurable; not exposed as inputs/outputs |
output "y" {} | Expose values to callers or operators | Shown after apply; readable via terraform output |
data "x" "y" {} | Read existing infra not managed here | Fetched during plan; use for VPCs, DNS zones, ACR images |
Variable precedence (highest to lowest): CLI -var → *.auto.tfvars → terraform.tfvars → environment variables (TF_VAR_*) → default in variable {} block.
Meta-arguments modify how Terraform manages resources:
count / for_each: Create multiple resource instances. Prefer for_each over count for maps/sets — removing an item in the middle of a count list causes all subsequent resources to be destroyed and recreated (index-based addressing).depends_on: Explicit dependency for cases where references alone can't capture ordering (e.g., a resource that depends on a policy attachment but doesn't reference it).lifecycle {}:
create_before_destroy = true: Create replacement before destroying old. Essential for zero-downtime replacements.prevent_destroy = true: Block terraform destroy on critical resources (production databases). A safety net against accidents.ignore_changes = [tags]: Don't reconcile specific attributes that change out-of-band.provisioner (use sparingly): Run scripts on resource creation/destruction. Fragile and breaks idempotency — prefer cloud-init, Ansible, or Docker images instead.A data platform team uses Terraform to provision Azure Databricks workspaces per environment (dev/staging/prod) from a shared module. Each workspace gets: a managed resource group, a VNet with workspace subnet injection, a Premium SKU, a service principal for CI/CD, a single-node dev cluster (auto-terminating in 30 min), and an all-purpose prod cluster. Environment-specific .tfvars control node types and cluster sizes. The same Terraform config is applied in a GitHub Actions workflow on PR merge to the main branch — no manual portal clicks.
An ADLS Gen2 storage account with hierarchical namespace is provisioned along with lifecycle management rules (move blobs to Cool after 30 days, Archive after 90 days, delete after 365 days). Storage containers for bronze/silver/gold zones are created. Azure AD groups are imported via data blocks and assigned appropriate RBAC roles (Storage Blob Data Contributor for ETL service principals, Storage Blob Data Reader for BI tools). All of this is in a reusable module with input variables for retention periods and principal IDs.
A Snowflake provider manages warehouses, databases, schemas, roles, and grants across dev/prod environments. The same Terraform module provisions an X-Small warehouse for dev and an X-Large (auto-suspend 60s, auto-resume) for prod. Role hierarchies and grants are codified — no ad-hoc GRANT statements in runbooks. Drift detection (terraform plan in CI) alerts the team when someone manually modified a warehouse size in the Snowflake console.
Each feature branch in a data product repo provisions a short-lived Postgres + Airflow environment using Terraform + GitHub Actions. On PR close, a terraform destroy job cleans up all resources. The remote state is stored in an S3 bucket keyed by branch name. prevent_destroy is intentionally omitted on ephemeral environments. This pattern dramatically reduces cloud costs from long-lived development environments.
# versions.tf
terraform {
required_version = "~> 1.7"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.90"
}
databricks = {
source = "databricks/databricks"
version = "~> 1.40"
}
}
# Remote backend — state stored in Azure Blob Storage
backend "azurerm" {
resource_group_name = "tf-state-rg"
storage_account_name = "mycompanytfstate"
container_name = "tfstate"
key = "data-platform/prod.tfstate"
}
}
provider "azurerm" {
features {}
# Uses AZURE_CLIENT_ID / AZURE_CLIENT_SECRET / AZURE_TENANT_ID env vars
# or Managed Identity when running in Azure DevOps / GitHub Actions OIDC
}
provider "databricks" {
host = azurerm_databricks_workspace.main.workspace_url
# azure_client_id / azure_client_secret set via env vars
}
# main.tf
resource "azurerm_resource_group" "main" {
name = "rg-${var.project}-${var.environment}-${var.location_short}"
location = var.location
tags = local.common_tags
}
resource "azurerm_storage_account" "datalake" {
name = "sa${var.project}${var.environment}" # 3-24 chars, lowercase
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
account_tier = "Standard"
account_replication_type = "GRS"
is_hns_enabled = true # Hierarchical namespace = ADLS Gen2
min_tls_version = "TLS1_2"
blob_properties {
delete_retention_policy { days = 30 }
}
tags = local.common_tags
}
resource "azurerm_storage_data_lake_gen2_filesystem" "bronze" {
name = "bronze"
storage_account_id = azurerm_storage_account.datalake.id
}
resource "azurerm_databricks_workspace" "main" {
name = "dbw-${var.project}-${var.environment}"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
sku = "premium"
lifecycle {
prevent_destroy = true # Never destroy prod workspace accidentally
}
tags = local.common_tags
}
# locals.tf
locals {
common_tags = {
project = var.project
environment = var.environment
managed_by = "terraform"
}
location_short = {
"eastus" = "eus"
"westeurope" = "weu"
"northeurope" = "neu"
}[var.location]
}
# clusters.tf
variable "cluster_configs" {
type = map(object({
node_type = string
num_workers = number
autoterminate = number
}))
default = {
dev = {
node_type = "Standard_DS3_v2"
num_workers = 1
autoterminate = 30
}
prod = {
node_type = "Standard_DS5_v2"
num_workers = 8
autoterminate = 120
}
}
}
resource "databricks_cluster" "env_clusters" {
for_each = var.cluster_configs
cluster_name = "apc-${each.key}"
spark_version = "14.3.x-scala2.12"
node_type_id = each.value.node_type
num_workers = each.value.num_workers
autotermination_minutes = each.value.autoterminate
spark_conf = {
"spark.databricks.delta.preview.enabled" = "true"
}
library {
pypi { package = "dbt-databricks==1.8.0" }
}
}
output "cluster_ids" {
value = { for k, v in databricks_cluster.env_clusters : k => v.id }
}
# Import an existing storage container that was created manually
# (Terraform 1.5+ declarative import block)
import {
to = azurerm_storage_data_lake_gen2_filesystem.silver
id = "/subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/rg-myproject-prod-eus/providers/Microsoft.Storage/storageAccounts/samyprojectprod/blobServices/default/containers/silver"
}
resource "azurerm_storage_data_lake_gen2_filesystem" "silver" {
name = "silver"
storage_account_id = azurerm_storage_account.datalake.id
}
---
# .github/workflows/tf-drift.yml (simplified)
# Runs on a schedule to detect configuration drift
# name: Terraform Drift Detection
# on:
# schedule:
# - cron: '0 6 * * 1-5' # 6am UTC weekdays
#
# jobs:
# drift:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v4
# - uses: hashicorp/setup-terraform@v3
# with: { terraform_version: 1.7.5 }
# - run: terraform init
# - run: terraform plan -detailed-exitcode
# id: plan
# - if: steps.plan.outputs.exitcode == '2' # 2 = changes detected
# uses: slackapi/slack-github-action@v1
# with:
# payload: '{"text":"⚠️ Terraform drift detected in prod!"}'
↑ Back to top
| Feature | Terraform / OpenTofu | Pulumi | Bicep / ARM | AWS CDK | Ansible |
|---|---|---|---|---|---|
| Language | HCL (declarative) | Python / TypeScript / Go (imperative) | Bicep (Azure DSL) | TypeScript / Python | YAML + Jinja |
| Multi-cloud | ✅ 3000+ providers | ✅ Same providers | ❌ Azure only | ❌ AWS only | ✅ Limited |
| State management | Built-in .tfstate | Pulumi Cloud / S3 | ARM API (no local state) | CloudFormation stacks | No state |
| Learning curve | Low (HCL simple) | Medium (requires coding) | Low (Azure-native) | Medium | Low |
| IDE / test tooling | Good (Terraform Test, TFLint) | Excellent (type safety) | Good (VS Code extension) | Excellent (Jest/Pytest) | Molecule |
| Best for | Multi-cloud IaC, DE platform teams | Complex logic, developer-friendly IaC | Azure-only orgs with Microsoft support | AWS-native CDK teams | Config management, not infra provisioning |
Rule of thumb: Use Terraform (or OpenTofu) unless your team is Azure-only (Bicep is simpler) or has strong Python/TypeScript skills and needs loops/conditionals that HCL makes awkward (Pulumi). Avoid Ansible for idempotent cloud resource management — it lacks reliable state tracking.
↑ Back to topcount instead of for_each for maps. If you manage a list of resources with count = length(var.buckets) and remove item at index 1, Terraform destroys and recreates all resources from index 1 onward because their numeric addresses change. Always use for_each with string keys for resources that may have items removed from the middle of the collection.sensitive = true, any output value is stored in plaintext in terraform.tfstate. Mark sensitive variables with sensitive = true (suppresses CLI display), but remember the value is still in state. Prefer retrieving secrets from Key Vault / Secrets Manager at runtime rather than passing them through Terraform..terraform.lock.hcl. The lock file pins exact provider versions and checksums. Without it, terraform init in CI may pull a newer provider version that breaks your configuration or changes resource behaviour silently. Always commit this file. Never .gitignore it.terraform apply without reviewing plan. Always run terraform plan -out=tfplan first and review the output — especially lines with # forces replacement. Replacement means destroy + create, which causes downtime for live resources like RDS instances, Kafka clusters, or Databricks workspaces. Use lifecycle { create_before_destroy = true } where replacement is unavoidable.terraform apply for dev and prod from the same directory (using terraform workspace) can lead to accidental prod applies. Use separate root modules (or directories with separate backends) for each environment rather than workspaces for isolation-critical separation. Terraform Cloud/Enterprise workspaces have stronger guardrails; OSS workspaces are just state-file namespaces.dev.tfvars and prod.tfvars files. Validate with terraform validate and terraform plan.terraform plan and observe the detected drift. Restore the state using terraform apply. Discuss when ignore_changes would be appropriate vs resolving the drift.import {} block (Terraform 1.5+) for it. Run terraform plan to confirm no changes are needed after import. Compare this approach with the older terraform import CLI command and explain the workflow difference.With count, resources are addressed by index (e.g., azurerm_storage_account.buckets[0]). If you remove an item from the middle of the list, all resources with a higher index are destroyed and recreated because their addresses shift. With for_each, resources are addressed by a unique key (e.g., azurerm_storage_account.buckets["bronze"]). Removing one key only affects that specific resource, leaving all others untouched. Use toset(var.names) or a map for the for_each value.
The sensitive = true flag suppresses the value in CLI output (terraform plan / apply shows (sensitive value)). However, the plaintext value is still written to the state file (terraform.tfstate). The flag does NOT encrypt the state. You must separately encrypt the remote state backend (S3 SSE, Azure Blob at-rest encryption) and restrict access via IAM. For high-sensitivity secrets, prefer fetching them from Key Vault at runtime (in the application) rather than piping them through Terraform resources.
terraform plan -detailed-exitcode returns exit code 0 (success, no changes), 1 (error), or 2 (success, changes are pending). In CI pipelines, exit code 2 is used to detect configuration drift: run terraform plan -detailed-exitcode on a schedule; if it exits with 2, alert the team (Slack, PagerDuty) that real infrastructure has diverged from the Terraform config. This is the foundation of drift detection workflows in mature IaC practices.
terraform destroy is a CLI command that destroys all resources managed by the current root module. It's useful for tearing down ephemeral environments (feature branches, staging). lifecycle { prevent_destroy = true } is a safeguard in HCL that causes terraform plan to error with "Instance cannot be destroyed" if Terraform's plan includes destroying that resource — protecting production databases, Databricks workspaces, or S3 buckets from accidental deletion via apply or destroy. Use prevent_destroy on any resource where accidental destruction would cause data loss or outages.
State drift occurs when real cloud infrastructure diverges from what Terraform's state file records. Common causes: manual changes in the cloud console, other automation tools (runbooks, scripts) modifying resources, cloud provider changes on managed attributes (auto-assigned IPs, tags added by policies). Detection: terraform plan compares state vs live API call results and reports differences. Remediation options: (1) terraform apply to bring infra back to the Terraform-defined state; (2) terraform state rm + re-import if the resource should remain as-is; (3) Update HCL to match the drifted state if the manual change was intentional; (4) Add ignore_changes for attributes managed out-of-band (e.g., tags managed by a cloud governance policy).