Azure Data Factory (ADF)

Summary

Azure Data Factory is Microsoft's cloud-native ETL/ELT orchestration and data integration service. It provides a visual, low-code interface for building data pipelines that move and transform data across 100+ connectors — from on-premises SQL Server to Azure Blob Storage, Snowflake, REST APIs, and more. ADF is not a compute engine: Copy Activities move data, and Mapping Data Flows delegate transformation to a managed Spark cluster under the hood. For data engineering interviews, understanding ADF's pipeline model, integration runtimes, parameterization patterns, and CI/CD workflows is essential. Note: ADF is distinct from Microsoft Fabric Data Factory, which is a separate product in the Fabric ecosystem.

Core Concepts
Industry Use Cases
Code Examples
Comparison / When to Use
Gotchas & Anti-patterns
Exercises
Quiz
Further Reading

Core Concepts

1. Pipelines, Activities & Triggers

An ADF pipeline is a logical grouping of activities that together perform a data movement or transformation task. Activities fall into three categories: data movement (Copy Activity), data transformation (Mapping Data Flow, HDInsight Spark, Databricks Notebook, Stored Procedure), and control flow (If Condition, ForEach, Until, Set Variable, Web Activity). Pipelines are triggered by schedule triggers (cron-like), tumbling window triggers (non-overlapping, retryable intervals — ideal for incremental loads), event triggers (Blob storage events or custom events via Event Grid), and manual triggers. Tumbling window triggers maintain their own state and support dependency chaining — e.g., "today's window depends on yesterday's successful run."

2. Integration Runtimes (IR)

The Integration Runtime is ADF's compute backbone. Three types: (1) Azure IR — fully managed, auto-scaling, used for cloud-to-cloud data movement and data flows. Supports managed virtual network for private endpoints (no public internet exposure). (2) Self-Hosted IR — installed on an on-premises machine or VM to access data behind firewalls (SQL Server, file shares, Oracle). Supports HA via multi-node clusters. (3) Azure-SSIS IR — managed SSIS runtime for lift-and-shift of legacy SSIS packages (increasingly rare in new designs). For interviews, know that IR choice directly impacts security posture, networking (VNet injection, private endpoints), and performance (IR region should match data source region to avoid cross-region bandwidth costs).

3. Mapping Data Flows

Mapping Data Flows are ADF's visual, code-free transformation layer. Under the hood, they compile to Spark code running on a managed Spark cluster (allocated via "data flow debug" sessions in development or auto-provisioned at runtime). Key transformations: Source, Sink, Filter, Derived Column, Aggregate, Join, Lookup, Conditional Split, Flatten (for nested JSON/XML), Pivot/Unpivot, Window, Rank, Exists, Union, Alter Row (for upsert/delete semantics to sinks). Data flows support schema drift — they can handle columns not present at design time using pattern-based column mapping. Performance is controlled by core count (8, 16, 32, …, 256 cores) and compute type (General Purpose, Compute Optimized, Memory Optimized). Flows can be parameterized with pipeline-level parameters for reuse.

4. Linked Services & Datasets

Linked Services define the connection to an external data store or compute (connection string, auth method, IR). Think of them as JDBC connection objects. Datasets represent the structure of data within a linked service — table name, file path, schema. In modern ADF designs, datasets are increasingly optional: Copy Activities can use inline datasets to reduce the number of ADF objects. Auth options include: managed identity (preferred — no credential rotation), service principal, key vault references (for secrets), and SAS tokens. Always prefer managed identity + Key Vault integration for production pipelines.

5. Parameterization & Expressions

ADF has a rich expression language for dynamic content. Pipeline parameters, system variables (@pipeline().RunId, @trigger().scheduledTime), and functions (@formatDateTime, @concat, @if, @coalesce) enable template-style pipelines. Common pattern: a single "generic" copy pipeline parameterized by source table, sink path, and watermark column — invoked via ForEach over a metadata table. Global parameters (factory-level) are useful for environment-specific values (e.g., storage account name) that differ between dev/staging/prod.

6. CI/CD with Git Integration

ADF natively integrates with Azure DevOps Repos or GitHub for source control. The adf_publish branch contains auto-generated ARM templates (or Bicep) for deployment. The recommended CI/CD flow: develop in a feature branch → merge PR to main (collaboration branch) → publish generates ARM templates → a release pipeline deploys to staging/prod factories using pre- and post-deployment scripts to handle linked service parameter overrides, global parameter differences, and trigger stop/start. Use the ADF Utilities npm package (@microsoft/azure-data-factory-utilities) in CI to validate and export ARM templates without the UI's "Publish" button.

7. Monitoring, Alerting & Cost Management

ADF integrates with Azure Monitor for operational visibility. Key monitoring strategies: (1) Diagnostic settings — route pipeline run logs, activity run logs, and trigger run logs to Log Analytics, Storage Account, or Event Hub. Use KQL queries in Log Analytics for custom dashboards (e.g., average pipeline duration trends, failure rates by pipeline). (2) Azure Monitor Alerts — create metric-based alerts on PipelineFailedRuns, ActivityFailedRuns, or TriggerFailedRuns. For proactive monitoring, alert on PipelineSucceededRuns absence (detect silent failures where a trigger didn't fire). (3) Cost management — ADF costs come from three components: activity runs (per-execution flat fee), data movement DIU-hours (Copy Activity throughput), and data flow cluster hours (managed Spark compute). Data Flows dominate costs — optimize by right-sizing core count, using TTL (time-to-live) for debug clusters to avoid cold starts, and batching small flows into fewer cluster sessions. (4) Pipeline-level tags — tag pipelines by team or project for cost allocation via Azure Cost Management.

8. Managed VNet & Private Endpoints

For enterprises with strict network isolation requirements, ADF supports a Managed Virtual Network (VNet) associated with the Azure Integration Runtime. When enabled, all outbound connections from data flows and copy activities go through managed private endpoints — private links to Azure services (Storage, SQL, Synapse, Key Vault) that never traverse the public internet. Setup: (1) enable Managed VNet on the Azure IR, (2) create managed private endpoints for each linked service, (3) the target resource owner approves the private endpoint connection. Traffic flow: ADF → Managed VNet → Azure Private Link → Target Resource. Limitations: managed VNets add ~2-3 minutes to activity startup (network provisioning), and not all connectors support private endpoints (on-premises sources still require Self-Hosted IR). This is the recommended pattern for HIPAA, PCI-DSS, and SOC 2 compliant pipelines.

↑ Back to top

Industry Use Cases

1. Enterprise Data Warehouse Incremental Load

A retail company loads 50+ tables from on-premises SQL Server to Azure Synapse Analytics nightly. ADF uses a Self-Hosted IR to connect through a corporate firewall. A metadata-driven pipeline reads a control table listing each source table, its watermark column, and last-loaded value. A ForEach activity iterates over this list, executing a parameterized Copy Activity for each table (incremental via WHERE modified_date > @{item().watermark}). After copy, a Stored Procedure activity updates the watermark. Tumbling window triggers ensure each day's load is retryable independently.

2. Multi-Source Data Lake Ingestion

A healthcare organization ingests data from REST APIs (HL7 FHIR endpoints), SFTP (lab results CSV), and Azure SQL (patient demographics) into ADLS Gen2. ADF orchestrates: Web Activity authenticates to the FHIR API and paginates results, Copy Activity with SFTP connector pulls CSV files, and another Copy Activity extracts from Azure SQL. All land in a raw/bronze zone in ADLS, partitioned by ingestion date. Event triggers on the ADLS container kick off downstream Databricks notebooks for transformation.

3. SaaS Data Integration with REST APIs

A marketing company pulls campaign data from Salesforce, HubSpot, and Google Analytics REST APIs into a central Azure SQL database. ADF's REST connector with pagination support handles each API's unique pagination pattern (offset-based, cursor-based, next-link). Pipelines run on a 4-hour schedule, with each run fetching only records modified since the last run (using @trigger().outputs.windowStartTime from a tumbling window trigger). Flattening nested JSON responses uses Mapping Data Flows with Flatten transformations.

4. Event-Driven Data Pipeline with ADLS Triggers

A manufacturing company drops IoT sensor data as CSV files onto ADLS Gen2 every 5 minutes from edge devices. ADF uses a storage event trigger (via Event Grid) to detect new files in raw/iot-sensors/ and immediately launches a pipeline. The pipeline: (1) Copy Activity reads the new file, (2) Mapping Data Flow validates sensor readings (drop readings with temperature outside -40°C to 200°C range), (3) conditional split routes anomalous readings to a quarantine container and valid readings to a silver Delta table. An Azure Function activity sends real-time alerts to the plant operations dashboard when anomaly rates exceed 5%. This pattern achieves near-real-time ingestion (2-3 minute latency) without maintaining a dedicated streaming cluster.

5. Cross-Cloud Data Migration

An enterprise migrating from AWS to Azure uses ADF as the data migration backbone. ADF's S3 connector (via Azure IR) copies 200+ tables from AWS RDS (via S3 export) and raw Parquet files from S3 buckets into ADLS Gen2. A metadata-driven pipeline reads a migration manifest CSV listing source paths, target paths, file formats, and validation rules. ForEach loops execute parallel Copy Activities with configurable DIU counts (higher for large tables). Post-copy, a Databricks notebook activity runs row count and checksum validation between S3 source and ADLS target. Tumbling window triggers process daily deltas after the initial bulk migration completes. The entire migration runs over 3 weeks with no custom code — only ADF configuration.

↑ Back to top

Code Examples

1. Metadata-Driven Copy Pipeline (ARM Template / JSON)

// Pipeline JSON — ForEach over metadata table, executing parameterized copy
{
  "name": "MetadataDrivenCopy",
  "properties": {
    "activities": [
      {
        "name": "LookupControlTable",
        "type": "Lookup",
        "typeProperties": {
          "source": {
            "type": "AzureSqlSource",
            "sqlReaderQuery": "SELECT * FROM dbo.pipeline_control WHERE is_active = 1"
          },
          "dataset": { "referenceName": "ControlTableDS" },
          "firstRowOnly": false
        }
      },
      {
        "name": "ForEachTable",
        "type": "ForEach",
        "dependsOn": [{ "activity": "LookupControlTable", "dependencyConditions": ["Succeeded"] }],
        "typeProperties": {
          "items": { "value": "@activity('LookupControlTable').output.value" },
          "isSequential": false,
          "batchCount": 10,
          "activities": [
            {
              "name": "CopyTable",
              "type": "Copy",
              "typeProperties": {
                "source": {
                  "type": "SqlServerSource",
                  "sqlReaderQuery": "SELECT * FROM @{item().schema_name}.@{item().table_name} WHERE @{item().watermark_col} > '@{item().last_watermark}'"
                },
                "sink": {
                  "type": "ParquetSink",
                  "storeSettings": { "type": "AzureBlobFSWriteSettings" }
                }
              }
            }
          ]
        }
      }
    ],
    "parameters": {
      "windowStart": { "type": "string" }
    }
  }
}

2. Expression Language — Dynamic File Path & Watermark

// Sink file path with date partitioning (ADF expression)
@concat(
  'raw/',
  pipeline().parameters.schemaName, '/',
  pipeline().parameters.tableName, '/',
  'year=', formatDateTime(trigger().scheduledTime, 'yyyy'), '/',
  'month=', formatDateTime(trigger().scheduledTime, 'MM'), '/',
  'day=', formatDateTime(trigger().scheduledTime, 'dd'), '/',
  pipeline().parameters.tableName, '.parquet'
)

// Incremental watermark query
@concat(
  'SELECT * FROM ', item().schemaName, '.', item().tableName,
  ' WHERE ', item().watermarkColumn,
  ' > ''', item().lastWatermark, '''',
  ' AND ', item().watermarkColumn,
  ' <= ''', pipeline().parameters.windowEnd, ''''
)

3. Mapping Data Flow — SCD Type 1 Upsert Pattern

// Data Flow Script (DFS) — ADF Mapping Data Flow definition
// Implements SCD Type 1: overwrite matching rows, insert new ones

source(output(
    customer_id as integer,
    name as string,
    email as string,
    updated_at as timestamp
), allowSchemaDrift: true) ~> SourceStream

source(output(
    customer_id as integer,
    name as string,
    email as string,
    updated_at as timestamp
), allowSchemaDrift: true) ~> ExistingDim

// Lookup existing records by business key
SourceStream, ExistingDim lookup(
    SourceStream@customer_id == ExistingDim@customer_id,
    multiple: false, pickup: 'first',
    broadcast: 'auto'
) ~> LookupExisting

// Alter Row: upsert if exists, insert if new
LookupExisting alterRow(
    upsertIf(true())
) ~> AlterRow

// Sink to Azure SQL with upsert enabled
AlterRow sink(
    allowSchemaDrift: true,
    input(customer_id as integer, name as string, email as string, updated_at as timestamp),
    deletable: false, insertable: true,
    updateable: false, upsertable: true,
    keys: ['customer_id'],
    format: 'table'
) ~> DimCustomerSink

4. CI/CD — Azure DevOps Pipeline YAML for ADF Deployment

# azure-pipelines.yml — ADF CI/CD deployment
trigger:
  branches:
    include:
      - adf_publish   # auto-generated by ADF publish

pool:
  vmImage: 'ubuntu-latest'

stages:
  - stage: DeployToStaging
    jobs:
      - job: Deploy
        steps:
          - task: AzureResourceManagerTemplateDeployment@3
            inputs:
              azureResourceManagerConnection: 'AzureServiceConnection'
              resourceGroupName: 'rg-data-staging'
              location: 'East US'
              templateLocation: 'Linked artifact'
              csmFile: '$(Build.SourcesDirectory)/ARMTemplateForFactory.json'
              csmParametersFile: '$(Build.SourcesDirectory)/ARMTemplateParametersForFactory.json'
              overrideParameters: >
                -factoryName "adf-staging"
                -LS_AzureSQL_connectionString "$(STAGING_SQL_CONN)"
                -LS_ADLS_url "https://stagingdatalake.dfs.core.windows.net"

5. PowerShell — Stop/Start Triggers During Deployment

# Pre-deployment: stop all triggers
$triggers = Get-AzDataFactoryV2Trigger `
    -ResourceGroupName "rg-data-prod" `
    -DataFactoryName "adf-prod"

foreach ($trigger in $triggers) {
    if ($trigger.RuntimeState -eq "Started") {
        Stop-AzDataFactoryV2Trigger `
            -ResourceGroupName "rg-data-prod" `
            -DataFactoryName "adf-prod" `
            -Name $trigger.Name -Force
    }
}

# ... ARM template deployment happens here ...

# Post-deployment: restart triggers
foreach ($trigger in $triggers) {
    Start-AzDataFactoryV2Trigger `
        -ResourceGroupName "rg-data-prod" `
        -DataFactoryName "adf-prod" `
        -Name $trigger.Name -Force
}

6. Event Trigger — Blob Storage (ARM JSON)

// Storage event trigger — fires when new blobs land in a path
{
  "name": "IoT_File_Arrival_Trigger",
  "properties": {
    "type": "BlobEventsTrigger",
    "typeProperties": {
      "blobPathBeginsWith": "/raw/iot-sensors/blobs/",
      "blobPathEndsWith": ".csv",
      "ignoreEmptyBlobs": true,
      "scope": "/subscriptions/{sub-id}/resourceGroups/rg-data/providers/Microsoft.Storage/storageAccounts/datalakeprod",
      "events": ["Microsoft.Storage.BlobCreated"]
    },
    "pipelines": [
      {
        "pipelineReference": { "referenceName": "ProcessIoTSensorData", "type": "PipelineReference" },
        "parameters": {
          "fileName": "@triggerBody().fileName",
          "folderPath": "@triggerBody().folderPath"
        }
      }
    ]
  }
}

7. Monitoring — Log Analytics KQL Queries

// KQL: Pipeline failure summary — last 7 days
ADFPipelineRun
| where TimeGenerated > ago(7d)
| where Status == "Failed"
| summarize FailureCount = count(),
            LastFailure  = max(TimeGenerated)
    by PipelineName, FailureType
| order by FailureCount desc

// KQL: Copy Activity throughput — identify slow copies
ADFActivityRun
| where TimeGenerated > ago(24h)
| where ActivityType == "Copy" and Status == "Succeeded"
| extend DurationMin = datetime_diff("minute", End, Start),
         RowsCopied  = toint(Output.rowsCopied),
         DataReadMB  = todouble(Output.dataRead) / 1048576
| project PipelineName, ActivityName, DurationMin,
          RowsCopied, DataReadMB,
          ThroughputMBps = DataReadMB / (DurationMin * 60)
| order by DurationMin desc

// KQL: Data Flow cluster utilization — cost optimization
ADFActivityRun
| where ActivityType == "ExecuteDataFlow"
| where TimeGenerated > ago(30d)
| extend ClusterMinutes = datetime_diff("minute", End, Start)
| summarize TotalClusterMin = sum(ClusterMinutes),
            AvgClusterMin   = avg(ClusterMinutes),
            Runs            = count()
    by PipelineName
| extend EstMonthlyCost = TotalClusterMin * 0.274  // approx $/min for 16-core GP
| order by EstMonthlyCost desc

↑ Back to top

Comparison / When to Use

Feature	Azure Data Factory	Apache Airflow	AWS Glue	Fivetran / Airbyte
Primary role	Cloud ETL/ELT orchestration	General workflow orchestration	Serverless ETL (Spark-based)	EL (extract-load) connectors
Compute	Managed Spark (data flows), Copy service	External (delegates to Spark, SQL, etc.)	Managed Spark	Managed connectors
Transformation	Mapping Data Flows (visual Spark)	Delegates to external engines	PySpark scripts	dbt integration / SQL
UI	Visual drag-and-drop pipeline designer	Grid/Graph/Gantt views	Visual editor + Jupyter notebooks	Dashboard-oriented
Connectors	100+ native connectors	80+ provider packages	AWS service-native + JDBC	300+ pre-built connectors
CI/CD	Git integration + ARM templates	DAG files in Git (Python)	CloudFormation / CDK	API-based config
Pricing	Per activity run + DIU hours + data flow cluster hours	Self-managed (infra cost) or managed (Astronomer/MWAA)	DPU-hours	Per row / connector
Best for	Azure-centric data integration, visual ETL	Complex multi-system orchestration	AWS-native Spark ETL	Rapid SaaS data ingestion

Choose ADF when: You're in the Azure ecosystem, need a visual pipeline designer, and want managed Spark for transformations without running your own cluster. Choose Airflow when: You need a code-first orchestrator with complex dependency logic across heterogeneous systems. Choose AWS Glue when: You're AWS-native and want serverless Spark ETL. Choose Fivetran/Airbyte when: You primarily need extract-load from SaaS sources and will transform in the warehouse (ELT).

↑ Back to top

Gotchas & Anti-patterns

Data Flows for simple copy operations. Mapping Data Flows spin up a Spark cluster (~5-10 min cold start). If you're just copying a table from SQL Server to ADLS with no transformation, use a Copy Activity instead. Data Flows shine for complex transformations (joins, aggregations, pivots), not raw data movement.
Ignoring Integration Runtime region. If your ADF is in East US and your storage account is in West Europe, every Copy Activity routes data across regions — increasing latency and egress costs. Match the Azure IR's auto-resolve region to your data source region, or create explicit IRs in specific regions.
Hardcoded values instead of parameterization. Building separate pipelines for each table or environment is a maintenance nightmare. Use a metadata-driven approach: a control table lists all sources, and a single parameterized pipeline handles them all. Global parameters handle environment differences (dev/staging/prod).
Not stopping triggers before deployment. Deploying ARM templates while triggers are running can cause concurrent pipeline runs with old and new definitions. Always stop triggers pre-deployment and restart them post-deployment, with the cleanup script handling triggers that were removed in the new version.
Overusing ADF for complex logic. ADF's expression language is limited compared to Python. If you find yourself building deeply nested If/ForEach/Until loops with complex expressions, consider delegating to a Databricks notebook or Azure Function. ADF is best as an orchestrator, not a general-purpose programming environment.
Not configuring Data Flow TTL (Time-to-Live). Data Flow debug clusters have a configurable TTL. Without TTL, every Data Flow activity cold-starts a Spark cluster (~5-10 min). Setting TTL to 60-120 minutes keeps the cluster warm across consecutive pipeline runs, dramatically reducing execution time during business hours. But remember TTL clusters incur idle costs — schedule them alongside your pipeline frequency.
Ignoring DIU auto-tuning in Copy Activities. Copy Activities default to Auto DIU, but for large data transfers (multi-TB), manually setting DIU to 128-256 and enabling parallel copies can 4-8x throughput. Conversely, for small tables (<100 MB), high DIU wastes money. Profile your copy patterns and set DIU ranges per pipeline parameter based on expected data volume.

↑ Back to top

Exercises

Build a metadata-driven incremental pipeline. Create an Azure SQL control table with columns: source_schema, source_table, watermark_column, last_watermark, is_active. Build an ADF pipeline that: (1) Lookups the control table, (2) ForEach active table, executes a parameterized Copy Activity with dynamic SQL using the watermark, (3) after each successful copy, updates the watermark via Stored Procedure. Test with 3 tables, each with different watermark columns.
Implement CI/CD for ADF. Set up Git integration (Azure DevOps or GitHub) for your ADF instance. Create a feature branch, make pipeline changes, submit a PR. After merge, use the ADF publish mechanism to generate ARM templates. Write an Azure DevOps release pipeline (YAML) that deploys to a staging factory with parameter overrides. Include pre/post-deployment scripts for trigger management.
Build a Data Flow with SCD Type 2. Create a Mapping Data Flow that implements slowly changing dimension Type 2: new source records are inserted, changed records expire the old row (set end_date, is_current=false) and insert a new current row. Test with a customer dimension where email addresses change. Verify that historical records are preserved.

↑ Back to top

Quiz

Q1: What are the three types of Integration Runtimes in ADF and when would you use each?

(1) Azure IR — default, fully managed. Use for cloud-to-cloud data movement and Mapping Data Flows. Supports managed VNet for private connectivity. (2) Self-Hosted IR — installed on an on-premises machine or Azure VM. Use to access data behind corporate firewalls (SQL Server, Oracle, file shares). Supports HA with multi-node setup. (3) Azure-SSIS IR — managed runtime for executing legacy SSIS packages. Use only for lift-and-shift migration of existing SSIS workloads; not recommended for new development.

Q2: Explain the difference between a Schedule Trigger and a Tumbling Window Trigger.

Schedule Trigger fires at specified times (cron-like) and can trigger multiple pipelines. It's fire-and-forget — no built-in retry for missed windows. A schedule trigger passes the scheduled time as a parameter but doesn't track window state. Tumbling Window Trigger operates on fixed, non-overlapping time intervals. It maintains state per window (succeeded, failed, waiting), supports automatic retry of failed windows, allows dependency chaining (window N depends on window N-1), and can backfill historical windows. Choose tumbling window for incremental ETL where each time interval must be processed exactly once.

Q3: How does ADF handle schema drift in Mapping Data Flows?

When allowSchemaDrift: true is enabled on a source, ADF dynamically accepts columns not defined at design time. The Data Flow processes these drifted columns through transformations — you can reference them with pattern-based column rules (e.g., $$ to refer to all columns, or regex patterns like name(patternMatch('.*_id'))). The Derived Column transformation with byName() and byPosition() functions accesses drifted columns dynamically. Schema drift is essential for semi-structured data (JSON, CSV) where upstream sources evolve without notice.

Q4: What is the recommended approach for managing secrets (connection strings, passwords) in ADF?

Use Azure Key Vault linked service + Managed Identity. Grant the ADF managed identity "Key Vault Secrets User" role on your Key Vault. In linked services, reference secrets as Key Vault secrets rather than storing them directly. This provides: (1) No secrets in ARM templates or Git, (2) Automatic credential rotation, (3) Audit logging via Key Vault logs, (4) Different Key Vaults per environment (dev/prod) with same secret names. Avoid storing secrets as pipeline parameters or global parameters, as these appear in monitoring logs and ARM templates.

Q5: Why is it an anti-pattern to use Mapping Data Flows for simple data copy operations?

Mapping Data Flows require a Spark cluster, which takes 5-10 minutes to start (cold start) and incurs cluster-hour costs even for trivial operations. A Copy Activity, by contrast, uses the lightweight ADF data movement service — it starts in seconds, costs only by DIU-hours (typically cheaper), and can move data at very high throughput (multi-terabyte per run). Use Data Flows only when you need actual transformations (joins, aggregations, window functions, pivots). For extract-load tasks, Copy Activity is faster, cheaper, and simpler.

Q6: How does ADF's Managed Virtual Network (VNet) improve security, and what are its trade-offs?

A Managed VNet wraps the Azure Integration Runtime in a Microsoft-managed virtual network. All outbound connections use managed private endpoints — Private Link connections to Azure services (ADLS, SQL, Key Vault) that keep traffic off the public internet. This eliminates the need to whitelist ADF's IP ranges on firewalls. Security benefits: (1) no public internet exposure for data in transit, (2) compliance with HIPAA/PCI-DSS requirements, (3) simplified network architecture (no self-managed VMs or VNet peering). Trade-offs: (1) ~2-3 minute overhead per activity start (private endpoint provisioning), (2) not all connectors support private endpoints (on-prem still needs Self-Hosted IR), (3) Managed VNet IR must be in a fixed region (no auto-resolve). Best practice: use Managed VNet for production; use public Azure IR for dev/test to save time.

Q7: Explain the metadata-driven pipeline pattern in ADF and when you'd use it.

A metadata-driven pipeline uses a control table (Azure SQL, JSON config file, or Dataverse) that lists all data sources and their parameters: schema name, table name, watermark column, last watermark value, target path, DIU settings, and active flag. A single ADF pipeline reads this table with a Lookup activity, iterates with ForEach (parallel batch count = 10-20), and executes a parameterized Copy Activity per source. Benefits: (1) adding a new table requires only a control table row, not pipeline changes, (2) consistent patterns across 50-500+ tables, (3) centralized monitoring (one pipeline run covers all tables). Use this pattern when you have 10+ tables with similar copy/transform patterns. It's the most common production ADF architecture.

↑ Back to top

Azure Data Factory (ADF)

Summary

Table of Contents

Core Concepts

1. Pipelines, Activities & Triggers

2. Integration Runtimes (IR)

3. Mapping Data Flows

4. Linked Services & Datasets

5. Parameterization & Expressions

6. CI/CD with Git Integration

7. Monitoring, Alerting & Cost Management

8. Managed VNet & Private Endpoints

Industry Use Cases

1. Enterprise Data Warehouse Incremental Load

2. Multi-Source Data Lake Ingestion

3. SaaS Data Integration with REST APIs

4. Event-Driven Data Pipeline with ADLS Triggers

5. Cross-Cloud Data Migration

Code Examples

1. Metadata-Driven Copy Pipeline (ARM Template / JSON)

2. Expression Language — Dynamic File Path & Watermark

3. Mapping Data Flow — SCD Type 1 Upsert Pattern

4. CI/CD — Azure DevOps Pipeline YAML for ADF Deployment

5. PowerShell — Stop/Start Triggers During Deployment

6. Event Trigger — Blob Storage (ARM JSON)

7. Monitoring — Log Analytics KQL Queries

Comparison / When to Use

Gotchas & Anti-patterns

Exercises

Quiz

Further Reading