Kafka, Airflow & PySpark

This guide covers three pillars of modern data engineering individually — Apache Kafka for real-time event streaming, Apache Airflow for workflow orchestration, and PySpark for large-scale data transformation — then shows how they compose into a production pipeline stack.

Among them, this trio powers a huge proportion of real-world data platforms: Kafka handles ingestion and event routing, Airflow schedules and monitors batch/streaming jobs, and PySpark performs heavy transformations. Understanding each tool and their integration points is essential for any data engineering role.

Part 1: Apache Kafka
Part 2: Apache Airflow
Part 3: PySpark
Part 4: The Stack — End-to-End Pipeline

Part 1: Apache Kafka

Core Concepts

1. Topics, Partitions & Offsets

A Kafka topic is an append-only, immutable log of events. Topics are split into partitions — the unit of parallelism and data distribution. Each message within a partition gets a monotonically increasing offset. Ordering is guaranteed only within a single partition, not across partitions. When producing, keys determine partition assignment via a hash (default: murmur2). Choosing the right partition count is critical — too few limits consumer parallelism; too many increases broker metadata overhead and end-to-end latency.

2. Consumer Groups & Rebalancing

Consumers subscribe to topics as part of a consumer group. Kafka assigns each partition to exactly one consumer in the group, enabling horizontal scale-out. When consumers join or leave, a rebalance occurs. Older "eager" rebalancing stops all consumers during reassignment. Modern Kafka (2.4+) supports cooperative incremental rebalancing which only revokes the specific partitions that need to move, minimizing downtime. The new KIP-848 consumer group protocol (server-side assignment, GA in Kafka 3.7+) further reduces rebalance latency.

3. Exactly-Once Semantics (EOS)

Kafka supports three delivery guarantees: at-most-once, at-least-once, and exactly-once. EOS relies on two features: idempotent producers (deduplication via producer ID + sequence number) and transactions (atomic writes across multiple partitions). For consume-transform-produce patterns, use isolation.level=read_committed so consumers only see committed transaction data. EOS adds ~3-5% latency overhead and requires careful transactional.id management.

4. Replication & ISR (In-Sync Replicas)

Each partition has a leader and zero or more follower replicas. The ISR set contains followers that are caught up with the leader within replica.lag.time.max.ms. A produce is acknowledged based on acks: acks=0 (fire-and-forget), acks=1 (leader only), acks=all (all ISR members). Setting min.insync.replicas=2 with acks=all and replication.factor=3 is the standard production durability configuration.

5. Schema Registry & Schema Evolution

In production, a Schema Registry (e.g., Confluent Schema Registry) manages Avro, Protobuf, or JSON schemas for topics. It enforces compatibility rules — BACKWARD, FORWARD, FULL — preventing producers from publishing data that would break consumers. This decouples producer and consumer deployments. The schema ID is embedded in the first 5 bytes of each message (magic byte + 4-byte ID).

6. Log Compaction & Retention

Kafka supports two retention strategies: time/size-based deletion and log compaction. Compaction keeps only the latest value for each key, making the topic a materializable table. This is the foundation for Kafka as a "streaming database" — changelog topics in Kafka Streams use compaction. Compaction runs as a background thread and never removes the active segment, so recent duplicates may temporarily exist.

Feature	Apache Kafka	AWS Kinesis	Google Pub/Sub	Apache Pulsar
Ordering	Per-partition	Per-shard	Per-key (ordering keys)	Per-partition
Retention	Configurable (days/forever via compaction)	1–365 days	7 days (31 with Lite)	Configurable + tiered storage
Replay	Full offset-based replay	Shard iterator reset	Seek to timestamp	Full offset replay
Throughput	Millions msg/sec (scales with partitions)	1 MB/s per shard	Auto-scales	Millions msg/sec
Exactly-once	Native (idempotent + txns)	At-least-once (dedup in app)	At-least-once	Native (txns)
Managed option	Confluent Cloud, MSK, Aiven	Fully managed (AWS)	Fully managed (GCP)	StreamNative
Ecosystem	Connect, Streams, ksqlDB, Schema Registry	Lambda integration	Dataflow integration	Functions, IO connectors
Best for	High-throughput, replay, multi-consumer	AWS-native simple streaming	GCP-native pub/sub	Multi-tenant, geo-replication

Feature	Apache Airflow	Prefect	Dagster	Databricks Workflows
Scheduling	Cron, timetables, datasets	Cron, event-driven	Cron, sensors, assets	Cron, file arrival triggers
Paradigm	DAG of tasks (operators)	Flows & tasks (Python-native)	Assets & ops (data-aware)	Jobs & tasks (Spark-centric)
Dynamic tasks	expand() in 2.3+	Native (map)	Native (dynamic partitions)	For-each tasks
Executor model	Celery, K8s, Local	Agents, Work Pools	Executors (K8s, etc.)	Databricks clusters
UI	Grid, Graph, Gantt	Cloud UI (free tier)	Dagit (asset lineage)	Databricks Workflows UI
Ecosystem	80+ provider packages	Growing integrations	Growing integrations	Databricks-native (Spark, Delta)
Learning curve	Moderate (many concepts)	Low (Pythonic)	Moderate (asset model)	Low if already on Databricks
Best for	Complex multi-system orchestration	Python-native workflows	Data asset-centric lineage	Spark-centric workloads

Feature	PySpark	Pandas / Polars	dbt (SQL)	Apache Flink
Scale	TB–PB (distributed)	GB (single-node)	Depends on warehouse	TB–PB (distributed)
Paradigm	DataFrame + SQL	DataFrame	SQL models	DataStream + Table API
Latency	Seconds–minutes (batch/micro-batch)	Milliseconds (local)	Depends on warehouse	Milliseconds (true streaming)
Streaming	Structured Streaming (micro-batch)	No	No	Native event-time streaming
Language	Python, Scala, SQL	Python, Rust (Polars)	SQL + Jinja	Java, Scala, Python, SQL
Ecosystem	MLlib, GraphX, Delta Lake	scikit-learn, NumPy	dbt packages, exposures	CDC connectors, Iceberg
Best for	Large-scale batch ETL, feature eng.	Small data, prototyping	Warehouse transformations	True low-latency streaming

Pattern	Kafka + Airflow + PySpark	Kafka + Flink	ADF + Synapse	Databricks end-to-end
Ingestion	Kafka (streaming)	Kafka (streaming)	ADF copy activity	Auto Loader / DLT
Orchestration	Airflow (flexible)	Flink jobs (self-contained)	ADF pipelines	Databricks Workflows
Transformation	PySpark (batch/micro-batch)	Flink (true streaming)	Mapping data flows / Synapse SQL	PySpark / SQL / DLT
Latency	Minutes (micro-batch)	Milliseconds to seconds	Minutes (batch)	Seconds–minutes
Complexity	High (3 systems to manage)	Medium (1 system for stream)	Low (Azure managed)	Medium (single platform)
Best for	Flexible multi-system, team knows Spark	Low-latency streaming at scale	Azure shops, less code	All-in on Databricks lakehouse

Kafka, Airflow & PySpark

Table of Contents

Part 1: Apache Kafka

Core Concepts

1. Topics, Partitions & Offsets

2. Consumer Groups & Rebalancing

3. Exactly-Once Semantics (EOS)

4. Replication & ISR (In-Sync Replicas)

5. Schema Registry & Schema Evolution

6. Log Compaction & Retention

Industry Use Cases

1. Real-Time Event-Driven Microservices

2. Change Data Capture (CDC)

3. Real-Time Fraud Detection

4. Log Aggregation & Observability

Code Examples

1. Idempotent Producer with Avro (Python — confluent-kafka)

2. Consumer Group with Manual Offset Commit

3. Kafka Streams Topology — Windowed Aggregation (Java-style concept, Python pseudocode)

4. Schema Registry — Avro Producer with Schema Evolution

5. Dead Letter Queue (DLQ) Pattern

Comparison / When to Use

Gotchas & Anti-patterns

Exercises

Quiz

Further Reading

Part 2: Apache Airflow

Core Concepts

1. DAG Design & the TaskFlow API

2. Dynamic Task Mapping (expand)

3. Executors: Celery, Kubernetes, Local

4. Sensors, Deferrable Operators & Triggers

5. Connections, Variables & Secrets Backends

6. Datasets & Data-Aware Scheduling

Industry Use Cases

1. Daily Warehouse Refresh (ELT)

2. ML Feature Pipeline Orchestration

3. Multi-Cloud Data Sync

Code Examples

1. TaskFlow API with Dynamic Task Mapping

2. Deferrable Sensor — Wait for S3 File

3. Dataset-Driven Cross-DAG Dependency

Comparison / When to Use

Gotchas & Anti-patterns

Exercises

Quiz

Further Reading

Part 3: PySpark

Core Concepts

1. Catalyst Optimizer & Tungsten Execution

2. Partitioning, Shuffles & the Shuffle Problem

3. Broadcast Joins vs. Sort-Merge Joins

4. Adaptive Query Execution (AQE)

5. Structured Streaming

6. Memory Management & Spill

Industry Use Cases

1. Large-Scale ETL (Petabyte Data Lakes)

2. Data Quality & Validation at Scale

3. Feature Engineering for ML Pipelines

Code Examples

1. Incremental Load with Partition Pruning

2. Skew-Resistant Join with Salting

3. Window Functions for Sessionization

4. Broadcast Join with Dimension Table

5. Pandas UDF — Vectorized Text Processing

Comparison / When to Use

Gotchas & Anti-patterns

Exercises

Quiz

Further Reading

Part 4: The Stack — End-to-End Pipeline

Architecture Overview

Architecture Diagram (ASCII)

Combined Code Example

Airflow DAG: Orchestrating the End-to-End Pipeline

PySpark Job: kafka_to_bronze.py (Reading from Kafka)

Integration Gotchas

Comparison / When to Use This Stack

Quiz

Further Reading