DE Study Guide

Kafka, Airflow & PySpark

The modern streaming-and-batch pipeline stack. Covers each tool individually, then how they compose into an end-to-end data pipeline from ingestion through transformation to warehouse sink.

Topics: Consumer groups, exactly-once semantics, DAG best practices, TaskFlow API, Spark partitioning, broadcast joins, pipeline orchestration patterns

Azure Data Factory

Microsoft's cloud-native ETL/ELT orchestration service. Covers pipelines, data flows, integration runtimes, and how ADF fits into broader Azure data architectures.

Topics: Mapping data flows, linked services, self-hosted IR, parameterized pipelines, CI/CD with ADF, monitoring, cost optimization

Databricks

The unified analytics platform built on Apache Spark. Covers Delta Lake, Unity Catalog, Workflows, and the lakehouse paradigm that merges warehouse and lake patterns.

Topics: Delta Lake internals, Z-ordering, liquid clustering, Unity Catalog governance, structured streaming, Databricks Workflows, photon engine

Data Engineering Concepts

Foundational modeling and architecture patterns every data engineer must know. Covers Kimball, Data Vault 2.0, and modern lakehouse patterns in depth.

Topics: Star schema, SCDs type 1/2/3, hubs-links-satellites, medallion architecture, Delta Lake / Iceberg, OBT trade-offs

dbt (Data Build Tool)

The standard transformation layer for modern ELT pipelines. Covers project structure, materializations, incremental models, testing, Jinja macros, the semantic layer, and model contracts.

Topics: ref() & source(), incremental strategies, snapshots (SCD2), unit tests, dbt packages, MetricFlow semantic layer, CI/CD with slim builds, model governance

Enterprise Data Engineering

The non-technical and semi-technical skills that separate senior enterprise DEs from mid-level ones. Covers governance, FinOps, migrations, incident response, and stakeholder management.

Topics: Team models (hub-and-spoke, data mesh), GDPR/HIPAA/SOX compliance, PII handling, CI/CD for data, cost management, SLAs, runbooks, ADRs, migration patterns

Redis

The in-memory data structure store used for caching, rate limiting, real-time leaderboards, and stream processing. Covers data structures, persistence, clustering, and Streams.

Topics: String/Hash/Set/ZSet/Stream data types, RDB vs AOF persistence, eviction policies, Redis Sentinel vs Cluster, consumer groups, Lua scripts, cache-aside pattern

Postgres & pgvector

PostgreSQL as a production data platform — covering advanced indexing, partitioning, MVCC, JSONB, and the pgvector extension for vector similarity search and RAG pipelines.

Topics: EXPLAIN ANALYZE, index types (B-tree/GIN/BRIN/covering), HNSW vs IVFFlat, hybrid RRF search, range partitioning, MVCC/VACUUM, JSONB operators

Terraform

Infrastructure as Code for data platforms. Covers HCL fundamentals, state management, modules, provider ecosystems, and GitOps workflows for cloud data infrastructure.

Topics: HCL syntax, state files & backends, modules, workspaces, data sources, lifecycle meta-arguments, Terragrunt DRY patterns, CI/CD with Atlantis

LangChain

The leading Python framework for composing LLM applications. Covers LCEL, RAG pipelines, ReAct agents, memory management, and structured output for data engineering workflows.

Topics: LCEL pipe operator, ChatPromptTemplate, RAG (Chroma/FAISS), RetrievalQA, MMR retrieval, ConversationSummaryMemory, ReAct agent pattern, Pydantic structured output

CrewAI

Multi-agent orchestration framework built on role-based personas. Covers agents, tasks, crews, sequential vs hierarchical execution, memory systems, and custom tool development.

Topics: Agent/Task/Crew primitives, Process.sequential vs hierarchical, long-term SQLite memory, BaseTool with Pydantic schema, delegation patterns, multi-agent data pipelines

Google ADK

Google's open-source Agentic Development Kit for building and deploying AI agents on GCP. Covers LlmAgent, tool development, SequentialAgent workflows, and Vertex AI Agent Engine deployment.

Topics: LlmAgent with function tools, SequentialAgent/ParallelAgent/LoopAgent, session state, output_key data flow, ADK eval framework, Vertex AI Agent Engine deployment

The open standard for connecting AI models to external data sources and tools. Covers the host/client/server architecture, Resources, Tools, Prompts, Sampling, and building data warehouse MCP servers.

Topics: JSON-RPC 2.0, stdio vs HTTP+SSE transport, FastMCP Python SDK, Resources vs Tools vs Prompts, Sampling primitive, prompt injection security, Claude Desktop config

Vertex AI

Google Cloud's unified ML platform covering the full MLOps lifecycle — from AutoML and custom training through Vertex Pipelines, Feature Store, batch prediction, and model monitoring.

Topics: CustomTrainingJob, KFP v2 pipeline components, batch prediction from BigQuery, Feature Store online/offline serving, Model Registry, training-serving skew, Model Monitoring

API Contracts

Formal agreements between API producers and consumers. Covers OpenAPI 3.1, Avro schema evolution, consumer-driven contract testing with Pact, AsyncAPI for event-driven systems, and breaking-change detection in CI.

Topics: Schema-first design, backward/forward compatibility, breaking vs non-breaking changes, versioning strategies, Pact CDCT, Confluent Schema Registry, oasdiff CI gates

LangGraph

Low-level graph orchestration library for stateful, multi-actor LLM applications. Covers the state graph model, checkpointers for persistence, human-in-the-loop interrupts, parallel fan-out with Send, and production deployment patterns.

Topics: StateGraph, reducers, conditional edges, MemorySaver/PostgresSaver, interrupt_before, Send fan-out, astream_events, ReAct loop, subgraphs

How to Use These Guides

Read the summary and index to orient yourself on each topic.
Work through Core Concepts — these map to the depth interviewers expect.
Study the Code Examples — they use real-world patterns, not toy demos.
Review Gotchas & Anti-patterns — these signal hands-on experience in interviews.
Try the Exercises — building something cements understanding better than reading.
Test yourself with the Quiz — answers are hidden behind toggles so you can self-assess.
Bookmark Further Reading for deeper dives after your initial review.

Prerequisites

These guides assume you are comfortable with Python and basic SQL. They skip "what is a variable" basics and jump straight into data-engineering-specific patterns, trade-offs, and production concerns. If you need a refresher on Python fundamentals, do that first.

Guide Structure

Every study guide follows the same structure so you can build a consistent study rhythm:

Summary — what it is and why it matters
Table of Contents — anchor-linked for quick navigation
Core Concepts (4–6 topics at intermediate depth)
Industry Use Cases (3+ real scenarios)
Code Examples (3–5 production-style snippets)
Comparison / When to Use (trade-off tables vs. alternatives)
Common Gotchas & Anti-patterns
Hands-on Exercises (3 tasks)
Interview Quiz (5 Q&A with hidden answers)
Further Reading (official docs, blogs, papers)

Data Engineering Interview Study Guides