Data Engineering Interview Study Guides

Fourteen focused study guides covering the most-asked topics in modern data engineering interviews — from core pipeline tools and cloud platforms to AI/ML infrastructure and LLM frameworks. Each guide is designed to be completed in roughly one day and targets intermediate-level practitioners who already know Python and SQL.

Kafka, Airflow & PySpark

The modern streaming-and-batch pipeline stack. Covers each tool individually, then how they compose into an end-to-end data pipeline from ingestion through transformation to warehouse sink.

Topics: Consumer groups, exactly-once semantics, DAG best practices, TaskFlow API, Spark partitioning, broadcast joins, pipeline orchestration patterns

Azure Data Factory

Microsoft's cloud-native ETL/ELT orchestration service. Covers pipelines, data flows, integration runtimes, and how ADF fits into broader Azure data architectures.

Topics: Mapping data flows, linked services, self-hosted IR, parameterized pipelines, CI/CD with ADF, monitoring, cost optimization

Databricks

The unified analytics platform built on Apache Spark. Covers Delta Lake, Unity Catalog, Workflows, and the lakehouse paradigm that merges warehouse and lake patterns.

Topics: Delta Lake internals, Z-ordering, liquid clustering, Unity Catalog governance, structured streaming, Databricks Workflows, photon engine

Data Engineering Concepts

Foundational modeling and architecture patterns every data engineer must know. Covers Kimball, Data Vault 2.0, and modern lakehouse patterns in depth.

Topics: Star schema, SCDs type 1/2/3, hubs-links-satellites, medallion architecture, Delta Lake / Iceberg, OBT trade-offs

dbt (Data Build Tool)

The standard transformation layer for modern ELT pipelines. Covers project structure, materializations, incremental models, testing, Jinja macros, the semantic layer, and model contracts.

Topics: ref() & source(), incremental strategies, snapshots (SCD2), unit tests, dbt packages, MetricFlow semantic layer, CI/CD with slim builds, model governance

Enterprise Data Engineering

The non-technical and semi-technical skills that separate senior enterprise DEs from mid-level ones. Covers governance, FinOps, migrations, incident response, and stakeholder management.

Topics: Team models (hub-and-spoke, data mesh), GDPR/HIPAA/SOX compliance, PII handling, CI/CD for data, cost management, SLAs, runbooks, ADRs, migration patterns

Redis

The in-memory data structure store used for caching, rate limiting, real-time leaderboards, and stream processing. Covers data structures, persistence, clustering, and Streams.

Topics: String/Hash/Set/ZSet/Stream data types, RDB vs AOF persistence, eviction policies, Redis Sentinel vs Cluster, consumer groups, Lua scripts, cache-aside pattern

Postgres & pgvector

PostgreSQL as a production data platform — covering advanced indexing, partitioning, MVCC, JSONB, and the pgvector extension for vector similarity search and RAG pipelines.

Topics: EXPLAIN ANALYZE, index types (B-tree/GIN/BRIN/covering), HNSW vs IVFFlat, hybrid RRF search, range partitioning, MVCC/VACUUM, JSONB operators

Terraform

Infrastructure as Code for data platforms. Covers HCL fundamentals, state management, modules, provider ecosystems, and GitOps workflows for cloud data infrastructure.

Topics: HCL syntax, state files & backends, modules, workspaces, data sources, lifecycle meta-arguments, Terragrunt DRY patterns, CI/CD with Atlantis

LangChain

The leading Python framework for composing LLM applications. Covers LCEL, RAG pipelines, ReAct agents, memory management, and structured output for data engineering workflows.

Topics: LCEL pipe operator, ChatPromptTemplate, RAG (Chroma/FAISS), RetrievalQA, MMR retrieval, ConversationSummaryMemory, ReAct agent pattern, Pydantic structured output

CrewAI

Multi-agent orchestration framework built on role-based personas. Covers agents, tasks, crews, sequential vs hierarchical execution, memory systems, and custom tool development.

Topics: Agent/Task/Crew primitives, Process.sequential vs hierarchical, long-term SQLite memory, BaseTool with Pydantic schema, delegation patterns, multi-agent data pipelines

Google ADK

Google's open-source Agentic Development Kit for building and deploying AI agents on GCP. Covers LlmAgent, tool development, SequentialAgent workflows, and Vertex AI Agent Engine deployment.

Topics: LlmAgent with function tools, SequentialAgent/ParallelAgent/LoopAgent, session state, output_key data flow, ADK eval framework, Vertex AI Agent Engine deployment

Model Context Protocol (MCP)

The open standard for connecting AI models to external data sources and tools. Covers the host/client/server architecture, Resources, Tools, Prompts, Sampling, and building data warehouse MCP servers.

Topics: JSON-RPC 2.0, stdio vs HTTP+SSE transport, FastMCP Python SDK, Resources vs Tools vs Prompts, Sampling primitive, prompt injection security, Claude Desktop config

Vertex AI

Google Cloud's unified ML platform covering the full MLOps lifecycle — from AutoML and custom training through Vertex Pipelines, Feature Store, batch prediction, and model monitoring.

Topics: CustomTrainingJob, KFP v2 pipeline components, batch prediction from BigQuery, Feature Store online/offline serving, Model Registry, training-serving skew, Model Monitoring

API Contracts

Formal agreements between API producers and consumers. Covers OpenAPI 3.1, Avro schema evolution, consumer-driven contract testing with Pact, AsyncAPI for event-driven systems, and breaking-change detection in CI.

Topics: Schema-first design, backward/forward compatibility, breaking vs non-breaking changes, versioning strategies, Pact CDCT, Confluent Schema Registry, oasdiff CI gates

How to Use These Guides

Prerequisites

These guides assume you are comfortable with Python and basic SQL. They skip "what is a variable" basics and jump straight into data-engineering-specific patterns, trade-offs, and production concerns. If you need a refresher on Python fundamentals, do that first.

Guide Structure

Every study guide follows the same structure so you can build a consistent study rhythm: