Fourteen focused study guides covering the most-asked topics in modern data engineering interviews — from core pipeline tools and cloud platforms to AI/ML infrastructure and LLM frameworks. Each guide is designed to be completed in roughly one day and targets intermediate-level practitioners who already know Python and SQL.
The modern streaming-and-batch pipeline stack. Covers each tool individually, then how they compose into an end-to-end data pipeline from ingestion through transformation to warehouse sink.
Microsoft's cloud-native ETL/ELT orchestration service. Covers pipelines, data flows, integration runtimes, and how ADF fits into broader Azure data architectures.
The unified analytics platform built on Apache Spark. Covers Delta Lake, Unity Catalog, Workflows, and the lakehouse paradigm that merges warehouse and lake patterns.
Foundational modeling and architecture patterns every data engineer must know. Covers Kimball, Data Vault 2.0, and modern lakehouse patterns in depth.
The standard transformation layer for modern ELT pipelines. Covers project structure, materializations, incremental models, testing, Jinja macros, the semantic layer, and model contracts.
The non-technical and semi-technical skills that separate senior enterprise DEs from mid-level ones. Covers governance, FinOps, migrations, incident response, and stakeholder management.
The in-memory data structure store used for caching, rate limiting, real-time leaderboards, and stream processing. Covers data structures, persistence, clustering, and Streams.
PostgreSQL as a production data platform — covering advanced indexing, partitioning, MVCC, JSONB, and the pgvector extension for vector similarity search and RAG pipelines.
Infrastructure as Code for data platforms. Covers HCL fundamentals, state management, modules, provider ecosystems, and GitOps workflows for cloud data infrastructure.
The leading Python framework for composing LLM applications. Covers LCEL, RAG pipelines, ReAct agents, memory management, and structured output for data engineering workflows.
Multi-agent orchestration framework built on role-based personas. Covers agents, tasks, crews, sequential vs hierarchical execution, memory systems, and custom tool development.
Google's open-source Agentic Development Kit for building and deploying AI agents on GCP. Covers LlmAgent, tool development, SequentialAgent workflows, and Vertex AI Agent Engine deployment.
The open standard for connecting AI models to external data sources and tools. Covers the host/client/server architecture, Resources, Tools, Prompts, Sampling, and building data warehouse MCP servers.
Google Cloud's unified ML platform covering the full MLOps lifecycle — from AutoML and custom training through Vertex Pipelines, Feature Store, batch prediction, and model monitoring.
Formal agreements between API producers and consumers. Covers OpenAPI 3.1, Avro schema evolution, consumer-driven contract testing with Pact, AsyncAPI for event-driven systems, and breaking-change detection in CI.
These guides assume you are comfortable with Python and basic SQL. They skip "what is a variable" basics and jump straight into data-engineering-specific patterns, trade-offs, and production concerns. If you need a refresher on Python fundamentals, do that first.
Every study guide follows the same structure so you can build a consistent study rhythm: