Redis

Summary

Redis 7+ is in-memory data structure server most commonly deployed as a cache, message broker, session store, and real-time analytics engine. Its single-threaded event loop delivers sub-millisecond latency at hundreds of thousands of operations per second on commodity hardware. For data engineers, Redis shows up in three primary roles: cache layer in front of slow queries (BI dashboards, API responses), rate-limiting and deduplication store in streaming pipelines, and fast lookup table for enriching stream events. Redis 7.4+ also bundles Redis Stack with modules for vector similarity search (RediSearch), JSON documents (RedisJSON), and probabilistic data structures (RedisBloom).

This guide targets Redis 7.4. All concepts also apply to Valkey 7.x (the community fork) and AWS ElastiCache / Azure Cache for Redis managed offerings.

Table of Contents

Core Concepts

Data Structures

Redis is not just a key-value store — each key maps to a typed data structure. Choosing the right type avoids O(n) operations and excess memory.

TypeKey OperationsDE Use Case
StringSET, GET, INCR, SETNX, GETSETCache serialized JSON, feature flags, counters
HashHSET, HGETALL, HINCRBYDimensional lookup tables (user profile, product metadata)
ListLPUSH/RPUSH, LRANGE, BRPOPSimple task queues, recent-events ring buffers
SetSADD, SCARD, SINTERSTOREDeduplication (seen event IDs), unique visitor counts
Sorted Set (ZSet)ZADD, ZRANGEBYSCORE, ZREVRANKLeaderboards, sliding-window rate limits, scheduled jobs
StreamXADD, XREADGROUP, XACKEvent bus with consumer groups, log aggregation
HyperLogLogPFADD, PFCOUNTApproximate unique count (DAU, distinct queries) at 12 KB/key

Persistence: RDB vs AOF

Redis is in-memory by default — a restart loses all data unless persistence is configured.

Expiry & Eviction

Key expiry (EXPIRE key seconds / SET key value EX seconds) removes keys after a TTL. Redis uses a lazy expiry check (deleted on access) plus a background sampler that periodically evicts expired keys. With many expiring keys, set jitter on TTLs to prevent thundering-herd cache stampedes.

Eviction policies (set via maxmemory-policy) control what Redis deletes when maxmemory is reached:

Replication, Sentinel & Cluster

Redis supports three deployment topologies:

Redis Streams

Streams (XADD / XREADGROUP) are an append-only log data structure — conceptually similar to a Kafka topic partition but entirely in-memory. Key concepts:

Pub/Sub vs Streams

FeaturePub/SubStreams
DeliveryFire-and-forget; offline subscribers miss messagesPersisted log; consumers can replay
Consumer groups❌ All subscribers receive all messages✅ Each consumer group processes each entry once
At-least-once delivery✅ via ACK + PEL redelivery
BackpressureNoneMAXLEN trimming
Use whenReal-time broadcast (live dashboards, notifications)Reliable event processing (DE pipelines, audit logs)
↑ Back to top

Industry Use Cases

Cache-Aside for BI Dashboard Acceleration

A Spark-based data platform generates complex Summary statistics on demand. Each query takes 3–8 seconds on the warehouse. A Redis cache with allkeys-lru eviction stores serialized JSON results keyed by a hash of the query parameters with a 5-minute TTL. Cache hit rate reaches 85% within 10 minutes of peak traffic, reducing warehouse compute costs by 60%. Cache stampedes on TTL expiry are mitigated by a Lua-based lock that lets only one caller refresh the value.

Sliding-Window Rate Limiting for API Pipelines

A data ingestion API accepts events from thousands of IoT devices. To prevent abuse, per-device rate limiting is implemented using a Redis Sorted Set: device events are recorded as ZADD device:{id}:events NX <timestamp> <event_id>. The window is then trimmed (ZREMRANGEBYSCORE) and counted (ZCARD) in a single Lua script, ensuring atomicity without distributed locks.

Stream-Based Microservice Event Bus

Five microservices (ingest, validate, enrich, aggregate, export) form a pipeline that processes field sensor data. Each stage publishes to the next service's Redis Stream with consumer groups. Slow stages accumulate a backlog visible in the stream length. Dead-letter handling: entries unacknowledged for >30 seconds are moved to a dlq stream by a background reclaimer that uses XAUTOCLAIM. The total pipeline latency is under 200ms at 50K events/second.

Sorted Set Leaderboard for Real-Time Analytics

A gaming analytics dashboard shows "top 100 players by score in the last hour". When a game event arrives, ZINCRBY leaderboard:hourly <delta> <player_id> updates the score atomically. ZREVRANGE leaderboard:hourly 0 99 WITHSCORES retrieves the top 100 in O(log n + 100) time. A background job shifts the window by creating a new key every hour and using EXPIRE on the old one. This pattern requires zero separate aggregation pass.

HyperLogLog for Approximate Unique Visitor Counting

A web analytics pipeline tracks unique page views per URL per day. Using PFADD pageviews:{date}:{url} {user_id} for each event and PFCOUNT pageviews:{date}:{url} for the count, each counter uses at most 12 KB regardless of cardinality. Error rate is <1%. Multiple HyperLogLogs can be merged with PFMERGE to compute monthly uniques without re-processing events. Memory savings vs exact counters are 99%+ for high-cardinality URLs.

↑ Back to top

Code Examples

Example 1 — Cache-Aside with Lua Stampede Protection

import redis
import json, hashlib, time

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

-- Lua script: atomic check-and-lock to prevent cache stampede
LOCK_SCRIPT = """
local val = redis.call('GET', KEYS[1])
if val then return val end
local locked = redis.call('SET', KEYS[2], '1', 'NX', 'EX', '10')
if locked then return nil end
return redis.call('GET', KEYS[1])
"""

def cache_aside(query_key: str, expensive_fn, ttl: int = 300):
    lock_key = ff"{query_key}:lock"
    script = r.register_script(LOCK_SCRIPT)

    for attempt in range(5):
        cached = script(keys=[query_key, lock_key])
        if cached is not None:
            return json.loads(cached)  # Cache hit

        result = expensive_fn()        # Only one caller reaches here
        r.set(query_key, json.dumps(result), ex=ttl)
        r.delete(lock_key)
        return result

    time.sleep(0.05)                  # Brief wait when lock is held
    return json.loads(r.get(query_key) or "{}")

Example 2 — Sliding-Window Rate Limiter (Atomic Lua)

RATE_LIMIT_SCRIPT = """
local key    = KEYS[1]
local now    = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local limit  = tonumber(ARGV[3])
local uid    = ARGV[4]

redis.call('ZREMRANGEBYSCORE', key, '-inf', now - window)
local count = redis.call('ZCARD', key)
if count >= limit then
  return 0
end
redis.call('ZADD', key, 'NX', now, uid .. ':' .. now)
redis.call('EXPIRE', key, window)
return 1
"""

import time, uuid

rate_script = r.register_script(RATE_LIMIT_SCRIPT)

def is_allowed(device_id: str, window_seconds: int = 60, limit: int = 100) -> bool:
    now = int(time.time() * 1000)  # millisecond precision
    key = ff"ratelimit:{device_id}"
    uid = str(uuid.uuid4())
    result = rate_script(
        keys=[key],
        args=[now, window_seconds * 1000, limit, uid]
    )
    return bool(result)

Example 3 — Redis Streams Consumer Group Pipeline

import redis, json, time, threading

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

STREAM = "sensor-events"
GROUP  = "enrichment-workers"

# Create consumer group (idempotent)
try:
    r.xgroup_create(STREAM, GROUP, id="0", mkstream=True)
except redis.exceptions.ResponseError:
    pass  # Group already exists

def worker(consumer_name: str):
    while True:
        # Block up to 1 s waiting for new messages ('>' = undelivered)
        entries = r.xreadgroup(
            GROUP, consumer_name,
            {STREAM: ">"},
            count=10, block=1000
        )
        if not entries:
            continue
        for _, messages in entries:
            for msg_id, data in messages:
                try:
                    process(data)         # user-defined enrichment
                    r.xack(STREAM, GROUP, msg_id)
                except Exception as e:
                    print(ff"[WARN] {msg_id} failed: {e} — left in PEL for retry")

# Producer: ingest raw events
def produce(sensor_id: str, value: float):
    r.xadd(STREAM, {"sensor_id": sensor_id, "value": value},
           maxlen=500_000, approximate=True)

# Launch two consumers in parallel threads
for name in ["w1", "w2"]:
    threading.Thread(target=worker, args=(name,), daemon=True).start()

Example 4 — Sorted Set Leaderboard

LEADERBOARD = "scores:hourly"

def record_score(player_id: str, delta: float):
    r.zincrby(LEADERBOARD, delta, player_id)

def top_n(n: int = 10) -> list[dict]:
    entries = r.zrevrange(LEADERBOARD, 0, n - 1, withscores=True)
    return [
        {"player": pid, "score": int(score), "rank": i + 1}
        for i, (pid, score) in enumerate(entries)
    ]

def player_rank(player_id: str) -> int | None:
    rank = r.zrevrank(LEADERBOARD, player_id)
    return rank + 1 if rank is not None else None

# Rotate leaderboard every hour: NEW key + expire OLD key
import datetime

def rotate_leaderboard():
    hour_key = datetime.datetime.utcnow().strftime("scores:%Y%m%d%H")
    r.rename(LEADERBOARD, hour_key)
    r.expire(hour_key, 86400)          # keep 24 h of history
↑ Back to top

Comparison / When to Use

DimensionRedisMemcachedDragonflyDBApache Kafka
Data structuresRich (10+ types)String onlyRedis-compatible (10+ types)Byte arrays (topic records)
Persistence✅ RDB + AOF❌ In-memory only✅ Snapshot + journal✅ Durable log on disk
Consumer groups / at-least-once✅ Streams✅ Streams✅ Consumer groups
Max throughput~1M ops/s (single node)~1M ops/s~4M ops/s (multi-threaded)Millions msg/s (partitioned)
Retention / replayConfigurable MAXLENNoneConfigurableUnlimited (log compaction)
Best forCaching, rate limits, leaderboards, real-time enrichmentSimple session cachingSame as Redis, higher throughputHigh-volume durable event streaming

Rule of thumb: Use Redis for low-latency lookups and real-time structures (leaderboards, rate limits, session state). Use Kafka when you need durability, replay, and very high throughput at scale. Do not use Redis Streams as a Kafka replacement when message retention > 24 hours or when partition-level parallelism > ~8 is needed.

↑ Back to top

Gotchas & Anti-patterns

  1. Using Redis as a primary database without a durability strategy. Default Redis config has no persistence enabled. A restart loses all data. Always explicitly configure either AOF (appendonly yes, appendfsync everysec) or RDB snapshots before storing anything you can't afford to lose. Managed services (ElastiCache, Azure Cache) have separate persistence configuration separate from node spin-up.
  2. Storing large objects in a single key. A Redis Hash with 10,000 fields is fine; a String key holding a 10 MB serialized dataframe is not — it blocks the event loop during serialization/deserialization. Keep values under ~1 MB. For larger payloads, store keys in Redis and the actual data in object storage (S3 / ADLS), using Redis as a pointer cache.
  3. Using KEYS * in production. KEYS * scans the entire keyspace — O(N) — and blocks the single-threaded server. On a Redis instance with 10M keys, this can freeze responses for several seconds. Use SCAN (cursor-based, non-blocking) or query by key prefix patterns via RediSearch instead.
  4. Forgetting hash tag requirements in Redis Cluster. In Cluster mode, multi-key commands (MSET, KEYS, pipelining across keys) raise CROSSSLOT errors unless all keys hash to the same slot. Force co-location by using hash tags: wrap the shared part in {}, e.g., {user:123}:cart and {user:123}:wishlist always land on the same slot.
  5. Not setting maxmemory and eviction policy. Without maxmemory, Redis will consume all available RAM until the OOM killer terminates it, or the OS starts swapping (catastrophic latency). Always set maxmemory and an appropriate eviction policy. Monitor used_memory vs maxmemory with INFO memory.
↑ Back to top

Exercises

  1. Cache stampede lab: Implement the cache-aside pattern with and without the Lua lock. Simulate 50 concurrent requests hitting a cold cache key simultaneously (use threading.Thread). Measure how many times the "expensive function" is called in each scenario. Confirm the Lua lock reduces it to exactly 1 call. Then observe behavior when the lock holder crashes mid-refresh.
  2. Streams consumer group: Create a Redis Stream called orders. Write a producer that generates 1,000 fake order events. Create two consumer groups — fulfillment and analytics. Each group should have two consumers. Confirm each group independently receives all 1,000 events, and each consumer within a group receives roughly half. Implement a dead-letter handler using XAUTOCLAIM.
  3. Sorted set sliding leaderboard: Build a real-time leaderboard that maintains scores over a 1-hour rolling window. Use two sorted sets (current hour and previous hour) and a script that merges them with ZUNIONSTORE. Write a function that returns a player's rank and score across the merged window. Rotate the window every 60 seconds in a test environment and verify correctness.
↑ Back to top

Quiz

Q1: What is a cache stampede and how does Redis help prevent it?

A cache stampede (also called a thundering herd) occurs when a popular cache key expires simultaneously for many concurrent requests, causing all of them to bypass the cache and hit the backing store at once. Redis prevents this via Lua scripts: a lightweight atomic lock is set (SET lock_key 1 NX EX 10) before recomputing the value. Only the first request acquires the lock; subsequent requests poll until the value is repopulated. Alternatively, probabilistic early expiry can refresh keys slightly before actual expiry, spreading the recomputation load over time.

Q2: What is the difference between FLUSHDB and FLUSHALL and when should each be used?

FLUSHDB deletes all keys in the current logical database (default DB 0). FLUSHALL deletes all keys across all 16 logical databases. Both are blocking by default but accept the ASYNC modifier to delete in a background thread. In production, neither should ever be run without explicit confirmation — they are irreversible and instant. Use SCAN + DEL for selective key deletion. A common DE use case for FLUSHDB: clearing a test Redis instance before loading fresh fixture data.

Q3: A Redis Streams consumer group has a message in the PEL (Pending Entry List) that hasn't been acknowledged for 10 minutes. What does this mean and how do you handle it?

An entry in the PEL means it was delivered to a consumer but never acknowledged with XACK. This indicates the consumer either crashed mid-processing or encountered an error. Handle it with XAUTOCLAIM (Redis 7+): a background reclaimer periodically calls XAUTOCLAIM stream group reclaimer min-idle-time 0-0 COUNT 50 to transfer ownership of stale PEL entries to healthy consumers. After a configurable max-retry count, move the entry to a dead-letter stream for manual inspection. This pattern is equivalent to Kafka's consumer offset uncommitted detection.

Q4: Why should you never use KEYS * in a production Redis instance?

Redis is single-threaded for command processing. KEYS * performs a full keyspace scan — O(N) where N is the total number of keys — and blocks the entire server for its duration. On an instance with millions of keys, this can block for seconds, causing timeouts for all other clients. The production alternative is SCAN cursor MATCH pattern COUNT 100, which iterates in small batches without blocking. For structured key lookups, use RediSearch (part of Redis Stack) which maintains an inverted index.

Q5: What is the difference between Redis Sentinel and Redis Cluster?

Redis Sentinel provides high availability for a single-primary setup: Sentinel processes monitor the primary, detect failure using quorum voting, and promote a replica to primary. No data sharding — all data lives on one primary. Good for datasets that fit on a single server. Redis Cluster provides both high availability AND horizontal sharding across up to 1000 nodes. Data is distributed across 16,384 hash slots. It automatically handles failover. The tradeoff: multi-key commands require hash tags to ensure co-location, and not all commands are supported. Choose Sentinel for simpler HA; choose Cluster when your dataset outgrows a single node.

↑ Back to top

Further Reading

↑ Back to top