Building a High-Performance Cache Layer in Go
Stay on top of this story
Follow the names and topics behind it.
Add this story's key topics to your watchlist so LyscoNews can highlight related developments and future matches.
Create a free account to sync your watchlist, saved stories, and alerts across devices.
Quick Summary
Your service is slow. You add Redis. It gets faster. Then Redis becomes the bottleneck -- every request still makes a network round-trip, serialization costs add up, and under load you start seeing latency spikes from connection pool contention. Sound familiar? In this article, we'll build a two-tier cache layer in Go that combines a local in-memory cache with Redis, prevent cache stampedes using singleflight, and discuss the production considerations that separate a toy cache from a battle-tested one. Redis is excellent. But it's still a network hop away. For a typical service:
Operation Latency
Local memory read ~50ns
Redis GET (same AZ) ~0.5-1ms
PostgreSQL query ~2-10ms
That's a 10,000x difference between local memory and Redis. For hot keys that get read thousands of times per second, this matters. A local cache also gives you: Zero network overhead -- no serialization, no TCP, no connection pools Resilience -- your service still responds if Redis goes down briefly Reduced Redis load -- fewer commands means lower Redis CPU and network usage The tradeoff? Local caches are per-instance and can serve stale data. We'll address both. Let's start with a simple but effective local cache. We'll use sync.Map for concurrent access and a background goroutine for TTL eviction. package cache
import ( "sync" "time" )
type entry struct { value any expiresAt time.Time }
type LocalCache struct { data sync.Map maxSize int size int64 mu sync.Mutex // guards size }
func NewLocalCache(maxSize int, evictInterval time.Duration) *LocalCache { c := &LocalCache{maxSize: maxSize} go c.evictLoop(evictInterval) return c }
func (c *LocalCache) Get(key string) (any, bool) { raw, ok := c.data.Load(key) if !ok { return nil, false } e := raw.(*entry) if time.Now().After(e.expiresAt) { c.data.Delete(key) c.decrSize() return nil, false } return e.value, true }
func (c *LocalCache) Set(key string, value any, ttl time.Duration) { _, loaded := c.data.LoadOrStore(key, &entry{ value: value, expiresAt: time.Now().Add(ttl), }) if !loaded { c.incrSize() } else { c.data.Store(key, &entry{ value: value, expiresAt: time.Now().Add(ttl), }) } }
func (c *LocalCache) Delete(key string) { if _, loaded := c.data.LoadAndDelete(key); loaded { c.decrSize() } }
func (c *LocalCache) evictLoop(interval time.Duration) { ticker := time.NewTicker(interval) defer ticker.Stop() for range ticker.C { now := time.Now() c.data.Range(func(key, value any) bool { if now.After(value.(*entry).expiresAt) { c.data.Delete(key) c.decrSize() } return true }) } }
func (c *LocalCache) incrSize() { c.mu.Lock(); c.size++; c.mu.Unlock() } func (c *LocalCache) decrSize() { c.mu.Lock(); c.size--; c.mu.Unlock() }
This gives us O(1) reads and writes with lazy + periodic expiration. The sync.Map is optimized for the read-heavy, write-light pattern that caches typically exhibit. Why not a regular map with sync.RWMutex? For read-dominated workloads with many goroutines, sync.Map avoids lock contention on the read path entirely. Under write-heavy loads, a sharded map with RWMutex can outperform it -- but caches are almost always read-heavy. Now let's compose the local cache with Redis into a two-tier system. The lookup flow: Check local cache -> hit? Return immediately. Check Redis -> hit? Backfill local cache, return. Call the loader (DB, API, etc.) -> Populate both caches, return.
package cache
import ( "context" "fmt" "time"
"github.com/redis/go-redis/v9"
)
type TieredCache struct { local *LocalCache redis *redis.Client localTTL time.Duration redisTTL time.Duration }
func NewTieredCache(rc *redis.Client, localTTL, redisTTL time.Duration) TieredCache { return &TieredCache{ local: NewLocalCache(10000, 30time.Second), redis: rc, localTTL: localTTL, redisTTL: redisTTL, } }
func (tc *TieredCache) Get(ctx context.Context, key string) ([]byte, bool) { // Tier 1: local memory if val, ok := tc.local.Get(key); ok { return val.([]byte), true } // Tier 2: Redis val, err := tc.redis.Get(ctx, key).Bytes() if err == nil { tc.local.Set(key, val, tc.localTTL) // backfill L1 return val, true } return nil, false }
func (tc *TieredCache) Set(ctx context.Context, key string, value []byte) error { tc.local.Set(key, value, tc.localTTL) return tc.redis.Set(ctx, key, value, tc.redisTTL).Err() }
// GetOrLoad implements the full cache-aside pattern. func (tc *TieredCache) GetOrLoad( ctx context.Context, key string, loader func(ctx context.Context) ([]byte, error), ) ([]byte, error) { if val, ok := tc.Get(ctx, key); ok { return val, nil } val, err := loader(ctx) if err != nil { return nil, fmt.Errorf("loader for key %s: %w", key, err) } _ = tc.Set(ctx, key, val) // best-effort cache write return val, nil }
Usage is clean: data, err := cache.GetOrLoad(ctx, "user:1234", func(ctx context.Context) ([]byte, error) { u, err := db.GetUser(ctx, 1234) if err != nil { return nil, err } return json.Marshal(u) })
Important: keep the local TTL shorter than the Redis TTL. A good starting point is local 10-30s, Redis 5-15 minutes. This bounds cross-instance staleness while still absorbing the vast majority of reads locally. There's a critical problem with GetOrLoad. When a popular key expires, hundreds of goroutines simultaneously discover the miss and all call the loader. This is a cache stampede -- it can flatten your database. Go's golang.org/x/sync/singleflight deduplicates concurrent calls for the same key so only one goroutine does the actual work: import "golang.org/x/sync/singleflight"
type TieredCache struct { local *LocalCache redis *redis.Client localTTL time.Duration redisTTL time.Duration sf singleflight.Group }
func (tc *TieredCache) GetOrLoad( ctx context.Context, key string, loader func(ctx context.Context) ([]byte, error), ) ([]byte, error) { if val, ok := tc.Get(ctx, key); ok { return val, nil }
// Only one goroutine executes per key; others wait and share the result.
result, err, shared := tc.sf.Do(key, func() (any, error) {
// Double-check: another goroutine may have filled the cache
// while we waited for the singleflight slot.
if val, ok := tc.Get(ctx, key); ok {
return val, nil
}
val, err := loader(ctx)
if err != nil {
return nil, err
}
_ = tc.Set(ctx, key, val)
return val, nil
})
if err != nil {
return nil, err
}
_ = shared // useful for metrics: high share rate = stampede prevention working
return result.([]byte), nil
}
The double-check inside Do matters. Between the initial miss and acquiring the singleflight slot, another goroutine may have already populated the cache. Without this, you'd still make one redundant database call per stampede event. Test setup: 8-core machine, 100 concurrent goroutines, 10K unique keys with Zipfian distribution (some keys much hotter than others, like real traffic). func BenchmarkCacheTiers(b *testing.B) { b.Run("redis-only", func(b *testing.B) { b.RunParallel(func(pb *testing.PB) { for pb.Next() { rdb.Get(ctx, zipfKey()) } }) }) b.Run("local-only", func(b *testing.B) { b.RunParallel(func(pb *testing.PB) { for pb.Next() { local.Get(zipfKey()) } }) }) b.Run("tiered", func(b *testing.B) { b.RunParallel(func(pb *testing.PB) { for pb.Next() { tiered.Get(ctx, zipfKey()) } }) }) }
Results:
Approach ops/sec p50 p99
Redis only 85,000 0.6ms 2.1ms
Local only 12,000,000 48ns 210ns
Tiered (warm) 10,500,000 52ns 380ns
Tiered (cold start) 78,000 0.7ms 2.4ms
At steady state the tiered cache runs at near-local-only speed because hot keys live in L1. The extra local-miss check on cold paths adds only ~4ns of overhead before falling through to Redis. Stampede test: 1000 goroutines hitting the same expired key simultaneously:
Without singleflight With singleflight
1000 DB calls 1 DB call
p99: 850ms p99: 12ms
The difference is dramatic and gets worse under real load. An unbounded local cache will OOM your process. Two approaches: Max entry count -- simple and predictable. Evict oldest entries when full. Add a size check in Set and use an LRU library like hashicorp/golang-lru/v2 when you need eviction ordering.
Max memory bytes -- more precise but harder. For []byte values you can sum lengths directly; for arbitrary types, estimation gets complex.
Start with max entry count + short TTL. Monitor via runtime.MemStats and adjust. TTL-based eviction is often sufficient. When you also need to cap size: LRU -- the default choice. Well-understood, works for most access patterns. LFU -- better for heavily skewed workloads. More complex to implement correctly. Random -- surprisingly effective and nearly free. Consider it for unpredictable access patterns. For most services, LRU + TTL hits the sweet spot. Options for multi-instance consistency: Short local TTLs -- accept bounded staleness (10-30s). Simplest approach, often sufficient. Redis Pub/Sub -- publish invalidation events on write; instances subscribe and evict locally.
func (tc *TieredCache) Invalidate(ctx context.Context, key string) error { tc.local.Delete(key) tc.redis.Del(ctx, key) return tc.redis.Publish(ctx, "cache:invalidate", key).Err() }
// Each instance subscribes on startup: func (tc *TieredCache) SubscribeInvalidations(ctx context.Context) { sub := tc.redis.Subscribe(ctx, "cache:invalidate") go func() { for msg := range sub.Channel() { tc.local.Delete(msg.Payload) } }() }
Cache misses too. If a key doesn't exist in your database, store a sentinel to prevent repeated lookups: var sentinel = []byte("MISS")
// In the loader: if errors.Is(err, ErrNotFound) { _ = cache.Set(ctx, key, sentinel) // short TTL return nil, ErrNotFound }
Without this, a nonexistent key generates a database query on every request -- a pattern attackers can exploit. Track these metrics (export to Prometheus, Datadog, etc.): Hit rate per tier -- local should be 80%+ for hot paths Singleflight share rate -- high = stampede prevention working Cache size -- entry count and estimated memory Loader latency -- what you're protecting the system from
type Metrics struct { LocalHits atomic.Int64 LocalMisses atomic.Int64 RedisHits atomic.Int64 RedisMisses atomic.Int64 SFShared atomic.Int64 }
A dashboard showing per-tier hit rates will immediately tell you whether your cache is earning its complexity. Request -> Local Cache (L1, ~50ns) |miss Redis (L2, ~0.5ms) |miss singleflight dedup | Database (~5ms) | Populate L1 + L2
Key takeaways: Two tiers beat one -- local absorbs hot reads, Redis handles the long tail and cross-instance sharing. singleflight is non-negotiable -- without it, cache expiration under load becomes a database stampede. Short local TTLs -- 10-30s balances freshness against hit rate. Monitor everything -- hit rates, sizes, loader latency. Caches fail silently. Start with the simple version. Measure. Then add complexity only where the numbers justify it. This is part of the Production Backend Patterns series, where we tackle real infrastructure problems with practical Go code. Follow for the next post on rate limiting and backpressure. If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more production backend patterns.