Real-Time Recommendation as an Inference Architecture Problem

Recommendation systems were doing production ML inference at scale before "LLM serving" was a topic anyone discussed. Hyperscale ad-tech and content discovery platforms spent a decade solving the problem of running neural models in sub-50ms latency windows at millions of requests per second — and building the infrastructure patterns to do it reliably. Those patterns are not fully replicated in the LLM inference tooling ecosystem, and they're worth understanding.

When we invested in Shaped in 2024, part of the thesis was that the recommendation domain has generated infrastructure knowledge that hasn't fully diffused to the broader ML serving community. A team with real production recommendation experience carries that knowledge, and it's durable even as model architectures change.

The two-stage funnel as the foundational pattern

Industrial recommendation systems almost universally use a two-stage architecture: retrieval followed by ranking. The retrieval stage takes a user context and retrieves hundreds or thousands of candidate items from a corpus that may contain millions. The ranking stage scores those candidates more precisely, reordering them for final presentation.

The reason for the split is inference cost. Ranking models are expensive per item — they typically run a forward pass through a neural network for each candidate, using cross-attention between user features and item features. Running a ranking model against the full corpus is computationally infeasible at real-time latency. Retrieval — typically approximate nearest neighbor (ANN) search against dense embeddings — is cheap per lookup and parallelizes across a candidate corpus with specialized indexing structures like HNSW or FAISS.

The same reasoning increasingly applies to LLM-adjacent applications. A retrieval-augmented generation (RAG) system that retrieves relevant documents before feeding them to an LLM is following the same pattern: cheap retrieval to narrow the space, expensive generation only over relevant context. The infrastructure problems are similar: embedding index latency, retrieval quality vs. recall tradeoffs, staleness of the index relative to new data.

Feature freshness as a latency budget problem

One of the defining infrastructure challenges in real-time recommendation is feature freshness: user behavior from the last 30 seconds is more predictive for recommendation quality than user behavior from two hours ago, but computing fresh features in real time adds latency budget. The tradeoff between feature recency and serving latency is a fundamental design axis that production recommendation teams manage explicitly.

The patterns that emerged: precomputed features for slow-moving signals (historical preferences, demographic data) cached in a feature store with read latency under 1ms. Real-time features for fast-moving signals (session behavior, recent clicks) computed in a streaming pipeline and written to an in-memory store accessible to the serving layer within the request window. The serving layer reads from both, assembling the feature vector before calling the ranking model.

This architecture shows up, in modified form, in LLM serving systems doing context injection. Session context, retrieved documents, user-specific prompts — these are the "features" of the LLM request. The serving layer has to assemble them within the latency budget before triggering generation. Teams that have worked on recommendation feature stores understand this assembly problem intuitively; teams coming from pure NLP backgrounds sometimes treat it as an afterthought and pay for it in production latency.

Online-offline consistency: the silent killer

A recurring failure mode in recommendation systems: training-serving skew. The model is trained on batch-computed features from an offline pipeline. In production, the serving layer computes the same features in real time — but the computation logic diverges over time due to code changes, feature store version mismatches, or subtle differences in how null values and edge cases are handled. The model sees a different feature distribution in serving than it saw in training, and quality degrades in ways that don't correlate with any single obvious root cause.

This problem has a name in the recommendation infrastructure world — "feature skew" or "training-serving skew" — and production teams have built elaborate tooling to detect and prevent it: feature logging for post-hoc comparison, feature monitoring for distribution shift, canary deployments that compare serving feature distributions against training feature distributions before full rollout.

The analog in LLM systems: prompt template drift. The model is evaluated with one prompt format; production gradually drifts to a different format through undocumented changes. The semantic content is similar but the token distribution shifts enough to degrade performance. This is a solved problem in mature recommendation infrastructure and a largely unsolved one in most LLM deployment setups. The teams building production ML serving infrastructure — like Shaped — carry this institutional knowledge.

Embedding index management at scale

ANN retrieval over dense embeddings requires maintaining a searchable index that stays current as the item corpus changes. A content platform with a large item catalog adds new items continuously; the embedding index has to be updated to include them. Rebuilding the entire index on each update is computationally expensive and introduces serving gaps. Incremental index updates are fast but degrade retrieval quality over time as the index structure becomes fragmented.

The production pattern: a hot path that handles new items through a "fresh" index bucket with brute-force or smaller-scale ANN, combined with a background rebuild of the main index on a cadence (daily, weekly) that keeps the main index well-structured. Retrieval at serving time combines results from both buckets. The quality tradeoff between freshness and structure is managed explicitly through bucket weighting.

This design pattern reappears in any system doing retrieval over frequently updated content — knowledge bases for RAG, product catalogs for e-commerce search, document stores for enterprise AI applications. The teams that have operated recommendation infrastructure at scale already know the failure modes. The tooling for index lifecycle management in the LLM ecosystem is still catching up to what recommendation infrastructure teams built out of necessity a decade earlier.

What the LLM ecosystem is still missing

Watching the current generation of LLM serving infrastructure mature, a few gaps stand out relative to what production recommendation systems solved years ago.

Serving observability is underdeveloped. Recommendation systems have detailed telemetry on every serving event — which features were used, what the model scored, what was returned, what the user did afterward. This telemetry feeds continuous retraining pipelines and quality monitoring. LLM serving systems typically have request latency and error rate monitoring, but limited instrumentation of what actually happened inside the request: which retrieved documents were most influential, what context was used, where the token budget was spent.

Request-level quality signals are absent. Recommendation systems tie serving events to downstream behavioral signals (click, dwell time, conversion) and use these to evaluate and retrain. LLM serving systems have almost no production quality signal: the model either responded or it errored. Building the feedback loop from user behavior to model quality evaluation is genuinely hard for open-ended generation, but it's not an impossibility — and the teams that build it will have a durable advantage over those that rely on benchmark performance alone.

We're not saying recommendation infrastructure is a blueprint for LLM serving. The architectures are different and the problems aren't identical. But the discipline — close observation of serving behavior, tight feedback loops, explicit management of the training-serving boundary — transfers directly, and teams that carry it tend to build better systems faster than those that don't.