Multi-Modal Inference: The Next Frontier for Infrastructure Builders

The production AI inference landscape has been predominantly text-in, text-out for the past three years. The serving infrastructure, optimization tooling, and deployment patterns the community developed are optimized for this input-output shape. Multi-modal models — which process images, audio, and video alongside text — have a different compute profile, different memory requirements, and different serving challenges that the existing infrastructure handles poorly.

This matters for infrastructure investors because the shift to multi-modal production deployments creates a genuine infrastructure gap, and infrastructure gaps at this stage of an adoption curve tend to produce the next wave of infrastructure companies. The teams that fill this gap well won't just be "vLLM with image support" — they'll be building systems designed from the ground up for the multi-modal serving problem.

What's different about multi-modal inference compute

A vision-language model request has two compute phases that don't exist in text-only serving: image encoding and cross-modal attention. Image encoding — transforming raw pixel data into a dense embedding representation the language model can process — is typically done by a vision encoder (a ViT or similar) that runs separately from the language model itself. This encoding step is compute-intensive, roughly proportional to image resolution, and doesn't parallelize with the language model's prefill in standard implementations.

The resulting compute profile: a multi-modal request has image encoding time (variable, proportional to image count and resolution) plus the standard LLM prefill-decode pipeline. Under high throughput conditions, the image encoding step can become the bottleneck — the language model is waiting for encoded image features rather than being starved for attention compute. Getting good throughput requires scheduling that accounts for this asymmetry and batches image encoding work efficiently alongside text processing.

Audio processing adds another dimension: audio has to be chunked, transformed (typically via a spectrogram or learned audio encoder), and aligned temporally with text. The encoding latency for long audio clips — a 10-minute recording — can dominate the total request latency if the encoding pipeline isn't designed for streaming processing.

The KV cache problem is qualitatively harder

Image tokens expand the effective context length dramatically. A single high-resolution image tokenized for a vision-language model can consume hundreds or thousands of context tokens. A multi-turn conversation with several images can easily push into the 10K-20K token range just from image content, before any text is added.

This interacts badly with current KV cache management approaches in two ways. First, the absolute memory requirements scale with image count, not just conversation length — a user who uploads three screenshots occupies dramatically more KV cache memory than one asking a text-only follow-up. This makes cache capacity planning harder and preemption decisions more frequent and costly.

Second, prefix caching — one of the most effective cost-reduction techniques for text-only serving — has limited applicability to image content. Shared text prefixes (system prompts) cache cleanly; shared visual context is rare in practice and harder to detect efficiently. Teams that have relied heavily on prefix caching to reduce inference costs for text-heavy workloads will find their cache hit rates drop substantially when image content is introduced.

Model architecture heterogeneity and serving flexibility

The model architecture landscape for multi-modal models is less standardized than for pure-text LLMs. Text-only models have converged toward decoder-only transformer architectures with predictable parameter configurations. Multi-modal models use a variety of encoder-decoder patterns, cross-attention mechanisms, and fusion strategies that require more architectural flexibility from the serving layer.

Serving a model that uses a CLIP vision encoder fused with a Llama-style decoder is different from serving one that uses a ViT encoder with cross-attention at every transformer layer. The serving stack has to support different model topology configurations, different memory access patterns for the encoder and decoder components, and different batching strategies that account for the variable encoding cost based on input modality.

The current generation of LLM serving frameworks has added multi-modal support incrementally — typically by adding image encoding as a preprocessing step before passing tokens to the standard serving pipeline. This works for simple cases but doesn't generalize to architectures where encoder and decoder are tightly coupled or where streaming inference (generating while image encoding is still running) is required for acceptable latency.

Video inference: the next unsolved problem

Video adds a temporal dimension that image serving doesn't have: a 30-second video clip at 24fps is 720 frames, each of which has to be encoded and processed. Naive frame-by-frame processing doesn't scale — the context length and compute requirements would make per-video inference cost prohibitive for most applications.

The practical approaches under development: temporal sampling (only encoding keyframes or frames with significant visual change), temporal pooling (aggregating encoded features across time windows before language model attention), and streaming processing (processing video in chunks and maintaining temporal context across chunks). Each approach has different quality-cost tradeoffs and requires specific serving infrastructure support that doesn't exist in current text-focused serving stacks.

Video-native applications — document processing from screen recordings, customer service interactions with video context, manufacturing quality control — are early in production deployment but growing. The companies building serving infrastructure for video inference now are positioning for a market that will be significantly larger in two years than it is today.

Where the infrastructure opportunity sits

We're not saying all the existing text-only serving infrastructure is obsolete — it isn't. Text inference is growing and will continue growing. The argument is that multi-modal inference is a meaningfully different technical problem that creates demand for purpose-built infrastructure, not adapted text infrastructure.

The teams best positioned to build this are the ones who understand both the encoder-decoder architecture patterns of multi-modal models and the serving systems engineering requirements — batching, memory management, scheduling — at production scale. That combination is rarer than either skill in isolation, which is why the gap between where the infrastructure is and where the demand is heading looks like a real investment opportunity to us.

For Fund II, this has become an active focus area in a way it wasn't when we wrote the initial thesis. The four investments we've made so far are all infrastructure-layer bets that address specific gaps in the production serving stack. Multi-modal serving infrastructure is the gap we're looking at most closely for our next investment.