The Case for Serverless GPU Infrastructure

Our investment in Modal in 2022 was a bet on a specific insight: that the developer experience for running GPU workloads was broken in a way that was not going to fix itself through incremental improvements to existing cloud primitives. The fundamental problem was that GPU instances are slow, expensive, and stateful — you provision one, wait several minutes for it to start, manage it for the duration of your workload, and pay for idle time between jobs. For ML teams running batch inference jobs, model evaluations, training experiments, or preprocessing pipelines, this model creates enormous operational overhead and wasted spend. Serverless GPU infrastructure eliminates that overhead by making GPU execution feel like a function call.

The cold start problem is not what you think

The obvious objection to serverless GPU infrastructure is cold start latency. Spinning up a new GPU container naively takes tens of seconds — long enough to be unacceptable for any user-facing workload. The conventional wisdom is that GPU workloads are inherently incompatible with the serverless execution model because you cannot amortize the cold start over a short-lived function invocation the way you can with CPU-based serverless.

This framing is wrong, and the teams building the next generation of serverless GPU infrastructure know it. The cold start problem is not a physics constraint — it is an engineering problem. The latency comes from three separable sources: container image pull, GPU driver initialization, and application code startup (which for ML workloads typically includes model weight loading). Each of these is attackable independently.

Container image pull can be reduced to near-zero through aggressive image layer caching and snapshot-based container startup — loading a filesystem snapshot from fast local storage rather than pulling layers from a registry. GPU driver initialization can be reduced by maintaining warm pools of pre-initialized GPU contexts. Model weight loading, the biggest bottleneck for large model inference, can be reduced by keeping model weights in high-bandwidth local storage (NVMe arrays or memory-mapped files) rather than fetching from object storage on each cold start. Combining these techniques pushes GPU cold start latencies from tens of seconds into the sub-second range, which changes the calculus on which workloads serverless GPU infrastructure can serve.

Who actually benefits from serverless GPU

The immediate addressable market for serverless GPU infrastructure is not the large enterprise running sustained inference at scale — that customer will buy reserved GPU capacity and run dedicated serving infrastructure because the economics favor it at high utilization. The immediate customer is the engineering team that needs GPU execution for tasks where utilization is inherently bursty or unpredictable.

ML model evaluation pipelines are a canonical example. A team fine-tuning or evaluating models may run hundreds of model evaluation jobs over a two-day sprint and then nothing for a week. Maintaining dedicated GPU instances for that workload is expensive and operationally wasteful. A serverless GPU platform that charges per-second for actual compute time changes the unit economics dramatically — the team pays for the 40 minutes of actual GPU time their evaluation jobs consume, not for 168 hours of instance-hours that week.

Data preprocessing pipelines for training are another. Generating embeddings, running image augmentations, tokenizing large text corpora — these workloads are GPU-accelerated but highly parallelizable and naturally batch-oriented. They fit the serverless execution model well: fan out to many short-lived GPU workers, collect results, terminate. The alternative is managing a preprocessing cluster, which is infrastructure overhead that does not contribute to the core ML product.

The less obvious but larger opportunity is the long tail of ML applications that have not yet been built because the infrastructure burden was too high. A developer who wants to add image generation to a web application, a startup building a voice assistant, a research team that wants to run inference over a large dataset — all of these teams currently face a choice between managed APIs (limited control, vendor lock-in, often expensive at scale) or self-managed GPU infrastructure (high operational burden, capital cost, requires dedicated ML infrastructure expertise). Serverless GPU infrastructure creates a third path: direct access to GPU compute with infrastructure fully managed, at per-second pricing that makes experimentation economically viable.

The platform abstraction matters as much as the hardware access

The teams building serverless GPU infrastructure that we find most interesting are not primarily building a GPU rental business. They are building a developer platform where GPU execution is a primitive that composes with application code in the way that database calls or HTTP requests do — low-friction, directly from code, without a separate infrastructure layer to manage.

This means the developer experience of deploying a GPU workload should feel like writing a function with a decorator, not like provisioning infrastructure. The platform handles containerization, dependency management, scaling, scheduling, and billing transparently. From the developer's perspective, they write Python, specify compute requirements, and call the function — and it runs on GPUs without any infrastructure work.

Getting this developer experience right requires solving hard problems that are not obviously GPU problems: fast dependency installation and caching, secure multi-tenant execution, efficient job scheduling across a shared GPU pool, and a billing model that is transparent and predictable. The teams building this well are building a platform product, not just a compute abstraction.

Why this is the right time

Two structural trends make the serverless GPU market timing compelling. First, the size of ML workloads that require GPU execution is growing rapidly as foundation models become embedded in more application categories. More teams are running inference, evaluation, and preprocessing workloads that need GPU access but do not need dedicated GPU infrastructure. Second, the cost of GPU compute is declining as the supply of inference-optimized silicon increases and competition in the GPU cloud market intensifies. Declining GPU unit costs expand the range of workloads where serverless GPU infrastructure is economically viable compared to dedicated instances.

The infrastructure layer that makes GPU execution accessible to the broad population of developers building ML-powered applications is being built now. The companies that get the developer experience and the operational efficiency right will serve a market that is substantially larger than the current generation of GPU cloud customers.