A Mental Model for the AI Infrastructure Stack

Every infrastructure category eventually develops a canonical mental model — a way of drawing the stack that lets practitioners and investors reason about where value sits, where competition will happen, and which layers are likely to commoditize versus accrete margin. Networking had its OSI layers. Cloud had the IaaS/PaaS/SaaS stack. AI infrastructure is still in the stage where the vocabulary is inconsistent and the layers blur into each other, which makes it hard to think clearly about where to build.

I want to offer the model we use at Firntal. It is not the only correct way to decompose the stack, but it is the one that has been most useful for our investment decision-making — and more importantly, for understanding which types of companies will be structurally defensible five years from now versus which ones are likely to be absorbed into adjacent layers.

The five layers

We decompose the AI infrastructure stack into five layers, ordered from physical substrate to application interface:

Layer 1 — Compute Hardware. GPUs, TPUs, custom ASICs, networking fabric. This is where NVIDIA sits. It is also where AMD, Intel, Groq, Cerebras, and a cluster of startups building custom inference silicon are competing. The dynamics here are well-understood: high capital intensity, years of design-to-production lead time, winner-take-most in any given generation cycle. We do not invest at this layer from Firntal — not because it is unimportant but because the capital requirements and technical differentiation timelines are mismatched to seed-stage venture.

Layer 2 — Cloud Infrastructure and GPU Orchestration. The hyperscalers (AWS, Azure, GCP) plus the GPU cloud specialists (CoreWeave, Lambda, and newer entrants). The key competitive variable at this layer is GPU availability and networking configuration for distributed workloads. GPU cloud specialists can differentiate on price, access to newer hardware, and network topology — especially for training clusters and large-batch inference. We have one investment at the lower end of this layer and are watching the consolidation dynamics closely.

Layer 3 — Model Serving and Inference Infrastructure. This is where most of our attention sits. The systems that sit between a raw model artifact and a scalable endpoint: batching engines, scheduling layers, serving frameworks, quantization tooling, speculative decoding implementations, memory management for large models. The companies building here are solving hard systems problems — not ML problems, but distributed systems, operating systems, and compiler problems applied to the specific demands of neural network inference.

Layer 4 — Model Lifecycle and Developer Tooling. CI/CD for ML, experiment tracking, model registry, fine-tuning pipelines, evaluation frameworks, deployment automation. This layer was largely built for the training-focused ML workflow of 2017–2021. As inference workloads grow, it needs to be rebuilt for the inference-first reality — where the primary concern is not which model is most accurate in a lab but which model is fastest, cheapest, and most reliable at production serving latency.

Layer 5 — Application Enablement. APIs, SDKs, agent orchestration frameworks, RAG infrastructure, embedding services. This is the layer that lets application developers use models without understanding Layers 1–4. It is also the layer with the most competition and the most ambiguous long-term margin profile — it is adjacent to where the model providers themselves want to grow.

Where value accrues — the structural argument

The standard tech-stack logic is that value accrues to the layer with the strongest switching costs and the least commodity competition. In the AI stack, those properties are distributed non-uniformly:

Layer 1 (hardware) has enormous switching costs but also enormous capital barriers — value accrues there but so does risk. Layer 3 (inference infrastructure) has growing switching costs because optimized inference engines are deeply coupled to both hardware characteristics and model architectures — porting a production serving layer from one system to another is expensive. Layer 4 (developer tooling) has high coupling to engineering workflows, which creates stickiness, but also faces the perpetual threat of being absorbed by cloud providers packaging their own tooling.

Layer 5 is structurally the most contested. Application enablement companies face competition from below (cloud providers building managed APIs) and from above (application developers who eventually internalize the capability). We are not saying Layer 5 is a bad place to build — several large companies will emerge from it. We are saying that our operating knowledge compounds more at Layers 3 and 4, and that is where we focus.

The cross-layer tension: serving frameworks vs. cloud provider packaging

The most interesting tension in the current landscape is between standalone inference serving infrastructure (Layer 3 specialists) and the serving capabilities that major cloud providers are bundling into their managed AI platforms. AWS SageMaker, Azure ML, and GCP Vertex AI are all building managed inference endpoints. If these products become good enough, do the Layer 3 specialists get squeezed?

We have thought about this a lot and our current position is: no, not for the next several years, and perhaps not structurally at all for the leading infrastructure companies. The reason is that the performance gap between optimized inference and managed cloud serving is significant — 3x to 10x in throughput at equivalent hardware cost, depending on the workload and model size. That gap exists because the cloud providers are building for the 80th percentile of use cases, and the teams at Layer 3 infrastructure companies are building for the demands of production serving at the edge of what is possible.

Companies with latency-sensitive workloads — real-time recommendations, interactive chat, autonomous coding assistants, voice interfaces — cannot absorb a 3x throughput penalty. They will pay for optimized serving infrastructure rather than accepting the managed cloud default. That is the market the Layer 3 companies are building for.

How to use this model

When we evaluate a new company in the AI infrastructure space, the first question we ask is: which layer is this, and what are the competitive dynamics specific to that layer? A model routing company is Layer 4 (or the boundary between 3 and 4). A GPU cloud operator is Layer 2. A quantization library that gets bundled into serving frameworks sits between 3 and 4.

The second question: what is the company's relationship to the layers adjacent to it? Is it threatened by downward integration from a cloud provider? Is it building toward upward expansion into the application layer? Does its defensibility depend on coupling to a specific model architecture that could become obsolete?

These questions do not produce clean answers but they force rigorous thinking about structural position — which is more useful than focusing on near-term traction metrics for companies this early in a technology cycle.

The model will need revision as the stack evolves. That is fine. What we want is a working vocabulary for the problems that matter, not a fixed taxonomy that we defend past its useful life.