Someone asks: “Should we self-host our AI, or use the cloud APIs?”
It sounds like one question. It’s actually three — and the answer to each can be different for the same organization. Running them together is how you end up with either an expensive on-premise cluster that underperforms, or sensitive data in an infrastructure you don’t fully control.
The Performance Gap Is Real. It’s Also Shrinking.
Blunt version: if you self-host today, you accept a performance ceiling.
The frontier models — GPT-4.1, Claude Opus 4, Gemini 2.5 Pro — are not available as open weights. You cannot run them on your own hardware, full stop. The best models available for self-hosting right now are Llama 4 Maverick, DeepSeek V3.2, and Qwen 3 235B. They are genuinely capable. On coding benchmarks, the open-weights models are within a few percentage points of proprietary frontier models. The gap that was enormous in 2023 has compressed significantly.
Compressed is not closed. Complex multi-step reasoning, ambiguous judgment calls, advanced legal or financial analysis — this remains frontier territory. If your core use case lives there, self-hosting will cost you in output quality in ways that are hard to compensate for with better prompting or more hardware.
The Task Decides More Than the Model Does
Before deciding where to run a model, decide what you’re asking it to do.
Simple tasks — classification, entity extraction, document Q&A with RAG, FAQ bots — can run on a Llama 4 Scout. It has 17 billion active parameters, fits on a single 80GB H100 GPU, and at self-hosted rates costs roughly $0.36 per million tokens. For workloads in this category, there is no meaningful quality argument for a frontier cloud model.
Complex tasks — multi-step reasoning, code generation across large codebases, synthesis across ambiguous or contradictory inputs — still favor cloud. Not because cloud providers engineer better models, but because the best-performing models simply aren’t available outside their APIs.
The practical rule: start with the smallest model that solves the problem at acceptable accuracy. If a local 8B model gets the job done, nothing larger is worth the overhead — cost, latency, or operational complexity.
The Cost Math Has a Crossover Point
Self-hosting has an economics problem that only resolves in your favor at scale.
Hardware: a production-grade DeepSeek V3.2 deployment requires 8× H100 GPUs. That’s a significant capital commitment before you factor in power, cooling, and the operational staff who know how to keep it running. At spot instance prices, that same deployment costs roughly $5.72 per million tokens. Llama 4 Scout is considerably cheaper — around $0.36 per million tokens self-hosted — but it’s also a smaller model.
The crossover where self-hosting becomes financially rational is approximately $12,000–19,000 per month in equivalent API spend. Below that threshold, cloud APIs win on cost. Above it, the math shifts meaningfully over a 12–18 month amortization horizon.
Most organizations are well below that threshold. The economics of self-hosting are often argued by people who haven’t run the numbers against their actual usage.
The Data Question Is Structural, Not Contractual
Performance and cost are calculable. Data control is a different kind of problem.
The major cloud providers — OpenAI Enterprise, Anthropic’s API, Google Cloud — have agreements that explicitly prohibit using your API data to train their models. That is a real contractual commitment, and for many workloads it’s sufficient.
But contractual protection and actual data control are two different things. When data leaves your network perimeter, what happens at the infrastructure level — caching, internal review processes, logging retention — is outside your visibility. Your contract reflects what a vendor promises today under today’s terms. What persists technically, or becomes legally accessible in a different jurisdiction, is a separate matter.
For most workloads — internal tools, customer support, code assistants, search — this is a manageable and widely accepted risk. For genuinely sensitive data — medical records, legal strategy, industrial IP, financial models — on-premise deployment with open-weights models is the only architecture where the guarantee is technically verifiable rather than contractually asserted.
That’s not an argument against cloud. It’s an argument for knowing precisely which data you’re processing and where.
The Three Questions
Before the next AI infrastructure decision, answer these separately:
- Task quality: Does the use case require frontier reasoning, or does a capable open model cover it?
- Volume economics: What is the actual monthly inference spend, and at what point does owned infrastructure beat per-token billing?
- Data sensitivity: Is contractual protection sufficient, or does the nature of the data require technical control of the perimeter?
Different workloads in the same organization will have different answers. The mistake is treating it as one question with one answer.
Self-hosted or cloud isn’t a strategy. It’s a conclusion you reach after answering the right questions first.

Leave a comment