🟠 Builder

LLMs on your own server: when it makes sense and when it doesn't

Self-hosted Llama, Qwen, DeepSeek on-prem. Honest analysis of cost, latency, quality, and total cost of ownership vs public API. When it's worth it, when it becomes a liability.

May 15, 2026 · 12 min · ai-engineering

The 2026 question

In workshops and consulting engagements, three questions repeat:

“Can we run AI inside our network without sending data outside?”
“How much would it cost to replace OpenAI/Anthropic with our own machines?”
“If Llama 4/Qwen 3 are good, why are we still paying for the public API?”

Short answers, before the detail:

Yes, but…
More than you expect.
Because you’re paying 80% for reliability, 20% for the model.

This article opens the “but” and the “more than you expect.”

When self-hosted makes sense

Case 1: hard regulatory compliance

If the sector requires that personal/health/financial/military data NEVER leaves the company perimeter, self-hosted is the only option in some cases. Examples:

Hospitals with patient data.
Banks with tier-1 transaction data.
National defense, critical infrastructure.
Law firms under confidential mandate.

Even here, note: Anthropic, OpenAI, Google, and AWS offer dedicated-tenant options in regional data centers with guaranteed data residency in 2026. Before declaring “we need self-host,” verify if a dedicated tenant already resolves your legal requirement.

Case 2: massive volume + stable use case

If you run millions of inferences/month on a stable use case (e.g. ticket classification, standardized call summarization, invoice field extraction), unit cost at scale inverts:

Public API: per-token cost, linear scaling. USD 6k/month becomes USD 60k/month at 10× volume.
Self-hosted on a dedicated GPU: fixed server + energy cost. USD 6k/month stays USD 6k/month at 10× volume (until the GPU saturates).

Typical break-even in 2026 for a full Llama 3 70B-class model: ~5-15 million tokens/day processed consistently. Below that, API is still cheaper.

Case 3: impossible network latency

An application needing < 100ms first-token even with 200 concurrent users may justify a dedicated GPU. But in 2026, provider latencies have approached that — verify before jumping to the conclusion.

Case 4: research requiring frequent fine-tuning

For teams training custom models weekly (rare outside Big Tech), self-hosted is part of the routine. Almost no mid-market company is in that scenario.

When self-hosted does NOT make sense (most cases)

Anti-case 1: “we want to save money”

Basic math for 2026:

A100 40GB-class GPU new: USD 10-15k.
A100 cloud-rented: USD 1.5-3/h. Running 24/7 = USD 1,100-2,200/month.
Server with 4× A100 + power + cooling + colocation: USD 4-8k/month fixed.
Maintenance team: 1 engineer with GPU expertise (rare globally, USD 5-10k/month when you find one).

Monthly total for self-host production: USD 9-18k minimum.

To match this in public-API consumption (Claude Sonnet, GPT-4.1, Gemini 2.5):

USD 9-18k = USD 9-18k in tokens
Sonnet 4.x: ~USD 3/M tokens input + ~USD 15/M tokens output
Considering a typical mix of 70% input / 30% output: ~USD 6.6/M tokens average
USD 9k = ~1.4 billion tokens/month

Does your company process 1.4 billion tokens/month? Probably not. The API is cheaper for 95% of companies.

Anti-case 2: “we want privacy”

“Privacy” as an isolated motivation rarely justifies self-host in 2026. Anthropic, OpenAI, Google have Data Processing Addendums compliant with GDPR/LGPD/HIPAA. Cloud in the right region (AWS São Paulo, AWS Frankfurt, Azure East US, etc.) is eligible for most cases.

When that’s NOT enough: regulated sectors with an explicit contractual obligation against data transit.

Anti-case 3: “we want the latest version”

Self-host means freezing the model. You run Llama 3.3 70B. When Llama 5 comes out, you redo the setup. Small open-weight models lag behind closed-source proprietaries by 6-18 months.

If your application depends on frontier capability, API is where the frontier is.

Anti-case 4: “we want our own personality”

You want a model that “talks like the company.” You don’t need self-host for that. You need:

Well-designed system prompt
Few-shot examples
Eventually fine-tune via API (OpenAI, Anthropic, Google offer fine-tune as a service)

Self-host for personality is killing a mosquito with a bazooka.

Realistic open-weight stack 2026

If you decided (with grounds) that self-host makes sense, here’s the current stack:

Models

Llama 4 70B-405B (Meta) — good general quality, EN > non-EN.
Qwen 3 / DeepSeek V3 — strong in code and math, decent multilingual.
Phi-3.5 (Microsoft) — small (3-14B) efficient model. Good for structured tasks.
Mistral / Mixtral — European, good efficiency.

For non-English specific use cases, consider fine-tuning on a local corpus (Maritaca AI has well-trained PT versions; equivalent specialty fine-tunes exist for ES, FR, JA, AR).

Inference runtime

vLLM — market standard for serving LLMs at scale. Multi-GPU, batching, tensor parallelism.
Ollama — good for local dev + POCs, not recommended in tier-1 production.
TGI (Text Generation Inference, HuggingFace) — robust alternative.
TensorRT-LLM (NVIDIA) — maximum performance on NVIDIA GPUs, high complexity.

Orchestration

vLLM + Kubernetes + GPU autoscaler — enterprise standard.
Ray Serve — alternative for teams already using Ray.
Modal / Replicate — managed self-host, intermediate between public API and pure on-prem.

Observability

Prompt + response logs in SQL/SQLite (same pattern as harness runtime governance).
Metrics: tokens/s, p50/p95/p99 latency, GPU utilization, OOM rate.
Alerting via Prometheus + Grafana or similar.

The hybrid pattern (recommended)

For 80% of companies that think they want self-host, the optimal pattern is hybrid:

Public API for generic cases (drafting, summarization, general classification).
Self-hosted small model for a specific high-volume task with sensitive data (e.g. PII extraction in internal logs).
Local pre-processing to mask sensitive data BEFORE sending to the public API (PII redaction with a small local model + a Claude/GPT call for the rest).

That hybrid pattern captures 80% of the self-host benefit (privacy where it matters) at 20% of the cost + complexity.

FAQ

How long to spin up a self-host POC? With Ollama on a laptop or small server: 1 day. For real production with vLLM + Kubernetes: 2-6 weeks of dedicated engineering.

Regional servers — do they support this? Yes. Tier-3+ data centers in major markets (São Paulo, Frankfurt, Dublin, Singapore, Mumbai) have capacity. Energy + cooling cost varies — weigh it into TCO.

Can we buy A100/H100 locally? Stock limited and premium price vs direct import in many markets. For low volume (1-4 GPUs), local resellers. For high volume, import via a specialized partner is cheaper.

Is Anthropic Claude self-hostable? Anthropic does not offer open weights. Neither does OpenAI. Self-host is exclusively open-weight territory.

What about regional models (Maritaca, Sabiá, Aya, Falcon)? They have competitive versions in their target language. Worth considering for language-only cases with medium scale.

Next steps

Apply the decision matrix above to your case. If you’re not in one of the 4 “when it makes sense” scenarios, it probably doesn’t.
OpenClaw is a third-party open-source multi-channel gateway we adopted internally for WhatsApp/Telegram/Instagram/Discord — worth evaluating if you have multi-channel requirements.
SkilLab AI Newsletter — applied engineering deep dive every Thursday. Sign up below.