LLMs on your own server: when it makes sense and when it doesn't
Self-hosted Llama, Qwen, DeepSeek on-prem. Honest analysis of cost, latency, quality, and total cost of ownership vs public API. When it's worth it, when it becomes a liability.
The 2026 question
In workshops and consulting engagements, three questions repeat:
- “Can we run AI inside our network without sending data outside?”
- “How much would it cost to replace OpenAI/Anthropic with our own machines?”
- “If Llama 4/Qwen 3 are good, why are we still paying for the public API?”
Short answers, before the detail:
- Yes, but…
- More than you expect.
- Because you’re paying 80% for reliability, 20% for the model.
This article opens the “but” and the “more than you expect.”
When self-hosted makes sense
Case 1: hard regulatory compliance
If the sector requires that personal/health/financial/military data NEVER leaves the company perimeter, self-hosted is the only option in some cases. Examples:
- Hospitals with patient data.
- Banks with tier-1 transaction data.
- National defense, critical infrastructure.
- Law firms under confidential mandate.
Even here, note: Anthropic, OpenAI, Google, and AWS offer dedicated-tenant options in regional data centers with guaranteed data residency in 2026. Before declaring “we need self-host,” verify if a dedicated tenant already resolves your legal requirement.
Case 2: massive volume + stable use case
If you run millions of inferences/month on a stable use case (e.g. ticket classification, standardized call summarization, invoice field extraction), unit cost at scale inverts:
- Public API: per-token cost, linear scaling. USD 6k/month becomes USD 60k/month at 10× volume.
- Self-hosted on a dedicated GPU: fixed server + energy cost. USD 6k/month stays USD 6k/month at 10× volume (until the GPU saturates).
Typical break-even in 2026 for a full Llama 3 70B-class model: ~5-15 million tokens/day processed consistently. Below that, API is still cheaper.
Case 3: impossible network latency
An application needing < 100ms first-token even with 200 concurrent users may justify a dedicated GPU. But in 2026, provider latencies have approached that — verify before jumping to the conclusion.
Case 4: research requiring frequent fine-tuning
For teams training custom models weekly (rare outside Big Tech), self-hosted is part of the routine. Almost no mid-market company is in that scenario.
When self-hosted does NOT make sense (most cases)
Anti-case 1: “we want to save money”
Basic math for 2026:
- A100 40GB-class GPU new: USD 10-15k.
- A100 cloud-rented: USD 1.5-3/h. Running 24/7 = USD 1,100-2,200/month.
- Server with 4× A100 + power + cooling + colocation: USD 4-8k/month fixed.
- Maintenance team: 1 engineer with GPU expertise (rare globally, USD 5-10k/month when you find one).
Monthly total for self-host production: USD 9-18k minimum.
To match this in public-API consumption (Claude Sonnet, GPT-4.1, Gemini 2.5):
- USD 9-18k = USD 9-18k in tokens
- Sonnet 4.x: ~USD 3/M tokens input + ~USD 15/M tokens output
- Considering a typical mix of 70% input / 30% output: ~USD 6.6/M tokens average
- USD 9k = ~1.4 billion tokens/month
Does your company process 1.4 billion tokens/month? Probably not. The API is cheaper for 95% of companies.
Anti-case 2: “we want privacy”
“Privacy” as an isolated motivation rarely justifies self-host in 2026. Anthropic, OpenAI, Google have Data Processing Addendums compliant with GDPR/LGPD/HIPAA. Cloud in the right region (AWS São Paulo, AWS Frankfurt, Azure East US, etc.) is eligible for most cases.
When that’s NOT enough: regulated sectors with an explicit contractual obligation against data transit.
Anti-case 3: “we want the latest version”
Self-host means freezing the model. You run Llama 3.3 70B. When Llama 5 comes out, you redo the setup. Small open-weight models lag behind closed-source proprietaries by 6-18 months.
If your application depends on frontier capability, API is where the frontier is.
Anti-case 4: “we want our own personality”
You want a model that “talks like the company.” You don’t need self-host for that. You need:
- Well-designed system prompt
- Few-shot examples
- Eventually fine-tune via API (OpenAI, Anthropic, Google offer fine-tune as a service)
Self-host for personality is killing a mosquito with a bazooka.
Realistic open-weight stack 2026
If you decided (with grounds) that self-host makes sense, here’s the current stack:
Models
- Llama 4 70B-405B (Meta) — good general quality, EN > non-EN.
- Qwen 3 / DeepSeek V3 — strong in code and math, decent multilingual.
- Phi-3.5 (Microsoft) — small (3-14B) efficient model. Good for structured tasks.
- Mistral / Mixtral — European, good efficiency.
For non-English specific use cases, consider fine-tuning on a local corpus (Maritaca AI has well-trained PT versions; equivalent specialty fine-tunes exist for ES, FR, JA, AR).
Inference runtime
- vLLM — market standard for serving LLMs at scale. Multi-GPU, batching, tensor parallelism.
- Ollama — good for local dev + POCs, not recommended in tier-1 production.
- TGI (Text Generation Inference, HuggingFace) — robust alternative.
- TensorRT-LLM (NVIDIA) — maximum performance on NVIDIA GPUs, high complexity.
Orchestration
- vLLM + Kubernetes + GPU autoscaler — enterprise standard.
- Ray Serve — alternative for teams already using Ray.
- Modal / Replicate — managed self-host, intermediate between public API and pure on-prem.
Observability
- Prompt + response logs in SQL/SQLite (same pattern as harness runtime governance).
- Metrics: tokens/s, p50/p95/p99 latency, GPU utilization, OOM rate.
- Alerting via Prometheus + Grafana or similar.
The hybrid pattern (recommended)
For 80% of companies that think they want self-host, the optimal pattern is hybrid:
- Public API for generic cases (drafting, summarization, general classification).
- Self-hosted small model for a specific high-volume task with sensitive data (e.g. PII extraction in internal logs).
- Local pre-processing to mask sensitive data BEFORE sending to the public API (PII redaction with a small local model + a Claude/GPT call for the rest).
That hybrid pattern captures 80% of the self-host benefit (privacy where it matters) at 20% of the cost + complexity.
FAQ
How long to spin up a self-host POC? With Ollama on a laptop or small server: 1 day. For real production with vLLM + Kubernetes: 2-6 weeks of dedicated engineering.
Regional servers — do they support this? Yes. Tier-3+ data centers in major markets (São Paulo, Frankfurt, Dublin, Singapore, Mumbai) have capacity. Energy + cooling cost varies — weigh it into TCO.
Can we buy A100/H100 locally? Stock limited and premium price vs direct import in many markets. For low volume (1-4 GPUs), local resellers. For high volume, import via a specialized partner is cheaper.
Is Anthropic Claude self-hostable? Anthropic does not offer open weights. Neither does OpenAI. Self-host is exclusively open-weight territory.
What about regional models (Maritaca, Sabiá, Aya, Falcon)? They have competitive versions in their target language. Worth considering for language-only cases with medium scale.
Next steps
- Apply the decision matrix above to your case. If you’re not in one of the 4 “when it makes sense” scenarios, it probably doesn’t.
- OpenClaw is a third-party open-source multi-channel gateway we adopted internally for WhatsApp/Telegram/Instagram/Discord — worth evaluating if you have multi-channel requirements.
- SkilLab AI Newsletter — applied engineering deep dive every Thursday. Sign up below.
Also read
- Harness Stack — 9 layers of runtime governance, applicable to any LLM (API or self-host)
- AI for business: the only decision matrix you need — when to delegate
By Ivan Prado · SkilLab AI · May 2026. Translated and adapted from the PT-BR original.