Self-hosting Llama vs Claude API: the real cost breakdown

Q: Can I run a model on a Mac M2?

Yes — `llama.cpp` runs decent-sized models on Apple Silicon. Useful for local dev or single-user tools. Not viable for a multi-user SaaS — you can't horizontally scale a single Mac.

When self-hosting an open-weight LLM beats the Claude API, when it doesn't, and the operational costs nobody includes in their comparison.

YAEL Engineering12 Dec 20258 min read1,597 words

For a SaaS sending under 10 million tokens per month, the Claude API is dramatically cheaper than self-hosting an equivalent open-weight model. For a SaaS sending over 100 million tokens per month with consistent load, self-hosting Llama 3 70B or similar is meaningfully cheaper — if you have the engineering team to operate it. The crossover sits in a range that depends mostly on your utilization profile. Spiky workloads favor the API forever. Steady high-throughput workloads cross over earlier. The cost-per-token comparison everyone runs is misleading because it ignores GPU idle time, ops cost, and the cost of being wrong about quality.

We've done both. This is the breakdown we use for customers asking the question.

The naive comparison (and why it's wrong)

Per the marketing pages:

Claude Sonnet 4.6: ~$3 / 1M input tokens, $15 / 1M output tokens
Self-hosted Llama 3 70B on a $2/hr A100: roughly 50-100 tokens/sec at batch 1 → ~$0.02 / 1M tokens at full utilization

Looks like a 100x win for self-hosting. It is not. Three things are missing.

Utilization. A $2/hr GPU costs $1500/month whether you push 1 token through it or 100 million. If you average 20% utilization, your effective cost is 5x your "marketing" cost.
Quality. Llama 3 70B is not equivalent to Claude Sonnet. On many tasks it is materially worse. You may need a bigger model (405B) to match, or you may accept worse quality. Both options cost real money.
Operational overhead. Running GPU infrastructure is a real engineering job. vLLM crashes, models OOM, batch scheduling tuning, monitoring, on-call. Not free.

The honest cost model

Let me run real numbers for three scenarios.

Scenario A — startup, 5M tokens/month

Cost via Claude Sonnet: ~$15-50/month all-in, depending on input/output mix.

Cost to self-host Llama 3 70B with reasonable redundancy:

Two A100 instances at $1.50/hr each on a discount provider: $2160/month
Engineering time to keep it running: ~2 hours/week × $100/hr = $800/month
Total: ~$3000/month for less quality than Claude Sonnet

Verdict: Claude API by 200x. Self-hosting at this volume is paying for the privilege of having worse output.

Scenario B — growing SaaS, 50M tokens/month

Cost via Claude Sonnet: ~$300-1000/month.

Cost to self-host Llama 3 70B:

Same two A100s: $2160/month
Engineering overhead: ~5 hours/week × $100/hr = $2000/month
Total: ~$4200/month

Verdict: Claude API still wins by 4-10x. Cross-over not yet reached.

Scenario C — high-volume, 500M tokens/month

Cost via Claude Sonnet: ~$3000-10000/month.

Cost to self-host with proper batching on H100s:

Two H100 instances at $3/hr each: $4320/month
Engineering overhead: ~10 hours/week × $100/hr = $4000/month
Total: ~$8300/month

Verdict: Self-hosting starts to compete. If your output mix is heavy and your utilization is high, this is where the crossover happens.

Scenario D — enterprise, 5B tokens/month, consistent load

Claude API: $30k-100k/month.

Self-hosted: cluster of 8-16 H100s with proper batching, ~$30-60k/month. Plus a dedicated infra engineer.

Verdict: Self-hosting wins materially, but only if the load is consistent. Spiky load drops utilization and the math reverts.

The utilization problem

The most underappreciated factor. A GPU you bought (or reserved) costs the same whether you use it or not. If your traffic is bursty — 10x peak vs trough — your effective utilization is 10%, and your effective cost is 10x the headline.

The API has no this-problem. Anthropic absorbs the burst by running a giant fleet shared across customers. You pay only for tokens used.

Mitigations for self-hosters:

Spot instances (cheaper but interruptible)
Multi-model serving (the GPU runs Llama for app A and Mistral for app B, raising utilization)
Batch scheduling (queue async jobs to fill idle time)
vLLM PagedAttention (lets one GPU serve many concurrent requests efficiently)

None of these are free. They are all real engineering.

When self-hosting genuinely wins

The cases:

Strict data residency / privacy. You cannot send data to a third party. Self-hosted is the only option.
Custom fine-tunes. You've actually fine-tuned a model on your data (see RAG vs fine-tuning) and need to serve the fine-tuned weights. Some APIs let you fine-tune; many don't.
High-volume, low-latency, predictable load. Real-time interactive applications at scale where the API's per-token cost adds up to millions per year.
Edge / on-device. Privacy-first apps that run the model locally. A smaller Llama variant on the user's phone.

If your situation is none of these, the API wins for most teams.

The quality gap

Llama 3 70B and Llama 3 405B are good but they are not Claude Sonnet 4.6. On most general-reasoning benchmarks, Llama 3 405B trails Sonnet by 5-15%, and 70B trails by more. For narrow tasks where you fine-tune the open model on your domain, the gap closes. For general agentic work, it doesn't.

The cost of being wrong about quality is rarely included in cost comparisons. A SaaS that ships Llama-powered responses that are 20% worse than the Claude version may lose 5% of customers to perceived quality drop. At $200/month per customer, that loss can exceed the entire infra savings.

The hybrid pattern

What we actually ship for customers who care about both:

High-quality / customer-facing → Claude API (Sonnet for general, Opus for hardest reasoning)
High-volume / backend (classification, embeddings, batch summarization) → self-hosted smaller models
Embeddings → almost always self-hosted (open embedding models are competitive and the volume is high)

A customer-support agent might use Claude Sonnet for the user-visible response, an open embedding model for the RAG retrieval, and a small open classifier for the "should we escalate to a human" decision. Each step uses the right tool for its cost/quality profile.

Inference stack — what we use for self-hosting

When self-hosting is the right call:

vLLM as the inference server. PagedAttention, continuous batching, broad model support. The default.
Modal or Replicate as the GPU host. Both are pay-per-second and avoid the "always-on GPU at 20% utilization" trap.
Banana or Lambda Labs for cheaper but less abstracted hosting.
Together.ai or Fireworks if you want "self-hosted API" — they run open-weight models on their infra and bill per token, often cheaper than direct hosting and without the ops.

That last point is the secret most teams miss. Together and Fireworks let you run Llama 3 70B at per-token pricing competitive with Claude Haiku, without operating any GPUs yourself. For most "I want open weights" needs, this is the right answer in 2026.

// Together.ai is an OpenAI-compatible API
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.TOGETHER_API_KEY,
  baseURL: "https://api.together.xyz/v1",
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.3-70B-Instruct-Turbo",
  messages: [{ role: "user", content: "..." }],
});

Engineering cost is the hidden line

The biggest item in every self-hosting comparison we run for customers is engineering time. A real production self-hosting setup has:

vLLM tuned for your traffic profile
Health-check monitoring + auto-restart
A model-loading pipeline (downloading and validating weights)
GPU OOM detection and graceful degradation
Multi-region failover if uptime matters
A monitoring dashboard with tokens/sec, queue depth, error rate

Each of those is real work. Cumulatively it's an engineer-month to set up and 10-20% of an engineer's ongoing time to maintain. At $150k/year all-in for a senior, that's $15-30k/year of overhead.

The API has none of this. You make HTTP calls. They handle the GPUs.

Need help deciding API vs self-hosted?

We've run the math for multiple customers and we'll run it for you — including realistic utilization and quality assumptions.

See AI Agent service

FAQ

What about cost of using Together / Fireworks for open-weight models?

Together's Llama 3.3 70B is ~$0.88/1M tokens (input). Compared to Claude Sonnet at $3/1M, that's a 3.4x discount with somewhat worse quality. For tasks where the quality gap doesn't matter, it's a real saving.

Does the math change for embedding workloads?

Yes, materially. Open embedding models (BGE, e5, Cohere Embed v3 via the API) are competitive on quality, and the volume is usually high. Self-host or use a hosted open model — both are cheaper than OpenAI's text-embedding-3-large at high volume.

What about Anthropic's batch API?

50% discount on input + output tokens for non-time-sensitive jobs (24h SLA). For asynchronous workloads (overnight summarization, periodic data enrichment), this changes the math meaningfully — the crossover with self-hosting moves much higher.

Does prompt caching change things?

Yes. Prompt caching cuts input-token cost on repeated context by ~90%. For agents with large static system prompts, this is huge — see building AI agents with Claude tool use. The API's cached pricing makes self-hosting less competitive at most scales.

Can I run a model on a Mac M2?

Yes — llama.cpp runs decent-sized models on Apple Silicon. Useful for local dev or single-user tools. Not viable for a multi-user SaaS — you can't horizontally scale a single Mac.

What's the cheapest way to test an open-weight model?

Together.ai or Replicate. Both have pay-per-second GPU rental. You can test Llama 3 70B for $0.50 and decide whether the quality meets your bar before committing to infrastructure.

Does latency differ?

API latency: first token ~300-800ms depending on region. Self-hosted: as low as 50ms first-token on warm GPUs. If your UX cares about TTFT below 200ms, self-hosting on a co-located GPU wins.

Is there a privacy gain from self-hosting?

Yes. Your prompts never leave your infrastructure. For regulated industries (healthcare, legal) this can be the deciding factor regardless of cost.

TagsLlama Claude Self-hosting LLM Cost

ServiceAI Agent Development Automation Scripts

Keep reading

AI & AgentsBuilding AI agents with Claude tool use in productionWhat changes when an AI agent moves from demo to production — tool-call loops, error recovery, observability, cost controls, and the failure modes that only appear at scale.9 min read AI & AgentsRAG vs fine-tuning: when to pick each (and when to pick both)A practical decision framework for retrieval-augmented generation vs fine-tuning vs prompt engineering — with cost, latency, and update-frequency trade-offs.9 min read AI & AgentsChoosing a vector database: pgvector vs Pinecone vs QdrantAn honest comparison of the three serious choices for production vector search in 2026 — what each one is good at, what they're not, and why pgvector wins more often than the marketing suggests.9 min read

AI & Agents