RAG vs fine-tuning: when to pick each (and when to pick both)

Q: Can I fine-tune Claude?

Anthropic offers fine-tuning for some models in specific channels. For most teams the answer is no — but you can fine-tune open-weight models (Llama, Mistral) and use them via your own inference. We do this only when the volume justifies it.

Q: Does RAG hallucinate?

Less than no-RAG, but still some. The model can confuse retrieved chunks, blend unrelated facts, or paraphrase incorrectly. Citation rendering plus a "is the context sufficient" gate cuts this dramatically.

Q: How big should chunks be?

500-1000 tokens for most knowledge bases. Smaller for FAQ-style content, larger for technical docs where context matters. Always overlap chunks by 50-100 tokens so a relevant sentence isn't split across two retrievals.

Q: What about graph RAG / hierarchical RAG?

Useful for specific domains (legal contracts, multi-document reasoning). Adds significant complexity. Don't reach for it until standard RAG plateaus.

Q: Should I embed code differently from prose?

Yes if your domain is code-heavy. A code-specific embedding model (Voyage Code, Cohere Embed for code) outperforms general-purpose embeddings on programming-language content by 10-20%.

Q: How do I evaluate a RAG system?

Two layers. Retrieval eval: given a question and a known relevant chunk, does top-K contain it? Generation eval: given a question and context, does the answer match a reference answer? Run both in CI on every change. The retrieval eval catches chunking regressions; the generation eval catches prompt regressions.

Q: What's the cheapest production RAG stack?

Postgres + pgvector + Claude Haiku. Total cost under $20/month for low-volume internal tools. Quality is fine for most use cases.

Q: When does fine-tuning become *necessary*?

When prompting plateaus and your evals show a consistent quality gap. If your prompt is already 2000 tokens and you're still misbehaving on edge cases, fine-tuning teaches the pattern more efficiently than longer prompts. This is rare. Most teams haven't actually plateaued — they've stopped iterating on the prompt.

A practical decision framework for retrieval-augmented generation vs fine-tuning vs prompt engineering — with cost, latency, and update-frequency trade-offs.

YAEL Engineering28 Mar 20269 min read1,713 words

RAG is the right answer when your knowledge changes faster than you can fine-tune. Fine-tuning is the right answer when you need the model to adopt a specific behavior — a tone, a format, a way of reasoning — that you can't get from prompting alone. They are not competing solutions. They sit at different layers of the stack and most production systems use both. The teams that get this wrong burn months trying to fine-tune a model to recall facts that change weekly.

Quick decision framework, then the longer take.

The 30-second decision tree

Need the model to know facts → RAG.
Facts change weekly or faster → RAG, always.
Need the model to write in a specific voice / format → fine-tune (or a strong system prompt first).
Need the model to use specific tools / make specific decisions → prompt + tools, fine-tune only if prompting plateaus.
Need both up-to-date facts AND custom behavior → RAG on top of a fine-tuned model.

The single most common mistake is picking fine-tuning to make the model "know our docs." Don't. Fine-tuning is bad at teaching facts. It is good at teaching format and tone.

What RAG actually is

A pipeline that takes the user's question, retrieves the most relevant chunks of your knowledge corpus, and stuffs them into the prompt before generating an answer.

async function answerWithRag(question: string) {
  // 1. Embed the question
  const queryVector = await embed(question);

  // 2. Find top-K similar chunks
  const chunks = await vectorDB.query({
    vector: queryVector,
    topK: 8,
    filter: { type: "documentation" },
  });

  // 3. Build a prompt with the chunks as context
  const context = chunks.map((c, i) => `[${i}] ${c.text}`).join("\n\n");
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: `Answer using only the context provided. Cite chunks by [number]. If the context doesn't contain the answer, say so.`,
    messages: [{
      role: "user",
      content: `Context:\n${context}\n\nQuestion: ${question}`,
    }],
  });

  return response;
}

That's it. Three steps. Most production complexity lives in step 2 (chunking, reranking, hybrid search) and step 3 (citation handling, hallucination detection).

What fine-tuning actually is

You take an existing base model, show it a few thousand examples of input/output pairs in your desired format, and update the weights so it learns to imitate the pattern. The model walks away with a new disposition — it tends to respond in that style by default, even on inputs it hasn't seen.

Fine-tuning is not memorization. The model does not reliably recall facts from the training set. It learns the shape of what good answers look like, not the answers themselves.

When fine-tuning beats prompting

The honest cases:

Strict format compliance. You need every response to be valid JSON with a specific schema, on every input, with no preamble. A fine-tune of 500-2000 examples will do this more reliably than any prompt.
Tone or voice. You want the model to write like your brand. A fine-tune of 200-500 well-curated examples nails this; a prompt approximates it.
Domain-specific reasoning patterns. Medical chart-to-summary, legal contract clause categorization. The reasoning shape is unusual enough that prompting requires very long instructions.
Cost reduction. A small fine-tuned model can sometimes match a larger model on a narrow task at 1/10th the inference cost.

If your need doesn't match one of these, don't fine-tune. Prompt better.

When RAG is the only sensible answer

Almost any "answer questions about our documentation / database / Notion / Confluence." Three reasons:

Facts change. Your docs got updated this morning. The fine-tuned model was trained last quarter. RAG retrieves the new docs at query time. Fine-tuning requires a retrain.
Provenance. Users need to know where the answer came from. RAG gives you a natural citation surface — the retrieved chunks have URLs. Fine-tuned models can't tell you where their answer came from.
Recall reliability. Fine-tuned models hallucinate plausible-sounding facts that weren't in the training set. RAG can be configured to refuse answering when no good chunks were retrieved.

We've shipped RAG into customer-facing chat surfaces, internal Slackbots, and admin agents. Every one of them needed updateable knowledge. None of them got fine-tuned.

The hybrid: RAG on a fine-tuned model

The best of both, when warranted. Fine-tune a small model to respond in your format / voice. At inference, retrieve relevant docs and pass them in as context. The model handles formatting; RAG handles facts.

We do this rarely — usually for high-volume customer support agents where the per-message cost matters. For most production agents, RAG on a stock Claude Sonnet is the right starting point. See self-hosting Llama 3 vs Claude API: the cost breakdown for when self-hosted + fine-tune crosses over economically.

RAG, hard mode — what production looks like

The naive RAG pipeline (chunk → embed → top-K → answer) works in demos. In production you need:

Chunking that respects structure

Chunking by raw token count destroys context. Chunking by document structure (sections, headings, paragraphs) preserves meaning.

function chunkByMarkdown(doc: string): Chunk[] {
  const sections = doc.split(/\n(?=#{1,3} )/);
  const chunks: Chunk[] = [];
  for (const section of sections) {
    const tokens = estimateTokens(section);
    if (tokens < 800) {
      chunks.push({ text: section, tokens });
    } else {
      // Long section — split by paragraph but keep heading attached
      const heading = /^#{1,3}.*$/m.exec(section)?.[0] ?? "";
      const paragraphs = section.split(/\n\n+/);
      let current = heading ? `${heading}\n\n` : "";
      for (const p of paragraphs) {
        if (estimateTokens(current + p) > 800) {
          chunks.push({ text: current, tokens: estimateTokens(current) });
          current = `${heading}\n\n${p}`;
        } else {
          current += p + "\n\n";
        }
      }
      if (current.trim()) chunks.push({ text: current, tokens: estimateTokens(current) });
    }
  }
  return chunks;
}

Reranking

Vector search returns the top-K candidates. A cross-encoder reranker (Cohere Rerank, BGE reranker) re-orders them by true relevance. This is the single biggest quality improvement available — it typically beats embedding model upgrades.

Hybrid search

Dense vectors miss keyword matches. Pure BM25 misses semantic matches. Hybrid (BM25 + dense, reciprocal rank fusion) consistently outperforms either alone by 5-15%. Postgres with pgvector + tsvector handles both natively. See choosing a vector database for the trade-offs.

Citation surfaces

The user trusts an answer with [3] citations 2-3x more than an uncited answer. Always render citations as clickable links. Always.

"I don't know" detection

When the retrieved chunks don't actually contain the answer, the model will often hallucinate one anyway. Mitigation: an explicit step that asks the model to score whether the context is sufficient before answering. If insufficient, return "I don't know" with a suggestion to refine the question.

When fine-tuning is genuinely worth the effort

Honest accounting. Fine-tuning a model costs:

Engineering time to curate the training set (typically 2-6 weeks)
Compute to actually fine-tune (varies wildly)
A new evaluation pipeline because old prompt-eval rigs don't apply
Inference infrastructure if you're self-hosting

Versus prompting which costs:

One engineering day to iterate on the prompt
Anthropic's API bill

The crossover where fine-tuning pays back: you're running >1M API calls per month, your prompt is hitting a quality ceiling, and you can curate clean training data. If one of those three is missing, don't.

A real example

We built an internal agent for a customer that needed to answer support questions about their product docs. First pass — prompt-engineered Claude Sonnet with a static system prompt containing the FAQ. Result: 70% answer-accuracy, frequent hallucinations on edge cases.

Switched to RAG. Postgres with pgvector, 800-token chunks, top-8 retrieval, Cohere Rerank, citation rendering. Result: 91% answer-accuracy, hallucinations dropped to <2%, and updates to docs propagated within minutes.

We considered fine-tuning to match the customer's brand voice. Decided against — the marginal quality gain didn't justify the operational overhead. A 300-token style guide in the system prompt got 95% of the way there.

That decision is the right shape for most cases.

Need an LLM system that actually works in production?

We build RAG pipelines, tool-using agents, and (when warranted) fine-tuned models. We won't sell you fine-tuning when prompting is enough.

See AI Agent service

FAQ

Can I fine-tune Claude?

Anthropic offers fine-tuning for some models in specific channels. For most teams the answer is no — but you can fine-tune open-weight models (Llama, Mistral) and use them via your own inference. We do this only when the volume justifies it.

Does RAG hallucinate?

Less than no-RAG, but still some. The model can confuse retrieved chunks, blend unrelated facts, or paraphrase incorrectly. Citation rendering plus a "is the context sufficient" gate cuts this dramatically.

How big should chunks be?

500-1000 tokens for most knowledge bases. Smaller for FAQ-style content, larger for technical docs where context matters. Always overlap chunks by 50-100 tokens so a relevant sentence isn't split across two retrievals.

What about graph RAG / hierarchical RAG?

Useful for specific domains (legal contracts, multi-document reasoning). Adds significant complexity. Don't reach for it until standard RAG plateaus.

Should I embed code differently from prose?

Yes if your domain is code-heavy. A code-specific embedding model (Voyage Code, Cohere Embed for code) outperforms general-purpose embeddings on programming-language content by 10-20%.

How do I evaluate a RAG system?

Two layers. Retrieval eval: given a question and a known relevant chunk, does top-K contain it? Generation eval: given a question and context, does the answer match a reference answer? Run both in CI on every change. The retrieval eval catches chunking regressions; the generation eval catches prompt regressions.

What's the cheapest production RAG stack?

Postgres + pgvector + Claude Haiku. Total cost under $20/month for low-volume internal tools. Quality is fine for most use cases.

When does fine-tuning become necessary?

When prompting plateaus and your evals show a consistent quality gap. If your prompt is already 2000 tokens and you're still misbehaving on edge cases, fine-tuning teaches the pattern more efficiently than longer prompts. This is rare. Most teams haven't actually plateaued — they've stopped iterating on the prompt.

TagsRAG Fine-tuning Claude LLM Embeddings

ServiceAI Agent Development Automation Scripts

Keep reading

AI & AgentsBuilding AI agents with Claude tool use in productionWhat changes when an AI agent moves from demo to production — tool-call loops, error recovery, observability, cost controls, and the failure modes that only appear at scale.9 min read AI & AgentsSelf-hosting Llama vs Claude API: the real cost breakdownWhen self-hosting an open-weight LLM beats the Claude API, when it doesn't, and the operational costs nobody includes in their comparison.8 min read AI & AgentsChoosing a vector database: pgvector vs Pinecone vs QdrantAn honest comparison of the three serious choices for production vector search in 2026 — what each one is good at, what they're not, and why pgvector wins more often than the marketing suggests.9 min read

AI & Agents

RAG vs fine-tuning: when to pick each (and when to pick both)

A practical decision framework for retrieval-augmented generation vs fine-tuning vs prompt engineering — with cost, latency, and update-frequency trade-offs.

YAEL Engineering28 Mar 20269 min read1,713 words

Quick decision framework, then the longer take.

The 30-second decision tree

Need the model to know facts → RAG.
Facts change weekly or faster → RAG, always.
Need the model to write in a specific voice / format → fine-tune (or a strong system prompt first).
Need the model to use specific tools / make specific decisions → prompt + tools, fine-tune only if prompting plateaus.
Need both up-to-date facts AND custom behavior → RAG on top of a fine-tuned model.

The single most common mistake is picking fine-tuning to make the model "know our docs." Don't. Fine-tuning is bad at teaching facts. It is good at teaching format and tone.

What RAG actually is

A pipeline that takes the user's question, retrieves the most relevant chunks of your knowledge corpus, and stuffs them into the prompt before generating an answer.

async function answerWithRag(question: string) {
  // 1. Embed the question
  const queryVector = await embed(question);

  // 2. Find top-K similar chunks
  const chunks = await vectorDB.query({
    vector: queryVector,
    topK: 8,
    filter: { type: "documentation" },
  });

  // 3. Build a prompt with the chunks as context
  const context = chunks.map((c, i) => `[${i}] ${c.text}`).join("\n\n");
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: `Answer using only the context provided. Cite chunks by [number]. If the context doesn't contain the answer, say so.`,
    messages: [{
      role: "user",
      content: `Context:\n${context}\n\nQuestion: ${question}`,
    }],
  });

  return response;
}

That's it. Three steps. Most production complexity lives in step 2 (chunking, reranking, hybrid search) and step 3 (citation handling, hallucination detection).

What fine-tuning actually is

Fine-tuning is not memorization. The model does not reliably recall facts from the training set. It learns the shape of what good answers look like, not the answers themselves.

When fine-tuning beats prompting

The honest cases:

Strict format compliance. You need every response to be valid JSON with a specific schema, on every input, with no preamble. A fine-tune of 500-2000 examples will do this more reliably than any prompt.
Tone or voice. You want the model to write like your brand. A fine-tune of 200-500 well-curated examples nails this; a prompt approximates it.
Domain-specific reasoning patterns. Medical chart-to-summary, legal contract clause categorization. The reasoning shape is unusual enough that prompting requires very long instructions.
Cost reduction. A small fine-tuned model can sometimes match a larger model on a narrow task at 1/10th the inference cost.

If your need doesn't match one of these, don't fine-tune. Prompt better.

When RAG is the only sensible answer

Almost any "answer questions about our documentation / database / Notion / Confluence." Three reasons:

Facts change. Your docs got updated this morning. The fine-tuned model was trained last quarter. RAG retrieves the new docs at query time. Fine-tuning requires a retrain.
Provenance. Users need to know where the answer came from. RAG gives you a natural citation surface — the retrieved chunks have URLs. Fine-tuned models can't tell you where their answer came from.
Recall reliability. Fine-tuned models hallucinate plausible-sounding facts that weren't in the training set. RAG can be configured to refuse answering when no good chunks were retrieved.

We've shipped RAG into customer-facing chat surfaces, internal Slackbots, and admin agents. Every one of them needed updateable knowledge. None of them got fine-tuned.

The hybrid: RAG on a fine-tuned model

RAG, hard mode — what production looks like

The naive RAG pipeline (chunk → embed → top-K → answer) works in demos. In production you need:

Chunking that respects structure

Chunking by raw token count destroys context. Chunking by document structure (sections, headings, paragraphs) preserves meaning.

function chunkByMarkdown(doc: string): Chunk[] {
  const sections = doc.split(/\n(?=#{1,3} )/);
  const chunks: Chunk[] = [];
  for (const section of sections) {
    const tokens = estimateTokens(section);
    if (tokens < 800) {
      chunks.push({ text: section, tokens });
    } else {
      // Long section — split by paragraph but keep heading attached
      const heading = /^#{1,3}.*$/m.exec(section)?.[0] ?? "";
      const paragraphs = section.split(/\n\n+/);
      let current = heading ? `${heading}\n\n` : "";
      for (const p of paragraphs) {
        if (estimateTokens(current + p) > 800) {
          chunks.push({ text: current, tokens: estimateTokens(current) });
          current = `${heading}\n\n${p}`;
        } else {
          current += p + "\n\n";
        }
      }
      if (current.trim()) chunks.push({ text: current, tokens: estimateTokens(current) });
    }
  }
  return chunks;
}

Reranking

Hybrid search

Citation surfaces

The user trusts an answer with [3] citations 2-3x more than an uncited answer. Always render citations as clickable links. Always.

"I don't know" detection

When fine-tuning is genuinely worth the effort

Honest accounting. Fine-tuning a model costs:

Engineering time to curate the training set (typically 2-6 weeks)
Compute to actually fine-tune (varies wildly)
A new evaluation pipeline because old prompt-eval rigs don't apply
Inference infrastructure if you're self-hosting

Versus prompting which costs:

One engineering day to iterate on the prompt
Anthropic's API bill

A real example

That decision is the right shape for most cases.

Need an LLM system that actually works in production?

We build RAG pipelines, tool-using agents, and (when warranted) fine-tuned models. We won't sell you fine-tuning when prompting is enough.

See AI Agent service

FAQ

Can I fine-tune Claude?

Does RAG hallucinate?

How big should chunks be?

What about graph RAG / hierarchical RAG?

Useful for specific domains (legal contracts, multi-document reasoning). Adds significant complexity. Don't reach for it until standard RAG plateaus.

Should I embed code differently from prose?

Yes if your domain is code-heavy. A code-specific embedding model (Voyage Code, Cohere Embed for code) outperforms general-purpose embeddings on programming-language content by 10-20%.

How do I evaluate a RAG system?

What's the cheapest production RAG stack?

Postgres + pgvector + Claude Haiku. Total cost under $20/month for low-volume internal tools. Quality is fine for most use cases.

When does fine-tuning become necessary?

TagsRAG Fine-tuning Claude LLM Embeddings

ServiceAI Agent Development Automation Scripts