Fine-tuning vs RAG vs prompt engineering: how to choose without burning cash

The question that hits the AI committee every Tuesday: "are we training our own model?". The honest answer in almost every case is "not yet — and maybe never". Not because fine-tuning is bad. Because it's the most expensive, slowest, and riskiest of the three options to adapt an LLM to a specific business — and there's almost always a cheaper path that solves the problem before you get there.

This text is the decision framework between prompt engineering, RAG, and fine-tuning. It's not technical — it's managerial. The choice between the three defines whether the project delivers value in three weeks or in nine months.

What each one solves, in a sentence

Before the rule, you need clarity on what each technique does.

Prompt engineering is changing how you speak to the model. System instruction, few-shot examples, response structure. Cost: prompt-writer hours. Timeline: days. Risk: low.

RAG (Retrieval-Augmented Generation) is giving the model context it didn't have — fetching relevant snippets from a document base and injecting them at query time. As I argued about RAG in practice, the hard part isn't generating; it's retrieving. Cost: infra + corpus + retrieval. Timeline: 4–8 weeks to production. Risk: medium.

Fine-tuning is changing the model — retraining weights with your own data. Cost: curated training data + compute + iteration. Timeline: 2–6 months. Risk: high (the model can get worse at tasks it used to handle).

The difference isn't only technical. It's what you're willing to invest before you know it'll work. Prompt eng fails cheap. Fine-tuning fails expensive.

The right question is never "which technique is best". It's "what's the cheapest technique that solves the case to 80%". You climb the ladder only when you've exhausted the current rung.

The order that works

The rule we apply before any AI project with business adaptation. Always in this order.

Exhaust prompt engineering first. Before touching the corpus or the model, try solving with a better instruction. Well-chosen few-shot examples lift accuracy by 10–25% in almost every case. Forced response structure (JSON, numbered list) removes ambiguity. Explicit chain-of-thought improves reasoning. Whoever skips this step invests in RAG/fine-tuning to solve a prompt problem. (The same principle applies to prompt engineering for analytics pipelines, where LLM-generated SQL needs the same rigor.)
Climb to RAG when the model needs knowledge it doesn't have. Internal document, company policy, product base, customer history. If the question requires a fact the LLM doesn't know, RAG is the path. Not fine-tuning — fine-tuning teaches patterns, not facts.
Climb to fine-tuning when the problem is style, format, or very specific domain. When the model needs to write in your company's jargon, generate code in your internal standard, or respond in a rare structured format. Fine-tuning changes behavior; it doesn't change knowledge.

The most common mistake: using fine-tuning to solve a RAG problem ("the model doesn't know our rules"). It won't work. A fine-tuned model forgets half the rules in the next week or hallucinates plausible answers about rules that changed.

Real costs — the math nobody runs

The costs of each technique aren't just money. They're team time, operational risk, and difficulty of iterating. Worth cataloging.

Prompt engineering — costs. Writer hours (1–3 days per iteration). Eval set to measure before/after (1–2 weeks to assemble). Inference cost per token, ongoing but small in medium volume. Typical total: USD 1–5k to run a decent use case in pilot.

RAG — costs. Vector infra + indexing + retrieval (USD 100–1k/month depending on volume). Pipeline engineering (4–8 weeks of senior time). Corpus curation (underestimated, often half the effort). Index maintenance (corpus drift, freshness). Typical total: USD 20–60k to production, plus USD 2–7k/month.

Fine-tuning — costs. Curated training data (5–15k quality examples, usually humans labeling: USD 8–25k). Training compute (USD 1–10k per iteration, and you'll need 3–10 iterations). Rigorous eval (essential — without it, fine-tuning gets worse invisibly). Typical total: USD 50k–250k to model in production, and the model needs retraining every 6–12 months.

The ratio I see in practice: fine-tuning costs 5×–20× more than RAG, which costs 5×–15× more than prompt engineering. Skipping rungs jumps that ratio in the bill without warning.

The signals it's time to climb

Knowing when to stop at each rung is half the decision. Practical signals:

When to climb from prompt to RAG. When the model errs by lack of specific information — not by style. Ask: "would the model err less if I pasted the right document into the context?". If yes, RAG. If the answer errs by style, format, or reasoning, prompt eng still handles it.

When to climb from RAG to fine-tuning. Three combined signals: (a) you already have RAG retrieving well (recall@k > 80%); (b) the problem is in how the model writes after receiving context; (c) you have 5k+ quality labeled examples of the desired output. If any of the three is missing, fine-tuning won't fix it.

When not to climb to fine-tuning, even under pressure. When the problem is knowledge (RAG solves), when the use case changes month to month (a fine-tuned model becomes debt), when you don't have a serious evaluation protocol (without eval, fine-tuning is faith). These three contexts cover 80% of fine-tuning requests we receive.

The typical case that clarifies

The story that repeats in three out of five projects. Company wants to "train our ChatGPT" to answer internal questions. Tech team quotes fine-tuning: USD 100k + 5 months. Sponsor approves.

Three months in, the trained model answers the eval set well — and badly on almost everything else. Real diagnosis: the problem was knowledge (the model has no access to internal policy), not style. RAG over the documents would have solved it in 6 weeks for USD 20k, with higher quality. Fine-tuning now becomes maintenance liability.

This case is avoidable with the simple rule above. It's not lack of technical skill; it's lack of order.

The decision for whoever decides

If you're on a committee debating "fine-tuning or not", the right question to ask the technical team isn't "which is better". It's: have we exhausted prompt engineering? have we tried RAG?. In 80% of cases, the answer will be "not to the needed depth". Then you step back a rung, do it right, and most cases stop there — with 1/10 of the cost and 1/4 of the time.

Fine-tuning is the right tool for specific cases. It just isn't the default — and treating it as the default is the most expensive way to delay your company's AI value delivery. (When justifiable, the choice between self-hosted open source and proprietary changes the cost equation — worth calculating before committing.)

Questions that keep coming back

Before wrapping up, the questions that come up most often when this topic hits the table.

Is it worth training a custom model for my company?

In almost every case, not yet — and maybe never. Fine-tuning solves 5–10% of cases, the specific ones about style, format, or a very closed domain. Prompt engineering solves 60%, RAG solves another 30%, and most companies attempt fine-tuning before exhausting the previous two — and burn cash doing it.

The rule is to climb a rung only when the current one is exhausted: instructions and few-shot first; RAG when the model fails for lack of knowledge; fine-tuning only when the problem is how the model writes after receiving the right context, you have 5,000+ quality labeled examples, and a serious evaluation protocol. Without those three, fine-tuning is faith.

How much does each approach cost?

As an order of magnitude: prompt engineering runs R$ 5–20K to get a decent use case through pilot, in days. RAG typically costs R$ 80–250K to production, plus R$ 10–30K/month, over 4–8 weeks. Fine-tuning lands between R$ 200K and R$ 1 million to a production model, takes 2–6 months, and the model needs retraining every 6–12 months.

The practical ratio: fine-tuning costs 5–20× more than RAG, which costs 5–15× more than prompt engineering. And the difference isn't just money — it's how much you invest before knowing whether it works. Prompting fails cheap; fine-tuning fails expensive.

When should I use RAG instead of fine-tuning?

When the problem is knowledge, not behavior. If the model fails because it has no access to internal documents, company policy, or customer history, RAG is the path — fine-tuning teaches patterns, not facts. A fine-tuned model forgets half the rules or hallucinates about rules that changed.

The quick test: "would the model fail less if I pasted the right document into the context?". If yes, RAG. Fine-tuning only enters when retrieval already works well (recall@k above 80%) and what's missing is style, jargon, or format — and even then, only with training data and evaluation to match.

What each one solves, in a sentence

The order that works

Real costs — the math nobody runs

The signals it's time to climb

The typical case that clarifies

The decision for whoever decides

Questions that keep coming back

Is it worth training a custom model for my company?

How much does each approach cost?

When should I use RAG instead of fine-tuning?

Want to discuss this topic with a partner?

Further reading

Architecture of an MCP server: transport, authentication, and where it breaks

Model Context Protocol: what changes when every tool exposes an MCP server

Vector database is not mandatory in RAG — when the classic index wins