Prompt engineering for analytics: the forgotten pipeline between data and report

The new meeting-room scene in 2026: a director asks about a metric, the analyst asks ChatGPT how to fetch it, ChatGPT delivers SQL, analyst runs it on the warehouse, slide goes to the meeting. Total time from "question" to "slide": 15 minutes. In 2022 this flow took 2 days. Huge gain — and equal danger.

Because in at least half the cases, the SQL generated by the LLM has a subtle error: wrong filter, incomplete JOIN, metric computed with logic that looks right and isn't. The number comes out looking official, becomes a decision, and nobody notices it's wrong for weeks. This text is about treating LLM in analytics as part of the pipeline — with equivalent discipline. Without that, the speed gain becomes silent liability.

The "ChatGPT writing SQL" problem

LLMs got good at generating SQL in 2025–2026. Good enough to seem useful always. Not good enough to be useful every time they seem useful. Three failure modes show up regularly:

Failure 1: subtle schema error. LLM thinks orders.total exists when it's actually orders.amount_total. The SQL runs on a similar table, returns a sensible number, but measures the wrong thing.

Failure 2: business logic baked into the prompt. Analyst asks "how many active customers?". LLM generates SQL with its own definition of "active" (logged in within 30 days). Company defines active as "made a transaction within 90 days". Number comes out 3× too high.

Failure 3: incorrect aggregation with JOIN. LLM generates SQL with a JOIN between fact and dimension, but the dimension has duplication. Aggregation inflates. Nobody notices because the number isn't absurd — it's just wrong by ~15%.

These three combined produce what I call "number looks right, decision comes out wrong". And because LLM presents SQL with confidence, the analyst validates less than they would a SQL written by a peer. Synthetic confidence beats critical review.

In LLM-augmented analytics, speed comes with hidden risk: the SQL looks right because it was generated with confidence. Without validation discipline, "I'll ask ChatGPT" becomes "I'll decide based on an unverified number".

Five practices that separate gain from theater

The discipline a serious company adopts when incorporating LLM into an analytical pipeline. Without these five, the speed gain becomes silent liability.

Schema-aware context in the prompt. Don't throw raw question at the LLM. Build the prompt with warehouse schema, key-table descriptions, official metric definitions. dbt docs as a semantic layer feeds this well. Without context, the LLM invents columns.
Business definitions injected in the system prompt. "Active customer = transaction within 90 days. Revenue = subtotal before tax and discount. Churn = no transaction within 90 days". 5–10 core definitions as part of the fixed prompt. Without that, LLM uses generic definition and the number diverges.
Automatic validation of output before use. Generated SQL runs in sandbox, validated against known eval set. "Question X returns a number Y between 1000 and 1500?". Without validation, drift in the LLM degrades output silently. Same principle as the eval set for evaluating agents.
Restriction to read-only queries with governance. LLM doesn't write in production. Connection used is read-only, with permission restricted to analytical tables. Without that, a malicious prompt or error can cause real damage.
Log of every prompt → SQL → result interaction. For audit, for understanding drift, for debugging incidents. Who used what, when, with which response. Without log, AI governance in analytics doesn't exist.

Implementing the five turns "ChatGPT for SQL" into an augmented analytics pipeline. Without them, it's improvisation that becomes an incident in 3–6 months.

Where LLM in analytics really accelerates

Don't confuse the argument. Three contexts where well-implemented LLM saves huge time with controlled risk:

Translating business questions into SQL. A non-analyst can phrase a question in natural language, LLM generates SQL, system executes, returns answer. As I argued about LLM as internal agent, this case is one of the most consistent in ROI. Works well with schema-aware prompt + validation.

Model documentation generation. New dbt model needs descriptions in 30 columns? LLM generates first draft based on SQL and sample data. Analyst reviews. 80% of the work automated.

Quick exploratory analysis. New dataset arrived, team needs to understand structure, distribution, outliers. LLM with Code Interpreter or equivalent does EDA in minutes. Doesn't replace serious analysis, but accelerates understanding the terrain.

These three share a trait: error is tolerable and detectable, output is reviewed by a human before becoming a decision. Where output becomes decision without review (like SQL-to-slide direct), the five-item discipline becomes mandatory.

The "it understands my business" trap

Most frequent mistake in teams adopting LLM in analytics: after 2–3 questions the LLM answers well, the analyst stops validating. "It gets how we measure revenue". Lie. It gets how we measure revenue in that specific context, that specific phrasing. Change prompt, change table, change quarter — could be wrong again.

The confidence built with LLM in analytics differs from the confidence built with a colleague. A colleague learns from mistakes. The LLM doesn't learn — it performs well on sets similar to training, poorly on boundaries. Trusting with same confidence generates the worst scenario: high speed + low review.

How to measure it's paying back

Four metrics tell whether LLM in analytics is being well used:

Rate of generated SQL needing manual correction. Above 30%, schema-aware prompt is weak or eval set is insufficient. Below 10%, the pipeline is mature.

Average validation time per query. If it passes 5 minutes, the tool lost purpose. Automated validation needs to cover 80% of cases to be worth it.

"Wrong number discovered later" incidents. Count cases where decision was made on generated SQL and later found to be wrong. Above 1/month, governance is broken.

Adoption by persona. Does the analyst use it? The director? Who uses which interface? If only data engineers use it, democratization didn't happen — it became specialized tooling.

The decision for 2026

If your company has analysts using ChatGPT/Claude to generate SQL without governance, three moves:

Build controlled interface. Not "open ChatGPT". But internal tool with schema-aware prompt + embedded business definitions + sandboxed execution + automatic log. Equivalent to "the company's ChatGPT for analytics". High costs at start, clear ROI in 6 months.

Train the team to be skeptical. 1-hour session showing the three failure modes (schema, definition, JOIN). When the team understands how the LLM errs, use gets more careful.

Integrate with the semantic layer. dbt mart or semantic layer defines metrics; LLM consults the layer, not the raw warehouse. Cuts definition error by 80%.

LLM in analytics in 2026 is one of the clearest productivity opportunities — and one of the most dangerous without discipline. The difference between the two postures isn't in which model is chosen. It's in the pipeline built around it, with validation, context and log that treat the LLM as a critical tool — not as an assistant trusted by inertia.

Questions that keep coming back

To close, the three questions I hear most often when this topic comes up.

Can I trust the SQL that ChatGPT generates?

Not without validation — in at least half the cases, generated SQL carries a subtle error that goes unnoticed. The three most common failure modes: a column that doesn't exist in the schema (the LLM invents orders.total when it's orders.amount_total), a generic business definition that diverges from yours ("active" within 30 days vs. 90 days), and aggregation inflated by a JOIN against a duplicated dimension. None of these produces an absurd number — they produce a plausible, wrong one.

The behavioral aggravator: because the LLM presents SQL with confidence, analysts review it less than they would a colleague's SQL. The practical rule: generated SQL is a draft until it passes validation — sandbox, eval set, or at minimum a review against the official metric definitions.

What should I put in the prompt so the LLM gets SQL right more often?

Two things: the warehouse schema and your official business definitions — never the raw question alone. The prompt needs to carry the structure of the key tables with descriptions (dbt docs feeds this well) and the 5–10 core definitions as a fixed part of the system prompt: what counts as an active customer, how revenue is calculated, what counts as churn. Without schema, the LLM invents columns; without definitions, it uses generic ones and the number diverges.

The next step is pointing the LLM at the semantic layer (dbt mart or semantic layer) instead of the raw warehouse — that cuts definition errors by 80%. A good prompt isn't a magic phrase; it's structured context.

Is it worth building an internal tool instead of letting the team use ChatGPT directly?

Yes, if the team is already generating SQL with LLMs day to day — the "open ChatGPT" without governance is improvisation that becomes an incident in 3–6 months. A controlled interface packages what ad hoc use lacks: schema-aware prompt, embedded business definitions, sandboxed read-only execution, and automatic logging of every interaction. High cost at the start, clear ROI in 6 months.

If you can't build it yet, start with the cheap moves: a read-only connection restricted to analytical tables, a 1-hour session training the team on the three failure modes, and a log of who generated what. Without a log, AI governance in analytics simply doesn't exist.

The "ChatGPT writing SQL" problem

Five practices that separate gain from theater

Where LLM in analytics really accelerates

The "it understands my business" trap

How to measure it's paying back

The decision for 2026

Questions that keep coming back

Can I trust the SQL that ChatGPT generates?

What should I put in the prompt so the LLM gets SQL right more often?

Is it worth building an internal tool instead of letting the team use ChatGPT directly?

Want to discuss this topic with a partner?

Further reading

Lakehouse Is Not a Silver Bullet: When Plain Warehouse Still Wins

Data observability: catching pipeline failures before stakeholders do

Customer Data Platform became a commodity — what's left