Why RAG Is Not Only Retrieval-Augmented Generation — Petro Lashyn - Software Consultancy | Engineering Delivery

RAG sounds like three steps, but production pipelines need classification, intent, graphs, reranking, summarization, and more. Here is how real systems differ from the tutorial version.

RAG stands for Retrieval-Augmented Generation. Three words. Three steps. Embed the query, find similar text, generate an answer. Sounds clean. And if you build it that way, it will work on your first ten test questions.

Then a real user asks something like "the bot doesn't respond after I set up the webhook" and your three-step pipeline retrieves chunks about webhook configuration, chunks about bot response templates, and chunks about Slack integration timeouts. All of them scored high on vector similarity. None of them, individually, contain the answer. The answer lives in the relationship between a webhook misconfiguration and a specific error handling behavior that silently drops messages.

Vector search found text that sounds like the question. It didn't find the answer.

This is where most RAG tutorials stop and real engineering begins.

The problems behind the acronym

I've been running a RAG pipeline in production for a while now, and every non-obvious failure I've hit falls into one of five categories. They're all problems that the basic three-step flow doesn't acknowledge.

1. The classification problem

Not every user message deserves retrieval. Someone types "thanks, that helped" and your system runs a full embedding, searches the vector store, retrieves chunks about gratitude from the documentation, and generates a paragraph about being helpful. Wasted tokens, wasted latency, and a weird user experience.

Before you retrieve anything, you need to decide whether retrieval should happen at all. In our pipeline, the first real decision is binary: is this a support query or small talk? Small talk skips the entire retrieval path and goes straight to generation. This single gate saves cost on roughly 30% of messages.

2. The ambiguity problem

A user asks "how do I set up the integration?" Your product has six integrations. The system embeds the query, retrieves whichever integration's documentation had the closest vector, and answers confidently about the wrong one. The user doesn't know the answer is wrong because the system didn't hesitate.

This is a design failure, not a model failure. Before generating, you need to check: do the retrieved documents cluster around one topic or several? If several, the honest response is a clarifying question, not a guess.

We detect this by checking whether matched entities in the knowledge graph fall into disconnected clusters. If entity A (Slack integration) and entity B (email integration) have no path connecting them within four hops, the query is ambiguous. The system asks which one the user means.

3. The context overflow problem

You retrieve 20 chunks. Total: 15,000 tokens. You stuff them all into the prompt. The model reads the first few, reads the last few, and ignores everything in the middle.

This is the "lost-in-the-middle" problem, and it's well-documented in research. The fix isn't retrieving fewer chunks (you lose recall). The fix is summarizing the retrieved context before passing it to generation. Compress 15,000 tokens of raw chunks into 3,000 tokens of focused summary. The model actually reads the whole thing.

We cache summaries by content hash so identical context sets don't get re-summarized on repeated questions. Simple, but it cuts redundant LLM calls.

4. The flat retrieval problem

This is the big one.

Vector similarity operates on text surfaces. It finds chunks that sound like the query. But knowledge isn't flat. Knowledge has structure: causes, consequences, prerequisites, alternatives.

"What happens when finding X is linked to corrective action Y?" requires understanding a relationship. There's no single chunk that contains this answer because the information about finding X is in one document and the corrective action Y is in another. They're connected by a relationship, not by textual similarity.

This is where graphs enter the picture.

5. The intent problem

A user asking "what is two-factor authentication?" and a user asking "2FA stopped working after the update" need fundamentally different retrieval strategies. The first is a concept question. The second is a problem report. Retrieving the same chunks for both produces mediocre answers for both.

Intent classification before retrieval lets you shape the search. A concept query should pull explanatory content and related topics. A problem query should pull symptoms, causes, and solutions. Same knowledge base, different traversal paths.

What a graph adds to RAG

A knowledge graph stores entities and the relationships between them. Not text. Structure.

The simplest way to think about it: vector search gives you "documents that talk about similar things." A graph gives you "things that are connected to the thing you asked about, and how."

In practice, the graph sits alongside the vector store, not instead of it. Vector search finds the entry points (seed chunks). The graph expands from those entry points to discover related knowledge that vector similarity alone would miss.

How graph retrieval works in practice

The flow is three phases:

Phase 1: Seed selection. Vector search returns the top chunks by cosine distance. You take a subset of these (we use a configurable seed_limit) as starting points for graph traversal. The rest still contribute to the context, but only the seeds feed the graph.

Phase 2: Graph expansion. From each seed chunk, traverse the graph by a configurable number of hops. A chunk mentions entity A. Entity A is connected to entity B via a RESOLVED_BY relationship. Entity B has evidence in chunk C. Chunk C wasn't in the original vector results because it uses different language, but it contains the solution. Now it's in your context.

Phase 3: Reranking. Merge vector similarity scores with graph relationship scores. A chunk that is moderately similar by embedding but strongly connected through the graph to the query entities ranks higher than one that's just textually close. This is where "enriched" retrieval earns its name.

Designing a graph schema for your domain

Here's where it gets practical. The graph schema depends entirely on your domain. There's no universal schema. The entities you model and the relationships you define between them determine what the graph can tell you that vectors can't.

Let me walk through two examples.

Example 1: Customer support knowledge base

This is the domain I know best from building it.

Entities (nodes):

The core entity is SupportEntity with a kind field that distinguishes between types: Problem, Solution, Concept, Feature, Configuration.

A Problem is something that goes wrong ("Bot doesn't respond"). A Solution is how to fix it ("Check webhook URL configuration"). A Concept is explanatory knowledge ("How webhooks work"). Feature and Configuration are structural elements of the product.

Relationships (edges):

This is where the real design happens.

CAUSES connects a root cause to a symptom. "Invalid webhook URL" CAUSES "Bot doesn't respond." When a user reports the symptom, the graph traverses back to the cause.
RESOLVED_BY connects a problem to its solution. "Bot doesn't respond" RESOLVED_BY "Verify webhook endpoint is publicly accessible." One problem can have multiple solutions.
DIAGNOSED_BY connects a problem to a diagnostic question. "Bot doesn't respond" DIAGNOSED_BY "Does the webhook URL return 200 on GET requests?" This drives clarifying questions.
RELATED_TO is the general connection between any two entities. "Webhook configuration" RELATED_TO "API key setup." Bidirectional, used for concept exploration.
EVIDENCE_FOR and MENTIONS connect chunks to entities. A chunk of documentation is EVIDENCE_FOR a Solution, or MENTIONS a Feature. These edges are what link the vector world (chunks) to the graph world (entities).

Intent-aware traversal:

The same graph traverses differently based on what the user is asking. When the intent is Problem, the query follows CAUSES backward to find symptoms and RESOLVED_BY forward to find solutions. When the intent is HowTo, it follows RESOLVED_BY to solutions and looks for step-by-step content. When the intent is Concept, it follows RELATED_TO broadly to build a map of connected knowledge.

Same graph. Same entities. Different traversal strategy per intent.

Example 2: E-commerce product catalog

Different domain, different schema, same principle.

Entities:

Product, Category, Feature, Specification, Review, Brand.

Relationships:

BELONGS_TO connects Product to Category. "Running Shoe X" BELONGS_TO "Men's Athletic Footwear."
HAS_FEATURE connects Product to Feature. "Running Shoe X" HAS_FEATURE "Carbon fiber plate." When someone asks "shoes with carbon plates," vector search finds the phrase. But graph traversal also finds all other products sharing that feature, even if they describe it differently ("carbon-infused midsole").
COMPATIBLE_WITH connects products to each other. "Running Shoe X" COMPATIBLE_WITH "Performance Insole Y." This is knowledge that doesn't live in any single document. It lives in the relationship.
COMPARED_TO connects competing products. "Running Shoe X" COMPARED_TO "Running Shoe Z." When someone asks "which is better, X or Z?", the graph can surface the comparison context from both sides.
REVIEWED_IN connects Product to Review. This lets the system aggregate sentiment and specific feedback across reviews during retrieval.

The pattern:

Notice what both schemas share: entities represent things your users ask about, and relationships represent the connections between those things that no single chunk of text captures. The graph stores what falls between the documents.

How to know if you need a graph

Not every RAG system needs a graph. Adding one increases complexity in indexing, storage, and query processing. It's worth it when:

Your knowledge base has entities with meaningful relationships. If your data is a collection of independent articles with no cross-references, a graph adds nothing. If your data has products connected to features, problems connected to solutions, regulations connected to compliance requirements, the graph captures what vectors miss.
Users ask relational questions. "What products are compatible with X?" "What problems does this setting cause?" "How is concept A related to concept B?" These questions require traversing connections, not finding similar text.
Ambiguity is common. When users ask about entities that span multiple domains (a term that means different things in different contexts), graph cluster detection can identify ambiguity that vector search treats as a single topic.

If your use case is "search through 500 FAQ articles and return the closest one," vector search alone is sufficient. Don't add a graph because it sounds sophisticated. Add it because your users ask questions that require understanding relationships.

The pipeline is the product

RAG is not three steps. In production, the pipeline I run has sixteen stages. Each one exists because a specific failure mode in production required it. Classification to avoid unnecessary retrieval. Intent detection to shape the search strategy. Vector search for semantic matching. Graph expansion for relational knowledge. Reranking to merge scoring signals. Summarization to fit context windows. Ambiguity detection to avoid confident wrong answers. Model selection to balance cost and quality.

None of this is theoretical. Every stage maps to a real problem that a real user hit.

The three-letter acronym makes it sound simple. The engineering behind it isn't. And honestly, that's what makes it interesting.

Thanks for reading.

Why RAG Is Not Only Retrieval-Augmentation-Generation