Why Your RAG System Fails - and How Semantic Chunking Fixes It

By Gareth Hallberg, Lead Consultant at Nimble Approach

This article breaks down why naive, fixed-size chunking leads to poor retrieval, how semantic chunking solves the problem, and what tools and methods you can use to implement it effectively.

Retrieval-Augmented Generation (RAG) is a powerful technique for creating intelligent, context-aware AI systems. By grounding large language models (LLMs) in specific, private data, businesses can build applications that provide accurate and relevant answers. However, the success of any RAG system hinges on a crucial and , often overlooked step: data preparation. The common approach of simply chopping documents into fixed-size chunks is a significant pitfall, and this article will demonstrate why a semantically informed strategy is essential.

The core of the issue lies in the oversimplification of picking an arbitrary chunk size – say, 200 or 500 characters – and expecting it to work for all data types. While this may suffice for long-form prose, it fails consistently when applied to structured or semi-structured documents such as menus, reports, or JSON files.

A Case Study: The Restaurant Menu

To illustrate this problem, consider a simple document: a restaurant menu. Now, imagine asking a RAG system, built on this menu, three basic questions:

“What seafood pasta do you have?”
“What pizzas do you have for less than £13?”
“What dishes do you have that are suitable for vegetarians?”

A system using a fixed-size chunking strategy is likely to struggle with this test. Why? Because this method has no regard for the actual meaning or structure of the content. A 200-character chunk could easily split a single menu item in half, separating a dish’s name from its price, or its description from its dietary information (e.g., ‘Vegetarian’). This fragmentation of context makes it impossible for the retrieval system to find a complete, coherent piece of information, leading to inaccurate or incomplete answers from the LLM.

This is the fundamental flaw of naive chunking: it breaks the connections between related pieces of information, like a dish’s name and its price, within the data.

The Solution: From Arbitrary Splits to Semantic Understanding

The robust solution is semantic chunking – a method where the document is split based on its logical structure. Instead of counting characters, we identify the boundaries that define a complete thought or entry.

For our menu example, a practical implementation involves formatting the menu in Markdown and using a heading level (e.g., ####) to designate each menu item. By using this heading as a delimiter, every chunk becomes a single, self-contained menu item, maintaining the connection between the dish, its description, price, and attributes.

When this semantically chunked data is used, the RAG system’s performance is transformed. It can now easily retrieve the complete entry for “Linguine alle Vongole” (pasta sautéed with fresh clams, garlic, white wine, and parsley) and see that it contains seafood, or check the price of each pizza, because that information is contained within the same chunk. In short, it can answer all three questions that it previously struggled with..

For an excellent interactive demonstration of this concept, you can explore different chunking strategies here.

Choosing the Right Tools

Understanding the benefits of semantic chunking naturally leads to considering the tools best suited for its implementation.

While the Markdown heading approach is excellent for structured documents like menus, semantic chunking encompasses a broader range of techniques designed to split documents based on their inherent meaning and logical structure, rather than arbitrary character counts. The goal is always to ensure that each chunk represents a coherent and complete piece of information, maximising its utility for retrieval.

Here are some common and advanced semantic chunking methods:

1. Rule-Based or Delimiter-Based Chunking:

How it works: It relies on predefined rules or delimiters within the document’s structure. Examples include:
- Headings: As demonstrated, using Markdown headings (e.g., ##, ###) or document section titles (e.g., in Word or Google Docs) to define chunk boundaries.
- Paragraph Breaks: Treating each paragraph as a distinct chunk, assuming paragraphs generally represent a single, coherent idea.
- Specific Keywords or Phrases: Identifying key phrases or patterns that signal the start of a new logical unit (e.g., “Conclusion,” “Introduction,” “Key Findings”).
- XML/JSON Tags: For structured data, using specific tags to delineate logical entities.
Best for: Documents with clear, consistent internal structures like reports, manuals, articles with distinct sections, or structured data formats.

2. Sentence-Based Chunking:

How it works: This is a more granular approach where each sentence is treated as a separate chunk. While seemingly simple, it can be powerful for maintaining very fine-grained semantic units.
Best for: Documents where individual sentences carry significant, self-contained meaning, and where the context needed for retrieval is often limited to a single sentence. It’s also a good starting point for more advanced methods.

3. Paragraph-Based Chunking:

How it works: Similar to sentence-based, but groups sentences into paragraphs. This is a common and often effective method for general prose.
Best for: Most narrative texts, essays, and articles where paragraphs typically convey a single main idea.

4. Recursive Chunking:

How it works: This method involves splitting a document into larger chunks first, and then recursively splitting those larger chunks into smaller, more semantically coherent units if they exceed a certain size or complexity. It’s a hierarchical approach.
Best for: Long, complex documents where different levels of granularity might be useful for retrieval. It allows for both broad and specific searches.

5. Content-Aware or NLP-Based Chunking:

How it works: These methods leverage Natural Language Processing (NLP) techniques to understand the semantic flow of the text:
- Topic Modelling: Algorithms can identify distinct topics within a document and group sentences or paragraphs belonging to the same topic into a chunk.
- Coherence Scoring: Analyszing the semantic similarity between sentences or paragraphs to identify natural breakpoints where the topic or focus shifts.
- Embedding Similarity: Using vector embeddings of sentences or paragraphs. When the similarity between consecutive units drops below a certain threshold, it indicates a potential chunk boundary.
Best for: Less structured or free-form text where explicit delimiters are absent, or for achieving highly nuanced semantic divisions.

6. Hybrid Approaches:

How it works: Often, the most effective strategy is to combine multiple methods. For example, a document might first be split by major headings (rule-based), and then within each section, paragraphs could be further chunked (paragraph-based), or even sentences if the content demands it.
Best for: Almost all real-world applications, as documents rarely conform perfectly to a single chunking strategy.

Key Considerations for Choosing a Method

With so many semantic chunking techniques available – from simple delimiter-based methods to advanced NLP-driven approaches – it’s important to step back and evaluate which strategy best fits your specific use case. Selecting the right approach isn’t just about how the document can be split, but how it should be split to support reliable retrieval and effective downstream LLM reasoning.

When deciding which chunking method to use, consider the following factors:

Document Structure: How structured is your data? Does it have clear headings, sections, or other logical divisions?
Query Patterns: What kind of questions will your RAG system be asked? Do users need very specific facts (smaller chunks) or broader contextual information (larger chunks)?
Retrieval Granularity: How precise do your retrieval results need to be?
Computational Cost: More advanced NLP-based methods can be more computationally intensive.
Tooling: As mentioned, the available tools and frameworks (like LangChain’s diverse text splitters) will significantly influence your choices.

By thoughtfully selecting and implementing a semantic chunking strategy, you can dramatically improve the precision and recall of your RAG system, leading to more accurate and relevant responses from your LLM.

Selecting the Right Framework

This naturally raises an important practical question: which development framework can actually support the chunking strategy you’ve chosen? The effectiveness of your approach depends not only on understanding the theory behind semantic chunking but also on having tools capable of applying it correctly. When evaluating frameworks for building a RAG system, it’s crucial to look beyond surface-level features and closely examine their data processing capabilities.This insight has significant implications for choosing the right development framework. When evaluating tools for building a RAG system, it’s crucial to look beyond the surface and inspect their data processing capabilities.

For instance, the LangChain framework provides a variety of sophisticated text splitters, including a MarkdownTextSplitter, which is designed for precisely this kind of structured data. In contrast, other platforms may lack these specialised tools, making them less suitable for projects that involve anything other than simple text documents. This distinction is not minor – it can be the deciding factor in whether your application succeeds or fails.

Conclusion

The takeaway is clear: building a high-performing RAG system requires a thoughtful approach to data ingestion. Moving away from naive, fixed-size chunking and adopting a semantic strategy that respects the intrinsic structure of your data is paramount. By ensuring that each chunk represents a complete and coherent piece of information, you provide your RAG system with the high-quality foundation it needs to deliver truly accurate and intelligent results.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
rc::a	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::b	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::c	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::f	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_GRECAPTCHA	6 months	Google Recaptcha service sets this cookie to identify bots to protect the website against malicious spam attacks.

Cookie	Duration	Description
yt-player-bandwidth	never	The yt-player-bandwidth cookie is used to store the user's video player preferences and settings, particularly related to bandwidth and streaming quality on YouTube.
yt-player-headers-readable	never	The yt-player-headers-readable cookie is used by YouTube to store user preferences related to video playback and interface, enhancing the user's viewing experience.
yt-remote-cast-available	session	The yt-remote-cast-available cookie is used to store the user's preferences regarding whether casting is available on their YouTube video player.
yt-remote-cast-installed	session	The yt-remote-cast-installed cookie is used to store the user's video player preferences using embedded YouTube video.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-fast-check-period	session	The yt-remote-fast-check-period cookie is used by YouTube to store the user's video player preferences for embedded YouTube videos.
yt-remote-session-app	session	The yt-remote-session-app cookie is used by YouTube to store user preferences and information about the interface of the embedded YouTube video player.
yt-remote-session-name	session	The yt-remote-session-name cookie is used by YouTube to store the user's video player preferences using embedded YouTube video.
ytidb::LAST_RESULT_ENTRY_KEY	never	The cookie ytidb::LAST_RESULT_ENTRY_KEY is used by YouTube to store the last search result entry that was clicked by the user. This information is used to improve the user experience by providing more relevant search results in the future.

Cookie	Duration	Description
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	6 months	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Why Your RAG System Fails – and How Semantic Chunking Fixes It

A Case Study: The Restaurant Menu

The Solution: From Arbitrary Splits to Semantic Understanding