By Gareth Hallberg, Lead Consultant at Nimble Approach
In this blog, we take a closer look at how AI can help teams move faster without compromising on qThis article breaks down why naive, fixed-size chunking leads to poor retrieval, how semantic chunking solves the problem, and what tools and methods you can use to implement it effectively.
Retrieval-Augmented Generation (RAG) is a powerful technique for creating intelligent, context-aware AI systems. By grounding large language models (LLMs) in specific, private data, businesses can build applications that provide accurate and relevant answers. However, the success of any RAG system hinges on a crucial and , often overlooked step: data preparation. The common approach of simply chopping documents into fixed-size chunks is a significant pitfall, and this article will demonstrate why a semantically informed strategy is essential.
The core of the issue lies in the oversimplification of picking an arbitrary chunk size – say, 200 or 500 characters – and expecting it to work for all data types. While this may suffice for long-form prose, it fails consistently when applied to structured or semi-structured documents such as menus, reports, or JSON files.
A Case Study: The Restaurant Menu
To illustrate this problem, consider a simple document: a restaurant menu. Now, imagine asking a RAG system, built on this menu, three basic questions:
- “What seafood pasta do you have?”
- “What pizzas do you have for less than £13?”
- “What dishes do you have that are suitable for vegetarians?”
A system using a fixed-size chunking strategy is likely to struggle with this test. Why? Because this method has no regard for the actual meaning or structure of the content. A 200-character chunk could easily split a single menu item in half, separating a dish’s name from its price, or its description from its dietary information (e.g., ‘Vegetarian’). This fragmentation of context makes it impossible for the retrieval system to find a complete, coherent piece of information, leading to inaccurate or incomplete answers from the LLM.
This is the fundamental flaw of naive chunking: it breaks the connections between related pieces of information, like a dish’s name and its price, within the data.
The Solution: From Arbitrary Splits to Semantic Understanding
The robust solution is semantic chunking – a method where the document is split based on its logical structure. Instead of counting characters, we identify the boundaries that define a complete thought or entry.
For our menu example, a practical implementation involves formatting the menu in Markdown and using a heading level (e.g., ####) to designate each menu item. By using this heading as a delimiter, every chunk becomes a single, self-contained menu item, maintaining the connection between the dish, its description, price, and attributes.
When this semantically chunked data is used, the RAG system’s performance is transformed. It can now easily retrieve the complete entry for “Linguine alle Vongole” (pasta sautéed with fresh clams, garlic, white wine, and parsley) and see that it contains seafood, or check the price of each pizza, because that information is contained within the same chunk. In short, it can answer all three questions that it previously struggled with..
For an excellent interactive demonstration of this concept, you can explore different chunking strategies here.
Choosing the Right Tools
Understanding the benefits of semantic chunking naturally leads to considering the tools best suited for its implementation.
While the Markdown heading approach is excellent for structured documents like menus, semantic chunking encompasses a broader range of techniques designed to split documents based on their inherent meaning and logical structure, rather than arbitrary character counts. The goal is always to ensure that each chunk represents a coherent and complete piece of information, maximising its utility for retrieval.
Here are some common and advanced semantic chunking methods:
1. Rule-Based or Delimiter-Based Chunking:
- How it works: It relies on predefined rules or delimiters within the document’s structure. Examples include:
- Headings: As demonstrated, using Markdown headings (e.g., ##, ###) or document section titles (e.g., in Word or Google Docs) to define chunk boundaries.
- Paragraph Breaks: Treating each paragraph as a distinct chunk, assuming paragraphs generally represent a single, coherent idea.
- Specific Keywords or Phrases: Identifying key phrases or patterns that signal the start of a new logical unit (e.g., “Conclusion,” “Introduction,” “Key Findings”).
- XML/JSON Tags: For structured data, using specific tags to delineate logical entities.
- Headings: As demonstrated, using Markdown headings (e.g., ##, ###) or document section titles (e.g., in Word or Google Docs) to define chunk boundaries.
- Best for: Documents with clear, consistent internal structures like reports, manuals, articles with distinct sections, or structured data formats.
2. Sentence-Based Chunking:
- How it works: This is a more granular approach where each sentence is treated as a separate chunk. While seemingly simple, it can be powerful for maintaining very fine-grained semantic units.
- Best for: Documents where individual sentences carry significant, self-contained meaning, and where the context needed for retrieval is often limited to a single sentence. It’s also a good starting point for more advanced methods.
3. Paragraph-Based Chunking:
- How it works: Similar to sentence-based, but groups sentences into paragraphs. This is a common and often effective method for general prose.
- Best for: Most narrative texts, essays, and articles where paragraphs typically convey a single main idea.
4. Recursive Chunking:
- How it works: This method involves splitting a document into larger chunks first, and then recursively splitting those larger chunks into smaller, more semantically coherent units if they exceed a certain size or complexity. It’s a hierarchical approach.
- Best for: Long, complex documents where different levels of granularity might be useful for retrieval. It allows for both broad and specific searches.
5. Content-Aware or NLP-Based Chunking:
- How it works: These methods leverage Natural Language Processing (NLP) techniques to understand the semantic flow of the text:
- Topic Modelling: Algorithms can identify distinct topics within a document and group sentences or paragraphs belonging to the same topic into a chunk.
- Coherence Scoring: Analyszing the semantic similarity between sentences or paragraphs to identify natural breakpoints where the topic or focus shifts.
- Embedding Similarity: Using vector embeddings of sentences or paragraphs. When the similarity between consecutive units drops below a certain threshold, it indicates a potential chunk boundary.
- Topic Modelling: Algorithms can identify distinct topics within a document and group sentences or paragraphs belonging to the same topic into a chunk.
- Best for: Less structured or free-form text where explicit delimiters are absent, or for achieving highly nuanced semantic divisions.
6. Hybrid Approaches:
- How it works: Often, the most effective strategy is to combine multiple methods. For example, a document might first be split by major headings (rule-based), and then within each section, paragraphs could be further chunked (paragraph-based), or even sentences if the content demands it.
- Best for: Almost all real-world applications, as documents rarely conform perfectly to a single chunking strategy.
Key Considerations for Choosing a Method
With so many semantic chunking techniques available – from simple delimiter-based methods to advanced NLP-driven approaches – it’s important to step back and evaluate which strategy best fits your specific use case. Selecting the right approach isn’t just about how the document can be split, but how it should be split to support reliable retrieval and effective downstream LLM reasoning.
When deciding which chunking method to use, consider the following factors:
- Document Structure: How structured is your data? Does it have clear headings, sections, or other logical divisions?
- Query Patterns: What kind of questions will your RAG system be asked? Do users need very specific facts (smaller chunks) or broader contextual information (larger chunks)?
- Retrieval Granularity: How precise do your retrieval results need to be?
- Computational Cost: More advanced NLP-based methods can be more computationally intensive.
- Tooling: As mentioned, the available tools and frameworks (like LangChain’s diverse text splitters) will significantly influence your choices.
By thoughtfully selecting and implementing a semantic chunking strategy, you can dramatically improve the precision and recall of your RAG system, leading to more accurate and relevant responses from your LLM.
Selecting the Right Framework
This naturally raises an important practical question: which development framework can actually support the chunking strategy you’ve chosen? The effectiveness of your approach depends not only on understanding the theory behind semantic chunking but also on having tools capable of applying it correctly. When evaluating frameworks for building a RAG system, it’s crucial to look beyond surface-level features and closely examine their data processing capabilities.This insight has significant implications for choosing the right development framework. When evaluating tools for building a RAG system, it’s crucial to look beyond the surface and inspect their data processing capabilities.
For instance, the LangChain framework provides a variety of sophisticated text splitters, including a MarkdownTextSplitter, which is designed for precisely this kind of structured data. In contrast, other platforms may lack these specialised tools, making them less suitable for projects that involve anything other than simple text documents. This distinction is not minor – it can be the deciding factor in whether your application succeeds or fails.
Conclusion
The takeaway is clear: building a high-performing RAG system requires a thoughtful approach to data ingestion. Moving away from naive, fixed-size chunking and adopting a semantic strategy that respects the intrinsic structure of your data is paramount. By ensuring that each chunk represents a complete and coherent piece of information, you provide your RAG system with the high-quality foundation it needs to deliver truly accurate and intelligent results.














