By Richard Louden, Head of Technology (Data) at Nimble Approach
This blog examines the rapid rise in data generation and storage, why more data isn’t always better for training Large Language Models (LLMs), the importance of data curation in improving AI performance, and practical steps to support effective AI implementation.
The volume of data generated and stored by organisations continues to grow each day. An IDC whitepaper estimates that by the end of 2025, the world will be holding around 175 zettabytes of data – equivalent to 175 trillion 1GB USB sticks. This same whitepaper assesses that around 80% of this data is in an unstructured format – such as audio, video and text files – and 90% of said unstructured data is never analysed.
This continual rise in volume is due to two factors:
- We now have many more methods of generating data.
- Organisations have access to cheap storage through datalake services such as Amazon’s S3, Microsoft’s ADLS, and Google’s GCS.
The days of having to tightly control what data you keep for fear of massive bills or capacity issues are gone, replaced with the ability to store a terabyte of information for ~£15. Market research estimates the global data lake market at roughly $16 billion in 2024, with projections reaching $60 billion by 2030 – highlighting how widely these services are adopted, even when much of the underlying data goes largely unused.
Why More Data Doesn’t Always Mean Better AI
However, with the meteoric advancement of Large Language Models (LLMs), the accessibility and value of this data is now at an all time high. Organisations are racing to adopt AI in hopes of boosting efficiency and driving profit, and many see their vast stores of data as the ideal way to provide valuable context to commercially available models. The logical next step is to take this data, transform and index it, store it in a way an LLM can easily access, and unlock a whole new set of insights. In theory, this seems like the best approach – after all, wouldn’t access to more information always be better? Work by Anthropic (creators of the Claude model) and Chroma (creators of an open source AI applications Database) contests this, instead highlighting the need for smaller but highly relevant context to be as effective as possible. As such, providing such an expanse of likely conflicting information may have serious impacts on the value of your outputs.
When you pair this with an MIT study reporting that 95% of generative AI pilots fail, largely due to the difficulty of giving models the right context, it’s clear why a high-quality, low-volume strategy is gaining traction.
Better Curation For Better Clarity
So if taking your entire data lake, running it through a RAG pipeline – essentially transforming and indexing the data so LLMs and agentic applications can retrieve relevant information – and using it as context ultimately proves detrimental, what is the right approach? This is where the idea of data curation comes in, a term that has existed within the data architecture / engineering space for a long time. In the engineering space, this has typically involved cleaning and transforming structured data to create something that is understandable by end users.
Applying this to create organisational context for LLMs is no different, though has a greater focus on cleansing, organising, and metadata creation. Consider someone querying an LLM agent to analyse active users in an organisation with multiple departments that carry their own definitions for this term. If the agent draws context from uncurated sources, which is in itself a potential security risk as it may incorrectly access sensitive data, it will likely encounter conflicting information that it can’t properly reconcile. As a result, it may produce a response that sounds convincing but is ultimately inaccurate.
An example of the benefit of data curation comes from a recent project from Etsy, where they utilised an LLM to classify their product data. Initially, they tried using data that had been mass-labelled by a third party, but the lack of product knowledge and the level of human error significantly affected the results. As such, they pivoted to using internal product experts to curate a smaller – but much higher quality – dataset that led to significantly improved outputs.
Another example – this time in the related area of model fine-tuning rather than context provision – comes from work by Dataology. They found that when comparing identical models trained on two datasets, a curated 100-terabyte corpus and a raw 1-petabyte corpus, the curated set delivered 2.8× faster training and a 2.1× reduction in inference costs. With implementation costs for LLMs continuing to rise and ROI under increasing scrutiny, the case for properly curating your data is clear.
Where Do You Start With Data Curation?
Given there is clearly a case for curating data to enhance the value of LLMs, what are the practical steps an organisation can take?
- Know Your Data: The first step to data curation is understanding what your data relates to and how it connects to other sources. Document these elements across your organisational domains to build out your data architecture, using this to support both the curation process and to provide models further context.
- Understand The Process: The more complex the process, the more value you will see from introducing AI into it. However, such processes often have numerous steps, data sources, and hidden context that need to be understood and documented before adding AI into the loop. Without this step, you’ll see poor quality outputs – just as if you gave the task to a new starter with no access to supporting information.
- Establish Reusable Curation Pipelines: Data curation is something you will need to keep doing in order to add AI into additional processes or improve the outputs of those it is already involved in. As such, you should aim to make these processes as modular and reusable as possible, to ensure it is not a blocker should new information be required to support implementation.
Key Takeaways
The volume of data organisations store continues to grow, driven by falling storage costs, and recent advances in LLMs now offer a potential way to put that information to use. However, research in this area has shown that these models actually suffer when provided with large amounts of unorganised context, leading them to try and reconcile conflicting sources. Instead, organisations should focus on understanding the data they hold, the processes it supports, and how to curate it effectively to give their AI initiatives the best chance of success.
If you’re unsure where to begin – or want support designing a practical, scalable curation strategy – reach out to our team and start building AI that works in the real world. aim to help organisations navigate these challenges and realise sustainable, long-term value.














