Unlike traditional software systems, it can be very hard to test RAG and agentic systems. Through numerous conversations with a variety of enterprise customers, we’ve found testing to be one of the biggest unknowns in enterprise AI integration.
Simply put, some of the largest companies in the world don’t know how to definitively say which RAG and agentic technologies work best for their needs. This is unacceptable when deploying AI products at scale. At EyeLevel, we take a direct and pragmatic approach to testing; ensuring that customers can be confident that GroundX is the right tool for their needs.
At EyeLevel, we see holistic testing with human evaluation as the only way to definitively say a RAG system is performant. Our approach is as follows:
We employ a variety of strategies that expedite this general process, from systems that expedite Q/A pair generation to using LLM as a judge systems to smoke-test results. While these systems help, we’ve found that automated Q/A pair generation can result in biased evaluation sets, and LLM as a judge exhibits skewed analysis on difficult questions. Ultimately, to get actionable and accurate metrics, humans need to define the questions, define the answers, and score the results.
The major drawback with this approach is that it’s labor intensive. Thus, we’ve created a collection of datasets which are designed to allow users to test GroundX.
note: this list is currently in active development. More datasets will be added within the coming weeks.
This dataset consists of 1,000 pages of public facing PDFs from the tax consultant Deloitte. These documents were selected for their rich information density, in conjunction with a diverse multimodal representation of that information. This dataset is chiefly designed to test a RAG system’s ability to parse and accurately represent rich and complex multimodal data.
The dataset is paired with 92 Q/A pairs which define questions and their ground truth answers. By asking a RAG system these questions, and comparing the generated result to the ground truth answers, the general performance of the RAG system can be calculated.
We wrote a blog post describing how GroundX performed relative to other common RAG approaches on this dataset: link
This dataset is an extension of the Deloitte 1k dataset discussed previously. This dataset consists of four partitions:
Thus, this dataset consists of the same core questions posed to the same set of documents as the Deloitte 1k dataset, but with the addition of erroneous and irrelevant documents. This is designed to test how well a RAG system’s search capabilities deals with large datasets consisting of irrelevant information.
We wrote a blog post describing how GroundX performed relative to other common RAG approaches on this dataset: link
We plan on releasing tooling to make dataset construction and testing easier, but for now, we recommend our deep-dive on the subject:
RAG Evaluation: Almost Everything You Need to Know
From a high level, we recommend the following: