How We Approach Testing | Documentation

Introduction

Unlike traditional software systems, it can be very hard to test RAG and agentic systems. Through numerous conversations with a variety of enterprise customers, we’ve found testing to be one of the biggest unknowns in enterprise AI integration.

Simply put, some of the largest companies in the world don’t know how to definitively say which RAG and agentic technologies work best for their needs. This is unacceptable when deploying AI products at scale. At EyeLevel, we take a direct and pragmatic approach to testing; ensuring that customers can be confident that GroundX is the right tool for their needs.

Our Core Testing Philosophy

At EyeLevel, we see holistic testing with human evaluation as the only way to definitively say a RAG system is performant. Our approach is as follows:

Curate real world and application relevant documents
Define application relevant questions and their corresponding real-world answers (a.k.a. “ground truth Q/A pairs”)
Run the questions against the RAG system being tested
Evaluate generated answers against ground truth answers, using human evaluation as the primary metric.

We employ a variety of strategies that expedite this general process, from systems that expedite Q/A pair generation to using LLM as a judge systems to smoke-test results. While these systems help, we’ve found that automated Q/A pair generation can result in biased evaluation sets, and LLM as a judge exhibits skewed analysis on difficult questions. Ultimately, to get actionable and accurate metrics, humans need to define the questions, define the answers, and score the results.

The major drawback with this approach is that it’s labor intensive. Thus, we’ve created a collection of datasets which are designed to allow users to test GroundX.

Datasets

note: this list is currently in active development. More datasets will be added within the coming weeks.

Deloitte 1K

github link

This dataset consists of 1,000 pages of public facing PDFs from the tax consultant Deloitte. These documents were selected for their rich information density, in conjunction with a diverse multimodal representation of that information. This dataset is chiefly designed to test a RAG system’s ability to parse and accurately represent rich and complex multimodal data.

The dataset is paired with 92 Q/A pairs which define questions and their ground truth answers. By asking a RAG system these questions, and comparing the generated result to the ground truth answers, the general performance of the RAG system can be calculated.

We wrote a blog post describing how GroundX performed relative to other common RAG approaches on this dataset: link

Deloitte 100K

github link

This dataset is an extension of the Deloitte 1k dataset discussed previously. This dataset consists of four partitions:

partition 0: this is the Deloitte 1k dataset. 1,000 pages of documents and a corresponding 92 Q/A pairs.
partition 1: this is the same as partition 0, except with the addition of 9,000 additional pages of erroneous documents. The same Q/A pairs are preserved.
partition 2: partition 0, but with erroneous documents, bringing the partition to a total page count of 50,000. The same Q/A pairs are preserved.
partition 3: partition 0, but with erroneous documents, bringing the partition to a total page count of 100,000. The same Q/A pairs are preserved.

Thus, this dataset consists of the same core questions posed to the same set of documents as the Deloitte 1k dataset, but with the addition of erroneous and irrelevant documents. This is designed to test how well a RAG system’s search capabilities deals with large datasets consisting of irrelevant information.

We wrote a blog post describing how GroundX performed relative to other common RAG approaches on this dataset: link

Defining your Own Dataset

We plan on releasing tooling to make dataset construction and testing easier, but for now, we recommend our deep-dive on the subject:

RAG Evaluation: Almost Everything You Need to Know

From a high level, we recommend the following:

Start small, 100 pages of content and 20 curated question/answer pairs by subject matter experts can go a long way in testing a RAG system.
Most evaluation platforms offer some type of automation. This saves time, but we generally prefer humans in the loop. We’ve found LLM evaluation to be roughly 15%-20% off base, creating both false positives and negatives, though keeping the answers very simple and short lessens the issue.
Ensure diversity in the dataset. Observe the types of data (textual, tabular, graphical) and types of questions (entity extraction, summarization, aggregation) that are relevant to your use case and ensure they’re meaningfully present in your document set and Q/A pairs.
Scale up. Once you are in a good place with your 30 question, 100 pages test, ramp it up. We usually jump to 100 questions and 1,000 pages. We often add more pages for enterprise clients. We’ve seen popular vector approaches degrade in performance substantially when exposed to even a modest 10k page dataset.