In-Depth Exploration of GroundX Document Ingest | Documentation

Introduction

In this tutorial, we’ll cover how to add, or “ingest”, your files to GroundX.

With our proprietary ingest pipeline, your files undergo three critical processes:

First, an object detection model is applied to your document. This allows GroundX to understand the key components, visual information, and formatting.
Next, A variety of fine tuned VLMs are used to convert your data into a grounded textual representation which LLMs can understand.
Finally, your data is passed through a contextualization pipeline to bake in contextual metadata about your document.

Unlike other RAG solutions that require you to convert your files into plain text, Ground X is compatible with a wide variety of file formats out of the box, allowing you to expose your document data directly to an LLM without custom configuration.

more information about document parsing can be found in our guide on GroundX Ingest for Parsing. In this article, we’ll focus on the ingest pipeline in general; how to ingest local files, remotely hosted files, directories, etc.

Getting started

API Key

Go to the GroundX dashboard to get your API key.
GroundX can be installed for Python via pip install groundx
GroundX can be installed for NPM via npm i -s groundx

Before we begin, make sure you have the following information:

The ID of the GroundX bucket in which you wish to store your file. If you don’t have a bucket, you can create one with the buckets.create endpoint, or through the GroundX dashboard
The local path or public URL of the file you want to upload.

You may also want to prepare the following optional values:

The file name you wish to give your file once it’s in the GroundX bucket. This can be the name of the file being uploaded, or some different name.
The file type. The following file types are excepted:

bmp, csv, docx, gif, heif, hwp, ico, jpg, json, 
pdf, png, pptx, svg, tiff, tsv, txt, xlsx, webp

Example:

1 bucket_id = 6830
2 file_name = "aristotle-rhetoric.pdf";
3 file_type = "pdf"
4 upload_path = "documents/Aristotle-rhetoric.pdf";

Ingesting Individual Files

Now that we have a GroundX bucket we can upload content to, we can explore how ingest functions in GroundX. The simplest way to ingest content into GroundX is by uploading files one at a time.

First, you’ll need to set up authentication with the GroundX client.

1 from groundx import Document, GroundX
2 
3 client = GroundX(
4     api_key="YOUR_API_KEY",
5 )

Security Note

The “GROUNDX_API_KEY” placeholder represents your API key. We recommend storing your API key as an environment variable and accessing it from there. For this purpose, you can use libraries such as dotenv in Node.js or os in Python.

Once you’ve authenticated your client, you can ingest a document into GroundX via the ingest endpoint

1 response = client.ingest(
2     documents=[
3         Document(
4             bucket_id=bucket_id,
5             file_name=file_name,
6             file_path=upload_path,
7             file_type=file_type,
8             search_data=search_data
9         )
10     ]
11 )

The file_path specified in the ingest endpoint can either be that of a local path or a public URL.

After making the request, you should receive a response with processId and status. This response indicates that GroundX is uploading or ingesting your file into the indicated bucket.

1 {
2     "ingest": {
3         "processId": "23e782ac-3829-4833-965d-e77b4e289885",
4         "status": "queued"
5     }
6 }

the processId can be polled to get the most up-to-date upload status via the documents.get_processing_status_by_id endpoint.

Ingesting Directories

if you’re using the Python SDK, you can use the method ingest_directory to ingest the contents of a directory to a particular bucket.

1 from groundx import GroundX
2 
3 client = GroundX(
4   api_key="YOUR_API_KEY",
5 )
6 client.ingest_directory(
7   bucket_id=1234,
8   path="/path/to/directory",
9 )

This is a function that asynchronously batch uploads all of the documents within a directory tree, based on the top level path specified. It will render a tqdm progress bar, and automatically poll for updates on the batch currently being uploaded.

Adding extra search data

GroundX automatically generates contextual search data for your files. However, you can add extra search data to take maximum advantage of GroundX’s search capabilities, help maintain document context in the search query responses, and add tags or notes indicating instructions on how to handle the search results.

Example:

1 search_data = {
2     title: "rhetoric",
3     author: "Aristotle",
4     keywords: ["Ethos", "Pathos", "Logos", "Rhetorical Triangle", "Persuasion"]
5 }

Final details

Processing time depends on the size of your files. For upload restrictions like file and batch size, see the prompting and integration guide.

After automatically ingesting your files and eliminating the typical complexity of other RAG solutions, GroundX has prepared your content for searchability and automated response generation for your queries.