Extract Data from Documents
Extract Data from Documents
Use GroundX when you want a document to come back as JSON your application can
use. For example, a utility statement can return statement.account_number,
statement.total_amount_due, and service.service_address.
This guide uses the GroundX Python SDK to create the workflow, upload a document, and read the extracted JSON.
What You’ll Do
- Write a YAML file that names the JSON keys you want back.
- Create a GroundX workflow from that YAML.
- Upload a document so GroundX can run the workflow.
- Call
get_extractto read the extracted JSON.
1. Describe The JSON You Want Back
Create a YAML file. Names such as statement and service become top-level
objects in the returned JSON. Under each one, fields: is the list of values
GroundX should extract.
This YAML tells GroundX to return JSON like this:
Keep this file focused on the JSON your application needs. Use names your
application will read, such as statement or service, not names that describe
how extraction runs.
2. Turn The YAML Into Workflow Settings
Use prepare_extraction_yaml to check the YAML and produce the setting you pass
as extract when you create the workflow.
You do not need to inspect prepared.workflow_groups in most applications. Pass
it to GroundX as the workflow’s extract setting.
3. Create And Assign The Workflow
Create the workflow with the settings from the previous step. Then assign the workflow to the bucket where you will upload documents.
Use client.workflows.add_to_account(...) instead when the workflow should be
the account default.
4. Upload A Document
Upload documents to the bucket that has the workflow assigned to it. Use
process_level="full" so GroundX runs the workflow during ingest.
5. Get The JSON Back
After ingest completes, find the processed document and request its extracted JSON.
The result uses the same names from statement.yaml.
6. Improve The Results
When a value is missing or wrong, change the smallest part of the YAML that explains the miss.
- If one value is wrong, edit that value’s
description,identifiers, orinstructions. - If several
statementvalues are wrong, improve the prompt understatement:. - If the document parsed poorly, inspect the document X-Ray before changing prompts.
- If the returned JSON uses the wrong names, update the YAML before tuning prompts.
Then prepare the YAML again, update the GroundX workflow, ingest another document, and read the result again.

