Building reliable document extraction with GPT-4 and structured outputs

How we moved from fragile regex pipelines to a production-grade AI extraction system with evals and fallback logic.

The old pipeline: regex all the way down

Our client processed 800–1,200 scanned invoices per day from 60+ different vendors. Each vendor had slightly different layouts. The original pipeline was a chain of regex rules per vendor template — brittle, hard to maintain, and it broke every time a vendor changed their invoice format.

Maintenance was consuming 20% of one engineer's time. Accuracy sat at ~91%, meaning ~100 invoices per day required manual correction. The client wanted both problems solved.

Why structured outputs changed everything

OpenAI's structured output mode (with response_format: { type: 'json_schema' }) lets you enforce an exact JSON schema on the model's response. The model cannot produce output that doesn't match the schema — it's constrained at the token level, not post-processed.

This meant we could define a single Invoice schema — vendor name, invoice number, line items, totals, due date — and the model would always return a valid, typed object. No regex. No post-processing gymnastics.

The extraction architecture

Our pipeline has three stages:

OCR layer: AWS Textract for scanned PDFs; direct text extraction for digital PDFs via pdf-parse. Output is normalised plain text with layout hints (table markers).
Extraction layer: GPT-4o with structured outputs. System prompt is short — basically "extract invoice fields" — and the schema does the heavy lifting. We pass the OCR text as the user message.
Validation layer: Zod schema validation on the parsed response. Line item totals must sum to the invoice total ± 0.01. If validation fails, we retry once with a correction prompt; if it fails again, the invoice goes to the manual review queue.

Evals: the part everyone skips

We built a small eval suite of 200 invoices with known-good extractions. Every prompt change runs against the suite before deployment. We track field-level accuracy, not just overall accuracy — it's easy to have high aggregate accuracy while one field (e.g., tax amount) is consistently wrong.

The eval suite caught three regressions during development that we would have shipped without it. It adds maybe 30 minutes to each deployment cycle. Worth every second.

Fallback and cost management

GPT-4o is expensive at volume. We added a routing layer: simple invoices (single-page, known vendors with high confidence on first extract) are routed to gpt-4o-mini. Complex or multi-page invoices go to gpt-4o. This brought per-invoice cost down by ~55% with no measurable accuracy drop on the simple tier.

Results

Extraction accuracy on the eval suite: 98.7%, up from 91%. Manual corrections dropped from ~100/day to ~15/day. Maintenance burden fell from 20% of an engineer's time to near zero — prompt updates take minutes, not days. The client recovered the project cost in under three months through saved labour.