AI-Driven OCR and Field Extraction for Justice Court Forms

Mid-market counties handling thousands of justice court forms annually are caught between paper-based workflows and enterprise systems built for a different scale. The intake process, reading handwritten forms, extracting fields, verifying data against the state system, takes hours per batch and introduces transcription errors that can affect bail decisions, release timing, and SB6/SB9 compliance reporting. Our capstone built a complete evaluation framework and production-ready pipeline to solve it.
Working with a leader in the legal technology space, we evaluated 9 systems across 80handwritten prisoner registration forms and 900-plus test runs — all within FedRAMP-authorized AWS infrastructure. Off-the-shelf vision language models, combined with mandatory pre-processing, reached 87-88.5% accuracy without a single line of custom model training.
Three Phases, One Evaluation Framework
Each phase of this project answered a question the previous one could not. Rather than committing to a single model or approach upfront, we built an evaluation framework first — then let the evidence decide.
Phase1: Traditional OCR. We tested 3 layout detection models against 5 OCR engines(15 combinations). Best result: 86.1% character accuracy with doctr and Surya. The ceiling was the finding: models could transcribe characters but had no concept of which field a character belonged to. High text accuracy produced zero structured output.
Phase2: Vision Language Models. VLMs read layout and context simultaneously. By seeing "First Name" above a handwritten line, a model infers the field and extracts the value in a single pass — introducing field classification as a capability traditional OCR lacked entirely. We compared 7 VLMs from Anthropic, Meta, Amazon, and Mistral via AWS Bedrock.
Phase3: Ground Truth, Judge, and Alternatives. Testing leveraged 80 carefully collected handwritten forms and an independent Claude Sonnet 4.6 judge scoring each field 0 to 1. Arithmetic mean across ~114 fields gave the overall score, making 900-plus test runs directly comparable. Landing.ai ADE and AWS Textract were introduced as full alternative benchmarks.
The Ground Truth Problem
Prisoner registration forms contain sensitive PII that cannot leave the county's secure infrastructure. We could not use real forms for evaluation. After two failed attempts at crowd-sourced and AI-generated handwriting, we collected 80synthetic forms via Fiverr freelancers — realistic, varied handwriting with per-field Excel ground truth verified by the team. $1.00 per page, roughly three days turnaround.
• Amazon Mechanical Turk —Dropped. Workers skipped fields, quality was inconsistent, impossible to enforce legal form specificity.
• Diffusion Pen (AI handwriting) — Shelved. Output looked plausible but lacked naturalistic variability. Field placements too uniform, checkboxes inconsistent.
• Fiverr Freelancers — Final approach. Three synthetic personas, per-field Excel ground truth,80 forms total.
Key Findings
Pre-processing is the architecture, not a feature
Even 1-2 degrees of skew drops accuracy 15-50%. A 180-degree rotation brings Amazon Nova Pro to 0% and Claude Haiku to 1%. The pipeline does not work without quality detection and auto-correction running before any OCR call. This was the most operationally significant finding.
Claude leads Bedrock; Landing.ai leads overall
Claude Sonnet 4.6 scores 87% on clean originals — highest in Bedrock. Landing.ai ADE scores 88.5% and holds above 86% across all degradation conditions where Bedrock models collapse. For real scanning environments, that resilience matters more than the accuracy gap on clean forms.
FedRAMP compliance determines the practical path
Claude Sonnet 4.5 is confirmed FedRAMP High on AWS GovCloud as of April 2026. Landing.ai via Snowflake Native App is technically viable but requires legal review before adoption. Until that review is complete, Claude on Bedrock is the only confirmed production path.
The Recommended Pipeline
A 6-step agentic workflow on AWS GovCloud. Pre-processing is mandatory before any OCR call.
• 01 Upload. PDF split to page images at 200 DPI, stored in S3.
• 02 Detect. Claude Haiku 4.5 returns rotation, skew, contrast, and a 0-to-1 quality score per page.
• 03 Correct. PIL applies rotation, deskew, contrast boost, and sharpening.
• 04 Gate. Operator reviews before-and-after. Pages below 0.6 quality require explicit approval.
• 05 OCR + Classify. Claude Sonnet 4.5 on Bedrock extracts and maps all ~114 fields to the form schema in a single pass.
• 06 Verify. Human review with bias detection flags. Operators edit and complete the session.
At 500,000 forms annually (1,000,000 pages), the recommended production path costs $0.050-$0.070 per page — roughly $50,000-$70,000 total, a fraction of equivalent labor cost.
Human Oversight Is a Design Principle
Automation bias — trusting AI outputs without scrutiny — is a documented risk when the stakes involve someone's freedom. The verification step exists because no automated system should make unilateral decisions about detention. Audit trails, confidence scores, and bias detection exist so human judgment remains accountable at scale.
• Audit trails. Every extraction, flag, and approval is recorded — practical protection against due process challenges.
• Bias monitoring. Model performance tracked across race, gender, and language groups before it scales.
• Prompt versioning. Pin model IDs. Version prompts. Re-test on every release — treat it as a regression risk.
• Vendor disclosure. Commercial relationships and model error rates disclosed to all parties in proceedings.
• Human judgment stays primary. AI extracts. Humans approve or correct. Non-negotiable.
See It in Action
Three short video segments from the final capstone presentation.
Our Recommendation
Claude Sonnet 4.5 on AWS GovCloud — why it is the right call for a FedRAMP-compliant production deployment today, and what to watch for as the Landing.ai compliance review progresses.
Watch: https://studio.youtube.com/video/xO1bB3_IsAc/edit
Alternatives Compared
Landing.ai ADE, AWS Textract, and the open-source OCR baseline — what each brings, where each falls short, and how the comparison was structured to be fair across all nine systems.
Watch: https://youtu.be/nJoNf82AuMQ
Final Results
The full results — degradation heatmap, head-to-head judge scores, cost-at-scale analysis, and the single most important finding: pre-processing is the architecture, not a feature.
Watch: https://youtu.be/40ULFxvlCF8
Try the Live Demo
The demo application is deployed and available at:
https://2pyzxqkamq.us-west-2.awsapprunner.com/
No login required. Upload a court form PDF and walk through the full 6-step extraction workflow with before-and-after quality gate view and human verification with bias detection flags.
Why This Matters
More accurate intake data means fewer people detained due to administrative failures. It means state reporting under SB6/SB9 that reflects what actually happened. It means a justice system where paperwork errors are less likely to determine outcomes for the people who can least afford them.
"The evaluation harness, ground truth dataset, and LLM judge protocol we built ensure that model updates, form changes, and new degradation conditions trigger re-evaluation — not a scramble. The process is the durable artifact."
Full codebase: allllc/JusticeFormsOCR