AIOCRDocument ProcessingPythonFastAPI

AI Document Processing: OCR, Classification & Workflow Automation

How to build a production AI document processing pipeline that ingests PDFs, runs OCR, classifies documents by type, extracts structured fields, and pushes data to your ERP.

Softotic Engineering·18 February 2025·3 min read

Document-heavy industries — logistics, finance, legal, healthcare — spend significant human effort on data entry from PDFs, scanned forms, and images. AI document processing pipelines can automate 90%+ of this work. Here's how we build them at Softotic.

The Problem: Unstructured Documents at Scale

Manual document processing suffers from:

  • High error rates (4–8% is typical for manual data entry)
  • Slow throughput — humans process ~50–100 documents/hour
  • Inability to scale during peak periods
  • No audit trail of extracted values

Pipeline Architecture

A complete document processing pipeline has 6 stages:

``

[Ingestion] → [Pre-processing] → [OCR] → [Classification] → [Extraction] → [Validation] → [ERP Push]

`

Stage 1: Document Ingestion

Documents arrive via:

  • Email attachments (via IMAP listener or email webhook)
  • API upload (POST /documents with multipart form)
  • FTP/SFTP directory watch
  • WhatsApp/messaging (webhook)

Normalise to PDF: convert DOCX, TIFF, JPEG to PDF using pikepdf or img2pdf.

Stage 2: Pre-Processing

  • Deskew and denoise scanned images using OpenCV
  • Split multi-page documents into individual page images
  • Resize to optimal resolution for OCR (300 DPI for thermal prints, 200 DPI for standard scans)

`python

import cv2

import numpy as np

def preprocess(image: np.ndarray) -> np.ndarray:

gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

denoised = cv2.fastNlMeansDenoising(gray)

_, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

return binary

`

Stage 3: OCR

Use a hybrid approach for best accuracy:

  • AWS Textract for structured forms and tables (handles 2-column layouts, checkboxes)
  • Tesseract as fallback (for offline or cost-sensitive pipelines)
  • Azure Form Recognizer for specific form templates with pre-built models

`python

import boto3

textract = boto3.client("textract")

def run_ocr(document_bytes: bytes) -> dict:

response = textract.analyze_document(

Document={"Bytes": document_bytes},

FeatureTypes=["TABLES", "FORMS"]

)

return response["Blocks"]

`

Stage 4: Document Classification

Train a multi-class classifier on document layout and keyword features:

  • Inputs: OCR text, page count, presence of keywords (e.g., "INVOICE", "BILL OF LADING")
  • Model: Fine-tuned DistilBERT or simpler TF-IDF + LogisticRegression for high-volume low-cost classification
  • Classes: Invoice, Delivery Note, Customs Declaration, Contract, ID Document, etc.

Confidence threshold: if < 0.85, route to human review queue.

Stage 5: Field Extraction

Per document class, extract structured fields using:

  • Template matching: regex for IDs, amounts, dates in known positions
  • ML-based extraction: LayoutLM or Donut models for zero-shot extraction from new templates

`python

import re

def extract_invoice_fields(text: str) -> dict:

return {

"invoice_number": re.search(r"Invoice\s#?\s([A-Z0-9-]+)", text, re.I),

"amount_due": re.search(r"Total\sDue\s:?\s[\$£]?([\d,]+\.?\d)", text, re.I),

"due_date": re.search(r"Due\sDate\s:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})", text, re.I),

}

``

Stage 6: Validation & ERP Push

Validate extracted fields against business rules:

  • Amount is a valid number
  • Date is not in the past (for invoices)
  • Supplier ID exists in ERP

If validation passes, push to ERP via REST webhook. If not, route to review UI.

Infrastructure Stack

  • FastAPI — async REST API for document ingestion
  • Redis Queue (RQ) — background job processing
  • PostgreSQL — document metadata and extracted fields storage
  • S3 — original document storage
  • Docker — containerised deployment

Handling the Review Queue

Low-confidence extractions go to a human review UI where operators:

  • See the original document alongside extracted fields
  • Correct any errors
  • Approve and push to ERP
  • These corrections feed back into model fine-tuning

Conclusion

AI document processing delivers ROI within months for high-volume operations. The key investment is in the extraction and validation layers — getting those right is what separates a 60% accurate prototype from a 96%+ production system.

Ready to automate your document workflows? Talk to Softotic's AI team.

Ready to Transform

Let's build your next breakthrough.

Clear communication, predictable delivery, and long-term ownership. From day one, you're partnering with engineers who think like founders.

Start Here

Offices

LondonUnited Kingdom

DubaiUAE – Dubai

Copyright © 2026. All rights reserved.

SOFTOTIC LTD (16371717) is a private limited company incorporated by the Registrar of Companies for England and Wales under the Companies Act 2006& registered in Pakistan as a private SMC under SECP (0320678) and FBR with certification from Pakistan Software Export Board (Z-25-16578/25).