AI Document Processing: OCR, Classification & Workflow Automation

Document-heavy industries — logistics, finance, legal, healthcare — spend significant human effort on data entry from PDFs, scanned forms, and images. AI document processing pipelines can automate 90%+ of this work. Here's how we build them at Softotic.

The Problem: Unstructured Documents at Scale

Manual document processing suffers from:

High error rates (4–8% is typical for manual data entry)

Slow throughput — humans process ~50–100 documents/hour

Inability to scale during peak periods

No audit trail of extracted values

Pipeline Architecture

A complete document processing pipeline has 6 stages:


[Ingestion] → [Pre-processing] → [OCR] → [Classification] → [Extraction] → [Validation] → [ERP Push]




Stage 1: Document Ingestion


Documents arrive via:

Email attachments (via IMAP listener or email webhook)
API upload (POST /documents with multipart form)
FTP/SFTP directory watch
WhatsApp/messaging (webhook)

Normalise to PDF: convert DOCX, TIFF, JPEG to PDF using pikepdf or img2pdf.




Stage 2: Pre-Processing


Deskew and denoise scanned images using OpenCV
Split multi-page documents into individual page images
Resize to optimal resolution for OCR (300 DPI for thermal prints, 200 DPI for standard scans)

python
import cv2

import numpy as np



def preprocess(image: np.ndarray) -> np.ndarray:

    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    denoised = cv2.fastNlMeansDenoising(gray)

    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    return binary




Stage 3: OCR


Use a hybrid approach for best accuracy:

AWS Textract for structured forms and tables (handles 2-column layouts, checkboxes)
Tesseract as fallback (for offline or cost-sensitive pipelines)
Azure Form Recognizer for specific form templates with pre-built models

python
import boto3



textract = boto3.client("textract")



def run_ocr(document_bytes: bytes) -> dict:

    response = textract.analyze_document(

        Document={"Bytes": document_bytes},

        FeatureTypes=["TABLES", "FORMS"]

    )

    return response["Blocks"]




Stage 4: Document Classification


Train a multi-class classifier on document layout and keyword features:



Inputs: OCR text, page count, presence of keywords (e.g., "INVOICE", "BILL OF LADING")
Model: Fine-tuned DistilBERT or simpler TF-IDF + LogisticRegression for high-volume low-cost classification
Classes: Invoice, Delivery Note, Customs Declaration, Contract, ID Document, etc.


Confidence threshold: if < 0.85, route to human review queue.



Stage 5: Field Extraction


Per document class, extract structured fields using:

Template matching: regex for IDs, amounts, dates in known positions
ML-based extraction: LayoutLM or Donut models for zero-shot extraction from new templates

python
import re



def extract_invoice_fields(text: str) -> dict:

    return {

        "invoice_number": re.search(r"Invoice\s#?\s([A-Z0-9-]+)", text, re.I),

        "amount_due": re.search(r"Total\sDue\s:?\s[\$£]?([\d,]+\.?\d)", text, re.I),

        "due_date": re.search(r"Due\sDate\s:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})", text, re.I),

    }

Stage 6: Validation & ERP Push

Validate extracted fields against business rules:

Amount is a valid number

Date is not in the past (for invoices)

Supplier ID exists in ERP

If validation passes, push to ERP via REST webhook. If not, route to review UI.

Infrastructure Stack

FastAPI — async REST API for document ingestion

Redis Queue (RQ) — background job processing

PostgreSQL — document metadata and extracted fields storage

S3 — original document storage

Docker — containerised deployment

Handling the Review Queue

Low-confidence extractions go to a human review UI where operators:

See the original document alongside extracted fields

Correct any errors

Approve and push to ERP

These corrections feed back into model fine-tuning

Conclusion

AI document processing delivers ROI within months for high-volume operations. The key investment is in the extraction and validation layers — getting those right is what separates a 60% accurate prototype from a 96%+ production system.

Ready to automate your document workflows? Talk to Softotic's AI team.

The Problem: Unstructured Documents at Scale

Pipeline Architecture

Stage 1: Document Ingestion

Stage 2: Pre-Processing

Stage 3: OCR

Stage 4: Document Classification

Stage 5: Field Extraction

Stage 6: Validation & ERP Push

Infrastructure Stack

Handling the Review Queue

Conclusion

Let's build your next breakthrough.