Skip to content

Document Extraction

Multi-provider OCR and document processing with a unified interface.

Overview

From simple text extraction to advanced document understanding — Pipelex handles it all. Basic PDF text extraction works out of the box (via pypdfium2), but real documents demand more: OCR for scanned pages, layout analysis for complex structures, image extraction, and VLM-powered understanding.

Unlike LLM APIs (partly standardized around OpenAI's completions API), the OCR landscape is fragmented. Pipelex solves this with a unified interface: swap providers by changing your PipeExtract config, no code changes required.

Supported Providers

Provider Type Description
pypdfium2 Built-in Basic PDF text and image extraction without AI inference — works out of the box with no API keys
Mistral OCR Cloud API Industry-leading document understanding for media, text, tables, and equations
docling Local SDK IBM's open-source extraction library with local CPU processing and optional GPU acceleration
Azure Document Intelligence Gateway Enterprise-grade OCR with high accuracy for complex layouts, tables, and handwriting
Deepseek-OCR Gateway Open-source model optimized for markdown extraction from images

Key Capabilities

  • Page view generation — High-fidelity image rendering of extracted pages via pypdfium2
  • Embedded image extraction — Capture images found within documents
  • Layout analysis — Structured extraction of complex document layouts
  • Table recognition — Automatic table detection and extraction
  • Handwriting support — Via providers that support handwriting recognition (e.g., Azure Document Intelligence)
  • Multi-page processing — Batch processing of document pages with per-page results

Documents in LLM Prompts

Include PDFs directly in your prompts using @variable syntax. PipeLLM automatically handles document rendering — single documents, multiple documents, and mixed content combining text, images, and PDFs are all supported.