Document Extraction
Multi-provider OCR and document processing with a unified interface.
Overview
From simple text extraction to advanced document understanding — Pipelex handles it all. Basic PDF text extraction works out of the box (via pypdfium2), but real documents demand more: OCR for scanned pages, layout analysis for complex structures, image extraction, and VLM-powered understanding.
Unlike LLM APIs (partly standardized around OpenAI's completions API), the OCR landscape is fragmented. Pipelex solves this with a unified interface: swap providers by changing your PipeExtract config, no code changes required.
Supported Providers
| Provider | Type | Description |
|---|---|---|
| pypdfium2 | Built-in | Basic PDF text and image extraction without AI inference — works out of the box with no API keys |
| Mistral OCR | Cloud API | Industry-leading document understanding for media, text, tables, and equations |
| docling | Local SDK | IBM's open-source extraction library with local CPU processing and optional GPU acceleration |
| Azure Document Intelligence | Gateway | Enterprise-grade OCR with high accuracy for complex layouts, tables, and handwriting |
| Deepseek-OCR | Gateway | Open-source model optimized for markdown extraction from images |
Key Capabilities
- Page view generation — High-fidelity image rendering of extracted pages via pypdfium2
- Embedded image extraction — Capture images found within documents
- Layout analysis — Structured extraction of complex document layouts
- Table recognition — Automatic table detection and extraction
- Handwriting support — Via providers that support handwriting recognition (e.g., Azure Document Intelligence)
- Multi-page processing — Batch processing of document pages with per-page results
Documents in LLM Prompts
Include PDFs directly in your prompts using @variable syntax. PipeLLM automatically handles document rendering — single documents, multiple documents, and mixed content combining text, images, and PDFs are all supported.
Related Documentation
- PipeExtract - Operator reference and MTHDS fields
- Generic Document Extraction Example - Extract markdown from complex PDFs using vision