PipeExtract
The PipeExtract operator extracts structured content from documents. For PDFs and images, it performs OCR to extract text, embedded images, and full-page renderings. For web pages, it fetches and extracts page content.
How it works
PipeExtract takes a single input, which must be a Document or an Image (or a concept that refines one of them). The document URL can be a file path, a storage URL, or a web page URL. It processes the input and produces a list of PageContent objects. Each PageContent object encapsulates all the information extracted from one page.
The output is always a list. If the input is a single image, the output still contains one Page item.
The PageContent Structure
The PageContent object has the following structure:
text_and_images: Contains the main extracted content.text: The recognized text from the page as aTextContent.images: A list of any images found embedded within the page.
page_view: AnImageContentobject representing a full visual rendering of the page.
Configuration
PipeExtract is configured in your pipeline's .mthds file.
Extraction Models and Backend System
PipeExtract uses the unified inference backend system to manage extraction models. This means you can:
- Use different extraction providers (Mistral OCR, Linkup Fetch, local PDF extraction via internal backend, etc.)
- Configure extraction models through the same backend system as LLMs and image generation models
- Use extraction presets for consistent configurations across your pipelines
- Route extraction requests to different backends based on your routing profile
Common extraction model handles:
mistral-ocr: Mistral's OCR model for high-quality text and image extractionpypdfium2-extract-pdf: Local PDF text extraction (no API calls required)linkup-fetch: Web page content extraction from URLs
Common model aliases:
@default-extract-web-page: Default for web page extraction (Linkup Fetch)@default-extract-document: Default for document extraction (Azure Document Intelligence)@default-text-from-pdf: Fast local PDF text extraction (pypdfium2)
Extraction presets are defined in your model deck configuration and can include parameters like max_nb_images and image_min_size.
MTHDS Parameters
| Parameter | Type | Description | Required |
|---|---|---|---|
type |
string | The type of the pipe: PipeExtract |
Yes |
description |
string | A description of the extraction operation. | Yes |
inputs |
Fixed | The value must be of concept Document or Image (or a concept that refines one of them). For web page extraction, the Document holds a web URL. |
Yes |
output |
string | The output concept produced by the extraction operation. Use Page[]. |
Yes |
max_page_images |
integer or null | Maximum number of images to extract from pages: null (or omit) for unlimited, 0 for no images, or a positive integer to limit. Defaults to the value in your Extract model preset. |
No |
page_views |
boolean | If true, a high-fidelity image of each page will be included in the page_view field. Defaults to false. |
No |
page_views_dpi |
integer | The resolution (in Dots Per Inch) for the generated page views when processing a PDF. Defaults to 150. |
No |
page_image_captions |
boolean | If true, the OCR service may attempt to generate captions for the images found. Note: This feature depends on the OCR provider. |
No |
model |
string | The Extract model choice by name, setting, or preset to use (e.g., "mistral-document-ai-2505", "@default-extract-document"). Defaults to the model specified in the global config. |
No |
Example: Processing a PDF
This example defines a pipe that takes a PDF, extracts text and full-page images, and outputs them as a list of pages.
[concept]
ScannedDocument = "A document that has been scanned as a PDF"
[concept.ExtractedPages]
description = "A list of pages extracted from a document by OCR"
refines = "Page"
[pipe.extract_text_from_document]
type = "PipeExtract"
description = "Extract text from a scanned document"
inputs = { document = "ScannedDocument" }
output = "Page[]"
page_views = true
page_views_dpi = 200
The output of PipeExtract must be Page[].
To use this pipe, first load a PDF into the ScannedDocument concept. After the pipe runs, the output contains a list of PageContent objects, where each object has the extracted text and a 200 DPI image of the corresponding page.
Example: Extracting a Web Page
This example defines a pipe that extracts content from a web page URL.
[pipe.extract_web_article]
type = "PipeExtract"
description = "Extract content from a web page"
inputs = { article_url = "Document" }
output = "Page[]"
model = "@default-extract-web-page"
Pass a web URL as the document_uri when running the pipe. PipeExtract fetches the page and extracts its content into Page objects, following the same pattern as document extraction.
Related Documentation
- Generic Document Extraction Example - Extract markdown from complex PDFs using vision
- Invoice Extraction Example - Complete invoice processing pipeline
- Document Extraction Feature - Overview of document extraction capabilities