Skip to content

PipeExtract

The PipeExtract operator performs Optical Character Recognition (OCR) on images and PDF documents. It extracts text, embedded images, and provides full-page renderings.

How it works

PipeExtract takes a single input, which must be either an Image or a Pdf. It processes the document page by page and produces a list of PageContent objects. Each PageContent object encapsulates all the information extracted from a single page.

The output is always a list, even if the input is a single image (in which case the list will contain just one item).

The PageContent Structure

The PageContent object has the following structure:

  • text_and_images: Contains the main extracted content.
    • text: The recognized text from the page as a TextContent.
    • images: A list of any images found embedded within the page.
  • page_view: An ImageContent object representing a full visual rendering of the page.

Configuration

PipeExtract is configured in your pipeline's .plx file.

OCR Models and Backend System

PipeExtract uses the unified inference backend system to manage OCR models. This means you can:

  • Use different OCR providers (Mistral OCR, local PDF extraction via internal backend, etc.)
  • Configure OCR models through the same backend system as LLMs and image generation models
  • Use OCR presets for consistent configurations across your pipelines
  • Route OCR requests to different backends based on your routing profile

Common OCR model handles:

  • mistral-ocr: Mistral's OCR model for high-quality text and image extraction
  • pypdfium2-extract-text: Local PDF text extraction (no API calls required)

OCR presets are defined in your model deck configuration and can include parameters like max_nb_images and image_min_size.

PLX Parameters

Parameter Type Description Required
type string The type of the pipe: PipeExtract Yes
description string A description of the OCR operation. Yes
inputs Fixed The value is either of concept Image or Pdf. Yes
output string The output concept produced by the OCR operation. Yes
page_images boolean If true, any images found within the document pages will be extracted and included in the output. Defaults to false. No
page_views boolean If true, a high-fidelity image of each page will be included in the page_view field. Defaults to false. No
page_views_dpi integer The resolution (in Dots Per Inch) for the generated page views when processing a PDF. Defaults to 150. No
page_image_captions boolean If true, the OCR service may attempt to generate captions for the images found. Note: This feature depends on the OCR provider. No
model string The Extract model choice by name, setting, or preset to use (e.g., "mistral-ocr", "extract_text_from_visuals"). Defaults to the model specified in the global config. No

Example: Processing a PDF

This example defines a pipe that takes a PDF, extracts text and full-page images, and outputs them as a list of pages.

[concept]
ScannedDocument = "A document that has been scanned as a PDF"

[concept.ExtractedPages]
description = "A list of pages extracted from a document by OCR"
refines = "Page"

[pipe.extract_text_from_document]
type = "PipeExtract"
description = "Extract text from a scanned document"
inputs = { document = "ScannedDocument" }
output = "Page"
page_views = true
page_views_dpi = 200

The output of the PipeExtract must be exactly the native Page concept.

To use this pipe, you would first need to load a PDF into the ScannedDocument concept. After the pipe runs, the ExtractedPages concept will contain a list of PageContent objects, where each object has the extracted text and a 200 DPI image of the corresponding page.