Skip to content

PipeOcr

The PipeOcr operator performs Optical Character Recognition (OCR) on images and PDF documents. It extracts text, embedded images, and provides full-page renderings.

How it works

PipeOcr takes a single input, which must be either an Image or a Pdf. It processes the document page by page and produces a list of PageContent objects. Each PageContent object encapsulates all the information extracted from a single page.

The output is always a list, even if the input is a single image (in which case the list will contain just one item).

The PageContent Structure

The PageContent object has the following structure:

  • text_and_images: Contains the main extracted content.
    • text: The recognized text from the page as a TextContent.
    • images: A list of any images found embedded within the page.
  • page_view: An ImageContent object representing a full visual rendering of the page.

Configuration

PipeOcr is configured in your pipeline's .toml file.

TOML Parameters

Parameter Type Description Required
PipeOcr string A descriptive name for the OCR operation. Yes
inputs dictionary The input concept(s) for the OCR operation, as a dictionary mapping input names to concept codes. Yes
output string The output concept produced by the OCR operation. Yes
should_include_images boolean If true, any images found within the document pages will be extracted and included in the output. Defaults to false. No
should_include_page_views boolean If true, a high-fidelity image of each page will be included in the page_view field. Defaults to false. No
page_views_dpi integer The resolution (in Dots Per Inch) for the generated page views when processing a PDF. Defaults to 150. No
should_caption_images boolean If true, the OCR service may attempt to generate captions for the images found. Note: This feature depends on the OCR provider. No

Example: Processing a PDF

This example defines a pipe that takes a PDF, extracts text and full-page images, and outputs them as a list of pages.

[concept]
ScannedDocument = "A document that has been scanned as a PDF"
ExtractedPages = "A list of pages extracted from a document by OCR"

[pipe.extract_text_from_document]
PipeOcr = "Extract text from a scanned document"
inputs = { document = "ScannedDocument" }
output = "ExtractedText"
should_include_page_views = true
page_views_dpi = 200

To use this pipe, you would first need to load a PDF into the ScannedDocument concept. After the pipe runs, the ExtractedPages concept will contain a list of PageContent objects, where each object has the extracted text and a 200 DPI image of the corresponding page.