Example: Invoice Extraction

This example processes PDF invoices through a multi-step pipeline: extract pages with views, classify the invoice type, then extract structured data using vision.

Get the code

What it demonstrates

Rich structured concept (Invoice with 14+ fields including nested InvoiceDetails)
Nested PipeSequence (main pipeline batches over pages, sub-pipeline does analyze then extract)
Vision-based extraction using page views and OCR text
Using shared method packages for page extraction

The Method: `bundle.mthds`

Invoice concept

[concept.Invoice]
description = "Invoice information extracted from text, supporting both formal bills and receipts"

[concept.Invoice.structure]
invoice_id        = { type = "text", description = "Unique identifier for the invoice" }
invoice_number    = { type = "text", description = "Invoice number as shown on the document" }
date              = { type = "date", description = "Date when the invoice was issued" }
amount_incl_tax   = { type = "number", description = "Total amount including taxes" }
amount_excl_tax   = { type = "number", description = "Net amount excluding taxes" }
vendor            = { type = "text", description = "Name of the vendor/seller" }
category          = { type = "concept", concept_ref = "invoice_extraction.InvoiceDetails", description = "Category or type of expense" }
# ... and more fields

Pipeline

[pipe.process_invoice]
type = "PipeSequence"
inputs = { document = "Document" }
output = "Invoice[]"
steps = [
  { pipe = "github.com/Pipelex/methods/documents->documents.extract_page_contents_and_views", result = "invoice_pages" },
  { pipe = "extract_invoice", batch_over = "invoice_pages", batch_as = "invoice_page", result = "invoice" },
]

Each page goes through a sub-pipeline that first classifies the invoice type (bill vs. receipt), then extracts the full structured data using both the page view and OCR text:

[pipe.extract_invoice]
type = "PipeSequence"
inputs = { invoice_page = "Page" }
output = "Invoice"
steps = [
  { pipe = "analyze_invoice", result = "invoice_details" },
  { pipe = "extract_invoice_data", result = "invoice" },
]

How to run

pipelex run bundle examples/b_basics/document_extract/extract_invoice/bundle.mthds \
  -i examples/b_basics/document_extract/extract_invoice/inputs.json

PipeExtract Operator - Extract text and images from documents
PipeLLM Operator - The core operator for LLM interactions
PipeSequence Controller - Chain pipes into sequential workflows
Document Extraction - Overview of document extraction capabilities