Example: Invoice Extraction
This example processes PDF invoices through a multi-step pipeline: extract pages with views, classify the invoice type, then extract structured data using vision.
Get the code
What it demonstrates
- Rich structured concept (
Invoicewith 14+ fields including nestedInvoiceDetails) - Nested
PipeSequence(main pipeline batches over pages, sub-pipeline does analyze then extract) - Vision-based extraction using page views and OCR text
- Using shared method packages for page extraction
The Method: bundle.mthds
Invoice concept
[concept.Invoice]
description = "Invoice information extracted from text, supporting both formal bills and receipts"
[concept.Invoice.structure]
invoice_id = { type = "text", description = "Unique identifier for the invoice" }
invoice_number = { type = "text", description = "Invoice number as shown on the document" }
date = { type = "date", description = "Date when the invoice was issued" }
amount_incl_tax = { type = "number", description = "Total amount including taxes" }
amount_excl_tax = { type = "number", description = "Net amount excluding taxes" }
vendor = { type = "text", description = "Name of the vendor/seller" }
category = { type = "concept", concept_ref = "invoice_extraction.InvoiceDetails", description = "Category or type of expense" }
# ... and more fields
Pipeline
[pipe.process_invoice]
type = "PipeSequence"
inputs = { document = "Document" }
output = "Invoice[]"
steps = [
{ pipe = "github.com/Pipelex/methods/documents->documents.extract_page_contents_and_views", result = "invoice_pages" },
{ pipe = "extract_invoice", batch_over = "invoice_pages", batch_as = "invoice_page", result = "invoice" },
]
Each page goes through a sub-pipeline that first classifies the invoice type (bill vs. receipt), then extracts the full structured data using both the page view and OCR text:
[pipe.extract_invoice]
type = "PipeSequence"
inputs = { invoice_page = "Page" }
output = "Invoice"
steps = [
{ pipe = "analyze_invoice", result = "invoice_details" },
{ pipe = "extract_invoice_data", result = "invoice" },
]
How to run
pipelex run bundle examples/b_basics/document_extract/extract_invoice/bundle.mthds \
-i examples/b_basics/document_extract/extract_invoice/inputs.json
Related Documentation
- PipeExtract Operator - Extract text and images from documents
- PipeLLM Operator - The core operator for LLM interactions
- PipeSequence Controller - Chain pipes into sequential workflows
- Document Extraction - Overview of document extraction capabilities