Example: Generic Document Extraction
This example demonstrates a powerful and generic pipeline for extracting content from complex PDF documents. It can handle documents that contain both text and images, and it merges the extracted content into a single, coherent output.
Get the code
➡️ View on GitHub: examples/extract_generic.py
The Pipeline Explained
The power_extractor
pipeline is at the heart of this example. After its execution, a custom function merge_markdown_and_images
is used to combine the text (converted to Markdown) and the images from all pages.
async def extract_generic(pdf_url: str) -> TextAndImagesContent:
working_memory = WorkingMemoryFactory.make_from_pdf(
pdf_url=pdf_url,
concept_str="PDF",
name="pdf",
)
pipe_output, _ = await execute_pipeline(
pipe_code="power_extractor",
working_memory=working_memory,
)
working_memory = pipe_output.working_memory
markdown_and_images: TextAndImagesContent = merge_markdown_and_images(working_memory)
return markdown_and_images
The merge_markdown_and_images
function is a great example of how you can add your own Python code to a Pipelex workflow to perform custom processing.
def merge_markdown_and_images(working_memory: WorkingMemory) -> TextAndImagesContent:
# Pages extracted from the PDF by PipeOCR
page_contents_list = working_memory.get_stuff_as_list(item_type=PageContent, name="page_contents")
# Markdown text extracted from the Pages by PipeLLM
page_markdown_list = working_memory.get_stuff_as_list(item_type=TextContent, name="markdowns")
# ... (check for length equality)
# Concatenate the markdown text
concatenated_markdown_text: str = "\\n".join([page_markdown.text for page_markdown in page_markdown_list.items])
# Aggregate the images from the page contents
image_contents: List[ImageContent] = []
for page_content in page_contents_list.items:
if page_content.text_and_images.images:
image_contents.extend(page_content.text_and_images.images)
return TextAndImagesContent(
text=TextContent(text=concatenated_markdown_text),
images=image_contents,
)
This example shows the flexibility of Pipelex in handling complex, multi-modal documents and allowing for custom logic.