LLM-free, layout-aware PDF chunker in pure Rust
A new open-source PDF chunker written in pure Rust splits documents into layout-aware chunks (headings, paragraphs, tables, figures) without using LLMs. It preserves reading order, handles multi-column layouts, and outputs structured sections for downstream RAG or document processing pipelines.
Background
- This is an open-source tool that splits PDFs into smaller pieces ("chunks") for use with Retrieval-Augmented Generation (RAG) systems — a common pattern where an AI model searches a knowledge base before answering.
- Unlike most existing PDF chunkers, it does not rely on a large language model (LLM) to understand the document; instead it uses Rust's low-level PDF parsing (lopdf) and layout detection to find paragraphs, columns, and reading order.
- The "layout-aware" approach means it tries to preserve the logical structure of the page (headings, multi-column text, tables) rather than naively splitting by character count or page boundaries.
- Written in pure Rust with no C++ dependencies, making it easy to compile and embed in other Rust projects. Aims to be faster and more deterministic than ML-based alternatives.