DocETL: Declarative and Agentic Map-Reduce
DocETL is a declarative and agentic map-reduce system designed for processing and transforming unstructured documents. It allows users to define complex document processing pipelines using a high-level specification, leveraging AI agents to perform tasks like extraction, summarization, and transformation in a scalable manner.
Background
- DocETL is an open-source tool (GitHub repo from UC Berkeley's EPIC lab) that lets you process large documents using AI agents in a "declarative" way—you specify *what* you want (e.g., "extract all product names and prices"), and the system figures out the *how*.
- It's inspired by the classic "MapReduce" programming model (a Google-era technique for splitting big data tasks into smaller chunks, processing them in parallel, then combining results), but replaces manual code with LLM-powered agents.
- The key shift: instead of writing complex Python scripts to chunk documents, call an API, and merge outputs, you write a simple YAML config. DocETL automatically optimizes the pipeline (splitting, summarizing, joining) using AI.
- This matters because working with large unstructured text (legal contracts, research papers, internal wikis) via LLMs is currently expensive and error-prone. DocETL aims to make it cheaper, faster, and more reliable by automatically tuning how the agentic pipeline runs.
- The target audience is developers and data engineers building document-processing workflows, not end users.