Pulpie: Pareto-Optimal Models for Cleaning the Web

The article introduces Pulpie, a suite of Pareto-optimal models designed for web-scale data cleaning. It presents models that balance performance and computational cost to filter low-quality text from web datasets, improving the efficiency of training large language models.

Background

- **Feyn** is a startup working on mechanistic interpretability — reverse-engineering neural networks to understand their internal logic. - **Pulpie** is their new open-source tool for cleaning web text used in AI training. Instead of one big filter, it uses many small specialized models (each targeting a specific flaw like SEO spam, repetition, or toxicity) and combines their scores. - "Pareto-optimal" means the system optimizes for multiple tradeoffs at once (e.g., keeping diverse data while removing junk) rather than maximizing a single metric. - AI training data is often scraped from the open web (Common Crawl). Crude filters can remove useful content; Pulpie aims for more surgical quality control, which matters because data quality increasingly determines model performance more than architecture does.