Skip to content
TopicTracker
From HackerNewsView original
TranslationTranslation

Show HN: Quicktok, an exact BPE tokenizer 7x faster than tiktoken

Quicktok is a new exact BPE tokenizer that achieves 7x speed improvement over tiktoken, designed for efficient tokenization in natural language processing tasks.

Background

- Quicktok is a new, open-source BPE (Byte-Pair Encoding) tokenizer — the component that converts text into the numeric tokens large language models (LLMs) actually process. It claims to be 7x faster than OpenAI's widely-used reference tokenizer, tiktoken. - BPE tokenizers split words into subword units (e.g., "hello" might become ["hel", "lo"]). Speed matters because tokenization is a bottleneck in both training and inference, especially for models handling large volumes of text. - Tiktoken is OpenAI's official tokenizer, used by GPT-4 and GPT-3.5. Many third-party tools and APIs rely on it to ensure exact token counts match the model's expectations — any mismatch causes errors. Quicktok's key promise is exact (lossless) compatibility with tiktoken while being much faster.