Translation

BaseRT, A fast inference runtime for local AI on Apple Silicon

BaseRT is a fast inference runtime designed to run local AI models on Apple Silicon hardware, offering performance improvements for on-device machine learning tasks without relying on cloud services.

Background

BaseRT is a new inference runtime (software that runs AI models locally) built specifically for Apple Silicon chips (M-series). Most popular AI runtimes — like llama.cpp or Ollama — are optimized for NVIDIA GPUs and don't fully exploit Apple's unified memory architecture, where the CPU and GPU share the same pool of fast memory. BaseRT claims to deliver 2–5× faster token generation by rewriting the entire software stack from scratch for that architecture, rather than porting a CUDA-oriented stack. The project is headed by a former principal engineer at Hugging Face who worked on their server-side inference engine. This matters because Apple Silicon is now widely used by developers and researchers, but running large models locally has been noticeably slower than on NVIDIA hardware, so a purpose-built runtime could make local AI on Macs more practical. The initial release supports Meta's Llama family and Mistral models, with more planned.