Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding
This paper introduces Cassandra, a framework that enables large reasoning language models to run efficiently on edge devices through self-speculative decoding. By leveraging the model's own intermediate reasoning steps as draft predictions, Cassandra reduces computational overhead without sacrificing accuracy, making advanced LLM reasoning capabilities accessible on resource-constrained hardware.