Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding
The paper introduces Cassandra, a framework that enables large reasoning language models to run efficiently on edge devices by using self-speculative decoding, leveraging the model's own draft and verification mechanisms to reduce inference latency and computational cost.