Efficient collaborative decoding between edge-deployed small models and cloud-based large language models.
This project implements Token-Level Routing, a novel collaborative inference system where a small on-device model performs most decoding, selectively routing critical tokens to a powerful cloud-based LLM.
This approach significantly reduces latency and cost while retaining output quality β ideal for edge scenarios like mobile phones or IoT devices.
- β‘ Efficient: >60% accuracy boost by routing only ~7% of tokens to the LLM.
- π Edge-Cloud Collaboration: Combines local lightweight models with cloud intelligence.
- π§ Token-Level Routing: Fine-grained, confidence-driven token control.
- π± Deployable: Lightweight ONNX runtime works on laptops and mobile devices.
- π₯οΈ LLM Backend: Compatible with [SGLang] for LLM serving and kv-cache extension.
+-------------+ +-------------+ +-------------+
| User Input |--Prompt-->| SLM (ONNX) |--Tokens-->| Router |
+-------------+ +-------------+ +-------------+
|
Tokens with low confidence
v
+------------------+
| LLM (Server-side)|
+------------------+
See Guideline.md for setup and usage instructions.
- β macOS (Apple M1/M2/M3) are already support
- π§ Android under developmentοΌ
For questions or collaborations, feel free to open an issue or email us.

