🔀 Token-Level Routing for Edge Inference

Efficient collaborative decoding between edge-deployed small models and cloud-based large language models.

🎬 Demo

🎥 Watch the demo:

🧠 Overview

This project implements Token-Level Routing, a novel collaborative inference system where a small on-device model performs most decoding, selectively routing critical tokens to a powerful cloud-based LLM.

This approach significantly reduces latency and cost while retaining output quality — ideal for edge scenarios like mobile phones or IoT devices.

🚀 Key Features

⚡ Efficient: >60% accuracy boost by routing only ~7% of tokens to the LLM.
🌐 Edge-Cloud Collaboration: Combines local lightweight models with cloud intelligence.
🧭 Token-Level Routing: Fine-grained, confidence-driven token control.
📱 Deployable: Lightweight ONNX runtime works on laptops and mobile devices.
🖥️ LLM Backend: Compatible with [SGLang] for LLM serving and kv-cache extension.

🧩 Architecture

+-------------+           +-------------+           +-------------+
|  User Input |--Prompt-->|  SLM (ONNX) |--Tokens-->|   Router     |
+-------------+           +-------------+           +-------------+
                                                 |
                            Tokens with low confidence
                                                 v
                                      +------------------+
                                      | LLM (Server-side)|
                                      +------------------+

📘 Usage

See Guideline.md for setup and usage instructions.

💻 Platform Support

✅ macOS (Apple M1/M2/M3) are already support
🔧 Android under development！

📫 Contact

For questions or collaborations, feel free to open an issue or email us.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Front_end.png		Front_end.png
README.md		README.md
add_computation_node.py		add_computation_node.py
convert_onnx.py		convert_onnx.py
inference_streamlit.py		inference_streamlit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔀 Token-Level Routing for Edge Inference

🎬 Demo

🧠 Overview

🚀 Key Features

🧩 Architecture

📘 Usage

💻 Platform Support

📫 Contact

About

Uh oh!

Releases

Packages

Languages

Jianshu-She/Token-Routing

Folders and files

Latest commit

History

Repository files navigation

🔀 Token-Level Routing for Edge Inference

🎬 Demo

🧠 Overview

🚀 Key Features

🧩 Architecture

📘 Usage

💻 Platform Support

📫 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages