Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025).
SentenceKV is an efficient KV cache compression method that organizes the KV cache at the sentence level based on semantic similarity. Unlike token-level compression methods, SentenceKV leverages the natural structure of language by clustering semantically similar sentences and maintaining a compact cache that preserves important contextual information.
- Sentence-level Organization: Groups tokens into semantic sentences for more effective compression
- Semantic Clustering: Uses semantic similarity to determine which sentences to retain in the cache
- Efficient Inference: Maintains inference quality while significantly reducing memory usage
- Plug-and-play: Easy integration with existing transformer models
- Clone the repository:
git clone https://github.com/zzbright1998/SentenceKV.git
cd SentenceKV- Option A: Using Conda (Recommended)
conda env create -f environment.yml
conda activate sentencekv- Option B: Using pip
# Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install PyTorch first (adjust CUDA version as needed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Install other dependencies
pip install -r requirements.txt- Option C: Manual Installation
# Create a new conda environment
conda create -n sentencekv python=3.10
conda activate sentencekv
# Install PyTorch (adjust CUDA version as needed)
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
# Install core dependencies
pip install transformers>=4.44.0 accelerate>=1.0.0 datasets>=3.0.0
pip install flash-attn>=2.6.0 # For efficient attention computation
pip install evaluate rouge-score nltk scikit-learn # For evaluation metrics
pip install numpy pandas tqdm pyyaml requests # Basic utilitiesRequirements:
- Python 3.10+
- CUDA-compatible GPU (recommended)
- At least 16GB GPU memory for 7B models
To run experiments on LongBench datasets:
- Navigate to the LongBench directory:
cd SentenceKV/LongBench- Configure and run the evaluation:
./run_longbench.shThe script supports multiple configurations:
- Models: Various LLaMA models (3.2-3B, Llama-3.1-8B, etc.)
- Methods: SentenceKV, FullKV, SnapKV, H2O, Quest
- Datasets: narrativeqa, qasper, multifieldqa_en, hotpotqa, 2wikimqa, musique, gov_report, qmsum, multi_news, trec, triviaqa, samsum, passage_count, passage_retrieval_en, lcc, repobench-p
- Cache Capacities: Configurable KV cache sizes
- Evaluate results:
./metrics.sh path/to/results (e.g. SentenceKV/LongBench/results_long_bench/MODEL_NAME)To run experiments on RULER benchmark:
- Navigate to the RULER directory:
cd SentenceKV/RULER- Configure and run the evaluation:
./run_ruler.sh The script supports:
- Context Lengths: 16K, 32K, 65K tokens
- Datasets: NIAH (single/multi), VT, FWE, QA tasks
- Methods: SentenceKV and baseline methods
- Evaluate results:
./metrics.sh path/to/results (e.g. SentenceKV/RULER/results_ruler/MODEL_NAME)You can integrate SentenceKV into your own projects:
from models.sentencekv.custom_llama_sentencekv import LlamaForCausalLM
from transformers import LlamaTokenizer
# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# Configure SentenceKV parameters
model.config.max_capacity_prompts = 1024 # Cache capacity
model.config.semantic_factor = 3 # Semantic clustering factor
# Use the model for inference
inputs = tokenizer("Your input text here", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)max_capacity_prompts: Maximum number of tokens to keep in the KV cachesemantic_factor: Controls the granularity of semantic clusteringattn_implementation: Attention implementation ("flash_attention_2" recommended)
- SentenceKV: Our proposed sentence-level semantic KV caching
- FullKV: Full KV cache (baseline)
- SnapKV: Token-level KV cache compression
- H2O: Heavy-hitter oracle KV caching
- Quest: Query-aware sparse transformer
SentenceKV demonstrates significant improvements in:
- Memory Efficiency: Reduced KV cache memory usage
- Inference Speed: Faster generation with compressed cache
- Quality Preservation: Maintains performance on long-context tasks
Detailed results are available in our paper.
SentenceKV/
├── README.md
├── environment.yml
├── LongBench/ # LongBench evaluation
│ ├── run_longbench.py
│ ├── run_longbench.sh
│ ├── metrics.py
│ ├── metrics.sh
│ └── data/ # Evaluation datasets
├── RULER/ # RULER evaluation
│ ├── run_ruler.py
│ ├── run_ruler.sh
│ ├── metrics.py
│ ├── metrics.sh
│ └── data/ # RULER datasets
├── NIAH/ # Needle in a Haystack evaluation
├── models/
│ ├── sentencekv/ # Core SentenceKV implementation
│ ├── pyramidkv/ # PyramidKV baseline
│ ├── quest/ # Quest baseline
│ └── inf_llm/ # InfLLM baseline
If you find SentenceKV useful in your research, please cite our paper:
@article{zhu2025sentencekv,
title={SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching},
author={Zhu, Yuxuan and Falahati, Ali and Yang, David H and Amiri, Mohammad Mohammadi},
journal={arXiv preprint arXiv:2504.00970},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.


