- [2026/02/11]: 🚀 We released the official code and model checkpoints!
- [2026/02/06]: 🎉 WAVE has been accepted to ICLR 2026 as an Oral Presentation!
We introduce WAVE (Unified & Versatile Audio-Visual Embeddings), the first LLM-based embedding model that creates a unified representation space for text, audio, silent video, and synchronized audio-visual modalities.
Unlike previous methods that use separate encoders, WAVE is built upon Qwen2.5-Omni and employs a novel hierarchical feature fusion strategy. It features a dual-encoder design for audio (Speech + Environmental Sounds) and is trained via a joint multi-modal, multi-task approach.
Key Capabilities:
- Any-to-Any Retrieval: Seamlessly retrieve across text, audio, and video modalities.
- Prompt-Aware Embeddings: Generate embeddings tailored to user instructions (e.g., specific questions), significantly boosting performance in multimodal QA.
- SOTA Performance: WAVE achieves state-of-the-art results on the MMEB-v2-video benchmark and excels in audio/audio-visual retrieval tasks.
Overview of WAVE Architecture. It supports text-only, visual-only, audio-only, and audio-visual inputs.
WAVE achieves SOTA performance on MMEB-v2-video, outperforming industrial-grade models like Seed-1.6-Embedding.
| Model | Overall | CLS | QA | RET | MRET |
|---|---|---|---|---|---|
| LamRA 7B | 35.0 | 39.3 | 42.6 | 24.3 | 32.8 |
| GME 7B | 38.4 | 37.4 | 50.4 | 28.4 | 37.0 |
| CAFe 7B | 42.4 | 35.8 | 58.7 | 34.4 | 39.5 |
| Seed-1.6-Embedding | 55.3 | 55.0 | 60.9 | 51.3 | 53.5 |
| WAVE 7B (Ours) | 59.9 | 57.8 | 72.5 | 54.7 | 50.8 |
WAVE also demonstrates strong performance across audio and audio-visual modalities.
| Dataset | AudioCaps | Clotho | MMAU | MMAR | VGGSound | MusicCaps |
|---|---|---|---|---|---|---|
| Task | A-RET | A-RET | A-QA | A-QA | AV-RET | AV-RET |
| Reference | WavCaps | WavCaps | Qwen2.5-Omni | Qwen2.5-Omni | Encoder-only | Encoder-only |
| Ref. Value | 42.2 | 21.5 | 71.5 | 56.7 | 10.3 | 8.6 |
| WAVE 7B | 44.2 | 25.6 | 76.6 | 68.1 | 25.0 | 20.4 |
For key library versions, please refer to requirements.txt. Install via:
pip install -r requirements.txt
We provide the model checkpoint on Hugging Face.
| Model | Parameters | Backbone | Download |
|---|---|---|---|
| WAVE-7B | 7B | Qwen2.5-Omni | 🤗 HuggingFace |
To evaluate WAVE on your own datasets or benchmarks, follow these steps:
-
Data Preparation: Organize your data into the sample JSON format. Refer to the templates provided in the
scripts/directory, such as:scripts/ret_av2t.json(video/audio to text retrieval)scripts/ret_mc.json(QA)scripts/ret_v2a.json(audio-visual retrieval)
-
Inference: Modify
scripts/direct_test.sh, then run the script:bash scripts/direct_test.sh
The results will be automatically saved in the
output/testdirectory.
If you find WAVE useful for your research, please consider citing our paper:
@inproceedings{tang2026wave,
title={{WAVE}: Learning Unified \& Versatile Audio-Visual Embeddings with Multimodal {LLM}},
author={Changli Tang and Qinfan Xiao and Ke Mei and Tianyi Wang and Fengyun Rao and Chao Zhang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=MiV3WXDYJb}
}License: Apache 2.0