Carrier-Agnostic Speaker Verification Evaluation
Why this exists: I wanted to build a system that indexes spoken conversations—automatically identifying speakers across Discord, phone calls, and in-person meetings. But I hit a wall: the same person produces different embeddings depending on how I encountered them. Current models degrade up to 19× on real-world audio. Humans don't have this problem—we recognize voices regardless of medium. This benchmark measures that gap. Read the full story →
State-of-the-art speaker verification models achieve <1% EER on clean benchmarks. But what happens when audio passes through phone codecs, cheap webcams, or is replayed through speakers?
The CASE Benchmark answers this question—and the results are eye-opening.
| Condition | Typical SOTA Performance |
|---|---|
| Clean Audio | 0.6-1.7% EER ✅ |
| Phone Codec | 2-4% EER |
| Laptop Microphone | 0.6-1.8% EER |
| Room Reverb | 5-8% EER |
| Playback Chain | 9-13% EER ❌ |
That's up to 19× worse performance on realistic conditions.
The hardest scenario: audio encoded, played through a speaker, and re-recorded:
Voice → [Codec] → [Speaker] → [Room Acoustics] → [Microphone] → Recording
This happens when:
- Voice messages are played back and recorded
- Conference calls with speaker playback
- Smart speaker interactions
- Any voice replay attack scenario
pip install case-benchmark
# Install with model support
pip install case-benchmark[speechbrain] # SpeechBrain ECAPA-TDNN
pip install case-benchmark[all-models] # All supported modelscase-benchmark download --output-dir ./benchmark_data# Using built-in model wrappers
case-benchmark evaluate \
--model speechbrain \
--benchmark-dir ./benchmark_data \
--output-dir ./results
# Or programmaticallyfrom case_benchmark import CASEBenchmark, load_model
benchmark = CASEBenchmark("./benchmark_data")
model = load_model("speechbrain")
results = benchmark.evaluate(model)
results.print_summary()
# Clean EER: 0.56%, Degradation: +2.49%| Rank | Model | Absolute EER | Degradation | Clean EER |
|---|---|---|---|---|
| 1 | WeSpeaker ResNet34 | 3.01% | +2.43% | 0.58% |
| 2 | SpeechBrain ECAPA-TDNN | 3.05% | +2.49% | 0.56% |
| 3 | CASE HF v2-512 | 3.53% | +2.31% | 1.22% |
| 4 | NeMo TitaNet-L | 4.05% | +3.39% | 0.66% |
| 5 | pyannote Embedding | 4.47% | +2.79% | 1.68% |
| 6 | Resemblyzer | 10.49% | +5.65% | 4.84% |
Key Finding: The CASE HF model achieves the lowest degradation factor (+2.31%), validating its carrier-agnostic design.
For reference, current SOTA on VoxCeleb1-O (clean-clean only):
- ResNet293 + VoxBlink2: 0.17% EER (arXiv:2407.11510)
- ERes2NetV2: 0.61% EER (3D-Speaker)
Our benchmark tests production-ready models that are easily accessible. The models above require specialized training (VoxBlink2 dataset, 100K+ speakers) and post-processing (AS-Norm, QMF) not typically used in deployment.
| Category | Avg EER | vs Clean |
|---|---|---|
| Clean | 0.58% | baseline |
| Codec | 1.73% | +1.15% |
| Mic | 0.59% | +0.01% |
| Noise | 0.73% | +0.15% |
| Reverb | 5.88% | +5.30% |
| Playback | 8.57% | +7.99% |
All models show significant degradation on playback scenarios (codec→speaker→room→mic chains), though carrier-aware training reduces this gap substantially.
The benchmark includes 24 protocols across 6 categories:
| Category | Protocols | Description |
|---|---|---|
| Clean | 1 | Baseline (clean vs clean) |
| Codec | 7 | GSM, G.711, Opus, MP3 |
| Mic | 7 | Webcam, laptop, phone, headset |
| Noise | 5 | SNR 5-25 dB |
| Reverb | 1 | Simulated room acoustics |
| Playback | 3 | Full codec→speaker→room→mic chain |
Each protocol has 10,000 trials (5,000 target + 5,000 impostor).
Two metrics together describe a model's carrier robustness:
Clean EER = EER on clean_clean protocol
- Measures baseline performance under ideal conditions
- Lower is better (e.g., 0.58% is excellent)
Degradation = Absolute EER − Clean EER
- Measures robustness: how much performance is lost due to carrier effects
- Lower is better (e.g., +2.31% means minimal degradation)
- Independent of baseline—directly measures carrier susceptibility
A model with low Clean EER and low Degradation is ideal. Some models (like CASE HF) trade baseline performance for better robustness.
Note: An earlier "CASE-Score v1" metric used normalized ratios (EER_degraded / EER_clean), but this can misleadingly reward models with poor baselines. See Metrics for full details.
Built-in wrappers for popular models:
| Model | Install | Status |
|---|---|---|
| SpeechBrain ECAPA-TDNN | pip install case-benchmark[speechbrain] |
✅ Supported |
| WeSpeaker ResNet34/CAM++ | pip install case-benchmark[wespeaker] |
✅ Supported |
| pyannote embedding | pip install case-benchmark[pyannote] |
✅ Supported |
| NVIDIA NeMo TitaNet | pip install case-benchmark[nemo] |
✅ Supported |
| Resemblyzer | pip install case-benchmark[resemblyzer] |
✅ Supported |
| CASE HF v2-512 | pip install case-benchmark[case-hf] |
✅ Supported |
Implement the EmbeddingModel interface:
from case_benchmark.models.base import EmbeddingModel
import numpy as np
from pathlib import Path
class MyModel(EmbeddingModel):
def load(self, device: str = "cpu") -> None:
self.model = load_my_model(device)
self._loaded = True
def extract_embedding(self, audio_path: Path) -> np.ndarray:
audio = load_audio(audio_path)
return self.model.encode(audio).numpy()
@property
def embedding_dim(self) -> int:
return 192
@property
def name(self) -> str:
return "My Custom Model"- VoxCeleb1-O: 40 speakers, ~400 utterances (official test set)
- LibriSpeech test-clean: 40 speakers, ~392 utterances
- Total: 80 speakers across both datasets
- Sample rate: 16kHz mono
- Codecs: GSM, G.711 (μ-law, A-law), Opus (6k/12k/24k), MP3
- Microphones: Simulated FIR filters for webcam, laptop, phone, etc.
- Noise: DEMAND corpus at various SNR levels
- Reverb: Real RIRs from OpenSLR-28 + BUT ReverbDB
- Playback: Full codec→speaker→room→mic chain
Important: The benchmark uses different data sources than typical training pipelines to ensure proper train/eval separation.
| Component | Benchmark Source | Recommended for Training |
|---|---|---|
| Noise | DEMAND | MUSAN ✓ |
| Reverb | OpenSLR-28 + BUT ReverbDB (real RIRs) | pyroomacoustics or OpenSLR-26 (simulated) ✓ |
If you train with MUSAN noise and pyroomacoustics/OpenSLR-26 RIRs, your training data is properly separated from the benchmark. See docs/methodology.md for details.
# From HuggingFace
case-benchmark download --output-dir ./benchmark_data
# Or using huggingface_hub directly
from huggingface_hub import snapshot_download
snapshot_download("bigstorm/case-benchmark", local_dir="./benchmark_data")- Why This Exists - The problem this benchmark is trying to solve
- Methodology - Benchmark design and technical approach
- Protocols - Detailed protocol descriptions
- Metrics - EER, Degradation Factor, and how to compare models
- Findings - Key results and analysis
- Submission Guide - How to submit to leaderboard
If you use the CASE Benchmark in your research, please cite:
@misc{gitter2026case,
title={CASE Benchmark: Carrier-Agnostic Speaker Verification Evaluation},
author={Gitter, Ben},
year={2026},
howpublished={\url{https://github.com/gittb/case-benchmark}}
}- Code: MIT License
- Data: CC BY-NC 4.0 (non-commercial research only) (Contact Ben Gitter for a commerical license)
The benchmark audio is derived from VoxCeleb and LibriSpeech, which have their own license terms.
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
- Report issues on GitHub Issues
- Submit model results via Pull Request
- VoxCeleb for source audio data (VoxCeleb1-O test set)
- LibriSpeech for source audio data (test-clean subset)
- DEMAND for noise samples used in the benchmark
- OpenSLR-28 and BUT ReverbDB for real room impulse responses