Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions rfcs/0000-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# RFC Title

<!-- Replace "RFC Title" with a short, descriptive title for your proposal. -->

<!--
This template is a recommended starting point. Feel free to add, remove, or
reorganize sections to best convey your proposal.
-->

| | |
|---|---|
| **Status** | Draft <!-- Draft, InReview, Accepted, Rejected --> |
| **Authors** | Your Name |
| **Contributors** | <!-- Others who contributed to this RFC --> |
| **Created** | YYYY-MM-DD |
| **Updated** | YYYY-MM-DD |

## Summary

<!-- One paragraph describing the proposal at a high level. What are you building and why? -->

## Motivation

### Background

<!-- Context a reader needs to understand the problem. What exists today? -->

### Problem Statement

<!-- What specific problem does this RFC solve? Include quantitative data if available. -->

### Goals

<!-- Numbered list of concrete, measurable goals. -->

1. Goal one
2. Goal two

## Proposal

<!-- The core technical proposal. Include:
- Type definitions (structs, traits, type aliases)
- API signatures
- Module structure
- Algorithms / pseudocode where helpful

Use Rust code blocks for type/API definitions. -->

## Trade-offs

<!-- Describe the key design trade-offs and alternative approaches considered.
For each alternative, explain what it is, its pros/cons, and why it was
not chosen (or under what conditions it might be preferred). -->

## Benchmark Results

<!-- If applicable, include performance measurements.
State the configuration (hardware, dataset, parameters) and present results in tables. -->

## Future Work

<!-- Items explicitly deferred from this RFC. Use a checkbox list. -->

- [ ] Future item one
- [ ] Future item two

## References

<!-- Links to papers, prior art, related issues, or PRs. -->

1. [Reference title](URL)
299 changes: 299 additions & 0 deletions rfcs/0001-multi-vector-distance-functions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
# Multi-Vector Distance Functions

| | |
|---|---|
| **Status** | InReview |
| **Authors** | Suryansh Gupta |
| **Contributors** | Suryansh Gupta, Mark Hildebrand |
| **Created** | 2026-01-06 |
| **Updated** | 2026-02-06 |

## Summary

This RFC proposes a high-performance Chamfer distance implementation for multi-vector (ColBERT-style late interaction) representations in DiskANN. The design uses a **query-transposed tiling** approach that transposes queries into a block layout while keeping documents in row-major format, achieving up to **2.67x speedup** over SIMD baseline. The implementation builds on existing types from `diskann-quantization` (`Mat`, `MaxSim`, `Chamfer`) and implements `DistanceFunctionMut` from `diskann-vector` for ecosystem compatibility.

## Motivation

### Background

Traditional vector search represents each document as a single embedding. Multi-vector representations (used in models like ColBERT) encode each document/query as a **bag of embeddings** — typically one per token. This enables:

- **Fine-grained matching**: Token-level similarity captures nuanced semantic relationships
- **Late interaction**: Document embeddings are pre-computed; only lightweight aggregation at query time
- **Better recall**: If any query token matches any document token well, the document scores high

### Problem Statement

Chamfer distance for multi-vector search requires O(Q × D × Dim) operations per query-document pair, where:

- Q = number of query tokens
- D = number of document tokens
- Dim = embedding dimensionality

For typical configurations (Q=32, D=128, Dim=384), this is ~1.5M floating-point operations per pair. Naive implementations become a bottleneck for large-scale search.

### Goals

1. Implement high-performance Chamfer distance starting with `f32` embeddings, with future support for `f16` and `u8` types
2. Achieve 2x+ speedup over baseline SIMD through memory layout optimization
3. Maintain compatibility with DiskANN's `DistanceFunctionMut` trait
4. Provide a clean API that enables standalone distance function usage without full index integration
5. Achieve performance within 10–20% of `faer` SGEMM-based Chamfer computation, when both our implementation and `faer` are restricted to AVX2 (no AVX-512 on either side)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use AVX-512 if available? We have support, and it could likely really benefit dense operations like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would like to park this for now. We can pick it up in a follow up with a more thorough design for dynamic dispatching that also takes the machine architecture into account.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring is fine, but I don't think there need to be a super thorough design planning/review - we already have the infrastructure for dynamic dispatch and micro-architecture specific code generation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so what's your suggestion? Should we include the brief design around dynamic dispatch in this RFC itself?

I was thinking we can work on it in separate PRs and RFCs if required in future.


## Proposal

### Approach: Query-Transposed Tiling

We propose the **query-transposed tiling** approach as the primary Chamfer distance implementation for DiskANN integration. This approach transposes the query into a block-transposed layout, keeps documents in row-major format, and processes pairs of document vectors together to amortize query memory loads. A pre-allocated scratch buffer tracks per-query max similarities and is reused across distance calls.

This is the recommended default because it preserves the existing document storage format (no index migration), while still achieving significant speedups through SIMD tiling.

### Chamfer Distance Definition

For query multi-vector Q and document multi-vector D:

```
Chamfer(Q, D) = Σᵢ minⱼ -IP(qᵢ, dⱼ)
```

Since `InnerProduct::evaluate` in `diskann-vector` returns the negated inner product (`-IP`), the kernel finds the minimum negated IP per query vector (equivalent to finding the maximum similarity), then sums across all query vectors. The result is a distance compatible with DiskANN's min-heap.

### Types

The design builds on the multi-vector matrix types already defined in `diskann_quantization::multi_vector`:

#### Query and Document (from `diskann-quantization`)

```rust
use diskann_quantization::multi_vector::{Mat, MatRef, Standard, QueryMatRef};

/// Owning row-major matrix: rows = tokens, cols = dimensions
type MultiVector = Mat<Standard<f32>>;

/// Borrowed view into a multi-vector
type MultiVectorRef<'a> = MatRef<'a, Standard<f32>>;
```

`Standard<f32>` provides contiguous row-major storage with `as_slice()` for BLAS compatibility and zero-copy views via `MatRef`. `QueryMatRef` (a newtype over `MatRef`) distinguishes query from document matrices for asymmetric distance functions.

#### TransposedMultiVector (new type)

A block-transposed representation for SIMD-optimized inner product computation. The full transposed type will be added in separate PRs; the below is a simplified depiction.

This will implement a `Repr` trait similar to `Standard<T>` defined in `diskann_quantization::multi_vector`, backed by a block-transposed layout analogous to `BlockTranspose<16>`:

```rust
/// Block-transposed layout for SIMD-parallel inner products.
/// Groups 16 vectors and stores dimensions contiguously:
///
/// Block 0: [v0_d0..v15_d0], [v0_d1..v15_d1], ...
/// Block 1: [v16_d0..v31_d0], ...
///
/// This enables loading 8 values (f32x8) from 8 different vectors
/// in a single SIMD load.
pub struct TransposedMultiVector { ... }

impl TransposedMultiVector {
pub fn from_view(view: MatRef<'_, Standard<f32>>) -> Self;
pub fn num_vectors(&self) -> usize;
pub fn vector_dim(&self) -> usize;
}

impl From<&MultiVector> for TransposedMultiVector { ... }
```

#### MaxSim (from `diskann-quantization`)

The existing `MaxSim` type in `diskann_quantization::multi_vector::distance` provides a mutable buffer for per-query-vector maximum similarities. The query-transposed tiling approach uses an analogous scratch pattern internally:

```rust
/// Per-query-vector maximum similarity scores.
pub struct MaxSim<'a> {
scores: &'a mut [f32],
}
```

#### Chamfer (from `diskann-quantization`)

The existing unit type `Chamfer` in `diskann_quantization::multi_vector::distance` identifies the Chamfer distance function:

```rust
/// Unit type identifying the Chamfer (MaxSim) distance function.
pub struct Chamfer;

impl PureDistanceFunction for Chamfer { ... }
```

`Chamfer` implements `PureDistanceFunction` from `diskann-vector`, enabling it to be used as a distance identifier throughout the DiskANN ecosystem without carrying computation state.

### Algorithm

The query-transposed tiling approach:

1. Transpose the query into block-transposed layout (once, before distance calls)
2. For each pair of document vectors `(d1, d2)`:
- Load each query block once, compute inner products against both documents simultaneously
- Update per-query min negated IPs in the score array in `MaxSim` struct
3. Handle remainder if document count is odd
4. Sum scores to produce the final Chamfer distance

```
// `scores` corresponds to the scores array inside MaxSim
// InnerProduct::evaluate returns -IP(q, d), so we track min (= best match)
scores = [f32::MAX; num_queries] // reused across calls

for i in (0..num_docs step 2):
d1, d2 = doc[i], doc[i+1]
for each query block:
// Load query block ONCE, use for BOTH documents
for each query vector q:
scores[q] = min(scores[q], -IP(q, d1))
scores[q] = min(scores[q], -IP(q, d2))

if num_docs is odd:
for each query vector q:
scores[q] = min(scores[q], -IP(q, last_doc))

score = sum(scores[i] for i in 0..num_queries)
```

**Key optimizations**:

- Query blocks loaded once, reused for 2 document vectors — reduces memory bandwidth by ~50%
- Scratch buffer stored in `MaxSim` struct and reused across distance calls — zero allocation on hot path
- 4 SIMD accumulators per document vector to hide FMA latency
- Uses `SIMDMinMax` from `diskann-wide` for hardware-accelerated SIMD reductions

### Trait Implementation

`MaxSim` implements `DistanceFunctionMut` to populate its per-query-vector max similarities. The caller then sums the scores to produce the Chamfer distance. We propose adding a `chamfer_score()` convenience method on `MaxSim` for this:

```rust
// DistanceFunctionMut populates the MaxSim scores buffer with per-query max similarities
impl DistanceFunctionMut<&TransposedMultiVector, MatRef<'_, Standard<f32>>, Result<(), MaxSimError>> for MaxSim<'_> {
fn evaluate(&mut self, query: &TransposedMultiVector, doc: MatRef<'_, Standard<f32>>) -> Result<(), MaxSimError>;
}

impl MaxSim<'_> {
/// Sums the per-query min negated IPs to produce the Chamfer distance.
/// Call after `evaluate()` has populated the scores buffer.
///
/// Returns: Σᵢ scores[i] (each score is already minⱼ -IP(qᵢ, dⱼ))
pub fn chamfer_score(&self) -> f32 {
self.scores.iter().sum()
}

/// Resets all scores to `f32::MAX` for the next document comparison.
pub fn reset(&mut self) {
self.scores.fill(f32::MAX);
}
}
```

**Thread safety**: `MaxSim` holds `&mut [f32]`, so a `DistanceFunctionMut` implementation borrowing it is `!Sync` by construction. For concurrent search, each thread should own its own `MaxSim` buffer.

### API Usage

The end-to-end flow converts a row-major query into block-transposed layout, then uses `MaxSim` for repeated distance computations against document vectors:

```rust
use diskann_quantization::multi_vector::{Mat, MatRef, Standard};
use diskann_quantization::multi_vector::distance::MaxSim;
use diskann_vector::DistanceFunctionMut;

// 1. Query arrives as row-major multi-vector (e.g., 32 tokens × 384 dims)
let query: Mat<Standard<f32>> = /* from encoder/upstream layers */;

// 2. Transpose query into SIMD-friendly block layout (once per query)
let transposed_query = TransposedMultiVector::from_view(query.as_ref());

// 3. Create MaxSim buffer (reused across all document comparisons for this query)
let mut buffer = vec![f32::MAX; query.num_vectors()];
let mut max_sim = MaxSim::new(&mut buffer).unwrap();

// 4. For each candidate document:
for doc in candidate_documents {
// Populate per-query-vector max similarities
max_sim.evaluate(&transposed_query, doc.as_ref());

// Sum to get final Chamfer distance
let score = max_sim.chamfer_score();

// Reset for next document
max_sim.reset();
}
```

The `TransposedMultiVector::from_view` conversion is O(Q × Dim) — a single pass that rearranges the query into block-transposed layout. This cost is amortized over all document comparisons for that query, making it negligible in practice (typically thousands of comparisons per query).

## Trade-offs

### Alternative Approaches

The experimental crate explored six approaches total. The query-transposed tiling approach was selected as the proposal, but the alternatives remain available and may be better for specific workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing not covered here is how this behaves under multi-threading. it's possible that single threaded tests may look fine, but if effort is not applied to restrict working set sizes to the size of the L1/L2 as much as possible, then multiple-threads could stomp on one-another's usage of L3 when running multiple computations in parallel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great point. While the distance kernel itself is single-threaded, I agree that minimizing the working set is crucial to prevent cache thrashing when multiple threads are running parallel searches (e.g., in a Rayon thread pool).

I feel Query-Transposed Tiling approach tackles this problem fairly well:

  1. Reduced Bandwidth Pressure: By holding a query block in vector registers (or L1) and iterating over multiple document vectors, we maximize arithmetic intensity. This reduces the frequency of fetches from L3/Memory compared to the baseline, which theoretically leaves more bandwidth available for other threads.

  2. Small Working Set: The scratch buffer (MaxSim) and the query tiles are designed to stay hot in L1/L2. The only data streaming through L3 is the document vectors themselves, which is unavoidable.

That said, theoretical scaling often differs from practice. I can add a 'Throughput Benchmark' to the RFC (running the kernel on all available cores simultaneously) to verify that we don't see sub-linear scaling due to resource contention. Does that sound like a sufficient verification?


#### Document-Transposed Approaches

Instead of transposing the query, documents can be block-transposed at index time. This is implemented as `TransposedApproach` and `TransposedWithTilingApproach`.

| Aspect | Query-Transposed (Proposed) | Document-Transposed |
|--------|---------------------------|---------------------|
| **Document Layout** | Row-major (no change) | Block-transposed |
| **Query Layout** | Block-transposed (once per query) | Row-major |
| **Index Migration** | None | Required |
| **Hot Path Allocation** | None (scratch reused) | None |
| **Best For** | Many query tokens (≥16) | Few query tokens (≤8) |

**Backwards Compatibility**: Locking documents into a transposed layout creates backward compatibility obligations — if we later discover a better layout (e.g., for AVX-512), we're stuck supporting the legacy format. Query transpositions are ephemeral and not persisted, so the query-side layout can evolve freely.

#### SGEMM Approach

Uses BLAS matrix multiplication to compute the full Q×D similarity matrix, then SIMD row-max reduction. Dominates at large scale (≥32×128) with up to 2.16x speedup, but materializes the full similarity matrix (Q×D×4 bytes). Custom tiling approaches fuse max-reduction with dot product, avoiding this materialization, which makes them faster for small/medium Q×D.

#### Baseline Approaches

- **NaiveApproach**: Scalar O(n²) implementation for correctness verification
- **SimdApproach**: SIMD-accelerated inner product via `diskann_vector::InnerProduct`, iterating documents sequentially (1.0x baseline)

## Benchmark Results

**Machine:** Intel Core i7-1365U, AVX2 supported, AVX-512 not supported, 32 GB RAM

Median over 50 measurements. Each measurement computes 10 consecutive distance evaluations across 100 points.

### Speedup vs SIMD Baseline (Median, Lower Latency = Better)

| Configuration | SIMD (µs) | transposed_simd | transposed_tiling | query_transposed_tiling | sgemm |
|--------------|-----------|-----------------|-------------------|------------------------|-------|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit - can this table be adjusted to it's readable without rendering the markdown?

Copy link
Contributor Author

@suri-kumkaran suri-kumkaran Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, yet to push.

| dim=128, doc=32, query=8 | 2,237 | 1.34x | **1.75x** | 1.05x | 1.13x |
| dim=128, doc=64, query=16 | 9,224 | 1.42x | 2.07x | **2.35x** | 1.48x |
| dim=128, doc=128, query=32 | 47,882 | 1.32x | 1.86x | **2.64x** | 1.88x |
| dim=256, doc=32, query=8 | 4,654 | 1.26x | **1.69x** | 1.13x | 0.96x |
| dim=256, doc=64, query=16 | 25,809 | 1.56x | 1.94x | **2.40x** | 1.87x |
| dim=256, doc=128, query=32 | 101,093 | 1.41x | 1.71x | **2.67x** | 1.96x |
| dim=256, doc=16, query=32 | 8,239 | 1.22x | 1.77x | **2.02x** | 1.57x |
| dim=384, doc=32, query=8 | 8,412 | 1.41x | **1.65x** | 1.30x | 1.24x |
| dim=384, doc=64, query=16 | 38,162 | 1.30x | 1.47x | **1.70x** | 1.66x |
| dim=384, doc=128, query=32 | 171,431 | 1.53x | 1.94x | 2.04x | **2.16x** |
| **Average** | — | **1.38x** | **1.79x** | **1.93x** | **1.59x** |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also see situations where there are, say, 900+ documents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit far from a real use case because the Chamfer distance is very costly, but I can still benchmark it if you would like. Let me know if you want me to do that.


### Analysis

- **query_transposed_tiling** is the best overall approach (avg **1.93x**), winning 7 of 10 configurations
- **transposed_tiling** wins for small query counts (≤8 tokens) with up to **1.75x** speedup
- **sgemm** catches up at large scale (dim=384, 32×128) where BLAS cache-blocking dominates (**2.16x**)
- All tiling approaches exceed the **2x speedup** goal at ≥16×64 token configurations

## Future Work

- [ ] **FFI types**: View types for `MultiVector` / `TransposedMultiVector` for C/C++ callers
- [ ] **Integration into `diskann` crate**: Graduate from `experimental-` prefix into the main library with a clean API and appropriate module structure
- [ ] **Quantized types**: `f16`, `u8` support for memory efficiency
- [ ] **AVX-512 support**: Larger registers could enable tile size 4 (processing 4 queries simultaneously)
- [ ] **SIMD-accelerated horizontal MinMax**: Hardware-accelerated horizontal min/max reductions across SIMD lanes for faster per-query score aggregation

## References

1. [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction](https://arxiv.org/abs/2004.12832)
2. [Chamfer Distance](https://en.wikipedia.org/wiki/Chamfer_distance)
3. Experimental implementation: [experimental-multi-vector-bench crate (PR #730)](https://github.com/microsoft/DiskANN/pull/730)
Loading