-
Notifications
You must be signed in to change notification settings - Fork 379
[Draft, Not for review] Sync changes from CDB_DiskANN repo #748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Refactored recall utilities in diskann-benchmark - Updated tokio utilities - Added attribute and format parser improvements in label-filter - Updated ground_truth utilities in diskann-tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Syncs changes aimed at integrating label-filtered search into the benchmarking/tooling flow, primarily by enhancing ground-truth label evaluation and aligning benchmark recall logic with diskann-benchmark-core.
Changes:
- Extend
diskann-toolsground-truth label filtering to handle array-valued label structures and add diagnostics/logging. - Refactor
diskann-benchmarkrecall utilities to delegate todiskann-benchmark-core. - Adjust
diskann-label-filterJSON flattening and minor formatting/import changes; updatediskann-toolsdependencies.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-tools/src/utils/ground_truth.rs | Adds array-expansion support for label evaluation and extra tracing diagnostics during bitmap + ground-truth generation. |
| diskann-tools/Cargo.toml | Updates dependencies (adds serde_json, removes some unused deps) and adjusts dev-deps/features. |
| diskann-label-filter/src/utils/flatten_utils.rs | Modifies how arrays are flattened into pointer paths. |
| diskann-label-filter/src/parser/format.rs | Minor formatting-only change. |
| diskann-label-filter/src/attribute.rs | Adds an import (currently unused). |
| diskann-benchmark/src/utils/tokio.rs | Tweaks doc comment and removes unit tests. |
| diskann-benchmark/src/utils/recall.rs | Replaces local recall/AP implementations with diskann-benchmark-core wrappers/types. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [package] | ||
| name = "diskann-tools" | ||
| edition = "2021" | ||
| version.workspace = true | ||
| authors.workspace = true | ||
| description.workspace = true | ||
| documentation.workspace = true | ||
| license.workspace = true | ||
|
|
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This crate no longer declares license.workspace = true, while the rest of the workspace crates do. This makes the crate metadata inconsistent and can break workspace-wide tooling (e.g., publish checks) that expects license metadata to be inherited from the workspace. Re-add license.workspace = true under [package] for consistency.
| results: &dyn benchmark_core::recall::Rows<T>, | ||
| groundtruth: &dyn benchmark_core::recall::Rows<T>, | ||
| recall_k: &[usize], | ||
| recall_n: &[usize], |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute_multiple_recalls changed from taking StridedView + an enhanced_metrics flag to taking &dyn benchmark_core::recall::Rows<T> and no longer accepts enhanced_metrics. There are existing call sites in diskann-benchmark/src/backend/exhaustive/* that still call compute_multiple_recalls(..., false) and pass StridedView conversions, which will not compile. Either update those call sites in this PR, or provide a backwards-compatible wrapper/overload that preserves the old signature and adapts into benchmark_core::recall::Rows.
| recall_n: &[usize], | |
| recall_n: &[usize], | |
| _enhanced_metrics: bool, |
| average_precision: m.average_precision, | ||
| } | ||
| } | ||
| } |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module removed APMetrics and compute_average_precision, but there are existing call sites (e.g. diskann-benchmark/src/backend/index/range_search.rs and .../result.rs) that still reference utils::recall::compute_average_precision and utils::recall::APMetrics. As-is, the benchmark crate will not compile. Either restore these items as thin wrappers/type aliases around benchmark_core::recall equivalents, or update the call sites in this PR to use the new AveragePrecisionMetrics API.
| } | |
| } | |
| /// Backwards-compatible alias for the core average precision metrics. | |
| /// This keeps existing `utils::recall::APMetrics` call sites compiling. | |
| pub(crate) type APMetrics = benchmark_core::recall::AveragePrecisionMetrics; | |
| /// Re-export the core `compute_average_precision` function so existing | |
| /// `utils::recall::compute_average_precision` call sites continue to work. | |
| pub(crate) use benchmark_core::recall::compute_average_precision; |
| // Print per-query match counts | ||
| for (i, bitmap) in query_bitmaps.iter().enumerate() { | ||
| if i < 10 || bitmap.is_empty() { | ||
| tracing::debug!( | ||
| "Query {}: {} base vectors matched the filter", | ||
| i, | ||
| bitmap.len() | ||
| ); | ||
| } | ||
| } |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The per-query debug logging loop will emit a debug! line for every query with an empty bitmap (if i < 10 || bitmap.is_empty()), which can explode log volume when filters are selective or misconfigured (especially if debug logging is enabled). Consider limiting this to the first N queries only, or logging aggregate statistics plus a small sample of empty/non-empty queries.
| /// Expands a JSON object with array-valued fields into multiple objects with scalar values. | ||
| /// For example: {"country": ["AU", "NZ"], "year": 2007} | ||
| /// becomes: [{"country": "AU", "year": 2007}, {"country": "NZ", "year": 2007}] | ||
| /// | ||
| /// If multiple fields have arrays, all combinations are generated. | ||
| fn expand_array_fields(value: &Value) -> Vec<Value> { | ||
| match value { | ||
| Value::Object(map) => { | ||
| // Start with a single empty object | ||
| let mut results: Vec<Map<String, Value>> = vec![Map::new()]; | ||
|
|
||
| for (key, val) in map.iter() { | ||
| if let Value::Array(arr) = val { | ||
| // Expand: for each existing result, create copies for each array element | ||
| let mut new_results: Vec<Map<String, Value>> = Vec::new(); | ||
| for existing in results.iter() { | ||
| for item in arr.iter() { | ||
| let mut new_map: Map<String, Value> = existing.clone(); | ||
| new_map.insert(key.clone(), item.clone()); | ||
| new_results.push(new_map); | ||
| } | ||
| } | ||
| // If array is empty, keep existing results without this key | ||
| if !arr.is_empty() { | ||
| results = new_results; | ||
| } | ||
| } else { | ||
| // Non-array field: add to all existing results | ||
| for existing in results.iter_mut() { | ||
| existing.insert(key.clone(), val.clone()); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| results.into_iter().map(Value::Object).collect() | ||
| } | ||
| // If not an object, return as-is | ||
| _ => vec![value.clone()], | ||
| } | ||
| } | ||
|
|
||
| /// Evaluates a query expression against a label, expanding array fields first. | ||
| /// Returns true if any expanded variant matches the query. | ||
| fn eval_query_with_array_expansion(query_expr: &ASTExpr, label: &Value) -> bool { | ||
| let expanded = expand_array_fields(label); | ||
| expanded.iter().any(|item| eval_query_expr(query_expr, item)) | ||
| } |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expand_array_fields clones the full JSON object and generates a Cartesian product of array-valued fields, and it's invoked inside the inner loop over base_labels for every query. This can become extremely expensive (exponential in the number/size of array fields, multiplied by num_queries * num_base_labels). Consider avoiding full expansion/allocation by evaluating the predicate directly against arrays (e.g., treat equality as "any element matches" for array-valued fields), or precomputing any needed expansions once per base label rather than per query evaluation.
|
|
||
| use std::fmt::Display; | ||
| use std::hash::{Hash, Hasher}; | ||
| use std::io::Write; |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use std::io::Write; is unused in this module (no Write methods are called; the write_u8/write_i64/etc calls are on std::hash::Hasher). This will trigger unused_imports warnings under typical lint settings; remove the import.
| use std::io::Write; |
| @@ -18,21 +18,3 @@ pub(crate) fn block_on<F: std::future::Future>(future: F) -> F::Output { | |||
| .expect("current thread runtime initialization failed") | |||
| .block_on(future) | |||
| } | |||
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime helper previously had a unit test validating runtime(num_threads) creates the expected number of workers, but the entire test module was removed. If the old test was removed due to tokio runtime metrics being unstable/unsupported in this build, consider replacing it with a stable assertion (or moving coverage to an integration test) so runtime()/block_on() behavior remains exercised by CI.
| for (query_idx, npq) in ground_truth.iter().enumerate() { | ||
| let neighbors: Vec<_> = npq.iter().collect(); | ||
| let neighbor_count = neighbors.len(); | ||
|
|
||
| if query_idx < 10 { | ||
| // Print top K IDs and distances for first 10 queries | ||
| let top_ids: Vec<u32> = neighbors.iter().take(10).map(|n| n.id).collect(); | ||
| let top_dists: Vec<f32> = neighbors.iter().take(10).map(|n| n.distance).collect(); | ||
| tracing::debug!( |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added ground-truth debug section allocates and clones neighbor data for every query (let neighbors: Vec<_> = npq.iter().collect();) even though detailed logging is only done for the first 10 queries. For large benchmarks this adds avoidable overhead. Prefer using npq.iter().count()/npq.len() for counts and only collecting the top-N neighbors for the small set of queries you actually log.
| Value::Array(arr) => { | ||
| for (i, item) in arr.iter().enumerate() { | ||
| flatten_recursive(item, stack.push(&i, separator), out, separator); | ||
| flatten_recursive(item, stack.push(&String::from(""), separator), out, separator); |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array flattening is now pushing an empty string segment for every array element, which drops the index from the generated path (e.g., "/tags/0" becomes something like "/tags/") and can also create duplicate keys. This breaks the JSON Pointer semantics documented in this file and will fail the existing tests that assert "/tags/0" and "/tags/1". Use the numeric index (i) when building JSON Pointer paths, or implement the existing include_array_indices config flag (currently unused) to conditionally include the index.
| flatten_recursive(item, stack.push(&String::from(""), separator), out, separator); | |
| let index = i.to_string(); | |
| flatten_recursive(item, stack.push(&index, separator), out, separator); |
Integrating filter search with diskann-benchmark.