Skip to content

Remove centroid from most PQ interfaces.#1010

Open
hildebrandmw wants to merge 3 commits intomainfrom
mhildebr/remove-centroid
Open

Remove centroid from most PQ interfaces.#1010
hildebrandmw wants to merge 3 commits intomainfrom
mhildebr/remove-centroid

Conversation

@hildebrandmw
Copy link
Copy Markdown
Contributor

Remove the "centroid" from most PQ interfaces. This centroid is/was the dataset centroid and served to shift a vector before compressing or populating the distance lookup table. Here are the problems:

  1. This kind of shifting only works when computing L2 distances. This kind of shift does not preserve inner products or cosine similarity. So it is a little bit of a foot-gun and was already not being applied when these distances were being computed, leading to a somewhat confusing scenario where a centroid can be provided to the FixedChunkPQTable constructor and then ignored.

  2. When L2 distances are being used - application of the centroid can be done implicitly by adding it to the PQ pivots rather than subtracting it from each vector that gets processed. In this way, it always gets applied.

It's possible this has a small effect on training by conditioning the data a little better, but when we've applied this empirically in the past, it has not made a difference.

Carrying around this centroid is making it awkward to cleanly move PQ to diskann-quantization, so this PR takes the first steps towards removing it.

Logical Flow/Review Order

  • diskann_providers/src/model/pq/generate_pivot_arguments.rs: translate_to_center is removed from GeneratePivotArguments.

  • diskann_providers/src/model/pq/pq_construction.rs: The function generate_pq_pivots gains a legacy_center_data boolean argument to preserve the behavior of diskann-disk. This is necessary because tests in diskann-disk check against hard-coded pivots. Not centering the data results in different PQ codes due to slightly different floating point calculations, breaking these tests.

    Data translation is completely removed from the _membuf APIs.

  • diskann_providers/src/model/pq/fixed_chunk_pq_table.rs: Remove centroids altogether from FixedChunkPQTable.

  • diskann-providers/src/storage/pq_storage.rs: This is where backwards compatibility comes in. We cannot change the layout of the storage file and we also need to support previously generated files. To that end, load_pq_pivots_bin is updated to check if the loaded centroid is non-zero. If so, it is accumulated into the pivots to restore the original numerical behavior. Several tests are added to ensure this behavior.

    For saving, when no centroid is available, it can be passed to write_pivot_data as an Option::None and will then be inserted as zeros into the saved file.

  • diskann-disk/src/search/pq/pq_dataset.rs and diskann-disk/src/search/pq/quantizer_preprocess.rs: Preprocessing can now be greatly simplified. Because the FixedChunkPQTable no longer has centroids, the TransposedTable (offering much faster pre-processing times) can always be used.

The rest of the changes involve updating the construction sites of FixedChunkPQTable and invoking the new saving interface.

Backwards Compatibility

There exist PQ schemas in the wild that were generated with centroids and we still need to correctly process those.

  • diskann-disk: PQ schemas previously generated I believe have been saved with PQStorage. Thus, reloading into a FixedChunkPQTable with PQStorage::load_pq_pivots_bin will merge the centroid with the PQ pivots.
  • Users of the _membuf API: Backwards compatibility is handled by the caller (the only real user I know of uses this via FFI, and I will take responsibility for updating that). Again, this can be done by simply folding the centroid into the pivot data, or manually doing the centering. Both of which are quite easy to do.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes explicit dataset-centroid handling from most PQ interfaces, shifts compatibility to the storage/load path by folding legacy centroids into pivots, and simplifies downstream consumers so they can treat PQ tables as centroid-free. In the broader codebase, this moves PQ closer to the diskann-quantization abstractions and removes L2-only centering behavior from general-purpose interfaces.

Changes:

  • Removes centroid fields/parameters from FixedChunkPQTable, GeneratePivotArguments, and the _membuf PQ construction/compression APIs.
  • Updates PQ storage read/write paths to preserve the on-disk layout and recover legacy behavior by folding non-zero stored centroids into pivots on load.
  • Simplifies disk PQ consumers to always use TransposedTable, and updates construction/save call sites across providers, disk builder code, tools, and benchmarks.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
diskann-tools/src/utils/build_pq.rs Updates tool-side PQ pivot generation call to the new generate_pq_pivots signature.
diskann-providers/src/storage/pq_storage.rs Adds optional-centroid write path, legacy-centroid folding on load, and storage round-trip tests.
diskann-providers/src/model/pq/pq_construction.rs Removes centroid handling from in-memory PQ APIs and adds legacy-centering toggle to file-based pivot generation.
diskann-providers/src/model/pq/generate_pivot_arguments.rs Removes centroid-centering state from PQ generation arguments.
diskann-providers/src/model/pq/fixed_chunk_pq_table.rs Drops centroid storage/processing from the PQ table type and updates related tests/helpers.
diskann-providers/src/model/pq/distance/test_utils.rs Updates PQ test helpers to construct centroid-free tables.
diskann-providers/src/model/pq/distance/l2.rs Stops query preprocessing from subtracting centroids before lookup-table generation.
diskann-providers/src/model/graph/provider/async_/memory_quant_vector_provider.rs Saves PQ pivots without a centroid payload and updates test construction.
diskann-providers/src/model/graph/provider/async_/inmem/provider.rs Refreshes public doc examples for the new FixedChunkPQTable::new shape.
diskann-providers/src/model/graph/provider/async_/fast_memory_quant_vector_provider.rs Switches async fast in-memory PQ provider save/test paths to centroid-free tables.
diskann-providers/src/model/graph/provider/async_/experimental/multi_pq_async.rs Updates experimental multi-PQ compression call sites to the new membuf API.
diskann-providers/src/model/graph/provider/async_/bf_tree/quant_vector_provider.rs Adjusts BF-tree quant provider tests to the centroid-free table constructor.
diskann-providers/src/model/graph/provider/async_/bf_tree/provider.rs Saves BF-tree PQ pivots without centroids and updates related tests.
diskann-providers/src/index/diskann_async.rs Removes centroid buffers from async index PQ training and test setup.
diskann-disk/src/storage/quant/pq/pq_generation.rs Threads the new legacy-centering flag through disk PQ generation.
diskann-disk/src/storage/quant/pq/pq_dataset.rs Replaces the PQTable enum with always-transposed storage in PQData.
diskann-disk/src/storage/quant/pq/mod.rs Removes the now-obsolete PQTable re-export.
diskann-disk/src/storage/quant/mod.rs Stops re-exporting PQTable from the public quant module.
diskann-disk/src/search/pq/quantizer_preprocess.rs Simplifies PQ preprocessing to a single TransposedTable path.
diskann-disk/src/build/builder/quantizer.rs Saves builder-generated PQ pivots without a centroid payload.
diskann-benchmark/src/backend/exhaustive/product.rs Updates benchmark-side PQ table construction to the centroid-free constructor.
Comments suppressed due to low confidence (4)

diskann-providers/src/model/pq/pq_construction.rs:166

  • This public _membuf entry point no longer exposes the centroid output buffer, which breaks existing callers that still use the 0.50.x signature. The PR description calls out caller-side migration, but the crate versioning policy says these packages obey SemVer, so this signature change needs a shim/deprecation path or a major release.
pub fn generate_pq_pivots_from_membuf<T: Copy + Into<f32>>(
    parameters: &GeneratePivotArguments,
    train_data_slice: &[T],
    offsets: &mut [usize],
    full_pivot_data: &mut [f32],
    rng: &mut (impl Rng + ?Sized),
    cancellation_token: &mut bool,
    pool: RayonThreadPoolRef<'_>,

diskann-providers/src/model/pq/pq_construction.rs:166

  • The doc comment for this public function still describes the removed make_zero_mean/centroid parameters, so it no longer matches the actual API. Anyone reading the generated docs will be told to pass buffers that this signature no longer accepts.
pub fn generate_pq_pivots_from_membuf<T: Copy + Into<f32>>(
    parameters: &GeneratePivotArguments,
    train_data_slice: &[T],
    offsets: &mut [usize],
    full_pivot_data: &mut [f32],
    rng: &mut (impl Rng + ?Sized),
    cancellation_token: &mut bool,
    pool: RayonThreadPoolRef<'_>,

diskann-providers/src/model/pq/pq_construction.rs:80

  • Adding the new legacy_center_data parameter changes the signature of a public, re-exported function. That is another SemVer break for diskann-providers consumers on 0.50.x, so this needs a compatibility wrapper or a major-version plan instead of replacing the existing API in place.
pub fn generate_pq_pivots<Storage, Random>(
    parameters: GeneratePivotArguments,
    legacy_center_data: bool,
    train_data: &mut [f32],
    pq_storage: &PQStorage,
    storage_provider: &Storage,
    random_provider: RandomProvider<Random>,
    pool: RayonThreadPoolRef<'_>,

diskann-providers/src/storage/pq_storage.rs:98

  • Changing write_pivot_data from taking &[f32] to Option<&[f32]> is a source-breaking public API change for diskann-providers consumers. Because the workspace is still on a SemVer-governed 0.50.x line, this needs a compatibility layer or a coordinated major-version bump rather than replacing the signature in place.
    pub fn write_pivot_data<Storage>(
        &self,
        full_pivot_data: &[f32],
        centroid: Option<&[f32]>,
        chunk_offsets: &[usize],
        num_centers: usize,
        dim: usize,
        storage_provider: &Storage,
    ) -> ANNResult<()>

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 71 to 77
pub fn new(
num_train: usize,
dim: usize,
num_centers: usize,
num_pq_chunks: usize,
max_k_means_reps: usize,
translate_to_center: bool,
) -> Result<Self, GeneratePivotArgumentsError> {
Comment on lines 541 to 546
pub fn generate_pq_data_from_pivots_from_membuf<T: Copy + Into<f32>>(
vector_data: &[T],
pivot_data: &[f32],
num_pivots: usize,
centroid: Option<&[f32]>,
offsets: &[usize],
pq_out: &mut [u8],
Comment on lines 39 to +46
/// Get pq_table
pub fn pq_table(&self) -> &PQTable {
pub fn pq_table(&self) -> &TransposedTable {
&self.pq_pivot_table
}

/// Return the number of chunks in the underlying PQ schema.
pub fn get_num_chunks(&self) -> usize {
match &self.pq_pivot_table {
PQTable::Transposed(table) => table.nchunks(),
PQTable::Fixed(table) => table.get_num_chunks(),
}
self.pq_pivot_table.nchunks()
Comment on lines 402 to 409
/// Returns an immutable reference to the `pq_table`.
pub fn get_pq_table(&self) -> &[f32] {
self.table.view_pivots().into()
}

/// Returns an immutable reference to the `chunk_offsets`.
pub fn get_chunk_offsets(&self) -> &[usize] {
self.table.view_offsets().into()
Comment on lines 541 to 546
pub fn generate_pq_data_from_pivots_from_membuf<T: Copy + Into<f32>>(
vector_data: &[T],
pivot_data: &[f32],
num_pivots: usize,
centroid: Option<&[f32]>,
offsets: &[usize],
pq_out: &mut [u8],
/// * `centroid` - Optional per-dimension centroid. Pass `None` for the standard
/// (non-legacy) code path; a zero vector of length `dim` is written to preserve
/// the on-disk file format. Pass `Some(centroid)` only when legacy centroid
/// centering is enabled (see [`GeneratePivotArguments::with_legacy_centering`]).
///
/// Panics under the following condition:
///
/// * `base_vec.length() != self.get_dim()`.
Comment on lines +132 to +140
pub fn new(dim: usize, pq_table: Box<[f32]>, chunk_offsets: Box<[usize]>) -> ANNResult<Self> {
let len = pq_table.len();
let table = BasicTable::new(
MatrixBase::try_from(pq_table, len / dim, dim).bridge_err()?,
ChunkOffsetsBase::new(chunk_offsets).bridge_err()?,
)
.map_err(|err| ANNError::log_pq_error(diskann_quantization::error::format(&err)))?;

if centroids.len() != dim {
return Err(ANNError::log_pq_error(format_args!(
"centroids slice has length {} but the expected dim is {}",
centroids.len(),
dim
)));
}

Ok(Self { table, centroids })
Ok(Self { table })
Comment on lines 580 to 586
pub fn generate_pq_data_from_pivots_from_membuf_batch<T: Copy + Sync + Into<f32>>(
parameters: &GeneratePivotArguments,
vector_data: &[T],
pivot_data: &[f32],
centroid: &[f32],
offsets: &[usize],
pq_out: &mut [u8],
pool: RayonThreadPoolRef<'_>,
pub(crate) mod pq;
pub use pq::pq_generation::{PQGeneration, PQGenerationContext};
pub use pq::{PQData, PQTable};
pub use pq::PQData;
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 2, 2026

Codecov Report

❌ Patch coverage is 96.91358% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.49%. Comparing base (45428af) to head (b596601).

Files with missing lines Patch % Lines
diskann-providers/src/storage/pq_storage.rs 93.58% 5 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1010      +/-   ##
==========================================
- Coverage   89.49%   89.49%   -0.01%     
==========================================
  Files         448      448              
  Lines       84118    84021      -97     
==========================================
- Hits        75282    75193      -89     
+ Misses       8836     8828       -8     
Flag Coverage Δ
miri 89.49% <96.91%> (-0.01%) ⬇️
unittests 89.33% <96.91%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...iskann-benchmark/src/backend/exhaustive/product.rs 100.00% <ø> (ø)
diskann-disk/src/build/builder/quantizer.rs 90.52% <100.00%> (ø)
diskann-disk/src/search/pq/quantizer_preprocess.rs 97.29% <100.00%> (+9.29%) ⬆️
diskann-disk/src/storage/quant/pq/pq_dataset.rs 96.55% <100.00%> (-0.60%) ⬇️
diskann-disk/src/storage/quant/pq/pq_generation.rs 93.22% <100.00%> (ø)
diskann-providers/src/index/diskann_async.rs 95.97% <100.00%> (-0.02%) ⬇️
...aph/provider/async_/experimental/multi_pq_async.rs 96.38% <ø> (-0.02%) ⬇️
...ovider/async_/fast_memory_quant_vector_provider.rs 98.44% <100.00%> (-0.02%) ⬇️
.../src/model/graph/provider/async_/inmem/provider.rs 90.60% <ø> (ø)
...ph/provider/async_/memory_quant_vector_provider.rs 98.35% <100.00%> (-0.01%) ⬇️
... and 7 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@harsha-simhadri harsha-simhadri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with this. could you please run on a few medium size datasets and check recall before merging. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants