Remove centroid from most PQ interfaces.#1010
Conversation
There was a problem hiding this comment.
Pull request overview
This PR removes explicit dataset-centroid handling from most PQ interfaces, shifts compatibility to the storage/load path by folding legacy centroids into pivots, and simplifies downstream consumers so they can treat PQ tables as centroid-free. In the broader codebase, this moves PQ closer to the diskann-quantization abstractions and removes L2-only centering behavior from general-purpose interfaces.
Changes:
- Removes centroid fields/parameters from
FixedChunkPQTable,GeneratePivotArguments, and the_membufPQ construction/compression APIs. - Updates PQ storage read/write paths to preserve the on-disk layout and recover legacy behavior by folding non-zero stored centroids into pivots on load.
- Simplifies disk PQ consumers to always use
TransposedTable, and updates construction/save call sites across providers, disk builder code, tools, and benchmarks.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
diskann-tools/src/utils/build_pq.rs |
Updates tool-side PQ pivot generation call to the new generate_pq_pivots signature. |
diskann-providers/src/storage/pq_storage.rs |
Adds optional-centroid write path, legacy-centroid folding on load, and storage round-trip tests. |
diskann-providers/src/model/pq/pq_construction.rs |
Removes centroid handling from in-memory PQ APIs and adds legacy-centering toggle to file-based pivot generation. |
diskann-providers/src/model/pq/generate_pivot_arguments.rs |
Removes centroid-centering state from PQ generation arguments. |
diskann-providers/src/model/pq/fixed_chunk_pq_table.rs |
Drops centroid storage/processing from the PQ table type and updates related tests/helpers. |
diskann-providers/src/model/pq/distance/test_utils.rs |
Updates PQ test helpers to construct centroid-free tables. |
diskann-providers/src/model/pq/distance/l2.rs |
Stops query preprocessing from subtracting centroids before lookup-table generation. |
diskann-providers/src/model/graph/provider/async_/memory_quant_vector_provider.rs |
Saves PQ pivots without a centroid payload and updates test construction. |
diskann-providers/src/model/graph/provider/async_/inmem/provider.rs |
Refreshes public doc examples for the new FixedChunkPQTable::new shape. |
diskann-providers/src/model/graph/provider/async_/fast_memory_quant_vector_provider.rs |
Switches async fast in-memory PQ provider save/test paths to centroid-free tables. |
diskann-providers/src/model/graph/provider/async_/experimental/multi_pq_async.rs |
Updates experimental multi-PQ compression call sites to the new membuf API. |
diskann-providers/src/model/graph/provider/async_/bf_tree/quant_vector_provider.rs |
Adjusts BF-tree quant provider tests to the centroid-free table constructor. |
diskann-providers/src/model/graph/provider/async_/bf_tree/provider.rs |
Saves BF-tree PQ pivots without centroids and updates related tests. |
diskann-providers/src/index/diskann_async.rs |
Removes centroid buffers from async index PQ training and test setup. |
diskann-disk/src/storage/quant/pq/pq_generation.rs |
Threads the new legacy-centering flag through disk PQ generation. |
diskann-disk/src/storage/quant/pq/pq_dataset.rs |
Replaces the PQTable enum with always-transposed storage in PQData. |
diskann-disk/src/storage/quant/pq/mod.rs |
Removes the now-obsolete PQTable re-export. |
diskann-disk/src/storage/quant/mod.rs |
Stops re-exporting PQTable from the public quant module. |
diskann-disk/src/search/pq/quantizer_preprocess.rs |
Simplifies PQ preprocessing to a single TransposedTable path. |
diskann-disk/src/build/builder/quantizer.rs |
Saves builder-generated PQ pivots without a centroid payload. |
diskann-benchmark/src/backend/exhaustive/product.rs |
Updates benchmark-side PQ table construction to the centroid-free constructor. |
Comments suppressed due to low confidence (4)
diskann-providers/src/model/pq/pq_construction.rs:166
- This public
_membufentry point no longer exposes the centroid output buffer, which breaks existing callers that still use the 0.50.x signature. The PR description calls out caller-side migration, but the crate versioning policy says these packages obey SemVer, so this signature change needs a shim/deprecation path or a major release.
pub fn generate_pq_pivots_from_membuf<T: Copy + Into<f32>>(
parameters: &GeneratePivotArguments,
train_data_slice: &[T],
offsets: &mut [usize],
full_pivot_data: &mut [f32],
rng: &mut (impl Rng + ?Sized),
cancellation_token: &mut bool,
pool: RayonThreadPoolRef<'_>,
diskann-providers/src/model/pq/pq_construction.rs:166
- The doc comment for this public function still describes the removed
make_zero_mean/centroidparameters, so it no longer matches the actual API. Anyone reading the generated docs will be told to pass buffers that this signature no longer accepts.
pub fn generate_pq_pivots_from_membuf<T: Copy + Into<f32>>(
parameters: &GeneratePivotArguments,
train_data_slice: &[T],
offsets: &mut [usize],
full_pivot_data: &mut [f32],
rng: &mut (impl Rng + ?Sized),
cancellation_token: &mut bool,
pool: RayonThreadPoolRef<'_>,
diskann-providers/src/model/pq/pq_construction.rs:80
- Adding the new
legacy_center_dataparameter changes the signature of a public, re-exported function. That is another SemVer break fordiskann-providersconsumers on 0.50.x, so this needs a compatibility wrapper or a major-version plan instead of replacing the existing API in place.
pub fn generate_pq_pivots<Storage, Random>(
parameters: GeneratePivotArguments,
legacy_center_data: bool,
train_data: &mut [f32],
pq_storage: &PQStorage,
storage_provider: &Storage,
random_provider: RandomProvider<Random>,
pool: RayonThreadPoolRef<'_>,
diskann-providers/src/storage/pq_storage.rs:98
- Changing
write_pivot_datafrom taking&[f32]toOption<&[f32]>is a source-breaking public API change fordiskann-providersconsumers. Because the workspace is still on a SemVer-governed 0.50.x line, this needs a compatibility layer or a coordinated major-version bump rather than replacing the signature in place.
pub fn write_pivot_data<Storage>(
&self,
full_pivot_data: &[f32],
centroid: Option<&[f32]>,
chunk_offsets: &[usize],
num_centers: usize,
dim: usize,
storage_provider: &Storage,
) -> ANNResult<()>
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub fn new( | ||
| num_train: usize, | ||
| dim: usize, | ||
| num_centers: usize, | ||
| num_pq_chunks: usize, | ||
| max_k_means_reps: usize, | ||
| translate_to_center: bool, | ||
| ) -> Result<Self, GeneratePivotArgumentsError> { |
| pub fn generate_pq_data_from_pivots_from_membuf<T: Copy + Into<f32>>( | ||
| vector_data: &[T], | ||
| pivot_data: &[f32], | ||
| num_pivots: usize, | ||
| centroid: Option<&[f32]>, | ||
| offsets: &[usize], | ||
| pq_out: &mut [u8], |
| /// Get pq_table | ||
| pub fn pq_table(&self) -> &PQTable { | ||
| pub fn pq_table(&self) -> &TransposedTable { | ||
| &self.pq_pivot_table | ||
| } | ||
|
|
||
| /// Return the number of chunks in the underlying PQ schema. | ||
| pub fn get_num_chunks(&self) -> usize { | ||
| match &self.pq_pivot_table { | ||
| PQTable::Transposed(table) => table.nchunks(), | ||
| PQTable::Fixed(table) => table.get_num_chunks(), | ||
| } | ||
| self.pq_pivot_table.nchunks() |
| /// Returns an immutable reference to the `pq_table`. | ||
| pub fn get_pq_table(&self) -> &[f32] { | ||
| self.table.view_pivots().into() | ||
| } | ||
|
|
||
| /// Returns an immutable reference to the `chunk_offsets`. | ||
| pub fn get_chunk_offsets(&self) -> &[usize] { | ||
| self.table.view_offsets().into() |
| pub fn generate_pq_data_from_pivots_from_membuf<T: Copy + Into<f32>>( | ||
| vector_data: &[T], | ||
| pivot_data: &[f32], | ||
| num_pivots: usize, | ||
| centroid: Option<&[f32]>, | ||
| offsets: &[usize], | ||
| pq_out: &mut [u8], |
| /// * `centroid` - Optional per-dimension centroid. Pass `None` for the standard | ||
| /// (non-legacy) code path; a zero vector of length `dim` is written to preserve | ||
| /// the on-disk file format. Pass `Some(centroid)` only when legacy centroid | ||
| /// centering is enabled (see [`GeneratePivotArguments::with_legacy_centering`]). |
| /// | ||
| /// Panics under the following condition: | ||
| /// | ||
| /// * `base_vec.length() != self.get_dim()`. |
| pub fn new(dim: usize, pq_table: Box<[f32]>, chunk_offsets: Box<[usize]>) -> ANNResult<Self> { | ||
| let len = pq_table.len(); | ||
| let table = BasicTable::new( | ||
| MatrixBase::try_from(pq_table, len / dim, dim).bridge_err()?, | ||
| ChunkOffsetsBase::new(chunk_offsets).bridge_err()?, | ||
| ) | ||
| .map_err(|err| ANNError::log_pq_error(diskann_quantization::error::format(&err)))?; | ||
|
|
||
| if centroids.len() != dim { | ||
| return Err(ANNError::log_pq_error(format_args!( | ||
| "centroids slice has length {} but the expected dim is {}", | ||
| centroids.len(), | ||
| dim | ||
| ))); | ||
| } | ||
|
|
||
| Ok(Self { table, centroids }) | ||
| Ok(Self { table }) |
| pub fn generate_pq_data_from_pivots_from_membuf_batch<T: Copy + Sync + Into<f32>>( | ||
| parameters: &GeneratePivotArguments, | ||
| vector_data: &[T], | ||
| pivot_data: &[f32], | ||
| centroid: &[f32], | ||
| offsets: &[usize], | ||
| pq_out: &mut [u8], | ||
| pool: RayonThreadPoolRef<'_>, |
| pub(crate) mod pq; | ||
| pub use pq::pq_generation::{PQGeneration, PQGenerationContext}; | ||
| pub use pq::{PQData, PQTable}; | ||
| pub use pq::PQData; |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1010 +/- ##
==========================================
- Coverage 89.49% 89.49% -0.01%
==========================================
Files 448 448
Lines 84118 84021 -97
==========================================
- Hits 75282 75193 -89
+ Misses 8836 8828 -8
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
harsha-simhadri
left a comment
There was a problem hiding this comment.
ok with this. could you please run on a few medium size datasets and check recall before merging. Thanks.
Remove the "centroid" from most PQ interfaces. This centroid is/was the dataset centroid and served to shift a vector before compressing or populating the distance lookup table. Here are the problems:
This kind of shifting only works when computing L2 distances. This kind of shift does not preserve inner products or cosine similarity. So it is a little bit of a foot-gun and was already not being applied when these distances were being computed, leading to a somewhat confusing scenario where a centroid can be provided to the
FixedChunkPQTableconstructor and then ignored.When L2 distances are being used - application of the centroid can be done implicitly by adding it to the PQ pivots rather than subtracting it from each vector that gets processed. In this way, it always gets applied.
It's possible this has a small effect on training by conditioning the data a little better, but when we've applied this empirically in the past, it has not made a difference.
Carrying around this centroid is making it awkward to cleanly move PQ to
diskann-quantization, so this PR takes the first steps towards removing it.Logical Flow/Review Order
diskann_providers/src/model/pq/generate_pivot_arguments.rs:translate_to_centeris removed fromGeneratePivotArguments.diskann_providers/src/model/pq/pq_construction.rs: The functiongenerate_pq_pivotsgains alegacy_center_databoolean argument to preserve the behavior ofdiskann-disk. This is necessary because tests indiskann-diskcheck against hard-coded pivots. Not centering the data results in different PQ codes due to slightly different floating point calculations, breaking these tests.Data translation is completely removed from the
_membufAPIs.diskann_providers/src/model/pq/fixed_chunk_pq_table.rs: Remove centroids altogether fromFixedChunkPQTable.diskann-providers/src/storage/pq_storage.rs: This is where backwards compatibility comes in. We cannot change the layout of the storage file and we also need to support previously generated files. To that end,load_pq_pivots_binis updated to check if the loaded centroid is non-zero. If so, it is accumulated into the pivots to restore the original numerical behavior. Several tests are added to ensure this behavior.For saving, when no centroid is available, it can be passed to
write_pivot_dataas anOption::Noneand will then be inserted as zeros into the saved file.diskann-disk/src/search/pq/pq_dataset.rsanddiskann-disk/src/search/pq/quantizer_preprocess.rs: Preprocessing can now be greatly simplified. Because theFixedChunkPQTableno longer has centroids, theTransposedTable(offering much faster pre-processing times) can always be used.The rest of the changes involve updating the construction sites of
FixedChunkPQTableand invoking the new saving interface.Backwards Compatibility
There exist PQ schemas in the wild that were generated with centroids and we still need to correctly process those.
diskann-disk: PQ schemas previously generated I believe have been saved withPQStorage. Thus, reloading into aFixedChunkPQTablewithPQStorage::load_pq_pivots_binwill merge the centroid with the PQ pivots._membufAPI: Backwards compatibility is handled by the caller (the only real user I know of uses this via FFI, and I will take responsibility for updating that). Again, this can be done by simply folding the centroid into the pivot data, or manually doing the centering. Both of which are quite easy to do.