Float8, Performance, Correctness, & Portability!#732
Merged
ashvardanian merged 68 commits intomainfrom Apr 15, 2026
Merged
Conversation
…ator_gt` (#608) The previous `total_allocated()` reverse-engineered the total by dividing `last_capacity_` by the capacity multiplier in a loop. This was inaccurate when `ceil2(extended_bytes)` produced larger-than-expected arena sizes, causing the reverse calculation to undercount. Add a `total_allocated_` member that is incremented on each arena allocation and returned directly, making the method O(1) and exact.
Add `memory_stats_t` to `index_dense_gt` returning separate allocated, wasted, and reserved byte counts for both the graph tape allocator (alignment 64) and the vectors tape allocator (alignment 8).
Add `MemoryStats` shared struct to the cxx bridge and wire it through `NativeIndex::memory_stats()` in C++ to `Index::memory_stats()` in Rust, giving Rust callers per-tape allocated/wasted/reserved breakdowns.
Pre-allocate the vector buffer before calling `typed_->add` or `typed_->update`. If allocation fails, return early with the same "Out of memory!" error used throughout the rest of the codebase, before any graph mutation occurs. Co-authored-by: David Ivekovic <88043717+ DavIvek@users.noreply.github.com>
* Fix: Avoid duplicate neighbor slots in HNSW reverse links * Improve: Use `std::find` instead of `std::all_of` for neighbor presence check in `index_gt::form_reverse_links_` * Fix: Use `std::find_if` for close_header neighbor check
Correct "Cosine Similarity" to "Cosine Distance" and fix mismatched parentheses in the formula.
Float16 is unavailable on x86_64 macOS regardless of OS version. The @available annotation added in PR #610 is insufficient because Float16 is a type-level absence on x86_64, not a runtime availability issue. The sugar file (USearchIndex+Sugar.swift) already has the correct #if arch(arm64) guard. This commit adds the same guard to the core USearchIndex.swift file for consistency. Fixes #589
The release archives contained files under build directory prefixes (build_release/, build_artifacts/, c/) instead of flat at the root. This broke the documented install instructions for Go and C users. - macOS/Android: use `zip -j` to strip directory paths - Windows: stage files into a flat `pkg/` dir before tarring - WASM: copy header into build dir before archiving Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>
Closes #715 Co-Authored-By: David Ivekovic <88043717+DavIvek@users.noreply.github.com> Co-Authored-By: Björn Blissing <1640024+bjornblissing@users.noreply.github.com> Co-Authored-By: Jude Payne <141360+judepayne@users.noreply.github.com> Co-Authored-By: Anas Limem <160512789+anaslimem@users.noreply.github.com> Co-Authored-By: Markus Graf <24669860+markusalbertgraf@users.noreply.github.com>
) Closes #697 The equal_range iterator in flat_hash_multi_set_gt did not check the deleted flag when advancing. After add/remove cycles that create tombstones with the same key, the iterator would yield deleted entries as live matches. In index_dense_gt::remove() this caused already-freed slots to be pushed to free_keys_ again, inflating free_keys_.size() and producing unsigned underflow in size(). Reproduces in both multi and non-multi mode (single-threaded) whenever two keys hash-collide and one tombstone is reused by a different key. Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> Co-Authored-By: David Ivekovic <88043717+DavIvek@users.noreply.github.com>
On Windows MSVC, `RAND_MAX` is 32767 while `INT_MAX` is ~2.1 billion. Dividing `std::rand()` by `INT_MAX` produced values ~0.0000153, which are below the E4M3 FP8 smallest subnormal (1/512), causing all-zero vectors and a crash in HNSW graph construction (STATUS_STACK_BUFFER_OVERRUN).
- The first fix (default: return false) prevents unsupported metrics like Sorensen from being dispatched through a wrong NumKong kernel (causing NaN). - The second fix passes dimensions (bits) instead of bytes to NumKong for binary-set kernels, so Hamming/Tanimoto process the full vector.
NumKong's `types.h` uses `__has_builtin(...)` guarded by `defined(__has_builtin)`, which MSVC's legacy preprocessor fails to parse. The conforming preprocessor handles it correctly.
Python 3.8 (EOL Oct 2024) and 3.9 (EOL Oct 2025) wheels were failing because NumKong requires >=3.9 and has no cp38 wheels on PyPI. Rather than patching around this, drop all EOL language versions across SDKs: - Python: minimum 3.10, add `python_requires`, remove 38/39 from CI - Node.js: engine >=20 (was ~10/>=12, both ancient EOL) - Go: minimum 1.22 (was 1.19, EOL Aug 2023) - Java: source/target 21 LTS (was 18 non-LTS, EOL Sep 2022) - Ubuntu runners: all standardized to 24.04 (was mix of 22.04/latest) - GCC: drop explicit gcc-12/g++-12 installs (24.04 ships GCC 14)
Co-Authored-By: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Adds three new scalar types across all language bindings: - UInt8 `u8`: Full input type for unsigned 8-bit integer vectors, enabling correct benchmarking on BIGANN/SIFT datasets. Previously, `np.uint8` silently mapped to `b1x8` (bit-packed binary), producing nonsensical results. Now exposed as a first-class input type in C++, C, Rust, Go, Swift, and Objective-C. In Python, available via explicit `dtype="u8"` string; `np.uint8` auto-detection remains mapped to `B1` pending v3 breaking change. In JavaScript, Java, and C#, exposed as a quantization option. - Float6 `e2m3` / `e3m2`: Internal quantization types from the OCP MX v1.0 spec (6-bit micro-floats stored in 8-bit containers). Supported as quantization targets in all language bindings, with SIMD-accelerated distance kernels delegated to NumKong on x86 (Haswell+, Ice Lake, Alder Lake) and ARM (NEON+SDOT). LUT-based upcasts to Float32 for all FP8 and FP6 types replace prior arithmetic fallbacks. Relates to #595, #469, #527, #596, #633
This change to bench.cpp makes memory-map and join benchmarks opt-in, rather than always-on. Also adds CLI parsing for new quantization modes and new 100M dataset listings.
Samoed
reviewed
Apr 9, 2026
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
| * graph size, keeping the lock array comfortably within L2/L3 cache. | ||
| */ | ||
| template <typename allocator_at = std::allocator<byte_t>, std::size_t cache_line_ak = 128> // | ||
| class striped_locks_gt { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This "minor" release is equally important for performance, correctness, and portability!