Skip to content

Float8, Performance, Correctness, & Portability!#732

Merged
ashvardanian merged 68 commits intomainfrom
main-dev
Apr 15, 2026
Merged

Float8, Performance, Correctness, & Portability!#732
ashvardanian merged 68 commits intomainfrom
main-dev

Conversation

@ashvardanian
Copy link
Copy Markdown
Contributor

@ashvardanian ashvardanian commented Mar 14, 2026

This "minor" release is equally important for performance, correctness, and portability!

Crow-de-a-d and others added 30 commits February 19, 2026 14:03
…ator_gt` (#608)

The previous `total_allocated()` reverse-engineered the total by dividing
`last_capacity_` by the capacity multiplier in a loop. This was inaccurate
when `ceil2(extended_bytes)` produced larger-than-expected arena sizes,
causing the reverse calculation to undercount.

Add a `total_allocated_` member that is incremented on each arena
allocation and returned directly, making the method O(1) and exact.
Add `memory_stats_t` to `index_dense_gt` returning separate allocated,
wasted, and reserved byte counts for both the graph tape allocator
(alignment 64) and the vectors tape allocator (alignment 8).
Add `MemoryStats` shared struct to the cxx bridge and wire it through
`NativeIndex::memory_stats()` in C++ to `Index::memory_stats()` in Rust,
giving Rust callers per-tape allocated/wasted/reserved breakdowns.
Pre-allocate the vector buffer before calling `typed_->add` or `typed_->update`.
If allocation fails, return early with the same "Out of memory!" error used 
throughout the rest of the codebase, before any graph mutation occurs.

Co-authored-by: David Ivekovic <88043717+ DavIvek@users.noreply.github.com>
* Fix: Avoid duplicate neighbor slots in HNSW reverse links
* Improve: Use `std::find` instead of `std::all_of` for neighbor presence check in `index_gt::form_reverse_links_`
* Fix: Use `std::find_if` for close_header neighbor check
Correct "Cosine Similarity" to "Cosine Distance" and fix mismatched
parentheses in the formula.
Float16 is unavailable on x86_64 macOS regardless of OS version.
The @available annotation added in PR #610 is insufficient because
Float16 is a type-level absence on x86_64, not a runtime
availability issue.

The sugar file (USearchIndex+Sugar.swift) already has the correct
#if arch(arm64) guard. This commit adds the same guard to the
core USearchIndex.swift file for consistency.

Fixes #589
The release archives contained files under build directory prefixes
(build_release/, build_artifacts/, c/) instead of flat at the root.
This broke the documented install instructions for Go and C users.

- macOS/Android: use `zip -j` to strip directory paths
- Windows: stage files into a flat `pkg/` dir before tarring
- WASM: copy header into build dir before archiving

Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>
Closes #715

Co-Authored-By: David Ivekovic <88043717+DavIvek@users.noreply.github.com>
Co-Authored-By: Björn Blissing <1640024+bjornblissing@users.noreply.github.com>
Co-Authored-By: Jude Payne <141360+judepayne@users.noreply.github.com>
Co-Authored-By: Anas Limem <160512789+anaslimem@users.noreply.github.com>
Co-Authored-By: Markus Graf <24669860+markusalbertgraf@users.noreply.github.com>
)

Closes #697

The equal_range iterator in flat_hash_multi_set_gt did not check the
deleted flag when advancing. After add/remove cycles that create
tombstones with the same key, the iterator would yield deleted entries
as live matches. In index_dense_gt::remove() this caused already-freed
slots to be pushed to free_keys_ again, inflating free_keys_.size()
and producing unsigned underflow in size().

Reproduces in both multi and non-multi mode (single-threaded) whenever
two keys hash-collide and one tombstone is reused by a different key.

Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>
Co-Authored-By: David Ivekovic <88043717+DavIvek@users.noreply.github.com>
On Windows MSVC, `RAND_MAX` is 32767 while `INT_MAX` is ~2.1 billion.
Dividing `std::rand()` by `INT_MAX` produced values ~0.0000153, which
are below the E4M3 FP8 smallest subnormal (1/512), causing all-zero
vectors and a crash in HNSW graph construction (STATUS_STACK_BUFFER_OVERRUN).
- The first fix (default: return false) prevents unsupported metrics like
  Sorensen from being dispatched through a wrong NumKong kernel (causing NaN).
- The second fix passes dimensions (bits) instead of bytes to NumKong for
  binary-set kernels, so Hamming/Tanimoto process the full vector.
NumKong's `types.h` uses `__has_builtin(...)` guarded by
`defined(__has_builtin)`, which MSVC's legacy preprocessor
fails to parse. The conforming preprocessor handles it correctly.
ashvardanian and others added 14 commits April 6, 2026 12:12
Python 3.8 (EOL Oct 2024) and 3.9 (EOL Oct 2025) wheels were failing
because NumKong requires >=3.9 and has no cp38 wheels on PyPI. Rather
than patching around this, drop all EOL language versions across SDKs:

- Python: minimum 3.10, add `python_requires`, remove 38/39 from CI
- Node.js: engine >=20 (was ~10/>=12, both ancient EOL)
- Go: minimum 1.22 (was 1.19, EOL Aug 2023)
- Java: source/target 21 LTS (was 18 non-LTS, EOL Sep 2022)
- Ubuntu runners: all standardized to 24.04 (was mix of 22.04/latest)
- GCC: drop explicit gcc-12/g++-12 installs (24.04 ships GCC 14)
Co-Authored-By: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Adds three new scalar types across all language bindings:

- UInt8 `u8`: Full input type for unsigned 8-bit integer vectors,
  enabling correct benchmarking on BIGANN/SIFT datasets. Previously,
  `np.uint8` silently mapped to `b1x8` (bit-packed binary), producing
  nonsensical results. Now exposed as a first-class input type in C++,
  C, Rust, Go, Swift, and Objective-C. In Python, available via explicit
  `dtype="u8"` string; `np.uint8` auto-detection remains mapped to `B1`
  pending v3 breaking change. In JavaScript, Java, and C#, exposed as
  a quantization option.

- Float6 `e2m3` / `e3m2`: Internal quantization types from the OCP MX
  v1.0 spec (6-bit micro-floats stored in 8-bit containers). Supported
  as quantization targets in all language bindings, with SIMD-accelerated
  distance kernels delegated to NumKong on x86 (Haswell+, Ice Lake,
  Alder Lake) and ARM (NEON+SDOT). LUT-based upcasts to Float32 for
  all FP8 and FP6 types replace prior arithmetic fallbacks.

Relates to #595, #469, #527, #596, #633
This change to bench.cpp makes memory-map and join
benchmarks opt-in, rather than always-on. Also adds
CLI parsing for new quantization modes and new 100M
dataset listings.
Comment thread BENCHMARKS.md Outdated
@ashvardanian ashvardanian merged commit c68469e into main Apr 15, 2026
32 checks passed
Comment thread include/usearch/index.hpp
* graph size, keeping the lock array comfortably within L2/L3 cache.
*/
template <typename allocator_at = std::allocator<byte_t>, std::size_t cache_line_ak = 128> //
class striped_locks_gt {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.