Float8, Performance, Correctness, & Portability! by ashvardanian · Pull Request #732 · unum-cloud/USearch

ashvardanian · 2026-03-14T18:44:43Z

This "minor" release is equally important for performance, correctness, and portability!

It brings 1000+ new SIMD kernels from the NumKong v7 release, previously SimSIMD. This adds support for significantly more x86, Arm, RISC-V, PowerPC, LoongArch, and WASM hardware-accelerated backends to what is already one of the most portable production-grade search engine online.
Undefined behaviours & graph refinements, by @yoonseok-kim, @DavIvek, @madmax983 🫶
Memory safety improvements, by @knowack1, @dustinhealy, & @rschu1ze 🫶
Linting and documentation improvements, by @Samoed & @l1x 🫶
Compatibility patches, by @JadenGeller, @anaslimem, & @fran6co 🫶
Memory stats, by @swasik & @Crow-de-a-d 🫶

Closes #683

…ator_gt` (#608) The previous `total_allocated()` reverse-engineered the total by dividing `last_capacity_` by the capacity multiplier in a loop. This was inaccurate when `ceil2(extended_bytes)` produced larger-than-expected arena sizes, causing the reverse calculation to undercount. Add a `total_allocated_` member that is incremented on each arena allocation and returned directly, making the method O(1) and exact.

Add `memory_stats_t` to `index_dense_gt` returning separate allocated, wasted, and reserved byte counts for both the graph tape allocator (alignment 64) and the vectors tape allocator (alignment 8).

Add `MemoryStats` shared struct to the cxx bridge and wire it through `NativeIndex::memory_stats()` in C++ to `Index::memory_stats()` in Rust, giving Rust callers per-tape allocated/wasted/reserved breakdowns.

Closes #608

Pre-allocate the vector buffer before calling `typed_->add` or `typed_->update`. If allocation fails, return early with the same "Out of memory!" error used throughout the rest of the codebase, before any graph mutation occurs. Co-authored-by: David Ivekovic <88043717+ DavIvek@users.noreply.github.com>

…718)

* Fix: Avoid duplicate neighbor slots in HNSW reverse links * Improve: Use `std::find` instead of `std::all_of` for neighbor presence check in `index_gt::form_reverse_links_` * Fix: Use `std::find_if` for close_header neighbor check

Correct "Cosine Similarity" to "Cosine Distance" and fix mismatched parentheses in the formula.

@available

Float16 is unavailable on x86_64 macOS regardless of OS version. The @available annotation added in PR #610 is insufficient because Float16 is a type-level absence on x86_64, not a runtime availability issue. The sugar file (USearchIndex+Sugar.swift) already has the correct #if arch(arm64) guard. This commit adds the same guard to the core USearchIndex.swift file for consistency. Fixes #589

The release archives contained files under build directory prefixes (build_release/, build_artifacts/, c/) instead of flat at the root. This broke the documented install instructions for Go and C users. - macOS/Android: use `zip -j` to strip directory paths - Windows: stage files into a flat `pkg/` dir before tarring - WASM: copy header into build dir before archiving Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>

Closes #715 Co-Authored-By: David Ivekovic <88043717+DavIvek@users.noreply.github.com> Co-Authored-By: Björn Blissing <1640024+bjornblissing@users.noreply.github.com> Co-Authored-By: Jude Payne <141360+judepayne@users.noreply.github.com> Co-Authored-By: Anas Limem <160512789+anaslimem@users.noreply.github.com> Co-Authored-By: Markus Graf <24669860+markusalbertgraf@users.noreply.github.com>

) Closes #697 The equal_range iterator in flat_hash_multi_set_gt did not check the deleted flag when advancing. After add/remove cycles that create tombstones with the same key, the iterator would yield deleted entries as live matches. In index_dense_gt::remove() this caused already-freed slots to be pushed to free_keys_ again, inflating free_keys_.size() and producing unsigned underflow in size(). Reproduces in both multi and non-multi mode (single-threaded) whenever two keys hash-collide and one tombstone is reused by a different key. Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> Co-Authored-By: David Ivekovic <88043717+DavIvek@users.noreply.github.com>

On Windows MSVC, `RAND_MAX` is 32767 while `INT_MAX` is ~2.1 billion. Dividing `std::rand()` by `INT_MAX` produced values ~0.0000153, which are below the E4M3 FP8 smallest subnormal (1/512), causing all-zero vectors and a crash in HNSW graph construction (STATUS_STACK_BUFFER_OVERRUN).

- The first fix (default: return false) prevents unsupported metrics like Sorensen from being dispatched through a wrong NumKong kernel (causing NaN). - The second fix passes dimensions (bits) instead of bytes to NumKong for binary-set kernels, so Hamming/Tanimoto process the full vector.

NumKong's `types.h` uses `__has_builtin(...)` guarded by `defined(__has_builtin)`, which MSVC's legacy preprocessor fails to parse. The conforming preprocessor handles it correctly.

Python 3.8 (EOL Oct 2024) and 3.9 (EOL Oct 2025) wheels were failing because NumKong requires >=3.9 and has no cp38 wheels on PyPI. Rather than patching around this, drop all EOL language versions across SDKs: - Python: minimum 3.10, add `python_requires`, remove 38/39 from CI - Node.js: engine >=20 (was ~10/>=12, both ancient EOL) - Go: minimum 1.22 (was 1.19, EOL Aug 2023) - Java: source/target 21 LTS (was 18 non-LTS, EOL Sep 2022) - Ubuntu runners: all standardized to 24.04 (was mix of 22.04/latest) - GCC: drop explicit gcc-12/g++-12 installs (24.04 ships GCC 14)

Co-Authored-By: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

Adds three new scalar types across all language bindings: - UInt8 `u8`: Full input type for unsigned 8-bit integer vectors, enabling correct benchmarking on BIGANN/SIFT datasets. Previously, `np.uint8` silently mapped to `b1x8` (bit-packed binary), producing nonsensical results. Now exposed as a first-class input type in C++, C, Rust, Go, Swift, and Objective-C. In Python, available via explicit `dtype="u8"` string; `np.uint8` auto-detection remains mapped to `B1` pending v3 breaking change. In JavaScript, Java, and C#, exposed as a quantization option. - Float6 `e2m3` / `e3m2`: Internal quantization types from the OCP MX v1.0 spec (6-bit micro-floats stored in 8-bit containers). Supported as quantization targets in all language bindings, with SIMD-accelerated distance kernels delegated to NumKong on x86 (Haswell+, Ice Lake, Alder Lake) and ARM (NEON+SDOT). LUT-based upcasts to Float32 for all FP8 and FP6 types replace prior arithmetic fallbacks. Relates to #595, #469, #527, #596, #633

This change to bench.cpp makes memory-map and join benchmarks opt-in, rather than always-on. Also adds CLI parsing for new quantization modes and new 100M dataset listings.

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

yoonseok-kim · 2026-04-23T08:24:02Z

+ *          graph size, keeping the lock array comfortably within L2/L3 cache.
+ */
+template <typename allocator_at = std::allocator<byte_t>, std::size_t cache_line_ak = 128> //
+class striped_locks_gt {


Nice work. 👍

Crow-de-a-d and others added 30 commits February 19, 2026 14:03

Fix: Python serialized_length missing argument (#714)

5cde3fc

Closes #683

Add: memory_stats_t struct with per-tape breakdowns

8f6f8f6

Add `memory_stats_t` to `index_dense_gt` returning separate allocated, wasted, and reserved byte counts for both the graph tape allocator (alignment 64) and the vectors tape allocator (alignment 8).

Add: expose memory_stats through Rust FFI bindings

d5a04f8

Add `MemoryStats` shared struct to the cxx bridge and wire it through `NativeIndex::memory_stats()` in C++ to `Index::memory_stats()` in Rust, giving Rust callers per-tape allocated/wasted/reserved breakdowns.

Merge: Accurate memory statistics (#717)

24c5bed

Closes #608

Fix: Postfix operators for member/candidate iterators in index_gt (#…

7fbf086

…718)

Fix: Set equality comparison in Numba JIT PyTest

30381b7

Docs: Link capitalization

98eeb9e

Make: Switch from SimSIMD v6 to NumKong v7 submodule

b2361c3

Make: Pull NumKong crate

46347ec

Make: Pull SwiftPM NumKong dependency

eede286

Make: Pull NumKong via CMake

b0f8708

Make: Pull NumKong from PyPi

1b46d1b

Add: Uniform hardware caps queries across SDKs

11f4d5c

Chore: SimSIMD → NumKong refs

ed7454a

Improve: Allow non-sequential vector IDs in bench

11bcde1

Fix: Stale symbol name in join demo

ef0dc4f

Add: Float8 support with E5M2 & E4M3

bf80af8

Chore: Revert using GEMM-shaped kernels

d4fe5de

Improve: Expose & test Float8 in Java, ObjC, Go, JS, Rust

bebb539

Fix: Avoid IntPtr global out-of-scope alias in C#

5413341

Docs: Rust Cos metric docstring (#734)

c9ff5b8

Correct "Cosine Similarity" to "Cosine Distance" and fix mismatched parentheses in the formula.

Fix: Escalate MAP_FAILED on POSIX (#722)

ba47347

Make: Don't version removed dependency lock files

62724d4

ashvardanian added 3 commits April 6, 2026 00:31

Make: Newer /Zc:preprocessor for MSVC in binding.gyp

79cd39f

NumKong's `types.h` uses `__has_builtin(...)` guarded by `defined(__has_builtin)`, which MSVC's legacy preprocessor fails to parse. The conforming preprocessor handles it correctly.

anaslimem mentioned this pull request Apr 6, 2026

[FEATURE] add windows support/wheels for 0.1.2+ anaslimem/CortexaDB#33

Closed

ashvardanian and others added 14 commits April 6, 2026 12:12

Fix: Re-generate JNI header for hardware_acceleration checks

1a3a86e

Make: Bump deps for MSVC compatibility

36981b0

Fix: Don't compile NumKong C sources into macOS wheel

2c44fb0

Make: Drop deprecated SPDX license expression (#744)

145272f

Co-Authored-By: Roman Solomatin <36135455+Samoed@users.noreply.github.com>

Make: Split flaky CIBW jobs & add retries

5e0139f

Improve: Extended bench.cpp & datasets for 100M scale

dac569c

This change to bench.cpp makes memory-map and join benchmarks opt-in, rather than always-on. Also adds CLI parsing for new quantization modes and new 100M dataset listings.

Improve: Type annotations for Python SDK & scripts

c39b15b

Improve: New benchmarking suite

4f6d2d7

Improve: Reuse eval.random_vectors in scripts

2b7f469

Fix: ObjC compilation issue for b1x8

d54383d

Docs: SIFT & BigANN subsets

1c60c6c

Improve: Smaller & faster striped_locks_gt

4a47d18

Samoed reviewed Apr 9, 2026

View reviewed changes

Comment thread BENCHMARKS.md Outdated

ashvardanian and others added 3 commits April 9, 2026 23:36

Improve: Extended uint8 interfaces for Java, Swift, ObjC, JS

82e1196

Docs: Benchmarks table formatting

95963ff

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>

Make: Bump NumKong to v7.5

c666f16

ashvardanian force-pushed the main-dev branch from ebe2cdb to 2a98966 Compare April 15, 2026 14:33

Improve: Print stack-traces in test singal-handlers

94c7ec9

ashvardanian force-pushed the main-dev branch from 2a98966 to 94c7ec9 Compare April 15, 2026 15:25

ashvardanian added 2 commits April 15, 2026 22:50

Fix: Over-allocate & in-align striped locks

7aee20f

Fix: Harden min & max symbols for MSVC

a270651

ashvardanian force-pushed the main-dev branch from daa23fc to a270651 Compare April 15, 2026 22:09

ashvardanian merged commit c68469e into main Apr 15, 2026
32 checks passed

yoonseok-kim reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Float8, Performance, Correctness, & Portability!#732

Float8, Performance, Correctness, & Portability!#732
ashvardanian merged 68 commits intomainfrom
main-dev

ashvardanian commented Mar 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

yoonseok-kim Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

ashvardanian commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yoonseok-kim Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

ashvardanian commented Mar 14, 2026 •

edited

Loading