Skip to content

feat: add pcre2 as optional feature#1959

Open
wheynelau wants to merge 6 commits intohuggingface:mainfrom
wheynelau:perf-pcre2
Open

feat: add pcre2 as optional feature#1959
wheynelau wants to merge 6 commits intohuggingface:mainfrom
wheynelau:perf-pcre2

Conversation

@wheynelau
Copy link
Copy Markdown
Contributor

@wheynelau wheynelau commented Mar 2, 2026

Motivation: Exploring performance profiling and noticed onig showing up in the profiles and tried swapping for pcre2. Happy to get some feedback - I'm not deeply familiar with the tradeoffs.

I have validated that all tests pass and the benchmarks shows that its better for GPT2 and Llama3 models:

Benchmark main (onig) pcre2
bpe-encode/BPE GPT2 encode 1705.7±19.00ms 3.6 MB/sec 1422.8±10.89ms 4.3 MB/sec
llama3-encode/llama3-encode 1912.2±21.94ms 3.2 MB/sec 1601.6±5.81ms 3.9 MB/sec
bpe-encode/BPE GPT2 encode, no cache 2.5±0.04s 2.5 MB/sec 2.1±0.02s 2.9 MB/sec
llama3-encode/llama3-offsets 257.9±7.03ms 24.0 MB/sec 240.8±3.05ms 25.7 MB/sec
llama3-encode/llama3-batch 340.9±5.99ms 18.2 MB/sec 319.3±3.05ms 19.4 MB/sec

Commands used:

cargo bench --no-default-features --features onig,progressbar,esaxx_fast -- --save-baseline main
cargo bench --no-default-features --features pcre2,progressbar,esaxx_fast -- --save-baseline pcre2

Based on perf these were my CPU samples:

func onig pcre2
NormalizedString::split 5.58% 1.44%

Copy link
Copy Markdown
Contributor

@michaelfeil michaelfeil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, would recommend adding.

@michaelfeil
Copy link
Copy Markdown
Contributor

Ran similar benchmarks: #1968 noticed that the reason why speedup is so low is because we did not add benchmarks that capture long context + parallel encoding. Perf should be a little bit better on longer context.

@michaelfeil
Copy link
Copy Markdown
Contributor

I also think this should be able to cross-compile via py03, but not fully sure. Similar to other regex backend trade-offs.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but can we add some doc / something for the users to know they can potentially get better perfs with this? 🤗

I'll investigate on the tradeoffs as it might be worth defaulting to it!

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@wheynelau
Copy link
Copy Markdown
Contributor Author

@ArthurZucker I have added it to the lib.rs so that it shows up on docs.rs - do suggest if you'd like it documented elsewhere (e.g. the Python README).

I have also added it into the python bindings for maturin develop, but let me know if that's out of the scope here, happy to revert or make the necessary changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants