Conversation
|
Have you run the benchmarks for this? |
sorry, new here, could you tell me how do I run them? 😅 |
I think "cargo bench -p arrow-cast" should be sufficient. |
|
run benchmark cast_kernels |
|
Thank you @Rafferty97 and @aryan-212 -- I kicked off some benchmark runs to verify the performance implications of this change |
alamb
left a comment
There was a problem hiding this comment.
Thank you for this PR @aryan-212 and (especially) for the help reviwing @Rafferty97
| impl Parser for Float16Type { | ||
| fn parse(string: &str) -> Option<f16> { | ||
| lexical_core::parse(string.as_bytes()) | ||
| lexical_core::parse(string.trim().as_bytes()) |
There was a problem hiding this comment.
I wonder if there will be any performance implications 🤔
| #[test] | ||
| fn test_parse_trimmed_whitespace() { | ||
| // Float types | ||
| assert_eq!(Float16Type::parse(" 1.5 "), Some(f16::from_f32(1.5))); |
There was a problem hiding this comment.
Cn you please add some tests with only leading whitespace and some tests with only trailing whitespace?
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
I don't think there is any significant perf regression, isn't it? |
My reading of this result is that this code is 16% slower when parsing strings to floating point. I will rerun the benchmarks to see if this is reproducable |
|
run benchmark cast_kernels |
|
🤖 Arrow criterion benchmark running (GKE) | trigger |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Details
Resource Usagebase (merge-base)
branch
|
|
🤔 same slowdown:
|
|
hmmm, should I make changes in some other layer where the performance hit wouldn't be this much? Any pointers if possible? |
Well we can use the existing DataFusion CLI v52.3.0
> select ' 4.5'::float;
Optimizer rule 'simplify_expressions' failed
caused by
Arrow error: Cast error: Cannot cast string ' 4.5' to value of Float32 type
> select trim(' 4.5')::float;
+----------------------+
| btrim(Utf8(" 4.5")) |
+----------------------+
| 4.5 |
+----------------------+
1 row(s) fetched.
Elapsed 0.022 seconds.Interestingly DuckDB does automatically trim andrewlamb@Andrews-MacBook-Pro-3:~/Software/influxdb_pro/ent$ duckdb
DuckDB v1.5.0 (Variegata)
Enter ".help" for usage hints.
memory D select ' 4.5'::float;
┌────────────────────────┐
│ CAST(' 4.5' AS FLOAT) │
│ float │
├────────────────────────┤
│ 4.5 │
└────────────────────────┘As does postgres 🤔 andrewlamb@Andrews-MacBook-Pro-3:~/Software/influxdb_pro/ent$ psql -h localhost -U postgres
psql (14.22 (Homebrew), server 11.16 (Debian 11.16-1.pgdg90+1))
Type "help" for help.
postgres=# select ' 4.5'::float;
float8
--------
4.5
(1 row) |
|
Is it possible to improve the performance of trim? For example, trim just based on ascii whitespace rather than having to do utf8 checking? |
Which issue does this PR close?
Rationale for this change
The
Parser::parseimplementations for numeric types did not trim whitespace before parsing. This caused values like " 42 " or " 1.5 " to fail parsing and returnNone, even though they represent valid numbers.What changes are included in this PR?
.trim()calls before parsing in FloatType Parser implementations.string.trim()at the top of the parser_primitive! macro, which covers all integer and duration types.Are these changes tested?
Yes. Added test_parse_trimmed_whitespace covering:
Datafusion changes
For the following SQL :-
in datafusion we used to get
now after these changes, we get
this behaviour is now aligned with Databricks
Are there any user-facing changes?
Yes. Numeric parsing now accepts strings with leading/trailing whitespace. This is a relaxation of the previous behaviour (previously
None, nowSome(value)), so it is not a breaking change.