[Feature](agg) add agg function entropy#60833
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
5182bf0 to
33d8b77
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds the entropy aggregate function from DuckDB to Apache Doris. The function calculates Shannon Entropy using a frequency map and computing the empirical distribution function, with entropy measured in bits (base-2 logarithm).
Changes:
- Added backend (C++) implementation for entropy calculation using hash maps for frequency tracking
- Added frontend (Java) function definition and registration in Nereids planner
- Added comprehensive regression tests and unit tests covering various data types and edge cases
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| be/src/vec/aggregate_functions/aggregate_function_entropy.h | Core implementation of entropy calculation logic with support for numeric, string, and generic data types |
| be/src/vec/aggregate_functions/aggregate_function_entropy.cpp | Factory registration for the entropy aggregate function with type dispatching |
| be/src/vec/aggregate_functions/aggregate_function_simple_factory.cpp | Registered entropy function in the aggregate function factory |
| be/test/vec/aggregate_functions/agg_entropy_test.cpp | Unit tests covering numeric, string, generic, nullable, and empty input cases |
| be/test/vec/aggregate_functions/agg_function_test.h | Fixed empty block handling in deserialization tests |
| be/test/testutil/column_helper.h | Enhanced helper to support creating blocks with different column types |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/agg/Entropy.java | Frontend function definition extending NullableAggregateFunction |
| fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/AggregateFunctionVisitor.java | Added visitor method for entropy function |
| fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinAggregateFunctions.java | Registered entropy in builtin aggregate functions catalog |
| regression-test/suites/query_p0/aggregate/aggregate_function_entropy.groovy | Comprehensive regression tests covering all data types, NULL handling, window functions, and edge cases |
| regression-test/data/query_p0/aggregate/aggregate_function_entropy.out | Expected output for regression tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/review |
There was a problem hiding this comment.
Code Review Summary: [Feature](agg) add agg function entropy
This PR adds a new aggregate function entropy() that computes Shannon Entropy (in bits, base-2 logarithm) over value distributions. The implementation spans BE (C++ aggregate function with three data paths: numeric, string-hash, generic-hash), FE (Nereids function registration with varargs support), unit tests, and regression tests.
Overall: The implementation is well-structured and follows existing codebase conventions (modeled after uniqExact/multi_distinct_count). The serialization, null handling, empty-input handling, and compile_check usage are all correct. I found a few issues to address:
Issues Found
-
[Medium] Missing
TYPE_IPV4/TYPE_IPV6in numeric type list (aggregate_function_entropy.cpp:37-41): IPV4 and IPV6 types fall through to the generic path (arena serialization + hash), but they are simple fixed-size numeric types with properPrimitiveTypeTraits. Other comparable functions liketopnandapprox_count_distinctinclude them in their numeric type lists. They should be added for efficiency. -
[Low] Unused
isSkewparameter inEntropy.java(Entropy.java:50-52): The private constructor accepts and passesisSkewto the superclass, butwithIsSkew()is never overridden. The baseAggregateFunction.withIsSkew()throws aRuntimeException. If the optimizer ever tries to set skew on an entropy aggregation, it will crash. Since entropy doesn't need skew optimization, consider removing theisSkewparameter (matchingAvg/Variance/Sempattern) or overridingwithIsSkew().
Critical Checkpoint Conclusions
- Goal and correctness: The goal is to add Shannon Entropy as an aggregate function. The implementation correctly computes
-sum(p * log2(p))over the empirical distribution. Tests cover numeric, string, generic (multi-column), nullable, empty, group-by, and window function scenarios. Pass. - Modification scope: Clean and focused — only adds the new function and minimal test infrastructure improvements. Pass.
- Concurrency: No new concurrency concerns. The aggregate function follows the standard single-threaded-per-state model with merge for parallel execution. Pass.
- Lifecycle management: No special lifecycle issues. Arena memory usage in the generic path is consistent with the established pattern (freed at operator close). Pass.
- Configuration items: None added. Pass (N/A).
- Incompatible changes: No format/symbol compatibility issues. New function only. Pass.
- Parallel code paths: The three data paths (numeric, string, generic) are correctly dispatched. Missing IPV4/IPV6 in numeric path noted above. Minor issue.
- Special conditional checks: The
argument_types.size() == 1branch for optimized single-column paths is clear and well-structured. Pass. - Test coverage: Comprehensive regression tests covering all data types, NULL handling, empty input, group-by, window functions, and multi-column usage. BE unit tests cover numeric, string, generic, nullable, and empty cases. FE unit tests are absent but this is consistent with existing aggregate function conventions. Pass.
- Observability: N/A for a simple aggregate function. Pass.
- Transaction/persistence: N/A. Pass.
- FE-BE variable passing: N/A — uses existing varargs aggregate function infrastructure. Pass.
- Performance: Hash-based counting with XXH128 for string/generic paths is the same approach used by
uniqExact. The Arena memory waste in the generic path (serialized data accumulated but only hashes retained) is a known limitation shared by other aggregate functions. No new performance concerns beyond the IPV4/IPV6 optimization opportunity noted. Pass with minor note. - Other issues: The
compile_check_begin.h/compile_check_end.hpairing is correct. No narrowing conversion risks. Serialization format follows the establishedwrite_var_uint/write_binaryconvention. Pass.
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/agg/Entropy.java
Outdated
Show resolved
Hide resolved
69630db to
c88684f
Compare
c88684f to
be3a585
Compare
|
/review |
There was a problem hiding this comment.
Code Review Summary: Feature add agg function entropy
Overall this is a clean, well-structured addition of a new aggregate function. The implementation follows established patterns (modeled after aggregate_function_uniq.h) and includes good test coverage across BE unit tests, FE registration, and regression tests. Below are the findings organized by the critical checkpoints from AGENTS.md/code-review skill.
Critical Checkpoint Conclusions
1. Goal & Correctness: The PR adds a Shannon entropy aggregate function. The implementation correctly computes H = -Σ p_i * log2(p_i) from a frequency map. The empty-set case returns 0.0 (no divide-by-zero). The hash-only approach for strings/generic types is consistent with the pattern used by aggregate_function_uniq.h. Tests prove the function works.
2. Modification size & focus: The change is well-scoped — 11 files, all directly related to the new function. The column_helper.h and agg_function_test.h changes are reasonable utility improvements needed for the new tests.
3. Concurrency: No concurrency concerns — this is a standard aggregate function with no shared state.
4. Lifecycle management: No special lifecycle concerns. The flat_hash_map in the Data struct is value-owned and properly cleared on reset().
5. Configuration items: None added. N/A.
6. Incompatible changes: No storage format changes. The function is purely new, no compatibility concerns.
7. Parallel code paths: The three-tier data structure (numeric, string, generic) correctly covers all type paths. Single-arg complex types (ARRAY/MAP/STRUCT) properly fall through to the generic handler.
8. Test coverage: Good. BE unit tests cover numeric, string, generic (multi-column), nullable, and empty-set cases. Regression tests cover all data types including complex types, GROUP BY, window functions, and empty results. No FE unit test, but this is consistent with the codebase (no aggregate function in this package has dedicated FE unit tests).
9. Observability: N/A for a simple aggregate function.
10. Transaction/persistence: N/A.
11. Data writes: N/A.
12. FE-BE variable passing: N/A — no new thrift variables.
13. Performance: The implementation is efficient. Uses flat_hash_map (cache-friendly), CRC32 hash for numeric types, and avoids double-hashing for string types (XXH128 -> UInt128TrivialHash). get_result() iterates twice over the map (once for total_count, once for entropy calculation), which could be combined into a single pass, but this is a minor optimization. The reserve() calls in merge() and read() are good for avoiding rehashing.
14. Other issues: See inline comments for two minor issues found.
Issues Found
| # | Severity | File | Description |
|---|---|---|---|
| 1 | Minor (convention) | aggregate_function_entropy.h:144 |
Missing final keyword on AggregateFunctionEntropy class. Nearly all (46+) other aggregate function classes use final. |
| 2 | Suggestion | aggregate_function_entropy.h:86-97 |
get_result() iterates over frequency_map twice (once for total_count, once for entropy). Could be done in a single pass for better cache efficiency on large maps. |
regression-test/suites/query_p0/aggregate/aggregate_function_entropy.groovy
Outdated
Show resolved
Hide resolved
regression-test/suites/query_p0/aggregate/aggregate_function_entropy.groovy
Show resolved
Hide resolved
|
run buildall |
TPC-H: Total hot run time: 26981 ms |
TPC-DS: Total hot run time: 169384 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 26916 ms |
TPC-DS: Total hot run time: 167661 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
add new aggregate function entropy
Issue Number: #48203
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)