Skip to content

[Feature](agg) add agg function entropy#60833

Open
wrlcke wants to merge 6 commits intoapache:masterfrom
wrlcke:functions/entropy
Open

[Feature](agg) add agg function entropy#60833
wrlcke wants to merge 6 commits intoapache:masterfrom
wrlcke:functions/entropy

Conversation

@wrlcke
Copy link
Contributor

@wrlcke wrlcke commented Feb 25, 2026

What problem does this PR solve?

add new aggregate function entropy

Issue Number: #48203

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Feb 25, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds the entropy aggregate function from DuckDB to Apache Doris. The function calculates Shannon Entropy using a frequency map and computing the empirical distribution function, with entropy measured in bits (base-2 logarithm).

Changes:

  • Added backend (C++) implementation for entropy calculation using hash maps for frequency tracking
  • Added frontend (Java) function definition and registration in Nereids planner
  • Added comprehensive regression tests and unit tests covering various data types and edge cases

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
be/src/vec/aggregate_functions/aggregate_function_entropy.h Core implementation of entropy calculation logic with support for numeric, string, and generic data types
be/src/vec/aggregate_functions/aggregate_function_entropy.cpp Factory registration for the entropy aggregate function with type dispatching
be/src/vec/aggregate_functions/aggregate_function_simple_factory.cpp Registered entropy function in the aggregate function factory
be/test/vec/aggregate_functions/agg_entropy_test.cpp Unit tests covering numeric, string, generic, nullable, and empty input cases
be/test/vec/aggregate_functions/agg_function_test.h Fixed empty block handling in deserialization tests
be/test/testutil/column_helper.h Enhanced helper to support creating blocks with different column types
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/functions/agg/Entropy.java Frontend function definition extending NullableAggregateFunction
fe/fe-core/src/main/java/org/apache/doris/nereids/trees/expressions/visitor/AggregateFunctionVisitor.java Added visitor method for entropy function
fe/fe-core/src/main/java/org/apache/doris/catalog/BuiltinAggregateFunctions.java Registered entropy in builtin aggregate functions catalog
regression-test/suites/query_p0/aggregate/aggregate_function_entropy.groovy Comprehensive regression tests covering all data types, NULL handling, window functions, and edge cases
regression-test/data/query_p0/aggregate/aggregate_function_entropy.out Expected output for regression tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zclllyybb
Copy link
Contributor

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary: [Feature](agg) add agg function entropy

This PR adds a new aggregate function entropy() that computes Shannon Entropy (in bits, base-2 logarithm) over value distributions. The implementation spans BE (C++ aggregate function with three data paths: numeric, string-hash, generic-hash), FE (Nereids function registration with varargs support), unit tests, and regression tests.

Overall: The implementation is well-structured and follows existing codebase conventions (modeled after uniqExact/multi_distinct_count). The serialization, null handling, empty-input handling, and compile_check usage are all correct. I found a few issues to address:


Issues Found

  1. [Medium] Missing TYPE_IPV4/TYPE_IPV6 in numeric type list (aggregate_function_entropy.cpp:37-41): IPV4 and IPV6 types fall through to the generic path (arena serialization + hash), but they are simple fixed-size numeric types with proper PrimitiveTypeTraits. Other comparable functions like topn and approx_count_distinct include them in their numeric type lists. They should be added for efficiency.

  2. [Low] Unused isSkew parameter in Entropy.java (Entropy.java:50-52): The private constructor accepts and passes isSkew to the superclass, but withIsSkew() is never overridden. The base AggregateFunction.withIsSkew() throws a RuntimeException. If the optimizer ever tries to set skew on an entropy aggregation, it will crash. Since entropy doesn't need skew optimization, consider removing the isSkew parameter (matching Avg/Variance/Sem pattern) or overriding withIsSkew().


Critical Checkpoint Conclusions

  • Goal and correctness: The goal is to add Shannon Entropy as an aggregate function. The implementation correctly computes -sum(p * log2(p)) over the empirical distribution. Tests cover numeric, string, generic (multi-column), nullable, empty, group-by, and window function scenarios. Pass.
  • Modification scope: Clean and focused — only adds the new function and minimal test infrastructure improvements. Pass.
  • Concurrency: No new concurrency concerns. The aggregate function follows the standard single-threaded-per-state model with merge for parallel execution. Pass.
  • Lifecycle management: No special lifecycle issues. Arena memory usage in the generic path is consistent with the established pattern (freed at operator close). Pass.
  • Configuration items: None added. Pass (N/A).
  • Incompatible changes: No format/symbol compatibility issues. New function only. Pass.
  • Parallel code paths: The three data paths (numeric, string, generic) are correctly dispatched. Missing IPV4/IPV6 in numeric path noted above. Minor issue.
  • Special conditional checks: The argument_types.size() == 1 branch for optimized single-column paths is clear and well-structured. Pass.
  • Test coverage: Comprehensive regression tests covering all data types, NULL handling, empty input, group-by, window functions, and multi-column usage. BE unit tests cover numeric, string, generic, nullable, and empty cases. FE unit tests are absent but this is consistent with existing aggregate function conventions. Pass.
  • Observability: N/A for a simple aggregate function. Pass.
  • Transaction/persistence: N/A. Pass.
  • FE-BE variable passing: N/A — uses existing varargs aggregate function infrastructure. Pass.
  • Performance: Hash-based counting with XXH128 for string/generic paths is the same approach used by uniqExact. The Arena memory waste in the generic path (serialized data accumulated but only hashes retained) is a known limitation shared by other aggregate functions. No new performance concerns beyond the IPV4/IPV6 optimization opportunity noted. Pass with minor note.
  • Other issues: The compile_check_begin.h/compile_check_end.h pairing is correct. No narrowing conversion risks. Serialization format follows the established write_var_uint/write_binary convention. Pass.

@wrlcke wrlcke force-pushed the functions/entropy branch from 69630db to c88684f Compare March 9, 2026 22:12
@wrlcke wrlcke force-pushed the functions/entropy branch from c88684f to be3a585 Compare March 9, 2026 22:22
@zclllyybb
Copy link
Contributor

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary: Feature add agg function entropy

Overall this is a clean, well-structured addition of a new aggregate function. The implementation follows established patterns (modeled after aggregate_function_uniq.h) and includes good test coverage across BE unit tests, FE registration, and regression tests. Below are the findings organized by the critical checkpoints from AGENTS.md/code-review skill.

Critical Checkpoint Conclusions

1. Goal & Correctness: The PR adds a Shannon entropy aggregate function. The implementation correctly computes H = -Σ p_i * log2(p_i) from a frequency map. The empty-set case returns 0.0 (no divide-by-zero). The hash-only approach for strings/generic types is consistent with the pattern used by aggregate_function_uniq.h. Tests prove the function works.

2. Modification size & focus: The change is well-scoped — 11 files, all directly related to the new function. The column_helper.h and agg_function_test.h changes are reasonable utility improvements needed for the new tests.

3. Concurrency: No concurrency concerns — this is a standard aggregate function with no shared state.

4. Lifecycle management: No special lifecycle concerns. The flat_hash_map in the Data struct is value-owned and properly cleared on reset().

5. Configuration items: None added. N/A.

6. Incompatible changes: No storage format changes. The function is purely new, no compatibility concerns.

7. Parallel code paths: The three-tier data structure (numeric, string, generic) correctly covers all type paths. Single-arg complex types (ARRAY/MAP/STRUCT) properly fall through to the generic handler.

8. Test coverage: Good. BE unit tests cover numeric, string, generic (multi-column), nullable, and empty-set cases. Regression tests cover all data types including complex types, GROUP BY, window functions, and empty results. No FE unit test, but this is consistent with the codebase (no aggregate function in this package has dedicated FE unit tests).

9. Observability: N/A for a simple aggregate function.

10. Transaction/persistence: N/A.

11. Data writes: N/A.

12. FE-BE variable passing: N/A — no new thrift variables.

13. Performance: The implementation is efficient. Uses flat_hash_map (cache-friendly), CRC32 hash for numeric types, and avoids double-hashing for string types (XXH128 -> UInt128TrivialHash). get_result() iterates twice over the map (once for total_count, once for entropy calculation), which could be combined into a single pass, but this is a minor optimization. The reserve() calls in merge() and read() are good for avoiding rehashing.

14. Other issues: See inline comments for two minor issues found.

Issues Found

# Severity File Description
1 Minor (convention) aggregate_function_entropy.h:144 Missing final keyword on AggregateFunctionEntropy class. Nearly all (46+) other aggregate function classes use final.
2 Suggestion aggregate_function_entropy.h:86-97 get_result() iterates over frequency_map twice (once for total_count, once for entropy). Could be done in a single pass for better cache efficiency on large maps.

@linrrzqqq
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26981 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4ef69af34e68c432ec507db3d432690b9cce55ed, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17654	4517	4300	4300
q2	q3	10646	762	508	508
q4	4676	369	253	253
q5	7551	1213	1018	1018
q6	176	178	150	150
q7	809	845	667	667
q8	9756	1471	1337	1337
q9	5107	4780	4663	4663
q10	6321	1885	1652	1652
q11	463	263	258	258
q12	740	573	482	482
q13	18043	2970	2182	2182
q14	232	230	223	223
q15	q16	746	731	670	670
q17	731	851	448	448
q18	5979	5411	5313	5313
q19	1249	991	598	598
q20	553	481	384	384
q21	4480	1848	1601	1601
q22	431	325	274	274
Total cold run time: 96343 ms
Total hot run time: 26981 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4759	4725	4676	4676
q2	q3	3904	4346	3856	3856
q4	884	1230	814	814
q5	4167	4437	4430	4430
q6	178	172	144	144
q7	1772	1632	1486	1486
q8	2512	2721	2636	2636
q9	7651	7055	7432	7055
q10	3834	4081	3635	3635
q11	494	451	417	417
q12	484	598	462	462
q13	2683	3298	2323	2323
q14	400	419	313	313
q15	q16	708	783	706	706
q17	1162	1410	1443	1410
q18	6985	6727	6573	6573
q19	922	872	924	872
q20	2057	2145	2012	2012
q21	4027	3620	3337	3337
q22	452	433	388	388
Total cold run time: 50035 ms
Total hot run time: 47545 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 169384 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4ef69af34e68c432ec507db3d432690b9cce55ed, data reload: false

query5	4344	653	536	536
query6	325	241	221	221
query7	4238	477	272	272
query8	354	282	244	244
query9	8726	2775	2741	2741
query10	522	400	345	345
query11	6976	5118	4902	4902
query12	182	133	124	124
query13	1268	468	351	351
query14	5685	3700	3500	3500
query14_1	2869	2863	2907	2863
query15	203	196	177	177
query16	975	449	442	442
query17	913	747	639	639
query18	2449	460	356	356
query19	210	212	192	192
query20	136	134	131	131
query21	214	136	119	119
query22	13311	14186	14501	14186
query23	16311	15910	15774	15774
query23_1	16141	15790	16306	15790
query24	7267	1645	1257	1257
query24_1	1271	1267	1269	1267
query25	570	498	528	498
query26	1244	277	158	158
query27	2754	493	292	292
query28	4483	1866	1852	1852
query29	822	575	483	483
query30	296	224	190	190
query31	1040	945	900	900
query32	80	72	70	70
query33	514	331	281	281
query34	921	872	529	529
query35	642	679	609	609
query36	1074	1143	951	951
query37	140	102	89	89
query38	2963	2985	2877	2877
query39	864	823	817	817
query39_1	832	821	792	792
query40	226	153	135	135
query41	63	61	58	58
query42	265	256	260	256
query43	250	259	220	220
query44	
query45	202	191	182	182
query46	889	988	598	598
query47	2124	3023	2054	2054
query48	322	322	233	233
query49	634	462	378	378
query50	677	295	218	218
query51	4109	4120	3982	3982
query52	265	271	258	258
query53	295	330	294	294
query54	296	276	274	274
query55	95	83	86	83
query56	317	310	322	310
query57	1950	1743	1705	1705
query58	290	274	269	269
query59	2805	2951	2722	2722
query60	348	342	327	327
query61	152	148	152	148
query62	633	579	537	537
query63	313	287	283	283
query64	4938	1244	956	956
query65	
query66	1464	459	358	358
query67	24375	24586	24273	24273
query68	
query69	398	326	287	287
query70	992	990	943	943
query71	332	316	313	313
query72	2943	2684	2417	2417
query73	552	555	315	315
query74	9611	9542	9433	9433
query75	2863	2758	2477	2477
query76	2277	1037	691	691
query77	362	391	316	316
query78	10879	11186	10474	10474
query79	2357	779	577	577
query80	1747	629	559	559
query81	553	268	221	221
query82	1013	149	132	132
query83	334	260	249	249
query84	299	116	93	93
query85	903	500	446	446
query86	418	306	308	306
query87	3156	3166	3020	3020
query88	3595	2661	2645	2645
query89	432	375	345	345
query90	2012	186	186	186
query91	169	161	137	137
query92	82	75	71	71
query93	1029	877	503	503
query94	626	322	291	291
query95	584	341	389	341
query96	637	519	233	233
query97	2437	2471	2431	2431
query98	241	221	217	217
query99	1038	1009	902	902
Total cold run time: 251330 ms
Total hot run time: 169384 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.70% (19753/37479)
Line Coverage 36.26% (184396/508538)
Region Coverage 32.36% (142254/439594)
Branch Coverage 33.57% (62205/185281)

@zclllyybb
Copy link
Contributor

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 5.56% (1/18) 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 26916 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9b7d7b3a7fd662d34a1d23ade4ee79aeac35f263, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17594	4486	4299	4299
q2	q3	10640	786	528	528
q4	4680	352	249	249
q5	7560	1197	1021	1021
q6	188	183	146	146
q7	781	892	670	670
q8	9301	1480	1345	1345
q9	4827	4793	4729	4729
q10	6308	1911	1652	1652
q11	471	265	244	244
q12	729	593	474	474
q13	18033	2950	2181	2181
q14	226	236	210	210
q15	q16	768	749	683	683
q17	740	842	451	451
q18	6102	5414	5273	5273
q19	1121	986	628	628
q20	539	503	390	390
q21	4444	1842	1450	1450
q22	532	385	293	293
Total cold run time: 95584 ms
Total hot run time: 26916 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4672	4702	4561	4561
q2	q3	3903	4370	3866	3866
q4	890	1223	796	796
q5	4100	4383	4334	4334
q6	185	180	143	143
q7	1729	1669	1534	1534
q8	2505	2737	2566	2566
q9	7670	7446	7495	7446
q10	3774	4034	3588	3588
q11	533	433	410	410
q12	480	596	459	459
q13	2762	3161	2346	2346
q14	295	315	296	296
q15	q16	745	761	711	711
q17	1169	1367	1391	1367
q18	7113	6942	6727	6727
q19	926	906	908	906
q20	2077	2153	2007	2007
q21	4068	3462	3397	3397
q22	467	462	399	399
Total cold run time: 50063 ms
Total hot run time: 47859 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 167661 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9b7d7b3a7fd662d34a1d23ade4ee79aeac35f263, data reload: false

query5	4329	637	498	498
query6	342	222	235	222
query7	4217	459	273	273
query8	336	230	229	229
query9	8716	2744	2751	2744
query10	523	405	344	344
query11	6982	5070	4874	4874
query12	183	133	130	130
query13	1275	481	351	351
query14	5702	3691	3437	3437
query14_1	2822	2802	2821	2802
query15	215	196	176	176
query16	975	460	447	447
query17	913	724	647	647
query18	2448	455	355	355
query19	214	212	188	188
query20	139	132	125	125
query21	214	135	108	108
query22	13220	13947	14690	13947
query23	16179	16003	15738	15738
query23_1	15622	15716	15302	15302
query24	7228	1613	1215	1215
query24_1	1218	1224	1253	1224
query25	537	459	397	397
query26	1236	265	153	153
query27	2781	473	306	306
query28	4489	1824	1837	1824
query29	880	564	474	474
query30	301	221	188	188
query31	998	942	871	871
query32	86	72	69	69
query33	506	323	288	288
query34	904	873	532	532
query35	650	685	595	595
query36	1078	1117	969	969
query37	126	95	88	88
query38	2943	2927	2867	2867
query39	857	841	799	799
query39_1	788	805	812	805
query40	231	160	132	132
query41	62	58	58	58
query42	267	260	250	250
query43	236	246	214	214
query44	
query45	198	191	181	181
query46	879	974	612	612
query47	2122	2151	2070	2070
query48	315	318	229	229
query49	660	461	382	382
query50	693	278	205	205
query51	4075	4064	3980	3980
query52	264	267	254	254
query53	282	332	286	286
query54	293	274	267	267
query55	102	85	85	85
query56	319	316	306	306
query57	1910	1793	1698	1698
query58	284	280	271	271
query59	2796	2943	2764	2764
query60	335	342	318	318
query61	155	149	148	148
query62	636	590	534	534
query63	304	274	270	270
query64	5142	1261	1006	1006
query65	
query66	1500	457	357	357
query67	24209	24325	24161	24161
query68	
query69	408	305	284	284
query70	951	962	946	946
query71	328	301	302	301
query72	2807	2665	2166	2166
query73	548	544	319	319
query74	9650	9582	9473	9473
query75	2861	2731	2457	2457
query76	2279	1028	700	700
query77	363	376	303	303
query78	11003	11297	10501	10501
query79	1085	836	580	580
query80	688	632	528	528
query81	486	269	232	232
query82	1365	151	118	118
query83	369	268	250	250
query84	297	114	95	95
query85	846	475	427	427
query86	370	308	332	308
query87	3151	3165	2968	2968
query88	3596	2663	2664	2663
query89	418	371	340	340
query90	2021	177	165	165
query91	173	157	136	136
query92	80	73	70	70
query93	904	878	483	483
query94	458	325	283	283
query95	598	410	320	320
query96	654	529	230	230
query97	2527	2492	2408	2408
query98	233	241	220	220
query99	1001	972	925	925
Total cold run time: 248275 ms
Total hot run time: 167661 ms

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.68% (19742/37475)
Line Coverage 36.27% (184435/508558)
Region Coverage 32.36% (142178/439313)
Branch Coverage 33.56% (62175/185245)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.22% (26859/36681)
Line Coverage 56.61% (286912/506845)
Region Coverage 53.86% (238812/443362)
Branch Coverage 55.56% (103186/185727)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants