-
Notifications
You must be signed in to change notification settings - Fork 421
[doc](agg) add doc for aggregate function entropy #3412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
wrlcke
wants to merge
4
commits into
apache:master
Choose a base branch
from
wrlcke:functions/entropy
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
121 changes: 121 additions & 0 deletions
121
docs/sql-manual/sql-functions/aggregate-functions/entropy.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| --- | ||
| { | ||
| "title": "ENTROPY", | ||
| "language": "en", | ||
| "description": "Calculate the Shannon entropy of all non-null values in the specified column or expression." | ||
| } | ||
| --- | ||
|
|
||
| ## Description | ||
|
|
||
| Computes the Shannon entropy of all non-null values in the specified column or expression. | ||
|
|
||
| Entropy measures the uncertainty or randomness of a distribution. This function builds an empirical frequency map of the input values and computes entropy in bits using the base‑2 logarithm. | ||
|
|
||
| The Shannon entropy is defined as: | ||
|
|
||
| $ | ||
| Entropy(X) = -\sum_{i=1}^{k} p_i \log_2(p_i) | ||
| $ | ||
|
|
||
| Where: | ||
|
|
||
| - $k$ is the number of distinct non-null values | ||
| - $p_i = \frac{\text{count}(x_i)}{\text{total non-null count}}$ | ||
|
|
||
| ## Syntax | ||
|
|
||
| ```sql | ||
| ENTROPY(<expr1> [, <expr2>, ... , <exprN>]) | ||
| ``` | ||
|
|
||
| ## Parameters | ||
|
|
||
| | Parameter | Description | | ||
| |----------|-------------| | ||
| | `<expr1> [, <expr2>, ...]` | One or more expressions or columns. Supported types: TinyInt, SmallInt, Integer, BigInt, LargeInt, Float, Double, Decimal, String, IPv4/IPv6, Array, Map, Struct. When multiple expressions are provided, their values are serialized together to form a single composite key, and entropy is computed over the frequency distribution of these composite keys. | | ||
|
|
||
wrlcke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## Return Value | ||
|
|
||
| Returns a DOUBLE representing the Shannon entropy in bits. | ||
|
|
||
| - Returns NULL if all values are NULL or the input is empty. | ||
| - Ignores NULL values during computation. | ||
|
|
||
| ## Examples | ||
|
|
||
| ```sql | ||
| CREATE TABLE t1 ( | ||
| id INT, | ||
| c1 INT, | ||
| c2 STRING | ||
| ) DISTRIBUTED BY HASH(id) BUCKETS 1 | ||
| PROPERTIES ("replication_num"="1"); | ||
|
|
||
| INSERT INTO t1 VALUES | ||
| (1, 1, "a"), | ||
| (2, 1, "a"), | ||
| (3, 1, "b"), | ||
| (4, 2, "a"), | ||
| (5, NULL, "a"); | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(c1) FROM t1; | ||
| ``` | ||
|
|
||
| Distribution: 1 → 3, 2 → 1 | ||
|
|
||
| $H = -\left(\frac{1}{4}\log_2\frac{1}{4} + \frac{3}{4}\log_2\frac{3}{4}\right)=0.811$ | ||
|
|
||
| ```text | ||
| +--------------------+ | ||
| | entropy(c1) | | ||
| +--------------------+ | ||
wrlcke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | 0.8112781244591328 | | ||
| +--------------------+ | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(c1, c2) FROM t1; | ||
| ``` | ||
|
|
||
| Distribution: (1, "a") → 2, (1, "b") → 1, (2, "a") → 1 | ||
|
|
||
| $H = -\left(\frac{1}{4}\log_2\frac{1}{4} + \frac{2}{4}\log_2\frac{2}{4}+ \frac{1}{4}\log_2\frac{1}{4}\right)=1.5$ | ||
|
|
||
| ```text | ||
| +-----------------+ | ||
| | entropy(c1, c2) | | ||
| +-----------------+ | ||
| | 1.5 | | ||
| +-----------------+ | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(1); | ||
| ``` | ||
|
|
||
| Only one distinct value → entropy = 0 | ||
|
|
||
| ```text | ||
| +------------+ | ||
| | entropy(1) | | ||
| +------------+ | ||
| | 0 | | ||
| +------------+ | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(NULL) FROM t1; | ||
| ``` | ||
|
|
||
| Returns NULL if all values are NULL or the input is empty. | ||
|
|
||
| ```text | ||
| +---------------+ | ||
wrlcke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | entropy(NULL) | | ||
| +---------------+ | ||
| | NULL | | ||
| +---------------+ | ||
| ``` | ||
121 changes: 121 additions & 0 deletions
121
...in-content-docs/current/sql-manual/sql-functions/aggregate-functions/entropy.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| --- | ||
| { | ||
| "title": "ENTROPY", | ||
| "language": "zh-CN", | ||
| "description": "计算指定列或表达式中所有非 NULL 值的香农熵(Shannon Entropy)。" | ||
| } | ||
| --- | ||
|
|
||
| ## 描述 | ||
|
|
||
| 计算指定列或表达式中所有非 NULL 值的香农熵(Shannon Entropy)。 | ||
|
|
||
| 熵用于衡量分布的不确定性或随机性。该函数会基于输入值构建经验频率分布,并使用以 2 为底的对数计算熵,单位为 比特(bits)。 | ||
|
|
||
| 香农熵的定义如下: | ||
|
|
||
| $ | ||
| Entropy(X) = -\sum_{i=1}^{k} p_i \log_2(p_i) | ||
| $ | ||
|
|
||
| 其中: | ||
|
|
||
| - $k$ 为非 NULL 的不同值的数量 | ||
| - $p_i = \frac{x_i的数量}{\text{所有非null值数量}}$ | ||
|
|
||
| ## 语法 | ||
|
|
||
| ```sql | ||
| ENTROPY(<expr1> [, <expr2>, ... , <exprN>]) | ||
| ``` | ||
|
|
||
| ## 参数 | ||
|
|
||
| | 参数 | 说明 | | ||
| |------|------| | ||
| | `<expr1> [, <expr2>, ...]` | 一个或多个表达式或列。支持的类型包括:TinyInt、SmallInt、Integer、BigInt、LargeInt、Float、Double、Decimal、String、IPv4/IPv6、Array、Map、Struct 等。当提供多列时,每行的多个值会被序列化为一个复合键,并基于复合键的频率分布计算熵。 | | ||
|
|
||
| ## 返回值 | ||
|
|
||
| 返回一个 DOUBLE,表示以比特为单位的香农熵。 | ||
|
|
||
| - 如果所有值均为 NULL 或输入为空,则返回 NULL。 | ||
| - 计算过程中会忽略 NULL 值。 | ||
|
|
||
| ## 举例 | ||
|
|
||
| ```sql | ||
| CREATE TABLE t1 ( | ||
| id INT, | ||
| c1 INT, | ||
| c2 STRING | ||
| ) DISTRIBUTED BY HASH(id) BUCKETS 1 | ||
| PROPERTIES ("replication_num"="1"); | ||
|
|
||
| INSERT INTO t1 VALUES | ||
| (1, 1, "a"), | ||
| (2, 1, "a"), | ||
| (3, 1, "b"), | ||
| (4, 2, "a"), | ||
| (5, NULL, "a"); | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(c1) FROM t1; | ||
| ``` | ||
|
|
||
| 频率分布:1 → 3, 2 → 1 | ||
|
|
||
| 熵的计算:$H = -\left(\frac{1}{4}\log_2\frac{1}{4} + \frac{3}{4}\log_2\frac{3}{4}\right)=0.811$ | ||
|
|
||
| ```text | ||
| +--------------------+ | ||
| | entropy(c1) | | ||
| +--------------------+ | ||
wrlcke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | 0.8112781244591328 | | ||
| +--------------------+ | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(c1, c2) FROM t1; | ||
| ``` | ||
|
|
||
| 频率分布:(1, "a") → 2, (1, "b") → 1, (2, "a") → 1 | ||
|
|
||
| 熵的计算:$H = -\left(\frac{1}{4}\log_2\frac{1}{4} + \frac{2}{4}\log_2\frac{2}{4}+ \frac{1}{4}\log_2\frac{1}{4}\right)=1.5$ | ||
|
|
||
| ```text | ||
| +-----------------+ | ||
| | entropy(c1, c2) | | ||
| +-----------------+ | ||
| | 1.5 | | ||
| +-----------------+ | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(1); | ||
| ``` | ||
|
|
||
| 只有一个唯一值 → 熵 = 0 | ||
|
|
||
| ```text | ||
| +------------+ | ||
| | entropy(1) | | ||
| +------------+ | ||
| | 0 | | ||
| +------------+ | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(NULL) FROM t1; | ||
| ``` | ||
|
|
||
| 当所有值均为 NULL 或输入为空时返回 NULL。 | ||
|
|
||
| ```text | ||
| +---------------+ | ||
| | entropy(NULL) | | ||
| +---------------+ | ||
| | NULL | | ||
| +---------------+ | ||
| ``` | ||
125 changes: 125 additions & 0 deletions
125
...ontent-docs/version-4.x/sql-manual/sql-functions/aggregate-functions/entropy.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,125 @@ | ||
| --- | ||
| { | ||
| "title": "ENTROPY", | ||
| "language": "zh-CN", | ||
| "description": "计算指定列或表达式中所有非 NULL 值的香农熵(Shannon Entropy)。" | ||
| } | ||
| --- | ||
|
|
||
| ## 描述 | ||
|
|
||
| 计算指定列或表达式中所有非 NULL 值的香农熵(Shannon Entropy)。 | ||
|
|
||
| 熵用于衡量分布的不确定性或随机性。该函数会基于输入值构建经验频率分布,并使用以 2 为底的对数计算熵,单位为 比特(bits)。 | ||
|
|
||
| 香农熵的定义如下: | ||
|
|
||
| $ | ||
| Entropy(X) = -\sum_{i=1}^{k} p_i \log_2(p_i) | ||
| $ | ||
|
|
||
| 其中: | ||
|
|
||
| - $k$ 为非 NULL 的不同值的数量 | ||
| - $p_i = \frac{x_i的数量}{\text{所有非null值数量}}$ | ||
|
|
||
| :::info 备注 | ||
| 从 Apache Doris 4.1.0 开始支持该函数 | ||
| ::: | ||
|
|
||
| ## 语法 | ||
|
|
||
| ```sql | ||
| ENTROPY(<expr1> [, <expr2>, ... , <exprN>]) | ||
| ``` | ||
|
|
||
| ## 参数 | ||
|
|
||
| | 参数 | 说明 | | ||
| |------|------| | ||
| | `<expr1> [, <expr2>, ...]` | 一个或多个表达式或列。支持的类型包括:TinyInt、SmallInt、Integer、BigInt、LargeInt、Float、Double、Decimal、String、IPv4/IPv6、Array、Map、Struct 等。当提供多列时,每行的多个值会被序列化为一个复合键,并基于复合键的频率分布计算熵。 | | ||
|
|
||
| ## 返回值 | ||
|
|
||
| 返回一个 DOUBLE,表示以比特为单位的香农熵。 | ||
|
|
||
| - 如果所有值均为 NULL 或输入为空,则返回 NULL。 | ||
| - 计算过程中会忽略 NULL 值。 | ||
|
|
||
| ## 举例 | ||
|
|
||
| ```sql | ||
| CREATE TABLE t1 ( | ||
| id INT, | ||
| c1 INT, | ||
| c2 STRING | ||
| ) DISTRIBUTED BY HASH(id) BUCKETS 1 | ||
| PROPERTIES ("replication_num"="1"); | ||
|
|
||
| INSERT INTO t1 VALUES | ||
| (1, 1, "a"), | ||
| (2, 1, "a"), | ||
| (3, 1, "b"), | ||
| (4, 2, "a"), | ||
| (5, NULL, "a"); | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(c1) FROM t1; | ||
| ``` | ||
|
|
||
| 频率分布:1 → 3, 2 → 1 | ||
|
|
||
| 熵的计算:$H = -\left(\frac{1}{4}\log_2\frac{1}{4} + \frac{3}{4}\log_2\frac{3}{4}\right)=0.811$ | ||
|
|
||
| ```text | ||
| +--------------------+ | ||
| | entropy(c1) | | ||
| +--------------------+ | ||
| | 0.8112781244591328 | | ||
| +--------------------+ | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(c1, c2) FROM t1; | ||
| ``` | ||
|
|
||
| 频率分布:(1, "a") → 2, (1, "b") → 1, (2, "a") → 1 | ||
|
|
||
| 熵的计算:$H = -\left(\frac{1}{4}\log_2\frac{1}{4} + \frac{2}{4}\log_2\frac{2}{4}+ \frac{1}{4}\log_2\frac{1}{4}\right)=1.5$ | ||
|
|
||
| ```text | ||
| +-----------------+ | ||
| | entropy(c1, c2) | | ||
| +-----------------+ | ||
| | 1.5 | | ||
| +-----------------+ | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(1); | ||
| ``` | ||
|
|
||
| 只有一个唯一值 → 熵 = 0 | ||
|
|
||
| ```text | ||
| +------------+ | ||
| | entropy(1) | | ||
| +------------+ | ||
| | 0 | | ||
| +------------+ | ||
| ``` | ||
|
|
||
| ```sql | ||
| SELECT entropy(NULL) FROM t1; | ||
| ``` | ||
|
|
||
| 当所有值均为 NULL 或输入为空时返回 NULL。 | ||
|
|
||
| ```text | ||
| +---------------+ | ||
| | entropy(NULL) | | ||
| +---------------+ | ||
| | NULL | | ||
| +---------------+ | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.