Skip to content

ORC-2131: Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0#2580

Closed
QianyongY wants to merge 1 commit intoapache:mainfrom
QianyongY:features/ORC-2131
Closed

ORC-2131: Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0#2580
QianyongY wants to merge 1 commit intoapache:mainfrom
QianyongY:features/ORC-2131

Conversation

@QianyongY
Copy link
Contributor

@QianyongY QianyongY commented Mar 17, 2026

What changes were proposed in this pull request?

Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0

Why are the changes needed?

After enabling the optimizations related to orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes, we observed that ORC files written with the current defaults are about 10%–20% larger than before. For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase.

How was this patch tested?

Local test

With orc.dictionary.max.size.bytes=16777216 or orc.stripe.size.check.ratio=2.0, the written ORC data grows to 1.2 TB (data inflation).

           1         6665      1300347279057 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_2/d=2026-03-15

With orc.dictionary.max.size.bytes=0 and orc.stripe.size.check.ratio=0.0, the data size remains at the expected 1.0 TB.

           1         6665      1143347882367 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_1/d=2026-03-15

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback, @QianyongY . I understand that you want to disable this feature completely. However, may I ask if that is a general observation instead of a corner case? In general, the default value is not a silver bullet and you can tune this value in your workload.

For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase.

cc @wankunde and @cxzl25 from the original PR.

@cxzl25
Copy link
Contributor

cxzl25 commented Mar 18, 2026

When writing ORC data files using ORC-1986, we observed an increase in the size of some tables from 1.0 TB to 1.2 TB. A random inspection of one ORC file showed that the number of Stripes grew from the original 180 to 527. This resulted in a lower compression ratio and significantly slower read performance for downstream jobs, increasing the execution time from 1 hour to 2 hours and 20 minutes.

Therefore, this might be a regression issue. Setting it to 0 can avoid this problem, and users who need it can enable this parameter in the cluster by default.

@QianyongY
Copy link
Contributor Author

@dongjoon-hyun We rolled out this optimization to 10% of our tables. Among them, 5 tables had partitions grow by more than 15% in size. The worst case was a single partition growing from 140GB to 660GB+, which is concerning. It's not a universal issue, but some tables were indeed affected.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM for Apache ORC 2.3.1.

@dongjoon-hyun dongjoon-hyun added this to the 2.3.1 milestone Mar 19, 2026
@dongjoon-hyun
Copy link
Member

Feel free to merge and land to main/2.3, @cxzl25 ~

@dongjoon-hyun
Copy link
Member

When writing ORC data files using ORC-1986, we observed an increase in the size of some tables from 1.0 TB to 1.2 TB. A random inspection of one ORC file showed that the number of Stripes grew from the original 180 to 527. This resulted in a lower compression ratio and significantly slower read performance for downstream jobs, increasing the execution time from 1 hour to 2 hours and 20 minutes.

Therefore, this might be a regression issue. Setting it to 0 can avoid this problem, and users who need it can enable this parameter in the cluster by default.

Is there any reason not to merge this, @cxzl25 ? I thought you wanted to land this to fix your issues.

@cxzl25 cxzl25 closed this in 016b076 Mar 25, 2026
cxzl25 pushed a commit that referenced this pull request Mar 25, 2026
…ry.max.size.bytes to 0

### What changes were proposed in this pull request?

Set default of `orc.stripe.size.check.ratio` and `orc.dictionary.max.size.bytes` to 0

### Why are the changes needed?

After enabling the optimizations related to orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes, we observed that ORC files written with the current defaults are about 10%–20% larger than before. For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase.

### How was this patch tested?
Local test

With orc.dictionary.max.size.bytes=16777216 or orc.stripe.size.check.ratio=2.0, the written ORC data grows to 1.2 TB (data inflation).

```shell
           1         6665      1300347279057 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_2/d=2026-03-15
```

With orc.dictionary.max.size.bytes=0 and orc.stripe.size.check.ratio=0.0, the data size remains at the expected 1.0 TB.
```shell
           1         6665      1143347882367 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_1/d=2026-03-15
```

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #2580 from QianyongY/features/ORC-2131.

Authored-by: yongqian <yongqian@trip.com>
Signed-off-by: Shaoyun Chen <csy@apache.org>
(cherry picked from commit 016b076)
Signed-off-by: Shaoyun Chen <csy@apache.org>
@cxzl25
Copy link
Contributor

cxzl25 commented Mar 25, 2026

Is there any reason not to merge this

Sorry for the late merge. We have been conducting some verification tests one after another these past few days.

@cxzl25
Copy link
Contributor

cxzl25 commented Mar 25, 2026

@QianyongY Thank you.

Merged to main/2.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants