ORC-2131: Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0 by QianyongY · Pull Request #2580 · apache/orc

QianyongY · 2026-03-17T13:25:45Z

What changes were proposed in this pull request?

Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0

Why are the changes needed?

After enabling the optimizations related to orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes, we observed that ORC files written with the current defaults are about 10%–20% larger than before. For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase.

How was this patch tested?

Local test

With orc.dictionary.max.size.bytes=16777216 or orc.stripe.size.check.ratio=2.0, the written ORC data grows to 1.2 TB (data inflation).

           1         6665      1300347279057 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_2/d=2026-03-15

With orc.dictionary.max.size.bytes=0 and orc.stripe.size.check.ratio=0.0, the data size remains at the expected 1.0 TB.

           1         6665      1143347882367 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_1/d=2026-03-15

Was this patch authored or co-authored using generative AI tooling?

No

…e.bytes to 0

dongjoon-hyun

Thank you for the feedback, @QianyongY . I understand that you want to disable this feature completely. However, may I ask if that is a general observation instead of a corner case? In general, the default value is not a silver bullet and you can tune this value in your workload.

For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase.

cc @wankunde and @cxzl25 from the original PR.

#2371

cxzl25 · 2026-03-18T07:52:48Z

When writing ORC data files using ORC-1986, we observed an increase in the size of some tables from 1.0 TB to 1.2 TB. A random inspection of one ORC file showed that the number of Stripes grew from the original 180 to 527. This resulted in a lower compression ratio and significantly slower read performance for downstream jobs, increasing the execution time from 1 hour to 2 hours and 20 minutes.

Therefore, this might be a regression issue. Setting it to 0 can avoid this problem, and users who need it can enable this parameter in the cluster by default.

QianyongY · 2026-03-19T13:37:54Z

@dongjoon-hyun We rolled out this optimization to 10% of our tables. Among them, 5 tables had partitions grow by more than 15% in size. The worst case was a single partition growing from 140GB to 660GB+, which is concerning. It's not a universal issue, but some tables were indeed affected.

dongjoon-hyun

+1, LGTM for Apache ORC 2.3.1.

dongjoon-hyun · 2026-03-20T02:32:33Z

Feel free to merge and land to main/2.3, @cxzl25 ~

dongjoon-hyun · 2026-03-25T05:01:10Z

When writing ORC data files using ORC-1986, we observed an increase in the size of some tables from 1.0 TB to 1.2 TB. A random inspection of one ORC file showed that the number of Stripes grew from the original 180 to 527. This resulted in a lower compression ratio and significantly slower read performance for downstream jobs, increasing the execution time from 1 hour to 2 hours and 20 minutes.

Therefore, this might be a regression issue. Setting it to 0 can avoid this problem, and users who need it can enable this parameter in the cluster by default.

Is there any reason not to merge this, @cxzl25 ? I thought you wanted to land this to fix your issues.

…ry.max.size.bytes to 0 ### What changes were proposed in this pull request? Set default of `orc.stripe.size.check.ratio` and `orc.dictionary.max.size.bytes` to 0 ### Why are the changes needed? After enabling the optimizations related to orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes, we observed that ORC files written with the current defaults are about 10%–20% larger than before. For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase. ### How was this patch tested? Local test With orc.dictionary.max.size.bytes=16777216 or orc.stripe.size.check.ratio=2.0, the written ORC data grows to 1.2 TB (data inflation). ```shell 1 6665 1300347279057 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_2/d=2026-03-15 ``` With orc.dictionary.max.size.bytes=0 and orc.stripe.size.check.ratio=0.0, the data size remains at the expected 1.0 TB. ```shell 1 6665 1143347882367 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_1/d=2026-03-15 ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #2580 from QianyongY/features/ORC-2131. Authored-by: yongqian <yongqian@trip.com> Signed-off-by: Shaoyun Chen <csy@apache.org> (cherry picked from commit 016b076) Signed-off-by: Shaoyun Chen <csy@apache.org>

cxzl25 · 2026-03-25T06:46:15Z

Is there any reason not to merge this

Sorry for the late merge. We have been conducting some verification tests one after another these past few days.

cxzl25 · 2026-03-25T06:47:22Z

@QianyongY Thank you.

Merged to main/2.3.

Set default of orc.stripe.size.check.ratio and orc.dictionary.max.siz…

99017ff

…e.bytes to 0

dongjoon-hyun reviewed Mar 17, 2026

View reviewed changes

dongjoon-hyun approved these changes Mar 19, 2026

View reviewed changes

dongjoon-hyun added this to the 2.3.1 milestone Mar 19, 2026

cxzl25 approved these changes Mar 20, 2026

View reviewed changes

cxzl25 closed this in 016b076 Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-2131: Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0#2580

ORC-2131: Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0#2580
QianyongY wants to merge 1 commit intoapache:mainfrom
QianyongY:features/ORC-2131

QianyongY commented Mar 17, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

cxzl25 commented Mar 18, 2026

Uh oh!

QianyongY commented Mar 19, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Mar 20, 2026

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

cxzl25 commented Mar 25, 2026

Uh oh!

cxzl25 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

QianyongY commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cxzl25 commented Mar 18, 2026

Uh oh!

QianyongY commented Mar 19, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 20, 2026

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

cxzl25 commented Mar 25, 2026

Uh oh!

cxzl25 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

QianyongY commented Mar 17, 2026 •

edited

Loading