Skip to content

refactor: adds StringColumnFormatSpec for string dimension configs#19258

Open
jaykanakiya wants to merge 6 commits intoapache:masterfrom
jaykanakiya:string-column-format-spec
Open

refactor: adds StringColumnFormatSpec for string dimension configs#19258
jaykanakiya wants to merge 6 commits intoapache:masterfrom
jaykanakiya:string-column-format-spec

Conversation

@jaykanakiya
Copy link
Copy Markdown
Contributor

@jaykanakiya jaykanakiya commented Apr 2, 2026

Description

Adds StringColumnFormatSpec to indexSpec, allowing string column format settings to be configured at the job level with per-column overrides via dimensionSchema.columnFormatSpec. Follows similar pattern as #17762

Release note


Key changed/added classes in this PR
  • StringColumnFormatSpec
  • StringDimensionSchema
  • IndexSpec
  • Sink

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@JsonProperty("multiValueHandling") MultiValueHandling multiValueHandling,
@JsonProperty("createBitmapIndex") Boolean createBitmapIndex,
@JsonProperty("maxStringLength") @Nullable Integer maxStringLength
@JsonProperty("maxStringLength") @Nullable Integer maxStringLength,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since maxStringLength hasn't been released, should we just remove this in favor of the columnFormatSpec?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed maxStringLength from StringDimensionSchema.


@JsonCreator
public StringColumnFormatSpec(
@JsonProperty("createBitmapIndex") @Nullable Boolean createBitmapIndex,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we would switch this to using something like BitmapIndexType we have for NestedCommonFormatColumnFormatSpec used to control which indexes we build for long and double json fields, since ideally we can introduce some flexibility here so that we can more easily support different kinds of indexes on string columns that specifying with a bunch of booleans would be tedious.

I think this implementation can be a lot simpler than BitmapIndexType for now, we can worry about how index building can be abstracted for strings in the dimension indexer/mergers/serializers later when we actually introduce such a thing. I think there are 2 options we need to add to model the current state, dictionaryEncodedValueIndex and none, so we could make an abstract StringIndexType or StringBitmapIndexType and have basically empty implementations for NoIndex and DictionaryEncodedValueIndex (only needs like equals/hashcode implemented i think and maybe a static final instantiation to share) to use as a placeholder for future enhancement.

Then we can wire this indexType with those values up to the existing boolean all the stuff uses.

Can you look into making this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added StringBitmapIndexType class with dictionaryEncodedValueIndex and none options and replaced createBitmapIndex in columnFormatSpec.


private FireHydrant makeNewCurrIndex(long minTimestamp, DataSchema schema)
{
final DimensionsSpec dimensionsSpec = resolveEffectiveDimensionsSpec(schema.getDimensionsSpec());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i guess we have to do this here now because we need to know stuff up front when making the column indexers, seems worth leaving a comment explaining why we do this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

Comment on lines +399 to +412
private DimensionsSpec resolveEffectiveDimensionsSpec(DimensionsSpec dimensionsSpec)
{
if (indexSpec == null) {
return dimensionsSpec;
}
final List<DimensionSchema> effectiveDimensions = dimensionsSpec.getDimensions()
.stream()
.map(dim -> dim.getEffectiveSchema(indexSpec))
.collect(Collectors.toList());
return DimensionsSpec.builder(dimensionsSpec)
.setDimensions(effectiveDimensions)
.build();
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in the constructor we should call getEffectiveSpec on the indexSpec so that it is filled in with system defaults. We should also probably set it to IndexSpec.default() if it is null, so then we always resolve the effective schemas to fill in system default values

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added IndexSpec.getDefault() in the constructor as the default.

@jaykanakiya jaykanakiya requested a review from clintropolis April 6, 2026 15:55
@aho135
Copy link
Copy Markdown
Contributor

aho135 commented Apr 9, 2026

Thanks for the change @jaykanakiya!

Maybe we could do this as a follow up, but it looks like the indexSpec documentation is pretty outdated. It'd be helpful to see an example for stringColumnFormatSpec

StringColumnFormatSpec.getEffectiveFormatSpec(stringColumnFormatSpec, this)
);
} else if (defaultSpec.stringColumnFormatSpec != null) {
bob.withStringColumnFormatSpec(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? stringColumnFormatSpec is always null in the defaultSpec

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a similar pattern for other properties but I don't think it's actually needed in this case

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defautlSpec can be set in runtime.properties to set system level default overrides, so we want it to be like this i think

)
{
this(dimensionName, multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, StringDimensionSchema.getDefaultMaxStringLength());
this(dimensionName, multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, null);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this(dimensionName, multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, null);
this(dimensionName, multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, BuiltInTypesModule.getMaxStringLength());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively make BuiltInTypesModule.MAX_STRING_LENGTH public

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

)
{
this(multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, StringDimensionSchema.getDefaultMaxStringLength());
this(multiValueHandling, hasBitmapIndexes, hasSpatialIndexes, null);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also use BuiltInTypesModule for readability here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use BuiltInTypesModule.

@Override
public DimensionHandler getDimensionHandler()
{
Integer maxStringLength = columnFormatSpec != null ? columnFormatSpec.getMaxStringLength() : null;
Copy link
Copy Markdown
Contributor

@aho135 aho135 Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a comment so I can revisit this tomorrow. Why do we only pass in maxStringLength here, but use parent defaults for createBitmapIndex/multiValueHandling?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I think we should get all of the things from the formatSpec if they are defined there, and fallback to the parent if not set

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this.

+ " \"multiValueHandling\" : \"SORTED_SET\",\n"
+ " \"createBitmapIndex\" : false,\n"
+ " \"columnFormatSpec\" : { \"maxStringLength\" : 200 }\n"
+ "}";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a few test cases with different variations of columnFormatSpec. For example, tests where multiValueHandling is also present in columnFormatSpec to document the order of precedence

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests to verify precedence.

@clintropolis clintropolis added this to the 37.0.0 milestone Apr 9, 2026
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mostly lgtm now, thanks for making this change 👍

I actually think there is 1 more change that we want to do, but I think its also ok to do as a follow-up.

The change is that I want to switch string columns to actually store the StringColumnFormatSpec in the json of the DictionaryEncodedColumnPartSerde, and switch string column stuff to implementing ColumnFormat like the nested/auto column does instead of using the fallback CapabilitiesBasedFormat. That way, it can recreate a handler with the same parameters that were used to write it for things like compaction (by using the stored formatspec to create the format which is what drives what parameters are set on the indexer/merger/serializer stuff).

StringColumnFormatSpec.getEffectiveFormatSpec(stringColumnFormatSpec, this)
);
} else if (defaultSpec.stringColumnFormatSpec != null) {
bob.withStringColumnFormatSpec(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defautlSpec can be set in runtime.properties to set system level default overrides, so we want it to be like this i think

@Override
public DimensionHandler getDimensionHandler()
{
Integer maxStringLength = columnFormatSpec != null ? columnFormatSpec.getMaxStringLength() : null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I think we should get all of the things from the formatSpec if they are defined there, and fallback to the parent if not set

MultiValueHandling mvh = columnFormatSpec != null && columnFormatSpec.getMultiValueHandling() != null
? columnFormatSpec.getMultiValueHandling()
: getMultiValueHandling();
boolean bitmap = columnFormatSpec != null && columnFormatSpec.getIndexType() != null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can do a small refactor to only check if columnFormatSpec is null one time

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to null check columnFormatSpec once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants