Skip to content

feat: try support build index for nested path by ai#8042

Draft
fengys1996 wants to merge 2 commits intoGreptimeTeam:mainfrom
fengys1996:feat/try-support-nested-filed-index-by-ai
Draft

feat: try support build index for nested path by ai#8042
fengys1996 wants to merge 2 commits intoGreptimeTeam:mainfrom
fengys1996:feat/try-support-nested-filed-index-by-ai

Conversation

@fengys1996
Copy link
Copy Markdown
Contributor

@fengys1996 fengys1996 commented Apr 28, 2026

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

RUN CI ONLY, NO MERGE

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@github-actions github-actions Bot added size/M docs-not-required This change does not impact docs. labels Apr 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for indexing nested sub-fields within Struct or Json columns. It adds a SubField variant to IndexTarget, updates region options to configure these targets, and implements the extraction and casting logic in the InvertedIndexer. Feedback highlights that sub-field indexing currently misses primary key (tag) columns. Additionally, the extraction logic for JSON sub-fields is inefficient due to repeated parsing and contains a bug where intermediate objects or arrays cause traversal to fail. There is also an opportunity to optimize performance by grouping extractions for multiple sub-fields on the same column.

Comment on lines +379 to +384
for indexed_target in &self.indexed_targets {
if matches!(indexed_target.target, IndexTarget::ColumnId(_)) {
let col_id = match &indexed_target.target {
IndexTarget::ColumnId(col_id) => *col_id,
IndexTarget::SubField { .. } => unreachable!(),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In do_update, SubField targets are only handled in the else block (starting at line 444), which exclusively uses batch.field_col_value(col_id). This implementation fails to support subfield indexing for columns that are part of the primary key (tags), as those are stored separately in the Batch and field_col_value will return None for them. This means nested fields within tag columns won't be indexed during flush or compaction.

Comment on lines +580 to +598
fn extract_subfield_value(root: ValueRef<'_>, path: &[String]) -> Option<Value> {
let mut current = Value::from(root);
for segment in path {
current = match current {
Value::Struct(struct_value) => {
let fields = struct_value.struct_type().fields();
let index = fields.iter().position(|field| field.name() == segment)?;
struct_value.items().get(index).cloned()?
}
Value::Json(_) => {
let mut json: serde_json::Value = current.try_into().ok()?;
let obj = json.as_object_mut()?;
json_value_to_value(obj.remove(segment)?)?
}
_ => return None,
};
}
Some(current)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of extract_subfield_value has significant efficiency and correctness issues:

  1. Efficiency: Value::from(root) clones the entire structure (e.g., a large Struct) at the start. For Json columns, it parses the JSON string into a serde_json::Value at every segment of the path traversal. This will cause severe performance degradation during index building.
  2. Correctness: json_value_to_value (called on line 592) returns None for Object and Array. This means that if a path is a.b and a is an object within a JSON column, the traversal will stop at a and return None, failing to extract b. Nested paths in JSON columns are effectively unsupported by this logic.

It is recommended to use ValueRef for traversal to avoid cloning, and for JSON columns, parse the string once and traverse the serde_json::Value tree for all remaining segments.

Comment on lines +463 to +486
for i in 0..n {
self.value_buf.clear();
let value = values.data.get_ref(i);
let Some(value) = extract_subfield_value(value, path)
.and_then(|value| cast_value_for_index_type(value, value_type))
else {
self.index_creator
.push_with_name(target_key, None)
.await
.context(PushIndexValueSnafu)?;
continue;
};

IndexValueCodec::encode_nonnull_value(
value.as_value_ref(),
&sort_field,
&mut self.value_buf,
)
.context(EncodeSnafu)?;
self.index_creator
.push_with_name(target_key, Some(&self.value_buf))
.await
.context(PushIndexValueSnafu)?;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Processing subfield targets row-by-row for each target independently leads to redundant work. If multiple subfield indexes are defined on the same column (especially a JSON column), the column value is extracted and processed (including JSON parsing) multiple times per row. Consider grouping subfield extractions by column_id to parse the JSON once per row per column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant