feat: try support build index for nested path by ai#8042
feat: try support build index for nested path by ai#8042fengys1996 wants to merge 2 commits intoGreptimeTeam:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for indexing nested sub-fields within Struct or Json columns. It adds a SubField variant to IndexTarget, updates region options to configure these targets, and implements the extraction and casting logic in the InvertedIndexer. Feedback highlights that sub-field indexing currently misses primary key (tag) columns. Additionally, the extraction logic for JSON sub-fields is inefficient due to repeated parsing and contains a bug where intermediate objects or arrays cause traversal to fail. There is also an opportunity to optimize performance by grouping extractions for multiple sub-fields on the same column.
| for indexed_target in &self.indexed_targets { | ||
| if matches!(indexed_target.target, IndexTarget::ColumnId(_)) { | ||
| let col_id = match &indexed_target.target { | ||
| IndexTarget::ColumnId(col_id) => *col_id, | ||
| IndexTarget::SubField { .. } => unreachable!(), | ||
| }; |
There was a problem hiding this comment.
In do_update, SubField targets are only handled in the else block (starting at line 444), which exclusively uses batch.field_col_value(col_id). This implementation fails to support subfield indexing for columns that are part of the primary key (tags), as those are stored separately in the Batch and field_col_value will return None for them. This means nested fields within tag columns won't be indexed during flush or compaction.
| fn extract_subfield_value(root: ValueRef<'_>, path: &[String]) -> Option<Value> { | ||
| let mut current = Value::from(root); | ||
| for segment in path { | ||
| current = match current { | ||
| Value::Struct(struct_value) => { | ||
| let fields = struct_value.struct_type().fields(); | ||
| let index = fields.iter().position(|field| field.name() == segment)?; | ||
| struct_value.items().get(index).cloned()? | ||
| } | ||
| Value::Json(_) => { | ||
| let mut json: serde_json::Value = current.try_into().ok()?; | ||
| let obj = json.as_object_mut()?; | ||
| json_value_to_value(obj.remove(segment)?)? | ||
| } | ||
| _ => return None, | ||
| }; | ||
| } | ||
| Some(current) | ||
| } |
There was a problem hiding this comment.
The current implementation of extract_subfield_value has significant efficiency and correctness issues:
- Efficiency:
Value::from(root)clones the entire structure (e.g., a largeStruct) at the start. ForJsoncolumns, it parses the JSON string into aserde_json::Valueat every segment of the path traversal. This will cause severe performance degradation during index building. - Correctness:
json_value_to_value(called on line 592) returnsNoneforObjectandArray. This means that if a path isa.bandais an object within a JSON column, the traversal will stop ataand returnNone, failing to extractb. Nested paths in JSON columns are effectively unsupported by this logic.
It is recommended to use ValueRef for traversal to avoid cloning, and for JSON columns, parse the string once and traverse the serde_json::Value tree for all remaining segments.
| for i in 0..n { | ||
| self.value_buf.clear(); | ||
| let value = values.data.get_ref(i); | ||
| let Some(value) = extract_subfield_value(value, path) | ||
| .and_then(|value| cast_value_for_index_type(value, value_type)) | ||
| else { | ||
| self.index_creator | ||
| .push_with_name(target_key, None) | ||
| .await | ||
| .context(PushIndexValueSnafu)?; | ||
| continue; | ||
| }; | ||
|
|
||
| IndexValueCodec::encode_nonnull_value( | ||
| value.as_value_ref(), | ||
| &sort_field, | ||
| &mut self.value_buf, | ||
| ) | ||
| .context(EncodeSnafu)?; | ||
| self.index_creator | ||
| .push_with_name(target_key, Some(&self.value_buf)) | ||
| .await | ||
| .context(PushIndexValueSnafu)?; | ||
| } |
There was a problem hiding this comment.
Processing subfield targets row-by-row for each target independently leads to redundant work. If multiple subfield indexes are defined on the same column (especially a JSON column), the column value is extracted and processed (including JSON parsing) multiple times per row. Consider grouping subfield extractions by column_id to parse the JSON once per row per column.
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
RUN CI ONLY, NO MERGE
PR Checklist
Please convert it to a draft if some of the following conditions are not met.