Skip to content

feat: support inverted index for nested path#8075

Draft
fengys1996 wants to merge 2 commits intoGreptimeTeam:mainfrom
fengys1996:feat/inverted-index-for-nested-path
Draft

feat: support inverted index for nested path#8075
fengys1996 wants to merge 2 commits intoGreptimeTeam:mainfrom
fengys1996:feat/inverted-index-for-nested-path

Conversation

@fengys1996
Copy link
Copy Markdown
Contributor

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@github-actions github-actions Bot added size/S docs-not-required This change does not impact docs. labels May 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for nested paths in inverted index targets by extending the IndexTarget enum and updating the InvertedIndexer to handle these paths. The review identifies several performance issues related to repeated string allocations and formatting within loops, recommending that target_key be pre-calculated. Additionally, the review points out a logic gap where nested paths are skipped in do_update and raises a design concern regarding potential parsing collisions due to the lack of escaping for delimiters in the encoded target string.

Comment on lines +174 to +186
for target in &self.indexed_targets {
let Some(column_meta) = self.metadata.column_by_id(target.column_id()) else {
debug!(
"Column {} not found in the metadata during building inverted index",
col_id
target.column_id()
);
continue;
};
let column_name = &column_meta.column_schema.name;
if let Some(column_array) = batch.column_by_name(column_name) {
let column_name = if let Some(path) = target.path() {
format!("{}.{}", column_meta.column_schema.name, path.join("."))
} else {
column_meta.column_schema.name.clone()
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calculating target_key and column_name inside the row loop (or repeatedly in the target loop) is inefficient. target.to_string() involves string joining and formatting, which should be done once per target per batch. Ideally, these should be pre-calculated and stored in the InvertedIndexer struct, similar to how target_key was handled before this change.

Suggested change
for target in &self.indexed_targets {
let Some(column_meta) = self.metadata.column_by_id(target.column_id()) else {
debug!(
"Column {} not found in the metadata during building inverted index",
col_id
target.column_id()
);
continue;
};
let column_name = &column_meta.column_schema.name;
if let Some(column_array) = batch.column_by_name(column_name) {
let column_name = if let Some(path) = target.path() {
format!("{}.{}", column_meta.column_schema.name, path.join("."))
} else {
column_meta.column_schema.name.clone()
};
for target in &self.indexed_targets {
let target_key = target.to_string();
let Some(column_meta) = self.metadata.column_by_id(target.column_id()) else {
debug!(
"Column {} not found in the metadata during building inverted index",
target.column_id()
);
continue;
};
let column_name = match target.path() {
Some(path) => format!("{}.{}", column_meta.column_schema.name, path.join(".")),
None => column_meta.column_schema.name.clone(),
};

if value_ref.is_null() {
self.index_creator
.push_with_name(target_key, None)
.push_with_name(&target.to_string(), None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use the pre-calculated target_key here to avoid repeated string allocations in the row loop.

Suggested change
.push_with_name(&target.to_string(), None)
.push_with_name(&target_key, None)

.context(EncodeSnafu)?;
self.index_creator
.push_with_name(target_key, Some(&self.value_buf))
.push_with_name(&target.to_string(), Some(&self.value_buf))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use the pre-calculated target_key here to avoid repeated string allocations in the row loop.

Suggested change
.push_with_name(&target.to_string(), Some(&self.value_buf))
.push_with_name(&target_key, Some(&self.value_buf))


self.index_creator
.push_with_name_n(target_key, elem, *count)
.push_with_name_n(&target.to_string(), elem, *count)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use the pre-calculated target_key here.

Suggested change
.push_with_name_n(&target.to_string(), elem, *count)
.push_with_name_n(&target_key, elem, *count)

Comment on lines +312 to +315
for target in &self.indexed_targets {
if target.path().is_some() {
continue;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Pre-calculate target_key here and note that nested paths are currently skipped in do_update. Since do_update is the primary method used for indexing during regular SST writing, this means nested indices will not be created for most SSTs. If mito2::read::Batch does not contain flattened columns, the indexer should be updated to extract nested values from the JSON vectors.

        for target in &self.indexed_targets {
            let target_key = target.to_string();
            if target.path().is_some() {
                continue;
            }


self.index_creator
.push_with_name_n(target_key, value, n)
.push_with_name_n(&target.to_string(), value, n)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use the pre-calculated target_key here.

Suggested change
.push_with_name_n(&target.to_string(), value, n)
.push_with_name_n(&target_key, value, n)

if value.is_null() {
self.index_creator
.push_with_name(target_key, None)
.push_with_name(&target.to_string(), None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use the pre-calculated target_key here.

Suggested change
.push_with_name(&target.to_string(), None)
.push_with_name(&target_key, None)

.context(EncodeSnafu)?;
self.index_creator
.push_with_name(target_key, Some(&self.value_buf))
.push_with_name(&target.to_string(), Some(&self.value_buf))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use the pre-calculated target_key here.

Suggested change
.push_with_name(&target.to_string(), Some(&self.value_buf))
.push_with_name(&target_key, Some(&self.value_buf))

Comment thread src/index/src/target.rs

let nested_path_str = nested_path.trim();
ensure!(!nested_path_str.is_empty(), InvalidPathSnafu { key });
// FIXME(fys): do we need to handle special characters in here and encode method?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of . and : as delimiters in the encoded IndexTarget string without escaping is problematic. JSON keys frequently contain these characters. For example, a nested path like ["a.b", "c"] would be encoded as col_id:a.b.c, which the decode method would incorrectly parse as ["a", "b", "c"]. Consider implementing an escaping mechanism or using a more robust serialization format for the target key to avoid collisions.

@github-actions github-actions Bot added size/M and removed size/S labels May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant