feat: support inverted index for nested path#8075
feat: support inverted index for nested path#8075fengys1996 wants to merge 2 commits intoGreptimeTeam:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for nested paths in inverted index targets by extending the IndexTarget enum and updating the InvertedIndexer to handle these paths. The review identifies several performance issues related to repeated string allocations and formatting within loops, recommending that target_key be pre-calculated. Additionally, the review points out a logic gap where nested paths are skipped in do_update and raises a design concern regarding potential parsing collisions due to the lack of escaping for delimiters in the encoded target string.
| for target in &self.indexed_targets { | ||
| let Some(column_meta) = self.metadata.column_by_id(target.column_id()) else { | ||
| debug!( | ||
| "Column {} not found in the metadata during building inverted index", | ||
| col_id | ||
| target.column_id() | ||
| ); | ||
| continue; | ||
| }; | ||
| let column_name = &column_meta.column_schema.name; | ||
| if let Some(column_array) = batch.column_by_name(column_name) { | ||
| let column_name = if let Some(path) = target.path() { | ||
| format!("{}.{}", column_meta.column_schema.name, path.join(".")) | ||
| } else { | ||
| column_meta.column_schema.name.clone() | ||
| }; |
There was a problem hiding this comment.
Calculating target_key and column_name inside the row loop (or repeatedly in the target loop) is inefficient. target.to_string() involves string joining and formatting, which should be done once per target per batch. Ideally, these should be pre-calculated and stored in the InvertedIndexer struct, similar to how target_key was handled before this change.
| for target in &self.indexed_targets { | |
| let Some(column_meta) = self.metadata.column_by_id(target.column_id()) else { | |
| debug!( | |
| "Column {} not found in the metadata during building inverted index", | |
| col_id | |
| target.column_id() | |
| ); | |
| continue; | |
| }; | |
| let column_name = &column_meta.column_schema.name; | |
| if let Some(column_array) = batch.column_by_name(column_name) { | |
| let column_name = if let Some(path) = target.path() { | |
| format!("{}.{}", column_meta.column_schema.name, path.join(".")) | |
| } else { | |
| column_meta.column_schema.name.clone() | |
| }; | |
| for target in &self.indexed_targets { | |
| let target_key = target.to_string(); | |
| let Some(column_meta) = self.metadata.column_by_id(target.column_id()) else { | |
| debug!( | |
| "Column {} not found in the metadata during building inverted index", | |
| target.column_id() | |
| ); | |
| continue; | |
| }; | |
| let column_name = match target.path() { | |
| Some(path) => format!("{}.{}", column_meta.column_schema.name, path.join(".")), | |
| None => column_meta.column_schema.name.clone(), | |
| }; |
| if value_ref.is_null() { | ||
| self.index_creator | ||
| .push_with_name(target_key, None) | ||
| .push_with_name(&target.to_string(), None) |
| .context(EncodeSnafu)?; | ||
| self.index_creator | ||
| .push_with_name(target_key, Some(&self.value_buf)) | ||
| .push_with_name(&target.to_string(), Some(&self.value_buf)) |
|
|
||
| self.index_creator | ||
| .push_with_name_n(target_key, elem, *count) | ||
| .push_with_name_n(&target.to_string(), elem, *count) |
| for target in &self.indexed_targets { | ||
| if target.path().is_some() { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Pre-calculate target_key here and note that nested paths are currently skipped in do_update. Since do_update is the primary method used for indexing during regular SST writing, this means nested indices will not be created for most SSTs. If mito2::read::Batch does not contain flattened columns, the indexer should be updated to extract nested values from the JSON vectors.
for target in &self.indexed_targets {
let target_key = target.to_string();
if target.path().is_some() {
continue;
}|
|
||
| self.index_creator | ||
| .push_with_name_n(target_key, value, n) | ||
| .push_with_name_n(&target.to_string(), value, n) |
| if value.is_null() { | ||
| self.index_creator | ||
| .push_with_name(target_key, None) | ||
| .push_with_name(&target.to_string(), None) |
| .context(EncodeSnafu)?; | ||
| self.index_creator | ||
| .push_with_name(target_key, Some(&self.value_buf)) | ||
| .push_with_name(&target.to_string(), Some(&self.value_buf)) |
|
|
||
| let nested_path_str = nested_path.trim(); | ||
| ensure!(!nested_path_str.is_empty(), InvalidPathSnafu { key }); | ||
| // FIXME(fys): do we need to handle special characters in here and encode method? |
There was a problem hiding this comment.
The use of . and : as delimiters in the encoded IndexTarget string without escaping is problematic. JSON keys frequently contain these characters. For example, a nested path like ["a.b", "c"] would be encoded as col_id:a.b.c, which the decode method would incorrectly parse as ["a", "b", "c"]. Consider implementing an escaping mechanism or using a more robust serialization format for the target key to avoid collisions.
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
PR Checklist
Please convert it to a draft if some of the following conditions are not met.