feat: support try-swapping-with-projection#8044
feat: support try-swapping-with-projection#8044fengys1996 wants to merge 1 commit intoGreptimeTeam:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a projection_pushdown module designed to extract nested JSON paths from DataFusion projections, enabling the storage layer to optimize scans by fetching only required sub-fields. A significant issue was identified in the path collection logic: it currently only tracks paths within json_get functions and fails to account for columns accessed directly or in other contexts. This could result in the storage layer incorrectly omitting data when a full column is required alongside nested access.
| fn collect_nested_paths_from_expr( | ||
| expr: &dyn PhysicalExpr, | ||
| nested_paths: &mut BTreeSet<NestedPath>, | ||
| ) { | ||
| if let Some(path) = extract_json_get_path(expr) { | ||
| let _ = nested_paths.insert(path); | ||
| } | ||
|
|
||
| for child in expr.children() { | ||
| collect_nested_paths_from_expr(child.as_ref(), nested_paths); | ||
| } | ||
| } |
There was a problem hiding this comment.
The current implementation of collect_nested_paths_from_expr only collects paths from json_get functions but fails to account for columns that are accessed directly or used in other functions. If a column is used both in a json_get (which adds a nested path) and elsewhere in the projection (which requires the full column), the storage layer might incorrectly optimize the scan to only fetch the nested path, leading to incomplete data in the result set.
To fix this, the logic should explicitly track when a column is used outside of a pushable json_get and mark it as fully required by adding its root path to the set.
fn collect_nested_paths_from_expr(
expr: &dyn PhysicalExpr,
nested_paths: &mut BTreeSet<NestedPath>,
) {
if let Some(path) = extract_json_get_path(expr) {
nested_paths.insert(path);
// Recurse into arguments except the first one (the column) to avoid marking it as fully required.
for child in expr.children().iter().skip(1) {
collect_nested_paths_from_expr(child.as_ref(), nested_paths);
}
return;
}
// If this is a Column expression and it wasn't handled by extract_json_get_path above,
// it means the column is required in its entirety.
if let Some(column) = expr.as_any().downcast_ref::<Column>() {
nested_paths.insert(vec![column.name().to_string()]);
}
for child in expr.children() {
collect_nested_paths_from_expr(child.as_ref(), nested_paths);
}
}
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
PR Checklist
Please convert it to a draft if some of the following conditions are not met.