Skip to content

feat: support try-swapping-with-projection#8044

Draft
fengys1996 wants to merge 1 commit intoGreptimeTeam:mainfrom
fengys1996:feat/try-swapping-with-projection
Draft

feat: support try-swapping-with-projection#8044
fengys1996 wants to merge 1 commit intoGreptimeTeam:mainfrom
fengys1996:feat/try-swapping-with-projection

Conversation

@fengys1996
Copy link
Copy Markdown
Contributor

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

@github-actions github-actions Bot added size/M docs-not-required This change does not impact docs. labels Apr 28, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a projection_pushdown module designed to extract nested JSON paths from DataFusion projections, enabling the storage layer to optimize scans by fetching only required sub-fields. A significant issue was identified in the path collection logic: it currently only tracks paths within json_get functions and fails to account for columns accessed directly or in other contexts. This could result in the storage layer incorrectly omitting data when a full column is required alongside nested access.

Comment on lines +61 to +72
fn collect_nested_paths_from_expr(
expr: &dyn PhysicalExpr,
nested_paths: &mut BTreeSet<NestedPath>,
) {
if let Some(path) = extract_json_get_path(expr) {
let _ = nested_paths.insert(path);
}

for child in expr.children() {
collect_nested_paths_from_expr(child.as_ref(), nested_paths);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of collect_nested_paths_from_expr only collects paths from json_get functions but fails to account for columns that are accessed directly or used in other functions. If a column is used both in a json_get (which adds a nested path) and elsewhere in the projection (which requires the full column), the storage layer might incorrectly optimize the scan to only fetch the nested path, leading to incomplete data in the result set.

To fix this, the logic should explicitly track when a column is used outside of a pushable json_get and mark it as fully required by adding its root path to the set.

fn collect_nested_paths_from_expr(
    expr: &dyn PhysicalExpr,
    nested_paths: &mut BTreeSet<NestedPath>,
) {
    if let Some(path) = extract_json_get_path(expr) {
        nested_paths.insert(path);
        // Recurse into arguments except the first one (the column) to avoid marking it as fully required.
        for child in expr.children().iter().skip(1) {
            collect_nested_paths_from_expr(child.as_ref(), nested_paths);
        }
        return;
    }

    // If this is a Column expression and it wasn't handled by extract_json_get_path above,
    // it means the column is required in its entirety.
    if let Some(column) = expr.as_any().downcast_ref::<Column>() {
        nested_paths.insert(vec![column.name().to_string()]);
    }

    for child in expr.children() {
        collect_nested_paths_from_expr(child.as_ref(), nested_paths);
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant