Skip to content

perf(common): Parallelize metadata fetching during query planning#2017

Merged
fordN merged 2 commits intomainfrom
ford/optimizations/parallel-metadata-fetching
Mar 24, 2026
Merged

perf(common): Parallelize metadata fetching during query planning#2017
fordN merged 2 commits intomainfrom
ford/optimizations/parallel-metadata-fetching

Conversation

@fordN
Copy link
Contributor

@fordN fordN commented Mar 24, 2026

Background

During query planning, parquet file metadata (footers) are fetched one at a time. On cold cache with many files, this serializes hundreds of network round trips. In a large example query 684 files are fetched, so at 50ms per file it takes ~34 seconds just to fetch metadata before any data is read. Parallelizing the fetching with .buffered(32) takes theoretical time to fetch those 684 files in example down to ~1 second.

On warm cache the impact is minimal since cache lookups are sub-millisecond, but cold starts, new tables, and large time-range queries benefit significantly.

Changes

  • Replace sequential .then() with .buffered(N) in resolve_file_groups(), so up to N parquet metadata fetches run concurrently during query planning. Order is preserved to maintain deterministic round-robin partition assignment.
  • Add configurable metadata_fetch_concurrency (default: 32) via config file or AMP_CONFIG_METADATA_FETCH_CONCURRENCY env var
    • Config flows through: config file (metadata_fetch_concurrency) → Config → server/worker ConfigExecEnvExecContextBuilderQueryableSnapshotresolve_file_groups()

 Replace sequential `.then()` with `.buffered(32)` in
 resolve_file_groups() so up to 32 parquet metadata fetches run
 concurrently. Uses `.buffered()` to preserve ordering for
 deterministic round-robin partition assignment.
@fordN fordN force-pushed the ford/optimizations/parallel-metadata-fetching branch from 76d9b1f to db3133e Compare March 24, 2026 16:54
@fordN fordN requested review from JohnSwan1503 and LNSD March 24, 2026 16:54
@fordN fordN changed the title perf(common: Parallelize metadata fetching during query planning perf(common): Parallelize metadata fetching during query planning Mar 24, 2026
@fordN fordN force-pushed the ford/optimizations/parallel-metadata-fetching branch from db3133e to 8e4d432 Compare March 24, 2026 17:06
Configurable via config file or AMP_CONFIG_METADATA_FETCH_CONCURRENCY
env var. (default: 32) Controls how many parquet footer fetches run
concurrently during query planning.
@fordN fordN force-pushed the ford/optimizations/parallel-metadata-fetching branch from 8e4d432 to b47416c Compare March 24, 2026 17:15
Copy link
Contributor

@LNSD LNSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅

@fordN fordN merged commit 4d815eb into main Mar 24, 2026
8 checks passed
@fordN fordN deleted the ford/optimizations/parallel-metadata-fetching branch March 24, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants