feat(datasets): per-table & column bloom filter configuration via manifest by fordN · Pull Request #2021 · edgeandnode/amp

fordN · 2026-03-24T22:46:05Z

This PR introduces bloom filter configuration options specific to a dataset table and column.

Background

Parquet statistics contained in the header facilitate efficient data access; one feature in particular that is often useful is the min/max stats per column which allows pruning files and row groups if the min/max is out of range of a predicate in the query. However, the min/max statistics are not useful for unordered hash columns such as topic0, and address.

Bloom filters help in these cases, but their usefulness is very dependent on the data and the query patterns. The legacy configuration options, a global toggle with a single hardcoded NDV, don't allow enough flexibility for properly configuring bloom filters across a variety of datasets.

Changes

Add BloomFilterColumnConfig struct to datasets-common::manifest with per-column column name and ndv (number of distinct values) settings
Add bloom_filter_columns field to the raw dataset manifest Table definition, allowing per-table bloom filter configuration in dataset manifests
Add bloom_filter_columns() method to the Table trait (with empty default), so the writer can access bloom filter config for any table
Update worker-core::build_parquet_writer_properties() to accept per-column bloom filter configs and set set_column_bloom_filter_enabled & set_column_bloom_filter_ndv per column
Remove the global bloom_filters toggle from server config and ParquetConfig
Wire through all extractors by applying manifest bloom filter config in RawDataset::new()
Update raw manifest JSON schema with new bloom_filter_columns field
Update ampctl manifest generate to include the new bloom_filter_columns field (defaults to empty)

Usage

Dataset creators can now add the bloom_filter_columns property to their dataset manifest to configure bloom filters per-table:

{
  "tables": {
    "logs": {
      "bloom_filter_columns": [
        { "column": "topic0", "ndv": 5000 },
        { "column": "topic1", "ndv": 100000 },
        { "column": "address", "ndv": 50000 }
      ]
    }
  }
}

Test Results

Local testing on 10K blocks, comparing queries with and without bloom filters on topic0 and address:

Query	Row Groups Pruned	I/O Reduction	Scan Speedup
Transfer (common, ~47% of rows)	0%	None	None
Swap (medium, ~6% of rows)	0%	None	None
Rare event (1 row in 10K blocks)	83%	78%	~10x
Rare contract (0 rows in 10K blocks)	100%	97%	~2000x

Bloom filters are a no-op for common events (present in every row group) with negligible overhead. For rare events and specific contract addresses, they eliminate the majority of I/O by pruning row groups before any data is read.

Move bloom filter configuration from a global writer toggle to per-table manifest entries, allowing each table to specify which columns get bloom filters and their NDV values. Tables without config get no bloom filters (opt-in). This replaces the global `bloom_filters` flag in ParquetConfig.

fordN self-assigned this Mar 24, 2026

fordN added the performance label Mar 24, 2026

fordN requested review from JohnSwan1503 and LNSD and removed request for LNSD March 24, 2026 22:47

fordN force-pushed the ford/optimizations/per-table-bloom-filters branch from 689a22e to 25184ea Compare March 24, 2026 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): per-table & column bloom filter configuration via manifest#2021

feat(datasets): per-table & column bloom filter configuration via manifest#2021
fordN wants to merge 1 commit intomainfrom
ford/optimizations/per-table-bloom-filters

fordN commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fordN commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes

Usage

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fordN commented Mar 24, 2026 •

edited

Loading