Skip to content

feat(datasets): per-table & column bloom filter configuration via manifest#2021

Open
fordN wants to merge 1 commit intomainfrom
ford/optimizations/per-table-bloom-filters
Open

feat(datasets): per-table & column bloom filter configuration via manifest#2021
fordN wants to merge 1 commit intomainfrom
ford/optimizations/per-table-bloom-filters

Conversation

@fordN
Copy link
Contributor

@fordN fordN commented Mar 24, 2026

This PR introduces bloom filter configuration options specific to a dataset table and column.

Background

Parquet statistics contained in the header facilitate efficient data access; one feature in particular that is often useful is the min/max stats per column which allows pruning files and row groups if the min/max is out of range of a predicate in the query. However, the min/max statistics are not useful for unordered hash columns such as topic0, and address.

Bloom filters help in these cases, but their usefulness is very dependent on the data and the query patterns. The legacy configuration options, a global toggle with a single hardcoded NDV, don't allow enough flexibility for properly configuring bloom filters across a variety of datasets.

Changes

  • Add BloomFilterColumnConfig struct to datasets-common::manifest with per-column column name and ndv (number of distinct values) settings
  • Add bloom_filter_columns field to the raw dataset manifest Table definition, allowing per-table bloom filter configuration in dataset manifests
  • Add bloom_filter_columns() method to the Table trait (with empty default), so the writer can access bloom filter config for any table
  • Update worker-core::build_parquet_writer_properties() to accept per-column bloom filter configs and set set_column_bloom_filter_enabled & set_column_bloom_filter_ndv per column
  • Remove the global bloom_filters toggle from server config and ParquetConfig
  • Wire through all extractors by applying manifest bloom filter config in RawDataset::new()
  • Update raw manifest JSON schema with new bloom_filter_columns field
  • Update ampctl manifest generate to include the new bloom_filter_columns field (defaults to empty)

Usage

Dataset creators can now add the bloom_filter_columns property to their dataset manifest to configure bloom filters per-table:

{
  "tables": {
    "logs": {
      "bloom_filter_columns": [
        { "column": "topic0", "ndv": 5000 },
        { "column": "topic1", "ndv": 100000 },
        { "column": "address", "ndv": 50000 }
      ]
    }
  }
}

Test Results

Local testing on 10K blocks, comparing queries with and without bloom filters on topic0 and address:

Query Row Groups Pruned I/O Reduction Scan Speedup
Transfer (common, ~47% of rows) 0% None None
Swap (medium, ~6% of rows) 0% None None
Rare event (1 row in 10K blocks) 83% 78% ~10x
Rare contract (0 rows in 10K blocks) 100% 97% ~2000x

Bloom filters are a no-op for common events (present in every row group) with negligible overhead. For rare events and specific contract addresses, they eliminate the majority of I/O by pruning row groups before any data is read.

@fordN fordN self-assigned this Mar 24, 2026
@fordN fordN requested review from JohnSwan1503 and LNSD and removed request for LNSD March 24, 2026 22:47
Move bloom filter configuration from a global writer toggle to per-table
manifest entries, allowing each table to specify which columns get bloom
filters and their NDV values. Tables without config get no bloom filters
(opt-in). This replaces the global `bloom_filters` flag in ParquetConfig.
@fordN fordN force-pushed the ford/optimizations/per-table-bloom-filters branch from 689a22e to 25184ea Compare March 24, 2026 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant