feat(datasets): per-table & column bloom filter configuration via manifest#2021
Open
feat(datasets): per-table & column bloom filter configuration via manifest#2021
Conversation
Move bloom filter configuration from a global writer toggle to per-table manifest entries, allowing each table to specify which columns get bloom filters and their NDV values. Tables without config get no bloom filters (opt-in). This replaces the global `bloom_filters` flag in ParquetConfig.
689a22e to
25184ea
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces bloom filter configuration options specific to a dataset table and column.
Background
Parquet statistics contained in the header facilitate efficient data access; one feature in particular that is often useful is the min/max stats per column which allows pruning files and row groups if the min/max is out of range of a predicate in the query. However, the min/max statistics are not useful for unordered hash columns such as
topic0, andaddress.Bloom filters help in these cases, but their usefulness is very dependent on the data and the query patterns. The legacy configuration options, a global toggle with a single hardcoded NDV, don't allow enough flexibility for properly configuring bloom filters across a variety of datasets.
Changes
BloomFilterColumnConfigstruct todatasets-common::manifestwith per-columncolumnname andndv(number of distinct values) settingsbloom_filter_columnsfield to the raw dataset manifestTabledefinition, allowing per-table bloom filter configuration in dataset manifestsbloom_filter_columns()method to theTabletrait (with empty default), so the writer can access bloom filter config for any tableworker-core::build_parquet_writer_properties()to accept per-column bloom filter configs and setset_column_bloom_filter_enabled&set_column_bloom_filter_ndvper columnbloom_filterstoggle from server config andParquetConfigRawDataset::new()bloom_filter_columnsfieldUsage
Dataset creators can now add the
bloom_filter_columnsproperty to their dataset manifest to configure bloom filters per-table:{ "tables": { "logs": { "bloom_filter_columns": [ { "column": "topic0", "ndv": 5000 }, { "column": "topic1", "ndv": 100000 }, { "column": "address", "ndv": 50000 } ] } } }Test Results
Local testing on 10K blocks, comparing queries with and without bloom filters on
topic0andaddress:Bloom filters are a no-op for common events (present in every row group) with negligible overhead. For rare events and specific contract addresses, they eliminate the majority of I/O by pruning row groups before any data is read.