[FLINK-38930][checkpoint] Filtering record before processing without spilling strategy by 1996fanrui · Pull Request #27783 · apache/flink

1996fanrui · 2026-03-18T16:44:59Z

This PR depends on #27782

What is the purpose of the change

[FLINK-38930][checkpoint] Filtering record before processing without spilling strategy

Brief change log

Core filtering mechanism for recovered channel state buffers:

ChannelStateFilteringHandler with per-gate GateFilterHandler
RecordFilterContext with VirtualChannelRecordFilterFactory
Partial data check in SequentialChannelStateReaderImpl
Fix RecordFilterContext for Union downscale scenario

Verifying this change

Tons of unit tests

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive):no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector:no

Documentation

Does this pull request introduce a new feature? no

flinkbot · 2026-03-18T16:54:46Z

CI report:

db2565f UNKNOWN
26602df Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

pnowojski

Thanks! I've left a couple of comments from the first review pass

pnowojski · 2026-03-26T16:53:07Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateFilteringHandler.java

+         * Deserializes records from {@code sourceBuffer}, applies the virtual channel's record
+         * filter, and re-serializes the surviving records into new buffers.
+         */
+        List<Buffer> filterAndRewrite(


could you re-order methods in this class? Public first. Private either below all publics, or below the first usage?

pnowojski · 2026-03-26T16:54:30Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateFilteringHandler.java

+    /**
+     * Filters a recovered buffer from the specified virtual channel, returning new buffers
+     * containing only the records that belong to the current subtask.
+     *
+     * @return filtered buffers, possibly empty if all records were filtered out.
+     */
+    public List<Buffer> filterAndRewrite(
+            int gateIndex,
+            int oldSubtaskIndex,
+            int oldChannelIndex,
+            Buffer sourceBuffer,
+            BufferSupplier bufferSupplier)


Why does it return List from one single sourceBuffer? Could you explain this in the java doc? And how many Buffers can that be? If a lot, shouldn't this be an Iterator?

The code comment is udpated.

The List return can contain more than 1 buffer when a spanning record completes in this buffer — the deserializer caches partial data from previous buffers, so the output may include data not present in the current source buffer.

This is uncommon but possible with any spanning record. For this case, it will be covered by spilling logic if network pool is insufficient.

pnowojski · 2026-03-26T16:58:17Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/RecoveredChannelStateHandler.java

+        // Extra retain: filterAndRewrite consumes one ref, caller's finally releases another.
+        buffer.retainBuffer();


nit: I think it would be slightly cleaner to call buffer.retainBuffer from the outside, and contract would be then that this method always takes over ownership of this buffer.

Addressed together with the ownership concern in comment https://github.com/apache/flink/pull/27783/changes#r2996388666. Removed retainBuffer() and the catch block entirely. The buffer now has a single clean owner per path: in the filtering path, the deserializer recycles the buffer when consumed; the finally uses a defensive isRecycled() check only for the edge case where an exception occurs before the deserializer takes the buffer (e.g., VirtualChannel lookup failure). Added a buffer lifecycle diagram in the javadoc covering all paths. No extra retain/recycle needed.

pnowojski · 2026-03-26T17:01:37Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/RecoveredChannelStateHandler.java

+        } catch (Throwable t) {
+            // filterAndRewrite didn't consume the buffer, release the extra ref.
+            buffer.recycleBuffer();
+            throw t;
+        }


Hmm, that's a bit strange? It sounds like it's not clear who is owner of this buffer? There should be clean owner that's always responsible for cleaning up, no matter what.

...untime/src/main/java/org/apache/flink/streaming/runtime/io/recovery/RecordFilterContext.java

.../main/java/org/apache/flink/runtime/checkpoint/channel/SequentialChannelStateReaderImpl.java

pnowojski · 2026-03-26T17:51:51Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateFilteringHandler.java

+            List<StreamElement> filteredElements = new ArrayList<>();
+
+            while (true) {
+                DeserializationResult result = vc.getNextRecord(deserializationDelegate);
+                if (result.isFullRecord()) {
+                    filteredElements.add(deserializationDelegate.getInstance());
+                }
+                if (result.isBufferConsumed()) {
+                    break;
+                }
+            }
+
+            return serializeToBuffers(filteredElements, bufferSupplier);


ditto about List in List<StreamElement> filteredElements. It would be safer to be iterative. Current implementation risks OOMs if deserialised records are using more memory than the serialised records. This is not very common, but could happen.

Addressed in 62d42c7

pnowojski · 2026-03-26T17:53:28Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateFilteringHandler.java

+                        resultBuffers.add(currentBuffer.retainBuffer());
+                    }
+                    currentBuffer.recycleBuffer();
+                    currentBuffer = bufferSupplier.requestBufferBlocking();


Is it safe to block here? 🤔 Can this lead to deadlocks? I think we were discussing this, but AFAIR this code works differently to what we were discussing offline (either using unpooled buffer or create two different pools, or filter records in-place without requesting new buffer)?

Good catch!

This is addressed in a follow-up commit in https://github.com/apache/flink/pull/27639/commits (FLINK-38544, f031ddf) by falling back to heap buffer when the buffer pool is insufficient.

I think it would be better to squash that commit here, to avoid merging broken code given that we already have some working fix for it?

Addressed in b12a097

…ringRecoveryEnabled

…spilling strategy Core filtering mechanism for recovered channel state buffers: - ChannelStateFilteringHandler with per-gate GateFilterHandler - RecordFilterContext with VirtualChannelRecordFilterFactory - Partial data check in SequentialChannelStateReaderImpl - Fix RecordFilterContext for Union downscale scenario

1996fanrui

Thanks @pnowojski for the review,

All comments sound make sense to me, I have addressed all of them.

1996fanrui · 2026-03-27T10:09:24Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateFilteringHandler.java

+    /**
+     * Filters a recovered buffer from the specified virtual channel, returning new buffers
+     * containing only the records that belong to the current subtask.
+     *
+     * @return filtered buffers, possibly empty if all records were filtered out.
+     */
+    public List<Buffer> filterAndRewrite(
+            int gateIndex,
+            int oldSubtaskIndex,
+            int oldChannelIndex,
+            Buffer sourceBuffer,
+            BufferSupplier bufferSupplier)


The code comment is udpated.

The List return can contain more than 1 buffer when a spanning record completes in this buffer — the deserializer caches partial data from previous buffers, so the output may include data not present in the current source buffer.

This is uncommon but possible with any spanning record. For this case, it will be covered by spilling logic if network pool is insufficient.

...untime/src/main/java/org/apache/flink/streaming/runtime/io/recovery/RecordFilterContext.java

1996fanrui · 2026-03-27T12:29:14Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/RecoveredChannelStateHandler.java

+        // Extra retain: filterAndRewrite consumes one ref, caller's finally releases another.
+        buffer.retainBuffer();


Addressed together with the ownership concern in comment https://github.com/apache/flink/pull/27783/changes#r2996388666. Removed retainBuffer() and the catch block entirely. The buffer now has a single clean owner per path: in the filtering path, the deserializer recycles the buffer when consumed; the finally uses a defensive isRecycled() check only for the edge case where an exception occurs before the deserializer takes the buffer (e.g., VirtualChannel lookup failure). Added a buffer lifecycle diagram in the javadoc covering all paths. No extra retain/recycle needed.

1996fanrui · 2026-03-27T13:55:40Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateFilteringHandler.java

+            List<StreamElement> filteredElements = new ArrayList<>();
+
+            while (true) {
+                DeserializationResult result = vc.getNextRecord(deserializationDelegate);
+                if (result.isFullRecord()) {
+                    filteredElements.add(deserializationDelegate.getInstance());
+                }
+                if (result.isBufferConsumed()) {
+                    break;
+                }
+            }
+
+            return serializeToBuffers(filteredElements, bufferSupplier);


Addressed in 62d42c7

1996fanrui · 2026-03-27T13:56:02Z

.../src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateFilteringHandler.java

+                        resultBuffers.add(currentBuffer.retainBuffer());
+                    }
+                    currentBuffer.recycleBuffer();
+                    currentBuffer = bufferSupplier.requestBufferBlocking();


Addressed in b12a097

- Add javadoc for filterAndRewrite explaining spanning record multi-buffer output - Move retainBuffer call to caller for clearer buffer ownership contract - Implement Closeable for ChannelStateFilteringHandler - Use try-with-resources in SequentialChannelStateReaderImpl

…cumulation

…during recovery When unaligned checkpointing during recovery is enabled, use a heap buffer as fallback instead of blocking on buffer pool, to avoid hanging if the buffer pool is not yet available. When the feature is disabled, the original blocking behavior is preserved.

1996fanrui force-pushed the 38930/filtering-record branch from 2b06750 to 997e3a3 Compare March 20, 2026 08:48

pnowojski reviewed Mar 26, 2026

View reviewed changes

1996fanrui force-pushed the 38930/filtering-record branch from 997e3a3 to db2565f Compare March 26, 2026 21:38

1996fanrui added 3 commits March 27, 2026 14:50

[hotfix] Rename isUnalignedDuringRecoveryEnabled to isCheckpointingDu…

90da75c

…ringRecoveryEnabled

Addressing comments: Reorder methods in ChannelStateFilteringHandler

d24b3f3

1996fanrui force-pushed the 38930/filtering-record branch from db2565f to b12a097 Compare March 27, 2026 13:51

1996fanrui commented Mar 27, 2026

View reviewed changes

1996fanrui added 3 commits March 27, 2026 15:33

Address review comments: Stream filtering records instead of batch ac…

3e7418a

…cumulation

1996fanrui force-pushed the 38930/filtering-record branch from b12a097 to 26602df Compare March 27, 2026 14:35

		// Extra retain: filterAndRewrite consumes one ref, caller's finally releases another.
		buffer.retainBuffer();

Conversation

1996fanrui commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

pnowojski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

1996fanrui left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1996fanrui commented Mar 18, 2026 •

edited

Loading

flinkbot commented Mar 18, 2026 •

edited

Loading