Skip to content
This repository was archived by the owner on Feb 16, 2026. It is now read-only.

feat: add connection draining during shutdown#4442

Closed
kaiburjack wants to merge 1 commit intovarnishcache:masterfrom
HBTGmbH:drain_connections_on_shutdown
Closed

feat: add connection draining during shutdown#4442
kaiburjack wants to merge 1 commit intovarnishcache:masterfrom
HBTGmbH:drain_connections_on_shutdown

Conversation

@kaiburjack
Copy link
Copy Markdown
Contributor

@kaiburjack kaiburjack commented Jan 29, 2026

This is an implementation for addressing: #4441

This implements a new child CLI command "drain", which results in the child process to enter "draining mode" where it will still keep active and idle connections alive and always respond with "Connection: close" for HTTP/1.1 requests, leading to clients closing the connections after having successfully received the full response.
This drain duration is limited by a new parameter "drain_timeout". After this timeout expires, Varnish will commence with normal shutdown as usual.

Ideally, this drain timeout should be set to the client's keep-alive/idle timeout such that either a client receives Connection:close on a next request or the client itself will close the connection due to having reached its own idle timeout.

Varnish will periodically check whether the list of active sessions/client connections reached zero, after which drain mode can exit and Varnish can commence with normal shutdown.

I have also implemented corresponding VTC tests which all run through fine.

One noteworthy thing is in Pool_Work_Thread. Here we reduce the next poll time interval to 1 second, because otherwise Varnish workers would wait 60s before realizing that no VCL references are held anymore.

We also need to check in the manager, that, when the Varnish child exits, we will not restart it when we entered drain mode during shutdown. Instead, we let the child exit cleanly and then shutdown the manager process.

I haven't yet come around to dealing with HTTP/2 connections, which ideally involves using the "double-GOAWAY" graceful pattern of sending one GOAWAY with streamId = 2^31-1 to indicate to the client that all streams have been processed fine and it should leave the connection. Then, after a configurable timeout, the server should send a second GOAWAY with the actual last seen streamId. This is exactly the scheme followed by Envoy proxy when draining listeners and is the safest way to avoid "aborted stream" errors in clients.

@walid-git
Copy link
Copy Markdown
Member

Just for context, a similar command was implemented in #3959 (see traffic.refuse)

@walid-git
Copy link
Copy Markdown
Member

walid-git commented Feb 2, 2026

Thanks for your PR. This was discussed during today's bugwash:

Consensus is to not add a new CLI command, but instead add a flag to the existing stop command, like:

stop [-drain <timeout>]

When no -drain <timeout> is supplied, the command should use the drain_timeout parameter as a fallback, and should keep the current behavior (stop immediately) with the parameter's default value.

The current patch is not passing our test suit, feel free to look at #3959 for inspiration.

EDIT: Forgot to mention, the cli command should not hang, and return immediately even when a timeout is supplied

@kaiburjack
Copy link
Copy Markdown
Contributor Author

@walid-git thanks for having a look at it. Reigniting a discussion for how this could possibly be done, is already a good goal for this MR. ;)
Will close.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants