fix: Improve Zookeeper initialization wait logic to support multi url configuration store#671
fix: Improve Zookeeper initialization wait logic to support multi url configuration store#671ganeshkalyank wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Pulsar Helm chart’s cluster-initialization Job to wait for a multi-URL ZooKeeper configuration store using a ZooKeeper-aware command instead of DNS lookup, addressing init failures when configurationStore contains comma-separated hosts.
Changes:
- Replace
nslookup-based waiting forconfigurationStorewithbin/pulsar zookeeper-shell ... ls /polling. - Set a smaller JVM heap (
PULSAR_MEM) for the wait probe to reduce init-container memory usage.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export PULSAR_MEM="-Xmx128M"; | ||
| until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do | ||
| echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3; | ||
| done; |
There was a problem hiding this comment.
wait-zk-cs-ready now uses bin/pulsar zookeeper-shell to probe ZooKeeper, but it doesn’t apply the chart’s ZooKeeper TLS client settings. When .Values.tls.enabled and .Values.tls.zookeeper.enabled are true, this probe will fail even if the configuration store is reachable over TLS, blocking initialization. Consider including pulsar.toolset.zookeeper.tls.settings before invoking bin/pulsar (and ensure the initContainer mounts the toolset cert/CA volumes so those settings work).
| export PULSAR_MEM="-Xmx128M"; | |
| until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do | |
| echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3; | |
| done; | |
| export PULSAR_MEM="-Xmx128M"; | |
| {{- include "pulsar.toolset.zookeeper.tls.settings" . | nindent 12 }} | |
| until timeout 15 bin/pulsar zookeeper-shell -server {{ .Values.pulsar_metadata.configurationStore }} ls /; do | |
| echo "configurationStore {{ .Values.pulsar_metadata.configurationStore }} is unreachable... check in 3 seconds ..." && sleep 3; | |
| done; | |
| volumeMounts: | |
| {{- include "pulsar.toolset.certs.volumeMounts" . | nindent 8 }} |
|
please check copilot's review comments |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@lhotari addressed both the comments. also, assumed that configuration store deployment uses the same tls settings as zookeeper. |
Fixes #670
Motivation
When using a multi-URL configuration store (e.g., zk1:2181,zk2:2181), the wait-zk-cs-ready init container fails because nslookup cannot resolve comma-separated hostnames. This causes initialization to time out even when ZooKeeper is already accessible.
Modifications
Replaced nslookup with bin/pulsar zookeeper-shell -server ls /, which supports the full ZooKeeper connection string including multi-URL formats.
Verifying this change