Skip to content

fix: update existing instances first in node update#368

Merged
jason-lynch merged 1 commit intomainfrom
fix/PLAT-566/add-instances-last
Apr 28, 2026
Merged

fix: update existing instances first in node update#368
jason-lynch merged 1 commit intomainfrom
fix/PLAT-566/add-instances-last

Conversation

@jason-lynch
Copy link
Copy Markdown
Member

Summary

Modifies the node update process to ensure that existing instances are updated before adding any new replicas. This ensures that we apply configuration updates to the primary and other existing replicas first.

This fixes a bug for systemd where the new replica could be unable to bootstrap from the primary due to missing pg_hba.conf entries.

Testing

This issue was most noticeable on systemd. To run the E2E test:

# start the dev-lima environment in one terminal
make dev-lima-run

# in another terminal
# switch to the dev-lima enviroment
use-dev-lima

# run the new E2E test
make test-e2e E2E_RUN=TestAddReplica

To reproduce the issue by hand:

# start the dev-lima environment in one terminal
make dev-lima-run

# in another terminal, ensure the cluster is initialized
cp-init

# create a database with one node and one instance
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1"] }
    ]
  }
}
EOF

# add a read replica.
# This would hang indefinitely before because patroni was blocked by
# the primary instance's pg_hba.conf. Keep in mind that it does still
# take a few minutes to add a replica because pg_basebackup needs
# to wait for a checkpoint.
cp1-req update-database storefront <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "port": 0,
    "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1", "host-2"] }
    ]
  }
}
EOF

Notes for Reviewers

The E2E test that I added includes a second node. That wasn't necessary to reproduce this particular issue, but including it made a better general test.

PLAT-566

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Warning

Rate limit exceeded

@jason-lynch has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 39 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 393c2d7e-8670-4314-84db-45a36f94ce77

📥 Commits

Reviewing files that changed from the base of the PR and between cfc62b5 and 95f6441.

📒 Files selected for processing (9)
  • e2e/add_replica_test.go
  • server/internal/database/operations/golden_test/TestUpdateDatabase/adding_a_replica.json
  • server/internal/database/operations/golden_test/TestUpdateDatabase/adding_a_replica_with_primary_update.json
  • server/internal/database/operations/golden_test/TestUpdateDatabase/adding_multiple_replicas_concurrent.json
  • server/internal/database/operations/golden_test/TestUpdateDatabase/adding_multiple_replicas_rolling.json
  • server/internal/database/operations/update_database.go
  • server/internal/database/operations/update_database_test.go
  • server/internal/database/operations/update_nodes.go
  • server/internal/database/operations/update_nodes_test.go
📝 Walkthrough

Walkthrough

E2E test added for replica addition functionality. Core functions updated to accept current state and distinguish existing versus new replicas. Operation sequencing changed to update existing replicas first. Golden test fixtures updated to reflect new operation expectations. Node update strategy signatures modified to support state-aware replica handling.

Changes

Cohort / File(s) Summary
E2E Test
e2e/add_replica_test.go
Introduces TestAddReplica test that creates a database with two initial nodes, inserts data, then expands one node's HostIds to include a third host and verifies replication works correctly.
Core Logic Updates
server/internal/database/operations/update_database.go, server/internal/database/operations/update_nodes.go
Updated function signatures to accept current start state. UpdateNode, RollingUpdateNodes, and ConcurrentUpdateNodes now distinguish between existing replicas to update and new replicas to add. Operation ordering changed to prioritize existing replica updates before new additions.
Unit Tests
server/internal/database/operations/update_database_test.go, server/internal/database/operations/update_nodes_test.go
Refactored tests to pass explicit start state along with desired node specs. Added new test cases for primary-first updates and multiple replica scenarios. Added validation for missing primary instance.
Golden Test Fixtures
server/internal/database/operations/golden_test/TestUpdateDatabase/*.json
Updated expected operation sequences for replica addition scenarios. Replaced explicit has_diff diffs for node instance_ids with dependency_updated operations. Reordered database.switchover and monitor.instance creation batches. Added fixture for primary update concurrent with replica addition.

Poem

🐰 With replicas now multiplying with grace,
We hop through the state with more paces,
Existing ones first, then new ones bloom,
The primary leads through database room,
From one host to three—no more gloom! 🌱

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: updating existing instances before adding new replicas in node updates.
Description check ✅ Passed The description includes all required template sections: a clear summary of the fix, detailed changes, comprehensive testing instructions with both E2E test and manual reproduction steps, and notes for reviewers.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/PLAT-566/add-instances-last

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 27, 2026

Up to standards ✅

🟢 Issues 1 medium

Results:
1 new issue

Category Results
Complexity 1 medium

View in Codacy

🟢 Metrics 0 complexity · 0 duplication

Metric Results
Complexity 0
Duplication 0

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
server/internal/database/operations/update_nodes_test.go (1)

31-37: Nit: empty NodeResource{} in start state is unusual.

Other test cases populate Name/InstanceIDs on the start-state NodeResource. Here it's left empty (just to satisfy state structure). For consistency with the surrounding cases, consider populating it like the others — purely cosmetic.

♻️ Optional consistency tweak
 			start: makeState(t,
 				[]resource.Resource{
 					instance1.Instance,
-					&database.NodeResource{},
+					&database.NodeResource{
+						Name:        "n1",
+						InstanceIDs: []string{instance1.InstanceID()},
+					},
 				},
 				instance1.InstanceDependencies,
 			),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/database/operations/update_nodes_test.go` around lines 31 -
37, The start state for this test uses an empty &database.NodeResource{} which
is inconsistent with other cases; update the test's start state (the makeState
call that currently includes &database.NodeResource{}) to populate the
NodeResource's Name and InstanceIDs fields (matching the values used in
surrounding cases) so the initial state mirrors other tests and removes the
empty placeholder.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e/add_replica_test.go`:
- Around line 62-85: The call to db.Update in add_replica_test.go ignores the
returned error from DatabaseFixture.Update; change the call to capture the error
and fail the test on error (e.g., err := db.Update(...); require.NoError(t, err)
or if err != nil { t.Fatalf(...) }) so the test stops with a clear message when
the replica-add update fails; locate the db.Update call in add_replica_test.go
and replace the discarded return with an explicit error check using the test
helper (require.NoError or t.Fatalf) to surface the update failure immediately.

---

Nitpick comments:
In `@server/internal/database/operations/update_nodes_test.go`:
- Around line 31-37: The start state for this test uses an empty
&database.NodeResource{} which is inconsistent with other cases; update the
test's start state (the makeState call that currently includes
&database.NodeResource{}) to populate the NodeResource's Name and InstanceIDs
fields (matching the values used in surrounding cases) so the initial state
mirrors other tests and removes the empty placeholder.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3a4319b2-e5d2-4e44-bc63-2eaa3bde69a5

📥 Commits

Reviewing files that changed from the base of the PR and between 46cacdc and cfc62b5.

📒 Files selected for processing (9)
  • e2e/add_replica_test.go
  • server/internal/database/operations/golden_test/TestUpdateDatabase/adding_a_replica.json
  • server/internal/database/operations/golden_test/TestUpdateDatabase/adding_a_replica_with_primary_update.json
  • server/internal/database/operations/golden_test/TestUpdateDatabase/adding_multiple_replicas_concurrent.json
  • server/internal/database/operations/golden_test/TestUpdateDatabase/adding_multiple_replicas_rolling.json
  • server/internal/database/operations/update_database.go
  • server/internal/database/operations/update_database_test.go
  • server/internal/database/operations/update_nodes.go
  • server/internal/database/operations/update_nodes_test.go

Comment thread e2e/add_replica_test.go Outdated
Modifies the node update process to ensure that existing instances are
updated before adding any new replicas. This ensures that we apply
configuration updates to the primary and other existing replicas first.

This fixes a bug for systemd where the new replica could be unable to
bootstrap from the primary due to missing pg_hba.conf entries.

PLAT-566
@jason-lynch jason-lynch force-pushed the fix/PLAT-566/add-instances-last branch from cfc62b5 to 95f6441 Compare April 27, 2026 20:06
Copy link
Copy Markdown
Contributor

@moizpgedge moizpgedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the exact ticket scenario on dev-lima . Created a 3-node database, deleted n3, then reused the same host as a replica for n2.

Create 3-node database:

cp1-req create-database '{
  "spec": {
    "database_name": "plat566",
    "database_users": [{"username": "admin", "password": "password", "db_owner": true, "attributes": ["LOGIN", "SUPERUSER"]}],
    "port": 0, "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1"] },
      { "name": "n2", "host_ids": ["host-2"] },
      { "name": "n3", "host_ids": ["host-3"] }
    ]
  }
}'

Delete n3:

cp1-req update-database 230fbbce-6644-465c-8e49-d2cb63fbd116 '{
  "spec": {
    "database_name": "plat566",
    "database_users": [{"username": "admin", "db_owner": true, "attributes": ["LOGIN", "SUPERUSER"]}],
    "port": 0, "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1"] },
      { "name": "n2", "host_ids": ["host-2"] }
    ]
  }
}'

Confirmed n3 fully cleaned up via task log 019dd469-9309-7555-9b86-5d48e59201a5.

Reuse host-3 as n2 replica:

cp1-req update-database 230fbbce-6644-465c-8e49-d2cb63fbd116 '{
  "spec": {
    "database_name": "plat566",
    "database_users": [{"username": "admin", "db_owner": true, "attributes": ["LOGIN", "SUPERUSER"]}],
    "port": 0, "patroni_port": 0,
    "nodes": [
      { "name": "n1", "host_ids": ["host-1"] },
      { "name": "n2", "host_ids": ["host-2", "host-3"] }
    ]
  }
}'

Watch task:

cp-follow-task -s database -e 230fbbce-6644-465c-8e49-d2cb63fbd116 -t 019dd46e-7a97-7ef2-a4be-e2daf45b65aa

**Verify replica:**

curl -s http://localhost:3010/v1/databases/230fbbce-6644-465c-8e49-d2cb63fbd116 | jq '.instances[] | select(.node_name == "n2" and .host_id == "host-3")'
Replica 230fbbce-6644-465c-8e49-d2cb63fbd116-n2-ant97dj4 (host-3 reused) came up with:

"patroni_state": "running",
"role": "replica",
"state": "available"

Unit tests:

go test ./server/internal/database/operations/...

All pass including the new adding_a_replica and adding_a_replica_with_primary_update cases.

APPROVED

@jason-lynch jason-lynch merged commit 88d83e3 into main Apr 28, 2026
3 checks passed
@jason-lynch jason-lynch deleted the fix/PLAT-566/add-instances-last branch April 28, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants