fix(platfor-290): identity-access-token cleanup by PrestigePvP · Pull Request #6057 · Infisical/infisical

PrestigePvP · 2026-04-16T21:30:42Z

Context

Distributed lock on hourly cleanup — wraps the FrequentResourceCleanUp worker with a Redis lock (retryCount: 0) so overlapping runs across instances skip instead of compounding DB load. EU prod's hourly job was taking 2–3 min/batch; concurrent runs were multiplying lock contention on a throttled EBS volume.
Weekly idle-token drain — adds a new WeeklyResourceCleanUp queue (fires Sunday 03:00 UTC) that deletes tokens idle for 30+ days via COALESCE(accessTokenLastUsedAt, createdAt). A 1% sample showed ~54% of EU's ~64M token rows have never been used — these are unreachable by existing TTL/revoked/uses-exhausted predicates.
Expression index for idle predicate — new migration creates idx_identity_access_tokens_idle on COALESCE("accessTokenLastUsedAt", "createdAt") with a 4-hour statement timeout (CREATE INDEX CONCURRENTLY on 64M rows on a throttled DB needs the headroom).
Separate lock keys per job — hourly and weekly workers use distinct Redis lock keys so a long-running hourly run doesn't starve the weekly job for an entire week.

Steps to verify the change

Type

Checklist

Title follows the conventional commit format: type(scope): short description (scope is optional, e.g., fix: prevent crash on sync or fix(api): handle null response).
Tested locally
Updated docs (if needed)
Updated CLAUDE.md files (if needed)
Read the contributing guide

linear · 2026-04-16T21:30:46Z

PLATFOR-290 Rethink access token storage and hourly deletion in Postgres

maidul98 · 2026-04-16T21:31:22Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scan Engine	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7710ffc3f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T21:35:32Z

+        });
+      }
+      isRetrying = numberOfRetryOnFailure > 0;
+    } while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE));


Bound retry loop by clearing stale delete results

If one batch deletes rows and a later batch throws (e.g., statement timeout or lock contention), deletedTokenIds keeps its previous non-empty value, so deletedTokenIds.length > 0 remains true and the loop never exits even after MAX_RETRY_ON_FAILURE is reached. In that failure mode the weekly worker can run forever, repeatedly logging errors and never releasing capacity for future scheduled runs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-16T21:35:33Z

+      } catch {
+        logger.info(`${QueueName.FrequentResourceCleanUp}: another instance holds the lock, skipping this run`);
+        return;


Re-throw non-contention lock acquisition failures

This catch treats every acquireLock exception as "another instance holds the lock" and returns successfully. When Redis is unavailable or returns an unexpected error, the cleanup job is silently skipped and BullMQ records success, so no retry/alert path runs and token cleanup can be missed for extended periods. Only lock-contention errors should be swallowed; other errors should be propagated.

Useful? React with 👍 / 👎.

claude · 2026-04-16T21:42:21Z

+export async function down(knex: Knex): Promise<void> {
+  await knex.raw(`
+      DROP INDEX IF EXISTS idx_identity_access_tokens_idle
+    `);
+}


🔴 The down() migration uses plain DROP INDEX IF EXISTS rather than DROP INDEX CONCURRENTLY IF EXISTS, which acquires an AccessExclusiveLock that blocks all reads and writes on the identity_access_tokens table for the full duration of the drop. On the 64M-row throttled-EBS table this PR explicitly calls out, rolling back this migration in production would cause a write outage on identity access token operations; change line 36 to DROP INDEX CONCURRENTLY IF EXISTS idx_identity_access_tokens_idle to avoid the lock.

Extended reasoning...

What the bug is and how it manifests

PostgreSQL's plain DROP INDEX acquires an AccessExclusiveLock on the parent table, which blocks every concurrent read and write for the entire duration of the drop. This is in direct contrast to DROP INDEX CONCURRENTLY, which only takes brief metadata locks and allows normal DML to proceed throughout. On a large table, this distinction is critical.

The specific code path that triggers it

The bug is in the down() function of 20260416180000_add-identity-access-token-idle-index.ts (lines 35–39):

export async function down(knex: Knex): Promise<void> { await knex.raw(` DROP INDEX IF EXISTS idx_identity_access_tokens_idle `); }

This runs whenever the migration is rolled back (e.g., knex migrate:down or a failed deployment rollback).

Why existing code doesn't prevent it

The up() migration correctly uses CREATE INDEX CONCURRENTLY — but the same care was not applied to the rollback path. The migration already exports config = { transaction: false }, which is the exact prerequisite PostgreSQL requires before it will accept DROP INDEX CONCURRENTLY (concurrent index operations cannot run inside a transaction block). So the infrastructure for a safe drop is already in place; only the keyword is missing.

What the impact would be

The PR description explicitly states this table has ~64M rows on a throttled EBS volume, which is the reason CONCURRENTLY was chosen for the forward migration. While dropping an index is faster than building one, on throttled I/O it can still take minutes. During that window, every request that touches identity_access_tokens — token validation, authentication, refresh — would be blocked, causing a production outage on all identity access token operations for the full duration.

How to fix it

Replace line 36 with:

await knex.raw(`DROP INDEX CONCURRENTLY IF EXISTS idx_identity_access_tokens_idle`);

No other changes are needed; config = { transaction: false } is already exported.

Step-by-step proof

A deployment is rolled back and Knex runs down() on the 20260416180000 migration.

PostgreSQL executes DROP INDEX IF EXISTS idx_identity_access_tokens_idle.

PostgreSQL acquires AccessExclusiveLock on identity_access_tokens.

Every concurrent SELECT, INSERT, UPDATE, or DELETE on identity_access_tokens blocks immediately — no reads, no writes proceed.

On a 64M-row table on throttled EBS I/O, the drop could take several minutes.

All API endpoints that validate or refresh identity access tokens return errors or time out for the full duration.

Had DROP INDEX CONCURRENTLY been used, only a brief ShareUpdateExclusiveLock would be taken at the start and end, and all DML would continue uninterrupted throughout.

claude · 2026-04-16T21:42:21Z

+          setTimeout(resolve, 500);
+        });
+      }
+      isRetrying = numberOfRetryOnFailure > 0;
+    } while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE));
+
+    if (numberOfRetryOnFailure >= MAX_RETRY_ON_FAILURE) {
+      logger.error(
+        `IdentityAccessTokenIdlePrune: Pruning failed and stopped after ${MAX_RETRY_ON_FAILURE} consecutive retries.`
+      );
+    }
+
+    logger.info(
+      `${QueueName.WeeklyResourceCleanUp}: remove idle access tokens completed. Deleted ${totalDeletedCount} tokens.`
+    );
+  };
+
+  return { ...identityAccessTokenOrm, findOne, removeExpiredTokens, removeIdleTokens };


🔴 The do-while loop in removeIdleTokens can spin infinitely if any batch succeeds before a permanent DB failure. Once a batch returns N deleted IDs, deletedTokenIds remains non-empty forever because the catch block never resets it — so even after MAX_RETRY_ON_FAILURE consecutive failures the left-hand condition deletedTokenIds.length > 0 keeps the loop alive, bypassing the retry guard entirely and hammering the broken database every 500 ms. Fix: add deletedTokenIds = [] at the top of the do block (before the try). The identical pattern in removeExpiredTokens has the same pre-existing bug.

Extended reasoning...

What the bug is and how it manifests

removeIdleTokens (identity-access-token-dal.ts) uses a do-while loop controlled by:

while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE));

deletedTokenIds is assigned inside the try block and never reset in the catch block. If any batch succeeds (sets deletedTokenIds to a non-empty array) and then the database becomes permanently unavailable, subsequent batches throw exceptions. The catch block increments numberOfRetryOnFailure but leaves deletedTokenIds holding the results of the last successful batch. Once numberOfRetryOnFailure >= MAX_RETRY_ON_FAILURE (3), the right-hand side of the OR becomes false — but the left side (deletedTokenIds.length > 0) remains true indefinitely. The loop never exits.

The specific code path that triggers it

Batch 1 succeeds: deletedTokenIds = [5000 items], numberOfRetryOnFailure reset to 0.

DB goes down.

Batch 2 fails: numberOfRetryOnFailure = 1. deletedTokenIds unchanged. Condition: 5000 > 0 || (true && 1 < 3) → true.

Batch 3 fails: numberOfRetryOnFailure = 2. Condition: 5000 > 0 || (true && 2 < 3) → true.

Batch 4 fails: numberOfRetryOnFailure = 3. Condition: 5000 > 0 || (true && 3 < 3) → 5000 > 0 || false → true.

Every subsequent batch fails: condition stays true forever.

Why existing code does not prevent it

The MAX_RETRY_ON_FAILURE guard is only effective when deletedTokenIds is empty at the time retries are exhausted (i.e., if the very first batch fails). It provides zero protection after any prior batch has succeeded with deletions, because there is no code path that clears deletedTokenIds on failure or at the start of each iteration.

What the impact would be

In the failure scenario the weekly cleanup job enters an infinite busy loop — one DB round-trip every 500 ms — against an already broken or overloaded database, worsening the outage and keeping the BullMQ worker occupied indefinitely. The Redis distributed lock (3-hour TTL) will also be held, preventing any other instance from picking up the job.

How to fix it

Reset deletedTokenIds to [] at the top of the do block, before the try:

do { deletedTokenIds = []; // add this line try { // ... } catch (error) { // ... } } while (...);

This ensures that on any iteration where the DB throws, deletedTokenIds is empty at loop evaluation time. After MAX_RETRY_ON_FAILURE consecutive failures the right-hand side becomes false and the left-hand side (0 > 0) is also false, so the loop terminates as intended.

The same bug exists in removeExpiredTokens (a pre-existing issue), but removeIdleTokens introduces it fresh in this PR.

fix(platfor-290): identity-access-token cleanup

d7710ff

chatgpt-codex-connector Bot reviewed Apr 16, 2026

View reviewed changes

claude Bot reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(platfor-290): identity-access-token cleanup#6057

fix(platfor-290): identity-access-token cleanup#6057
PrestigePvP wants to merge 1 commit intomainfrom
tre/platfor-290-token-access

PrestigePvP commented Apr 16, 2026 •

edited

Loading

Uh oh!

linear Bot commented Apr 16, 2026

Uh oh!

maidul98 commented Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Uh oh!

claude Bot Apr 16, 2026

Uh oh!

claude Bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PrestigePvP commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Steps to verify the change

Type

Checklist

Uh oh!

linear Bot commented Apr 16, 2026

Uh oh!

maidul98 commented Apr 16, 2026

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PrestigePvP commented Apr 16, 2026 •

edited

Loading