Skip to content

fix(platfor-290): identity-access-token cleanup#6057

Open
PrestigePvP wants to merge 1 commit intomainfrom
tre/platfor-290-token-access
Open

fix(platfor-290): identity-access-token cleanup#6057
PrestigePvP wants to merge 1 commit intomainfrom
tre/platfor-290-token-access

Conversation

@PrestigePvP
Copy link
Copy Markdown
Contributor

@PrestigePvP PrestigePvP commented Apr 16, 2026

Context

  • Distributed lock on hourly cleanup — wraps the FrequentResourceCleanUp worker with a Redis lock (retryCount: 0) so overlapping runs across instances skip instead of compounding DB load. EU prod's hourly job was taking 2–3 min/batch; concurrent runs were multiplying lock contention on a throttled EBS volume.

  • Weekly idle-token drain — adds a new WeeklyResourceCleanUp queue (fires Sunday 03:00 UTC) that deletes tokens idle for 30+ days via COALESCE(accessTokenLastUsedAt, createdAt). A 1% sample showed ~54% of EU's ~64M token rows have never been used — these are unreachable by existing TTL/revoked/uses-exhausted predicates.

  • Expression index for idle predicate — new migration creates idx_identity_access_tokens_idle on COALESCE("accessTokenLastUsedAt", "createdAt") with a 4-hour statement timeout (CREATE INDEX CONCURRENTLY on 64M rows on a throttled DB needs the headroom).

  • Separate lock keys per job — hourly and weekly workers use distinct Redis lock keys so a long-running hourly run doesn't starve the weekly job for an entire week.

Steps to verify the change

Type

  • Fix
  • Feature
  • Improvement
  • Breaking
  • Docs
  • Chore

Checklist

  • Title follows the conventional commit format: type(scope): short description (scope is optional, e.g., fix: prevent crash on sync or fix(api): handle null response).
  • Tested locally
  • Updated docs (if needed)
  • Updated CLAUDE.md files (if needed)
  • Read the contributing guide

@linear
Copy link
Copy Markdown

linear Bot commented Apr 16, 2026

@maidul98
Copy link
Copy Markdown
Collaborator

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7710ffc3f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

});
}
isRetrying = numberOfRetryOnFailure > 0;
} while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bound retry loop by clearing stale delete results

If one batch deletes rows and a later batch throws (e.g., statement timeout or lock contention), deletedTokenIds keeps its previous non-empty value, so deletedTokenIds.length > 0 remains true and the loop never exits even after MAX_RETRY_ON_FAILURE is reached. In that failure mode the weekly worker can run forever, repeatedly logging errors and never releasing capacity for future scheduled runs.

Useful? React with 👍 / 👎.

Comment on lines +130 to +132
} catch {
logger.info(`${QueueName.FrequentResourceCleanUp}: another instance holds the lock, skipping this run`);
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-throw non-contention lock acquisition failures

This catch treats every acquireLock exception as "another instance holds the lock" and returns successfully. When Redis is unavailable or returns an unexpected error, the cleanup job is silently skipped and BullMQ records success, so no retry/alert path runs and token cleanup can be missed for extended periods. Only lock-contention errors should be swallowed; other errors should be propagated.

Useful? React with 👍 / 👎.

Comment on lines +35 to +39
export async function down(knex: Knex): Promise<void> {
await knex.raw(`
DROP INDEX IF EXISTS idx_identity_access_tokens_idle
`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The down() migration uses plain DROP INDEX IF EXISTS rather than DROP INDEX CONCURRENTLY IF EXISTS, which acquires an AccessExclusiveLock that blocks all reads and writes on the identity_access_tokens table for the full duration of the drop. On the 64M-row throttled-EBS table this PR explicitly calls out, rolling back this migration in production would cause a write outage on identity access token operations; change line 36 to DROP INDEX CONCURRENTLY IF EXISTS idx_identity_access_tokens_idle to avoid the lock.

Extended reasoning...

What the bug is and how it manifests

PostgreSQL's plain DROP INDEX acquires an AccessExclusiveLock on the parent table, which blocks every concurrent read and write for the entire duration of the drop. This is in direct contrast to DROP INDEX CONCURRENTLY, which only takes brief metadata locks and allows normal DML to proceed throughout. On a large table, this distinction is critical.

The specific code path that triggers it

The bug is in the down() function of 20260416180000_add-identity-access-token-idle-index.ts (lines 35–39):

export async function down(knex: Knex): Promise<void> {
  await knex.raw(`
      DROP INDEX IF EXISTS idx_identity_access_tokens_idle
    `);
}

This runs whenever the migration is rolled back (e.g., knex migrate:down or a failed deployment rollback).

Why existing code doesn't prevent it

The up() migration correctly uses CREATE INDEX CONCURRENTLY — but the same care was not applied to the rollback path. The migration already exports config = { transaction: false }, which is the exact prerequisite PostgreSQL requires before it will accept DROP INDEX CONCURRENTLY (concurrent index operations cannot run inside a transaction block). So the infrastructure for a safe drop is already in place; only the keyword is missing.

What the impact would be

The PR description explicitly states this table has ~64M rows on a throttled EBS volume, which is the reason CONCURRENTLY was chosen for the forward migration. While dropping an index is faster than building one, on throttled I/O it can still take minutes. During that window, every request that touches identity_access_tokens — token validation, authentication, refresh — would be blocked, causing a production outage on all identity access token operations for the full duration.

How to fix it

Replace line 36 with:

await knex.raw(`DROP INDEX CONCURRENTLY IF EXISTS idx_identity_access_tokens_idle`);

No other changes are needed; config = { transaction: false } is already exported.

Step-by-step proof

  1. A deployment is rolled back and Knex runs down() on the 20260416180000 migration.
  2. PostgreSQL executes DROP INDEX IF EXISTS idx_identity_access_tokens_idle.
  3. PostgreSQL acquires AccessExclusiveLock on identity_access_tokens.
  4. Every concurrent SELECT, INSERT, UPDATE, or DELETE on identity_access_tokens blocks immediately — no reads, no writes proceed.
  5. On a 64M-row table on throttled EBS I/O, the drop could take several minutes.
  6. All API endpoints that validate or refresh identity access tokens return errors or time out for the full duration.
  7. Had DROP INDEX CONCURRENTLY been used, only a brief ShareUpdateExclusiveLock would be taken at the start and end, and all DML would continue uninterrupted throughout.

Comment on lines +215 to +232
setTimeout(resolve, 500);
});
}
isRetrying = numberOfRetryOnFailure > 0;
} while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE));

if (numberOfRetryOnFailure >= MAX_RETRY_ON_FAILURE) {
logger.error(
`IdentityAccessTokenIdlePrune: Pruning failed and stopped after ${MAX_RETRY_ON_FAILURE} consecutive retries.`
);
}

logger.info(
`${QueueName.WeeklyResourceCleanUp}: remove idle access tokens completed. Deleted ${totalDeletedCount} tokens.`
);
};

return { ...identityAccessTokenOrm, findOne, removeExpiredTokens, removeIdleTokens };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The do-while loop in removeIdleTokens can spin infinitely if any batch succeeds before a permanent DB failure. Once a batch returns N deleted IDs, deletedTokenIds remains non-empty forever because the catch block never resets it — so even after MAX_RETRY_ON_FAILURE consecutive failures the left-hand condition deletedTokenIds.length > 0 keeps the loop alive, bypassing the retry guard entirely and hammering the broken database every 500 ms. Fix: add deletedTokenIds = [] at the top of the do block (before the try). The identical pattern in removeExpiredTokens has the same pre-existing bug.

Extended reasoning...

What the bug is and how it manifests

removeIdleTokens (identity-access-token-dal.ts) uses a do-while loop controlled by:

while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE));

deletedTokenIds is assigned inside the try block and never reset in the catch block. If any batch succeeds (sets deletedTokenIds to a non-empty array) and then the database becomes permanently unavailable, subsequent batches throw exceptions. The catch block increments numberOfRetryOnFailure but leaves deletedTokenIds holding the results of the last successful batch. Once numberOfRetryOnFailure >= MAX_RETRY_ON_FAILURE (3), the right-hand side of the OR becomes false — but the left side (deletedTokenIds.length > 0) remains true indefinitely. The loop never exits.

The specific code path that triggers it

  1. Batch 1 succeeds: deletedTokenIds = [5000 items], numberOfRetryOnFailure reset to 0.
  2. DB goes down.
  3. Batch 2 fails: numberOfRetryOnFailure = 1. deletedTokenIds unchanged. Condition: 5000 > 0 || (true && 1 < 3)true.
  4. Batch 3 fails: numberOfRetryOnFailure = 2. Condition: 5000 > 0 || (true && 2 < 3)true.
  5. Batch 4 fails: numberOfRetryOnFailure = 3. Condition: 5000 > 0 || (true && 3 < 3)5000 > 0 || falsetrue.
  6. Every subsequent batch fails: condition stays true forever.

Why existing code does not prevent it

The MAX_RETRY_ON_FAILURE guard is only effective when deletedTokenIds is empty at the time retries are exhausted (i.e., if the very first batch fails). It provides zero protection after any prior batch has succeeded with deletions, because there is no code path that clears deletedTokenIds on failure or at the start of each iteration.

What the impact would be

In the failure scenario the weekly cleanup job enters an infinite busy loop — one DB round-trip every 500 ms — against an already broken or overloaded database, worsening the outage and keeping the BullMQ worker occupied indefinitely. The Redis distributed lock (3-hour TTL) will also be held, preventing any other instance from picking up the job.

How to fix it

Reset deletedTokenIds to [] at the top of the do block, before the try:

do {
  deletedTokenIds = []; // add this line
  try {
    // ...
  } catch (error) {
    // ...
  }
} while (...);

This ensures that on any iteration where the DB throws, deletedTokenIds is empty at loop evaluation time. After MAX_RETRY_ON_FAILURE consecutive failures the right-hand side becomes false and the left-hand side (0 > 0) is also false, so the loop terminates as intended.

The same bug exists in removeExpiredTokens (a pre-existing issue), but removeIdleTokens introduces it fresh in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants