fix(platfor-290): identity-access-token cleanup#6057
fix(platfor-290): identity-access-token cleanup#6057PrestigePvP wants to merge 1 commit intomainfrom
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7710ffc3f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }); | ||
| } | ||
| isRetrying = numberOfRetryOnFailure > 0; | ||
| } while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE)); |
There was a problem hiding this comment.
Bound retry loop by clearing stale delete results
If one batch deletes rows and a later batch throws (e.g., statement timeout or lock contention), deletedTokenIds keeps its previous non-empty value, so deletedTokenIds.length > 0 remains true and the loop never exits even after MAX_RETRY_ON_FAILURE is reached. In that failure mode the weekly worker can run forever, repeatedly logging errors and never releasing capacity for future scheduled runs.
Useful? React with 👍 / 👎.
| } catch { | ||
| logger.info(`${QueueName.FrequentResourceCleanUp}: another instance holds the lock, skipping this run`); | ||
| return; |
There was a problem hiding this comment.
Re-throw non-contention lock acquisition failures
This catch treats every acquireLock exception as "another instance holds the lock" and returns successfully. When Redis is unavailable or returns an unexpected error, the cleanup job is silently skipped and BullMQ records success, so no retry/alert path runs and token cleanup can be missed for extended periods. Only lock-contention errors should be swallowed; other errors should be propagated.
Useful? React with 👍 / 👎.
| export async function down(knex: Knex): Promise<void> { | ||
| await knex.raw(` | ||
| DROP INDEX IF EXISTS idx_identity_access_tokens_idle | ||
| `); | ||
| } |
There was a problem hiding this comment.
🔴 The down() migration uses plain DROP INDEX IF EXISTS rather than DROP INDEX CONCURRENTLY IF EXISTS, which acquires an AccessExclusiveLock that blocks all reads and writes on the identity_access_tokens table for the full duration of the drop. On the 64M-row throttled-EBS table this PR explicitly calls out, rolling back this migration in production would cause a write outage on identity access token operations; change line 36 to DROP INDEX CONCURRENTLY IF EXISTS idx_identity_access_tokens_idle to avoid the lock.
Extended reasoning...
What the bug is and how it manifests
PostgreSQL's plain DROP INDEX acquires an AccessExclusiveLock on the parent table, which blocks every concurrent read and write for the entire duration of the drop. This is in direct contrast to DROP INDEX CONCURRENTLY, which only takes brief metadata locks and allows normal DML to proceed throughout. On a large table, this distinction is critical.
The specific code path that triggers it
The bug is in the down() function of 20260416180000_add-identity-access-token-idle-index.ts (lines 35–39):
export async function down(knex: Knex): Promise<void> {
await knex.raw(`
DROP INDEX IF EXISTS idx_identity_access_tokens_idle
`);
}This runs whenever the migration is rolled back (e.g., knex migrate:down or a failed deployment rollback).
Why existing code doesn't prevent it
The up() migration correctly uses CREATE INDEX CONCURRENTLY — but the same care was not applied to the rollback path. The migration already exports config = { transaction: false }, which is the exact prerequisite PostgreSQL requires before it will accept DROP INDEX CONCURRENTLY (concurrent index operations cannot run inside a transaction block). So the infrastructure for a safe drop is already in place; only the keyword is missing.
What the impact would be
The PR description explicitly states this table has ~64M rows on a throttled EBS volume, which is the reason CONCURRENTLY was chosen for the forward migration. While dropping an index is faster than building one, on throttled I/O it can still take minutes. During that window, every request that touches identity_access_tokens — token validation, authentication, refresh — would be blocked, causing a production outage on all identity access token operations for the full duration.
How to fix it
Replace line 36 with:
await knex.raw(`DROP INDEX CONCURRENTLY IF EXISTS idx_identity_access_tokens_idle`);No other changes are needed; config = { transaction: false } is already exported.
Step-by-step proof
- A deployment is rolled back and Knex runs
down()on the20260416180000migration. - PostgreSQL executes
DROP INDEX IF EXISTS idx_identity_access_tokens_idle. - PostgreSQL acquires
AccessExclusiveLockonidentity_access_tokens. - Every concurrent
SELECT,INSERT,UPDATE, orDELETEonidentity_access_tokensblocks immediately — no reads, no writes proceed. - On a 64M-row table on throttled EBS I/O, the drop could take several minutes.
- All API endpoints that validate or refresh identity access tokens return errors or time out for the full duration.
- Had
DROP INDEX CONCURRENTLYbeen used, only a brief ShareUpdateExclusiveLock would be taken at the start and end, and all DML would continue uninterrupted throughout.
| setTimeout(resolve, 500); | ||
| }); | ||
| } | ||
| isRetrying = numberOfRetryOnFailure > 0; | ||
| } while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE)); | ||
|
|
||
| if (numberOfRetryOnFailure >= MAX_RETRY_ON_FAILURE) { | ||
| logger.error( | ||
| `IdentityAccessTokenIdlePrune: Pruning failed and stopped after ${MAX_RETRY_ON_FAILURE} consecutive retries.` | ||
| ); | ||
| } | ||
|
|
||
| logger.info( | ||
| `${QueueName.WeeklyResourceCleanUp}: remove idle access tokens completed. Deleted ${totalDeletedCount} tokens.` | ||
| ); | ||
| }; | ||
|
|
||
| return { ...identityAccessTokenOrm, findOne, removeExpiredTokens, removeIdleTokens }; |
There was a problem hiding this comment.
🔴 The do-while loop in removeIdleTokens can spin infinitely if any batch succeeds before a permanent DB failure. Once a batch returns N deleted IDs, deletedTokenIds remains non-empty forever because the catch block never resets it — so even after MAX_RETRY_ON_FAILURE consecutive failures the left-hand condition deletedTokenIds.length > 0 keeps the loop alive, bypassing the retry guard entirely and hammering the broken database every 500 ms. Fix: add deletedTokenIds = [] at the top of the do block (before the try). The identical pattern in removeExpiredTokens has the same pre-existing bug.
Extended reasoning...
What the bug is and how it manifests
removeIdleTokens (identity-access-token-dal.ts) uses a do-while loop controlled by:
while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE));deletedTokenIds is assigned inside the try block and never reset in the catch block. If any batch succeeds (sets deletedTokenIds to a non-empty array) and then the database becomes permanently unavailable, subsequent batches throw exceptions. The catch block increments numberOfRetryOnFailure but leaves deletedTokenIds holding the results of the last successful batch. Once numberOfRetryOnFailure >= MAX_RETRY_ON_FAILURE (3), the right-hand side of the OR becomes false — but the left side (deletedTokenIds.length > 0) remains true indefinitely. The loop never exits.
The specific code path that triggers it
- Batch 1 succeeds:
deletedTokenIds = [5000 items],numberOfRetryOnFailurereset to 0. - DB goes down.
- Batch 2 fails:
numberOfRetryOnFailure = 1.deletedTokenIdsunchanged. Condition:5000 > 0 || (true && 1 < 3)→true. - Batch 3 fails:
numberOfRetryOnFailure = 2. Condition:5000 > 0 || (true && 2 < 3)→true. - Batch 4 fails:
numberOfRetryOnFailure = 3. Condition:5000 > 0 || (true && 3 < 3)→5000 > 0 || false→true. - Every subsequent batch fails: condition stays
trueforever.
Why existing code does not prevent it
The MAX_RETRY_ON_FAILURE guard is only effective when deletedTokenIds is empty at the time retries are exhausted (i.e., if the very first batch fails). It provides zero protection after any prior batch has succeeded with deletions, because there is no code path that clears deletedTokenIds on failure or at the start of each iteration.
What the impact would be
In the failure scenario the weekly cleanup job enters an infinite busy loop — one DB round-trip every 500 ms — against an already broken or overloaded database, worsening the outage and keeping the BullMQ worker occupied indefinitely. The Redis distributed lock (3-hour TTL) will also be held, preventing any other instance from picking up the job.
How to fix it
Reset deletedTokenIds to [] at the top of the do block, before the try:
do {
deletedTokenIds = []; // add this line
try {
// ...
} catch (error) {
// ...
}
} while (...);This ensures that on any iteration where the DB throws, deletedTokenIds is empty at loop evaluation time. After MAX_RETRY_ON_FAILURE consecutive failures the right-hand side becomes false and the left-hand side (0 > 0) is also false, so the loop terminates as intended.
The same bug exists in removeExpiredTokens (a pre-existing issue), but removeIdleTokens introduces it fresh in this PR.
Context
Distributed lock on hourly cleanup — wraps the FrequentResourceCleanUp worker with a Redis lock (retryCount: 0) so overlapping runs across instances skip instead of compounding DB load. EU prod's hourly job was taking 2–3 min/batch; concurrent runs were multiplying lock contention on a throttled EBS volume.
Weekly idle-token drain — adds a new WeeklyResourceCleanUp queue (fires Sunday 03:00 UTC) that deletes tokens idle for 30+ days via COALESCE(accessTokenLastUsedAt, createdAt). A 1% sample showed ~54% of EU's ~64M token rows have never been used — these are unreachable by existing TTL/revoked/uses-exhausted predicates.
Expression index for idle predicate — new migration creates idx_identity_access_tokens_idle on COALESCE("accessTokenLastUsedAt", "createdAt") with a 4-hour statement timeout (CREATE INDEX CONCURRENTLY on 64M rows on a throttled DB needs the headroom).
Separate lock keys per job — hourly and weekly workers use distinct Redis lock keys so a long-running hourly run doesn't starve the weekly job for an entire week.
Steps to verify the change
Type
Checklist
type(scope): short description(scope is optional, e.g.,fix: prevent crash on syncorfix(api): handle null response).