fix(aws): improve verify - check all accounts and improve role assumption reliability#739
fix(aws): improve verify - check all accounts and improve role assumption reliability#739ShubhamRasal merged 4 commits intodevfrom
Conversation
…as parameter and improve error handling in Verify method - Updated createAssumedRoleSession to take accountId as an argument, allowing for more flexible role assumption. - Modified Verify method to iterate over AccountIds, capturing errors for each account and returning the last encountered error if all attempts fail.
…as parameter and improve error handling in Verify method - Updated createAssumedRoleSession to take accountId as an argument, allowing for more flexible role assumption. - Modified Verify method to iterate over AccountIds, capturing errors for each account and returning the last encountered error if all attempts fail.
…ating over AccountIds - Enhanced the New function to attempt region retrieval using multiple AccountIds. - Implemented error handling to continue attempts with subsequent accounts if the initial one fails. - Ensured that a comprehensive error message is returned if all attempts to retrieve regions fail.
Neo - PR Security ReviewNo security issues found Highlights
Hardening Notes
Comment |
WalkthroughThe AWS provider now supports multi-account role assumption: New iterates AccountIds to build assumed-role sessions for DescribeRegions when the base user lacks permission; createAssumedRoleSession accepts accountId to construct RoleArn; Verify runs parallel per-account verification and aggregates failures. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Provider
participant BaseAWS as AWS (Base User)
participant AssumeRole as Assumed Role Session
participant TargetAWS as AWS (Target Account)
Client->>Provider: New(options)
Provider->>BaseAWS: DescribeRegions()
BaseAWS-->>Provider: Permission Denied
loop For each AccountId in options.AccountIds
Provider->>AssumeRole: createAssumedRoleSession(accountId)
AssumeRole->>AssumeRole: Build RoleArn using accountId
AssumeRole-->>Provider: Session created
Provider->>TargetAWS: DescribeRegions(assumed role)
alt Success
TargetAWS-->>Provider: Regions list
Provider->>Provider: Set regions & stop loop
else Failure
TargetAWS-->>Provider: Error
Provider->>Provider: Aggregate error & continue
end
end
alt Any account succeeded
Provider-->>Client: Initialized successfully
else All accounts failed
Provider-->>Client: Aggregated error returned
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/providers/aws/aws.go (1)
523-560:⚠️ Potential issue | 🔴 Critical
Verify()still skips account-role validation when the base session works.Because of the early return at Lines 524-526, this new fan-out only runs when
p.verify()fails on the source credentials. If the base creds can access EC2/S3/etc. but one or more target accounts are missing the role,Verify()still returnsniland those broken accounts stay undetected.💡 Suggested fix
- err := p.verify() - if err == nil { - return nil - } + baseErr := p.verify() if p.options.AssumeRoleName != "" && len(p.options.AccountIds) > 0 { var mu sync.Mutex var failedAccounts []string var wg sync.WaitGroup @@ if len(failedAccounts) > 0 { msg := fmt.Sprintf("failed to assume role %s in accounts: %s", p.options.AssumeRoleName, strings.Join(failedAccounts, ", ")) if p.options.OrgDiscoveryRoleArn != "" { msg += ". Add these to exclude_account_ids if they should not be part of discovery" } return errors.New(msg) } return nil } - return err + return baseErr🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/providers/aws/aws.go` around lines 523 - 560, The Verify() flow currently returns early when p.verify() succeeds so per-account assume-role checks never run; instead, always run the fan-out when p.options.AssumeRoleName != "" && len(p.options.AccountIds) > 0 regardless of the base verify result: call p.verify() and keep its error value, then iterate accounts and for each use createAssumedRoleSession (with p.session/p.session.Config and the account id), init services on the temporary Provider (tempProvider.initServices) and call tempProvider.verify(), collecting failedAccounts via sync.Mutex as done now; after the fan-out, if any failedAccounts exist return a constructed error including p.options.AssumeRoleName and p.options.OrgDiscoveryRoleArn guidance, otherwise return the original base verify error (which may be nil). Ensure identifiers referenced: Verify(), createAssumedRoleSession, Provider.initServices, tempProvider.verify(), p.options.AssumeRoleName, p.options.AccountIds, p.options.OrgDiscoveryRoleArn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/providers/aws/aws.go`:
- Around line 252-266: The loop currently overwrites regionErr for each account
so callers only see the last failure; change the logic in the block using
regionErr/createAssumedRoleSession/tempRC.DescribeRegions and options.AccountIds
to collect per-account errors (e.g., map or slice of errors keyed by accountId)
and append or wrap each failure instead of replacing regionErr, then when all
accounts fail return a single error that aggregates the per-account failures
(include accountId and original error text for each) so the caller sees every
account-level failure rather than just the final one.
- Around line 529-553: The account verification loop spawns an unbounded
goroutine per p.options.AccountIds and ignores ctx, so add a bounded worker
semaphore (e.g., a buffered chan semaphore with a configurable maxWorkers like
10) and acquire it before launching each goroutine and release it at the end to
cap fan-out, and make the goroutine respect cancellation by selecting on
ctx.Done() at start and before appending to failedAccounts; also update calls to
createAssumedRoleSession and Provider.verify to accept and use ctx (or wrap them
so they return early if ctx is done) and keep using mu to protect failedAccounts
and wg for completion so cancellation stops further work and prevents STS
throttling.
---
Outside diff comments:
In `@pkg/providers/aws/aws.go`:
- Around line 523-560: The Verify() flow currently returns early when p.verify()
succeeds so per-account assume-role checks never run; instead, always run the
fan-out when p.options.AssumeRoleName != "" && len(p.options.AccountIds) > 0
regardless of the base verify result: call p.verify() and keep its error value,
then iterate accounts and for each use createAssumedRoleSession (with
p.session/p.session.Config and the account id), init services on the temporary
Provider (tempProvider.initServices) and call tempProvider.verify(), collecting
failedAccounts via sync.Mutex as done now; after the fan-out, if any
failedAccounts exist return a constructed error including
p.options.AssumeRoleName and p.options.OrgDiscoveryRoleArn guidance, otherwise
return the original base verify error (which may be nil). Ensure identifiers
referenced: Verify(), createAssumedRoleSession, Provider.initServices,
tempProvider.verify(), p.options.AssumeRoleName, p.options.AccountIds,
p.options.OrgDiscoveryRoleArn.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fa099f27-b4e6-4f7e-be88-4043ad3718d0
📒 Files selected for processing (1)
pkg/providers/aws/aws.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/providers/aws/aws.go (1)
529-556:⚠️ Potential issue | 🟠 MajorContext cancellation is still not honored in the parallel verification loop.
The semaphore limiting concurrency to 200 was added (good), but the loop and goroutines still don't respect
ctx:
- Line 536 (
sem <- struct{}{}) can block indefinitely without checkingctx.Done()- The goroutine body doesn't check for cancellation before performing work
If the caller cancels the context, all spawned goroutines will continue executing their STS
AssumeRolecalls.💡 Suggested fix to honor ctx
sem := make(chan struct{}, 200) for _, accountId := range p.options.AccountIds { + select { + case <-ctx.Done(): + break + default: + } wg.Add(1) - sem <- struct{}{} + select { + case sem <- struct{}{}: + case <-ctx.Done(): + wg.Done() + continue + } go func(id string) { defer wg.Done() defer func() { <-sem }() + + if ctx.Err() != nil { + return + } + tempSession, err := createAssumedRoleSession(p.options, p.session, p.session.Config, id)As per coding guidelines, "Always pass and honor context.Context in Resources() and long-running API calls to support cancellation".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/providers/aws/aws.go` around lines 529 - 556, The parallel verification loop must honor the incoming ctx: change the semaphore send (sem <- struct{}{}) to a ctx-aware select that returns when ctx.Done() is closed so you don't block forever, and update the goroutine to early-return if ctx is cancelled before doing work; also propagate ctx into long-running calls by calling createAssumedRoleSession and Provider.verify with ctx (or their ctx-aware variants) and check ctx.Done() between operations (after acquiring the semaphore and before initServices/verify) so that goroutines add to failedAccounts only when appropriate and exit promptly on cancellation (references: sem, wg, failedAccounts, createAssumedRoleSession, Provider.initServices, Provider.verify, p.options.AccountIds, ctx).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/providers/aws/aws.go`:
- Around line 529-556: The parallel verification loop must honor the incoming
ctx: change the semaphore send (sem <- struct{}{}) to a ctx-aware select that
returns when ctx.Done() is closed so you don't block forever, and update the
goroutine to early-return if ctx is cancelled before doing work; also propagate
ctx into long-running calls by calling createAssumedRoleSession and
Provider.verify with ctx (or their ctx-aware variants) and check ctx.Done()
between operations (after acquiring the semaphore and before
initServices/verify) so that goroutines add to failedAccounts only when
appropriate and exit promptly on cancellation (references: sem, wg,
failedAccounts, createAssumedRoleSession, Provider.initServices,
Provider.verify, p.options.AccountIds, ctx).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 51cc0213-3737-4cc4-94e3-5ae2f7f2bf33
📒 Files selected for processing (1)
pkg/providers/aws/aws.go
Summary
Reports all failed account IDs in a single actionable error message.
on the first account.
AccountIds[0].
Context
With org discovery, account ordering is arbitrary — the first account may not have the scanner role configured. Previously, verification only checked the first account in the list — if it passed, the entire verification was considered successful, even if other accounts lacked the scanner role. This meant broken accounts were silently skipped during enumeration. Now all accounts are verified in parallel, and the error message lists exactly which accounts failed, with guidance to add them to exclude_account_ids when using org discovery.
Changes
Summary by CodeRabbit