Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
259 changes: 259 additions & 0 deletions .github/skills/ci-analysis/SKILL.md

Large diffs are not rendered by default.

75 changes: 75 additions & 0 deletions .github/skills/ci-analysis/references/azdo-helix-reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Azure DevOps and Helix Reference

## Supported Repositories

The script defaults to `dotnet/diagnostics` but works with any dotnet repository that uses Azure DevOps and Helix:

| Repository | Common Pipelines |
|------------|-----------------|
| `dotnet/diagnostics` | dotnet-diagnostics, dotnet-diagnostics-official |
| `dotnet/runtime` | runtime, runtime-dev-innerloop, dotnet-linker-tests |
| `dotnet/sdk` | dotnet-sdk (mix of local and Helix tests) |
| `dotnet/aspnetcore` | aspnetcore-ci |
| `dotnet/roslyn` | roslyn-CI |

Use `-Repository` to specify a different target:
```powershell
./scripts/Get-CIStatus.ps1 -PRNumber 12345 -Repository "dotnet/runtime"
```

**Note:** The script auto-discovers builds for a PR, so you rarely need to know definition IDs.

## Azure DevOps Organizations

**Public builds (default):**
- Organization: `dnceng-public`
- Project: `cbb18261-c48f-4abb-8651-8cdcb5474649`

**Internal/private builds:**
- Organization: `dnceng`
- Project GUID: Varies by pipeline

Override with:
```powershell
./scripts/Get-CIStatus.ps1 -BuildId 1276327 -Organization "dnceng" -Project "internal-project-guid"
```

## Common Pipeline Names (dotnet/diagnostics)

| Pipeline | Description |
|----------|-------------|
| `dotnet-diagnostics` | Main PR validation build |
| `dotnet-diagnostics-official` | Official/internal build |

The script discovers pipelines automatically from the PR.

## Useful Links

- [Helix Portal](https://helix.dot.net/): View Helix jobs and work items (all repos)
- [Helix API Documentation](https://helix.dot.net/swagger/): Swagger docs for Helix REST API
- [Build Analysis](https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/LandingPage.md): Known issues tracking (arcade infrastructure)
- [dnceng-public AzDO](https://dev.azure.com/dnceng-public/public/_build): Public builds for all dotnet repos

## Test Execution Types

### Helix Tests
Tests run on Helix distributed test infrastructure. The script extracts console log URLs and can fetch detailed failure info with `-ShowLogs`.

### Local Tests (Non-Helix)
Some tests run directly on the build agent. The script detects these and extracts Azure DevOps Test Run URLs.

## Known Issue Labels

- `Known Build Error` - Used by Build Analysis across all dotnet repositories
- Search syntax: `repo:<owner>/<repo> is:issue is:open label:"Known Build Error" <test-name>`

> **Note:** The `dotnet/diagnostics` repository does not currently have Known Build Error tracking set up. The script will still search for known issues, but may not find matches. This feature is more useful when targeting repositories like `dotnet/runtime` that have active Build Analysis.

Example searches (use `search_issues` when GitHub MCP is available, `gh` CLI otherwise):
```bash
# Search in diagnostics
gh issue list --repo dotnet/diagnostics --label "Known Build Error" --search "SOS"

# Search in runtime
gh issue list --repo dotnet/runtime --label "Known Build Error" --search "FileSystemWatcher"
```
95 changes: 95 additions & 0 deletions .github/skills/ci-analysis/references/azure-cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Deep Investigation with Azure CLI

The AzDO MCP tools (`azure-devops-pipelines_*`) handle most pipeline queries directly. This reference covers the Azure CLI fallback for cases where MCP tools are unavailable or the endpoint isn't exposed (e.g., downloading artifacts, inspecting pipeline definitions).

When the CI script and GitHub APIs aren't enough (e.g., investigating internal pipeline definitions or downloading build artifacts), use the Azure CLI with the `azure-devops` extension.

> 💡 **Prefer `az pipelines` / `az devops` commands over raw REST API calls.** The CLI handles authentication, pagination, and JSON output formatting. Only fall back to manual `Invoke-RestMethod` calls when the CLI doesn't expose the endpoint you need (e.g., build timelines). The CLI's `--query` (JMESPath) and `-o table` flags are powerful for filtering without extra scripting.

## Checking Authentication

Before making AzDO API calls, verify the CLI is installed and authenticated:

```powershell
# Ensure az is on PATH (Windows may need a refresh after install)
$env:Path = [System.Environment]::GetEnvironmentVariable("Path", "Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path", "User")

# Check if az CLI is available
az --version 2>$null | Select-Object -First 1

# Check if logged in and get current account
az account show --query "{name:name, user:user.name}" -o table 2>$null

# If not logged in, prompt the user to authenticate:
# az login # Interactive browser login
# az login --use-device-code # Device code flow (for remote/headless)

# Get an AAD access token for AzDO REST API calls (only needed for raw REST)
$accessToken = (az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)
$headers = @{ "Authorization" = "Bearer $accessToken" }
```

> ⚠️ If `az` is not installed, use `winget install -e --id Microsoft.AzureCLI` (Windows). The `azure-devops` extension is also required — install or verify it with `az extension add --name azure-devops` (safe to run if already installed). Ask the user to authenticate if needed.

> ⚠️ **Do NOT use `az devops configure --defaults`** — it sets user-wide defaults that may not match the organization/project needed for dotnet repositories. Always pass `--org` and `--project` (or `-p`) explicitly on each command.

## Querying Pipeline Definitions and Builds

```powershell
$org = "https://dev.azure.com/dnceng"
$project = "internal"

# Find a pipeline definition by name
az pipelines list --name "dotnet-unified-build" --org $org -p $project --query "[].{id:id, name:name, path:path}" -o table

# Get pipeline definition details (shows YAML path, triggers, etc.)
az pipelines show --id 1330 --org $org -p $project --query "{id:id, name:name, yamlPath:process.yamlFilename, repo:repository.name}" -o table

# List recent builds for a pipeline (replace {TARGET_BRANCH} with the PR's base branch, e.g., main or release/9.0)
az pipelines runs list --pipeline-ids 1330 --branch "refs/heads/{TARGET_BRANCH}" --top 5 --org $org -p $project --query "[].{id:id, result:result, finish:finishTime}" -o table

# Get a specific build's details
az pipelines runs show --id $buildId --org $org -p $project --query "{id:id, result:result, sourceBranch:sourceBranch}" -o table

# List build artifacts
az pipelines runs artifact list --run-id $buildId --org $org -p $project --query "[].{name:name, type:resource.type}" -o table

# Download a build artifact
az pipelines runs artifact download --run-id $buildId --artifact-name "TestBuild_linux_x64" --path "$env:TEMP\artifact" --org $org -p $project
```

## REST API Fallback

Fall back to REST API only when the CLI doesn't expose what you need:

```powershell
# Get build timeline (stages, jobs, tasks with results and durations) — no CLI equivalent
$accessToken = (az account get-access-token --resource 499b84ac-1321-427f-aa17-267ca6975798 --query accessToken -o tsv)
$headers = @{ "Authorization" = "Bearer $accessToken" }
$timelineUrl = "https://dev.azure.com/dnceng/internal/_apis/build/builds/$buildId/timeline?api-version=7.1"
$timeline = (Invoke-RestMethod -Uri $timelineUrl -Headers $headers)
$timeline.records | Where-Object { $_.result -eq "failed" -and $_.type -eq "Job" }
```

## Examining Pipeline YAML

All dotnet repos that use arcade put their pipeline definitions under `eng/pipelines/`. Use `az pipelines show` to find the YAML file path, then fetch it:

```powershell
# Find the YAML path for a pipeline
az pipelines show --id 1330 --org $org -p $project --query "{yamlPath:process.yamlFilename, repo:repository.name}" -o table

# Fetch the YAML from the repo (example: dotnet/runtime's runtime-official pipeline)
# github-mcp-server-get_file_contents owner:dotnet repo:runtime path:eng/pipelines/runtime-official.yml

# For VMR unified builds, the YAML is in dotnet/dotnet:
# github-mcp-server-get_file_contents owner:dotnet repo:dotnet path:eng/pipelines/unified-build.yml

# Templates are usually in eng/pipelines/common/ or eng/pipelines/templates/
```

This is especially useful when:
- A job name doesn't clearly indicate what it builds
- You need to understand stage dependencies (why a job was canceled)
- You want to find which template defines a specific step
- Investigating whether a pipeline change caused new failures
144 changes: 144 additions & 0 deletions .github/skills/ci-analysis/references/binlog-comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Deep Investigation: Binlog Comparison

When a test **passes on the target branch but fails on a PR**, comparing MSBuild binlogs from both runs reveals the exact difference in task parameters without guessing.

## When to Use This Pattern

- Test assertion compares "expected vs actual" build outputs (e.g., CSC args, reference lists)
- A build succeeds on one branch but fails on another with different MSBuild behavior
- You need to find which MSBuild property/item change caused a specific task to behave differently

## The Pattern: Delegate to Subagents

> ⚠️ **Do NOT download, load, and parse binlogs in the main conversation context.** This burns 10+ turns on mechanical work. Delegate to subagents instead.

### Step 1: Identify the two work items to compare

Use `Get-CIStatus.ps1` to find the failing Helix job + work item, then find a corresponding passing build (recent PR merged to the target branch, or a CI run on that branch).

**Finding Helix job IDs from build artifacts (binlogs to find binlogs):**
When the failing work item's Helix job ID isn't visible (e.g., canceled jobs, or finding a matching job from a passing build), the IDs are inside the build's `SendToHelix.binlog`:

1. Download the build artifact with `az`:
```
az pipelines runs artifact list --run-id $buildId --org "https://dev.azure.com/dnceng-public" -p public --query "[].name" -o tsv
az pipelines runs artifact download --run-id $buildId --artifact-name "TestBuild_linux_x64" --path "$env:TEMP\artifact" --org "https://dev.azure.com/dnceng-public" -p public
```
2. Load the binlog and search for job IDs:
```
mcp-binlog-tool-load_binlog path:"$env:TEMP\artifact\...\SendToHelix.binlog"
mcp-binlog-tool-search_binlog binlog_file:"..." query:"Sent Helix Job"
```
3. Query each Helix job GUID with the CI script:
```
./scripts/Get-CIStatus.ps1 -HelixJob "{GUID}" -FindBinlogs
```

**For Helix work item binlogs (the common case):**
The CI script shows binlog URLs directly when you query a specific work item:
```
./scripts/Get-CIStatus.ps1 -HelixJob "{JOB_ID}" -WorkItem "{WORK_ITEM}"
# Output includes: 🔬 msbuild.binlog: https://helix...blob.core.windows.net/...
```

### Step 2: Dispatch parallel subagents for extraction

Launch two `task` subagents (can run in parallel), each with a prompt like:

```
Download the msbuild.binlog from Helix job {JOB_ID} work item {WORK_ITEM}.
Use the CI skill script to get the artifact URL:
./scripts/Get-CIStatus.ps1 -HelixJob "{JOB_ID}" -WorkItem "{WORK_ITEM}"
Download the binlog URL to $env:TEMP\{label}.binlog.
Load it with the binlog MCP server (mcp-binlog-tool-load_binlog).
Search for the {TASK_NAME} task (mcp-binlog-tool-search_tasks_by_name).
Get full task details (mcp-binlog-tool-list_tasks_in_target) for the target containing the task.
Extract the CommandLineArguments parameter value.
Normalize paths:
- Replace Helix work dirs (/datadisks/disk1/work/XXXXXXXX) with {W}
- Replace runfile hashes (Program-[a-f0-9]+) with Program-{H}
- Replace temp dir names (dotnetSdkTests.[a-zA-Z0-9]+) with dotnetSdkTests.{T}
Parse into individual args using regex: (?:"[^"]+"|/[^\s]+|[^\s]+)
Sort the list and return it.
Report the total arg count prominently.
```

**Important:** When diffing, look for **extra or missing args** (different count), not value differences in existing args. A Debug/Release difference in `/define:` is expected noise — an extra `/analyzerconfig:` or `/reference:` arg is the real signal.

### Step 3: Diff the results

With two normalized arg lists, `Compare-Object` instantly reveals the difference.

## Useful Binlog MCP Queries

After loading a binlog with `mcp-binlog-tool-load_binlog`, use these queries (pass the loaded path as `binlog_file`):

```
# Find all invocations of a specific task
mcp-binlog-tool-search_tasks_by_name binlog_file:"$env:TEMP\my.binlog" taskName:"Csc"

# Search for a property value
mcp-binlog-tool-search_binlog binlog_file:"..." query:"analysislevel"

# Find what happened inside a specific target
mcp-binlog-tool-search_binlog binlog_file:"..." query:"under($target AddGlobalAnalyzerConfigForPackage_MicrosoftCodeAnalysisNetAnalyzers)"

# Get all properties matching a pattern
mcp-binlog-tool-search_binlog binlog_file:"..." query:"GlobalAnalyzerConfig"

# List tasks in a target (returns full parameter details including CommandLineArguments)
mcp-binlog-tool-list_tasks_in_target binlog_file:"..." projectId:22 targetId:167
```

## Path Normalization

Helix work items run on different machines with different paths. Normalize before comparing:

| Pattern | Replacement | Example |
|---------|-------------|---------|
| `/datadisks/disk1/work/[A-F0-9]{8}` | `{W}` | Helix work directory (Linux) |
| `C:\h\w\[A-F0-9]{8}` | `{W}` | Helix work directory (Windows) |
| `Program-[a-f0-9]{64}` | `Program-{H}` | Runfile content hash |
| `dotnetSdkTests\.[a-zA-Z0-9]+` | `dotnetSdkTests.{T}` | Temp test directory |

### After normalizing paths, focus on structural differences

> ⚠️ **Ignore value-only differences in existing args** (e.g., Debug vs Release in `/define:`, different hash paths). These are expected configuration differences. Focus on **extra or missing args** — a different arg count indicates a real build behavior change.

## Example: CscArguments Investigation

A merge PR (release/10.0.3xx → main) had 208 CSC args vs 207 on main. The diff:

```
FAIL-ONLY: /analyzerconfig:{W}/p/d/sdk/11.0.100-ci/Sdks/Microsoft.NET.Sdk/analyzers/build/config/analysislevel_11_default.globalconfig
```

### What the binlog properties showed

Both builds had identical property resolution:
- `EffectiveAnalysisLevel = 11.0`
- `_GlobalAnalyzerConfigFileName = analysislevel_11_default.globalconfig`
- `_GlobalAnalyzerConfigFile = .../config/analysislevel_11_default.globalconfig`

### The actual root cause

The `AddGlobalAnalyzerConfigForPackage` target has an `Exists()` condition:
```xml
<ItemGroup Condition="Exists('$(_GlobalAnalyzerConfigFile_...)')">
<EditorConfigFiles Include="$(_GlobalAnalyzerConfigFile_...)" />
</ItemGroup>
```

The merge's SDK layout **shipped** `analysislevel_11_default.globalconfig` on disk (from a newer roslyn-analyzers that flowed from 10.0.3xx), while main's SDK didn't have that file yet. Same property values, different files on disk = different build behavior.

### Lesson learned

Same MSBuild property resolution + different files on disk = different build behavior. Always check what's actually in the SDK layout, not just what the targets compute.

## Anti-Patterns

> ❌ **Don't manually split/parse CSC command lines in the main conversation.** CSC args have quoted paths, spaces, and complex structure. Regex parsing in PowerShell is fragile and burns turns on trial-and-error. Use a subagent.

> ❌ **Don't assume the MSBuild property diff explains the behavior diff.** Two branches can compute identical property values but produce different outputs because of different files on disk, different NuGet packages, or different task assemblies. Compare the actual task invocation.

> ❌ **Don't load large binlogs and browse them interactively in main context.** Use targeted searches: `mcp-binlog-tool-search_tasks_by_name` for a specific task, `mcp-binlog-tool-search_binlog` with a focused query. Get in, get the data, get out.
Loading