Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
369e78b
Add scan command for multi-repo discovery and consolidated reporting
lex0c Apr 20, 2026
02100f5
Detect bare repositories during scan discovery
lex0c Apr 20, 2026
16742da
Strip repo prefix from suspect-detector suggestions
lex0c Apr 20, 2026
61f8e65
Honor negated ignore rules before pruning a directory
lex0c Apr 20, 2026
5950a26
Keep per-repo commits when SHAs overlap across repositories
lex0c Apr 20, 2026
9a6f5f6
Normalize repo prefix before suspect-pattern matching
lex0c Apr 20, 2026
98efb10
Use numeric value for repo breakdown bar widths
lex0c Apr 20, 2026
6cfe86d
Skip ignored repos before recording discovery hits
lex0c Apr 20, 2026
26ac75e
Account for globbed negations in reinclude checks
lex0c Apr 20, 2026
e85768e
Guarantee unique slugs after hashing duplicate basenames
lex0c Apr 20, 2026
c4958bb
Propagate context cancellation into Discover walk
lex0c Apr 20, 2026
6153c59
Harden tests for scan feature gaps
lex0c Apr 20, 2026
0947705
Document scan command and per-repo breakdown metric
lex0c Apr 20, 2026
00b7321
Validate --since before launching scan
lex0c Apr 20, 2026
8142695
Resolve symlink roots before walking discovery tree
lex0c Apr 20, 2026
5bd94ab
Normalize slug uniqueness for case-insensitive filesystems
lex0c Apr 20, 2026
89df586
Fail scan when every repo's extract failed
lex0c Apr 20, 2026
66f4516
Reject missing explicit ignore file paths
lex0c Apr 20, 2026
f62cdc5
Recover from worker panics in the scan pool
lex0c Apr 20, 2026
ecce5ee
Scope Per-Repository Breakdown to profile reports only
lex0c Apr 20, 2026
a0a6251
Add --report-dir for per-repo HTML reports plus an index landing page
lex0c Apr 20, 2026
fefced5
Fix report-dir pending accounting, downgrade render errors, reserve i…
lex0c Apr 20, 2026
f2810b5
Align scan-index footer with the per-repo report footer
lex0c Apr 20, 2026
c0bfccb
Simplify scan-index heading and drop roots subtitle
lex0c Apr 20, 2026
3494e0e
Trim redundant chrome from scan-index cards
lex0c Apr 20, 2026
49cd00d
Drop green left-border accent from ok scan-index cards
lex0c Apr 20, 2026
2ae7961
Surface last-commit recency on scan index cards
lex0c Apr 20, 2026
d59338e
Address 7 minors from the recency-batch review
lex0c Apr 20, 2026
87d3e8b
Drop scan-index H1; promote summary cards to page anchor
lex0c Apr 20, 2026
60b4c37
Unify scan-index summary-card CSS with the team/profile templates
lex0c Apr 20, 2026
7666e57
Order scan index as a triage view, not alphabetical
lex0c Apr 20, 2026
549775f
Validate --email requires --report before launching scan
lex0c Apr 20, 2026
049cdb5
Guard future dates before truncating day differences
lex0c Apr 20, 2026
302d13f
Validate .git entry content before treating a dir as a repo
lex0c Apr 20, 2026
9d61c47
Normalize skipped statuses before rendering scan index
lex0c Apr 21, 2026
13749d3
Use an aggregated label for multi-root profile reports
lex0c Apr 21, 2026
247cd16
Raise scan parallel default and skip the sort in dev-email aggregation
lex0c Apr 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 69 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -401,6 +401,67 @@ gitcortex stats --input auth.jsonl --input payments.jsonl --stat coupling --top

Paths appear as `auth:src/main.go` and `payments:src/main.go`. Contributors are deduped by email across repos — the same developer contributing to both repos is counted once.

For workspaces containing many repos (an engineer's `~/work`, a platform team's service folder), `gitcortex scan` discovers every `.git` under one or more roots and extracts them in parallel — see below.

### Scan: discover and aggregate every repo under a root

Walk one or more directories, find every git repository (working trees and bare clones both detected), extract them in parallel, and optionally generate a consolidated HTML report. The main use case is "show my manager every project I've touched in the last year" or "give a platform team one dashboard across N services" without scripting the multi-extract + multi-input pipeline manually.

```bash
# Discover and extract every repo under ~/work, one JSONL per repo
gitcortex scan --root ~/work --output ./scan-out

# Consolidated report across all discovered repos
gitcortex scan --root ~/work --output ./scan-out --report ./all.html

# Personal cross-repo profile: only MY commits, with per-repo breakdown
gitcortex scan --root ~/work --output ./scan-out \
--report ./me.html --email me@company.com --since 1y \
--include-commit-messages

# Multiple roots, higher parallelism, pre-set ignore patterns
gitcortex scan --root ~/work --root ~/personal --root ~/oss \
--parallel 8 --max-depth 4 \
--output ./scan-out --report ./all.html
```

The scan output directory holds:

| file | purpose |
|---|---|
| `<slug>.jsonl` | per-repo JSONL, one per discovered repo |
| `<slug>.state` | resume checkpoint (safe to re-run scan to continue) |
| `manifest.json` | discovery results, per-repo status (ok/failed/pending), timing |

Each repo's slug is derived from its directory basename; colliding basenames get a short SHA-1 suffix (the suffix lengthens automatically on the rare truncation collision, so `<slug>.state` is stable across runs).

**Filtering discovery with `.gitcortex-ignore`.** Create a gitignore-style file at the scan root:

```
# skip heavy clones we don't want in the report
node_modules
chromium.git
linux.git

# skip vendored repos except the one we own
vendor/
!vendor/in-house-fork
```

Directory rules, globs, `**/foo`, and `!path` negations all work. Globbed negations like `!vendor*/keep` are honored — discovery descends into any dir where a negation rule could match a descendant. If `--ignore-file` is not set, scan looks for `.gitcortex-ignore` in the first `--root`.

**Consolidated report extras.** When a scan produces more than one repo's data, both the team report and `--email` profile report render a new *Per-Repository Breakdown* section: commits, churn, files, active days, and the share-of-total for each repo. On an `--email` profile the counts are filtered to that developer's contributions (files count reflects only files the dev touched).

**Flags worth knowing:**

- `--parallel N` — repos extracted concurrently (default 4). Git is I/O-bound, so values past NumCPU give diminishing returns.
- `--max-depth N` — stop descending past N levels. Useful when a root contains a monorepo with deeply nested internal repos you don't want enumerated.
- `--extract-ignore <glob>` (repeatable) — forwarded to each per-repo `extract --ignore`, e.g. `--extract-ignore 'package-lock.json' --extract-ignore 'dist/*'`.
- `--from / --to / --since` — time window applied to the consolidated report (same semantics as `report`).
- `--churn-half-life`, `--coupling-max-files`, `--coupling-min-changes`, `--network-min-files` — pass tuning to the consolidated report identical to `gitcortex report`.

Partial failures are non-fatal: the manifest records which repos failed, and the report is built from whichever JSONLs completed. `Ctrl+C` aborts both the discovery walk and any in-flight extracts; re-running picks up from each repo's state file.

### Diff: compare time periods

Compare stats between two time periods, or filter to a single period.
Expand Down Expand Up @@ -444,6 +505,8 @@ gitcortex report --input data.jsonl --email alice@company.com --output alice.htm

Includes: summary cards, activity heatmap (with table toggle), top contributors, file hotspots, churn risk (with full-dataset label distribution strip above the truncated table), bus factor, file coupling, working patterns heatmap, top commits, developer network, and developer profiles. A collapsible glossary at the top defines the terms (bus factor, churn, legacy-hotspot, specialization, etc.) for readers who are not already familiar. Typical size: 50-500KB depending on number of contributors.

When the input is multi-repo (from `gitcortex scan` or multiple `--input` files), both the team report and `--email` profile also render a *Per-Repository Breakdown* with commit/churn/files/active-days per repo and each repo's share of the total.

> The HTML activity heatmap is always monthly (year × 12 months grid). For day/week/year buckets, use `gitcortex stats --stat activity --granularity <unit>`.

### CI: quality gates for pipelines
Expand Down Expand Up @@ -483,9 +546,14 @@ internal/
parse.go Shared types (RawEntry, NumstatEntry)
discard.go Malformed entry tracking
extract/extract.go Extraction orchestration, state, JSONL writing
scan/
scan.go Multi-repo orchestration (worker pool over extract)
discovery.go Directory walk, bare-repo detection, slug uniqueness
ignore.go Gitignore-style matcher with negation support
stats/
reader.go Streaming JSONL aggregator (single-pass)
reader.go Streaming JSONL aggregator (single-pass, multi-JSONL)
stats.go Stat computations (9 stats)
repo_breakdown.go Per-repository aggregate (scan consolidated report)
format.go Table/CSV/JSON output formatting
```

Expand Down
166 changes: 166 additions & 0 deletions cmd/gitcortex/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import (
"github.com/lex0c/gitcortex/internal/extract"
"github.com/lex0c/gitcortex/internal/git"
reportpkg "github.com/lex0c/gitcortex/internal/report"
"github.com/lex0c/gitcortex/internal/scan"
"github.com/lex0c/gitcortex/internal/stats"

"github.com/spf13/cobra"
Expand All @@ -35,6 +36,7 @@ func main() {
rootCmd.AddCommand(diffCmd())
rootCmd.AddCommand(ciCmd())
rootCmd.AddCommand(reportCmd())
rootCmd.AddCommand(scanCmd())

if err := rootCmd.Execute(); err != nil {
os.Exit(1)
Expand Down Expand Up @@ -869,3 +871,167 @@ func reportCmd() *cobra.Command {

return cmd
}

// --- Scan ---

func scanCmd() *cobra.Command {
var (
roots []string
output string
ignoreFile string
maxDepth int
parallel int
email string
from string
to string
since string
reportPath string
topN int
extractIgnore []string
batchSize int
mailmap bool
firstParent bool
includeMessages bool
couplingMaxFiles int
couplingMinChanges int
churnHalfLife int
networkMinFiles int
)

cmd := &cobra.Command{
Use: "scan",
Short: "Discover git repositories under one or more roots and consolidate their history",
Long: `Walk the given root(s), find every git repository, and run extract on each
repository in parallel. Outputs one JSONL per repo plus a manifest in --output.
Optionally generates a consolidated HTML report including a per-repository
breakdown — handy for showing aggregated work across many repos.`,
RunE: func(cmd *cobra.Command, args []string) error {
if len(roots) == 0 {
return fmt.Errorf("--root is required (repeatable for multiple roots)")
}
if since != "" && (from != "" || to != "") {
return fmt.Errorf("--since cannot be combined with --from/--to")
}
if err := validateDate(from, "--from"); err != nil {
return err
}
if err := validateDate(to, "--to"); err != nil {
return err
}
if from != "" && to != "" && from > to {
return fmt.Errorf("--from (%s) must be on or before --to (%s)", from, to)
}
// Resolve --since up-front. If this fails we'd otherwise
// discover the typo only after a full multi-repo scan —
// minutes to hours on a large workspace, all thrown away
// because the user mistyped `--since 1yy`. Validate early,
// fail fast, keep the result for the report stage.
fromDate := from
if since != "" {
d, err := parseSince(since)
if err != nil {
return err
}
fromDate = d
}

cfg := scan.Config{
Roots: roots,
Output: output,
IgnoreFile: ignoreFile,
MaxDepth: maxDepth,
Parallel: parallel,
Extract: extract.Config{
BatchSize: batchSize,
IncludeMessages: includeMessages,
CommandTimeout: extract.DefaultCommandTimeout,
FirstParent: firstParent,
Mailmap: mailmap,
IgnorePatterns: extractIgnore,
StartOffset: -1,
},
}

ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

result, err := scan.Run(ctx, cfg)
// scan.Run returns a partial result alongside ctx.Err() on
// cancellation. Honor that — write whatever progress we made
// to disk and surface the error so the CLI exits non-zero.
if err != nil {
return err
}

if reportPath == "" {
fmt.Fprintf(os.Stderr, "Scan complete: %d JSONL file(s) in %s\n", len(result.JSONLs), result.OutputDir)
return nil
Comment thread
lex0c marked this conversation as resolved.
Outdated
}
if len(result.JSONLs) == 0 {
return fmt.Errorf("no successful repos extracted; cannot build report")
}

ds, err := stats.LoadMultiJSONL(result.JSONLs, stats.LoadOptions{
From: fromDate,
To: to,
HalfLifeDays: churnHalfLife,
CoupMaxFiles: couplingMaxFiles,
})
if err != nil {
return fmt.Errorf("load consolidated dataset: %w", err)
}
fmt.Fprintf(os.Stderr, "Loaded %d commits across %d repo(s)\n", ds.CommitCount, len(result.JSONLs))

f, err := os.Create(reportPath)
if err != nil {
return fmt.Errorf("create report: %w", err)
}
defer f.Close()

// Label the report after the basename of the first --root
// (or the output dir as a fallback). "scan-scan-output" was
// the previous default; users find the root name far more
// recognizable as the H1 of the report.
repoLabel := filepath.Base(result.OutputDir)
if len(cfg.Roots) > 0 {
repoLabel = filepath.Base(absPath(cfg.Roots[0]))
}
sf := stats.StatsFlags{CouplingMinChanges: couplingMinChanges, NetworkMinFiles: networkMinFiles}
if email != "" {
if err := reportpkg.GenerateProfile(f, ds, repoLabel, email); err != nil {
return fmt.Errorf("generate profile report: %w", err)
}
fmt.Fprintf(os.Stderr, "Profile report for %s written to %s\n", email, fileURL(reportPath))
return nil
}
if err := reportpkg.Generate(f, ds, repoLabel, topN, sf); err != nil {
return fmt.Errorf("generate report: %w", err)
}
fmt.Fprintf(os.Stderr, "Consolidated report written to %s\n", fileURL(reportPath))
return nil
},
}

cmd.Flags().StringSliceVar(&roots, "root", nil, "Root directory to walk for repositories (repeatable)")
cmd.Flags().StringVar(&output, "output", "scan-output", "Directory to write per-repo JSONL files and the manifest")
cmd.Flags().StringVar(&ignoreFile, "ignore-file", "", "Gitignore-style file with directories to skip during discovery. When unset, only the first --root is searched for a .gitcortex-ignore; pass an explicit path to apply rules across all roots.")
cmd.Flags().IntVar(&maxDepth, "max-depth", 0, "Maximum directory depth to descend into when looking for repos (0 = unlimited)")
cmd.Flags().IntVar(&parallel, "parallel", 4, "Number of repositories to extract in parallel")
cmd.Flags().StringVar(&email, "email", "", "Generate a per-developer profile report (only when --report is set)")
cmd.Flags().StringVar(&from, "from", "", "Window start date YYYY-MM-DD (forwarded to the consolidated report)")
cmd.Flags().StringVar(&to, "to", "", "Window end date YYYY-MM-DD (forwarded to the consolidated report)")
cmd.Flags().StringVar(&since, "since", "", "Filter to recent period (e.g. 7d, 4w, 3m, 1y); mutually exclusive with --from/--to")
cmd.Flags().StringVar(&reportPath, "report", "", "If set, generate a consolidated HTML report at this path after the scan")
cmd.Flags().IntVar(&topN, "top", 20, "Top-N entries per section in the consolidated report")
cmd.Flags().StringSliceVar(&extractIgnore, "extract-ignore", nil, "Glob patterns forwarded to per-repo extract --ignore (e.g. package-lock.json)")
cmd.Flags().IntVar(&batchSize, "batch-size", 1000, "Per-repo extract checkpoint interval")
cmd.Flags().BoolVar(&mailmap, "mailmap", false, "Use .mailmap (per repo) to normalize identities")
cmd.Flags().BoolVar(&firstParent, "first-parent", false, "Restrict extracts to the first-parent chain")
cmd.Flags().BoolVar(&includeMessages, "include-commit-messages", false, "Include commit messages in JSONL (needed for Top Commits in the consolidated report)")
cmd.Flags().IntVar(&couplingMaxFiles, "coupling-max-files", 50, "Max files per commit for coupling analysis (consolidated report)")
cmd.Flags().IntVar(&couplingMinChanges, "coupling-min-changes", 5, "Min co-changes for coupling results (consolidated report)")
cmd.Flags().IntVar(&churnHalfLife, "churn-half-life", 90, "Half-life in days for churn decay (consolidated report)")
cmd.Flags().IntVar(&networkMinFiles, "network-min-files", 5, "Min shared files for dev-network edges (consolidated report)")

return cmd
}
40 changes: 40 additions & 0 deletions cmd/gitcortex/main_test.go
Original file line number Diff line number Diff line change
@@ -1,10 +1,50 @@
package main

import (
"bytes"
"strings"
"testing"
)

// scanCmd must validate --since BEFORE running the discovery walk
// and extract pool. Without the early check, an obvious typo like
// `--since 1yy` only surfaces after scan.Run has already walked
// every root and extracted every repo found — which can take
// minutes-to-hours on a large workspace and waste the work.
//
// Use an empty TempDir so scan.Run would fail fast with "no git
// repositories found" if we reached it — that error does not mention
// "since", so an error containing "since" proves the early
// validation fired first.
func TestScanCmd_ValidatesSinceBeforeScanning(t *testing.T) {
cmd := scanCmd()
cmd.SilenceUsage = true
cmd.SilenceErrors = true
cmd.SetArgs([]string{
"--root", t.TempDir(),
"--output", t.TempDir(),
"--since", "bogus",
})
// Swallow any stderr output cobra might emit so we don't pollute
// go-test logs on success.
cmd.SetOut(&bytes.Buffer{})
cmd.SetErr(&bytes.Buffer{})

err := cmd.Execute()
if err == nil {
t.Fatal("expected error for --since bogus, got nil")
}
if !strings.Contains(err.Error(), "since") {
t.Errorf("expected error to mention --since (proving early validation), got %q", err)
}
// Reaching scan.Run on an empty TempDir would produce the
// discovery error; asserting its absence confirms the walk was
// not started.
if strings.Contains(err.Error(), "no git repositories found") {
t.Errorf("scan.Run was reached before --since was validated — discovery ran on an invalid-flag input: %q", err)
}
}

func TestValidateDate(t *testing.T) {
cases := []struct {
in string
Expand Down
23 changes: 23 additions & 0 deletions docs/METRICS.md
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,27 @@ A `tree(1)`-style view of the repository's directory layout, built from paths se

**When to use**: before drilling into hotspots or churn-risk, skim the structure to locate the modules those files live in. The tree is navigational context; ranked tables are where judgment happens.

## Per-Repository Breakdown

Cross-repo aggregation that appears only when the dataset was loaded from more than one JSONL (typically via `gitcortex scan` or multiple `--input` files). One row per repo:

| column | meaning |
|---|---|
| `Commits` | author-dated commit count in this repo |
| `% Commits` | share of total commits in the dataset |
| `Churn` | additions + deletions attributed to this repo |
| `% Churn` | share of total churn |
| `Files` | distinct files touched in this repo; when filtered by `--email`, counts only files that developer touched |
| `Active days` | distinct UTC author-dates |
| `Devs` | unique author emails in this repo |
| `First → Last` | earliest and latest author-date |

**How the repo label is derived**: `LoadMultiJSONL` prefixes every path in the dataset with `<filename-stem>:` (so `WordPress.git.jsonl` contributes paths like `WordPress.git:wp-includes/foo.php`). The breakdown groups by that prefix. If only a single JSONL is loaded, no prefix is emitted and the breakdown collapses to a single `(repo)` row the HTML report hides.

**Divergence between `% Commits` and `% Churn` is informative**: a repo dominating churn while holding modest commit share often signals large-content work (docs, data), while the reverse points to small-diff high-frequency repos (config, manifests).

**SHA collisions across repos** (forks, mirrors, cherry-picks between sibling projects) are preserved here — the breakdown tracks commits per repo via a dedicated slice populated at ingest, not the SHA-keyed commit map. Other file-level metrics (bus factor, coupling, dev network) still key by SHA and will collapse collided commits onto one record; if exact attribution matters for those, scan and aggregate the sibling repos separately.

## Data Flow

```
Expand Down Expand Up @@ -364,6 +385,8 @@ Directory-segment heuristics (`vendor`, `node_modules`, `dist`, `build`, `third_

The warning is advisory. Nothing is auto-filtered; the user decides whether to re-extract. Matches do not affect computed stats in that run. JSON/CSV output paths skip the warning since they're typically piped.

On multi-JSONL loads (e.g. `stats --input a.jsonl --input b.jsonl` or a `gitcortex scan` dataset), paths carry a `<repo>:` prefix internally. The suspect detector strips that prefix before matching and suggesting, so root-level `vendor/`, `package-lock.json`, `go.sum`, etc. in any individual repo are detected and the emitted `--ignore` globs are repo-relative (drop-in for `extract --ignore` and `scan --extract-ignore`). Same-shape findings across repos collapse to one suggestion (`dist/*` applies everywhere rather than being listed once per repo).

Statistical heuristics (very high churn-per-commit, single-author bulk updates) are deliberately out of scope — their false-positive rate on hand-authored code is higher than the path-based list and we'd rather stay quiet than cry wolf.

### `--mailmap` off by default
Expand Down
Loading
Loading