feat: add ItemList structured data for comparison table pages#18477
feat: add ItemList structured data for comparison table pages#18477alexleventer wants to merge 3 commits intomasterfrom
Conversation
Emits JSON-LD ItemList schema on /comparisons/ pages by parsing the feature comparison Markdown table, improving search-engine visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Docs ReviewScope: Reviewed all changes in this PR — the new Overall this is a well-structured addition that follows the existing collector patterns (matches Issues1. Bug — Product/competitor label mismatch on 2 of 7 pages ( The title parsing assumes the order of names in the page title matches the order of columns in the table. This breaks on pages where the title order is reversed relative to the table columns. Affected pages:
For example, on the Terraform page, the The other 5 pages with Suggested fix: Extract column names from the table header row itself rather than parsing the page title. The header row (e.g., {{/* Extract column names from the table header row */}}
{{ $headerCells := split (index $allTableLines 0) "|" }}
{{ $productName := index $headerCells 2 | strings.TrimSpace }}
{{ $competitorName := index $headerCells 3 | strings.TrimSpace }}This requires restructuring slightly (moving the header extraction before the name check), but it eliminates the title/column mismatch entirely and is more robust against future title changes. 2. Minor — Trailing " IaC" in competitor name ( On the Terraform page (h1: "Terraform vs. Pulumi IaC"), the fallback title parser produces Looks good
Mention me (@claude) if you'd like additional reviews or want me to help implement the fix. |
|
Your site preview for commit 18795e9 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-18477-18795e94.s3-website.us-west-2.amazonaws.com |
Lighthouse Performance ReportCommit: 18795e9 | Metric definitions
|
|
@claude - review again pls |
|
Claude finished @alexleventer's task in 2m 36s —— View job Docs Review
Re-reviewed the full diff ( Bug 1 (Critical, unchanged) — Product/competitor labels are swapped on 2 of 7 pages
The title-parsing logic assigns
The root cause: Suggested fix — extract column names from the table header itself, which is always in the correct order: {{/* Replace lines 7-18 with: */}}
{{ $allTableLines := findRE `(?m)^\|.+\|` $afterHeader }}
{{ if ge (len $allTableLines) 3 }}
{{ $headerCells := split (index $allTableLines 0) "|" }}
{{ $productName := index $headerCells 2 | strings.TrimSpace }}
{{ $competitorName := index $headerCells 3 | strings.TrimSpace }}This ties the label directly to the column position and is immune to h1 wording. Fix this → Bug 2 (Moderate, new) — Verbose competitor name on 2 more pages
For pages where the h1 reads "Pulumi vs. <Long Name>",
The data is attributed to the right product (columns aren't swapped), but the entity Issue 3 (Minor, new) —
|
CamSoper
left a comment
There was a problem hiding this comment.
Doing a human maintainer pass and landing on the same conclusions @claude did — including the column inversion, which I confirmed on the preview site. The Terraform page's JSON-LD currently reads:
"Terraform: Python, TypeScript, JavaScript, Go, C#, F#, Java, YAML | Pulumi IaC: HashiCorp Configuration Language (HCL)"
…which attributes Pulumi's languages to Terraform and HCL to Pulumi. Definitely want that fixed before this ships — structured data that's confidently wrong is worse than no structured data at all.
The header-row extraction approach @claude suggested resolves both the column swap and the verbose-name issue on aws-cdk / cdktf in one shot. Worth also picking up the hasPart wiring from WebPage to the ItemList while you're in there.
Ping me when it's ready for another look.
|
Claude finished @CamSoper's task in 2m 49s —— View job Fixes implemented
What changed
{{/* Extract column names from the table header row (always in correct column order) */}}
{{ $headerCells := split (index $allTableLines 0) "|" }}
{{ $productName := index $headerCells 2 | strings.TrimSpace }}
{{ $competitorName := index $headerCells 3 | strings.TrimSpace }}Column names now come from the actual table (
The two top-level guards (
{{ $comparisonTable = partial "schema/collectors/comparison-table-entity.html" . }}
{{ if and $comparisonTable (ne $comparisonTable (dict)) }}
{{ $webpage = merge $webpage (dict "hasPart" (dict "@id" "#comparison-table")) }}
{{ end }}The |
- Replace title-parsing heuristic in comparison-table-entity.html with
header-row extraction: split the first table row on "|" and read cells
2 and 3 for productName/competitorName. This fixes the label swap on
terraform (h1 "Terraform vs. Pulumi IaC", cols "Pulumi | Terraform")
and crossplane (h1 "Crossplane vs Pulumi", cols "Pulumi | Crossplane"),
and also produces shorter canonical names for aws-cdk ("AWS CDK") and
cdktf ("CDKTF") instead of the verbose h1 forms.
- Restructure graph-builder.html so the comparison table collector runs
before $webpage is appended to $graph, allowing a hasPart link to be
wired into the WebPage node when an ItemList entity is present.
Co-authored-by: Cam Soper <CamSoper@users.noreply.github.com>
CamSoper
left a comment
There was a problem hiding this comment.
Sorry, @alexleventer! It seems I accidentally triggered Claude and he just came along and fixed it. Feel free to revert that commit if you need to!
CamSoper
left a comment
There was a problem hiding this comment.
Cool idea — verified the single-table pages render clean ItemLists (opentofu, terraform, terraform/opentofu, aws-cdk, cdktf, cloudformation all match their row counts). Two bugs to sort out before this ships though, both isolated to the new collector.
1. Multi-table pollution on the Crossplane page
findRE in comparison-table-entity.html line 11 captures every pipe-prefixed line in the document, not just rows of the Feature table. Crossplane has three more tables after the Feature one, so their header and data rows get pulled into the ItemList.
I built the site locally and looked at the generated JSON-LD for /docs/iac/comparisons/crossplane/:
numberOfItemsis 22, should be 5- Position 6 is named
"Aspect"with description"Pulumi: **Pulumi** | Crossplane: **Crossplane**"(that's the second table's header row being treated as a data row) - Position 18 is
"When to Choose Pulumi"-- another table header - Positions 19-22 are rows from the "When to Choose" table, not features
Fix -- truncate to the first table block before extracting lines:
{{ $afterHeader := replaceRE `(?s)^.*?\| Feature \|` "| Feature |" .RawContent }}
{{ $firstTable := index (split $afterHeader "\n\n") 0 }}
{{ $allTableLines := findRE `(?m)^\|.+\|` $firstTable }}2. Bold markdown leaks into ItemList names/descriptions
The existing replaceRE calls strip HTML tags and markdown links, but not **bold** / *italic* emphasis. Crossplane uses bold in its feature column, so the ItemList ends up with names like "**Multi-language Support**" verbatim -- the asterisks leak into the schema payload.
Fix -- add emphasis stripping to $feature, $val1, $val2 around lines 32-39:
{{ $feature = replaceRE `\*+([^*]+)\*+` "$1" $feature }}
{{ $val1 = replaceRE `\*+([^*]+)\*+` "$1" $val1 }}
{{ $val2 = replaceRE `\*+([^*]+)\*+` "$1" $val2 }}Why this matters
On Crossplane specifically the structured data is worse than having none -- feeds search engines and AI answer engines a 22-item "feature list" where several items are literally table-header text. Rich Results Test is likely to flag it, and LLM summaries of the page could surface the junk items as real features.
Everything else looks good. The header-row extraction fix in fd506f2 is clean, and the hasPart wiring in graph-builder.html resolves correctly in the emitted @graph.
…asis - Truncate RawContent to the first table block before extracting rows so subsequent tables on the page (e.g., Crossplane's "When to Choose") stop leaking into the ItemList. - Strip **bold** / *italic* from feature/value cells so the JSON-LD doesn't surface literal asterisks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the careful review, @CamSoper — both fixes applied in 18795e9:
Also re-checked the single-table pages (terraform: 21 items, opentofu: 19) — row counts unchanged, so the truncation doesn't affect pages with only one Feature table. |
Summary
comparison-table-entity.html) that parses Markdown feature-comparison tables on/comparisons/pages and emits JSON-LDItemListstructured datagraph-builder.htmlso it's included in the page's schema graphTest plan
make buildsucceedsItemListJSON-LD block appears in the page source🤖 Generated with Claude Code