fix(translate): prevent content loss in long-form translation by luandro · Pull Request #169 · digidem/comapeo-docs

luandro · 2026-03-19T20:00:40Z

Problem

Closes #166

Long-form Notion pages (troubleshooting, create/edit observation, view/edit track, understanding exchange, etc.) were silently dropping sections during automatic translation. The root cause: feeding very large chunks (up to 500 K chars) to the model saturated its effective attention window, causing it to omit headings and paragraphs without raising any error.

Solution

Two complementary mechanisms:

1. Proactive aggressive chunking

Lowered TRANSLATION_CHUNK_MAX_CHARS from 500,000 → 120,000 chars. Each translation request now stays well within the model's reliable attention range, reducing the chance of content being dropped in the first place.

2. Structural completeness validation + retry

After every translation call, the result is compared against the source using structural markers:

Heading count (any heading loss triggers a retry)
Fenced code blocks
Bullet / numbered lists (threshold: ≥ 3 items)
Table lines
Severe length shrinkage (< 55% of source length when source ≥ 4,000 chars)

If incompleteness is detected, the chunk limit is halved and the translation is retried (up to TRANSLATION_COMPLETENESS_MAX_RETRIES = 4 times, with a floor of TRANSLATION_MIN_CHUNK_MAX_CHARS = 8,000 chars). After exhausting retries a non-critical error is surfaced so the pipeline can log and continue.

Test plan

New: chunks long-form content proactively below model-derived maximums
New: retries with smaller chunks when a valid response omits a section
New: fails (non-critically) when repeated completeness retries still return incomplete content
New: treats heavy structural shrinkage as incomplete translation
New: preserves complete heading structures when chunking by sections
Existing: token overflow errors still classified as non-critical token_overflow
Existing: single-call fast path for small content
Existing: chunks large content and calls API once per chunk
All tests pass (bunx vitest run scripts/notion-translate/translateFrontMatter.test.ts)

Long-form Notion pages (troubleshooting, create/edit observation, etc.) were silently dropping sections during automatic translation. The likely cause: feeding very large chunks to the model saturated its effective attention window, causing it to omit headings and paragraphs without raising an error. Changes: - Lower proactive chunk ceiling from 500 K → 120 K chars so each translation request stays well within reliable model attention range - Add structural completeness validation after every translation call: checks heading count, fenced code blocks, bullet/numbered lists, table lines, and severe length shrinkage (< 55 % of source) - Retry with progressively smaller chunks (halved each attempt, floor 8 K chars) up to TRANSLATION_COMPLETENESS_MAX_RETRIES (2) times when incompleteness is detected, then surface a non-critical error Closes #166

github-actions · 2026-03-19T20:02:05Z

🐳 Docker Image Published

Your Docker image has been built and pushed for this PR.

Image Reference: docker.io/communityfirst/comapeo-docs-api:pr-169

Platforms: linux/amd64, linux/arm64

Testing

To test this image:

docker pull docker.io/communityfirst/comapeo-docs-api:pr-169
docker run -p 3001:3001 docker.io/communityfirst/comapeo-docs-api:pr-169

_{Built with commit 156021d}

luandro · 2026-03-19T20:03:26Z

@codex review

kilo-code-bot · 2026-03-19T20:04:31Z

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

This increment adds data URL image handling (routing through processing pipeline instead of keeping inline), removes batch-level timeout, and improves failed page accounting.

Severity	Count
CRITICAL	0
WARNING	1
SUGGESTION	0

Issue Details

File	Line	Issue
`generateBlocks.ts`	596	Inconsistent formatting: "Sample data: URLs:" has colon after "data" but line 589 uses "Sample S3 URLs:" (no colon between adjective and noun). Should be "Sample data URLs:" to match.

Files Reviewed (8 files)

scripts/constants.ts - CUSTOM_API_CHUNK_MAX_CHARS and CUSTOM_API_MAX_OUTPUT_TOKENS
scripts/notion-fetch/generateBlocks.ts - Post-write validation, batch timeout removal
scripts/notion-fetch/imageReplacer.ts - Data URL processing, diagnostics
scripts/notion-fetch/markdownRetryProcessor.ts - Data URL retry logic
scripts/notion-translate/translateBlocks.ts - Block translation with image placeholders
scripts/notion-translate/index.ts - Uses translateBlocks, resolveCanonicalDocsRelativePath
scripts/notion-translate/translateFrontMatter.ts - Data URL masking/unmasking
Test files - Updated tests for new functionality

Key Changes Verified

Data URL handling: Images with data: URLs are now processed through processAndReplaceImages instead of being left inline
Batch timeout removal: Removed timeoutMs: 600000 from generateBlocks - comments explain why (would race against valid retries)
Post-write diagnostics: Now checks both S3 and data URL matches
Image placeholder flow: translateBlocks converts image blocks to placeholder paragraphs before translation
Page metadata cache: New resolveCanonicalDocsRelativePath function extracts docs-relative paths

Previously Resolved Issues (carried forward)

File	Line	Issue	Status
`translateFrontMatter.ts`	~310	Fenced code regex handles indented fences	✅
`translateFrontMatter.ts`	~739	Heading loss check uses strict comparison	✅
`translateFrontMatter.ts`	~893	JSON truncation handled with finish_reason check	✅
`translateCodeJson.ts`	92-94	max_tokens added for custom backends	✅

_{Reviewed by minimax-m2.5-20260211 · 1,573,167 tokens}

github-actions · 2026-03-19T20:04:38Z

🚀 Preview Deployment

Your documentation preview is ready!

Preview URL: https://pr-169.comapeo-docs.pages.dev

🔄 Content: Regenerated 5 pages from Notion (script changes detected)

💡 Tip: Add label fetch-all-pages to test with full content, or fetch-10-pages for broader coverage.

This preview will update automatically when you push new commits to this PR.

_{Built with commit 156021d}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33c1581bba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

luandro · 2026-03-19T20:15:09Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33c1581bba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

… overhead P1: TRANSLATION_COMPLETENESS_MAX_RETRIES was 2, so halving from 120k only reached 60k → 30k before giving up. Reaching the 8k floor requires 4 halvings (120k→60k→30k→15k→8k), so raise the constant to 4. P2: getChunkContentBudget was flooring the *content* budget at TRANSLATION_MIN_CHUNK_MAX_CHARS (8k), ignoring prompt overhead (~2.6k). This made the actual request larger than the documented 8k minimum. Fix: subtract overhead from the total limit and floor the content budget at 1; the 8k total-request floor is already enforced by the retry caller. Update the "preserves heading structures" test to use a chunkLimit that reflects a realistic total-request budget (3_200 chars) rather than a raw content size (500 chars), which the old incorrect floor had masked.

luandro · 2026-03-19T20:29:49Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd3b4d24b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

collectMarkdownStructureMetrics only matched ATX headings (# Heading). CommonMark/Docusaurus also accept setext headings (underline with === or ---). If the model reformats a heading into setext style the count would drop and translateText would incorrectly treat the translation as incomplete. Add a multiline regex for setext headings and include them in headingCount.

luandro · 2026-03-19T20:39:22Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c36dfe9b16

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The previous regex /^.+\n[=\-]{2,}\s*/gm matched any non-empty line followed by --- or ===, which also matches list items before thematic breaks (e.g. "- Item\n---"). This caused isSuspiciouslyIncompleteTranslation to count spurious headings in the source and falsely flag complete translations as incomplete. Fix: only match === underlines (setext H1). The = character has no other CommonMark meaning, so this is unambiguous. Setext H2 (--- underline) is skipped because it cannot be distinguished from a thematic break without a full parser. Notion content uses ATX headings exclusively anyway.

luandro · 2026-03-19T20:46:45Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4459a152a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

… check Setext H2 headings (Heading\n---): re-introduce detection with a negative lookahead that excludes lines starting with list markers or block-level prefixes, which avoids the thematic-break false-positive while still catching genuine section headings. Admonitions (:::type … :::): Docusaurus callout blocks can be silently dropped by the model without triggering any of the existing checks. Count opening+closing ::: pairs (like fenced code blocks) and treat a drop in admonitionCount as an incompleteness signal.

luandro · 2026-03-19T20:57:34Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ecdf12235a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ection P1 — table detection: the previous regex only matched GFM table rows with outer pipes (| A | B |). Models sometimes emit pipeless form (A | B | C). Switch to matching GFM table *separator* rows instead — these are the unambiguous per-spec indicator of a table and work regardless of outer-pipe style. Threshold lowered to 1 separator (from 2 data lines). Regex uses a simple character-class + .filter() to avoid ReDoS-unsafe nested quantifiers. P2 — fenced code content: structural markers inside fenced code blocks (headings, list items, table rows, admonitions) were counted as real document structure. Strip fenced block interiors before running all regex checks so that code samples do not inflate source counts and cause false-positive incompleteness failures.

luandro · 2026-03-19T21:09:32Z

@codex review

Fixes a TypeScript 'includes does not exist on type never' error and allows up to 3 spaces of optional indentation for fenced code blocks in stripFencedCodeContent. Co-authored-by: Junie <junie@jetbrains.com>

capy-ai

Added 2 comments

Template literal contained unescaped backticks that broke the TypeScript parser. Also added explicit constraints forbidding JSX attribute wrapping and HTML/JSX tag modification to reduce future LLM corruption.

Replace flat per-file stats with a dynamic box-drawing metrics table (headings, images, lines, chars). Consolidate checks into a single human-readable block with inline timing and failure details. Trim logs section to three essential lines.

…calls Tests that style={{...}}, className, and HTML block elements survive translation intact across pt-BR and es. Skipped automatically when OPENAI_API_KEY is absent. Loads .env for local runs. Serves as acceptance criterion for the planned jsxPlaceholders protection layer.

Add three new test cases to `parseFrontmatterKeys` and integration tests: 1. parseFrontmatterKeys ignores body-level horizontal-rule blocks without anchor keys 2. parseFrontmatterKeys keeps real frontmatter and ignores later Video blocks 3. succeeds when a later chunk contains a body-level Video block These tests validate that Video blocks appearing after markdown separators in the document body are correctly handled and do not interfere with frontmatter parsing or multi-chunk translation workflows.

capy-ai

Added 1 comment

…g i18n files Docusaurus 3.7+ validates image paths during MDX compilation before remark plugins run, causing build failures when __locale_ref__ placeholders reach the i18n files on disk. Add decodeLocaleImagePlaceholderPaths() (parallel to the existing remote variant) and call it in saveTranslatedContentToDisk() so translated files always contain canonical /images/... paths, never __locale_ref__ placeholders. Also fix the two existing i18n files (creating-a-new-observation.md in pt and es) that were written by a previous pipeline run with unresolved placeholders.

…ests - Gate JSX integration tests with explicit RUN_LIVE_TRANSLATION_TESTS=1 flag instead of implicit OPENAI_API_KEY presence, preventing unintended live API calls in CI or developer machines with .env files - Fix heading chunking test: increase chunkLimit from 3200 to 3600 to account for ~3.05K TRANSLATION_PROMPT overhead, replace fragile toHaveBeenCalledTimes(3) with behavioral assertions on translated headings and section-level chunking structure - Fix Video/frontmatter test: increase chunkLimit from 3000 to 6000 to avoid budget=1 which caused splitByLines to drop newlines during character-level force-splitting, corrupting frontmatter in reassembled markdown Refs: #166

socket-security · 2026-04-04T16:16:25Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	eslint-plugin-prettier@5.5.4

View full report

…ion test The splitByLines function used text.split("\n") which consumed newline characters. When lines didn't fit and current was flushed, the "\n" separator between them was lost, making reassembly via join("") corrupt frontmatter structure (e.g. "---\ntitle" became "---title"). Fix: use split(/(\n)/) to keep newlines as explicit tokens in the output array, so join("") reconstructs the original text exactly. This mirrors the approach already used by splitByParagraphs. Add a regression test that uses the original chunkLimit: 3_000 which produces a content budget of ~1 char, forcing character-level splitting. The test asserts that frontmatter keys, Video blocks, and body headings all survive the round-trip through many tiny chunks. Refs: #166

…ocks - remark-fix-image-paths: use Node from @types/unist via a local UnistNode = Node & Record<string, unknown> intersection instead of any - generateBlocks.test.ts: narrow 5 new mock-call lambdas from any[] to unknown[], which TypeScript accepts since the bodies already guard with typeof

When loadCanonicalEnglishMarkdown returns null (e.g. docs/ not yet generated in CI), getSourceMarkdownForTranslation now logs the fallback to the Notion API instead of silently proceeding.

capy-ai

Added 1 comment

eval/ scripts use Bun-only .ts-extension dynamic imports and globalThis.fetch assignment that tsc rejects. They're standalone Bun scripts, not part of the compiled output.

…abels ensureTranslatedFrontmatter was replacing title, sidebar_label, and pagination_label all with the same effectiveTitle, discarding any distinct values the LLM correctly translated. Add extractFrontmatterValue helper that reads sidebar_label and pagination_label from the LLM-translated content (translatedContent), falling back to effectiveTitle only when the key is absent or blank. title always uses effectiveTitle to guard against LLM hallucination. Add 5 regression tests covering: distinct label preservation, no- frontmatter fallback, blank-value fallback, title hallucination guard, and no-key-injection when the English canonical lacks the key.

…fallback In the fallback path where canonical English markdown is unavailable and content is fetched from Notion directly, S3 image URLs have no local file equivalents. Switch imageHandling from placeholder-path to placeholder-text so the LLM receives [Image: alt] tokens instead of unresolvable /images/__remote_ref__/... paths. Update the imageStabilization test to use a real S3 hostname and assert the text-placeholder behaviour for this code path.

capy-ai

Added 1 comment

capy-ai

Added 1 comment

…body content recoverMissingTranslatedHeadings only checked aggregate completeness metrics before accepting a recovered document, so a heading could be reinserted with zero body content beneath it (content merged into an adjacent section) and the recovery would silently pass. After the aggregate completeness guard, re-parse the recovered document and verify that every actually-inserted heading has non-empty sectionContentLength. Track actually-inserted indexes in a separate set to exclude headings skipped by hasHeadingNearInsertionPoint (cached but never inserted), and apply toLocaleLowerCase on both sides of the lookup to match the normalization used elsewhere in the function. Also add security/detect-object-injection ESLint override for scripts/notion-translate/ (same rationale as the existing override for scripts/notion-fetch/generateBlocks.ts: controlled property access, not user input).

github-actions · 2026-04-05T09:20:15Z

🧹 Preview Deployment Cleanup

The preview deployment for this PR has been cleaned up.

Preview URL was: https://pr-169.comapeo-docs.pages.dev

_{Note: Cloudflare Pages deployments follow automatic retention policies. Old previews are cleaned up automatically.}

This comment was marked as off-topic.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

Comment thread scripts/constants.ts Outdated

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated

This comment was marked as off-topic.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated

This comment was marked as off-topic.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated

This comment was marked as off-topic.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated

This comment was marked as off-topic.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 19, 2026

View reviewed changes

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated

This comment was marked as off-topic.

Sign in to view

fix(scripts): resolve typescript compilation and markdown parsing bugs

ece8b76

Fixes a TypeScript 'includes does not exist on type never' error and allows up to 3 spaces of optional indentation for fenced code blocks in stripFencedCodeContent. Co-authored-by: Junie <junie@jetbrains.com>

capy-ai bot reviewed Apr 3, 2026

View reviewed changes

Comment thread scripts/notion-translate/index.ts

Comment thread scripts/run-single-page-translation-flow.sh Outdated

luandro added 6 commits April 3, 2026 02:52

fix(notion-translate): preserve fallback image references

68f9b47

fix(scripts): normalize single-page fallback docs path

01b41eb

fix(notion-translate): escape backticks and add JSX constraint to prompt

d554adf

Template literal contained unescaped backticks that broke the TypeScript parser. Also added explicit constraints forbidding JSX attribute wrapping and HTML/JSX tag modification to reduce future LLM corruption.

capy-ai bot reviewed Apr 3, 2026

View reviewed changes

Comment thread scripts/notion-translate/index.ts

luandro and others added 3 commits April 3, 2026 12:34

fix(notion-translate): detect S3 URLs behind remote placeholders

e9bd1b1

CoMapeo Content Bot and others added 4 commits April 4, 2026 13:24

fix(translate): log when canonical English markdown is missing in docs/

3297cea

When loadCanonicalEnglishMarkdown returns null (e.g. docs/ not yet generated in CI), getSourceMarkdownForTranslation now logs the fallback to the Notion API instead of silently proceeding.

chore: add MIT license

e8b976a

capy-ai bot reviewed Apr 4, 2026

View reviewed changes

Comment thread scripts/notion-translate/index.ts

luandro added 3 commits April 4, 2026 14:16

chore(tsconfig): exclude scripts/eval from tsc compilation

249d819

eval/ scripts use Bun-only .ts-extension dynamic imports and globalThis.fetch assignment that tsc rejects. They're standalone Bun scripts, not part of the compiled output.

capy-ai bot reviewed Apr 4, 2026

View reviewed changes

Comment thread scripts/notion-translate/index.ts Outdated

CoMapeo Content Bot and others added 2 commits April 4, 2026 14:44

revert(i18n): exclude i18n locale changes from PR 169

c44e3d5

fix(notion-translate): avoid url fallback in image placeholders

981e4d7

luandro mentioned this pull request Apr 5, 2026

feat(api): add OpenAPI spec, workflow doc, and GH Actions trigger #178

Open

3 tasks

capy-ai bot reviewed Apr 5, 2026

View reviewed changes

Comment thread scripts/notion-translate/translateFrontMatter.ts

luandro merged commit 9b374e2 into main Apr 5, 2026
7 checks passed

luandro deleted the issue-166 branch April 5, 2026 09:20

tomasciccola mentioned this pull request Apr 9, 2026

Auto translation is failing for new content #180

Closed

Conversation

luandro commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

1. Proactive aggressive chunking

2. Structural completeness validation + retry

Test plan

Uh oh!

This comment was marked as off-topic.

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐳 Docker Image Published

Testing

Uh oh!

luandro commented Mar 19, 2026

Uh oh!

kilo-code-bot bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Preview Deployment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

luandro commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

luandro commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

luandro commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

luandro commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

luandro commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as off-topic.

luandro commented Mar 19, 2026 •

edited

Loading

github-actions bot commented Mar 19, 2026 •

edited

Loading

kilo-code-bot bot commented Mar 19, 2026 •

edited

Loading

github-actions bot commented Mar 19, 2026 •

edited

Loading

socket-security bot commented Apr 4, 2026 •

edited

Loading