Skip to content

fix(translate): prevent content loss in long-form translation#169

Merged
luandro merged 61 commits intomainfrom
issue-166
Apr 5, 2026
Merged

fix(translate): prevent content loss in long-form translation#169
luandro merged 61 commits intomainfrom
issue-166

Conversation

@luandro
Copy link
Copy Markdown
Contributor

@luandro luandro commented Mar 19, 2026

Problem

Closes #166

Long-form Notion pages (troubleshooting, create/edit observation, view/edit track, understanding exchange, etc.) were silently dropping sections during automatic translation. The root cause: feeding very large chunks (up to 500 K chars) to the model saturated its effective attention window, causing it to omit headings and paragraphs without raising any error.

Solution

Two complementary mechanisms:

1. Proactive aggressive chunking

Lowered TRANSLATION_CHUNK_MAX_CHARS from 500,000 → 120,000 chars. Each translation request now stays well within the model's reliable attention range, reducing the chance of content being dropped in the first place.

2. Structural completeness validation + retry

After every translation call, the result is compared against the source using structural markers:

  • Heading count (any heading loss triggers a retry)
  • Fenced code blocks
  • Bullet / numbered lists (threshold: ≥ 3 items)
  • Table lines
  • Severe length shrinkage (< 55% of source length when source ≥ 4,000 chars)

If incompleteness is detected, the chunk limit is halved and the translation is retried (up to TRANSLATION_COMPLETENESS_MAX_RETRIES = 4 times, with a floor of TRANSLATION_MIN_CHUNK_MAX_CHARS = 8,000 chars). After exhausting retries a non-critical error is surfaced so the pipeline can log and continue.

Test plan

  • New: chunks long-form content proactively below model-derived maximums
  • New: retries with smaller chunks when a valid response omits a section
  • New: fails (non-critically) when repeated completeness retries still return incomplete content
  • New: treats heavy structural shrinkage as incomplete translation
  • New: preserves complete heading structures when chunking by sections
  • Existing: token overflow errors still classified as non-critical token_overflow
  • Existing: single-call fast path for small content
  • Existing: chunks large content and calls API once per chunk
  • All tests pass (bunx vitest run scripts/notion-translate/translateFrontMatter.test.ts)

Long-form Notion pages (troubleshooting, create/edit observation, etc.)
were silently dropping sections during automatic translation. The likely
cause: feeding very large chunks to the model saturated its effective
attention window, causing it to omit headings and paragraphs without
raising an error.

Changes:
- Lower proactive chunk ceiling from 500 K → 120 K chars so each
  translation request stays well within reliable model attention range
- Add structural completeness validation after every translation call:
  checks heading count, fenced code blocks, bullet/numbered lists,
  table lines, and severe length shrinkage (< 55 % of source)
- Retry with progressively smaller chunks (halved each attempt, floor
  8 K chars) up to TRANSLATION_COMPLETENESS_MAX_RETRIES (2) times when
  incompleteness is detected, then surface a non-critical error

Closes #166
greptile-apps[bot]

This comment was marked as off-topic.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 19, 2026

🐳 Docker Image Published

Your Docker image has been built and pushed for this PR.

Image Reference: docker.io/communityfirst/comapeo-docs-api:pr-169

Platforms: linux/amd64, linux/arm64

Testing

To test this image:

docker pull docker.io/communityfirst/comapeo-docs-api:pr-169
docker run -p 3001:3001 docker.io/communityfirst/comapeo-docs-api:pr-169

Built with commit 156021d

@luandro
Copy link
Copy Markdown
Contributor Author

luandro commented Mar 19, 2026

@codex review

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 19, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

This increment adds data URL image handling (routing through processing pipeline instead of keeping inline), removes batch-level timeout, and improves failed page accounting.

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0
Issue Details
File Line Issue
generateBlocks.ts 596 Inconsistent formatting: "Sample data: URLs:" has colon after "data" but line 589 uses "Sample S3 URLs:" (no colon between adjective and noun). Should be "Sample data URLs:" to match.
Files Reviewed (8 files)
  • scripts/constants.ts - CUSTOM_API_CHUNK_MAX_CHARS and CUSTOM_API_MAX_OUTPUT_TOKENS
  • scripts/notion-fetch/generateBlocks.ts - Post-write validation, batch timeout removal
  • scripts/notion-fetch/imageReplacer.ts - Data URL processing, diagnostics
  • scripts/notion-fetch/markdownRetryProcessor.ts - Data URL retry logic
  • scripts/notion-translate/translateBlocks.ts - Block translation with image placeholders
  • scripts/notion-translate/index.ts - Uses translateBlocks, resolveCanonicalDocsRelativePath
  • scripts/notion-translate/translateFrontMatter.ts - Data URL masking/unmasking
  • Test files - Updated tests for new functionality
Key Changes Verified
  1. Data URL handling: Images with data: URLs are now processed through processAndReplaceImages instead of being left inline
  2. Batch timeout removal: Removed timeoutMs: 600000 from generateBlocks - comments explain why (would race against valid retries)
  3. Post-write diagnostics: Now checks both S3 and data URL matches
  4. Image placeholder flow: translateBlocks converts image blocks to placeholder paragraphs before translation
  5. Page metadata cache: New resolveCanonicalDocsRelativePath function extracts docs-relative paths
Previously Resolved Issues (carried forward)
File Line Issue Status
translateFrontMatter.ts ~310 Fenced code regex handles indented fences
translateFrontMatter.ts ~739 Heading loss check uses strict comparison
translateFrontMatter.ts ~893 JSON truncation handled with finish_reason check
translateCodeJson.ts 92-94 max_tokens added for custom backends

Reviewed by minimax-m2.5-20260211 · 1,573,167 tokens

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 19, 2026

🚀 Preview Deployment

Your documentation preview is ready!

Preview URL: https://pr-169.comapeo-docs.pages.dev

🔄 Content: Regenerated 5 pages from Notion (script changes detected)

💡 Tip: Add label fetch-all-pages to test with full content, or fetch-10-pages for broader coverage.

This preview will update automatically when you push new commits to this PR.


Built with commit 156021d

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33c1581bba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated
Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated
@luandro
Copy link
Copy Markdown
Contributor Author

luandro commented Mar 19, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33c1581bba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/constants.ts Outdated
Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated
… overhead

P1: TRANSLATION_COMPLETENESS_MAX_RETRIES was 2, so halving from 120k only
    reached 60k → 30k before giving up. Reaching the 8k floor requires 4
    halvings (120k→60k→30k→15k→8k), so raise the constant to 4.

P2: getChunkContentBudget was flooring the *content* budget at
    TRANSLATION_MIN_CHUNK_MAX_CHARS (8k), ignoring prompt overhead (~2.6k).
    This made the actual request larger than the documented 8k minimum.
    Fix: subtract overhead from the total limit and floor the content budget
    at 1; the 8k total-request floor is already enforced by the retry caller.

Update the "preserves heading structures" test to use a chunkLimit that
reflects a realistic total-request budget (3_200 chars) rather than a
raw content size (500 chars), which the old incorrect floor had masked.
greptile-apps[bot]

This comment was marked as off-topic.

@luandro
Copy link
Copy Markdown
Contributor Author

luandro commented Mar 19, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd3b4d24b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated
collectMarkdownStructureMetrics only matched ATX headings (# Heading).
CommonMark/Docusaurus also accept setext headings (underline with === or ---).
If the model reformats a heading into setext style the count would drop and
translateText would incorrectly treat the translation as incomplete.

Add a multiline regex for setext headings and include them in headingCount.
greptile-apps[bot]

This comment was marked as off-topic.

@luandro
Copy link
Copy Markdown
Contributor Author

luandro commented Mar 19, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c36dfe9b16

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated
The previous regex /^.+\n[=\-]{2,}\s*/gm matched any non-empty line
followed by --- or ===, which also matches list items before thematic
breaks (e.g. "- Item\n---"). This caused isSuspiciouslyIncompleteTranslation
to count spurious headings in the source and falsely flag complete
translations as incomplete.

Fix: only match === underlines (setext H1). The = character has no other
CommonMark meaning, so this is unambiguous. Setext H2 (--- underline) is
skipped because it cannot be distinguished from a thematic break without
a full parser. Notion content uses ATX headings exclusively anyway.
greptile-apps[bot]

This comment was marked as off-topic.

@luandro
Copy link
Copy Markdown
Contributor Author

luandro commented Mar 19, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4459a152a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated
Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated
… check

Setext H2 headings (Heading\n---): re-introduce detection with a negative
lookahead that excludes lines starting with list markers or block-level
prefixes, which avoids the thematic-break false-positive while still
catching genuine section headings.

Admonitions (:::type … :::): Docusaurus callout blocks can be silently
dropped by the model without triggering any of the existing checks.
Count opening+closing ::: pairs (like fenced code blocks) and treat
a drop in admonitionCount as an incompleteness signal.
greptile-apps[bot]

This comment was marked as off-topic.

@luandro
Copy link
Copy Markdown
Contributor Author

luandro commented Mar 19, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ecdf12235a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated
Comment thread scripts/notion-translate/translateFrontMatter.ts Outdated
…ection

P1 — table detection: the previous regex only matched GFM table rows with
outer pipes (| A | B |). Models sometimes emit pipeless form (A | B | C).
Switch to matching GFM table *separator* rows instead — these are the
unambiguous per-spec indicator of a table and work regardless of outer-pipe
style. Threshold lowered to 1 separator (from 2 data lines). Regex uses a
simple character-class + .filter() to avoid ReDoS-unsafe nested quantifiers.

P2 — fenced code content: structural markers inside fenced code blocks
(headings, list items, table rows, admonitions) were counted as real
document structure. Strip fenced block interiors before running all
regex checks so that code samples do not inflate source counts and
cause false-positive incompleteness failures.
greptile-apps[bot]

This comment was marked as off-topic.

@luandro
Copy link
Copy Markdown
Contributor Author

luandro commented Mar 19, 2026

@codex review

@chatgpt-codex-connector

This comment was marked as off-topic.

Fixes a TypeScript 'includes does not exist on type never' error and allows up to 3 spaces of optional indentation for fenced code blocks in stripFencedCodeContent.

Co-authored-by: Junie <junie@jetbrains.com>
Copy link
Copy Markdown

@capy-ai capy-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 2 comments

Comment thread scripts/notion-translate/index.ts
Comment thread scripts/run-single-page-translation-flow.sh Outdated
luandro added 6 commits April 3, 2026 02:52
Template literal contained unescaped backticks that broke the TypeScript
parser. Also added explicit constraints forbidding JSX attribute wrapping
and HTML/JSX tag modification to reduce future LLM corruption.
Replace flat per-file stats with a dynamic box-drawing metrics table
(headings, images, lines, chars). Consolidate checks into a single
human-readable block with inline timing and failure details. Trim logs
section to three essential lines.
…calls

Tests that style={{...}}, className, and HTML block elements survive
translation intact across pt-BR and es. Skipped automatically when
OPENAI_API_KEY is absent. Loads .env for local runs. Serves as
acceptance criterion for the planned jsxPlaceholders protection layer.
Add three new test cases to `parseFrontmatterKeys` and integration tests:

1. parseFrontmatterKeys ignores body-level horizontal-rule blocks without anchor keys
2. parseFrontmatterKeys keeps real frontmatter and ignores later Video blocks
3. succeeds when a later chunk contains a body-level Video block

These tests validate that Video blocks appearing after markdown separators
in the document body are correctly handled and do not interfere with
frontmatter parsing or multi-chunk translation workflows.
Copy link
Copy Markdown

@capy-ai capy-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 1 comment

Comment thread scripts/notion-translate/index.ts
luandro and others added 3 commits April 3, 2026 12:34
…g i18n files

Docusaurus 3.7+ validates image paths during MDX compilation before remark
plugins run, causing build failures when __locale_ref__ placeholders reach
the i18n files on disk.

Add decodeLocaleImagePlaceholderPaths() (parallel to the existing remote
variant) and call it in saveTranslatedContentToDisk() so translated files
always contain canonical /images/... paths, never __locale_ref__ placeholders.

Also fix the two existing i18n files (creating-a-new-observation.md in pt
and es) that were written by a previous pipeline run with unresolved placeholders.
…ests

- Gate JSX integration tests with explicit RUN_LIVE_TRANSLATION_TESTS=1
  flag instead of implicit OPENAI_API_KEY presence, preventing unintended
  live API calls in CI or developer machines with .env files
- Fix heading chunking test: increase chunkLimit from 3200 to 3600 to
  account for ~3.05K TRANSLATION_PROMPT overhead, replace fragile
  toHaveBeenCalledTimes(3) with behavioral assertions on translated
  headings and section-level chunking structure
- Fix Video/frontmatter test: increase chunkLimit from 3000 to 6000
  to avoid budget=1 which caused splitByLines to drop newlines during
  character-level force-splitting, corrupting frontmatter in reassembled
  markdown

Refs: #166
@socket-security
Copy link
Copy Markdown

socket-security bot commented Apr 4, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedeslint-plugin-prettier@​5.5.49910010090100

View full report

CoMapeo Content Bot and others added 4 commits April 4, 2026 13:24
…ion test

The splitByLines function used text.split("\n") which consumed newline
characters. When lines didn't fit and current was flushed, the "\n"
separator between them was lost, making reassembly via join("") corrupt
frontmatter structure (e.g. "---\ntitle" became "---title").

Fix: use split(/(\n)/) to keep newlines as explicit tokens in the output
array, so join("") reconstructs the original text exactly. This mirrors
the approach already used by splitByParagraphs.

Add a regression test that uses the original chunkLimit: 3_000 which
produces a content budget of ~1 char, forcing character-level splitting.
The test asserts that frontmatter keys, Video blocks, and body headings
all survive the round-trip through many tiny chunks.

Refs: #166
…ocks

- remark-fix-image-paths: use Node from @types/unist via a local
  UnistNode = Node & Record<string, unknown> intersection instead of any
- generateBlocks.test.ts: narrow 5 new mock-call lambdas from any[] to
  unknown[], which TypeScript accepts since the bodies already guard with typeof
When loadCanonicalEnglishMarkdown returns null (e.g. docs/ not yet
generated in CI), getSourceMarkdownForTranslation now logs the fallback
to the Notion API instead of silently proceeding.
Copy link
Copy Markdown

@capy-ai capy-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 1 comment

Comment thread scripts/notion-translate/index.ts
luandro added 3 commits April 4, 2026 14:16
eval/ scripts use Bun-only .ts-extension dynamic imports and
globalThis.fetch assignment that tsc rejects. They're standalone
Bun scripts, not part of the compiled output.
…abels

ensureTranslatedFrontmatter was replacing title, sidebar_label, and
pagination_label all with the same effectiveTitle, discarding any
distinct values the LLM correctly translated.

Add extractFrontmatterValue helper that reads sidebar_label and
pagination_label from the LLM-translated content (translatedContent),
falling back to effectiveTitle only when the key is absent or blank.
title always uses effectiveTitle to guard against LLM hallucination.

Add 5 regression tests covering: distinct label preservation, no-
frontmatter fallback, blank-value fallback, title hallucination guard,
and no-key-injection when the English canonical lacks the key.
…fallback

In the fallback path where canonical English markdown is unavailable and
content is fetched from Notion directly, S3 image URLs have no local
file equivalents. Switch imageHandling from placeholder-path to
placeholder-text so the LLM receives [Image: alt] tokens instead of
unresolvable /images/__remote_ref__/... paths.

Update the imageStabilization test to use a real S3 hostname and assert
the text-placeholder behaviour for this code path.
Copy link
Copy Markdown

@capy-ai capy-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 1 comment

Comment thread scripts/notion-translate/index.ts Outdated
Copy link
Copy Markdown

@capy-ai capy-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 1 comment

Comment thread scripts/notion-translate/translateFrontMatter.ts
…body content

recoverMissingTranslatedHeadings only checked aggregate completeness
metrics before accepting a recovered document, so a heading could be
reinserted with zero body content beneath it (content merged into an
adjacent section) and the recovery would silently pass.

After the aggregate completeness guard, re-parse the recovered document
and verify that every actually-inserted heading has non-empty
sectionContentLength. Track actually-inserted indexes in a separate set
to exclude headings skipped by hasHeadingNearInsertionPoint (cached but
never inserted), and apply toLocaleLowerCase on both sides of the lookup
to match the normalization used elsewhere in the function.

Also add security/detect-object-injection ESLint override for
scripts/notion-translate/ (same rationale as the existing override for
scripts/notion-fetch/generateBlocks.ts: controlled property access,
not user input).
@luandro luandro merged commit 9b374e2 into main Apr 5, 2026
7 checks passed
@luandro luandro deleted the issue-166 branch April 5, 2026 09:20
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

🧹 Preview Deployment Cleanup

The preview deployment for this PR has been cleaned up.

Preview URL was: https://pr-169.comapeo-docs.pages.dev


Note: Cloudflare Pages deployments follow automatic retention policies. Old previews are cleaned up automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Translation: Automatic translation is skipping chunks in long form content

2 participants