Skip to content

Fix whitespace preservation 8895408090764671849#901

Open
Azadron228 wants to merge 3 commits intowagtail:mainfrom
Azadron228:fix-whitespace-preservation-8895408090764671849
Open

Fix whitespace preservation 8895408090764671849#901
Azadron228 wants to merge 3 commits intowagtail:mainfrom
Azadron228:fix-whitespace-preservation-8895408090764671849

Conversation

@Azadron228
Copy link
Copy Markdown

No description provided.

google-labs-jules bot and others added 3 commits February 16, 2026 10:06
This change addresses a regression where multiple whitespaces between words
or tags were collapsed during extraction and restoration of translatable
strings.

The root cause was BeautifulSoup's default behavior with the html.parser
engine, which collapses whitespace at the root level of a fragment. By
adding '[document]' to the 'preserve_whitespace_tags' list, we instruct
the parser to keep all whitespaces across the fragment.

Key changes:
- Introduced a `get_soup` helper in `wagtail_localize/strings.py` that
  correctly configures BeautifulSoup.
- Consistently used `get_soup` in all string extraction and restoration
  paths.
- Added regression tests in `wagtail_localize/tests/test_strings.py`.

Co-authored-by: Azadron228 <106530452+Azadron228@users.noreply.github.com>
- Configured BeautifulSoup to preserve whitespaces by including '[document]'
  in the preserve_whitespace_tags parameter.
- Consistently used a get_soup helper in all string extraction and
  restoration paths.
- Suppressed the MarkupResemblesLocatorWarning in get_soup to avoid
  warnings when parsing strings that look like URLs or file paths.
- Added regression tests in wagtail_localize/tests/test_strings.py.

Co-authored-by: Azadron228 <106530452+Azadron228@users.noreply.github.com>
Double whitespaces between words or tags were being lost during
translation synchronization because BeautifulSoup's html.parser collapses
whitespace-only nodes in certain contexts.

This change introduces a more robust way to preserve all whitespace
by configuring the BeautifulSoup builder with a custom
preserve_whitespace_tags class that always returns True.

Modified wagtail_localize/strings.py to centralize soup creation
in a get_soup helper and updated all call sites. Also updated the
dummy machine translator for consistency.

Added regression tests in wagtail_localize/tests/test_strings.py.

Co-authored-by: Azadron228 <106530452+Azadron228@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@Marvinrose Marvinrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the review below

Copy link
Copy Markdown
Contributor

@Marvinrose Marvinrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! Centralising BeautifulSoup calls into get_soup() is a clean improvement and the whitespace preservation approach is sensible.
However, the CI failure needs fixing. The pre-commit checks failed on ruff formatting and linting. You can fix this by running ' ruff check --fix . ' and ' ruff format . ' locally, then pushing the updated commit.
Suggestion: Whilst this is a great contribution, it'd be nice if you added a PR description. a description explaining what problem this solves, what caused it, or how to reproduce it. It would really help reviewers (and future contributors reading git history) to know: what specific scenario was losing whitespace, and which part of the pipeline was collapsing it?

INLINE_TAGS = ["a", "abbr", "acronym", "b", "code", "em", "i", "strong", "br"]


class PreserveWhitespaceTags:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class returns True for every tag, meaning BeautifulSoup will now preserve whitespace inside all HTML elements including block-level ones like

and

. The existing INLINE_TAGS list suggests the codebase already distinguishes inline from block elements. Was it intentional to override this for all tags, or should PreserveWhitespaceTags only apply to inline tags?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants