Fix whitespace preservation 8895408090764671849#901
Fix whitespace preservation 8895408090764671849#901Azadron228 wants to merge 3 commits intowagtail:mainfrom
Conversation
This change addresses a regression where multiple whitespaces between words or tags were collapsed during extraction and restoration of translatable strings. The root cause was BeautifulSoup's default behavior with the html.parser engine, which collapses whitespace at the root level of a fragment. By adding '[document]' to the 'preserve_whitespace_tags' list, we instruct the parser to keep all whitespaces across the fragment. Key changes: - Introduced a `get_soup` helper in `wagtail_localize/strings.py` that correctly configures BeautifulSoup. - Consistently used `get_soup` in all string extraction and restoration paths. - Added regression tests in `wagtail_localize/tests/test_strings.py`. Co-authored-by: Azadron228 <106530452+Azadron228@users.noreply.github.com>
- Configured BeautifulSoup to preserve whitespaces by including '[document]' in the preserve_whitespace_tags parameter. - Consistently used a get_soup helper in all string extraction and restoration paths. - Suppressed the MarkupResemblesLocatorWarning in get_soup to avoid warnings when parsing strings that look like URLs or file paths. - Added regression tests in wagtail_localize/tests/test_strings.py. Co-authored-by: Azadron228 <106530452+Azadron228@users.noreply.github.com>
Double whitespaces between words or tags were being lost during translation synchronization because BeautifulSoup's html.parser collapses whitespace-only nodes in certain contexts. This change introduces a more robust way to preserve all whitespace by configuring the BeautifulSoup builder with a custom preserve_whitespace_tags class that always returns True. Modified wagtail_localize/strings.py to centralize soup creation in a get_soup helper and updated all call sites. Also updated the dummy machine translator for consistency. Added regression tests in wagtail_localize/tests/test_strings.py. Co-authored-by: Azadron228 <106530452+Azadron228@users.noreply.github.com>
Marvinrose
left a comment
There was a problem hiding this comment.
Thanks for this contribution! Centralising BeautifulSoup calls into get_soup() is a clean improvement and the whitespace preservation approach is sensible.
However, the CI failure needs fixing. The pre-commit checks failed on ruff formatting and linting. You can fix this by running ' ruff check --fix . ' and ' ruff format . ' locally, then pushing the updated commit.
Suggestion: Whilst this is a great contribution, it'd be nice if you added a PR description. a description explaining what problem this solves, what caused it, or how to reproduce it. It would really help reviewers (and future contributors reading git history) to know: what specific scenario was losing whitespace, and which part of the pipeline was collapsing it?
| INLINE_TAGS = ["a", "abbr", "acronym", "b", "code", "em", "i", "strong", "br"] | ||
|
|
||
|
|
||
| class PreserveWhitespaceTags: |
There was a problem hiding this comment.
This class returns True for every tag, meaning BeautifulSoup will now preserve whitespace inside all HTML elements including block-level ones like
. The existing INLINE_TAGS list suggests the codebase already distinguishes inline from block elements. Was it intentional to override this for all tags, or should PreserveWhitespaceTags only apply to inline tags?
No description provided.