fix(validate): URL_REGEX no longer captures trailing prose punctuation#208
Open
d123d wants to merge 1 commit intoJuliusBrussee:mainfrom
Open
fix(validate): URL_REGEX no longer captures trailing prose punctuation#208d123d wants to merge 1 commit intoJuliusBrussee:mainfrom
d123d wants to merge 1 commit intoJuliusBrussee:mainfrom
Conversation
Before: 'https?://[^\s)]+' greedily consumes trailing '.', ',', ';', '>',
etc. when URLs appear in prose.
Symptom: when the LLM correctly strips trailing punctuation during
compression, validator flags 'URL missing' because the originally-extracted
URL had punctuation attached that no longer appears in the output.
Example failing case before this fix:
original: 'See https://example.com. Next sentence.'
extracted: 'https://example.com.' (period included)
compressed: 'https://example.com' (LLM correctly dropped period)
validator: 'URL missing: https://example.com.' <-- false positive
After: non-greedy body + lookahead that permits zero-or-more prose-punct
characters followed by whitespace, closing paren, or end-of-string. Internal
'.' in domains and paths preserved; trailing prose-punct excluded.
Test cases (all pass):
'see https://example.com.' -> https://example.com
'visit https://a.co/path, then' -> https://a.co/path
'docs: https://foo.bar/x?q=1; more' -> https://foo.bar/x?q=1
'(see https://example.com)' -> https://example.com
'https://github.com/org/repo/issues/42' -> https://github.com/org/repo/issues/42
'http://localhost:8080/api?a=1&b=2 ws' -> http://localhost:8080/api?a=1&b=2
'path file https://a.co/b.html' -> https://a.co/b.html (internal . kept)
'end of string https://api/v1' -> https://api/v1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
`validate.py:5` pattern `https?://[^\s)]+` greedily consumes trailing prose punctuation:
```python
When the LLM correctly strips trailing prose punctuation during compression, the validator flags `URL missing` because the extracted URL-with-period no longer matches:
```
Compressing with Claude...
Validation failed:
```
This triggers unnecessary retries (and sometimes unrecoverable failures when the LLM doesn't reinsert the period because it's not part of the URL).
Fix
Non-greedy body + lookahead that permits zero-or-more prose-punctuation chars followed by whitespace, closing paren, or end-of-string:
```python
URL_REGEX = re.compile(r"https?://[^\s)]+?(?=[.,;:!?>]*(?:[\s)]|\Z))")
```
Internal `.` in domains and paths preserved (lookahead requires a terminator after any stripped punctuation).
Test cases
All URLs without trailing prose punctuation are unchanged. Only the prose-punctuation cases improve.
Verify
```
python -m py_compile caveman-compress/scripts/validate.py
```
Passes.