Skip to content

fix(validate): URL_REGEX no longer captures trailing prose punctuation#208

Open
d123d wants to merge 1 commit intoJuliusBrussee:mainfrom
d123d:fix/validate-url-regex-trailing-punct
Open

fix(validate): URL_REGEX no longer captures trailing prose punctuation#208
d123d wants to merge 1 commit intoJuliusBrussee:mainfrom
d123d:fix/validate-url-regex-trailing-punct

Conversation

@d123d
Copy link
Copy Markdown

@d123d d123d commented Apr 17, 2026

Problem

`validate.py:5` pattern `https?://[^\s)]+` greedily consumes trailing prose punctuation:

```python

re.search(r'https?://[^\s)]+', 'See https://example.com. Next sentence.').group()
'https://example.com.' # includes period
```

When the LLM correctly strips trailing prose punctuation during compression, the validator flags `URL missing` because the extracted URL-with-period no longer matches:

```
Compressing with Claude...
Validation failed:

This triggers unnecessary retries (and sometimes unrecoverable failures when the LLM doesn't reinsert the period because it's not part of the URL).

Fix

Non-greedy body + lookahead that permits zero-or-more prose-punctuation chars followed by whitespace, closing paren, or end-of-string:

```python
URL_REGEX = re.compile(r"https?://[^\s)]+?(?=[.,;:!?>]*(?:[\s)]|\Z))")
```

Internal `.` in domains and paths preserved (lookahead requires a terminator after any stripped punctuation).

Test cases

Input Before After
`see https://example.com.\` `https://example.com.\` `https://example.com\`
`visit https://a.co/path, then` `https://a.co/path,\` `https://a.co/path\`
`docs: https://foo.bar/x?q=1; more` `https://foo.bar/x?q=1;\` `https://foo.bar/x?q=1\`
`(see https://example.com)\` `https://example.com\` `https://example.com\` (unchanged)
`https://github.com/org/repo/issues/42\` full match full match (unchanged)
`http://localhost:8080/api?a=1&b=2 works` full match full match (unchanged)
`path file https://a.co/b.html\` full match full match (internal . kept)
`https://api/v1\` (end of string) full match full match (unchanged)

All URLs without trailing prose punctuation are unchanged. Only the prose-punctuation cases improve.

Verify

```
python -m py_compile caveman-compress/scripts/validate.py
```

Passes.

Before: 'https?://[^\s)]+' greedily consumes trailing '.', ',', ';', '>',
etc. when URLs appear in prose.

Symptom: when the LLM correctly strips trailing punctuation during
compression, validator flags 'URL missing' because the originally-extracted
URL had punctuation attached that no longer appears in the output.

Example failing case before this fix:

    original:   'See https://example.com. Next sentence.'
    extracted:  'https://example.com.'  (period included)
    compressed: 'https://example.com' (LLM correctly dropped period)
    validator:  'URL missing: https://example.com.'  <-- false positive

After: non-greedy body + lookahead that permits zero-or-more prose-punct
characters followed by whitespace, closing paren, or end-of-string. Internal
'.' in domains and paths preserved; trailing prose-punct excluded.

Test cases (all pass):
  'see https://example.com.'              -> https://example.com
  'visit https://a.co/path, then'         -> https://a.co/path
  'docs: https://foo.bar/x?q=1; more'     -> https://foo.bar/x?q=1
  '(see https://example.com)'             -> https://example.com
  'https://github.com/org/repo/issues/42' -> https://github.com/org/repo/issues/42
  'http://localhost:8080/api?a=1&b=2 ws'  -> http://localhost:8080/api?a=1&b=2
  'path file https://a.co/b.html'         -> https://a.co/b.html  (internal . kept)
  'end of string https://api/v1'          -> https://api/v1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant