fix(validate): URL_REGEX no longer captures trailing prose punctuation by d123d · Pull Request #208 · JuliusBrussee/caveman

d123d · 2026-04-17T02:12:16Z

Problem

`validate.py:5` pattern `https?://[^\s)]+` greedily consumes trailing prose punctuation:

```python

re.search(r'https?://[^\s)]+', 'See https://example.com. Next sentence.').group()
'https://example.com.' # includes period
```

When the LLM correctly strips trailing prose punctuation during compression, the validator flags `URL missing` because the extracted URL-with-period no longer matches:

```
Compressing with Claude...
Validation failed:

URL missing: https://example.com.
```

This triggers unnecessary retries (and sometimes unrecoverable failures when the LLM doesn't reinsert the period because it's not part of the URL).

Fix

Non-greedy body + lookahead that permits zero-or-more prose-punctuation chars followed by whitespace, closing paren, or end-of-string:

```python
URL_REGEX = re.compile(r"https?://[^\s)]+?(?=[.,;:!?>]*(?:[\s)]|\Z))")
```

Internal `.` in domains and paths preserved (lookahead requires a terminator after any stripped punctuation).

Test cases

Input	Before	After
`see https://example.com.\`	`https://example.com.\`	`https://example.com\`
`visit https://a.co/path, then`	`https://a.co/path,\`	`https://a.co/path\`
`docs: https://foo.bar/x?q=1; more`	`https://foo.bar/x?q=1;\`	`https://foo.bar/x?q=1\`
`(see https://example.com)\`	`https://example.com\`	`https://example.com\` (unchanged)
`https://github.com/org/repo/issues/42\`	full match	full match (unchanged)
`http://localhost:8080/api?a=1&b=2 works`	full match	full match (unchanged)
`path file https://a.co/b.html\`	full match	full match (internal . kept)
`https://api/v1\` (end of string)	full match	full match (unchanged)

All URLs without trailing prose punctuation are unchanged. Only the prose-punctuation cases improve.

Verify

```
python -m py_compile caveman-compress/scripts/validate.py
```

Passes.

Before: 'https?://[^\s)]+' greedily consumes trailing '.', ',', ';', '>', etc. when URLs appear in prose. Symptom: when the LLM correctly strips trailing punctuation during compression, validator flags 'URL missing' because the originally-extracted URL had punctuation attached that no longer appears in the output. Example failing case before this fix: original: 'See https://example.com. Next sentence.' extracted: 'https://example.com.' (period included) compressed: 'https://example.com' (LLM correctly dropped period) validator: 'URL missing: https://example.com.' <-- false positive After: non-greedy body + lookahead that permits zero-or-more prose-punct characters followed by whitespace, closing paren, or end-of-string. Internal '.' in domains and paths preserved; trailing prose-punct excluded. Test cases (all pass): 'see https://example.com.' -> https://example.com 'visit https://a.co/path, then' -> https://a.co/path 'docs: https://foo.bar/x?q=1; more' -> https://foo.bar/x?q=1 '(see https://example.com)' -> https://example.com 'https://github.com/org/repo/issues/42' -> https://github.com/org/repo/issues/42 'http://localhost:8080/api?a=1&b=2 ws' -> http://localhost:8080/api?a=1&b=2 'path file https://a.co/b.html' -> https://a.co/b.html (internal . kept) 'end of string https://api/v1' -> https://api/v1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(validate): URL_REGEX no longer captures trailing prose punctuation#208

fix(validate): URL_REGEX no longer captures trailing prose punctuation#208
d123d wants to merge 1 commit intoJuliusBrussee:mainfrom
d123d:fix/validate-url-regex-trailing-punct

d123d commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d123d commented Apr 17, 2026

Problem

Fix

Test cases

Verify

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant