fix: deduplicate search sources by title to prevent duplicate citations by octo-patch · Pull Request #1115 · ItzCrazyKns/Vane

octo-patch · 2026-04-17T01:07:07Z

Problem

When using balanced or quality search modes, the same academic paper often appears across multiple platforms (PubMed, ScienceDirect, ResearchGate, Google Scholar) with different URLs. The existing URL-based deduplication (seenUrls) did not catch these cross-platform duplicates, so the writer LLM received the same paper indexed as 4 separate sources and cited all of them — leading to reports with 100+ citations where the majority were duplicates of the same underlying paper.

Solution

Add a normalised-title Set (seenTitles) alongside the existing URL Map in Researcher.research(). Before adding a new result to the final source list, its title is lower-cased and trimmed; if that normalised title has already been seen, the result is dropped. The first (highest-priority) occurrence is kept; subsequent cross-platform duplicates are silently discarded.

Results without a metadata.title are unaffected by the new check.
The existing URL-based merge logic (appending content of same-URL duplicates) is unchanged.

Testing

Manually traced through the deduplication logic with a hypothetical set of results for the same paper appearing on PubMed, ScienceDirect, ResearchGate, and Google Scholar — only the first result is retained, the rest are filtered out before being numbered and passed to the writer.

Summary by cubic

Deduplicate search results by normalized title across platforms to prevent duplicate citations in balanced/quality modes. Keeps the first occurrence and fixes #1109.

Bug Fixes
- Added a seenTitles Set (lowercased, trimmed) in Researcher.research() to drop cross-platform duplicates before numbering.
- Kept existing URL-merge behavior; results without metadata.title are unchanged.

^{Written for commit 7ea7c2a. Summary will update on new commits.}

The same academic paper often appears on multiple platforms (PubMed, ScienceDirect, ResearchGate, Google Scholar) with different URLs. The existing URL-based deduplication did not catch these, leading to 100+ citations in quality/balanced mode where most were duplicates of the same underlying paper. Add a normalised-title Set alongside the existing URL Map so that a result whose title (case-folded, trimmed) has already been seen is dropped before being numbered and passed to the writer LLM. This keeps the first (highest-priority) occurrence while silently discarding cross-platform duplicates. Results without a title are unaffected. Fixes ItzCrazyKns#1109

cubic-dev-ai

No issues found across 1 file

cubic-dev-ai bot reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deduplicate search sources by title to prevent duplicate citations#1115

fix: deduplicate search sources by title to prevent duplicate citations#1115
octo-patch wants to merge 1 commit intoItzCrazyKns:masterfrom
octo-patch:fix/issue-1109-deduplicate-citations-by-title

octo-patch commented Apr 17, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

octo-patch commented Apr 17, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

octo-patch commented Apr 17, 2026 •

edited by cubic-dev-ai bot

Loading