Skip to content

fix: deduplicate search sources by title to prevent duplicate citations#1115

Open
octo-patch wants to merge 1 commit intoItzCrazyKns:masterfrom
octo-patch:fix/issue-1109-deduplicate-citations-by-title
Open

fix: deduplicate search sources by title to prevent duplicate citations#1115
octo-patch wants to merge 1 commit intoItzCrazyKns:masterfrom
octo-patch:fix/issue-1109-deduplicate-citations-by-title

Conversation

@octo-patch
Copy link
Copy Markdown

@octo-patch octo-patch commented Apr 17, 2026

Fixes #1109

Problem

When using balanced or quality search modes, the same academic paper often appears across multiple platforms (PubMed, ScienceDirect, ResearchGate, Google Scholar) with different URLs. The existing URL-based deduplication (seenUrls) did not catch these cross-platform duplicates, so the writer LLM received the same paper indexed as 4 separate sources and cited all of them — leading to reports with 100+ citations where the majority were duplicates of the same underlying paper.

Solution

Add a normalised-title Set (seenTitles) alongside the existing URL Map in Researcher.research(). Before adding a new result to the final source list, its title is lower-cased and trimmed; if that normalised title has already been seen, the result is dropped. The first (highest-priority) occurrence is kept; subsequent cross-platform duplicates are silently discarded.

  • Results without a metadata.title are unaffected by the new check.
  • The existing URL-based merge logic (appending content of same-URL duplicates) is unchanged.

Testing

Manually traced through the deduplication logic with a hypothetical set of results for the same paper appearing on PubMed, ScienceDirect, ResearchGate, and Google Scholar — only the first result is retained, the rest are filtered out before being numbered and passed to the writer.


Summary by cubic

Deduplicate search results by normalized title across platforms to prevent duplicate citations in balanced/quality modes. Keeps the first occurrence and fixes #1109.

  • Bug Fixes
    • Added a seenTitles Set (lowercased, trimmed) in Researcher.research() to drop cross-platform duplicates before numbering.
    • Kept existing URL-merge behavior; results without metadata.title are unchanged.

Written for commit 7ea7c2a. Summary will update on new commits.

The same academic paper often appears on multiple platforms (PubMed,
ScienceDirect, ResearchGate, Google Scholar) with different URLs. The
existing URL-based deduplication did not catch these, leading to 100+
citations in quality/balanced mode where most were duplicates of the
same underlying paper.

Add a normalised-title Set alongside the existing URL Map so that a
result whose title (case-folded, trimmed) has already been seen is
dropped before being numbered and passed to the writer LLM. This keeps
the first (highest-priority) occurrence while silently discarding
cross-platform duplicates. Results without a title are unaffected.

Fixes ItzCrazyKns#1109
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate citations

1 participant