fix: deduplicate search sources by title to prevent duplicate citations#1115
Open
octo-patch wants to merge 1 commit intoItzCrazyKns:masterfrom
Open
fix: deduplicate search sources by title to prevent duplicate citations#1115octo-patch wants to merge 1 commit intoItzCrazyKns:masterfrom
octo-patch wants to merge 1 commit intoItzCrazyKns:masterfrom
Conversation
The same academic paper often appears on multiple platforms (PubMed, ScienceDirect, ResearchGate, Google Scholar) with different URLs. The existing URL-based deduplication did not catch these, leading to 100+ citations in quality/balanced mode where most were duplicates of the same underlying paper. Add a normalised-title Set alongside the existing URL Map so that a result whose title (case-folded, trimmed) has already been seen is dropped before being numbered and passed to the writer LLM. This keeps the first (highest-priority) occurrence while silently discarding cross-platform duplicates. Results without a title are unaffected. Fixes ItzCrazyKns#1109
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1109
Problem
When using balanced or quality search modes, the same academic paper often appears across multiple platforms (PubMed, ScienceDirect, ResearchGate, Google Scholar) with different URLs. The existing URL-based deduplication (
seenUrls) did not catch these cross-platform duplicates, so the writer LLM received the same paper indexed as 4 separate sources and cited all of them — leading to reports with 100+ citations where the majority were duplicates of the same underlying paper.Solution
Add a normalised-title
Set(seenTitles) alongside the existing URLMapinResearcher.research(). Before adding a new result to the final source list, its title is lower-cased and trimmed; if that normalised title has already been seen, the result is dropped. The first (highest-priority) occurrence is kept; subsequent cross-platform duplicates are silently discarded.metadata.titleare unaffected by the new check.Testing
Manually traced through the deduplication logic with a hypothetical set of results for the same paper appearing on PubMed, ScienceDirect, ResearchGate, and Google Scholar — only the first result is retained, the rest are filtered out before being numbered and passed to the writer.
Summary by cubic
Deduplicate search results by normalized title across platforms to prevent duplicate citations in balanced/quality modes. Keeps the first occurrence and fixes #1109.
seenTitlesSet (lowercased, trimmed) inResearcher.research()to drop cross-platform duplicates before numbering.metadata.titleare unchanged.Written for commit 7ea7c2a. Summary will update on new commits.