Skip to content

Adapt search pipeline to prefer semantic results#26771

Open
lautel wants to merge 7 commits intomainfrom
improve-search-recall
Open

Adapt search pipeline to prefer semantic results#26771
lautel wants to merge 7 commits intomainfrom
improve-search-recall

Conversation

@lautel
Copy link
Contributor

@lautel lautel commented Mar 25, 2026

  • Remove dataAssetEmbeddings alias from entities with no embeddings: tag, container, file, worksheet, spreadsheet, directory.
  • Change default search pipeline for hybrid search: k=30, weights=[0.4, 0.6] --> semantic results have more weight

@lautel lautel added the safe to test Add this label to run secure Github workflows on PRs label Mar 25, 2026
pmbrull
pmbrull previously approved these changes Mar 25, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 25, 2026

OpenMetadata Service New-Code Coverage

PASS. Required changed-line coverage: 90.00% overall and per touched production file.

  • Overall executable changed lines: 3/3 covered (100.00%)
  • Missed executable changed lines: 0
  • Non-executable changed lines ignored by JaCoCo: 0
  • Changed production files: 2
File Covered Missed Executable Non-exec Coverage Uncovered lines
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java 2 0 2 0 100.00% -
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java 1 0 1 0 100.00% -

Only changed executable lines under openmetadata-service/src/main/java are counted. Test files, comments, imports, and non-executable lines are excluded.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 25, 2026

🟡 Playwright Results — all passed (14 flaky)

✅ 3402 passed · ❌ 0 failed · 🟡 14 flaky · ⏭️ 216 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 453 0 2 2
🟡 Shard 2 602 0 2 32
🟡 Shard 3 606 0 3 27
🟡 Shard 4 601 0 2 47
🟡 Shard 5 586 0 1 67
🟡 Shard 6 554 0 4 41
🟡 14 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › API Endpoint - customization should work (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DomainTierCertificationVoting.spec.ts › DataProduct - Certification assign, update, and remove (shard 2, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Features/UserProfileOnlineStatus.spec.ts › Should show "Active recently" for users active within last hour (shard 3, 1 retry)
  • Flow/ExploreDiscovery.spec.ts › Should display deleted assets when showDeleted is checked and deleted is not present in queryFilter (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Table (shard 4, 1 retry)
  • Pages/EntityDataSteward.spec.ts › Glossary Term Add, Update and Remove (shard 5, 1 retry)
  • Pages/ExploreTree.spec.ts › Verify Database and Database Schema available in explore tree (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)
  • VersionPages/EntityVersionPages.spec.ts › Directory (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

.createObjectNode()
.put("technique", "rrf")
.put("rank_constant", 60)
.put("rank_constant", 30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

motivation behind reducing denominator constant? overall curious about this variable, 60 as default also intriguing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60 is the default rank constant. Everywhere you see RRF, the default is 60. Thing is, it makes scores quite uniform and masks very high ranking items. I halved it to maintain clearer differences in scores between rank positions.

* Update CLAUDE.md with environment setup and worktree instructions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address PR review feedback on CLAUDE.md environment setup

- Fix Python version range to 3.10-3.11 (matches CI matrix and noxfile)
- Fix "claustre" typo to "Claude Code"
- Remove hardcoded ~/Code/OpenMetadata/env paths, use generic references
- Reorder commands: install_dev_env before make generate (Makefile requires it)
- Soften environment-specific assertions about system Python

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@gitar-bot
Copy link

gitar-bot bot commented Mar 27, 2026

Code Review ✅ Approved

Adapts the search pipeline to prioritize semantic results over keyword-based matches, improving search recall. No issues found.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants