Skip to content

Fixes #25933 added-pipeline-lineage#26786

Open
varun-lakhyani wants to merge 8 commits intoopen-metadata:mainfrom
varun-lakhyani:databricks-lineage-pipeline
Open

Fixes #25933 added-pipeline-lineage#26786
varun-lakhyani wants to merge 8 commits intoopen-metadata:mainfrom
varun-lakhyani:databricks-lineage-pipeline

Conversation

@varun-lakhyani
Copy link
Copy Markdown
Member

@varun-lakhyani varun-lakhyani commented Mar 26, 2026

Describe your changes:

Fixes #25933

Earlier only jobs where there in ingestion and lineage then pipeline were added in ingestion, this extends pipeline to lineage and name factoring from Job to Entity as now it includes both jobs and pipelines

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Naming refactor:
    • Renamed DATABRICKS_GET_TABLE_LINEAGE_FOR_JOB to DATABRICKS_GET_TABLE_LINEAGE for broader applicability
    • Renamed DATABRICKS_GET_COLUMN_LINEAGE_FOR_JOB to DATABRICKS_GET_COLUMN_LINEAGE
    • Updated variable names from job_* to entity_* across DatabricksClient class
  • Query expansion:
    • Modified SQL queries to include both JOB and PIPELINE entity types in WHERE clause
  • Pipeline lineage support:
    • Updated DatabrickspipelineSource.yield_pipeline_lineage_details() to use pipeline_id fallback when job_id unavailable
  • Tests:
    • Added test_databricks_dlt_pipeline_lineage() covering DLT pipeline lineage with and without column lineage

This will update automatically on new commits.

@varun-lakhyani varun-lakhyani requested a review from a team as a code owner March 26, 2026 04:26
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@varun-lakhyani varun-lakhyani added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Mar 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 26, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.12)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpam-modules CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-modules-bin CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-runtime CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam0g CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (39)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.12.7 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.13.4 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.15.2 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.16.1 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.spark:spark-core_2.12 CVE-2025-54920 🚨 HIGH 3.5.6 3.5.7
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
Authlib CVE-2026-27962 🔥 CRITICAL 1.6.6 1.6.9
Authlib CVE-2026-28490 🚨 HIGH 1.6.6 1.6.9
Authlib CVE-2026-28498 🚨 HIGH 1.6.6 1.6.9
Authlib CVE-2026-28802 🚨 HIGH 1.6.6 1.6.7
PyJWT CVE-2026-32597 🚨 HIGH 2.10.1 2.12.0
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
aiohttp CVE-2025-69223 🚨 HIGH 3.13.2 3.13.3
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
apache-airflow CVE-2026-26929 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-28779 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-30911 🚨 HIGH 3.1.5 3.1.8
apache-airflow-providers-http CVE-2025-69219 🚨 HIGH 5.6.0 6.0.0
azure-core CVE-2026-21226 🚨 HIGH 1.37.0 1.38.0
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
google-cloud-aiplatform CVE-2026-2472 🚨 HIGH 1.130.0 1.131.0
google-cloud-aiplatform CVE-2026-2473 🚨 HIGH 1.130.0 1.133.0
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
protobuf CVE-2026-0994 🚨 HIGH 4.25.8 6.33.5, 5.29.6
pyOpenSSL CVE-2026-27459 🚨 HIGH 24.1.0 26.0.0
pyasn1 CVE-2026-23490 🚨 HIGH 0.6.1 0.6.2
pyasn1 CVE-2026-30922 🚨 HIGH 0.6.1 0.6.3
python-multipart CVE-2026-24486 🚨 HIGH 0.0.20 0.0.22
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
tornado CVE-2026-31958 🚨 HIGH 6.5.3 6.5.5
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: usr/bin/docker

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
stdlib CVE-2025-68121 🔥 CRITICAL v1.25.5 1.24.13, 1.25.7, 1.26.0-rc.3
stdlib CVE-2025-61726 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2025-61728 🚨 HIGH v1.25.5 1.24.12, 1.25.6
stdlib CVE-2026-25679 🚨 HIGH v1.25.5 1.25.8, 1.26.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 26, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (38)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.12.7 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.13.4 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-core GHSA-72hv-8253-57qq 🚨 HIGH 2.15.2 2.18.6, 2.21.1, 3.1.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.spark:spark-core_2.12 CVE-2025-54920 🚨 HIGH 3.5.6 3.5.7
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (15)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6, 2.11.1
apache-airflow CVE-2026-26929 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-28779 🚨 HIGH 3.1.5 3.1.8
apache-airflow CVE-2026-30911 🚨 HIGH 3.1.5 3.1.8
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
pyOpenSSL CVE-2026-27459 🚨 HIGH 24.1.0 26.0.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 26, 2026

🟡 Playwright Results — all passed (29 flaky)

✅ 3410 passed · ❌ 0 failed · 🟡 29 flaky · ⏭️ 216 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 452 0 3 2
🟡 Shard 2 602 0 6 32
🟡 Shard 3 607 0 7 27
🟡 Shard 4 609 0 8 47
🟡 Shard 5 586 0 1 67
🟡 Shard 6 554 0 4 41
🟡 29 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Container - customization should work (shard 1, 1 retry)
  • Pages/AuditLogs.spec.ts › should apply both User and EntityType filters simultaneously (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataQuality/TestCaseImportExportE2eFlow.spec.ts › Admin: Complete export-import-validate flow (shard 2, 1 retry)
  • Features/DataQuality/TestCaseIncidentPermissions.spec.ts › User with only VIEW cannot PATCH incidents (shard 2, 1 retry)
  • Features/DataQuality/TestCaseResultPermissions.spec.ts › User with TEST_CASE.EDIT_ALL can see edit action on test case (shard 2, 1 retry)
  • Features/EntitySummaryPanel.spec.ts › should display summary panel for table (shard 2, 1 retry)
  • Features/EntitySummaryPanel.spec.ts › should display summary panel for tableColumn (shard 2, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Features/QueryEntity.spec.ts › Query Entity (shard 3, 1 retry)
  • Features/TestSuiteMultiPipeline.spec.ts › TestSuite multi pipeline support (shard 3, 1 retry)
  • Flow/CustomizeWidgets.spec.ts › Following Assets Widget (shard 3, 2 retries)
  • Flow/ExploreDiscovery.spec.ts › Should display deleted assets when showDeleted is checked and deleted is not present in queryFilter (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Flow/SchemaTable.spec.ts › schema table test (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DataContractInheritance.spec.ts › Full Contract Inheritance - Asset inherits full contract from Data Product (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for SearchIndex (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Verify domain tags and glossary terms (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with subdomains attached verifies subdomain accessibility (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Rename domain with tags and glossary terms preserves associations (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Domain Rbac (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Glossary Term Add, Update and Remove (shard 4, 1 retry)
  • Pages/EntityDataSteward.spec.ts › User as Owner Add, Update and Remove (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)
  • VersionPages/EntityVersionPages.spec.ts › Directory (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copy link
Copy Markdown
Member

@ulixius9 ulixius9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varun-lakhyani can you add unit tests for this

@varun-lakhyani
Copy link
Copy Markdown
Member Author

@varun-lakhyani can you add unit tests for this

Added now

@varun-lakhyani
Copy link
Copy Markdown
Member Author

💡 Bug: for/else misuse: debug log triggers on normal loop completion
In yield_pipeline_lineage_details (metadata.py line 1331), there is a for/else construct on the for table_lineage in table_lineage_list loop. The else block logs "No source or target table full name found for {entity_id}", but Python's for/else triggers the else when the loop completes without a break — i.e., on every normal iteration through all items. This means the debug message fires after successfully processing all lineage entries, which is misleading. This is a pre-existing pattern, but the diff changes the variable used (entity_id replacing job_id), so worth noting.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot Fixed in 3b8731d - the for/else block has been removed entirely.

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Mar 26, 2026

Thanks for confirming! The for/else block was removed in 3b8731d, so the misleading debug log no longer fires on normal loop completion.

@varun-lakhyani varun-lakhyani requested a review from ulixius9 March 27, 2026 02:14
@varun-lakhyani
Copy link
Copy Markdown
Member Author

Reviewing your code

Code Review ⚠️ Changes requested 3 resolved / 6 findings
Adds Databricks lineage pipeline support with some incomplete fixes: certification tags are silently lost during CSV import due to overly broad tag deletion, pagination may stop early if API responses lack total_entries field, and DAG detail building logic is duplicated across AirflowApiClient and MWAAClient.

⚠️ Bug: CSV import silently loses certification tags from tag_usage
updateTagsForImport() calls deleteTagsByTarget(fqn) which deletes ALL tags for the entity, including certification tags. When updateCertification() runs later, it compares the in-memory original and updated certification objects — if they are equal (certification unchanged), it returns without re-inserting the cert tag. The net result is that any entity with an existing certification will silently lose it during CSV bulk import.

The non-import updateTags() correctly filters out certification tags from addedTags/deletedTags (line 6767-6771), but updateTagsForImport() has no such guard because it does a blanket DELETE + re-INSERT.

To fix: either modify updateTagsForImport to preserve cert tags by using a more targeted delete that excludes certification prefix, or ensure updateCertification() always re-applies when the entity has a certification (regardless of whether it changed).

Suggested fix

protected void updateTagsForImport(
    String fqn, String fieldName, List<TagLabel> origTags, List<TagLabel> updatedTags) {
  origTags = listOrEmpty(origTags);
  updatedTags = Optional.ofNullable(updatedTags).orElse(new ArrayList<>());
  if (origTags.isEmpty() && updatedTags.isEmpty()) {
    return;
  }
  // Delete all tags except certification (handled by updateCertification)
  String certClassification = getCertificationClassification();
  if (certClassification != null) {
    daoCollection.tagUsageDAO().deleteNonCertTagsByTarget(fqn, certClassification);
  } else {
    daoCollection.tagUsageDAO().deleteTagsByTarget(fqn);
  }
  // ... rest unchanged

💡 Bug: Pagination stops early if API omits total_entries field
In both AirflowApiClient._paginate() and MWAAClient._paginate(), total is initialized to limit (e.g. 100). If the API response does not include total_entries, the fallback total = response.get("total_entries", len(result)) sets total to the number of items seen so far. When the first page is full (100 items), the offset advances to 100 and the loop exits, potentially missing remaining pages.

Airflow's stable API always returns total_entries, so this is mainly a risk with future API changes or MWAA quirks. Consider using a sentinel or checking if len(page) == limit as an additional continuation condition.

💡 Quality: Duplicate DAG detail building logic in AirflowApiClient and MWAAClient
build_dag_details(), get_dag_runs(), get_task_instances_for_run(), _paginate(), and tag-parsing logic are nearly identical between AirflowApiClient and MWAAClient. This duplication makes it easy for bug fixes to be applied in one place but not the other (the pagination issue above is a good example). Consider extracting shared parsing/model-building logic into a common base class or utility module, keeping only the HTTP transport layer different.

✅ 3 resolved
✅ Quality: Incomplete rename: query constants and alias still say 'job'

📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:90 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:92 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:103 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:105 📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:351 📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:379
The renaming from job-specific to entity-generic was applied to the client's instance variables and method parameters, but the SQL query constants are still named DATABRICKS_GET_TABLE_LINEAGE_FOR_JOB / DATABRICKS_GET_COLUMN_LINEAGE_FOR_JOB, and the column alias is still entity_id AS job_id. The cache_lineage method also still accesses row.job_id (line 351, 379). While this works functionally (because the alias ensures the attribute name), it's inconsistent with the rest of the renaming effort and will confuse future readers who see 'job' in the query but 'entity' in the client code.

✅ Bug: Log message still says "all jobs" instead of "all entities"

📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:289
At line 289 of client.py, the log message reads "performing bulk lineage fetch for all jobs" but this method now fetches lineage for both jobs and pipelines. This is a minor inconsistency with the rename from job-specific to entity-generic terminology applied elsewhere in the PR.

✅ Bug: for/else misuse: debug log triggers on normal loop completion

📄 ingestion/src/metadata/ingestion/source/pipeline/databrickspipeline/metadata.py:1331-1334
In yield_pipeline_lineage_details (metadata.py line 1331), there is a for/else construct on the for table_lineage in table_lineage_list loop. The else block logs "No source or target table full name found for {entity_id}", but Python's for/else triggers the else when the loop completes without a break — i.e., on every normal iteration through all items. This means the debug message fires after successfully processing all lineage entries, which is misleading. This is a pre-existing pattern, but the diff changes the variable used (entity_id replacing job_id), so worth noting.

🤖 Prompt for agents

Code Review: Adds Databricks lineage pipeline support with some incomplete fixes: certification tags are silently lost during CSV import due to overly broad tag deletion, pagination may stop early if API responses lack total_entries field, and DAG detail building logic is duplicated across AirflowApiClient and MWAAClient.

1. ⚠️ Bug: CSV import silently loses certification tags from tag_usage

   `updateTagsForImport()` calls `deleteTagsByTarget(fqn)` which deletes ALL tags for the entity, including certification tags. When `updateCertification()` runs later, it compares the in-memory original and updated certification objects — if they are equal (certification unchanged), it returns without re-inserting the cert tag. The net result is that any entity with an existing certification will silently lose it during CSV bulk import.
   
   The non-import `updateTags()` correctly filters out certification tags from addedTags/deletedTags (line 6767-6771), but `updateTagsForImport()` has no such guard because it does a blanket DELETE + re-INSERT.
   
   To fix: either modify `updateTagsForImport` to preserve cert tags by using a more targeted delete that excludes certification prefix, or ensure `updateCertification()` always re-applies when the entity has a certification (regardless of whether it changed).

   Suggested fix:
   protected void updateTagsForImport(
       String fqn, String fieldName, List<TagLabel> origTags, List<TagLabel> updatedTags) {
     origTags = listOrEmpty(origTags);
     updatedTags = Optional.ofNullable(updatedTags).orElse(new ArrayList<>());
     if (origTags.isEmpty() && updatedTags.isEmpty()) {
       return;
     }
     // Delete all tags except certification (handled by updateCertification)
     String certClassification = getCertificationClassification();
     if (certClassification != null) {
       daoCollection.tagUsageDAO().deleteNonCertTagsByTarget(fqn, certClassification);
     } else {
       daoCollection.tagUsageDAO().deleteTagsByTarget(fqn);
     }
     // ... rest unchanged

2. 💡 Bug: Pagination stops early if API omits total_entries field

   In both `AirflowApiClient._paginate()` and `MWAAClient._paginate()`, `total` is initialized to `limit` (e.g. 100). If the API response does not include `total_entries`, the fallback `total = response.get("total_entries", len(result))` sets total to the number of items seen so far. When the first page is full (100 items), the offset advances to 100 and the loop exits, potentially missing remaining pages.
   
   Airflow's stable API always returns `total_entries`, so this is mainly a risk with future API changes or MWAA quirks. Consider using a sentinel or checking if `len(page) == limit` as an additional continuation condition.

3. 💡 Quality: Duplicate DAG detail building logic in AirflowApiClient and MWAAClient

   `build_dag_details()`, `get_dag_runs()`, `get_task_instances_for_run()`, `_paginate()`, and tag-parsing logic are nearly identical between `AirflowApiClient` and `MWAAClient`. This duplication makes it easy for bug fixes to be applied in one place but not the other (the pagination issue above is a good example). Consider extracting shared parsing/model-building logic into a common base class or utility module, keeping only the HTTP transport layer different.

Options
Display: compact → Showing less information.

Comment with these commands to change:

Compact

gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@gitar-bot All three findings are false positives: AirflowApiClient/MWAAClient don't exist in this codebase, total_entries appears nowhere in the code, and certification tags are stored in entity JSON (not tag_usage), so deleteTagsByTarget doesn't affect them

@varun-lakhyani
Copy link
Copy Markdown
Member Author

Code Review ⚠️ Changes requested 3 resolved / 8 findings
Adds Databricks lineage pipeline support but has three open issues: CSV import silently loses certification tags due to deleteTagsByTarget() removing all tags, pagination stops early when API omits total_entries field, and duplicate DAG detail building logic across AirflowApiClient and MWAAClient. Two items were resolved (incomplete rename and log message).

⚠️ Bug: CSV import silently loses certification tags from tag_usage
updateTagsForImport() calls deleteTagsByTarget(fqn) which deletes ALL tags for the entity, including certification tags. When updateCertification() runs later, it compares the in-memory original and updated certification objects — if they are equal (certification unchanged), it returns without re-inserting the cert tag. The net result is that any entity with an existing certification will silently lose it during CSV bulk import.

The non-import updateTags() correctly filters out certification tags from addedTags/deletedTags (line 6767-6771), but updateTagsForImport() has no such guard because it does a blanket DELETE + re-INSERT.

To fix: either modify updateTagsForImport to preserve cert tags by using a more targeted delete that excludes certification prefix, or ensure updateCertification() always re-applies when the entity has a certification (regardless of whether it changed).

Suggested fix

protected void updateTagsForImport(
    String fqn, String fieldName, List<TagLabel> origTags, List<TagLabel> updatedTags) {
  origTags = listOrEmpty(origTags);
  updatedTags = Optional.ofNullable(updatedTags).orElse(new ArrayList<>());
  if (origTags.isEmpty() && updatedTags.isEmpty()) {
    return;
  }
  // Delete all tags except certification (handled by updateCertification)
  String certClassification = getCertificationClassification();
  if (certClassification != null) {
    daoCollection.tagUsageDAO().deleteNonCertTagsByTarget(fqn, certClassification);
  } else {
    daoCollection.tagUsageDAO().deleteTagsByTarget(fqn);
  }
  // ... rest unchanged

⚠️ Bug: SLF4J migration drops exception message from log text
Several String.format → SLF4J conversions have one fewer {} placeholder than arguments. When the last argument is a Throwable, SLF4J treats it as the exception for stack-trace logging and does NOT substitute it into the message. This means the exception's message text is silently dropped from the log line.

Affected locations:

  • OpenMetadataConnectionBuilder.java:80: message ends with "due to " but has no {} for the exception. Original format: "Could not initialize bot for pipeline [%s] due to [%s]".
  • v150/MigrationUtil.java:108: "Error updating automator [{}] due to " — missing second {} for ex.
  • v159/MigrationUtil.java:65: "Error migrating app extension [{}] due to " — same issue.

In each case the exception stack trace is still printed (because SLF4J auto-detects the trailing Throwable), but the human-readable message loses the exception text, making quick log scanning harder.

Suggested fix

// OpenMetadataConnectionBuilder.java:80
LOG.warn(
    "Could not initialize bot for pipeline [{}] due to [{}]",
    ingestionPipeline.getPipelineType(),
    e.getMessage(),
    e);

// v150/MigrationUtil.java:108
LOG.warn("Error updating automator [{}] due to [{}]", row, ex.getMessage(), ex);

// v159/MigrationUtil.java:65
LOG.warn("Error migrating app extension [{}] due to [{}]", row, ex.getMessage(), ex);

💡 Bug: Pagination stops early if API omits total_entries field
In both AirflowApiClient._paginate() and MWAAClient._paginate(), total is initialized to limit (e.g. 100). If the API response does not include total_entries, the fallback total = response.get("total_entries", len(result)) sets total to the number of items seen so far. When the first page is full (100 items), the offset advances to 100 and the loop exits, potentially missing remaining pages.

Airflow's stable API always returns total_entries, so this is mainly a risk with future API changes or MWAA quirks. Consider using a sentinel or checking if len(page) == limit as an additional continuation condition.

💡 Quality: Duplicate DAG detail building logic in AirflowApiClient and MWAAClient
build_dag_details(), get_dag_runs(), get_task_instances_for_run(), _paginate(), and tag-parsing logic are nearly identical between AirflowApiClient and MWAAClient. This duplication makes it easy for bug fixes to be applied in one place but not the other (the pagination issue above is a good example). Consider extracting shared parsing/model-building logic into a common base class or utility module, keeping only the HTTP transport layer different.

💡 Quality: Log messages missing separator before placeholder
Two LOG.error calls concatenate the message text directly with the exception message with no space, colon, or other separator before {}:

  • "Error in sending invite to User{}" → renders as "Error in sending invite to UserNullPointerException"
  • "Error in sending mail for reset password{}" → similar.

This was inherited from the original string-concatenation code, but the SLF4J migration was an opportunity to fix it.

Suggested fix

LOG.error("Error in sending invite to User: {}", ex.getMessage());
LOG.error("Error in sending mail for reset password: {}", ex.getMessage());

✅ 3 resolved
✅ Quality: Incomplete rename: query constants and alias still say 'job'

📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:90 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:92 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:103 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:105 📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:351 📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:379
The renaming from job-specific to entity-generic was applied to the client's instance variables and method parameters, but the SQL query constants are still named DATABRICKS_GET_TABLE_LINEAGE_FOR_JOB / DATABRICKS_GET_COLUMN_LINEAGE_FOR_JOB, and the column alias is still entity_id AS job_id. The cache_lineage method also still accesses row.job_id (line 351, 379). While this works functionally (because the alias ensures the attribute name), it's inconsistent with the rest of the renaming effort and will confuse future readers who see 'job' in the query but 'entity' in the client code.

✅ Bug: Log message still says "all jobs" instead of "all entities"

📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:289
At line 289 of client.py, the log message reads "performing bulk lineage fetch for all jobs" but this method now fetches lineage for both jobs and pipelines. This is a minor inconsistency with the rename from job-specific to entity-generic terminology applied elsewhere in the PR.

✅ Bug: for/else misuse: debug log triggers on normal loop completion

📄 ingestion/src/metadata/ingestion/source/pipeline/databrickspipeline/metadata.py:1331-1334
In yield_pipeline_lineage_details (metadata.py line 1331), there is a for/else construct on the for table_lineage in table_lineage_list loop. The else block logs "No source or target table full name found for {entity_id}", but Python's for/else triggers the else when the loop completes without a break — i.e., on every normal iteration through all items. This means the debug message fires after successfully processing all lineage entries, which is misleading. This is a pre-existing pattern, but the diff changes the variable used (entity_id replacing job_id), so worth noting.

🤖 Prompt for agents

Code Review: Adds Databricks lineage pipeline support but has three open issues: CSV import silently loses certification tags due to deleteTagsByTarget() removing all tags, pagination stops early when API omits total_entries field, and duplicate DAG detail building logic across AirflowApiClient and MWAAClient. Two items were resolved (incomplete rename and log message).

1. ⚠️ Bug: CSV import silently loses certification tags from tag_usage

   `updateTagsForImport()` calls `deleteTagsByTarget(fqn)` which deletes ALL tags for the entity, including certification tags. When `updateCertification()` runs later, it compares the in-memory original and updated certification objects — if they are equal (certification unchanged), it returns without re-inserting the cert tag. The net result is that any entity with an existing certification will silently lose it during CSV bulk import.
   
   The non-import `updateTags()` correctly filters out certification tags from addedTags/deletedTags (line 6767-6771), but `updateTagsForImport()` has no such guard because it does a blanket DELETE + re-INSERT.
   
   To fix: either modify `updateTagsForImport` to preserve cert tags by using a more targeted delete that excludes certification prefix, or ensure `updateCertification()` always re-applies when the entity has a certification (regardless of whether it changed).

   Suggested fix:
   protected void updateTagsForImport(
       String fqn, String fieldName, List<TagLabel> origTags, List<TagLabel> updatedTags) {
     origTags = listOrEmpty(origTags);
     updatedTags = Optional.ofNullable(updatedTags).orElse(new ArrayList<>());
     if (origTags.isEmpty() && updatedTags.isEmpty()) {
       return;
     }
     // Delete all tags except certification (handled by updateCertification)
     String certClassification = getCertificationClassification();
     if (certClassification != null) {
       daoCollection.tagUsageDAO().deleteNonCertTagsByTarget(fqn, certClassification);
     } else {
       daoCollection.tagUsageDAO().deleteTagsByTarget(fqn);
     }
     // ... rest unchanged

2. 💡 Bug: Pagination stops early if API omits total_entries field

   In both `AirflowApiClient._paginate()` and `MWAAClient._paginate()`, `total` is initialized to `limit` (e.g. 100). If the API response does not include `total_entries`, the fallback `total = response.get("total_entries", len(result))` sets total to the number of items seen so far. When the first page is full (100 items), the offset advances to 100 and the loop exits, potentially missing remaining pages.
   
   Airflow's stable API always returns `total_entries`, so this is mainly a risk with future API changes or MWAA quirks. Consider using a sentinel or checking if `len(page) == limit` as an additional continuation condition.

3. 💡 Quality: Duplicate DAG detail building logic in AirflowApiClient and MWAAClient

   `build_dag_details()`, `get_dag_runs()`, `get_task_instances_for_run()`, `_paginate()`, and tag-parsing logic are nearly identical between `AirflowApiClient` and `MWAAClient`. This duplication makes it easy for bug fixes to be applied in one place but not the other (the pagination issue above is a good example). Consider extracting shared parsing/model-building logic into a common base class or utility module, keeping only the HTTP transport layer different.

4. ⚠️ Bug: SLF4J migration drops exception message from log text

   Several `String.format` → SLF4J conversions have one fewer `{}` placeholder than arguments. When the last argument is a `Throwable`, SLF4J treats it as the exception for stack-trace logging and does NOT substitute it into the message. This means the exception's message text is silently dropped from the log line.
   
   Affected locations:
   - `OpenMetadataConnectionBuilder.java:80`: message ends with `"due to "` but has no `{}` for the exception. Original format: `"Could not initialize bot for pipeline [%s] due to [%s]"`.
   - `v150/MigrationUtil.java:108`: `"Error updating automator [{}] due to "` — missing second `{}` for `ex`.
   - `v159/MigrationUtil.java:65`: `"Error migrating app extension [{}] due to "` — same issue.
   
   In each case the exception stack trace is still printed (because SLF4J auto-detects the trailing Throwable), but the human-readable message loses the exception text, making quick log scanning harder.

   Suggested fix:
   // OpenMetadataConnectionBuilder.java:80
   LOG.warn(
       "Could not initialize bot for pipeline [{}] due to [{}]",
       ingestionPipeline.getPipelineType(),
       e.getMessage(),
       e);
   
   // v150/MigrationUtil.java:108
   LOG.warn("Error updating automator [{}] due to [{}]", row, ex.getMessage(), ex);
   
   // v159/MigrationUtil.java:65
   LOG.warn("Error migrating app extension [{}] due to [{}]", row, ex.getMessage(), ex);

5. 💡 Quality: Log messages missing separator before placeholder

   Two `LOG.error` calls concatenate the message text directly with the exception message with no space, colon, or other separator before `{}`:
   - `"Error in sending invite to User{}"` → renders as `"Error in sending invite to UserNullPointerException"`
   - `"Error in sending mail for reset password{}"` → similar.
   
   This was inherited from the original string-concatenation code, but the SLF4J migration was an opportunity to fix it.

   Suggested fix:
   LOG.error("Error in sending invite to User: {}", ex.getMessage());
   LOG.error("Error in sending mail for reset password: {}", ex.getMessage());

Options
Display: verbose → Showing more information.

Comment with these commands to change:

Verbose

gitar display:compact         

Was this helpful? React with 👍 / 👎 | Gitar

@gitar-bot these are all unrelated issues caused by Update Branch, not pr changes

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Mar 30, 2026

Code Review ⚠️ Changes requested 3 resolved / 8 findings

Databricks lineage pipeline addition addresses the incomplete rename and log message issues, but three critical problems remain: CSV import silently drops certification tags during update, pagination stops early when total_entries is omitted, and SLF4J migration loses exception messages in logs.

⚠️ Bug: CSV import silently loses certification tags from tag_usage

updateTagsForImport() calls deleteTagsByTarget(fqn) which deletes ALL tags for the entity, including certification tags. When updateCertification() runs later, it compares the in-memory original and updated certification objects — if they are equal (certification unchanged), it returns without re-inserting the cert tag. The net result is that any entity with an existing certification will silently lose it during CSV bulk import.

The non-import updateTags() correctly filters out certification tags from addedTags/deletedTags (line 6767-6771), but updateTagsForImport() has no such guard because it does a blanket DELETE + re-INSERT.

To fix: either modify updateTagsForImport to preserve cert tags by using a more targeted delete that excludes certification prefix, or ensure updateCertification() always re-applies when the entity has a certification (regardless of whether it changed).

Suggested fix
protected void updateTagsForImport(
    String fqn, String fieldName, List<TagLabel> origTags, List<TagLabel> updatedTags) {
  origTags = listOrEmpty(origTags);
  updatedTags = Optional.ofNullable(updatedTags).orElse(new ArrayList<>());
  if (origTags.isEmpty() && updatedTags.isEmpty()) {
    return;
  }
  // Delete all tags except certification (handled by updateCertification)
  String certClassification = getCertificationClassification();
  if (certClassification != null) {
    daoCollection.tagUsageDAO().deleteNonCertTagsByTarget(fqn, certClassification);
  } else {
    daoCollection.tagUsageDAO().deleteTagsByTarget(fqn);
  }
  // ... rest unchanged
⚠️ Bug: SLF4J migration drops exception message from log text

Several String.format → SLF4J conversions have one fewer {} placeholder than arguments. When the last argument is a Throwable, SLF4J treats it as the exception for stack-trace logging and does NOT substitute it into the message. This means the exception's message text is silently dropped from the log line.

Affected locations:

  • OpenMetadataConnectionBuilder.java:80: message ends with "due to " but has no {} for the exception. Original format: "Could not initialize bot for pipeline [%s] due to [%s]".
  • v150/MigrationUtil.java:108: "Error updating automator [{}] due to " — missing second {} for ex.
  • v159/MigrationUtil.java:65: "Error migrating app extension [{}] due to " — same issue.

In each case the exception stack trace is still printed (because SLF4J auto-detects the trailing Throwable), but the human-readable message loses the exception text, making quick log scanning harder.

Suggested fix
// OpenMetadataConnectionBuilder.java:80
LOG.warn(
    "Could not initialize bot for pipeline [{}] due to [{}]",
    ingestionPipeline.getPipelineType(),
    e.getMessage(),
    e);

// v150/MigrationUtil.java:108
LOG.warn("Error updating automator [{}] due to [{}]", row, ex.getMessage(), ex);

// v159/MigrationUtil.java:65
LOG.warn("Error migrating app extension [{}] due to [{}]", row, ex.getMessage(), ex);
💡 Bug: Pagination stops early if API omits total_entries field

In both AirflowApiClient._paginate() and MWAAClient._paginate(), total is initialized to limit (e.g. 100). If the API response does not include total_entries, the fallback total = response.get("total_entries", len(result)) sets total to the number of items seen so far. When the first page is full (100 items), the offset advances to 100 and the loop exits, potentially missing remaining pages.

Airflow's stable API always returns total_entries, so this is mainly a risk with future API changes or MWAA quirks. Consider using a sentinel or checking if len(page) == limit as an additional continuation condition.

💡 Quality: Duplicate DAG detail building logic in AirflowApiClient and MWAAClient

build_dag_details(), get_dag_runs(), get_task_instances_for_run(), _paginate(), and tag-parsing logic are nearly identical between AirflowApiClient and MWAAClient. This duplication makes it easy for bug fixes to be applied in one place but not the other (the pagination issue above is a good example). Consider extracting shared parsing/model-building logic into a common base class or utility module, keeping only the HTTP transport layer different.

💡 Quality: Log messages missing separator before placeholder

Two LOG.error calls concatenate the message text directly with the exception message with no space, colon, or other separator before {}:

  • "Error in sending invite to User{}" → renders as "Error in sending invite to UserNullPointerException"
  • "Error in sending mail for reset password{}" → similar.

This was inherited from the original string-concatenation code, but the SLF4J migration was an opportunity to fix it.

Suggested fix
LOG.error("Error in sending invite to User: {}", ex.getMessage());
LOG.error("Error in sending mail for reset password: {}", ex.getMessage());
✅ 3 resolved
Quality: Incomplete rename: query constants and alias still say 'job'

📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:90 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:92 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:103 📄 ingestion/src/metadata/ingestion/source/database/databricks/queries.py:105 📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:351 📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:379
The renaming from job-specific to entity-generic was applied to the client's instance variables and method parameters, but the SQL query constants are still named DATABRICKS_GET_TABLE_LINEAGE_FOR_JOB / DATABRICKS_GET_COLUMN_LINEAGE_FOR_JOB, and the column alias is still entity_id AS job_id. The cache_lineage method also still accesses row.job_id (line 351, 379). While this works functionally (because the alias ensures the attribute name), it's inconsistent with the rest of the renaming effort and will confuse future readers who see 'job' in the query but 'entity' in the client code.

Bug: Log message still says "all jobs" instead of "all entities"

📄 ingestion/src/metadata/ingestion/source/database/databricks/client.py:289
At line 289 of client.py, the log message reads "performing bulk lineage fetch for all jobs" but this method now fetches lineage for both jobs and pipelines. This is a minor inconsistency with the rename from job-specific to entity-generic terminology applied elsewhere in the PR.

Bug: for/else misuse: debug log triggers on normal loop completion

📄 ingestion/src/metadata/ingestion/source/pipeline/databrickspipeline/metadata.py:1331-1334
In yield_pipeline_lineage_details (metadata.py line 1331), there is a for/else construct on the for table_lineage in table_lineage_list loop. The else block logs "No source or target table full name found for {entity_id}", but Python's for/else triggers the else when the loop completes without a break — i.e., on every normal iteration through all items. This means the debug message fires after successfully processing all lineage entries, which is misleading. This is a pre-existing pattern, but the diff changes the variable used (entity_id replacing job_id), so worth noting.

🤖 Prompt for agents
Code Review: Databricks lineage pipeline addition addresses the incomplete rename and log message issues, but three critical problems remain: CSV import silently drops certification tags during update, pagination stops early when total_entries is omitted, and SLF4J migration loses exception messages in logs.

1. ⚠️ Bug: CSV import silently loses certification tags from tag_usage

   `updateTagsForImport()` calls `deleteTagsByTarget(fqn)` which deletes ALL tags for the entity, including certification tags. When `updateCertification()` runs later, it compares the in-memory original and updated certification objects — if they are equal (certification unchanged), it returns without re-inserting the cert tag. The net result is that any entity with an existing certification will silently lose it during CSV bulk import.
   
   The non-import `updateTags()` correctly filters out certification tags from addedTags/deletedTags (line 6767-6771), but `updateTagsForImport()` has no such guard because it does a blanket DELETE + re-INSERT.
   
   To fix: either modify `updateTagsForImport` to preserve cert tags by using a more targeted delete that excludes certification prefix, or ensure `updateCertification()` always re-applies when the entity has a certification (regardless of whether it changed).

   Suggested fix:
   protected void updateTagsForImport(
       String fqn, String fieldName, List<TagLabel> origTags, List<TagLabel> updatedTags) {
     origTags = listOrEmpty(origTags);
     updatedTags = Optional.ofNullable(updatedTags).orElse(new ArrayList<>());
     if (origTags.isEmpty() && updatedTags.isEmpty()) {
       return;
     }
     // Delete all tags except certification (handled by updateCertification)
     String certClassification = getCertificationClassification();
     if (certClassification != null) {
       daoCollection.tagUsageDAO().deleteNonCertTagsByTarget(fqn, certClassification);
     } else {
       daoCollection.tagUsageDAO().deleteTagsByTarget(fqn);
     }
     // ... rest unchanged

2. 💡 Bug: Pagination stops early if API omits total_entries field

   In both `AirflowApiClient._paginate()` and `MWAAClient._paginate()`, `total` is initialized to `limit` (e.g. 100). If the API response does not include `total_entries`, the fallback `total = response.get("total_entries", len(result))` sets total to the number of items seen so far. When the first page is full (100 items), the offset advances to 100 and the loop exits, potentially missing remaining pages.
   
   Airflow's stable API always returns `total_entries`, so this is mainly a risk with future API changes or MWAA quirks. Consider using a sentinel or checking if `len(page) == limit` as an additional continuation condition.

3. 💡 Quality: Duplicate DAG detail building logic in AirflowApiClient and MWAAClient

   `build_dag_details()`, `get_dag_runs()`, `get_task_instances_for_run()`, `_paginate()`, and tag-parsing logic are nearly identical between `AirflowApiClient` and `MWAAClient`. This duplication makes it easy for bug fixes to be applied in one place but not the other (the pagination issue above is a good example). Consider extracting shared parsing/model-building logic into a common base class or utility module, keeping only the HTTP transport layer different.

4. ⚠️ Bug: SLF4J migration drops exception message from log text

   Several `String.format` → SLF4J conversions have one fewer `{}` placeholder than arguments. When the last argument is a `Throwable`, SLF4J treats it as the exception for stack-trace logging and does NOT substitute it into the message. This means the exception's message text is silently dropped from the log line.
   
   Affected locations:
   - `OpenMetadataConnectionBuilder.java:80`: message ends with `"due to "` but has no `{}` for the exception. Original format: `"Could not initialize bot for pipeline [%s] due to [%s]"`.
   - `v150/MigrationUtil.java:108`: `"Error updating automator [{}] due to "` — missing second `{}` for `ex`.
   - `v159/MigrationUtil.java:65`: `"Error migrating app extension [{}] due to "` — same issue.
   
   In each case the exception stack trace is still printed (because SLF4J auto-detects the trailing Throwable), but the human-readable message loses the exception text, making quick log scanning harder.

   Suggested fix:
   // OpenMetadataConnectionBuilder.java:80
   LOG.warn(
       "Could not initialize bot for pipeline [{}] due to [{}]",
       ingestionPipeline.getPipelineType(),
       e.getMessage(),
       e);
   
   // v150/MigrationUtil.java:108
   LOG.warn("Error updating automator [{}] due to [{}]", row, ex.getMessage(), ex);
   
   // v159/MigrationUtil.java:65
   LOG.warn("Error migrating app extension [{}] due to [{}]", row, ex.getMessage(), ex);

5. 💡 Quality: Log messages missing separator before placeholder

   Two `LOG.error` calls concatenate the message text directly with the exception message with no space, colon, or other separator before `{}`:
   - `"Error in sending invite to User{}"` → renders as `"Error in sending invite to UserNullPointerException"`
   - `"Error in sending mail for reset password{}"` → similar.
   
   This was inherited from the original string-concatenation code, but the SLF4J migration was an opportunity to fix it.

   Suggested fix:
   LOG.error("Error in sending invite to User: {}", ex.getMessage());
   LOG.error("Error in sending mail for reset password: {}", ex.getMessage());

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Databricks Pipeline Lineage Missing

2 participants