Skip to content

feat(incident): add IncidentTcrsSyncHandler for Task→TCRS sync#26759

Merged
IceS2 merged 41 commits intofeat/incident-lifecycle-workflowfrom
feat/ilw-item2-incident-tcrs-sync-hook
Mar 31, 2026
Merged

feat(incident): add IncidentTcrsSyncHandler for Task→TCRS sync#26759
IceS2 merged 41 commits intofeat/incident-lifecycle-workflowfrom
feat/ilw-item2-incident-tcrs-sync-hook

Conversation

@IceS2
Copy link
Copy Markdown
Contributor

@IceS2 IceS2 commented Mar 25, 2026

Summary

  • Adds aboutEntityLink field to Task schema (EntityLink format encoding testCase FQN + incident stateId), backed by generated DB columns and index in MySQL/PostgreSQL
  • Implements IncidentTcrsSyncHandler — an async EntityLifecycleEventHandler that syncs Task lifecycle events to TCRS records (New on creation, Ack on InProgress, Assigned on assignee change, Resolved on terminal status)
  • Adds TCRS guard in openOrAssignTask() to prevent duplicate Task creation when a workflow-managed Task already exists
  • SetupImpl builds aboutEntityLink for IncidentResolution/TestCaseResolution task types
  • Adds testCaseStatus to WorkflowTriggerFields enum (matches the actual field name in entity changeDescription)

Test plan

  • IncidentTcrsSyncHandlerTest — 10 unit tests covering isIncidentTask() boundary conditions, handler properties, and EntityLink parsing
  • ManualTaskOutboxIT — E2E integration test verifying full pipeline: workflow trigger → Task creation with aboutEntityLink → TCRS(Ack) on InProgress → TCRS(Resolved) on Completed → workflow FINISHED

IceS2 and others added 30 commits March 16, 2026 14:41
Adds a generic, configurable-status human task node for governance
workflows. The node creates an OM Task, waits for status transitions
via IntermediateCatchEvent messages, and routes based on terminal vs
non-terminal statuses.

Key components:
- ManualTask.java: BPMN subprocess builder (setup → wait → route → close)
- SetupDelegate/SetupImpl: Task creation, idempotent on cycle re-entry
- CheckTerminalDelegate: Validates status against template
- CloseTaskDelegate/CloseTaskImpl: Closes task on terminal status
- SetResultDelegate: Propagates status to parent for edge routing
- ManualTaskTemplateResolver: Template-based status configuration
- ManualTaskDefinition JSON schema + nodeType/nodeSubType registration

The node is domain-agnostic — incident/approval behavior lives in the
workflow graph around the node, not inside it.
Remove inputNamespaceMapExpr and configMapExpr from BaseDelegate.
Each delegate now declares its own Expression fields, preventing
NullPointerExceptions in delegates that don't use these fields
(e.g., SetResultDelegate, CheckTerminalDelegate, CloseTaskDelegate).
- Fix: isAlreadyClosed now only checks task.getResolution() != null.
  Previously it also checked terminalStatuses.contains(currentStatus),
  which always returned true when CloseTask runs (the PATCH already set
  the terminal status), leaving tasks permanently unresolved.
- Remove unused terminalStatuses parameter from closeTask/CloseTaskDelegate
- Rename taskCreated → taskAlreadyExists for clarity: the variable means
  "should we enter the message-waiting phase" (true on re-entry, false
  on first creation)
Implements the bridge that connects Task status changes to the
ManualTask workflow node via Flowable message delivery.

Bridge chain:
  TaskRepository.postUpdate() detects status change
  → TaskWorkflowHandler.transitionManualTaskStatus() (sends updatedBy)
  → WorkflowHandler.sendManualTaskMessage() with async exponential retry

Key design decisions:
- postUpdate wrapped in try-catch: workflow failures never break PATCH
- Async retry via ScheduledExecutorService + resilience4j IntervalFunction:
  500ms → 1s → 2s → 4s → 5s cap (~12.5s total coverage)
- First attempt synchronous (fast path), retries non-blocking
- CloseTaskImpl uses actual user from PATCH, falls back to governance-bot

Also includes:
- WorkflowDefinitionRepository/WorkflowInstanceStateRepository updates
- CollectionDAO, ListFilter, EntityResource supporting changes
- SQL migration (2.0.0) for stageResult generated column
- ManualTaskWorkflowTest: full E2E lifecycle test
- Fix: catch FlowableOptimisticLockingException in tryDeliverMessage and
  return false to trigger retry (concurrent modification means another
  thread may have consumed the subscription)
- Fix: nonTerminalReachable BFS now iterates the full edges list at every
  step, not just the unfiltered outgoingEdges map. Prevents following
  terminal-condition edges from intermediate nodes.
- Refactor: hoist IntervalFunction to a static final constant instead of
  recreating on each retry. Made interval fields final.
- Fix: remove IF NOT EXISTS from PostgreSQL migration for consistency
  with MySQL pattern (Flyway handles migration idempotency)
…tConsumer

Add Entity.TASK to validEntityTypes, detect workflow-managed task status
changes via isWorkflowManagedTaskStatusChange, and enqueue them to the
outbox table via enqueueTaskMessage before the existing signal broadcast.
Add 4 reflection-based unit tests for isWorkflowManagedTaskStatusChange
covering early-return conditions: non-update events, non-task entity types,
missing changeDescription, and non-status field changes.
Remove the TaskRepository.postUpdate override that synchronously called
TransitionManualTaskStatus, the TaskWorkflowHandler.transitionManualTaskStatus
method it depended on, and the WorkflowHandler async retry infrastructure
(sendManualTaskMessage, scheduleMessageRetry, tryDeliverMessage, and their
backing constants and ScheduledExecutorService). Task status transitions are
now delivered exclusively via the Transactional Outbox pattern.
…andler

Start the drainer after the process engine is built in the constructor.
Restart it when initializeNewProcessEngine() rebuilds the engine at
runtime. Shut it down gracefully via WorkflowHandler.shutdown(), which
is called from ManagedShutdown.stop() in OpenMetadataApplication.
…tency

The E2E test must tolerate up to 10s CE poll + 30s drainer poll plus
margin. Raise all Awaitility atMost() values to 90 seconds.
…roadcast disruption

A DB failure during outbox INSERT should not prevent the signal broadcast
path from executing. Log the error and continue.
…an older row

SKIP LOCKED skips individual locked rows, not entire task groups. Without
this guard, Worker B could grab a newer status while Worker A still holds
the oldest — violating per-task ordering. The fix queries the absolute
oldest createdAt (no lock) and skips the task if the locked row is newer.
…ycle

Collapse findDistinctPendingTaskIds + per-task findAndLockOldestPending +
per-task findOldestPendingCreatedAt into a single findAndLockAllOldestPending
query using MIN(createdAt) JOIN with FOR UPDATE SKIP LOCKED. Per-task
ordering is preserved naturally: if the oldest row for a task is locked by
another worker, the JOIN produces no match for that task.
C2: Replace MIN(createdAt) JOIN with ROW_NUMBER() PARTITION BY taskId
    to guarantee exactly one row per task even with identical timestamps.

C3: Split bulk-lock transaction into bulk-read (no lock) + per-entry
    transactions. Row locks now held only during single-entry delivery,
    not the entire batch. Flowable API calls no longer inside a DB
    transaction holding locks on other rows.

I2: Add MAX_ATTEMPTS=100. Entries exceeding this are excluded from the
    drain query and effectively dead-lettered for investigation.

I3: Call cleanupDelivered() at end of each drain cycle with 7-day
    retention to prevent unbounded table growth.

I4: Extract workflowInstanceId from ChangeEvent entity payload instead
    of fetching from DB. Eliminates extra round trip per task status
    change event.

I5: Move signal broadcast before outbox enqueue so it always fires.
    Wrap enqueue in resilience4j retry (3 attempts) for transient DB
    errors. Unhandled failure propagates to event publisher for retry.
Add LIMIT 500 with ORDER BY attempts ASC, createdAt ASC to prevent
unbounded result sets and prioritize fresh messages over stuck ones.
Separate cleanup into its own try-catch for cleaner error diagnostics.
…OutboxIT

Move E2E test to integration-tests module where the full application stack
is running (CE pipeline, schedulers, drainer). The test verifies the
complete outbox delivery pipeline through observable outcomes:

1. Deploy workflow → create table → workflow triggers → task created
2. PATCH task InProgress → PATCH task Completed
3. Poll workflow instance until FINISHED status
4. Assert stage results contain expected status transitions
@github-actions
Copy link
Copy Markdown
Contributor

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

…ncident-tcrs-sync-hook

Resolve conflicts in SetupImpl.java and ManualTaskOutboxIT.java by
keeping HEAD (aboutEntityLink + TCRS sync additions).
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

❌ Lint Check Failed — ESLint + Prettier (core-components)

The following files have style issues that need to be fixed:

Fix locally (fast — only for changed files in the branch):

make ui-checkstyle-core-components-changed

Or to fix all files:

make ui-checkstyle-core-components

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.


-- aboutEntityLink: hierarchical entity identity for lifecycle handler lookups
ALTER TABLE task_entity ADD COLUMN IF NOT EXISTS aboutEntityLink varchar(1024)
GENERATED ALWAYS AS (json_unquote(json_extract(`json`, _utf8mb4'$.aboutEntityLink'))) STORED;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is mysql not utf8 by default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be, but it does not hurt to be defensive, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In sql files okay, in run time for other cases when we know the value is so and so, it's better to code that way rather than being defensive, makes debugging a pain. As this is in sql file, it's okay!

String testCaseFqn = link.getEntityFQN();
UUID stateId = UUID.fromString(link.getArrayFieldName());

if (tcrRecordExists(stateId)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we name the functions in detail, I had to recall what tcr is for 1-2 mins which is not helping in debugging

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fair, will do it!

…timize BFS

- Rename tcrRecordExists/insertTcrsRecord/mapTaskChangeToTcrsType to
  descriptive names (incidentResolutionStatusExists, etc.) per reviewer
  feedback on abbreviation clarity
- Guard extractStringValue against single-char edge case
- Use outgoingEdges adjacency list in nonTerminalReachable BFS instead
  of scanning all edges per node (O(N+E) vs O(N*E))
@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Mar 31, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Adds IncidentTcrsSyncHandler for syncing Tasks to TCRS with proper error handling for string edge cases and optimized graph traversal. All findings resolved.

✅ 2 resolved
Edge Case: extractStringValue crashes on single-quote string edge case

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/handlers/IncidentTcrsSyncHandler.java:226-228
In IncidentTcrsSyncHandler.extractStringValue(), the expression s.substring(1, s.length() - 1) is called when s.startsWith("""). If s is exactly " (length 1), this becomes substring(1, 0) which throws StringIndexOutOfBoundsException. While unlikely in practice (FieldChange status values are well-formed), this is a latent crash in the lifecycle event handler.

Performance: nonTerminalReachable BFS iterates all edges per node (O(N*E))

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/WorkflowDefinitionRepository.java:420-430
In WorkflowDefinitionRepository.nonTerminalReachable(), the BFS inner loop iterates over ALL edges for each node in the queue to find outgoing edges: for (EdgeDefinition edge : edges). Since outgoingEdges adjacency list is already built at line 250-267 and passed to buildCycleCheckGraph, it could be reused here instead of scanning all edges. This is O(N*E) instead of O(N+E). Only impacts validation time, not a hot path, but easy to fix.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@IceS2 IceS2 merged commit 6b6a1a7 into feat/incident-lifecycle-workflow Mar 31, 2026
21 of 50 checks passed
@IceS2 IceS2 deleted the feat/ilw-item2-incident-tcrs-sync-hook branch March 31, 2026 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants