feat(incident): add IncidentTcrsSyncHandler for Task→TCRS sync#26759
Conversation
Adds a generic, configurable-status human task node for governance workflows. The node creates an OM Task, waits for status transitions via IntermediateCatchEvent messages, and routes based on terminal vs non-terminal statuses. Key components: - ManualTask.java: BPMN subprocess builder (setup → wait → route → close) - SetupDelegate/SetupImpl: Task creation, idempotent on cycle re-entry - CheckTerminalDelegate: Validates status against template - CloseTaskDelegate/CloseTaskImpl: Closes task on terminal status - SetResultDelegate: Propagates status to parent for edge routing - ManualTaskTemplateResolver: Template-based status configuration - ManualTaskDefinition JSON schema + nodeType/nodeSubType registration The node is domain-agnostic — incident/approval behavior lives in the workflow graph around the node, not inside it.
Remove inputNamespaceMapExpr and configMapExpr from BaseDelegate. Each delegate now declares its own Expression fields, preventing NullPointerExceptions in delegates that don't use these fields (e.g., SetResultDelegate, CheckTerminalDelegate, CloseTaskDelegate).
- Fix: isAlreadyClosed now only checks task.getResolution() != null. Previously it also checked terminalStatuses.contains(currentStatus), which always returned true when CloseTask runs (the PATCH already set the terminal status), leaving tasks permanently unresolved. - Remove unused terminalStatuses parameter from closeTask/CloseTaskDelegate - Rename taskCreated → taskAlreadyExists for clarity: the variable means "should we enter the message-waiting phase" (true on re-entry, false on first creation)
Implements the bridge that connects Task status changes to the ManualTask workflow node via Flowable message delivery. Bridge chain: TaskRepository.postUpdate() detects status change → TaskWorkflowHandler.transitionManualTaskStatus() (sends updatedBy) → WorkflowHandler.sendManualTaskMessage() with async exponential retry Key design decisions: - postUpdate wrapped in try-catch: workflow failures never break PATCH - Async retry via ScheduledExecutorService + resilience4j IntervalFunction: 500ms → 1s → 2s → 4s → 5s cap (~12.5s total coverage) - First attempt synchronous (fast path), retries non-blocking - CloseTaskImpl uses actual user from PATCH, falls back to governance-bot Also includes: - WorkflowDefinitionRepository/WorkflowInstanceStateRepository updates - CollectionDAO, ListFilter, EntityResource supporting changes - SQL migration (2.0.0) for stageResult generated column - ManualTaskWorkflowTest: full E2E lifecycle test
- Fix: catch FlowableOptimisticLockingException in tryDeliverMessage and return false to trigger retry (concurrent modification means another thread may have consumed the subscription) - Fix: nonTerminalReachable BFS now iterates the full edges list at every step, not just the unfiltered outgoingEdges map. Prevents following terminal-condition edges from intermediate nodes. - Refactor: hoist IntervalFunction to a static final constant instead of recreating on each retry. Made interval fields final. - Fix: remove IF NOT EXISTS from PostgreSQL migration for consistency with MySQL pattern (Flyway handles migration idempotency)
…tConsumer Add Entity.TASK to validEntityTypes, detect workflow-managed task status changes via isWorkflowManagedTaskStatusChange, and enqueue them to the outbox table via enqueueTaskMessage before the existing signal broadcast.
Add 4 reflection-based unit tests for isWorkflowManagedTaskStatusChange covering early-return conditions: non-update events, non-task entity types, missing changeDescription, and non-status field changes.
Remove the TaskRepository.postUpdate override that synchronously called TransitionManualTaskStatus, the TaskWorkflowHandler.transitionManualTaskStatus method it depended on, and the WorkflowHandler async retry infrastructure (sendManualTaskMessage, scheduleMessageRetry, tryDeliverMessage, and their backing constants and ScheduledExecutorService). Task status transitions are now delivered exclusively via the Transactional Outbox pattern.
…andler Start the drainer after the process engine is built in the constructor. Restart it when initializeNewProcessEngine() rebuilds the engine at runtime. Shut it down gracefully via WorkflowHandler.shutdown(), which is called from ManagedShutdown.stop() in OpenMetadataApplication.
…tency The E2E test must tolerate up to 10s CE poll + 30s drainer poll plus margin. Raise all Awaitility atMost() values to 90 seconds.
…roadcast disruption A DB failure during outbox INSERT should not prevent the signal broadcast path from executing. Log the error and continue.
…an older row SKIP LOCKED skips individual locked rows, not entire task groups. Without this guard, Worker B could grab a newer status while Worker A still holds the oldest — violating per-task ordering. The fix queries the absolute oldest createdAt (no lock) and skips the task if the locked row is newer.
…ycle Collapse findDistinctPendingTaskIds + per-task findAndLockOldestPending + per-task findOldestPendingCreatedAt into a single findAndLockAllOldestPending query using MIN(createdAt) JOIN with FOR UPDATE SKIP LOCKED. Per-task ordering is preserved naturally: if the oldest row for a task is locked by another worker, the JOIN produces no match for that task.
C2: Replace MIN(createdAt) JOIN with ROW_NUMBER() PARTITION BY taskId
to guarantee exactly one row per task even with identical timestamps.
C3: Split bulk-lock transaction into bulk-read (no lock) + per-entry
transactions. Row locks now held only during single-entry delivery,
not the entire batch. Flowable API calls no longer inside a DB
transaction holding locks on other rows.
I2: Add MAX_ATTEMPTS=100. Entries exceeding this are excluded from the
drain query and effectively dead-lettered for investigation.
I3: Call cleanupDelivered() at end of each drain cycle with 7-day
retention to prevent unbounded table growth.
I4: Extract workflowInstanceId from ChangeEvent entity payload instead
of fetching from DB. Eliminates extra round trip per task status
change event.
I5: Move signal broadcast before outbox enqueue so it always fires.
Wrap enqueue in resilience4j retry (3 attempts) for transient DB
errors. Unhandled failure propagates to event publisher for retry.
Add LIMIT 500 with ORDER BY attempts ASC, createdAt ASC to prevent unbounded result sets and prioritize fresh messages over stuck ones. Separate cleanup into its own try-catch for cleaner error diagnostics.
…OutboxIT Move E2E test to integration-tests module where the full application stack is running (CE pipeline, schedulers, drainer). The test verifies the complete outbox delivery pipeline through observable outcomes: 1. Deploy workflow → create table → workflow triggers → task created 2. PATCH task InProgress → PATCH task Completed 3. Poll workflow instance until FINISHED status 4. Assert stage results contain expected status transitions
…incident-tcrs-sync-hook
…pen-metadata/OpenMetadata into feat/ilw-item2-incident-tcrs-sync-hook
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
…ncident-tcrs-sync-hook Resolve conflicts in SetupImpl.java and ManualTaskOutboxIT.java by keeping HEAD (aboutEntityLink + TCRS sync additions).
❌ Lint Check Failed — ESLint + Prettier (core-components)The following files have style issues that need to be fixed: Fix locally (fast — only for changed files in the branch): make ui-checkstyle-core-components-changedOr to fix all files: make ui-checkstyle-core-components |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
|
|
||
| -- aboutEntityLink: hierarchical entity identity for lifecycle handler lookups | ||
| ALTER TABLE task_entity ADD COLUMN IF NOT EXISTS aboutEntityLink varchar(1024) | ||
| GENERATED ALWAYS AS (json_unquote(json_extract(`json`, _utf8mb4'$.aboutEntityLink'))) STORED; |
There was a problem hiding this comment.
is mysql not utf8 by default?
There was a problem hiding this comment.
I think it should be, but it does not hurt to be defensive, right?
There was a problem hiding this comment.
In sql files okay, in run time for other cases when we know the value is so and so, it's better to code that way rather than being defensive, makes debugging a pain. As this is in sql file, it's okay!
| String testCaseFqn = link.getEntityFQN(); | ||
| UUID stateId = UUID.fromString(link.getArrayFieldName()); | ||
|
|
||
| if (tcrRecordExists(stateId)) { |
There was a problem hiding this comment.
nit: can we name the functions in detail, I had to recall what tcr is for 1-2 mins which is not helping in debugging
There was a problem hiding this comment.
This is fair, will do it!
...rc/main/java/org/openmetadata/service/events/lifecycle/handlers/IncidentTcrsSyncHandler.java
Outdated
Show resolved
Hide resolved
…timize BFS - Rename tcrRecordExists/insertTcrsRecord/mapTaskChangeToTcrsType to descriptive names (incidentResolutionStatusExists, etc.) per reviewer feedback on abbreviation clarity - Guard extractStringValue against single-char edge case - Use outgoingEdges adjacency list in nonTerminalReachable BFS instead of scanning all edges per node (O(N+E) vs O(N*E))
Code Review ✅ Approved 2 resolved / 2 findingsAdds IncidentTcrsSyncHandler for syncing Tasks to TCRS with proper error handling for string edge cases and optimized graph traversal. All findings resolved. ✅ 2 resolved✅ Edge Case: extractStringValue crashes on single-quote string edge case
✅ Performance: nonTerminalReachable BFS iterates all edges per node (O(N*E))
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
6b6a1a7
into
feat/incident-lifecycle-workflow
Summary
aboutEntityLinkfield to Task schema (EntityLink format encoding testCase FQN + incident stateId), backed by generated DB columns and index in MySQL/PostgreSQLIncidentTcrsSyncHandler— an async EntityLifecycleEventHandler that syncs Task lifecycle events to TCRS records (New on creation, Ack on InProgress, Assigned on assignee change, Resolved on terminal status)openOrAssignTask()to prevent duplicate Task creation when a workflow-managed Task already existsSetupImplbuildsaboutEntityLinkfor IncidentResolution/TestCaseResolution task typestestCaseStatustoWorkflowTriggerFieldsenum (matches the actual field name in entity changeDescription)Test plan
IncidentTcrsSyncHandlerTest— 10 unit tests coveringisIncidentTask()boundary conditions, handler properties, and EntityLink parsingManualTaskOutboxIT— E2E integration test verifying full pipeline: workflow trigger → Task creation with aboutEntityLink → TCRS(Ack) on InProgress → TCRS(Resolved) on Completed → workflow FINISHED