feat(pipeline): widen conflicting columns to String instead of errori… by teochenglim · Pull Request #7991 · GreptimeTeam/greptimedb

teochenglim · 2026-04-18T05:24:07Z

feat(pipeline): add coerce_on_conflict option to gracefully handle type conflicts in identity pipeline

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

When ingesting JSON logs via the greptime_identity pipeline, a type conflict on
any field (e.g. user_id: 123 in one document, user_id: "unknown" in another)
causes the entire batch to be rejected with IdentifyPipelineColumnTypeMismatch.
For semi-structured or externally-sourced logs with evolving schemas, this makes
zero-config ingestion unusable without writing a full custom pipeline that
explicitly enumerates every potentially-conflicting field.

This PR adds an opt-in coerce_on_conflict flag (default false) to
GreptimePipelineParams, passed via the existing HTTP header:

x-greptime-pipeline-params: coerce_on_conflict=true

How it works:

In resolve_schema(), when a type mismatch is detected and the flag is set,
the conflicting column is widened to ConcreteDataType::String instead of
returning an error. A WARN is emitted once per widened column.
The column index is recorded in a new SchemaInfo.coerced_to_string set.
In resolve_value(), after constructing the ValueData, any value whose
column is in coerced_to_string is stringified via a new
value_data_to_string() helper.
In identity_pipeline_inner(), after all rows in the batch are processed, a
fixup pass retroactively stringifies values that were written in earlier rows
before the conflict was first detected.

Backward compatibility:

The flag defaults to false. Existing pipelines are completely unaffected.
No API or schema changes are made to the default ingestion path.

Limitations:

Column widening is one-way within a batch: once widened to String, the
column stays String for all subsequent rows in that batch.
Previously ingested rows (from earlier batches) are not retroactively altered;
they retain their original typed values. Schema widening across batches
depends on the underlying table's ALTER TABLE behaviour.

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

I have written the necessary rustdoc comments.
I have added the necessary unit tests and integration tests.
This PR requires documentation updates.
API changes are backward compatible.
Schema or data changes are backward compatible.

feat(pipeline): Add global `coerce_on_conflict` flag for identity pipeline type conflict resolution

Context

GreptimeDB's greptime_identity pipeline is a zero-config ingestion path that automatically flattens JSON and infers column schemas from the first document seen. It is the default pipeline for log ingestion and requires no user-defined schema.

The schema-merging logic lives in resolve_schema() (greptime.rs:426). When a field appears with a different type than what was previously recorded, the pipeline raises IdentifyPipelineColumnTypeMismatch and rejects the entire batch.

This strict behavior is intentional for structured, schema-stable data. However, it creates a brittle experience for semi-structured or evolving log sources where field types are not guaranteed to be consistent across documents.

Problem

When ingesting JSON logs from external sources (e.g., application traces, third-party webhooks, or heterogeneous systems), the same field may appear with different types across documents:

{ "user_id": 123,       "status": "ok" }
{ "user_id": "unknown", "status": "degraded" }

The second document fails with:

Column datatype mismatch. For column: user_id, expected datatype: Int64, actual datatype: String

This means:

Any document with an unexpected field type is dropped — no partial ingestion.
Users must enumerate every potentially-conflicting field in a custom pipeline with on_failure: default — impractical for unknown or highly dynamic schemas.
There is no opt-in "lenient mode" that keeps data flowing while still indexing conflicting fields as searchable strings.

The existing on_failure mechanism only handles value coercion failures (e.g., parsing "-" as an integer within a declared transform). It does not apply to schema-level type conflicts detected in resolve_schema().

How Elasticsearch Handles This

Elasticsearch faces the same problem — dynamic mapping locks a field's type on first write, and subsequent type conflicts reject the document. It exposes several mechanisms to work around this. Two concrete cases:

ES Case 1: `ignore_malformed` — accept the document, drop the conflicting field

Set at the index level, this tells Elasticsearch to skip indexing a field that doesn't match its mapped type, but still accept the rest of the document.

PUT my-index
{
  "settings": {
    "index.mapping.ignore_malformed": true
  },
  "mappings": {
    "properties": {
      "user_id": { "type": "integer" }
    }
  }
}

Behavior with conflicting input:

// Document 1 — indexed normally
{ "user_id": 123, "event": "login" }

// Document 2 — user_id is dropped from index; event is indexed
{ "user_id": "unknown", "event": "auth_failure" }

Document 2 is accepted, but user_id: "unknown" is silently excluded from the index. It still appears in _source (the raw JSON), but you cannot search or filter on it. It is invisible to queries.

Why this doesn't solve our problem: Data is lost for query purposes. WHERE user_id = 'unknown' returns nothing. For a time-series log database where every field should be queryable, this is unacceptable.

ES Case 2: `dynamic_templates` — pre-empt conflicts by pattern-matching field names

Dynamic templates let you override type detection before conflicts occur. You define rules keyed on field name patterns or detected JSON types:

PUT my-index
{
  "mappings": {
    "dynamic_templates": [
      {
        "user_fields_always_string": {
          "match": "user_*",
          "match_mapping_type": "*",
          "mapping": { "type": "keyword" }
        }
      }
    ]
  }
}

Behavior:

// Both documents map user_id to keyword regardless of value type
{ "user_id": 123 }       →  user_id: "123"  (keyword)
{ "user_id": "unknown" } →  user_id: "unknown"  (keyword)

This works — but only if you know the field names (or name patterns) in advance. You must author the template with explicit match rules before ingestion begins.

Why this doesn't solve our problem: It requires anticipating which fields will conflict. For completely unknown schemas (third-party logs, forwarded events from external systems), you cannot enumerate fields you have never seen. This is the exact scenario we are trying to handle.

Options I Considered

Option 1: Per-field `on_failure: default` in a custom pipeline

Write a custom pipeline that declares each field with type: string and on_failure: default. This is GreptimeDB's existing escape hatch for type conflicts.

processors:
  - type: cast
    fields:
      - name: user_id
        type: string
        on_failure: default

Why it falls short: Same limitation as ES dynamic templates — you must enumerate every field that might conflict. Completely unworkable for unknown or evolving schemas. You are essentially writing a schema for a system whose value is being schema-free.

Option 2: Pre-ingestion transformation

Stringify all values before they reach GreptimeDB — using Vector, Logstash, or a custom Rust adapter that converts every serde_json::Value to its string representation before sending to the write API.

Why it falls short: Adds an external component to the ingestion path, increases operational complexity, and destroys type information permanently — even for fields that are consistently typed. You lose WHERE score > 90 forever just because score occasionally arrives as a string.

Option 3: `ignore_malformed`-style — accept document, drop conflicting field

Analogous to ES Case 1 above. When a conflict is detected, null out the conflicting field and continue ingestion. The document survives, but the conflicting value is lost.

Why it falls short: For a time-series query database, silent data loss is worse than a visible error. A user querying user_id = "unknown" would get zero results with no indication that data was dropped. ES accepts this trade-off because _source preserves the raw JSON; GreptimeDB has no equivalent raw storage layer.

Option 4: Global `coerce_on_conflict` — this PR ✓

Add a pipeline-level opt-in flag. When a type conflict is detected, widen the column to String and coerce both the incoming value and all previously written values in the current batch to their string representations. The field remains stored and fully queryable. A warning is logged once per widened column. No document is rejected.

This is the gap that neither Elasticsearch nor any comparable system fills natively:

Mechanism	Scope	On conflict	Queryable?
ES `ignore_malformed`	Index-level	Drop field, keep doc	No — only in `_source`
ES `dynamic_templates`	Pattern-based	Preempt conflict	Yes — but requires prior knowledge
GreptimeDB per-field `on_failure`	Per-field	Use default value	Yes — but requires field enumeration
`coerce_on_conflict` (this PR)	Pipeline-level global	Widen to String, keep doc	Yes — fully searchable

Solution

What this PR does

Adds coerce_on_conflict: bool to GreptimePipelineParams, passed via the existing x-greptime-pipeline-params HTTP header:

x-greptime-pipeline-params: coerce_on_conflict=true

When coerce_on_conflict = false (default):

Behavior is identical to today. Zero impact on existing pipelines.

When coerce_on_conflict = true:

In resolve_schema(), when a type mismatch is detected:
- The column schema is widened to ConcreteDataType::String.
- The column index is recorded in SchemaInfo.coerced_to_string.
- A WARN is emitted once per widened column: coerce_on_conflict: widening column 'user_id' from Int64 to String.
- Ingestion continues — no error is returned.
In resolve_value(), after constructing the ValueData, any column tracked in coerced_to_string has its value stringified via value_data_to_string().
In identity_pipeline_inner(), after all rows are processed, a fixup pass retroactively stringifies values written in earlier rows before the conflict was detected.

End-to-end example

Input batch (3 documents):

[
  { "user_id": 123,       "event": "login" },
  { "user_id": "unknown", "event": "auth_failure" },
  { "user_id": 456,       "event": "logout" }
]

Without this PR (default behaviour):

Row 1 ingested: user_id = 123 (Int64)
Row 2 ERROR: Column datatype mismatch. For column: user_id, expected: Int64, actual: String
Batch aborted.

With coerce_on_conflict=true:

Row 1 ingested: user_id = "123" (String, retroactively widened)
Row 2 ingested: user_id = "unknown" (String)   ← WARN emitted here
Row 3 ingested: user_id = "456" (String)

All three rows land. user_id is queryable as a string column. WHERE user_id = 'unknown' returns row 2 as expected.

Code changes

File	Change
`src/pipeline/src/etl/transform/transformer/greptime.rs`	`coerce_on_conflict` field + accessor on `GreptimePipelineParams`; `coerced_to_string: HashSet<usize>` on `SchemaInfo`; coercion logic in `resolve_schema()`; `value_data_to_string()` helper; post-construction coercion in `resolve_value()`; retroactive fixup pass in `identity_pipeline_inner()`

Trade-offs

	Default off (`false`)	Opt-in (`true`)
Backward compatibility	Full — no change	N/A, opt-in
Type safety	Strict, first-write-wins	Relaxed, conflict → String
Ingestion robustness	Fails on conflict	Always proceeds
Query behavior on conflicting field	N/A (batch rejected)	Fully searchable as String
Performance	Baseline	Minor overhead: type check + optional coercion + fixup pass
Data loss	Document lost on conflict	No data loss

Because coerce_on_conflict defaults to false, this PR has zero impact on existing pipelines.

gemini-code-assist

Code Review

This pull request introduces a coerce_on_conflict configuration for the Greptime identity pipeline, which resolves type mismatches by widening columns to String and retroactively converting existing values in the batch. Review feedback highlights a bug in the string conversion of jsonb data (stored in BinaryValue and JsonValue) and identifies redundant logic in the value resolution process for newly added columns.

gemini-code-assist · 2026-04-18T05:25:53Z

+        ValueData::BinaryValue(v) => {
+            ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
+        }
+        ValueData::JsonValue(v) => {
+            ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
+        }


The BinaryValue and JsonValue variants in this pipeline are used to store JSON data encoded in the jsonb binary format (as seen in lines 775 and 786). Using String::from_utf8_lossy directly on these bytes will not produce a valid JSON string and will instead result in a string containing the raw binary representation (likely garbage characters). You should decode the jsonb data using the jsonb crate before converting it to a string.

Suggested change

ValueData::BinaryValue(v) => {

ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())

}

ValueData::JsonValue(v) => {

ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())

}

ValueData::BinaryValue(v) | ValueData::JsonValue(v) => {

let s = jsonb::from_slice(&v)

.map(|jv| jv.to_string())

.unwrap_or_else(|_| String::from_utf8_lossy(&v).into_owned());

ValueData::StringValue(s)

}

gemini-code-assist · 2026-04-18T05:25:53Z

+    let value_data = if let Some(idx) = index {
+        if schema_info.coerced_to_string.contains(&idx) {
+            value_data.map(value_data_to_string)
+        } else {
+            value_data
+        }
+    } else {
+        // Newly added column: check by name in case it was just inserted.
+        let new_idx = schema_info.index.get(&column_name).copied();
+        if new_idx.is_some_and(|i| schema_info.coerced_to_string.contains(&i)) {
+            value_data.map(value_data_to_string)
+        } else {
+            value_data
+        }
+    };


The else block for newly added columns is redundant. If index was None at the start of resolve_value, it means the column was not present in the schema for the current batch. While resolve_schema might add the column to the schema, it only inserts into coerced_to_string when a type mismatch is detected on an existing column (where index is Some). Therefore, a newly added column can never be in coerced_to_string during the same resolve_value call, making the extra lookup and check unnecessary.

let value_data = if let Some(idx) = index { if schema_info.coerced_to_string.contains(&idx) { value_data.map(value_data_to_string) } else { value_data } } else { value_data };

…ng in identity pipeline Signed-off-by: Teo Cheng Lim <teochenglim@gmail.com>

teochenglim · 2026-04-18T05:31:49Z

Addressing Gemini code review

Gemini Code Review — Response & Patches

Review Summary

Gemini flagged two issues on the coerce_on_conflict implementation:

[High] BinaryValue/JsonValue conversion in value_data_to_string() produced garbage output because jsonb binary data was naively decoded as UTF-8.
[Medium] The else branch in resolve_value() for newly added columns was redundant dead code.

Patch 1 — Fix jsonb binary → String conversion (High)

File: src/pipeline/src/etl/transform/transformer/greptime.rs

Root cause: BinaryValue and JsonValue store data in the jsonb binary wire format (produced by jsonb::Value::to_vec()). Treating those bytes as UTF-8 yields corrupt strings. The fix decodes through the jsonb crate first.

-        ValueData::BinaryValue(v) => {
-            ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
-        }
-        ValueData::JsonValue(v) => {
-            ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
-        }
+        ValueData::BinaryValue(v) | ValueData::JsonValue(v) => {
+            let s = jsonb::from_slice(&v)
+                .map(|jv| jv.to_string())
+                .unwrap_or_else(|_| String::from_utf8_lossy(&v).into_owned());
+            ValueData::StringValue(s)
+        }

Patch 2 — Remove redundant else branch in resolve_value() (Medium)

File: src/pipeline/src/etl/transform/transformer/greptime.rs

Root cause: coerced_to_string is only populated inside resolve_schema() when index is Some (i.e. the column already existed and had a type conflict). When index is None, the column is being seen for the first time — it cannot be in coerced_to_string. The extra name-lookup check was unreachable logic.

-    } else {
-        // Newly added column: check by name in case it was just inserted.
-        let new_idx = schema_info.index.get(&column_name).copied();
-        if new_idx.is_some_and(|i| schema_info.coerced_to_string.contains(&i)) {
-            value_data.map(value_data_to_string)
-        } else {
-            value_data
-        }
-    };
+    } else {
+        value_data
+    };

teochenglim requested a review from a team as a code owner April 18, 2026 05:24

github-actions Bot added size/S docs-not-required This change does not impact docs. labels Apr 18, 2026

gemini-code-assist Bot reviewed Apr 18, 2026

View reviewed changes

teochenglim force-pushed the main branch from f4ea0b8 to 318ed8a Compare April 18, 2026 05:25

feat(pipeline): widen conflicting columns to String instead of errori…

ab6c771

…ng in identity pipeline Signed-off-by: Teo Cheng Lim <teochenglim@gmail.com>

teochenglim force-pushed the main branch from 318ed8a to ab6c771 Compare April 18, 2026 05:29

killme2008 requested a review from shuiyisong April 18, 2026 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pipeline): widen conflicting columns to String instead of errori…#7991

feat(pipeline): widen conflicting columns to String instead of errori…#7991
teochenglim wants to merge 1 commit intoGreptimeTeam:mainfrom
teochenglim:main

teochenglim commented Apr 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Uh oh!

teochenglim commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teochenglim commented Apr 18, 2026

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

PR Checklist

feat(pipeline): Add global coerce_on_conflict flag for identity pipeline type conflict resolution

Context

Problem

How Elasticsearch Handles This

ES Case 1: ignore_malformed — accept the document, drop the conflicting field

ES Case 2: dynamic_templates — pre-empt conflicts by pattern-matching field names

Options I Considered

Option 1: Per-field on_failure: default in a custom pipeline

Option 2: Pre-ingestion transformation

Option 3: ignore_malformed-style — accept document, drop conflicting field

Option 4: Global coerce_on_conflict — this PR ✓

Solution

What this PR does

End-to-end example

Code changes

Trade-offs

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

teochenglim commented Apr 18, 2026

Gemini Code Review — Response & Patches

Review Summary

Patch 1 — Fix jsonb binary → String conversion (High)

Patch 2 — Remove redundant else branch in resolve_value() (Medium)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(pipeline): Add global `coerce_on_conflict` flag for identity pipeline type conflict resolution

ES Case 1: `ignore_malformed` — accept the document, drop the conflicting field

ES Case 2: `dynamic_templates` — pre-empt conflicts by pattern-matching field names

Option 1: Per-field `on_failure: default` in a custom pipeline

Option 3: `ignore_malformed`-style — accept document, drop conflicting field

Option 4: Global `coerce_on_conflict` — this PR ✓