Skip to content

feat(pipeline): widen conflicting columns to String instead of errori…#7991

Open
teochenglim wants to merge 1 commit intoGreptimeTeam:mainfrom
teochenglim:main
Open

feat(pipeline): widen conflicting columns to String instead of errori…#7991
teochenglim wants to merge 1 commit intoGreptimeTeam:mainfrom
teochenglim:main

Conversation

@teochenglim
Copy link
Copy Markdown

feat(pipeline): add coerce_on_conflict option to gracefully handle type conflicts in identity pipeline

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

When ingesting JSON logs via the greptime_identity pipeline, a type conflict on
any field (e.g. user_id: 123 in one document, user_id: "unknown" in another)
causes the entire batch to be rejected with IdentifyPipelineColumnTypeMismatch.
For semi-structured or externally-sourced logs with evolving schemas, this makes
zero-config ingestion unusable without writing a full custom pipeline that
explicitly enumerates every potentially-conflicting field.

This PR adds an opt-in coerce_on_conflict flag (default false) to
GreptimePipelineParams, passed via the existing HTTP header:

x-greptime-pipeline-params: coerce_on_conflict=true

How it works:

  • In resolve_schema(), when a type mismatch is detected and the flag is set,
    the conflicting column is widened to ConcreteDataType::String instead of
    returning an error. A WARN is emitted once per widened column.
  • The column index is recorded in a new SchemaInfo.coerced_to_string set.
  • In resolve_value(), after constructing the ValueData, any value whose
    column is in coerced_to_string is stringified via a new
    value_data_to_string() helper.
  • In identity_pipeline_inner(), after all rows in the batch are processed, a
    fixup pass retroactively stringifies values that were written in earlier rows
    before the conflict was first detected.

Backward compatibility:

The flag defaults to false. Existing pipelines are completely unaffected.
No API or schema changes are made to the default ingestion path.

Limitations:

  • Column widening is one-way within a batch: once widened to String, the
    column stays String for all subsequent rows in that batch.
  • Previously ingested rows (from earlier batches) are not retroactively altered;
    they retain their original typed values. Schema widening across batches
    depends on the underlying table's ALTER TABLE behaviour.

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

feat(pipeline): Add global coerce_on_conflict flag for identity pipeline type conflict resolution

Context

GreptimeDB's greptime_identity pipeline is a zero-config ingestion path that automatically flattens JSON and infers column schemas from the first document seen. It is the default pipeline for log ingestion and requires no user-defined schema.

The schema-merging logic lives in resolve_schema() (greptime.rs:426). When a field appears with a different type than what was previously recorded, the pipeline raises IdentifyPipelineColumnTypeMismatch and rejects the entire batch.

This strict behavior is intentional for structured, schema-stable data. However, it creates a brittle experience for semi-structured or evolving log sources where field types are not guaranteed to be consistent across documents.


Problem

When ingesting JSON logs from external sources (e.g., application traces, third-party webhooks, or heterogeneous systems), the same field may appear with different types across documents:

{ "user_id": 123,       "status": "ok" }
{ "user_id": "unknown", "status": "degraded" }

The second document fails with:

Column datatype mismatch. For column: user_id, expected datatype: Int64, actual datatype: String

This means:

  • Any document with an unexpected field type is dropped — no partial ingestion.
  • Users must enumerate every potentially-conflicting field in a custom pipeline with on_failure: default — impractical for unknown or highly dynamic schemas.
  • There is no opt-in "lenient mode" that keeps data flowing while still indexing conflicting fields as searchable strings.

The existing on_failure mechanism only handles value coercion failures (e.g., parsing "-" as an integer within a declared transform). It does not apply to schema-level type conflicts detected in resolve_schema().


How Elasticsearch Handles This

Elasticsearch faces the same problem — dynamic mapping locks a field's type on first write, and subsequent type conflicts reject the document. It exposes several mechanisms to work around this. Two concrete cases:

ES Case 1: ignore_malformed — accept the document, drop the conflicting field

Set at the index level, this tells Elasticsearch to skip indexing a field that doesn't match its mapped type, but still accept the rest of the document.

PUT my-index
{
  "settings": {
    "index.mapping.ignore_malformed": true
  },
  "mappings": {
    "properties": {
      "user_id": { "type": "integer" }
    }
  }
}

Behavior with conflicting input:

// Document 1 — indexed normally
{ "user_id": 123, "event": "login" }

// Document 2 — user_id is dropped from index; event is indexed
{ "user_id": "unknown", "event": "auth_failure" }

Document 2 is accepted, but user_id: "unknown" is silently excluded from the index. It still appears in _source (the raw JSON), but you cannot search or filter on it. It is invisible to queries.

Why this doesn't solve our problem: Data is lost for query purposes. WHERE user_id = 'unknown' returns nothing. For a time-series log database where every field should be queryable, this is unacceptable.


ES Case 2: dynamic_templates — pre-empt conflicts by pattern-matching field names

Dynamic templates let you override type detection before conflicts occur. You define rules keyed on field name patterns or detected JSON types:

PUT my-index
{
  "mappings": {
    "dynamic_templates": [
      {
        "user_fields_always_string": {
          "match": "user_*",
          "match_mapping_type": "*",
          "mapping": { "type": "keyword" }
        }
      }
    ]
  }
}

Behavior:

// Both documents map user_id to keyword regardless of value type
{ "user_id": 123 }       →  user_id: "123"  (keyword)
{ "user_id": "unknown" } →  user_id: "unknown"  (keyword)

This works — but only if you know the field names (or name patterns) in advance. You must author the template with explicit match rules before ingestion begins.

Why this doesn't solve our problem: It requires anticipating which fields will conflict. For completely unknown schemas (third-party logs, forwarded events from external systems), you cannot enumerate fields you have never seen. This is the exact scenario we are trying to handle.


Options I Considered

Option 1: Per-field on_failure: default in a custom pipeline

Write a custom pipeline that declares each field with type: string and on_failure: default. This is GreptimeDB's existing escape hatch for type conflicts.

processors:
  - type: cast
    fields:
      - name: user_id
        type: string
        on_failure: default

Why it falls short: Same limitation as ES dynamic templates — you must enumerate every field that might conflict. Completely unworkable for unknown or evolving schemas. You are essentially writing a schema for a system whose value is being schema-free.


Option 2: Pre-ingestion transformation

Stringify all values before they reach GreptimeDB — using Vector, Logstash, or a custom Rust adapter that converts every serde_json::Value to its string representation before sending to the write API.

Why it falls short: Adds an external component to the ingestion path, increases operational complexity, and destroys type information permanently — even for fields that are consistently typed. You lose WHERE score > 90 forever just because score occasionally arrives as a string.


Option 3: ignore_malformed-style — accept document, drop conflicting field

Analogous to ES Case 1 above. When a conflict is detected, null out the conflicting field and continue ingestion. The document survives, but the conflicting value is lost.

Why it falls short: For a time-series query database, silent data loss is worse than a visible error. A user querying user_id = "unknown" would get zero results with no indication that data was dropped. ES accepts this trade-off because _source preserves the raw JSON; GreptimeDB has no equivalent raw storage layer.


Option 4: Global coerce_on_conflict — this PR ✓

Add a pipeline-level opt-in flag. When a type conflict is detected, widen the column to String and coerce both the incoming value and all previously written values in the current batch to their string representations. The field remains stored and fully queryable. A warning is logged once per widened column. No document is rejected.

This is the gap that neither Elasticsearch nor any comparable system fills natively:

Mechanism Scope On conflict Queryable?
ES ignore_malformed Index-level Drop field, keep doc No — only in _source
ES dynamic_templates Pattern-based Preempt conflict Yes — but requires prior knowledge
GreptimeDB per-field on_failure Per-field Use default value Yes — but requires field enumeration
coerce_on_conflict (this PR) Pipeline-level global Widen to String, keep doc Yes — fully searchable

Solution

What this PR does

Adds coerce_on_conflict: bool to GreptimePipelineParams, passed via the existing x-greptime-pipeline-params HTTP header:

x-greptime-pipeline-params: coerce_on_conflict=true

When coerce_on_conflict = false (default):

  • Behavior is identical to today. Zero impact on existing pipelines.

When coerce_on_conflict = true:

  • In resolve_schema(), when a type mismatch is detected:
    • The column schema is widened to ConcreteDataType::String.
    • The column index is recorded in SchemaInfo.coerced_to_string.
    • A WARN is emitted once per widened column: coerce_on_conflict: widening column 'user_id' from Int64 to String.
    • Ingestion continues — no error is returned.
  • In resolve_value(), after constructing the ValueData, any column tracked in coerced_to_string has its value stringified via value_data_to_string().
  • In identity_pipeline_inner(), after all rows are processed, a fixup pass retroactively stringifies values written in earlier rows before the conflict was detected.

End-to-end example

Input batch (3 documents):

[
  { "user_id": 123,       "event": "login" },
  { "user_id": "unknown", "event": "auth_failure" },
  { "user_id": 456,       "event": "logout" }
]

Without this PR (default behaviour):

Row 1 ingested: user_id = 123 (Int64)
Row 2 ERROR: Column datatype mismatch. For column: user_id, expected: Int64, actual: String
Batch aborted.

With coerce_on_conflict=true:

Row 1 ingested: user_id = "123" (String, retroactively widened)
Row 2 ingested: user_id = "unknown" (String)   ← WARN emitted here
Row 3 ingested: user_id = "456" (String)

All three rows land. user_id is queryable as a string column. WHERE user_id = 'unknown' returns row 2 as expected.

Code changes

File Change
src/pipeline/src/etl/transform/transformer/greptime.rs coerce_on_conflict field + accessor on GreptimePipelineParams; coerced_to_string: HashSet<usize> on SchemaInfo; coercion logic in resolve_schema(); value_data_to_string() helper; post-construction coercion in resolve_value(); retroactive fixup pass in identity_pipeline_inner()

Trade-offs

Default off (false) Opt-in (true)
Backward compatibility Full — no change N/A, opt-in
Type safety Strict, first-write-wins Relaxed, conflict → String
Ingestion robustness Fails on conflict Always proceeds
Query behavior on conflicting field N/A (batch rejected) Fully searchable as String
Performance Baseline Minor overhead: type check + optional coercion + fixup pass
Data loss Document lost on conflict No data loss

Because coerce_on_conflict defaults to false, this PR has zero impact on existing pipelines.

@teochenglim teochenglim requested a review from a team as a code owner April 18, 2026 05:24
@github-actions github-actions Bot added size/S docs-not-required This change does not impact docs. labels Apr 18, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a coerce_on_conflict configuration for the Greptime identity pipeline, which resolves type mismatches by widening columns to String and retroactively converting existing values in the batch. Review feedback highlights a bug in the string conversion of jsonb data (stored in BinaryValue and JsonValue) and identifies redundant logic in the value resolution process for newly added columns.

Comment on lines +638 to +643
ValueData::BinaryValue(v) => {
ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
}
ValueData::JsonValue(v) => {
ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The BinaryValue and JsonValue variants in this pipeline are used to store JSON data encoded in the jsonb binary format (as seen in lines 775 and 786). Using String::from_utf8_lossy directly on these bytes will not produce a valid JSON string and will instead result in a string containing the raw binary representation (likely garbage characters). You should decode the jsonb data using the jsonb crate before converting it to a string.

Suggested change
ValueData::BinaryValue(v) => {
ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
}
ValueData::JsonValue(v) => {
ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
}
ValueData::BinaryValue(v) | ValueData::JsonValue(v) => {
let s = jsonb::from_slice(&v)
.map(|jv| jv.to_string())
.unwrap_or_else(|_| String::from_utf8_lossy(&v).into_owned());
ValueData::StringValue(s)
}

Comment on lines +793 to +807
let value_data = if let Some(idx) = index {
if schema_info.coerced_to_string.contains(&idx) {
value_data.map(value_data_to_string)
} else {
value_data
}
} else {
// Newly added column: check by name in case it was just inserted.
let new_idx = schema_info.index.get(&column_name).copied();
if new_idx.is_some_and(|i| schema_info.coerced_to_string.contains(&i)) {
value_data.map(value_data_to_string)
} else {
value_data
}
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The else block for newly added columns is redundant. If index was None at the start of resolve_value, it means the column was not present in the schema for the current batch. While resolve_schema might add the column to the schema, it only inserts into coerced_to_string when a type mismatch is detected on an existing column (where index is Some). Therefore, a newly added column can never be in coerced_to_string during the same resolve_value call, making the extra lookup and check unnecessary.

    let value_data = if let Some(idx) = index {
        if schema_info.coerced_to_string.contains(&idx) {
            value_data.map(value_data_to_string)
        } else {
            value_data
        }
    } else {
        value_data
    };

…ng in identity pipeline

Signed-off-by: Teo Cheng Lim <teochenglim@gmail.com>
@teochenglim
Copy link
Copy Markdown
Author

Addressing Gemini code review

Gemini Code Review — Response & Patches

Review Summary

Gemini flagged two issues on the coerce_on_conflict implementation:

  1. [High] BinaryValue/JsonValue conversion in value_data_to_string() produced garbage output because jsonb binary data was naively decoded as UTF-8.
  2. [Medium] The else branch in resolve_value() for newly added columns was redundant dead code.

Patch 1 — Fix jsonb binary → String conversion (High)

File: src/pipeline/src/etl/transform/transformer/greptime.rs

Root cause: BinaryValue and JsonValue store data in the jsonb binary wire format (produced by jsonb::Value::to_vec()). Treating those bytes as UTF-8 yields corrupt strings. The fix decodes through the jsonb crate first.

-        ValueData::BinaryValue(v) => {
-            ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
-        }
-        ValueData::JsonValue(v) => {
-            ValueData::StringValue(String::from_utf8_lossy(&v).into_owned())
-        }
+        ValueData::BinaryValue(v) | ValueData::JsonValue(v) => {
+            let s = jsonb::from_slice(&v)
+                .map(|jv| jv.to_string())
+                .unwrap_or_else(|_| String::from_utf8_lossy(&v).into_owned());
+            ValueData::StringValue(s)
+        }

Patch 2 — Remove redundant else branch in resolve_value() (Medium)

File: src/pipeline/src/etl/transform/transformer/greptime.rs

Root cause: coerced_to_string is only populated inside resolve_schema() when index is Some (i.e. the column already existed and had a type conflict). When index is None, the column is being seen for the first time — it cannot be in coerced_to_string. The extra name-lookup check was unreachable logic.

-    } else {
-        // Newly added column: check by name in case it was just inserted.
-        let new_idx = schema_info.index.get(&column_name).copied();
-        if new_idx.is_some_and(|i| schema_info.coerced_to_string.contains(&i)) {
-            value_data.map(value_data_to_string)
-        } else {
-            value_data
-        }
-    };
+    } else {
+        value_data
+    };

@killme2008 killme2008 requested a review from shuiyisong April 18, 2026 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant