Skip to content

Fix literal promotion#4826

Open
ivarusic-amd wants to merge 8 commits intodevelopfrom
fix_literal_promotion
Open

Fix literal promotion#4826
ivarusic-amd wants to merge 8 commits intodevelopfrom
fix_literal_promotion

Conversation

@ivarusic-amd
Copy link
Copy Markdown
Contributor

Motivation

When parsing ONNX models with half-precision (float16) weights, scalar float32 constants (e.g. epsilon in LayerNorm) were causing unintended type promotion of entire computation paths from float16 to float32. This resulted in a type mismatch error when the promoted float32 tensor reached a downstream operation (e.g. convolution or dot) whose weights remained float16. Result in this error:

same_type: convolution: Types do not match

Technical Details

Before applying the common type, inputs that are scalar literals (single-element and evaluable) are excluded from the type decision. If the remaining tensor inputs share a type, that type takes precedence over the scalar's type, and the scalar is cast down to match.

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • [ x] Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

Copy link
Copy Markdown
Collaborator

@CharlieL7 CharlieL7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could affect accuracy. Another way to handle this that could preserve accuracy would be to promote the parameter to the wider format and then convert down the output. It's not clear which way would be more reasonable. Feels like a bug in the input model.

Comment thread src/onnx/parse_gru.cpp
Comment on lines +169 to +174
if(not args[5]->is_undefined() and
args[5]->get_shape().type() != args[1]->get_shape().type())
{
args[5] = info.add_instruction(
make_op("convert", {{"target_type", args[1]->get_shape().type()}}), args[5]);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a specific change for GRU?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, for this specific model just literal change only helped with migraphx driver read command / compile was still failing with types do not match. The fix is in parse_gru.cpp because the type mismatch is introduced by the rewrite_rnn pass, which decomposes the GRU op into dot operations using m.insert_instruction(make_op("dot"), ...) directl; bypassing add_common_op and its type reconciliation. I agree this is odd behaviour; i will send you model via teams; for you to have better understanding of it.

Comment thread src/common.cpp
else
{
auto common = common_shape(to_shapes(inputs));
if(options.common_type)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining the reasoning for this added code block

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@ivarusic-amd
Copy link
Copy Markdown
Contributor Author

This could affect accuracy. Another way to handle this that could preserve accuracy would be to promote the parameter to the wider format and then convert down the output. It's not clear which way would be more reasonable. Feels like a bug in the input model.

Agree. But it should only affect large scalar values for accuracy; all other cases should be fine? For the model bug i also agree, it's little bit odd, specialy since dml also have prolem with the model. But onnx format say it's fine. I can share model with you via teams. There are few models failing with similar errors (types mismatch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants