Skip to content

Latte and Invalid Unicode #212

@rkocman

Description

@rkocman

Version: 2.6.0 (and all previous?)

Bug Description

If the content of a page is acquired from various sources, it may happen that it contains some invalid Unicode. However, when Latte encounters the invalid Unicode, it starts to behave rather unpredictably (for a tool that should help with the output printing).

Steps To Reproduce

Consider the following demo:

composer require tracy/tracy
composer require latte/latte

index.php:

<?php
require_once __DIR__.'/vendor/autoload.php';
Tracy\Debugger::enable();
$template = __DIR__.'/test1.latte';
//$template = __DIR__.'/test2.latte';
//$template = __DIR__.'/test3.latte';
$latte = new Latte\Engine;
$latte->render($template);

and the following templates:
test1.latte:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Latte Test</title>
  </head>
  <body>
    Welcome!<br>
    {var $validC="Valid codepoint: \u{30A2}."}
    {var $valid8="Valid UTF-8: \xE3\x82\xA2."}
    {var $invalidC="Invalid codepoint: \u{D800}."}
    {var $invalid8="Invalid UTF-8: \xE3\x80\x22."}
    1. {$validC}<br>
    2. {$valid8}<br>
    3. {$invalidC}<br>
    4. {$invalid8}<br>
    5. {$invalidC|noescape}<br>
    6. {$invalid8|noescape}<br>
    End.
  </body>
</html>

test2.latte:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Latte Test</title>
  </head>
  <body>
    Welcome!<br>
    {var $validC="Valid codepoint: \u{30A2}."}
    {var $valid8="Valid UTF-8: \xE3\x82\xA2."}
    {var $invalidC="Invalid codepoint: \u{D800}."}
    {var $invalid8="Invalid UTF-8: \xE3\x80\x22."}
    <script>
    //1. {$validC}<br>
    //2. {$valid8}<br>
    //3. {$invalidC}<br>
    //4. {$invalid8}<br>
    //5. {$invalidC|noescape}<br>
    //6. {$invalid8|noescape}<br>
    </script>
    End.
  </body>
</html>

test3.latte:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Latte Test</title>
  </head>
  <body>
    Welcome!<br>
    {var $validC="Valid codepoint: \u{30A2}."}
    {var $valid8="Valid UTF-8: \xE3\x82\xA2."}
    {var $invalidC="Invalid codepoint: \u{D800}."}
    {var $invalid8="Invalid UTF-8: \xE3\x80\x22."}
    <style>
    /*1. {$validC}<br>*/
    /*2. {$valid8}<br>*/
    /*3. {$invalidC}<br>*/
    /*4. {$invalid8}<br>*/
    /*5. {$invalidC|noescape}<br>*/
    /*6. {$invalid8|noescape}<br>*/
    </style>
    End.
  </body>
</html>

In test1, the output of 3. and 4. is completely discarded and nothing is printed.
In test2, a runtime exception "Malformed UTF-8 characters" is raised.
In test3, everything is printed as it is without any modifications.

Expected Behavior

Some examples how (not) to handle the invalid Unicode can be found here:
http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
http://unicode.org/reports/tr36/#Illegal_Input_Byte_Sequences
http://unicode.org/reports/tr36/#Some_Output_For_All_Input
Most prominently, web browsers replace the illegal byte sequences with the replacement character U+FFFD.
From my point of view, it is undesired to discard the whole string or to raise the exception.

Possible Solution

The problematic part of the code seems to be located in Latte/Runtime/Filters, where htmlspecialchars and json_encode are used.
If you look at the documentation:
https://www.php.net/manual/en/function.htmlspecialchars.php
https://www.php.net/manual/en/function.json-encode.php
there are flags ENT_SUBSTITUTE and JSON_INVALID_UTF8_SUBSTITUTE that can handle the situation.

As a side note, in Tracy/Helpers the HTML content is already being escaped with ENT_SUBSTITUTE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions