Version: 2.6.0 (and all previous?)
Bug Description
If the content of a page is acquired from various sources, it may happen that it contains some invalid Unicode. However, when Latte encounters the invalid Unicode, it starts to behave rather unpredictably (for a tool that should help with the output printing).
Steps To Reproduce
Consider the following demo:
composer require tracy/tracy
composer require latte/latte
index.php:
<?php
require_once __DIR__.'/vendor/autoload.php';
Tracy\Debugger::enable();
$template = __DIR__.'/test1.latte';
//$template = __DIR__.'/test2.latte';
//$template = __DIR__.'/test3.latte';
$latte = new Latte\Engine;
$latte->render($template);
and the following templates:
test1.latte:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Latte Test</title>
</head>
<body>
Welcome!<br>
{var $validC="Valid codepoint: \u{30A2}."}
{var $valid8="Valid UTF-8: \xE3\x82\xA2."}
{var $invalidC="Invalid codepoint: \u{D800}."}
{var $invalid8="Invalid UTF-8: \xE3\x80\x22."}
1. {$validC}<br>
2. {$valid8}<br>
3. {$invalidC}<br>
4. {$invalid8}<br>
5. {$invalidC|noescape}<br>
6. {$invalid8|noescape}<br>
End.
</body>
</html>
test2.latte:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Latte Test</title>
</head>
<body>
Welcome!<br>
{var $validC="Valid codepoint: \u{30A2}."}
{var $valid8="Valid UTF-8: \xE3\x82\xA2."}
{var $invalidC="Invalid codepoint: \u{D800}."}
{var $invalid8="Invalid UTF-8: \xE3\x80\x22."}
<script>
//1. {$validC}<br>
//2. {$valid8}<br>
//3. {$invalidC}<br>
//4. {$invalid8}<br>
//5. {$invalidC|noescape}<br>
//6. {$invalid8|noescape}<br>
</script>
End.
</body>
</html>
test3.latte:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Latte Test</title>
</head>
<body>
Welcome!<br>
{var $validC="Valid codepoint: \u{30A2}."}
{var $valid8="Valid UTF-8: \xE3\x82\xA2."}
{var $invalidC="Invalid codepoint: \u{D800}."}
{var $invalid8="Invalid UTF-8: \xE3\x80\x22."}
<style>
/*1. {$validC}<br>*/
/*2. {$valid8}<br>*/
/*3. {$invalidC}<br>*/
/*4. {$invalid8}<br>*/
/*5. {$invalidC|noescape}<br>*/
/*6. {$invalid8|noescape}<br>*/
</style>
End.
</body>
</html>
In test1, the output of 3. and 4. is completely discarded and nothing is printed.
In test2, a runtime exception "Malformed UTF-8 characters" is raised.
In test3, everything is printed as it is without any modifications.
Expected Behavior
Some examples how (not) to handle the invalid Unicode can be found here:
http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
http://unicode.org/reports/tr36/#Illegal_Input_Byte_Sequences
http://unicode.org/reports/tr36/#Some_Output_For_All_Input
Most prominently, web browsers replace the illegal byte sequences with the replacement character U+FFFD.
From my point of view, it is undesired to discard the whole string or to raise the exception.
Possible Solution
The problematic part of the code seems to be located in Latte/Runtime/Filters, where htmlspecialchars and json_encode are used.
If you look at the documentation:
https://www.php.net/manual/en/function.htmlspecialchars.php
https://www.php.net/manual/en/function.json-encode.php
there are flags ENT_SUBSTITUTE and JSON_INVALID_UTF8_SUBSTITUTE that can handle the situation.
As a side note, in Tracy/Helpers the HTML content is already being escaped with ENT_SUBSTITUTE.
Version: 2.6.0 (and all previous?)
Bug Description
If the content of a page is acquired from various sources, it may happen that it contains some invalid Unicode. However, when Latte encounters the invalid Unicode, it starts to behave rather unpredictably (for a tool that should help with the output printing).
Steps To Reproduce
Consider the following demo:
index.php:
and the following templates:
test1.latte:
test2.latte:
test3.latte:
In test1, the output of 3. and 4. is completely discarded and nothing is printed.
In test2, a runtime exception "Malformed UTF-8 characters" is raised.
In test3, everything is printed as it is without any modifications.
Expected Behavior
Some examples how (not) to handle the invalid Unicode can be found here:
http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
http://unicode.org/reports/tr36/#Illegal_Input_Byte_Sequences
http://unicode.org/reports/tr36/#Some_Output_For_All_Input
Most prominently, web browsers replace the illegal byte sequences with the replacement character U+FFFD.
From my point of view, it is undesired to discard the whole string or to raise the exception.
Possible Solution
The problematic part of the code seems to be located in
Latte/Runtime/Filters, wherehtmlspecialcharsandjson_encodeare used.If you look at the documentation:
https://www.php.net/manual/en/function.htmlspecialchars.php
https://www.php.net/manual/en/function.json-encode.php
there are flags
ENT_SUBSTITUTEandJSON_INVALID_UTF8_SUBSTITUTEthat can handle the situation.As a side note, in
Tracy/Helpersthe HTML content is already being escaped withENT_SUBSTITUTE.