Latte and Invalid Unicode

Version: 2.6.0 (and all previous?)

### Bug Description
If the content of a page is acquired from various sources, it may happen that it contains some invalid Unicode. However, when Latte encounters the invalid Unicode, it starts to behave rather unpredictably (for a tool that should help with the output printing).

### Steps To Reproduce
Consider the following demo:
```bash
composer require tracy/tracy
composer require latte/latte
```
index.php:
```php
<?php
require_once __DIR__.'/vendor/autoload.php';
Tracy\Debugger::enable();
$template = __DIR__.'/test1.latte';
//$template = __DIR__.'/test2.latte';
//$template = __DIR__.'/test3.latte';
$latte = new Latte\Engine;
$latte->render($template);
```
and the following templates:
test1.latte:
```latte
<!DOCTYPE html>
<html lang="en">
 <head>
 <meta charset="utf-8">
 <title>Latte Test</title>
 </head>
 <body>
 Welcome! 
 {var $validC="Valid codepoint: \u{30A2}."}
 {var $valid8="Valid UTF-8: \xE3\x82\xA2."}
 {var $invalidC="Invalid codepoint: \u{D800}."}
 {var $invalid8="Invalid UTF-8: \xE3\x80\x22."}
 1. {$validC} 
 2. {$valid8} 
 3. {$invalidC} 
 4. {$invalid8} 
 5. {$invalidC|noescape} 
 6. {$invalid8|noescape} 
 End.
 </body>
</html>
```
test2.latte:
```latte
<!DOCTYPE html>
<html lang="en">
 <head>
 <meta charset="utf-8">
 <title>Latte Test</title>
 </head>
 <body>
 Welcome! 
 {var $validC="Valid codepoint: \u{30A2}."}
 {var $valid8="Valid UTF-8: \xE3\x82\xA2."}
 {var $invalidC="Invalid codepoint: \u{D800}."}
 {var $invalid8="Invalid UTF-8: \xE3\x80\x22."}
 <script>
 //1. {$validC} 
 //2. {$valid8} 
 //3. {$invalidC} 
 //4. {$invalid8} 
 //5. {$invalidC|noescape} 
 //6. {$invalid8|noescape} 
 </script>
 End.
 </body>
</html>
```
test3.latte:
```latte
<!DOCTYPE html>
<html lang="en">
 <head>
 <meta charset="utf-8">
 <title>Latte Test</title>
 </head>
 <body>
 Welcome! 
 {var $validC="Valid codepoint: \u{30A2}."}
 {var $valid8="Valid UTF-8: \xE3\x82\xA2."}
 {var $invalidC="Invalid codepoint: \u{D800}."}
 {var $invalid8="Invalid UTF-8: \xE3\x80\x22."}
 <style>
 /*1. {$validC} */
 /*2. {$valid8} */
 /*3. {$invalidC} */
 /*4. {$invalid8} */
 /*5. {$invalidC|noescape} */
 /*6. {$invalid8|noescape} */
 </style>
 End.
 </body>
</html>
```
In test1, the output of 3. and 4. is completely discarded and nothing is printed.
In test2, a runtime exception "Malformed UTF-8 characters" is raised.
In test3, everything is printed as it is without any modifications.

### Expected Behavior
Some examples how (not) to handle the invalid Unicode can be found here:
http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
http://unicode.org/reports/tr36/#Illegal_Input_Byte_Sequences
http://unicode.org/reports/tr36/#Some_Output_For_All_Input
Most prominently, web browsers replace the illegal byte sequences with the replacement character U+FFFD.
From my point of view, it is undesired to discard the whole string or to raise the exception.

### Possible Solution
The problematic part of the code seems to be located in `Latte/Runtime/Filters`, where `htmlspecialchars` and `json_encode` are used.
If you look at the documentation:
https://www.php.net/manual/en/function.htmlspecialchars.php
https://www.php.net/manual/en/function.json-encode.php
there are flags `ENT_SUBSTITUTE` and `JSON_INVALID_UTF8_SUBSTITUTE` that can handle the situation.

As a side note, in `Tracy/Helpers` the HTML content is already being escaped with `ENT_SUBSTITUTE`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Latte and Invalid Unicode #212

Bug Description

Steps To Reproduce

Expected Behavior

Possible Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Latte and Invalid Unicode #212

Description

Bug Description

Steps To Reproduce

Expected Behavior

Possible Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions