Dealing with mixed encodings in PHP
During the development of Phast, we ran into an issue where a PHP app was composed of files in different encodings.
Most files in this app are UTF-8, but some are Windows-1252 (the most common flavour of ISO-8559-1 or Latin-1 encoding). This kind of mess is quite common in legacy apps.1
This is a problem for us, since we use PHP's DOMDocument to process the HTML, and PHP's XML parser doesn't deal with this situation well.
The following may be used to convert a mixed encoding string into UTF-8, assuming that everything that's not valid UTF-8, is Windows-1252.
function repair_mixed_encoding($data) {
return preg_replace_callback(
'~
[\x09\x0A\x0D\x20-\x7E]++ # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
| (.)
~xs',
function($match) {
if (isset($match[1]) && strlen($match[1])) {
return mb_convert_encoding($match[1], 'UTF-8', 'Windows-1252');
} else {
return $match[0];
}
},
$data
);
}