Dealing with mixed encodings in PHP

During the development of Phast, we ran into an issue where a PHP app was composed of files in different encodings.

Most files in this app are UTF-8, but some are Windows-1252 (the most common flavour of ISO-8559-1 or Latin-1 encoding). This kind of mess is quite common in legacy apps.1

This is a problem for us, since we use PHP's DOMDocument to process the HTML, and PHP's XML parser doesn't deal with this situation well.

The following may be used to convert a mixed encoding string into UTF-8, assuming that everything that's not valid UTF-8, is Windows-1252.

function repair_mixed_encoding($data) {
    return preg_replace_callback(
        '~
            [\x09\x0A\x0D\x20-\x7E]++          # ASCII
          | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
          |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
          | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
          |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
          |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
          | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
          |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
          | (.)
        ~xs',
        function($match) {
            if (isset($match[1]) && strlen($match[1])) {
                return mb_convert_encoding($match[1], 'UTF-8', 'Windows-1252');
            } else {
                return $match[0];
            }
        },
        $data
    );
}