有谁知道检测被错误解码为 Latin-1 文本的 UTF-8 的启发式方法吗？

Question

我正在从气象服务处收到天气警报。虽然 HTTP 响应声称是 UTF-8，但显然它包含一些这样的文本：

SuÃ°austan 13-20 m/s og snjÃ³koma meÃ° lÃ©legu skyggni og versnandi akstursskilyrÃ°um.

...应该如下所示：

Suðaustan 13-20 m/s og snjókoma með lélegu skyggni og versnandi akstursskilyrðum.

...但在它第一次到达我之前已经被错误地解码，被错误地解码后重新编码为UTF-8。我们大多数人以前可能都见过这种“mojibake”垃圾，至少在视觉上，它通常具有许多共同特征——例如很多 Ã 个字符、¢ 个符号等等。

我现在正在使用此代码修复它：

  // Check for UTF-8 wrongly decoded as Latin-1
  if (/[\x80-\xC5]/.test(result)) {
    const bytes = Buffer.from(result, 'latin1');
    const altText = bytes.toString('utf8');

    if (altText.length < result.length)
      result = altText;
  }

...这就是目前的工作，但这不是一个非常复杂的测试。

有人知道更好的方法吗？

Answer 1

Anyone know of a better method?

不知道你怎么判断比较好。我刚才写了这个函数来对字符串做这个转换。

不知道这是否比缓冲区更好。

function utf8_decode(str) {
  //assuming the input is a valid utf-8 string. 
  //Invalid parts are ignored / remain in the string.
  return str.replace(
    /[\u00c0-\u00df][\u0080-\u00bf]|([\u00e0-\u00ef][\u0080-\u00bf]{2})|([\u00f0-\u00f7][\u0080-\u00bf]{3})/g,
    (two, three, four) => String.fromCodePoint(
      // UTF-16 codePoints
      four ? (four.charCodeAt(0) & 7) << 18 | (four.charCodeAt(1) & 63) << 12 | (four.charCodeAt(2) & 63) << 6 | (four.charCodeAt(3) & 63) :
      // UTF-8 multibytes
      three ? (three.charCodeAt(0) & 15) << 12 | (three.charCodeAt(1) & 63) << 6 | (three.charCodeAt(2) & 63) :
      (two.charCodeAt(0) & 31) << 6 | (two.charCodeAt(1) & 63)
    )
  )
}

console.log(utf8_decode("SuÃ°austan 13-20 m/s og snjÃ³koma meÃ° lÃ©legu skyggni og versnandi akstursskilyrÃ°um."));

console.log(utf8_decode("ð\x9F\x98\x8B"));

正则表达式比你的好。

之后无需检查转换是否导致某些更改。

有谁知道检测被错误解码为 Latin-1 文本的 UTF-8 的启发式方法吗？

Does anyone know a good heuristic for detecting UTF-8 badly decoded as Latin-1 text?

javascript

http

internationalization

character-encoding