PHP 将损坏的非英文字符串 (iso 8859-1) 恢复为 utf-8

Question

这个post的结尾是我自己写的答案。至少对我来说效果很好。 还有回购，https://github.com/jihuichoi/correct-broken-korean-iso8859-1-to-utf8

======

我的韩语弦断了。我想将其恢复为 utf-8 字符串。

$str = '"3234", "ºÎ»êÀü´ÜÁö ¹èÆ÷»ç¿ø ¸ðÁý.  2¿ù6ÀÏºÎÅÍ ¤ý»ó¼¼³»¿ëÈ®ÀÎ", "2017-03-02 11:12:34';

以上字符串是完整字符串的一部分，位于文件中。该文件以 utf-8 格式保存，并且还包含完整的（韩语）字符。只有少数字符串有破字符。

尝试 1. mb_convert_encoding，iconv 不工作。因为

print_r(mb_detect_encoding($str));
result : UTF-8

尝试 2。 尝试拆分字符串并逐个转换。

$result = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
var_dump($result);

result : 
array(52) {
  [0]=>
  string(2) "º"
  [1]=>
  string(2) "Î"
  [2]=>
  string(2) "»"
  [3]=>
  string(2) "ê"
  [4]=>
  string(2) "À"
  [5]=>
......

尝试 3。 我不知道是怎么回事。于是又反过来试了一下

上面的字符串实际上是“부산전단지 배포사원 모집.2월6일부터 ㆍ상세내용확인” （我在online convert site上找回了，网址等信息在这个底部post）

然后我想出了每 2 个损坏的字符组成一个正确的字符。所以我检查了每个损坏字符和目标（正确）字符的十六进制代码。并做了一些计算。

$str = 'ºÎ'; //부
$var = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
var_dump($var);

$tmp_str = ''; $result = '';
for($i = 0; $i < count($var); $i++)
{
    if(($i+1)%2 == 1) {
        $tmp_str .= dechex(_uniord($var[$i]));
    } else {
        $tmp_str .= dechex(_uniord($var[$i]));
        $uni2 = dechex(hexdec($tmp_str) + hexdec('EAFBB2'));
        $result .= hexToStr($uni2);
        $tmp_str = '';
    }   
}

echo $result;

result : 부

成功了！但它仅适用于“부”。我应该为每个韩文字符添加另一个十六进制数字而不是 EAFBB2。

尝试 4

在JAVA、

new String(XXX.getBytes(8859_1), "euc-kr")

似乎很适合我的目的。但我不知道 JAVA。 http://egloos.zum.com/ndba/v/2831611

尝试 5。 尝试使用等同于 Java 的 getBytes。但这非常困难。损坏的字符每个有 2 个字节。两个损坏的字符组成一个正确的字符。然而正确的字符有 3 个字节。（可能是因为它是 utf-8）

这意味着我应该使 2+2 => 3 ????

$str = 'ºÎ'; //부
for($i = 0; $i < strlen($str); $i++){
    $bytes[] = ord($str[$i]);

}
print_r($bytes);

Array
(
    [0] => 194
    [1] => 186
    [2] => 195
    [3] => 142
)

$str = '부'; //부
for($i = 0; $i < strlen($str); $i++){
    $bytes[] = ord($str[$i]);
}
print_r($bytes);

Array
(
    [0] => 235
    [1] => 182
    [2] => 128
)

请帮帮我。我有很多断弦，我需要恢复它们。

在线转换网站(http://string-functions.com/encodedecode.aspx)说，

Here, you can simulate what happens if you encode a text file with one encoding and then decode the text with a different encoding. Try e.g. to encode the Swedish characters åäö with utf-8 and then decode them with iso-8859-1, or try to encode 明伯 (simplified Chinese meaning 'to understand') with utf-8 and decode with GB 18030. That will yield the characters: 鏄庝集, which I really can't understand.

这就是我想要的，这个网站很好地恢复了我断掉的琴弦。（iso-8859-1 到 euc-kr）但我想在 php.

中执行相同的过程

Answer 1

mb_convert_encoding() 应该可以为您做到这一点：

<?php
$line = 'ºÎ»êÀü´ÜÁö ¹èÆ÷»ç¿ø ¸ðÁý.  2¿ù6ÀÏºÎÅÍ ¤ý»ó¼¼³»¿ëÈ®ÀÎ';
$line = mb_convert_encoding($line, "UTF-8", "EUC-KR");
echo "$line\n";

我的结果，当我将此 PHP 保存在 ISO-8859-1 文件中时 是：

부산전단지 배포사원 모집.  2월6일부터 ㆍ상세내용확인

当我将 PHP 源代码保存为 UTF-8 时，我得到了这个：

쨘?쨩챗?체쨈??철 쨔챔?첨쨩챌쩔첩 쨍챨?첵.  2쩔첫6??쨘??? 짚첵쨩처쩌쩌쨀쨩쩔챘?짰??

Answer 2

自己回答

iso-8859-1 中有损坏的字符，但不完全是。它应该转换为字节并再次转换为 ksc5601 为此，我只使用一个映射 table。因为ksc5601 没有任何规则。它使用自己的映射 table。

https://github.com/jihuichoi/correct-broken-korean-iso8859-1-to-utf8

Answer 3

在 MySQL 中，即 latin1 到 euckr。例如：

ÀÏºÎÅÍ¤ý 是十六进制 C0 CF BA CE C5 CD A4 FD，
일부터ㆍ 是十六进制 C0CF BACE C5CD A4FD

你应该争取 utf8: hex EC9DBC EBB680 ED84B0 E3868D

PHP 将损坏的非英文字符串 (iso 8859-1) 恢复为 utf-8

PHP recovery broken non-english string(iso 8859-1) as utf-8

php

string

encoding

utf-8

iso-8859-1