替换r中不同编码的特殊字符
Replacing special characters from different encodings in r
我有一个损坏的文件,其中 Windows-特殊字符已被替换为 UTF-8 "equivalents"。我试图编写一个能够根据 this table:
替换特殊字符的函数
utf2win <- function(x){
soll <- c("À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë",
"Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø",
"Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å",
"æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò",
"ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"
)
ist <- c("À", "Ã", "Â", "Ã", "Ä", "Ã…", "Æ", "Ç", "È", "É",
"Ê", "Ë", "ÃŒ", "Ã", "ÃŽ", "Ã", "Ã", "Ñ", "Ã’", "Ó", "Ô",
"Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ãœ", "Ã", "Þ", "ß",
"Ã", "á", "â", "ã", "ä", "Ã¥", "æ", "ç", "è", "é", "ê",
"ë", "ì", "Ã", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ",
"ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ")
for(i in 1: length(ist)){
x <- gsub(ist[i], soll[i], x)
}
return(x)
}
现在进行测试
a <- "Geidorf: Grabengürtel"
utf2win(a)
什么也没发生...我想问题是字符“Ô没有被正确识别。你有解决我的问题的方法吗?
这是一个编码问题。您也许可以修复它,但如果没有该文件就很难知道。如果您不能强制使用正确的编码,readBin
是一个不错的选择。以下是我发现的摘要:
我尝试 iconv
作为示例字符串
iconv(a, "UTF-8", "WINDOWS-1252")
#[1] "Geidorf: Grabengürtel"
它有效,但你是对的,“Ô有问题
iconv("Geidorf: Grabengürtel Ã", "UTF-8", "WINDOWS-1252")
#[1] NA
我们可以看到哪些字母有问题:
ist[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))]
[1] "Ã" "Ã" "Ã" "Ã" "Ã" "Ã"
# corresponding characters
paste(soll[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))])
[1] "Á" "Í" "Ï" "Ð" "Ý" "à"
您链接到的网站有一个相关的 page,其中说明了问题所在:
Encoding Problem: Double Mis-Conversion
Symptom
With this particular double conversion, most characters display
correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D,
0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with
the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD
will show the problem. If you look at the I18nQA Encoding Debug Table
you can see that these characters in UTF-8 have second bytes ending in
one of the Unassigned Windows code points.
Á Í Ï Ð Ý
"à"是不同的情况。当它应该是“Ã\u00A0”或“Ã\xA0”或“Ô时,您已将其映射到“Ô(请注意 space 不是正常的 space;它是 non-breaking space)。因此,在 ist
中修复它会处理一个字母。
至于其他字符(“Á”、“Í”、“Ï”、“Д和“Ý”),它们都映射到 ist
中的“Ô ,只要那是真的,你就永远无法进行适当的替换。
我有一个损坏的文件,其中 Windows-特殊字符已被替换为 UTF-8 "equivalents"。我试图编写一个能够根据 this table:
替换特殊字符的函数utf2win <- function(x){
soll <- c("À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë",
"Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø",
"Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å",
"æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò",
"ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"
)
ist <- c("À", "Ã", "Â", "Ã", "Ä", "Ã…", "Æ", "Ç", "È", "É",
"Ê", "Ë", "ÃŒ", "Ã", "ÃŽ", "Ã", "Ã", "Ñ", "Ã’", "Ó", "Ô",
"Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ãœ", "Ã", "Þ", "ß",
"Ã", "á", "â", "ã", "ä", "Ã¥", "æ", "ç", "è", "é", "ê",
"ë", "ì", "Ã", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ",
"ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ")
for(i in 1: length(ist)){
x <- gsub(ist[i], soll[i], x)
}
return(x)
}
现在进行测试
a <- "Geidorf: Grabengürtel"
utf2win(a)
什么也没发生...我想问题是字符“Ô没有被正确识别。你有解决我的问题的方法吗?
这是一个编码问题。您也许可以修复它,但如果没有该文件就很难知道。如果您不能强制使用正确的编码,readBin
是一个不错的选择。以下是我发现的摘要:
我尝试 iconv
作为示例字符串
iconv(a, "UTF-8", "WINDOWS-1252")
#[1] "Geidorf: Grabengürtel"
它有效,但你是对的,“Ô有问题
iconv("Geidorf: Grabengürtel Ã", "UTF-8", "WINDOWS-1252")
#[1] NA
我们可以看到哪些字母有问题:
ist[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))]
[1] "Ã" "Ã" "Ã" "Ã" "Ã" "Ã"
# corresponding characters
paste(soll[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))])
[1] "Á" "Í" "Ï" "Ð" "Ý" "à"
您链接到的网站有一个相关的 page,其中说明了问题所在:
Encoding Problem: Double Mis-Conversion
Symptom
With this particular double conversion, most characters display correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D, 0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD will show the problem. If you look at the I18nQA Encoding Debug Table you can see that these characters in UTF-8 have second bytes ending in one of the Unassigned Windows code points.
Á Í Ï Ð Ý
"à"是不同的情况。当它应该是“Ã\u00A0”或“Ã\xA0”或“Ô时,您已将其映射到“Ô(请注意 space 不是正常的 space;它是 non-breaking space)。因此,在 ist
中修复它会处理一个字母。
至于其他字符(“Á”、“Í”、“Ï”、“Д和“Ý”),它们都映射到 ist
中的“Ô ,只要那是真的,你就永远无法进行适当的替换。