iconv() returns NA 当给出一个带有特定特殊字符的字符串时

Question

我正在尝试将输入文件的一些字符串从 UTF8 转换为 ASCII。对于我给出的大多数字符串，转换在 iconv() 下工作得很好。但是在其中一些上，它 returns NA。虽然手动修复文件中的问题似乎是最简单的选项，但不幸的是，我目前根本没有可用的选项。

我已经为我的问题做了一个可重现的例子，但我们假设我必须想出一种方法让 iconv() 以某种方式转换 s1 中的字符串而不是 NA.

这是可重现的例子：

s1 <- "Besançon" #as read from an input file I cannot modify
s2 <- "Paris"
s3 <- "Linköping"
s4 <- "Besançon" #Manual input for testing

s1 <- iconv(s1, to='ASCII//TRANSLIT')
s2 <- iconv(s2, to='ASCII//TRANSLIT')
s3 <- iconv(s3, to='ASCII//TRANSLIT')
s4 <- iconv(s4, to='ASCII//TRANSLIT')

我得到以下输出：

> s1
[1] NA
> s2
[1] "Paris"
> s3
[1] "Link\"oping"
> s4
[1] "Besancon"

在研究了代码之后，我发现现在从输入文件中完全复制的条目“Besançon”中有问题。当我自己手动输入时，问题就解决了。由于我根本无法修改输入文件，您认为确切的问题是什么？您知道如何解决吗？

提前致谢，

编辑：

仔细一看，第一行的字有些奇怪。它似乎被SO的格式带走了。但是要重现它，我能给出的最好的就是这两张描述它的图像。第一张图片将我的光标放在 # 之前第二张图片是在按下删除键后，应该会删除白色的 space... 结果是删除了 "。所以那里肯定有一些奇怪的东西。

Answer 1

您假定的 utf8 文件中可能有 latin1（或其他编码）字符。例如：

> latin=iconv('Besançon','utf8','latin1')
> iconv(latin,to='ascii//translit')
[1] NA
> iconv(latin,'utf8','ascii//translit')
[1] NA
> iconv(latin,'latin1','ascii//translit')
[1] "Besancon"
> iconv(l,'Windows-1250','ascii//translit')
[1] "Besancon"

你可以，例如使用数据中每个字符集编码的结果创建一个新的向量或数据列，如果一个是 NA，则回退到下一个，例如

utf8 = iconv(x,'utf8','ascii//translit')
latin1 = iconv(x,'latin1','ascii//translit')
win1250 = iconv(x,'Windows-1250','ascii//translit')
result = ifelse(
  is.na(utf8),
  ifelse(
    is.na(latin1),
    win1250,
    latin1
  ),
  utf8
)

如果这些编码不起作用，制作一个只有问题词的文件，然后使用 unix/linux file 命令检测编码，或者尝试一些可能的编码。

我过去只是列出了 iconv 支持的所有编码，尝试了所有 lapply，然后使用了对每个字符串有效的任何结果，但是一些“from”编码将 return非 NA 但不正确的结果，因此最好在数据中的每个唯一字符上尝试此操作，以确定使用 iconv 编码的哪个子集以及使用顺序。

Answer 2

事实证明，使用 sub='' 实际上解决了问题，尽管我不确定为什么。

iconv(s1, to='ASCII//TRANSLIT', sub='')

来自文档sub

character string. If not NA it is used to replace any non-convertible bytes in the input. (This would normally be a single character, but can be more.) If "byte", the indication is "" with the hex code of the byte. If "Unicode" and converting from UTF-8, the Unicode point in the form "<U+xxxx>".

所以我最终发现字符串中有一个我无法转换（也看不到）的字符，使用 sub 是消除它的一种方法。我仍然不确定这个角色是什么。不过问题解决了

iconv() returns NA 当给出一个带有特定特殊字符的字符串时

iconv() returns NA when given a string with a specific special character

ascii

r

character-encoding

iconv