R 字符串编码最佳实践
R String Encoding best practice
我 运行 遇到了与此 post 中所述相同的问题:
这意味着,我有带有字符串的向量,它们看起来完全一样,但是当设置为等于 == 时返回了一个 false。
举个例子:
> allowed_stock_exchanges
[1] "Australian Securities Exchange" "Borsa Italiana SpA"
[3] "Canadian Securities Exchange" "Euronext Amsterdam"
[5] "Euronext Brussels" "Euronext Lisbon"
[7] "Euronext Paris" "Frankfurt"
[9] "Irish Stock Exchange" "London Stock Exchange"
[11] "Mercado Continuo Espanol (SIBE)" "NASDAQ"
[13] "NASDAQ OMX Stockholm" "NYSE"
[15] "NYSE MKT LLC" "OMX Nordic Copenhagen"
[17] "OMX Nordic Helsinki" "Oslo Bors"
[19] "OTC" "Swiss SIX Exchange"
[21] "Toronto" "Vienna Stock Exchange"
[23] "XETRA"
> available_stock_exchanges
[1] "NYSE" "NASDAQ"
[3] "OTC" "NYSE MKT LLC"
[5] "London Stock Exchange" "TSX Venture Exchange"
[7] "Philippine Stock Exchange" "Toronto"
[9] "Australian Securities Exchange" "Korea Stock Exchange"
[11] "Kuala Lumpur" "New Zealand Exchange Ltd"
[13] "Singapore" "XETRA"
[15] "Vienna Stock Exchange" "Canadian Securities Exchange"
[17] "Frankfurt" "NSX Australia"
[19] "NASDAQ OMX Stockholm" "Mercado Continuo Espanol (SIBE)"
[21] "Euronext Paris" "Euronext Brussels"
[23] "OMX Nordic Copenhagen" "Swiss SIX Exchange"
[25] "Euronext Amsterdam" "Borsa Italiana SpA"
[27] "OMX Nordic Helsinki" "Oslo Bors"
[29] "Euronext Lisbon" "Dusseldorf"
[31] "Irish Stock Exchange" "Hamburg Stock Exchange"
[33] "Luxembourg" "OMX Nordic Iceland"
[35] "Warsaw Stock Exchange" "Norwegian OTC Market"
[37] "Buenos Aires" "Berlin"
[39] "Hong Kong" "Berne Stock Exchange"
[41] "Johannesburg" "Nordic Growth Market"
[43] "Athens Stock Exchange"
> allowed_stock_exchanges[1] == available_stock_exchanges[9]
[1] FALSE
> charToRaw(allowed_stock_exchanges[1])
[1] 41 75 73 74 72 61 6c 69 61 6e c2 a0 53 65 63 75 72 69 74 69 65 73 c2 a0 45 78 63 68 61 6e 67
[32] 65
> charToRaw(available_stock_exchanges[9])
[1] 41 75 73 74 72 61 6c 69 61 6e 20 53 65 63 75 72 69 74 69 65 73 20 45 78 63 68 61 6e 67 65
> Encoding(allowed_stock_exchanges)
[1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "unknown" "UTF-8"
[10] "UTF-8" "UTF-8" "unknown" "UTF-8" "unknown" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
[19] "unknown" "UTF-8" "unknown" "UTF-8" "unknown"
> Encoding(available_stock_exchanges)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[10] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[19] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[28] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[37] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
我在这个混乱中的问题:
避免此类问题的最佳做法是什么?
例如,当从包含字符串的不同来源读取数据时,我应该 check/adapt 什么?
现在我怎样才能使这两个向量均质化,以便如果设置相等,它真的会告诉我它们是相等的吗?
有什么建议吗?
编辑:我发现未知数是 ASCII。然而不幸的是我不知道如何“同质化”一切
charToRaw()
调用非常有用。您的一个字符串将 space 编码为原始字符 20
(ASCII space),另一个将其编码为 C2 A0
“不间断 space”。 R 认为它们不同,因此字符串不匹配。
要解决此问题,您可能可以编写一个小函数,将所有内容转换为一致的格式。例如,要处理不中断 space 问题,您可以使用
allowed_stock_exchanges <- gsub("\u00a0", " ", allowed_stock_exchanges)
其中 \u00a0
是该字符的 Unicode 编码;在 UTF-8 中,它显示为 C2 A0。 (有关这些关系,请参阅 https://www.utf8-chartable.de/ 等网站。)
使用 tools::showNonASCII()
功能有助于检测其他潜在问题。它将识别诸如不间断 space 之类的东西。它不会做的是仅限于字符串看起来相同但编码不同的情况,因此它也会识别重音字符。
另一种可能性是使用
allowed_stock_exchanges <- iconv(allowed_stock_exchanges, to = "ASCII//TRANSLIT")
它尝试转换为 ASCII,必要时进行替换。这会将重音字符转换为一些有趣的 ASCII 版本(我看到“é”转换为“'e”),因此在比较之前您需要在两个字符串上使用它。但是根据你的系统,转换可能和我的不一样。
我 运行 遇到了与此 post 中所述相同的问题:
这意味着,我有带有字符串的向量,它们看起来完全一样,但是当设置为等于 == 时返回了一个 false。
举个例子:
> allowed_stock_exchanges
[1] "Australian Securities Exchange" "Borsa Italiana SpA"
[3] "Canadian Securities Exchange" "Euronext Amsterdam"
[5] "Euronext Brussels" "Euronext Lisbon"
[7] "Euronext Paris" "Frankfurt"
[9] "Irish Stock Exchange" "London Stock Exchange"
[11] "Mercado Continuo Espanol (SIBE)" "NASDAQ"
[13] "NASDAQ OMX Stockholm" "NYSE"
[15] "NYSE MKT LLC" "OMX Nordic Copenhagen"
[17] "OMX Nordic Helsinki" "Oslo Bors"
[19] "OTC" "Swiss SIX Exchange"
[21] "Toronto" "Vienna Stock Exchange"
[23] "XETRA"
> available_stock_exchanges
[1] "NYSE" "NASDAQ"
[3] "OTC" "NYSE MKT LLC"
[5] "London Stock Exchange" "TSX Venture Exchange"
[7] "Philippine Stock Exchange" "Toronto"
[9] "Australian Securities Exchange" "Korea Stock Exchange"
[11] "Kuala Lumpur" "New Zealand Exchange Ltd"
[13] "Singapore" "XETRA"
[15] "Vienna Stock Exchange" "Canadian Securities Exchange"
[17] "Frankfurt" "NSX Australia"
[19] "NASDAQ OMX Stockholm" "Mercado Continuo Espanol (SIBE)"
[21] "Euronext Paris" "Euronext Brussels"
[23] "OMX Nordic Copenhagen" "Swiss SIX Exchange"
[25] "Euronext Amsterdam" "Borsa Italiana SpA"
[27] "OMX Nordic Helsinki" "Oslo Bors"
[29] "Euronext Lisbon" "Dusseldorf"
[31] "Irish Stock Exchange" "Hamburg Stock Exchange"
[33] "Luxembourg" "OMX Nordic Iceland"
[35] "Warsaw Stock Exchange" "Norwegian OTC Market"
[37] "Buenos Aires" "Berlin"
[39] "Hong Kong" "Berne Stock Exchange"
[41] "Johannesburg" "Nordic Growth Market"
[43] "Athens Stock Exchange"
> allowed_stock_exchanges[1] == available_stock_exchanges[9]
[1] FALSE
> charToRaw(allowed_stock_exchanges[1])
[1] 41 75 73 74 72 61 6c 69 61 6e c2 a0 53 65 63 75 72 69 74 69 65 73 c2 a0 45 78 63 68 61 6e 67
[32] 65
> charToRaw(available_stock_exchanges[9])
[1] 41 75 73 74 72 61 6c 69 61 6e 20 53 65 63 75 72 69 74 69 65 73 20 45 78 63 68 61 6e 67 65
> Encoding(allowed_stock_exchanges)
[1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "unknown" "UTF-8"
[10] "UTF-8" "UTF-8" "unknown" "UTF-8" "unknown" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
[19] "unknown" "UTF-8" "unknown" "UTF-8" "unknown"
> Encoding(available_stock_exchanges)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[10] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[19] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[28] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[37] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
我在这个混乱中的问题:
避免此类问题的最佳做法是什么? 例如,当从包含字符串的不同来源读取数据时,我应该 check/adapt 什么?
现在我怎样才能使这两个向量均质化,以便如果设置相等,它真的会告诉我它们是相等的吗?
有什么建议吗?
编辑:我发现未知数是 ASCII。然而不幸的是我不知道如何“同质化”一切
charToRaw()
调用非常有用。您的一个字符串将 space 编码为原始字符 20
(ASCII space),另一个将其编码为 C2 A0
“不间断 space”。 R 认为它们不同,因此字符串不匹配。
要解决此问题,您可能可以编写一个小函数,将所有内容转换为一致的格式。例如,要处理不中断 space 问题,您可以使用
allowed_stock_exchanges <- gsub("\u00a0", " ", allowed_stock_exchanges)
其中 \u00a0
是该字符的 Unicode 编码;在 UTF-8 中,它显示为 C2 A0。 (有关这些关系,请参阅 https://www.utf8-chartable.de/ 等网站。)
使用 tools::showNonASCII()
功能有助于检测其他潜在问题。它将识别诸如不间断 space 之类的东西。它不会做的是仅限于字符串看起来相同但编码不同的情况,因此它也会识别重音字符。
另一种可能性是使用
allowed_stock_exchanges <- iconv(allowed_stock_exchanges, to = "ASCII//TRANSLIT")
它尝试转换为 ASCII,必要时进行替换。这会将重音字符转换为一些有趣的 ASCII 版本(我看到“é”转换为“'e”),因此在比较之前您需要在两个字符串上使用它。但是根据你的系统,转换可能和我的不一样。