使用 R 进行语言相关排序

Language dependent sorting with R

1) 如何正确排序?

任务是根据英文字母对缩写的美国州名进行排序。但我注意到,R 基于某种操作系统语言或区域设置对列表进行排序。例如,在我的语言(立陶宛语)中,甚至拉丁语(非立陶宛语)字母的顺序也不同于英语字母表中的顺序。仅比较两个字母表中非立陶宛字母的顺序:

"ABCDEFGHI Y JKLMNOPRSTUVZ"

sort(LETTERS)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "Y" "J" "K" "L" "M" "N"
[16] "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Z"

对比

"ABCDEFGHIJKLMNOPQRSTUVWX Y Z"

LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

因此各州缩写的排序顺序也不同(注意最后的 2,它们应该是 "WV" 然后是 "WY"):

sort(state.abb)
 [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA"
[13] "ID" "IL" "IN" "KY" "KS" "LA" "MA" "MD" "ME" "MI" "MN" "MO"
[25] "MS" "MT" "NC" "ND" "NE" "NH" "NY" "NJ" "NM" "NV" "OH" "OK"
[37] "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA" "VT" "WA" "WI"
[49] "WY" "WV"

我试过了Sys.setlocale("LC_TIME","English_United States.1252")。它有助于在绘图、图表和数字中获取工作日的英文名称。

现在我需要帮助以 "English" 方式正确排序。

2) 初学者 R 用户应注意的其他重要的 R 语言相关设置是什么?

如果您有建议,R 的行为依赖于语言以及如何处理,请列出。

LC_TIME 控制 date/time 相关语言整理。为了您的目的,LC_ALL 应该可以解决问题:

Sys.setenv('LC_ALL', 'English_United States.1252')
sort(letters)

但是,请注意这些设置是特定于操作系统的。例如,上述内容不适用于典型的 Unix 系统。相反,字符串 'en_US.UTF-8' 通常是一个很好的设置 — 但在 Windows 下,它本身可能会带来问题,因为 R 的 Unicode 支持在 Windows.

上是粗略的

我不熟悉 R,但它似乎与许多其他编程语言有同样的问题:标准库中缺乏原生 Unicode 支持。 "Unicode support" 我的意思是 Unicode 标准 (http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf), annexes from the Unicode Standard (especially the one that deals with collation http://unicode.org/reports/tr10/) and up-to-date versions of CLDR (http://cldr.unicode.org/). Essentially, there are ambiguous rules for sorting which cannot be standardized without picking some "true" method and neglecting cultural differences. Partially this has been mitigated by allowing multiple collation levels which neglect certain details (like diacritic marks), creating the Case-folding algorithm (in some cases toLower(toUpper(str)) != toLower(str)), defining collation rules through CLDR database but the problem remains intact. There are also issues like context-dependent comparison (http://unicode.org/reports/tr10/#Contextual_Sensitivity) 的第 3 章,如果你想进行 'correct' 字符串比较,它要求你使用符合 Unicode 标准的成熟解决方案。

有一个名为 ICU(Unicode 国际组件)的著名库,与其他库相比,它实现了 Unicode 标准的大量功能。它在 C/C++ 和 Java 中有实现(所有这些都是开源的,具有类似 BSD 的许可证,但有绑定到其他语言的 C 版本,包括 R (https://cran.r-project.org/web/packages/stringi/, http://site.icu-project.org/related)。所以您可以使用 'stringi' 项目使用 ICU 语言环境和整理工具进行文本处理。

更新: 为了使用 ICU 整理方法,您需要获取 ICU4C(因不同操作系统而异),然后安装 R 语言包:

install.packages('stringi')

那么你应该导入它

library(stringi)

之后您可以使用这些类型的函数 (http://docs.rexamine.com/R-man/stringi/stri_compare.html). You can pass additional parameters to the collator being created at the end of these functions (http://docs.rexamine.com/R-man/stringi/stri_opts_collator.html),这将影响比较的执行方式。

stri_cmp_lt("WV", "WY", locale="lt_LT")
stri_cmp_lt("WV", "WY", locale="en_US")
stri_compare("WV", "WV", locale="en_US", strength='1')

例如,上面的'strength'参数设置了所谓的'collation level'(http://unicode.org/reports/tr10/#Notation). The locale is specified by Language and Country Codes as specified here (http://userguide.icu-project.org/locale)。您可以使用这些函数来实现自定义排序函数(例如使用这些函数进行比较的快速排序),因为内置函数似乎没有提供任何更改排序谓词的方法。

更新 2:或者,甚至比实现自己的排序更好,只需使用 stri_sort 函数,它允许您指定自定义 ICU 整理器 (http://docs.rexamine.com/R-man/stringi/stri_order.html),如下所示:

stri_sort(state.abb, locale="en_US")
stri_sort(state.abb, locale="lt_LT")

[1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WV" "WY"
 [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KY" "KS" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NY" "NJ" "NM" "NV" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WY" "WV"

请注意,WV 和 WY 现在在不同的语言环境中处于不同的位置。