给定大量重新编码数据的 CSV,编写映射函数的最有效方法

Most efficient way to write a mapping function given a large CSV of recoded data

假设我有一个从某人给我的大型 csv 加载的数据框,其中包含我想应用于其他数据集的 mapping/recode 数据。这是 csv 中可能包含的内容的一个可重现的小示例:

library(wakefield)
csv_mapping <- data.frame(
  from = as.character(name(30)),
  to = as.character(likert_7(30))  
)

以独立于 csv 数据源的方式从此数据帧创建映射函数的最快方法是什么?我通常会通过 运行:

dput(csv_mapping$from)
dput(csv_mapping$to)

在我的控制台中,然后我将向量复制并粘贴到一个函数中并使用 plyr::mapvalues(),如下所示:

mapping_fn <- function(x) {

  fromvec <- c("Kameira", "Sanavi", "Avangelene", "Maryonna", "Wyvonna", "Enam", 
               "Yain", "Tyonna", "Shekira", "Eleanna", "Azriela", "Saajida", 
               "Chantee", "Julieanne", "Genisha", "Delesha", "Macenzi", "Alyasia", 
               "Latonga", "Josuhe", "Arter", "Stone", "Ramaj", "Lilinoe", "Zacharie", 
               "Joshuamichael", "Desseray", "Colorado", "Jaidn", "Verline")

  tovec <- c("Agree", "Somewhat Disagree", "Agree", "Agree", "Neutral", 
          "Somewhat Disagree", "Neutral", "Strongly Agree", "Somewhat Disagree", 
          "Disagree", "Strongly Disagree", "Disagree", "Somewhat Agree", 
          "Strongly Disagree", "Strongly Disagree", "Somewhat Agree", "Strongly Agree", 
          "Somewhat Agree", "Disagree", "Disagree", "Strongly Agree", "Strongly Disagree", 
          "Disagree", "Somewhat Agree", "Strongly Disagree", "Strongly Disagree", 
          "Neutral", "Somewhat Agree", "Agree", "Disagree")

  plyr::mapvalues(x, from = fromvec, to = tovec, warn_missing = F)

}

考虑到 plyr 现在被认为已退役,是否有更聪明或更快捷的方法在不使用 mapvalue 的情况下执行此操作?

一种自然的方法是使用 join。如果您的数据已经在数据框中,这将特别有用,但如果您真的只想要映射值的向量,您可以修改它。

假设我们有一个由 csv 定义的映射,如下所示:

csv_mapping <- data.frame(from = c("Kameira", "Sanavi", "Avangelene", 
                                   "Maryonna", "Wyvonna"),
                          to = c("Agree", "Somewhat Disagree", "Agree",
                                 "Agree", "Neutral"))

csv_mapping
#>         from                to
#> 1    Kameira             Agree
#> 2     Sanavi Somewhat Disagree
#> 3 Avangelene             Agree
#> 4   Maryonna             Agree
#> 5    Wyvonna           Neutral

然后假设我们有一个数据框 df,其中列 x 给出了我们想要映射到新值的值。请注意,df 还可以包含其他列,在这种情况下,我们将添加一些随机值以进行演示。

df <- data.frame(x = c("Sanavi", "Maryonna", "Maryonna", "Wyvonna",
                       "Kameira","Avangelene", "Sanavi", "Wyvonna"),
                 vals = rnorm(8))

df
#>            x        vals
#> 1     Sanavi -0.95005745
#> 2   Maryonna -0.20650715
#> 3   Maryonna -0.07755789
#> 4    Wyvonna  1.72379970
#> 5    Kameira -1.36642679
#> 6 Avangelene -1.48638577
#> 7     Sanavi  0.16987157
#> 8    Wyvonna -0.55194346

然后,我们可以使用 dplyr 的 left_join 将映射值引入数据帧。 (你可以阅读更多here)。

dplyr::left_join(df, csv_mapping, by = c("x" = "from"))
#>            x        vals                to
#> 1     Sanavi -0.95005745 Somewhat Disagree
#> 2   Maryonna -0.20650715             Agree
#> 3   Maryonna -0.07755789             Agree
#> 4    Wyvonna  1.72379970           Neutral
#> 5    Kameira -1.36642679             Agree
#> 6 Avangelene -1.48638577             Agree
#> 7     Sanavi  0.16987157 Somewhat Disagree
#> 8    Wyvonna -0.55194346           Neutral

此时,您从给定的地图中获得了每个 x 值对应的 to 值。如果您只想要那些 to 值,您可以简单地从数据框中拉出 to 列。

reprex package (v0.3.0)

于 2020-06-03 创建

一个非常简单的解决方案,使用 dplyr 包中的 recode

level_key <- setNames(csv_mapping$to, csv_mapping$from)
dplyr::recode(csv_mapping$from, !!!level_key)

基本上我们创建包含键值对的named向量level_key,然后我们在recode函数内部使用unquote拼接。


例子

library(wakefield)
set.seed(42)
csv_mapping <- data.frame(
  from = as.character(name(5)),
  to = as.character(likert_7(5))  
)
csv_mapping

#       from                to
# 1 Merrissa Strongly Disagree
# 2  Lilbert           Neutral
# 3  Rudelle    Strongly Agree
# 4  Kaymani Somewhat Disagree
# 5   Kenadi          Disagree

level_key <- setNames(csv_mapping$to, csv_mapping$from)
dplyr::recode(csv_mapping$from, !!!level_key)
# [1] "Strongly Disagree" "Neutral"           "Strongly Agree"    "Somewhat Disagree" "Disagree"

因此,根据上面 Ric S 的回答,我可以使用我原来的方法,但使用 dplyr 而不是 plyr,如下所示:

mapping_fn <- function(x) {

    fromvec <-  c("Kameira", "Sanavi", "Avangelene", "Maryonna", "Wyvonna", "Enam", 
                 "Yain", "Tyonna", "Shekira", "Eleanna", "Azriela", "Saajida", 
                 "Chantee", "Julieanne", "Genisha", "Delesha", "Macenzi", "Alyasia", 
                 "Latonga", "Josuhe", "Arter", "Stone", "Ramaj", "Lilinoe", "Zacharie", 
                 "Joshuamichael", "Desseray", "Colorado", "Jaidn", "Verline")
    
    tovec <- c("Agree", "Somewhat Disagree", "Agree", "Agree", "Neutral", 
               "Somewhat Disagree", "Neutral", "Strongly Agree", "Somewhat Disagree", 
               "Disagree", "Strongly Disagree", "Disagree", "Somewhat Agree", 
               "Strongly Disagree", "Strongly Disagree", "Somewhat Agree", "Strongly Agree", 
               "Somewhat Agree", "Disagree", "Disagree", "Strongly Agree", "Strongly Disagree", 
               "Disagree", "Somewhat Agree", "Strongly Disagree", "Strongly Disagree", 
               "Neutral", "Somewhat Agree", "Agree", "Disagree")

  
    level_key <- setNames(tovec, fromvec)
    dplyr::recode(x, !!!level_key)
    
}